diff mbox series

btrfs: tree-checker: add type and sequence check for inline backrefs

Message ID 23fbab97bd9dbce7869e858cb59d96a7238db57e.1698105469.git.wqu@suse.com (mailing list archive)
State New, archived
Headers show
Series btrfs: tree-checker: add type and sequence check for inline backrefs | expand

Commit Message

Qu Wenruo Oct. 24, 2023, 2:11 a.m. UTC
[BUG]
There is a bug report that ntfs2btrfs had a bug that it can lead to
transaction abort and the filesystem flips to read-only.

[CAUSE]
For inline backref items, kernel has a strict requirement for their
ordered, they must follow the following rules:

- All btrfs_extent_inline_ref::type should be in an ascending order

- Within the same type, the items should follow a descending order by
  their sequence number

  For EXTENT_DATA_REF type, the sequence number is result from
  hash_extent_data_ref().
  For other types, their sequence numbers are
  btrfs_extent_inline_ref::offset.

Thus if there is any code not following above rules, the resulted
inline backrefs can prevent the kernel to locate the needed inline
backref and lead to transaction abort.

[FIX]
Ntrfs2btrfs has already fixed the problem, and btrfs-progs has added the
ability to detect such problems.

For kernel, let's be more noisy and be more specific about the order, so
that the next time kernel hits such problem we would reject it in the
first place, without leading to transaction abort.

Link: https://github.com/kdave/btrfs-progs/pull/622
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/tree-checker.c | 38 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 38 insertions(+)

Comments

Josef Bacik Oct. 24, 2023, 1:48 p.m. UTC | #1
On Tue, Oct 24, 2023 at 12:41:11PM +1030, Qu Wenruo wrote:
> [BUG]
> There is a bug report that ntfs2btrfs had a bug that it can lead to
> transaction abort and the filesystem flips to read-only.
> 
> [CAUSE]
> For inline backref items, kernel has a strict requirement for their
> ordered, they must follow the following rules:
> 
> - All btrfs_extent_inline_ref::type should be in an ascending order
> 
> - Within the same type, the items should follow a descending order by
>   their sequence number
> 
>   For EXTENT_DATA_REF type, the sequence number is result from
>   hash_extent_data_ref().
>   For other types, their sequence numbers are
>   btrfs_extent_inline_ref::offset.
> 
> Thus if there is any code not following above rules, the resulted
> inline backrefs can prevent the kernel to locate the needed inline
> backref and lead to transaction abort.
> 
> [FIX]
> Ntrfs2btrfs has already fixed the problem, and btrfs-progs has added the
> ability to detect such problems.
> 
> For kernel, let's be more noisy and be more specific about the order, so
> that the next time kernel hits such problem we would reject it in the
> first place, without leading to transaction abort.
> 
> Link: https://github.com/kdave/btrfs-progs/pull/622
> Signed-off-by: Qu Wenruo <wqu@suse.com>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Does fsck catch this?  If not can you update it so it does?  Thanks,

Josef
David Sterba Oct. 24, 2023, 7:53 p.m. UTC | #2
On Tue, Oct 24, 2023 at 12:41:11PM +1030, Qu Wenruo wrote:
> [BUG]
> There is a bug report that ntfs2btrfs had a bug that it can lead to
> transaction abort and the filesystem flips to read-only.
> 
> [CAUSE]
> For inline backref items, kernel has a strict requirement for their
> ordered, they must follow the following rules:
> 
> - All btrfs_extent_inline_ref::type should be in an ascending order
> 
> - Within the same type, the items should follow a descending order by
>   their sequence number
> 
>   For EXTENT_DATA_REF type, the sequence number is result from
>   hash_extent_data_ref().
>   For other types, their sequence numbers are
>   btrfs_extent_inline_ref::offset.
> 
> Thus if there is any code not following above rules, the resulted
> inline backrefs can prevent the kernel to locate the needed inline
> backref and lead to transaction abort.
> 
> [FIX]
> Ntrfs2btrfs has already fixed the problem, and btrfs-progs has added the
> ability to detect such problems.
> 
> For kernel, let's be more noisy and be more specific about the order, so
> that the next time kernel hits such problem we would reject it in the
> first place, without leading to transaction abort.
> 
> Link: https://github.com/kdave/btrfs-progs/pull/622
> Signed-off-by: Qu Wenruo <wqu@suse.com>

Added to misc-next, thanks.
Qu Wenruo Oct. 24, 2023, 8:36 p.m. UTC | #3
On 2023/10/25 00:18, Josef Bacik wrote:
> On Tue, Oct 24, 2023 at 12:41:11PM +1030, Qu Wenruo wrote:
>> [BUG]
>> There is a bug report that ntfs2btrfs had a bug that it can lead to
>> transaction abort and the filesystem flips to read-only.
>>
>> [CAUSE]
>> For inline backref items, kernel has a strict requirement for their
>> ordered, they must follow the following rules:
>>
>> - All btrfs_extent_inline_ref::type should be in an ascending order
>>
>> - Within the same type, the items should follow a descending order by
>>    their sequence number
>>
>>    For EXTENT_DATA_REF type, the sequence number is result from
>>    hash_extent_data_ref().
>>    For other types, their sequence numbers are
>>    btrfs_extent_inline_ref::offset.
>>
>> Thus if there is any code not following above rules, the resulted
>> inline backrefs can prevent the kernel to locate the needed inline
>> backref and lead to transaction abort.
>>
>> [FIX]
>> Ntrfs2btrfs has already fixed the problem, and btrfs-progs has added the
>> ability to detect such problems.
>>
>> For kernel, let's be more noisy and be more specific about the order, so
>> that the next time kernel hits such problem we would reject it in the
>> first place, without leading to transaction abort.
>>
>> Link: https://github.com/kdave/btrfs-progs/pull/622
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>
> Reviewed-by: Josef Bacik <josef@toxicpanda.com>
>
> Does fsck catch this?  If not can you update it so it does?  Thanks,

Fsck is already catching that, and a progs selftest image is also added.

Thanks,
Qu

>
> Josef
Boris Burkov Nov. 2, 2023, 7:07 p.m. UTC | #4
On Tue, Oct 24, 2023 at 12:41:11PM +1030, Qu Wenruo wrote:
> [BUG]
> There is a bug report that ntfs2btrfs had a bug that it can lead to
> transaction abort and the filesystem flips to read-only.
> 
> [CAUSE]
> For inline backref items, kernel has a strict requirement for their
> ordered, they must follow the following rules:
> 
> - All btrfs_extent_inline_ref::type should be in an ascending order
> 
> - Within the same type, the items should follow a descending order by
>   their sequence number
> 
>   For EXTENT_DATA_REF type, the sequence number is result from
>   hash_extent_data_ref().
>   For other types, their sequence numbers are
>   btrfs_extent_inline_ref::offset.
> 
> Thus if there is any code not following above rules, the resulted
> inline backrefs can prevent the kernel to locate the needed inline
> backref and lead to transaction abort.
> 
> [FIX]
> Ntrfs2btrfs has already fixed the problem, and btrfs-progs has added the
> ability to detect such problems.
> 
> For kernel, let's be more noisy and be more specific about the order, so
> that the next time kernel hits such problem we would reject it in the
> first place, without leading to transaction abort.
> 
> Link: https://github.com/kdave/btrfs-progs/pull/622
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/tree-checker.c | 38 ++++++++++++++++++++++++++++++++++++++
>  1 file changed, 38 insertions(+)
> 
> diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c
> index a416cbea75d1..981ad301d29d 100644
> --- a/fs/btrfs/tree-checker.c
> +++ b/fs/btrfs/tree-checker.c
> @@ -31,6 +31,7 @@
>  #include "inode-item.h"
>  #include "dir-item.h"
>  #include "raid-stripe-tree.h"
> +#include "extent-tree.h"
>  
>  /*
>   * Error message should follow the following format:
> @@ -1276,6 +1277,8 @@ static int check_extent_item(struct extent_buffer *leaf,
>  	unsigned long ptr;	/* Current pointer inside inline refs */
>  	unsigned long end;	/* Extent item end */
>  	const u32 item_size = btrfs_item_size(leaf, slot);
> +	u8 last_type = 0;
> +	u64 last_seq = U64_MAX;
>  	u64 flags;
>  	u64 generation;
>  	u64 total_refs;		/* Total refs in btrfs_extent_item */
> @@ -1322,6 +1325,17 @@ static int check_extent_item(struct extent_buffer *leaf,
>  	 *    2.2) Ref type specific data
>  	 *         Either using btrfs_extent_inline_ref::offset, or specific
>  	 *         data structure.
> +	 *    All above inline items should follow the order:
> +	 *
> +	 *    - All btrfs_extent_inline_ref::type should be in an ascending
> +	 *      order
> +	 *
> +	 *    - Within the same type, the items should follow a descending
> +	 *      order by their sequence number
> +	 *      The sequence number is determined by:
> +	 *      * btrfs_extent_inline_ref::offset for all types  other than
> +	 *        EXTENT_DATA_REF
> +	 *      * hash_extent_data_ref() for EXTENT_DATA_REF
>  	 */
>  	if (unlikely(item_size < sizeof(*ei))) {
>  		extent_err(leaf, slot,
> @@ -1403,6 +1417,7 @@ static int check_extent_item(struct extent_buffer *leaf,
>  		struct btrfs_extent_inline_ref *iref;
>  		struct btrfs_extent_data_ref *dref;
>  		struct btrfs_shared_data_ref *sref;
> +		u64 seq;
>  		u64 dref_offset;
>  		u64 inline_offset;
>  		u8 inline_type;
> @@ -1416,6 +1431,7 @@ static int check_extent_item(struct extent_buffer *leaf,
>  		iref = (struct btrfs_extent_inline_ref *)ptr;
>  		inline_type = btrfs_extent_inline_ref_type(leaf, iref);
>  		inline_offset = btrfs_extent_inline_ref_offset(leaf, iref);
> +		seq = inline_offset;
>  		if (unlikely(ptr + btrfs_extent_inline_ref_size(inline_type) > end)) {
>  			extent_err(leaf, slot,
>  "inline ref item overflows extent item, ptr %lu iref size %u end %lu",
> @@ -1446,6 +1462,10 @@ static int check_extent_item(struct extent_buffer *leaf,
>  		case BTRFS_EXTENT_DATA_REF_KEY:
>  			dref = (struct btrfs_extent_data_ref *)(&iref->offset);
>  			dref_offset = btrfs_extent_data_ref_offset(leaf, dref);
> +			seq = hash_extent_data_ref(
> +					btrfs_extent_data_ref_root(leaf, dref),
> +					btrfs_extent_data_ref_objectid(leaf, dref),
> +					btrfs_extent_data_ref_offset(leaf, dref));
>  			if (unlikely(!IS_ALIGNED(dref_offset,
>  						 fs_info->sectorsize))) {
>  				extent_err(leaf, slot,
> @@ -1475,6 +1495,24 @@ static int check_extent_item(struct extent_buffer *leaf,
>  				   inline_type);
>  			return -EUCLEAN;
>  		}
> +		if (inline_type < last_type) {
> +			extent_err(leaf, slot,
> +				   "inline ref out-of-order: has type %u, prev type %u",
> +				   inline_type, last_type);
> +			return -EUCLEAN;
> +		}
> +		/* Type changed, allow the sequence starts from U64_MAX again. */
> +		if (inline_type > last_type)
> +			last_seq = U64_MAX;
> +		if (seq > last_seq) {
> +			extent_err(leaf, slot,
> +"inline ref out-of-order: has type %u offset %llu seq 0x%llx, prev type %u seq 0x%llx",
> +				   inline_type, inline_offset, seq,
> +				   last_type, last_seq);
> +			return -EUCLEAN;
> +		}
> +		last_type = inline_type;
> +		last_seq = seq;
>  		ptr += btrfs_extent_inline_ref_size(inline_type);
>  	}
>  	/* No padding is allowed */
> -- 
> 2.42.0
> 

I believe this breaks simple quotas EXTENT_OWNER_REF_KEY items which
have type 188 but come in before the other inline items.

For a repro, btrfs/301 (available in the master fstests branch) fails
with the patch but passes without it.
Qu Wenruo Nov. 2, 2023, 8:17 p.m. UTC | #5
On 2023/11/3 05:37, Boris Burkov wrote:
> On Tue, Oct 24, 2023 at 12:41:11PM +1030, Qu Wenruo wrote:
>> [BUG]
>> There is a bug report that ntfs2btrfs had a bug that it can lead to
>> transaction abort and the filesystem flips to read-only.
>>
>> [CAUSE]
>> For inline backref items, kernel has a strict requirement for their
>> ordered, they must follow the following rules:
>>
>> - All btrfs_extent_inline_ref::type should be in an ascending order
>>
>> - Within the same type, the items should follow a descending order by
>>    their sequence number
>>
>>    For EXTENT_DATA_REF type, the sequence number is result from
>>    hash_extent_data_ref().
>>    For other types, their sequence numbers are
>>    btrfs_extent_inline_ref::offset.
>>
>> Thus if there is any code not following above rules, the resulted
>> inline backrefs can prevent the kernel to locate the needed inline
>> backref and lead to transaction abort.
>>
>> [FIX]
>> Ntrfs2btrfs has already fixed the problem, and btrfs-progs has added the
>> ability to detect such problems.
>>
>> For kernel, let's be more noisy and be more specific about the order, so
>> that the next time kernel hits such problem we would reject it in the
>> first place, without leading to transaction abort.
>>
>> Link: https://github.com/kdave/btrfs-progs/pull/622
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>>   fs/btrfs/tree-checker.c | 38 ++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 38 insertions(+)
>>
>> diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c
>> index a416cbea75d1..981ad301d29d 100644
>> --- a/fs/btrfs/tree-checker.c
>> +++ b/fs/btrfs/tree-checker.c
>> @@ -31,6 +31,7 @@
>>   #include "inode-item.h"
>>   #include "dir-item.h"
>>   #include "raid-stripe-tree.h"
>> +#include "extent-tree.h"
>>
>>   /*
>>    * Error message should follow the following format:
>> @@ -1276,6 +1277,8 @@ static int check_extent_item(struct extent_buffer *leaf,
>>   	unsigned long ptr;	/* Current pointer inside inline refs */
>>   	unsigned long end;	/* Extent item end */
>>   	const u32 item_size = btrfs_item_size(leaf, slot);
>> +	u8 last_type = 0;
>> +	u64 last_seq = U64_MAX;
>>   	u64 flags;
>>   	u64 generation;
>>   	u64 total_refs;		/* Total refs in btrfs_extent_item */
>> @@ -1322,6 +1325,17 @@ static int check_extent_item(struct extent_buffer *leaf,
>>   	 *    2.2) Ref type specific data
>>   	 *         Either using btrfs_extent_inline_ref::offset, or specific
>>   	 *         data structure.
>> +	 *    All above inline items should follow the order:
>> +	 *
>> +	 *    - All btrfs_extent_inline_ref::type should be in an ascending
>> +	 *      order
>> +	 *
>> +	 *    - Within the same type, the items should follow a descending
>> +	 *      order by their sequence number
>> +	 *      The sequence number is determined by:
>> +	 *      * btrfs_extent_inline_ref::offset for all types  other than
>> +	 *        EXTENT_DATA_REF
>> +	 *      * hash_extent_data_ref() for EXTENT_DATA_REF
>>   	 */
>>   	if (unlikely(item_size < sizeof(*ei))) {
>>   		extent_err(leaf, slot,
>> @@ -1403,6 +1417,7 @@ static int check_extent_item(struct extent_buffer *leaf,
>>   		struct btrfs_extent_inline_ref *iref;
>>   		struct btrfs_extent_data_ref *dref;
>>   		struct btrfs_shared_data_ref *sref;
>> +		u64 seq;
>>   		u64 dref_offset;
>>   		u64 inline_offset;
>>   		u8 inline_type;
>> @@ -1416,6 +1431,7 @@ static int check_extent_item(struct extent_buffer *leaf,
>>   		iref = (struct btrfs_extent_inline_ref *)ptr;
>>   		inline_type = btrfs_extent_inline_ref_type(leaf, iref);
>>   		inline_offset = btrfs_extent_inline_ref_offset(leaf, iref);
>> +		seq = inline_offset;
>>   		if (unlikely(ptr + btrfs_extent_inline_ref_size(inline_type) > end)) {
>>   			extent_err(leaf, slot,
>>   "inline ref item overflows extent item, ptr %lu iref size %u end %lu",
>> @@ -1446,6 +1462,10 @@ static int check_extent_item(struct extent_buffer *leaf,
>>   		case BTRFS_EXTENT_DATA_REF_KEY:
>>   			dref = (struct btrfs_extent_data_ref *)(&iref->offset);
>>   			dref_offset = btrfs_extent_data_ref_offset(leaf, dref);
>> +			seq = hash_extent_data_ref(
>> +					btrfs_extent_data_ref_root(leaf, dref),
>> +					btrfs_extent_data_ref_objectid(leaf, dref),
>> +					btrfs_extent_data_ref_offset(leaf, dref));
>>   			if (unlikely(!IS_ALIGNED(dref_offset,
>>   						 fs_info->sectorsize))) {
>>   				extent_err(leaf, slot,
>> @@ -1475,6 +1495,24 @@ static int check_extent_item(struct extent_buffer *leaf,
>>   				   inline_type);
>>   			return -EUCLEAN;
>>   		}
>> +		if (inline_type < last_type) {
>> +			extent_err(leaf, slot,
>> +				   "inline ref out-of-order: has type %u, prev type %u",
>> +				   inline_type, last_type);
>> +			return -EUCLEAN;
>> +		}
>> +		/* Type changed, allow the sequence starts from U64_MAX again. */
>> +		if (inline_type > last_type)
>> +			last_seq = U64_MAX;
>> +		if (seq > last_seq) {
>> +			extent_err(leaf, slot,
>> +"inline ref out-of-order: has type %u offset %llu seq 0x%llx, prev type %u seq 0x%llx",
>> +				   inline_type, inline_offset, seq,
>> +				   last_type, last_seq);
>> +			return -EUCLEAN;
>> +		}
>> +		last_type = inline_type;
>> +		last_seq = seq;
>>   		ptr += btrfs_extent_inline_ref_size(inline_type);
>>   	}
>>   	/* No padding is allowed */
>> --
>> 2.42.0
>>
>
> I believe this breaks simple quotas EXTENT_OWNER_REF_KEY items which
> have type 188 but come in before the other inline items.

Does it mean EXTENT_OWNER_REF_KEY is another odd ball, which doesn't
follow the existing type/inline ref order?

If so, can we fix it in the kernel first?

Thanks,
Qu
>
> For a repro, btrfs/301 (available in the master fstests branch) fails
> with the patch but passes without it.
David Sterba Nov. 2, 2023, 8:19 p.m. UTC | #6
On Fri, Nov 03, 2023 at 06:47:59AM +1030, Qu Wenruo wrote:
> 
> 
> On 2023/11/3 05:37, Boris Burkov wrote:
> > On Tue, Oct 24, 2023 at 12:41:11PM +1030, Qu Wenruo wrote:
> >> [BUG]
> >> There is a bug report that ntfs2btrfs had a bug that it can lead to
> >> transaction abort and the filesystem flips to read-only.
> >>
> >> [CAUSE]
> >> For inline backref items, kernel has a strict requirement for their
> >> ordered, they must follow the following rules:
> >>
> >> - All btrfs_extent_inline_ref::type should be in an ascending order
> >>
> >> - Within the same type, the items should follow a descending order by
> >>    their sequence number
> >>
> >>    For EXTENT_DATA_REF type, the sequence number is result from
> >>    hash_extent_data_ref().
> >>    For other types, their sequence numbers are
> >>    btrfs_extent_inline_ref::offset.
> >>
> >> Thus if there is any code not following above rules, the resulted
> >> inline backrefs can prevent the kernel to locate the needed inline
> >> backref and lead to transaction abort.
> >>
> >> [FIX]
> >> Ntrfs2btrfs has already fixed the problem, and btrfs-progs has added the
> >> ability to detect such problems.
> >>
> >> For kernel, let's be more noisy and be more specific about the order, so
> >> that the next time kernel hits such problem we would reject it in the
> >> first place, without leading to transaction abort.
> >>
> >> Link: https://github.com/kdave/btrfs-progs/pull/622
> >> Signed-off-by: Qu Wenruo <wqu@suse.com>
> >> ---
> >>   fs/btrfs/tree-checker.c | 38 ++++++++++++++++++++++++++++++++++++++
> >>   1 file changed, 38 insertions(+)
> >>
> >> diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c
> >> index a416cbea75d1..981ad301d29d 100644
> >> --- a/fs/btrfs/tree-checker.c
> >> +++ b/fs/btrfs/tree-checker.c
> >> @@ -31,6 +31,7 @@
> >>   #include "inode-item.h"
> >>   #include "dir-item.h"
> >>   #include "raid-stripe-tree.h"
> >> +#include "extent-tree.h"
> >>
> >>   /*
> >>    * Error message should follow the following format:
> >> @@ -1276,6 +1277,8 @@ static int check_extent_item(struct extent_buffer *leaf,
> >>   	unsigned long ptr;	/* Current pointer inside inline refs */
> >>   	unsigned long end;	/* Extent item end */
> >>   	const u32 item_size = btrfs_item_size(leaf, slot);
> >> +	u8 last_type = 0;
> >> +	u64 last_seq = U64_MAX;
> >>   	u64 flags;
> >>   	u64 generation;
> >>   	u64 total_refs;		/* Total refs in btrfs_extent_item */
> >> @@ -1322,6 +1325,17 @@ static int check_extent_item(struct extent_buffer *leaf,
> >>   	 *    2.2) Ref type specific data
> >>   	 *         Either using btrfs_extent_inline_ref::offset, or specific
> >>   	 *         data structure.
> >> +	 *    All above inline items should follow the order:
> >> +	 *
> >> +	 *    - All btrfs_extent_inline_ref::type should be in an ascending
> >> +	 *      order
> >> +	 *
> >> +	 *    - Within the same type, the items should follow a descending
> >> +	 *      order by their sequence number
> >> +	 *      The sequence number is determined by:
> >> +	 *      * btrfs_extent_inline_ref::offset for all types  other than
> >> +	 *        EXTENT_DATA_REF
> >> +	 *      * hash_extent_data_ref() for EXTENT_DATA_REF
> >>   	 */
> >>   	if (unlikely(item_size < sizeof(*ei))) {
> >>   		extent_err(leaf, slot,
> >> @@ -1403,6 +1417,7 @@ static int check_extent_item(struct extent_buffer *leaf,
> >>   		struct btrfs_extent_inline_ref *iref;
> >>   		struct btrfs_extent_data_ref *dref;
> >>   		struct btrfs_shared_data_ref *sref;
> >> +		u64 seq;
> >>   		u64 dref_offset;
> >>   		u64 inline_offset;
> >>   		u8 inline_type;
> >> @@ -1416,6 +1431,7 @@ static int check_extent_item(struct extent_buffer *leaf,
> >>   		iref = (struct btrfs_extent_inline_ref *)ptr;
> >>   		inline_type = btrfs_extent_inline_ref_type(leaf, iref);
> >>   		inline_offset = btrfs_extent_inline_ref_offset(leaf, iref);
> >> +		seq = inline_offset;
> >>   		if (unlikely(ptr + btrfs_extent_inline_ref_size(inline_type) > end)) {
> >>   			extent_err(leaf, slot,
> >>   "inline ref item overflows extent item, ptr %lu iref size %u end %lu",
> >> @@ -1446,6 +1462,10 @@ static int check_extent_item(struct extent_buffer *leaf,
> >>   		case BTRFS_EXTENT_DATA_REF_KEY:
> >>   			dref = (struct btrfs_extent_data_ref *)(&iref->offset);
> >>   			dref_offset = btrfs_extent_data_ref_offset(leaf, dref);
> >> +			seq = hash_extent_data_ref(
> >> +					btrfs_extent_data_ref_root(leaf, dref),
> >> +					btrfs_extent_data_ref_objectid(leaf, dref),
> >> +					btrfs_extent_data_ref_offset(leaf, dref));
> >>   			if (unlikely(!IS_ALIGNED(dref_offset,
> >>   						 fs_info->sectorsize))) {
> >>   				extent_err(leaf, slot,
> >> @@ -1475,6 +1495,24 @@ static int check_extent_item(struct extent_buffer *leaf,
> >>   				   inline_type);
> >>   			return -EUCLEAN;
> >>   		}
> >> +		if (inline_type < last_type) {
> >> +			extent_err(leaf, slot,
> >> +				   "inline ref out-of-order: has type %u, prev type %u",
> >> +				   inline_type, last_type);
> >> +			return -EUCLEAN;
> >> +		}
> >> +		/* Type changed, allow the sequence starts from U64_MAX again. */
> >> +		if (inline_type > last_type)
> >> +			last_seq = U64_MAX;
> >> +		if (seq > last_seq) {
> >> +			extent_err(leaf, slot,
> >> +"inline ref out-of-order: has type %u offset %llu seq 0x%llx, prev type %u seq 0x%llx",
> >> +				   inline_type, inline_offset, seq,
> >> +				   last_type, last_seq);
> >> +			return -EUCLEAN;
> >> +		}
> >> +		last_type = inline_type;
> >> +		last_seq = seq;
> >>   		ptr += btrfs_extent_inline_ref_size(inline_type);
> >>   	}
> >>   	/* No padding is allowed */
> >> --
> >> 2.42.0
> >>
> >
> > I believe this breaks simple quotas EXTENT_OWNER_REF_KEY items which
> > have type 188 but come in before the other inline items.
> 
> Does it mean EXTENT_OWNER_REF_KEY is another odd ball, which doesn't
> follow the existing type/inline ref order?
> 
> If so, can we fix it in the kernel first?

It's still time to do such changes.
Boris Burkov Nov. 2, 2023, 8:35 p.m. UTC | #7
On Fri, Nov 03, 2023 at 06:47:59AM +1030, Qu Wenruo wrote:
> 
> 
> On 2023/11/3 05:37, Boris Burkov wrote:
> > On Tue, Oct 24, 2023 at 12:41:11PM +1030, Qu Wenruo wrote:
> > > [BUG]
> > > There is a bug report that ntfs2btrfs had a bug that it can lead to
> > > transaction abort and the filesystem flips to read-only.
> > > 
> > > [CAUSE]
> > > For inline backref items, kernel has a strict requirement for their
> > > ordered, they must follow the following rules:
> > > 
> > > - All btrfs_extent_inline_ref::type should be in an ascending order
> > > 
> > > - Within the same type, the items should follow a descending order by
> > >    their sequence number
> > > 
> > >    For EXTENT_DATA_REF type, the sequence number is result from
> > >    hash_extent_data_ref().
> > >    For other types, their sequence numbers are
> > >    btrfs_extent_inline_ref::offset.
> > > 
> > > Thus if there is any code not following above rules, the resulted
> > > inline backrefs can prevent the kernel to locate the needed inline
> > > backref and lead to transaction abort.
> > > 
> > > [FIX]
> > > Ntrfs2btrfs has already fixed the problem, and btrfs-progs has added the
> > > ability to detect such problems.
> > > 
> > > For kernel, let's be more noisy and be more specific about the order, so
> > > that the next time kernel hits such problem we would reject it in the
> > > first place, without leading to transaction abort.
> > > 
> > > Link: https://github.com/kdave/btrfs-progs/pull/622
> > > Signed-off-by: Qu Wenruo <wqu@suse.com>
> > > ---
> > >   fs/btrfs/tree-checker.c | 38 ++++++++++++++++++++++++++++++++++++++
> > >   1 file changed, 38 insertions(+)
> > > 
> > > diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c
> > > index a416cbea75d1..981ad301d29d 100644
> > > --- a/fs/btrfs/tree-checker.c
> > > +++ b/fs/btrfs/tree-checker.c
> > > @@ -31,6 +31,7 @@
> > >   #include "inode-item.h"
> > >   #include "dir-item.h"
> > >   #include "raid-stripe-tree.h"
> > > +#include "extent-tree.h"
> > > 
> > >   /*
> > >    * Error message should follow the following format:
> > > @@ -1276,6 +1277,8 @@ static int check_extent_item(struct extent_buffer *leaf,
> > >   	unsigned long ptr;	/* Current pointer inside inline refs */
> > >   	unsigned long end;	/* Extent item end */
> > >   	const u32 item_size = btrfs_item_size(leaf, slot);
> > > +	u8 last_type = 0;
> > > +	u64 last_seq = U64_MAX;
> > >   	u64 flags;
> > >   	u64 generation;
> > >   	u64 total_refs;		/* Total refs in btrfs_extent_item */
> > > @@ -1322,6 +1325,17 @@ static int check_extent_item(struct extent_buffer *leaf,
> > >   	 *    2.2) Ref type specific data
> > >   	 *         Either using btrfs_extent_inline_ref::offset, or specific
> > >   	 *         data structure.
> > > +	 *    All above inline items should follow the order:
> > > +	 *
> > > +	 *    - All btrfs_extent_inline_ref::type should be in an ascending
> > > +	 *      order
> > > +	 *
> > > +	 *    - Within the same type, the items should follow a descending
> > > +	 *      order by their sequence number
> > > +	 *      The sequence number is determined by:
> > > +	 *      * btrfs_extent_inline_ref::offset for all types  other than
> > > +	 *        EXTENT_DATA_REF
> > > +	 *      * hash_extent_data_ref() for EXTENT_DATA_REF
> > >   	 */
> > >   	if (unlikely(item_size < sizeof(*ei))) {
> > >   		extent_err(leaf, slot,
> > > @@ -1403,6 +1417,7 @@ static int check_extent_item(struct extent_buffer *leaf,
> > >   		struct btrfs_extent_inline_ref *iref;
> > >   		struct btrfs_extent_data_ref *dref;
> > >   		struct btrfs_shared_data_ref *sref;
> > > +		u64 seq;
> > >   		u64 dref_offset;
> > >   		u64 inline_offset;
> > >   		u8 inline_type;
> > > @@ -1416,6 +1431,7 @@ static int check_extent_item(struct extent_buffer *leaf,
> > >   		iref = (struct btrfs_extent_inline_ref *)ptr;
> > >   		inline_type = btrfs_extent_inline_ref_type(leaf, iref);
> > >   		inline_offset = btrfs_extent_inline_ref_offset(leaf, iref);
> > > +		seq = inline_offset;
> > >   		if (unlikely(ptr + btrfs_extent_inline_ref_size(inline_type) > end)) {
> > >   			extent_err(leaf, slot,
> > >   "inline ref item overflows extent item, ptr %lu iref size %u end %lu",
> > > @@ -1446,6 +1462,10 @@ static int check_extent_item(struct extent_buffer *leaf,
> > >   		case BTRFS_EXTENT_DATA_REF_KEY:
> > >   			dref = (struct btrfs_extent_data_ref *)(&iref->offset);
> > >   			dref_offset = btrfs_extent_data_ref_offset(leaf, dref);
> > > +			seq = hash_extent_data_ref(
> > > +					btrfs_extent_data_ref_root(leaf, dref),
> > > +					btrfs_extent_data_ref_objectid(leaf, dref),
> > > +					btrfs_extent_data_ref_offset(leaf, dref));
> > >   			if (unlikely(!IS_ALIGNED(dref_offset,
> > >   						 fs_info->sectorsize))) {
> > >   				extent_err(leaf, slot,
> > > @@ -1475,6 +1495,24 @@ static int check_extent_item(struct extent_buffer *leaf,
> > >   				   inline_type);
> > >   			return -EUCLEAN;
> > >   		}
> > > +		if (inline_type < last_type) {
> > > +			extent_err(leaf, slot,
> > > +				   "inline ref out-of-order: has type %u, prev type %u",
> > > +				   inline_type, last_type);
> > > +			return -EUCLEAN;
> > > +		}
> > > +		/* Type changed, allow the sequence starts from U64_MAX again. */
> > > +		if (inline_type > last_type)
> > > +			last_seq = U64_MAX;
> > > +		if (seq > last_seq) {
> > > +			extent_err(leaf, slot,
> > > +"inline ref out-of-order: has type %u offset %llu seq 0x%llx, prev type %u seq 0x%llx",
> > > +				   inline_type, inline_offset, seq,
> > > +				   last_type, last_seq);
> > > +			return -EUCLEAN;
> > > +		}
> > > +		last_type = inline_type;
> > > +		last_seq = seq;
> > >   		ptr += btrfs_extent_inline_ref_size(inline_type);
> > >   	}
> > >   	/* No padding is allowed */
> > > --
> > > 2.42.0
> > > 
> > 
> > I believe this breaks simple quotas EXTENT_OWNER_REF_KEY items which
> > have type 188 but come in before the other inline items.
> 
> Does it mean EXTENT_OWNER_REF_KEY is another odd ball, which doesn't
> follow the existing type/inline ref order?

Yes, I suppose it does. I didn't see that invariant documented anywhere,
so I apologize for breaking it. It does seem like a valuable invariant
to keep the inline items sorted.

If it's possible at this stage to change the type number to be 170 or
something, I think that would fix it, and would be a much less intrusive
change than pushing the owner ref item to the back of the inline refs,
which would complicate parsing a lot more, IMO.

OTOH, in general, the parsing has to have special cases for the owner ref
inline item since it is per extent, not per ref, so I don't see why it
couldn't just skip it here too.

e.g., this works to fix it:

diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c
index 50fdc69fdddf..62150419c6d4 100644
--- a/fs/btrfs/tree-checker.c
+++ b/fs/btrfs/tree-checker.c
@@ -1496,6 +1496,9 @@ static int check_extent_item(struct extent_buffer *leaf,
 				   inline_type);
 			return -EUCLEAN;
 		}
+
+		if (last_type == BTRFS_EXTENT_OWNER_REF_KEY)
+			goto next;
 		if (inline_type < last_type) {
 			extent_err(leaf, slot,
 				   "inline ref out-of-order: has type %u, prev type %u",
@@ -1512,6 +1515,7 @@ static int check_extent_item(struct extent_buffer *leaf,
 				   last_type, last_seq);
 			return -EUCLEAN;
 		}
+next:
 		last_type = inline_type;
 		last_seq = seq;
 		ptr += btrfs_extent_inline_ref_size(inline_type);

> 
> If so, can we fix it in the kernel first?
> 
> Thanks,
> Qu
> > 
> > For a repro, btrfs/301 (available in the master fstests branch) fails
> > with the patch but passes without it.
Josef Bacik Nov. 2, 2023, 8:36 p.m. UTC | #8
On Tue, Oct 24, 2023 at 12:41:11PM +1030, Qu Wenruo wrote:
> [BUG]
> There is a bug report that ntfs2btrfs had a bug that it can lead to
> transaction abort and the filesystem flips to read-only.
> 
> [CAUSE]
> For inline backref items, kernel has a strict requirement for their
> ordered, they must follow the following rules:
> 
> - All btrfs_extent_inline_ref::type should be in an ascending order
> 
> - Within the same type, the items should follow a descending order by
>   their sequence number
> 
>   For EXTENT_DATA_REF type, the sequence number is result from
>   hash_extent_data_ref().
>   For other types, their sequence numbers are
>   btrfs_extent_inline_ref::offset.
> 
> Thus if there is any code not following above rules, the resulted
> inline backrefs can prevent the kernel to locate the needed inline
> backref and lead to transaction abort.
> 
> [FIX]
> Ntrfs2btrfs has already fixed the problem, and btrfs-progs has added the
> ability to detect such problems.
> 
> For kernel, let's be more noisy and be more specific about the order, so
> that the next time kernel hits such problem we would reject it in the
> first place, without leading to transaction abort.
> 
> Link: https://github.com/kdave/btrfs-progs/pull/622
> Signed-off-by: Qu Wenruo <wqu@suse.com>

This broke squotas and I didn't notice it until I was running the CI for my
mount api changes.

Lets try to use the CI for most things, even if you send it at the same time you
submit a job, it'll keep this sort of thing from happening.  Thanks,

Josef
Qu Wenruo Nov. 2, 2023, 8:49 p.m. UTC | #9
On 2023/11/3 07:06, Josef Bacik wrote:
> On Tue, Oct 24, 2023 at 12:41:11PM +1030, Qu Wenruo wrote:
>> [BUG]
>> There is a bug report that ntfs2btrfs had a bug that it can lead to
>> transaction abort and the filesystem flips to read-only.
>>
>> [CAUSE]
>> For inline backref items, kernel has a strict requirement for their
>> ordered, they must follow the following rules:
>>
>> - All btrfs_extent_inline_ref::type should be in an ascending order
>>
>> - Within the same type, the items should follow a descending order by
>>    their sequence number
>>
>>    For EXTENT_DATA_REF type, the sequence number is result from
>>    hash_extent_data_ref().
>>    For other types, their sequence numbers are
>>    btrfs_extent_inline_ref::offset.
>>
>> Thus if there is any code not following above rules, the resulted
>> inline backrefs can prevent the kernel to locate the needed inline
>> backref and lead to transaction abort.
>>
>> [FIX]
>> Ntrfs2btrfs has already fixed the problem, and btrfs-progs has added the
>> ability to detect such problems.
>>
>> For kernel, let's be more noisy and be more specific about the order, so
>> that the next time kernel hits such problem we would reject it in the
>> first place, without leading to transaction abort.
>>
>> Link: https://github.com/kdave/btrfs-progs/pull/622
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>
> This broke squotas and I didn't notice it until I was running the CI for my
> mount api changes.
>
> Lets try to use the CI for most things, even if you send it at the same time you
> submit a job, it'll keep this sort of thing from happening.  Thanks,

My bad, didn't utilize CI at all.

Any quick guides/docs on the CI system?

Thanks,
Qu

>
> Josef
Qu Wenruo Nov. 2, 2023, 9:05 p.m. UTC | #10
On 2023/11/3 07:05, Boris Burkov wrote:
> On Fri, Nov 03, 2023 at 06:47:59AM +1030, Qu Wenruo wrote:
>>
>>
>> On 2023/11/3 05:37, Boris Burkov wrote:
>>> On Tue, Oct 24, 2023 at 12:41:11PM +1030, Qu Wenruo wrote:
>>>> [BUG]
>>>> There is a bug report that ntfs2btrfs had a bug that it can lead to
>>>> transaction abort and the filesystem flips to read-only.
>>>>
>>>> [CAUSE]
>>>> For inline backref items, kernel has a strict requirement for their
>>>> ordered, they must follow the following rules:
>>>>
>>>> - All btrfs_extent_inline_ref::type should be in an ascending order
>>>>
>>>> - Within the same type, the items should follow a descending order by
>>>>     their sequence number
>>>>
>>>>     For EXTENT_DATA_REF type, the sequence number is result from
>>>>     hash_extent_data_ref().
>>>>     For other types, their sequence numbers are
>>>>     btrfs_extent_inline_ref::offset.
>>>>
>>>> Thus if there is any code not following above rules, the resulted
>>>> inline backrefs can prevent the kernel to locate the needed inline
>>>> backref and lead to transaction abort.
>>>>
>>>> [FIX]
>>>> Ntrfs2btrfs has already fixed the problem, and btrfs-progs has added the
>>>> ability to detect such problems.
>>>>
>>>> For kernel, let's be more noisy and be more specific about the order, so
>>>> that the next time kernel hits such problem we would reject it in the
>>>> first place, without leading to transaction abort.
>>>>
>>>> Link: https://github.com/kdave/btrfs-progs/pull/622
>>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>>>> ---
>>>>    fs/btrfs/tree-checker.c | 38 ++++++++++++++++++++++++++++++++++++++
>>>>    1 file changed, 38 insertions(+)
>>>>
>>>> diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c
>>>> index a416cbea75d1..981ad301d29d 100644
>>>> --- a/fs/btrfs/tree-checker.c
>>>> +++ b/fs/btrfs/tree-checker.c
>>>> @@ -31,6 +31,7 @@
>>>>    #include "inode-item.h"
>>>>    #include "dir-item.h"
>>>>    #include "raid-stripe-tree.h"
>>>> +#include "extent-tree.h"
>>>>
>>>>    /*
>>>>     * Error message should follow the following format:
>>>> @@ -1276,6 +1277,8 @@ static int check_extent_item(struct extent_buffer *leaf,
>>>>    	unsigned long ptr;	/* Current pointer inside inline refs */
>>>>    	unsigned long end;	/* Extent item end */
>>>>    	const u32 item_size = btrfs_item_size(leaf, slot);
>>>> +	u8 last_type = 0;
>>>> +	u64 last_seq = U64_MAX;
>>>>    	u64 flags;
>>>>    	u64 generation;
>>>>    	u64 total_refs;		/* Total refs in btrfs_extent_item */
>>>> @@ -1322,6 +1325,17 @@ static int check_extent_item(struct extent_buffer *leaf,
>>>>    	 *    2.2) Ref type specific data
>>>>    	 *         Either using btrfs_extent_inline_ref::offset, or specific
>>>>    	 *         data structure.
>>>> +	 *    All above inline items should follow the order:
>>>> +	 *
>>>> +	 *    - All btrfs_extent_inline_ref::type should be in an ascending
>>>> +	 *      order
>>>> +	 *
>>>> +	 *    - Within the same type, the items should follow a descending
>>>> +	 *      order by their sequence number
>>>> +	 *      The sequence number is determined by:
>>>> +	 *      * btrfs_extent_inline_ref::offset for all types  other than
>>>> +	 *        EXTENT_DATA_REF
>>>> +	 *      * hash_extent_data_ref() for EXTENT_DATA_REF
>>>>    	 */
>>>>    	if (unlikely(item_size < sizeof(*ei))) {
>>>>    		extent_err(leaf, slot,
>>>> @@ -1403,6 +1417,7 @@ static int check_extent_item(struct extent_buffer *leaf,
>>>>    		struct btrfs_extent_inline_ref *iref;
>>>>    		struct btrfs_extent_data_ref *dref;
>>>>    		struct btrfs_shared_data_ref *sref;
>>>> +		u64 seq;
>>>>    		u64 dref_offset;
>>>>    		u64 inline_offset;
>>>>    		u8 inline_type;
>>>> @@ -1416,6 +1431,7 @@ static int check_extent_item(struct extent_buffer *leaf,
>>>>    		iref = (struct btrfs_extent_inline_ref *)ptr;
>>>>    		inline_type = btrfs_extent_inline_ref_type(leaf, iref);
>>>>    		inline_offset = btrfs_extent_inline_ref_offset(leaf, iref);
>>>> +		seq = inline_offset;
>>>>    		if (unlikely(ptr + btrfs_extent_inline_ref_size(inline_type) > end)) {
>>>>    			extent_err(leaf, slot,
>>>>    "inline ref item overflows extent item, ptr %lu iref size %u end %lu",
>>>> @@ -1446,6 +1462,10 @@ static int check_extent_item(struct extent_buffer *leaf,
>>>>    		case BTRFS_EXTENT_DATA_REF_KEY:
>>>>    			dref = (struct btrfs_extent_data_ref *)(&iref->offset);
>>>>    			dref_offset = btrfs_extent_data_ref_offset(leaf, dref);
>>>> +			seq = hash_extent_data_ref(
>>>> +					btrfs_extent_data_ref_root(leaf, dref),
>>>> +					btrfs_extent_data_ref_objectid(leaf, dref),
>>>> +					btrfs_extent_data_ref_offset(leaf, dref));
>>>>    			if (unlikely(!IS_ALIGNED(dref_offset,
>>>>    						 fs_info->sectorsize))) {
>>>>    				extent_err(leaf, slot,
>>>> @@ -1475,6 +1495,24 @@ static int check_extent_item(struct extent_buffer *leaf,
>>>>    				   inline_type);
>>>>    			return -EUCLEAN;
>>>>    		}
>>>> +		if (inline_type < last_type) {
>>>> +			extent_err(leaf, slot,
>>>> +				   "inline ref out-of-order: has type %u, prev type %u",
>>>> +				   inline_type, last_type);
>>>> +			return -EUCLEAN;
>>>> +		}
>>>> +		/* Type changed, allow the sequence starts from U64_MAX again. */
>>>> +		if (inline_type > last_type)
>>>> +			last_seq = U64_MAX;
>>>> +		if (seq > last_seq) {
>>>> +			extent_err(leaf, slot,
>>>> +"inline ref out-of-order: has type %u offset %llu seq 0x%llx, prev type %u seq 0x%llx",
>>>> +				   inline_type, inline_offset, seq,
>>>> +				   last_type, last_seq);
>>>> +			return -EUCLEAN;
>>>> +		}
>>>> +		last_type = inline_type;
>>>> +		last_seq = seq;
>>>>    		ptr += btrfs_extent_inline_ref_size(inline_type);
>>>>    	}
>>>>    	/* No padding is allowed */
>>>> --
>>>> 2.42.0
>>>>
>>>
>>> I believe this breaks simple quotas EXTENT_OWNER_REF_KEY items which
>>> have type 188 but come in before the other inline items.
>>
>> Does it mean EXTENT_OWNER_REF_KEY is another odd ball, which doesn't
>> follow the existing type/inline ref order?
>
> Yes, I suppose it does. I didn't see that invariant documented anywhere,
> so I apologize for breaking it. It does seem like a valuable invariant
> to keep the inline items sorted.

Not a big deal, in fact we all hit the same situation, and that's why
we're adding tree-checker, kinda like a slightly better documented
on-disk format.

>
> If it's possible at this stage to change the type number to be 170 or
> something, I think that would fix it, and would be a much less intrusive
> change than pushing the owner ref item to the back of the inline refs,
> which would complicate parsing a lot more, IMO.

I believe this is much better solution.

Especially we have EXPERIMENTAL flags for btrfs-progs, thus it should
give us quite some period to tweak the on-disk format.

Although even I submitted a tree-checker for this, I'm not a fan of the
existing weird ordering thing for inlined backrefs at all.

I'd prefer to have no order at all (it's not that bad, since the
backrefs are already inlined, but can still bring small perf drop), or
fully ordered (type and seq follow the same ascending or descending order).

But this needs a on-disk format change, and even we introduced such
change it would still take a long time to deprecate the existing order.

So I'm afraid we would stick to the existing order.

>
> OTOH, in general, the parsing has to have special cases for the owner ref
> inline item since it is per extent, not per ref, so I don't see why it
> couldn't just skip it here too.

Yeah, it's not hard to do in the code, but for the sake of consistency,
I'd really prefer it to follow some existing order, even if the existing
one is already weird...



Another thing to mention is, that EXTENT_OWNER_REF seems to be inlined
only, has no keyed version.

I have already come upon a rare corner case:

   What if an EXTENT_ITEM is already so large that it can not add a new
   inline ref?

I guess this can only be a problem for converting, as for now squota can
only be enabled at mkfs time, thus the new EXTENT_OWNER_REF can always
be inlined.

Thanks,
Qu

>
> e.g., this works to fix it:
>
> diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c
> index 50fdc69fdddf..62150419c6d4 100644
> --- a/fs/btrfs/tree-checker.c
> +++ b/fs/btrfs/tree-checker.c
> @@ -1496,6 +1496,9 @@ static int check_extent_item(struct extent_buffer *leaf,
>   				   inline_type);
>   			return -EUCLEAN;
>   		}
> +
> +		if (last_type == BTRFS_EXTENT_OWNER_REF_KEY)
> +			goto next;
>   		if (inline_type < last_type) {
>   			extent_err(leaf, slot,
>   				   "inline ref out-of-order: has type %u, prev type %u",
> @@ -1512,6 +1515,7 @@ static int check_extent_item(struct extent_buffer *leaf,
>   				   last_type, last_seq);
>   			return -EUCLEAN;
>   		}
> +next:
>   		last_type = inline_type;
>   		last_seq = seq;
>   		ptr += btrfs_extent_inline_ref_size(inline_type);
>
>>
>> If so, can we fix it in the kernel first?
>>
>> Thanks,
>> Qu
>>>
>>> For a repro, btrfs/301 (available in the master fstests branch) fails
>>> with the patch but passes without it.
Boris Burkov Nov. 2, 2023, 9:34 p.m. UTC | #11
On Fri, Nov 03, 2023 at 07:35:48AM +1030, Qu Wenruo wrote:
> 
> 
> On 2023/11/3 07:05, Boris Burkov wrote:
> > On Fri, Nov 03, 2023 at 06:47:59AM +1030, Qu Wenruo wrote:
> > > 
> > > 
> > > On 2023/11/3 05:37, Boris Burkov wrote:
> > > > On Tue, Oct 24, 2023 at 12:41:11PM +1030, Qu Wenruo wrote:
> > > > > [BUG]
> > > > > There is a bug report that ntfs2btrfs had a bug that it can lead to
> > > > > transaction abort and the filesystem flips to read-only.
> > > > > 
> > > > > [CAUSE]
> > > > > For inline backref items, kernel has a strict requirement for their
> > > > > ordered, they must follow the following rules:
> > > > > 
> > > > > - All btrfs_extent_inline_ref::type should be in an ascending order
> > > > > 
> > > > > - Within the same type, the items should follow a descending order by
> > > > >     their sequence number
> > > > > 
> > > > >     For EXTENT_DATA_REF type, the sequence number is result from
> > > > >     hash_extent_data_ref().
> > > > >     For other types, their sequence numbers are
> > > > >     btrfs_extent_inline_ref::offset.
> > > > > 
> > > > > Thus if there is any code not following above rules, the resulted
> > > > > inline backrefs can prevent the kernel to locate the needed inline
> > > > > backref and lead to transaction abort.
> > > > > 
> > > > > [FIX]
> > > > > Ntrfs2btrfs has already fixed the problem, and btrfs-progs has added the
> > > > > ability to detect such problems.
> > > > > 
> > > > > For kernel, let's be more noisy and be more specific about the order, so
> > > > > that the next time kernel hits such problem we would reject it in the
> > > > > first place, without leading to transaction abort.
> > > > > 
> > > > > Link: https://github.com/kdave/btrfs-progs/pull/622
> > > > > Signed-off-by: Qu Wenruo <wqu@suse.com>
> > > > > ---
> > > > >    fs/btrfs/tree-checker.c | 38 ++++++++++++++++++++++++++++++++++++++
> > > > >    1 file changed, 38 insertions(+)
> > > > > 
> > > > > diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c
> > > > > index a416cbea75d1..981ad301d29d 100644
> > > > > --- a/fs/btrfs/tree-checker.c
> > > > > +++ b/fs/btrfs/tree-checker.c
> > > > > @@ -31,6 +31,7 @@
> > > > >    #include "inode-item.h"
> > > > >    #include "dir-item.h"
> > > > >    #include "raid-stripe-tree.h"
> > > > > +#include "extent-tree.h"
> > > > > 
> > > > >    /*
> > > > >     * Error message should follow the following format:
> > > > > @@ -1276,6 +1277,8 @@ static int check_extent_item(struct extent_buffer *leaf,
> > > > >    	unsigned long ptr;	/* Current pointer inside inline refs */
> > > > >    	unsigned long end;	/* Extent item end */
> > > > >    	const u32 item_size = btrfs_item_size(leaf, slot);
> > > > > +	u8 last_type = 0;
> > > > > +	u64 last_seq = U64_MAX;
> > > > >    	u64 flags;
> > > > >    	u64 generation;
> > > > >    	u64 total_refs;		/* Total refs in btrfs_extent_item */
> > > > > @@ -1322,6 +1325,17 @@ static int check_extent_item(struct extent_buffer *leaf,
> > > > >    	 *    2.2) Ref type specific data
> > > > >    	 *         Either using btrfs_extent_inline_ref::offset, or specific
> > > > >    	 *         data structure.
> > > > > +	 *    All above inline items should follow the order:
> > > > > +	 *
> > > > > +	 *    - All btrfs_extent_inline_ref::type should be in an ascending
> > > > > +	 *      order
> > > > > +	 *
> > > > > +	 *    - Within the same type, the items should follow a descending
> > > > > +	 *      order by their sequence number
> > > > > +	 *      The sequence number is determined by:
> > > > > +	 *      * btrfs_extent_inline_ref::offset for all types  other than
> > > > > +	 *        EXTENT_DATA_REF
> > > > > +	 *      * hash_extent_data_ref() for EXTENT_DATA_REF
> > > > >    	 */
> > > > >    	if (unlikely(item_size < sizeof(*ei))) {
> > > > >    		extent_err(leaf, slot,
> > > > > @@ -1403,6 +1417,7 @@ static int check_extent_item(struct extent_buffer *leaf,
> > > > >    		struct btrfs_extent_inline_ref *iref;
> > > > >    		struct btrfs_extent_data_ref *dref;
> > > > >    		struct btrfs_shared_data_ref *sref;
> > > > > +		u64 seq;
> > > > >    		u64 dref_offset;
> > > > >    		u64 inline_offset;
> > > > >    		u8 inline_type;
> > > > > @@ -1416,6 +1431,7 @@ static int check_extent_item(struct extent_buffer *leaf,
> > > > >    		iref = (struct btrfs_extent_inline_ref *)ptr;
> > > > >    		inline_type = btrfs_extent_inline_ref_type(leaf, iref);
> > > > >    		inline_offset = btrfs_extent_inline_ref_offset(leaf, iref);
> > > > > +		seq = inline_offset;
> > > > >    		if (unlikely(ptr + btrfs_extent_inline_ref_size(inline_type) > end)) {
> > > > >    			extent_err(leaf, slot,
> > > > >    "inline ref item overflows extent item, ptr %lu iref size %u end %lu",
> > > > > @@ -1446,6 +1462,10 @@ static int check_extent_item(struct extent_buffer *leaf,
> > > > >    		case BTRFS_EXTENT_DATA_REF_KEY:
> > > > >    			dref = (struct btrfs_extent_data_ref *)(&iref->offset);
> > > > >    			dref_offset = btrfs_extent_data_ref_offset(leaf, dref);
> > > > > +			seq = hash_extent_data_ref(
> > > > > +					btrfs_extent_data_ref_root(leaf, dref),
> > > > > +					btrfs_extent_data_ref_objectid(leaf, dref),
> > > > > +					btrfs_extent_data_ref_offset(leaf, dref));
> > > > >    			if (unlikely(!IS_ALIGNED(dref_offset,
> > > > >    						 fs_info->sectorsize))) {
> > > > >    				extent_err(leaf, slot,
> > > > > @@ -1475,6 +1495,24 @@ static int check_extent_item(struct extent_buffer *leaf,
> > > > >    				   inline_type);
> > > > >    			return -EUCLEAN;
> > > > >    		}
> > > > > +		if (inline_type < last_type) {
> > > > > +			extent_err(leaf, slot,
> > > > > +				   "inline ref out-of-order: has type %u, prev type %u",
> > > > > +				   inline_type, last_type);
> > > > > +			return -EUCLEAN;
> > > > > +		}
> > > > > +		/* Type changed, allow the sequence starts from U64_MAX again. */
> > > > > +		if (inline_type > last_type)
> > > > > +			last_seq = U64_MAX;
> > > > > +		if (seq > last_seq) {
> > > > > +			extent_err(leaf, slot,
> > > > > +"inline ref out-of-order: has type %u offset %llu seq 0x%llx, prev type %u seq 0x%llx",
> > > > > +				   inline_type, inline_offset, seq,
> > > > > +				   last_type, last_seq);
> > > > > +			return -EUCLEAN;
> > > > > +		}
> > > > > +		last_type = inline_type;
> > > > > +		last_seq = seq;
> > > > >    		ptr += btrfs_extent_inline_ref_size(inline_type);
> > > > >    	}
> > > > >    	/* No padding is allowed */
> > > > > --
> > > > > 2.42.0
> > > > > 
> > > > 
> > > > I believe this breaks simple quotas EXTENT_OWNER_REF_KEY items which
> > > > have type 188 but come in before the other inline items.
> > > 
> > > Does it mean EXTENT_OWNER_REF_KEY is another odd ball, which doesn't
> > > follow the existing type/inline ref order?
> > 
> > Yes, I suppose it does. I didn't see that invariant documented anywhere,
> > so I apologize for breaking it. It does seem like a valuable invariant
> > to keep the inline items sorted.
> 
> Not a big deal, in fact we all hit the same situation, and that's why
> we're adding tree-checker, kinda like a slightly better documented
> on-disk format.

+1, Love "documenting" via tree-checker :)

> 
> > 
> > If it's possible at this stage to change the type number to be 170 or
> > something, I think that would fix it, and would be a much less intrusive
> > change than pushing the owner ref item to the back of the inline refs,
> > which would complicate parsing a lot more, IMO.
> 
> I believe this is much better solution.

Agreed. I hope it's possible! Dave, can you weigh in on whether this
exact change can be done at this point? You suggested upthread that it
should be, but I just want to be sure.

> 
> Especially we have EXPERIMENTAL flags for btrfs-progs, thus it should
> give us quite some period to tweak the on-disk format.
> 
> Although even I submitted a tree-checker for this, I'm not a fan of the
> existing weird ordering thing for inlined backrefs at all.
> 
> I'd prefer to have no order at all (it's not that bad, since the
> backrefs are already inlined, but can still bring small perf drop), or
> fully ordered (type and seq follow the same ascending or descending order).
> 
> But this needs a on-disk format change, and even we introduced such
> change it would still take a long time to deprecate the existing order.
> 
> So I'm afraid we would stick to the existing order.

+1

> 
> > 
> > OTOH, in general, the parsing has to have special cases for the owner ref
> > inline item since it is per extent, not per ref, so I don't see why it
> > couldn't just skip it here too.
> 
> Yeah, it's not hard to do in the code, but for the sake of consistency,
> I'd really prefer it to follow some existing order, even if the existing
> one is already weird...
> 

"Existing order is owner ref, then weird hash/type order for the rest" :P

> 
> 
> Another thing to mention is, that EXTENT_OWNER_REF seems to be inlined
> only, has no keyed version.
> 
> I have already come upon a rare corner case:
> 
>   What if an EXTENT_ITEM is already so large that it can not add a new
>   inline ref?
> 
> I guess this can only be a problem for converting, as for now squota can
> only be enabled at mkfs time, thus the new EXTENT_OWNER_REF can always
> be inlined.

It's a good point, but the answer is slightly subtler. You can enable
squota on a live fs, it just doesn't/can't do the O(extents)
conversion for existing extents. The owner ref is created for all *new*
extent items, but when creating an extent item, you dictate the item
size which leaves enough room for the inline refs (including the owner
ref).

> 
> Thanks,
> Qu
> 
> > 
> > e.g., this works to fix it:
> > 
> > diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c
> > index 50fdc69fdddf..62150419c6d4 100644
> > --- a/fs/btrfs/tree-checker.c
> > +++ b/fs/btrfs/tree-checker.c
> > @@ -1496,6 +1496,9 @@ static int check_extent_item(struct extent_buffer *leaf,
> >   				   inline_type);
> >   			return -EUCLEAN;
> >   		}
> > +
> > +		if (last_type == BTRFS_EXTENT_OWNER_REF_KEY)
> > +			goto next;
> >   		if (inline_type < last_type) {
> >   			extent_err(leaf, slot,
> >   				   "inline ref out-of-order: has type %u, prev type %u",
> > @@ -1512,6 +1515,7 @@ static int check_extent_item(struct extent_buffer *leaf,
> >   				   last_type, last_seq);
> >   			return -EUCLEAN;
> >   		}
> > +next:
> >   		last_type = inline_type;
> >   		last_seq = seq;
> >   		ptr += btrfs_extent_inline_ref_size(inline_type);
> > 
> > > 
> > > If so, can we fix it in the kernel first?
> > > 
> > > Thanks,
> > > Qu
> > > > 
> > > > For a repro, btrfs/301 (available in the master fstests branch) fails
> > > > with the patch but passes without it.
Qu Wenruo Nov. 2, 2023, 9:39 p.m. UTC | #12
On 2023/11/3 08:04, Boris Burkov wrote:
[...]
>>
>>
>> Another thing to mention is, that EXTENT_OWNER_REF seems to be inlined
>> only, has no keyed version.
>>
>> I have already come upon a rare corner case:
>>
>>    What if an EXTENT_ITEM is already so large that it can not add a new
>>    inline ref?
>>
>> I guess this can only be a problem for converting, as for now squota can
>> only be enabled at mkfs time, thus the new EXTENT_OWNER_REF can always
>> be inlined.
>
> It's a good point, but the answer is slightly subtler. You can enable
> squota on a live fs, it just doesn't/can't do the O(extents)
> conversion for existing extents. The owner ref is created for all *new*
> extent items, but when creating an extent item, you dictate the item
> size which leaves enough room for the inline refs (including the owner
> ref).

Oh, that's way safer.

Although I guess we may want btrfs-progs conversion support for an
unmounted btrfs in the future, in that case we may need to consider the
keyed version of EXTENT_OWNER_REF, to be able to handle an existing
large EXTENT_ITEM.

And of course, for the sake of consistency (since all the existing 4
types all can be inlined or keyed).

Thanks,
Qu

>
>>
>> Thanks,
>> Qu
>>
>>>
>>> e.g., this works to fix it:
>>>
>>> diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c
>>> index 50fdc69fdddf..62150419c6d4 100644
>>> --- a/fs/btrfs/tree-checker.c
>>> +++ b/fs/btrfs/tree-checker.c
>>> @@ -1496,6 +1496,9 @@ static int check_extent_item(struct extent_buffer *leaf,
>>>    				   inline_type);
>>>    			return -EUCLEAN;
>>>    		}
>>> +
>>> +		if (last_type == BTRFS_EXTENT_OWNER_REF_KEY)
>>> +			goto next;
>>>    		if (inline_type < last_type) {
>>>    			extent_err(leaf, slot,
>>>    				   "inline ref out-of-order: has type %u, prev type %u",
>>> @@ -1512,6 +1515,7 @@ static int check_extent_item(struct extent_buffer *leaf,
>>>    				   last_type, last_seq);
>>>    			return -EUCLEAN;
>>>    		}
>>> +next:
>>>    		last_type = inline_type;
>>>    		last_seq = seq;
>>>    		ptr += btrfs_extent_inline_ref_size(inline_type);
>>>
>>>>
>>>> If so, can we fix it in the kernel first?
>>>>
>>>> Thanks,
>>>> Qu
>>>>>
>>>>> For a repro, btrfs/301 (available in the master fstests branch) fails
>>>>> with the patch but passes without it.
Boris Burkov Nov. 2, 2023, 9:46 p.m. UTC | #13
On Fri, Nov 03, 2023 at 08:09:10AM +1030, Qu Wenruo wrote:
> 
> 
> On 2023/11/3 08:04, Boris Burkov wrote:
> [...]
> > > 
> > > 
> > > Another thing to mention is, that EXTENT_OWNER_REF seems to be inlined
> > > only, has no keyed version.
> > > 
> > > I have already come upon a rare corner case:
> > > 
> > >    What if an EXTENT_ITEM is already so large that it can not add a new
> > >    inline ref?
> > > 
> > > I guess this can only be a problem for converting, as for now squota can
> > > only be enabled at mkfs time, thus the new EXTENT_OWNER_REF can always
> > > be inlined.
> > 
> > It's a good point, but the answer is slightly subtler. You can enable
> > squota on a live fs, it just doesn't/can't do the O(extents)
> > conversion for existing extents. The owner ref is created for all *new*
> > extent items, but when creating an extent item, you dictate the item
> > size which leaves enough room for the inline refs (including the owner
> > ref).
> 
> Oh, that's way safer.
> 
> Although I guess we may want btrfs-progs conversion support for an
> unmounted btrfs in the future, in that case we may need to consider the
> keyed version of EXTENT_OWNER_REF, to be able to handle an existing
> large EXTENT_ITEM.

I would love that, except I couldn't think of a way to do it while
keeping the semantics of simple quotas, well.. simple :)

We just don't know what subvol an extent is from post-hoc, so any
conversion would be guessing. Reflinks are really opaque from this
perspective (and the reason for needing this unpleasant format changing
item at all).

I suppose "pick the first ref for conversion" could be reasonable and
useful for some (all?) users.

Current btrfstune for squotas just turns it on for the next mount
without converting extents.

> 
> And of course, for the sake of consistency (since all the existing 4
> types all can be inlined or keyed).

True.

> 
> Thanks,
> Qu
> 
> > 
> > > 
> > > Thanks,
> > > Qu
> > > 
> > > > 
> > > > e.g., this works to fix it:
> > > > 
> > > > diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c
> > > > index 50fdc69fdddf..62150419c6d4 100644
> > > > --- a/fs/btrfs/tree-checker.c
> > > > +++ b/fs/btrfs/tree-checker.c
> > > > @@ -1496,6 +1496,9 @@ static int check_extent_item(struct extent_buffer *leaf,
> > > >    				   inline_type);
> > > >    			return -EUCLEAN;
> > > >    		}
> > > > +
> > > > +		if (last_type == BTRFS_EXTENT_OWNER_REF_KEY)
> > > > +			goto next;
> > > >    		if (inline_type < last_type) {
> > > >    			extent_err(leaf, slot,
> > > >    				   "inline ref out-of-order: has type %u, prev type %u",
> > > > @@ -1512,6 +1515,7 @@ static int check_extent_item(struct extent_buffer *leaf,
> > > >    				   last_type, last_seq);
> > > >    			return -EUCLEAN;
> > > >    		}
> > > > +next:
> > > >    		last_type = inline_type;
> > > >    		last_seq = seq;
> > > >    		ptr += btrfs_extent_inline_ref_size(inline_type);
> > > > 
> > > > > 
> > > > > If so, can we fix it in the kernel first?
> > > > > 
> > > > > Thanks,
> > > > > Qu
> > > > > > 
> > > > > > For a repro, btrfs/301 (available in the master fstests branch) fails
> > > > > > with the patch but passes without it.
Qu Wenruo Nov. 2, 2023, 9:47 p.m. UTC | #14
On 2023/11/3 08:16, Boris Burkov wrote:
> On Fri, Nov 03, 2023 at 08:09:10AM +1030, Qu Wenruo wrote:
>>
>>
>> On 2023/11/3 08:04, Boris Burkov wrote:
>> [...]
>>>>
>>>>
>>>> Another thing to mention is, that EXTENT_OWNER_REF seems to be inlined
>>>> only, has no keyed version.
>>>>
>>>> I have already come upon a rare corner case:
>>>>
>>>>     What if an EXTENT_ITEM is already so large that it can not add a new
>>>>     inline ref?
>>>>
>>>> I guess this can only be a problem for converting, as for now squota can
>>>> only be enabled at mkfs time, thus the new EXTENT_OWNER_REF can always
>>>> be inlined.
>>>
>>> It's a good point, but the answer is slightly subtler. You can enable
>>> squota on a live fs, it just doesn't/can't do the O(extents)
>>> conversion for existing extents. The owner ref is created for all *new*
>>> extent items, but when creating an extent item, you dictate the item
>>> size which leaves enough room for the inline refs (including the owner
>>> ref).
>>
>> Oh, that's way safer.
>>
>> Although I guess we may want btrfs-progs conversion support for an
>> unmounted btrfs in the future, in that case we may need to consider the
>> keyed version of EXTENT_OWNER_REF, to be able to handle an existing
>> large EXTENT_ITEM.
>
> I would love that, except I couldn't think of a way to do it while
> keeping the semantics of simple quotas, well.. simple :)
>
> We just don't know what subvol an extent is from post-hoc, so any
> conversion would be guessing. Reflinks are really opaque from this
> perspective (and the reason for needing this unpleasant format changing
> item at all).

Oh, indeed I didn't take this into consideration.

And it is indeed making simple quote complex, just forget this idea.

Thanks,
Qu
>
> I suppose "pick the first ref for conversion" could be reasonable and
> useful for some (all?) users.
>
> Current btrfstune for squotas just turns it on for the next mount
> without converting extents.
>
>>
>> And of course, for the sake of consistency (since all the existing 4
>> types all can be inlined or keyed).
>
> True.
>
>>
>> Thanks,
>> Qu
>>
>>>
>>>>
>>>> Thanks,
>>>> Qu
>>>>
>>>>>
>>>>> e.g., this works to fix it:
>>>>>
>>>>> diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c
>>>>> index 50fdc69fdddf..62150419c6d4 100644
>>>>> --- a/fs/btrfs/tree-checker.c
>>>>> +++ b/fs/btrfs/tree-checker.c
>>>>> @@ -1496,6 +1496,9 @@ static int check_extent_item(struct extent_buffer *leaf,
>>>>>     				   inline_type);
>>>>>     			return -EUCLEAN;
>>>>>     		}
>>>>> +
>>>>> +		if (last_type == BTRFS_EXTENT_OWNER_REF_KEY)
>>>>> +			goto next;
>>>>>     		if (inline_type < last_type) {
>>>>>     			extent_err(leaf, slot,
>>>>>     				   "inline ref out-of-order: has type %u, prev type %u",
>>>>> @@ -1512,6 +1515,7 @@ static int check_extent_item(struct extent_buffer *leaf,
>>>>>     				   last_type, last_seq);
>>>>>     			return -EUCLEAN;
>>>>>     		}
>>>>> +next:
>>>>>     		last_type = inline_type;
>>>>>     		last_seq = seq;
>>>>>     		ptr += btrfs_extent_inline_ref_size(inline_type);
>>>>>
>>>>>>
>>>>>> If so, can we fix it in the kernel first?
>>>>>>
>>>>>> Thanks,
>>>>>> Qu
>>>>>>>
>>>>>>> For a repro, btrfs/301 (available in the master fstests branch) fails
>>>>>>> with the patch but passes without it.
Josef Bacik Nov. 3, 2023, 1:45 p.m. UTC | #15
On Fri, Nov 03, 2023 at 07:19:41AM +1030, Qu Wenruo wrote:
> 
> 
> On 2023/11/3 07:06, Josef Bacik wrote:
> > On Tue, Oct 24, 2023 at 12:41:11PM +1030, Qu Wenruo wrote:
> > > [BUG]
> > > There is a bug report that ntfs2btrfs had a bug that it can lead to
> > > transaction abort and the filesystem flips to read-only.
> > > 
> > > [CAUSE]
> > > For inline backref items, kernel has a strict requirement for their
> > > ordered, they must follow the following rules:
> > > 
> > > - All btrfs_extent_inline_ref::type should be in an ascending order
> > > 
> > > - Within the same type, the items should follow a descending order by
> > >    their sequence number
> > > 
> > >    For EXTENT_DATA_REF type, the sequence number is result from
> > >    hash_extent_data_ref().
> > >    For other types, their sequence numbers are
> > >    btrfs_extent_inline_ref::offset.
> > > 
> > > Thus if there is any code not following above rules, the resulted
> > > inline backrefs can prevent the kernel to locate the needed inline
> > > backref and lead to transaction abort.
> > > 
> > > [FIX]
> > > Ntrfs2btrfs has already fixed the problem, and btrfs-progs has added the
> > > ability to detect such problems.
> > > 
> > > For kernel, let's be more noisy and be more specific about the order, so
> > > that the next time kernel hits such problem we would reject it in the
> > > first place, without leading to transaction abort.
> > > 
> > > Link: https://github.com/kdave/btrfs-progs/pull/622
> > > Signed-off-by: Qu Wenruo <wqu@suse.com>
> > 
> > This broke squotas and I didn't notice it until I was running the CI for my
> > mount api changes.
> > 
> > Lets try to use the CI for most things, even if you send it at the same time you
> > submit a job, it'll keep this sort of thing from happening.  Thanks,
> 
> My bad, didn't utilize CI at all.
> 
> Any quick guides/docs on the CI system?
> 

I'll put something in the developers docs, but just push your branch to github,
and submit a PR against btrfs/linux, the 'ci' branch, and it'll automatically
kick it off for you.  Thanks,

Josef
David Sterba Nov. 3, 2023, 3:04 p.m. UTC | #16
On Thu, Nov 02, 2023 at 02:34:30PM -0700, Boris Burkov wrote:
> On Fri, Nov 03, 2023 at 07:35:48AM +1030, Qu Wenruo wrote:
> > > If it's possible at this stage to change the type number to be 170 or
> > > something, I think that would fix it, and would be a much less intrusive
> > > change than pushing the owner ref item to the back of the inline refs,
> > > which would complicate parsing a lot more, IMO.
> > 
> > I believe this is much better solution.
> 
> Agreed. I hope it's possible! Dave, can you weigh in on whether this
> exact change can be done at this point? You suggested upthread that it
> should be, but I just want to be sure.

Yes, the key number can be changed, we'll only need to synchronize
kernel and progs, which may take a few days. Once this is in kernel I'll
do a 6.6.x bugfix release short after.

The exact key number should be between

BTRFS_METADATA_ITEM_KEY 169

and

BTRFS_TREE_BLOCK_REF_KEY        176

so I wouldn't pick 170 as it leaves no gap. Also add a comment that the
key must be before the other _REF_KEY so the refs can be sorted.
diff mbox series

Patch

diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c
index a416cbea75d1..981ad301d29d 100644
--- a/fs/btrfs/tree-checker.c
+++ b/fs/btrfs/tree-checker.c
@@ -31,6 +31,7 @@ 
 #include "inode-item.h"
 #include "dir-item.h"
 #include "raid-stripe-tree.h"
+#include "extent-tree.h"
 
 /*
  * Error message should follow the following format:
@@ -1276,6 +1277,8 @@  static int check_extent_item(struct extent_buffer *leaf,
 	unsigned long ptr;	/* Current pointer inside inline refs */
 	unsigned long end;	/* Extent item end */
 	const u32 item_size = btrfs_item_size(leaf, slot);
+	u8 last_type = 0;
+	u64 last_seq = U64_MAX;
 	u64 flags;
 	u64 generation;
 	u64 total_refs;		/* Total refs in btrfs_extent_item */
@@ -1322,6 +1325,17 @@  static int check_extent_item(struct extent_buffer *leaf,
 	 *    2.2) Ref type specific data
 	 *         Either using btrfs_extent_inline_ref::offset, or specific
 	 *         data structure.
+	 *    All above inline items should follow the order:
+	 *
+	 *    - All btrfs_extent_inline_ref::type should be in an ascending
+	 *      order
+	 *
+	 *    - Within the same type, the items should follow a descending
+	 *      order by their sequence number
+	 *      The sequence number is determined by:
+	 *      * btrfs_extent_inline_ref::offset for all types  other than
+	 *        EXTENT_DATA_REF
+	 *      * hash_extent_data_ref() for EXTENT_DATA_REF
 	 */
 	if (unlikely(item_size < sizeof(*ei))) {
 		extent_err(leaf, slot,
@@ -1403,6 +1417,7 @@  static int check_extent_item(struct extent_buffer *leaf,
 		struct btrfs_extent_inline_ref *iref;
 		struct btrfs_extent_data_ref *dref;
 		struct btrfs_shared_data_ref *sref;
+		u64 seq;
 		u64 dref_offset;
 		u64 inline_offset;
 		u8 inline_type;
@@ -1416,6 +1431,7 @@  static int check_extent_item(struct extent_buffer *leaf,
 		iref = (struct btrfs_extent_inline_ref *)ptr;
 		inline_type = btrfs_extent_inline_ref_type(leaf, iref);
 		inline_offset = btrfs_extent_inline_ref_offset(leaf, iref);
+		seq = inline_offset;
 		if (unlikely(ptr + btrfs_extent_inline_ref_size(inline_type) > end)) {
 			extent_err(leaf, slot,
 "inline ref item overflows extent item, ptr %lu iref size %u end %lu",
@@ -1446,6 +1462,10 @@  static int check_extent_item(struct extent_buffer *leaf,
 		case BTRFS_EXTENT_DATA_REF_KEY:
 			dref = (struct btrfs_extent_data_ref *)(&iref->offset);
 			dref_offset = btrfs_extent_data_ref_offset(leaf, dref);
+			seq = hash_extent_data_ref(
+					btrfs_extent_data_ref_root(leaf, dref),
+					btrfs_extent_data_ref_objectid(leaf, dref),
+					btrfs_extent_data_ref_offset(leaf, dref));
 			if (unlikely(!IS_ALIGNED(dref_offset,
 						 fs_info->sectorsize))) {
 				extent_err(leaf, slot,
@@ -1475,6 +1495,24 @@  static int check_extent_item(struct extent_buffer *leaf,
 				   inline_type);
 			return -EUCLEAN;
 		}
+		if (inline_type < last_type) {
+			extent_err(leaf, slot,
+				   "inline ref out-of-order: has type %u, prev type %u",
+				   inline_type, last_type);
+			return -EUCLEAN;
+		}
+		/* Type changed, allow the sequence starts from U64_MAX again. */
+		if (inline_type > last_type)
+			last_seq = U64_MAX;
+		if (seq > last_seq) {
+			extent_err(leaf, slot,
+"inline ref out-of-order: has type %u offset %llu seq 0x%llx, prev type %u seq 0x%llx",
+				   inline_type, inline_offset, seq,
+				   last_type, last_seq);
+			return -EUCLEAN;
+		}
+		last_type = inline_type;
+		last_seq = seq;
 		ptr += btrfs_extent_inline_ref_size(inline_type);
 	}
 	/* No padding is allowed */