diff mbox series

btrfs: add extra ending condition for indirect data backref resolution

Message ID 1578044681-25562-1-git-send-email-ethanwu@synology.com (mailing list archive)
State New, archived
Headers show
Series btrfs: add extra ending condition for indirect data backref resolution | expand

Commit Message

ethanwu Jan. 3, 2020, 9:44 a.m. UTC
Btrfs has two types of data backref.
For BTRFS_EXTENT_DATA_REF_KEY type of backref, we don't have the
exact block number. Therefore, we need to call resolve_indirect_refs
which uses btrfs_search_slot to locate the leaf block. After that,
we need to walk through the leafs to search for the EXTENT_DATA items
that have disk bytenr matching the extent item(add_all_parents).

The only conditions we'll stop searching are
1. We find different object id or type is not EXTENT_DATA
2. We've already got all the refs we want(total_refs)

Take the following EXTENT_ITEM as example:
item 11 key (40831553536 EXTENT_ITEM 4194304) itemoff 15460 itemsize 95
    extent refs 24 gen 7302 flags DATA
    extent data backref root 257 objectid 260 offset 65536 count 5 #backref entry 1
    extent data backref root 258 objectid 265 offset 0 count 9 #backref entry 2
    shared data backref parent 394985472 count 10 #backref entry 3

If we want to search for backref entry 1, total_refs here would be 24 rather
than its count 5.

The reason to use 24 is because some EXTENT_DATA in backref entry 3 block
394985472 also points to EXTENT_ITEM 40831553536, if this block also belongs to
root 257 and lies between these 5 items of backref entry 1,
and we use total_refs = 5, we'll end up missing some refs from backref
entry 1.

But using total_refs=24 is not accurate. We'll never find extent data keys in
backref entry 2, since we searched root 257 not 258. We'll never reach block
394985472 either if this block is not a leaf in root 257.
As a result, the loop keeps on going until we reach the end of that inode.

Since we're searching for parent block of this backref entry 1,
we're 100% sure we'll never find any EXTENT_DATA beyond (65536 + 4194304) that
matching this entry. If there's any EXTENT_DATA with offset beyond this range
using this extent item, its backref must be stored at different backref entry.
That EXTENT_DATA will be handled when we process that backref entry.

Fix this by breaking from loop if we reach offset + (size of EXTENT_ITEM).

btrfs send use backref to search for clone candidate.
Without this patch, performance drops when running following script.
This script creates a 10G file with all of its extent size 64K.
Then it generates shared backref for each data extent, and
those backrefs could not be found when doing btrfs_resolve_indirect_refs.

item 87 key (11843469312 EXTENT_ITEM 65536) itemoff 10475 itemsize 66
    refs 3 gen 74 flags DATA
    extent data backref root 256 objectid 260 offset 10289152 count 2
    # This shared backref couldn't be found when resolving
    # indirect ref from snapshot of sub 256
    shared data backref parent 2303049728 count 1

btrfs subvolume create /volume1/sub1
for i in `seq 1 163840`; do dd if=/dev/zero of=/volume1/sub1/file bs=64K count=1 seek=$((i-1)) conv=notrunc oflag=direct 2>/dev/null; done
btrfs subvolume snapshot /volume1/sub1 /volume1/sub2
for i in `seq 1 163840`; do dd if=/dev/zero of=/volume1/sub1/file bs=4K count=1 seek=$(((i-1)*16+10)) conv=notrunc oflag=direct 2>/dev/null; done
btrfs subvolume snapshot -r /volume1/sub1 /volume1/snap1
time btrfs send /volume1/snap1 | btrfs receive /volume2

without this patch
real 69m48.124s
user 0m50.199s
sys  70m15.600s

with this patch
real 1m31.498s
user 0m35.858s
sys  2m55.544s

Signed-off-by: ethanwu <ethanwu@synology.com>
---
 fs/btrfs/backref.c | 21 +++++++++++++++------
 1 file changed, 15 insertions(+), 6 deletions(-)

Comments

Qu Wenruo Jan. 3, 2020, 10:15 a.m. UTC | #1
On 2020/1/3 下午5:44, ethanwu wrote:
> Btrfs has two types of data backref.
> For BTRFS_EXTENT_DATA_REF_KEY type of backref, we don't have the
> exact block number. Therefore, we need to call resolve_indirect_refs
> which uses btrfs_search_slot to locate the leaf block. After that,
> we need to walk through the leafs to search for the EXTENT_DATA items
> that have disk bytenr matching the extent item(add_all_parents).
> 
> The only conditions we'll stop searching are
> 1. We find different object id or type is not EXTENT_DATA
> 2. We've already got all the refs we want(total_refs)
> 
> Take the following EXTENT_ITEM as example:
> item 11 key (40831553536 EXTENT_ITEM 4194304) itemoff 15460 itemsize 95
>     extent refs 24 gen 7302 flags DATA
>     extent data backref root 257 objectid 260 offset 65536 count 5 #backref entry 1
>     extent data backref root 258 objectid 265 offset 0 count 9 #backref entry 2
>     shared data backref parent 394985472 count 10 #backref entry 3
> 
> If we want to search for backref entry 1, total_refs here would be 24 rather
> than its count 5.
> 
> The reason to use 24 is because some EXTENT_DATA in backref entry 3 block
> 394985472 also points to EXTENT_ITEM 40831553536, if this block also belongs to
> root 257 and lies between these 5 items of backref entry 1,
> and we use total_refs = 5, we'll end up missing some refs from backref
> entry 1.

Indeed looks like a problem.

> 
> But using total_refs=24 is not accurate. We'll never find extent data keys in
> backref entry 2, since we searched root 257 not 258. We'll never reach block
> 394985472 either if this block is not a leaf in root 257.
> As a result, the loop keeps on going until we reach the end of that inode.
> 
> Since we're searching for parent block of this backref entry 1,
> we're 100% sure we'll never find any EXTENT_DATA beyond (65536 + 4194304) that
> matching this entry.

Backref offset is always a bug-prone member, thus I hope to double check
on this.

What if the backref offset already underflows?
Like this:
  item 10 key (13631488 EXTENT_ITEM 1048576) itemoff 15860 itemsize 111
       refs 3 gen 6 flags DATA
       extent data backref root FS_TREE objectid 259 offset
18446744073709547520 count 1 <<<
       extent data backref root FS_TREE objectid 257 offset 0 count 1
       extent data backref root FS_TREE objectid 258 offset 4096 count 1


Since backref offset is not file offset, but file_extent_item::offset -
file_offset, it can be a super large number for reflinked extents.


Current kernel handles this by a very ugly but working hack: resetting
key_for_search.offset to 0 in add_prelim_ref() if it detects such case.

Then this would screw up your check, causing unexpected early exit.

I guess we have to find a new method to solve the problem then.

Thanks,
Qu

> If there's any EXTENT_DATA with offset beyond this range
> using this extent item, its backref must be stored at different backref entry.
> That EXTENT_DATA will be handled when we process that backref entry.
> 
> Fix this by breaking from loop if we reach offset + (size of EXTENT_ITEM).
> 
> btrfs send use backref to search for clone candidate.
> Without this patch, performance drops when running following script.
> This script creates a 10G file with all of its extent size 64K.
> Then it generates shared backref for each data extent, and
> those backrefs could not be found when doing btrfs_resolve_indirect_refs.
> 
> item 87 key (11843469312 EXTENT_ITEM 65536) itemoff 10475 itemsize 66
>     refs 3 gen 74 flags DATA
>     extent data backref root 256 objectid 260 offset 10289152 count 2
>     # This shared backref couldn't be found when resolving
>     # indirect ref from snapshot of sub 256
>     shared data backref parent 2303049728 count 1
> 
> btrfs subvolume create /volume1/sub1
> for i in `seq 1 163840`; do dd if=/dev/zero of=/volume1/sub1/file bs=64K count=1 seek=$((i-1)) conv=notrunc oflag=direct 2>/dev/null; done
> btrfs subvolume snapshot /volume1/sub1 /volume1/sub2
> for i in `seq 1 163840`; do dd if=/dev/zero of=/volume1/sub1/file bs=4K count=1 seek=$(((i-1)*16+10)) conv=notrunc oflag=direct 2>/dev/null; done
> btrfs subvolume snapshot -r /volume1/sub1 /volume1/snap1
> time btrfs send /volume1/snap1 | btrfs receive /volume2
> 
> without this patch
> real 69m48.124s
> user 0m50.199s
> sys  70m15.600s
> 
> with this patch
> real 1m31.498s
> user 0m35.858s
> sys  2m55.544s
> 
> Signed-off-by: ethanwu <ethanwu@synology.com>
> ---
>  fs/btrfs/backref.c | 21 +++++++++++++++------
>  1 file changed, 15 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
> index e5d8531..ae64995 100644
> --- a/fs/btrfs/backref.c
> +++ b/fs/btrfs/backref.c
> @@ -412,7 +412,7 @@ static int add_indirect_ref(const struct btrfs_fs_info *fs_info,
>  static int add_all_parents(struct btrfs_root *root, struct btrfs_path *path,
>  			   struct ulist *parents, struct prelim_ref *ref,
>  			   int level, u64 time_seq, const u64 *extent_item_pos,
> -			   u64 total_refs, bool ignore_offset)
> +			   u64 total_refs, bool ignore_offset, u64 num_bytes)
>  {
>  	int ret = 0;
>  	int slot;
> @@ -458,6 +458,9 @@ static int add_all_parents(struct btrfs_root *root, struct btrfs_path *path,
>  		fi = btrfs_item_ptr(eb, slot, struct btrfs_file_extent_item);
>  		disk_byte = btrfs_file_extent_disk_bytenr(eb, fi);
>  
> +		if (key_for_search->type == BTRFS_EXTENT_DATA_KEY &&
> +		    key.offset >= key_for_search->offset + num_bytes)
> +		       break;
>  		if (disk_byte == wanted_disk_byte) {
>  			eie = NULL;
>  			old = NULL;
> @@ -504,7 +507,7 @@ static int resolve_indirect_ref(struct btrfs_fs_info *fs_info,
>  				struct btrfs_path *path, u64 time_seq,
>  				struct prelim_ref *ref, struct ulist *parents,
>  				const u64 *extent_item_pos, u64 total_refs,
> -				bool ignore_offset)
> +				bool ignore_offset, u64 num_bytes)
>  {
>  	struct btrfs_root *root;
>  	struct btrfs_key root_key;
> @@ -575,7 +578,8 @@ static int resolve_indirect_ref(struct btrfs_fs_info *fs_info,
>  	}
>  
>  	ret = add_all_parents(root, path, parents, ref, level, time_seq,
> -			      extent_item_pos, total_refs, ignore_offset);
> +			      extent_item_pos, total_refs, ignore_offset,
> +			      num_bytes);
>  out:
>  	path->lowest_level = 0;
>  	btrfs_release_path(path);
> @@ -610,7 +614,8 @@ static int resolve_indirect_refs(struct btrfs_fs_info *fs_info,
>  				 struct btrfs_path *path, u64 time_seq,
>  				 struct preftrees *preftrees,
>  				 const u64 *extent_item_pos, u64 total_refs,
> -				 struct share_check *sc, bool ignore_offset)
> +				 struct share_check *sc, bool ignore_offset,
> +				 u64 num_bytes)
>  {
>  	int err;
>  	int ret = 0;
> @@ -655,7 +660,7 @@ static int resolve_indirect_refs(struct btrfs_fs_info *fs_info,
>  		}
>  		err = resolve_indirect_ref(fs_info, path, time_seq, ref,
>  					   parents, extent_item_pos,
> -					   total_refs, ignore_offset);
> +					   total_refs, ignore_offset, num_bytes);
>  		/*
>  		 * we can only tolerate ENOENT,otherwise,we should catch error
>  		 * and return directly.
> @@ -1127,6 +1132,7 @@ static int find_parent_nodes(struct btrfs_trans_handle *trans,
>  	struct extent_inode_elem *eie = NULL;
>  	/* total of both direct AND indirect refs! */
>  	u64 total_refs = 0;
> +	u64 num_bytes = SZ_256M;
>  	struct preftrees preftrees = {
>  		.direct = PREFTREE_INIT,
>  		.indirect = PREFTREE_INIT,
> @@ -1194,6 +1200,7 @@ static int find_parent_nodes(struct btrfs_trans_handle *trans,
>  				goto again;
>  			}
>  			spin_unlock(&delayed_refs->lock);
> +			num_bytes = head->num_bytes;
>  			ret = add_delayed_refs(fs_info, head, time_seq,
>  					       &preftrees, &total_refs, sc);
>  			mutex_unlock(&head->mutex);
> @@ -1215,6 +1222,7 @@ static int find_parent_nodes(struct btrfs_trans_handle *trans,
>  		if (key.objectid == bytenr &&
>  		    (key.type == BTRFS_EXTENT_ITEM_KEY ||
>  		     key.type == BTRFS_METADATA_ITEM_KEY)) {
> +			num_bytes = key.offset;
>  			ret = add_inline_refs(fs_info, path, bytenr,
>  					      &info_level, &preftrees,
>  					      &total_refs, sc);
> @@ -1236,7 +1244,8 @@ static int find_parent_nodes(struct btrfs_trans_handle *trans,
>  	WARN_ON(!RB_EMPTY_ROOT(&preftrees.indirect_missing_keys.root.rb_root));
>  
>  	ret = resolve_indirect_refs(fs_info, path, time_seq, &preftrees,
> -				    extent_item_pos, total_refs, sc, ignore_offset);
> +				    extent_item_pos, total_refs, sc, ignore_offset,
> +				    num_bytes);
>  	if (ret)
>  		goto out;
>  
>
ethanwu Jan. 3, 2020, 11:37 a.m. UTC | #2
Qu Wenruo 於 2020-01-03 18:15 寫到:
> On 2020/1/3 下午5:44, ethanwu wrote:
>> Btrfs has two types of data backref.
>> For BTRFS_EXTENT_DATA_REF_KEY type of backref, we don't have the
>> exact block number. Therefore, we need to call resolve_indirect_refs
>> which uses btrfs_search_slot to locate the leaf block. After that,
>> we need to walk through the leafs to search for the EXTENT_DATA items
>> that have disk bytenr matching the extent item(add_all_parents).
>> 
>> The only conditions we'll stop searching are
>> 1. We find different object id or type is not EXTENT_DATA
>> 2. We've already got all the refs we want(total_refs)
>> 
>> Take the following EXTENT_ITEM as example:
>> item 11 key (40831553536 EXTENT_ITEM 4194304) itemoff 15460 itemsize 
>> 95
>>     extent refs 24 gen 7302 flags DATA
>>     extent data backref root 257 objectid 260 offset 65536 count 5 
>> #backref entry 1
>>     extent data backref root 258 objectid 265 offset 0 count 9 
>> #backref entry 2
>>     shared data backref parent 394985472 count 10 #backref entry 3
>> 
>> If we want to search for backref entry 1, total_refs here would be 24 
>> rather
>> than its count 5.
>> 
>> The reason to use 24 is because some EXTENT_DATA in backref entry 3 
>> block
>> 394985472 also points to EXTENT_ITEM 40831553536, if this block also 
>> belongs to
>> root 257 and lies between these 5 items of backref entry 1,
>> and we use total_refs = 5, we'll end up missing some refs from backref
>> entry 1.
> 
> Indeed looks like a problem.
> 
>> 
>> But using total_refs=24 is not accurate. We'll never find extent data 
>> keys in
>> backref entry 2, since we searched root 257 not 258. We'll never reach 
>> block
>> 394985472 either if this block is not a leaf in root 257.
>> As a result, the loop keeps on going until we reach the end of that 
>> inode.
>> 
>> Since we're searching for parent block of this backref entry 1,
>> we're 100% sure we'll never find any EXTENT_DATA beyond (65536 + 
>> 4194304) that
>> matching this entry.
> 
> Backref offset is always a bug-prone member, thus I hope to double 
> check
> on this.
> 
> What if the backref offset already underflows?
> Like this:
>   item 10 key (13631488 EXTENT_ITEM 1048576) itemoff 15860 itemsize 111
>        refs 3 gen 6 flags DATA
>        extent data backref root FS_TREE objectid 259 offset
> 18446744073709547520 count 1 <<<
>        extent data backref root FS_TREE objectid 257 offset 0 count 1
>        extent data backref root FS_TREE objectid 258 offset 4096 count 
> 1
> 
> 
> Since backref offset is not file offset, but file_extent_item::offset -
> file_offset, it can be a super large number for reflinked extents.
> 
> 
> Current kernel handles this by a very ugly but working hack: resetting
> key_for_search.offset to 0 in add_prelim_ref() if it detects such case.
> 
> Then this would screw up your check, causing unexpected early exit.

Thanks for the reminder.
I think in this case the check won't fail. Even when we revert the 
working hack
in the future, it still works unless we use u64 to do the calculation.

(u64) 18446744073709547520 = (s64) -4096

Suppose this very large offset is equal to X
            The next line is the original file view.
            [                                 ]
            ^           ^                     ^     ......    ^
            0           (u64)X + num_bytes    EOF             X
       [----oooooooooooo]  Original range to check. - part is
       ^                   the very large offset where no file extents
       X in terms of s64   exist, so actually the range [0,X+num_bytes)
            [oooooooooooooooo]  range to check after hack X=>0
                             ^
                             0 + num_bytes

With my patch, applying this hack will only make my check condition 
looser.
Causing more range to be checked (represented by o) compared to no hack.

The only way I think this check would fail would be:
File at offset 2^64 - 4096 uses offset 0 of a 1MB data extent,
key_for_search->offset + num_bytes = 2^64 - 4096 + 1048576 = 1044480
Therefore, when iterating through the leafs, we'll break early at
offset 1044480, leave the EXTENT_DATA key @2^64 - 4096 behind.
But AFAIK, file of that size is not allowed in btrfs.

Thanks,
ethanwu
> 
> I guess we have to find a new method to solve the problem then.
> 
> Thanks,
> Qu
> 
>> If there's any EXTENT_DATA with offset beyond this range
>> using this extent item, its backref must be stored at different 
>> backref entry.
>> That EXTENT_DATA will be handled when we process that backref entry.
>> 
>> Fix this by breaking from loop if we reach offset + (size of 
>> EXTENT_ITEM).
>> 
>> btrfs send use backref to search for clone candidate.
>> Without this patch, performance drops when running following script.
>> This script creates a 10G file with all of its extent size 64K.
>> Then it generates shared backref for each data extent, and
>> those backrefs could not be found when doing 
>> btrfs_resolve_indirect_refs.
>> 
>> item 87 key (11843469312 EXTENT_ITEM 65536) itemoff 10475 itemsize 66
>>     refs 3 gen 74 flags DATA
>>     extent data backref root 256 objectid 260 offset 10289152 count 2
>>     # This shared backref couldn't be found when resolving
>>     # indirect ref from snapshot of sub 256
>>     shared data backref parent 2303049728 count 1
>> 
>> btrfs subvolume create /volume1/sub1
>> for i in `seq 1 163840`; do dd if=/dev/zero of=/volume1/sub1/file 
>> bs=64K count=1 seek=$((i-1)) conv=notrunc oflag=direct 2>/dev/null; 
>> done
>> btrfs subvolume snapshot /volume1/sub1 /volume1/sub2
>> for i in `seq 1 163840`; do dd if=/dev/zero of=/volume1/sub1/file 
>> bs=4K count=1 seek=$(((i-1)*16+10)) conv=notrunc oflag=direct 
>> 2>/dev/null; done
>> btrfs subvolume snapshot -r /volume1/sub1 /volume1/snap1
>> time btrfs send /volume1/snap1 | btrfs receive /volume2
>> 
>> without this patch
>> real 69m48.124s
>> user 0m50.199s
>> sys  70m15.600s
>> 
>> with this patch
>> real 1m31.498s
>> user 0m35.858s
>> sys  2m55.544s
>> 
>> Signed-off-by: ethanwu <ethanwu@synology.com>
>> ---
>>  fs/btrfs/backref.c | 21 +++++++++++++++------
>>  1 file changed, 15 insertions(+), 6 deletions(-)
>> 
>> diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
>> index e5d8531..ae64995 100644
>> --- a/fs/btrfs/backref.c
>> +++ b/fs/btrfs/backref.c
>> @@ -412,7 +412,7 @@ static int add_indirect_ref(const struct 
>> btrfs_fs_info *fs_info,
>>  static int add_all_parents(struct btrfs_root *root, struct btrfs_path 
>> *path,
>>  			   struct ulist *parents, struct prelim_ref *ref,
>>  			   int level, u64 time_seq, const u64 *extent_item_pos,
>> -			   u64 total_refs, bool ignore_offset)
>> +			   u64 total_refs, bool ignore_offset, u64 num_bytes)
>>  {
>>  	int ret = 0;
>>  	int slot;
>> @@ -458,6 +458,9 @@ static int add_all_parents(struct btrfs_root 
>> *root, struct btrfs_path *path,
>>  		fi = btrfs_item_ptr(eb, slot, struct btrfs_file_extent_item);
>>  		disk_byte = btrfs_file_extent_disk_bytenr(eb, fi);
>> 
>> +		if (key_for_search->type == BTRFS_EXTENT_DATA_KEY &&
>> +		    key.offset >= key_for_search->offset + num_bytes)
>> +		       break;
>>  		if (disk_byte == wanted_disk_byte) {
>>  			eie = NULL;
>>  			old = NULL;
>> @@ -504,7 +507,7 @@ static int resolve_indirect_ref(struct 
>> btrfs_fs_info *fs_info,
>>  				struct btrfs_path *path, u64 time_seq,
>>  				struct prelim_ref *ref, struct ulist *parents,
>>  				const u64 *extent_item_pos, u64 total_refs,
>> -				bool ignore_offset)
>> +				bool ignore_offset, u64 num_bytes)
>>  {
>>  	struct btrfs_root *root;
>>  	struct btrfs_key root_key;
>> @@ -575,7 +578,8 @@ static int resolve_indirect_ref(struct 
>> btrfs_fs_info *fs_info,
>>  	}
>> 
>>  	ret = add_all_parents(root, path, parents, ref, level, time_seq,
>> -			      extent_item_pos, total_refs, ignore_offset);
>> +			      extent_item_pos, total_refs, ignore_offset,
>> +			      num_bytes);
>>  out:
>>  	path->lowest_level = 0;
>>  	btrfs_release_path(path);
>> @@ -610,7 +614,8 @@ static int resolve_indirect_refs(struct 
>> btrfs_fs_info *fs_info,
>>  				 struct btrfs_path *path, u64 time_seq,
>>  				 struct preftrees *preftrees,
>>  				 const u64 *extent_item_pos, u64 total_refs,
>> -				 struct share_check *sc, bool ignore_offset)
>> +				 struct share_check *sc, bool ignore_offset,
>> +				 u64 num_bytes)
>>  {
>>  	int err;
>>  	int ret = 0;
>> @@ -655,7 +660,7 @@ static int resolve_indirect_refs(struct 
>> btrfs_fs_info *fs_info,
>>  		}
>>  		err = resolve_indirect_ref(fs_info, path, time_seq, ref,
>>  					   parents, extent_item_pos,
>> -					   total_refs, ignore_offset);
>> +					   total_refs, ignore_offset, num_bytes);
>>  		/*
>>  		 * we can only tolerate ENOENT,otherwise,we should catch error
>>  		 * and return directly.
>> @@ -1127,6 +1132,7 @@ static int find_parent_nodes(struct 
>> btrfs_trans_handle *trans,
>>  	struct extent_inode_elem *eie = NULL;
>>  	/* total of both direct AND indirect refs! */
>>  	u64 total_refs = 0;
>> +	u64 num_bytes = SZ_256M;
>>  	struct preftrees preftrees = {
>>  		.direct = PREFTREE_INIT,
>>  		.indirect = PREFTREE_INIT,
>> @@ -1194,6 +1200,7 @@ static int find_parent_nodes(struct 
>> btrfs_trans_handle *trans,
>>  				goto again;
>>  			}
>>  			spin_unlock(&delayed_refs->lock);
>> +			num_bytes = head->num_bytes;
>>  			ret = add_delayed_refs(fs_info, head, time_seq,
>>  					       &preftrees, &total_refs, sc);
>>  			mutex_unlock(&head->mutex);
>> @@ -1215,6 +1222,7 @@ static int find_parent_nodes(struct 
>> btrfs_trans_handle *trans,
>>  		if (key.objectid == bytenr &&
>>  		    (key.type == BTRFS_EXTENT_ITEM_KEY ||
>>  		     key.type == BTRFS_METADATA_ITEM_KEY)) {
>> +			num_bytes = key.offset;
>>  			ret = add_inline_refs(fs_info, path, bytenr,
>>  					      &info_level, &preftrees,
>>  					      &total_refs, sc);
>> @@ -1236,7 +1244,8 @@ static int find_parent_nodes(struct 
>> btrfs_trans_handle *trans,
>>  
>> 	WARN_ON(!RB_EMPTY_ROOT(&preftrees.indirect_missing_keys.root.rb_root));
>> 
>>  	ret = resolve_indirect_refs(fs_info, path, time_seq, &preftrees,
>> -				    extent_item_pos, total_refs, sc, ignore_offset);
>> +				    extent_item_pos, total_refs, sc, ignore_offset,
>> +				    num_bytes);
>>  	if (ret)
>>  		goto out;
>> 
>>
Qu Wenruo Jan. 3, 2020, 12:32 p.m. UTC | #3
On 2020/1/3 下午7:37, ethanwu wrote:
> Qu Wenruo 於 2020-01-03 18:15 寫到:
>> On 2020/1/3 下午5:44, ethanwu wrote:
[snip]
>> What if the backref offset already underflows?
>> Like this:
>>   item 10 key (13631488 EXTENT_ITEM 1048576) itemoff 15860 itemsize 111
>>        refs 3 gen 6 flags DATA
>>        extent data backref root FS_TREE objectid 259 offset
>> 18446744073709547520 count 1 <<<
>>        extent data backref root FS_TREE objectid 257 offset 0 count 1
>>        extent data backref root FS_TREE objectid 258 offset 4096 count 1
>>
>>
>> Since backref offset is not file offset, but file_extent_item::offset -
>> file_offset, it can be a super large number for reflinked extents.
>>
>>
>> Current kernel handles this by a very ugly but working hack: resetting
>> key_for_search.offset to 0 in add_prelim_ref() if it detects such case.
>>
>> Then this would screw up your check, causing unexpected early exit.
> 
> Thanks for the reminder.
> I think in this case the check won't fail. Even when we revert the
> working hack
> in the future, it still works unless we use u64 to do the calculation.
> 
> (u64) 18446744073709547520 = (s64) -4096
> 
> Suppose this very large offset is equal to X
>            The next line is the original file view.
>            [                                 ]
>            ^           ^                     ^     ......    ^
>            0           (u64)X + num_bytes    EOF             X
>       [----oooooooooooo]  Original range to check. - part is
>       ^                   the very large offset where no file extents
>       X in terms of s64   exist, so actually the range [0,X+num_bytes)
>            [oooooooooooooooo]  range to check after hack X=>0
>                             ^
>                             0 + num_bytes
> 
> With my patch, applying this hack will only make my check condition looser.
> Causing more range to be checked (represented by o) compared to no hack.

Ah, you're right.

Since file_extent_item::offset can never be larger than extent size, so
we backref offset can only be in the range of [file_pos - 0, file_pos -
extent_size).

If we got a minus backref offset, it means file_pos -
file_extent_item::offset < 0, which means file_pos <
file_extent_item::offset.
And since file_extent_item::offset < extent_size, file_pos < extent_isze.

Thus even with current hack, the check still works, as we search from
file_pos 0, ends at file_pos extent size, which covers the file_extent_item.

Great explanation and patch.

Reviewed-by: Qu Wenruo <wqu@suse.com>

Thanks,
Qu

> 
> The only way I think this check would fail would be:
> File at offset 2^64 - 4096 uses offset 0 of a 1MB data extent,
> key_for_search->offset + num_bytes = 2^64 - 4096 + 1048576 = 1044480
> Therefore, when iterating through the leafs, we'll break early at
> offset 1044480, leave the EXTENT_DATA key @2^64 - 4096 behind.
> But AFAIK, file of that size is not allowed in btrfs.
> 
> Thanks,
> ethanwu
>>
>> I guess we have to find a new method to solve the problem then.
>>
>> Thanks,
>> Qu
>>
[...]
>>> +        if (key_for_search->type == BTRFS_EXTENT_DATA_KEY &&
>>> +            key.offset >= key_for_search->offset + num_bytes)
>>> +               break;
>>>          if (disk_byte == wanted_disk_byte) {
Josef Bacik Jan. 3, 2020, 4:31 p.m. UTC | #4
On 1/3/20 4:44 AM, ethanwu wrote:
> Btrfs has two types of data backref.
> For BTRFS_EXTENT_DATA_REF_KEY type of backref, we don't have the
> exact block number. Therefore, we need to call resolve_indirect_refs
> which uses btrfs_search_slot to locate the leaf block. After that,
> we need to walk through the leafs to search for the EXTENT_DATA items
> that have disk bytenr matching the extent item(add_all_parents).
> 
> The only conditions we'll stop searching are
> 1. We find different object id or type is not EXTENT_DATA
> 2. We've already got all the refs we want(total_refs)
> 
> Take the following EXTENT_ITEM as example:
> item 11 key (40831553536 EXTENT_ITEM 4194304) itemoff 15460 itemsize 95
>      extent refs 24 gen 7302 flags DATA
>      extent data backref root 257 objectid 260 offset 65536 count 5 #backref entry 1
>      extent data backref root 258 objectid 265 offset 0 count 9 #backref entry 2
>      shared data backref parent 394985472 count 10 #backref entry 3
> 
> If we want to search for backref entry 1, total_refs here would be 24 rather
> than its count 5.
> 
> The reason to use 24 is because some EXTENT_DATA in backref entry 3 block
> 394985472 also points to EXTENT_ITEM 40831553536, if this block also belongs to
> root 257 and lies between these 5 items of backref entry 1,
> and we use total_refs = 5, we'll end up missing some refs from backref
> entry 1.
> 

This seems like the crux of the problem here.  The backref stuff is just blindly 
looking for counts, without keeping track of which counts matter.  So for full 
refs we should only be looking down paths where generation > the snapshot 
generation.  And then for the shared refs it should be anything that comes from 
that shared block.  That would be the proper way to fix the problem, not put 
some arbitrary limit on how far into the inode we can search.

That's not to say what you are doing here is wrong, we really won't have 
anything past the given extent size so we can definitely break out earlier.  But 
what I worry about is say 394985472 _was_ in between the leaves while searching 
down for backref entry #1, we'd end up with duplicate entries and not catch some 
of the other entries.  This feels like we need to fix the backref logic to know 
if it's looking for direct refs, and thus only go down paths with generation > 
snapshot generation, or shared refs and thus only count things that directly 
point to the parent block.  Thanks,

Josef
ethanwu Jan. 6, 2020, 3:45 a.m. UTC | #5
Josef Bacik 於 2020-01-04 00:31 寫到:
> On 1/3/20 4:44 AM, ethanwu wrote:
>> Btrfs has two types of data backref.
>> For BTRFS_EXTENT_DATA_REF_KEY type of backref, we don't have the
>> exact block number. Therefore, we need to call resolve_indirect_refs
>> which uses btrfs_search_slot to locate the leaf block. After that,
>> we need to walk through the leafs to search for the EXTENT_DATA items
>> that have disk bytenr matching the extent item(add_all_parents).
>> 
>> The only conditions we'll stop searching are
>> 1. We find different object id or type is not EXTENT_DATA
>> 2. We've already got all the refs we want(total_refs)
>> 
>> Take the following EXTENT_ITEM as example:
>> item 11 key (40831553536 EXTENT_ITEM 4194304) itemoff 15460 itemsize 
>> 95
>>      extent refs 24 gen 7302 flags DATA
>>      extent data backref root 257 objectid 260 offset 65536 count 5 
>> #backref entry 1
>>      extent data backref root 258 objectid 265 offset 0 count 9 
>> #backref entry 2
>>      shared data backref parent 394985472 count 10 #backref entry 3
>> 
>> If we want to search for backref entry 1, total_refs here would be 24 
>> rather
>> than its count 5.
>> 
>> The reason to use 24 is because some EXTENT_DATA in backref entry 3 
>> block
>> 394985472 also points to EXTENT_ITEM 40831553536, if this block also 
>> belongs to
>> root 257 and lies between these 5 items of backref entry 1,
>> and we use total_refs = 5, we'll end up missing some refs from backref
>> entry 1.
>> 
> 
> This seems like the crux of the problem here.  The backref stuff is
> just blindly looking for counts, without keeping track of which counts
> matter.  So for full refs we should only be looking down paths where
> generation > the snapshot generation.  And then for the shared refs it
> should be anything that comes from that shared block.  That would be
> the proper way to fix the problem, not put some arbitrary limit on how
> far into the inode we can search.
> 

I am not sure if generation could be used to skip blocks for 
full(indirect) backref.

For exmple:
create a data extent in subvol id 257 at generation 10
At generation 11, take snapshot(suppose the snapshot id is 258) from 
subvol 257.

When we send snapshot 258, all the tree blocks it searches comes from 
subvol 257,
since snapshot only copy root node from its source,
none of tree blocks in subvol 257 has generation(all <= 10) > snapshot 
generation(11)

Or do I miss something?

> That's not to say what you are doing here is wrong, we really won't
> have anything past the given extent size so we can definitely break
> out earlier.  But what I worry about is say 394985472 _was_ in between
> the leaves while searching down for backref entry #1, we'd end up with
> duplicate entries and not catch some of the other entries.  This feels

This patch doesn't adjust the total_refs. Is there any example that
this patch will ruin the backref walking?

> like we need to fix the backref logic to know if it's looking for
> direct refs, and thus only go down paths with generation > snapshot
> generation, or shared refs and thus only count things that directly
> point to the parent block.  Thanks,
> 

Ok, I agree, my patch doesn't solve the original problem:
When resolving indirect refs, we could take entries that don't
belong to the backref entry we are searching for right now.

If this need to be fixed, I think it could be done by the following way

item 11 key (40831553536 EXTENT_ITEM 4194304) itemoff 15460 itemsize
         extent refs 24 gen 7302 flags DATA
         shared data backref parent 394985472 count 10 #backref entry 1
         extent data backref root 257 objectid 260 offset 1048576 count 3 
#backref entry 2
         extent data backref root 256 objectid 260 offset 65536 count 6 
#backref entry 3
         extent data backref root 257 objectid 260 offset 65536 count 5 
#backref entry 4

When searching for entry 4, the EXTENT_DATA entries that match the 
EXTENT_ITEM bytenr
will be in one of the following situations:

1. shared block that just happens to be part of root 257. For every leaf 
we run into,
    check its bytenr to see if it is a shared data backref entry, if so 
skip it.
    We may need an extra list or rb tree to store this information.
2. same subvol, inode but different offset. Right now in 
add_all_parents, we only
    check if bytenr matches. Adding extra check to see if backref offset 
is the same
    (here backref entry 1: 65536 != entry 3: 1048576)
3. This might happen if subvol 257 is a snapshot from subvol 256, check 
leaf owner, if
    not 257 skip it.
4. None of the above, it's type 4 backref entry, this is what we want, 
add it!

In this way, we only count entries that matter, and total_refs could be
changed from total refs of that extent item to number of each entry.
Then, we could break from loop as soon as possible.

Will this look better?

thanks,
ethanwu

> Josef
Josef Bacik Jan. 6, 2020, 4:05 p.m. UTC | #6
On 1/5/20 10:45 PM, ethanwu wrote:
> Josef Bacik 於 2020-01-04 00:31 寫到:
>> On 1/3/20 4:44 AM, ethanwu wrote:
>>> Btrfs has two types of data backref.
>>> For BTRFS_EXTENT_DATA_REF_KEY type of backref, we don't have the
>>> exact block number. Therefore, we need to call resolve_indirect_refs
>>> which uses btrfs_search_slot to locate the leaf block. After that,
>>> we need to walk through the leafs to search for the EXTENT_DATA items
>>> that have disk bytenr matching the extent item(add_all_parents).
>>>
>>> The only conditions we'll stop searching are
>>> 1. We find different object id or type is not EXTENT_DATA
>>> 2. We've already got all the refs we want(total_refs)
>>>
>>> Take the following EXTENT_ITEM as example:
>>> item 11 key (40831553536 EXTENT_ITEM 4194304) itemoff 15460 itemsize 95
>>>      extent refs 24 gen 7302 flags DATA
>>>      extent data backref root 257 objectid 260 offset 65536 count 5 #backref 
>>> entry 1
>>>      extent data backref root 258 objectid 265 offset 0 count 9 #backref entry 2
>>>      shared data backref parent 394985472 count 10 #backref entry 3
>>>
>>> If we want to search for backref entry 1, total_refs here would be 24 rather
>>> than its count 5.
>>>
>>> The reason to use 24 is because some EXTENT_DATA in backref entry 3 block
>>> 394985472 also points to EXTENT_ITEM 40831553536, if this block also belongs to
>>> root 257 and lies between these 5 items of backref entry 1,
>>> and we use total_refs = 5, we'll end up missing some refs from backref
>>> entry 1.
>>>
>>
>> This seems like the crux of the problem here.  The backref stuff is
>> just blindly looking for counts, without keeping track of which counts
>> matter.  So for full refs we should only be looking down paths where
>> generation > the snapshot generation.  And then for the shared refs it
>> should be anything that comes from that shared block.  That would be
>> the proper way to fix the problem, not put some arbitrary limit on how
>> far into the inode we can search.
>>
> 
> I am not sure if generation could be used to skip blocks for full(indirect) 
> backref.
> 
> For exmple:
> create a data extent in subvol id 257 at generation 10
> At generation 11, take snapshot(suppose the snapshot id is 258) from subvol 257.
> 
> When we send snapshot 258, all the tree blocks it searches comes from subvol 257,
> since snapshot only copy root node from its source,
> none of tree blocks in subvol 257 has generation(all <= 10) > snapshot 
> generation(11)
> 
> Or do I miss something?

Nope I was saying it wrong, sorry about that.  What I should say is for "backref 
entry 1" we should _only_ walk down paths that belong to root 257, and then for 
root 258 we _only_ walk down paths that belong to 258, and then we do our normal 
dance for indirect refs.

> 
>> That's not to say what you are doing here is wrong, we really won't
>> have anything past the given extent size so we can definitely break
>> out earlier.  But what I worry about is say 394985472 _was_ in between
>> the leaves while searching down for backref entry #1, we'd end up with
>> duplicate entries and not catch some of the other entries.  This feels
> 
> This patch doesn't adjust the total_refs. Is there any example that
> this patch will ruin the backref walking?

No I'm talking about a general failure of the current code, your patch doesn't 
make it better or worse.

> 
>> like we need to fix the backref logic to know if it's looking for
>> direct refs, and thus only go down paths with generation > snapshot
>> generation, or shared refs and thus only count things that directly
>> point to the parent block.  Thanks,
>>
> 
> Ok, I agree, my patch doesn't solve the original problem:
> When resolving indirect refs, we could take entries that don't
> belong to the backref entry we are searching for right now.
> 
> If this need to be fixed, I think it could be done by the following way
> 
> item 11 key (40831553536 EXTENT_ITEM 4194304) itemoff 15460 itemsize
>          extent refs 24 gen 7302 flags DATA
>          shared data backref parent 394985472 count 10 #backref entry 1
>          extent data backref root 257 objectid 260 offset 1048576 count 3 
> #backref entry 2
>          extent data backref root 256 objectid 260 offset 65536 count 6 #backref 
> entry 3
>          extent data backref root 257 objectid 260 offset 65536 count 5 #backref 
> entry 4
> 
> When searching for entry 4, the EXTENT_DATA entries that match the EXTENT_ITEM 
> bytenr
> will be in one of the following situations:
> 
> 1. shared block that just happens to be part of root 257. For every leaf we run 
> into,
>     check its bytenr to see if it is a shared data backref entry, if so skip it.
>     We may need an extra list or rb tree to store this information.

We don't need to worry about this case, because if we have a normal ref then the 
whole path down to that bytenr belongs wholey to that root.  The full backref 
will only be in paths that were not touched by the referencing root.

> 2. same subvol, inode but different offset. Right now in add_all_parents, we only
>     check if bytenr matches. Adding extra check to see if backref offset is the 
> same
>     (here backref entry 1: 65536 != entry 3: 1048576)

Yeah we definitely need to do this because clone can change the offset and point 
at the same bytenr, so we for sure want to only match on matching offsets.

> 3. This might happen if subvol 257 is a snapshot from subvol 256, check leaf 
> owner, if
>     not 257 skip it.

Yup, this is the "only walk down paths owned by the root" thing.

> 4. None of the above, it's type 4 backref entry, this is what we want, add it!
> 
> In this way, we only count entries that matter, and total_refs could be
> changed from total refs of that extent item to number of each entry.
> Then, we could break from loop as soon as possible.
> 
> Will this look better?
>

Yup this is what I want, because then we are for sure always getting exactly the 
right refs.  I would do

1) Make a path->only_this_root (or something better named) that _only_ walks 
down paths that are owned by the original root.  We use this for non-full 
backref walking (backref entry 2, 3, and 4 in the example above).  If we have a 
normal extent backref that references a real root then we know for sure if we 
walk down to that bytenr from that root the whole path will be owned by that root.

2) Match on the offset in the extent data ref.  We don't want to find unrelated 
clones, we want exactly the right match.

3) Keep track of how many refs for that backref.  For the backref entry #4 
example we'd use the above strategies and break as soon as we found 5 entries.

4) Take a good hard look at the indirect backref resolution code.  I feel like 
it's probably mostly ok, we just fix the above things for normal backref lookups 
and it'll just work, so we'll definitely only find references that come from 
that indirect backref.  But I haven't looked too closely so I could be wrong.

I think we're on the same page now, hopefully it's not too much extra work.  But 
it will be by far more robust and reliable.  Thanks,

Josef
ethanwu Jan. 17, 2020, 10:44 a.m. UTC | #7
Josef Bacik 於 2020-01-07 00:05 寫到:
> On 1/5/20 10:45 PM, ethanwu wrote:
>> Josef Bacik 於 2020-01-04 00:31 寫到:
>>> On 1/3/20 4:44 AM, ethanwu wrote:
>>>> Btrfs has two types of data backref.
>>>> For BTRFS_EXTENT_DATA_REF_KEY type of backref, we don't have the
>>>> exact block number. Therefore, we need to call resolve_indirect_refs
>>>> which uses btrfs_search_slot to locate the leaf block. After that,
>>>> we need to walk through the leafs to search for the EXTENT_DATA 
>>>> items
>>>> that have disk bytenr matching the extent item(add_all_parents).
>>>> 
>>>> The only conditions we'll stop searching are
>>>> 1. We find different object id or type is not EXTENT_DATA
>>>> 2. We've already got all the refs we want(total_refs)
>>>> 
>>>> Take the following EXTENT_ITEM as example:
>>>> item 11 key (40831553536 EXTENT_ITEM 4194304) itemoff 15460 itemsize 
>>>> 95
>>>>      extent refs 24 gen 7302 flags DATA
>>>>      extent data backref root 257 objectid 260 offset 65536 count 5 
>>>> #backref entry 1
>>>>      extent data backref root 258 objectid 265 offset 0 count 9 
>>>> #backref entry 2
>>>>      shared data backref parent 394985472 count 10 #backref entry 3
>>>> 
>>>> If we want to search for backref entry 1, total_refs here would be 
>>>> 24 rather
>>>> than its count 5.
>>>> 
>>>> The reason to use 24 is because some EXTENT_DATA in backref entry 3 
>>>> block
>>>> 394985472 also points to EXTENT_ITEM 40831553536, if this block also 
>>>> belongs to
>>>> root 257 and lies between these 5 items of backref entry 1,
>>>> and we use total_refs = 5, we'll end up missing some refs from 
>>>> backref
>>>> entry 1.
>>>> 
>>> 
>>> This seems like the crux of the problem here.  The backref stuff is
>>> just blindly looking for counts, without keeping track of which 
>>> counts
>>> matter.  So for full refs we should only be looking down paths where
>>> generation > the snapshot generation.  And then for the shared refs 
>>> it
>>> should be anything that comes from that shared block.  That would be
>>> the proper way to fix the problem, not put some arbitrary limit on 
>>> how
>>> far into the inode we can search.
>>> 
>> 
>> I am not sure if generation could be used to skip blocks for 
>> full(indirect) backref.
>> 
>> For exmple:
>> create a data extent in subvol id 257 at generation 10
>> At generation 11, take snapshot(suppose the snapshot id is 258) from 
>> subvol 257.
>> 
>> When we send snapshot 258, all the tree blocks it searches comes from 
>> subvol 257,
>> since snapshot only copy root node from its source,
>> none of tree blocks in subvol 257 has generation(all <= 10) > snapshot 
>> generation(11)
>> 
>> Or do I miss something?
> 
> Nope I was saying it wrong, sorry about that.  What I should say is
> for "backref entry 1" we should _only_ walk down paths that belong to
> root 257, and then for root 258 we _only_ walk down paths that belong
> to 258, and then we do our normal dance for indirect refs.
> 
>> 
>>> That's not to say what you are doing here is wrong, we really won't
>>> have anything past the given extent size so we can definitely break
>>> out earlier.  But what I worry about is say 394985472 _was_ in 
>>> between
>>> the leaves while searching down for backref entry #1, we'd end up 
>>> with
>>> duplicate entries and not catch some of the other entries.  This 
>>> feels
>> 
>> This patch doesn't adjust the total_refs. Is there any example that
>> this patch will ruin the backref walking?
> 
> No I'm talking about a general failure of the current code, your patch
> doesn't make it better or worse.
> 
>> 
>>> like we need to fix the backref logic to know if it's looking for
>>> direct refs, and thus only go down paths with generation > snapshot
>>> generation, or shared refs and thus only count things that directly
>>> point to the parent block.  Thanks,
>>> 
>> 
>> Ok, I agree, my patch doesn't solve the original problem:
>> When resolving indirect refs, we could take entries that don't
>> belong to the backref entry we are searching for right now.
>> 
>> If this need to be fixed, I think it could be done by the following 
>> way
>> 
>> item 11 key (40831553536 EXTENT_ITEM 4194304) itemoff 15460 itemsize
>>          extent refs 24 gen 7302 flags DATA
>>          shared data backref parent 394985472 count 10 #backref entry 
>> 1
>>          extent data backref root 257 objectid 260 offset 1048576 
>> count 3 #backref entry 2
>>          extent data backref root 256 objectid 260 offset 65536 count 
>> 6 #backref entry 3
>>          extent data backref root 257 objectid 260 offset 65536 count 
>> 5 #backref entry 4
>> 
>> When searching for entry 4, the EXTENT_DATA entries that match the 
>> EXTENT_ITEM bytenr
>> will be in one of the following situations:
>> 
>> 1. shared block that just happens to be part of root 257. For every 
>> leaf we run into,
>>     check its bytenr to see if it is a shared data backref entry, if 
>> so skip it.
>>     We may need an extra list or rb tree to store this information.
> 
> We don't need to worry about this case, because if we have a normal
> ref then the whole path down to that bytenr belongs wholey to that
> root.  The full backref will only be in paths that were not touched by
> the referencing root.
> 

Thank you for the review,
I don't fully understand the way shared data backref is used in btrfs.
It took me a while to check the backref code and do some experiment.
One way shared backref will be used is balance, all the items used
by relocation tree use shared backref.

After running balance, if any EXTENT_ITEM is moved during balance,
all the back reference of that newly-located EXTENT_ITEM will become
shared, and the block owner is exactly the original root.

We could then produce normal reference by just COWIng the tree block,
and leaving some of the shared backrefs unchanged,(dd an 128MB extent
and cow every other 4K blocks, so these items span across many
leafs and COWing one block leaves the other shared backrefs untouched)

In the end, we have two types of back reference from the same root,
and yet the owner of all these blocks are the same root.

Therefore, I think this condition is still needed.

>> 2. same subvol, inode but different offset. Right now in 
>> add_all_parents, we only
>>     check if bytenr matches. Adding extra check to see if backref 
>> offset is the same
>>     (here backref entry 1: 65536 != entry 3: 1048576)
> 
> Yeah we definitely need to do this because clone can change the offset
> and point at the same bytenr, so we for sure want to only match on
> matching offsets.
> 
>> 3. This might happen if subvol 257 is a snapshot from subvol 256, 
>> check leaf owner, if
>>     not 257 skip it.
> 
> Yup, this is the "only walk down paths owned by the root" thing.
> 
>> 4. None of the above, it's type 4 backref entry, this is what we want, 
>> add it!
>> 
>> In this way, we only count entries that matter, and total_refs could 
>> be
>> changed from total refs of that extent item to number of each entry.
>> Then, we could break from loop as soon as possible.
>> 
>> Will this look better?
>> 
> 
> Yup this is what I want, because then we are for sure always getting
> exactly the right refs.  I would do
> 
> 1) Make a path->only_this_root (or something better named) that _only_
> walks down paths that are owned by the original root.  We use this for
> non-full backref walking (backref entry 2, 3, and 4 in the example
> above).  If we have a normal extent backref that references a real
> root then we know for sure if we walk down to that bytenr from that
> root the whole path will be owned by that root.
> 
> 2) Match on the offset in the extent data ref.  We don't want to find
> unrelated clones, we want exactly the right match.
> 
> 3) Keep track of how many refs for that backref.  For the backref
> entry #4 example we'd use the above strategies and break as soon as we
> found 5 entries.
> 
> 4) Take a good hard look at the indirect backref resolution code.  I
> feel like it's probably mostly ok, we just fix the above things for
> normal backref lookups and it'll just work, so we'll definitely only
> find references that come from that indirect backref.  But I haven't
> looked too closely so I could be wrong.
> 
> I think we're on the same page now, hopefully it's not too much extra
> work.  But it will be by far more robust and reliable.  Thanks,
> 

Thanks for the steps provided, I've been looking the backref code, and
so far haven't found any thing that might break the idea. I'll start
working and do some test.

Thanks,
ethanwu

> Josef
Josef Bacik Jan. 17, 2020, 2:21 p.m. UTC | #8
On 1/17/20 5:44 AM, ethanwu wrote:
> Josef Bacik 於 2020-01-07 00:05 寫到:
>> On 1/5/20 10:45 PM, ethanwu wrote:
>>> Josef Bacik 於 2020-01-04 00:31 寫到:
>>>> On 1/3/20 4:44 AM, ethanwu wrote:
>>>>> Btrfs has two types of data backref.
>>>>> For BTRFS_EXTENT_DATA_REF_KEY type of backref, we don't have the
>>>>> exact block number. Therefore, we need to call resolve_indirect_refs
>>>>> which uses btrfs_search_slot to locate the leaf block. After that,
>>>>> we need to walk through the leafs to search for the EXTENT_DATA items
>>>>> that have disk bytenr matching the extent item(add_all_parents).
>>>>>
>>>>> The only conditions we'll stop searching are
>>>>> 1. We find different object id or type is not EXTENT_DATA
>>>>> 2. We've already got all the refs we want(total_refs)
>>>>>
>>>>> Take the following EXTENT_ITEM as example:
>>>>> item 11 key (40831553536 EXTENT_ITEM 4194304) itemoff 15460 itemsize 95
>>>>>      extent refs 24 gen 7302 flags DATA
>>>>>      extent data backref root 257 objectid 260 offset 65536 count 5 
>>>>> #backref entry 1
>>>>>      extent data backref root 258 objectid 265 offset 0 count 9 #backref 
>>>>> entry 2
>>>>>      shared data backref parent 394985472 count 10 #backref entry 3
>>>>>
>>>>> If we want to search for backref entry 1, total_refs here would be 24 rather
>>>>> than its count 5.
>>>>>
>>>>> The reason to use 24 is because some EXTENT_DATA in backref entry 3 block
>>>>> 394985472 also points to EXTENT_ITEM 40831553536, if this block also 
>>>>> belongs to
>>>>> root 257 and lies between these 5 items of backref entry 1,
>>>>> and we use total_refs = 5, we'll end up missing some refs from backref
>>>>> entry 1.
>>>>>
>>>>
>>>> This seems like the crux of the problem here.  The backref stuff is
>>>> just blindly looking for counts, without keeping track of which counts
>>>> matter.  So for full refs we should only be looking down paths where
>>>> generation > the snapshot generation.  And then for the shared refs it
>>>> should be anything that comes from that shared block.  That would be
>>>> the proper way to fix the problem, not put some arbitrary limit on how
>>>> far into the inode we can search.
>>>>
>>>
>>> I am not sure if generation could be used to skip blocks for full(indirect) 
>>> backref.
>>>
>>> For exmple:
>>> create a data extent in subvol id 257 at generation 10
>>> At generation 11, take snapshot(suppose the snapshot id is 258) from subvol 257.
>>>
>>> When we send snapshot 258, all the tree blocks it searches comes from subvol 
>>> 257,
>>> since snapshot only copy root node from its source,
>>> none of tree blocks in subvol 257 has generation(all <= 10) > snapshot 
>>> generation(11)
>>>
>>> Or do I miss something?
>>
>> Nope I was saying it wrong, sorry about that.  What I should say is
>> for "backref entry 1" we should _only_ walk down paths that belong to
>> root 257, and then for root 258 we _only_ walk down paths that belong
>> to 258, and then we do our normal dance for indirect refs.
>>
>>>
>>>> That's not to say what you are doing here is wrong, we really won't
>>>> have anything past the given extent size so we can definitely break
>>>> out earlier.  But what I worry about is say 394985472 _was_ in between
>>>> the leaves while searching down for backref entry #1, we'd end up with
>>>> duplicate entries and not catch some of the other entries.  This feels
>>>
>>> This patch doesn't adjust the total_refs. Is there any example that
>>> this patch will ruin the backref walking?
>>
>> No I'm talking about a general failure of the current code, your patch
>> doesn't make it better or worse.
>>
>>>
>>>> like we need to fix the backref logic to know if it's looking for
>>>> direct refs, and thus only go down paths with generation > snapshot
>>>> generation, or shared refs and thus only count things that directly
>>>> point to the parent block.  Thanks,
>>>>
>>>
>>> Ok, I agree, my patch doesn't solve the original problem:
>>> When resolving indirect refs, we could take entries that don't
>>> belong to the backref entry we are searching for right now.
>>>
>>> If this need to be fixed, I think it could be done by the following way
>>>
>>> item 11 key (40831553536 EXTENT_ITEM 4194304) itemoff 15460 itemsize
>>>          extent refs 24 gen 7302 flags DATA
>>>          shared data backref parent 394985472 count 10 #backref entry 1
>>>          extent data backref root 257 objectid 260 offset 1048576 count 3 
>>> #backref entry 2
>>>          extent data backref root 256 objectid 260 offset 65536 count 6 
>>> #backref entry 3
>>>          extent data backref root 257 objectid 260 offset 65536 count 5 
>>> #backref entry 4
>>>
>>> When searching for entry 4, the EXTENT_DATA entries that match the 
>>> EXTENT_ITEM bytenr
>>> will be in one of the following situations:
>>>
>>> 1. shared block that just happens to be part of root 257. For every leaf we 
>>> run into,
>>>     check its bytenr to see if it is a shared data backref entry, if so skip it.
>>>     We may need an extra list or rb tree to store this information.
>>
>> We don't need to worry about this case, because if we have a normal
>> ref then the whole path down to that bytenr belongs wholey to that
>> root.  The full backref will only be in paths that were not touched by
>> the referencing root.
>>
> 
> Thank you for the review,
> I don't fully understand the way shared data backref is used in btrfs.
> It took me a while to check the backref code and do some experiment.
> One way shared backref will be used is balance, all the items used
> by relocation tree use shared backref.
> 
> After running balance, if any EXTENT_ITEM is moved during balance,
> all the back reference of that newly-located EXTENT_ITEM will become
> shared, and the block owner is exactly the original root.
> 
> We could then produce normal reference by just COWIng the tree block,
> and leaving some of the shared backrefs unchanged,(dd an 128MB extent
> and cow every other 4K blocks, so these items span across many
> leafs and COWing one block leaves the other shared backrefs untouched)
> 
> In the end, we have two types of back reference from the same root,
> and yet the owner of all these blocks are the same root.
> 
> Therefore, I think this condition is still needed.

Yes you can definitely have a shared ref pointing back to a root that you have a 
real ref for, but my point is you treat this separately.  If you have a shared 
block you _won't_ have a real ref for the items in that leaf from the same root. 
  They should be exclusive of any real reference you have.  Thanks,

Josef
diff mbox series

Patch

diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
index e5d8531..ae64995 100644
--- a/fs/btrfs/backref.c
+++ b/fs/btrfs/backref.c
@@ -412,7 +412,7 @@  static int add_indirect_ref(const struct btrfs_fs_info *fs_info,
 static int add_all_parents(struct btrfs_root *root, struct btrfs_path *path,
 			   struct ulist *parents, struct prelim_ref *ref,
 			   int level, u64 time_seq, const u64 *extent_item_pos,
-			   u64 total_refs, bool ignore_offset)
+			   u64 total_refs, bool ignore_offset, u64 num_bytes)
 {
 	int ret = 0;
 	int slot;
@@ -458,6 +458,9 @@  static int add_all_parents(struct btrfs_root *root, struct btrfs_path *path,
 		fi = btrfs_item_ptr(eb, slot, struct btrfs_file_extent_item);
 		disk_byte = btrfs_file_extent_disk_bytenr(eb, fi);
 
+		if (key_for_search->type == BTRFS_EXTENT_DATA_KEY &&
+		    key.offset >= key_for_search->offset + num_bytes)
+		       break;
 		if (disk_byte == wanted_disk_byte) {
 			eie = NULL;
 			old = NULL;
@@ -504,7 +507,7 @@  static int resolve_indirect_ref(struct btrfs_fs_info *fs_info,
 				struct btrfs_path *path, u64 time_seq,
 				struct prelim_ref *ref, struct ulist *parents,
 				const u64 *extent_item_pos, u64 total_refs,
-				bool ignore_offset)
+				bool ignore_offset, u64 num_bytes)
 {
 	struct btrfs_root *root;
 	struct btrfs_key root_key;
@@ -575,7 +578,8 @@  static int resolve_indirect_ref(struct btrfs_fs_info *fs_info,
 	}
 
 	ret = add_all_parents(root, path, parents, ref, level, time_seq,
-			      extent_item_pos, total_refs, ignore_offset);
+			      extent_item_pos, total_refs, ignore_offset,
+			      num_bytes);
 out:
 	path->lowest_level = 0;
 	btrfs_release_path(path);
@@ -610,7 +614,8 @@  static int resolve_indirect_refs(struct btrfs_fs_info *fs_info,
 				 struct btrfs_path *path, u64 time_seq,
 				 struct preftrees *preftrees,
 				 const u64 *extent_item_pos, u64 total_refs,
-				 struct share_check *sc, bool ignore_offset)
+				 struct share_check *sc, bool ignore_offset,
+				 u64 num_bytes)
 {
 	int err;
 	int ret = 0;
@@ -655,7 +660,7 @@  static int resolve_indirect_refs(struct btrfs_fs_info *fs_info,
 		}
 		err = resolve_indirect_ref(fs_info, path, time_seq, ref,
 					   parents, extent_item_pos,
-					   total_refs, ignore_offset);
+					   total_refs, ignore_offset, num_bytes);
 		/*
 		 * we can only tolerate ENOENT,otherwise,we should catch error
 		 * and return directly.
@@ -1127,6 +1132,7 @@  static int find_parent_nodes(struct btrfs_trans_handle *trans,
 	struct extent_inode_elem *eie = NULL;
 	/* total of both direct AND indirect refs! */
 	u64 total_refs = 0;
+	u64 num_bytes = SZ_256M;
 	struct preftrees preftrees = {
 		.direct = PREFTREE_INIT,
 		.indirect = PREFTREE_INIT,
@@ -1194,6 +1200,7 @@  static int find_parent_nodes(struct btrfs_trans_handle *trans,
 				goto again;
 			}
 			spin_unlock(&delayed_refs->lock);
+			num_bytes = head->num_bytes;
 			ret = add_delayed_refs(fs_info, head, time_seq,
 					       &preftrees, &total_refs, sc);
 			mutex_unlock(&head->mutex);
@@ -1215,6 +1222,7 @@  static int find_parent_nodes(struct btrfs_trans_handle *trans,
 		if (key.objectid == bytenr &&
 		    (key.type == BTRFS_EXTENT_ITEM_KEY ||
 		     key.type == BTRFS_METADATA_ITEM_KEY)) {
+			num_bytes = key.offset;
 			ret = add_inline_refs(fs_info, path, bytenr,
 					      &info_level, &preftrees,
 					      &total_refs, sc);
@@ -1236,7 +1244,8 @@  static int find_parent_nodes(struct btrfs_trans_handle *trans,
 	WARN_ON(!RB_EMPTY_ROOT(&preftrees.indirect_missing_keys.root.rb_root));
 
 	ret = resolve_indirect_refs(fs_info, path, time_seq, &preftrees,
-				    extent_item_pos, total_refs, sc, ignore_offset);
+				    extent_item_pos, total_refs, sc, ignore_offset,
+				    num_bytes);
 	if (ret)
 		goto out;