diff mbox series

[2/2] btrfs: send: fix a bug that sending a link command on existing file path

Message ID 20220706130903.1661-3-bingjingc@synology.com (mailing list archive)
State New, archived
Headers show
Series btrfs: send: fix a bug that sending a link command on existing file path | expand

Commit Message

bingjingc July 6, 2022, 1:09 p.m. UTC
From: BingJing Chang <bingjingc@synology.com>

btrfs_ioctl_send processes recorded btrfs_keys in a defined order. (First,
we process a btrfs_key with a samller objectid. If btrfs_keys have the same
objectid, then we compare their types and offsets accordingly.) However,
reference paths for an inode can be stored in either BTRFS_INODE_REF_KEY
btrfs_keys or BTRFS_INODE_EXTREF_KEY btrfs_keys. And due to the limitation
of the helper function - iterate_inode_ref, we can only iterate the entries
of ONE btrfs_inode_ref or btrfs_inode_extref. That is, there must be a bug
in processing the same reference paths, which are stored in different ways.

Here is an exmple that btrfs_ioctl_send will send a link command on an
existing file path:

  $ btrfs subvolume create test

  # create a file and 2000 hard links to the same inode
  $ dd if=/dev/zero of=test/1M bs=1M count=1
  $ for i in {1..2000}; do link test/1M test/$i ; done

  # take a snapshot for parent snapshot
  $ btrfs sub snap -r test snap1

  # remove 2000 hard links and re-create the last 1000 links
  $ for i in {1..2000}; do rm test/$i; done;
  $ for i in {1001..2000}; do link test/1M test/$i; done

  # take another one for sned snapshot
  $ btrfs sub snap -r test snap2

  $ mkdir tmp
  $ btrfs send -e snap2 -p snap1 | btrfs receive tmp/
  At subvol snap2
  link 1238 -> 1M
  ERROR: link 1238 -> 1M failed: File exists

In the parent snapshot snap1, reference paths 1 - 1237 are stored in a
INODE_REF item and the rest ones are stored in other INODE_EXTREF items.
But in the send snapshot snap2, all reference paths can be stored within a
INODE_REF item. During the send process, btrfs_ioctl_send will process the
INODE_REF item first. Then it found that reference paths 1 - 1000 were
gone in the send snapshot, so it recorded them in sctx->deleted_refs for
unlink. And it found that reference paths 1238 - 2000 were new paths in
the send snapshot, so it recorded them in sctx->new_refs for link. Since
we do not load all contents of its INODE_EXTREF items to make comparison,
afterwards, btrfs_ioctl_send may make a mistake sending a link command on
an existing file path.

To fix the bug, we can either remove the duplicated items both in
sctx->new_refs and sctx->deleted_refs before generating link/unlink
commands or prevent them from being added into list. Both of them require
efficient data structures like C++ sets to look up duplicated items. And
we also need to take two scenarios into consideration. One is the most
common case that one inode has only one reference path. The other is the
worst case that there are ten thousands of hard links of an inode.
(BTRFS_LINK_MAX is 65536) So we'd like to introduce rbtree to store the
computing references. (The tree depth of the worst cases is just 16. And
it takes less memory to store few entries than hash sets.) And in order
not to occupy too much moemory, we also introduce
__record_new_ref_if_needed() and __record_deleted_ref_if_needed() for
changed_ref() to check and remove the duplications early.

Reviewed-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: BingJing Chang <bingjingc@synology.com>
---
 fs/btrfs/send.c | 160 ++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 156 insertions(+), 4 deletions(-)

Comments

Filipe Manana July 6, 2022, 4:03 p.m. UTC | #1
On Wed, Jul 06, 2022 at 09:09:03PM +0800, bingjingc wrote:
> From: BingJing Chang <bingjingc@synology.com>
> 
> btrfs_ioctl_send processes recorded btrfs_keys in a defined order. (First,
> we process a btrfs_key with a samller objectid. If btrfs_keys have the same
> objectid, then we compare their types and offsets accordingly.) 

That's a very convoluted way to simply say that send processes keys in the order
they are found in the btrees, from the leftmost to the rightmost.

> However,
> reference paths for an inode can be stored in either BTRFS_INODE_REF_KEY
> btrfs_keys or BTRFS_INODE_EXTREF_KEY btrfs_keys. And due to the limitation
> of the helper function - iterate_inode_ref, we can only iterate the entries
> of ONE btrfs_inode_ref or btrfs_inode_extref. That is, there must be a bug
> in processing the same reference paths, which are stored in different ways.

When you say "there must be a bug", you want to say "there is a bug", and then
explain what the bug is.

However the bug has nothing to do with the order of how keys are processed.

What happens is that when we are processing an inode we go over all references
and add all the new names to the "new_refs" list and add all the deleted names
to the "deleted_refs" list.

Then in the end, when we finish processing the inode, we iterate over all the
new names in the "new_refs" list and send link commands for those names. After
that we go over all the names in the "deleted_refs" list and send unlink commands
for those names.

The problem in this case, is that we have names duplicated in both lists.
So we try to create them before we remove them, therefore the receiver gets an
-EEXIST error when trying the link operations.

> 
> Here is an exmple that btrfs_ioctl_send will send a link command on an

exmple -> example

> existing file path:
> 
>   $ btrfs subvolume create test
> 
>   # create a file and 2000 hard links to the same inode
>   $ dd if=/dev/zero of=test/1M bs=1M count=1

touch test/1M

The data is completely irrelevant here, all we care about are the
hard links of the file. Making the reproducer the minimum necessary
to trigger a bug, makes it much less distracting and easier to grasp.

>   $ for i in {1..2000}; do link test/1M test/$i ; done
> 
>   # take a snapshot for parent snapshot
>   $ btrfs sub snap -r test snap1
> 
>   # remove 2000 hard links and re-create the last 1000 links
>   $ for i in {1..2000}; do rm test/$i; done;
>   $ for i in {1001..2000}; do link test/1M test/$i; done
> 
>   # take another one for sned snapshot

sned -> send

>   $ btrfs sub snap -r test snap2
> 
>   $ mkdir tmp
>   $ btrfs send -e snap2 -p snap1 | btrfs receive tmp/

The -e is not necessary.

>   At subvol snap2
>   link 1238 -> 1M
>   ERROR: link 1238 -> 1M failed: File exists
> 
> In the parent snapshot snap1, reference paths 1 - 1237 are stored in a
> INODE_REF item and the rest ones are stored in other INODE_EXTREF items.
> But in the send snapshot snap2, all reference paths can be stored within a
> INODE_REF item. During the send process, btrfs_ioctl_send will process the
> INODE_REF item first. Then it found that reference paths 1 - 1000 were
> gone in the send snapshot, so it recorded them in sctx->deleted_refs for
> unlink. And it found that reference paths 1238 - 2000 were new paths in
> the send snapshot, so it recorded them in sctx->new_refs for link. Since
> we do not load all contents of its INODE_EXTREF items to make comparison,
> afterwards, btrfs_ioctl_send may make a mistake sending a link command on
> an existing file path.

So this explanation is not correct, it's neither about the names being stored
in an INODE_REF or INODE_EXTREF item nor about processing one item before or
after the other. As mentioned before, it's because we always process the
"new_refs" list before the "deleted_refs" list, and in this case we have the
same names (paths) in both lists.

> 
> To fix the bug, we can either remove the duplicated items both in
> sctx->new_refs and sctx->deleted_refs before generating link/unlink
> commands or prevent them from being added into list. Both of them require
> efficient data structures like C++ sets to look up duplicated items.

There's a much more obvious alternative, which is processing first the
"deleted_refs" list before the "new_refs" list.

It's inefficient because we do a bunch of unlinks followed by links for
the same paths. On the other hand, it does not require maintaining two
new rbtrees and allocating memory for records in these rbtrees.

Plus this type of scenario should be very rare. It requires having at least
hundreds of hard links in an inode in order to trigger the creation of extended
references, and then removing and recreating a bunch of those hard links in the
send snapshot. How common is that?

Is this a case you got into in some user workload, or was it found by reading
and inspecting the code?

I would like to have in the changelog this alternative mentioned, and if it's
not good, then explain why it is not, and why we must follow a different solution.

It's probably not easy, because at process_recorded_refs() when we unlink we
need to know if any ancestor was orphanized before, which we figure out when
we iterate over the "new_refs" list. So it would likely need some reshuffling
in the logic of how we do things there.

> And
> we also need to take two scenarios into consideration. One is the most
> common case that one inode has only one reference path. The other is the
> worst case that there are ten thousands of hard links of an inode.
> (BTRFS_LINK_MAX is 65536) So we'd like to introduce rbtree to store the
> computing references. (The tree depth of the worst cases is just 16. And

computing -> computed

> it takes less memory to store few entries than hash sets.) And in order
> not to occupy too much moemory, we also introduce

moemory -> memory

> __record_new_ref_if_needed() and __record_deleted_ref_if_needed() for
> changed_ref() to check and remove the duplications early.

Also, the subject:

"btrfs: send: fix a bug that sending a link command on existing file path"

sounds a bit awkward, mostly because of the "that".
Something like the following would be more concise:

"btrfs: send: fix sending link commands for existing file paths"

> 
> Reviewed-by: Robbie Ko <robbieko@synology.com>
> Signed-off-by: BingJing Chang <bingjingc@synology.com>
> ---
>  fs/btrfs/send.c | 160 ++++++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 156 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
> index 420a86720aa2..23ae631ef23b 100644
> --- a/fs/btrfs/send.c
> +++ b/fs/btrfs/send.c
> @@ -234,6 +234,9 @@ struct send_ctx {
>  	 * Indexed by the inode number of the directory to be deleted.
>  	 */
>  	struct rb_root orphan_dirs;
> +
> +	struct rb_root rbtree_new_refs;
> +	struct rb_root rbtree_deleted_refs;
>  };
>  
>  struct pending_dir_move {
> @@ -2747,6 +2750,8 @@ struct recorded_ref {
>  	u64 dir;
>  	u64 dir_gen;
>  	int name_len;
> +	struct rb_node node;
> +	struct rb_root *root;
>  };
>  
>  static struct recorded_ref *recorded_ref_alloc(void)
> @@ -2756,6 +2761,7 @@ static struct recorded_ref *recorded_ref_alloc(void)
>  	ref = kzalloc(sizeof(*ref), GFP_KERNEL);
>  	if (!ref)
>  		return NULL;
> +	RB_CLEAR_NODE(&ref->node);
>  	INIT_LIST_HEAD(&ref->list);
>  	return ref;
>  }
> @@ -2764,6 +2770,8 @@ static void recorded_ref_free(struct recorded_ref *ref)
>  {
>  	if (!ref)
>  		return;
> +	if (!RB_EMPTY_NODE(&ref->node))
> +		rb_erase(&ref->node, ref->root);
>  	list_del(&ref->list);
>  	fs_path_free(ref->full_path);
>  	kfree(ref);
> @@ -4373,12 +4381,152 @@ static int __record_deleted_ref(int num, u64 dir, int index,
>  			  &sctx->deleted_refs);
>  }
>  
> +static int rbtree_ref_comp(const void *k, const struct rb_node *node)
> +{
> +	const struct recorded_ref *data = k;
> +	const struct recorded_ref *ref = rb_entry(node, struct recorded_ref, node);
> +	int result;
> +
> +	if (data->dir > ref->dir)
> +		return 1;
> +	if (data->dir < ref->dir)
> +		return -1;
> +	if (data->dir_gen > ref->dir_gen)
> +		return 1;
> +	if (data->dir_gen < ref->dir_gen)
> +		return -1;
> +	if (data->name_len > ref->name_len)
> +		return 1;
> +	if (data->name_len < ref->name_len)
> +		return -1;
> +	result = strcmp(data->name, ref->name);
> +	if (result > 0)
> +		return 1;
> +	if (result < 0)
> +		return -1;
> +	return 0;
> +}
> +
> +static bool rbtree_ref_less(struct rb_node *node, const struct rb_node *parent)
> +{
> +	const struct recorded_ref *entry = rb_entry(node,
> +						    struct recorded_ref,
> +						    node);
> +
> +	return rbtree_ref_comp(entry, parent) < 0;
> +}
> +
> +static int record_ref2(struct rb_root *root, struct list_head *refs,

This is a terrible function name.

If we end up with this solution, please rename it to something more clear
like record_ref_in_tree() for example. Almost anything other than an existing
function name followed by a number is a better choice.

This bug is a very good finding, and has been around since forever.

Thanks.

> +			     struct fs_path *name, u64 dir, u64 dir_gen,
> +			     struct send_ctx *sctx)
> +{
> +	int ret = 0;
> +	struct fs_path *path = NULL;
> +	struct recorded_ref *ref = NULL;
> +
> +	path = fs_path_alloc();
> +	if (!path) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +
> +	ref = recorded_ref_alloc();
> +	if (!ref) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +
> +	ret = get_cur_path(sctx, dir, dir_gen, path);
> +	if (ret < 0)
> +		goto out;
> +	ret = fs_path_add_path(path, name);
> +	if (ret < 0)
> +		goto out;
> +
> +	ref->dir = dir;
> +	ref->dir_gen = dir_gen;
> +	set_ref_path(ref, path);
> +	list_add_tail(&ref->list, refs);
> +	rb_add(&ref->node, root, rbtree_ref_less);
> +	ref->root = root;
> +out:
> +	if (ret) {
> +		if (path && (!ref || !ref->full_path))
> +			fs_path_free(path);
> +		recorded_ref_free(ref);
> +	}
> +	return ret;
> +}
> +
> +static int __record_new_ref_if_needed(int num, u64 dir, int index,
> +				      struct fs_path *name, void *ctx)
> +{
> +	int ret = 0;
> +	struct send_ctx *sctx = ctx;
> +	struct rb_node *node = NULL;
> +	struct recorded_ref data;
> +	struct recorded_ref *ref;
> +	u64 dir_gen;
> +
> +	ret = get_inode_info(sctx->send_root, dir, NULL, &dir_gen, NULL,
> +			     NULL, NULL, NULL);
> +	if (ret < 0)
> +		goto out;
> +
> +	data.dir = dir;
> +	data.dir_gen = dir_gen;
> +	set_ref_path(&data, name);
> +	node = rb_find(&data, &sctx->rbtree_deleted_refs, rbtree_ref_comp);
> +	if (node) {
> +		ref = rb_entry(node, struct recorded_ref, node);
> +		recorded_ref_free(ref);
> +	} else {
> +		ret = record_ref2(&sctx->rbtree_new_refs, &sctx->new_refs,
> +				  name, dir, dir_gen, sctx);
> +	}
> +out:
> +	return ret;
> +}
> +
> +static int __record_deleted_ref_if_needed(int num, u64 dir, int index,
> +			    struct fs_path *name,
> +			    void *ctx)
> +{
> +	int ret = 0;
> +	struct send_ctx *sctx = ctx;
> +	struct rb_node *node = NULL;
> +	struct recorded_ref data;
> +	struct recorded_ref *ref;
> +	u64 dir_gen;
> +
> +	ret = get_inode_info(sctx->parent_root, dir, NULL, &dir_gen, NULL,
> +			     NULL, NULL, NULL);
> +	if (ret < 0)
> +		goto out;
> +
> +	data.dir = dir;
> +	data.dir_gen = dir_gen;
> +	set_ref_path(&data, name);
> +	node = rb_find(&data, &sctx->rbtree_new_refs, rbtree_ref_comp);
> +	if (node) {
> +		ref = rb_entry(node, struct recorded_ref, node);
> +		recorded_ref_free(ref);
> +	} else {
> +		ret = record_ref2(&sctx->rbtree_deleted_refs,
> +				  &sctx->deleted_refs, name, dir, dir_gen,
> +				  sctx);
> +	}
> +out:
> +	return ret;
> +}
> +
>  static int record_new_ref(struct send_ctx *sctx)
>  {
>  	int ret;
>  
>  	ret = iterate_inode_ref(sctx->send_root, sctx->left_path,
> -				sctx->cmp_key, 0, __record_new_ref, sctx);
> +				sctx->cmp_key, 0, __record_new_ref_if_needed,
> +				sctx);
>  	if (ret < 0)
>  		goto out;
>  	ret = 0;
> @@ -4392,7 +4540,8 @@ static int record_deleted_ref(struct send_ctx *sctx)
>  	int ret;
>  
>  	ret = iterate_inode_ref(sctx->parent_root, sctx->right_path,
> -				sctx->cmp_key, 0, __record_deleted_ref, sctx);
> +				sctx->cmp_key, 0,
> +				__record_deleted_ref_if_needed, sctx);
>  	if (ret < 0)
>  		goto out;
>  	ret = 0;
> @@ -4475,7 +4624,7 @@ static int __record_changed_new_ref(int num, u64 dir, int index,
>  	ret = find_iref(sctx->parent_root, sctx->right_path,
>  			sctx->cmp_key, dir, dir_gen, name);
>  	if (ret == -ENOENT)
> -		ret = __record_new_ref(num, dir, index, name, sctx);
> +		ret = __record_new_ref_if_needed(num, dir, index, name, sctx);
>  	else if (ret > 0)
>  		ret = 0;
>  
> @@ -4498,7 +4647,8 @@ static int __record_changed_deleted_ref(int num, u64 dir, int index,
>  	ret = find_iref(sctx->send_root, sctx->left_path, sctx->cmp_key,
>  			dir, dir_gen, name);
>  	if (ret == -ENOENT)
> -		ret = __record_deleted_ref(num, dir, index, name, sctx);
> +		ret = __record_deleted_ref_if_needed(num, dir, index, name,
> +						     sctx);
>  	else if (ret > 0)
>  		ret = 0;
>  
> @@ -7576,6 +7726,8 @@ long btrfs_ioctl_send(struct inode *inode, struct btrfs_ioctl_send_args *arg)
>  	sctx->pending_dir_moves = RB_ROOT;
>  	sctx->waiting_dir_moves = RB_ROOT;
>  	sctx->orphan_dirs = RB_ROOT;
> +	sctx->rbtree_new_refs = RB_ROOT;
> +	sctx->rbtree_deleted_refs = RB_ROOT;
>  
>  	sctx->clone_roots = kvcalloc(sizeof(*sctx->clone_roots),
>  				     arg->clone_sources_count + 1,
> -- 
> 2.37.0
>
bingjing chang July 10, 2022, 3:07 p.m. UTC | #2
Hi Filipe,

Thank you for the review and comments. It really helps.
I will submit a patch v2 to fix the problems.

Filipe Manana <fdmanana@kernel.org> 於 2022年7月7日 週四 凌晨12:04寫道:
>
> On Wed, Jul 06, 2022 at 09:09:03PM +0800, bingjingc wrote:
> > From: BingJing Chang <bingjingc@synology.com>
> >
> > btrfs_ioctl_send processes recorded btrfs_keys in a defined order. (First,
> > we process a btrfs_key with a samller objectid. If btrfs_keys have the same
> > objectid, then we compare their types and offsets accordingly.)
>
> That's a very convoluted way to simply say that send processes keys in the order
> they are found in the btrees, from the leftmost to the rightmost.
>
> > However,
> > reference paths for an inode can be stored in either BTRFS_INODE_REF_KEY
> > btrfs_keys or BTRFS_INODE_EXTREF_KEY btrfs_keys. And due to the limitation
> > of the helper function - iterate_inode_ref, we can only iterate the entries
> > of ONE btrfs_inode_ref or btrfs_inode_extref. That is, there must be a bug
> > in processing the same reference paths, which are stored in different ways.
>
> When you say "there must be a bug", you want to say "there is a bug", and then
> explain what the bug is.
>
> However the bug has nothing to do with the order of how keys are processed.
>
> What happens is that when we are processing an inode we go over all references
> and add all the new names to the "new_refs" list and add all the deleted names
> to the "deleted_refs" list.
>
> Then in the end, when we finish processing the inode, we iterate over all the
> new names in the "new_refs" list and send link commands for those names. After
> that we go over all the names in the "deleted_refs" list and send unlink commands
> for those names.
>
> The problem in this case, is that we have names duplicated in both lists.
> So we try to create them before we remove them, therefore the receiver gets an
> -EEXIST error when trying the link operations.
>

Yes, the problem happens when we have names duplicated in both lists.
And with the current logics in process_recorded_refs(), we will issue
links on existing files.

I will try to make the description clear in patch v2.

> >
> > Here is an exmple that btrfs_ioctl_send will send a link command on an
>
> exmple -> example
>
> > existing file path:
> >
> >   $ btrfs subvolume create test
> >
> >   # create a file and 2000 hard links to the same inode
> >   $ dd if=/dev/zero of=test/1M bs=1M count=1
>
> touch test/1M
>
> The data is completely irrelevant here, all we care about are the
> hard links of the file. Making the reproducer the minimum necessary
> to trigger a bug, makes it much less distracting and easier to grasp.
>
> >   $ for i in {1..2000}; do link test/1M test/$i ; done
> >
> >   # take a snapshot for parent snapshot
> >   $ btrfs sub snap -r test snap1
> >
> >   # remove 2000 hard links and re-create the last 1000 links
> >   $ for i in {1..2000}; do rm test/$i; done;
> >   $ for i in {1001..2000}; do link test/1M test/$i; done
> >
> >   # take another one for sned snapshot
>
> sned -> send
>
> >   $ btrfs sub snap -r test snap2
> >
> >   $ mkdir tmp
> >   $ btrfs send -e snap2 -p snap1 | btrfs receive tmp/
>
> The -e is not necessary.
>
> >   At subvol snap2
> >   link 1238 -> 1M
> >   ERROR: link 1238 -> 1M failed: File exists
> >
> > In the parent snapshot snap1, reference paths 1 - 1237 are stored in a
> > INODE_REF item and the rest ones are stored in other INODE_EXTREF items.
> > But in the send snapshot snap2, all reference paths can be stored within a
> > INODE_REF item. During the send process, btrfs_ioctl_send will process the
> > INODE_REF item first. Then it found that reference paths 1 - 1000 were
> > gone in the send snapshot, so it recorded them in sctx->deleted_refs for
> > unlink. And it found that reference paths 1238 - 2000 were new paths in
> > the send snapshot, so it recorded them in sctx->new_refs for link. Since
> > we do not load all contents of its INODE_EXTREF items to make comparison,
> > afterwards, btrfs_ioctl_send may make a mistake sending a link command on
> > an existing file path.
>
> So this explanation is not correct, it's neither about the names being stored
> in an INODE_REF or INODE_EXTREF item nor about processing one item before or
> after the other. As mentioned before, it's because we always process the
> "new_refs" list before the "deleted_refs" list, and in this case we have the
> same names (paths) in both lists.
>

Sorry, I didn't make this paragraph easy to understand.
Here, I want to address that there is already a function find_iref(), which
can be used to check duplications. But it has limitations.

For this example, we delete 2000 files and recreate 1000 files. Not all of
the 2000 file paths are added to the "deleted_refs" list. And not all of
the 1000 re-created file paths are added to the "new_refs" list. With
find_iref(), when processing inode references, {1001..1237} are not added
to any lists because they appear in both inode references. Afterwards,
two lists contain items as below:
"new_refs" list: {1238..2000}
"deleted_refs" list: {1..1000} + {1238..2000}
Therefore, there are duplicated items {1238..2000} in both lists. It results
in link failure.

> >
> > To fix the bug, we can either remove the duplicated items both in
> > sctx->new_refs and sctx->deleted_refs before generating link/unlink
> > commands or prevent them from being added into list. Both of them require
> > efficient data structures like C++ sets to look up duplicated items.
>
> There's a much more obvious alternative, which is processing first the
> "deleted_refs" list before the "new_refs" list.
>
> It's inefficient because we do a bunch of unlinks followed by links for
> the same paths. On the other hand, it does not require maintaining two
> new rbtrees and allocating memory for records in these rbtrees.
>

Thank you for mentioning this. I will describe the intuition idea and the
difficulties. I was unable to reshuffle the processing order. If someone can
do this, we are glad to pick that solution.

> Plus this type of scenario should be very rare. It requires having at least
> hundreds of hard links in an inode in order to trigger the creation of extended
> references, and then removing and recreating a bunch of those hard links in the
> send snapshot. How common is that?
>
> Is this a case you got into in some user workload, or was it found by reading
> and inspecting the code?
>

It's not common. But it happened in the real world. Some version backup
applications seem to use large hard links for some purposes.

> I would like to have in the changelog this alternative mentioned, and if it's
> not good, then explain why it is not, and why we must follow a different solution.
>

> It's probably not easy, because at process_recorded_refs() when we unlink we
> need to know if any ancestor was orphanized before, which we figure out when
> we iterate over the "new_refs" list. So it would likely need some reshuffling
> in the logic of how we do things there.
>
> > And
> > we also need to take two scenarios into consideration. One is the most
> > common case that one inode has only one reference path. The other is the
> > worst case that there are ten thousands of hard links of an inode.
> > (BTRFS_LINK_MAX is 65536) So we'd like to introduce rbtree to store the
> > computing references. (The tree depth of the worst cases is just 16. And
>
> computing -> computed
>
> > it takes less memory to store few entries than hash sets.) And in order
> > not to occupy too much moemory, we also introduce
>
> moemory -> memory
>
> > __record_new_ref_if_needed() and __record_deleted_ref_if_needed() for
> > changed_ref() to check and remove the duplications early.
>
> Also, the subject:
>
> "btrfs: send: fix a bug that sending a link command on existing file path"
>
> sounds a bit awkward, mostly because of the "that".
> Something like the following would be more concise:
>
> "btrfs: send: fix sending link commands for existing file paths"
>
> >
> > Reviewed-by: Robbie Ko <robbieko@synology.com>
> > Signed-off-by: BingJing Chang <bingjingc@synology.com>
> > ---
> >  fs/btrfs/send.c | 160 ++++++++++++++++++++++++++++++++++++++++++++++--
> >  1 file changed, 156 insertions(+), 4 deletions(-)
> >
> > diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
> > index 420a86720aa2..23ae631ef23b 100644
> > --- a/fs/btrfs/send.c
> > +++ b/fs/btrfs/send.c
> > @@ -234,6 +234,9 @@ struct send_ctx {
> >        * Indexed by the inode number of the directory to be deleted.
> >        */
> >       struct rb_root orphan_dirs;
> > +
> > +     struct rb_root rbtree_new_refs;
> > +     struct rb_root rbtree_deleted_refs;
> >  };
> >
> >  struct pending_dir_move {
> > @@ -2747,6 +2750,8 @@ struct recorded_ref {
> >       u64 dir;
> >       u64 dir_gen;
> >       int name_len;
> > +     struct rb_node node;
> > +     struct rb_root *root;
> >  };
> >
> >  static struct recorded_ref *recorded_ref_alloc(void)
> > @@ -2756,6 +2761,7 @@ static struct recorded_ref *recorded_ref_alloc(void)
> >       ref = kzalloc(sizeof(*ref), GFP_KERNEL);
> >       if (!ref)
> >               return NULL;
> > +     RB_CLEAR_NODE(&ref->node);
> >       INIT_LIST_HEAD(&ref->list);
> >       return ref;
> >  }
> > @@ -2764,6 +2770,8 @@ static void recorded_ref_free(struct recorded_ref *ref)
> >  {
> >       if (!ref)
> >               return;
> > +     if (!RB_EMPTY_NODE(&ref->node))
> > +             rb_erase(&ref->node, ref->root);
> >       list_del(&ref->list);
> >       fs_path_free(ref->full_path);
> >       kfree(ref);
> > @@ -4373,12 +4381,152 @@ static int __record_deleted_ref(int num, u64 dir, int index,
> >                         &sctx->deleted_refs);
> >  }
> >
> > +static int rbtree_ref_comp(const void *k, const struct rb_node *node)
> > +{
> > +     const struct recorded_ref *data = k;
> > +     const struct recorded_ref *ref = rb_entry(node, struct recorded_ref, node);
> > +     int result;
> > +
> > +     if (data->dir > ref->dir)
> > +             return 1;
> > +     if (data->dir < ref->dir)
> > +             return -1;
> > +     if (data->dir_gen > ref->dir_gen)
> > +             return 1;
> > +     if (data->dir_gen < ref->dir_gen)
> > +             return -1;
> > +     if (data->name_len > ref->name_len)
> > +             return 1;
> > +     if (data->name_len < ref->name_len)
> > +             return -1;
> > +     result = strcmp(data->name, ref->name);
> > +     if (result > 0)
> > +             return 1;
> > +     if (result < 0)
> > +             return -1;
> > +     return 0;
> > +}
> > +
> > +static bool rbtree_ref_less(struct rb_node *node, const struct rb_node *parent)
> > +{
> > +     const struct recorded_ref *entry = rb_entry(node,
> > +                                                 struct recorded_ref,
> > +                                                 node);
> > +
> > +     return rbtree_ref_comp(entry, parent) < 0;
> > +}
> > +
> > +static int record_ref2(struct rb_root *root, struct list_head *refs,
>
> This is a terrible function name.
>
> If we end up with this solution, please rename it to something more clear
> like record_ref_in_tree() for example. Almost anything other than an existing
> function name followed by a number is a better choice.
>

Thank you for the naming.

> This bug is a very good finding, and has been around since forever.
>
> Thanks.
>
> > +                          struct fs_path *name, u64 dir, u64 dir_gen,
> > +                          struct send_ctx *sctx)
> > +{
> > +     int ret = 0;
> > +     struct fs_path *path = NULL;
> > +     struct recorded_ref *ref = NULL;
> > +
> > +     path = fs_path_alloc();
> > +     if (!path) {
> > +             ret = -ENOMEM;
> > +             goto out;
> > +     }
> > +
> > +     ref = recorded_ref_alloc();
> > +     if (!ref) {
> > +             ret = -ENOMEM;
> > +             goto out;
> > +     }
> > +
> > +     ret = get_cur_path(sctx, dir, dir_gen, path);
> > +     if (ret < 0)
> > +             goto out;
> > +     ret = fs_path_add_path(path, name);
> > +     if (ret < 0)
> > +             goto out;
> > +
> > +     ref->dir = dir;
> > +     ref->dir_gen = dir_gen;
> > +     set_ref_path(ref, path);
> > +     list_add_tail(&ref->list, refs);
> > +     rb_add(&ref->node, root, rbtree_ref_less);
> > +     ref->root = root;
> > +out:
> > +     if (ret) {
> > +             if (path && (!ref || !ref->full_path))
> > +                     fs_path_free(path);
> > +             recorded_ref_free(ref);
> > +     }
> > +     return ret;
> > +}
> > +
> > +static int __record_new_ref_if_needed(int num, u64 dir, int index,
> > +                                   struct fs_path *name, void *ctx)
> > +{
> > +     int ret = 0;
> > +     struct send_ctx *sctx = ctx;
> > +     struct rb_node *node = NULL;
> > +     struct recorded_ref data;
> > +     struct recorded_ref *ref;
> > +     u64 dir_gen;
> > +
> > +     ret = get_inode_info(sctx->send_root, dir, NULL, &dir_gen, NULL,
> > +                          NULL, NULL, NULL);
> > +     if (ret < 0)
> > +             goto out;
> > +
> > +     data.dir = dir;
> > +     data.dir_gen = dir_gen;
> > +     set_ref_path(&data, name);
> > +     node = rb_find(&data, &sctx->rbtree_deleted_refs, rbtree_ref_comp);
> > +     if (node) {
> > +             ref = rb_entry(node, struct recorded_ref, node);
> > +             recorded_ref_free(ref);
> > +     } else {
> > +             ret = record_ref2(&sctx->rbtree_new_refs, &sctx->new_refs,
> > +                               name, dir, dir_gen, sctx);
> > +     }
> > +out:
> > +     return ret;
> > +}
> > +
> > +static int __record_deleted_ref_if_needed(int num, u64 dir, int index,
> > +                         struct fs_path *name,
> > +                         void *ctx)
> > +{
> > +     int ret = 0;
> > +     struct send_ctx *sctx = ctx;
> > +     struct rb_node *node = NULL;
> > +     struct recorded_ref data;
> > +     struct recorded_ref *ref;
> > +     u64 dir_gen;
> > +
> > +     ret = get_inode_info(sctx->parent_root, dir, NULL, &dir_gen, NULL,
> > +                          NULL, NULL, NULL);
> > +     if (ret < 0)
> > +             goto out;
> > +
> > +     data.dir = dir;
> > +     data.dir_gen = dir_gen;
> > +     set_ref_path(&data, name);
> > +     node = rb_find(&data, &sctx->rbtree_new_refs, rbtree_ref_comp);
> > +     if (node) {
> > +             ref = rb_entry(node, struct recorded_ref, node);
> > +             recorded_ref_free(ref);
> > +     } else {
> > +             ret = record_ref2(&sctx->rbtree_deleted_refs,
> > +                               &sctx->deleted_refs, name, dir, dir_gen,
> > +                               sctx);
> > +     }
> > +out:
> > +     return ret;
> > +}
> > +
> >  static int record_new_ref(struct send_ctx *sctx)
> >  {
> >       int ret;
> >
> >       ret = iterate_inode_ref(sctx->send_root, sctx->left_path,
> > -                             sctx->cmp_key, 0, __record_new_ref, sctx);
> > +                             sctx->cmp_key, 0, __record_new_ref_if_needed,
> > +                             sctx);
> >       if (ret < 0)
> >               goto out;
> >       ret = 0;
> > @@ -4392,7 +4540,8 @@ static int record_deleted_ref(struct send_ctx *sctx)
> >       int ret;
> >
> >       ret = iterate_inode_ref(sctx->parent_root, sctx->right_path,
> > -                             sctx->cmp_key, 0, __record_deleted_ref, sctx);
> > +                             sctx->cmp_key, 0,
> > +                             __record_deleted_ref_if_needed, sctx);
> >       if (ret < 0)
> >               goto out;
> >       ret = 0;
> > @@ -4475,7 +4624,7 @@ static int __record_changed_new_ref(int num, u64 dir, int index,
> >       ret = find_iref(sctx->parent_root, sctx->right_path,
> >                       sctx->cmp_key, dir, dir_gen, name);
> >       if (ret == -ENOENT)
> > -             ret = __record_new_ref(num, dir, index, name, sctx);
> > +             ret = __record_new_ref_if_needed(num, dir, index, name, sctx);
> >       else if (ret > 0)
> >               ret = 0;
> >
> > @@ -4498,7 +4647,8 @@ static int __record_changed_deleted_ref(int num, u64 dir, int index,
> >       ret = find_iref(sctx->send_root, sctx->left_path, sctx->cmp_key,
> >                       dir, dir_gen, name);
> >       if (ret == -ENOENT)
> > -             ret = __record_deleted_ref(num, dir, index, name, sctx);
> > +             ret = __record_deleted_ref_if_needed(num, dir, index, name,
> > +                                                  sctx);
> >       else if (ret > 0)
> >               ret = 0;
> >
> > @@ -7576,6 +7726,8 @@ long btrfs_ioctl_send(struct inode *inode, struct btrfs_ioctl_send_args *arg)
> >       sctx->pending_dir_moves = RB_ROOT;
> >       sctx->waiting_dir_moves = RB_ROOT;
> >       sctx->orphan_dirs = RB_ROOT;
> > +     sctx->rbtree_new_refs = RB_ROOT;
> > +     sctx->rbtree_deleted_refs = RB_ROOT;
> >
> >       sctx->clone_roots = kvcalloc(sizeof(*sctx->clone_roots),
> >                                    arg->clone_sources_count + 1,
> > --
> > 2.37.0
> >
Filipe Manana July 11, 2022, 2:10 p.m. UTC | #3
On Sun, Jul 10, 2022 at 11:07:10PM +0800, bingjing chang wrote:
> Hi Filipe,
> 
> Thank you for the review and comments. It really helps.
> I will submit a patch v2 to fix the problems.
> 
> Filipe Manana <fdmanana@kernel.org> 於 2022年7月7日 週四 凌晨12:04寫道:
> >
> > On Wed, Jul 06, 2022 at 09:09:03PM +0800, bingjingc wrote:
> > > From: BingJing Chang <bingjingc@synology.com>
> > >
> > > btrfs_ioctl_send processes recorded btrfs_keys in a defined order. (First,
> > > we process a btrfs_key with a samller objectid. If btrfs_keys have the same
> > > objectid, then we compare their types and offsets accordingly.)
> >
> > That's a very convoluted way to simply say that send processes keys in the order
> > they are found in the btrees, from the leftmost to the rightmost.
> >
> > > However,
> > > reference paths for an inode can be stored in either BTRFS_INODE_REF_KEY
> > > btrfs_keys or BTRFS_INODE_EXTREF_KEY btrfs_keys. And due to the limitation
> > > of the helper function - iterate_inode_ref, we can only iterate the entries
> > > of ONE btrfs_inode_ref or btrfs_inode_extref. That is, there must be a bug
> > > in processing the same reference paths, which are stored in different ways.
> >
> > When you say "there must be a bug", you want to say "there is a bug", and then
> > explain what the bug is.
> >
> > However the bug has nothing to do with the order of how keys are processed.
> >
> > What happens is that when we are processing an inode we go over all references
> > and add all the new names to the "new_refs" list and add all the deleted names
> > to the "deleted_refs" list.
> >
> > Then in the end, when we finish processing the inode, we iterate over all the
> > new names in the "new_refs" list and send link commands for those names. After
> > that we go over all the names in the "deleted_refs" list and send unlink commands
> > for those names.
> >
> > The problem in this case, is that we have names duplicated in both lists.
> > So we try to create them before we remove them, therefore the receiver gets an
> > -EEXIST error when trying the link operations.
> >
> 
> Yes, the problem happens when we have names duplicated in both lists.
> And with the current logics in process_recorded_refs(), we will issue
> links on existing files.
> 
> I will try to make the description clear in patch v2.
> 
> > >
> > > Here is an exmple that btrfs_ioctl_send will send a link command on an
> >
> > exmple -> example
> >
> > > existing file path:
> > >
> > >   $ btrfs subvolume create test
> > >
> > >   # create a file and 2000 hard links to the same inode
> > >   $ dd if=/dev/zero of=test/1M bs=1M count=1
> >
> > touch test/1M
> >
> > The data is completely irrelevant here, all we care about are the
> > hard links of the file. Making the reproducer the minimum necessary
> > to trigger a bug, makes it much less distracting and easier to grasp.
> >
> > >   $ for i in {1..2000}; do link test/1M test/$i ; done
> > >
> > >   # take a snapshot for parent snapshot
> > >   $ btrfs sub snap -r test snap1
> > >
> > >   # remove 2000 hard links and re-create the last 1000 links
> > >   $ for i in {1..2000}; do rm test/$i; done;
> > >   $ for i in {1001..2000}; do link test/1M test/$i; done
> > >
> > >   # take another one for sned snapshot
> >
> > sned -> send
> >
> > >   $ btrfs sub snap -r test snap2
> > >
> > >   $ mkdir tmp
> > >   $ btrfs send -e snap2 -p snap1 | btrfs receive tmp/
> >
> > The -e is not necessary.
> >
> > >   At subvol snap2
> > >   link 1238 -> 1M
> > >   ERROR: link 1238 -> 1M failed: File exists
> > >
> > > In the parent snapshot snap1, reference paths 1 - 1237 are stored in a
> > > INODE_REF item and the rest ones are stored in other INODE_EXTREF items.
> > > But in the send snapshot snap2, all reference paths can be stored within a
> > > INODE_REF item. During the send process, btrfs_ioctl_send will process the
> > > INODE_REF item first. Then it found that reference paths 1 - 1000 were
> > > gone in the send snapshot, so it recorded them in sctx->deleted_refs for
> > > unlink. And it found that reference paths 1238 - 2000 were new paths in
> > > the send snapshot, so it recorded them in sctx->new_refs for link. Since
> > > we do not load all contents of its INODE_EXTREF items to make comparison,
> > > afterwards, btrfs_ioctl_send may make a mistake sending a link command on
> > > an existing file path.
> >
> > So this explanation is not correct, it's neither about the names being stored
> > in an INODE_REF or INODE_EXTREF item nor about processing one item before or
> > after the other. As mentioned before, it's because we always process the
> > "new_refs" list before the "deleted_refs" list, and in this case we have the
> > same names (paths) in both lists.
> >
> 
> Sorry, I didn't make this paragraph easy to understand.
> Here, I want to address that there is already a function find_iref(), which
> can be used to check duplications. But it has limitations.

Yes, I got it. Its problem is that it works only for the names inside a single
ref (or extref) item.

Your solution basically is a version of find_iref() that works across all the
ref/extref items of an inode.

> 
> For this example, we delete 2000 files and recreate 1000 files. Not all of
> the 2000 file paths are added to the "deleted_refs" list. And not all of
> the 1000 re-created file paths are added to the "new_refs" list. With
> find_iref(), when processing inode references, {1001..1237} are not added
> to any lists because they appear in both inode references. Afterwards,
> two lists contain items as below:
> "new_refs" list: {1238..2000}
> "deleted_refs" list: {1..1000} + {1238..2000}
> Therefore, there are duplicated items {1238..2000} in both lists. It results
> in link failure.

Yep. Many names got moved from an extref item into a regular ref item, which
is what leads to the duplicate names in both lists.

> 
> > >
> > > To fix the bug, we can either remove the duplicated items both in
> > > sctx->new_refs and sctx->deleted_refs before generating link/unlink
> > > commands or prevent them from being added into list. Both of them require
> > > efficient data structures like C++ sets to look up duplicated items.
> >
> > There's a much more obvious alternative, which is processing first the
> > "deleted_refs" list before the "new_refs" list.
> >
> > It's inefficient because we do a bunch of unlinks followed by links for
> > the same paths. On the other hand, it does not require maintaining two
> > new rbtrees and allocating memory for records in these rbtrees.
> >
> 
> Thank you for mentioning this. I will describe the intuition idea and the
> difficulties. I was unable to reshuffle the processing order. If someone can
> do this, we are glad to pick that solution.

I just tried that today, but it's harder than I though, I couldn't find a way
for it to work without breaking other scenarios (mostly to get orphanization
right and the need to recompute paths after orphanization).

So I think we can go with your solution.
I would just like to see two changes:

1) An updated changelog as mentioned before;

2) Change the code to only use the new infrastructure to manage refs.
   The code you added is a parallel infrastructure that is more robust and
   deals with that case of names being moved from an extref item to a ref
   item. Having only one infrastructure makes things much easier to maintain
   and a lot less code, and it also allows to rename record_ref2() to simply
   record_ref(), as the old record_ref() is now gone.
   I made those changes in the following patch, which you can apply on top
   of your patch, or if you prefer I can send it separately.

What do you think?

The cleanup patch is the following, and it passes fstests and some light
stress testing:

From: Filipe Manana <fdmanana@suse.com>
Date: Mon, 11 Jul 2022 14:55:26 +0100
Subject: [PATCH] btrfs: send: always use the rbtree based inode ref management
 infrastructure

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/send.c | 203 +++++-------------------------------------------
 1 file changed, 20 insertions(+), 183 deletions(-)

diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
index 37e3ba5c8586..797e2fb3b26b 100644
--- a/fs/btrfs/send.c
+++ b/fs/btrfs/send.c
@@ -2189,7 +2189,7 @@ static int __get_cur_name_and_parent(struct send_ctx *sctx,
 	/*
 	 * If the inode is not existent yet, add the orphan name and return 1.
 	 * This should only happen for the parent dir that we determine in
-	 * __record_new_ref
+	 * __record_new_ref_if_needed().
 	 */
 	ret = is_inode_existent(sctx, ino, gen);
 	if (ret < 0)
@@ -2790,27 +2790,6 @@ static void set_ref_path(struct recorded_ref *ref, struct fs_path *path)
 	ref->name_len = ref->full_path->end - ref->name;
 }
 
-/*
- * We need to process new refs before deleted refs, but compare_tree gives us
- * everything mixed. So we first record all refs and later process them.
- * This function is a helper to record one ref.
- */
-static int __record_ref(struct list_head *head, u64 dir,
-		      u64 dir_gen, struct fs_path *path)
-{
-	struct recorded_ref *ref;
-
-	ref = recorded_ref_alloc();
-	if (!ref)
-		return -ENOMEM;
-
-	ref->dir = dir;
-	ref->dir_gen = dir_gen;
-	set_ref_path(ref, path);
-	list_add_tail(&ref->list, head);
-	return 0;
-}
-
 static int dup_ref(struct recorded_ref *ref, struct list_head *list)
 {
 	struct recorded_ref *new;
@@ -4337,56 +4316,6 @@ static int process_recorded_refs(struct send_ctx *sctx, int *pending_move)
 	return ret;
 }
 
-static int record_ref(struct btrfs_root *root, u64 dir, struct fs_path *name,
-		      void *ctx, struct list_head *refs)
-{
-	int ret = 0;
-	struct send_ctx *sctx = ctx;
-	struct fs_path *p;
-	u64 gen;
-
-	p = fs_path_alloc();
-	if (!p)
-		return -ENOMEM;
-
-	ret = get_inode_info(root, dir, NULL, &gen, NULL, NULL,
-			NULL, NULL);
-	if (ret < 0)
-		goto out;
-
-	ret = get_cur_path(sctx, dir, gen, p);
-	if (ret < 0)
-		goto out;
-	ret = fs_path_add_path(p, name);
-	if (ret < 0)
-		goto out;
-
-	ret = __record_ref(refs, dir, gen, p);
-
-out:
-	if (ret)
-		fs_path_free(p);
-	return ret;
-}
-
-static int __record_new_ref(int num, u64 dir, int index,
-			    struct fs_path *name,
-			    void *ctx)
-{
-	struct send_ctx *sctx = ctx;
-	return record_ref(sctx->send_root, dir, name, ctx, &sctx->new_refs);
-}
-
-
-static int __record_deleted_ref(int num, u64 dir, int index,
-				struct fs_path *name,
-				void *ctx)
-{
-	struct send_ctx *sctx = ctx;
-	return record_ref(sctx->parent_root, dir, name, ctx,
-			  &sctx->deleted_refs);
-}
-
 static int rbtree_ref_comp(const void *k, const struct rb_node *node)
 {
 	const struct recorded_ref *data = k;
@@ -4422,9 +4351,9 @@ static bool rbtree_ref_less(struct rb_node *node, const struct rb_node *parent)
 	return rbtree_ref_comp(entry, parent) < 0;
 }
 
-static int record_ref2(struct rb_root *root, struct list_head *refs,
-			     struct fs_path *name, u64 dir, u64 dir_gen,
-			     struct send_ctx *sctx)
+static int record_ref(struct rb_root *root, struct list_head *refs,
+		      struct fs_path *name, u64 dir, u64 dir_gen,
+		      struct send_ctx *sctx)
 {
 	int ret = 0;
 	struct fs_path *path = NULL;
@@ -4487,8 +4416,8 @@ static int __record_new_ref_if_needed(int num, u64 dir, int index,
 		ref = rb_entry(node, struct recorded_ref, node);
 		recorded_ref_free(ref);
 	} else {
-		ret = record_ref2(&sctx->rbtree_new_refs, &sctx->new_refs,
-				  name, dir, dir_gen, sctx);
+		ret = record_ref(&sctx->rbtree_new_refs, &sctx->new_refs,
+				 name, dir, dir_gen, sctx);
 	}
 out:
 	return ret;
@@ -4518,9 +4447,9 @@ static int __record_deleted_ref_if_needed(int num, u64 dir, int index,
 		ref = rb_entry(node, struct recorded_ref, node);
 		recorded_ref_free(ref);
 	} else {
-		ret = record_ref2(&sctx->rbtree_deleted_refs,
-				  &sctx->deleted_refs, name, dir, dir_gen,
-				  sctx);
+		ret = record_ref(&sctx->rbtree_deleted_refs,
+				 &sctx->deleted_refs, name, dir, dir_gen,
+				 sctx);
 	}
 out:
 	return ret;
@@ -4564,113 +4493,16 @@ struct find_ref_ctx {
 	int found_idx;
 };
 
-static int __find_iref(int num, u64 dir, int index,
-		       struct fs_path *name,
-		       void *ctx_)
-{
-	struct find_ref_ctx *ctx = ctx_;
-	u64 dir_gen;
-	int ret;
-
-	if (dir == ctx->dir && fs_path_len(name) == fs_path_len(ctx->name) &&
-	    strncmp(name->start, ctx->name->start, fs_path_len(name)) == 0) {
-		/*
-		 * To avoid doing extra lookups we'll only do this if everything
-		 * else matches.
-		 */
-		ret = get_inode_info(ctx->root, dir, NULL, &dir_gen, NULL,
-				     NULL, NULL, NULL);
-		if (ret)
-			return ret;
-		if (dir_gen != ctx->dir_gen)
-			return 0;
-		ctx->found_idx = num;
-		return 1;
-	}
-	return 0;
-}
-
-static int find_iref(struct btrfs_root *root,
-		     struct btrfs_path *path,
-		     struct btrfs_key *key,
-		     u64 dir, u64 dir_gen, struct fs_path *name)
-{
-	int ret;
-	struct find_ref_ctx ctx;
-
-	ctx.dir = dir;
-	ctx.name = name;
-	ctx.dir_gen = dir_gen;
-	ctx.found_idx = -1;
-	ctx.root = root;
-
-	ret = iterate_inode_ref(root, path, key, 0, __find_iref, &ctx);
-	if (ret < 0)
-		return ret;
-
-	if (ctx.found_idx == -1)
-		return -ENOENT;
-
-	return ctx.found_idx;
-}
-
-static int __record_changed_new_ref(int num, u64 dir, int index,
-				    struct fs_path *name,
-				    void *ctx)
-{
-	u64 dir_gen;
-	int ret;
-	struct send_ctx *sctx = ctx;
-
-	ret = get_inode_info(sctx->send_root, dir, NULL, &dir_gen, NULL,
-			     NULL, NULL, NULL);
-	if (ret)
-		return ret;
-
-	ret = find_iref(sctx->parent_root, sctx->right_path,
-			sctx->cmp_key, dir, dir_gen, name);
-	if (ret == -ENOENT)
-		ret = __record_new_ref_if_needed(num, dir, index, name, sctx);
-	else if (ret > 0)
-		ret = 0;
-
-	return ret;
-}
-
-static int __record_changed_deleted_ref(int num, u64 dir, int index,
-					struct fs_path *name,
-					void *ctx)
-{
-	u64 dir_gen;
-	int ret;
-	struct send_ctx *sctx = ctx;
-
-	ret = get_inode_info(sctx->parent_root, dir, NULL, &dir_gen, NULL,
-			     NULL, NULL, NULL);
-	if (ret)
-		return ret;
-
-	ret = find_iref(sctx->send_root, sctx->left_path, sctx->cmp_key,
-			dir, dir_gen, name);
-	if (ret == -ENOENT)
-		ret = __record_deleted_ref_if_needed(num, dir, index, name,
-						     sctx);
-	else if (ret > 0)
-		ret = 0;
-
-	return ret;
-}
-
 static int record_changed_ref(struct send_ctx *sctx)
 {
 	int ret = 0;
 
 	ret = iterate_inode_ref(sctx->send_root, sctx->left_path,
-			sctx->cmp_key, 0, __record_changed_new_ref, sctx);
+			sctx->cmp_key, 0, __record_new_ref_if_needed, sctx);
 	if (ret < 0)
 		goto out;
 	ret = iterate_inode_ref(sctx->parent_root, sctx->right_path,
-			sctx->cmp_key, 0, __record_changed_deleted_ref, sctx);
+			sctx->cmp_key, 0, __record_deleted_ref_if_needed, sctx);
 	if (ret < 0)
 		goto out;
 	ret = 0;
@@ -4701,10 +4533,10 @@ static int process_all_refs(struct send_ctx *sctx,
 
 	if (cmd == BTRFS_COMPARE_TREE_NEW) {
 		root = sctx->send_root;
-		cb = __record_new_ref;
+		cb = __record_new_ref_if_needed;
 	} else if (cmd == BTRFS_COMPARE_TREE_DELETED) {
 		root = sctx->parent_root;
-		cb = __record_deleted_ref;
+		cb = __record_deleted_ref_if_needed;
 	} else {
 		btrfs_err(sctx->send_root->fs_info,
 				"Wrong command %d in process_all_refs", cmd);
@@ -6544,8 +6376,13 @@ static int record_parent_ref(int num, u64 dir, int index, struct fs_path *name,
 {
 	struct parent_paths_ctx *ppctx = ctx;
 
-	return record_ref(ppctx->sctx->parent_root, dir, name, ppctx->sctx,
-			  ppctx->refs);
+	/*
+	 * Pass 0 as the generation for the directory, we don't care about it
+	 * here as we have no new references to add, we just want to delete all
+	 * references for an inode.
+	 */
+	return record_ref(&ppctx->sctx->rbtree_deleted_refs, ppctx->refs, name,
+			  dir, 0, ppctx->sctx);
 }
 
 /*
bingjing chang July 12, 2022, 1:37 a.m. UTC | #4
Filipe Manana <fdmanana@kernel.org> 於 2022年7月11日 週一 晚上10:10寫道:
>
> On Sun, Jul 10, 2022 at 11:07:10PM +0800, bingjing chang wrote:
> > Hi Filipe,
> >
> > Thank you for the review and comments. It really helps.
> > I will submit a patch v2 to fix the problems.
> >
> > Filipe Manana <fdmanana@kernel.org> 於 2022年7月7日 週四 凌晨12:04寫道:
> > >
> > > On Wed, Jul 06, 2022 at 09:09:03PM +0800, bingjingc wrote:
> > > > From: BingJing Chang <bingjingc@synology.com>
> > > >
> > > > btrfs_ioctl_send processes recorded btrfs_keys in a defined order. (First,
> > > > we process a btrfs_key with a samller objectid. If btrfs_keys have the same
> > > > objectid, then we compare their types and offsets accordingly.)
> > >
> > > That's a very convoluted way to simply say that send processes keys in the order
> > > they are found in the btrees, from the leftmost to the rightmost.
> > >
> > > > However,
> > > > reference paths for an inode can be stored in either BTRFS_INODE_REF_KEY
> > > > btrfs_keys or BTRFS_INODE_EXTREF_KEY btrfs_keys. And due to the limitation
> > > > of the helper function - iterate_inode_ref, we can only iterate the entries
> > > > of ONE btrfs_inode_ref or btrfs_inode_extref. That is, there must be a bug
> > > > in processing the same reference paths, which are stored in different ways.
> > >
> > > When you say "there must be a bug", you want to say "there is a bug", and then
> > > explain what the bug is.
> > >
> > > However the bug has nothing to do with the order of how keys are processed.
> > >
> > > What happens is that when we are processing an inode we go over all references
> > > and add all the new names to the "new_refs" list and add all the deleted names
> > > to the "deleted_refs" list.
> > >
> > > Then in the end, when we finish processing the inode, we iterate over all the
> > > new names in the "new_refs" list and send link commands for those names. After
> > > that we go over all the names in the "deleted_refs" list and send unlink commands
> > > for those names.
> > >
> > > The problem in this case, is that we have names duplicated in both lists.
> > > So we try to create them before we remove them, therefore the receiver gets an
> > > -EEXIST error when trying the link operations.
> > >
> >
> > Yes, the problem happens when we have names duplicated in both lists.
> > And with the current logics in process_recorded_refs(), we will issue
> > links on existing files.
> >
> > I will try to make the description clear in patch v2.
> >
> > > >
> > > > Here is an exmple that btrfs_ioctl_send will send a link command on an
> > >
> > > exmple -> example
> > >
> > > > existing file path:
> > > >
> > > >   $ btrfs subvolume create test
> > > >
> > > >   # create a file and 2000 hard links to the same inode
> > > >   $ dd if=/dev/zero of=test/1M bs=1M count=1
> > >
> > > touch test/1M
> > >
> > > The data is completely irrelevant here, all we care about are the
> > > hard links of the file. Making the reproducer the minimum necessary
> > > to trigger a bug, makes it much less distracting and easier to grasp.
> > >
> > > >   $ for i in {1..2000}; do link test/1M test/$i ; done
> > > >
> > > >   # take a snapshot for parent snapshot
> > > >   $ btrfs sub snap -r test snap1
> > > >
> > > >   # remove 2000 hard links and re-create the last 1000 links
> > > >   $ for i in {1..2000}; do rm test/$i; done;
> > > >   $ for i in {1001..2000}; do link test/1M test/$i; done
> > > >
> > > >   # take another one for sned snapshot
> > >
> > > sned -> send
> > >
> > > >   $ btrfs sub snap -r test snap2
> > > >
> > > >   $ mkdir tmp
> > > >   $ btrfs send -e snap2 -p snap1 | btrfs receive tmp/
> > >
> > > The -e is not necessary.
> > >
> > > >   At subvol snap2
> > > >   link 1238 -> 1M
> > > >   ERROR: link 1238 -> 1M failed: File exists
> > > >
> > > > In the parent snapshot snap1, reference paths 1 - 1237 are stored in a
> > > > INODE_REF item and the rest ones are stored in other INODE_EXTREF items.
> > > > But in the send snapshot snap2, all reference paths can be stored within a
> > > > INODE_REF item. During the send process, btrfs_ioctl_send will process the
> > > > INODE_REF item first. Then it found that reference paths 1 - 1000 were
> > > > gone in the send snapshot, so it recorded them in sctx->deleted_refs for
> > > > unlink. And it found that reference paths 1238 - 2000 were new paths in
> > > > the send snapshot, so it recorded them in sctx->new_refs for link. Since
> > > > we do not load all contents of its INODE_EXTREF items to make comparison,
> > > > afterwards, btrfs_ioctl_send may make a mistake sending a link command on
> > > > an existing file path.
> > >
> > > So this explanation is not correct, it's neither about the names being stored
> > > in an INODE_REF or INODE_EXTREF item nor about processing one item before or
> > > after the other. As mentioned before, it's because we always process the
> > > "new_refs" list before the "deleted_refs" list, and in this case we have the
> > > same names (paths) in both lists.
> > >
> >
> > Sorry, I didn't make this paragraph easy to understand.
> > Here, I want to address that there is already a function find_iref(), which
> > can be used to check duplications. But it has limitations.
>
> Yes, I got it. Its problem is that it works only for the names inside a single
> ref (or extref) item.
>
> Your solution basically is a version of find_iref() that works across all the
> ref/extref items of an inode.
>

Yes, that is. Thank you for your understanding.

> >
> > For this example, we delete 2000 files and recreate 1000 files. Not all of
> > the 2000 file paths are added to the "deleted_refs" list. And not all of
> > the 1000 re-created file paths are added to the "new_refs" list. With
> > find_iref(), when processing inode references, {1001..1237} are not added
> > to any lists because they appear in both inode references. Afterwards,
> > two lists contain items as below:
> > "new_refs" list: {1238..2000}
> > "deleted_refs" list: {1..1000} + {1238..2000}
> > Therefore, there are duplicated items {1238..2000} in both lists. It results
> > in link failure.
>
> Yep. Many names got moved from an extref item into a regular ref item, which
> is what leads to the duplicate names in both lists.
>
> >
> > > >
> > > > To fix the bug, we can either remove the duplicated items both in
> > > > sctx->new_refs and sctx->deleted_refs before generating link/unlink
> > > > commands or prevent them from being added into list. Both of them require
> > > > efficient data structures like C++ sets to look up duplicated items.
> > >
> > > There's a much more obvious alternative, which is processing first the
> > > "deleted_refs" list before the "new_refs" list.
> > >
> > > It's inefficient because we do a bunch of unlinks followed by links for
> > > the same paths. On the other hand, it does not require maintaining two
> > > new rbtrees and allocating memory for records in these rbtrees.
> > >
> >
> > Thank you for mentioning this. I will describe the intuition idea and the
> > difficulties. I was unable to reshuffle the processing order. If someone can
> > do this, we are glad to pick that solution.
>
> I just tried that today, but it's harder than I though, I couldn't find a way
> for it to work without breaking other scenarios (mostly to get orphanization
> right and the need to recompute paths after orphanization).
>
> So I think we can go with your solution.
> I would just like to see two changes:
>
> 1) An updated changelog as mentioned before;
>
> 2) Change the code to only use the new infrastructure to manage refs.
>    The code you added is a parallel infrastructure that is more robust and
>    deals with that case of names being moved from an extref item to a ref
>    item. Having only one infrastructure makes things much easier to maintain
>    and a lot less code, and it also allows to rename record_ref2() to simply
>    record_ref(), as the old record_ref() is now gone.
>    I made those changes in the following patch, which you can apply on top
>    of your patch, or if you prefer I can send it separately.
>
> What do you think?
>
I just revised my patch. I'd like to ask you for a favor to submit a
separate one. Thanks a lot.

> The cleanup patch is the following, and it passes fstests and some light
> stress testing:
>
> From: Filipe Manana <fdmanana@suse.com>
> Date: Mon, 11 Jul 2022 14:55:26 +0100
> Subject: [PATCH] btrfs: send: always use the rbtree based inode ref management
>  infrastructure
>
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
>  fs/btrfs/send.c | 203 +++++-------------------------------------------
>  1 file changed, 20 insertions(+), 183 deletions(-)
>
> diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
> index 37e3ba5c8586..797e2fb3b26b 100644
> --- a/fs/btrfs/send.c
> +++ b/fs/btrfs/send.c
> @@ -2189,7 +2189,7 @@ static int __get_cur_name_and_parent(struct send_ctx *sctx,
>         /*
>          * If the inode is not existent yet, add the orphan name and return 1.
>          * This should only happen for the parent dir that we determine in
> -        * __record_new_ref
> +        * __record_new_ref_if_needed().
>          */
>         ret = is_inode_existent(sctx, ino, gen);
>         if (ret < 0)
> @@ -2790,27 +2790,6 @@ static void set_ref_path(struct recorded_ref *ref, struct fs_path *path)
>         ref->name_len = ref->full_path->end - ref->name;
>  }
>
> -/*
> - * We need to process new refs before deleted refs, but compare_tree gives us
> - * everything mixed. So we first record all refs and later process them.
> - * This function is a helper to record one ref.
> - */
> -static int __record_ref(struct list_head *head, u64 dir,
> -                     u64 dir_gen, struct fs_path *path)
> -{
> -       struct recorded_ref *ref;
> -
> -       ref = recorded_ref_alloc();
> -       if (!ref)
> -               return -ENOMEM;
> -
> -       ref->dir = dir;
> -       ref->dir_gen = dir_gen;
> -       set_ref_path(ref, path);
> -       list_add_tail(&ref->list, head);
> -       return 0;
> -}
> -
>  static int dup_ref(struct recorded_ref *ref, struct list_head *list)
>  {
>         struct recorded_ref *new;
> @@ -4337,56 +4316,6 @@ static int process_recorded_refs(struct send_ctx *sctx, int *pending_move)
>         return ret;
>  }
>
> -static int record_ref(struct btrfs_root *root, u64 dir, struct fs_path *name,
> -                     void *ctx, struct list_head *refs)
> -{
> -       int ret = 0;
> -       struct send_ctx *sctx = ctx;
> -       struct fs_path *p;
> -       u64 gen;
> -
> -       p = fs_path_alloc();
> -       if (!p)
> -               return -ENOMEM;
> -
> -       ret = get_inode_info(root, dir, NULL, &gen, NULL, NULL,
> -                       NULL, NULL);
> -       if (ret < 0)
> -               goto out;
> -
> -       ret = get_cur_path(sctx, dir, gen, p);
> -       if (ret < 0)
> -               goto out;
> -       ret = fs_path_add_path(p, name);
> -       if (ret < 0)
> -               goto out;
> -
> -       ret = __record_ref(refs, dir, gen, p);
> -
> -out:
> -       if (ret)
> -               fs_path_free(p);
> -       return ret;
> -}
> -
> -static int __record_new_ref(int num, u64 dir, int index,
> -                           struct fs_path *name,
> -                           void *ctx)
> -{
> -       struct send_ctx *sctx = ctx;
> -       return record_ref(sctx->send_root, dir, name, ctx, &sctx->new_refs);
> -}
> -
> -
> -static int __record_deleted_ref(int num, u64 dir, int index,
> -                               struct fs_path *name,
> -                               void *ctx)
> -{
> -       struct send_ctx *sctx = ctx;
> -       return record_ref(sctx->parent_root, dir, name, ctx,
> -                         &sctx->deleted_refs);
> -}
> -
>  static int rbtree_ref_comp(const void *k, const struct rb_node *node)
>  {
>         const struct recorded_ref *data = k;
> @@ -4422,9 +4351,9 @@ static bool rbtree_ref_less(struct rb_node *node, const struct rb_node *parent)
>         return rbtree_ref_comp(entry, parent) < 0;
>  }
>
> -static int record_ref2(struct rb_root *root, struct list_head *refs,
> -                            struct fs_path *name, u64 dir, u64 dir_gen,
> -                            struct send_ctx *sctx)
> +static int record_ref(struct rb_root *root, struct list_head *refs,
> +                     struct fs_path *name, u64 dir, u64 dir_gen,
> +                     struct send_ctx *sctx)
>  {
>         int ret = 0;
>         struct fs_path *path = NULL;
> @@ -4487,8 +4416,8 @@ static int __record_new_ref_if_needed(int num, u64 dir, int index,
>                 ref = rb_entry(node, struct recorded_ref, node);
>                 recorded_ref_free(ref);
>         } else {
> -               ret = record_ref2(&sctx->rbtree_new_refs, &sctx->new_refs,
> -                                 name, dir, dir_gen, sctx);
> +               ret = record_ref(&sctx->rbtree_new_refs, &sctx->new_refs,
> +                                name, dir, dir_gen, sctx);
>         }
>  out:
>         return ret;
> @@ -4518,9 +4447,9 @@ static int __record_deleted_ref_if_needed(int num, u64 dir, int index,
>                 ref = rb_entry(node, struct recorded_ref, node);
>                 recorded_ref_free(ref);
>         } else {
> -               ret = record_ref2(&sctx->rbtree_deleted_refs,
> -                                 &sctx->deleted_refs, name, dir, dir_gen,
> -                                 sctx);
> +               ret = record_ref(&sctx->rbtree_deleted_refs,
> +                                &sctx->deleted_refs, name, dir, dir_gen,
> +                                sctx);
>         }
>  out:
>         return ret;
> @@ -4564,113 +4493,16 @@ struct find_ref_ctx {
>         int found_idx;
>  };
>
> -static int __find_iref(int num, u64 dir, int index,
> -                      struct fs_path *name,
> -                      void *ctx_)
> -{
> -       struct find_ref_ctx *ctx = ctx_;
> -       u64 dir_gen;
> -       int ret;
> -
> -       if (dir == ctx->dir && fs_path_len(name) == fs_path_len(ctx->name) &&
> -           strncmp(name->start, ctx->name->start, fs_path_len(name)) == 0) {
> -               /*
> -                * To avoid doing extra lookups we'll only do this if everything
> -                * else matches.
> -                */
> -               ret = get_inode_info(ctx->root, dir, NULL, &dir_gen, NULL,
> -                                    NULL, NULL, NULL);
> -               if (ret)
> -                       return ret;
> -               if (dir_gen != ctx->dir_gen)
> -                       return 0;
> -               ctx->found_idx = num;
> -               return 1;
> -       }
> -       return 0;
> -}
> -
> -static int find_iref(struct btrfs_root *root,
> -                    struct btrfs_path *path,
> -                    struct btrfs_key *key,
> -                    u64 dir, u64 dir_gen, struct fs_path *name)
> -{
> -       int ret;
> -       struct find_ref_ctx ctx;
> -
> -       ctx.dir = dir;
> -       ctx.name = name;
> -       ctx.dir_gen = dir_gen;
> -       ctx.found_idx = -1;
> -       ctx.root = root;
> -
> -       ret = iterate_inode_ref(root, path, key, 0, __find_iref, &ctx);
> -       if (ret < 0)
> -               return ret;
> -
> -       if (ctx.found_idx == -1)
> -               return -ENOENT;
> -
> -       return ctx.found_idx;
> -}
> -
> -static int __record_changed_new_ref(int num, u64 dir, int index,
> -                                   struct fs_path *name,
> -                                   void *ctx)
> -{
> -       u64 dir_gen;
> -       int ret;
> -       struct send_ctx *sctx = ctx;
> -
> -       ret = get_inode_info(sctx->send_root, dir, NULL, &dir_gen, NULL,
> -                            NULL, NULL, NULL);
> -       if (ret)
> -               return ret;
> -
> -       ret = find_iref(sctx->parent_root, sctx->right_path,
> -                       sctx->cmp_key, dir, dir_gen, name);
> -       if (ret == -ENOENT)
> -               ret = __record_new_ref_if_needed(num, dir, index, name, sctx);
> -       else if (ret > 0)
> -               ret = 0;
> -
> -       return ret;
> -}
> -
> -static int __record_changed_deleted_ref(int num, u64 dir, int index,
> -                                       struct fs_path *name,
> -                                       void *ctx)
> -{
> -       u64 dir_gen;
> -       int ret;
> -       struct send_ctx *sctx = ctx;
> -
> -       ret = get_inode_info(sctx->parent_root, dir, NULL, &dir_gen, NULL,
> -                            NULL, NULL, NULL);
> -       if (ret)
> -               return ret;
> -
> -       ret = find_iref(sctx->send_root, sctx->left_path, sctx->cmp_key,
> -                       dir, dir_gen, name);
> -       if (ret == -ENOENT)
> -               ret = __record_deleted_ref_if_needed(num, dir, index, name,
> -                                                    sctx);
> -       else if (ret > 0)
> -               ret = 0;
> -
> -       return ret;
> -}
> -
>  static int record_changed_ref(struct send_ctx *sctx)
>  {
>         int ret = 0;
>
>         ret = iterate_inode_ref(sctx->send_root, sctx->left_path,
> -                       sctx->cmp_key, 0, __record_changed_new_ref, sctx);
> +                       sctx->cmp_key, 0, __record_new_ref_if_needed, sctx);
>         if (ret < 0)
>                 goto out;
>         ret = iterate_inode_ref(sctx->parent_root, sctx->right_path,
> -                       sctx->cmp_key, 0, __record_changed_deleted_ref, sctx);
> +                       sctx->cmp_key, 0, __record_deleted_ref_if_needed, sctx);
>         if (ret < 0)
>                 goto out;
>         ret = 0;
> @@ -4701,10 +4533,10 @@ static int process_all_refs(struct send_ctx *sctx,
>
>         if (cmd == BTRFS_COMPARE_TREE_NEW) {
>                 root = sctx->send_root;
> -               cb = __record_new_ref;
> +               cb = __record_new_ref_if_needed;
>         } else if (cmd == BTRFS_COMPARE_TREE_DELETED) {
>                 root = sctx->parent_root;
> -               cb = __record_deleted_ref;
> +               cb = __record_deleted_ref_if_needed;
>         } else {
>                 btrfs_err(sctx->send_root->fs_info,
>                                 "Wrong command %d in process_all_refs", cmd);
> @@ -6544,8 +6376,13 @@ static int record_parent_ref(int num, u64 dir, int index, struct fs_path *name,
>  {
>         struct parent_paths_ctx *ppctx = ctx;
>
> -       return record_ref(ppctx->sctx->parent_root, dir, name, ppctx->sctx,
> -                         ppctx->refs);
> +       /*
> +        * Pass 0 as the generation for the directory, we don't care about it
> +        * here as we have no new references to add, we just want to delete all
> +        * references for an inode.
> +        */
> +       return record_ref(&ppctx->sctx->rbtree_deleted_refs, ppctx->refs, name,
> +                         dir, 0, ppctx->sctx);
>  }
>
>  /*
> --
> 2.35.1
>
>
> Thanks.
>
>
> >
> > > Plus this type of scenario should be very rare. It requires having at least
> > > hundreds of hard links in an inode in order to trigger the creation of extended
> > > references, and then removing and recreating a bunch of those hard links in the
> > > send snapshot. How common is that?
> > >
> > > Is this a case you got into in some user workload, or was it found by reading
> > > and inspecting the code?
> > >
> >
> > It's not common. But it happened in the real world. Some version backup
> > applications seem to use large hard links for some purposes.
> >
> > > I would like to have in the changelog this alternative mentioned, and if it's
> > > not good, then explain why it is not, and why we must follow a different solution.
> > >
> >
> > > It's probably not easy, because at process_recorded_refs() when we unlink we
> > > need to know if any ancestor was orphanized before, which we figure out when
> > > we iterate over the "new_refs" list. So it would likely need some reshuffling
> > > in the logic of how we do things there.
> > >
> > > > And
> > > > we also need to take two scenarios into consideration. One is the most
> > > > common case that one inode has only one reference path. The other is the
> > > > worst case that there are ten thousands of hard links of an inode.
> > > > (BTRFS_LINK_MAX is 65536) So we'd like to introduce rbtree to store the
> > > > computing references. (The tree depth of the worst cases is just 16. And
> > >
> > > computing -> computed
> > >
> > > > it takes less memory to store few entries than hash sets.) And in order
> > > > not to occupy too much moemory, we also introduce
> > >
> > > moemory -> memory
> > >
> > > > __record_new_ref_if_needed() and __record_deleted_ref_if_needed() for
> > > > changed_ref() to check and remove the duplications early.
> > >
> > > Also, the subject:
> > >
> > > "btrfs: send: fix a bug that sending a link command on existing file path"
> > >
> > > sounds a bit awkward, mostly because of the "that".
> > > Something like the following would be more concise:
> > >
> > > "btrfs: send: fix sending link commands for existing file paths"
> > >
> > > >
> > > > Reviewed-by: Robbie Ko <robbieko@synology.com>
> > > > Signed-off-by: BingJing Chang <bingjingc@synology.com>
> > > > ---
> > > >  fs/btrfs/send.c | 160 ++++++++++++++++++++++++++++++++++++++++++++++--
> > > >  1 file changed, 156 insertions(+), 4 deletions(-)
> > > >
> > > > diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
> > > > index 420a86720aa2..23ae631ef23b 100644
> > > > --- a/fs/btrfs/send.c
> > > > +++ b/fs/btrfs/send.c
> > > > @@ -234,6 +234,9 @@ struct send_ctx {
> > > >        * Indexed by the inode number of the directory to be deleted.
> > > >        */
> > > >       struct rb_root orphan_dirs;
> > > > +
> > > > +     struct rb_root rbtree_new_refs;
> > > > +     struct rb_root rbtree_deleted_refs;
> > > >  };
> > > >
> > > >  struct pending_dir_move {
> > > > @@ -2747,6 +2750,8 @@ struct recorded_ref {
> > > >       u64 dir;
> > > >       u64 dir_gen;
> > > >       int name_len;
> > > > +     struct rb_node node;
> > > > +     struct rb_root *root;
> > > >  };
> > > >
> > > >  static struct recorded_ref *recorded_ref_alloc(void)
> > > > @@ -2756,6 +2761,7 @@ static struct recorded_ref *recorded_ref_alloc(void)
> > > >       ref = kzalloc(sizeof(*ref), GFP_KERNEL);
> > > >       if (!ref)
> > > >               return NULL;
> > > > +     RB_CLEAR_NODE(&ref->node);
> > > >       INIT_LIST_HEAD(&ref->list);
> > > >       return ref;
> > > >  }
> > > > @@ -2764,6 +2770,8 @@ static void recorded_ref_free(struct recorded_ref *ref)
> > > >  {
> > > >       if (!ref)
> > > >               return;
> > > > +     if (!RB_EMPTY_NODE(&ref->node))
> > > > +             rb_erase(&ref->node, ref->root);
> > > >       list_del(&ref->list);
> > > >       fs_path_free(ref->full_path);
> > > >       kfree(ref);
> > > > @@ -4373,12 +4381,152 @@ static int __record_deleted_ref(int num, u64 dir, int index,
> > > >                         &sctx->deleted_refs);
> > > >  }
> > > >
> > > > +static int rbtree_ref_comp(const void *k, const struct rb_node *node)
> > > > +{
> > > > +     const struct recorded_ref *data = k;
> > > > +     const struct recorded_ref *ref = rb_entry(node, struct recorded_ref, node);
> > > > +     int result;
> > > > +
> > > > +     if (data->dir > ref->dir)
> > > > +             return 1;
> > > > +     if (data->dir < ref->dir)
> > > > +             return -1;
> > > > +     if (data->dir_gen > ref->dir_gen)
> > > > +             return 1;
> > > > +     if (data->dir_gen < ref->dir_gen)
> > > > +             return -1;
> > > > +     if (data->name_len > ref->name_len)
> > > > +             return 1;
> > > > +     if (data->name_len < ref->name_len)
> > > > +             return -1;
> > > > +     result = strcmp(data->name, ref->name);
> > > > +     if (result > 0)
> > > > +             return 1;
> > > > +     if (result < 0)
> > > > +             return -1;
> > > > +     return 0;
> > > > +}
> > > > +
> > > > +static bool rbtree_ref_less(struct rb_node *node, const struct rb_node *parent)
> > > > +{
> > > > +     const struct recorded_ref *entry = rb_entry(node,
> > > > +                                                 struct recorded_ref,
> > > > +                                                 node);
> > > > +
> > > > +     return rbtree_ref_comp(entry, parent) < 0;
> > > > +}
> > > > +
> > > > +static int record_ref2(struct rb_root *root, struct list_head *refs,
> > >
> > > This is a terrible function name.
> > >
> > > If we end up with this solution, please rename it to something more clear
> > > like record_ref_in_tree() for example. Almost anything other than an existing
> > > function name followed by a number is a better choice.
> > >
> >
> > Thank you for the naming.
> >
> > > This bug is a very good finding, and has been around since forever.
> > >
> > > Thanks.
> > >
> > > > +                          struct fs_path *name, u64 dir, u64 dir_gen,
> > > > +                          struct send_ctx *sctx)
> > > > +{
> > > > +     int ret = 0;
> > > > +     struct fs_path *path = NULL;
> > > > +     struct recorded_ref *ref = NULL;
> > > > +
> > > > +     path = fs_path_alloc();
> > > > +     if (!path) {
> > > > +             ret = -ENOMEM;
> > > > +             goto out;
> > > > +     }
> > > > +
> > > > +     ref = recorded_ref_alloc();
> > > > +     if (!ref) {
> > > > +             ret = -ENOMEM;
> > > > +             goto out;
> > > > +     }
> > > > +
> > > > +     ret = get_cur_path(sctx, dir, dir_gen, path);
> > > > +     if (ret < 0)
> > > > +             goto out;
> > > > +     ret = fs_path_add_path(path, name);
> > > > +     if (ret < 0)
> > > > +             goto out;
> > > > +
> > > > +     ref->dir = dir;
> > > > +     ref->dir_gen = dir_gen;
> > > > +     set_ref_path(ref, path);
> > > > +     list_add_tail(&ref->list, refs);
> > > > +     rb_add(&ref->node, root, rbtree_ref_less);
> > > > +     ref->root = root;
> > > > +out:
> > > > +     if (ret) {
> > > > +             if (path && (!ref || !ref->full_path))
> > > > +                     fs_path_free(path);
> > > > +             recorded_ref_free(ref);
> > > > +     }
> > > > +     return ret;
> > > > +}
> > > > +
> > > > +static int __record_new_ref_if_needed(int num, u64 dir, int index,
> > > > +                                   struct fs_path *name, void *ctx)
> > > > +{
> > > > +     int ret = 0;
> > > > +     struct send_ctx *sctx = ctx;
> > > > +     struct rb_node *node = NULL;
> > > > +     struct recorded_ref data;
> > > > +     struct recorded_ref *ref;
> > > > +     u64 dir_gen;
> > > > +
> > > > +     ret = get_inode_info(sctx->send_root, dir, NULL, &dir_gen, NULL,
> > > > +                          NULL, NULL, NULL);
> > > > +     if (ret < 0)
> > > > +             goto out;
> > > > +
> > > > +     data.dir = dir;
> > > > +     data.dir_gen = dir_gen;
> > > > +     set_ref_path(&data, name);
> > > > +     node = rb_find(&data, &sctx->rbtree_deleted_refs, rbtree_ref_comp);
> > > > +     if (node) {
> > > > +             ref = rb_entry(node, struct recorded_ref, node);
> > > > +             recorded_ref_free(ref);
> > > > +     } else {
> > > > +             ret = record_ref2(&sctx->rbtree_new_refs, &sctx->new_refs,
> > > > +                               name, dir, dir_gen, sctx);
> > > > +     }
> > > > +out:
> > > > +     return ret;
> > > > +}
> > > > +
> > > > +static int __record_deleted_ref_if_needed(int num, u64 dir, int index,
> > > > +                         struct fs_path *name,
> > > > +                         void *ctx)
> > > > +{
> > > > +     int ret = 0;
> > > > +     struct send_ctx *sctx = ctx;
> > > > +     struct rb_node *node = NULL;
> > > > +     struct recorded_ref data;
> > > > +     struct recorded_ref *ref;
> > > > +     u64 dir_gen;
> > > > +
> > > > +     ret = get_inode_info(sctx->parent_root, dir, NULL, &dir_gen, NULL,
> > > > +                          NULL, NULL, NULL);
> > > > +     if (ret < 0)
> > > > +             goto out;
> > > > +
> > > > +     data.dir = dir;
> > > > +     data.dir_gen = dir_gen;
> > > > +     set_ref_path(&data, name);
> > > > +     node = rb_find(&data, &sctx->rbtree_new_refs, rbtree_ref_comp);
> > > > +     if (node) {
> > > > +             ref = rb_entry(node, struct recorded_ref, node);
> > > > +             recorded_ref_free(ref);
> > > > +     } else {
> > > > +             ret = record_ref2(&sctx->rbtree_deleted_refs,
> > > > +                               &sctx->deleted_refs, name, dir, dir_gen,
> > > > +                               sctx);
> > > > +     }
> > > > +out:
> > > > +     return ret;
> > > > +}
> > > > +
> > > >  static int record_new_ref(struct send_ctx *sctx)
> > > >  {
> > > >       int ret;
> > > >
> > > >       ret = iterate_inode_ref(sctx->send_root, sctx->left_path,
> > > > -                             sctx->cmp_key, 0, __record_new_ref, sctx);
> > > > +                             sctx->cmp_key, 0, __record_new_ref_if_needed,
> > > > +                             sctx);
> > > >       if (ret < 0)
> > > >               goto out;
> > > >       ret = 0;
> > > > @@ -4392,7 +4540,8 @@ static int record_deleted_ref(struct send_ctx *sctx)
> > > >       int ret;
> > > >
> > > >       ret = iterate_inode_ref(sctx->parent_root, sctx->right_path,
> > > > -                             sctx->cmp_key, 0, __record_deleted_ref, sctx);
> > > > +                             sctx->cmp_key, 0,
> > > > +                             __record_deleted_ref_if_needed, sctx);
> > > >       if (ret < 0)
> > > >               goto out;
> > > >       ret = 0;
> > > > @@ -4475,7 +4624,7 @@ static int __record_changed_new_ref(int num, u64 dir, int index,
> > > >       ret = find_iref(sctx->parent_root, sctx->right_path,
> > > >                       sctx->cmp_key, dir, dir_gen, name);
> > > >       if (ret == -ENOENT)
> > > > -             ret = __record_new_ref(num, dir, index, name, sctx);
> > > > +             ret = __record_new_ref_if_needed(num, dir, index, name, sctx);
> > > >       else if (ret > 0)
> > > >               ret = 0;
> > > >
> > > > @@ -4498,7 +4647,8 @@ static int __record_changed_deleted_ref(int num, u64 dir, int index,
> > > >       ret = find_iref(sctx->send_root, sctx->left_path, sctx->cmp_key,
> > > >                       dir, dir_gen, name);
> > > >       if (ret == -ENOENT)
> > > > -             ret = __record_deleted_ref(num, dir, index, name, sctx);
> > > > +             ret = __record_deleted_ref_if_needed(num, dir, index, name,
> > > > +                                                  sctx);
> > > >       else if (ret > 0)
> > > >               ret = 0;
> > > >
> > > > @@ -7576,6 +7726,8 @@ long btrfs_ioctl_send(struct inode *inode, struct btrfs_ioctl_send_args *arg)
> > > >       sctx->pending_dir_moves = RB_ROOT;
> > > >       sctx->waiting_dir_moves = RB_ROOT;
> > > >       sctx->orphan_dirs = RB_ROOT;
> > > > +     sctx->rbtree_new_refs = RB_ROOT;
> > > > +     sctx->rbtree_deleted_refs = RB_ROOT;
> > > >
> > > >       sctx->clone_roots = kvcalloc(sizeof(*sctx->clone_roots),
> > > >                                    arg->clone_sources_count + 1,
> > > > --
> > > > 2.37.0
> > > >
diff mbox series

Patch

diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
index 420a86720aa2..23ae631ef23b 100644
--- a/fs/btrfs/send.c
+++ b/fs/btrfs/send.c
@@ -234,6 +234,9 @@  struct send_ctx {
 	 * Indexed by the inode number of the directory to be deleted.
 	 */
 	struct rb_root orphan_dirs;
+
+	struct rb_root rbtree_new_refs;
+	struct rb_root rbtree_deleted_refs;
 };
 
 struct pending_dir_move {
@@ -2747,6 +2750,8 @@  struct recorded_ref {
 	u64 dir;
 	u64 dir_gen;
 	int name_len;
+	struct rb_node node;
+	struct rb_root *root;
 };
 
 static struct recorded_ref *recorded_ref_alloc(void)
@@ -2756,6 +2761,7 @@  static struct recorded_ref *recorded_ref_alloc(void)
 	ref = kzalloc(sizeof(*ref), GFP_KERNEL);
 	if (!ref)
 		return NULL;
+	RB_CLEAR_NODE(&ref->node);
 	INIT_LIST_HEAD(&ref->list);
 	return ref;
 }
@@ -2764,6 +2770,8 @@  static void recorded_ref_free(struct recorded_ref *ref)
 {
 	if (!ref)
 		return;
+	if (!RB_EMPTY_NODE(&ref->node))
+		rb_erase(&ref->node, ref->root);
 	list_del(&ref->list);
 	fs_path_free(ref->full_path);
 	kfree(ref);
@@ -4373,12 +4381,152 @@  static int __record_deleted_ref(int num, u64 dir, int index,
 			  &sctx->deleted_refs);
 }
 
+static int rbtree_ref_comp(const void *k, const struct rb_node *node)
+{
+	const struct recorded_ref *data = k;
+	const struct recorded_ref *ref = rb_entry(node, struct recorded_ref, node);
+	int result;
+
+	if (data->dir > ref->dir)
+		return 1;
+	if (data->dir < ref->dir)
+		return -1;
+	if (data->dir_gen > ref->dir_gen)
+		return 1;
+	if (data->dir_gen < ref->dir_gen)
+		return -1;
+	if (data->name_len > ref->name_len)
+		return 1;
+	if (data->name_len < ref->name_len)
+		return -1;
+	result = strcmp(data->name, ref->name);
+	if (result > 0)
+		return 1;
+	if (result < 0)
+		return -1;
+	return 0;
+}
+
+static bool rbtree_ref_less(struct rb_node *node, const struct rb_node *parent)
+{
+	const struct recorded_ref *entry = rb_entry(node,
+						    struct recorded_ref,
+						    node);
+
+	return rbtree_ref_comp(entry, parent) < 0;
+}
+
+static int record_ref2(struct rb_root *root, struct list_head *refs,
+			     struct fs_path *name, u64 dir, u64 dir_gen,
+			     struct send_ctx *sctx)
+{
+	int ret = 0;
+	struct fs_path *path = NULL;
+	struct recorded_ref *ref = NULL;
+
+	path = fs_path_alloc();
+	if (!path) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	ref = recorded_ref_alloc();
+	if (!ref) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	ret = get_cur_path(sctx, dir, dir_gen, path);
+	if (ret < 0)
+		goto out;
+	ret = fs_path_add_path(path, name);
+	if (ret < 0)
+		goto out;
+
+	ref->dir = dir;
+	ref->dir_gen = dir_gen;
+	set_ref_path(ref, path);
+	list_add_tail(&ref->list, refs);
+	rb_add(&ref->node, root, rbtree_ref_less);
+	ref->root = root;
+out:
+	if (ret) {
+		if (path && (!ref || !ref->full_path))
+			fs_path_free(path);
+		recorded_ref_free(ref);
+	}
+	return ret;
+}
+
+static int __record_new_ref_if_needed(int num, u64 dir, int index,
+				      struct fs_path *name, void *ctx)
+{
+	int ret = 0;
+	struct send_ctx *sctx = ctx;
+	struct rb_node *node = NULL;
+	struct recorded_ref data;
+	struct recorded_ref *ref;
+	u64 dir_gen;
+
+	ret = get_inode_info(sctx->send_root, dir, NULL, &dir_gen, NULL,
+			     NULL, NULL, NULL);
+	if (ret < 0)
+		goto out;
+
+	data.dir = dir;
+	data.dir_gen = dir_gen;
+	set_ref_path(&data, name);
+	node = rb_find(&data, &sctx->rbtree_deleted_refs, rbtree_ref_comp);
+	if (node) {
+		ref = rb_entry(node, struct recorded_ref, node);
+		recorded_ref_free(ref);
+	} else {
+		ret = record_ref2(&sctx->rbtree_new_refs, &sctx->new_refs,
+				  name, dir, dir_gen, sctx);
+	}
+out:
+	return ret;
+}
+
+static int __record_deleted_ref_if_needed(int num, u64 dir, int index,
+			    struct fs_path *name,
+			    void *ctx)
+{
+	int ret = 0;
+	struct send_ctx *sctx = ctx;
+	struct rb_node *node = NULL;
+	struct recorded_ref data;
+	struct recorded_ref *ref;
+	u64 dir_gen;
+
+	ret = get_inode_info(sctx->parent_root, dir, NULL, &dir_gen, NULL,
+			     NULL, NULL, NULL);
+	if (ret < 0)
+		goto out;
+
+	data.dir = dir;
+	data.dir_gen = dir_gen;
+	set_ref_path(&data, name);
+	node = rb_find(&data, &sctx->rbtree_new_refs, rbtree_ref_comp);
+	if (node) {
+		ref = rb_entry(node, struct recorded_ref, node);
+		recorded_ref_free(ref);
+	} else {
+		ret = record_ref2(&sctx->rbtree_deleted_refs,
+				  &sctx->deleted_refs, name, dir, dir_gen,
+				  sctx);
+	}
+out:
+	return ret;
+}
+
 static int record_new_ref(struct send_ctx *sctx)
 {
 	int ret;
 
 	ret = iterate_inode_ref(sctx->send_root, sctx->left_path,
-				sctx->cmp_key, 0, __record_new_ref, sctx);
+				sctx->cmp_key, 0, __record_new_ref_if_needed,
+				sctx);
 	if (ret < 0)
 		goto out;
 	ret = 0;
@@ -4392,7 +4540,8 @@  static int record_deleted_ref(struct send_ctx *sctx)
 	int ret;
 
 	ret = iterate_inode_ref(sctx->parent_root, sctx->right_path,
-				sctx->cmp_key, 0, __record_deleted_ref, sctx);
+				sctx->cmp_key, 0,
+				__record_deleted_ref_if_needed, sctx);
 	if (ret < 0)
 		goto out;
 	ret = 0;
@@ -4475,7 +4624,7 @@  static int __record_changed_new_ref(int num, u64 dir, int index,
 	ret = find_iref(sctx->parent_root, sctx->right_path,
 			sctx->cmp_key, dir, dir_gen, name);
 	if (ret == -ENOENT)
-		ret = __record_new_ref(num, dir, index, name, sctx);
+		ret = __record_new_ref_if_needed(num, dir, index, name, sctx);
 	else if (ret > 0)
 		ret = 0;
 
@@ -4498,7 +4647,8 @@  static int __record_changed_deleted_ref(int num, u64 dir, int index,
 	ret = find_iref(sctx->send_root, sctx->left_path, sctx->cmp_key,
 			dir, dir_gen, name);
 	if (ret == -ENOENT)
-		ret = __record_deleted_ref(num, dir, index, name, sctx);
+		ret = __record_deleted_ref_if_needed(num, dir, index, name,
+						     sctx);
 	else if (ret > 0)
 		ret = 0;
 
@@ -7576,6 +7726,8 @@  long btrfs_ioctl_send(struct inode *inode, struct btrfs_ioctl_send_args *arg)
 	sctx->pending_dir_moves = RB_ROOT;
 	sctx->waiting_dir_moves = RB_ROOT;
 	sctx->orphan_dirs = RB_ROOT;
+	sctx->rbtree_new_refs = RB_ROOT;
+	sctx->rbtree_deleted_refs = RB_ROOT;
 
 	sctx->clone_roots = kvcalloc(sizeof(*sctx->clone_roots),
 				     arg->clone_sources_count + 1,