diff mbox series

[3/3] merge-ort: fix issue with dual rename and add/add conflict

Message ID bb2badccb71d76efe0e47431246376b1e7016b05.1655871652.git.gitgitgadget@gmail.com (mailing list archive)
State Superseded
Headers show
Series Fix dual rename into each other plus conflicting adds | expand

Commit Message

Elijah Newren June 22, 2022, 4:20 a.m. UTC
From: Elijah Newren <newren@gmail.com>

There is code in both merge-recursive and merge-ort for avoiding doubly
transitive renames (i.e. one side renames directory A/ -> B/, and the
other side renames directory B/ -> C/), because this combination would
otherwise make a mess for new files added to A/ on the first side and
wondering which directory they end up in -- especially if there were
even more renames such as the first side renaming C/ -> D/.  In such
cases, it just turns "off" directory rename detection for the higher
order transitive cases.

The testcases added in t6423 a couple commits ago are slightly different
but similar in principle.  They involve a similar case of paired
renaming but instead of A/ -> B/ and B/ -> C/, the second side renames
a leading directory of B/ to C/.  And both sides add a new file
somewhere under the directory that the other side will rename.  While
the new files added start within different directories and thus could
logically end up within different directories, it is weird for a file
on one side to end up where the other one started and not move along
with it.  So, let's just turn off directory rename detection in this
case as well.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c                         | 4 ++++
 t/t6423-merge-rename-directories.sh | 4 ++--
 2 files changed, 6 insertions(+), 2 deletions(-)

Comments

Jonathan Tan June 27, 2022, 6:47 p.m. UTC | #1
"Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:
> The testcases added in t6423 a couple commits ago are slightly different
> but similar in principle.  They involve a similar case of paired
> renaming but instead of A/ -> B/ and B/ -> C/, the second side renames
> a leading directory of B/ to C/.  

Hmm...one side moved sub1 -> sub3 and the other moved sub2 from the root
to under sub1, right? So it's more like A/ -> B/ and C/ -> A/C/.

> And both sides add a new file
> somewhere under the directory that the other side will rename.  

Rename or move, I think.

> While
> the new files added start within different directories and thus could
> logically end up within different directories, it is weird for a file
> on one side to end up where the other one started and not move along
> with it.  So, let's just turn off directory rename detection in this
> case as well.

Makes sense.

> diff --git a/merge-ort.c b/merge-ort.c
> index fa6667de18c..5bcb9a4980b 100644
> --- a/merge-ort.c
> +++ b/merge-ort.c
> @@ -2292,9 +2292,13 @@ static char *check_for_directory_rename(struct merge_options *opt,
>  	struct strmap_entry *rename_info;
>  	struct strmap_entry *otherinfo = NULL;
>  	const char *new_dir;
> +	int other_side = 3 - side_index;
>  
> +	/* Cases where there is no new path, so we return NULL */

What do you mean by "no new path"?

>  	if (strmap_empty(dir_renames))
>  		return new_path;
> +	if (strmap_get(&collisions[other_side], path))
> +		return new_path;

So as far as I understand, collisions here, for a given side, is a map.
The map's keys are all the filenames of added and renamed files (I'm
assuming that's what 'A' and 'R' are) from that side after any directory
moves on the other side are applied. So, for example, if we add "foo/a"
on side A and rename "foo/" to "bar/" on side B, then side A's collision
map would have a key "bar/a". So I'm not sure if "collision" is the
right name here, but the existing code already uses it so I'll leave it
be.

It makes sense that this situation (of side A having "bar/a" because
side B renamed "foo/" to "bar/", and at the same time, side B adds its
own "bar/a") is weird, as stated in the commit message, so I don't mind
disabling checking for directory rename in this case. However, in
theory, I don't see how disabling this check would fix anything, since
the bug seems to be about two different files on different sides being
renamed to the same filename through some convoluted means. (Unless this
is the only convoluted means to do it, and disabling it means that the
bug wouldn't occur.)
Elijah Newren June 30, 2022, 12:05 a.m. UTC | #2
On Mon, Jun 27, 2022 at 11:47 AM Jonathan Tan <jonathantanmy@google.com> wrote:
>
> "Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:
> > The testcases added in t6423 a couple commits ago are slightly different
> > but similar in principle.  They involve a similar case of paired
> > renaming but instead of A/ -> B/ and B/ -> C/, the second side renames
> > a leading directory of B/ to C/.
>
> Hmm...one side moved sub1 -> sub3 and the other moved sub2 from the root
> to under sub1, right? So it's more like A/ -> B/ and C/ -> A/C/.

Hmm, maybe I should have been more explicit in my mapping:
   A = sub2
   B = sub1/sub2
   leading directory of B = sub1
   C = sub3

> > And both sides add a new file
> > somewhere under the directory that the other side will rename.
>
> Rename or move, I think.

I'm confused; I don't understand what this comment means.  Renames
tend to be created using "git mv", before or after making content
changes, so to me a file being renamed or a file being moved to a
different location are synonymous.  What distinction are you making
between renames and moves?

> > While
> > the new files added start within different directories and thus could
> > logically end up within different directories, it is weird for a file
> > on one side to end up where the other one started and not move along
> > with it.  So, let's just turn off directory rename detection in this
> > case as well.
>
> Makes sense.
>
> > diff --git a/merge-ort.c b/merge-ort.c
> > index fa6667de18c..5bcb9a4980b 100644
> > --- a/merge-ort.c
> > +++ b/merge-ort.c
> > @@ -2292,9 +2292,13 @@ static char *check_for_directory_rename(struct merge_options *opt,
> >       struct strmap_entry *rename_info;
> >       struct strmap_entry *otherinfo = NULL;
> >       const char *new_dir;
> > +     int other_side = 3 - side_index;
> >
> > +     /* Cases where there is no new path, so we return NULL */
>
> What do you mean by "no new path"?

Hmm, perhaps I should change this to:

/* Cases where we don't have or don't want a directory rename for this
path, so we return NULL */

The purpose of this function is to check whether a given path would be
modified by directory renaming to get a new path.  So, "no new path"
means we don't have an applicable directory rename or don't want to
use it.

> >       if (strmap_empty(dir_renames))
> >               return new_path;
> > +     if (strmap_get(&collisions[other_side], path))
> > +             return new_path;
>
> So as far as I understand, collisions here, for a given side, is a map.
> The map's keys are all the filenames of added and renamed files (I'm
> assuming that's what 'A' and 'R' are) from that side after any directory
> moves on the other side are applied. So, for example, if we add "foo/a"
> on side A and rename "foo/" to "bar/" on side B, then side A's collision
> map would have a key "bar/a". So I'm not sure if "collision" is the
> right name here, but the existing code already uses it so I'll leave it
> be.

Let's take your example a bit further, to discuss the original kind of
usecase that "collisions" was written for.  In addition to adding
"foo/a" on side A, we also add "foo2/a" and "foo3/a".  And then in
addition to renaming "foo/" to "bar/" on side B, we also rename
"foo2/" to "bar/" and "foo3/" to "bar/", thus merging all of foo/,
foo2/, and foo3/ into a single directory named bar/.  If the files in
foo/, foo2/, and foo3/ on side B were all unique, you can see how
there'd be no problem merging these directories together.  The problem
only comes at merge time when you attempt to apply the directory
renames from side B to the new files on side A.  That's when you get
collisions, in this case three files that would be named bar/a.

Checking for such collisions was the purpose of the "collisions"
metadata, so I think the name is fitting.  The only problem is that
we're reusing that data now for a slightly different purpose, though I
think it's still "collision-y".

> It makes sense that this situation (of side A having "bar/a" because
> side B renamed "foo/" to "bar/", and at the same time, side B adds its
> own "bar/a") is weird, as stated in the commit message, so I don't mind
> disabling checking for directory rename in this case.  However, in
> theory, I don't see how disabling this check would fix anything, since
> the bug seems to be about two different files on different sides being
> renamed to the same filename through some convoluted means. (Unless this
> is the only convoluted means to do it, and disabling it means that the
> bug wouldn't occur.)

Hmm...let me see if I can explain this a different way.

The short version of the issue here is that if a directory rename
wants to rename NEWFILE -> ALTERNATE_NEWFILE, but there is another
directory rename on the other side of history that wants to rename
ANOTHER_FILE -> NEWFILE, then we run into problems and have to turn
off the NEWFILE -> ALTERNATE_NEWFILE.  This check here is all about
this case.

To see why this is the problematic setup...

The primary data structure in merge-ort is opt->priv->paths, a strmap
which maps: (path involved in the merge) -> (conflict_info).
(Technically, it could have a merged_info instead of a conflict_info
if the file was trivially resolvable early on but since merged_info is
the first entry in a conflict_info, logically it can still be thought
of as a conflict_info just with less data.).  Now a conflict_info
stores information about the OIDs and modes for all three sides of the
merge (i.e. both sides of the merge and the base).  Whenever a rename
is processed, we have to update this map, because the rename makes the
conflict_info now apply to a different path.  In the simple cases, the
conflict_info from the source path needs to be merged with the
conflict_info for the target path, and the source path's conflict_info
needs to be marked as null (literally setting the .is_null field to
1).  rename/rename and such can make this a bit more complicated.

Normally, directory renames would actually be a simpler case for this
merging of conflict_info structs rather than a more complicated case.
There cannot be a directory rename if the directory exists on both
sides, so we don't need to worry about there already being a file on
the other side whose conflict_info we need to merge with the source
conflict_info.  So, the code just added an assertion that there wasn't
one.  The problem is, that _another_ directory rename for the other
side of history can move a file into the way of where our directory
rename wants to operate on.  Let's jump into the example testcase I
added to make this more concrete:

   #   Commit O: sub1/file,                 sub2/other
   #   Commit A: sub3/file,                 sub2/{other, new_add_add_file_1}
   #   Commit B: sub1/{file, newfile}, sub1/sub2/{other, new_add_add_file_2}
   #
   #   In words:
   #     A: sub1/ -> sub3/, add sub2/new_add_add_file_1
   #     B: sub2/ -> sub1/sub2, add sub1/newfile, add
sub1/sub2/new_add_add_file_2

Here, the sub2/sub1/sub2/ rename on sideB will rename A's
sub2/new_add_add_file to sub1/sub2/new_add_add_file, which is at the
same location as B's sub1/sub2/new_add_add_file (and which A's sub1/
-> sub3/ directory rename would normally operate on).

Given our opt->priv->paths data structure, if we wanted to let both
directory renames take effect, then the order would matter:

* If the sub1/ -> sub3/ directory rename is applied first, then:
  * B's sub1/sub2/new_add_add_file gets renamed to sub3/sub2/new_add_add_file
  * sub1/sub2/new_add_add_file is marked as .is_null=1
  * A's sub2/new_add_add_file gets renamed to
sub1/sub2/new_add_add_file (which was already marked as null)

This set of steps seems to trigger the "error: cache entry has null
sha1" fatal error I mentioned earlier.  In contrast, if we take the
other order:

* If the sub2/ -> sub1/sub2/ rename is applied first, then:
  * A's sub2/new_add_add_file gets renamed to sub1/sub2/new_add_add_file
  * the conflict_info for the two sub1/sub2/new_add_add_file's are now merged
  * the sub1/ -> sub3/ directory rename is applied to move this file
to sub3/sub2/new_add_add_file

This second order may not sound so bad.  And, in fact, you can get
this behavior simply by relaxing (or commenting out) the assertion
Glen reported hitting.  However, that results in making the merge have
a fatal error when you reverse its direction (i.e. when swapping HEAD
and MERGE_HEAD), and seems somewhat confusing to me given that A's
sub2/new_add_add_file was renamed twice, going against the general
"avoid-mutiply-transitive-renames" rule employed elsewhere in
directory rename detection.

To avoid this order dependence, and the weird multiple-renaming of a
single path, we just want to turn off directory renames when the
source of a directory rename on one side is the target of a directory
rename on the other.  That's what this patch does.


Does that help?  Or did I make it more confusing?
Jonathan Tan July 6, 2022, 5:25 p.m. UTC | #3
Sorry for getting back to this so late. I don't have any issues with the
patches, but just to close the loop:

Elijah Newren <newren@gmail.com> writes:
> On Mon, Jun 27, 2022 at 11:47 AM Jonathan Tan <jonathantanmy@google.com> wrote:
> >
> > "Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:
> > > The testcases added in t6423 a couple commits ago are slightly different
> > > but similar in principle.  They involve a similar case of paired
> > > renaming but instead of A/ -> B/ and B/ -> C/, the second side renames
> > > a leading directory of B/ to C/.
> >
> > Hmm...one side moved sub1 -> sub3 and the other moved sub2 from the root
> > to under sub1, right? So it's more like A/ -> B/ and C/ -> A/C/.
> 
> Hmm, maybe I should have been more explicit in my mapping:
>    A = sub2
>    B = sub1/sub2
>    leading directory of B = sub1
>    C = sub3

Substituting into A/ -> B/ and "a leading directory of B/ to C/", we
get:

  sub2 -> sub1/sub2 and sub1 -> sub3

which is indeed what is happening. I see, thanks.

> > > And both sides add a new file
> > > somewhere under the directory that the other side will rename.
> >
> > Rename or move, I think.
> 
> I'm confused; I don't understand what this comment means.  Renames
> tend to be created using "git mv", before or after making content
> changes, so to me a file being renamed or a file being moved to a
> different location are synonymous.  What distinction are you making
> between renames and moves?

I was thinking that a rename means that the directory still has the same
parent directory, whereas a move means that the directory keeps its
basename but has a different parent directory. Maybe it's just the way I
think about things, but the important thing here is that a directory was
moved so that its parent directory is a directory that is different and
that already exists, and I think that this meaning is lost when we say
"rename". But it might just be me.

> Hmm, perhaps I should change this to:
> 
> /* Cases where we don't have or don't want a directory rename for this
> path, so we return NULL */
> 
> The purpose of this function is to check whether a given path would be
> modified by directory renaming to get a new path.  So, "no new path"
> means we don't have an applicable directory rename or don't want to
> use it.

I see - the new text makes sense.

> > >       if (strmap_empty(dir_renames))
> > >               return new_path;
> > > +     if (strmap_get(&collisions[other_side], path))
> > > +             return new_path;
> >
> > So as far as I understand, collisions here, for a given side, is a map.
> > The map's keys are all the filenames of added and renamed files (I'm
> > assuming that's what 'A' and 'R' are) from that side after any directory
> > moves on the other side are applied. So, for example, if we add "foo/a"
> > on side A and rename "foo/" to "bar/" on side B, then side A's collision
> > map would have a key "bar/a". So I'm not sure if "collision" is the
> > right name here, but the existing code already uses it so I'll leave it
> > be.
> 
> Let's take your example a bit further, to discuss the original kind of
> usecase that "collisions" was written for.  In addition to adding
> "foo/a" on side A, we also add "foo2/a" and "foo3/a".  And then in
> addition to renaming "foo/" to "bar/" on side B, we also rename
> "foo2/" to "bar/" and "foo3/" to "bar/", thus merging all of foo/,
> foo2/, and foo3/ into a single directory named bar/.  If the files in
> foo/, foo2/, and foo3/ on side B were all unique, you can see how
> there'd be no problem merging these directories together.  The problem
> only comes at merge time when you attempt to apply the directory
> renames from side B to the new files on side A.  That's when you get
> collisions, in this case three files that would be named bar/a.
> 
> Checking for such collisions was the purpose of the "collisions"
> metadata, so I think the name is fitting.  The only problem is that
> we're reusing that data now for a slightly different purpose, though I
> think it's still "collision-y".

That makes sense, thanks.

> > It makes sense that this situation (of side A having "bar/a" because
> > side B renamed "foo/" to "bar/", and at the same time, side B adds its
> > own "bar/a") is weird, as stated in the commit message, so I don't mind
> > disabling checking for directory rename in this case.  However, in
> > theory, I don't see how disabling this check would fix anything, since
> > the bug seems to be about two different files on different sides being
> > renamed to the same filename through some convoluted means. (Unless this
> > is the only convoluted means to do it, and disabling it means that the
> > bug wouldn't occur.)

[snip]

> Hmm...let me see if I can explain this a different way.
> 
> The short version of the issue here is that if a directory rename
> wants to rename NEWFILE -> ALTERNATE_NEWFILE, but there is another
> directory rename on the other side of history that wants to rename
> ANOTHER_FILE -> NEWFILE, then we run into problems and have to turn
> off the NEWFILE -> ALTERNATE_NEWFILE.  This check here is all about
> this case.
> 
> To see why this is the problematic setup...
> 
> Now a conflict_info
> stores information about the OIDs and modes for all three sides of the
> merge (i.e. both sides of the merge and the base).  Whenever a rename
> is processed, we have to update this map, because the rename makes the
> conflict_info now apply to a different path.  In the simple cases, the
> conflict_info from the source path needs to be merged with the
> conflict_info for the target path, and the source path's conflict_info
> needs to be marked as null (literally setting the .is_null field to
> 1).  rename/rename and such can make this a bit more complicated.

Ah, I think I was missing this part. The intention is that processing
one side in this way (and thus modifying its conflict_info) would not
affect the processing of any other sides, but there is a bug that makes
it not so. (And the intention is not, say, when processing all sides,
to write the output in a new data structure so that the result is the
same no matter the order.)

So I think the answer to my question is: no, we do not want to be able
to rename two different files on different sides to the same filename
through any convoluted means, and if that happens, it's a bug.

[snip illustration of what happens in either merge order]

> To avoid this order dependence, and the weird multiple-renaming of a
> single path, we just want to turn off directory renames when the
> source of a directory rename on one side is the target of a directory
> rename on the other.  That's what this patch does.
> 
> 
> Does that help?  Or did I make it more confusing?

I think that helped, thanks.
diff mbox series

Patch

diff --git a/merge-ort.c b/merge-ort.c
index fa6667de18c..5bcb9a4980b 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -2292,9 +2292,13 @@  static char *check_for_directory_rename(struct merge_options *opt,
 	struct strmap_entry *rename_info;
 	struct strmap_entry *otherinfo = NULL;
 	const char *new_dir;
+	int other_side = 3 - side_index;
 
+	/* Cases where there is no new path, so we return NULL */
 	if (strmap_empty(dir_renames))
 		return new_path;
+	if (strmap_get(&collisions[other_side], path))
+		return new_path;
 	rename_info = check_dir_renamed(path, dir_renames);
 	if (!rename_info)
 		return new_path;
diff --git a/t/t6423-merge-rename-directories.sh b/t/t6423-merge-rename-directories.sh
index 296c04f8046..4286ae987c4 100755
--- a/t/t6423-merge-rename-directories.sh
+++ b/t/t6423-merge-rename-directories.sh
@@ -5245,7 +5245,7 @@  test_setup_12l () {
 	)
 }
 
-test_expect_merge_algorithm failure failure '12l (B into A): Rename into each other + add/add conflict' '
+test_expect_merge_algorithm failure success '12l (B into A): Rename into each other + add/add conflict' '
 	test_setup_12l BintoA &&
 	(
 		cd 12l_BintoA &&
@@ -5273,7 +5273,7 @@  test_expect_merge_algorithm failure failure '12l (B into A): Rename into each ot
 	)
 '
 
-test_expect_merge_algorithm failure failure '12l (A into B): Rename into each other + add/add conflict' '
+test_expect_merge_algorithm failure success '12l (A into B): Rename into each other + add/add conflict' '
 	test_setup_12l AintoB &&
 	(
 		cd 12l_AintoB &&