diff mbox series

[v2,01/20] merge-ort: setup basic internal data structures

Message ID 20201102204344.342633-2-newren@gmail.com (mailing list archive)
State New, archived
Headers show
Series fundamentals of merge-ort implementation | expand

Commit Message

Elijah Newren Nov. 2, 2020, 8:43 p.m. UTC
Set up some basic internal data structures.  The only carry-over from
merge-recursive.c is call_depth, though needed_rename_limit will be
added later.

The central piece of data will definitely be the strmap "paths", which
will map every relevant pathname under consideration to either a
merged_info or a conflict_info.  ("unmerged" is a strmap that is a
subset of "paths".)

merged_info contains all relevant information for a non-conflicted
entry.  conflict_info contains a merged_info, plus any additional
information about a conflict such as the higher orders stages involved
and the names of the paths those came from (handy once renames get
involved).  If an entry remains conflicted, the merged_info portion of a
conflict_info will later be filled with whatever version of the file
should be placed in the working directory (e.g. an as-merged-as-possible
variation that contains conflict markers).

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c | 40 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 40 insertions(+)

Comments

Jonathan Tan Nov. 6, 2020, 10:05 p.m. UTC | #1
I'm not very familiar with the merge machinery, but I'll attempt a
review of at least the first 10 patches.

> merged_info contains all relevant information for a non-conflicted
> entry.  conflict_info contains a merged_info, plus any additional
> information about a conflict such as the higher orders stages involved
> and the names of the paths those came from (handy once renames get
> involved).  If an entry remains conflicted, the merged_info portion of a
> conflict_info will later be filled with whatever version of the file
> should be placed in the working directory (e.g. an as-merged-as-possible
> variation that contains conflict markers).

I think that this information should be in the .c file.

> diff --git a/merge-ort.c b/merge-ort.c
> index b487901d3e..9d5ea0930d 100644
> --- a/merge-ort.c
> +++ b/merge-ort.c
> @@ -17,6 +17,46 @@
>  #include "cache.h"
>  #include "merge-ort.h"
>  
> +#include "strmap.h"
> +
> +struct merge_options_internal {
> +	struct strmap paths;    /* maps path -> (merged|conflict)_info */
> +	struct strmap unmerged; /* maps path -> conflict_info */
> +	const char *current_dir_name;
> +	int call_depth;
> +};

Maybe document if the path is from the root of the directory or just the
filename as it appears in a tree object?

I would have expected "unmerged" to be a "strset", but I guess it's more
convenient for it to be a map itself. Maybe just document it as "all
mappings in paths wherein the value is a struct conflict_info".

There seems to be 2 ways of referring to something that we couldn't
merge - "conflicted" (or "having a conflict") and "unmerged". Should we
stick to one of them?

Also, looking ahead, I see that current_dir_name is used as a temporary
variable in the recursive calls to collect_merge_info_callback(). I
would prefer if current_dir_name went in the cbdata to that function
instead, but if that's not possible, maybe document here that
current_dir_name is only used in collect_merge_info_callback(), and
temporarily at that.

> +struct version_info {
> +	struct object_id oid;
> +	unsigned short mode;
> +};

OK.

> +struct merged_info {
> +	struct version_info result;
> +	unsigned is_null:1;
> +	unsigned clean:1;
> +	size_t basename_offset;
> +	 /*
> +	  * Containing directory name.  Note that we assume directory_name is
> +	  * constructed such that
> +	  *    strcmp(dir1_name, dir2_name) == 0 iff dir1_name == dir2_name,
> +	  * i.e. string equality is equivalent to pointer equality.  For this
> +	  * to hold, we have to be careful setting directory_name.
> +	  */
> +	const char *directory_name;
> +};

I'm not sure how most of the fields in this struct are to be used, but
perhaps that will be clearer once I read the subsequent code.

> +struct conflict_info {
> +	struct merged_info merged;
> +	struct version_info stages[3];
> +	const char *pathnames[3];

Why would these be different across stages? (Rename detection?)

> +	unsigned df_conflict:1;

OK.

> +	unsigned path_conflict:1;

This doesn't seem to be assigned anywhere in this patch set?

> +	unsigned filemask:3;
> +	unsigned dirmask:3;

I wonder if this needs to be documented that the least significant bit
corresponds to stages[0], and so forth.

> +	unsigned match_mask:3;

I think this can be derived by just looking at the stages array? Maybe
document as:

  Optimization to track which stages match. Either 0 or at least 2 bits
  are set; if at least 2 bits are set, their corresponding stages match.

> +};
Elijah Newren Nov. 6, 2020, 10:45 p.m. UTC | #2
On Fri, Nov 6, 2020 at 2:05 PM Jonathan Tan <jonathantanmy@google.com> wrote:
>
> I'm not very familiar with the merge machinery, but I'll attempt a
> review of at least the first 10 patches.

Thanks for taking a look!

> > merged_info contains all relevant information for a non-conflicted
> > entry.  conflict_info contains a merged_info, plus any additional
> > information about a conflict such as the higher orders stages involved
> > and the names of the paths those came from (handy once renames get
> > involved).  If an entry remains conflicted, the merged_info portion of a
> > conflict_info will later be filled with whatever version of the file
> > should be placed in the working directory (e.g. an as-merged-as-possible
> > variation that contains conflict markers).
>
> I think that this information should be in the .c file.

Okay.

> > diff --git a/merge-ort.c b/merge-ort.c
> > index b487901d3e..9d5ea0930d 100644
> > --- a/merge-ort.c
> > +++ b/merge-ort.c
> > @@ -17,6 +17,46 @@
> >  #include "cache.h"
> >  #include "merge-ort.h"
> >
> > +#include "strmap.h"
> > +
> > +struct merge_options_internal {
> > +     struct strmap paths;    /* maps path -> (merged|conflict)_info */
> > +     struct strmap unmerged; /* maps path -> conflict_info */
> > +     const char *current_dir_name;
> > +     int call_depth;
> > +};
>
> Maybe document if the path is from the root of the directory or just the
> filename as it appears in a tree object?

Yeah, full relative path from toplevel.  I'll try to add some
documentation to that effect.

> I would have expected "unmerged" to be a "strset", but I guess it's more
> convenient for it to be a map itself. Maybe just document it as "all
> mappings in paths wherein the value is a struct conflict_info".

Makes sense.  And yeah, it's not a strset just because of the simple
optimization to avoid needing to do another lookup in paths afterwards
to get the actual conflict_info; the only time it is used is as a loop
over the still-unmerged entries to try to three-way merge them and
such.

> There seems to be 2 ways of referring to something that we couldn't
> merge - "conflicted" (or "having a conflict") and "unmerged". Should we
> stick to one of them?

Uhm...perhaps, but it feels like I'm going to miss some while looking
over it.  Also, there are some semantic differences at play.  Since
the whole algorithm is divided around multiple stages --
collect_merge_info(), detect_and_process_renames(), process_entries(),
as of a given early stage we might just know that we couldn't merge it
*yet*.  For example, both sides modified the entry, or one side
modified and the other side is missing ("did they delete it or rename
it?").  After rename detection and three-way content merge, something
that had not been automatically mergeable as of an earlier step might
end up being so.  But we need names for stuff in the interim state.
But it's also possible for us to know at an early state that thing are
definitely going to be a conflict regardless of what later stages do
(e.g. both sides rename a path, but rename it differently).

> Also, looking ahead, I see that current_dir_name is used as a temporary
> variable in the recursive calls to collect_merge_info_callback(). I
> would prefer if current_dir_name went in the cbdata to that function
> instead, but if that's not possible, maybe document here that
> current_dir_name is only used in collect_merge_info_callback(), and
> temporarily at that.

Yeah, not possible.  collect_merge_info_callback() has to be of
traverse_callback_t type (from tree-walk.h), which provides no extra
parameters for extra callback data.  I can add a documentation
comment.

> > +struct version_info {
> > +     struct object_id oid;
> > +     unsigned short mode;
> > +};
>
> OK.
>
> > +struct merged_info {
> > +     struct version_info result;
> > +     unsigned is_null:1;
> > +     unsigned clean:1;
> > +     size_t basename_offset;
> > +      /*
> > +       * Containing directory name.  Note that we assume directory_name is
> > +       * constructed such that
> > +       *    strcmp(dir1_name, dir2_name) == 0 iff dir1_name == dir2_name,
> > +       * i.e. string equality is equivalent to pointer equality.  For this
> > +       * to hold, we have to be careful setting directory_name.
> > +       */
> > +     const char *directory_name;
> > +};
>
> I'm not sure how most of the fields in this struct are to be used, but
> perhaps that will be clearer once I read the subsequent code.
>
> > +struct conflict_info {
> > +     struct merged_info merged;
> > +     struct version_info stages[3];
> > +     const char *pathnames[3];
>
> Why would these be different across stages? (Rename detection?)

Yes, as noted in the portion of the commit message that you said you
wanted in the .c file.

>
> > +     unsigned df_conflict:1;
>
> OK.
>
> > +     unsigned path_conflict:1;
>
> This doesn't seem to be assigned anywhere in this patch set?

Oh, right, I could drop it for now and add it back later when it is
used.  It's basically non-content conflict other than the specialized
D/F conflict.  So, things like rename/delete, moved by directory
rename, rename/rename(1to2), and rename/add/delete.  I could have
potentially lumped it in with df_conflict or made df_conflict a
subset, but df_conflict is different enough from the others that it
got a special flag.

But yeah, since it's all rename-related stuff and this patchset
doesn't have any rename handling yet, I probably should have left it
out until I added that stuff later.

> > +     unsigned filemask:3;
> > +     unsigned dirmask:3;
>
> I wonder if this needs to be documented that the least significant bit
> corresponds to stages[0], and so forth.

Maybe I should just put a comment to look at tree-walk.h?  The struct
traverse_info has a "fn" member with a big comment above it describing
mask & dirmask; filemask is just mask & ~dirmask.

> > +     unsigned match_mask:3;
>
> I think this can be derived by just looking at the stages array? Maybe
> document as:
>
>   Optimization to track which stages match. Either 0 or at least 2 bits
>   are set; if at least 2 bits are set, their corresponding stages match.

Yep, I only wanted to compute the match_mask once (I always got
annoyed in merge-recursive.c that we were re-comparing what had
already been compared and computed within unpack_trees()).  I like
your suggested comment; will add it.
Jonathan Tan Nov. 9, 2020, 8:55 p.m. UTC | #3
> > There seems to be 2 ways of referring to something that we couldn't
> > merge - "conflicted" (or "having a conflict") and "unmerged". Should we
> > stick to one of them?
> 
> Uhm...perhaps, but it feels like I'm going to miss some while looking
> over it.  Also, there are some semantic differences at play.  Since
> the whole algorithm is divided around multiple stages --
> collect_merge_info(), detect_and_process_renames(), process_entries(),
> as of a given early stage we might just know that we couldn't merge it
> *yet*.  For example, both sides modified the entry, or one side
> modified and the other side is missing ("did they delete it or rename
> it?").  After rename detection and three-way content merge, something
> that had not been automatically mergeable as of an earlier step might
> end up being so.  But we need names for stuff in the interim state.
> But it's also possible for us to know at an early state that thing are
> definitely going to be a conflict regardless of what later stages do
> (e.g. both sides rename a path, but rename it differently).

In that case, maybe "possibly conflicted" and "unconflicted"? Or maybe
someone else will come up with a better name.

> > Also, looking ahead, I see that current_dir_name is used as a temporary
> > variable in the recursive calls to collect_merge_info_callback(). I
> > would prefer if current_dir_name went in the cbdata to that function
> > instead, but if that's not possible, maybe document here that
> > current_dir_name is only used in collect_merge_info_callback(), and
> > temporarily at that.
> 
> Yeah, not possible.  collect_merge_info_callback() has to be of
> traverse_callback_t type (from tree-walk.h), which provides no extra
> parameters for extra callback data.  I can add a documentation
> comment.

But traverse_callback_t provides a data field, which you use to store a
struct merge_options in patch 6 - in theory, you could instead create
another struct that contains a pointer to struct merge_options and store
that in data instead. That does sound like it will make the code more
complicated, though, so maybe current_dir_name here is the way.

> > I wonder if this needs to be documented that the least significant bit
> > corresponds to stages[0], and so forth.
> 
> Maybe I should just put a comment to look at tree-walk.h?  The struct
> traverse_info has a "fn" member with a big comment above it describing
> mask & dirmask; filemask is just mask & ~dirmask.

That makes sense. I understood a lot more once I saw that it was
iterating over a variable number of trees - hence the arrays.
diff mbox series

Patch

diff --git a/merge-ort.c b/merge-ort.c
index b487901d3e..9d5ea0930d 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -17,6 +17,46 @@ 
 #include "cache.h"
 #include "merge-ort.h"
 
+#include "strmap.h"
+
+struct merge_options_internal {
+	struct strmap paths;    /* maps path -> (merged|conflict)_info */
+	struct strmap unmerged; /* maps path -> conflict_info */
+	const char *current_dir_name;
+	int call_depth;
+};
+
+struct version_info {
+	struct object_id oid;
+	unsigned short mode;
+};
+
+struct merged_info {
+	struct version_info result;
+	unsigned is_null:1;
+	unsigned clean:1;
+	size_t basename_offset;
+	 /*
+	  * Containing directory name.  Note that we assume directory_name is
+	  * constructed such that
+	  *    strcmp(dir1_name, dir2_name) == 0 iff dir1_name == dir2_name,
+	  * i.e. string equality is equivalent to pointer equality.  For this
+	  * to hold, we have to be careful setting directory_name.
+	  */
+	const char *directory_name;
+};
+
+struct conflict_info {
+	struct merged_info merged;
+	struct version_info stages[3];
+	const char *pathnames[3];
+	unsigned df_conflict:1;
+	unsigned path_conflict:1;
+	unsigned filemask:3;
+	unsigned dirmask:3;
+	unsigned match_mask:3;
+};
+
 void merge_switch_to_result(struct merge_options *opt,
 			    struct tree *head,
 			    struct merge_result *result,