mbox series

[00/20] fundamentals of merge-ort implementation

Message ID pull.923.git.git.1606635803.gitgitgadget@gmail.com (mailing list archive)
Headers show
Series fundamentals of merge-ort implementation | expand

Message

Philippe Blain via GitGitGadget Nov. 29, 2020, 7:43 a.m. UTC
This is actually v3 of this series; but v2 depended on two topics that
hadn't graduated yet so I couldn't easily use gitgitgadget and get its
testing. Now that the topics have graduated, I have rebased on master. You
can see v2 and comments on it over here: 
https://lore.kernel.org/git/20201102204344.342633-1-newren@gmail.com/

The goal of this series is to show the new design and structure behind
merge-ort, particularly the bits that are completely different to how
merge-recursive operates. There are still multiple important codepaths that
die with a "Not yet implemented" message, so the new merge algorithm is
still not very usable. However, it can handle very trivial rebases or
cherry-picks at the end of the series, and the number of test failures when
run under GIT_TEST_MERGE_ALGORITHM=ort drops from 2281 down to 1453.

At a high level, merge-ort avoids unpack_trees() and the index, instead
using traverse_trees() and its own data structure. After it is done
processing each path, it writes a tree. Only after it has created a new tree
will it touch the working copy or the index. It does so by using a simple
checkout-like step to switch from head to the newly created tree. If there
are conflicted entries, it touches up the index after the checkout-like step
to record those higher order stages.

In the series:

 * Patch 1 adds some basic data structures.
 * Patch 2 documents the high-level steps.
 * Patches 3-5 are some simple setup.
 * Patches 6-10 collect data from the traverse_trees() operation.
 * Patches 11-15 process the individual paths and create a tree.
 * Patches 16-19 handle checkout-and-then-write-higher-order-stages.
 * Patch 20 frees data from the merge_options_internal data structure

Changes since v2 (Thanks to Stolee and Jonathan Tan for their excellent and
detailed reviews!):

 * Add thorough code comments to data structures, to try to answer multiple
   questions the reviewers brought up.
 * Add comments in various areas of the code that were a bit tricky.
 * As per Stolee's request/suggestion, restructured accesses to
   conflict_info* to make the code document that we
 * As per Jonathan's suggestion, restructured tree writing to have
   "END_TREE" markers for directories -- namely, the directory itself.
 * Removed path_conflict field (will add it back in a later series when it
   is actually used)
 * Improved requested commit messages
 * Settled on "conflicted" instead of mix of "conflicted" and "unmerged"
 * Fixed various typos, missing words, etc.
 * Fixed various spaces around operators, missing blank lines, etc.
 * Some other small tweaks I probably forgot or overlooked while typing up
   this summary.

Things for reviewers to concentrate on:

 * Patch 1: I added lots of comments describing the data structures, based
   on reviewer questions/comments. Are they clear?
 * Patch 9: Stolee was worried about the allocation of merged_info OR
   conflict_info and whether we were safely accessing the fields. Since this
   is our primary and largest data structure and many times most entries
   only need a smaller merged_info, I really do want the space savings of
   not always allocating the larger type. I typically access as a
   merged_info* now, and added some accessor macros to document why each
   access as a conflict_info* is okay. Does this seem like a good solution
   to the concern? 
 * Patch 15: Jonathan felt the previous version of this patch was hard to
   follow. I've restructured so that we process the directories in
   opt->priv->paths directly; you can kind of view them as non-synthetic
   end-of-tree markers. They may not stand out as such, though (since they
   aren't synthetic with special names or handling), so I've added some
   pretty big comments to explain things. Does this address concerns?
 * Patches 16-20: these have not yet been reviewed, though these are easier
   patches to review than many of the first 15! A quick guide: * Patches 16,
      18, and 20 are very straightforward; patches 17 and 19 are the ones
      that would benefit more from review.
    * Patch 17 is
      basically the twoway_merge subset of merge_working_tree() from
      builtin/checkout.c. Find that bit of code and it's a direct
      comparison.
    * Patch 19
      amounts to "how do I remove stage 0 entries in the index and replace
      them with 1-3 higher order stages?".

Elijah Newren (20):
  merge-ort: setup basic internal data structures
  merge-ort: add some high-level algorithm structure
  merge-ort: port merge_start() from merge-recursive
  merge-ort: use histogram diff
  merge-ort: add an err() function similar to one from merge-recursive
  merge-ort: implement a very basic collect_merge_info()
  merge-ort: avoid repeating fill_tree_descriptor() on the same tree
  merge-ort: compute a few more useful fields for collect_merge_info
  merge-ort: record stage and auxiliary info for every path
  merge-ort: avoid recursing into identical trees
  merge-ort: add a preliminary simple process_entries() implementation
  merge-ort: have process_entries operate in a defined order
  merge-ort: step 1 of tree writing -- record basenames, modes, and oids
  merge-ort: step 2 of tree writing -- function to create tree object
  merge-ort: step 3 of tree writing -- handling subdirectories as we go
  merge-ort: basic outline for merge_switch_to_result()
  merge-ort: add implementation of checkout()
  tree: enable cmp_cache_name_compare() to be used elsewhere
  merge-ort: add implementation of record_conflicted_index_entries()
  merge-ort: free data structures in merge_finalize()

 merge-ort.c | 1207 ++++++++++++++++++++++++++++++++++++++++++++++++++-
 tree.c      |    2 +-
 tree.h      |    2 +
 3 files changed, 1207 insertions(+), 4 deletions(-)


base-commit: e67fbf927dfdf13d0b21dc6ea15dc3c7ef448ea0
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-git-923%2Fnewren%2Fort-basics-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-git-923/newren/ort-basics-v1
Pull-Request: https://github.com/git/git/pull/923

Comments

Elijah Newren Nov. 29, 2020, 7:47 a.m. UTC | #1
On Sat, Nov 28, 2020 at 11:43 PM Elijah Newren via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> This is actually v3 of this series; but v2 depended on two topics that
> hadn't graduated yet so I couldn't easily use gitgitgadget and get its
> testing. Now that the topics have graduated, I have rebased on master. You
> can see v2 and comments on it over here:
> https://lore.kernel.org/git/20201102204344.342633-1-newren@gmail.com/

Oops, I forgot to CC my two great reviewers of previous rounds.
Fixing that now...
Philippe Blain via GitGitGadget Dec. 4, 2020, 8:47 p.m. UTC | #2
This is actually v4 of this series (the first two rounds depended on topics
that hadn't graduated yet, so I hadn't yet used gitgitgadget for submitting
it). As a reminder, if you need to see the first two rounds before I started
submitting this series with gitgitgadget, you can see them over here: 
https://lore.kernel.org/git/20201102204344.342633-1-newren@gmail.com/

Changes since v3:

 * Made the small tweaks suggested by Ævar
 * Fixed an embarrassing tree ordering bug in commit 13; base_name_compare()
   != strcmp() is important.

(Tree ordering bug found due to the fact that merge-ort, including many
patches not yet submitted to this list, is in live use at $DAYJOB.)

Elijah Newren (20):
  merge-ort: setup basic internal data structures
  merge-ort: add some high-level algorithm structure
  merge-ort: port merge_start() from merge-recursive
  merge-ort: use histogram diff
  merge-ort: add an err() function similar to one from merge-recursive
  merge-ort: implement a very basic collect_merge_info()
  merge-ort: avoid repeating fill_tree_descriptor() on the same tree
  merge-ort: compute a few more useful fields for collect_merge_info
  merge-ort: record stage and auxiliary info for every path
  merge-ort: avoid recursing into identical trees
  merge-ort: add a preliminary simple process_entries() implementation
  merge-ort: have process_entries operate in a defined order
  merge-ort: step 1 of tree writing -- record basenames, modes, and oids
  merge-ort: step 2 of tree writing -- function to create tree object
  merge-ort: step 3 of tree writing -- handling subdirectories as we go
  merge-ort: basic outline for merge_switch_to_result()
  merge-ort: add implementation of checkout()
  tree: enable cmp_cache_name_compare() to be used elsewhere
  merge-ort: add implementation of record_conflicted_index_entries()
  merge-ort: free data structures in merge_finalize()

 merge-ort.c | 1221 ++++++++++++++++++++++++++++++++++++++++++++++++++-
 tree.c      |    2 +-
 tree.h      |    2 +
 3 files changed, 1221 insertions(+), 4 deletions(-)


base-commit: e67fbf927dfdf13d0b21dc6ea15dc3c7ef448ea0
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-git-923%2Fnewren%2Fort-basics-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-git-923/newren/ort-basics-v2
Pull-Request: https://github.com/git/git/pull/923

Range-diff vs v1:

  1:  2568ec92c6 =  1:  2568ec92c6 merge-ort: setup basic internal data structures
  2:  3a063865c3 !  2:  b658536f59 merge-ort: add some high-level algorithm structure
     @@ merge-ort.c: struct conflict_info {
      +			      struct tree *side1,
      +			      struct tree *side2)
      +{
     ++	/* TODO: Implement this using traverse_trees() */
      +	die("Not yet implemented.");
      +}
      +
  3:  5615f0eecb =  3:  acb40f5c16 merge-ort: port merge_start() from merge-recursive
  4:  564b072ac1 =  4:  22fecf6ccd merge-ort: use histogram diff
  5:  91516799e4 !  5:  6c4c0c15b3 merge-ort: add an err() function similar to one from merge-recursive
     @@ merge-ort.c: struct conflict_info {
       			      struct tree *side1,
       			      struct tree *side2)
       {
     -+	/* TODO: Implement this using traverse_trees() */
     +-	/* TODO: Implement this using traverse_trees() */
       	die("Not yet implemented.");
       }
       
     @@ merge-ort.c: static void merge_ort_nonrecursive_internal(struct merge_options *o
       
      -	collect_merge_info(opt, merge_base, side1, side2);
      +	if (collect_merge_info(opt, merge_base, side1, side2) != 0) {
     ++		/*
     ++		 * TRANSLATORS: The %s arguments are: 1) tree hash of a merge
     ++		 * base, and 2-3) the trees for the two trees we're merging.
     ++		 */
      +		err(opt, _("collecting merge info failed for trees %s, %s, %s"),
      +		    oid_to_hex(&merge_base->object.oid),
      +		    oid_to_hex(&side1->object.oid),
  6:  ab743967aa !  6:  27268ef8a3 merge-ort: implement a very basic collect_merge_info()
     @@ merge-ort.c: static int err(struct merge_options *opt, const char *err, ...)
       			      struct tree *side1,
       			      struct tree *side2)
       {
     --	/* TODO: Implement this using traverse_trees() */
      -	die("Not yet implemented.");
      +	int ret;
      +	struct tree_desc t[3];
  7:  bff758c5dd =  7:  c6e5621c21 merge-ort: avoid repeating fill_tree_descriptor() on the same tree
  8:  61b3d66fdc =  8:  93fd69fa3c merge-ort: compute a few more useful fields for collect_merge_info
  9:  4e4298fa70 =  9:  decff4b375 merge-ort: record stage and auxiliary info for every path
 10:  3ec087eb68 = 10:  86c661fe1e merge-ort: avoid recursing into identical trees
 11:  0c89cee34e = 11:  aa3b13ffd8 merge-ort: add a preliminary simple process_entries() implementation
 12:  605cbc19d2 = 12:  b54306fd0e merge-ort: have process_entries operate in a defined order
 13:  242c3cab13 = 13:  8ee8561d7a merge-ort: step 1 of tree writing -- record basenames, modes, and oids
 14:  33a5d23c85 ! 14:  6ff56824c3 merge-ort: step 2 of tree writing -- function to create tree object
     @@ merge-ort.c: struct directory_versions {
       	struct string_list versions;
       };
       
     ++static int tree_entry_order(const void *a_, const void *b_)
     ++{
     ++	const struct string_list_item *a = a_;
     ++	const struct string_list_item *b = b_;
     ++
     ++	const struct merged_info *ami = a->util;
     ++	const struct merged_info *bmi = b->util;
     ++	return base_name_compare(a->string, strlen(a->string), ami->result.mode,
     ++				 b->string, strlen(b->string), bmi->result.mode);
     ++}
     ++
      +static void write_tree(struct object_id *result_oid,
      +		       struct string_list *versions,
      +		       unsigned int offset,
     @@ merge-ort.c: struct directory_versions {
      +	 */
      +	relevant_entries.items = versions->items + offset;
      +	relevant_entries.nr = versions->nr - offset;
     -+	string_list_sort(&relevant_entries);
     ++	QSORT(relevant_entries.items, relevant_entries.nr, tree_entry_order);
      +
      +	/* Pre-allocate some space in buf */
      +	extra = hash_size + 8; /* 8: 6 for mode, 1 for space, 1 for NUL char */
 15:  29615c366f ! 15:  da4fe90049 merge-ort: step 3 of tree writing -- handling subdirectories as we go
     @@ merge-ort.c: static int string_list_df_name_compare(const char *one, const char
      +	unsigned last_directory_len;
       };
       
     - static void write_tree(struct object_id *result_oid,
     + static int tree_entry_order(const void *a_, const void *b_)
      @@ merge-ort.c: static void record_entry_for_tree(struct directory_versions *dir_metadata,
       			   basename)->util = &mi->result;
       }
 16:  da54fa454a = 16:  8e90d211c5 merge-ort: basic outline for merge_switch_to_result()
 17:  68307f1b67 = 17:  61fada146c merge-ort: add implementation of checkout()
 18:  a3cd563621 = 18:  f5a13a0b08 tree: enable cmp_cache_name_compare() to be used elsewhere
 19:  56b162c609 ! 19:  4efac38116 merge-ort: add implementation of record_conflicted_index_entries()
     @@ merge-ort.c: static int record_conflicted_index_entries(struct merge_options *op
      +		pos = index_name_pos(index, path, strlen(path));
      +		SWAP(index->cache_nr, original_cache_nr);
      +		if (pos < 0) {
     -+			if (ci->filemask == 1)
     -+				cache_tree_invalidate_path(index, path);
     -+			else
     ++			if (ci->filemask != 1)
      +				BUG("Conflicted %s but nothing in basic working tree or index; this shouldn't happen", path);
     ++			cache_tree_invalidate_path(index, path);
      +		} else {
      +			ce = index->cache[pos];
      +
 20:  a4f722a46e = 20:  fbeb527d67 merge-ort: free data structures in merge_finalize()