[v2,5/5] merge-ort: add prefetching for content merges

Message ID	317bcc7f56cb718a8be625838576f33ce788c3ef.1623796907.git.gitgitgadget@gmail.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <git-owner@kernel.org> Message-Id: <317bcc7f56cb718a8be625838576f33ce788c3ef.1623796907.git.gitgitgadget@gmail.com> In-Reply-To: <pull.969.v2.git.1623796907.gitgitgadget@gmail.com> References: <pull.969.git.1622856485.gitgitgadget@gmail.com> <pull.969.v2.git.1623796907.gitgitgadget@gmail.com> Date: Tue, 15 Jun 2021 22:41:46 +0000 Subject: [PATCH v2 5/5] merge-ort: add prefetching for content merges Fcc: Sent Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit MIME-Version: 1.0 To: git@vger.kernel.org Cc: Jonathan Tan <jonathantanmy@google.com>, Derrick Stolee <dstolee@gmail.com>, Taylor Blau <me@ttaylorr.com>, Elijah Newren <newren@gmail.com>, Elijah Newren <newren@gmail.com> Precedence: bulk From: Elijah Newren <newren@gmail.com>
Series	Optimization batch 13: partial clone optimizations for merge-ort \| expand [v2,0/5] Optimization batch 13: partial clone optimizations for merge-ort [v2,1/5] promisor-remote: output trace2 statistics for number of objects fetched [v2,2/5] t6421: add tests checking for excessive object downloads during merge [v2,3/5] diffcore-rename: allow different missing_object_cb functions [v2,4/5] diffcore-rename: use a different prefetch for basename comparisons [v2,5/5] merge-ort: add prefetching for content merges

Message ID

317bcc7f56cb718a8be625838576f33ce788c3ef.1623796907.git.gitgitgadget@gmail.com (mailing list archive)

State

New, archived

Headers

Message-Id: 
 <317bcc7f56cb718a8be625838576f33ce788c3ef.1623796907.git.gitgitgadget@gmail.com>
In-Reply-To: <pull.969.v2.git.1623796907.gitgitgadget@gmail.com>
References: <pull.969.git.1622856485.gitgitgadget@gmail.com>
        <pull.969.v2.git.1623796907.gitgitgadget@gmail.com>
Date: Tue, 15 Jun 2021 22:41:46 +0000
Subject: [PATCH v2 5/5] merge-ort: add prefetching for content merges
Fcc: Sent
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
MIME-Version: 1.0
To: git@vger.kernel.org
Cc: Jonathan Tan <jonathantanmy@google.com>,
        Derrick Stolee <dstolee@gmail.com>,
        Taylor Blau <me@ttaylorr.com>,
        Elijah Newren <newren@gmail.com>,
        Elijah Newren <newren@gmail.com>
Precedence: bulk
From: Elijah Newren <newren@gmail.com>

Series

Optimization batch 13: partial clone optimizations for merge-ort | expand

Commit Message

Elijah Newren June 15, 2021, 10:41 p.m. UTC

From: Elijah Newren <newren@gmail.com>

Commit 7fbbcb21b1 ("diff: batch fetching of missing blobs", 2019-04-05)
introduced batching of fetching missing blobs, so that the diff
machinery would have one fetch subprocess grab N blobs instead of N
processes each grabbing 1.

However, the diff machinery is not the only thing in a merge that needs
to work on blobs.  The 3-way content merges need them as well.  Rather
than download all the blobs 1 at a time, prefetch all the blobs needed
for regular content merges.

This does not cover all possible paths in merge-ort that might need to
download blobs.  Others include:
  - The blob_unchanged() calls to avoid modify/delete conflicts (when
    blob renormalization results in an "unchanged" file)
  - Preliminary content merges needed for rename/add and
    rename/rename(2to1) style conflicts.  (Both of these types of
    conflicts can result in nested conflict markers from the need to do
    two levels of content merging; the first happens before our new
    prefetch_for_content_merges() function.)

The first of these wouldn't be an extreme amount of work to support, and
even the second could be theoretically supported in batching, but all of
these cases seem unusual to me, and this is a minor performance
optimization anyway; in the worst case we only get some of the fetches
batched and have a few additional one-off fetches.  So for now, just
handle the regular 3-way content merges in our prefetching.

For the testcase from the previous commit, the number of downloaded
objects remains at 63, but this drops the number of fetches needed from
32 down to 20, a sizeable reduction.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c                    | 50 ++++++++++++++++++++++++++++++++++
 t/t6421-merge-partial-clone.sh |  2 +-
 2 files changed, 51 insertions(+), 1 deletion(-)

Comments

Junio C Hamano June 17, 2021, 5:04 a.m. UTC | #1

"Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:

> +		/* Ignore clean entries */
> +		if (ci->merged.clean)
> +			continue;
> +
> +		/* Ignore entries that don't need a content merge */
> +		if (ci->match_mask || ci->filemask < 6 ||
> +		    !S_ISREG(ci->stages[1].mode) ||
> +		    !S_ISREG(ci->stages[2].mode) ||
> +		    oideq(&ci->stages[1].oid, &ci->stages[2].oid))
> +			continue;
> +
> +		/* Also don't need content merge if base matches either side */
> +		if (ci->filemask == 7 &&
> +		    S_ISREG(ci->stages[0].mode) &&
> +		    (oideq(&ci->stages[0].oid, &ci->stages[1].oid) ||
> +		     oideq(&ci->stages[0].oid, &ci->stages[2].oid)))
> +			continue;

Even though this is unlikely to change, it is unsatisfactory that we
reproduce the knowledge on the situations when a merge will
trivially resolve and when it will need to go content level.

One obvious way to solve it would be to fold this logic into the
main code that actually merges a list of "ci"s by making it a two
pass process (the first pass does essentially the same as this new
function, the second pass does the tree-level merge where the above
says "continue", fills mmfiles with the loop below, and calls into
ll_merge() after the loop to merge), but the logic duplication is
not too big and it may not be worth such a code churn.

> +		for (i = 0; i < 3; i++) {
> +			unsigned side_mask = (1 << i);
> +			struct version_info *vi = &ci->stages[i];
> +
> +			if ((ci->filemask & side_mask) &&
> +			    S_ISREG(vi->mode) &&
> +			    oid_object_info_extended(opt->repo, &vi->oid, NULL,
> +						     OBJECT_INFO_FOR_PREFETCH))
> +				oid_array_append(&to_fetch, &vi->oid);
> +		}
> +	}
> +
> +	promisor_remote_get_direct(opt->repo, to_fetch.oid, to_fetch.nr);
> +	oid_array_clear(&to_fetch);
> +}
> +

Elijah Newren June 22, 2021, 8:02 a.m. UTC | #2

On Wed, Jun 16, 2021 at 10:04 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> "Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > +             /* Ignore clean entries */
> > +             if (ci->merged.clean)
> > +                     continue;
> > +
> > +             /* Ignore entries that don't need a content merge */
> > +             if (ci->match_mask || ci->filemask < 6 ||
> > +                 !S_ISREG(ci->stages[1].mode) ||
> > +                 !S_ISREG(ci->stages[2].mode) ||
> > +                 oideq(&ci->stages[1].oid, &ci->stages[2].oid))
> > +                     continue;
> > +
> > +             /* Also don't need content merge if base matches either side */
> > +             if (ci->filemask == 7 &&
> > +                 S_ISREG(ci->stages[0].mode) &&
> > +                 (oideq(&ci->stages[0].oid, &ci->stages[1].oid) ||
> > +                  oideq(&ci->stages[0].oid, &ci->stages[2].oid)))
> > +                     continue;
>
> Even though this is unlikely to change, it is unsatisfactory that we
> reproduce the knowledge on the situations when a merge will
> trivially resolve and when it will need to go content level.

I agree, it's not the nicest.

> One obvious way to solve it would be to fold this logic into the
> main code that actually merges a list of "ci"s by making it a two
> pass process (the first pass does essentially the same as this new
> function, the second pass does the tree-level merge where the above
> says "continue", fills mmfiles with the loop below, and calls into
> ll_merge() after the loop to merge), but the logic duplication is
> not too big and it may not be worth such a code churn.

I'm worried even more about the resulting complexity than the code
churn.  The two-pass model, which I considered, would require special
casing so many of the branches of process_entry() that it feels like
it'd be increasing code complexity more than introducing a function
with a few duplicated checks.  process_entry() was already a function
that Stolee reported as coming across as pretty complex to him in
earlier rounds of review, but that seems to just be intrinsic based on
the number of special cases: handling anything from entries with D/F
conflicts, to different file types, to match_mask being precomputed,
to recursive vs. normal cases, to modify/delete, to normalization, to
added on one side, to deleted on both side, to three-way content
merges.  The three-way content merges are just one of 9-ish different
branches, and are the only one that we're prefetching for.  It just
seems easier and cleaner overall to add these three checks to pick off
the cases that will end up going through the three-way content merges.
I've looked at it again a couple times over the past few days based on
your comment, but I still can't see a way to restructure it that feels
cleaner than what I've currently got.

Also, it may be worth noting here that if these checks fell out of
date with process_entry() in some manner, it still would not affect
the correctness of the code.  At worst, it'd only affect whether
enough or too many objects are prefetched.  If too many, then some
extra objects would be downloaded, and if too few, then we'd end up
later fetching additional objects 1-by-1 on demand later.

So I'm going to agree with the not-worth-it portion of your final
sentence and leave this out of the next roll.

diff --git a/merge-ort.c b/merge-ort.c
index cfa751053b01..e3a5dfc7b312 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -29,6 +29,7 @@ 
 #include "entry.h"
 #include "ll-merge.h"
 #include "object-store.h"
+#include "promisor-remote.h"
 #include "revision.h"
 #include "strmap.h"
 #include "submodule.h"
@@ -3485,6 +3486,54 @@  static void process_entry(struct merge_options *opt,
 	record_entry_for_tree(dir_metadata, path, &ci->merged);
 }
 
+static void prefetch_for_content_merges(struct merge_options *opt,
+					struct string_list *plist)
+{
+	struct string_list_item *e;
+	struct oid_array to_fetch = OID_ARRAY_INIT;
+
+	if (opt->repo != the_repository || !has_promisor_remote())
+		return;
+
+	for (e = &plist->items[plist->nr-1]; e >= plist->items; --e) {
+		/* char *path = e->string; */
+		struct conflict_info *ci = e->util;
+		int i;
+
+		/* Ignore clean entries */
+		if (ci->merged.clean)
+			continue;
+
+		/* Ignore entries that don't need a content merge */
+		if (ci->match_mask || ci->filemask < 6 ||
+		    !S_ISREG(ci->stages[1].mode) ||
+		    !S_ISREG(ci->stages[2].mode) ||
+		    oideq(&ci->stages[1].oid, &ci->stages[2].oid))
+			continue;
+
+		/* Also don't need content merge if base matches either side */
+		if (ci->filemask == 7 &&
+		    S_ISREG(ci->stages[0].mode) &&
+		    (oideq(&ci->stages[0].oid, &ci->stages[1].oid) ||
+		     oideq(&ci->stages[0].oid, &ci->stages[2].oid)))
+			continue;
+
+		for (i = 0; i < 3; i++) {
+			unsigned side_mask = (1 << i);
+			struct version_info *vi = &ci->stages[i];
+
+			if ((ci->filemask & side_mask) &&
+			    S_ISREG(vi->mode) &&
+			    oid_object_info_extended(opt->repo, &vi->oid, NULL,
+						     OBJECT_INFO_FOR_PREFETCH))
+				oid_array_append(&to_fetch, &vi->oid);
+		}
+	}
+
+	promisor_remote_get_direct(opt->repo, to_fetch.oid, to_fetch.nr);
+	oid_array_clear(&to_fetch);
+}
+
 static void process_entries(struct merge_options *opt,
 			    struct object_id *result_oid)
 {
@@ -3531,6 +3580,7 @@  static void process_entries(struct merge_options *opt,
 	 * the way when it is time to process the file at the same path).
 	 */
 	trace2_region_enter("merge", "processing", opt->repo);
+	prefetch_for_content_merges(opt, &plist);
 	for (entry = &plist.items[plist.nr-1]; entry >= plist.items; --entry) {
 		char *path = entry->string;
 		/*
diff --git a/t/t6421-merge-partial-clone.sh b/t/t6421-merge-partial-clone.sh
index a011f8d27867..26964aa56256 100755
--- a/t/t6421-merge-partial-clone.sh
+++ b/t/t6421-merge-partial-clone.sh
@@ -396,7 +396,7 @@  test_expect_merge_algorithm failure success 'Objects downloaded when a directory
 #
 #   Summary: 4 fetches (1 for 6 objects, 1 for 8, 1 for 3, 1 for 2)
 #
-test_expect_merge_algorithm failure failure 'Objects downloaded with lots of renames and modifications' '
+test_expect_merge_algorithm failure success 'Objects downloaded with lots of renames and modifications' '
 	test_setup_repo &&
 	git clone --sparse --filter=blob:none "file://$(pwd)/server" objects-many &&
 	(

[v2,5/5] merge-ort: add prefetching for content merges

Commit Message

Comments

Patch