diff mbox series

[6/6] fetch: avoid second connectivity check if we already have all objects

Message ID 646ac90e62aab4e4aec595d6848b60233bbe8c77.1629452412.git.ps@pks.im (mailing list archive)
State Superseded
Headers show
Series Speed up mirror-fetches with many refs | expand

Commit Message

Patrick Steinhardt Aug. 20, 2021, 10:08 a.m. UTC
When fetching refs, we are doing two connectivity checks:

    - The first one in `fetch_refs()` is done such that we can
      short-circuit the case where we already have all objects
      referenced by the updated set of refs.

    - The second one in `store_updated_refs()` does a sanity check that
      we have all objects after we have fetched the packfile.

We always execute both connectivity checks, but this is wasteful in case
the first connectivity check already notices that we have all objects
locally available.

Refactor the code to do both connectivity checks in `fetch_refs()`,
which allows us to easily skip the second connectivity check if we
already have all objects available. This refactoring is safe to do given
that we always call `fetch_refs()` followed by `consume_refs()`, which
is the only caller of `store_updated_refs()`.

This gives us a nice speedup when doing a mirror-fetch in a repository
with about 2.3M refs where the fetching repo already has all objects:

    Benchmark #1: HEAD~: git-fetch
      Time (mean ± σ):     31.232 s ±  0.082 s    [User: 27.901 s, System: 5.178 s]
      Range (min … max):   31.118 s … 31.301 s    5 runs

    Benchmark #2: HEAD: git-fetch
      Time (mean ± σ):     26.616 s ±  0.100 s    [User: 23.675 s, System: 4.752 s]
      Range (min … max):   26.544 s … 26.788 s    5 runs

    Summary
      'HEAD: git-fetch' ran
        1.17 ± 0.01 times faster than 'HEAD~: git-fetch'

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 builtin/fetch.c | 27 +++++++++++++--------------
 1 file changed, 13 insertions(+), 14 deletions(-)

Comments

Derrick Stolee Aug. 20, 2021, 2:47 p.m. UTC | #1
On 8/20/2021 6:08 AM, Patrick Steinhardt wrote:
> When fetching refs, we are doing two connectivity checks:
> 
>     - The first one in `fetch_refs()` is done such that we can
>       short-circuit the case where we already have all objects
>       referenced by the updated set of refs.
> 
>     - The second one in `store_updated_refs()` does a sanity check that
>       we have all objects after we have fetched the packfile.
> 
> We always execute both connectivity checks, but this is wasteful in case
> the first connectivity check already notices that we have all objects
> locally available.
> 
> Refactor the code to do both connectivity checks in `fetch_refs()`,
> which allows us to easily skip the second connectivity check if we
> already have all objects available. This refactoring is safe to do given
> that we always call `fetch_refs()` followed by `consume_refs()`, which
> is the only caller of `store_updated_refs()`.

Should we try to make it more clear that fetch_refs() must be followed
by consume_refs() via a comment above the fetch_refs(), or possibly even
its call sites?

Thanks,
-Stolee
Patrick Steinhardt Aug. 23, 2021, 6:52 a.m. UTC | #2
On Fri, Aug 20, 2021 at 10:47:11AM -0400, Derrick Stolee wrote:
> On 8/20/2021 6:08 AM, Patrick Steinhardt wrote:
> > When fetching refs, we are doing two connectivity checks:
> > 
> >     - The first one in `fetch_refs()` is done such that we can
> >       short-circuit the case where we already have all objects
> >       referenced by the updated set of refs.
> > 
> >     - The second one in `store_updated_refs()` does a sanity check that
> >       we have all objects after we have fetched the packfile.
> > 
> > We always execute both connectivity checks, but this is wasteful in case
> > the first connectivity check already notices that we have all objects
> > locally available.
> > 
> > Refactor the code to do both connectivity checks in `fetch_refs()`,
> > which allows us to easily skip the second connectivity check if we
> > already have all objects available. This refactoring is safe to do given
> > that we always call `fetch_refs()` followed by `consume_refs()`, which
> > is the only caller of `store_updated_refs()`.
> 
> Should we try to make it more clear that fetch_refs() must be followed
> by consume_refs() via a comment above the fetch_refs(), or possibly even
> its call sites?

I wasn't quite happy with this outcome, either. How about we instead
merge both functions into `fetch_and_consume_refs()`? Both are quite
short, and that makes sure we always call them together to make the
requirement explicit.

I'll add another patch to do this refactoring.

Patrick
diff mbox series

Patch

diff --git a/builtin/fetch.c b/builtin/fetch.c
index 20fcfe0f45..088a8af13b 100644
--- a/builtin/fetch.c
+++ b/builtin/fetch.c
@@ -1068,7 +1068,7 @@  N_("It took %.2f seconds to check forced updates. You can use\n"
    " to avoid this check.\n");
 
 static int store_updated_refs(const char *raw_url, const char *remote_name,
-			      int connectivity_checked, struct ref *ref_map)
+			      struct ref *ref_map)
 {
 	struct fetch_head fetch_head;
 	struct commit *commit;
@@ -1090,16 +1090,6 @@  static int store_updated_refs(const char *raw_url, const char *remote_name,
 	else
 		url = xstrdup("foreign");
 
-	if (!connectivity_checked) {
-		struct check_connected_options opt = CHECK_CONNECTED_INIT;
-
-		rm = ref_map;
-		if (check_connected(iterate_ref_map, &rm, &opt)) {
-			rc = error(_("%s did not send all necessary objects\n"), url);
-			goto abort;
-		}
-	}
-
 	if (atomic_fetch) {
 		transaction = ref_transaction_begin(&err);
 		if (!transaction) {
@@ -1302,6 +1292,18 @@  static int fetch_refs(struct transport *transport, struct ref *ref_map)
 		return ret;
 	}
 
+	/*
+	 * If the transport didn't yet check for us, we need to verify
+	 * ourselves that we have obtained all missing objects now.
+	 */
+	if (!transport->smart_options || !transport->smart_options->connectivity_checked) {
+		if (check_connected(iterate_ref_map, &ref_map, NULL)) {
+			ret = error(_("remote did not send all necessary objects\n"));
+			transport_unlock_pack(transport);
+			return ret;
+		}
+	}
+
 	/*
 	 * Keep the new pack's ".keep" file around to allow the caller
 	 * time to update refs to reference the new objects.
@@ -1312,13 +1314,10 @@  static int fetch_refs(struct transport *transport, struct ref *ref_map)
 /* Update local refs based on the ref values fetched from a remote */
 static int consume_refs(struct transport *transport, struct ref *ref_map)
 {
-	int connectivity_checked = transport->smart_options
-		? transport->smart_options->connectivity_checked : 0;
 	int ret;
 	trace2_region_enter("fetch", "consume_refs", the_repository);
 	ret = store_updated_refs(transport->url,
 				 transport->remote->name,
-				 connectivity_checked,
 				 ref_map);
 	transport_unlock_pack(transport);
 	trace2_region_leave("fetch", "consume_refs", the_repository);