diff mbox series

diff: batch fetching of missing blobs

Message ID 20190326220906.111879-1-jonathantanmy@google.com (mailing list archive)
State New, archived
Headers show
Series diff: batch fetching of missing blobs | expand

Commit Message

Jonathan Tan March 26, 2019, 10:09 p.m. UTC
When running a command like "git show" or "git diff" in a partial clone,
batch all missing blobs to be fetched as one request.

This is similar to c0c578b33c ("unpack-trees: batch fetching of missing
blobs", 2017-12-08), but for another command.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
Here's an improvement for those having partial clones.

I couldn't find a good place to place the test (a place that checks how

Comments

SZEDER Gábor March 27, 2019, 10:10 a.m. UTC | #1
On Tue, Mar 26, 2019 at 03:09:06PM -0700, Jonathan Tan wrote:
> diff --git a/t/t4067-diff-partial-clone.sh b/t/t4067-diff-partial-clone.sh
> new file mode 100755
> index 0000000000..38f03be114
> --- /dev/null
> +++ b/t/t4067-diff-partial-clone.sh
> @@ -0,0 +1,76 @@
> +#!/bin/sh
> +
> +test_description='behavior of diff when reading objects in a partial clone'
> +
> +. ./test-lib.sh
> +
> +test_expect_success 'git show batches blobs' '
> +	test_create_repo server &&
> +	echo a >server/a &&
> +	echo b >server/b &&
> +	git -C server add a b &&
> +	git -C server commit -m x &&
> +
> +	test_config -C server uploadpack.allowfilter 1 &&
> +	test_config -C server uploadpack.allowanysha1inwant 1 &&
> +	git clone --bare --filter=blob:limit=0 "file://$(pwd)/server" client &&
> +
> +	# Ensure that there is exactly 1 negotiation by checking that there is
> +	# only 1 "done" line sent. ("done" marks the end of negotiation.)
> +	GIT_TRACE_PACKET="$(pwd)/trace" git -C client show HEAD &&
> +	grep "git> done" trace >done_lines &&
> +	test_line_count = 1 done_lines
> +'
> +
> +test_expect_success 'diff batches blobs' '
> +	rm -rf server client trace &&

Please use 'test_when_finished' in the previous test to clean up after
itself.

> +
> +	test_create_repo server &&
> +	echo a >server/a &&
> +	echo b >server/b &&
> +	git -C server add a b &&
> +	git -C server commit -m x &&
> +	echo c >server/c &&
> +	echo d >server/d &&
> +	git -C server add c d &&
> +	git -C server commit -m x &&
> +
> +	test_config -C server uploadpack.allowfilter 1 &&
> +	test_config -C server uploadpack.allowanysha1inwant 1 &&
> +	git clone --bare --filter=blob:limit=0 "file://$(pwd)/server" client &&
> +
> +	# Ensure that there is exactly 1 negotiation by checking that there is
> +	# only 1 "done" line sent. ("done" marks the end of negotiation.)
> +	GIT_TRACE_PACKET="$(pwd)/trace" git -C client diff HEAD^ HEAD &&
> +	grep "git> done" trace >done_lines &&
> +	test_line_count = 1 done_lines
> +'
> +
> +test_expect_success 'diff skips same-OID blobs' '
> +	rm -rf server client trace &&

Likewise.

> +	test_create_repo server &&
> +	echo a >server/a &&
> +	echo b >server/b &&
> +	git -C server add a b &&
> +	git -C server commit -m x &&
> +	echo another-a >server/a &&
> +	git -C server add a &&
> +	git -C server commit -m x &&
> +
> +	test_config -C server uploadpack.allowfilter 1 &&
> +	test_config -C server uploadpack.allowanysha1inwant 1 &&
> +	git clone --bare --filter=blob:limit=0 "file://$(pwd)/server" client &&
> +
> +	echo a | git hash-object --stdin >hash-old-a &&
> +	echo another-a | git hash-object --stdin >hash-new-a &&
> +	echo b | git hash-object --stdin >hash-b &&
> +
> +	# Ensure that only a and another-a are fetched.
> +	GIT_TRACE_PACKET="$(pwd)/trace" git -C client diff HEAD^ HEAD &&
> +	grep "want $(cat hash-old-a)" trace &&
> +	grep "want $(cat hash-new-a)" trace &&
> +	! grep "want $(cat hash-b)" trace
> +'
> +
> +test_done
> -- 
> 2.21.0.155.ge902e9bcae.dirty
>
Johannes Schindelin March 27, 2019, 10:02 p.m. UTC | #2
Hi Jonathan,

On Tue, 26 Mar 2019, Jonathan Tan wrote:

> When running a command like "git show" or "git diff" in a partial clone,
> batch all missing blobs to be fetched as one request.
>
> This is similar to c0c578b33c ("unpack-trees: batch fetching of missing
> blobs", 2017-12-08), but for another command.
>
> Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
> ---
> Here's an improvement for those having partial clones.
>
> I couldn't find a good place to place the test (a place that checks how
> diff interfaces with the object store would be ideal), so I created a
> new one. Let me know if there's a better place to put it.

My 2c: that's fine.

> diff --git a/diff.c b/diff.c
> index ec5c095199..0e08d05b14 100644
> --- a/diff.c
> +++ b/diff.c
> @@ -25,6 +25,7 @@
>  #include "packfile.h"
>  #include "parse-options.h"
>  #include "help.h"
> +#include "fetch-object.h"
>
>  #ifdef NO_FAST_WORKING_DIRECTORY
>  #define FAST_WORKING_DIRECTORY 0
> @@ -6067,6 +6068,32 @@ static void diff_flush_patch_all_file_pairs(struct diff_options *o)
>  	if (o->color_moved)
>  		o->emitted_symbols = &esm;
>
> +	if (repository_format_partial_clone) {
> +		/*
> +		 * Prefetch the diff pairs that are about to be flushed.
> +		 */
> +		struct oid_array to_fetch = OID_ARRAY_INIT;
> +		int fetch_if_missing_store = fetch_if_missing;
> +
> +		fetch_if_missing = 0;
> +		for (i = 0; i < q->nr; i++) {
> +			struct diff_filepair *p = q->queue[i];
> +			if (!check_pair_status(p))
> +				continue;
> +			if (p->one && p->one->oid_valid &&
> +			    !has_object_file(&p->one->oid))
> +				oid_array_append(&to_fetch, &p->one->oid);
> +			if (p->two && p->two->oid_valid &&
> +			    !has_object_file(&p->two->oid))
> +				oid_array_append(&to_fetch, &p->two->oid);
> +		}
> +		if (to_fetch.nr)
> +			fetch_objects(repository_format_partial_clone,
> +				      to_fetch.oid, to_fetch.nr);
> +		fetch_if_missing = fetch_if_missing_store;

I am slightly uneasy about the fact that this is totally *not*
multi-thread safe. If only we had a way to pass the `fetch_if_missing`
information as a parameter to `has_object_file()`...

Do you think that would be easy to do? If so, I think now would be as fine
an opportunity as ever to introduce that.

Otherwise the patch looks just fine to me.

Thanks,
Dscho

> +		oid_array_clear(&to_fetch);
> +	}
> +
>  	for (i = 0; i < q->nr; i++) {
>  		struct diff_filepair *p = q->queue[i];
>  		if (check_pair_status(p))
> diff --git a/t/t4067-diff-partial-clone.sh b/t/t4067-diff-partial-clone.sh
> new file mode 100755
> index 0000000000..38f03be114
> --- /dev/null
> +++ b/t/t4067-diff-partial-clone.sh
> @@ -0,0 +1,76 @@
> +#!/bin/sh
> +
> +test_description='behavior of diff when reading objects in a partial clone'
> +
> +. ./test-lib.sh
> +
> +test_expect_success 'git show batches blobs' '
> +	test_create_repo server &&
> +	echo a >server/a &&
> +	echo b >server/b &&
> +	git -C server add a b &&
> +	git -C server commit -m x &&
> +
> +	test_config -C server uploadpack.allowfilter 1 &&
> +	test_config -C server uploadpack.allowanysha1inwant 1 &&
> +	git clone --bare --filter=blob:limit=0 "file://$(pwd)/server" client &&
> +
> +	# Ensure that there is exactly 1 negotiation by checking that there is
> +	# only 1 "done" line sent. ("done" marks the end of negotiation.)
> +	GIT_TRACE_PACKET="$(pwd)/trace" git -C client show HEAD &&
> +	grep "git> done" trace >done_lines &&
> +	test_line_count = 1 done_lines
> +'
> +
> +test_expect_success 'diff batches blobs' '
> +	rm -rf server client trace &&
> +
> +	test_create_repo server &&
> +	echo a >server/a &&
> +	echo b >server/b &&
> +	git -C server add a b &&
> +	git -C server commit -m x &&
> +	echo c >server/c &&
> +	echo d >server/d &&
> +	git -C server add c d &&
> +	git -C server commit -m x &&
> +
> +	test_config -C server uploadpack.allowfilter 1 &&
> +	test_config -C server uploadpack.allowanysha1inwant 1 &&
> +	git clone --bare --filter=blob:limit=0 "file://$(pwd)/server" client &&
> +
> +	# Ensure that there is exactly 1 negotiation by checking that there is
> +	# only 1 "done" line sent. ("done" marks the end of negotiation.)
> +	GIT_TRACE_PACKET="$(pwd)/trace" git -C client diff HEAD^ HEAD &&
> +	grep "git> done" trace >done_lines &&
> +	test_line_count = 1 done_lines
> +'
> +
> +test_expect_success 'diff skips same-OID blobs' '
> +	rm -rf server client trace &&
> +
> +	test_create_repo server &&
> +	echo a >server/a &&
> +	echo b >server/b &&
> +	git -C server add a b &&
> +	git -C server commit -m x &&
> +	echo another-a >server/a &&
> +	git -C server add a &&
> +	git -C server commit -m x &&
> +
> +	test_config -C server uploadpack.allowfilter 1 &&
> +	test_config -C server uploadpack.allowanysha1inwant 1 &&
> +	git clone --bare --filter=blob:limit=0 "file://$(pwd)/server" client &&
> +
> +	echo a | git hash-object --stdin >hash-old-a &&
> +	echo another-a | git hash-object --stdin >hash-new-a &&
> +	echo b | git hash-object --stdin >hash-b &&
> +
> +	# Ensure that only a and another-a are fetched.
> +	GIT_TRACE_PACKET="$(pwd)/trace" git -C client diff HEAD^ HEAD &&
> +	grep "want $(cat hash-old-a)" trace &&
> +	grep "want $(cat hash-new-a)" trace &&
> +	! grep "want $(cat hash-b)" trace
> +'
> +
> +test_done
> --
> 2.21.0.155.ge902e9bcae.dirty
>
>
Jeff King March 28, 2019, 6:52 a.m. UTC | #3
On Tue, Mar 26, 2019 at 03:09:06PM -0700, Jonathan Tan wrote:

> When running a command like "git show" or "git diff" in a partial clone,
> batch all missing blobs to be fetched as one request.
> 
> This is similar to c0c578b33c ("unpack-trees: batch fetching of missing
> blobs", 2017-12-08), but for another command.

Sounds like a good idea, and this should make some cases much better
without making any cases worse. Two observations about how we might do
even better, though:

> @@ -6067,6 +6068,32 @@ static void diff_flush_patch_all_file_pairs(struct diff_options *o)
> [...]

At this stage we're looking at actually diffing the contents themselves.
But we'd also potentially need the blobs during rename and break
detection. It's not always the same set of blobs (e.g., unless you've
cranked up the copy-detection flags, renames only look at added/deleted
entries). We could have each phase do its own bulk fetch, which
worst-case gives us probably three fetches. But I wonder if we could
figure out a complete plausible set immediately after the tree diff.

> +	if (repository_format_partial_clone) {
> +		/*
> +		 * Prefetch the diff pairs that are about to be flushed.
> +		 */
> +		struct oid_array to_fetch = OID_ARRAY_INIT;
> +		int fetch_if_missing_store = fetch_if_missing;
> +
> +		fetch_if_missing = 0;
> +		for (i = 0; i < q->nr; i++) {
> +			struct diff_filepair *p = q->queue[i];
> +			if (!check_pair_status(p))
> +				continue;
> +			if (p->one && p->one->oid_valid &&
> +			    !has_object_file(&p->one->oid))
> +				oid_array_append(&to_fetch, &p->one->oid);
> +			if (p->two && p->two->oid_valid &&
> +			    !has_object_file(&p->two->oid))
> +				oid_array_append(&to_fetch, &p->two->oid);
> +		}

These has_object_file() calls may repeatedly re-scan the pack directory,
once per call.  Since it's likely that some may be missing, that may be
a noticeable amount of wasted work for a big diff (still way less than
individually fetching each missing object, so it's a net win, but read
on).

If you use the QUICK flag, that avoids the re-scans, but we may miss
erroneously say "we don't have it" if we race with a repack. For that,
we can either:

  1. Just ignore it. It's relatively rare, and the worst case is that we
     re-fetch an object.

  2. Do a series of QUICK checks, followed by a single
     reprepare_packed_git() if we had any missing, and then another
     series of QUICK checks. Then worst-case we have a single re-scan.

Something like:

  int object_is_missing_cb(const struct object_id *oid, void *data)
  {
	return !has_object_file_with_flags(oid, OBJECT_INFO_QUICK);
  }
  ...

  /* collect all of the possible blobs we need */
  for (i = 0; i < q->nr; i++) {
	...
	oid_array_append(&to_fetch, &p->one->oid);
	oid_array_append(&to_fetch, &p->two->oid);
  }

  /* drop any we already have */
  oid_array_filter(&to_fetch, object_is_missing_cb, NULL);

  /* any missing ones might actually be a race; try again */
  if (to_fetch.nr) {
	  reprepare_packed_git(the_repository);
	  oid_array_filter(&to_fetch, object_is_missing_cb, NULL);
  }

  /* and now anything we have left is definitely not here */
  if (to_fetch.nr)
	fetch_objects(..., to_fetch.oid, to_fetch.nr).

One thing I noticed while writing this: we don't seem to do any
de-duplication of the list (in yours or mine), and it doesn't look like
fetch_objects() does either. I wonder if an oidset would be a better
data structure.

-Peff
diff mbox series

Patch

diff interfaces with the object store would be ideal), so I created a
new one. Let me know if there's a better place to put it.
---
 diff.c                        | 27 +++++++++++++
 t/t4067-diff-partial-clone.sh | 76 +++++++++++++++++++++++++++++++++++
 2 files changed, 103 insertions(+)
 create mode 100755 t/t4067-diff-partial-clone.sh

diff --git a/diff.c b/diff.c
index ec5c095199..0e08d05b14 100644
--- a/diff.c
+++ b/diff.c
@@ -25,6 +25,7 @@ 
 #include "packfile.h"
 #include "parse-options.h"
 #include "help.h"
+#include "fetch-object.h"
 
 #ifdef NO_FAST_WORKING_DIRECTORY
 #define FAST_WORKING_DIRECTORY 0
@@ -6067,6 +6068,32 @@  static void diff_flush_patch_all_file_pairs(struct diff_options *o)
 	if (o->color_moved)
 		o->emitted_symbols = &esm;
 
+	if (repository_format_partial_clone) {
+		/*
+		 * Prefetch the diff pairs that are about to be flushed.
+		 */
+		struct oid_array to_fetch = OID_ARRAY_INIT;
+		int fetch_if_missing_store = fetch_if_missing;
+
+		fetch_if_missing = 0;
+		for (i = 0; i < q->nr; i++) {
+			struct diff_filepair *p = q->queue[i];
+			if (!check_pair_status(p))
+				continue;
+			if (p->one && p->one->oid_valid &&
+			    !has_object_file(&p->one->oid))
+				oid_array_append(&to_fetch, &p->one->oid);
+			if (p->two && p->two->oid_valid &&
+			    !has_object_file(&p->two->oid))
+				oid_array_append(&to_fetch, &p->two->oid);
+		}
+		if (to_fetch.nr)
+			fetch_objects(repository_format_partial_clone,
+				      to_fetch.oid, to_fetch.nr);
+		fetch_if_missing = fetch_if_missing_store;
+		oid_array_clear(&to_fetch);
+	}
+
 	for (i = 0; i < q->nr; i++) {
 		struct diff_filepair *p = q->queue[i];
 		if (check_pair_status(p))
diff --git a/t/t4067-diff-partial-clone.sh b/t/t4067-diff-partial-clone.sh
new file mode 100755
index 0000000000..38f03be114
--- /dev/null
+++ b/t/t4067-diff-partial-clone.sh
@@ -0,0 +1,76 @@ 
+#!/bin/sh
+
+test_description='behavior of diff when reading objects in a partial clone'
+
+. ./test-lib.sh
+
+test_expect_success 'git show batches blobs' '
+	test_create_repo server &&
+	echo a >server/a &&
+	echo b >server/b &&
+	git -C server add a b &&
+	git -C server commit -m x &&
+
+	test_config -C server uploadpack.allowfilter 1 &&
+	test_config -C server uploadpack.allowanysha1inwant 1 &&
+	git clone --bare --filter=blob:limit=0 "file://$(pwd)/server" client &&
+
+	# Ensure that there is exactly 1 negotiation by checking that there is
+	# only 1 "done" line sent. ("done" marks the end of negotiation.)
+	GIT_TRACE_PACKET="$(pwd)/trace" git -C client show HEAD &&
+	grep "git> done" trace >done_lines &&
+	test_line_count = 1 done_lines
+'
+
+test_expect_success 'diff batches blobs' '
+	rm -rf server client trace &&
+
+	test_create_repo server &&
+	echo a >server/a &&
+	echo b >server/b &&
+	git -C server add a b &&
+	git -C server commit -m x &&
+	echo c >server/c &&
+	echo d >server/d &&
+	git -C server add c d &&
+	git -C server commit -m x &&
+
+	test_config -C server uploadpack.allowfilter 1 &&
+	test_config -C server uploadpack.allowanysha1inwant 1 &&
+	git clone --bare --filter=blob:limit=0 "file://$(pwd)/server" client &&
+
+	# Ensure that there is exactly 1 negotiation by checking that there is
+	# only 1 "done" line sent. ("done" marks the end of negotiation.)
+	GIT_TRACE_PACKET="$(pwd)/trace" git -C client diff HEAD^ HEAD &&
+	grep "git> done" trace >done_lines &&
+	test_line_count = 1 done_lines
+'
+
+test_expect_success 'diff skips same-OID blobs' '
+	rm -rf server client trace &&
+
+	test_create_repo server &&
+	echo a >server/a &&
+	echo b >server/b &&
+	git -C server add a b &&
+	git -C server commit -m x &&
+	echo another-a >server/a &&
+	git -C server add a &&
+	git -C server commit -m x &&
+
+	test_config -C server uploadpack.allowfilter 1 &&
+	test_config -C server uploadpack.allowanysha1inwant 1 &&
+	git clone --bare --filter=blob:limit=0 "file://$(pwd)/server" client &&
+
+	echo a | git hash-object --stdin >hash-old-a &&
+	echo another-a | git hash-object --stdin >hash-new-a &&
+	echo b | git hash-object --stdin >hash-b &&
+
+	# Ensure that only a and another-a are fetched.
+	GIT_TRACE_PACKET="$(pwd)/trace" git -C client diff HEAD^ HEAD &&
+	grep "want $(cat hash-old-a)" trace &&
+	grep "want $(cat hash-new-a)" trace &&
+	! grep "want $(cat hash-b)" trace
+'
+
+test_done