diff mbox series

unpack-trees: allow missing CE_SKIP_WORKTREE objs

Message ID 20181008214816.42856-1-jonathantanmy@google.com (mailing list archive)
State New, archived
Headers show
Series unpack-trees: allow missing CE_SKIP_WORKTREE objs | expand

Commit Message

Jonathan Tan Oct. 8, 2018, 9:48 p.m. UTC
Whenever a sparse checkout occurs, the existence of all blobs in the
index is verified, whether or not they are included or excluded by the
.git/info/sparse-checkout specification. This degrades performance,
significantly in the case of a partial clone, because a lazy fetch
occurs whenever the existence of a missing blob is checked.

At the point of invoking cache_tree_update() in unpack_trees(),
CE_SKIP_WORKTREE is already set on all excluded blobs
(mark_new_skip_worktree() is called twice to set CE_NEW_SKIP_WORKTREE,
then apply_sparse_checkout() is called which copies over
CE_NEW_SKIP_WORKTREE to CE_SKIP_WORKTREE). cache_tree_update() can use
this information to know which blobs are excluded, and thus skip the
checking of these.

Because cache_tree_update() is used from multiple places, this new
behavior is guarded by a new flag WRITE_TREE_SKIP_WORKTREE_MISSING_OK.
Implement this new flag, and teach unpack_trees() to invoke
cache_tree_update() with this new flag.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 cache-tree.c                     |  6 +++++-
 cache-tree.h                     |  4 ++++
 t/t1090-sparse-checkout-scope.sh | 33 ++++++++++++++++++++++++++++++++
 unpack-trees.c                   |  1 +
 4 files changed, 43 insertions(+), 1 deletion(-)

Comments

Junio C Hamano Oct. 9, 2018, 9:27 a.m. UTC | #1
Jonathan Tan <jonathantanmy@google.com> writes:

> Because cache_tree_update() is used from multiple places, this new
> behavior is guarded by a new flag WRITE_TREE_SKIP_WORKTREE_MISSING_OK.

The name of the new flag is mouthful, but we know we do not need to
materialize these blobs (exactly because the skip-worktree bit is
set, so we do not need to know what's in these blobs) and it is OK
for these to be missing (to put it differently, we do not care if
they exist or not---hence we short-circuit the otherwise required
call to has_object_file()), iow, the name of the mode is "A missing
object with skip-worktree bit set is OK", which makes sense to me.

>  		if (is_null_oid(oid) ||
> -		    (mode != S_IFGITLINK && !missing_ok && !has_object_file(oid))) {
> +		    (mode != S_IFGITLINK && !missing_ok &&
> +		     !(skip_worktree_missing_ok && ce_skip_worktree(ce)) &&
> +		     !has_object_file(oid))) {

For a non-gitlink entry, if the caller does not say "any missing
object is OK", we normally check has_object_file().  But now
has_object_file() call happens only when ...

Hmph, isn't this new condition somewhat wrong?  We do not want to
skip it for entries without skip-worktree bit set.  We only do not
care if we are operating in skip-worktree-missing-ok mode and the
bit is set on ce.  IOW:

	if ((skip_worktree_missing_ok && ce_skip_worktree(ce)) ||
            has_object_file(oid))
		/* then we are happy */

but the whole thing above tries to catch problematic case, so I'd
need to negate that, i.e.

	if ( ... &&
	    !((skip_worktree_missing_ok && ce_skip_worktree(ce)) ||
             has_object_file(oid)))
		/* we are in trouble */

and pushing the negation down, we get

	if ( ... &&
	    (!(skip_worktree_missing_ok && ce_skip_worktree(ce)) &&
             !has_object_file(oid)))
		/* we are in trouble */

OK.  The logic in the patch is correct.  I however feel that it is a
bit too dense to make sense of.

Thanks, will queue.
Junio C Hamano Oct. 9, 2018, 9:30 a.m. UTC | #2
Jonathan Tan <jonathantanmy@google.com> writes:

> @@ -1635,6 +1635,7 @@ int unpack_trees(unsigned len, struct tree_desc *t, struct unpack_trees_options
>  				o->result.cache_tree = cache_tree();
>  			if (!cache_tree_fully_valid(o->result.cache_tree))
>  				cache_tree_update(&o->result,
> +						  WRITE_TREE_SKIP_WORKTREE_MISSING_OK |
>  						  WRITE_TREE_SILENT |
>  						  WRITE_TREE_REPAIR);
>  		}

Hmmmm.  Should this be passing the bit unconditionally?  Shouldn't
it be set only when we are doing lazy clone?  A non-lazy clone that
merely uses sparse checkout has nowhere else to turn to if it loses
a blob object that currently is not necessary to complete a checkout,
unlike a repository with promisor.
Ben Peart Oct. 9, 2018, 2:48 p.m. UTC | #3
On 10/8/2018 5:48 PM, Jonathan Tan wrote:
> Whenever a sparse checkout occurs, the existence of all blobs in the
> index is verified, whether or not they are included or excluded by the
> .git/info/sparse-checkout specification. This degrades performance,
> significantly in the case of a partial clone, because a lazy fetch
> occurs whenever the existence of a missing blob is checked.
> 
> At the point of invoking cache_tree_update() in unpack_trees(),
> CE_SKIP_WORKTREE is already set on all excluded blobs
> (mark_new_skip_worktree() is called twice to set CE_NEW_SKIP_WORKTREE,
> then apply_sparse_checkout() is called which copies over
> CE_NEW_SKIP_WORKTREE to CE_SKIP_WORKTREE). cache_tree_update() can use
> this information to know which blobs are excluded, and thus skip the
> checking of these.
> 
> Because cache_tree_update() is used from multiple places, this new
> behavior is guarded by a new flag WRITE_TREE_SKIP_WORKTREE_MISSING_OK.
> Implement this new flag, and teach unpack_trees() to invoke
> cache_tree_update() with this new flag.
> 

I wonder if preventing the download of all missing blobs should be 
limited to only the checkout command.  When you looked at the other 
places that call cache_tree_update(), does it make sense that they 
trigger the download of all the missing blobs?  For example, I suspect 
you would not want them all downloaded just to do a merge-recursive.

In full disclosure, we implemented this a couple of years ago [1] for 
GVFS and opted to do the logic a little differently.  We found that we 
didn't want to trigger the download of all missing blobs in 
cache_tree_update() for _any_ command that was executing in a partially 
cloned repo.  This is safe because if any individual blob is actually 
needed, it will get downloaded "on demand" already.

[1] https://github.com/Microsoft/git/commit/ec865500d98


> Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
> ---
>   cache-tree.c                     |  6 +++++-
>   cache-tree.h                     |  4 ++++
>   t/t1090-sparse-checkout-scope.sh | 33 ++++++++++++++++++++++++++++++++
>   unpack-trees.c                   |  1 +
>   4 files changed, 43 insertions(+), 1 deletion(-)
> 
> diff --git a/cache-tree.c b/cache-tree.c
> index 5ce51468f0..340caf2d34 100644
> --- a/cache-tree.c
> +++ b/cache-tree.c
> @@ -246,6 +246,8 @@ static int update_one(struct cache_tree *it,
>   	int missing_ok = flags & WRITE_TREE_MISSING_OK;
>   	int dryrun = flags & WRITE_TREE_DRY_RUN;
>   	int repair = flags & WRITE_TREE_REPAIR;
> +	int skip_worktree_missing_ok =
> +		flags & WRITE_TREE_SKIP_WORKTREE_MISSING_OK;
>   	int to_invalidate = 0;
>   	int i;
>   
> @@ -356,7 +358,9 @@ static int update_one(struct cache_tree *it,
>   		}
>   
>   		if (is_null_oid(oid) ||
> -		    (mode != S_IFGITLINK && !missing_ok && !has_object_file(oid))) {
> +		    (mode != S_IFGITLINK && !missing_ok &&
> +		     !(skip_worktree_missing_ok && ce_skip_worktree(ce)) &&
> +		     !has_object_file(oid))) {
>   			strbuf_release(&buffer);
>   			if (expected_missing)
>   				return -1;
> diff --git a/cache-tree.h b/cache-tree.h
> index 0ab6784ffe..655d487619 100644
> --- a/cache-tree.h
> +++ b/cache-tree.h
> @@ -40,6 +40,10 @@ void cache_tree_verify(struct index_state *);
>   #define WRITE_TREE_DRY_RUN 4
>   #define WRITE_TREE_SILENT 8
>   #define WRITE_TREE_REPAIR 16
> +/*
> + * Do not check for the presence of cache entry objects with CE_SKIP_WORKTREE.
> + */
> +#define WRITE_TREE_SKIP_WORKTREE_MISSING_OK 32
>   
>   /* error return codes */
>   #define WRITE_TREE_UNREADABLE_INDEX (-1)
> diff --git a/t/t1090-sparse-checkout-scope.sh b/t/t1090-sparse-checkout-scope.sh
> index 25d7c700f6..090b7fc3d3 100755
> --- a/t/t1090-sparse-checkout-scope.sh
> +++ b/t/t1090-sparse-checkout-scope.sh
> @@ -63,4 +63,37 @@ test_expect_success 'return to full checkout of master' '
>   	test "$(cat b)" = "modified"
>   '
>   
> +test_expect_success 'in partial clone, sparse checkout only fetches needed blobs' '
> +	test_create_repo server &&
> +	git clone "file://$(pwd)/server" client &&
> +
> +	test_config -C server uploadpack.allowfilter 1 &&
> +	test_config -C server uploadpack.allowanysha1inwant 1 &&
> +	echo a >server/a &&
> +	echo bb >server/b &&
> +	mkdir server/c &&
> +	echo ccc >server/c/c &&
> +	git -C server add a b c/c &&
> +	git -C server commit -m message &&
> +
> +	test_config -C client core.sparsecheckout 1 &&
> +	test_config -C client extensions.partialclone origin &&
> +	echo "!/*" >client/.git/info/sparse-checkout &&
> +	echo "/a" >>client/.git/info/sparse-checkout &&
> +	git -C client fetch --filter=blob:none origin &&
> +	git -C client checkout FETCH_HEAD &&
> +
> +	git -C client rev-list HEAD \
> +		--quiet --objects --missing=print >unsorted_actual &&
> +	(
> +		printf "?" &&
> +		git hash-object server/b &&
> +		printf "?" &&
> +		git hash-object server/c/c
> +	) >unsorted_expect &&
> +	sort unsorted_actual >actual &&
> +	sort unsorted_expect >expect &&
> +	test_cmp expect actual
> +'
> +
>   test_done
> diff --git a/unpack-trees.c b/unpack-trees.c
> index 51bfac6aa0..39e0e7a6c7 100644
> --- a/unpack-trees.c
> +++ b/unpack-trees.c
> @@ -1635,6 +1635,7 @@ int unpack_trees(unsigned len, struct tree_desc *t, struct unpack_trees_options
>   				o->result.cache_tree = cache_tree();
>   			if (!cache_tree_fully_valid(o->result.cache_tree))
>   				cache_tree_update(&o->result,
> +						  WRITE_TREE_SKIP_WORKTREE_MISSING_OK |
>   						  WRITE_TREE_SILENT |
>   						  WRITE_TREE_REPAIR);
>   		}
>
Ben Peart Oct. 9, 2018, 2:49 p.m. UTC | #4
On 10/9/2018 5:30 AM, Junio C Hamano wrote:
> Jonathan Tan <jonathantanmy@google.com> writes:
> 
>> @@ -1635,6 +1635,7 @@ int unpack_trees(unsigned len, struct tree_desc *t, struct unpack_trees_options
>>   				o->result.cache_tree = cache_tree();
>>   			if (!cache_tree_fully_valid(o->result.cache_tree))
>>   				cache_tree_update(&o->result,
>> +						  WRITE_TREE_SKIP_WORKTREE_MISSING_OK |
>>   						  WRITE_TREE_SILENT |
>>   						  WRITE_TREE_REPAIR);
>>   		}
> 
> Hmmmm.  Should this be passing the bit unconditionally?  Shouldn't
> it be set only when we are doing lazy clone?  A non-lazy clone that
> merely uses sparse checkout has nowhere else to turn to if it loses
> a blob object that currently is not necessary to complete a checkout,
> unlike a repository with promisor.
> 

I agree.  I believe this logic should only be triggered when running in 
a partial clone repo. Otherwise, we're potentially changing the behavior 
of 'normal' repos unnecessarily.
diff mbox series

Patch

diff --git a/cache-tree.c b/cache-tree.c
index 5ce51468f0..340caf2d34 100644
--- a/cache-tree.c
+++ b/cache-tree.c
@@ -246,6 +246,8 @@  static int update_one(struct cache_tree *it,
 	int missing_ok = flags & WRITE_TREE_MISSING_OK;
 	int dryrun = flags & WRITE_TREE_DRY_RUN;
 	int repair = flags & WRITE_TREE_REPAIR;
+	int skip_worktree_missing_ok =
+		flags & WRITE_TREE_SKIP_WORKTREE_MISSING_OK;
 	int to_invalidate = 0;
 	int i;
 
@@ -356,7 +358,9 @@  static int update_one(struct cache_tree *it,
 		}
 
 		if (is_null_oid(oid) ||
-		    (mode != S_IFGITLINK && !missing_ok && !has_object_file(oid))) {
+		    (mode != S_IFGITLINK && !missing_ok &&
+		     !(skip_worktree_missing_ok && ce_skip_worktree(ce)) &&
+		     !has_object_file(oid))) {
 			strbuf_release(&buffer);
 			if (expected_missing)
 				return -1;
diff --git a/cache-tree.h b/cache-tree.h
index 0ab6784ffe..655d487619 100644
--- a/cache-tree.h
+++ b/cache-tree.h
@@ -40,6 +40,10 @@  void cache_tree_verify(struct index_state *);
 #define WRITE_TREE_DRY_RUN 4
 #define WRITE_TREE_SILENT 8
 #define WRITE_TREE_REPAIR 16
+/*
+ * Do not check for the presence of cache entry objects with CE_SKIP_WORKTREE.
+ */
+#define WRITE_TREE_SKIP_WORKTREE_MISSING_OK 32
 
 /* error return codes */
 #define WRITE_TREE_UNREADABLE_INDEX (-1)
diff --git a/t/t1090-sparse-checkout-scope.sh b/t/t1090-sparse-checkout-scope.sh
index 25d7c700f6..090b7fc3d3 100755
--- a/t/t1090-sparse-checkout-scope.sh
+++ b/t/t1090-sparse-checkout-scope.sh
@@ -63,4 +63,37 @@  test_expect_success 'return to full checkout of master' '
 	test "$(cat b)" = "modified"
 '
 
+test_expect_success 'in partial clone, sparse checkout only fetches needed blobs' '
+	test_create_repo server &&
+	git clone "file://$(pwd)/server" client &&
+
+	test_config -C server uploadpack.allowfilter 1 &&
+	test_config -C server uploadpack.allowanysha1inwant 1 &&
+	echo a >server/a &&
+	echo bb >server/b &&
+	mkdir server/c &&
+	echo ccc >server/c/c &&
+	git -C server add a b c/c &&
+	git -C server commit -m message &&
+
+	test_config -C client core.sparsecheckout 1 &&
+	test_config -C client extensions.partialclone origin &&
+	echo "!/*" >client/.git/info/sparse-checkout &&
+	echo "/a" >>client/.git/info/sparse-checkout &&
+	git -C client fetch --filter=blob:none origin &&
+	git -C client checkout FETCH_HEAD &&
+
+	git -C client rev-list HEAD \
+		--quiet --objects --missing=print >unsorted_actual &&
+	(
+		printf "?" &&
+		git hash-object server/b &&
+		printf "?" &&
+		git hash-object server/c/c
+	) >unsorted_expect &&
+	sort unsorted_actual >actual &&
+	sort unsorted_expect >expect &&
+	test_cmp expect actual
+'
+
 test_done
diff --git a/unpack-trees.c b/unpack-trees.c
index 51bfac6aa0..39e0e7a6c7 100644
--- a/unpack-trees.c
+++ b/unpack-trees.c
@@ -1635,6 +1635,7 @@  int unpack_trees(unsigned len, struct tree_desc *t, struct unpack_trees_options
 				o->result.cache_tree = cache_tree();
 			if (!cache_tree_fully_valid(o->result.cache_tree))
 				cache_tree_update(&o->result,
+						  WRITE_TREE_SKIP_WORKTREE_MISSING_OK |
 						  WRITE_TREE_SILENT |
 						  WRITE_TREE_REPAIR);
 		}