diff mbox series

cache-tree: do not lazy-fetch merge tree

Message ID 20190903194247.217964-1-jonathantanmy@google.com (mailing list archive)
State New, archived
Headers show
Series cache-tree: do not lazy-fetch merge tree | expand

Commit Message

Jonathan Tan Sept. 3, 2019, 7:42 p.m. UTC
When cherry-picking (for example), new trees may be constructed. During
this process, Git checks whether these trees exist. However, in a
partial clone, this causes a lazy fetch to occur, which is both
unnecessary (because Git has already constructed this tree as part of
the cherry-picking process) and likely to fail (because the remote
probably doesn't have this tree).

Do not lazy fetch in this situation.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
Another partial clone bug.

This raises the issue that failed fetches are currently fatal - if they
weren't fatal, this cherry-pick would have worked (except with some
delay as the fetch is attempted, and with a warning message about the
fetch failing). My personal inclination right now is to leave things as
it is (fatal failed fetches), but I'm open to other opinions.
---
 cache-tree.c             |  2 +-
 t/t0410-partial-clone.sh | 14 ++++++++++++++
 2 files changed, 15 insertions(+), 1 deletion(-)

Comments

Derrick Stolee Sept. 4, 2019, 1:37 a.m. UTC | #1
On 9/3/2019 3:42 PM, Jonathan Tan wrote:
> When cherry-picking (for example), new trees may be constructed. During
> this process, Git checks whether these trees exist. However, in a
> partial clone, this causes a lazy fetch to occur, which is both
> unnecessary (because Git has already constructed this tree as part of
> the cherry-picking process) and likely to fail (because the remote
> probably doesn't have this tree).

If we have constructed the object already, then why do we not see it
and avoid fetching it? This must be a slightly strange timing issue
with objects being flushed to disk or added to the object cache.

One approach is to find all of these has_object_file() calls that should
really be one with OBJECT_INFO_SKIP_FETCH_OBJECT. Another would be to
find out why has_object_file() isn't seeing the object we constructed.

> Do not lazy fetch in this situation.
I agree that the patch has this effect.
 
> Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
> ---
> Another partial clone bug.
> 
> This raises the issue that failed fetches are currently fatal - if they
> weren't fatal, this cherry-pick would have worked (except with some
> delay as the fetch is attempted, and with a warning message about the
> fetch failing). My personal inclination right now is to leave things as
> it is (fatal failed fetches), but I'm open to other opinions.
> ---
>  cache-tree.c             |  2 +-
>  t/t0410-partial-clone.sh | 14 ++++++++++++++
>  2 files changed, 15 insertions(+), 1 deletion(-)
> 
> diff --git a/cache-tree.c b/cache-tree.c
> index c22161f987..9e596893bc 100644
> --- a/cache-tree.c
> +++ b/cache-tree.c
> @@ -407,7 +407,7 @@ static int update_one(struct cache_tree *it,
>  	if (repair) {
>  		struct object_id oid;
>  		hash_object_file(buffer.buf, buffer.len, tree_type, &oid);
> -		if (has_object_file(&oid))
> +		if (has_object_file_with_flags(&oid, OBJECT_INFO_SKIP_FETCH_OBJECT))
>  			oidcpy(&it->oid, &oid);
>  		else
>  			to_invalidate = 1;
> diff --git a/t/t0410-partial-clone.sh b/t/t0410-partial-clone.sh
> index 6415063980..3e434b6a81 100755
> --- a/t/t0410-partial-clone.sh
> +++ b/t/t0410-partial-clone.sh
> @@ -492,6 +492,20 @@ test_expect_success 'gc stops traversal when a missing but promised object is re
>  	! grep "$TREE_HASH" out
>  '
>  
> +test_expect_success 'do not fetch when checking existence of tree we construct ourselves' '
> +	rm -rf repo &&
> +	test_create_repo repo &&
> +	test_commit -C repo base &&
> +	test_commit -C repo side1 &&
> +	git -C repo checkout base &&
> +	test_commit -C repo side2 &&
> +
> +	git -C repo config core.repositoryformatversion 1 &&
> +	git -C repo config extensions.partialclone "arbitrary string" &&
> +
> +	git -C repo cherry-pick side1
> +'
> +

I appreciate this test!

Thanks,
-Stolee
Jonathan Tan Sept. 4, 2019, 10:35 p.m. UTC | #2
> On 9/3/2019 3:42 PM, Jonathan Tan wrote:
> > When cherry-picking (for example), new trees may be constructed. During
> > this process, Git checks whether these trees exist. However, in a
> > partial clone, this causes a lazy fetch to occur, which is both
> > unnecessary (because Git has already constructed this tree as part of
> > the cherry-picking process) and likely to fail (because the remote
> > probably doesn't have this tree).
> 
> If we have constructed the object already, then why do we not see it
> and avoid fetching it? This must be a slightly strange timing issue
> with objects being flushed to disk or added to the object cache.

Thanks for taking a look at this! The answer is that I wasn't precise
when I said "already constructed" - I meant that it is in a struct
strbuf. It hasn't been written to disk yet, so has_object_file() does
not see it.

> One approach is to find all of these has_object_file() calls that should
> really be one with OBJECT_INFO_SKIP_FETCH_OBJECT. Another would be to
> find out why has_object_file() isn't seeing the object we constructed.

By the former, do you mean that we should look at the other
has_object_file() calls? I looked at the others in cache-tree.c and I
think the one in this patch is the only one that is called on an OID
generated from hash_object_file(). (And I answered the latter in the
above paragraph.)

To avoid confusion, maybe this commit message is better:

When cherry-picking (for example), new trees may be constructed. During
this process, Git constructs the new tree in a struct strbuf, computes
the OID of the new tree, and checks if the new OID already exists on
disk. However, in a partial clone, the disk check causes a lazy fetch to
occur, which is both unnecessary (because we have the tree in the struct
strbuf) and likely to fail (because the remote probably doesn't have
this tree).
Junio C Hamano Sept. 4, 2019, 11:35 p.m. UTC | #3
Jonathan Tan <jonathantanmy@google.com> writes:

> When cherry-picking (for example), new trees may be constructed. During
> this process, Git constructs the new tree in a struct strbuf, computes
> the OID of the new tree, and checks if the new OID already exists on
> disk. However, in a partial clone, the disk check causes a lazy fetch to
> occur, which is both unnecessary (because we have the tree in the struct
> strbuf) and likely to fail (because the remote probably doesn't have
> this tree).

FWIW, this logic dates back to aecf567c ("cache-tree: create/update
cache-tree on checkout", 2014-07-05), when "checkout" learned to
perform opportunistic revalidation of cache-tree data structure,
without writing into the object store.  If we were lazily checked
out, and created a blob locally that happens to match the original
we did not fetch in a directory this piece of code is hashing, the
resulting hash _may_ name a tree that the other side has that we did
not fetch, so taking the "to_invalidate = 1" side would make the
resulting cache-tree less optimal, but because the design choice
being made here is to take that hit in order to avoid network cost,
as long as that is documented properly (iow, "probably doesn't have"
is not an issue; even if they have it, you do not want to fetch and
make the cache-tree entry valid), it is OK.
diff mbox series

Patch

diff --git a/cache-tree.c b/cache-tree.c
index c22161f987..9e596893bc 100644
--- a/cache-tree.c
+++ b/cache-tree.c
@@ -407,7 +407,7 @@  static int update_one(struct cache_tree *it,
 	if (repair) {
 		struct object_id oid;
 		hash_object_file(buffer.buf, buffer.len, tree_type, &oid);
-		if (has_object_file(&oid))
+		if (has_object_file_with_flags(&oid, OBJECT_INFO_SKIP_FETCH_OBJECT))
 			oidcpy(&it->oid, &oid);
 		else
 			to_invalidate = 1;
diff --git a/t/t0410-partial-clone.sh b/t/t0410-partial-clone.sh
index 6415063980..3e434b6a81 100755
--- a/t/t0410-partial-clone.sh
+++ b/t/t0410-partial-clone.sh
@@ -492,6 +492,20 @@  test_expect_success 'gc stops traversal when a missing but promised object is re
 	! grep "$TREE_HASH" out
 '
 
+test_expect_success 'do not fetch when checking existence of tree we construct ourselves' '
+	rm -rf repo &&
+	test_create_repo repo &&
+	test_commit -C repo base &&
+	test_commit -C repo side1 &&
+	git -C repo checkout base &&
+	test_commit -C repo side2 &&
+
+	git -C repo config core.repositoryformatversion 1 &&
+	git -C repo config extensions.partialclone "arbitrary string" &&
+
+	git -C repo cherry-pick side1
+'
+
 . "$TEST_DIRECTORY"/lib-httpd.sh
 start_httpd