[v2] cache-tree: do not lazy-fetch merge tree
diff mbox series

Message ID 20190909190130.146613-1-jonathantanmy@google.com
State New
Headers show
Series
  • [v2] cache-tree: do not lazy-fetch merge tree
Related show

Commit Message

Jonathan Tan Sept. 9, 2019, 7:01 p.m. UTC
When cherry-picking (for example), new trees may be constructed. During
this process, Git constructs the new tree in a struct strbuf, computes
the OID of the new tree, and checks if the new OID already exists on
disk. However, in a partial clone, the disk check causes a lazy fetch to
occur, which is both unnecessary (because we have the tree in the struct
strbuf) and likely to fail (because the remote probably doesn't have
this tree).

Do not lazy fetch in this situation.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
As requested in What's Cooking [1], here's a patch with an updated
commit message. Otherwise, the patch is exactly the same.

[1] https://public-inbox.org/git/xmqqd0gcm2zm.fsf@gitster-ct.c.googlers.com/
---
 cache-tree.c             |  2 +-
 t/t0410-partial-clone.sh | 14 ++++++++++++++
 2 files changed, 15 insertions(+), 1 deletion(-)

Comments

Junio C Hamano Sept. 9, 2019, 7:55 p.m. UTC | #1
Jonathan Tan <jonathantanmy@google.com> writes:

> When cherry-picking (for example), new trees may be constructed. During
> this process, Git constructs the new tree in a struct strbuf, computes
> the OID of the new tree, and checks if the new OID already exists on
> disk. However, in a partial clone, the disk check causes a lazy fetch to
> occur, which is both unnecessary (because we have the tree in the struct
> strbuf) and likely to fail (because the remote probably doesn't have
> this tree).

I somehow smell that the above misses the point of the check in the
first place, though.  The reason why we are computing the tree
object's name and seeing if we have it locally on disk is to decide
if we want to record it in the cache tree, *without* writing the
tree out to our object store, no?

It is not just unnecessary but also against the goal of the codepath
to lazily download it, even if the tree is available remotely.  And
it is irrelevant that there are cases the remote does not have
it---we have no need to mention that we must be prepared to see the
lazy fetch to fail.  Even when they do have one, we do not want to
fetch it and write to our object store.

Isn't that what is going on?  I thought I dug up the original that
introduced the has_object_file() call to this codepath to make sure
we understand why we make the check (and I expected the person who
is proposing this change to do the same and record the finding in
the proposed log message).

I am running out of time today, and will revisit later this week
(I'll be down for at least two days starting tomorrow, by the way).

Thanks.
Junio C Hamano Sept. 9, 2019, 9:05 p.m. UTC | #2
Junio C Hamano <gitster@pobox.com> writes:

> Isn't that what is going on?  I thought I dug up the original that
> introduced the has_object_file() call to this codepath to make sure
> we understand why we make the check (and I expected the person who
> is proposing this change to do the same and record the finding in
> the proposed log message).
>
> I am running out of time today, and will revisit later this week
> (I'll be down for at least two days starting tomorrow, by the way).

Here is what I came up with.

    The cache-tree datastructure is used to speed up the comparison
    between the HEAD and the index, and when the index is updated by
    a cherry-pick (for example), a tree object that would represent
    the paths in the index in a directory is constructed in-core, to
    see if such a tree object exists already in the object store.

    When the lazy-fetch mechanism was introduced, we converted this
    "does the tree exist?" check into an "if it does not, and if we
    lazily cloned, see if the remote has it" call by mistake.  Since
    the whole point of this check is to repair the cache-tree by
    recording an already existing tree object opportunistically, we
    shouldn't even try to fetch one from the remote.

    Pass the OBJECT_INFO_SKIP_FETCH_OBJECT flag to make sure we only
    check for existence in the local object store without triggering the
    lazy fetch mechanism.
Jeff King Sept. 9, 2019, 10:21 p.m. UTC | #3
On Mon, Sep 09, 2019 at 02:05:53PM -0700, Junio C Hamano wrote:

> Junio C Hamano <gitster@pobox.com> writes:
> 
> > Isn't that what is going on?  I thought I dug up the original that
> > introduced the has_object_file() call to this codepath to make sure
> > we understand why we make the check (and I expected the person who
> > is proposing this change to do the same and record the finding in
> > the proposed log message).
> >
> > I am running out of time today, and will revisit later this week
> > (I'll be down for at least two days starting tomorrow, by the way).
> 
> Here is what I came up with.
> 
>     The cache-tree datastructure is used to speed up the comparison
>     between the HEAD and the index, and when the index is updated by
>     a cherry-pick (for example), a tree object that would represent
>     the paths in the index in a directory is constructed in-core, to
>     see if such a tree object exists already in the object store.
> 
>     When the lazy-fetch mechanism was introduced, we converted this
>     "does the tree exist?" check into an "if it does not, and if we
>     lazily cloned, see if the remote has it" call by mistake.  Since
>     the whole point of this check is to repair the cache-tree by
>     recording an already existing tree object opportunistically, we
>     shouldn't even try to fetch one from the remote.
> 
>     Pass the OBJECT_INFO_SKIP_FETCH_OBJECT flag to make sure we only
>     check for existence in the local object store without triggering the
>     lazy fetch mechanism.

As a third-party observer, that explanation makes sense to me.

I wondered also if this means we should be using OBJECT_INFO_QUICK.
I.e., do we expect to see a "miss" here often, forcing us to re-scan the
packed directory?

Reading dd0c34c46b (cache-tree: protect against "git prune".,
2006-04-24), I think the answer is "no".

-Peff
Junio C Hamano Sept. 10, 2019, 1:09 a.m. UTC | #4
Jeff King <peff@peff.net> writes:

> I wondered also if this means we should be using OBJECT_INFO_QUICK.
> I.e., do we expect to see a "miss" here often, forcing us to re-scan the
> packed directory?

As a performance optimization hack, it is OK if we did not notice
that the tree object, which corresponds to what is currently
prepared for a directory in the index, does exist in the object
store.  It is not worth rescanning the packs to "protect" against
races, I think, in the "repair" codepath.

When the user actually wants to write the index out as a tree, we
would write it out as a loose object (or omit doing so if we know
there are already copies), but because it is not a crime to create a
duplicate loose object when we already have a packed copy, I do not
think we need to rescan in that context, either.  But I do not think
the codepath Jonathan's patch touches is about that operation.
SZEDER Gábor Sept. 10, 2019, 12:49 p.m. UTC | #5
On Mon, Sep 09, 2019 at 12:01:30PM -0700, Jonathan Tan wrote:
> diff --git a/t/t0410-partial-clone.sh b/t/t0410-partial-clone.sh
> index 6415063980..3e434b6a81 100755
> --- a/t/t0410-partial-clone.sh
> +++ b/t/t0410-partial-clone.sh
> @@ -492,6 +492,20 @@ test_expect_success 'gc stops traversal when a missing but promised object is re
>  	! grep "$TREE_HASH" out
>  '
>  
> +test_expect_success 'do not fetch when checking existence of tree we construct ourselves' '
> +	rm -rf repo &&
> +	test_create_repo repo &&
> +	test_commit -C repo base &&
> +	test_commit -C repo side1 &&
> +	git -C repo checkout base &&
> +	test_commit -C repo side2 &&
> +
> +	git -C repo config core.repositoryformatversion 1 &&
> +	git -C repo config extensions.partialclone "arbitrary string" &&
> +
> +	git -C repo cherry-pick side1
> +'
> +

Sidenote, just curious: did you originally intend to add this test
before the test script sources 'lib-httpd.sh', or you were about to
append it at the end as usual, but then noticed the warning comment
telling you not to do so?

>  . "$TEST_DIRECTORY"/lib-httpd.sh
>  start_httpd
Jonathan Tan Sept. 10, 2019, 6:15 p.m. UTC | #6
> Junio C Hamano <gitster@pobox.com> writes:
> 
> > Isn't that what is going on?  I thought I dug up the original that
> > introduced the has_object_file() call to this codepath to make sure
> > we understand why we make the check (and I expected the person who
> > is proposing this change to do the same and record the finding in
> > the proposed log message).
> >
> > I am running out of time today, and will revisit later this week
> > (I'll be down for at least two days starting tomorrow, by the way).
> 
> Here is what I came up with.
> 
>     The cache-tree datastructure is used to speed up the comparison
>     between the HEAD and the index, and when the index is updated by
>     a cherry-pick (for example), a tree object that would represent
>     the paths in the index in a directory is constructed in-core, to
>     see if such a tree object exists already in the object store.
> 
>     When the lazy-fetch mechanism was introduced, we converted this
>     "does the tree exist?" check into an "if it does not, and if we
>     lazily cloned, see if the remote has it" call by mistake.  Since
>     the whole point of this check is to repair the cache-tree by
>     recording an already existing tree object opportunistically, we
>     shouldn't even try to fetch one from the remote.
> 
>     Pass the OBJECT_INFO_SKIP_FETCH_OBJECT flag to make sure we only
>     check for existence in the local object store without triggering the
>     lazy fetch mechanism.

This commit message looks good to me. Thanks for writing the commit
message - I thought that the justification in the commit message I wrote
would be sufficient, but it makes sense to look into why the check was
done.
Jonathan Tan Sept. 10, 2019, 6:19 p.m. UTC | #7
> Sidenote, just curious: did you originally intend to add this test
> before the test script sources 'lib-httpd.sh', or you were about to
> append it at the end as usual, but then noticed the warning comment
> telling you not to do so?

Honestly, I don't remember. I do try to put tests near similar tests, so
I might have seen that we had HTTP tests at the bottom and non-HTTP
tests at the top, but in this case, I don't remember if I had that
thought before putting this test where it is now.

Patch
diff mbox series

diff --git a/cache-tree.c b/cache-tree.c
index c22161f987..9e596893bc 100644
--- a/cache-tree.c
+++ b/cache-tree.c
@@ -407,7 +407,7 @@  static int update_one(struct cache_tree *it,
 	if (repair) {
 		struct object_id oid;
 		hash_object_file(buffer.buf, buffer.len, tree_type, &oid);
-		if (has_object_file(&oid))
+		if (has_object_file_with_flags(&oid, OBJECT_INFO_SKIP_FETCH_OBJECT))
 			oidcpy(&it->oid, &oid);
 		else
 			to_invalidate = 1;
diff --git a/t/t0410-partial-clone.sh b/t/t0410-partial-clone.sh
index 6415063980..3e434b6a81 100755
--- a/t/t0410-partial-clone.sh
+++ b/t/t0410-partial-clone.sh
@@ -492,6 +492,20 @@  test_expect_success 'gc stops traversal when a missing but promised object is re
 	! grep "$TREE_HASH" out
 '
 
+test_expect_success 'do not fetch when checking existence of tree we construct ourselves' '
+	rm -rf repo &&
+	test_create_repo repo &&
+	test_commit -C repo base &&
+	test_commit -C repo side1 &&
+	git -C repo checkout base &&
+	test_commit -C repo side2 &&
+
+	git -C repo config core.repositoryformatversion 1 &&
+	git -C repo config extensions.partialclone "arbitrary string" &&
+
+	git -C repo cherry-pick side1
+'
+
 . "$TEST_DIRECTORY"/lib-httpd.sh
 start_httpd