diff mbox series

[v3] index-pack: remove fetch_if_missing=0

Message ID 20230313181518.6322-1-five231003@gmail.com (mailing list archive)
State Superseded
Headers show
Series [v3] index-pack: remove fetch_if_missing=0 | expand

Commit Message

Kousik Sanagavarapu March 13, 2023, 6:15 p.m. UTC
A collision test is triggered in sha1_object(), whenever there is an
object file in our repo. If our repo is a partial clone, then checking
for this file existence does not lazy-fetch the object (if the object
is missing and if there are one or more promisor remotes) when
fetch_if_missing is set to 0.

This global was added as a temporary measure to suppress the fetching
of missing objects [1] and can be removed once the remaining commands:
 - fetch-pack
 - fsck
 - pack-objects
 - prune
 - rev-list
can handle lazy-fetching without fetch_if_missing.

Hence, use has_object() to check for the existence of an object, which
has the default behavior of not lazy-fetching in a partial clone. It is
worth mentioning that this is the only place where there is potential for
lazy-fetching and all other cases [2] are properly handled, making it safe
to remove this global here.

[1] See 8b4c0103a9 (sha1_file: support lazily fetching missing objects,
		   2017-12-08)
[2] These cases are:
    - When we check objects, but we return with 0 early if the object
      doesn't exist.
    - We prefetch delta bases in a partial clone, if we don't have them
      (as the comment outlines).
    - There are some cases where we fsck objects, but lazy-fetching is
      already handled in fsck.

Helped-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Kousik Sanagavarapu <five231003@gmail.com>
---
 builtin/index-pack.c     | 11 +----------
 t/t5616-partial-clone.sh | 28 ++++++++++++++++++++++++++++
 2 files changed, 29 insertions(+), 10 deletions(-)

Comments

Junio C Hamano March 13, 2023, 7:17 p.m. UTC | #1
Kousik Sanagavarapu <five231003@gmail.com> writes:

> A collision test is triggered in sha1_object(), whenever there is an
> object file in our repo. If our repo is a partial clone, then checking
> for this file existence does not lazy-fetch the object (if the object
> is missing and if there are one or more promisor remotes) when
> fetch_if_missing is set to 0.
> ...
> Hence, use has_object() to check for the existence of an object, which
> has the default behavior of not lazy-fetching in a partial clone. It is
> worth mentioning that this is the only place where there is potential for
> lazy-fetching and all other cases [2] are properly handled, making it safe
> to remove this global here.
>
> [1] See 8b4c0103a9 (sha1_file: support lazily fetching missing objects,
> 		   2017-12-08)

Thanks for this reference.

The way I read the "lazy fetching is by default not suppressed, and
this is a temporary measure" described in the log message is quite
opposite from where this patch wants to go, though.  

I think the commit envisioned the world where all the accesses to
the object layer are aware of the characteristics of a lazy clone
(e.g. has_object() reporting "we do not have that object locally
(yet)" is not an immediate sign of a repository corruption) and when
to lazily fetch and when to tolerate locally missing objects is
controlled more tightly, and to reach that world one step at a time,
introduced the global, so that for now everybody lazily fetches, but
the commands individual patches concentrates to "fix" can turn off
the "by default all missing objects are lazily fetched" so that they
can either allow certain objects to be locally missing, or fetch
them from promisor remotes when they need the contents of such
objects.  By fixing each commands one by one, eventually we would be
able to wean ourselves away from this "by default everything is
lazily fetched" global---in other words, in the ideal endgame, the
fetch_if_missing should be set to 0 everywhere.

So, if this patch was made in reaction to the "it was a temporary
measure" in 8b4c0103 (sha1_file: support lazily fetching missing
objects, 2017-12-08), I think it goes in the completely opposite
direction.  If the patch shared the cause with 8b4c0103 and wanted
to help realize the ideal world, it instead should have left this
command who can already work with fetch_if_missing=0 alone and fixed
somebody else who still depends on fetch_if_missing=1, I think.

Now it is a separate issue to argue if "everybody knows exactly when
to trigger lazy fetching and fetch_if_missing is set to false
everywhere" is really the ideal endgame.  I do not think "Future
patches will update some commands to either tolerate missing objects
or be more efficient in fetching them." proposed by the commit from
late 2017 has seen that much advance recently.

But for commands that need to deal with many missing objects,
enumerating the objects that are missing locally and need fetching
first and then requesting them in a batch should be vastly more
efficient than the default lazy fetch logic that lets the caller
request a single object, realize it is missing locally, make a
connection to fetch that single object and disconnect.  So I have to
suspect that ...

> This global was added as a temporary measure to suppress the fetching
> of missing objects [1] and can be removed once the remaining commands:
>  - fetch-pack
>  - fsck
>  - pack-objects
>  - prune
>  - rev-list
> can handle lazy-fetching without fetch_if_missing.

... this "can handle" may be a misguided direction to go in.  They
were taught not to lazy fetch because blindly lazy fetching was bad,
weren't they?
Junio C Hamano March 13, 2023, 7:18 p.m. UTC | #2
Kousik Sanagavarapu <five231003@gmail.com> writes:

> A collision test is triggered in sha1_object(), whenever there is an
> object file in our repo. If our repo is a partial clone, then checking
> for this file existence does not lazy-fetch the object (if the object
> is missing and if there are one or more promisor remotes) when
> fetch_if_missing is set to 0.
> ...
> Hence, use has_object() to check for the existence of an object, which
> has the default behavior of not lazy-fetching in a partial clone. It is
> worth mentioning that this is the only place where there is potential for
> lazy-fetching and all other cases [2] are properly handled, making it safe
> to remove this global here.
>
> [1] See 8b4c0103a9 (sha1_file: support lazily fetching missing objects,
> 		   2017-12-08)

Thanks for the reference.

The way I read the "lazy fetching is by default not suppressed, and
this is a temporary measure" described in the log message is quite
opposite from where this patch wants to go, though.  

I think the commit envisioned the world where all the accesses to
the object layer are aware of the characteristics of a lazy clone
(e.g. has_object() reporting "we do not have that object locally
(yet)" is not an immediate sign of a repository corruption) and when
to lazily fetch and when to tolerate locally missing objects is
controlled more tightly, and to reach that world one step at a time,
introduced the global, so that for now everybody lazily fetches, but
the commands individual patches concentrates to "fix" can turn off
the "by default all missing objects are lazily fetched" so that they
can either allow certain objects to be locally missing, or fetch
them from promisor remotes when they need the contents of such
objects.  By fixing each commands one by one, eventually we would be
able to wean ourselves away from this "by default everything is
lazily fetched" global---in other words, in the ideal endgame, the
fetch_if_missing should be set to 0 everywhere.

So, if this patch was made in reaction to the "it was a temporary
measure" in 8b4c0103 (sha1_file: support lazily fetching missing
objects, 2017-12-08), I think it goes in the completely opposite
direction.  If the patch shared the cause with 8b4c0103 and wanted
to help realize the ideal world, it instead should have left this
command who can already work with fetch_if_missing=0 alone and fixed
somebody else who still depends on fetch_if_missing=1, I think.

Now it is a separate issue to argue if "everybody knows exactly when
to trigger lazy fetching and fetch_if_missing is set to false
everywhere" is really the ideal endgame.  I do not think "Future
patches will update some commands to either tolerate missing objects
or be more efficient in fetching them." proposed by the commit from
late 2017 has seen that much advance recently.

But for commands that need to deal with many missing objects,
enumerating the objects that are missing locally and need fetching
first and then requesting them in a batch should be vastly more
efficient than the default lazy fetch logic that lets the caller
request a single object, realize it is missing locally, make a
connection to fetch that single object and disconnect.  So I have to
suspect that ...

> This global was added as a temporary measure to suppress the fetching
> of missing objects [1] and can be removed once the remaining commands:
>  - fetch-pack
>  - fsck
>  - pack-objects
>  - prune
>  - rev-list
> can handle lazy-fetching without fetch_if_missing.

... this "can handle" may be a misguided direction to go in.  They
were taught not to lazy fetch because blindly lazy fetching was bad,
weren't they?
diff mbox series

Patch

diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index 6648f2daef..8c0f36a49e 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -800,8 +800,7 @@  static void sha1_object(const void *data, struct object_entry *obj_entry,
 
 	if (startup_info->have_repository) {
 		read_lock();
-		collision_test_needed =
-			has_object_file_with_flags(oid, OBJECT_INFO_QUICK);
+		collision_test_needed = has_object(the_repository, oid, 0);
 		read_unlock();
 	}
 
@@ -1728,14 +1727,6 @@  int cmd_index_pack(int argc, const char **argv, const char *prefix)
 	int report_end_of_input = 0;
 	int hash_algo = 0;
 
-	/*
-	 * index-pack never needs to fetch missing objects except when
-	 * REF_DELTA bases are missing (which are explicitly handled). It only
-	 * accesses the repo to do hash collision checks and to check which
-	 * REF_DELTA bases need to be fetched.
-	 */
-	fetch_if_missing = 0;
-
 	if (argc == 2 && !strcmp(argv[1], "-h"))
 		usage(index_pack_usage);
 
diff --git a/t/t5616-partial-clone.sh b/t/t5616-partial-clone.sh
index f519d2a87a..fdb34a0b50 100755
--- a/t/t5616-partial-clone.sh
+++ b/t/t5616-partial-clone.sh
@@ -644,6 +644,34 @@  test_expect_success 'repack does not loosen promisor objects' '
 	grep "loosen_unused_packed_objects/loosened:0" trace
 '
 
+test_expect_success 'index-pack does not lazy-fetch when checking for sha1 collsions' '
+	rm -rf server client another-remote &&
+
+	git init server &&
+	echo "line" >server/file &&
+	git -C server add file &&
+	git -C server commit -am "file" &&
+	git -C server config --local uploadpack.allowFilter 1 &&
+	git -C server config --local uploadpack.allowAnySha1InWant 1 &&
+
+	git clone --no-checkout --filter=blob:none "file://$(pwd)/server" client &&
+	git -C client config extensions.partialClone 1 &&
+	git -C client config remote.origin.promisor 1 &&
+
+	git clone "file://$(pwd)/server" another-remote &&
+
+	echo "new line" >server/new-file &&
+	git -C server add new-file &&
+	git -C server commit -am "new-file" &&
+
+	git -C another-remote pull &&
+
+	# Try to fetch so that "client" will have to do a collision-check.
+	# This should, however, not fetch "new-file" because "client" is a
+	# partial clone.
+	git -C client fetch "file://$(pwd)/another-remote" main
+'
+
 test_expect_success 'lazy-fetch in submodule succeeds' '
 	# setup
 	test_config_global protocol.file.allow always &&