diff mbox series

[v4] index-pack: remove fetch_if_missing=0

Message ID 20230317175601.4250-1-five231003@gmail.com (mailing list archive)
State New, archived
Headers show
Series [v4] index-pack: remove fetch_if_missing=0 | expand

Commit Message

Kousik Sanagavarapu March 17, 2023, 5:56 p.m. UTC
A collision test is triggered in sha1_object(), whenever there is an
object file in our repo. If our repo is a partial clone, then checking
for this file existence does not lazy-fetch the object (if the object
is missing and if there are one or more promisor remotes) when
fetch_if_missing is set to 0.

Though this global lets us control lazy-fetching in regions of code,
it prevents multi-threading [1].

Hence, use has_object() to check for the existence of an object, which
has the default behavior of not lazy-fetching in a partial clone, even
when fetch_if_missing is set to 1. It is worth mentioning that this is
the only place where there is potential for lazy-fetching and all other
cases [2] are properly handled, making it safe to remove this global
here.

[1] See https://lore.kernel.org/git/xmqqv9sdeeif.fsf@gitster-ct.c.googlers.com/
    and the discussion that follows from it.

[2] These cases are:
    - When we check objects, but we return with 0 early if the object
      doesn't exist.
    - We prefetch delta bases in a partial clone, if we don't have them
      (as the comment outlines).
    - There are some cases where we fsck objects, but lazy-fetching is
      already handled in fsck.

Helped-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Kousik Sanagavarapu <five231003@gmail.com>
---

Changes since v3:
- Changed the commit message to give a stronger reason as to
  why we should reduce the use of this global and in this
  process remove it from builtin/index-pack.c

- Also made an addition to the test which I overlooked last time
  and without which the does not make sense

 builtin/index-pack.c     | 11 +----------
 t/t5616-partial-clone.sh | 33 +++++++++++++++++++++++++++++++++
 2 files changed, 34 insertions(+), 10 deletions(-)

Comments

Junio C Hamano March 17, 2023, 10:58 p.m. UTC | #1
Kousik Sanagavarapu <five231003@gmail.com> writes:

> A collision test is triggered in sha1_object(), whenever there is an
> object file in our repo. If our repo is a partial clone, then checking
> for this file existence does not lazy-fetch the object (if the object
> is missing and if there are one or more promisor remotes) when
> fetch_if_missing is set to 0.
>
> Though this global lets us control lazy-fetching in regions of code,
> it prevents multi-threading [1].

Sorry, but I really do not see the point.

We already have read_lock/read_unlock to prevent multiple threads
from stomping on the in-core object database structure either way.

If somebody needs to dynamically change the value of fetch_if_missing
after the program started and spawned multiple threads, yes, the update
to the single variable would become a problem point in multi-threading.

But that is not what we are doing, and you already discovered that
this was done as "a temporary measure" to selectively let some
programs use 0 and others use 1 for lazy-fetching, at a very early
part of these programs.

If we are to reduce this global, perhaps we should teach more
codepaths not to lazy fetch by default.  Once everybody gets
converted like so, then index-pack can lose the assignment of 0 to
the variable, as the global variable would be initialized to 0 and
nobody will flip it to 1 to "temporarily opt into lazy fetching by
default until it gets fixed".  At that point, we can lose the global
variable.

So "we want to reduce the use of this global" is not a good reason
to do this change at all, without a convincing argument that says
why everybody should do automatic lazy fetching of objects.  If
everybody should avoid doing automatic lazy fetching, a good first
step to reduce the use of this global is not to touch index-pack
that has already been fixed not to do so, no?
Kousik Sanagavarapu March 19, 2023, 6:17 a.m. UTC | #2
On 18/03/23 04:28, Junio C Hamano wrote:

> Kousik Sanagavarapu <five231003@gmail.com> writes:
>
>> A collision test is triggered in sha1_object(), whenever there is an
>> object file in our repo. If our repo is a partial clone, then checking
>> for this file existence does not lazy-fetch the object (if the object
>> is missing and if there are one or more promisor remotes) when
>> fetch_if_missing is set to 0.
>>
>> Though this global lets us control lazy-fetching in regions of code,
>> it prevents multi-threading [1].
> Sorry, but I really do not see the point.
>
> We already have read_lock/read_unlock to prevent multiple threads
> from stomping on the in-core object database structure either way.
>
> If somebody needs to dynamically change the value of fetch_if_missing
> after the program started and spawned multiple threads, yes, the update
> to the single variable would become a problem point in multi-threading.
>
> But that is not what we are doing, and you already discovered that
> this was done as "a temporary measure" to selectively let some
> programs use 0 and others use 1 for lazy-fetching, at a very early
> part of these programs.
>
> If we are to reduce this global, perhaps we should teach more
> codepaths not to lazy fetch by default.  Once everybody gets
> converted like so, then index-pack can lose the assignment of 0 to
> the variable, as the global variable would be initialized to 0 and
> nobody will flip it to 1 to "temporarily opt into lazy fetching by
> default until it gets fixed".  At that point, we can lose the global
> variable.
>
> So "we want to reduce the use of this global" is not a good reason
> to do this change at all, without a convincing argument that says
> why everybody should do automatic lazy fetching of objects.  If
> everybody should avoid doing automatic lazy fetching, a good first
> step to reduce the use of this global is not to touch index-pack
> that has already been fixed not to do so, no?

Thanks for the review.


Also, thanks for pointing out the direction of work in this area.

Really helpful.
diff mbox series

Patch

diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index 6648f2daef..8c0f36a49e 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -800,8 +800,7 @@  static void sha1_object(const void *data, struct object_entry *obj_entry,
 
 	if (startup_info->have_repository) {
 		read_lock();
-		collision_test_needed =
-			has_object_file_with_flags(oid, OBJECT_INFO_QUICK);
+		collision_test_needed = has_object(the_repository, oid, 0);
 		read_unlock();
 	}
 
@@ -1728,14 +1727,6 @@  int cmd_index_pack(int argc, const char **argv, const char *prefix)
 	int report_end_of_input = 0;
 	int hash_algo = 0;
 
-	/*
-	 * index-pack never needs to fetch missing objects except when
-	 * REF_DELTA bases are missing (which are explicitly handled). It only
-	 * accesses the repo to do hash collision checks and to check which
-	 * REF_DELTA bases need to be fetched.
-	 */
-	fetch_if_missing = 0;
-
 	if (argc == 2 && !strcmp(argv[1], "-h"))
 		usage(index_pack_usage);
 
diff --git a/t/t5616-partial-clone.sh b/t/t5616-partial-clone.sh
index f519d2a87a..41fa7130f1 100755
--- a/t/t5616-partial-clone.sh
+++ b/t/t5616-partial-clone.sh
@@ -644,6 +644,39 @@  test_expect_success 'repack does not loosen promisor objects' '
 	grep "loosen_unused_packed_objects/loosened:0" trace
 '
 
+test_expect_success 'index-pack does not lazy-fetch
+		     when checking for sha1 collsions' '
+	rm -rf server client another-remote &&
+
+	git init server &&
+	echo "line" >server/file &&
+	git -C server add file &&
+	git -C server commit -am "file" &&
+	git -C server config --local uploadpack.allowFilter 1 &&
+	git -C server config --local uploadpack.allowAnySha1InWant 1 &&
+
+	git clone --no-checkout --filter=blob:none \
+				"file://$(pwd)/server" client &&
+	git -C client config extensions.partialClone 1 &&
+	git -C client config remote.origin.promisor 1 &&
+
+	git clone "file://$(pwd)/server" another-remote &&
+
+	echo "new line" >server/new-file &&
+	git -C server add new-file &&
+	git -C server commit -am "new-file" &&
+	HASH=$(git -C server hash-object new-file) &&
+
+	git -C another-remote pull &&
+
+	# Try to fetch so that "client" will have to do a collision-check.
+	# This should, however, not fetch "new-file" because "client" is a
+	# partial clone.
+	GIT_TRACE_PACKET=git -C client fetch \
+				"file://$(pwd)/another-remote" main &&
+	! grep "want $HASH" trace
+'
+
 test_expect_success 'lazy-fetch in submodule succeeds' '
 	# setup
 	test_config_global protocol.file.allow always &&