diff mbox series

[2/2] repack: avoid loosening promisor pack objects in partial clones

Message ID 20210414191403.4387-3-rafaeloliveira.cs@gmail.com (mailing list archive)
State Superseded
Headers show
Series prevent `repack` to unpack and delete promisor objects | expand

Commit Message

Rafael Silva April 14, 2021, 7:14 p.m. UTC
When `-A` and `-d` are used together, besides packing all objects (-A)
and removing redundant packs (-d), it also unpack all unreachable
objects and deletes them by calling `git pruned-packed`. For a partial
clone, that contains unreferenced objects, this results in unpacking
all "promisor" objects and deleting them right after, which
unnecessarily increases the `repack` execution time and disk usage
during the unpacking of the objects.

For instance, a partially cloned repository that filters all the blob
objects (e.g. "--filter=blob:none"), `repack` ends up unpacking all
blobs into the filesystem that, depending on the repo size, makes
nearly impossible to repack the operation before running out of disk.

For a partial clone, `git repack` calls `git pack-objects` twice: (1)
for handle the "promisor" objects and (2) for performing the repack
with --exclude-promisor-objects option, that results in unpacking and
deleting of the objects. Given that we actually should keep the
promisor objects, let's teach `repack` to tell `pack-objects` to
--keep the old "promisor" pack file.

The --keep-pack option takes only a packfile name, but we concatenate
both the path and the name in a single string. Instead, let's split
them into separate string in order to easily pass the packfile name
later.

Additionally, add a new perf test to evaluate the performance
impact made by this changes (tested on git.git):

    Test            HEAD^                 HEAD
    ------------------------------------------------------------
    5600.5: gc      137.67(42.48+93.64)   8.08(6.91+1.45) -94.1%

In this particular script, the improvement is big because every
object in the newly-cloned partial repository is a promisor object.

Reported-by: SZEDER Gábor <szeder.dev@gmail.com>
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Rafael Silva <rafaeloliveira.cs@gmail.com>
---
 builtin/repack.c              | 9 +++++++--
 t/perf/p5600-partial-clone.sh | 4 ++++
 t/t5616-partial-clone.sh      | 9 +++++++++
 3 files changed, 20 insertions(+), 2 deletions(-)

Comments

Jonathan Tan April 15, 2021, 1:04 a.m. UTC | #1
> When `-A` and `-d` are used together, besides packing all objects (-A)
> and removing redundant packs (-d), it also unpack all unreachable
> objects and deletes them by calling `git pruned-packed`.

I still think of these objects as not unreachable, even though I know
that pack-objects calls them that (the argument is called
--unpack-unreachable). So I would say "it also loosens all objects that
were previously packed but did not go into the new pack", but perhaps
this is OK too.

> For a partial
> clone, that contains unreferenced objects, this results in unpacking
> all "promisor" objects and deleting them right after, which
> unnecessarily increases the `repack` execution time and disk usage
> during the unpacking of the objects.

I think that the commit message also needs to explain that we're
deleting the promisor objects immediately because they happen to be in a
promisor pack. So perhaps this whole part could be written as follows:

  When "git repack -Ad" is run in a partial clone, "pack-objects" is
  invoked twice: once to repack all promisor objects, and once to repack
  all non-promisor objects. The latter "pack-objects" invocation is with
  --exclude-promisor-objects and --unpack-unreachable, which loosens all
  unused objects. Unfortunately, this includes promisor objects.

  Because the "-d" argument to "git repack" subsequently deletes all
  loose objects also in packs, these just-loosened promisor objects will
  be immediately deleted. But this extra disk churn is unnecessary in
  the first place.

> For instance, a partially cloned repository that filters all the blob
> objects (e.g. "--filter=blob:none"), `repack` ends up unpacking all
> blobs into the filesystem that, depending on the repo size, makes
> nearly impossible to repack the operation before running out of disk.
> 
> For a partial clone, `git repack` calls `git pack-objects` twice: (1)
> for handle the "promisor" objects and (2) for performing the repack
> with --exclude-promisor-objects option, that results in unpacking and
> deleting of the objects. Given that we actually should keep the
> promisor objects, let's teach `repack` to tell `pack-objects` to
> --keep the old "promisor" pack file.

It's not this call (2) that results in any deleting of the objects, but
the later call to prune_packed_objects(). Also, promisor objects are
kept regardless of what we pass to "pack-objects" here (the keeping is
done separately). Maybe write (continuation from my suggestion above):

  In order to avoid this extra disk churn, pass the names of the
  promisor packfiles as "--keep-pack" arguments to this
  second invocation of "pack-objects". This informs "pack-objects" that
  the promisor objects are already in a safe packfile and, therefore, do
  not need to be loosened.

> @@ -533,7 +533,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
>  	}
>  
>  	packdir = mkpathdup("%s/pack", get_object_directory());
> -	packtmp = mkpathdup("%s/.tmp-%d-pack", packdir, (int)getpid());
> +	packtmp_name = xstrfmt(".tmp-%d-pack", (int)getpid());
> +	packtmp = mkpathdup("%s/%s", packdir, packtmp_name);
>  
>  	sigchain_push_common(remove_pack_on_signal);
>  

Normally I would be concerned that packtmp_name is not freed, but in
this case, it's a static variable (same as packtmp).

> @@ -576,6 +577,10 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
>  		repack_promisor_objects(&po_args, &names);
>  
>  		if (existing_packs.nr && delete_redundant) {
> +			for_each_string_list_item(item, &names) {
> +				strvec_pushf(&cmd.args, "--keep-pack=%s-%s.pack",
> +					     packtmp_name, item->string);
> +			}

Git style is to not have braces for single-statement loops.

> +test_expect_success 'repack does not loose all objects' '
> +	rm -rf client &&
> +	git clone --bare --filter=blob:none "file://$(pwd)/srv.bare" client &&
> +	test_when_finished "rm -rf client" &&
> +	git -C client repack -A -l -d --no-prune-packed &&
> +	git -C client count-objects -v >object-count &&
> +	grep "^prune-packable: 0" object-count
> +'

s/loose all objects/loosen promisor objects/

Also, add a comment describing why we have "--no-prune-packed" there
(probably something about not pruning any loose objects that are already
in packs, so that we can verify that no redundant loose objects are
being created in the first place).
Junio C Hamano April 15, 2021, 3:51 a.m. UTC | #2
Jonathan Tan <jonathantanmy@google.com> writes:

>> When `-A` and `-d` are used together, besides packing all objects (-A)
>> and removing redundant packs (-d), it also unpack all unreachable
>> objects and deletes them by calling `git pruned-packed`.
>
> I still think of these objects as not unreachable, even though I know
> that pack-objects calls them that (the argument is called
> --unpack-unreachable). So I would say "it also loosens all objects that
> were previously packed but did not go into the new pack", but perhaps
> this is OK too.

Hmph, that is puzzling.  I understand that the operation about

 (1) finding all the objects that are still reachable and send them
     into a newly created pack, and

 (2) among the objects that were previously in the packs, eject
     those that weren't made into the new pack with the previous
     point.

Where did I get it wrong?  If all the reachable ones are dealt with
with the first point, what is leftover is not reachable, no?
Jeff King April 15, 2021, 9:03 a.m. UTC | #3
On Wed, Apr 14, 2021 at 08:51:02PM -0700, Junio C Hamano wrote:

> Jonathan Tan <jonathantanmy@google.com> writes:
> 
> >> When `-A` and `-d` are used together, besides packing all objects (-A)
> >> and removing redundant packs (-d), it also unpack all unreachable
> >> objects and deletes them by calling `git pruned-packed`.
> >
> > I still think of these objects as not unreachable, even though I know
> > that pack-objects calls them that (the argument is called
> > --unpack-unreachable). So I would say "it also loosens all objects that
> > were previously packed but did not go into the new pack", but perhaps
> > this is OK too.
> 
> Hmph, that is puzzling.  I understand that the operation about
> 
>  (1) finding all the objects that are still reachable and send them
>      into a newly created pack, and
> 
>  (2) among the objects that were previously in the packs, eject
>      those that weren't made into the new pack with the previous
>      point.
> 
> Where did I get it wrong?  If all the reachable ones are dealt with
> with the first point, what is leftover is not reachable, no?

Right. I think your understanding is correct, and the commit message is
a bit confused. Normally after we eject loose objects, they'd stay there
(a follow-up git-gc may run git-prune and delete them, though if they
were recent enough not to just drop completely during the repack, then
git-prune would likewise leave them be). Talking about prune-packed here
is misleading, because it usually has nothing to do with these objects.

What makes the partial-clone situation under discussion interesting is
that the objects _are_ reachable. They are excluded from the new pack
because we put them in a separate promisor pack. But we erroneously turn
them loose, rather than realizing that they were excluded for a
different reason.

So the fundamental bug is that we turn them loose at all. What makes the
bug trickier to see is that when we run prune-packed afterwards, we then
clean up the evidence of the bug (so it looks more like a performance
problem than a correctness one).

-Peff
Jeff King April 15, 2021, 9:05 a.m. UTC | #4
On Wed, Apr 14, 2021 at 06:04:54PM -0700, Jonathan Tan wrote:

> > @@ -576,6 +577,10 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
> >  		repack_promisor_objects(&po_args, &names);
> >  
> >  		if (existing_packs.nr && delete_redundant) {
> > +			for_each_string_list_item(item, &names) {
> > +				strvec_pushf(&cmd.args, "--keep-pack=%s-%s.pack",
> > +					     packtmp_name, item->string);
> > +			}
> 
> Git style is to not have braces for single-statement loops.

It is, though given that for_each_string_list_item() is a weird macro
instead of a regular for-loop, IMHO it makes things more obvious to have
the braces.

(All the rest of your comments seemed quite good to me).

-Peff
Junio C Hamano April 15, 2021, 6:06 p.m. UTC | #5
Rafael Silva <rafaeloliveira.cs@gmail.com> writes:

> For instance, a partially cloned repository that filters all the blob
> objects (e.g. "--filter=blob:none"), `repack` ends up unpacking all
> blobs into the filesystem that, depending on the repo size, makes
> nearly impossible to repack the operation before running out of disk.

Could you clarify this paragraph a bit more?  It is unclear why the
repository has "all the blob objects" that it can loosen with repack
in the first place, if it was cloned without any blob.  Do you mean
that repack does not stop at the promisor pack boundary and instead
lazily download blobs "on-demand", which ends up as loose objects?

Thanks.
Rafael Silva April 18, 2021, 7:12 a.m. UTC | #6
Jonathan Tan <jonathantanmy@google.com> writes:

>> For a partial
>> clone, that contains unreferenced objects, this results in unpacking
>> all "promisor" objects and deleting them right after, which
>> unnecessarily increases the `repack` execution time and disk usage
>> during the unpacking of the objects.
>
> I think that the commit message also needs to explain that we're
> deleting the promisor objects immediately because they happen to be in a
> promisor pack. So perhaps this whole part could be written as follows:
>
>   When "git repack -Ad" is run in a partial clone, "pack-objects" is
>   invoked twice: once to repack all promisor objects, and once to repack
>   all non-promisor objects. The latter "pack-objects" invocation is with
>   --exclude-promisor-objects and --unpack-unreachable, which loosens all
>   unused objects. Unfortunately, this includes promisor objects.
>
>   Because the "-d" argument to "git repack" subsequently deletes all
>   loose objects also in packs, these just-loosened promisor objects will
>   be immediately deleted. But this extra disk churn is unnecessary in
>   the first place.
>

Thanks for suggesting this message, obviously it's much better than
mine specially because removes the confusion that I made when writing
the message about the `pack-objects` and `prune-packed`.

I'll update patch's message on the next revision.

>> For instance, a partially cloned repository that filters all the blob
>> objects (e.g. "--filter=blob:none"), `repack` ends up unpacking all
>> blobs into the filesystem that, depending on the repo size, makes
>> nearly impossible to repack the operation before running out of disk.
>> 
>> For a partial clone, `git repack` calls `git pack-objects` twice: (1)
>> for handle the "promisor" objects and (2) for performing the repack
>> with --exclude-promisor-objects option, that results in unpacking and
>> deleting of the objects. Given that we actually should keep the
>> promisor objects, let's teach `repack` to tell `pack-objects` to
>> --keep the old "promisor" pack file.
>
> It's not this call (2) that results in any deleting of the objects, but
> the later call to prune_packed_objects(). Also, promisor objects are
> kept regardless of what we pass to "pack-objects" here (the keeping is
> done separately). Maybe write (continuation from my suggestion above):
>
>   In order to avoid this extra disk churn, pass the names of the
>   promisor packfiles as "--keep-pack" arguments to this
>   second invocation of "pack-objects". This informs "pack-objects" that
>   the promisor objects are already in a safe packfile and, therefore, do
>   not need to be loosened.
>

You're right, the second call only loosens the object and the deleting
comes after it, which makes my message misleading. 

The suggested paragraph explain better the situation, thanks for
suggesting it and will replace mine in the next revision.

>> @@ -533,7 +533,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
>>  	}
>>  
>>  	packdir = mkpathdup("%s/pack", get_object_directory());
>> -	packtmp = mkpathdup("%s/.tmp-%d-pack", packdir, (int)getpid());
>> +	packtmp_name = xstrfmt(".tmp-%d-pack", (int)getpid());
>> +	packtmp = mkpathdup("%s/%s", packdir, packtmp_name);
>>  
>>  	sigchain_push_common(remove_pack_on_signal);
>>  
>
> Normally I would be concerned that packtmp_name is not freed, but in
> this case, it's a static variable (same as packtmp).
>

Indeed.

>> @@ -576,6 +577,10 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
>>  		repack_promisor_objects(&po_args, &names);
>>  
>>  		if (existing_packs.nr && delete_redundant) {
>> +			for_each_string_list_item(item, &names) {
>> +				strvec_pushf(&cmd.args, "--keep-pack=%s-%s.pack",
>> +					     packtmp_name, item->string);
>> +			}
>
> Git style is to not have braces for single-statement loops.
>

I preferred to keep the braces simply because the
for_each_string_list_item() is macro instead of a normal for-loop
statement. 

>> +test_expect_success 'repack does not loose all objects' '
>> +	rm -rf client &&
>> +	git clone --bare --filter=blob:none "file://$(pwd)/srv.bare" client &&
>> +	test_when_finished "rm -rf client" &&
>> +	git -C client repack -A -l -d --no-prune-packed &&
>> +	git -C client count-objects -v >object-count &&
>> +	grep "^prune-packable: 0" object-count
>> +'
>
> s/loose all objects/loosen promisor objects/
>
> Also, add a comment describing why we have "--no-prune-packed" there
> (probably something about not pruning any loose objects that are already
> in packs, so that we can verify that no redundant loose objects are
> being created in the first place).

Thanks for the reviewing and helping clarify the patch intentions. I'll
address this comments on the next revision.
Rafael Silva April 18, 2021, 8:40 a.m. UTC | #7
Junio C Hamano <gitster@pobox.com> writes:

> Rafael Silva <rafaeloliveira.cs@gmail.com> writes:
>
>> For instance, a partially cloned repository that filters all the blob
>> objects (e.g. "--filter=blob:none"), `repack` ends up unpacking all
>> blobs into the filesystem that, depending on the repo size, makes
>> nearly impossible to repack the operation before running out of disk.
>
> Could you clarify this paragraph a bit more?  It is unclear why the
> repository has "all the blob objects" that it can loosen with repack
> in the first place, if it was cloned without any blob.  Do you mean
> that repack does not stop at the promisor pack boundary and instead
> lazily download blobs "on-demand", which ends up as loose objects?
>
> Thanks.

I should have written that we "unpack (or better turn loose) the 
promisor objects" not the `blob` objects. Instead, I should have
written something like:

    ... ends up loosening all promisor objects, on this case all
    the `trees` and `commits` objects, into the filesystem ...

Apologise for the confusion and this misleading message. I'll clarify
this in the v2.
diff mbox series

Patch

diff --git a/builtin/repack.c b/builtin/repack.c
index 6baaeb979c..0ecd76b79c 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -20,7 +20,7 @@  static int delta_base_offset = 1;
 static int pack_kept_objects = -1;
 static int write_bitmaps = -1;
 static int use_delta_islands;
-static char *packdir, *packtmp;
+static char *packdir, *packtmp_name, *packtmp;
 
 static const char *const git_repack_usage[] = {
 	N_("git repack [<options>]"),
@@ -533,7 +533,8 @@  int cmd_repack(int argc, const char **argv, const char *prefix)
 	}
 
 	packdir = mkpathdup("%s/pack", get_object_directory());
-	packtmp = mkpathdup("%s/.tmp-%d-pack", packdir, (int)getpid());
+	packtmp_name = xstrfmt(".tmp-%d-pack", (int)getpid());
+	packtmp = mkpathdup("%s/%s", packdir, packtmp_name);
 
 	sigchain_push_common(remove_pack_on_signal);
 
@@ -576,6 +577,10 @@  int cmd_repack(int argc, const char **argv, const char *prefix)
 		repack_promisor_objects(&po_args, &names);
 
 		if (existing_packs.nr && delete_redundant) {
+			for_each_string_list_item(item, &names) {
+				strvec_pushf(&cmd.args, "--keep-pack=%s-%s.pack",
+					     packtmp_name, item->string);
+			}
 			if (unpack_unreachable) {
 				strvec_pushf(&cmd.args,
 					     "--unpack-unreachable=%s",
diff --git a/t/perf/p5600-partial-clone.sh b/t/perf/p5600-partial-clone.sh
index ca785a3341..a965f2c4d6 100755
--- a/t/perf/p5600-partial-clone.sh
+++ b/t/perf/p5600-partial-clone.sh
@@ -35,4 +35,8 @@  test_perf 'count non-promisor commits' '
 	git -C bare.git rev-list --all --count --exclude-promisor-objects
 '
 
+test_perf 'gc' '
+	git -C bare.git gc
+'
+
 test_done
diff --git a/t/t5616-partial-clone.sh b/t/t5616-partial-clone.sh
index 5cb415386e..de77822735 100755
--- a/t/t5616-partial-clone.sh
+++ b/t/t5616-partial-clone.sh
@@ -548,6 +548,15 @@  test_expect_success 'fetch from a partial clone, protocol v2' '
 	grep "version 2" trace
 '
 
+test_expect_success 'repack does not loose all objects' '
+	rm -rf client &&
+	git clone --bare --filter=blob:none "file://$(pwd)/srv.bare" client &&
+	test_when_finished "rm -rf client" &&
+	git -C client repack -A -l -d --no-prune-packed &&
+	git -C client count-objects -v >object-count &&
+	grep "^prune-packable: 0" object-count
+'
+
 . "$TEST_DIRECTORY"/lib-httpd.sh
 start_httpd