diff mbox series

[03/10] builtin/gc.c: ignore cruft packs with `--keep-largest-pack`

Message ID 796df920ad6af0ee9101a0f3f80edbc793987336.1681764848.git.me@ttaylorr.com (mailing list archive)
State Superseded
Headers show
Series gc: enable cruft packs by default | expand

Commit Message

Taylor Blau April 17, 2023, 8:54 p.m. UTC
When cruft packs were implemented, we never adjusted the code for `git
gc`'s `--keep-largest-pack` and `gc.bigPackThreshold` to ignore cruft
packs. This option and configuration option share a common
implementation, but including cruft packs is wrong in both cases:

  - Running `git gc --keep-largest-pack` in a repository where the
    largest pack is the cruft pack itself will make it impossible for
    `git gc` to prune objects, since the cruft pack itself is kept.

  - The same is true for `gc.bigPackThreshold`, if the size of the cruft
    pack exceeds the limit set by the caller.

Ignore cruft packs in the common implementation for both of these
options, and add a pair of tests to prevent any future regressions here.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/config/gc.txt | 10 ++++-----
 Documentation/git-gc.txt    |  7 +++---
 builtin/gc.c                |  2 +-
 t/t6500-gc.sh               | 43 +++++++++++++++++++++++++++++++++++++
 4 files changed, 53 insertions(+), 9 deletions(-)

Comments

Junio C Hamano April 17, 2023, 10:54 p.m. UTC | #1
Taylor Blau <me@ttaylorr.com> writes:

> When cruft packs were implemented, we never adjusted the code for `git
> gc`'s `--keep-largest-pack` and `gc.bigPackThreshold` to ignore cruft
> packs. This option and configuration option share a common
> implementation, but including cruft packs is wrong in both cases:
>
>   - Running `git gc --keep-largest-pack` in a repository where the
>     largest pack is the cruft pack itself will make it impossible for
>     `git gc` to prune objects, since the cruft pack itself is kept.

Makes sense.  We want to keep the largest pack that is actually in
use, and we want to consolidate other non-cruft packs into one.

>   - The same is true for `gc.bigPackThreshold`, if the size of the cruft
>     pack exceeds the limit set by the caller.

This is not as cut-and-dried clear as the previous one.  "This pack
is so large that it is not worth rewriting it only to expunge a
handful of objects that are no longer reachable from it" is the main
motivation to use this configuration, but doesn't some part of the
same reasoning apply equally to a large cruft pack?  But let's
assume that the configuration is totally irrelevant to cruft packs
and read on.

>  --keep-largest-pack::
> -	All packs except the largest pack and those marked with a
> -	`.keep` files are consolidated into a single pack. When this
> -	option is used, `gc.bigPackThreshold` is ignored.
> +	All packs except the largest pack, any packs marked with a
> +	`.keep` file, and any cruft pack(s) are consolidated into a
> +	single pack. When this option is used, `gc.bigPackThreshold` is
> +	ignored.

"except the largest pack" -> "except the largest, non-cruft pack"
Taylor Blau April 17, 2023, 11:03 p.m. UTC | #2
On Mon, Apr 17, 2023 at 03:54:35PM -0700, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
>
> >   - The same is true for `gc.bigPackThreshold`, if the size of the cruft
> >     pack exceeds the limit set by the caller.
>
> This is not as cut-and-dried clear as the previous one.  "This pack
> is so large that it is not worth rewriting it only to expunge a
> handful of objects that are no longer reachable from it" is the main
> motivation to use this configuration, but doesn't some part of the
> same reasoning apply equally to a large cruft pack?  But let's
> assume that the configuration is totally irrelevant to cruft packs
> and read on.

This is an inherent design trade-off. I imagine that callers who want to
avoid rewriting their (large) cruft packs would prefer to generate a new
cruft pack on top with just the recently accumulated unreachable
objects.

That kind of works, except if you need to prune objects that are packed
in an earlier cruft pack. If you have `gc.bigPackThreshold`, there is no
way to do this if you need to expire objects that are in cruft packs
above that threshold.

A user may find themselves frustrated when trying to `git gc --prune`
some sensitive object(s) from their repository doesn't appear to work,
only to discover that `gc.bigPackThreshold` is set somewhere in their
configuration.

Writing (largely) the same cruft pack to expunge a few objects isn't
ideal, but it is better than the status quo. And if you have so many
unreachable objects that this is a concern, it is probably time to prune
anyway.

It is possible that in the future we could support writing multiple
cruft packs (we already handle the presence of multiple cruft packs
fine, just don't expose an easy way for the user to write >1 of them).
And at that point we would be able to relax this patch a bit and allow
`gc.bigPackThreshold` to cover cruft packs, too. But in the meantime,
the benefit of avoiding loose object explosions outweighs the possible
drawbacks here, IMHO.

> >  --keep-largest-pack::
> > -	All packs except the largest pack and those marked with a
> > -	`.keep` files are consolidated into a single pack. When this
> > -	option is used, `gc.bigPackThreshold` is ignored.
> > +	All packs except the largest pack, any packs marked with a
> > +	`.keep` file, and any cruft pack(s) are consolidated into a
> > +	single pack. When this option is used, `gc.bigPackThreshold` is
> > +	ignored.
>
> "except the largest pack" -> "except the largest, non-cruft pack"

Indeed, good eyes.

Thanks,
Taylor
Jeff King April 18, 2023, 10:39 a.m. UTC | #3
On Mon, Apr 17, 2023 at 07:03:08PM -0400, Taylor Blau wrote:

> On Mon, Apr 17, 2023 at 03:54:35PM -0700, Junio C Hamano wrote:
> > Taylor Blau <me@ttaylorr.com> writes:
> >
> > >   - The same is true for `gc.bigPackThreshold`, if the size of the cruft
> > >     pack exceeds the limit set by the caller.
> >
> > This is not as cut-and-dried clear as the previous one.  "This pack
> > is so large that it is not worth rewriting it only to expunge a
> > handful of objects that are no longer reachable from it" is the main
> > motivation to use this configuration, but doesn't some part of the
> > same reasoning apply equally to a large cruft pack?  But let's
> > assume that the configuration is totally irrelevant to cruft packs
> > and read on.
> 
> This is an inherent design trade-off. I imagine that callers who want to
> avoid rewriting their (large) cruft packs would prefer to generate a new
> cruft pack on top with just the recently accumulated unreachable
> objects.
> 
> That kind of works, except if you need to prune objects that are packed
> in an earlier cruft pack. If you have `gc.bigPackThreshold`, there is no
> way to do this if you need to expire objects that are in cruft packs
> above that threshold.
> 
> A user may find themselves frustrated when trying to `git gc --prune`
> some sensitive object(s) from their repository doesn't appear to work,
> only to discover that `gc.bigPackThreshold` is set somewhere in their
> configuration.
> 
> Writing (largely) the same cruft pack to expunge a few objects isn't
> ideal, but it is better than the status quo. And if you have so many
> unreachable objects that this is a concern, it is probably time to prune
> anyway.

Yeah, what your patch does makes sense to me as a default behavior. In a
pre-cruft-pack world, those objects would all be left alone by
gc.bigPackThreshol (because they're loose), and the essence of
cruft-packs is creating a parallel universe where those ejected-to-loose
objects just happen to be stored in a more efficient format.

> It is possible that in the future we could support writing multiple
> cruft packs (we already handle the presence of multiple cruft packs
> fine, just don't expose an easy way for the user to write >1 of them).
> And at that point we would be able to relax this patch a bit and allow
> `gc.bigPackThreshold` to cover cruft packs, too. But in the meantime,
> the benefit of avoiding loose object explosions outweighs the possible
> drawbacks here, IMHO.

I wondered if that interface might be an option to say "hey, I have a
gigantic cruft file I want to carry forward, please leave it alone".

But if you have a giant cruft pack that is making your "git gc" too
slow, it will eventually age out on its own. And if you're impatient,
then "git gc --prune=now" is probably the right tool.

And If you really did want to keep rolling it forward for some reason,
then I'd think marking it with ".keep" would be the best thing (and
maybe even dropping the mtimes file? I'm not sure a how a kept-cruft
pack does or should behave).

-Peff
Derrick Stolee April 18, 2023, 2:54 p.m. UTC | #4
On 4/18/2023 6:39 AM, Jeff King wrote:
> On Mon, Apr 17, 2023 at 07:03:08PM -0400, Taylor Blau wrote:

I agree with the prior discussion that gc.bigPackThreshold is
currently misbehaving and stopping it from caring about cruft packs
is the best way to fix that behavior in this series.

>> It is possible that in the future we could support writing multiple
>> cruft packs (we already handle the presence of multiple cruft packs
>> fine, just don't expose an easy way for the user to write >1 of them).
>> And at that point we would be able to relax this patch a bit and allow
>> `gc.bigPackThreshold` to cover cruft packs, too. But in the meantime,
>> the benefit of avoiding loose object explosions outweighs the possible
>> drawbacks here, IMHO.
> 
> I wondered if that interface might be an option to say "hey, I have a
> gigantic cruft file I want to carry forward, please leave it alone".
> 
> But if you have a giant cruft pack that is making your "git gc" too
> slow, it will eventually age out on its own. And if you're impatient,
> then "git gc --prune=now" is probably the right tool.
> 
> And If you really did want to keep rolling it forward for some reason,
> then I'd think marking it with ".keep" would be the best thing (and
> maybe even dropping the mtimes file? I'm not sure a how a kept-cruft
> pack does or should behave).

Generally, it's probably a good idea to (later) create a separate knob
for "don't rewrite the objects in a 'big' cruft pack unless you need
to". For situations where cruft objects are being collected and not
regularly pruned, this helps avoid repacking all unreachable objects
into a giant single pack, even though only a small number of objects
were discovered unreachable this time.

The important times where we'd want to consider a 'big' cruft pack
for rewrite are:

 1. Some objects in the cruft pack are being pruned.
 2. Some objects in the cruft pack need updated mtimes.

However, in the typical case that we are adding new cruft objects
and not changing the mtimes of existing unreachable objects, we could
create a sensible limit on the size of a cruft pack to be rewritten
during normal maintenance.

My personal preference would be something between 2GB and 10GB, which
seems like a decent balance between "size of cruft pack" and "number of
cruft packs" for most repositories. Since none of the objects are
reachable, we don't really care about them having good deltas for things
like fetches and clones. The benefit of reducing the time spent in 'git
repack --cruft' outweighs the slight disk space savings by having a
single cruft pack, in my opinion.

Thanks,
-Stolee
diff mbox series

Patch

diff --git a/Documentation/config/gc.txt b/Documentation/config/gc.txt
index 38fea076a2..8d5353e9e0 100644
--- a/Documentation/config/gc.txt
+++ b/Documentation/config/gc.txt
@@ -43,11 +43,11 @@  gc.autoDetach::
 	if the system supports it. Default is true.
 
 gc.bigPackThreshold::
-	If non-zero, all packs larger than this limit are kept when
-	`git gc` is run. This is very similar to `--keep-largest-pack`
-	except that all packs that meet the threshold are kept, not
-	just the largest pack. Defaults to zero. Common unit suffixes of
-	'k', 'm', or 'g' are supported.
+	If non-zero, all non-cruft packs larger than this limit are kept
+	when `git gc` is run. This is very similar to
+	`--keep-largest-pack` except that all non-cruft packs that meet
+	the threshold are kept, not just the largest pack. Defaults to
+	zero. Common unit suffixes of 'k', 'm', or 'g' are supported.
 +
 Note that if the number of kept packs is more than gc.autoPackLimit,
 this configuration variable is ignored, all packs except the base pack
diff --git a/Documentation/git-gc.txt b/Documentation/git-gc.txt
index a65c9aa62d..2427478314 100644
--- a/Documentation/git-gc.txt
+++ b/Documentation/git-gc.txt
@@ -77,9 +77,10 @@  be performed as well.
 	instance running on this repository.
 
 --keep-largest-pack::
-	All packs except the largest pack and those marked with a
-	`.keep` files are consolidated into a single pack. When this
-	option is used, `gc.bigPackThreshold` is ignored.
+	All packs except the largest pack, any packs marked with a
+	`.keep` file, and any cruft pack(s) are consolidated into a
+	single pack. When this option is used, `gc.bigPackThreshold` is
+	ignored.
 
 AGGRESSIVE
 ----------
diff --git a/builtin/gc.c b/builtin/gc.c
index edd98d35a5..53ef137e1d 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -219,7 +219,7 @@  static struct packed_git *find_base_packs(struct string_list *packs,
 	struct packed_git *p, *base = NULL;
 
 	for (p = get_all_packs(the_repository); p; p = p->next) {
-		if (!p->pack_local)
+		if (!p->pack_local || p->is_cruft)
 			continue;
 		if (limit) {
 			if (p->pack_size >= limit)
diff --git a/t/t6500-gc.sh b/t/t6500-gc.sh
index d9acb63951..df6f2e6e52 100755
--- a/t/t6500-gc.sh
+++ b/t/t6500-gc.sh
@@ -298,6 +298,49 @@  test_expect_success 'feature.experimental=false avoids cruft packs by default' '
 	)
 '
 
+test_expect_success '--keep-largest-pack ignores cruft packs' '
+	test_when_finished "rm -fr repo" &&
+	git init repo &&
+	(
+		cd repo &&
+
+		# Generate a pack for reachable objects (of which there
+		# are 3), and one for unreachable objects (of which
+		# there are 6).
+		prepare_cruft_history &&
+		git gc --cruft &&
+
+		mtimes="$(find .git/objects/pack -type f -name "pack-*.mtimes")" &&
+		sz="$(test_file_size "${mtimes%.mtimes}.pack")" &&
+
+		# Ensure that the cruft pack gets removed (due to
+		# `--prune=now`) despite it being the largest pack.
+		git -c gc.bigPackThreshold=$sz gc --cruft --prune=now &&
+
+		assert_no_cruft_packs
+	)
+'
+
+test_expect_success 'gc.bigPackThreshold ignores cruft packs' '
+	test_when_finished "rm -fr repo" &&
+	git init repo &&
+	(
+		cd repo &&
+
+		# Generate a pack for reachable objects (of which there
+		# are 3), and one for unreachable objects (of which
+		# there are 6).
+		prepare_cruft_history &&
+		git gc --cruft &&
+
+		# Ensure that the cruft pack gets removed (due to
+		# `--prune=now`) despite it being the largest pack.
+		git gc --cruft --prune=now --keep-largest-pack &&
+
+		assert_no_cruft_packs
+	)
+'
+
 run_and_wait_for_auto_gc () {
 	# We read stdout from gc for the side effect of waiting until the
 	# background gc process exits, closing its fd 9.  Furthermore, the