diff mbox series

[v4,4/6] pack-objects: generate cruft packs at most one object over threshold

Message ID f2ca92245ada74825806b50f786aab312275fd85.1741648467.git.me@ttaylorr.com (mailing list archive)
State New
Headers show
Series pack-objects: freshen objects with multi-cruft packs | expand

Commit Message

Taylor Blau March 11, 2025, 12:21 a.m. UTC
When generating multiple cruft packs with 'git repack --max-cruft-size',
we use 'git pack-objects --cruft --max-pack-size' (with many other
elided options), filling in the '--max-pack-size' value with whatever
was provided via the '--max-cruft-size' flag.

This causes us to generate a pack that is smaller than the specified
threshold. This poses a problem since we will never be able to generate
a cruft pack that crosses the threshold. In effect, this means that we
will try and repack its contents over and over again.

Instead, change the meaning of '--max-pack-size' in pack-objects when
combined with '--cruft'. When put together, '--max-pack-size' allows the
pack to grow larger than the specified threshold, but only by one
additional object.

This allows cruft packs to become just a little bit larger than the
threshold, allowing cruft packs to accumulate past the set threshold and
avoid being repacked in the future until a pruning GC takes place.

Noticed-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/config/pack.adoc      |  4 ++
 Documentation/git-pack-objects.adoc |  4 ++
 builtin/pack-objects.c              | 32 +++++++++++++-
 t/t5329-pack-objects-cruft.sh       | 67 +++++++++++++++++++++++++++++
 4 files changed, 105 insertions(+), 2 deletions(-)

Comments

Junio C Hamano March 11, 2025, 9:59 p.m. UTC | #1
Taylor Blau <me@ttaylorr.com> writes:

> When generating multiple cruft packs with 'git repack --max-cruft-size',
> we use 'git pack-objects --cruft --max-pack-size' (with many other
> elided options), filling in the '--max-pack-size' value with whatever
> was provided via the '--max-cruft-size' flag.
>
> This causes us to generate a pack that is smaller than the specified
> threshold. This poses a problem since we will never be able to generate
> a cruft pack that crosses the threshold.

So far I see absolutely *NO* problem described in the above.  The
user said "I want to chop them into 200MB pieces but do not exceed
the threshold" and the system honored that wish.

> In effect, this means that we
> will try and repack its contents over and over again.

The end effect however may be problematic, but isn't it due to the
way when to repack is determined?  You see 199MB piece of cruft pack
plus some other cruft data.  You have generated no new cruft and no
existing cruft expired out, but you do not know these facts until
you try to repack.  Because 200MB is the limit, you include the
199MB one as part of the ones to be recombined into the new cruft
pack because 199MB is smaller than 200MB and you do not know that
the reason why it is 199MB is because the earlier repack operation
found all remaining cruft material to be larger than 1MB; if there
were a 0.5MB cruft, it may have made it closer to 200MB.

So would it be feasible to remember how 199MB cruft pack is lying in
the object store (i.e. earlier we packed as much as possible), and
add a logic that says "if there is nothing to expire out of this
one, do not attempt to repack---this is fine as-is"?

> Instead, change the meaning of '--max-pack-size' in pack-objects when
> combined with '--cruft'. When put together, '--max-pack-size' allows the
> pack to grow larger than the specified threshold, but only by one
> additional object.

I do not think that would work well.  You have no control over the
size of that one additional object---it may weigh more than 100MB,
combining your 199MB cruft pack with something else to make it ~300MB
cruft.  In other words, "just a little bit larger" sounds like a
wishful thinking handwaving.
diff mbox series

Patch

diff --git a/Documentation/config/pack.adoc b/Documentation/config/pack.adoc
index da527377fa..0a90931b93 100644
--- a/Documentation/config/pack.adoc
+++ b/Documentation/config/pack.adoc
@@ -119,6 +119,10 @@  sizes (e.g., removable media that cannot store the whole repository),
 you are likely better off creating a single large packfile and splitting
 it using a generic multi-volume archive tool (e.g., Unix `split`).
 +
+When generating cruft packs with `git pack-objects`, this option has a
+slightly different interpretation than above; see the documentation for
+`--max-pack-size` option in linkgit:git-pack-objects[1].
++
 The minimum size allowed is limited to 1 MiB. The default is unlimited.
 Common unit suffixes of 'k', 'm', or 'g' are supported.
 
diff --git a/Documentation/git-pack-objects.adoc b/Documentation/git-pack-objects.adoc
index 7f69ae4855..aee467c496 100644
--- a/Documentation/git-pack-objects.adoc
+++ b/Documentation/git-pack-objects.adoc
@@ -161,6 +161,10 @@  depth is 4095.
 	`pack.packSizeLimit` is set. Note that this option may result in
 	a larger and slower repository; see the discussion in
 	`pack.packSizeLimit`.
++
+When used with `--cruft`, the output packfile(s) may be as large or
+larger than the configured maximum size. The pack will exceed the
+specified maximum by no more than one object.
 
 --honor-pack-keep::
 	This flag causes an object already in a local pack that
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 58a9b16126..f701b4c9ec 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -692,11 +692,39 @@  static off_t write_object(struct hashfile *f,
 	off_t len;
 	int usable_delta, to_reuse;
 
+	if (cruft && pack_size_limit && pack_size_limit <= write_offset) {
+		/*
+		 * When writing a cruft pack with a limited size,
+		 * perform the --max-pack-size check *before* writing
+		 * the object.
+		 *
+		 * When we have not yet reached the size limit, this
+		 * combined with the fact that we act as if there is no
+		 * limit when writing objects via write_object() allows
+		 * us to grow one object *past* the specified limit.
+		 *
+		 * This is important for generating cruft packs with a
+		 * --max-pack-size so we can generate packs that are
+		 * just over the threshold to avoid repacking them in
+		 * the future.
+		 */
+		return 0;
+	}
+
 	if (!pack_to_stdout)
 		crc32_begin(f);
 
-	/* apply size limit if limited packsize and not first object */
-	if (!pack_size_limit || !nr_written)
+	/*
+	 * Apply size limit when one is provided, with the following
+	 * exceptions:
+	 *
+	 * - We are writing the first object.
+	 *
+	 * - We are writing a cruft pack with a size limit. The check
+	 *   above covers this case while letting the pack grow at most
+	 *   one object beyond the limit.
+	 */
+	if (!pack_size_limit || !nr_written || cruft)
 		limit = 0;
 	else if (pack_size_limit <= write_offset)
 		/*
diff --git a/t/t5329-pack-objects-cruft.sh b/t/t5329-pack-objects-cruft.sh
index 60dac8312d..9cbc21a65d 100755
--- a/t/t5329-pack-objects-cruft.sh
+++ b/t/t5329-pack-objects-cruft.sh
@@ -3,6 +3,7 @@ 
 test_description='cruft pack related pack-objects tests'
 
 . ./test-lib.sh
+. "$TEST_DIRECTORY"/lib-cruft.sh
 
 objdir=.git/objects
 packdir=$objdir/pack
@@ -695,4 +696,70 @@  test_expect_success 'additional cruft blobs via gc.recentObjectsHook' '
 	)
 '
 
+test_expect_success 'cruft pack generation beyond --max-pack-size' '
+	test_when_finished "rm -fr repo" &&
+	git init repo &&
+	(
+		cd repo &&
+
+		# Disable pack compression to ensure the pack size is
+		# predictable.
+		git config pack.compression 0 &&
+
+		sz=524288 && # 0.5 MiB
+		foo="$(generate_random_blob foo $sz)" &&
+		bar="$(generate_random_blob bar $sz)" &&
+		baz="$(generate_random_blob baz $sz)" &&
+		quux="$(generate_random_blob quux $sz)" &&
+
+		printf "%s\n" "$foo" "$bar" >A.objects &&
+		printf "%s\n" "$baz" "$quux" >B.objects &&
+
+		A="$(git pack-objects $packdir/pack <A.objects)" &&
+		B="$(git pack-objects $packdir/pack <B.objects)" &&
+
+		git prune-packed &&
+
+		sz=1572864 && # 1.5 MiB
+		printf -- "-%s\n" "pack-$A.pack" "pack-$B.pack" >C.in &&
+		git pack-objects --cruft --max-pack-size=$sz $packdir/pack \
+			<C.in >C.out &&
+
+		test_line_count = 2 C.out &&
+		C_large="$(head -n 1 C.out)" &&
+		C_small="$(tail -n 1 C.out)" &&
+
+		# Swap $C_large and $C_small if necessary.
+		if test "$(test_file_size $packdir/pack-$C_large.idx)" -lt \
+			"$(test_file_size $packdir/pack-$C_small.idx)"
+		then
+			tmp="$C_large" &&
+			C_large="$C_small" &&
+			C_small="$tmp"
+		fi &&
+
+		# Ensure the large pack is no smaller than the threshold
+		# such that it does not get repacked in subsequent runs
+		# with the same --max-pack-size setting.
+		test $(test_file_size $packdir/pack-$C_large.pack) -ge $sz &&
+
+		{
+			git show-index <"$packdir/pack-$C_large.idx" &&
+			git show-index <"$packdir/pack-$C_small.idx"
+		} >actual.raw &&
+		printf "%s\n" "$foo" "$bar" "$baz" "$quux" >expect.raw &&
+
+		sort <expect.raw >expect &&
+		cut -d " " -f 2 actual.raw | sort >actual &&
+
+		# Ensure that all of the objects are present in the two
+		# cruft packs we just generated.
+		#
+		# Note that the contents of "actual" are not
+		# de-duplicated. This is intentional to ensure we avoid
+		# packing the same object twice (once in each pack).
+		test_cmp expect actual
+	)
+'
+
 test_done