diff mbox series

[v2] builtin/repack.c: avoid making cruft packs preferred

Message ID 035393935108d02aaf8927189b05102f4f74f340.1696370003.git.me@ttaylorr.com (mailing list archive)
State New, archived
Headers show
Series [v2] builtin/repack.c: avoid making cruft packs preferred | expand

Commit Message

Taylor Blau Oct. 3, 2023, 9:54 p.m. UTC
When doing a `--geometric` repack, we make sure that the preferred pack
(if writing a MIDX) is the largest pack that we *didn't* repack. That
has the effect of keeping the preferred pack in sync with the pack
containing a majority of the repository's reachable objects.

But if the repository happens to double in size, we'll repack
everything. Here we don't specify any `--preferred-pack`, and instead
let the MIDX code choose.

In the past, that worked fine, since there would only be one pack to
choose from: the one we just wrote. But it's no longer necessarily the
case that there is one pack to choose from. It's possible that the
repository also has a cruft pack, too.

If the cruft pack happens to come earlier in lexical order (and has an
earlier mtime than any non-cruft pack), we'll pick that pack as
preferred. This makes it impossible to reuse chunks of the reachable
pack verbatim from pack-objects, so is sub-optimal.

Luckily, this is a somewhat rare circumstance to be in, since we would
have to repack the entire repository during a `--geometric` repack, and
the cruft pack would have to sort ahead of the pack we just created.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
A small re-roll of the original patch, removing a few paragraphs from
the proposed log message that are irrelevant and impossible to produce
in today's git.git.

For more details, see <ZRyNHRf+tQwV+T6P@nand.local>.

Range-diff against v1:
1:  19d9aae08e ! 1:  0353939351 builtin/repack.c: avoid making cruft packs preferred
    @@ Commit message
         have to repack the entire repository during a `--geometric` repack, and
         the cruft pack would have to sort ahead of the pack we just created.

    -    Note that this behavior is usually just a performance regression. But
    -    it's possible it could be a correctness issue.
    -
    -    Suppose an object was duplicated among the cruft and non-cruft pack. The
    -    MIDX will pick the one from the pack with the lowest mtime, which will
    -    always be the cruft one. But if the non-cruft pack happens to sort
    -    earlier in lexical order, we'll treat that one as preferred, but not all
    -    duplicates will be resolved in favor of that pack.
    -
    -    So if we happened to have an object which appears in both packs
    -    (e.g., due to a cruft object being freshened, causing it to appear
    -    loose, and then repacking it via the `--geometric` repack) it's possible
    -    the duplicate would be picked from the non-preferred pack.
    -
         Signed-off-by: Taylor Blau <me@ttaylorr.com>

      ## builtin/repack.c ##

 builtin/repack.c        | 47 ++++++++++++++++++++++++++++++++++++++++-
 t/t7704-repack-cruft.sh | 39 ++++++++++++++++++++++++++++++++++
 2 files changed, 85 insertions(+), 1 deletion(-)

--
2.42.0.341.g9d86323b73.dirty

Comments

Jeff King Oct. 4, 2023, 1:13 p.m. UTC | #1
On Tue, Oct 03, 2023 at 05:54:19PM -0400, Taylor Blau wrote:

> A small re-roll of the original patch, removing a few paragraphs from
> the proposed log message that are irrelevant and impossible to produce
> in today's git.git.
> 
> For more details, see <ZRyNHRf+tQwV+T6P@nand.local>.

Thanks, this looks good to me, and the explanation in the linked message
makes sense.

-Peff
Patrick Steinhardt Oct. 5, 2023, 11:14 a.m. UTC | #2
On Tue, Oct 03, 2023 at 05:54:19PM -0400, Taylor Blau wrote:
[snip]
> @@ -801,6 +814,38 @@ static int write_midx_included_packs(struct string_list *include,
>  	if (preferred)
>  		strvec_pushf(&cmd.args, "--preferred-pack=%s",
>  			     pack_basename(preferred));
> +	else if (names->nr) {
> +		/* The largest pack was repacked, meaning that either
> +		 * one or two packs exist depending on whether the
> +		 * repository has a cruft pack or not.

Nit: this comment will grow stale soonish once your patch series lands
that introduces a maximum packfile size for cruft packs as there can be
arbitrarily many cruft packs from thereon.

> +		 * Select the non-cruft one as preferred to encourage
> +		 * pack-reuse among packs containing reachable objects
> +		 * over unreachable ones.
> +		 *
> +		 * (Note we could write multiple packs here if
> +		 * `--max-pack-size` was given, but any one of them
> +		 * will suffice, so pick the first one.)
> +		 */

Well, okay, you kind of acknowledge this here.

The rest of this patch series looks good to me and makes sense. I don't
really think that this comment here is worth a reroll.

Patrick
Eric Sunshine Oct. 5, 2023, 8:27 p.m. UTC | #3
On Thu, Oct 5, 2023 at 10:46 AM Patrick Steinhardt <ps@pks.im> wrote:
> On Tue, Oct 03, 2023 at 05:54:19PM -0400, Taylor Blau wrote:
> > +             /* The largest pack was repacked, meaning that either
> > +              * one or two packs exist depending on whether the
> > +              * repository has a cruft pack or not.
>
> Nit: this comment will grow stale soonish once your patch series lands
> that introduces a maximum packfile size for cruft packs as there can be
> arbitrarily many cruft packs from thereon.
>
> The rest of this patch series looks good to me and makes sense. I don't
> really think that this comment here is worth a reroll.

If you do happen to reroll for some reason, though, perhaps take care
of the minor style violation, as well. Use:

    /*
     * Multi-line comment
     * style.
     */

rather than:

    /* Multi-line comment
     * style.
     */
diff mbox series

Patch

diff --git a/builtin/repack.c b/builtin/repack.c
index 04770b15fe..a1a893d952 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -355,6 +355,18 @@  static struct generated_pack_data *populate_pack_exts(const char *name)
 	return data;
 }

+static int has_pack_ext(const struct generated_pack_data *data,
+			const char *ext)
+{
+	int i;
+	for (i = 0; i < ARRAY_SIZE(exts); i++) {
+		if (strcmp(exts[i].name, ext))
+			continue;
+		return !!data->tempfiles[i];
+	}
+	BUG("unknown pack extension: '%s'", ext);
+}
+
 static void repack_promisor_objects(const struct pack_objects_args *args,
 				    struct string_list *names)
 {
@@ -772,6 +784,7 @@  static void midx_included_packs(struct string_list *include,

 static int write_midx_included_packs(struct string_list *include,
 				     struct pack_geometry *geometry,
+				     struct string_list *names,
 				     const char *refs_snapshot,
 				     int show_progress, int write_bitmaps)
 {
@@ -801,6 +814,38 @@  static int write_midx_included_packs(struct string_list *include,
 	if (preferred)
 		strvec_pushf(&cmd.args, "--preferred-pack=%s",
 			     pack_basename(preferred));
+	else if (names->nr) {
+		/* The largest pack was repacked, meaning that either
+		 * one or two packs exist depending on whether the
+		 * repository has a cruft pack or not.
+		 *
+		 * Select the non-cruft one as preferred to encourage
+		 * pack-reuse among packs containing reachable objects
+		 * over unreachable ones.
+		 *
+		 * (Note we could write multiple packs here if
+		 * `--max-pack-size` was given, but any one of them
+		 * will suffice, so pick the first one.)
+		 */
+		for_each_string_list_item(item, names) {
+			struct generated_pack_data *data = item->util;
+			if (has_pack_ext(data, ".mtimes"))
+				continue;
+
+			strvec_pushf(&cmd.args, "--preferred-pack=pack-%s.pack",
+				     item->string);
+			break;
+		}
+	} else {
+		/*
+		 * No packs were kept, and no packs were written. The
+		 * only thing remaining are .keep packs (unless
+		 * --pack-kept-objects was given).
+		 *
+		 * Set the `--preferred-pack` arbitrarily here.
+		 */
+		;
+	}

 	if (refs_snapshot)
 		strvec_pushf(&cmd.args, "--refs-snapshot=%s", refs_snapshot);
@@ -1360,7 +1405,7 @@  int cmd_repack(int argc, const char **argv, const char *prefix)
 		struct string_list include = STRING_LIST_INIT_NODUP;
 		midx_included_packs(&include, &existing, &names, &geometry);

-		ret = write_midx_included_packs(&include, &geometry,
+		ret = write_midx_included_packs(&include, &geometry, &names,
 						refs_snapshot ? get_tempfile_path(refs_snapshot) : NULL,
 						show_progress, write_bitmaps > 0);

diff --git a/t/t7704-repack-cruft.sh b/t/t7704-repack-cruft.sh
index dc86ca8269..be3735dff0 100755
--- a/t/t7704-repack-cruft.sh
+++ b/t/t7704-repack-cruft.sh
@@ -372,4 +372,43 @@  test_expect_success '--max-cruft-size ignores non-local packs' '
 	)
 '

+test_expect_success 'reachable packs are preferred over cruft ones' '
+	repo="cruft-preferred-packs" &&
+	git init "$repo" &&
+	(
+		cd "$repo" &&
+
+		# This test needs to exercise careful control over when a MIDX
+		# is and is not written. Unset the corresponding TEST variable
+		# accordingly.
+		sane_unset GIT_TEST_MULTI_PACK_INDEX &&
+
+		test_commit base &&
+		test_commit --no-tag cruft &&
+
+		non_cruft="$(echo base | git pack-objects --revs $packdir/pack)" &&
+		# Write a cruft pack which both (a) sorts ahead of the non-cruft
+		# pack in lexical order, and (b) has an older mtime to appease
+		# the MIDX preferred pack selection routine.
+		cruft="$(echo pack-$non_cruft.pack | git pack-objects --cruft $packdir/pack-A)" &&
+		test-tool chmtime -1000 $packdir/pack-A-$cruft.pack &&
+
+		test_commit other &&
+		git repack -d &&
+
+		git repack --geometric 2 -d --write-midx --write-bitmap-index &&
+
+		# After repacking, there are two packs left: one reachable one
+		# (which is the result of combining both of the existing two
+		# non-cruft packs), and one cruft pack.
+		find .git/objects/pack -type f -name "*.pack" >packs &&
+		test_line_count = 2 packs &&
+
+		# Make sure that the pack we just wrote is marked as preferred,
+		# not the cruft one.
+		pack="$(test-tool read-midx --preferred-pack $objdir)" &&
+		test_path_is_missing "$packdir/$(basename "$pack" ".idx").mtimes"
+	)
+'
+
 test_done