diff mbox series

builtin/repack.c: avoid making cruft packs preferred

Message ID 19d9aae08eab05c6b5dda4c2090236b1c3f62998.1696349955.git.me@ttaylorr.com (mailing list archive)
State Superseded
Headers show
Series builtin/repack.c: avoid making cruft packs preferred | expand

Commit Message

Taylor Blau Oct. 3, 2023, 4:27 p.m. UTC
When doing a `--geometric` repack, we make sure that the preferred pack
(if writing a MIDX) is the largest pack that we *didn't* repack. That
has the effect of keeping the preferred pack in sync with the pack
containing a majority of the repository's reachable objects.

But if the repository happens to double in size, we'll repack
everything. Here we don't specify any `--preferred-pack`, and instead
let the MIDX code choose.

In the past, that worked fine, since there would only be one pack to
choose from: the one we just wrote. But it's no longer necessarily the
case that there is one pack to choose from. It's possible that the
repository also has a cruft pack, too.

If the cruft pack happens to come earlier in lexical order (and has an
earlier mtime than any non-cruft pack), we'll pick that pack as
preferred. This makes it impossible to reuse chunks of the reachable
pack verbatim from pack-objects, so is sub-optimal.

Luckily, this is a somewhat rare circumstance to be in, since we would
have to repack the entire repository during a `--geometric` repack, and
the cruft pack would have to sort ahead of the pack we just created.

Note that this behavior is usually just a performance regression. But
it's possible it could be a correctness issue.

Suppose an object was duplicated among the cruft and non-cruft pack. The
MIDX will pick the one from the pack with the lowest mtime, which will
always be the cruft one. But if the non-cruft pack happens to sort
earlier in lexical order, we'll treat that one as preferred, but not all
duplicates will be resolved in favor of that pack.

So if we happened to have an object which appears in both packs
(e.g., due to a cruft object being freshened, causing it to appear
loose, and then repacking it via the `--geometric` repack) it's possible
the duplicate would be picked from the non-preferred pack.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
I've had this sitting in my patch queue for a while now. It's a
non-critical performance fix that avoids the repack/MIDX machinery from
ever choosing a cruft pack as preferred when writing a MIDX bitmap
without a given --preferred-pack.

There is no correctness issue here, but choosing a pack with few/no
reachable objects means that our pack reuse mechanism will rarely kick
in, resulting in performance degradation.

 builtin/repack.c        | 47 ++++++++++++++++++++++++++++++++++++++++-
 t/t7704-repack-cruft.sh | 39 ++++++++++++++++++++++++++++++++++
 2 files changed, 85 insertions(+), 1 deletion(-)

--
2.42.0.311.ga7f7a0e966.dirty

Comments

Taylor Blau Oct. 3, 2023, 4:30 p.m. UTC | #1
On Tue, Oct 03, 2023 at 12:27:51PM -0400, Taylor Blau wrote:
> I've had this sitting in my patch queue for a while now. It's a
> non-critical performance fix that avoids the repack/MIDX machinery from
> ever choosing a cruft pack as preferred when writing a MIDX bitmap
> without a given --preferred-pack.
>
> There is no correctness issue here, but choosing a pack with few/no
> reachable objects means that our pack reuse mechanism will rarely kick
> in, resulting in performance degradation.
>
>  builtin/repack.c        | 47 ++++++++++++++++++++++++++++++++++++++++-
>  t/t7704-repack-cruft.sh | 39 ++++++++++++++++++++++++++++++++++
>  2 files changed, 85 insertions(+), 1 deletion(-)

Oops, I should have mentioned that this is meant to be applied on top of
'tb/multi-cruft-pack' to reduce the conflict resolution burden. Sorry
about that.

Thanks,
Taylor
Jeff King Oct. 3, 2023, 8:16 p.m. UTC | #2
On Tue, Oct 03, 2023 at 12:27:51PM -0400, Taylor Blau wrote:

> Note that this behavior is usually just a performance regression. But
> it's possible it could be a correctness issue.
> 
> Suppose an object was duplicated among the cruft and non-cruft pack. The
> MIDX will pick the one from the pack with the lowest mtime, which will
> always be the cruft one. But if the non-cruft pack happens to sort
> earlier in lexical order, we'll treat that one as preferred, but not all
> duplicates will be resolved in favor of that pack.
> 
> So if we happened to have an object which appears in both packs
> (e.g., due to a cruft object being freshened, causing it to appear
> loose, and then repacking it via the `--geometric` repack) it's possible
> the duplicate would be picked from the non-preferred pack.

I'm not sure I understand how that is a correctness issue. The contents
of the object are the same in either pack. Or do you mean that the
pack-reuse code in pack-objects.c may get confused and try to use the
wrong pack/offset when sending the object out? I would think it would
always be coming from the preferred pack for that code (so the outcome
is just that we fail to do the pack-reuse optimization very well, but we
don't generate a wrong answer).

Other than that, the explanation and patch make perfect sense to me, and
the patch looks good.

-Peff
Junio C Hamano Oct. 3, 2023, 8:20 p.m. UTC | #3
Taylor Blau <me@ttaylorr.com> writes:

> On Tue, Oct 03, 2023 at 12:27:51PM -0400, Taylor Blau wrote:
>> I've had this sitting in my patch queue for a while now. It's a
>> non-critical performance fix that avoids the repack/MIDX machinery from
>> ever choosing a cruft pack as preferred when writing a MIDX bitmap
>> without a given --preferred-pack.
>>
>> There is no correctness issue here, but choosing a pack with few/no
>> reachable objects means that our pack reuse mechanism will rarely kick
>> in, resulting in performance degradation.
>>
>>  builtin/repack.c        | 47 ++++++++++++++++++++++++++++++++++++++++-
>>  t/t7704-repack-cruft.sh | 39 ++++++++++++++++++++++++++++++++++
>>  2 files changed, 85 insertions(+), 1 deletion(-)
>
> Oops, I should have mentioned that this is meant to be applied on top of
> 'tb/multi-cruft-pack' to reduce the conflict resolution burden. Sorry
> about that.

Sorry, but I do not follow.  tb/multi-cruft-pack was merged to
'master' at c0b5d46d (Documentation/gitformat-pack.txt: drop mixed
version section, 2023-08-28) but back then t7704 did not exist.  Do
you mean the other topic in-flight from you about max-cruft-size?
Taylor Blau Oct. 3, 2023, 9:39 p.m. UTC | #4
On Tue, Oct 03, 2023 at 01:20:24PM -0700, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
>
> > On Tue, Oct 03, 2023 at 12:27:51PM -0400, Taylor Blau wrote:
> >> I've had this sitting in my patch queue for a while now. It's a
> >> non-critical performance fix that avoids the repack/MIDX machinery from
> >> ever choosing a cruft pack as preferred when writing a MIDX bitmap
> >> without a given --preferred-pack.
> >>
> >> There is no correctness issue here, but choosing a pack with few/no
> >> reachable objects means that our pack reuse mechanism will rarely kick
> >> in, resulting in performance degradation.
> >>
> >>  builtin/repack.c        | 47 ++++++++++++++++++++++++++++++++++++++++-
> >>  t/t7704-repack-cruft.sh | 39 ++++++++++++++++++++++++++++++++++
> >>  2 files changed, 85 insertions(+), 1 deletion(-)
> >
> > Oops, I should have mentioned that this is meant to be applied on top of
> > 'tb/multi-cruft-pack' to reduce the conflict resolution burden. Sorry
> > about that.
>
> Sorry, but I do not follow.  tb/multi-cruft-pack was merged to
> 'master' at c0b5d46d (Documentation/gitformat-pack.txt: drop mixed
> version section, 2023-08-28) but back then t7704 did not exist.  Do
> you mean the other topic in-flight from you about max-cruft-size?

My mistake -- this should be applied on top of the patches at this
series:

    https://lore.kernel.org/git/cover.1696293862.git.me@ttaylorr.com/

which is the other topic that you're referring to. I ran "git
for-each-ref 'refs/remotes/origin/tb/*' | grep cruft" and mistakenly
grabbed 'tb/multi-cruft-pack' thinking that that was that series and not
the one already merged to 'master'.

But it looks like the series linked above hasn't yet grown a topic
branch in gitster/git yet, hence my mistake. Sorry about that!

Thanks,
Taylor
Taylor Blau Oct. 3, 2023, 9:52 p.m. UTC | #5
On Tue, Oct 03, 2023 at 04:16:17PM -0400, Jeff King wrote:
> On Tue, Oct 03, 2023 at 12:27:51PM -0400, Taylor Blau wrote:
>
> > Note that this behavior is usually just a performance regression. But
> > it's possible it could be a correctness issue.
> >
> > Suppose an object was duplicated among the cruft and non-cruft pack. The
> > MIDX will pick the one from the pack with the lowest mtime, which will
> > always be the cruft one. But if the non-cruft pack happens to sort
> > earlier in lexical order, we'll treat that one as preferred, but not all
> > duplicates will be resolved in favor of that pack.
> >
> > So if we happened to have an object which appears in both packs
> > (e.g., due to a cruft object being freshened, causing it to appear
> > loose, and then repacking it via the `--geometric` repack) it's possible
> > the duplicate would be picked from the non-preferred pack.
>
> I'm not sure I understand how that is a correctness issue. The contents
> of the object are the same in either pack. Or do you mean that the
> pack-reuse code in pack-objects.c may get confused and try to use the
> wrong pack/offset when sending the object out? I would think it would
> always be coming from the preferred pack for that code (so the outcome
> is just that we fail to do the pack-reuse optimization very well, but we
> don't generate a wrong answer).

Admittedly I'm not 100% sure of my interpretation here either, since I
wrote most of this patch's log message nearly two years ago ;-).

I think the issue was as follows:

  - you have an object duplicated among some cruft and non-cruft pack
  - the cruft pack happens to have an older mtime than the non-cruft one
  - the repack reused no non-cruft packs, which (before this patch)
    would let the MIDX avoid picking a preferred pack
  - if the non-cruft pack happens to sort ahead of the cruft one in
    lexical order, it'll appear in the first few bits of the bitmap's
    ordering, causing us to treat it as if it were the preferred pack
  - ...but the MIDX will resolve duplicate objects in favor of the
    oldest pack when neither are preferred, causing us to pick a
    duplicate from the non-"preferred" pack.

(The last point is what causes the pack-bitmap reading code to get
confused, since it expects and makes assumption based on resolving ties
in favor of the preferred pack.)

That feels right to me, but admittedly my memory is pretty hazy on those
bitmap bugs. However, I don't think that this is an issue anymore, since
this patch was written before we had 177c0d6e63 (midx: infer preferred
pack when not given one, 2021-08-31) in GitHub's private fork, and the
patch message here incorrectly refers to statements about GitHub's fork,
not git.git.

But since we have 177c0d6e63 in git.git, this paragraph can and should
be dropped as finding ourselves in that situation is impossible, since
we would infer a preferred pack when not given one, and resolve
duplicates in favor of it.

That makes this purely a performance issue.

> Other than that, the explanation and patch make perfect sense to me, and
> the patch looks good.

Thanks!

> -Peff

Thanks,
Taylor
diff mbox series

Patch

diff --git a/builtin/repack.c b/builtin/repack.c
index 04770b15fe..a1a893d952 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -355,6 +355,18 @@  static struct generated_pack_data *populate_pack_exts(const char *name)
 	return data;
 }

+static int has_pack_ext(const struct generated_pack_data *data,
+			const char *ext)
+{
+	int i;
+	for (i = 0; i < ARRAY_SIZE(exts); i++) {
+		if (strcmp(exts[i].name, ext))
+			continue;
+		return !!data->tempfiles[i];
+	}
+	BUG("unknown pack extension: '%s'", ext);
+}
+
 static void repack_promisor_objects(const struct pack_objects_args *args,
 				    struct string_list *names)
 {
@@ -772,6 +784,7 @@  static void midx_included_packs(struct string_list *include,

 static int write_midx_included_packs(struct string_list *include,
 				     struct pack_geometry *geometry,
+				     struct string_list *names,
 				     const char *refs_snapshot,
 				     int show_progress, int write_bitmaps)
 {
@@ -801,6 +814,38 @@  static int write_midx_included_packs(struct string_list *include,
 	if (preferred)
 		strvec_pushf(&cmd.args, "--preferred-pack=%s",
 			     pack_basename(preferred));
+	else if (names->nr) {
+		/* The largest pack was repacked, meaning that either
+		 * one or two packs exist depending on whether the
+		 * repository has a cruft pack or not.
+		 *
+		 * Select the non-cruft one as preferred to encourage
+		 * pack-reuse among packs containing reachable objects
+		 * over unreachable ones.
+		 *
+		 * (Note we could write multiple packs here if
+		 * `--max-pack-size` was given, but any one of them
+		 * will suffice, so pick the first one.)
+		 */
+		for_each_string_list_item(item, names) {
+			struct generated_pack_data *data = item->util;
+			if (has_pack_ext(data, ".mtimes"))
+				continue;
+
+			strvec_pushf(&cmd.args, "--preferred-pack=pack-%s.pack",
+				     item->string);
+			break;
+		}
+	} else {
+		/*
+		 * No packs were kept, and no packs were written. The
+		 * only thing remaining are .keep packs (unless
+		 * --pack-kept-objects was given).
+		 *
+		 * Set the `--preferred-pack` arbitrarily here.
+		 */
+		;
+	}

 	if (refs_snapshot)
 		strvec_pushf(&cmd.args, "--refs-snapshot=%s", refs_snapshot);
@@ -1360,7 +1405,7 @@  int cmd_repack(int argc, const char **argv, const char *prefix)
 		struct string_list include = STRING_LIST_INIT_NODUP;
 		midx_included_packs(&include, &existing, &names, &geometry);

-		ret = write_midx_included_packs(&include, &geometry,
+		ret = write_midx_included_packs(&include, &geometry, &names,
 						refs_snapshot ? get_tempfile_path(refs_snapshot) : NULL,
 						show_progress, write_bitmaps > 0);

diff --git a/t/t7704-repack-cruft.sh b/t/t7704-repack-cruft.sh
index dc86ca8269..be3735dff0 100755
--- a/t/t7704-repack-cruft.sh
+++ b/t/t7704-repack-cruft.sh
@@ -372,4 +372,43 @@  test_expect_success '--max-cruft-size ignores non-local packs' '
 	)
 '

+test_expect_success 'reachable packs are preferred over cruft ones' '
+	repo="cruft-preferred-packs" &&
+	git init "$repo" &&
+	(
+		cd "$repo" &&
+
+		# This test needs to exercise careful control over when a MIDX
+		# is and is not written. Unset the corresponding TEST variable
+		# accordingly.
+		sane_unset GIT_TEST_MULTI_PACK_INDEX &&
+
+		test_commit base &&
+		test_commit --no-tag cruft &&
+
+		non_cruft="$(echo base | git pack-objects --revs $packdir/pack)" &&
+		# Write a cruft pack which both (a) sorts ahead of the non-cruft
+		# pack in lexical order, and (b) has an older mtime to appease
+		# the MIDX preferred pack selection routine.
+		cruft="$(echo pack-$non_cruft.pack | git pack-objects --cruft $packdir/pack-A)" &&
+		test-tool chmtime -1000 $packdir/pack-A-$cruft.pack &&
+
+		test_commit other &&
+		git repack -d &&
+
+		git repack --geometric 2 -d --write-midx --write-bitmap-index &&
+
+		# After repacking, there are two packs left: one reachable one
+		# (which is the result of combining both of the existing two
+		# non-cruft packs), and one cruft pack.
+		find .git/objects/pack -type f -name "*.pack" >packs &&
+		test_line_count = 2 packs &&
+
+		# Make sure that the pack we just wrote is marked as preferred,
+		# not the cruft one.
+		pack="$(test-tool read-midx --preferred-pack $objdir)" &&
+		test_path_is_missing "$packdir/$(basename "$pack" ".idx").mtimes"
+	)
+'
+
 test_done