diff mbox series

[7/8] builtin/repack.c: make largest pack preferred

Message ID a790ee5ac6c03d6832599e77f84c352f577d6287.1631331139.git.me@ttaylorr.com (mailing list archive)
State Superseded
Headers show
Series repack: introduce `--write-midx` | expand

Commit Message

Taylor Blau Sept. 11, 2021, 3:32 a.m. UTC
When repacking into a geometric series and writing a multi-pack bitmap,
it is beneficial to have the largest resulting pack be the preferred
object source in the bitmap's MIDX, since selecting the large packs can
lead to fewer broken delta chains and better compression.

Teach 'git repack' to identify this pack and pass it to the MIDX write
machinery in order to mark it as preferred.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/git-repack.txt |  4 ++++
 builtin/repack.c             | 23 ++++++++++++++++++++++-
 pack-bitmap.c                |  2 +-
 pack-bitmap.h                |  1 +
 t/helper/test-read-midx.c    | 25 ++++++++++++++++++++++++-
 t/t7703-repack-geometric.sh  | 23 +++++++++++++++++++++++
 6 files changed, 75 insertions(+), 3 deletions(-)

Comments

Ævar Arnfjörð Bjarmason Sept. 11, 2021, 10:17 a.m. UTC | #1
On Fri, Sep 10 2021, Taylor Blau wrote:

> +	if (geometry) {
> +		struct packed_git *largest = get_largest_active_pack(geometry);
> +		if (largest)
> +			strvec_pushf(&cmd.args, "--preferred-pack=%s",
> +				     pack_basename(largest));
> +		else
> +			/*
> +			 * The largest pack was repacked, meaning only one pack
> +			 * exists (and tautologically, it is the largest).
> +			 */
> +			;
> +	}

Probably one of those cases where an assignment within an "if" is the
better of two evils, despite the CodingGuidelines warning against it
(but not forbidding it). The added get_largest_active_pack() could also
learn to punt on a NULL argument, in which case we wouldn't need an
assignment):

	struct packed_git *largest;
	[...]
        /* If ...[...] */
	if (geometry && (largest = get_largest_active_pack(geometry)))
		strvec_push(...);

I punted on re-phrasing the comment, because the current code invites
the reader to start reading this block, then we see that we may do
stuff, and then the comment applies to what we *didn't* do.

> -		usage("read-midx [--show-objects|--checksum] <object-dir>");
> +		usage("read-midx [--show-objects|--checksum|--preferred-pack] <object-dir>");

Just an aside, but I'm surprised to see this use the older
non-parse_options() usage API, which we're generally moving away
from. usage_msg_opt() is generally better.

> +		git config core.multiPackIndex true &&

I remember a similar pattern from another series, does this test pass
without the config being set? I didn't check...
Taylor Blau Sept. 11, 2021, 4:35 p.m. UTC | #2
On Sat, Sep 11, 2021 at 12:17:37PM +0200, Ævar Arnfjörð Bjarmason wrote:
>
> On Fri, Sep 10 2021, Taylor Blau wrote:
>
> > +	if (geometry) {
> > +		struct packed_git *largest = get_largest_active_pack(geometry);
> > +		if (largest)
> > +			strvec_pushf(&cmd.args, "--preferred-pack=%s",
> > +				     pack_basename(largest));
> > +		else
> > +			/*
> > +			 * The largest pack was repacked, meaning only one pack
> > +			 * exists (and tautologically, it is the largest).
> > +			 */
> > +			;
> > +	}
>
> Probably one of those cases where an assignment within an "if" is the
> better of two evils, despite the CodingGuidelines warning against it
> (but not forbidding it). The added get_largest_active_pack() could also
> learn to punt on a NULL argument, in which case we wouldn't need an
> assignment):
>
> 	struct packed_git *largest;
> 	[...]
>         /* If ...[...] */
> 	if (geometry && (largest = get_largest_active_pack(geometry)))
> 		strvec_push(...);
>
> I punted on re-phrasing the comment, because the current code invites
> the reader to start reading this block, then we see that we may do
> stuff, and then the comment applies to what we *didn't* do.

Of the two, I vastly prefer the form that doesn't require assignment
inside the conditional expression.

That comment could use an extra bit about how we don't handle
incremental repacks (since we don't bother looking at the size after
words in the case of `git repack -d` (which is neither all-into-one nor
geometric). Fixed both, thanks.

> > -		usage("read-midx [--show-objects|--checksum] <object-dir>");
> > +		usage("read-midx [--show-objects|--checksum|--preferred-pack] <object-dir>");
>
> Just an aside, but I'm surprised to see this use the older
> non-parse_options() usage API, which we're generally moving away
> from. usage_msg_opt() is generally better.

Opportunity for future cleanup. I didn't see any point to change it in
the middle of an otherwise unrelated series.

> > +		git config core.multiPackIndex true &&
>
> I remember a similar pattern from another series, does this test pass
> without the config being set? I didn't check...

It does! This test was written long before 18e449f86b (midx: enable
core.multiPackIndex by default, 2020-09-25), but since 18e449f86b that
line is no longer necessary. Thanks for noticing!

Thanks,
Taylor
diff mbox series

Patch

diff --git a/Documentation/git-repack.txt b/Documentation/git-repack.txt
index 0f2d235ca5..7183fb498f 100644
--- a/Documentation/git-repack.txt
+++ b/Documentation/git-repack.txt
@@ -190,6 +190,10 @@  this "roll-up", without respect to their reachability. This is subject
 to change in the future. This option (implying a drastically different
 repack mode) is not guaranteed to work with all other combinations of
 option to `git repack`.
++
+When writing a multi-pack bitmap, `git repack` selects the largest resulting
+pack as the preferred pack for object selection by the MIDX (see
+linkgit:git-multi-pack-index[1]).
 
 -m::
 --write-midx::
diff --git a/builtin/repack.c b/builtin/repack.c
index cb4292ab37..e958caff8b 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -422,6 +422,13 @@  static void split_pack_geometry(struct pack_geometry *geometry, int factor)
 	geometry->split = split;
 }
 
+static struct packed_git *get_largest_active_pack(struct pack_geometry *geometry)
+{
+	if (geometry->split == geometry->pack_nr)
+		return NULL;
+	return geometry->pack[geometry->pack_nr - 1];
+}
+
 static void clear_pack_geometry(struct pack_geometry *geometry)
 {
 	if (!geometry)
@@ -469,6 +476,7 @@  static void midx_included_packs(struct string_list *include,
 }
 
 static int write_midx_included_packs(struct string_list *include,
+				     struct pack_geometry *geometry,
 				     int show_progress, int write_bitmaps)
 {
 	struct child_process cmd = CHILD_PROCESS_INIT;
@@ -490,6 +498,19 @@  static int write_midx_included_packs(struct string_list *include,
 	if (write_bitmaps)
 		strvec_push(&cmd.args, "--bitmap");
 
+	if (geometry) {
+		struct packed_git *largest = get_largest_active_pack(geometry);
+		if (largest)
+			strvec_pushf(&cmd.args, "--preferred-pack=%s",
+				     pack_basename(largest));
+		else
+			/*
+			 * The largest pack was repacked, meaning only one pack
+			 * exists (and tautologically, it is the largest).
+			 */
+			;
+	}
+
 	ret = start_command(&cmd);
 	if (ret)
 		return ret;
@@ -776,7 +797,7 @@  int cmd_repack(int argc, const char **argv, const char *prefix)
 		midx_included_packs(&include, &existing_packs,
 				    &existing_kept_packs, &names, geometry);
 
-		ret = write_midx_included_packs(&include,
+		ret = write_midx_included_packs(&include, geometry,
 						show_progress, write_bitmaps > 0);
 
 		string_list_clear(&include, 0);
diff --git a/pack-bitmap.c b/pack-bitmap.c
index 8504110a4d..67be9be9a6 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -1418,7 +1418,7 @@  static int try_partial_reuse(struct packed_git *pack,
 	return 0;
 }
 
-static uint32_t midx_preferred_pack(struct bitmap_index *bitmap_git)
+uint32_t midx_preferred_pack(struct bitmap_index *bitmap_git)
 {
 	struct multi_pack_index *m = bitmap_git->midx;
 	if (!m)
diff --git a/pack-bitmap.h b/pack-bitmap.h
index 469090bad2..7d407c5a4c 100644
--- a/pack-bitmap.h
+++ b/pack-bitmap.h
@@ -55,6 +55,7 @@  int test_bitmap_commits(struct repository *r);
 struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs,
 					 struct list_objects_filter_options *filter,
 					 int filter_provided_objects);
+uint32_t midx_preferred_pack(struct bitmap_index *bitmap_git);
 int reuse_partial_packfile_from_bitmap(struct bitmap_index *,
 				       struct packed_git **packfile,
 				       uint32_t *entries,
diff --git a/t/helper/test-read-midx.c b/t/helper/test-read-midx.c
index cb0d27049a..0038559129 100644
--- a/t/helper/test-read-midx.c
+++ b/t/helper/test-read-midx.c
@@ -3,6 +3,7 @@ 
 #include "midx.h"
 #include "repository.h"
 #include "object-store.h"
+#include "pack-bitmap.h"
 
 static int read_midx_file(const char *object_dir, int show_objects)
 {
@@ -72,14 +73,36 @@  static int read_midx_checksum(const char *object_dir)
 	return 0;
 }
 
+static int read_midx_preferred_pack(const char *object_dir)
+{
+	struct multi_pack_index *midx = NULL;
+	struct bitmap_index *bitmap = NULL;
+
+	setup_git_directory();
+
+	midx = load_multi_pack_index(object_dir, 1);
+	if (!midx)
+		return 1;
+
+	bitmap = prepare_bitmap_git(the_repository);
+	if (!(bitmap && bitmap_is_midx(bitmap)))
+		return 1;
+
+
+	printf("%s\n", midx->pack_names[midx_preferred_pack(bitmap)]);
+	return 0;
+}
+
 int cmd__read_midx(int argc, const char **argv)
 {
 	if (!(argc == 2 || argc == 3))
-		usage("read-midx [--show-objects|--checksum] <object-dir>");
+		usage("read-midx [--show-objects|--checksum|--preferred-pack] <object-dir>");
 
 	if (!strcmp(argv[1], "--show-objects"))
 		return read_midx_file(argv[2], 1);
 	else if (!strcmp(argv[1], "--checksum"))
 		return read_midx_checksum(argv[2]);
+	else if (!strcmp(argv[1], "--preferred-pack"))
+		return read_midx_preferred_pack(argv[2]);
 	return read_midx_file(argv[1], 0);
 }
diff --git a/t/t7703-repack-geometric.sh b/t/t7703-repack-geometric.sh
index 5ccaa440e0..7bdeffa111 100755
--- a/t/t7703-repack-geometric.sh
+++ b/t/t7703-repack-geometric.sh
@@ -180,4 +180,27 @@  test_expect_success '--geometric ignores kept packs' '
 	)
 '
 
+test_expect_success '--geometric chooses largest MIDX preferred pack' '
+	git init geometric &&
+	test_when_finished "rm -fr geometric" &&
+	(
+		cd geometric &&
+		git config core.multiPackIndex true &&
+
+		# These packs already form a geometric progression.
+		test_commit_bulk --start=1 1 && # 3 objects
+		test_commit_bulk --start=2 2 && # 6 objects
+		ls $objdir/pack/pack-*.idx >before &&
+		test_commit_bulk --start=4 4 && # 12 objects
+		ls $objdir/pack/pack-*.idx >after &&
+
+		git repack --geometric 2 -dbm &&
+
+		comm -3 before after | xargs -n 1 basename >expect &&
+		test-tool read-midx --preferred-pack $objdir >actual &&
+
+		test_cmp expect actual
+	)
+'
+
 test_done