diff mbox series

[11/11] midx-write.c: use `--stdin-packs` when repacking

Message ID 736be63234baf7fc6df8259d9bb7298858b2bc74.1711387439.git.me@ttaylorr.com (mailing list archive)
State Superseded
Headers show
Series midx: split MIDX writing routines into midx-write.c, cleanup | expand

Commit Message

Taylor Blau March 25, 2024, 5:24 p.m. UTC
When constructing a new pack `git multi-pack-index repack` provides a
list of objects which is the union of objects in all MIDX'd packs which
were "included" in the repack.

Though correct, this typically yields a poorly structured pack, since
providing the objects list over stdin does not give pack-objects a
chance to discover the namehash values for each object, leading to
sub-optimal delta selection.

We can use `--stdin-packs` instead, which has a couple of benefits:

  - it does a supplemental walk over objects in the supplied list of
    packs to discover their namehash, leading to higher-quality delta
    selection

  - it requires us to list far less data over stdin; instead of listing
    each object in the resulting pack, we need only list the
    constituent packs from which those objects were selected in the MIDX

Of course, this comes at a slight cost: though we save time on listing
packs versus objects over stdin[^1] (around ~650 milliseconds), we add a
non-trivial amount of time walking over the given objects in order to
find better deltas.

In general, this is likely to more closely match the user's expectations
(i.e. that packs generated via `git multi-pack-index repack` are written
with high-quality deltas). But if not, we can always introduce a new
option in pack-objects to disable the supplemental object walk, which
would yield a pure CPU-time savings, at the cost of the on-disk size of
the resulting pack.

[^1]: In a patched version of Git that doesn't perform the supplemental
  object walk in `pack-objects --stdin-packs`, we save around ~650ms
  (from 5.968 to 5.325 seconds) when running `git multi-pack-index
  repack --batch-size=0` on git.git with all objects packed, and all
  packs in a MIDX.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 midx-write.c | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

Comments

Jeff King March 27, 2024, 8:37 a.m. UTC | #1
On Mon, Mar 25, 2024 at 01:24:50PM -0400, Taylor Blau wrote:

> Though correct, this typically yields a poorly structured pack, since
> providing the objects list over stdin does not give pack-objects a
> chance to discover the namehash values for each object, leading to
> sub-optimal delta selection.
> 
> We can use `--stdin-packs` instead, which has a couple of benefits:
> 
>   - it does a supplemental walk over objects in the supplied list of
>     packs to discover their namehash, leading to higher-quality delta
>     selection
> 
>   - it requires us to list far less data over stdin; instead of listing
>     each object in the resulting pack, we need only list the
>     constituent packs from which those objects were selected in the MIDX

Yeah, using --stdin-packs makes much more sense to me. I have to admit
that I do not really see the point "multi-pack-index repack" in the
first place. You'd presumably be better off with geometric repacking,
and I think "repack --geometric --write-midx" will do things in the
right order (new pack, then write midx, then delete redundant packs).

I guess "multi-pack-index repack" came before geometric repacking, and
is maybe redundant-ish now? If that's the case, then I guess I don't
care that much about optimizing its packs. ;) But we can't really delete
it without a deprecation period, so making it more sensible in the
meantime is reasonable to me.

-Peff
diff mbox series

Patch

diff --git a/midx-write.c b/midx-write.c
index 4f1d649aa6..d341b9c628 100644
--- a/midx-write.c
+++ b/midx-write.c
@@ -1474,7 +1474,8 @@  int midx_repack(struct repository *r, const char *object_dir, size_t batch_size,
 	repo_config_get_bool(r, "repack.usedeltabaseoffset", &delta_base_offset);
 	repo_config_get_bool(r, "repack.usedeltaislands", &use_delta_islands);
 
-	strvec_push(&cmd.args, "pack-objects");
+	strvec_pushl(&cmd.args, "pack-objects", "--stdin-packs", "--non-empty",
+		     NULL);
 
 	if (delta_base_offset)
 		strvec_push(&cmd.args, "--delta-base-offset");
@@ -1498,16 +1499,15 @@  int midx_repack(struct repository *r, const char *object_dir, size_t batch_size,
 	}
 
 	cmd_in = xfdopen(cmd.in, "w");
-
-	for (i = 0; i < m->num_objects; i++) {
-		struct object_id oid;
-		uint32_t pack_int_id = nth_midxed_pack_int_id(m, i);
-
-		if (!include_pack[pack_int_id])
+	for (i = 0; i < m->num_packs; i++) {
+		struct packed_git *p = m->packs[i];
+		if (!p)
 			continue;
 
-		nth_midxed_object_oid(&oid, m, i);
-		fprintf(cmd_in, "%s\n", oid_to_hex(&oid));
+		if (include_pack[i])
+			fprintf(cmd_in, "%s\n", pack_basename(p));
+		else
+			fprintf(cmd_in, "^%s\n", pack_basename(p));
 	}
 	fclose(cmd_in);