diff mbox series

[v2,4/4] midx-write.c: use `--stdin-packs` when repacking

Message ID b5d6ba5802aef6ddf1542f1b0efffe43c22436ba.1712006190.git.me@ttaylorr.com (mailing list archive)
State Accepted
Commit b7d6f23a17110d597d58f4a8e1b34b7a72c43fe1
Headers show
Series midx: split MIDX writing routines into midx-write.c, cleanup | expand

Commit Message

Taylor Blau April 1, 2024, 9:16 p.m. UTC
When constructing a new pack `git multi-pack-index repack` provides a
list of objects which is the union of objects in all MIDX'd packs which
were "included" in the repack.

Though correct, this typically yields a poorly structured pack, since
providing the objects list over stdin does not give pack-objects a
chance to discover the namehash values for each object, leading to
sub-optimal delta selection.

We can use `--stdin-packs` instead, which has a couple of benefits:

  - it does a supplemental walk over objects in the supplied list of
    packs to discover their namehash, leading to higher-quality delta
    selection

  - it requires us to list far less data over stdin; instead of listing
    each object in the resulting pack, we need only list the
    constituent packs from which those objects were selected in the MIDX

Of course, this comes at a slight cost: though we save time on listing
packs versus objects over stdin[^1] (around ~650 milliseconds), we add a
non-trivial amount of time walking over the given objects in order to
find better deltas.

In general, this is likely to more closely match the user's expectations
(i.e. that packs generated via `git multi-pack-index repack` are written
with high-quality deltas). But if not, we can always introduce a new
option in pack-objects to disable the supplemental object walk, which
would yield a pure CPU-time savings, at the cost of the on-disk size of
the resulting pack.

[^1]: In a patched version of Git that doesn't perform the supplemental
  object walk in `pack-objects --stdin-packs`, we save around ~650ms
  (from 5.968 to 5.325 seconds) when running `git multi-pack-index
  repack --batch-size=0` on git.git with all objects packed, and all
  packs in a MIDX.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 midx-write.c | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

Comments

Junio C Hamano April 1, 2024, 9:45 p.m. UTC | #1
Taylor Blau <me@ttaylorr.com> writes:

> When constructing a new pack `git multi-pack-index repack` provides a
> list of objects which is the union of objects in all MIDX'd packs which
> were "included" in the repack.
>
> Though correct, this typically yields a poorly structured pack, since
> providing the objects list over stdin does not give pack-objects a
> chance to discover the namehash values for each object, leading to
> sub-optimal delta selection.
>
> We can use `--stdin-packs` instead, which has a couple of benefits:
>
>   - it does a supplemental walk over objects in the supplied list of
>     packs to discover their namehash, leading to higher-quality delta
>     selection
>
>   - it requires us to list far less data over stdin; instead of listing
>     each object in the resulting pack, we need only list the
>     constituent packs from which those objects were selected in the MIDX
>
> Of course, this comes at a slight cost: though we save time on listing
> packs versus objects over stdin[^1] (around ~650 milliseconds), we add a
> non-trivial amount of time walking over the given objects in order to
> find better deltas.
>
> In general, this is likely to more closely match the user's expectations
> (i.e. that packs generated via `git multi-pack-index repack` are written
> with high-quality deltas). But if not, we can always introduce a new
> option in pack-objects to disable the supplemental object walk, which
> would yield a pure CPU-time savings, at the cost of the on-disk size of
> the resulting pack.
>
> [^1]: In a patched version of Git that doesn't perform the supplemental
>   object walk in `pack-objects --stdin-packs`, we save around ~650ms
>   (from 5.968 to 5.325 seconds) when running `git multi-pack-index
>   repack --batch-size=0` on git.git with all objects packed, and all
>   packs in a MIDX.

There are some measures in the mind of readers' who have read the
explanation so far.

 - So, this gives us a resulting pack with better delta selection.
   How much better would it get in a sample repository?  10%?  40%?

 - Of course, the better delta selection comes with cost.  How much
   more time do we spend?  20%?  150%?

 - As we do not enumerate all the object names, we save some time.
   Around 0.65 seconds in a sample repository.

I think among the three, the first two are more interesting numbers,
no?

I wonder if we can leverage the trick that reuses existing packdata
when we stream packs to feed the "git fetch" clients---we rely on
the fact that existing packs are tightly packed with good delta
selection, and using bitmap stream contiguous section(s) as much as
possible without disturbing the existing delta chain.  Wouldn't the
"we have many packs, let's repack them into one" workload benefit
the same way?

> -	strvec_push(&cmd.args, "pack-objects");
> +	strvec_pushl(&cmd.args, "pack-objects", "--stdin-packs", "--non-empty",
> +		     NULL);
>  
>  	strvec_pushf(&cmd.args, "%s/pack/pack", object_dir);
>  
> @@ -1498,16 +1499,15 @@ int midx_repack(struct repository *r, const char *object_dir, size_t batch_size,
>  	}
>  
>  	cmd_in = xfdopen(cmd.in, "w");
> -
> -	for (i = 0; i < m->num_objects; i++) {
> -		struct object_id oid;
> -		uint32_t pack_int_id = nth_midxed_pack_int_id(m, i);
> -
> -		if (!include_pack[pack_int_id])
> +	for (i = 0; i < m->num_packs; i++) {
> +		struct packed_git *p = m->packs[i];
> +		if (!p)
>  			continue;
>  
> -		nth_midxed_object_oid(&oid, m, i);
> -		fprintf(cmd_in, "%s\n", oid_to_hex(&oid));
> +		if (include_pack[i])
> +			fprintf(cmd_in, "%s\n", pack_basename(p));
> +		else
> +			fprintf(cmd_in, "^%s\n", pack_basename(p));
>  	}
>  	fclose(cmd_in);

Looking very straight-forward.  Will queue.

Thanks.
Patrick Steinhardt April 2, 2024, 11:47 a.m. UTC | #2
On Mon, Apr 01, 2024 at 05:16:44PM -0400, Taylor Blau wrote:
[snip]
> @@ -1498,16 +1499,15 @@ int midx_repack(struct repository *r, const char *object_dir, size_t batch_size,
>  	}
>  
>  	cmd_in = xfdopen(cmd.in, "w");
> -
> -	for (i = 0; i < m->num_objects; i++) {
> -		struct object_id oid;
> -		uint32_t pack_int_id = nth_midxed_pack_int_id(m, i);
> -
> -		if (!include_pack[pack_int_id])
> +	for (i = 0; i < m->num_packs; i++) {
> +		struct packed_git *p = m->packs[i];
> +		if (!p)
>  			continue;
>  
> -		nth_midxed_object_oid(&oid, m, i);
> -		fprintf(cmd_in, "%s\n", oid_to_hex(&oid));
> +		if (include_pack[i])
> +			fprintf(cmd_in, "%s\n", pack_basename(p));
> +		else
> +			fprintf(cmd_in, "^%s\n", pack_basename(p));
>  	}

The patch looks straight-forward to me overall, but this last part
confused me a bit. Why do we explicitly exclude those packs instead of
merely skipping over them? Is this so that we don't repack objects which
are part of both included and not-included packs?

I was mostly wondering whether it can ever happen here that we decide to
not include a pack because all of its objects are already contained in
other packs. In that case, explicitly excluding it would lead to repo
corruption if we deleted such a not-included pack.

Patrick
diff mbox series

Patch

diff --git a/midx-write.c b/midx-write.c
index 960cc46250..65e69d2de7 100644
--- a/midx-write.c
+++ b/midx-write.c
@@ -1474,7 +1474,8 @@  int midx_repack(struct repository *r, const char *object_dir, size_t batch_size,
 	repo_config_get_bool(r, "repack.usedeltabaseoffset", &delta_base_offset);
 	repo_config_get_bool(r, "repack.usedeltaislands", &use_delta_islands);
 
-	strvec_push(&cmd.args, "pack-objects");
+	strvec_pushl(&cmd.args, "pack-objects", "--stdin-packs", "--non-empty",
+		     NULL);
 
 	strvec_pushf(&cmd.args, "%s/pack/pack", object_dir);
 
@@ -1498,16 +1499,15 @@  int midx_repack(struct repository *r, const char *object_dir, size_t batch_size,
 	}
 
 	cmd_in = xfdopen(cmd.in, "w");
-
-	for (i = 0; i < m->num_objects; i++) {
-		struct object_id oid;
-		uint32_t pack_int_id = nth_midxed_pack_int_id(m, i);
-
-		if (!include_pack[pack_int_id])
+	for (i = 0; i < m->num_packs; i++) {
+		struct packed_git *p = m->packs[i];
+		if (!p)
 			continue;
 
-		nth_midxed_object_oid(&oid, m, i);
-		fprintf(cmd_in, "%s\n", oid_to_hex(&oid));
+		if (include_pack[i])
+			fprintf(cmd_in, "%s\n", pack_basename(p));
+		else
+			fprintf(cmd_in, "^%s\n", pack_basename(p));
 	}
 	fclose(cmd_in);