diff mbox series

[v2,2/3] midx.c: use `pack-objects --stdin-packs` when repacking

Message ID 4218d9e08aba629d8f64b5a999f60d12e5d8785b.1663706401.git.me@ttaylorr.com (mailing list archive)
State New, archived
Headers show
Series midx: use `--stdin-packs` to implement `repack` | expand

Commit Message

Taylor Blau Sept. 20, 2022, 8:40 p.m. UTC
The `git multi-pack-index repack` command aggregates all objects in
certain packs contained in the MIDX, largely dictated by the value of
its `--batch-size` parameter.

To create the new pack, the MIDX machinery spawns a `pack-objects`
sub-process, which it feeds the object IDs in each of the pack(s) that
it wants to aggregate.

This implementation, which dates back to ce1e4a105b (midx: implement
midx_repack(), 2019-06-10), predates the `--stdin-packs` mode in
`pack-objects`. This mode (which was introduced a couple of years later
in 339bce27f4 (builtin/pack-objects.c: add '--stdin-packs' option,
2021-02-22)) allows `pack-objects` callers to specify the contents of
the output pack by indicating which packs to aggregate.

This patch replaces the pre-`--stdin-packs` invocation (where each
object is given to `pack-objects` one by one) with the more modern
`--stdin-packs` option.

This allows us to avoid some CPU cycles serializing and deserializing
every object ID in all of the packs we're aggregating. It also avoids us
having to send a potentially large amount of data down to
`pack-objects`.

But more importantly, it generates slightly higher quality (read: more
tightly compressed) packs, because of the reachability traversal that
`--stdin-packs` does after the fact in order to gather namehash values
which seed the delta selection process.

In practice, this seems to add a slight amount of overhead (on the order
of a few seconds for git.git broken up into ~100 packs), in exchange for
a modest reduction (on the order of ~3.5%) in the resulting pack size.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 midx.c | 19 +++++++++++--------
 1 file changed, 11 insertions(+), 8 deletions(-)

Comments

Derrick Stolee Sept. 23, 2022, 6:23 p.m. UTC | #1
On 9/20/2022 4:40 PM, Taylor Blau wrote:

> +	for (i = 0; i < m->num_packs; i++) {
> +		struct repack_info *info = &pack_info[i];
>  
> -		if (!include_pack[pack_int_id])
> -			continue;

...

> +		fprintf(cmd_in, "%s%s\n", include_pack[info->pack_int_id] ? "" : "^", scratch.buf);

Outside of how the object set is provided (a list of objects or a list of
packs) it seems this is equivalent. The only difference is that we are
writing a line for _every_ pack in the multi-pack-index, but we preface
the line with "^" to say "not this pack".

Do you know if there is any reason to do this explicitly? Does this
change the set of objects in any way (perhaps by not including
duplicates that are tracked in those other packs)?

Personally, I would have kept the "if (...) continue;" pattern, so maybe
you had a concrete idea why this approach is better.

Thanks,
-Stolee
Taylor Blau Sept. 23, 2022, 6:28 p.m. UTC | #2
On Fri, Sep 23, 2022 at 02:23:30PM -0400, Derrick Stolee wrote:
> Do you know if there is any reason to do this explicitly? Does this
> change the set of objects in any way (perhaps by not including
> duplicates that are tracked in those other packs)?

Yes. The "^" lines become excluded packs from the perspective of the
follow-on reachability traversal to discover the namehashes. So as soon
as we hit an object contained in one of the packs marked as excluded,
we'll halt the traversal along that direction, since we know that we're
not going to pick up any objects in those packs.

So you could omit them, and you'd get the same resulting pack, but it
would take longer to generate since we wouldn't be stopping the
follow-on traversal as early as possible.

Thanks,
Taylor
Derrick Stolee Sept. 23, 2022, 7:05 p.m. UTC | #3
On 9/23/2022 2:28 PM, Taylor Blau wrote:
> On Fri, Sep 23, 2022 at 02:23:30PM -0400, Derrick Stolee wrote:
>> Do you know if there is any reason to do this explicitly? Does this
>> change the set of objects in any way (perhaps by not including
>> duplicates that are tracked in those other packs)?
> 
> Yes. The "^" lines become excluded packs from the perspective of the
> follow-on reachability traversal to discover the namehashes. So as soon
> as we hit an object contained in one of the packs marked as excluded,
> we'll halt the traversal along that direction, since we know that we're
> not going to pick up any objects in those packs.
> 
> So you could omit them, and you'd get the same resulting pack, but it
> would take longer to generate since we wouldn't be stopping the
> follow-on traversal as early as possible.

Excellent. Thanks for explaining that!

-Stolee
diff mbox series

Patch

diff --git a/midx.c b/midx.c
index db0a70c5af..5fbf964bba 100644
--- a/midx.c
+++ b/midx.c
@@ -1955,6 +1955,7 @@  int midx_repack(struct repository *r, const char *object_dir, size_t batch_size,
 	struct child_process cmd = CHILD_PROCESS_INIT;
 	FILE *cmd_in;
 	struct strbuf base_name = STRBUF_INIT;
+	struct strbuf scratch = STRBUF_INIT;
 	struct multi_pack_index *m = lookup_multi_pack_index(r, object_dir);
 	struct repack_info *pack_info = NULL;
 
@@ -1996,7 +1997,7 @@  int midx_repack(struct repository *r, const char *object_dir, size_t batch_size,
 	repo_config_get_bool(r, "repack.usedeltabaseoffset", &delta_base_offset);
 	repo_config_get_bool(r, "repack.usedeltaislands", &use_delta_islands);
 
-	strvec_push(&cmd.args, "pack-objects");
+	strvec_pushl(&cmd.args, "pack-objects", "--stdin-packs", NULL);
 
 	strbuf_addstr(&base_name, object_dir);
 	strbuf_addstr(&base_name, "/pack/pack");
@@ -2025,17 +2026,19 @@  int midx_repack(struct repository *r, const char *object_dir, size_t batch_size,
 
 	cmd_in = xfdopen(cmd.in, "w");
 
-	for (i = 0; i < m->num_objects; i++) {
-		struct object_id oid;
-		uint32_t pack_int_id = nth_midxed_pack_int_id(m, i);
+	for (i = 0; i < m->num_packs; i++) {
+		struct repack_info *info = &pack_info[i];
 
-		if (!include_pack[pack_int_id])
-			continue;
+		strbuf_reset(&scratch);
 
-		nth_midxed_object_oid(&oid, m, i);
-		fprintf(cmd_in, "%s\n", oid_to_hex(&oid));
+		strbuf_addstr(&scratch, m->pack_names[info->pack_int_id]);
+		strbuf_strip_suffix(&scratch, ".idx");
+		strbuf_addstr(&scratch, ".pack");
+
+		fprintf(cmd_in, "%s%s\n", include_pack[info->pack_int_id] ? "" : "^", scratch.buf);
 	}
 	fclose(cmd_in);
+	strbuf_release(&scratch);
 
 	if (finish_command(&cmd)) {
 		error(_("could not finish pack-objects"));