Message ID | cover.1744661167.git.me@ttaylorr.com (mailing list archive) |
---|---|
Headers | show |
Series | repack: avoid MIDX'ing cruft pack(s) where possible | expand |
On Mon, Apr 14, 2025 at 1:06 PM Taylor Blau <me@ttaylorr.com> wrote: > > Here is a non-RFC version of my series to explore creating MIDXs while > repacking that don't include the cruft pack. > > The core idea behind this approach is to ensure that packs generated via > geometric repacking traverse through objects that appear in packs which > are neither included nor excluded. This phrasing feels confusing -- what does it mean for packs to be neither included nor excluded? Maybe: "The core idea behind this approach is to allow some (most) of the objects in a pack to be excluded, while still including some subset of objects from that pack as part of the repack. In particular, we include the objects in that pack which are reachable from the other objects we repack. This is different from our current handling which either entirely includes or entirely excludes all objects from a given pack." > Then if some commit (for example) in > a pack reaches some once-unreachable object stored in a cruft pack, the > pack generated via geometric repacking will pick up and write a copy of > that object during its traversal. > > If you repack consistently using this strategy, you can guarantee that > the union of geometrically-repacked packs are closed under reachability > without having to keep track of any cruft pack(s) in the MIDX. Also, if you do a single non-geometric repack with this strategy, you are also closed under reachability, right? Is that the suggested transition plan for those that want to use this...first do a non-geometric repack, and then ensure that subsequent geometric repacks are done with this strategy?
On Mon, Apr 14, 2025 at 1:06 PM Taylor Blau <me@ttaylorr.com> wrote: > > With '--unpacked', pack-objects adds loose objects (which don't appear > in any of the excluded packs from '--stdin-packs') to the output pack > without considering them as reachability tips for the name-hash > traversal. > > This was an oversight in the original implementation of '--stdin-packs', > since the code which enumerates and adds loose objects to the output > pack (`add_unreachable_loose_objects()`) did not have access to the > 'rev_info' struct found in `read_packs_list_from_stdin()`. > > Excluding unpacked objects from that traversal doesn't effect the s/effect/affect/ ? > correctness of the resulting pack, but it does make it harder to > discover good deltas for loose objects. > > Now that the 'rev_info' struct is declared outside of > `read_packs_list_from_stdin()`, we can pass it to > `add_objects_in_unpacked_packs()` and add any loose objects as tips to > the above-mentioned traversal, in theory producing slightly tighter > packs as a result. > > Signed-off-by: Taylor Blau <me@ttaylorr.com> > --- > builtin/pack-objects.c | 20 ++++++++++++-------- > 1 file changed, 12 insertions(+), 8 deletions(-) > > diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c > index 1689cddd3a..2aa12da4af 100644 > --- a/builtin/pack-objects.c > +++ b/builtin/pack-objects.c > @@ -3642,7 +3642,7 @@ static void read_packs_list_from_stdin(struct rev_info *revs) > string_list_clear(&exclude_packs, 0); > } > > -static void add_unreachable_loose_objects(void); > +static void add_unreachable_loose_objects(struct rev_info *revs); > > static void read_stdin_packs(int rev_list_unpacked) > { > @@ -3669,7 +3669,7 @@ static void read_stdin_packs(int rev_list_unpacked) > ignore_packed_keep_in_core = 1; > read_packs_list_from_stdin(&revs); > if (rev_list_unpacked) > - add_unreachable_loose_objects(); > + add_unreachable_loose_objects(&revs); > > if (prepare_revision_walk(&revs)) > die(_("revision walk setup failed")); > @@ -3788,7 +3788,7 @@ static void enumerate_cruft_objects(void) > _("Enumerating cruft objects"), 0); > > add_objects_in_unpacked_packs(); > - add_unreachable_loose_objects(); > + add_unreachable_loose_objects(NULL); > > stop_progress(&progress_state); > } > @@ -4066,8 +4066,9 @@ static void add_objects_in_unpacked_packs(void) > } > > static int add_loose_object(const struct object_id *oid, const char *path, > - void *data UNUSED) > + void *data) > { > + struct rev_info *revs = data; > enum object_type type = oid_object_info(the_repository, oid, NULL); > > if (type < 0) { > @@ -4088,6 +4089,10 @@ static int add_loose_object(const struct object_id *oid, const char *path, > } else { > add_object_entry(oid, type, "", 0); > } > + > + if (revs && type == OBJ_COMMIT) > + add_pending_oid(revs, NULL, oid, 0); > + > return 0; > } > > @@ -4096,11 +4101,10 @@ static int add_loose_object(const struct object_id *oid, const char *path, > * add_object_entry will weed out duplicates, so we just add every > * loose object we find. > */ > -static void add_unreachable_loose_objects(void) > +static void add_unreachable_loose_objects(struct rev_info *revs) > { > for_each_loose_file_in_objdir(repo_get_object_directory(the_repository), > - add_loose_object, > - NULL, NULL, NULL); > + add_loose_object, NULL, NULL, revs); > } > > static int has_sha1_pack_kept_or_nonlocal(const struct object_id *oid) > @@ -4356,7 +4360,7 @@ static void get_object_list(struct rev_info *revs, int ac, const char **av) > if (keep_unreachable) > add_objects_in_unpacked_packs(); > if (pack_loose_unreachable) > - add_unreachable_loose_objects(); > + add_unreachable_loose_objects(NULL); > if (unpack_unreachable) > loosen_unused_packed_objects(); > > -- > 2.49.0.229.gc267761125.dirty Should this patch have some tests demonstrating the difference in which objects are included?
On Mon, Apr 14, 2025 at 1:06 PM Taylor Blau <me@ttaylorr.com> wrote: > > When invoked with '--stdin-packs', pack-objects will generate a pack > which contains the objects found in the "included" packs, less any > objects from "excluded" packs. > > Packs that exist in the repository but weren't specified as either > included or excluded are in practice treated like the latter, at least > in the sense that pack-objects won't include objects from those packs. > This behavior forces us to include any cruft pack(s) in a repository's > multi-pack index for the reasons described in ddee3703b3 > (builtin/repack.c: add cruft packs to MIDX during geometric repack, > 2022-05-20). > > The full details are in ddee3703b3, but the gist is if you > have a once-unreachable object in a cruft pack which later becomes > reachable via one or more commits in a pack generated with > '--stdin-packs', you *have* to include that object in the MIDX via the > copy in the cruft pack, otherwise we cannot generate reachability > bitmaps for any commits which reach that object. > > This prepares us for new repacking behavior which will "resurrect" > objects found in cruft or otherwise unspecified packs when generating > new packs. In the context of geometric repacking, this may be used to > maintain a sequence of geometrically-repacked packs, the union of which > is closed under reachability, even in the case described earlier. > > Signed-off-by: Taylor Blau <me@ttaylorr.com> > --- > Documentation/git-pack-objects.adoc | 8 ++- > builtin/pack-objects.c | 89 +++++++++++++++++------- > t/t5331-pack-objects-stdin.sh | 101 ++++++++++++++++++++++++++++ > 3 files changed, 171 insertions(+), 27 deletions(-) > > diff --git a/Documentation/git-pack-objects.adoc b/Documentation/git-pack-objects.adoc > index 7f69ae4855..c894582799 100644 > --- a/Documentation/git-pack-objects.adoc > +++ b/Documentation/git-pack-objects.adoc > @@ -87,13 +87,19 @@ base-name:: > reference was included in the resulting packfile. This > can be useful to send new tags to native Git clients. > > ---stdin-packs:: > +--stdin-packs[=<mode>]:: > Read the basenames of packfiles (e.g., `pack-1234abcd.pack`) > from the standard input, instead of object names or revision > arguments. The resulting pack contains all objects listed in the > included packs (those not beginning with `^`), excluding any > objects listed in the excluded packs (beginning with `^`). > + > +When `mode` is "follow", pack objects which are reachable from objects > +in the included packs, but appear in packs that are not listed. > +Reachable objects which appear in excluded packs are not packed. Useful > +for resurrecting once-cruft objects to generate packs which are closed > +under reachability up to the excluded packs. Maybe: When `mode` is "follow", objects from packs not listed on stdin receive special treatment. Objects within unlisted packs will be included if those objects (1) are reachable from the included packs, and (2) are not also found in any of the excluded packs. This mode is useful for resurrecting once-cruft objects to generate packs which are closed under reachability up to the boundary set by the excluded packs. > ++ > Incompatible with `--revs`, or options that imply `--revs` (such as > `--all`), with the exception of `--unpacked`, which is compatible. > > diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c > index 2aa12da4af..6406f4a5b1 100644 > --- a/builtin/pack-objects.c > +++ b/builtin/pack-objects.c > @@ -272,6 +272,12 @@ static struct oidmap configured_exclusions; > static struct oidset excluded_by_config; > static int name_hash_version = -1; > > +enum stdin_packs_mode { > + STDIN_PACKS_MODE_NONE, > + STDIN_PACKS_MODE_STANDARD, > + STDIN_PACKS_MODE_FOLLOW, > +}; > + > /** > * Check whether the name_hash_version chosen by user input is appropriate, > * and also validate whether it is compatible with other features. > @@ -3511,32 +3517,43 @@ static int add_object_entry_from_pack(const struct object_id *oid, > return 0; > } > > -static void show_commit_pack_hint(struct commit *commit UNUSED, > - void *data UNUSED) > -{ > - /* nothing to do; commits don't have a namehash */ > -} > - > static void show_object_pack_hint(struct object *object, const char *name, > - void *data UNUSED) > + void *data) > { > - struct object_entry *oe = packlist_find(&to_pack, &object->oid); > - if (!oe) > + enum stdin_packs_mode mode = *(enum stdin_packs_mode *)data; > + if (mode == STDIN_PACKS_MODE_FOLLOW) { > + add_object_entry(&object->oid, object->type, name, 0); > + } else { > + struct object_entry *oe = packlist_find(&to_pack, &object->oid); > + if (!oe) > + return; > + > + /* > + * Our 'to_pack' list was constructed by iterating all > + * objects packed in included packs, and so doesn't > + * have a non-zero hash field that you would typically > + * pick up during a reachability traversal. > + * > + * Make a best-effort attempt to fill in the ->hash > + * and ->no_try_delta here using a now in order to > + * perhaps improve the delta selection process. > + */ I know you just moved this paragraph from below...but it doesn't parse for me. "using a now in order to perhaps"? What does that mean? > + oe->hash = pack_name_hash_fn(name); > + oe->no_try_delta = name && no_try_delta(name); > + > + stdin_packs_hints_nr++; > + } > +} > + > +static void show_commit_pack_hint(struct commit *commit, void *data) > +{ > + enum stdin_packs_mode mode = *(enum stdin_packs_mode *)data; > + if (mode == STDIN_PACKS_MODE_FOLLOW) { > + show_object_pack_hint((struct object *)commit, "", data); > return; > + } > + /* nothing to do; commits don't have a namehash */ > > - /* > - * Our 'to_pack' list was constructed by iterating all objects packed in > - * included packs, and so doesn't have a non-zero hash field that you > - * would typically pick up during a reachability traversal. > - * > - * Make a best-effort attempt to fill in the ->hash and ->no_try_delta > - * here using a now in order to perhaps improve the delta selection > - * process. > - */ > - oe->hash = pack_name_hash_fn(name); > - oe->no_try_delta = name && no_try_delta(name); > - > - stdin_packs_hints_nr++; > } It might be worth swapping the order of functions as a preparatory patch (both here and when you've done it elsewhere in this series), just because it'll make the diff so much easier to read when we can see the changes to the function without have to also deal with the order swapping (since order swapping looks like a large deletion and large addition of one of the two functions). > static int pack_mtime_cmp(const void *_a, const void *_b) > @@ -3644,7 +3661,7 @@ static void read_packs_list_from_stdin(struct rev_info *revs) > > static void add_unreachable_loose_objects(struct rev_info *revs); > > -static void read_stdin_packs(int rev_list_unpacked) > +static void read_stdin_packs(enum stdin_packs_mode mode, int rev_list_unpacked) > { > struct rev_info revs; > > @@ -3676,7 +3693,7 @@ static void read_stdin_packs(int rev_list_unpacked) > traverse_commit_list(&revs, > show_commit_pack_hint, > show_object_pack_hint, > - NULL); > + &mode); > > trace2_data_intmax("pack-objects", the_repository, "stdin_packs_found", > stdin_packs_found_nr); > @@ -4467,6 +4484,23 @@ static int is_not_in_promisor_pack(struct commit *commit, void *data) { > return is_not_in_promisor_pack_obj((struct object *) commit, data); > } > > +static int parse_stdin_packs_mode(const struct option *opt, const char *arg, > + int unset) > +{ > + enum stdin_packs_mode *mode = opt->value; > + > + if (unset) > + *mode = STDIN_PACKS_MODE_NONE; > + else if (!arg || !*arg) > + *mode = STDIN_PACKS_MODE_STANDARD; I don't understand why you have both a None mode and a Standard mode, especially since the implementation seems to only care about whether or not the Follow mode has been set. Shouldn't these both be setting mode to the same value? > + else if (!strcmp(arg, "follow")) > + *mode = STDIN_PACKS_MODE_FOLLOW; > + else > + die(_("invalid value for '%s': '%s'"), opt->long_name, arg); > + > + return 0; > +} > + > int cmd_pack_objects(int argc, > const char **argv, > const char *prefix, > @@ -4478,7 +4512,7 @@ int cmd_pack_objects(int argc, > struct strvec rp = STRVEC_INIT; > int rev_list_unpacked = 0, rev_list_all = 0, rev_list_reflog = 0; > int rev_list_index = 0; > - int stdin_packs = 0; > + enum stdin_packs_mode stdin_packs = STDIN_PACKS_MODE_NONE; > struct string_list keep_pack_list = STRING_LIST_INIT_NODUP; > struct list_objects_filter_options filter_options = > LIST_OBJECTS_FILTER_INIT; > @@ -4533,6 +4567,9 @@ int cmd_pack_objects(int argc, > OPT_SET_INT_F(0, "indexed-objects", &rev_list_index, > N_("include objects referred to by the index"), > 1, PARSE_OPT_NONEG), > + OPT_CALLBACK_F(0, "stdin-packs", &stdin_packs, N_("mode"), > + N_("read packs from stdin"), > + PARSE_OPT_OPTARG, parse_stdin_packs_mode), > OPT_BOOL(0, "stdin-packs", &stdin_packs, > N_("read packs from stdin")), > OPT_BOOL(0, "stdout", &pack_to_stdout, > @@ -4788,7 +4825,7 @@ int cmd_pack_objects(int argc, > progress_state = start_progress(the_repository, > _("Enumerating objects"), 0); > if (stdin_packs) { > - read_stdin_packs(rev_list_unpacked); > + read_stdin_packs(stdin_packs, rev_list_unpacked); > } else if (cruft) { > read_cruft_objects(); > } else if (!use_internal_rev_list) { > diff --git a/t/t5331-pack-objects-stdin.sh b/t/t5331-pack-objects-stdin.sh > index 4f5e2733a2..f97d2d1b71 100755 > --- a/t/t5331-pack-objects-stdin.sh > +++ b/t/t5331-pack-objects-stdin.sh > @@ -236,4 +236,105 @@ test_expect_success 'pack-objects --stdin with packfiles from main and alternate > test_cmp expected-objects actual-objects > ' > > +packdir=.git/objects/pack > + > +objects_in_packs () { > + for p in "$@" > + do > + git show-index <"$packdir/pack-$p.idx" || return 1 > + done >objects.raw && > + > + cut -d' ' -f2 objects.raw | sort && > + rm -f objects.raw > +} > + > +test_expect_success 'setup for --stdin-packs=follow' ' > + git init stdin-packs--follow && > + ( > + cd stdin-packs--follow && > + > + for c in A B C D > + do > + test_commit "$c" || return 1 > + done && > + > + A="$(echo A | git pack-objects --revs $packdir/pack)" && > + B="$(echo A..B | git pack-objects --revs $packdir/pack)" && > + C="$(echo B..C | git pack-objects --revs $packdir/pack)" && > + D="$(echo C..D | git pack-objects --revs $packdir/pack)" && > + > + git prune-packed > + ) > +' > + > +test_expect_success '--stdin-packs=follow walks into unknown packs' ' > + test_when_finished "rm -fr repo" && > + > + git init repo && > + ( > + cd repo && > + > + for c in A B C D > + do > + test_commit "$c" || return 1 > + done && > + > + A="$(echo A | git pack-objects --revs $packdir/pack)" && > + B="$(echo A..B | git pack-objects --revs $packdir/pack)" && > + C="$(echo B..C | git pack-objects --revs $packdir/pack)" && > + D="$(echo C..D | git pack-objects --revs $packdir/pack)" && > + > + git prune-packed && > + > + cat >in <<-EOF && > + pack-$B.pack > + ^pack-$C.pack > + pack-$D.pack > + EOF > + > + # With just --stdin-packs, pack "A" is unknown to us, so > + # only objects from packs "B" and "D" are included in > + # the output pack. > + P=$(git pack-objects --stdin-packs $packdir/pack <in) && > + objects_in_packs $B $D >expect && > + objects_in_packs $P >actual && > + test_cmp expect actual && > + > + # But with --stdin-packs=follow, objects from both > + # included packs reach objects from the unknown pack, so > + # objects from pack "A" is included in the output pack > + # in addition to the above. > + P=$(git pack-objects --stdin-packs=follow $packdir/pack <in) && > + objects_in_packs $A $B $D >expect && > + objects_in_packs $P >actual && > + test_cmp expect actual && > + > + test_commit E && > + # And with --unpacked, we will pick up objects from unknown > + # packs that are reachable from loose objects. Loose object E > + # reaches objects in pack A, but there are three excluded packs > + # in between. > + # > + # The resulting pack should include objects reachable from E > + # that are not present in packs B, C, or D, along with those > + # present in pack A. > + cat >in <<-EOF && > + ^pack-$B.pack > + ^pack-$C.pack > + ^pack-$D.pack > + EOF > + > + P=$(git pack-objects --stdin-packs=follow --unpacked \ > + $packdir/pack <in) && > + > + { > + objects_in_packs $A && > + git rev-list --objects --no-object-names D..E > + }>expect.raw && > + sort expect.raw >expect && > + objects_in_packs $P >actual && > + test_cmp expect actual > + ) > +' > + > test_done > -- > 2.49.0.229.gc267761125.dirty I like the tests -- normal --stdin-packs, then --stdin-packs=follow, then --stdin-packs=follow + --unpacked. However, would it be worthwhile to create commit E immediately after creating the packs? Currently, the third test shows us that unpacked objects are included when --unpacked is passed. But the tests don't let us know whether that flag is necessary, i.e. whether unpacked objects will just be included anyway. If you move the creation of commit E as a loose object immediately after the pack creation and before all the tests, then these same tests demonstrate this additional bit of information.
On Mon, Apr 14, 2025 at 1:06 PM Taylor Blau <me@ttaylorr.com> wrote: > > In ddee3703b3 (builtin/repack.c: add cruft packs to MIDX during > geometric repack, 2022-05-20), repack began adding cruft pack(s) to the > MIDX with '--write-midx' to ensure that the resulting MIDX was always > closed under reachability in order to generate reachability bitmaps. > > Suppose you have a once-unreachable object packed in a cruft pack, which > later on becomes reachable from one or more objects in a geometrically > repacked pack. That once-unreachable object *won't* appear in the new > pack, since the cruft pack was specified as neither included nor > excluded to 'pack-objects --stdin-packs'. I believe you are talking about the state before your series (i.e., this is carrying on from the previous paragraph), but it reads as though you are talking about the state after the first seven patches of this series. Some kind of connection wording to clarify would really help here. > If the bitmap selection > process picks one or more commits which reach the once-unreachable > objects, commit ddee3703b3 ensures that the MIDX will be closed under > reachability. Without it, we would fail to generate a MIDX bitmap. After reading this part, I had to go back and re-read and figure out what point in time everything was referring to. > ddee3703b3 alludes to the fact that this is sub-optimal by saying > > [...] it's desirable to avoid including cruft packs in the MIDX > because it causes the MIDX to store a bunch of objects which are > likely to get thrown away. > > , which is true, but hides an even larger problem. If repositories > rarely prune their unreachable objects and/or have many of them, the > MIDX must keep track of a large number of objects which bloats the MIDX > and slows down object lookup. > > This is doubly unfortunate because the vast majority of objects in cruft > pack(s) are unlikely to be read, but object reads that go through the > MIDX have to search through them anyway. "have to search through them"? That could be read to suggest those individual objects are read, rather than just traversed over. Maybe "...unlikely to be read, so the enlarged MIDX is for mostly tracking known-to-likely-be-irrelevant objects", or something like that? > This patch causes geometrically-repacked packs to contain a copy of any > once-unreachable object(s) with 'git pack-objects --stdin-packs=follow', > allowing us to avoid including any cruft packs in the MIDX. This is > because a sequence of geometrically-repacked packs that were all > generated with '--stdin-packs=follow' are guaranteed to have their union > be closed under reachability. > > Note that you cannot guarantee that a collection of packs is closed > under reachability if not all of them were generated with following as maybe: ...with "follow" as above. "follow" or "following" feels like it needs quotes so the reader understands its meant as the name of a mode, rather than a verb in the sentence. > above. One tell-tale sign that not all geometrically-repacked packs in > the MIDX were generated with following is to see if there is a pack in same here with "following"...and below. > the existing MIDX that is not going to be somehow represented (either > verbatim or as part of a geometric rollup) in the new MIDX. > > If there is, then starting to generate packs with following during > geometric repacking won't work, since it's open to the same race as > described above. > > But if you're starting from scratch (e.g., building the first MIDX after > an all-into-one '--cruft' repack), then you can guarantee that the union > of subsequently generated packs from geometric repacking *is* closed > under reachability. > > Detect when this is the case and avoid including cruft packs in the MIDX > where possible. The existing behavior remains the default, and the new > behavior is available with the config 'repack.midxMustIncludeCruft' set > to 'false'. > > Signed-off-by: Taylor Blau <me@ttaylorr.com> > --- > Documentation/config/repack.adoc | 7 ++ > builtin/repack.c | 162 +++++++++++++++++++++++++++---- > t/t7704-repack-cruft.sh | 90 +++++++++++++++++ > 3 files changed, 241 insertions(+), 18 deletions(-) > > diff --git a/Documentation/config/repack.adoc b/Documentation/config/repack.adoc > index c79af6d7b8..e9e78dcb19 100644 > --- a/Documentation/config/repack.adoc > +++ b/Documentation/config/repack.adoc > @@ -39,3 +39,10 @@ repack.cruftThreads:: > a cruft pack and the respective parameters are not given over > the command line. See similarly named `pack.*` configuration > variables for defaults and meaning. > + > +repack.midxMustContainCruft:: > + When set to true, linkgit:git-repack[1] will unconditionally include > + cruft pack(s), if any, in the multi-pack index when invoked with > + `--write-midx`. When false, cruft packs are only included in the MIDX > + when necessary (e.g., because they might be required to form a > + reachability closure with MIDX bitmaps). Defaults to true. > diff --git a/builtin/repack.c b/builtin/repack.c > index f3330ade7b..ee43a4f4c1 100644 > --- a/builtin/repack.c > +++ b/builtin/repack.c > @@ -39,6 +39,7 @@ static int write_bitmaps = -1; > static int use_delta_islands; > static int run_update_server_info = 1; > static char *packdir, *packtmp_name, *packtmp; > +static int midx_must_contain_cruft = 1; > > static const char *const git_repack_usage[] = { > N_("git repack [-a] [-A] [-d] [-f] [-F] [-l] [-n] [-q] [-b] [-m]\n" > @@ -107,6 +108,10 @@ static int repack_config(const char *var, const char *value, > free(cruft_po_args->threads); > return git_config_string(&cruft_po_args->threads, var, value); > } > + if (!strcmp(var, "repack.midxmustcontaincruft")) { > + midx_must_contain_cruft = git_config_bool(var, value); > + return 0; > + } > return git_default_config(var, value, ctx, cb); > } > > @@ -687,6 +692,77 @@ static void free_pack_geometry(struct pack_geometry *geometry) > free(geometry->pack); > } > > +static int midx_has_unknown_packs(char **midx_pack_names, > + size_t midx_pack_names_nr, > + struct string_list *include, > + struct pack_geometry *geometry, > + struct existing_packs *existing) > +{ > + size_t i; > + > + string_list_sort(include); > + > + for (i = 0; i < midx_pack_names_nr; i++) { > + const char *pack_name = midx_pack_names[i]; > + > + /* > + * Determine whether or not each MIDX'd pack from the existing > + * MIDX (if any) is represented in the new MIDX. For each pack > + * in the MIDX, it must either be: > + * > + * - In the "include" list of packs to be included in the new > + * MIDX. Note this function is called before the include > + * list is populated with any cruft pack(s). > + * > + * - Below the geometric split line (if using pack geometry), > + * indicating that the pack won't be included in the new > + * MIDX, but its contents were rolled up as part of the > + * geometric repack. > + * > + * - In the existing non-kept packs list (if not using pack > + * geometry), and marked as non-deleted. > + */ > + if (string_list_has_string(include, pack_name)) { > + continue; > + } else if (geometry) { > + struct strbuf buf = STRBUF_INIT; > + uint32_t j; > + > + for (j = 0; j < geometry->split; j++) { > + strbuf_reset(&buf); > + strbuf_addstr(&buf, pack_basename(geometry->pack[j])); > + strbuf_strip_suffix(&buf, ".pack"); > + strbuf_addstr(&buf, ".idx"); > + > + if (!strcmp(pack_name, buf.buf)) { > + strbuf_release(&buf); > + break; > + } > + } > + > + strbuf_release(&buf); > + > + if (j < geometry->split) > + continue; > + } else { > + struct string_list_item *item; > + > + item = string_list_lookup(&existing->non_kept_packs, > + pack_name); > + if (item && !pack_is_marked_for_deletion(item)) > + continue; > + } > + > + /* > + * If we got to this point, the MIDX includes some pack that we > + * don't know about. > + */ > + return 1; > + } > + > + return 0; > +} > + > struct midx_snapshot_ref_data { > struct tempfile *f; > struct oidset seen; > @@ -755,6 +831,8 @@ static void midx_snapshot_refs(struct tempfile *f) > > static void midx_included_packs(struct string_list *include, > struct existing_packs *existing, > + char **midx_pack_names, > + size_t midx_pack_names_nr, > struct string_list *names, > struct pack_geometry *geometry) > { > @@ -808,26 +886,55 @@ static void midx_included_packs(struct string_list *include, > } > } > > - for_each_string_list_item(item, &existing->cruft_packs) { > + if (midx_must_contain_cruft || > + midx_has_unknown_packs(midx_pack_names, midx_pack_names_nr, > + include, geometry, existing)) { > /* > - * When doing a --geometric repack, there is no need to check > - * for deleted packs, since we're by definition not doing an > - * ALL_INTO_ONE repack (hence no packs will be deleted). > - * Otherwise we must check for and exclude any packs which are > - * enqueued for deletion. > + * If there are one or more unknown pack(s) present (see > + * midx_has_unknown_packs() for what makes a pack > + * "unknown") in the MIDX before the repack, keep them > + * as they may be required to form a reachability > + * closure if the MIDX is bitmapped. > * > - * So we could omit the conditional below in the --geometric > - * case, but doing so is unnecessary since no packs are marked > - * as pending deletion (since we only call > - * `mark_packs_for_deletion()` when doing an all-into-one > - * repack). > + * For example, a cruft pack can be required to form a > + * reachability closure if the MIDX is bitmapped and one > + * or more of its selected commits reaches a once-cruft > + * object that was later made reachable. The antecedent of "its" is unclear here; just spell it out to reduce how much thinking the reader needs to do? > */ > - if (pack_is_marked_for_deletion(item)) > - continue; > + for_each_string_list_item(item, &existing->cruft_packs) { > + /* > + * When doing a --geometric repack, there is no > + * need to check for deleted packs, since we're > + * by definition not doing an ALL_INTO_ONE > + * repack (hence no packs will be deleted). > + * Otherwise we must check for and exclude any > + * packs which are enqueued for deletion. > + * > + * So we could omit the conditional below in the > + * --geometric case, but doing so is unnecessary > + * since no packs are marked as pending > + * deletion (since we only call > + * `mark_packs_for_deletion()` when doing an > + * all-into-one repack). > + */ > + if (pack_is_marked_for_deletion(item)) > + continue; > > - strbuf_reset(&buf); > - strbuf_addf(&buf, "%s.idx", item->string); > - string_list_insert(include, buf.buf); > + strbuf_reset(&buf); > + strbuf_addf(&buf, "%s.idx", item->string); > + string_list_insert(include, buf.buf); > + } > + } else { > + /* > + * Modern versions of Git will write new copies of > + * once-cruft objects when doing a --geometric repack. "Modern versions of Git" -> "Modern versions of Git with the appropriate config setting" ? > + * > + * If the MIDX has no cruft pack, new packs written > + * during a --geometric repack will not rely on the > + * cruft pack to form a reachability closure, so we can > + * avoid including them in the MIDX in that case. > + */ > + ; > } > > strbuf_release(&buf); > @@ -1142,6 +1249,8 @@ int cmd_repack(int argc, > struct tempfile *refs_snapshot = NULL; > int i, ext, ret; > int show_progress; > + char **midx_pack_names = NULL; > + size_t midx_pack_names_nr = 0; > > /* variables to be filled by option parsing */ > int delete_redundant = 0; > @@ -1356,7 +1465,10 @@ int cmd_repack(int argc, > !(pack_everything & PACK_CRUFT)) > strvec_push(&cmd.args, "--pack-loose-unreachable"); > } else if (geometry.split_factor) { > - strvec_push(&cmd.args, "--stdin-packs"); > + if (midx_must_contain_cruft) > + strvec_push(&cmd.args, "--stdin-packs"); > + else > + strvec_push(&cmd.args, "--stdin-packs=follow"); > strvec_push(&cmd.args, "--unpacked"); > } else { > strvec_push(&cmd.args, "--unpacked"); > @@ -1478,6 +1590,16 @@ int cmd_repack(int argc, > > string_list_sort(&names); > > + if (get_local_multi_pack_index(the_repository)) { > + uint32_t i; > + struct multi_pack_index *m = > + get_local_multi_pack_index(the_repository); > + > + ALLOC_ARRAY(midx_pack_names, m->num_packs); > + for (i = 0; i < m->num_packs; i++) > + midx_pack_names[midx_pack_names_nr++] = xstrdup(m->pack_names[i]); > + } > + > close_object_store(the_repository->objects); > > /* > @@ -1519,7 +1641,8 @@ int cmd_repack(int argc, > > if (write_midx) { > struct string_list include = STRING_LIST_INIT_DUP; > - midx_included_packs(&include, &existing, &names, &geometry); > + midx_included_packs(&include, &existing, midx_pack_names, > + midx_pack_names_nr, &names, &geometry); > > ret = write_midx_included_packs(&include, &geometry, &names, > refs_snapshot ? get_tempfile_path(refs_snapshot) : NULL, > @@ -1570,6 +1693,9 @@ int cmd_repack(int argc, > string_list_clear(&names, 1); > existing_packs_release(&existing); > free_pack_geometry(&geometry); > + for (size_t i = 0; i < midx_pack_names_nr; i++) > + free(midx_pack_names[i]); > + free(midx_pack_names); > pack_objects_args_release(&po_args); > pack_objects_args_release(&cruft_po_args); > > diff --git a/t/t7704-repack-cruft.sh b/t/t7704-repack-cruft.sh > index 8aebfb45f5..2b0a55f8fd 100755 > --- a/t/t7704-repack-cruft.sh > +++ b/t/t7704-repack-cruft.sh > @@ -724,4 +724,94 @@ test_expect_success 'cruft repack respects --quiet' ' > ) > ' > > +setup_cruft_exclude_tests() { > + git init "$1" && > + ( > + cd "$1" && > + > + git config repack.midxMustContainCruft false && > + > + test_commit one && > + > + test_commit --no-tag two && > + two="$(git rev-parse HEAD)" && > + test_commit --no-tag three && > + three="$(git rev-parse HEAD)" && > + git reset --hard one && > + git reflog expire --all --expire=all && > + > + GIT_TEST_MULTI_PACK_INDEX=0 git repack --cruft -d && > + > + git merge $two && > + test_commit four > + ) > +} > + > +test_expect_success 'repack --write-midx excludes cruft where possible' ' > + setup_cruft_exclude_tests exclude-cruft-when-possible && > + ( > + cd exclude-cruft-when-possible && > + > + GIT_TEST_MULTI_PACK_INDEX=0 \ > + git repack -d --geometric=2 --write-midx --write-bitmap-index && > + > + test-tool read-midx --show-objects $objdir >midx && > + cruft="$(ls $packdir/*.mtimes)" && > + test_grep ! "$(basename "$cruft" .mtimes).idx" midx && > + > + git rev-list --all --objects --no-object-names >reachable.raw && > + sort reachable.raw >reachable.objects && > + awk "/\.pack$/ { print \$1 }" <midx | sort >midx.objects && > + > + test_cmp reachable.objects midx.objects > + ) > +' > + > +test_expect_success 'repack --write-midx includes cruft when instructed' ' > + setup_cruft_exclude_tests exclude-cruft-when-instructed && > + ( > + cd exclude-cruft-when-instructed && > + > + GIT_TEST_MULTI_PACK_INDEX=0 \ > + git -c repack.midxMustContainCruft=true repack \ > + -d --geometric=2 --write-midx --write-bitmap-index && > + > + test-tool read-midx --show-objects $objdir >midx && > + cruft="$(ls $packdir/*.mtimes)" && > + test_grep "$(basename "$cruft" .mtimes).idx" midx && > + > + git cat-file --batch-check="%(objectname)" --batch-all-objects \ > + >all.objects && > + awk "/\.pack$/ { print \$1 }" <midx | sort >midx.objects && > + > + test_cmp all.objects midx.objects > + ) > +' > + > +test_expect_success 'repack --write-midx includes cruft when necessary' ' > + setup_cruft_exclude_tests exclude-cruft-when-necessary && > + ( > + cd exclude-cruft-when-necessary && > + > + test_path_is_file $(ls $packdir/pack-*.mtimes) && > + ls $packdir/pack-*.idx | sort >packs.all && > + grep -o "pack-.*\.idx$" packs.all >in && > + > + git multi-pack-index write --stdin-packs --bitmap <in && > + > + test_commit five && > + GIT_TEST_MULTI_PACK_INDEX=0 \ > + git repack -d --geometric=2 --write-midx --write-bitmap-index && > + > + test-tool read-midx --show-objects $objdir >midx && > + awk "/\.pack$/ { print \$1 }" <midx | sort >midx.objects && > + git cat-file --batch-all-objects --batch-check="%(objectname)" \ > + >expect.objects && > + test_cmp expect.objects midx.objects && > + > + grep "^pack-" midx >midx.packs && > + test_line_count = "$(($(wc -l <packs.all) + 1))" midx.packs > + ) > +' > + > test_done > -- > 2.49.0.229.gc267761125.dirty
On Mon, Apr 14, 2025 at 08:10:51PM -0700, Elijah Newren wrote: > On Mon, Apr 14, 2025 at 1:06 PM Taylor Blau <me@ttaylorr.com> wrote: > > > > With '--unpacked', pack-objects adds loose objects (which don't appear > > in any of the excluded packs from '--stdin-packs') to the output pack > > without considering them as reachability tips for the name-hash > > traversal. > > > > This was an oversight in the original implementation of '--stdin-packs', > > since the code which enumerates and adds loose objects to the output > > pack (`add_unreachable_loose_objects()`) did not have access to the > > 'rev_info' struct found in `read_packs_list_from_stdin()`. > > > > Excluding unpacked objects from that traversal doesn't effect the > > s/effect/affect/ ? Oops, yes. > Should this patch have some tests demonstrating the difference in > which objects are included? No; this patch doesn't actually change the set of objects we include with '--stdin-packs' in conjunction with '--unpacked', it just alters their name-hash values in an attempt to produce better deltas. I don't think we have any tests that check this traversal in the packed or unpacked case, though we could probably add some. It's not obvious how we'd test that the traversal actually produced better/different deltas, but we could at least check that it happened with the trace2 identifier "pack-objects/stdin_packs_hints". I think it's probably worth doing at some point, though I don't think I see it as especially urgent, unless you feel strongly otherwise. Thanks, Taylor
On Mon, Apr 14, 2025 at 08:11:08PM -0700, Elijah Newren wrote: > > diff --git a/Documentation/git-pack-objects.adoc b/Documentation/git-pack-objects.adoc > > index 7f69ae4855..c894582799 100644 > > --- a/Documentation/git-pack-objects.adoc > > +++ b/Documentation/git-pack-objects.adoc > > @@ -87,13 +87,19 @@ base-name:: > > reference was included in the resulting packfile. This > > can be useful to send new tags to native Git clients. > > > > ---stdin-packs:: > > +--stdin-packs[=<mode>]:: > > Read the basenames of packfiles (e.g., `pack-1234abcd.pack`) > > from the standard input, instead of object names or revision > > arguments. The resulting pack contains all objects listed in the > > included packs (those not beginning with `^`), excluding any > > objects listed in the excluded packs (beginning with `^`). > > + > > +When `mode` is "follow", pack objects which are reachable from objects > > +in the included packs, but appear in packs that are not listed. > > +Reachable objects which appear in excluded packs are not packed. Useful > > +for resurrecting once-cruft objects to generate packs which are closed > > +under reachability up to the excluded packs. > > Maybe: > > When `mode` is "follow", objects from packs not listed on stdin > receive special treatment. Objects within unlisted packs will be > included if those objects (1) are reachable from the included packs, > and (2) are not also found in any of the excluded packs. This mode is > useful for resurrecting once-cruft objects to generate packs which are > closed under reachability up to the boundary set by the excluded > packs. I like it. I went with your version with some minor rewording and tweaks on top. > > + /* > > + * Our 'to_pack' list was constructed by iterating all > > + * objects packed in included packs, and so doesn't > > + * have a non-zero hash field that you would typically > > + * pick up during a reachability traversal. > > + * > > + * Make a best-effort attempt to fill in the ->hash > > + * and ->no_try_delta here using a now in order to > > + * perhaps improve the delta selection process. > > + */ > > I know you just moved this paragraph from below...but it doesn't parse > for me. "using a now in order to perhaps"? What does that mean? Yeah, this is just bogus, and was so before this patch. The rewording is minor enough (just dropping "using a now") that I think we can just squash it in with the movement in this patch. > > + oe->hash = pack_name_hash_fn(name); > > + oe->no_try_delta = name && no_try_delta(name); > > + > > + stdin_packs_hints_nr++; > > + } > > +} > > + > > +static void show_commit_pack_hint(struct commit *commit, void *data) > > +{ > > + enum stdin_packs_mode mode = *(enum stdin_packs_mode *)data; > > + if (mode == STDIN_PACKS_MODE_FOLLOW) { > > + show_object_pack_hint((struct object *)commit, "", data); > > return; > > + } > > + /* nothing to do; commits don't have a namehash */ > > > > - /* > > - * Our 'to_pack' list was constructed by iterating all objects packed in > > - * included packs, and so doesn't have a non-zero hash field that you > > - * would typically pick up during a reachability traversal. > > - * > > - * Make a best-effort attempt to fill in the ->hash and ->no_try_delta > > - * here using a now in order to perhaps improve the delta selection > > - * process. > > - */ > > - oe->hash = pack_name_hash_fn(name); > > - oe->no_try_delta = name && no_try_delta(name); > > - > > - stdin_packs_hints_nr++; > > } > > It might be worth swapping the order of functions as a preparatory > patch (both here and when you've done it elsewhere in this series), > just because it'll make the diff so much easier to read when we can > see the changes to the function without have to also deal with the > order swapping (since order swapping looks like a large deletion and > large addition of one of the two functions). Fair enough. > > @@ -4467,6 +4484,23 @@ static int is_not_in_promisor_pack(struct commit *commit, void *data) { > > return is_not_in_promisor_pack_obj((struct object *) commit, data); > > } > > > > +static int parse_stdin_packs_mode(const struct option *opt, const char *arg, > > + int unset) > > +{ > > + enum stdin_packs_mode *mode = opt->value; > > + > > + if (unset) > > + *mode = STDIN_PACKS_MODE_NONE; > > + else if (!arg || !*arg) > > + *mode = STDIN_PACKS_MODE_STANDARD; > > I don't understand why you have both a None mode and a Standard mode, > especially since the implementation seems to only care about whether > or not the Follow mode has been set. Shouldn't these both be setting > mode to the same value? I'm not sure I follow your question... stdin_packs is a tri-state. It can be off, on in standard/legacy mode, or on in follow mode. > > +test_expect_success 'setup for --stdin-packs=follow' ' > > + git init stdin-packs--follow && > > + ( > > + cd stdin-packs--follow && > > + > > + for c in A B C D > > + do > > + test_commit "$c" || return 1 > > + done && > > + > > + A="$(echo A | git pack-objects --revs $packdir/pack)" && > > + B="$(echo A..B | git pack-objects --revs $packdir/pack)" && > > + C="$(echo B..C | git pack-objects --revs $packdir/pack)" && > > + D="$(echo C..D | git pack-objects --revs $packdir/pack)" && > > + > > + git prune-packed > > + ) > > +' Huh, I have no idea how this snuck in. This "setup" test does nothing and creates a repository that isn't used later on in the script. Probably leftover from writing these tests in the first place, but I've removed it. > I like the tests -- normal --stdin-packs, then --stdin-packs=follow, > then --stdin-packs=follow + --unpacked. I think the normal tests are accidental since we use pack-objects to write packs A, B, C, and D. But the --stdin-packs vs. --stdin-packs=follow and --stdin-packs=follow + --unpacked was definitely intentional. > However, would it be worthwhile to create commit E immediately after > creating the packs? Yeah, I think that is a good suggestion. We already have tests that exercise --stdin-packs with --unpacked earlier in the same script, but obviously not with --stdin-packs=follow. Moving the creation of commit E earlier up makes a lot of sense to me, thanks! Thanks, Taylor
On Mon, Apr 14, 2025 at 08:11:22PM -0700, Elijah Newren wrote: > On Mon, Apr 14, 2025 at 1:06 PM Taylor Blau <me@ttaylorr.com> wrote: > > > > In ddee3703b3 (builtin/repack.c: add cruft packs to MIDX during > > geometric repack, 2022-05-20), repack began adding cruft pack(s) to the > > MIDX with '--write-midx' to ensure that the resulting MIDX was always > > closed under reachability in order to generate reachability bitmaps. > > > > Suppose you have a once-unreachable object packed in a cruft pack, which > > later on becomes reachable from one or more objects in a geometrically > > repacked pack. That once-unreachable object *won't* appear in the new > > pack, since the cruft pack was specified as neither included nor > > excluded to 'pack-objects --stdin-packs'. > > I believe you are talking about the state before your series (i.e., > this is carrying on from the previous paragraph), but it reads as > though you are talking about the state after the first seven patches > of this series. Some kind of connection wording to clarify would > really help here. Sure. > > If the bitmap selection > > process picks one or more commits which reach the once-unreachable > > objects, commit ddee3703b3 ensures that the MIDX will be closed under > > reachability. Without it, we would fail to generate a MIDX bitmap. > > After reading this part, I had to go back and re-read and figure out > what point in time everything was referring to. Yeah, this is confusing to me too after reading it back. I made some tweaks that I think clarify things. > > ddee3703b3 alludes to the fact that this is sub-optimal by saying > > > > [...] it's desirable to avoid including cruft packs in the MIDX > > because it causes the MIDX to store a bunch of objects which are > > likely to get thrown away. > > > > , which is true, but hides an even larger problem. If repositories > > rarely prune their unreachable objects and/or have many of them, the > > MIDX must keep track of a large number of objects which bloats the MIDX > > and slows down object lookup. > > > > This is doubly unfortunate because the vast majority of objects in cruft > > pack(s) are unlikely to be read, but object reads that go through the > > MIDX have to search through them anyway. > > "have to search through them"? That could be read to suggest those > individual objects are read, rather than just traversed over. Maybe > "...unlikely to be read, so the enlarged MIDX is for mostly tracking > known-to-likely-be-irrelevant objects", or something like that? Thanks for pointing out... I clarified this one as well. > > This patch causes geometrically-repacked packs to contain a copy of any > > once-unreachable object(s) with 'git pack-objects --stdin-packs=follow', > > allowing us to avoid including any cruft packs in the MIDX. This is > > because a sequence of geometrically-repacked packs that were all > > generated with '--stdin-packs=follow' are guaranteed to have their union > > be closed under reachability. > > > > Note that you cannot guarantee that a collection of packs is closed > > under reachability if not all of them were generated with following as > > maybe: ...with "follow" as above. "follow" or "following" feels like > it needs quotes so the reader understands its meant as the name of a > mode, rather than a verb in the sentence. > > > above. One tell-tale sign that not all geometrically-repacked packs in > > the MIDX were generated with following is to see if there is a pack in > > same here with "following"...and below. Great calls on both, thanks. > > @@ -808,26 +886,55 @@ static void midx_included_packs(struct string_list *include, > > } > > } > > > > - for_each_string_list_item(item, &existing->cruft_packs) { > > + if (midx_must_contain_cruft || > > + midx_has_unknown_packs(midx_pack_names, midx_pack_names_nr, > > + include, geometry, existing)) { > > /* > > - * When doing a --geometric repack, there is no need to check > > - * for deleted packs, since we're by definition not doing an > > - * ALL_INTO_ONE repack (hence no packs will be deleted). > > - * Otherwise we must check for and exclude any packs which are > > - * enqueued for deletion. > > + * If there are one or more unknown pack(s) present (see > > + * midx_has_unknown_packs() for what makes a pack > > + * "unknown") in the MIDX before the repack, keep them > > + * as they may be required to form a reachability > > + * closure if the MIDX is bitmapped. > > * > > - * So we could omit the conditional below in the --geometric > > - * case, but doing so is unnecessary since no packs are marked > > - * as pending deletion (since we only call > > - * `mark_packs_for_deletion()` when doing an all-into-one > > - * repack). > > + * For example, a cruft pack can be required to form a > > + * reachability closure if the MIDX is bitmapped and one > > + * or more of its selected commits reaches a once-cruft > > + * object that was later made reachable. > > The antecedent of "its" is unclear here; just spell it out to reduce > how much thinking the reader needs to do? Eek, good suggestion again. Thanks, I fixed it up and made the antecedent explicit. > > */ > > - if (pack_is_marked_for_deletion(item)) > > - continue; > > + for_each_string_list_item(item, &existing->cruft_packs) { > > + /* > > + * When doing a --geometric repack, there is no > > + * need to check for deleted packs, since we're > > + * by definition not doing an ALL_INTO_ONE > > + * repack (hence no packs will be deleted). > > + * Otherwise we must check for and exclude any > > + * packs which are enqueued for deletion. > > + * > > + * So we could omit the conditional below in the > > + * --geometric case, but doing so is unnecessary > > + * since no packs are marked as pending > > + * deletion (since we only call > > + * `mark_packs_for_deletion()` when doing an > > + * all-into-one repack). > > + */ > > + if (pack_is_marked_for_deletion(item)) > > + continue; > > > > - strbuf_reset(&buf); > > - strbuf_addf(&buf, "%s.idx", item->string); > > - string_list_insert(include, buf.buf); > > + strbuf_reset(&buf); > > + strbuf_addf(&buf, "%s.idx", item->string); > > + string_list_insert(include, buf.buf); > > + } > > + } else { > > + /* > > + * Modern versions of Git will write new copies of > > + * once-cruft objects when doing a --geometric repack. > > "Modern versions of Git" -> "Modern versions of Git with the > appropriate config setting" ? Heh. Great catch. Can you tell this part was written before I added the configuration option? ;-) Thanks, Taylor
On Mon, Apr 14, 2025 at 07:57:52PM -0700, Elijah Newren wrote: > On Mon, Apr 14, 2025 at 1:06 PM Taylor Blau <me@ttaylorr.com> wrote: > > > > Here is a non-RFC version of my series to explore creating MIDXs while > > repacking that don't include the cruft pack. > > > > The core idea behind this approach is to ensure that packs generated via > > geometric repacking traverse through objects that appear in packs which > > are neither included nor excluded. > > This phrasing feels confusing -- what does it mean for packs to be > neither included nor excluded? Maybe: > > "The core idea behind this approach is to allow some (most) of the > objects in a pack to be excluded, while still including some subset of > objects from that pack as part of the repack. In particular, we > include the objects in that pack which are reachable from the other > objects we repack. This is different from our current handling which > either entirely includes or entirely excludes all objects from a given > pack." I am admittedly having a little bit of a hard time parsing your version of this, but I think this part: [...] In particular, we include the objects in that pack which are reachable from the other objects we repack. isn't quite right. It's not that the output pack contains objects reachable from the other objects we repack, but rather it contains the reachable objects from the other objects we repack *if* those objects don't appear in an excluded pack given as part of the input. > > Then if some commit (for example) in > > a pack reaches some once-unreachable object stored in a cruft pack, the > > pack generated via geometric repacking will pick up and write a copy of > > that object during its traversal. > > > > If you repack consistently using this strategy, you can guarantee that > > the union of geometrically-repacked packs are closed under reachability > > without having to keep track of any cruft pack(s) in the MIDX. > > Also, if you do a single non-geometric repack with this strategy, you > are also closed under reachability, right? Is that the suggested > transition plan for those that want to use this...first do a > non-geometric repack, and then ensure that subsequent geometric > repacks are done with this strategy? Yeah, the last commit gets at this a bit. The property you have to maintain is that the union of geometrically-repacked packs (which form the MIDX) are and stay closed under reachability. I am pretty sure that the way this is constructed, adding new geometrically-repacked packs to the chain does not violate this property[^1]. But you can't guarantee it part of the way through a sequence of geometric repacks, which is what midx_has_unknown_packs() is checking for. If you do an all-into-one cruft repack first, then there is no MIDX to begin with, so there aren't any unknown packs to worry about (since there are no packs in a MIDX to begin with). When that property is met, then we can use the new behavior. Thanks, Taylor [^1]: So long as you don't drop part of the geometric progression, e.g., if you have some pack that was in the existing MIDX, but wasn't repacked or included in the new MIDX.
On Tue, Apr 15, 2025 at 1:45 PM Taylor Blau <me@ttaylorr.com> wrote: > > > @@ -4467,6 +4484,23 @@ static int is_not_in_promisor_pack(struct commit *commit, void *data) { > > > return is_not_in_promisor_pack_obj((struct object *) commit, data); > > > } > > > > > > +static int parse_stdin_packs_mode(const struct option *opt, const char *arg, > > > + int unset) > > > +{ > > > + enum stdin_packs_mode *mode = opt->value; > > > + > > > + if (unset) > > > + *mode = STDIN_PACKS_MODE_NONE; > > > + else if (!arg || !*arg) > > > + *mode = STDIN_PACKS_MODE_STANDARD; > > > > I don't understand why you have both a None mode and a Standard mode, > > especially since the implementation seems to only care about whether > > or not the Follow mode has been set. Shouldn't these both be setting > > mode to the same value? > > I'm not sure I follow your question... stdin_packs is a tri-state. It > can be off, on in standard/legacy mode, or on in follow mode. I was just confused. I looked in the code for STDIN_PACKS_MODE_{NONE,STANDARD,FOLLOW}, and other than initial setup, only the _FOLLOW variant was used anywhere. I overlooked the "if (stdin_packs" usage, which is what further distinguishes between _NONE and _STANDARD. Sorry.
Here is a non-RFC version of my series to explore creating MIDXs while repacking that don't include the cruft pack. The core idea behind this approach is to ensure that packs generated via geometric repacking traverse through objects that appear in packs which are neither included nor excluded. Then if some commit (for example) in a pack reaches some once-unreachable object stored in a cruft pack, the pack generated via geometric repacking will pick up and write a copy of that object during its traversal. If you repack consistently using this strategy, you can guarantee that the union of geometrically-repacked packs are closed under reachability without having to keep track of any cruft pack(s) in the MIDX. This version has a couple of minor changes from the RFC: - Before using a designated initializer to setup a 'struct object_info', add a new preparatory commit to explain that such designated initializers rely on the default value for non-initialized fields to be zero'd. - Less cruft-pack specific reasoning for when repack can use this new mode (thanks to a helpful discussion with Peff while thinking through and talking about these changes). - A new 'repack.midxMustContainCruft' configuration knob to opt-in to this new behavior. - More readable (IMHO) test scripts. I think this version is sufficiently ready for review. I'm going to deploy a copy of this within GitHub's infrastructure and see how it behaves on a single replica of an internal repository over a ~week and report back. Thanks in advance for any review in the meantime :-). Taylor Blau (8): pack-objects: use standard option incompatibility functions object-store-ll.h: add note about designated initializers pack-objects: limit scope in 'add_object_entry_from_pack()' pack-objects: factor out handling '--stdin-packs' pack-objects: declare 'rev_info' for '--stdin-packs' earlier pack-objects: perform name-hash traversal for unpacked objects pack-objects: introduce '--stdin-packs=follow' repack: exclude cruft pack(s) from the MIDX where possible Documentation/config/repack.adoc | 7 + Documentation/git-pack-objects.adoc | 8 +- builtin/pack-objects.c | 193 +++++++++++++++++----------- builtin/repack.c | 162 ++++++++++++++++++++--- object-store-ll.h | 8 ++ t/t5331-pack-objects-stdin.sh | 103 ++++++++++++++- t/t7704-repack-cruft.sh | 90 +++++++++++++ 7 files changed, 478 insertions(+), 93 deletions(-) Range-diff against v1: 1: 63fb4dab30 = 1: 65bc7e4630 pack-objects: use standard option incompatibility functions -: ---------- > 2: 920c91eb1e object-store-ll.h: add note about designated initializers 2: 6357633f6d = 3: f8ac36b110 pack-objects: limit scope in 'add_object_entry_from_pack()' 3: 43e889b157 = 4: 5e03b482ba pack-objects: factor out handling '--stdin-packs' 4: 07a91be3ec = 5: bccbac2ec5 pack-objects: declare 'rev_info' for '--stdin-packs' earlier 5: 241f7c87e5 = 6: 0bc2183dc3 pack-objects: perform name-hash traversal for unpacked objects 6: a0318321ec = 7: 697a337cb1 pack-objects: introduce '--stdin-packs=follow' 7: ef0bc38cf0 < -: ---------- repack: keep track of existing MIDX'd packs 8: 19b69c1246 ! 8: a2ec1b826c repack: exclude cruft pack(s) from the MIDX where possible @@ Commit message Note that you cannot guarantee that a collection of packs is closed under reachability if not all of them were generated with following as above. One tell-tale sign that not all geometrically-repacked packs in - the MIDX were generated with following is to see if there is a cruft - pack already in the MIDX. + the MIDX were generated with following is to see if there is a pack in + the existing MIDX that is not going to be somehow represented (either + verbatim or as part of a geometric rollup) in the new MIDX. If there is, then starting to generate packs with following during geometric repacking won't work, since it's open to the same race as @@ Commit message under reachability. Detect when this is the case and avoid including cruft packs in the MIDX - where possible. + where possible. The existing behavior remains the default, and the new + behavior is available with the config 'repack.midxMustIncludeCruft' set + to 'false'. Signed-off-by: Taylor Blau <me@ttaylorr.com> + ## Documentation/config/repack.adoc ## +@@ Documentation/config/repack.adoc: repack.cruftThreads:: + a cruft pack and the respective parameters are not given over + the command line. See similarly named `pack.*` configuration + variables for defaults and meaning. ++ ++repack.midxMustContainCruft:: ++ When set to true, linkgit:git-repack[1] will unconditionally include ++ cruft pack(s), if any, in the multi-pack index when invoked with ++ `--write-midx`. When false, cruft packs are only included in the MIDX ++ when necessary (e.g., because they might be required to form a ++ reachability closure with MIDX bitmaps). Defaults to true. + ## builtin/repack.c ## -@@ builtin/repack.c: static void pack_mark_in_midx(struct string_list_item *item) - item->util = (void*)((uintptr_t)item->util | PACK_IN_MIDX); +@@ builtin/repack.c: static int write_bitmaps = -1; + static int use_delta_islands; + static int run_update_server_info = 1; + static char *packdir, *packtmp_name, *packtmp; ++static int midx_must_contain_cruft = 1; + + static const char *const git_repack_usage[] = { + N_("git repack [-a] [-A] [-d] [-f] [-F] [-l] [-n] [-q] [-b] [-m]\n" +@@ builtin/repack.c: static int repack_config(const char *var, const char *value, + free(cruft_po_args->threads); + return git_config_string(&cruft_po_args->threads, var, value); + } ++ if (!strcmp(var, "repack.midxmustcontaincruft")) { ++ midx_must_contain_cruft = git_config_bool(var, value); ++ return 0; ++ } + return git_default_config(var, value, ctx, cb); + } + +@@ builtin/repack.c: static void free_pack_geometry(struct pack_geometry *geometry) + free(geometry->pack); } -+static int pack_is_in_midx(struct string_list_item *item) ++static int midx_has_unknown_packs(char **midx_pack_names, ++ size_t midx_pack_names_nr, ++ struct string_list *include, ++ struct pack_geometry *geometry, ++ struct existing_packs *existing) +{ -+ return (uintptr_t)item->util & PACK_IN_MIDX; -+} ++ size_t i; + -+static int existing_has_cruft_in_midx(struct existing_packs *existing) -+{ -+ struct string_list_item *item; -+ for_each_string_list_item(item, &existing->cruft_packs) { -+ if (pack_is_in_midx(item)) -+ return 1; ++ string_list_sort(include); ++ ++ for (i = 0; i < midx_pack_names_nr; i++) { ++ const char *pack_name = midx_pack_names[i]; ++ ++ /* ++ * Determine whether or not each MIDX'd pack from the existing ++ * MIDX (if any) is represented in the new MIDX. For each pack ++ * in the MIDX, it must either be: ++ * ++ * - In the "include" list of packs to be included in the new ++ * MIDX. Note this function is called before the include ++ * list is populated with any cruft pack(s). ++ * ++ * - Below the geometric split line (if using pack geometry), ++ * indicating that the pack won't be included in the new ++ * MIDX, but its contents were rolled up as part of the ++ * geometric repack. ++ * ++ * - In the existing non-kept packs list (if not using pack ++ * geometry), and marked as non-deleted. ++ */ ++ if (string_list_has_string(include, pack_name)) { ++ continue; ++ } else if (geometry) { ++ struct strbuf buf = STRBUF_INIT; ++ uint32_t j; ++ ++ for (j = 0; j < geometry->split; j++) { ++ strbuf_reset(&buf); ++ strbuf_addstr(&buf, pack_basename(geometry->pack[j])); ++ strbuf_strip_suffix(&buf, ".pack"); ++ strbuf_addstr(&buf, ".idx"); ++ ++ if (!strcmp(pack_name, buf.buf)) { ++ strbuf_release(&buf); ++ break; ++ } ++ } ++ ++ strbuf_release(&buf); ++ ++ if (j < geometry->split) ++ continue; ++ } else { ++ struct string_list_item *item; ++ ++ item = string_list_lookup(&existing->non_kept_packs, ++ pack_name); ++ if (item && !pack_is_marked_for_deletion(item)) ++ continue; ++ } ++ ++ /* ++ * If we got to this point, the MIDX includes some pack that we ++ * don't know about. ++ */ ++ return 1; + } ++ + return 0; +} + - static void mark_packs_for_deletion_1(struct string_list *names, - struct string_list *list) + struct midx_snapshot_ref_data { + struct tempfile *f; + struct oidset seen; +@@ builtin/repack.c: static void midx_snapshot_refs(struct tempfile *f) + + static void midx_included_packs(struct string_list *include, + struct existing_packs *existing, ++ char **midx_pack_names, ++ size_t midx_pack_names_nr, + struct string_list *names, + struct pack_geometry *geometry) { @@ builtin/repack.c: static void midx_included_packs(struct string_list *include, } } - for_each_string_list_item(item, &existing->cruft_packs) { -+ if (existing_has_cruft_in_midx(existing)) { ++ if (midx_must_contain_cruft || ++ midx_has_unknown_packs(midx_pack_names, midx_pack_names_nr, ++ include, geometry, existing)) { /* - * When doing a --geometric repack, there is no need to check - * for deleted packs, since we're by definition not doing an - * ALL_INTO_ONE repack (hence no packs will be deleted). - * Otherwise we must check for and exclude any packs which are - * enqueued for deletion. -+ * If we had one or more cruft pack(s) present in the -+ * MIDX before the repack, keep them as they may be -+ * required to form a reachability closure if the MIDX -+ * is bitmapped. ++ * If there are one or more unknown pack(s) present (see ++ * midx_has_unknown_packs() for what makes a pack ++ * "unknown") in the MIDX before the repack, keep them ++ * as they may be required to form a reachability ++ * closure if the MIDX is bitmapped. * - * So we could omit the conditional below in the --geometric - * case, but doing so is unnecessary since no packs are marked - * as pending deletion (since we only call - * `mark_packs_for_deletion()` when doing an all-into-one - * repack). -+ * A cruft pack can be required to form a reachability -+ * closure if the MIDX is bitmapped and one or more of -+ * its selected commits reaches a once-cruft object that -+ * was later made reachable. ++ * For example, a cruft pack can be required to form a ++ * reachability closure if the MIDX is bitmapped and one ++ * or more of its selected commits reaches a once-cruft ++ * object that was later made reachable. */ - if (pack_is_marked_for_deletion(item)) - continue; @@ builtin/repack.c: static void midx_included_packs(struct string_list *include, } strbuf_release(&buf); +@@ builtin/repack.c: int cmd_repack(int argc, + struct tempfile *refs_snapshot = NULL; + int i, ext, ret; + int show_progress; ++ char **midx_pack_names = NULL; ++ size_t midx_pack_names_nr = 0; + + /* variables to be filled by option parsing */ + int delete_redundant = 0; @@ builtin/repack.c: int cmd_repack(int argc, !(pack_everything & PACK_CRUFT)) strvec_push(&cmd.args, "--pack-loose-unreachable"); } else if (geometry.split_factor) { - strvec_push(&cmd.args, "--stdin-packs"); -+ if (existing_has_cruft_in_midx(&existing)) ++ if (midx_must_contain_cruft) + strvec_push(&cmd.args, "--stdin-packs"); + else + strvec_push(&cmd.args, "--stdin-packs=follow"); strvec_push(&cmd.args, "--unpacked"); } else { strvec_push(&cmd.args, "--unpacked"); +@@ builtin/repack.c: int cmd_repack(int argc, + + string_list_sort(&names); + ++ if (get_local_multi_pack_index(the_repository)) { ++ uint32_t i; ++ struct multi_pack_index *m = ++ get_local_multi_pack_index(the_repository); ++ ++ ALLOC_ARRAY(midx_pack_names, m->num_packs); ++ for (i = 0; i < m->num_packs; i++) ++ midx_pack_names[midx_pack_names_nr++] = xstrdup(m->pack_names[i]); ++ } ++ + close_object_store(the_repository->objects); + + /* +@@ builtin/repack.c: int cmd_repack(int argc, + + if (write_midx) { + struct string_list include = STRING_LIST_INIT_DUP; +- midx_included_packs(&include, &existing, &names, &geometry); ++ midx_included_packs(&include, &existing, midx_pack_names, ++ midx_pack_names_nr, &names, &geometry); + + ret = write_midx_included_packs(&include, &geometry, &names, + refs_snapshot ? get_tempfile_path(refs_snapshot) : NULL, +@@ builtin/repack.c: int cmd_repack(int argc, + string_list_clear(&names, 1); + existing_packs_release(&existing); + free_pack_geometry(&geometry); ++ for (size_t i = 0; i < midx_pack_names_nr; i++) ++ free(midx_pack_names[i]); ++ free(midx_pack_names); + pack_objects_args_release(&po_args); + pack_objects_args_release(&cruft_po_args); + ## t/t7704-repack-cruft.sh ## @@ t/t7704-repack-cruft.sh: test_expect_success 'cruft repack respects --quiet' ' ) ' -+test_expect_success 'repack --write-midx excludes cruft where possible' ' -+ git init exclude-cruft-when-possible && ++setup_cruft_exclude_tests() { ++ git init "$1" && + ( -+ cd exclude-cruft-when-possible && ++ cd "$1" && ++ ++ git config repack.midxMustContainCruft false && + + test_commit one && + @@ t/t7704-repack-cruft.sh: test_expect_success 'cruft repack respects --quiet' ' + test_commit --no-tag three && + three="$(git rev-parse HEAD)" && + git reset --hard one && -+ + git reflog expire --all --expire=all && + -+ git repack --cruft -d && -+ ls $packdir/pack-*.idx | sort >packs.before && ++ GIT_TEST_MULTI_PACK_INDEX=0 git repack --cruft -d && + + git merge $two && -+ test_commit four && ++ test_commit four ++ ) ++} ++ ++test_expect_success 'repack --write-midx excludes cruft where possible' ' ++ setup_cruft_exclude_tests exclude-cruft-when-possible && ++ ( ++ cd exclude-cruft-when-possible && ++ ++ GIT_TEST_MULTI_PACK_INDEX=0 \ + git repack -d --geometric=2 --write-midx --write-bitmap-index && -+ ls $packdir/pack-*.idx | sort >packs.after && + -+ comm -13 packs.before packs.after >packs.new && -+ test_line_count = 1 packs.new && ++ test-tool read-midx --show-objects $objdir >midx && ++ cruft="$(ls $packdir/*.mtimes)" && ++ test_grep ! "$(basename "$cruft" .mtimes).idx" midx && + -+ git rev-list --objects --no-object-names one..four >expect.raw && -+ sort expect.raw >expect && ++ git rev-list --all --objects --no-object-names >reachable.raw && ++ sort reachable.raw >reachable.objects && ++ awk "/\.pack$/ { print \$1 }" <midx | sort >midx.objects && + -+ git show-index <$(cat packs.new) >actual.raw && -+ cut -d" " -f2 actual.raw | sort >actual && ++ test_cmp reachable.objects midx.objects ++ ) ++' + -+ test_cmp expect actual && ++test_expect_success 'repack --write-midx includes cruft when instructed' ' ++ setup_cruft_exclude_tests exclude-cruft-when-instructed && ++ ( ++ cd exclude-cruft-when-instructed && + -+ test-tool read-midx --show-objects $objdir >actual.raw && -+ grep "\.pack$" actual.raw | cut -d" " -f1 | sort >actual.objects && -+ git rev-list --objects --no-object-names HEAD >expect.raw && -+ sort expect.raw >expect.objects && ++ GIT_TEST_MULTI_PACK_INDEX=0 \ ++ git -c repack.midxMustContainCruft=true repack \ ++ -d --geometric=2 --write-midx --write-bitmap-index && + -+ test_cmp expect.objects actual.objects && ++ test-tool read-midx --show-objects $objdir >midx && ++ cruft="$(ls $packdir/*.mtimes)" && ++ test_grep "$(basename "$cruft" .mtimes).idx" midx && + -+ cruft="$(basename $(ls $packdir/*.mtimes))" && -+ grep "^pack-" actual.raw >actual.packs && -+ ! test_grep "${cruft%.mtimes}.idx" actual.packs ++ git cat-file --batch-check="%(objectname)" --batch-all-objects \ ++ >all.objects && ++ awk "/\.pack$/ { print \$1 }" <midx | sort >midx.objects && ++ ++ test_cmp all.objects midx.objects + ) +' + +test_expect_success 'repack --write-midx includes cruft when necessary' ' ++ setup_cruft_exclude_tests exclude-cruft-when-necessary && + ( -+ cd exclude-cruft-when-possible && ++ cd exclude-cruft-when-necessary && + ++ test_path_is_file $(ls $packdir/pack-*.mtimes) && + ls $packdir/pack-*.idx | sort >packs.all && + grep -o "pack-.*\.idx$" packs.all >in && + + git multi-pack-index write --stdin-packs --bitmap <in && + + test_commit five && ++ GIT_TEST_MULTI_PACK_INDEX=0 \ + git repack -d --geometric=2 --write-midx --write-bitmap-index && + -+ test-tool read-midx --show-objects $objdir >actual.raw && -+ grep "\.pack$" actual.raw | cut -d" " -f1 | sort >actual.objects && ++ test-tool read-midx --show-objects $objdir >midx && ++ awk "/\.pack$/ { print \$1 }" <midx | sort >midx.objects && + git cat-file --batch-all-objects --batch-check="%(objectname)" \ + >expect.objects && -+ test_cmp expect.objects actual.objects && ++ test_cmp expect.objects midx.objects && + -+ grep "^pack-" actual.raw >actual.packs && -+ test_line_count = "$(($(wc -l <packs.all) + 1))" actual.packs ++ grep "^pack-" midx >midx.packs && ++ test_line_count = "$(($(wc -l <packs.all) + 1))" midx.packs + ) +' + base-commit: 485f5f863615e670fd97ae40af744e14072cfe18