Message ID | b75869befba26899d88d6c6d413cc756aeadbd80.1701198172.git.me@ttaylorr.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | pack-objects: multi-pack verbatim reuse | expand |
On Tue, Nov 28, 2023 at 02:08:18PM -0500, Taylor Blau wrote: > Now that we can generate packs which are disjoint with respect to the > set of currently-disjoint packs, implement a mode of `git repack` which > extends the set of disjoint packs with any new (non-cruft) pack(s) > generated during the repack. > > The idea is mostly straightforward, with a couple of gotcha's. The > straightforward part is to make sure that any new packs are disjoint > with respect to the set of currently disjoint packs which are _not_ > being removed from the repository as a result of the repack. > > If a pack which is currently marked as disjoint is, on the other hand, > about to be removed from the repository, it is OK (and expected) that > new pack(s) will contain some or all of its objects. Since the pack > originally marked as disjoint will be removed, it will necessarily leave > the disjoint set, making room for new packs with its same objects to > take its place. In other words, the resulting set of disjoint packs will > be disjoint with respect to one another. > > The gotchas mostly have to do with making sure that we do not generate a > disjoint pack in the following scenarios: Okay, let me verify whether I understand the reasons: > - promisor packs Which is because promisor packs actually don't contain any objects? > - cruft packs (which may necessarily need to include an object from a > disjoint pack in order to freshen it in certain circumstances) This one took me a while to figure out. If we'd mark crufts as disjoint, then it would mean that new packfiles cannot be marked as disjoint if objects which were previously unreachable do become reachable again. So we'd be pessimizing packfiles for live objects in favor of others which aren't. > - all-into-one repacks without '-d' Because here the old packfiles that this would make redundant aren't deleted and thus the objects are duplicate now. > - `--filter-to`, which conceptually could work with the new > `--extend-disjoint` option, but only in limited circumstances We're probably also not properly set up to handle the new alternate object directory and exclude objects that are part of a potentially disjoint packfile that exists already. Also, the current MIDX may not even cover the alternate. > Otherwise, we mark which packs were created as disjoint by using a new > bit in the `generated_pack_data` struct, and then marking those pack(s) > as disjoint accordingly when generating the MIDX. Non-deleted packs > which are marked as disjoint are retained as such by passing the > equivalent of `--retain-disjoint` when calling the MIDX API to update > the MIDX. Okay. I had a bit of trouble to sift through the various different flags like "--retain-disjoint", "--extend-disjoint", "--ignore-disjoint" and so on. But well, they do different things and it's been a few days since I've reviewed the preceding patches, so this is probably fine. One thing I wondered: do we need to consider the `-l` flag? When using an alternate object directory it is totally feasible that the alternate may be creating new disjoint packages without us knowing, and thus we may not be able to guarantee the disjoint property anymore. Patrick
On Thu, Dec 07, 2023 at 02:13:08PM +0100, Patrick Steinhardt wrote: > > The gotchas mostly have to do with making sure that we do not generate a > > disjoint pack in the following scenarios: > > Okay, let me verify whether I understand the reasons: > > > - promisor packs > > Which is because promisor packs actually don't contain any objects? Right. > > - cruft packs (which may necessarily need to include an object from a > > disjoint pack in order to freshen it in certain circumstances) > > This one took me a while to figure out. If we'd mark crufts as disjoint, > then it would mean that new packfiles cannot be marked as disjoint if > objects which were previously unreachable do become reachable again. > So we'd be pessimizing packfiles for live objects in favor of others > which aren't. Yeah, that's right, too. There are a couple of cases where more than one cruft pack may contain the same object, one of them being the flip-flopping between reachable and unreachable as you suggest above. Another is that you have a non-prunable unreachable object which is already in a cruft pack. If the object's mtime gets updated (and still cannot be pruned), we'll end up freshening the object loose, and then packing it again (with the more recent mtime) into a new cruft pack. That aside, I actually think that there are ways to mark cruft packs disjoint. But they're complicated, and moreover, I don't think you'd ever *want* to mark a cruft pack as disjoint. Cruft packs usually contain garbage, which is unlikely to be useful to any fetches/clones. If we did mark them as disjoint, it would mean that we could reuse verbatim sections of the cruft pack in our output, but we would likely end up with very few such sections. > > - all-into-one repacks without '-d' > > Because here the old packfiles that this would make redundant aren't > deleted and thus the objects are duplicate now. Yep. > > Otherwise, we mark which packs were created as disjoint by using a new > > bit in the `generated_pack_data` struct, and then marking those pack(s) > > as disjoint accordingly when generating the MIDX. Non-deleted packs > > which are marked as disjoint are retained as such by passing the > > equivalent of `--retain-disjoint` when calling the MIDX API to update > > the MIDX. > > Okay. I had a bit of trouble to sift through the various different > flags like "--retain-disjoint", "--extend-disjoint", "--ignore-disjoint" > and so on. But well, they do different things and it's been a few days > since I've reviewed the preceding patches, so this is probably fine. Yeah, I am definitely open to better naming conventions here? I figured that: - --retain-disjoint was a good name for the MIDX option, since it is retaining existing disjoint packs in the new MIDX - --extend-disjoint was a good name for the repack option, since it is extending the set of disjoint packs - --ignore-disjoint was a good name for the pack-objects option, since it is ignoring objects in disjoint packs Writing this out, I think that you could make an argument that `--exclude-disjoint` is a better name for the last option. So I'm definitely open to suggestions here, but I don't want to get too bogged down on command-line option naming (so long as we're all reasonably happy with the result). > One thing I wondered: do we need to consider the `-l` flag? When using > an alternate object directory it is totally feasible that the alternate > may be creating new disjoint packages without us knowing, and thus we > may not be able to guarantee the disjoint property anymore. I don't think so. We'd only care about one direction of this (that alternates do not create disjoint packs which overlap with ours, instead of the other way around), but since we don't put non-local packs in the MIDX, I think we're OK. I suppose that you might run into trouble if you use the chained MIDX thing (via its `->next` pointer). I haven't used that feature myself, so I'd have to play around with it. Thanks, Taylor
On Thu, Dec 07, 2023 at 03:28:18PM -0500, Taylor Blau wrote: > On Thu, Dec 07, 2023 at 02:13:08PM +0100, Patrick Steinhardt wrote: > > > - cruft packs (which may necessarily need to include an object from a > > > disjoint pack in order to freshen it in certain circumstances) > > > > This one took me a while to figure out. If we'd mark crufts as disjoint, > > then it would mean that new packfiles cannot be marked as disjoint if > > objects which were previously unreachable do become reachable again. > > So we'd be pessimizing packfiles for live objects in favor of others > > which aren't. > > Yeah, that's right, too. There are a couple of cases where more than one > cruft pack may contain the same object, one of them being the > flip-flopping between reachable and unreachable as you suggest above. > Another is that you have a non-prunable unreachable object which is > already in a cruft pack. If the object's mtime gets updated (and still > cannot be pruned), we'll end up freshening the object loose, and then > packing it again (with the more recent mtime) into a new cruft pack. > > That aside, I actually think that there are ways to mark cruft packs > disjoint. But they're complicated, and moreover, I don't think you'd > ever *want* to mark a cruft pack as disjoint. Cruft packs usually > contain garbage, which is unlikely to be useful to any fetches/clones. > > If we did mark them as disjoint, it would mean that we could reuse > verbatim sections of the cruft pack in our output, but we would likely > end up with very few such sections. Makes sense. It also doesn't feel worth it to introduce additional complexity for objects that for most of the part wouldn't ever be served on a fetch anyway. [snip] > > Okay. I had a bit of trouble to sift through the various different > > flags like "--retain-disjoint", "--extend-disjoint", "--ignore-disjoint" > > and so on. But well, they do different things and it's been a few days > > since I've reviewed the preceding patches, so this is probably fine. > > Yeah, I am definitely open to better naming conventions here? I figured > that: > > - --retain-disjoint was a good name for the MIDX option, since it is > retaining existing disjoint packs in the new MIDX > - --extend-disjoint was a good name for the repack option, since it is > extending the set of disjoint packs > - --ignore-disjoint was a good name for the pack-objects option, since > it is ignoring objects in disjoint packs > > Writing this out, I think that you could make an argument that > `--exclude-disjoint` is a better name for the last option. So I'm > definitely open to suggestions here, but I don't want to get too bogged > down on command-line option naming (so long as we're all reasonably > happy with the result). Yeah, as said, I don't mind it too much. It's a complex area and the flags all do different things, so it's expected that you may have to do some research on what exactly they do. That being said, I do like your proposed `--exclude-disjoint` a lot more than `--ignore-disjoint`. > > One thing I wondered: do we need to consider the `-l` flag? When using > > an alternate object directory it is totally feasible that the alternate > > may be creating new disjoint packages without us knowing, and thus we > > may not be able to guarantee the disjoint property anymore. > > I don't think so. We'd only care about one direction of this (that > alternates do not create disjoint packs which overlap with ours, instead > of the other way around), but since we don't put non-local packs in the > MIDX, I think we're OK. > > I suppose that you might run into trouble if you use the chained MIDX > thing (via its `->next` pointer). I haven't used that feature myself, so > I'd have to play around with it. We do use this feature at GitLab for forks, where forks connect to a common alternate object directory to deduplicate objects. As both the fork repository and the alternate object directory use an MIDX I think they would be set up exactly like that. I guess the only really viable solution here is to ignore disjoint packs in the main repo that connects to the alternate in the case where the alternate has any disjoint packs itself. Patrick
On Fri, Dec 08, 2023 at 09:19:25AM +0100, Patrick Steinhardt wrote: > > Writing this out, I think that you could make an argument that > > `--exclude-disjoint` is a better name for the last option. So I'm > > definitely open to suggestions here, but I don't want to get too bogged > > down on command-line option naming (so long as we're all reasonably > > happy with the result). > > Yeah, as said, I don't mind it too much. It's a complex area and the > flags all do different things, so it's expected that you may have to do > some research on what exactly they do. That being said, I do like your > proposed `--exclude-disjoint` a lot more than `--ignore-disjoint`. I think that's fair, I renamed the option to be "--exclude-disjoint" instead of "--ignore-disjoint" for any subsequent round(s) of this series. > > > One thing I wondered: do we need to consider the `-l` flag? When using > > > an alternate object directory it is totally feasible that the alternate > > > may be creating new disjoint packages without us knowing, and thus we > > > may not be able to guarantee the disjoint property anymore. > > > > I don't think so. We'd only care about one direction of this (that > > alternates do not create disjoint packs which overlap with ours, instead > > of the other way around), but since we don't put non-local packs in the > > MIDX, I think we're OK. > > > > I suppose that you might run into trouble if you use the chained MIDX > > thing (via its `->next` pointer). I haven't used that feature myself, so > > I'd have to play around with it. > > We do use this feature at GitLab for forks, where forks connect to a > common alternate object directory to deduplicate objects. As both the > fork repository and the alternate object directory use an MIDX I think > they would be set up exactly like that. Yep, that's right. I wasn't sure whether or not this feature had been used extensively in production or not (we don't use it at GitHub, since objects only live in their fork repositories for a short while before moving to the fork network repository). > I guess the only really viable solution here is to ignore disjoint packs > in the main repo that connects to the alternate in the case where the > alternate has any disjoint packs itself. I think the behavior you'd get here is that we'd only look for disjoint packs in the first MIDX in the chain (which is always the local one), and we'd only recognizes packs from that MIDX as being potentially disjoint. If you have the bulk of your repositories in the alternate, then I think you might want to consider how we combine the two. My sense is that you'd want to be disjoint with respect to anything downstream of you. Whether or not this is a feature that you/others need, I definitely think we should leave it out of this series, since I am (a) fairly certain that this is possible to do, and (b) already feel like this series on its own is complicated enough. Thanks, Taylor
On Fri, Dec 08, 2023 at 05:48:28PM -0500, Taylor Blau wrote: > On Fri, Dec 08, 2023 at 09:19:25AM +0100, Patrick Steinhardt wrote: > > > > One thing I wondered: do we need to consider the `-l` flag? When using > > > > an alternate object directory it is totally feasible that the alternate > > > > may be creating new disjoint packages without us knowing, and thus we > > > > may not be able to guarantee the disjoint property anymore. > > > > > > I don't think so. We'd only care about one direction of this (that > > > alternates do not create disjoint packs which overlap with ours, instead > > > of the other way around), but since we don't put non-local packs in the > > > MIDX, I think we're OK. > > > > > > I suppose that you might run into trouble if you use the chained MIDX > > > thing (via its `->next` pointer). I haven't used that feature myself, so > > > I'd have to play around with it. > > > > We do use this feature at GitLab for forks, where forks connect to a > > common alternate object directory to deduplicate objects. As both the > > fork repository and the alternate object directory use an MIDX I think > > they would be set up exactly like that. > > Yep, that's right. I wasn't sure whether or not this feature had been > used extensively in production or not (we don't use it at GitHub, since > objects only live in their fork repositories for a short while before > moving to the fork network repository). > > > I guess the only really viable solution here is to ignore disjoint packs > > in the main repo that connects to the alternate in the case where the > > alternate has any disjoint packs itself. > > I think the behavior you'd get here is that we'd only look for disjoint > packs in the first MIDX in the chain (which is always the local one), > and we'd only recognizes packs from that MIDX as being potentially > disjoint. > > If you have the bulk of your repositories in the alternate, then I think > you might want to consider how we combine the two. Yes, in the general case the bulk of objects is indeed contained in the alternate. > My sense is that you'd want to be disjoint with respect to anything > downstream of you. Ideally yes, but this is unfortunately not easily achievable in the general case. It's one of the many painpoints that alternates bring with them. Suppose two forks A and B are connected to the same alternate. Both A and B now gain the same set of objects via whatever means. At this point these objects can be stored in disjoint packs in each of the repos as they are not yet deduplicated via the alternate. But if you were to pull objects from either A or B into the alternate then you cannot ensure disjointedness at all anymore because you would first have to repack objects in both A and B. For two forks this might still seem manageable. But as soon as your fork network grows larger it's clear that this becomes almost impossible to do. So ultimately, I don't see an alternative to ignoring disjointedness bits in either of the two linked-together repos. > Whether or not this is a feature that you/others need, I definitely > think we should leave it out of this series, since I am (a) fairly > certain that this is possible to do, and (b) already feel like this > series on its own is complicated enough. I'm perfectly fine if we say that the benefits of your patch series cannot yet be applied to repositories with alternates. But from my point of view it's a requirement that this patch series does not silently break this usecase due to Git starting to generate disjointed packs where it cannot ensure that the disjointedness property holds. As I haven't yet read through this whole patch series, the question is boils down to whether the end result is opt-in or opt-out. If it was opt-out then I could see the above usecase breaking silently. If it was opt-in then things should be fine and we can address this ommission in a follow up patch series. We at GitLab would definitely be interested in helping out with this given that it directly affects us and that the demonstrated savings seem very promising. Patrick
On Mon, Dec 11, 2023 at 09:18:32AM +0100, Patrick Steinhardt wrote: > > If you have the bulk of your repositories in the alternate, then I think > > you might want to consider how we combine the two. > > Yes, in the general case the bulk of objects is indeed contained in the > alternate. I thought about this for a while this morning, and think that while it is possible, I'm not sure I can think of a compelling use-case where you'd want to reuse objects from packs across an alternate boundary. On the "I think it's possible front": The challenge is making sure that the set of disjoint packs among each object store is globally disjoint in one direction along the alternate path. I think the rule would require you to honor the disjointed-ness of any packs in alternate(s) you might have when constructing new disjoint packs. So if repository fork.git is an alternate of network.git (and both had long-lived MIDXs), network.git is free to make any set of disjoint packs it chooses, and fork.git can only create disjoint packs which are disjoint with respect to (a) the other disjoint packs in fork.git (if any), and (b) the disjoint packs in network.git (and recursively for any repositories that network.git is an alternate of in the general case). Now on the "I can't think of a compelling use-case front": I think the only reason you'd want to be able to reuse objects from MIDXs across the alternates boundary is if you have MIDX bitmaps in both the repository and its alternate. Indeed, the only time that we kick in pack-reuse in general is when we have a bitmap, so in order to reuse objects from both the repo and its alternate, you'd have to have a bitmap in both repositories. But having a MIDX bitmap means that any packs in the MIDX for which you're generating a bitmap have to be closed over object reachability. So unless the repository and its alternate have totally distinct lines of history (in which case, I'm not sure you would want to share objects between the two in the first place), any pack you bitmap in the child repository fundamentally couldn't be disjoint with respect to its parent. This is because if it were to be disjoint, it would have to be repacked with '-l' (or some equivalent '-l'-like flag that only ignores non-local packs which are marked as disjoint). But if you exclude those objects and any one (or more) of them is reachable from some object(s) you didn't exclude, you wouldn't be able to generate a bitmap in the first place. It's very possible that there's something about your setup that I'm not fully grokking, but I don't think in general this is something that we'd want to do (even if it is theoretically possible). > > Whether or not this is a feature that you/others need, I definitely > > think we should leave it out of this series, since I am (a) fairly > > certain that this is possible to do, and (b) already feel like this > > series on its own is complicated enough. > > I'm perfectly fine if we say that the benefits of your patch series > cannot yet be applied to repositories with alternates. But from my point > of view it's a requirement that this patch series does not silently > break this usecase due to Git starting to generate disjointed packs > where it cannot ensure that the disjointedness property holds. I think one thing you could reasonably do is use *only* the non-local MIDX bitmaps when doing pack reuse. Currently we'll use the first MIDX we find, which is guaranteed to be the local one, if it exists. This was the case before this series, and this series does not change that behavior. Unless you had a pack bitmap in the alternated repository (which I think is unlikely, since it would require a full reachability closure, thus eliminating any de-duplication benefits you'd otherwise get when using alternates), you'd be find before and after this series. > As I haven't yet read through this whole patch series, the question is > boils down to whether the end result is opt-in or opt-out. If it was > opt-out then I could see the above usecase breaking silently. If it was > opt-in then things should be fine and we can address this ommission in a > follow up patch series. We at GitLab would definitely be interested in > helping out with this given that it directly affects us and that the > demonstrated savings seem very promising. The end result is opt-in, you have to change the `pack.allowPackReuse` configuration from its default value of "true" (or the alternate spelling taught in this series, "single") to "multi" in order to enable the new behavior. Thanks, Taylor
diff --git a/Documentation/git-repack.txt b/Documentation/git-repack.txt index c902512a9e..50ba5e7f9c 100644 --- a/Documentation/git-repack.txt +++ b/Documentation/git-repack.txt @@ -249,6 +249,18 @@ linkgit:git-multi-pack-index[1]). Write a multi-pack index (see linkgit:git-multi-pack-index[1]) containing the non-redundant packs. +--extend-disjoint:: + Extends the set of disjoint packs. All new non-cruft pack(s) + generated are constructed to be disjoint with respect to the set + of currently disjoint packs, excluding any packs that will be + removed as a result of the repack operation. For more on + disjoint packs, see the details in linkgit:gitformat-pack[5], + under the section "`DISP` chunk and disjoint packs". ++ +Useful only with the combination of `--write-midx` and +`--write-bitmap-index`. Incompatible with `--filter-to`. Incompatible +with `-A`, `-a`, or `--cruft` unless `-d` is given. + CONFIGURATION ------------- diff --git a/builtin/repack.c b/builtin/repack.c index edaee4dbec..0601bd16c4 100644 --- a/builtin/repack.c +++ b/builtin/repack.c @@ -58,6 +58,7 @@ struct pack_objects_args { int no_reuse_object; int quiet; int local; + int ignore_disjoint; struct list_objects_filter_options filter_options; }; @@ -293,6 +294,8 @@ static void prepare_pack_objects(struct child_process *cmd, strvec_push(&cmd->args, "--local"); if (args->quiet) strvec_push(&cmd->args, "--quiet"); + if (args->ignore_disjoint) + strvec_push(&cmd->args, "--ignore-disjoint"); if (delta_base_offset) strvec_push(&cmd->args, "--delta-base-offset"); strvec_push(&cmd->args, out); @@ -334,9 +337,11 @@ static struct { struct generated_pack_data { struct tempfile *tempfiles[ARRAY_SIZE(exts)]; + unsigned disjoint : 1; }; -static struct generated_pack_data *populate_pack_exts(const char *name) +static struct generated_pack_data *populate_pack_exts(const char *name, + unsigned disjoint) { struct stat statbuf; struct strbuf path = STRBUF_INIT; @@ -353,6 +358,8 @@ static struct generated_pack_data *populate_pack_exts(const char *name) data->tempfiles[i] = register_tempfile(path.buf); } + data->disjoint = disjoint; + strbuf_release(&path); return data; } @@ -379,6 +386,8 @@ static void repack_promisor_objects(const struct pack_objects_args *args, prepare_pack_objects(&cmd, args, packtmp); cmd.in = -1; + strvec_pushf(&cmd.args, "--no-ignore-disjoint"); + /* * NEEDSWORK: Giving pack-objects only the OIDs without any ordering * hints may result in suboptimal deltas in the resulting pack. See if @@ -421,7 +430,7 @@ static void repack_promisor_objects(const struct pack_objects_args *args, line.buf); write_promisor_file(promisor_name, NULL, 0); - item->util = populate_pack_exts(item->string); + item->util = populate_pack_exts(item->string, 0); free(promisor_name); } @@ -731,8 +740,13 @@ static void midx_included_packs(struct string_list *include, for_each_string_list_item(item, &existing->kept_packs) string_list_insert(include, xstrfmt("%s.idx", item->string)); - for_each_string_list_item(item, names) - string_list_insert(include, xstrfmt("pack-%s.idx", item->string)); + for_each_string_list_item(item, names) { + const char *marker = ""; + struct generated_pack_data *data = item->util; + if (data->disjoint) + marker = "+"; + string_list_insert(include, xstrfmt("%spack-%s.idx", marker, item->string)); + } if (geometry->split_factor) { struct strbuf buf = STRBUF_INIT; uint32_t i; @@ -788,7 +802,8 @@ static int write_midx_included_packs(struct string_list *include, struct pack_geometry *geometry, struct string_list *names, const char *refs_snapshot, - int show_progress, int write_bitmaps) + int show_progress, int write_bitmaps, + int exclude_disjoint) { struct child_process cmd = CHILD_PROCESS_INIT; struct string_list_item *item; @@ -852,6 +867,9 @@ static int write_midx_included_packs(struct string_list *include, if (refs_snapshot) strvec_pushf(&cmd.args, "--refs-snapshot=%s", refs_snapshot); + if (exclude_disjoint) + strvec_push(&cmd.args, "--retain-disjoint"); + ret = start_command(&cmd); if (ret) return ret; @@ -895,7 +913,7 @@ static void remove_redundant_bitmaps(struct string_list *include, static int finish_pack_objects_cmd(struct child_process *cmd, struct string_list *names, - int local) + int local, int disjoint) { FILE *out; struct strbuf line = STRBUF_INIT; @@ -913,7 +931,7 @@ static int finish_pack_objects_cmd(struct child_process *cmd, */ if (local) { item = string_list_append(names, line.buf); - item->util = populate_pack_exts(line.buf); + item->util = populate_pack_exts(line.buf, disjoint); } } fclose(out); @@ -970,7 +988,7 @@ static int write_filtered_pack(const struct pack_objects_args *args, fprintf(in, "%s%s.pack\n", caret, item->string); fclose(in); - return finish_pack_objects_cmd(&cmd, names, local); + return finish_pack_objects_cmd(&cmd, names, local, 0); } static int existing_cruft_pack_cmp(const void *va, const void *vb) @@ -1098,7 +1116,7 @@ static int write_cruft_pack(const struct pack_objects_args *args, fprintf(in, "%s.pack\n", item->string); fclose(in); - return finish_pack_objects_cmd(&cmd, names, local); + return finish_pack_objects_cmd(&cmd, names, local, 0); } static const char *find_pack_prefix(const char *packdir, const char *packtmp) @@ -1190,6 +1208,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix) N_("pack prefix to store a pack containing pruned objects")), OPT_STRING(0, "filter-to", &filter_to, N_("dir"), N_("pack prefix to store a pack containing filtered out objects")), + OPT_BOOL(0, "extend-disjoint", &po_args.ignore_disjoint, + N_("add new packs to the set of disjoint ones")), OPT_END() }; @@ -1255,6 +1275,16 @@ int cmd_repack(int argc, const char **argv, const char *prefix) strbuf_release(&path); } + if (po_args.ignore_disjoint) { + if (filter_to) + die(_("options '%s' and '%s' cannot be used together"), + "--filter-to", "--extend-disjoint"); + if (pack_everything && !delete_redundant) + die(_("cannot use '--extend-disjoint' with '%s' but not '-d'"), + pack_everything & LOOSEN_UNREACHABLE ? "-A" : + pack_everything & PACK_CRUFT ? "--cruft" : "-a"); + } + packdir = mkpathdup("%s/pack", get_object_directory()); packtmp_name = xstrfmt(".tmp-%d-pack", (int)getpid()); packtmp = mkpathdup("%s/%s", packdir, packtmp_name); @@ -1308,6 +1338,9 @@ int cmd_repack(int argc, const char **argv, const char *prefix) if (pack_everything & ALL_INTO_ONE) { repack_promisor_objects(&po_args, &names); + if (delete_redundant) + strvec_pushf(&cmd.args, "--no-ignore-disjoint"); + if (has_existing_non_kept_packs(&existing) && delete_redundant && !(pack_everything & PACK_CRUFT)) { @@ -1364,7 +1397,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix) fclose(in); } - ret = finish_pack_objects_cmd(&cmd, &names, 1); + ret = finish_pack_objects_cmd(&cmd, &names, 1, po_args.ignore_disjoint); if (ret) goto cleanup; @@ -1387,6 +1420,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix) cruft_po_args.local = po_args.local; cruft_po_args.quiet = po_args.quiet; + cruft_po_args.ignore_disjoint = 0; ret = write_cruft_pack(&cruft_po_args, packtmp, pack_prefix, cruft_expiration, &names, @@ -1487,7 +1521,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix) ret = write_midx_included_packs(&include, &geometry, &names, refs_snapshot ? get_tempfile_path(refs_snapshot) : NULL, - show_progress, write_bitmaps > 0); + show_progress, write_bitmaps > 0, + po_args.ignore_disjoint); if (!ret && write_bitmaps) remove_redundant_bitmaps(&include, packdir); diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh index d2975e6c93..277f1ff1d7 100755 --- a/t/t7700-repack.sh +++ b/t/t7700-repack.sh @@ -6,6 +6,7 @@ test_description='git repack works correctly' . "${TEST_DIRECTORY}/lib-bitmap.sh" . "${TEST_DIRECTORY}/lib-midx.sh" . "${TEST_DIRECTORY}/lib-terminal.sh" +. "${TEST_DIRECTORY}/lib-disjoint.sh" commit_and_pack () { test_commit "$@" 1>&2 && @@ -525,7 +526,8 @@ test_expect_success '--filter works with --max-pack-size' ' ' objdir=.git/objects -midx=$objdir/pack/multi-pack-index +packdir=$objdir/pack +midx=$packdir/multi-pack-index test_expect_success 'setup for --write-midx tests' ' git init midx && diff --git a/t/t7705-repack-extend-disjoint.sh b/t/t7705-repack-extend-disjoint.sh new file mode 100755 index 0000000000..0c8be1cb3f --- /dev/null +++ b/t/t7705-repack-extend-disjoint.sh @@ -0,0 +1,142 @@ +#!/bin/sh + +test_description='git repack --extend-disjoint works correctly' + +. ./test-lib.sh +. "${TEST_DIRECTORY}/lib-disjoint.sh" + +packdir=.git/objects/pack + +GIT_TEST_MULTI=0 +GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP=0 + +test_expect_success 'repack --extend-disjoint creates new disjoint packs' ' + git init repo && + ( + cd repo && + + test_commit A && + test_commit B && + + A="$(echo A | git pack-objects --revs $packdir/pack)" && + B="$(echo A..B | git pack-objects --revs $packdir/pack)" && + + git prune-packed && + + cat >in <<-EOF && + pack-$A.idx + +pack-$B.idx + EOF + git multi-pack-index write --bitmap --stdin-packs <in && + + test_must_not_be_disjoint "pack-$A.pack" && + test_must_be_disjoint "pack-$B.pack" && + + test_commit C && + + find $packdir -type f -name "*.idx" | sort >packs.before && + git repack --write-midx --write-bitmap-index --extend-disjoint && + find $packdir -type f -name "*.idx" | sort >packs.after && + + comm -13 packs.before packs.after >packs.new && + + test_line_count = 1 packs.new && + + test_must_not_be_disjoint "pack-$A.pack" && + test_must_be_disjoint "pack-$B.pack" && + test_must_be_disjoint "$(basename $(cat packs.new) .idx).pack" + ) +' + +test_expect_success 'repack --extend-disjoint combines existing disjoint packs' ' + ( + cd repo && + + test_commit D && + + git repack -a -d --write-midx --write-bitmap-index --extend-disjoint && + + find $packdir -type f -name "*.pack" >packs && + test_line_count = 1 packs && + + test_must_be_disjoint "$(basename $(cat packs))" + + ) +' + +test_expect_success 'repack --extend-disjoint with --geometric' ' + git init disjoint-geometric && + ( + cd disjoint-geometric && + + test_commit_bulk 8 && + base="$(basename $(ls $packdir/pack-*.idx))" && + echo "+$base" >>in && + + test_commit A && + A="$(echo HEAD^.. | git pack-objects --revs $packdir/pack)" && + test_commit B && + B="$(echo HEAD^.. | git pack-objects --revs $packdir/pack)" && + + git prune-packed && + + cat >>in <<-EOF && + +pack-$A.idx + +pack-$B.idx + EOF + git multi-pack-index write --bitmap --stdin-packs <in && + + test_must_be_disjoint "pack-$A.pack" && + test_must_be_disjoint "pack-$B.pack" && + test_must_be_disjoint "${base%.idx}.pack" && + + test_commit C && + + find $packdir -type f -name "*.pack" | sort >packs.before && + git repack --geometric=2 -d --write-midx --write-bitmap-index --extend-disjoint && + find $packdir -type f -name "*.pack" | sort >packs.after && + + comm -12 packs.before packs.after >packs.unchanged && + comm -23 packs.before packs.after >packs.removed && + comm -13 packs.before packs.after >packs.new && + + cat >expect <<-EOF && + $packdir/${base%.idx}.pack + EOF + test_cmp expect packs.unchanged && + + sort >expect <<-EOF && + $packdir/pack-$A.pack + $packdir/pack-$B.pack + EOF + test_cmp expect packs.removed && + + test_line_count = 1 packs.new && + + test_must_be_disjoint "$(basename $(cat packs.new))" && + test_must_be_disjoint "${base%.idx}.pack" + ) +' + +for flag in "-A" "-a" "--cruft" +do + test_expect_success "repack --extend-disjoint incompatible with $flag without -d" ' + test_must_fail git repack $flag --extend-disjoint \ + --write-midx --write-bitmap-index 2>actual && + cat >expect <<-EOF && + fatal: cannot use $SQ--extend-disjoint$SQ with $SQ$flag$SQ but not $SQ-d$SQ + EOF + test_cmp expect actual + ' +done + +test_expect_success 'repack --extend-disjoint is incompatible with --filter-to' ' + test_must_fail git repack --extend-disjoint --filter-to=dir 2>actual && + + cat >expect <<-EOF && + fatal: options $SQ--filter-to$SQ and $SQ--extend-disjoint$SQ cannot be used together + EOF + test_cmp expect actual +' + +test_done
Now that we can generate packs which are disjoint with respect to the set of currently-disjoint packs, implement a mode of `git repack` which extends the set of disjoint packs with any new (non-cruft) pack(s) generated during the repack. The idea is mostly straightforward, with a couple of gotcha's. The straightforward part is to make sure that any new packs are disjoint with respect to the set of currently disjoint packs which are _not_ being removed from the repository as a result of the repack. If a pack which is currently marked as disjoint is, on the other hand, about to be removed from the repository, it is OK (and expected) that new pack(s) will contain some or all of its objects. Since the pack originally marked as disjoint will be removed, it will necessarily leave the disjoint set, making room for new packs with its same objects to take its place. In other words, the resulting set of disjoint packs will be disjoint with respect to one another. The gotchas mostly have to do with making sure that we do not generate a disjoint pack in the following scenarios: - promisor packs - cruft packs (which may necessarily need to include an object from a disjoint pack in order to freshen it in certain circumstances) - all-into-one repacks without '-d' - `--filter-to`, which conceptually could work with the new `--extend-disjoint` option, but only in limited circumstances Otherwise, we mark which packs were created as disjoint by using a new bit in the `generated_pack_data` struct, and then marking those pack(s) as disjoint accordingly when generating the MIDX. Non-deleted packs which are marked as disjoint are retained as such by passing the equivalent of `--retain-disjoint` when calling the MIDX API to update the MIDX. Signed-off-by: Taylor Blau <me@ttaylorr.com> --- Documentation/git-repack.txt | 12 +++ builtin/repack.c | 57 +++++++++--- t/t7700-repack.sh | 4 +- t/t7705-repack-extend-disjoint.sh | 142 ++++++++++++++++++++++++++++++ 4 files changed, 203 insertions(+), 12 deletions(-) create mode 100755 t/t7705-repack-extend-disjoint.sh