Message ID | 48438876fb42a889110e100a6c42ca84e93aac49.1733011259.git.me@ttaylorr.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | builtin/repack.c: prune unreachable objects with `--expire-to` | expand |
On Sat, Nov 30, 2024 at 07:01:03PM -0500, Taylor Blau wrote: > diff --git a/builtin/repack.c b/builtin/repack.c > index d6bb37e84ae..57cab72dcf5 100644 > --- a/builtin/repack.c > +++ b/builtin/repack.c > @@ -1553,6 +1553,21 @@ int cmd_repack(int argc, > &existing); > if (show_progress) > opts |= PRUNE_PACKED_VERBOSE; > + > + if (expire_to && *expire_to) { > + char *alt = dirname(xstrdup(expire_to)); > + size_t len = strlen(alt); > + > + if (strip_suffix(alt, "pack", &len) && > + is_dir_sep(alt[len - 1])) { Argh. This clearly needs a bounds check to ensure that 'len >= 1'. I suspect that passing "--expire-to=pack/xyz" would segfault. Thanks, Taylor
On Sat, Nov 30, 2024 at 07:01:03PM -0500, Taylor Blau wrote: > When invoked with '--expire-to', 'git repack' will move unreachable > objects beyond the grace period to a separate repository outside of the > main object store. > > Later on, 'git repack' will remove any existing packs which were made > redundant by the 'repack' operation, before then pruning loose objects > which were packed. Ordinarily, unreachable objects which have expired > were already packed via some earlier 'repack' operation, and so are > removed from the main repository in the first step. > > But if a repository has unreachable objects which: > > - have an mtime earlier than the --cruft-expiration period, > - are loose, and > - have never been packed > > Then we'll create a pack containing those objects to store in the > repository specified by the '--expire-to' option, but never prune the > loose copies of those objects from the main repository. That's because > we don't have a pack in the main repository which contains those > objects, so prune_packed_objects() skips over them. > > (As an aside, for repositories that have a large number of unreachable > objects which were never packed, and are old enough to be expired, this > can be quite painful. That's because even though we expect the repack to > prune those objects which were GC'd, we don't per the above). > > Teach repack to add the repository specified by '--expire-to' as an > alternate of the main object store so that 'prune_packed_objects()' can > "see" the packed copy of those objects, and remove them appropriately. I think the solution you came up with makes sense in the confines of what "repack --cruft" and "--expire-to" do. But I can't help but feel that we may have taken a wrong (or at least somewhat confusing) turn before this. That is, why are we expecting "repack" to prune those loose objects at all? In a traditional "repack -ad" they would not be removed, because they are not packed and not reachable. It is the responsibility of a follow-up "git prune" to get rid of them, and that is run automatically by "git gc". When we added in cruft packs, now those objects are "packed" in the sense that we are storing them in a cruft pack instead of loose. But they are not really "packed" in the sense that the original "repack" would do. They're not in a pack we intend to keep, and the storage in a cruft pack is mostly an incidental optimization over the loose format. But "repack" started deleting them (because it knew they were moved into a cruft pack), rather than waiting for "git prune" to do so. Which is a little weird, but OK. But then --expire-to adds extra confusion because we are _really_ not packing them in the repository now. They are going to some magical out-of-repo spot. Which yes, happens to be a pack. But I think one could argue that they are not packed in the repo at that point, and it the responsibility of "git prune" to get rid of them. Now all of that is fairly philosophical. Is there a real-world benefit to trying to retain this more purist view-point? I don't know. Certainly I can think of some downsides, one of which is that running a follow-up git-prune requires computing the full reachability again. That's both expensive and racy with respect to what we saved in the cruft pack. So I think there's some argument there that the pruning _has_ to be part of the repack process for atomicity, and therefore --expire-to has to do the same. > @@ -1553,6 +1553,21 @@ int cmd_repack(int argc, > &existing); > if (show_progress) > opts |= PRUNE_PACKED_VERBOSE; > + > + if (expire_to && *expire_to) { > + char *alt = dirname(xstrdup(expire_to)); > + size_t len = strlen(alt); > + > + if (strip_suffix(alt, "pack", &len) && > + is_dir_sep(alt[len - 1])) { > + alt[len - 1] = '\0'; > + > + add_to_alternates_memory(alt); > + reprepare_packed_git(the_repository); > + } > + > + free(alt); > + } OK, so we are adding the containing directory as an alternate. That gives me two concerns: 1. The expire-to string is something like "path/to/objects/pack/pack", and we'll have created "path/to/objects/pack/pack-<hash>.pack" Using dirname() strips that down to "path/to/objects/pack". OK. And then we manually strip "pack/" off the end, which we have to do to get the "base" objects/ directory. But what if the path given by the user via --expire-to doesn't look like an object directory? I.e., does not end in "pack/"? Then this feature would not work at all. Should we be mentioning this in the git-repack docs? As an aside, I think the current --expire-to docs are misleading. They say: --expire-to=<dir> Write a cruft pack containing pruned objects (if any) to the directory <dir>. [...] But that isn't right. It is not a <dir> but a <base-name> similar to the one that pack-objects takes. If you do --expire-to=some/dir, then you'll get some/dir-<hash>.pack. 2. Since we're adding the whole directory, that reprepare_packed_git() will also find any existing packs that were sitting in the --expire-to directory. If you are repeatedly stuffing cruft packs into the same directory every time you repack, then you'll see those older packs, and we'll prune objects that they mention. I'm not sure it's too bad from a correctness perspective to allow that (and I think they'd have ended up in the most recent cruft pack anyway). But I wonder if the expense of checking those packs eventually adds up. One thing that could perhaps deal with both issues is to add the single cruft pack itself, rather than the surrounding directory. I.e., would it work to add_packed_git() and install_packed_git() each cruft pack (should just be one, but I think in theory there may be multiple if maxPackSize is enabled)? It's maybe a little weird to have an entry in packed_git that doesn't come from an object database that we're using. But I don't think there are any problems with that (I'd probably pass "0" for the "local" flag ;) ). The only other gotcha I see is that you probably need to reprepare_packed_git() afterwards to make sure the packed_git_mru list is refreshed. -Peff
diff --git a/builtin/repack.c b/builtin/repack.c index d6bb37e84ae..57cab72dcf5 100644 --- a/builtin/repack.c +++ b/builtin/repack.c @@ -1553,6 +1553,21 @@ int cmd_repack(int argc, &existing); if (show_progress) opts |= PRUNE_PACKED_VERBOSE; + + if (expire_to && *expire_to) { + char *alt = dirname(xstrdup(expire_to)); + size_t len = strlen(alt); + + if (strip_suffix(alt, "pack", &len) && + is_dir_sep(alt[len - 1])) { + alt[len - 1] = '\0'; + + add_to_alternates_memory(alt); + reprepare_packed_git(the_repository); + } + + free(alt); + } prune_packed_objects(opts); if (!keep_unreachable && diff --git a/t/t7704-repack-cruft.sh b/t/t7704-repack-cruft.sh index 5db9f4e10f7..ee1ffcdae3c 100755 --- a/t/t7704-repack-cruft.sh +++ b/t/t7704-repack-cruft.sh @@ -30,6 +30,12 @@ test_expect_success '--expire-to stores pruned objects (now)' ' git branch -D cruft && git reflog expire --all --expire=all && + for obj in $(cat moved.want) + do + path="$objdir/$(test_oid_to_path $obj)" && + test_path_is_file "$path" || return 1 + done && + git init --bare expired.git && git repack -d \ --cruft --cruft-expiration="now" \ @@ -38,6 +44,12 @@ test_expect_success '--expire-to stores pruned objects (now)' ' expired="$(ls expired.git/objects/pack/pack-*.idx)" && test_path_is_file "${expired%.idx}.mtimes" && + for obj in $(cat moved.want) + do + path="$objdir/$(test_oid_to_path $obj)" && + test_path_is_missing "$path" || return 1 + done && + # Since the `--cruft-expiration` is "now", the effective # behavior is to move _all_ unreachable objects out to # the location in `--expire-to`.
When invoked with '--expire-to', 'git repack' will move unreachable objects beyond the grace period to a separate repository outside of the main object store. Later on, 'git repack' will remove any existing packs which were made redundant by the 'repack' operation, before then pruning loose objects which were packed. Ordinarily, unreachable objects which have expired were already packed via some earlier 'repack' operation, and so are removed from the main repository in the first step. But if a repository has unreachable objects which: - have an mtime earlier than the --cruft-expiration period, - are loose, and - have never been packed Then we'll create a pack containing those objects to store in the repository specified by the '--expire-to' option, but never prune the loose copies of those objects from the main repository. That's because we don't have a pack in the main repository which contains those objects, so prune_packed_objects() skips over them. (As an aside, for repositories that have a large number of unreachable objects which were never packed, and are old enough to be expired, this can be quite painful. That's because even though we expect the repack to prune those objects which were GC'd, we don't per the above). Teach repack to add the repository specified by '--expire-to' as an alternate of the main object store so that 'prune_packed_objects()' can "see" the packed copy of those objects, and remove them appropriately. Signed-off-by: Taylor Blau <me@ttaylorr.com> --- builtin/repack.c | 15 +++++++++++++++ t/t7704-repack-cruft.sh | 12 ++++++++++++ 2 files changed, 27 insertions(+) base-commit: cc01bad4a9f566cf4453c7edd6b433851b0835e2