diff mbox series

builtin/repack.c: prune unreachable objects with `--expire-to`

Message ID 48438876fb42a889110e100a6c42ca84e93aac49.1733011259.git.me@ttaylorr.com (mailing list archive)
State New
Headers show
Series builtin/repack.c: prune unreachable objects with `--expire-to` | expand

Commit Message

Taylor Blau Dec. 1, 2024, 12:01 a.m. UTC
When invoked with '--expire-to', 'git repack' will move unreachable
objects beyond the grace period to a separate repository outside of the
main object store.

Later on, 'git repack' will remove any existing packs which were made
redundant by the 'repack' operation, before then pruning loose objects
which were packed. Ordinarily, unreachable objects which have expired
were already packed via some earlier 'repack' operation, and so are
removed from the main repository in the first step.

But if a repository has unreachable objects which:

  - have an mtime earlier than the --cruft-expiration period,
  - are loose, and
  - have never been packed

Then we'll create a pack containing those objects to store in the
repository specified by the '--expire-to' option, but never prune the
loose copies of those objects from the main repository. That's because
we don't have a pack in the main repository which contains those
objects, so prune_packed_objects() skips over them.

(As an aside, for repositories that have a large number of unreachable
objects which were never packed, and are old enough to be expired, this
can be quite painful. That's because even though we expect the repack to
prune those objects which were GC'd, we don't per the above).

Teach repack to add the repository specified by '--expire-to' as an
alternate of the main object store so that 'prune_packed_objects()' can
"see" the packed copy of those objects, and remove them appropriately.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/repack.c        | 15 +++++++++++++++
 t/t7704-repack-cruft.sh | 12 ++++++++++++
 2 files changed, 27 insertions(+)


base-commit: cc01bad4a9f566cf4453c7edd6b433851b0835e2

Comments

Taylor Blau Dec. 1, 2024, 4:24 a.m. UTC | #1
On Sat, Nov 30, 2024 at 07:01:03PM -0500, Taylor Blau wrote:
> diff --git a/builtin/repack.c b/builtin/repack.c
> index d6bb37e84ae..57cab72dcf5 100644
> --- a/builtin/repack.c
> +++ b/builtin/repack.c
> @@ -1553,6 +1553,21 @@ int cmd_repack(int argc,
>  							&existing);
>  		if (show_progress)
>  			opts |= PRUNE_PACKED_VERBOSE;
> +
> +		if (expire_to && *expire_to) {
> +			char *alt = dirname(xstrdup(expire_to));
> +			size_t len = strlen(alt);
> +
> +			if (strip_suffix(alt, "pack", &len) &&
> +			    is_dir_sep(alt[len - 1])) {

Argh. This clearly needs a bounds check to ensure that 'len >= 1'.

I suspect that passing "--expire-to=pack/xyz" would segfault.

Thanks,
Taylor
Jeff King Dec. 1, 2024, 9:34 p.m. UTC | #2
On Sat, Nov 30, 2024 at 07:01:03PM -0500, Taylor Blau wrote:

> When invoked with '--expire-to', 'git repack' will move unreachable
> objects beyond the grace period to a separate repository outside of the
> main object store.
> 
> Later on, 'git repack' will remove any existing packs which were made
> redundant by the 'repack' operation, before then pruning loose objects
> which were packed. Ordinarily, unreachable objects which have expired
> were already packed via some earlier 'repack' operation, and so are
> removed from the main repository in the first step.
> 
> But if a repository has unreachable objects which:
> 
>   - have an mtime earlier than the --cruft-expiration period,
>   - are loose, and
>   - have never been packed
> 
> Then we'll create a pack containing those objects to store in the
> repository specified by the '--expire-to' option, but never prune the
> loose copies of those objects from the main repository. That's because
> we don't have a pack in the main repository which contains those
> objects, so prune_packed_objects() skips over them.
> 
> (As an aside, for repositories that have a large number of unreachable
> objects which were never packed, and are old enough to be expired, this
> can be quite painful. That's because even though we expect the repack to
> prune those objects which were GC'd, we don't per the above).
> 
> Teach repack to add the repository specified by '--expire-to' as an
> alternate of the main object store so that 'prune_packed_objects()' can
> "see" the packed copy of those objects, and remove them appropriately.

I think the solution you came up with makes sense in the confines of
what "repack --cruft" and "--expire-to" do. But I can't help but feel
that we may have taken a wrong (or at least somewhat confusing) turn
before this.

That is, why are we expecting "repack" to prune those loose objects at
all? In a traditional "repack -ad" they would not be removed, because
they are not packed and not reachable. It is the responsibility of a
follow-up "git prune" to get rid of them, and that is run automatically
by "git gc".

When we added in cruft packs, now those objects are "packed" in the
sense that we are storing them in a cruft pack instead of loose. But
they are not really "packed" in the sense that the original "repack"
would do. They're not in a pack we intend to keep, and the storage in a
cruft pack is mostly an incidental optimization over the loose format.
But "repack" started deleting them (because it knew they were moved into
a cruft pack), rather than waiting for "git prune" to do so. Which is a
little weird, but OK.

But then --expire-to adds extra confusion because we are _really_ not
packing them in the repository now. They are going to some magical
out-of-repo spot. Which yes, happens to be a pack. But I think one could
argue that they are not packed in the repo at that point, and it the
responsibility of "git prune" to get rid of them.


Now all of that is fairly philosophical. Is there a real-world benefit
to trying to retain this more purist view-point? I don't know. Certainly
I can think of some downsides, one of which is that running a follow-up
git-prune requires computing the full reachability again. That's both
expensive and racy with respect to what we saved in the cruft pack.

So I think there's some argument there that the pruning _has_ to be part
of the repack process for atomicity, and therefore --expire-to has to do
the same.

> @@ -1553,6 +1553,21 @@ int cmd_repack(int argc,
>  							&existing);
>  		if (show_progress)
>  			opts |= PRUNE_PACKED_VERBOSE;
> +
> +		if (expire_to && *expire_to) {
> +			char *alt = dirname(xstrdup(expire_to));
> +			size_t len = strlen(alt);
> +
> +			if (strip_suffix(alt, "pack", &len) &&
> +			    is_dir_sep(alt[len - 1])) {
> +				alt[len - 1] = '\0';
> +
> +				add_to_alternates_memory(alt);
> +				reprepare_packed_git(the_repository);
> +			}
> +
> +			free(alt);
> +		}

OK, so we are adding the containing directory as an alternate. That
gives me two concerns:

  1. The expire-to string is something like "path/to/objects/pack/pack",
     and we'll have created "path/to/objects/pack/pack-<hash>.pack"
     Using dirname() strips that down to "path/to/objects/pack". OK. And
     then we manually strip "pack/" off the end, which we have to do to
     get the "base" objects/ directory.

     But what if the path given by the user via --expire-to doesn't look
     like an object directory? I.e., does not end in "pack/"? Then this
     feature would not work at all.

     Should we be mentioning this in the git-repack docs?

     As an aside, I think the current --expire-to docs are misleading.
     They say:

       --expire-to=<dir>
	   Write a cruft pack containing pruned objects (if any) to the
	   directory <dir>. [...]

     But that isn't right. It is not a <dir> but a <base-name> similar
     to the one that pack-objects takes. If you do --expire-to=some/dir,
     then you'll get some/dir-<hash>.pack.

  2. Since we're adding the whole directory, that reprepare_packed_git()
     will also find any existing packs that were sitting in the
     --expire-to directory. If you are repeatedly stuffing cruft packs
     into the same directory every time you repack, then you'll see
     those older packs, and we'll prune objects that they mention. I'm
     not sure it's too bad from a correctness perspective to allow that
     (and I think they'd have ended up in the most recent cruft pack
     anyway). But I wonder if the expense of checking those packs
     eventually adds up.

One thing that could perhaps deal with both issues is to add the single
cruft pack itself, rather than the surrounding directory. I.e., would it
work to add_packed_git() and install_packed_git() each cruft pack
(should just be one, but I think in theory there may be multiple if
maxPackSize is enabled)?

It's maybe a little weird to have an entry in packed_git that doesn't
come from an object database that we're using. But I don't think there
are any problems with that (I'd probably pass "0" for the "local" flag
;) ).

The only other gotcha I see is that you probably need to
reprepare_packed_git() afterwards to make sure the packed_git_mru list
is refreshed.

-Peff
diff mbox series

Patch

diff --git a/builtin/repack.c b/builtin/repack.c
index d6bb37e84ae..57cab72dcf5 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -1553,6 +1553,21 @@  int cmd_repack(int argc,
 							&existing);
 		if (show_progress)
 			opts |= PRUNE_PACKED_VERBOSE;
+
+		if (expire_to && *expire_to) {
+			char *alt = dirname(xstrdup(expire_to));
+			size_t len = strlen(alt);
+
+			if (strip_suffix(alt, "pack", &len) &&
+			    is_dir_sep(alt[len - 1])) {
+				alt[len - 1] = '\0';
+
+				add_to_alternates_memory(alt);
+				reprepare_packed_git(the_repository);
+			}
+
+			free(alt);
+		}
 		prune_packed_objects(opts);
 
 		if (!keep_unreachable &&
diff --git a/t/t7704-repack-cruft.sh b/t/t7704-repack-cruft.sh
index 5db9f4e10f7..ee1ffcdae3c 100755
--- a/t/t7704-repack-cruft.sh
+++ b/t/t7704-repack-cruft.sh
@@ -30,6 +30,12 @@  test_expect_success '--expire-to stores pruned objects (now)' '
 		git branch -D cruft &&
 		git reflog expire --all --expire=all &&
 
+		for obj in $(cat moved.want)
+		do
+			path="$objdir/$(test_oid_to_path $obj)" &&
+			test_path_is_file "$path" || return 1
+		done &&
+
 		git init --bare expired.git &&
 		git repack -d \
 			--cruft --cruft-expiration="now" \
@@ -38,6 +44,12 @@  test_expect_success '--expire-to stores pruned objects (now)' '
 		expired="$(ls expired.git/objects/pack/pack-*.idx)" &&
 		test_path_is_file "${expired%.idx}.mtimes" &&
 
+		for obj in $(cat moved.want)
+		do
+			path="$objdir/$(test_oid_to_path $obj)" &&
+			test_path_is_missing "$path" || return 1
+		done &&
+
 		# Since the `--cruft-expiration` is "now", the effective
 		# behavior is to move _all_ unreachable objects out to
 		# the location in `--expire-to`.