diff mbox series

[v3] builtin/pack-objects.c: introduce `pack.extraCruftTips`

Message ID 27a7f16aab35b5cac391d9831aadb0f2e2146313.1683151485.git.me@ttaylorr.com (mailing list archive)
State New, archived
Headers show
Series [v3] builtin/pack-objects.c: introduce `pack.extraCruftTips` | expand

Commit Message

Taylor Blau May 3, 2023, 10:05 p.m. UTC
This patch introduces a new multi-valued configuration option,
`pack.extraCruftTips` as an alternative means to mark certain objects in
the cruft pack as rescued, even if they would otherwise be pruned.

Traditionally, cruft packs are generated in one of two ways:

  - When not pruning, all packed and loose objects which do not appear
    in the any of the packs which wil remain (as indicated over stdin,
    with `--cruft`) are packed separately into a cruft pack.

  - When pruning, the above process happens, with two additional tweaks.
    Instead of adding every object into the cruft pack, only objects
    which have mtime that is within the grace period are kept. In
    addition, a "rescuing" traversal is performed over the remaining
    cruft objects to keep around cruft objects which would have aged out
    of the repository, but are reachable from other cruft objects which
    have not yet aged out.

This works well for repositories which have all of their "interesting"
objects worth keeping around reachable via at least one reference.

However, there is no option to be able to keep around certain objects
that have otherwise aged out of the grace period. The only way to retain
those objects is:

  - to point a reference at them, which may be undesirable or
    infeasible,
  - to track them via the reflog, which may be undesirable since the
    reflog's lifetime is limited to that of the reference it's tracking
    (and callers may want to keep those unreachable objects around for
    longer)
  - to extend the grace period, which may keep around other objects that
    the caller *does* want to discard,
  - or, to force the caller to construct the pack of objects they want
    to keep themselves, and then mark the pack as kept by adding a
    ".keep" file.

This patch introduces a new configuration, `pack.extraCruftTips` which
allows the caller to specify a program (or set of programs) whose output
is treated as a set of objects to treat as "kept", even if they are
unreachable and have aged out of the retention period.

The implementation is straightforward, and pretends as if any objects
listed by one of the `pack.extraCruftTips` hooks is recent-enough to
warrant rescuing (even for non-pruning GCs).

We then add those as tips to another reachability traversal (along with
any recent objects, if pruning), marking every object along the way as
cruft (and thus retaining it in the cruft pack).

A potential alternative here is to introduce a new mode to alter the
contents of the reachable pack instead of the cruft one. One could
imagine a new option to `pack-objects`, say `--extra-tips` that does the
same thing as above, adding the visited set of objects along the
traversal to the pack.

But this has the unfortunate side-effect of altering the reachability
closure of that pack. If parts of the unreachable object graph mentioned
by one or more of the "extra tips" programs is not closed, then the
resulting pack won't be either. This makes it impossible in the general
case to write out reachability bitmaps for that pack, since closure is a
requirement there.

Instead, keep these unreachable objects in the cruft pack instead, to
ensure that we can continue to have a pack containing just reachable
objects, which is always safe to write a bitmap over.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
Here is a reworked version of the pack.extraCruftTips feature to reuse
parts of the "rescue old/unreachable, but reachable-from-recent" traversal.

It is somewhat more complicated than the original, and I would
definitely appreciate a close review from a second (or third) set of
eyes ;-).

Range-diff against v2:
1:  da1711b13b ! 1:  27a7f16aab builtin/pack-objects.c: introduce `pack.extraCruftTips`
    @@ Commit message
         is treated as a set of objects to treat as "kept", even if they are
         unreachable and have aged out of the retention period.

    -    The implementation is straightforward: after determining the set of
    -    objects to pack into the cruft pack (either by calling
    -    `enumerate_cruft_objects()` or `enumerate_and_traverse_cruft_objects()`)
    -    we figure out if there are any program(s) we need to consult in order to
    -    determine if there are any objects which we need to "rescue". We then
    -    add those as tips to another reachability traversal, marking every
    -    object along the way as cruft (and thus retaining it in the cruft pack).
    +    The implementation is straightforward, and pretends as if any objects
    +    listed by one of the `pack.extraCruftTips` hooks is recent-enough to
    +    warrant rescuing (even for non-pruning GCs).
    +
    +    We then add those as tips to another reachability traversal (along with
    +    any recent objects, if pruning), marking every object along the way as
    +    cruft (and thus retaining it in the cruft pack).

         A potential alternative here is to introduce a new mode to alter the
         contents of the reachable pack instead of the cruft one. One could
    @@ builtin/pack-objects.c

      /*
       * Objects we are going to pack are collected in the `to_pack` structure.
    -@@ builtin/pack-objects.c: static void enumerate_and_traverse_cruft_objects(struct string_list *fresh_packs
    +@@ builtin/pack-objects.c: static void enumerate_cruft_objects(void)
      	stop_progress(&progress_state);
      }

    @@ builtin/pack-objects.c: static void enumerate_and_traverse_cruft_objects(struct
     +{
     +	struct child_process cmd = CHILD_PROCESS_INIT;
     +	struct strbuf buf = STRBUF_INIT;
    -+	FILE *out = NULL;
    ++	FILE *out;
     +	int ret = 0;
     +
     +	cmd.use_shell = 1;
     +	cmd.out = -1;
     +
     +	strvec_push(&cmd.args, args);
    -+	strvec_pushv(&cmd.env, (const char **)local_repo_env);
     +
    -+	if (start_command(&cmd)) {
    -+		ret = -1;
    -+		goto done;
    -+	}
    ++	if (start_command(&cmd))
    ++		return -1;
     +
     +	out = xfdopen(cmd.out, "r");
    -+	while (strbuf_getline_lf(&buf, out) != EOF) {
    ++	while (strbuf_getline(&buf, out) != EOF) {
     +		struct object_id oid;
     +		struct object *obj;
    ++		int type;
     +		const char *rest;
     +
     +		if (parse_oid_hex(buf.buf, &oid, &rest) || *rest) {
    @@ builtin/pack-objects.c: static void enumerate_and_traverse_cruft_objects(struct
     +			goto done;
     +		}
     +
    -+		obj = parse_object(the_repository, &oid);
    ++		type = oid_object_info(the_repository, &oid, NULL);
    ++		if (type < 0)
    ++			continue;
    ++
    ++		obj = lookup_object_by_type(the_repository, &oid, type);
     +		if (!obj)
     +			continue;
     +
    @@ builtin/pack-objects.c: static void enumerate_and_traverse_cruft_objects(struct
     +	return ret;
     +}
     +
    -+static int enumerate_extra_cruft_tips(void)
    ++static void add_extra_cruft_tips_to_traversal(struct rev_info *revs)
     +{
    -+	struct rev_info revs;
     +	const struct string_list *programs;
     +	int ret = 0;
     +	size_t i;
     +
     +	if (git_config_get_string_multi("pack.extracrufttips", &programs))
    -+		return ret;
    -+
    -+	repo_init_revisions(the_repository, &revs, NULL);
    -+
    -+	revs.tag_objects = 1;
    -+	revs.tree_objects = 1;
    -+	revs.blob_objects = 1;
    -+
    -+	revs.include_check = cruft_include_check;
    -+	revs.include_check_obj = cruft_include_check_obj;
    -+
    -+	revs.ignore_missing_links = 1;
    ++		return;
     +
     +	if (progress)
     +		progress_state = start_progress(_("Enumerating extra cruft tips"), 0);
     +	nr_seen = 0;
     +	for (i = 0; i < programs->nr; i++) {
    -+		ret = enumerate_extra_cruft_tips_1(&revs,
    ++		ret = enumerate_extra_cruft_tips_1(revs,
     +						   programs->items[i].string);
     +		if (ret)
     +			break;
    @@ builtin/pack-objects.c: static void enumerate_and_traverse_cruft_objects(struct
     +
     +	if (ret)
     +		die(_("unable to enumerate additional cruft tips"));
    -+
    -+	if (prepare_revision_walk(&revs))
    -+		die(_("revision walk setup failed"));
    -+	if (progress)
    -+		progress_state = start_progress(_("Traversing extra cruft objects"), 0);
    -+	nr_seen = 0;
    -+
    -+	/*
    -+	 * Retain all objects reachable from extra tips via
    -+	 * show_cruft_commit(), and show_cruft_object(), regardless of
    -+	 * their mtime.
    -+	 */
    -+	traverse_commit_list(&revs, show_cruft_commit, show_cruft_object, NULL);
    -+	stop_progress(&progress_state);
    -+
    -+	return ret;
     +}
     +
    - static void read_cruft_objects(void)
    + static void enumerate_and_traverse_cruft_objects(struct string_list *fresh_packs)
      {
    - 	struct strbuf buf = STRBUF_INIT;
    + 	struct packed_git *p;
    + 	struct rev_info revs;
    +-	int ret;
    ++	int ret = 0;
    +
    + 	repo_init_revisions(the_repository, &revs, NULL);
    +
    +@@ builtin/pack-objects.c: static void enumerate_and_traverse_cruft_objects(struct string_list *fresh_packs
    +
    + 	revs.ignore_missing_links = 1;
    +
    +-	if (progress)
    +-		progress_state = start_progress(_("Enumerating cruft objects"), 0);
    +-	ret = add_unseen_recent_objects_to_traversal(&revs, cruft_expiration,
    +-						     set_cruft_mtime, 1);
    +-	stop_progress(&progress_state);
    ++	if (cruft_expiration) {
    ++		if (progress)
    ++			progress_state = start_progress(_("Enumerating cruft objects"), 0);
    ++		ret = add_unseen_recent_objects_to_traversal(&revs, cruft_expiration,
    ++							     set_cruft_mtime, 1);
    ++		stop_progress(&progress_state);
    ++	}
    ++
    ++	add_extra_cruft_tips_to_traversal(&revs);
    +
    + 	if (ret)
    + 		die(_("unable to add cruft objects"));
     @@ builtin/pack-objects.c: static void read_cruft_objects(void)
    - 	else
    +
    + 	if (cruft_expiration)
    + 		enumerate_and_traverse_cruft_objects(&fresh_packs);
    +-	else
    ++	else {
    ++		/*
    ++		 * call both enumerate_cruft_objects() and
    ++		 * enumerate_and_traverse_cruft_objects(), since:
    ++		 *
    ++		 *   - the former is required to implement non-pruning GCs
    ++		 *   - the latter is a noop for "cruft_expiration == 0", but
    ++		 *     picks up any additional tips mentioned via
    ++		 *     `pack.extraCruftTips`.
    ++		 */
      		enumerate_cruft_objects();
    ++		enumerate_and_traverse_cruft_objects(&fresh_packs);
    ++	}

    -+	enumerate_extra_cruft_tips();
    -+
      	strbuf_release(&buf);
      	string_list_clear(&discard_packs, 0);
    - 	string_list_clear(&fresh_packs, 0);

      ## t/t5329-pack-objects-cruft.sh ##
     @@ t/t5329-pack-objects-cruft.sh: test_expect_success 'cruft objects are freshend via loose' '

 Documentation/config/pack.txt |  11 +++
 builtin/pack-objects.c        | 104 +++++++++++++++++++--
 t/t5329-pack-objects-cruft.sh | 171 ++++++++++++++++++++++++++++++++++
 3 files changed, 279 insertions(+), 7 deletions(-)

--
2.40.1.477.g73ad7b90e1.dirty

Comments

Junio C Hamano May 3, 2023, 11:18 p.m. UTC | #1
Taylor Blau <me@ttaylorr.com> writes:

>   - When not pruning, all packed and loose objects which do not appear
>     in the any of the packs which wil remain (as indicated over stdin,
>     with `--cruft`) are packed separately into a cruft pack.

"wil remain" -> "will remain"

But more importantly, it is not clear to me whose "stdin" we are
talking about and what is sent "over stdin" here.  Somebody pipes
the output of "printf %s --cruft" to "pack-objects"???

> However, there is no option to be able to keep around certain objects
> that have otherwise aged out of the grace period. The only way to retain
> those objects is:
>
>   - to point a reference at them, which may be undesirable or
>     infeasible,
>   - to track them via the reflog, which may be undesirable since the
>     reflog's lifetime is limited to that of the reference it's tracking
>     (and callers may want to keep those unreachable objects around for
>     longer)
>   - to extend the grace period, which may keep around other objects that
>     the caller *does* want to discard,
>   - or, to force the caller to construct the pack of objects they want
>     to keep themselves, and then mark the pack as kept by adding a
>     ".keep" file.

If you do not want to point objects you want to keep with a ref,
then you are making a problem for yourself, as gc decisions are
designed around the notion that anything worthwhile to keep are
supposed to be reachable from some anchoring points (like the refs,
the reflogs, the index and its various extensions, etc.).

> Here is a reworked version of the pack.extraCruftTips feature to reuse
> parts of the "rescue old/unreachable, but reachable-from-recent" traversal.

That sounds like a sensible way to go, but as I said earlier, it
still is not clear, at least to me, why the pack.extraCruftTips
program can compute them cheaply enough when you cannot have these
tips pointed at with refs in some reserved namespaces of your own.

> +pack.extraCruftTips::

Is this consistent with the way we name a variable whose value
specifies a command to run?  At first, I thought this was a
multi-valued configuration variable, each of the value is an object
name.  extraCruftTipsCmd?  extraCruftTipsHook?

> +	When generating a cruft pack, use the shell to execute the
> +	specified command(s), and interpret their output as additional
> +	tips of objects to keep in the cruft pack, regardless of their

What is a "tip of an object"?  The first byte ;-)?

A "tip of history" would only imply commit objects, but presumably
you would want to specify a tree and protect all the blobs and trees
it recursively contains, so that is not a good name for it.

> +	age.
> ++
> +Output must contain exactly one hex object ID per line, and nothing
> +else. Objects which cannot be found in the repository are ignored.

Are objects that cannot be found the same as objects that are
missing?  How do we determine if a given object name refers to an
object which cannot be found?  Would the lazy fetching from promisor
remotes come into play?

> +Multiple hooks are supported, but all must exit successfully, else no
> +cruft pack will be generated.
> +

Now, we are told this variable refers to "hooks".  If that is the
name we want to call them, we'd better use the term from the
beginning.

> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> index a5b466839b..6f6e7872cd 100644
> --- a/builtin/pack-objects.c
> +++ b/builtin/pack-objects.c
> @@ -44,6 +44,7 @@
>  #include "pack-mtimes.h"
>  #include "parse-options.h"
>  #include "wrapper.h"
> +#include "run-command.h"
>
>  /*
>   * Objects we are going to pack are collected in the `to_pack` structure.
> @@ -3543,11 +3544,85 @@ static void enumerate_cruft_objects(void)
>  	stop_progress(&progress_state);
>  }
>
> +static int enumerate_extra_cruft_tips_1(struct rev_info *revs, const char *args)

Let's lose the _1 suffix unless you have
enumerate_extra_cruft_tips() that farms out bulk of its inner
workings to this one.

> +{
> +	struct child_process cmd = CHILD_PROCESS_INIT;
> +	struct strbuf buf = STRBUF_INIT;
> +	FILE *out;
> +	int ret = 0;
> +
> +	cmd.use_shell = 1;
> +	cmd.out = -1;
> +
> +	strvec_push(&cmd.args, args);
> +
> +	if (start_command(&cmd))
> +		return -1;
> +
> +	out = xfdopen(cmd.out, "r");

It somehow feels funny to be reading from "out", which is usually
where we write things to.

> +	while (strbuf_getline(&buf, out) != EOF) {

Ditto.

> +		struct object_id oid;
> +		struct object *obj;
> +		int type;
> +		const char *rest;
> +
> +		if (parse_oid_hex(buf.buf, &oid, &rest) || *rest) {
> +			ret = error(_("invalid extra cruft tip: '%s'"), buf.buf);
> +			goto done;
> +		}
> +
> +		type = oid_object_info(the_repository, &oid, NULL);
> +		if (type < 0)
> +			continue;
> +
> +		obj = lookup_object_by_type(the_repository, &oid, type);
> +		if (!obj)
> +			continue;

Hmph, we may want to have an interface that lets us avoid looking up
the same oid twice in the same set of tables.  Given an object
unseen so far, oid_object_info() should have done most of the work
necessary for lookup_object_by_type() to get to and start parsing
the data of the object in the good case (i.e. object exists and in a
pack---just we haven't needed it yet), but in the above sequence
there is not enough information passed between two calls to take
advantage of it.

> +		display_progress(progress_state, ++nr_seen);
> +		add_pending_object(revs, obj, "");
> +	}
> +
> +	ret = finish_command(&cmd);
> +
> +done:
> +	if (out)
> +		fclose(out);
> +	strbuf_release(&buf);
> +	child_process_clear(&cmd);
> +
> +	return ret;
> +}
> +
> +static void add_extra_cruft_tips_to_traversal(struct rev_info *revs)
> +{
> +	const struct string_list *programs;
> +	int ret = 0;
> +	size_t i;
> +
> +	if (git_config_get_string_multi("pack.extracrufttips", &programs))
> +		return;
> +
> +	if (progress)
> +		progress_state = start_progress(_("Enumerating extra cruft tips"), 0);
> +	nr_seen = 0;
> +	for (i = 0; i < programs->nr; i++) {
> +		ret = enumerate_extra_cruft_tips_1(revs,
> +						   programs->items[i].string);
> +		if (ret)
> +			break;
> +	}
> +	stop_progress(&progress_state);

We iterate over the value list internal to the repo_config, but we
are only borrowing them so there is no need to do any clean-up upon
leaving.  OK.

> +	if (ret)
> +		die(_("unable to enumerate additional cruft tips"));
> +}
> +
>  static void enumerate_and_traverse_cruft_objects(struct string_list *fresh_packs)
>  {
>  	struct packed_git *p;
>  	struct rev_info revs;
> -	int ret;
> +	int ret = 0;
>
>  	repo_init_revisions(the_repository, &revs, NULL);
>
> @@ -3560,11 +3635,15 @@ static void enumerate_and_traverse_cruft_objects(struct string_list *fresh_packs
>
>  	revs.ignore_missing_links = 1;
>
> -	if (progress)
> -		progress_state = start_progress(_("Enumerating cruft objects"), 0);
> -	ret = add_unseen_recent_objects_to_traversal(&revs, cruft_expiration,
> -						     set_cruft_mtime, 1);
> -	stop_progress(&progress_state);
> +	if (cruft_expiration) {
> +		if (progress)
> +			progress_state = start_progress(_("Enumerating cruft objects"), 0);
> +		ret = add_unseen_recent_objects_to_traversal(&revs, cruft_expiration,
> +							     set_cruft_mtime, 1);
> +		stop_progress(&progress_state);
> +	}
> +
> +	add_extra_cruft_tips_to_traversal(&revs);

Hmph, is this hunk doing two unrelated things at the same time?
What was the reason why we didn't have to be conditional on
cruft_expiration in the original code, and why does it start
mattering when we teach the code to call out to hooks?

> @@ -3639,8 +3718,19 @@ static void read_cruft_objects(void)
>
>  	if (cruft_expiration)
>  		enumerate_and_traverse_cruft_objects(&fresh_packs);
> -	else
> +	else {
> +		/*
> +		 * call both enumerate_cruft_objects() and
> +		 * enumerate_and_traverse_cruft_objects(), since:
> +		 *
> +		 *   - the former is required to implement non-pruning GCs
> +		 *   - the latter is a noop for "cruft_expiration == 0", but
> +		 *     picks up any additional tips mentioned via
> +		 *     `pack.extraCruftTips`.
> +		 */
>  		enumerate_cruft_objects();
> +		enumerate_and_traverse_cruft_objects(&fresh_packs);
> +	}
Junio C Hamano May 3, 2023, 11:42 p.m. UTC | #2
Junio C Hamano <gitster@pobox.com> writes:

>> +Multiple hooks are supported, but all must exit successfully, else no
>> +cruft pack will be generated.
>> +
>
> Now, we are told this variable refers to "hooks".  If that is the
> name we want to call them, we'd better use the term from the
> beginning.

I was updating the "What's cooking" draft and here is my tentative
summary of the topic.

    * tb/pack-extra-cruft-tips (2023-05-03) 1 commit
     - builtin/pack-objects.c: introduce `pack.extraCruftTips`

     "git pack-objects" learned to invoke a new hook program that
     enumerates extra objects to be used as anchoring points to keep
     otherwise unreachable objects in cruft packs.

     source: <27a7f16aab35b5cac391d9831aadb0f2e2146313.1683151485.git.me@ttaylorr.com>

I think I earlier saw review comments on a side thread that
mentioned "repack -A"; when I read it I didn't quite get the point
of the discussion, but after summarizing the gist of the topic as
above, I _think_ I tend to agree.  This should not be about "should
we have these extra objects in cruft pack?" but is about "should we
keep these extra objects alive and exempt from pruning".  What these
extra tips tell us is that the part of the reachability graph rooted
at these extra objects should be treated as if they are reachable.

Storing them inside cruft packs may be a reasonable choice to make
today, in the sense that among various object storage mechansim, the
cruft pack mechanism may be the best fit in today's system, but it
does not have to stay that way.  Naming the variable to specify the
hooks with name "cruft" in them would make it hard to explain once
we find an even better storage mechanism to store such a "not really
used but want to keep" objects.
Taylor Blau May 3, 2023, 11:48 p.m. UTC | #3
On Wed, May 03, 2023 at 04:42:41PM -0700, Junio C Hamano wrote:
> Storing them inside cruft packs may be a reasonable choice to make
> today, in the sense that among various object storage mechansim, the
> cruft pack mechanism may be the best fit in today's system, but it
> does not have to stay that way.  Naming the variable to specify the
> hooks with name "cruft" in them would make it hard to explain once
> we find an even better storage mechanism to store such a "not really
> used but want to keep" objects.

I dunno. I thought about this too, and I get your argument, but I am not
convinced that a future mechanism would lend itself well to keeping
around additional sets of objects in the same way cruft packs do. In
that case, we would prefer having called this `pack.extraCruftTips` and
relegating it to the cruft pack system.

We could make this more generic, and extend support to the legacy
prune-via-loose mechanism. But like I said to Peff, I have a hard time
imagining anybody using it.

So, I'm torn. I see what you're saying, but I think I still tend to fall
on the side of leaving it as-is.

Thanks,
Taylor
Taylor Blau May 3, 2023, 11:50 p.m. UTC | #4
On Wed, May 03, 2023 at 04:42:41PM -0700, Junio C Hamano wrote:
> Storing them inside cruft packs may be a reasonable choice to make
> today, in the sense that among various object storage mechansim, the
> cruft pack mechanism may be the best fit in today's system, but it
> does not have to stay that way.  Naming the variable to specify the
> hooks with name "cruft" in them would make it hard to explain once
> we find an even better storage mechanism to store such a "not really
> used but want to keep" objects.

I dunno. I thought about this too, and I get your argument, but I am not
convinced that a future mechanism would lend itself well to keeping
around additional sets of objects in the same way cruft packs do. In
that case, we would prefer having called this `pack.extraCruftTips` and
relegating it to the cruft pack system.

We could make this more generic, and extend support to the legacy
prune-via-loose mechanism. But like I said to Peff, I have a hard time
imagining anybody using it.

So, I'm torn. I see what you're saying, but I think I still tend to fall
on the side of leaving it as-is.

Thanks,
Taylor
Jeff King May 5, 2023, 9:39 p.m. UTC | #5
On Wed, May 03, 2023 at 04:18:44PM -0700, Junio C Hamano wrote:

> > +	When generating a cruft pack, use the shell to execute the
> > +	specified command(s), and interpret their output as additional
> > +	tips of objects to keep in the cruft pack, regardless of their
> 
> What is a "tip of an object"?  The first byte ;-)?
> 
> A "tip of history" would only imply commit objects, but presumably
> you would want to specify a tree and protect all the blobs and trees
> it recursively contains, so that is not a good name for it.

"tips of the object graph" perhaps?

> > +		if (parse_oid_hex(buf.buf, &oid, &rest) || *rest) {
> > +			ret = error(_("invalid extra cruft tip: '%s'"), buf.buf);
> > +			goto done;
> > +		}
> > +
> > +		type = oid_object_info(the_repository, &oid, NULL);
> > +		if (type < 0)
> > +			continue;
> > +
> > +		obj = lookup_object_by_type(the_repository, &oid, type);
> > +		if (!obj)
> > +			continue;
> 
> Hmph, we may want to have an interface that lets us avoid looking up
> the same oid twice in the same set of tables.  Given an object
> unseen so far, oid_object_info() should have done most of the work
> necessary for lookup_object_by_type() to get to and start parsing
> the data of the object in the good case (i.e. object exists and in a
> pack---just we haven't needed it yet), but in the above sequence
> there is not enough information passed between two calls to take
> advantage of it.

This code was my suggestion, but it may have actually been a bad
direction.

I don't think communicating between oid_object_info() and
lookup_object_by_type() is important. The latter is only doing a lookup
in the internal hash with lookup_object(), and then auto-vivifying using
the type if necessary (which we provide to it).

The bigger inefficiency is that we call oid_object_info() before seeing
if we have already instantiated an object struct via lookup_object().

Obviously we could do that first. But let's take a step back. My
original suggestion was thinking that we don't want to call
parse_object() because it's expensive, especially for a blob. But in the
long run, most of these objects (except blobs!) will end up parsed
anyway, because we are going to see which other objects they reach.

So it's OK to parse anything except blobs. And indeed, we have a better
tool for that these days:

  obj = parse_object_with_flags(r, oid, PARSE_OBJECT_SKIP_HASH_CHECK);

That does exactly what we want. If we already saw and parsed the object,
it's quick noop after a hash lookup. If we didn't, then it already has
optimizations to avoid reading object contents if possible (checking the
commit graph, checking the type for blobs).

Skipping the hash check might seem like a bad idea for a repack, but
it's what we already do for blobs found via traversing. A disk repack
uses the much cheaper pack idx crc for exactly this purpose: to avoid
expanding objects unnecessarily.

-Peff
Jeff King May 5, 2023, 10:19 p.m. UTC | #6
On Wed, May 03, 2023 at 06:05:45PM -0400, Taylor Blau wrote:

> +	out = xfdopen(cmd.out, "r");
> +	while (strbuf_getline(&buf, out) != EOF) {
> +		struct object_id oid;
> +		struct object *obj;
> +		int type;
> +		const char *rest;
> +
> +		if (parse_oid_hex(buf.buf, &oid, &rest) || *rest) {
> +			ret = error(_("invalid extra cruft tip: '%s'"), buf.buf);
> +			goto done;
> +		}

While adapting this for my "how about this" patch elsewhere in the
thread, I noticed the error handling here is a little funny.

The other side sent us unexpected output, so we stopped parsing. But we
never reap the child process. We just jump to "done", which clears the
memory used by the process struct, but never call finish_command().

I think you want to break out of this loop and run finish_command. But
you have to make sure to fclose(out) first, so that it knows to exit
(otherwise it may fill up the pipe buffer and block waiting for us to
read, creating a deadlock).

And then

So something like:

  while (strbuf_getline(...)) {
	if (some_error) {
		ret = error(...);
		break;
  }

  fclose(out); /* no need for "if (out)" here; it's always open */
  ret |= finish_command(&cmd);

  /* no need to child_process_clear(); already done by finish_command() */
  strbuf_release(&buf);
  return ret;

-Peff
diff mbox series

Patch

diff --git a/Documentation/config/pack.txt b/Documentation/config/pack.txt
index d4c7c9d4e4..0b79bd1751 100644
--- a/Documentation/config/pack.txt
+++ b/Documentation/config/pack.txt
@@ -67,6 +67,17 @@  pack.deltaCacheLimit::
 	result once the best match for all objects is found.
 	Defaults to 1000. Maximum value is 65535.

+pack.extraCruftTips::
+	When generating a cruft pack, use the shell to execute the
+	specified command(s), and interpret their output as additional
+	tips of objects to keep in the cruft pack, regardless of their
+	age.
++
+Output must contain exactly one hex object ID per line, and nothing
+else. Objects which cannot be found in the repository are ignored.
+Multiple hooks are supported, but all must exit successfully, else no
+cruft pack will be generated.
+
 pack.threads::
 	Specifies the number of threads to spawn when searching for best
 	delta matches.  This requires that linkgit:git-pack-objects[1]
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index a5b466839b..6f6e7872cd 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -44,6 +44,7 @@ 
 #include "pack-mtimes.h"
 #include "parse-options.h"
 #include "wrapper.h"
+#include "run-command.h"

 /*
  * Objects we are going to pack are collected in the `to_pack` structure.
@@ -3543,11 +3544,85 @@  static void enumerate_cruft_objects(void)
 	stop_progress(&progress_state);
 }

+static int enumerate_extra_cruft_tips_1(struct rev_info *revs, const char *args)
+{
+	struct child_process cmd = CHILD_PROCESS_INIT;
+	struct strbuf buf = STRBUF_INIT;
+	FILE *out;
+	int ret = 0;
+
+	cmd.use_shell = 1;
+	cmd.out = -1;
+
+	strvec_push(&cmd.args, args);
+
+	if (start_command(&cmd))
+		return -1;
+
+	out = xfdopen(cmd.out, "r");
+	while (strbuf_getline(&buf, out) != EOF) {
+		struct object_id oid;
+		struct object *obj;
+		int type;
+		const char *rest;
+
+		if (parse_oid_hex(buf.buf, &oid, &rest) || *rest) {
+			ret = error(_("invalid extra cruft tip: '%s'"), buf.buf);
+			goto done;
+		}
+
+		type = oid_object_info(the_repository, &oid, NULL);
+		if (type < 0)
+			continue;
+
+		obj = lookup_object_by_type(the_repository, &oid, type);
+		if (!obj)
+			continue;
+
+		display_progress(progress_state, ++nr_seen);
+		add_pending_object(revs, obj, "");
+	}
+
+	ret = finish_command(&cmd);
+
+done:
+	if (out)
+		fclose(out);
+	strbuf_release(&buf);
+	child_process_clear(&cmd);
+
+	return ret;
+}
+
+static void add_extra_cruft_tips_to_traversal(struct rev_info *revs)
+{
+	const struct string_list *programs;
+	int ret = 0;
+	size_t i;
+
+	if (git_config_get_string_multi("pack.extracrufttips", &programs))
+		return;
+
+	if (progress)
+		progress_state = start_progress(_("Enumerating extra cruft tips"), 0);
+	nr_seen = 0;
+	for (i = 0; i < programs->nr; i++) {
+		ret = enumerate_extra_cruft_tips_1(revs,
+						   programs->items[i].string);
+		if (ret)
+			break;
+	}
+	stop_progress(&progress_state);
+
+	if (ret)
+		die(_("unable to enumerate additional cruft tips"));
+}
+
 static void enumerate_and_traverse_cruft_objects(struct string_list *fresh_packs)
 {
 	struct packed_git *p;
 	struct rev_info revs;
-	int ret;
+	int ret = 0;

 	repo_init_revisions(the_repository, &revs, NULL);

@@ -3560,11 +3635,15 @@  static void enumerate_and_traverse_cruft_objects(struct string_list *fresh_packs

 	revs.ignore_missing_links = 1;

-	if (progress)
-		progress_state = start_progress(_("Enumerating cruft objects"), 0);
-	ret = add_unseen_recent_objects_to_traversal(&revs, cruft_expiration,
-						     set_cruft_mtime, 1);
-	stop_progress(&progress_state);
+	if (cruft_expiration) {
+		if (progress)
+			progress_state = start_progress(_("Enumerating cruft objects"), 0);
+		ret = add_unseen_recent_objects_to_traversal(&revs, cruft_expiration,
+							     set_cruft_mtime, 1);
+		stop_progress(&progress_state);
+	}
+
+	add_extra_cruft_tips_to_traversal(&revs);

 	if (ret)
 		die(_("unable to add cruft objects"));
@@ -3639,8 +3718,19 @@  static void read_cruft_objects(void)

 	if (cruft_expiration)
 		enumerate_and_traverse_cruft_objects(&fresh_packs);
-	else
+	else {
+		/*
+		 * call both enumerate_cruft_objects() and
+		 * enumerate_and_traverse_cruft_objects(), since:
+		 *
+		 *   - the former is required to implement non-pruning GCs
+		 *   - the latter is a noop for "cruft_expiration == 0", but
+		 *     picks up any additional tips mentioned via
+		 *     `pack.extraCruftTips`.
+		 */
 		enumerate_cruft_objects();
+		enumerate_and_traverse_cruft_objects(&fresh_packs);
+	}

 	strbuf_release(&buf);
 	string_list_clear(&discard_packs, 0);
diff --git a/t/t5329-pack-objects-cruft.sh b/t/t5329-pack-objects-cruft.sh
index 303f7a5d84..3ad16a9fca 100755
--- a/t/t5329-pack-objects-cruft.sh
+++ b/t/t5329-pack-objects-cruft.sh
@@ -739,4 +739,175 @@  test_expect_success 'cruft objects are freshend via loose' '
 	)
 '

+test_expect_success 'additional cruft tips may be specified via pack.extraCruftTips' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		# Create a handful of objects.
+		#
+		#   - one reachable commit, "base", designated for the reachable
+		#     pack
+		#   - one unreachable commit, "cruft.discard", which is marked
+		#     for deletion
+		#   - one unreachable commit, "cruft.old", which would be marked
+		#     for deletion, but is rescued as an extra cruft tip
+		#   - one unreachable commit, "cruft.new", which is not marked
+		#     for deletion
+		test_commit base &&
+		git branch -M main &&
+
+		git checkout --orphan discard &&
+		git rm -fr . &&
+		test_commit --no-tag cruft.discard &&
+
+		git checkout --orphan old &&
+		git rm -fr . &&
+		test_commit --no-tag cruft.old &&
+		cruft_old="$(git rev-parse HEAD)" &&
+
+		git checkout --orphan new &&
+		git rm -fr . &&
+		test_commit --no-tag cruft.new &&
+		cruft_new="$(git rev-parse HEAD)" &&
+
+		git checkout main &&
+		git branch -D discard old new &&
+		git reflog expire --all --expire=all &&
+
+		# mark cruft.old with an mtime that is many minutes
+		# older than the expiration period, and mark cruft.new
+		# with an mtime that is in the future (and thus not
+		# eligible for pruning).
+		test-tool chmtime -2000 "$objdir/$(test_oid_to_path $cruft_old)" &&
+		test-tool chmtime +1000 "$objdir/$(test_oid_to_path $cruft_new)" &&
+
+		# Write the list of cruft objects we expect to
+		# accumulate, which is comprised of everything reachable
+		# from cruft.old and cruft.new, but not cruft.discard.
+		git rev-list --objects --no-object-names \
+			$cruft_old $cruft_new >cruft.raw &&
+		sort cruft.raw >cruft.expect &&
+
+		# Write the script to list extra tips, which are limited
+		# to cruft.old, in this case.
+		write_script extra-tips <<-EOF &&
+		echo $cruft_old
+		EOF
+		git config pack.extraCruftTips ./extra-tips &&
+
+		git repack --cruft --cruft-expiration=now -d &&
+
+		mtimes="$(ls .git/objects/pack/pack-*.mtimes)" &&
+		git show-index <${mtimes%.mtimes}.idx >cruft &&
+		cut -d" " -f2 cruft | sort >cruft.actual &&
+		test_cmp cruft.expect cruft.actual &&
+
+		# Ensure that the "old" objects are removed after
+		# dropping the pack.extraCruftTips hook.
+		git config --unset pack.extraCruftTips &&
+		git repack --cruft --cruft-expiration=now -d &&
+
+		mtimes="$(ls .git/objects/pack/pack-*.mtimes)" &&
+		git show-index <${mtimes%.mtimes}.idx >cruft &&
+		cut -d" " -f2 cruft | sort >cruft.actual &&
+
+		git rev-list --objects --no-object-names $cruft_new >cruft.raw &&
+		cp cruft.expect cruft.old &&
+		sort cruft.raw >cruft.expect &&
+		test_cmp cruft.expect cruft.actual &&
+
+		# ensure objects which are no longer in the cruft pack were
+		# removed from the repository
+		for object in $(comm -13 cruft.expect cruft.old)
+		do
+			test_must_fail git cat-file -t $object || return 1
+		done
+	)
+'
+
+test_expect_success 'multi-valued pack.extraCruftTips' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit base &&
+		git branch -M main &&
+
+		git checkout --orphan cruft.a &&
+		git rm -fr . &&
+		test_commit --no-tag cruft.a &&
+		cruft_a="$(git rev-parse HEAD)" &&
+
+		git checkout --orphan cruft.b &&
+		git rm -fr . &&
+		test_commit --no-tag cruft.b &&
+		cruft_b="$(git rev-parse HEAD)" &&
+
+		git checkout main &&
+		git branch -D cruft.a cruft.b &&
+		git reflog expire --all --expire=all &&
+
+		echo "echo $cruft_a" | write_script extra-tips.a &&
+		echo "echo $cruft_b" | write_script extra-tips.b &&
+		echo "false" | write_script extra-tips.c &&
+
+		git rev-list --objects --no-object-names $cruft_a $cruft_b \
+			>cruft.raw &&
+		sort cruft.raw >cruft.expect &&
+
+		# ensure that each extra cruft tip is saved by its
+		# respective hook
+		git config --add pack.extraCruftTips ./extra-tips.a &&
+		git config --add pack.extraCruftTips ./extra-tips.b &&
+		git repack --cruft --cruft-expiration=now -d &&
+
+		mtimes="$(ls .git/objects/pack/pack-*.mtimes)" &&
+		git show-index <${mtimes%.mtimes}.idx >cruft &&
+		cut -d" " -f2 cruft | sort >cruft.actual &&
+		test_cmp cruft.expect cruft.actual &&
+
+		# ensure that a dirty exit halts cruft pack generation
+		git config --add pack.extraCruftTips ./extra-tips.c &&
+		test_must_fail git repack --cruft -d 2>err &&
+		grep "unable to enumerate additional cruft tips" err &&
+
+		# and that the existing cruft pack is left alone
+		test_path_is_file "$mtimes"
+	)
+'
+
+test_expect_success 'additional cruft blobs via pack.extraCruftTips' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit base &&
+
+		blob=$(echo "unreachable" | git hash-object -w --stdin) &&
+
+		# mark the unreachable blob we wrote above as having
+		# aged out of the retention period
+		test-tool chmtime -2000 "$objdir/$(test_oid_to_path $blob)" &&
+
+		# Write the script to list extra tips, which is just the
+		# extra blob as above.
+		write_script extra-tips <<-EOF &&
+		echo $blob
+		EOF
+		git config pack.extraCruftTips ./extra-tips &&
+
+		git repack --cruft --cruft-expiration=now -d &&
+
+		mtimes="$(ls .git/objects/pack/pack-*.mtimes)" &&
+		git show-index <${mtimes%.mtimes}.idx >cruft &&
+		cut -d" " -f2 cruft >actual &&
+		echo $blob >expect &&
+		test_cmp expect actual
+	)
+'
+
 test_done