diff mbox series

[v2,2/8] revision: learn '--no-kept-objects'

Message ID ddc2896caa13b9f1cdccb2f0a5892143fa98237c.1612411123.git.me@ttaylorr.com (mailing list archive)
State Superseded
Headers show
Series repack: support repacking into a geometric sequence | expand

Commit Message

Taylor Blau Feb. 4, 2021, 3:58 a.m. UTC
A future caller will want to be able to perform a reachability traversal
which terminates when visiting an object found in a kept pack. The
closest existing option is '--honor-pack-keep', but this isn't quite
what we want. Instead of halting the traversal midway through, a full
traversal is always performed, and the results are only trimmed
afterwords.

Besides needing to introduce a new flag (since culling results
post-facto can be different than halting the traversal as it's
happening), there is an additional wrinkle handling the distinction
in-core and on-disk kept packs. That is: what kinds of kept pack should
stop the traversal?

Introduce '--no-kept-objects[=<on-disk|in-core>]' to specify which kinds
of kept packs, if any, should stop a traversal. This can be useful for
callers that want to perform a reachability analysis, but want to leave
certain packs alone (for e.g., when doing a geometric repack that has
some "large" packs which are kept in-core that it wants to leave alone).

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/rev-list-options.txt |  7 +++
 list-objects.c                     |  7 +++
 revision.c                         | 15 +++++++
 revision.h                         |  4 ++
 t/t6114-keep-packs.sh              | 69 ++++++++++++++++++++++++++++++
 5 files changed, 102 insertions(+)
 create mode 100755 t/t6114-keep-packs.sh

Comments

Jeff King Feb. 16, 2021, 11:17 p.m. UTC | #1
On Wed, Feb 03, 2021 at 10:58:57PM -0500, Taylor Blau wrote:

> @@ -3797,6 +3807,11 @@ enum commit_action get_commit_action(struct rev_info *revs, struct commit *commi
>  		return commit_ignore;
>  	if (revs->unpacked && has_object_pack(&commit->object.oid))
>  		return commit_ignore;
> +	if (revs->no_kept_objects) {
> +		if (has_object_kept_pack(&commit->object.oid,
> +					 revs->keep_pack_cache_flags))
> +			return commit_ignore;
> +	}

OK, so this has the same "problems" as --unpacked, which is that we can
miss some objects (i.e., things that are reachable but not-kept may not
be reported). But it should be OK in this version of the series, because
we will not be relying on it for selection of objects, but only to fill
in ordering / namehash fields.

Should we warn people about that, either as a comment or in the commit
message?

> +--no-kept-objects[=<kind>]::
> +	Halts the traversal as soon as an object in a kept pack is
> +	found. If `<kind>` is `on-disk`, only packs with a corresponding
> +	`*.keep` file are ignored. If `<kind>` is `in-core`, only packs
> +	with their in-core kept state set are ignored. Otherwise, both
> +	kinds of kept packs are ignored.

Likewise, I wonder whether we need to expose this mode to users.
Normally I'm a fan of doing so, because it allows scripted callers
access to more of the internals, but:

  - the semantics are kind of weird about where we draw the line between
    performance and absolute correctness

  - the "in-core" thing is a bit weird for callers of rev-list; how do I
    as a caller mark a pack as kept-in-core? I think it's only an
    internal pack-objects thing.

Once we support this in rev-list, we'll have to do it forever (or deal
with deprecation, etc). If we just need it internally, maybe it's wise
to leave it as a something you ask for by manipulating rev_info
directly. Or perhaps leave it as an undocumented interface we use for
testing, and not something we promise to keep working.

> --- a/list-objects.c
> +++ b/list-objects.c
> @@ -338,6 +338,13 @@ static void traverse_trees_and_blobs(struct traversal_context *ctx,
>  			ctx->show_object(obj, name, ctx->show_data);
>  			continue;
>  		}
> +		if (ctx->revs->no_kept_objects) {
> +			struct pack_entry e;
> +			if (find_kept_pack_entry(ctx->revs->repo, &obj->oid,
> +						 ctx->revs->keep_pack_cache_flags,
> +						 &e))
> +				continue;
> +		}

This hunk is interesting.

There is no similar check for revs->unpacked in list-objects.c to cut
off the traversal. And indeed, running "rev-list --unpacked" will
generally look at the _whole_ tree for a commit that is unpacked, even
if all of the tree entries are packed. That's something we might
consider changing in the name of performance (though it does increase
the number of cases where --unpacked will fail to find an unpacked but
reachable object).

But this is a funny place to put it. If I understand it correctly, it is
cutting off the traversal at the very top of the tree. I.e., if we had a
commit that is not-kept, we'd queue it's root tree. And then we might
find that the root tree is kept, and avoid traversing it. But if we _do_
traverse it, we would look at every subtree it contains, even if they
are kept! That's because we recurse the tree via the recursive
process_tree(), not by queueing more objects in the pending array here.

So this check seems to exist in a funny middle ground. I think it's
unlikely to catch anything useful (usually commits have a unique root
tree; it's all of the untouched parts of the subtrees that will be in
the kept packs). IMHO we should either drop it (and act like
"--unpacked", accepting that we may traverse some extra tree objects),
or we should go all-in on performance and cut it off in the top of
process_tree().

-Peff
Taylor Blau Feb. 17, 2021, 6:35 p.m. UTC | #2
On Tue, Feb 16, 2021 at 06:17:40PM -0500, Jeff King wrote:
> On Wed, Feb 03, 2021 at 10:58:57PM -0500, Taylor Blau wrote:
>
> > @@ -3797,6 +3807,11 @@ enum commit_action get_commit_action(struct rev_info *revs, struct commit *commi
> >  		return commit_ignore;
> >  	if (revs->unpacked && has_object_pack(&commit->object.oid))
> >  		return commit_ignore;
> > +	if (revs->no_kept_objects) {
> > +		if (has_object_kept_pack(&commit->object.oid,
> > +					 revs->keep_pack_cache_flags))
> > +			return commit_ignore;
> > +	}
>
> OK, so this has the same "problems" as --unpacked, which is that we can
> miss some objects (i.e., things that are reachable but not-kept may not
> be reported). But it should be OK in this version of the series, because
> we will not be relying on it for selection of objects, but only to fill
> in ordering / namehash fields.
>
> Should we warn people about that, either as a comment or in the commit
> message?

Yeah, let's warn about it in the commit message. We could put it in the
documentation, but...

> > +--no-kept-objects[=<kind>]::
> > +	Halts the traversal as soon as an object in a kept pack is
> > +	found. If `<kind>` is `on-disk`, only packs with a corresponding
> > +	`*.keep` file are ignored. If `<kind>` is `in-core`, only packs
> > +	with their in-core kept state set are ignored. Otherwise, both
> > +	kinds of kept packs are ignored.
>
> Likewise, I wonder whether we need to expose this mode to users.
> Normally I'm a fan of doing so, because it allows scripted callers
> access to more of the internals, but:
>
>   - the semantics are kind of weird about where we draw the line between
>     performance and absolute correctness
>
>   - the "in-core" thing is a bit weird for callers of rev-list; how do I
>     as a caller mark a pack as kept-in-core? I think it's only an
>     internal pack-objects thing.
>
> Once we support this in rev-list, we'll have to do it forever (or deal
> with deprecation, etc). If we just need it internally, maybe it's wise
> to leave it as a something you ask for by manipulating rev_info
> directly. Or perhaps leave it as an undocumented interface we use for
> testing, and not something we promise to keep working.

I think that you raise a good point about not advertising this option,
since doing so paints us into a corner that we have to keep it working
and behaving consistently forever.

I'm not opposed to the idea that we may eventually want to do so, but I
think that this is too early for that. As you note, we *could* just
expose it in rev_info flags, but that makes it much more difficult to
test some of the tricky cases that are added in t6114, so I think a
middle ground of having an undocumented option satisfies both of our
wants.

> > --- a/list-objects.c
> > +++ b/list-objects.c
> > @@ -338,6 +338,13 @@ static void traverse_trees_and_blobs(struct traversal_context *ctx,
> >  			ctx->show_object(obj, name, ctx->show_data);
> >  			continue;
> >  		}
> > +		if (ctx->revs->no_kept_objects) {
> > +			struct pack_entry e;
> > +			if (find_kept_pack_entry(ctx->revs->repo, &obj->oid,
> > +						 ctx->revs->keep_pack_cache_flags,
> > +						 &e))
> > +				continue;
> > +		}
>
> This hunk is interesting.
>
> There is no similar check for revs->unpacked in list-objects.c to cut
> off the traversal. And indeed, running "rev-list --unpacked" will
> generally look at the _whole_ tree for a commit that is unpacked, even
> if all of the tree entries are packed. That's something we might
> consider changing in the name of performance (though it does increase
> the number of cases where --unpacked will fail to find an unpacked but
> reachable object).
>
> But this is a funny place to put it. If I understand it correctly, it is
> cutting off the traversal at the very top of the tree. I.e., if we had a
> commit that is not-kept, we'd queue it's root tree. And then we might
> find that the root tree is kept, and avoid traversing it. But if we _do_
> traverse it, we would look at every subtree it contains, even if they
> are kept! That's because we recurse the tree via the recursive
> process_tree(), not by queueing more objects in the pending array here.
>
> So this check seems to exist in a funny middle ground. I think it's
> unlikely to catch anything useful (usually commits have a unique root
> tree; it's all of the untouched parts of the subtrees that will be in
> the kept packs). IMHO we should either drop it (and act like
> "--unpacked", accepting that we may traverse some extra tree objects),
> or we should go all-in on performance and cut it off in the top of
> process_tree().

Agreed. Let's drop it.

> -Peff

Thanks,
Taylor
diff mbox series

Patch

diff --git a/Documentation/rev-list-options.txt b/Documentation/rev-list-options.txt
index 96cc89d157..f611832277 100644
--- a/Documentation/rev-list-options.txt
+++ b/Documentation/rev-list-options.txt
@@ -861,6 +861,13 @@  ifdef::git-rev-list[]
 	Only useful with `--objects`; print the object IDs that are not
 	in packs.
 
+--no-kept-objects[=<kind>]::
+	Halts the traversal as soon as an object in a kept pack is
+	found. If `<kind>` is `on-disk`, only packs with a corresponding
+	`*.keep` file are ignored. If `<kind>` is `in-core`, only packs
+	with their in-core kept state set are ignored. Otherwise, both
+	kinds of kept packs are ignored.
+
 --object-names::
 	Only useful with `--objects`; print the names of the object IDs
 	that are found. This is the default behavior.
diff --git a/list-objects.c b/list-objects.c
index e19589baa0..b06c3bfeba 100644
--- a/list-objects.c
+++ b/list-objects.c
@@ -338,6 +338,13 @@  static void traverse_trees_and_blobs(struct traversal_context *ctx,
 			ctx->show_object(obj, name, ctx->show_data);
 			continue;
 		}
+		if (ctx->revs->no_kept_objects) {
+			struct pack_entry e;
+			if (find_kept_pack_entry(ctx->revs->repo, &obj->oid,
+						 ctx->revs->keep_pack_cache_flags,
+						 &e))
+				continue;
+		}
 		if (!path)
 			path = "";
 		if (obj->type == OBJ_TREE) {
diff --git a/revision.c b/revision.c
index fbc3e607fd..4c5adb90b1 100644
--- a/revision.c
+++ b/revision.c
@@ -2336,6 +2336,16 @@  static int handle_revision_opt(struct rev_info *revs, int argc, const char **arg
 		revs->unpacked = 1;
 	} else if (starts_with(arg, "--unpacked=")) {
 		die(_("--unpacked=<packfile> no longer supported"));
+	} else if (!strcmp(arg, "--no-kept-objects")) {
+		revs->no_kept_objects = 1;
+		revs->keep_pack_cache_flags |= IN_CORE_KEEP_PACKS;
+		revs->keep_pack_cache_flags |= ON_DISK_KEEP_PACKS;
+	} else if (skip_prefix(arg, "--no-kept-objects=", &optarg)) {
+		revs->no_kept_objects = 1;
+		if (!strcmp(optarg, "in-core"))
+			revs->keep_pack_cache_flags |= IN_CORE_KEEP_PACKS;
+		if (!strcmp(optarg, "on-disk"))
+			revs->keep_pack_cache_flags |= ON_DISK_KEEP_PACKS;
 	} else if (!strcmp(arg, "-r")) {
 		revs->diff = 1;
 		revs->diffopt.flags.recursive = 1;
@@ -3797,6 +3807,11 @@  enum commit_action get_commit_action(struct rev_info *revs, struct commit *commi
 		return commit_ignore;
 	if (revs->unpacked && has_object_pack(&commit->object.oid))
 		return commit_ignore;
+	if (revs->no_kept_objects) {
+		if (has_object_kept_pack(&commit->object.oid,
+					 revs->keep_pack_cache_flags))
+			return commit_ignore;
+	}
 	if (commit->object.flags & UNINTERESTING)
 		return commit_ignore;
 	if (revs->line_level_traverse && !want_ancestry(revs)) {
diff --git a/revision.h b/revision.h
index e6be3c845e..a20a530d52 100644
--- a/revision.h
+++ b/revision.h
@@ -148,6 +148,7 @@  struct rev_info {
 			edge_hint_aggressive:1,
 			limited:1,
 			unpacked:1,
+			no_kept_objects:1,
 			boundary:2,
 			count:1,
 			left_right:1,
@@ -317,6 +318,9 @@  struct rev_info {
 	 * This is loaded from the commit-graph being used.
 	 */
 	struct bloom_filter_settings *bloom_filter_settings;
+
+	/* misc. flags related to '--no-kept-objects' */
+	unsigned keep_pack_cache_flags;
 };
 
 int ref_excluded(struct string_list *, const char *path);
diff --git a/t/t6114-keep-packs.sh b/t/t6114-keep-packs.sh
new file mode 100755
index 0000000000..9239d8aa46
--- /dev/null
+++ b/t/t6114-keep-packs.sh
@@ -0,0 +1,69 @@ 
+#!/bin/sh
+
+test_description='rev-list with .keep packs'
+. ./test-lib.sh
+
+test_expect_success 'setup' '
+	test_commit loose &&
+	test_commit packed &&
+	test_commit kept &&
+
+	KEPT_PACK=$(git pack-objects --revs .git/objects/pack/pack <<-EOF
+	refs/tags/kept
+	^refs/tags/packed
+	EOF
+	) &&
+	MISC_PACK=$(git pack-objects --revs .git/objects/pack/pack <<-EOF
+	refs/tags/packed
+	^refs/tags/loose
+	EOF
+	) &&
+
+	touch .git/objects/pack/pack-$KEPT_PACK.keep
+'
+
+rev_list_objects () {
+	git rev-list "$@" >out &&
+	sort out
+}
+
+idx_objects () {
+	git show-index <$1 >expect-idx &&
+	cut -d" " -f2 <expect-idx | sort
+}
+
+test_expect_success '--no-kept-objects excludes trees and blobs in .keep packs' '
+	rev_list_objects --objects --all --no-object-names >kept &&
+	rev_list_objects --objects --all --no-object-names --no-kept-objects >no-kept &&
+
+	idx_objects .git/objects/pack/pack-$KEPT_PACK.idx >expect &&
+	comm -3 kept no-kept >actual &&
+
+	test_cmp expect actual
+'
+
+test_expect_success '--no-kept-objects excludes kept non-MIDX object' '
+	test_config core.multiPackIndex true &&
+
+	# Create a pack with just the commit object in pack, and do not mark it
+	# as kept (even though it appears in $KEPT_PACK, which does have a .keep
+	# file).
+	MIDX_PACK=$(git pack-objects .git/objects/pack/pack <<-EOF
+	$(git rev-parse kept)
+	EOF
+	) &&
+
+	# Write a MIDX containing all packs, but use the version of the commit
+	# at "kept" in a non-kept pack by touching $MIDX_PACK.
+	touch .git/objects/pack/pack-$MIDX_PACK.pack &&
+	git multi-pack-index write &&
+
+	rev_list_objects --objects --no-object-names --no-kept-objects HEAD >actual &&
+	(
+		idx_objects .git/objects/pack/pack-$MISC_PACK.idx &&
+		git rev-list --objects --no-object-names refs/tags/loose
+	) | sort >expect &&
+	test_cmp expect actual
+'
+
+test_done