diff mbox series

[02/10] revision: learn '--no-kept-objects'

Message ID 4184529648abe7451b5c7b772493df8c067cec82.1611098616.git.me@ttaylorr.com (mailing list archive)
State Superseded
Headers show
Series repack: support repacking into a geometric sequence | expand

Commit Message

Taylor Blau Jan. 19, 2021, 11:24 p.m. UTC
Some callers want to perform a reachability traversal that terminates
when an object is found in a kept pack. The closest existing option is
'--honor-pack-keep', but this isn't quite what we want. Instead of
halting the traversal midway through, a full traversal is always
performed, and the results are only trimmed afterwords.

Besides needing to introduce a new flag (since culling results
post-facto can be different than halting the traversal as it's
happening), there is an additional wrinkle handling the distinction
in-core and on-disk kept packs. That is: what kinds of kept pack should
stop the traversal?

Introduce '--no-kept-objects[=<on-disk|in-core>]' to specify which kinds
of kept packs, if any, should stop a traversal. This can be useful for
callers that want to perform a reachability analysis, but want to leave
certain packs alone (for e.g., when doing a geometric repack that has
some "large" packs it wants to leave alone).

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/rev-list-options.txt |  7 +++
 list-objects.c                     |  7 +++
 revision.c                         | 15 +++++++
 revision.h                         |  4 ++
 t/t6114-keep-packs.sh              | 69 ++++++++++++++++++++++++++++++
 5 files changed, 102 insertions(+)
 create mode 100755 t/t6114-keep-packs.sh

Comments

Junio C Hamano Jan. 29, 2021, 3:10 a.m. UTC | #1
Taylor Blau <me@ttaylorr.com> writes:

> Some callers want to perform a reachability traversal that terminates
> when an object is found in a kept pack. The closest existing option is
> '--honor-pack-keep', but this isn't quite what we want. Instead of
> halting the traversal midway through, a full traversal is always
> performed, and the results are only trimmed afterwords.

True.  

Is there a reason to keep both kinds?  It is obvious that stopping
traversal once we hit a kept pack would be more time and space
efficient (I presume that the reason why .kept pack matters is
because we are repacking everything else) to enumerate the objects
that need to be repacked than traversing all the way and filtering
out objects that appear in .kept packs, but would there be some
correctness implications to replace the existing use of
"--honor-pack-keep" with "--no-kept-objects=on-disk"?  

What it means to be excluded by the former is quite clear: any
object that appears in a kept pack, whether another copy of it
appears elsewhere, is excluded from getting enumerated for
repacking.  It is quite unclear what it means to enumerate objects
with "--no-kept-objects".  It is clear from the implementation side
of the thing (stop traversal at objects that appear in any kept
pack), but it is totally unclear what such a meaning defined
operationally affects the resulting enumeration.  We know that the
enumerated objects do not appear in any of the kept pack, but it
does not mean all objects that are reachable/in-use that are not in
any kept packs are enumerated.

> diff --git a/Documentation/rev-list-options.txt b/Documentation/rev-list-options.txt
> index 002379056a..817419d552 100644
> --- a/Documentation/rev-list-options.txt
> +++ b/Documentation/rev-list-options.txt
> @@ -856,6 +856,13 @@ ifdef::git-rev-list[]
>  	Only useful with `--objects`; print the object IDs that are not
>  	in packs.
>  
> +--no-kept-objects[=<kind>]::
> +	Halts the traversal as soon as an object in a kept pack is
> +	found. If `<kind>` is `on-disk`, only packs with a corresponding
> +	`*.keep` file are ignored. If `<kind>` is `in-core`, only packs
> +	with their in-core kept state set are ignored. Otherwise, both
> +	kinds of kept packs are ignored.

Is it explained anywhere how "in-core kept state" is bootstrapped,
modified and maintained?

The patch to C-part itself is a trivially correct implementation of
"stop at an object that can be found in a kept pack", and there is
no comment, but it is not clear to me what we want to achieve by
this.  Is the underlying assumption that no objects in .kept pack
would refer to outside world, either loose or packs that are not
kept?  How are we guaranteeing it?
Taylor Blau Jan. 29, 2021, 7:13 p.m. UTC | #2
On Thu, Jan 28, 2021 at 07:10:04PM -0800, Junio C Hamano wrote:
> We know that the enumerated objects do not appear in any of the kept
> pack, but it does not mean all objects that are reachable/in-use that
> are not in any kept packs are enumerated.

You raise a very valid point. FWIW, I originally wrote these patches as
just "enumerate the objects in these small packs, make a new pack out of
those, and then (optionally) delete the small ones. I abandoned that
idea because it needs special handling for loose objects, and it has no
idea which objects are unreachable, etc.

But maybe it is time to go back to the drawing board there. Perhaps a
`--geometric` repack implies that we keep unreachable objects in effect,
and that a full repack (i.e., one that does reachability analysis) is
required to drop them.

Other ideas are welcome.

Thanks,
Taylor
diff mbox series

Patch

diff --git a/Documentation/rev-list-options.txt b/Documentation/rev-list-options.txt
index 002379056a..817419d552 100644
--- a/Documentation/rev-list-options.txt
+++ b/Documentation/rev-list-options.txt
@@ -856,6 +856,13 @@  ifdef::git-rev-list[]
 	Only useful with `--objects`; print the object IDs that are not
 	in packs.
 
+--no-kept-objects[=<kind>]::
+	Halts the traversal as soon as an object in a kept pack is
+	found. If `<kind>` is `on-disk`, only packs with a corresponding
+	`*.keep` file are ignored. If `<kind>` is `in-core`, only packs
+	with their in-core kept state set are ignored. Otherwise, both
+	kinds of kept packs are ignored.
+
 --object-names::
 	Only useful with `--objects`; print the names of the object IDs
 	that are found. This is the default behavior.
diff --git a/list-objects.c b/list-objects.c
index e19589baa0..b06c3bfeba 100644
--- a/list-objects.c
+++ b/list-objects.c
@@ -338,6 +338,13 @@  static void traverse_trees_and_blobs(struct traversal_context *ctx,
 			ctx->show_object(obj, name, ctx->show_data);
 			continue;
 		}
+		if (ctx->revs->no_kept_objects) {
+			struct pack_entry e;
+			if (find_kept_pack_entry(ctx->revs->repo, &obj->oid,
+						 ctx->revs->keep_pack_cache_flags,
+						 &e))
+				continue;
+		}
 		if (!path)
 			path = "";
 		if (obj->type == OBJ_TREE) {
diff --git a/revision.c b/revision.c
index 1bb590ece7..ff1ea77224 100644
--- a/revision.c
+++ b/revision.c
@@ -2334,6 +2334,16 @@  static int handle_revision_opt(struct rev_info *revs, int argc, const char **arg
 		revs->unpacked = 1;
 	} else if (starts_with(arg, "--unpacked=")) {
 		die(_("--unpacked=<packfile> no longer supported"));
+	} else if (!strcmp(arg, "--no-kept-objects")) {
+		revs->no_kept_objects = 1;
+		revs->keep_pack_cache_flags |= IN_CORE_KEEP_PACKS;
+		revs->keep_pack_cache_flags |= ON_DISK_KEEP_PACKS;
+	} else if (skip_prefix(arg, "--no-kept-objects=", &optarg)) {
+		revs->no_kept_objects = 1;
+		if (!strcmp(optarg, "in-core"))
+			revs->keep_pack_cache_flags |= IN_CORE_KEEP_PACKS;
+		if (!strcmp(optarg, "on-disk"))
+			revs->keep_pack_cache_flags |= ON_DISK_KEEP_PACKS;
 	} else if (!strcmp(arg, "-r")) {
 		revs->diff = 1;
 		revs->diffopt.flags.recursive = 1;
@@ -3822,6 +3832,11 @@  enum commit_action get_commit_action(struct rev_info *revs, struct commit *commi
 		return commit_ignore;
 	if (revs->unpacked && has_object_pack(&commit->object.oid))
 		return commit_ignore;
+	if (revs->no_kept_objects) {
+		if (has_object_kept_pack(&commit->object.oid,
+					 revs->keep_pack_cache_flags))
+			return commit_ignore;
+	}
 	if (commit->object.flags & UNINTERESTING)
 		return commit_ignore;
 	if (revs->line_level_traverse && !want_ancestry(revs)) {
diff --git a/revision.h b/revision.h
index 086ff10280..15d0e6aee5 100644
--- a/revision.h
+++ b/revision.h
@@ -148,6 +148,7 @@  struct rev_info {
 			edge_hint_aggressive:1,
 			limited:1,
 			unpacked:1,
+			no_kept_objects:1,
 			boundary:2,
 			count:1,
 			left_right:1,
@@ -312,6 +313,9 @@  struct rev_info {
 	 * This is loaded from the commit-graph being used.
 	 */
 	struct bloom_filter_settings *bloom_filter_settings;
+
+	/* misc. flags related to '--no-kept-objects' */
+	unsigned keep_pack_cache_flags;
 };
 
 int ref_excluded(struct string_list *, const char *path);
diff --git a/t/t6114-keep-packs.sh b/t/t6114-keep-packs.sh
new file mode 100755
index 0000000000..9239d8aa46
--- /dev/null
+++ b/t/t6114-keep-packs.sh
@@ -0,0 +1,69 @@ 
+#!/bin/sh
+
+test_description='rev-list with .keep packs'
+. ./test-lib.sh
+
+test_expect_success 'setup' '
+	test_commit loose &&
+	test_commit packed &&
+	test_commit kept &&
+
+	KEPT_PACK=$(git pack-objects --revs .git/objects/pack/pack <<-EOF
+	refs/tags/kept
+	^refs/tags/packed
+	EOF
+	) &&
+	MISC_PACK=$(git pack-objects --revs .git/objects/pack/pack <<-EOF
+	refs/tags/packed
+	^refs/tags/loose
+	EOF
+	) &&
+
+	touch .git/objects/pack/pack-$KEPT_PACK.keep
+'
+
+rev_list_objects () {
+	git rev-list "$@" >out &&
+	sort out
+}
+
+idx_objects () {
+	git show-index <$1 >expect-idx &&
+	cut -d" " -f2 <expect-idx | sort
+}
+
+test_expect_success '--no-kept-objects excludes trees and blobs in .keep packs' '
+	rev_list_objects --objects --all --no-object-names >kept &&
+	rev_list_objects --objects --all --no-object-names --no-kept-objects >no-kept &&
+
+	idx_objects .git/objects/pack/pack-$KEPT_PACK.idx >expect &&
+	comm -3 kept no-kept >actual &&
+
+	test_cmp expect actual
+'
+
+test_expect_success '--no-kept-objects excludes kept non-MIDX object' '
+	test_config core.multiPackIndex true &&
+
+	# Create a pack with just the commit object in pack, and do not mark it
+	# as kept (even though it appears in $KEPT_PACK, which does have a .keep
+	# file).
+	MIDX_PACK=$(git pack-objects .git/objects/pack/pack <<-EOF
+	$(git rev-parse kept)
+	EOF
+	) &&
+
+	# Write a MIDX containing all packs, but use the version of the commit
+	# at "kept" in a non-kept pack by touching $MIDX_PACK.
+	touch .git/objects/pack/pack-$MIDX_PACK.pack &&
+	git multi-pack-index write &&
+
+	rev_list_objects --objects --no-object-names --no-kept-objects HEAD >actual &&
+	(
+		idx_objects .git/objects/pack/pack-$MISC_PACK.idx &&
+		git rev-list --objects --no-object-names refs/tags/loose
+	) | sort >expect &&
+	test_cmp expect actual
+'
+
+test_done