Message ID | ddc2896caa13b9f1cdccb2f0a5892143fa98237c.1612411123.git.me@ttaylorr.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | repack: support repacking into a geometric sequence | expand |
On Wed, Feb 03, 2021 at 10:58:57PM -0500, Taylor Blau wrote: > @@ -3797,6 +3807,11 @@ enum commit_action get_commit_action(struct rev_info *revs, struct commit *commi > return commit_ignore; > if (revs->unpacked && has_object_pack(&commit->object.oid)) > return commit_ignore; > + if (revs->no_kept_objects) { > + if (has_object_kept_pack(&commit->object.oid, > + revs->keep_pack_cache_flags)) > + return commit_ignore; > + } OK, so this has the same "problems" as --unpacked, which is that we can miss some objects (i.e., things that are reachable but not-kept may not be reported). But it should be OK in this version of the series, because we will not be relying on it for selection of objects, but only to fill in ordering / namehash fields. Should we warn people about that, either as a comment or in the commit message? > +--no-kept-objects[=<kind>]:: > + Halts the traversal as soon as an object in a kept pack is > + found. If `<kind>` is `on-disk`, only packs with a corresponding > + `*.keep` file are ignored. If `<kind>` is `in-core`, only packs > + with their in-core kept state set are ignored. Otherwise, both > + kinds of kept packs are ignored. Likewise, I wonder whether we need to expose this mode to users. Normally I'm a fan of doing so, because it allows scripted callers access to more of the internals, but: - the semantics are kind of weird about where we draw the line between performance and absolute correctness - the "in-core" thing is a bit weird for callers of rev-list; how do I as a caller mark a pack as kept-in-core? I think it's only an internal pack-objects thing. Once we support this in rev-list, we'll have to do it forever (or deal with deprecation, etc). If we just need it internally, maybe it's wise to leave it as a something you ask for by manipulating rev_info directly. Or perhaps leave it as an undocumented interface we use for testing, and not something we promise to keep working. > --- a/list-objects.c > +++ b/list-objects.c > @@ -338,6 +338,13 @@ static void traverse_trees_and_blobs(struct traversal_context *ctx, > ctx->show_object(obj, name, ctx->show_data); > continue; > } > + if (ctx->revs->no_kept_objects) { > + struct pack_entry e; > + if (find_kept_pack_entry(ctx->revs->repo, &obj->oid, > + ctx->revs->keep_pack_cache_flags, > + &e)) > + continue; > + } This hunk is interesting. There is no similar check for revs->unpacked in list-objects.c to cut off the traversal. And indeed, running "rev-list --unpacked" will generally look at the _whole_ tree for a commit that is unpacked, even if all of the tree entries are packed. That's something we might consider changing in the name of performance (though it does increase the number of cases where --unpacked will fail to find an unpacked but reachable object). But this is a funny place to put it. If I understand it correctly, it is cutting off the traversal at the very top of the tree. I.e., if we had a commit that is not-kept, we'd queue it's root tree. And then we might find that the root tree is kept, and avoid traversing it. But if we _do_ traverse it, we would look at every subtree it contains, even if they are kept! That's because we recurse the tree via the recursive process_tree(), not by queueing more objects in the pending array here. So this check seems to exist in a funny middle ground. I think it's unlikely to catch anything useful (usually commits have a unique root tree; it's all of the untouched parts of the subtrees that will be in the kept packs). IMHO we should either drop it (and act like "--unpacked", accepting that we may traverse some extra tree objects), or we should go all-in on performance and cut it off in the top of process_tree(). -Peff
On Tue, Feb 16, 2021 at 06:17:40PM -0500, Jeff King wrote: > On Wed, Feb 03, 2021 at 10:58:57PM -0500, Taylor Blau wrote: > > > @@ -3797,6 +3807,11 @@ enum commit_action get_commit_action(struct rev_info *revs, struct commit *commi > > return commit_ignore; > > if (revs->unpacked && has_object_pack(&commit->object.oid)) > > return commit_ignore; > > + if (revs->no_kept_objects) { > > + if (has_object_kept_pack(&commit->object.oid, > > + revs->keep_pack_cache_flags)) > > + return commit_ignore; > > + } > > OK, so this has the same "problems" as --unpacked, which is that we can > miss some objects (i.e., things that are reachable but not-kept may not > be reported). But it should be OK in this version of the series, because > we will not be relying on it for selection of objects, but only to fill > in ordering / namehash fields. > > Should we warn people about that, either as a comment or in the commit > message? Yeah, let's warn about it in the commit message. We could put it in the documentation, but... > > +--no-kept-objects[=<kind>]:: > > + Halts the traversal as soon as an object in a kept pack is > > + found. If `<kind>` is `on-disk`, only packs with a corresponding > > + `*.keep` file are ignored. If `<kind>` is `in-core`, only packs > > + with their in-core kept state set are ignored. Otherwise, both > > + kinds of kept packs are ignored. > > Likewise, I wonder whether we need to expose this mode to users. > Normally I'm a fan of doing so, because it allows scripted callers > access to more of the internals, but: > > - the semantics are kind of weird about where we draw the line between > performance and absolute correctness > > - the "in-core" thing is a bit weird for callers of rev-list; how do I > as a caller mark a pack as kept-in-core? I think it's only an > internal pack-objects thing. > > Once we support this in rev-list, we'll have to do it forever (or deal > with deprecation, etc). If we just need it internally, maybe it's wise > to leave it as a something you ask for by manipulating rev_info > directly. Or perhaps leave it as an undocumented interface we use for > testing, and not something we promise to keep working. I think that you raise a good point about not advertising this option, since doing so paints us into a corner that we have to keep it working and behaving consistently forever. I'm not opposed to the idea that we may eventually want to do so, but I think that this is too early for that. As you note, we *could* just expose it in rev_info flags, but that makes it much more difficult to test some of the tricky cases that are added in t6114, so I think a middle ground of having an undocumented option satisfies both of our wants. > > --- a/list-objects.c > > +++ b/list-objects.c > > @@ -338,6 +338,13 @@ static void traverse_trees_and_blobs(struct traversal_context *ctx, > > ctx->show_object(obj, name, ctx->show_data); > > continue; > > } > > + if (ctx->revs->no_kept_objects) { > > + struct pack_entry e; > > + if (find_kept_pack_entry(ctx->revs->repo, &obj->oid, > > + ctx->revs->keep_pack_cache_flags, > > + &e)) > > + continue; > > + } > > This hunk is interesting. > > There is no similar check for revs->unpacked in list-objects.c to cut > off the traversal. And indeed, running "rev-list --unpacked" will > generally look at the _whole_ tree for a commit that is unpacked, even > if all of the tree entries are packed. That's something we might > consider changing in the name of performance (though it does increase > the number of cases where --unpacked will fail to find an unpacked but > reachable object). > > But this is a funny place to put it. If I understand it correctly, it is > cutting off the traversal at the very top of the tree. I.e., if we had a > commit that is not-kept, we'd queue it's root tree. And then we might > find that the root tree is kept, and avoid traversing it. But if we _do_ > traverse it, we would look at every subtree it contains, even if they > are kept! That's because we recurse the tree via the recursive > process_tree(), not by queueing more objects in the pending array here. > > So this check seems to exist in a funny middle ground. I think it's > unlikely to catch anything useful (usually commits have a unique root > tree; it's all of the untouched parts of the subtrees that will be in > the kept packs). IMHO we should either drop it (and act like > "--unpacked", accepting that we may traverse some extra tree objects), > or we should go all-in on performance and cut it off in the top of > process_tree(). Agreed. Let's drop it. > -Peff Thanks, Taylor
diff --git a/Documentation/rev-list-options.txt b/Documentation/rev-list-options.txt index 96cc89d157..f611832277 100644 --- a/Documentation/rev-list-options.txt +++ b/Documentation/rev-list-options.txt @@ -861,6 +861,13 @@ ifdef::git-rev-list[] Only useful with `--objects`; print the object IDs that are not in packs. +--no-kept-objects[=<kind>]:: + Halts the traversal as soon as an object in a kept pack is + found. If `<kind>` is `on-disk`, only packs with a corresponding + `*.keep` file are ignored. If `<kind>` is `in-core`, only packs + with their in-core kept state set are ignored. Otherwise, both + kinds of kept packs are ignored. + --object-names:: Only useful with `--objects`; print the names of the object IDs that are found. This is the default behavior. diff --git a/list-objects.c b/list-objects.c index e19589baa0..b06c3bfeba 100644 --- a/list-objects.c +++ b/list-objects.c @@ -338,6 +338,13 @@ static void traverse_trees_and_blobs(struct traversal_context *ctx, ctx->show_object(obj, name, ctx->show_data); continue; } + if (ctx->revs->no_kept_objects) { + struct pack_entry e; + if (find_kept_pack_entry(ctx->revs->repo, &obj->oid, + ctx->revs->keep_pack_cache_flags, + &e)) + continue; + } if (!path) path = ""; if (obj->type == OBJ_TREE) { diff --git a/revision.c b/revision.c index fbc3e607fd..4c5adb90b1 100644 --- a/revision.c +++ b/revision.c @@ -2336,6 +2336,16 @@ static int handle_revision_opt(struct rev_info *revs, int argc, const char **arg revs->unpacked = 1; } else if (starts_with(arg, "--unpacked=")) { die(_("--unpacked=<packfile> no longer supported")); + } else if (!strcmp(arg, "--no-kept-objects")) { + revs->no_kept_objects = 1; + revs->keep_pack_cache_flags |= IN_CORE_KEEP_PACKS; + revs->keep_pack_cache_flags |= ON_DISK_KEEP_PACKS; + } else if (skip_prefix(arg, "--no-kept-objects=", &optarg)) { + revs->no_kept_objects = 1; + if (!strcmp(optarg, "in-core")) + revs->keep_pack_cache_flags |= IN_CORE_KEEP_PACKS; + if (!strcmp(optarg, "on-disk")) + revs->keep_pack_cache_flags |= ON_DISK_KEEP_PACKS; } else if (!strcmp(arg, "-r")) { revs->diff = 1; revs->diffopt.flags.recursive = 1; @@ -3797,6 +3807,11 @@ enum commit_action get_commit_action(struct rev_info *revs, struct commit *commi return commit_ignore; if (revs->unpacked && has_object_pack(&commit->object.oid)) return commit_ignore; + if (revs->no_kept_objects) { + if (has_object_kept_pack(&commit->object.oid, + revs->keep_pack_cache_flags)) + return commit_ignore; + } if (commit->object.flags & UNINTERESTING) return commit_ignore; if (revs->line_level_traverse && !want_ancestry(revs)) { diff --git a/revision.h b/revision.h index e6be3c845e..a20a530d52 100644 --- a/revision.h +++ b/revision.h @@ -148,6 +148,7 @@ struct rev_info { edge_hint_aggressive:1, limited:1, unpacked:1, + no_kept_objects:1, boundary:2, count:1, left_right:1, @@ -317,6 +318,9 @@ struct rev_info { * This is loaded from the commit-graph being used. */ struct bloom_filter_settings *bloom_filter_settings; + + /* misc. flags related to '--no-kept-objects' */ + unsigned keep_pack_cache_flags; }; int ref_excluded(struct string_list *, const char *path); diff --git a/t/t6114-keep-packs.sh b/t/t6114-keep-packs.sh new file mode 100755 index 0000000000..9239d8aa46 --- /dev/null +++ b/t/t6114-keep-packs.sh @@ -0,0 +1,69 @@ +#!/bin/sh + +test_description='rev-list with .keep packs' +. ./test-lib.sh + +test_expect_success 'setup' ' + test_commit loose && + test_commit packed && + test_commit kept && + + KEPT_PACK=$(git pack-objects --revs .git/objects/pack/pack <<-EOF + refs/tags/kept + ^refs/tags/packed + EOF + ) && + MISC_PACK=$(git pack-objects --revs .git/objects/pack/pack <<-EOF + refs/tags/packed + ^refs/tags/loose + EOF + ) && + + touch .git/objects/pack/pack-$KEPT_PACK.keep +' + +rev_list_objects () { + git rev-list "$@" >out && + sort out +} + +idx_objects () { + git show-index <$1 >expect-idx && + cut -d" " -f2 <expect-idx | sort +} + +test_expect_success '--no-kept-objects excludes trees and blobs in .keep packs' ' + rev_list_objects --objects --all --no-object-names >kept && + rev_list_objects --objects --all --no-object-names --no-kept-objects >no-kept && + + idx_objects .git/objects/pack/pack-$KEPT_PACK.idx >expect && + comm -3 kept no-kept >actual && + + test_cmp expect actual +' + +test_expect_success '--no-kept-objects excludes kept non-MIDX object' ' + test_config core.multiPackIndex true && + + # Create a pack with just the commit object in pack, and do not mark it + # as kept (even though it appears in $KEPT_PACK, which does have a .keep + # file). + MIDX_PACK=$(git pack-objects .git/objects/pack/pack <<-EOF + $(git rev-parse kept) + EOF + ) && + + # Write a MIDX containing all packs, but use the version of the commit + # at "kept" in a non-kept pack by touching $MIDX_PACK. + touch .git/objects/pack/pack-$MIDX_PACK.pack && + git multi-pack-index write && + + rev_list_objects --objects --no-object-names --no-kept-objects HEAD >actual && + ( + idx_objects .git/objects/pack/pack-$MISC_PACK.idx && + git rev-list --objects --no-object-names refs/tags/loose + ) | sort >expect && + test_cmp expect actual +' + +test_done
A future caller will want to be able to perform a reachability traversal which terminates when visiting an object found in a kept pack. The closest existing option is '--honor-pack-keep', but this isn't quite what we want. Instead of halting the traversal midway through, a full traversal is always performed, and the results are only trimmed afterwords. Besides needing to introduce a new flag (since culling results post-facto can be different than halting the traversal as it's happening), there is an additional wrinkle handling the distinction in-core and on-disk kept packs. That is: what kinds of kept pack should stop the traversal? Introduce '--no-kept-objects[=<on-disk|in-core>]' to specify which kinds of kept packs, if any, should stop a traversal. This can be useful for callers that want to perform a reachability analysis, but want to leave certain packs alone (for e.g., when doing a geometric repack that has some "large" packs which are kept in-core that it wants to leave alone). Signed-off-by: Taylor Blau <me@ttaylorr.com> --- Documentation/rev-list-options.txt | 7 +++ list-objects.c | 7 +++ revision.c | 15 +++++++ revision.h | 4 ++ t/t6114-keep-packs.sh | 69 ++++++++++++++++++++++++++++++ 5 files changed, 102 insertions(+) create mode 100755 t/t6114-keep-packs.sh