Message ID | 20231019121024.194317-4-karthik.188@gmail.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | rev-list: add support for commits in `--missing` | expand |
Karthik Nayak <karthik.188@gmail.com> writes: > diff --git a/object.h b/object.h > index 114d45954d..b76830fce1 100644 > --- a/object.h > +++ b/object.h > @@ -62,7 +62,7 @@ void object_array_init(struct object_array *array); > > /* > * object flag allocation: > - * revision.h: 0---------10 15 23------27 > + * revision.h: 0---------10 15 22------28 > * fetch-pack.c: 01 67 > * negotiator/default.c: 2--5 > * walker.c: 0-2 > @@ -82,7 +82,7 @@ void object_array_init(struct object_array *array); > * builtin/show-branch.c: 0-------------------------------------------26 > * builtin/unpack-objects.c: 2021 > */ > -#define FLAG_BITS 28 > +#define FLAG_BITS 29 > > #define TYPE_BITS 3 I am afraid that this is not a good direction to go, given that the way FLAG_BITS is used is like this: /* * The object type is stored in 3 bits. */ struct object { unsigned parsed : 1; unsigned type : TYPE_BITS; unsigned flags : FLAG_BITS; struct object_id oid; }; 28 was there, not as a random number of bits we happen to be using. It was derived by (32 - 3 - 1), i.e. ensure the bitfields above are stored within a single word. sizeof(struct object) is 40 bytes on x86-64, with offsetof(oid) being 4 (i.e. the bitfields fit in a single 4-byte word). If we make FLAG_BITS 29, we will add 4 bytes to the structure and waste 31-bit per each and every in-core objects. Do we really need to allocate a new bit in the object flags, which is already a scarce resource?
Junio C Hamano <gitster@pobox.com> writes: > Do we really need to allocate a new bit in the object flags, which > is already a scarce resource? Clarification. I was *not* wondering if we can steal and (re|ab)use a bit that is used for other purposes, in order to avoid allocating a new bit. Rather, I was wondering if we need to use object flags to mark these objects, or can do what we want to do without using any object flags at all. For the purpose of reporting "missing" objects, wouldn't it be sufficient to walk the object graph and report our findings as we go? To avoid reporting the same object twice, as we reasonably can expect that the missing objects are minority (compared to the total number of objects), perhaps the codepath that makes such a report can use a hashmap of object_ids or something, for example. Thanks.
On Fri, Oct 20, 2023 at 1:35 AM Junio C Hamano <gitster@pobox.com> wrote: > > Junio C Hamano <gitster@pobox.com> writes: > > > Do we really need to allocate a new bit in the object flags, which > > is already a scarce resource? > > Clarification. I was *not* wondering if we can steal and (re|ab)use > a bit that is used for other purposes, in order to avoid allocating > a new bit. > > Rather, I was wondering if we need to use object flags to mark these > objects, or can do what we want to do without using any object flags > at all. For the purpose of reporting "missing" objects, wouldn't it > be sufficient to walk the object graph and report our findings as we > go? To avoid reporting the same object twice, as we reasonably can > expect that the missing objects are minority (compared to the total > number of objects), perhaps the codepath that makes such a report > can use a hashmap of object_ids or something, for example. > This is kind of hard to do since the revision walking happens in revision.c and list-objects.c but the missing commits are needed in builtin/rev-list.c. So even if we build a list of missing commits in list-objects.c, there is no nice way to send this back to rev-list.c. The only way we communicate between these layers is through callbacks, i.e. the show_commit() function in rev-list.c is called for every commit that the revision walk traverses over. I'm not entirely sure what the best path to take here to be honest. I will look and see if there are any bits where overlapping doesn't cause any issues in the meantime.
On Fri, Oct 20, 2023 at 1:14 PM Karthik Nayak <karthik.188@gmail.com> wrote: > I'm not entirely sure what the best path to take here to be honest. I will look > and see if there are any bits where overlapping doesn't cause any issues > in the meantime. Was trying to use bit number 12, which coincides METAINFO_SHOWN in builtin/blame.c. From skimming over the code, METAINFO_SHOWN is used only within blame.c and there should not be collisions here since blame.c doesn't set the do_not_die_on_missing_objects bit either. The tests seem to pass here too [1]. This could be the potential solution here, Junio, what do you think? I could send another version if this approach looks good. [1] https://gitlab.com/gitlab-org/git/-/jobs/5339132966 Diff against the tip of v3: diff --git a/object.h b/object.h index b76830fce1..0cb8c02b95 100644 --- a/object.h +++ b/object.h @@ -62,7 +62,7 @@ void object_array_init(struct object_array *array); /* * object flag allocation: - * revision.h: 0---------10 15 22------28 + * revision.h: 0---------10 12 15 22------27 * fetch-pack.c: 01 67 * negotiator/default.c: 2--5 * walker.c: 0-2 @@ -82,7 +82,7 @@ void object_array_init(struct object_array *array); * builtin/show-branch.c: 0-------------------------------------------26 * builtin/unpack-objects.c: 2021 */ -#define FLAG_BITS 29 +#define FLAG_BITS 28 #define TYPE_BITS 3 diff --git a/revision.h b/revision.h index d3c1ca0f4e..d2de9d65e4 100644 --- a/revision.h +++ b/revision.h @@ -38,6 +38,9 @@ #define PATCHSAME (1u<<9) #define BOTTOM (1u<<10) +/* WARNING: This is also used as METAINFO_SHOWN in builtin/blame.c. */ +#define MISSING (1u<<12) + /* WARNING: This is also used as REACHABLE in commit-graph.c. */ #define PULL_MERGE (1u<<15) @@ -53,7 +56,6 @@ #define ANCESTRY_PATH (1u<<27) #define ALL_REV_FLAGS (((1u<<11)-1) | NOT_USER_GIVEN | TRACK_LINEAR | PULL_MERGE) -#define MISSING (1u<<28) #define DECORATE_SHORT_REFS 1 #define DECORATE_FULL_REFS 2
Junio C Hamano <gitster@pobox.com> writes: > Rather, I was wondering if we need to use object flags to mark these > objects, or can do what we want to do without using any object flags > at all. For the purpose of reporting "missing" objects, wouldn't it > be sufficient to walk the object graph and report our findings as we > go? To avoid reporting the same object twice, as we reasonably can > expect that the missing objects are minority (compared to the total > number of objects), perhaps the codepath that makes such a report > can use a hashmap of object_ids or something, for example. Digging from the bottom, * builtin/rev-list.c:show_commit() gets "struct rev_list_info *" that has "struct rev_info *" [*]. * list-objects.c:do_traverse() calls revision.c:get_revision() to obtain commits, some of which may be missing ones, and things behind get_revision() are responsible for marking the commit as missing. It has "struct traversal_context *", among whose members is the "revs" member that is the "struct rev_info *". * revision.c:get_revision() and machinery behind it ultimately discovers a missing commit in the revision.c:process_parents() that loops over the parents commit_list. It of course has access to "struct rev_info *". So, presumably, if we add a new member to "struct rev_info" that optionally [*] points at an oidset that records the object names of missing objects we discovered so far (i.e., the set of missing objects), the location we set the MISSING bit of a commit can instead add the object name of the commit to the set. And we can export a function that takes "struct rev_info *" and "struct object *" (or "struct object_id *") to check for membership in the "set of missing objects", which would be used where we checked the MISSING bit of a commit. I do not know the performance implications of going this route, but if we do not find a suitable vacant bit, we do not have to use any object flags bit to do this, if we go this route, I would think. I may be missing some details that breaks the above outline, though. [Footnotes] * A potential #leftoverbits tangent. Why is "rev_list_info" structure declared in <bisect.h>? I suspect that this is a fallout from recent header file shuffling, but given who uses it (among which is rev-list:show_commit() that has very little to do with bisection and uses the information in rev_list_info when doing its normal non-bisect things), it does not make much sense. * When .do_not_die_on_missing_objects is false, it can and should be left NULL, but presumably we use the "do not die" bit even when we are not necessarily collecting the missing objects? So the new member cannot replace the "do not die" bit completely.
Karthik Nayak <karthik.188@gmail.com> writes: > Was trying to use bit number 12, which coincides METAINFO_SHOWN in > builtin/blame.c. > From skimming over the code, METAINFO_SHOWN is used only within > blame.c and there > should not be collisions here since blame.c doesn't set the > do_not_die_on_missing_objects bit > either. Thanks for researching. It sounds like it may be a better bit to steal than the one used by the commit-graph, as long as there is no reason to expect that blame may want to work in a corrupt repository with missing objects, but when it happens, we may regret the decision we are making here. Thanks.
On Fri, Oct 20, 2023 at 6:41 PM Junio C Hamano <gitster@pobox.com> wrote: > > Junio C Hamano <gitster@pobox.com> writes: > > > Rather, I was wondering if we need to use object flags to mark these > > objects, or can do what we want to do without using any object flags > > at all. For the purpose of reporting "missing" objects, wouldn't it > > be sufficient to walk the object graph and report our findings as we > > go? To avoid reporting the same object twice, as we reasonably can > > expect that the missing objects are minority (compared to the total > > number of objects), perhaps the codepath that makes such a report > > can use a hashmap of object_ids or something, for example. > > Digging from the bottom, > > * builtin/rev-list.c:show_commit() gets "struct rev_list_info *" > that has "struct rev_info *" [*]. > > * list-objects.c:do_traverse() calls revision.c:get_revision() to > obtain commits, some of which may be missing ones, and things > behind get_revision() are responsible for marking the commit as > missing. It has "struct traversal_context *", among whose > members is the "revs" member that is the "struct rev_info *". > > * revision.c:get_revision() and machinery behind it ultimately > discovers a missing commit in the revision.c:process_parents() > that loops over the parents commit_list. It of course has access > to "struct rev_info *". > > So, presumably, if we add a new member to "struct rev_info" that > optionally [*] points at an oidset that records the object names of > missing objects we discovered so far (i.e., the set of missing > objects), the location we set the MISSING bit of a commit can > instead add the object name of the commit to the set. And we can > export a function that takes "struct rev_info *" and "struct object > *" (or "struct object_id *") to check for membership in the "set of > missing objects", which would be used where we checked the MISSING > bit of a commit. > > I do not know the performance implications of going this route, but > if we do not find a suitable vacant bit, we do not have to use any > object flags bit to do this, if we go this route, I would think. I > may be missing some details that breaks the above outline, though. > > > [Footnotes] > > * A potential #leftoverbits tangent. > > Why is "rev_list_info" structure declared in <bisect.h>? I > suspect that this is a fallout from recent header file shuffling, > but given who uses it (among which is rev-list:show_commit() that > has very little to do with bisection and uses the information in > rev_list_info when doing its normal non-bisect things), it does > not make much sense. > > * When .do_not_die_on_missing_objects is false, it can and should > be left NULL, but presumably we use the "do not die" bit even > when we are not necessarily collecting the missing objects? So > the new member cannot replace the "do not die" bit completely. Thanks for the suggestion, this does seem like a good way to go ahead without using flags. The only performance issue being if there are too many commits which are missing, then oidset would be large. But I think that's okay though. > Thanks for researching. It sounds like it may be a better bit to > steal than the one used by the commit-graph, as long as there is no > reason to expect that blame may want to work in a corrupt repository > with missing objects, but when it happens, we may regret the > decision we are making here. > I don't see blame working with missing commits though, because it relies on parsing commits to get information to show to the user. So I think it's a safe bit to steal. Also, when the time comes we could always release the bit and move to the solution you mentioned above. Anyways on the whole I think keeping it future compatible makes a lot more sense. I'll send a patch series to implement an oidset instead of flags soon. - Karthik
diff --git a/builtin/rev-list.c b/builtin/rev-list.c index 98542e8b3c..604ae01d0c 100644 --- a/builtin/rev-list.c +++ b/builtin/rev-list.c @@ -149,6 +149,12 @@ static void show_commit(struct commit *commit, void *data) display_progress(progress, ++progress_counter); + if (revs->do_not_die_on_missing_objects && + commit->object.flags & MISSING) { + finish_object__ma(&commit->object); + return; + } + if (show_disk_usage) total_disk_usage += get_object_disk_usage(&commit->object); diff --git a/list-objects.c b/list-objects.c index 47296dff2f..60c5533721 100644 --- a/list-objects.c +++ b/list-objects.c @@ -387,7 +387,7 @@ static void do_traverse(struct traversal_context *ctx) * an uninteresting boundary commit may not have its tree * parsed yet, but we are not going to show them anyway */ - if (!ctx->revs->tree_objects) + if (!ctx->revs->tree_objects || commit->object.flags & MISSING) ; /* do not bother loading tree */ else if (repo_get_commit_tree(the_repository, commit)) { struct tree *tree = repo_get_commit_tree(the_repository, diff --git a/object.h b/object.h index 114d45954d..b76830fce1 100644 --- a/object.h +++ b/object.h @@ -62,7 +62,7 @@ void object_array_init(struct object_array *array); /* * object flag allocation: - * revision.h: 0---------10 15 23------27 + * revision.h: 0---------10 15 22------28 * fetch-pack.c: 01 67 * negotiator/default.c: 2--5 * walker.c: 0-2 @@ -82,7 +82,7 @@ void object_array_init(struct object_array *array); * builtin/show-branch.c: 0-------------------------------------------26 * builtin/unpack-objects.c: 2021 */ -#define FLAG_BITS 28 +#define FLAG_BITS 29 #define TYPE_BITS 3 diff --git a/revision.c b/revision.c index 219dc76716..8100806068 100644 --- a/revision.c +++ b/revision.c @@ -1110,7 +1110,7 @@ static int process_parents(struct rev_info *revs, struct commit *commit, struct commit_list *parent = commit->parents; unsigned pass_flags; - if (commit->object.flags & ADDED) + if (commit->object.flags & (ADDED | MISSING)) return 0; commit->object.flags |= ADDED; @@ -1168,7 +1168,8 @@ static int process_parents(struct rev_info *revs, struct commit *commit, for (parent = commit->parents; parent; parent = parent->next) { struct commit *p = parent->item; int gently = revs->ignore_missing_links || - revs->exclude_promisor_objects; + revs->exclude_promisor_objects || + revs->do_not_die_on_missing_objects; if (repo_parse_commit_gently(revs->repo, p, gently) < 0) { if (revs->exclude_promisor_objects && is_promisor_object(&p->object.oid)) { @@ -1176,7 +1177,11 @@ static int process_parents(struct rev_info *revs, struct commit *commit, break; continue; } - return -1; + + if (!revs->do_not_die_on_missing_objects) + return -1; + else + p->object.flags |= MISSING; } if (revs->sources) { char **slot = revision_sources_at(revs->sources, p); diff --git a/revision.h b/revision.h index c73c92ef40..d3c1ca0f4e 100644 --- a/revision.h +++ b/revision.h @@ -53,6 +53,8 @@ #define ANCESTRY_PATH (1u<<27) #define ALL_REV_FLAGS (((1u<<11)-1) | NOT_USER_GIVEN | TRACK_LINEAR | PULL_MERGE) +#define MISSING (1u<<28) + #define DECORATE_SHORT_REFS 1 #define DECORATE_FULL_REFS 2 diff --git a/t/t6022-rev-list-missing.sh b/t/t6022-rev-list-missing.sh new file mode 100755 index 0000000000..40265a4f66 --- /dev/null +++ b/t/t6022-rev-list-missing.sh @@ -0,0 +1,74 @@ +#!/bin/sh + +test_description='handling of missing objects in rev-list' + +TEST_PASSES_SANITIZE_LEAK=true +. ./test-lib.sh + +# We setup the repository with two commits, this way HEAD is always +# available and we can hide commit 1. +test_expect_success 'create repository and alternate directory' ' + test_commit 1 && + test_commit 2 && + test_commit 3 +' + +for obj in "HEAD~1" "HEAD~1^{tree}" "HEAD:1.t" +do + test_expect_success "rev-list --missing=error fails with missing object $obj" ' + oid="$(git rev-parse $obj)" && + path=".git/objects/$(test_oid_to_path $oid)" && + + mv "$path" "$path.hidden" && + test_when_finished "mv $path.hidden $path" && + + test_must_fail git rev-list --missing=error --objects \ + --no-object-names HEAD + ' +done + +for obj in "HEAD~1" "HEAD~1^{tree}" "HEAD:1.t" +do + for action in "allow-any" "print" + do + test_expect_success "rev-list --missing=$action with missing $obj" ' + oid="$(git rev-parse $obj)" && + path=".git/objects/$(test_oid_to_path $oid)" && + + # Before the object is made missing, we use rev-list to + # get the expected oids. + git rev-list --objects --no-object-names \ + HEAD ^$obj >expect.raw && + + # Blobs are shared by all commits, so evethough a commit/tree + # might be skipped, its blob must be accounted for. + if [ $obj != "HEAD:1.t" ]; then + echo $(git rev-parse HEAD:1.t) >>expect.raw && + echo $(git rev-parse HEAD:2.t) >>expect.raw + fi && + + mv "$path" "$path.hidden" && + test_when_finished "mv $path.hidden $path" && + + git rev-list --missing=$action --objects --no-object-names \ + HEAD >actual.raw && + + # When the action is to print, we should also add the missing + # oid to the expect list. + case $action in + allow-any) + ;; + print) + grep ?$oid actual.raw && + echo ?$oid >>expect.raw + ;; + esac && + + sort actual.raw >actual && + sort expect.raw >expect && + test_cmp expect actual + ' + done +done + +test_done
The `--missing` object option in rev-list currently works only with missing blobs/trees. For missing commits the revision walker fails with a fatal error. Let's extend the functionality of `--missing` option to also support commit objects. This is done by adding a new `MISSING` flag that the revision walker sets whenever it encounters a missing tree/commit. The revision walker will now continue the traversal and call `show_commit()` even for missing commits. In rev-list we can then check for this flag and call the existing code for parsing `--missing` objects. A scenario where this option would be used is to find the boundary objects between different object directories. Consider a repository with a main object directory (GIT_OBJECT_DIRECTORY) and one or more alternate object directories (GIT_ALTERNATE_OBJECT_DIRECTORIES). In such a repository, using the `--missing=print` option while disabling the alternate object directory allows us to find the boundary objects between the main and alternate object directory. Signed-off-by: Karthik Nayak <karthik.188@gmail.com> --- builtin/rev-list.c | 6 +++ list-objects.c | 2 +- object.h | 4 +- revision.c | 11 ++++-- revision.h | 2 + t/t6022-rev-list-missing.sh | 74 +++++++++++++++++++++++++++++++++++++ 6 files changed, 93 insertions(+), 6 deletions(-) create mode 100755 t/t6022-rev-list-missing.sh