diff mbox series

[v3,3/3] rev-list: add commit object support in `--missing` option

Message ID 20231019121024.194317-4-karthik.188@gmail.com (mailing list archive)
State Superseded
Headers show
Series rev-list: add support for commits in `--missing` | expand

Commit Message

Karthik Nayak Oct. 19, 2023, 12:10 p.m. UTC
The `--missing` object option in rev-list currently works only with
missing blobs/trees. For missing commits the revision walker fails with
a fatal error.

Let's extend the functionality of `--missing` option to also support
commit objects. This is done by adding a new `MISSING` flag that the
revision walker sets whenever it encounters a missing tree/commit. The
revision walker will now continue the traversal and call `show_commit()`
even for missing commits. In rev-list we can then check for this flag
and call the existing code for parsing `--missing` objects.

A scenario where this option would be used is to find the boundary
objects between different object directories. Consider a repository with
a main object directory (GIT_OBJECT_DIRECTORY) and one or more alternate
object directories (GIT_ALTERNATE_OBJECT_DIRECTORIES). In such a
repository, using the `--missing=print` option while disabling the
alternate object directory allows us to find the boundary objects
between the main and alternate object directory.

Signed-off-by: Karthik Nayak <karthik.188@gmail.com>
---
 builtin/rev-list.c          |  6 +++
 list-objects.c              |  2 +-
 object.h                    |  4 +-
 revision.c                  | 11 ++++--
 revision.h                  |  2 +
 t/t6022-rev-list-missing.sh | 74 +++++++++++++++++++++++++++++++++++++
 6 files changed, 93 insertions(+), 6 deletions(-)
 create mode 100755 t/t6022-rev-list-missing.sh

Comments

Junio C Hamano Oct. 19, 2023, 10:05 p.m. UTC | #1
Karthik Nayak <karthik.188@gmail.com> writes:

> diff --git a/object.h b/object.h
> index 114d45954d..b76830fce1 100644
> --- a/object.h
> +++ b/object.h
> @@ -62,7 +62,7 @@ void object_array_init(struct object_array *array);
>  
>  /*
>   * object flag allocation:
> - * revision.h:               0---------10         15             23------27
> + * revision.h:               0---------10         15             22------28
>   * fetch-pack.c:             01    67
>   * negotiator/default.c:       2--5
>   * walker.c:                 0-2
> @@ -82,7 +82,7 @@ void object_array_init(struct object_array *array);
>   * builtin/show-branch.c:    0-------------------------------------------26
>   * builtin/unpack-objects.c:                                 2021
>   */
> -#define FLAG_BITS  28
> +#define FLAG_BITS  29
>  
>  #define TYPE_BITS 3

I am afraid that this is not a good direction to go, given that the
way FLAG_BITS is used is like this:

    /*
     * The object type is stored in 3 bits.
     */
    struct object {
            unsigned parsed : 1;
            unsigned type : TYPE_BITS;
            unsigned flags : FLAG_BITS;
            struct object_id oid;
    };

28 was there, not as a random number of bits we happen to be using.
It was derived by (32 - 3 - 1), i.e. ensure the bitfields above are
stored within a single word.

sizeof(struct object) is 40 bytes on x86-64, with offsetof(oid)
being 4 (i.e. the bitfields fit in a single 4-byte word).  If we
make FLAG_BITS 29, we will add 4 bytes to the structure and waste
31-bit per each and every in-core objects.

Do we really need to allocate a new bit in the object flags, which
is already a scarce resource?
Junio C Hamano Oct. 19, 2023, 11:35 p.m. UTC | #2
Junio C Hamano <gitster@pobox.com> writes:

> Do we really need to allocate a new bit in the object flags, which
> is already a scarce resource?

Clarification.  I was *not* wondering if we can steal and (re|ab)use
a bit that is used for other purposes, in order to avoid allocating
a new bit.

Rather, I was wondering if we need to use object flags to mark these
objects, or can do what we want to do without using any object flags
at all.  For the purpose of reporting "missing" objects, wouldn't it
be sufficient to walk the object graph and report our findings as we
go?  To avoid reporting the same object twice, as we reasonably can
expect that the missing objects are minority (compared to the total
number of objects), perhaps the codepath that makes such a report
can use a hashmap of object_ids or something, for example.

Thanks.
Karthik Nayak Oct. 20, 2023, 11:14 a.m. UTC | #3
On Fri, Oct 20, 2023 at 1:35 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> Junio C Hamano <gitster@pobox.com> writes:
>
> > Do we really need to allocate a new bit in the object flags, which
> > is already a scarce resource?
>
> Clarification.  I was *not* wondering if we can steal and (re|ab)use
> a bit that is used for other purposes, in order to avoid allocating
> a new bit.
>
> Rather, I was wondering if we need to use object flags to mark these
> objects, or can do what we want to do without using any object flags
> at all.  For the purpose of reporting "missing" objects, wouldn't it
> be sufficient to walk the object graph and report our findings as we
> go?  To avoid reporting the same object twice, as we reasonably can
> expect that the missing objects are minority (compared to the total
> number of objects), perhaps the codepath that makes such a report
> can use a hashmap of object_ids or something, for example.
>

This is kind of hard to do since the revision walking happens in revision.c
and list-objects.c but the missing commits are needed in builtin/rev-list.c.
So even if we build a list of missing commits in list-objects.c, there is no
nice way to send this back to rev-list.c.

The only way we communicate between these layers is through callbacks,
i.e. the show_commit() function in rev-list.c is called for every commit that
the revision walk traverses over.

I'm not entirely sure what the best path to take here to be honest. I will look
and see if there are any bits where overlapping doesn't cause any issues
in the meantime.
Karthik Nayak Oct. 20, 2023, 2:47 p.m. UTC | #4
On Fri, Oct 20, 2023 at 1:14 PM Karthik Nayak <karthik.188@gmail.com> wrote:
> I'm not entirely sure what the best path to take here to be honest. I will look
> and see if there are any bits where overlapping doesn't cause any issues
> in the meantime.

Was trying to use bit number 12, which coincides METAINFO_SHOWN in
builtin/blame.c.
From skimming over the code, METAINFO_SHOWN is used only within
blame.c and there
should not be collisions here since blame.c doesn't set the
do_not_die_on_missing_objects bit
either.

The tests seem to pass here too [1]. This could be the potential
solution here, Junio, what do you think?
I could send another version if this approach looks good.

[1] https://gitlab.com/gitlab-org/git/-/jobs/5339132966

Diff against the tip of v3:

diff --git a/object.h b/object.h
index b76830fce1..0cb8c02b95 100644
--- a/object.h
+++ b/object.h
@@ -62,7 +62,7 @@ void object_array_init(struct object_array *array);

 /*
  * object flag allocation:
- * revision.h:               0---------10         15             22------28
+ * revision.h:               0---------10  12    15             22------27
  * fetch-pack.c:             01    67
  * negotiator/default.c:       2--5
  * walker.c:                 0-2
@@ -82,7 +82,7 @@ void object_array_init(struct object_array *array);
  * builtin/show-branch.c:    0-------------------------------------------26
  * builtin/unpack-objects.c:                                 2021
  */
-#define FLAG_BITS  29
+#define FLAG_BITS  28

 #define TYPE_BITS 3

diff --git a/revision.h b/revision.h
index d3c1ca0f4e..d2de9d65e4 100644
--- a/revision.h
+++ b/revision.h
@@ -38,6 +38,9 @@
 #define PATCHSAME    (1u<<9)
 #define BOTTOM        (1u<<10)

+/* WARNING: This is also used as METAINFO_SHOWN in builtin/blame.c. */
+#define MISSING     (1u<<12)
+
 /* WARNING: This is also used as REACHABLE in commit-graph.c. */
 #define PULL_MERGE    (1u<<15)

@@ -53,7 +56,6 @@
 #define ANCESTRY_PATH    (1u<<27)
 #define ALL_REV_FLAGS    (((1u<<11)-1) | NOT_USER_GIVEN |
TRACK_LINEAR | PULL_MERGE)

-#define MISSING (1u<<28)

 #define DECORATE_SHORT_REFS    1
 #define DECORATE_FULL_REFS    2
Junio C Hamano Oct. 20, 2023, 4:41 p.m. UTC | #5
Junio C Hamano <gitster@pobox.com> writes:

> Rather, I was wondering if we need to use object flags to mark these
> objects, or can do what we want to do without using any object flags
> at all.  For the purpose of reporting "missing" objects, wouldn't it
> be sufficient to walk the object graph and report our findings as we
> go?  To avoid reporting the same object twice, as we reasonably can
> expect that the missing objects are minority (compared to the total
> number of objects), perhaps the codepath that makes such a report
> can use a hashmap of object_ids or something, for example.

Digging from the bottom,

 * builtin/rev-list.c:show_commit() gets "struct rev_list_info *"
   that has "struct rev_info *" [*].

 * list-objects.c:do_traverse() calls revision.c:get_revision() to
   obtain commits, some of which may be missing ones, and things
   behind get_revision() are responsible for marking the commit as
   missing.  It has "struct traversal_context *", among whose
   members is the "revs" member that is the "struct rev_info *".

 * revision.c:get_revision() and machinery behind it ultimately
   discovers a missing commit in the revision.c:process_parents()
   that loops over the parents commit_list.  It of course has access
   to "struct rev_info *".

So, presumably, if we add a new member to "struct rev_info" that
optionally [*] points at an oidset that records the object names of
missing objects we discovered so far (i.e., the set of missing
objects), the location we set the MISSING bit of a commit can
instead add the object name of the commit to the set.  And we can
export a function that takes "struct rev_info *" and "struct object
*" (or "struct object_id *") to check for membership in the "set of
missing objects", which would be used where we checked the MISSING
bit of a commit.

I do not know the performance implications of going this route, but
if we do not find a suitable vacant bit, we do not have to use any
object flags bit to do this, if we go this route, I would think.  I
may be missing some details that breaks the above outline, though.


[Footnotes]

 * A potential #leftoverbits tangent.

   Why is "rev_list_info" structure declared in <bisect.h>?  I
   suspect that this is a fallout from recent header file shuffling,
   but given who uses it (among which is rev-list:show_commit() that
   has very little to do with bisection and uses the information in
   rev_list_info when doing its normal non-bisect things), it does
   not make much sense.

 * When .do_not_die_on_missing_objects is false, it can and should
   be left NULL, but presumably we use the "do not die" bit even
   when we are not necessarily collecting the missing objects?  So
   the new member cannot replace the "do not die" bit completely.
Junio C Hamano Oct. 20, 2023, 5:45 p.m. UTC | #6
Karthik Nayak <karthik.188@gmail.com> writes:

> Was trying to use bit number 12, which coincides METAINFO_SHOWN in
> builtin/blame.c.
> From skimming over the code, METAINFO_SHOWN is used only within
> blame.c and there
> should not be collisions here since blame.c doesn't set the
> do_not_die_on_missing_objects bit
> either.

Thanks for researching.  It sounds like it may be a better bit to
steal than the one used by the commit-graph, as long as there is no
reason to expect that blame may want to work in a corrupt repository
with missing objects, but when it happens, we may regret the
decision we are making here.

Thanks.
Karthik Nayak Oct. 24, 2023, 11:34 a.m. UTC | #7
On Fri, Oct 20, 2023 at 6:41 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Junio C Hamano <gitster@pobox.com> writes:
>
> > Rather, I was wondering if we need to use object flags to mark these
> > objects, or can do what we want to do without using any object flags
> > at all.  For the purpose of reporting "missing" objects, wouldn't it
> > be sufficient to walk the object graph and report our findings as we
> > go?  To avoid reporting the same object twice, as we reasonably can
> > expect that the missing objects are minority (compared to the total
> > number of objects), perhaps the codepath that makes such a report
> > can use a hashmap of object_ids or something, for example.
>
> Digging from the bottom,
>
>  * builtin/rev-list.c:show_commit() gets "struct rev_list_info *"
>    that has "struct rev_info *" [*].
>
>  * list-objects.c:do_traverse() calls revision.c:get_revision() to
>    obtain commits, some of which may be missing ones, and things
>    behind get_revision() are responsible for marking the commit as
>    missing.  It has "struct traversal_context *", among whose
>    members is the "revs" member that is the "struct rev_info *".
>
>  * revision.c:get_revision() and machinery behind it ultimately
>    discovers a missing commit in the revision.c:process_parents()
>    that loops over the parents commit_list.  It of course has access
>    to "struct rev_info *".
>
> So, presumably, if we add a new member to "struct rev_info" that
> optionally [*] points at an oidset that records the object names of
> missing objects we discovered so far (i.e., the set of missing
> objects), the location we set the MISSING bit of a commit can
> instead add the object name of the commit to the set.  And we can
> export a function that takes "struct rev_info *" and "struct object
> *" (or "struct object_id *") to check for membership in the "set of
> missing objects", which would be used where we checked the MISSING
> bit of a commit.
>
> I do not know the performance implications of going this route, but
> if we do not find a suitable vacant bit, we do not have to use any
> object flags bit to do this, if we go this route, I would think.  I
> may be missing some details that breaks the above outline, though.
>
>
> [Footnotes]
>
>  * A potential #leftoverbits tangent.
>
>    Why is "rev_list_info" structure declared in <bisect.h>?  I
>    suspect that this is a fallout from recent header file shuffling,
>    but given who uses it (among which is rev-list:show_commit() that
>    has very little to do with bisection and uses the information in
>    rev_list_info when doing its normal non-bisect things), it does
>    not make much sense.
>
>  * When .do_not_die_on_missing_objects is false, it can and should
>    be left NULL, but presumably we use the "do not die" bit even
>    when we are not necessarily collecting the missing objects?  So
>    the new member cannot replace the "do not die" bit completely.

Thanks for the suggestion, this does seem like a good way to go ahead without
using flags. The only performance issue being if there are too many commits
which are missing, then oidset would be large.

But I think that's okay though.

> Thanks for researching.  It sounds like it may be a better bit to
> steal than the one used by the commit-graph, as long as there is no
> reason to expect that blame may want to work in a corrupt repository
> with missing objects, but when it happens, we may regret the
> decision we are making here.
>

I don't see blame working with missing commits though, because it relies on
parsing commits to get information to show to the user. So I think it's a safe
bit to steal. Also, when the time comes we could always release the bit and
move to the solution you mentioned above.

Anyways on the whole I think keeping it future compatible makes a lot
more sense.
I'll send a patch series to implement an oidset instead of flags soon.

- Karthik
diff mbox series

Patch

diff --git a/builtin/rev-list.c b/builtin/rev-list.c
index 98542e8b3c..604ae01d0c 100644
--- a/builtin/rev-list.c
+++ b/builtin/rev-list.c
@@ -149,6 +149,12 @@  static void show_commit(struct commit *commit, void *data)
 
 	display_progress(progress, ++progress_counter);
 
+	if (revs->do_not_die_on_missing_objects &&
+	    commit->object.flags & MISSING) {
+		finish_object__ma(&commit->object);
+		return;
+	}
+
 	if (show_disk_usage)
 		total_disk_usage += get_object_disk_usage(&commit->object);
 
diff --git a/list-objects.c b/list-objects.c
index 47296dff2f..60c5533721 100644
--- a/list-objects.c
+++ b/list-objects.c
@@ -387,7 +387,7 @@  static void do_traverse(struct traversal_context *ctx)
 		 * an uninteresting boundary commit may not have its tree
 		 * parsed yet, but we are not going to show them anyway
 		 */
-		if (!ctx->revs->tree_objects)
+		if (!ctx->revs->tree_objects || commit->object.flags & MISSING)
 			; /* do not bother loading tree */
 		else if (repo_get_commit_tree(the_repository, commit)) {
 			struct tree *tree = repo_get_commit_tree(the_repository,
diff --git a/object.h b/object.h
index 114d45954d..b76830fce1 100644
--- a/object.h
+++ b/object.h
@@ -62,7 +62,7 @@  void object_array_init(struct object_array *array);
 
 /*
  * object flag allocation:
- * revision.h:               0---------10         15             23------27
+ * revision.h:               0---------10         15             22------28
  * fetch-pack.c:             01    67
  * negotiator/default.c:       2--5
  * walker.c:                 0-2
@@ -82,7 +82,7 @@  void object_array_init(struct object_array *array);
  * builtin/show-branch.c:    0-------------------------------------------26
  * builtin/unpack-objects.c:                                 2021
  */
-#define FLAG_BITS  28
+#define FLAG_BITS  29
 
 #define TYPE_BITS 3
 
diff --git a/revision.c b/revision.c
index 219dc76716..8100806068 100644
--- a/revision.c
+++ b/revision.c
@@ -1110,7 +1110,7 @@  static int process_parents(struct rev_info *revs, struct commit *commit,
 	struct commit_list *parent = commit->parents;
 	unsigned pass_flags;
 
-	if (commit->object.flags & ADDED)
+	if (commit->object.flags & (ADDED | MISSING))
 		return 0;
 	commit->object.flags |= ADDED;
 
@@ -1168,7 +1168,8 @@  static int process_parents(struct rev_info *revs, struct commit *commit,
 	for (parent = commit->parents; parent; parent = parent->next) {
 		struct commit *p = parent->item;
 		int gently = revs->ignore_missing_links ||
-			     revs->exclude_promisor_objects;
+			     revs->exclude_promisor_objects ||
+			     revs->do_not_die_on_missing_objects;
 		if (repo_parse_commit_gently(revs->repo, p, gently) < 0) {
 			if (revs->exclude_promisor_objects &&
 			    is_promisor_object(&p->object.oid)) {
@@ -1176,7 +1177,11 @@  static int process_parents(struct rev_info *revs, struct commit *commit,
 					break;
 				continue;
 			}
-			return -1;
+
+			if (!revs->do_not_die_on_missing_objects)
+				return -1;
+			else
+				p->object.flags |= MISSING;
 		}
 		if (revs->sources) {
 			char **slot = revision_sources_at(revs->sources, p);
diff --git a/revision.h b/revision.h
index c73c92ef40..d3c1ca0f4e 100644
--- a/revision.h
+++ b/revision.h
@@ -53,6 +53,8 @@ 
 #define ANCESTRY_PATH	(1u<<27)
 #define ALL_REV_FLAGS	(((1u<<11)-1) | NOT_USER_GIVEN | TRACK_LINEAR | PULL_MERGE)
 
+#define MISSING (1u<<28)
+
 #define DECORATE_SHORT_REFS	1
 #define DECORATE_FULL_REFS	2
 
diff --git a/t/t6022-rev-list-missing.sh b/t/t6022-rev-list-missing.sh
new file mode 100755
index 0000000000..40265a4f66
--- /dev/null
+++ b/t/t6022-rev-list-missing.sh
@@ -0,0 +1,74 @@ 
+#!/bin/sh
+
+test_description='handling of missing objects in rev-list'
+
+TEST_PASSES_SANITIZE_LEAK=true
+. ./test-lib.sh
+
+# We setup the repository with two commits, this way HEAD is always
+# available and we can hide commit 1.
+test_expect_success 'create repository and alternate directory' '
+	test_commit 1 &&
+	test_commit 2 &&
+	test_commit 3
+'
+
+for obj in "HEAD~1" "HEAD~1^{tree}" "HEAD:1.t"
+do
+	test_expect_success "rev-list --missing=error fails with missing object $obj" '
+		oid="$(git rev-parse $obj)" &&
+		path=".git/objects/$(test_oid_to_path $oid)" &&
+
+		mv "$path" "$path.hidden" &&
+		test_when_finished "mv $path.hidden $path" &&
+
+		test_must_fail git rev-list --missing=error --objects \
+			--no-object-names HEAD
+	'
+done
+
+for obj in "HEAD~1" "HEAD~1^{tree}" "HEAD:1.t"
+do
+	for action in "allow-any" "print"
+	do
+		test_expect_success "rev-list --missing=$action with missing $obj" '
+			oid="$(git rev-parse $obj)" &&
+			path=".git/objects/$(test_oid_to_path $oid)" &&
+
+			# Before the object is made missing, we use rev-list to
+			# get the expected oids.
+			git rev-list --objects --no-object-names \
+				HEAD ^$obj >expect.raw &&
+
+			# Blobs are shared by all commits, so evethough a commit/tree
+			# might be skipped, its blob must be accounted for.
+			if [ $obj != "HEAD:1.t" ]; then
+				echo $(git rev-parse HEAD:1.t) >>expect.raw &&
+				echo $(git rev-parse HEAD:2.t) >>expect.raw
+			fi &&
+
+			mv "$path" "$path.hidden" &&
+			test_when_finished "mv $path.hidden $path" &&
+
+			git rev-list --missing=$action --objects --no-object-names \
+				HEAD >actual.raw &&
+
+			# When the action is to print, we should also add the missing
+			# oid to the expect list.
+			case $action in
+			allow-any)
+				;;
+			print)
+				grep ?$oid actual.raw &&
+				echo ?$oid >>expect.raw
+				;;
+			esac &&
+
+			sort actual.raw >actual &&
+			sort expect.raw >expect &&
+			test_cmp expect actual
+		'
+	done
+done
+
+test_done