diff mbox series

[v2,2/2] commit: detect commits that exist in commit-graph but not in the ODB

Message ID 0476d4855562b677ced106a4cc7788b46434cf21.1698060036.git.ps@pks.im (mailing list archive)
State Superseded
Commit 7fcce523ca47b7ada7d025ce047bd74b0bc9dc64
Headers show
Series commit-graph: detect commits missing in ODB | expand

Commit Message

Patrick Steinhardt Oct. 23, 2023, 11:27 a.m. UTC
Commit graphs can become stale and contain references to commits that do
not exist in the object database anymore. Theoretically, this can lead
to a scenario where we are able to successfully look up any such commit
via the commit graph even though such a lookup would fail if done via
the object database directly.

As the commit graph is mostly intended as a sort of cache to speed up
parsing of commits we do not want to have diverging behaviour in a
repository with and a repository without commit graphs, no matter
whether they are stale or not. As commits are otherwise immutable, the
only thing that we really need to care about is thus the presence or
absence of a commit.

To address potentially stale commit data that may exist in the graph,
our `lookup_commit_in_graph()` function will check for the commit's
existence in both the commit graph, but also in the object database. So
even if we were able to look up the commit's data in the graph, we would
still pretend as if the commit didn't exist if it is missing in the
object database.

We don't have the same safety net in `parse_commit_in_graph_one()`
though. This function is mostly used internally in "commit-graph.c"
itself to validate the commit graph, and this usage is fine. We do
expose its functionality via `parse_commit_in_graph()` though, which
gets called by `repo_parse_commit_internal()`, and that function is in
turn used in many places in our codebase.

For all I can see this function is never used to directly turn an object
ID into a commit object without additional safety checks before or after
this lookup. What it is being used for though is to walk history via the
parent chain of commits. So when commits in the parent chain of a graph
walk are missing it is possible that we wouldn't notice if that missing
commit was part of the commit graph. Thus, a query like `git rev-parse
HEAD~2` can succeed even if the intermittent commit is missing.

It's unclear whether there are additional ways in which such stale
commit graphs can lead to problems. In any case, it feels like this is a
bigger bug waiting to happen when we gain additional direct or indirect
callers of `repo_parse_commit_internal()`. So let's fix the inconsistent
behaviour by checking for object existence via the object database, as
well.

This check of course comes with a performance penalty. The following
benchmarks have been executed in a clone of linux.git with stable tags
added:

    Benchmark 1: git -c core.commitGraph=true rev-list --topo-order --all (git = master)
      Time (mean ± σ):      2.913 s ±  0.018 s    [User: 2.363 s, System: 0.548 s]
      Range (min … max):    2.894 s …  2.950 s    10 runs

    Benchmark 2: git -c core.commitGraph=true rev-list --topo-order --all (git = pks-commit-graph-inconsistency)
      Time (mean ± σ):      3.834 s ±  0.052 s    [User: 3.276 s, System: 0.556 s]
      Range (min … max):    3.780 s …  3.961 s    10 runs

    Benchmark 3: git -c core.commitGraph=false rev-list --topo-order --all (git = master)
      Time (mean ± σ):     13.841 s ±  0.084 s    [User: 13.152 s, System: 0.687 s]
      Range (min … max):   13.714 s … 13.995 s    10 runs

    Benchmark 4: git -c core.commitGraph=false rev-list --topo-order --all (git = pks-commit-graph-inconsistency)
      Time (mean ± σ):     13.762 s ±  0.116 s    [User: 13.094 s, System: 0.667 s]
      Range (min … max):   13.645 s … 14.038 s    10 runs

    Summary
      git -c core.commitGraph=true rev-list --topo-order --all (git = master) ran
        1.32 ± 0.02 times faster than git -c core.commitGraph=true rev-list --topo-order --all (git = pks-commit-graph-inconsistency)
        4.72 ± 0.05 times faster than git -c core.commitGraph=false rev-list --topo-order --all (git = pks-commit-graph-inconsistency)
        4.75 ± 0.04 times faster than git -c core.commitGraph=false rev-list --topo-order --all (git = master)

We look at a ~30% regression in general, but in general we're still a
whole lot faster than without the commit graph. To counteract this, the
new check can be turned off with the `GIT_COMMIT_GRAPH_PARANOIA` envvar.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 commit.c                | 16 +++++++++++++++-
 t/t5318-commit-graph.sh | 27 +++++++++++++++++++++++++++
 2 files changed, 42 insertions(+), 1 deletion(-)

Comments

Junio C Hamano Oct. 24, 2023, 7:10 p.m. UTC | #1
Patrick Steinhardt <ps@pks.im> writes:

> Commit graphs can become stale and contain references to commits that do
> not exist in the object database anymore. Theoretically, this can lead
> to a scenario where we are able to successfully look up any such commit
> via the commit graph even though such a lookup would fail if done via
> the object database directly.
>
> As the commit graph is mostly intended as a sort of cache to speed up
> parsing of commits we do not want to have diverging behaviour in a
> repository with and a repository without commit graphs, no matter
> whether they are stale or not. As commits are otherwise immutable, the
> only thing that we really need to care about is thus the presence or
> absence of a commit.
>
> To address potentially stale commit data that may exist in the graph,
> our `lookup_commit_in_graph()` function will check for the commit's
> existence in both the commit graph, but also in the object database. So
> even if we were able to look up the commit's data in the graph, we would
> still pretend as if the commit didn't exist if it is missing in the
> object database.
>
> We don't have the same safety net in `parse_commit_in_graph_one()`
> though. This function is mostly used internally in "commit-graph.c"
> itself to validate the commit graph, and this usage is fine. We do
> expose its functionality via `parse_commit_in_graph()` though, which
> gets called by `repo_parse_commit_internal()`, and that function is in
> turn used in many places in our codebase.
>
> For all I can see this function is never used to directly turn an object
> ID into a commit object without additional safety checks before or after
> this lookup. What it is being used for though is to walk history via the
> parent chain of commits. So when commits in the parent chain of a graph
> walk are missing it is possible that we wouldn't notice if that missing
> commit was part of the commit graph. Thus, a query like `git rev-parse
> HEAD~2` can succeed even if the intermittent commit is missing.
>
> It's unclear whether there are additional ways in which such stale
> commit graphs can lead to problems. In any case, it feels like this is a
> bigger bug waiting to happen when we gain additional direct or indirect
> callers of `repo_parse_commit_internal()`. So let's fix the inconsistent
> behaviour by checking for object existence via the object database, as
> well.
>
> This check of course comes with a performance penalty. The following
> benchmarks have been executed in a clone of linux.git with stable tags
> added:
>
>     Benchmark 1: git -c core.commitGraph=true rev-list --topo-order --all (git = master)
>       Time (mean ± σ):      2.913 s ±  0.018 s    [User: 2.363 s, System: 0.548 s]
>       Range (min … max):    2.894 s …  2.950 s    10 runs
>
>     Benchmark 2: git -c core.commitGraph=true rev-list --topo-order --all (git = pks-commit-graph-inconsistency)
>       Time (mean ± σ):      3.834 s ±  0.052 s    [User: 3.276 s, System: 0.556 s]
>       Range (min … max):    3.780 s …  3.961 s    10 runs
>
>     Benchmark 3: git -c core.commitGraph=false rev-list --topo-order --all (git = master)
>       Time (mean ± σ):     13.841 s ±  0.084 s    [User: 13.152 s, System: 0.687 s]
>       Range (min … max):   13.714 s … 13.995 s    10 runs
>
>     Benchmark 4: git -c core.commitGraph=false rev-list --topo-order --all (git = pks-commit-graph-inconsistency)
>       Time (mean ± σ):     13.762 s ±  0.116 s    [User: 13.094 s, System: 0.667 s]
>       Range (min … max):   13.645 s … 14.038 s    10 runs
>
>     Summary
>       git -c core.commitGraph=true rev-list --topo-order --all (git = master) ran
>         1.32 ± 0.02 times faster than git -c core.commitGraph=true rev-list --topo-order --all (git = pks-commit-graph-inconsistency)
>         4.72 ± 0.05 times faster than git -c core.commitGraph=false rev-list --topo-order --all (git = pks-commit-graph-inconsistency)
>         4.75 ± 0.04 times faster than git -c core.commitGraph=false rev-list --topo-order --all (git = master)
>
> We look at a ~30% regression in general, but in general we're still a
> whole lot faster than without the commit graph. To counteract this, the
> new check can be turned off with the `GIT_COMMIT_GRAPH_PARANOIA` envvar.

Very nicely described.  Will queue.  I'll go offline for the rest of
the week but if there are no significant issues discovered by the
time I come back, let's declare a victory and merge these two
patches down to 'next'.

Thanks.

>
> Signed-off-by: Patrick Steinhardt <ps@pks.im>
> ---
>  commit.c                | 16 +++++++++++++++-
>  t/t5318-commit-graph.sh | 27 +++++++++++++++++++++++++++
>  2 files changed, 42 insertions(+), 1 deletion(-)
>
> diff --git a/commit.c b/commit.c
> index b3223478bc..7399e90212 100644
> --- a/commit.c
> +++ b/commit.c
> @@ -28,6 +28,7 @@
>  #include "shallow.h"
>  #include "tree.h"
>  #include "hook.h"
> +#include "parse.h"
>  
>  static struct commit_extra_header *read_commit_extra_header_lines(const char *buf, size_t len, const char **);
>  
> @@ -572,8 +573,21 @@ int repo_parse_commit_internal(struct repository *r,
>  		return -1;
>  	if (item->object.parsed)
>  		return 0;
> -	if (use_commit_graph && parse_commit_in_graph(r, item))
> +	if (use_commit_graph && parse_commit_in_graph(r, item)) {
> +		static int object_paranoia = -1;
> +
> +		if (object_paranoia == -1)
> +			object_paranoia = git_env_bool(GIT_COMMIT_GRAPH_PARANOIA, 1);
> +
> +		if (object_paranoia && !has_object(r, &item->object.oid, 0)) {
> +			unparse_commit(r, &item->object.oid);
> +			return quiet_on_missing ? -1 :
> +				error(_("commit %s exists in commit-graph but not in the object database"),
> +				      oid_to_hex(&item->object.oid));
> +		}
> +
>  		return 0;
> +	}
>  
>  	if (oid_object_info_extended(r, &item->object.oid, &oi, flags) < 0)
>  		return quiet_on_missing ? -1 :
> diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
> index c0cc454538..79467d7926 100755
> --- a/t/t5318-commit-graph.sh
> +++ b/t/t5318-commit-graph.sh
> @@ -842,4 +842,31 @@ test_expect_success 'stale commit cannot be parsed when given directly' '
>  	)
>  '
>  
> +test_expect_success 'stale commit cannot be parsed when traversing graph' '
> +	test_when_finished "rm -rf repo" &&
> +	git init repo &&
> +	(
> +		cd repo &&
> +
> +		test_commit A &&
> +		test_commit B &&
> +		test_commit C &&
> +		git commit-graph write --reachable &&
> +
> +		# Corrupt the repository by deleting the intermittent commit
> +		# object. Commands should notice that this object is absent and
> +		# thus that the repository is corrupt even if the commit graph
> +		# exists.
> +		oid=$(git rev-parse B) &&
> +		rm .git/objects/"$(test_oid_to_path "$oid")" &&
> +
> +		# Again, we should be able to parse the commit when not
> +		# being paranoid about commit graph staleness...
> +		GIT_COMMIT_GRAPH_PARANOIA=false git rev-parse HEAD~2 &&
> +		# ... but fail when we are paranoid.
> +		test_must_fail git rev-parse HEAD~2 2>error &&
> +		grep "error: commit $oid exists in commit-graph but not in the object database" error
> +	)
> +'
> +
>  test_done
Taylor Blau Oct. 30, 2023, 9:31 p.m. UTC | #2
On Mon, Oct 23, 2023 at 01:27:20PM +0200, Patrick Steinhardt wrote:
> @@ -572,8 +573,21 @@ int repo_parse_commit_internal(struct repository *r,
>  		return -1;
>  	if (item->object.parsed)
>  		return 0;
> -	if (use_commit_graph && parse_commit_in_graph(r, item))
> +	if (use_commit_graph && parse_commit_in_graph(r, item)) {
> +		static int object_paranoia = -1;
> +
> +		if (object_paranoia == -1)
> +			object_paranoia = git_env_bool(GIT_COMMIT_GRAPH_PARANOIA, 1);

The same note here about object_paranoia versus graph_paranoia, but
otherwise this patch looks good to me, modulo one typo below.

> @@ -842,4 +842,31 @@ test_expect_success 'stale commit cannot be parsed when given directly' '
>  	)
>  '
>
> +test_expect_success 'stale commit cannot be parsed when traversing graph' '
> +	test_when_finished "rm -rf repo" &&
> +	git init repo &&
> +	(
> +		cd repo &&
> +
> +		test_commit A &&
> +		test_commit B &&
> +		test_commit C &&
> +		git commit-graph write --reachable &&
> +
> +		# Corrupt the repository by deleting the intermittent commit

s/intermittent/intermediate

> +		# object. Commands should notice that this object is absent and
> +		# thus that the repository is corrupt even if the commit graph
> +		# exists.
> +		oid=$(git rev-parse B) &&
> +		rm .git/objects/"$(test_oid_to_path "$oid")" &&
> +
> +		# Again, we should be able to parse the commit when not
> +		# being paranoid about commit graph staleness...
> +		GIT_COMMIT_GRAPH_PARANOIA=false git rev-parse HEAD~2 &&
> +		# ... but fail when we are paranoid.
> +		test_must_fail git rev-parse HEAD~2 2>error &&
> +		grep "error: commit $oid exists in commit-graph but not in the object database" error
> +	)

Thanks,
Taylor
Taylor Blau Oct. 30, 2023, 9:32 p.m. UTC | #3
On Tue, Oct 24, 2023 at 12:10:13PM -0700, Junio C Hamano wrote:
> > We look at a ~30% regression in general, but in general we're still a
> > whole lot faster than without the commit graph. To counteract this, the
> > new check can be turned off with the `GIT_COMMIT_GRAPH_PARANOIA` envvar.
>
> Very nicely described.  Will queue.  I'll go offline for the rest of
> the week but if there are no significant issues discovered by the
> time I come back, let's declare a victory and merge these two
> patches down to 'next'.

I think we're close here. There are a couple of small comments that I
made throughout these two patches, but nothing major.

Thanks,
Taylor
Junio C Hamano Oct. 31, 2023, 12:21 a.m. UTC | #4
Taylor Blau <me@ttaylorr.com> writes:

> On Tue, Oct 24, 2023 at 12:10:13PM -0700, Junio C Hamano wrote:
>> > We look at a ~30% regression in general, but in general we're still a
>> > whole lot faster than without the commit graph. To counteract this, the
>> > new check can be turned off with the `GIT_COMMIT_GRAPH_PARANOIA` envvar.
>>
>> Very nicely described.  Will queue.  I'll go offline for the rest of
>> the week but if there are no significant issues discovered by the
>> time I come back, let's declare a victory and merge these two
>> patches down to 'next'.
>
> I think we're close here. There are a couple of small comments that I
> made throughout these two patches, but nothing major.

Thanks for a comment.
diff mbox series

Patch

diff --git a/commit.c b/commit.c
index b3223478bc..7399e90212 100644
--- a/commit.c
+++ b/commit.c
@@ -28,6 +28,7 @@ 
 #include "shallow.h"
 #include "tree.h"
 #include "hook.h"
+#include "parse.h"
 
 static struct commit_extra_header *read_commit_extra_header_lines(const char *buf, size_t len, const char **);
 
@@ -572,8 +573,21 @@  int repo_parse_commit_internal(struct repository *r,
 		return -1;
 	if (item->object.parsed)
 		return 0;
-	if (use_commit_graph && parse_commit_in_graph(r, item))
+	if (use_commit_graph && parse_commit_in_graph(r, item)) {
+		static int object_paranoia = -1;
+
+		if (object_paranoia == -1)
+			object_paranoia = git_env_bool(GIT_COMMIT_GRAPH_PARANOIA, 1);
+
+		if (object_paranoia && !has_object(r, &item->object.oid, 0)) {
+			unparse_commit(r, &item->object.oid);
+			return quiet_on_missing ? -1 :
+				error(_("commit %s exists in commit-graph but not in the object database"),
+				      oid_to_hex(&item->object.oid));
+		}
+
 		return 0;
+	}
 
 	if (oid_object_info_extended(r, &item->object.oid, &oi, flags) < 0)
 		return quiet_on_missing ? -1 :
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index c0cc454538..79467d7926 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -842,4 +842,31 @@  test_expect_success 'stale commit cannot be parsed when given directly' '
 	)
 '
 
+test_expect_success 'stale commit cannot be parsed when traversing graph' '
+	test_when_finished "rm -rf repo" &&
+	git init repo &&
+	(
+		cd repo &&
+
+		test_commit A &&
+		test_commit B &&
+		test_commit C &&
+		git commit-graph write --reachable &&
+
+		# Corrupt the repository by deleting the intermittent commit
+		# object. Commands should notice that this object is absent and
+		# thus that the repository is corrupt even if the commit graph
+		# exists.
+		oid=$(git rev-parse B) &&
+		rm .git/objects/"$(test_oid_to_path "$oid")" &&
+
+		# Again, we should be able to parse the commit when not
+		# being paranoid about commit graph staleness...
+		GIT_COMMIT_GRAPH_PARANOIA=false git rev-parse HEAD~2 &&
+		# ... but fail when we are paranoid.
+		test_must_fail git rev-parse HEAD~2 2>error &&
+		grep "error: commit $oid exists in commit-graph but not in the object database" error
+	)
+'
+
 test_done