diff mbox series

[v2,1/2] commit-graph: introduce envvar to disable commit existence checks

Message ID a89c4355285bc0bb0ec339818e6fe907f9ffd30e.1698060036.git.ps@pks.im (mailing list archive)
State Superseded
Commit 756cff72edbd9e101c755b53d59ae8ad59c7a8a8
Headers show
Series commit-graph: detect commits missing in ODB | expand

Commit Message

Patrick Steinhardt Oct. 23, 2023, 11:27 a.m. UTC
Our `lookup_commit_in_graph()` helper tries to look up commits from the
commit graph and, if it doesn't exist there, falls back to parsing it
from the object database instead. This is intended to speed up the
lookup of any such commit that exists in the database. There is an edge
case though where the commit exists in the graph, but not in the object
database. To avoid returning such stale commits the helper function thus
double checks that any such commit parsed from the graph also exists in
the object database. This makes the function safe to use even when
commit graphs aren't updated regularly.

We're about to introduce the same pattern into other parts of our code
base though, namely `repo_parse_commit_internal()`. Here the extra
sanity check is a bit of a tougher sell: `lookup_commit_in_graph()` was
a newly introduced helper, and as such there was no performance hit by
adding this sanity check. If we added `repo_parse_commit_internal()`
with that sanity check right from the beginning as well, this would
probably never have been an issue to begin with. But by retrofitting it
with this sanity check now we do add a performance regression to
preexisting code, and thus there is a desire to avoid this or at least
give an escape hatch.

In practice, there is no inherent reason why either of those functions
should have the sanity check whereas the other one does not: either both
of them are able to detect this issue or none of them should be. This
also means that the default of whether we do the check should likely be
the same for both. To err on the side of caution, we thus rather want to
make `repo_parse_commit_internal()` stricter than to loosen the checks
that we already have in `lookup_commit_in_graph()`.

The escape hatch is added in the form of a new GIT_COMMIT_GRAPH_PARANOIA
environment variable that mirrors GIT_REF_PARANOIA. If enabled, which is
the default, we will double check that commits looked up in the commit
graph via `lookup_commit_in_graph()` also exist in the object database.
This same check will also be added in `repo_parse_commit_internal()`.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 Documentation/git.txt   |  9 +++++++++
 commit-graph.c          |  6 +++++-
 commit-graph.h          |  6 ++++++
 t/t5318-commit-graph.sh | 21 +++++++++++++++++++++
 4 files changed, 41 insertions(+), 1 deletion(-)

Comments

Taylor Blau Oct. 30, 2023, 9:29 p.m. UTC | #1
On Mon, Oct 23, 2023 at 01:27:16PM +0200, Patrick Steinhardt wrote:
> Our `lookup_commit_in_graph()` helper tries to look up commits from the
> commit graph and, if it doesn't exist there, falls back to parsing it
> from the object database instead. This is intended to speed up the
> lookup of any such commit that exists in the database. There is an edge
> case though where the commit exists in the graph, but not in the object
> database. To avoid returning such stale commits the helper function thus
> double checks that any such commit parsed from the graph also exists in
> the object database. This makes the function safe to use even when
> commit graphs aren't updated regularly.
>
> We're about to introduce the same pattern into other parts of our code
> base though, namely `repo_parse_commit_internal()`. Here the extra
> sanity check is a bit of a tougher sell: `lookup_commit_in_graph()` was
> a newly introduced helper, and as such there was no performance hit by
> adding this sanity check. If we added `repo_parse_commit_internal()`
> with that sanity check right from the beginning as well, this would
> probably never have been an issue to begin with. But by retrofitting it
> with this sanity check now we do add a performance regression to
> preexisting code, and thus there is a desire to avoid this or at least
> give an escape hatch.
>
> In practice, there is no inherent reason why either of those functions
> should have the sanity check whereas the other one does not: either both
> of them are able to detect this issue or none of them should be. This
> also means that the default of whether we do the check should likely be
> the same for both. To err on the side of caution, we thus rather want to
> make `repo_parse_commit_internal()` stricter than to loosen the checks
> that we already have in `lookup_commit_in_graph()`.

All well reasoned. I think the most compelling reason is that we're
already doing this extra check in lookup_commit_in_graph(), and having
that be somewhat inconsistent with repo_parse_commit_internal() feels
error-prone to me.

> The escape hatch is added in the form of a new GIT_COMMIT_GRAPH_PARANOIA
> environment variable that mirrors GIT_REF_PARANOIA. If enabled, which is
> the default, we will double check that commits looked up in the commit
> graph via `lookup_commit_in_graph()` also exist in the object database.
> This same check will also be added in `repo_parse_commit_internal()`.

Sounds good.

> Signed-off-by: Patrick Steinhardt <ps@pks.im>
> ---
>  Documentation/git.txt   |  9 +++++++++
>  commit-graph.c          |  6 +++++-
>  commit-graph.h          |  6 ++++++
>  t/t5318-commit-graph.sh | 21 +++++++++++++++++++++
>  4 files changed, 41 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/git.txt b/Documentation/git.txt
> index 11228956cd..22c2b537aa 100644
> --- a/Documentation/git.txt
> +++ b/Documentation/git.txt
> @@ -911,6 +911,15 @@ for full details.
>  	should not normally need to set this to `0`, but it may be
>  	useful when trying to salvage data from a corrupted repository.
>
> +`GIT_COMMIT_GRAPH_PARANOIA`::
> +	If this Boolean environment variable is set to false, ignore the
> +	case where commits exist in the commit graph but not in the
> +	object database. Normally, Git will check whether commits loaded
> +	from the commit graph exist in the object database to avoid
> +	issues with stale commit graphs, but this check comes with a
> +	performance penalty. The default is `1` (i.e., be paranoid about
> +	stale commits in the commit graph).
> +

The first two sentences seem to be flipped. Perhaps:

    When loading a commit object from the commit-graph, Git will perform
    an existence check on the object in the ODB before parsing it out of
    the commit-graph. The default is "true", which enables the
    aforementioned behavior. Setting this to "false" disables the
    existential check when parsing commits from a commit-graph.

>  `GIT_ALLOW_PROTOCOL`::
>  	If set to a colon-separated list of protocols, behave as if
>  	`protocol.allow` is set to `never`, and each of the listed
> diff --git a/commit-graph.c b/commit-graph.c
> index fd2f700b2e..12ec31902e 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -939,14 +939,18 @@ int repo_find_commit_pos_in_graph(struct repository *r, struct commit *c,
>
>  struct commit *lookup_commit_in_graph(struct repository *repo, const struct object_id *id)
>  {
> +	static int object_paranoia = -1;
>  	struct commit *commit;
>  	uint32_t pos;
>
> +	if (object_paranoia == -1)
> +		object_paranoia = git_env_bool(GIT_COMMIT_GRAPH_PARANOIA, 1);
> +

I don't think that this is a reroll-able issue, but calling this
variable object_paranoia to store a setting for *graph* paranoia feels
like a good itch to scratch. But obviously not a big deal ;-).

> @@ -821,4 +821,25 @@ test_expect_success 'overflow during generation version upgrade' '
>  	)
>  '
>
> +test_expect_success 'stale commit cannot be parsed when given directly' '
> +	test_when_finished "rm -rf repo" &&
> +	git init repo &&
> +	(
> +		cd repo &&
> +		test_commit A &&
> +		test_commit B &&
> +		git commit-graph write --reachable &&
> +
> +		oid=$(git rev-parse B) &&
> +		rm .git/objects/"$(test_oid_to_path "$oid")" &&
> +
> +		# Verify that it is possible to read the commit from the
> +		# commit graph when not being paranoid, ...
> +		GIT_COMMIT_GRAPH_PARANOIA=false git rev-list B &&
> +		# ... but parsing the commit when double checking that
> +		# it actually exists in the object database should fail.
> +		test_must_fail git rev-list -1 B

Would "cat-file -p" be more direct here than "rev-list -1"?

Thanks,
Taylor
Patrick Steinhardt Oct. 31, 2023, 6:19 a.m. UTC | #2
On Mon, Oct 30, 2023 at 05:29:33PM -0400, Taylor Blau wrote:
> On Mon, Oct 23, 2023 at 01:27:16PM +0200, Patrick Steinhardt wrote:
> > Our `lookup_commit_in_graph()` helper tries to look up commits from the
> > commit graph and, if it doesn't exist there, falls back to parsing it
> > from the object database instead. This is intended to speed up the
> > lookup of any such commit that exists in the database. There is an edge
> > case though where the commit exists in the graph, but not in the object
> > database. To avoid returning such stale commits the helper function thus
> > double checks that any such commit parsed from the graph also exists in
> > the object database. This makes the function safe to use even when
> > commit graphs aren't updated regularly.
> >
> > We're about to introduce the same pattern into other parts of our code
> > base though, namely `repo_parse_commit_internal()`. Here the extra
> > sanity check is a bit of a tougher sell: `lookup_commit_in_graph()` was
> > a newly introduced helper, and as such there was no performance hit by
> > adding this sanity check. If we added `repo_parse_commit_internal()`
> > with that sanity check right from the beginning as well, this would
> > probably never have been an issue to begin with. But by retrofitting it
> > with this sanity check now we do add a performance regression to
> > preexisting code, and thus there is a desire to avoid this or at least
> > give an escape hatch.
> >
> > In practice, there is no inherent reason why either of those functions
> > should have the sanity check whereas the other one does not: either both
> > of them are able to detect this issue or none of them should be. This
> > also means that the default of whether we do the check should likely be
> > the same for both. To err on the side of caution, we thus rather want to
> > make `repo_parse_commit_internal()` stricter than to loosen the checks
> > that we already have in `lookup_commit_in_graph()`.
> 
> All well reasoned. I think the most compelling reason is that we're
> already doing this extra check in lookup_commit_in_graph(), and having
> that be somewhat inconsistent with repo_parse_commit_internal() feels
> error-prone to me.
> 
> > The escape hatch is added in the form of a new GIT_COMMIT_GRAPH_PARANOIA
> > environment variable that mirrors GIT_REF_PARANOIA. If enabled, which is
> > the default, we will double check that commits looked up in the commit
> > graph via `lookup_commit_in_graph()` also exist in the object database.
> > This same check will also be added in `repo_parse_commit_internal()`.
> 
> Sounds good.
> 
> > Signed-off-by: Patrick Steinhardt <ps@pks.im>
> > ---
> >  Documentation/git.txt   |  9 +++++++++
> >  commit-graph.c          |  6 +++++-
> >  commit-graph.h          |  6 ++++++
> >  t/t5318-commit-graph.sh | 21 +++++++++++++++++++++
> >  4 files changed, 41 insertions(+), 1 deletion(-)
> >
> > diff --git a/Documentation/git.txt b/Documentation/git.txt
> > index 11228956cd..22c2b537aa 100644
> > --- a/Documentation/git.txt
> > +++ b/Documentation/git.txt
> > @@ -911,6 +911,15 @@ for full details.
> >  	should not normally need to set this to `0`, but it may be
> >  	useful when trying to salvage data from a corrupted repository.
> >
> > +`GIT_COMMIT_GRAPH_PARANOIA`::
> > +	If this Boolean environment variable is set to false, ignore the
> > +	case where commits exist in the commit graph but not in the
> > +	object database. Normally, Git will check whether commits loaded
> > +	from the commit graph exist in the object database to avoid
> > +	issues with stale commit graphs, but this check comes with a
> > +	performance penalty. The default is `1` (i.e., be paranoid about
> > +	stale commits in the commit graph).
> > +
> 
> The first two sentences seem to be flipped. Perhaps:
> 
>     When loading a commit object from the commit-graph, Git will perform
>     an existence check on the object in the ODB before parsing it out of
>     the commit-graph. The default is "true", which enables the
>     aforementioned behavior. Setting this to "false" disables the
>     existential check when parsing commits from a commit-graph.

I was modelling this after the text we had in `GIT_REF_PARANOIA`, but I
like your version more indeed. I'll massage it a bit to mention _why_
one would want to disable this.

> >  `GIT_ALLOW_PROTOCOL`::
> >  	If set to a colon-separated list of protocols, behave as if
> >  	`protocol.allow` is set to `never`, and each of the listed
> > diff --git a/commit-graph.c b/commit-graph.c
> > index fd2f700b2e..12ec31902e 100644
> > --- a/commit-graph.c
> > +++ b/commit-graph.c
> > @@ -939,14 +939,18 @@ int repo_find_commit_pos_in_graph(struct repository *r, struct commit *c,
> >
> >  struct commit *lookup_commit_in_graph(struct repository *repo, const struct object_id *id)
> >  {
> > +	static int object_paranoia = -1;
> >  	struct commit *commit;
> >  	uint32_t pos;
> >
> > +	if (object_paranoia == -1)
> > +		object_paranoia = git_env_bool(GIT_COMMIT_GRAPH_PARANOIA, 1);
> > +
> 
> I don't think that this is a reroll-able issue, but calling this
> variable object_paranoia to store a setting for *graph* paranoia feels
> like a good itch to scratch. But obviously not a big deal ;-).

Ugh, yeah. I first had the envvar as "GIT_OBJECT_PARANOIA", but
discarded that name because I feared that it might become overloaded
with semi-related checks.

Will fix.

> > @@ -821,4 +821,25 @@ test_expect_success 'overflow during generation version upgrade' '
> >  	)
> >  '
> >
> > +test_expect_success 'stale commit cannot be parsed when given directly' '
> > +	test_when_finished "rm -rf repo" &&
> > +	git init repo &&
> > +	(
> > +		cd repo &&
> > +		test_commit A &&
> > +		test_commit B &&
> > +		git commit-graph write --reachable &&
> > +
> > +		oid=$(git rev-parse B) &&
> > +		rm .git/objects/"$(test_oid_to_path "$oid")" &&
> > +
> > +		# Verify that it is possible to read the commit from the
> > +		# commit graph when not being paranoid, ...
> > +		GIT_COMMIT_GRAPH_PARANOIA=false git rev-list B &&
> > +		# ... but parsing the commit when double checking that
> > +		# it actually exists in the object database should fail.
> > +		test_must_fail git rev-list -1 B
> 
> Would "cat-file -p" be more direct here than "rev-list -1"?

No, because it doesn't use `lookup_commit_in_graph()`. I had to go
searching a bit to find something that exposes this inconsistency, and
git-rev-list(1) was the easiest one I found.

Patrick
diff mbox series

Patch

diff --git a/Documentation/git.txt b/Documentation/git.txt
index 11228956cd..22c2b537aa 100644
--- a/Documentation/git.txt
+++ b/Documentation/git.txt
@@ -911,6 +911,15 @@  for full details.
 	should not normally need to set this to `0`, but it may be
 	useful when trying to salvage data from a corrupted repository.
 
+`GIT_COMMIT_GRAPH_PARANOIA`::
+	If this Boolean environment variable is set to false, ignore the
+	case where commits exist in the commit graph but not in the
+	object database. Normally, Git will check whether commits loaded
+	from the commit graph exist in the object database to avoid
+	issues with stale commit graphs, but this check comes with a
+	performance penalty. The default is `1` (i.e., be paranoid about
+	stale commits in the commit graph).
+
 `GIT_ALLOW_PROTOCOL`::
 	If set to a colon-separated list of protocols, behave as if
 	`protocol.allow` is set to `never`, and each of the listed
diff --git a/commit-graph.c b/commit-graph.c
index fd2f700b2e..12ec31902e 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -939,14 +939,18 @@  int repo_find_commit_pos_in_graph(struct repository *r, struct commit *c,
 
 struct commit *lookup_commit_in_graph(struct repository *repo, const struct object_id *id)
 {
+	static int object_paranoia = -1;
 	struct commit *commit;
 	uint32_t pos;
 
+	if (object_paranoia == -1)
+		object_paranoia = git_env_bool(GIT_COMMIT_GRAPH_PARANOIA, 1);
+
 	if (!prepare_commit_graph(repo))
 		return NULL;
 	if (!search_commit_pos_in_graph(id, repo->objects->commit_graph, &pos))
 		return NULL;
-	if (!has_object(repo, id, 0))
+	if (object_paranoia && !has_object(repo, id, 0))
 		return NULL;
 
 	commit = lookup_commit(repo, id);
diff --git a/commit-graph.h b/commit-graph.h
index 20ada7e891..bd4289620c 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -8,6 +8,12 @@ 
 #define GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE "GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE"
 #define GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS "GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS"
 
+/*
+ * This environment variable controls whether commits looked up via the
+ * commit graph will be double checked to exist in the object database.
+ */
+#define GIT_COMMIT_GRAPH_PARANOIA "GIT_COMMIT_GRAPH_PARANOIA"
+
 /*
  * This method is only used to enhance coverage of the commit-graph
  * feature in the test suite with the GIT_TEST_COMMIT_GRAPH and
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index ba65f17dd9..c0cc454538 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -821,4 +821,25 @@  test_expect_success 'overflow during generation version upgrade' '
 	)
 '
 
+test_expect_success 'stale commit cannot be parsed when given directly' '
+	test_when_finished "rm -rf repo" &&
+	git init repo &&
+	(
+		cd repo &&
+		test_commit A &&
+		test_commit B &&
+		git commit-graph write --reachable &&
+
+		oid=$(git rev-parse B) &&
+		rm .git/objects/"$(test_oid_to_path "$oid")" &&
+
+		# Verify that it is possible to read the commit from the
+		# commit graph when not being paranoid, ...
+		GIT_COMMIT_GRAPH_PARANOIA=false git rev-list B &&
+		# ... but parsing the commit when double checking that
+		# it actually exists in the object database should fail.
+		test_must_fail git rev-list -1 B
+	)
+'
+
 test_done