diff mbox series

commit: detect commits that exist in commit-graph but not in the ODB

Message ID b0bf576c51a706367a758b8e30eca37edb9c2734.1697200576.git.ps@pks.im (mailing list archive)
State New, archived
Headers show
Series commit: detect commits that exist in commit-graph but not in the ODB | expand

Commit Message

Patrick Steinhardt Oct. 13, 2023, 12:37 p.m. UTC
Commit graphs can become stale and contain references to commits that do
not exist in the object database anymore. Theoretically, this can lead
to a scenario where we are able to successfully look up any such commit
via the commit graph even though such a lookup would fail if done via
the object database directly.

As the commit graph is mostly intended as a sort of cache to speed up
parsing of commits we do not want to have diverging behaviour in a
repository with and a repository without commit graphs, no matter
whether they are stale or not. As commits are otherwise immutable, the
only thing that we really need to care about is thus the presence or
absence of a commit.

To address potentially stale commit data that may exist in the graph,
our `lookup_commit_in_graph()` function will check for the commit's
existence in both the commit graph, but also in the object database. So
even if we were able to look up the commit's data in the graph, we would
still pretend as if the commit didn't exist if it is missing in the
object database.

We don't have the same safety net in `parse_commit_in_graph_one()`
though. This function is mostly used internally in "commit-graph.c"
itself to validate the commit graph, and this usage is fine. We do
expose its functionality via `parse_commit_in_graph()` though, which
gets called by `repo_parse_commit_internal()`, and that function is in
turn used in many places in our codebase.

For all I can see this function is never used to directly turn an object
ID into a commit object without additional safety checks before or after
this lookup. What it is being used for though is to walk history via the
parent chain of commits. So when commits in the parent chain of a graph
walk are missing it is possible that we wouldn't notice if that missing
commit was part of the commit graph. Thus, a query like `git rev-parse
HEAD~2` can succeed even if the intermittent commit is missing.

It's unclear whether there are additional ways in which such stale
commit graphs can lead to problems. In any case, it feels like this is a
bigger bug waiting to happen when we gain additional direct or indirect
callers of `repo_parse_commit_internal()`. So let's fix the inconsistent
behaviour by checking for object existence via the object database, as
well.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 commit.c                |  7 ++++++-
 t/t5318-commit-graph.sh | 23 +++++++++++++++++++++++
 2 files changed, 29 insertions(+), 1 deletion(-)

Comments

Junio C Hamano Oct. 13, 2023, 6:21 p.m. UTC | #1
Patrick Steinhardt <ps@pks.im> writes:

> @@ -572,8 +572,13 @@ int repo_parse_commit_internal(struct repository *r,
>  		return -1;
>  	if (item->object.parsed)
>  		return 0;
> -	if (use_commit_graph && parse_commit_in_graph(r, item))
> +	if (use_commit_graph && parse_commit_in_graph(r, item)) {
> +		if (!has_object(r, &item->object.oid, 0))
> +			return quiet_on_missing ? -1 :
> +				error(_("commit %s exists in commit-graph but not in the object database"),
> +				      oid_to_hex(&item->object.oid));
>  		return 0;
> +	}

Ever since this codepath was introduced by 177722b3 (commit:
integrate commit graph with commit parsing, 2018-04-10), we blindly
trusted what commit-graph file says.  This change is a strict
improvement in the correctness department, but there are two things
that are a bit worrying.

One.  The additional check should certainly be cheaper than a full
reading and parsing of an object, either from a loose object file or
from a pack entry.  It may not hurt performance too much, but it
still would give us more confidence if we know by how much we are
pessimising good cases where the commit-graph does match reality.
Our stance on these secondary files that store precomputed values
for optimization purposes is in general to use them blindly unless
in exceptional cases where the operation values the correctness even
when the validity of these secondary files is dubious (e.g., "fsck"),
and doing this extra check regardless of the caller at this low level
of the callchain is a bit worrying.

Another is that by the time parse_commit_in_graph() returns true and
we realize that the answer we got is bogus by asking has_object(),
item->object.parsed has already been toggled on, so the caller now
has a commit object that claimed it was already parsed and does not
match reality.  Hopefully the caller takes an early exit upon seeing
a failure from parse_commit_gently() and the .parsed bit does not
matter, but maybe I am missing a case where it does.  I dunno.

Other than that, sounds very sensible and the code change is clean.

Will queue.  Thanks.

> diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
> index ba65f17dd9..25f8e9e2d3 100755
> --- a/t/t5318-commit-graph.sh
> +++ b/t/t5318-commit-graph.sh
> @@ -821,4 +821,27 @@ test_expect_success 'overflow during generation version upgrade' '
>  	)
>  '
>  
> +test_expect_success 'commit exists in commit-graph but not in object database' '
> +	test_when_finished "rm -rf repo" &&
> +	git init repo &&
> +	(
> +		cd repo &&
> +
> +		test_commit A &&
> +		test_commit B &&
> +		test_commit C &&
> +		git commit-graph write --reachable &&
> +
> +		# Corrupt the repository by deleting the intermittent commit
> +		# object. Commands should notice that this object is absent and
> +		# thus that the repository is corrupt even if the commit graph
> +		# exists.
> +		oid=$(git rev-parse B) &&
> +		rm .git/objects/"$(test_oid_to_path "$oid")" &&
> +
> +		test_must_fail git rev-parse HEAD~2 2>error &&
> +		grep "error: commit $oid exists in commit-graph but not in the object database" error
> +	)
> +'
> +
>  test_done
Patrick Steinhardt Oct. 17, 2023, 6:37 a.m. UTC | #2
On Fri, Oct 13, 2023 at 11:21:48AM -0700, Junio C Hamano wrote:
> Patrick Steinhardt <ps@pks.im> writes:
> 
> > @@ -572,8 +572,13 @@ int repo_parse_commit_internal(struct repository *r,
> >  		return -1;
> >  	if (item->object.parsed)
> >  		return 0;
> > -	if (use_commit_graph && parse_commit_in_graph(r, item))
> > +	if (use_commit_graph && parse_commit_in_graph(r, item)) {
> > +		if (!has_object(r, &item->object.oid, 0))
> > +			return quiet_on_missing ? -1 :
> > +				error(_("commit %s exists in commit-graph but not in the object database"),
> > +				      oid_to_hex(&item->object.oid));
> >  		return 0;
> > +	}
> 
> Ever since this codepath was introduced by 177722b3 (commit:
> integrate commit graph with commit parsing, 2018-04-10), we blindly
> trusted what commit-graph file says.  This change is a strict
> improvement in the correctness department, but there are two things
> that are a bit worrying.
> 
> One.  The additional check should certainly be cheaper than a full
> reading and parsing of an object, either from a loose object file or
> from a pack entry.  It may not hurt performance too much, but it
> still would give us more confidence if we know by how much we are
> pessimising good cases where the commit-graph does match reality.
> Our stance on these secondary files that store precomputed values
> for optimization purposes is in general to use them blindly unless
> in exceptional cases where the operation values the correctness even
> when the validity of these secondary files is dubious (e.g., "fsck"),
> and doing this extra check regardless of the caller at this low level
> of the callchain is a bit worrying.

Fair point indeed. The following is a worst-case scenario benchmark of
of the change where we do a full topological walk of all reachable
commits in the graph, executed in linux.git. We parse commit parents via
`repo_parse_commit_gently()`, so the new code path now basically has to
check for object existence of every reachable commit:

Benchmark 1: git -c core.commitGraph=true rev-list --topo-order --all (git = master)
  Time (mean ± σ):      2.913 s ±  0.018 s    [User: 2.363 s, System: 0.548 s]
  Range (min … max):    2.894 s …  2.950 s    10 runs

Benchmark 2: git -c core.commitGraph=true rev-list --topo-order --all (git = pks-commit-graph-inconsistency)
  Time (mean ± σ):      3.834 s ±  0.052 s    [User: 3.276 s, System: 0.556 s]
  Range (min … max):    3.780 s …  3.961 s    10 runs

Benchmark 3: git -c core.commitGraph=false rev-list --topo-order --all (git = master)
  Time (mean ± σ):     13.841 s ±  0.084 s    [User: 13.152 s, System: 0.687 s]
  Range (min … max):   13.714 s … 13.995 s    10 runs

Benchmark 4: git -c core.commitGraph=false rev-list --topo-order --all (git = pks-commit-graph-inconsistency)
  Time (mean ± σ):     13.762 s ±  0.116 s    [User: 13.094 s, System: 0.667 s]
  Range (min … max):   13.645 s … 14.038 s    10 runs

Summary
  git -c core.commitGraph=true rev-list --topo-order --all (git = master) ran
    1.32 ± 0.02 times faster than git -c core.commitGraph=true rev-list --topo-order --all (git = pks-commit-graph-inconsistency)
    4.72 ± 0.05 times faster than git -c core.commitGraph=false rev-list --topo-order --all (git = pks-commit-graph-inconsistency)
    4.75 ± 0.04 times faster than git -c core.commitGraph=false rev-list --topo-order --all (git = master)

The added check does lead to a performance regression indeed, which is
not all that unexpected. That being said, the commit-graph still results
in a significant speedup compared to the case where we don't have it.

> Another is that by the time parse_commit_in_graph() returns true and
> we realize that the answer we got is bogus by asking has_object(),
> item->object.parsed has already been toggled on, so the caller now
> has a commit object that claimed it was already parsed and does not
> match reality.  Hopefully the caller takes an early exit upon seeing
> a failure from parse_commit_gently() and the .parsed bit does not
> matter, but maybe I am missing a case where it does.  I dunno.

We could also call `unparse_commit()` when we notice the stale commit
graph item. This would be in the same spirit as the rest of this patch
as it would lead to an overall safer end state.

In any case I'll wait for additional input before sending a v2, most
importantly to see whether we think that consistency trumps performance
in this case. Personally I'm still of the mind that it should, which
also comes from the fact that we were fighting with stale commit graphs
several times in production data.

Patrick

> Other than that, sounds very sensible and the code change is clean.
> 
> Will queue.  Thanks.
> 
> > diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
> > index ba65f17dd9..25f8e9e2d3 100755
> > --- a/t/t5318-commit-graph.sh
> > +++ b/t/t5318-commit-graph.sh
> > @@ -821,4 +821,27 @@ test_expect_success 'overflow during generation version upgrade' '
> >  	)
> >  '
> >  
> > +test_expect_success 'commit exists in commit-graph but not in object database' '
> > +	test_when_finished "rm -rf repo" &&
> > +	git init repo &&
> > +	(
> > +		cd repo &&
> > +
> > +		test_commit A &&
> > +		test_commit B &&
> > +		test_commit C &&
> > +		git commit-graph write --reachable &&
> > +
> > +		# Corrupt the repository by deleting the intermittent commit
> > +		# object. Commands should notice that this object is absent and
> > +		# thus that the repository is corrupt even if the commit graph
> > +		# exists.
> > +		oid=$(git rev-parse B) &&
> > +		rm .git/objects/"$(test_oid_to_path "$oid")" &&
> > +
> > +		test_must_fail git rev-parse HEAD~2 2>error &&
> > +		grep "error: commit $oid exists in commit-graph but not in the object database" error
> > +	)
> > +'
> > +
> >  test_done
Junio C Hamano Oct. 17, 2023, 6:34 p.m. UTC | #3
Patrick Steinhardt <ps@pks.im> writes:

> Fair point indeed. The following is a worst-case scenario benchmark of
> of the change where we do a full topological walk of all reachable
> commits in the graph, executed in linux.git. We parse commit parents via
> `repo_parse_commit_gently()`, so the new code path now basically has to
> check for object existence of every reachable commit:
> ...
> The added check does lead to a performance regression indeed, which is
> not all that unexpected. That being said, the commit-graph still results
> in a significant speedup compared to the case where we don't have it.

Yeah, I agree that both points are expected.  An extra check that is
much cheaper than the full parsing is paying a small price to be a
bit more careful than before.  The question is if the price is small
enough.  I am still not sure if the extra carefulness is warranted
for all normal cases to spend 30% extra cycles


Yeah, I agree that both points are expected.  An extra check that is
much cheaper than the full parsing is paying a small price to be a
bit more careful than before.  The question is if the price is small
enough.  I am still not sure if the extra carefulness is warranted
for all normal cases to spend 30% extra cycles.

Thanks.
Patrick Steinhardt Oct. 19, 2023, 6:45 a.m. UTC | #4
On Tue, Oct 17, 2023 at 11:34:53AM -0700, Junio C Hamano wrote:
> Patrick Steinhardt <ps@pks.im> writes:
> 
> > Fair point indeed. The following is a worst-case scenario benchmark of
> > of the change where we do a full topological walk of all reachable
> > commits in the graph, executed in linux.git. We parse commit parents via
> > `repo_parse_commit_gently()`, so the new code path now basically has to
> > check for object existence of every reachable commit:
> > ...
> > The added check does lead to a performance regression indeed, which is
> > not all that unexpected. That being said, the commit-graph still results
> > in a significant speedup compared to the case where we don't have it.
> 
> Yeah, I agree that both points are expected.  An extra check that is
> much cheaper than the full parsing is paying a small price to be a
> bit more careful than before.  The question is if the price is small
> enough.  I am still not sure if the extra carefulness is warranted
> for all normal cases to spend 30% extra cycles.
> 
> Thanks.

Well, seen in contexts like the thread that spawned this discussion I
think it's preferable to take a relatively smaller price compared to
disabling the commit graph entirely in some situations. With that in
mind, personally I'd be happy with either of two outcomes here:

    - We take the patch I proposed as a hardening measure, which allows
      us to use the commit graph in basically any situation where we
      could parse objects from the ODB without any downside except for a
      performance hit.

    - We say that corrupt repositories do not need to be accounted for
      when using metadata caches like the commit-graph. That would mean
      in consequence that for the `--missing` flag we would not have to
      disable commit graphs.

The test failure that was exposed in Karthik's test only stems from the
fact that we manually corrupt the repository and is not specific to the
`--missing` flag. This is further stressed by the new test aded in my
own patch, as we can trigger a similar bug when not using `--missing`.

For end users, object pruning would happen either via git-gc(1) or via
git-maintenance(1), and both of them do know to update commit graphs
already. This should significantly decrease the chance that such stale
commit graphs would be found out there in the wild, but it's still not
impossible of course. Especially in cases where you use lower-level
commands like git-repack(1) with cruft packs or git-prune(1), which is
often the case for Git hosters, the risk is higher though.

Patrick
Patrick Steinhardt Oct. 19, 2023, 8:25 a.m. UTC | #5
On Thu, Oct 19, 2023 at 08:45:34AM +0200, Patrick Steinhardt wrote:
> On Tue, Oct 17, 2023 at 11:34:53AM -0700, Junio C Hamano wrote:
> > Patrick Steinhardt <ps@pks.im> writes:
> > 
> > > Fair point indeed. The following is a worst-case scenario benchmark of
> > > of the change where we do a full topological walk of all reachable
> > > commits in the graph, executed in linux.git. We parse commit parents via
> > > `repo_parse_commit_gently()`, so the new code path now basically has to
> > > check for object existence of every reachable commit:
> > > ...
> > > The added check does lead to a performance regression indeed, which is
> > > not all that unexpected. That being said, the commit-graph still results
> > > in a significant speedup compared to the case where we don't have it.
> > 
> > Yeah, I agree that both points are expected.  An extra check that is
> > much cheaper than the full parsing is paying a small price to be a
> > bit more careful than before.  The question is if the price is small
> > enough.  I am still not sure if the extra carefulness is warranted
> > for all normal cases to spend 30% extra cycles.
> > 
> > Thanks.
> 
> Well, seen in contexts like the thread that spawned this discussion I
> think it's preferable to take a relatively smaller price compared to
> disabling the commit graph entirely in some situations. With that in
> mind, personally I'd be happy with either of two outcomes here:
> 
>     - We take the patch I proposed as a hardening measure, which allows
>       us to use the commit graph in basically any situation where we
>       could parse objects from the ODB without any downside except for a
>       performance hit.
> 
>     - We say that corrupt repositories do not need to be accounted for
>       when using metadata caches like the commit-graph. That would mean
>       in consequence that for the `--missing` flag we would not have to
>       disable commit graphs.
> 
> The test failure that was exposed in Karthik's test only stems from the
> fact that we manually corrupt the repository and is not specific to the
> `--missing` flag. This is further stressed by the new test aded in my
> own patch, as we can trigger a similar bug when not using `--missing`.

There's another way to handle this, which is to conditionally enable the
object existence check. This would be less of a performance hit compared
to disabling commit graphs altogether with `--missing`, but still ensure
that repository corruption was detected. Second, it would not regress
performance for all preexisting users of `repo_parse_commit_gently()`.

Patrick
Junio C Hamano Oct. 19, 2023, 5:16 p.m. UTC | #6
Patrick Steinhardt <ps@pks.im> writes:

> There's another way to handle this, which is to conditionally enable the
> object existence check. This would be less of a performance hit compared
> to disabling commit graphs altogether with `--missing`, but still ensure
> that repository corruption was detected. Second, it would not regress
> performance for all preexisting users of `repo_parse_commit_gently()`.

The above was what I meant to suggest when you demonstrated that the
code with additional check is still much more performant than
running without the commit-graph optimization, yet has observable
impact on performance for normal codepaths that do not need the
extra carefulness.

But I wasn't sure if it is easy to plumb the "do we want to double
check?  in other words, are we running something like --missing that
care the correctness a bit more than usual cases?" bit down from the
caller, because this check is so deep in the callchain.

Thanks.
Jeff King Oct. 20, 2023, 10 a.m. UTC | #7
On Thu, Oct 19, 2023 at 10:16:56AM -0700, Junio C Hamano wrote:

> Patrick Steinhardt <ps@pks.im> writes:
> 
> > There's another way to handle this, which is to conditionally enable the
> > object existence check. This would be less of a performance hit compared
> > to disabling commit graphs altogether with `--missing`, but still ensure
> > that repository corruption was detected. Second, it would not regress
> > performance for all preexisting users of `repo_parse_commit_gently()`.
> 
> The above was what I meant to suggest when you demonstrated that the
> code with additional check is still much more performant than
> running without the commit-graph optimization, yet has observable
> impact on performance for normal codepaths that do not need the
> extra carefulness.
> 
> But I wasn't sure if it is easy to plumb the "do we want to double
> check?  in other words, are we running something like --missing that
> care the correctness a bit more than usual cases?" bit down from the
> caller, because this check is so deep in the callchain.

I wonder if we would want a "be extra careful" flag that is read from
the environment? That is largely how GIT_REF_PARANOIA works, and then
particular operations set it (though actually it is the default these
days, so they no longer do so explicitly).

I guess that is really like a global variable but even more gross. ;)
But it is nice that it can cross process boundaries, because "how
careful do we want to be" may be in the eye of the caller, especially
for plumbing commands. E.g., even without --missing, you may want
"rev-list" to be extra careful about checking the existence of commits.
The caller in check_connected() could arguably turn that on by default
(the way we used to turn on GIT_REF_PARANOIA for pruning repacks before
it became the default).

-Peff
Junio C Hamano Oct. 20, 2023, 5:35 p.m. UTC | #8
Jeff King <peff@peff.net> writes:

> I guess that is really like a global variable but even more gross. ;)

Yeah, I tend to agree, but ...

> But it is nice that it can cross process boundaries, because "how
> careful do we want to be" may be in the eye of the caller, especially
> for plumbing commands. E.g., even without --missing, you may want
> "rev-list" to be extra careful about checking the existence of commits.

... these are all very good points.

> The caller in check_connected() could arguably turn that on by default
> (the way we used to turn on GIT_REF_PARANOIA for pruning repacks before
> it became the default).

True.

Thanks.
Patrick Steinhardt Oct. 23, 2023, 10:15 a.m. UTC | #9
On Fri, Oct 20, 2023 at 06:00:24AM -0400, Jeff King wrote:
> On Thu, Oct 19, 2023 at 10:16:56AM -0700, Junio C Hamano wrote:
> 
> > Patrick Steinhardt <ps@pks.im> writes:
> > 
> > > There's another way to handle this, which is to conditionally enable the
> > > object existence check. This would be less of a performance hit compared
> > > to disabling commit graphs altogether with `--missing`, but still ensure
> > > that repository corruption was detected. Second, it would not regress
> > > performance for all preexisting users of `repo_parse_commit_gently()`.
> > 
> > The above was what I meant to suggest when you demonstrated that the
> > code with additional check is still much more performant than
> > running without the commit-graph optimization, yet has observable
> > impact on performance for normal codepaths that do not need the
> > extra carefulness.
> > 
> > But I wasn't sure if it is easy to plumb the "do we want to double
> > check?  in other words, are we running something like --missing that
> > care the correctness a bit more than usual cases?" bit down from the
> > caller, because this check is so deep in the callchain.
> 
> I wonder if we would want a "be extra careful" flag that is read from
> the environment? That is largely how GIT_REF_PARANOIA works, and then
> particular operations set it (though actually it is the default these
> days, so they no longer do so explicitly).
> 
> I guess that is really like a global variable but even more gross. ;)
> But it is nice that it can cross process boundaries, because "how
> careful do we want to be" may be in the eye of the caller, especially
> for plumbing commands. E.g., even without --missing, you may want
> "rev-list" to be extra careful about checking the existence of commits.
> The caller in check_connected() could arguably turn that on by default
> (the way we used to turn on GIT_REF_PARANOIA for pruning repacks before
> it became the default).

Good point indeed, I also started to ask myself whether we'd want to
have a config option. An environment variable `GIT_OBJECT_PARANOIA`
would work equally well though and be less "official".

What I like about this idea is that it would also allow us to easily
unify the behaviour of `lookup_commit_in_graph()` and the new sanity
check in `repo_parse_commit_gently()` that I'm introducing here. I'd
personally think that the default for any such toggle should be `true`,
also to keep the current behaviour of `lookup_commit_in_graph()`. But
changing it would be easy enough.

I'll send a v2 of this series with such a toggle.

Patrick
diff mbox series

Patch

diff --git a/commit.c b/commit.c
index b3223478bc..109e9217e3 100644
--- a/commit.c
+++ b/commit.c
@@ -572,8 +572,13 @@  int repo_parse_commit_internal(struct repository *r,
 		return -1;
 	if (item->object.parsed)
 		return 0;
-	if (use_commit_graph && parse_commit_in_graph(r, item))
+	if (use_commit_graph && parse_commit_in_graph(r, item)) {
+		if (!has_object(r, &item->object.oid, 0))
+			return quiet_on_missing ? -1 :
+				error(_("commit %s exists in commit-graph but not in the object database"),
+				      oid_to_hex(&item->object.oid));
 		return 0;
+	}
 
 	if (oid_object_info_extended(r, &item->object.oid, &oi, flags) < 0)
 		return quiet_on_missing ? -1 :
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index ba65f17dd9..25f8e9e2d3 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -821,4 +821,27 @@  test_expect_success 'overflow during generation version upgrade' '
 	)
 '
 
+test_expect_success 'commit exists in commit-graph but not in object database' '
+	test_when_finished "rm -rf repo" &&
+	git init repo &&
+	(
+		cd repo &&
+
+		test_commit A &&
+		test_commit B &&
+		test_commit C &&
+		git commit-graph write --reachable &&
+
+		# Corrupt the repository by deleting the intermittent commit
+		# object. Commands should notice that this object is absent and
+		# thus that the repository is corrupt even if the commit graph
+		# exists.
+		oid=$(git rev-parse B) &&
+		rm .git/objects/"$(test_oid_to_path "$oid")" &&
+
+		test_must_fail git rev-parse HEAD~2 2>error &&
+		grep "error: commit $oid exists in commit-graph but not in the object database" error
+	)
+'
+
 test_done