diff mbox series

commit-graph: retain commit slab when closing NULL commit_graph

Message ID 20240105054142.GA2035092@coredump.intra.peff.net (mailing list archive)
State Accepted
Commit d70f554cdf38b0b05cfaa8e8eb9f80d54a5ae11c
Headers show
Series commit-graph: retain commit slab when closing NULL commit_graph | expand

Commit Message

Jeff King Jan. 5, 2024, 5:41 a.m. UTC
This fixes a regression introduced in ac6d45d11f (commit-graph: move
slab-clearing to close_commit_graph(), 2023-10-03), in which running:

  git -c fetch.writeCommitGraph=true fetch --recurse-submodules

multiple times in a freshly cloned repository causes a segfault. What
happens in the second (and subsequent) runs is this:

  1. We make a "struct commit" for any ref tips which we're storing
     (even if we already have them, they still go into FETCH_HEAD).

     Because the first run will have created a commit graph, we'll find
     those commits in the graph.

     The commit struct is therefore created with a NULL "maybe_tree"
     entry, because we can load its oid from the graph later. But to do
     that we need to remember that we got the commit from the graph,
     which is recorded in a global commit_graph_data_slab object.

  2. Because we're using --recurse-submodules, we'll try to fetch each
     of the possible submodules. That implies creating a separate
     "struct repository" in-process for each submodule, which will
     require a later call to repo_clear().

     The call to repo_clear() calls raw_object_store_clear(), which in
     turn calls close_object_store(), which in turn calls
     close_commit_graph(). And the latter frees the commit graph data
     slab.

  3. Later, when trying to write out a new commit graph, we'll ask for
     their tree oid via get_commit_tree_oid(), which will see that the
     object is parsed but with a NULL maybe_tree field. We'd then
     usually pull it from the graph file, but because the slab was
     cleared, we don't realize that we can do so! We end up returning
     NULL and segfaulting.

     (It seems questionable that we'd write a graph entry for such a
     commit anyway, since we know we already have one. I didn't
     double-check, but that may simply be another side effect of having
     cleared the slab).

The bug is in step (2) above. We should not be clearing the slab when
cleaning up the submodule repository structs. Prior to ac6d45d11f, we
did not do so because it was done inside a helper function that returned
early when it saw NULL. So the behavior change from that commit is that
we'll now _always_ clear the slab via repo_clear(), even if the
repository being closed did not have a commit graph (and thus would have
a NULL commit_graph struct).

The most immediate fix is to add in a NULL check in close_commit_graph(),
making it a true noop when passed in an object_store with a NULL
commit_graph (it's OK to just return early, since the rest of its code
is already a noop when passed NULL). That restores the pre-ac6d45d11f
behavior. And that's what this patch does, along with a test that
exercises it (we already have a test that uses submodules along with
fetch.writeCommitGraph, but the bug only triggers when there is a
subsequent fetch and when that fetch uses --recurse-submodules).

So that fixes the regression in the least-risky way possible.

I do think there's some fragility here that we might want to follow up
on. We have a global commit_graph_data_slab that contains graph
positions, and our global commit structs depend on the that slab
remaining valid. But close_commit_graph() is just about closing _one_
object store's graph. So it's dangerous to call that function and clear
the slab without also throwing away any "struct commit" we might have
parsed that depends on it.

Which at first glance seems like a bug we could already trigger. In the
situation described here, there is no commit graph in the submodule
repository, so our commit graph is NULL (in fact, in our test script
there is no submodule repo at all, so we immediately return from
repo_init() and call repo_clear() only to free up memory). But what
would happen if there was one? Wouldn't we see a non-NULL commit_graph
entry, and then clear the global slab anyway?

The answer is "no", but for very bizarre reasons. Remember that
repo_clear() calls raw_object_store_clear(), which then calls
close_object_store() and thus close_commit_graph(). But before it does
so, raw_object_store_clear() does something else: it frees the commit
graph and sets it to NULL! So by this code path we'll _never_ see a
non-NULL commit_graph struct, and thus never clear the slab.

So it happens to work out. But it still seems questionable to me that we
would clear a global slab (which might still be in use) when closing the
commit graph. This clearing comes from 957ba814bf (commit-graph: when
closing the graph, also release the slab, 2021-09-08), and was fixing a
case where we really did need it to be closed (and in that case we
presumably call close_object_store() more directly).

So I suspect there may still be a bug waiting to happen there, as any
object loaded before the call to close_object_store() may be stranded
with a bogus maybe_tree entry (and thus looking at it after the call
might cause an error). But I'm not sure how to trigger it, nor what the
fix should look like (you probably would need to "unparse" any objects
pulled from the graph). And so this patch punts on that for now in favor
of fixing the recent regression in the most direct way, which should not
have any other fallouts.

Signed-off-by: Jeff King <peff@peff.net>
---
I prepared this on top of jk/commit-graph-leak-fixes, which is the
branch in v2.43.0 that introduced the regression.

 commit-graph.c   | 3 +++
 t/t5510-fetch.sh | 3 ++-
 2 files changed, 5 insertions(+), 1 deletion(-)

Comments

Taylor Blau Jan. 5, 2024, 5:07 p.m. UTC | #1
On Fri, Jan 05, 2024 at 12:41:42AM -0500, Jeff King wrote:
> This fixes a regression introduced in ac6d45d11f (commit-graph: move
> slab-clearing to close_commit_graph(), 2023-10-03), in which running:
>
>   git -c fetch.writeCommitGraph=true fetch --recurse-submodules
>
> multiple times in a freshly cloned repository causes a segfault. What
> happens in the second (and subsequent) runs is this:
>
>   1. We make a "struct commit" for any ref tips which we're storing
>      (even if we already have them, they still go into FETCH_HEAD).
>
>      Because the first run will have created a commit graph, we'll find
>      those commits in the graph.
>
>      The commit struct is therefore created with a NULL "maybe_tree"
>      entry, because we can load its oid from the graph later. But to do
>      that we need to remember that we got the commit from the graph,
>      which is recorded in a global commit_graph_data_slab object.
>
>   2. Because we're using --recurse-submodules, we'll try to fetch each
>      of the possible submodules. That implies creating a separate
>      "struct repository" in-process for each submodule, which will
>      require a later call to repo_clear().
>
>      The call to repo_clear() calls raw_object_store_clear(), which in
>      turn calls close_object_store(), which in turn calls
>      close_commit_graph(). And the latter frees the commit graph data
>      slab.
>
>   3. Later, when trying to write out a new commit graph, we'll ask for
>      their tree oid via get_commit_tree_oid(), which will see that the
>      object is parsed but with a NULL maybe_tree field. We'd then
>      usually pull it from the graph file, but because the slab was
>      cleared, we don't realize that we can do so! We end up returning
>      NULL and segfaulting.
>
>      (It seems questionable that we'd write a graph entry for such a
>      commit anyway, since we know we already have one. I didn't
>      double-check, but that may simply be another side effect of having
>      cleared the slab).

Yeah, I agree that we should not be re-writing a commit-graph entry that
already exists in an earlier (non-removed) layer of the commit-graph.
I haven't thought through all of the details here, but that sounds like
a potentially dangerous thing to be doing.

> So that fixes the regression in the least-risky way possible.

Thanks for the detailed explanation. I think that the fix presented here
is a reasonable approach, and is worthwhile to pick up.

> I do think there's some fragility here that we might want to follow up
> on. We have a global commit_graph_data_slab that contains graph
> positions, and our global commit structs depend on the that slab
> remaining valid. But close_commit_graph() is just about closing _one_
> object store's graph. So it's dangerous to call that function and clear
> the slab without also throwing away any "struct commit" we might have
> parsed that depends on it.

I definitely agree that there seems like a high risk of another
potentially-dangerous bug happening as a result of this area.

One thing that sticks out to me is that we have a scope mismatch
between the commit structs, the raw_object_store, and the commit slabs.
Slabs are tied to the lifetime of the raw_object_store, but the commit
objects are tied to the global lifetime of the process. I wonder if it
would make sense to have a slab per object-store, and require that you
pass both the commit *and* the object-store you're looking it up in as
arguments to any slab-related functions.

I dunno. I have not put a ton of thought into that ^^ approach either,
so let me know if it seems obviously bogus to you or if this is worth
looking into further.

Thanks,
Taylor
Junio C Hamano Jan. 5, 2024, 7:56 p.m. UTC | #2
Jeff King <peff@peff.net> writes:

> This fixes a regression introduced in ac6d45d11f (commit-graph: move
> slab-clearing to close_commit_graph(), 2023-10-03), in which running:
> ...
> So it happens to work out. But it still seems questionable to me that we
> would clear a global slab (which might still be in use) when closing the
> commit graph. This clearing comes from 957ba814bf (commit-graph: when
> closing the graph, also release the slab, 2021-09-08), and was fixing a
> case where we really did need it to be closed (and in that case we
> presumably call close_object_store() more directly).

Wow, that is nasty.

But anyway, thanks for your usual "3 pages of explanation for 2
lines of change".  The patch does look the best fix in the meantime.
Jeff King Jan. 10, 2024, 11:39 a.m. UTC | #3
On Fri, Jan 05, 2024 at 12:07:44PM -0500, Taylor Blau wrote:

> > I do think there's some fragility here that we might want to follow up
> > on. We have a global commit_graph_data_slab that contains graph
> > positions, and our global commit structs depend on the that slab
> > remaining valid. But close_commit_graph() is just about closing _one_
> > object store's graph. So it's dangerous to call that function and clear
> > the slab without also throwing away any "struct commit" we might have
> > parsed that depends on it.
> 
> I definitely agree that there seems like a high risk of another
> potentially-dangerous bug happening as a result of this area.
> 
> One thing that sticks out to me is that we have a scope mismatch
> between the commit structs, the raw_object_store, and the commit slabs.
> Slabs are tied to the lifetime of the raw_object_store, but the commit
> objects are tied to the global lifetime of the process. I wonder if it
> would make sense to have a slab per object-store, and require that you
> pass both the commit *and* the object-store you're looking it up in as
> arguments to any slab-related functions.

Right, that scope mismatch is the crux of the issue. I had imagined that
if you're really closing the object database you might want to just
throw away all of the object structs. But we don't have a way to do that
(and in general I'd be slightly scared if we did, since lots of code
likes to think of object pointers as immutable).

I don't know if your solution works, though. If I "partially" load a
commit (i.e., its "parsed" flag is 1, but maybe_tree is still NULL, not
having been loaded from the graph yet), then that commit object depends
on the slab. If we later ask the commit about its tree, then it needs to
refer to the slab to see if it came from the graph. Right now that
happens in the global slab. If we did it in a per-object-store slab,
we'd need each commit to know which object store it came from!

Of course we do have a "struct repository" argument in
repo_get_commit_tree(). So...maybe that's enough? It still doesn't get
us all the way there, though. In this case the culprit was submodules,
and we really would be closing an alternate repo. But we have actual
calls to close_commit_graph() sprinkled around. Any code which closes
any commit-graph and then still accesses a "struct commit" is
potentially buggy.

I think a simpler solution would be that upon clearing the slab, we
either "finish" each commit (filling in the maybe_tree field) or
"unparse" it (unsetting the parsed flag, and then doing a regular and/or
graph lookup if it is accessed again later).

It should be easy-ish to iterate through the slab and look at the
commits that are mentioned in it. Though maybe not? Each commit knows
its slab-id, but I'm not sure if we have a master list of commits to go
the other way.

So yet another alternative is: don't clear the slab. Mark the entries as
"this came from a graph, but we don't know the correct graph position
anymore" (presumably with a new sentinel value for data->graph_pos). And
then functions like repo_get_commit_tree() would know that they need to
try harder (e.g., doing the unparse then and going back to the object
store to parse again).

Hmm. Actually, maybe COMMIT_NOT_FROM_GRAPH already functions that way.
If we see a commit that is parsed but has a NULL maybe_tree, then if we
see COMMIT_NOT_FROM_GRAPH we assume it's just a corrupt commit and
return NULL. But we could just try harder to re-parse it then. Like:

diff --git a/commit.c b/commit.c
index ef679a0b93..909898cf7f 100644
--- a/commit.c
+++ b/commit.c
@@ -423,6 +423,24 @@ struct tree *repo_get_commit_tree(struct repository *r,
 	if (commit_graph_position(commit) != COMMIT_NOT_FROM_GRAPH)
 		return get_commit_tree_in_graph(r, commit);
 
+	/*
+	 * This is either a corrupt commit, or one which we partially loaded
+	 * from a graph file but then subsequently threw away the graph data.
+	 *
+	 * Optimistically assume it's the latter and try to reload from
+	 * scratch. This gives a performance penalty if it really is a corrupt
+	 * commit, but presumably that happens rarely.
+	 *
+	 * This will throw away the parents list, which is potentially sketchy.
+	 * A better version of this would just unset commit->object.parsed
+	 * and then do a minimal version of parse_commit() that _just_ loads
+	 * the tree data (and/or graph position if available).
+	 *
+	 * Naturally we'd need to drop the "const" from our commit above, too.
+	 */
+	unparse_commit(r, &commit->object.oid);
+	repo_parse_commit(r, commit);
+
 	return NULL;
 }
 
I dunno. I do feel like this is a lurking maintenance headache, and
might even be a triggerable bug. But without knowing of a way that it
happens in the current code base, it feels like it would be easy to make
things worse rather than better.

I'm content to leave it for now and let this discussion serve as a
reference if somebody does manage to run into it in practice.

-Peff
Junio C Hamano Jan. 10, 2024, 4:38 p.m. UTC | #4
Jeff King <peff@peff.net> writes:

> I think a simpler solution would be that upon clearing the slab, we
> either "finish" each commit (filling in the maybe_tree field) or
> "unparse" it (unsetting the parsed flag, and then doing a regular and/or
> graph lookup if it is accessed again later).

Wow, that's clever.

> It should be easy-ish to iterate through the slab and look at the
> commits that are mentioned in it. Though maybe not? Each commit knows
> its slab-id, but I'm not sure if we have a master list of commits to go
> the other way.

We have table of all in-core objects, don't we?

> +	 * This will throw away the parents list, which is potentially sketchy.
> +	 * A better version of this would just unset commit->object.parsed
> +	 * and then do a minimal version of parse_commit() that _just_ loads
> +	 * the tree data (and/or graph position if available).

Yeah, it is a concern that we may be working with an in-core commit
object whose parent list has already been rewritten during revision
traversal.  Well thought out.

> +	 *
> +	 * Naturally we'd need to drop the "const" from our commit above, too.
> +	 */
> +	unparse_commit(r, &commit->object.oid);
> +	repo_parse_commit(r, commit);
> +
>  	return NULL;
>  }
>  
> I dunno. I do feel like this is a lurking maintenance headache, and
> might even be a triggerable bug. But without knowing of a way that it
> happens in the current code base, it feels like it would be easy to make
> things worse rather than better.

Unfortunately I share the feeling X-<.

Thanks.
Jeff King Jan. 11, 2024, 7:53 a.m. UTC | #5
On Wed, Jan 10, 2024 at 08:38:18AM -0800, Junio C Hamano wrote:

> > It should be easy-ish to iterate through the slab and look at the
> > commits that are mentioned in it. Though maybe not? Each commit knows
> > its slab-id, but I'm not sure if we have a master list of commits to go
> > the other way.
> 
> We have table of all in-core objects, don't we?

Oh, duh. Yes, we could iterate over obj_hash. I do think the "on demand"
version I showed later in the message is better, though, as the work
both scales with the number of affected commits (rather than the total
number of objects) and can be done lazily (so callers that are not buggy
pay no price at all).

> > +	 * This will throw away the parents list, which is potentially sketchy.
> > +	 * A better version of this would just unset commit->object.parsed
> > +	 * and then do a minimal version of parse_commit() that _just_ loads
> > +	 * the tree data (and/or graph position if available).
> 
> Yeah, it is a concern that we may be working with an in-core commit
> object whose parent list has already been rewritten during revision
> traversal.  Well thought out.

Hmm. Looking at my "a better version" sentence, I guess we know that
either:

  1. The object really is corrupt (i.e., we could not load the tree).

  2. It came from a graph in the first place.

So what if we just tried harder to look it up in the graph file (rather
than the slab) when we see COMMIT_NOT_FROM_GRAPH? And indeed, we even
have a function to do this already!

So something like this almost works:

diff --git a/commit.c b/commit.c
index b3223478bc..096a3d18d0 100644
--- a/commit.c
+++ b/commit.c
@@ -418,10 +418,12 @@ static inline void set_commit_tree(struct commit *c, struct tree *t)
 struct tree *repo_get_commit_tree(struct repository *r,
 				  const struct commit *commit)
 {
+	uint32_t pos;
+
 	if (commit->maybe_tree || !commit->object.parsed)
 		return commit->maybe_tree;
 
-	if (commit_graph_position(commit) != COMMIT_NOT_FROM_GRAPH)
+	if (repo_find_commit_pos_in_graph(r, commit, &pos))
 		return get_commit_tree_in_graph(r, commit);
 
 	return NULL;

but:

  1. It doesn't update the slab, so get_commit_tree_in_graph() then
     complains immediately, because it uses only
     commit_graph_position(). I guess we'd need to do:

       commit_graph_data_at(commit)->graph_pos = pos;

     somewhere. That does make the recently-found segfault go away.
     But...

  2. I'm not sure if other spots would want similar treatment. We store
     generation numbers in the slab, too. I think we'll figure things
     out when they're not available, so there's no segfault problem. But
     it's maybe a potential performance issue. Likewise, it is probably
     a bug that the graph-writing code is looking at this commit at all
     (since we know it is already in a graph file). So we might be
     papering over that bug (that is, even though the segfault is gone,
     we are perhaps still writing a duplicate graph entry).

> > I dunno. I do feel like this is a lurking maintenance headache, and
> > might even be a triggerable bug. But without knowing of a way that it
> > happens in the current code base, it feels like it would be easy to make
> > things worse rather than better.
> 
> Unfortunately I share the feeling X-<.

I somehow sniped myself into thinking about it more, but that has only
reinforced my feeling that I'm afraid to touch it. ;)

-Peff
Junio C Hamano Jan. 11, 2024, 6:35 p.m. UTC | #6
Jeff King <peff@peff.net> writes:

> On Wed, Jan 10, 2024 at 08:38:18AM -0800, Junio C Hamano wrote:
>
>> > It should be easy-ish to iterate through the slab and look at the
>> > commits that are mentioned in it. Though maybe not? Each commit knows
>> > its slab-id, but I'm not sure if we have a master list of commits to go
>> > the other way.
>> 
>> We have table of all in-core objects, don't we?
>
> Oh, duh. Yes, we could iterate over obj_hash. I do think the "on demand"
> version I showed later in the message is better, though, as the work
> both scales with the number of affected commits (rather than the total
> number of objects) and can be done lazily (so callers that are not buggy
> pay no price at all).

Yeah, it is far more desirable than scanning obj_hash if we can do
the right thing on lazily.

> So what if we just tried harder to look it up in the graph file (rather
> than the slab) when we see COMMIT_NOT_FROM_GRAPH? And indeed, we even
> have a function to do this already!

;-)

> but:
> ...
> I somehow sniped myself into thinking about it more, but that has only
> reinforced my feeling that I'm afraid to touch it. ;)

Thanks for a nice summary.
diff mbox series

Patch

diff --git a/commit-graph.c b/commit-graph.c
index e4d09da090..f26503295a 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -727,6 +727,9 @@  struct bloom_filter_settings *get_bloom_filter_settings(struct repository *r)
 
 void close_commit_graph(struct raw_object_store *o)
 {
+	if (!o->commit_graph)
+		return;
+
 	clear_commit_graph_data_slab(&commit_graph_data_slab);
 	free_commit_graph(o->commit_graph);
 	o->commit_graph = NULL;
diff --git a/t/t5510-fetch.sh b/t/t5510-fetch.sh
index 19c36b57f4..bcf524d549 100755
--- a/t/t5510-fetch.sh
+++ b/t/t5510-fetch.sh
@@ -802,7 +802,8 @@  test_expect_success 'fetch.writeCommitGraph with submodules' '
 		cd super-clone &&
 		rm -rf .git/objects/info &&
 		git -c fetch.writeCommitGraph=true fetch origin &&
-		test_path_is_file .git/objects/info/commit-graphs/commit-graph-chain
+		test_path_is_file .git/objects/info/commit-graphs/commit-graph-chain &&
+		git -c fetch.writeCommitGraph=true fetch --recurse-submodules origin
 	)
 '