Message ID | 20240105054142.GA2035092@coredump.intra.peff.net (mailing list archive) |
---|---|
State | Accepted |
Commit | d70f554cdf38b0b05cfaa8e8eb9f80d54a5ae11c |
Headers | show |
Series | commit-graph: retain commit slab when closing NULL commit_graph | expand |
On Fri, Jan 05, 2024 at 12:41:42AM -0500, Jeff King wrote: > This fixes a regression introduced in ac6d45d11f (commit-graph: move > slab-clearing to close_commit_graph(), 2023-10-03), in which running: > > git -c fetch.writeCommitGraph=true fetch --recurse-submodules > > multiple times in a freshly cloned repository causes a segfault. What > happens in the second (and subsequent) runs is this: > > 1. We make a "struct commit" for any ref tips which we're storing > (even if we already have them, they still go into FETCH_HEAD). > > Because the first run will have created a commit graph, we'll find > those commits in the graph. > > The commit struct is therefore created with a NULL "maybe_tree" > entry, because we can load its oid from the graph later. But to do > that we need to remember that we got the commit from the graph, > which is recorded in a global commit_graph_data_slab object. > > 2. Because we're using --recurse-submodules, we'll try to fetch each > of the possible submodules. That implies creating a separate > "struct repository" in-process for each submodule, which will > require a later call to repo_clear(). > > The call to repo_clear() calls raw_object_store_clear(), which in > turn calls close_object_store(), which in turn calls > close_commit_graph(). And the latter frees the commit graph data > slab. > > 3. Later, when trying to write out a new commit graph, we'll ask for > their tree oid via get_commit_tree_oid(), which will see that the > object is parsed but with a NULL maybe_tree field. We'd then > usually pull it from the graph file, but because the slab was > cleared, we don't realize that we can do so! We end up returning > NULL and segfaulting. > > (It seems questionable that we'd write a graph entry for such a > commit anyway, since we know we already have one. I didn't > double-check, but that may simply be another side effect of having > cleared the slab). Yeah, I agree that we should not be re-writing a commit-graph entry that already exists in an earlier (non-removed) layer of the commit-graph. I haven't thought through all of the details here, but that sounds like a potentially dangerous thing to be doing. > So that fixes the regression in the least-risky way possible. Thanks for the detailed explanation. I think that the fix presented here is a reasonable approach, and is worthwhile to pick up. > I do think there's some fragility here that we might want to follow up > on. We have a global commit_graph_data_slab that contains graph > positions, and our global commit structs depend on the that slab > remaining valid. But close_commit_graph() is just about closing _one_ > object store's graph. So it's dangerous to call that function and clear > the slab without also throwing away any "struct commit" we might have > parsed that depends on it. I definitely agree that there seems like a high risk of another potentially-dangerous bug happening as a result of this area. One thing that sticks out to me is that we have a scope mismatch between the commit structs, the raw_object_store, and the commit slabs. Slabs are tied to the lifetime of the raw_object_store, but the commit objects are tied to the global lifetime of the process. I wonder if it would make sense to have a slab per object-store, and require that you pass both the commit *and* the object-store you're looking it up in as arguments to any slab-related functions. I dunno. I have not put a ton of thought into that ^^ approach either, so let me know if it seems obviously bogus to you or if this is worth looking into further. Thanks, Taylor
Jeff King <peff@peff.net> writes: > This fixes a regression introduced in ac6d45d11f (commit-graph: move > slab-clearing to close_commit_graph(), 2023-10-03), in which running: > ... > So it happens to work out. But it still seems questionable to me that we > would clear a global slab (which might still be in use) when closing the > commit graph. This clearing comes from 957ba814bf (commit-graph: when > closing the graph, also release the slab, 2021-09-08), and was fixing a > case where we really did need it to be closed (and in that case we > presumably call close_object_store() more directly). Wow, that is nasty. But anyway, thanks for your usual "3 pages of explanation for 2 lines of change". The patch does look the best fix in the meantime.
On Fri, Jan 05, 2024 at 12:07:44PM -0500, Taylor Blau wrote: > > I do think there's some fragility here that we might want to follow up > > on. We have a global commit_graph_data_slab that contains graph > > positions, and our global commit structs depend on the that slab > > remaining valid. But close_commit_graph() is just about closing _one_ > > object store's graph. So it's dangerous to call that function and clear > > the slab without also throwing away any "struct commit" we might have > > parsed that depends on it. > > I definitely agree that there seems like a high risk of another > potentially-dangerous bug happening as a result of this area. > > One thing that sticks out to me is that we have a scope mismatch > between the commit structs, the raw_object_store, and the commit slabs. > Slabs are tied to the lifetime of the raw_object_store, but the commit > objects are tied to the global lifetime of the process. I wonder if it > would make sense to have a slab per object-store, and require that you > pass both the commit *and* the object-store you're looking it up in as > arguments to any slab-related functions. Right, that scope mismatch is the crux of the issue. I had imagined that if you're really closing the object database you might want to just throw away all of the object structs. But we don't have a way to do that (and in general I'd be slightly scared if we did, since lots of code likes to think of object pointers as immutable). I don't know if your solution works, though. If I "partially" load a commit (i.e., its "parsed" flag is 1, but maybe_tree is still NULL, not having been loaded from the graph yet), then that commit object depends on the slab. If we later ask the commit about its tree, then it needs to refer to the slab to see if it came from the graph. Right now that happens in the global slab. If we did it in a per-object-store slab, we'd need each commit to know which object store it came from! Of course we do have a "struct repository" argument in repo_get_commit_tree(). So...maybe that's enough? It still doesn't get us all the way there, though. In this case the culprit was submodules, and we really would be closing an alternate repo. But we have actual calls to close_commit_graph() sprinkled around. Any code which closes any commit-graph and then still accesses a "struct commit" is potentially buggy. I think a simpler solution would be that upon clearing the slab, we either "finish" each commit (filling in the maybe_tree field) or "unparse" it (unsetting the parsed flag, and then doing a regular and/or graph lookup if it is accessed again later). It should be easy-ish to iterate through the slab and look at the commits that are mentioned in it. Though maybe not? Each commit knows its slab-id, but I'm not sure if we have a master list of commits to go the other way. So yet another alternative is: don't clear the slab. Mark the entries as "this came from a graph, but we don't know the correct graph position anymore" (presumably with a new sentinel value for data->graph_pos). And then functions like repo_get_commit_tree() would know that they need to try harder (e.g., doing the unparse then and going back to the object store to parse again). Hmm. Actually, maybe COMMIT_NOT_FROM_GRAPH already functions that way. If we see a commit that is parsed but has a NULL maybe_tree, then if we see COMMIT_NOT_FROM_GRAPH we assume it's just a corrupt commit and return NULL. But we could just try harder to re-parse it then. Like: diff --git a/commit.c b/commit.c index ef679a0b93..909898cf7f 100644 --- a/commit.c +++ b/commit.c @@ -423,6 +423,24 @@ struct tree *repo_get_commit_tree(struct repository *r, if (commit_graph_position(commit) != COMMIT_NOT_FROM_GRAPH) return get_commit_tree_in_graph(r, commit); + /* + * This is either a corrupt commit, or one which we partially loaded + * from a graph file but then subsequently threw away the graph data. + * + * Optimistically assume it's the latter and try to reload from + * scratch. This gives a performance penalty if it really is a corrupt + * commit, but presumably that happens rarely. + * + * This will throw away the parents list, which is potentially sketchy. + * A better version of this would just unset commit->object.parsed + * and then do a minimal version of parse_commit() that _just_ loads + * the tree data (and/or graph position if available). + * + * Naturally we'd need to drop the "const" from our commit above, too. + */ + unparse_commit(r, &commit->object.oid); + repo_parse_commit(r, commit); + return NULL; } I dunno. I do feel like this is a lurking maintenance headache, and might even be a triggerable bug. But without knowing of a way that it happens in the current code base, it feels like it would be easy to make things worse rather than better. I'm content to leave it for now and let this discussion serve as a reference if somebody does manage to run into it in practice. -Peff
Jeff King <peff@peff.net> writes: > I think a simpler solution would be that upon clearing the slab, we > either "finish" each commit (filling in the maybe_tree field) or > "unparse" it (unsetting the parsed flag, and then doing a regular and/or > graph lookup if it is accessed again later). Wow, that's clever. > It should be easy-ish to iterate through the slab and look at the > commits that are mentioned in it. Though maybe not? Each commit knows > its slab-id, but I'm not sure if we have a master list of commits to go > the other way. We have table of all in-core objects, don't we? > + * This will throw away the parents list, which is potentially sketchy. > + * A better version of this would just unset commit->object.parsed > + * and then do a minimal version of parse_commit() that _just_ loads > + * the tree data (and/or graph position if available). Yeah, it is a concern that we may be working with an in-core commit object whose parent list has already been rewritten during revision traversal. Well thought out. > + * > + * Naturally we'd need to drop the "const" from our commit above, too. > + */ > + unparse_commit(r, &commit->object.oid); > + repo_parse_commit(r, commit); > + > return NULL; > } > > I dunno. I do feel like this is a lurking maintenance headache, and > might even be a triggerable bug. But without knowing of a way that it > happens in the current code base, it feels like it would be easy to make > things worse rather than better. Unfortunately I share the feeling X-<. Thanks.
On Wed, Jan 10, 2024 at 08:38:18AM -0800, Junio C Hamano wrote: > > It should be easy-ish to iterate through the slab and look at the > > commits that are mentioned in it. Though maybe not? Each commit knows > > its slab-id, but I'm not sure if we have a master list of commits to go > > the other way. > > We have table of all in-core objects, don't we? Oh, duh. Yes, we could iterate over obj_hash. I do think the "on demand" version I showed later in the message is better, though, as the work both scales with the number of affected commits (rather than the total number of objects) and can be done lazily (so callers that are not buggy pay no price at all). > > + * This will throw away the parents list, which is potentially sketchy. > > + * A better version of this would just unset commit->object.parsed > > + * and then do a minimal version of parse_commit() that _just_ loads > > + * the tree data (and/or graph position if available). > > Yeah, it is a concern that we may be working with an in-core commit > object whose parent list has already been rewritten during revision > traversal. Well thought out. Hmm. Looking at my "a better version" sentence, I guess we know that either: 1. The object really is corrupt (i.e., we could not load the tree). 2. It came from a graph in the first place. So what if we just tried harder to look it up in the graph file (rather than the slab) when we see COMMIT_NOT_FROM_GRAPH? And indeed, we even have a function to do this already! So something like this almost works: diff --git a/commit.c b/commit.c index b3223478bc..096a3d18d0 100644 --- a/commit.c +++ b/commit.c @@ -418,10 +418,12 @@ static inline void set_commit_tree(struct commit *c, struct tree *t) struct tree *repo_get_commit_tree(struct repository *r, const struct commit *commit) { + uint32_t pos; + if (commit->maybe_tree || !commit->object.parsed) return commit->maybe_tree; - if (commit_graph_position(commit) != COMMIT_NOT_FROM_GRAPH) + if (repo_find_commit_pos_in_graph(r, commit, &pos)) return get_commit_tree_in_graph(r, commit); return NULL; but: 1. It doesn't update the slab, so get_commit_tree_in_graph() then complains immediately, because it uses only commit_graph_position(). I guess we'd need to do: commit_graph_data_at(commit)->graph_pos = pos; somewhere. That does make the recently-found segfault go away. But... 2. I'm not sure if other spots would want similar treatment. We store generation numbers in the slab, too. I think we'll figure things out when they're not available, so there's no segfault problem. But it's maybe a potential performance issue. Likewise, it is probably a bug that the graph-writing code is looking at this commit at all (since we know it is already in a graph file). So we might be papering over that bug (that is, even though the segfault is gone, we are perhaps still writing a duplicate graph entry). > > I dunno. I do feel like this is a lurking maintenance headache, and > > might even be a triggerable bug. But without knowing of a way that it > > happens in the current code base, it feels like it would be easy to make > > things worse rather than better. > > Unfortunately I share the feeling X-<. I somehow sniped myself into thinking about it more, but that has only reinforced my feeling that I'm afraid to touch it. ;) -Peff
Jeff King <peff@peff.net> writes: > On Wed, Jan 10, 2024 at 08:38:18AM -0800, Junio C Hamano wrote: > >> > It should be easy-ish to iterate through the slab and look at the >> > commits that are mentioned in it. Though maybe not? Each commit knows >> > its slab-id, but I'm not sure if we have a master list of commits to go >> > the other way. >> >> We have table of all in-core objects, don't we? > > Oh, duh. Yes, we could iterate over obj_hash. I do think the "on demand" > version I showed later in the message is better, though, as the work > both scales with the number of affected commits (rather than the total > number of objects) and can be done lazily (so callers that are not buggy > pay no price at all). Yeah, it is far more desirable than scanning obj_hash if we can do the right thing on lazily. > So what if we just tried harder to look it up in the graph file (rather > than the slab) when we see COMMIT_NOT_FROM_GRAPH? And indeed, we even > have a function to do this already! ;-) > but: > ... > I somehow sniped myself into thinking about it more, but that has only > reinforced my feeling that I'm afraid to touch it. ;) Thanks for a nice summary.
diff --git a/commit-graph.c b/commit-graph.c index e4d09da090..f26503295a 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -727,6 +727,9 @@ struct bloom_filter_settings *get_bloom_filter_settings(struct repository *r) void close_commit_graph(struct raw_object_store *o) { + if (!o->commit_graph) + return; + clear_commit_graph_data_slab(&commit_graph_data_slab); free_commit_graph(o->commit_graph); o->commit_graph = NULL; diff --git a/t/t5510-fetch.sh b/t/t5510-fetch.sh index 19c36b57f4..bcf524d549 100755 --- a/t/t5510-fetch.sh +++ b/t/t5510-fetch.sh @@ -802,7 +802,8 @@ test_expect_success 'fetch.writeCommitGraph with submodules' ' cd super-clone && rm -rf .git/objects/info && git -c fetch.writeCommitGraph=true fetch origin && - test_path_is_file .git/objects/info/commit-graphs/commit-graph-chain + test_path_is_file .git/objects/info/commit-graphs/commit-graph-chain && + git -c fetch.writeCommitGraph=true fetch --recurse-submodules origin ) '
This fixes a regression introduced in ac6d45d11f (commit-graph: move slab-clearing to close_commit_graph(), 2023-10-03), in which running: git -c fetch.writeCommitGraph=true fetch --recurse-submodules multiple times in a freshly cloned repository causes a segfault. What happens in the second (and subsequent) runs is this: 1. We make a "struct commit" for any ref tips which we're storing (even if we already have them, they still go into FETCH_HEAD). Because the first run will have created a commit graph, we'll find those commits in the graph. The commit struct is therefore created with a NULL "maybe_tree" entry, because we can load its oid from the graph later. But to do that we need to remember that we got the commit from the graph, which is recorded in a global commit_graph_data_slab object. 2. Because we're using --recurse-submodules, we'll try to fetch each of the possible submodules. That implies creating a separate "struct repository" in-process for each submodule, which will require a later call to repo_clear(). The call to repo_clear() calls raw_object_store_clear(), which in turn calls close_object_store(), which in turn calls close_commit_graph(). And the latter frees the commit graph data slab. 3. Later, when trying to write out a new commit graph, we'll ask for their tree oid via get_commit_tree_oid(), which will see that the object is parsed but with a NULL maybe_tree field. We'd then usually pull it from the graph file, but because the slab was cleared, we don't realize that we can do so! We end up returning NULL and segfaulting. (It seems questionable that we'd write a graph entry for such a commit anyway, since we know we already have one. I didn't double-check, but that may simply be another side effect of having cleared the slab). The bug is in step (2) above. We should not be clearing the slab when cleaning up the submodule repository structs. Prior to ac6d45d11f, we did not do so because it was done inside a helper function that returned early when it saw NULL. So the behavior change from that commit is that we'll now _always_ clear the slab via repo_clear(), even if the repository being closed did not have a commit graph (and thus would have a NULL commit_graph struct). The most immediate fix is to add in a NULL check in close_commit_graph(), making it a true noop when passed in an object_store with a NULL commit_graph (it's OK to just return early, since the rest of its code is already a noop when passed NULL). That restores the pre-ac6d45d11f behavior. And that's what this patch does, along with a test that exercises it (we already have a test that uses submodules along with fetch.writeCommitGraph, but the bug only triggers when there is a subsequent fetch and when that fetch uses --recurse-submodules). So that fixes the regression in the least-risky way possible. I do think there's some fragility here that we might want to follow up on. We have a global commit_graph_data_slab that contains graph positions, and our global commit structs depend on the that slab remaining valid. But close_commit_graph() is just about closing _one_ object store's graph. So it's dangerous to call that function and clear the slab without also throwing away any "struct commit" we might have parsed that depends on it. Which at first glance seems like a bug we could already trigger. In the situation described here, there is no commit graph in the submodule repository, so our commit graph is NULL (in fact, in our test script there is no submodule repo at all, so we immediately return from repo_init() and call repo_clear() only to free up memory). But what would happen if there was one? Wouldn't we see a non-NULL commit_graph entry, and then clear the global slab anyway? The answer is "no", but for very bizarre reasons. Remember that repo_clear() calls raw_object_store_clear(), which then calls close_object_store() and thus close_commit_graph(). But before it does so, raw_object_store_clear() does something else: it frees the commit graph and sets it to NULL! So by this code path we'll _never_ see a non-NULL commit_graph struct, and thus never clear the slab. So it happens to work out. But it still seems questionable to me that we would clear a global slab (which might still be in use) when closing the commit graph. This clearing comes from 957ba814bf (commit-graph: when closing the graph, also release the slab, 2021-09-08), and was fixing a case where we really did need it to be closed (and in that case we presumably call close_object_store() more directly). So I suspect there may still be a bug waiting to happen there, as any object loaded before the call to close_object_store() may be stranded with a bogus maybe_tree entry (and thus looking at it after the call might cause an error). But I'm not sure how to trigger it, nor what the fix should look like (you probably would need to "unparse" any objects pulled from the graph). And so this patch punts on that for now in favor of fixing the recent regression in the most direct way, which should not have any other fallouts. Signed-off-by: Jeff King <peff@peff.net> --- I prepared this on top of jk/commit-graph-leak-fixes, which is the branch in v2.43.0 that introduced the regression. commit-graph.c | 3 +++ t/t5510-fetch.sh | 3 ++- 2 files changed, 5 insertions(+), 1 deletion(-)