Message ID | X8qGTaIdnNa5mAfC@coredump.intra.peff.net (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | misc commit-graph and oid-array cleanups | expand |
On 12/4/2020 1:56 PM, Jeff King wrote: > That turns out not to be a problem, though. The only things we do with > the count are: > > - check if our count will overflow our data structures. But the limit > there is 2^31 commits, so it's not likely to happen in practice. You do point out that you are removing this logic, done in this if statement: > - if (count_distinct >= GRAPH_EDGE_LAST_MASK) { > - error(_("the commit graph format cannot write %d commits"), count_distinct); > - res = -1; > - goto cleanup; > - } What is the harm of keeping this? I know it is very unlikely, but I'd rather fail here than write a commit-graph that gets parsed as garbage. Thanks, -Stolee
On Fri, Dec 04, 2020 at 03:06:24PM -0500, Derrick Stolee wrote: > On 12/4/2020 1:56 PM, Jeff King wrote: > > That turns out not to be a problem, though. The only things we do with > > the count are: > > > > - check if our count will overflow our data structures. But the limit > > there is 2^31 commits, so it's not likely to happen in practice. > > You do point out that you are removing this logic, done in this > if statement: > > > - if (count_distinct >= GRAPH_EDGE_LAST_MASK) { > > - error(_("the commit graph format cannot write %d commits"), count_distinct); > > - res = -1; > > - goto cleanup; > > - } > > What is the harm of keeping this? I know it is very unlikely, but > I'd rather fail here than write a commit-graph that gets parsed as > garbage. I agree it's important to have. But this one is redundant. Look a few lines below this hunk, and we have: copy_oids_to_commits(ctx); if (ctx->commits.nr >= GRAPH_EDGE_LAST_MASK) { error(_("too many commits to write graph")); res = -1; goto cleanup; } So before we were counting distinct commits, checking that our count fits, and then _ignoring_ that count in order to create the actual list of commits, and then checking that the actual list's count fits. We only need one of those checks, and the important one is the one from the actual list (they _should_ match, but due to the bug, they sometimes didn't). My "not likely to happen in practice" is not about the quality of the check, but rather that being off by one would never matter in practice. Does that make more sense? -Peff
On 12/4/2020 3:42 PM, Jeff King wrote: > On Fri, Dec 04, 2020 at 03:06:24PM -0500, Derrick Stolee wrote: > >> On 12/4/2020 1:56 PM, Jeff King wrote: >>> That turns out not to be a problem, though. The only things we do with >>> the count are: >>> >>> - check if our count will overflow our data structures. But the limit >>> there is 2^31 commits, so it's not likely to happen in practice. >> >> You do point out that you are removing this logic, done in this >> if statement: >> >>> - if (count_distinct >= GRAPH_EDGE_LAST_MASK) { >>> - error(_("the commit graph format cannot write %d commits"), count_distinct); >>> - res = -1; >>> - goto cleanup; >>> - } >> >> What is the harm of keeping this? I know it is very unlikely, but >> I'd rather fail here than write a commit-graph that gets parsed as >> garbage. > > I agree it's important to have. But this one is redundant. Look a few > lines below this hunk, and we have: > > copy_oids_to_commits(ctx); > > if (ctx->commits.nr >= GRAPH_EDGE_LAST_MASK) { > error(_("too many commits to write graph")); > res = -1; > goto cleanup; > } Yep, my fault for not looking further in the context of your patch. > So before we were counting distinct commits, checking that our count > fits, and then _ignoring_ that count in order to create the actual list > of commits, and then checking that the actual list's count fits. We only > need one of those checks, and the important one is the one from the > actual list (they _should_ match, but due to the bug, they sometimes > didn't). > > My "not likely to happen in practice" is not about the quality of the > check, but rather that being off by one would never matter in practice. > > Does that make more sense? Makes sense to me. No need to change this patch at all. Thanks, -Stolee
On Fri, Dec 04, 2020 at 03:47:43PM -0500, Derrick Stolee wrote: > > So before we were counting distinct commits, checking that our count > > fits, and then _ignoring_ that count in order to create the actual list > > of commits, and then checking that the actual list's count fits. We only > > need one of those checks, and the important one is the one from the > > actual list (they _should_ match, but due to the bug, they sometimes > > didn't). > > > > My "not likely to happen in practice" is not about the quality of the > > check, but rather that being off by one would never matter in practice. > > > > Does that make more sense? > Makes sense to me. No need to change this patch at all. I brushed up the commit message a bit to make those points clearer: 7: bbccccdc5c ! 7: 23420dbd1b commit-graph: drop count_distinct_commits() function @@ Commit message the count are: - check if our count will overflow our data structures. But the limit - there is 2^31 commits, so it's not likely to happen in practice. + there is 2^31 commits, so while this is a useful check, the + off-by-one is not likely to matter. - pre-allocate the array of commit pointers. But over-allocating by - one isn't a problem. + one isn't a problem; we'll just waste a few extra bytes. The bug would be easy enough to fix, but we can observe that neither of - those steps is necessary. We'll check the count of the commit array - after we build it anyway, so checking at this point is redundant. And we - use ALLOC_GROW() when building the commit array, so there's no need to - preallocate it (it's possible that doing so is slightly more efficient, - but if we care we can just optimistically allocate one slot for each - oid; I didn't bother here). + those steps is necessary. + + After building the actual commit array, we'll likewise check its count + for overflow. So the extra check of the distinct commit count here is + redundant. + + And likewise we use ALLOC_GROW() when building the commit array, so + there's no need to preallocate it (it's possible that doing so is + slightly more efficient, but if we care we can just optimistically + allocate one slot for each oid; I didn't bother here). So count_distinct_commits() isn't doing anything useful. Let's just get rid of that step. -Peff
On 12/4/2020 3:50 PM, Jeff King wrote: > On Fri, Dec 04, 2020 at 03:47:43PM -0500, Derrick Stolee wrote: >>> Does that make more sense? >> Makes sense to me. No need to change this patch at all. > > I brushed up the commit message a bit to make those points clearer: LGTM. Thanks. -Stolee
On Fri, Dec 04 2020, Jeff King wrote: > When writing a commit graph, we collect a list of object ids in an > array, which we'll eventually copy into an array of "struct commit" > pointers. Before we do that, though, we count the number of distinct > commit entries. There's a subtle bug in this step, though. > [...] > - display_progress(ctx->progress, 0); /* TODO: Measure QSORT() progress */ > - QSORT(ctx->oids.list, ctx->oids.nr, oid_compare); > [...] > + QSORT(ctx->oids.list, ctx->oids.nr, oid_compare); One thing your commit messag doesn't note is the removal of this TODO. Theoretically we still have the issue I noted in 890226ccb57 (commit-graph write: add itermediate progress, 2019-01-19), in practice I think it's fine to just forever remove this TODO. I tried now on a really big repo and couldn't even get the progress bar you removed to appear, or the one you moved the QSORT() to. Maybe I need to go even bigger, but more likely I overdid things a bit when I added all these progress bars to commit-graph.c at the time.
On Sat, Dec 05, 2020 at 03:26:56AM +0100, Ævar Arnfjörð Bjarmason wrote: > On Fri, Dec 04 2020, Jeff King wrote: > > > When writing a commit graph, we collect a list of object ids in an > > array, which we'll eventually copy into an array of "struct commit" > > pointers. Before we do that, though, we count the number of distinct > > commit entries. There's a subtle bug in this step, though. > > [...] > > - display_progress(ctx->progress, 0); /* TODO: Measure QSORT() progress */ > > - QSORT(ctx->oids.list, ctx->oids.nr, oid_compare); > > [...] > > + QSORT(ctx->oids.list, ctx->oids.nr, oid_compare); > > One thing your commit messag doesn't note is the removal of this > TODO. Theoretically we still have the issue I noted in 890226ccb57 > (commit-graph write: add itermediate progress, 2019-01-19), in practice > I think it's fine to just forever remove this TODO. Yeah, I don't think we can really put progress bars on everything. If you have a dataset so large that sorting the commits by oid takes a long time, then I think you probably just need to accept that there are going to be some small pauses while Git works. -Peff
diff --git a/commit-graph.c b/commit-graph.c index 6f62a07313..1ac3516cf5 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -1588,35 +1588,6 @@ static void fill_oids_from_all_packs(struct write_commit_graph_context *ctx) stop_progress(&ctx->progress); } -static uint32_t count_distinct_commits(struct write_commit_graph_context *ctx) -{ - uint32_t i, count_distinct = 1; - - if (ctx->report_progress) - ctx->progress = start_delayed_progress( - _("Counting distinct commits in commit graph"), - ctx->oids.nr); - display_progress(ctx->progress, 0); /* TODO: Measure QSORT() progress */ - QSORT(ctx->oids.list, ctx->oids.nr, oid_compare); - - for (i = 1; i < ctx->oids.nr; i++) { - display_progress(ctx->progress, i + 1); - if (!oideq(&ctx->oids.list[i - 1], &ctx->oids.list[i])) { - if (ctx->split) { - struct commit *c = lookup_commit(ctx->r, &ctx->oids.list[i]); - - if (!c || commit_graph_position(c) != COMMIT_NOT_FROM_GRAPH) - continue; - } - - count_distinct++; - } - } - stop_progress(&ctx->progress); - - return count_distinct; -} - static void copy_oids_to_commits(struct write_commit_graph_context *ctx) { uint32_t i; @@ -1628,6 +1599,7 @@ static void copy_oids_to_commits(struct write_commit_graph_context *ctx) ctx->progress = start_delayed_progress( _("Finding extra edges in commit graph"), ctx->oids.nr); + QSORT(ctx->oids.list, ctx->oids.nr, oid_compare); for (i = 0; i < ctx->oids.nr; i++) { unsigned int num_parents; @@ -2155,7 +2127,7 @@ int write_commit_graph(struct object_directory *odb, const struct commit_graph_opts *opts) { struct write_commit_graph_context *ctx; - uint32_t i, count_distinct = 0; + uint32_t i; int res = 0; int replace = 0; struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS; @@ -2268,17 +2240,6 @@ int write_commit_graph(struct object_directory *odb, close_reachable(ctx); - count_distinct = count_distinct_commits(ctx); - - if (count_distinct >= GRAPH_EDGE_LAST_MASK) { - error(_("the commit graph format cannot write %d commits"), count_distinct); - res = -1; - goto cleanup; - } - - ctx->commits.alloc = count_distinct; - ALLOC_ARRAY(ctx->commits.list, ctx->commits.alloc); - copy_oids_to_commits(ctx); if (ctx->commits.nr >= GRAPH_EDGE_LAST_MASK) {
When writing a commit graph, we collect a list of object ids in an array, which we'll eventually copy into an array of "struct commit" pointers. Before we do that, though, we count the number of distinct commit entries. There's a subtle bug in this step, though. We eliminate not only duplicate oids, but also in split mode, any oids which are not commits or which are already in a graph file. However, the loop starts at index 1, always counting index 0 as distinct. And indeed it can't be a duplicate, since we check for those by comparing against the previous entry, and there isn't one for index 0. But it could be a commit that's already in a graph file, and we'd overcount the number of commits by 1 in that case. That turns out not to be a problem, though. The only things we do with the count are: - check if our count will overflow our data structures. But the limit there is 2^31 commits, so it's not likely to happen in practice. - pre-allocate the array of commit pointers. But over-allocating by one isn't a problem. The bug would be easy enough to fix, but we can observe that neither of those steps is necessary. We'll check the count of the commit array after we build it anyway, so checking at this point is redundant. And we use ALLOC_GROW() when building the commit array, so there's no need to preallocate it (it's possible that doing so is slightly more efficient, but if we care we can just optimistically allocate one slot for each oid; I didn't bother here). So count_distinct_commits() isn't doing anything useful. Let's just get rid of that step. Note that a side effect of the function was that we sorted the list of oids, which we do rely on in copy_oids_to_commits(), since it must also skip the duplicates. So we'll move the qsort there. Signed-off-by: Jeff King <peff@peff.net> --- commit-graph.c | 43 ++----------------------------------------- 1 file changed, 2 insertions(+), 41 deletions(-)