[7/9] commit-graph: drop count_distinct_commits() function

Message ID	X8qGTaIdnNa5mAfC@coredump.intra.peff.net (mailing list archive)
State	Superseded
Headers	show Return-Path: <git-owner@kernel.org> Date: Fri, 4 Dec 2020 13:56:13 -0500 From: Jeff King <peff@peff.net> To: git@vger.kernel.org Cc: Derrick Stolee <dstolee@microsoft.com> Subject: [PATCH 7/9] commit-graph: drop count_distinct_commits() function Message-ID: <X8qGTaIdnNa5mAfC@coredump.intra.peff.net> References: <X8qEg/KiAQDugPC0@coredump.intra.peff.net> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <X8qEg/KiAQDugPC0@coredump.intra.peff.net> Precedence: bulk
Series	misc commit-graph and oid-array cleanups \| expand [0/9] misc commit-graph and oid-array cleanups [1/9] oid-array.h: drop sha1 mention from header guard [2/9] t0064: drop sha1 mention from filename [3/9] t0064: make duplicate tests more robust [4/9] cache.h: move hash/oid functions to hash.h [5/9] oid-array: make sort function public [6/9] oid-array: provide a for-loop iterator [7/9] commit-graph: drop count_distinct_commits() function [8/9] commit-graph: replace packed_oid_list with oid_array [9/9] commit-graph: use size_t for array allocation and indexing

Jeff King Dec. 4, 2020, 6:56 p.m. UTC

When writing a commit graph, we collect a list of object ids in an
array, which we'll eventually copy into an array of "struct commit"
pointers. Before we do that, though, we count the number of distinct
commit entries. There's a subtle bug in this step, though.

We eliminate not only duplicate oids, but also in split mode, any oids
which are not commits or which are already in a graph file. However, the
loop starts at index 1, always counting index 0 as distinct. And indeed
it can't be a duplicate, since we check for those by comparing against
the previous entry, and there isn't one for index 0. But it could be a
commit that's already in a graph file, and we'd overcount the number of
commits by 1 in that case.

That turns out not to be a problem, though. The only things we do with
the count are:

  - check if our count will overflow our data structures. But the limit
    there is 2^31 commits, so it's not likely to happen in practice.

  - pre-allocate the array of commit pointers. But over-allocating by
    one isn't a problem.

The bug would be easy enough to fix, but we can observe that neither of
those steps is necessary. We'll check the count of the commit array
after we build it anyway, so checking at this point is redundant. And we
use ALLOC_GROW() when building the commit array, so there's no need to
preallocate it (it's possible that doing so is slightly more efficient,
but if we care we can just optimistically allocate one slot for each
oid; I didn't bother here).

So count_distinct_commits() isn't doing anything useful. Let's just get
rid of that step.

Note that a side effect of the function was that we sorted the list of
oids, which we do rely on in copy_oids_to_commits(), since it must also
skip the duplicates. So we'll move the qsort there.

Signed-off-by: Jeff King <peff@peff.net>
---
 commit-graph.c | 43 ++-----------------------------------------
 1 file changed, 2 insertions(+), 41 deletions(-)

Derrick Stolee Dec. 4, 2020, 8:06 p.m. UTC | #1

On 12/4/2020 1:56 PM, Jeff King wrote:
> That turns out not to be a problem, though. The only things we do with
> the count are:
> 
>   - check if our count will overflow our data structures. But the limit
>     there is 2^31 commits, so it's not likely to happen in practice.

You do point out that you are removing this logic, done in this
if statement:

> -	if (count_distinct >= GRAPH_EDGE_LAST_MASK) {
> -		error(_("the commit graph format cannot write %d commits"), count_distinct);
> -		res = -1;
> -		goto cleanup;
> -	}

What is the harm of keeping this? I know it is very unlikely, but
I'd rather fail here than write a commit-graph that gets parsed as
garbage.

Thanks,
-Stolee

Jeff King Dec. 4, 2020, 8:42 p.m. UTC | #2

On Fri, Dec 04, 2020 at 03:06:24PM -0500, Derrick Stolee wrote:

> On 12/4/2020 1:56 PM, Jeff King wrote:
> > That turns out not to be a problem, though. The only things we do with
> > the count are:
> > 
> >   - check if our count will overflow our data structures. But the limit
> >     there is 2^31 commits, so it's not likely to happen in practice.
> 
> You do point out that you are removing this logic, done in this
> if statement:
> 
> > -	if (count_distinct >= GRAPH_EDGE_LAST_MASK) {
> > -		error(_("the commit graph format cannot write %d commits"), count_distinct);
> > -		res = -1;
> > -		goto cleanup;
> > -	}
> 
> What is the harm of keeping this? I know it is very unlikely, but
> I'd rather fail here than write a commit-graph that gets parsed as
> garbage.

I agree it's important to have. But this one is redundant. Look a few
lines below this hunk, and we have:

	copy_oids_to_commits(ctx);

	if (ctx->commits.nr >= GRAPH_EDGE_LAST_MASK) {
		error(_("too many commits to write graph"));
		res = -1;
		goto cleanup;
	}

So before we were counting distinct commits, checking that our count
fits, and then _ignoring_ that count in order to create the actual list
of commits, and then checking that the actual list's count fits. We only
need one of those checks, and the important one is the one from the
actual list (they _should_ match, but due to the bug, they sometimes
didn't).

My "not likely to happen in practice" is not about the quality of the
check, but rather that being off by one would never matter in practice.

Does that make more sense?

-Peff

Derrick Stolee Dec. 4, 2020, 8:47 p.m. UTC | #3

On 12/4/2020 3:42 PM, Jeff King wrote:
> On Fri, Dec 04, 2020 at 03:06:24PM -0500, Derrick Stolee wrote:
> 
>> On 12/4/2020 1:56 PM, Jeff King wrote:
>>> That turns out not to be a problem, though. The only things we do with
>>> the count are:
>>>
>>>   - check if our count will overflow our data structures. But the limit
>>>     there is 2^31 commits, so it's not likely to happen in practice.
>>
>> You do point out that you are removing this logic, done in this
>> if statement:
>>
>>> -	if (count_distinct >= GRAPH_EDGE_LAST_MASK) {
>>> -		error(_("the commit graph format cannot write %d commits"), count_distinct);
>>> -		res = -1;
>>> -		goto cleanup;
>>> -	}
>>
>> What is the harm of keeping this? I know it is very unlikely, but
>> I'd rather fail here than write a commit-graph that gets parsed as
>> garbage.
> 
> I agree it's important to have. But this one is redundant. Look a few
> lines below this hunk, and we have:
> 
> 	copy_oids_to_commits(ctx);
> 
> 	if (ctx->commits.nr >= GRAPH_EDGE_LAST_MASK) {
> 		error(_("too many commits to write graph"));
> 		res = -1;
> 		goto cleanup;
> 	}


Yep, my fault for not looking further in the context of your patch.

> So before we were counting distinct commits, checking that our count
> fits, and then _ignoring_ that count in order to create the actual list
> of commits, and then checking that the actual list's count fits. We only
> need one of those checks, and the important one is the one from the
> actual list (they _should_ match, but due to the bug, they sometimes
> didn't).
> 
> My "not likely to happen in practice" is not about the quality of the
> check, but rather that being off by one would never matter in practice.
> 
> Does that make more sense?
Makes sense to me. No need to change this patch at all.

Thanks,
-Stolee

Jeff King Dec. 4, 2020, 8:50 p.m. UTC | #4

On Fri, Dec 04, 2020 at 03:47:43PM -0500, Derrick Stolee wrote:

> > So before we were counting distinct commits, checking that our count
> > fits, and then _ignoring_ that count in order to create the actual list
> > of commits, and then checking that the actual list's count fits. We only
> > need one of those checks, and the important one is the one from the
> > actual list (they _should_ match, but due to the bug, they sometimes
> > didn't).
> > 
> > My "not likely to happen in practice" is not about the quality of the
> > check, but rather that being off by one would never matter in practice.
> > 
> > Does that make more sense?
> Makes sense to me. No need to change this patch at all.

I brushed up the commit message a bit to make those points clearer:

 7:  bbccccdc5c !  7:  23420dbd1b commit-graph: drop count_distinct_commits() function
    @@ Commit message
         the count are:
     
           - check if our count will overflow our data structures. But the limit
    -        there is 2^31 commits, so it's not likely to happen in practice.
    +        there is 2^31 commits, so while this is a useful check, the
    +        off-by-one is not likely to matter.
     
           - pre-allocate the array of commit pointers. But over-allocating by
    -        one isn't a problem.
    +        one isn't a problem; we'll just waste a few extra bytes.
     
         The bug would be easy enough to fix, but we can observe that neither of
    -    those steps is necessary. We'll check the count of the commit array
    -    after we build it anyway, so checking at this point is redundant. And we
    -    use ALLOC_GROW() when building the commit array, so there's no need to
    -    preallocate it (it's possible that doing so is slightly more efficient,
    -    but if we care we can just optimistically allocate one slot for each
    -    oid; I didn't bother here).
    +    those steps is necessary.
    +
    +    After building the actual commit array, we'll likewise check its count
    +    for overflow. So the extra check of the distinct commit count here is
    +    redundant.
    +
    +    And likewise we use ALLOC_GROW() when building the commit array, so
    +    there's no need to preallocate it (it's possible that doing so is
    +    slightly more efficient, but if we care we can just optimistically
    +    allocate one slot for each oid; I didn't bother here).
     
         So count_distinct_commits() isn't doing anything useful. Let's just get
         rid of that step.

-Peff

Derrick Stolee Dec. 4, 2020, 9:01 p.m. UTC | #5

On 12/4/2020 3:50 PM, Jeff King wrote:
> On Fri, Dec 04, 2020 at 03:47:43PM -0500, Derrick Stolee wrote:
>>> Does that make more sense?
>> Makes sense to me. No need to change this patch at all.
> 
> I brushed up the commit message a bit to make those points clearer:

LGTM. Thanks.

-Stolee

Ævar Arnfjörð Bjarmason Dec. 5, 2020, 2:26 a.m. UTC | #6

On Fri, Dec 04 2020, Jeff King wrote:

> When writing a commit graph, we collect a list of object ids in an
> array, which we'll eventually copy into an array of "struct commit"
> pointers. Before we do that, though, we count the number of distinct
> commit entries. There's a subtle bug in this step, though.
> [...]
> -	display_progress(ctx->progress, 0); /* TODO: Measure QSORT() progress */
> -	QSORT(ctx->oids.list, ctx->oids.nr, oid_compare);
> [...]
> +	QSORT(ctx->oids.list, ctx->oids.nr, oid_compare);

One thing your commit messag doesn't note is the removal of this
TODO. Theoretically we still have the issue I noted in 890226ccb57
(commit-graph write: add itermediate progress, 2019-01-19), in practice
I think it's fine to just forever remove this TODO.

I tried now on a really big repo and couldn't even get the progress bar
you removed to appear, or the one you moved the QSORT() to.

Maybe I need to go even bigger, but more likely I overdid things a bit
when I added all these progress bars to commit-graph.c at the time.

Jeff King Dec. 7, 2020, 7:01 p.m. UTC | #7

On Sat, Dec 05, 2020 at 03:26:56AM +0100, Ævar Arnfjörð Bjarmason wrote:

> On Fri, Dec 04 2020, Jeff King wrote:
> 
> > When writing a commit graph, we collect a list of object ids in an
> > array, which we'll eventually copy into an array of "struct commit"
> > pointers. Before we do that, though, we count the number of distinct
> > commit entries. There's a subtle bug in this step, though.
> > [...]
> > -	display_progress(ctx->progress, 0); /* TODO: Measure QSORT() progress */
> > -	QSORT(ctx->oids.list, ctx->oids.nr, oid_compare);
> > [...]
> > +	QSORT(ctx->oids.list, ctx->oids.nr, oid_compare);
> 
> One thing your commit messag doesn't note is the removal of this
> TODO. Theoretically we still have the issue I noted in 890226ccb57
> (commit-graph write: add itermediate progress, 2019-01-19), in practice
> I think it's fine to just forever remove this TODO.

Yeah, I don't think we can really put progress bars on everything. If
you have a dataset so large that sorting the commits by oid takes a long
time, then I think you probably just need to accept that there are going
to be some small pauses while Git works.

-Peff

[7/9] commit-graph: drop count_distinct_commits() function

Commit Message

Comments

Patch