diff mbox series

[3/7] commit-graph: start parsing generation v2 (again)

Message ID a3436b92a32f7f6dd02ad61eb2337a4d088d5e9c.1645735117.git.gitgitgadget@gmail.com (mailing list archive)
State New, archived
Headers show
Series Commit-graph: Generation Number v2 Fixes, v3 implementation | expand

Commit Message

Derrick Stolee Feb. 24, 2022, 8:38 p.m. UTC
From: Derrick Stolee <derrickstolee@github.com>

The 'read_generation_data' member of 'struct commit_graph' was
introduced by 1fdc383c5 (commit-graph: use generation v2 only if entire
chain does, 2021-01-16). The intention was to avoid using corrected
commit dates if not all layers of a commit-graph had that data stored.
The logic in validate_mixed_generation_chain() at that point incorrectly
initialized read_generation_data to 1 if and only if the tip
commit-graph contained the Corrected Commit Date chunk.

This was "fixed" in 448a39e65 (commit-graph: validate layers for
generation data, 2021-02-02) to validate that read_generation_data was
either non-zero for all layers, or it would set read_generation_data to
zero for all layers.

The problem here is that read_generation_data is not initialized to be
non-zero anywhere!

This change initializes read_generation_data immediately after the chunk
is parsed, so each layer will have its value present as soon as
possible.

The read_generation_data member is used in fill_commit_graph_info() to
determine if we should use the corrected commit date or the topological
levels stored in the Commit Data chunk. Due to this bug, all previous
versions of Git were defaulting to topological levels in all cases!

This can be measured with some performance tests. Using the Linux kernel
as a testbed, I generated a complete commit-graph containing corrected
commit dates and tested the 'new' version against the previous, 'old'
version.

First, rev-list with --topo-order demonstrates a 26% improvement using
corrected commit dates:

hyperfine \
	-n "old" "$OLD_GIT rev-list --topo-order -1000 v3.6" \
	-n "new" "$NEW_GIT rev-list --topo-order -1000 v3.6" \
	--warmup=10

Benchmark 1: old
  Time (mean ± σ):      57.1 ms ±   3.1 ms
  Range (min … max):    52.9 ms …  62.0 ms    55 runs

Benchmark 2: new
  Time (mean ± σ):      45.5 ms ±   3.3 ms
  Range (min … max):    39.9 ms …  51.7 ms    59 runs

Summary
  'new' ran
    1.26 ± 0.11 times faster than 'old'

These performance improvements are due to the algorithmic improvements
given by walking fewer commits due to the higher cutoffs from corrected
commit dates.

However, this comes at a cost. The additional I/O cost of parsing the
corrected commit dates is visible in case of merge-base commands that do
not reduce the overall number of walked commits.

hyperfine \
        -n "old" "$OLD_GIT merge-base v4.8 v4.9" \
        -n "new" "$NEW_GIT merge-base v4.8 v4.9" \
        --warmup=10

Benchmark 1: old
  Time (mean ± σ):     110.4 ms ±   6.4 ms
  Range (min … max):    96.0 ms … 118.3 ms    25 runs

Benchmark 2: new
  Time (mean ± σ):     150.7 ms ±   1.1 ms
  Range (min … max):   149.3 ms … 153.4 ms    19 runs

Summary
  'old' ran
    1.36 ± 0.08 times faster than 'new'

Performance issues like this are what motivated 702110aac (commit-graph:
use config to specify generation type, 2021-02-25).

In the future, we could fix this performance problem by inserting the
corrected commit date offsets into the Commit Date chunk instead of
having that data in an extra chunk.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 commit-graph.c                |  3 +++
 t/t4216-log-bloom.sh          |  2 +-
 t/t5318-commit-graph.sh       | 14 ++++++++++++--
 t/t5324-split-commit-graph.sh |  9 +++++++--
 4 files changed, 23 insertions(+), 5 deletions(-)

Comments

Patrick Steinhardt Feb. 28, 2022, 3:18 p.m. UTC | #1
On Thu, Feb 24, 2022 at 08:38:32PM +0000, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <derrickstolee@github.com>
> 
> The 'read_generation_data' member of 'struct commit_graph' was
> introduced by 1fdc383c5 (commit-graph: use generation v2 only if entire
> chain does, 2021-01-16). The intention was to avoid using corrected
> commit dates if not all layers of a commit-graph had that data stored.
> The logic in validate_mixed_generation_chain() at that point incorrectly
> initialized read_generation_data to 1 if and only if the tip
> commit-graph contained the Corrected Commit Date chunk.
> 
> This was "fixed" in 448a39e65 (commit-graph: validate layers for
> generation data, 2021-02-02) to validate that read_generation_data was
> either non-zero for all layers, or it would set read_generation_data to
> zero for all layers.
> 
> The problem here is that read_generation_data is not initialized to be
> non-zero anywhere!
> 
> This change initializes read_generation_data immediately after the chunk
> is parsed, so each layer will have its value present as soon as
> possible.
> 
> The read_generation_data member is used in fill_commit_graph_info() to
> determine if we should use the corrected commit date or the topological
> levels stored in the Commit Data chunk. Due to this bug, all previous
> versions of Git were defaulting to topological levels in all cases!
> 
> This can be measured with some performance tests. Using the Linux kernel
> as a testbed, I generated a complete commit-graph containing corrected
> commit dates and tested the 'new' version against the previous, 'old'
> version.
> 
> First, rev-list with --topo-order demonstrates a 26% improvement using
> corrected commit dates:
> 
> hyperfine \
> 	-n "old" "$OLD_GIT rev-list --topo-order -1000 v3.6" \
> 	-n "new" "$NEW_GIT rev-list --topo-order -1000 v3.6" \
> 	--warmup=10
> 
> Benchmark 1: old
>   Time (mean ± σ):      57.1 ms ±   3.1 ms
>   Range (min … max):    52.9 ms …  62.0 ms    55 runs
> 
> Benchmark 2: new
>   Time (mean ± σ):      45.5 ms ±   3.3 ms
>   Range (min … max):    39.9 ms …  51.7 ms    59 runs
> 
> Summary
>   'new' ran
>     1.26 ± 0.11 times faster than 'old'
> 
> These performance improvements are due to the algorithmic improvements
> given by walking fewer commits due to the higher cutoffs from corrected
> commit dates.
> 
> However, this comes at a cost. The additional I/O cost of parsing the
> corrected commit dates is visible in case of merge-base commands that do
> not reduce the overall number of walked commits.
> 
> hyperfine \
>         -n "old" "$OLD_GIT merge-base v4.8 v4.9" \
>         -n "new" "$NEW_GIT merge-base v4.8 v4.9" \
>         --warmup=10
> 
> Benchmark 1: old
>   Time (mean ± σ):     110.4 ms ±   6.4 ms
>   Range (min … max):    96.0 ms … 118.3 ms    25 runs
> 
> Benchmark 2: new
>   Time (mean ± σ):     150.7 ms ±   1.1 ms
>   Range (min … max):   149.3 ms … 153.4 ms    19 runs
> 
> Summary
>   'old' ran
>     1.36 ± 0.08 times faster than 'new'
> 
> Performance issues like this are what motivated 702110aac (commit-graph:
> use config to specify generation type, 2021-02-25).
> 
> In the future, we could fix this performance problem by inserting the
> corrected commit date offsets into the Commit Date chunk instead of
> having that data in an extra chunk.
> 
> Signed-off-by: Derrick Stolee <derrickstolee@github.com>
> ---
>  commit-graph.c                |  3 +++
>  t/t4216-log-bloom.sh          |  2 +-
>  t/t5318-commit-graph.sh       | 14 ++++++++++++--
>  t/t5324-split-commit-graph.sh |  9 +++++++--
>  4 files changed, 23 insertions(+), 5 deletions(-)
> 
> diff --git a/commit-graph.c b/commit-graph.c
> index a19bd96c2ee..8e52bb09552 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -407,6 +407,9 @@ struct commit_graph *parse_commit_graph(struct repository *r,
>  			&graph->chunk_generation_data);
>  		pair_chunk(cf, GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW,
>  			&graph->chunk_generation_data_overflow);
> +
> +		if (graph->chunk_generation_data)
> +			graph->read_generation_data = 1;
>  	}
>  
>  	if (r->settings.commit_graph_read_changed_paths) {

I wanted to test your changes because they seem quite exciting in the
context of my work as well, but this commit seems to uncover a bug with
how we handle overflows. I originally triggered the bug when trying to
do a mirror-fetch, but as it turns it seems to trigger now whenever the
commit-graph is being read:

    $ git commit-graph verify
    fatal: commit-graph requires overflow generation data but has none

    $ git commit-graph write --split
    Finding commits for commit graph among packed objects: 100% (10235119/10235119), done.
    fatal: commit-graph requires overflow generation data but has none

    $ git commit-graph write --split=replace
    Finding commits for commit graph among packed objects: 100% (10235119/10235119), done.
    fatal: commit-graph requires overflow generation data but has none

I initially assumed this may be a bug with how we previously wrote the
commit-graph, but removing all chains still reliably triggers it:

    $ rm -f objects/info/commit-graphs/*
    $ git commit-graph write --split
    Finding commits for commit graph among packed objects: 100% (10235119/10235119), done.
    fatal: commit-graph requires overflow generation data but has none

I haven't yet found the time to dig deeper into why this is happening.
While the repository is publicly accessible at [1], unfortunately the
bug seems to be triggered by a commit that's only kept alive by an
internal reference.

Patrick

[1]: https://gitlab.com/gitlab-com/www-gitlab-com.git
Derrick Stolee Feb. 28, 2022, 4:23 p.m. UTC | #2
On 2/28/2022 10:18 AM, Patrick Steinhardt wrote:
> On Thu, Feb 24, 2022 at 08:38:32PM +0000, Derrick Stolee via GitGitGadget wrote:
>> From: Derrick Stolee <derrickstolee@github.com>
...
>> diff --git a/commit-graph.c b/commit-graph.c
>> index a19bd96c2ee..8e52bb09552 100644
>> --- a/commit-graph.c
>> +++ b/commit-graph.c
>> @@ -407,6 +407,9 @@ struct commit_graph *parse_commit_graph(struct repository *r,
>>  			&graph->chunk_generation_data);
>>  		pair_chunk(cf, GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW,
>>  			&graph->chunk_generation_data_overflow);
>> +
>> +		if (graph->chunk_generation_data)
>> +			graph->read_generation_data = 1;
>>  	}
>>  
>>  	if (r->settings.commit_graph_read_changed_paths) {
> 
> I wanted to test your changes because they seem quite exciting in the
> context of my work as well, but this commit seems to uncover a bug with
> how we handle overflows. I originally triggered the bug when trying to
> do a mirror-fetch, but as it turns it seems to trigger now whenever the
> commit-graph is being read:
> 
>     $ git commit-graph verify
>     fatal: commit-graph requires overflow generation data but has none
> 
>     $ git commit-graph write --split
>     Finding commits for commit graph among packed objects: 100% (10235119/10235119), done.
>     fatal: commit-graph requires overflow generation data but has none
> 
>     $ git commit-graph write --split=replace
>     Finding commits for commit graph among packed objects: 100% (10235119/10235119), done.
>     fatal: commit-graph requires overflow generation data but has none
> 
> I initially assumed this may be a bug with how we previously wrote the
> commit-graph, but removing all chains still reliably triggers it:
> 
>     $ rm -f objects/info/commit-graphs/*
>     $ git commit-graph write --split
>     Finding commits for commit graph among packed objects: 100% (10235119/10235119), done.
>     fatal: commit-graph requires overflow generation data but has none
> 
> I haven't yet found the time to dig deeper into why this is happening.
> While the repository is publicly accessible at [1], unfortunately the
> bug seems to be triggered by a commit that's only kept alive by an
> internal reference.
> 
> Patrick
> 
> [1]: https://gitlab.com/gitlab-com/www-gitlab-com.git

Thanks for including this information. Just to be clear: did you
include patch 4 in your tests, or not? Patch 4 includes a fix
related to overflow values, so it would be helpful to know if you
found a _different_ bug or if it is the same one.

Thanks,
-Stolee
Patrick Steinhardt Feb. 28, 2022, 4:59 p.m. UTC | #3
On Mon, Feb 28, 2022 at 11:23:38AM -0500, Derrick Stolee wrote:
> On 2/28/2022 10:18 AM, Patrick Steinhardt wrote:
> > On Thu, Feb 24, 2022 at 08:38:32PM +0000, Derrick Stolee via GitGitGadget wrote:
> >> From: Derrick Stolee <derrickstolee@github.com>
> ...
> >> diff --git a/commit-graph.c b/commit-graph.c
> >> index a19bd96c2ee..8e52bb09552 100644
> >> --- a/commit-graph.c
> >> +++ b/commit-graph.c
> >> @@ -407,6 +407,9 @@ struct commit_graph *parse_commit_graph(struct repository *r,
> >>  			&graph->chunk_generation_data);
> >>  		pair_chunk(cf, GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW,
> >>  			&graph->chunk_generation_data_overflow);
> >> +
> >> +		if (graph->chunk_generation_data)
> >> +			graph->read_generation_data = 1;
> >>  	}
> >>  
> >>  	if (r->settings.commit_graph_read_changed_paths) {
> > 
> > I wanted to test your changes because they seem quite exciting in the
> > context of my work as well, but this commit seems to uncover a bug with
> > how we handle overflows. I originally triggered the bug when trying to
> > do a mirror-fetch, but as it turns it seems to trigger now whenever the
> > commit-graph is being read:
> > 
> >     $ git commit-graph verify
> >     fatal: commit-graph requires overflow generation data but has none
> > 
> >     $ git commit-graph write --split
> >     Finding commits for commit graph among packed objects: 100% (10235119/10235119), done.
> >     fatal: commit-graph requires overflow generation data but has none
> > 
> >     $ git commit-graph write --split=replace
> >     Finding commits for commit graph among packed objects: 100% (10235119/10235119), done.
> >     fatal: commit-graph requires overflow generation data but has none
> > 
> > I initially assumed this may be a bug with how we previously wrote the
> > commit-graph, but removing all chains still reliably triggers it:
> > 
> >     $ rm -f objects/info/commit-graphs/*
> >     $ git commit-graph write --split
> >     Finding commits for commit graph among packed objects: 100% (10235119/10235119), done.
> >     fatal: commit-graph requires overflow generation data but has none
> > 
> > I haven't yet found the time to dig deeper into why this is happening.
> > While the repository is publicly accessible at [1], unfortunately the
> > bug seems to be triggered by a commit that's only kept alive by an
> > internal reference.
> > 
> > Patrick
> > 
> > [1]: https://gitlab.com/gitlab-com/www-gitlab-com.git
> 
> Thanks for including this information. Just to be clear: did you
> include patch 4 in your tests, or not? Patch 4 includes a fix
> related to overflow values, so it would be helpful to know if you
> found a _different_ bug or if it is the same one.
> 
> Thanks,
> -Stolee

I initially only applied the first three patches, but after having hit
the fatal error I also applied the rest of this series to have a look at
whether it is indeed fixed already by one of your later patches. The
error remains the same though.

Patrick
Derrick Stolee Feb. 28, 2022, 6:44 p.m. UTC | #4
On 2/28/2022 11:59 AM, Patrick Steinhardt wrote:
> On Mon, Feb 28, 2022 at 11:23:38AM -0500, Derrick Stolee wrote:
>> On 2/28/2022 10:18 AM, Patrick Steinhardt wrote:
>>> I haven't yet found the time to dig deeper into why this is happening.
>>> While the repository is publicly accessible at [1], unfortunately the
>>> bug seems to be triggered by a commit that's only kept alive by an
>>> internal reference.
>>>
>>> Patrick
>>>
>>> [1]: https://gitlab.com/gitlab-com/www-gitlab-com.git
>>
>> Thanks for including this information. Just to be clear: did you
>> include patch 4 in your tests, or not? Patch 4 includes a fix
>> related to overflow values, so it would be helpful to know if you
>> found a _different_ bug or if it is the same one.
>>
>> Thanks,
>> -Stolee
> 
> I initially only applied the first three patches, but after having hit
> the fatal error I also applied the rest of this series to have a look at
> whether it is indeed fixed already by one of your later patches. The
> error remains the same though.

Thanks for this extra context. Is this a commit-graph that you wrote
with the first three patches and then you get an error when reading it?

Do you get the same error when deleting that file and rewriting it with
all patches included?

Thanks,
-Stolee
Patrick Steinhardt March 1, 2022, 9:46 a.m. UTC | #5
On Mon, Feb 28, 2022 at 01:44:01PM -0500, Derrick Stolee wrote:
> On 2/28/2022 11:59 AM, Patrick Steinhardt wrote:
> > On Mon, Feb 28, 2022 at 11:23:38AM -0500, Derrick Stolee wrote:
> >> On 2/28/2022 10:18 AM, Patrick Steinhardt wrote:
> >>> I haven't yet found the time to dig deeper into why this is happening.
> >>> While the repository is publicly accessible at [1], unfortunately the
> >>> bug seems to be triggered by a commit that's only kept alive by an
> >>> internal reference.
> >>>
> >>> Patrick
> >>>
> >>> [1]: https://gitlab.com/gitlab-com/www-gitlab-com.git
> >>
> >> Thanks for including this information. Just to be clear: did you
> >> include patch 4 in your tests, or not? Patch 4 includes a fix
> >> related to overflow values, so it would be helpful to know if you
> >> found a _different_ bug or if it is the same one.
> >>
> >> Thanks,
> >> -Stolee
> > 
> > I initially only applied the first three patches, but after having hit
> > the fatal error I also applied the rest of this series to have a look at
> > whether it is indeed fixed already by one of your later patches. The
> > error remains the same though.
> 
> Thanks for this extra context. Is this a commit-graph that you wrote
> with the first three patches and then you get an error when reading it?
> 
> Do you get the same error when deleting that file and rewriting it with
> all patches included?
> 
> Thanks,
> -Stolee

Yes, I do. I've applied all four patches from v2 on top of 715d08a9e5
(The eighth batch, 2022-02-25) and still get the same results:

    $ find objects/info/commit-graphs/
    objects/info/commit-graphs/
    objects/info/commit-graphs/graph-607e641165f3e83a82d5b14af4e611bf2a688f35.graph
    objects/info/commit-graphs/commit-graph-chain
    objects/info/commit-graphs/graph-5f357c7573c0075d42d82b28e660bc3eac01bfe8.graph
    objects/info/commit-graphs/graph-e0c12ead1b61c7c30720ae372e8a9f98d95dfb2d.graph
    objects/info/commit-graphs/graph-c96723b133c2d81106a01ecd7a8773bb2ef6c2e1.graph

     $ git commit-graph verify
    fatal: commit-graph requires overflow generation data but has none

     $ git commit-graph write
    Finding commits for commit graph among packed objects: 100% (10235119/10235119), done.
    Expanding reachable commits in commit graph: 2197197, done.
    Finding extra edges in commit graph: 100% (2197197/2197197), done.
    fatal: commit-graph requires overflow generation data but has none

     $ rm -rf objects/info/commit-graphs/

     $ git commit-graph write
    Finding commits for commit graph among packed objects: 100% (10235119/10235119), done.
    Expanding reachable commits in commit graph: 2197197, done.
    Finding extra edges in commit graph: 100% (2197197/2197197), done.
    fatal: commit-graph requires overflow generation data but has none)

So even generating them completely anew doesn't seem to generate the
overflow generation data.

Patrick
Patrick Steinhardt March 1, 2022, 10:35 a.m. UTC | #6
On Tue, Mar 01, 2022 at 10:46:14AM +0100, Patrick Steinhardt wrote:
> On Mon, Feb 28, 2022 at 01:44:01PM -0500, Derrick Stolee wrote:
> > On 2/28/2022 11:59 AM, Patrick Steinhardt wrote:
> > > On Mon, Feb 28, 2022 at 11:23:38AM -0500, Derrick Stolee wrote:
> > >> On 2/28/2022 10:18 AM, Patrick Steinhardt wrote:
> > >>> I haven't yet found the time to dig deeper into why this is happening.
> > >>> While the repository is publicly accessible at [1], unfortunately the
> > >>> bug seems to be triggered by a commit that's only kept alive by an
> > >>> internal reference.
> > >>>
> > >>> Patrick
> > >>>
> > >>> [1]: https://gitlab.com/gitlab-com/www-gitlab-com.git
> > >>
> > >> Thanks for including this information. Just to be clear: did you
> > >> include patch 4 in your tests, or not? Patch 4 includes a fix
> > >> related to overflow values, so it would be helpful to know if you
> > >> found a _different_ bug or if it is the same one.
> > >>
> > >> Thanks,
> > >> -Stolee
> > > 
> > > I initially only applied the first three patches, but after having hit
> > > the fatal error I also applied the rest of this series to have a look at
> > > whether it is indeed fixed already by one of your later patches. The
> > > error remains the same though.
> > 
> > Thanks for this extra context. Is this a commit-graph that you wrote
> > with the first three patches and then you get an error when reading it?
> > 
> > Do you get the same error when deleting that file and rewriting it with
> > all patches included?
> > 
> > Thanks,
> > -Stolee
> 
> Yes, I do. I've applied all four patches from v2 on top of 715d08a9e5
> (The eighth batch, 2022-02-25) and still get the same results:
> 
>     $ find objects/info/commit-graphs/
>     objects/info/commit-graphs/
>     objects/info/commit-graphs/graph-607e641165f3e83a82d5b14af4e611bf2a688f35.graph
>     objects/info/commit-graphs/commit-graph-chain
>     objects/info/commit-graphs/graph-5f357c7573c0075d42d82b28e660bc3eac01bfe8.graph
>     objects/info/commit-graphs/graph-e0c12ead1b61c7c30720ae372e8a9f98d95dfb2d.graph
>     objects/info/commit-graphs/graph-c96723b133c2d81106a01ecd7a8773bb2ef6c2e1.graph
> 
>      $ git commit-graph verify
>     fatal: commit-graph requires overflow generation data but has none
> 
>      $ git commit-graph write
>     Finding commits for commit graph among packed objects: 100% (10235119/10235119), done.
>     Expanding reachable commits in commit graph: 2197197, done.
>     Finding extra edges in commit graph: 100% (2197197/2197197), done.
>     fatal: commit-graph requires overflow generation data but has none
> 
>      $ rm -rf objects/info/commit-graphs/
> 
>      $ git commit-graph write
>     Finding commits for commit graph among packed objects: 100% (10235119/10235119), done.
>     Expanding reachable commits in commit graph: 2197197, done.
>     Finding extra edges in commit graph: 100% (2197197/2197197), done.
>     fatal: commit-graph requires overflow generation data but has none)
> 
> So even generating them completely anew doesn't seem to generate the
> overflow generation data.
> 
> Patrick

I stand corrected. I forgot that the repository at hand was connected to
another one via `objects/info/alternates`. If I prune commit-graphs from
that alternate, too, then it works alright with your patches.

This makes me wonder how such a bugfix should be handled though. As this
series is right now, users will be faced with repository corruption as
soon as they upgrade their Git version to one that contains this patch
series. This corruption needs manual action: they have to go into the
repository, delete the commit-graphs and then optionally create new
ones.

This is not a good user experience, and it's worse on the server-side
where we now have a timeframe where all commit-graphs are potentially
corrupt. This effectively leads to us being unable to serve those repos
at all until we have rewritten the commit-graphs because all commands
which make use of the commit-graph will now die:

    $ git log
    fatal: commit-graph requires overflow generation data but has none

So the question is whether this is a change that needs to be rolled out
over multiple releases. First we'd get in the bug fix such that we write
correct commit-graphs, and after this fix has been released we can also
release the fix that starts to actually parse the generation. This
ensures there's a grace period during which we can hopefully correct the
data on-disk such that users are not faced with failures.

The better alternative would probably be to just gracefully handle
commit-graphs which are corrupted in such a way. Can we maybe just
continue to not parse generations in case we find that the commit-graph
doesn't have overflow generation data?

This is more of a general issue though: commit-graphs are an auxiliary
cache that is not required for proper operation at all. If we fail to
parse it, then Git shouldn't die but instead fail gracefully just ignore
it. Furthermore, if we notice that graphs are corrupt when we try to
write new ones, we may just delete the corrupt versions automatically
and generate completely new ones.

Patrick
Derrick Stolee March 1, 2022, 2:06 p.m. UTC | #7
On 3/1/2022 5:35 AM, Patrick Steinhardt wrote:
> On Tue, Mar 01, 2022 at 10:46:14AM +0100, Patrick Steinhardt wrote:
>> On Mon, Feb 28, 2022 at 01:44:01PM -0500, Derrick Stolee wrote:
>>> On 2/28/2022 11:59 AM, Patrick Steinhardt wrote:
>>>> On Mon, Feb 28, 2022 at 11:23:38AM -0500, Derrick Stolee wrote:
>>>>> On 2/28/2022 10:18 AM, Patrick Steinhardt wrote:
>>>>>> I haven't yet found the time to dig deeper into why this is happening.
>>>>>> While the repository is publicly accessible at [1], unfortunately the
>>>>>> bug seems to be triggered by a commit that's only kept alive by an
>>>>>> internal reference.
>>>>>>
>>>>>> Patrick
>>>>>>
>>>>>> [1]: https://gitlab.com/gitlab-com/www-gitlab-com.git
>>>>>
>>>>> Thanks for including this information. Just to be clear: did you
>>>>> include patch 4 in your tests, or not? Patch 4 includes a fix
>>>>> related to overflow values, so it would be helpful to know if you
>>>>> found a _different_ bug or if it is the same one.
>>>>>
>>>>> Thanks,
>>>>> -Stolee
>>>>
>>>> I initially only applied the first three patches, but after having hit
>>>> the fatal error I also applied the rest of this series to have a look at
>>>> whether it is indeed fixed already by one of your later patches. The
>>>> error remains the same though.
>>>
>>> Thanks for this extra context. Is this a commit-graph that you wrote
>>> with the first three patches and then you get an error when reading it?
>>>
>>> Do you get the same error when deleting that file and rewriting it with
>>> all patches included?
>>>
>>> Thanks,
>>> -Stolee
>>
>> Yes, I do. I've applied all four patches from v2 on top of 715d08a9e5
>> (The eighth batch, 2022-02-25) and still get the same results:
>>
>>     $ find objects/info/commit-graphs/
>>     objects/info/commit-graphs/
>>     objects/info/commit-graphs/graph-607e641165f3e83a82d5b14af4e611bf2a688f35.graph
>>     objects/info/commit-graphs/commit-graph-chain
>>     objects/info/commit-graphs/graph-5f357c7573c0075d42d82b28e660bc3eac01bfe8.graph
>>     objects/info/commit-graphs/graph-e0c12ead1b61c7c30720ae372e8a9f98d95dfb2d.graph
>>     objects/info/commit-graphs/graph-c96723b133c2d81106a01ecd7a8773bb2ef6c2e1.graph
>>
>>      $ git commit-graph verify
>>     fatal: commit-graph requires overflow generation data but has none
>>
>>      $ git commit-graph write
>>     Finding commits for commit graph among packed objects: 100% (10235119/10235119), done.
>>     Expanding reachable commits in commit graph: 2197197, done.
>>     Finding extra edges in commit graph: 100% (2197197/2197197), done.
>>     fatal: commit-graph requires overflow generation data but has none
>>
>>      $ rm -rf objects/info/commit-graphs/
>>
>>      $ git commit-graph write
>>     Finding commits for commit graph among packed objects: 100% (10235119/10235119), done.
>>     Expanding reachable commits in commit graph: 2197197, done.
>>     Finding extra edges in commit graph: 100% (2197197/2197197), done.
>>     fatal: commit-graph requires overflow generation data but has none)
>>
>> So even generating them completely anew doesn't seem to generate the
>> overflow generation data.
>>
>> Patrick
> 
> I stand corrected. I forgot that the repository at hand was connected to
> another one via `objects/info/alternates`. If I prune commit-graphs from
> that alternate, too, then it works alright with your patches.

OK, thanks. That clarifies the situation.

I ordered the patches such that the fix in patch 4 could be immediately
testable, which is not the case without patch 3. However, it does leave
this temporary state where information can be incorrect if only a subset
of the series is applied.

> This makes me wonder how such a bugfix should be handled though. As this
> series is right now, users will be faced with repository corruption as
> soon as they upgrade their Git version to one that contains this patch
> series. This corruption needs manual action: they have to go into the
> repository, delete the commit-graphs and then optionally create new
> ones.
> 
> This is not a good user experience, and it's worse on the server-side
> where we now have a timeframe where all commit-graphs are potentially
> corrupt. This effectively leads to us being unable to serve those repos
> at all until we have rewritten the commit-graphs because all commands
> which make use of the commit-graph will now die:
> 
>     $ git log
>     fatal: commit-graph requires overflow generation data but has none
> 
> So the question is whether this is a change that needs to be rolled out
> over multiple releases. First we'd get in the bug fix such that we write
> correct commit-graphs, and after this fix has been released we can also
> release the fix that starts to actually parse the generation. This
> ensures there's a grace period during which we can hopefully correct the
> data on-disk such that users are not faced with failures.

You are right that we need to be careful here, but I also think that
previous versions of Git always wrote the correct data. Here is my
thought process:

1. To get this bug, we need to have parsed the corrected commit date
   from an existing commit-graph in order to under-count the number
   of overflow values.

2. Before this series, Git versions were not parsing the corrected
   commit date, so they recompute the corrected commit date every
   time the commit-graph is written, getting the proper count of
   overflow values.

For these reasons, data written by previous versions of Git are
correct and can be trusted without a staged release.

Does this make sense? Or, do you experience a different result when
you build commit-graphs with a released Git version and then when
writing on top with all patches applied?

> The better alternative would probably be to just gracefully handle
> commit-graphs which are corrupted in such a way. Can we maybe just
> continue to not parse generations in case we find that the commit-graph
> doesn't have overflow generation data?
>
> This is more of a general issue though: commit-graphs are an auxiliary
> cache that is not required for proper operation at all. If we fail to
> parse it, then Git shouldn't die but instead fail gracefully just ignore
> it. Furthermore, if we notice that graphs are corrupt when we try to
> write new ones, we may just delete the corrupt versions automatically
> and generate completely new ones.

You are right that we can be better about failures here and report
and error instead of a die(). Especially in this case, we could just
revert to topological levels instead of throwing out the commit-graph
entirely.

This seems like something for another series, so we can be sure to
audit all cases of fatal errors when parsing the commit-graph so we
catch all of them and do the "best" thing in each case.

Thanks,
-Stolee
Patrick Steinhardt March 1, 2022, 2:53 p.m. UTC | #8
On Tue, Mar 01, 2022 at 09:06:44AM -0500, Derrick Stolee wrote:
> On 3/1/2022 5:35 AM, Patrick Steinhardt wrote:
> > On Tue, Mar 01, 2022 at 10:46:14AM +0100, Patrick Steinhardt wrote:
> >> On Mon, Feb 28, 2022 at 01:44:01PM -0500, Derrick Stolee wrote:
> >>> On 2/28/2022 11:59 AM, Patrick Steinhardt wrote:
> >>>> On Mon, Feb 28, 2022 at 11:23:38AM -0500, Derrick Stolee wrote:
> >>>>> On 2/28/2022 10:18 AM, Patrick Steinhardt wrote:
> >>>>>> I haven't yet found the time to dig deeper into why this is happening.
> >>>>>> While the repository is publicly accessible at [1], unfortunately the
> >>>>>> bug seems to be triggered by a commit that's only kept alive by an
> >>>>>> internal reference.
> >>>>>>
> >>>>>> Patrick
> >>>>>>
> >>>>>> [1]: https://gitlab.com/gitlab-com/www-gitlab-com.git
> >>>>>
> >>>>> Thanks for including this information. Just to be clear: did you
> >>>>> include patch 4 in your tests, or not? Patch 4 includes a fix
> >>>>> related to overflow values, so it would be helpful to know if you
> >>>>> found a _different_ bug or if it is the same one.
> >>>>>
> >>>>> Thanks,
> >>>>> -Stolee
> >>>>
> >>>> I initially only applied the first three patches, but after having hit
> >>>> the fatal error I also applied the rest of this series to have a look at
> >>>> whether it is indeed fixed already by one of your later patches. The
> >>>> error remains the same though.
> >>>
> >>> Thanks for this extra context. Is this a commit-graph that you wrote
> >>> with the first three patches and then you get an error when reading it?
> >>>
> >>> Do you get the same error when deleting that file and rewriting it with
> >>> all patches included?
> >>>
> >>> Thanks,
> >>> -Stolee
> >>
> >> Yes, I do. I've applied all four patches from v2 on top of 715d08a9e5
> >> (The eighth batch, 2022-02-25) and still get the same results:
> >>
> >>     $ find objects/info/commit-graphs/
> >>     objects/info/commit-graphs/
> >>     objects/info/commit-graphs/graph-607e641165f3e83a82d5b14af4e611bf2a688f35.graph
> >>     objects/info/commit-graphs/commit-graph-chain
> >>     objects/info/commit-graphs/graph-5f357c7573c0075d42d82b28e660bc3eac01bfe8.graph
> >>     objects/info/commit-graphs/graph-e0c12ead1b61c7c30720ae372e8a9f98d95dfb2d.graph
> >>     objects/info/commit-graphs/graph-c96723b133c2d81106a01ecd7a8773bb2ef6c2e1.graph
> >>
> >>      $ git commit-graph verify
> >>     fatal: commit-graph requires overflow generation data but has none
> >>
> >>      $ git commit-graph write
> >>     Finding commits for commit graph among packed objects: 100% (10235119/10235119), done.
> >>     Expanding reachable commits in commit graph: 2197197, done.
> >>     Finding extra edges in commit graph: 100% (2197197/2197197), done.
> >>     fatal: commit-graph requires overflow generation data but has none
> >>
> >>      $ rm -rf objects/info/commit-graphs/
> >>
> >>      $ git commit-graph write
> >>     Finding commits for commit graph among packed objects: 100% (10235119/10235119), done.
> >>     Expanding reachable commits in commit graph: 2197197, done.
> >>     Finding extra edges in commit graph: 100% (2197197/2197197), done.
> >>     fatal: commit-graph requires overflow generation data but has none)
> >>
> >> So even generating them completely anew doesn't seem to generate the
> >> overflow generation data.
> >>
> >> Patrick
> > 
> > I stand corrected. I forgot that the repository at hand was connected to
> > another one via `objects/info/alternates`. If I prune commit-graphs from
> > that alternate, too, then it works alright with your patches.
> 
> OK, thanks. That clarifies the situation.
> 
> I ordered the patches such that the fix in patch 4 could be immediately
> testable, which is not the case without patch 3. However, it does leave
> this temporary state where information can be incorrect if only a subset
> of the series is applied.
> 
> > This makes me wonder how such a bugfix should be handled though. As this
> > series is right now, users will be faced with repository corruption as
> > soon as they upgrade their Git version to one that contains this patch
> > series. This corruption needs manual action: they have to go into the
> > repository, delete the commit-graphs and then optionally create new
> > ones.
> > 
> > This is not a good user experience, and it's worse on the server-side
> > where we now have a timeframe where all commit-graphs are potentially
> > corrupt. This effectively leads to us being unable to serve those repos
> > at all until we have rewritten the commit-graphs because all commands
> > which make use of the commit-graph will now die:
> > 
> >     $ git log
> >     fatal: commit-graph requires overflow generation data but has none
> > 
> > So the question is whether this is a change that needs to be rolled out
> > over multiple releases. First we'd get in the bug fix such that we write
> > correct commit-graphs, and after this fix has been released we can also
> > release the fix that starts to actually parse the generation. This
> > ensures there's a grace period during which we can hopefully correct the
> > data on-disk such that users are not faced with failures.
> 
> You are right that we need to be careful here, but I also think that
> previous versions of Git always wrote the correct data. Here is my
> thought process:
> 
> 1. To get this bug, we need to have parsed the corrected commit date
>    from an existing commit-graph in order to under-count the number
>    of overflow values.
> 
> 2. Before this series, Git versions were not parsing the corrected
>    commit date, so they recompute the corrected commit date every
>    time the commit-graph is written, getting the proper count of
>    overflow values.
> 
> For these reasons, data written by previous versions of Git are
> correct and can be trusted without a staged release.
> 
> Does this make sense? Or, do you experience a different result when
> you build commit-graphs with a released Git version and then when
> writing on top with all patches applied?

Just to verify my understanding: you claim that the bug I was hitting
shouldn't be encountered in the wild when the release , but
only if one were to write a commit-graph with the intermediate stafe
until patch 3/4 of your patch series?

Hum. I have re-verified, and this indeed seems to play out. So I must've
accidentally ran all my testing with the state generated without the
final patch which fixes the corruption. I do see lots of the following
warnings, but overall I can verify and write the commit-graph just fine:

    commit-graph generation for commit c80a42de8803e2d77818d0c82f88e748d7f9425f is 1623362063 < 1623362139

Thanks for your patience, and sorry for the noise :)

> > The better alternative would probably be to just gracefully handle
> > commit-graphs which are corrupted in such a way. Can we maybe just
> > continue to not parse generations in case we find that the commit-graph
> > doesn't have overflow generation data?
> >
> > This is more of a general issue though: commit-graphs are an auxiliary
> > cache that is not required for proper operation at all. If we fail to
> > parse it, then Git shouldn't die but instead fail gracefully just ignore
> > it. Furthermore, if we notice that graphs are corrupt when we try to
> > write new ones, we may just delete the corrupt versions automatically
> > and generate completely new ones.
> 
> You are right that we can be better about failures here and report
> and error instead of a die(). Especially in this case, we could just
> revert to topological levels instead of throwing out the commit-graph
> entirely.
> 
> This seems like something for another series, so we can be sure to
> audit all cases of fatal errors when parsing the commit-graph so we
> catch all of them and do the "best" thing in each case.

I agree.

Patrick
Derrick Stolee March 1, 2022, 3:25 p.m. UTC | #9
On 3/1/2022 9:53 AM, Patrick Steinhardt wrote:
> On Tue, Mar 01, 2022 at 09:06:44AM -0500, Derrick Stolee wrote:
>> On 3/1/2022 5:35 AM, Patrick Steinhardt wrote:
>>> On Tue, Mar 01, 2022 at 10:46:14AM +0100, Patrick Steinhardt wrote:
>>>> On Mon, Feb 28, 2022 at 01:44:01PM -0500, Derrick Stolee wrote:
>>>>> On 2/28/2022 11:59 AM, Patrick Steinhardt wrote:
>>>>>> On Mon, Feb 28, 2022 at 11:23:38AM -0500, Derrick Stolee wrote:
>>>>>>> On 2/28/2022 10:18 AM, Patrick Steinhardt wrote:
>>>>>>>> [1]: https://gitlab.com/gitlab-com/www-gitlab-com.git
...
>>> So the question is whether this is a change that needs to be rolled out
>>> over multiple releases. First we'd get in the bug fix such that we write
>>> correct commit-graphs, and after this fix has been released we can also
>>> release the fix that starts to actually parse the generation. This
>>> ensures there's a grace period during which we can hopefully correct the
>>> data on-disk such that users are not faced with failures.
>>
>> You are right that we need to be careful here, but I also think that
>> previous versions of Git always wrote the correct data. Here is my
>> thought process:
>>
>> 1. To get this bug, we need to have parsed the corrected commit date
>>    from an existing commit-graph in order to under-count the number
>>    of overflow values.
>>
>> 2. Before this series, Git versions were not parsing the corrected
>>    commit date, so they recompute the corrected commit date every
>>    time the commit-graph is written, getting the proper count of
>>    overflow values.
>>
>> For these reasons, data written by previous versions of Git are
>> correct and can be trusted without a staged release.
>>
>> Does this make sense? Or, do you experience a different result when
>> you build commit-graphs with a released Git version and then when
>> writing on top with all patches applied?
> 
> Just to verify my understanding: you claim that the bug I was hitting
> shouldn't be encountered in the wild when the release , but
> only if one were to write a commit-graph with the intermediate stafe
> until patch 3/4 of your patch series?

That is my claim. And my testing of the repo at [1] has demonstrated
that it works correctly in these cases.
 
> Hum. I have re-verified, and this indeed seems to play out. So I must've
> accidentally ran all my testing with the state generated without the
> final patch which fixes the corruption. I do see lots of the following
> warnings, but overall I can verify and write the commit-graph just fine:
> 
>     commit-graph generation for commit c80a42de8803e2d77818d0c82f88e748d7f9425f is 1623362063 < 1623362139

But I'm not able to generate these warnings from either version. I
tried generating different levels of a split commit-graph, but
could not reproduce it. If you have reproduction steps using current
'master' (or any released Git version) and the four patches here,
then I would love to get a full understanding of your errors.

Thanks,
-Stolee
Patrick Steinhardt March 2, 2022, 1:57 p.m. UTC | #10
On Tue, Mar 01, 2022 at 10:25:46AM -0500, Derrick Stolee wrote:
> On 3/1/2022 9:53 AM, Patrick Steinhardt wrote:
> > On Tue, Mar 01, 2022 at 09:06:44AM -0500, Derrick Stolee wrote:
> >> On 3/1/2022 5:35 AM, Patrick Steinhardt wrote:
> >>> On Tue, Mar 01, 2022 at 10:46:14AM +0100, Patrick Steinhardt wrote:
> >>>> On Mon, Feb 28, 2022 at 01:44:01PM -0500, Derrick Stolee wrote:
> >>>>> On 2/28/2022 11:59 AM, Patrick Steinhardt wrote:
> >>>>>> On Mon, Feb 28, 2022 at 11:23:38AM -0500, Derrick Stolee wrote:
> >>>>>>> On 2/28/2022 10:18 AM, Patrick Steinhardt wrote:
> >>>>>>>> [1]: https://gitlab.com/gitlab-com/www-gitlab-com.git
> ...
> >>> So the question is whether this is a change that needs to be rolled out
> >>> over multiple releases. First we'd get in the bug fix such that we write
> >>> correct commit-graphs, and after this fix has been released we can also
> >>> release the fix that starts to actually parse the generation. This
> >>> ensures there's a grace period during which we can hopefully correct the
> >>> data on-disk such that users are not faced with failures.
> >>
> >> You are right that we need to be careful here, but I also think that
> >> previous versions of Git always wrote the correct data. Here is my
> >> thought process:
> >>
> >> 1. To get this bug, we need to have parsed the corrected commit date
> >>    from an existing commit-graph in order to under-count the number
> >>    of overflow values.
> >>
> >> 2. Before this series, Git versions were not parsing the corrected
> >>    commit date, so they recompute the corrected commit date every
> >>    time the commit-graph is written, getting the proper count of
> >>    overflow values.
> >>
> >> For these reasons, data written by previous versions of Git are
> >> correct and can be trusted without a staged release.
> >>
> >> Does this make sense? Or, do you experience a different result when
> >> you build commit-graphs with a released Git version and then when
> >> writing on top with all patches applied?
> > 
> > Just to verify my understanding: you claim that the bug I was hitting
> > shouldn't be encountered in the wild when the release , but
> > only if one were to write a commit-graph with the intermediate stafe
> > until patch 3/4 of your patch series?
> 
> That is my claim. And my testing of the repo at [1] has demonstrated
> that it works correctly in these cases.
>  
> > Hum. I have re-verified, and this indeed seems to play out. So I must've
> > accidentally ran all my testing with the state generated without the
> > final patch which fixes the corruption. I do see lots of the following
> > warnings, but overall I can verify and write the commit-graph just fine:
> > 
> >     commit-graph generation for commit c80a42de8803e2d77818d0c82f88e748d7f9425f is 1623362063 < 1623362139
> 
> But I'm not able to generate these warnings from either version. I
> tried generating different levels of a split commit-graph, but
> could not reproduce it. If you have reproduction steps using current
> 'master' (or any released Git version) and the four patches here,
> then I would love to get a full understanding of your errors.
> 
> Thanks,
> -Stolee

I haven't yet been able to reproduce it with publicly available data,
but with the internal references I'm able to evoke the warnings
reliably. It only works when I have two repositories connected via
alternates, when generating the commit-graph in the linked-to repo
first, and then generating the commit-graph in the linking repo.

The following recipe allows me to reproduce, but rely on private data:

    $ git --version
    git version 2.35.1

    # The pool repository is the one we're linked to from the fork.
    $ cd "$pool"
    $ rm -rf objects/info/commit-graph objects/info/commit-graph
    $ git commit-graph write --split

    $ cd "$fork"
    $ rm -rf objects/info/commit-graph objects/info/commit-graph
    $ git commit-graph write --split

    $ git commit-graph verify --no-progress
    $ echo $?
    0

    # This is 715d08a9e51251ad8290b181b6ac3b9e1f9719d7 with your full v2
    # applied on top.
    $ ~/Development/git/bin-wrappers/git --version
    git version 2.35.1.358.g7ede1bea24

    $ ~/Development/git/bin-wrappers/git commit-graph verify --no-progress
    commit-graph generation for commit 06a91bac00ed11128becd48d5ae77eacd8f24c97 is 1623273624 < 1623273710
    commit-graph generation for commit 0ae91029f27238e8f8e109c6bb3907f864dda14f is 1622151146 < 1622151220
    commit-graph generation for commit 0d4582a33d8c8e3eb01adbf564f5e1deeb3b56a2 is 1631045222 < 1631045225
    commit-graph generation for commit 0daf8976439d7e0bb9710c5ee63b570580e0dc03 is 1620347739 < 1620347789
    commit-graph generation for commit 0e0ee8ffb3fa22cee7d28e21cbd6df26454932cf is 1623783297 < 1623783380
    commit-graph generation for commit 0f08ab3de6ec115ea8a956a1996cb9759e640e74 is 1621543278 < 1621543339
    commit-graph generation for commit 133ed0319b5a66ae0c2be76e5a887b880452b111 is 1620949864 < 1620949915
    commit-graph generation for commit 1341b3e6c63343ae94a8a473fa057126ddd4669a is 1637344364 < 1637344384
    commit-graph generation for commit 15bdfc501c2c9f23e9353bf6e6a5facd9c32a07a is 1623348103 < 1623348133
    ...
    $ echo $?
    1

When generating commit-graphs with your patches applied the `verify`
step works alright.

I've also by accident stumbled over the original error again:

    fatal: commit-graph requires overflow generation data but has none

This time it's definitely not caused by generating commit-graphs with an
in-between state of your patch series because the data comes straight
from production with no changes to the commit-graphs performed by
myself. There we're running Git v2.33.1 with a couple of backported
patches (see [1]). While those patches cause us to make more use of the
commit-graph, none modify the way we generate them.

Of note is that the commit-graph contains references to commits which
don't exist in the ODB anymore.

Patrick

[1]: https://gitlab.com/gitlab-org/gitlab-git/-/commits/pks-v2.33.1.gl3
Derrick Stolee March 2, 2022, 2:57 p.m. UTC | #11
On 3/2/2022 8:57 AM, Patrick Steinhardt wrote:
> On Tue, Mar 01, 2022 at 10:25:46AM -0500, Derrick Stolee wrote:
>> On 3/1/2022 9:53 AM, Patrick Steinhardt wrote:

>>> Hum. I have re-verified, and this indeed seems to play out. So I must've
>>> accidentally ran all my testing with the state generated without the
>>> final patch which fixes the corruption. I do see lots of the following
>>> warnings, but overall I can verify and write the commit-graph just fine:
>>>
>>>     commit-graph generation for commit c80a42de8803e2d77818d0c82f88e748d7f9425f is 1623362063 < 1623362139
>>
>> But I'm not able to generate these warnings from either version. I
>> tried generating different levels of a split commit-graph, but
>> could not reproduce it. If you have reproduction steps using current
>> 'master' (or any released Git version) and the four patches here,
>> then I would love to get a full understanding of your errors.
>>
>> Thanks,
>> -Stolee
> 
> I haven't yet been able to reproduce it with publicly available data,
> but with the internal references I'm able to evoke the warnings
> reliably. It only works when I have two repositories connected via
> alternates, when generating the commit-graph in the linked-to repo
> first, and then generating the commit-graph in the linking repo.
> 
> The following recipe allows me to reproduce, but rely on private data:
> 
>     $ git --version
>     git version 2.35.1
> 
>     # The pool repository is the one we're linked to from the fork.
>     $ cd "$pool"
>     $ rm -rf objects/info/commit-graph objects/info/commit-graph
>     $ git commit-graph write --split
> 
>     $ cd "$fork"
>     $ rm -rf objects/info/commit-graph objects/info/commit-graph
>     $ git commit-graph write --split
> 
>     $ git commit-graph verify --no-progress
>     $ echo $?
>     0
> 
>     # This is 715d08a9e51251ad8290b181b6ac3b9e1f9719d7 with your full v2
>     # applied on top.
>     $ ~/Development/git/bin-wrappers/git --version
>     git version 2.35.1.358.g7ede1bea24
> 
>     $ ~/Development/git/bin-wrappers/git commit-graph verify --no-progress
>     commit-graph generation for commit 06a91bac00ed11128becd48d5ae77eacd8f24c97 is 1623273624 < 1623273710
>     commit-graph generation for commit 0ae91029f27238e8f8e109c6bb3907f864dda14f is 1622151146 < 1622151220
>     commit-graph generation for commit 0d4582a33d8c8e3eb01adbf564f5e1deeb3b56a2 is 1631045222 < 1631045225
>     commit-graph generation for commit 0daf8976439d7e0bb9710c5ee63b570580e0dc03 is 1620347739 < 1620347789
>     commit-graph generation for commit 0e0ee8ffb3fa22cee7d28e21cbd6df26454932cf is 1623783297 < 1623783380
>     commit-graph generation for commit 0f08ab3de6ec115ea8a956a1996cb9759e640e74 is 1621543278 < 1621543339
>     commit-graph generation for commit 133ed0319b5a66ae0c2be76e5a887b880452b111 is 1620949864 < 1620949915
>     commit-graph generation for commit 1341b3e6c63343ae94a8a473fa057126ddd4669a is 1637344364 < 1637344384
>     commit-graph generation for commit 15bdfc501c2c9f23e9353bf6e6a5facd9c32a07a is 1623348103 < 1623348133
>     ...
>     $ echo $?
>     1
> 
> When generating commit-graphs with your patches applied the `verify`
> step works alright.
> 
> I've also by accident stumbled over the original error again:
> 
>     fatal: commit-graph requires overflow generation data but has none
> 
> This time it's definitely not caused by generating commit-graphs with an
> in-between state of your patch series because the data comes straight
> from production with no changes to the commit-graphs performed by
> myself. There we're running Git v2.33.1 with a couple of backported
> patches (see [1]). While those patches cause us to make more use of the
> commit-graph, none modify the way we generate them.
> 
> Of note is that the commit-graph contains references to commits which
> don't exist in the ODB anymore.
> 
> Patrick
> 
> [1]: https://gitlab.com/gitlab-org/gitlab-git/-/commits/pks-v2.33.1.gl3

Thank you for your diligence here, Patrick. I really appreciate the
work you're putting in to verify the situation.

Since our repro relies on private information, but is consistent, I
wonder if we should take the patch below, which starts to ignore the
older generation number v2 data and only writes freshly-computed
numbers.

Thanks,
-Stolee

--- 8< ---

From c53d8bd52bbcab3862e8a826ee75692edc7e4173 Mon Sep 17 00:00:00 2001
From: Derrick Stolee <derrickstolee@github.com>
Date: Wed, 2 Mar 2022 09:45:13 -0500
Subject: [PATCH v3 5/4] commit-graph: declare bankruptcy on GDAT chunks

The Generation Data (GDAT) and Generation Data Overflow (GDOV) chunks
store corrected commit date offsets, used for generation number v2.
Recent changes have demonstrated that previous versions of Git were
incorrectly parsing data from these chunks, but might have also been
writing them incorrectly.

I asserted [1] that the previous fixes were sufficient because the known
reasons for incorrectly writing generation number v2 data relied on
parsing the information incorrectly out of a commit-graph file, but the
previous versions of Git were not reading the generation number v2 data.

However, Patrick demonstrated [2] a case where in split commit-graphs
across an alternate boundary (and possibly some other special
conditions) it was possible to have a commit-graph that was generated by
a previous version of Git have incorrect generation number v2 data which
results in errors like the following:

  commit-graph generation for commit <oid> is 1623273624 < 1623273710

[1] https://lore.kernel.org/git/f50e74f0-9ffa-f4f2-4663-269801495ed3@github.com/
[2] https://lore.kernel.org/git/Yh93vOkt2DkrGPh2@ncase/

Clearly, there is something else going on. The situation is not
completely understood, but the errors do not reproduce if the
commit-graphs are all generated by a Git version including these recent
fixes.

If we cannot trust the existing data in the GDAT and GDOV chunks, then
we can alter the format to change the chunk IDs for these chunks. This
causes the new version of Git to silently ignore the older chunks (and
disabling generation number v2 in the process) while writing new
commit-graph files with correct data in the GDA2 and GDO2 chunks.

Update commit-graph-format.txt including a historical note about these
deprecated chunks.

Reported-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 Documentation/technical/commit-graph-format.txt | 12 ++++++++++--
 commit-graph.c                                  |  4 ++--
 2 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
index 87971c27dd7..484b185ba98 100644
--- a/Documentation/technical/commit-graph-format.txt
+++ b/Documentation/technical/commit-graph-format.txt
@@ -93,7 +93,7 @@ CHUNK DATA:
       2 bits of the lowest byte, storing the 33rd and 34th bit of the
       commit time.
 
-  Generation Data (ID: {'G', 'D', 'A', 'T' }) (N * 4 bytes) [Optional]
+  Generation Data (ID: {'G', 'D', 'A', '2' }) (N * 4 bytes) [Optional]
     * This list of 4-byte values store corrected commit date offsets for the
       commits, arranged in the same order as commit data chunk.
     * If the corrected commit date offset cannot be stored within 31 bits,
@@ -104,7 +104,7 @@ CHUNK DATA:
       by compatible versions of Git and in case of split commit-graph chains,
       the topmost layer also has Generation Data chunk.
 
-  Generation Data Overflow (ID: {'G', 'D', 'O', 'V' }) [Optional]
+  Generation Data Overflow (ID: {'G', 'D', 'O', '2' }) [Optional]
     * This list of 8-byte values stores the corrected commit date offsets
       for commits with corrected commit date offsets that cannot be
       stored within 31 bits.
@@ -156,3 +156,11 @@ CHUNK DATA:
 TRAILER:
 
 	H-byte HASH-checksum of all of the above.
+
+== Historical Notes:
+
+The Generation Data (GDA2) and Generation Data Overflow (GDO2) chunks have
+the number '2' in their chunk IDs because a previous version of Git wrote
+possibly erroneous data in these chunks with the IDs "GDAT" and "GDOV". By
+changing the IDs, newer versions of Git will silently ignore those older
+chunks and write the new information without trusting the incorrect data.
diff --git a/commit-graph.c b/commit-graph.c
index b86a6a634fe..fb2ced0bd6d 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -39,8 +39,8 @@ void git_test_write_commit_graph_or_die(void)
 #define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
 #define GRAPH_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
 #define GRAPH_CHUNKID_DATA 0x43444154 /* "CDAT" */
-#define GRAPH_CHUNKID_GENERATION_DATA 0x47444154 /* "GDAT" */
-#define GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW 0x47444f56 /* "GDOV" */
+#define GRAPH_CHUNKID_GENERATION_DATA 0x47444132 /* "GDA2" */
+#define GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW 0x47444f32 /* "GDO2" */
 #define GRAPH_CHUNKID_EXTRAEDGES 0x45444745 /* "EDGE" */
 #define GRAPH_CHUNKID_BLOOMINDEXES 0x42494458 /* "BIDX" */
 #define GRAPH_CHUNKID_BLOOMDATA 0x42444154 /* "BDAT" */
Junio C Hamano March 2, 2022, 6:15 p.m. UTC | #12
Derrick Stolee <derrickstolee@github.com> writes:

> Since our repro relies on private information, but is consistent, I
> wonder if we should take the patch below, which starts to ignore the
> older generation number v2 data and only writes freshly-computed
> numbers.

;-)

> Clearly, there is something else going on. The situation is not
> completely understood, but the errors do not reproduce if the
> commit-graphs are all generated by a Git version including these recent
> fixes.

Do you mean "we know doing X and then Y and then Z on this
particular private data with older version of Git without those two
fixes will lead to a broken timestamp, but doing exactly the same
with the two fixes, the breakage does not reproduce"?  If so, that
is quite encouraging news.  Thanks for working well together.

> If we cannot trust the existing data in the GDAT and GDOV chunks, then
> we can alter the format to change the chunk IDs for these chunks. This
> causes the new version of Git to silently ignore the older chunks (and
> disabling generation number v2 in the process) while writing new
> commit-graph files with correct data in the GDA2 and GDO2 chunks.
>
> Update commit-graph-format.txt including a historical note about these
> deprecated chunks.

Sensible.

> @@ -156,3 +156,11 @@ CHUNK DATA:
>  TRAILER:
>  
>  	H-byte HASH-checksum of all of the above.
> +
> +== Historical Notes:
> +
> +The Generation Data (GDA2) and Generation Data Overflow (GDO2) chunks have
> +the number '2' in their chunk IDs because a previous version of Git wrote
> +possibly erroneous data in these chunks with the IDs "GDAT" and "GDOV". By
> +changing the IDs, newer versions of Git will silently ignore those older
> +chunks and write the new information without trusting the incorrect data.

Good.  How does a new version of Git skip and ignore GDAT and GDOV
in existing files?  By not having any code to recognize what they
are?

I am wondering if there is some notion of "if you do not understand
what this chunk is, you are incapable of handling this file
correctly, so do not use it" kind of bit per chunks (similar to the
index extensions where ones that begin with [A-Z] are optional) that
may negatively affect this plan.

Thanks.

> diff --git a/commit-graph.c b/commit-graph.c
> index b86a6a634fe..fb2ced0bd6d 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -39,8 +39,8 @@ void git_test_write_commit_graph_or_die(void)
>  #define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
>  #define GRAPH_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
>  #define GRAPH_CHUNKID_DATA 0x43444154 /* "CDAT" */
> -#define GRAPH_CHUNKID_GENERATION_DATA 0x47444154 /* "GDAT" */
> -#define GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW 0x47444f56 /* "GDOV" */
> +#define GRAPH_CHUNKID_GENERATION_DATA 0x47444132 /* "GDA2" */
> +#define GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW 0x47444f32 /* "GDO2" */
>  #define GRAPH_CHUNKID_EXTRAEDGES 0x45444745 /* "EDGE" */
>  #define GRAPH_CHUNKID_BLOOMINDEXES 0x42494458 /* "BIDX" */
>  #define GRAPH_CHUNKID_BLOOMDATA 0x42444154 /* "BDAT" */
Derrick Stolee March 2, 2022, 6:46 p.m. UTC | #13
On 3/2/2022 1:15 PM, Junio C Hamano wrote:
> Derrick Stolee <derrickstolee@github.com> writes:
> 
>> Since our repro relies on private information, but is consistent, I
>> wonder if we should take the patch below, which starts to ignore the
>> older generation number v2 data and only writes freshly-computed
>> numbers.
> 
> ;-)
> 
>> Clearly, there is something else going on. The situation is not
>> completely understood, but the errors do not reproduce if the
>> commit-graphs are all generated by a Git version including these recent
>> fixes.
> 
> Do you mean "we know doing X and then Y and then Z on this
> particular private data with older version of Git without those two
> fixes will lead to a broken timestamp, but doing exactly the same
> with the two fixes, the breakage does not reproduce"?  If so, that
> is quite encouraging news.  Thanks for working well together.

Yes, that is my understanding.

>> If we cannot trust the existing data in the GDAT and GDOV chunks, then
>> we can alter the format to change the chunk IDs for these chunks. This
>> causes the new version of Git to silently ignore the older chunks (and
>> disabling generation number v2 in the process) while writing new
>> commit-graph files with correct data in the GDA2 and GDO2 chunks.
>>
>> Update commit-graph-format.txt including a historical note about these
>> deprecated chunks.
> 
> Sensible.
> 
>> @@ -156,3 +156,11 @@ CHUNK DATA:
>>  TRAILER:
>>  
>>  	H-byte HASH-checksum of all of the above.
>> +
>> +== Historical Notes:
>> +
>> +The Generation Data (GDA2) and Generation Data Overflow (GDO2) chunks have
>> +the number '2' in their chunk IDs because a previous version of Git wrote
>> +possibly erroneous data in these chunks with the IDs "GDAT" and "GDOV". By
>> +changing the IDs, newer versions of Git will silently ignore those older
>> +chunks and write the new information without trusting the incorrect data.
> 
> Good.  How does a new version of Git skip and ignore GDAT and GDOV
> in existing files?  By not having any code to recognize what they
> are?
> 
> I am wondering if there is some notion of "if you do not understand
> what this chunk is, you are incapable of handling this file
> correctly, so do not use it" kind of bit per chunks (similar to the
> index extensions where ones that begin with [A-Z] are optional) that
> may negatively affect this plan.

The chunk IDs do not have this special casing rule. This is a
bit unfortunate for certain cases like adding something that _must_
be understood. Here, it works to our benefit that GDAT and GDOV are
optional and can be safely ignored. Thus, clients with this patch
will ignore GDAT and GDOV and continue using topological levels
form the CDAT chunk. Older clients without this patch will ignore
the new GDA2 and GDO2 chunks and continue using topological levels.

For Git versions without this topic branch, this "continue using
topological levels" means no change of behavior at all.

Thanks,
-Stolee
Junio C Hamano March 2, 2022, 10:42 p.m. UTC | #14
Derrick Stolee <derrickstolee@github.com> writes:

>> I am wondering if there is some notion of "if you do not understand
>> what this chunk is, you are incapable of handling this file
>> correctly, so do not use it" kind of bit per chunks (similar to the
>> index extensions where ones that begin with [A-Z] are optional) that
>> may negatively affect this plan.
>
> The chunk IDs do not have this special casing rule. This is a
> bit unfortunate for certain cases like adding something that _must_
> be understood. Here, it works to our benefit that GDAT and GDOV are
> optional and can be safely ignored. Thus, clients with this patch
> will ignore GDAT and GDOV and continue using topological levels
> form the CDAT chunk. Older clients without this patch will ignore
> the new GDA2 and GDO2 chunks and continue using topological levels.
>
> For Git versions without this topic branch, this "continue using
> topological levels" means no change of behavior at all.

Excellent.  Thanks.
Patrick Steinhardt March 3, 2022, 11:19 a.m. UTC | #15
On Wed, Mar 02, 2022 at 09:57:17AM -0500, Derrick Stolee wrote:
> On 3/2/2022 8:57 AM, Patrick Steinhardt wrote:
> > On Tue, Mar 01, 2022 at 10:25:46AM -0500, Derrick Stolee wrote:
> >> On 3/1/2022 9:53 AM, Patrick Steinhardt wrote:
> 
> >>> Hum. I have re-verified, and this indeed seems to play out. So I must've
> >>> accidentally ran all my testing with the state generated without the
> >>> final patch which fixes the corruption. I do see lots of the following
> >>> warnings, but overall I can verify and write the commit-graph just fine:
> >>>
> >>>     commit-graph generation for commit c80a42de8803e2d77818d0c82f88e748d7f9425f is 1623362063 < 1623362139
> >>
> >> But I'm not able to generate these warnings from either version. I
> >> tried generating different levels of a split commit-graph, but
> >> could not reproduce it. If you have reproduction steps using current
> >> 'master' (or any released Git version) and the four patches here,
> >> then I would love to get a full understanding of your errors.
> >>
> >> Thanks,
> >> -Stolee
> > 
> > I haven't yet been able to reproduce it with publicly available data,
> > but with the internal references I'm able to evoke the warnings
> > reliably. It only works when I have two repositories connected via
> > alternates, when generating the commit-graph in the linked-to repo
> > first, and then generating the commit-graph in the linking repo.
> > 
> > The following recipe allows me to reproduce, but rely on private data:
> > 
> >     $ git --version
> >     git version 2.35.1
> > 
> >     # The pool repository is the one we're linked to from the fork.
> >     $ cd "$pool"
> >     $ rm -rf objects/info/commit-graph objects/info/commit-graph
> >     $ git commit-graph write --split
> > 
> >     $ cd "$fork"
> >     $ rm -rf objects/info/commit-graph objects/info/commit-graph
> >     $ git commit-graph write --split
> > 
> >     $ git commit-graph verify --no-progress
> >     $ echo $?
> >     0
> > 
> >     # This is 715d08a9e51251ad8290b181b6ac3b9e1f9719d7 with your full v2
> >     # applied on top.
> >     $ ~/Development/git/bin-wrappers/git --version
> >     git version 2.35.1.358.g7ede1bea24
> > 
> >     $ ~/Development/git/bin-wrappers/git commit-graph verify --no-progress
> >     commit-graph generation for commit 06a91bac00ed11128becd48d5ae77eacd8f24c97 is 1623273624 < 1623273710
> >     commit-graph generation for commit 0ae91029f27238e8f8e109c6bb3907f864dda14f is 1622151146 < 1622151220
> >     commit-graph generation for commit 0d4582a33d8c8e3eb01adbf564f5e1deeb3b56a2 is 1631045222 < 1631045225
> >     commit-graph generation for commit 0daf8976439d7e0bb9710c5ee63b570580e0dc03 is 1620347739 < 1620347789
> >     commit-graph generation for commit 0e0ee8ffb3fa22cee7d28e21cbd6df26454932cf is 1623783297 < 1623783380
> >     commit-graph generation for commit 0f08ab3de6ec115ea8a956a1996cb9759e640e74 is 1621543278 < 1621543339
> >     commit-graph generation for commit 133ed0319b5a66ae0c2be76e5a887b880452b111 is 1620949864 < 1620949915
> >     commit-graph generation for commit 1341b3e6c63343ae94a8a473fa057126ddd4669a is 1637344364 < 1637344384
> >     commit-graph generation for commit 15bdfc501c2c9f23e9353bf6e6a5facd9c32a07a is 1623348103 < 1623348133
> >     ...
> >     $ echo $?
> >     1
> > 
> > When generating commit-graphs with your patches applied the `verify`
> > step works alright.
> > 
> > I've also by accident stumbled over the original error again:
> > 
> >     fatal: commit-graph requires overflow generation data but has none
> > 
> > This time it's definitely not caused by generating commit-graphs with an
> > in-between state of your patch series because the data comes straight
> > from production with no changes to the commit-graphs performed by
> > myself. There we're running Git v2.33.1 with a couple of backported
> > patches (see [1]). While those patches cause us to make more use of the
> > commit-graph, none modify the way we generate them.
> > 
> > Of note is that the commit-graph contains references to commits which
> > don't exist in the ODB anymore.
> > 
> > Patrick
> > 
> > [1]: https://gitlab.com/gitlab-org/gitlab-git/-/commits/pks-v2.33.1.gl3
> 
> Thank you for your diligence here, Patrick. I really appreciate the
> work you're putting in to verify the situation.
> 
> Since our repro relies on private information, but is consistent, I
> wonder if we should take the patch below, which starts to ignore the
> older generation number v2 data and only writes freshly-computed
> numbers.
> 
> Thanks,
> -Stolee

Thanks. With your patch below the `fatal:` error is gone, but I'm still
seeing the same errors with regards to the commit-graph generations.

So to summarize my findings:

    - This bug occurs when writing commit-graphs with v2.35.1, but
      reading them with your patches.

    - This bug occurs when I have two repositories connected via an
      alternates file. I haven't yet been able to reproduce it in a
      single repository that is not connected to a separate ODB.

    - This bug only occurs when I first generate the commit-graph in the
      repository I'm borrowing objects from.

    - This bug only occurs when I write commit-graphs with `--split` in
      both repositories. "Normal" commit-graphs don't have this issue,
      and neither can I see it with `--split=replace` or mixed-type
      commit-graphs.

Beware, the following explanation is based on my very basic
understanding of the commit-graph code and thus more likely to be wrong
than right:

With the old Git version, we've been mis-parsing the generation because
`read_generation_data` wasn't ever set. As a result it can happen that
the second split commit-graph we're generating computes its own
generation numbers from the wrong starting point because it uses the
mis-parsed generation numbers from the parent commit-graph.

With your patches, we start to correctly account for overflows and would
thus end up with a different value for the generation depending on where
we parse the commit from: if we parse it from the first commit-graph it
would be correct because it's contains the "root" of the generation
numbers. But if we parse a commit from the second commit-graph we may
have a mismatch because the generation numbers in there may have been
derived from generation numbers mis-parsed from the first commit-graph.
And because these would be wrong in case there was an overflow it is
clear that the new corrected generation number may be wrong, as well.

Patrick

> --- 8< ---
> 
> From c53d8bd52bbcab3862e8a826ee75692edc7e4173 Mon Sep 17 00:00:00 2001
> From: Derrick Stolee <derrickstolee@github.com>
> Date: Wed, 2 Mar 2022 09:45:13 -0500
> Subject: [PATCH v3 5/4] commit-graph: declare bankruptcy on GDAT chunks
> 
> The Generation Data (GDAT) and Generation Data Overflow (GDOV) chunks
> store corrected commit date offsets, used for generation number v2.
> Recent changes have demonstrated that previous versions of Git were
> incorrectly parsing data from these chunks, but might have also been
> writing them incorrectly.
> 
> I asserted [1] that the previous fixes were sufficient because the known
> reasons for incorrectly writing generation number v2 data relied on
> parsing the information incorrectly out of a commit-graph file, but the
> previous versions of Git were not reading the generation number v2 data.
> 
> However, Patrick demonstrated [2] a case where in split commit-graphs
> across an alternate boundary (and possibly some other special
> conditions) it was possible to have a commit-graph that was generated by
> a previous version of Git have incorrect generation number v2 data which
> results in errors like the following:
> 
>   commit-graph generation for commit <oid> is 1623273624 < 1623273710
> 
> [1] https://lore.kernel.org/git/f50e74f0-9ffa-f4f2-4663-269801495ed3@github.com/
> [2] https://lore.kernel.org/git/Yh93vOkt2DkrGPh2@ncase/
> 
> Clearly, there is something else going on. The situation is not
> completely understood, but the errors do not reproduce if the
> commit-graphs are all generated by a Git version including these recent
> fixes.
> 
> If we cannot trust the existing data in the GDAT and GDOV chunks, then
> we can alter the format to change the chunk IDs for these chunks. This
> causes the new version of Git to silently ignore the older chunks (and
> disabling generation number v2 in the process) while writing new
> commit-graph files with correct data in the GDA2 and GDO2 chunks.
> 
> Update commit-graph-format.txt including a historical note about these
> deprecated chunks.
> 
> Reported-by: Patrick Steinhardt <ps@pks.im>
> Signed-off-by: Derrick Stolee <derrickstolee@github.com>
> ---
>  Documentation/technical/commit-graph-format.txt | 12 ++++++++++--
>  commit-graph.c                                  |  4 ++--
>  2 files changed, 12 insertions(+), 4 deletions(-)
> 
> diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
> index 87971c27dd7..484b185ba98 100644
> --- a/Documentation/technical/commit-graph-format.txt
> +++ b/Documentation/technical/commit-graph-format.txt
> @@ -93,7 +93,7 @@ CHUNK DATA:
>        2 bits of the lowest byte, storing the 33rd and 34th bit of the
>        commit time.
>  
> -  Generation Data (ID: {'G', 'D', 'A', 'T' }) (N * 4 bytes) [Optional]
> +  Generation Data (ID: {'G', 'D', 'A', '2' }) (N * 4 bytes) [Optional]
>      * This list of 4-byte values store corrected commit date offsets for the
>        commits, arranged in the same order as commit data chunk.
>      * If the corrected commit date offset cannot be stored within 31 bits,
> @@ -104,7 +104,7 @@ CHUNK DATA:
>        by compatible versions of Git and in case of split commit-graph chains,
>        the topmost layer also has Generation Data chunk.
>  
> -  Generation Data Overflow (ID: {'G', 'D', 'O', 'V' }) [Optional]
> +  Generation Data Overflow (ID: {'G', 'D', 'O', '2' }) [Optional]
>      * This list of 8-byte values stores the corrected commit date offsets
>        for commits with corrected commit date offsets that cannot be
>        stored within 31 bits.
> @@ -156,3 +156,11 @@ CHUNK DATA:
>  TRAILER:
>  
>  	H-byte HASH-checksum of all of the above.
> +
> +== Historical Notes:
> +
> +The Generation Data (GDA2) and Generation Data Overflow (GDO2) chunks have
> +the number '2' in their chunk IDs because a previous version of Git wrote
> +possibly erroneous data in these chunks with the IDs "GDAT" and "GDOV". By
> +changing the IDs, newer versions of Git will silently ignore those older
> +chunks and write the new information without trusting the incorrect data.
> diff --git a/commit-graph.c b/commit-graph.c
> index b86a6a634fe..fb2ced0bd6d 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -39,8 +39,8 @@ void git_test_write_commit_graph_or_die(void)
>  #define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
>  #define GRAPH_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
>  #define GRAPH_CHUNKID_DATA 0x43444154 /* "CDAT" */
> -#define GRAPH_CHUNKID_GENERATION_DATA 0x47444154 /* "GDAT" */
> -#define GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW 0x47444f56 /* "GDOV" */
> +#define GRAPH_CHUNKID_GENERATION_DATA 0x47444132 /* "GDA2" */
> +#define GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW 0x47444f32 /* "GDO2" */
>  #define GRAPH_CHUNKID_EXTRAEDGES 0x45444745 /* "EDGE" */
>  #define GRAPH_CHUNKID_BLOOMINDEXES 0x42494458 /* "BIDX" */
>  #define GRAPH_CHUNKID_BLOOMDATA 0x42444154 /* "BDAT" */
> -- 
> 2.35.1.138.gfc5de29e9e6
> 
> 
> 
>
Derrick Stolee March 3, 2022, 4 p.m. UTC | #16
On 3/3/2022 6:19 AM, Patrick Steinhardt wrote:
> On Wed, Mar 02, 2022 at 09:57:17AM -0500, Derrick Stolee wrote:
>> On 3/2/2022 8:57 AM, Patrick Steinhardt wrote:
>>> On Tue, Mar 01, 2022 at 10:25:46AM -0500, Derrick Stolee wrote:
>>>> On 3/1/2022 9:53 AM, Patrick Steinhardt wrote:
>>
>>>>> Hum. I have re-verified, and this indeed seems to play out. So I must've
>>>>> accidentally ran all my testing with the state generated without the
>>>>> final patch which fixes the corruption. I do see lots of the following
>>>>> warnings, but overall I can verify and write the commit-graph just fine:
>>>>>
>>>>>     commit-graph generation for commit c80a42de8803e2d77818d0c82f88e748d7f9425f is 1623362063 < 1623362139
>>>>
>>>> But I'm not able to generate these warnings from either version. I
>>>> tried generating different levels of a split commit-graph, but
>>>> could not reproduce it. If you have reproduction steps using current
>>>> 'master' (or any released Git version) and the four patches here,
>>>> then I would love to get a full understanding of your errors.
>>>>
>>>> Thanks,
>>>> -Stolee
>>>
>>> I haven't yet been able to reproduce it with publicly available data,
>>> but with the internal references I'm able to evoke the warnings
>>> reliably. It only works when I have two repositories connected via
>>> alternates, when generating the commit-graph in the linked-to repo
>>> first, and then generating the commit-graph in the linking repo.
>>>
>>> The following recipe allows me to reproduce, but rely on private data:
>>>
>>>     $ git --version
>>>     git version 2.35.1
>>>
>>>     # The pool repository is the one we're linked to from the fork.
>>>     $ cd "$pool"
>>>     $ rm -rf objects/info/commit-graph objects/info/commit-graph
>>>     $ git commit-graph write --split
>>>
>>>     $ cd "$fork"
>>>     $ rm -rf objects/info/commit-graph objects/info/commit-graph
>>>     $ git commit-graph write --split
>>>
>>>     $ git commit-graph verify --no-progress
>>>     $ echo $?
>>>     0
>>>
>>>     # This is 715d08a9e51251ad8290b181b6ac3b9e1f9719d7 with your full v2
>>>     # applied on top.
>>>     $ ~/Development/git/bin-wrappers/git --version
>>>     git version 2.35.1.358.g7ede1bea24
>>>
>>>     $ ~/Development/git/bin-wrappers/git commit-graph verify --no-progress
>>>     commit-graph generation for commit 06a91bac00ed11128becd48d5ae77eacd8f24c97 is 1623273624 < 1623273710
>>>     commit-graph generation for commit 0ae91029f27238e8f8e109c6bb3907f864dda14f is 1622151146 < 1622151220
>>>     commit-graph generation for commit 0d4582a33d8c8e3eb01adbf564f5e1deeb3b56a2 is 1631045222 < 1631045225
>>>     commit-graph generation for commit 0daf8976439d7e0bb9710c5ee63b570580e0dc03 is 1620347739 < 1620347789
>>>     commit-graph generation for commit 0e0ee8ffb3fa22cee7d28e21cbd6df26454932cf is 1623783297 < 1623783380
>>>     commit-graph generation for commit 0f08ab3de6ec115ea8a956a1996cb9759e640e74 is 1621543278 < 1621543339
>>>     commit-graph generation for commit 133ed0319b5a66ae0c2be76e5a887b880452b111 is 1620949864 < 1620949915
>>>     commit-graph generation for commit 1341b3e6c63343ae94a8a473fa057126ddd4669a is 1637344364 < 1637344384
>>>     commit-graph generation for commit 15bdfc501c2c9f23e9353bf6e6a5facd9c32a07a is 1623348103 < 1623348133
>>>     ...
>>>     $ echo $?
>>>     1
>>>
>>> When generating commit-graphs with your patches applied the `verify`
>>> step works alright.
>>>
>>> I've also by accident stumbled over the original error again:
>>>
>>>     fatal: commit-graph requires overflow generation data but has none
>>>
>>> This time it's definitely not caused by generating commit-graphs with an
>>> in-between state of your patch series because the data comes straight
>>> from production with no changes to the commit-graphs performed by
>>> myself. There we're running Git v2.33.1 with a couple of backported
>>> patches (see [1]). While those patches cause us to make more use of the
>>> commit-graph, none modify the way we generate them.
>>>
>>> Of note is that the commit-graph contains references to commits which
>>> don't exist in the ODB anymore.
>>>
>>> Patrick
>>>
>>> [1]: https://gitlab.com/gitlab-org/gitlab-git/-/commits/pks-v2.33.1.gl3
>>
>> Thank you for your diligence here, Patrick. I really appreciate the
>> work you're putting in to verify the situation.
>>
>> Since our repro relies on private information, but is consistent, I
>> wonder if we should take the patch below, which starts to ignore the
>> older generation number v2 data and only writes freshly-computed
>> numbers.
>>
>> Thanks,
>> -Stolee
> 
> Thanks. With your patch below the `fatal:` error is gone, but I'm still
> seeing the same errors with regards to the commit-graph generations.

This is disappointing and unexpected. Thanks for verifying.

> So to summarize my findings:
> 
>     - This bug occurs when writing commit-graphs with v2.35.1, but
>       reading them with your patches.
> 
>     - This bug occurs when I have two repositories connected via an
>       alternates file. I haven't yet been able to reproduce it in a
>       single repository that is not connected to a separate ODB.

This is an interesting distinction. One that I didn't think would
matter, but I'll look into the code to see how that could affect
things.

>     - This bug only occurs when I first generate the commit-graph in the
>       repository I'm borrowing objects from.
> 
>     - This bug only occurs when I write commit-graphs with `--split` in
>       both repositories. "Normal" commit-graphs don't have this issue,
>       and neither can I see it with `--split=replace` or mixed-type
>       commit-graphs.
> 
> Beware, the following explanation is based on my very basic
> understanding of the commit-graph code and thus more likely to be wrong
> than right:
> 
> With the old Git version, we've been mis-parsing the generation because
> `read_generation_data` wasn't ever set. As a result it can happen that
> the second split commit-graph we're generating computes its own
> generation numbers from the wrong starting point because it uses the
> mis-parsed generation numbers from the parent commit-graph.
> 
> With your patches, we start to correctly account for overflows and would
> thus end up with a different value for the generation depending on where
> we parse the commit from: if we parse it from the first commit-graph it
> would be correct because it's contains the "root" of the generation
> numbers. But if we parse a commit from the second commit-graph we may
> have a mismatch because the generation numbers in there may have been
> derived from generation numbers mis-parsed from the first commit-graph.
> And because these would be wrong in case there was an overflow it is
> clear that the new corrected generation number may be wrong, as well.

Hm. My expectation was that the older layers of the split commit-graph
would have read_generation_data disabled (because the new Git version
cannot read the GDAT chunk) and then the validate_mixed_generation_chain()
method would remove read_generation_data from all of the graphs in the
list.

Combining this with your thoughts on cross-alternate split commit-graphs,
this makes me think we should try this:

--- >8 ---

diff --git a/commit-graph.c b/commit-graph.c
index fb2ced0bd6..74c6534f56 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -609,8 +609,6 @@ struct commit_graph *read_commit_graph_one(struct repository *r,
 	if (!g)
 		g = load_commit_graph_chain(r, odb);
 
-	validate_mixed_generation_chain(g);
-
 	return g;
 }
 
@@ -668,7 +666,13 @@ static int prepare_commit_graph(struct repository *r)
 	     !r->objects->commit_graph && odb;
 	     odb = odb->next)
 		prepare_commit_graph_one(r, odb);
-	return !!r->objects->commit_graph;
+
+	if (r->objects->commit_graph) {
+		validate_mixed_generation_chain(r->objects->commit_graph);
+		return 1;
+	}
+
+	return 0;
 }
 
 int generation_numbers_enabled(struct repository *r)


--- >8 ---

Notice that I'm moving the validate_mixed_generation_chain() call
out of read_commit_graph_one() and into prepare_commit_graph(). To
my understanding, this _should_ have an equivalent end state as the
old code, but might be worth trying just as a quick check.

I will continue investigating and try to reproduce with this
additional constraint of working across an alternate.

Thanks,
-Stolee
Derrick Stolee March 4, 2022, 2:03 p.m. UTC | #17
On 3/3/2022 11:00 AM, Derrick Stolee wrote:
> On 3/3/2022 6:19 AM, Patrick Steinhardt wrote:
>> On Wed, Mar 02, 2022 at 09:57:17AM -0500, Derrick Stolee wrote:
>>> On 3/2/2022 8:57 AM, Patrick Steinhardt wrote:
>>>> On Tue, Mar 01, 2022 at 10:25:46AM -0500, Derrick Stolee wrote:
>>>>> On 3/1/2022 9:53 AM, Patrick Steinhardt wrote:
>>>
>>>>>> Hum. I have re-verified, and this indeed seems to play out. So I must've
>>>>>> accidentally ran all my testing with the state generated without the
>>>>>> final patch which fixes the corruption. I do see lots of the following
>>>>>> warnings, but overall I can verify and write the commit-graph just fine:
>>>>>>
>>>>>>     commit-graph generation for commit c80a42de8803e2d77818d0c82f88e748d7f9425f is 1623362063 < 1623362139
>>>>>
>>>>> But I'm not able to generate these warnings from either version. I
>>>>> tried generating different levels of a split commit-graph, but
>>>>> could not reproduce it. If you have reproduction steps using current
>>>>> 'master' (or any released Git version) and the four patches here,
>>>>> then I would love to get a full understanding of your errors.
>>>>>
>>>>> Thanks,
>>>>> -Stolee
>>>>
>>>> I haven't yet been able to reproduce it with publicly available data,
>>>> but with the internal references I'm able to evoke the warnings
>>>> reliably. It only works when I have two repositories connected via
>>>> alternates, when generating the commit-graph in the linked-to repo
>>>> first, and then generating the commit-graph in the linking repo.
>>>>
>>>> The following recipe allows me to reproduce, but rely on private data:
>>>>
>>>>     $ git --version
>>>>     git version 2.35.1
>>>>
>>>>     # The pool repository is the one we're linked to from the fork.
>>>>     $ cd "$pool"
>>>>     $ rm -rf objects/info/commit-graph objects/info/commit-graph
>>>>     $ git commit-graph write --split
>>>>
>>>>     $ cd "$fork"
>>>>     $ rm -rf objects/info/commit-graph objects/info/commit-graph
>>>>     $ git commit-graph write --split
>>>>
>>>>     $ git commit-graph verify --no-progress
>>>>     $ echo $?
>>>>     0
>>>>
>>>>     # This is 715d08a9e51251ad8290b181b6ac3b9e1f9719d7 with your full v2
>>>>     # applied on top.
>>>>     $ ~/Development/git/bin-wrappers/git --version
>>>>     git version 2.35.1.358.g7ede1bea24
>>>>
>>>>     $ ~/Development/git/bin-wrappers/git commit-graph verify --no-progress
>>>>     commit-graph generation for commit 06a91bac00ed11128becd48d5ae77eacd8f24c97 is 1623273624 < 1623273710
>>>>     commit-graph generation for commit 0ae91029f27238e8f8e109c6bb3907f864dda14f is 1622151146 < 1622151220
>>>>     commit-graph generation for commit 0d4582a33d8c8e3eb01adbf564f5e1deeb3b56a2 is 1631045222 < 1631045225
>>>>     commit-graph generation for commit 0daf8976439d7e0bb9710c5ee63b570580e0dc03 is 1620347739 < 1620347789
>>>>     commit-graph generation for commit 0e0ee8ffb3fa22cee7d28e21cbd6df26454932cf is 1623783297 < 1623783380
>>>>     commit-graph generation for commit 0f08ab3de6ec115ea8a956a1996cb9759e640e74 is 1621543278 < 1621543339
>>>>     commit-graph generation for commit 133ed0319b5a66ae0c2be76e5a887b880452b111 is 1620949864 < 1620949915
>>>>     commit-graph generation for commit 1341b3e6c63343ae94a8a473fa057126ddd4669a is 1637344364 < 1637344384
>>>>     commit-graph generation for commit 15bdfc501c2c9f23e9353bf6e6a5facd9c32a07a is 1623348103 < 1623348133
>>>>     ...
>>>>     $ echo $?
>>>>     1
>>>>
>>>> When generating commit-graphs with your patches applied the `verify`
>>>> step works alright.
>>>>
>>>> I've also by accident stumbled over the original error again:
>>>>
>>>>     fatal: commit-graph requires overflow generation data but has none
>>>>
>>>> This time it's definitely not caused by generating commit-graphs with an
>>>> in-between state of your patch series because the data comes straight
>>>> from production with no changes to the commit-graphs performed by
>>>> myself. There we're running Git v2.33.1 with a couple of backported
>>>> patches (see [1]). While those patches cause us to make more use of the
>>>> commit-graph, none modify the way we generate them.
>>>>
>>>> Of note is that the commit-graph contains references to commits which
>>>> don't exist in the ODB anymore.
>>>>
>>>> Patrick
>>>>
>>>> [1]: https://gitlab.com/gitlab-org/gitlab-git/-/commits/pks-v2.33.1.gl3
>>>
>>> Thank you for your diligence here, Patrick. I really appreciate the
>>> work you're putting in to verify the situation.
>>>
>>> Since our repro relies on private information, but is consistent, I
>>> wonder if we should take the patch below, which starts to ignore the
>>> older generation number v2 data and only writes freshly-computed
>>> numbers.
>>>
>>> Thanks,
>>> -Stolee
>>
>> Thanks. With your patch below the `fatal:` error is gone, but I'm still
>> seeing the same errors with regards to the commit-graph generations.
> 
> This is disappointing and unexpected. Thanks for verifying.
> 
>> So to summarize my findings:
>>
>>     - This bug occurs when writing commit-graphs with v2.35.1, but
>>       reading them with your patches.
>>
>>     - This bug occurs when I have two repositories connected via an
>>       alternates file. I haven't yet been able to reproduce it in a
>>       single repository that is not connected to a separate ODB.
> 
> This is an interesting distinction. One that I didn't think would
> matter, but I'll look into the code to see how that could affect
> things.
> 
>>     - This bug only occurs when I first generate the commit-graph in the
>>       repository I'm borrowing objects from.
>>
>>     - This bug only occurs when I write commit-graphs with `--split` in
>>       both repositories. "Normal" commit-graphs don't have this issue,
>>       and neither can I see it with `--split=replace` or mixed-type
>>       commit-graphs.
>>
>> Beware, the following explanation is based on my very basic
>> understanding of the commit-graph code and thus more likely to be wrong
>> than right:
>>
>> With the old Git version, we've been mis-parsing the generation because
>> `read_generation_data` wasn't ever set. As a result it can happen that
>> the second split commit-graph we're generating computes its own
>> generation numbers from the wrong starting point because it uses the
>> mis-parsed generation numbers from the parent commit-graph.
>>
>> With your patches, we start to correctly account for overflows and would
>> thus end up with a different value for the generation depending on where
>> we parse the commit from: if we parse it from the first commit-graph it
>> would be correct because it's contains the "root" of the generation
>> numbers. But if we parse a commit from the second commit-graph we may
>> have a mismatch because the generation numbers in there may have been
>> derived from generation numbers mis-parsed from the first commit-graph.
>> And because these would be wrong in case there was an overflow it is
>> clear that the new corrected generation number may be wrong, as well.
> 
> Hm. My expectation was that the older layers of the split commit-graph
> would have read_generation_data disabled (because the new Git version
> cannot read the GDAT chunk) and then the validate_mixed_generation_chain()
> method would remove read_generation_data from all of the graphs in the
> list.
> 
> Combining this with your thoughts on cross-alternate split commit-graphs,
> this makes me think we should try this:
> 
> --- >8 ---
> 
> diff --git a/commit-graph.c b/commit-graph.c
> index fb2ced0bd6..74c6534f56 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -609,8 +609,6 @@ struct commit_graph *read_commit_graph_one(struct repository *r,
>  	if (!g)
>  		g = load_commit_graph_chain(r, odb);
>  
> -	validate_mixed_generation_chain(g);
> -
>  	return g;
>  }
>  
> @@ -668,7 +666,13 @@ static int prepare_commit_graph(struct repository *r)
>  	     !r->objects->commit_graph && odb;
>  	     odb = odb->next)
>  		prepare_commit_graph_one(r, odb);
> -	return !!r->objects->commit_graph;
> +
> +	if (r->objects->commit_graph) {
> +		validate_mixed_generation_chain(r->objects->commit_graph);
> +		return 1;
> +	}
> +
> +	return 0;
>  }
>  
>  int generation_numbers_enabled(struct repository *r)
> 
> 
> --- >8 ---
> 
> Notice that I'm moving the validate_mixed_generation_chain() call
> out of read_commit_graph_one() and into prepare_commit_graph(). To
> my understanding, this _should_ have an equivalent end state as the
> old code, but might be worth trying just as a quick check.
> 
> I will continue investigating and try to reproduce with this
> additional constraint of working across an alternate.

My attempts to reproduce this across an alternate have failed. I
tried running the following test against Git without these patches,
then verify with the newer version of Git. (I also have generated
a few new layers on top with these patches, and they correctly drop
the GDA2 and GDO2 chunks when the lower layers "don't have gen v2".)


test_description='commit-graph with offsets across alternates'
. ./test-lib.sh

if ! test_have_prereq TIME_IS_64BIT || ! test_have_prereq TIME_T_IS_64BIT
then
	skip_all='skipping 64-bit timestamp tests'
	test_done
fi


UNIX_EPOCH_ZERO="@0 +0000"
FUTURE_DATE="@4147483646 +0000"

GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=0

test_expect_success 'generate alternate split commit-graph' '
	git init alternate &&
	(
		cd alternate &&
		test_commit --date "$UNIX_EPOCH_ZERO" 1 &&
		test_commit --date "$FUTURE_DATE" 2 &&
		git commit-graph write --reachable &&
		test_commit --date "$UNIX_EPOCH_ZERO" 3 &&
		test_commit --date "$FUTURE_DATE" 4 &&
		git commit-graph write --reachable --split=no-merge
	) &&
	git clone --shared alternate fork &&
	(
		cd fork &&
		test_commit --date "$UNIX_EPOCH_ZERO" 5 &&
		test_commit --date "$FUTURE_DATE" 6 &&
		git commit-graph write --reachable --split=no-merge &&
		test_commit --date "$UNIX_EPOCH_ZERO" 7 &&
		test_commit --date "$FUTURE_DATE" 8 &&
		git commit-graph write --reachable --split=no-merge
	)
'

test_done


My testing after running this with -d allows me to reliably see these
layers being created with GDAT and GDOV chunks. Running the 'git
commit-graph verify' command with the new code does not show those
errors, even after adding commits and another layer to the split
commit-graph.

I look forward to any additional insights you might have here.

Thanks,
-Stolee
Patrick Steinhardt March 7, 2022, 10:34 a.m. UTC | #18
On Fri, Mar 04, 2022 at 09:03:15AM -0500, Derrick Stolee wrote:
> On 3/3/2022 11:00 AM, Derrick Stolee wrote:
> > On 3/3/2022 6:19 AM, Patrick Steinhardt wrote:
> >> On Wed, Mar 02, 2022 at 09:57:17AM -0500, Derrick Stolee wrote:
> >>> On 3/2/2022 8:57 AM, Patrick Steinhardt wrote:
> >>>> On Tue, Mar 01, 2022 at 10:25:46AM -0500, Derrick Stolee wrote:
> >>>>> On 3/1/2022 9:53 AM, Patrick Steinhardt wrote:
> >>>
> >>>>>> Hum. I have re-verified, and this indeed seems to play out. So I must've
> >>>>>> accidentally ran all my testing with the state generated without the
> >>>>>> final patch which fixes the corruption. I do see lots of the following
> >>>>>> warnings, but overall I can verify and write the commit-graph just fine:
> >>>>>>
> >>>>>>     commit-graph generation for commit c80a42de8803e2d77818d0c82f88e748d7f9425f is 1623362063 < 1623362139
> >>>>>
> >>>>> But I'm not able to generate these warnings from either version. I
> >>>>> tried generating different levels of a split commit-graph, but
> >>>>> could not reproduce it. If you have reproduction steps using current
> >>>>> 'master' (or any released Git version) and the four patches here,
> >>>>> then I would love to get a full understanding of your errors.
> >>>>>
> >>>>> Thanks,
> >>>>> -Stolee
> >>>>
> >>>> I haven't yet been able to reproduce it with publicly available data,
> >>>> but with the internal references I'm able to evoke the warnings
> >>>> reliably. It only works when I have two repositories connected via
> >>>> alternates, when generating the commit-graph in the linked-to repo
> >>>> first, and then generating the commit-graph in the linking repo.
> >>>>
> >>>> The following recipe allows me to reproduce, but rely on private data:
> >>>>
> >>>>     $ git --version
> >>>>     git version 2.35.1
> >>>>
> >>>>     # The pool repository is the one we're linked to from the fork.
> >>>>     $ cd "$pool"
> >>>>     $ rm -rf objects/info/commit-graph objects/info/commit-graph
> >>>>     $ git commit-graph write --split
> >>>>
> >>>>     $ cd "$fork"
> >>>>     $ rm -rf objects/info/commit-graph objects/info/commit-graph
> >>>>     $ git commit-graph write --split
> >>>>
> >>>>     $ git commit-graph verify --no-progress
> >>>>     $ echo $?
> >>>>     0
> >>>>
> >>>>     # This is 715d08a9e51251ad8290b181b6ac3b9e1f9719d7 with your full v2
> >>>>     # applied on top.
> >>>>     $ ~/Development/git/bin-wrappers/git --version
> >>>>     git version 2.35.1.358.g7ede1bea24
> >>>>
> >>>>     $ ~/Development/git/bin-wrappers/git commit-graph verify --no-progress
> >>>>     commit-graph generation for commit 06a91bac00ed11128becd48d5ae77eacd8f24c97 is 1623273624 < 1623273710
> >>>>     commit-graph generation for commit 0ae91029f27238e8f8e109c6bb3907f864dda14f is 1622151146 < 1622151220
> >>>>     commit-graph generation for commit 0d4582a33d8c8e3eb01adbf564f5e1deeb3b56a2 is 1631045222 < 1631045225
> >>>>     commit-graph generation for commit 0daf8976439d7e0bb9710c5ee63b570580e0dc03 is 1620347739 < 1620347789
> >>>>     commit-graph generation for commit 0e0ee8ffb3fa22cee7d28e21cbd6df26454932cf is 1623783297 < 1623783380
> >>>>     commit-graph generation for commit 0f08ab3de6ec115ea8a956a1996cb9759e640e74 is 1621543278 < 1621543339
> >>>>     commit-graph generation for commit 133ed0319b5a66ae0c2be76e5a887b880452b111 is 1620949864 < 1620949915
> >>>>     commit-graph generation for commit 1341b3e6c63343ae94a8a473fa057126ddd4669a is 1637344364 < 1637344384
> >>>>     commit-graph generation for commit 15bdfc501c2c9f23e9353bf6e6a5facd9c32a07a is 1623348103 < 1623348133
> >>>>     ...
> >>>>     $ echo $?
> >>>>     1
> >>>>
> >>>> When generating commit-graphs with your patches applied the `verify`
> >>>> step works alright.
> >>>>
> >>>> I've also by accident stumbled over the original error again:
> >>>>
> >>>>     fatal: commit-graph requires overflow generation data but has none
> >>>>
> >>>> This time it's definitely not caused by generating commit-graphs with an
> >>>> in-between state of your patch series because the data comes straight
> >>>> from production with no changes to the commit-graphs performed by
> >>>> myself. There we're running Git v2.33.1 with a couple of backported
> >>>> patches (see [1]). While those patches cause us to make more use of the
> >>>> commit-graph, none modify the way we generate them.
> >>>>
> >>>> Of note is that the commit-graph contains references to commits which
> >>>> don't exist in the ODB anymore.
> >>>>
> >>>> Patrick
> >>>>
> >>>> [1]: https://gitlab.com/gitlab-org/gitlab-git/-/commits/pks-v2.33.1.gl3
> >>>
> >>> Thank you for your diligence here, Patrick. I really appreciate the
> >>> work you're putting in to verify the situation.
> >>>
> >>> Since our repro relies on private information, but is consistent, I
> >>> wonder if we should take the patch below, which starts to ignore the
> >>> older generation number v2 data and only writes freshly-computed
> >>> numbers.
> >>>
> >>> Thanks,
> >>> -Stolee
> >>
> >> Thanks. With your patch below the `fatal:` error is gone, but I'm still
> >> seeing the same errors with regards to the commit-graph generations.
> > 
> > This is disappointing and unexpected. Thanks for verifying.
> > 
> >> So to summarize my findings:
> >>
> >>     - This bug occurs when writing commit-graphs with v2.35.1, but
> >>       reading them with your patches.
> >>
> >>     - This bug occurs when I have two repositories connected via an
> >>       alternates file. I haven't yet been able to reproduce it in a
> >>       single repository that is not connected to a separate ODB.
> > 
> > This is an interesting distinction. One that I didn't think would
> > matter, but I'll look into the code to see how that could affect
> > things.
> > 
> >>     - This bug only occurs when I first generate the commit-graph in the
> >>       repository I'm borrowing objects from.
> >>
> >>     - This bug only occurs when I write commit-graphs with `--split` in
> >>       both repositories. "Normal" commit-graphs don't have this issue,
> >>       and neither can I see it with `--split=replace` or mixed-type
> >>       commit-graphs.
> >>
> >> Beware, the following explanation is based on my very basic
> >> understanding of the commit-graph code and thus more likely to be wrong
> >> than right:
> >>
> >> With the old Git version, we've been mis-parsing the generation because
> >> `read_generation_data` wasn't ever set. As a result it can happen that
> >> the second split commit-graph we're generating computes its own
> >> generation numbers from the wrong starting point because it uses the
> >> mis-parsed generation numbers from the parent commit-graph.
> >>
> >> With your patches, we start to correctly account for overflows and would
> >> thus end up with a different value for the generation depending on where
> >> we parse the commit from: if we parse it from the first commit-graph it
> >> would be correct because it's contains the "root" of the generation
> >> numbers. But if we parse a commit from the second commit-graph we may
> >> have a mismatch because the generation numbers in there may have been
> >> derived from generation numbers mis-parsed from the first commit-graph.
> >> And because these would be wrong in case there was an overflow it is
> >> clear that the new corrected generation number may be wrong, as well.
> > 
> > Hm. My expectation was that the older layers of the split commit-graph
> > would have read_generation_data disabled (because the new Git version
> > cannot read the GDAT chunk) and then the validate_mixed_generation_chain()
> > method would remove read_generation_data from all of the graphs in the
> > list.
> > 
> > Combining this with your thoughts on cross-alternate split commit-graphs,
> > this makes me think we should try this:
> > 
> > --- >8 ---
> > 
> > diff --git a/commit-graph.c b/commit-graph.c
> > index fb2ced0bd6..74c6534f56 100644
> > --- a/commit-graph.c
> > +++ b/commit-graph.c
> > @@ -609,8 +609,6 @@ struct commit_graph *read_commit_graph_one(struct repository *r,
> >  	if (!g)
> >  		g = load_commit_graph_chain(r, odb);
> >  
> > -	validate_mixed_generation_chain(g);
> > -
> >  	return g;
> >  }
> >  
> > @@ -668,7 +666,13 @@ static int prepare_commit_graph(struct repository *r)
> >  	     !r->objects->commit_graph && odb;
> >  	     odb = odb->next)
> >  		prepare_commit_graph_one(r, odb);
> > -	return !!r->objects->commit_graph;
> > +
> > +	if (r->objects->commit_graph) {
> > +		validate_mixed_generation_chain(r->objects->commit_graph);
> > +		return 1;
> > +	}
> > +
> > +	return 0;
> >  }
> >  
> >  int generation_numbers_enabled(struct repository *r)
> > 
> > 
> > --- >8 ---
> > 
> > Notice that I'm moving the validate_mixed_generation_chain() call
> > out of read_commit_graph_one() and into prepare_commit_graph(). To
> > my understanding, this _should_ have an equivalent end state as the
> > old code, but might be worth trying just as a quick check.
> > 
> > I will continue investigating and try to reproduce with this
> > additional constraint of working across an alternate.
> 
> My attempts to reproduce this across an alternate have failed. I
> tried running the following test against Git without these patches,
> then verify with the newer version of Git. (I also have generated
> a few new layers on top with these patches, and they correctly drop
> the GDA2 and GDO2 chunks when the lower layers "don't have gen v2".)
> 
> 
> test_description='commit-graph with offsets across alternates'
> . ./test-lib.sh
> 
> if ! test_have_prereq TIME_IS_64BIT || ! test_have_prereq TIME_T_IS_64BIT
> then
> 	skip_all='skipping 64-bit timestamp tests'
> 	test_done
> fi
> 
> 
> UNIX_EPOCH_ZERO="@0 +0000"
> FUTURE_DATE="@4147483646 +0000"
> 
> GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=0
> 
> test_expect_success 'generate alternate split commit-graph' '
> 	git init alternate &&
> 	(
> 		cd alternate &&
> 		test_commit --date "$UNIX_EPOCH_ZERO" 1 &&
> 		test_commit --date "$FUTURE_DATE" 2 &&
> 		git commit-graph write --reachable &&
> 		test_commit --date "$UNIX_EPOCH_ZERO" 3 &&
> 		test_commit --date "$FUTURE_DATE" 4 &&
> 		git commit-graph write --reachable --split=no-merge
> 	) &&
> 	git clone --shared alternate fork &&
> 	(
> 		cd fork &&
> 		test_commit --date "$UNIX_EPOCH_ZERO" 5 &&
> 		test_commit --date "$FUTURE_DATE" 6 &&
> 		git commit-graph write --reachable --split=no-merge &&
> 		test_commit --date "$UNIX_EPOCH_ZERO" 7 &&
> 		test_commit --date "$FUTURE_DATE" 8 &&
> 		git commit-graph write --reachable --split=no-merge
> 	)
> '
> 
> test_done
> 
> 
> My testing after running this with -d allows me to reliably see these
> layers being created with GDAT and GDOV chunks. Running the 'git
> commit-graph verify' command with the new code does not show those
> errors, even after adding commits and another layer to the split
> commit-graph.
> 
> I look forward to any additional insights you might have here.

I don't really know why, but now I've become unable to reproduce it
again. I think we should just go with your patch 5/4 on top -- it does
fix the most important issue, which is the `die()` I saw on almost all
commands. The second part about the warnings I'm just not sure about,
but I don't think it should stop this patch series given my own
uncertainty.

Patrick
Derrick Stolee March 7, 2022, 1:45 p.m. UTC | #19
On 3/7/2022 5:34 AM, Patrick Steinhardt wrote:
> On Fri, Mar 04, 2022 at 09:03:15AM -0500, Derrick Stolee wrote:
>> On 3/3/2022 11:00 AM, Derrick Stolee wrote:
...
>>> I will continue investigating and try to reproduce with this
>>> additional constraint of working across an alternate.
>>
>> My attempts to reproduce this across an alternate have failed. I
>> tried running the following test against Git without these patches,
>> then verify with the newer version of Git. (I also have generated
>> a few new layers on top with these patches, and they correctly drop
>> the GDA2 and GDO2 chunks when the lower layers "don't have gen v2".)
>>
>>
>> test_description='commit-graph with offsets across alternates'
>> . ./test-lib.sh
>>
>> if ! test_have_prereq TIME_IS_64BIT || ! test_have_prereq TIME_T_IS_64BIT
>> then
>> 	skip_all='skipping 64-bit timestamp tests'
>> 	test_done
>> fi
>>
>>
>> UNIX_EPOCH_ZERO="@0 +0000"
>> FUTURE_DATE="@4147483646 +0000"
>>
>> GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=0
>>
>> test_expect_success 'generate alternate split commit-graph' '
>> 	git init alternate &&
>> 	(
>> 		cd alternate &&
>> 		test_commit --date "$UNIX_EPOCH_ZERO" 1 &&
>> 		test_commit --date "$FUTURE_DATE" 2 &&
>> 		git commit-graph write --reachable &&
>> 		test_commit --date "$UNIX_EPOCH_ZERO" 3 &&
>> 		test_commit --date "$FUTURE_DATE" 4 &&
>> 		git commit-graph write --reachable --split=no-merge
>> 	) &&
>> 	git clone --shared alternate fork &&
>> 	(
>> 		cd fork &&
>> 		test_commit --date "$UNIX_EPOCH_ZERO" 5 &&
>> 		test_commit --date "$FUTURE_DATE" 6 &&
>> 		git commit-graph write --reachable --split=no-merge &&
>> 		test_commit --date "$UNIX_EPOCH_ZERO" 7 &&
>> 		test_commit --date "$FUTURE_DATE" 8 &&
>> 		git commit-graph write --reachable --split=no-merge
>> 	)
>> '
>>
>> test_done
>>
>>
>> My testing after running this with -d allows me to reliably see these
>> layers being created with GDAT and GDOV chunks. Running the 'git
>> commit-graph verify' command with the new code does not show those
>> errors, even after adding commits and another layer to the split
>> commit-graph.
>>
>> I look forward to any additional insights you might have here.
> 
> I don't really know why, but now I've become unable to reproduce it
> again. I think we should just go with your patch 5/4 on top -- it does
> fix the most important issue, which is the `die()` I saw on almost all
> commands. The second part about the warnings I'm just not sure about,
> but I don't think it should stop this patch series given my own
> uncertainty.

Thanks for following up. I agree that with 5/4 we should be safe.

I'll remain available to quickly respond if anything else surprising
comes up in this area.

Thanks!
-Stolee
Junio C Hamano March 7, 2022, 5:22 p.m. UTC | #20
Derrick Stolee <derrickstolee@github.com> writes:

> Thanks for following up. I agree that with 5/4 we should be safe.
>
> I'll remain available to quickly respond if anything else surprising
> comes up in this area.

Thanks.  I just picked up the "bankruptcy" step, so we should be
good to go.
Patrick Steinhardt March 10, 2022, 1:58 p.m. UTC | #21
On Mon, Mar 07, 2022 at 08:45:07AM -0500, Derrick Stolee wrote:
> On 3/7/2022 5:34 AM, Patrick Steinhardt wrote:
> > On Fri, Mar 04, 2022 at 09:03:15AM -0500, Derrick Stolee wrote:
> >> On 3/3/2022 11:00 AM, Derrick Stolee wrote:
> ...
> >>> I will continue investigating and try to reproduce with this
> >>> additional constraint of working across an alternate.
> >>
> >> My attempts to reproduce this across an alternate have failed. I
> >> tried running the following test against Git without these patches,
> >> then verify with the newer version of Git. (I also have generated
> >> a few new layers on top with these patches, and they correctly drop
> >> the GDA2 and GDO2 chunks when the lower layers "don't have gen v2".)
> >>
> >>
> >> test_description='commit-graph with offsets across alternates'
> >> . ./test-lib.sh
> >>
> >> if ! test_have_prereq TIME_IS_64BIT || ! test_have_prereq TIME_T_IS_64BIT
> >> then
> >> 	skip_all='skipping 64-bit timestamp tests'
> >> 	test_done
> >> fi
> >>
> >>
> >> UNIX_EPOCH_ZERO="@0 +0000"
> >> FUTURE_DATE="@4147483646 +0000"
> >>
> >> GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=0
> >>
> >> test_expect_success 'generate alternate split commit-graph' '
> >> 	git init alternate &&
> >> 	(
> >> 		cd alternate &&
> >> 		test_commit --date "$UNIX_EPOCH_ZERO" 1 &&
> >> 		test_commit --date "$FUTURE_DATE" 2 &&
> >> 		git commit-graph write --reachable &&
> >> 		test_commit --date "$UNIX_EPOCH_ZERO" 3 &&
> >> 		test_commit --date "$FUTURE_DATE" 4 &&
> >> 		git commit-graph write --reachable --split=no-merge
> >> 	) &&
> >> 	git clone --shared alternate fork &&
> >> 	(
> >> 		cd fork &&
> >> 		test_commit --date "$UNIX_EPOCH_ZERO" 5 &&
> >> 		test_commit --date "$FUTURE_DATE" 6 &&
> >> 		git commit-graph write --reachable --split=no-merge &&
> >> 		test_commit --date "$UNIX_EPOCH_ZERO" 7 &&
> >> 		test_commit --date "$FUTURE_DATE" 8 &&
> >> 		git commit-graph write --reachable --split=no-merge
> >> 	)
> >> '
> >>
> >> test_done
> >>
> >>
> >> My testing after running this with -d allows me to reliably see these
> >> layers being created with GDAT and GDOV chunks. Running the 'git
> >> commit-graph verify' command with the new code does not show those
> >> errors, even after adding commits and another layer to the split
> >> commit-graph.
> >>
> >> I look forward to any additional insights you might have here.
> > 
> > I don't really know why, but now I've become unable to reproduce it
> > again. I think we should just go with your patch 5/4 on top -- it does
> > fix the most important issue, which is the `die()` I saw on almost all
> > commands. The second part about the warnings I'm just not sure about,
> > but I don't think it should stop this patch series given my own
> > uncertainty.
> 
> Thanks for following up. I agree that with 5/4 we should be safe.
> 
> I'll remain available to quickly respond if anything else surprising
> comes up in this area.
> 
> Thanks!
> -Stolee

There is another surprise I hit today in the context of generation
numbers. In production, I found the following bug:

    signal: aborted (core dumped): BUG: chunk-format.c:88: expected to write 8 bytes to chunk 47444f56, but wrote 168304 instead

47444f56 is the GENERATION_DATA_OVERFLOW chunk ID, and seemingly the
precomputed size we intended to write was mismatching the data we have
actually been writing to disk. And I think this stems from a mismatch in
how we precompute the number of generation data overflows compared to
how we're actually writing the data to disk:

    - We precompute how many generation number overflows there are in
      `compute_generation_numbers()`. Here we only increment the number
      of overflows in case all parents of a given commit have a non-zero
      generation number and if the generation is bigger than OFFSET_MAX.
      Seemingly we have found only a single commit which matches this
      criteria because we pass `sizeof(timestamp_t) * overflows` as
      expected size, and `sizeof(timestamp_t) == 8`.

    - On the other hand, when we write generation numbers to disk in
      `write_graph_chunk_generation_data_overflow()`, we always write a
      chunk in case its offset is bigger than OFFSET_MAX. So we don't
      care about the parents here, and this seems to extend the number
      of commits which match this criteria to 21038 commits we write
      into the file.

The result is that the sanity check we do where we compare that the
actually written amount of data matches what we expect fails because of
the different ways we count this data.

This time I don't have access to the repository myself, I only tried to
combine what's happening based on the bug message and the code.

Patrick
Derrick Stolee March 10, 2022, 5:18 p.m. UTC | #22
On 3/10/2022 8:58 AM, Patrick Steinhardt wrote:
> On Mon, Mar 07, 2022 at 08:45:07AM -0500, Derrick Stolee wrote:
>> On 3/7/2022 5:34 AM, Patrick Steinhardt wrote:
>>> On Fri, Mar 04, 2022 at 09:03:15AM -0500, Derrick Stolee wrote:
>>>> On 3/3/2022 11:00 AM, Derrick Stolee wrote:
>> ...
>>>>> I will continue investigating and try to reproduce with this
>>>>> additional constraint of working across an alternate.
>>>>
>>>> My attempts to reproduce this across an alternate have failed. I
>>>> tried running the following test against Git without these patches,
>>>> then verify with the newer version of Git. (I also have generated
>>>> a few new layers on top with these patches, and they correctly drop
>>>> the GDA2 and GDO2 chunks when the lower layers "don't have gen v2".)
>>>>
>>>>
>>>> test_description='commit-graph with offsets across alternates'
>>>> . ./test-lib.sh
>>>>
>>>> if ! test_have_prereq TIME_IS_64BIT || ! test_have_prereq TIME_T_IS_64BIT
>>>> then
>>>> 	skip_all='skipping 64-bit timestamp tests'
>>>> 	test_done
>>>> fi
>>>>
>>>>
>>>> UNIX_EPOCH_ZERO="@0 +0000"
>>>> FUTURE_DATE="@4147483646 +0000"
>>>>
>>>> GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=0
>>>>
>>>> test_expect_success 'generate alternate split commit-graph' '
>>>> 	git init alternate &&
>>>> 	(
>>>> 		cd alternate &&
>>>> 		test_commit --date "$UNIX_EPOCH_ZERO" 1 &&
>>>> 		test_commit --date "$FUTURE_DATE" 2 &&
>>>> 		git commit-graph write --reachable &&
>>>> 		test_commit --date "$UNIX_EPOCH_ZERO" 3 &&
>>>> 		test_commit --date "$FUTURE_DATE" 4 &&
>>>> 		git commit-graph write --reachable --split=no-merge
>>>> 	) &&
>>>> 	git clone --shared alternate fork &&
>>>> 	(
>>>> 		cd fork &&
>>>> 		test_commit --date "$UNIX_EPOCH_ZERO" 5 &&
>>>> 		test_commit --date "$FUTURE_DATE" 6 &&
>>>> 		git commit-graph write --reachable --split=no-merge &&
>>>> 		test_commit --date "$UNIX_EPOCH_ZERO" 7 &&
>>>> 		test_commit --date "$FUTURE_DATE" 8 &&
>>>> 		git commit-graph write --reachable --split=no-merge
>>>> 	)
>>>> '
>>>>
>>>> test_done
>>>>
>>>>
>>>> My testing after running this with -d allows me to reliably see these
>>>> layers being created with GDAT and GDOV chunks. Running the 'git
>>>> commit-graph verify' command with the new code does not show those
>>>> errors, even after adding commits and another layer to the split
>>>> commit-graph.
>>>>
>>>> I look forward to any additional insights you might have here.
>>>
>>> I don't really know why, but now I've become unable to reproduce it
>>> again. I think we should just go with your patch 5/4 on top -- it does
>>> fix the most important issue, which is the `die()` I saw on almost all
>>> commands. The second part about the warnings I'm just not sure about,
>>> but I don't think it should stop this patch series given my own
>>> uncertainty.
>>
>> Thanks for following up. I agree that with 5/4 we should be safe.
>>
>> I'll remain available to quickly respond if anything else surprising
>> comes up in this area.
>>
>> Thanks!
>> -Stolee
> 
> There is another surprise I hit today in the context of generation
> numbers. In production, I found the following bug:

"In production" makes me think this is on a version of Git without
these patches. Am I right?

>     signal: aborted (core dumped): BUG: chunk-format.c:88: expected to write 8 bytes to chunk 47444f56, but wrote 168304 instead
> 
> 47444f56 is the GENERATION_DATA_OVERFLOW chunk ID, and seemingly the
> precomputed size we intended to write was mismatching the data we have
> actually been writing to disk. And I think this stems from a mismatch in
> how we precompute the number of generation data overflows compared to
> how we're actually writing the data to disk:

Yes, and this count was supposed to be fixed in patch v3 3/5.

The issue is that we skip the increment if the commit was already
parsed, so we would undercount. The fix was to do a loop at the
end focused on counting these overflows.

If you have v3 applied and you still get this error, then we need
to look closely at this issue.

For your production's sake, you might want to set
"commitGraph.generationVersion=1" in your config until you have the
fixes from this series.

Thanks,
-Stolee
diff mbox series

Patch

diff --git a/commit-graph.c b/commit-graph.c
index a19bd96c2ee..8e52bb09552 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -407,6 +407,9 @@  struct commit_graph *parse_commit_graph(struct repository *r,
 			&graph->chunk_generation_data);
 		pair_chunk(cf, GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW,
 			&graph->chunk_generation_data_overflow);
+
+		if (graph->chunk_generation_data)
+			graph->read_generation_data = 1;
 	}
 
 	if (r->settings.commit_graph_read_changed_paths) {
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index 5ed6d2a21c1..fa9d32facfb 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -48,7 +48,7 @@  graph_read_expect () {
 	header: 43475048 1 $(test_oid oid_version) $NUM_CHUNKS 0
 	num_commits: $1
 	chunks: oid_fanout oid_lookup commit_metadata generation_data bloom_indexes bloom_data
-	options: bloom(1,10,7)
+	options: bloom(1,10,7) read_generation_data
 	EOF
 	test-tool read-graph >actual &&
 	test_cmp expect actual
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index f9bffe38013..1afee1c2705 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -100,11 +100,21 @@  graph_read_expect() {
 		OPTIONAL=" $2"
 		NUM_CHUNKS=$((3 + $(echo "$2" | wc -w)))
 	fi
+	GENERATION_VERSION=2
+	if test ! -z "$3"
+	then
+		GENERATION_VERSION=$3
+	fi
+	OPTIONS=
+	if test $GENERATION_VERSION -gt 1
+	then
+		OPTIONS=" read_generation_data"
+	fi
 	cat >expect <<- EOF
 	header: 43475048 1 $(test_oid oid_version) $NUM_CHUNKS 0
 	num_commits: $1
 	chunks: oid_fanout oid_lookup commit_metadata$OPTIONAL
-	options:
+	options:$OPTIONS
 	EOF
 	test-tool read-graph >output &&
 	test_cmp expect output
@@ -498,7 +508,7 @@  test_expect_success 'git commit-graph verify' '
 	cd "$TRASH_DIRECTORY/full" &&
 	git rev-parse commits/8 | git -c commitGraph.generationVersion=1 commit-graph write --stdin-commits &&
 	git commit-graph verify >output &&
-	graph_read_expect 9 extra_edges
+	graph_read_expect 9 extra_edges 1
 '
 
 NUM_COMMITS=9
diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
index 778fa418de2..669ddc645fa 100755
--- a/t/t5324-split-commit-graph.sh
+++ b/t/t5324-split-commit-graph.sh
@@ -30,11 +30,16 @@  graph_read_expect() {
 	then
 		NUM_BASE=$2
 	fi
+	OPTIONS=
+	if test -z "$3"
+	then
+		OPTIONS=" read_generation_data"
+	fi
 	cat >expect <<- EOF
 	header: 43475048 1 $(test_oid oid_version) 4 $NUM_BASE
 	num_commits: $1
 	chunks: oid_fanout oid_lookup commit_metadata generation_data
-	options:
+	options:$OPTIONS
 	EOF
 	test-tool read-graph >output &&
 	test_cmp expect output
@@ -624,7 +629,7 @@  test_expect_success 'write generation data chunk if topmost remaining layer has
 		header: 43475048 1 $(test_oid oid_version) 5 1
 		num_commits: $(($NUM_SECOND_LAYER_COMMITS + $NUM_THIRD_LAYER_COMMITS + $NUM_FOURTH_LAYER_COMMITS + $NUM_FIFTH_LAYER_COMMITS))
 		chunks: oid_fanout oid_lookup commit_metadata generation_data
-		options:
+		options: read_generation_data
 		EOF
 		test_cmp expect output
 	)