diff mbox series

[5/7] commit-graph: document file format v2

Message ID 7f9b65bd22551fd7fd5d2f0bf18aee8c25f1db99.1645735117.git.gitgitgadget@gmail.com (mailing list archive)
State New, archived
Headers show
Series Commit-graph: Generation Number v2 Fixes, v3 implementation | expand

Commit Message

Derrick Stolee Feb. 24, 2022, 8:38 p.m. UTC
From: Derrick Stolee <derrickstolee@github.com>

The corrected commit date was first documented in 5a3b130ca (doc: add
corrected commit date info, 2021-01-16) and it used an optional chunk to
augment the commit-graph format without modifying the file format
version.

One major benefit to this approach is that corrected commit dates could
be written without causing a backwards compatibility issue with Git
versions that do not understand them. The topological level was still
available in the CDAT chunk as it was before.

However, this causes a different issue: more data needs to be loaded
from disk when parsing commits from the commit-graph. In cases where
there is no significant algorithmic gain from using corrected commit
dates, commit walks take up to 20% longer because of this extra data.

Create a new file format version for the commit-graph format that
differs only in the CDAT chunk: it now stores corrected commit date
offsets. This brings our data back to normal and will demonstrate
performance gains in almost all cases.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 .../technical/commit-graph-format.txt         | 22 ++++++++++++++-----
 1 file changed, 17 insertions(+), 5 deletions(-)

Comments

Junio C Hamano Feb. 24, 2022, 10:55 p.m. UTC | #1
"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Derrick Stolee <derrickstolee@github.com>
>
> The corrected commit date was first documented in 5a3b130ca (doc: add
> corrected commit date info, 2021-01-16) and it used an optional chunk to
> augment the commit-graph format without modifying the file format
> version.
>
> One major benefit to this approach is that corrected commit dates could
> be written without causing a backwards compatibility issue with Git
> versions that do not understand them. The topological level was still
> available in the CDAT chunk as it was before.
>
> However, this causes a different issue: more data needs to be loaded
> from disk when parsing commits from the commit-graph. In cases where
> there is no significant algorithmic gain from using corrected commit
> dates, commit walks take up to 20% longer because of this extra data.
>
> Create a new file format version for the commit-graph format that
> differs only in the CDAT chunk: it now stores corrected commit date
> offsets. This brings our data back to normal and will demonstrate
> performance gains in almost all cases.

OK.
Ævar Arnfjörð Bjarmason Feb. 25, 2022, 10:31 p.m. UTC | #2
On Thu, Feb 24 2022, Derrick Stolee via GitGitGadget wrote:

> The corrected commit date was first documented in 5a3b130ca (doc: add
> corrected commit date info, 2021-01-16) and it used an optional chunk to
> augment the commit-graph format without modifying the file format
> version.
>
> One major benefit to this approach is that corrected commit dates could
> be written without causing a backwards compatibility issue with Git
> versions that do not understand them. The topological level was still
> available in the CDAT chunk as it was before.
>
> However, this causes a different issue: more data needs to be loaded
> from disk when parsing commits from the commit-graph. In cases where
> there is no significant algorithmic gain from using corrected commit
> dates, commit walks take up to 20% longer because of this extra data.
>
> Create a new file format version for the commit-graph format that
> differs only in the CDAT chunk: it now stores corrected commit date
> offsets. This brings our data back to normal and will demonstrate
> performance gains in almost all cases.
>
> Signed-off-by: Derrick Stolee <derrickstolee@github.com>
> ---
>  .../technical/commit-graph-format.txt         | 22 ++++++++++++++-----
>  1 file changed, 17 insertions(+), 5 deletions(-)
>
> diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
> index 87971c27dd7..2cb48993314 100644
> --- a/Documentation/technical/commit-graph-format.txt
> +++ b/Documentation/technical/commit-graph-format.txt
> @@ -36,7 +36,7 @@ HEADER:
>        The signature is: {'C', 'G', 'P', 'H'}
>  
>    1-byte version number:
> -      Currently, the only valid version is 1.
> +      This version number can be 1 or 2.
>  
>    1-byte Hash Version
>        We infer the hash length (H) from this value:
> @@ -85,13 +85,22 @@ CHUNK DATA:
>        position. If there are more than two parents, the second value
>        has its most-significant bit on and the other bits store an array
>        position into the Extra Edge List chunk.
> -    * The next 8 bytes store the topological level (generation number v1)
> -      of the commit and
> -      the commit time in seconds since EPOCH. The generation number
> -      uses the higher 30 bits of the first 4 bytes, while the commit
> +    * The next 8 bytes store the generation number information of the
> +      commit and the commit time in seconds since EPOCH. The generation
> +      number uses the higher 30 bits of the first 4 bytes, while the commit
>        time uses the 32 bits of the second 4 bytes, along with the lowest
>        2 bits of the lowest byte, storing the 33rd and 34th bit of the
>        commit time.
> +      - If the commit-graph file format is version 1, then the higher 30
> +	bits contain the topological level (generation number v1) for the
> +	commit.
> +      - If the commit-graph file format is version 2, then the higher 30
> +	bits contain the corrected commit date offset (generation number
> +	v2) for the commit, except if the offset cannot be stored within
> +	29 bits. If the offset is too large for 29 bits, then the value
> +	stored here has its most-significant bit on and the other bits
> +	store the position of the corrected commit date in the Generation
> +	Date Overflow chunk.
>  
>    Generation Data (ID: {'G', 'D', 'A', 'T' }) (N * 4 bytes) [Optional]
>      * This list of 4-byte values store corrected commit date offsets for the
> @@ -103,6 +112,9 @@ CHUNK DATA:
>      * Generation Data chunk is present only when commit-graph file is written
>        by compatible versions of Git and in case of split commit-graph chains,
>        the topmost layer also has Generation Data chunk.
> +    * This chunk does not exist if the commit-graph file format version is 2,
> +      because the corrected commit date offset data is stored in the Commit
> +      Data chunk.
>  
>    Generation Data Overflow (ID: {'G', 'D', 'O', 'V' }) [Optional]
>      * This list of 8-byte values stores the corrected commit date offsets

We talked a while ago now about how we do commit-graph format changes
and this is partially echoing those earlier questions[1] from 2019.

I fully understand why we're writing this amended CDAT chunk in a
different layout. By not having the GDAT side-chunk to look up in the
data is more local, that part of the file is more compact etc.

What I don't understand is why getting those performance improvements
requires the breaking version change & the writing of the incompatible
version number.

I.e. couldn't the differently formatted CDAT chunk be written instead to a new
chunk name (say "2DAT") instead? Per [1] we'd pay a small fixed cost for
a possibly empty chunk (I didn't re-do those numbers), but surely the
performance improvements will be about the same for that miniscule
overhead.

It will give you something you can't have here, which is optional
compatibility with older clients by writing both versions. That'll be a
~2x as large file on disk, but with the page cache & each client version
skipping to the data it needs caching characteristics & data locality
should work out to about the same thing.

Or maybe they won't. I just found it surprising when reviewing this to
not find an answer to why that approach wasn't
considered.

E.g. 76ffbca71a9 (commit-graph: write Bloom filters to commit graph
file, 2020-04-06) is a commit adding such new optional and
backwards-compatible data.

1. https://lore.kernel.org/git/87h8acivkh.fsf@evledraar.gmail.com/
Derrick Stolee Feb. 28, 2022, 1:44 p.m. UTC | #3
On 2/25/2022 5:31 PM, Ævar Arnfjörð Bjarmason wrote:
> 
> On Thu, Feb 24 2022, Derrick Stolee via GitGitGadget wrote:
> 
...
>>    Generation Data (ID: {'G', 'D', 'A', 'T' }) (N * 4 bytes) [Optional]
>>      * This list of 4-byte values store corrected commit date offsets for the
>> @@ -103,6 +112,9 @@ CHUNK DATA:
>>      * Generation Data chunk is present only when commit-graph file is written
>>        by compatible versions of Git and in case of split commit-graph chains,
>>        the topmost layer also has Generation Data chunk.
>> +    * This chunk does not exist if the commit-graph file format version is 2,
>> +      because the corrected commit date offset data is stored in the Commit
>> +      Data chunk.
>>  
>>    Generation Data Overflow (ID: {'G', 'D', 'O', 'V' }) [Optional]
>>      * This list of 8-byte values stores the corrected commit date offsets
> 
> We talked a while ago now about how we do commit-graph format changes
> and this is partially echoing those earlier questions[1] from 2019.
> 
> I fully understand why we're writing this amended CDAT chunk in a
> different layout. By not having the GDAT side-chunk to look up in the
> data is more local, that part of the file is more compact etc.
> 
> What I don't understand is why getting those performance improvements
> requires the breaking version change & the writing of the incompatible
> version number.
> 
> I.e. couldn't the differently formatted CDAT chunk be written instead to a new
> chunk name (say "2DAT") instead? Per [1] we'd pay a small fixed cost for
> a possibly empty chunk (I didn't re-do those numbers), but surely the
> performance improvements will be about the same for that miniscule
> overhead.

CDAT is a required chunk. It is part of the v1 spec that CDAT exists
and is correct. All other Git clients will error out when reading a
"v1" graph without such a chunk, and in a way that is less helpful to
users. Instead of clearly indicating "file version is too new" it will
say "commit-graph is missing the Commit Data chunk" which is not
helpful.

> It will give you something you can't have here, which is optional
> compatibility with older clients by writing both versions. That'll be a
> ~2x as large file on disk, but with the page cache & each client version
> skipping to the data it needs caching characteristics & data locality
> should work out to about the same thing.

Writing both is the only way that this could work without incrementing
the graph version number, but I'd rather just update the number and
avoid wasting the effort to write that extra data.

It seems you are hyper-focused on "we don't _need_ to update the version
number" and you are willing to recommend wasteful approaches in order to
support that stance.

So: you're right. We don't _need_ to update the version number. But this
is the best choice among the options available.

> Or maybe they won't. I just found it surprising when reviewing this to
> not find an answer to why that approach wasn't
> considered.

The point is to create a new format that can be chosen when deployed
in an environment where older Git versions will not exist (such as
a Git server). The new version is not chosen by default and instead
is opt-in through the commitGraph.generationVersion config option.

Perhaps in a year or two we would consider making this the new
default, but there is no rush to do so.

Thanks,
-Stolee
Ævar Arnfjörð Bjarmason Feb. 28, 2022, 2:27 p.m. UTC | #4
On Mon, Feb 28 2022, Derrick Stolee wrote:

> On 2/25/2022 5:31 PM, Ævar Arnfjörð Bjarmason wrote:
>> 
>> On Thu, Feb 24 2022, Derrick Stolee via GitGitGadget wrote:
>> 
> ...
>>>    Generation Data (ID: {'G', 'D', 'A', 'T' }) (N * 4 bytes) [Optional]
>>>      * This list of 4-byte values store corrected commit date offsets for the
>>> @@ -103,6 +112,9 @@ CHUNK DATA:
>>>      * Generation Data chunk is present only when commit-graph file is written
>>>        by compatible versions of Git and in case of split commit-graph chains,
>>>        the topmost layer also has Generation Data chunk.
>>> +    * This chunk does not exist if the commit-graph file format version is 2,
>>> +      because the corrected commit date offset data is stored in the Commit
>>> +      Data chunk.
>>>  
>>>    Generation Data Overflow (ID: {'G', 'D', 'O', 'V' }) [Optional]
>>>      * This list of 8-byte values stores the corrected commit date offsets
>> 
>> We talked a while ago now about how we do commit-graph format changes
>> and this is partially echoing those earlier questions[1] from 2019.
>> 
>> I fully understand why we're writing this amended CDAT chunk in a
>> different layout. By not having the GDAT side-chunk to look up in the
>> data is more local, that part of the file is more compact etc.
>> 
>> What I don't understand is why getting those performance improvements
>> requires the breaking version change & the writing of the incompatible
>> version number.
>> 
>> I.e. couldn't the differently formatted CDAT chunk be written instead to a new
>> chunk name (say "2DAT") instead? Per [1] we'd pay a small fixed cost for
>> a possibly empty chunk (I didn't re-do those numbers), but surely the
>> performance improvements will be about the same for that miniscule
>> overhead.
>
> CDAT is a required chunk. It is part of the v1 spec that CDAT exists
> and is correct. All other Git clients will error out when reading a
> "v1" graph without such a chunk, and in a way that is less helpful to
> users. Instead of clearly indicating "file version is too new" it will
> say "commit-graph is missing the Commit Data chunk" which is not
> helpful.

Yes. That would be the worst of both worlds.

I thought the reference to the 2019-era post made it clear (which is
explicit about this aspect), but I'm talking about writing one of:

 A. An empty chunk
 B. Keeping a "stale" chunk around (as we re-write the graph)
 C. Duplicate writes of new/old chunks.

And not simply omitting the CDAT chunk. As you point out would give you
all the drawbacks of a version number change, with none of the benefits.

I haven't re-tested this now, but at the time doing any of (A..C) would
work smoothly for older clients, while giving newer ones improved data.

>> It will give you something you can't have here, which is optional
>> compatibility with older clients by writing both versions. That'll be a
>> ~2x as large file on disk, but with the page cache & each client version
>> skipping to the data it needs caching characteristics & data locality
>> should work out to about the same thing.
>
> Writing both is the only way that this could work without incrementing
> the graph version number, but I'd rather just update the number and
> avoid wasting the effort to write that extra data.

...

> It seems you are hyper-focused on "we don't _need_ to update the version
> number" and you are willing to recommend wasteful approaches in order to
> support that stance.

I'd say less hyper-focused, and more clarifying an IMO major unstated
trade-off of the proposed format change.

> So: you're right. We don't _need_ to update the version number. But this
> is the best choice among the options available.

...

>> Or maybe they won't. I just found it surprising when reviewing this to
>> not find an answer to why that approach wasn't
>> considered.
>
> The point is to create a new format that can be chosen when deployed
> in an environment where older Git versions will not exist (such as
> a Git server). The new version is not chosen by default and instead
> is opt-in through the commitGraph.generationVersion config option.
>
> Perhaps in a year or two we would consider making this the new
> default, but there is no rush to do so.

Looking into this a bit more I think that in either case this is less of
a big deal after my 43d35618055 (commit-graph write: don't die if the
existing graph is corrupt, 2019-03-25), which came out of some of those
discussions at the time of [1].

I.e. now a client that only understands version N-1 will warn when
loading it, wheras it's only if a pre-v2.22.0 client (which has that
commit) reads the repository that we'd hard die on it, correct?

But speaking of hyper-focus. I think that arguably applies to you in
this case when considering the trade-offs of these sorts of format
changes :)

I.e. you're primarily considering cases of say a git server (presumably
running on GitHub) or another such deployment where it's easy to have
full control over all of your versions "in the wild".

And thus a three-phase rollout of something like a format change can be
done in a timely and predictable manner.

But git is used by *a lot* of people in a bunch of different
scenarios. E.g.:

 * A shared (hopefully read-only) NFS mounted by remote "unmanaged" clients.
 * A tarred-up directory including a .git, which may be transferred to
   a machine with a pre-v2.22.0 version.

Or even softer cases of failure, such as:

 * A cronjob causes an alert/incident somewhere because the server 
   operator started writing a new version, but forgot about a set
   of machines that are still on the old version.

I think that even if it's less conceptually clean it's worth considering
being over backwards to be kinder to such use-cases, unless it's really
required for other reasons to break such in-the-wild use-cases.

Or in this case, if it's thought to be worth it to help reviewers decide
by separating the performance improvement aspect from the changed
interaction between new graphs and older clients.

As a further nit on the proposed end-state here: Do I understand it
correctly that commitGraph.generationVersion=[1|2] (i.e. on current
"master") will always result in a file that's compatible with older
versions, since the only thing "v2" there controls now is to write the
optional GDAT and GDOV chunks?

Whereas going from commitGraph.generationVersion=2 to
commitGraph.generationVersion=3 in this series will impact older clients
as noted above, since we're bumping the version (of the file, to 2 if
the config is 3, which as Junio noted is a bit confusing).

I think if you're set on going down the path of bumping the top-level
version that deserves to be made much clearer in the added
documentation. Right now the only hint to that is a passing mention that
for v3:

    [it] will be incompatible with some old versions of Git

Which if we're opting for breaking format changes really should note
some of the caveats above, that pre-v2.22.0 hard-dies, and probably
describe "some old versions of Git" a bit more clearly.

It actually means once this gets released "the git version that was the
latest one you could download yesterday". Which a reader of the docs
probably won't expect when starting to play with this in mixed-version
environment.

1. https://lore.kernel.org/git/87h8acivkh.fsf@evledraar.gmail.com/
Derrick Stolee Feb. 28, 2022, 4:39 p.m. UTC | #5
On 2/28/2022 9:27 AM, Ævar Arnfjörð Bjarmason wrote:
> 
> On Mon, Feb 28 2022, Derrick Stolee wrote:
> 
>> On 2/25/2022 5:31 PM, Ævar Arnfjörð Bjarmason wrote:

>>> Or maybe they won't. I just found it surprising when reviewing this to
>>> not find an answer to why that approach wasn't
>>> considered.
>>
>> The point is to create a new format that can be chosen when deployed
>> in an environment where older Git versions will not exist (such as
>> a Git server). The new version is not chosen by default and instead
>> is opt-in through the commitGraph.generationVersion config option.
>>
>> Perhaps in a year or two we would consider making this the new
>> default, but there is no rush to do so.
> 
> Looking into this a bit more I think that in either case this is less of
> a big deal after my 43d35618055 (commit-graph write: don't die if the
> existing graph is corrupt, 2019-03-25), which came out of some of those
> discussions at the time of [1].
> 
> I.e. now a client that only understands version N-1 will warn when
> loading it, wheras it's only if a pre-v2.22.0 client (which has that
> commit) reads the repository that we'd hard die on it, correct?
> 
> But speaking of hyper-focus. I think that arguably applies to you in
> this case when considering the trade-offs of these sorts of format
> changes :)
> 
> I.e. you're primarily considering cases of say a git server (presumably
> running on GitHub) or another such deployment where it's easy to have
> full control over all of your versions "in the wild".

I'm thinking of servers, yes, but also 99% of clients who only upgrade
(or _maybe_ downgrade to a recent, previous version occasionally).
 
> And thus a three-phase rollout of something like a format change can be
> done in a timely and predictable manner.
> 
> But git is used by *a lot* of people in a bunch of different
> scenarios. E.g.:
> 
>  * A shared (hopefully read-only) NFS mounted by remote "unmanaged" clients.
>  * A tarred-up directory including a .git, which may be transferred to
>    a machine with a pre-v2.22.0 version.
> 
> Or even softer cases of failure, such as:
> 
>  * A cronjob causes an alert/incident somewhere because the server 
>    operator started writing a new version, but forgot about a set
>    of machines that are still on the old version.

It is important to continue supporting these cases, and this change does
not cause any issues for them. However, this handful of corner cases
should not block progress in the main cases.

> I think that even if it's less conceptually clean it's worth considering
> being over backwards to be kinder to such use-cases, unless it's really
> required for other reasons to break such in-the-wild use-cases.
> 
> Or in this case, if it's thought to be worth it to help reviewers decide
> by separating the performance improvement aspect from the changed
> interaction between new graphs and older clients.
> 
> As a further nit on the proposed end-state here: Do I understand it
> correctly that commitGraph.generationVersion=[1|2] (i.e. on current
> "master") will always result in a file that's compatible with older
> versions, since the only thing "v2" there controls now is to write the
> optional GDAT and GDOV chunks?
> 
> Whereas going from commitGraph.generationVersion=2 to
> commitGraph.generationVersion=3 in this series will impact older clients
> as noted above, since we're bumping the version (of the file, to 2 if
> the config is 3, which as Junio noted is a bit confusing).
> 
> I think if you're set on going down the path of bumping the top-level
> version that deserves to be made much clearer in the added
> documentation. Right now the only hint to that is a passing mention that
> for v3:
> 
>     [it] will be incompatible with some old versions of Git
> 
> Which if we're opting for breaking format changes really should note
> some of the caveats above, that pre-v2.22.0 hard-dies, and probably
> describe "some old versions of Git" a bit more clearly.
> 
> It actually means once this gets released "the git version that was the
> latest one you could download yesterday". Which a reader of the docs
> probably won't expect when starting to play with this in mixed-version
> environment.
> 
> 1. https://lore.kernel.org/git/87h8acivkh.fsf@evledraar.gmail.com/

This documentation could be altered to be specific about versions,
but such a specific change makes assumptions of the version that will
include it. As of now, the generation number v2 fixes will _probably_
get in for 2.36 and the format change would have enough time to cook
for 2.37, so I'll update the docs to refer to that version explicitly.

The pre-2.22.0 change might be helpful to mention, but it could also be
noise to the reader. We can revisit this when these patches are
submitted again in another thread. There's also concern about third-
party tools like libgit2. I'd rather draw the line as "tread carefully
here" than "here is so much information that a reader might think it
is all they need to know".

Thanks,
-Stolee
Ævar Arnfjörð Bjarmason Feb. 28, 2022, 9:14 p.m. UTC | #6
On Mon, Feb 28 2022, Derrick Stolee wrote:

> On 2/28/2022 9:27 AM, Ævar Arnfjörð Bjarmason wrote:
>> 
>> On Mon, Feb 28 2022, Derrick Stolee wrote:
>> 
>>> On 2/25/2022 5:31 PM, Ævar Arnfjörð Bjarmason wrote:
>
>>>> Or maybe they won't. I just found it surprising when reviewing this to
>>>> not find an answer to why that approach wasn't
>>>> considered.
>>>
>>> The point is to create a new format that can be chosen when deployed
>>> in an environment where older Git versions will not exist (such as
>>> a Git server). The new version is not chosen by default and instead
>>> is opt-in through the commitGraph.generationVersion config option.
>>>
>>> Perhaps in a year or two we would consider making this the new
>>> default, but there is no rush to do so.
>> 
>> Looking into this a bit more I think that in either case this is less of
>> a big deal after my 43d35618055 (commit-graph write: don't die if the
>> existing graph is corrupt, 2019-03-25), which came out of some of those
>> discussions at the time of [1].
>> 
>> I.e. now a client that only understands version N-1 will warn when
>> loading it, wheras it's only if a pre-v2.22.0 client (which has that
>> commit) reads the repository that we'd hard die on it, correct?
>> 
>> But speaking of hyper-focus. I think that arguably applies to you in
>> this case when considering the trade-offs of these sorts of format
>> changes :)
>> 
>> I.e. you're primarily considering cases of say a git server (presumably
>> running on GitHub) or another such deployment where it's easy to have
>> full control over all of your versions "in the wild".
>
> I'm thinking of servers, yes, but also 99% of clients who only upgrade
> (or _maybe_ downgrade to a recent, previous version occasionally).

*nod*

>> And thus a three-phase rollout of something like a format change can be
>> done in a timely and predictable manner.
>> 
>> But git is used by *a lot* of people in a bunch of different
>> scenarios. E.g.:
>> 
>>  * A shared (hopefully read-only) NFS mounted by remote "unmanaged" clients.
>>  * A tarred-up directory including a .git, which may be transferred to
>>    a machine with a pre-v2.22.0 version.
>> 
>> Or even softer cases of failure, such as:
>> 
>>  * A cronjob causes an alert/incident somewhere because the server 
>>    operator started writing a new version, but forgot about a set
>>    of machines that are still on the old version.
>
> It is important to continue supporting these cases, and this change does
> not cause any issues for them.

The issues in those cases will range from warnings on older versions
when loading the graph to errors if it's pre-v2.22.0, with the
performance benefits v3 placing them out of range of v2-only clients.

I think arguable that's OK/worth it, but it's "not [any] issues", no?

> However, this handful of corner cases should not block progress in the
> main cases.

What progress would be blocked?

I'm only talking about whether we choose to consider a "new graph" to be an:

    <existing version number>
    <existing chunk name (old content, possibly empty)>
    <new chunk name (new content)>

v.s.:

    <old/new version number>
    <existing chunk name old/new (incompatible) content>

I.e. the "progress" this series is about is in getting the data locality
with smaller data with the new content.

But that's also possible to get with a very low amount of fixed-overhead.

Per the referenced E-Mail an "empty" commit-graph file was ~1k bytes in
2019, I haven't re-checked. In terms of wasted space it's miniscule &
<1/4 of one FS page on Linux.

I'm not just trying to rehash the same points, I *think* the version
bump is just an aesthetic choice & we're not getting any performance
difference out of that.

But I'm not sure from the "block progress" etc., so maybe I'm still
missing something...

>> I think that even if it's less conceptually clean it's worth considering
>> being over backwards to be kinder to such use-cases, unless it's really
>> required for other reasons to break such in-the-wild use-cases.
>> 
>> Or in this case, if it's thought to be worth it to help reviewers decide
>> by separating the performance improvement aspect from the changed
>> interaction between new graphs and older clients.
>> 
>> As a further nit on the proposed end-state here: Do I understand it
>> correctly that commitGraph.generationVersion=[1|2] (i.e. on current
>> "master") will always result in a file that's compatible with older
>> versions, since the only thing "v2" there controls now is to write the
>> optional GDAT and GDOV chunks?
>> 
>> Whereas going from commitGraph.generationVersion=2 to
>> commitGraph.generationVersion=3 in this series will impact older clients
>> as noted above, since we're bumping the version (of the file, to 2 if
>> the config is 3, which as Junio noted is a bit confusing).
>> 
>> I think if you're set on going down the path of bumping the top-level
>> version that deserves to be made much clearer in the added
>> documentation. Right now the only hint to that is a passing mention that
>> for v3:
>> 
>>     [it] will be incompatible with some old versions of Git
>> 
>> Which if we're opting for breaking format changes really should note
>> some of the caveats above, that pre-v2.22.0 hard-dies, and probably
>> describe "some old versions of Git" a bit more clearly.
>> 
>> It actually means once this gets released "the git version that was the
>> latest one you could download yesterday". Which a reader of the docs
>> probably won't expect when starting to play with this in mixed-version
>> environment.
>> 
>> 1. https://lore.kernel.org/git/87h8acivkh.fsf@evledraar.gmail.com/
>
> This documentation could be altered to be specific about versions,
> but such a specific change makes assumptions of the version that will
> include it. As of now, the generation number v2 fixes will _probably_
> get in for 2.36 and the format change would have enough time to cook
> for 2.37, so I'll update the docs to refer to that version explicitly.

...

> The pre-2.22.0 change might be helpful to mention, but it could also be
> noise to the reader. We can revisit this when these patches are
> submitted again in another thread. There's also concern about third-
> party tools like libgit2. I'd rather draw the line as "tread carefully
> here" than "here is so much information that a reader might think it
> is all they need to know".

In terms of concern about libgit2 or any other implementation (which I
haven't looked at) isn't "tread carefully" to do it with new chunks if
possible, which we've done before with BIDX/BDAT, v.s. a version bump we
haven't done?

I'd think it wouldn't be an issue either way for any reader of the
format, and libgit2 is more specialized & won't have someone on RHEL6 or
whatever trying to inspect a random repo.

It just seems like a win-win to have a performance improvement with
smooth backwards compatibility v.s. without, if that's possible.
Derrick Stolee March 1, 2022, 2:19 p.m. UTC | #7
On 2/28/2022 4:14 PM, Ævar Arnfjörð Bjarmason wrote:
> 
> On Mon, Feb 28 2022, Derrick Stolee wrote:
> 
>> On 2/28/2022 9:27 AM, Ævar Arnfjörð Bjarmason wrote:
>>>
>>> On Mon, Feb 28 2022, Derrick Stolee wrote:
>>>
>>>> On 2/25/2022 5:31 PM, Ævar Arnfjörð Bjarmason wrote:
>>
>>>>> Or maybe they won't. I just found it surprising when reviewing this to
>>>>> not find an answer to why that approach wasn't
>>>>> considered.
>>>>
>>>> The point is to create a new format that can be chosen when deployed
>>>> in an environment where older Git versions will not exist (such as
>>>> a Git server). The new version is not chosen by default and instead
>>>> is opt-in through the commitGraph.generationVersion config option.
>>>>
>>>> Perhaps in a year or two we would consider making this the new
>>>> default, but there is no rush to do so.
>>>
>>> Looking into this a bit more I think that in either case this is less of
>>> a big deal after my 43d35618055 (commit-graph write: don't die if the
>>> existing graph is corrupt, 2019-03-25), which came out of some of those
>>> discussions at the time of [1].
>>>
>>> I.e. now a client that only understands version N-1 will warn when
>>> loading it, wheras it's only if a pre-v2.22.0 client (which has that
>>> commit) reads the repository that we'd hard die on it, correct?
>>>
>>> But speaking of hyper-focus. I think that arguably applies to you in
>>> this case when considering the trade-offs of these sorts of format
>>> changes :)
>>>
>>> I.e. you're primarily considering cases of say a git server (presumably
>>> running on GitHub) or another such deployment where it's easy to have
>>> full control over all of your versions "in the wild".
>>
>> I'm thinking of servers, yes, but also 99% of clients who only upgrade
>> (or _maybe_ downgrade to a recent, previous version occasionally).
> 
> *nod*
> 
>>> And thus a three-phase rollout of something like a format change can be
>>> done in a timely and predictable manner.
>>>
>>> But git is used by *a lot* of people in a bunch of different
>>> scenarios. E.g.:
>>>
>>>  * A shared (hopefully read-only) NFS mounted by remote "unmanaged" clients.
>>>  * A tarred-up directory including a .git, which may be transferred to
>>>    a machine with a pre-v2.22.0 version.
>>>
>>> Or even softer cases of failure, such as:
>>>
>>>  * A cronjob causes an alert/incident somewhere because the server 
>>>    operator started writing a new version, but forgot about a set
>>>    of machines that are still on the old version.
>>
>> It is important to continue supporting these cases, and this change does
>> not cause any issues for them.
> 
> The issues in those cases will range from warnings on older versions
> when loading the graph to errors if it's pre-v2.22.0, with the
> performance benefits v3 placing them out of range of v2-only clients.
> 
> I think arguable that's OK/worth it, but it's "not [any] issues", no?

What I mean is that this change does not enable the new graph version
by default, so these users do not have any issues unless someone opts
in to the feature while in this mixed scenario.

>> However, this handful of corner cases should not block progress in the
>> main cases.
> 
> What progress would be blocked?
> 
> I'm only talking about whether we choose to consider a "new graph" to be an:
> 
>     <existing version number>
>     <existing chunk name (old content, possibly empty)>
>     <new chunk name (new content)>
> 
> v.s.:
> 
>     <old/new version number>
>     <existing chunk name old/new (incompatible) content>
> 
> I.e. the "progress" this series is about is in getting the data locality
> with smaller data with the new content.
> 
> But that's also possible to get with a very low amount of fixed-overhead.
> 
> Per the referenced E-Mail an "empty" commit-graph file was ~1k bytes in
> 2019, I haven't re-checked. In terms of wasted space it's miniscule &
> <1/4 of one FS page on Linux.

If you're talking "empty" data, then you need to have an empty Commit
Data chunk _and_ and empty OID Lookup chunk in order to not have
breakage. So you'd need duplicate versions of these chunks for the
new "Commit Data 2" chunk. Then we need special-casing for all of this
during parsing that is unnecessary complexity.

Finally, the end result becomes "older versions get slower without
any warning" instead of "older versions get a message about not
understanding the commit-graph file".
 
> I'm not just trying to rehash the same points, I *think* the version
> bump is just an aesthetic choice & we're not getting any performance
> difference out of that.
> 
> But I'm not sure from the "block progress" etc., so maybe I'm still
> missing something...

The fact that we have a Generation Data chunk instead of already
bumping the file format version number is already a concession to
this concern about backwards compatibility.

With the point above about empty Commit Data Chunks, the only way
to properly conserve backwards compatibility is to have a full
Commit Data Chunk as well as a second copy that contains the new
offsets instead of topological levels. This is wasteful.

>>> I think that even if it's less conceptually clean it's worth considering
>>> being over backwards to be kinder to such use-cases, unless it's really
>>> required for other reasons to break such in-the-wild use-cases.
>>>
>>> Or in this case, if it's thought to be worth it to help reviewers decide
>>> by separating the performance improvement aspect from the changed
>>> interaction between new graphs and older clients.
>>>
>>> As a further nit on the proposed end-state here: Do I understand it
>>> correctly that commitGraph.generationVersion=[1|2] (i.e. on current
>>> "master") will always result in a file that's compatible with older
>>> versions, since the only thing "v2" there controls now is to write the
>>> optional GDAT and GDOV chunks?
>>>
>>> Whereas going from commitGraph.generationVersion=2 to
>>> commitGraph.generationVersion=3 in this series will impact older clients
>>> as noted above, since we're bumping the version (of the file, to 2 if
>>> the config is 3, which as Junio noted is a bit confusing).
>>>
>>> I think if you're set on going down the path of bumping the top-level
>>> version that deserves to be made much clearer in the added
>>> documentation. Right now the only hint to that is a passing mention that
>>> for v3:
>>>
>>>     [it] will be incompatible with some old versions of Git
>>>
>>> Which if we're opting for breaking format changes really should note
>>> some of the caveats above, that pre-v2.22.0 hard-dies, and probably
>>> describe "some old versions of Git" a bit more clearly.
>>>
>>> It actually means once this gets released "the git version that was the
>>> latest one you could download yesterday". Which a reader of the docs
>>> probably won't expect when starting to play with this in mixed-version
>>> environment.
>>>
>>> 1. https://lore.kernel.org/git/87h8acivkh.fsf@evledraar.gmail.com/
>>
>> This documentation could be altered to be specific about versions,
>> but such a specific change makes assumptions of the version that will
>> include it. As of now, the generation number v2 fixes will _probably_
>> get in for 2.36 and the format change would have enough time to cook
>> for 2.37, so I'll update the docs to refer to that version explicitly.
> 
> ...
> 
>> The pre-2.22.0 change might be helpful to mention, but it could also be
>> noise to the reader. We can revisit this when these patches are
>> submitted again in another thread. There's also concern about third-
>> party tools like libgit2. I'd rather draw the line as "tread carefully
>> here" than "here is so much information that a reader might think it
>> is all they need to know".
> 
> In terms of concern about libgit2 or any other implementation (which I
> haven't looked at) isn't "tread carefully" to do it with new chunks if
> possible, which we've done before with BIDX/BDAT, v.s. a version bump we
> haven't done?

New chunks adding new information is part of the design. Changing
the location of existing data is new here.

> I'd think it wouldn't be an issue either way for any reader of the
> format, and libgit2 is more specialized & won't have someone on RHEL6 or
> whatever trying to inspect a random repo.
> 
> It just seems like a win-win to have a performance improvement with
> smooth backwards compatibility v.s. without, if that's possible.

You are right that it is _possible_, but I don't think that the
side-effects are worth it. Those being:

* "Empty CDAT Chunk": Silently slowing down older clients.
* "Duplicate CDAT Chunk": Wasted data.

Finally, I want to reiterate that by making this opt-in, users make
the call about whether or not they are in a scenario where this
compatibility issue is appropriate for them. This includes waiting
to see if third-party tools like libgit2 are updated to understand
this version.

Thanks,
-Stolee
Ævar Arnfjörð Bjarmason March 1, 2022, 2:29 p.m. UTC | #8
On Tue, Mar 01 2022, Derrick Stolee wrote:

> On 2/28/2022 4:14 PM, Ævar Arnfjörð Bjarmason wrote:
>> 
>> On Mon, Feb 28 2022, Derrick Stolee wrote:
>> 
>>> On 2/28/2022 9:27 AM, Ævar Arnfjörð Bjarmason wrote:
>>>>
>>>> On Mon, Feb 28 2022, Derrick Stolee wrote:
>>>>
>>>>> On 2/25/2022 5:31 PM, Ævar Arnfjörð Bjarmason wrote:
>>>
>>>>>> Or maybe they won't. I just found it surprising when reviewing this to
>>>>>> not find an answer to why that approach wasn't
>>>>>> considered.
>>>>>
>>>>> The point is to create a new format that can be chosen when deployed
>>>>> in an environment where older Git versions will not exist (such as
>>>>> a Git server). The new version is not chosen by default and instead
>>>>> is opt-in through the commitGraph.generationVersion config option.
>>>>>
>>>>> Perhaps in a year or two we would consider making this the new
>>>>> default, but there is no rush to do so.
>>>>
>>>> Looking into this a bit more I think that in either case this is less of
>>>> a big deal after my 43d35618055 (commit-graph write: don't die if the
>>>> existing graph is corrupt, 2019-03-25), which came out of some of those
>>>> discussions at the time of [1].
>>>>
>>>> I.e. now a client that only understands version N-1 will warn when
>>>> loading it, wheras it's only if a pre-v2.22.0 client (which has that
>>>> commit) reads the repository that we'd hard die on it, correct?
>>>>
>>>> But speaking of hyper-focus. I think that arguably applies to you in
>>>> this case when considering the trade-offs of these sorts of format
>>>> changes :)
>>>>
>>>> I.e. you're primarily considering cases of say a git server (presumably
>>>> running on GitHub) or another such deployment where it's easy to have
>>>> full control over all of your versions "in the wild".
>>>
>>> I'm thinking of servers, yes, but also 99% of clients who only upgrade
>>> (or _maybe_ downgrade to a recent, previous version occasionally).
>> 
>> *nod*
>> 
>>>> And thus a three-phase rollout of something like a format change can be
>>>> done in a timely and predictable manner.
>>>>
>>>> But git is used by *a lot* of people in a bunch of different
>>>> scenarios. E.g.:
>>>>
>>>>  * A shared (hopefully read-only) NFS mounted by remote "unmanaged" clients.
>>>>  * A tarred-up directory including a .git, which may be transferred to
>>>>    a machine with a pre-v2.22.0 version.
>>>>
>>>> Or even softer cases of failure, such as:
>>>>
>>>>  * A cronjob causes an alert/incident somewhere because the server 
>>>>    operator started writing a new version, but forgot about a set
>>>>    of machines that are still on the old version.
>>>
>>> It is important to continue supporting these cases, and this change does
>>> not cause any issues for them.
>> 
>> The issues in those cases will range from warnings on older versions
>> when loading the graph to errors if it's pre-v2.22.0, with the
>> performance benefits v3 placing them out of range of v2-only clients.
>> 
>> I think arguable that's OK/worth it, but it's "not [any] issues", no?
>
> What I mean is that this change does not enable the new graph version
> by default, so these users do not have any issues unless someone opts
> in to the feature while in this mixed scenario.

Indeed. FWIW I wasn't confused about that bit. I'm just commenting on
/how/ we do version upgrades, and if we can save users unnecessary
hassle relatively easily.

But I also think the writing is on the wall that you'll want to
(understandably) bump the default sooner than later, or if not for this
data for other future chunks.

>>> However, this handful of corner cases should not block progress in the
>>> main cases.
>> 
>> What progress would be blocked?
>> 
>> I'm only talking about whether we choose to consider a "new graph" to be an:
>> 
>>     <existing version number>
>>     <existing chunk name (old content, possibly empty)>
>>     <new chunk name (new content)>
>> 
>> v.s.:
>> 
>>     <old/new version number>
>>     <existing chunk name old/new (incompatible) content>
>> 
>> I.e. the "progress" this series is about is in getting the data locality
>> with smaller data with the new content.
>> 
>> But that's also possible to get with a very low amount of fixed-overhead.
>> 
>> Per the referenced E-Mail an "empty" commit-graph file was ~1k bytes in
>> 2019, I haven't re-checked. In terms of wasted space it's miniscule &
>> <1/4 of one FS page on Linux.
>
> If you're talking "empty" data, then you need to have an empty Commit
> Data chunk _and_ and empty OID Lookup chunk in order to not have
> breakage. So you'd need duplicate versions of these chunks for the
> new "Commit Data 2" chunk. Then we need special-casing for all of this
> during parsing that is unnecessary complexity.

Why does it need to be special-cased? Don't we just call pair_chunk() on
the new chunk name, and if it doesn't exist fall back on the old
chunk. We'll then note what format we're parsing, just as this series
does.

> Finally, the end result becomes "older versions get slower without
> any warning" instead of "older versions get a message about not
> understanding the commit-graph file".

Sure, IF you want to write such an empty chunk. The point is that you
now have the option.

And this is the same edge case we already dealt with for
GDAT/GDOV. I.e. older readers who didn't understand it would be slower.

We can still have a feature to make older clients intentionally
break/warn, it seems to me that if you'd want such a thing you'd want it
aside from the specific mechanism of this proposed upgrade.

Or you could dual-write the data for older clients, which I think
probably isn't worth the hassle.

I.e. if you're worried about silent slowdown older clients happily
ignoring the BIDX and BDAT chunks are silently slower.

>> I'm not just trying to rehash the same points, I *think* the version
>> bump is just an aesthetic choice & we're not getting any performance
>> difference out of that.
>> 
>> But I'm not sure from the "block progress" etc., so maybe I'm still
>> missing something...
>
> The fact that we have a Generation Data chunk instead of already
> bumping the file format version number is already a concession to
> this concern about backwards compatibility.

Sure, but not taking that version bumping route in the past shouldn't
bias us towards doing it now, should it?

> With the point above about empty Commit Data Chunks, the only way
> to properly conserve backwards compatibility is to have a full
> Commit Data Chunk as well as a second copy that contains the new
> offsets instead of topological levels. This is wasteful.

Empty chunks would be a handful of bytes, and not produce those
errors/warnings, and AFAICT without any downsides.

But that's me assuming that the overlap between people for whom the
commit-graph is critical for performance and those using wildly
different versions is pretty much zero.

The reason I mentioned it at all initially in
https://lore.kernel.org/git/220228.86pmn73toq.gmgdl@evledraar.gmail.com/
was in reference to trying to understand the context of the performance
gains, i.e. whether they'd be equivalent with a new chunk or dual-write
data.

>>>> I think that even if it's less conceptually clean it's worth considering
>>>> being over backwards to be kinder to such use-cases, unless it's really
>>>> required for other reasons to break such in-the-wild use-cases.
>>>>
>>>> Or in this case, if it's thought to be worth it to help reviewers decide
>>>> by separating the performance improvement aspect from the changed
>>>> interaction between new graphs and older clients.
>>>>
>>>> As a further nit on the proposed end-state here: Do I understand it
>>>> correctly that commitGraph.generationVersion=[1|2] (i.e. on current
>>>> "master") will always result in a file that's compatible with older
>>>> versions, since the only thing "v2" there controls now is to write the
>>>> optional GDAT and GDOV chunks?
>>>>
>>>> Whereas going from commitGraph.generationVersion=2 to
>>>> commitGraph.generationVersion=3 in this series will impact older clients
>>>> as noted above, since we're bumping the version (of the file, to 2 if
>>>> the config is 3, which as Junio noted is a bit confusing).
>>>>
>>>> I think if you're set on going down the path of bumping the top-level
>>>> version that deserves to be made much clearer in the added
>>>> documentation. Right now the only hint to that is a passing mention that
>>>> for v3:
>>>>
>>>>     [it] will be incompatible with some old versions of Git
>>>>
>>>> Which if we're opting for breaking format changes really should note
>>>> some of the caveats above, that pre-v2.22.0 hard-dies, and probably
>>>> describe "some old versions of Git" a bit more clearly.
>>>>
>>>> It actually means once this gets released "the git version that was the
>>>> latest one you could download yesterday". Which a reader of the docs
>>>> probably won't expect when starting to play with this in mixed-version
>>>> environment.
>>>>
>>>> 1. https://lore.kernel.org/git/87h8acivkh.fsf@evledraar.gmail.com/
>>>
>>> This documentation could be altered to be specific about versions,
>>> but such a specific change makes assumptions of the version that will
>>> include it. As of now, the generation number v2 fixes will _probably_
>>> get in for 2.36 and the format change would have enough time to cook
>>> for 2.37, so I'll update the docs to refer to that version explicitly.
>> 
>> ...
>> 
>>> The pre-2.22.0 change might be helpful to mention, but it could also be
>>> noise to the reader. We can revisit this when these patches are
>>> submitted again in another thread. There's also concern about third-
>>> party tools like libgit2. I'd rather draw the line as "tread carefully
>>> here" than "here is so much information that a reader might think it
>>> is all they need to know".
>> 
>> In terms of concern about libgit2 or any other implementation (which I
>> haven't looked at) isn't "tread carefully" to do it with new chunks if
>> possible, which we've done before with BIDX/BDAT, v.s. a version bump we
>> haven't done?
>
> New chunks adding new information is part of the design. Changing
> the location of existing data is new here.

We've never bumped the top-level version number, and hard dying on "git
status" or whatever was also part of the initial design :)

I think it's legitimate to ask/argue that these version number bumps are
something we should be reserving for truly incompatible format bumps,
v.s. "new indexes".

I.e. this case is similar to us having a SQL database with N tables, and
we'd like to add a new table or index.

We could have a central "schema_version", or we could just add a new
table in a backwards-compatiable way. Older clients read older
data/tables, which is possibly empty.

The commit-graph is essentially such a key-value store when it comes to
top-level chunks, and we're already making decisions on what data to
load/use based on chunk existence.

>> I'd think it wouldn't be an issue either way for any reader of the
>> format, and libgit2 is more specialized & won't have someone on RHEL6 or
>> whatever trying to inspect a random repo.
>> 
>> It just seems like a win-win to have a performance improvement with
>> smooth backwards compatibility v.s. without, if that's possible.
>
> You are right that it is _possible_, but I don't think that the
> side-effects are worth it. Those being:
>
> * "Empty CDAT Chunk": Silently slowing down older clients.
> * "Duplicate CDAT Chunk": Wasted data.
>
> Finally, I want to reiterate that by making this opt-in, users make
> the call about whether or not they are in a scenario where this
> compatibility issue is appropriate for them. This includes waiting
> to see if third-party tools like libgit2 are updated to understand
> this version.

I think we're probably not getting any further here & this back & forth,
it's certainly been interesting, and thanks a lot for your patience and
time.

I'll try to look at some of this once the "prep" patches land.

If you'll end up doing this via the version route I'm not going to
strongly object to it or anything, I was just trying to review this to
see if I understood the trade-offs & constraints involved. In particular
with reference to those changes in 2019 that I did to make
format/corruption/version transitions non-fatal.

Thanks!
Derrick Stolee March 1, 2022, 3:59 p.m. UTC | #9
On 3/1/2022 9:29 AM, Ævar Arnfjörð Bjarmason wrote:

I agree that this discussion has mostly run its course and I'll do my
best to summarize it in the commit messages of a future patch series.

I just wanted to focus on two things in the most-recent reply:

> On Tue, Mar 01 2022, Derrick Stolee wrote:
> 
>> On 2/28/2022 4:14 PM, Ævar Arnfjörð Bjarmason wrote:
>>>
>>> On Mon, Feb 28 2022, Derrick Stolee wrote:
>>> I think arguable that's OK/worth it, but it's "not [any] issues", no?
>>
>> What I mean is that this change does not enable the new graph version
>> by default, so these users do not have any issues unless someone opts
>> in to the feature while in this mixed scenario.
> 
> Indeed. FWIW I wasn't confused about that bit. I'm just commenting on
> /how/ we do version upgrades, and if we can save users unnecessary
> hassle relatively easily.
> 
> But I also think the writing is on the wall that you'll want to
> (understandably) bump the default sooner than later, or if not for this
> data for other future chunks.

I think somewhere I said we wouldn't want to update this default for
at least a year after it ships, but I'm also happy to never update it
and let experts opt-in when they want.

>> Finally, the end result becomes "older versions get slower without
>> any warning" instead of "older versions get a message about not
>> understanding the commit-graph file".
> 
> Sure, IF you want to write such an empty chunk. The point is that you
> now have the option.
> 
> And this is the same edge case we already dealt with for
> GDAT/GDOV. I.e. older readers who didn't understand it would be slower.
> 
> We can still have a feature to make older clients intentionally
> break/warn, it seems to me that if you'd want such a thing you'd want it
> aside from the specific mechanism of this proposed upgrade.
> 
> Or you could dual-write the data for older clients, which I think
> probably isn't worth the hassle.
> 
> I.e. if you're worried about silent slowdown older clients happily
> ignoring the BIDX and BDAT chunks are silently slower.

Older clients ignoring BIDX and BDAT chunks means they are silently
slower than newer clients, but they are still as fast as they were
yesterday. The empty chunk approach will make those older Git versions
slower than they were yesterday.

Thanks,
-Stolee
diff mbox series

Patch

diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
index 87971c27dd7..2cb48993314 100644
--- a/Documentation/technical/commit-graph-format.txt
+++ b/Documentation/technical/commit-graph-format.txt
@@ -36,7 +36,7 @@  HEADER:
       The signature is: {'C', 'G', 'P', 'H'}
 
   1-byte version number:
-      Currently, the only valid version is 1.
+      This version number can be 1 or 2.
 
   1-byte Hash Version
       We infer the hash length (H) from this value:
@@ -85,13 +85,22 @@  CHUNK DATA:
       position. If there are more than two parents, the second value
       has its most-significant bit on and the other bits store an array
       position into the Extra Edge List chunk.
-    * The next 8 bytes store the topological level (generation number v1)
-      of the commit and
-      the commit time in seconds since EPOCH. The generation number
-      uses the higher 30 bits of the first 4 bytes, while the commit
+    * The next 8 bytes store the generation number information of the
+      commit and the commit time in seconds since EPOCH. The generation
+      number uses the higher 30 bits of the first 4 bytes, while the commit
       time uses the 32 bits of the second 4 bytes, along with the lowest
       2 bits of the lowest byte, storing the 33rd and 34th bit of the
       commit time.
+      - If the commit-graph file format is version 1, then the higher 30
+	bits contain the topological level (generation number v1) for the
+	commit.
+      - If the commit-graph file format is version 2, then the higher 30
+	bits contain the corrected commit date offset (generation number
+	v2) for the commit, except if the offset cannot be stored within
+	29 bits. If the offset is too large for 29 bits, then the value
+	stored here has its most-significant bit on and the other bits
+	store the position of the corrected commit date in the Generation
+	Date Overflow chunk.
 
   Generation Data (ID: {'G', 'D', 'A', 'T' }) (N * 4 bytes) [Optional]
     * This list of 4-byte values store corrected commit date offsets for the
@@ -103,6 +112,9 @@  CHUNK DATA:
     * Generation Data chunk is present only when commit-graph file is written
       by compatible versions of Git and in case of split commit-graph chains,
       the topmost layer also has Generation Data chunk.
+    * This chunk does not exist if the commit-graph file format version is 2,
+      because the corrected commit date offset data is stored in the Commit
+      Data chunk.
 
   Generation Data Overflow (ID: {'G', 'D', 'O', 'V' }) [Optional]
     * This list of 8-byte values stores the corrected commit date offsets