[v3,11/11] doc: add corrected commit date info

Message ID	f6f91af30587ec24e2eee052c89a536cbff42c4f.1597509583.git.gitgitgadget@gmail.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=NddV=BZ=vger.kernel.org=git-owner@kernel.org> Message-Id: <f6f91af30587ec24e2eee052c89a536cbff42c4f.1597509583.git.gitgitgadget@gmail.com> In-Reply-To: <pull.676.v3.git.1597509583.gitgitgadget@gmail.com> References: <pull.676.v2.git.1596941624.gitgitgadget@gmail.com> <pull.676.v3.git.1597509583.gitgitgadget@gmail.com> From: "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> Date: Sat, 15 Aug 2020 16:39:43 +0000 Subject: [PATCH v3 11/11] doc: add corrected commit date info Fcc: Sent Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit MIME-Version: 1.0 To: git@vger.kernel.org Cc: Derrick Stolee <stolee@gmail.com>, Jakub =?utf-8?b?TmFyxJlic2tp?= <jnareb@gmail.com>, Taylor Blau <me@ttaylor.com>, Abhishek Kumar <abhishekkumar8222@gmail.com>, Abhishek Kumar <abhishekkumar8222@gmail.com> Sender: git-owner@vger.kernel.org Precedence: bulk
Series	Implement Corrected Commit Date \| expand [v3,00/11,GSoC] Implement Corrected Commit Date [v3,01/11] commit-graph: fix regression when computing bloom filter [v3,02/11] revision: parse parent in indegree_walk_step() [v3,03/11] commit-graph: consolidate fill_commit_graph_info [v3,04/11] commit-graph: consolidate compare_commits_by_gen [v3,05/11] commit-graph: return 64-bit generation number [v3,06/11] commit-graph: add a slab to store topological levels [v3,07/11] commit-graph: implement corrected commit date [v3,08/11] commit-graph: implement generation data chunk [v3,09/11] commit-graph: use generation v2 only if entire chain does [v3,10/11] commit-reach: use corrected commit dates in paint_down_to_common() [v3,11/11] doc: add corrected commit date info

Philippe Blain via GitGitGadget Aug. 15, 2020, 4:39 p.m. UTC

From: Abhishek Kumar <abhishekkumar8222@gmail.com>

With generation data chunk and corrected commit dates implemented, let's
update the technical documentation for commit-graph.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 .../technical/commit-graph-format.txt         | 12 ++---
 Documentation/technical/commit-graph.txt      | 45 ++++++++++++-------
 2 files changed, 36 insertions(+), 21 deletions(-)

Jakub Narębski Aug. 22, 2020, 10:20 p.m. UTC | #1

Hello,

"Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
>
> With generation data chunk and corrected commit dates implemented, let's
> update the technical documentation for commit-graph.
>
> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>

All right.

> ---
>  .../technical/commit-graph-format.txt         | 12 ++---
>  Documentation/technical/commit-graph.txt      | 45 ++++++++++++-------
>  2 files changed, 36 insertions(+), 21 deletions(-)
>
> diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
> index 440541045d..71c43884ec 100644
> --- a/Documentation/technical/commit-graph-format.txt
> +++ b/Documentation/technical/commit-graph-format.txt
> @@ -4,11 +4,7 @@ Git commit graph format
>  The Git commit graph stores a list of commit OIDs and some associated
>  metadata, including:
>
> -- The generation number of the commit. Commits with no parents have
> -  generation number 1; commits with parents have generation number
> -  one more than the maximum generation number of its parents. We
> -  reserve zero as special, and can be used to mark a generation
> -  number invalid or as "not computed".
> +- The generation number of the commit.

All right, that was duplicated information.  Now that we need to talk
about two of them, it would not make sense to duplicate that.

>
>  - The root tree OID.
>
> @@ -88,6 +84,12 @@ CHUNK DATA:

Shouldn't we also replace 'generation number' occurences in description
of the Commit Data (CDAT) chunk with either 'topological level' or
'generation number v1'?

>        2 bits of the lowest byte, storing the 33rd and 34th bit of the
>        commit time.
>
> +  Generation Data (ID: {'G', 'D', 'A', 'T' }) (N * 4 bytes) [Optional]

It is not exactly 'optional', as it implies that we need to turn it on
(or that we can turn it off).  It is more 'conditional', as it can be
not present due to outside influences (mixed-version environment).

> +    * This list of 4-byte values store corrected commit date offsets for the
> +      commits, arranged in the same order as commit data chunk.

I have just realized purely theoretical, but possible, problem with
storing non-monotinic generation number related values like corrected
commit date offset in constrained space.  There are problems with
clamping them.

Say that somewhere in the ancestry chain there is a commit A with commit
date far in the future by mistake, for example 2120-08-22; it is
important for that date to be not able to be represented using uint32_t.
Say that a later descendant commit B is malformed, and has committer
date of 0, that is 1970-01-01. This means that the corrected commit date
for B must be larger than 2120-08-22 - which for this commit means that
corrected commit date offset do not fit in 32 bits, and must be clamped
(replaced) with GENERATION_NUMBER_V2_OFFSET_MAX.

Say that we have commit C that is child of B, and it has correct commit
date.  Because of mistake in commit A, it has corrected commit date of
more than 2120-08-22 (corrected commit date degenerated into topological
level plus constant).

Now C can reach B, and B can reach A.  However, if we recover corrected
commit date of B out of its date=0 and offset=GENERATION_NUMBER_V2_OFFSET_MAX
we get a number that is smaller than correct corrected commit date.  We
will have

   gen(A) > date(B) + offset(B) < gen(C)

Which breaks reachability condition guarantee.

If instead we use GENERATION_NUMBER_V2_MAX for commits with clamped
corrected commit date, that is offset=GENERATION_NUMBER_V2_OFFSET_MAX,
we would get

  gen(A) < GENERATION_NUMBER_V2_MAX > gen(C)

And again reachability condition is broken.

This is a very contrived but possible example.  This shouldn't happen,
but ufortunately it can happen.

The question is how to deal with this issue.  Ignore it as unlikely?
Switch to storing corrected commit date, which is monotonic, so if there
is commit with GENERATION_NUMBER_V2_MAX, then subsequent descendant
commits will also have GENERATION_NUMBER_V2_MAX -- and pay with up to 7%
larger commit-graph file?

> +    * This list can be later modified to store future generation number related
> +      data.

How can it be later modified?  There is no header, no version number.
How would we add another generation number data?

> +
>    Extra Edge List (ID: {'E', 'D', 'G', 'E'}) [Optional]
>        This list of 4-byte values store the second through nth parents for
>        all octopus merges. The second parent value in the commit data stores
> diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt
> index 808fa30b99..f27145328c 100644
> --- a/Documentation/technical/commit-graph.txt
> +++ b/Documentation/technical/commit-graph.txt
> @@ -38,14 +38,27 @@ A consumer may load the following info for a commit from the graph:
>
>  Values 1-4 satisfy the requirements of parse_commit_gently().
>
> -Define the "generation number" of a commit recursively as follows:
> +There are two definitions of generation number:
> +1. Corrected committer dates
> +2. Topological levels

Should we add versioning info, that is:

  +1. Corrected committer dates  (generation number v2)
  +2. Topological levels  (generation number v1)

> +
> +Define "corrected committer date" of a commit recursively as follows:
> +
> +  * A commit with no parents (a root commit) has corrected committer date
> +    equal to its committer date.
> +
> +  * A commit with at least one parent has corrected committer date equal to
> +    the maximum of its commiter date and one more than the largest corrected
> +    committer date among its parents.
> +
> +Define the "topological level" of a commit recursively as follows:
>
>   * A commit with no parents (a root commit) has generation number one.

Shouldn't this be

    * A commit with no parents (a root commit) has topological level of one.

>
> - * A commit with at least one parent has generation number one more than
> -   the largest generation number among its parents.
> + * A commit with at least one parent has topological level one more than
> +   the largest topological level among its parents.
>
> -Equivalently, the generation number of a commit A is one more than the
> +Equivalently, the topological level of a commit A is one more than the
>  length of a longest path from A to a root commit. The recursive definition
>  is easier to use for computation and observing the following property:

We should probably explicitly state that the property state applies to
both versions of generation number, not only to topological level.

>
> @@ -67,17 +80,12 @@ numbers, the general heuristic is the following:
>      If A and B are commits with commit time X and Y, respectively, and
>      X < Y, then A _probably_ cannot reach B.
>
> -This heuristic is currently used whenever the computation is allowed to
> -violate topological relationships due to clock skew (such as "git log"
> -with default order), but is not used when the topological order is
> -required (such as merge base calculations, "git log --graph").
> -

To be overly pedantic, this heuristic is still used, but now in much
more rare case.  In addition to what is stated above, at least one layer
in the split commit-graph chain must have been generated by "Old" Git,
for the date heuristic to be used.

But that might be unnecessary level of detail.

>  In practice, we expect some commits to be created recently and not stored
>  in the commit graph. We can treat these commits as having "infinite"
>  generation number and walk until reaching commits with known generation
>  number.
>
> -We use the macro GENERATION_NUMBER_INFINITY = 0xFFFFFFFF to mark commits not
> +We use the macro GENERATION_NUMBER_INFINITY to mark commits not

All right.

>  in the commit-graph file. If a commit-graph file was written by a version
>  of Git that did not compute generation numbers, then those commits will
>  have generation number represented by the macro GENERATION_NUMBER_ZERO = 0.
> @@ -93,12 +101,11 @@ fully-computed generation numbers. Using strict inequality may result in
>  walking a few extra commits, but the simplicity in dealing with commits
>  with generation number *_INFINITY or *_ZERO is valuable.
>
> -We use the macro GENERATION_NUMBER_MAX = 0x3FFFFFFF to for commits whose
> -generation numbers are computed to be at least this value. We limit at
> -this value since it is the largest value that can be stored in the
> -commit-graph file using the 30 bits available to generation numbers. This
> -presents another case where a commit can have generation number equal to
> -that of a parent.
> +We use the macro GENERATION_NUMBER_MAX for commits whose generation numbers
> +are computed to be at least this value. We limit at this value since it is
> +the largest value that can be stored in the commit-graph file using the
> +available to generation numbers. This presents another case where a
> +commit can have generation number equal to that of a parent.

All right, though it could have been done without re-wrapping, so that
only first line would be marked as changed.

As I wrote, there is theoretical problem with this for offsets.

>
>  Design Details
>  --------------
> @@ -267,6 +274,12 @@ The merge strategy values (2 for the size multiple, 64,000 for the maximum
>  number of commits) could be extracted into config settings for full
>  flexibility.
>
> +We also merge commit-graph chains when we try to write a commit graph with
> +two different generation number definitions as they cannot be compared directly.
> +We overwrite the existing chain and create a commit-graph with the newer or more
> +efficient defintion. For example, overwriting topological levels commit graph
> +chain to create a corrected commit dates commit graph chain.
> +

This is more complicated than that.

I think we should explicitly state that Git ensures that in split
commit-graph chain, if there are layers without the GDAT chunk (that
force Git to use topological levels for generation numbers), then they
are top layers.  So if there is commit-graph file created by "Old" Git,
then when addig new layer it would also be GDAT-less.

Now how to write this...

>  ## Deleting graph-{hash} files
>
>  After a new tip file is written, some `graph-{hash}` files may no longer

Best,

Abhishek Kumar Aug. 27, 2020, 6:39 a.m. UTC | #2

On Sun, Aug 23, 2020 at 12:20:57AM +0200, Jakub Narębski wrote:
> Hello,
> 
> "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
> > From: Abhishek Kumar <abhishekkumar8222@gmail.com>
> >
> > With generation data chunk and corrected commit dates implemented, let's
> > update the technical documentation for commit-graph.
> >
> > Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> 
> All right.
> 
> > ---
> >  .../technical/commit-graph-format.txt         | 12 ++---
> >  Documentation/technical/commit-graph.txt      | 45 ++++++++++++-------
> >  2 files changed, 36 insertions(+), 21 deletions(-)
> >
> > diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
> > index 440541045d..71c43884ec 100644
> > --- a/Documentation/technical/commit-graph-format.txt
> > +++ b/Documentation/technical/commit-graph-format.txt
> > @@ -4,11 +4,7 @@ Git commit graph format
> >  The Git commit graph stores a list of commit OIDs and some associated
> >  metadata, including:
> >
> > -- The generation number of the commit. Commits with no parents have
> > -  generation number 1; commits with parents have generation number
> > -  one more than the maximum generation number of its parents. We
> > -  reserve zero as special, and can be used to mark a generation
> > -  number invalid or as "not computed".
> > +- The generation number of the commit.
> 
> All right, that was duplicated information.  Now that we need to talk
> about two of them, it would not make sense to duplicate that.
> 
> >
> >  - The root tree OID.
> >
> > @@ -88,6 +84,12 @@ CHUNK DATA:
> 
> Shouldn't we also replace 'generation number' occurences in description
> of the Commit Data (CDAT) chunk with either 'topological level' or
> 'generation number v1'?

Yes, we should.

> 
> >        2 bits of the lowest byte, storing the 33rd and 34th bit of the
> >        commit time.
> >
> > +  Generation Data (ID: {'G', 'D', 'A', 'T' }) (N * 4 bytes) [Optional]
> 
> It is not exactly 'optional', as it implies that we need to turn it on
> (or that we can turn it off).  It is more 'conditional', as it can be
> not present due to outside influences (mixed-version environment).
> 
> > +    * This list of 4-byte values store corrected commit date offsets for the
> > +      commits, arranged in the same order as commit data chunk.
> 
> I have just realized purely theoretical, but possible, problem with
> storing non-monotinic generation number related values like corrected
> commit date offset in constrained space.  There are problems with
> clamping them.
> 
> Say that somewhere in the ancestry chain there is a commit A with commit
> date far in the future by mistake, for example 2120-08-22; it is
> important for that date to be not able to be represented using uint32_t.
> Say that a later descendant commit B is malformed, and has committer
> date of 0, that is 1970-01-01. This means that the corrected commit date
> for B must be larger than 2120-08-22 - which for this commit means that
> corrected commit date offset do not fit in 32 bits, and must be clamped
> (replaced) with GENERATION_NUMBER_V2_OFFSET_MAX.
> 
> Say that we have commit C that is child of B, and it has correct commit
> date.  Because of mistake in commit A, it has corrected commit date of
> more than 2120-08-22 (corrected commit date degenerated into topological
> level plus constant).
> 
> Now C can reach B, and B can reach A.  However, if we recover corrected
> commit date of B out of its date=0 and offset=GENERATION_NUMBER_V2_OFFSET_MAX
> we get a number that is smaller than correct corrected commit date.  We
> will have
> 
>    gen(A) > date(B) + offset(B) < gen(C)
> 
> Which breaks reachability condition guarantee.
> 
> If instead we use GENERATION_NUMBER_V2_MAX for commits with clamped
> corrected commit date, that is offset=GENERATION_NUMBER_V2_OFFSET_MAX,
> we would get
> 
>   gen(A) < GENERATION_NUMBER_V2_MAX > gen(C)
> 
> And again reachability condition is broken.
> 
> This is a very contrived but possible example.  This shouldn't happen,
> but ufortunately it can happen.
> 

Yes, that's very unfortunate. 

Here's a much simpler example:

A commit P has an reasonable commit date (i.e. after release of Git to
present) D and has a child commit C with committer date 0. Now, the 
corrected commiter date of C would D + 1 and the offset would be same too,
as the committer date is zero. This overflows as reasonable dates are of
the order 2 ^ 34.

> 
> The question is how to deal with this issue.  Ignore it as unlikely?
> Switch to storing corrected commit date, which is monotonic, so if there
> is commit with GENERATION_NUMBER_V2_MAX, then subsequent descendant
> commits will also have GENERATION_NUMBER_V2_MAX -- and pay with up to 7%
> larger commit-graph file?
> 

To be honest, I would prefer storing corrected committer dates over
storing offsets.

While it is 7% of the size of commit-graph file, it is also *only* around
~3.5 MB for a repository of the size of linux kernel (and IIRC
correctly, the Windows repo has ~2M commits, it amounts to ~8 MB).

Minimizing space and memory requirements are a top priority, but
shouldn't making sure our program is correct and efficient to be a
greater priority?

I would love to hear your and Dr. Stolee's opinions on this.

> > +    * This list can be later modified to store future generation number related
> > +      data.
> 
> How can it be later modified?  There is no header, no version number.
> How would we add another generation number data?
> 

We could modify the graph version in future. Here's how I think it would
work:

Graph Version 1, No GDAT -> Topological level
Graph Version 2, GDAT    -> Corrected committer dates
Graph Version 3, GDAT    -> Generation number v3

and so on.

Of course, we do not have to update generation number definition for
each graph version.

However, my statement could still be wrong for things that we do not
foresee (similar to how we missed the hard die on different graph version),
so I am removing the statement.

> > +
> >    Extra Edge List (ID: {'E', 'D', 'G', 'E'}) [Optional]
> >        This list of 4-byte values store the second through nth parents for
> >        all octopus merges. The second parent value in the commit data stores
> > diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt
> > index 808fa30b99..f27145328c 100644
> > --- a/Documentation/technical/commit-graph.txt
> > +++ b/Documentation/technical/commit-graph.txt
> > @@ -38,14 +38,27 @@ A consumer may load the following info for a commit from the graph:
> >
> >  Values 1-4 satisfy the requirements of parse_commit_gently().
> >
> > -Define the "generation number" of a commit recursively as follows:
> > +There are two definitions of generation number:
> > +1. Corrected committer dates
> > +2. Topological levels
> 
> Should we add versioning info, that is:
> 
>   +1. Corrected committer dates  (generation number v2)
>   +2. Topological levels  (generation number v1)
> 

Yes, added.

> > +
> > +Define "corrected committer date" of a commit recursively as follows:
> > +
> > +  * A commit with no parents (a root commit) has corrected committer date
> > +    equal to its committer date.
> > +
> > +  * A commit with at least one parent has corrected committer date equal to
> > +    the maximum of its commiter date and one more than the largest corrected
> > +    committer date among its parents.
> > +
> > +Define the "topological level" of a commit recursively as follows:
> >
> >   * A commit with no parents (a root commit) has generation number one.
> 
> Shouldn't this be
> 
>     * A commit with no parents (a root commit) has topological level of one.
> 

Thanks, fixed!

> >
> > - * A commit with at least one parent has generation number one more than
> > -   the largest generation number among its parents.
> > + * A commit with at least one parent has topological level one more than
> > +   the largest topological level among its parents.
> >
> > -Equivalently, the generation number of a commit A is one more than the
> > +Equivalently, the topological level of a commit A is one more than the
> >  length of a longest path from A to a root commit. The recursive definition
> >  is easier to use for computation and observing the following property:
> 
> We should probably explicitly state that the property state applies to
> both versions of generation number, not only to topological level.
> 
> >
> > @@ -67,17 +80,12 @@ numbers, the general heuristic is the following:
> >      If A and B are commits with commit time X and Y, respectively, and
> >      X < Y, then A _probably_ cannot reach B.
> >
> > -This heuristic is currently used whenever the computation is allowed to
> > -violate topological relationships due to clock skew (such as "git log"
> > -with default order), but is not used when the topological order is
> > -required (such as merge base calculations, "git log --graph").
> > -
> 
> To be overly pedantic, this heuristic is still used, but now in much
> more rare case.  In addition to what is stated above, at least one layer
> in the split commit-graph chain must have been generated by "Old" Git,
> for the date heuristic to be used.
> 
> But that might be unnecessary level of detail.
> 
> >  In practice, we expect some commits to be created recently and not stored
> >  in the commit graph. We can treat these commits as having "infinite"
> >  generation number and walk until reaching commits with known generation
> >  number.
> >
> > -We use the macro GENERATION_NUMBER_INFINITY = 0xFFFFFFFF to mark commits not
> > +We use the macro GENERATION_NUMBER_INFINITY to mark commits not
> 
> All right.
> 
> >  in the commit-graph file. If a commit-graph file was written by a version
> >  of Git that did not compute generation numbers, then those commits will
> >  have generation number represented by the macro GENERATION_NUMBER_ZERO = 0.
> > @@ -93,12 +101,11 @@ fully-computed generation numbers. Using strict inequality may result in
> >  walking a few extra commits, but the simplicity in dealing with commits
> >  with generation number *_INFINITY or *_ZERO is valuable.
> >
> > -We use the macro GENERATION_NUMBER_MAX = 0x3FFFFFFF to for commits whose
> > -generation numbers are computed to be at least this value. We limit at
> > -this value since it is the largest value that can be stored in the
> > -commit-graph file using the 30 bits available to generation numbers. This
> > -presents another case where a commit can have generation number equal to
> > -that of a parent.
> > +We use the macro GENERATION_NUMBER_MAX for commits whose generation numbers
> > +are computed to be at least this value. We limit at this value since it is
> > +the largest value that can be stored in the commit-graph file using the
> > +available to generation numbers. This presents another case where a
> > +commit can have generation number equal to that of a parent.
> 
> All right, though it could have been done without re-wrapping, so that
> only first line would be marked as changed.
> 
> As I wrote, there is theoretical problem with this for offsets.
> 
> >
> >  Design Details
> >  --------------
> > @@ -267,6 +274,12 @@ The merge strategy values (2 for the size multiple, 64,000 for the maximum
> >  number of commits) could be extracted into config settings for full
> >  flexibility.
> >
> > +We also merge commit-graph chains when we try to write a commit graph with
> > +two different generation number definitions as they cannot be compared directly.
> > +We overwrite the existing chain and create a commit-graph with the newer or more
> > +efficient defintion. For example, overwriting topological levels commit graph
> > +chain to create a corrected commit dates commit graph chain.
> > +
> 
> This is more complicated than that.
> 
> I think we should explicitly state that Git ensures that in split
> commit-graph chain, if there are layers without the GDAT chunk (that
> force Git to use topological levels for generation numbers), then they
> are top layers.  So if there is commit-graph file created by "Old" Git,
> then when addig new layer it would also be GDAT-less.
> 
> Now how to write this...

Thinking about this, I feel creating a new section called "Handling
Mixed Generation Number Chains" made more sense:

  ## Handling Mixed Generation Number Chains

  With the introduction of generation number v2 and generation data chunk,
  the following scenario is possible:

  1. "New" Git writes a commit-graph with a GDAT chunk.
  2. "Old" Git writes a split commit-graph on top without a GDAT chunk.

  The commits in the lower layer will be interpreted as having very large
  generation values (commit date plus offset) compared to the generation
  numbers in the top layer (toplogical level). This violates the
  expectation that the generation of a parent is strictly smaller than the
  generation of a child. In such cases, we revert to using topological
  levels for all layers to maintain backwards compatability.

  When writing a new layer in split commit-graph, we write a GDAT chunk
  only if the topmost layer has a GDAT chunk. This guarantees that if a
  lyer has GDAT chunk, all lower layers must have a GDAT chunk as well.

  Rewriting layers follows similar approach: if the topmost layer below
  set of layers being rewriteen (in the split commit-graph chain) exists,
  and it does not contain GDAT chunk, then the result of rewrite does not
  have GDAT chunks either.

> 
> >  ## Deleting graph-{hash} files
> >
> >  After a new tip file is written, some `graph-{hash}` files may no longer
> 
> Best,
> -- 
> Jakub Narębski

Thanks
- Abhishek

Jakub Narębski Aug. 27, 2020, 12:43 p.m. UTC | #3

Hello,

Abhishek Kumar <abhishekkumar8222@gmail.com> writes:
> On Sun, Aug 23, 2020 at 12:20:57AM +0200, Jakub Narębski wrote:
>> "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
[...]

>>> +    * This list of 4-byte values store corrected commit date offsets for the
>>> +      commits, arranged in the same order as commit data chunk.
>>
>> I have just realized purely theoretical, but possible, problem with
>> storing non-monotinic generation number related values like corrected
>> commit date offset in constrained space.  There are problems with
>> clamping them.
>>
>> Say that somewhere in the ancestry chain there is a commit A with commit
>> date far in the future by mistake, for example 2120-08-22; it is
>> important for that date to be not able to be represented using uint32_t.
>> Say that a later descendant commit B is malformed, and has committer
>> date of 0, that is 1970-01-01. This means that the corrected commit date
>> for B must be larger than 2120-08-22 - which for this commit means that
>> corrected commit date offset do not fit in 32 bits, and must be clamped
>> (replaced) with GENERATION_NUMBER_V2_OFFSET_MAX.
>>
>> Say that we have commit C that is child of B, and it has correct commit
>> date.  Because of mistake in commit A, it has corrected commit date of
>> more than 2120-08-22 (corrected commit date degenerated into topological
>> level plus constant).
>>
>> Now C can reach B, and B can reach A.  However, if we recover corrected
>> commit date of B out of its date=0 and offset=GENERATION_NUMBER_V2_OFFSET_MAX
>> we get a number that is smaller than correct corrected commit date.  We
>> will have
>>
>>    gen(A) > date(B) + offset(B) < gen(C)
>>
>> Which breaks reachability condition guarantee.
>>
>> If instead we use GENERATION_NUMBER_V2_MAX for commits with clamped
>> corrected commit date, that is offset=GENERATION_NUMBER_V2_OFFSET_MAX,
>> we would get
>>
>>   gen(A) < GENERATION_NUMBER_V2_MAX > gen(C)
>>
>> And again reachability condition is broken.
>>
>> This is a very contrived but possible example.  This shouldn't happen,
>> but ufortunately it can happen.
>>
>
> Yes, that's very unfortunate.
>
> Here's a much simpler example:
>
> A commit P has an reasonable commit date (i.e. after release of Git to
> present) D and has a child commit C with committer date 0. Now, the
> corrected commiter date of C would D + 1 and the offset would be same too,
> as the committer date is zero. This overflows as reasonable dates are of
> the order 2 ^ 34.

No, we need the value of date D that doesn't fit in 2^32 _unsigned_ value,
so it needs to be even more in the future than Y2k38 (2038-01-19 03:14:07),
which is related to storing date as a _signed_ 32-bit integer

The current-ish Unix epoch time is 1598524281 - let's use it for value
of D.  Then the offset for commit C would be 1598524282.  The current
proposal uses 32 bits to store commit date offsets (as unsigned value).
The maximum value of offset that we can store is therefore 2^32 - 1,
which is 4294967295.

   corrected commit date offset(C) = 1,598,524,282
   GENERATION_NUMBER_V2_MAX        = 4,294,967,295

As you can see there is no overflow in the simplified example.

>>
>> The question is how to deal with this issue.  Ignore it as unlikely?
>> Switch to storing corrected commit date, which is monotonic, so if there
>> is commit with GENERATION_NUMBER_V2_MAX, then subsequent descendant
>> commits will also have GENERATION_NUMBER_V2_MAX -- and pay with up to 7%
>> larger commit-graph file?
>>
>
> To be honest, I would prefer storing corrected committer dates over
> storing offsets.
>
> While it is 7% of the size of commit-graph file, it is also *only* around
> ~3.5 MB for a repository of the size of linux kernel (and IIRC
> correctly, the Windows repo has ~2M commits, it amounts to ~8 MB).

It is up to 7% of per-commit data, and it doesn't take into account EDGE
chunk (for octopus merges), and it doesn't also take into account the
size of changed-paths Bloom filters data take in the commit-graph.

> Minimizing space and memory requirements are a top priority, but
> shouldn't making sure our program is correct and efficient to be a
> greater priority?

On the other hand the case where we would encounter offsets that do not
fit in uint32_t is extremply unlikely in sane repositories.

I can think of three solutions:

1. use 64-bit corrected commit dates
   - advantages:
     * simplest code,
     * no need for overflow handling, as we can store all possible values
       of timestamp_t
   - disadvantages:
     * commit-graph size increased by up to 7%

2. use 32-bit corrected commit date offsets,
   but simply do not store GDAT chunk if there is offset that would not
   fit in 32-bit wide field
   - advantages:
     * commit-graph is smaller
     * relatively simple overflow handling
   - disadvantages:
     * performance penalty (generation number v1 vs v2) for abnormal
       repositories (with overflow not fitting in uint32_t)
     * tests would be needed to exercise the overflow code

3. use 32-bit for corrected commit date offset,
   with oveflow handling, for example using most significant bit
   to denote that other bits store position into offset overflow
   with 64-bits for those offsets that do not fit in 31-bits
   - advantages:
     * commit-graph is smaller, increasing for abnormal repos
   - disadvantages:
     * most complex code of all proposed solutions
     * smaller overflow limit of 2^31 - 1
     * tests would be needed to exercise the overflow code

I think because the situation where we encounter overflow in 32-bit
corrected commit date offset is rare, we should go with either 1 or 2
solution.

> I would love to hear your and Dr. Stolee's opinions on this.

I have CC-ed Junio C Hamano to ask for his opinion.

>>> +    * This list can be later modified to store future generation number related
>>> +      data.
>>
>> How can it be later modified?  There is no header, no version number.
>> How would we add another generation number data?
>>
>
> We could modify the graph version in future. Here's how I think it would
> work:
>
> Graph Version 1, No GDAT -> Topological level
> Graph Version 2, GDAT    -> Corrected committer dates
> Graph Version 3, GDAT    -> Generation number v3
>
> and so on.
>
> Of course, we do not have to update generation number definition for
> each graph version.

So it was about generic mechanism, not something specific to the GDAT chunk.

> However, my statement could still be wrong for things that we do not
> foresee (similar to how we missed the hard die on different graph version),
> so I am removing the statement.

Good.

[...]
>>> +We also merge commit-graph chains when we try to write a commit graph with
>>> +two different generation number definitions as they cannot be compared directly.
>>> +We overwrite the existing chain and create a commit-graph with the newer or more
>>> +efficient defintion. For example, overwriting topological levels commit graph
>>> +chain to create a corrected commit dates commit graph chain.
>>> +
>>
>> This is more complicated than that.
>>
>> I think we should explicitly state that Git ensures that in split
>> commit-graph chain, if there are layers without the GDAT chunk (that
>> force Git to use topological levels for generation numbers), then they
>> are top layers.  So if there is commit-graph file created by "Old" Git,
>> then when addig new layer it would also be GDAT-less.
>>
>> Now how to write this...
>
> Thinking about this, I feel creating a new section called "Handling
> Mixed Generation Number Chains" made more sense:
>
>   ## Handling Mixed Generation Number Chains
>
>   With the introduction of generation number v2 and generation data chunk,
>   the following scenario is possible:
>
>   1. "New" Git writes a commit-graph with a GDAT chunk.
>   2. "Old" Git writes a split commit-graph on top without a GDAT chunk.
>
>   The commits in the lower layer will be interpreted as having very large
>   generation values (commit date plus offset) compared to the generation
>   numbers in the top layer (toplogical level). This violates the
>   expectation that the generation of a parent is strictly smaller than the
>   generation of a child. In such cases, we revert to using topological
>   levels for all layers to maintain backwards compatability.
>
>   When writing a new layer in split commit-graph, we write a GDAT chunk
>   only if the topmost layer has a GDAT chunk. This guarantees that if a
>   lyer has GDAT chunk, all lower layers must have a GDAT chunk as well.
>
>   Rewriting layers follows similar approach: if the topmost layer below
>   set of layers being rewriteen (in the split commit-graph chain) exists,
>   and it does not contain GDAT chunk, then the result of rewrite does not
>   have GDAT chunks either.

Good idea, and nice writeup.

Best,

Derrick Stolee Aug. 27, 2020, 1:15 p.m. UTC | #4

On 8/27/2020 2:39 AM, Abhishek Kumar wrote:
> Thinking about this, I feel creating a new section called "Handling
> Mixed Generation Number Chains" made more sense:
> 
>   ## Handling Mixed Generation Number Chains
> 
>   With the introduction of generation number v2 and generation data chunk,
>   the following scenario is possible:
> 
>   1. "New" Git writes a commit-graph with a GDAT chunk.
>   2. "Old" Git writes a split commit-graph on top without a GDAT chunk.

I like the idea of this section, and this setup is good.

>   The commits in the lower layer will be interpreted as having very large
>   generation values (commit date plus offset) compared to the generation
>   numbers in the top layer (toplogical level). This violates the
>   expectation that the generation of a parent is strictly smaller than the
>   generation of a child. In such cases, we revert to using topological
>   levels for all layers to maintain backwards compatability.

s/toplogical/topological

But also, we don't want to phrase this as "in this case, we do the wrong
thing" but instead

  A naive approach of using the newest available generation number from
  each layer would lead to violated expectations: the lower layer would
  use corrected commit dates which are much larger than the topological
  levels of the higher layer. For this reason, Git inspects each layer
  to see if any layer is missing corrected commit dates. In such a case,
  Git only uses topological levels.

>   When writing a new layer in split commit-graph, we write a GDAT chunk
>   only if the topmost layer has a GDAT chunk. This guarantees that if a
>   lyer has GDAT chunk, all lower layers must have a GDAT chunk as well.

s/lyer/layer

Perhaps leaving this at a higher level than referencing "GDAT chunk" is
advisable. Perhaps use "we write corrected commit dates" or "all lower
layers must store corrected commit dates as well", for example.

>   Rewriting layers follows similar approach: if the topmost layer below
>   set of layers being rewriteen (in the split commit-graph chain) exists,
>   and it does not contain GDAT chunk, then the result of rewrite does not
>   have GDAT chunks either.

This could use more positive language to make it clear that sometimes
we _do_ want to write corrected commit dates when merging layers:

  When merging layers, we do not consider whether the merged layers had
  corrected commit dates. Instead, the new layer will have corrected
  commit dates if and only if all existing layers below the new layer
  have corrected commit dates.

Thanks,
-Stolee

Abhishek Kumar Sept. 1, 2020, 1:01 p.m. UTC | #5

On Thu, Aug 27, 2020 at 09:15:56AM -0400, Derrick Stolee wrote:
> On 8/27/2020 2:39 AM, Abhishek Kumar wrote:
> > Thinking about this, I feel creating a new section called "Handling
> > Mixed Generation Number Chains" made more sense:
> > 
> >   ## Handling Mixed Generation Number Chains
> > 
> >   With the introduction of generation number v2 and generation data chunk,
> >   the following scenario is possible:
> > 
> >   1. "New" Git writes a commit-graph with a GDAT chunk.
> >   2. "Old" Git writes a split commit-graph on top without a GDAT chunk.
> 
> I like the idea of this section, and this setup is good.
> 
> >   The commits in the lower layer will be interpreted as having very large
> >   generation values (commit date plus offset) compared to the generation
> >   numbers in the top layer (toplogical level). This violates the
> >   expectation that the generation of a parent is strictly smaller than the
> >   generation of a child. In such cases, we revert to using topological
> >   levels for all layers to maintain backwards compatability.
> 
> s/toplogical/topological
> 
> But also, we don't want to phrase this as "in this case, we do the wrong
> thing" but instead
> 
>   A naive approach of using the newest available generation number from
>   each layer would lead to violated expectations: the lower layer would
>   use corrected commit dates which are much larger than the topological
>   levels of the higher layer. For this reason, Git inspects each layer
>   to see if any layer is missing corrected commit dates. In such a case,
>   Git only uses topological levels.
> 
> >   When writing a new layer in split commit-graph, we write a GDAT chunk
> >   only if the topmost layer has a GDAT chunk. This guarantees that if a
> >   lyer has GDAT chunk, all lower layers must have a GDAT chunk as well.
> 
> s/lyer/layer
> 
> Perhaps leaving this at a higher level than referencing "GDAT chunk" is
> advisable. Perhaps use "we write corrected commit dates" or "all lower
> layers must store corrected commit dates as well", for example.
> 
> >   Rewriting layers follows similar approach: if the topmost layer below
> >   set of layers being rewriteen (in the split commit-graph chain) exists,
> >   and it does not contain GDAT chunk, then the result of rewrite does not
> >   have GDAT chunks either.
> 
> This could use more positive language to make it clear that sometimes
> we _do_ want to write corrected commit dates when merging layers:
> 
>   When merging layers, we do not consider whether the merged layers had
>   corrected commit dates. Instead, the new layer will have corrected
>   commit dates if and only if all existing layers below the new layer
>   have corrected commit dates.

Thanks, that is a great suggestion! Using positive language is more
straightforward and easier to understand.

> 
> Thanks,
> -Stolee

[v3,11/11] doc: add corrected commit date info

Commit Message

Comments

Patch