mbox series

[0/6,GSoC] Implement Corrected Commit Date

Message ID pull.676.git.1595927632.gitgitgadget@gmail.com (mailing list archive)
Headers show
Series Implement Corrected Commit Date | expand

Message

Bruce Perry via GitGitGadget July 28, 2020, 9:13 a.m. UTC
This patch series implements the corrected commit date offsets as generation
number v2, along with other pre-requisites.

Git uses topological levels in the commit-graph file for commit-graph
traversal operations like git log --graph. Unfortunately, using topological
levels can result in a worse performance than without them when compared
with committer date as a heuristics. For example, git merge-base v4.8 v4.9 
on the Linux repository walks 635,579 commits using topological levels and
walks 167,468 using committer date.

Thus, the need for generation number v2 was born. New generation number
needed to provide good performance, increment updates, and backward
compatibility. Due to an unfortunate problem, we also needed a way to
distinguish between the old and new generation number without incrementing
graph version.

Various candidates were examined (https://github.com/derrickstolee/gen-test, 
https://github.com/abhishekkumar2718/git/pull/1). The proposed generation
number v2, Corrected Commit Date with Mononotically Increasing Offsets 
performed much worse than committer date (506,577 vs. 167,468 commits walked
for git merge-base v4.8 v4.9) and was dropped.

Using Generation Data chunk (GDAT) relieves the requirement of backward
compatibility as we would continue to store topological levels in Commit
Data (CDAT) chunk. Thus, Corrected Commit Date was chosen as generation
number v2. The Corrected Commit Date is defined as:

For a commit C, let its corrected commit date be the maximum of the commit
date of C and the corrected commit dates of its parents. Then corrected
commit date offset is the difference between corrected commit date of C and
commit date of C.

We will introduce an additional commit-graph chunk, Generation Data chunk,
and store corrected commit date offsets in GDAT chunk while storing
topological levels in CDAT chunk. The old versions of Git would ignore GDAT
chunk, using topological levels from CDAT chunk. In contrast, new versions
of Git would use corrected commit dates, falling back to topological level
if the generation data chunk is absent in the commit-graph file.

Here's what left for the PR (which I intend to take on with the second
version of pull request):

 1. Add an option to skip writing generation data chunk (to test whether new
    Git works without GDAT as intended).
 2. Handle writing to commit-graph for mismatched version (that is, merging
    all graphs into a new graph with a GDAT chunk).
 3. Update technical documentation.

I look forward to everyone's reviews!

Thanks

 * Abhishek


----------------------------------------------------------------------------

The build fails for t9807-git-p4-submit.sh on osx-clang, which I feel is
unrelated to my code changes. Still need to investigate further.

Abhishek Kumar (6):
  commit-graph: fix regression when computing bloom filter
  revision: parse parent in indegree_walk_step()
  commit-graph: consolidate fill_commit_graph_info
  commit-graph: consolidate compare_commits_by_gen
  commit-graph: implement generation data chunk
  commit-graph: implement corrected commit date offset

 blame.c                       |   2 +-
 commit-graph.c                | 181 +++++++++++++++++++++-------------
 commit-graph.h                |   7 +-
 commit-reach.c                |  47 +++------
 commit-reach.h                |   2 +-
 commit.c                      |   9 +-
 commit.h                      |   3 +
 revision.c                    |  17 ++--
 t/helper/test-read-graph.c    |   2 +
 t/t4216-log-bloom.sh          |   4 +-
 t/t5000-tar-tree.sh           |   4 +-
 t/t5318-commit-graph.sh       |  21 ++--
 t/t5324-split-commit-graph.sh |  12 +--
 upload-pack.c                 |   2 +-
 14 files changed, 178 insertions(+), 135 deletions(-)


base-commit: 47ae905ffb98cc4d4fd90083da6bc8dab55d9ecc
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-676%2Fabhishekkumar2718%2Fcorrected_commit_date-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-676/abhishekkumar2718/corrected_commit_date-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/676

Comments

Taylor Blau July 28, 2020, 2:54 p.m. UTC | #1
Hi Abhishek,

On Tue, Jul 28, 2020 at 09:13:45AM +0000, Abhishek Kumar via GitGitGadget wrote:
> This patch series implements the corrected commit date offsets as generation
> number v2, along with other pre-requisites.

Very exciting. I have been eagerly following your blog and asking
Stolee about your progress, so I am excited to read these patches.

> Git uses topological levels in the commit-graph file for commit-graph
> traversal operations like git log --graph. Unfortunately, using topological
> levels can result in a worse performance than without them when compared
> with committer date as a heuristics. For example, git merge-base v4.8 v4.9
> on the Linux repository walks 635,579 commits using topological levels and
> walks 167,468 using committer date.
>
> Thus, the need for generation number v2 was born. New generation number
> needed to provide good performance, increment updates, and backward
> compatibility. Due to an unfortunate problem, we also needed a way to
> distinguish between the old and new generation number without incrementing
> graph version.
>
> Various candidates were examined (https://github.com/derrickstolee/gen-test,
> https://github.com/abhishekkumar2718/git/pull/1). The proposed generation
> number v2, Corrected Commit Date with Mononotically Increasing Offsets
> performed much worse than committer date (506,577 vs. 167,468 commits walked
> for git merge-base v4.8 v4.9) and was dropped.
>
> Using Generation Data chunk (GDAT) relieves the requirement of backward
> compatibility as we would continue to store topological levels in Commit
> Data (CDAT) chunk. Thus, Corrected Commit Date was chosen as generation
> number v2. The Corrected Commit Date is defined as:
>
> For a commit C, let its corrected commit date be the maximum of the commit
> date of C and the corrected commit dates of its parents. Then corrected
> commit date offset is the difference between corrected commit date of C and
> commit date of C.

Interestingly, we use a very similar metric at GitHub to sort commits in
various UI views which have lots of existing machinery that sorts
an abstract collection by each element's "date". Since that sort is
stable, and we want to respect the order that Git delivered, we take the
pairwise max of each successive pair of commits.

> We will introduce an additional commit-graph chunk, Generation Data chunk,
> and store corrected commit date offsets in GDAT chunk while storing
> topological levels in CDAT chunk. The old versions of Git would ignore GDAT
> chunk, using topological levels from CDAT chunk. In contrast, new versions
> of Git would use corrected commit dates, falling back to topological level
> if the generation data chunk is absent in the commit-graph file.

I'm sure that I'll learn more when I get to this point, but I would like
to hear more about why you want to store the offset rather than the
corrected commit date itself. It seems that the offset could be either
positive or negative, so you'd only have the range of a signed integer
(rather than storing 8 bytes of a time_t for the full breadth of
possibilities).

I know also that Peff is working on negative timestamp support, so I
would want to hear about what he thinks of this, too.

> Here's what left for the PR (which I intend to take on with the second
> version of pull request):
>
>  1. Add an option to skip writing generation data chunk (to test whether new
>     Git works without GDAT as intended).

This will be good to gradually roll-out the new chunk. Another thought
is to control whether or not the commit-graph machinery _reads_ this
chunk if it's present. That can be useful for debugging too (eg., I have
a commit-graph with a GDAT chunk that is broken in some way, what
happens if I don't read that chunk?)

Maybe something like `commitgraph.readsGenerationData`? Incidentally,
I'm preparing a `commitgraph.readsChangedPaths` to control whether or
not we read the Bloom index and data chunks. I'll send that to the list
shortly (it's in my fork somewhere if you want an earlier look), but
that may be a useful reference for you.

>  2. Handle writing to commit-graph for mismatched version (that is, merging
>     all graphs into a new graph with a GDAT chunk).
>  3. Update technical documentation.
>
> I look forward to everyone's reviews!
>
> Thanks
>
>  * Abhishek
>
>
> ----------------------------------------------------------------------------
>
> The build fails for t9807-git-p4-submit.sh on osx-clang, which I feel is
> unrelated to my code changes. Still need to investigate further.
>
> Abhishek Kumar (6):
>   commit-graph: fix regression when computing bloom filter
>   revision: parse parent in indegree_walk_step()
>   commit-graph: consolidate fill_commit_graph_info
>   commit-graph: consolidate compare_commits_by_gen
>   commit-graph: implement generation data chunk
>   commit-graph: implement corrected commit date offset
>
>  blame.c                       |   2 +-
>  commit-graph.c                | 181 +++++++++++++++++++++-------------
>  commit-graph.h                |   7 +-
>  commit-reach.c                |  47 +++------
>  commit-reach.h                |   2 +-
>  commit.c                      |   9 +-
>  commit.h                      |   3 +
>  revision.c                    |  17 ++--
>  t/helper/test-read-graph.c    |   2 +
>  t/t4216-log-bloom.sh          |   4 +-
>  t/t5000-tar-tree.sh           |   4 +-
>  t/t5318-commit-graph.sh       |  21 ++--
>  t/t5324-split-commit-graph.sh |  12 +--
>  upload-pack.c                 |   2 +-
>  14 files changed, 178 insertions(+), 135 deletions(-)
>
>
> base-commit: 47ae905ffb98cc4d4fd90083da6bc8dab55d9ecc
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-676%2Fabhishekkumar2718%2Fcorrected_commit_date-v1
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-676/abhishekkumar2718/corrected_commit_date-v1
> Pull-Request: https://github.com/gitgitgadget/git/pull/676
> --
> gitgitgadget

Thanks,
Taylor
Derrick Stolee July 28, 2020, 4:35 p.m. UTC | #2
On 7/28/2020 5:13 AM, Abhishek Kumar via GitGitGadget wrote:
> This patch series implements the corrected commit date offsets as generation
> number v2, along with other pre-requisites.
> 
> Git uses topological levels in the commit-graph file for commit-graph
> traversal operations like git log --graph. Unfortunately, using topological
> levels can result in a worse performance than without them when compared
> with committer date as a heuristics. For example, git merge-base v4.8 v4.9 
> on the Linux repository walks 635,579 commits using topological levels and
> walks 167,468 using committer date.
> 
> Thus, the need for generation number v2 was born. New generation number
> needed to provide good performance, increment updates, and backward
> compatibility. Due to an unfortunate problem, we also needed a way to
> distinguish between the old and new generation number without incrementing
> graph version.
> 
> Various candidates were examined (https://github.com/derrickstolee/gen-test, 
> https://github.com/abhishekkumar2718/git/pull/1). The proposed generation
> number v2, Corrected Commit Date with Mononotically Increasing Offsets 
> performed much worse than committer date (506,577 vs. 167,468 commits walked
> for git merge-base v4.8 v4.9) and was dropped.
> 
> Using Generation Data chunk (GDAT) relieves the requirement of backward
> compatibility as we would continue to store topological levels in Commit
> Data (CDAT) chunk. Thus, Corrected Commit Date was chosen as generation
> number v2. The Corrected Commit Date is defined as:
> 
> For a commit C, let its corrected commit date be the maximum of the commit
> date of C and the corrected commit dates of its parents. Then corrected
> commit date offset is the difference between corrected commit date of C and
> commit date of C.
> 
> We will introduce an additional commit-graph chunk, Generation Data chunk,
> and store corrected commit date offsets in GDAT chunk while storing
> topological levels in CDAT chunk. The old versions of Git would ignore GDAT
> chunk, using topological levels from CDAT chunk. In contrast, new versions
> of Git would use corrected commit dates, falling back to topological level
> if the generation data chunk is absent in the commit-graph file.
> 
> Here's what left for the PR (which I intend to take on with the second
> version of pull request):
> 
>  1. Add an option to skip writing generation data chunk (to test whether new
>     Git works without GDAT as intended).

This would be a good idea, if only as a GIT_TEST_* environment variable.
I think it important we have a test for the compatibility scenario where
we have an "old" commit-graph with the new code and test that reading and
writing still works properly.

>  2. Handle writing to commit-graph for mismatched version (that is, merging
>     all graphs into a new graph with a GDAT chunk).

This is an excellent thing to do. There are a few options when writing an
incremental commit-graph when the base graphs do not have the GDAT chunk:

   i. Do not write the GDAT chunk unless we are merging all levels
      (based on the merging strategy).

  ii. Merge all levels, then write the GDAT chunk.

>  3. Update technical documentation.

Yes, I was going to ask for a patch that updates
Documentation/technical/commit-graph-format.txt.

This is an excellent v1. A lot of small things, but no
really big issues.

Thanks,
-Stolee
Abhishek Kumar July 30, 2020, 7:47 a.m. UTC | #3
On Tue, Jul 28, 2020 at 10:54:58AM -0400, Taylor Blau wrote:
> Hi Abhishek,
> 
> On Tue, Jul 28, 2020 at 09:13:45AM +0000, Abhishek Kumar via GitGitGadget wrote:
> > This patch series implements the corrected commit date offsets as generation
> > number v2, along with other pre-requisites.
> 
> Very exciting. I have been eagerly following your blog and asking
> Stolee about your progress, so I am excited to read these patches.
> 

I am so glad to hear that!

> 
> [...]
> 
> I'm sure that I'll learn more when I get to this point, but I would like
> to hear more about why you want to store the offset rather than the
> corrected commit date itself. It seems that the offset could be either
> positive or negative, so you'd only have the range of a signed integer
> (rather than storing 8 bytes of a time_t for the full breadth of
> possibilities).
> 

Corrected commit dates are at least as big as the committer date, so the
offset (i.e. corrected date - committer date) would never be negative.

We store offsets instead of corrected commit dates because:
- We save 4 bytes for each commit, which amounts to 7-8% of the size of
  commit graph file (of course, dependent on the other chunks used).
- We save some time while writing the commit-graph file too, around
  ~200ms for the Linux repo.

While the savings are modest, writing corrected dates does not offer any
advantage that we could think of, at the time.

> I know also that Peff is working on negative timestamp support, so I
> would want to hear about what he thinks of this, too.

I have read up on Peff's work with negative timestamp support and it's
pretty exciting.

> [...]

Thanks
- Abhishek