diff mbox series

[1/6] commit-graph: fix regression when computing bloom filter

Message ID 91e6e97a66aff88e0b860e34659dddc3396c7f28.1595927632.git.gitgitgadget@gmail.com (mailing list archive)
State New, archived
Headers show
Series Implement Corrected Commit Date | expand

Commit Message

Derrick Stolee via GitGitGadget July 28, 2020, 9:13 a.m. UTC
From: Abhishek Kumar <abhishekkumar8222@gmail.com>

With 3d112755 (commit-graph: examine commits by generation number), Git
knew to sort by generation number before examining the diff when not
using pack order. c49c82aa (commit: move members graph_pos, generation
to a slab, 2020-06-17) moved generation number into a slab and
introduced a helper which returns GENERATION_NUMBER_INFINITY when
writing the graph. Sorting is no longer useful and essentially reverts
the earlier commit.

Let's fix this by accessing generation number directly through the slab.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 commit-graph.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

Comments

Taylor Blau July 28, 2020, 3:28 p.m. UTC | #1
On Tue, Jul 28, 2020 at 09:13:46AM +0000, Abhishek Kumar via GitGitGadget wrote:
> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
>
> With 3d112755 (commit-graph: examine commits by generation number), Git
> knew to sort by generation number before examining the diff when not
> using pack order. c49c82aa (commit: move members graph_pos, generation
> to a slab, 2020-06-17) moved generation number into a slab and
> introduced a helper which returns GENERATION_NUMBER_INFINITY when
> writing the graph. Sorting is no longer useful and essentially reverts
> the earlier commit.

This last sentence is slightly confusing. Do you think it would be more
clear if you said elaborated a bit? Perhaps something like:

  [...]

  commit_gen_cmp is used when writing a commit-graph to sort commits in
  generation order before computing Bloom filters. Since c49c82aa made
  it so that 'commit_graph_generation()' returns
  'GENERATION_NUMBER_INFINITY' during writing, we cannot call it within
  this function. Instead, access the generation number directly through
  the slab (i.e., by calling 'commit_graph_data_at(c)->generation') in
  order to access it while writing.

I think the above would be a good extra paragraph in the commit message
provided that you remove the sentence beginning with "Sorting is no
longer useful..."

> Let's fix this by accessing generation number directly through the slab.
>
> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> ---
>  commit-graph.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index 1af68c297d..5d3c9bd23c 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -144,8 +144,9 @@ static int commit_gen_cmp(const void *va, const void *vb)
>  	const struct commit *a = *(const struct commit **)va;
>  	const struct commit *b = *(const struct commit **)vb;
>
> -	uint32_t generation_a = commit_graph_generation(a);
> -	uint32_t generation_b = commit_graph_generation(b);
> +	uint32_t generation_a = commit_graph_data_at(a)->generation;
> +	uint32_t generation_b = commit_graph_data_at(b)->generation;
> +

Nit; this whitespace diff is extraneous, but it's not hurting anything
either. Since it looks like you're rerolling anyway, it would be good to
just get rid of it.

Otherwise this fix makes sense to me.

>  	/* lower generation commits first */
>  	if (generation_a < generation_b)
>  		return -1;
> --
> gitgitgadget

Thanks,
Taylor
Abhishek Kumar July 30, 2020, 5:24 a.m. UTC | #2
On Tue, Jul 28, 2020 at 11:28:44AM -0400, Taylor Blau wrote:
> On Tue, Jul 28, 2020 at 09:13:46AM +0000, Abhishek Kumar via GitGitGadget wrote:
> > From: Abhishek Kumar <abhishekkumar8222@gmail.com>
> >
> > With 3d112755 (commit-graph: examine commits by generation number), Git
> > knew to sort by generation number before examining the diff when not
> > using pack order. c49c82aa (commit: move members graph_pos, generation
> > to a slab, 2020-06-17) moved generation number into a slab and
> > introduced a helper which returns GENERATION_NUMBER_INFINITY when
> > writing the graph. Sorting is no longer useful and essentially reverts
> > the earlier commit.
> 
> This last sentence is slightly confusing. Do you think it would be more
> clear if you said elaborated a bit? Perhaps something like:
> 
>   [...]
> 
>   commit_gen_cmp is used when writing a commit-graph to sort commits in
>   generation order before computing Bloom filters. Since c49c82aa made
>   it so that 'commit_graph_generation()' returns
>   'GENERATION_NUMBER_INFINITY' during writing, we cannot call it within
>   this function. Instead, access the generation number directly through
>   the slab (i.e., by calling 'commit_graph_data_at(c)->generation') in
>   order to access it while writing.
> 

Thanks! That is clearer. Will change.

> I think the above would be a good extra paragraph in the commit message
> provided that you remove the sentence beginning with "Sorting is no
> longer useful..."
> 
> > Let's fix this by accessing generation number directly through the slab.
> >
> > Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> > ---
> >  commit-graph.c | 5 +++--
> >  1 file changed, 3 insertions(+), 2 deletions(-)
> >
> > diff --git a/commit-graph.c b/commit-graph.c
> > index 1af68c297d..5d3c9bd23c 100644
> > --- a/commit-graph.c
> > +++ b/commit-graph.c
> > @@ -144,8 +144,9 @@ static int commit_gen_cmp(const void *va, const void *vb)
> >  	const struct commit *a = *(const struct commit **)va;
> >  	const struct commit *b = *(const struct commit **)vb;
> >
> > -	uint32_t generation_a = commit_graph_generation(a);
> > -	uint32_t generation_b = commit_graph_generation(b);
> > +	uint32_t generation_a = commit_graph_data_at(a)->generation;
> > +	uint32_t generation_b = commit_graph_data_at(b)->generation;
> > +
> 
> Nit; this whitespace diff is extraneous, but it's not hurting anything
> either. Since it looks like you're rerolling anyway, it would be good to
> just get rid of it.
> 
> Otherwise this fix makes sense to me.
> 
> >  	/* lower generation commits first */
> >  	if (generation_a < generation_b)
> >  		return -1;
> > --
> > gitgitgadget
> 
> Thanks,
> Taylor
Jakub Narębski Aug. 4, 2020, 12:46 a.m. UTC | #3
"Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
>
> With 3d112755 (commit-graph: examine commits by generation number), Git
> knew to sort by generation number before examining the diff when not
> using pack order. c49c82aa (commit: move members graph_pos, generation
> to a slab, 2020-06-17) moved generation number into a slab and
> introduced a helper which returns GENERATION_NUMBER_INFINITY when
> writing the graph. Sorting is no longer useful and essentially reverts
> the earlier commit.
>
> Let's fix this by accessing generation number directly through the slab.

It looks like unfortunate and unforeseen consequence of putting together
graph position and generation number in the commit_graph_data struct.
During writing of the commit-graph file generation number is computed,
but graph position is undefined (yet), and commit_graph_generation()
uses graph_pos field to find if the data for commit is initialized;
in this case wrongly.

Anyway, when writing the commit graph we first compute generation
number, then (if requested) the changed-paths Bloom filter.  Skipping
the unnecessary check is a good thing... assuming that commit_gen_cmp()
is used only when writing the commit graph, and not when traversing it
(because then some commits may not have generation number set, and maybe
even do not have any data on the commit slab) - which is the case.

>
> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> ---
>  commit-graph.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index 1af68c297d..5d3c9bd23c 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c

We might want to add function comment either here or in the header that
this comparisonn function is to be used only for `git commit-graph
write`, and not for graph traversal (even if similar funnction exists in
other modules).

> @@ -144,8 +144,9 @@ static int commit_gen_cmp(const void *va, const void *vb)
>  	const struct commit *a = *(const struct commit **)va;
>  	const struct commit *b = *(const struct commit **)vb;
>
> -	uint32_t generation_a = commit_graph_generation(a);
> -	uint32_t generation_b = commit_graph_generation(b);
> +	uint32_t generation_a = commit_graph_data_at(a)->generation;
> +	uint32_t generation_b = commit_graph_data_at(b)->generation;
> +
>  	/* lower generation commits first */
>  	if (generation_a < generation_b)
>  		return -1;

Best,
--
Jakub Narębski
Taylor Blau Aug. 4, 2020, 12:56 a.m. UTC | #4
On Tue, Aug 04, 2020 at 02:46:55AM +0200, Jakub Narębski wrote:
> "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > From: Abhishek Kumar <abhishekkumar8222@gmail.com>
> >
> > With 3d112755 (commit-graph: examine commits by generation number), Git
> > knew to sort by generation number before examining the diff when not
> > using pack order. c49c82aa (commit: move members graph_pos, generation
> > to a slab, 2020-06-17) moved generation number into a slab and
> > introduced a helper which returns GENERATION_NUMBER_INFINITY when
> > writing the graph. Sorting is no longer useful and essentially reverts
> > the earlier commit.
> >
> > Let's fix this by accessing generation number directly through the slab.
>
> It looks like unfortunate and unforeseen consequence of putting together
> graph position and generation number in the commit_graph_data struct.
> During writing of the commit-graph file generation number is computed,
> but graph position is undefined (yet), and commit_graph_generation()
> uses graph_pos field to find if the data for commit is initialized;
> in this case wrongly.
>
> Anyway, when writing the commit graph we first compute generation
> number, then (if requested) the changed-paths Bloom filter.  Skipping
> the unnecessary check is a good thing... assuming that commit_gen_cmp()
> is used only when writing the commit graph, and not when traversing it
> (because then some commits may not have generation number set, and maybe
> even do not have any data on the commit slab) - which is the case.
>
> >
> > Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> > ---
> >  commit-graph.c | 5 +++--
> >  1 file changed, 3 insertions(+), 2 deletions(-)
> >
> > diff --git a/commit-graph.c b/commit-graph.c
> > index 1af68c297d..5d3c9bd23c 100644
> > --- a/commit-graph.c
> > +++ b/commit-graph.c
>
> We might want to add function comment either here or in the header that
> this comparisonn function is to be used only for `git commit-graph
> write`, and not for graph traversal (even if similar funnction exists in
> other modules).

I think that probably within the function is just fine, and that we can
avoid touching commit-graph.h here.

>
> > @@ -144,8 +144,9 @@ static int commit_gen_cmp(const void *va, const void *vb)
> >  	const struct commit *a = *(const struct commit **)va;
> >  	const struct commit *b = *(const struct commit **)vb;

Maybe something like:

  /*
   * Access the generation number directly with
   * 'commit_graph_data_at(...)->generation' instead of going through
   * the slab as usual to avoid accessing a yet-uncomputed value.
   */

Folks that are curious for more can blame this commit and read there.
I'd err on the side of being brief in the code comment and verbose in
the commit message than the other way around ;).

> >
> > -	uint32_t generation_a = commit_graph_generation(a);
> > -	uint32_t generation_b = commit_graph_generation(b);
> > +	uint32_t generation_a = commit_graph_data_at(a)->generation;
> > +	uint32_t generation_b = commit_graph_data_at(b)->generation;
> > +
> >  	/* lower generation commits first */
> >  	if (generation_a < generation_b)
> >  		return -1;
>
> Best,
> --
> Jakub Narębski

Thanks,
Taylor
Jakub Narębski Aug. 4, 2020, 7:55 a.m. UTC | #5
Jakub Narębski <jnareb@gmail.com> writes:

[...]
>> @@ -144,8 +144,9 @@ static int commit_gen_cmp(const void *va, const void *vb)
>>  	const struct commit *a = *(const struct commit **)va;
>>  	const struct commit *b = *(const struct commit **)vb;
>>
>> -	uint32_t generation_a = commit_graph_generation(a);
>> -	uint32_t generation_b = commit_graph_generation(b);
>> +	uint32_t generation_a = commit_graph_data_at(a)->generation;
>> +	uint32_t generation_b = commit_graph_data_at(b)->generation;
>> +
>>  	/* lower generation commits first */
>>  	if (generation_a < generation_b)
>>  		return -1;

NOTE: One more thing: we would want to check if corrected commit date
(generation number v2) or topological level (generation number v1) is
better for this purpose, that is gives better performance.

The commit 3d11275505 (commit-graph: examine commits by generation
number) which introduced using commit_gen_cmp when writing commit graph
when finding commits via `--reachable` flags describes the following
performance improvement:

    On the Linux kernel repository, this change reduced the computation
    time for 'git commit-graph write --reachable --changed-paths' from
    3m00s to 1m37s.

We would probably want time for no sorting, for sorting by generation
number v2, and for sorting by topological level (generation number v1)
for the same or similar case.

Best,
--
Jakub Narębski
Jakub Narębski Aug. 4, 2020, 10:10 a.m. UTC | #6
Taylor Blau <me@ttaylorr.com> writes:
> On Tue, Aug 04, 2020 at 02:46:55AM +0200, Jakub Narębski wrote:
>> "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:

[...]
>>> diff --git a/commit-graph.c b/commit-graph.c
>>> index 1af68c297d..5d3c9bd23c 100644
>>> --- a/commit-graph.c
>>> +++ b/commit-graph.c
>>
>> We might want to add function comment either here or in the header that
>> this comparisonn function is to be used only for `git commit-graph
>> write`, and not for graph traversal (even if similar funnction exists in
>> other modules).
>
> I think that probably within the function is just fine, and that we can
> avoid touching commit-graph.h here.
>
>>
>>> @@ -144,8 +144,9 @@ static int commit_gen_cmp(const void *va, const void *vb)
>>>  	const struct commit *a = *(const struct commit **)va;
>>>  	const struct commit *b = *(const struct commit **)vb;
>
> Maybe something like:
>
>   /*
>    * Access the generation number directly with
>    * 'commit_graph_data_at(...)->generation' instead of going through
>    * the slab as usual to avoid accessing a yet-uncomputed value.
>    */

I think the last part of this comment should read:

[...]
     * 'commit_graph_data_at(...)->generation' instead of going through
     * the commit_graph_generation() helper function to access just
     * computed data [during `git commit-graph write --reachable --changed-paths`].
     */

Or something like that (the part in square brackets is optional; I am
not sure if adding it helps or not).

>
> Folks that are curious for more can blame this commit and read there.
> I'd err on the side of being brief in the code comment and verbose in
> the commit message than the other way around ;).

I agree.

>>>
>>> -	uint32_t generation_a = commit_graph_generation(a);
>>> -	uint32_t generation_b = commit_graph_generation(b);
>>> +	uint32_t generation_a = commit_graph_data_at(a)->generation;
>>> +	uint32_t generation_b = commit_graph_data_at(b)->generation;
>>> +
>>>  	/* lower generation commits first */
>>>  	if (generation_a < generation_b)
>>>  		return -1;

Best,
diff mbox series

Patch

diff --git a/commit-graph.c b/commit-graph.c
index 1af68c297d..5d3c9bd23c 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -144,8 +144,9 @@  static int commit_gen_cmp(const void *va, const void *vb)
 	const struct commit *a = *(const struct commit **)va;
 	const struct commit *b = *(const struct commit **)vb;
 
-	uint32_t generation_a = commit_graph_generation(a);
-	uint32_t generation_b = commit_graph_generation(b);
+	uint32_t generation_a = commit_graph_data_at(a)->generation;
+	uint32_t generation_b = commit_graph_data_at(b)->generation;
+
 	/* lower generation commits first */
 	if (generation_a < generation_b)
 		return -1;