diff mbox series

name-rev: test showing failure with non-monotonic commit dates

Message ID 20220214210136.1532574-1-jacob.e.keller@intel.com (mailing list archive)
State New, archived
Headers show
Series name-rev: test showing failure with non-monotonic commit dates | expand

Commit Message

Jacob Keller Feb. 14, 2022, 9:01 p.m. UTC
From: Jacob Keller <jacob.keller@gmail.com>

If a commit in a sequence of linear history has a non-monotonically
increasing commit timestamp, git name-rev will not properly name the
commit.

However, if you use --annotate-stdin then the commit does actually get
picked up and named properly.

Analyzing the source, it appears to be caused by the cutoff logic which
is some sort of heuristic which relies on monotonically increasing
commit dates.

This seems like the cutoff using commit date is some sort of heuristic
which reduces the cost of describing something.. but --annotate-stdin
and --all don't use it.

In the example setup I could do:

echo "<commit id>" | git name-rev --annotate-stdin

and get the expected result without the cutoff logic, and it seems at
least on small repositories to be as fast as the normal attempt, except
it produces accurate results.

Signed-off-by: Jacob Keller <jacob.keller@gmail.com>
---
 t/t6120-describe.sh | 62 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 62 insertions(+)

Comments

Junio C Hamano Feb. 14, 2022, 9:50 p.m. UTC | #1
Jacob Keller <jacob.e.keller@intel.com> writes:

> From: Jacob Keller <jacob.keller@gmail.com>
>
> If a commit in a sequence of linear history has a non-monotonically
> increasing commit timestamp, git name-rev will not properly name the
> commit.
>
> However, if you use --annotate-stdin then the commit does actually get
> picked up and named properly.

IIRC, this is to be expected.

When preparing to answer --annotate-stdin request, the command has
to dig down to the root of the history, which would be too expensive
in some repositories and wants to stop traversal early when it knows
particular commits it needs to describe.

Dscho?  I think this is pretty much a fundamental part of the
initial version added by bd321bcc (Add git-name-rev, 2005-10-26) and
kept that way to this day, I think.
Jacob Keller Feb. 14, 2022, 10:07 p.m. UTC | #2
On Mon, Feb 14, 2022 at 1:50 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Jacob Keller <jacob.e.keller@intel.com> writes:
>
> > From: Jacob Keller <jacob.keller@gmail.com>
> >
> > If a commit in a sequence of linear history has a non-monotonically
> > increasing commit timestamp, git name-rev will not properly name the
> > commit.
> >
> > However, if you use --annotate-stdin then the commit does actually get
> > picked up and named properly.
>
> IIRC, this is to be expected.
>

Right. I figured this is somehow expected behavior...

> When preparing to answer --annotate-stdin request, the command has
> to dig down to the root of the history, which would be too expensive
> in some repositories and wants to stop traversal early when it knows
> particular commits it needs to describe.
>

And this method of cutting the search relies on monotonic commit times right?

Is there any other method we could use (commit graph?) or perhaps to
add an option so that you could go "git name-rev --no-cutoff <commid
id>"? That would potentially allow working around this particular
problem on repositories where its known to be problematic.

Alternatively is there some other way to apply the cutoff heuristic
only in some cases? I get the sense this is intended to allow cutting
off merged branches? i.e. not applying it when history is linear? I'd
have to study it further but the existing algorithm seems to break
because as it goes up the history it has found an "older" commit, so
it stops trying to blame that line...?

> Dscho?  I think this is pretty much a fundamental part of the
> initial version added by bd321bcc (Add git-name-rev, 2005-10-26) and
> kept that way to this day, I think.
>

The reason we ended up with non-monotonic commit timestamps is a bit
strange, and at least going forward I can remove the cause. However
the history I deal with already has this, so perhaps a simple
"--no-cutoff" option would be sufficient? This would allow getting the
equivalent behavior to annotate-stdin without needing to script it
myself.

Thanks,
Jake
Ævar Arnfjörð Bjarmason Feb. 15, 2022, 7:15 a.m. UTC | #3
On Mon, Feb 14 2022, Jacob Keller wrote:

> From: Jacob Keller <jacob.keller@gmail.com>
>
> If a commit in a sequence of linear history has a non-monotonically
> increasing commit timestamp, git name-rev will not properly name the
> commit.
>
> However, if you use --annotate-stdin then the commit does actually get
> picked up and named properly.
>
> Analyzing the source, it appears to be caused by the cutoff logic which
> is some sort of heuristic which relies on monotonically increasing
> commit dates.
>
> This seems like the cutoff using commit date is some sort of heuristic
> which reduces the cost of describing something.. but --annotate-stdin
> and --all don't use it.
>
> In the example setup I could do:
>
> echo "<commit id>" | git name-rev --annotate-stdin
>
> and get the expected result without the cutoff logic, and it seems at
> least on small repositories to be as fast as the normal attempt, except
> it produces accurate results.
>
> Signed-off-by: Jacob Keller <jacob.keller@gmail.com>
> ---
>  t/t6120-describe.sh | 62 +++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 62 insertions(+)
>
> diff --git a/t/t6120-describe.sh b/t/t6120-describe.sh
> index 9781b92aeddf..e9f897e42591 100755
> --- a/t/t6120-describe.sh
> +++ b/t/t6120-describe.sh
> @@ -488,6 +488,68 @@ test_expect_success 'name-rev covers all conditions while looking at parents' '
>  	)
>  '
>  
> +# A-B-C-D-E-main
> +#
> +# Where C has a non-monotonically increasing commit timestamp w.r.t. other
> +# commits
> +test_expect_success 'non-monotonic commit dates setup' '
> +	git init non-monotonic &&
> +	(
> +		cd non-monotonic &&
> +
> +		echo A >file &&
> +		git add file &&
> +		GIT_COMMITTER_DATE="2020-01-01 18:00" git commit -m A &&
> +
> +		echo B >file &&
> +		git add file &&
> +		GIT_COMMITTER_DATE="2020-01-02 18:00" git commit -m B &&
> +
> +		echo C >file &&
> +		git add file &&
> +		GIT_COMMITTER_DATE="2005-01-01 18:00" git commit -m C &&
> +
> +		echo D >file &&
> +		git add file &&
> +		GIT_COMMITTER_DATE="2020-01-04 18:00" git commit -m D &&
> +
> +		echo E >file &&
> +		git add file &&
> +		GIT_COMMITTER_DATE="2020-01-05 18:00" git commit -m E
> +	)

Shorter & avoids the needless subshell as:

    git init repo &&
    test_commit -C repo --date="2020-01-01 18:00" A &&
    test_commit -C repo --date="2020-01-02 18:00" B &&
    [...]


> +test_expect_failure 'name-rev commit timestamp prevents naming commits' '
> +	(
> +		cd non-monotonic &&
> +
> +		B=$(git rev-parse main~3) &&
> +
> +		echo "$B main~3" >expect &&
> +		git name-rev $B >actual &&
> +
> +		test_cmp expect actual
> +	)
> +'

I haven't checked, but is the explicit peeling to $B really needed here,
are the results different with a main~3 or main~3^{commit}?

I.e. the first column of the output will of course be, but will the
result on the second column? I suspect not, but haven't run this. In any
case I tihnk teh test/commit message could do with an explanation.

> +test_expect_success 'name-rev --all works with non-monotonic' '
> +	(
> +		cd non-monotonic &&
> +
> +		cat >expect <<EOF &&

You can use "<<-\EOF" here so you can indent these:

> +main
> +main~1
> +main~2
> +main~3
> +main~4
> +EOF
> +
> +		git log --pretty=%H | git name-rev --annotate-stdin --name-only >actual &&

Don't use "git" on the LHS of a pipe, in case it segfaults, so:

    git log [...] >revs &&
    git name-rev [...] <revs >actual

> +
> +		test_cmp expect actual
> +	)
> +'
> +
>  #               B
>  #               o
>  #                \
Derrick Stolee Feb. 15, 2022, 2:48 p.m. UTC | #4
On 2/14/2022 5:07 PM, Jacob Keller wrote:
> On Mon, Feb 14, 2022 at 1:50 PM Junio C Hamano <gitster@pobox.com> wrote:
>>
>> Jacob Keller <jacob.e.keller@intel.com> writes:
>>
>>> From: Jacob Keller <jacob.keller@gmail.com>
>>>
>>> If a commit in a sequence of linear history has a non-monotonically
>>> increasing commit timestamp, git name-rev will not properly name the
>>> commit.
>>>
>>> However, if you use --annotate-stdin then the commit does actually get
>>> picked up and named properly.
>>
>> IIRC, this is to be expected.
>>
> 
> Right. I figured this is somehow expected behavior...
> 
>> When preparing to answer --annotate-stdin request, the command has
>> to dig down to the root of the history, which would be too expensive
>> in some repositories and wants to stop traversal early when it knows
>> particular commits it needs to describe.
>>
> 
> And this method of cutting the search relies on monotonic commit times right?
> 
> Is there any other method we could use (commit graph?) or perhaps to
> add an option so that you could go "git name-rev --no-cutoff <commid
> id>"? That would potentially allow working around this particular
> problem on repositories where its known to be problematic.

I initially thought that generation numbers could help. The usual way
is to use a priority queue that explores by generation, not commit
date. This approach was immediately stifled by these lines:

	memset(&queue, 0, sizeof(queue)); /* Use the prio_queue as LIFO */
	prio_queue_put(&queue, start_commit);

So the queue is really a stack.
 
> Alternatively is there some other way to apply the cutoff heuristic
> only in some cases? I get the sense this is intended to allow cutting
> off merged branches? i.e. not applying it when history is linear? I'd
> have to study it further but the existing algorithm seems to break
> because as it goes up the history it has found an "older" commit, so
> it stops trying to blame that line...?

It is still possible that the cutoff could be altered to use generation
numbers instead of commit dates, but I haven't looked closely enough to
be sure.

Here is a very basic attempt. With GIT_TEST_COMMIT_GRAPH=1, your
test_expect_failure turns into a success.

diff --git a/builtin/name-rev.c b/builtin/name-rev.c
index 138e3c30a2b..f7ad1dd8b4d 100644
--- a/builtin/name-rev.c
+++ b/builtin/name-rev.c
@@ -9,6 +9,7 @@
 #include "prio-queue.h"
 #include "hash-lookup.h"
 #include "commit-slab.h"
+#include "commit-graph.h"
 
 /*
  * One day.  See the 'name a rev shortly after epoch' test in t6120 when
@@ -27,6 +28,7 @@ struct rev_name {
 define_commit_slab(commit_rev_name, struct rev_name);
 
 static timestamp_t cutoff = TIME_MAX;
+static timestamp_t generation_cutoff = 0;
 static struct commit_rev_name rev_names;
 
 /* How many generations are maximally preferred over _one_ merge traversal? */
@@ -151,7 +153,10 @@ static void name_rev(struct commit *start_commit,
 	struct rev_name *start_name;
 
 	parse_commit(start_commit);
-	if (start_commit->date < cutoff)
+	if (generation_cutoff && generation_cutoff < GENERATION_NUMBER_INFINITY) {
+		if (commit_graph_generation(start_commit) < generation_cutoff)
+			return;
+	} else if (start_commit->date < cutoff)
 		return;
 
 	start_name = create_or_update_name(start_commit, taggerdate, 0, 0,
@@ -599,6 +604,8 @@ int cmd_name_rev(int argc, const char **argv, const char *prefix)
 		if (commit) {
 			if (cutoff > commit->date)
 				cutoff = commit->date;
+			if (generation_cutoff > commit_graph_generation(commit))
+				generation_cutoff = commit_graph_generation(commit);
 		}
 
 		if (peel_tag) {

Thanks,
-Stolee
Jacob Keller Feb. 15, 2022, 11:38 p.m. UTC | #5
On 2/15/2022 6:48 AM, Derrick Stolee wrote:
> On 2/14/2022 5:07 PM, Jacob Keller wrote:
>> On Mon, Feb 14, 2022 at 1:50 PM Junio C Hamano <gitster@pobox.com> wrote:
>>>
>>> Jacob Keller <jacob.e.keller@intel.com> writes:
>>>
>>>> From: Jacob Keller <jacob.keller@gmail.com>
>>>>
>>>> If a commit in a sequence of linear history has a non-monotonically
>>>> increasing commit timestamp, git name-rev will not properly name the
>>>> commit.
>>>>
>>>> However, if you use --annotate-stdin then the commit does actually get
>>>> picked up and named properly.
>>>
>>> IIRC, this is to be expected.
>>>
>>
>> Right. I figured this is somehow expected behavior...
>>
>>> When preparing to answer --annotate-stdin request, the command has
>>> to dig down to the root of the history, which would be too expensive
>>> in some repositories and wants to stop traversal early when it knows
>>> particular commits it needs to describe.
>>>
>>
>> And this method of cutting the search relies on monotonic commit times right?
>>
>> Is there any other method we could use (commit graph?) or perhaps to
>> add an option so that you could go "git name-rev --no-cutoff <commid
>> id>"? That would potentially allow working around this particular
>> problem on repositories where its known to be problematic.
> 
> I initially thought that generation numbers could help. The usual way
> is to use a priority queue that explores by generation, not commit
> date. This approach was immediately stifled by these lines:
> 
> 	memset(&queue, 0, sizeof(queue)); /* Use the prio_queue as LIFO */
> 	prio_queue_put(&queue, start_commit);
> 
> So the queue is really a stack.
> 

Right. A closer look at the name-rev algorithm seems to be that it
starts looking at each tag and scanning down history to see if it can
find the given commit. It uses the commit timestamp as a heuristic to
decide whether or not to stop looking. This can speed up the search
because it prevents scanning the entire history.. but it breaks when
that heuristic is no longer true such as in the particular setup like
mine with a funky timestamp.


>> Alternatively is there some other way to apply the cutoff heuristic
>> only in some cases? I get the sense this is intended to allow cutting
>> off merged branches? i.e. not applying it when history is linear? I'd
>> have to study it further but the existing algorithm seems to break
>> because as it goes up the history it has found an "older" commit, so
>> it stops trying to blame that line...?
> 
> It is still possible that the cutoff could be altered to use generation
> numbers instead of commit dates, but I haven't looked closely enough to
> be sure.
> 

Right. Using generation number would work for this I think.. The real
question being if it satisfies the other requirements.

I think it does, but I'm not 100% sure yet.

> Here is a very basic attempt. With GIT_TEST_COMMIT_GRAPH=1, your
> test_expect_failure turns into a success.
> 
> diff --git a/builtin/name-rev.c b/builtin/name-rev.c
> index 138e3c30a2b..f7ad1dd8b4d 100644
> --- a/builtin/name-rev.c
> +++ b/builtin/name-rev.c
> @@ -9,6 +9,7 @@
>  #include "prio-queue.h"
>  #include "hash-lookup.h"
>  #include "commit-slab.h"
> +#include "commit-graph.h"
>  
>  /*
>   * One day.  See the 'name a rev shortly after epoch' test in t6120 when
> @@ -27,6 +28,7 @@ struct rev_name {
>  define_commit_slab(commit_rev_name, struct rev_name);
>  
>  static timestamp_t cutoff = TIME_MAX;
> +static timestamp_t generation_cutoff = 0;
>  static struct commit_rev_name rev_names;
>  
>  /* How many generations are maximally preferred over _one_ merge traversal? */
> @@ -151,7 +153,10 @@ static void name_rev(struct commit *start_commit,
>  	struct rev_name *start_name;
>  
>  	parse_commit(start_commit);
> -	if (start_commit->date < cutoff)
> +	if (generation_cutoff && generation_cutoff < GENERATION_NUMBER_INFINITY) {
> +		if (commit_graph_generation(start_commit) < generation_cutoff)
> +			return;
> +	} else if (start_commit->date < cutoff)
>  		return;
>  
>  	start_name = create_or_update_name(start_commit, taggerdate, 0, 0,
> @@ -599,6 +604,8 @@ int cmd_name_rev(int argc, const char **argv, const char *prefix)
>  		if (commit) {
>  			if (cutoff > commit->date)
>  				cutoff = commit->date;
> +			if (generation_cutoff > commit_graph_generation(commit))
> +				generation_cutoff = commit_graph_generation(commit);
>  		}
>  
>  		if (peel_tag) {
> 
> Thanks,
> -Stolee
Junio C Hamano Feb. 16, 2022, 12:51 a.m. UTC | #6
Derrick Stolee <derrickstolee@github.com> writes:

> I initially thought that generation numbers could help. The usual way
> is to use a priority queue that explores by generation, not commit
> date. This approach was immediately stifled by these lines:
>
> 	memset(&queue, 0, sizeof(queue)); /* Use the prio_queue as LIFO */
> 	prio_queue_put(&queue, start_commit);
>
> So the queue is really a stack.

Hmph, I am not sure if stifled is a word, but I agree that this one
is not solvable by "we have a priority queue that explores by commit
date, and using generation numbers instead of commit date will give
us a more stable result when clock skews are involved", because the
traversal is not the usual "we pop the newest commit seen so far to
dig into older history".

However.

> It is still possible that the cutoff could be altered to use generation
> numbers instead of commit dates, but I haven't looked closely enough to
> be sure.

In "name-rev [--tags] C", we look for a tag to use in describing the
given commit C as an ancestry path starting at the tag T (e.g. T~4,
T~2^2).  There can be multiple such tags (e.g. it is likely that a
commit that is v1.0~2 is also reachable from tag v2.0, even though
it would require more hops).  We try to and find a tag that gives
the "simplest" path.  For that purpose, it is no use to consider any
tag that is not a descendant of the given commit, because doing an
ancestry traversal starting from such a tag will never reach the
commit.  In a world where everybody's clock is in sync, if commit
was made at time X, any tag that was made before time X will not be
a descendant of the commit, so we do not add such a tag to the
candidate pool.

I think the idea of "cutoff" heuristic is exactly what generation
numbers can improve, in an imperfect world where there are imperfect
clocks.
Jacob Keller Feb. 16, 2022, 1:16 a.m. UTC | #7
> -----Original Message-----
> From: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
> Sent: Monday, February 14, 2022 11:16 PM
> To: Keller, Jacob E <jacob.e.keller@intel.com>
> Cc: git@vger.kernel.org; Jacob Keller <jacob.keller@gmail.com>
> Subject: Re: [PATCH] name-rev: test showing failure with non-monotonic commit
> dates
> 
> 
> Shorter & avoids the needless subshell as:
> 
>     git init repo &&
>     test_commit -C repo --date="2020-01-01 18:00" A &&
>     test_commit -C repo --date="2020-01-02 18:00" B &&
>     [...]
> 
>

Sure. I can pick these improvements up if if we end up actually wanting the test case. I think we're still discussing the core problem in this thread too.

 
> > +test_expect_failure 'name-rev commit timestamp prevents naming commits' '
> > +	(
> > +		cd non-monotonic &&
> > +
> > +		B=$(git rev-parse main~3) &&
> > +
> > +		echo "$B main~3" >expect &&
> > +		git name-rev $B >actual &&
> > +
> > +		test_cmp expect actual
> > +	)
> > +'
> 
> I haven't checked, but is the explicit peeling to $B really needed here,
> are the results different with a main~3 or main~3^{commit}?
> 
> I.e. the first column of the output will of course be, but will the
> result on the second column? I suspect not, but haven't run this. In any
> case I tihnk teh test/commit message could do with an explanation.
> 

The first column is the commit id, so we need to get that either way to compare the expected and actual output. As far as I know passing "main~3" instead of the commit id to name-rev doesn't change this.

Thanks,
Jake
Jacob Keller Feb. 27, 2022, 10:05 p.m. UTC | #8
On Tue, Feb 15, 2022 at 4:51 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> In "name-rev [--tags] C", we look for a tag to use in describing the
> given commit C as an ancestry path starting at the tag T (e.g. T~4,
> T~2^2).  There can be multiple such tags (e.g. it is likely that a
> commit that is v1.0~2 is also reachable from tag v2.0, even though
> it would require more hops).  We try to and find a tag that gives
> the "simplest" path.  For that purpose, it is no use to consider any
> tag that is not a descendant of the given commit, because doing an
> ancestry traversal starting from such a tag will never reach the
> commit.  In a world where everybody's clock is in sync, if commit
> was made at time X, any tag that was made before time X will not be
> a descendant of the commit, so we do not add such a tag to the
> candidate pool.
>
> I think the idea of "cutoff" heuristic is exactly what generation
> numbers can improve, in an imperfect world where there are imperfect
> clocks.

Yep. I have a patch that will implement this behavior based on
Derrick's suggestion.

Thanks,
Jake
Johannes Schindelin March 9, 2022, 9:56 p.m. UTC | #9
Hi Junio,

On Mon, 14 Feb 2022, Junio C Hamano wrote:

> Jacob Keller <jacob.e.keller@intel.com> writes:
>
> > From: Jacob Keller <jacob.keller@gmail.com>
> >
> > If a commit in a sequence of linear history has a non-monotonically
> > increasing commit timestamp, git name-rev will not properly name the
> > commit.
> >
> > However, if you use --annotate-stdin then the commit does actually get
> > picked up and named properly.
>
> IIRC, this is to be expected.
>
> When preparing to answer --annotate-stdin request, the command has
> to dig down to the root of the history, which would be too expensive
> in some repositories and wants to stop traversal early when it knows
> particular commits it needs to describe.
>
> Dscho?  I think this is pretty much a fundamental part of the
> initial version added by bd321bcc (Add git-name-rev, 2005-10-26) and
> kept that way to this day, I think.

Yes, it was.

Looks like Stolee had great suggestions while I did not find time to catch
up with the mailing list.

Ciao,
Dscho
diff mbox series

Patch

diff --git a/t/t6120-describe.sh b/t/t6120-describe.sh
index 9781b92aeddf..e9f897e42591 100755
--- a/t/t6120-describe.sh
+++ b/t/t6120-describe.sh
@@ -488,6 +488,68 @@  test_expect_success 'name-rev covers all conditions while looking at parents' '
 	)
 '
 
+# A-B-C-D-E-main
+#
+# Where C has a non-monotonically increasing commit timestamp w.r.t. other
+# commits
+test_expect_success 'non-monotonic commit dates setup' '
+	git init non-monotonic &&
+	(
+		cd non-monotonic &&
+
+		echo A >file &&
+		git add file &&
+		GIT_COMMITTER_DATE="2020-01-01 18:00" git commit -m A &&
+
+		echo B >file &&
+		git add file &&
+		GIT_COMMITTER_DATE="2020-01-02 18:00" git commit -m B &&
+
+		echo C >file &&
+		git add file &&
+		GIT_COMMITTER_DATE="2005-01-01 18:00" git commit -m C &&
+
+		echo D >file &&
+		git add file &&
+		GIT_COMMITTER_DATE="2020-01-04 18:00" git commit -m D &&
+
+		echo E >file &&
+		git add file &&
+		GIT_COMMITTER_DATE="2020-01-05 18:00" git commit -m E
+	)
+'
+
+test_expect_failure 'name-rev commit timestamp prevents naming commits' '
+	(
+		cd non-monotonic &&
+
+		B=$(git rev-parse main~3) &&
+
+		echo "$B main~3" >expect &&
+		git name-rev $B >actual &&
+
+		test_cmp expect actual
+	)
+'
+
+test_expect_success 'name-rev --all works with non-monotonic' '
+	(
+		cd non-monotonic &&
+
+		cat >expect <<EOF &&
+main
+main~1
+main~2
+main~3
+main~4
+EOF
+
+		git log --pretty=%H | git name-rev --annotate-stdin --name-only >actual &&
+
+		test_cmp expect actual
+	)
+'
+
 #               B
 #               o
 #                \