diff mbox series

diffcore-delta: avoid ignoring final 'line' of file

Message ID pull.1637.git.1705006074626.gitgitgadget@gmail.com (mailing list archive)
State Superseded
Headers show
Series diffcore-delta: avoid ignoring final 'line' of file | expand

Commit Message

Elijah Newren Jan. 11, 2024, 8:47 p.m. UTC
From: Elijah Newren <newren@gmail.com>

hash_chars() would hash lines to integers, and store them in a spanhash,
but cut lines at 64 characters.  Thus, whenever it reached 64 characters
or a newline, it would create a new spanhash.  The problem is, the final
part of the file might not end 64 characters after the previous 'line'
and might not end with a newline.  This could, for example, cause an
85-byte file with 12 lines and only the first character in the file
differing to appear merely 23% similar rather than the expected 97%.
Ensure the last line is included, and add a testcase that would have
caught this problem.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
    diffcore-delta: avoid ignoring final 'line' of file
    
    Found while experimenting with converting portions of diffcore-delta to
    Rust.

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1637%2Fnewren%2Ffix-diffcore-final-line-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1637/newren/fix-diffcore-final-line-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/1637

 diffcore-delta.c       |  4 ++++
 t/t4001-diff-rename.sh | 19 +++++++++++++++++++
 2 files changed, 23 insertions(+)


base-commit: 055bb6e9969085777b7fab83e3fee0017654f134

Comments

Taylor Blau Jan. 11, 2024, 9:45 p.m. UTC | #1
On Thu, Jan 11, 2024 at 08:47:54PM +0000, Elijah Newren via GitGitGadget wrote:
> ---
>  diffcore-delta.c       |  4 ++++
>  t/t4001-diff-rename.sh | 19 +++++++++++++++++++
>  2 files changed, 23 insertions(+)

Nice find and fix. This patch LGTM.

Thanks,
Taylor
Junio C Hamano Jan. 11, 2024, 11 p.m. UTC | #2
"Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:

> diff --git a/diffcore-delta.c b/diffcore-delta.c
> index c30b56e983b..7136c3dd203 100644
> --- a/diffcore-delta.c
> +++ b/diffcore-delta.c
> @@ -159,6 +159,10 @@ static struct spanhash_top *hash_chars(struct repository *r,
>  		n = 0;
>  		accum1 = accum2 = 0;
>  	}
> +	if (n > 0) {
> +		hashval = (accum1 + accum2 * 0x61) % HASHBASE;
> +		hash = add_spanhash(hash, hashval, n);
> +	}

OK, so we were ignoring the final short bit that is not terminated
with LF due to the "continue".  Nicely found.

> diff --git a/t/t4001-diff-rename.sh b/t/t4001-diff-rename.sh
> index 85be1367de6..29299acbce7 100755
> --- a/t/t4001-diff-rename.sh
> +++ b/t/t4001-diff-rename.sh
> @@ -286,4 +286,23 @@ test_expect_success 'basename similarity vs best similarity' '
>  	test_cmp expected actual
>  '
>  
> +test_expect_success 'last line matters too' '
> +	test_write_lines a 0 1 2 3 4 5 6 7 8 9 >nonewline &&
> +	printf "git ignores final up to 63 characters if not newline terminated" >>nonewline &&
> +	git add nonewline &&
> +	git commit -m "original version of file with no final newline" &&

I found it misleading that the file whose name is nonewline has
bunch of LF including on its last line ;-).

> +	# Change ONLY the first character of the whole file
> +	test_write_lines b 0 1 2 3 4 5 6 7 8 9 >nonewline &&

Same here, but it is too much to bother rewriting it ...

	{
		test_write_lines ...
		printf ...
	} >incomplete

... like so ("incomplete" stands for "file ending with an incomplete line"),
so I'll let it pass.

> +	printf "git ignores final up to 63 characters if not newline terminated" >>nonewline &&


> +	git add nonewline &&
> +	git mv nonewline still-no-newline &&
> +	git commit -a -m "rename nonewline -> still-no-newline" &&
> +	git diff-tree -r -M01 --name-status HEAD^ HEAD >actual &&
> +	cat >expected <<-\EOF &&
> +	R097	nonewline	still-no-newline

I am not very happy with the hardcoded 97.  You are already using
the non-standard 10% threshold.  If the delta detection that
forgets about the last line is so broken as your proposed log
message noted, shouldn't you be able to construct a sample pair of
preimage and postimage for which the broken version gives so low
similarity to be judged not worth treating as a rename, while the
fixed version gives reasonable similarity to be made into a rename,
by the default threshold?  That way, the test only needs to see if
we got a rename (with any similarity) or a delete and an add.
Elijah Newren Jan. 13, 2024, 1:45 a.m. UTC | #3
On Thu, Jan 11, 2024 at 3:00 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> "Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > diff --git a/diffcore-delta.c b/diffcore-delta.c
> > index c30b56e983b..7136c3dd203 100644
> > --- a/diffcore-delta.c
> > +++ b/diffcore-delta.c
> > @@ -159,6 +159,10 @@ static struct spanhash_top *hash_chars(struct repository *r,
> >               n = 0;
> >               accum1 = accum2 = 0;
> >       }
> > +     if (n > 0) {
> > +             hashval = (accum1 + accum2 * 0x61) % HASHBASE;
> > +             hash = add_spanhash(hash, hashval, n);
> > +     }
>
> OK, so we were ignoring the final short bit that is not terminated
> with LF due to the "continue".  Nicely found.
>
> > diff --git a/t/t4001-diff-rename.sh b/t/t4001-diff-rename.sh
> > index 85be1367de6..29299acbce7 100755
> > --- a/t/t4001-diff-rename.sh
> > +++ b/t/t4001-diff-rename.sh
> > @@ -286,4 +286,23 @@ test_expect_success 'basename similarity vs best similarity' '
> >       test_cmp expected actual
> >  '
> >
> > +test_expect_success 'last line matters too' '
> > +     test_write_lines a 0 1 2 3 4 5 6 7 8 9 >nonewline &&
> > +     printf "git ignores final up to 63 characters if not newline terminated" >>nonewline &&
> > +     git add nonewline &&
> > +     git commit -m "original version of file with no final newline" &&
>
> I found it misleading that the file whose name is nonewline has
> bunch of LF including on its last line ;-).

Yeah, sorry.  It's been a while, but I think I started with a file
with only one line that had no LF, but then realized for inexact
rename detection to kick in I needed some other lines, at least one of
which didn't match...but I forgot to update the filename after adding
the other lines...

> > +     # Change ONLY the first character of the whole file
> > +     test_write_lines b 0 1 2 3 4 5 6 7 8 9 >nonewline &&
>
> Same here, but it is too much to bother rewriting it ...
>
>         {
>                 test_write_lines ...
>                 printf ...
>         } >incomplete
>
> ... like so ("incomplete" stands for "file ending with an incomplete line"),
> so I'll let it pass.
>
> > +     printf "git ignores final up to 63 characters if not newline terminated" >>nonewline &&
>
>
> > +     git add nonewline &&
> > +     git mv nonewline still-no-newline &&
> > +     git commit -a -m "rename nonewline -> still-no-newline" &&
> > +     git diff-tree -r -M01 --name-status HEAD^ HEAD >actual &&
> > +     cat >expected <<-\EOF &&
> > +     R097    nonewline       still-no-newline
>
> I am not very happy with the hardcoded 97.  You are already using
> the non-standard 10% threshold.  If the delta detection that
> forgets about the last line is so broken as your proposed log
> message noted, shouldn't you be able to construct a sample pair of
> preimage and postimage for which the broken version gives so low
> similarity to be judged not worth treating as a rename, while the
> fixed version gives reasonable similarity to be made into a rename,
> by the default threshold?  That way, the test only needs to see if
> we got a rename (with any similarity) or a delete and an add.

Oops, the threshold is entirely unnecessary here; not sure why I
didn't remember to take it out (originally used the threshold while
testing without the fix to just how low of a similarity git thought
these nearly identical files had).

Since you don't like the threshold, and since we don't seem to have a
summary format that reports on the rename without the percentage, I
guess I need to munge the output with sed:

      sed -e "s/^R[0-9]* /R /" actual >actual.munged &&

and then compare the expected output to that.  I'll send in a patch
doing so and fix up the filenames and drop the rename threshold while
at it.
Junio C Hamano Jan. 13, 2024, 6:21 a.m. UTC | #4
Elijah Newren <newren@gmail.com> writes:

>> I am not very happy with the hardcoded 97.  You are already using
>> the non-standard 10% threshold.  If the delta detection that
>> forgets about the last line is so broken as your proposed log
>> message noted, shouldn't you be able to construct a sample pair of
>> preimage and postimage for which the broken version gives so low
>> similarity to be judged not worth treating as a rename, while the
>> fixed version gives reasonable similarity to be made into a rename,
>> by the default threshold?  That way, the test only needs to see if
>> we got a rename (with any similarity) or a delete and an add.
>
> Oops, the threshold is entirely unnecessary here; not sure why I
> didn't remember to take it out (originally used the threshold while
> testing without the fix to just how low of a similarity git thought
> these nearly identical files had).
>
> Since you don't like the threshold, and since we don't seem to have a
> summary format that reports on the rename without the percentage, I
> guess I need to munge the output with sed:
>
>       sed -e "s/^R[0-9]* /R /" actual >actual.munged &&

Heh, I was hoping that we should be able to use "diff --name-only".

 $ git mv Makefile Breakfile
 $ git diff --name-only -M HEAD
 Breakfile
 $ git reset --hard
 $ git rm Makefile
 $ >Breakfile && git add Breakfile
 $ git diff --name-only -M HEAD
 Breakfile
 Makefile
 $ git reset --hard
Elijah Newren Jan. 19, 2024, 1:54 a.m. UTC | #5
On Fri, Jan 12, 2024 at 10:22 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Elijah Newren <newren@gmail.com> writes:
>
> >> I am not very happy with the hardcoded 97.  You are already using
> >> the non-standard 10% threshold.  If the delta detection that
> >> forgets about the last line is so broken as your proposed log
> >> message noted, shouldn't you be able to construct a sample pair of
> >> preimage and postimage for which the broken version gives so low
> >> similarity to be judged not worth treating as a rename, while the
> >> fixed version gives reasonable similarity to be made into a rename,
> >> by the default threshold?  That way, the test only needs to see if
> >> we got a rename (with any similarity) or a delete and an add.
> >
> > Oops, the threshold is entirely unnecessary here; not sure why I
> > didn't remember to take it out (originally used the threshold while
> > testing without the fix to just how low of a similarity git thought
> > these nearly identical files had).
> >
> > Since you don't like the threshold, and since we don't seem to have a
> > summary format that reports on the rename without the percentage, I
> > guess I need to munge the output with sed:
> >
> >       sed -e "s/^R[0-9]* /R /" actual >actual.munged &&
>
> Heh, I was hoping that we should be able to use "diff --name-only".
>
>  $ git mv Makefile Breakfile
>  $ git diff --name-only -M HEAD
>  Breakfile
>  $ git reset --hard
>  $ git rm Makefile
>  $ >Breakfile && git add Breakfile
>  $ git diff --name-only -M HEAD
>  Breakfile
>  Makefile
>  $ git reset --hard

I guess we could, since the only thing in this repository is a file
which is being renamed, and we can then deduce based on the setup that
the change we expected must have happened.

However, I didn't like the deduce bit; since --name-only only lists
one of the two filenames and doesn't provide any hint that the change
involved is a rename, it felt to me like using --name-only would make
the test not really be a rename test.
Junio C Hamano Jan. 19, 2024, 3:06 a.m. UTC | #6
Elijah Newren <newren@gmail.com> writes:

>> Heh, I was hoping that we should be able to use "diff --name-only".
>>
>>  $ git mv Makefile Breakfile
>>  $ git diff --name-only -M HEAD
>>  Breakfile
>>  $ git reset --hard
>>  $ git rm Makefile
>>  $ >Breakfile && git add Breakfile
>>  $ git diff --name-only -M HEAD
>>  Breakfile
>>  Makefile
>>  $ git reset --hard
>
> I guess we could, since the only thing in this repository is a file
> which is being renamed, and we can then deduce based on the setup that
> the change we expected must have happened.
>
> However, I didn't like the deduce bit; since --name-only only lists
> one of the two filenames and doesn't provide any hint that the change
> involved is a rename, it felt to me like using --name-only would make
> the test not really be a rename test.

Hmph, I am not quite seeing what you are saying.  If the "mv" from
Makefile to Breakfile in the above example is between preimage and
postimage that are similar enough, then we will see "one" paths,
i.e. the file in the "after" side of the diff.  But if the "mv" from
Makefile to Breakfile involves too large a content change (like,
say, going from 3800+ lines to an empty file, in the second example
above), then because such a change from Makefile to Breakfile is too
dissimilar, we do not judge it as "renamed" and show "two" paths.  I
do not quite see where we need to "deduce".
Elijah Newren Jan. 19, 2024, 5:05 a.m. UTC | #7
On Thu, Jan 18, 2024 at 7:06 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Elijah Newren <newren@gmail.com> writes:
>
> >> Heh, I was hoping that we should be able to use "diff --name-only".
> >>
> >>  $ git mv Makefile Breakfile
> >>  $ git diff --name-only -M HEAD
> >>  Breakfile
> >>  $ git reset --hard
> >>  $ git rm Makefile
> >>  $ >Breakfile && git add Breakfile
> >>  $ git diff --name-only -M HEAD
> >>  Breakfile
> >>  Makefile
> >>  $ git reset --hard
> >
> > I guess we could, since the only thing in this repository is a file
> > which is being renamed, and we can then deduce based on the setup that
> > the change we expected must have happened.
> >
> > However, I didn't like the deduce bit; since --name-only only lists
> > one of the two filenames and doesn't provide any hint that the change
> > involved is a rename, it felt to me like using --name-only would make
> > the test not really be a rename test.
>
> Hmph, I am not quite seeing what you are saying.  If the "mv" from
> Makefile to Breakfile in the above example is between preimage and
> postimage that are similar enough, then we will see "one" paths,
> i.e. the file in the "after" side of the diff.  But if the "mv" from
> Makefile to Breakfile involves too large a content change (like,
> say, going from 3800+ lines to an empty file, in the second example
> above), then because such a change from Makefile to Breakfile is too
> dissimilar, we do not judge it as "renamed" and show "two" paths.  I
> do not quite see where we need to "deduce".

You just wrote a very well worded paragraph going through the
reasoning involved to prove that a rename is involved.  You reasoned,
or deduced, the necessary conclusion quite well.  Sure, it might be a
simple deduction given the knowledge of the test setup, but it's still
a deduction.

But perhaps I can put it another way:  You can't just look at the
output of `diff --name-only` and say a rename was involved -- unless
you know the test setup and the previous operations.  In fact, how
about a possibly contrived alternate scenario: What if `git mv $1 $2`
had a bug where, after doing the expected work, it did the equivalent
of running `git checkout HEAD -- $1` at the end of its operation.
Then we'd see:

   $ <tweak Makefile slightly>
   $ git mv Makefile Breakfile
   $ git diff --name-only -M HEAD
   Breakfile

i.e. we get the same output as without the git-mv bug (note that
Makefile will not be listed since it is unmodified), but with the bug
in git-mv there definitely is no rename involved (since there's no
delete to pair with the addition of Breakfile).  As such, we'd still
pass the test despite there being no rename.  Sure, the example is
somewhat contrived, but I'm just saying that the --name-only output
doesn't actually test or prove that a rename occurred.  And since this
test, and all other tests in this particular testfile, are
specifically about renames, the fact that we aren't specifically
testing for something being renamed feels odd to me.

If you still like `diff --name-only` better anyway, that's fine and
I'll switch it.  I'm just stating why it seems suboptimal to me.
Junio C Hamano Jan. 19, 2024, 6:27 a.m. UTC | #8
Elijah Newren <newren@gmail.com> writes:

> But perhaps I can put it another way:  You can't just look at the
> output of `diff --name-only` and say a rename was involved -- unless
> you know the test setup and the previous operations.

That is true.  But that is exactly what a test is about.  You have
this and that file, and you do this operation, now what should
happen?  Does the observation match the expectation?  That is what
our tests are done.

And your argument should not have to rely on a bug in "git mv".
After all, you should be able to do the same with "mv A B && git add
B && git add -u" (or "git rm -f A") and you won't be affected by
such a bug.

> If you still like `diff --name-only` better anyway, that's fine and
> I'll switch it.  I'm just stating why it seems suboptimal to me.

I'd prefer to omit "sed" involved, but I'd even more prefer not
waste more time on the test.  As long as we can make a robust test
(which an extra process running sed would certainly give us), I'd be
fine.

Thanks.
diff mbox series

Patch

diff --git a/diffcore-delta.c b/diffcore-delta.c
index c30b56e983b..7136c3dd203 100644
--- a/diffcore-delta.c
+++ b/diffcore-delta.c
@@ -159,6 +159,10 @@  static struct spanhash_top *hash_chars(struct repository *r,
 		n = 0;
 		accum1 = accum2 = 0;
 	}
+	if (n > 0) {
+		hashval = (accum1 + accum2 * 0x61) % HASHBASE;
+		hash = add_spanhash(hash, hashval, n);
+	}
 	QSORT(hash->data, (size_t)1ul << hash->alloc_log2, spanhash_cmp);
 	return hash;
 }
diff --git a/t/t4001-diff-rename.sh b/t/t4001-diff-rename.sh
index 85be1367de6..29299acbce7 100755
--- a/t/t4001-diff-rename.sh
+++ b/t/t4001-diff-rename.sh
@@ -286,4 +286,23 @@  test_expect_success 'basename similarity vs best similarity' '
 	test_cmp expected actual
 '
 
+test_expect_success 'last line matters too' '
+	test_write_lines a 0 1 2 3 4 5 6 7 8 9 >nonewline &&
+	printf "git ignores final up to 63 characters if not newline terminated" >>nonewline &&
+	git add nonewline &&
+	git commit -m "original version of file with no final newline" &&
+
+	# Change ONLY the first character of the whole file
+	test_write_lines b 0 1 2 3 4 5 6 7 8 9 >nonewline &&
+	printf "git ignores final up to 63 characters if not newline terminated" >>nonewline &&
+	git add nonewline &&
+	git mv nonewline still-no-newline &&
+	git commit -a -m "rename nonewline -> still-no-newline" &&
+	git diff-tree -r -M01 --name-status HEAD^ HEAD >actual &&
+	cat >expected <<-\EOF &&
+	R097	nonewline	still-no-newline
+	EOF
+	test_cmp expected actual
+'
+
 test_done