diff mbox series

Histogram/patience diff matching lines with different counts

Message ID CANiSa6jtwizbR4K-DqdKjVeZqAkbswnPXCBZZrrfNy2CKBEQVg@mail.gmail.com (mailing list archive)
State New
Headers show
Series Histogram/patience diff matching lines with different counts | expand

Commit Message

Martin von Zweigbergk Jan. 9, 2025, 9:31 p.m. UTC
Hi,

Let's say you have this a file with this content:
```
a
b
c
d
e
f
```

Then you change it to this:
```
a
b2
c
d2
c
e2
f
```

Note that most lines changed, but `c` remains unchanged but duplicated.

Now `git diff --diff-algorithm=histogram` will show this diff:
```
```

I'm surprised the first "c" line is considered unchanged. I thought
histogram diff was supposed to first match up unique lines between the
two sides and then gradually try higher and higher counts if there
were no unique lines. In this case, only "a" and "f" have count 1
(i.e. are unique) on both sides, so they would be matched up first.
After that, "c" is unique on the left side but has a different count
(namely 2) on the right side, so I would have thought that it should
not be considered matching. Does anyone know if it's implemented this
way on purpose? Actually, I think I remember reading that Git falls
back to Myers in some cases, so maybe that's what's going on here?

As some of you know, I work on the Jujutsu/jj VCS
(https://github.com/jj-vcs/jj). We also use histogram diff (and only
histogram diff) and actually allowed matching up lines with different
counts a while ago, but I thought it seemed too arbitrary to line up
the first matches if there were different counts, so we changed that.
Then we got a report from a user that Git behaves differently. See
https://github.com/jj-vcs/jj/issues/761#issuecomment-2581219294 for
more details.

Thanks

Comments

Jonathan Tan Jan. 10, 2025, 7:33 p.m. UTC | #1
Martin von Zweigbergk <martinvonz@gmail.com> writes:
> After that, "c" is unique on the left side but has a different count
> (namely 2) on the right side, so I would have thought that it should
> not be considered matching. Does anyone know if it's implemented this
> way on purpose? 

The purpose, if any, might be lost to history. The implementation of
histogram diff in Git seems to be a port from JGit (8c912eea94 (teach
--histogram to diff, 2011-07-12)). The one in JGit [1] seems to be an
original extension of patience diff by Shawn Pearce, who has passed away
a few years ago.

The class documentation comment in [1] does not give any rationale for
or against rejecting lines with different counts, but quoting from it:

> * By always selecting a LCS position with the lowest occurrence count, this
> * algorithm behaves exactly like Bram Cohen's patience diff whenever there is a
> * unique common element available between the two sequences. When no unique
> * elements exist, the lowest occurrence element is chosen instead. This offers
> * more readable diffs than simply falling back on the standard Myers' O(ND)
> * algorithm would produce.

I think it makes sense to reject lines with different counts, just like
how jj does it today, since the original motivation for low-occurring
lines in both the patience diff and the histogram diff algorithms was
to keep high-signal lines (i.e. not lines such as "}" and "return;") as
context (instead of + or -), and if the count of a line differs, it is
probably not a high-signal line in the first place.

I don't think it's worth changing it now, though, especially in Git. In
the scenario you describe, even if we change Git to reject lines with
different counts, failing to find a matching line means we fall back to
Myers, which matches up the only "c" on the left and the first "c" on
the right anyway (so in the end, we still won't get the result that you
might want - reporting that the whole block has changed). (In jj's case,
in which there is no fall back to Myers, I think it's reasonable to make
histogram diff work only with non-different counts, since the lack of
a fall back will indeed mean that we report that the whole block has
changed. This sounds like the "highly ambiguous" case that Bram Cohen,
the inventor of patience diff, mentions in [3].)

To further complicate things, in both JGit and Git, a line with
different counts is not considered matching only if the count in
"A" (the left hand side) is greater than the count in "B" [2]. I can't
think of a reason for this asymmetry, and the class documentation
comment in [1] doesn't explain that either.

[1] https://eclipse.googlesource.com/jgit/jgit/+/refs/heads/master/org.eclipse.jgit/src/org/eclipse/jgit/diff/HistogramDiff.java
[2] https://eclipse.googlesource.com/jgit/jgit/+/refs/heads/master/org.eclipse.jgit/src/org/eclipse/jgit/diff/HistogramDiffIndex.java#206
[3] https://lore.kernel.org/git/alpine.DEB.1.00.0902052113590.7491@intel-tinevez-2-302/

> As some of you know, I work on the Jujutsu/jj VCS
> (https://github.com/jj-vcs/jj). We also use histogram diff (and only
> histogram diff) and actually allowed matching up lines with different
> counts a while ago, but I thought it seemed too arbitrary to line up
> the first matches if there were different counts, so we changed that.
> Then we got a report from a user that Git behaves differently. See
> https://github.com/jj-vcs/jj/issues/761#issuecomment-2581219294 for
> more details.
> 
> Thanks

I think that it is unavoidable that different VCSes may produce
different diffs. Even in Git itself, there are many options (including
which algorithm to use) that can change the nature of the diff produced.
diff mbox series

Patch

diff --git a/file b/file
index 0fdf397..7cfb042 100644
--- a/file
+++ b/file
@@ -1,6 +1,7 @@ 
 a
-b
+b2
 c
-d
-e
+d2
+c
+e2
 f