Message ID | CANiSa6jtwizbR4K-DqdKjVeZqAkbswnPXCBZZrrfNy2CKBEQVg@mail.gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | Histogram/patience diff matching lines with different counts | expand |
Martin von Zweigbergk <martinvonz@gmail.com> writes: > After that, "c" is unique on the left side but has a different count > (namely 2) on the right side, so I would have thought that it should > not be considered matching. Does anyone know if it's implemented this > way on purpose? The purpose, if any, might be lost to history. The implementation of histogram diff in Git seems to be a port from JGit (8c912eea94 (teach --histogram to diff, 2011-07-12)). The one in JGit [1] seems to be an original extension of patience diff by Shawn Pearce, who has passed away a few years ago. The class documentation comment in [1] does not give any rationale for or against rejecting lines with different counts, but quoting from it: > * By always selecting a LCS position with the lowest occurrence count, this > * algorithm behaves exactly like Bram Cohen's patience diff whenever there is a > * unique common element available between the two sequences. When no unique > * elements exist, the lowest occurrence element is chosen instead. This offers > * more readable diffs than simply falling back on the standard Myers' O(ND) > * algorithm would produce. I think it makes sense to reject lines with different counts, just like how jj does it today, since the original motivation for low-occurring lines in both the patience diff and the histogram diff algorithms was to keep high-signal lines (i.e. not lines such as "}" and "return;") as context (instead of + or -), and if the count of a line differs, it is probably not a high-signal line in the first place. I don't think it's worth changing it now, though, especially in Git. In the scenario you describe, even if we change Git to reject lines with different counts, failing to find a matching line means we fall back to Myers, which matches up the only "c" on the left and the first "c" on the right anyway (so in the end, we still won't get the result that you might want - reporting that the whole block has changed). (In jj's case, in which there is no fall back to Myers, I think it's reasonable to make histogram diff work only with non-different counts, since the lack of a fall back will indeed mean that we report that the whole block has changed. This sounds like the "highly ambiguous" case that Bram Cohen, the inventor of patience diff, mentions in [3].) To further complicate things, in both JGit and Git, a line with different counts is not considered matching only if the count in "A" (the left hand side) is greater than the count in "B" [2]. I can't think of a reason for this asymmetry, and the class documentation comment in [1] doesn't explain that either. [1] https://eclipse.googlesource.com/jgit/jgit/+/refs/heads/master/org.eclipse.jgit/src/org/eclipse/jgit/diff/HistogramDiff.java [2] https://eclipse.googlesource.com/jgit/jgit/+/refs/heads/master/org.eclipse.jgit/src/org/eclipse/jgit/diff/HistogramDiffIndex.java#206 [3] https://lore.kernel.org/git/alpine.DEB.1.00.0902052113590.7491@intel-tinevez-2-302/ > As some of you know, I work on the Jujutsu/jj VCS > (https://github.com/jj-vcs/jj). We also use histogram diff (and only > histogram diff) and actually allowed matching up lines with different > counts a while ago, but I thought it seemed too arbitrary to line up > the first matches if there were different counts, so we changed that. > Then we got a report from a user that Git behaves differently. See > https://github.com/jj-vcs/jj/issues/761#issuecomment-2581219294 for > more details. > > Thanks I think that it is unavoidable that different VCSes may produce different diffs. Even in Git itself, there are many options (including which algorithm to use) that can change the nature of the diff produced.
diff --git a/file b/file index 0fdf397..7cfb042 100644 --- a/file +++ b/file @@ -1,6 +1,7 @@ a -b +b2 c -d -e +d2 +c +e2 f