diff mbox series

diff.renames not working?

Message ID CAHd499BT35jvPtsuD9gfJB0HJ=NxtzyQOaiD7-=sHJbFYhphpg@mail.gmail.com (mailing list archive)
State New, archived
Headers show
Series diff.renames not working? | expand

Commit Message

Robert Dailey Sept. 13, 2019, 8:24 p.m. UTC
I'm using Git version 2.23. I have the `diff.renames` setting set to
`copies`. My code base has a file named `JniPaymentManager.hpp` (and
`cpp`) that had too much code in it, so I refactored this file by
splitting out a significant portion of the code in it to another file,
named `ZPayClient.hpp` (and `cpp`). When I diff the topic branch
against the mainline, it shows it as a copy with a percentage:

```
$ git diff --name-status master...topic
C055    Jni/JniPaymentManager.cpp    ZPay/ZPayClient.cpp
C070    Jni/JniPaymentManager.hpp    ZPay/ZPayClient.hpp
```

Now my goal is to diff `ZPayClient.hpp` and see the changes to the
moved-out portion of code as it relates to the original state of that
code in `JniPaymentManager.hpp`. To do this, I tried this command:

```
$ git diff master...topic -- ZPay/ZPayClient.hpp
```

The unified diff header I got back is:

```
+++ ZPay/ZPayClient.hpp
```

Hmm, it's treating it as a new file. Even though I have `diff.renames`
set to `copies`? Even though `diff --name-status` acknowledges the
relationship with the original file for the code on `master`? This is
confusing...

Out of curiosity, I thought I'd try this command:

```
git diff --follow master...topic -- ZPay/ZPayClient.hpp
```

And I get this unified diff header:

```
+++ ZPay/ZPayClient.hpp
```

Now this looks more like it. I can actually see a useful diff here,
instead of everything looking like a new file. But there is a lot of
confusion here:

1. `diff --follow` is not a documented[1] option. Why does it work?
2. `diff -M` doesn't actually work either. It should, though. In fact,
I expected it to work as `--follow` does. But it doesn't.
3. The `diff.renames` config doesn't seem to be working here, when it should.

Can someone explain the behavior I'm seeing? I really am confused
about all this...

[1]: https://git-scm.com/docs/git-diff

Comments

Bryan Turner Sept. 13, 2019, 10:36 p.m. UTC | #1
On Fri, Sep 13, 2019 at 1:24 PM Robert Dailey <rcdailey.lists@gmail.com> wrote:
>
> 2. `diff -M` doesn't actually work either. It should, though. In fact,
> I expected it to work as `--follow` does. But it doesn't.

Just a small point of clarification: Is -M really what you mean here?
Given you indicated you have "diff.renames=copies", wouldn't you need
-C? -M only detects renames, not copies.

(I haven't looked too deeply into the rest, but this detail caught my eye.)

Best regards,
Bryan Turner


> 3. The `diff.renames` config doesn't seem to be working here, when it should.
>
> Can someone explain the behavior I'm seeing? I really am confused
> about all this...
>
> [1]: https://git-scm.com/docs/git-diff
Jeff King Sept. 14, 2019, 3:30 a.m. UTC | #2
On Fri, Sep 13, 2019 at 03:24:06PM -0500, Robert Dailey wrote:

> Now my goal is to diff `ZPayClient.hpp` and see the changes to the
> moved-out portion of code as it relates to the original state of that
> code in `JniPaymentManager.hpp`. To do this, I tried this command:
> 
> ```
> $ git diff master...topic -- ZPay/ZPayClient.hpp
> ```
> 
> The unified diff header I got back is:
> 
> ```
> diff --git ZPay/ZPayClient.hpp ZPay/ZPayClient.hpp
> new file mode 100644
> index 00000000..6ebc2a9a
> --- /dev/null
> +++ ZPay/ZPayClient.hpp
> ```
> 
> Hmm, it's treating it as a new file. Even though I have `diff.renames`
> set to `copies`? Even though `diff --name-status` acknowledges the
> relationship with the original file for the code on `master`? This is
> confusing...

This is due to the way that rename detection works. When looking for
renames, the tree diff gives us a series of candidate "sources" that
were deleted and candidates "dests" that were added. And then we try to
match them up.

Copy detection differs in that it uses any file touched in the diff as a
source, not just deletions.

And then "--find-copies-harder" uses even unmodified files as sources
(but see below).

So here's a simplified setup:

  git init repo
  cd repo

  seq 100 >a
  git add a
  git commit -m a

  cp a b
  seq 99 >a
  git add a b
  git commit -m b

and then we can try these commands:

  # this won't find a rename, because "a" was not deleted
  git diff-tree --name-status -M HEAD^ HEAD

  # this will find the copy, because now we consider "a" a source
  git diff-tree --name-status -C HEAD^ HEAD

  # this _won't_ find the copy, because we limited our tree diff to
  # just look at "b"; hence we don't even consider "a" a source
  git diff-tree --name-status -C HEAD^ HEAD -- b

And that last one is the one that confused you. Naively it seems like
doing this would work (two "-C" are the same as "--find-copies-harder"):

  git diff-tree --name-status -C -C HEAD^ HEAD -- b

but it doesn't. That's because we're still using the tree diff to find
sources, and just adding unmodified entries to the source list. But our
pathspec prevents the diff from even considering "a". While this might
seem useless at first, you can imagine something like:

  git diff-tree -C -C HEAD^ HEAD -- subdir/

which would consider all files in subdir as sources, but not those
outside (which may be especially important for performance).

But there's no (clean) way to expand the set of paths that we consider
as sources without also showing them in the output. There are two
useful variants I could imagine, though:

  - a way to consider _all_ paths in the repository, not just those in
    the pathspec, as sources, but show only the entries from the
    pathspec. This could probably be a "harder" version of
    "--find-copies-harder", something like "-C -C -C <revs> -- b".
    Naturally this would be even more expensive in a big repo.

  - a way to independently specify the source pathspec and the
    output-limiting pathspec. This is a cheaper version of the one
    above, where you could look at a subset of the tree a sources, but
    limit the set of shown paths even further. It's not conceptually
    that difficult, but syntactically it gets weird since you have two
    lists of pathspecs on the command-line.

I think the first one wouldn't be _too_ hard if somebody is interested
in getting their feet wet with the diffcore-rename.c code. The second is
probably not worth the effort.

> Out of curiosity, I thought I'd try this command:
> 
> ```
> git diff --follow master...topic -- ZPay/ZPayClient.hpp
> ```

So yes, that does work, and is why I added the "(clean)" qualifier
above. It behaves like the "-C -C -C" I proposed. But the fact that it
does so is entirely accidental. What happens is this:

  - we're a little sloppy about what constitutes a traversal option and
    what is a diff option. Many diff commands rely on setup_revisions(),
    which parses both. So "diff --follow" probably _should_ be flagged
    as an error, but isn't.

  - the implementation of "--follow" works by doing a separate,
    from-scratch tree-level diff on each commit (it _has_ to ignore your
    pathspec, since by definition it allows only a single file to be in
    the pathspec). And then rather than throwing away that result, it
    feeds it to the rest of the diff pipeline, which then shows the
    output you expected.

So it does do what you want, but only for the single-file case. And
certainly it was never intended to, and that might change in the future.

> Now this looks more like it. I can actually see a useful diff here,
> instead of everything looking like a new file. But there is a lot of
> confusion here:
> 
> 1. `diff --follow` is not a documented[1] option. Why does it work?

Accident. :) See above.

> 2. `diff -M` doesn't actually work either. It should, though. In fact,
> I expected it to work as `--follow` does. But it doesn't.

It doesn't work because this is a copy, not a rename.

> 3. The `diff.renames` config doesn't seem to be working here, when it should.

It does, but the pathspec prevents it from finding a source candidate.

-Peff
Robert Dailey Sept. 16, 2019, 2:25 p.m. UTC | #3
On Fri, Sep 13, 2019 at 10:30 PM Jeff King <peff@peff.net> wrote:
> SNIP
>
> > Now this looks more like it. I can actually see a useful diff here,
> > instead of everything looking like a new file. But there is a lot of
> > confusion here:
> >
> > 1. `diff --follow` is not a documented[1] option. Why does it work?
>
> Accident. :) See above.
>
> > 2. `diff -M` doesn't actually work either. It should, though. In fact,
> > I expected it to work as `--follow` does. But it doesn't.
>
> It doesn't work because this is a copy, not a rename.
>
> > 3. The `diff.renames` config doesn't seem to be working here, when it should.
>
> It does, but the pathspec prevents it from finding a source candidate.

Jeff, thanks so much. All of your examples help to contrast the
different behavior. I thought -M did copies & renames, that was my
misunderstanding. But the fact that you explained that the file spec
would prevent it from working as I'd like anyway means it doesn't
matter too much.

Ultimately my goal is to use the pathspec to filter what is visible, I
don't necessarily intend it to obscure my results due to internal
behavior. I realize there are performance obligations too (as you
described) but I feel like something like this should be
straightforward for end users. I consider myself a step above most
people with my understanding of Git and how it functions, and even I
was confused by this. Especially when it comes to folks I work with at
my day job, I just don't see them making sense of this.

To me, as transparently as possible, moves and copies should be
handled intrinsically. I realize there are technical challenges
(performance, ambiguities, etc), but anything that can be done to help
with that would go a long way with most users. Maybe a 3rd option from
your list to solve this issue might be adding some metadata to commits
or blob object types to record the pathspec of the file it was a copy
or move of. Even if this required you to first do a git cp/mv before
modifying the content, or a strict similarity threshold, I think it
would go a long way. That means that `diff` can utilize the metadata
in the object itself to find the path specs to follow, instead of
requiring some complex deduction logic that has a steep performance
cost.
Junio C Hamano Sept. 16, 2019, 5:35 p.m. UTC | #4
Jeff King <peff@peff.net> writes:

>   - a way to independently specify the source pathspec and the
>     output-limiting pathspec. This is a cheaper version of the one
>     above, where you could look at a subset of the tree a sources, but
>     limit the set of shown paths even further. It's not conceptually
>     that difficult, but syntactically it gets weird since you have two
>     lists of pathspecs on the command-line.

Yes, the pathspec has always been input limiter and I think we have
discussed adding separate output limiter from time to time, but
nothing has happened.

Back in the scripted porcelain days, it would conceptually have been
simpler and cleaner, as you probably would run "diff-tree --raw"
with no (or "input") pathspec, filter its output with "output"
pathspec, and then convert the raw diff to a patch ( we used to have
such a filter, before we gave the -p option to all three "diff"
family backends).

Introducing an option to say "Pretend as if the system supported
output pathspec, and use the pathspec as such, without any input
pathspec" feels like a dirty hack compared to conceptual purity of
having two separate sets, and having to run without any input
pathspec may not be usable in truly large project, but I suspect it
may be good enough for most people (i.e. your "first one" that
wouldn't be too hard).  I am not sure if I like it, though X-<.

>> ```
>> git diff --follow master...topic -- ZPay/ZPayClient.hpp
>> ```
>
> So yes, that does work, and is why I added the "(clean)" qualifier
> above. It behaves like the "-C -C -C" I proposed. But the fact that it
> does so is entirely accidental. What happens is this:

I am kinda surprised that "diff" gets affected by "--follow" (I knew
that the command line parser would take that option meant for "log"
family without complaining) at all.
diff mbox series

Patch

diff --git ZPay/ZPayClient.hpp ZPay/ZPayClient.hpp
new file mode 100644
index 00000000..6ebc2a9a
--- /dev/null
diff --git Jni/JniPaymentManager.hpp ZPay/ZPayClient.hpp
similarity index 70%
copy from Jni/JniPaymentManager.hpp
copy to ZPay/ZPayClient.hpp
index fc18e2d2..6ebc2a9a 100644
--- Jni/JniPaymentManager.hpp