diff mbox series

[v2] diff: Fix modified lines stats with --stat and --numstat

Message ID 20200920130945.26399-1-tguyot@gmail.com (mailing list archive)
State Superseded
Headers show
Series [v2] diff: Fix modified lines stats with --stat and --numstat | expand

Commit Message

Thomas Guyot Sept. 20, 2020, 1:09 p.m. UTC
In builtin_diffstat(), when both files are coming from "stdin" (which
could be better described as the file's content being written directly
into the file object), oideq() compares two null hashes and ignores the
actual differences for the statistics.

This patch checks if is_stdin flag is set on both sides and compare
contents directly.

Signed-off-by: Thomas Guyot-Sionnest <tguyot@gmail.com>
---
Range-diff:
1:  479c2835fc ! 1:  1f25713d44 diff: Fix modified lines stats with --stat and --numstat
    @@ -20,8 +20,12 @@
      	}
      
     -	same_contents = oideq(&one->oid, &two->oid);
    ++	/* What is_stdin really means is that the file's content is only
    ++	 * in the filespec's buffer and its oid is zero. We can't compare
    ++	 * oid's if both are null and we can just diff the buffers */
     +	if (one->is_stdin && two->is_stdin)
    -+		same_contents = !strcmp(one->data, two->data);
    ++		same_contents = (one->size == two->size ?
    ++			!memcmp(one->data, two->data, one->size) : 0);
     +	else
     +		same_contents = oideq(&one->oid, &two->oid);
      

 diff.c                | 9 ++++++++-
 t/t3206-range-diff.sh | 8 ++++----
 2 files changed, 12 insertions(+), 5 deletions(-)

Comments

Taylor Blau Sept. 20, 2020, 3:39 p.m. UTC | #1
On Sun, Sep 20, 2020 at 09:09:46AM -0400, Thomas Guyot-Sionnest wrote:
> In builtin_diffstat(), when both files are coming from "stdin" (which
> could be better described as the file's content being written directly
> into the file object), oideq() compares two null hashes and ignores the
> actual differences for the statistics.
>
> This patch checks if is_stdin flag is set on both sides and compare
> contents directly.
>
> Signed-off-by: Thomas Guyot-Sionnest <tguyot@gmail.com>
> ---
> Range-diff:
> 1:  479c2835fc ! 1:  1f25713d44 diff: Fix modified lines stats with --stat and --numstat
>     @@ -20,8 +20,12 @@
>       	}
>
>      -	same_contents = oideq(&one->oid, &two->oid);
>     ++	/* What is_stdin really means is that the file's content is only
>     ++	 * in the filespec's buffer and its oid is zero. We can't compare
>     ++	 * oid's if both are null and we can just diff the buffers */
>      +	if (one->is_stdin && two->is_stdin)
>     -+		same_contents = !strcmp(one->data, two->data);
>     ++		same_contents = (one->size == two->size ?
>     ++			!memcmp(one->data, two->data, one->size) : 0);
>      +	else
>      +		same_contents = oideq(&one->oid, &two->oid);

After reading your explanation in [1], this version makes more sense to
me.

Thanks.

[1]: https://lore.kernel.org/git/f4c4cb48-f4b5-3d4d-066d-b94e961dcbb5@gmail.com/

Taylor
Thomas Guyot Sept. 20, 2020, 4:38 p.m. UTC | #2
On 2020-09-20 11:39, Taylor Blau wrote:
> On Sun, Sep 20, 2020 at 09:09:46AM -0400, Thomas Guyot-Sionnest wrote:
>> In builtin_diffstat(), when both files are coming from "stdin" (which
>> could be better described as the file's content being written directly
>> into the file object), oideq() compares two null hashes and ignores the
>> actual differences for the statistics.
>>
>> This patch checks if is_stdin flag is set on both sides and compare
>> contents directly.
>>
>> Signed-off-by: Thomas Guyot-Sionnest <tguyot@gmail.com>
>> ---
>> Range-diff:
>> 1:  479c2835fc ! 1:  1f25713d44 diff: Fix modified lines stats with --stat and --numstat
>>     @@ -20,8 +20,12 @@
>>       	}
>>
>>      -	same_contents = oideq(&one->oid, &two->oid);
>>     ++	/* What is_stdin really means is that the file's content is only
>>     ++	 * in the filespec's buffer and its oid is zero. We can't compare
>>     ++	 * oid's if both are null and we can just diff the buffers */
>>      +	if (one->is_stdin && two->is_stdin)
>>     -+		same_contents = !strcmp(one->data, two->data);
>>     ++		same_contents = (one->size == two->size ?
>>     ++			!memcmp(one->data, two->data, one->size) : 0);
>>      +	else
>>      +		same_contents = oideq(&one->oid, &two->oid);
> 
> After reading your explanation in [1], this version makes more sense to
> me.
> 
> Thanks.
> 
> [1]: https://lore.kernel.org/git/f4c4cb48-f4b5-3d4d-066d-b94e961dcbb5@gmail.com/

There's a little bit missing... Just before the new code example:,
previous to last paragraph:

> it's assumed [we can just call oidcmp()] and I won't make complex

--
Thomas
Junio C Hamano Sept. 20, 2020, 7:11 p.m. UTC | #3
Taylor Blau <me@ttaylorr.com> writes:

> On Sun, Sep 20, 2020 at 09:09:46AM -0400, Thomas Guyot-Sionnest wrote:
>> In builtin_diffstat(), when both files are coming from "stdin" (which
>> could be better described as the file's content being written directly
>> into the file object), oideq() compares two null hashes and ignores the
>> actual differences for the statistics.
>>
>> This patch checks if is_stdin flag is set on both sides and compare
>> contents directly.
>>
>> Signed-off-by: Thomas Guyot-Sionnest <tguyot@gmail.com>
>> ---
>> Range-diff:
>> 1:  479c2835fc ! 1:  1f25713d44 diff: Fix modified lines stats with --stat and --numstat
>>     @@ -20,8 +20,12 @@
>>       	}
>>
>>      -	same_contents = oideq(&one->oid, &two->oid);
>>     ++	/* What is_stdin really means is that the file's content is only
>>     ++	 * in the filespec's buffer and its oid is zero. We can't compare
>>     ++	 * oid's if both are null and we can just diff the buffers */
>>      +	if (one->is_stdin && two->is_stdin)
>>     -+		same_contents = !strcmp(one->data, two->data);
>>     ++		same_contents = (one->size == two->size ?
>>     ++			!memcmp(one->data, two->data, one->size) : 0);
>>      +	else
>>      +		same_contents = oideq(&one->oid, &two->oid);
>
> After reading your explanation in [1], this version makes more sense to
> me.

These oid fields are prepared by calling diff_fill_oid_info(), and
even for paths that are dirty (hence no "free" oid available from
index or tree entry), an appropriate oid is computed.

But there are paths for which oid cannot be computed without
destroying their contents.  Such paths are marked by the function
with null_oid.

It happens to be that stdin is the only class of paths that are
treated that way _right_ _now_, but future code may support
different kind of paths that share the same trait.

When we want to know "is comparing the oid sufficient?", we
shouldn't inspect the is_stdin flag ourselves in a caller of
diff_fill_oid_info(), because the helper _is_ responsible for
knowing what kind of paths are special, and signals that "assume
this would not be equal to anything else" by giving null_oid back.

The caller should use the info left by diff_fill_oid_info(), namely,
"even if the oid on both sides are the same, if it is null_oid, then
we know diff_fill_oid_info() didn't actually compute the oid, and we
need to compare the blob ourselves".

And there is no point in doing memcmp() here, I think.  

The same_contents() check is done as an optimization to avoid xdl.
Even if the two sides were thought to be different at the oid level,
xdl comparison may find that there is no difference after all
(e.g. think of whitespace ignoring comparison), so we should assume
and rely on that the downstream code MUST BE prepared to handle
false negatives (i.e. same_contents says they are different, but
they actually produce no diffstat).  Running memcmp() over contents
in potentially a large buffer to find that they are different, and
then have xdl process that large buffer again, would be a waste.

Summarizing the above, I think the second best fix is this (which
means that the posted patch is the third best):

	/*
	 * diff_fill_oid_info() marked one/two->oid with null_oid
	 * for a path whose oid is not available.  Disable early
	 * return optimization for them.
	 */
	if (oideq(&one->oid, &null_oid) || oideq(&two->oid, &null_oid))
		same_contents = 0; /* could be different */
	else if (oideq(&one->oid, &two->oid))
		same_contents = 1; /* definitely the same */
	else
		same_contents = 0; /* definitely different */

But I suspect that the best fix is to teach diff_fill_oid_info() to
hash the in-memory data to compute the oid, instead of punting and
filling the oid field with null_oid.  If function builtin_diffstat()
is allowed to look at the contents and run memcmp() here, the 'data'
field should have been filled and valid when diff_fill_oid_info()
looked at it already.

The "best" fix will have wider consequences, so we may not want to
jump to it right away without careful consideration.

For example, the "best" fix will fix another bug.  The 'index'
header shows a NULL object name in normal "diff --patch" output for
these paths in the current code, which means they cannot be used
with "apply --3way".  That way, this codepath does not have to know
anything about the "null means we don't know" convention.  

Try:

    $ (cat COPYING; echo) >RENAMING
    $ git diff --no-index COPYING - <RENAMING | grep '^index '
    index 536e55524d..0000000000 100644

and notice that the stdin side has a null object name in the current
code.  I think we will show the right object name if we fix the
diff_fill_oid_info().

Thanks.
Junio C Hamano Sept. 20, 2020, 8:08 p.m. UTC | #4
Junio C Hamano <gitster@pobox.com> writes:

> But I suspect that the best fix is to teach diff_fill_oid_info() to
> hash the in-memory data to compute the oid, instead of punting and
> filling the oid field with null_oid.  If function builtin_diffstat()
> is allowed to look at the contents and run memcmp() here, the 'data'
> field should have been filled and valid when diff_fill_oid_info()
> looked at it already.
>
> The "best" fix will have wider consequences, so we may not want to
> jump to it right away without careful consideration.

And then after giving a bit more thought, I don't recommend to go
with this approach, because it breaks an established convention that
objects with unknown name is perfectly OK and shown with the null
oid.

In other words, I'd suggest to use the "second best" one I gave in
the message I am responding to.

Thanks.
Junio C Hamano Sept. 20, 2020, 8:36 p.m. UTC | #5
Junio C Hamano <gitster@pobox.com> writes:

> Summarizing the above, I think the second best fix is this (which
> means that the posted patch is the third best):
>
> 	/*
> 	 * diff_fill_oid_info() marked one/two->oid with null_oid
> 	 * for a path whose oid is not available.  Disable early
> 	 * return optimization for them.
> 	 */
> 	if (oideq(&one->oid, &null_oid) || oideq(&two->oid, &null_oid))
> 		same_contents = 0; /* could be different */
> 	else if (oideq(&one->oid, &two->oid))
> 		same_contents = 1; /* definitely the same */
> 	else
> 		same_contents = 0; /* definitely different */

A tangent.

There is this code in diff.c::fill_metainfo() that is used to
populate the "index" header element of "diff --patch" output:

	if (one && two && !oideq(&one->oid, &two->oid)) {
		const unsigned hexsz = the_hash_algo->hexsz;
		int abbrev = o->abbrev ? o->abbrev : DEFAULT_ABBREV;

		if (o->flags.full_index)
			abbrev = hexsz;

		if (o->flags.binary) {
			mmfile_t mf;
			if ((!fill_mmfile(o->repo, &mf, one) &&
			     diff_filespec_is_binary(o->repo, one)) ||
			    (!fill_mmfile(o->repo, &mf, two) &&
			     diff_filespec_is_binary(o->repo, two)))
				abbrev = hexsz;
		}
		strbuf_addf(msg, "%s%sindex %s..%s", line_prefix, set,
			    diff_abbrev_oid(&one->oid, abbrev),
			    diff_abbrev_oid(&two->oid, abbrev));
		if (one->mode == two->mode)
			strbuf_addf(msg, " %06o", one->mode);
		strbuf_addf(msg, "%s\n", reset);
	}

Currently it is OK because there can only be one side that
diff_fill_oid_info() would mark as "oid unavailable" (e.g. reading
standard input stream).  If a new feature is introducing a situation
where both ends have null_oid, which was so far been impossible,
we'd probably need to factor out the condition used in the above
into a helper function, e.g.

    static int cannot_be_the_same(struct diff_filespec *one, struct diff_filespec *two)
    {
	if ((oideq(&one->oid, &null_oid) || oideq(&two->oid, &null_oid))
		return 1;
	else if (oideq(&one->oid, &two->oid))
		return 0;
	else
		return 1;
    }

and rewrite the conditional in fill_metainfo() to

	if (one && two && cannot_be_the_same(one, two)) {
		...

The "second best fix" could then become a single liner:

	same_contents = !cannot_be_the_same(one, two);

using the helper.
Junio C Hamano Sept. 20, 2020, 10:15 p.m. UTC | #6
Junio C Hamano <gitster@pobox.com> writes:

Sorry, this will be the last message from me on this topic for now.

> we'd probably need to factor out the condition used in the above
> into a helper function, e.g.
>
>     static int cannot_be_the_same(struct diff_filespec *one, struct diff_filespec *two)

The naming of this helper is tricky.  In both potential callers,
what we want to see is "one and two may be different, we cannot say
they are the same with certainty", so "cannot be the same" is a
misnomer.  Worse, the negated form is hard to grok.

Perhaps "may_differ()" is a more correct name.  If either side is
NULL oid, we cannot say they are the same, so it is true.  If two
oid that are not NULL oid are the same, there is no possibility that
they are different, so we return false.  And two oid that are not
NULL oid are different, we know they are different, so we return
true.

>     {
> 	if ((oideq(&one->oid, &null_oid) || oideq(&two->oid, &null_oid))
> 		return 1;
> 	else if (oideq(&one->oid, &two->oid))
> 		return 0;
> 	else
> 		return 1;
>     }
>
> and rewrite the conditional in fill_metainfo() to
>
> 	if (one && two && cannot_be_the_same(one, two)) {
> 		...

And this becomes much easier to read and understand, i.e.

	if (one && two && may_differ(one, two)) {
		... create the index one->oid..two->oid header ...

> The "second best fix" could then become a single liner:
>
> 	same_contents = !cannot_be_the_same(one, two);
>
> using the helper.

And this becomes

	same_contents = !may_differ(one, two);

meaning that "there is no possibility that one and two are
different".  That allows us to optimize out the invocation of the
xdl machinery.
Jeff King Sept. 21, 2020, 7:26 p.m. UTC | #7
On Sun, Sep 20, 2020 at 12:11:20PM -0700, Junio C Hamano wrote:

> > After reading your explanation in [1], this version makes more sense to
> > me.
> 
> These oid fields are prepared by calling diff_fill_oid_info(), and
> even for paths that are dirty (hence no "free" oid available from
> index or tree entry), an appropriate oid is computed.

This is the part that confused me earlier. I expected these "stdin"
entries, just like other some other entries (e.g., stat dirty ones for
diff-files, or anything for "diff --no-index") to have bogus oids.

But that diff_fill_oid_info() is what actually computes the sha1 from
scratch for them. I get why that is needed for generating a git diff, as
we have an "index from...to" line there that we'd want to fill.

For diffstat, though, it seems like a waste of time; we don't care what
the object hash is. I.e., if we were to do this:

diff --git a/diff.c b/diff.c
index 16eeaf6645..1934af29a5 100644
--- a/diff.c
+++ b/diff.c
@@ -4564,9 +4564,6 @@ static void run_diffstat(struct diff_filepair *p, struct diff_options *o,
 	if (o->prefix_length)
 		strip_prefix(o->prefix_length, &name, &other);
 
-	diff_fill_oid_info(p->one, o->repo->index);
-	diff_fill_oid_info(p->two, o->repo->index);
-
 	builtin_diffstat(name, other, p->one, p->two,
 			 diffstat, o, p);
 }

then everything seems to work fine _except_ a "git diff --stat
--no-index", exactly because it hits this "same_contents" check we've
been discussing. And once that is fixed properly (to handle any case
where we have no oid, not just when the stdin flag is set), then perhaps
it is worth doing.

Or perhaps not. Even if we have to memcmp sometimes in
builtin_diffstat(), it would be faster than computing the individual
hashes. But it may not be measurably so, and it would be no difference
for the common case of filespecs for which we do know the oid for free.
I also suspect we'd need to be a little smarter about combined formats
(e.g., "--stat --patch" might as well compute the oid as early as
possible, since we'll need it eventually for the patch; but we'd hit the
call in builtin_diffstat() before the one in run_diff()).

> But there are paths for which oid cannot be computed without
> destroying their contents.  Such paths are marked by the function
> with null_oid.

I'm not clear how computing the oid destroys the contents. We have them
in an in-memory buffer at this point, don't we? So we _could_ generate
an oid even for stdin, like this:

diff --git a/cache.h b/cache.h
index 55d7f61087..1ace143eac 100644
--- a/cache.h
+++ b/cache.h
@@ -858,6 +858,7 @@ int ie_modified(struct index_state *, const struct cache_entry *, struct stat *,
 #define HASH_RENORMALIZE  4
 int index_fd(struct index_state *istate, struct object_id *oid, int fd, struct stat *st, enum object_type type, const char *path, unsigned flags);
 int index_path(struct index_state *istate, struct object_id *oid, const char *path, struct stat *st, unsigned flags);
+int index_mem(struct index_state *istate, struct object_id *oid, void *buf, size_t size, enum object_type type, const char *path, unsigned flags);
 
 /*
  * Record to sd the data from st that we use to check whether a file
diff --git a/diff.c b/diff.c
index 16eeaf6645..181b632114 100644
--- a/diff.c
+++ b/diff.c
@@ -4463,7 +4463,10 @@ static void diff_fill_oid_info(struct diff_filespec *one, struct index_state *is
 		if (!one->oid_valid) {
 			struct stat st;
 			if (one->is_stdin) {
-				oidclr(&one->oid);
+				if (index_mem(istate, &one->oid,
+					      one->data, one->size,
+					      OBJ_BLOB, one->path, 0))
+					die("cannot hash diff file from stdin");
 				return;
 			}
 			if (lstat(one->path, &st) < 0)
diff --git a/sha1-file.c b/sha1-file.c
index 770501d6d1..c7d017b3e0 100644
--- a/sha1-file.c
+++ b/sha1-file.c
@@ -2046,10 +2046,10 @@ static void check_tag(const void *buf, size_t size)
 		die(_("corrupt tag"));
 }
 
-static int index_mem(struct index_state *istate,
-		     struct object_id *oid, void *buf, size_t size,
-		     enum object_type type,
-		     const char *path, unsigned flags)
+int index_mem(struct index_state *istate,
+	      struct object_id *oid, void *buf, size_t size,
+	      enum object_type type,
+	      const char *path, unsigned flags)
 {
 	int ret, re_allocated = 0;
 	int write_object = flags & HASH_WRITE_OBJECT;

which is basically your "best fix" from below. It fixes the bug here,
and it gives you a non-null index line. I'd consider coupling it with
calling fill_oid less often, though (something like range-diff computes
a bunch of fake-stdin diffs, and doesn't need to waste time computing
the oids at all).

> Summarizing the above, I think the second best fix is this (which
> means that the posted patch is the third best):
> 
> 	/*
> 	 * diff_fill_oid_info() marked one/two->oid with null_oid
> 	 * for a path whose oid is not available.  Disable early
> 	 * return optimization for them.
> 	 */
> 	if (oideq(&one->oid, &null_oid) || oideq(&two->oid, &null_oid))
> 		same_contents = 0; /* could be different */
> 	else if (oideq(&one->oid, &two->oid))
> 		same_contents = 1; /* definitely the same */
> 	else
> 		same_contents = 0; /* definitely different */

This is the direction I was getting at in my earlier emails, except that
I imagined that first conditional could be checking:

  if (!one->oid_valid || !two->oid_valid)

but I was surprised to see that diff_fill_oid_info() does not set
oid_valid. Is that a bug?

I also imagined that we'd have to determine right then whether the
contents are actually different or not with a memcmp(), to avoid
emitting a "0 changes" line, but we do handle that case within the
"!same_contents" conditional. See the comment starting with "Omit
diffstats..." added recently by 1cf3d5db9b (diff: teach --stat to ignore
uninteresting modifications, 2020-08-20).

-Peff
Junio C Hamano Sept. 21, 2020, 9:51 p.m. UTC | #8
Jeff King <peff@peff.net> writes:

> For diffstat, though, it seems like a waste of time; we don't care what
> the object hash is. I.e., if we were to do this:
>
> diff --git a/diff.c b/diff.c
> index 16eeaf6645..1934af29a5 100644
> --- a/diff.c
> +++ b/diff.c
> @@ -4564,9 +4564,6 @@ static void run_diffstat(struct diff_filepair *p, struct diff_options *o,
>  	if (o->prefix_length)
>  		strip_prefix(o->prefix_length, &name, &other);
>  
> -	diff_fill_oid_info(p->one, o->repo->index);
> -	diff_fill_oid_info(p->two, o->repo->index);
> -
>  	builtin_diffstat(name, other, p->one, p->two,
>  			 diffstat, o, p);
>  }
>
> then everything seems to work fine _except_ a "git diff --stat
> --no-index", exactly because it hits this "same_contents" check we've
> been discussing. And once that is fixed properly (to handle any case
> where we have no oid, not just when the stdin flag is set), then perhaps
> it is worth doing.

> Or perhaps not. Even if we have to memcmp sometimes in
> builtin_diffstat(), it would be faster than computing the individual
> hashes. But it may not be measurably so, and it would be no difference
> for the common case of filespecs for which we do know the oid for free.
> I also suspect we'd need to be a little smarter about combined formats
> (e.g., "--stat --patch" might as well compute the oid as early as
> possible, since we'll need it eventually for the patch; but we'd hit the
> call in builtin_diffstat() before the one in run_diff()).
>
>> But there are paths for which oid cannot be computed without
>> destroying their contents.  Such paths are marked by the function
>> with null_oid.
>
> I'm not clear how computing the oid destroys the contents. We have them
> in an in-memory buffer at this point, don't we? So we _could_ generate
> an oid even for stdin, like this:

Yes, yes yes.  That is the "best" (which later retracted) approach I
suggested in the same message, but it would end up filling a
real-looking object name for working tree side of diff-files, which
has a far larger consequence we need to think about and consumes
more brain cycles than warranted here, I would think.

>> Summarizing the above, I think the second best fix is this (which
>> means that the posted patch is the third best):
>> 
>> 	/*
>> 	 * diff_fill_oid_info() marked one/two->oid with null_oid
>> 	 * for a path whose oid is not available.  Disable early
>> 	 * return optimization for them.
>> 	 */
>> 	if (oideq(&one->oid, &null_oid) || oideq(&two->oid, &null_oid))
>> 		same_contents = 0; /* could be different */
>> 	else if (oideq(&one->oid, &two->oid))
>> 		same_contents = 1; /* definitely the same */
>> 	else
>> 		same_contents = 0; /* definitely different */
>
> This is the direction I was getting at in my earlier emails, except that
> I imagined that first conditional could be checking:
>
>   if (!one->oid_valid || !two->oid_valid)
>
> but I was surprised to see that diff_fill_oid_info() does not set
> oid_valid. Is that a bug?

I do not think so.  oid_valid refers to the state during the
collection phase (those who called diff_addremove() etc.) and
updating it in diff_fill_oid_info() would lose information.  Maybe
nobody looks at the bit at this late in the processing chain these
days, in which case we can start flipping the bit there, but I
offhand do not know what consequences such a change would trigger.

> I also imagined that we'd have to determine right then whether the
> contents are actually different or not with a memcmp(), to avoid
> emitting a "0 changes" line, but we do handle that case within the
> "!same_contents" conditional. See the comment starting with "Omit
> diffstats..." added recently by 1cf3d5db9b (diff: teach --stat to ignore
> uninteresting modifications, 2020-08-20).

Yes, we are essentially on the same page---same_contents bit is
merely an optimization to decide cheaply when we do not have to do
xdl, but the codepath that does the xdl must be prepared to deal
with the "we thought they are different, but after all they turn out
to be equivalent" case.  Therefore false positive to declare two
different things as same cannot be tolerated, but false negative to
declare two things that are the same as !same_contents is fine.
Jeff King Sept. 21, 2020, 10:20 p.m. UTC | #9
On Mon, Sep 21, 2020 at 02:51:21PM -0700, Junio C Hamano wrote:

> > This is the direction I was getting at in my earlier emails, except that
> > I imagined that first conditional could be checking:
> >
> >   if (!one->oid_valid || !two->oid_valid)
> >
> > but I was surprised to see that diff_fill_oid_info() does not set
> > oid_valid. Is that a bug?
> 
> I do not think so.  oid_valid refers to the state during the
> collection phase (those who called diff_addremove() etc.) and
> updating it in diff_fill_oid_info() would lose information.  Maybe
> nobody looks at the bit at this late in the processing chain these
> days, in which case we can start flipping the bit there, but I
> offhand do not know what consequences such a change would trigger.

We use the flag to determine whether we need to compute the oid from
scratch. So I would think the current code causes us to compute the oid
multiple times in many cases. For example, with this patch:

diff --git a/diff.c b/diff.c
index ee8e8189e9..8363abab5b 100644
--- a/diff.c
+++ b/diff.c
@@ -4424,6 +4424,8 @@ static void diff_fill_oid_info(struct diff_filespec *one, struct index_state *is
 				die_errno("stat '%s'", one->path);
 			if (index_path(istate, &one->oid, one->path, &st, 0))
 				die("cannot hash %s", one->path);
+			warning("computed oid of %s as %s",
+				one->path, oid_to_hex(&one->oid));
 		}
 	}
 	else

I get (because diff.c is dirty in my working tree due to the patch):

  $ ./git diff --stat -p
  warning: computed oid of diff.c as 8363abab5b51479ac8cc9fb1c96b39fb90041f88
   diff.c | 2 ++
   1 file changed, 2 insertions(+)
  
  warning: computed oid of diff.c as 8363abab5b51479ac8cc9fb1c96b39fb90041f88
  diff --git a/diff.c b/diff.c
  index ee8e8189e9..8363abab5b 100644
  --- a/diff.c
  +++ b/diff.c
  @@ -4424,6 +4424,8 @@ static void diff_fill_oid_info(struct diff_filespec *one, struct index_state *is
   				die_errno("stat '%s'", one->path);
   			if (index_path(istate, &one->oid, one->path, &st, 0))
   				die("cannot hash %s", one->path);
  +			warning("computed oid of %s as %s",
  +				one->path, oid_to_hex(&one->oid));
   		}
   	}
   	else

even though we already know the oid in the second call, so it's wasted
work. I agree that other code could be depending on oid_valid in a weird
way, but IMHO that code is probably wrong to do so. But it may not be
worth digging into, if nobody has complained about the waste.

> > I also imagined that we'd have to determine right then whether the
> > contents are actually different or not with a memcmp(), to avoid
> > emitting a "0 changes" line, but we do handle that case within the
> > "!same_contents" conditional. See the comment starting with "Omit
> > diffstats..." added recently by 1cf3d5db9b (diff: teach --stat to ignore
> > uninteresting modifications, 2020-08-20).
> 
> Yes, we are essentially on the same page---same_contents bit is
> merely an optimization to decide cheaply when we do not have to do
> xdl, but the codepath that does the xdl must be prepared to deal
> with the "we thought they are different, but after all they turn out
> to be equivalent" case.  Therefore false positive to declare two
> different things as same cannot be tolerated, but false negative to
> declare two things that are the same as !same_contents is fine.

I thought it may matter on "maint", where we do not have 1cf3d5db9b.
I.e., I expected:

  echo foo >a
  echo foo >b
  git diff --no-index --stat a b

might switch from no output to having a line like:

  a => b | 0

But we don't even get to builtin_diffstat() there. We throw out the pair
in diffcore_skip_stat_unmatch(). Likewise, if you get past that with
something like a mode change:

  chmod +x b
  git diff --no-index --stat a b

then that does generate the "0" stat line. But it does so both before
and after the proposed change. The same thing happens in no-index mode:

  git init
  echo foo >file
  git add .
  git commit -am no-bit
  chmod +x file
  git commit -am exec-bit
  git show --stat

will give you:

   file | 0

I'm not sure if that's the desired behavior or not, but at any rate
fixing this builtin_diffstat() conditional won't change it either way. :)

-Peff
Junio C Hamano Sept. 21, 2020, 10:37 p.m. UTC | #10
Jeff King <peff@peff.net> writes:

> .... I agree that other code could be depending on oid_valid in a weird
> way, but IMHO that code is probably wrong to do so. But it may not be
> worth digging into, if nobody has complained about the waste.

Yup, that was my feeling when I wrote the message you are responding
to.
diff mbox series

Patch

diff --git a/diff.c b/diff.c
index ee8e8189e9..2e47bf824e 100644
--- a/diff.c
+++ b/diff.c
@@ -3681,7 +3681,14 @@  static void builtin_diffstat(const char *name_a, const char *name_b,
 		return;
 	}
 
-	same_contents = oideq(&one->oid, &two->oid);
+	/* What is_stdin really means is that the file's content is only
+	 * in the filespec's buffer and its oid is zero. We can't compare
+	 * oid's if both are null and we can just diff the buffers */
+	if (one->is_stdin && two->is_stdin)
+		same_contents = (one->size == two->size ?
+			!memcmp(one->data, two->data, one->size) : 0);
+	else
+		same_contents = oideq(&one->oid, &two->oid);
 
 	if (diff_filespec_is_binary(o->repo, one) ||
 	    diff_filespec_is_binary(o->repo, two)) {
diff --git a/t/t3206-range-diff.sh b/t/t3206-range-diff.sh
index e024cff65c..4715e75b68 100755
--- a/t/t3206-range-diff.sh
+++ b/t/t3206-range-diff.sh
@@ -258,11 +258,11 @@  test_expect_success 'changed commit with --stat diff option' '
 	     a => b | 0
 	     1 file changed, 0 insertions(+), 0 deletions(-)
 	3:  $(test_oid t3) ! 3:  $(test_oid c3) s/11/B/
-	     a => b | 0
-	     1 file changed, 0 insertions(+), 0 deletions(-)
+	     a => b | 2 +-
+	     1 file changed, 1 insertion(+), 1 deletion(-)
 	4:  $(test_oid t4) ! 4:  $(test_oid c4) s/12/B/
-	     a => b | 0
-	     1 file changed, 0 insertions(+), 0 deletions(-)
+	     a => b | 2 +-
+	     1 file changed, 1 insertion(+), 1 deletion(-)
 	EOF
 	test_cmp expect actual
 '