diff mbox series

[v2] log -G: Ignore binary files

Message ID c4eac0b0ff0812e5aa8b081e603fc8bdd042ddeb.1543403143.git.thomas.braun@virtuell-zuhause.de (mailing list archive)
State New, archived
Headers show
Series [v2] log -G: Ignore binary files | expand

Commit Message

Thomas Braun Nov. 28, 2018, 11:32 a.m. UTC
The -G<regex> option of log looks for the differences whose patch text
contains added/removed lines that match regex.

As the concept of patch text only makes sense for text files, we need to
ignore binary files when searching with -G <regex> as well.

The -S<block of text> option of log looks for differences that changes
the number of occurrences of the specified block of text (i.e.
addition/deletion) in a file. As we want to keep the current behaviour,
add a test to ensure it.

Signed-off-by: Thomas Braun <thomas.braun@virtuell-zuhause.de>
---

Changes since v1:
- Merged both patches into one
- Adapted commit messages
- Added missing support for -a flag with tests
- Placed new code into correct location to be able to reuse an existing
  optimization
- Uses help-suggested -Ga writing without spaces
- Uses orphan branches instead of cannonball cleanup with rm -rf
- Changed search text to make it clear that it is not about the \0 boundary

 Documentation/gitdiffcore.txt |  2 +-
 diffcore-pickaxe.c            |  6 ++++++
 t/t4209-log-pickaxe.sh        | 40 +++++++++++++++++++++++++++++++++++
 3 files changed, 47 insertions(+), 1 deletion(-)

Comments

Ævar Arnfjörð Bjarmason Nov. 28, 2018, 12:54 p.m. UTC | #1
On Wed, Nov 28 2018, Thomas Braun wrote:

Looks much better this time around.

> The -G<regex> option of log looks for the differences whose patch text
> contains added/removed lines that match regex.
>
> As the concept of patch text only makes sense for text files, we need to
> ignore binary files when searching with -G <regex> as well.
>
> The -S<block of text> option of log looks for differences that changes
> the number of occurrences of the specified block of text (i.e.
> addition/deletion) in a file. As we want to keep the current behaviour,
> add a test to ensure it.
> [...]
> diff --git a/Documentation/gitdiffcore.txt b/Documentation/gitdiffcore.txt
> index c0a60f3158..059ddd3431 100644
> --- a/Documentation/gitdiffcore.txt
> +++ b/Documentation/gitdiffcore.txt
> @@ -242,7 +242,7 @@ textual diff has an added or a deleted line that matches the given
>  regular expression.  This means that it will detect in-file (or what
>  rename-detection considers the same file) moves, which is noise.  The
>  implementation runs diff twice and greps, and this can be quite
> -expensive.
> +expensive.  Binary files without textconv filter are ignored.

Now that we support --text that should be documented. I tried to come up
with something on top:

    diff --git a/Documentation/diff-options.txt b/Documentation/diff-options.txt
    index 0378cd574e..42ae65fb57 100644
    --- a/Documentation/diff-options.txt
    +++ b/Documentation/diff-options.txt
    @@ -524,6 +524,10 @@ struct), and want to know the history of that block since it first
     came into being: use the feature iteratively to feed the interesting
     block in the preimage back into `-S`, and keep going until you get the
     very first version of the block.
    ++
    +Unlike `-G` the `-S` option will always search through binary files
    +without a textconv filter. [[TODO: Don't we want to support --no-text
    +then as an optimization?]].

     -G<regex>::
     	Look for differences whose patch text contains added/removed
    @@ -545,6 +549,15 @@ occurrences of that string did not change).
     +
     See the 'pickaxe' entry in linkgit:gitdiffcore[7] for more
     information.
    ++
    +Unless `--text` is supplied binary files without a textconv filter
    +will be ignored.  This was not the case before Git version 2.21..
    ++
    +With `--text`, instead of patch lines we <some example similar to the
    +above diff showing what we actually do for binary files. [[TODO: How
    +does that work?. Could just link to the "diffcore-pickaxe: For
    +Detecting Addition/Deletion of Specified String" section in
    +gitdiffcore(7) which could explain it]]

     --find-object=<object-id>::
     	Look for differences that change the number of occurrences of
    diff --git a/Documentation/gitdiffcore.txt b/Documentation/gitdiffcore.txt
    index c0a60f3158..26880b4149 100644
    --- a/Documentation/gitdiffcore.txt
    +++ b/Documentation/gitdiffcore.txt
    @@ -251,6 +251,10 @@ criterion in a changeset, the entire changeset is kept.  This behavior
     is designed to make reviewing changes in the context of the whole
     changeset easier.

    +Both `-S' and `-G' will ignore binary files without a textconv filter
    +by default, this can be overriden with `--text`. With `--text` the
    +binary patch we look through is generated as [[TODO: ???]].
    +
     diffcore-order: For Sorting the Output Based on Filenames
     ---------------------------------------------------------

But as you can see given the TODO comments I don't know how this works
exactly. I *could* dig, but that's my main outstanding problem with this
patch, the commit message / docs aren't being updated to reflect the new
behavior.

I.e. let's leave the docs in some state where the reader can as
unambiguously know what to expect with -G and these binary diffs we've
been implicitly supporting as with the textual diffs. Ideally with some
examples of how to generate them (re my question about the base85 output
in v1).

Part of that's obviously behavior we've had all along, but it's much
more convincing to say:

    We are changing X which we've done for ages, it works exactly like
    this, and here's a switch to get it back.

Instead of:

    X doesn't make sense, let's turn it off.

Also the diffcore docs already say stuff about how slow/fast things are,
and in a side-thread you said:

    My main motiviation is to speed up "log -G" as that takes a
    considerable amount of time when it wades through MBs of binary
    files which change often.

Makes sense, but then let's say something about that in that section of
the docs.

>  When `-S` or `-G` are used without `--pickaxe-all`, only filepairs
>  that match their respective criterion are kept in the output.  When
> diff --git a/diffcore-pickaxe.c b/diffcore-pickaxe.c
> index 69fc55ea1e..4cea086f80 100644
> --- a/diffcore-pickaxe.c
> +++ b/diffcore-pickaxe.c
> @@ -154,6 +154,12 @@ static int pickaxe_match(struct diff_filepair *p, struct diff_options *o,
>  	if (textconv_one == textconv_two && diff_unmodified_pair(p))
>  		return 0;
>
> +	if ((o->pickaxe_opts & DIFF_PICKAXE_KIND_G) &&
> +	    !o->flags.text &&
> +	    ((!textconv_one && diff_filespec_is_binary(o->repo, p->one)) ||
> +	     (!textconv_two && diff_filespec_is_binary(o->repo, p->two))))
> +		return 0;
> +
>  	mf1.size = fill_textconv(o->repo, textconv_one, p->one, &mf1.ptr);
>  	mf2.size = fill_textconv(o->repo, textconv_two, p->two, &mf2.ptr);
>
> diff --git a/t/t4209-log-pickaxe.sh b/t/t4209-log-pickaxe.sh
> index 844df760f7..5c3e2a16b2 100755
> --- a/t/t4209-log-pickaxe.sh
> +++ b/t/t4209-log-pickaxe.sh
> @@ -106,4 +106,44 @@ test_expect_success 'log -S --no-textconv (missing textconv tool)' '
>  	rm .gitattributes
>  '
>
> +test_expect_success 'log -G ignores binary files' '
> +	git checkout --orphan orphan1 &&
> +	printf "a\0a" >data.bin &&
> +	git add data.bin &&
> +	git commit -m "message" &&
> +	git log -Ga >result &&
> +	test_must_be_empty result
> +'
> +
> +test_expect_success 'log -G looks into binary files with -a' '
> +	git checkout --orphan orphan2 &&
> +	printf "a\0a" >data.bin &&
> +	git add data.bin &&
> +	git commit -m "message" &&
> +	git log -a -Ga >actual &&
> +	git log >expected &&
> +	test_cmp actual expected
> +'

A large part of the question(s) I have above & future readers would
presumably have would be answered by these tests using more realistic
test data. I.e. also with \n in there to see whether -G is also
line-based in this binary case.

> +test_expect_success 'log -G looks into binary files with textconv filter' '
> +	git checkout --orphan orphan3 &&
> +	echo "* diff=bin" > .gitattributes &&
> +	printf "a\0a" >data.bin &&
> +	git add data.bin &&
> +	git commit -m "message" &&
> +	git -c diff.bin.textconv=cat log -Ga >actual &&
> +	git log >expected &&
> +	test_cmp actual expected
> +'
> +
> +test_expect_success 'log -S looks into binary files' '
> +	git checkout --orphan orphan4 &&
> +	printf "a\0a" >data.bin &&
> +	git add data.bin &&
> +	git commit -m "message" &&
> +	git log -Sa >actual &&
> +	git log >expected &&
> +	test_cmp actual expected
> +'
> +
>  test_done

These tests have way to much repeated boilerplate for no reason. This
could just be (as-is, without the better test data suggested above):

diff --git a/t/t4209-log-pickaxe.sh b/t/t4209-log-pickaxe.sh
index 844df760f7..23ed6cc4b1 100755
--- a/t/t4209-log-pickaxe.sh
+++ b/t/t4209-log-pickaxe.sh
@@ -106,4 +106,34 @@ test_expect_success 'log -S --no-textconv (missing textconv tool)' '
 	rm .gitattributes
 '

+test_expect_success 'setup log -[GS] binary & --text' '
+	git checkout --orphan GS-binary-and-text &&
+	printf "a\0a" >data.bin &&
+	git add data.bin &&
+	git commit -m "message" &&
+	git log >full-log
+'
+
+test_expect_success 'log -G ignores binary files' '
+	git log -Ga >result &&
+	test_must_be_empty result
+'
+
+test_expect_success 'log -G looks into binary files with -a' '
+	git log -a -Ga >actual &&
+	test_cmp actual full-log
+'
+
+test_expect_success 'log -G looks into binary files with textconv filter' '
+	echo "* diff=bin" >.gitattributes &&
+	git -c diff.bin.textconv=cat log -Ga >actual &&
+	test_cmp actual full-log
+'
+
+test_expect_success 'log -S looks into binary files' '
+	>.gitattributes &&
+	git log -Sa >actual &&
+	test_cmp actual full-log
+'
+
 test_done
Junio C Hamano Nov. 29, 2018, 7:10 a.m. UTC | #2
Thomas Braun <thomas.braun@virtuell-zuhause.de> writes:

> Subject: Re: [PATCH v2] log -G: Ignore binary files

s/Ig/ig/; (will locally munge--this alone is no reason to reroll).

The code changes looked sensible.

> diff --git a/t/t4209-log-pickaxe.sh b/t/t4209-log-pickaxe.sh
> index 844df760f7..5c3e2a16b2 100755
> --- a/t/t4209-log-pickaxe.sh
> +++ b/t/t4209-log-pickaxe.sh
> @@ -106,4 +106,44 @@ test_expect_success 'log -S --no-textconv (missing textconv tool)' '
>  	rm .gitattributes
>  '
>  
> +test_expect_success 'log -G ignores binary files' '
> +	git checkout --orphan orphan1 &&
> +	printf "a\0a" >data.bin &&
> +	git add data.bin &&
> +	git commit -m "message" &&
> +	git log -Ga >result &&
> +	test_must_be_empty result
> +'

As this is the first mention of data.bin, this is adding a new file
data.bin that has two 'a' but is a binary file.  And that is the
only commit in the history leading to orphan1.

The fact that "log -Ga" won't find any means it missed the creation
event, because the blob is binary.  Good.

> +test_expect_success 'log -G looks into binary files with -a' '
> +	git checkout --orphan orphan2 &&
> +	printf "a\0a" >data.bin &&
> +	git add data.bin &&
> +	git commit -m "message" &&

This starts from the state left by the previous test piece, i.e. we
have a binary data.bin file with two 'a' in it.  We pretend to
modify and add, but these two steps are no-op if the previous
succeeded, but even if the previous step failed, we get what we want
in the data.bin file.  And then we make an initial commit the same
way.

> +	git log -a -Ga >actual &&
> +	git log >expected &&

And we ran the same test but this time with "-a" to tell Git that
binary-ness should not matter.  It will find the sole commit.  Good.

> +	test_cmp actual expected
> +'
> +
> +test_expect_success 'log -G looks into binary files with textconv filter' '
> +	git checkout --orphan orphan3 &&
> +	echo "* diff=bin" > .gitattributes &&

s/> />/; (will locally munge--this alone is no reason to reroll).

> +	printf "a\0a" >data.bin &&
> +	git add data.bin &&
> +	git commit -m "message" &&
> +	git -c diff.bin.textconv=cat log -Ga >actual &&

This exposes a slight iffy-ness in the design.  The textconv filter
used here does not strip the "binary-ness" from the payload, but it
is enough to tell the machinery that -G should look into the
difference.  Is that really desirable, though?

IOW, if this weren't the initial commit (which is handled by the
codepath to special-case creation and deletion in diff_grep()
function), would "log -Ga" show it without "-a"?  Should it?

I think this test piece (and probably the previous ones for "-a" vs
"no -a" without textconv, as well) should be using a history with
three commits, where

    - the root commit introduces "a\0a" to data.bin (creation event)

    - the second commit adds another instance of "a\0a" to data.bin
      (forces comparison)

    - the third commit removes data.bin (deletion event)

and make sure that the three are treated identically.  If "log -Ga"
finds one (with the combination of other conditions like use of
textconv or -a option), it should find all three, and vice versa.

> +	git log >expected &&
> +	test_cmp actual expected
> +'
> +
> +test_expect_success 'log -S looks into binary files' '
> +	git checkout --orphan orphan4 &&
> +	printf "a\0a" >data.bin &&
> +	git add data.bin &&
> +	git commit -m "message" &&
> +	git log -Sa >actual &&
> +	git log >expected &&
> +	test_cmp actual expected
> +'

Likewise.  This would also benefit from a three-commit history.

Perhaps you can create such a history at the beginning of these
additions as another "setup -G/-S binary test" step and test
different variations in subsequent tests without the setup?

>  test_done
Junio C Hamano Nov. 29, 2018, 7:22 a.m. UTC | #3
Junio C Hamano <gitster@pobox.com> writes:

>> +test_expect_success 'log -G ignores binary files' '
>> +	git checkout --orphan orphan1 &&
>> +	printf "a\0a" >data.bin &&
>> +	git add data.bin &&
>> +	git commit -m "message" &&
>> +	git log -Ga >result &&
>> +	test_must_be_empty result
>> +'
>
> As this is the first mention of data.bin, this is adding a new file
> data.bin that has two 'a' but is a binary file.  And that is the
> only commit in the history leading to orphan1.
>
> The fact that "log -Ga" won't find any means it missed the creation
> event, because the blob is binary.  Good.

By the way, this root commit records another file whose path is
"file" and has "Picked<LF>" in it.  If the file had 'a' in it, it
would have been included in "git log" output, but that is too subtle
a point to be noticed by the readers who are only reading this patch
without seeing what has been done to the index before this test
piece.

If you are going to restructure these tests to create a three-commit
history in a single expect_success that is inspected with various
"log -Ga" invocations in subsequent tests, it is worth removing that
other file (or rather, starting with "read-tree --empty" immediately
after checking out the orphan branch, to clarify to the readers that
there is nothing but what you add in the set-up step in the index)
to make the test more robust.
Thomas Braun Dec. 14, 2018, 6:44 p.m. UTC | #4
> Ævar Arnfjörð Bjarmason <avarab@gmail.com> hat am 28. November 2018 um 13:54 geschrieben:
> 
> 
> 
> On Wed, Nov 28 2018, Thomas Braun wrote:
> 
> Looks much better this time around.

Thanks.
 
> > The -G<regex> option of log looks for the differences whose patch text
> > contains added/removed lines that match regex.
> >
> > As the concept of patch text only makes sense for text files, we need to
> > ignore binary files when searching with -G <regex> as well.
> >
> > The -S<block of text> option of log looks for differences that changes
> > the number of occurrences of the specified block of text (i.e.
> > addition/deletion) in a file. As we want to keep the current behaviour,
> > add a test to ensure it.
> > [...]
> > diff --git a/Documentation/gitdiffcore.txt b/Documentation/gitdiffcore.txt
> > index c0a60f3158..059ddd3431 100644
> > --- a/Documentation/gitdiffcore.txt
> > +++ b/Documentation/gitdiffcore.txt
> > @@ -242,7 +242,7 @@ textual diff has an added or a deleted line that matches the given
> >  regular expression.  This means that it will detect in-file (or what
> >  rename-detection considers the same file) moves, which is noise.  The
> >  implementation runs diff twice and greps, and this can be quite
> > -expensive.
> > +expensive.  Binary files without textconv filter are ignored.
> 
> Now that we support --text that should be documented. I tried to come up
> with something on top:
> 
>     diff --git a/Documentation/diff-options.txt b/Documentation/diff-options.txt
>     index 0378cd574e..42ae65fb57 100644
>     --- a/Documentation/diff-options.txt
>     +++ b/Documentation/diff-options.txt
>     @@ -524,6 +524,10 @@ struct), and want to know the history of that block since it first
>      came into being: use the feature iteratively to feed the interesting
>      block in the preimage back into `-S`, and keep going until you get the
>      very first version of the block.
>     ++
>     +Unlike `-G` the `-S` option will always search through binary files
>     +without a textconv filter. [[TODO: Don't we want to support --no-text
>     +then as an optimization?]].
> 
>      -G<regex>::
>      	Look for differences whose patch text contains added/removed
>     @@ -545,6 +549,15 @@ occurrences of that string did not change).
>      +
>      See the 'pickaxe' entry in linkgit:gitdiffcore[7] for more
>      information.
>     ++
>     +Unless `--text` is supplied binary files without a textconv filter
>     +will be ignored.  This was not the case before Git version 2.21..
>     ++
>     +With `--text`, instead of patch lines we <some example similar to the
>     +above diff showing what we actually do for binary files. [[TODO: How
>     +does that work?. Could just link to the "diffcore-pickaxe: For
>     +Detecting Addition/Deletion of Specified String" section in
>     +gitdiffcore(7) which could explain it]]
> 
>      --find-object=<object-id>::
>      	Look for differences that change the number of occurrences of
>     diff --git a/Documentation/gitdiffcore.txt b/Documentation/gitdiffcore.txt
>     index c0a60f3158..26880b4149 100644
>     --- a/Documentation/gitdiffcore.txt
>     +++ b/Documentation/gitdiffcore.txt
>     @@ -251,6 +251,10 @@ criterion in a changeset, the entire changeset is kept.  This behavior
>      is designed to make reviewing changes in the context of the whole
>      changeset easier.
> 
>     +Both `-S' and `-G' will ignore binary files without a textconv filter
>     +by default, this can be overriden with `--text`. With `--text` the
>     +binary patch we look through is generated as [[TODO: ???]].
>     +
>      diffcore-order: For Sorting the Output Based on Filenames
>      ---------------------------------------------------------
> 
> But as you can see given the TODO comments I don't know how this works
> exactly. I *could* dig, but that's my main outstanding problem with this
> patch, the commit message / docs aren't being updated to reflect the new
> behavior.

v3 will have some more documentation which took inspiration by your sketches here.
I've not included a reference to the git version 2.21 in which that patch will hopefully
land as that seems to be not common in the documentation.

I see tweaking the behaviour of -S outside of this patch series.
 
> I.e. let's leave the docs in some state where the reader can as
> unambiguously know what to expect with -G and these binary diffs we've
> been implicitly supporting as with the textual diffs. Ideally with some
> examples of how to generate them (re my question about the base85 output
> in v1).
> 
> Part of that's obviously behavior we've had all along, but it's much
> more convincing to say:
> 
>     We are changing X which we've done for ages, it works exactly like
>     this, and here's a switch to get it back.
> 
> Instead of:
> 
>     X doesn't make sense, let's turn it off.
> 
> Also the diffcore docs already say stuff about how slow/fast things are,
> and in a side-thread you said:
> 
>     My main motiviation is to speed up "log -G" as that takes a
>     considerable amount of time when it wades through MBs of binary
>     files which change often.
> 
> Makes sense, but then let's say something about that in that section of
> the docs.

Done.

> >  When `-S` or `-G` are used without `--pickaxe-all`, only filepairs
> >  that match their respective criterion are kept in the output.  When
> > diff --git a/diffcore-pickaxe.c b/diffcore-pickaxe.c
> > index 69fc55ea1e..4cea086f80 100644
> > --- a/diffcore-pickaxe.c
> > +++ b/diffcore-pickaxe.c
> > @@ -154,6 +154,12 @@ static int pickaxe_match(struct diff_filepair *p, struct diff_options *o,
> >  	if (textconv_one == textconv_two && diff_unmodified_pair(p))
> >  		return 0;
> >
> > +	if ((o->pickaxe_opts & DIFF_PICKAXE_KIND_G) &&
> > +	    !o->flags.text &&
> > +	    ((!textconv_one && diff_filespec_is_binary(o->repo, p->one)) ||
> > +	     (!textconv_two && diff_filespec_is_binary(o->repo, p->two))))
> > +		return 0;
> > +
> >  	mf1.size = fill_textconv(o->repo, textconv_one, p->one, &mf1.ptr);
> >  	mf2.size = fill_textconv(o->repo, textconv_two, p->two, &mf2.ptr);
> >
> > diff --git a/t/t4209-log-pickaxe.sh b/t/t4209-log-pickaxe.sh
> > index 844df760f7..5c3e2a16b2 100755
> > --- a/t/t4209-log-pickaxe.sh
> > +++ b/t/t4209-log-pickaxe.sh
> > @@ -106,4 +106,44 @@ test_expect_success 'log -S --no-textconv (missing textconv tool)' '
> >  	rm .gitattributes
> >  '
> >
> > +test_expect_success 'log -G ignores binary files' '
> > +	git checkout --orphan orphan1 &&
> > +	printf "a\0a" >data.bin &&
> > +	git add data.bin &&
> > +	git commit -m "message" &&
> > +	git log -Ga >result &&
> > +	test_must_be_empty result
> > +'
> > +
> > +test_expect_success 'log -G looks into binary files with -a' '
> > +	git checkout --orphan orphan2 &&
> > +	printf "a\0a" >data.bin &&
> > +	git add data.bin &&
> > +	git commit -m "message" &&
> > +	git log -a -Ga >actual &&
> > +	git log >expected &&
> > +	test_cmp actual expected
> > +'
> 
> A large part of the question(s) I have above & future readers would
> presumably have would be answered by these tests using more realistic
> test data. I.e. also with \n in there to see whether -G is also
> line-based in this binary case.
> 
> > +test_expect_success 'log -G looks into binary files with textconv filter' '
> > +	git checkout --orphan orphan3 &&
> > +	echo "* diff=bin" > .gitattributes &&
> > +	printf "a\0a" >data.bin &&
> > +	git add data.bin &&
> > +	git commit -m "message" &&
> > +	git -c diff.bin.textconv=cat log -Ga >actual &&
> > +	git log >expected &&
> > +	test_cmp actual expected
> > +'
> > +
> > +test_expect_success 'log -S looks into binary files' '
> > +	git checkout --orphan orphan4 &&
> > +	printf "a\0a" >data.bin &&
> > +	git add data.bin &&
> > +	git commit -m "message" &&
> > +	git log -Sa >actual &&
> > +	git log >expected &&
> > +	test_cmp actual expected
> > +'
> > +
> >  test_done

Done.

> These tests have way to much repeated boilerplate for no reason. This
> could just be (as-is, without the better test data suggested above):
> 
> diff --git a/t/t4209-log-pickaxe.sh b/t/t4209-log-pickaxe.sh
> index 844df760f7..23ed6cc4b1 100755
> --- a/t/t4209-log-pickaxe.sh
> +++ b/t/t4209-log-pickaxe.sh
> @@ -106,4 +106,34 @@ test_expect_success 'log -S --no-textconv (missing textconv tool)' '
>  	rm .gitattributes
>  '
> 
> +test_expect_success 'setup log -[GS] binary & --text' '
> +	git checkout --orphan GS-binary-and-text &&
> +	printf "a\0a" >data.bin &&
> +	git add data.bin &&
> +	git commit -m "message" &&
> +	git log >full-log
> +'
> +
> +test_expect_success 'log -G ignores binary files' '
> +	git log -Ga >result &&
> +	test_must_be_empty result
> +'
> +
> +test_expect_success 'log -G looks into binary files with -a' '
> +	git log -a -Ga >actual &&
> +	test_cmp actual full-log
> +'
> +
> +test_expect_success 'log -G looks into binary files with textconv filter' '
> +	echo "* diff=bin" >.gitattributes &&
> +	git -c diff.bin.textconv=cat log -Ga >actual &&
> +	test_cmp actual full-log
> +'
> +
> +test_expect_success 'log -S looks into binary files' '
> +	>.gitattributes &&
> +	git log -Sa >actual &&
> +	test_cmp actual full-log
> +'
> +
>  test_done

Thanks for pointer. This is resolved in v3 as well. I'm not used to test cases which
depend on each other but your are totally right.

Thanks for the review.
Thomas Braun Dec. 14, 2018, 6:45 p.m. UTC | #5
> Junio C Hamano <gitster@pobox.com> hat am 29. November 2018 um 08:10 geschrieben:
> 
> 
> Thomas Braun <thomas.braun@virtuell-zuhause.de> writes:
> 
> > Subject: Re: [PATCH v2] log -G: Ignore binary files
> 
> s/Ig/ig/; (will locally munge--this alone is no reason to reroll).

Done.
 
> The code changes looked sensible.

Thanks.

> > diff --git a/t/t4209-log-pickaxe.sh b/t/t4209-log-pickaxe.sh
> > index 844df760f7..5c3e2a16b2 100755
> > --- a/t/t4209-log-pickaxe.sh
> > +++ b/t/t4209-log-pickaxe.sh
> > @@ -106,4 +106,44 @@ test_expect_success 'log -S --no-textconv (missing textconv tool)' '
> >  	rm .gitattributes
> >  '
> >  
> > +test_expect_success 'log -G ignores binary files' '
> > +	git checkout --orphan orphan1 &&
> > +	printf "a\0a" >data.bin &&
> > +	git add data.bin &&
> > +	git commit -m "message" &&
> > +	git log -Ga >result &&
> > +	test_must_be_empty result
> > +'
> 
> As this is the first mention of data.bin, this is adding a new file
> data.bin that has two 'a' but is a binary file.  And that is the
> only commit in the history leading to orphan1.
> 
> The fact that "log -Ga" won't find any means it missed the creation
> event, because the blob is binary.  Good.
> 
> > +test_expect_success 'log -G looks into binary files with -a' '
> > +	git checkout --orphan orphan2 &&
> > +	printf "a\0a" >data.bin &&
> > +	git add data.bin &&
> > +	git commit -m "message" &&
> 
> This starts from the state left by the previous test piece, i.e. we
> have a binary data.bin file with two 'a' in it.  We pretend to
> modify and add, but these two steps are no-op if the previous
> succeeded, but even if the previous step failed, we get what we want
> in the data.bin file.  And then we make an initial commit the same
> way.
> 
> > +	git log -a -Ga >actual &&
> > +	git log >expected &&
> 
> And we ran the same test but this time with "-a" to tell Git that
> binary-ness should not matter.  It will find the sole commit.  Good.
> 
> > +	test_cmp actual expected
> > +'
> > +
> > +test_expect_success 'log -G looks into binary files with textconv filter' '
> > +	git checkout --orphan orphan3 &&
> > +	echo "* diff=bin" > .gitattributes &&
> 
> s/> />/; (will locally munge--this alone is no reason to reroll).

Done.

> > +	printf "a\0a" >data.bin &&
> > +	git add data.bin &&
> > +	git commit -m "message" &&
> > +	git -c diff.bin.textconv=cat log -Ga >actual &&
> 
> This exposes a slight iffy-ness in the design.  The textconv filter
> used here does not strip the "binary-ness" from the payload, but it
> is enough to tell the machinery that -G should look into the
> difference.  Is that really desirable, though?
> 
> IOW, if this weren't the initial commit (which is handled by the
> codepath to special-case creation and deletion in diff_grep()
> function), would "log -Ga" show it without "-a"?  Should it?

Yes "log -Ga" will find all three commits (creation, modification, deletion)
which are present in v3 without "-a" and cat as textconv filter.

I can make that more explicit with a textconv filter which removes the binary-ness

git -c diff.bin.textconv="sed -e \"s/\x00//g\"" log -Ga >log &&

(diff.bin.textconv="cat -v" works here as well but seems non-portable)

Now we could also search for "aa" as the NUL separating them is gone but that could
be getting too clever or?

> I think this test piece (and probably the previous ones for "-a" vs
> "no -a" without textconv, as well) should be using a history with
> three commits, where
> 
>     - the root commit introduces "a\0a" to data.bin (creation event)
> 
>     - the second commit adds another instance of "a\0a" to data.bin
>       (forces comparison)
> 
>     - the third commit removes data.bin (deletion event)
> 
> and make sure that the three are treated identically.  If "log -Ga"
> finds one (with the combination of other conditions like use of
> textconv or -a option), it should find all three, and vice versa.

Good point. I've added that.

> > +	git log >expected &&
> > +	test_cmp actual expected
> > +'
> > +
> > +test_expect_success 'log -S looks into binary files' '
> > +	git checkout --orphan orphan4 &&
> > +	printf "a\0a" >data.bin &&
> > +	git add data.bin &&
> > +	git commit -m "message" &&
> > +	git log -Sa >actual &&
> > +	git log >expected &&
> > +	test_cmp actual expected
> > +'
> 
> Likewise.  This would also benefit from a three-commit history.
> 
> Perhaps you can create such a history at the beginning of these
> additions as another "setup -G/-S binary test" step and test
> different variations in subsequent tests without the setup?

Done.
Thomas Braun Dec. 14, 2018, 6:45 p.m. UTC | #6
> Junio C Hamano <gitster@pobox.com> hat am 29. November 2018 um 08:22 geschrieben:
> 
> 
> Junio C Hamano <gitster@pobox.com> writes:
> 
> >> +test_expect_success 'log -G ignores binary files' '
> >> +	git checkout --orphan orphan1 &&
> >> +	printf "a\0a" >data.bin &&
> >> +	git add data.bin &&
> >> +	git commit -m "message" &&
> >> +	git log -Ga >result &&
> >> +	test_must_be_empty result
> >> +'
> >
> > As this is the first mention of data.bin, this is adding a new file
> > data.bin that has two 'a' but is a binary file.  And that is the
> > only commit in the history leading to orphan1.
> >
> > The fact that "log -Ga" won't find any means it missed the creation
> > event, because the blob is binary.  Good.
> 
> By the way, this root commit records another file whose path is
> "file" and has "Picked<LF>" in it.  If the file had 'a' in it, it
> would have been included in "git log" output, but that is too subtle
> a point to be noticed by the readers who are only reading this patch
> without seeing what has been done to the index before this test
> piece.
> 
> If you are going to restructure these tests to create a three-commit
> history in a single expect_success that is inspected with various
> "log -Ga" invocations in subsequent tests, it is worth removing that
> other file (or rather, starting with "read-tree --empty" immediately
> after checking out the orphan branch, to clarify to the readers that
> there is nothing but what you add in the set-up step in the index)
> to make the test more robust.

Thanks for the explanation. First I though that "checkout --orphan"
already takes care of everything but "read-tree --empty" is the way to go.

Done.
diff mbox series

Patch

diff --git a/Documentation/gitdiffcore.txt b/Documentation/gitdiffcore.txt
index c0a60f3158..059ddd3431 100644
--- a/Documentation/gitdiffcore.txt
+++ b/Documentation/gitdiffcore.txt
@@ -242,7 +242,7 @@  textual diff has an added or a deleted line that matches the given
 regular expression.  This means that it will detect in-file (or what
 rename-detection considers the same file) moves, which is noise.  The
 implementation runs diff twice and greps, and this can be quite
-expensive.
+expensive.  Binary files without textconv filter are ignored.
 
 When `-S` or `-G` are used without `--pickaxe-all`, only filepairs
 that match their respective criterion are kept in the output.  When
diff --git a/diffcore-pickaxe.c b/diffcore-pickaxe.c
index 69fc55ea1e..4cea086f80 100644
--- a/diffcore-pickaxe.c
+++ b/diffcore-pickaxe.c
@@ -154,6 +154,12 @@  static int pickaxe_match(struct diff_filepair *p, struct diff_options *o,
 	if (textconv_one == textconv_two && diff_unmodified_pair(p))
 		return 0;
 
+	if ((o->pickaxe_opts & DIFF_PICKAXE_KIND_G) &&
+	    !o->flags.text &&
+	    ((!textconv_one && diff_filespec_is_binary(o->repo, p->one)) ||
+	     (!textconv_two && diff_filespec_is_binary(o->repo, p->two))))
+		return 0;
+
 	mf1.size = fill_textconv(o->repo, textconv_one, p->one, &mf1.ptr);
 	mf2.size = fill_textconv(o->repo, textconv_two, p->two, &mf2.ptr);
 
diff --git a/t/t4209-log-pickaxe.sh b/t/t4209-log-pickaxe.sh
index 844df760f7..5c3e2a16b2 100755
--- a/t/t4209-log-pickaxe.sh
+++ b/t/t4209-log-pickaxe.sh
@@ -106,4 +106,44 @@  test_expect_success 'log -S --no-textconv (missing textconv tool)' '
 	rm .gitattributes
 '
 
+test_expect_success 'log -G ignores binary files' '
+	git checkout --orphan orphan1 &&
+	printf "a\0a" >data.bin &&
+	git add data.bin &&
+	git commit -m "message" &&
+	git log -Ga >result &&
+	test_must_be_empty result
+'
+
+test_expect_success 'log -G looks into binary files with -a' '
+	git checkout --orphan orphan2 &&
+	printf "a\0a" >data.bin &&
+	git add data.bin &&
+	git commit -m "message" &&
+	git log -a -Ga >actual &&
+	git log >expected &&
+	test_cmp actual expected
+'
+
+test_expect_success 'log -G looks into binary files with textconv filter' '
+	git checkout --orphan orphan3 &&
+	echo "* diff=bin" > .gitattributes &&
+	printf "a\0a" >data.bin &&
+	git add data.bin &&
+	git commit -m "message" &&
+	git -c diff.bin.textconv=cat log -Ga >actual &&
+	git log >expected &&
+	test_cmp actual expected
+'
+
+test_expect_success 'log -S looks into binary files' '
+	git checkout --orphan orphan4 &&
+	printf "a\0a" >data.bin &&
+	git add data.bin &&
+	git commit -m "message" &&
+	git log -Sa >actual &&
+	git log >expected &&
+	test_cmp actual expected
+'
+
 test_done