diff mbox series

[v1,1/2] log -G: Ignore binary files

Message ID 590f2ca6b5323c17365a1645b5d10e9ab30623c4.1542833244.git.thomas.braun@virtuell-zuhause.de (mailing list archive)
State New, archived
Headers show
Series Teach log -G to ignore binary files | expand

Commit Message

Thomas Braun Nov. 21, 2018, 8:52 p.m. UTC
The -G <regex> option of log looks for the differences whose patch text
contains added/removed lines that match regex.

The concept of differences only makes sense for text files, therefore
we need to ignore binary files when searching with -G <regex> as well.

Signed-off-by: Thomas Braun <thomas.braun@virtuell-zuhause.de>
---
 Documentation/gitdiffcore.txt |  2 +-
 diffcore-pickaxe.c            |  5 +++++
 t/t4209-log-pickaxe.sh        | 22 ++++++++++++++++++++++
 3 files changed, 28 insertions(+), 1 deletion(-)

Comments

Junio C Hamano Nov. 22, 2018, 1:29 a.m. UTC | #1
Thomas Braun <thomas.braun@virtuell-zuhause.de> writes:

> The -G <regex> option of log looks for the differences whose patch text
> contains added/removed lines that match regex.
>
> The concept of differences only makes sense for text files, therefore
> we need to ignore binary files when searching with -G <regex> as well.
>
> Signed-off-by: Thomas Braun <thomas.braun@virtuell-zuhause.de>
> ---
>  Documentation/gitdiffcore.txt |  2 +-
>  diffcore-pickaxe.c            |  5 +++++
>  t/t4209-log-pickaxe.sh        | 22 ++++++++++++++++++++++
>  3 files changed, 28 insertions(+), 1 deletion(-)

OK.

> diff --git a/Documentation/gitdiffcore.txt b/Documentation/gitdiffcore.txt
> index c0a60f3158..059ddd3431 100644
> --- a/Documentation/gitdiffcore.txt
> +++ b/Documentation/gitdiffcore.txt
> @@ -242,7 +242,7 @@ textual diff has an added or a deleted line that matches the given
>  regular expression.  This means that it will detect in-file (or what
>  rename-detection considers the same file) moves, which is noise.  The
>  implementation runs diff twice and greps, and this can be quite
> -expensive.
> +expensive.  Binary files without textconv filter are ignored.

OK.

> diff --git a/diffcore-pickaxe.c b/diffcore-pickaxe.c
> index 69fc55ea1e..8c2558b07d 100644
> --- a/diffcore-pickaxe.c
> +++ b/diffcore-pickaxe.c
> @@ -144,6 +144,11 @@ static int pickaxe_match(struct diff_filepair *p, struct diff_options *o,
>  		textconv_two = get_textconv(o->repo->index, p->two);
>  	}
>  
> +	if ((o->pickaxe_opts & DIFF_PICKAXE_KIND_G) &&
> +	    ((!textconv_one && diff_filespec_is_binary(o->repo, p->one)) ||
> +	     (!textconv_two && diff_filespec_is_binary(o->repo, p->two))))
> +		return 0;
> +
>  	/*
>  	 * If we have an unmodified pair, we know that the count will be the
>  	 * same and don't even have to load the blobs. Unless textconv is in

Shouldn't this new test come after the existing optimization, which
allows us to leave without loading the blob contents (which is
needed once you call diff_filespec_is_binary())?

> diff --git a/t/t4209-log-pickaxe.sh b/t/t4209-log-pickaxe.sh
> index 844df760f7..42cc8afd8b 100755
> --- a/t/t4209-log-pickaxe.sh
> +++ b/t/t4209-log-pickaxe.sh
> @@ -106,4 +106,26 @@ test_expect_success 'log -S --no-textconv (missing textconv tool)' '
>  	rm .gitattributes
>  '
>  
> +test_expect_success 'log -G ignores binary files' '
> +	rm -rf .git &&
> +	git init &&

Please never never ever do the above two unless you are writing a
test that checks low-level repository details.

If you want a clean history that has specific lineage of commits
without getting affected by commits that have been made by the
previous test pieces, it is OK to "checkout --orphan" to create an
empty history to work with.

> +	printf "a\0b" >data.bin &&
> +	git add data.bin &&
> +	git commit -m "message" &&
> +	git log -G a >result &&
> +	test_must_be_empty result
> +'
> +
> +test_expect_success 'log -G looks into binary files with textconv filter' '
> +	rm -rf .git &&
> +	git init &&
> +	echo "* diff=bin" > .gitattributes &&
> +	printf "a\0b" >data.bin &&
> +	git add data.bin &&
> +	git commit -m "message" &&
> +	git -c diff.bin.textconv=cat log -G a >actual &&
> +	git log >expected &&
> +	test_cmp actual expected
> +'
> +
>  test_done
Ævar Arnfjörð Bjarmason Nov. 22, 2018, 10:16 a.m. UTC | #2
>
On Wed, Nov 21 2018, Thomas Braun wrote:

> The -G <regex> option of log looks for the differences whose patch text
> contains added/removed lines that match regex.
>
> The concept of differences only makes sense for text files, therefore
> we need to ignore binary files when searching with -G <regex> as well.
>
> Signed-off-by: Thomas Braun <thomas.braun@virtuell-zuhause.de>
> ---
>  Documentation/gitdiffcore.txt |  2 +-
>  diffcore-pickaxe.c            |  5 +++++
>  t/t4209-log-pickaxe.sh        | 22 ++++++++++++++++++++++
>  3 files changed, 28 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/gitdiffcore.txt b/Documentation/gitdiffcore.txt
> index c0a60f3158..059ddd3431 100644
> --- a/Documentation/gitdiffcore.txt
> +++ b/Documentation/gitdiffcore.txt
> @@ -242,7 +242,7 @@ textual diff has an added or a deleted line that matches the given
>  regular expression.  This means that it will detect in-file (or what
>  rename-detection considers the same file) moves, which is noise.  The
>  implementation runs diff twice and greps, and this can be quite
> -expensive.
> +expensive.  Binary files without textconv filter are ignored.
>
>  When `-S` or `-G` are used without `--pickaxe-all`, only filepairs
>  that match their respective criterion are kept in the output.  When
> diff --git a/diffcore-pickaxe.c b/diffcore-pickaxe.c
> index 69fc55ea1e..8c2558b07d 100644
> --- a/diffcore-pickaxe.c
> +++ b/diffcore-pickaxe.c
> @@ -144,6 +144,11 @@ static int pickaxe_match(struct diff_filepair *p, struct diff_options *o,
>  		textconv_two = get_textconv(o->repo->index, p->two);
>  	}
>
> +	if ((o->pickaxe_opts & DIFF_PICKAXE_KIND_G) &&
> +	    ((!textconv_one && diff_filespec_is_binary(o->repo, p->one)) ||
> +	     (!textconv_two && diff_filespec_is_binary(o->repo, p->two))))
> +		return 0;
> +
>  	/*
>  	 * If we have an unmodified pair, we know that the count will be the
>  	 * same and don't even have to load the blobs. Unless textconv is in
> diff --git a/t/t4209-log-pickaxe.sh b/t/t4209-log-pickaxe.sh
> index 844df760f7..42cc8afd8b 100755
> --- a/t/t4209-log-pickaxe.sh
> +++ b/t/t4209-log-pickaxe.sh
> @@ -106,4 +106,26 @@ test_expect_success 'log -S --no-textconv (missing textconv tool)' '
>  	rm .gitattributes
>  '
>
> +test_expect_success 'log -G ignores binary files' '
> +	rm -rf .git &&
> +	git init &&
> +	printf "a\0b" >data.bin &&
> +	git add data.bin &&
> +	git commit -m "message" &&
> +	git log -G a >result &&

Would be less confusing as "-Ga" since that's the invocation we
document, even though I see (but wasn't aware that...) "-G a" works too.

> +	test_must_be_empty result
> +'
> +
> +test_expect_success 'log -G looks into binary files with textconv filter' '
> +	rm -rf .git &&
> +	git init &&
> +	echo "* diff=bin" > .gitattributes &&
> +	printf "a\0b" >data.bin &&
> +	git add data.bin &&
> +	git commit -m "message" &&
> +	git -c diff.bin.textconv=cat log -G a >actual &&
> +	git log >expected &&
> +	test_cmp actual expected
> +'
> +
>  test_done

This patch seems like the wrong direction to me. In particular the
assertion that "the concept of differences only makes sense for text
files". That's just not true. This patch breaks this:

    (
        rm -rf /tmp/g-test &&
        git init /tmp/g-test &&
        cd /tmp/g-test &&
        for i in {1..10}; do
            echo "Always matching thensome 5" >file &&
            printf "a thensome %d binary \0" $i >>file &&
            git add file &&
            git commit -m"Bump $i"
        done &&
        git log -Gthensome.*5
    )

Right now this will emit 3/10 patches, and the right ones! I.e. "Bump
[156]". The 1st one because it introduces the "Always matching thensome
5". Then 5/6 because the add/remove the string "a thensome 5 binary",
respectively. Which matches /thensome.*5/.

I.e. in the first one we do a regex match against the content here
because we don't have both sides:
https://github.com/git/git/blob/v2.19.2/diffcore-pickaxe.c#L48-L53

And then for the later ones where we have both sides we end up in
diffgrep_consume():
https://github.com/git/git/blob/v2.19.2/diffcore-pickaxe.c#L27-L36

I think there may be a real issue here to address, which might be some
combination of:

 a) Even though the diffcore can do a binary diff internally, this is
    not what it exposes with "-p", we just say "Binary files differ".

    I don't know how to emit the raw version we'll end up passing to
    diffgrep_consume() in this case. Is it just --binary without the
    encoding? I don't know...

 b) Your test case shows that you're matching a string at a \0
    boundary. Is this perhaps something you ran into? I.e. that we don't
    have some -F version of -G so we can't supply regexes that match
    past a \0? I had some related work on grep for this that hasn't been
    carried over to the diffcore:

        git log --grep='grep:.*\\0' --author=Ævar

 c) Is this binary diff we end up matching against just bad in some
    cases? I haven't dug but that wouldn't surprise me, i.e. that it's
    trying to be line-based so we'll overmatch in many cases.

So maybe this is something that should be passed down as a flag? See a
recent discussion at
https://public-inbox.org/git/87lg77cmr1.fsf@evledraar.gmail.com/ for how
that could be done.

Also if we don't have some tests already that were failing with this
patch we really should have those as "let's test the current behavior
first". Unfortunately tests in this area are really lacking, see
e.g. my:

    git log --author=Junio --min-parents=2 --grep=ab/.*grep

For some series of patches to grep where to get one patch in I needed to
often lead with 5-10 test patches to convince reviewers that I knew what
I was changing, and also to be comfortable that I'd covered all the edge
cases we currently supported, but weren't testing for.
Jeff King Nov. 22, 2018, 4:20 p.m. UTC | #3
On Wed, Nov 21, 2018 at 09:52:27PM +0100, Thomas Braun wrote:

> diff --git a/diffcore-pickaxe.c b/diffcore-pickaxe.c
> index 69fc55ea1e..8c2558b07d 100644
> --- a/diffcore-pickaxe.c
> +++ b/diffcore-pickaxe.c
> @@ -144,6 +144,11 @@ static int pickaxe_match(struct diff_filepair *p, struct diff_options *o,
>  		textconv_two = get_textconv(o->repo->index, p->two);
>  	}
>  
> +	if ((o->pickaxe_opts & DIFF_PICKAXE_KIND_G) &&
> +	    ((!textconv_one && diff_filespec_is_binary(o->repo, p->one)) ||
> +	     (!textconv_two && diff_filespec_is_binary(o->repo, p->two))))
> +		return 0;

If the user passes "-a" to treat binary files as text, we should
probably skip the binary check. I think we'd need to check
"o->flags.text" here.

> diff --git a/t/t4209-log-pickaxe.sh b/t/t4209-log-pickaxe.sh
> index 844df760f7..42cc8afd8b 100755
> --- a/t/t4209-log-pickaxe.sh
> +++ b/t/t4209-log-pickaxe.sh
> @@ -106,4 +106,26 @@ test_expect_success 'log -S --no-textconv (missing textconv tool)' '
> [...]
> +test_expect_success 'log -G ignores binary files' '
> [...]
> +test_expect_success 'log -G looks into binary files with textconv filter' '

And likewise add a test here similar to the textconv one.

-Peff
Jeff King Nov. 22, 2018, 4:27 p.m. UTC | #4
On Thu, Nov 22, 2018 at 11:16:38AM +0100, Ævar Arnfjörð Bjarmason wrote:

> > +test_expect_success 'log -G looks into binary files with textconv filter' '
> > +	rm -rf .git &&
> > +	git init &&
> > +	echo "* diff=bin" > .gitattributes &&
> > +	printf "a\0b" >data.bin &&
> > +	git add data.bin &&
> > +	git commit -m "message" &&
> > +	git -c diff.bin.textconv=cat log -G a >actual &&
> > +	git log >expected &&
> > +	test_cmp actual expected
> > +'
> > +
> >  test_done
> 
> This patch seems like the wrong direction to me. In particular the
> assertion that "the concept of differences only makes sense for text
> files". That's just not true. This patch breaks this:

But "-G" is defined as "look for differences whose patch text contains
added/removed lines that match <regex>". We don't have patch text here,
let alone added/removed lines.

For binary files, "-Sfoo" is better defined. I think we _could_ define
"search for <pattern> in the added/removed bytes of a binary file".  But
I don't think that's what the current code does (it really does a line
diff on a binary file, which is likely to put tons of unchanged crap
into the "added and removed" lines, because the line divisions aren't
meaningful in the first place).

>     (
>         rm -rf /tmp/g-test &&
>         git init /tmp/g-test &&
>         cd /tmp/g-test &&
>         for i in {1..10}; do
>             echo "Always matching thensome 5" >file &&
>             printf "a thensome %d binary \0" $i >>file &&
>             git add file &&
>             git commit -m"Bump $i"
>         done &&
>         git log -Gthensome.*5
>     )
> 
> Right now this will emit 3/10 patches, and the right ones! I.e. "Bump
> [156]". The 1st one because it introduces the "Always matching thensome
> 5". Then 5/6 because the add/remove the string "a thensome 5 binary",
> respectively. Which matches /thensome.*5/.

Right, this will sometimes do the right thing. But it will also often do
the wrong thing. It's also very expensive (we specifically avoid feeding
large binary files to xdiff, but I think "-G" will happily do so -- I
didn't double check, though).

-Peff
Junio C Hamano Nov. 24, 2018, 2:32 a.m. UTC | #5
Jeff King <peff@peff.net> writes:

>> +	if ((o->pickaxe_opts & DIFF_PICKAXE_KIND_G) &&
>> +	    ((!textconv_one && diff_filespec_is_binary(o->repo, p->one)) ||
>> +	     (!textconv_two && diff_filespec_is_binary(o->repo, p->two))))
>> +		return 0;
>
> If the user passes "-a" to treat binary files as text, we should
> probably skip the binary check. I think we'd need to check
> "o->flags.text" here.

Yeah, I forgot about that option.  It would give an escape hatch
that has a sane explanation.
Stefan Beller Nov. 26, 2018, 8:19 p.m. UTC | #6
On Wed, Nov 21, 2018 at 1:08 PM Thomas Braun
<thomas.braun@virtuell-zuhause.de> wrote:
>
> The -G <regex> option of log looks for the differences whose patch text
> contains added/removed lines that match regex.
>
> The concept of differences only makes sense for text files, therefore
> we need to ignore binary files when searching with -G <regex> as well.

What about partial text/partial binary files?

I recall using text searching tools (not necessarily git machinery,
my memory is fuzzy) to check for strings in pdf files, which are
usually marked binary in context of git, such that we do not
see their diffs in `log -p`.

But I would expect a search with -G or -S to still work...
until I find the exception in the docs, only to wonder if
there is a switch to turn off this optimisation for this
corner case.

Stefan
Junio C Hamano Nov. 27, 2018, 12:51 a.m. UTC | #7
Stefan Beller <sbeller@google.com> writes:

> On Wed, Nov 21, 2018 at 1:08 PM Thomas Braun
> <thomas.braun@virtuell-zuhause.de> wrote:
>>
>> The -G <regex> option of log looks for the differences whose patch text
>> contains added/removed lines that match regex.
>>
>> The concept of differences only makes sense for text files, therefore
>> we need to ignore binary files when searching with -G <regex> as well.
>
> What about partial text/partial binary files?

Good point. You'd use "-a" (or "--text") to tell the diff machinery
to treat the contents as text, and the new logic must pay attention
to that command line option.
Thomas Braun Nov. 28, 2018, 11:31 a.m. UTC | #8
> Junio C Hamano <gitster@pobox.com> hat am 22. November 2018 um 02:29 geschrieben:
> 
> 
> Thomas Braun <thomas.braun@virtuell-zuhause.de> writes:
> 
> > The -G <regex> option of log looks for the differences whose patch text
> > contains added/removed lines that match regex.
> >
> > The concept of differences only makes sense for text files, therefore
> > we need to ignore binary files when searching with -G <regex> as well.
> >
> > Signed-off-by: Thomas Braun <thomas.braun@virtuell-zuhause.de>
> > ---
> >  Documentation/gitdiffcore.txt |  2 +-
> >  diffcore-pickaxe.c            |  5 +++++
> >  t/t4209-log-pickaxe.sh        | 22 ++++++++++++++++++++++
> >  3 files changed, 28 insertions(+), 1 deletion(-)
> 
> OK.
> 
> > diff --git a/Documentation/gitdiffcore.txt b/Documentation/gitdiffcore.txt
> > index c0a60f3158..059ddd3431 100644
> > --- a/Documentation/gitdiffcore.txt
> > +++ b/Documentation/gitdiffcore.txt
> > @@ -242,7 +242,7 @@ textual diff has an added or a deleted line that matches the given
> >  regular expression.  This means that it will detect in-file (or what
> >  rename-detection considers the same file) moves, which is noise.  The
> >  implementation runs diff twice and greps, and this can be quite
> > -expensive.
> > +expensive.  Binary files without textconv filter are ignored.
> 
> OK.
> 
> > diff --git a/diffcore-pickaxe.c b/diffcore-pickaxe.c
> > index 69fc55ea1e..8c2558b07d 100644
> > --- a/diffcore-pickaxe.c
> > +++ b/diffcore-pickaxe.c
> > @@ -144,6 +144,11 @@ static int pickaxe_match(struct diff_filepair *p, struct diff_options *o,
> >  		textconv_two = get_textconv(o->repo->index, p->two);
> >  	}
> >  
> > +	if ((o->pickaxe_opts & DIFF_PICKAXE_KIND_G) &&
> > +	    ((!textconv_one && diff_filespec_is_binary(o->repo, p->one)) ||
> > +	     (!textconv_two && diff_filespec_is_binary(o->repo, p->two))))
> > +		return 0;
> > +
> >  	/*
> >  	 * If we have an unmodified pair, we know that the count will be the
> >  	 * same and don't even have to load the blobs. Unless textconv is in
> 
> Shouldn't this new test come after the existing optimization, which
> allows us to leave without loading the blob contents (which is
> needed once you call diff_filespec_is_binary())?

Yes, good point.

> > diff --git a/t/t4209-log-pickaxe.sh b/t/t4209-log-pickaxe.sh
> > index 844df760f7..42cc8afd8b 100755
> > --- a/t/t4209-log-pickaxe.sh
> > +++ b/t/t4209-log-pickaxe.sh
> > @@ -106,4 +106,26 @@ test_expect_success 'log -S --no-textconv (missing textconv tool)' '
> >  	rm .gitattributes
> >  '
> >  
> > +test_expect_success 'log -G ignores binary files' '
> > +	rm -rf .git &&
> > +	git init &&
> 
> Please never never ever do the above two unless you are writing a
> test that checks low-level repository details.
> 
> If you want a clean history that has specific lineage of commits
> without getting affected by commits that have been made by the
> previous test pieces, it is OK to "checkout --orphan" to create an
> empty history to work with.

Thanks for the hint. I thought I had seen a less intrusive way for getting an empty history. 
Changed.

> > +	printf "a\0b" >data.bin &&
> > +	git add data.bin &&
> > +	git commit -m "message" &&
> > +	git log -G a >result &&
> > +	test_must_be_empty result
> > +'
> > +
> > +test_expect_success 'log -G looks into binary files with textconv filter' '
> > +	rm -rf .git &&
> > +	git init &&
> > +	echo "* diff=bin" > .gitattributes &&
> > +	printf "a\0b" >data.bin &&
> > +	git add data.bin &&
> > +	git commit -m "message" &&
> > +	git -c diff.bin.textconv=cat log -G a >actual &&
> > +	git log >expected &&
> > +	test_cmp actual expected
> > +'
> > +
> >  test_done
>
Thomas Braun Nov. 28, 2018, 11:31 a.m. UTC | #9
> Ævar Arnfjörð Bjarmason <avarab@gmail.com> hat am 22. November 2018 um 11:16 geschrieben:

[...]

> >
> > +test_expect_success 'log -G ignores binary files' '
> > +	rm -rf .git &&
> > +	git init &&
> > +	printf "a\0b" >data.bin &&
> > +	git add data.bin &&
> > +	git commit -m "message" &&
> > +	git log -G a >result &&
> 
> Would be less confusing as "-Ga" since that's the invocation we
> document, even though I see (but wasn't aware that...) "-G a" works too.

Done.

> > +	test_must_be_empty result
> > +'
> > +
> > +test_expect_success 'log -G looks into binary files with textconv filter' '
> > +	rm -rf .git &&
> > +	git init &&
> > +	echo "* diff=bin" > .gitattributes &&
> > +	printf "a\0b" >data.bin &&
> > +	git add data.bin &&
> > +	git commit -m "message" &&
> > +	git -c diff.bin.textconv=cat log -G a >actual &&
> > +	git log >expected &&
> > +	test_cmp actual expected
> > +'
> > +
> >  test_done
> 
> This patch seems like the wrong direction to me. In particular the
> assertion that "the concept of differences only makes sense for text
> files". That's just not true. This patch breaks this:
> 
>     (
>         rm -rf /tmp/g-test &&
>         git init /tmp/g-test &&
>         cd /tmp/g-test &&
>         for i in {1..10}; do
>             echo "Always matching thensome 5" >file &&
>             printf "a thensome %d binary \0" $i >>file &&
>             git add file &&
>             git commit -m"Bump $i"
>         done &&
>         git log -Gthensome.*5
>     )
> 
> Right now this will emit 3/10 patches, and the right ones! I.e. "Bump
> [156]". The 1st one because it introduces the "Always matching thensome
> 5". Then 5/6 because the add/remove the string "a thensome 5 binary",
> respectively. Which matches /thensome.*5/.

log -p does not show you the patch text in your example because it is treated
as binary. And currently "log -G" has a different opinion into what it looks
and what it ignores. My patch tries to bring both more in line.
 
> I.e. in the first one we do a regex match against the content here
> because we don't have both sides:
> https://github.com/git/git/blob/v2.19.2/diffcore-pickaxe.c#L48-L53
> 
> And then for the later ones where we have both sides we end up in
> diffgrep_consume():
> https://github.com/git/git/blob/v2.19.2/diffcore-pickaxe.c#L27-L36
> 
> I think there may be a real issue here to address, which might be some
> combination of:
> 
>  a) Even though the diffcore can do a binary diff internally, this is
>     not what it exposes with "-p", we just say "Binary files differ".
> 
>     I don't know how to emit the raw version we'll end up passing to
>     diffgrep_consume() in this case. Is it just --binary without the
>     encoding? I don't know...
> 
>  b) Your test case shows that you're matching a string at a \0
>     boundary. Is this perhaps something you ran into? I.e. that we don't
>     have some -F version of -G so we can't supply regexes that match
>     past a \0? I had some related work on grep for this that hasn't been
>     carried over to the diffcore:
> 
>         git log --grep='grep:.*\\0' --author=Ævar
> 
>  c) Is this binary diff we end up matching against just bad in some
>     cases? I haven't dug but that wouldn't surprise me, i.e. that it's
>     trying to be line-based so we'll overmatch in many cases.
> 
> So maybe this is something that should be passed down as a flag? See a
> recent discussion at
> https://public-inbox.org/git/87lg77cmr1.fsf@evledraar.gmail.com/ for how
> that could be done.

It is not about the \0 boundary. v2 of the patches will clarify that. My main
motiviation is to speed up "log -G" as that takes a considerable amount of time 
when it wades through MBs of binary files which change often. And in multiple places
I can already treat binary files differently (e.g. turn off delta compression, skip
trying to diff them, no EOL normalization). And for me making log -G ignore what git 
thinks are binary files is making the line clearer between what should be treated as binary
and what as text.

> Also if we don't have some tests already that were failing with this
> patch we really should have those as "let's test the current behavior
> first". Unfortunately tests in this area are really lacking, see
> e.g. my:
> 
>     git log --author=Junio --min-parents=2 --grep=ab/.*grep
> 
> For some series of patches to grep where to get one patch in I needed to
> often lead with 5-10 test patches to convince reviewers that I knew what
> I was changing, and also to be comfortable that I'd covered all the edge
> cases we currently supported, but weren't testing for.

I'm happy to add more test cases to convince everyone involved :)
Thomas Braun Nov. 28, 2018, 11:31 a.m. UTC | #10
> Jeff King <peff@peff.net> hat am 22. November 2018 um 17:20 geschrieben:
> 
> 
> On Wed, Nov 21, 2018 at 09:52:27PM +0100, Thomas Braun wrote:
> 
> > diff --git a/diffcore-pickaxe.c b/diffcore-pickaxe.c
> > index 69fc55ea1e..8c2558b07d 100644
> > --- a/diffcore-pickaxe.c
> > +++ b/diffcore-pickaxe.c
> > @@ -144,6 +144,11 @@ static int pickaxe_match(struct diff_filepair *p, struct diff_options *o,
> >  		textconv_two = get_textconv(o->repo->index, p->two);
> >  	}
> >  
> > +	if ((o->pickaxe_opts & DIFF_PICKAXE_KIND_G) &&
> > +	    ((!textconv_one && diff_filespec_is_binary(o->repo, p->one)) ||
> > +	     (!textconv_two && diff_filespec_is_binary(o->repo, p->two))))
> > +		return 0;
> 
> If the user passes "-a" to treat binary files as text, we should
> probably skip the binary check. I think we'd need to check
> "o->flags.text" here.

Good point. I missed that flag. Added.

> > diff --git a/t/t4209-log-pickaxe.sh b/t/t4209-log-pickaxe.sh
> > index 844df760f7..42cc8afd8b 100755
> > --- a/t/t4209-log-pickaxe.sh
> > +++ b/t/t4209-log-pickaxe.sh
> > @@ -106,4 +106,26 @@ test_expect_success 'log -S --no-textconv (missing textconv tool)' '
> > [...]
> > +test_expect_success 'log -G ignores binary files' '
> > [...]
> > +test_expect_success 'log -G looks into binary files with textconv filter' '
> 
> And likewise add a test here similar to the textconv one.

Added as well.
Thomas Braun Nov. 28, 2018, 11:31 a.m. UTC | #11
> Junio C Hamano <gitster@pobox.com> hat am 27. November 2018 um 01:51 geschrieben:
> 
> 
> Stefan Beller <sbeller@google.com> writes:
> 
> > On Wed, Nov 21, 2018 at 1:08 PM Thomas Braun
> > <thomas.braun@virtuell-zuhause.de> wrote:
> >>
> >> The -G <regex> option of log looks for the differences whose patch text
> >> contains added/removed lines that match regex.
> >>
> >> The concept of differences only makes sense for text files, therefore
> >> we need to ignore binary files when searching with -G <regex> as well.
> >
> > What about partial text/partial binary files?
> 
> Good point. You'd use "-a" (or "--text") to tell the diff machinery
> to treat the contents as text, and the new logic must pay attention
> to that command line option.

Yes exactly. Either use -a for the occasional use or a textconv filter
for permanent use.

Coming from the opposite side: I usually mark svg files as binary as the
textual diff is well, let's say uninspiring.
Thomas Braun Nov. 28, 2018, 11:31 a.m. UTC | #12
> Ævar Arnfjörð Bjarmason <avarab@gmail.com> hat am 22. November 2018 um 11:16 geschrieben:

[...]

> >
> > +test_expect_success 'log -G ignores binary files' '
> > +	rm -rf .git &&
> > +	git init &&
> > +	printf "a\0b" >data.bin &&
> > +	git add data.bin &&
> > +	git commit -m "message" &&
> > +	git log -G a >result &&
> 
> Would be less confusing as "-Ga" since that's the invocation we
> document, even though I see (but wasn't aware that...) "-G a" works too.

Done.

> > +	test_must_be_empty result
> > +'
> > +
> > +test_expect_success 'log -G looks into binary files with textconv filter' '
> > +	rm -rf .git &&
> > +	git init &&
> > +	echo "* diff=bin" > .gitattributes &&
> > +	printf "a\0b" >data.bin &&
> > +	git add data.bin &&
> > +	git commit -m "message" &&
> > +	git -c diff.bin.textconv=cat log -G a >actual &&
> > +	git log >expected &&
> > +	test_cmp actual expected
> > +'
> > +
> >  test_done
> 
> This patch seems like the wrong direction to me. In particular the
> assertion that "the concept of differences only makes sense for text
> files". That's just not true. This patch breaks this:
> 
>     (
>         rm -rf /tmp/g-test &&
>         git init /tmp/g-test &&
>         cd /tmp/g-test &&
>         for i in {1..10}; do
>             echo "Always matching thensome 5" >file &&
>             printf "a thensome %d binary \0" $i >>file &&
>             git add file &&
>             git commit -m"Bump $i"
>         done &&
>         git log -Gthensome.*5
>     )
> 
> Right now this will emit 3/10 patches, and the right ones! I.e. "Bump
> [156]". The 1st one because it introduces the "Always matching thensome
> 5". Then 5/6 because the add/remove the string "a thensome 5 binary",
> respectively. Which matches /thensome.*5/.

log -p does not show you the patch text in your example because it is treated
as binary. And currently "log -G" has a different opinion into what it looks
and what it ignores. My patch tries to bring both more in line.
 
> I.e. in the first one we do a regex match against the content here
> because we don't have both sides:
> https://github.com/git/git/blob/v2.19.2/diffcore-pickaxe.c#L48-L53
> 
> And then for the later ones where we have both sides we end up in
> diffgrep_consume():
> https://github.com/git/git/blob/v2.19.2/diffcore-pickaxe.c#L27-L36
> 
> I think there may be a real issue here to address, which might be some
> combination of:
> 
>  a) Even though the diffcore can do a binary diff internally, this is
>     not what it exposes with "-p", we just say "Binary files differ".
> 
>     I don't know how to emit the raw version we'll end up passing to
>     diffgrep_consume() in this case. Is it just --binary without the
>     encoding? I don't know...
> 
>  b) Your test case shows that you're matching a string at a \0
>     boundary. Is this perhaps something you ran into? I.e. that we don't
>     have some -F version of -G so we can't supply regexes that match
>     past a \0? I had some related work on grep for this that hasn't been
>     carried over to the diffcore:
> 
>         git log --grep='grep:.*\\0' --author=Ævar
> 
>  c) Is this binary diff we end up matching against just bad in some
>     cases? I haven't dug but that wouldn't surprise me, i.e. that it's
>     trying to be line-based so we'll overmatch in many cases.
> 
> So maybe this is something that should be passed down as a flag? See a
> recent discussion at
> https://public-inbox.org/git/87lg77cmr1.fsf@evledraar.gmail.com/ for how
> that could be done.

It is not about the \0 boundary. v2 of the patches will clarify that. My main
motiviation is to speed up "log -G" as that takes a considerable amount of time 
when it wades through MBs of binary files which change often. And in multiple places
I can already treat binary files differently (e.g. turn off delta compression, skip
trying to diff them, no EOL normalization). And for me making log -G ignore what git 
thinks are binary files is making the line clearer between what should be treated
as binary and what as text.

> Also if we don't have some tests already that were failing with this
> patch we really should have those as "let's test the current behavior
> first". Unfortunately tests in this area are really lacking, see
> e.g. my:
> 
>     git log --author=Junio --min-parents=2 --grep=ab/.*grep
> 
> For some series of patches to grep where to get one patch in I needed to
> often lead with 5-10 test patches to convince reviewers that I knew what
> I was changing, and also to be comfortable that I'd covered all the edge
> cases we currently supported, but weren't testing for.

I'm happy to add more test cases to convince everyone involved :)
diff mbox series

Patch

diff --git a/Documentation/gitdiffcore.txt b/Documentation/gitdiffcore.txt
index c0a60f3158..059ddd3431 100644
--- a/Documentation/gitdiffcore.txt
+++ b/Documentation/gitdiffcore.txt
@@ -242,7 +242,7 @@  textual diff has an added or a deleted line that matches the given
 regular expression.  This means that it will detect in-file (or what
 rename-detection considers the same file) moves, which is noise.  The
 implementation runs diff twice and greps, and this can be quite
-expensive.
+expensive.  Binary files without textconv filter are ignored.
 
 When `-S` or `-G` are used without `--pickaxe-all`, only filepairs
 that match their respective criterion are kept in the output.  When
diff --git a/diffcore-pickaxe.c b/diffcore-pickaxe.c
index 69fc55ea1e..8c2558b07d 100644
--- a/diffcore-pickaxe.c
+++ b/diffcore-pickaxe.c
@@ -144,6 +144,11 @@  static int pickaxe_match(struct diff_filepair *p, struct diff_options *o,
 		textconv_two = get_textconv(o->repo->index, p->two);
 	}
 
+	if ((o->pickaxe_opts & DIFF_PICKAXE_KIND_G) &&
+	    ((!textconv_one && diff_filespec_is_binary(o->repo, p->one)) ||
+	     (!textconv_two && diff_filespec_is_binary(o->repo, p->two))))
+		return 0;
+
 	/*
 	 * If we have an unmodified pair, we know that the count will be the
 	 * same and don't even have to load the blobs. Unless textconv is in
diff --git a/t/t4209-log-pickaxe.sh b/t/t4209-log-pickaxe.sh
index 844df760f7..42cc8afd8b 100755
--- a/t/t4209-log-pickaxe.sh
+++ b/t/t4209-log-pickaxe.sh
@@ -106,4 +106,26 @@  test_expect_success 'log -S --no-textconv (missing textconv tool)' '
 	rm .gitattributes
 '
 
+test_expect_success 'log -G ignores binary files' '
+	rm -rf .git &&
+	git init &&
+	printf "a\0b" >data.bin &&
+	git add data.bin &&
+	git commit -m "message" &&
+	git log -G a >result &&
+	test_must_be_empty result
+'
+
+test_expect_success 'log -G looks into binary files with textconv filter' '
+	rm -rf .git &&
+	git init &&
+	echo "* diff=bin" > .gitattributes &&
+	printf "a\0b" >data.bin &&
+	git add data.bin &&
+	git commit -m "message" &&
+	git -c diff.bin.textconv=cat log -G a >actual &&
+	git log >expected &&
+	test_cmp actual expected
+'
+
 test_done