[19/25] pickaxe -G: set -U0 for diff generation

Message ID	20210203032811.14979-20-avarab@gmail.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <git-owner@kernel.org> From: =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= <avarab@gmail.com> To: git@vger.kernel.org Cc: Junio C Hamano <gitster@pobox.com>, Jeff King <peff@peff.net>, Johannes Schindelin <johannes.schindelin@gmx.de>, =?utf-8?q?Carlo_Marcelo_A?= =?utf-8?q?renas_Bel=C3=B3n?= <carenas@gmail.com>, =?utf-8?b?w4Z2YXIgQXJu?= =?utf-8?b?ZmrDtnLDsCBCamFybWFzb24=?= <avarab@gmail.com> Subject: [PATCH 19/25] pickaxe -G: set -U0 for diff generation Date: Wed, 3 Feb 2021 04:28:05 +0100 Message-Id: <20210203032811.14979-20-avarab@gmail.com> In-Reply-To: <20210203032811.14979-1-avarab@gmail.com> References: <20210203032811.14979-1-avarab@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	grep: PCREv2 fixes, remove kwset.[ch] \| expand [00/25] grep: PCREv2 fixes, remove kwset.[ch] [01/25] grep/pcre2 tests: reword comments referring to kwset [02/25] grep/pcre2: drop needless assignment + assert() on opt->pcre2 [03/25] grep/pcre2: drop needless assignment to NULL [04/25] grep/pcre2: correct reference to grep_init() in comment [05/25] grep/pcre2: prepare to add debugging to pcre2_malloc() [06/25] grep/pcre2: add GREP_PCRE2_DEBUG_MALLOC debug mode [07/25] grep/pcre2: use compile-time PCREv2 version test [08/25] grep/pcre2: use pcre2_maketables_free() function [09/25] grep/pcre2: actually make pcre2 use custom allocator [10/25] grep/pcre2: move back to thread-only PCREv2 structures [11/25] grep/pcre2: move definitions of pcre2_{malloc,free} [12/25] pickaxe tests: refactor to use test_commit --append [13/25] pickaxe -S: support content with NULs under --pickaxe-regex [14/25] pickaxe -S: remove redundant "sz" check in while-loop [15/25] pickaxe/style: consolidate declarations and assignments [16/25] pickaxe tests: add test for diffgrep_consume() internals [17/25] pickaxe tests: add test for "log -S" not being a regex [18/25] perf: add performance test for pickaxe [19/25] pickaxe -G: set -U0 for diff generation [20/25] grep.h: make patmatch() a public function [21/25] pickaxe: use PCREv2 for -G and -S [22/25] Remove unused kwset.[ch] [23/25] xdiff-interface: allow early return from xdiff_emit_{line,hunk}_fn [24/25] xdiff-interface: support early exit in xdiff_outf() [25/25] pickaxe -G: terminate early on matching lines

Message ID

20210203032811.14979-20-avarab@gmail.com (mailing list archive)

State

New, archived

Headers

From: =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?=  <avarab@gmail.com>
To: git@vger.kernel.org
Cc: Junio C Hamano <gitster@pobox.com>, Jeff King <peff@peff.net>,
 Johannes Schindelin <johannes.schindelin@gmx.de>, =?utf-8?q?Carlo_Marcelo_A?=
	=?utf-8?q?renas_Bel=C3=B3n?=  <carenas@gmail.com>, =?utf-8?b?w4Z2YXIgQXJu?=
	=?utf-8?b?ZmrDtnLDsCBCamFybWFzb24=?=  <avarab@gmail.com>
Subject: [PATCH 19/25] pickaxe -G: set -U0 for diff generation
Date: Wed,  3 Feb 2021 04:28:05 +0100
Message-Id: <20210203032811.14979-20-avarab@gmail.com>
In-Reply-To: <20210203032811.14979-1-avarab@gmail.com>
References: <20210203032811.14979-1-avarab@gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Precedence: bulk

Series

grep: PCREv2 fixes, remove kwset.[ch] | expand

Commit Message

Ævar Arnfjörð Bjarmason Feb. 3, 2021, 3:28 a.m. UTC

Set the equivalent of -U0 when generating diffs for "git log -G". As
seen in diffgrep_consume() we ignore any lines that aren't the "+" and
"-" lines, so the rest of the output wasn't being used.

It turns out that we spent quite a bit of CPU just on this[1]:

    Test                                             HEAD~             HEAD
    -----------------------------------------------------------------------------------------
    4209.2: git log -G'a' <limit-rev>..              0.60(0.54+0.06)   0.52(0.46+0.05) -13.3%
    4209.8: git log -G'uncommon' <limit-rev>..       0.61(0.54+0.07)   0.53(0.47+0.06) -13.1%
    4209.14: git log -G'[þæö]' <limit-rev>..         0.60(0.55+0.04)   0.56(0.48+0.04) -6.7%
    4209.21: git log -i -G'a' <limit-rev>..          0.63(0.56+0.03)   0.54(0.48+0.05) -14.3%
    4209.27: git log -i -G'uncommon' <limit-rev>..   0.61(0.55+0.05)   0.53(0.47+0.06) -13.1%
    4209.33: git log -i -G'[þæö]' <limit-rev>..      0.61(0.53+0.07)   0.53(0.47+0.05) -13.1%

I also experimented with setting diff.interHunkContext to 10, 100
etc. As noted above it's useless for -G to have non-"+" and non-"-"
lines for the matching itself, but there's going to be some sweet spot
where if we can be handed bigger hunks at a time our matching might be
faster.

But alas, the results of that were:

    Test                                             HEAD~2            HEAD~                    HEAD
    ------------------------------------------------------------------------------------------------------------------
    4209.2: git log -G'a' <limit-rev>..              0.61(0.53+0.07)   0.51(0.46+0.05) -16.4%   0.51(0.46+0.05) -16.4%
    4209.8: git log -G'uncommon' <limit-rev>..       0.66(0.55+0.05)   0.53(0.48+0.04) -19.7%   0.52(0.49+0.03) -21.2%
    4209.14: git log -G'[þæö]' <limit-rev>..         0.63(0.54+0.06)   0.51(0.44+0.07) -19.0%   0.52(0.46+0.06) -17.5%
    4209.21: git log -i -G'a' <limit-rev>..          0.62(0.54+0.07)   0.51(0.46+0.04) -17.7%   0.53(0.45+0.07) -14.5%
    4209.27: git log -i -G'uncommon' <limit-rev>..   0.62(0.56+0.06)   0.53(0.48+0.05) -14.5%   0.53(0.46+0.07) -14.5%
    4209.33: git log -i -G'[þæö]' <limit-rev>..      0.63(0.57+0.03)   0.58(0.46+0.06) -7.9%    0.53(0.46+0.06) -15.9%

I.e. maybe it's faster in some cases, but probably slower in general.

Those results are going to be crappy because we're matching a line at
a time, as opposed to some version of /m matching across the whole

Comments

Ævar Arnfjörð Bjarmason Feb. 3, 2021, 2:26 p.m. UTC | #1

On Wed, Feb 03 2021, Ævar Arnfjörð Bjarmason wrote:

> Set the equivalent of -U0 when generating diffs for "git log -G". As
> seen in diffgrep_consume() we ignore any lines that aren't the "+" and
> "-" lines, so the rest of the output wasn't being used.
>
> It turns out that we spent quite a bit of CPU just on this[1]:
>
>     Test                                             HEAD~             HEAD
>     -----------------------------------------------------------------------------------------
>     4209.2: git log -G'a' <limit-rev>..              0.60(0.54+0.06)   0.52(0.46+0.05) -13.3%
>     4209.8: git log -G'uncommon' <limit-rev>..       0.61(0.54+0.07)   0.53(0.47+0.06) -13.1%
>     4209.14: git log -G'[þæö]' <limit-rev>..         0.60(0.55+0.04)   0.56(0.48+0.04) -6.7%
>     4209.21: git log -i -G'a' <limit-rev>..          0.63(0.56+0.03)   0.54(0.48+0.05) -14.3%
>     4209.27: git log -i -G'uncommon' <limit-rev>..   0.61(0.55+0.05)   0.53(0.47+0.06) -13.1%
>     4209.33: git log -i -G'[þæö]' <limit-rev>..      0.61(0.53+0.07)   0.53(0.47+0.05) -13.1%
>
> I also experimented with setting diff.interHunkContext to 10, 100
> etc. As noted above it's useless for -G to have non-"+" and non-"-"
> lines for the matching itself, but there's going to be some sweet spot
> where if we can be handed bigger hunks at a time our matching might be
> faster.
>
> But alas, the results of that were:
>
>     Test                                             HEAD~2            HEAD~                    HEAD
>     ------------------------------------------------------------------------------------------------------------------
>     4209.2: git log -G'a' <limit-rev>..              0.61(0.53+0.07)   0.51(0.46+0.05) -16.4%   0.51(0.46+0.05) -16.4%
>     4209.8: git log -G'uncommon' <limit-rev>..       0.66(0.55+0.05)   0.53(0.48+0.04) -19.7%   0.52(0.49+0.03) -21.2%
>     4209.14: git log -G'[þæö]' <limit-rev>..         0.63(0.54+0.06)   0.51(0.44+0.07) -19.0%   0.52(0.46+0.06) -17.5%
>     4209.21: git log -i -G'a' <limit-rev>..          0.62(0.54+0.07)   0.51(0.46+0.04) -17.7%   0.53(0.45+0.07) -14.5%
>     4209.27: git log -i -G'uncommon' <limit-rev>..   0.62(0.56+0.06)   0.53(0.48+0.05) -14.5%   0.53(0.46+0.07) -14.5%
>     4209.33: git log -i -G'[þæö]' <limit-rev>..      0.63(0.57+0.03)   0.58(0.46+0.06) -7.9%    0.53(0.46+0.06) -15.9%
>
> I.e. maybe it's faster in some cases, but probably slower in general.
>
> Those results are going to be crappy because we're matching a line at
> a time, as opposed to some version of /m matching across the whole
> diff (if possible). So that approach might be worth revisiting in the
> future.
>
> 1. GIT_SKIP_TESTS="p4209.[1379] p4209.15 p4209.2[028] p4209.34" GIT_PERF_EXTRA= GIT_PERF_REPO=~/g/git/ GIT_PERF_REPEAT_COUNT=5 GIT_PERF_MAKE_OPTS='-j8 USE_LIBPCRE=Y CFLAGS=-O3 LIBPCREDIR=/home/avar/g/pcre2/inst' ./run HEAD~ HEAD -- p4209-pickaxe.sh
>
> Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
> ---
>  diffcore-pickaxe.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/diffcore-pickaxe.c b/diffcore-pickaxe.c
> index cb865c8b29..5161c81057 100644
> --- a/diffcore-pickaxe.c
> +++ b/diffcore-pickaxe.c
> @@ -60,7 +60,7 @@ static int diff_grep(mmfile_t *one, mmfile_t *two,
>  	memset(&xecfg, 0, sizeof(xecfg));
>  	ecbdata.regexp = regexp;
>  	ecbdata.hit = 0;
> -	xecfg.ctxlen = o->context;
> +	xecfg.ctxlen = 0;
>  	xecfg.interhunkctxlen = o->interhunkcontext;
>  	if (xdi_diff_outf(one, two, discard_hunk_line, diffgrep_consume,
>  			  &ecbdata, &xpp, &xecfg))

I since discovered Junio's f01cae918f (diff: teach --stat/--numstat to
honor -U$num, 2011-09-22) (as an aside we have no test for that
behavior).

I haven't looked carefully, but I don't think we'll have the same issue
here, as pickaxe currently doesn't care about whether something is on
the + or - line, when briefly looking at the diffstat edge cases it
seems that's what differs based on -U<n> for the diffstat.

Junio C Hamano Feb. 3, 2021, 7:42 p.m. UTC | #2

Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

> I since discovered Junio's f01cae918f (diff: teach --stat/--numstat to
> honor -U$num, 2011-09-22) (as an aside we have no test for that
> behavior).
>
> I haven't looked carefully, but I don't think we'll have the same issue
> here, as pickaxe currently doesn't care about whether something is on
> the + or - line, when briefly looking at the diffstat edge cases it
> seems that's what differs based on -U<n> for the diffstat.

With -U0 or different <n> in general, the matching between preimage
and postimage may become different, and both -U3 (usual) and -U0 may
express the same change "correctly" from the point of view of a
program like "git apply", but humans would see them as different
patches, and "diffstat" that counts number of +/- would give
different results.  The patch IDs may also be different.  The old
commit was to pessimize the logic (because we do not need context
just to count +/- lines for the purpose of diffstat) to match human
expectations.  They expect "'diffstat' must be counting 'diff -p'
output" and we were counting "diff -p -U0" instead, resulting in
different numbers.

With internally using -U0, the updated "pickaxe -G" is likely to get
the same complaints: "'pickaxe -G<token>' found this commit, but in
the 'git show' output, the token does not seem to be affected".

You'd respond to "try 'git show -U0' and now you'd see the <token>",
but again that is probably breaking human expectations.

diff (if possible). So that approach might be worth revisiting in the
future.

1. GIT_SKIP_TESTS="p4209.[1379] p4209.15 p4209.2[028] p4209.34" GIT_PERF_EXTRA= GIT_PERF_REPO=~/g/git/ GIT_PERF_REPEAT_COUNT=5 GIT_PERF_MAKE_OPTS='-j8 USE_LIBPCRE=Y CFLAGS=-O3 LIBPCREDIR=/home/avar/g/pcre2/inst' ./run HEAD~ HEAD -- p4209-pickaxe.sh

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 diffcore-pickaxe.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/diffcore-pickaxe.c b/diffcore-pickaxe.c
index cb865c8b29..5161c81057 100644
--- a/diffcore-pickaxe.c
+++ b/diffcore-pickaxe.c
@@ -60,7 +60,7 @@  static int diff_grep(mmfile_t *one, mmfile_t *two,
 	memset(&xecfg, 0, sizeof(xecfg));
 	ecbdata.regexp = regexp;
 	ecbdata.hit = 0;
-	xecfg.ctxlen = o->context;
+	xecfg.ctxlen = 0;
 	xecfg.interhunkctxlen = o->interhunkcontext;
 	if (xdi_diff_outf(one, two, discard_hunk_line, diffgrep_consume,
 			  &ecbdata, &xpp, &xecfg))

[19/25] pickaxe -G: set -U0 for diff generation

Commit Message

Comments

Patch