mbox series

[v3,0/2] Teach diff to honor diff algorithms set through git attributes

Message ID pull.1452.v3.git.git.1676665285.gitgitgadget@gmail.com (mailing list archive)
Headers show
Series Teach diff to honor diff algorithms set through git attributes | expand

Message

Philippe Blain via GitGitGadget Feb. 17, 2023, 8:21 p.m. UTC
When a repository contains different kinds of files, it may be desirable to
use different algorithms based on file type. This is currently not feasible
through the command line or using git configs. However, we can leverage the
fact that gitattributes are path aware.

Teach the diff machinery to check gitattributes when diffing files by using
the existing diff. scheme, and add an "algorithm" type to the external
driver config.

Changes since V2:

 * minor clean up and variable renaming
 * avoid parsing attribute files for the driver if the diff algorithm is set
   through the command line

Changes since V1:

 * utilize the existing diff.<driver>.* scheme where the driver is defined
   in gitattributes, but the algorithm is defined in the gitconfig.

To address some of the performance concerns in the previous series, a
benchmark shows that a performance penalty is no longer incurred, now that
we are no longer adding an additional attributes parsing call:

$ hyperfine -r 5 -L a bin-wrappers/git,git '{a} diff v2.0.0 v2.28.0'
Benchmark 1: git-bin-wrapper diff v2.0.0 v2.28.0 Time (mean ± σ): 1.072 s ±
0.289 s [User: 0.626 s, System: 0.081 s] Range (min … max): 0.772 s … 1.537
s 5 runs

Benchmark 2: git diff v2.0.0 v2.28.0 Time (mean ± σ): 1.003 s ± 0.065 s
[User: 0.684 s, System: 0.067 s] Range (min … max): 0.914 s … 1.091 s 5 runs

Summary 'git diff v2.0.0 v2.28.0' ran 1.07 ± 0.30 times faster than
'git-bin-wrapper diff v2.0.0 v2.28.0'

John Cai (2):
  diff: consolidate diff algorithm option parsing
  diff: teach diff to read algorithm from diff driver

 Documentation/gitattributes.txt | 37 ++++++++++++++
 diff.c                          | 90 ++++++++++++++++++++++++---------
 diff.h                          |  1 +
 t/lib-diff-alternative.sh       | 38 +++++++++++++-
 userdiff.c                      |  4 +-
 userdiff.h                      |  1 +
 6 files changed, 146 insertions(+), 25 deletions(-)


base-commit: c867e4fa180bec4750e9b54eb10f459030dbebfd
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-git-1452%2Fjohn-cai%2Fjc%2Fattr-diff-algo-v3
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-git-1452/john-cai/jc/attr-diff-algo-v3
Pull-Request: https://github.com/git/git/pull/1452

Range-diff vs v2:

 1:  0c5e1fc6c26 ! 1:  816c47aa414 diff: consolidate diff algorithm option parsing
     @@ Metadata
       ## Commit message ##
          diff: consolidate diff algorithm option parsing
      
     -    The diff option parsing for --minimal, --patience, --histgoram can all
     -    be consolidated into one function. This is a preparatory step for the
     -    subsequent commit which teaches diff to keep track of whether or not a
     -    diff algorithm has been set via the command line.
     +    A subsequent commit will need the ability to tell if the diff algorithm
     +    was set through the command line through setting a new member of
     +    diff_options. While this logic can be added to the
     +    diff_opt_diff_algorithm() callback, the `--minimal` and `--histogram`
     +    options are handled via OPT_BIT without a callback.
      
     -    Additionally, the logic that sets the diff algorithm in
     -    diff_opt_diff_algorithm() can  be refactored into a helper that will
     +    Remedy this by consolidating the options parsing logic for --minimal and
     +    --histogram into one callback. This way we can modify `diff_options` in
     +    that function.
     +
     +    As an additional refactor, the logic that sets the diff algorithm in
     +    diff_opt_diff_algorithm() can be refactored into a helper that will
          allow multiple callsites to set the diff algorithm.
      
          Signed-off-by: John Cai <johncai86@gmail.com>
     @@ diff.c: static int diff_filepair_is_phoney(struct diff_filespec *one,
      +	long value = parse_algorithm_value(alg);
      +
      +	if (value < 0)
     -+		return 1;
     ++		return -1;
      +
      +	/* clear out previous settings */
      +	DIFF_XDL_CLR(opts, NEED_MINIMAL);
     @@ diff.c: static int diff_opt_diff_algorithm(const struct option *opt,
      +	BUG_ON_OPT_NEG(unset);
      +	BUG_ON_OPT_ARG(arg);
      +
     -+	if (!strcmp(opt->long_name, "patience")) {
     -+		size_t i;
     -+		/*
     -+		 * Both --patience and --anchored use PATIENCE_DIFF
     -+		 * internally, so remove any anchors previously
     -+		 * specified.
     -+		 */
     -+		for (i = 0; i < options->anchors_nr; i++)
     -+			free(options->anchors[i]);
     -+		options->anchors_nr = 0;
     -+	}
     -+
      +	if (set_diff_algorithm(options, opt->long_name))
      +		BUG("available diff algorithms include \"myers\", "
      +			       "\"minimal\", \"patience\" and \"histogram\"");
     @@ diff.c: static int diff_opt_diff_algorithm(const struct option *opt,
       	return 0;
       }
       
     -@@ diff.c: static enum parse_opt_result diff_opt_output(struct parse_opt_ctx_t *ctx,
     - 	return 0;
     - }
     +@@ diff.c: static int diff_opt_patience(const struct option *opt,
       
     --static int diff_opt_patience(const struct option *opt,
     --			     const char *arg, int unset)
     --{
     --	struct diff_options *options = opt->value;
     --	int i;
     --
     --	BUG_ON_OPT_NEG(unset);
     --	BUG_ON_OPT_ARG(arg);
     + 	BUG_ON_OPT_NEG(unset);
     + 	BUG_ON_OPT_ARG(arg);
      -	options->xdl_opts = DIFF_WITH_ALG(options, PATIENCE_DIFF);
     --	/*
     --	 * Both --patience and --anchored use PATIENCE_DIFF
     --	 * internally, so remove any anchors previously
     --	 * specified.
     --	 */
     --	for (i = 0; i < options->anchors_nr; i++)
     --		free(options->anchors[i]);
     --	options->anchors_nr = 0;
     + 	/*
     + 	 * Both --patience and --anchored use PATIENCE_DIFF
     + 	 * internally, so remove any anchors previously
     +@@ diff.c: static int diff_opt_patience(const struct option *opt,
     + 	for (i = 0; i < options->anchors_nr; i++)
     + 		free(options->anchors[i]);
     + 	options->anchors_nr = 0;
      -	return 0;
     --}
     --
     ++
     ++	return set_diff_algorithm(options, "patience");
     + }
     + 
       static int diff_opt_ignore_regex(const struct option *opt,
     - 				 const char *arg, int unset)
     - {
      @@ diff.c: struct option *add_diff_options(const struct option *opts,
       			    N_("prevent rename/copy detection if the number of rename/copy targets exceeds given limit")),
       
     @@ diff.c: struct option *add_diff_options(const struct option *opts,
       			  N_("ignore whitespace when comparing lines"),
       			  XDF_IGNORE_WHITESPACE, PARSE_OPT_NONEG),
      @@ diff.c: struct option *add_diff_options(const struct option *opts,
     - 		OPT_CALLBACK_F(0, "patience", options, NULL,
       			       N_("generate diff using the \"patience diff\" algorithm"),
       			       PARSE_OPT_NONEG | PARSE_OPT_NOARG,
     --			       diff_opt_patience),
     + 			       diff_opt_patience),
      -		OPT_BITOP(0, "histogram", &options->xdl_opts,
      -			  N_("generate diff using the \"histogram diff\" algorithm"),
      -			  XDF_HISTOGRAM_DIFF, XDF_DIFF_ALGORITHM_MASK),
     -+			       diff_opt_diff_algorithm_no_arg),
      +		OPT_CALLBACK_F(0, "histogram", options, NULL,
      +			       N_("generate diff using the \"histogram diff\" algorithm"),
      +			       PARSE_OPT_NONEG | PARSE_OPT_NOARG,
 2:  cb030563149 ! 2:  b330222ce83 diff: teach diff to read gitattribute diff-algorithm
     @@ Metadata
      Author: John Cai <johncai86@gmail.com>
      
       ## Commit message ##
     -    diff: teach diff to read gitattribute diff-algorithm
     +    diff: teach diff to read algorithm from diff driver
      
          It can be useful to specify diff algorithms per file type. For example,
          one may want to use the minimal diff algorithm for .json files, another
          for .c files, etc.
      
     -    Teach the diff machinery to check attributes for a diff driver. Also
     -    teach the diff driver parser a new type "algorithm" to look for in the
     +    The diff machinery already checks attributes for a diff driver. Teach
     +    the diff driver parser a new type "algorithm" to look for in the
          config, which will be used if a driver has been specified through the
          attributes.
      
     -    Enforce precedence of diff algorithm by favoring the command line option,
     -    then looking at the driver attributes & config combination, then finally
     -    the diff.algorithm config.
     +    Enforce precedence of the diff algorithm by favoring the command line
     +    option, then looking at the driver attributes & config combination, then
     +    finally the diff.algorithm config.
      
     -    To enforce precedence order, use the `xdl_opts_command_line` member
     +    To enforce precedence order, use a new `ignore_driver_algorithm` member
          during options pasing to indicate the diff algorithm was set via command
          line args.
      
          Signed-off-by: John Cai <johncai86@gmail.com>
      
       ## Documentation/gitattributes.txt ##
     -@@ Documentation/gitattributes.txt: String::
     - 	by the configuration variables in the "diff.foo" section of the
     - 	Git config file.
     - 
     --
     - Defining an external diff driver
     - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     - 
      @@ Documentation/gitattributes.txt: with the above configuration, i.e. `j-c-diff`, with 7
       parameters, just like `GIT_EXTERNAL_DIFF` program is called.
       See linkgit:git[1] for details.
     @@ Documentation/gitattributes.txt: with the above configuration, i.e. `j-c-diff`,
      +The diff algorithm can be set through the `diff.algorithm` config key, but
      +sometimes it may be helpful to set the diff algorithm by path. For example, one
      +might wish to set a diff algorithm automatically for all `.json` files such that
     -+the user would not need to pass in a separate command line `--diff-algorithm` flag each
     -+time.
     ++the user would not need to pass in a separate command line `--diff-algorithm`
     ++flag each time.
      +
     -+First, in `.gitattributes`, you would assign the `diff` attribute for paths.
     ++First, in `.gitattributes`, assign the `diff` attribute for paths.
      +
     -+*Git attributes*
      +------------------------
      +*.json diff=<name>
      +------------------------
      +
     -+Then, you would define a "diff.<name>.algorithm" configuration to specify the
     -+diff algorithm, choosing from `meyers`, `patience`, `minimal`, and `histogram`.
     -+
     -+*Git config*
     ++Then, define a "diff.<name>.algorithm" configuration to specify the diff
     ++algorithm, choosing from `meyers`, `patience`, `minimal`, or `histogram`.
      +
      +----------------------------------------------------------------
      +[diff "<name>"]
      +  algorithm = histogram
      +----------------------------------------------------------------
      +
     -+This diff algorithm applies to git-diff(1), including the `--stat` output.
     ++This diff algorithm applies to user facing diff output like git-diff(1),
     ++git-show(1) and is used for the `--stat` output as well. The merge machinery
     ++will not use the diff algorithm set through this method.
      +
      +NOTE: If the `command` key also exists, then Git will treat this as an external
      +diff and attempt to use the value set for `command` as an external program. For
      +instance, the following config, combined with the above `.gitattributes` file,
      +will result in `command` favored over `algorithm`.
      +
     -+*Git config*
     -+
      +----------------------------------------------------------------
      +[diff "<name>"]
      +  command = j-c-diff
     @@ diff.c: static void run_diff_cmd(const char *pgm,
       	const char *xfrm_msg = NULL;
       	int complete_rewrite = (p->status == DIFF_STATUS_MODIFIED) && p->score;
       	int must_show_header = 0;
     -+	struct userdiff_driver *drv = userdiff_find_by_path(o->repo->index, attr_path);
     ++	struct userdiff_driver *drv = NULL;
       
      -
      -	if (o->flags.allow_external) {
      -		struct userdiff_driver *drv;
      -
     --		drv = userdiff_find_by_path(o->repo->index, attr_path);
     -+	if (o->flags.allow_external)
     - 		if (drv && drv->external)
     - 			pgm = drv->external;
     ++	if (o->flags.allow_external || !o->ignore_driver_algorithm)
     + 		drv = userdiff_find_by_path(o->repo->index, attr_path);
     +-		if (drv && drv->external)
     +-			pgm = drv->external;
      -	}
     ++
     ++	if (o->flags.allow_external && drv && drv->external)
     ++		pgm = drv->external;
       
       	if (msg) {
       		/*
     @@ diff.c: static void run_diff_cmd(const char *pgm,
       	}
      -	if (one && two)
      +	if (one && two) {
     -+		if (!o->xdl_opts_command_line)
     -+			if (drv && drv->algorithm)
     -+				set_diff_algorithm(o, drv->algorithm);
     ++		if (drv && !o->ignore_driver_algorithm && drv->algorithm)
     ++			set_diff_algorithm(o, drv->algorithm);
      +
       		builtin_diff(name, other ? other : name,
       			     one, two, xfrm_msg, must_show_header,
     @@ diff.c: static void run_diffstat(struct diff_filepair *p, struct diff_options *o
       	const char *name;
       	const char *other;
       
     -+	struct userdiff_driver *drv = userdiff_find_by_path(o->repo->index, p->one->path);
     -+	if (drv && drv->algorithm)
     -+		set_diff_algorithm(o, drv->algorithm);
     ++	if (!o->ignore_driver_algorithm) {
     ++		struct userdiff_driver *drv = userdiff_find_by_path(o->repo->index, p->one->path);
     ++
     ++		if (drv && drv->algorithm) {
     ++			set_diff_algorithm(o, drv->algorithm);
     ++		}
     ++	}
      +
       	if (DIFF_PAIR_UNMERGED(p)) {
       		/* unmerged */
     @@ diff.c: static int diff_opt_diff_algorithm(const struct option *opt,
       		return error(_("option diff-algorithm accepts \"myers\", "
       			       "\"minimal\", \"patience\" and \"histogram\""));
       
     -+	options->xdl_opts_command_line = 1;
     ++	options->ignore_driver_algorithm = 1;
      +
       	return 0;
       }
     @@ diff.c: static int diff_opt_diff_algorithm_no_arg(const struct option *opt,
       		BUG("available diff algorithms include \"myers\", "
       			       "\"minimal\", \"patience\" and \"histogram\"");
       
     -+	options->xdl_opts_command_line = 1;
     ++	options->ignore_driver_algorithm = 1;
      +
       	return 0;
       }
       
     +@@ diff.c: static int diff_opt_patience(const struct option *opt,
     + 	for (i = 0; i < options->anchors_nr; i++)
     + 		free(options->anchors[i]);
     + 	options->anchors_nr = 0;
     ++	options->ignore_driver_algorithm = 1;
     + 
     + 	return set_diff_algorithm(options, "patience");
     + }
      
       ## diff.h ##
      @@ diff.h: struct diff_options {
       	int prefix_length;
       	const char *stat_sep;
       	int xdl_opts;
     -+	/* If xdl_opts has been set via the command line. */
     -+	int xdl_opts_command_line;
     ++	int ignore_driver_algorithm;
       
       	/* see Documentation/diff-options.txt */
       	char **anchors;

Comments

Elijah Newren Feb. 18, 2023, 1:16 a.m. UTC | #1
On Fri, Feb 17, 2023 at 12:21 PM John Cai via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> When a repository contains different kinds of files, it may be desirable to
> use different algorithms based on file type. This is currently not feasible
> through the command line or using git configs. However, we can leverage the
> fact that gitattributes are path aware.
>
> Teach the diff machinery to check gitattributes when diffing files by using
> the existing diff. scheme, and add an "algorithm" type to the external
> driver config.
[...]
> To address some of the performance concerns in the previous series, a
> benchmark shows that a performance penalty is no longer incurred, now that
> we are no longer adding an additional attributes parsing call:
>
> $ hyperfine -r 5 -L a bin-wrappers/git,git '{a} diff v2.0.0 v2.28.0'
> Benchmark 1: git-bin-wrapper diff v2.0.0 v2.28.0 Time (mean ± σ): 1.072 s ±
> 0.289 s [User: 0.626 s, System: 0.081 s] Range (min … max): 0.772 s … 1.537
> s 5 runs
>
> Benchmark 2: git diff v2.0.0 v2.28.0 Time (mean ± σ): 1.003 s ± 0.065 s
> [User: 0.684 s, System: 0.067 s] Range (min … max): 0.914 s … 1.091 s 5 runs
>
> Summary 'git diff v2.0.0 v2.28.0' ran 1.07 ± 0.30 times faster than
> 'git-bin-wrapper diff v2.0.0 v2.28.0'

I'm sorry, I don't understand this.  What are you measuring?  I
presume bin-wrappers/git refers to the version of git built with your
changes, but what version of git does "git" refer to?  Also, do you
have any .gitattributes or .git/config changes present when you are
testing to trigger the new functionality you have written?

Also, doesn't this benchmark demonstrate the opposite of your claim?
You said there was no performance penalty, but the benchmark shows a
7% slowdown.  We've battled hard to get smaller improvements than
that, so this is still worrisome, even if it's no longer a factor of 2
or whatever it was.  But, again, I'm not sure what is being measured.
If the difference is because patience diff was used for some files,
then it's not an apples-to-apples comparison, and a 7% slowdown would
be no cause for concern.

Since I was curious, I compiled both a version of git from directly
before your series, and directly after, then added a '*.[ch]
diff=other' line to the end of .gitattributes, then ran:

$ hyperfine -L a ./older-git,./newer-git '{a} -c
diff.other.algorithm=myers diff --numstat v2.0.0 v2.28.0'
Benchmark 1: ./older-git -c diff.other.algorithm=myers diff --numstat
v2.0.0 v2.28.0
  Time (mean ± σ):     870.2 ms ±   4.4 ms    [User: 755.2 ms, System: 109.8 ms]
  Range (min … max):   861.0 ms … 876.8 ms    10 runs

Benchmark 2: ./newer-git -c diff.other.algorithm=myers diff --numstat
v2.0.0 v2.28.0
  Time (mean ± σ):     876.9 ms ±   4.8 ms    [User: 758.0 ms, System: 113.1 ms]
  Range (min … max):   870.7 ms … 884.1 ms    10 runs

Summary
  './older-git -c diff.other.algorithm=myers diff --numstat v2.0.0 v2.28.0' ran
    1.01 ± 0.01 times faster than './newer-git -c
diff.other.algorithm=myers diff --numstat v2.0.0 v2.28.0'

I specifically specified 'myers' to match what we'd get from the
default anyway, so I would only be testing the slowdown from the
.gitattribute parsing.  So, I think the performance overhead comes out
to just 1% rather than 7% (and further that's when I make it only
print overall stats about the diff rather than the full diff, since I
know that's faster.  If I didn't do that, the perf hit might appear to
be less than 1%).
John Cai Feb. 20, 2023, 1:37 p.m. UTC | #2
Hi Elijah,

On 17 Feb 2023, at 20:16, Elijah Newren wrote:

> On Fri, Feb 17, 2023 at 12:21 PM John Cai via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
>>
>> When a repository contains different kinds of files, it may be desirable to
>> use different algorithms based on file type. This is currently not feasible
>> through the command line or using git configs. However, we can leverage the
>> fact that gitattributes are path aware.
>>
>> Teach the diff machinery to check gitattributes when diffing files by using
>> the existing diff. scheme, and add an "algorithm" type to the external
>> driver config.
> [...]
>> To address some of the performance concerns in the previous series, a
>> benchmark shows that a performance penalty is no longer incurred, now that
>> we are no longer adding an additional attributes parsing call:
>>
>> $ hyperfine -r 5 -L a bin-wrappers/git,git '{a} diff v2.0.0 v2.28.0'
>> Benchmark 1: git-bin-wrapper diff v2.0.0 v2.28.0 Time (mean ± σ): 1.072 s ±
>> 0.289 s [User: 0.626 s, System: 0.081 s] Range (min … max): 0.772 s … 1.537
>> s 5 runs
>>
>> Benchmark 2: git diff v2.0.0 v2.28.0 Time (mean ± σ): 1.003 s ± 0.065 s
>> [User: 0.684 s, System: 0.067 s] Range (min … max): 0.914 s … 1.091 s 5 runs
>>
>> Summary 'git diff v2.0.0 v2.28.0' ran 1.07 ± 0.30 times faster than
>> 'git-bin-wrapper diff v2.0.0 v2.28.0'
>
> I'm sorry, I don't understand this.  What are you measuring?  I
> presume bin-wrappers/git refers to the version of git built with your
> changes, but what version of git does "git" refer to?  Also, do you
> have any .gitattributes or .git/config changes present when you are
> testing to trigger the new functionality you have written?
>
> Also, doesn't this benchmark demonstrate the opposite of your claim?
> You said there was no performance penalty, but the benchmark shows a
> 7% slowdown.  We've battled hard to get smaller improvements than
> that, so this is still worrisome, even if it's no longer a factor of 2
> or whatever it was.  But, again, I'm not sure what is being measured.
> If the difference is because patience diff was used for some files,
> then it's not an apples-to-apples comparison, and a 7% slowdown would
> be no cause for concern.
>
> Since I was curious, I compiled both a version of git from directly
> before your series, and directly after, then added a '*.[ch]
> diff=other' line to the end of .gitattributes, then ran:
>
> $ hyperfine -L a ./older-git,./newer-git '{a} -c
> diff.other.algorithm=myers diff --numstat v2.0.0 v2.28.0'
> Benchmark 1: ./older-git -c diff.other.algorithm=myers diff --numstat
> v2.0.0 v2.28.0
>   Time (mean ± σ):     870.2 ms ±   4.4 ms    [User: 755.2 ms, System: 109.8 ms]
>   Range (min … max):   861.0 ms … 876.8 ms    10 runs
>
> Benchmark 2: ./newer-git -c diff.other.algorithm=myers diff --numstat
> v2.0.0 v2.28.0
>   Time (mean ± σ):     876.9 ms ±   4.8 ms    [User: 758.0 ms, System: 113.1 ms]
>   Range (min … max):   870.7 ms … 884.1 ms    10 runs
>
> Summary
>   './older-git -c diff.other.algorithm=myers diff --numstat v2.0.0 v2.28.0' ran
>     1.01 ± 0.01 times faster than './newer-git -c
> diff.other.algorithm=myers diff --numstat v2.0.0 v2.28.0'
>
> I specifically specified 'myers' to match what we'd get from the
> default anyway, so I would only be testing the slowdown from the
> .gitattribute parsing.  So, I think the performance overhead comes out
> to just 1% rather than 7% (and further that's when I make it only
> print overall stats about the diff rather than the full diff, since I
> know that's faster.  If I didn't do that, the perf hit might appear to
> be less than 1%).

Thanks for taking the time to do this! I should have been a bit more careful
about this benchmark, and more explicit about what it was benchmarking. I just
ran it again and made sure that the same algorithm was used, and I got results
similar to you.

Will update the cover letter, thanks!