diff mbox series

[v2,2/3] builtin: introduce diff-pairs command

Message ID 20250212041825.2455031-3-jltobler@gmail.com (mailing list archive)
State New
Headers show
Series batch blob diff generation | expand

Commit Message

Justin Tobler Feb. 12, 2025, 4:18 a.m. UTC
Through git-diff(1), a single diff can be generated from a pair of blob
revisions directly. Unfortunately, there is not a mechanism to compute
batches of specific file pair diffs in a single process. Such a feature
is particularly useful on the server-side where diffing between a large
set of changes is not feasible all at once due to timeout concerns.

To facilitate this, introduce git-diff-pairs(1) which takes the
null-terminated raw diff format as input on stdin and produces diffs in
other formats. As the raw diff format already contains the necessary
metadata, it becomes possible to progressively generate batches of diffs
without having to recompute rename detection or retrieve object context.
Something like the following:

	git diff-tree -r -z -M $old $new |
	git diff-pairs -p

should generate the same output as `git diff-tree -p -M`. Furthermore,
each line of raw diff formatted input can also be individually fed to a
separate git-diff-pairs(1) process and still produce the same output.

Based-on-patch-by: Jeff King <peff@peff.net>
Signed-off-by: Justin Tobler <jltobler@gmail.com>
---
 .gitignore                        |   1 +
 Documentation/git-diff-pairs.adoc |  62 +++++++++++
 Documentation/meson.build         |   1 +
 Makefile                          |   1 +
 builtin.h                         |   1 +
 builtin/diff-pairs.c              | 178 ++++++++++++++++++++++++++++++
 command-list.txt                  |   1 +
 git.c                             |   1 +
 meson.build                       |   1 +
 t/meson.build                     |   1 +
 t/t4070-diff-pairs.sh             |  80 ++++++++++++++
 11 files changed, 328 insertions(+)
 create mode 100644 Documentation/git-diff-pairs.adoc
 create mode 100644 builtin/diff-pairs.c
 create mode 100755 t/t4070-diff-pairs.sh

Comments

Patrick Steinhardt Feb. 12, 2025, 9:23 a.m. UTC | #1
On Tue, Feb 11, 2025 at 10:18:24PM -0600, Justin Tobler wrote:
> Through git-diff(1), a single diff can be generated from a pair of blob
> revisions directly. Unfortunately, there is not a mechanism to compute
> batches of specific file pair diffs in a single process. Such a feature
> is particularly useful on the server-side where diffing between a large
> set of changes is not feasible all at once due to timeout concerns.
> 
> To facilitate this, introduce git-diff-pairs(1) which takes the
> null-terminated raw diff format as input on stdin and produces diffs in

s/null/NUL/

> other formats. As the raw diff format already contains the necessary
> metadata, it becomes possible to progressively generate batches of diffs
> without having to recompute rename detection or retrieve object context.
> Something like the following:
> 
> 	git diff-tree -r -z -M $old $new |
> 	git diff-pairs -p
> 
> should generate the same output as `git diff-tree -p -M`. Furthermore,
> each line of raw diff formatted input can also be individually fed to a
> separate git-diff-pairs(1) process and still produce the same output.
> 
> Based-on-patch-by: Jeff King <peff@peff.net>
> Signed-off-by: Justin Tobler <jltobler@gmail.com>

I really like this new design, thanks for working well together!

> diff --git a/Documentation/git-diff-pairs.adoc b/Documentation/git-diff-pairs.adoc
> new file mode 100644
> index 0000000000..e9ef4a6615
> --- /dev/null
> +++ b/Documentation/git-diff-pairs.adoc
> @@ -0,0 +1,62 @@
> +git-diff-pairs(1)
> +=================
> +
> +NAME
> +----
> +git-diff-pairs - Compare blob pairs generated by `diff-tree --raw`
> +
> +SYNOPSIS
> +--------
> +[verse]
> +'git diff-pairs' [diff-options]

This should use `[synopsis]`, which allows you to drop the quoting.

> +DESCRIPTION
> +-----------
> +
> +Given the output of `diff-tree -z` on its stdin, `diff-pairs` will
> +reformat that output into whatever format is requested on its command

Reformatting from my point of view implies that we only rearrange bits a
bit. But we're not only reformatting the input, but actually compute the
diffs.

> +line.  For example:
> +
> +-----------------------------
> +git diff-tree -z -M $a $b |
> +git diff-pairs -p
> +-----------------------------
> +
> +will compute the tree diff in one step (including renames), and then
> +`diff-pairs` will compute and format the blob-level diffs for each pair.
> +This can be used to modify the raw diff in the middle (without having to
> +parse or re-create more complicated formats like `--patch`), or to
> +compute diffs progressively over the course of multiple invocations of
> +`diff-pairs`.
> +
> +Each blob pair is fed to the diff machinery individually queued and the output
> +is flushed on stdin EOF.

I think the "flushing" part is a bit hard to understand without knowing
anything about the command internals. As an unknowing reader I would
assume you're talking about fflush(3p), but I think you're rather
talking about when we "flush" the internal diff queue and thus compute
the diffs.

So I'd rephrase this and not talk about flushing, but about the
behaviour observed by the user instead.

> +OPTIONS
> +-------
> +
> +include::diff-options.adoc[]
> +
> +include::diff-generate-patch.adoc[]
> +
> +NOTES
> +----
> +
> +`diff-pairs` should handle any input generated by `diff-tree --raw -z`.
> +It may choke or otherwise misbehave on output from `diff-files`, etc.

This reads a bit weird. The first thing that trips me is the "should".
Does it or doesn't it handle the output of git-diff-tree(1)? The second
part is that this, at least to me, implies that other formats of course
aren't accepted, so why point that out explicitly?

> +Here's an incomplete list of things that `diff-pairs` could do, but
> +doesn't (mostly in the name of simplicity):
> +
> + - Only `-z` input is accepted, not normal `--raw` input.
> +
> + - Abbreviated sha1s are rejected in the input from `diff-tree`; if you

s/sha1s/object IDs/

> +   want to abbreviate the output, you can pass `--abbrev` to
> +   `diff-pairs`.
> +
> + - Pathspecs are not handled by `diff-pairs`; you can limit the diff via
> +   the initial `diff-tree` invocation.

Makes sense.

> diff --git a/builtin/diff-pairs.c b/builtin/diff-pairs.c
> new file mode 100644
> index 0000000000..08f3ee81e5
> --- /dev/null
> +++ b/builtin/diff-pairs.c
> @@ -0,0 +1,178 @@
> +#include "builtin.h"
> +#include "commit.h"
> +#include "config.h"
> +#include "diff.h"
> +#include "diffcore.h"
> +#include "gettext.h"
> +#include "hex.h"
> +#include "object.h"
> +#include "parse-options.h"
> +#include "revision.h"
> +#include "strbuf.h"
> +
> +static unsigned parse_mode_or_die(const char *mode, const char **endp)
> +{
> +	uint16_t ret;
> +
> +	*endp = parse_mode(mode, &ret);
> +	if (!*endp)
> +		die("unable to parse mode: %s", mode);

Missing translation.

> +	return ret;
> +}
> +
> +static void parse_oid(const char *p, struct object_id *oid, const char **endp,
> +		      const struct git_hash_algo *algop)
> +{
> +	if (parse_oid_hex_algop(p, oid, endp, algop) || *(*endp)++ != ' ')
> +		die("unable to parse object id: %s", p);

Here, too. Do we also name this `parse_oid_or_die()` to stay consistent
with `parse_mode_or_die()`? The same is also true for `parse_score()`.

> +}
> +
> +static unsigned short parse_score(const char *score)
> +{
> +	unsigned long ret;
> +	char *endp;
> +
> +	errno = 0;
> +	ret = strtoul(score, &endp, 10);
> +	ret *= MAX_SCORE / 100;
> +	if (errno || endp == score || *endp || (unsigned short)ret != ret)
> +		die("unable to parse rename/copy score: %s", score);
> +	return ret;
> +}

You can use `strtoul_ui()` instead, which does most of the error
handling for you.

> +static void flush_diff_queue(struct diff_options *options)
> +{
> +	/*
> +	 * If rename detection is not requested, use rename information from the
> +	 * raw diff formatted input. Setting found_follow ensures diffcore_std()
> +	 * does not mess with rename information already present in queued
> +	 * filepairs.
> +	 */
> +	if (!options->detect_rename)
> +		options->found_follow = 1;
> +	diffcore_std(options);
> +	diff_flush(options);
> +}
> +
> +int cmd_diff_pairs(int argc, const char **argv, const char *prefix,
> +		   struct repository *repo)
> +{
> +	struct strbuf path_dst = STRBUF_INIT;
> +	struct strbuf path = STRBUF_INIT;
> +	struct strbuf meta = STRBUF_INIT;
> +	struct rev_info revs;
> +	int ret;
> +
> +	const char * const usage[] = {
> +		N_("git diff-pairs [diff-options]"),
> +		NULL
> +	};
> +	struct option options[] = {
> +		OPT_END()
> +	};
> +
> +	show_usage_with_options_if_asked(argc, argv, usage, options);
> +
> +	repo_init_revisions(repo, &revs, prefix);
> +	repo_config(repo, git_diff_basic_config, NULL);
> +	revs.disable_stdin = 1;
> +	revs.abbrev = 0;
> +	revs.diff = 1;
> +
> +	argc = setup_revisions(argc, argv, &revs, NULL);

We need to check whether `argc > 0` here. Otherwise, unknown parameters
may simply be ignored, I think.

> +
> +	/* Don't allow pathspecs at all. */
> +	if (revs.prune_data.nr)
> +		usage_with_options(usage, options);

Should we give a better error in this case?

> +	if (!revs.diffopt.output_format)
> +		revs.diffopt.output_format = DIFF_FORMAT_RAW;
> +
> +	while (1) {
> +		struct object_id oid_a, oid_b;
> +		struct diff_filepair *pair;
> +		unsigned mode_a, mode_b;
> +		const char *p;
> +		char status;
> +
> +		if (strbuf_getline_nul(&meta, stdin) == EOF)
> +			break;
> +
> +		p = meta.buf;
> +		if (*p != ':')
> +			die("invalid raw diff input");
> +		p++;
> +
> +		mode_a = parse_mode_or_die(p, &p);
> +		mode_b = parse_mode_or_die(p, &p);
> +
> +		parse_oid(p, &oid_a, &p, repo->hash_algo);
> +		parse_oid(p, &oid_b, &p, repo->hash_algo);
> +
> +		status = *p++;
> +
> +		if (strbuf_getline_nul(&path, stdin) == EOF)
> +			die("got EOF while reading path");

Missing translation.

> +		switch (status) {
> +		case DIFF_STATUS_ADDED:
> +			pair = diff_filepair_addremove(&revs.diffopt, '+',
> +						       mode_b, &oid_b,
> +						       1, path.buf, 0);
> +			if (pair)
> +				pair->status = status;
> +			break;
> +
> +		case DIFF_STATUS_DELETED:
> +			pair = diff_filepair_addremove(&revs.diffopt, '-',
> +						       mode_a, &oid_a,
> +						       1, path.buf, 0);
> +			if (pair)
> +				pair->status = status;
> +			break;
> +
> +		case DIFF_STATUS_TYPE_CHANGED:
> +		case DIFF_STATUS_MODIFIED:
> +			pair = diff_filepair_change(&revs.diffopt,
> +						    mode_a, mode_b,
> +						    &oid_a, &oid_b, 1, 1,
> +						    path.buf, 0, 0);
> +			if (pair)
> +				pair->status = status;
> +			break;
> +
> +		case DIFF_STATUS_RENAMED:
> +		case DIFF_STATUS_COPIED:
> +			{
> +				struct diff_filespec *a, *b;
> +
> +				if (strbuf_getline_nul(&path_dst, stdin) == EOF)
> +					die("got EOF while reading destination path");

Missing translation.

> +				a = alloc_filespec(path.buf);
> +				b = alloc_filespec(path_dst.buf);
> +				fill_filespec(a, &oid_a, 1, mode_a);
> +				fill_filespec(b, &oid_b, 1, mode_b);
> +
> +				pair = diff_queue(&diff_queued_diff, a, b);
> +				pair->status = status;
> +				pair->score = parse_score(p);
> +				pair->renamed_pair = 1;
> +			}
> +			break;
> +
> +		default:
> +			die("unknown diff status: %c", status);

Missing translation.

> diff --git a/t/t4070-diff-pairs.sh b/t/t4070-diff-pairs.sh
> new file mode 100755
> index 0000000000..e0a8e6f0a0
> --- /dev/null
> +++ b/t/t4070-diff-pairs.sh
> @@ -0,0 +1,80 @@
[snip]
> +test_expect_success 'split input across multiple diff-pairs' '
> +	write_script split-raw-diff "$PERL_PATH" <<-\EOF &&
> +	$/ = "\0";
> +	while (<>) {
> +	  my $meta = $_;
> +	  my $path = <>;
> +	  # renames have an extra path
> +	  my $path2 = <> if $meta =~ /[RC]\d+/;
> +
> +	  open(my $fh, ">", sprintf "diff%03d", $.);
> +	  print $fh $meta, $path, $path2;
> +	}
> +	EOF
> +
> +	git diff-tree -p -M -C -C base new >expect &&
> +
> +	git diff-tree -r -z -M -C -C base new |
> +	./split-raw-diff &&
> +	for i in diff*; do
> +		git diff-pairs -p <$i || return 1

Formatting:

	for i in diff*
	do
		git diff-pairs -p <$i ||
		return 1
	done >actual

> +	done >actual &&
> +	test_cmp expect actual
> +'

Patrick
Karthik Nayak Feb. 12, 2025, 9:51 a.m. UTC | #2
Justin Tobler <jltobler@gmail.com> writes:

> Through git-diff(1), a single diff can be generated from a pair of blob
> revisions directly. Unfortunately, there is not a mechanism to compute
> batches of specific file pair diffs in a single process. Such a feature
> is particularly useful on the server-side where diffing between a large
> set of changes is not feasible all at once due to timeout concerns.
>
> To facilitate this, introduce git-diff-pairs(1) which takes the
> null-terminated raw diff format as input on stdin and produces diffs in
> other formats. As the raw diff format already contains the necessary
> metadata, it becomes possible to progressively generate batches of diffs
> without having to recompute rename detection or retrieve object context.
> Something like the following:
>
> 	git diff-tree -r -z -M $old $new |
> 	git diff-pairs -p
>
> should generate the same output as `git diff-tree -p -M`. Furthermore,
> each line of raw diff formatted input can also be individually fed to a
> separate git-diff-pairs(1) process and still produce the same output.
>
> Based-on-patch-by: Jeff King <peff@peff.net>
> Signed-off-by: Justin Tobler <jltobler@gmail.com>
> ---
>  .gitignore                        |   1 +
>  Documentation/git-diff-pairs.adoc |  62 +++++++++++
>  Documentation/meson.build         |   1 +
>  Makefile                          |   1 +
>  builtin.h                         |   1 +
>  builtin/diff-pairs.c              | 178 ++++++++++++++++++++++++++++++
>  command-list.txt                  |   1 +
>  git.c                             |   1 +
>  meson.build                       |   1 +
>  t/meson.build                     |   1 +
>  t/t4070-diff-pairs.sh             |  80 ++++++++++++++
>  11 files changed, 328 insertions(+)
>  create mode 100644 Documentation/git-diff-pairs.adoc
>  create mode 100644 builtin/diff-pairs.c
>  create mode 100755 t/t4070-diff-pairs.sh
>
> diff --git a/.gitignore b/.gitignore
> index e82aa19df0..03448c076a 100644
> --- a/.gitignore
> +++ b/.gitignore
> @@ -54,6 +54,7 @@
>  /git-diff
>  /git-diff-files
>  /git-diff-index
> +/git-diff-pairs
>  /git-diff-tree
>  /git-difftool
>  /git-difftool--helper
> diff --git a/Documentation/git-diff-pairs.adoc b/Documentation/git-diff-pairs.adoc
> new file mode 100644
> index 0000000000..e9ef4a6615
> --- /dev/null
> +++ b/Documentation/git-diff-pairs.adoc
> @@ -0,0 +1,62 @@
> +git-diff-pairs(1)
> +=================
> +
> +NAME
> +----
> +git-diff-pairs - Compare blob pairs generated by `diff-tree --raw`
> +
> +SYNOPSIS
> +--------
> +[verse]
> +'git diff-pairs' [diff-options]
> +
> +DESCRIPTION
> +-----------
> +
> +Given the output of `diff-tree -z` on its stdin, `diff-pairs` will
> +reformat that output into whatever format is requested on its command
> +line.  For example:
> +
> +-----------------------------
> +git diff-tree -z -M $a $b |
> +git diff-pairs -p
> +-----------------------------
> +
> +will compute the tree diff in one step (including renames), and then
> +`diff-pairs` will compute and format the blob-level diffs for each pair.
> +This can be used to modify the raw diff in the middle (without having to
> +parse or re-create more complicated formats like `--patch`), or to
> +compute diffs progressively over the course of multiple invocations of
> +`diff-pairs`.
> +
> +Each blob pair is fed to the diff machinery individually queued and the output
> +is flushed on stdin EOF.

I found this hard to understand.

After reading below, perhaps it would be easier to understand something
simpler which doesn't mention the internal queuing mechanism and only
talks about how the output is only steamed once we read EOF on stdin.

> +
> +OPTIONS
> +-------
> +
> +include::diff-options.adoc[]
> +
> +include::diff-generate-patch.adoc[]
> +
> +NOTES
> +----
> +
> +`diff-pairs` should handle any input generated by `diff-tree --raw -z`.
> +It may choke or otherwise misbehave on output from `diff-files`, etc.
> +
> +Here's an incomplete list of things that `diff-pairs` could do, but
> +doesn't (mostly in the name of simplicity):
> +
> + - Only `-z` input is accepted, not normal `--raw` input.
> +
> + - Abbreviated sha1s are rejected in the input from `diff-tree`; if you
> +   want to abbreviate the output, you can pass `--abbrev` to
> +   `diff-pairs`.
> +
> + - Pathspecs are not handled by `diff-pairs`; you can limit the diff via
> +   the initial `diff-tree` invocation.
> +
> +GIT
> +---
> +Part of the linkgit:git[1] suite
> diff --git a/Documentation/meson.build b/Documentation/meson.build
> index ead8e48213..e5ee177022 100644
> --- a/Documentation/meson.build
> +++ b/Documentation/meson.build
> @@ -41,6 +41,7 @@ manpages = {
>    'git-diagnose.adoc' : 1,
>    'git-diff-files.adoc' : 1,
>    'git-diff-index.adoc' : 1,
> +  'git-diff-pairs.adoc' : 1,
>    'git-difftool.adoc' : 1,
>    'git-diff-tree.adoc' : 1,
>    'git-diff.adoc' : 1,
> diff --git a/Makefile b/Makefile
> index 896d02339e..3b8e1ad15e 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -1232,6 +1232,7 @@ BUILTIN_OBJS += builtin/describe.o
>  BUILTIN_OBJS += builtin/diagnose.o
>  BUILTIN_OBJS += builtin/diff-files.o
>  BUILTIN_OBJS += builtin/diff-index.o
> +BUILTIN_OBJS += builtin/diff-pairs.o
>  BUILTIN_OBJS += builtin/diff-tree.o
>  BUILTIN_OBJS += builtin/diff.o
>  BUILTIN_OBJS += builtin/difftool.o
> diff --git a/builtin.h b/builtin.h
> index f7b166b334..b2d2e9eb07 100644
> --- a/builtin.h
> +++ b/builtin.h
> @@ -152,6 +152,7 @@ int cmd_diagnose(int argc, const char **argv, const char *prefix, struct reposit
>  int cmd_diff_files(int argc, const char **argv, const char *prefix, struct repository *repo);
>  int cmd_diff_index(int argc, const char **argv, const char *prefix, struct repository *repo);
>  int cmd_diff(int argc, const char **argv, const char *prefix, struct repository *repo);
> +int cmd_diff_pairs(int argc, const char **argv, const char *prefix, struct repository *repo);
>  int cmd_diff_tree(int argc, const char **argv, const char *prefix, struct repository *repo);
>  int cmd_difftool(int argc, const char **argv, const char *prefix, struct repository *repo);
>  int cmd_env__helper(int argc, const char **argv, const char *prefix, struct repository *repo);
> diff --git a/builtin/diff-pairs.c b/builtin/diff-pairs.c
> new file mode 100644
> index 0000000000..08f3ee81e5
> --- /dev/null
> +++ b/builtin/diff-pairs.c
> @@ -0,0 +1,178 @@
> +#include "builtin.h"
> +#include "commit.h"
> +#include "config.h"
> +#include "diff.h"
> +#include "diffcore.h"
> +#include "gettext.h"
> +#include "hex.h"
> +#include "object.h"
> +#include "parse-options.h"
> +#include "revision.h"
> +#include "strbuf.h"
> +
> +static unsigned parse_mode_or_die(const char *mode, const char **endp)
> +{
> +	uint16_t ret;
> +
> +	*endp = parse_mode(mode, &ret);
> +	if (!*endp)
> +		die("unable to parse mode: %s", mode);
> +	return ret;
> +}
> +
> +static void parse_oid(const char *p, struct object_id *oid, const char **endp,
> +		      const struct git_hash_algo *algop)

Nit: similar to the function above, should this be called
`parse_oid_or_die`?

> +{
> +	if (parse_oid_hex_algop(p, oid, endp, algop) || *(*endp)++ != ' ')
> +		die("unable to parse object id: %s", p);
> +}
> +
> +static unsigned short parse_score(const char *score)
> +{
> +	unsigned long ret;
> +	char *endp;
> +
> +	errno = 0;
> +	ret = strtoul(score, &endp, 10);
> +	ret *= MAX_SCORE / 100;
> +	if (errno || endp == score || *endp || (unsigned short)ret != ret)
> +		die("unable to parse rename/copy score: %s", score);
> +	return ret;
> +}
> +
> +static void flush_diff_queue(struct diff_options *options)
> +{
> +	/*
> +	 * If rename detection is not requested, use rename information from the
> +	 * raw diff formatted input. Setting found_follow ensures diffcore_std()
> +	 * does not mess with rename information already present in queued
> +	 * filepairs.
> +	 */
> +	if (!options->detect_rename)
> +		options->found_follow = 1;
> +	diffcore_std(options);
> +	diff_flush(options);
> +}
> +
> +int cmd_diff_pairs(int argc, const char **argv, const char *prefix,
> +		   struct repository *repo)
> +{
> +	struct strbuf path_dst = STRBUF_INIT;
> +	struct strbuf path = STRBUF_INIT;
> +	struct strbuf meta = STRBUF_INIT;
> +	struct rev_info revs;
> +	int ret;
> +
> +	const char * const usage[] = {
> +		N_("git diff-pairs [diff-options]"),
> +		NULL
> +	};
> +	struct option options[] = {
> +		OPT_END()
> +	};
> +
> +	show_usage_with_options_if_asked(argc, argv, usage, options);
> +
> +	repo_nit_revisions(repo, &revs, prefix);
> +	repo_config(repo, git_diff_basic_config, NULL);
> +	revs.disable_stdin = 1;
> +	revs.abbrev = 0;
> +	revs.diff = 1;
> +
> +	argc = setup_revisions(argc, argv, &revs, NULL);
> +
> +	/* Don't allow pathspecs at all. */
> +	if (revs.prune_data.nr)
> +		usage_with_options(usage, options);
> +
> +	if (!revs.diffopt.output_format)
> +		revs.diffopt.output_format = DIFF_FORMAT_RAW;
> +
> +	while (1) {
> +		struct object_id oid_a, oid_b;
> +		struct diff_filepair *pair;
> +		unsigned mode_a, mode_b;
> +		const char *p;
> +		char status;
> +
> +		if (strbuf_getline_nul(&meta, stdin) == EOF)
> +			break;
> +
> +		p = meta.buf;
> +		if (*p != ':')
> +			die("invalid raw diff input");
> +		p++;
> +
> +		mode_a = parse_mode_or_die(p, &p);
> +		mode_b = parse_mode_or_die(p, &p);
> +
> +		parse_oid(p, &oid_a, &p, repo->hash_algo);
> +		parse_oid(p, &oid_b, &p, repo->hash_algo);
> +
> +		status = *p++;
> +
> +		if (strbuf_getline_nul(&path, stdin) == EOF)
> +			die("got EOF while reading path");
> +
> +		switch (status) {
> +		case DIFF_STATUS_ADDED:
> +			pair = diff_filepair_addremove(&revs.diffopt, '+',
> +						       mode_b, &oid_b,
> +						       1, path.buf, 0);
> +			if (pair)
> +				pair->status = status;
> +			break;
> +
> +		case DIFF_STATUS_DELETED:
> +			pair = diff_filepair_addremove(&revs.diffopt, '-',
> +						       mode_a, &oid_a,
> +						       1, path.buf, 0);
> +			if (pair)
> +				pair->status = status;
> +			break;
> +
> +		case DIFF_STATUS_TYPE_CHANGED:
> +		case DIFF_STATUS_MODIFIED:
> +			pair = diff_filepair_change(&revs.diffopt,
> +						    mode_a, mode_b,
> +						    &oid_a, &oid_b, 1, 1,
> +						    path.buf, 0, 0);
> +			if (pair)
> +				pair->status = status;
> +			break;
> +
> +		case DIFF_STATUS_RENAMED:
> +		case DIFF_STATUS_COPIED:
> +			{
> +				struct diff_filespec *a, *b;
> +
> +				if (strbuf_getline_nul(&path_dst, stdin) == EOF)
> +					die("got EOF while reading destination path");
> +
> +				a = alloc_filespec(path.buf);
> +				b = alloc_filespec(path_dst.buf);
> +				fill_filespec(a, &oid_a, 1, mode_a);
> +				fill_filespec(b, &oid_b, 1, mode_b);
> +
> +				pair = diff_queue(&diff_queued_diff, a, b);
> +				pair->status = status;
> +				pair->score = parse_score(p);
> +				pair->renamed_pair = 1;
> +			}
> +			break;
> +
> +		default:

The only state I think is missing is `DIFF_STATUS_UNMERGED` (from
'diff.h'). Is that a state we need to handle?

> +			die("unknown diff status: %c", status);
> +		}
> +	}
> +
> +	flush_diff_queue(&revs.diffopt);

Now I understand what you meant by queuing the diffs.

> +	ret = diff_result_code(&revs);
> +
> +	strbuf_release(&path_dst);
> +	strbuf_release(&path);
> +	strbuf_release(&meta);
> +	release_revisions(&revs);
> +
> +	return ret;
> +}
> diff --git a/command-list.txt b/command-list.txt
> index e0bb87b3b5..bb8acd51d8 100644
> --- a/command-list.txt
> +++ b/command-list.txt
> @@ -95,6 +95,7 @@ git-diagnose                            ancillaryinterrogators
>  git-diff                                mainporcelain           info
>  git-diff-files                          plumbinginterrogators
>  git-diff-index                          plumbinginterrogators
> +git-diff-pairs                          plumbinginterrogators
>  git-diff-tree                           plumbinginterrogators
>  git-difftool                            ancillaryinterrogators          complete
>  git-fast-export                         ancillarymanipulators
> diff --git a/git.c b/git.c
> index b23761480f..12bba872bb 100644
> --- a/git.c
> +++ b/git.c
> @@ -540,6 +540,7 @@ static struct cmd_struct commands[] = {
>  	{ "diff", cmd_diff, NO_PARSEOPT },
>  	{ "diff-files", cmd_diff_files, RUN_SETUP | NEED_WORK_TREE | NO_PARSEOPT },
>  	{ "diff-index", cmd_diff_index, RUN_SETUP | NO_PARSEOPT },
> +	{ "diff-pairs", cmd_diff_pairs, RUN_SETUP | NO_PARSEOPT },
>  	{ "diff-tree", cmd_diff_tree, RUN_SETUP | NO_PARSEOPT },
>  	{ "difftool", cmd_difftool, RUN_SETUP_GENTLY },
>  	{ "fast-export", cmd_fast_export, RUN_SETUP },
> diff --git a/meson.build b/meson.build
> index fbb8105d96..66ce3326e8 100644
> --- a/meson.build
> +++ b/meson.build
> @@ -537,6 +537,7 @@ builtin_sources = [
>    'builtin/diagnose.c',
>    'builtin/diff-files.c',
>    'builtin/diff-index.c',
> +  'builtin/diff-pairs.c',
>    'builtin/diff-tree.c',
>    'builtin/diff.c',
>    'builtin/difftool.c',
> diff --git a/t/meson.build b/t/meson.build
> index 4574280590..7ff17c6d29 100644
> --- a/t/meson.build
> +++ b/t/meson.build
> @@ -500,6 +500,7 @@ integration_tests = [
>    't4067-diff-partial-clone.sh',
>    't4068-diff-symmetric-merge-base.sh',
>    't4069-remerge-diff.sh',
> +  't4070-diff-pairs.sh',
>    't4100-apply-stat.sh',
>    't4101-apply-nonl.sh',
>    't4102-apply-rename.sh',
> diff --git a/t/t4070-diff-pairs.sh b/t/t4070-diff-pairs.sh
> new file mode 100755
> index 0000000000..e0a8e6f0a0
> --- /dev/null
> +++ b/t/t4070-diff-pairs.sh
> @@ -0,0 +1,80 @@
> +#!/bin/sh
> +
> +test_description='basic diff-pairs tests'
> +. ./test-lib.sh
> +
> +# This creates a diff with added, modified, deleted, renamed, copied, and
> +# typechange entries. That includes one in a subdirectory for non-recursive
> +# tests, and both exact and inexact similarity scores.
> +test_expect_success 'create commit with various diffs' '

Generally, tests for setup are named 'setup' so we can do something
like:
  sh ./t0050-filesystem.sh --run=setup,9-11

Can we renmae this to 'setup'?

> +	echo to-be-gone >deleted &&
> +	echo original >modified &&
> +	echo now-a-file >symlink &&
> +	test_seq 200 >two-hundred &&
> +	test_seq 201 500 >five-hundred &&
> +	git add . &&
> +	test_tick &&
> +	git commit -m base &&
> +	git tag base &&
> +
> +	echo now-here >added &&
> +	echo new >modified &&
> +	rm deleted &&
> +	mkdir subdir &&
> +	echo content >subdir/file &&
> +	mv two-hundred renamed &&
> +	test_seq 201 500 | sed s/300/modified/ >copied &&
> +	rm symlink &&
> +	git add -A . &&
> +	test_ln_s_add dest symlink &&
> +	test_tick &&
> +	git commit -m new &&
> +	git tag new
> +'
> +
> +test_expect_success 'diff-pairs recreates --raw' '
> +	git diff-tree -r -M -C -C base new >expect &&
> +	git diff-tree -r -M -C -C -z base new |
> +	git diff-pairs >actual &&
> +	test_cmp expect actual
> +'
> +
> +test_expect_success 'diff-pairs can create -p output' '
> +	git diff-tree -p -M -C -C base new >expect &&
> +	git diff-tree -r -M -C -C -z base new |
> +	git diff-pairs -p >actual &&
> +	test_cmp expect actual
> +'
> +
> +test_expect_success 'non-recursive --raw retains tree entry' '
> +	git diff-tree base new >expect &&
> +	git diff-tree -z base new |
> +	git diff-pairs >actual &&
> +	test_cmp expect actual
> +'
> +
> +test_expect_success 'split input across multiple diff-pairs' '
> +	write_script split-raw-diff "$PERL_PATH" <<-\EOF &&
> +	$/ = "\0";
> +	while (<>) {
> +	  my $meta = $_;
> +	  my $path = <>;
> +	  # renames have an extra path
> +	  my $path2 = <> if $meta =~ /[RC]\d+/;
> +
> +	  open(my $fh, ">", sprintf "diff%03d", $.);
> +	  print $fh $meta, $path, $path2;
> +	}
> +	EOF
> +
> +	git diff-tree -p -M -C -C base new >expect &&
> +
> +	git diff-tree -r -z -M -C -C base new |
> +	./split-raw-diff &&
> +	for i in diff*; do
> +		git diff-pairs -p <$i || return 1
> +	done >actual &&
> +	test_cmp expect actual
> +'
> +
> +test_done
> --
> 2.48.1
Jean-Noël Avila Feb. 12, 2025, 11:40 a.m. UTC | #3
Le 12/02/2025 à 05:18, Justin Tobler a écrit :

>
> +
> +SYNOPSIS
> +--------
> +[verse]
> +'git diff-pairs' [diff-options]
> +

This should read:

[synopsis]
git-diff-pairs [<diff-options>]



> +DESCRIPTION
> +-----------
> +
> +Given the output of `diff-tree -z` on its stdin, `diff-pairs` will

Please do not use the future form when describing the actual behavior.

> +reformat that output into whatever format is requested on its command
> +line.  For example:
> +
> +-----------------------------
> +git diff-tree -z -M $a $b |
> +git diff-pairs -p
> +-----------------------------
> +
> +will compute the tree diff in one step (including renames), and then
> +`diff-pairs` will compute and format the blob-level diffs for each pair.
> +This can be used to modify the raw diff in the middle (without having to
> +parse or re-create more complicated formats like `--patch`), or to
> +compute diffs progressively over the course of multiple invocations of
> +`diff-pairs`.
> +

JN
Junio C Hamano Feb. 12, 2025, 4:50 p.m. UTC | #4
Justin Tobler <jltobler@gmail.com> writes:

> +NOTES
> +----
> +
> +`diff-pairs` should handle any input generated by `diff-tree --raw -z`.
> +It may choke or otherwise misbehave on output from `diff-files`, etc.
> +
> +Here's an incomplete list of things that `diff-pairs` could do, but
> +doesn't (mostly in the name of simplicity):
> +
> + - Only `-z` input is accepted, not normal `--raw` input.
> +
> + - Abbreviated sha1s are rejected in the input from `diff-tree`; if you
> +   want to abbreviate the output, you can pass `--abbrev` to
> +   `diff-pairs`.
> +
> + - Pathspecs are not handled by `diff-pairs`; you can limit the diff via
> +   the initial `diff-tree` invocation.

Which of the above limitations are fundamental, and which are merely
due to incomplete implementation that could be improved in the
future iterations?  Without reading the code deeply, a lot of them
look like merely due to this iteration being at a WIP state and not
quite ready for the general public.

What is especially curious is the reason why it is limited to
diff-tree (by the way, don't you require '-r' if you are fed
'diff-tree' output, or are you prepared to expand tree objects in
the input yourself?).

I can guess that the 0{40} object names in the postimage to signal
paths with working tree changes unadded to the index is something
this fundamentally cannot work with, but you should be able to grok
'diff-index --cached', which does not have that issue, just fine.

> diff --git a/Documentation/meson.build b/Documentation/meson.build
> index ead8e48213..e5ee177022 100644
> --- a/Documentation/meson.build
> +++ b/Documentation/meson.build
> @@ -41,6 +41,7 @@ manpages = {
>    'git-diagnose.adoc' : 1,
>    'git-diff-files.adoc' : 1,
>    'git-diff-index.adoc' : 1,
> +  'git-diff-pairs.adoc' : 1,
>    'git-difftool.adoc' : 1,
>    'git-diff-tree.adoc' : 1,
>    'git-diff.adoc' : 1,

This apparently does not apply to 'master' and the base at least
needs to contain 1f010d6b (doc: use .adoc extension for AsciiDoc
files, 2025-01-20).  Please clearly mark the series as such in the
cover letter if the series is not built on top of recent 'master'
(or 'maint' if it is a series to fix breakage, but it does not apply
to this series).

Thanks.
Phillip Wood Feb. 17, 2025, 2:38 p.m. UTC | #5
Hi Justin

On 12/02/2025 04:18, Justin Tobler wrote:
> Through git-diff(1), a single diff can be generated from a pair of blob
> revisions directly. Unfortunately, there is not a mechanism to compute
> batches of specific file pair diffs in a single process. Such a feature
> is particularly useful on the server-side where diffing between a large
> set of changes is not feasible all at once due to timeout concerns.
> 
> To facilitate this, introduce git-diff-pairs(1) which takes the
> null-terminated raw diff format as input on stdin and produces diffs in
> other formats. As the raw diff format already contains the necessary
> metadata, it becomes possible to progressively generate batches of diffs
> without having to recompute rename detection or retrieve object context.
> Something like the following:
> 
> 	git diff-tree -r -z -M $old $new |
> 	git diff-pairs -p
> 
> should generate the same output as `git diff-tree -p -M`. Furthermore,
> each line of raw diff formatted input can also be individually fed to a
> separate git-diff-pairs(1) process and still produce the same output.

I like the idea of this, I've left a few comments mainly around the UI.

 > +Here's an incomplete list of things that `diff-pairs` could do, but
 > +doesn't (mostly in the name of simplicity):
 > +
 > + - Only `-z` input is accepted, not normal `--raw` input.

I think only accepting NUL terminated input is fine, but if we want to 
accept other formats we should  have a plan for how to do that in a 
backwards compatible way as we cannot use `-z` to distinguish between 
input formats.

> +	const char * const usage[] = {
> +		N_("git diff-pairs [diff-options]"),

Normally the option summary printed by "git foo -h" is generated by the 
option parser. In this case we don't define any options and use 
setup_revisions() instead so we need to provide the option summary 
ourselves. Looking at diff-files.c we can add

	"\n"
	COMMON_DIFF_OPTIONS_HELP;

to do that.

> +	argc = setup_revisions(argc, argv, &revs, NULL);

I think we should check that there are no options left on the 
commandline after setup_revisions() returns

> +	/* Don't allow pathspecs at all. */
> +	if (revs.prune_data.nr)
> +		usage_with_options(usage, options);

It is not just pathspecs that we want to reject but all revision related 
options. Looking at diff-files.c we can do

	if (rev.pending.nr ||
	    rev.min_age != -1 || rev.max_age != -1 ||
	    rev.max_count != -1)
		usage_with_option(usage, options);

To catch some of that but it still accepts things like "--first-parent", 
"--merges" and "--ancestry-path". We may just have to live with that as 
I don't think it is worth expanding a huge amount of effort to prevent them.

> +	if (!revs.diffopt.output_format)
> +		revs.diffopt.output_format = DIFF_FORMAT_RAW;

This matches the other diff plumbing commands but I'm not sure it is the 
most helpful default for a command that is supposed to transform raw 
diffs into another format. Maybe we should default to DIFF_FORMAT_PATCH?


> +test_expect_success 'split input across multiple diff-pairs' '

This needs a PERL prerequisite I think. I'm a bit unsure what this test 
adds compared to the others.

Best Wishes

Phillip

> +	write_script split-raw-diff "$PERL_PATH" <<-\EOF &&
> +	$/ = "\0";
> +	while (<>) {
> +	  my $meta = $_;
> +	  my $path = <>;
> +	  # renames have an extra path
> +	  my $path2 = <> if $meta =~ /[RC]\d+/;
> +
> +	  open(my $fh, ">", sprintf "diff%03d", $.);
> +	  print $fh $meta, $path, $path2;
> +	}
> +	EOF
> +
> +	git diff-tree -p -M -C -C base new >expect &&
> +
> +	git diff-tree -r -z -M -C -C base new |
> +	./split-raw-diff &&
> +	for i in diff*; do
> +		git diff-pairs -p <$i || return 1
> +	done >actual &&
> +	test_cmp expect actual
> +'
> +
> +test_done
Justin Tobler Feb. 19, 2025, 8:51 p.m. UTC | #6
On 25/02/17 02:38PM, Phillip Wood wrote:
> Hi Justin
> 
> On 12/02/2025 04:18, Justin Tobler wrote:
> > Through git-diff(1), a single diff can be generated from a pair of blob
> > revisions directly. Unfortunately, there is not a mechanism to compute
> > batches of specific file pair diffs in a single process. Such a feature
> > is particularly useful on the server-side where diffing between a large
> > set of changes is not feasible all at once due to timeout concerns.
> > 
> > To facilitate this, introduce git-diff-pairs(1) which takes the
> > null-terminated raw diff format as input on stdin and produces diffs in
> > other formats. As the raw diff format already contains the necessary
> > metadata, it becomes possible to progressively generate batches of diffs
> > without having to recompute rename detection or retrieve object context.
> > Something like the following:
> > 
> > 	git diff-tree -r -z -M $old $new |
> > 	git diff-pairs -p
> > 
> > should generate the same output as `git diff-tree -p -M`. Furthermore,
> > each line of raw diff formatted input can also be individually fed to a
> > separate git-diff-pairs(1) process and still produce the same output.
> 
> I like the idea of this, I've left a few comments mainly around the UI.
> 
> > +Here's an incomplete list of things that `diff-pairs` could do, but
> > +doesn't (mostly in the name of simplicity):
> > +
> > + - Only `-z` input is accepted, not normal `--raw` input.
> 
> I think only accepting NUL terminated input is fine, but if we want to
> accept other formats we should  have a plan for how to do that in a
> backwards compatible way as we cannot use `-z` to distinguish between input
> formats.

If in the future we want to support the normal format, we could introduce
an `--input-format=normal` option or something along those lines. 

> > +	const char * const usage[] = {
> > +		N_("git diff-pairs [diff-options]"),
> 
> Normally the option summary printed by "git foo -h" is generated by the
> option parser. In this case we don't define any options and use
> setup_revisions() instead so we need to provide the option summary
> ourselves. Looking at diff-files.c we can add
> 
> 	"\n"
> 	COMMON_DIFF_OPTIONS_HELP;
> 
> to do that.

Would this be preferable even if git-diff-pairs doesn't support all of
the common diff options?

> > +	argc = setup_revisions(argc, argv, &revs, NULL);
> 
> I think we should check that there are no options left on the commandline
> after setup_revisions() returns

Good call, will do in the next version.

> > +	/* Don't allow pathspecs at all. */
> > +	if (revs.prune_data.nr)
> > +		usage_with_options(usage, options);
> 
> It is not just pathspecs that we want to reject but all revision related
> options. Looking at diff-files.c we can do
> 
> 	if (rev.pending.nr ||
> 	    rev.min_age != -1 || rev.max_age != -1 ||
> 	    rev.max_count != -1)
> 		usage_with_option(usage, options);
> 
> To catch some of that but it still accepts things like "--first-parent",
> "--merges" and "--ancestry-path". We may just have to live with that as I
> don't think it is worth expanding a huge amount of effort to prevent them.

Yes, we should also reject revision as well as pathspec arguments. Will
update.

> > +	if (!revs.diffopt.output_format)
> > +		revs.diffopt.output_format = DIFF_FORMAT_RAW;
> 
> This matches the other diff plumbing commands but I'm not sure it is the
> most helpful default for a command that is supposed to transform raw diffs
> into another format. Maybe we should default to DIFF_FORMAT_PATCH?

As you mentioned, defaulting to DIFF_FORMAT_RAW isn't the most useful
behavior. I agree that it makes more sense to use DIFF_FORMAT_PATCH as
the default. Will update in the next version.

> > +test_expect_success 'split input across multiple diff-pairs' '
> 
> This needs a PERL prerequisite I think. I'm a bit unsure what this test adds
> compared to the others.

This test demonstrates that the raw diff input can be split across
separate git-diff-pairs(1) processes and still produce equivilant
output which is one of the main usecases for the command. That being
said, this test isn't really exercising different behavior of
git-diff-pairs(1) itself, so maybe it would be best to drop it.

Thanks for the review :)

-Justin
Junio C Hamano Feb. 19, 2025, 9:57 p.m. UTC | #7
Justin Tobler <jltobler@gmail.com> writes:

>> I think only accepting NUL terminated input is fine, but if we want to
>> accept other formats we should  have a plan for how to do that in a
>> backwards compatible way as we cannot use `-z` to distinguish between input
>> formats.
>
> If in the future we want to support the normal format, we could introduce
> an `--input-format=normal` option or something along those lines. 

Please don't.  Have an explicit '-z' option from the beginning, and
if the initial version is incapable of reading from text input, then
it is perfectly fine to have

	if (!nul_termination)
		die(_("working without -z not supported (yet)");

Otherwise people have to remember that unlike everybody else that
uses "-z" to signal NUL termination, this one alone wants to use a
"--input-format" option that nobody else uses.

>> > +	/* Don't allow pathspecs at all. */
>> > +	if (revs.prune_data.nr)
>> > +		usage_with_options(usage, options);

Hmph, this is very unfortuate.

The "--raw" format was originally designed as an interchange format
between the frontend and backend.  

The frontend programs take two sets of contents stored in various
places (like tree vs index, tree vs another tree) and express
comparison of corresponding paths in (<from mode+contents> <to
mode+contents> <path>) tuples" (a rough equivalent to what we
internally have on the diff_queued_diff queue in core).

The "--raw" format was designed to "dump" what is in the
diff_queued_diff list.

And then it would be passed to the single backend, that takes
"--raw" format, pass them through the diffcore transform machinery
(like matching removal and addition to detect renames), and produce
various forms of output (like patch, diffstat, etc.).

To me, what you are writing is the output phase of that pipeline,
i.e. the backend.  We do want to (evantually) be able to filter with
pathspec, and all other things the current diff machinery does after
the existing "all-in-one" "git diff" and "git diff-{files,index,tree}"
commands do from their call to diffcore_std() and diffcore_flush().

The revisions option parsing machinery does accept options that
would *not* make sense to expect for them to make any difference to
the result of running "diff".  Rejecting them is a nice thing to
have, e.g. "git diff --no-merges HEAD^ HEAD" does not error out, but
some people may want it to barf (I don't care---I am not sick enough
to give apparently nonsense options to random commands), but it is
perfectly fine to start your implementation with "nonsense options
may be ignored".

But in a "git diff-* -z | git diff-pairs -z" pipeline, I do not see
a particular reason why you would want to forbid the downstream
command to further limit the paths it processes with its own
pathspec, e.g.

    git diff-tree -z --raw A B -- t/ | git diff-pairs -z t/helper/

sounds like a perfectly sensible request to grant.

My recommendation is to avoid deciding to reject things your initial
implementation happens not to support (yet) too early.  In the end,
we want this backend half just as powerful as, if not more than, the
real "git diff" machinery that has both front- and backend in the
same binary.

Thanks.
Justin Tobler Feb. 19, 2025, 10:19 p.m. UTC | #8
On 25/02/12 08:50AM, Junio C Hamano wrote:
> Justin Tobler <jltobler@gmail.com> writes:
> 
> > +NOTES
> > +----
> > +
> > +`diff-pairs` should handle any input generated by `diff-tree --raw -z`.
> > +It may choke or otherwise misbehave on output from `diff-files`, etc.
> > +
> > +Here's an incomplete list of things that `diff-pairs` could do, but
> > +doesn't (mostly in the name of simplicity):
> > +
> > + - Only `-z` input is accepted, not normal `--raw` input.
> > +
> > + - Abbreviated sha1s are rejected in the input from `diff-tree`; if you
> > +   want to abbreviate the output, you can pass `--abbrev` to
> > +   `diff-pairs`.
> > +
> > + - Pathspecs are not handled by `diff-pairs`; you can limit the diff via
> > +   the initial `diff-tree` invocation.
> 
> Which of the above limitations are fundamental, and which are merely
> due to incomplete implementation that could be improved in the
> future iterations?  

Thinking about this some more, I'm a bit unsure whether
git-diff-pairs(1) should support "normal" `--raw` input. Furthermore, if
we do want to support it, maybe it should be the default?

From my perspective, ultimately I don't think there is much additional
value provided by supporting multiple input options for
git-diff-pairs(1) since the end result would be the same and its just an
intermediate format. As I see it, the benefit of the NUL delimited raw
diff ouput format is that it is a bit simpler to parse and likely a bit
more efficient as it wouldn't have to deal with unquoting paths with
special characters. The benefit of the "normal" raw format is probably
that it is the more intuitive default option.

I'm certainly interested in what folks think about this :)

For abbreviated object IDs, supporting them would make the input format
more flexible, but it would be simpler to just require the full OID be
provided thus making the input format more explicit. My current thinking
is to leave this unless others think it would be useful to support.

Regarding pathspec support, being that git-diff-pairs(1) operates solely
on the provided set of file pairs produced via some other Git operation,
I don't think further limiting would provide much additional value
either. If we do want this though, I think support could be added in the
future.

> Without reading the code deeply, a lot of them
> look like merely due to this iteration being at a WIP state and not
> quite ready for the general public.
> 
> What is especially curious is the reason why it is limited to
> diff-tree (by the way, don't you require '-r' if you are fed
> 'diff-tree' output, or are you prepared to expand tree objects in
> the input yourself?).

The tree objects in the input are not expanded. With `git diff-pairs
--raw` these objects are just printed again. With the `--patch` option,
they are just ommitted.

> I can guess that the 0{40} object names in the postimage to signal
> paths with working tree changes unadded to the index is something
> this fundamentally cannot work with, but you should be able to grok
> 'diff-index --cached', which does not have that issue, just fine.

I'll rework the documentation in the next version. git-diff-tree(1) is
the command I have in mind as the common usecase to use in combination
with git-diff-pairs(1), but it is not solely limited to it. As you
mentioned, there are other commands that could be used to provide input
here.

> > diff --git a/Documentation/meson.build b/Documentation/meson.build
> > index ead8e48213..e5ee177022 100644
> > --- a/Documentation/meson.build
> > +++ b/Documentation/meson.build
> > @@ -41,6 +41,7 @@ manpages = {
> >    'git-diagnose.adoc' : 1,
> >    'git-diff-files.adoc' : 1,
> >    'git-diff-index.adoc' : 1,
> > +  'git-diff-pairs.adoc' : 1,
> >    'git-difftool.adoc' : 1,
> >    'git-diff-tree.adoc' : 1,
> >    'git-diff.adoc' : 1,
> 
> This apparently does not apply to 'master' and the base at least
> needs to contain 1f010d6b (doc: use .adoc extension for AsciiDoc
> files, 2025-01-20).  Please clearly mark the series as such in the
> cover letter if the series is not built on top of recent 'master'
> (or 'maint' if it is a series to fix breakage, but it does not apply
> to this series).

Will do

Thanks
-Justin
Justin Tobler Feb. 19, 2025, 10:38 p.m. UTC | #9
On 25/02/19 01:57PM, Junio C Hamano wrote:
> Justin Tobler <jltobler@gmail.com> writes:
> 
> >> I think only accepting NUL terminated input is fine, but if we want to
> >> accept other formats we should  have a plan for how to do that in a
> >> backwards compatible way as we cannot use `-z` to distinguish between input
> >> formats.
> >
> > If in the future we want to support the normal format, we could introduce
> > an `--input-format=normal` option or something along those lines. 
> 
> Please don't.  Have an explicit '-z' option from the beginning, and
> if the initial version is incapable of reading from text input, then
> it is perfectly fine to have
> 
> 	if (!nul_termination)
> 		die(_("working without -z not supported (yet)");
> 
> Otherwise people have to remember that unlike everybody else that
> uses "-z" to signal NUL termination, this one alone wants to use a
> "--input-format" option that nobody else uses.

Thanks, I think this is a much better approach! :)

> 
> >> > +	/* Don't allow pathspecs at all. */
> >> > +	if (revs.prune_data.nr)
> >> > +		usage_with_options(usage, options);
> 
> Hmph, this is very unfortuate.
> 
> The "--raw" format was originally designed as an interchange format
> between the frontend and backend.  
> 
> The frontend programs take two sets of contents stored in various
> places (like tree vs index, tree vs another tree) and express
> comparison of corresponding paths in (<from mode+contents> <to
> mode+contents> <path>) tuples" (a rough equivalent to what we
> internally have on the diff_queued_diff queue in core).
> 
> The "--raw" format was designed to "dump" what is in the
> diff_queued_diff list.
> 
> And then it would be passed to the single backend, that takes
> "--raw" format, pass them through the diffcore transform machinery
> (like matching removal and addition to detect renames), and produce
> various forms of output (like patch, diffstat, etc.).
> 
> To me, what you are writing is the output phase of that pipeline,
> i.e. the backend.  We do want to (evantually) be able to filter with
> pathspec, and all other things the current diff machinery does after
> the existing "all-in-one" "git diff" and "git diff-{files,index,tree}"
> commands do from their call to diffcore_std() and diffcore_flush().
> 
> The revisions option parsing machinery does accept options that
> would *not* make sense to expect for them to make any difference to
> the result of running "diff".  Rejecting them is a nice thing to
> have, e.g. "git diff --no-merges HEAD^ HEAD" does not error out, but
> some people may want it to barf (I don't care---I am not sick enough
> to give apparently nonsense options to random commands), but it is
> perfectly fine to start your implementation with "nonsense options
> may be ignored".
> 
> But in a "git diff-* -z | git diff-pairs -z" pipeline, I do not see
> a particular reason why you would want to forbid the downstream
> command to further limit the paths it processes with its own
> pathspec, e.g.
> 
>     git diff-tree -z --raw A B -- t/ | git diff-pairs -z t/helper/
> 
> sounds like a perfectly sensible request to grant.
> 
> My recommendation is to avoid deciding to reject things your initial
> implementation happens not to support (yet) too early.  In the end,
> we want this backend half just as powerful as, if not more than, the
> real "git diff" machinery that has both front- and backend in the
> same binary.

Ok, that makes sense. I was originally thinking pathspec limiting could
just be handled upstream, but it probably doesn't make much to
arbitrarily limit this functionality and remain more flexible. I'll do
this in the next version.

Thanks
-Justin
Junio C Hamano Feb. 19, 2025, 11:19 p.m. UTC | #10
Justin Tobler <jltobler@gmail.com> writes:

> Thinking about this some more, I'm a bit unsure whether
> git-diff-pairs(1) should support "normal" `--raw` input. Furthermore, if
> we do want to support it, maybe it should be the default?
>
> From my perspective, ultimately I don't think there is much additional
> value provided by supporting multiple input options for
> git-diff-pairs(1) since the end result would be the same and its just an
> intermediate format. As I see it, the benefit of the NUL delimited raw
> diff ouput format is that it is a bit simpler to parse and likely a bit
> more efficient as it wouldn't have to deal with unquoting paths with
> special characters. The benefit of the "normal" raw format is probably
> that it is the more intuitive default option.
>
> I'm certainly interested in what folks think about this :)

FWIW, in our toolset, "-z" is not the default primarily because text
format were chosen to help debuggability, which used to really matter
in the early days.

> For abbreviated object IDs, supporting them would make the input format
> more flexible, but it would be simpler to just require the full OID be
> provided thus making the input format more explicit. My current thinking
> is to leave this unless others think it would be useful to support.

Abbreviated object names would only be at "might be nice to have"
level, I would think.  We are talking about tools-to-tools
communication after all.

> Regarding pathspec support, being that git-diff-pairs(1) operates solely
> on the provided set of file pairs produced via some other Git operation,
> I don't think further limiting would provide much additional value
> either. If we do want this though, I think support could be added in the
> future.

Another consideration is which side of the pipeline should take the
responsibility to invoke the diffcore machinery.  We certainly could
make it the job for the upstream/frontend, in which case diff-pairs
does not have to call into diffcore-rename, BUT it also means the
downstream/backend needs to be able to parse two paths (renamed from
and renamed to).  Or we could make it the job for the downstream,
and forbid the upstream/frontend from feeding renamed pairs (i.e.
any input with status letter R or C are invalid), in which case
diff-pairs can choose to invoke rename detection or not by paying
attention to the -M option and invoking diffcore_rename() itself
(which should be at no-cost from coding point of view, as it should
be just the matter of calling diffcore_std()).

> The tree objects in the input are not expanded. With `git diff-pairs
> --raw` these objects are just printed again. With the `--patch` option,
> they are just ommitted.

Instead of getting expanded into its subpaths?

>> This apparently does not apply to 'master' and the base at least
>> needs to contain 1f010d6b (doc: use .adoc extension for AsciiDoc
>> files, 2025-01-20).  Please clearly mark the series as such in the
>> cover letter if the series is not built on top of recent 'master'
>> (or 'maint' if it is a series to fix breakage, but it does not apply
>> to this series).
>
> Will do

No longer needed, as the tip of 'master' now lives in the .adoc
world.  Hurrah!
Junio C Hamano Feb. 19, 2025, 11:47 p.m. UTC | #11
Junio C Hamano <gitster@pobox.com> writes:

>> Regarding pathspec support, being that git-diff-pairs(1) operates solely
>> on the provided set of file pairs produced via some other Git operation,
>> I don't think further limiting would provide much additional value
>> either. If we do want this though, I think support could be added in the
>> future.
>
> Another consideration is which side of the pipeline should take the
> responsibility to invoke the diffcore machinery.  We certainly could
> make it the job for the upstream/frontend, in which case diff-pairs
> does not have to call into diffcore-rename, BUT it also means the
> downstream/backend needs to be able to parse two paths (renamed from
> and renamed to).  Or we could make it the job for the downstream,
> and forbid the upstream/frontend from feeding renamed pairs (i.e.
> any input with status letter R or C are invalid), in which case
> diff-pairs can choose to invoke rename detection or not by paying
> attention to the -M option and invoking diffcore_rename() itself
> (which should be at no-cost from coding point of view, as it should
> be just the matter of calling diffcore_std()).

Sorry, but I hit <SEND> too early before finishing the most
important part.  We can move the features between upstream frontends
and downstream diff-pairs.  Depending on our goals, the best
division of labor would be different.  If we want to make it easy
for people to write their custom frontends, for example, it might
make sense to allow them to be as stupid and simple as possible and
make all the heavy lifting the responsibility of the diff-pairs
backend, which is the shared resource these frontends share and rely
on (so that they have incentive to help us make sure diff-pairs will
stay correct and performant).  If on the other hand we want to allow
people to do fancy processing in their custom frontends, maybe keeping
diff-pairs as stupid and transparent would be a better option to give
the people who write upstream/frontends more predictable behaviour.

Where to do the pathspec limiting is one of these things.  You could
make it responsibility for the frontends if we assume that frontends
must do their own limiting.  Or you could make it an optional feature
of the backends, so that frontends that does not do its own limiting
can ask diff-pairs to limit.  Which side to burden more really depends
on whose job we are trying to make it easier.

Thanks.
Justin Tobler Feb. 20, 2025, 12:32 a.m. UTC | #12
On 25/02/19 03:47PM, Junio C Hamano wrote:
> Junio C Hamano <gitster@pobox.com> writes:
> 
> >> Regarding pathspec support, being that git-diff-pairs(1) operates solely
> >> on the provided set of file pairs produced via some other Git operation,
> >> I don't think further limiting would provide much additional value
> >> either. If we do want this though, I think support could be added in the
> >> future.
> >
> > Another consideration is which side of the pipeline should take the
> > responsibility to invoke the diffcore machinery.  We certainly could
> > make it the job for the upstream/frontend, in which case diff-pairs
> > does not have to call into diffcore-rename, BUT it also means the
> > downstream/backend needs to be able to parse two paths (renamed from
> > and renamed to).  Or we could make it the job for the downstream,
> > and forbid the upstream/frontend from feeding renamed pairs (i.e.
> > any input with status letter R or C are invalid), in which case
> > diff-pairs can choose to invoke rename detection or not by paying
> > attention to the -M option and invoking diffcore_rename() itself
> > (which should be at no-cost from coding point of view, as it should
> > be just the matter of calling diffcore_std()).

In the current implementation, diff-pairs is capable of handling input
containing rename/copy filepairs computed upstream. It does so by
parsing the input line and manually setting the status, score, and paths
for the queued `diff_filepair`.

I think diff-pairs should support rename and copy input as it would
allow for rename/copy detection to be performed upfront in a single
pass by the upstream and the resulting output could be split up and fed
to separate downstream diff-pairs. This is particularly useful for
server-side diffs to break up what would be large diffs.

> Sorry, but I hit <SEND> too early before finishing the most
> important part.  We can move the features between upstream frontends
> and downstream diff-pairs.  Depending on our goals, the best
> division of labor would be different.  If we want to make it easy
> for people to write their custom frontends, for example, it might
> make sense to allow them to be as stupid and simple as possible and
> make all the heavy lifting the responsibility of the diff-pairs
> backend, which is the shared resource these frontends share and rely
> on (so that they have incentive to help us make sure diff-pairs will
> stay correct and performant).  If on the other hand we want to allow
> people to do fancy processing in their custom frontends, maybe keeping
> diff-pairs as stupid and transparent would be a better option to give
> the people who write upstream/frontends more predictable behaviour.
> 
> Where to do the pathspec limiting is one of these things.  You could
> make it responsibility for the frontends if we assume that frontends
> must do their own limiting.  Or you could make it an optional feature
> of the backends, so that frontends that does not do its own limiting
> can ask diff-pairs to limit.  Which side to burden more really depends
> on whose job we are trying to make it easier.

For the server-side diff usecase, I think that aligns more towards
having a front-end that does more of the heavy lifting such rename/copy
detection and pathspec limiting, while the diff-pairs really just needs
to compute the individual diffs for the already specified file pairs.

I do see value though in keeping the door open for diff-pairs to become
more robust and flexible. Maybe it would be fine for now to say pathspec
limiting is not supported, but it could be in the future?

> >> The tree objects in the input are not expanded. With `git diff-pairs
> >> --raw` these objects are just printed again. With the `--patch` option,
> >> they are just ommitted.
> 
> >Instead of getting expanded into its subpaths?

The current implementation of diff-pairs is rather simple. It relies on
the upstream to feed it the file pairs with all the info upfront so it
can setup the diff queue. This means input with tree objects is also
queued as-is without being expanded further. I could maybe see a future
though where we want diff-pairs to be a more robust backend and supports
expanding these paths via -r option. Following previous discussion,
maybe it's fine to keep the initial implementation of diff-pairs on the
simple side for now. We could make diff-pairs die() for now if the -r
option is explicitly set.

Thanks
-Justin
Justin Tobler Feb. 20, 2025, 2:56 p.m. UTC | #13
On 25/02/19 06:32PM, Justin Tobler wrote:
> > >> The tree objects in the input are not expanded. With `git diff-pairs
> > >> --raw` these objects are just printed again. With the `--patch` option,
> > >> they are just ommitted.
> > 
> > >Instead of getting expanded into its subpaths?
> 
> The current implementation of diff-pairs is rather simple. It relies on
> the upstream to feed it the file pairs with all the info upfront so it
> can setup the diff queue. This means input with tree objects is also
> queued as-is without being expanded further. I could maybe see a future
> though where we want diff-pairs to be a more robust backend and supports
> expanding these paths via -r option. Following previous discussion,
> maybe it's fine to keep the initial implementation of diff-pairs on the
> simple side for now. We could make diff-pairs die() for now if the -r
> option is explicitly set.

Thinking about this some more, adding support to expand trees in
diff-pairs would alter patch output behavior. To better enable backwards
compatible inclusion of this feature in the future, we may just want to
die() for now if any tree object is present in diff-pairs input.

-Justin
Junio C Hamano Feb. 20, 2025, 4:14 p.m. UTC | #14
Justin Tobler <jltobler@gmail.com> writes:

> Thinking about this some more, adding support to expand trees in
> diff-pairs would alter patch output behavior. To better enable backwards
> compatible inclusion of this feature in the future, we may just want to
> die() for now if any tree object is present in diff-pairs input.

Sounds sensible.

Thanks.
diff mbox series

Patch

diff --git a/.gitignore b/.gitignore
index e82aa19df0..03448c076a 100644
--- a/.gitignore
+++ b/.gitignore
@@ -54,6 +54,7 @@ 
 /git-diff
 /git-diff-files
 /git-diff-index
+/git-diff-pairs
 /git-diff-tree
 /git-difftool
 /git-difftool--helper
diff --git a/Documentation/git-diff-pairs.adoc b/Documentation/git-diff-pairs.adoc
new file mode 100644
index 0000000000..e9ef4a6615
--- /dev/null
+++ b/Documentation/git-diff-pairs.adoc
@@ -0,0 +1,62 @@ 
+git-diff-pairs(1)
+=================
+
+NAME
+----
+git-diff-pairs - Compare blob pairs generated by `diff-tree --raw`
+
+SYNOPSIS
+--------
+[verse]
+'git diff-pairs' [diff-options]
+
+DESCRIPTION
+-----------
+
+Given the output of `diff-tree -z` on its stdin, `diff-pairs` will
+reformat that output into whatever format is requested on its command
+line.  For example:
+
+-----------------------------
+git diff-tree -z -M $a $b |
+git diff-pairs -p
+-----------------------------
+
+will compute the tree diff in one step (including renames), and then
+`diff-pairs` will compute and format the blob-level diffs for each pair.
+This can be used to modify the raw diff in the middle (without having to
+parse or re-create more complicated formats like `--patch`), or to
+compute diffs progressively over the course of multiple invocations of
+`diff-pairs`.
+
+Each blob pair is fed to the diff machinery individually queued and the output
+is flushed on stdin EOF.
+
+OPTIONS
+-------
+
+include::diff-options.adoc[]
+
+include::diff-generate-patch.adoc[]
+
+NOTES
+----
+
+`diff-pairs` should handle any input generated by `diff-tree --raw -z`.
+It may choke or otherwise misbehave on output from `diff-files`, etc.
+
+Here's an incomplete list of things that `diff-pairs` could do, but
+doesn't (mostly in the name of simplicity):
+
+ - Only `-z` input is accepted, not normal `--raw` input.
+
+ - Abbreviated sha1s are rejected in the input from `diff-tree`; if you
+   want to abbreviate the output, you can pass `--abbrev` to
+   `diff-pairs`.
+
+ - Pathspecs are not handled by `diff-pairs`; you can limit the diff via
+   the initial `diff-tree` invocation.
+
+GIT
+---
+Part of the linkgit:git[1] suite
diff --git a/Documentation/meson.build b/Documentation/meson.build
index ead8e48213..e5ee177022 100644
--- a/Documentation/meson.build
+++ b/Documentation/meson.build
@@ -41,6 +41,7 @@  manpages = {
   'git-diagnose.adoc' : 1,
   'git-diff-files.adoc' : 1,
   'git-diff-index.adoc' : 1,
+  'git-diff-pairs.adoc' : 1,
   'git-difftool.adoc' : 1,
   'git-diff-tree.adoc' : 1,
   'git-diff.adoc' : 1,
diff --git a/Makefile b/Makefile
index 896d02339e..3b8e1ad15e 100644
--- a/Makefile
+++ b/Makefile
@@ -1232,6 +1232,7 @@  BUILTIN_OBJS += builtin/describe.o
 BUILTIN_OBJS += builtin/diagnose.o
 BUILTIN_OBJS += builtin/diff-files.o
 BUILTIN_OBJS += builtin/diff-index.o
+BUILTIN_OBJS += builtin/diff-pairs.o
 BUILTIN_OBJS += builtin/diff-tree.o
 BUILTIN_OBJS += builtin/diff.o
 BUILTIN_OBJS += builtin/difftool.o
diff --git a/builtin.h b/builtin.h
index f7b166b334..b2d2e9eb07 100644
--- a/builtin.h
+++ b/builtin.h
@@ -152,6 +152,7 @@  int cmd_diagnose(int argc, const char **argv, const char *prefix, struct reposit
 int cmd_diff_files(int argc, const char **argv, const char *prefix, struct repository *repo);
 int cmd_diff_index(int argc, const char **argv, const char *prefix, struct repository *repo);
 int cmd_diff(int argc, const char **argv, const char *prefix, struct repository *repo);
+int cmd_diff_pairs(int argc, const char **argv, const char *prefix, struct repository *repo);
 int cmd_diff_tree(int argc, const char **argv, const char *prefix, struct repository *repo);
 int cmd_difftool(int argc, const char **argv, const char *prefix, struct repository *repo);
 int cmd_env__helper(int argc, const char **argv, const char *prefix, struct repository *repo);
diff --git a/builtin/diff-pairs.c b/builtin/diff-pairs.c
new file mode 100644
index 0000000000..08f3ee81e5
--- /dev/null
+++ b/builtin/diff-pairs.c
@@ -0,0 +1,178 @@ 
+#include "builtin.h"
+#include "commit.h"
+#include "config.h"
+#include "diff.h"
+#include "diffcore.h"
+#include "gettext.h"
+#include "hex.h"
+#include "object.h"
+#include "parse-options.h"
+#include "revision.h"
+#include "strbuf.h"
+
+static unsigned parse_mode_or_die(const char *mode, const char **endp)
+{
+	uint16_t ret;
+
+	*endp = parse_mode(mode, &ret);
+	if (!*endp)
+		die("unable to parse mode: %s", mode);
+	return ret;
+}
+
+static void parse_oid(const char *p, struct object_id *oid, const char **endp,
+		      const struct git_hash_algo *algop)
+{
+	if (parse_oid_hex_algop(p, oid, endp, algop) || *(*endp)++ != ' ')
+		die("unable to parse object id: %s", p);
+}
+
+static unsigned short parse_score(const char *score)
+{
+	unsigned long ret;
+	char *endp;
+
+	errno = 0;
+	ret = strtoul(score, &endp, 10);
+	ret *= MAX_SCORE / 100;
+	if (errno || endp == score || *endp || (unsigned short)ret != ret)
+		die("unable to parse rename/copy score: %s", score);
+	return ret;
+}
+
+static void flush_diff_queue(struct diff_options *options)
+{
+	/*
+	 * If rename detection is not requested, use rename information from the
+	 * raw diff formatted input. Setting found_follow ensures diffcore_std()
+	 * does not mess with rename information already present in queued
+	 * filepairs.
+	 */
+	if (!options->detect_rename)
+		options->found_follow = 1;
+	diffcore_std(options);
+	diff_flush(options);
+}
+
+int cmd_diff_pairs(int argc, const char **argv, const char *prefix,
+		   struct repository *repo)
+{
+	struct strbuf path_dst = STRBUF_INIT;
+	struct strbuf path = STRBUF_INIT;
+	struct strbuf meta = STRBUF_INIT;
+	struct rev_info revs;
+	int ret;
+
+	const char * const usage[] = {
+		N_("git diff-pairs [diff-options]"),
+		NULL
+	};
+	struct option options[] = {
+		OPT_END()
+	};
+
+	show_usage_with_options_if_asked(argc, argv, usage, options);
+
+	repo_init_revisions(repo, &revs, prefix);
+	repo_config(repo, git_diff_basic_config, NULL);
+	revs.disable_stdin = 1;
+	revs.abbrev = 0;
+	revs.diff = 1;
+
+	argc = setup_revisions(argc, argv, &revs, NULL);
+
+	/* Don't allow pathspecs at all. */
+	if (revs.prune_data.nr)
+		usage_with_options(usage, options);
+
+	if (!revs.diffopt.output_format)
+		revs.diffopt.output_format = DIFF_FORMAT_RAW;
+
+	while (1) {
+		struct object_id oid_a, oid_b;
+		struct diff_filepair *pair;
+		unsigned mode_a, mode_b;
+		const char *p;
+		char status;
+
+		if (strbuf_getline_nul(&meta, stdin) == EOF)
+			break;
+
+		p = meta.buf;
+		if (*p != ':')
+			die("invalid raw diff input");
+		p++;
+
+		mode_a = parse_mode_or_die(p, &p);
+		mode_b = parse_mode_or_die(p, &p);
+
+		parse_oid(p, &oid_a, &p, repo->hash_algo);
+		parse_oid(p, &oid_b, &p, repo->hash_algo);
+
+		status = *p++;
+
+		if (strbuf_getline_nul(&path, stdin) == EOF)
+			die("got EOF while reading path");
+
+		switch (status) {
+		case DIFF_STATUS_ADDED:
+			pair = diff_filepair_addremove(&revs.diffopt, '+',
+						       mode_b, &oid_b,
+						       1, path.buf, 0);
+			if (pair)
+				pair->status = status;
+			break;
+
+		case DIFF_STATUS_DELETED:
+			pair = diff_filepair_addremove(&revs.diffopt, '-',
+						       mode_a, &oid_a,
+						       1, path.buf, 0);
+			if (pair)
+				pair->status = status;
+			break;
+
+		case DIFF_STATUS_TYPE_CHANGED:
+		case DIFF_STATUS_MODIFIED:
+			pair = diff_filepair_change(&revs.diffopt,
+						    mode_a, mode_b,
+						    &oid_a, &oid_b, 1, 1,
+						    path.buf, 0, 0);
+			if (pair)
+				pair->status = status;
+			break;
+
+		case DIFF_STATUS_RENAMED:
+		case DIFF_STATUS_COPIED:
+			{
+				struct diff_filespec *a, *b;
+
+				if (strbuf_getline_nul(&path_dst, stdin) == EOF)
+					die("got EOF while reading destination path");
+
+				a = alloc_filespec(path.buf);
+				b = alloc_filespec(path_dst.buf);
+				fill_filespec(a, &oid_a, 1, mode_a);
+				fill_filespec(b, &oid_b, 1, mode_b);
+
+				pair = diff_queue(&diff_queued_diff, a, b);
+				pair->status = status;
+				pair->score = parse_score(p);
+				pair->renamed_pair = 1;
+			}
+			break;
+
+		default:
+			die("unknown diff status: %c", status);
+		}
+	}
+
+	flush_diff_queue(&revs.diffopt);
+	ret = diff_result_code(&revs);
+
+	strbuf_release(&path_dst);
+	strbuf_release(&path);
+	strbuf_release(&meta);
+	release_revisions(&revs);
+
+	return ret;
+}
diff --git a/command-list.txt b/command-list.txt
index e0bb87b3b5..bb8acd51d8 100644
--- a/command-list.txt
+++ b/command-list.txt
@@ -95,6 +95,7 @@  git-diagnose                            ancillaryinterrogators
 git-diff                                mainporcelain           info
 git-diff-files                          plumbinginterrogators
 git-diff-index                          plumbinginterrogators
+git-diff-pairs                          plumbinginterrogators
 git-diff-tree                           plumbinginterrogators
 git-difftool                            ancillaryinterrogators          complete
 git-fast-export                         ancillarymanipulators
diff --git a/git.c b/git.c
index b23761480f..12bba872bb 100644
--- a/git.c
+++ b/git.c
@@ -540,6 +540,7 @@  static struct cmd_struct commands[] = {
 	{ "diff", cmd_diff, NO_PARSEOPT },
 	{ "diff-files", cmd_diff_files, RUN_SETUP | NEED_WORK_TREE | NO_PARSEOPT },
 	{ "diff-index", cmd_diff_index, RUN_SETUP | NO_PARSEOPT },
+	{ "diff-pairs", cmd_diff_pairs, RUN_SETUP | NO_PARSEOPT },
 	{ "diff-tree", cmd_diff_tree, RUN_SETUP | NO_PARSEOPT },
 	{ "difftool", cmd_difftool, RUN_SETUP_GENTLY },
 	{ "fast-export", cmd_fast_export, RUN_SETUP },
diff --git a/meson.build b/meson.build
index fbb8105d96..66ce3326e8 100644
--- a/meson.build
+++ b/meson.build
@@ -537,6 +537,7 @@  builtin_sources = [
   'builtin/diagnose.c',
   'builtin/diff-files.c',
   'builtin/diff-index.c',
+  'builtin/diff-pairs.c',
   'builtin/diff-tree.c',
   'builtin/diff.c',
   'builtin/difftool.c',
diff --git a/t/meson.build b/t/meson.build
index 4574280590..7ff17c6d29 100644
--- a/t/meson.build
+++ b/t/meson.build
@@ -500,6 +500,7 @@  integration_tests = [
   't4067-diff-partial-clone.sh',
   't4068-diff-symmetric-merge-base.sh',
   't4069-remerge-diff.sh',
+  't4070-diff-pairs.sh',
   't4100-apply-stat.sh',
   't4101-apply-nonl.sh',
   't4102-apply-rename.sh',
diff --git a/t/t4070-diff-pairs.sh b/t/t4070-diff-pairs.sh
new file mode 100755
index 0000000000..e0a8e6f0a0
--- /dev/null
+++ b/t/t4070-diff-pairs.sh
@@ -0,0 +1,80 @@ 
+#!/bin/sh
+
+test_description='basic diff-pairs tests'
+. ./test-lib.sh
+
+# This creates a diff with added, modified, deleted, renamed, copied, and
+# typechange entries. That includes one in a subdirectory for non-recursive
+# tests, and both exact and inexact similarity scores.
+test_expect_success 'create commit with various diffs' '
+	echo to-be-gone >deleted &&
+	echo original >modified &&
+	echo now-a-file >symlink &&
+	test_seq 200 >two-hundred &&
+	test_seq 201 500 >five-hundred &&
+	git add . &&
+	test_tick &&
+	git commit -m base &&
+	git tag base &&
+
+	echo now-here >added &&
+	echo new >modified &&
+	rm deleted &&
+	mkdir subdir &&
+	echo content >subdir/file &&
+	mv two-hundred renamed &&
+	test_seq 201 500 | sed s/300/modified/ >copied &&
+	rm symlink &&
+	git add -A . &&
+	test_ln_s_add dest symlink &&
+	test_tick &&
+	git commit -m new &&
+	git tag new
+'
+
+test_expect_success 'diff-pairs recreates --raw' '
+	git diff-tree -r -M -C -C base new >expect &&
+	git diff-tree -r -M -C -C -z base new |
+	git diff-pairs >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'diff-pairs can create -p output' '
+	git diff-tree -p -M -C -C base new >expect &&
+	git diff-tree -r -M -C -C -z base new |
+	git diff-pairs -p >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'non-recursive --raw retains tree entry' '
+	git diff-tree base new >expect &&
+	git diff-tree -z base new |
+	git diff-pairs >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'split input across multiple diff-pairs' '
+	write_script split-raw-diff "$PERL_PATH" <<-\EOF &&
+	$/ = "\0";
+	while (<>) {
+	  my $meta = $_;
+	  my $path = <>;
+	  # renames have an extra path
+	  my $path2 = <> if $meta =~ /[RC]\d+/;
+
+	  open(my $fh, ">", sprintf "diff%03d", $.);
+	  print $fh $meta, $path, $path2;
+	}
+	EOF
+
+	git diff-tree -p -M -C -C base new >expect &&
+
+	git diff-tree -r -z -M -C -C base new |
+	./split-raw-diff &&
+	for i in diff*; do
+		git diff-pairs -p <$i || return 1
+	done >actual &&
+	test_cmp expect actual
+'
+
+test_done