[0/3] batch blob diff generation

Message ID	20241213042312.2890841-1-jltobler@gmail.com (mailing list archive)
Headers	show Received: from mail-oo1-f50.google.com (mail-oo1-f50.google.com [209.85.161.50]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id CF69A17BB1C for <git@vger.kernel.org>; Fri, 13 Dec 2024 04:26:13 +0000 (UTC) From: Justin Tobler <jltobler@gmail.com> To: git@vger.kernel.org Cc: ps@pks.im, Justin Tobler <jltobler@gmail.com> Subject: [PATCH 0/3] batch blob diff generation Date: Thu, 12 Dec 2024 22:23:09 -0600 Message-ID: <20241213042312.2890841-1-jltobler@gmail.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	batch blob diff generation \| expand [0/3] batch blob diff generation [1/3] builtin: introduce diff-blob command [2/3] builtin/diff-blob: add "--stdin" option [3/3] builtin/diff-blob: Add "-z" option

Justin Tobler Dec. 13, 2024, 4:23 a.m. UTC

Through git-diff(1) it is possible to generate a diff directly between
two blobs. This is particularly useful when the pre-image and post-image
blobs are known and we only care about the diff between them.
Unfortunately, if a user has a batch of known blob pairs to compute
diffs for, there is currently not a way to do so via a single Git
process.

To enable support for batch diffs of multiple blob pairs, this
series introduces a new diff plumbing command git-diff-blob(1). Similar
to git-diff-tree(1), it provides a "--stdin" option that reads a pair of
blobs on each line of input and generates the diffs. This is intended to
be used for scripting purposes where more fine-grained control for diff
generation is desired. Below is an example for each usage:

    $ git diff-blob HEAD~5000:README.md HEAD:README.md

    $ git diff-blob --stdin <<EOF
    88f126184c52bfe4859ec189d018872902e02a84 665ce5f5a83647619fba9157fa9b0141ae8b228b
    HEAD~5000:README.md HEAD:README.md
    EOF

Some alternative approaches that were considered:
Instead of creating a new plumbing command, the existing git-diff(1)
could have been extended with a similar "--batch" option ("--stdin" is
techinically already handled through setup_revisions() since it isn't
disabled). This option could read from stdin and generate diffs for any
valid revision pair that gets provided (not just blob diffs). The
primary reason for not going down this route was that git-diff-tree(1)
already has support for batch diff generation for commits/trees through
its "--stdin" option and teaching git-diff(1) a superset of this
functionality would further complicate this porcelain interface for
something that seems like more of a plumbing feature.

Another idea was to extend the existing git-diff-tree(1) to support
generating diffs for blob pairs through its "--stdin" option. This
didn't seem like a good fit either though as it is really outside the
scope of responsibilities for that command.

Ultimately I couldn't find an existing place that seemed like a good fit
thus the new plumbing command route was chosen. I'm still not sure
though if a standalone "diff-blob" command is the right choice here
either. Its primary function of generating a single blob pair diff is a
direct subset of git-diff(1) and is thus largely redundant. The only
additional value comes from its "--stdin" option which enables batch
processing. To an extent it seems much of the existing diff plumbing
commands feature set can also be accessed through git-diff(1) so maybe
this isn't a big deal. Feedback and suggestions are much appreciated.

This series is structured as follows:

    - Patch 1 introduces the "diff-blob" plumbing command and its
      surrounding setup.

    - Patch 2 teaches "diff-blob" the "--stdin" option which allows
      multiple blob pair diffs to be specified and processed.

    - Patch 3 teaches "diff-blob" the "-z" option which, when used
      with "--stdin", uses the NUL character to delimit the inputed
      blobs and outputted diffs.

The series is built on top of caacdb5d (The fifteenth batch, 2024-12-10)
with ps/build at 904339ed (Introduce support for the Meson build system,
2024-12-06) merged into it. This is done so the new command is
integrated with the meson build system.

-Justin

Justin Tobler (3):
  builtin: introduce diff-blob command
  builtin/diff-blob: add "--stdin" option
  builtin/diff-blob: Add "-z" option

 .gitignore                      |   1 +
 Documentation/git-diff-blob.txt |  39 +++++++
 Documentation/meson.build       |   1 +
 Makefile                        |   1 +
 builtin.h                       |   1 +
 builtin/diff-blob.c             | 200 ++++++++++++++++++++++++++++++++
 command-list.txt                |   1 +
 git.c                           |   1 +
 meson.build                     |   1 +
 t/t4063-diff-blobs.sh           | 108 ++++++++++-------
 10 files changed, 309 insertions(+), 45 deletions(-)
 create mode 100644 Documentation/git-diff-blob.txt
 create mode 100644 builtin/diff-blob.c

Jeff King Dec. 13, 2024, 8:12 a.m. UTC | #1

On Thu, Dec 12, 2024 at 10:23:09PM -0600, Justin Tobler wrote:

> To enable support for batch diffs of multiple blob pairs, this
> series introduces a new diff plumbing command git-diff-blob(1). Similar
> to git-diff-tree(1), it provides a "--stdin" option that reads a pair of
> blobs on each line of input and generates the diffs. This is intended to
> be used for scripting purposes where more fine-grained control for diff
> generation is desired. Below is an example for each usage:
> 
>     $ git diff-blob HEAD~5000:README.md HEAD:README.md
> 
>     $ git diff-blob --stdin <<EOF
>     88f126184c52bfe4859ec189d018872902e02a84 665ce5f5a83647619fba9157fa9b0141ae8b228b
>     HEAD~5000:README.md HEAD:README.md
>     EOF

In the first example, I think just using "git diff" would work (though
it is not a plumbing command). But the stdin example is what's
interesting here anyway, since it can handle arbitrary inputs. So let's
focus on that.

Feeding just blob ids has a big drawback: we don't have any context! So
you get bogus filenames in the patch, no mode data, and so on.

Feeding the paths along with their commits, as you do on the second
line, gives you those things from the lookup context. But it also has
some problems. One, it's needlessly expensive; we have to traverse
HEAD~5000, and then dig into its tree to find the blobs (which
presumably you already did, since how else would you end up with those
oids). And two, there are parsing ambiguities, since arbitrary revision
names can contain spaces. E.g., are we looking for the file "README.md
HEAD:README.md" in HEAD~5000?

So ideally we'd have an input format that encapsulates that extra
context data and provides some mechanism for quoting. And it turns out
we do: the --raw diff format.

If the program takes that format, then you can manually feed it two
arbitrary blob oids if you have them (and put whatever you like for the
mode/path context), like:

  git diff-blob --stdin <<\EOF
  :100644 100644 88f126184c52bfe4859ec189d018872902e02a84 665ce5f5a83647619fba9157fa9b0141ae8b228b M	README.md
  EOF

Or you can get the real context yourself (though it seems to me that
this is a gap in what "cat-file --batch" should be able to do in a
single process):

  git ls-tree HEAD~5000 README.md >out
  read mode_a blob oid_a path <out
  git ls-tree HEAD README.md >out
  read mode_b blob oid_b path <out
  printf ":$mode_a $mode_b $oid_a $oid_b M\tREADME.md" |
  git diff-blob --stdin

But it also means you can use --raw output directly. So:

  git diff-tree --raw -r HEAD~5000 HEAD -- README.md |
  git diff-blob --stdin

Now that command by itself doesn't look all that useful; you could have
just asked for patches from diff-tree. But by splitting the two, you can
filter the set of paths in between (for example, to omit some entries,
or to batch a large diff into more manageable chunks for pagination,
etc).

The patch might look something like this:

  https://lore.kernel.org/git/20161201204042.6yslbyrg7l6ghhww@sigill.intra.peff.net/

:) That is what has been powering the diffs at github.com since 2016 or
so. And continues to do so, as far as I know. I don't have access to
their internal repository anymore, but I've continued to rebase the
topic forward in my personal repo. You can fetch it from:

  https://github.com/peff/git jk/diff-pairs

in case that is helpful.

-Peff

Junio C Hamano Dec. 13, 2024, 10:17 a.m. UTC | #2

Jeff King <peff@peff.net> writes:

> Feeding just blob ids has a big drawback: we don't have any context! So
> you get bogus filenames in the patch, no mode data, and so on.

And the lack of filenames and the tree object name at the root level
means you do not get anything out of the attribute subsystem, which
in turn may affect a few more things.

Unfortunately the format used in the output from "diff --raw" does
not capture this.

Does this want to be a building block for the server side diff?
Would it be a bit too low level for each "request" to comparing only
two blob objects?  Can we place a lot more assumptions on the
relationship among blob pairs that appear in the --stdin request
(e.g., they all come from the same tree and share the same attr
stack to look up attributes applicable for them)?

Jeff King Dec. 13, 2024, 10:38 a.m. UTC | #3

On Fri, Dec 13, 2024 at 07:17:59PM +0900, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > Feeding just blob ids has a big drawback: we don't have any context! So
> > you get bogus filenames in the patch, no mode data, and so on.
> 
> And the lack of filenames and the tree object name at the root level
> means you do not get anything out of the attribute subsystem, which
> in turn may affect a few more things.
> 
> Unfortunately the format used in the output from "diff --raw" does
> not capture this.

Don't we just use the working tree .gitattributes by default, and ignore
what's in the endpoints? In a bare repo we wouldn't have that, but I
think the recently added attr.tree and --attr-source options would work
there.

You can't use attributes from multiple trees in a single request, but I
doubt that would be a big drawback.

-Peff

Justin Tobler Dec. 13, 2024, 4:41 p.m. UTC | #4

On 24/12/13 03:12AM, Jeff King wrote:

> In the first example, I think just using "git diff" would work (though
> it is not a plumbing command). But the stdin example is what's
> interesting here anyway, since it can handle arbitrary inputs. So let's
> focus on that.
> 
> Feeding just blob ids has a big drawback: we don't have any context! So
> you get bogus filenames in the patch, no mode data, and so on.
> 
> Feeding the paths along with their commits, as you do on the second
> line, gives you those things from the lookup context. But it also has
> some problems. One, it's needlessly expensive; we have to traverse
> HEAD~5000, and then dig into its tree to find the blobs (which
> presumably you already did, since how else would you end up with those
> oids). And two, there are parsing ambiguities, since arbitrary revision
> names can contain spaces. E.g., are we looking for the file "README.md
> HEAD:README.md" in HEAD~5000?
> 
> So ideally we'd have an input format that encapsulates that extra
> context data and provides some mechanism for quoting. And it turns out
> we do: the --raw diff format.

I had not considered using the raw diff format as the input source. As
you pointed out, using blob IDs alone loses some of the useful context.
By using path-scoped revisions, we can still get this context, but at an
added cost of having to traverse the tree to get the underlying
information.

As you also mentioned, this is potentially wasteful if, for example, the
blobs diffs you are trying to generate are a subset of git-diff-tree(1)
output and thus the context is already known ahead of time. Which is
exactly what we are hoping to accomplish.

> If the program takes that format, then you can manually feed it two
> arbitrary blob oids if you have them (and put whatever you like for the
> mode/path context), like:
> 
>   git diff-blob --stdin <<\EOF
>   :100644 100644 88f126184c52bfe4859ec189d018872902e02a84 665ce5f5a83647619fba9157fa9b0141ae8b228b M	README.md
>   EOF
> 
> Or you can get the real context yourself (though it seems to me that
> this is a gap in what "cat-file --batch" should be able to do in a
> single process):
> 
>   git ls-tree HEAD~5000 README.md >out
>   read mode_a blob oid_a path <out
>   git ls-tree HEAD README.md >out
>   read mode_b blob oid_b path <out
>   printf ":$mode_a $mode_b $oid_a $oid_b M\tREADME.md" |
>   git diff-blob --stdin
> 
> But it also means you can use --raw output directly. So:
> 
>   git diff-tree --raw -r HEAD~5000 HEAD -- README.md |
>   git diff-blob --stdin
> 
> Now that command by itself doesn't look all that useful; you could have
> just asked for patches from diff-tree. But by splitting the two, you can
> filter the set of paths in between (for example, to omit some entries,
> or to batch a large diff into more manageable chunks for pagination,
> etc).

Yup, this is exactly what I'm hoping to achieve! As a single commit may
contain an unbounded number changes, being able to control diff
generation at the blob level is quite useful. Using the raw diff format
as input looks like a rather elegant solution and I think it makes
sense to use it here for the "--stdin" option over just reading two
blobs.

> The patch might look something like this:
> 
>   https://lore.kernel.org/git/20161201204042.6yslbyrg7l6ghhww@sigill.intra.peff.net/
> 
> :) That is what has been powering the diffs at github.com since 2016 or
> so. And continues to do so, as far as I know. I don't have access to
> their internal repository anymore, but I've continued to rebase the
> topic forward in my personal repo. You can fetch it from:
> 
>   https://github.com/peff/git jk/diff-pairs
> 
> in case that is helpful.

Thanks Peff! From looking at the mentioned thread and branch, it looks
like I'm basically trying to accomplish the same thing here. Just a bit
late to the conversation. :)

While the use-case is rather narrow, I think it would be nice to see
this functionality provided upstream. I see this as a means to faciliate
more fine-grained control of the blob diffs we actually want to compute
at a given time and it seems like it would be reasonable to expose as
part of the diff plumbing. I would certainly be interested in adapting
this series to instead use raw input from git-diff-tree(1) or trying to
revive the previous series if that is preferred. 

If there is interest in continuing, some lingering questions I have:

Being that the primary purpose of git-diff-blob(1) here is to handle
generating blob diffs as specified by stdin, is there any reason to have
a normal mode that accepts a blob pair as arguments? Or would it be best
to limit the input mechanism to stdin entirely? If the user wanted to
compute a single blob diff they could just use git-diff(1) already so
providing this as a part of git-diff-blob(1) is a bit redundant. Having
it as an option for the user does seem a bit more friendly though.

-Justin

Junio C Hamano Dec. 13, 2024, 10:34 p.m. UTC | #5

Jeff King <peff@peff.net> writes:

> So ideally we'd have an input format that encapsulates that extra
> context data and provides some mechanism for quoting. And it turns out
> we do: the --raw diff format.

Funny.  The raw diff format indeed was designed as an interchange
format from various "compare two sets of things" front-ends (like
diff-files, diff-cache, and diff-tree) that emits the raw format, to
be read by "diff-helper" (initially called "diff-tree-helper") that
takes the raw format and

 - matches removed and added paths with similar contents to detect
   renames and copies

 - computes the output in various formats including "patch".

So I guess we came a full circle, finally ;-).  Looking in the archive
for messages exchanged between junkio@ and torvalds@ mentioning diff
before 2005-05-30 finds some interesting gems.

https://lore.kernel.org/git/7v1x8zsamn.fsf_-_@assigned-by-dhcp.cox.net/

Junio C Hamano Dec. 15, 2024, 2:07 a.m. UTC | #6

Jeff King <peff@peff.net> writes:

>> Unfortunately the format used in the output from "diff --raw" does
>> not capture this.
>
> Don't we just use the working tree .gitattributes by default, and ignore
> what's in the endpoints? In a bare repo we wouldn't have that, but I
> think the recently added attr.tree and --attr-source options would work
> there.

Yeah, you're right.  I forgot about attr.tree (does not seem to be
documented, by the way) and --attr-source & GIT_ATTR_SOURCE.  I
imagine that this feature is primarily meant to be used on the
server side, and in bare repositories, only "diff-tree" makes sense
among the diff-* family of commands, which (as server environments
lack "the index" nor "the working tree") would already be using
these mechanisms, so there is no new issues introduced here.

> You can't use attributes from multiple trees in a single request, but I
> doubt that would be a big drawback.

I think it is also true with the normal diff-tree and friends; I do
not think it looks up attributes from each tree independently when
you run "git diff-tree -r A B" to compare the blob in tree A that is
CRLF with the blob at the same path in tree B.  So we should be OK.

Thanks.

Junio C Hamano Dec. 15, 2024, 2:17 a.m. UTC | #7

Junio C Hamano <gitster@pobox.com> writes:

> Yeah, you're right.  I forgot about attr.tree (does not seem to be
> documented, by the way)

We do have an entry in Documentation/config/attr.txt that describes
the three; I simply assumed it is not documented as I didn't see it
mentioned in Documentation/git.txt where --attr-source &
GIT_ATTR_SOURCE are described.

We may want to add something like this, perhaps?

----- >8 -----
Subject: doc: give attr.tree a bit more visibility

In "git help config" output, attr.tree mentions both --attr-source
and GIT_ATTR_SOURCE, but the description of --attr-source and
GIT_ATTR_SOURCE that appear in "git help git", attr.tree is missing.

Add it so that these three are described together in both places.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 Documentation/git.txt | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git c/Documentation/git.txt w/Documentation/git.txt
index d15a869762..13f6785408 100644
--- c/Documentation/git.txt
+++ w/Documentation/git.txt
@@ -228,7 +228,10 @@ If you just want to run git as if it was started in `<path>` then use
 --attr-source=<tree-ish>::
 	Read gitattributes from <tree-ish> instead of the worktree. See
 	linkgit:gitattributes[5]. This is equivalent to setting the
-	`GIT_ATTR_SOURCE` environment variable.
+	`GIT_ATTR_SOURCE` environment variable.  The `attr.tree`
+	configuration variable is used as a fallback when this option
+	or the environment variable are not in use.
+

 GIT COMMANDS
 ------------

Junio C Hamano Dec. 15, 2024, 11:24 p.m. UTC | #8

Junio C Hamano <gitster@pobox.com> writes:

> Jeff King <peff@peff.net> writes:
>
>> So ideally we'd have an input format that encapsulates that extra
>> context data and provides some mechanism for quoting. And it turns out
>> we do: the --raw diff format.
>
> Funny.  The raw diff format indeed was designed as an interchange
> format from various "compare two sets of things" front-ends (like
> diff-files, diff-cache, and diff-tree) that emits the raw format, to
> be read by "diff-helper" (initially called "diff-tree-helper") that
> takes the raw format and
>
>  - matches removed and added paths with similar contents to detect
>    renames and copies
>
>  - computes the output in various formats including "patch".
>
> So I guess we came a full circle, finally ;-).  Looking in the archive
> for messages exchanged between junkio@ and torvalds@ mentioning diff
> before 2005-05-30 finds some interesting gems.
>
> https://lore.kernel.org/git/7v1x8zsamn.fsf_-_@assigned-by-dhcp.cox.net/

So, if we were to do what Justin tried to do honoring the overall
design of our diff machinery, I think what we can do is as follows:

 * Use the "diff --raw" output format as the input, but with a bit
   of twist.

   (1) a narrow special case that takes only a single diff_filepair
       of <old> and <new> blobs, and immediately run diff_queue() on
       that single diff_filepair, which is Justin's use case.  For
       this mode of operation, "flush after reach record of input"
       may be sufficient.

   (2) as a general "interchange format" to feed "comparison between
       two sets of <object, path>" into our diff machinery, we are
       better off if we can treat the input stream as multiple
       records that describes comparison between two sets.  Imagine
       "git log --oneline --first-parent -2 --raw HEAD", where one
       set of "diff --raw" records show the changed blobs with their
       paths between HEAD~1 and HEAD, and another set does so for
       HEAD~2 and HEAD~1.  We need to be able to tell where the
       first set ends and the second set starts, so that rename
       detection and other things, if requested, can be done within
       each set.

   My recommendation is to use a single blank line as a separator,
   e.g.

        :100644 100644 ce31f93061 9829984b0a M	Documentation/git-refs.txt
        :100644 100644 8b3882cff1 4a74f7c7bd M	refs.c
        :100755 100755 1bfff3a7af f59bc4860f M	t/t1460-refs-migrate.sh

        :100644 100644 c11213f520 8953d1c6d3 M	refs/files-backend.c
        :100644 100644 b2e3ba877d bec5962deb M	refs/reftable-backend.c

   so an application that wants to compare only one diff_filepair
   at a time would issue something like

        :100644 100644 ce31f93061 9829984b0a M	Documentation/git-refs.txt

        :100644 100644 8b3882cff1 4a74f7c7bd M	refs.c

        :100755 100755 1bfff3a7af f59bc4860f M	t/t1460-refs-migrate.sh

   so the parsing machinery does not have to worry about case (1) above.

 * Parse and append the input into diff_queue(), until you see an
   blank line.

   - If at EOF you are done, but if you have something accumulated
     in diff_queue(), show them (like below) first.  In any case, at
     EOF, you are done.

 * Run diffcore_std() followed by diff_flush() to have the contents
   of the queue nicely formatted and emptied.  Go back to parsing
   more input lines.

Jeff King Dec. 16, 2024, 11:11 a.m. UTC | #9

On Sat, Dec 14, 2024 at 06:17:01PM -0800, Junio C Hamano wrote:

> Junio C Hamano <gitster@pobox.com> writes:
> 
> > Yeah, you're right.  I forgot about attr.tree (does not seem to be
> > documented, by the way)
> 
> We do have an entry in Documentation/config/attr.txt that describes
> the three; I simply assumed it is not documented as I didn't see it
> mentioned in Documentation/git.txt where --attr-source &
> GIT_ATTR_SOURCE are described.
> 
> We may want to add something like this, perhaps?
> 
> ----- >8 -----
> Subject: doc: give attr.tree a bit more visibility

Yeah, that looks good to me.

I recall that the performance of attr.tree is not great for _some_
commands (like pack-objects). So it's perhaps reasonable to use for
single commands like "git diff" but not to set in your on-disk config.
It's possible we'd want to warn people about that before advertising it
more widely? I dunno.

-Peff

Jeff King Dec. 16, 2024, 11:18 a.m. UTC | #10

On Fri, Dec 13, 2024 at 10:41:25AM -0600, Justin Tobler wrote:

> > The patch might look something like this:
> > 
> >   https://lore.kernel.org/git/20161201204042.6yslbyrg7l6ghhww@sigill.intra.peff.net/
> > 
> > :) That is what has been powering the diffs at github.com since 2016 or
> > so. And continues to do so, as far as I know. I don't have access to
> > their internal repository anymore, but I've continued to rebase the
> > topic forward in my personal repo. You can fetch it from:
> > 
> >   https://github.com/peff/git jk/diff-pairs
> > 
> > in case that is helpful.
> 
> Thanks Peff! From looking at the mentioned thread and branch, it looks
> like I'm basically trying to accomplish the same thing here. Just a bit
> late to the conversation. :)
> 
> While the use-case is rather narrow, I think it would be nice to see
> this functionality provided upstream. I see this as a means to faciliate
> more fine-grained control of the blob diffs we actually want to compute
> at a given time and it seems like it would be reasonable to expose as
> part of the diff plumbing. I would certainly be interested in adapting
> this series to instead use raw input from git-diff-tree(1) or trying to
> revive the previous series if that is preferred.

Yeah, if you want to take it in that direction, either by adapting the
idea, or by starting with diff-pairs and polishing it up, I'm happy
either way. GitHub folks may be happy if you keep the name "diff-pairs"
and match the interface. ;)

> If there is interest in continuing, some lingering questions I have:
> 
> Being that the primary purpose of git-diff-blob(1) here is to handle
> generating blob diffs as specified by stdin, is there any reason to have
> a normal mode that accepts a blob pair as arguments? Or would it be best
> to limit the input mechanism to stdin entirely? If the user wanted to
> compute a single blob diff they could just use git-diff(1) already so
> providing this as a part of git-diff-blob(1) is a bit redundant. Having
> it as an option for the user does seem a bit more friendly though.

I don't have a strong opinion. I agree it _could_ be more friendly, but
it also raises questions about how that mode/path context is filled in.
And also about other modes. E.g., "git diff HEAD:foo bar" will diff
between a blob and a working tree. Should a new plumbing command support
that, too?

If those aren't things you immediately care about, I'd probably punt on
it for now. I think it could be added later without losing compatibility
(command-line arguments as appropriate for a  single pair, and since the
stdin format starts with ":mode" there's room to extend it).

-Peff

Jeff King Dec. 16, 2024, 11:30 a.m. UTC | #11

On Sun, Dec 15, 2024 at 03:24:11PM -0800, Junio C Hamano wrote:

> > Funny.  The raw diff format indeed was designed as an interchange
> > format from various "compare two sets of things" front-ends (like
> > diff-files, diff-cache, and diff-tree) that emits the raw format, to
> > be read by "diff-helper" (initially called "diff-tree-helper") that
> > takes the raw format and
> >
> >  - matches removed and added paths with similar contents to detect
> >    renames and copies
> >
> >  - computes the output in various formats including "patch".
> >
> > So I guess we came a full circle, finally ;-).  Looking in the archive
> > for messages exchanged between junkio@ and torvalds@ mentioning diff
> > before 2005-05-30 finds some interesting gems.
> >
> > https://lore.kernel.org/git/7v1x8zsamn.fsf_-_@assigned-by-dhcp.cox.net/

:) That same thread was linked when I posted the original diff-pairs
many years ago.

> So, if we were to do what Justin tried to do honoring the overall
> design of our diff machinery, I think what we can do is as follows:
> 
>  * Use the "diff --raw" output format as the input, but with a bit
>    of twist.
> 
>    (1) a narrow special case that takes only a single diff_filepair
>        of <old> and <new> blobs, and immediately run diff_queue() on
>        that single diff_filepair, which is Justin's use case.  For
>        this mode of operation, "flush after reach record of input"
>        may be sufficient.

My understanding was that he does not actually care about this case
(just feeding two blobs), but is actually processing --raw output in the
first place. Or did you just mean that we'd still be feeding raw output,
but just getting the flush behavior?

>    (2) as a general "interchange format" to feed "comparison between
>        two sets of <object, path>" into our diff machinery, we are
>        better off if we can treat the input stream as multiple
>        records that describes comparison between two sets.  Imagine
>        "git log --oneline --first-parent -2 --raw HEAD", where one
>        set of "diff --raw" records show the changed blobs with their
>        paths between HEAD~1 and HEAD, and another set does so for
>        HEAD~2 and HEAD~1.  We need to be able to tell where the
>        first set ends and the second set starts, so that rename
>        detection and other things, if requested, can be done within
>        each set.

Seems reasonable. For the use of diff-pairs at GitHub, I always just did
full-tree things like rename detection in the initial diff-tree
invocation. Since my goal was splitting/filtering, doing it after would
yield wrong answers (since diff-pairs never sees the complete set).

But it's possible for somebody to want to filter the intermediate
results, then do full-tree stuff on the result (or even just delay the
cost of rename detection). And certainly it's possible to want to feed a
whole bunch of unrelated diff segments without having to spawn a process
for each.

So it's not something I wanted, but I agree it's good to plan for.

>    My recommendation is to use a single blank line as a separator,
>    e.g.
> 
>         :100644 100644 ce31f93061 9829984b0a M	Documentation/git-refs.txt
>         :100644 100644 8b3882cff1 4a74f7c7bd M	refs.c
>         :100755 100755 1bfff3a7af f59bc4860f M	t/t1460-refs-migrate.sh
> 
>         :100644 100644 c11213f520 8953d1c6d3 M	refs/files-backend.c
>         :100644 100644 b2e3ba877d bec5962deb M	refs/reftable-backend.c
> 
>    so an application that wants to compare only one diff_filepair
>    at a time would issue something like
> 
>         :100644 100644 ce31f93061 9829984b0a M	Documentation/git-refs.txt
> 
>         :100644 100644 8b3882cff1 4a74f7c7bd M	refs.c
> 
>         :100755 100755 1bfff3a7af f59bc4860f M	t/t1460-refs-migrate.sh
> 
>    so the parsing machinery does not have to worry about case (1) above.

Yeah, that seems good. And it is backwards-compatible with the existing
diff-pairs format (which just barfs on a blank line). That's not a
big concern for the project, but it is nice that it makes life a bit
simpler for folks who would eventually be tasked with switching from it
to this new tool. ;)

>  * Parse and append the input into diff_queue(), until you see an
>    blank line.
> 
>    - If at EOF you are done, but if you have something accumulated
>      in diff_queue(), show them (like below) first.  In any case, at
>      EOF, you are done.

Yep, makes sense.

>  * Run diffcore_std() followed by diff_flush() to have the contents
>    of the queue nicely formatted and emptied.  Go back to parsing
>    more input lines.

That makes sense. I don't think my diff-pairs runs diffcore_std() at
all. The plumbing defaults mean it would always be a noop unless you
explicitly passed in rename, etc, options, and I never wanted to do
that.

We might have to check the interaction of diffcore code on a set of
queued diffs that already have values for renames, etc. I.e., that:

  git diff-tree --raw -M |
  git diff-pairs -M

does not barf, since the input step in diff-pairs is going to set status
to 'R', etc, in the pairs.

-Peff

Junio C Hamano Dec. 16, 2024, 4:29 p.m. UTC | #12

Jeff King <peff@peff.net> writes:

> I recall that the performance of attr.tree is not great for _some_
> commands (like pack-objects). So it's perhaps reasonable to use for
> single commands like "git diff" but not to set in your on-disk config.
> It's possible we'd want to warn people about that before advertising it
> more widely? I dunno.

Or we disable the unusably-inefficient feature before doing so.
Would attr.tree be much less efficient than GIT_ATTR_SOURCE?

Jeff King Dec. 18, 2024, 11:39 a.m. UTC | #13

On Mon, Dec 16, 2024 at 08:29:41AM -0800, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > I recall that the performance of attr.tree is not great for _some_
> > commands (like pack-objects). So it's perhaps reasonable to use for
> > single commands like "git diff" but not to set in your on-disk config.
> > It's possible we'd want to warn people about that before advertising it
> > more widely? I dunno.
> 
> Or we disable the unusably-inefficient feature before doing so.
> Would attr.tree be much less efficient than GIT_ATTR_SOURCE?

Whether it's unusably inefficient depends on what you throw at it. IIRC,
the performance difference for pack-objects on git.git was fairly
negligible. The problem in linux.git is that besides being big, it has a
deep(er) directory structure. So collecting all of the attributes for a
file like drivers/gpu/drm/foo/bar.h needs to open all of those
intermediate trees.

So I'd be inclined to leave it in place, in case somebody is actually
happily using it.

GIT_ATTR_SOURCE suffers all of the same problems; it's just that you'd
presumably only use it with a few select commands (as far as I know,
pack-objects is the worst case because it's looking up one attribute on
every single blob in all of history).

-Peff

Junio C Hamano Dec. 18, 2024, 2:53 p.m. UTC | #14

Jeff King <peff@peff.net> writes:

>> > I recall that the performance of attr.tree is not great for _some_
>> > commands (like pack-objects). So it's perhaps reasonable to use for
>> > single commands like "git diff" but not to set in your on-disk config.
>> > It's possible we'd want to warn people about that before advertising it
>> > more widely? I dunno.
>> 
>> Or we disable the unusably-inefficient feature before doing so.
>> Would attr.tree be much less efficient than GIT_ATTR_SOURCE?
>
> Whether it's unusably inefficient depends on what you throw at it. IIRC,
> the performance difference for pack-objects on git.git was fairly
> negligible. The problem in linux.git is that besides being big, it has a
> deep(er) directory structure. So collecting all of the attributes for a
> file like drivers/gpu/drm/foo/bar.h needs to open all of those
> intermediate trees.
>
> So I'd be inclined to leave it in place, in case somebody is actually
> happily using it.
>
> GIT_ATTR_SOURCE suffers all of the same problems; it's just that you'd
> presumably only use it with a few select commands (as far as I know,
> pack-objects is the worst case because it's looking up one attribute on
> every single blob in all of history).

Ah, OK.  So your "caution" was about the underlying mechanism to
allow attributes corrected from the specified tree, and not
specifically about using "attr.tree" to specify that tree?  That was
what got me confused.

If that is the case, I do not think the documentation patch that
started this exchange that adds attr.tree to where GIT_ATTR_SOURCE
and --attr-source are already mentioned makes anything worse.

Thanks.

Jeff King Dec. 20, 2024, 9:09 a.m. UTC | #15

On Wed, Dec 18, 2024 at 06:53:31AM -0800, Junio C Hamano wrote:

> > Whether it's unusably inefficient depends on what you throw at it. IIRC,
> > the performance difference for pack-objects on git.git was fairly
> > negligible. The problem in linux.git is that besides being big, it has a
> > deep(er) directory structure. So collecting all of the attributes for a
> > file like drivers/gpu/drm/foo/bar.h needs to open all of those
> > intermediate trees.
> >
> > So I'd be inclined to leave it in place, in case somebody is actually
> > happily using it.
> >
> > GIT_ATTR_SOURCE suffers all of the same problems; it's just that you'd
> > presumably only use it with a few select commands (as far as I know,
> > pack-objects is the worst case because it's looking up one attribute on
> > every single blob in all of history).
> 
> Ah, OK.  So your "caution" was about the underlying mechanism to
> allow attributes corrected from the specified tree, and not
> specifically about using "attr.tree" to specify that tree?  That was
> what got me confused.
> 
> If that is the case, I do not think the documentation patch that
> started this exchange that adds attr.tree to where GIT_ATTR_SOURCE
> and --attr-source are already mentioned makes anything worse.

Yeah, I agree it's somewhat orthogonal. Your patch made me think about
it because it is advertising the config variant more widely. Somebody
doing:

  git --attr-source=foo diff ...

is probably OK, but:

  git --attr-source=foo pack-objects ...

is less so. Using attr.tree instead means you're going to do the latter
whether you intended to or not.

-Peff

Jeff King Dec. 20, 2024, 9:10 a.m. UTC | #16

On Fri, Dec 20, 2024 at 04:09:08AM -0500, Jeff King wrote:

> > Ah, OK.  So your "caution" was about the underlying mechanism to
> > allow attributes corrected from the specified tree, and not
> > specifically about using "attr.tree" to specify that tree?  That was
> > what got me confused.
> > 
> > If that is the case, I do not think the documentation patch that
> > started this exchange that adds attr.tree to where GIT_ATTR_SOURCE
> > and --attr-source are already mentioned makes anything worse.
> 
> Yeah, I agree it's somewhat orthogonal. Your patch made me think about
> it because it is advertising the config variant more widely. Somebody
> doing:
> 
>   git --attr-source=foo diff ...
> 
> is probably OK, but:
> 
>   git --attr-source=foo pack-objects ...
> 
> is less so. Using attr.tree instead means you're going to do the latter
> whether you intended to or not.

Re-reading my message, I guess I didn't really give any conclusion. ;)

I should add: I'm OK with leaving the performance implications
undocumented for now. Hopefully in the long run somebody is interested
in addressing them.

-Peff

[0/3] batch blob diff generation

Message

Comments