diff mbox series

[2/2] Allow passing pipes for input pipes to diff --no-index

Message ID 20200918113256.8699-3-tguyot@gmail.com (mailing list archive)
State Superseded
Headers show
Series [1/2] diff: Fix modified lines stats with --stat and --numstat | expand

Commit Message

Thomas Guyot Sept. 18, 2020, 11:32 a.m. UTC
A very handy way to pass data to applications is to use the <() process
substitution syntax in bash variants. It allow comparing files streamed
from a remote server or doing on-the-fly stream processing to alter the
diff. These are usually implemented as a symlink that points to a bogus
name (ex "pipe:[209326419]") but opens as a pipe.

Git normally tracks symlinks targets. This patch makes it detect such
pipes in --no-index mode and read the file normally like it would do for
stdin ("-"), so they can be compared directly.

Signed-off-by: Thomas Guyot-Sionnest <tguyot@gmail.com>
---
 diff-no-index.c          |  56 ++++++++++--
 t/t4053-diff-no-index.sh | 189 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 238 insertions(+), 7 deletions(-)

Comments

Taylor Blau Sept. 18, 2020, 2:36 p.m. UTC | #1
Hi Thomas,

On Fri, Sep 18, 2020 at 07:32:56AM -0400, Thomas Guyot-Sionnest wrote:
> A very handy way to pass data to applications is to use the <() process
> substitution syntax in bash variants. It allow comparing files streamed
> from a remote server or doing on-the-fly stream processing to alter the
> diff. These are usually implemented as a symlink that points to a bogus
> name (ex "pipe:[209326419]") but opens as a pipe.

This is true in bash, but sh does not support process substitution with
<().

> Git normally tracks symlinks targets. This patch makes it detect such
> pipes in --no-index mode and read the file normally like it would do for
> stdin ("-"), so they can be compared directly.
>
> Signed-off-by: Thomas Guyot-Sionnest <tguyot@gmail.com>
> ---
>  diff-no-index.c          |  56 ++++++++++--
>  t/t4053-diff-no-index.sh | 189 +++++++++++++++++++++++++++++++++++++++
>  2 files changed, 238 insertions(+), 7 deletions(-)
>
> diff --git a/diff-no-index.c b/diff-no-index.c
> index 7814eabfe0..779c686d23 100644
> --- a/diff-no-index.c
> +++ b/diff-no-index.c
> @@ -41,6 +41,33 @@ static int read_directory_contents(const char *path, struct string_list *list)
>   */
>  static const char file_from_standard_input[] = "-";
>
> +/* Check that file is - (STDIN) or unnamed pipe - explicitly
> + * avoid on-disk named pipes which could block
> + */
> +static int ispipe(const char *name)
> +{
> +	struct stat st;
> +
> +	if (name == file_from_standard_input)
> +		return 1;  /* STDIN */
> +
> +	if (!lstat(name, &st)) {
> +		if (S_ISLNK(st.st_mode)) {

I had to read this a few times to make sure that I got it; you want to
stat the link itself, and then check that it links to a pipe.

I'm not sure why, though. Do you want to avoid handling named FIFOs in
the code below? Your comment that they "could block" makes me think you
do, but I don't know why that would be a problem.

> +			/* symlink - read it and check it doesn't exists
> +			 * as a file yet link to a pipe */
> +			struct strbuf sb = STRBUF_INIT;
> +			strbuf_realpath(&sb, name, 0);
> +			/* We're abusing strbuf_realpath here, it may append
> +			 * pipe:[NNNNNNNNN] to an abs path */
> +			if (!stat(sb.buf, &st))

Statting sb.buf is confusing to me (especially when followed up by
another stat right below. Could you explain?

> +test_expect_success 'diff --no-index can diff piped subshells' '
> +	echo 1 >non/git/c &&
> +	test_expect_code 0 git diff --no-index non/git/b <(cat non/git/c) &&
> +	test_expect_code 0 git diff --no-index <(cat non/git/b) non/git/c &&
> +	test_expect_code 0 git diff --no-index <(cat non/git/b) <(cat non/git/c) &&
> +	test_expect_code 0 cat non/git/b | git diff --no-index - non/git/c &&
> +	test_expect_code 0 cat non/git/c | git diff --no-index non/git/b - &&
> +	test_expect_code 0 cat non/git/b | git diff --no-index - <(cat non/git/c) &&
> +	test_expect_code 0 cat non/git/c | git diff --no-index <(cat non/git/b) -
> +'

Indeed this test fails (Git thinks that the HERE-DOC is broken, but I
suspect it's just getting confused by the '<()'). This test (like almost
all other tests in Git) use /bin/sh as its shebang. Does your /bin/sh
actually point to bash?

If you did want to test something like this, you'd need to source
t/lib-bash.sh instead of t/test-lib.sh.

Unrelated to the above comment, but there are a few small style nits
that I notice:

  - There is no need to run with 'test_expect_code 0' since the test is
    marked as 'test_expect_success' and the commands are all in an '&&'
    chain. (This does appear to be common style for others in t4053, so
    you may just be matching it--which is fine--but an additional
    clean-up on top to modernize would be appreciated, too).

  - The cat pipe is unnecessary, and is also violating a rule that we
    don't place 'git' on the right-hand side of a pipe (can you redirect
    the file at the end instead?).

Documentation/CodingGuidelines is a great place to look if you are ever
curious about whether something is in good style.

> +test_expect_success 'diff --no-index finds diff in piped subshells' '
> +	(
> +		set -- <(cat /dev/null) <(cat /dev/null)

Why is this necessary?

> +		cat <<-EOF >expect
> +			diff --git a$1 b$2
> +			--- a$1
> +			+++ b$2
> +			@@ -1 +1 @@
> +			-1
> +			+2
> +		EOF
> +	) &&
> +	test_expect_code 1 \
> +		git diff --no-index <(cat non/git/b) <(sed s/1/2/ non/git/c) >actual &&
> +	test_cmp expect actual
> +'

Thanks,
Taylor
Thomas Guyot Sept. 18, 2020, 4:34 p.m. UTC | #2
Hi Taylor,

On Fri, 18 Sep 2020 at 10:36, Taylor Blau <me@ttaylorr.com> wrote:
> On Fri, Sep 18, 2020 at 07:32:56AM -0400, Thomas Guyot-Sionnest wrote:
> > A very handy way to pass data to applications is to use the <() process
> > substitution syntax in bash variants. It allow comparing files streamed
> > from a remote server or doing on-the-fly stream processing to alter the
> > diff. These are usually implemented as a symlink that points to a bogus
> > name (ex "pipe:[209326419]") but opens as a pipe.
>
> This is true in bash, but sh does not support process substitution with
> <().

Bash, ksh, zsh and likely any more moden shell. Other programming
languages also setup such pipes. It's much cleaner than creating temp
files and cleaning them up and in some cases faster too (I've ran
diff's like this over GB's of test data, it's very handy to remove
known patterns that would cause needless diffs).

> > +/* Check that file is - (STDIN) or unnamed pipe - explicitly
> > + * avoid on-disk named pipes which could block
> > + */
> > +static int ispipe(const char *name)
> > +{
> > +     struct stat st;
> > +
> > +     if (name == file_from_standard_input)
> > +             return 1;  /* STDIN */
> > +
> > +     if (!lstat(name, &st)) {
> > +             if (S_ISLNK(st.st_mode)) {
>
> I had to read this a few times to make sure that I got it; you want to
> stat the link itself, and then check that it links to a pipe.
>
> I'm not sure why, though. Do you want to avoid handling named FIFOs in
> the code below? Your comment that they "could block" makes me think you
> do, but I don't know why that would be a problem.

I'll admit the comment was written first and is a bit naive  - i'll
rephrase that. Yes you don't want to block on pipes like if you run a
"grep -R" on a subtree that has fifos - but as I coded this I realized
the obvious: git tracks symlinks name so the real bugger would be to
detect one as a pipe and try reading it instead or calling readlink().

> > +                     /* symlink - read it and check it doesn't exists
> > +                      * as a file yet link to a pipe */
> > +                     struct strbuf sb = STRBUF_INIT;
> > +                     strbuf_realpath(&sb, name, 0);
> > +                     /* We're abusing strbuf_realpath here, it may append
> > +                      * pipe:[NNNNNNNNN] to an abs path */
> > +                     if (!stat(sb.buf, &st))
>
> Statting sb.buf is confusing to me (especially when followed up by
> another stat right below. Could you explain?

The whole block is under lstat/S_ISLNK (see previous chunk), so the
path provided to us was a symlink.

Initially I looked at what differentiate these - mainly, stat() st_dev
- but that struct is os-specific, you'd want to check major(st_dev) ==
0 (at least on linux) and even if we knew how each os behaves, the
code isn't portable and would be a pain to support. Gnu's difftools
have very incomplete historical source code in git but there's
indications they have gotten rid of it too.

So what I'm doing instead is trying to resolve the link and see if the
destination exists (a clear no). Luckily strbuf_realpath does the
heavy lifting and leaves me with a real path to the file the symlink
points to (especially useful for relative links), which is bogus for
the special pipes we're interested in.

Then the block right after (not shown) do a stat() on the initial name
and return whenever it's a fifo or not (if it is, but the link is
broken, we know it's a special device).

Now you mention it, maybe I could do that stat first, rule this out
from the beginning... less work for the general case.

*untested*:

    if (!lstat(name, &st)) {
        if (!S_ISLNK(st.st_mode))
            return(0);
        if (!stat(name, &st)) {
            if (!S_ISFIFO(st.st_mode))
                return(0);

            /* We have a symlink that points to a pipe. If it's resolved
             * target doesn't really exist we can safely assume it's a
             * special file and use it */
            struct strbuf sb = STRBUF_INIT;
            strbuf_realpath(&sb, name, 0);
            /* We're abusing strbuf_realpath here, it may append
             * pipe:[NNNNNNNNN] to an abs path */
            if (stat(sb.buf, &st))
                return(1); /* stat failed, special one */
        }
    }
    return(0);

TL;DR - the conditions we need:

- lstat(name) == 0  // name exists
- islink(lstat(name))  // name is a symlink
- stat(name) == 0  // target of name is reachable
- isfifo(stat(name))  // Target of name is a fifo
- stat(realpath(readlink(name))) != 0  // Although we can reach it,
name's destination doesn't actually exist.

BTW is st/sb too confusing ? I took examples elsewhere in the code, I
can rename them if it's easier to read.

> > +test_expect_success 'diff --no-index can diff piped subshells' '
> > +     echo 1 >non/git/c &&
> > +     test_expect_code 0 git diff --no-index non/git/b <(cat non/git/c) &&
> > +     test_expect_code 0 git diff --no-index <(cat non/git/b) non/git/c &&
> > +     test_expect_code 0 git diff --no-index <(cat non/git/b) <(cat non/git/c) &&
> > +     test_expect_code 0 cat non/git/b | git diff --no-index - non/git/c &&
> > +     test_expect_code 0 cat non/git/c | git diff --no-index non/git/b - &&
> > +     test_expect_code 0 cat non/git/b | git diff --no-index - <(cat non/git/c) &&
> > +     test_expect_code 0 cat non/git/c | git diff --no-index <(cat non/git/b) -
> > +'
>
> Indeed this test fails (Git thinks that the HERE-DOC is broken, but I
> suspect it's just getting confused by the '<()'). This test (like almost
> all other tests in Git) use /bin/sh as its shebang. Does your /bin/sh
> actually point to bash?
>
> If you did want to test something like this, you'd need to source
> t/lib-bash.sh instead of t/test-lib.sh.

Thanks for the tip - indeed I think I ran the testsuite directly with
back, but the make test failed.

> Unrelated to the above comment, but there are a few small style nits
> that I notice:
>
>   - There is no need to run with 'test_expect_code 0' since the test is
>     marked as 'test_expect_success' and the commands are all in an '&&'
>     chain. (This does appear to be common style for others in t4053, so
>     you may just be matching it--which is fine--but an additional
>     clean-up on top to modernize would be appreciated, too).
>
>   - The cat pipe is unnecessary, and is also violating a rule that we
>     don't place 'git' on the right-hand side of a pipe (can you redirect
>     the file at the end instead?).

Cleanup, no pipelines (I read too fast / assumed last command was ok) - will do!

> Documentation/CodingGuidelines is a great place to look if you are ever
> curious about whether something is in good style.
>
> > +test_expect_success 'diff --no-index finds diff in piped subshells' '
> > +     (
> > +             set -- <(cat /dev/null) <(cat /dev/null)

Precautions/portability. The file names are somewhat dynamic (at least
the fd part...) this is to be sure I capture the names of the pipes
that will be used (assuming the fd's will be reallocated in the same
order which I think is fairly safe). An alternative is to sed "actual"
to remove known variables, but then I hope it would be reliable (and
can I use sed -r?). IIRC earlier versions of bash or on some systems a
temp file could be used for these - although it defeats the purpose
it's not a reason to fail....

I cannot develop this on other systems but I tested the pipe names on
Windows and Sunos, and also using ksh and zsh on Linux (zsh uses /proc
directly, kss uses lower fd's which means it can easily clash with
scripts if you don't use named fd's, but not our problem....)

Thanks,

Thomas
Jeff King Sept. 18, 2020, 5:19 p.m. UTC | #3
On Fri, Sep 18, 2020 at 12:34:48PM -0400, Thomas Guyot-Sionnest wrote:

> Hi Taylor,
> 
> On Fri, 18 Sep 2020 at 10:36, Taylor Blau <me@ttaylorr.com> wrote:
> > On Fri, Sep 18, 2020 at 07:32:56AM -0400, Thomas Guyot-Sionnest wrote:
> > > A very handy way to pass data to applications is to use the <() process
> > > substitution syntax in bash variants. It allow comparing files streamed
> > > from a remote server or doing on-the-fly stream processing to alter the
> > > diff. These are usually implemented as a symlink that points to a bogus
> > > name (ex "pipe:[209326419]") but opens as a pipe.
> >
> > This is true in bash, but sh does not support process substitution with
> > <().
> 
> Bash, ksh, zsh and likely any more moden shell. Other programming
> languages also setup such pipes. It's much cleaner than creating temp
> files and cleaning them up and in some cases faster too (I've ran
> diff's like this over GB's of test data, it's very handy to remove
> known patterns that would cause needless diffs).

Yeah, it's definitely a reasonable thing to want (see below). And from a
portability perspective, it is outside of Git's scope; users with those
shells can use the feature, and people on other shells don't have to
care.

But we do have to account for this in the test suite, which must be able
to run under a vanilla POSIX shell. So you'd probably want to set up a
prerequisite that lets us skip these tests on other shells, like:

  test_lazy_prereq PROCESS_SUBSTITUTION '
	echo foo >expect &&
	cat >actual <(echo foo) &&
	test_cmp expect actual
  '

  test_expect_success PROCESS_SUBSTITUTION 'some test...' '
	# safe because we skip this test on shells that do not support it
	git diff --no-index <(cat whatever)
  '

Though it is a little sad that people running the suite with a vanilla
/bin/sh like dash wouldn't ever run the tests. I wonder if there's a
more portable way to formulate it.

Getting back to the overall feature, this is definitely something that
has come up before. The last I know of is:

  https://lore.kernel.org/git/20181220002610.43832-1-sandals@crustytoothpaste.net/

which everybody seemed to like the direction of; I suspect the original
author (cc'd) just never got around to it again. Compared to this
approach, it uses a command-line option to avoid dereferencing symlinks.
That puts an extra burden on the caller to pass the option, but it's way
less magical; you could drop all of the "does this look like a symlink
to a pipe" heuristics. It would also be much easier to test. ;)

-Peff
Jeff King Sept. 18, 2020, 5:20 p.m. UTC | #4
On Fri, Sep 18, 2020 at 10:36:47AM -0400, Taylor Blau wrote:

>   - The cat pipe is unnecessary, and is also violating a rule that we
>     don't place 'git' on the right-hand side of a pipe (can you redirect
>     the file at the end instead?).

What's wrong with git on the right-hand side of a pipe?

On the left-hand side we lose its exit code, which is bad. But on the
right hand side, we are only losing the exit code of "cat", which we
don't care about.

(Though I agree that "cat" is pointless here; we could just be
redirecting from a file).

-Peff
Jeff King Sept. 18, 2020, 5:21 p.m. UTC | #5
On Fri, Sep 18, 2020 at 01:19:50PM -0400, Jeff King wrote:

> Getting back to the overall feature, this is definitely something that
> has come up before. The last I know of is:
> 
>   https://lore.kernel.org/git/20181220002610.43832-1-sandals@crustytoothpaste.net/
> 
> which everybody seemed to like the direction of; I suspect the original
> author (cc'd) just never got around to it again. Compared to this
> approach, it uses a command-line option to avoid dereferencing symlinks.
> That puts an extra burden on the caller to pass the option, but it's way
> less magical; you could drop all of the "does this look like a symlink
> to a pipe" heuristics. It would also be much easier to test. ;)

Of course I forgot to add the cc. +cc brian.
Thomas Guyot Sept. 18, 2020, 5:39 p.m. UTC | #6
Hi Jeff,

On Fri, 18 Sep 2020 at 13:19, Jeff King <peff@peff.net> wrote:
> On Fri, Sep 18, 2020 at 12:34:48PM -0400, Thomas Guyot-Sionnest wrote:
> But we do have to account for this in the test suite, which must be able
> to run under a vanilla POSIX shell. So you'd probably want to set up a
> prerequisite that lets us skip these tests on other shells, like:

Indeed, the bash test library that was suggested earlier may be better
as it exec() bash rather than skipping tests. Testing as a prereq
works, another approach which may be harder to swallow but allow
testing when default shell is /bin/sh is to run each git command
through bash - could be coupled with a dep if bash isn't installed at
all.

> Though it is a little sad that people running the suite with a vanilla
> /bin/sh like dash wouldn't ever run the tests. I wonder if there's a
> more portable way to formulate it.

Debian defaults to dash, a minimalistic and afaik POSIX-compliant shell.

> Getting back to the overall feature, this is definitely something that
> has come up before. The last I know of is:
>
>   https://lore.kernel.org/git/20181220002610.43832-1-sandals@crustytoothpaste.net/
>
> which everybody seemed to like the direction of; I suspect the original
> author (cc'd) just never got around to it again. Compared to this
> approach, it uses a command-line option to avoid dereferencing symlinks.
> That puts an extra burden on the caller to pass the option, but it's way
> less magical; you could drop all of the "does this look like a symlink
> to a pipe" heuristics. It would also be much easier to test. ;)

Thanks for the info. Another consideration is how other commands -
diff, vim, sed, curl, etc can take input the same way, so being able
to swap-in git is a plus imho,and one less switch to learn (or even
learn about, I never looked for one for this issue).

Regards,

Thomas
Junio C Hamano Sept. 18, 2020, 5:48 p.m. UTC | #7
Jeff King <peff@peff.net> writes:

> Getting back to the overall feature, this is definitely something that
> has come up before. The last I know of is:
>
>   https://lore.kernel.org/git/20181220002610.43832-1-sandals@crustytoothpaste.net/
>
> which everybody seemed to like the direction of; I suspect the original
> author (cc'd) just never got around to it again. Compared to this
> approach, it uses a command-line option to avoid dereferencing symlinks.
> That puts an extra burden on the caller to pass the option, but it's way
> less magical; you could drop all of the "does this look like a symlink
> to a pipe" heuristics. It would also be much easier to test. ;)

Yes, I do remember liking the approach very much and wanted to take
it once the "do not dereference symlinks everywhere---just limit it
to what was given from the command line" was done.

To be quite honest, I think "git diff --no-index A B" should
unconditionally dereference A and/or B if they are symlinks, whether
they are symlinks to pipes, regular files or directories, and
otherwise treat symlinks in A and B the same way as "git diff" if A
and B are directories.  But that is a design guideline that becomes
needed only after we start resurrecting Brian's effort, not with
these patches that started this thread.

Thanks.
Taylor Blau Sept. 18, 2020, 5:58 p.m. UTC | #8
Hi Thomas,

On Fri, Sep 18, 2020 at 12:34:48PM -0400, Thomas Guyot-Sionnest wrote:
> Hi Taylor,
>
> On Fri, 18 Sep 2020 at 10:36, Taylor Blau <me@ttaylorr.com> wrote:
> > On Fri, Sep 18, 2020 at 07:32:56AM -0400, Thomas Guyot-Sionnest wrote:
> > > A very handy way to pass data to applications is to use the <() process
> > > substitution syntax in bash variants. It allow comparing files streamed
> > > from a remote server or doing on-the-fly stream processing to alter the
> > > diff. These are usually implemented as a symlink that points to a bogus
> > > name (ex "pipe:[209326419]") but opens as a pipe.
> >
> > This is true in bash, but sh does not support process substitution with
> > <().
>
> Bash, ksh, zsh and likely any more moden shell. Other programming
> languages also setup such pipes. It's much cleaner than creating temp
> files and cleaning them up and in some cases faster too (I've ran
> diff's like this over GB's of test data, it's very handy to remove
> known patterns that would cause needless diffs).

Oh, yes, I definitely agree that it will be useful for callers who use
more modern shells. I was making sure that we would still be able to run
this in the test suite (for us, that means making a new file that
sources lib-bash and tests only in environments where bash is
supported).

> Now you mention it, maybe I could do that stat first, rule this out
> from the beginning... less work for the general case.
>
> *untested*:
>
>     if (!lstat(name, &st)) {
>         if (!S_ISLNK(st.st_mode))
>             return(0);
>         if (!stat(name, &st)) {
>             if (!S_ISFIFO(st.st_mode))
>                 return(0);

I still don't think I quite understand why FIFOs aren't allowed...
>
>             /* We have a symlink that points to a pipe. If it's resolved
>              * target doesn't really exist we can safely assume it's a
>              * special file and use it */
>             struct strbuf sb = STRBUF_INIT;
>             strbuf_realpath(&sb, name, 0);
>             /* We're abusing strbuf_realpath here, it may append
>              * pipe:[NNNNNNNNN] to an abs path */
>             if (stat(sb.buf, &st))
>                 return(1); /* stat failed, special one */

By the time that I got to this point I think that what you wrote is much
easier to follow. Thanks.

>         }
>     }
>     return(0);
>
> TL;DR - the conditions we need:
>
> - lstat(name) == 0  // name exists
> - islink(lstat(name))  // name is a symlink
> - stat(name) == 0  // target of name is reachable
> - isfifo(stat(name))  // Target of name is a fifo
> - stat(realpath(readlink(name))) != 0  // Although we can reach it,
> name's destination doesn't actually exist.
>
> BTW is st/sb too confusing ? I took examples elsewhere in the code, I
> can rename them if it's easier to read.

No, it's fine. I think anecdotally I read 'struct strbuf buf' more than
I read 'struct strbuf sb', but I guess that's just the areas of code
that I happen to frequent, since some quick grepping around shows 462
uses of 'sb' followed by 425 uses of 'buf' (the next most common names
are 'err' and 'out', with 191 and 121 uses, respectively).

> > > +test_expect_success 'diff --no-index finds diff in piped subshells' '
> > > +     (
> > > +             set -- <(cat /dev/null) <(cat /dev/null)
>
> Precautions/portability. The file names are somewhat dynamic (at least
> the fd part...) this is to be sure I capture the names of the pipes
> that will be used (assuming the fd's will be reallocated in the same
> order which I think is fairly safe). An alternative is to sed "actual"
> to remove known variables, but then I hope it would be reliable (and
> can I use sed -r?). IIRC earlier versions of bash or on some systems a
> temp file could be used for these - although it defeats the purpose
> it's not a reason to fail....

OK.

> I cannot develop this on other systems but I tested the pipe names on
> Windows and Sunos, and also using ksh and zsh on Linux (zsh uses /proc
> directly, kss uses lower fd's which means it can easily clash with
> scripts if you don't use named fd's, but not our problem....)
>
> Thanks,
>
> Thomas

Thanks,
Taylor
Taylor Blau Sept. 18, 2020, 6 p.m. UTC | #9
On Fri, Sep 18, 2020 at 01:20:44PM -0400, Jeff King wrote:
> On Fri, Sep 18, 2020 at 10:36:47AM -0400, Taylor Blau wrote:
>
> >   - The cat pipe is unnecessary, and is also violating a rule that we
> >     don't place 'git' on the right-hand side of a pipe (can you redirect
> >     the file at the end instead?).
>
> What's wrong with git on the right-hand side of a pipe?

Ack, ignore me. The problem would be on the left-hand side only, without
set -o pipefail, which we don't do.

> On the left-hand side we lose its exit code, which is bad. But on the
> right hand side, we are only losing the exit code of "cat", which we
> don't care about.
>
> (Though I agree that "cat" is pointless here; we could just be
> redirecting from a file).

Yep, but an easy mistake to make nonetheless.

> -Peff

Thanks,
Taylor
Jeff King Sept. 18, 2020, 6:02 p.m. UTC | #10
On Fri, Sep 18, 2020 at 10:48:41AM -0700, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > Getting back to the overall feature, this is definitely something that
> > has come up before. The last I know of is:
> >
> >   https://lore.kernel.org/git/20181220002610.43832-1-sandals@crustytoothpaste.net/
> >
> > which everybody seemed to like the direction of; I suspect the original
> > author (cc'd) just never got around to it again. Compared to this
> > approach, it uses a command-line option to avoid dereferencing symlinks.
> > That puts an extra burden on the caller to pass the option, but it's way
> > less magical; you could drop all of the "does this look like a symlink
> > to a pipe" heuristics. It would also be much easier to test. ;)
> 
> Yes, I do remember liking the approach very much and wanted to take
> it once the "do not dereference symlinks everywhere---just limit it
> to what was given from the command line" was done.
> 
> To be quite honest, I think "git diff --no-index A B" should
> unconditionally dereference A and/or B if they are symlinks, whether
> they are symlinks to pipes, regular files or directories, and
> otherwise treat symlinks in A and B the same way as "git diff" if A
> and B are directories.  But that is a design guideline that becomes
> needed only after we start resurrecting Brian's effort, not with
> these patches that started this thread.

Yeah, I think I'd be fine with that approach, too. It makes "git diff
--no-index" more like other tools out of the box. And if we took brian's
patch first, then we'd just be flipping its default, and the option it
adds would give an easy escape hatch for somebody who really wants to
diff two maybe-symlinks.

-Peff
Jeff King Sept. 18, 2020, 6:05 p.m. UTC | #11
On Fri, Sep 18, 2020 at 01:58:36PM -0400, Taylor Blau wrote:

> > *untested*:
> >
> >     if (!lstat(name, &st)) {
> >         if (!S_ISLNK(st.st_mode))
> >             return(0);
> >         if (!stat(name, &st)) {
> >             if (!S_ISFIFO(st.st_mode))
> >                 return(0);
> 
> I still don't think I quite understand why FIFOs aren't allowed...

It's the other way around. Non-fifos return "0" from this "is it a pipe"
function.

I think it is to prevent a false positive with:

  rm bar
  ln -s foo bar
  git diff --no-index foo something-else

We'd still want to treat "foo" as a symlink there. I.e., the heuristic
for guessing it's a process substitution pipe is:

  - it's a symlink that doesn't point anywhere
  - it's also a fifo

-Peff
brian m. carlson Sept. 18, 2020, 9:56 p.m. UTC | #12
On 2020-09-18 at 11:32:56, Thomas Guyot-Sionnest wrote:
> diff --git a/diff-no-index.c b/diff-no-index.c
> index 7814eabfe0..779c686d23 100644
> --- a/diff-no-index.c
> +++ b/diff-no-index.c
> @@ -41,6 +41,33 @@ static int read_directory_contents(const char *path, struct string_list *list)
>   */
>  static const char file_from_standard_input[] = "-";
>  
> +/* Check that file is - (STDIN) or unnamed pipe - explicitly
> + * avoid on-disk named pipes which could block
> + */
> +static int ispipe(const char *name)
> +{
> +	struct stat st;
> +
> +	if (name == file_from_standard_input)
> +		return 1;  /* STDIN */
> +
> +	if (!lstat(name, &st)) {
> +		if (S_ISLNK(st.st_mode)) {
> +			/* symlink - read it and check it doesn't exists
> +			 * as a file yet link to a pipe */
> +			struct strbuf sb = STRBUF_INIT;
> +			strbuf_realpath(&sb, name, 0);
> +			/* We're abusing strbuf_realpath here, it may append
> +			 * pipe:[NNNNNNNNN] to an abs path */
> +			if (!stat(sb.buf, &st))
> +				return 0; /* link target exists , not pipe */
> +			if (!stat(name, &st))
> +				return S_ISFIFO(st.st_mode);

This makes a lot of assumptions about the implementation which are
specific to Linux, namely that an anonymous pipe will be a symlink to a
FIFO.  That's not the case on macOS, where the /dev/fd entries are
actually named pipes themselves.

Granted, in that case, Git just chokes and fails instead of diffing the
symlink values, but I suspect you'll want this to work on macOS as well.
I don't use macOS that often, but I would appreciate it if it worked
when I did, and I'm sure others will as well.

I think we can probably get away with just doing a regular stat and
seeing if S_ISFIFO is true, which is both simpler and cheaper.

> diff --git a/t/t4053-diff-no-index.sh b/t/t4053-diff-no-index.sh
> index 0168946b63..e49f773515 100755
> --- a/t/t4053-diff-no-index.sh
> +++ b/t/t4053-diff-no-index.sh
> @@ -144,4 +144,193 @@ test_expect_success 'diff --no-index allows external diff' '
>  	test_cmp expect actual
>  '
>  
> +test_expect_success 'diff --no-index can diff piped subshells' '
> +	echo 1 >non/git/c &&
> +	test_expect_code 0 git diff --no-index non/git/b <(cat non/git/c) &&
> +	test_expect_code 0 git diff --no-index <(cat non/git/b) non/git/c &&
> +	test_expect_code 0 git diff --no-index <(cat non/git/b) <(cat non/git/c) &&
> +	test_expect_code 0 cat non/git/b | git diff --no-index - non/git/c &&
> +	test_expect_code 0 cat non/git/c | git diff --no-index non/git/b - &&
> +	test_expect_code 0 cat non/git/b | git diff --no-index - <(cat non/git/c) &&
> +	test_expect_code 0 cat non/git/c | git diff --no-index <(cat non/git/b) -
> +'

As mentioned by others, this requires non-POSIX syntax.  /bin/sh on my
Debian system is dash, which doesn't support this.  You can either use a
prerequisite, or just test by piping from standard input and assume that
Git can handle the rest.  I would recommend at least adding some POSIX
testcases that use only a pipe from standard input to avoid regressing
this behavior on Debian and Ubuntu.
Thomas Guyot Sept. 20, 2020, 12:54 p.m. UTC | #13
On 2020-09-18 14:02, Jeff King wrote:
> On Fri, Sep 18, 2020 at 10:48:41AM -0700, Junio C Hamano wrote:
> 
>> Jeff King <peff@peff.net> writes:
>>
>>> Getting back to the overall feature, this is definitely something that
>>> has come up before. The last I know of is:
>>>
>>>   https://lore.kernel.org/git/20181220002610.43832-1-sandals@crustytoothpaste.net/
>>>
>>> which everybody seemed to like the direction of; I suspect the original
>>> author (cc'd) just never got around to it again. Compared to this
>>> approach, it uses a command-line option to avoid dereferencing symlinks.
>>> That puts an extra burden on the caller to pass the option, but it's way
>>> less magical; you could drop all of the "does this look like a symlink
>>> to a pipe" heuristics. It would also be much easier to test. ;)
>>
>> Yes, I do remember liking the approach very much and wanted to take
>> it once the "do not dereference symlinks everywhere---just limit it
>> to what was given from the command line" was done.
>>
>> To be quite honest, I think "git diff --no-index A B" should
>> unconditionally dereference A and/or B if they are symlinks, whether
>> they are symlinks to pipes, regular files or directories, and
>> otherwise treat symlinks in A and B the same way as "git diff" if A
>> and B are directories.  But that is a design guideline that becomes
>> needed only after we start resurrecting Brian's effort, not with
>> these patches that started this thread.
> 
> Yeah, I think I'd be fine with that approach, too. It makes "git diff
> --no-index" more like other tools out of the box. And if we took brian's
> patch first, then we'd just be flipping its default, and the option it
> adds would give an easy escape hatch for somebody who really wants to
> diff two maybe-symlinks.

Considering the issue with MacOS I'm starting to think the best solution
is to not use any heuristic and read passed-in files directly. That
said, I don't think it makes much change either way (if I resurrect
Brian's patch is will probably end up being a hybrid between the two as
both read the pipe at the same place and my approach was simpler further
down).

I'm not sure which way I prefer to start first - will you accept a patch
that reads passed in files as-is if I I start with this one?

In the mean time I will submit the first patch fixed) as a standalone one.

Regards,

Thomas
Jeff King Sept. 21, 2020, 7:31 p.m. UTC | #14
On Sun, Sep 20, 2020 at 08:54:53AM -0400, Thomas Guyot wrote:

> Considering the issue with MacOS I'm starting to think the best solution
> is to not use any heuristic and read passed-in files directly. That
> said, I don't think it makes much change either way (if I resurrect
> Brian's patch is will probably end up being a hybrid between the two as
> both read the pipe at the same place and my approach was simpler further
> down).
> 
> I'm not sure which way I prefer to start first - will you accept a patch
> that reads passed in files as-is if I I start with this one?

I think the ideal is:

  - implement a command-line option to read the content of paths on the
    command-line literally (i.e., reading from pipes, dereferencing
    symlinks, etc)

  - make sure we have the inverse option (which you should get for free
    in step 1 if you use parse_options)

  - flip the default to do literal reads

We'd sometimes wait several versions before that last step to give
people time to adjust scripts, etc. But in this case, I suspect it would
be OK to just flip it immediately. We don't consider "git diff" itself
part of the stable plumbing, and the --no-index part of it I would
consider even less stable. And AFAICT most people consider the current
behavior a bug because it doesn't behave like other diff tools.

-Peff
Junio C Hamano Sept. 21, 2020, 8:14 p.m. UTC | #15
Jeff King <peff@peff.net> writes:

> We'd sometimes wait several versions before that last step to give
> people time to adjust scripts, etc. But in this case, I suspect it would
> be OK to just flip it immediately. We don't consider "git diff" itself
> part of the stable plumbing, and the --no-index part of it I would
> consider even less stable. And AFAICT most people consider the current
> behavior a bug because it doesn't behave like other diff tools.

The "git diff" proper gets no filenames from the command line and
the above strictly applies only to the no-index mode, with or
without the explicit "--no-index" option.

It was a way to give Git niceties like colored diffs, renames,
etc. to non-Git managed two sets of paths, and the primary reason
why we have it as a mode of "git diff" is because we chose not to
bother with interacting with upstream maintainers of "diff".  In an
ideal world, "GNU diff" and others would have learned things like
renames, word diffs, etc., instead of "git diff" adding "--no-index"
mode.

And that makes me agree that users expect "git diff --no-index" to
behave like other people's diff and that is more important than
behaving like "git diff" for that mode.
diff mbox series

Patch

diff --git a/diff-no-index.c b/diff-no-index.c
index 7814eabfe0..779c686d23 100644
--- a/diff-no-index.c
+++ b/diff-no-index.c
@@ -41,6 +41,33 @@  static int read_directory_contents(const char *path, struct string_list *list)
  */
 static const char file_from_standard_input[] = "-";
 
+/* Check that file is - (STDIN) or unnamed pipe - explicitly
+ * avoid on-disk named pipes which could block
+ */
+static int ispipe(const char *name)
+{
+	struct stat st;
+
+	if (name == file_from_standard_input)
+		return 1;  /* STDIN */
+
+	if (!lstat(name, &st)) {
+		if (S_ISLNK(st.st_mode)) {
+			/* symlink - read it and check it doesn't exists
+			 * as a file yet link to a pipe */
+			struct strbuf sb = STRBUF_INIT;
+			strbuf_realpath(&sb, name, 0);
+			/* We're abusing strbuf_realpath here, it may append
+			 * pipe:[NNNNNNNNN] to an abs path */
+			if (!stat(sb.buf, &st))
+				return 0; /* link target exists , not pipe */
+			if (!stat(name, &st))
+				return S_ISFIFO(st.st_mode);
+		}
+	}
+	return 0;
+}
+
 static int get_mode(const char *path, int *mode)
 {
 	struct stat st;
@@ -51,7 +78,7 @@  static int get_mode(const char *path, int *mode)
 	else if (!strcasecmp(path, "nul"))
 		*mode = 0;
 #endif
-	else if (path == file_from_standard_input)
+	else if (ispipe(path))
 		*mode = create_ce_mode(0666);
 	else if (lstat(path, &st))
 		return error("Could not access '%s'", path);
@@ -60,13 +87,13 @@  static int get_mode(const char *path, int *mode)
 	return 0;
 }
 
-static int populate_from_stdin(struct diff_filespec *s)
+static int populate_from_fd(struct diff_filespec *s, int fd)
 {
 	struct strbuf buf = STRBUF_INIT;
 	size_t size = 0;
 
-	if (strbuf_read(&buf, 0, 0) < 0)
-		return error_errno("error while reading from stdin");
+	if (strbuf_read(&buf, fd, 0) < 0)
+		return error_errno(_("error while reading from fd %i"), fd);
 
 	s->should_munmap = 0;
 	s->data = strbuf_detach(&buf, &size);
@@ -76,6 +103,20 @@  static int populate_from_stdin(struct diff_filespec *s)
 	return 0;
 }
 
+static int populate_from_pipe(struct diff_filespec *s, const char *name)
+{
+	int ret;
+	FILE *f;
+
+	f = fopen(name, "r");
+	if (!f)
+		return error_errno(_("cannot open %s"), name);
+
+	ret = populate_from_fd(s, fileno(f));
+	fclose(f);
+	return ret;
+}
+
 static struct diff_filespec *noindex_filespec(const char *name, int mode)
 {
 	struct diff_filespec *s;
@@ -85,7 +126,9 @@  static struct diff_filespec *noindex_filespec(const char *name, int mode)
 	s = alloc_filespec(name);
 	fill_filespec(s, &null_oid, 0, mode);
 	if (name == file_from_standard_input)
-		populate_from_stdin(s);
+		populate_from_fd(s, 0);
+	else if (ispipe(name))
+		populate_from_pipe(s, name);
 	return s;
 }
 
@@ -218,8 +261,7 @@  static void fixup_paths(const char **path, struct strbuf *replacement)
 {
 	unsigned int isdir0, isdir1;
 
-	if (path[0] == file_from_standard_input ||
-	    path[1] == file_from_standard_input)
+	if (ispipe(path[0]) || ispipe(path[1]))
 		return;
 	isdir0 = is_directory(path[0]);
 	isdir1 = is_directory(path[1]);
diff --git a/t/t4053-diff-no-index.sh b/t/t4053-diff-no-index.sh
index 0168946b63..e49f773515 100755
--- a/t/t4053-diff-no-index.sh
+++ b/t/t4053-diff-no-index.sh
@@ -144,4 +144,193 @@  test_expect_success 'diff --no-index allows external diff' '
 	test_cmp expect actual
 '
 
+test_expect_success 'diff --no-index can diff piped subshells' '
+	echo 1 >non/git/c &&
+	test_expect_code 0 git diff --no-index non/git/b <(cat non/git/c) &&
+	test_expect_code 0 git diff --no-index <(cat non/git/b) non/git/c &&
+	test_expect_code 0 git diff --no-index <(cat non/git/b) <(cat non/git/c) &&
+	test_expect_code 0 cat non/git/b | git diff --no-index - non/git/c &&
+	test_expect_code 0 cat non/git/c | git diff --no-index non/git/b - &&
+	test_expect_code 0 cat non/git/b | git diff --no-index - <(cat non/git/c) &&
+	test_expect_code 0 cat non/git/c | git diff --no-index <(cat non/git/b) -
+'
+
+test_expect_success 'diff --no-index finds diff in piped subshells' '
+	(
+		set -- <(cat /dev/null) <(cat /dev/null)
+		cat <<-EOF >expect
+			diff --git a$1 b$2
+			--- a$1
+			+++ b$2
+			@@ -1 +1 @@
+			-1
+			+2
+		EOF
+	) &&
+	test_expect_code 1 \
+		git diff --no-index <(cat non/git/b) <(sed s/1/2/ non/git/c) >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'diff --no-index with stat and numstat' '
+	(
+		set -- <(cat /dev/null) <(cat /dev/null)
+		min=$((${#1} < ${#2} ? ${#1} : ${#2}))
+		for ((i=0; i<min; i++)); do [ "${1:i:1}" = "${2:i:1}" ] || break; done
+		base=${1:0:i-1}
+		cat <<-EOF >expect1
+			 $base{${1#$base} => ${2#$base}} | 2 +-
+			 1 file changed, 1 insertion(+), 1 deletion(-)
+		EOF
+		cat <<-EOF >expect2
+			1	1	$base{${1#$base} => ${2#$base}}
+		EOF
+	) &&
+	test_expect_code 1 \
+		git diff --no-index --stat <(cat non/git/a) <(sed s/1/2/ non/git/b) >actual &&
+	test_cmp expect1 actual &&
+	test_expect_code 1 \
+		git diff --no-index --numstat <(cat non/git/a) <(sed s/1/2/ non/git/b) >actual &&
+	test_cmp expect2 actual
+'
+
+test_expect_success PIPE 'diff --no-index on filesystem pipes' '
+	(
+		cd non/git &&
+		mkdir f g &&
+		mkfifo f/1 g/1 &&
+		test_expect_code 128 git diff --no-index f g &&
+		test_expect_code 128 git diff --no-index ../../a f &&
+		test_expect_code 128 git diff --no-index g ../../a &&
+		test_expect_code 128 git diff --no-index f/1 g/1 &&
+		test_expect_code 128 git diff --no-index f/1 ../../a/1 &&
+		test_expect_code 128 git diff --no-index ../../a/1 g/1
+	)
+'
+
+test_expect_success PIPE 'diff --no-index reads symlinks to named pipes as symlinks' '
+	(
+		cd non/git &&
+		mkdir h i &&
+		ln -s ../f/1 h/1 &&
+		ln -s ../g/1 i/1 &&
+		test_expect_code 1 git diff --no-index h i >actual &&
+		cat <<-EOF >expect &&
+			diff --git a/h/1 b/i/1
+			index d0b5850..d8b9c34 120000
+			--- a/h/1
+			+++ b/i/1
+			@@ -1 +1 @@
+			-../f/1
+			\ No newline at end of file
+			+../g/1
+			\ No newline at end of file
+		EOF
+		test_cmp expect actual &&
+		test_expect_code 1 git diff --no-index ../../a h >actual &&
+		cat <<-EOF >expect &&
+			diff --git a/../../a/1 b/../../a/1
+			deleted file mode 100644
+			index d00491f..0000000
+			--- a/../../a/1
+			+++ /dev/null
+			@@ -1 +0,0 @@
+			-1
+			diff --git a/h/1 b/h/1
+			new file mode 120000
+			index 0000000..d0b5850
+			--- /dev/null
+			+++ b/h/1
+			@@ -0,0 +1 @@
+			+../f/1
+			\ No newline at end of file
+			diff --git a/../../a/2 b/../../a/2
+			deleted file mode 100644
+			index 0cfbf08..0000000
+			--- a/../../a/2
+			+++ /dev/null
+			@@ -1 +0,0 @@
+			-2
+		EOF
+		test_cmp expect actual &&
+		test_expect_code 1 git diff --no-index i ../../a >actual &&
+		cat <<-EOF >expect &&
+			diff --git a/i/1 b/i/1
+			deleted file mode 120000
+			index d8b9c34..0000000
+			--- a/i/1
+			+++ /dev/null
+			@@ -1 +0,0 @@
+			-../g/1
+			\ No newline at end of file
+			diff --git a/../../a/1 b/../../a/1
+			new file mode 100644
+			index 0000000..d00491f
+			--- /dev/null
+			+++ b/../../a/1
+			@@ -0,0 +1 @@
+			+1
+			diff --git a/../../a/2 b/../../a/2
+			new file mode 100644
+			index 0000000..0cfbf08
+			--- /dev/null
+			+++ b/../../a/2
+			@@ -0,0 +1 @@
+			+2
+		EOF
+		test_cmp expect actual &&
+		test_expect_code 1 git diff --no-index h/1 i/1 >actual &&
+		cat <<-EOF >expect &&
+			diff --git a/h/1 b/i/1
+			index d0b5850..d8b9c34 120000
+			--- a/h/1
+			+++ b/i/1
+			@@ -1 +1 @@
+			-../f/1
+			\ No newline at end of file
+			+../g/1
+			\ No newline at end of file
+		EOF
+		test_cmp expect actual &&
+		test_expect_code 1 git diff --no-index h/1 ../../a/1 >actual &&
+		cat <<-EOF >expect &&
+			diff --git a/h/1 b/h/1
+			deleted file mode 120000
+			index d0b5850..0000000
+			--- a/h/1
+			+++ /dev/null
+			@@ -1 +0,0 @@
+			-../f/1
+			\ No newline at end of file
+			diff --git a/../../a/1 b/../../a/1
+			new file mode 100644
+			index 0000000..d00491f
+			--- /dev/null
+			+++ b/../../a/1
+			@@ -0,0 +1 @@
+			+1
+		EOF
+		test_cmp expect actual &&
+		test_expect_code 1 git diff --no-index ../../a/1 i/1 >actual &&
+		cat <<-EOF >expect &&
+			diff --git a/../../a/1 b/../../a/1
+			deleted file mode 100644
+			index d00491f..0000000
+			--- a/../../a/1
+			+++ /dev/null
+			@@ -1 +0,0 @@
+			-1
+			diff --git a/i/1 b/i/1
+			new file mode 120000
+			index 0000000..d8b9c34
+			--- /dev/null
+			+++ b/i/1
+			@@ -0,0 +1 @@
+			+../g/1
+			\ No newline at end of file
+		EOF
+		test_cmp expect actual
+	)
+'
+
 test_done