diff mbox series

diff: add support for reading files literally with --no-index

Message ID 20181220002610.43832-1-sandals@crustytoothpaste.net (mailing list archive)
State New, archived
Headers show
Series diff: add support for reading files literally with --no-index | expand

Commit Message

brian m. carlson Dec. 20, 2018, 12:26 a.m. UTC
In some shells, such as bash and zsh, it's possible to use a command
substitution to provide the output of a command as a file argument to
another process, like so:

  diff -u <(printf "a\nb\n") <(printf "a\nc\n")

However, this syntax does not produce useful results with git diff
--no-index.

On macOS, the arguments to the command are named pipes under /dev/fd,
and git diff doesn't know how to handle a named pipe. On Linux, the
arguments are symlinks to pipes, so git diff "helpfully" diffs these
symlinks, comparing their targets like "pipe:[1234]" and "pipe:[5678]".

Because this behavior is not very helpful, and because git diff has many
features that people would like to use even on non-Git files, add an
option to git diff --no-index to read files literally, dereferencing
symlinks and reading them as a normal file.

Note that this behavior requires that the files be read entirely into
memory, just as we do when reading from standard input.

Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
---
This is a long-standing annoyance of mine, and it also makes some use
cases possible (for example, diffing filtered and non-filtered objects).

We don't include a test for the pipe scenario because I couldn't get
that case to work in portable shell (although of course it works in
bash). I have, however, tested it on both macOS and Linux. No clue how
this works on Windows.

 Documentation/git-diff.txt |  5 +++++
 diff-no-index.c            | 34 +++++++++++++++++++++++++++-------
 diff.c                     | 24 +++++++++++++-----------
 diff.h                     |  1 +
 diffcore.h                 |  1 +
 t/t4053-diff-no-index.sh   | 28 ++++++++++++++++++++++++++++
 6 files changed, 75 insertions(+), 18 deletions(-)

Comments

Jeff King Dec. 20, 2018, 3:48 p.m. UTC | #1
On Thu, Dec 20, 2018 at 12:26:10AM +0000, brian m. carlson wrote:

> In some shells, such as bash and zsh, it's possible to use a command
> substitution to provide the output of a command as a file argument to
> another process, like so:
> 
>   diff -u <(printf "a\nb\n") <(printf "a\nc\n")
> 
> However, this syntax does not produce useful results with git diff
> --no-index.
> 
> On macOS, the arguments to the command are named pipes under /dev/fd,
> and git diff doesn't know how to handle a named pipe. On Linux, the
> arguments are symlinks to pipes, so git diff "helpfully" diffs these
> symlinks, comparing their targets like "pipe:[1234]" and "pipe:[5678]".
> 
> Because this behavior is not very helpful, and because git diff has many
> features that people would like to use even on non-Git files, add an
> option to git diff --no-index to read files literally, dereferencing
> symlinks and reading them as a normal file.
> 
> Note that this behavior requires that the files be read entirely into
> memory, just as we do when reading from standard input.

Yeah, I think this is a useful feature to have. It seemed familiar, and
indeed there were some patches a while back:

  https://public-inbox.org/git/20170113102021.6054-1-dennis@kaarsemaker.net/

Compared to that series, the implementation here seems simpler and less
likely to have unexpected effects.

Reading the patch, one thing I _thought_ it did (but it does not) is to
impact only the actual arguments on the command line, not entries we
find from recursing trees.

I.e., if I said:

  ln -s bar a/foo
  ln -s baz b/foo
  git diff --no-index --literally a/foo b/foo
  git diff --no-index --literally a/ b/

should I still see a/foo as a symlink in the second one, or not?

The distinction is a bit subtle, but I think treating only the actual
top-level arguments as symlinks would solve your problem, but still
allow a more detailed diff for the recursive cases.

There's some more discussion on this in the earlier iteration of that
series:

  https://public-inbox.org/git/20161111201958.2175-1-dennis@kaarsemaker.net/

I also suspect people might want a config option to make this the
default (which should be OK, as git-diff is, after all, porcelain). But
either way, a command line option is the first step.

> Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
> ---
> This is a long-standing annoyance of mine, and it also makes some use
> cases possible (for example, diffing filtered and non-filtered objects).
> 
> We don't include a test for the pipe scenario because I couldn't get
> that case to work in portable shell (although of course it works in
> bash). I have, however, tested it on both macOS and Linux. No clue how
> this works on Windows.

I think focusing on symlinks makes sense. You could probably test pipes
with mkfifo, but looking at your implementation, it's pretty clear that
it will operate on anything that works with read().

> diff --git a/Documentation/git-diff.txt b/Documentation/git-diff.txt
> index 030f162f30..4c4695c88d 100644
> --- a/Documentation/git-diff.txt
> +++ b/Documentation/git-diff.txt
> @@ -111,6 +111,11 @@ include::diff-options.txt[]
>  	"Unmerged".  Can be used only when comparing the working tree
>  	with the index.
>  
> +--literally::
> +  Read the specified files literally, as `diff` would,
> +  dereferencing any symlinks and reading data from pipes.
> +  This option only works with `--no-index`.
> +

Looks like spaces for indent, whereas the context uses tabs.

I think "literal" is a good way to describe this concept. If we do grow
a config option to make this the default, then countermanding it would
be "--no-literally", which parses a bit funny as English.

If you agree with my "only the top-level" line of reasoning above, maybe
"--literal-arguments" and "--no-literal-arguments" might make sense.

We could also in theory offer several levels: no literals, top-level
literals, everything literal, at which point it becomes a tri-state.

> -static struct diff_filespec *noindex_filespec(const char *name, int mode)
> +static int populate_literally(struct diff_filespec *s)
> +{
> +	struct strbuf buf = STRBUF_INIT;
> +	size_t size = 0;
> +	int fd = xopen(s->path, O_RDONLY);
> +
> +	if (strbuf_read(&buf, fd, 0) < 0)
> +		return error_errno("error while reading from '%s'", s->path);
> +
> +	s->should_munmap = 0;
> +	s->data = strbuf_detach(&buf, &size);
> +	s->size = size;
> +	s->should_free = 1;
> +	s->read_literally = 1;
> +	return 0;
> +}
> +
> +static struct diff_filespec *noindex_filespec(const char *name, int mode,
> +					      struct diff_options *o)
> [...]
> +	else if (o->flags.read_literally)
> +		populate_literally(s);

The implementation is pleasantly simple. If we want to do the "only
top-level" thing, we'd just have to pass in a depth flag here.

> diff --git a/diff.c b/diff.c
> index dc9965e836..740d0087b9 100644
> --- a/diff.c
> +++ b/diff.c
> @@ -4282,18 +4282,18 @@ static void run_diff_cmd(const char *pgm,
>  		fprintf(o->file, "* Unmerged path %s\n", name);
>  }
>  
> -static void diff_fill_oid_info(struct diff_filespec *one, struct index_state *istate)
> +static void diff_fill_oid_info(struct diff_filespec *one, struct diff_options *o)

It might be worth breaking these "pass the options around" hunks into a
separate preparatory patch.

-Peff
Duy Nguyen Dec. 20, 2018, 5:06 p.m. UTC | #2
On Thu, Dec 20, 2018 at 1:26 AM brian m. carlson
<sandals@crustytoothpaste.net> wrote:
> @@ -5159,6 +5159,8 @@ int diff_opt_parse(struct diff_options *options,
>                 options->flags.funccontext = 1;
>         else if (!strcmp(arg, "--no-function-context"))
>                 options->flags.funccontext = 0;
> +       else if (!strcmp(arg, "--literally"))
> +               options->flags.read_literally = 1;

You probably want to check in diff_setup_done() that if
flags.read_literally is set but flags.no_index is not, then abandon
ship and die() because --literally is not used with --no-index. Even
when --no-index is implicit, flags.no_index should be set.

I wonder if --follow-symlinks would be a good alternative for this
(then if the final destination is unmmapable then we just read the
file whole in memory without the user asking, so it will work with
pipes). --follow-symlinks then could be made work with non-"no-index"
case too. But --literally is also ok.
Jeff King Dec. 20, 2018, 5:17 p.m. UTC | #3
On Thu, Dec 20, 2018 at 06:06:11PM +0100, Duy Nguyen wrote:

> On Thu, Dec 20, 2018 at 1:26 AM brian m. carlson
> <sandals@crustytoothpaste.net> wrote:
> > @@ -5159,6 +5159,8 @@ int diff_opt_parse(struct diff_options *options,
> >                 options->flags.funccontext = 1;
> >         else if (!strcmp(arg, "--no-function-context"))
> >                 options->flags.funccontext = 0;
> > +       else if (!strcmp(arg, "--literally"))
> > +               options->flags.read_literally = 1;
> 
> You probably want to check in diff_setup_done() that if
> flags.read_literally is set but flags.no_index is not, then abandon
> ship and die() because --literally is not used with --no-index. Even
> when --no-index is implicit, flags.no_index should be set.

Yeah, good catch. "git diff --literally HEAD" should report an error.

> I wonder if --follow-symlinks would be a good alternative for this
> (then if the final destination is unmmapable then we just read the
> file whole in memory without the user asking, so it will work with
> pipes). --follow-symlinks then could be made work with non-"no-index"
> case too. But --literally is also ok.

It's more than symlinks, though. Reading from a named pipe, we'd want to
see the actual contents with --literally (and not "oops, I don't know
how to represent a named pipe").

-Peff
Duy Nguyen Dec. 20, 2018, 5:23 p.m. UTC | #4
On Thu, Dec 20, 2018 at 6:18 PM Jeff King <peff@peff.net> wrote:
> > I wonder if --follow-symlinks would be a good alternative for this
> > (then if the final destination is unmmapable then we just read the
> > file whole in memory without the user asking, so it will work with
> > pipes). --follow-symlinks then could be made work with non-"no-index"
> > case too. But --literally is also ok.
>
> It's more than symlinks, though. Reading from a named pipe, we'd want to
> see the actual contents with --literally (and not "oops, I don't know
> how to represent a named pipe").

Yes, but I think at least --no-index it makes sense to just fall back
to read() if we can't mmap(). mmap is more of an optimization than a
requirement. There's no loss going from "oops I don't know how to
represent it" to "here's the content from whatever what that device
is". Symlinks are different because we have two possible
representations and the user should choose.
Jeff King Dec. 20, 2018, 5:32 p.m. UTC | #5
On Thu, Dec 20, 2018 at 06:23:42PM +0100, Duy Nguyen wrote:

> On Thu, Dec 20, 2018 at 6:18 PM Jeff King <peff@peff.net> wrote:
> > > I wonder if --follow-symlinks would be a good alternative for this
> > > (then if the final destination is unmmapable then we just read the
> > > file whole in memory without the user asking, so it will work with
> > > pipes). --follow-symlinks then could be made work with non-"no-index"
> > > case too. But --literally is also ok.
> >
> > It's more than symlinks, though. Reading from a named pipe, we'd want to
> > see the actual contents with --literally (and not "oops, I don't know
> > how to represent a named pipe").
> 
> Yes, but I think at least --no-index it makes sense to just fall back
> to read() if we can't mmap(). mmap is more of an optimization than a
> requirement. There's no loss going from "oops I don't know how to
> represent it" to "here's the content from whatever what that device
> is". Symlinks are different because we have two possible
> representations and the user should choose.

Oh, I see. I misread your paragraph. Yeah, I think that is the behavior
I would want by default, too, though the previous thread makes me thing
some people would literally rather have the "I can't represent this"
behavior (because they really do want a diff that can reconstruct the
filesystem state, and an error if not).

-Peff
Duy Nguyen Dec. 20, 2018, 5:37 p.m. UTC | #6
On Thu, Dec 20, 2018 at 6:32 PM Jeff King <peff@peff.net> wrote:
>
> On Thu, Dec 20, 2018 at 06:23:42PM +0100, Duy Nguyen wrote:
>
> > On Thu, Dec 20, 2018 at 6:18 PM Jeff King <peff@peff.net> wrote:
> > > > I wonder if --follow-symlinks would be a good alternative for this
> > > > (then if the final destination is unmmapable then we just read the
> > > > file whole in memory without the user asking, so it will work with
> > > > pipes). --follow-symlinks then could be made work with non-"no-index"
> > > > case too. But --literally is also ok.
> > >
> > > It's more than symlinks, though. Reading from a named pipe, we'd want to
> > > see the actual contents with --literally (and not "oops, I don't know
> > > how to represent a named pipe").
> >
> > Yes, but I think at least --no-index it makes sense to just fall back
> > to read() if we can't mmap(). mmap is more of an optimization than a
> > requirement. There's no loss going from "oops I don't know how to
> > represent it" to "here's the content from whatever what that device
> > is". Symlinks are different because we have two possible
> > representations and the user should choose.
>
> Oh, I see. I misread your paragraph. Yeah, I think that is the behavior
> I would want by default, too, though the previous thread makes me thing
> some people would literally rather have the "I can't represent this"
> behavior (because they really do want a diff that can reconstruct the
> filesystem state, and an error if not).

I can't see a good use case for that. But yeah it would not harm to be
a bit more conservative, then make --literally default later and leave
--no-literally as an escape hatch.
Ævar Arnfjörð Bjarmason Dec. 20, 2018, 9:43 p.m. UTC | #7
On Thu, Dec 20 2018, brian m. carlson wrote:

> We don't include a test for the pipe scenario because I couldn't get
> that case to work in portable shell (although of course it works in
> bash). I have, however, tested it on both macOS and Linux. No clue how
> this works on Windows.

You can (and should) add tests using the PROCESS_SUBSTITUTION prereq
Dennis used in an earlier version of this:
https://public-inbox.org/git/20170113102021.6054-3-dennis@kaarsemaker.net/

Even if that wasn't possible a bash-specific test is better than no
test, and can be had by including t/lib-bash.sh in your test, see
e.g. the completion tests.
brian m. carlson Dec. 20, 2018, 11:54 p.m. UTC | #8
On Thu, Dec 20, 2018 at 10:43:06PM +0100, Ævar Arnfjörð Bjarmason wrote:
> 
> On Thu, Dec 20 2018, brian m. carlson wrote:
> 
> > We don't include a test for the pipe scenario because I couldn't get
> > that case to work in portable shell (although of course it works in
> > bash). I have, however, tested it on both macOS and Linux. No clue how
> > this works on Windows.
> 
> You can (and should) add tests using the PROCESS_SUBSTITUTION prereq
> Dennis used in an earlier version of this:
> https://public-inbox.org/git/20170113102021.6054-3-dennis@kaarsemaker.net/

I'm happy to steal that code. I didn't know this had come up before, so
I didn't think of it.
brian m. carlson Dec. 21, 2018, 12:25 a.m. UTC | #9
On Thu, Dec 20, 2018 at 10:48:41AM -0500, Jeff King wrote:
> The distinction is a bit subtle, but I think treating only the actual
> top-level arguments as symlinks would solve your problem, but still
> allow a more detailed diff for the recursive cases.

Yeah, I think that would be better. I'll add a test.

> Looks like spaces for indent, whereas the context uses tabs.

Will fix.

> I think "literal" is a good way to describe this concept. If we do grow
> a config option to make this the default, then countermanding it would
> be "--no-literally", which parses a bit funny as English.
> 
> If you agree with my "only the top-level" line of reasoning above, maybe
> "--literal-arguments" and "--no-literal-arguments" might make sense.
> 
> We could also in theory offer several levels: no literals, top-level
> literals, everything literal, at which point it becomes a tri-state.

Yeah, this is what the POSIX symlink options (-H, -L, -P) do. I
originally came up with "--dereference" as the name, but I decided to
change it, because it doesn't affect just symlinks. I think
--literal-arguments is better.

Theoretically, we could adopt the POSIX options for symlink/pipe
handling if we want to in the future, but I think that's a decision we
should make later.

> > diff --git a/diff.c b/diff.c
> > index dc9965e836..740d0087b9 100644
> > --- a/diff.c
> > +++ b/diff.c
> > @@ -4282,18 +4282,18 @@ static void run_diff_cmd(const char *pgm,
> >  		fprintf(o->file, "* Unmerged path %s\n", name);
> >  }
> >  
> > -static void diff_fill_oid_info(struct diff_filespec *one, struct index_state *istate)
> > +static void diff_fill_oid_info(struct diff_filespec *one, struct diff_options *o)
> 
> It might be worth breaking these "pass the options around" hunks into a
> separate preparatory patch.

I can do that.
Johannes Schindelin Dec. 21, 2018, 11:52 a.m. UTC | #10
Hi Brian,

On Thu, 20 Dec 2018, brian m. carlson wrote:

> In some shells, such as bash and zsh, it's possible to use a command
> substitution to provide the output of a command as a file argument to
> another process, like so:
> 
>   diff -u <(printf "a\nb\n") <(printf "a\nc\n")
> 
> However, this syntax does not produce useful results with git diff
> --no-index.
> 
> On macOS, the arguments to the command are named pipes under /dev/fd,
> and git diff doesn't know how to handle a named pipe. On Linux, the
> arguments are symlinks to pipes, so git diff "helpfully" diffs these
> symlinks, comparing their targets like "pipe:[1234]" and "pipe:[5678]".


... and in Git for Windows' Bash, it would result in something like:

	$ git -P diff --no-index <(printf "a\nb\n") <(printf "a\nc\n")
	error: Could not access '/proc/24012/fd/63'

... because the Bash is "MSYS2-aware" and knows about `/proc/` while
`git.exe` is a pure Win32 executable that has no idea what Bash is talking
about.

Sadly, your patch does not change the situation one bit (because it does
not change the fact that the MSYS2 Bash passes a path to `git.exe` that is
not a valid Windows path, and neither could it, but that's not the problem
of your patch).

I reviewed your patch and it looks good to me!

Thanks,
Dscho

> Because this behavior is not very helpful, and because git diff has many
> features that people would like to use even on non-Git files, add an
> option to git diff --no-index to read files literally, dereferencing
> symlinks and reading them as a normal file.
> 
> Note that this behavior requires that the files be read entirely into
> memory, just as we do when reading from standard input.
> 
> Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
> ---
> This is a long-standing annoyance of mine, and it also makes some use
> cases possible (for example, diffing filtered and non-filtered objects).
> 
> We don't include a test for the pipe scenario because I couldn't get
> that case to work in portable shell (although of course it works in
> bash). I have, however, tested it on both macOS and Linux. No clue how
> this works on Windows.
> 
>  Documentation/git-diff.txt |  5 +++++
>  diff-no-index.c            | 34 +++++++++++++++++++++++++++-------
>  diff.c                     | 24 +++++++++++++-----------
>  diff.h                     |  1 +
>  diffcore.h                 |  1 +
>  t/t4053-diff-no-index.sh   | 28 ++++++++++++++++++++++++++++
>  6 files changed, 75 insertions(+), 18 deletions(-)
> 
> diff --git a/Documentation/git-diff.txt b/Documentation/git-diff.txt
> index 030f162f30..4c4695c88d 100644
> --- a/Documentation/git-diff.txt
> +++ b/Documentation/git-diff.txt
> @@ -111,6 +111,11 @@ include::diff-options.txt[]
>  	"Unmerged".  Can be used only when comparing the working tree
>  	with the index.
>  
> +--literally::
> +  Read the specified files literally, as `diff` would,
> +  dereferencing any symlinks and reading data from pipes.
> +  This option only works with `--no-index`.
> +
>  <path>...::
>  	The <paths> parameters, when given, are used to limit
>  	the diff to the named paths (you can give directory
> diff --git a/diff-no-index.c b/diff-no-index.c
> index 9414e922d1..2707206aee 100644
> --- a/diff-no-index.c
> +++ b/diff-no-index.c
> @@ -75,7 +75,25 @@ static int populate_from_stdin(struct diff_filespec *s)
>  	return 0;
>  }
>  
> -static struct diff_filespec *noindex_filespec(const char *name, int mode)
> +static int populate_literally(struct diff_filespec *s)
> +{
> +	struct strbuf buf = STRBUF_INIT;
> +	size_t size = 0;
> +	int fd = xopen(s->path, O_RDONLY);
> +
> +	if (strbuf_read(&buf, fd, 0) < 0)
> +		return error_errno("error while reading from '%s'", s->path);
> +
> +	s->should_munmap = 0;
> +	s->data = strbuf_detach(&buf, &size);
> +	s->size = size;
> +	s->should_free = 1;
> +	s->read_literally = 1;
> +	return 0;
> +}
> +
> +static struct diff_filespec *noindex_filespec(const char *name, int mode,
> +					      struct diff_options *o)
>  {
>  	struct diff_filespec *s;
>  
> @@ -85,6 +103,8 @@ static struct diff_filespec *noindex_filespec(const char *name, int mode)
>  	fill_filespec(s, &null_oid, 0, mode);
>  	if (name == file_from_standard_input)
>  		populate_from_stdin(s);
> +	else if (o->flags.read_literally)
> +		populate_literally(s);
>  	return s;
>  }
>  
> @@ -101,14 +121,14 @@ static int queue_diff(struct diff_options *o,
>  
>  		if (S_ISDIR(mode1)) {
>  			/* 2 is file that is created */
> -			d1 = noindex_filespec(NULL, 0);
> -			d2 = noindex_filespec(name2, mode2);
> +			d1 = noindex_filespec(NULL, 0, o);
> +			d2 = noindex_filespec(name2, mode2, o);
>  			name2 = NULL;
>  			mode2 = 0;
>  		} else {
>  			/* 1 is file that is deleted */
> -			d1 = noindex_filespec(name1, mode1);
> -			d2 = noindex_filespec(NULL, 0);
> +			d1 = noindex_filespec(name1, mode1, o);
> +			d2 = noindex_filespec(NULL, 0, o);
>  			name1 = NULL;
>  			mode1 = 0;
>  		}
> @@ -189,8 +209,8 @@ static int queue_diff(struct diff_options *o,
>  			SWAP(name1, name2);
>  		}
>  
> -		d1 = noindex_filespec(name1, mode1);
> -		d2 = noindex_filespec(name2, mode2);
> +		d1 = noindex_filespec(name1, mode1, o);
> +		d2 = noindex_filespec(name2, mode2, o);
>  		diff_queue(&diff_queued_diff, d1, d2);
>  		return 0;
>  	}
> diff --git a/diff.c b/diff.c
> index dc9965e836..740d0087b9 100644
> --- a/diff.c
> +++ b/diff.c
> @@ -4282,18 +4282,18 @@ static void run_diff_cmd(const char *pgm,
>  		fprintf(o->file, "* Unmerged path %s\n", name);
>  }
>  
> -static void diff_fill_oid_info(struct diff_filespec *one, struct index_state *istate)
> +static void diff_fill_oid_info(struct diff_filespec *one, struct diff_options *o)
>  {
>  	if (DIFF_FILE_VALID(one)) {
>  		if (!one->oid_valid) {
>  			struct stat st;
> -			if (one->is_stdin) {
> +			if (one->is_stdin || one->read_literally) {
>  				oidclr(&one->oid);
>  				return;
>  			}
>  			if (lstat(one->path, &st) < 0)
>  				die_errno("stat '%s'", one->path);
> -			if (index_path(istate, &one->oid, one->path, &st, 0))
> +			if (index_path(o->repo->index, &one->oid, one->path, &st, 0))
>  				die("cannot hash %s", one->path);
>  		}
>  	}
> @@ -4341,8 +4341,8 @@ static void run_diff(struct diff_filepair *p, struct diff_options *o)
>  		return;
>  	}
>  
> -	diff_fill_oid_info(one, o->repo->index);
> -	diff_fill_oid_info(two, o->repo->index);
> +	diff_fill_oid_info(one, o);
> +	diff_fill_oid_info(two, o);
>  
>  	if (!pgm &&
>  	    DIFF_FILE_VALID(one) && DIFF_FILE_VALID(two) &&
> @@ -4389,8 +4389,8 @@ static void run_diffstat(struct diff_filepair *p, struct diff_options *o,
>  	if (o->prefix_length)
>  		strip_prefix(o->prefix_length, &name, &other);
>  
> -	diff_fill_oid_info(p->one, o->repo->index);
> -	diff_fill_oid_info(p->two, o->repo->index);
> +	diff_fill_oid_info(p->one, o);
> +	diff_fill_oid_info(p->two, o);
>  
>  	builtin_diffstat(name, other, p->one, p->two,
>  			 diffstat, o, p);
> @@ -4414,8 +4414,8 @@ static void run_checkdiff(struct diff_filepair *p, struct diff_options *o)
>  	if (o->prefix_length)
>  		strip_prefix(o->prefix_length, &name, &other);
>  
> -	diff_fill_oid_info(p->one, o->repo->index);
> -	diff_fill_oid_info(p->two, o->repo->index);
> +	diff_fill_oid_info(p->one, o);
> +	diff_fill_oid_info(p->two, o);
>  
>  	builtin_checkdiff(name, other, attr_path, p->one, p->two, o);
>  }
> @@ -5159,6 +5159,8 @@ int diff_opt_parse(struct diff_options *options,
>  		options->flags.funccontext = 1;
>  	else if (!strcmp(arg, "--no-function-context"))
>  		options->flags.funccontext = 0;
> +	else if (!strcmp(arg, "--literally"))
> +		options->flags.read_literally = 1;
>  	else if ((argcount = parse_long_opt("output", av, &optarg))) {
>  		char *path = prefix_filename(prefix, optarg);
>  		options->file = xfopen(path, "w");
> @@ -5720,8 +5722,8 @@ static int diff_get_patch_id(struct diff_options *options, struct object_id *oid
>  		if (DIFF_PAIR_UNMERGED(p))
>  			continue;
>  
> -		diff_fill_oid_info(p->one, options->repo->index);
> -		diff_fill_oid_info(p->two, options->repo->index);
> +		diff_fill_oid_info(p->one, options);
> +		diff_fill_oid_info(p->two, options);
>  
>  		len1 = remove_space(p->one->path, strlen(p->one->path));
>  		len2 = remove_space(p->two->path, strlen(p->two->path));
> diff --git a/diff.h b/diff.h
> index ce5e8a8183..7dedd3bcd1 100644
> --- a/diff.h
> +++ b/diff.h
> @@ -97,6 +97,7 @@ struct diff_flags {
>  	unsigned stat_with_summary:1;
>  	unsigned suppress_diff_headers:1;
>  	unsigned dual_color_diffed_diffs:1;
> +	unsigned read_literally:1;
>  };
>  
>  static inline void diff_flags_or(struct diff_flags *a,
> diff --git a/diffcore.h b/diffcore.h
> index b651061c0e..363869447a 100644
> --- a/diffcore.h
> +++ b/diffcore.h
> @@ -48,6 +48,7 @@ struct diff_filespec {
>  #define DIRTY_SUBMODULE_UNTRACKED 1
>  #define DIRTY_SUBMODULE_MODIFIED  2
>  	unsigned is_stdin : 1;
> +	unsigned read_literally : 1;
>  	unsigned has_more_entries : 1; /* only appear in combined diff */
>  	/* data should be considered "binary"; -1 means "don't know yet" */
>  	signed int is_binary : 2;
> diff --git a/t/t4053-diff-no-index.sh b/t/t4053-diff-no-index.sh
> index 6e0dd6f9e5..53e6bcdc19 100755
> --- a/t/t4053-diff-no-index.sh
> +++ b/t/t4053-diff-no-index.sh
> @@ -137,4 +137,32 @@ test_expect_success 'diff --no-index from repo subdir with absolute paths' '
>  	test_cmp expect actual
>  '
>  
> +test_expect_success 'diff --no-index --literally' '
> +	echo "diff --git a/../../non/git/a b/../../non/git/b" >expect &&
> +	test_expect_code 1 \
> +		git -C repo/sub \
> +		diff --literally ../../non/git/a ../../non/git/b >actual &&
> +	head -n 1 <actual >actual.head &&
> +	test_cmp expect actual.head
> +'
> +
> +test_expect_success SYMLINKS 'diff --no-index --literally with symlinks' '
> +	test_write_lines a b c >f1 &&
> +	test_write_lines a d c >f2 &&
> +	ln -s f1 s1 &&
> +	ln -s f2 s2 &&
> +	cat >expect <<-\EOF &&
> +	diff --git a/s1 b/s2
> +	--- a/s1
> +	+++ b/s2
> +	@@ -1,3 +1,3 @@
> +	 a
> +	-b
> +	+d
> +	 c
> +	EOF
> +	test_expect_code 1 git diff --no-index --literally s1 s2 >actual &&
> +	test_cmp expect actual
> +'
> +
>  test_done
>
brian m. carlson Dec. 21, 2018, 11:20 p.m. UTC | #11
On Fri, Dec 21, 2018 at 12:52:04PM +0100, Johannes Schindelin wrote:
> ... and in Git for Windows' Bash, it would result in something like:
> 
> 	$ git -P diff --no-index <(printf "a\nb\n") <(printf "a\nc\n")
> 	error: Could not access '/proc/24012/fd/63'
> 
> ... because the Bash is "MSYS2-aware" and knows about `/proc/` while
> `git.exe` is a pure Win32 executable that has no idea what Bash is talking
> about.
> 
> Sadly, your patch does not change the situation one bit (because it does
> not change the fact that the MSYS2 Bash passes a path to `git.exe` that is
> not a valid Windows path, and neither could it, but that's not the problem
> of your patch).

That tells me that I need to exclude Windows (excepting Cygwin) if I add
a PROCESS_SUBSTITUTION test like Ævar suggested, and that indeed is
helpful information. I'm sorry my patch won't be useful on Windows,
though.
Junio C Hamano Jan. 2, 2019, 6:56 p.m. UTC | #12
"brian m. carlson" <sandals@crustytoothpaste.net> writes:

> +test_expect_success SYMLINKS 'diff --no-index --literally with symlinks' '
> +	test_write_lines a b c >f1 &&
> +	test_write_lines a d c >f2 &&
> +	ln -s f1 s1 &&
> +	ln -s f2 s2 &&
> +	cat >expect <<-\EOF &&
> +	diff --git a/s1 b/s2
> +	--- a/s1
> +	+++ b/s2
> +	@@ -1,3 +1,3 @@
> +	 a
> +	-b
> +	+d
> +	 c
> +	EOF
> +	test_expect_code 1 git diff --no-index --literally s1 s2 >actual &&
> +	test_cmp expect actual
> +'

This is good as a goal, but the implementation seems to be overly
eager to dereference any symlink or non-regular file found in any
level of recursion.  The use case presented as the justification in
the proposed log message, and the explanation in the documentation,
suggests that only the paths given from the command line are treated
this way.

It may make sense to do this as two orthgonal flags.  Dereference
symlinks and named pipes given from the command line is one use
case.  Dereference any symlinks encountered during recursion is
another.  And the latter might be useful even inside a repository
as an option, even though the former would never make sense unless
running in --no-index mode.

Thanks.
brian m. carlson Jan. 4, 2019, 2:08 a.m. UTC | #13
On Wed, Jan 02, 2019 at 10:56:45AM -0800, Junio C Hamano wrote:
> This is good as a goal, but the implementation seems to be overly
> eager to dereference any symlink or non-regular file found in any
> level of recursion.  The use case presented as the justification in
> the proposed log message, and the explanation in the documentation,
> suggests that only the paths given from the command line are treated
> this way.
> 
> It may make sense to do this as two orthgonal flags.  Dereference
> symlinks and named pipes given from the command line is one use
> case.  Dereference any symlinks encountered during recursion is
> another.  And the latter might be useful even inside a repository
> as an option, even though the former would never make sense unless
> running in --no-index mode.

Yeah, I definitely think I'll be doing a reroll to address the recursion
issue as Peff suggested. I've just been on vacation and will be
traveling again soon, so I haven't gotten to it yet. Feel free to wait
for the reroll before picking it up.
Jonathan Nieder Jan. 4, 2019, 2:18 a.m. UTC | #14
Hi,

brian m. carlson wrote:

> In some shells, such as bash and zsh, it's possible to use a command
> substitution to provide the output of a command as a file argument to
> another process, like so:
>
>   diff -u <(printf "a\nb\n") <(printf "a\nc\n")
>
> However, this syntax does not produce useful results with git diff
> --no-index.

Thanks much for fixing this.  It's something I've run into, too.

[...]
> --- a/Documentation/git-diff.txt
> +++ b/Documentation/git-diff.txt
> @@ -111,6 +111,11 @@ include::diff-options.txt[]
>  	"Unmerged".  Can be used only when comparing the working tree
>  	with the index.
>  
> +--literally::
> +  Read the specified files literally, as `diff` would,
> +  dereferencing any symlinks and reading data from pipes.
> +  This option only works with `--no-index`.

I may be a minority in this opinion, but I had trouble understanding
what --literally would do from its name.  I suspect we can come up
with a better name.

Unfortunately, I'm terrible at coming up with names. :-P

--dereference would be a good name when it comes to symlinks, but
it's not a good name for reading what is on the other side of a pipe.
On the plus side, it matches "diff" and "cp"'s name for the "follow
symbolic links" option.

--plain captures the desire a little better --- we want a plain
read(2) from the file instead of trying to be smart and look at
whether it is e.g. a block device.  But in the context of "diff", that
would seem more like an option that affects the output.

What would you think of

 - always reading from fifos instead of describing them, since I've
   never encountered a use case where people want the latter

 - --dereference to control whether to follow symlinks

?

[...]
> --- a/diff-no-index.c
> +++ b/diff-no-index.c
> @@ -75,7 +75,25 @@ static int populate_from_stdin(struct diff_filespec *s)
>  	return 0;
>  }
>  
> -static struct diff_filespec *noindex_filespec(const char *name, int mode)
> +static int populate_literally(struct diff_filespec *s)
> +{
> +	struct strbuf buf = STRBUF_INIT;
> +	size_t size = 0;
> +	int fd = xopen(s->path, O_RDONLY);
> +
> +	if (strbuf_read(&buf, fd, 0) < 0)
> +		return error_errno("error while reading from '%s'", s->path);
> +
> +	s->should_munmap = 0;
> +	s->data = strbuf_detach(&buf, &size);
> +	s->size = size;
> +	s->should_free = 1;
> +	s->read_literally = 1;

Oh!  --read-literally works perfectly for me as a name. :)

Jonathan
brian m. carlson Jan. 4, 2019, 2:57 a.m. UTC | #15
On Thu, Jan 03, 2019 at 06:18:55PM -0800, Jonathan Nieder wrote:
> I may be a minority in this opinion, but I had trouble understanding
> what --literally would do from its name.  I suspect we can come up
> with a better name.
> 
> Unfortunately, I'm terrible at coming up with names. :-P
> 
> --dereference would be a good name when it comes to symlinks, but
> it's not a good name for reading what is on the other side of a pipe.
> On the plus side, it matches "diff" and "cp"'s name for the "follow
> symbolic links" option.
> 
> --plain captures the desire a little better --- we want a plain
> read(2) from the file instead of trying to be smart and look at
> whether it is e.g. a block device.  But in the context of "diff", that
> would seem more like an option that affects the output.
> 
> What would you think of
> 
>  - always reading from fifos instead of describing them, since I've
>    never encountered a use case where people want the latter
> 
>  - --dereference to control whether to follow symlinks

This is actually surprisingly difficult. The reason I implemented this
only for no-index mode is because there are actually several places we
can stat a file in the diff code, and implementing a --dereference
option that catches all of those cases and getting the option passed
down to them is non-trivial.

Since my primary environment is on Linux, I need to implement functional
--dereference semantics before implementing the FIFO functionality to
know if it works in my use case.

I agree that would be a nicer approach overall, so let me see if I can
take another stab at it in a way that isn't excessively complex. It
would also make the name less controversial, since "--dereference" is
the obvious choice for dereferencing symlinks.

> [...]
> > --- a/diff-no-index.c
> > +++ b/diff-no-index.c
> > @@ -75,7 +75,25 @@ static int populate_from_stdin(struct diff_filespec *s)
> >  	return 0;
> >  }
> >  
> > -static struct diff_filespec *noindex_filespec(const char *name, int mode)
> > +static int populate_literally(struct diff_filespec *s)
> > +{
> > +	struct strbuf buf = STRBUF_INIT;
> > +	size_t size = 0;
> > +	int fd = xopen(s->path, O_RDONLY);
> > +
> > +	if (strbuf_read(&buf, fd, 0) < 0)
> > +		return error_errno("error while reading from '%s'", s->path);
> > +
> > +	s->should_munmap = 0;
> > +	s->data = strbuf_detach(&buf, &size);
> > +	s->size = size;
> > +	s->should_free = 1;
> > +	s->read_literally = 1;
> 
> Oh!  --read-literally works perfectly for me as a name. :)

If I can't manage to split this out into two pieces, I'll pick that as
the name for the reroll.
Junio C Hamano Jan. 4, 2019, 7:26 p.m. UTC | #16
"brian m. carlson" <sandals@crustytoothpaste.net> writes:

>>  - --dereference to control whether to follow symlinks
>
> This is actually surprisingly difficult. The reason I implemented this
> only for no-index mode is because there are actually several places we
> can stat a file in the diff code, and implementing a --dereference
> option that catches all of those cases and getting the option passed
> down to them is non-trivial.

Another thing to worry about is symlinks that point outside the
working tree.  When a tracked content "dir/link" is a symlink to
"/etc/motd", it probably makes sense to open("/etc/motd") and read()
it on the working tree side of the diff, and probably even on the
index side of the diff, but what about obtaining contents for
"dir/link" in a year-old commit under --deference mode?  I am not
sure if it makes sense to read from the filesystem in such a case.

I personally am perfectly fine if this "do not compare readlink(2),
but read contents literally" is limited to the --no-index mode.

Thanks.
brian m. carlson Jan. 5, 2019, 5:39 p.m. UTC | #17
On Fri, Jan 04, 2019 at 11:26:56AM -0800, Junio C Hamano wrote:
> "brian m. carlson" <sandals@crustytoothpaste.net> writes:
> 
> >>  - --dereference to control whether to follow symlinks
> >
> > This is actually surprisingly difficult. The reason I implemented this
> > only for no-index mode is because there are actually several places we
> > can stat a file in the diff code, and implementing a --dereference
> > option that catches all of those cases and getting the option passed
> > down to them is non-trivial.
> 
> Another thing to worry about is symlinks that point outside the
> working tree.  When a tracked content "dir/link" is a symlink to
> "/etc/motd", it probably makes sense to open("/etc/motd") and read()
> it on the working tree side of the diff, and probably even on the
> index side of the diff, but what about obtaining contents for
> "dir/link" in a year-old commit under --deference mode?  I am not
> sure if it makes sense to read from the filesystem in such a case.
> 
> I personally am perfectly fine if this "do not compare readlink(2),
> but read contents literally" is limited to the --no-index mode.

That's a good point. I think I'll stick with the current design, then,
since that seems like the least surprising way forward. It also means
that we don't read outside of the working tree unless --no-index is in
use, which may be beneficial for security purposes.

Thanks for a helpful perspective.
diff mbox series

Patch

diff --git a/Documentation/git-diff.txt b/Documentation/git-diff.txt
index 030f162f30..4c4695c88d 100644
--- a/Documentation/git-diff.txt
+++ b/Documentation/git-diff.txt
@@ -111,6 +111,11 @@  include::diff-options.txt[]
 	"Unmerged".  Can be used only when comparing the working tree
 	with the index.
 
+--literally::
+  Read the specified files literally, as `diff` would,
+  dereferencing any symlinks and reading data from pipes.
+  This option only works with `--no-index`.
+
 <path>...::
 	The <paths> parameters, when given, are used to limit
 	the diff to the named paths (you can give directory
diff --git a/diff-no-index.c b/diff-no-index.c
index 9414e922d1..2707206aee 100644
--- a/diff-no-index.c
+++ b/diff-no-index.c
@@ -75,7 +75,25 @@  static int populate_from_stdin(struct diff_filespec *s)
 	return 0;
 }
 
-static struct diff_filespec *noindex_filespec(const char *name, int mode)
+static int populate_literally(struct diff_filespec *s)
+{
+	struct strbuf buf = STRBUF_INIT;
+	size_t size = 0;
+	int fd = xopen(s->path, O_RDONLY);
+
+	if (strbuf_read(&buf, fd, 0) < 0)
+		return error_errno("error while reading from '%s'", s->path);
+
+	s->should_munmap = 0;
+	s->data = strbuf_detach(&buf, &size);
+	s->size = size;
+	s->should_free = 1;
+	s->read_literally = 1;
+	return 0;
+}
+
+static struct diff_filespec *noindex_filespec(const char *name, int mode,
+					      struct diff_options *o)
 {
 	struct diff_filespec *s;
 
@@ -85,6 +103,8 @@  static struct diff_filespec *noindex_filespec(const char *name, int mode)
 	fill_filespec(s, &null_oid, 0, mode);
 	if (name == file_from_standard_input)
 		populate_from_stdin(s);
+	else if (o->flags.read_literally)
+		populate_literally(s);
 	return s;
 }
 
@@ -101,14 +121,14 @@  static int queue_diff(struct diff_options *o,
 
 		if (S_ISDIR(mode1)) {
 			/* 2 is file that is created */
-			d1 = noindex_filespec(NULL, 0);
-			d2 = noindex_filespec(name2, mode2);
+			d1 = noindex_filespec(NULL, 0, o);
+			d2 = noindex_filespec(name2, mode2, o);
 			name2 = NULL;
 			mode2 = 0;
 		} else {
 			/* 1 is file that is deleted */
-			d1 = noindex_filespec(name1, mode1);
-			d2 = noindex_filespec(NULL, 0);
+			d1 = noindex_filespec(name1, mode1, o);
+			d2 = noindex_filespec(NULL, 0, o);
 			name1 = NULL;
 			mode1 = 0;
 		}
@@ -189,8 +209,8 @@  static int queue_diff(struct diff_options *o,
 			SWAP(name1, name2);
 		}
 
-		d1 = noindex_filespec(name1, mode1);
-		d2 = noindex_filespec(name2, mode2);
+		d1 = noindex_filespec(name1, mode1, o);
+		d2 = noindex_filespec(name2, mode2, o);
 		diff_queue(&diff_queued_diff, d1, d2);
 		return 0;
 	}
diff --git a/diff.c b/diff.c
index dc9965e836..740d0087b9 100644
--- a/diff.c
+++ b/diff.c
@@ -4282,18 +4282,18 @@  static void run_diff_cmd(const char *pgm,
 		fprintf(o->file, "* Unmerged path %s\n", name);
 }
 
-static void diff_fill_oid_info(struct diff_filespec *one, struct index_state *istate)
+static void diff_fill_oid_info(struct diff_filespec *one, struct diff_options *o)
 {
 	if (DIFF_FILE_VALID(one)) {
 		if (!one->oid_valid) {
 			struct stat st;
-			if (one->is_stdin) {
+			if (one->is_stdin || one->read_literally) {
 				oidclr(&one->oid);
 				return;
 			}
 			if (lstat(one->path, &st) < 0)
 				die_errno("stat '%s'", one->path);
-			if (index_path(istate, &one->oid, one->path, &st, 0))
+			if (index_path(o->repo->index, &one->oid, one->path, &st, 0))
 				die("cannot hash %s", one->path);
 		}
 	}
@@ -4341,8 +4341,8 @@  static void run_diff(struct diff_filepair *p, struct diff_options *o)
 		return;
 	}
 
-	diff_fill_oid_info(one, o->repo->index);
-	diff_fill_oid_info(two, o->repo->index);
+	diff_fill_oid_info(one, o);
+	diff_fill_oid_info(two, o);
 
 	if (!pgm &&
 	    DIFF_FILE_VALID(one) && DIFF_FILE_VALID(two) &&
@@ -4389,8 +4389,8 @@  static void run_diffstat(struct diff_filepair *p, struct diff_options *o,
 	if (o->prefix_length)
 		strip_prefix(o->prefix_length, &name, &other);
 
-	diff_fill_oid_info(p->one, o->repo->index);
-	diff_fill_oid_info(p->two, o->repo->index);
+	diff_fill_oid_info(p->one, o);
+	diff_fill_oid_info(p->two, o);
 
 	builtin_diffstat(name, other, p->one, p->two,
 			 diffstat, o, p);
@@ -4414,8 +4414,8 @@  static void run_checkdiff(struct diff_filepair *p, struct diff_options *o)
 	if (o->prefix_length)
 		strip_prefix(o->prefix_length, &name, &other);
 
-	diff_fill_oid_info(p->one, o->repo->index);
-	diff_fill_oid_info(p->two, o->repo->index);
+	diff_fill_oid_info(p->one, o);
+	diff_fill_oid_info(p->two, o);
 
 	builtin_checkdiff(name, other, attr_path, p->one, p->two, o);
 }
@@ -5159,6 +5159,8 @@  int diff_opt_parse(struct diff_options *options,
 		options->flags.funccontext = 1;
 	else if (!strcmp(arg, "--no-function-context"))
 		options->flags.funccontext = 0;
+	else if (!strcmp(arg, "--literally"))
+		options->flags.read_literally = 1;
 	else if ((argcount = parse_long_opt("output", av, &optarg))) {
 		char *path = prefix_filename(prefix, optarg);
 		options->file = xfopen(path, "w");
@@ -5720,8 +5722,8 @@  static int diff_get_patch_id(struct diff_options *options, struct object_id *oid
 		if (DIFF_PAIR_UNMERGED(p))
 			continue;
 
-		diff_fill_oid_info(p->one, options->repo->index);
-		diff_fill_oid_info(p->two, options->repo->index);
+		diff_fill_oid_info(p->one, options);
+		diff_fill_oid_info(p->two, options);
 
 		len1 = remove_space(p->one->path, strlen(p->one->path));
 		len2 = remove_space(p->two->path, strlen(p->two->path));
diff --git a/diff.h b/diff.h
index ce5e8a8183..7dedd3bcd1 100644
--- a/diff.h
+++ b/diff.h
@@ -97,6 +97,7 @@  struct diff_flags {
 	unsigned stat_with_summary:1;
 	unsigned suppress_diff_headers:1;
 	unsigned dual_color_diffed_diffs:1;
+	unsigned read_literally:1;
 };
 
 static inline void diff_flags_or(struct diff_flags *a,
diff --git a/diffcore.h b/diffcore.h
index b651061c0e..363869447a 100644
--- a/diffcore.h
+++ b/diffcore.h
@@ -48,6 +48,7 @@  struct diff_filespec {
 #define DIRTY_SUBMODULE_UNTRACKED 1
 #define DIRTY_SUBMODULE_MODIFIED  2
 	unsigned is_stdin : 1;
+	unsigned read_literally : 1;
 	unsigned has_more_entries : 1; /* only appear in combined diff */
 	/* data should be considered "binary"; -1 means "don't know yet" */
 	signed int is_binary : 2;
diff --git a/t/t4053-diff-no-index.sh b/t/t4053-diff-no-index.sh
index 6e0dd6f9e5..53e6bcdc19 100755
--- a/t/t4053-diff-no-index.sh
+++ b/t/t4053-diff-no-index.sh
@@ -137,4 +137,32 @@  test_expect_success 'diff --no-index from repo subdir with absolute paths' '
 	test_cmp expect actual
 '
 
+test_expect_success 'diff --no-index --literally' '
+	echo "diff --git a/../../non/git/a b/../../non/git/b" >expect &&
+	test_expect_code 1 \
+		git -C repo/sub \
+		diff --literally ../../non/git/a ../../non/git/b >actual &&
+	head -n 1 <actual >actual.head &&
+	test_cmp expect actual.head
+'
+
+test_expect_success SYMLINKS 'diff --no-index --literally with symlinks' '
+	test_write_lines a b c >f1 &&
+	test_write_lines a d c >f2 &&
+	ln -s f1 s1 &&
+	ln -s f2 s2 &&
+	cat >expect <<-\EOF &&
+	diff --git a/s1 b/s2
+	--- a/s1
+	+++ b/s2
+	@@ -1,3 +1,3 @@
+	 a
+	-b
+	+d
+	 c
+	EOF
+	test_expect_code 1 git diff --no-index --literally s1 s2 >actual &&
+	test_cmp expect actual
+'
+
 test_done