diff mbox series

[3/3] diff: detect pathspec magic not supported by --follow

Message ID 20230601174351.GC4165297@coredump.intra.peff.net (mailing list archive)
State Accepted
Commit 8260bc59023136edeaed1f1006a03f44cc849883
Headers show
Series handling pathspec magic with --follow | expand

Commit Message

Jeff King June 1, 2023, 5:43 p.m. UTC
The --follow code doesn't handle most forms of pathspec magic. We check
that no unexpected ones have made it to try_to_follow_renames() with a
runtime GUARD_PATHSPEC() check, which gives behavior like this:

  $ git log --follow ':(icase)makefile' >/dev/null
  BUG: tree-diff.c:596: unsupported magic 10
  Aborted

The same is true of ":(glob)", ":(attr)", and so on. It's good that we
notice the problem rather than continuing and producing a wrong answer.
But there are two non-ideal things:

  1. The idea of GUARD_PATHSPEC() is to catch programming errors where
     low-level code gets unexpected pathspecs. We'd usually try to catch
     unsupported pathspecs by passing a magic_mask to parse_pathspec(),
     which would give the user a much better message like:

       pathspec magic not supported by this command: 'icase'

     That doesn't happen here because git-log usually _does_ support
     all types of pathspec magic, and so it passes "0" for the mask
     (this call actually happens in setup_revisions()). It needs to
     distinguish the normal case from the "--follow" one but currently
     doesn't.

  2. In addition to --follow, we have the log.follow config option. When
     that is set, we try to turn on --follow mode only when there is a
     single pathspec (since --follow doesn't handle anything else). But
     really, that ought to be expanded to "use --follow when the
     pathspec supports it". Otherwise, we'd complain any time you use an
     exotic pathspec:

       $ git config log.follow true
       $ git log ':(icase)makefile' >/dev/null
       BUG: tree-diff.c:596: unsupported magic 10
       Aborted

     We should instead just avoid enabling follow mode if it's not
     supported by this particular invocation.

This patch expands our diff_check_follow_pathspec() function to cover
pathspec magic, solving both problems.

A few final notes:

  - we could also solve (1) by passing the appropriate mask to
    parse_pathspec(). But that's not great for two reasons. One is that
    the error message is less precise. It says "magic not supported by
    this command", but really it is not the command, but rather the
    --follow option which is the problem. The second is that it always
    calls die(). But for our log.follow code, we want to speculatively
    ask "is this pathspec OK?" and just get a boolean result.

  - This is obviously the right thing to do for ':(icase)' and most
    other magic options. But ':(glob)' is a bit odd here. The --follow
    code doesn't support wildcards, but we allow them anyway. From
    try_to_follow_renames():

	#if 0
	        /*
	         * We should reject wildcards as well. Unfortunately we
	         * haven't got a reliable way to detect that 'foo\*bar' in
	         * fact has no wildcards. nowildcard_len is merely a hint for
	         * optimization. Let it slip for now until wildmatch is taught
	         * about dry-run mode and returns wildcard info.
	         */
	        if (opt->pathspec.has_wildcard)
	                BUG("wildcards are not supported");
	#endif

    So something like "git log --follow 'Make*'" is already doing the
    wrong thing, since ":(glob)" behavior is already the default (it is
    used only to countermand an earlier --noglob-pathspecs).

    So we _could_ loosen the guard to allow :(glob), since it just
    behaves the same as pathspecs do by default. But it seems like a
    backwards step to do so. It already doesn't work (it hits the BUG()
    case currently), and given that the user took an explicit step to
    say "this pathspec should glob", it is reasonable for us to say "no,
    --follow does not support globbing" (or in the case of log.follow,
    avoid turning on follow mode). Which is what happens after this
    patch.

  - The set of allowed pathspec magic is obviously the same as in
    GUARD_PATHSPEC(). We could perhaps factor these out to avoid
    repetition. The point of having separate masks and GUARD calls is
    that we don't necessarily know which parsed pathspecs will be used
    where. But in this case, the two are heavily correlated. Still,
    there may be some value in keeping them separate; it would make
    anyone think twice about adding new magic to the list in
    diff_check_follow_pathspec(). They'd need to touch
    try_to_follow_renames() as well, which is the code that would
    actually need to be updated to handle more exotic pathspecs.

  - The documentation for log.follow says that it enables --follow
    "...when a single <path> is given". We could possibly expand that to
    say "with no unsupported pathspec magic", but that raises the
    question of documenting which magic is supported. I think the
    existing wording of "single <path>" sufficiently encompasses the
    idea (the forbidden magic is stuff that might match multiple
    entries), and the spirit remains the same.

Reported-by: Jim Pryor <dubiousjim@gmail.com>
Signed-off-by: Jeff King <peff@peff.net>
---
 diff.c         | 15 +++++++++++++++
 t/t4202-log.sh | 15 +++++++++++++++
 2 files changed, 30 insertions(+)

Comments

Junio C Hamano June 2, 2023, 7:27 a.m. UTC | #1
Jeff King <peff@peff.net> writes:

> The --follow code doesn't handle most forms of pathspec magic. We check
> that no unexpected ones have made it to try_to_follow_renames() with a
> runtime GUARD_PATHSPEC() check, which gives behavior like this:
>
>   $ git log --follow ':(icase)makefile' >/dev/null
>   BUG: tree-diff.c:596: unsupported magic 10
>   Aborted

Stepping back a bit, isn't the real reason why follow mode is
incompatible with the pathspec magic because it does not want to
deal with anything but a single literal pathname?

In other words, in "git log --follow ':(icase)makefile'", if the
given pathspec matches a single path in the starting commit, we
should be able to start from the commit inspecting that uniquely
found path, compare the commit with its parent(s) at the path, and
when the parent does not have the path, detect rename and continue
following that the name before the rename as the single path
starting from that parent.  Perhaps the commit we start at, HEAD,
has "Makefile" and not "makefile" (if we have two, that is an
error), so we start traversal, following the pathname "Makefile".
At some point, we may discover that the parent did not have
"Makefile", but has "GNUMakefile" with a similar contents, at which
point we swap the path we follow from "Makefile" to "GNUMakefile".

At each step, tree-diff does not need to and does not want to do
anything fancy with pathspec---the operation is always "this path in
the parent, does it exist in the parent? If not, did it come from a
different path?", and use of "literal pathspec" mode makes sense
there.

So why does try_to_follow_renames() even *need* to say anything
other than "We are in follow mode; what we have is a single path;
use it as a literal pathspec that matches the path literally without
any funky magic"?  GUARD_PATHSPEC() is "please inspect the pathspec
and see what kind of special pattern matching it requests---we
cannot allow it to ask for certain things but do allow it to ask for
other things", that sounds terribly misguided here.  The string is
literal, so we do not even want to look at it to see what it
asks---it is taken as nothing but a literal string and we do not let
it ask for anything special.

For example, if we started at HEAD to follow Makefile and in an
ancestor commit we find that the Makefile came from a funny path
":(icase)foo", instead of allowing the pathspec match code to
interprete it as a magic pathspec that matches any case permutations
of 'f/o/o', we want to force the pathspec match code to match
nothing but the funny path "colon parens icase close parens foo" by
using literal pathspec magic.

So perhaps having GUARD_PATHSPEC() there is the mistake, no?

For the above idea to work, I think we need to resolve the pathspec
early; "log --follow" should find its starting point, apply the
given pathspec to its treeish to grab the literal pathname, and use
that single literal pathname as the literal pathspec and continue,
which I do not think it does right now.  But with it done upfront,
the pathspec limiting done during the history traversal, including
try_to_follow_renames() bit, should be able to limit itself to "is
the path literally the string given as the pathspec"?

I dunno.
Jeff King June 15, 2023, 7:26 a.m. UTC | #2
On Fri, Jun 02, 2023 at 04:27:05PM +0900, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > The --follow code doesn't handle most forms of pathspec magic. We check
> > that no unexpected ones have made it to try_to_follow_renames() with a
> > runtime GUARD_PATHSPEC() check, which gives behavior like this:
> >
> >   $ git log --follow ':(icase)makefile' >/dev/null
> >   BUG: tree-diff.c:596: unsupported magic 10
> >   Aborted
> 
> Stepping back a bit, isn't the real reason why follow mode is
> incompatible with the pathspec magic because it does not want to
> deal with anything but a single literal pathname?

Sorry, I'm a little behind on list reading; I know this topic has
advanced almost to 'master' in the meantime, but I had a few thoughts
worth getting out.

Basically, yes, I agree with everything in your email. My series is more
or less papering over deficiencies in the --follow code, and that code
would be much better off thinking of single literal paths that it is
following.

I was trying with my series not to go down the rabbit hole of --follow
deficiencies. So in that sense, I think it is still a good idea for the
short-term. And though we might perhaps revert some of it if we actually
improve how --follow works, I don't think it would be too hard to do so
later. But of course if we are going to improve --follow _now_, then it
could replace my series.

And what you outlined here:

> For the above idea to work, I think we need to resolve the pathspec
> early; "log --follow" should find its starting point, apply the
> given pathspec to its treeish to grab the literal pathname, and use
> that single literal pathname as the literal pathspec and continue,
> which I do not think it does right now.  But with it done upfront,
> the pathspec limiting done during the history traversal, including
> try_to_follow_renames() bit, should be able to limit itself to "is
> the path literally the string given as the pathspec"?

seems mostly sensible to me, except that the notion of "starting point"
is ambiguous. If I do:

  git log --follow branch1 branch2 -- foo*

then we have two treeishes. Do we look at one? Both? What happens if the
pathspec resolves differently?

Off the top of my head, I'd say that we should look at all starting
tips, resolve the pathspec in all of them, and complain if it doesn't
resolve to the same single path in each one. Because it otherwise would
produce garbled results.

To be fair, this same garbled case already happens when traversing a
merge sends us down two paths of history, as a followed rename might be
present on one but not the other, and we keep only a single path. But as
you know, that is a long-time deficiency of --follow. :) At least
detecting it from the starting tips is easy and something the user can
deal with by modifying the command invocation.

So I dunno. That doesn't sound _too_ hard to do, though it probably also
needs a lot of the same infrastructure I introduced in my series. Namely
that log.follow needs to ask "hey, is this pathspec usable for
following?" so that it can quietly turn the feature off, rather than
triggering an error.

So I'd suggest to let my series continue graduating, and then if anybody
wants to work on more sensible pathspec handling, to do so on top.

-Peff
Junio C Hamano June 15, 2023, 7 p.m. UTC | #3
Jeff King <peff@peff.net> writes:

> ... except that the notion of "starting point"
> is ambiguous. If I do:
>
>   git log --follow branch1 branch2 -- foo*
>
> then we have two treeishes. Do we look at one? Both? What happens if the
> pathspec resolves differently?

If we consider a middle-ground where we do not keep a separate
pathname that each history traversal is following but only keep a
single pathname, then the answer is one of "ignore others and pick
one" (which is essentially what needs to happen in the current
implementation when a path gets renamed between a child commit and
its parent while the other parents do not have renames), "ignore
others and pick one, but give a warning that the result may be
incomplete", or "barf".  The latter two would need us to look at all
treeishes.

A false-positive-ridden "solution" that would slightly be better
than that would be to still keep a global, single, pathspec that has
more than one elements (all of which are literal pathspec elements).
Every time we discover a rename, we add the newly discovered old
name as an additional element, instead of replacing the single
element.  If we were to go that route, then starting from multiple
points, each following a pathname, would be a simple matter of
looking at all trees and finding which pathname each history
traversal wants to follow and creating a pathspec with all these
(literal) pathnames.

But once we fix the underlying problem, and start tracking which
path each history traversal trajectory is currently following, then
it is not a problem for each starting point to have a pathname that
is being followed.  Each commit would _know_ which pathname it is
interested in, and when a traversal from a commit to its parent sees
a rename, the pathname being followed for the parent commit will be
different from the child.

When this happens, the parent may have a pathname it is already
interested in, set by traversing from another child (i.e. one side
of a fork renamed the path one way, while the other side of a fork
kept the same name or renamed it to another way, and we traversed
both forks down and are now at the fork point).  It most likely
should be the same name we discovered from the other child, but in
weird cases it may be different (e.g. a child thinks its path A came
from parent's path X while another child thinks its path A came from
parent's path Y), in which case we may have to follow two pathnames
in the parent commit, as if the pathname we have been following in
the child had combined contents from multiple paths.

> So I'd suggest to let my series continue graduating, and then if anybody
> wants to work on more sensible pathspec handling, to do so on top.

I think we are in agreement on this---otherwise I wouldn't have
merged the patch as-is down to 'next' ;-)

As I said multiple times in the past, the current "log --follow" is
more of a checkbox item than a feature, and in that context, I think
your "fix", which may be papering over just one obvious breakage, is
what the codepath should be happy with, before it gets reworked more
seriously.

Thanks.
diff mbox series

Patch

diff --git a/diff.c b/diff.c
index 9f548f3471..8ca16fefe7 100644
--- a/diff.c
+++ b/diff.c
@@ -4753,11 +4753,26 @@  unsigned diff_filter_bit(char status)
 
 int diff_check_follow_pathspec(struct pathspec *ps, int die_on_error)
 {
+	unsigned forbidden_magic;
+
 	if (ps->nr != 1) {
 		if (die_on_error)
 			die(_("--follow requires exactly one pathspec"));
 		return 0;
 	}
+
+	forbidden_magic = ps->items[0].magic & ~(PATHSPEC_FROMTOP |
+						 PATHSPEC_LITERAL);
+	if (forbidden_magic) {
+		if (die_on_error) {
+			struct strbuf sb = STRBUF_INIT;
+			pathspec_magic_names(forbidden_magic, &sb);
+			die(_("pathspec magic not supported by --follow: %s"),
+			    sb.buf);
+		}
+		return 0;
+	}
+
 	return 1;
 }
 
diff --git a/t/t4202-log.sh b/t/t4202-log.sh
index ae73aef922..5b54d7928e 100755
--- a/t/t4202-log.sh
+++ b/t/t4202-log.sh
@@ -187,6 +187,21 @@  test_expect_success 'git config log.follow does not die with no paths' '
 	git log --
 '
 
+test_expect_success 'git log --follow rejects unsupported pathspec magic' '
+	test_must_fail git log --follow ":(top,glob,icase)ichi" 2>stderr &&
+	# check full error message; we want to be sure we mention both
+	# of the rejected types (glob,icase), but not the allowed one (top)
+	echo "fatal: pathspec magic not supported by --follow: ${SQ}glob${SQ}, ${SQ}icase${SQ}" >expect &&
+	test_cmp expect stderr
+'
+
+test_expect_success 'log.follow disabled with unsupported pathspec magic' '
+	test_config log.follow true &&
+	git log --format=%s ":(glob,icase)ichi" >actual &&
+	echo third >expect &&
+	test_cmp expect actual
+'
+
 test_expect_success 'git config log.follow is overridden by --no-follow' '
 	test_config log.follow true &&
 	git log --no-follow --pretty="format:%s" ichi >actual &&