diff mbox series

[3/3] shorten_unambiguous_ref(): avoid sscanf()

Message ID Y+vV8Ifkj1QV7KF0@coredump.intra.peff.net (mailing list archive)
State Superseded
Headers show
Series get rid of sscanf() when shortening refs | expand

Commit Message

Jeff King Feb. 14, 2023, 6:41 p.m. UTC
To shorten a fully qualified ref (e.g., taking "refs/heads/foo" to just
"foo"), we munge the usual lookup rules ("refs/heads/%.*s", etc) to drop
the ".*" modifier (so "refs/heads/%s"), and then use sscanf() to match
that against the refname, pulling the "%s" content into a separate
buffer.

This has two downsides:

  - sscanf("%s") reportedly misbehaves on macOS with some input and
    locale combinations, returning a partial or garbled string. See
    this thread:

      https://lore.kernel.org/git/CAGF3oAcCi+fG12j-1U0hcrWwkF5K_9WhOi6ZPHBzUUzfkrZDxA@mail.gmail.com/

  - scanf in general is an error-prone interface. For example, scanning
    for "%s" will copy bytes into a destination string, which must have
    been correctly sized ahead of time to avoid a buffer overflow. In
    this case, the code is OK (the buffer is pessimistically sized to
    match the original string, which should give us a maximum). But in
    general, we do not want to encourage people to use scanf at all.

So instead, let's note that our lookup rules are not arbitrary format
strings, but all contain exactly one "%.*s" placeholder. We already rely
on this, both for lookup (we feed the lookup format along with exactly
one int/ptr combo to snprintf, etc) and for shortening (we munge "%.*s"
to "%s", and then insist that sscanf() finds exactly one result).

We can parse this manually by just matching the bytes that occur before
and after the "%.*s" placeholder. While we have a few extra lines of
parsing code, the result is arguably simpler, as can skip the
preprocessing step and its tricky memory management entirely.

The in-code comments should explain the parsing strategy, but there's
one subtle change here. The original code allocated a single buffer, and
then overwrote it in each loop iteration, since that's the only option
sscanf() gives us. But our parser can actually return a ptr/len combo
for the matched string, which is all we need (since we just feed it back
to the lookup rules with "%.*s"), and then copy it only when returning
to the caller.

Reported-by: 孟子易 <mengziyi540841@gmail.com>
Helped-by: Eric Sunshine <sunshine@sunshineco.com>
Signed-off-by: Jeff King <peff@peff.net>
---
BTW, this diff is generated with --patience, which generates a _much_
nicer output in this case. Not important to this series, but since there
was discussion of switching the default in a nearby thread, it seemed
like an interesting example.

 refs.c | 77 ++++++++++++++++++++++++++++++++--------------------------
 1 file changed, 42 insertions(+), 35 deletions(-)

Comments

Junio C Hamano Feb. 14, 2023, 9:48 p.m. UTC | #1
Jeff King <peff@peff.net> writes:

> +/*
> + * Check that the string refname matches a rule of the form
> + * "{prefix}%.*s{suffix}". So "foo/bar/baz" would match the rule
> + * "foo/%.*s/baz", and return the string "bar".
> + */
> +static const char *match_parse_rule(const char *refname, const char *rule,
> +				    size_t *len)
> +{
> +	/*
> +	 * Check that rule matches refname up to the first percent
> +	 * in the rule. This is basically skip_prefix(), but
> +	 * ending at percent in the prefix, rather than end-of-string.
> +	 */
> +	do {
> +		if (!*rule)
> +			BUG("rev-parse rule did not have percent");
> +		if (*rule == '%')
> +			break;
> +	} while (*refname++ == *rule++);

So, if we have refname="refs/heads/frotz" and rule="refs/%.*s", then
we'll scan refname and rule to skip over their "refs/" prefix, and
the next iteration, where post-increment moved the pointers to point
at 'h' (at the beginning of "heads/frotz") on the refname side and
'%' on the rule side, we iterate once more, notice *rule is '%', and
break out of the loop.  We have refname="heads/frotz" and rule="%.*s"

If we have refname="refsXheads/frotz" and rule="refs/%.*s", after
skipping over "refs", refname points at 'X' while rule points at '/'
and the loop needs to break.  Both pointers are post-incremented,
and now we have refname="heads/frotz" and rule="%.*s".

Am I reading the loop correctly?  I wanted the bogus refname not to
match the rule, but without peeking back refname[-1], I cannot tell
the two cases apart at this point.

> +	/*
> +	 * Check that we matched all the way to the "%" placeholder,
> +	 * and skip past it within the rule string. If so, "refname" at
> +	 * this point is the beginning of a potential match.
> +	 */
> +	if (!skip_prefix(rule, "%.*s", &rule))
> +		return NULL;

And we now have rule pointing at "" (i.e. "refs/%.*s" has been fully
consumed).  refname points at "heads/frotz".

> +	/*
> +	 * And now check that our suffix (if any) matches.
> +	 */
> +	if (!strip_suffix(refname, rule, len))
> +		return NULL;
> +
> +	return refname; /* len set by strip_suffix() */
> +}

And the suffix "" is stripped and we yield "heads/frotz".
Junio C Hamano Feb. 14, 2023, 10:25 p.m. UTC | #2
Junio C Hamano <gitster@pobox.com> writes:

> Am I reading the loop correctly?  I wanted the bogus refname not to
> match the rule, but without peeking back refname[-1], I cannot tell
> the two cases apart at this point.


Heh.

    $ git symbolic-ref BOGO refsXheads/naster
    $ git symbolic-ref --short BOGO
    heads/naster
Jeff King Feb. 14, 2023, 10:30 p.m. UTC | #3
On Tue, Feb 14, 2023 at 01:48:57PM -0800, Junio C Hamano wrote:

> > +	do {
> > +		if (!*rule)
> > +			BUG("rev-parse rule did not have percent");
> > +		if (*rule == '%')
> > +			break;
> > +	} while (*refname++ == *rule++);
> 
> So, if we have refname="refs/heads/frotz" and rule="refs/%.*s", then
> we'll scan refname and rule to skip over their "refs/" prefix, and
> the next iteration, where post-increment moved the pointers to point
> at 'h' (at the beginning of "heads/frotz") on the refname side and
> '%' on the rule side, we iterate once more, notice *rule is '%', and
> break out of the loop.  We have refname="heads/frotz" and rule="%.*s"
> 
> If we have refname="refsXheads/frotz" and rule="refs/%.*s", after
> skipping over "refs", refname points at 'X' while rule points at '/'
> and the loop needs to break.  Both pointers are post-incremented,
> and now we have refname="heads/frotz" and rule="%.*s".

Thanks for being careful. I had originally detected a match by setting a
flag in the loop when we see the "%", but then thought it wasn't needed.
And it's not for the matching case, but it is for the non-match.

This would fix it:

diff --git a/refs.c b/refs.c
index d8ce7e9ee1..2c26cf02d3 100644
--- a/refs.c
+++ b/refs.c
@@ -1318,6 +1318,8 @@ int update_ref(const char *msg, const char *refname,
 static const char *match_parse_rule(const char *refname, const char *rule,
 				    size_t *len)
 {
+	int matched = 0;
+
 	/*
 	 * Check that rule matches refname up to the first percent
 	 * in the rule. This is basically skip_prefix(), but
@@ -1326,10 +1328,15 @@ static const char *match_parse_rule(const char *refname, const char *rule,
 	do {
 		if (!*rule)
 			BUG("rev-parse rule did not have percent");
-		if (*rule == '%')
+		if (*rule == '%') {
+			matched = 1;
 			break;
+		}
 	} while (*refname++ == *rule++);
 
+	if (!matched)
+		return 0;
+
 	/*
 	 * Check that we matched all the way to the "%" placeholder,
 	 * and skip past it within the rule string. If so, "refname" at

but I have a feeling that it gets more readable if we flip the break
conditional and the loop condition.

I had also imagined this as a skip_prefix_to_percent() helper, which
makes the logic nicer, but we actually need to advance in both the
refname and the prefix, which makes for a weird interface.

-Peff
Junio C Hamano Feb. 14, 2023, 10:34 p.m. UTC | #4
Jeff King <peff@peff.net> writes:

> but I have a feeling that it gets more readable if we flip the break
> conditional and the loop condition.

Yeah, the somewhoat unusual loop structure was what motivated me to
look at its corner case.  Flipping the logic around may make it more
straight forward.
Jeff King Feb. 14, 2023, 10:40 p.m. UTC | #5
On Tue, Feb 14, 2023 at 02:34:01PM -0800, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > but I have a feeling that it gets more readable if we flip the break
> > conditional and the loop condition.
> 
> Yeah, the somewhoat unusual loop structure was what motivated me to
> look at its corner case.  Flipping the logic around may make it more
> straight forward.

It does indeed. I pulled the logic from skip_prefix(), thinking that by
relying on it I would avoid making a stupid mistake. Oh well. :)

Doing it like this is much more readable:

diff --git a/refs.c b/refs.c
index d8ce7e9ee1..725adafcd8 100644
--- a/refs.c
+++ b/refs.c
@@ -1323,12 +1323,12 @@ static const char *match_parse_rule(const char *refname, const char *rule,
 	 * in the rule. This is basically skip_prefix(), but
 	 * ending at percent in the prefix, rather than end-of-string.
 	 */
-	do {
+	while (*rule != '%') {
 		if (!*rule)
 			BUG("rev-parse rule did not have percent");
-		if (*rule == '%')
-			break;
-	} while (*refname++ == *rule++);
+		if (*refname++ != *rule++)
+			return 0;
+	}
 
 	/*
 	 * Check that we matched all the way to the "%" placeholder,

I'll hold on to that (plus an adjustment to the comment below to match,
and perhaps a test for this negative-match case) for a day or so to give
anybody else a chance to comment, and then send out a v2 tomorrow.

-Peff
Eric Sunshine Feb. 14, 2023, 11:20 p.m. UTC | #6
On Tue, Feb 14, 2023 at 4:48 PM Junio C Hamano <gitster@pobox.com> wrote:
> Jeff King <peff@peff.net> writes:
> > +     do {
> > +             if (!*rule)
> > +                     BUG("rev-parse rule did not have percent");
> > +             if (*rule == '%')
> > +                     break;
> > +     } while (*refname++ == *rule++);
>
> So, if we have refname="refs/heads/frotz" and rule="refs/%.*s", then
> we'll scan refname and rule to skip over their "refs/" prefix, and
> the next iteration, where post-increment moved the pointers to point
> at 'h' (at the beginning of "heads/frotz") on the refname side and
> '%' on the rule side, we iterate once more, notice *rule is '%', and
> break out of the loop.  We have refname="heads/frotz" and rule="%.*s"
>
> If we have refname="refsXheads/frotz" and rule="refs/%.*s", after
> skipping over "refs", refname points at 'X' while rule points at '/'
> and the loop needs to break.  Both pointers are post-incremented,
> and now we have refname="heads/frotz" and rule="%.*s".

Nice catch. I had sat staring at and worrying about the combined
comparison and post-increment, trying to come up with a failing edge
case but missed this one entirely.

This logic error missed by two people suggests that the patch probably
ought to be accompanied by some new tests.
Junio C Hamano Feb. 15, 2023, 5:10 a.m. UTC | #7
Jeff King <peff@peff.net> writes:

> It does indeed. I pulled the logic from skip_prefix(), thinking that by
> relying on it I would avoid making a stupid mistake. Oh well. :)
>
> Doing it like this is much more readable:
> ...
> I'll hold on to that (plus an adjustment to the comment below to match,
> and perhaps a test for this negative-match case) for a day or so to give
> anybody else a chance to comment, and then send out a v2 tomorrow.

Thanks, and surely that is very readable.

Alternatively, I think you can just compare refname and rule until
they diverge, without doing any special casing for per-cent on the
rule side inside the loop.

If you do not find any difference, or the byte that differ is not
the per-cent at the beginning of "%.*s" on the rule side, they they
do not match.
Jeff King Feb. 15, 2023, 2:30 p.m. UTC | #8
On Tue, Feb 14, 2023 at 09:10:04PM -0800, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > It does indeed. I pulled the logic from skip_prefix(), thinking that by
> > relying on it I would avoid making a stupid mistake. Oh well. :)
> >
> > Doing it like this is much more readable:
> > ...
> > I'll hold on to that (plus an adjustment to the comment below to match,
> > and perhaps a test for this negative-match case) for a day or so to give
> > anybody else a chance to comment, and then send out a v2 tomorrow.
> 
> Thanks, and surely that is very readable.
> 
> Alternatively, I think you can just compare refname and rule until
> they diverge, without doing any special casing for per-cent on the
> rule side inside the loop.
> 
> If you do not find any difference, or the byte that differ is not
> the per-cent at the beginning of "%.*s" on the rule side, they they
> do not match.

I had a similar thought, but I think it is fooled by "refs/heads/%foo".
The correct shortening there is "%foo".  But we'd parse the
"refs/heads/%.*s" rule up to the ".", and then complain that they do not
match.

-Peff
Junio C Hamano Feb. 15, 2023, 4:41 p.m. UTC | #9
Jeff King <peff@peff.net> writes:

> I had a similar thought, but I think it is fooled by "refs/heads/%foo".
> The correct shortening there is "%foo".  But we'd parse the
> "refs/heads/%.*s" rule up to the ".", and then complain that they do not
> match.

Yeah, you're right.
diff mbox series

Patch

diff --git a/refs.c b/refs.c
index 84f344d8af..d8ce7e9ee1 100644
--- a/refs.c
+++ b/refs.c
@@ -1310,53 +1310,61 @@  int update_ref(const char *msg, const char *refname,
 			       old_oid, flags, onerr);
 }
 
+/*
+ * Check that the string refname matches a rule of the form
+ * "{prefix}%.*s{suffix}". So "foo/bar/baz" would match the rule
+ * "foo/%.*s/baz", and return the string "bar".
+ */
+static const char *match_parse_rule(const char *refname, const char *rule,
+				    size_t *len)
+{
+	/*
+	 * Check that rule matches refname up to the first percent
+	 * in the rule. This is basically skip_prefix(), but
+	 * ending at percent in the prefix, rather than end-of-string.
+	 */
+	do {
+		if (!*rule)
+			BUG("rev-parse rule did not have percent");
+		if (*rule == '%')
+			break;
+	} while (*refname++ == *rule++);
+
+	/*
+	 * Check that we matched all the way to the "%" placeholder,
+	 * and skip past it within the rule string. If so, "refname" at
+	 * this point is the beginning of a potential match.
+	 */
+	if (!skip_prefix(rule, "%.*s", &rule))
+		return NULL;
+
+	/*
+	 * And now check that our suffix (if any) matches.
+	 */
+	if (!strip_suffix(refname, rule, len))
+		return NULL;
+
+	return refname; /* len set by strip_suffix() */
+}
+
 char *refs_shorten_unambiguous_ref(struct ref_store *refs,
 				   const char *refname, int strict)
 {
 	int i;
-	static char **scanf_fmts;
-	char *short_name;
 	struct strbuf resolved_buf = STRBUF_INIT;
 
-	if (!scanf_fmts) {
-		/*
-		 * Pre-generate scanf formats from ref_rev_parse_rules[].
-		 * Generate a format suitable for scanf from a
-		 * ref_rev_parse_rules rule by interpolating "%s" at the
-		 * location of the "%.*s".
-		 */
-		size_t total_len = 0;
-		size_t offset = 0;
-
-		for (i = 0; i < NUM_REV_PARSE_RULES; i++)
-			/* -2 for strlen("%.*s") - strlen("%s"); +1 for NUL */
-			total_len += strlen(ref_rev_parse_rules[i]) - 2 + 1;
-
-		scanf_fmts = xmalloc(st_add(st_mult(sizeof(char *), NUM_REV_PARSE_RULES), total_len));
-
-		offset = 0;
-		for (i = 0; i < NUM_REV_PARSE_RULES; i++) {
-			assert(offset < total_len);
-			scanf_fmts[i] = (char *)&scanf_fmts[NUM_REV_PARSE_RULES] + offset;
-			offset += xsnprintf(scanf_fmts[i], total_len - offset,
-					    ref_rev_parse_rules[i], 2, "%s") + 1;
-		}
-	}
-
-	/* buffer for scanf result, at most refname must fit */
-	short_name = xstrdup(refname);
-
 	/* skip first rule, it will always match */
 	for (i = NUM_REV_PARSE_RULES - 1; i > 0 ; --i) {
 		int j;
 		int rules_to_fail = i;
+		const char *short_name;
 		size_t short_name_len;
 
-		if (1 != sscanf(refname, scanf_fmts[i], short_name))
+		short_name = match_parse_rule(refname, ref_rev_parse_rules[i],
+					      &short_name_len);
+		if (!short_name)
 			continue;
 
-		short_name_len = strlen(short_name);
-
 		/*
 		 * in strict mode, all (except the matched one) rules
 		 * must fail to resolve to a valid non-ambiguous ref
@@ -1394,12 +1402,11 @@  char *refs_shorten_unambiguous_ref(struct ref_store *refs,
 		 */
 		if (j == rules_to_fail) {
 			strbuf_release(&resolved_buf);
-			return short_name;
+			return xmemdupz(short_name, short_name_len);
 		}
 	}
 
 	strbuf_release(&resolved_buf);
-	free(short_name);
 	return xstrdup(refname);
 }