Message ID | Y+vV8Ifkj1QV7KF0@coredump.intra.peff.net (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | get rid of sscanf() when shortening refs | expand |
Jeff King <peff@peff.net> writes: > +/* > + * Check that the string refname matches a rule of the form > + * "{prefix}%.*s{suffix}". So "foo/bar/baz" would match the rule > + * "foo/%.*s/baz", and return the string "bar". > + */ > +static const char *match_parse_rule(const char *refname, const char *rule, > + size_t *len) > +{ > + /* > + * Check that rule matches refname up to the first percent > + * in the rule. This is basically skip_prefix(), but > + * ending at percent in the prefix, rather than end-of-string. > + */ > + do { > + if (!*rule) > + BUG("rev-parse rule did not have percent"); > + if (*rule == '%') > + break; > + } while (*refname++ == *rule++); So, if we have refname="refs/heads/frotz" and rule="refs/%.*s", then we'll scan refname and rule to skip over their "refs/" prefix, and the next iteration, where post-increment moved the pointers to point at 'h' (at the beginning of "heads/frotz") on the refname side and '%' on the rule side, we iterate once more, notice *rule is '%', and break out of the loop. We have refname="heads/frotz" and rule="%.*s" If we have refname="refsXheads/frotz" and rule="refs/%.*s", after skipping over "refs", refname points at 'X' while rule points at '/' and the loop needs to break. Both pointers are post-incremented, and now we have refname="heads/frotz" and rule="%.*s". Am I reading the loop correctly? I wanted the bogus refname not to match the rule, but without peeking back refname[-1], I cannot tell the two cases apart at this point. > + /* > + * Check that we matched all the way to the "%" placeholder, > + * and skip past it within the rule string. If so, "refname" at > + * this point is the beginning of a potential match. > + */ > + if (!skip_prefix(rule, "%.*s", &rule)) > + return NULL; And we now have rule pointing at "" (i.e. "refs/%.*s" has been fully consumed). refname points at "heads/frotz". > + /* > + * And now check that our suffix (if any) matches. > + */ > + if (!strip_suffix(refname, rule, len)) > + return NULL; > + > + return refname; /* len set by strip_suffix() */ > +} And the suffix "" is stripped and we yield "heads/frotz".
Junio C Hamano <gitster@pobox.com> writes: > Am I reading the loop correctly? I wanted the bogus refname not to > match the rule, but without peeking back refname[-1], I cannot tell > the two cases apart at this point. Heh. $ git symbolic-ref BOGO refsXheads/naster $ git symbolic-ref --short BOGO heads/naster
On Tue, Feb 14, 2023 at 01:48:57PM -0800, Junio C Hamano wrote: > > + do { > > + if (!*rule) > > + BUG("rev-parse rule did not have percent"); > > + if (*rule == '%') > > + break; > > + } while (*refname++ == *rule++); > > So, if we have refname="refs/heads/frotz" and rule="refs/%.*s", then > we'll scan refname and rule to skip over their "refs/" prefix, and > the next iteration, where post-increment moved the pointers to point > at 'h' (at the beginning of "heads/frotz") on the refname side and > '%' on the rule side, we iterate once more, notice *rule is '%', and > break out of the loop. We have refname="heads/frotz" and rule="%.*s" > > If we have refname="refsXheads/frotz" and rule="refs/%.*s", after > skipping over "refs", refname points at 'X' while rule points at '/' > and the loop needs to break. Both pointers are post-incremented, > and now we have refname="heads/frotz" and rule="%.*s". Thanks for being careful. I had originally detected a match by setting a flag in the loop when we see the "%", but then thought it wasn't needed. And it's not for the matching case, but it is for the non-match. This would fix it: diff --git a/refs.c b/refs.c index d8ce7e9ee1..2c26cf02d3 100644 --- a/refs.c +++ b/refs.c @@ -1318,6 +1318,8 @@ int update_ref(const char *msg, const char *refname, static const char *match_parse_rule(const char *refname, const char *rule, size_t *len) { + int matched = 0; + /* * Check that rule matches refname up to the first percent * in the rule. This is basically skip_prefix(), but @@ -1326,10 +1328,15 @@ static const char *match_parse_rule(const char *refname, const char *rule, do { if (!*rule) BUG("rev-parse rule did not have percent"); - if (*rule == '%') + if (*rule == '%') { + matched = 1; break; + } } while (*refname++ == *rule++); + if (!matched) + return 0; + /* * Check that we matched all the way to the "%" placeholder, * and skip past it within the rule string. If so, "refname" at but I have a feeling that it gets more readable if we flip the break conditional and the loop condition. I had also imagined this as a skip_prefix_to_percent() helper, which makes the logic nicer, but we actually need to advance in both the refname and the prefix, which makes for a weird interface. -Peff
Jeff King <peff@peff.net> writes: > but I have a feeling that it gets more readable if we flip the break > conditional and the loop condition. Yeah, the somewhoat unusual loop structure was what motivated me to look at its corner case. Flipping the logic around may make it more straight forward.
On Tue, Feb 14, 2023 at 02:34:01PM -0800, Junio C Hamano wrote: > Jeff King <peff@peff.net> writes: > > > but I have a feeling that it gets more readable if we flip the break > > conditional and the loop condition. > > Yeah, the somewhoat unusual loop structure was what motivated me to > look at its corner case. Flipping the logic around may make it more > straight forward. It does indeed. I pulled the logic from skip_prefix(), thinking that by relying on it I would avoid making a stupid mistake. Oh well. :) Doing it like this is much more readable: diff --git a/refs.c b/refs.c index d8ce7e9ee1..725adafcd8 100644 --- a/refs.c +++ b/refs.c @@ -1323,12 +1323,12 @@ static const char *match_parse_rule(const char *refname, const char *rule, * in the rule. This is basically skip_prefix(), but * ending at percent in the prefix, rather than end-of-string. */ - do { + while (*rule != '%') { if (!*rule) BUG("rev-parse rule did not have percent"); - if (*rule == '%') - break; - } while (*refname++ == *rule++); + if (*refname++ != *rule++) + return 0; + } /* * Check that we matched all the way to the "%" placeholder, I'll hold on to that (plus an adjustment to the comment below to match, and perhaps a test for this negative-match case) for a day or so to give anybody else a chance to comment, and then send out a v2 tomorrow. -Peff
On Tue, Feb 14, 2023 at 4:48 PM Junio C Hamano <gitster@pobox.com> wrote: > Jeff King <peff@peff.net> writes: > > + do { > > + if (!*rule) > > + BUG("rev-parse rule did not have percent"); > > + if (*rule == '%') > > + break; > > + } while (*refname++ == *rule++); > > So, if we have refname="refs/heads/frotz" and rule="refs/%.*s", then > we'll scan refname and rule to skip over their "refs/" prefix, and > the next iteration, where post-increment moved the pointers to point > at 'h' (at the beginning of "heads/frotz") on the refname side and > '%' on the rule side, we iterate once more, notice *rule is '%', and > break out of the loop. We have refname="heads/frotz" and rule="%.*s" > > If we have refname="refsXheads/frotz" and rule="refs/%.*s", after > skipping over "refs", refname points at 'X' while rule points at '/' > and the loop needs to break. Both pointers are post-incremented, > and now we have refname="heads/frotz" and rule="%.*s". Nice catch. I had sat staring at and worrying about the combined comparison and post-increment, trying to come up with a failing edge case but missed this one entirely. This logic error missed by two people suggests that the patch probably ought to be accompanied by some new tests.
Jeff King <peff@peff.net> writes: > It does indeed. I pulled the logic from skip_prefix(), thinking that by > relying on it I would avoid making a stupid mistake. Oh well. :) > > Doing it like this is much more readable: > ... > I'll hold on to that (plus an adjustment to the comment below to match, > and perhaps a test for this negative-match case) for a day or so to give > anybody else a chance to comment, and then send out a v2 tomorrow. Thanks, and surely that is very readable. Alternatively, I think you can just compare refname and rule until they diverge, without doing any special casing for per-cent on the rule side inside the loop. If you do not find any difference, or the byte that differ is not the per-cent at the beginning of "%.*s" on the rule side, they they do not match.
On Tue, Feb 14, 2023 at 09:10:04PM -0800, Junio C Hamano wrote: > Jeff King <peff@peff.net> writes: > > > It does indeed. I pulled the logic from skip_prefix(), thinking that by > > relying on it I would avoid making a stupid mistake. Oh well. :) > > > > Doing it like this is much more readable: > > ... > > I'll hold on to that (plus an adjustment to the comment below to match, > > and perhaps a test for this negative-match case) for a day or so to give > > anybody else a chance to comment, and then send out a v2 tomorrow. > > Thanks, and surely that is very readable. > > Alternatively, I think you can just compare refname and rule until > they diverge, without doing any special casing for per-cent on the > rule side inside the loop. > > If you do not find any difference, or the byte that differ is not > the per-cent at the beginning of "%.*s" on the rule side, they they > do not match. I had a similar thought, but I think it is fooled by "refs/heads/%foo". The correct shortening there is "%foo". But we'd parse the "refs/heads/%.*s" rule up to the ".", and then complain that they do not match. -Peff
Jeff King <peff@peff.net> writes: > I had a similar thought, but I think it is fooled by "refs/heads/%foo". > The correct shortening there is "%foo". But we'd parse the > "refs/heads/%.*s" rule up to the ".", and then complain that they do not > match. Yeah, you're right.
diff --git a/refs.c b/refs.c index 84f344d8af..d8ce7e9ee1 100644 --- a/refs.c +++ b/refs.c @@ -1310,53 +1310,61 @@ int update_ref(const char *msg, const char *refname, old_oid, flags, onerr); } +/* + * Check that the string refname matches a rule of the form + * "{prefix}%.*s{suffix}". So "foo/bar/baz" would match the rule + * "foo/%.*s/baz", and return the string "bar". + */ +static const char *match_parse_rule(const char *refname, const char *rule, + size_t *len) +{ + /* + * Check that rule matches refname up to the first percent + * in the rule. This is basically skip_prefix(), but + * ending at percent in the prefix, rather than end-of-string. + */ + do { + if (!*rule) + BUG("rev-parse rule did not have percent"); + if (*rule == '%') + break; + } while (*refname++ == *rule++); + + /* + * Check that we matched all the way to the "%" placeholder, + * and skip past it within the rule string. If so, "refname" at + * this point is the beginning of a potential match. + */ + if (!skip_prefix(rule, "%.*s", &rule)) + return NULL; + + /* + * And now check that our suffix (if any) matches. + */ + if (!strip_suffix(refname, rule, len)) + return NULL; + + return refname; /* len set by strip_suffix() */ +} + char *refs_shorten_unambiguous_ref(struct ref_store *refs, const char *refname, int strict) { int i; - static char **scanf_fmts; - char *short_name; struct strbuf resolved_buf = STRBUF_INIT; - if (!scanf_fmts) { - /* - * Pre-generate scanf formats from ref_rev_parse_rules[]. - * Generate a format suitable for scanf from a - * ref_rev_parse_rules rule by interpolating "%s" at the - * location of the "%.*s". - */ - size_t total_len = 0; - size_t offset = 0; - - for (i = 0; i < NUM_REV_PARSE_RULES; i++) - /* -2 for strlen("%.*s") - strlen("%s"); +1 for NUL */ - total_len += strlen(ref_rev_parse_rules[i]) - 2 + 1; - - scanf_fmts = xmalloc(st_add(st_mult(sizeof(char *), NUM_REV_PARSE_RULES), total_len)); - - offset = 0; - for (i = 0; i < NUM_REV_PARSE_RULES; i++) { - assert(offset < total_len); - scanf_fmts[i] = (char *)&scanf_fmts[NUM_REV_PARSE_RULES] + offset; - offset += xsnprintf(scanf_fmts[i], total_len - offset, - ref_rev_parse_rules[i], 2, "%s") + 1; - } - } - - /* buffer for scanf result, at most refname must fit */ - short_name = xstrdup(refname); - /* skip first rule, it will always match */ for (i = NUM_REV_PARSE_RULES - 1; i > 0 ; --i) { int j; int rules_to_fail = i; + const char *short_name; size_t short_name_len; - if (1 != sscanf(refname, scanf_fmts[i], short_name)) + short_name = match_parse_rule(refname, ref_rev_parse_rules[i], + &short_name_len); + if (!short_name) continue; - short_name_len = strlen(short_name); - /* * in strict mode, all (except the matched one) rules * must fail to resolve to a valid non-ambiguous ref @@ -1394,12 +1402,11 @@ char *refs_shorten_unambiguous_ref(struct ref_store *refs, */ if (j == rules_to_fail) { strbuf_release(&resolved_buf); - return short_name; + return xmemdupz(short_name, short_name_len); } } strbuf_release(&resolved_buf); - free(short_name); return xstrdup(refname); }
To shorten a fully qualified ref (e.g., taking "refs/heads/foo" to just "foo"), we munge the usual lookup rules ("refs/heads/%.*s", etc) to drop the ".*" modifier (so "refs/heads/%s"), and then use sscanf() to match that against the refname, pulling the "%s" content into a separate buffer. This has two downsides: - sscanf("%s") reportedly misbehaves on macOS with some input and locale combinations, returning a partial or garbled string. See this thread: https://lore.kernel.org/git/CAGF3oAcCi+fG12j-1U0hcrWwkF5K_9WhOi6ZPHBzUUzfkrZDxA@mail.gmail.com/ - scanf in general is an error-prone interface. For example, scanning for "%s" will copy bytes into a destination string, which must have been correctly sized ahead of time to avoid a buffer overflow. In this case, the code is OK (the buffer is pessimistically sized to match the original string, which should give us a maximum). But in general, we do not want to encourage people to use scanf at all. So instead, let's note that our lookup rules are not arbitrary format strings, but all contain exactly one "%.*s" placeholder. We already rely on this, both for lookup (we feed the lookup format along with exactly one int/ptr combo to snprintf, etc) and for shortening (we munge "%.*s" to "%s", and then insist that sscanf() finds exactly one result). We can parse this manually by just matching the bytes that occur before and after the "%.*s" placeholder. While we have a few extra lines of parsing code, the result is arguably simpler, as can skip the preprocessing step and its tricky memory management entirely. The in-code comments should explain the parsing strategy, but there's one subtle change here. The original code allocated a single buffer, and then overwrote it in each loop iteration, since that's the only option sscanf() gives us. But our parser can actually return a ptr/len combo for the matched string, which is all we need (since we just feed it back to the lookup rules with "%.*s"), and then copy it only when returning to the caller. Reported-by: 孟子易 <mengziyi540841@gmail.com> Helped-by: Eric Sunshine <sunshine@sunshineco.com> Signed-off-by: Jeff King <peff@peff.net> --- BTW, this diff is generated with --patience, which generates a _much_ nicer output in this case. Not important to this series, but since there was discussion of switching the default in a nearby thread, it seemed like an interesting example. refs.c | 77 ++++++++++++++++++++++++++++++++-------------------------- 1 file changed, 42 insertions(+), 35 deletions(-)