Message ID | Y+z3RfhAxW/2iNYP@coredump.intra.peff.net (mailing list archive) |
---|---|
State | Accepted |
Commit | 613bef56b820cf7f24dc4b3ae65fc91826368185 |
Headers | show |
Series | get rid of sscanf() when shortening refs | expand |
On Wed, Feb 15, 2023 at 10:16:21AM -0500, Jeff King wrote: [] > +test_expect_success 'symbolic-ref --short handles complex utf8 case' ' > + name="测试-加-增加-加-增加" && > + git symbolic-ref TEST_SYMREF "refs/heads/$name" && > + # In the real world, we saw problems with this case only > + # when the locale includes UTF-8. Set it here to try to make things as > + # hard as possible for us to pass, but in practice we should do the > + # right thing regardless (and of course some platforms may not even > + # have this locale). > + LC_ALL=en_US.UTF-8 git symbolic-ref --short TEST_SYMREF >actual && > + echo "$name" >expect && > + test_cmp expect actual > +' > + Hm, I may have 1 or two questions here. The real world was MacOs, should that be mentioned here? The other thing seems to be that there is a bug even with LANG=C, see the response from Eric here: $ git symbolic-ref --short HEAD | xxd 00000000: e6b5 8be8 af95 2de5 8a0a ......-... $ LANG=C git symbolic-ref --short HEAD | xxd 00000000: e6b5 8be8 af95 2de5 8aa0 2de5 a29e e58a ......-...-..... 00000010: a02d e58a a02d e5a2 9ee5 8aa0 0a .-...-....... Does it make sense to a) Use the local locale, what ever that is b) Re-run with LC_ALL=en_US.UTF-8 c) Re-run with LANG=C (that is where I had suspected problems when using UTF-8) d) Mention MacOs here ? Somewhat in that style: test_expect_success 'symbolic-ref --short handles complex utf8 case' ' name="测试-加-增加-加-增加" && git symbolic-ref TEST_SYMREF "refs/heads/$name" && # In the real world, we saw problems with this case only under MacOs Ventura # when the locale includes UTF-8. Try it here to try to make things as # hard as possible for us to pass, but in practice we should do the # right thing regardless (and of course some platforms may not even # have this locale). # Use try even the default and LANG=C git symbolic-ref --short TEST_SYMREF >actual && echo "$name" >expect && test_cmp expect actual && LC_ALL=en_US.UTF-8 git symbolic-ref --short TEST_SYMREF >actual && echo "$name" >expect && test_cmp expect actual && LANG=C git symbolic-ref --short TEST_SYMREF >actual && echo "$name" >expect && test_cmp expect actual '
On Thu, Feb 16, 2023 at 12:56 AM Torsten Bögershausen <tboegi@web.de> wrote: > On Wed, Feb 15, 2023 at 10:16:21AM -0500, Jeff King wrote: > > +test_expect_success 'symbolic-ref --short handles complex utf8 case' ' > > + name="测试-加-增加-加-增加" && > > + git symbolic-ref TEST_SYMREF "refs/heads/$name" && > > + # In the real world, we saw problems with this case only > > + # when the locale includes UTF-8. Set it here to try to make things as > > + # hard as possible for us to pass, but in practice we should do the > > + # right thing regardless (and of course some platforms may not even > > + # have this locale). > > + LC_ALL=en_US.UTF-8 git symbolic-ref --short TEST_SYMREF >actual && > > + echo "$name" >expect && > > + test_cmp expect actual > > +' > > The other thing seems to be that there is a bug even with > LANG=C, see the response from Eric here: > > $ git symbolic-ref --short HEAD | xxd > 00000000: e6b5 8be8 af95 2de5 8a0a ......-... > $ LANG=C git symbolic-ref --short HEAD | xxd > 00000000: e6b5 8be8 af95 2de5 8aa0 2de5 a29e e58a ......-...-..... > 00000010: a02d e58a a02d e5a2 9ee5 8aa0 0a .-...-....... I'm confused. To what bug do you refer? In my tests, LANG=C seemed to sidestep the problem. > Does it make sense to > a) Use the local locale, what ever that is > b) Re-run with LC_ALL=en_US.UTF-8 > c) Re-run with LANG=C (that is where I had suspected problems when using UTF-8) In my tests, LANG=C is the only case which seemed to work correctly when the implementation used fscanf(). > d) Mention MacOs here ? Certainly, a good idea. > Somewhat in that style: > > test_expect_success 'symbolic-ref --short handles complex utf8 case' ' > name="测试-加-增加-加-增加" && > git symbolic-ref TEST_SYMREF "refs/heads/$name" && > # In the real world, we saw problems with this case only under MacOs Ventura I'm on ancient High Sierra (10.13) using HFS+, so the problem is not Ventura-specific. The original bug report did mention Ventura (which presumably is using APFS). > # when the locale includes UTF-8. Try it here to try to make things as > # hard as possible for us to pass, but in practice we should do the > # right thing regardless (and of course some platforms may not even > # have this locale). > # Use try even the default and LANG=C > git symbolic-ref --short TEST_SYMREF >actual && > echo "$name" >expect && > test_cmp expect actual && > LC_ALL=en_US.UTF-8 git symbolic-ref --short TEST_SYMREF >actual && > echo "$name" >expect && > test_cmp expect actual && > LANG=C git symbolic-ref --short TEST_SYMREF >actual && > echo "$name" >expect && > test_cmp expect actual > '
Eric Sunshine <sunshine@sunshineco.com> writes: >> d) Mention MacOs here ? > > Certainly, a good idea. Yes. >> test_expect_success 'symbolic-ref --short handles complex utf8 case' ' >> name="测试-加-增加-加-增加" && >> git symbolic-ref TEST_SYMREF "refs/heads/$name" && >> # In the real world, we saw problems with this case only under MacOs Ventura > > I'm on ancient High Sierra (10.13) using HFS+, so the problem is not > Ventura-specific. The original bug report did mention Ventura (which > presumably is using APFS). I do not think the bug is in sscanf() and not in the code that deals with filenames in the filesystem. The symbolic-ref TEST_SYMREF is implemented by writing the problematic string into a regular file, and we read it as a stream of bytes---there is no chance for things like filename normalization the filesystem tries to do to corrupt it. So reference to "under macOS Ventuara" I think is about their C library, not a particular filesystem it uses. How about a bit more detail on sscanf(), like this? t/t1401-symbolic-ref.sh | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git c/t/t1401-symbolic-ref.sh w/t/t1401-symbolic-ref.sh index be23be30c7..dafcb4d61b 100755 --- c/t/t1401-symbolic-ref.sh +++ w/t/t1401-symbolic-ref.sh @@ -192,11 +192,13 @@ test_expect_success 'symbolic-ref pointing at another' ' test_expect_success 'symbolic-ref --short handles complex utf8 case' ' name="测试-加-增加-加-增加" && git symbolic-ref TEST_SYMREF "refs/heads/$name" && - # In the real world, we saw problems with this case only - # when the locale includes UTF-8. Set it here to try to make things as - # hard as possible for us to pass, but in practice we should do the - # right thing regardless (and of course some platforms may not even - # have this locale). + # In the real world, we saw this case misbehaved on macOS only + # when the locale includes UTF-8, back when "symbolic-ref --short" + # used sscanf(3) as part of its implementation. Set it here to + # try to make things as hard as possible for us to pass, but in + # practice we should do the right thing regardless (and of course + # some platforms may not even have this locale), as we no longer + # use platform sscanf(3) there. LC_ALL=en_US.UTF-8 git symbolic-ref --short TEST_SYMREF >actual && echo "$name" >expect && test_cmp expect actual
On Thu, Feb 16, 2023 at 09:21:15AM -0800, Junio C Hamano wrote: > How about a bit more detail on sscanf(), like this? > > t/t1401-symbolic-ref.sh | 12 +++++++----- > 1 file changed, 7 insertions(+), 5 deletions(-) > > diff --git c/t/t1401-symbolic-ref.sh w/t/t1401-symbolic-ref.sh > index be23be30c7..dafcb4d61b 100755 > --- c/t/t1401-symbolic-ref.sh > +++ w/t/t1401-symbolic-ref.sh > @@ -192,11 +192,13 @@ test_expect_success 'symbolic-ref pointing at another' ' > test_expect_success 'symbolic-ref --short handles complex utf8 case' ' > name="测试-加-增加-加-增加" && > git symbolic-ref TEST_SYMREF "refs/heads/$name" && > - # In the real world, we saw problems with this case only > - # when the locale includes UTF-8. Set it here to try to make things as > - # hard as possible for us to pass, but in practice we should do the > - # right thing regardless (and of course some platforms may not even > - # have this locale). > + # In the real world, we saw this case misbehaved on macOS only > + # when the locale includes UTF-8, back when "symbolic-ref --short" > + # used sscanf(3) as part of its implementation. Set it here to > + # try to make things as hard as possible for us to pass, but in > + # practice we should do the right thing regardless (and of course > + # some platforms may not even have this locale), as we no longer > + # use platform sscanf(3) there. > LC_ALL=en_US.UTF-8 git symbolic-ref --short TEST_SYMREF >actual && > echo "$name" >expect && > test_cmp expect actual I am OK with that squashed in. I hadn't bothered to mention macOS, etc, because that was covered in the commit message. My point in this comment was mostly to say "don't just remove this LC_ALL! It is doing something". But your text makes the situation even more clear. I do kind of wonder if this test is even doing much. It is nice to verify the fix (and hopefully somebody with macOS did indeed verify that it fails before the fix!). But it does not seem all that likely that we are going to regress in this area. I think it's reasonable to include it, but if the LC_ALL bit starts creating any portability problems, my first instinct would be to drop the test. -Peff
On Thu, Feb 16, 2023 at 06:56:41AM +0100, Torsten Bögershausen wrote: > Does it make sense to > a) Use the local locale, what ever that is It is always "C", because that's what our tests do. And we found that wasn't particularly interesting to this case. > b) Re-run with LC_ALL=en_US.UTF-8 That part's what we already do, and... > c) Re-run with LANG=C (that is where I had suspected problems when using UTF-8) This won't do anything, because we set LC_ALL in the test scripts, which takes precedence over LANG. Eric used it in his real-world testing because he quite sensibly does not have LC_ALL set in his usual shell. :) > d) Mention MacOs here ? Yup, discussed downthread. -Peff
Jeff King <peff@peff.net> writes: > I do kind of wonder if this test is even doing much. It is nice to > verify the fix (and hopefully somebody with macOS did indeed verify that > it fails before the fix!). But it does not seem all that likely that we > are going to regress in this area. I think it's reasonable to include > it, but if the LC_ALL bit starts creating any portability problems, my > first instinct would be to drop the test. That was my thought as well. I do not see this particular test to be about protecting against regression. With the greediness of its patterns, it is not likely that our future update will reintroduce use of sscanf(3) and suffer from the same platform bug. It is about documenting the historical bug, i.e. the value of the test part of the patch primarily is that macOS folks can apply it (and revert the code fix) and demonstrate the presence of a problem. The refs/remotes/%s/HEAD test does have value---it is to keep us away from the temptation to go back to the sscanf(3) based solution. So do the other two to lessor degree, as they cover corner cases that a similar/rejected implementation would have failed to handle. Thanks.
On Thu, Feb 16, 2023 at 12:31:48PM -0500, Jeff King wrote:
> On Thu, Feb 16, 2023 at 06:56:41AM +0100, Torsten Bögershausen wrote:
[]
Thanks for the patience with my somewhat confusing emails-
I was to much confused by the fact that I couldn't reproduce the problem here.
In short: The bug is in good hands.
diff --git a/refs.c b/refs.c index 84f344d8af..aeae31c972 100644 --- a/refs.c +++ b/refs.c @@ -1310,53 +1310,62 @@ int update_ref(const char *msg, const char *refname, old_oid, flags, onerr); } -char *refs_shorten_unambiguous_ref(struct ref_store *refs, - const char *refname, int strict) +/* + * Check that the string refname matches a rule of the form + * "{prefix}%.*s{suffix}". So "foo/bar/baz" would match the rule + * "foo/%.*s/baz", and return the string "bar". + */ +static const char *match_parse_rule(const char *refname, const char *rule, + size_t *len) { - int i; - static char **scanf_fmts; - char *short_name; - struct strbuf resolved_buf = STRBUF_INIT; - - if (!scanf_fmts) { - /* - * Pre-generate scanf formats from ref_rev_parse_rules[]. - * Generate a format suitable for scanf from a - * ref_rev_parse_rules rule by interpolating "%s" at the - * location of the "%.*s". - */ - size_t total_len = 0; - size_t offset = 0; + /* + * Check that rule matches refname up to the first percent in the rule. + * We can bail immediately if not, but otherwise we leave "rule" at the + * %-placeholder, and "refname" at the start of the potential matched + * name. + */ + while (*rule != '%') { + if (!*rule) + BUG("rev-parse rule did not have percent"); + if (*refname++ != *rule++) + return NULL; + } - for (i = 0; i < NUM_REV_PARSE_RULES; i++) - /* -2 for strlen("%.*s") - strlen("%s"); +1 for NUL */ - total_len += strlen(ref_rev_parse_rules[i]) - 2 + 1; + /* + * Check that our "%" is the expected placeholder. This assumes there + * are no other percents (placeholder or quoted) in the string, but + * that is sufficient for our rev-parse rules. + */ + if (!skip_prefix(rule, "%.*s", &rule)) + return NULL; - scanf_fmts = xmalloc(st_add(st_mult(sizeof(char *), NUM_REV_PARSE_RULES), total_len)); + /* + * And now check that our suffix (if any) matches. + */ + if (!strip_suffix(refname, rule, len)) + return NULL; - offset = 0; - for (i = 0; i < NUM_REV_PARSE_RULES; i++) { - assert(offset < total_len); - scanf_fmts[i] = (char *)&scanf_fmts[NUM_REV_PARSE_RULES] + offset; - offset += xsnprintf(scanf_fmts[i], total_len - offset, - ref_rev_parse_rules[i], 2, "%s") + 1; - } - } + return refname; /* len set by strip_suffix() */ +} - /* buffer for scanf result, at most refname must fit */ - short_name = xstrdup(refname); +char *refs_shorten_unambiguous_ref(struct ref_store *refs, + const char *refname, int strict) +{ + int i; + struct strbuf resolved_buf = STRBUF_INIT; /* skip first rule, it will always match */ for (i = NUM_REV_PARSE_RULES - 1; i > 0 ; --i) { int j; int rules_to_fail = i; + const char *short_name; size_t short_name_len; - if (1 != sscanf(refname, scanf_fmts[i], short_name)) + short_name = match_parse_rule(refname, ref_rev_parse_rules[i], + &short_name_len); + if (!short_name) continue; - short_name_len = strlen(short_name); - /* * in strict mode, all (except the matched one) rules * must fail to resolve to a valid non-ambiguous ref @@ -1394,12 +1403,11 @@ char *refs_shorten_unambiguous_ref(struct ref_store *refs, */ if (j == rules_to_fail) { strbuf_release(&resolved_buf); - return short_name; + return xmemdupz(short_name, short_name_len); } } strbuf_release(&resolved_buf); - free(short_name); return xstrdup(refname); } diff --git a/t/t1401-symbolic-ref.sh b/t/t1401-symbolic-ref.sh index d708acdb81..be23be30c7 100755 --- a/t/t1401-symbolic-ref.sh +++ b/t/t1401-symbolic-ref.sh @@ -189,4 +189,38 @@ test_expect_success 'symbolic-ref pointing at another' ' test_cmp expect actual ' +test_expect_success 'symbolic-ref --short handles complex utf8 case' ' + name="测试-加-增加-加-增加" && + git symbolic-ref TEST_SYMREF "refs/heads/$name" && + # In the real world, we saw problems with this case only + # when the locale includes UTF-8. Set it here to try to make things as + # hard as possible for us to pass, but in practice we should do the + # right thing regardless (and of course some platforms may not even + # have this locale). + LC_ALL=en_US.UTF-8 git symbolic-ref --short TEST_SYMREF >actual && + echo "$name" >expect && + test_cmp expect actual +' + +test_expect_success 'symbolic-ref --short handles name with suffix' ' + git symbolic-ref TEST_SYMREF "refs/remotes/origin/HEAD" && + git symbolic-ref --short TEST_SYMREF >actual && + echo "origin" >expect && + test_cmp expect actual +' + +test_expect_success 'symbolic-ref --short handles almost-matching name' ' + git symbolic-ref TEST_SYMREF "refs/headsXfoo" && + git symbolic-ref --short TEST_SYMREF >actual && + echo "headsXfoo" >expect && + test_cmp expect actual +' + +test_expect_success 'symbolic-ref --short handles name with percent' ' + git symbolic-ref TEST_SYMREF "refs/heads/%foo" && + git symbolic-ref --short TEST_SYMREF >actual && + echo "%foo" >expect && + test_cmp expect actual +' + test_done
To shorten a fully qualified ref (e.g., taking "refs/heads/foo" to just "foo"), we munge the usual lookup rules ("refs/heads/%.*s", etc) to drop the ".*" modifier (so "refs/heads/%s"), and then use sscanf() to match that against the refname, pulling the "%s" content into a separate buffer. This has a few downsides: - sscanf("%s") reportedly misbehaves on macOS with some input and locale combinations, returning a partial or garbled string. See this thread: https://lore.kernel.org/git/CAGF3oAcCi+fG12j-1U0hcrWwkF5K_9WhOi6ZPHBzUUzfkrZDxA@mail.gmail.com/ - scanf's matching of "%s" is greedy. So the "refs/remotes/%s/HEAD" rule would never pull "origin" out of "refs/remotes/origin/HEAD". Instead it always produced "origin/HEAD", which is redundant with the "refs/remotes/%s" rule. - scanf in general is an error-prone interface. For example, scanning for "%s" will copy bytes into a destination string, which must have been correctly sized ahead of time to avoid a buffer overflow. In this case, the code is OK (the buffer is pessimistically sized to match the original string, which should give us a maximum). But in general, we do not want to encourage people to use scanf at all. So instead, let's note that our lookup rules are not arbitrary format strings, but all contain exactly one "%.*s" placeholder. We already rely on this, both for lookup (we feed the lookup format along with exactly one int/ptr combo to snprintf, etc) and for shortening (we munge "%.*s" to "%s", and then insist that sscanf() finds exactly one result). We can parse this manually by just matching the bytes that occur before and after the "%.*s" placeholder. While we have a few extra lines of parsing code, the result is arguably simpler, as can skip the preprocessing step and its tricky memory management entirely. The in-code comments should explain the parsing strategy, but there's one subtle change here. The original code allocated a single buffer, and then overwrote it in each loop iteration, since that's the only option sscanf() gives us. But our parser can actually return a ptr/len combo for the matched string, which is all we need (since we just feed it back to the lookup rules with "%.*s"), and then copy it only when returning to the caller. There are a few new tests here, all using symbolic-ref (the code can be triggered in many ways, but symrefs are convenient in that we don't need to create a real ref, which avoids any complications from the filesystem munging the name): - the first covers the real-world case which misbehaved on macOS. Setting LC_ALL is required to trigger the problem there (since otherwise our tests use LC_ALL=C), and hopefully is at worst simply ignored on other systems (and doesn't cause libc to complain, etc, on systems without that locale). - the second covers the "origin/HEAD" case as discussed above, which is now fixed - the remainder are for "weird" cases that work both before and after this patch, but would be easy to get wrong with off-by-one problems in the parsing (and came out of discussions and earlier iterations of the patch that did get them wrong). - absent here are tests of boring, expected-to-work cases like "refs/heads/foo", etc. Those are covered all over the test suite both explicitly (for-each-ref's refname:short) and implicitly (in the output of git-status, etc). Reported-by: 孟子易 <mengziyi540841@gmail.com> Helped-by: Eric Sunshine <sunshine@sunshineco.com> Signed-off-by: Jeff King <peff@peff.net> --- refs.c | 78 +++++++++++++++++++++++------------------ t/t1401-symbolic-ref.sh | 34 ++++++++++++++++++ 2 files changed, 77 insertions(+), 35 deletions(-)