Message ID | 7327ac06-d5da-ec53-543e-78e7729e78bb@web.de (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | userdiff: support regexec(3) with multi-byte support | expand |
Am 06.04.23 um 22:19 schrieb René Scharfe: > Since 1819ad327b (grep: fix multibyte regex handling under macOS, > 2022-08-26) we use the system library for all regular expression > matching on macOS, not just for git grep. It supports multi-byte > strings and rejects invalid multi-byte characters. > > This broke all built-in userdiff word regexes in UTF-8 locales because > they all include such invalid bytes in expressions that are intended to > match multi-byte characters without explicit support for that from the > regex engine. > > "|[^[:space:]]|[\xc0-\xff][\x80-\xbf]+" is added to all built-in word > regexes to match a single non-space or multi-byte character. The \xNN > characters are invalid if interpreted as UTF-8 because they have their > high bit set, which indicates they are part of a multi-byte character, > but they are surrounded by single-byte characters. Perhpas the expression should be "[\xc4\x80-\xf7\xbf\xbf\xbf]+", i.e., sequences of code points U+0080 to U+10FFFF? > > Replace that expression with "|[^[:space:]]" if the regex engine > supports multi-byte matching, as there is no need to have an explicit > range for multi-byte characters then. This is not equivalent. The original treated a sequence of non-ASCII characters as a word. The new version treats each individual non-space character (both ASCII and non-ASCII) as a word. > Additionally the word regex for tex contains the expression > "[a-zA-Z0-9\x80-\xff]+" with a similarly invalid range. The best > replacement with only valid characters that I can come up with is > "([a-zA-Z0-9]|[^\x01-\x7f])+". Unlike the original it matches NUL > characters, though. Assuming that tex files usually don't contain NUL > this should be acceptable. This is acceptable, of course. The replacement range looks sensible. -- Hannes
Am 07.04.23 um 00:35 schrieb Johannes Sixt: > Am 06.04.23 um 22:19 schrieb René Scharfe: >> Since 1819ad327b (grep: fix multibyte regex handling under macOS, >> 2022-08-26) we use the system library for all regular expression >> matching on macOS, not just for git grep. It supports multi-byte >> strings and rejects invalid multi-byte characters. >> >> This broke all built-in userdiff word regexes in UTF-8 locales because >> they all include such invalid bytes in expressions that are intended to >> match multi-byte characters without explicit support for that from the >> regex engine. >> >> "|[^[:space:]]|[\xc0-\xff][\x80-\xbf]+" is added to all built-in word >> regexes to match a single non-space or multi-byte character. The \xNN >> characters are invalid if interpreted as UTF-8 because they have their >> high bit set, which indicates they are part of a multi-byte character, >> but they are surrounded by single-byte characters. > > Perhpas the expression should be "[\xc4\x80-\xf7\xbf\xbf\xbf]+", i.e., > sequences of code points U+0080 to U+10FFFF? regcomp(3) on macOS doesn't like it: fatal: invalid regular expression: [a-zA-Z_][a-zA-Z0-9_]*|[0-9][0-9.]*([Ee][-+]?[0-9]+)?[fFlLuU]*|0[xXbB][0-9a-fA-F]+[lLuU]*|\.[0-9][0-9]*([Ee][-+]?[0-9]+)?[fFlL]?|[-+*/<>%&^|=!]=|--|\+\+|<<=?|>>=?|&&|\|\||::|->\*?|\.\*|<=>|[^[:space:]]|[Ā-????] Looks like it objects to U+10FFFF here; "[\xc4\x80-\xf3\xa0\x80\x80]" is accepted for example. \xc4\x80 is U+0100, by the way; U+0080 would be \xc2\x80. And regcomp(3) doesn't like that either ("[\xc2\x80-\xf3\xa0\x80\x80]"): fatal: invalid regular expression: [a-zA-Z_][a-zA-Z0-9_]*|[0-9][0-9.]*([Ee][-+]?[0-9]+)?[fFlLuU]*|0[xXbB][0-9a-fA-F]+[lLuU]*|\.[0-9][0-9]*([Ee][-+]?[0-9]+)?[fFlL]?|[-+*/<>%&^|=!]=|--|\+\+|<<=?|>>=?|&&|\|\||::|->\*?|\.\*|<=>|[^[:space:]]|[<U+0080>-
Am 07.04.23 um 09:49 schrieb René Scharfe: > Am 07.04.23 um 00:35 schrieb Johannes Sixt: >> This is not equivalent. The original treated a sequence of non-ASCII >> characters as a word. The new version treats each individual non-space >> character (both ASCII and non-ASCII) as a word. > > I assume you mean "The original treated [a single non-space as well as] > a sequence of non-ASCII characters [making up a single multi-byte > character] as a word.". That works as intended by 664d44ee7f (userdiff: > simplify word-diff safeguard, 2011-01-11). I misread the original RE. I thought it would lump multiple multi-byte characters together into one word, but it does not; sorry for that. It looks like your suggested replacement is behaviorally identical to the original after all, except perhaps for this one: > The new one doesn't match multi-byte whitespace anymore. but I did not find a reference that confirms it. I don't think we need to bend over backwards to keep this compatibility, though. -- Hannes
On Thu, Apr 6, 2023 at 4:19 PM René Scharfe <l.s.r@web.de> wrote: > > Since 1819ad327b (grep: fix multibyte regex handling under macOS, > 2022-08-26) we use the system library for all regular expression > matching on macOS, not just for git grep. It supports multi-byte > strings and rejects invalid multi-byte characters. > > This broke all built-in userdiff word regexes in UTF-8 locales because > they all include such invalid bytes in expressions that are intended to > match multi-byte characters without explicit support for that from the > regex engine. > > "|[^[:space:]]|[\xc0-\xff][\x80-\xbf]+" is added to all built-in word > regexes to match a single non-space or multi-byte character. The \xNN > characters are invalid if interpreted as UTF-8 because they have their > high bit set, which indicates they are part of a multi-byte character, > but they are surrounded by single-byte characters. > > Replace that expression with "|[^[:space:]]" if the regex engine > supports multi-byte matching, as there is no need to have an explicit > range for multi-byte characters then. Check for that capability at > runtime, because it depends on the locale and thus on environment > variables. Construct the full replacement expression at build time > and just switch it in if necessary to avoid string manipulation and > allocations at runtime. > > Additionally the word regex for tex contains the expression > "[a-zA-Z0-9\x80-\xff]+" with a similarly invalid range. The best > replacement with only valid characters that I can come up with is > "([a-zA-Z0-9]|[^\x01-\x7f])+". Unlike the original it matches NUL > characters, though. Assuming that tex files usually don't contain NUL > this should be acceptable. > > Reported-by: D. Ben Knoble <ben.knoble@gmail.com> > Reported-by: Eric Sunshine <sunshine@sunshineco.com> > Helped-by: Junio C Hamano <gitster@pobox.com> > Signed-off-by: René Scharfe <l.s.r@web.de> I tested the patch locally on top of ae73b2c8f1 and it solved my problem. Seems like there's still some further discussion, though.
"D. Ben Knoble" <ben.knoble@gmail.com> writes: > On Thu, Apr 6, 2023 at 4:19 PM René Scharfe <l.s.r@web.de> wrote: >> >> Since 1819ad327b (grep: fix multibyte regex handling under macOS, >> 2022-08-26) we use the system library for all regular expression >> matching on macOS, not just for git grep. It supports multi-byte >> strings and rejects invalid multi-byte characters. >> ... >> Reported-by: D. Ben Knoble <ben.knoble@gmail.com> >> Reported-by: Eric Sunshine <sunshine@sunshineco.com> >> Helped-by: Junio C Hamano <gitster@pobox.com> >> Signed-off-by: René Scharfe <l.s.r@web.de> > > I tested the patch locally on top of ae73b2c8f1 and it solved my > problem. Seems like there's still some further discussion, though. Thanks very much for reporting and testing. Also, thanks all for working on this solution. Will queue.
On Thu, Apr 6, 2023 at 4:19 PM René Scharfe <l.s.r@web.de> wrote: > Since 1819ad327b (grep: fix multibyte regex handling under macOS, > 2022-08-26) we use the system library for all regular expression > matching on macOS, not just for git grep. It supports multi-byte > strings and rejects invalid multi-byte characters. > > This broke all built-in userdiff word regexes in UTF-8 locales because > they all include such invalid bytes in expressions that are intended to > match multi-byte characters without explicit support for that from the > regex engine. > > "|[^[:space:]]|[\xc0-\xff][\x80-\xbf]+" is added to all built-in word > regexes to match a single non-space or multi-byte character. The \xNN > characters are invalid if interpreted as UTF-8 because they have their > high bit set, which indicates they are part of a multi-byte character, > but they are surrounded by single-byte characters. > > Replace that expression with "|[^[:space:]]" if the regex engine > supports multi-byte matching, as there is no need to have an explicit > range for multi-byte characters then. Check for that capability at > runtime, because it depends on the locale and thus on environment > variables. Construct the full replacement expression at build time > and just switch it in if necessary to avoid string manipulation and > allocations at runtime. > > Reported-by: D. Ben Knoble <ben.knoble@gmail.com> > Reported-by: Eric Sunshine <sunshine@sunshineco.com> > Helped-by: Junio C Hamano <gitster@pobox.com> > Signed-off-by: René Scharfe <l.s.r@web.de> Thank you, René! This patch resolves the problem I was experiencing[1]. I'm happy to have --color-words working again. [1]: https://lore.kernel.org/git/CAPig+cSNmws2b7f7aRA2C56kvQYG3w_g+KhYdqhtmf+XhtAMhQ@mail.gmail.com/
diff --git a/t/t4034-diff-words.sh b/t/t4034-diff-words.sh index 15764ee9ac..74586f3813 100755 --- a/t/t4034-diff-words.sh +++ b/t/t4034-diff-words.sh @@ -69,6 +69,10 @@ test_language_driver () { echo "* diff='"$lang"'" >.gitattributes && word_diff --color-words ' + test_expect_success "diff driver '$lang' in Islandic" ' + LANG=is_IS.UTF-8 LANGUAGE=is LC_ALL="$is_IS_locale" \ + word_diff --color-words + ' } test_expect_success setup ' diff --git a/userdiff.c b/userdiff.c index 09203fbc35..eaec6ebb5e 100644 --- a/userdiff.c +++ b/userdiff.c @@ -17,6 +17,7 @@ static int drivers_alloc; .cflags = REG_EXTENDED, \ }, \ .word_regex = wrx "|[^[:space:]]|[\xc0-\xff][\x80-\xbf]+", \ + .word_regex_multi_byte = wrx "|[^[:space:]]", \ } #define IPATTERN(lang, rx, wrx) { \ .name = lang, \ @@ -26,6 +27,7 @@ static int drivers_alloc; .cflags = REG_EXTENDED | REG_ICASE, \ }, \ .word_regex = wrx "|[^[:space:]]|[\xc0-\xff][\x80-\xbf]+", \ + .word_regex_multi_byte = wrx "|[^[:space:]]", \ } /* @@ -294,7 +296,7 @@ PATTERNS("scheme", /* All other words should be delimited by spaces or parentheses */ "|([^][)(}{[ \t])+"), PATTERNS("tex", "^(\\\\((sub)*section|chapter|part)\\*{0,1}\\{.*)$", - "\\\\[a-zA-Z@]+|\\\\.|[a-zA-Z0-9\x80-\xff]+"), + "\\\\[a-zA-Z@]+|\\\\.|([a-zA-Z0-9]|[^\x01-\x7f])+"), { "default", NULL, NULL, -1, { NULL, 0 } }, }; #undef PATTERNS @@ -330,6 +332,25 @@ static int userdiff_find_by_namelen_cb(struct userdiff_driver *driver, return 0; } +static int regexec_supports_multi_byte_chars(void) +{ + static const char not_space[] = "[^[:space:]]"; + static const char utf8_multi_byte_char[] = "\xc2\xa3"; + regex_t re; + regmatch_t match; + static int result = -1; + + if (result != -1) + return result; + if (regcomp(&re, not_space, REG_EXTENDED)) + BUG("invalid regular expression: %s", not_space); + result = !regexec(&re, utf8_multi_byte_char, 1, &match, 0) && + match.rm_so == 0 && + match.rm_eo == strlen(utf8_multi_byte_char); + regfree(&re); + return result; +} + static struct userdiff_driver *userdiff_find_by_namelen(const char *name, size_t len) { struct find_by_namelen_data udcbdata = { @@ -405,7 +426,13 @@ int userdiff_config(const char *k, const char *v) struct userdiff_driver *userdiff_find_by_name(const char *name) { int len = strlen(name); - return userdiff_find_by_namelen(name, len); + struct userdiff_driver *driver = userdiff_find_by_namelen(name, len); + if (driver && driver->word_regex_multi_byte) { + if (regexec_supports_multi_byte_chars()) + driver->word_regex = driver->word_regex_multi_byte; + driver->word_regex_multi_byte = NULL; + } + return driver; } struct userdiff_driver *userdiff_find_by_path(struct index_state *istate, diff --git a/userdiff.h b/userdiff.h index 24419db697..d726804c3e 100644 --- a/userdiff.h +++ b/userdiff.h @@ -18,6 +18,7 @@ struct userdiff_driver { int binary; struct userdiff_funcname funcname; const char *word_regex; + const char *word_regex_multi_byte; const char *textconv; struct notes_cache *textconv_cache; int textconv_want_cache;
Since 1819ad327b (grep: fix multibyte regex handling under macOS, 2022-08-26) we use the system library for all regular expression matching on macOS, not just for git grep. It supports multi-byte strings and rejects invalid multi-byte characters. This broke all built-in userdiff word regexes in UTF-8 locales because they all include such invalid bytes in expressions that are intended to match multi-byte characters without explicit support for that from the regex engine. "|[^[:space:]]|[\xc0-\xff][\x80-\xbf]+" is added to all built-in word regexes to match a single non-space or multi-byte character. The \xNN characters are invalid if interpreted as UTF-8 because they have their high bit set, which indicates they are part of a multi-byte character, but they are surrounded by single-byte characters. Replace that expression with "|[^[:space:]]" if the regex engine supports multi-byte matching, as there is no need to have an explicit range for multi-byte characters then. Check for that capability at runtime, because it depends on the locale and thus on environment variables. Construct the full replacement expression at build time and just switch it in if necessary to avoid string manipulation and allocations at runtime. Additionally the word regex for tex contains the expression "[a-zA-Z0-9\x80-\xff]+" with a similarly invalid range. The best replacement with only valid characters that I can come up with is "([a-zA-Z0-9]|[^\x01-\x7f])+". Unlike the original it matches NUL characters, though. Assuming that tex files usually don't contain NUL this should be acceptable. Reported-by: D. Ben Knoble <ben.knoble@gmail.com> Reported-by: Eric Sunshine <sunshine@sunshineco.com> Helped-by: Junio C Hamano <gitster@pobox.com> Signed-off-by: René Scharfe <l.s.r@web.de> --- t/t4034-diff-words.sh | 4 ++++ userdiff.c | 31 +++++++++++++++++++++++++++++-- userdiff.h | 1 + 3 files changed, 34 insertions(+), 2 deletions(-) -- 2.40.0