Message ID | 20250218153537.16320-2-dhar61595@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | userdiff: add built-in pattern for shell scripts | expand |
Moumita <dhar61595@gmail.com> writes: > PATTERNS("bash", > - /* Optional leading indentation */ > + /* Optional leading indentation */ What is this change about? > "^[ \t]*" > - /* Start of captured text */ > + /* Start of captured function name */ > "(" > "(" > - /* POSIX identifier with mandatory parentheses */ > - "[a-zA-Z_][a-zA-Z0-9_]*[ \t]*\\([ \t]*\\))" > + /* POSIX identifier with mandatory parentheses (allow spaces inside) */ > + "[a-zA-Z_][a-zA-Z0-9_]*[ \t]*\\([ \t]*\\)" Is indentation-change intended and required for this patch to work correctly? > "|" > - /* Bashism identifier with optional parentheses */ > - "(function[ \t]+[a-zA-Z_][a-zA-Z0-9_]*(([ \t]*\\([ \t]*\\))|([ \t]+))" > + /* Bash-style function definitions, allowing optional `function` keyword */ > + "(?:function[ \t]+(?=[a-zA-Z_]))?[a-zA-Z_][a-zA-Z0-9_]*(([ \t]*\\([ \t]*\\))|([ \t]+))?" Ditto. Regular expressions are write-only language; please make sure that you do not add any unnecessary changes to distract eyes of reviewers from spotting the _real_ changes that improves the current codebase. > ")" > /* Optional whitespace */ > "[ \t]*" > - /* Compound command starting with `{`, `(`, `((` or `[[` */ > - "(\\{|\\(\\(?|\\[\\[)" > - /* End of captured text */ > + /* Allow function body to start with `{`, `(` (subshell), `[[` */ > + "(\\{|\\(|\\[\\[)" > + /* End of captured function name */ > ")", > /* -- */ > - /* Characters not in the default $IFS value */ > - "[^ \t]+"), We used to pretty-much use "a run of non-whitespace characters is a token". Now we are a bit more picky. Which may or may not be good, but it is hard to tell if it is an improvement. > + /* Identifiers: variable and function names */ > + "[a-zA-Z_][a-zA-Z0-9_]*" > + /* Numeric constants: integers and decimals */ > + "|[-+]?[0-9]+(\\.[0-9]*)?|[-+]?\\.[0-9]+" > + /* Shell variables: `$VAR`, `${VAR}` */ > + "|\\$[a-zA-Z_][a-zA-Z0-9_]*|\\$\\{[^}]+\\}" > + /* Logical and comparison operators */ > + "|\\|\\||&&|<<|>>|==|!=|<=|>=" > + /* Assignment and arithmetic operators */ > + "|[-+*/%&|^!=<>]=?" > + /* Command-line options (to avoid splitting `-option`) */ > + "|--?[a-zA-Z0-9_-]+" > + /* Brackets and grouping symbols */ > + "|\\(|\\)|\\{|\\}|\\[|\\]"), The fact that this patch does not have any changes to "t/" hierarchy suggests me that we do not have existing tests to see how sample text files in the supported languages are tokenized (otherwise the above changes would require adjusting such existing tests), so I think it should be left outside of this topic, but I wonder if adding such tests gives us a good way to demonstrate the effect of these changes to userdiff patterns. Thanks.
Moumita <dhar61595@gmail.com> writes: > From: Moumita Dhar <dhar61595@gmail.com> > > The existing Bash userdiff pattern misses some shell function forms, such as > `function foo()`, multi-line definitions, and extra whitespace. > > Extend the pattern to: > - Support `function foo()` syntax. > - Allow spaces in `foo ( )` definitions. > - Recognize multi-line definitions with backslashes. > - Broaden function body detection. > > Signed-off-by: Moumita Dhar <dhar61595@gmail.com> Applied to any one of the recent tips of 'master', this seemed break tests and the reproduction seems to be quite easy. $ make $ cd t && sh t4018-*.sh -i -v ... test_expect_code: command exited with 128, we wanted 1 ... not ok 6 - builtin bash pattern compiles #... # test_grep ! fatal msg && # test_grep ! error msg # $ cat t/trash*.t4018*/msg fatal: Invalid regexp to look for hunk header: ^[ ]*(([a-z... Please make it a habit to run tests after you modified the code before sending out patches with the modifications. Thanks.
diff --git a/userdiff.c b/userdiff.c index 340c4eb4f7..194e28883d 100644 --- a/userdiff.c +++ b/userdiff.c @@ -53,26 +53,38 @@ IPATTERN("ada", "|[-+]?[0-9][0-9#_.aAbBcCdDeEfF]*([eE][+-]?[0-9_]+)?" "|=>|\\.\\.|\\*\\*|:=|/=|>=|<=|<<|>>|<>"), PATTERNS("bash", - /* Optional leading indentation */ + /* Optional leading indentation */ "^[ \t]*" - /* Start of captured text */ + /* Start of captured function name */ "(" "(" - /* POSIX identifier with mandatory parentheses */ - "[a-zA-Z_][a-zA-Z0-9_]*[ \t]*\\([ \t]*\\))" + /* POSIX identifier with mandatory parentheses (allow spaces inside) */ + "[a-zA-Z_][a-zA-Z0-9_]*[ \t]*\\([ \t]*\\)" "|" - /* Bashism identifier with optional parentheses */ - "(function[ \t]+[a-zA-Z_][a-zA-Z0-9_]*(([ \t]*\\([ \t]*\\))|([ \t]+))" + /* Bash-style function definitions, allowing optional `function` keyword */ + "(?:function[ \t]+(?=[a-zA-Z_]))?[a-zA-Z_][a-zA-Z0-9_]*(([ \t]*\\([ \t]*\\))|([ \t]+))?" ")" /* Optional whitespace */ "[ \t]*" - /* Compound command starting with `{`, `(`, `((` or `[[` */ - "(\\{|\\(\\(?|\\[\\[)" - /* End of captured text */ + /* Allow function body to start with `{`, `(` (subshell), `[[` */ + "(\\{|\\(|\\[\\[)" + /* End of captured function name */ ")", /* -- */ - /* Characters not in the default $IFS value */ - "[^ \t]+"), + /* Identifiers: variable and function names */ + "[a-zA-Z_][a-zA-Z0-9_]*" + /* Numeric constants: integers and decimals */ + "|[-+]?[0-9]+(\\.[0-9]*)?|[-+]?\\.[0-9]+" + /* Shell variables: `$VAR`, `${VAR}` */ + "|\\$[a-zA-Z_][a-zA-Z0-9_]*|\\$\\{[^}]+\\}" + /* Logical and comparison operators */ + "|\\|\\||&&|<<|>>|==|!=|<=|>=" + /* Assignment and arithmetic operators */ + "|[-+*/%&|^!=<>]=?" + /* Command-line options (to avoid splitting `-option`) */ + "|--?[a-zA-Z0-9_-]+" + /* Brackets and grouping symbols */ + "|\\(|\\)|\\{|\\}|\\[|\\]"), PATTERNS("bibtex", "(@[a-zA-Z]{1,}[ \t]*\\{{0,1}[ \t]*[^ \t\"@',\\#}{~%]*).*$", /* -- */