diff mbox series

[v2,1/1] userdiff: extend Bash pattern to cover more shell function forms

Message ID 20250218153537.16320-2-dhar61595@gmail.com (mailing list archive)
State New
Headers show
Series userdiff: add built-in pattern for shell scripts | expand

Commit Message

Moumita Feb. 18, 2025, 3:35 p.m. UTC
From: Moumita Dhar <dhar61595@gmail.com>

The existing Bash userdiff pattern misses some shell function forms, such as
`function foo()`, multi-line definitions, and extra whitespace.

Extend the pattern to:
- Support `function foo()` syntax.
- Allow spaces in `foo ( )` definitions.
- Recognize multi-line definitions with backslashes.
- Broaden function body detection.

Signed-off-by: Moumita Dhar <dhar61595@gmail.com>
---
 userdiff.c | 34 +++++++++++++++++++++++-----------
 1 file changed, 23 insertions(+), 11 deletions(-)

Comments

Junio C Hamano Feb. 18, 2025, 7:30 p.m. UTC | #1
Moumita <dhar61595@gmail.com> writes:

>  PATTERNS("bash",
> -	 /* Optional leading indentation */
> +     /* Optional leading indentation */

What is this change about?

>  	 "^[ \t]*"
> -	 /* Start of captured text */
> +	 /* Start of captured function name */
>  	 "("
>  	 "("
> -	     /* POSIX identifier with mandatory parentheses */
> -	     "[a-zA-Z_][a-zA-Z0-9_]*[ \t]*\\([ \t]*\\))"
> +		 /* POSIX identifier with mandatory parentheses (allow spaces inside) */
> +		 "[a-zA-Z_][a-zA-Z0-9_]*[ \t]*\\([ \t]*\\)"

Is indentation-change intended and required for this patch to work correctly?

>  	 "|"
> -	     /* Bashism identifier with optional parentheses */
> -	     "(function[ \t]+[a-zA-Z_][a-zA-Z0-9_]*(([ \t]*\\([ \t]*\\))|([ \t]+))"
> +		 /* Bash-style function definitions, allowing optional `function` keyword */
> +		 "(?:function[ \t]+(?=[a-zA-Z_]))?[a-zA-Z_][a-zA-Z0-9_]*(([ \t]*\\([ \t]*\\))|([ \t]+))?"

Ditto.

Regular expressions are write-only language; please make sure that
you do not add any unnecessary changes to distract eyes of
reviewers from spotting the _real_ changes that improves the current
codebase.

>  	 ")"
>  	 /* Optional whitespace */
>  	 "[ \t]*"
> -	 /* Compound command starting with `{`, `(`, `((` or `[[` */
> -	 "(\\{|\\(\\(?|\\[\\[)"
> -	 /* End of captured text */
> +	 /* Allow function body to start with `{`, `(` (subshell), `[[` */
> +	 "(\\{|\\(|\\[\\[)"
> +	 /* End of captured function name */
>  	 ")",


>  	 /* -- */
> -	 /* Characters not in the default $IFS value */
> -	 "[^ \t]+"),

We used to pretty-much use "a run of non-whitespace characters is a
token".  Now we are a bit more picky.

Which may or may not be good, but it is hard to tell if it is an
improvement.

> +	 /* Identifiers: variable and function names */
> +	 "[a-zA-Z_][a-zA-Z0-9_]*"
> +	 /* Numeric constants: integers and decimals */
> +	 "|[-+]?[0-9]+(\\.[0-9]*)?|[-+]?\\.[0-9]+"
> +	 /* Shell variables: `$VAR`, `${VAR}` */
> +	 "|\\$[a-zA-Z_][a-zA-Z0-9_]*|\\$\\{[^}]+\\}"
> +	 /* Logical and comparison operators */
> +	 "|\\|\\||&&|<<|>>|==|!=|<=|>="
> +	 /* Assignment and arithmetic operators */
> +	 "|[-+*/%&|^!=<>]=?"
> +	 /* Command-line options (to avoid splitting `-option`) */
> +	 "|--?[a-zA-Z0-9_-]+"
> +	 /* Brackets and grouping symbols */
> +	 "|\\(|\\)|\\{|\\}|\\[|\\]"),

The fact that this patch does not have any changes to "t/" hierarchy
suggests me that we do not have existing tests to see how sample
text files in the supported languages are tokenized (otherwise the
above changes would require adjusting such existing tests), so I
think it should be left outside of this topic, but I wonder if
adding such tests gives us a good way to demonstrate the effect of
these changes to userdiff patterns.

Thanks.
Junio C Hamano Feb. 18, 2025, 11:38 p.m. UTC | #2
Moumita <dhar61595@gmail.com> writes:

> From: Moumita Dhar <dhar61595@gmail.com>
>
> The existing Bash userdiff pattern misses some shell function forms, such as
> `function foo()`, multi-line definitions, and extra whitespace.
>
> Extend the pattern to:
> - Support `function foo()` syntax.
> - Allow spaces in `foo ( )` definitions.
> - Recognize multi-line definitions with backslashes.
> - Broaden function body detection.
>
> Signed-off-by: Moumita Dhar <dhar61595@gmail.com>

Applied to any one of the recent tips of 'master', this seemed break
tests and the reproduction seems to be quite easy.

    $ make 
    $ cd t && sh t4018-*.sh -i -v
    ...
    test_expect_code: command exited with 128, we wanted 1 ...
    not ok 6 - builtin bash pattern compiles
    #...
    #			test_grep ! fatal msg &&
    #			test_grep ! error msg
    #		
    $ cat t/trash*.t4018*/msg
    fatal: Invalid regexp to look for hunk header: ^[ 	]*(([a-z...

Please make it a habit to run tests after you modified the code
before sending out patches with the modifications.

Thanks.
diff mbox series

Patch

diff --git a/userdiff.c b/userdiff.c
index 340c4eb4f7..194e28883d 100644
--- a/userdiff.c
+++ b/userdiff.c
@@ -53,26 +53,38 @@  IPATTERN("ada",
 	 "|[-+]?[0-9][0-9#_.aAbBcCdDeEfF]*([eE][+-]?[0-9_]+)?"
 	 "|=>|\\.\\.|\\*\\*|:=|/=|>=|<=|<<|>>|<>"),
 PATTERNS("bash",
-	 /* Optional leading indentation */
+     /* Optional leading indentation */
 	 "^[ \t]*"
-	 /* Start of captured text */
+	 /* Start of captured function name */
 	 "("
 	 "("
-	     /* POSIX identifier with mandatory parentheses */
-	     "[a-zA-Z_][a-zA-Z0-9_]*[ \t]*\\([ \t]*\\))"
+		 /* POSIX identifier with mandatory parentheses (allow spaces inside) */
+		 "[a-zA-Z_][a-zA-Z0-9_]*[ \t]*\\([ \t]*\\)"
 	 "|"
-	     /* Bashism identifier with optional parentheses */
-	     "(function[ \t]+[a-zA-Z_][a-zA-Z0-9_]*(([ \t]*\\([ \t]*\\))|([ \t]+))"
+		 /* Bash-style function definitions, allowing optional `function` keyword */
+		 "(?:function[ \t]+(?=[a-zA-Z_]))?[a-zA-Z_][a-zA-Z0-9_]*(([ \t]*\\([ \t]*\\))|([ \t]+))?"
 	 ")"
 	 /* Optional whitespace */
 	 "[ \t]*"
-	 /* Compound command starting with `{`, `(`, `((` or `[[` */
-	 "(\\{|\\(\\(?|\\[\\[)"
-	 /* End of captured text */
+	 /* Allow function body to start with `{`, `(` (subshell), `[[` */
+	 "(\\{|\\(|\\[\\[)"
+	 /* End of captured function name */
 	 ")",
 	 /* -- */
-	 /* Characters not in the default $IFS value */
-	 "[^ \t]+"),
+	 /* Identifiers: variable and function names */
+	 "[a-zA-Z_][a-zA-Z0-9_]*"
+	 /* Numeric constants: integers and decimals */
+	 "|[-+]?[0-9]+(\\.[0-9]*)?|[-+]?\\.[0-9]+"
+	 /* Shell variables: `$VAR`, `${VAR}` */
+	 "|\\$[a-zA-Z_][a-zA-Z0-9_]*|\\$\\{[^}]+\\}"
+	 /* Logical and comparison operators */
+	 "|\\|\\||&&|<<|>>|==|!=|<=|>="
+	 /* Assignment and arithmetic operators */
+	 "|[-+*/%&|^!=<>]=?"
+	 /* Command-line options (to avoid splitting `-option`) */
+	 "|--?[a-zA-Z0-9_-]+"
+	 /* Brackets and grouping symbols */
+	 "|\\(|\\)|\\{|\\}|\\[|\\]"),
 PATTERNS("bibtex",
 	 "(@[a-zA-Z]{1,}[ \t]*\\{{0,1}[ \t]*[^ \t\"@',\\#}{~%]*).*$",
 	 /* -- */