diff mbox series

[1/1] Added built in function recognition for shell

Message ID 20250211114611.9334-2-dhar61595@gmail.com (mailing list archive)
State Superseded
Headers show
Series userdiff: add built-in pattern for shell scripts | expand

Commit Message

Moumita Feb. 11, 2025, 11:46 a.m. UTC
Introduced a built-in userdiff driver for shell scripts, enabling
accurate function name recognition in `git diff` hunk headers.

Enhancements include:
- Function name detection for both POSIX and Bash/Ksh-style functions:
  - `function_name() { ... }`
  - `function function_name { ... }`
- Exclusion of shell keywords that can resemble function names,
  preventing false matches (e.g., `if`, `for`, `while`, `return`, etc.).
- Improved tokenization support for:
  - Identifiers (variable and function names)
  - Numeric constants (integers and decimals)
  - Shell variables (`$VAR`, `${VAR}`)
  - Logical (`&&`, `||`, `==`, `!=`, `<=`, `>=`) and arithmetic operators
  - Assignment and redirection operators
  - Brackets and grouping symbols

This update improves Git’s diff readability for shell scripts,
bringing it in line with existing built-in userdiff drivers.

Signed-off-by: Moumita <dhar61595@gmail.com>
---
 userdiff.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

Comments

Johannes Sixt Feb. 15, 2025, 2:37 p.m. UTC | #1
Am 11.02.25 um 12:46 schrieb Moumita:
> Introduced a built-in userdiff driver for shell scripts, enabling
> accurate function name recognition in `git diff` hunk headers.
> 
> Enhancements include:
> - Function name detection for both POSIX and Bash/Ksh-style functions:
>   - `function_name() { ... }`
>   - `function function_name { ... }`
> - Exclusion of shell keywords that can resemble function names,
>   preventing false matches (e.g., `if`, `for`, `while`, `return`, etc.).
> - Improved tokenization support for:
>   - Identifiers (variable and function names)
>   - Numeric constants (integers and decimals)
>   - Shell variables (`$VAR`, `${VAR}`)
>   - Logical (`&&`, `||`, `==`, `!=`, `<=`, `>=`) and arithmetic operators
>   - Assignment and redirection operators
>   - Brackets and grouping symbols
> 
> This update improves Git’s diff readability for shell scripts,
> bringing it in line with existing built-in userdiff drivers.

Please remove the marketing tone from the commit message.

 - "accurate function name recognition": Is not possible because the
tools (regular expressions) don't have sufficient context to allow it.

 - "Enhancements include": Either list *all* enhancements, or focus on
noteworthy ones, but don't make a long list that's still incomplete.

 - The word "improve" is never needed in the context that we see it
here, because you certainly don't want to worsen the code base.

Please study section "Describe your changes well" in
Documentation/SubmittingPatches on how to write the commit message.
(Describe the state before the change in present tense, and the changes
in imperative mood.)

An important point is *why* we want this change. In the case of a new
userdiff driver, however, the answer is usally simply "because we can",
and everybody knows it. Please don't write a long paragraph that dances
around this fact. Just don't write it; but if there are other reason,
please do say so.

> 
> Signed-off-by: Moumita <dhar61595@gmail.com>

The name given in Signed-off-by has legal meaning. If you have a given
name and a surname, please use both names here and then by extension
also as patch author.

> ---
>  userdiff.c | 20 ++++++++++++++++++++
>  1 file changed, 20 insertions(+)
> 
> diff --git a/userdiff.c b/userdiff.c
> index 340c4eb4f7..a8c14807c6 100644
> --- a/userdiff.c
> +++ b/userdiff.c
> @@ -334,6 +334,26 @@ PATTERNS("scheme",
>  	 "\\|([^\\\\]*)\\|"
>  	 /* All other words should be delimited by spaces or parentheses */
>  	 "|([^][)(}{[ \t])+"),
> +PATTERNS("shell",

Correctly sorted into the list. Good!

HOWEVER! We already have a diffdriver for bash. It would be better to
extend that one than to introduce a new driver.

> +	 /* Negate shell keywords that can look like functions */
> +	 "!^[ \t]*(if|elif|else|fi|for|while|until|case|esac|then|do|done|return|break|continue)\\b\n"

This list looks unnecessary. The userdiff driver can assume that it
operates on syntactically valid text.

   if () {
   }

isn't correct, and that's the case for all listed keywords.

> +	 /* POSIX-style shell functions: function_name() { ... } */
> +	 "^[ \t]*([a-zA-Z_][a-zA-Z0-9_]*)[ \t]*\\(\\)[ \t]*\\{\n"

Two nitpicks:

- There can be whitespace between the parentheses.
- The function body can also be in parentheses.

   foo ( ) (
      echo
   )

would be a correct shell function definition. Its body always runs in a
sub-shell.

> +	 /* Bash/Ksh-style functions: function function_name { ... } */
> +	 "^[ \t]*function[ \t]+([a-zA-Z_][a-zA-Z0-9_]*)[ \t]*\\{\n",

Both here and above, just let the regular expression end before the
opening bracket. Then it would also recognize this oddity:

function x \
  { echo x; }

> +	 /* -- */
> +	 /* Identifiers: variable and function names */
> +	 "[a-zA-Z_][a-zA-Z0-9_]*"
> +	 /* Numeric constants: integers and decimals */
> +	 "|[-+]?[0-9]+(\\.[0-9]*)?"

Typically, a - or + isn't part of the number, but a separate operator. I
suggest to drop the leading '[-+]?'. But see below.

> +	 /* Shell variables: $VAR and ${VAR} */
> +	 "|\\$[a-zA-Z_][a-zA-Z0-9_]*|\\$\\{[^}]+\\}"

The uses of ${name} are rather exceptional. The cases where ${ is used
are the complex cases, for example:

    base=${fname##*/}
    base=${base%."$ext"}

In such cases, I would prefer that the constituents of the expression
are their own tokens and not lumped into a single "variable name" token.

> +	 /* Logical and comparison operators */
> +	 "|\\|\\||&&|<<|>>|==|!=|<=|>="
> +	 /* Assignment and arithmetic operators */
> +	 "|[-+*/%&|^!=<>]=?"

Which makes me think: Text in shell commands has only very little
restrictions. A typical case are command options starting with one or
two dashes. Do we want to separate the dash from the option name? This
regular expression would do so, and I would consider it a deficiency.

> +	 /* Brackets and grouping symbols */
> +	 "|\\(|\\)|\\{|\\}|\\[|\\]"),
>  PATTERNS("tex", "^(\\\\((sub)*section|chapter|part)\\*{0,1}\\{.*)$",
>  	 "\\\\[a-zA-Z@]+|\\\\.|([a-zA-Z0-9]|[^\x01-\x7f])+"),
>  { .name = "default", .binary = -1 },

It would be nice to have some tests. Have a look at t/t4018/bash-* and
t/t4034/* (no bash there, yet).

-- Hannes
diff mbox series

Patch

diff --git a/userdiff.c b/userdiff.c
index 340c4eb4f7..a8c14807c6 100644
--- a/userdiff.c
+++ b/userdiff.c
@@ -334,6 +334,26 @@  PATTERNS("scheme",
 	 "\\|([^\\\\]*)\\|"
 	 /* All other words should be delimited by spaces or parentheses */
 	 "|([^][)(}{[ \t])+"),
+PATTERNS("shell",
+	 /* Negate shell keywords that can look like functions */
+	 "!^[ \t]*(if|elif|else|fi|for|while|until|case|esac|then|do|done|return|break|continue)\\b\n"
+	 /* POSIX-style shell functions: function_name() { ... } */
+	 "^[ \t]*([a-zA-Z_][a-zA-Z0-9_]*)[ \t]*\\(\\)[ \t]*\\{\n"
+	 /* Bash/Ksh-style functions: function function_name { ... } */
+	 "^[ \t]*function[ \t]+([a-zA-Z_][a-zA-Z0-9_]*)[ \t]*\\{\n",
+	 /* -- */
+	 /* Identifiers: variable and function names */
+	 "[a-zA-Z_][a-zA-Z0-9_]*"
+	 /* Numeric constants: integers and decimals */
+	 "|[-+]?[0-9]+(\\.[0-9]*)?"
+	 /* Shell variables: $VAR and ${VAR} */
+	 "|\\$[a-zA-Z_][a-zA-Z0-9_]*|\\$\\{[^}]+\\}"
+	 /* Logical and comparison operators */
+	 "|\\|\\||&&|<<|>>|==|!=|<=|>="
+	 /* Assignment and arithmetic operators */
+	 "|[-+*/%&|^!=<>]=?"
+	 /* Brackets and grouping symbols */
+	 "|\\(|\\)|\\{|\\}|\\[|\\]"),
 PATTERNS("tex", "^(\\\\((sub)*section|chapter|part)\\*{0,1}\\{.*)$",
 	 "\\\\[a-zA-Z@]+|\\\\.|([a-zA-Z0-9]|[^\x01-\x7f])+"),
 { .name = "default", .binary = -1 },