[GSOC] userdiff: add support for Scheme

Message ID	20210327173938.59391-1-raykar.ath@gmail.com (mailing list archive)
State	Superseded
Headers	show Return-Path: <git-owner@kernel.org> From: Atharva Raykar <raykar.ath@gmail.com> To: git@vger.kernel.org Cc: Atharva Raykar <raykar.ath@gmail.com> Subject: [GSOC][PATCH] userdiff: add support for Scheme Date: Sat, 27 Mar 2021 23:09:38 +0530 Message-Id: <20210327173938.59391-1-raykar.ath@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	[GSOC] userdiff: add support for Scheme \| expand [GSOC] userdiff: add support for Scheme

Atharva Raykar March 27, 2021, 5:39 p.m. UTC

Add a diff driver for Scheme (R5RS and R6RS) which
recognizes top level and local `define` forms,
whether it is a function definition, binding, syntax
definition or a user-defined `define-xyzzy` form.

The rationale for picking `define` forms for the
hunk headers is because it is usually the only
significant form for defining the structure of the
program, and it is a common pattern for schemers to
have local function definitions to hide their
visibility, so it is not only the top level
`define`'s that are of interest. Schemers also
extend the language with macros to provide their
own define forms (for example, something like a
`define-test-suite`) which is also captured in the
hunk header.

The word regex is a best-effort attempt to conform
to R6RS[1] valid identifiers, symbols and numbers.

[1] http://www.r6rs.org/final/html/r6rs/r6rs-Z-H-7.html#node_chap_4

Signed-off-by: Atharva Raykar <raykar.ath@gmail.com>
---

Hi, first-time contributor here, I wanted to have a go at this as
a microproject.

A few things I had to consider:
 - Going through the mailing list, there have already been two other
   patches that are for lispy languages that have taken slightly
   different approaches: Elisp[1] and Clojure[2]. Would it make any
   sense to have a single userdiff driver for lisp that just captures
   all top level forms in the hunk? I personally felt it's better to
   differentiate the drivers for each language, as they have different
   constructs.

 - It was hard to decide exactly which forms should appear on the hunk
   headers. Having programmed in a Scheme before, I went with the forms I
   would have liked to see when looking at git diffs, which would be the
   nearest `define` along with `define-syntax` and other define forms that
   are created as user-defined macros. I am willing to ask around in
   certain active scheme communities for some kind of consensus, but there
   is no single large consolidated group of schemers (the closest is
   probably comp.lang.scheme?).

 - By best-effort attempt at the wordregex, I mean that it is a little
   more permissive than it has to be, as it accepts a few words that are
   technically invalid in Scheme.
   Making it handle all cases like numbers and identifiers with separate
   regexen would be greatly complicated (Eg: #x#e10.2f3 is a valid number
   but #x#f10.2e3 is not; 10t1 is a valid identifier, but 10s1 is a number
   -- my wordregex just clubs all of these into a generic 'word match' which
   trades of granularity for simplicity, and it usually does the right thing).

[1] http://public-inbox.org/git/20210213192447.6114-1-git@adamspiers.org/
[2] http://public-inbox.org/git/pull.902.git.1615667191368.gitgitgadget@gmail.com/

 Documentation/gitattributes.txt    | 2 ++
 t/t4018-diff-funcname.sh           | 1 +
 t/t4018/scheme-define-syntax       | 8 ++++++++
 t/t4018/scheme-local-define        | 4 ++++
 t/t4018/scheme-top-level-define    | 4 ++++
 t/t4018/scheme-user-defined-define | 6 ++++++
 t/t4034-diff-words.sh              | 1 +
 t/t4034/scheme/expect              | 9 +++++++++
 t/t4034/scheme/post                | 4 ++++
 t/t4034/scheme/pre                 | 4 ++++
 userdiff.c                         | 8 ++++++++
 11 files changed, 51 insertions(+)
 create mode 100644 t/t4018/scheme-define-syntax
 create mode 100644 t/t4018/scheme-local-define
 create mode 100644 t/t4018/scheme-top-level-define
 create mode 100644 t/t4018/scheme-user-defined-define
 create mode 100644 t/t4034/scheme/expect
 create mode 100644 t/t4034/scheme/post
 create mode 100644 t/t4034/scheme/pre

Junio C Hamano March 27, 2021, 10:50 p.m. UTC | #1

Atharva Raykar <raykar.ath@gmail.com> writes:

> diff --git a/t/t4018/scheme-define-syntax b/t/t4018/scheme-define-syntax
> new file mode 100644
> index 0000000000..603b99cea4
> --- /dev/null
> +++ b/t/t4018/scheme-define-syntax
> @@ -0,0 +1,8 @@
> +(define-syntax define-test-suite RIGHT
> +  (syntax-rules ()
> +    ((_ suite-name (name test) ChangeMe ...)
> +     (define suite-name
> +       (let ((tests
> +              `((name . ,test) ...)))
> +         (lambda ()
> +           (ChangeMe 'suite-name tests)))))))
> \ No newline at end of file

Is there a good reason to leave the final line incomplete?  If there
isn't, complete it (applies to other newly-created files in the patch).

> diff --git a/userdiff.c b/userdiff.c
> index 3f81a2261c..c51a8c98ba 100644
> --- a/userdiff.c
> +++ b/userdiff.c
> @@ -191,6 +191,14 @@ PATTERNS("rust",
>  	 "[a-zA-Z_][a-zA-Z0-9_]*"
>  	 "|[0-9][0-9_a-fA-Fiosuxz]*(\\.([0-9]*[eE][+-]?)?[0-9_fF]*)?"
>  	 "|[-+*\\/<>%&^|=!:]=|<<=?|>>=?|&&|\\|\\||->|=>|\\.{2}=|\\.{3}|::"),
> +PATTERNS("scheme",
> +         "^[\t ]*(\\(define-?.*)$",

Didn't "git diff HEAD" before committing (or "git show") highlighted
these whitespace errors?

.git/rebase-apply/patch:183: indent with spaces.
         "^[\t ]*(\\(define-?.*)$",
.git/rebase-apply/patch:184: trailing whitespace, indent with spaces.
         /* 
.git/rebase-apply/patch:185: indent with spaces.
          * Scheme allows symbol names to have any character,
.git/rebase-apply/patch:186: indent with spaces.
          * as long as it is not a form of a parenthesis.
.git/rebase-apply/patch:187: indent with spaces.
          * The spaces must be escaped.
warning: squelched 2 whitespace errors
warning: 7 lines applied after fixing whitespace errors.

> +         /* 
> +          * Scheme allows symbol names to have any character,
> +          * as long as it is not a form of a parenthesis.
> +          * The spaces must be escaped.
> +          */
> +         "(\\.|[^][)(\\}\\{ ])+"),

One or more "dot or anything other than SP or parentheses"?  But
a dot "." is neither a space or any {bra-ce} letter, so would the
above be equivalent to

	"[^][()\\{\\} \t]+"

I wonder...

I am also trying to figure out what you wanted to achieve by
mentioning "The spaces must be escaped.".  Did you mean something
like (string->symbol "a symbol with SP in it") is a symbol?  Even
so, I cannot quite guess the significance of that fact wrt the
regexp you added here?

As we are trying to catch program identifiers (symbols in scheme)
and numeric literals, treating any group of non-whitespace letters
that is delimited by one or more whitespaces as a "word" would be a
good first-order approximation, but in addition, as can be seen in
an example like (a(b(c))), parentheses can also serve as such "word
delimiters" in addition to whitespaces.  So the regexp given above
makes sense to me from that angle, especially if you do not limit
the whitespace to only SP, but include HT (\t) as well.  But was
that how you came up with the regexp?

Thanks.

>  PATTERNS("bibtex", "(@[a-zA-Z]{1,}[ \t]*\\{{0,1}[ \t]*[^ \t\"@',\\#}{~%]*).*$",
>  	 "[={}\"]|[^={}\" \t]+"),
>  PATTERNS("tex", "^(\\\\((sub)*section|chapter|part)\\*{0,1}\\{.*)$",

Junio C Hamano March 27, 2021, 11:09 p.m. UTC | #2

Junio C Hamano <gitster@pobox.com> writes:

> Atharva Raykar <raykar.ath@gmail.com> writes:
> ...
>> +           (ChangeMe 'suite-name tests)))))))
>> \ No newline at end of file
>
> Is there a good reason to leave the final line incomplete?  ...
> I am also trying to figure out what you wanted to achieve ...

Taking all of them together, here is what I hope you may agree as
its improved version.  The only differences from what you posted are
corrections to all the "\ No newline at end of file" and the simplification
of the pattern (remove "a dot" from the alternative and add \t next
to SP).  Without changes, the new tests still pass so ... ;-)

    diff --git c/userdiff.c w/userdiff.c
    index 5fd0eb31ec..685fe712aa 100644
    --- c/userdiff.c
    +++ w/userdiff.c
    @@ -193,12 +193,8 @@ PATTERNS("rust",
             "|[-+*\\/<>%&^|=!:]=|<<=?|>>=?|&&|\\|\\||->|=>|\\.{2}=|\\.{3}|::"),
     PATTERNS("scheme",
             "^[\t ]*(\\(define-?.*)$",
    -	 /*
    -	  * Scheme allows symbol names to have any character,
    -	  * as long as it is not a form of a parenthesis.
    -	  * The spaces must be escaped.
    -	  */
    -	 "(\\.|[^][)(\\}\\{ ])+"),
    +	 /* whitespace separated tokens, but parentheses also can delimit words */
    +	 "([^][)(\\}\\{ \t])+"),
     PATTERNS("bibtex", "(@[a-zA-Z]{1,}[ \t]*\\{{0,1}[ \t]*[^ \t\"@',\\#}{~%]*).*$",
             "[={}\"]|[^={}\" \t]+"),
     PATTERNS("tex", "^(\\\\((sub)*section|chapter|part)\\*{0,1}\\{.*)$",

----- >8 ---------- >8 ---------- >8 ---------- >8 ---------- >8 -----
From: Atharva Raykar <raykar.ath@gmail.com>
Date: Sat, 27 Mar 2021 23:09:38 +0530
Subject: [PATCH] userdiff: add support for Scheme

Add a diff driver for Scheme (R5RS and R6RS) which
recognizes top level and local `define` forms,
whether it is a function definition, binding, syntax
definition or a user-defined `define-xyzzy` form.

The rationale for picking `define` forms for the
hunk headers is because it is usually the only
significant form for defining the structure of the
program, and it is a common pattern for schemers to
have local function definitions to hide their
visibility, so it is not only the top level
`define`'s that are of interest. Schemers also
extend the language with macros to provide their
own define forms (for example, something like a
`define-test-suite`) which is also captured in the
hunk header.

Since the identifier syntax is quite forgiving, we start
our word regexp from "words delimited by whitespaces" and
then loosen to include various forms of parentheses characters
to word-delimiters.

Signed-off-by: Atharva Raykar <raykar.ath@gmail.com>
[jc: simplified word regex and its explanation; fixed whitespace errors]
Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 Documentation/gitattributes.txt    | 2 ++
 t/t4018-diff-funcname.sh           | 1 +
 t/t4018/scheme-define-syntax       | 8 ++++++++
 t/t4018/scheme-local-define        | 4 ++++
 t/t4018/scheme-top-level-define    | 4 ++++
 t/t4018/scheme-user-defined-define | 6 ++++++
 t/t4034-diff-words.sh              | 1 +
 t/t4034/scheme/expect              | 9 +++++++++
 t/t4034/scheme/post                | 4 ++++
 t/t4034/scheme/pre                 | 4 ++++
 userdiff.c                         | 4 ++++
 11 files changed, 47 insertions(+)
 create mode 100644 t/t4018/scheme-define-syntax
 create mode 100644 t/t4018/scheme-local-define
 create mode 100644 t/t4018/scheme-top-level-define
 create mode 100644 t/t4018/scheme-user-defined-define
 create mode 100644 t/t4034/scheme/expect
 create mode 100644 t/t4034/scheme/post
 create mode 100644 t/t4034/scheme/pre

diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index 0a60472bb5..cfcfa800c2 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -845,6 +845,8 @@ patterns are available:
 
 - `rust` suitable for source code in the Rust language.
 
+- `scheme` suitable for source code in the Scheme language.
+
 - `tex` suitable for source code for LaTeX documents.
 
 
diff --git a/t/t4018-diff-funcname.sh b/t/t4018-diff-funcname.sh
index 9675bc17db..823ea96acb 100755
--- a/t/t4018-diff-funcname.sh
+++ b/t/t4018-diff-funcname.sh
@@ -48,6 +48,7 @@ diffpatterns="
 	python
 	ruby
 	rust
+	scheme
 	tex
 	custom1
 	custom2
diff --git a/t/t4018/scheme-define-syntax b/t/t4018/scheme-define-syntax
new file mode 100644
index 0000000000..33fa50c844
--- /dev/null
+++ b/t/t4018/scheme-define-syntax
@@ -0,0 +1,8 @@
+(define-syntax define-test-suite RIGHT
+  (syntax-rules ()
+    ((_ suite-name (name test) ChangeMe ...)
+     (define suite-name
+       (let ((tests
+              `((name . ,test) ...)))
+         (lambda ()
+           (ChangeMe 'suite-name tests)))))))
diff --git a/t/t4018/scheme-local-define b/t/t4018/scheme-local-define
new file mode 100644
index 0000000000..bc6d8aebbe
--- /dev/null
+++ b/t/t4018/scheme-local-define
@@ -0,0 +1,4 @@
+(define (higher-order)
+  (define local-function RIGHT
+    (lambda (x)
+     (car "this is" "ChangeMe"))))
diff --git a/t/t4018/scheme-top-level-define b/t/t4018/scheme-top-level-define
new file mode 100644
index 0000000000..624743c22b
--- /dev/null
+++ b/t/t4018/scheme-top-level-define
@@ -0,0 +1,4 @@
+(define (some-func x y z) RIGHT
+  (let ((a x)
+        (b y))
+        (ChangeMe a b)))
diff --git a/t/t4018/scheme-user-defined-define b/t/t4018/scheme-user-defined-define
new file mode 100644
index 0000000000..70e403c5e2
--- /dev/null
+++ b/t/t4018/scheme-user-defined-define
@@ -0,0 +1,6 @@
+(define-test-suite record-case-tests RIGHT
+  (record-case-1 (lambda (fail)
+                   (let ((a (make-foo 1 2)))
+                     (record-case a
+                       ((bar x) (ChangeMe))
+                       ((foo a b) (+ a b)))))))
diff --git a/t/t4034-diff-words.sh b/t/t4034-diff-words.sh
index 56f1e62a97..ee7721ab91 100755
--- a/t/t4034-diff-words.sh
+++ b/t/t4034-diff-words.sh
@@ -325,6 +325,7 @@ test_language_driver perl
 test_language_driver php
 test_language_driver python
 test_language_driver ruby
+test_language_driver scheme
 test_language_driver tex
 
 test_expect_success 'word-diff with diff.sbe' '
diff --git a/t/t4034/scheme/expect b/t/t4034/scheme/expect
new file mode 100644
index 0000000000..eed21e803c
--- /dev/null
+++ b/t/t4034/scheme/expect
@@ -0,0 +1,9 @@
+<BOLD>diff --git a/pre b/post<RESET>
+<BOLD>index 6a5efba..7c4a6b4 100644<RESET>
+<BOLD>--- a/pre<RESET>
+<BOLD>+++ b/post<RESET>
+<CYAN>@@ -1,4 +1,4 @@<RESET>
+(define (<RED>myfunc a b<RESET><GREEN>my-func first second<RESET>)
+  ; This is a <RED>really<RESET><GREEN>(moderately)<RESET> cool function.
+  (let ((c (<RED>+ a b<RESET><GREEN>add1 first<RESET>)))
+    (format "one more than the total is %d" (<RED>add1<RESET><GREEN>+<RESET> c <GREEN>second<RESET>))))
diff --git a/t/t4034/scheme/post b/t/t4034/scheme/post
new file mode 100644
index 0000000000..28f59c6584
--- /dev/null
+++ b/t/t4034/scheme/post
@@ -0,0 +1,4 @@
+(define (my-func first second)
+  ; This is a (moderately) cool function.
+  (let ((c (add1 first)))
+    (format "one more than the total is %d" (+ c second))))
diff --git a/t/t4034/scheme/pre b/t/t4034/scheme/pre
new file mode 100644
index 0000000000..4bd0069493
--- /dev/null
+++ b/t/t4034/scheme/pre
@@ -0,0 +1,4 @@
+(define (myfunc a b)
+  ; This is a really cool function.
+  (let ((c (+ a b)))
+    (format "one more than the total is %d" (add1 c))))
diff --git a/userdiff.c b/userdiff.c
index 3f81a2261c..685fe712aa 100644
--- a/userdiff.c
+++ b/userdiff.c
@@ -191,6 +191,10 @@ PATTERNS("rust",
 	 "[a-zA-Z_][a-zA-Z0-9_]*"
 	 "|[0-9][0-9_a-fA-Fiosuxz]*(\\.([0-9]*[eE][+-]?)?[0-9_fF]*)?"
 	 "|[-+*\\/<>%&^|=!:]=|<<=?|>>=?|&&|\\|\\||->|=>|\\.{2}=|\\.{3}|::"),
+PATTERNS("scheme",
+	 "^[\t ]*(\\(define-?.*)$",
+	 /* whitespace separated tokens, but parentheses also can delimit words */
+	 "([^][)(\\}\\{ \t])+"),
 PATTERNS("bibtex", "(@[a-zA-Z]{1,}[ \t]*\\{{0,1}[ \t]*[^ \t\"@',\\#}{~%]*).*$",
 	 "[={}\"]|[^={}\" \t]+"),
 PATTERNS("tex", "^(\\\\((sub)*section|chapter|part)\\*{0,1}\\{.*)$",

Johannes Sixt March 27, 2021, 11:46 p.m. UTC | #3

Am 27.03.21 um 18:39 schrieb Atharva Raykar:
>  - By best-effort attempt at the wordregex, I mean that it is a little
>    more permissive than it has to be, as it accepts a few words that are
>    technically invalid in Scheme.
>    Making it handle all cases like numbers and identifiers with separate
>    regexen would be greatly complicated (Eg: #x#e10.2f3 is a valid number
>    but #x#f10.2e3 is not; 10t1 is a valid identifier, but 10s1 is a number
>    -- my wordregex just clubs all of these into a generic 'word match' which
>    trades of granularity for simplicity, and it usually does the right thing).

It is ok to have regex that capture tokens that are not valid. A
userdiff driver can assume that it operates only text that is valid in
the language.

> diff --git a/t/t4018/scheme-define-syntax b/t/t4018/scheme-define-syntax
> new file mode 100644
> index 0000000000..603b99cea4
> --- /dev/null
> +++ b/t/t4018/scheme-define-syntax
> @@ -0,0 +1,8 @@
> +(define-syntax define-test-suite RIGHT
> +  (syntax-rules ()
> +    ((_ suite-name (name test) ChangeMe ...)
> +     (define suite-name

This test is suspicious. Notice the "ChangeMe" above? That is sufficient
to let the test case succeed. The "ChangeMe" in the last line below
should be the only one.

But then there is this indented '(define' that is not marked as RIGHT,
and I wonder how is it different from...

> +       (let ((tests
> +              `((name . ,test) ...)))
> +         (lambda ()
> +           (ChangeMe 'suite-name tests)))))))
> \ No newline at end of file
> diff --git a/t/t4018/scheme-local-define b/t/t4018/scheme-local-define
> new file mode 100644
> index 0000000000..90e75dcce8
> --- /dev/null
> +++ b/t/t4018/scheme-local-define
> @@ -0,0 +1,4 @@
> +(define (higher-order)
> +  (define local-function RIGHT

... this one, which is also indented and *is* marked as RIGHT.

BTW, it's good to see test cases for both indented and not-indented
trigger lines.

> +    (lambda (x)
> +     (car "this is" "ChangeMe"))))
> \ No newline at end of file

> diff --git a/userdiff.c b/userdiff.c
> index 3f81a2261c..c51a8c98ba 100644
> --- a/userdiff.c
> +++ b/userdiff.c
> @@ -191,6 +191,14 @@ PATTERNS("rust",
>  	 "[a-zA-Z_][a-zA-Z0-9_]*"
>  	 "|[0-9][0-9_a-fA-Fiosuxz]*(\\.([0-9]*[eE][+-]?)?[0-9_fF]*)?"
>  	 "|[-+*\\/<>%&^|=!:]=|<<=?|>>=?|&&|\\|\\||->|=>|\\.{2}=|\\.{3}|::"),
> +PATTERNS("scheme",
> +         "^[\t ]*(\\(define-?.*)$",

This "optional hyphen followed by anything" in the regex is strange.
Wouldn't that also capture a line that looks like, e.g.,

    (defined-foo bar)

Perhaps we want "define[- \t].*" in the regex?

> +         /* 
> +          * Scheme allows symbol names to have any character,
> +          * as long as it is not a form of a parenthesis.
> +          * The spaces must be escaped.
> +          */
> +         "(\\.|[^][)(\\}\\{ ])+"),
>  PATTERNS("bibtex", "(@[a-zA-Z]{1,}[ \t]*\\{{0,1}[ \t]*[^ \t\"@',\\#}{~%]*).*$",
>  	 "[={}\"]|[^={}\" \t]+"),
>  PATTERNS("tex", "^(\\\\((sub)*section|chapter|part)\\*{0,1}\\{.*)$",
> 

-- Hannes

Ævar Arnfjörð Bjarmason March 28, 2021, 3:16 a.m. UTC | #4

On Sun, Mar 28 2021, Junio C Hamano wrote:

> Junio C Hamano <gitster@pobox.com> writes:
>
>> Atharva Raykar <raykar.ath@gmail.com> writes:
>> ...
>>> +           (ChangeMe 'suite-name tests)))))))
>>> \ No newline at end of file
>>
>> Is there a good reason to leave the final line incomplete?  ...
>> I am also trying to figure out what you wanted to achieve ...
>
> Taking all of them together, here is what I hope you may agree as
> its improved version.  The only differences from what you posted are
> corrections to all the "\ No newline at end of file" and the simplification
> of the pattern (remove "a dot" from the alternative and add \t next
> to SP).  Without changes, the new tests still pass so ... ;-)
>
>     diff --git c/userdiff.c w/userdiff.c
>     index 5fd0eb31ec..685fe712aa 100644
>     --- c/userdiff.c
>     +++ w/userdiff.c
>     @@ -193,12 +193,8 @@ PATTERNS("rust",
>              "|[-+*\\/<>%&^|=!:]=|<<=?|>>=?|&&|\\|\\||->|=>|\\.{2}=|\\.{3}|::"),
>      PATTERNS("scheme",
>              "^[\t ]*(\\(define-?.*)$",
>     -	 /*
>     -	  * Scheme allows symbol names to have any character,
>     -	  * as long as it is not a form of a parenthesis.
>     -	  * The spaces must be escaped.
>     -	  */
>     -	 "(\\.|[^][)(\\}\\{ ])+"),
>     +	 /* whitespace separated tokens, but parentheses also can delimit words */
>     +	 "([^][)(\\}\\{ \t])+"),
>      PATTERNS("bibtex", "(@[a-zA-Z]{1,}[ \t]*\\{{0,1}[ \t]*[^ \t\"@',\\#}{~%]*).*$",
>              "[={}\"]|[^={}\" \t]+"),
>      PATTERNS("tex", "^(\\\\((sub)*section|chapter|part)\\*{0,1}\\{.*)$",
>
> ----- >8 ---------- >8 ---------- >8 ---------- >8 ---------- >8 -----
> From: Atharva Raykar <raykar.ath@gmail.com>
> Date: Sat, 27 Mar 2021 23:09:38 +0530
> Subject: [PATCH] userdiff: add support for Scheme
>
> Add a diff driver for Scheme (R5RS and R6RS) which
> recognizes top level and local `define` forms,
> whether it is a function definition, binding, syntax
> definition or a user-defined `define-xyzzy` form.
>
> The rationale for picking `define` forms for the
> hunk headers is because it is usually the only
> significant form for defining the structure of the
> program, and it is a common pattern for schemers to
> have local function definitions to hide their
> visibility, so it is not only the top level
> `define`'s that are of interest. Schemers also
> extend the language with macros to provide their
> own define forms (for example, something like a
> `define-test-suite`) which is also captured in the
> hunk header.
>
> Since the identifier syntax is quite forgiving, we start
> our word regexp from "words delimited by whitespaces" and
> then loosen to include various forms of parentheses characters
> to word-delimiters.
>
> Signed-off-by: Atharva Raykar <raykar.ath@gmail.com>
> [jc: simplified word regex and its explanation; fixed whitespace errors]
> Signed-off-by: Junio C Hamano <gitster@pobox.com>
> ---
>  Documentation/gitattributes.txt    | 2 ++
>  t/t4018-diff-funcname.sh           | 1 +
>  t/t4018/scheme-define-syntax       | 8 ++++++++
>  t/t4018/scheme-local-define        | 4 ++++
>  t/t4018/scheme-top-level-define    | 4 ++++
>  t/t4018/scheme-user-defined-define | 6 ++++++
>  t/t4034-diff-words.sh              | 1 +
>  t/t4034/scheme/expect              | 9 +++++++++
>  t/t4034/scheme/post                | 4 ++++
>  t/t4034/scheme/pre                 | 4 ++++
>  userdiff.c                         | 4 ++++
>  11 files changed, 47 insertions(+)
>  create mode 100644 t/t4018/scheme-define-syntax
>  create mode 100644 t/t4018/scheme-local-define
>  create mode 100644 t/t4018/scheme-top-level-define
>  create mode 100644 t/t4018/scheme-user-defined-define
>  create mode 100644 t/t4034/scheme/expect
>  create mode 100644 t/t4034/scheme/post
>  create mode 100644 t/t4034/scheme/pre
>
> diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
> index 0a60472bb5..cfcfa800c2 100644
> --- a/Documentation/gitattributes.txt
> +++ b/Documentation/gitattributes.txt
> @@ -845,6 +845,8 @@ patterns are available:
>  
>  - `rust` suitable for source code in the Rust language.
>  
> +- `scheme` suitable for source code in the Scheme language.
> +
>  - `tex` suitable for source code for LaTeX documents.
>  
>  
> diff --git a/t/t4018-diff-funcname.sh b/t/t4018-diff-funcname.sh
> index 9675bc17db..823ea96acb 100755
> --- a/t/t4018-diff-funcname.sh
> +++ b/t/t4018-diff-funcname.sh
> @@ -48,6 +48,7 @@ diffpatterns="
>  	python
>  	ruby
>  	rust
> +	scheme
>  	tex
>  	custom1
>  	custom2
> diff --git a/t/t4018/scheme-define-syntax b/t/t4018/scheme-define-syntax
> new file mode 100644
> index 0000000000..33fa50c844
> --- /dev/null
> +++ b/t/t4018/scheme-define-syntax
> @@ -0,0 +1,8 @@
> +(define-syntax define-test-suite RIGHT
> +  (syntax-rules ()
> +    ((_ suite-name (name test) ChangeMe ...)
> +     (define suite-name
> +       (let ((tests
> +              `((name . ,test) ...)))
> +         (lambda ()
> +           (ChangeMe 'suite-name tests)))))))
> diff --git a/t/t4018/scheme-local-define b/t/t4018/scheme-local-define
> new file mode 100644
> index 0000000000..bc6d8aebbe
> --- /dev/null
> +++ b/t/t4018/scheme-local-define
> @@ -0,0 +1,4 @@
> +(define (higher-order)
> +  (define local-function RIGHT
> +    (lambda (x)
> +     (car "this is" "ChangeMe"))))
> diff --git a/t/t4018/scheme-top-level-define b/t/t4018/scheme-top-level-define
> new file mode 100644
> index 0000000000..624743c22b
> --- /dev/null
> +++ b/t/t4018/scheme-top-level-define
> @@ -0,0 +1,4 @@
> +(define (some-func x y z) RIGHT
> +  (let ((a x)
> +        (b y))
> +        (ChangeMe a b)))
> diff --git a/t/t4018/scheme-user-defined-define b/t/t4018/scheme-user-defined-define
> new file mode 100644
> index 0000000000..70e403c5e2
> --- /dev/null
> +++ b/t/t4018/scheme-user-defined-define
> @@ -0,0 +1,6 @@
> +(define-test-suite record-case-tests RIGHT
> +  (record-case-1 (lambda (fail)
> +                   (let ((a (make-foo 1 2)))
> +                     (record-case a
> +                       ((bar x) (ChangeMe))
> +                       ((foo a b) (+ a b)))))))
> diff --git a/t/t4034-diff-words.sh b/t/t4034-diff-words.sh
> index 56f1e62a97..ee7721ab91 100755
> --- a/t/t4034-diff-words.sh
> +++ b/t/t4034-diff-words.sh
> @@ -325,6 +325,7 @@ test_language_driver perl
>  test_language_driver php
>  test_language_driver python
>  test_language_driver ruby
> +test_language_driver scheme
>  test_language_driver tex
>  
>  test_expect_success 'word-diff with diff.sbe' '
> diff --git a/t/t4034/scheme/expect b/t/t4034/scheme/expect
> new file mode 100644
> index 0000000000..eed21e803c
> --- /dev/null
> +++ b/t/t4034/scheme/expect
> @@ -0,0 +1,9 @@
> +<BOLD>diff --git a/pre b/post<RESET>
> +<BOLD>index 6a5efba..7c4a6b4 100644<RESET>
> +<BOLD>--- a/pre<RESET>
> +<BOLD>+++ b/post<RESET>
> +<CYAN>@@ -1,4 +1,4 @@<RESET>
> +(define (<RED>myfunc a b<RESET><GREEN>my-func first second<RESET>)
> +  ; This is a <RED>really<RESET><GREEN>(moderately)<RESET> cool function.
> +  (let ((c (<RED>+ a b<RESET><GREEN>add1 first<RESET>)))
> +    (format "one more than the total is %d" (<RED>add1<RESET><GREEN>+<RESET> c <GREEN>second<RESET>))))
> diff --git a/t/t4034/scheme/post b/t/t4034/scheme/post
> new file mode 100644
> index 0000000000..28f59c6584
> --- /dev/null
> +++ b/t/t4034/scheme/post
> @@ -0,0 +1,4 @@
> +(define (my-func first second)
> +  ; This is a (moderately) cool function.
> +  (let ((c (add1 first)))
> +    (format "one more than the total is %d" (+ c second))))
> diff --git a/t/t4034/scheme/pre b/t/t4034/scheme/pre
> new file mode 100644
> index 0000000000..4bd0069493
> --- /dev/null
> +++ b/t/t4034/scheme/pre
> @@ -0,0 +1,4 @@
> +(define (myfunc a b)
> +  ; This is a really cool function.
> +  (let ((c (+ a b)))
> +    (format "one more than the total is %d" (add1 c))))
> diff --git a/userdiff.c b/userdiff.c
> index 3f81a2261c..685fe712aa 100644
> --- a/userdiff.c
> +++ b/userdiff.c
> @@ -191,6 +191,10 @@ PATTERNS("rust",
>  	 "[a-zA-Z_][a-zA-Z0-9_]*"
>  	 "|[0-9][0-9_a-fA-Fiosuxz]*(\\.([0-9]*[eE][+-]?)?[0-9_fF]*)?"
>  	 "|[-+*\\/<>%&^|=!:]=|<<=?|>>=?|&&|\\|\\||->|=>|\\.{2}=|\\.{3}|::"),
> +PATTERNS("scheme",
> +	 "^[\t ]*(\\(define-?.*)$",

The "define-?.*" can be simplified to just "define.*", but looking at
the tests is that the intent? From the tests it looks like "define[- ]"
is what the author wants, unless this is meant to also match
"(definements".

Has this been tested on some real-world scheme code? E.g. I have guile
installed locally, and it has really large top-level eval-when
blocks. These rules would jump over those to whatever the function above
them is.

> +	 /* whitespace separated tokens, but parentheses also can delimit words */
> +	 "([^][)(\\}\\{ \t])+"),
>  PATTERNS("bibtex", "(@[a-zA-Z]{1,}[ \t]*\\{{0,1}[ \t]*[^ \t\"@',\\#}{~%]*).*$",
>  	 "[={}\"]|[^={}\" \t]+"),
>  PATTERNS("tex", "^(\\\\((sub)*section|chapter|part)\\*{0,1}\\{.*)$",

Junio C Hamano March 28, 2021, 5:37 a.m. UTC | #5

Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

[jc: redirecting the question to patch author---I am just a messenger]

>> diff --git a/userdiff.c b/userdiff.c
>> index 3f81a2261c..685fe712aa 100644
>> --- a/userdiff.c
>> +++ b/userdiff.c
>> @@ -191,6 +191,10 @@ PATTERNS("rust",
>>  	 "[a-zA-Z_][a-zA-Z0-9_]*"
>>  	 "|[0-9][0-9_a-fA-Fiosuxz]*(\\.([0-9]*[eE][+-]?)?[0-9_fF]*)?"
>>  	 "|[-+*\\/<>%&^|=!:]=|<<=?|>>=?|&&|\\|\\||->|=>|\\.{2}=|\\.{3}|::"),
>> +PATTERNS("scheme",
>> +	 "^[\t ]*(\\(define-?.*)$",
>
> The "define-?.*" can be simplified to just "define.*", but looking at
> the tests is that the intent? From the tests it looks like "define[- ]"
> is what the author wants, unless this is meant to also match
> "(definements".
>
> Has this been tested on some real-world scheme code? E.g. I have guile
> installed locally, and it has really large top-level eval-when
> blocks. These rules would jump over those to whatever the function above
> them is.
>
>> +	 /* whitespace separated tokens, but parentheses also can delimit words */
>> +	 "([^][)(\\}\\{ \t])+"),
>>  PATTERNS("bibtex", "(@[a-zA-Z]{1,}[ \t]*\\{{0,1}[ \t]*[^ \t\"@',\\#}{~%]*).*$",
>>  	 "[={}\"]|[^={}\" \t]+"),
>>  PATTERNS("tex", "^(\\\\((sub)*section|chapter|part)\\*{0,1}\\{.*)$",

Atharva Raykar March 28, 2021, 11:51 a.m. UTC | #6

On 28-Mar-2021, at 04:20, Junio C Hamano <gitster@pobox.com> wrote:
> 
> Atharva Raykar <raykar.ath@gmail.com> writes:
> 
>> diff --git a/t/t4018/scheme-define-syntax b/t/t4018/scheme-define-syntax
>> new file mode 100644
>> index 0000000000..603b99cea4
>> --- /dev/null
>> +++ b/t/t4018/scheme-define-syntax
>> @@ -0,0 +1,8 @@
>> +(define-syntax define-test-suite RIGHT
>> +  (syntax-rules ()
>> +    ((_ suite-name (name test) ChangeMe ...)
>> +     (define suite-name
>> +       (let ((tests
>> +              `((name . ,test) ...)))
>> +         (lambda ()
>> +           (ChangeMe 'suite-name tests)))))))
>> \ No newline at end of file
> 
> Is there a good reason to leave the final line incomplete?  If there
> isn't, complete it (applies to other newly-created files in the patch).

Will do.

>> diff --git a/userdiff.c b/userdiff.c
>> index 3f81a2261c..c51a8c98ba 100644
>> --- a/userdiff.c
>> +++ b/userdiff.c
>> @@ -191,6 +191,14 @@ PATTERNS("rust",
>> 	 "[a-zA-Z_][a-zA-Z0-9_]*"
>> 	 "|[0-9][0-9_a-fA-Fiosuxz]*(\\.([0-9]*[eE][+-]?)?[0-9_fF]*)?"
>> 	 "|[-+*\\/<>%&^|=!:]=|<<=?|>>=?|&&|\\|\\||->|=>|\\.{2}=|\\.{3}|::"),
>> +PATTERNS("scheme",
>> +         "^[\t ]*(\\(define-?.*)$",
> 
> Didn't "git diff HEAD" before committing (or "git show") highlighted
> these whitespace errors?
> 
> .git/rebase-apply/patch:183: indent with spaces.
>         "^[\t ]*(\\(define-?.*)$",
> .git/rebase-apply/patch:184: trailing whitespace, indent with spaces.
>         /* 
> .git/rebase-apply/patch:185: indent with spaces.
>          * Scheme allows symbol names to have any character,
> .git/rebase-apply/patch:186: indent with spaces.
>          * as long as it is not a form of a parenthesis.
> .git/rebase-apply/patch:187: indent with spaces.
>          * The spaces must be escaped.
> warning: squelched 2 whitespace errors
> warning: 7 lines applied after fixing whitespace errors.

It did highlight the spaces (which I accidentally overlooked), but I
didn’t receive these warnings. It shows up with the --check flag though.
I'll recheck my configuration. Thanks for pointing this out.

> 
>> +         /* 
>> +          * Scheme allows symbol names to have any character,
>> +          * as long as it is not a form of a parenthesis.
>> +          * The spaces must be escaped.
>> +          */
>> +         "(\\.|[^][)(\\}\\{ ])+"),
> 
> One or more "dot or anything other than SP or parentheses"?  But
> a dot "." is neither a space or any {bra-ce} letter, so would the
> above be equivalent to
> 
> 	"[^][()\\{\\} \t]+"
> 
> I wonder...

A backslash is allowed in scheme identifiers, and I erroneously thought that
the first part handles the case for identifiers such as `component\new` or 
`\"id-with-quotes\"`. (I tested it with a regex engine that behaves differently
than the one git is using, my bad.)

> I am also trying to figure out what you wanted to achieve by
> mentioning "The spaces must be escaped.".  Did you mean something
> like (string->symbol "a symbol with SP in it") is a symbol?  Even
> so, I cannot quite guess the significance of that fact wrt the
> regexp you added here?

I initially tried using identifiers like `space\ separated` and they
seemed to work in my REPL, but turns out space separated identifiers in
scheme do not work with backslashes, and it was working because of the way
my terminal handled escaping. Space separated identifiers are declared like
`|space separated|` and this too only seems to work with Racket, not
the other Scheme implementations. So I stand corrected here, and it's better
to drop this feature altogether.

But somehow, the regexp you suggested, ie:

	"[^][()\\{\\} \t]+"

does not handle the case of make\foo -> make\bar (it will only diff on foo).
I am not too sure why it treats backslashes as delimiters.

This seems to actually do what I was going for:

	"(\\\\|[^][)(\\}\\{ ])+"

> As we are trying to catch program identifiers (symbols in scheme)
> and numeric literals, treating any group of non-whitespace letters
> that is delimited by one or more whitespaces as a "word" would be a
> good first-order approximation, but in addition, as can be seen in
> an example like (a(b(c))), parentheses can also serve as such "word
> delimiters" in addition to whitespaces.  So the regexp given above
> makes sense to me from that angle, especially if you do not limit
> the whitespace to only SP, but include HT (\t) as well.  But was
> that how you came up with the regexp?

Yes, this is exactly what I was trying to express. All words should be
delimited by either whitespace or a parenthesis, and all other special
characters should be accepted as part of the word.

Atharva Raykar March 28, 2021, 12:23 p.m. UTC | #7

On 28-Mar-2021, at 05:16, Johannes Sixt <j6t@kdbg.org> wrote:
>> diff --git a/t/t4018/scheme-define-syntax b/t/t4018/scheme-define-syntax
>> new file mode 100644
>> index 0000000000..603b99cea4
>> --- /dev/null
>> +++ b/t/t4018/scheme-define-syntax
>> @@ -0,0 +1,8 @@
>> +(define-syntax define-test-suite RIGHT
>> +  (syntax-rules ()
>> +    ((_ suite-name (name test) ChangeMe ...)
>> +     (define suite-name
> 
> This test is suspicious. Notice the "ChangeMe" above? That is sufficient
> to let the test case succeed. The "ChangeMe" in the last line below
> should be the only one.

Thanks for pointing this out. The second "ChangeMe" was not supposed to be
there.

What I wanted to test was the hunk header showing the line for
'(define-syntax ...' and not the internal '(define ...' below it. Thus the
ChangeMe should be located above the internal define so that the hunk header
would show define-syntax and not the local define.

> But then there is this indented '(define' that is not marked as RIGHT,
> and I wonder how is it different from...
> 
>> +       (let ((tests
>> +              `((name . ,test) ...)))
>> +         (lambda ()
>> +           (ChangeMe 'suite-name tests)))))))
>> \ No newline at end of file
>> diff --git a/t/t4018/scheme-local-define b/t/t4018/scheme-local-define
>> new file mode 100644
>> index 0000000000..90e75dcce8
>> --- /dev/null
>> +++ b/t/t4018/scheme-local-define
>> @@ -0,0 +1,4 @@
>> +(define (higher-order)
>> +  (define local-function RIGHT
> 
> ... this one, which is also indented and *is* marked as RIGHT.

In this test case, I was explicitly testing for an indented '(define'
whereas in the former, I was testing for the top-level '(define-syntax',
which happened to have an internal define (which will inevitably show up
in a lot of scheme code).

>> +    (lambda (x)
>> +     (car "this is" "ChangeMe"))))
>> \ No newline at end of file
> 
>> diff --git a/userdiff.c b/userdiff.c
>> index 3f81a2261c..c51a8c98ba 100644
>> --- a/userdiff.c
>> +++ b/userdiff.c
>> @@ -191,6 +191,14 @@ PATTERNS("rust",
>> 	 "[a-zA-Z_][a-zA-Z0-9_]*"
>> 	 "|[0-9][0-9_a-fA-Fiosuxz]*(\\.([0-9]*[eE][+-]?)?[0-9_fF]*)?"
>> 	 "|[-+*\\/<>%&^|=!:]=|<<=?|>>=?|&&|\\|\\||->|=>|\\.{2}=|\\.{3}|::"),
>> +PATTERNS("scheme",
>> +         "^[\t ]*(\\(define-?.*)$",
> 
> This "optional hyphen followed by anything" in the regex is strange.
> Wouldn't that also capture a line that looks like, e.g.,
> 
>    (defined-foo bar)
> 
> Perhaps we want "define[- \t].*" in the regex?

Yes, this is what I intended to do, thanks for correcting it.

Atharva Raykar March 28, 2021, 12:40 p.m. UTC | #8

On 28-Mar-2021, at 08:46, Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
> The "define-?.*" can be simplified to just "define.*", but looking at
> the tests is that the intent? From the tests it looks like "define[- ]"
> is what the author wants, unless this is meant to also match
> "(definements".

Yes, you captured my intent correctly. Will fix it.

> Has this been tested on some real-world scheme code? E.g. I have guile
> installed locally, and it has really large top-level eval-when
> blocks. These rules would jump over those to whatever the function above
> them is.

I do not have a large scheme codebase on my own, I usually use Racket,
which is a much larger language with many more forms. Other Schemes like
Guile also extend the language a lot, like in your example, eval-when is
an extension provided by Guile (and Chicken and Chez), but not a part of
the R6RS document when I searched its index.

So the 'define' forms are the only one that I know would reliably be present
across all schemes. But one can also make a case where some of these non-standard
forms may be common enough that they are worth adding in. In that case which
forms to include? Should we consider everything in the SRFI's[1]? Should the
various module definitions of Racket be included? It's a little tricky to know
where to stop.

That being said, I will try to run this through more Scheme codebases that I can
find and see if there are any forms that seem to show up commonly enough that they
are worth including.

[1] https://en.wikipedia.org/wiki/Scheme_Requests_for_Implementation

Atharva Raykar March 28, 2021, 12:45 p.m. UTC | #9

On 28-Mar-2021, at 04:39, Junio C Hamano <gitster@pobox.com> wrote:
> 
> Junio C Hamano <gitster@pobox.com> writes:
> 
>> Atharva Raykar <raykar.ath@gmail.com> writes:
>> ...
>>> +           (ChangeMe 'suite-name tests)))))))
>>> \ No newline at end of file
>> 
>> Is there a good reason to leave the final line incomplete?  ...
>> I am also trying to figure out what you wanted to achieve ...
> 
> Taking all of them together, here is what I hope you may agree as
> its improved version.  The only differences from what you posted are
> corrections to all the "\ No newline at end of file" and the simplification
> of the pattern (remove "a dot" from the alternative and add \t next
> to SP).  Without changes, the new tests still pass so ... ;-)
> 
>    diff --git c/userdiff.c w/userdiff.c
>    index 5fd0eb31ec..685fe712aa 100644
>    --- c/userdiff.c
>    +++ w/userdiff.c
>    @@ -193,12 +193,8 @@ PATTERNS("rust",
>             "|[-+*\\/<>%&^|=!:]=|<<=?|>>=?|&&|\\|\\||->|=>|\\.{2}=|\\.{3}|::"),
>     PATTERNS("scheme",
>             "^[\t ]*(\\(define-?.*)$",
>    -	 /*
>    -	  * Scheme allows symbol names to have any character,
>    -	  * as long as it is not a form of a parenthesis.
>    -	  * The spaces must be escaped.
>    -	  */
>    -	 "(\\.|[^][)(\\}\\{ ])+"),
>    +	 /* whitespace separated tokens, but parentheses also can delimit words */
>    +	 "([^][)(\\}\\{ \t])+"),
>     PATTERNS("bibtex", "(@[a-zA-Z]{1,}[ \t]*\\{{0,1}[ \t]*[^ \t\"@',\\#}{~%]*).*$",
>             "[={}\"]|[^={}\" \t]+"),
>     PATTERNS("tex", "^(\\\\((sub)*section|chapter|part)\\*{0,1}\\{.*)$",

Thanks for these. I will eventually send another patch with the whitespaces corrected,
and try to see if there is a better way to handle backslashes, other than the regex I
suggested. I will also be writing another test case to check that case properly.

I will also incorporate the other changes suggested by Johannes and Ævar as well,
My regex was not supposed to capture forms like `defined-thing`. And there are a
few rough edges with some of my test cases, which I will correct as well in the next
patch. It is also worth spending some more time and see if there is any other form other
than definitions that a Scheme programmer other than myself may be interested in. I will
consult a few Scheme communities and mailing lists and see what more experienced
programmers have to say.

Junio C Hamano March 28, 2021, 6:06 p.m. UTC | #10

Atharva Raykar <raykar.ath@gmail.com> writes:

>>> +         "(\\.|[^][)(\\}\\{ ])+"),
>> 
>> One or more "dot or anything other than SP or parentheses"?  But
>> a dot "." is neither a space or any {bra-ce} letter, so would the
>> above be equivalent to
>> 
>> 	"[^][()\\{\\} \t]+"
>> 
>> I wonder...
>
> A backslash is allowed in scheme identifiers, and I erroneously thought that
> the first part handles the case for identifiers such as `component\new` or 
> `\"id-with-quotes\"`. (I tested it with a regex engine that behaves differently
> than the one git is using, my bad.)

Ah, perhaps you didn't have enough backslashes.  A half of the
doubled one before the dot is eaten by the C compiler, so the regexp
engine is seeing only a single backslash before the dot, which means
"literally a single dot".  If you meant "literally a single
backslash, followed by any single char", you probably would write 4
backslashes and a dot---half of the backslashes would be eaten by
the compiler, so you'd be passing two backslashes and a dot, which
is probably what you meant.

Having said that, two further points.

 - the "anything but whitespaces and various forms of parentheses"
   set would include backslash, so 'component\new' would be taken as
   a single word with "[^][()\\{\\} \t]+", wouldn't it?

 - how common is the use of backslashes in identifiers?  I am trying
   to see if the additional complexity needed to support them is
   worth the benefit.

> But somehow, the regexp you suggested, ie:
>
> 	"[^][()\\{\\} \t]+"
>
> does not handle the case of make\foo -> make\bar (it will only diff on foo).
> I am not too sure why it treats backslashes as delimiters.

Perhaps because you have included two backslashes inside [] to say
"backslash is not a word character" in the original, and I blindly
copied that?  IOW, do you need to quote {} inside []?

> Yes, this is exactly what I was trying to express. All words should be
> delimited by either whitespace or a parenthesis, and all other special
> characters should be accepted as part of the word.

That sentence after "All words should be..." would be a good comment
to replace what you wrote in the original, then ;-).

Atharva Raykar March 29, 2021, 8:12 a.m. UTC | #11

On 28-Mar-2021, at 23:36, Junio C Hamano <gitster@pobox.com> wrote:
> 
> Atharva Raykar <raykar.ath@gmail.com> writes:
> 
>>>> +         "(\\.|[^][)(\\}\\{ ])+"),
>>> 
>>> One or more "dot or anything other than SP or parentheses"?  But
>>> a dot "." is neither a space or any {bra-ce} letter, so would the
>>> above be equivalent to
>>> 
>>> 	"[^][()\\{\\} \t]+"
>>> 
>>> I wonder...
>> 
>> A backslash is allowed in scheme identifiers, and I erroneously thought that
>> the first part handles the case for identifiers such as `component\new` or 
>> `\"id-with-quotes\"`. (I tested it with a regex engine that behaves differently
>> than the one git is using, my bad.)
> 
> Ah, perhaps you didn't have enough backslashes.  A half of the
> doubled one before the dot is eaten by the C compiler, so the regexp
> engine is seeing only a single backslash before the dot, which means
> "literally a single dot".  If you meant "literally a single
> backslash, followed by any single char", you probably would write 4
> backslashes and a dot---half of the backslashes would be eaten by
> the compiler, so you'd be passing two backslashes and a dot, which
> is probably what you meant.
> 
> Having said that, two further points.
> 
> - the "anything but whitespaces and various forms of parentheses"
>   set would include backslash, so 'component\new' would be taken as
>   a single word with "[^][()\\{\\} \t]+", wouldn't it?
> 
> - how common is the use of backslashes in identifiers?  I am trying
>   to see if the additional complexity needed to support them is
>   worth the benefit.

I have refined the regex, and now it is much simpler and does all of what
I want it to:

	"([^][)(}{[:space:]])+"

I did not have to escape the various parentheses, so I avoided the need to
handle backslashes separately. The "\\t" was causing problems as well because
it took it as a '\' followed by a 't' (Thanks to j416 on #git-devel for
helping me out on this).

>> Yes, this is exactly what I was trying to express. All words should be
>> delimited by either whitespace or a parenthesis, and all other special
>> characters should be accepted as part of the word.
> 
> That sentence after "All words should be..." would be a good comment
> to replace what you wrote in the original, then ;-).

Yes, that should make it a lot more clear.

Phillip Wood March 29, 2021, 10:08 a.m. UTC | #12

Hi Atharva

On 28/03/2021 13:40, Atharva Raykar wrote:
> On 28-Mar-2021, at 08:46, Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
>> The "define-?.*" can be simplified to just "define.*", but looking at
>> the tests is that the intent? From the tests it looks like "define[- ]"
>> is what the author wants, unless this is meant to also match
>> "(definements".
> 
> Yes, you captured my intent correctly. Will fix it.
> 
>> Has this been tested on some real-world scheme code? E.g. I have guile
>> installed locally, and it has really large top-level eval-when
>> blocks. These rules would jump over those to whatever the function above
>> them is.
> 
> I do not have a large scheme codebase on my own, I usually use Racket,
> which is a much larger language with many more forms. Other Schemes like
> Guile also extend the language a lot, like in your example, eval-when is
> an extension provided by Guile (and Chicken and Chez), but not a part of
> the R6RS document when I searched its index.
> 
> So the 'define' forms are the only one that I know would reliably be present
> across all schemes. But one can also make a case where some of these non-standard
> forms may be common enough that they are worth adding in. In that case which
> forms to include? Should we consider everything in the SRFI's[1]? Should the
> various module definitions of Racket be included? It's a little tricky to know
> where to stop.

If there are some common forms such as eval-when then it would be good 
to include them, otherwise we end up needing a different rule for each 
scheme implementation as they all seem to tweak something. Gerbil uses 
'def...' e.g def, defsyntax, defstruct, defrules rather than define, 
define-syntax, define-record etc. I'm not user if we want to accommodate 
that or not.

Best Wishes

Phillip


> That being said, I will try to run this through more Scheme codebases that I can
> find and see if there are any forms that seem to show up commonly enough that they
> are worth including.
> 
> [1] https://en.wikipedia.org/wiki/Scheme_Requests_for_Implementation
>

Phillip Wood March 29, 2021, 10:12 a.m. UTC | #13

Hi Atharva

On 28/03/2021 12:51, Atharva Raykar wrote:
> On 28-Mar-2021, at 04:20, Junio C Hamano <gitster@pobox.com> wrote:
>>
>> Atharva Raykar <raykar.ath@gmail.com> writes:
>>
>>> +         /*
>>> +          * Scheme allows symbol names to have any character,
>>> +          * as long as it is not a form of a parenthesis.
>>> +          * The spaces must be escaped.
>>> +          */
>>> +         "(\\.|[^][)(\\}\\{ ])+"),
>>
>> One or more "dot or anything other than SP or parentheses"?  But
>> a dot "." is neither a space or any {bra-ce} letter, so would the
>> above be equivalent to
>>
>> 	"[^][()\\{\\} \t]+"
>>
>> I wonder...
> 
> A backslash is allowed in scheme identifiers, and I erroneously thought that
> the first part handles the case for identifiers such as `component\new` or
> `\"id-with-quotes\"`. (I tested it with a regex engine that behaves differently
> than the one git is using, my bad.)
> 
>> I am also trying to figure out what you wanted to achieve by
>> mentioning "The spaces must be escaped.".  Did you mean something
>> like (string->symbol "a symbol with SP in it") is a symbol?  Even
>> so, I cannot quite guess the significance of that fact wrt the
>> regexp you added here?
> 
> I initially tried using identifiers like `space\ separated` and they
> seemed to work in my REPL, but turns out space separated identifiers in
> scheme do not work with backslashes, and it was working because of the way
> my terminal handled escaping. Space separated identifiers are declared like
> `|space separated|` and this too only seems to work with Racket, not
> the other Scheme implementations.

I think the bar notation works with some other such as gambit and 
possibly guile (it's a while since I used the latter)

Best wishes

Phillip

  So I stand corrected here, and it's better
> to drop this feature altogether.
> 
> But somehow, the regexp you suggested, ie:
> 
> 	"[^][()\\{\\} \t]+"
> 
> does not handle the case of make\foo -> make\bar (it will only diff on foo).
> I am not too sure why it treats backslashes as delimiters.
> 
> This seems to actually do what I was going for:
> 
> 	"(\\\\|[^][)(\\}\\{ ])+"
> 
>> As we are trying to catch program identifiers (symbols in scheme)
>> and numeric literals, treating any group of non-whitespace letters
>> that is delimited by one or more whitespaces as a "word" would be a
>> good first-order approximation, but in addition, as can be seen in
>> an example like (a(b(c))), parentheses can also serve as such "word
>> delimiters" in addition to whitespaces.  So the regexp given above
>> makes sense to me from that angle, especially if you do not limit
>> the whitespace to only SP, but include HT (\t) as well.  But was
>> that how you came up with the regexp?
> 
> Yes, this is exactly what I was trying to express. All words should be
> delimited by either whitespace or a parenthesis, and all other special
> characters should be accepted as part of the word.
>

Phillip Wood March 29, 2021, 10:18 a.m. UTC | #14

Hi Atharva

On 28/03/2021 13:23, Atharva Raykar wrote:
> On 28-Mar-2021, at 05:16, Johannes Sixt <j6t@kdbg.org> wrote:
 > [...]
>>> diff --git a/t/t4018/scheme-local-define b/t/t4018/scheme-local-define
>>> new file mode 100644
>>> index 0000000000..90e75dcce8
>>> --- /dev/null
>>> +++ b/t/t4018/scheme-local-define
>>> @@ -0,0 +1,4 @@
>>> +(define (higher-order)
>>> +  (define local-function RIGHT
>>
>> ... this one, which is also indented and *is* marked as RIGHT.
> 
> In this test case, I was explicitly testing for an indented '(define'
> whereas in the former, I was testing for the top-level '(define-syntax',
> which happened to have an internal define (which will inevitably show up
> in a lot of scheme code).

It would be nice to include indented define forms but including them 
means that any change to the body of a function is attributed to the 
last internal definition rather than the actual function. For example

(define (f arg)
   (define (g x)
     (+ 1 x))

   (some-func ...)
   ;;any change here will have '(define (g x)' in the hunk header, not 
'(define (f arg)'

I don't think this can be avoided as we rely on regexs rather than 
parsing the source so it is probably best to only match toplevel defines.

Best Wishes

Phillip

> 
>>> +    (lambda (x)
>>> +     (car "this is" "ChangeMe"))))
>>> \ No newline at end of file
>>
>>> diff --git a/userdiff.c b/userdiff.c
>>> index 3f81a2261c..c51a8c98ba 100644
>>> --- a/userdiff.c
>>> +++ b/userdiff.c
>>> @@ -191,6 +191,14 @@ PATTERNS("rust",
>>> 	 "[a-zA-Z_][a-zA-Z0-9_]*"
>>> 	 "|[0-9][0-9_a-fA-Fiosuxz]*(\\.([0-9]*[eE][+-]?)?[0-9_fF]*)?"
>>> 	 "|[-+*\\/<>%&^|=!:]=|<<=?|>>=?|&&|\\|\\||->|=>|\\.{2}=|\\.{3}|::"),
>>> +PATTERNS("scheme",
>>> +         "^[\t ]*(\\(define-?.*)$",
>>
>> This "optional hyphen followed by anything" in the regex is strange.
>> Wouldn't that also capture a line that looks like, e.g.,
>>
>>     (defined-foo bar)
>>
>> Perhaps we want "define[- \t].*" in the regex?
> 
> Yes, this is what I intended to do, thanks for correcting it.
>

Johannes Sixt March 29, 2021, 10:48 a.m. UTC | #15

Am 29.03.21 um 12:18 schrieb Phillip Wood:
> It would be nice to include indented define forms but including them
> means that any change to the body of a function is attributed to the
> last internal definition rather than the actual function. For example
> 
> (define (f arg)
>   (define (g x)
>     (+ 1 x))
> 
>   (some-func ...)
>   ;;any change here will have '(define (g x)' in the hunk header, not
> '(define (f arg)'
> 
> I don't think this can be avoided as we rely on regexs rather than
> parsing the source so it is probably best to only match toplevel defines.

There can be two rules, one that matches '(define-' that is indented,
and another one that matches all non-indented forms of definitions. If
that is what you mean.

-- Hannes

Ævar Arnfjörð Bjarmason March 29, 2021, 1:12 p.m. UTC | #16

On Mon, Mar 29 2021, Johannes Sixt wrote:

> Am 29.03.21 um 12:18 schrieb Phillip Wood:
>> It would be nice to include indented define forms but including them
>> means that any change to the body of a function is attributed to the
>> last internal definition rather than the actual function. For example
>> 
>> (define (f arg)
>>   (define (g x)
>>     (+ 1 x))
>> 
>>   (some-func ...)
>>   ;;any change here will have '(define (g x)' in the hunk header, not
>> '(define (f arg)'
>> 
>> I don't think this can be avoided as we rely on regexs rather than
>> parsing the source so it is probably best to only match toplevel defines.
>
> There can be two rules, one that matches '(define-' that is indented,
> and another one that matches all non-indented forms of definitions. If
> that is what you mean.

Yes, but that doesn't help in these sorts of cases because what a rule
like that really wants is some version of "don't match this line, but
only if you can reasonably match this other rule".

We can only do rule precedence on a per-line basis via the inverted
matches.

So for languages like cl/elisp/scheme and others where it's common to
have nested function definitions (then -W would like the top-level) *OR*
similarly looking nested function definitions, but the top-level isn't a
function but a (setq) or whatever we're basically stuck with picking one
or the other.

I've pondered how to get around this problem in my userdiff.c hacking
without resorting to supporting some general-purpose Turing machine, and
have so far come up with nothing.

You can see lots of prior art by grepping Emacs's source code for
beginning-of-defun, it solves this problem by exposing a Turing machine
:)

Phillip Wood March 29, 2021, 2:06 p.m. UTC | #17

On 29/03/2021 14:12, Ævar Arnfjörð Bjarmason wrote:
> 
> On Mon, Mar 29 2021, Johannes Sixt wrote:
> 
>> Am 29.03.21 um 12:18 schrieb Phillip Wood:
>>> It would be nice to include indented define forms but including them
>>> means that any change to the body of a function is attributed to the
>>> last internal definition rather than the actual function. For example
>>>
>>> (define (f arg)
>>>    (define (g x)
>>>      (+ 1 x))
>>>
>>>    (some-func ...)
>>>    ;;any change here will have '(define (g x)' in the hunk header, not
>>> '(define (f arg)'
>>>
>>> I don't think this can be avoided as we rely on regexs rather than
>>> parsing the source so it is probably best to only match toplevel defines.
>>
>> There can be two rules, one that matches '(define-' that is indented,
>> and another one that matches all non-indented forms of definitions. If
>> that is what you mean.
> 
> Yes, but that doesn't help in these sorts of cases because what a rule
> like that really wants is some version of "don't match this line, but
> only if you can reasonably match this other rule".
> 
> We can only do rule precedence on a per-line basis via the inverted
> matches.
> 
> So for languages like cl/elisp/scheme and others where it's common to
> have nested function definitions (then -W would like the top-level) *OR*
> similarly looking nested function definitions, but the top-level isn't a
> function but a (setq) or whatever we're basically stuck with picking one
> or the other.

Exactly

> I've pondered how to get around this problem in my userdiff.c hacking
> without resorting to supporting some general-purpose Turing machine, and
> have so far come up with nothing.

I think using an indentation heuristic would probably work quite well 
for most languages - see 
https://public-inbox.org/git/20200923215859.102981-1-rtzoeller@rtzoeller.com/ 
for a discussion from last year (from memory there were some problems 
with the approach in those patches but I think there are some suggestion 
from Peff and me later in the thread on how they could be overcome)

Best Wishes

Phillip


> You can see lots of prior art by grepping Emacs's source code for
> beginning-of-defun, it solves this problem by exposing a Turing machine
> :)
>

Junio C Hamano March 29, 2021, 8:47 p.m. UTC | #18

Atharva Raykar <raykar.ath@gmail.com> writes:

>> Having said that, two further points.
>> 
>> - the "anything but whitespaces and various forms of parentheses"
>>   set would include backslash, so 'component\new' would be taken as
>>   a single word with "[^][()\\{\\} \t]+", wouldn't it?
>> 
>> - how common is the use of backslashes in identifiers?  I am trying
>>   to see if the additional complexity needed to support them is
>>   worth the benefit.
>
> I have refined the regex, and now it is much simpler and does all of what
> I want it to:
>
> 	"([^][)(}{[:space:]])+"

OK, [:space:] is already used elsewhere, so it would be OK.

In practice, the only difference from "[ \t]" (which is used in many
other patterns in the same file) is that [:space:] class includes
form-feed (\Ctrl-L); nobody would write vertical-tab in the code,
and the matching is done one line at a time, so the fact that LF (or
CRLF) is in the [:space:] class does not make a difference anyway.

> I did not have to escape the various parentheses, so I avoided the need to
> handle backslashes separately. The "\\t" was causing problems as well because

If you spelled "\\t" that would have caused a problem of your own
making ;-)

I think what I gave in the message you are responding to was a
single backslash followed by a 't', to let the compiler turn them
into a single HT character, and that wouldn't have had such a
problem---in fact "[ \t]" is used in many other existing rules in
the same file.

Thanks.

Atharva Raykar March 30, 2021, 6:41 a.m. UTC | #19

On 29-Mar-2021, at 15:38, Phillip Wood <phillip.wood123@gmail.com> wrote:
> On 28/03/2021 13:40, Atharva Raykar wrote:
>> On 28-Mar-2021, at 08:46, Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
>>> The "define-?.*" can be simplified to just "define.*", but looking at
>>> the tests is that the intent? From the tests it looks like "define[- ]"
>>> is what the author wants, unless this is meant to also match
>>> "(definements".
>> Yes, you captured my intent correctly. Will fix it.
>>> Has this been tested on some real-world scheme code? E.g. I have guile
>>> installed locally, and it has really large top-level eval-when
>>> blocks. These rules would jump over those to whatever the function above
>>> them is.
>> I do not have a large scheme codebase on my own, I usually use Racket,
>> which is a much larger language with many more forms. Other Schemes like
>> Guile also extend the language a lot, like in your example, eval-when is
>> an extension provided by Guile (and Chicken and Chez), but not a part of
>> the R6RS document when I searched its index.
>> So the 'define' forms are the only one that I know would reliably be present
>> across all schemes. But one can also make a case where some of these non-standard
>> forms may be common enough that they are worth adding in. In that case which
>> forms to include? Should we consider everything in the SRFI's[1]? Should the
>> various module definitions of Racket be included? It's a little tricky to know
>> where to stop.
> 
> If there are some common forms such as eval-when then it would be good to include them, otherwise we end up needing a different rule for each scheme implementation as they all seem to tweak something. Gerbil uses 'def...' e.g def, defsyntax, defstruct, defrules rather than define, define-syntax, define-record etc. I'm not user if we want to accommodate that or not.

Yes, this is the part that is hard for me to figure out. I am going by
two heuristics: what Scheme communities in other places would generally
prefer, and what patterns I see happen more often in scheme code.

The former is tricky to do. I posted to a few mailing lists about this,
but they don't seem active enough to garner any responses.

The latter is a little easier to measure quickly. I did a GitHub search,
where I filtered results to only consider Scheme files (language:scheme).

Some armchair stats, just for a broad understanding:

  Total number of scheme files: 529,339
  No. of times a construct is used in those files:
    define and its variants : 431,090 (81.4%)
    def and its variants    :  18,466 ( 3.5%)
    eval-when               :   3,375 ( 0.6%)

There was no way for me to quickly know which of these uses are at the top
level, but either way of the more structural forms that do show up in Scheme
code, define and its variants seem like a clear winner. I am not sure if
it's worth adding more rules to check for def and its variants, given that
they are not nearly as common.

Atharva Raykar March 30, 2021, 7:04 a.m. UTC | #20

> On 29-Mar-2021, at 15:48, Phillip Wood <phillip.wood123@gmail.com> wrote:
> 
> Hi Atharva
> 
> On 28/03/2021 13:23, Atharva Raykar wrote:
>> On 28-Mar-2021, at 05:16, Johannes Sixt <j6t@kdbg.org> wrote:
> > [...]
>>>> diff --git a/t/t4018/scheme-local-define b/t/t4018/scheme-local-define
>>>> new file mode 100644
>>>> index 0000000000..90e75dcce8
>>>> --- /dev/null
>>>> +++ b/t/t4018/scheme-local-define
>>>> @@ -0,0 +1,4 @@
>>>> +(define (higher-order)
>>>> +  (define local-function RIGHT
>>> 
>>> ... this one, which is also indented and *is* marked as RIGHT.
>> In this test case, I was explicitly testing for an indented '(define'
>> whereas in the former, I was testing for the top-level '(define-syntax',
>> which happened to have an internal define (which will inevitably show up
>> in a lot of scheme code).
> 
> It would be nice to include indented define forms but including them means that any change to the body of a function is attributed to the last internal definition rather than the actual function. For example
> 
> (define (f arg)
>  (define (g x)
>    (+ 1 x))
> 
>  (some-func ...)
>  ;;any change here will have '(define (g x)' in the hunk header, not '(define (f arg)'

The reason I went for this over the top level forms, is because
I felt it was useful to see the nearest definition for internal
functions that often have a lot of the actual business logic of
the program (at least a lot of SICP seems to follow this pattern).
The disadvantage is as you said, it might also catch trivial inner
functions and the developer might lose context.

Another problem is it may match more trivial bindings, like:

(define (some-func things)
  ...
  (define items '(eggs
                  ham
                  peanut-butter))
  ...)

What I have noticed *anecdotally* is that this is not common enough
to be too much of a problem, and local define bindings seem to be more
favoured in Racket than other Schemes, that use 'let' more often.

> I don't think this can be avoided as we rely on regexs rather than parsing the source so it is probably best to only match toplevel defines.

The other issue with only matching top level defines is that a
lot of scheme programs are library definitions, something like

(library
    (foo bar)
  (export ...)
  (define ...)
  (define ...)
  ;; and a bunch of other definitions...
)

Only matching top level defines will completely ignore matching all
the definitions in these files.

Atharva Raykar March 30, 2021, 10:22 a.m. UTC | #21

> On 30-Mar-2021, at 12:34, Atharva Raykar <raykar.ath@gmail.com> wrote:
> 
> 
> 
>> On 29-Mar-2021, at 15:48, Phillip Wood <phillip.wood123@gmail.com> wrote:
>> 
>> Hi Atharva
>> 
>> On 28/03/2021 13:23, Atharva Raykar wrote:
>>> On 28-Mar-2021, at 05:16, Johannes Sixt <j6t@kdbg.org> wrote:
>>> [...]
>>>>> diff --git a/t/t4018/scheme-local-define b/t/t4018/scheme-local-define
>>>>> new file mode 100644
>>>>> index 0000000000..90e75dcce8
>>>>> --- /dev/null
>>>>> +++ b/t/t4018/scheme-local-define
>>>>> @@ -0,0 +1,4 @@
>>>>> +(define (higher-order)
>>>>> +  (define local-function RIGHT
>>>> 
>>>> ... this one, which is also indented and *is* marked as RIGHT.
>>> In this test case, I was explicitly testing for an indented '(define'
>>> whereas in the former, I was testing for the top-level '(define-syntax',
>>> which happened to have an internal define (which will inevitably show up
>>> in a lot of scheme code).
>> 
>> It would be nice to include indented define forms but including them means that any change to the body of a function is attributed to the last internal definition rather than the actual function. For example
>> 
>> (define (f arg)
>> (define (g x)
>>   (+ 1 x))
>> 
>> (some-func ...)
>> ;;any change here will have '(define (g x)' in the hunk header, not '(define (f arg)'
> 
> The reason I went for this over the top level forms, is because
> I felt it was useful to see the nearest definition for internal
> functions that often have a lot of the actual business logic of
> the program (at least a lot of SICP seems to follow this pattern).
> The disadvantage is as you said, it might also catch trivial inner
> functions and the developer might lose context.

Never mind this message, I had misunderstood the problem you were trying to
demonstrate. I wholeheartedly agree with what you are trying to say, and
the indentation heuristic discussed does look interesting. I shall have a
glance at the RFC you linked in the other reply.

> The disadvantage is as you said, it might also catch trivial inner
> functions and the developer might lose context.

Feel free to disregard me misquoting you here. You did not say that (:

> Another problem is it may match more trivial bindings, like:
> 
> (define (some-func things)
>  ...
>  (define items '(eggs
>                  ham
>                  peanut-butter))
>  ...)
> 
> What I have noticed *anecdotally* is that this is not common enough
> to be too much of a problem, and local define bindings seem to be more
> favoured in Racket than other Schemes, that use 'let' more often.
> 
>> I don't think this can be avoided as we rely on regexs rather than parsing the source so it is probably best to only match toplevel defines.
> 
> The other issue with only matching top level defines is that a
> lot of scheme programs are library definitions, something like
> 
> (library
>    (foo bar)
>  (export ...)
>  (define ...)
>  (define ...)
>  ;; and a bunch of other definitions...
> )
> 
> Only matching top level defines will completely ignore matching all
> the definitions in these files.

That said, I still stand by the fact that only catching top level defines
will lead to a lot of definitions being ignored. Maybe the occasional
mismatch may be worth the gain in the number of function contexts being
detected?

Ævar Arnfjörð Bjarmason March 30, 2021, 12:56 p.m. UTC | #22

On Tue, Mar 30 2021, Atharva Raykar wrote:

> On 29-Mar-2021, at 15:38, Phillip Wood <phillip.wood123@gmail.com> wrote:
>> On 28/03/2021 13:40, Atharva Raykar wrote:
>>> On 28-Mar-2021, at 08:46, Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
>>>> The "define-?.*" can be simplified to just "define.*", but looking at
>>>> the tests is that the intent? From the tests it looks like "define[- ]"
>>>> is what the author wants, unless this is meant to also match
>>>> "(definements".
>>> Yes, you captured my intent correctly. Will fix it.
>>>> Has this been tested on some real-world scheme code? E.g. I have guile
>>>> installed locally, and it has really large top-level eval-when
>>>> blocks. These rules would jump over those to whatever the function above
>>>> them is.
>>> I do not have a large scheme codebase on my own, I usually use Racket,
>>> which is a much larger language with many more forms. Other Schemes like
>>> Guile also extend the language a lot, like in your example, eval-when is
>>> an extension provided by Guile (and Chicken and Chez), but not a part of
>>> the R6RS document when I searched its index.
>>> So the 'define' forms are the only one that I know would reliably be present
>>> across all schemes. But one can also make a case where some of these non-standard
>>> forms may be common enough that they are worth adding in. In that case which
>>> forms to include? Should we consider everything in the SRFI's[1]? Should the
>>> various module definitions of Racket be included? It's a little tricky to know
>>> where to stop.
>> 
>> If there are some common forms such as eval-when then it would be good to include them, otherwise we end up needing a different rule for each scheme implementation as they all seem to tweak something. Gerbil uses 'def...' e.g def, defsyntax, defstruct, defrules rather than define, define-syntax, define-record etc. I'm not user if we want to accommodate that or not.
>
> Yes, this is the part that is hard for me to figure out. I am going by
> two heuristics: what Scheme communities in other places would generally
> prefer, and what patterns I see happen more often in scheme code.
>
> The former is tricky to do. I posted to a few mailing lists about this,
> but they don't seem active enough to garner any responses.
>
> The latter is a little easier to measure quickly. I did a GitHub search,
> where I filtered results to only consider Scheme files (language:scheme).
>
> Some armchair stats, just for a broad understanding:
>
>   Total number of scheme files: 529,339
>   No. of times a construct is used in those files:
>     define and its variants : 431,090 (81.4%)
>     def and its variants    :  18,466 ( 3.5%)
>     eval-when               :   3,375 ( 0.6%)
>
> There was no way for me to quickly know which of these uses are at the top
> level, but either way of the more structural forms that do show up in Scheme
> code, define and its variants seem like a clear winner. I am not sure if
> it's worth adding more rules to check for def and its variants, given that
> they are not nearly as common.

In those cases we should veer on the side of inclusion. The only problem
we'll have is if "eval-when" is a "setq"-like function top-level form in
some other scheme dialect, so we'll have a conflict.

Otherwise it's fine, programs that only use "define" won't be bothered
by an eval-when rule.

Atharva Raykar March 30, 2021, 1:48 p.m. UTC | #23

On 30-Mar-2021, at 18:26, Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
> 
> 
> On Tue, Mar 30 2021, Atharva Raykar wrote:
> 
>> On 29-Mar-2021, at 15:38, Phillip Wood <phillip.wood123@gmail.com> wrote:
>>> On 28/03/2021 13:40, Atharva Raykar wrote:
>>>> On 28-Mar-2021, at 08:46, Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
>>>>> The "define-?.*" can be simplified to just "define.*", but looking at
>>>>> the tests is that the intent? From the tests it looks like "define[- ]"
>>>>> is what the author wants, unless this is meant to also match
>>>>> "(definements".
>>>> Yes, you captured my intent correctly. Will fix it.
>>>>> Has this been tested on some real-world scheme code? E.g. I have guile
>>>>> installed locally, and it has really large top-level eval-when
>>>>> blocks. These rules would jump over those to whatever the function above
>>>>> them is.
>>>> I do not have a large scheme codebase on my own, I usually use Racket,
>>>> which is a much larger language with many more forms. Other Schemes like
>>>> Guile also extend the language a lot, like in your example, eval-when is
>>>> an extension provided by Guile (and Chicken and Chez), but not a part of
>>>> the R6RS document when I searched its index.
>>>> So the 'define' forms are the only one that I know would reliably be present
>>>> across all schemes. But one can also make a case where some of these non-standard
>>>> forms may be common enough that they are worth adding in. In that case which
>>>> forms to include? Should we consider everything in the SRFI's[1]? Should the
>>>> various module definitions of Racket be included? It's a little tricky to know
>>>> where to stop.
>>> 
>>> If there are some common forms such as eval-when then it would be good to include them, otherwise we end up needing a different rule for each scheme implementation as they all seem to tweak something. Gerbil uses 'def...' e.g def, defsyntax, defstruct, defrules rather than define, define-syntax, define-record etc. I'm not user if we want to accommodate that or not.
>> 
>> Yes, this is the part that is hard for me to figure out. I am going by
>> two heuristics: what Scheme communities in other places would generally
>> prefer, and what patterns I see happen more often in scheme code.
>> 
>> The former is tricky to do. I posted to a few mailing lists about this,
>> but they don't seem active enough to garner any responses.
>> 
>> The latter is a little easier to measure quickly. I did a GitHub search,
>> where I filtered results to only consider Scheme files (language:scheme).
>> 
>> Some armchair stats, just for a broad understanding:
>> 
>>  Total number of scheme files: 529,339
>>  No. of times a construct is used in those files:
>>    define and its variants : 431,090 (81.4%)
>>    def and its variants    :  18,466 ( 3.5%)
>>    eval-when               :   3,375 ( 0.6%)
>> 
>> There was no way for me to quickly know which of these uses are at the top
>> level, but either way of the more structural forms that do show up in Scheme
>> code, define and its variants seem like a clear winner. I am not sure if
>> it's worth adding more rules to check for def and its variants, given that
>> they are not nearly as common.
> 
> In those cases we should veer on the side of inclusion. The only problem
> we'll have is if "eval-when" is a "setq"-like function top-level form in
> some other scheme dialect, so we'll have a conflict.
> 
> Otherwise it's fine, programs that only use "define" won't be bothered
> by an eval-when rule.

I would like some clarification, since my knowledge of Common Lisp's setq
and Guile's/Other's eval-when is pretty surface level.

> The only problem we'll have is if "eval-when" is a "setq"-like function
> top-level form in some other scheme dialect, so we'll have a conflict.

I am not sure what you mean when you say if "eval-when" is a "setq"-like
top level form, and exactly what kind of problem it may cause.

I also realized from my understanding of the Guile Documentation[1],
that "eval-when" is used to tell the compiler which expressions should be
made available during the expansion phase.

It does not seem to have anything that may help identify the location of the
hunk, which I understand is the primary purpose of these hunk headers.
All uses of "eval-when" would be some variation of:

	(eval-when (expand load eval) ; no identifier in this form
	  ...)

unlike a "define" which will always name the nearest function, which helps as
a landmark.

Would that be a valid reason to exclude "eval-when"?

Phillip Wood April 5, 2021, 10:04 a.m. UTC | #24

Hi Atharva

On 30/03/2021 11:22, Atharva Raykar wrote:
> 
> 
>> On 30-Mar-2021, at 12:34, Atharva Raykar <raykar.ath@gmail.com> wrote:
>>
>>
>>
>>> On 29-Mar-2021, at 15:48, Phillip Wood <phillip.wood123@gmail.com> wrote:
>>>
>>> Hi Atharva
>>>
>>> On 28/03/2021 13:23, Atharva Raykar wrote:
>>>> On 28-Mar-2021, at 05:16, Johannes Sixt <j6t@kdbg.org> wrote:
>>>> [...]
>>>>>> diff --git a/t/t4018/scheme-local-define b/t/t4018/scheme-local-define
>>>>>> new file mode 100644
>>>>>> index 0000000000..90e75dcce8
>>>>>> --- /dev/null
>>>>>> +++ b/t/t4018/scheme-local-define
>>>>>> @@ -0,0 +1,4 @@
>>>>>> +(define (higher-order)
>>>>>> +  (define local-function RIGHT
>>>>>
>>>>> ... this one, which is also indented and *is* marked as RIGHT.
>>>> In this test case, I was explicitly testing for an indented '(define'
>>>> whereas in the former, I was testing for the top-level '(define-syntax',
>>>> which happened to have an internal define (which will inevitably show up
>>>> in a lot of scheme code).
>>>
>>> It would be nice to include indented define forms but including them means that any change to the body of a function is attributed to the last internal definition rather than the actual function. For example
>>>
>>> (define (f arg)
>>> (define (g x)
>>>    (+ 1 x))
>>>
>>> (some-func ...)
>>> ;;any change here will have '(define (g x)' in the hunk header, not '(define (f arg)'
>>
>> The reason I went for this over the top level forms, is because
>> I felt it was useful to see the nearest definition for internal
>> functions that often have a lot of the actual business logic of
>> the program (at least a lot of SICP seems to follow this pattern).
>> The disadvantage is as you said, it might also catch trivial inner
>> functions and the developer might lose context.
> 
> Never mind this message, I had misunderstood the problem you were trying to
> demonstrate. I wholeheartedly agree with what you are trying to say, and
> the indentation heuristic discussed does look interesting. I shall have a
> glance at the RFC you linked in the other reply.
> 
>> The disadvantage is as you said, it might also catch trivial inner
>> functions and the developer might lose context.
> 
> Feel free to disregard me misquoting you here. You did not say that (:
> 
>> Another problem is it may match more trivial bindings, like:
>>
>> (define (some-func things)
>>   ...
>>   (define items '(eggs
>>                   ham
>>                   peanut-butter))
>>   ...)
>>
>> What I have noticed *anecdotally* is that this is not common enough
>> to be too much of a problem, and local define bindings seem to be more
>> favoured in Racket than other Schemes, that use 'let' more often.
>>
>>> I don't think this can be avoided as we rely on regexs rather than parsing the source so it is probably best to only match toplevel defines.
>>
>> The other issue with only matching top level defines is that a
>> lot of scheme programs are library definitions, something like
>>
>> (library
>>     (foo bar)
>>   (export ...)
>>   (define ...)
>>   (define ...)
>>   ;; and a bunch of other definitions...
>> )
>>
>> Only matching top level defines will completely ignore matching all
>> the definitions in these files.
> 
> That said, I still stand by the fact that only catching top level defines
> will lead to a lot of definitions being ignored. Maybe the occasional
> mismatch may be worth the gain in the number of function contexts being
> detected?

I'm not sure that the mismatches will be occasional - every time you 
have an internal definition in a function the hunk header will be wrong 
when you change the main body of the function. This will affect grep 
--function-context and diff -W as well as the normal hunk headers. The 
problem is there is no way to avoid that and provide something useful in 
the library example you have above. It would be useful to find some code 
bases and diff the output of 'git log --patch' with and without the 
leading whitespace match in the function pattern to see how often this 
is a problem (i.e. when the funcnames do not match see which one is 
correct).

Best Wishes

Phillip

Johannes Sixt April 5, 2021, 5:58 p.m. UTC | #25

Am 05.04.21 um 12:04 schrieb Phillip Wood:
> Hi Atharva
> 
> On 30/03/2021 11:22, Atharva Raykar wrote:
>>
>>
>>> On 30-Mar-2021, at 12:34, Atharva Raykar <raykar.ath@gmail.com> wrote:
>>>
>>>
>>>
>>>> On 29-Mar-2021, at 15:48, Phillip Wood <phillip.wood123@gmail.com>
>>>> wrote:
>>>>
>>>> Hi Atharva
>>>>
>>>> On 28/03/2021 13:23, Atharva Raykar wrote:
>>>>> On 28-Mar-2021, at 05:16, Johannes Sixt <j6t@kdbg.org> wrote:
>>>>> [...]
>>>>>>> diff --git a/t/t4018/scheme-local-define
>>>>>>> b/t/t4018/scheme-local-define
>>>>>>> new file mode 100644
>>>>>>> index 0000000000..90e75dcce8
>>>>>>> --- /dev/null
>>>>>>> +++ b/t/t4018/scheme-local-define
>>>>>>> @@ -0,0 +1,4 @@
>>>>>>> +(define (higher-order)
>>>>>>> +  (define local-function RIGHT
>>>>>>
>>>>>> ... this one, which is also indented and *is* marked as RIGHT.
>>>>> In this test case, I was explicitly testing for an indented '(define'
>>>>> whereas in the former, I was testing for the top-level
>>>>> '(define-syntax',
>>>>> which happened to have an internal define (which will inevitably
>>>>> show up
>>>>> in a lot of scheme code).
>>>>
>>>> It would be nice to include indented define forms but including them
>>>> means that any change to the body of a function is attributed to the
>>>> last internal definition rather than the actual function. For example
>>>>
>>>> (define (f arg)
>>>> (define (g x)
>>>>    (+ 1 x))
>>>>
>>>> (some-func ...)
>>>> ;;any change here will have '(define (g x)' in the hunk header, not
>>>> '(define (f arg)'
>>>
>>> The reason I went for this over the top level forms, is because
>>> I felt it was useful to see the nearest definition for internal
>>> functions that often have a lot of the actual business logic of
>>> the program (at least a lot of SICP seems to follow this pattern).
>>> The disadvantage is as you said, it might also catch trivial inner
>>> functions and the developer might lose context.
>>
>> Never mind this message, I had misunderstood the problem you were
>> trying to
>> demonstrate. I wholeheartedly agree with what you are trying to say, and
>> the indentation heuristic discussed does look interesting. I shall have a
>> glance at the RFC you linked in the other reply.
>>
>>> The disadvantage is as you said, it might also catch trivial inner
>>> functions and the developer might lose context.
>>
>> Feel free to disregard me misquoting you here. You did not say that (:
>>
>>> Another problem is it may match more trivial bindings, like:
>>>
>>> (define (some-func things)
>>>   ...
>>>   (define items '(eggs
>>>                   ham
>>>                   peanut-butter))
>>>   ...)
>>>
>>> What I have noticed *anecdotally* is that this is not common enough
>>> to be too much of a problem, and local define bindings seem to be more
>>> favoured in Racket than other Schemes, that use 'let' more often.
>>>
>>>> I don't think this can be avoided as we rely on regexs rather than
>>>> parsing the source so it is probably best to only match toplevel
>>>> defines.
>>>
>>> The other issue with only matching top level defines is that a
>>> lot of scheme programs are library definitions, something like
>>>
>>> (library
>>>     (foo bar)
>>>   (export ...)
>>>   (define ...)
>>>   (define ...)
>>>   ;; and a bunch of other definitions...
>>> )
>>>
>>> Only matching top level defines will completely ignore matching all
>>> the definitions in these files.
>>
>> That said, I still stand by the fact that only catching top level defines
>> will lead to a lot of definitions being ignored. Maybe the occasional
>> mismatch may be worth the gain in the number of function contexts being
>> detected?
> 
> I'm not sure that the mismatches will be occasional - every time you
> have an internal definition in a function the hunk header will be wrong
> when you change the main body of the function. This will affect grep
> --function-context and diff -W as well as the normal hunk headers. The
> problem is there is no way to avoid that and provide something useful in
> the library example you have above. It would be useful to find some code
> bases and diff the output of 'git log --patch' with and without the
> leading whitespace match in the function pattern to see how often this
> is a problem (i.e. when the funcnames do not match see which one is
> correct).

--function-context is just one application of the function matcher. To
work properly with nested function definitions, it would have to
understand the nesting. But it does not; there is nothing that we can do
about it without a proper language parser. Therefore, the argument that
the matcher does not work well with --function-context for nested
functions is of little relevance.

IMO, the primary concern should be whether the matcher decorates hunk
contexts sufficiently well.

-- Hannes

Atharva Raykar April 6, 2021, 12:29 p.m. UTC | #26

On 05-Apr-2021, at 15:34, Phillip Wood <phillip.wood123@gmail.com> wrote:
> 
> Hi Atharva
> 
> On 30/03/2021 11:22, Atharva Raykar wrote:
>>> On 30-Mar-2021, at 12:34, Atharva Raykar <raykar.ath@gmail.com> wrote:
>>> 
>>> 
>>> 
>>>> On 29-Mar-2021, at 15:48, Phillip Wood <phillip.wood123@gmail.com> wrote:
>>>> 
>>>> Hi Atharva
>>>> 
>>>> On 28/03/2021 13:23, Atharva Raykar wrote:
>>>>> On 28-Mar-2021, at 05:16, Johannes Sixt <j6t@kdbg.org> wrote:
>>>>> [...]
>>>>>>> diff --git a/t/t4018/scheme-local-define b/t/t4018/scheme-local-define
>>>>>>> new file mode 100644
>>>>>>> index 0000000000..90e75dcce8
>>>>>>> --- /dev/null
>>>>>>> +++ b/t/t4018/scheme-local-define
>>>>>>> @@ -0,0 +1,4 @@
>>>>>>> +(define (higher-order)
>>>>>>> +  (define local-function RIGHT
>>>>>> 
>>>>>> ... this one, which is also indented and *is* marked as RIGHT.
>>>>> In this test case, I was explicitly testing for an indented '(define'
>>>>> whereas in the former, I was testing for the top-level '(define-syntax',
>>>>> which happened to have an internal define (which will inevitably show up
>>>>> in a lot of scheme code).
>>>> 
>>>> It would be nice to include indented define forms but including them means that any change to the body of a function is attributed to the last internal definition rather than the actual function. For example
>>>> 
>>>> (define (f arg)
>>>> (define (g x)
>>>>   (+ 1 x))
>>>> 
>>>> (some-func ...)
>>>> ;;any change here will have '(define (g x)' in the hunk header, not '(define (f arg)'
>>> 
>>> The reason I went for this over the top level forms, is because
>>> I felt it was useful to see the nearest definition for internal
>>> functions that often have a lot of the actual business logic of
>>> the program (at least a lot of SICP seems to follow this pattern).
>>> The disadvantage is as you said, it might also catch trivial inner
>>> functions and the developer might lose context.
>> Never mind this message, I had misunderstood the problem you were trying to
>> demonstrate. I wholeheartedly agree with what you are trying to say, and
>> the indentation heuristic discussed does look interesting. I shall have a
>> glance at the RFC you linked in the other reply.
>>> The disadvantage is as you said, it might also catch trivial inner
>>> functions and the developer might lose context.
>> Feel free to disregard me misquoting you here. You did not say that (:
>>> Another problem is it may match more trivial bindings, like:
>>> 
>>> (define (some-func things)
>>>  ...
>>>  (define items '(eggs
>>>                  ham
>>>                  peanut-butter))
>>>  ...)
>>> 
>>> What I have noticed *anecdotally* is that this is not common enough
>>> to be too much of a problem, and local define bindings seem to be more
>>> favoured in Racket than other Schemes, that use 'let' more often.
>>> 
>>>> I don't think this can be avoided as we rely on regexs rather than parsing the source so it is probably best to only match toplevel defines.
>>> 
>>> The other issue with only matching top level defines is that a
>>> lot of scheme programs are library definitions, something like
>>> 
>>> (library
>>>    (foo bar)
>>>  (export ...)
>>>  (define ...)
>>>  (define ...)
>>>  ;; and a bunch of other definitions...
>>> )
>>> 
>>> Only matching top level defines will completely ignore matching all
>>> the definitions in these files.
>> That said, I still stand by the fact that only catching top level defines
>> will lead to a lot of definitions being ignored. Maybe the occasional
>> mismatch may be worth the gain in the number of function contexts being
>> detected?
> 
> I'm not sure that the mismatches will be occasional - every time you have an internal definition in a function the hunk header will be wrong when you change the main body of the function. This will affect grep --function-context and diff -W as well as the normal hunk headers. The problem is there is no way to avoid that and provide something useful in the library example you have above. It would be useful to find some code bases and diff the output of 'git log --patch' with and without the leading whitespace match in the function pattern to see how often this is a problem (i.e. when the funcnames do not match see which one is correct).

You are right -- on trying out the function on a two other scheme
codebases, I noticed that there are a lot more wrongly matched functions
than I initially thought. About half of them identify the wrong function
in one of the repositories I tried. However, removing the leading
whitespace in the pattern did not lead to better matching; it just led
to a lot of the hunk headers going blank. I am not sure what causes this
behaviour, but my guess is that the function contexts are shown only if
it is within a certain distance from the function definition?

Even if it did match only the top level defines correctly, the functions
matched would still often be technically wrong -- it will show the outer
function as the context when the user has edited an internal function
(and in Scheme, there is heavy usage of internal functions).

After running 'git grep --function-context' with the leading whitespace
removed, it seems to match too aggressively, as it captures a huge
region to match all the way upto the top level. Especially for files
where all the definitions are in a 'library'.

Overall, I personally felt that there were more downsides to matching
only at the top level. I'd rather the hunk header have the nearest
function to provide the context, than have no function displayed at all.
Even when the match is wrong, it at least helps me locate where the
change was made more easily.


> Best Wishes
> 
> Phillip
> 
>

Phillip Wood April 6, 2021, 7:10 p.m. UTC | #27

Hi Atharva

On 06/04/2021 13:29, Atharva Raykar wrote:
> On 05-Apr-2021, at 15:34, Phillip Wood <phillip.wood123@gmail.com> wrote:
>>
>> Hi Atharva
>>
>> On 30/03/2021 11:22, Atharva Raykar wrote:
>>>> On 30-Mar-2021, at 12:34, Atharva Raykar <raykar.ath@gmail.com> wrote:
>>>>
>>>>
>>>>
>>>>> On 29-Mar-2021, at 15:48, Phillip Wood <phillip.wood123@gmail.com> wrote:
>>>>>
>>>>> Hi Atharva
>>>>>
>>>>> On 28/03/2021 13:23, Atharva Raykar wrote:
>>>>>> On 28-Mar-2021, at 05:16, Johannes Sixt <j6t@kdbg.org> wrote:
>>>>>> [...]
>>>>>>>> diff --git a/t/t4018/scheme-local-define b/t/t4018/scheme-local-define
>>>>>>>> new file mode 100644
>>>>>>>> index 0000000000..90e75dcce8
>>>>>>>> --- /dev/null
>>>>>>>> +++ b/t/t4018/scheme-local-define
>>>>>>>> @@ -0,0 +1,4 @@
>>>>>>>> +(define (higher-order)
>>>>>>>> +  (define local-function RIGHT
>>>>>>>
>>>>>>> ... this one, which is also indented and *is* marked as RIGHT.
>>>>>> In this test case, I was explicitly testing for an indented '(define'
>>>>>> whereas in the former, I was testing for the top-level '(define-syntax',
>>>>>> which happened to have an internal define (which will inevitably show up
>>>>>> in a lot of scheme code).
>>>>>
>>>>> It would be nice to include indented define forms but including them means that any change to the body of a function is attributed to the last internal definition rather than the actual function. For example
>>>>>
>>>>> (define (f arg)
>>>>> (define (g x)
>>>>>    (+ 1 x))
>>>>>
>>>>> (some-func ...)
>>>>> ;;any change here will have '(define (g x)' in the hunk header, not '(define (f arg)'
>>>>
>>>> The reason I went for this over the top level forms, is because
>>>> I felt it was useful to see the nearest definition for internal
>>>> functions that often have a lot of the actual business logic of
>>>> the program (at least a lot of SICP seems to follow this pattern).
>>>> The disadvantage is as you said, it might also catch trivial inner
>>>> functions and the developer might lose context.
>>> Never mind this message, I had misunderstood the problem you were trying to
>>> demonstrate. I wholeheartedly agree with what you are trying to say, and
>>> the indentation heuristic discussed does look interesting. I shall have a
>>> glance at the RFC you linked in the other reply.
>>>> The disadvantage is as you said, it might also catch trivial inner
>>>> functions and the developer might lose context.
>>> Feel free to disregard me misquoting you here. You did not say that (:
>>>> Another problem is it may match more trivial bindings, like:
>>>>
>>>> (define (some-func things)
>>>>   ...
>>>>   (define items '(eggs
>>>>                   ham
>>>>                   peanut-butter))
>>>>   ...)
>>>>
>>>> What I have noticed *anecdotally* is that this is not common enough
>>>> to be too much of a problem, and local define bindings seem to be more
>>>> favoured in Racket than other Schemes, that use 'let' more often.
>>>>
>>>>> I don't think this can be avoided as we rely on regexs rather than parsing the source so it is probably best to only match toplevel defines.
>>>>
>>>> The other issue with only matching top level defines is that a
>>>> lot of scheme programs are library definitions, something like
>>>>
>>>> (library
>>>>     (foo bar)
>>>>   (export ...)
>>>>   (define ...)
>>>>   (define ...)
>>>>   ;; and a bunch of other definitions...
>>>> )
>>>>
>>>> Only matching top level defines will completely ignore matching all
>>>> the definitions in these files.
>>> That said, I still stand by the fact that only catching top level defines
>>> will lead to a lot of definitions being ignored. Maybe the occasional
>>> mismatch may be worth the gain in the number of function contexts being
>>> detected?
>>
>> I'm not sure that the mismatches will be occasional - every time you have an internal definition in a function the hunk header will be wrong when you change the main body of the function. This will affect grep --function-context and diff -W as well as the normal hunk headers. The problem is there is no way to avoid that and provide something useful in the library example you have above. It would be useful to find some code bases and diff the output of 'git log --patch' with and without the leading whitespace match in the function pattern to see how often this is a problem (i.e. when the funcnames do not match see which one is correct).
> 
> You are right -- on trying out the function on a two other scheme
> codebases, I noticed that there are a lot more wrongly matched functions
> than I initially thought. About half of them identify the wrong function
> in one of the repositories I tried. However, removing the leading
> whitespace in the pattern did not lead to better matching; it just led
> to a lot of the hunk headers going blank. I am not sure what causes this
> behaviour, but my guess is that the function contexts are shown only if
> it is within a certain distance from the function definition?
> 
> Even if it did match only the top level defines correctly, the functions
> matched would still often be technically wrong -- it will show the outer
> function as the context when the user has edited an internal function
> (and in Scheme, there is heavy usage of internal functions).
> 
> After running 'git grep --function-context' with the leading whitespace
> removed, it seems to match too aggressively, as it captures a huge
> region to match all the way upto the top level. Especially for files
> where all the definitions are in a 'library'.
> 
> Overall, I personally felt that there were more downsides to matching
> only at the top level. I'd rather the hunk header have the nearest
> function to provide the context, than have no function displayed at all.
> Even when the match is wrong, it at least helps me locate where the
> change was made more easily.

Thanks for taking the time to check the differences between the two 
approaches, as there is no perfect solution I'm happy to go with the one 
that seemed to be best in your investigations

Best Wishes

Phillip

>> Best Wishes
>>
>> Phillip
>>
>>
>

[GSOC] userdiff: add support for Scheme

Commit Message

Comments

Patch