Message ID | 20210327173938.59391-1-raykar.ath@gmail.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | [GSOC] userdiff: add support for Scheme | expand |
Atharva Raykar <raykar.ath@gmail.com> writes: > diff --git a/t/t4018/scheme-define-syntax b/t/t4018/scheme-define-syntax > new file mode 100644 > index 0000000000..603b99cea4 > --- /dev/null > +++ b/t/t4018/scheme-define-syntax > @@ -0,0 +1,8 @@ > +(define-syntax define-test-suite RIGHT > + (syntax-rules () > + ((_ suite-name (name test) ChangeMe ...) > + (define suite-name > + (let ((tests > + `((name . ,test) ...))) > + (lambda () > + (ChangeMe 'suite-name tests))))))) > \ No newline at end of file Is there a good reason to leave the final line incomplete? If there isn't, complete it (applies to other newly-created files in the patch). > diff --git a/userdiff.c b/userdiff.c > index 3f81a2261c..c51a8c98ba 100644 > --- a/userdiff.c > +++ b/userdiff.c > @@ -191,6 +191,14 @@ PATTERNS("rust", > "[a-zA-Z_][a-zA-Z0-9_]*" > "|[0-9][0-9_a-fA-Fiosuxz]*(\\.([0-9]*[eE][+-]?)?[0-9_fF]*)?" > "|[-+*\\/<>%&^|=!:]=|<<=?|>>=?|&&|\\|\\||->|=>|\\.{2}=|\\.{3}|::"), > +PATTERNS("scheme", > + "^[\t ]*(\\(define-?.*)$", Didn't "git diff HEAD" before committing (or "git show") highlighted these whitespace errors? .git/rebase-apply/patch:183: indent with spaces. "^[\t ]*(\\(define-?.*)$", .git/rebase-apply/patch:184: trailing whitespace, indent with spaces. /* .git/rebase-apply/patch:185: indent with spaces. * Scheme allows symbol names to have any character, .git/rebase-apply/patch:186: indent with spaces. * as long as it is not a form of a parenthesis. .git/rebase-apply/patch:187: indent with spaces. * The spaces must be escaped. warning: squelched 2 whitespace errors warning: 7 lines applied after fixing whitespace errors. > + /* > + * Scheme allows symbol names to have any character, > + * as long as it is not a form of a parenthesis. > + * The spaces must be escaped. > + */ > + "(\\.|[^][)(\\}\\{ ])+"), One or more "dot or anything other than SP or parentheses"? But a dot "." is neither a space or any {bra-ce} letter, so would the above be equivalent to "[^][()\\{\\} \t]+" I wonder... I am also trying to figure out what you wanted to achieve by mentioning "The spaces must be escaped.". Did you mean something like (string->symbol "a symbol with SP in it") is a symbol? Even so, I cannot quite guess the significance of that fact wrt the regexp you added here? As we are trying to catch program identifiers (symbols in scheme) and numeric literals, treating any group of non-whitespace letters that is delimited by one or more whitespaces as a "word" would be a good first-order approximation, but in addition, as can be seen in an example like (a(b(c))), parentheses can also serve as such "word delimiters" in addition to whitespaces. So the regexp given above makes sense to me from that angle, especially if you do not limit the whitespace to only SP, but include HT (\t) as well. But was that how you came up with the regexp? Thanks. > PATTERNS("bibtex", "(@[a-zA-Z]{1,}[ \t]*\\{{0,1}[ \t]*[^ \t\"@',\\#}{~%]*).*$", > "[={}\"]|[^={}\" \t]+"), > PATTERNS("tex", "^(\\\\((sub)*section|chapter|part)\\*{0,1}\\{.*)$",
Junio C Hamano <gitster@pobox.com> writes: > Atharva Raykar <raykar.ath@gmail.com> writes: > ... >> + (ChangeMe 'suite-name tests))))))) >> \ No newline at end of file > > Is there a good reason to leave the final line incomplete? ... > I am also trying to figure out what you wanted to achieve ... Taking all of them together, here is what I hope you may agree as its improved version. The only differences from what you posted are corrections to all the "\ No newline at end of file" and the simplification of the pattern (remove "a dot" from the alternative and add \t next to SP). Without changes, the new tests still pass so ... ;-) diff --git c/userdiff.c w/userdiff.c index 5fd0eb31ec..685fe712aa 100644 --- c/userdiff.c +++ w/userdiff.c @@ -193,12 +193,8 @@ PATTERNS("rust", "|[-+*\\/<>%&^|=!:]=|<<=?|>>=?|&&|\\|\\||->|=>|\\.{2}=|\\.{3}|::"), PATTERNS("scheme", "^[\t ]*(\\(define-?.*)$", - /* - * Scheme allows symbol names to have any character, - * as long as it is not a form of a parenthesis. - * The spaces must be escaped. - */ - "(\\.|[^][)(\\}\\{ ])+"), + /* whitespace separated tokens, but parentheses also can delimit words */ + "([^][)(\\}\\{ \t])+"), PATTERNS("bibtex", "(@[a-zA-Z]{1,}[ \t]*\\{{0,1}[ \t]*[^ \t\"@',\\#}{~%]*).*$", "[={}\"]|[^={}\" \t]+"), PATTERNS("tex", "^(\\\\((sub)*section|chapter|part)\\*{0,1}\\{.*)$", ----- >8 ---------- >8 ---------- >8 ---------- >8 ---------- >8 ----- From: Atharva Raykar <raykar.ath@gmail.com> Date: Sat, 27 Mar 2021 23:09:38 +0530 Subject: [PATCH] userdiff: add support for Scheme Add a diff driver for Scheme (R5RS and R6RS) which recognizes top level and local `define` forms, whether it is a function definition, binding, syntax definition or a user-defined `define-xyzzy` form. The rationale for picking `define` forms for the hunk headers is because it is usually the only significant form for defining the structure of the program, and it is a common pattern for schemers to have local function definitions to hide their visibility, so it is not only the top level `define`'s that are of interest. Schemers also extend the language with macros to provide their own define forms (for example, something like a `define-test-suite`) which is also captured in the hunk header. Since the identifier syntax is quite forgiving, we start our word regexp from "words delimited by whitespaces" and then loosen to include various forms of parentheses characters to word-delimiters. Signed-off-by: Atharva Raykar <raykar.ath@gmail.com> [jc: simplified word regex and its explanation; fixed whitespace errors] Signed-off-by: Junio C Hamano <gitster@pobox.com> --- Documentation/gitattributes.txt | 2 ++ t/t4018-diff-funcname.sh | 1 + t/t4018/scheme-define-syntax | 8 ++++++++ t/t4018/scheme-local-define | 4 ++++ t/t4018/scheme-top-level-define | 4 ++++ t/t4018/scheme-user-defined-define | 6 ++++++ t/t4034-diff-words.sh | 1 + t/t4034/scheme/expect | 9 +++++++++ t/t4034/scheme/post | 4 ++++ t/t4034/scheme/pre | 4 ++++ userdiff.c | 4 ++++ 11 files changed, 47 insertions(+) create mode 100644 t/t4018/scheme-define-syntax create mode 100644 t/t4018/scheme-local-define create mode 100644 t/t4018/scheme-top-level-define create mode 100644 t/t4018/scheme-user-defined-define create mode 100644 t/t4034/scheme/expect create mode 100644 t/t4034/scheme/post create mode 100644 t/t4034/scheme/pre diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt index 0a60472bb5..cfcfa800c2 100644 --- a/Documentation/gitattributes.txt +++ b/Documentation/gitattributes.txt @@ -845,6 +845,8 @@ patterns are available: - `rust` suitable for source code in the Rust language. +- `scheme` suitable for source code in the Scheme language. + - `tex` suitable for source code for LaTeX documents. diff --git a/t/t4018-diff-funcname.sh b/t/t4018-diff-funcname.sh index 9675bc17db..823ea96acb 100755 --- a/t/t4018-diff-funcname.sh +++ b/t/t4018-diff-funcname.sh @@ -48,6 +48,7 @@ diffpatterns=" python ruby rust + scheme tex custom1 custom2 diff --git a/t/t4018/scheme-define-syntax b/t/t4018/scheme-define-syntax new file mode 100644 index 0000000000..33fa50c844 --- /dev/null +++ b/t/t4018/scheme-define-syntax @@ -0,0 +1,8 @@ +(define-syntax define-test-suite RIGHT + (syntax-rules () + ((_ suite-name (name test) ChangeMe ...) + (define suite-name + (let ((tests + `((name . ,test) ...))) + (lambda () + (ChangeMe 'suite-name tests))))))) diff --git a/t/t4018/scheme-local-define b/t/t4018/scheme-local-define new file mode 100644 index 0000000000..bc6d8aebbe --- /dev/null +++ b/t/t4018/scheme-local-define @@ -0,0 +1,4 @@ +(define (higher-order) + (define local-function RIGHT + (lambda (x) + (car "this is" "ChangeMe")))) diff --git a/t/t4018/scheme-top-level-define b/t/t4018/scheme-top-level-define new file mode 100644 index 0000000000..624743c22b --- /dev/null +++ b/t/t4018/scheme-top-level-define @@ -0,0 +1,4 @@ +(define (some-func x y z) RIGHT + (let ((a x) + (b y)) + (ChangeMe a b))) diff --git a/t/t4018/scheme-user-defined-define b/t/t4018/scheme-user-defined-define new file mode 100644 index 0000000000..70e403c5e2 --- /dev/null +++ b/t/t4018/scheme-user-defined-define @@ -0,0 +1,6 @@ +(define-test-suite record-case-tests RIGHT + (record-case-1 (lambda (fail) + (let ((a (make-foo 1 2))) + (record-case a + ((bar x) (ChangeMe)) + ((foo a b) (+ a b))))))) diff --git a/t/t4034-diff-words.sh b/t/t4034-diff-words.sh index 56f1e62a97..ee7721ab91 100755 --- a/t/t4034-diff-words.sh +++ b/t/t4034-diff-words.sh @@ -325,6 +325,7 @@ test_language_driver perl test_language_driver php test_language_driver python test_language_driver ruby +test_language_driver scheme test_language_driver tex test_expect_success 'word-diff with diff.sbe' ' diff --git a/t/t4034/scheme/expect b/t/t4034/scheme/expect new file mode 100644 index 0000000000..eed21e803c --- /dev/null +++ b/t/t4034/scheme/expect @@ -0,0 +1,9 @@ +<BOLD>diff --git a/pre b/post<RESET> +<BOLD>index 6a5efba..7c4a6b4 100644<RESET> +<BOLD>--- a/pre<RESET> +<BOLD>+++ b/post<RESET> +<CYAN>@@ -1,4 +1,4 @@<RESET> +(define (<RED>myfunc a b<RESET><GREEN>my-func first second<RESET>) + ; This is a <RED>really<RESET><GREEN>(moderately)<RESET> cool function. + (let ((c (<RED>+ a b<RESET><GREEN>add1 first<RESET>))) + (format "one more than the total is %d" (<RED>add1<RESET><GREEN>+<RESET> c <GREEN>second<RESET>)))) diff --git a/t/t4034/scheme/post b/t/t4034/scheme/post new file mode 100644 index 0000000000..28f59c6584 --- /dev/null +++ b/t/t4034/scheme/post @@ -0,0 +1,4 @@ +(define (my-func first second) + ; This is a (moderately) cool function. + (let ((c (add1 first))) + (format "one more than the total is %d" (+ c second)))) diff --git a/t/t4034/scheme/pre b/t/t4034/scheme/pre new file mode 100644 index 0000000000..4bd0069493 --- /dev/null +++ b/t/t4034/scheme/pre @@ -0,0 +1,4 @@ +(define (myfunc a b) + ; This is a really cool function. + (let ((c (+ a b))) + (format "one more than the total is %d" (add1 c)))) diff --git a/userdiff.c b/userdiff.c index 3f81a2261c..685fe712aa 100644 --- a/userdiff.c +++ b/userdiff.c @@ -191,6 +191,10 @@ PATTERNS("rust", "[a-zA-Z_][a-zA-Z0-9_]*" "|[0-9][0-9_a-fA-Fiosuxz]*(\\.([0-9]*[eE][+-]?)?[0-9_fF]*)?" "|[-+*\\/<>%&^|=!:]=|<<=?|>>=?|&&|\\|\\||->|=>|\\.{2}=|\\.{3}|::"), +PATTERNS("scheme", + "^[\t ]*(\\(define-?.*)$", + /* whitespace separated tokens, but parentheses also can delimit words */ + "([^][)(\\}\\{ \t])+"), PATTERNS("bibtex", "(@[a-zA-Z]{1,}[ \t]*\\{{0,1}[ \t]*[^ \t\"@',\\#}{~%]*).*$", "[={}\"]|[^={}\" \t]+"), PATTERNS("tex", "^(\\\\((sub)*section|chapter|part)\\*{0,1}\\{.*)$",
Am 27.03.21 um 18:39 schrieb Atharva Raykar: > - By best-effort attempt at the wordregex, I mean that it is a little > more permissive than it has to be, as it accepts a few words that are > technically invalid in Scheme. > Making it handle all cases like numbers and identifiers with separate > regexen would be greatly complicated (Eg: #x#e10.2f3 is a valid number > but #x#f10.2e3 is not; 10t1 is a valid identifier, but 10s1 is a number > -- my wordregex just clubs all of these into a generic 'word match' which > trades of granularity for simplicity, and it usually does the right thing). It is ok to have regex that capture tokens that are not valid. A userdiff driver can assume that it operates only text that is valid in the language. > diff --git a/t/t4018/scheme-define-syntax b/t/t4018/scheme-define-syntax > new file mode 100644 > index 0000000000..603b99cea4 > --- /dev/null > +++ b/t/t4018/scheme-define-syntax > @@ -0,0 +1,8 @@ > +(define-syntax define-test-suite RIGHT > + (syntax-rules () > + ((_ suite-name (name test) ChangeMe ...) > + (define suite-name This test is suspicious. Notice the "ChangeMe" above? That is sufficient to let the test case succeed. The "ChangeMe" in the last line below should be the only one. But then there is this indented '(define' that is not marked as RIGHT, and I wonder how is it different from... > + (let ((tests > + `((name . ,test) ...))) > + (lambda () > + (ChangeMe 'suite-name tests))))))) > \ No newline at end of file > diff --git a/t/t4018/scheme-local-define b/t/t4018/scheme-local-define > new file mode 100644 > index 0000000000..90e75dcce8 > --- /dev/null > +++ b/t/t4018/scheme-local-define > @@ -0,0 +1,4 @@ > +(define (higher-order) > + (define local-function RIGHT ... this one, which is also indented and *is* marked as RIGHT. BTW, it's good to see test cases for both indented and not-indented trigger lines. > + (lambda (x) > + (car "this is" "ChangeMe")))) > \ No newline at end of file > diff --git a/userdiff.c b/userdiff.c > index 3f81a2261c..c51a8c98ba 100644 > --- a/userdiff.c > +++ b/userdiff.c > @@ -191,6 +191,14 @@ PATTERNS("rust", > "[a-zA-Z_][a-zA-Z0-9_]*" > "|[0-9][0-9_a-fA-Fiosuxz]*(\\.([0-9]*[eE][+-]?)?[0-9_fF]*)?" > "|[-+*\\/<>%&^|=!:]=|<<=?|>>=?|&&|\\|\\||->|=>|\\.{2}=|\\.{3}|::"), > +PATTERNS("scheme", > + "^[\t ]*(\\(define-?.*)$", This "optional hyphen followed by anything" in the regex is strange. Wouldn't that also capture a line that looks like, e.g., (defined-foo bar) Perhaps we want "define[- \t].*" in the regex? > + /* > + * Scheme allows symbol names to have any character, > + * as long as it is not a form of a parenthesis. > + * The spaces must be escaped. > + */ > + "(\\.|[^][)(\\}\\{ ])+"), > PATTERNS("bibtex", "(@[a-zA-Z]{1,}[ \t]*\\{{0,1}[ \t]*[^ \t\"@',\\#}{~%]*).*$", > "[={}\"]|[^={}\" \t]+"), > PATTERNS("tex", "^(\\\\((sub)*section|chapter|part)\\*{0,1}\\{.*)$", > -- Hannes
On Sun, Mar 28 2021, Junio C Hamano wrote: > Junio C Hamano <gitster@pobox.com> writes: > >> Atharva Raykar <raykar.ath@gmail.com> writes: >> ... >>> + (ChangeMe 'suite-name tests))))))) >>> \ No newline at end of file >> >> Is there a good reason to leave the final line incomplete? ... >> I am also trying to figure out what you wanted to achieve ... > > Taking all of them together, here is what I hope you may agree as > its improved version. The only differences from what you posted are > corrections to all the "\ No newline at end of file" and the simplification > of the pattern (remove "a dot" from the alternative and add \t next > to SP). Without changes, the new tests still pass so ... ;-) > > diff --git c/userdiff.c w/userdiff.c > index 5fd0eb31ec..685fe712aa 100644 > --- c/userdiff.c > +++ w/userdiff.c > @@ -193,12 +193,8 @@ PATTERNS("rust", > "|[-+*\\/<>%&^|=!:]=|<<=?|>>=?|&&|\\|\\||->|=>|\\.{2}=|\\.{3}|::"), > PATTERNS("scheme", > "^[\t ]*(\\(define-?.*)$", > - /* > - * Scheme allows symbol names to have any character, > - * as long as it is not a form of a parenthesis. > - * The spaces must be escaped. > - */ > - "(\\.|[^][)(\\}\\{ ])+"), > + /* whitespace separated tokens, but parentheses also can delimit words */ > + "([^][)(\\}\\{ \t])+"), > PATTERNS("bibtex", "(@[a-zA-Z]{1,}[ \t]*\\{{0,1}[ \t]*[^ \t\"@',\\#}{~%]*).*$", > "[={}\"]|[^={}\" \t]+"), > PATTERNS("tex", "^(\\\\((sub)*section|chapter|part)\\*{0,1}\\{.*)$", > > ----- >8 ---------- >8 ---------- >8 ---------- >8 ---------- >8 ----- > From: Atharva Raykar <raykar.ath@gmail.com> > Date: Sat, 27 Mar 2021 23:09:38 +0530 > Subject: [PATCH] userdiff: add support for Scheme > > Add a diff driver for Scheme (R5RS and R6RS) which > recognizes top level and local `define` forms, > whether it is a function definition, binding, syntax > definition or a user-defined `define-xyzzy` form. > > The rationale for picking `define` forms for the > hunk headers is because it is usually the only > significant form for defining the structure of the > program, and it is a common pattern for schemers to > have local function definitions to hide their > visibility, so it is not only the top level > `define`'s that are of interest. Schemers also > extend the language with macros to provide their > own define forms (for example, something like a > `define-test-suite`) which is also captured in the > hunk header. > > Since the identifier syntax is quite forgiving, we start > our word regexp from "words delimited by whitespaces" and > then loosen to include various forms of parentheses characters > to word-delimiters. > > Signed-off-by: Atharva Raykar <raykar.ath@gmail.com> > [jc: simplified word regex and its explanation; fixed whitespace errors] > Signed-off-by: Junio C Hamano <gitster@pobox.com> > --- > Documentation/gitattributes.txt | 2 ++ > t/t4018-diff-funcname.sh | 1 + > t/t4018/scheme-define-syntax | 8 ++++++++ > t/t4018/scheme-local-define | 4 ++++ > t/t4018/scheme-top-level-define | 4 ++++ > t/t4018/scheme-user-defined-define | 6 ++++++ > t/t4034-diff-words.sh | 1 + > t/t4034/scheme/expect | 9 +++++++++ > t/t4034/scheme/post | 4 ++++ > t/t4034/scheme/pre | 4 ++++ > userdiff.c | 4 ++++ > 11 files changed, 47 insertions(+) > create mode 100644 t/t4018/scheme-define-syntax > create mode 100644 t/t4018/scheme-local-define > create mode 100644 t/t4018/scheme-top-level-define > create mode 100644 t/t4018/scheme-user-defined-define > create mode 100644 t/t4034/scheme/expect > create mode 100644 t/t4034/scheme/post > create mode 100644 t/t4034/scheme/pre > > diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt > index 0a60472bb5..cfcfa800c2 100644 > --- a/Documentation/gitattributes.txt > +++ b/Documentation/gitattributes.txt > @@ -845,6 +845,8 @@ patterns are available: > > - `rust` suitable for source code in the Rust language. > > +- `scheme` suitable for source code in the Scheme language. > + > - `tex` suitable for source code for LaTeX documents. > > > diff --git a/t/t4018-diff-funcname.sh b/t/t4018-diff-funcname.sh > index 9675bc17db..823ea96acb 100755 > --- a/t/t4018-diff-funcname.sh > +++ b/t/t4018-diff-funcname.sh > @@ -48,6 +48,7 @@ diffpatterns=" > python > ruby > rust > + scheme > tex > custom1 > custom2 > diff --git a/t/t4018/scheme-define-syntax b/t/t4018/scheme-define-syntax > new file mode 100644 > index 0000000000..33fa50c844 > --- /dev/null > +++ b/t/t4018/scheme-define-syntax > @@ -0,0 +1,8 @@ > +(define-syntax define-test-suite RIGHT > + (syntax-rules () > + ((_ suite-name (name test) ChangeMe ...) > + (define suite-name > + (let ((tests > + `((name . ,test) ...))) > + (lambda () > + (ChangeMe 'suite-name tests))))))) > diff --git a/t/t4018/scheme-local-define b/t/t4018/scheme-local-define > new file mode 100644 > index 0000000000..bc6d8aebbe > --- /dev/null > +++ b/t/t4018/scheme-local-define > @@ -0,0 +1,4 @@ > +(define (higher-order) > + (define local-function RIGHT > + (lambda (x) > + (car "this is" "ChangeMe")))) > diff --git a/t/t4018/scheme-top-level-define b/t/t4018/scheme-top-level-define > new file mode 100644 > index 0000000000..624743c22b > --- /dev/null > +++ b/t/t4018/scheme-top-level-define > @@ -0,0 +1,4 @@ > +(define (some-func x y z) RIGHT > + (let ((a x) > + (b y)) > + (ChangeMe a b))) > diff --git a/t/t4018/scheme-user-defined-define b/t/t4018/scheme-user-defined-define > new file mode 100644 > index 0000000000..70e403c5e2 > --- /dev/null > +++ b/t/t4018/scheme-user-defined-define > @@ -0,0 +1,6 @@ > +(define-test-suite record-case-tests RIGHT > + (record-case-1 (lambda (fail) > + (let ((a (make-foo 1 2))) > + (record-case a > + ((bar x) (ChangeMe)) > + ((foo a b) (+ a b))))))) > diff --git a/t/t4034-diff-words.sh b/t/t4034-diff-words.sh > index 56f1e62a97..ee7721ab91 100755 > --- a/t/t4034-diff-words.sh > +++ b/t/t4034-diff-words.sh > @@ -325,6 +325,7 @@ test_language_driver perl > test_language_driver php > test_language_driver python > test_language_driver ruby > +test_language_driver scheme > test_language_driver tex > > test_expect_success 'word-diff with diff.sbe' ' > diff --git a/t/t4034/scheme/expect b/t/t4034/scheme/expect > new file mode 100644 > index 0000000000..eed21e803c > --- /dev/null > +++ b/t/t4034/scheme/expect > @@ -0,0 +1,9 @@ > +<BOLD>diff --git a/pre b/post<RESET> > +<BOLD>index 6a5efba..7c4a6b4 100644<RESET> > +<BOLD>--- a/pre<RESET> > +<BOLD>+++ b/post<RESET> > +<CYAN>@@ -1,4 +1,4 @@<RESET> > +(define (<RED>myfunc a b<RESET><GREEN>my-func first second<RESET>) > + ; This is a <RED>really<RESET><GREEN>(moderately)<RESET> cool function. > + (let ((c (<RED>+ a b<RESET><GREEN>add1 first<RESET>))) > + (format "one more than the total is %d" (<RED>add1<RESET><GREEN>+<RESET> c <GREEN>second<RESET>)))) > diff --git a/t/t4034/scheme/post b/t/t4034/scheme/post > new file mode 100644 > index 0000000000..28f59c6584 > --- /dev/null > +++ b/t/t4034/scheme/post > @@ -0,0 +1,4 @@ > +(define (my-func first second) > + ; This is a (moderately) cool function. > + (let ((c (add1 first))) > + (format "one more than the total is %d" (+ c second)))) > diff --git a/t/t4034/scheme/pre b/t/t4034/scheme/pre > new file mode 100644 > index 0000000000..4bd0069493 > --- /dev/null > +++ b/t/t4034/scheme/pre > @@ -0,0 +1,4 @@ > +(define (myfunc a b) > + ; This is a really cool function. > + (let ((c (+ a b))) > + (format "one more than the total is %d" (add1 c)))) > diff --git a/userdiff.c b/userdiff.c > index 3f81a2261c..685fe712aa 100644 > --- a/userdiff.c > +++ b/userdiff.c > @@ -191,6 +191,10 @@ PATTERNS("rust", > "[a-zA-Z_][a-zA-Z0-9_]*" > "|[0-9][0-9_a-fA-Fiosuxz]*(\\.([0-9]*[eE][+-]?)?[0-9_fF]*)?" > "|[-+*\\/<>%&^|=!:]=|<<=?|>>=?|&&|\\|\\||->|=>|\\.{2}=|\\.{3}|::"), > +PATTERNS("scheme", > + "^[\t ]*(\\(define-?.*)$", The "define-?.*" can be simplified to just "define.*", but looking at the tests is that the intent? From the tests it looks like "define[- ]" is what the author wants, unless this is meant to also match "(definements". Has this been tested on some real-world scheme code? E.g. I have guile installed locally, and it has really large top-level eval-when blocks. These rules would jump over those to whatever the function above them is. > + /* whitespace separated tokens, but parentheses also can delimit words */ > + "([^][)(\\}\\{ \t])+"), > PATTERNS("bibtex", "(@[a-zA-Z]{1,}[ \t]*\\{{0,1}[ \t]*[^ \t\"@',\\#}{~%]*).*$", > "[={}\"]|[^={}\" \t]+"), > PATTERNS("tex", "^(\\\\((sub)*section|chapter|part)\\*{0,1}\\{.*)$",
Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes: [jc: redirecting the question to patch author---I am just a messenger] >> diff --git a/userdiff.c b/userdiff.c >> index 3f81a2261c..685fe712aa 100644 >> --- a/userdiff.c >> +++ b/userdiff.c >> @@ -191,6 +191,10 @@ PATTERNS("rust", >> "[a-zA-Z_][a-zA-Z0-9_]*" >> "|[0-9][0-9_a-fA-Fiosuxz]*(\\.([0-9]*[eE][+-]?)?[0-9_fF]*)?" >> "|[-+*\\/<>%&^|=!:]=|<<=?|>>=?|&&|\\|\\||->|=>|\\.{2}=|\\.{3}|::"), >> +PATTERNS("scheme", >> + "^[\t ]*(\\(define-?.*)$", > > The "define-?.*" can be simplified to just "define.*", but looking at > the tests is that the intent? From the tests it looks like "define[- ]" > is what the author wants, unless this is meant to also match > "(definements". > > Has this been tested on some real-world scheme code? E.g. I have guile > installed locally, and it has really large top-level eval-when > blocks. These rules would jump over those to whatever the function above > them is. > >> + /* whitespace separated tokens, but parentheses also can delimit words */ >> + "([^][)(\\}\\{ \t])+"), >> PATTERNS("bibtex", "(@[a-zA-Z]{1,}[ \t]*\\{{0,1}[ \t]*[^ \t\"@',\\#}{~%]*).*$", >> "[={}\"]|[^={}\" \t]+"), >> PATTERNS("tex", "^(\\\\((sub)*section|chapter|part)\\*{0,1}\\{.*)$",
On 28-Mar-2021, at 04:20, Junio C Hamano <gitster@pobox.com> wrote: > > Atharva Raykar <raykar.ath@gmail.com> writes: > >> diff --git a/t/t4018/scheme-define-syntax b/t/t4018/scheme-define-syntax >> new file mode 100644 >> index 0000000000..603b99cea4 >> --- /dev/null >> +++ b/t/t4018/scheme-define-syntax >> @@ -0,0 +1,8 @@ >> +(define-syntax define-test-suite RIGHT >> + (syntax-rules () >> + ((_ suite-name (name test) ChangeMe ...) >> + (define suite-name >> + (let ((tests >> + `((name . ,test) ...))) >> + (lambda () >> + (ChangeMe 'suite-name tests))))))) >> \ No newline at end of file > > Is there a good reason to leave the final line incomplete? If there > isn't, complete it (applies to other newly-created files in the patch). Will do. >> diff --git a/userdiff.c b/userdiff.c >> index 3f81a2261c..c51a8c98ba 100644 >> --- a/userdiff.c >> +++ b/userdiff.c >> @@ -191,6 +191,14 @@ PATTERNS("rust", >> "[a-zA-Z_][a-zA-Z0-9_]*" >> "|[0-9][0-9_a-fA-Fiosuxz]*(\\.([0-9]*[eE][+-]?)?[0-9_fF]*)?" >> "|[-+*\\/<>%&^|=!:]=|<<=?|>>=?|&&|\\|\\||->|=>|\\.{2}=|\\.{3}|::"), >> +PATTERNS("scheme", >> + "^[\t ]*(\\(define-?.*)$", > > Didn't "git diff HEAD" before committing (or "git show") highlighted > these whitespace errors? > > .git/rebase-apply/patch:183: indent with spaces. > "^[\t ]*(\\(define-?.*)$", > .git/rebase-apply/patch:184: trailing whitespace, indent with spaces. > /* > .git/rebase-apply/patch:185: indent with spaces. > * Scheme allows symbol names to have any character, > .git/rebase-apply/patch:186: indent with spaces. > * as long as it is not a form of a parenthesis. > .git/rebase-apply/patch:187: indent with spaces. > * The spaces must be escaped. > warning: squelched 2 whitespace errors > warning: 7 lines applied after fixing whitespace errors. It did highlight the spaces (which I accidentally overlooked), but I didn’t receive these warnings. It shows up with the --check flag though. I'll recheck my configuration. Thanks for pointing this out. > >> + /* >> + * Scheme allows symbol names to have any character, >> + * as long as it is not a form of a parenthesis. >> + * The spaces must be escaped. >> + */ >> + "(\\.|[^][)(\\}\\{ ])+"), > > One or more "dot or anything other than SP or parentheses"? But > a dot "." is neither a space or any {bra-ce} letter, so would the > above be equivalent to > > "[^][()\\{\\} \t]+" > > I wonder... A backslash is allowed in scheme identifiers, and I erroneously thought that the first part handles the case for identifiers such as `component\new` or `\"id-with-quotes\"`. (I tested it with a regex engine that behaves differently than the one git is using, my bad.) > I am also trying to figure out what you wanted to achieve by > mentioning "The spaces must be escaped.". Did you mean something > like (string->symbol "a symbol with SP in it") is a symbol? Even > so, I cannot quite guess the significance of that fact wrt the > regexp you added here? I initially tried using identifiers like `space\ separated` and they seemed to work in my REPL, but turns out space separated identifiers in scheme do not work with backslashes, and it was working because of the way my terminal handled escaping. Space separated identifiers are declared like `|space separated|` and this too only seems to work with Racket, not the other Scheme implementations. So I stand corrected here, and it's better to drop this feature altogether. But somehow, the regexp you suggested, ie: "[^][()\\{\\} \t]+" does not handle the case of make\foo -> make\bar (it will only diff on foo). I am not too sure why it treats backslashes as delimiters. This seems to actually do what I was going for: "(\\\\|[^][)(\\}\\{ ])+" > As we are trying to catch program identifiers (symbols in scheme) > and numeric literals, treating any group of non-whitespace letters > that is delimited by one or more whitespaces as a "word" would be a > good first-order approximation, but in addition, as can be seen in > an example like (a(b(c))), parentheses can also serve as such "word > delimiters" in addition to whitespaces. So the regexp given above > makes sense to me from that angle, especially if you do not limit > the whitespace to only SP, but include HT (\t) as well. But was > that how you came up with the regexp? Yes, this is exactly what I was trying to express. All words should be delimited by either whitespace or a parenthesis, and all other special characters should be accepted as part of the word.
On 28-Mar-2021, at 05:16, Johannes Sixt <j6t@kdbg.org> wrote: >> diff --git a/t/t4018/scheme-define-syntax b/t/t4018/scheme-define-syntax >> new file mode 100644 >> index 0000000000..603b99cea4 >> --- /dev/null >> +++ b/t/t4018/scheme-define-syntax >> @@ -0,0 +1,8 @@ >> +(define-syntax define-test-suite RIGHT >> + (syntax-rules () >> + ((_ suite-name (name test) ChangeMe ...) >> + (define suite-name > > This test is suspicious. Notice the "ChangeMe" above? That is sufficient > to let the test case succeed. The "ChangeMe" in the last line below > should be the only one. Thanks for pointing this out. The second "ChangeMe" was not supposed to be there. What I wanted to test was the hunk header showing the line for '(define-syntax ...' and not the internal '(define ...' below it. Thus the ChangeMe should be located above the internal define so that the hunk header would show define-syntax and not the local define. > But then there is this indented '(define' that is not marked as RIGHT, > and I wonder how is it different from... > >> + (let ((tests >> + `((name . ,test) ...))) >> + (lambda () >> + (ChangeMe 'suite-name tests))))))) >> \ No newline at end of file >> diff --git a/t/t4018/scheme-local-define b/t/t4018/scheme-local-define >> new file mode 100644 >> index 0000000000..90e75dcce8 >> --- /dev/null >> +++ b/t/t4018/scheme-local-define >> @@ -0,0 +1,4 @@ >> +(define (higher-order) >> + (define local-function RIGHT > > ... this one, which is also indented and *is* marked as RIGHT. In this test case, I was explicitly testing for an indented '(define' whereas in the former, I was testing for the top-level '(define-syntax', which happened to have an internal define (which will inevitably show up in a lot of scheme code). >> + (lambda (x) >> + (car "this is" "ChangeMe")))) >> \ No newline at end of file > >> diff --git a/userdiff.c b/userdiff.c >> index 3f81a2261c..c51a8c98ba 100644 >> --- a/userdiff.c >> +++ b/userdiff.c >> @@ -191,6 +191,14 @@ PATTERNS("rust", >> "[a-zA-Z_][a-zA-Z0-9_]*" >> "|[0-9][0-9_a-fA-Fiosuxz]*(\\.([0-9]*[eE][+-]?)?[0-9_fF]*)?" >> "|[-+*\\/<>%&^|=!:]=|<<=?|>>=?|&&|\\|\\||->|=>|\\.{2}=|\\.{3}|::"), >> +PATTERNS("scheme", >> + "^[\t ]*(\\(define-?.*)$", > > This "optional hyphen followed by anything" in the regex is strange. > Wouldn't that also capture a line that looks like, e.g., > > (defined-foo bar) > > Perhaps we want "define[- \t].*" in the regex? Yes, this is what I intended to do, thanks for correcting it.
On 28-Mar-2021, at 08:46, Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote: > The "define-?.*" can be simplified to just "define.*", but looking at > the tests is that the intent? From the tests it looks like "define[- ]" > is what the author wants, unless this is meant to also match > "(definements". Yes, you captured my intent correctly. Will fix it. > Has this been tested on some real-world scheme code? E.g. I have guile > installed locally, and it has really large top-level eval-when > blocks. These rules would jump over those to whatever the function above > them is. I do not have a large scheme codebase on my own, I usually use Racket, which is a much larger language with many more forms. Other Schemes like Guile also extend the language a lot, like in your example, eval-when is an extension provided by Guile (and Chicken and Chez), but not a part of the R6RS document when I searched its index. So the 'define' forms are the only one that I know would reliably be present across all schemes. But one can also make a case where some of these non-standard forms may be common enough that they are worth adding in. In that case which forms to include? Should we consider everything in the SRFI's[1]? Should the various module definitions of Racket be included? It's a little tricky to know where to stop. That being said, I will try to run this through more Scheme codebases that I can find and see if there are any forms that seem to show up commonly enough that they are worth including. [1] https://en.wikipedia.org/wiki/Scheme_Requests_for_Implementation
On 28-Mar-2021, at 04:39, Junio C Hamano <gitster@pobox.com> wrote: > > Junio C Hamano <gitster@pobox.com> writes: > >> Atharva Raykar <raykar.ath@gmail.com> writes: >> ... >>> + (ChangeMe 'suite-name tests))))))) >>> \ No newline at end of file >> >> Is there a good reason to leave the final line incomplete? ... >> I am also trying to figure out what you wanted to achieve ... > > Taking all of them together, here is what I hope you may agree as > its improved version. The only differences from what you posted are > corrections to all the "\ No newline at end of file" and the simplification > of the pattern (remove "a dot" from the alternative and add \t next > to SP). Without changes, the new tests still pass so ... ;-) > > diff --git c/userdiff.c w/userdiff.c > index 5fd0eb31ec..685fe712aa 100644 > --- c/userdiff.c > +++ w/userdiff.c > @@ -193,12 +193,8 @@ PATTERNS("rust", > "|[-+*\\/<>%&^|=!:]=|<<=?|>>=?|&&|\\|\\||->|=>|\\.{2}=|\\.{3}|::"), > PATTERNS("scheme", > "^[\t ]*(\\(define-?.*)$", > - /* > - * Scheme allows symbol names to have any character, > - * as long as it is not a form of a parenthesis. > - * The spaces must be escaped. > - */ > - "(\\.|[^][)(\\}\\{ ])+"), > + /* whitespace separated tokens, but parentheses also can delimit words */ > + "([^][)(\\}\\{ \t])+"), > PATTERNS("bibtex", "(@[a-zA-Z]{1,}[ \t]*\\{{0,1}[ \t]*[^ \t\"@',\\#}{~%]*).*$", > "[={}\"]|[^={}\" \t]+"), > PATTERNS("tex", "^(\\\\((sub)*section|chapter|part)\\*{0,1}\\{.*)$", Thanks for these. I will eventually send another patch with the whitespaces corrected, and try to see if there is a better way to handle backslashes, other than the regex I suggested. I will also be writing another test case to check that case properly. I will also incorporate the other changes suggested by Johannes and Ævar as well, My regex was not supposed to capture forms like `defined-thing`. And there are a few rough edges with some of my test cases, which I will correct as well in the next patch. It is also worth spending some more time and see if there is any other form other than definitions that a Scheme programmer other than myself may be interested in. I will consult a few Scheme communities and mailing lists and see what more experienced programmers have to say.
Atharva Raykar <raykar.ath@gmail.com> writes: >>> + "(\\.|[^][)(\\}\\{ ])+"), >> >> One or more "dot or anything other than SP or parentheses"? But >> a dot "." is neither a space or any {bra-ce} letter, so would the >> above be equivalent to >> >> "[^][()\\{\\} \t]+" >> >> I wonder... > > A backslash is allowed in scheme identifiers, and I erroneously thought that > the first part handles the case for identifiers such as `component\new` or > `\"id-with-quotes\"`. (I tested it with a regex engine that behaves differently > than the one git is using, my bad.) Ah, perhaps you didn't have enough backslashes. A half of the doubled one before the dot is eaten by the C compiler, so the regexp engine is seeing only a single backslash before the dot, which means "literally a single dot". If you meant "literally a single backslash, followed by any single char", you probably would write 4 backslashes and a dot---half of the backslashes would be eaten by the compiler, so you'd be passing two backslashes and a dot, which is probably what you meant. Having said that, two further points. - the "anything but whitespaces and various forms of parentheses" set would include backslash, so 'component\new' would be taken as a single word with "[^][()\\{\\} \t]+", wouldn't it? - how common is the use of backslashes in identifiers? I am trying to see if the additional complexity needed to support them is worth the benefit. > But somehow, the regexp you suggested, ie: > > "[^][()\\{\\} \t]+" > > does not handle the case of make\foo -> make\bar (it will only diff on foo). > I am not too sure why it treats backslashes as delimiters. Perhaps because you have included two backslashes inside [] to say "backslash is not a word character" in the original, and I blindly copied that? IOW, do you need to quote {} inside []? > Yes, this is exactly what I was trying to express. All words should be > delimited by either whitespace or a parenthesis, and all other special > characters should be accepted as part of the word. That sentence after "All words should be..." would be a good comment to replace what you wrote in the original, then ;-).
On 28-Mar-2021, at 23:36, Junio C Hamano <gitster@pobox.com> wrote: > > Atharva Raykar <raykar.ath@gmail.com> writes: > >>>> + "(\\.|[^][)(\\}\\{ ])+"), >>> >>> One or more "dot or anything other than SP or parentheses"? But >>> a dot "." is neither a space or any {bra-ce} letter, so would the >>> above be equivalent to >>> >>> "[^][()\\{\\} \t]+" >>> >>> I wonder... >> >> A backslash is allowed in scheme identifiers, and I erroneously thought that >> the first part handles the case for identifiers such as `component\new` or >> `\"id-with-quotes\"`. (I tested it with a regex engine that behaves differently >> than the one git is using, my bad.) > > Ah, perhaps you didn't have enough backslashes. A half of the > doubled one before the dot is eaten by the C compiler, so the regexp > engine is seeing only a single backslash before the dot, which means > "literally a single dot". If you meant "literally a single > backslash, followed by any single char", you probably would write 4 > backslashes and a dot---half of the backslashes would be eaten by > the compiler, so you'd be passing two backslashes and a dot, which > is probably what you meant. > > Having said that, two further points. > > - the "anything but whitespaces and various forms of parentheses" > set would include backslash, so 'component\new' would be taken as > a single word with "[^][()\\{\\} \t]+", wouldn't it? > > - how common is the use of backslashes in identifiers? I am trying > to see if the additional complexity needed to support them is > worth the benefit. I have refined the regex, and now it is much simpler and does all of what I want it to: "([^][)(}{[:space:]])+" I did not have to escape the various parentheses, so I avoided the need to handle backslashes separately. The "\\t" was causing problems as well because it took it as a '\' followed by a 't' (Thanks to j416 on #git-devel for helping me out on this). >> Yes, this is exactly what I was trying to express. All words should be >> delimited by either whitespace or a parenthesis, and all other special >> characters should be accepted as part of the word. > > That sentence after "All words should be..." would be a good comment > to replace what you wrote in the original, then ;-). Yes, that should make it a lot more clear.
Hi Atharva On 28/03/2021 13:40, Atharva Raykar wrote: > On 28-Mar-2021, at 08:46, Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote: >> The "define-?.*" can be simplified to just "define.*", but looking at >> the tests is that the intent? From the tests it looks like "define[- ]" >> is what the author wants, unless this is meant to also match >> "(definements". > > Yes, you captured my intent correctly. Will fix it. > >> Has this been tested on some real-world scheme code? E.g. I have guile >> installed locally, and it has really large top-level eval-when >> blocks. These rules would jump over those to whatever the function above >> them is. > > I do not have a large scheme codebase on my own, I usually use Racket, > which is a much larger language with many more forms. Other Schemes like > Guile also extend the language a lot, like in your example, eval-when is > an extension provided by Guile (and Chicken and Chez), but not a part of > the R6RS document when I searched its index. > > So the 'define' forms are the only one that I know would reliably be present > across all schemes. But one can also make a case where some of these non-standard > forms may be common enough that they are worth adding in. In that case which > forms to include? Should we consider everything in the SRFI's[1]? Should the > various module definitions of Racket be included? It's a little tricky to know > where to stop. If there are some common forms such as eval-when then it would be good to include them, otherwise we end up needing a different rule for each scheme implementation as they all seem to tweak something. Gerbil uses 'def...' e.g def, defsyntax, defstruct, defrules rather than define, define-syntax, define-record etc. I'm not user if we want to accommodate that or not. Best Wishes Phillip > That being said, I will try to run this through more Scheme codebases that I can > find and see if there are any forms that seem to show up commonly enough that they > are worth including. > > [1] https://en.wikipedia.org/wiki/Scheme_Requests_for_Implementation >
Hi Atharva On 28/03/2021 12:51, Atharva Raykar wrote: > On 28-Mar-2021, at 04:20, Junio C Hamano <gitster@pobox.com> wrote: >> >> Atharva Raykar <raykar.ath@gmail.com> writes: >> >>> + /* >>> + * Scheme allows symbol names to have any character, >>> + * as long as it is not a form of a parenthesis. >>> + * The spaces must be escaped. >>> + */ >>> + "(\\.|[^][)(\\}\\{ ])+"), >> >> One or more "dot or anything other than SP or parentheses"? But >> a dot "." is neither a space or any {bra-ce} letter, so would the >> above be equivalent to >> >> "[^][()\\{\\} \t]+" >> >> I wonder... > > A backslash is allowed in scheme identifiers, and I erroneously thought that > the first part handles the case for identifiers such as `component\new` or > `\"id-with-quotes\"`. (I tested it with a regex engine that behaves differently > than the one git is using, my bad.) > >> I am also trying to figure out what you wanted to achieve by >> mentioning "The spaces must be escaped.". Did you mean something >> like (string->symbol "a symbol with SP in it") is a symbol? Even >> so, I cannot quite guess the significance of that fact wrt the >> regexp you added here? > > I initially tried using identifiers like `space\ separated` and they > seemed to work in my REPL, but turns out space separated identifiers in > scheme do not work with backslashes, and it was working because of the way > my terminal handled escaping. Space separated identifiers are declared like > `|space separated|` and this too only seems to work with Racket, not > the other Scheme implementations. I think the bar notation works with some other such as gambit and possibly guile (it's a while since I used the latter) Best wishes Phillip So I stand corrected here, and it's better > to drop this feature altogether. > > But somehow, the regexp you suggested, ie: > > "[^][()\\{\\} \t]+" > > does not handle the case of make\foo -> make\bar (it will only diff on foo). > I am not too sure why it treats backslashes as delimiters. > > This seems to actually do what I was going for: > > "(\\\\|[^][)(\\}\\{ ])+" > >> As we are trying to catch program identifiers (symbols in scheme) >> and numeric literals, treating any group of non-whitespace letters >> that is delimited by one or more whitespaces as a "word" would be a >> good first-order approximation, but in addition, as can be seen in >> an example like (a(b(c))), parentheses can also serve as such "word >> delimiters" in addition to whitespaces. So the regexp given above >> makes sense to me from that angle, especially if you do not limit >> the whitespace to only SP, but include HT (\t) as well. But was >> that how you came up with the regexp? > > Yes, this is exactly what I was trying to express. All words should be > delimited by either whitespace or a parenthesis, and all other special > characters should be accepted as part of the word. >
Hi Atharva On 28/03/2021 13:23, Atharva Raykar wrote: > On 28-Mar-2021, at 05:16, Johannes Sixt <j6t@kdbg.org> wrote: > [...] >>> diff --git a/t/t4018/scheme-local-define b/t/t4018/scheme-local-define >>> new file mode 100644 >>> index 0000000000..90e75dcce8 >>> --- /dev/null >>> +++ b/t/t4018/scheme-local-define >>> @@ -0,0 +1,4 @@ >>> +(define (higher-order) >>> + (define local-function RIGHT >> >> ... this one, which is also indented and *is* marked as RIGHT. > > In this test case, I was explicitly testing for an indented '(define' > whereas in the former, I was testing for the top-level '(define-syntax', > which happened to have an internal define (which will inevitably show up > in a lot of scheme code). It would be nice to include indented define forms but including them means that any change to the body of a function is attributed to the last internal definition rather than the actual function. For example (define (f arg) (define (g x) (+ 1 x)) (some-func ...) ;;any change here will have '(define (g x)' in the hunk header, not '(define (f arg)' I don't think this can be avoided as we rely on regexs rather than parsing the source so it is probably best to only match toplevel defines. Best Wishes Phillip > >>> + (lambda (x) >>> + (car "this is" "ChangeMe")))) >>> \ No newline at end of file >> >>> diff --git a/userdiff.c b/userdiff.c >>> index 3f81a2261c..c51a8c98ba 100644 >>> --- a/userdiff.c >>> +++ b/userdiff.c >>> @@ -191,6 +191,14 @@ PATTERNS("rust", >>> "[a-zA-Z_][a-zA-Z0-9_]*" >>> "|[0-9][0-9_a-fA-Fiosuxz]*(\\.([0-9]*[eE][+-]?)?[0-9_fF]*)?" >>> "|[-+*\\/<>%&^|=!:]=|<<=?|>>=?|&&|\\|\\||->|=>|\\.{2}=|\\.{3}|::"), >>> +PATTERNS("scheme", >>> + "^[\t ]*(\\(define-?.*)$", >> >> This "optional hyphen followed by anything" in the regex is strange. >> Wouldn't that also capture a line that looks like, e.g., >> >> (defined-foo bar) >> >> Perhaps we want "define[- \t].*" in the regex? > > Yes, this is what I intended to do, thanks for correcting it. >
Am 29.03.21 um 12:18 schrieb Phillip Wood: > It would be nice to include indented define forms but including them > means that any change to the body of a function is attributed to the > last internal definition rather than the actual function. For example > > (define (f arg) > (define (g x) > (+ 1 x)) > > (some-func ...) > ;;any change here will have '(define (g x)' in the hunk header, not > '(define (f arg)' > > I don't think this can be avoided as we rely on regexs rather than > parsing the source so it is probably best to only match toplevel defines. There can be two rules, one that matches '(define-' that is indented, and another one that matches all non-indented forms of definitions. If that is what you mean. -- Hannes
On Mon, Mar 29 2021, Johannes Sixt wrote: > Am 29.03.21 um 12:18 schrieb Phillip Wood: >> It would be nice to include indented define forms but including them >> means that any change to the body of a function is attributed to the >> last internal definition rather than the actual function. For example >> >> (define (f arg) >> (define (g x) >> (+ 1 x)) >> >> (some-func ...) >> ;;any change here will have '(define (g x)' in the hunk header, not >> '(define (f arg)' >> >> I don't think this can be avoided as we rely on regexs rather than >> parsing the source so it is probably best to only match toplevel defines. > > There can be two rules, one that matches '(define-' that is indented, > and another one that matches all non-indented forms of definitions. If > that is what you mean. Yes, but that doesn't help in these sorts of cases because what a rule like that really wants is some version of "don't match this line, but only if you can reasonably match this other rule". We can only do rule precedence on a per-line basis via the inverted matches. So for languages like cl/elisp/scheme and others where it's common to have nested function definitions (then -W would like the top-level) *OR* similarly looking nested function definitions, but the top-level isn't a function but a (setq) or whatever we're basically stuck with picking one or the other. I've pondered how to get around this problem in my userdiff.c hacking without resorting to supporting some general-purpose Turing machine, and have so far come up with nothing. You can see lots of prior art by grepping Emacs's source code for beginning-of-defun, it solves this problem by exposing a Turing machine :)
On 29/03/2021 14:12, Ævar Arnfjörð Bjarmason wrote: > > On Mon, Mar 29 2021, Johannes Sixt wrote: > >> Am 29.03.21 um 12:18 schrieb Phillip Wood: >>> It would be nice to include indented define forms but including them >>> means that any change to the body of a function is attributed to the >>> last internal definition rather than the actual function. For example >>> >>> (define (f arg) >>> (define (g x) >>> (+ 1 x)) >>> >>> (some-func ...) >>> ;;any change here will have '(define (g x)' in the hunk header, not >>> '(define (f arg)' >>> >>> I don't think this can be avoided as we rely on regexs rather than >>> parsing the source so it is probably best to only match toplevel defines. >> >> There can be two rules, one that matches '(define-' that is indented, >> and another one that matches all non-indented forms of definitions. If >> that is what you mean. > > Yes, but that doesn't help in these sorts of cases because what a rule > like that really wants is some version of "don't match this line, but > only if you can reasonably match this other rule". > > We can only do rule precedence on a per-line basis via the inverted > matches. > > So for languages like cl/elisp/scheme and others where it's common to > have nested function definitions (then -W would like the top-level) *OR* > similarly looking nested function definitions, but the top-level isn't a > function but a (setq) or whatever we're basically stuck with picking one > or the other. Exactly > I've pondered how to get around this problem in my userdiff.c hacking > without resorting to supporting some general-purpose Turing machine, and > have so far come up with nothing. I think using an indentation heuristic would probably work quite well for most languages - see https://public-inbox.org/git/20200923215859.102981-1-rtzoeller@rtzoeller.com/ for a discussion from last year (from memory there were some problems with the approach in those patches but I think there are some suggestion from Peff and me later in the thread on how they could be overcome) Best Wishes Phillip > You can see lots of prior art by grepping Emacs's source code for > beginning-of-defun, it solves this problem by exposing a Turing machine > :) >
Atharva Raykar <raykar.ath@gmail.com> writes: >> Having said that, two further points. >> >> - the "anything but whitespaces and various forms of parentheses" >> set would include backslash, so 'component\new' would be taken as >> a single word with "[^][()\\{\\} \t]+", wouldn't it? >> >> - how common is the use of backslashes in identifiers? I am trying >> to see if the additional complexity needed to support them is >> worth the benefit. > > I have refined the regex, and now it is much simpler and does all of what > I want it to: > > "([^][)(}{[:space:]])+" OK, [:space:] is already used elsewhere, so it would be OK. In practice, the only difference from "[ \t]" (which is used in many other patterns in the same file) is that [:space:] class includes form-feed (\Ctrl-L); nobody would write vertical-tab in the code, and the matching is done one line at a time, so the fact that LF (or CRLF) is in the [:space:] class does not make a difference anyway. > I did not have to escape the various parentheses, so I avoided the need to > handle backslashes separately. The "\\t" was causing problems as well because If you spelled "\\t" that would have caused a problem of your own making ;-) I think what I gave in the message you are responding to was a single backslash followed by a 't', to let the compiler turn them into a single HT character, and that wouldn't have had such a problem---in fact "[ \t]" is used in many other existing rules in the same file. Thanks.
On 29-Mar-2021, at 15:38, Phillip Wood <phillip.wood123@gmail.com> wrote: > On 28/03/2021 13:40, Atharva Raykar wrote: >> On 28-Mar-2021, at 08:46, Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote: >>> The "define-?.*" can be simplified to just "define.*", but looking at >>> the tests is that the intent? From the tests it looks like "define[- ]" >>> is what the author wants, unless this is meant to also match >>> "(definements". >> Yes, you captured my intent correctly. Will fix it. >>> Has this been tested on some real-world scheme code? E.g. I have guile >>> installed locally, and it has really large top-level eval-when >>> blocks. These rules would jump over those to whatever the function above >>> them is. >> I do not have a large scheme codebase on my own, I usually use Racket, >> which is a much larger language with many more forms. Other Schemes like >> Guile also extend the language a lot, like in your example, eval-when is >> an extension provided by Guile (and Chicken and Chez), but not a part of >> the R6RS document when I searched its index. >> So the 'define' forms are the only one that I know would reliably be present >> across all schemes. But one can also make a case where some of these non-standard >> forms may be common enough that they are worth adding in. In that case which >> forms to include? Should we consider everything in the SRFI's[1]? Should the >> various module definitions of Racket be included? It's a little tricky to know >> where to stop. > > If there are some common forms such as eval-when then it would be good to include them, otherwise we end up needing a different rule for each scheme implementation as they all seem to tweak something. Gerbil uses 'def...' e.g def, defsyntax, defstruct, defrules rather than define, define-syntax, define-record etc. I'm not user if we want to accommodate that or not. Yes, this is the part that is hard for me to figure out. I am going by two heuristics: what Scheme communities in other places would generally prefer, and what patterns I see happen more often in scheme code. The former is tricky to do. I posted to a few mailing lists about this, but they don't seem active enough to garner any responses. The latter is a little easier to measure quickly. I did a GitHub search, where I filtered results to only consider Scheme files (language:scheme). Some armchair stats, just for a broad understanding: Total number of scheme files: 529,339 No. of times a construct is used in those files: define and its variants : 431,090 (81.4%) def and its variants : 18,466 ( 3.5%) eval-when : 3,375 ( 0.6%) There was no way for me to quickly know which of these uses are at the top level, but either way of the more structural forms that do show up in Scheme code, define and its variants seem like a clear winner. I am not sure if it's worth adding more rules to check for def and its variants, given that they are not nearly as common.
> On 29-Mar-2021, at 15:48, Phillip Wood <phillip.wood123@gmail.com> wrote: > > Hi Atharva > > On 28/03/2021 13:23, Atharva Raykar wrote: >> On 28-Mar-2021, at 05:16, Johannes Sixt <j6t@kdbg.org> wrote: > > [...] >>>> diff --git a/t/t4018/scheme-local-define b/t/t4018/scheme-local-define >>>> new file mode 100644 >>>> index 0000000000..90e75dcce8 >>>> --- /dev/null >>>> +++ b/t/t4018/scheme-local-define >>>> @@ -0,0 +1,4 @@ >>>> +(define (higher-order) >>>> + (define local-function RIGHT >>> >>> ... this one, which is also indented and *is* marked as RIGHT. >> In this test case, I was explicitly testing for an indented '(define' >> whereas in the former, I was testing for the top-level '(define-syntax', >> which happened to have an internal define (which will inevitably show up >> in a lot of scheme code). > > It would be nice to include indented define forms but including them means that any change to the body of a function is attributed to the last internal definition rather than the actual function. For example > > (define (f arg) > (define (g x) > (+ 1 x)) > > (some-func ...) > ;;any change here will have '(define (g x)' in the hunk header, not '(define (f arg)' The reason I went for this over the top level forms, is because I felt it was useful to see the nearest definition for internal functions that often have a lot of the actual business logic of the program (at least a lot of SICP seems to follow this pattern). The disadvantage is as you said, it might also catch trivial inner functions and the developer might lose context. Another problem is it may match more trivial bindings, like: (define (some-func things) ... (define items '(eggs ham peanut-butter)) ...) What I have noticed *anecdotally* is that this is not common enough to be too much of a problem, and local define bindings seem to be more favoured in Racket than other Schemes, that use 'let' more often. > I don't think this can be avoided as we rely on regexs rather than parsing the source so it is probably best to only match toplevel defines. The other issue with only matching top level defines is that a lot of scheme programs are library definitions, something like (library (foo bar) (export ...) (define ...) (define ...) ;; and a bunch of other definitions... ) Only matching top level defines will completely ignore matching all the definitions in these files.
> On 30-Mar-2021, at 12:34, Atharva Raykar <raykar.ath@gmail.com> wrote: > > > >> On 29-Mar-2021, at 15:48, Phillip Wood <phillip.wood123@gmail.com> wrote: >> >> Hi Atharva >> >> On 28/03/2021 13:23, Atharva Raykar wrote: >>> On 28-Mar-2021, at 05:16, Johannes Sixt <j6t@kdbg.org> wrote: >>> [...] >>>>> diff --git a/t/t4018/scheme-local-define b/t/t4018/scheme-local-define >>>>> new file mode 100644 >>>>> index 0000000000..90e75dcce8 >>>>> --- /dev/null >>>>> +++ b/t/t4018/scheme-local-define >>>>> @@ -0,0 +1,4 @@ >>>>> +(define (higher-order) >>>>> + (define local-function RIGHT >>>> >>>> ... this one, which is also indented and *is* marked as RIGHT. >>> In this test case, I was explicitly testing for an indented '(define' >>> whereas in the former, I was testing for the top-level '(define-syntax', >>> which happened to have an internal define (which will inevitably show up >>> in a lot of scheme code). >> >> It would be nice to include indented define forms but including them means that any change to the body of a function is attributed to the last internal definition rather than the actual function. For example >> >> (define (f arg) >> (define (g x) >> (+ 1 x)) >> >> (some-func ...) >> ;;any change here will have '(define (g x)' in the hunk header, not '(define (f arg)' > > The reason I went for this over the top level forms, is because > I felt it was useful to see the nearest definition for internal > functions that often have a lot of the actual business logic of > the program (at least a lot of SICP seems to follow this pattern). > The disadvantage is as you said, it might also catch trivial inner > functions and the developer might lose context. Never mind this message, I had misunderstood the problem you were trying to demonstrate. I wholeheartedly agree with what you are trying to say, and the indentation heuristic discussed does look interesting. I shall have a glance at the RFC you linked in the other reply. > The disadvantage is as you said, it might also catch trivial inner > functions and the developer might lose context. Feel free to disregard me misquoting you here. You did not say that (: > Another problem is it may match more trivial bindings, like: > > (define (some-func things) > ... > (define items '(eggs > ham > peanut-butter)) > ...) > > What I have noticed *anecdotally* is that this is not common enough > to be too much of a problem, and local define bindings seem to be more > favoured in Racket than other Schemes, that use 'let' more often. > >> I don't think this can be avoided as we rely on regexs rather than parsing the source so it is probably best to only match toplevel defines. > > The other issue with only matching top level defines is that a > lot of scheme programs are library definitions, something like > > (library > (foo bar) > (export ...) > (define ...) > (define ...) > ;; and a bunch of other definitions... > ) > > Only matching top level defines will completely ignore matching all > the definitions in these files. That said, I still stand by the fact that only catching top level defines will lead to a lot of definitions being ignored. Maybe the occasional mismatch may be worth the gain in the number of function contexts being detected?
On Tue, Mar 30 2021, Atharva Raykar wrote: > On 29-Mar-2021, at 15:38, Phillip Wood <phillip.wood123@gmail.com> wrote: >> On 28/03/2021 13:40, Atharva Raykar wrote: >>> On 28-Mar-2021, at 08:46, Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote: >>>> The "define-?.*" can be simplified to just "define.*", but looking at >>>> the tests is that the intent? From the tests it looks like "define[- ]" >>>> is what the author wants, unless this is meant to also match >>>> "(definements". >>> Yes, you captured my intent correctly. Will fix it. >>>> Has this been tested on some real-world scheme code? E.g. I have guile >>>> installed locally, and it has really large top-level eval-when >>>> blocks. These rules would jump over those to whatever the function above >>>> them is. >>> I do not have a large scheme codebase on my own, I usually use Racket, >>> which is a much larger language with many more forms. Other Schemes like >>> Guile also extend the language a lot, like in your example, eval-when is >>> an extension provided by Guile (and Chicken and Chez), but not a part of >>> the R6RS document when I searched its index. >>> So the 'define' forms are the only one that I know would reliably be present >>> across all schemes. But one can also make a case where some of these non-standard >>> forms may be common enough that they are worth adding in. In that case which >>> forms to include? Should we consider everything in the SRFI's[1]? Should the >>> various module definitions of Racket be included? It's a little tricky to know >>> where to stop. >> >> If there are some common forms such as eval-when then it would be good to include them, otherwise we end up needing a different rule for each scheme implementation as they all seem to tweak something. Gerbil uses 'def...' e.g def, defsyntax, defstruct, defrules rather than define, define-syntax, define-record etc. I'm not user if we want to accommodate that or not. > > Yes, this is the part that is hard for me to figure out. I am going by > two heuristics: what Scheme communities in other places would generally > prefer, and what patterns I see happen more often in scheme code. > > The former is tricky to do. I posted to a few mailing lists about this, > but they don't seem active enough to garner any responses. > > The latter is a little easier to measure quickly. I did a GitHub search, > where I filtered results to only consider Scheme files (language:scheme). > > Some armchair stats, just for a broad understanding: > > Total number of scheme files: 529,339 > No. of times a construct is used in those files: > define and its variants : 431,090 (81.4%) > def and its variants : 18,466 ( 3.5%) > eval-when : 3,375 ( 0.6%) > > There was no way for me to quickly know which of these uses are at the top > level, but either way of the more structural forms that do show up in Scheme > code, define and its variants seem like a clear winner. I am not sure if > it's worth adding more rules to check for def and its variants, given that > they are not nearly as common. In those cases we should veer on the side of inclusion. The only problem we'll have is if "eval-when" is a "setq"-like function top-level form in some other scheme dialect, so we'll have a conflict. Otherwise it's fine, programs that only use "define" won't be bothered by an eval-when rule.
On 30-Mar-2021, at 18:26, Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote: > > > On Tue, Mar 30 2021, Atharva Raykar wrote: > >> On 29-Mar-2021, at 15:38, Phillip Wood <phillip.wood123@gmail.com> wrote: >>> On 28/03/2021 13:40, Atharva Raykar wrote: >>>> On 28-Mar-2021, at 08:46, Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote: >>>>> The "define-?.*" can be simplified to just "define.*", but looking at >>>>> the tests is that the intent? From the tests it looks like "define[- ]" >>>>> is what the author wants, unless this is meant to also match >>>>> "(definements". >>>> Yes, you captured my intent correctly. Will fix it. >>>>> Has this been tested on some real-world scheme code? E.g. I have guile >>>>> installed locally, and it has really large top-level eval-when >>>>> blocks. These rules would jump over those to whatever the function above >>>>> them is. >>>> I do not have a large scheme codebase on my own, I usually use Racket, >>>> which is a much larger language with many more forms. Other Schemes like >>>> Guile also extend the language a lot, like in your example, eval-when is >>>> an extension provided by Guile (and Chicken and Chez), but not a part of >>>> the R6RS document when I searched its index. >>>> So the 'define' forms are the only one that I know would reliably be present >>>> across all schemes. But one can also make a case where some of these non-standard >>>> forms may be common enough that they are worth adding in. In that case which >>>> forms to include? Should we consider everything in the SRFI's[1]? Should the >>>> various module definitions of Racket be included? It's a little tricky to know >>>> where to stop. >>> >>> If there are some common forms such as eval-when then it would be good to include them, otherwise we end up needing a different rule for each scheme implementation as they all seem to tweak something. Gerbil uses 'def...' e.g def, defsyntax, defstruct, defrules rather than define, define-syntax, define-record etc. I'm not user if we want to accommodate that or not. >> >> Yes, this is the part that is hard for me to figure out. I am going by >> two heuristics: what Scheme communities in other places would generally >> prefer, and what patterns I see happen more often in scheme code. >> >> The former is tricky to do. I posted to a few mailing lists about this, >> but they don't seem active enough to garner any responses. >> >> The latter is a little easier to measure quickly. I did a GitHub search, >> where I filtered results to only consider Scheme files (language:scheme). >> >> Some armchair stats, just for a broad understanding: >> >> Total number of scheme files: 529,339 >> No. of times a construct is used in those files: >> define and its variants : 431,090 (81.4%) >> def and its variants : 18,466 ( 3.5%) >> eval-when : 3,375 ( 0.6%) >> >> There was no way for me to quickly know which of these uses are at the top >> level, but either way of the more structural forms that do show up in Scheme >> code, define and its variants seem like a clear winner. I am not sure if >> it's worth adding more rules to check for def and its variants, given that >> they are not nearly as common. > > In those cases we should veer on the side of inclusion. The only problem > we'll have is if "eval-when" is a "setq"-like function top-level form in > some other scheme dialect, so we'll have a conflict. > > Otherwise it's fine, programs that only use "define" won't be bothered > by an eval-when rule. I would like some clarification, since my knowledge of Common Lisp's setq and Guile's/Other's eval-when is pretty surface level. > The only problem we'll have is if "eval-when" is a "setq"-like function > top-level form in some other scheme dialect, so we'll have a conflict. I am not sure what you mean when you say if "eval-when" is a "setq"-like top level form, and exactly what kind of problem it may cause. I also realized from my understanding of the Guile Documentation[1], that "eval-when" is used to tell the compiler which expressions should be made available during the expansion phase. It does not seem to have anything that may help identify the location of the hunk, which I understand is the primary purpose of these hunk headers. All uses of "eval-when" would be some variation of: (eval-when (expand load eval) ; no identifier in this form ...) unlike a "define" which will always name the nearest function, which helps as a landmark. Would that be a valid reason to exclude "eval-when"?
Hi Atharva On 30/03/2021 11:22, Atharva Raykar wrote: > > >> On 30-Mar-2021, at 12:34, Atharva Raykar <raykar.ath@gmail.com> wrote: >> >> >> >>> On 29-Mar-2021, at 15:48, Phillip Wood <phillip.wood123@gmail.com> wrote: >>> >>> Hi Atharva >>> >>> On 28/03/2021 13:23, Atharva Raykar wrote: >>>> On 28-Mar-2021, at 05:16, Johannes Sixt <j6t@kdbg.org> wrote: >>>> [...] >>>>>> diff --git a/t/t4018/scheme-local-define b/t/t4018/scheme-local-define >>>>>> new file mode 100644 >>>>>> index 0000000000..90e75dcce8 >>>>>> --- /dev/null >>>>>> +++ b/t/t4018/scheme-local-define >>>>>> @@ -0,0 +1,4 @@ >>>>>> +(define (higher-order) >>>>>> + (define local-function RIGHT >>>>> >>>>> ... this one, which is also indented and *is* marked as RIGHT. >>>> In this test case, I was explicitly testing for an indented '(define' >>>> whereas in the former, I was testing for the top-level '(define-syntax', >>>> which happened to have an internal define (which will inevitably show up >>>> in a lot of scheme code). >>> >>> It would be nice to include indented define forms but including them means that any change to the body of a function is attributed to the last internal definition rather than the actual function. For example >>> >>> (define (f arg) >>> (define (g x) >>> (+ 1 x)) >>> >>> (some-func ...) >>> ;;any change here will have '(define (g x)' in the hunk header, not '(define (f arg)' >> >> The reason I went for this over the top level forms, is because >> I felt it was useful to see the nearest definition for internal >> functions that often have a lot of the actual business logic of >> the program (at least a lot of SICP seems to follow this pattern). >> The disadvantage is as you said, it might also catch trivial inner >> functions and the developer might lose context. > > Never mind this message, I had misunderstood the problem you were trying to > demonstrate. I wholeheartedly agree with what you are trying to say, and > the indentation heuristic discussed does look interesting. I shall have a > glance at the RFC you linked in the other reply. > >> The disadvantage is as you said, it might also catch trivial inner >> functions and the developer might lose context. > > Feel free to disregard me misquoting you here. You did not say that (: > >> Another problem is it may match more trivial bindings, like: >> >> (define (some-func things) >> ... >> (define items '(eggs >> ham >> peanut-butter)) >> ...) >> >> What I have noticed *anecdotally* is that this is not common enough >> to be too much of a problem, and local define bindings seem to be more >> favoured in Racket than other Schemes, that use 'let' more often. >> >>> I don't think this can be avoided as we rely on regexs rather than parsing the source so it is probably best to only match toplevel defines. >> >> The other issue with only matching top level defines is that a >> lot of scheme programs are library definitions, something like >> >> (library >> (foo bar) >> (export ...) >> (define ...) >> (define ...) >> ;; and a bunch of other definitions... >> ) >> >> Only matching top level defines will completely ignore matching all >> the definitions in these files. > > That said, I still stand by the fact that only catching top level defines > will lead to a lot of definitions being ignored. Maybe the occasional > mismatch may be worth the gain in the number of function contexts being > detected? I'm not sure that the mismatches will be occasional - every time you have an internal definition in a function the hunk header will be wrong when you change the main body of the function. This will affect grep --function-context and diff -W as well as the normal hunk headers. The problem is there is no way to avoid that and provide something useful in the library example you have above. It would be useful to find some code bases and diff the output of 'git log --patch' with and without the leading whitespace match in the function pattern to see how often this is a problem (i.e. when the funcnames do not match see which one is correct). Best Wishes Phillip
Am 05.04.21 um 12:04 schrieb Phillip Wood: > Hi Atharva > > On 30/03/2021 11:22, Atharva Raykar wrote: >> >> >>> On 30-Mar-2021, at 12:34, Atharva Raykar <raykar.ath@gmail.com> wrote: >>> >>> >>> >>>> On 29-Mar-2021, at 15:48, Phillip Wood <phillip.wood123@gmail.com> >>>> wrote: >>>> >>>> Hi Atharva >>>> >>>> On 28/03/2021 13:23, Atharva Raykar wrote: >>>>> On 28-Mar-2021, at 05:16, Johannes Sixt <j6t@kdbg.org> wrote: >>>>> [...] >>>>>>> diff --git a/t/t4018/scheme-local-define >>>>>>> b/t/t4018/scheme-local-define >>>>>>> new file mode 100644 >>>>>>> index 0000000000..90e75dcce8 >>>>>>> --- /dev/null >>>>>>> +++ b/t/t4018/scheme-local-define >>>>>>> @@ -0,0 +1,4 @@ >>>>>>> +(define (higher-order) >>>>>>> + (define local-function RIGHT >>>>>> >>>>>> ... this one, which is also indented and *is* marked as RIGHT. >>>>> In this test case, I was explicitly testing for an indented '(define' >>>>> whereas in the former, I was testing for the top-level >>>>> '(define-syntax', >>>>> which happened to have an internal define (which will inevitably >>>>> show up >>>>> in a lot of scheme code). >>>> >>>> It would be nice to include indented define forms but including them >>>> means that any change to the body of a function is attributed to the >>>> last internal definition rather than the actual function. For example >>>> >>>> (define (f arg) >>>> (define (g x) >>>> (+ 1 x)) >>>> >>>> (some-func ...) >>>> ;;any change here will have '(define (g x)' in the hunk header, not >>>> '(define (f arg)' >>> >>> The reason I went for this over the top level forms, is because >>> I felt it was useful to see the nearest definition for internal >>> functions that often have a lot of the actual business logic of >>> the program (at least a lot of SICP seems to follow this pattern). >>> The disadvantage is as you said, it might also catch trivial inner >>> functions and the developer might lose context. >> >> Never mind this message, I had misunderstood the problem you were >> trying to >> demonstrate. I wholeheartedly agree with what you are trying to say, and >> the indentation heuristic discussed does look interesting. I shall have a >> glance at the RFC you linked in the other reply. >> >>> The disadvantage is as you said, it might also catch trivial inner >>> functions and the developer might lose context. >> >> Feel free to disregard me misquoting you here. You did not say that (: >> >>> Another problem is it may match more trivial bindings, like: >>> >>> (define (some-func things) >>> ... >>> (define items '(eggs >>> ham >>> peanut-butter)) >>> ...) >>> >>> What I have noticed *anecdotally* is that this is not common enough >>> to be too much of a problem, and local define bindings seem to be more >>> favoured in Racket than other Schemes, that use 'let' more often. >>> >>>> I don't think this can be avoided as we rely on regexs rather than >>>> parsing the source so it is probably best to only match toplevel >>>> defines. >>> >>> The other issue with only matching top level defines is that a >>> lot of scheme programs are library definitions, something like >>> >>> (library >>> (foo bar) >>> (export ...) >>> (define ...) >>> (define ...) >>> ;; and a bunch of other definitions... >>> ) >>> >>> Only matching top level defines will completely ignore matching all >>> the definitions in these files. >> >> That said, I still stand by the fact that only catching top level defines >> will lead to a lot of definitions being ignored. Maybe the occasional >> mismatch may be worth the gain in the number of function contexts being >> detected? > > I'm not sure that the mismatches will be occasional - every time you > have an internal definition in a function the hunk header will be wrong > when you change the main body of the function. This will affect grep > --function-context and diff -W as well as the normal hunk headers. The > problem is there is no way to avoid that and provide something useful in > the library example you have above. It would be useful to find some code > bases and diff the output of 'git log --patch' with and without the > leading whitespace match in the function pattern to see how often this > is a problem (i.e. when the funcnames do not match see which one is > correct). --function-context is just one application of the function matcher. To work properly with nested function definitions, it would have to understand the nesting. But it does not; there is nothing that we can do about it without a proper language parser. Therefore, the argument that the matcher does not work well with --function-context for nested functions is of little relevance. IMO, the primary concern should be whether the matcher decorates hunk contexts sufficiently well. -- Hannes
On 05-Apr-2021, at 15:34, Phillip Wood <phillip.wood123@gmail.com> wrote: > > Hi Atharva > > On 30/03/2021 11:22, Atharva Raykar wrote: >>> On 30-Mar-2021, at 12:34, Atharva Raykar <raykar.ath@gmail.com> wrote: >>> >>> >>> >>>> On 29-Mar-2021, at 15:48, Phillip Wood <phillip.wood123@gmail.com> wrote: >>>> >>>> Hi Atharva >>>> >>>> On 28/03/2021 13:23, Atharva Raykar wrote: >>>>> On 28-Mar-2021, at 05:16, Johannes Sixt <j6t@kdbg.org> wrote: >>>>> [...] >>>>>>> diff --git a/t/t4018/scheme-local-define b/t/t4018/scheme-local-define >>>>>>> new file mode 100644 >>>>>>> index 0000000000..90e75dcce8 >>>>>>> --- /dev/null >>>>>>> +++ b/t/t4018/scheme-local-define >>>>>>> @@ -0,0 +1,4 @@ >>>>>>> +(define (higher-order) >>>>>>> + (define local-function RIGHT >>>>>> >>>>>> ... this one, which is also indented and *is* marked as RIGHT. >>>>> In this test case, I was explicitly testing for an indented '(define' >>>>> whereas in the former, I was testing for the top-level '(define-syntax', >>>>> which happened to have an internal define (which will inevitably show up >>>>> in a lot of scheme code). >>>> >>>> It would be nice to include indented define forms but including them means that any change to the body of a function is attributed to the last internal definition rather than the actual function. For example >>>> >>>> (define (f arg) >>>> (define (g x) >>>> (+ 1 x)) >>>> >>>> (some-func ...) >>>> ;;any change here will have '(define (g x)' in the hunk header, not '(define (f arg)' >>> >>> The reason I went for this over the top level forms, is because >>> I felt it was useful to see the nearest definition for internal >>> functions that often have a lot of the actual business logic of >>> the program (at least a lot of SICP seems to follow this pattern). >>> The disadvantage is as you said, it might also catch trivial inner >>> functions and the developer might lose context. >> Never mind this message, I had misunderstood the problem you were trying to >> demonstrate. I wholeheartedly agree with what you are trying to say, and >> the indentation heuristic discussed does look interesting. I shall have a >> glance at the RFC you linked in the other reply. >>> The disadvantage is as you said, it might also catch trivial inner >>> functions and the developer might lose context. >> Feel free to disregard me misquoting you here. You did not say that (: >>> Another problem is it may match more trivial bindings, like: >>> >>> (define (some-func things) >>> ... >>> (define items '(eggs >>> ham >>> peanut-butter)) >>> ...) >>> >>> What I have noticed *anecdotally* is that this is not common enough >>> to be too much of a problem, and local define bindings seem to be more >>> favoured in Racket than other Schemes, that use 'let' more often. >>> >>>> I don't think this can be avoided as we rely on regexs rather than parsing the source so it is probably best to only match toplevel defines. >>> >>> The other issue with only matching top level defines is that a >>> lot of scheme programs are library definitions, something like >>> >>> (library >>> (foo bar) >>> (export ...) >>> (define ...) >>> (define ...) >>> ;; and a bunch of other definitions... >>> ) >>> >>> Only matching top level defines will completely ignore matching all >>> the definitions in these files. >> That said, I still stand by the fact that only catching top level defines >> will lead to a lot of definitions being ignored. Maybe the occasional >> mismatch may be worth the gain in the number of function contexts being >> detected? > > I'm not sure that the mismatches will be occasional - every time you have an internal definition in a function the hunk header will be wrong when you change the main body of the function. This will affect grep --function-context and diff -W as well as the normal hunk headers. The problem is there is no way to avoid that and provide something useful in the library example you have above. It would be useful to find some code bases and diff the output of 'git log --patch' with and without the leading whitespace match in the function pattern to see how often this is a problem (i.e. when the funcnames do not match see which one is correct). You are right -- on trying out the function on a two other scheme codebases, I noticed that there are a lot more wrongly matched functions than I initially thought. About half of them identify the wrong function in one of the repositories I tried. However, removing the leading whitespace in the pattern did not lead to better matching; it just led to a lot of the hunk headers going blank. I am not sure what causes this behaviour, but my guess is that the function contexts are shown only if it is within a certain distance from the function definition? Even if it did match only the top level defines correctly, the functions matched would still often be technically wrong -- it will show the outer function as the context when the user has edited an internal function (and in Scheme, there is heavy usage of internal functions). After running 'git grep --function-context' with the leading whitespace removed, it seems to match too aggressively, as it captures a huge region to match all the way upto the top level. Especially for files where all the definitions are in a 'library'. Overall, I personally felt that there were more downsides to matching only at the top level. I'd rather the hunk header have the nearest function to provide the context, than have no function displayed at all. Even when the match is wrong, it at least helps me locate where the change was made more easily. > Best Wishes > > Phillip > >
Hi Atharva On 06/04/2021 13:29, Atharva Raykar wrote: > On 05-Apr-2021, at 15:34, Phillip Wood <phillip.wood123@gmail.com> wrote: >> >> Hi Atharva >> >> On 30/03/2021 11:22, Atharva Raykar wrote: >>>> On 30-Mar-2021, at 12:34, Atharva Raykar <raykar.ath@gmail.com> wrote: >>>> >>>> >>>> >>>>> On 29-Mar-2021, at 15:48, Phillip Wood <phillip.wood123@gmail.com> wrote: >>>>> >>>>> Hi Atharva >>>>> >>>>> On 28/03/2021 13:23, Atharva Raykar wrote: >>>>>> On 28-Mar-2021, at 05:16, Johannes Sixt <j6t@kdbg.org> wrote: >>>>>> [...] >>>>>>>> diff --git a/t/t4018/scheme-local-define b/t/t4018/scheme-local-define >>>>>>>> new file mode 100644 >>>>>>>> index 0000000000..90e75dcce8 >>>>>>>> --- /dev/null >>>>>>>> +++ b/t/t4018/scheme-local-define >>>>>>>> @@ -0,0 +1,4 @@ >>>>>>>> +(define (higher-order) >>>>>>>> + (define local-function RIGHT >>>>>>> >>>>>>> ... this one, which is also indented and *is* marked as RIGHT. >>>>>> In this test case, I was explicitly testing for an indented '(define' >>>>>> whereas in the former, I was testing for the top-level '(define-syntax', >>>>>> which happened to have an internal define (which will inevitably show up >>>>>> in a lot of scheme code). >>>>> >>>>> It would be nice to include indented define forms but including them means that any change to the body of a function is attributed to the last internal definition rather than the actual function. For example >>>>> >>>>> (define (f arg) >>>>> (define (g x) >>>>> (+ 1 x)) >>>>> >>>>> (some-func ...) >>>>> ;;any change here will have '(define (g x)' in the hunk header, not '(define (f arg)' >>>> >>>> The reason I went for this over the top level forms, is because >>>> I felt it was useful to see the nearest definition for internal >>>> functions that often have a lot of the actual business logic of >>>> the program (at least a lot of SICP seems to follow this pattern). >>>> The disadvantage is as you said, it might also catch trivial inner >>>> functions and the developer might lose context. >>> Never mind this message, I had misunderstood the problem you were trying to >>> demonstrate. I wholeheartedly agree with what you are trying to say, and >>> the indentation heuristic discussed does look interesting. I shall have a >>> glance at the RFC you linked in the other reply. >>>> The disadvantage is as you said, it might also catch trivial inner >>>> functions and the developer might lose context. >>> Feel free to disregard me misquoting you here. You did not say that (: >>>> Another problem is it may match more trivial bindings, like: >>>> >>>> (define (some-func things) >>>> ... >>>> (define items '(eggs >>>> ham >>>> peanut-butter)) >>>> ...) >>>> >>>> What I have noticed *anecdotally* is that this is not common enough >>>> to be too much of a problem, and local define bindings seem to be more >>>> favoured in Racket than other Schemes, that use 'let' more often. >>>> >>>>> I don't think this can be avoided as we rely on regexs rather than parsing the source so it is probably best to only match toplevel defines. >>>> >>>> The other issue with only matching top level defines is that a >>>> lot of scheme programs are library definitions, something like >>>> >>>> (library >>>> (foo bar) >>>> (export ...) >>>> (define ...) >>>> (define ...) >>>> ;; and a bunch of other definitions... >>>> ) >>>> >>>> Only matching top level defines will completely ignore matching all >>>> the definitions in these files. >>> That said, I still stand by the fact that only catching top level defines >>> will lead to a lot of definitions being ignored. Maybe the occasional >>> mismatch may be worth the gain in the number of function contexts being >>> detected? >> >> I'm not sure that the mismatches will be occasional - every time you have an internal definition in a function the hunk header will be wrong when you change the main body of the function. This will affect grep --function-context and diff -W as well as the normal hunk headers. The problem is there is no way to avoid that and provide something useful in the library example you have above. It would be useful to find some code bases and diff the output of 'git log --patch' with and without the leading whitespace match in the function pattern to see how often this is a problem (i.e. when the funcnames do not match see which one is correct). > > You are right -- on trying out the function on a two other scheme > codebases, I noticed that there are a lot more wrongly matched functions > than I initially thought. About half of them identify the wrong function > in one of the repositories I tried. However, removing the leading > whitespace in the pattern did not lead to better matching; it just led > to a lot of the hunk headers going blank. I am not sure what causes this > behaviour, but my guess is that the function contexts are shown only if > it is within a certain distance from the function definition? > > Even if it did match only the top level defines correctly, the functions > matched would still often be technically wrong -- it will show the outer > function as the context when the user has edited an internal function > (and in Scheme, there is heavy usage of internal functions). > > After running 'git grep --function-context' with the leading whitespace > removed, it seems to match too aggressively, as it captures a huge > region to match all the way upto the top level. Especially for files > where all the definitions are in a 'library'. > > Overall, I personally felt that there were more downsides to matching > only at the top level. I'd rather the hunk header have the nearest > function to provide the context, than have no function displayed at all. > Even when the match is wrong, it at least helps me locate where the > change was made more easily. Thanks for taking the time to check the differences between the two approaches, as there is no perfect solution I'm happy to go with the one that seemed to be best in your investigations Best Wishes Phillip >> Best Wishes >> >> Phillip >> >> >
diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt index 0a60472bb5..cfcfa800c2 100644 --- a/Documentation/gitattributes.txt +++ b/Documentation/gitattributes.txt @@ -845,6 +845,8 @@ patterns are available: - `rust` suitable for source code in the Rust language. +- `scheme` suitable for source code in the Scheme language. + - `tex` suitable for source code for LaTeX documents. diff --git a/t/t4018-diff-funcname.sh b/t/t4018-diff-funcname.sh index 9675bc17db..823ea96acb 100755 --- a/t/t4018-diff-funcname.sh +++ b/t/t4018-diff-funcname.sh @@ -48,6 +48,7 @@ diffpatterns=" python ruby rust + scheme tex custom1 custom2 diff --git a/t/t4018/scheme-define-syntax b/t/t4018/scheme-define-syntax new file mode 100644 index 0000000000..603b99cea4 --- /dev/null +++ b/t/t4018/scheme-define-syntax @@ -0,0 +1,8 @@ +(define-syntax define-test-suite RIGHT + (syntax-rules () + ((_ suite-name (name test) ChangeMe ...) + (define suite-name + (let ((tests + `((name . ,test) ...))) + (lambda () + (ChangeMe 'suite-name tests))))))) \ No newline at end of file diff --git a/t/t4018/scheme-local-define b/t/t4018/scheme-local-define new file mode 100644 index 0000000000..90e75dcce8 --- /dev/null +++ b/t/t4018/scheme-local-define @@ -0,0 +1,4 @@ +(define (higher-order) + (define local-function RIGHT + (lambda (x) + (car "this is" "ChangeMe")))) \ No newline at end of file diff --git a/t/t4018/scheme-top-level-define b/t/t4018/scheme-top-level-define new file mode 100644 index 0000000000..03acdc628d --- /dev/null +++ b/t/t4018/scheme-top-level-define @@ -0,0 +1,4 @@ +(define (some-func x y z) RIGHT + (let ((a x) + (b y)) + (ChangeMe a b))) \ No newline at end of file diff --git a/t/t4018/scheme-user-defined-define b/t/t4018/scheme-user-defined-define new file mode 100644 index 0000000000..401093bac3 --- /dev/null +++ b/t/t4018/scheme-user-defined-define @@ -0,0 +1,6 @@ +(define-test-suite record-case-tests RIGHT + (record-case-1 (lambda (fail) + (let ((a (make-foo 1 2))) + (record-case a + ((bar x) (ChangeMe)) + ((foo a b) (+ a b))))))) \ No newline at end of file diff --git a/t/t4034-diff-words.sh b/t/t4034-diff-words.sh index 56f1e62a97..ee7721ab91 100755 --- a/t/t4034-diff-words.sh +++ b/t/t4034-diff-words.sh @@ -325,6 +325,7 @@ test_language_driver perl test_language_driver php test_language_driver python test_language_driver ruby +test_language_driver scheme test_language_driver tex test_expect_success 'word-diff with diff.sbe' ' diff --git a/t/t4034/scheme/expect b/t/t4034/scheme/expect new file mode 100644 index 0000000000..eed21e803c --- /dev/null +++ b/t/t4034/scheme/expect @@ -0,0 +1,9 @@ +<BOLD>diff --git a/pre b/post<RESET> +<BOLD>index 6a5efba..7c4a6b4 100644<RESET> +<BOLD>--- a/pre<RESET> +<BOLD>+++ b/post<RESET> +<CYAN>@@ -1,4 +1,4 @@<RESET> +(define (<RED>myfunc a b<RESET><GREEN>my-func first second<RESET>) + ; This is a <RED>really<RESET><GREEN>(moderately)<RESET> cool function. + (let ((c (<RED>+ a b<RESET><GREEN>add1 first<RESET>))) + (format "one more than the total is %d" (<RED>add1<RESET><GREEN>+<RESET> c <GREEN>second<RESET>)))) diff --git a/t/t4034/scheme/post b/t/t4034/scheme/post new file mode 100644 index 0000000000..7c4a6b4f3d --- /dev/null +++ b/t/t4034/scheme/post @@ -0,0 +1,4 @@ +(define (my-func first second) + ; This is a (moderately) cool function. + (let ((c (add1 first))) + (format "one more than the total is %d" (+ c second)))) \ No newline at end of file diff --git a/t/t4034/scheme/pre b/t/t4034/scheme/pre new file mode 100644 index 0000000000..6a5efbae61 --- /dev/null +++ b/t/t4034/scheme/pre @@ -0,0 +1,4 @@ +(define (myfunc a b) + ; This is a really cool function. + (let ((c (+ a b))) + (format "one more than the total is %d" (add1 c)))) \ No newline at end of file diff --git a/userdiff.c b/userdiff.c index 3f81a2261c..c51a8c98ba 100644 --- a/userdiff.c +++ b/userdiff.c @@ -191,6 +191,14 @@ PATTERNS("rust", "[a-zA-Z_][a-zA-Z0-9_]*" "|[0-9][0-9_a-fA-Fiosuxz]*(\\.([0-9]*[eE][+-]?)?[0-9_fF]*)?" "|[-+*\\/<>%&^|=!:]=|<<=?|>>=?|&&|\\|\\||->|=>|\\.{2}=|\\.{3}|::"), +PATTERNS("scheme", + "^[\t ]*(\\(define-?.*)$", + /* + * Scheme allows symbol names to have any character, + * as long as it is not a form of a parenthesis. + * The spaces must be escaped. + */ + "(\\.|[^][)(\\}\\{ ])+"), PATTERNS("bibtex", "(@[a-zA-Z]{1,}[ \t]*\\{{0,1}[ \t]*[^ \t\"@',\\#}{~%]*).*$", "[={}\"]|[^={}\" \t]+"), PATTERNS("tex", "^(\\\\((sub)*section|chapter|part)\\*{0,1}\\{.*)$",
Add a diff driver for Scheme (R5RS and R6RS) which recognizes top level and local `define` forms, whether it is a function definition, binding, syntax definition or a user-defined `define-xyzzy` form. The rationale for picking `define` forms for the hunk headers is because it is usually the only significant form for defining the structure of the program, and it is a common pattern for schemers to have local function definitions to hide their visibility, so it is not only the top level `define`'s that are of interest. Schemers also extend the language with macros to provide their own define forms (for example, something like a `define-test-suite`) which is also captured in the hunk header. The word regex is a best-effort attempt to conform to R6RS[1] valid identifiers, symbols and numbers. [1] http://www.r6rs.org/final/html/r6rs/r6rs-Z-H-7.html#node_chap_4 Signed-off-by: Atharva Raykar <raykar.ath@gmail.com> --- Hi, first-time contributor here, I wanted to have a go at this as a microproject. A few things I had to consider: - Going through the mailing list, there have already been two other patches that are for lispy languages that have taken slightly different approaches: Elisp[1] and Clojure[2]. Would it make any sense to have a single userdiff driver for lisp that just captures all top level forms in the hunk? I personally felt it's better to differentiate the drivers for each language, as they have different constructs. - It was hard to decide exactly which forms should appear on the hunk headers. Having programmed in a Scheme before, I went with the forms I would have liked to see when looking at git diffs, which would be the nearest `define` along with `define-syntax` and other define forms that are created as user-defined macros. I am willing to ask around in certain active scheme communities for some kind of consensus, but there is no single large consolidated group of schemers (the closest is probably comp.lang.scheme?). - By best-effort attempt at the wordregex, I mean that it is a little more permissive than it has to be, as it accepts a few words that are technically invalid in Scheme. Making it handle all cases like numbers and identifiers with separate regexen would be greatly complicated (Eg: #x#e10.2f3 is a valid number but #x#f10.2e3 is not; 10t1 is a valid identifier, but 10s1 is a number -- my wordregex just clubs all of these into a generic 'word match' which trades of granularity for simplicity, and it usually does the right thing). [1] http://public-inbox.org/git/20210213192447.6114-1-git@adamspiers.org/ [2] http://public-inbox.org/git/pull.902.git.1615667191368.gitgitgadget@gmail.com/ Documentation/gitattributes.txt | 2 ++ t/t4018-diff-funcname.sh | 1 + t/t4018/scheme-define-syntax | 8 ++++++++ t/t4018/scheme-local-define | 4 ++++ t/t4018/scheme-top-level-define | 4 ++++ t/t4018/scheme-user-defined-define | 6 ++++++ t/t4034-diff-words.sh | 1 + t/t4034/scheme/expect | 9 +++++++++ t/t4034/scheme/post | 4 ++++ t/t4034/scheme/pre | 4 ++++ userdiff.c | 8 ++++++++ 11 files changed, 51 insertions(+) create mode 100644 t/t4018/scheme-define-syntax create mode 100644 t/t4018/scheme-local-define create mode 100644 t/t4018/scheme-top-level-define create mode 100644 t/t4018/scheme-user-defined-define create mode 100644 t/t4034/scheme/expect create mode 100644 t/t4034/scheme/post create mode 100644 t/t4034/scheme/pre