diff mbox series

userdiff: support Markdown

Message ID 20200421010035.13915-1-ash@sorrel.sh (mailing list archive)
State New, archived
Headers show
Series userdiff: support Markdown | expand

Commit Message

Ash Holland April 21, 2020, 1 a.m. UTC
It's typical to find Markdown documentation alongside source code, and
having better context for documentation changes is useful; see also
commit 69f9c87d4 (userdiff: add support for Fountain documents,
2015-07-21).

The pattern is based on the CommonMark specification 0.29, section 4.2:
https://spec.commonmark.org/

Only ATX headings are supported, as detecting setext headings would
require printing the line before a pattern matches, or matching a
multiline pattern.

Signed-off-by: Ash Holland <ash@sorrel.sh>
---

If it is indeed possible to match multiline patterns, let me know! I
would love to support setext (underlined) headings with this.

I would also appreciate feedback on the word-diff pattern here, I have
no real idea what should constitute a word in a Markdown document, apart
from that it should probably be similar to the definition given for
Fountain, given that Fountain appears to have somewhat similar inline
syntax to Markdown.

 Documentation/gitattributes.txt       |  2 ++
 t/t4018-diff-funcname.sh              |  1 +
 t/t4018/markdown-heading-indented     |  6 ++++++
 t/t4018/markdown-heading-non-headings | 17 +++++++++++++++++
 userdiff.c                            |  3 +++
 5 files changed, 29 insertions(+)
 create mode 100644 t/t4018/markdown-heading-indented
 create mode 100644 t/t4018/markdown-heading-non-headings

Comments

Emma Brooks April 21, 2020, 2:22 a.m. UTC | #1
On 2020-04-21 02:00:35+0100, Ash Holland wrote:
> I would also appreciate feedback on the word-diff pattern here, I have
> no real idea what should constitute a word in a Markdown document, apart
> from that it should probably be similar to the definition given for
> Fountain, given that Fountain appears to have somewhat similar inline
> syntax to Markdown.

Since Markdown can have raw HTML tags in many variants, it may make
sense to extend the word pattern to "[^<>= \t]+" like HTML's pattern so
tags starting/ending will not be considered part of a word.
Johannes Sixt April 23, 2020, 6:17 p.m. UTC | #2
Am 21.04.20 um 03:00 schrieb Ash Holland:
> It's typical to find Markdown documentation alongside source code, and
> having better context for documentation changes is useful; see also
> commit 69f9c87d4 (userdiff: add support for Fountain documents,
> 2015-07-21).
> 
> The pattern is based on the CommonMark specification 0.29, section 4.2:
> https://spec.commonmark.org/
> 
> Only ATX headings are supported, as detecting setext headings would
> require printing the line before a pattern matches, or matching a
> multiline pattern.

The patch looks good. I have one question about the patthern, though
(see below).

> Signed-off-by: Ash Holland <ash@sorrel.sh>
> ---
> 
> If it is indeed possible to match multiline patterns, let me know! I
> would love to support setext (underlined) headings with this.

We don't have multi-line matching, unfortunately.

> I would also appreciate feedback on the word-diff pattern here, I have
> no real idea what should constitute a word in a Markdown document, apart
> from that it should probably be similar to the definition given for
> Fountain, given that Fountain appears to have somewhat similar inline
> syntax to Markdown.
> 
>  Documentation/gitattributes.txt       |  2 ++
>  t/t4018-diff-funcname.sh              |  1 +
>  t/t4018/markdown-heading-indented     |  6 ++++++
>  t/t4018/markdown-heading-non-headings | 17 +++++++++++++++++
>  userdiff.c                            |  3 +++
>  5 files changed, 29 insertions(+)
>  create mode 100644 t/t4018/markdown-heading-indented
>  create mode 100644 t/t4018/markdown-heading-non-headings
> 
> diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
> index 508fe713c..2d0a03715 100644
> --- a/Documentation/gitattributes.txt
> +++ b/Documentation/gitattributes.txt
> @@ -824,6 +824,8 @@ patterns are available:
>  
>  - `java` suitable for source code in the Java language.
>  
> +- `markdown` suitable for Markdown documents.
> +
>  - `matlab` suitable for source code in the MATLAB and Octave languages.
>  
>  - `objc` suitable for source code in the Objective-C language.
> diff --git a/t/t4018-diff-funcname.sh b/t/t4018-diff-funcname.sh
> index 02255a08b..9d0779757 100755
> --- a/t/t4018-diff-funcname.sh
> +++ b/t/t4018-diff-funcname.sh
> @@ -38,6 +38,7 @@ diffpatterns="
>  	golang
>  	html
>  	java
> +	markdown
>  	matlab
>  	objc
>  	pascal
> diff --git a/t/t4018/markdown-heading-indented b/t/t4018/markdown-heading-indented
> new file mode 100644
> index 000000000..1991c2bd4
> --- /dev/null
> +++ b/t/t4018/markdown-heading-indented
> @@ -0,0 +1,6 @@
> +Indented headings are allowed, as long as the indent is no more than 3 spaces.
> +
> +   ### RIGHT
> +
> +- something
> +- ChangeMe
> diff --git a/t/t4018/markdown-heading-non-headings b/t/t4018/markdown-heading-non-headings
> new file mode 100644
> index 000000000..1f19b91d6
> --- /dev/null
> +++ b/t/t4018/markdown-heading-non-headings
> @@ -0,0 +1,17 @@
> +Headings can be right next to other lines of the file:
> +# RIGHT
> +Indents of more than four spaces make a code block:
> +
> +    # code comment, not heading
> +
> +If there's no space after the final hash, it's not a heading:
> +
> +#hashtag
> +
> +Sequences of more than 6 hashes don't make a heading:
> +
> +####### over-enthusiastic heading
> +
> +So the detected heading should be right up at the start of this file.
> +
> +ChangeMe

Nicely done!

> diff --git a/userdiff.c b/userdiff.c
> index efbe05e5a..f79adb3a3 100644
> --- a/userdiff.c
> +++ b/userdiff.c
> @@ -79,6 +79,9 @@ PATTERNS("java",
>  	 "|[-+0-9.e]+[fFlL]?|0[xXbB]?[0-9a-fA-F]+[lL]?"
>  	 "|[-+*/<>%&^|=!]="
>  	 "|--|\\+\\+|<<=?|>>>?=?|&&|\\|\\|"),
> +PATTERNS("markdown",
> +	 "^ {0,3}#{1,6}( .*)?$",

What is the purpose of making the heading text optional? Why would you
want to match a sequence of hash marks without any text following it?

> +	 "[^ \t-]+"),
>  PATTERNS("matlab",
>  	 /*
>  	  * Octave pattern is mostly the same as matlab, except that '%%%' and
> 

-- Hannes
Ash Holland April 23, 2020, 11:32 p.m. UTC | #3
On Tue Apr 21, 2020 at 2:22 AM, Emma Brooks wrote:
> Since Markdown can have raw HTML tags in many variants, it may make
> sense to extend the word pattern to "[^<>= \t]+" like HTML's pattern so
> tags starting/ending will not be considered part of a word.

Good point, I'll update the pattern to that, thanks!
Ash Holland April 23, 2020, 11:42 p.m. UTC | #4
On Thu Apr 23, 2020 at 9:17 PM PST, Johannes Sixt wrote:
> Am 21.04.20 um 03:00 schrieb Ash Holland:
> > diff --git a/userdiff.c b/userdiff.c
> > index efbe05e5a..f79adb3a3 100644
> > --- a/userdiff.c
> > +++ b/userdiff.c
> > @@ -79,6 +79,9 @@ PATTERNS("java",
> >  	 "|[-+0-9.e]+[fFlL]?|0[xXbB]?[0-9a-fA-F]+[lL]?"
> >  	 "|[-+*/<>%&^|=!]="
> >  	 "|--|\\+\\+|<<=?|>>>?=?|&&|\\|\\|"),
> > +PATTERNS("markdown",
> > +	 "^ {0,3}#{1,6}( .*)?$",
>
> What is the purpose of making the heading text optional? Why would you
> want to match a sequence of hash marks without any text following it?

Strictly speaking, a markdown heading is allowed to be empty -- see for
example https://spec.commonmark.org/0.29/#example-49. I'm happy to
change it if you think it's more useful to show a previous heading which
contains text than an empty one, though.
Johannes Sixt April 24, 2020, 5:21 p.m. UTC | #5
Am 24.04.20 um 01:42 schrieb Ash Holland:
> On Thu Apr 23, 2020 at 9:17 PM PST, Johannes Sixt wrote:
>> Am 21.04.20 um 03:00 schrieb Ash Holland:
>>> diff --git a/userdiff.c b/userdiff.c
>>> index efbe05e5a..f79adb3a3 100644
>>> --- a/userdiff.c
>>> +++ b/userdiff.c
>>> @@ -79,6 +79,9 @@ PATTERNS("java",
>>>  	 "|[-+0-9.e]+[fFlL]?|0[xXbB]?[0-9a-fA-F]+[lL]?"
>>>  	 "|[-+*/<>%&^|=!]="
>>>  	 "|--|\\+\\+|<<=?|>>>?=?|&&|\\|\\|"),
>>> +PATTERNS("markdown",
>>> +	 "^ {0,3}#{1,6}( .*)?$",
>>
>> What is the purpose of making the heading text optional? Why would you
>> want to match a sequence of hash marks without any text following it?
> 
> Strictly speaking, a markdown heading is allowed to be empty -- see for
> example https://spec.commonmark.org/0.29/#example-49. I'm happy to
> change it if you think it's more useful to show a previous heading which
> contains text than an empty one, though.

I don't know what makes sense, I don't write markdown regularly. A quick
check shows that the sequence of hashmarks appears in the hunk header.
Is that useful? (A genuine question!)

-- Hannes
Junio C Hamano April 28, 2020, 9:57 p.m. UTC | #6
"Ash Holland" <ash@sorrel.sh> writes:

> On Tue Apr 21, 2020 at 2:22 AM, Emma Brooks wrote:
>> Since Markdown can have raw HTML tags in many variants, it may make
>> sense to extend the word pattern to "[^<>= \t]+" like HTML's pattern so
>> tags starting/ending will not be considered part of a word.
>
> Good point, I'll update the pattern to that, thanks!

I just marked the topic as "expecting a reroll" in the "What's
cooking" report I have been preparing, but has something happened
after this exchange?

No need to rush, but we'll be closing the acceptance of new features
in three weeks for this cycle, so we won't have infinite amount of
time, either.

Thanks.
Ash Holland April 29, 2020, 12:12 p.m. UTC | #7
On Tue Apr 28, 2020 at 3:57 PM BST, Junio C Hamano wrote:
> "Ash Holland" <ash@sorrel.sh> writes:
>
> > On Tue Apr 21, 2020 at 2:22 AM, Emma Brooks wrote:
> >> Since Markdown can have raw HTML tags in many variants, it may make
> >> sense to extend the word pattern to "[^<>= \t]+" like HTML's pattern so
> >> tags starting/ending will not be considered part of a word.
> >
> > Good point, I'll update the pattern to that, thanks!
>
> I just marked the topic as "expecting a reroll" in the "What's
> cooking" report I have been preparing, but has something happened
> after this exchange?

You've not missed anything, sorry, I've been busy with exams -- I'll
prepare a v2 later today.
Ash Holland April 29, 2020, 12:21 p.m. UTC | #8
On Fri Apr 24, 2020 at 8:21 PM BST, Johannes Sixt wrote:
> Am 24.04.20 um 01:42 schrieb Ash Holland:
> > On Thu Apr 23, 2020 at 9:17 PM PST, Johannes Sixt wrote:
> >> Am 21.04.20 um 03:00 schrieb Ash Holland:
> >>> diff --git a/userdiff.c b/userdiff.c
> >>> index efbe05e5a..f79adb3a3 100644
> >>> --- a/userdiff.c
> >>> +++ b/userdiff.c
> >>> @@ -79,6 +79,9 @@ PATTERNS("java",
> >>>  	 "|[-+0-9.e]+[fFlL]?|0[xXbB]?[0-9a-fA-F]+[lL]?"
> >>>  	 "|[-+*/<>%&^|=!]="
> >>>  	 "|--|\\+\\+|<<=?|>>>?=?|&&|\\|\\|"),
> >>> +PATTERNS("markdown",
> >>> +	 "^ {0,3}#{1,6}( .*)?$",
> >>
> >> What is the purpose of making the heading text optional? Why would you
> >> want to match a sequence of hash marks without any text following it?
> > 
> > Strictly speaking, a markdown heading is allowed to be empty -- see for
> > example https://spec.commonmark.org/0.29/#example-49. I'm happy to
> > change it if you think it's more useful to show a previous heading which
> > contains text than an empty one, though.
>
> I don't know what makes sense, I don't write markdown regularly. A quick
> check shows that the sequence of hashmarks appears in the hunk header.
> Is that useful? (A genuine question!)

I think probably it would be more confusing to have Git silently ignore
empty headings, having occasionally written documents with empty
headings in the past (e.g. when I know I want some different sections,
but I don't know what to call them yet). Probably not many people would
ever run into this situation either way, though.
diff mbox series

Patch

diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index 508fe713c..2d0a03715 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -824,6 +824,8 @@  patterns are available:
 
 - `java` suitable for source code in the Java language.
 
+- `markdown` suitable for Markdown documents.
+
 - `matlab` suitable for source code in the MATLAB and Octave languages.
 
 - `objc` suitable for source code in the Objective-C language.
diff --git a/t/t4018-diff-funcname.sh b/t/t4018-diff-funcname.sh
index 02255a08b..9d0779757 100755
--- a/t/t4018-diff-funcname.sh
+++ b/t/t4018-diff-funcname.sh
@@ -38,6 +38,7 @@  diffpatterns="
 	golang
 	html
 	java
+	markdown
 	matlab
 	objc
 	pascal
diff --git a/t/t4018/markdown-heading-indented b/t/t4018/markdown-heading-indented
new file mode 100644
index 000000000..1991c2bd4
--- /dev/null
+++ b/t/t4018/markdown-heading-indented
@@ -0,0 +1,6 @@ 
+Indented headings are allowed, as long as the indent is no more than 3 spaces.
+
+   ### RIGHT
+
+- something
+- ChangeMe
diff --git a/t/t4018/markdown-heading-non-headings b/t/t4018/markdown-heading-non-headings
new file mode 100644
index 000000000..1f19b91d6
--- /dev/null
+++ b/t/t4018/markdown-heading-non-headings
@@ -0,0 +1,17 @@ 
+Headings can be right next to other lines of the file:
+# RIGHT
+Indents of more than four spaces make a code block:
+
+    # code comment, not heading
+
+If there's no space after the final hash, it's not a heading:
+
+#hashtag
+
+Sequences of more than 6 hashes don't make a heading:
+
+####### over-enthusiastic heading
+
+So the detected heading should be right up at the start of this file.
+
+ChangeMe
diff --git a/userdiff.c b/userdiff.c
index efbe05e5a..f79adb3a3 100644
--- a/userdiff.c
+++ b/userdiff.c
@@ -79,6 +79,9 @@  PATTERNS("java",
 	 "|[-+0-9.e]+[fFlL]?|0[xXbB]?[0-9a-fA-F]+[lL]?"
 	 "|[-+*/<>%&^|=!]="
 	 "|--|\\+\\+|<<=?|>>>?=?|&&|\\|\\|"),
+PATTERNS("markdown",
+	 "^ {0,3}#{1,6}( .*)?$",
+	 "[^ \t-]+"),
 PATTERNS("matlab",
 	 /*
 	  * Octave pattern is mostly the same as matlab, except that '%%%' and