userdiff: two simplifications of patterns for rust
diff mbox series

Message ID 2d32b107-9278-faa0-4fea-afe662031272@kdbg.org
State New
Headers show
Series
  • userdiff: two simplifications of patterns for rust
Related show

Commit Message

Johannes Sixt May 30, 2019, 4:44 p.m. UTC
- Do not enforce (but assume) syntactic correctness of language
  constructs that go into hunk headers: we only want to ensure that
  the keywords actually are words and not just the initial part of
  some identifier.

- In the word regex, match numbers only when they begin with a digit,
  but then be liberal in what follows, assuming that the text that is
  matched is syntactially correct.

Signed-off-by: Johannes Sixt <j6t@kdbg.org>
---
Am 28.05.19 um 23:01 schrieb Marc-André Lureau:
> Both of these changes were based on your recommendations. Would you
> mind sending a follow-up patch yourself?
> 
> I can send a seperate patch for the 3 extra tests.

So, here it is. Looking forward to seeing a patch with the tests.

 userdiff.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

Comments

Ævar Arnfjörð Bjarmason May 30, 2019, 6:59 p.m. UTC | #1
On Thu, May 30 2019, Johannes Sixt wrote:

> - Do not enforce (but assume) syntactic correctness of language
>   constructs that go into hunk headers: we only want to ensure that
>   the keywords actually are words and not just the initial part of
>   some identifier.
>
> - In the word regex, match numbers only when they begin with a digit,
>   but then be liberal in what follows, assuming that the text that is
>   matched is syntactially correct.

I don't know if this is possible for Rust (but very much suspect so...),
but I think that in general we should aim to be more forgiving than not
with these patterns.

Because, as the history of userdiff.c shows, new keywords get introduced
into these languages, and old git versions survive for a long time. If
the syntax is otherwise fairly regular perhaps we don't need to hardcode
the list of existing keywords?
Johannes Sixt May 30, 2019, 8:32 p.m. UTC | #2
Am 30.05.19 um 20:59 schrieb Ævar Arnfjörð Bjarmason:
> 
> On Thu, May 30 2019, Johannes Sixt wrote:
> 
>> - Do not enforce (but assume) syntactic correctness of language
>>   constructs that go into hunk headers: we only want to ensure that
>>   the keywords actually are words and not just the initial part of
>>   some identifier.
>>
>> - In the word regex, match numbers only when they begin with a digit,
>>   but then be liberal in what follows, assuming that the text that is
>>   matched is syntactially correct.
> 
> I don't know if this is possible for Rust (but very much suspect so...),
> but I think that in general we should aim to be more forgiving than not
> with these patterns.

The C/C++ pattern is actually very forgiving in the hunk header pattern:
It takes every line that begins with an un-indented letter. That works
very well in in C because C does not have nested functions and it is
typical that the function definition lines are not indented. But that
breaks down with C++: indented function definitions are very common;
they happen inside class and namespace definitions. Such functions are
not picked up, and we live with that so far (at least, I do).

> Because, as the history of userdiff.c shows, new keywords get introduced
> into these languages, and old git versions survive for a long time. If
> the syntax is otherwise fairly regular perhaps we don't need to hardcode
> the list of existing keywords?

We are talking about (1) hunk header lines (not something really
important) and (2) programming languages: new keywords don't pop up
every month. Granted, inventing new languages is en vogue these days.
But really, I mean, WTH?

Having available keywords to recognize hunk header candidates helps a
lot. I thought long about a possible pattern for C++, but I gave up,
because the language is so rich and there are no suitable keywords.

-- Hannes

Patch
diff mbox series

diff --git a/userdiff.c b/userdiff.c
index 8d7e62e2a5..2bcf105caf 100644
--- a/userdiff.c
+++ b/userdiff.c
@@ -131,11 +131,10 @@  PATTERNS("ruby", "^[ \t]*((class|module|def)[ \t].*)$",
 	 "|[-+0-9.e]+|0[xXbB]?[0-9a-fA-F]+|\\?(\\\\C-)?(\\\\M-)?."
 	 "|//=?|[-+*/<>%&^|=!]=|<<=?|>>=?|===|\\.{1,3}|::|[!=]~"),
 PATTERNS("rust",
-	 "^[\t ]*((pub(\\([^\\)]+\\))?[\t ]+)?((async|const|unsafe|extern([\t ]+\"[^\"]+\"))[\t ]+)?(struct|enum|union|mod|trait|fn|impl(<.+>)?)[ \t]+[^;]*)$",
+	 "^[\t ]*((pub(\\([^\\)]+\\))?[\t ]+)?((async|const|unsafe|extern([\t ]+\"[^\"]+\"))[\t ]+)?(struct|enum|union|mod|trait|fn|impl)[< \t]+[^;]*)$",
 	 /* -- */
 	 "[a-zA-Z_][a-zA-Z0-9_]*"
-	 "|[-+_0-9.eE]+(f32|f64|u8|u16|u32|u64|u128|usize|i8|i16|i32|i64|i128|isize)?"
-	 "|0[box]?[0-9a-fA-F_]+(u8|u16|u32|u64|u128|usize|i8|i16|i32|i64|i128|isize)?"
+	 "|[0-9][0-9_a-fA-Fiosuxz]*(\\.([0-9]*[eE][+-]?)?[0-9_fF]*)?"
 	 "|[-+*\\/<>%&^|=!:]=|<<=?|>>=?|&&|\\|\\||->|=>|\\.{2}=|\\.{3}|::"),
 PATTERNS("bibtex", "(@[a-zA-Z]{1,}[ \t]*\\{{0,1}[ \t]*[^ \t\"@',\\#}{~%]*).*$",
 	 "[={}\"]|[^={}\" \t]+"),