Message ID | 20181004113015.GA30901@manohar-ssh (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [Outreachy] git/userdiff.c fix regex pattern error | expand |
On Thu, 4 Oct 2018 at 19:56, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote: > > Hi Ananya, > > thank you for taking the time to write this patch! > > On Thu, 4 Oct 2018, Ananya Krishna Maram wrote: > > > the forward slash character should be escaped with backslash. Fix > > Unescaped forward slash error in Python regex statements. > > > > Signed-off-by: Ananya Krishna Maram<ananyakittu1997@gmail.com> > > That explains pretty well what you did, but I wonder why the forward slash > needs to be escaped? I would understand if we enclosed the pattern in > `/<regex>/`, as it is done e.g. in Javascript, but we do not... You are correct, the code would execute either ways. But when I came across this line, I didn't get it's meaning instantly because as per standards, forward slash has to be escaped. In fact when open source code is written according to standards, the code will be reachable to more people. Thanks, Ananya. > Thanks, > Johannes > > > --- > > userdiff.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/userdiff.c b/userdiff.c > > index f565f6731..f4ff9b9e5 100644 > > --- a/userdiff.c > > +++ b/userdiff.c > > @@ -123,7 +123,7 @@ PATTERNS("python", "^[ \t]*((class|def)[ \t].*)$", > > /* -- */ > > "[a-zA-Z_][a-zA-Z0-9_]*" > > "|[-+0-9.e]+[jJlL]?|0[xX]?[0-9a-fA-F]+[lL]?" > > - "|[-+*/<>%&^|=!]=|//=?|<<=?|>>=?|\\*\\*=?"), > > + "|[-+*\/<>%&^|=!]=|\/\/=?|<<=?|>>=?|\\*\\*=?"), > > /* -- */ > > PATTERNS("ruby", "^[ \t]*((class|module|def)[ \t].*)$", > > /* -- */ > > -- > > 2.17.1 > > > >
On Thu, 4 Oct 2018 at 20:56, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote: > > Hi Ananya, > > On Thu, 4 Oct 2018, Ananya Krishna Maram wrote: > > > On Thu, 4 Oct 2018 at 19:56, Johannes Schindelin > > <Johannes.Schindelin@gmx.de> wrote: > > > > > > Hi Ananya, > > > > > > thank you for taking the time to write this patch! > > > > > > On Thu, 4 Oct 2018, Ananya Krishna Maram wrote: > > > > > > > the forward slash character should be escaped with backslash. Fix > > > > Unescaped forward slash error in Python regex statements. > > > > > > > > Signed-off-by: Ananya Krishna Maram<ananyakittu1997@gmail.com> > > > > > > That explains pretty well what you did, but I wonder why the forward slash > > > needs to be escaped? I would understand if we enclosed the pattern in > > > `/<regex>/`, as it is done e.g. in Javascript, but we do not... > > > > You are correct, the code would execute either ways. But when I came across > > this line, I didn't get it's meaning instantly because as per standards, forward > > slash has to be escaped. In fact when open source code is written according to > > standards, the code will be reachable to more people. > > I am afraid that I do not follow... Regular expressions have quite a few > special characters, but forward slashes are not among them. > > Meaning: if we had specified the regular expression thusly (in any > language that supports them to be specified in this way): > > /|[-+*/<>%&^|=!]=|//=?|<<=?|>>=?|\\*\\*=?/ > > then I would agree that this is a bug, and needs to be fixed. But we > specify it as a regular C string: > > "|[-+*/<>%&^|=!]=|//=?|<<=?|>>=?|\\*\\*=?" > > In this context, the backslash has an additional, nested meaning: it > escapes special characters in a C string, too. So writing > > "\\" > > will actually result in a string consisting of a single backslash. And > > "\n" > > would specify a string consisting of a single line feed character. Some C > compilers ignore incorrectly-escaped characters, i.e. "\/" would simply > expand to the forward slash. > > However, you wanted to escape the forward slash in the regular expression. > To do that, you would have to write > > "\\/" > > i.e. specifying, in a C string, a backslash character, followed by a > forward slash character. > > However, the regular expression would then be interpreting the backslash > character as escape character in its own right, seeing the forward slash, > and replacing this sequence by a forward slash. > > But it does not need to be escaped, when you specify the regular > expression the way we do. And the way we specified it is really the > standard when specifying regular expressions in C code, i.e. *without* the > suggested backslash. Aha!. this makes total sense. I was thinking from a general regular expression point of view. But I should be thinking from C point of view and how C might interpret this newly submitted string. This explanation is very clear. Thanks for taking time to reply to my patch. From next time on, I will try to think from git project's point of view. Thanks, Ananya. > Ciao, > Johannes > > > > > Thanks, > > Ananya. > > > > > Thanks, > > > Johannes > > > > > > > --- > > > > userdiff.c | 2 +- > > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > > > diff --git a/userdiff.c b/userdiff.c > > > > index f565f6731..f4ff9b9e5 100644 > > > > --- a/userdiff.c > > > > +++ b/userdiff.c > > > > @@ -123,7 +123,7 @@ PATTERNS("python", "^[ \t]*((class|def)[ \t].*)$", > > > > /* -- */ > > > > "[a-zA-Z_][a-zA-Z0-9_]*" > > > > "|[-+0-9.e]+[jJlL]?|0[xX]?[0-9a-fA-F]+[lL]?" > > > > - "|[-+*/<>%&^|=!]=|//=?|<<=?|>>=?|\\*\\*=?"), > > > > + "|[-+*\/<>%&^|=!]=|\/\/=?|<<=?|>>=?|\\*\\*=?"), > > > > /* -- */ > > > > PATTERNS("ruby", "^[ \t]*((class|module|def)[ \t].*)$", > > > > /* -- */ > > > > -- > > > > 2.17.1 > > > > > > > > > >
Hi Ananya, thank you for taking the time to write this patch! On Thu, 4 Oct 2018, Ananya Krishna Maram wrote: > the forward slash character should be escaped with backslash. Fix > Unescaped forward slash error in Python regex statements. > > Signed-off-by: Ananya Krishna Maram<ananyakittu1997@gmail.com> That explains pretty well what you did, but I wonder why the forward slash needs to be escaped? I would understand if we enclosed the pattern in `/<regex>/`, as it is done e.g. in Javascript, but we do not... Thanks, Johannes > --- > userdiff.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/userdiff.c b/userdiff.c > index f565f6731..f4ff9b9e5 100644 > --- a/userdiff.c > +++ b/userdiff.c > @@ -123,7 +123,7 @@ PATTERNS("python", "^[ \t]*((class|def)[ \t].*)$", > /* -- */ > "[a-zA-Z_][a-zA-Z0-9_]*" > "|[-+0-9.e]+[jJlL]?|0[xX]?[0-9a-fA-F]+[lL]?" > - "|[-+*/<>%&^|=!]=|//=?|<<=?|>>=?|\\*\\*=?"), > + "|[-+*\/<>%&^|=!]=|\/\/=?|<<=?|>>=?|\\*\\*=?"), > /* -- */ > PATTERNS("ruby", "^[ \t]*((class|module|def)[ \t].*)$", > /* -- */ > -- > 2.17.1 > >
Hi Ananya, On Thu, 4 Oct 2018, Ananya Krishna Maram wrote: > On Thu, 4 Oct 2018 at 19:56, Johannes Schindelin > <Johannes.Schindelin@gmx.de> wrote: > > > > Hi Ananya, > > > > thank you for taking the time to write this patch! > > > > On Thu, 4 Oct 2018, Ananya Krishna Maram wrote: > > > > > the forward slash character should be escaped with backslash. Fix > > > Unescaped forward slash error in Python regex statements. > > > > > > Signed-off-by: Ananya Krishna Maram<ananyakittu1997@gmail.com> > > > > That explains pretty well what you did, but I wonder why the forward slash > > needs to be escaped? I would understand if we enclosed the pattern in > > `/<regex>/`, as it is done e.g. in Javascript, but we do not... > > You are correct, the code would execute either ways. But when I came across > this line, I didn't get it's meaning instantly because as per standards, forward > slash has to be escaped. In fact when open source code is written according to > standards, the code will be reachable to more people. I am afraid that I do not follow... Regular expressions have quite a few special characters, but forward slashes are not among them. Meaning: if we had specified the regular expression thusly (in any language that supports them to be specified in this way): /|[-+*/<>%&^|=!]=|//=?|<<=?|>>=?|\\*\\*=?/ then I would agree that this is a bug, and needs to be fixed. But we specify it as a regular C string: "|[-+*/<>%&^|=!]=|//=?|<<=?|>>=?|\\*\\*=?" In this context, the backslash has an additional, nested meaning: it escapes special characters in a C string, too. So writing "\\" will actually result in a string consisting of a single backslash. And "\n" would specify a string consisting of a single line feed character. Some C compilers ignore incorrectly-escaped characters, i.e. "\/" would simply expand to the forward slash. However, you wanted to escape the forward slash in the regular expression. To do that, you would have to write "\\/" i.e. specifying, in a C string, a backslash character, followed by a forward slash character. However, the regular expression would then be interpreting the backslash character as escape character in its own right, seeing the forward slash, and replacing this sequence by a forward slash. But it does not need to be escaped, when you specify the regular expression the way we do. And the way we specified it is really the standard when specifying regular expressions in C code, i.e. *without* the suggested backslash. Ciao, Johannes > > Thanks, > Ananya. > > > Thanks, > > Johannes > > > > > --- > > > userdiff.c | 2 +- > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > diff --git a/userdiff.c b/userdiff.c > > > index f565f6731..f4ff9b9e5 100644 > > > --- a/userdiff.c > > > +++ b/userdiff.c > > > @@ -123,7 +123,7 @@ PATTERNS("python", "^[ \t]*((class|def)[ \t].*)$", > > > /* -- */ > > > "[a-zA-Z_][a-zA-Z0-9_]*" > > > "|[-+0-9.e]+[jJlL]?|0[xX]?[0-9a-fA-F]+[lL]?" > > > - "|[-+*/<>%&^|=!]=|//=?|<<=?|>>=?|\\*\\*=?"), > > > + "|[-+*\/<>%&^|=!]=|\/\/=?|<<=?|>>=?|\\*\\*=?"), > > > /* -- */ > > > PATTERNS("ruby", "^[ \t]*((class|module|def)[ \t].*)$", > > > /* -- */ > > > -- > > > 2.17.1 > > > > > > >
Hi Ananya, On Thu, 4 Oct 2018, Ananya Krishna Maram wrote: > On Thu, 4 Oct 2018 at 20:56, Johannes Schindelin > <Johannes.Schindelin@gmx.de> wrote: > > > > [... talking about the reason why a slash does not need to be escaped > > in a C string specifying a regular expression...] > > > > But it does not need to be escaped, when you specify the regular > > expression the way we do. And the way we specified it is really the > > standard when specifying regular expressions in C code, i.e. *without* the > > suggested backslash. > > Aha!. this makes total sense. I was thinking from a general regular expression > point of view. But I should be thinking from C point of view and how C > might interpret this newly submitted string. > This explanation is very clear. Thanks for taking time to reply to my > patch. From next time on, I will try to think from > git project's point of view. Of course! Thank you for taking the time to contribute this patch. Maybe you have another idea for a micro-project? Maybe there is something in Git that you wish was more convenient? Or maybe https://public-inbox.org/git/?q=leftoverbits has something that you would like to implement? Ciao, Johannes > > Thanks, > Ananya. > > > Ciao, > > Johannes > > > > > > > > Thanks, > > > Ananya. > > > > > > > Thanks, > > > > Johannes > > > > > > > > > --- > > > > > userdiff.c | 2 +- > > > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > > > > > diff --git a/userdiff.c b/userdiff.c > > > > > index f565f6731..f4ff9b9e5 100644 > > > > > --- a/userdiff.c > > > > > +++ b/userdiff.c > > > > > @@ -123,7 +123,7 @@ PATTERNS("python", "^[ \t]*((class|def)[ \t].*)$", > > > > > /* -- */ > > > > > "[a-zA-Z_][a-zA-Z0-9_]*" > > > > > "|[-+0-9.e]+[jJlL]?|0[xX]?[0-9a-fA-F]+[lL]?" > > > > > - "|[-+*/<>%&^|=!]=|//=?|<<=?|>>=?|\\*\\*=?"), > > > > > + "|[-+*\/<>%&^|=!]=|\/\/=?|<<=?|>>=?|\\*\\*=?"), > > > > > /* -- */ > > > > > PATTERNS("ruby", "^[ \t]*((class|module|def)[ \t].*)$", > > > > > /* -- */ > > > > > -- > > > > > 2.17.1 > > > > > > > > > > > > > >
Ananya Krishna Maram <ananyakittu1997@gmail.com> writes: >> But it does not need to be escaped, when you specify the regular >> expression the way we do. And the way we specified it is really the >> standard when specifying regular expressions in C code, i.e. *without* the >> suggested backslash. > > Aha!. this makes total sense. I was thinking from a general regular expression > point of view. But I should be thinking from C point of view and how C > might interpret this newly submitted string. If you were thinking from a general regex point of view, you would never have treated '/' as anything special, though. Historically, some languages (like sed and perl) had an regexp match operator, which is denoted by enclosing a regexp inside a pair of slashes. In these languages, if you use /<regexp>/ operator, and if you want to match slash with the pattern, you somehow need a way to write an regexp that has slash in it. E.g. if you want a pattern that would match 'a' followed by '/' followed by 'b', your regexp would look like "a/b", but the regexp match operation you would write in these languages would be like /a\/b/, so that '/<regexp>/' parser can tell that the second slash is not a slash that signals the end of the match operator. And then there is an unnamed misdesigned language that has a regmatch() function, which takes a string that contains a regular expression, but somehow requires that string to begin and end with a slash for no justifiable reason ;-). If you were thinking about regexp from that brain-dead languages' point of view, yes, you should unlearn it and what Dscho gave you would make sense. C's regexp(3) library does not share such a misdesign and just takes a regular expression as a string. You would still need to follow the quoting rules of C string literals (e.g. write a literal double-quote or backslash after an escaping backslash), but of course slash is not special there.
Hi Junio, On Sun, 7 Oct 2018, Junio C Hamano wrote: > And then there is an unnamed misdesigned language that has a > regmatch() function, which takes a string that contains a regular > expression, but somehow requires that string to begin and end with a > slash for no justifiable reason ;-). It makes more sense once you realize that /<regexp>/ is a very nice syntactic construct to obtain an instance of a regular expression object, rather than a string describing one. In Perl it is not as obvious as in Javascript. But in C, we do not have objects, so the way to describe a regular expression is a string (which you have to compile into an opaque data structure before applying it). Ciao, Johannes
Johannes Schindelin <Johannes.Schindelin@gmx.de> writes: > On Sun, 7 Oct 2018, Junio C Hamano wrote: > >> And then there is an unnamed misdesigned language that has a >> regmatch() function, which takes a string that contains a regular >> expression, but somehow requires that string to begin and end with a >> slash for no justifiable reason ;-). > > It makes more sense once you realize that /<regexp>/ is a very nice > syntactic construct to obtain an instance of a regular expression object, > rather than a string describing one. In Perl, qr/<regexp>/ is often how you use that syntactic construct. The unnamed misdesigned language literally uses a string whose content is /<regexp>/; iow, preg_match('/a.*b/', ...) is what you see, and there is no "regex object involved. As far as the function is concerned, it is receiving a string, that represents a regexp. There is no reason to require the pattern _string_ to begin and end with slash, but somehow they do. > But in C, we do not have objects, so the way to describe a regular > expression is a string (which you have to compile into an opaque data > structure before applying it). Absolutely. C does not have regexp object literal and the literal you write to represent a regexp is just a string that you would pass to functions like regcomp(). But at least C is not so brain-dead to require slashes on either end of that string ;-)
diff --git a/userdiff.c b/userdiff.c index f565f6731..f4ff9b9e5 100644 --- a/userdiff.c +++ b/userdiff.c @@ -123,7 +123,7 @@ PATTERNS("python", "^[ \t]*((class|def)[ \t].*)$", /* -- */ "[a-zA-Z_][a-zA-Z0-9_]*" "|[-+0-9.e]+[jJlL]?|0[xX]?[0-9a-fA-F]+[lL]?" - "|[-+*/<>%&^|=!]=|//=?|<<=?|>>=?|\\*\\*=?"), + "|[-+*\/<>%&^|=!]=|\/\/=?|<<=?|>>=?|\\*\\*=?"), /* -- */ PATTERNS("ruby", "^[ \t]*((class|module|def)[ \t].*)$", /* -- */
the forward slash character should be escaped with backslash. Fix Unescaped forward slash error in Python regex statements. Signed-off-by: Ananya Krishna Maram<ananyakittu1997@gmail.com> --- userdiff.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)