[Outreachy] git/userdiff.c fix regex pattern error
diff mbox series

Message ID 20181004113015.GA30901@manohar-ssh
State New
Headers show
Series
  • [Outreachy] git/userdiff.c fix regex pattern error
Related show

Commit Message

Ananya Krishna Maram Oct. 4, 2018, 11:30 a.m. UTC
the forward slash character should be escaped with backslash. Fix
Unescaped forward slash error in Python regex statements.

Signed-off-by: Ananya Krishna Maram<ananyakittu1997@gmail.com>
---
 userdiff.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Ananya Krishna Maram Oct. 4, 2018, 9:35 a.m. UTC | #1
On Thu, 4 Oct 2018 at 19:56, Johannes Schindelin
<Johannes.Schindelin@gmx.de> wrote:
>
> Hi Ananya,
>
> thank you for taking the time to write this patch!
>
> On Thu, 4 Oct 2018, Ananya Krishna Maram wrote:
>
> > the forward slash character should be escaped with backslash. Fix
> > Unescaped forward slash error in Python regex statements.
> >
> > Signed-off-by: Ananya Krishna Maram<ananyakittu1997@gmail.com>
>
> That explains pretty well what you did, but I wonder why the forward slash
> needs to be escaped? I would understand if we enclosed the pattern in
> `/<regex>/`, as it is done e.g. in Javascript, but we do not...

You are correct, the code would execute either ways. But when I came across
this line, I didn't get it's meaning instantly because as per standards, forward
slash has to be escaped. In fact when open source code is written according to
standards, the code will be reachable to more people.

Thanks,
Ananya.

> Thanks,
> Johannes
>
> > ---
> >  userdiff.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/userdiff.c b/userdiff.c
> > index f565f6731..f4ff9b9e5 100644
> > --- a/userdiff.c
> > +++ b/userdiff.c
> > @@ -123,7 +123,7 @@ PATTERNS("python", "^[ \t]*((class|def)[ \t].*)$",
> >        /* -- */
> >        "[a-zA-Z_][a-zA-Z0-9_]*"
> >        "|[-+0-9.e]+[jJlL]?|0[xX]?[0-9a-fA-F]+[lL]?"
> > -      "|[-+*/<>%&^|=!]=|//=?|<<=?|>>=?|\\*\\*=?"),
> > +      "|[-+*\/<>%&^|=!]=|\/\/=?|<<=?|>>=?|\\*\\*=?"),
> >        /* -- */
> >  PATTERNS("ruby", "^[ \t]*((class|module|def)[ \t].*)$",
> >        /* -- */
> > --
> > 2.17.1
> >
> >
Ananya Krishna Maram Oct. 4, 2018, 10:39 a.m. UTC | #2
On Thu, 4 Oct 2018 at 20:56, Johannes Schindelin
<Johannes.Schindelin@gmx.de> wrote:
>
> Hi Ananya,
>
> On Thu, 4 Oct 2018, Ananya Krishna Maram wrote:
>
> > On Thu, 4 Oct 2018 at 19:56, Johannes Schindelin
> > <Johannes.Schindelin@gmx.de> wrote:
> > >
> > > Hi Ananya,
> > >
> > > thank you for taking the time to write this patch!
> > >
> > > On Thu, 4 Oct 2018, Ananya Krishna Maram wrote:
> > >
> > > > the forward slash character should be escaped with backslash. Fix
> > > > Unescaped forward slash error in Python regex statements.
> > > >
> > > > Signed-off-by: Ananya Krishna Maram<ananyakittu1997@gmail.com>
> > >
> > > That explains pretty well what you did, but I wonder why the forward slash
> > > needs to be escaped? I would understand if we enclosed the pattern in
> > > `/<regex>/`, as it is done e.g. in Javascript, but we do not...
> >
> > You are correct, the code would execute either ways. But when I came across
> > this line, I didn't get it's meaning instantly because as per standards, forward
> > slash has to be escaped. In fact when open source code is written according to
> > standards, the code will be reachable to more people.
>
> I am afraid that I do not follow... Regular expressions have quite a few
> special characters, but forward slashes are not among them.
>
> Meaning: if we had specified the regular expression thusly (in any
> language that supports them to be specified in this way):
>
>         /|[-+*/<>%&^|=!]=|//=?|<<=?|>>=?|\\*\\*=?/
>
> then I would agree that this is a bug, and needs to be fixed. But we
> specify it as a regular C string:
>
>         "|[-+*/<>%&^|=!]=|//=?|<<=?|>>=?|\\*\\*=?"
>
> In this context, the backslash has an additional, nested meaning: it
> escapes special characters in a C string, too. So writing
>
>         "\\"
>
> will actually result in a string consisting of a single backslash. And
>
>         "\n"
>
> would specify a string consisting of a single line feed character. Some C
> compilers ignore incorrectly-escaped characters, i.e. "\/" would simply
> expand to the forward slash.
>
> However, you wanted to escape the forward slash in the regular expression.
> To do that, you would have to write
>
>         "\\/"
>
> i.e. specifying, in a C string, a backslash character, followed by a
> forward slash character.
>
> However, the regular expression would then be interpreting the backslash
> character as escape character in its own right, seeing the forward slash,
> and replacing this sequence by a forward slash.
>
> But it does not need to be escaped, when you specify the regular
> expression the way we do. And the way we specified it is really the
> standard when specifying regular expressions in C code, i.e. *without* the
> suggested backslash.

Aha!. this makes total sense. I was thinking from a general regular expression
point of view. But I should be thinking from C point of view and how C
might interpret this newly submitted string.
This explanation is very clear. Thanks for taking time to reply to my
patch. From next time on, I will try to think from
git project's point of view.

Thanks,
Ananya.

> Ciao,
> Johannes
>
> >
> > Thanks,
> > Ananya.
> >
> > > Thanks,
> > > Johannes
> > >
> > > > ---
> > > >  userdiff.c | 2 +-
> > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > >
> > > > diff --git a/userdiff.c b/userdiff.c
> > > > index f565f6731..f4ff9b9e5 100644
> > > > --- a/userdiff.c
> > > > +++ b/userdiff.c
> > > > @@ -123,7 +123,7 @@ PATTERNS("python", "^[ \t]*((class|def)[ \t].*)$",
> > > >        /* -- */
> > > >        "[a-zA-Z_][a-zA-Z0-9_]*"
> > > >        "|[-+0-9.e]+[jJlL]?|0[xX]?[0-9a-fA-F]+[lL]?"
> > > > -      "|[-+*/<>%&^|=!]=|//=?|<<=?|>>=?|\\*\\*=?"),
> > > > +      "|[-+*\/<>%&^|=!]=|\/\/=?|<<=?|>>=?|\\*\\*=?"),
> > > >        /* -- */
> > > >  PATTERNS("ruby", "^[ \t]*((class|module|def)[ \t].*)$",
> > > >        /* -- */
> > > > --
> > > > 2.17.1
> > > >
> > > >
> >
Johannes Schindelin Oct. 4, 2018, 2:26 p.m. UTC | #3
Hi Ananya,

thank you for taking the time to write this patch!

On Thu, 4 Oct 2018, Ananya Krishna Maram wrote:

> the forward slash character should be escaped with backslash. Fix
> Unescaped forward slash error in Python regex statements.
> 
> Signed-off-by: Ananya Krishna Maram<ananyakittu1997@gmail.com>

That explains pretty well what you did, but I wonder why the forward slash
needs to be escaped? I would understand if we enclosed the pattern in
`/<regex>/`, as it is done e.g. in Javascript, but we do not...

Thanks,
Johannes

> ---
>  userdiff.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/userdiff.c b/userdiff.c
> index f565f6731..f4ff9b9e5 100644
> --- a/userdiff.c
> +++ b/userdiff.c
> @@ -123,7 +123,7 @@ PATTERNS("python", "^[ \t]*((class|def)[ \t].*)$",
>  	 /* -- */
>  	 "[a-zA-Z_][a-zA-Z0-9_]*"
>  	 "|[-+0-9.e]+[jJlL]?|0[xX]?[0-9a-fA-F]+[lL]?"
> -	 "|[-+*/<>%&^|=!]=|//=?|<<=?|>>=?|\\*\\*=?"),
> +	 "|[-+*\/<>%&^|=!]=|\/\/=?|<<=?|>>=?|\\*\\*=?"),
>  	 /* -- */
>  PATTERNS("ruby", "^[ \t]*((class|module|def)[ \t].*)$",
>  	 /* -- */
> -- 
> 2.17.1
> 
>
Johannes Schindelin Oct. 4, 2018, 3:26 p.m. UTC | #4
Hi Ananya,

On Thu, 4 Oct 2018, Ananya Krishna Maram wrote:

> On Thu, 4 Oct 2018 at 19:56, Johannes Schindelin
> <Johannes.Schindelin@gmx.de> wrote:
> >
> > Hi Ananya,
> >
> > thank you for taking the time to write this patch!
> >
> > On Thu, 4 Oct 2018, Ananya Krishna Maram wrote:
> >
> > > the forward slash character should be escaped with backslash. Fix
> > > Unescaped forward slash error in Python regex statements.
> > >
> > > Signed-off-by: Ananya Krishna Maram<ananyakittu1997@gmail.com>
> >
> > That explains pretty well what you did, but I wonder why the forward slash
> > needs to be escaped? I would understand if we enclosed the pattern in
> > `/<regex>/`, as it is done e.g. in Javascript, but we do not...
> 
> You are correct, the code would execute either ways. But when I came across
> this line, I didn't get it's meaning instantly because as per standards, forward
> slash has to be escaped. In fact when open source code is written according to
> standards, the code will be reachable to more people.

I am afraid that I do not follow... Regular expressions have quite a few
special characters, but forward slashes are not among them.

Meaning: if we had specified the regular expression thusly (in any
language that supports them to be specified in this way):

	/|[-+*/<>%&^|=!]=|//=?|<<=?|>>=?|\\*\\*=?/

then I would agree that this is a bug, and needs to be fixed. But we
specify it as a regular C string:

	"|[-+*/<>%&^|=!]=|//=?|<<=?|>>=?|\\*\\*=?"

In this context, the backslash has an additional, nested meaning: it
escapes special characters in a C string, too. So writing

	"\\"

will actually result in a string consisting of a single backslash. And

	"\n"

would specify a string consisting of a single line feed character. Some C
compilers ignore incorrectly-escaped characters, i.e. "\/" would simply
expand to the forward slash.

However, you wanted to escape the forward slash in the regular expression.
To do that, you would have to write

	"\\/"

i.e. specifying, in a C string, a backslash character, followed by a
forward slash character.

However, the regular expression would then be interpreting the backslash
character as escape character in its own right, seeing the forward slash,
and replacing this sequence by a forward slash.

But it does not need to be escaped, when you specify the regular
expression the way we do. And the way we specified it is really the
standard when specifying regular expressions in C code, i.e. *without* the
suggested backslash.

Ciao,
Johannes

> 
> Thanks,
> Ananya.
> 
> > Thanks,
> > Johannes
> >
> > > ---
> > >  userdiff.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/userdiff.c b/userdiff.c
> > > index f565f6731..f4ff9b9e5 100644
> > > --- a/userdiff.c
> > > +++ b/userdiff.c
> > > @@ -123,7 +123,7 @@ PATTERNS("python", "^[ \t]*((class|def)[ \t].*)$",
> > >        /* -- */
> > >        "[a-zA-Z_][a-zA-Z0-9_]*"
> > >        "|[-+0-9.e]+[jJlL]?|0[xX]?[0-9a-fA-F]+[lL]?"
> > > -      "|[-+*/<>%&^|=!]=|//=?|<<=?|>>=?|\\*\\*=?"),
> > > +      "|[-+*\/<>%&^|=!]=|\/\/=?|<<=?|>>=?|\\*\\*=?"),
> > >        /* -- */
> > >  PATTERNS("ruby", "^[ \t]*((class|module|def)[ \t].*)$",
> > >        /* -- */
> > > --
> > > 2.17.1
> > >
> > >
>
Johannes Schindelin Oct. 4, 2018, 7:42 p.m. UTC | #5
Hi Ananya,

On Thu, 4 Oct 2018, Ananya Krishna Maram wrote:

> On Thu, 4 Oct 2018 at 20:56, Johannes Schindelin
> <Johannes.Schindelin@gmx.de> wrote:
> >
> > [... talking about the reason why a slash does not need to be escaped
> > in a C string specifying a regular expression...]
> >
> > But it does not need to be escaped, when you specify the regular
> > expression the way we do. And the way we specified it is really the
> > standard when specifying regular expressions in C code, i.e. *without* the
> > suggested backslash.
> 
> Aha!. this makes total sense. I was thinking from a general regular expression
> point of view. But I should be thinking from C point of view and how C
> might interpret this newly submitted string.
> This explanation is very clear. Thanks for taking time to reply to my
> patch. From next time on, I will try to think from
> git project's point of view.

Of course! Thank you for taking the time to contribute this patch.

Maybe you have another idea for a micro-project? Maybe there is something
in Git that you wish was more convenient? Or maybe
https://public-inbox.org/git/?q=leftoverbits has something that you would
like to implement?

Ciao,
Johannes

> 
> Thanks,
> Ananya.
> 
> > Ciao,
> > Johannes
> >
> > >
> > > Thanks,
> > > Ananya.
> > >
> > > > Thanks,
> > > > Johannes
> > > >
> > > > > ---
> > > > >  userdiff.c | 2 +-
> > > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/userdiff.c b/userdiff.c
> > > > > index f565f6731..f4ff9b9e5 100644
> > > > > --- a/userdiff.c
> > > > > +++ b/userdiff.c
> > > > > @@ -123,7 +123,7 @@ PATTERNS("python", "^[ \t]*((class|def)[ \t].*)$",
> > > > >        /* -- */
> > > > >        "[a-zA-Z_][a-zA-Z0-9_]*"
> > > > >        "|[-+0-9.e]+[jJlL]?|0[xX]?[0-9a-fA-F]+[lL]?"
> > > > > -      "|[-+*/<>%&^|=!]=|//=?|<<=?|>>=?|\\*\\*=?"),
> > > > > +      "|[-+*\/<>%&^|=!]=|\/\/=?|<<=?|>>=?|\\*\\*=?"),
> > > > >        /* -- */
> > > > >  PATTERNS("ruby", "^[ \t]*((class|module|def)[ \t].*)$",
> > > > >        /* -- */
> > > > > --
> > > > > 2.17.1
> > > > >
> > > > >
> > >
>
Junio C Hamano Oct. 6, 2018, 11:41 p.m. UTC | #6
Ananya Krishna Maram <ananyakittu1997@gmail.com> writes:

>> But it does not need to be escaped, when you specify the regular
>> expression the way we do. And the way we specified it is really the
>> standard when specifying regular expressions in C code, i.e. *without* the
>> suggested backslash.
>
> Aha!. this makes total sense. I was thinking from a general regular expression
> point of view. But I should be thinking from C point of view and how C
> might interpret this newly submitted string.

If you were thinking from a general regex point of view, you would
never have treated '/' as anything special, though.

Historically, some languages (like sed and perl) had an regexp match
operator, which is denoted by enclosing a regexp inside a pair of
slashes.  In these languages, if you use /<regexp>/ operator, and if
you want to match slash with the pattern, you somehow need a way to
write an regexp that has slash in it.  E.g. if you want a pattern
that would match 'a' followed by '/' followed by 'b', your regexp
would look like "a/b", but the regexp match operation you would
write in these languages would be like /a\/b/, so that '/<regexp>/'
parser can tell that the second slash is not a slash that signals
the end of the match operator.

And then there is an unnamed misdesigned language that has a
regmatch() function, which takes a string that contains a regular
expression, but somehow requires that string to begin and end with a
slash for no justifiable reason ;-).  If you were thinking about
regexp from that brain-dead languages' point of view, yes, you
should unlearn it and what Dscho gave you would make sense.

C's regexp(3) library does not share such a misdesign and just takes
a regular expression as a string.  You would still need to follow
the quoting rules of C string literals (e.g. write a literal
double-quote or backslash after an escaping backslash), but of
course slash is not special there.
Johannes Schindelin Oct. 12, 2018, 7:57 a.m. UTC | #7
Hi Junio,

On Sun, 7 Oct 2018, Junio C Hamano wrote:

> And then there is an unnamed misdesigned language that has a
> regmatch() function, which takes a string that contains a regular
> expression, but somehow requires that string to begin and end with a
> slash for no justifiable reason ;-).

It makes more sense once you realize that /<regexp>/ is a very nice
syntactic construct to obtain an instance of a regular expression object,
rather than a string describing one.

In Perl it is not as obvious as in Javascript.

But in C, we do not have objects, so the way to describe a regular
expression is a string (which you have to compile into an opaque data
structure before applying it).

Ciao,
Johannes
Junio C Hamano Oct. 12, 2018, 8:41 a.m. UTC | #8
Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

> On Sun, 7 Oct 2018, Junio C Hamano wrote:
>
>> And then there is an unnamed misdesigned language that has a
>> regmatch() function, which takes a string that contains a regular
>> expression, but somehow requires that string to begin and end with a
>> slash for no justifiable reason ;-).
>
> It makes more sense once you realize that /<regexp>/ is a very nice
> syntactic construct to obtain an instance of a regular expression object,
> rather than a string describing one.

In Perl, qr/<regexp>/ is often how you use that syntactic construct.

The unnamed misdesigned language literally uses a string whose
content is /<regexp>/; iow, 

	preg_match('/a.*b/', ...)

is what you see, and there is no "regex object involved.  As far as
the function is concerned, it is receiving a string, that represents
a regexp.  There is no reason to require the pattern _string_ to
begin and end with slash, but somehow they do.

> But in C, we do not have objects, so the way to describe a regular
> expression is a string (which you have to compile into an opaque data
> structure before applying it).

Absolutely.  C does not have regexp object literal and the literal
you write to represent a regexp is just a string that you would pass
to functions like regcomp().  But at least C is not so brain-dead to
require slashes on either end of that string ;-)

Patch
diff mbox series

diff --git a/userdiff.c b/userdiff.c
index f565f6731..f4ff9b9e5 100644
--- a/userdiff.c
+++ b/userdiff.c
@@ -123,7 +123,7 @@  PATTERNS("python", "^[ \t]*((class|def)[ \t].*)$",
 	 /* -- */
 	 "[a-zA-Z_][a-zA-Z0-9_]*"
 	 "|[-+0-9.e]+[jJlL]?|0[xX]?[0-9a-fA-F]+[lL]?"
-	 "|[-+*/<>%&^|=!]=|//=?|<<=?|>>=?|\\*\\*=?"),
+	 "|[-+*\/<>%&^|=!]=|\/\/=?|<<=?|>>=?|\\*\\*=?"),
 	 /* -- */
 PATTERNS("ruby", "^[ \t]*((class|module|def)[ \t].*)$",
 	 /* -- */