diff mbox series

[1/5] grep: stop modifying buffer in strip_timestamp

Message ID YUlVsLkFGRfRqpKG@coredump.intra.peff.net (mailing list archive)
State New, archived
Headers show
Series const-correctness in grep.c | expand

Commit Message

Jeff King Sept. 21, 2021, 3:46 a.m. UTC
When grepping for headers in commit objects, we receive individual
lines (e.g., "author Name <email> 1234 -0000"), and then strip off the
timestamp to do our match. We do so by writing a NUL byte over the
whitespace separator, and then remembering to restore it later.

We had to do it this way when this was added back in a4d7d2c6db (log
--author/--committer: really match only with name part, 2008-09-04),
because we fed the result directly to regexec(), which expects a
NUL-terminated string. But since b7d36ffca0 (regex: use regexec_buf(),
2016-09-21), we have a function which can match part of a buffer.

So instead of modifying the string, we can instead just move the "eol"
pointer, and the rest of the code will do the right thing. This will let
further patches mark more buffers as "const".

Signed-off-by: Jeff King <peff@peff.net>
---
 grep.c | 13 +++----------
 1 file changed, 3 insertions(+), 10 deletions(-)

Comments

Carlo Marcelo Arenas Belón Sept. 21, 2021, 5:18 a.m. UTC | #1
On Mon, Sep 20, 2021 at 9:09 PM Jeff King <peff@peff.net> wrote:
> @@ -971,7 +966,7 @@ static int match_one_pattern(struct grep_pat *p, char *bol, char *eol,
>                 switch (p->field) {
>                 case GREP_HEADER_AUTHOR:
>                 case GREP_HEADER_COMMITTER:
> -                       saved_ch = strip_timestamp(bol, &eol);
> +                       strip_timestamp(bol, &eol);

Why not something like (plus added error handling, even if it seems
the original didn't have them)?

  eol = strrchr(bol, '>');

Carlo
Eric Sunshine Sept. 21, 2021, 5:24 a.m. UTC | #2
On Tue, Sep 21, 2021 at 1:18 AM Carlo Arenas <carenas@gmail.com> wrote:
> On Mon, Sep 20, 2021 at 9:09 PM Jeff King <peff@peff.net> wrote:
> > @@ -971,7 +966,7 @@ static int match_one_pattern(struct grep_pat *p, char *bol, char *eol,
> >                 switch (p->field) {
> >                 case GREP_HEADER_AUTHOR:
> >                 case GREP_HEADER_COMMITTER:
> > -                       saved_ch = strip_timestamp(bol, &eol);
> > +                       strip_timestamp(bol, &eol);
>
> Why not something like (plus added error handling, even if it seems
> the original didn't have them)?
>
>   eol = strrchr(bol, '>');

strrchr() would search backward from the NUL, not from `eol`, thus
would not be a faithful conversion (and might not be safe, though I
didn't dig through all the callers).
Carlo Marcelo Arenas Belón Sept. 21, 2021, 5:40 a.m. UTC | #3
On Mon, Sep 20, 2021 at 10:24 PM Eric Sunshine <sunshine@sunshineco.com> wrote:
>
> On Tue, Sep 21, 2021 at 1:18 AM Carlo Arenas <carenas@gmail.com> wrote:
> > On Mon, Sep 20, 2021 at 9:09 PM Jeff King <peff@peff.net> wrote:
> > > @@ -971,7 +966,7 @@ static int match_one_pattern(struct grep_pat *p, char *bol, char *eol,
> > >                 switch (p->field) {
> > >                 case GREP_HEADER_AUTHOR:
> > >                 case GREP_HEADER_COMMITTER:
> > > -                       saved_ch = strip_timestamp(bol, &eol);
> > > +                       strip_timestamp(bol, &eol);
> >
> > Why not something like (plus added error handling, even if it seems
> > the original didn't have them)?
> >
> >   eol = strrchr(bol, '>');
>
> strrchr() would search backward from the NUL, not from `eol`, thus
> would not be a faithful conversion (and might not be safe, though I
> didn't dig through all the callers).

of course; I meant memrchr, which I guess will need to have also a
compat version[1]
this is the only caller for strip_timestamp()

[1] https://www.gnu.org/software/gnulib/manual/html_node/memrchr.html
Jeff King Sept. 21, 2021, 5:43 a.m. UTC | #4
On Tue, Sep 21, 2021 at 01:24:41AM -0400, Eric Sunshine wrote:

> On Tue, Sep 21, 2021 at 1:18 AM Carlo Arenas <carenas@gmail.com> wrote:
> > On Mon, Sep 20, 2021 at 9:09 PM Jeff King <peff@peff.net> wrote:
> > > @@ -971,7 +966,7 @@ static int match_one_pattern(struct grep_pat *p, char *bol, char *eol,
> > >                 switch (p->field) {
> > >                 case GREP_HEADER_AUTHOR:
> > >                 case GREP_HEADER_COMMITTER:
> > > -                       saved_ch = strip_timestamp(bol, &eol);
> > > +                       strip_timestamp(bol, &eol);
> >
> > Why not something like (plus added error handling, even if it seems
> > the original didn't have them)?
> >
> >   eol = strrchr(bol, '>');
> 
> strrchr() would search backward from the NUL, not from `eol`, thus
> would not be a faithful conversion (and might not be safe, though I
> didn't dig through all the callers).

Right. The point is that we should be respecting "eol" here in the first
place. Plus the final result is the character _after_ the '>'. Plus when
it returns NULL, you'd want to leave "eol" untouched (stripping
nothing).

So yeah, I'm sure it could be rewritten around memrchr() or something,
but I doubt it would be much shorter, and the chance of introducing an
off-by-one seems non-trivial. :)

-Peff
Carlo Marcelo Arenas Belón Sept. 21, 2021, 6:42 a.m. UTC | #5
On Tue, Sep 21, 2021 at 01:43:05AM -0400, Jeff King wrote:
> 
> So yeah, I'm sure it could be rewritten around memrchr() or something,
> but I doubt it would be much shorter, and the chance of introducing an
> off-by-one seems non-trivial. :)

Considering I am writing it, it is most likely warranted ;)

but it doesn't look that bad IMHO

Carlo

PS. I tested it in macOS with the compatibility layer that will be needed
------ 8> -------
Subject: [PATCH] grep: retire strip_timestamp()

After recent changes, the name is no longer valid, as the function
doesn't strip anything.

Having the code in the main function also helps with readability

Signed-off-by: Carlo Marcelo Arenas Belón <carenas@gmail.com>
---
 grep.c | 19 +++++--------------
 1 file changed, 5 insertions(+), 14 deletions(-)

diff --git a/grep.c b/grep.c
index 5b1f2da4d3..56fd86a7d8 100644
--- a/grep.c
+++ b/grep.c
@@ -922,18 +922,6 @@ static int patmatch(struct grep_pat *p, char *line, char *eol,
 	return hit;
 }
 
-static void strip_timestamp(char *bol, char **eol_p)
-{
-	char *eol = *eol_p;
-
-	while (bol < --eol) {
-		if (*eol != '>')
-			continue;
-		*eol_p = ++eol;
-		break;
-	}
-}
-
 static struct {
 	const char *field;
 	size_t len;
@@ -965,9 +953,12 @@ static int match_one_pattern(struct grep_pat *p, char *bol, char *eol,
 		bol += len;
 		switch (p->field) {
 		case GREP_HEADER_AUTHOR:
-		case GREP_HEADER_COMMITTER:
-			strip_timestamp(bol, &eol);
+		case GREP_HEADER_COMMITTER: {
+			char *em = memrchr(bol, '>', eol - bol);
+			if (em)
+				eol = em + 1;
 			break;
+		}
 		default:
 			break;
 		}
René Scharfe Sept. 21, 2021, 7:37 a.m. UTC | #6
Am 21.09.21 um 08:42 schrieb Carlo Marcelo Arenas Belón:
> On Tue, Sep 21, 2021 at 01:43:05AM -0400, Jeff King wrote:
>>
>> So yeah, I'm sure it could be rewritten around memrchr() or something,
>> but I doubt it would be much shorter, and the chance of introducing an
>> off-by-one seems non-trivial. :)
>
> Considering I am writing it, it is most likely warranted ;)
>
> but it doesn't look that bad IMHO
>
> Carlo
>
> PS. I tested it in macOS with the compatibility layer that will be needed

Right; memrchr is a GNU extension.  We'd need a compat/ implementation to
be able to use it.

> ------ 8> -------
> Subject: [PATCH] grep: retire strip_timestamp()
>
> After recent changes, the name is no longer valid, as the function
> doesn't strip anything.

It still does; the input string slice between bol and eol contains a
trailing timestamp and the output slice doesn't.

>
> Having the code in the main function also helps with readability
>
> Signed-off-by: Carlo Marcelo Arenas Belón <carenas@gmail.com>
> ---
>  grep.c | 19 +++++--------------
>  1 file changed, 5 insertions(+), 14 deletions(-)
>
> diff --git a/grep.c b/grep.c
> index 5b1f2da4d3..56fd86a7d8 100644
> --- a/grep.c
> +++ b/grep.c
> @@ -922,18 +922,6 @@ static int patmatch(struct grep_pat *p, char *line, char *eol,
>  	return hit;
>  }
>
> -static void strip_timestamp(char *bol, char **eol_p)
> -{
> -	char *eol = *eol_p;
> -
> -	while (bol < --eol) {
> -		if (*eol != '>')
> -			continue;
> -		*eol_p = ++eol;
> -		break;
> -	}
> -}
> -
>  static struct {
>  	const char *field;
>  	size_t len;
> @@ -965,9 +953,12 @@ static int match_one_pattern(struct grep_pat *p, char *bol, char *eol,
>  		bol += len;
>  		switch (p->field) {
>  		case GREP_HEADER_AUTHOR:
> -		case GREP_HEADER_COMMITTER:
> -			strip_timestamp(bol, &eol);
> +		case GREP_HEADER_COMMITTER: {
> +			char *em = memrchr(bol, '>', eol - bol);
> +			if (em)
> +				eol = em + 1;

The old code documents the intent via the function name.  The new one
goes into the nitty-gritty without further explanation, which I find
harder to read.

>  			break;
> +		}
>  		default:
>  			break;
>  		}
>
Jeff King Sept. 21, 2021, 2:24 p.m. UTC | #7
On Tue, Sep 21, 2021 at 09:37:23AM +0200, René Scharfe wrote:

> > @@ -965,9 +953,12 @@ static int match_one_pattern(struct grep_pat *p, char *bol, char *eol,
> >  		bol += len;
> >  		switch (p->field) {
> >  		case GREP_HEADER_AUTHOR:
> > -		case GREP_HEADER_COMMITTER:
> > -			strip_timestamp(bol, &eol);
> > +		case GREP_HEADER_COMMITTER: {
> > +			char *em = memrchr(bol, '>', eol - bol);
> > +			if (em)
> > +				eol = em + 1;
> 
> The old code documents the intent via the function name.  The new one
> goes into the nitty-gritty without further explanation, which I find
> harder to read.

Agreed. I do think the conversion is functionally correct, but it
doesn't strike me as worth the change.

-Peff
Ævar Arnfjörð Bjarmason Sept. 21, 2021, 9:02 p.m. UTC | #8
On Tue, Sep 21 2021, Jeff King wrote:

> On Tue, Sep 21, 2021 at 09:37:23AM +0200, René Scharfe wrote:
>
>> > @@ -965,9 +953,12 @@ static int match_one_pattern(struct grep_pat *p, char *bol, char *eol,
>> >  		bol += len;
>> >  		switch (p->field) {
>> >  		case GREP_HEADER_AUTHOR:
>> > -		case GREP_HEADER_COMMITTER:
>> > -			strip_timestamp(bol, &eol);
>> > +		case GREP_HEADER_COMMITTER: {
>> > +			char *em = memrchr(bol, '>', eol - bol);
>> > +			if (em)
>> > +				eol = em + 1;
>> 
>> The old code documents the intent via the function name.  The new one
>> goes into the nitty-gritty without further explanation, which I find
>> harder to read.
>
> Agreed. I do think the conversion is functionally correct, but it
> doesn't strike me as worth the change.

As far as some general improvement in thish area it seems to me that
this whole subthread is losing the forest for the trees.

We're operating a buffer that has "\n"'s in it, and then we loop and
split it up one line at a time, often just to match author OR
committer.

We then have things like commit_rewrite_person() in revision.c that's
preparing a "fake" commit just for grep.c's use, just for it to strip
those headers down again.

A real improvement in this area IMO would be trying to bridge the gap
between parse_commit_buffer() and grep.c's match_one_pattern().

I.e. we've even parsed this once before in commit.c's
parse_commit_date() by the time it gets to grep.c, now we're just doing
it again.

It probably makes sense to split up that commit.c code into something
that can give you structured output, i.e. headers with types and
start/end points for interesting data, then in grep.c we won't need a
strip_anything(), or strrchr() or memrchr() or whatever.

It would also be a lot faster for grepping if we could offload more of
this work to the regex engine, particularly if we've got a more capable
engine like PCREv2.

In many cases we're splitting lines ourselves, when we could have the
engine work in a multi-line mode, or translate the user's --author match
into something that can match the raw commit header. So have an implicit
/^author /m anchor if we're matching author headers, instead of having
grep.c string-twiddle that around one line at a time.

Just my 0.02.
Jeff King Sept. 22, 2021, 8:20 p.m. UTC | #9
On Tue, Sep 21, 2021 at 11:02:31PM +0200, Ævar Arnfjörð Bjarmason wrote:

> 
> On Tue, Sep 21 2021, Jeff King wrote:
> 
> > On Tue, Sep 21, 2021 at 09:37:23AM +0200, René Scharfe wrote:
> >
> >> > @@ -965,9 +953,12 @@ static int match_one_pattern(struct grep_pat *p, char *bol, char *eol,
> >> >  		bol += len;
> >> >  		switch (p->field) {
> >> >  		case GREP_HEADER_AUTHOR:
> >> > -		case GREP_HEADER_COMMITTER:
> >> > -			strip_timestamp(bol, &eol);
> >> > +		case GREP_HEADER_COMMITTER: {
> >> > +			char *em = memrchr(bol, '>', eol - bol);
> >> > +			if (em)
> >> > +				eol = em + 1;
> >> 
> >> The old code documents the intent via the function name.  The new one
> >> goes into the nitty-gritty without further explanation, which I find
> >> harder to read.
> >
> > Agreed. I do think the conversion is functionally correct, but it
> > doesn't strike me as worth the change.
> 
> As far as some general improvement in thish area it seems to me that
> this whole subthread is losing the forest for the trees.

I'd definitely agree with that. :)

> It probably makes sense to split up that commit.c code into something
> that can give you structured output, i.e. headers with types and
> start/end points for interesting data, then in grep.c we won't need a
> strip_anything(), or strrchr() or memrchr() or whatever.

I don't disagree with any of this, either, but I think it's a separate
(and much more complicated) topic than what this series is dealing with.
So my preference would be to take this as an immediate improvement, and
let anything like that get built on top.

> It would also be a lot faster for grepping if we could offload more of
> this work to the regex engine, particularly if we've got a more capable
> engine like PCREv2.
> 
> In many cases we're splitting lines ourselves, when we could have the
> engine work in a multi-line mode, or translate the user's --author match
> into something that can match the raw commit header. So have an implicit
> /^author /m anchor if we're matching author headers, instead of having
> grep.c string-twiddle that around one line at a time.

Ditto here. Multi-line matching may make things a lot more efficient,
but I think is out of scope for this series. That seems like an
interesting area for future work.

-Peff
Ævar Arnfjörð Bjarmason Sept. 23, 2021, 12:53 a.m. UTC | #10
On Wed, Sep 22 2021, Jeff King wrote:

> On Tue, Sep 21, 2021 at 11:02:31PM +0200, Ævar Arnfjörð Bjarmason wrote:
>
>> 
>> On Tue, Sep 21 2021, Jeff King wrote:
>> 
>> > On Tue, Sep 21, 2021 at 09:37:23AM +0200, René Scharfe wrote:
>> >
>> >> > @@ -965,9 +953,12 @@ static int match_one_pattern(struct grep_pat *p, char *bol, char *eol,
>> >> >  		bol += len;
>> >> >  		switch (p->field) {
>> >> >  		case GREP_HEADER_AUTHOR:
>> >> > -		case GREP_HEADER_COMMITTER:
>> >> > -			strip_timestamp(bol, &eol);
>> >> > +		case GREP_HEADER_COMMITTER: {
>> >> > +			char *em = memrchr(bol, '>', eol - bol);
>> >> > +			if (em)
>> >> > +				eol = em + 1;
>> >> 
>> >> The old code documents the intent via the function name.  The new one
>> >> goes into the nitty-gritty without further explanation, which I find
>> >> harder to read.
>> >
>> > Agreed. I do think the conversion is functionally correct, but it
>> > doesn't strike me as worth the change.
>> 
>> As far as some general improvement in thish area it seems to me that
>> this whole subthread is losing the forest for the trees.
>
> I'd definitely agree with that. :)
>
>> It probably makes sense to split up that commit.c code into something
>> that can give you structured output, i.e. headers with types and
>> start/end points for interesting data, then in grep.c we won't need a
>> strip_anything(), or strrchr() or memrchr() or whatever.
>
> I don't disagree with any of this, either, but I think it's a separate
> (and much more complicated) topic than what this series is dealing with.
> So my preference would be to take this as an immediate improvement, and
> let anything like that get built on top.
>
>> It would also be a lot faster for grepping if we could offload more of
>> this work to the regex engine, particularly if we've got a more capable
>> engine like PCREv2.
>> 
>> In many cases we're splitting lines ourselves, when we could have the
>> engine work in a multi-line mode, or translate the user's --author match
>> into something that can match the raw commit header. So have an implicit
>> /^author /m anchor if we're matching author headers, instead of having
>> grep.c string-twiddle that around one line at a time.
>
> Ditto here. Multi-line matching may make things a lot more efficient,
> but I think is out of scope for this series. That seems like an
> interesting area for future work.

Indeed, *way* outside this series. I'm I think the change you suggested
here makes sense for this change, just pointing out that for any
follow-up changes it's much more worthwhile to consider a few more
stackframes than just the 1-2 that involve those strings in that
particular form.
diff mbox series

Patch

diff --git a/grep.c b/grep.c
index 79598f245f..5b1f2da4d3 100644
--- a/grep.c
+++ b/grep.c
@@ -922,20 +922,16 @@  static int patmatch(struct grep_pat *p, char *line, char *eol,
 	return hit;
 }
 
-static int strip_timestamp(char *bol, char **eol_p)
+static void strip_timestamp(char *bol, char **eol_p)
 {
 	char *eol = *eol_p;
-	int ch;
 
 	while (bol < --eol) {
 		if (*eol != '>')
 			continue;
 		*eol_p = ++eol;
-		ch = *eol;
-		*eol = '\0';
-		return ch;
+		break;
 	}
-	return 0;
 }
 
 static struct {
@@ -952,7 +948,6 @@  static int match_one_pattern(struct grep_pat *p, char *bol, char *eol,
 			     regmatch_t *pmatch, int eflags)
 {
 	int hit = 0;
-	int saved_ch = 0;
 	const char *start = bol;
 
 	if ((p->token != GREP_PATTERN) &&
@@ -971,7 +966,7 @@  static int match_one_pattern(struct grep_pat *p, char *bol, char *eol,
 		switch (p->field) {
 		case GREP_HEADER_AUTHOR:
 		case GREP_HEADER_COMMITTER:
-			saved_ch = strip_timestamp(bol, &eol);
+			strip_timestamp(bol, &eol);
 			break;
 		default:
 			break;
@@ -1021,8 +1016,6 @@  static int match_one_pattern(struct grep_pat *p, char *bol, char *eol,
 				goto again;
 		}
 	}
-	if (p->token == GREP_PATTERN_HEAD && saved_ch)
-		*eol = saved_ch;
 	if (hit) {
 		pmatch[0].rm_so += bol - start;
 		pmatch[0].rm_eo += bol - start;