diff mbox series

[v2] grep: provide sane default to grep_source struct

Message ID 20190516214444.191743-1-emilyshaffer@google.com (mailing list archive)
State New, archived
Headers show
Series [v2] grep: provide sane default to grep_source struct | expand

Commit Message

Emily Shaffer May 16, 2019, 9:44 p.m. UTC
grep_buffer creates a struct grep_source gs and calls grep_source()
with it. However, gs.name is null, which eventually produces a
segmentation fault in
grep_source()->grep_source_1()->show_line() when grep_opt.status_only is
not set.

This seems to be unreachable from existing commands but is reachable in
the event that someone rolls their own revision walk and neglects to set
rev_info->grep_filter->status_only. Conceivably, someone might want to
print all changed lines of all commits which match some regex.

To futureproof, give developers a heads-up if grep_source() is called
and a valid name field is expected but not given. This path should be
unreachable now, but in the future may become reachable if developers
want to add the ability to use grep_buffer() in prepared in-core data
and provide hints to the user about where the data came from.

Signed-off-by: Emily Shaffer <emilyshaffer@google.com>
---
I've added the BUG() line to grep_source(). I considered adding it to
grep_source_1() but didn't see much difference. Looks like
grep_source_1() exists purely as a helper to grep_source() and is never
called by anybody else... but it's the function where the crash would
happen. So I'm not sure.

I also modified the wording for the BUG("") statement a little from
jrn's suggestion to hopefully make it more explicit, but I welcome
wordsmithing.

 grep.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

Comments

Jeff King May 16, 2019, 10:02 p.m. UTC | #1
On Thu, May 16, 2019 at 02:44:44PM -0700, Emily Shaffer wrote:

> grep_buffer creates a struct grep_source gs and calls grep_source()
> with it. However, gs.name is null, which eventually produces a
> segmentation fault in
> grep_source()->grep_source_1()->show_line() when grep_opt.status_only is
> not set.

Funny wrapping here.

> This seems to be unreachable from existing commands but is reachable in
> the event that someone rolls their own revision walk and neglects to set
> rev_info->grep_filter->status_only. Conceivably, someone might want to
> print all changed lines of all commits which match some regex.
> 
> To futureproof, give developers a heads-up if grep_source() is called
> and a valid name field is expected but not given. This path should be
> unreachable now, but in the future may become reachable if developers
> want to add the ability to use grep_buffer() in prepared in-core data
> and provide hints to the user about where the data came from.

So I guess this is probably my fault, as I was the one who factored out
the grep_source bits long ago (though I would also not be surprised if
it could be triggered before, too).

I think adding a BUG() is the right thing to do.

> I've added the BUG() line to grep_source(). I considered adding it to
> grep_source_1() but didn't see much difference. Looks like
> grep_source_1() exists purely as a helper to grep_source() and is never
> called by anybody else... but it's the function where the crash would
> happen. So I'm not sure.

I agree it probably doesn't matter. I'd probably stick at at the top of
grep_source_1(), just with the rationale that it could possibly catch
more cases if somebody ever refactors grep_source() to have two
different callers (e.g., imagine we later add a variant that takes more
options, but leave the old one to avoid disrupting existing callers).

> I also modified the wording for the BUG("") statement a little from
> jrn's suggestion to hopefully make it more explicit, but I welcome
> wordsmithing.
> [...]
> +		BUG("grep call which could print a name requires "
> +		    "grep_source.name be non-NULL");

It looks fine to me. The point is that nobody should ever see this. And
if they do, we should already get a file/line number telling us where
the problem is (and a coredump to poke at). So as long as it's enough to
convince a regular user that they should make a report to the mailing
list, it will have done its job.

> diff --git a/grep.c b/grep.c
> index 0d50598acd..6ceb880a8c 100644
> --- a/grep.c
> +++ b/grep.c
> @@ -2021,6 +2021,9 @@ static int chk_hit_marker(struct grep_expr *x)
>  
>  int grep_source(struct grep_opt *opt, struct grep_source *gs)
>  {
> +	if (!opt->status_only && gs->name == NULL)
> +		BUG("grep call which could print a name requires "
> +		    "grep_source.name be non-NULL");
>  	/*
>  	 * we do not have to do the two-pass grep when we do not check
>  	 * buffer-wide "all-match".

Minor nit, but can we put a blank line between the new block and the
comment?

> @@ -2045,7 +2048,11 @@ int grep_buffer(struct grep_opt *opt, char *buf, unsigned long size)
>  	struct grep_source gs;
>  	int r;
>  
> -	grep_source_init(&gs, GREP_SOURCE_BUF, NULL, NULL, NULL);
> +	/* TODO: In the future it may become desirable to pass in the name as
> +	 * an argument to grep_buffer(). At that time, "(in core)" should be
> +	 * replaced.
> +	 */
> +	grep_source_init(&gs, GREP_SOURCE_BUF, _("(in core)"), NULL, NULL);

Hmm. I don't see much point in this one, as it would just avoid
triggering our BUG(). If somebody is adding new grep_buffer() callers
that don't use status_only, wouldn't we want them to see the BUG() to
know that they need to refactor grep_buffer() to provide a name?

-Peff
Emily Shaffer May 21, 2019, 11:52 p.m. UTC | #2
On Thu, May 16, 2019 at 06:02:54PM -0400, Jeff King wrote:
> On Thu, May 16, 2019 at 02:44:44PM -0700, Emily Shaffer wrote:
> 
> > grep_buffer creates a struct grep_source gs and calls grep_source()
> > with it. However, gs.name is null, which eventually produces a
> > segmentation fault in
> > grep_source()->grep_source_1()->show_line() when grep_opt.status_only is
> > not set.
> 
> Funny wrapping here.
> 
> > This seems to be unreachable from existing commands but is reachable in
> > the event that someone rolls their own revision walk and neglects to set
> > rev_info->grep_filter->status_only. Conceivably, someone might want to
> > print all changed lines of all commits which match some regex.
> > 
> > To futureproof, give developers a heads-up if grep_source() is called
> > and a valid name field is expected but not given. This path should be
> > unreachable now, but in the future may become reachable if developers
> > want to add the ability to use grep_buffer() in prepared in-core data
> > and provide hints to the user about where the data came from.
> 
> So I guess this is probably my fault, as I was the one who factored out
> the grep_source bits long ago (though I would also not be surprised if
> it could be triggered before, too).
> 
> I think adding a BUG() is the right thing to do.
> 
> > I've added the BUG() line to grep_source(). I considered adding it to
> > grep_source_1() but didn't see much difference. Looks like
> > grep_source_1() exists purely as a helper to grep_source() and is never
> > called by anybody else... but it's the function where the crash would
> > happen. So I'm not sure.
> 
> I agree it probably doesn't matter. I'd probably stick at at the top of
> grep_source_1(), just with the rationale that it could possibly catch
> more cases if somebody ever refactors grep_source() to have two
> different callers (e.g., imagine we later add a variant that takes more
> options, but leave the old one to avoid disrupting existing callers).

I think this combined with needing a history lesson to know not to call
grep_source_1() makes a pretty strong case for moving it. I'll put the
BUG() line at the top of grep_source_1() instead.

> 
> > I also modified the wording for the BUG("") statement a little from
> > jrn's suggestion to hopefully make it more explicit, but I welcome
> > wordsmithing.
> > [...]
> > +		BUG("grep call which could print a name requires "
> > +		    "grep_source.name be non-NULL");
> 
> It looks fine to me. The point is that nobody should ever see this. And
> if they do, we should already get a file/line number telling us where
> the problem is (and a coredump to poke at). So as long as it's enough to
> convince a regular user that they should make a report to the mailing
> list, it will have done its job.
> 
> > diff --git a/grep.c b/grep.c
> > index 0d50598acd..6ceb880a8c 100644
> > --- a/grep.c
> > +++ b/grep.c
> > @@ -2021,6 +2021,9 @@ static int chk_hit_marker(struct grep_expr *x)
> >  
> >  int grep_source(struct grep_opt *opt, struct grep_source *gs)
> >  {
> > +	if (!opt->status_only && gs->name == NULL)
> > +		BUG("grep call which could print a name requires "
> > +		    "grep_source.name be non-NULL");
> >  	/*
> >  	 * we do not have to do the two-pass grep when we do not check
> >  	 * buffer-wide "all-match".
> 
> Minor nit, but can we put a blank line between the new block and the
> comment?

I made sure to include blank lines surrounding this block in its new
home.

> 
> > @@ -2045,7 +2048,11 @@ int grep_buffer(struct grep_opt *opt, char *buf, unsigned long size)
> >  	struct grep_source gs;
> >  	int r;
> >  
> > -	grep_source_init(&gs, GREP_SOURCE_BUF, NULL, NULL, NULL);
> > +	/* TODO: In the future it may become desirable to pass in the name as
> > +	 * an argument to grep_buffer(). At that time, "(in core)" should be
> > +	 * replaced.
> > +	 */
> > +	grep_source_init(&gs, GREP_SOURCE_BUF, _("(in core)"), NULL, NULL);
> 
> Hmm. I don't see much point in this one, as it would just avoid
> triggering our BUG(). If somebody is adding new grep_buffer() callers
> that don't use status_only, wouldn't we want them to see the BUG() to
> know that they need to refactor grep_buffer() to provide a name?

For me, I'd rather have *some* functionality available without doing a
large refactor. The line now isn't inaccurate, and `git grep --fixed
"(in core)"` shows it's the only instance of that string in the
codebase, so it's easy to track down where it's happening. Though the
BUG() line is pretty easy to track down too.

Having a default field like this might help folks move quickly if they
just want proof of concept that their idea works without adding a
bunch of plumbing. But then, they could add it on their own if they're
that early in their feature....

Can we think of a reason anybody would want to be able to use it this
way with the placeholder string?

> 
> -Peff

Thanks for your comments Peff.

- Emily
Jonathan Nieder May 22, 2019, 12:17 a.m. UTC | #3
Hi,

Emily Shaffer wrote:
> On Thu, May 16, 2019 at 06:02:54PM -0400, Jeff King wrote:
>> On Thu, May 16, 2019 at 02:44:44PM -0700, Emily Shaffer wrote:

>>> +	/* TODO: In the future it may become desirable to pass in the name as
>>> +	 * an argument to grep_buffer(). At that time, "(in core)" should be
>>> +	 * replaced.
>>> +	 */

(micronit, likely moot: Git's multi-line comments start with "/*" on
its own line:

	/*
	 * NEEDSWORK: Passing the name in as an argument would allow
	 * "(in core)" to be replaced.
	 */

.)

>>> +	grep_source_init(&gs, GREP_SOURCE_BUF, _("(in core)"), NULL, NULL);
>>
>> Hmm. I don't see much point in this one, as it would just avoid
>> triggering our BUG(). If somebody is adding new grep_buffer() callers
>> that don't use status_only, wouldn't we want them to see the BUG() to
>> know that they need to refactor grep_buffer() to provide a name?
[...]
> Can we think of a reason anybody would want to be able to use it this
> way with the placeholder string?

I agree with Peff here: using NULL puts this in a better place with
respect to Rusty's API design manifesto[1].

With the "(in core)" default, I may end up triggering the "(in core)"
behavior in production, because there is not a clear enough signal
that my code path is making a mistake.  That's problematic because it
gives the end user a confusing experience: the end user cares where
the line comes from, not that it spent a second or two in core.

With the NULL default, *especially* after this patch, such usage would
instead trigger a BUG: line in output, meaning

- if it gets exercised in tests, the test will fail, prompting the
  patch auther to pass in a more appropriate label

- if it gets missed in tests and gets triggered in production, the error
  message makes it clear that this is a mistake so the user is likely
  to report a bug instead of assuming this is deliberate but confusing
  behavior

In that vein, this patch is very helpful, since the BUG would trip
*consistently*, not only when the grep pattern matches.  Failing
consistently like this is a huge improvement in API usability.

It would be even better if we could catch the problem at compile time,
but one thing at a time.

Thanks,
Jonathan

[1] https://ozlabs.org/~rusty/index.cgi/tech/2008-03-30.html,
https://ozlabs.org/~rusty/index.cgi/tech/2008-04-01.html
diff mbox series

Patch

diff --git a/grep.c b/grep.c
index 0d50598acd..6ceb880a8c 100644
--- a/grep.c
+++ b/grep.c
@@ -2021,6 +2021,9 @@  static int chk_hit_marker(struct grep_expr *x)
 
 int grep_source(struct grep_opt *opt, struct grep_source *gs)
 {
+	if (!opt->status_only && gs->name == NULL)
+		BUG("grep call which could print a name requires "
+		    "grep_source.name be non-NULL");
 	/*
 	 * we do not have to do the two-pass grep when we do not check
 	 * buffer-wide "all-match".
@@ -2045,7 +2048,11 @@  int grep_buffer(struct grep_opt *opt, char *buf, unsigned long size)
 	struct grep_source gs;
 	int r;
 
-	grep_source_init(&gs, GREP_SOURCE_BUF, NULL, NULL, NULL);
+	/* TODO: In the future it may become desirable to pass in the name as
+	 * an argument to grep_buffer(). At that time, "(in core)" should be
+	 * replaced.
+	 */
+	grep_source_init(&gs, GREP_SOURCE_BUF, _("(in core)"), NULL, NULL);
 	gs.buf = buf;
 	gs.size = size;