diff mbox series

[3/3] sequencer: use read_author_script()

Message ID 20180912101029.28052-4-phillip.wood@talktalk.net (mailing list archive)
State New, archived
Headers show
Series am/rebase: share read_author_script() | expand

Commit Message

Phillip Wood Sept. 12, 2018, 10:10 a.m. UTC
From: Phillip Wood <phillip.wood@dunelm.org.uk>

Use the new function to read the author script, updating
read_env_script() and read_author_ident(). This means we now have a
single code path that reads the author script and uses sq_dequote() to
dequote it. This fixes potential problems with user edited scripts
as read_env_script() which did not track quotes properly.

This commit also removes the fallback code for checking for a broken
author script after git is upgraded when a rebase is stopped. Now that
the parsing uses sq_dequote() it will reliably return an error if the
quoting is broken and the user will have to abort the rebase and
restart. This isn't ideal but it's a corner case and the detection of
the broken quoting could be confused by user edited author scripts.

Signed-off-by: Phillip Wood <phillip.wood@dunelm.org.uk>
---
 sequencer.c | 108 +++++++++++++++-------------------------------------
 1 file changed, 30 insertions(+), 78 deletions(-)

Comments

Eric Sunshine Sept. 12, 2018, 12:14 p.m. UTC | #1
On Wed, Sep 12, 2018 at 6:11 AM Phillip Wood <phillip.wood@talktalk.net> wrote:
> Use the new function to read the author script, updating
> read_env_script() and read_author_ident(). This means we now have a
> single code path that reads the author script and uses sq_dequote() to
> dequote it. This fixes potential problems with user edited scripts
> as read_env_script() which did not track quotes properly.
> [...]
> Signed-off-by: Phillip Wood <phillip.wood@dunelm.org.uk>
> ---
> diff --git a/sequencer.c b/sequencer.c
> @@ -723,54 +723,35 @@ int read_author_script(const char *path, char **name, char **email, char **date,
>  static int read_env_script(struct argv_array *env)
>  {
> +       strbuf_addstr(&script, "GIT_AUTHOR_NAME=");
> +       strbuf_addstr(&script, name);
> +       argv_array_push(env, script.buf);
> +       strbuf_reset(&script);
> +       strbuf_addstr(&script, "GIT_AUTHOR_EMAIL=");
> +       strbuf_addstr(&script, email);
> +       argv_array_push(env, script.buf);
> +       strbuf_reset(&script);
> +       strbuf_addstr(&script, "GIT_AUTHOR_DATE=");
> +       strbuf_addstr(&script, date);
> +       argv_array_push(env, script.buf);
> +       strbuf_release(&script);

I haven't read this series closely yet, but this caught my eye while
scanning it quickly. The above noisy code can all be collapsed to the
simpler:

    argv_array_pushf(env, "GIT_AUTHOR_NAME=%s", name);
    argv_array_pushf(env, "GIT_AUTHOR_EMAIL =%s", email);
    argv_array_pushf(env, "GIT_AUTHOR_DATE =%s", date);
Eric Sunshine Sept. 14, 2018, 12:31 a.m. UTC | #2
On Wed, Sep 12, 2018 at 6:11 AM Phillip Wood <phillip.wood@talktalk.net> wrote:
> Use the new function to read the author script, updating
> read_env_script() and read_author_ident(). This means we now have a
> single code path that reads the author script and uses sq_dequote() to
> dequote it. This fixes potential problems with user edited scripts
> as read_env_script() which did not track quotes properly.
> [...]
> Signed-off-by: Phillip Wood <phillip.wood@dunelm.org.uk>
> ---
>  /*
>   * Read a list of environment variable assignments (such as the author-script
>   * file) into an environment block. Returns -1 on error, 0 otherwise.
>   */

According to this comment, this function is capable of parsing a file
of arbitrary "NAME=Value" lines, and indeed the original code does
just that, but...

>  static int read_env_script(struct argv_array *env)
>  {
> +       char *name, *email, *date;
>
> -       if (strbuf_read_file(&script, rebase_path_author_script(), 256) <= 0)
> +       if (read_author_script(rebase_path_author_script(),
> +                              &name, &email, &date, 0))

...the new implementation is able to handle only GIT_AUTHOR_NAME,
GIT_AUTHOR_EMAIL, and GIT_AUTHOR_DATE, in exactly that order.

This seems like a pretty serious (and possibly buggy) change of
behavior, and makes the function much less useful (in general). Is it
true that it will only ever be used for files containing that limited
set of names? If so, the behavior change deserves mention in the
commit message, the function comment needs updating, and the function
itself probably ought to be renamed.

> +       strbuf_addstr(&script, "GIT_AUTHOR_NAME=");
> +       strbuf_addstr(&script, name);
> +       argv_array_push(env, script.buf);
> +       strbuf_reset(&script);
> +       strbuf_addstr(&script, "GIT_AUTHOR_EMAIL=");
> +       strbuf_addstr(&script, email);
> +       argv_array_push(env, script.buf);
> +       strbuf_reset(&script);
> +       strbuf_addstr(&script, "GIT_AUTHOR_DATE=");
> +       strbuf_addstr(&script, date);
> +       argv_array_push(env, script.buf);
> +       strbuf_release(&script);

Mentioned earlier[1], this can all collapse down to:

argv_array_pushf(env, "GIT_AUTHOR_NAME=%s", name);
argv_array_pushf(env, "GIT_AUTHOR_EMAIL=%s", email);
argv_array_pushf(env, "GIT_AUTHOR_DATE=%s", date);

However, it's unfortunate that this manual and hard-coded
reconstruction is needed at all. If you restructure the factoring of
this patch series, such ugliness can be avoided altogether. For
instance, the series could be structured like this:

1. Introduce a general-purpose function for reading a file containing
arbitrary "NAME=Value" lines (not carrying about specific key names or
their order) and returning them in some data structure (perhaps via
'string_list' as parse_key_value_squoted() in patch 2/3 does).

2. Build read_author_script() atop #1, making it expect and extract
GIT_AUTHOR_NAME, GIT_AUTHOR_EMAIL, and GIT_AUTHOR_DATE (possibly in
that order, or possibly not if we don't care).

3. Retrofit existing parsers to call one of those two functions (this
step may happen over several patches). So, for instance,
read_env_script() would call the generic parser from #1, whereas
sequencer.c:read_author_ident() would call the more specific parser
from #2.

More below...

> @@ -790,54 +771,25 @@ static char *get_author(const char *message)
>  /* Read author-script and return an ident line (author <email> timestamp) */
>  static const char *read_author_ident(struct strbuf *buf)
>  {
> -       const char *keys[] = {
> -               "GIT_AUTHOR_NAME=", "GIT_AUTHOR_EMAIL=", "GIT_AUTHOR_DATE="
> -       };
> -       if (strbuf_read_file(buf, rebase_path_author_script(), 256) <= 0)
> +       if (read_author_script(rebase_path_author_script(),
> +                              &name, &email, &date, 0))
>                 return NULL;
> -       /* dequote values and construct ident line in-place */
> -       for (in = buf->buf; i < 3 && in - buf->buf < buf->len; i++) {
> -               if (!skip_prefix(in, keys[i], (const char **)&in)) {
> -                       warning(_("could not parse '%s' (looking for '%s')"),
> -                               rebase_path_author_script(), keys[i]);
> -                       return NULL;
> -               }
> -               if (!sq_dequote(in)) {
> -                       warning(_("bad quoting on %s value in '%s'"),
> -                               keys[i], rebase_path_author_script());
> -                       return NULL;
> -               }
> -       if (i < 3) {
> -               warning(_("could not parse '%s' (looking for '%s')"),
> -                       rebase_path_author_script(), keys[i]);
> -               return NULL;
> -       }

The parsing code being thrown away here does a better job of
diagnosing problems (thus helping the user figure out what went wrong)
than the new shared parser introduced by patch 2/3. The shared
function only ever reports a generic "unable to parse", whereas the
above code gets specific, saying that it was looking for a particular
key or that quoting was broken. I'd have expected the new shared
parser to encompass the best features of the existing parsers (such as
presenting better error messages).

>         /* validate date since fmt_ident() will die() on bad value */
> -       if (parse_date(val[2], &out)){
> +       if (parse_date(date, buf)){

Re-purposing the strbuf 'buf', which is passed into this function,
binds this function too tightly with its caller by assuming that the
caller will never need the original content of 'buf' anymore. Thus, it
would be better for this code continue using its own local strbuf
'out' rather than re-purposing the incoming 'buf'.

>                 warning(_("invalid date format '%s' in '%s'"),
> -                       val[2], rebase_path_author_script());
> -               strbuf_release(&out);
> +                       date, rebase_path_author_script());
> +               strbuf_release(buf);

Likewise, it's doubly odd to see this function releasing 'buf' which
it does not own.

>                 return NULL;
>         }

[1]: https://public-inbox.org/git/CAPig+cRvUr26GZyW6ecYhpwABueBqaEfZH1+JjLaqZo8+RTD6Q@mail.gmail.com/
Phillip Wood Oct. 10, 2018, 10:35 a.m. UTC | #3
On 14/09/2018 01:31, Eric Sunshine wrote:
> On Wed, Sep 12, 2018 at 6:11 AM Phillip Wood <phillip.wood@talktalk.net> wrote:
>> Use the new function to read the author script, updating
>> read_env_script() and read_author_ident(). This means we now have a
>> single code path that reads the author script and uses sq_dequote() to
>> dequote it. This fixes potential problems with user edited scripts
>> as read_env_script() which did not track quotes properly.
>> [...]
>> Signed-off-by: Phillip Wood <phillip.wood@dunelm.org.uk>
>> ---
>>  /*
>>   * Read a list of environment variable assignments (such as the author-script
>>   * file) into an environment block. Returns -1 on error, 0 otherwise.
>>   */
> 
> According to this comment, this function is capable of parsing a file
> of arbitrary "NAME=Value" lines, and indeed the original code does
> just that, but...
> 
>>  static int read_env_script(struct argv_array *env)
>>  {
>> +       char *name, *email, *date;
>>
>> -       if (strbuf_read_file(&script, rebase_path_author_script(), 256) <= 0)
>> +       if (read_author_script(rebase_path_author_script(),
>> +                              &name, &email, &date, 0))
> 
> ...the new implementation is able to handle only GIT_AUTHOR_NAME,
> GIT_AUTHOR_EMAIL, and GIT_AUTHOR_DATE, in exactly that order.
> 
> This seems like a pretty serious (and possibly buggy) change of
> behavior, and makes the function much less useful (in general). Is it
> true that it will only ever be used for files containing that limited
> set of names? If so, the behavior change deserves mention in the
> commit message, the function comment needs updating, and the function
> itself probably ought to be renamed.

You're right the change in behavior should be mentioned explicitly, I'd
not thought about it in those terms. I'm not sure if the change is
buggy, this code is what am uses for its author script handling. To me
the point of the author-script file is to set the author details, not to
set arbitrary environment variables. We have already significantly
reduced what someone can do with this file in the transition from shell
to C as we no longer support arbitrary shell code in the file. I'd
rather try and reuse the existing code from am unless someone can
demonstrate an active use for something more general. (I'm still not
sure what use editing the author-script is - it is only of use if the
rebase stops for conflicts, it cannot be used to change the author of an
arbitrary set of commits)

>> +       strbuf_addstr(&script, "GIT_AUTHOR_NAME=");
>> +       strbuf_addstr(&script, name);
>> +       argv_array_push(env, script.buf);
>> +       strbuf_reset(&script);
>> +       strbuf_addstr(&script, "GIT_AUTHOR_EMAIL=");
>> +       strbuf_addstr(&script, email);
>> +       argv_array_push(env, script.buf);
>> +       strbuf_reset(&script);
>> +       strbuf_addstr(&script, "GIT_AUTHOR_DATE=");
>> +       strbuf_addstr(&script, date);
>> +       argv_array_push(env, script.buf);
>> +       strbuf_release(&script);
> 
> Mentioned earlier[1], this can all collapse down to:
> 
> argv_array_pushf(env, "GIT_AUTHOR_NAME=%s", name);
> argv_array_pushf(env, "GIT_AUTHOR_EMAIL=%s", email);
> argv_array_pushf(env, "GIT_AUTHOR_DATE=%s", date);
> 
> However, it's unfortunate that this manual and hard-coded
> reconstruction is needed at all. If you restructure the factoring of
> this patch series, such ugliness can be avoided altogether. For
> instance, the series could be structured like this:
> 
> 1. Introduce a general-purpose function for reading a file containing
> arbitrary "NAME=Value" lines (not carrying about specific key names or
> their order) and returning them in some data structure (perhaps via
> 'string_list' as parse_key_value_squoted() in patch 2/3 does).
> 
> 2. Build read_author_script() atop #1, making it expect and extract
> GIT_AUTHOR_NAME, GIT_AUTHOR_EMAIL, and GIT_AUTHOR_DATE (possibly in
> that order, or possibly not if we don't care).
> 
> 3. Retrofit existing parsers to call one of those two functions (this
> step may happen over several patches). So, for instance,
> read_env_script() would call the generic parser from #1, whereas
> sequencer.c:read_author_ident() would call the more specific parser
> from #2.

That plan requires all new code rather than reusing tried and tested
code from am. Furthermore I'm not sure it has any advantage as far as
users are concerned. My aim with this series was to try and do something
fairly simple that reused the parsing from am, rather than build a whole
new system with its own bugs.

> More below...
> 
>> @@ -790,54 +771,25 @@ static char *get_author(const char *message)
>>  /* Read author-script and return an ident line (author <email> timestamp) */
>>  static const char *read_author_ident(struct strbuf *buf)
>>  {
>> -       const char *keys[] = {
>> -               "GIT_AUTHOR_NAME=", "GIT_AUTHOR_EMAIL=", "GIT_AUTHOR_DATE="
>> -       };
>> -       if (strbuf_read_file(buf, rebase_path_author_script(), 256) <= 0)
>> +       if (read_author_script(rebase_path_author_script(),
>> +                              &name, &email, &date, 0))
>>                 return NULL;
>> -       /* dequote values and construct ident line in-place */
>> -       for (in = buf->buf; i < 3 && in - buf->buf < buf->len; i++) {
>> -               if (!skip_prefix(in, keys[i], (const char **)&in)) {
>> -                       warning(_("could not parse '%s' (looking for '%s')"),
>> -                               rebase_path_author_script(), keys[i]);
>> -                       return NULL;
>> -               }
>> -               if (!sq_dequote(in)) {
>> -                       warning(_("bad quoting on %s value in '%s'"),
>> -                               keys[i], rebase_path_author_script());
>> -                       return NULL;
>> -               }
>> -       if (i < 3) {
>> -               warning(_("could not parse '%s' (looking for '%s')"),
>> -                       rebase_path_author_script(), keys[i]);
>> -               return NULL;
>> -       }
> 
> The parsing code being thrown away here does a better job of
> diagnosing problems (thus helping the user figure out what went wrong)
> than the new shared parser introduced by patch 2/3. The shared
> function only ever reports a generic "unable to parse", whereas the
> above code gets specific, saying that it was looking for a particular
> key or that quoting was broken. I'd have expected the new shared
> parser to encompass the best features of the existing parsers (such as
> presenting better error messages).

You're right but the context is that this function is only used when the
root commit is rebased (and then only for the root commit). Everything
else goes through read_env_script() which doesn't even bother to check
if all the variables have been set or report any parsing errors.

> 
>>         /* validate date since fmt_ident() will die() on bad value */
>> -       if (parse_date(val[2], &out)){
>> +       if (parse_date(date, buf)){
> 
> Re-purposing the strbuf 'buf', which is passed into this function,
> binds this function too tightly with its caller by assuming that the
> caller will never need the original content of 'buf' anymore. Thus, it
> would be better for this code continue using its own local strbuf
> 'out' rather than re-purposing the incoming 'buf'.

That's a good point, I'll fix it.

>>                 warning(_("invalid date format '%s' in '%s'"),
>> -                       val[2], rebase_path_author_script());
>> -               strbuf_release(&out);
>> +                       date, rebase_path_author_script());
>> +               strbuf_release(buf);
> 
> Likewise, it's doubly odd to see this function releasing 'buf' which
> it does not own.
> 
>>                 return NULL;
>>         }
> 
> [1]: https://public-inbox.org/git/CAPig+cRvUr26GZyW6ecYhpwABueBqaEfZH1+JjLaqZo8+RTD6Q@mail.gmail.com/
> 

Thanks for looking at this, I'm keen to keep things simple and reuse the
am author-script parsing if possible. It is more restrictive but I'm not
sure that anyone is actually taking advantage of the flexibility offered
by the current setup and it fixes the de-quoting bugs in read_env_script().

Best Wishes

Phillip
diff mbox series

Patch

diff --git a/sequencer.c b/sequencer.c
index 5d0ff8f1c1..630741cfe0 100644
--- a/sequencer.c
+++ b/sequencer.c
@@ -723,54 +723,35 @@  int read_author_script(const char *path, char **name, char **email, char **date,
 	return retval;
 }
 
-/*
- * write_author_script() used to fail to terminate the last line with a "'" and
- * also escaped "'" incorrectly as "'\\\\''" rather than "'\\''". We check for
- * the terminating "'" on the last line to see how "'" has been escaped in case
- * git was upgraded while rebase was stopped.
- */
-static int quoting_is_broken(const char *s, size_t n)
-{
-	/* Skip any empty lines in case the file was hand edited */
-	while (n > 0 && s[--n] == '\n')
-		; /* empty */
-	if (n > 0 && s[n] != '\'')
-		return 1;
-
-	return 0;
-}
-
 /*
  * Read a list of environment variable assignments (such as the author-script
  * file) into an environment block. Returns -1 on error, 0 otherwise.
  */
 static int read_env_script(struct argv_array *env)
 {
 	struct strbuf script = STRBUF_INIT;
-	int i, count = 0, sq_bug;
-	const char *p2;
-	char *p;
+	char *name, *email, *date;
 
-	if (strbuf_read_file(&script, rebase_path_author_script(), 256) <= 0)
+	if (read_author_script(rebase_path_author_script(),
+			       &name, &email, &date, 0))
 		return -1;
-	/* write_author_script() used to quote incorrectly */
-	sq_bug = quoting_is_broken(script.buf, script.len);
-	for (p = script.buf; *p; p++)
-		if (sq_bug && skip_prefix(p, "'\\\\''", &p2))
-			strbuf_splice(&script, p - script.buf, p2 - p, "'", 1);
-		else if (skip_prefix(p, "'\\''", &p2))
-			strbuf_splice(&script, p - script.buf, p2 - p, "'", 1);
-		else if (*p == '\'')
-			strbuf_splice(&script, p-- - script.buf, 1, "", 0);
-		else if (*p == '\n') {
-			*p = '\0';
-			count++;
-		}
 
-	for (i = 0, p = script.buf; i < count; i++) {
-		argv_array_push(env, p);
-		p += strlen(p) + 1;
-	}
+	strbuf_addstr(&script, "GIT_AUTHOR_NAME=");
+	strbuf_addstr(&script, name);
+	argv_array_push(env, script.buf);
+	strbuf_reset(&script);
+	strbuf_addstr(&script, "GIT_AUTHOR_EMAIL=");
+	strbuf_addstr(&script, email);
+	argv_array_push(env, script.buf);
+	strbuf_reset(&script);
+	strbuf_addstr(&script, "GIT_AUTHOR_DATE=");
+	strbuf_addstr(&script, date);
+	argv_array_push(env, script.buf);
+	strbuf_release(&script);
+
+	free(name);
+	free(email);
+	free(date);
 
 	return 0;
 }
@@ -790,54 +771,25 @@  static char *get_author(const char *message)
 /* Read author-script and return an ident line (author <email> timestamp) */
 static const char *read_author_ident(struct strbuf *buf)
 {
-	const char *keys[] = {
-		"GIT_AUTHOR_NAME=", "GIT_AUTHOR_EMAIL=", "GIT_AUTHOR_DATE="
-	};
-	struct strbuf out = STRBUF_INIT;
-	char *in, *eol;
-	const char *val[3];
-	int i = 0;
+	char *name, *email, *date;
 
-	if (strbuf_read_file(buf, rebase_path_author_script(), 256) <= 0)
+	if (read_author_script(rebase_path_author_script(),
+			       &name, &email, &date, 0))
 		return NULL;
 
-	/* dequote values and construct ident line in-place */
-	for (in = buf->buf; i < 3 && in - buf->buf < buf->len; i++) {
-		if (!skip_prefix(in, keys[i], (const char **)&in)) {
-			warning(_("could not parse '%s' (looking for '%s')"),
-				rebase_path_author_script(), keys[i]);
-			return NULL;
-		}
-
-		eol = strchrnul(in, '\n');
-		*eol = '\0';
-		if (!sq_dequote(in)) {
-			warning(_("bad quoting on %s value in '%s'"),
-				keys[i], rebase_path_author_script());
-			return NULL;
-		}
-		val[i] = in;
-		in = eol + 1;
-	}
-
-	if (i < 3) {
-		warning(_("could not parse '%s' (looking for '%s')"),
-			rebase_path_author_script(), keys[i]);
-		return NULL;
-	}
-
 	/* validate date since fmt_ident() will die() on bad value */
-	if (parse_date(val[2], &out)){
+	if (parse_date(date, buf)){
 		warning(_("invalid date format '%s' in '%s'"),
-			val[2], rebase_path_author_script());
-		strbuf_release(&out);
+			date, rebase_path_author_script());
+		strbuf_release(buf);
 		return NULL;
 	}
 
-	strbuf_reset(&out);
-	strbuf_addstr(&out, fmt_ident(val[0], val[1], val[2], 0));
-	strbuf_swap(buf, &out);
-	strbuf_release(&out);
+	strbuf_reset(buf);
+	strbuf_addstr(buf, fmt_ident(name, email, date, 0));
+	free(name);
+	free(email);
+	free(date);
 	return buf->buf;
 }