diff mbox series

[11/15] find multi-byte comment chars in unterminated buffers

Message ID 20240307092638.GK2080210@coredump.intra.peff.net (mailing list archive)
State Superseded
Headers show
Series allow multi-byte core.commentChar | expand

Commit Message

Jeff King March 7, 2024, 9:26 a.m. UTC
As with the previous patch, we need to swap out single-byte matching for
something like starts_with() to match all bytes of a multi-byte comment
character. But for cases where the buffer is not NUL-terminated (and we
instead have an explicit size or end pointer), it's not safe to use
starts_with(), as it might walk off the end of the buffer.

Let's introduce a new starts_with_mem() that does the same thing but
also accepts the length of the "haystack" str and makes sure not to walk
past it.

Note that in most cases the existing code did not need a length check at
all, since it was written in a way that knew we had at least one byte
available (and that was all we checked). So I had to read each one to
find the appropriate bounds. The one exception is sequencer.c's
add_commented_lines(), where we can actually get rid of the length
check. Just like starts_with(), our starts_with_mem() handles an empty
haystack variable by not matching (assuming a non-empty prefix).

A few notes on the implementation of starts_with_mem():

  - it would be equally correct to take an "end" pointer (and indeed,
    many of the callers have this and have to subtract to come up with
    the length). I think taking a ptr/size combo is a more usual
    interface for our codebase, though, and has the added benefit that
    the function signature makes it harder to mix up the three
    parameters.

  - we could obviously build starts_with() on top of this by passing
    strlen(str) as the length. But it's possible that starts_with() is a
    relatively hot code path, and it should not pay that penalty (it can
    generally return an answer proportional to the size of the prefix,
    not the whole string).

  - it naively feels like xstrncmpz() should be able to do the same
    thing, but that's not quite true. If you pass the length of the
    haystack buffer, then strncmp() finds that a shorter prefix string
    is "less than" than the haystack, even if the haystack starts with
    the prefix. If you pass the length of the prefix, then you risk
    reading past the end of the haystack if it is shorter than the
    prefix. So I think we really do need a new function.

Signed-off-by: Jeff King <peff@peff.net>
---
Arguably starts_with() and this new function should both be inlined,
like we do for skip_prefix(), but I think that's out of scope for this
series.

And it's possible I was simply too dumb to figure out xstrncmpz() here.
I'm waiting for René to show up and tell me how to do it. ;)

IMHO this is the trickiest commit of the whole series, as it would be
easy to get the length computations subtly wrong.

 commit.c    |  3 ++-
 sequencer.c |  4 ++--
 strbuf.c    | 11 +++++++++++
 strbuf.h    |  1 +
 trailer.c   |  4 ++--
 5 files changed, 18 insertions(+), 5 deletions(-)

Comments

Jeff King March 7, 2024, 11:08 a.m. UTC | #1
On Thu, Mar 07, 2024 at 04:26:38AM -0500, Jeff King wrote:

> IMHO this is the trickiest commit of the whole series, as it would be
> easy to get the length computations subtly wrong.

And sure enough...

> diff --git a/trailer.c b/trailer.c
> index fe18faf6c5..f59c90b4b5 100644
> --- a/trailer.c
> +++ b/trailer.c
> @@ -882,7 +882,7 @@ static size_t find_trailer_block_start(const char *buf, size_t len)
>  
>  	/* The first paragraph is the title and cannot be trailers */
>  	for (s = buf; s < buf + len; s = next_line(s)) {
> -		if (s[0] == comment_line_char)
> +		if (starts_with_mem(s, buf + len - s, comment_line_str))
>  			continue;
>  		if (is_blank_line(s))
>  			break;
> @@ -902,7 +902,7 @@ static size_t find_trailer_block_start(const char *buf, size_t len)
>  		const char **p;
>  		ssize_t separator_pos;
>  
> -		if (bol[0] == comment_line_char) {
> +		if (starts_with_mem(bol, buf + end_of_title - bol, comment_line_str)) {
>  			non_trailer_lines += possible_continuation_lines;
>  			possible_continuation_lines = 0;
>  			continue;

This second hunk needs:

diff --git a/trailer.c b/trailer.c
index f59c90b4b5..fdb0b8137e 100644
--- a/trailer.c
+++ b/trailer.c
@@ -902,7 +902,7 @@ static size_t find_trailer_block_start(const char *buf, size_t len)
 		const char **p;
 		ssize_t separator_pos;
 
-		if (starts_with_mem(bol, buf + end_of_title - bol, comment_line_str)) {
+		if (starts_with_mem(bol, buf + len - bol, comment_line_str)) {
 			non_trailer_lines += possible_continuation_lines;
 			possible_continuation_lines = 0;
 			continue;

I was trying to bound the size based on the loop, which is:

          for (l = last_line(buf, len);
               l >= end_of_title;
               l = last_line(buf, l)) {
                  const char *bol = buf + l;

but I misread "end_of_title" as an upper bound, not a lower one. Which
makes sense because we're iterating backwards over the lines. So I
suppose we could bound it by the previous "bol" value. But in practice,
your prefix won't cross such a boundary anyway, as it won't have a
newline in it (maybe that's something we should enforce? I guess you
could set core.commentChar to '\n' even before my series, which would be
slightly insane).

So just bounding ourselves to "buf + len" seems reasonable, as that
makes sure we don't step outside the buffer passed into the function.

Curiously, this was found by the sanitizer job in CI, where UBSan
complains of integer overflow in the pointer computation. I had run with
both ASan/UBSan locally, but just using gcc, which doesn't seem to find
it (the CI job uses clang). So I'll that to my mental tally of "clang
seems to be better with sanitizers".

-Peff
René Scharfe March 7, 2024, 7:41 p.m. UTC | #2
Am 07.03.24 um 12:08 schrieb Jeff King:
> On Thu, Mar 07, 2024 at 04:26:38AM -0500, Jeff King wrote:
>
>> IMHO this is the trickiest commit of the whole series, as it would be
>> easy to get the length computations subtly wrong.
>
> And sure enough...
>
>> diff --git a/trailer.c b/trailer.c
>> index fe18faf6c5..f59c90b4b5 100644
>> --- a/trailer.c
>> +++ b/trailer.c
>> @@ -882,7 +882,7 @@ static size_t find_trailer_block_start(const char *buf, size_t len)
>>
>>  	/* The first paragraph is the title and cannot be trailers */
>>  	for (s = buf; s < buf + len; s = next_line(s)) {
>> -		if (s[0] == comment_line_char)
>> +		if (starts_with_mem(s, buf + len - s, comment_line_str))
>>  			continue;
>>  		if (is_blank_line(s))
>>  			break;
>> @@ -902,7 +902,7 @@ static size_t find_trailer_block_start(const char *buf, size_t len)
>>  		const char **p;
>>  		ssize_t separator_pos;
>>
>> -		if (bol[0] == comment_line_char) {
>> +		if (starts_with_mem(bol, buf + end_of_title - bol, comment_line_str)) {
>>  			non_trailer_lines += possible_continuation_lines;
>>  			possible_continuation_lines = 0;
>>  			continue;
>
> This second hunk needs:
>
> diff --git a/trailer.c b/trailer.c
> index f59c90b4b5..fdb0b8137e 100644
> --- a/trailer.c
> +++ b/trailer.c
> @@ -902,7 +902,7 @@ static size_t find_trailer_block_start(const char *buf, size_t len)
>  		const char **p;
>  		ssize_t separator_pos;
>
> -		if (starts_with_mem(bol, buf + end_of_title - bol, comment_line_str)) {
> +		if (starts_with_mem(bol, buf + len - bol, comment_line_str)) {
>  			non_trailer_lines += possible_continuation_lines;
>  			possible_continuation_lines = 0;
>  			continue;
>
> I was trying to bound the size based on the loop, which is:
>
>           for (l = last_line(buf, len);
>                l >= end_of_title;
>                l = last_line(buf, l)) {
>                   const char *bol = buf + l;
>
> but I misread "end_of_title" as an upper bound, not a lower one. Which
> makes sense because we're iterating backwards over the lines. So I
> suppose we could bound it by the previous "bol" value. But in practice,
> your prefix won't cross such a boundary anyway, as it won't have a
> newline in it (maybe that's something we should enforce? I guess you
> could set core.commentChar to '\n' even before my series, which would be
> slightly insane).
>
> So just bounding ourselves to "buf + len" seems reasonable, as that
> makes sure we don't step outside the buffer passed into the function.
>
> Curiously, this was found by the sanitizer job in CI, where UBSan
> complains of integer overflow in the pointer computation. I had run with
> both ASan/UBSan locally, but just using gcc, which doesn't seem to find
> it (the CI job uses clang). So I'll that to my mental tally of "clang
> seems to be better with sanitizers".
>
> -Peff
René Scharfe March 7, 2024, 7:42 p.m. UTC | #3
Am 07.03.24 um 10:26 schrieb Jeff King:
> As with the previous patch, we need to swap out single-byte matching for
> something like starts_with() to match all bytes of a multi-byte comment
> character. But for cases where the buffer is not NUL-terminated (and we
> instead have an explicit size or end pointer), it's not safe to use
> starts_with(), as it might walk off the end of the buffer.
>
> Let's introduce a new starts_with_mem() that does the same thing but
> also accepts the length of the "haystack" str and makes sure not to walk
> past it.
>
> Note that in most cases the existing code did not need a length check at
> all, since it was written in a way that knew we had at least one byte
> available (and that was all we checked). So I had to read each one to
> find the appropriate bounds. The one exception is sequencer.c's
> add_commented_lines(), where we can actually get rid of the length
> check. Just like starts_with(), our starts_with_mem() handles an empty
> haystack variable by not matching (assuming a non-empty prefix).
>
> A few notes on the implementation of starts_with_mem():
>
>   - it would be equally correct to take an "end" pointer (and indeed,
>     many of the callers have this and have to subtract to come up with
>     the length). I think taking a ptr/size combo is a more usual
>     interface for our codebase, though, and has the added benefit that
>     the function signature makes it harder to mix up the three
>     parameters.
>
>   - we could obviously build starts_with() on top of this by passing
>     strlen(str) as the length. But it's possible that starts_with() is a
>     relatively hot code path, and it should not pay that penalty (it can
>     generally return an answer proportional to the size of the prefix,
>     not the whole string).
>
>   - it naively feels like xstrncmpz() should be able to do the same
>     thing, but that's not quite true. If you pass the length of the
>     haystack buffer, then strncmp() finds that a shorter prefix string
>     is "less than" than the haystack, even if the haystack starts with
>     the prefix. If you pass the length of the prefix, then you risk
>     reading past the end of the haystack if it is shorter than the
>     prefix. So I think we really do need a new function.

Yes.  xstrncmpz() compares a NUL-terminated string and a length-limited
string.  If you want to check whether the former is a prefix of the
latter then you need to stop comparing when reaching its NUL, and also
after exhausting the latter.  So you need to take both lengths into
account:

int starts_with_mem(const char *str, size_t len, const char *prefix)
{
	size_t prefixlen = strlen(prefix);
	return prefixlen <= len && !xstrncmpz(prefix, str, prefixlen);
}

Using memcmp() here is equivalent and simpler:

int starts_with_mem(const char *str, size_t len, const char *prefix)
{
	size_t prefixlen = strlen(prefix);
	return prefixlen <= len && !memcmp(str, prefix, prefixlen);
}

And your version below avoids function calls and avoids traversing the
strings beyond their common prefix, of course.

>
> Signed-off-by: Jeff King <peff@peff.net>
> ---
> Arguably starts_with() and this new function should both be inlined,
> like we do for skip_prefix(), but I think that's out of scope for this
> series.

Inlining would allow the compiler to unroll the loop for string
constants.  I doubt it would do that for variables, as in the code
below.

Inlining the strlen()+memcmp() version above might allow the compiler
to push the strlen() call out of a loop.

Would any of that improve performance noticeably?  For the call sites
below I doubt it.  But it would probably increase the object text size.

> And it's possible I was simply too dumb to figure out xstrncmpz() here.
> I'm waiting for René to show up and tell me how to do it. ;)

Nah, it's not a good fit, as it requires the two strings to have the
same length.

>
> IMHO this is the trickiest commit of the whole series, as it would be
> easy to get the length computations subtly wrong.
>
>  commit.c    |  3 ++-
>  sequencer.c |  4 ++--
>  strbuf.c    | 11 +++++++++++
>  strbuf.h    |  1 +
>  trailer.c   |  4 ++--
>  5 files changed, 18 insertions(+), 5 deletions(-)
>
> diff --git a/commit.c b/commit.c
> index ef679a0b93..531a666cba 100644
> --- a/commit.c
> +++ b/commit.c
> @@ -1796,7 +1796,8 @@ size_t ignored_log_message_bytes(const char *buf, size_t len)
>  		else
>  			next_line++;
>
> -		if (buf[bol] == comment_line_char || buf[bol] == '\n') {
> +		if (starts_with_mem(buf + bol, cutoff - bol, comment_line_str) ||
> +		    buf[bol] == '\n') {
>  			/* is this the first of the run of comments? */
>  			if (!boc)
>  				boc = bol;
> diff --git a/sequencer.c b/sequencer.c
> index 991a2dbe96..664986e3b2 100644
> --- a/sequencer.c
> +++ b/sequencer.c
> @@ -1840,7 +1840,7 @@ static int is_fixup_flag(enum todo_command command, unsigned flag)
>  static void add_commented_lines(struct strbuf *buf, const void *str, size_t len)
>  {
>  	const char *s = str;
> -	while (len > 0 && s[0] == comment_line_char) {
> +	while (starts_with_mem(s, len, comment_line_str)) {
>  		size_t count;
>  		const char *n = memchr(s, '\n', len);
>  		if (!n)
> @@ -2562,7 +2562,7 @@ static int parse_insn_line(struct repository *r, struct todo_item *item,
>  	/* left-trim */
>  	bol += strspn(bol, " \t");
>
> -	if (bol == eol || *bol == '\r' || *bol == comment_line_char) {
> +	if (bol == eol || *bol == '\r' || starts_with_mem(bol, eol - bol, comment_line_str)) {

If the strspn() call is safe (which it is, as the caller expects the
string to be NUL-terminated) then you could use starts_with() here and
avoid the length calculation.  But that would also match
comment_line_str values that contain LF, which the _mem version does not
and that's better.

Not sure why lines that start with CR are considered comment lines,
though.

>  		item->command = TODO_COMMENT;
>  		item->commit = NULL;
>  		item->arg_offset = bol - buf;
> diff --git a/strbuf.c b/strbuf.c
> index 7c8f582127..291bdc2a65 100644
> --- a/strbuf.c
> +++ b/strbuf.c
> @@ -24,6 +24,17 @@ int istarts_with(const char *str, const char *prefix)
>  			return 0;
>  }
>
> +int starts_with_mem(const char *str, size_t len, const char *prefix)
> +{
> +	const char *end = str + len;
> +	for (; ; str++, prefix++) {
> +		if (!*prefix)
> +			return 1;
> +		else if (str == end || *str != *prefix)
> +			return 0;
> +	}
> +}

So this checks whether a length-limited string has a prefix given as a
NUL-terminated string.  I'd have called it mem_starts_with() and have
expected starts_with_mem() to check a NUL-terminated string for a
length-limited prefix (think !strncmp(str, prefix, prefixlen)).

> +
>  int skip_to_optional_arg_default(const char *str, const char *prefix,
>  				 const char **arg, const char *def)
>  {
> diff --git a/strbuf.h b/strbuf.h
> index 58dddf2777..3156d6ea8c 100644
> --- a/strbuf.h
> +++ b/strbuf.h
> @@ -673,6 +673,7 @@ char *xstrfmt(const char *fmt, ...);
>
>  int starts_with(const char *str, const char *prefix);
>  int istarts_with(const char *str, const char *prefix);
> +int starts_with_mem(const char *str, size_t len, const char *prefix);
>
>  /*
>   * If the string "str" is the same as the string in "prefix", then the "arg"
> diff --git a/trailer.c b/trailer.c
> index fe18faf6c5..f59c90b4b5 100644
> --- a/trailer.c
> +++ b/trailer.c
> @@ -882,7 +882,7 @@ static size_t find_trailer_block_start(const char *buf, size_t len)
>
>  	/* The first paragraph is the title and cannot be trailers */
>  	for (s = buf; s < buf + len; s = next_line(s)) {
> -		if (s[0] == comment_line_char)
> +		if (starts_with_mem(s, buf + len - s, comment_line_str))
>  			continue;
>  		if (is_blank_line(s))

Another case where starts_with() would be safe to use, as
is_blank_line() expects (and gets) a NUL-terminated string, but it would
allow matching comment_line_str values that contain LF.

>  			break;
> @@ -902,7 +902,7 @@ static size_t find_trailer_block_start(const char *buf, size_t len)
>  		const char **p;
>  		ssize_t separator_pos;
>
> -		if (bol[0] == comment_line_char) {
> +		if (starts_with_mem(bol, buf + end_of_title - bol, comment_line_str)) {

We're in the same buffer, so the above comment applies here as well.

>  			non_trailer_lines += possible_continuation_lines;
>  			possible_continuation_lines = 0;
>  			continue;
René Scharfe March 7, 2024, 7:47 p.m. UTC | #4
Am 07.03.24 um 20:41 schrieb René Scharfe:

Sorry, sent too early.

> Am 07.03.24 um 12:08 schrieb Jeff King:
>> On Thu, Mar 07, 2024 at 04:26:38AM -0500, Jeff King wrote:
>>
>>> IMHO this is the trickiest commit of the whole series, as it would be
>>> easy to get the length computations subtly wrong.
>>
>> And sure enough...
>>
>>> diff --git a/trailer.c b/trailer.c
>>> index fe18faf6c5..f59c90b4b5 100644
>>> --- a/trailer.c
>>> +++ b/trailer.c
>>> @@ -882,7 +882,7 @@ static size_t find_trailer_block_start(const char *buf, size_t len)
>>>
>>>  	/* The first paragraph is the title and cannot be trailers */
>>>  	for (s = buf; s < buf + len; s = next_line(s)) {
>>> -		if (s[0] == comment_line_char)
>>> +		if (starts_with_mem(s, buf + len - s, comment_line_str))
>>>  			continue;
>>>  		if (is_blank_line(s))
>>>  			break;
>>> @@ -902,7 +902,7 @@ static size_t find_trailer_block_start(const char *buf, size_t len)
>>>  		const char **p;
>>>  		ssize_t separator_pos;
>>>
>>> -		if (bol[0] == comment_line_char) {
>>> +		if (starts_with_mem(bol, buf + end_of_title - bol, comment_line_str)) {
>>>  			non_trailer_lines += possible_continuation_lines;
>>>  			possible_continuation_lines = 0;
>>>  			continue;
>>
>> This second hunk needs:
>>
>> diff --git a/trailer.c b/trailer.c
>> index f59c90b4b5..fdb0b8137e 100644
>> --- a/trailer.c
>> +++ b/trailer.c
>> @@ -902,7 +902,7 @@ static size_t find_trailer_block_start(const char *buf, size_t len)
>>  		const char **p;
>>  		ssize_t separator_pos;
>>
>> -		if (starts_with_mem(bol, buf + end_of_title - bol, comment_line_str)) {
>> +		if (starts_with_mem(bol, buf + len - bol, comment_line_str)) {
>>  			non_trailer_lines += possible_continuation_lines;
>>  			possible_continuation_lines = 0;
>>  			continue;
>>
>> I was trying to bound the size based on the loop, which is:
>>
>>           for (l = last_line(buf, len);
>>                l >= end_of_title;
>>                l = last_line(buf, l)) {
>>                   const char *bol = buf + l;
>>
>> but I misread "end_of_title" as an upper bound, not a lower one. Which
>> makes sense because we're iterating backwards over the lines. So I
>> suppose we could bound it by the previous "bol" value. But in practice,
>> your prefix won't cross such a boundary anyway, as it won't have a
>> newline in it (maybe that's something we should enforce? I guess you
>> could set core.commentChar to '\n' even before my series, which would be
>> slightly insane).
>>
>> So just bounding ourselves to "buf + len" seems reasonable, as that
>> makes sure we don't step outside the buffer passed into the function.

If you don't want or expect LF in comment_line_str, better check it.
And if you do that, most callers of starts_with_mem() -- including this
one -- can use starts_with() instead, as mentioned in my reply to your
patch.  Less calculations, less errors..

René
Phillip Wood March 8, 2024, 10:17 a.m. UTC | #5
Hi Peff and René

On 07/03/2024 19:42, René Scharfe wrote:
> Am 07.03.24 um 10:26 schrieb Jeff King:
>> diff --git a/sequencer.c b/sequencer.c
>> index 991a2dbe96..664986e3b2 100644
>> --- a/sequencer.c
>> +++ b/sequencer.c
>> @@ -1840,7 +1840,7 @@ static int is_fixup_flag(enum todo_command command, unsigned flag)
>>   static void add_commented_lines(struct strbuf *buf, const void *str, size_t len)
>>   {
>>   	const char *s = str;
>> -	while (len > 0 && s[0] == comment_line_char) {
>> +	while (starts_with_mem(s, len, comment_line_str)) {
>>   		size_t count;
>>   		const char *n = memchr(s, '\n', len);
>>   		if (!n)
>> @@ -2562,7 +2562,7 @@ static int parse_insn_line(struct repository *r, struct todo_item *item,
>>   	/* left-trim */
>>   	bol += strspn(bol, " \t");
>>
>> -	if (bol == eol || *bol == '\r' || *bol == comment_line_char) {
>> +	if (bol == eol || *bol == '\r' || starts_with_mem(bol, eol - bol, comment_line_str)) {
> 
> If the strspn() call is safe (which it is, as the caller expects the
> string to be NUL-terminated) then you could use starts_with() here and
> avoid the length calculation.  But that would also match
> comment_line_str values that contain LF, which the _mem version does not > and that's better.

I agree with your analysis. I do wonder though if we should reject 
whitespace and control characters when parsing core.commentChar, it 
feels like accepting them is a bug waiting to happen. If 
comment_line_char starts with ' ' or '\t' that part will be eaten by the 
strspn() above and so starts_with_mem() wont match. Also we will never 
match a comment if comment_line_str contains '\n'.

> Not sure why lines that start with CR are considered comment lines,
> though.

I think it is a lazy way of looking for an empty line ending in CR LF, 
it should really be

	|| (bol[0] == '\r' && bol[1] == '\n') ||

Best Wishes

Phillip
Junio C Hamano March 8, 2024, 3:58 p.m. UTC | #6
Phillip Wood <phillip.wood123@gmail.com> writes:

> I agree with your analysis. I do wonder though if we should reject
> whitespace and control characters when parsing core.commentChar, it
> feels like accepting them is a bug waiting to happen. If
> comment_line_char starts with ' ' or '\t' that part will be eaten by
> the strspn() above and so starts_with_mem() wont match. Also we will
> never match a comment if comment_line_str contains '\n'.

Another thing I was wondering is what we want to do a random
byte-sequence that may match from the middle of a multi-byte UTF-8
character.

The reason I haven't mentioned these "nonsense input" is because
they will at worst only lead to self-denial-of-service to those who
are too curious, and will fall into "don't do it then" category.

Also, what exactly is the definition of "nonsense" will become can
of worms.  I can sympathise if somebody wants to use "#\t" to give
themselves a bit more room than usual on the left for visibility,
for example, so there might be a case to want whitespace characters.

>> Not sure why lines that start with CR are considered comment lines,
>> though.
>
> I think it is a lazy way of looking for an empty line ending in CR LF,
> it should really be
>
> 	|| (bol[0] == '\r' && bol[1] == '\n') ||

My recollection matches your speculation. 

IIRC the lazy persono was probably me but I didn't run "git blame".
Phillip Wood March 8, 2024, 4:20 p.m. UTC | #7
On 08/03/2024 15:58, Junio C Hamano wrote:
> Phillip Wood <phillip.wood123@gmail.com> writes:
> 
>> I agree with your analysis. I do wonder though if we should reject
>> whitespace and control characters when parsing core.commentChar, it
>> feels like accepting them is a bug waiting to happen. If
>> comment_line_char starts with ' ' or '\t' that part will be eaten by
>> the strspn() above and so starts_with_mem() wont match. Also we will
>> never match a comment if comment_line_str contains '\n'.
> 
> Another thing I was wondering is what we want to do a random
> byte-sequence that may match from the middle of a multi-byte UTF-8
> character.
> 
> The reason I haven't mentioned these "nonsense input" is because
> they will at worst only lead to self-denial-of-service to those who
> are too curious, and will fall into "don't do it then" category.

We could certainly leave it as-is and tell users they are only hurting 
themselves if they complain when it does not work.

> Also, what exactly is the definition of "nonsense" will become can
> of worms.  I can sympathise if somebody wants to use "#\t" to give
> themselves a bit more room than usual on the left for visibility,
> for example, so there might be a case to want whitespace characters.

That's fair, maybe we could just ban leading whitespace if we do decide 
to restrict core.commentChar

Best Wishes

Phillip

>>> Not sure why lines that start with CR are considered comment lines,
>>> though.
>>
>> I think it is a lazy way of looking for an empty line ending in CR LF,
>> it should really be
>>
>> 	|| (bol[0] == '\r' && bol[1] == '\n') ||
> 
> My recollection matches your speculation.
> 
> IIRC the lazy persono was probably me but I didn't run "git blame".
Jeff King March 12, 2024, 8:05 a.m. UTC | #8
On Thu, Mar 07, 2024 at 08:42:22PM +0100, René Scharfe wrote:

> > Arguably starts_with() and this new function should both be inlined,
> > like we do for skip_prefix(), but I think that's out of scope for this
> > series.
> 
> Inlining would allow the compiler to unroll the loop for string
> constants.  I doubt it would do that for variables, as in the code
> below.
> 
> Inlining the strlen()+memcmp() version above might allow the compiler
> to push the strlen() call out of a loop.
> 
> Would any of that improve performance noticeably?  For the call sites
> below I doubt it.  But it would probably increase the object text size.

Good point. With non-constant prefixes in these cases, it probably
wouldn't buy much. There are a lot of other cases with actual string
constants. A compiler in theory could turn starts_with(str, "foo") into
a few instructions. But it's not even clear that it's in very many hot
paths. It would definitely be something we'd have to measure.

> > And it's possible I was simply too dumb to figure out xstrncmpz() here.
> > I'm waiting for René to show up and tell me how to do it. ;)
> 
> Nah, it's not a good fit, as it requires the two strings to have the
> same length.

Thanks for confirming I wasn't missing anything. :)

> > @@ -2562,7 +2562,7 @@ static int parse_insn_line(struct repository *r, struct todo_item *item,
> >  	/* left-trim */
> >  	bol += strspn(bol, " \t");
> >
> > -	if (bol == eol || *bol == '\r' || *bol == comment_line_char) {
> > +	if (bol == eol || *bol == '\r' || starts_with_mem(bol, eol - bol, comment_line_str)) {
> 
> If the strspn() call is safe (which it is, as the caller expects the
> string to be NUL-terminated) then you could use starts_with() here and
> avoid the length calculation.  But that would also match
> comment_line_str values that contain LF, which the _mem version does not
> and that's better.

I try not to read too much into the use of string functions on what
otherwise appears to be an unterminated buffer. While in Git it is quite
often terminated at allocation time (coming from a strbuf, etc) I feel
like I've fixed a number of out-of-bounds reads simply due to sloppy
practices. And even if something is correct today, it is easy for it to
change, since the assumption is made far away from allocation.

So I dunno. Like you said, fewer computations is fewer opportunity to
mess things up. I don't like the idea of introducing a new hand-grenade
that might blow up later, but maybe if it's right next to a strspn()
call that's already a problem, it's not materially making anything
worse.

> > +int starts_with_mem(const char *str, size_t len, const char *prefix)
> > +{
> > +	const char *end = str + len;
> > +	for (; ; str++, prefix++) {
> > +		if (!*prefix)
> > +			return 1;
> > +		else if (str == end || *str != *prefix)
> > +			return 0;
> > +	}
> > +}
> 
> So this checks whether a length-limited string has a prefix given as a
> NUL-terminated string.  I'd have called it mem_starts_with() and have
> expected starts_with_mem() to check a NUL-terminated string for a
> length-limited prefix (think !strncmp(str, prefix, prefixlen)).

I was going for consistency with skip_prefix_mem() and strip_suffix_mem().
To be fair, I probably also named those ones, but I think it's pretty
established. We've never needed the length-limited prefix variant yet,
so I don't know that we're squatting on anything too valuable.

> > @@ -882,7 +882,7 @@ static size_t find_trailer_block_start(const char *buf, size_t len)
> >
> >  	/* The first paragraph is the title and cannot be trailers */
> >  	for (s = buf; s < buf + len; s = next_line(s)) {
> > -		if (s[0] == comment_line_char)
> > +		if (starts_with_mem(s, buf + len - s, comment_line_str))
> >  			continue;
> >  		if (is_blank_line(s))
> 
> Another case where starts_with() would be safe to use, as
> is_blank_line() expects (and gets) a NUL-terminated string, but it would
> allow matching comment_line_str values that contain LF.

Hmm. Yes, it is a NUL-terminated string always, but the caller has told
us not to look past end_of_log_message(). I suspect that if there is no
newline in comment_line_str() it's probably impossible to go past "len"
(just because the end of the log surely ends with either a NUL or a
newline). But it feels iffy to me. I dunno.

-Peff
Jeff King March 12, 2024, 8:19 a.m. UTC | #9
On Fri, Mar 08, 2024 at 04:20:12PM +0000, Phillip Wood wrote:

> On 08/03/2024 15:58, Junio C Hamano wrote:
> > Phillip Wood <phillip.wood123@gmail.com> writes:
> > 
> > > I agree with your analysis. I do wonder though if we should reject
> > > whitespace and control characters when parsing core.commentChar, it
> > > feels like accepting them is a bug waiting to happen. If
> > > comment_line_char starts with ' ' or '\t' that part will be eaten by
> > > the strspn() above and so starts_with_mem() wont match. Also we will
> > > never match a comment if comment_line_str contains '\n'.
> > 
> > Another thing I was wondering is what we want to do a random
> > byte-sequence that may match from the middle of a multi-byte UTF-8
> > character.
> > 
> > The reason I haven't mentioned these "nonsense input" is because
> > they will at worst only lead to self-denial-of-service to those who
> > are too curious, and will fall into "don't do it then" category.
> 
> We could certainly leave it as-is and tell users they are only hurting
> themselves if they complain when it does not work.

That was mostly my plan. To some degree I think this is orthogonal to my
series. You can already set core.commentChar to space or newline, and
I'm sure the results are not very good. Actually, I guess it is easy to
try:

  git -c core.commentChar=$'\n' commit --allow-empty

treats everything as not-a-comment.

Maybe it's worth forbidding this at the start of the series, and then
carrying it through. I really do think newline is the most special
character here, just because it's obviously going to be meaningful to
all of our line-oriented parsing. So you'll get weird results, as
opposed to broken multibyte characters, where things would still work if
you choose to consistently use them (and arguably we cannot even define
"broken" as the user can use a different encoding).

Likewise, I guess people might complain that their core.commentChar is
NFD and their editor writes out NFC characters or something, and we
don't match. I was hoping we could just punt on that and nobody would
ever notice (certainly I think it is OK to punt for now and somebody who
truly cares can make a utf8_starts_with() or similar).

> > Also, what exactly is the definition of "nonsense" will become can
> > of worms.  I can sympathise if somebody wants to use "#\t" to give
> > themselves a bit more room than usual on the left for visibility,
> > for example, so there might be a case to want whitespace characters.
> 
> That's fair, maybe we could just ban leading whitespace if we do decide to
> restrict core.commentChar

Leading whitespace actually does work, though I think you'd be slightly
insane to use it.

I'm currently using "! COMMENT !" (after using a unicode char for a few
days). It's horribly ugly, but I wanted to see if any bugs cropped up
(and vim's built-in git syntax highlighting colors it correctly ;) ).

-Peff
Phillip Wood March 12, 2024, 2:36 p.m. UTC | #10
Hi Peff

On 12/03/2024 08:19, Jeff King wrote:
> On Fri, Mar 08, 2024 at 04:20:12PM +0000, Phillip Wood wrote:
>> We could certainly leave it as-is and tell users they are only hurting
>> themselves if they complain when it does not work.
> 
> That was mostly my plan. To some degree I think this is orthogonal to my
> series. You can already set core.commentChar to space or newline, and
> I'm sure the results are not very good. Actually, I guess it is easy to
> try:
> 
>    git -c core.commentChar=$'\n' commit --allow-empty
> 
> treats everything as not-a-comment.
> 
> Maybe it's worth forbidding this at the start of the series, and then
> carrying it through. I really do think newline is the most special
> character here, just because it's obviously going to be meaningful to
> all of our line-oriented parsing. So you'll get weird results, as
> opposed to broken multibyte characters, where things would still work if
> you choose to consistently use them (and arguably we cannot even define
> "broken" as the user can use a different encoding).

I agree newline is a special case compared to broken multibyte 
characters, I see you've disallowed it in v2 which seems like a good idea.

> Likewise, I guess people might complain that their core.commentChar is
> NFD and their editor writes out NFC characters or something, and we
> don't match. I was hoping we could just punt on that and nobody would
> ever notice (certainly I think it is OK to punt for now and somebody who
> truly cares can make a utf8_starts_with() or similar).
> 
>>> Also, what exactly is the definition of "nonsense" will become can
>>> of worms.  I can sympathise if somebody wants to use "#\t" to give
>>> themselves a bit more room than usual on the left for visibility,
>>> for example, so there might be a case to want whitespace characters.
>>
>> That's fair, maybe we could just ban leading whitespace if we do decide to
>> restrict core.commentChar
> 
> Leading whitespace actually does work, though I think you'd be slightly
> insane to use it.

For "git rebase" in only works if you edit the todo list with "git 
rebase --edit-todo" which calls strbuf_stripspace() and therefore 
parse_insn_line() never sees the comments. If you edit the todo list 
directly then it will error out. You can see this with

     git -c core.commentChar=' ' rebase -x 'echo " this is a comment" 
 >>"$(git rev-parse --git-path rebase-merge/git-rebase-todo)"' HEAD^

which successfully picks HEAD but then gives

     error: invalid command 'this'

when it tries to parse the todo list after the exec command is run. 
Given it is broken already I'm not sure we should worry about it here. 
In any case it is not clear how much we should worry about problems 
caused by users editing the todo list without using "git rebase 
--edit-todo". There is code in parse_insn_line() which is clearly there 
to handle direct editing of the file but I don't think it is tested and 
directly editing the file probably bypasses the 
rebase.missingCommitsCheck checks as well.

Best Wishes

Phillip

> I'm currently using "! COMMENT !" (after using a unicode char for a few
> days). It's horribly ugly, but I wanted to see if any bugs cropped up
> (and vim's built-in git syntax highlighting colors it correctly ;) ).
> 
> -Peff
Jeff King March 13, 2024, 6:23 a.m. UTC | #11
On Tue, Mar 12, 2024 at 02:36:36PM +0000, phillip.wood123@gmail.com wrote:

> > Leading whitespace actually does work, though I think you'd be slightly
> > insane to use it.
> 
> For "git rebase" in only works if you edit the todo list with "git rebase
> --edit-todo" which calls strbuf_stripspace() and therefore parse_insn_line()
> never sees the comments. If you edit the todo list directly then it will
> error out. You can see this with
> 
>     git -c core.commentChar=' ' rebase -x 'echo " this is a comment"
> >>"$(git rev-parse --git-path rebase-merge/git-rebase-todo)"' HEAD^
> 
> which successfully picks HEAD but then gives
> 
>     error: invalid command 'this'
> 
> when it tries to parse the todo list after the exec command is run. Given it
> is broken already I'm not sure we should worry about it here. In any case it
> is not clear how much we should worry about problems caused by users editing
> the todo list without using "git rebase --edit-todo". There is code in
> parse_insn_line() which is clearly there to handle direct editing of the
> file but I don't think it is tested and directly editing the file probably
> bypasses the rebase.missingCommitsCheck checks as well.

Ah, thanks for the example. I guess it's not too surprising that it can
cause confusion. Given that it's an existing issue, I think my
preference would be to leave it out of the series under discussion
(given how long and complicated it is already), but I'd have no
objection to tightening things further on top as a separate series.

-Peff
René Scharfe March 14, 2024, 7:37 p.m. UTC | #12
Am 12.03.24 um 09:05 schrieb Jeff King:
> On Thu, Mar 07, 2024 at 08:42:22PM +0100, René Scharfe wrote:
>
>>> @@ -2562,7 +2562,7 @@ static int parse_insn_line(struct repository *r, struct todo_item *item,
>>>  	/* left-trim */
>>>  	bol += strspn(bol, " \t");
>>>
>>> -	if (bol == eol || *bol == '\r' || *bol == comment_line_char) {
>>> +	if (bol == eol || *bol == '\r' || starts_with_mem(bol, eol - bol, comment_line_str)) {
>>
>> If the strspn() call is safe (which it is, as the caller expects the
>> string to be NUL-terminated) then you could use starts_with() here and
>> avoid the length calculation.  But that would also match
>> comment_line_str values that contain LF, which the _mem version does not
>> and that's better.
>
> I try not to read too much into the use of string functions on what
> otherwise appears to be an unterminated buffer. While in Git it is quite
> often terminated at allocation time (coming from a strbuf, etc) I feel
> like I've fixed a number of out-of-bounds reads simply due to sloppy
> practices. And even if something is correct today, it is easy for it to
> change, since the assumption is made far away from allocation.
>
> So I dunno. Like you said, fewer computations is fewer opportunity to
> mess things up. I don't like the idea of introducing a new hand-grenade
> that might blow up later, but maybe if it's right next to a strspn()
> call that's already a problem, it's not materially making anything
> worse.

Yeah, and my logic was flawed: If the caller somehow guarantees that a
space or tab occurs before eol then the strspn() call is safe.  Its
presence doesn't guarantee NUL termination.  parse_insn_line() would
not be safe to use without that prerequisite, but that's a different
matter..

>>> @@ -882,7 +882,7 @@ static size_t find_trailer_block_start(const char *buf, size_t len)
>>>
>>>  	/* The first paragraph is the title and cannot be trailers */
>>>  	for (s = buf; s < buf + len; s = next_line(s)) {
>>> -		if (s[0] == comment_line_char)
>>> +		if (starts_with_mem(s, buf + len - s, comment_line_str))
>>>  			continue;
>>>  		if (is_blank_line(s))
>>
>> Another case where starts_with() would be safe to use, as
>> is_blank_line() expects (and gets) a NUL-terminated string, but it would
>> allow matching comment_line_str values that contain LF.
>
> Hmm. Yes, it is a NUL-terminated string always, but the caller has told
> us not to look past end_of_log_message(). I suspect that if there is no
> newline in comment_line_str() it's probably impossible to go past "len"
> (just because the end of the log surely ends with either a NUL or a
> newline). But it feels iffy to me. I dunno.

Same flawed thinking on my part: As long as we're guaranteed a blank
line in the buffer we won't walk past its end.  That doesn't mean we can
assume a NUL is present.  But that's fragile.  The code should use
memchr() instead of strchrnul().

That's not the problem you set out to solve in your series, though, and
you avoid making it worse by respecting the length limit in the code
you change.  #leftoverbits

Keeping track of the remaining length increases code size and adds
opportunities for mistakes.  Not sure how to avoid it, however.  Using
eol instead of len at least avoids subtractions.

tl;dr: Good patch (in v2).

René
diff mbox series

Patch

diff --git a/commit.c b/commit.c
index ef679a0b93..531a666cba 100644
--- a/commit.c
+++ b/commit.c
@@ -1796,7 +1796,8 @@  size_t ignored_log_message_bytes(const char *buf, size_t len)
 		else
 			next_line++;
 
-		if (buf[bol] == comment_line_char || buf[bol] == '\n') {
+		if (starts_with_mem(buf + bol, cutoff - bol, comment_line_str) ||
+		    buf[bol] == '\n') {
 			/* is this the first of the run of comments? */
 			if (!boc)
 				boc = bol;
diff --git a/sequencer.c b/sequencer.c
index 991a2dbe96..664986e3b2 100644
--- a/sequencer.c
+++ b/sequencer.c
@@ -1840,7 +1840,7 @@  static int is_fixup_flag(enum todo_command command, unsigned flag)
 static void add_commented_lines(struct strbuf *buf, const void *str, size_t len)
 {
 	const char *s = str;
-	while (len > 0 && s[0] == comment_line_char) {
+	while (starts_with_mem(s, len, comment_line_str)) {
 		size_t count;
 		const char *n = memchr(s, '\n', len);
 		if (!n)
@@ -2562,7 +2562,7 @@  static int parse_insn_line(struct repository *r, struct todo_item *item,
 	/* left-trim */
 	bol += strspn(bol, " \t");
 
-	if (bol == eol || *bol == '\r' || *bol == comment_line_char) {
+	if (bol == eol || *bol == '\r' || starts_with_mem(bol, eol - bol, comment_line_str)) {
 		item->command = TODO_COMMENT;
 		item->commit = NULL;
 		item->arg_offset = bol - buf;
diff --git a/strbuf.c b/strbuf.c
index 7c8f582127..291bdc2a65 100644
--- a/strbuf.c
+++ b/strbuf.c
@@ -24,6 +24,17 @@  int istarts_with(const char *str, const char *prefix)
 			return 0;
 }
 
+int starts_with_mem(const char *str, size_t len, const char *prefix)
+{
+	const char *end = str + len;
+	for (; ; str++, prefix++) {
+		if (!*prefix)
+			return 1;
+		else if (str == end || *str != *prefix)
+			return 0;
+	}
+}
+
 int skip_to_optional_arg_default(const char *str, const char *prefix,
 				 const char **arg, const char *def)
 {
diff --git a/strbuf.h b/strbuf.h
index 58dddf2777..3156d6ea8c 100644
--- a/strbuf.h
+++ b/strbuf.h
@@ -673,6 +673,7 @@  char *xstrfmt(const char *fmt, ...);
 
 int starts_with(const char *str, const char *prefix);
 int istarts_with(const char *str, const char *prefix);
+int starts_with_mem(const char *str, size_t len, const char *prefix);
 
 /*
  * If the string "str" is the same as the string in "prefix", then the "arg"
diff --git a/trailer.c b/trailer.c
index fe18faf6c5..f59c90b4b5 100644
--- a/trailer.c
+++ b/trailer.c
@@ -882,7 +882,7 @@  static size_t find_trailer_block_start(const char *buf, size_t len)
 
 	/* The first paragraph is the title and cannot be trailers */
 	for (s = buf; s < buf + len; s = next_line(s)) {
-		if (s[0] == comment_line_char)
+		if (starts_with_mem(s, buf + len - s, comment_line_str))
 			continue;
 		if (is_blank_line(s))
 			break;
@@ -902,7 +902,7 @@  static size_t find_trailer_block_start(const char *buf, size_t len)
 		const char **p;
 		ssize_t separator_pos;
 
-		if (bol[0] == comment_line_char) {
+		if (starts_with_mem(bol, buf + end_of_title - bol, comment_line_str)) {
 			non_trailer_lines += possible_continuation_lines;
 			possible_continuation_lines = 0;
 			continue;