diff mbox series

[2/4] config: really keep value-internal whitespace verbatim

Message ID cd890a7015733e311e0e0a4939c539d2894e31cf.1710508691.git.dsimic@manjaro.org (mailing list archive)
State Superseded
Headers show
Series Fix a bug in configuration parsing, and improve tests and documentation | expand

Commit Message

Dragan Simic March 15, 2024, 1:22 p.m. UTC
Fix a bug in function parse_value() that prevented whitespace characters
(i.e. spaces and horizontal tabs) found inside configuration option values
from being parsed and returned in their original form.  The bug caused any
number of consecutive whitespace characters to be wrongly "squashed" into
the same number of space characters.

This bug was introduced back in July 2009, in commit ebdaae372b46 ("config:
Keep inner whitespace verbatim").

Further investigation showed that setting a configuration value, by invoking
git-config(1), converts value-internal horizontal tabs into "\t" escape
sequences, which the buggy value-parsing logic in function parse_value()
didn't "squash" into spaces.  That's why the test included in the ebdaae37
commit passed, which presumably made the bug remain undetected for this long.
On the other hand, value-internal literal horizontal tab characters, found in
a configuration file edited by hand, do get "squashed" by the value-parsing
logic, so the right choice was to fix this bug by making the value-internal
whitespace characters preserved verbatim.

Fixes: ebdaae372b46 ("config: Keep inner whitespace verbatim")
Signed-off-by: Dragan Simic <dsimic@manjaro.org>
---
 config.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

Comments

Junio C Hamano March 15, 2024, 5:46 p.m. UTC | #1
Dragan Simic <dsimic@manjaro.org> writes:

> Fix a bug in function parse_value() that prevented whitespace characters
> (i.e. spaces and horizontal tabs) found inside configuration option values
> from being parsed and returned in their original form.  The bug caused any
> number of consecutive whitespace characters to be wrongly "squashed" into
> the same number of space characters.
>
> This bug was introduced back in July 2009, in commit ebdaae372b46 ("config:
> Keep inner whitespace verbatim").
>
> Further investigation showed that setting a configuration value, by invoking
> git-config(1), converts value-internal horizontal tabs into "\t" escape
> sequences, which the buggy value-parsing logic in function parse_value()
> didn't "squash" into spaces.  That's why the test included in the ebdaae37
> commit passed, which presumably made the bug remain undetected for this long.
> On the other hand, value-internal literal horizontal tab characters, found in
> a configuration file edited by hand, do get "squashed" by the value-parsing
> logic, so the right choice was to fix this bug by making the value-internal
> whitespace characters preserved verbatim.

OK.

> Fixes: ebdaae372b46 ("config: Keep inner whitespace verbatim")

No need for this line.  You mentioned it in the text already, and
more importantly, grepping for trailers is not the right thing to do
because we may think we fixed all bugs in ebdaae372b46 did with this
patch, which may in 6 months turn out to be false but we cannot undo
the trailer.  On the other hand, the discussion of the problem in
the proposed log message gives readers a richer context and the
future developers can understand that this fixed one thing in
ebdaae372b46 but didn't even know about other bugs introduced by
that old commit and make more intelligent decision based on that
better understanding.

> Signed-off-by: Dragan Simic <dsimic@manjaro.org>
> ---
>  config.c | 13 +++++++++----
>  1 file changed, 9 insertions(+), 4 deletions(-)
>
> diff --git a/config.c b/config.c
> index a86a20cdf5cb..5072f12e62e4 100644
> --- a/config.c
> +++ b/config.c
> @@ -817,33 +817,38 @@ static int get_next_char(struct config_source *cs)
>  
>  static char *parse_value(struct config_source *cs)
>  {
> -	int quote = 0, comment = 0, space = 0;
> +	int quote = 0, comment = 0;
> +	size_t trim_len = 0;


>  	strbuf_reset(&cs->value);
>  	for (;;) {
>  		int c = get_next_char(cs);
>  		if (c == '\n') {
>  			if (quote) {
>  				cs->linenr--;
>  				return NULL;
>  			}
> +			if (trim_len)
> +				strbuf_setlen(&cs->value, trim_len);
>  			return cs->value.buf;

So the idea is that we copy everything we read verbatim in cs->value
but keep track of the beginning of the run of whitespace characters
at the end of the line in trim_len so that we can rtrim it?  That
should be the most straight-forward implementation.

>  		}
>  		if (comment)
>  			continue;
>  		if (isspace(c) && !quote) {
> +			if (!trim_len)
> +				trim_len = cs->value.len;
>  			if (cs->value.len)
> +				strbuf_addch(&cs->value, c);
>  			continue;

While we are not inside a dq-pair, we keep ignoring a whitespace
character at the beginning (i.e. cs->value.len == 0).

If we have some value in cs->value, however, we add the whitespace
character to cs->value verbatim.  trim_len==0 signals that the last
character we processed was not a whitespace character, and we copy
the current length of cs->value there---this is so that we can rtrim
away the run of whitespaces at the end of the input, including the
byte we are adding here, if it turns out that we are looking at the
first whitespace character of such trailing blanks.

>  		}
>  		if (!quote) {
>  			if (c == ';' || c == '#') {
>  				comment = 1;
>  				continue;
>  			}
>  		}

> +		if (trim_len)
> +			trim_len = 0;

If we are outside a dq-pair, we reach here only when we are looking
at a non-whitespace character.  If we are counting a run of unquoted
whitespaces, we can reset trim_len here to record that we do not
have to trim.

But can we be seeing a whitespace that is inside quote here?  Is
resetting trim_len to zero what we want in such a case?  Let's see.
Inside dq, we'd want to accumulate bytes, possibly after unwrapping
their backslash escapes, to cs->value, and these bytes include the
whitespace characters---we want to keep them literally without
trimming.  OK.

Looking good.  We should already demonstrate that this works well
with a new test in the same patch, can't we?  If we can, we should.

Thanks.

>  		if (c == '\\') {
>  			c = get_next_char(cs);
>  			switch (c) {
Dragan Simic March 15, 2024, 7:50 p.m. UTC | #2
On 2024-03-15 18:46, Junio C Hamano wrote:
> Dragan Simic <dsimic@manjaro.org> writes:
> 
>> Fixes: ebdaae372b46 ("config: Keep inner whitespace verbatim")
> 
> No need for this line.  You mentioned it in the text already, and
> more importantly, grepping for trailers is not the right thing to do
> because we may think we fixed all bugs in ebdaae372b46 did with this
> patch, which may in 6 months turn out to be false but we cannot undo
> the trailer.  On the other hand, the discussion of the problem in
> the proposed log message gives readers a richer context and the
> future developers can understand that this fixed one thing in
> ebdaae372b46 but didn't even know about other bugs introduced by
> that old commit and make more intelligent decision based on that
> better understanding.

Got it, and I agree that the discussions linked in the patch description
provide much better ways for understanding what brought us to the 
current
situation, and what led to the proposed bugfixes.

It's simply that I'm kind of used to providing "Fixes" tags for patches
in other projects;  I'll make a mental note not to do that for git.

I'll also send the v2 with no "Fixes" tag, after the v1 spends a few 
days
on the mailing list.

>> Signed-off-by: Dragan Simic <dsimic@manjaro.org>
>> ---
>>  config.c | 13 +++++++++----
>>  1 file changed, 9 insertions(+), 4 deletions(-)
>> 
>> diff --git a/config.c b/config.c
>> index a86a20cdf5cb..5072f12e62e4 100644
>> --- a/config.c
>> +++ b/config.c
>> @@ -817,33 +817,38 @@ static int get_next_char(struct config_source 
>> *cs)
>> 
>>  static char *parse_value(struct config_source *cs)
>>  {
>> -	int quote = 0, comment = 0, space = 0;
>> +	int quote = 0, comment = 0;
>> +	size_t trim_len = 0;
> 
> 
>>  	strbuf_reset(&cs->value);
>>  	for (;;) {
>>  		int c = get_next_char(cs);
>>  		if (c == '\n') {
>>  			if (quote) {
>>  				cs->linenr--;
>>  				return NULL;
>>  			}
>> +			if (trim_len)
>> +				strbuf_setlen(&cs->value, trim_len);
>>  			return cs->value.buf;
> 
> So the idea is that we copy everything we read verbatim in cs->value
> but keep track of the beginning of the run of whitespace characters
> at the end of the line in trim_len so that we can rtrim it?  That
> should be the most straight-forward implementation.

Yes, that's it.  It's as simple as it can be.  A different, somewhat
more complex approach of putting the encountered whitespace in a 
separate
buffer may be warranted if we expected _a lot_ of whitespace, i.e. much
more whitespace than the other characters, but that shouldn't be the 
case,
which makes the simple approach a good choice.

>>  		}
>>  		if (comment)
>>  			continue;
>>  		if (isspace(c) && !quote) {
>> +			if (!trim_len)
>> +				trim_len = cs->value.len;
>>  			if (cs->value.len)
>> +				strbuf_addch(&cs->value, c);
>>  			continue;
> 
> While we are not inside a dq-pair, we keep ignoring a whitespace
> character at the beginning (i.e. cs->value.len == 0).

Yes, that "eats up" any leading whitespace.

> If we have some value in cs->value, however, we add the whitespace
> character to cs->value verbatim.  trim_len==0 signals that the last
> character we processed was not a whitespace character, and we copy
> the current length of cs->value there---this is so that we can rtrim
> away the run of whitespaces at the end of the input, including the
> byte we are adding here, if it turns out that we are looking at the
> first whitespace character of such trailing blanks.

Exactly.  By the way, please know that I _really_ appreciate your
highly detailed patch reviews!

>>  		}
>>  		if (!quote) {
>>  			if (c == ';' || c == '#') {
>>  				comment = 1;
>>  				continue;
>>  			}
>>  		}
> 
>> +		if (trim_len)
>> +			trim_len = 0;
> 
> If we are outside a dq-pair, we reach here only when we are looking
> at a non-whitespace character.  If we are counting a run of unquoted
> whitespaces, we can reset trim_len here to record that we do not
> have to trim.

Exactly.  It resets the "need to rtrim flag", so to speak.

> But can we be seeing a whitespace that is inside quote here?  Is
> resetting trim_len to zero what we want in such a case?  Let's see.
> Inside dq, we'd want to accumulate bytes, possibly after unwrapping
> their backslash escapes, to cs->value, and these bytes include the
> whitespace characters---we want to keep them literally without
> trimming.  OK.

Yes, any whitespace inside a quoted option value is to be taken
verbatim, so we basically reset the "need to rtrim flag" in that case,
because value-internal whitespace isn't to be rtrimmed.

> Looking good.  We should already demonstrate that this works well
> with a new test in the same patch, can't we?  If we can, we should.

Sorry, this sounds a bit confusing to me?  I think there are already
more than enough new tests in the 3/4 patch I sent?

Could you, please, clarify a bit?

>>  		if (c == '\\') {
>>  			c = get_next_char(cs);
>>  			switch (c) {
diff mbox series

Patch

diff --git a/config.c b/config.c
index a86a20cdf5cb..5072f12e62e4 100644
--- a/config.c
+++ b/config.c
@@ -817,33 +817,38 @@  static int get_next_char(struct config_source *cs)
 
 static char *parse_value(struct config_source *cs)
 {
-	int quote = 0, comment = 0, space = 0;
+	int quote = 0, comment = 0;
+	size_t trim_len = 0;
 
 	strbuf_reset(&cs->value);
 	for (;;) {
 		int c = get_next_char(cs);
 		if (c == '\n') {
 			if (quote) {
 				cs->linenr--;
 				return NULL;
 			}
+			if (trim_len)
+				strbuf_setlen(&cs->value, trim_len);
 			return cs->value.buf;
 		}
 		if (comment)
 			continue;
 		if (isspace(c) && !quote) {
+			if (!trim_len)
+				trim_len = cs->value.len;
 			if (cs->value.len)
-				space++;
+				strbuf_addch(&cs->value, c);
 			continue;
 		}
 		if (!quote) {
 			if (c == ';' || c == '#') {
 				comment = 1;
 				continue;
 			}
 		}
-		for (; space; space--)
-			strbuf_addch(&cs->value, ' ');
+		if (trim_len)
+			trim_len = 0;
 		if (c == '\\') {
 			c = get_next_char(cs);
 			switch (c) {