diff mbox series

[1/3] config.txt: describe whitespace characters further and more accurately

Message ID 1c670101fc29a9ccc71cf4d213545a564e14aa05.1710258538.git.dsimic@manjaro.org (mailing list archive)
State New
Headers show
Series Improve the documentation and test coverage for whitespace and comments | expand

Commit Message

Dragan Simic March 12, 2024, 3:55 p.m. UTC
Make it more clear what the whitespace characters are in the context of git
configuration, improve the description of the trailing whitespace handling,
and correct the description of the value-internal whitespace handling.

Signed-off-by: Dragan Simic <dsimic@manjaro.org>
---
 Documentation/config.txt | 19 +++++++++++--------
 1 file changed, 11 insertions(+), 8 deletions(-)

Comments

Junio C Hamano March 14, 2024, 1:18 a.m. UTC | #1
Dragan Simic <dsimic@manjaro.org> writes:

> Make it more clear what the whitespace characters are in the context of git
> configuration, improve the description of the trailing whitespace handling,
> and correct the description of the value-internal whitespace handling.
>
> Signed-off-by: Dragan Simic <dsimic@manjaro.org>
> ---
>  Documentation/config.txt | 19 +++++++++++--------
>  1 file changed, 11 insertions(+), 8 deletions(-)
>
> diff --git a/Documentation/config.txt b/Documentation/config.txt
> index 782c2bab906c..4480bb44203b 100644
> --- a/Documentation/config.txt
> +++ b/Documentation/config.txt
> @@ -22,9 +22,10 @@ multivalued.
>  Syntax
>  ~~~~~~
>  
> -The syntax is fairly flexible and permissive; whitespaces are mostly
> -ignored.  The '#' and ';' characters begin comments to the end of line,
> -blank lines are ignored.
> +The syntax is fairly flexible and permissive.  Whitespace characters,
> +which in this context are the space character (SP) and the horizontal
> +tabulation (HT), are mostly ignored.  The '#' and ';' characters begin
> +comments to the end of line.  Blank lines are ignored.

OK, except for "whitespace characters"---do we need to say
"whitespace characters", after we already listed HT and SP are the
ones, instead of just "whitespaces"?

> @@ -64,12 +65,14 @@ The variable names are case-insensitive, allow only alphanumeric characters
>  and `-`, and must start with an alphabetic character.
>  
>  A line that defines a value can be continued to the next line by
> +ending it with a `\`; the backslash and the end-of-line are stripped.
> +Leading whitespace characters after 'name =', the remainder of the
>  line after the first comment character '#' or ';', and trailing
> +whitespace characters of the line are discarded unless they are enclosed
> +in double quotes.  This discarding of the trailing whitespace characters
> +also applies after the remainder of the line after the comment character
> +is discarded.

"also" makes it sound as if we do it twice, once to remove trailing
whitespaces after the remainder of the line after '#", and then trim
the trailing whitespaces after we removed the comment.

I wonder if we can make it clearer by following the step-by-step
nature of the earlier part of the paragraph through.  We already say
the folded line processing is done first, so break things down in
conceptual phases/steps, perhaps like

 * The backslash at the end-of-line is removed, together with the
   end-of-line, to form a single long line.

 * Anything that come after the first unquoted comment character,
   either '#' or ';', are discarded.

 * The leading and trailing whitespaces around the value part
   (i.e. what follows 'name =') are discarded.

 * Remaining unquoted whitespaces inside the value part are munged.

> +Any number of internal whitespace characters found within
> +the value are converted to the same number of space (SP) characters.

The last one sounds like a bug to me, by the way.

At least the very original 17712991 (Add ".git/config" file parser,
2005-10-10) squashed a run of whitespace characters into a single
SP, which makes sense as a "clean-up".

But ebdaae37 (config: Keep inner whitespace verbatim, 2009-07-30),
while claiming to "Keep" inner whitespaces, broke it by replacing
any isspace() bytes that are not SP with SP, contradicting its
stated purpose.

As the latest change by the author of that change is from more than
10 years ago, I do not expect that he is still interested in this
part of the codebase, but thanks to a very clearly written log
message, we can read what the motivation behind that change was, and
seeing that what the code does contradicts with the stated
motivation we can safely declare that this is an ancient bug.

Fixing that bug can of course be left outside the series.  For those
who are looking for microproject ideas who discovered this message
by searching for the #leftoverbits keyword, one possible fix would
be to revert ebdaae37, make sure a value with any whitespace in it
gets quoted, and document clearly that an unquoted run of
whitespaces is squashed into a single SP.  Another way that is
milder is to finish what ebdaae37 wanted to do and retain the
whitespaces "verbatim".

Thanks.
Dragan Simic March 14, 2024, 6:20 a.m. UTC | #2
On 2024-03-14 02:18, Junio C Hamano wrote:
> Dragan Simic <dsimic@manjaro.org> writes:
>> -The syntax is fairly flexible and permissive; whitespaces are mostly
>> -ignored.  The '#' and ';' characters begin comments to the end of 
>> line,
>> -blank lines are ignored.
>> +The syntax is fairly flexible and permissive.  Whitespace characters,
>> +which in this context are the space character (SP) and the horizontal
>> +tabulation (HT), are mostly ignored.  The '#' and ';' characters 
>> begin
>> +comments to the end of line.  Blank lines are ignored.
> 
> OK, except for "whitespace characters"---do we need to say
> "whitespace characters", after we already listed HT and SP are the
> ones, instead of just "whitespaces"?

I also spent some time thinking about that.  To me, the plural form, 
i.e.
"whitespaces", simply doesn't sound very good, because "whitespace" 
feels
to me more like a mass noun, and I really haven't seen it used in plural
form in other projects.

>>  A line that defines a value can be continued to the next line by
>> +ending it with a `\`; the backslash and the end-of-line are stripped.
>> +Leading whitespace characters after 'name =', the remainder of the
>>  line after the first comment character '#' or ';', and trailing
>> +whitespace characters of the line are discarded unless they are 
>> enclosed
>> +in double quotes.  This discarding of the trailing whitespace 
>> characters
>> +also applies after the remainder of the line after the comment 
>> character
>> +is discarded.
> 
> "also" makes it sound as if we do it twice, once to remove trailing
> whitespaces after the remainder of the line after '#", and then trim
> the trailing whitespaces after we removed the comment.

Good point, I also felt it the same way, but went with such wording 
simply
because I thought it should be more understandable to the users, despite
being technically a bit incorrect.

> I wonder if we can make it clearer by following the step-by-step
> nature of the earlier part of the paragraph through.  We already say
> the folded line processing is done first, so break things down in
> conceptual phases/steps, perhaps like
> 
>  * The backslash at the end-of-line is removed, together with the
>    end-of-line, to form a single long line.
> 
>  * Anything that come after the first unquoted comment character,
>    either '#' or ';', are discarded.
> 
>  * The leading and trailing whitespaces around the value part
>    (i.e. what follows 'name =') are discarded.
> 
>  * Remaining unquoted whitespaces inside the value part are munged.

Hmm, I'm not really sure that such a description would be more clear
to the users, despite being technically more correct.  I'll think a bit
more about it.

>> +Any number of internal whitespace characters found within
>> +the value are converted to the same number of space (SP) characters.
> 
> The last one sounds like a bug to me, by the way.
> 
> At least the very original 17712991 (Add ".git/config" file parser,
> 2005-10-10) squashed a run of whitespace characters into a single
> SP, which makes sense as a "clean-up".
> 
> But ebdaae37 (config: Keep inner whitespace verbatim, 2009-07-30),
> while claiming to "Keep" inner whitespaces, broke it by replacing
> any isspace() bytes that are not SP with SP, contradicting its
> stated purpose.

Thank you for the investigation.  The ebdaae37 commit certainly
introduced a bug to the value parsing, which presumably has remained
undected because the included test passes.

The way I see it, fixing the bug may actually be a breaking change,
because some user configurations may actually rely on the current
(mis)behavior.  This makes me somewhat afraid that fixing this bug,
which I already thought about, may actually do more harm than good.
However, fixing this bug seems to be only right thing to do, which
I'll explain further below.

> As the latest change by the author of that change is from more than
> 10 years ago, I do not expect that he is still interested in this
> part of the codebase, but thanks to a very clearly written log
> message, we can read what the motivation behind that change was, and
> seeing that what the code does contradicts with the stated
> motivation we can safely declare that this is an ancient bug.

Agreed, the evidence is clear.

> Fixing that bug can of course be left outside the series.  For those
> who are looking for microproject ideas who discovered this message
> by searching for the #leftoverbits keyword, one possible fix would
> be to revert ebdaae37, make sure a value with any whitespace in it
> gets quoted, and document clearly that an unquoted run of
> whitespaces is squashed into a single SP.  Another way that is
> milder is to finish what ebdaae37 wanted to do and retain the
> whitespaces "verbatim".

I already though about fixing the bug so the value parser actually does
what git-config(1) currently says, but as I already noted above, I'm
afraid a bit that fixing this bug may actually do more harm than good.

Though, further investigation shows that setting a configuration value,
by invoking git-config(1), converts value-internal tabs into "\t" escape
sequences, which the value-parsing logic doesn't "squash" into spaces.
That's why the test included in the ebdaae37 commit passes.  On the 
other
hand, value-internal literal tab characters, found in a configuration
file, do get "squashed" by the value-parsing logic, so I'd say that the
only right thing to do is to fix this bug by making the value-internal
whitespace characters preserved verbatim.

I'd be happy to include the bugfix into this series, if my 
above-mentioned
fears prove to be unnecessary.
Junio C Hamano March 14, 2024, 4:45 p.m. UTC | #3
Dragan Simic <dsimic@manjaro.org> writes:

> Though, further investigation shows that setting a configuration value,
> by invoking git-config(1), converts value-internal tabs into "\t" escape
> sequences, which the value-parsing logic doesn't "squash" into spaces.

Correct.  It would have been nicer to just quote values that had
whitespaces in them, but replacing HT to SP while turning HT that
comes from our tool into "\t" would still let the value round-trip,
while breaking anything written manually in editors.  If you stay
within Git without using any editor, what ebdaae37 (config: Keep
inner whitespace verbatim, 2009-07-30) left us is at least
internally consistent.

> I'd be happy to include the bugfix into this series, if my
> above-mentioned
> fears prove to be unnecessary.

Documenting status quo is a good place to stop for now.  I do not
know if it is a good idea to add too many tests to etch the current
behaviour that we know is wrong and we'll need to update when we fix
the bug, though.

Thanks.
Dragan Simic March 14, 2024, 6:48 p.m. UTC | #4
On 2024-03-14 17:45, Junio C Hamano wrote:
> Dragan Simic <dsimic@manjaro.org> writes:
> 
>> Though, further investigation shows that setting a configuration 
>> value,
>> by invoking git-config(1), converts value-internal tabs into "\t" 
>> escape
>> sequences, which the value-parsing logic doesn't "squash" into spaces.
> 
> Correct.  It would have been nicer to just quote values that had
> whitespaces in them, but replacing HT to SP while turning HT that
> comes from our tool into "\t" would still let the value round-trip,
> while breaking anything written manually in editors.  If you stay
> within Git without using any editor, what ebdaae37 (config: Keep
> inner whitespace verbatim, 2009-07-30) left us is at least
> internally consistent.

Yes, but we already support unquoted values that contain whitespace
characters, and people use editors to configure variables.  For example,
I never use git-config(1) to make changes to my ~/.gitconfig file.

>> I'd be happy to include the bugfix into this series, if my
>> above-mentioned
>> fears prove to be unnecessary.
> 
> Documenting status quo is a good place to stop for now.  I do not
> know if it is a good idea to add too many tests to etch the current
> behaviour that we know is wrong and we'll need to update when we fix
> the bug, though.

But I already started to work on a bugfix?  I'm pretty much close to
completing the bugfix and doing some testing.
diff mbox series

Patch

diff --git a/Documentation/config.txt b/Documentation/config.txt
index 782c2bab906c..4480bb44203b 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -22,9 +22,10 @@  multivalued.
 Syntax
 ~~~~~~
 
-The syntax is fairly flexible and permissive; whitespaces are mostly
-ignored.  The '#' and ';' characters begin comments to the end of line,
-blank lines are ignored.
+The syntax is fairly flexible and permissive.  Whitespace characters,
+which in this context are the space character (SP) and the horizontal
+tabulation (HT), are mostly ignored.  The '#' and ';' characters begin
+comments to the end of line.  Blank lines are ignored.
 
 The file consists of sections and variables.  A section begins with
 the name of the section in square brackets and continues until the next
@@ -64,12 +65,14 @@  The variable names are case-insensitive, allow only alphanumeric characters
 and `-`, and must start with an alphabetic character.
 
 A line that defines a value can be continued to the next line by
-ending it with a `\`; the backslash and the end-of-line are
-stripped.  Leading whitespaces after 'name =', the remainder of the
+ending it with a `\`; the backslash and the end-of-line are stripped.
+Leading whitespace characters after 'name =', the remainder of the
 line after the first comment character '#' or ';', and trailing
-whitespaces of the line are discarded unless they are enclosed in
-double quotes.  Internal whitespaces within the value are retained
-verbatim.
+whitespace characters of the line are discarded unless they are enclosed
+in double quotes.  This discarding of the trailing whitespace characters
+also applies after the remainder of the line after the comment character
+is discarded.  Any number of internal whitespace characters found within
+the value are converted to the same number of space (SP) characters.
 
 Inside double quotes, double quote `"` and backslash `\` characters
 must be escaped: use `\"` for `"` and `\\` for `\`.