diff mbox series

[4/4] doc add: renormalize is not idempotent for CRCRLF

Message ID d3b8ed97a105ea1d7e656c964b7eee378e11ede6.1657385781.git.gitgitgadget@gmail.com (mailing list archive)
State New, archived
Headers show
Series Add some Glossary terms, and extra renormalize information. | expand

Commit Message

Philip Oakley July 9, 2022, 4:56 p.m. UTC
From: Philip Oakley <philipoakley@iee.email>

Bug report
 https://lore.kernel.org/git/AM0PR02MB56357CC96B702244F3271014E8DC9@AM0PR02MB5635.eurprd02.prod.outlook.com/
noted that a file containing /r/r/n needed renormalising twice.

This is by design. Lone CR characters, not paired with an LF, are left
unchanged. Note the lack of idempotentness of the "clean" filter in the
documentation.

Renormalize was introduced at 9472935d81e (add: introduce "--renormalize",
Torsten Bögershausen, 2017-11-16)

Signed-off-by: Philip Oakley <philipoakley@iee.email>
---
 Documentation/git-add.txt | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Comments

Junio C Hamano July 9, 2022, 9:06 p.m. UTC | #1
"Philip Oakley via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Philip Oakley <philipoakley@iee.email>
>
> Bug report
>  https://lore.kernel.org/git/AM0PR02MB56357CC96B702244F3271014E8DC9@AM0PR02MB5635.eurprd02.prod.outlook.com/
> noted that a file containing /r/r/n needed renormalising twice.

Did you mean backslash, not forward?

> This is by design. Lone CR characters, not paired with an LF, are left
> unchanged. Note the lack of idempotentness of the "clean" filter in the
> documentation.

OK.


> Renormalize was introduced at 9472935d81e (add: introduce "--renormalize",
> Torsten Bögershausen, 2017-11-16)

Does this need to be said "HERE", rather than leaving it to run "git
blame" for those who became curious?

> Signed-off-by: Philip Oakley <philipoakley@iee.email>
> ---
>  Documentation/git-add.txt | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/git-add.txt b/Documentation/git-add.txt
> index 11eb70f16c7..c4a5ad11a6b 100644
> --- a/Documentation/git-add.txt
> +++ b/Documentation/git-add.txt
> @@ -188,7 +188,8 @@ for "git add --no-all <pathspec>...", i.e. ignored removed files.
>  	forcibly add them again to the index.  This is useful after
>  	changing `core.autocrlf` configuration or the `text` attribute
>  	in order to correct files added with wrong CRLF/LF line endings.
> -	This option implies `-u`.
> +	This option implies `-u`. Lone CR characters are untouched, so
> +	cleaning not idempotent. A CRCRLF sequence cleans to CRLF.

Lack of verb BE somewhere.

Do we expect our readers all understand the math-y word?  It is not
too hard to explain it to math-uninitiated, e.g.

    This option implies `-u`.  Note that running renormalize again
    on the result of running renormalize may make it even "more
    normal".  A CR-CR-LF sequence would first renormalize to CR-LF
    (the first CR, a lone CR, is left intact, and CR-LF that follows
    normalizes to LF).  If you run renormalize again, the resulting
    CR-LF will normalize down to LF.
Torsten Bögershausen July 10, 2022, 7:48 a.m. UTC | #2
On Sat, Jul 09, 2022 at 04:56:21PM +0000, Philip Oakley via GitGitGadget wrote:
> From: Philip Oakley <philipoakley@iee.email>
>
> Bug report
>  https://lore.kernel.org/git/AM0PR02MB56357CC96B702244F3271014E8DC9@AM0PR02MB5635.eurprd02.prod.outlook.com/
> noted that a file containing /r/r/n needed renormalising twice.
>
> This is by design. Lone CR characters, not paired with an LF, are left
> unchanged.

This is all fine.

> Note the lack of idempotentness of the "clean" filter in the
> documentation.

The clean filter is idempotent, I would claim, see below.
You can run it, and re-run, and re-run, there will no other changes.
CRLF in the worktree will become LF in the repo,
'lone CR' stay as they are.
In that sense, CRCRLF in the worktree will become CRLF in the repo.
You can the renormalize again and again.

The "trick" is that the user has to decide what CRCRLF mean and what
should happen in the repo:
CRCRLF in the worktree becomes one line ending (one LF in the repo)
or
CRCRLF in the worktree becomes two line endings ( LFLF in the repo)

For a) you can use dos2unix twice.
Or run `git add --renormalize` followed by
`rm git.bdf`
`git restore .`

The thing is that we used a combination of different commands
$ git add --renormalize .
$ git commit -m "Renormalize bdf.txt"
$ rm git.bdf
$ git restore .
$ git add --renormalize .
$ git commit -m "Renormalize a second time bdf.txt"

... to clean up this very situation.

And, if CRCRLF should have become LFLF instead ?
Probably a python script is needed to fix this.
(or some other script/program in the language of your choice)

We could argue that
`git add --renormalize` is idempotent, but a series of carefully crafted
commands is not.
In short, what is missing is the documentation how CRCRLF is handled by
Git.

>
> Renormalize was introduced at 9472935d81e (add: introduce "--renormalize",
> Torsten Bögershausen, 2017-11-16)
>
> Signed-off-by: Philip Oakley <philipoakley@iee.email>
> ---
>  Documentation/git-add.txt | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/git-add.txt b/Documentation/git-add.txt
> index 11eb70f16c7..c4a5ad11a6b 100644
> --- a/Documentation/git-add.txt
> +++ b/Documentation/git-add.txt
> @@ -188,7 +188,8 @@ for "git add --no-all <pathspec>...", i.e. ignored removed files.
>  	forcibly add them again to the index.  This is useful after
>  	changing `core.autocrlf` configuration or the `text` attribute
>  	in order to correct files added with wrong CRLF/LF line endings.
> -	This option implies `-u`.

> +	This option implies `-u`. Lone CR characters are untouched, so
> +	cleaning not idempotent. A CRCRLF sequence cleans to CRLF.

How about this:

This option implies `-u`. Lone CR characters are untouched. CRCRLF cleans to CRLF.
Philip Oakley July 10, 2022, 9:52 p.m. UTC | #3
On 09/07/2022 22:06, Junio C Hamano wrote:
> "Philip Oakley via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
>> From: Philip Oakley <philipoakley@iee.email>
>>
>> Bug report
>>  https://lore.kernel.org/git/AM0PR02MB56357CC96B702244F3271014E8DC9@AM0PR02MB5635.eurprd02.prod.outlook.com/
>> noted that a file containing /r/r/n needed renormalising twice.
> Did you mean backslash, not forward?

Correct. Too many years of Windows.
>
>> This is by design. Lone CR characters, not paired with an LF, are left
>> unchanged. Note the lack of idempotentness of the "clean" filter in the
>> documentation.
> OK.
>
>
>> Renormalize was introduced at 9472935d81e (add: introduce "--renormalize",
>> Torsten Bögershausen, 2017-11-16)
> Does this need to be said "HERE", rather than leaving it to run "git
> blame" for those who became curious?

It was a misguided reminder to cc Torsten about his recollection of the
CRCRLF issue. I'll remove it. I see Torsten has also commented.
>
>> Signed-off-by: Philip Oakley <philipoakley@iee.email>
>> ---
>>  Documentation/git-add.txt | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/Documentation/git-add.txt b/Documentation/git-add.txt
>> index 11eb70f16c7..c4a5ad11a6b 100644
>> --- a/Documentation/git-add.txt
>> +++ b/Documentation/git-add.txt
>> @@ -188,7 +188,8 @@ for "git add --no-all <pathspec>...", i.e. ignored removed files.
>>  	forcibly add them again to the index.  This is useful after
>>  	changing `core.autocrlf` configuration or the `text` attribute
>>  	in order to correct files added with wrong CRLF/LF line endings.
>> -	This option implies `-u`.
>> +	This option implies `-u`. Lone CR characters are untouched, so
>> +	cleaning *^* not idempotent. A CRCRLF sequence cleans to CRLF.
> Lack of verb BE somewhere. 
'^' It took me three re-reads to see my mistyping as my head knew what
I'd meant to write, I've marked above as a note to self.
Aside: Are there any guides / suggestions / how-to's for on-line
reviewing that you can recommend o
> Do we expect our readers all understand the math-y word? 
Ok. It's mainly used in the test directory, and fsmonitor.h, but not in
the user docs.

>  It is not
> too hard to explain it to math-uninitiated, e.g.
>
>     This option implies `-u`.  Note that running renormalize again
>     on the result of running renormalize may make it even "more
>     normal".  A CR-CR-LF sequence would first renormalize to CR-LF
>     (the first CR, a lone CR, is left intact, and CR-LF that follows
>     normalizes to LF).  If you run renormalize again, the resulting
>     CR-LF will normalize down to LF.
>
Torsten had a shorter suggestion I'll also look at.

Philip
Junio C Hamano July 10, 2022, 10:04 p.m. UTC | #4
Philip Oakley <philipoakley@iee.email> writes:

>>> +	This option implies `-u`. Lone CR characters are untouched, so
>>> +	cleaning *^* not idempotent. A CRCRLF sequence cleans to CRLF.
>> Lack of verb BE somewhere. 
> '^' It took me three re-reads to see my mistyping as my head knew what
> I'd meant to write, I've marked above as a note to self.
> Aside: Are there any guides / suggestions / how-to's for on-line
> reviewing that you can recommend o

Sorry, but I do not know of any good "trick" to fight against our
common tendency to easily miss trivial typoes and thinkos in what we
ourselves wrote.  We can be surprisingly blind to what a colleague
can spot immediately, and that is why it helps to have a thorough
read-through by a reviewer with fresh eyes.  When I was a more
prolific contributor, I sometimes tried to read aloud what I wrote
to myself, both docs and code, and caught silly mistakes before
sending them out to the list, but I do not recommend it to others.
Philip Oakley July 10, 2022, 10:09 p.m. UTC | #5
Hi Tortsen,
Thanks for the reply and comments.

On 10/07/2022 08:48, Torsten Bögershausen wrote:
> On Sat, Jul 09, 2022 at 04:56:21PM +0000, Philip Oakley via GitGitGadget wrote:
>> From: Philip Oakley <philipoakley@iee.email>
>>
>> Bug report
>>  https://lore.kernel.org/git/AM0PR02MB56357CC96B702244F3271014E8DC9@AM0PR02MB5635.eurprd02.prod.outlook.com/
>> noted that a file containing /r/r/n needed renormalising twice.
>>
>> This is by design. Lone CR characters, not paired with an LF, are left
>> unchanged.
> This is all fine.
>
>> Note the lack of idempotentness of the "clean" filter in the
>> documentation.
> The clean filter is idempotent, I would claim, see below.

I'd disagree, on the basis that any second 'idempotent' cleaning should
not change the file content at all. The need for a second clean was the
surprise the user had.
> You can run it, and re-run, and re-run, there will no other changes.
> CRLF in the worktree will become LF in the repo,
> 'lone CR' stay as they are.
> In that sense, CRCRLF in the worktree will become CRLF in the repo.
So  the output isn't normalised, and warning messages ensue (if enabled,
etc)
> You can the renormalize again and again.
>
> The "trick" is that the user has to decide what CRCRLF mean and what
> should happen in the repo
.. for which they should be forewarned of the issue.
In this case it was a large repository transfer of legacy data, so
little knowledge of how the double CRs occured, but it was a real issue
for them.


> :
> CRCRLF in the worktree becomes one line ending (one LF in the repo)
> or
> CRCRLF in the worktree becomes two line endings ( LFLF in the repo)
>
> For a) you can use dos2unix twice.
> Or run `git add --renormalize` followed by
> `rm git.bdf`
> `git restore .`
>
> The thing is that we used a combination of different commands
> $ git add --renormalize .
> $ git commit -m "Renormalize bdf.txt"
> $ rm git.bdf
> $ git restore .
> $ git add --renormalize .
> $ git commit -m "Renormalize a second time bdf.txt"
>
> ... to clean up this very situation.
>
> And, if CRCRLF should have become LFLF instead ?
> Probably a python script is needed to fix this.
> (or some other script/program in the language of your choice)
>
> We could argue that
> `git add --renormalize` is idempotent, but a series of carefully crafted
> commands is not.

> In short, what is missing is the documentation how CRCRLF is handled by
> Git.
*nod*
>
>> Renormalize was introduced at 9472935d81e (add: introduce "--renormalize",
>> Torsten Bögershausen, 2017-11-16)
>>
>> Signed-off-by: Philip Oakley <philipoakley@iee.email>
>> ---
>>  Documentation/git-add.txt | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/Documentation/git-add.txt b/Documentation/git-add.txt
>> index 11eb70f16c7..c4a5ad11a6b 100644
>> --- a/Documentation/git-add.txt
>> +++ b/Documentation/git-add.txt
>> @@ -188,7 +188,8 @@ for "git add --no-all <pathspec>...", i.e. ignored removed files.
>>  	forcibly add them again to the index.  This is useful after
>>  	changing `core.autocrlf` configuration or the `text` attribute
>>  	in order to correct files added with wrong CRLF/LF line endings.
>> -	This option implies `-u`.
>> +	This option implies `-u`. Lone CR characters are untouched, so
>> +	cleaning not idempotent. A CRCRLF sequence cleans to CRLF.
> How about this:
>
> This option implies `-u`. Lone CR characters are untouched. CRCRLF cleans to CRLF.
That is probably sufficient. It drops the awkward 'idempotent'. And
indicates this edge case, though doesn't highlight that the resultant
CRLF still leaves the file only partially renormalised.

I'll reword.
>
Philip Oakley July 10, 2022, 10:25 p.m. UTC | #6
On 10/07/2022 23:04, Junio C Hamano wrote:
> Philip Oakley <philipoakley@iee.email> writes:
>
>>>> +	This option implies `-u`. Lone CR characters are untouched, so
>>>> +	cleaning *^* not idempotent. A CRCRLF sequence cleans to CRLF.
>>> Lack of verb BE somewhere. 
>> '^' It took me three re-reads to see my mistyping as my head knew what
>> I'd meant to write, I've marked above as a note to self.
>> Aside: Are there any guides / suggestions / how-to's for on-line
>> reviewing that you can recommend o
> Sorry, but I do not know of any good "trick" to fight against our
> common tendency to easily miss trivial typoes and thinkos in what we
> ourselves wrote.  We can be surprisingly blind to what a colleague
> can spot immediately, and that is why it helps to have a thorough
> read-through by a reviewer with fresh eyes.  When I was a more
> prolific contributor, I sometimes tried to read aloud what I wrote
> to myself, both docs and code, and caught silly mistakes before
> sending them out to the list, but I do not recommend it to others.

Thanks. There does appear to be a lack of literature or articles in this
area of on-list reviewing

I've not even seen an list of snippets collated from email advice. 
Other than the email etiquette's starter for ten on don't top post ;-)

--
Philip
Junio C Hamano Aug. 5, 2022, 10:26 p.m. UTC | #7
Philip Oakley <philipoakley@iee.email> writes:

>> How about this:
>>
>> This option implies `-u`. Lone CR characters are untouched. CRCRLF cleans to CRLF.
> That is probably sufficient. It drops the awkward 'idempotent'. And
> indicates this edge case, though doesn't highlight that the resultant
> CRLF still leaves the file only partially renormalised.
>
> I'll reword.

It's been a few weeks since the last activity on this topic.
Anything you guys need unblocked to move forward?

Thanks.
Torsten Bögershausen Aug. 6, 2022, 7:22 p.m. UTC | #8
On Fri, Aug 05, 2022 at 03:26:10PM -0700, Junio C Hamano wrote:
> Philip Oakley <philipoakley@iee.email> writes:
>
> >> How about this:
> >>
> >> This option implies `-u`. Lone CR characters are untouched. CRCRLF cleans to CRLF.
> > That is probably sufficient. It drops the awkward 'idempotent'. And
> > indicates this edge case, though doesn't highlight that the resultant
> > CRLF still leaves the file only partially renormalised.
> >
> > I'll reword.
>
> It's been a few weeks since the last activity on this topic.
> Anything you guys need unblocked to move forward?
>
> Thanks.
>

Not from my point of view. My understanding is, that the short version is OK for
everybody:

This option implies `-u`. Lone CR characters are untouched. CRCRLF cleans to CRLF.

Is it OK to ask you for a local ammend to push this further ?
Philip Oakley Aug. 8, 2022, 2:32 p.m. UTC | #9
I've unfortunately had some family issue which prevented me doing any work.

If I haven't managed anything by the end of the week. I'd be happy if
others took it forward.

On 05/08/2022 23:26, Junio C Hamano wrote:
> Philip Oakley <philipoakley@iee.email> writes:
>
>>> How about this:
>>>
>>> This option implies `-u`. Lone CR characters are untouched. CRCRLF cleans to CRLF.
>> That is probably sufficient. It drops the awkward 'idempotent'. And
>> indicates this edge case, though doesn't highlight that the resultant
>> CRLF still leaves the file only partially renormalised.
>>
>> I'll reword.
> It's been a few weeks since the last activity on this topic.
> Anything you guys need unblocked to move forward?
>
> Thanks.
>
Junio C Hamano Aug. 8, 2022, 4:21 p.m. UTC | #10
Philip Oakley <philipoakley@iee.email> writes:

> I've unfortunately had some family issue which prevented me doing any work.

I hope everything will be well on your side.

> If I haven't managed anything by the end of the week. I'd be happy if
> others took it forward.

Thanks for letting us know.
Torsten Bögershausen Aug. 9, 2022, 6:44 p.m. UTC | #11
On Mon, Aug 08, 2022 at 03:32:35PM +0100, Philip Oakley wrote:
> I've unfortunately had some family issue which prevented me doing any work.

That is sad to here. I hope that things are getting better in one way or another.

>
> If I haven't managed anything by the end of the week. I'd be happy if
> others took it forward.

I can certainly have a look, after the weekend, and continue your work.
diff mbox series

Patch

diff --git a/Documentation/git-add.txt b/Documentation/git-add.txt
index 11eb70f16c7..c4a5ad11a6b 100644
--- a/Documentation/git-add.txt
+++ b/Documentation/git-add.txt
@@ -188,7 +188,8 @@  for "git add --no-all <pathspec>...", i.e. ignored removed files.
 	forcibly add them again to the index.  This is useful after
 	changing `core.autocrlf` configuration or the `text` attribute
 	in order to correct files added with wrong CRLF/LF line endings.
-	This option implies `-u`.
+	This option implies `-u`. Lone CR characters are untouched, so
+	cleaning not idempotent. A CRCRLF sequence cleans to CRLF.
 
 --chmod=(+|-)x::
 	Override the executable bit of the added files.  The executable