diff mbox series

tests: drop dependency on `git diff` in check-chainlint

Message ID 20231214032248.1615-1-ericsunshine@charter.net (mailing list archive)
State New, archived
Headers show
Series tests: drop dependency on `git diff` in check-chainlint | expand

Commit Message

Eric Sunshine Dec. 14, 2023, 3:22 a.m. UTC
From: Eric Sunshine <sunshine@sunshineco.com>

The "check-chainlint" target runs automatically when running tests and
performs self-checks to verify that the chainlinter itself produces the
expected output. Originally, the chainlinter was implemented via sed,
but the infrastructure has been rewritten in fb41727b7e (t: retire
unused chainlint.sed, 2022-09-01) to use a Perl script instead.

The rewrite caused some slight whitespace changes in the output that are
ultimately not of much importance. In order to be able to assert that
the actual chainlinter errors match our expectations we thus have to
ignore whitespace characters when diffing them. As the `-w` flag is not
in POSIX we try to use `git diff -w --no-index` before we fall back to
non-standard `diff -w -u`.

To accommodate for cases where the host system has no Git installation
we use the locally-compiled version of Git. This can result in problems
though when the Git project's repository is using extensions that the
locally-compiled version of Git doesn't understand, in which case `git`
may refuse to run and thus cause the checks to fail.

Work around this issue by normalizing whitespace via sed before invoking
diff, which allows any platform diff implementation to be used, thus
eliminating the dependency upon `git diff` and the non-POSIX `-w` flag.

Reported-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Eric Sunshine <sunshine@sunshineco.com>
---

This is an alternative solution to the issue Patrick's patch[1]
addresses. Hopefully, this approach should avoid the sort of push-back
Patrick's patch received[2].

I shamelessly stole most of Patrick's commit message.

The sed expressions for normalizing whitespace prior to `diff` may look
a bit hairy, but they are simple enough in concept:

* collapse runs of whitespace to a single SP
* drop blank lines (this step is not new)
* fold out possible SP at beginning and end of each line
* fold out SP surrounding common punctuation characters used in shell
  scripts, such as `>`, `|`, `;`, etc.

By the way, I'm somewhat surprised that this issue crops up at all
considering that --no-index is being used with git-diff. As such, I
would have thought that the local repository's format would not have
been interrogated at all. If that's a bug in `git diff --no-index`, then
fixing that could be considered yet another alternative solution to the
issue raised here.

[1]: https://lore.kernel.org/git/4112adbe467c14a8f22a87ea41aa4705f8760cf6.1702380646.git.ps@pks.im/
[2]: https://lore.kernel.org/git/xmqqr0jqnnmn.fsf@gitster.g/

 t/Makefile | 14 +++-----------
 1 file changed, 3 insertions(+), 11 deletions(-)

Comments

Patrick Steinhardt Dec. 14, 2023, 8:05 a.m. UTC | #1
On Wed, Dec 13, 2023 at 10:22:48PM -0500, Eric Sunshine wrote:
> From: Eric Sunshine <sunshine@sunshineco.com>
> 
> The "check-chainlint" target runs automatically when running tests and
> performs self-checks to verify that the chainlinter itself produces the
> expected output. Originally, the chainlinter was implemented via sed,
> but the infrastructure has been rewritten in fb41727b7e (t: retire
> unused chainlint.sed, 2022-09-01) to use a Perl script instead.
> 
> The rewrite caused some slight whitespace changes in the output that are
> ultimately not of much importance. In order to be able to assert that
> the actual chainlinter errors match our expectations we thus have to
> ignore whitespace characters when diffing them. As the `-w` flag is not
> in POSIX we try to use `git diff -w --no-index` before we fall back to
> non-standard `diff -w -u`.
> 
> To accommodate for cases where the host system has no Git installation
> we use the locally-compiled version of Git. This can result in problems
> though when the Git project's repository is using extensions that the
> locally-compiled version of Git doesn't understand, in which case `git`
> may refuse to run and thus cause the checks to fail.
> 
> Work around this issue by normalizing whitespace via sed before invoking
> diff, which allows any platform diff implementation to be used, thus
> eliminating the dependency upon `git diff` and the non-POSIX `-w` flag.
> 
> Reported-by: Patrick Steinhardt <ps@pks.im>
> Signed-off-by: Eric Sunshine <sunshine@sunshineco.com>
> ---
> 
> This is an alternative solution to the issue Patrick's patch[1]
> addresses. Hopefully, this approach should avoid the sort of push-back
> Patrick's patch received[2].

Thanks for chiming in!

> I shamelessly stole most of Patrick's commit message.
> 
> The sed expressions for normalizing whitespace prior to `diff` may look
> a bit hairy, but they are simple enough in concept:
> 
> * collapse runs of whitespace to a single SP
> * drop blank lines (this step is not new)
> * fold out possible SP at beginning and end of each line
> * fold out SP surrounding common punctuation characters used in shell
>   scripts, such as `>`, `|`, `;`, etc.
> 
> By the way, I'm somewhat surprised that this issue crops up at all
> considering that --no-index is being used with git-diff. As such, I
> would have thought that the local repository's format would not have
> been interrogated at all. If that's a bug in `git diff --no-index`, then
> fixing that could be considered yet another alternative solution to the
> issue raised here.

This strongly reminds me of the thread at [1], where a similar issue was
discussed for git-grep(1). Quoting Junio: 

> I actually do not think these "we are allowing Git tools to be used
> on random garbage" is a good idea to begin with X-<.  If we invented
> something nice for our variant in "git grep" and wish we can use it
> outside the repository, contributing the feature to implementations
> of "grep" would have been the right way to move forward, instead of
> contaminating the codebase with things that are not related to Git.

So this might not be the best way to go.

> [1]: https://lore.kernel.org/git/4112adbe467c14a8f22a87ea41aa4705f8760cf6.1702380646.git.ps@pks.im/
> [2]: https://lore.kernel.org/git/xmqqr0jqnnmn.fsf@gitster.g/
> 
>  t/Makefile | 14 +++-----------
>  1 file changed, 3 insertions(+), 11 deletions(-)
> 
> diff --git a/t/Makefile b/t/Makefile
> index 225aaf78ed..656ff10afa 100644
> --- a/t/Makefile
> +++ b/t/Makefile
> @@ -103,20 +103,12 @@ check-chainlint:
>  		echo "# chainlint: $(CHAINLINTTMP_SQ)/tests" && \
>  		for i in $(CHAINLINTTESTS); do \
>  			echo "# chainlint: $$i" && \
> -			sed -e '/^[ 	]*$$/d' chainlint/$$i.expect; \
> +			sed -e 's/[ 	][ 	]*/ /g;/^ *$$/d;s/^ //;s/ $$//;s/\([<>|();&]\) /\1/g;s/ \([<>|();&]\)/\1/g' chainlint/$$i.expect; \

These sed expressions do look hairy indeed. I have to wonder: all that
we're doing here is to munge the expected files we already have in our
tree. Can't we fix those to look exactly like the actual results instead
and then avoid any kind of post processing altogether? If I understand
correctly the only reason we do this post processing is because the
original implementation of the chainlinter produced slightly different
whitespace.

Patrick

[1]: https://lore.kernel.org/git/xmqq7cnnpy3z.fsf@gitster.g/
Eric Sunshine Dec. 14, 2023, 8:31 a.m. UTC | #2
On Thu, Dec 14, 2023 at 3:05 AM Patrick Steinhardt <ps@pks.im> wrote:
> On Wed, Dec 13, 2023 at 10:22:48PM -0500, Eric Sunshine wrote:
> > This is an alternative solution to the issue Patrick's patch[1]
> > addresses. Hopefully, this approach should avoid the sort of push-back
> > Patrick's patch received[2].
> >
> > By the way, I'm somewhat surprised that this issue crops up at all
> > considering that --no-index is being used with git-diff. As such, I
> > would have thought that the local repository's format would not have
> > been interrogated at all. If that's a bug in `git diff --no-index`, then
> > fixing that could be considered yet another alternative solution to the
> > issue raised here.
>
> This strongly reminds me of the thread at [1], where a similar issue was
> discussed for git-grep(1). Quoting Junio:
>
> > I actually do not think these "we are allowing Git tools to be used
> > on random garbage" is a good idea to begin with X-<.  If we invented
> > something nice for our variant in "git grep" and wish we can use it
> > outside the repository, contributing the feature to implementations
> > of "grep" would have been the right way to move forward, instead of
> > contaminating the codebase with things that are not related to Git.
>
> So this might not be the best way to go.

I recall Junio mentioning that, and I'm fine with the conclusion that
"fixing" --no-index is counter to the project's goals.

> > -                     sed -e '/^[     ]*$$/d' chainlint/$$i.expect; \
> > +                     sed -e 's/[     ][      ]*/ /g;/^ *$$/d;s/^ //;s/ $$//;s/\([<>|();&]\) /\1/g;s/ \([<>|();&]\)/\1/g' chainlint/$$i.expect; \
> >
> > The sed expressions for normalizing whitespace prior to `diff` may look
> > a bit hairy, but they are simple enough in concept:
> >
> > * collapse runs of whitespace to a single SP
> > * drop blank lines (this step is not new)
> > * fold out possible SP at beginning and end of each line
> > * fold out SP surrounding common punctuation characters used in shell
> >   scripts, such as `>`, `|`, `;`, etc.
>
> These sed expressions do look hairy indeed. I have to wonder: all that
> we're doing here is to munge the expected files we already have in our
> tree. Can't we fix those to look exactly like the actual results instead
> and then avoid any kind of post processing altogether? If I understand
> correctly the only reason we do this post processing is because the
> original implementation of the chainlinter produced slightly different
> whitespace.

Yes and no. It's not just whitespace.

I did strongly consider submitting patches to fix all the whitespace
differences in the "expect" files when chainlint.pl replaced
chainlint.sed, but I particularly didn't want to plague the mailing
list with such noise. It's really just unnecessary churn since it's so
easy to work around it with minor sed magic.

And time tells me that that was probably the correct decision since
the output of chainlint.pl has changed multiple times. Even the output
of chainlint.sed wasn't necessarily stable[1]. Then, of course
chainlint.pl replaced[2] chainlint.sed. The original implementation or
chainlint.pl just dumped out the parsed token stream, but [3] improved
it to preserve the original formatting of the test snippet, and [4]
annotated the output with line numbers of the original test snippet.
Had those changes been accompanied by extra patches to "fix" the
"expect" files to suit, it would have been just that much more noise
both in terms of patches to review and in terms of churn in the actual
history.

And, who knows, the output of chainlint.pl might change/improve again
some day. So, I still favor using sed to smooth over these minor
differences rather than "fixing" the "expect" file repeatedly to
adjust them for changes which are not significant to what is actually
being tested.

[1]: d73f5cfa89 (chainlint.sed: stop splitting "(..." into separate
lines "(" and "...", 2021-12-13)
[2]: d00113ec34 (t/Makefile: apply chainlint.pl to existing
self-tests, 2022-09-01)
[3]: 73c768dae9 (chainlint: annotate original test definition rather
than token stream, 2022-11-08)
[4]: 48d69d8f2f (chainlint: prefix annotated test definition with line
numbers, 2022-11-11)
Junio C Hamano Dec. 14, 2023, 4:49 p.m. UTC | #3
Patrick Steinhardt <ps@pks.im> writes:

> This strongly reminds me of the thread at [1], where a similar issue was
> discussed for git-grep(1). Quoting Junio: 
>
>> I actually do not think these "we are allowing Git tools to be used
>> on random garbage" is a good idea to begin with X-<.  If we invented
>> something nice for our variant in "git grep" and wish we can use it
>> outside the repository, contributing the feature to implementations
>> of "grep" would have been the right way to move forward, instead of
>> contaminating the codebase with things that are not related to Git.
>
> So this might not be the best way to go.

That is not a conclusion I want people to draw.

Like it or not, "git diff --no-index" will be with us to stay, and
"--no-index" being "we have abused the rest of Git code to implement
'diff' that works _outside_ a Git repository---now go and do your
thing", we would eventually want to correct it, if it is misbehaving
when a repository it finds is in a shape it does not like, no?

We should have what you quoted in mind as a general principle, and
think twice when we are tempted to hoard useful features for another
tool we initially wrote for Git and allow them to be used with the
"--no-index" option, instead of contributing them to the tool that
does not know or care "git" repositories (like "diff" and "grep").
Patrick Steinhardt Dec. 15, 2023, 5:36 a.m. UTC | #4
On Thu, Dec 14, 2023 at 08:49:49AM -0800, Junio C Hamano wrote:
> Patrick Steinhardt <ps@pks.im> writes:
> > This strongly reminds me of the thread at [1], where a similar issue was
> > discussed for git-grep(1). Quoting Junio: 
> >
> >> I actually do not think these "we are allowing Git tools to be used
> >> on random garbage" is a good idea to begin with X-<.  If we invented
> >> something nice for our variant in "git grep" and wish we can use it
> >> outside the repository, contributing the feature to implementations
> >> of "grep" would have been the right way to move forward, instead of
> >> contaminating the codebase with things that are not related to Git.
> >
> > So this might not be the best way to go.
> 
> That is not a conclusion I want people to draw.
> 
> Like it or not, "git diff --no-index" will be with us to stay, and
> "--no-index" being "we have abused the rest of Git code to implement
> 'diff' that works _outside_ a Git repository---now go and do your
> thing", we would eventually want to correct it, if it is misbehaving
> when a repository it finds is in a shape it does not like, no?
> 
> We should have what you quoted in mind as a general principle, and
> think twice when we are tempted to hoard useful features for another
> tool we initially wrote for Git and allow them to be used with the
> "--no-index" option, instead of contributing them to the tool that
> does not know or care "git" repositories (like "diff" and "grep").

Okay, thanks for clarifying!

Patrick
diff mbox series

Patch

diff --git a/t/Makefile b/t/Makefile
index 225aaf78ed..656ff10afa 100644
--- a/t/Makefile
+++ b/t/Makefile
@@ -103,20 +103,12 @@  check-chainlint:
 		echo "# chainlint: $(CHAINLINTTMP_SQ)/tests" && \
 		for i in $(CHAINLINTTESTS); do \
 			echo "# chainlint: $$i" && \
-			sed -e '/^[ 	]*$$/d' chainlint/$$i.expect; \
+			sed -e 's/[ 	][ 	]*/ /g;/^ *$$/d;s/^ //;s/ $$//;s/\([<>|();&]\) /\1/g;s/ \([<>|();&]\)/\1/g' chainlint/$$i.expect; \
 		done \
 	} >'$(CHAINLINTTMP_SQ)'/expect && \
 	$(CHAINLINT) --emit-all '$(CHAINLINTTMP_SQ)'/tests | \
-		sed -e 's/^[1-9][0-9]* //;/^[ 	]*$$/d' >'$(CHAINLINTTMP_SQ)'/actual && \
-	if test -f ../GIT-BUILD-OPTIONS; then \
-		. ../GIT-BUILD-OPTIONS; \
-	fi && \
-	if test -x ../git$$X; then \
-		DIFFW="../git$$X --no-pager diff -w --no-index"; \
-	else \
-		DIFFW="diff -w -u"; \
-	fi && \
-	$$DIFFW '$(CHAINLINTTMP_SQ)'/expect '$(CHAINLINTTMP_SQ)'/actual
+		sed -e 's/^[1-9][0-9]* //;s/[ 	][ 	]*/ /g;/^ *$$/d;s/^ //;s/ $$//;s/\([<>|();&]\) /\1/g;s/ \([<>|();&]\)/\1/g' >'$(CHAINLINTTMP_SQ)'/actual && \
+	diff -u '$(CHAINLINTTMP_SQ)'/expect '$(CHAINLINTTMP_SQ)'/actual
 
 test-lint: test-lint-duplicates test-lint-executable test-lint-shell-syntax \
 	test-lint-filenames