[RFC] Add new build target 'check-spelling'

Message ID	20221031074317.377366-1-sw@weilnetz.de (mailing list archive)
State	New, archived
Headers	show Return-Path: <qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org> To: qemu-devel@nongnu.org Cc: Thomas Huth <thuth@redhat.com>, Peter Maydell <peter.maydell@linaro.org>, Stefan Weil <sw@weilnetz.de> Subject: [RFC PATCH] Add new build target 'check-spelling' Date: Mon, 31 Oct 2022 08:43:17 +0100 Message-Id: <20221031074317.377366-1-sw@weilnetz.de> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Received-SPF: pass client-ip=37.120.169.71; envelope-from=stefan@weilnetz.de; helo=mail.v2201612906741603.powersrv.de X-Spam_score_int: -18 X-Spam_score: -1.9 X-Spam_bar: - X-Spam_report: (-1.9 / 5.0 requ) BAYES_00=-1.9, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action Precedence: list Sender: "Qemu-devel" <qemu-devel-bounces@nongnu.org> Reply-to: Stefan Weil <sw@weilnetz.de> From: Stefan Weil via <qemu-devel@nongnu.org> Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org
Series	[RFC] Add new build target 'check-spelling' \| expand [RFC] Add new build target 'check-spelling'

Stefan Weil Oct. 31, 2022, 7:43 a.m. UTC

`make check-spelling` can now be used to get a list of spelling errors.
It uses the latest version of codespell, a spell checker implemented in Python.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
---

This RFC can already be used for manual tests, but still reports false
positives, mostly because some variable names are interpreted as words.
These words can either be ignored in the check, or in some cases the code
might be changed to use different variable names.

The check currently only skips a few directories and files, so for example
checked out submodules are also checked.

The rule can be extended to allow user provided ignore and skip lists,
for example by introducing Makefile variables CODESPELL_SKIP=userfile
or CODESPELL_IGNORE=userfile. A limited check could be implemented by
providing a base directory CODESPELL_START=basedirectory, for example
CODESPELL_START=docs.

Regards,
Stefan

 tests/Makefile.include       | 10 ++++++++++
 tests/codespell/README.rst   | 18 ++++++++++++++++++
 tests/codespell/exclude-file |  3 +++
 tests/codespell/ignore-words | 19 +++++++++++++++++++
 tests/requirements.txt       |  1 +
 5 files changed, 51 insertions(+)
 create mode 100644 tests/codespell/README.rst
 create mode 100644 tests/codespell/exclude-file
 create mode 100644 tests/codespell/ignore-words

Thomas Huth Oct. 31, 2022, 7:52 a.m. UTC | #1

On 31/10/2022 08.43, Stefan Weil wrote:
> `make check-spelling` can now be used to get a list of spelling errors.
> It uses the latest version of codespell, a spell checker implemented in Python.
> 
> Signed-off-by: Stefan Weil <sw@weilnetz.de>
> ---
> 
> This RFC can already be used for manual tests, but still reports false
> positives, mostly because some variable names are interpreted as words.
> These words can either be ignored in the check, or in some cases the code
> might be changed to use different variable names.
> 
> The check currently only skips a few directories and files, so for example
> checked out submodules are also checked.
> 
> The rule can be extended to allow user provided ignore and skip lists,
> for example by introducing Makefile variables CODESPELL_SKIP=userfile
> or CODESPELL_IGNORE=userfile. A limited check could be implemented by
> providing a base directory CODESPELL_START=basedirectory, for example
> CODESPELL_START=docs.
> 
> Regards,
> Stefan
> 
>   tests/Makefile.include       | 10 ++++++++++
>   tests/codespell/README.rst   | 18 ++++++++++++++++++
>   tests/codespell/exclude-file |  3 +++
>   tests/codespell/ignore-words | 19 +++++++++++++++++++
>   tests/requirements.txt       |  1 +
>   5 files changed, 51 insertions(+)
>   create mode 100644 tests/codespell/README.rst
>   create mode 100644 tests/codespell/exclude-file
>   create mode 100644 tests/codespell/ignore-words
> 
> diff --git a/tests/Makefile.include b/tests/Makefile.include
> index 9422ddaece..b9daeda932 100644
> --- a/tests/Makefile.include
> +++ b/tests/Makefile.include
> @@ -155,6 +155,16 @@ check-acceptance-deprecated-warning:
>   
>   check-acceptance: check-acceptance-deprecated-warning | check-avocado
>   
> +.PHONY: check-spelling
> +CODESPELL_DIR=tests/codespell
> +check-spelling: check-venv
> +	source $(TESTS_VENV_DIR)/bin/activate && \
> +	cd "$(SRC_PATH)" && \
> +	codespell -s . \
> +	  --exclude-file=$(CODESPELL_DIR)/exclude-file \
> +	  --ignore-words=$(CODESPELL_DIR)/ignore-words \
> +	  --skip="./.git,./bin,./build,./linux-headers,*.patch,nohup.out"

I like the idea, but I think it's unlikely that we can make this work for 
the whole source tree any time soon. So maybe it makes more sense to start 
with some few directories first (e.g. docs/ ) and then the maintainers can 
opt-in by cleaning up their directories first and then by adding their 
directories to this target here?

  Thomas

Stefan Weil Oct. 31, 2022, 10:44 a.m. UTC | #2

Am 31.10.22 um 08:52 schrieb Thomas Huth:

> On 31/10/2022 08.43, Stefan Weil wrote:
>> `make check-spelling` can now be used to get a list of spelling errors.
>> It uses the latest version of codespell, a spell checker implemented 
>> in Python.
>>
>> Signed-off-by: Stefan Weil <sw@weilnetz.de>
>> ---
>>
>> This RFC can already be used for manual tests, but still reports false
>> positives, mostly because some variable names are interpreted as words.
>> These words can either be ignored in the check, or in some cases the 
>> code
>> might be changed to use different variable names.
>>
>> The check currently only skips a few directories and files, so for 
>> example
>> checked out submodules are also checked.
>>
>> The rule can be extended to allow user provided ignore and skip lists,
>> for example by introducing Makefile variables CODESPELL_SKIP=userfile
>> or CODESPELL_IGNORE=userfile. A limited check could be implemented by
>> providing a base directory CODESPELL_START=basedirectory, for example
>> CODESPELL_START=docs.
>>
>> Regards,
>> Stefan
[...]
>> I like the idea, but I think it's unlikely that we can make this work 
>> for the whole source tree any time soon. So maybe it makes more sense 
>> to start with some few directories first (e.g. docs/ ) and then the 
>> maintainers can opt-in by cleaning up their directories first and 
>> then by adding their directories to this target here?
>
>  Thomas


Even without implementing CODESPELL_START as described above, the script 
can already be used and integrated into CI scripts.

It takes about 60 seconds to check the whole source tree including 
submodules on my (slow) virtual machine.

The resulting output has about 20000 lines or 1272 KiB. It can be 
filtered for relevant parts of the source tree or used for a summary.

Sample script: grep "^[.]" spellcheck.log | sed s/^..// | sed 's/\/.*//' 
| sed s/:.*// | sort | uniq -c

This produces a summary for the top level hierarchy of files and 
directories:

       3 accel
       1 audio
       1 backends
      77 block
       7 block.c
      20 bsd-user
     386 capstone
      12 chardev
       1 configure
       8 contrib
       6 crypto
      64 disas
      32 docs
      31 dtc
       8 fpu
       1 gdbstub
       1 gdb-xml
       1 .github
     537 hw
       7 inc
     114 include
       1 libdecnumber
      33 linux-user
       1 MAINTAINERS
     150 meson
       6 meson.build
      16 migration
       1 nbd
       5 net
      12 pc-bios
       7 python
       3 qapi
       2 qemu
       5 qemu-options.hx
      22 qga
   14175 roms
      43 scripts
       3 semihosting
      18 slirp
       2 softmmu
      59 subprojects
     504 target
       6 tcg
       3 test.rb
     175 tests
       6 tools
      20 ui
       8 util

It shows that "roms" contributes by far the most typos. Omitting it 
would reduce the required time to 22 seconds and the number of typos 
found (2947 lines in output) very much.

"capstone" (which has no entry in MAINTAINERS), "target" and "hw" also 
contribute more than 300 hits each, therefore cc'ing Richard.

Stefan

Thomas Huth Oct. 31, 2022, 10:50 a.m. UTC | #3

On 31/10/2022 11.44, Stefan Weil wrote:
> Am 31.10.22 um 08:52 schrieb Thomas Huth:
> 
>> On 31/10/2022 08.43, Stefan Weil wrote:
>>> `make check-spelling` can now be used to get a list of spelling errors.
>>> It uses the latest version of codespell, a spell checker implemented in 
>>> Python.
>>>
>>> Signed-off-by: Stefan Weil <sw@weilnetz.de>
>>> ---
>>>
>>> This RFC can already be used for manual tests, but still reports false
>>> positives, mostly because some variable names are interpreted as words.
>>> These words can either be ignored in the check, or in some cases the code
>>> might be changed to use different variable names.
>>>
>>> The check currently only skips a few directories and files, so for example
>>> checked out submodules are also checked.
>>>
>>> The rule can be extended to allow user provided ignore and skip lists,
>>> for example by introducing Makefile variables CODESPELL_SKIP=userfile
>>> or CODESPELL_IGNORE=userfile. A limited check could be implemented by
>>> providing a base directory CODESPELL_START=basedirectory, for example
>>> CODESPELL_START=docs.
>>>
>>> Regards,
>>> Stefan
> [...]
>>> I like the idea, but I think it's unlikely that we can make this work for 
>>> the whole source tree any time soon. So maybe it makes more sense to 
>>> start with some few directories first (e.g. docs/ ) and then the 
>>> maintainers can opt-in by cleaning up their directories first and then by 
>>> adding their directories to this target here?
>>
>>  Thomas
> 
> 
> Even without implementing CODESPELL_START as described above, the script can 
> already be used and integrated into CI scripts.
> 
> It takes about 60 seconds to check the whole source tree including 
> submodules on my (slow) virtual machine.
> 
> The resulting output has about 20000 lines or 1272 KiB. It can be filtered 
> for relevant parts of the source tree or used for a summary.
> 
> Sample script: grep "^[.]" spellcheck.log | sed s/^..// | sed 's/\/.*//' | 
> sed s/:.*// | sort | uniq -c
> 
> This produces a summary for the top level hierarchy of files and directories:
> 
>        3 accel
>        1 audio
>        1 backends
>       77 block
>        7 block.c
>       20 bsd-user
>      386 capstone
>       12 chardev
>        1 configure
>        8 contrib
>        6 crypto
>       64 disas
>       32 docs
>       31 dtc
>        8 fpu
>        1 gdbstub
>        1 gdb-xml
>        1 .github
>      537 hw
>        7 inc
>      114 include
>        1 libdecnumber
>       33 linux-user
>        1 MAINTAINERS
>      150 meson
>        6 meson.build
>       16 migration
>        1 nbd
>        5 net
>       12 pc-bios
>        7 python
>        3 qapi
>        2 qemu
>        5 qemu-options.hx
>       22 qga
>    14175 roms
>       43 scripts
>        3 semihosting
>       18 slirp
>        2 softmmu
>       59 subprojects
>      504 target
>        6 tcg
>        3 test.rb
>      175 tests
>        6 tools
>       20 ui
>        8 util
> 
> It shows that "roms" contributes by far the most typos. Omitting it would 
> reduce the required time to 22 seconds and the number of typos found (2947 
> lines in output) very much.

"roms" mostly consists of third-party submodules that we do not have direct 
control of. I think this should definitely be omitted.

> "capstone" (which has no entry in MAINTAINERS)

That's likely because it has been a submodule that has been removed a while 
ago. "rm -rf capstone" should solve that issue on your local buildtree ;-)

(yes, that's another nuisance of submodules - the checked out files don't go 
away when the submodule gets removed)

  Thomas

Daniel P. Berrangé Oct. 31, 2022, 10:52 a.m. UTC | #4

On Mon, Oct 31, 2022 at 11:44:48AM +0100, Stefan Weil via wrote:
> Am 31.10.22 um 08:52 schrieb Thomas Huth:
> 
> > On 31/10/2022 08.43, Stefan Weil wrote:
> > > `make check-spelling` can now be used to get a list of spelling errors.
> > > It uses the latest version of codespell, a spell checker implemented
> > > in Python.
> > > 
> > > Signed-off-by: Stefan Weil <sw@weilnetz.de>
> > > ---
> > > 
> > > This RFC can already be used for manual tests, but still reports false
> > > positives, mostly because some variable names are interpreted as words.
> > > These words can either be ignored in the check, or in some cases the
> > > code
> > > might be changed to use different variable names.
> > > 
> > > The check currently only skips a few directories and files, so for
> > > example
> > > checked out submodules are also checked.
> > > 
> > > The rule can be extended to allow user provided ignore and skip lists,
> > > for example by introducing Makefile variables CODESPELL_SKIP=userfile
> > > or CODESPELL_IGNORE=userfile. A limited check could be implemented by
> > > providing a base directory CODESPELL_START=basedirectory, for example
> > > CODESPELL_START=docs.
> > > 
> > > Regards,
> > > Stefan
> [...]
> > > I like the idea, but I think it's unlikely that we can make this
> > > work for the whole source tree any time soon. So maybe it makes more
> > > sense to start with some few directories first (e.g. docs/ ) and
> > > then the maintainers can opt-in by cleaning up their directories
> > > first and then by adding their directories to this target here?
> > 
> >  Thomas
> 
> 
> Even without implementing CODESPELL_START as described above, the script can
> already be used and integrated into CI scripts.

To get most value from CI, we strongly prefer the test to be a clear
pass/fail.

We do have some jobs that are marked non-gating, since they have
false failures and need manual inspection of results. The effect
is those jobs are largely ignored by everyone, so not really of
significant benefit.

So I'd agree with Thomas about starting with a config that can
get a clear pass/fail, and expanding from there if we can't get
the full-tree clean from the start.

> It shows that "roms" contributes by far the most typos. Omitting it would
> reduce the required time to 22 seconds and the number of typos found (2947
> lines in output) very much.

We should not look at 'roms' at all since it is just a place for
git submodulees, and build system integration. No interesting code
lives there.

> "capstone" (which has no entry in MAINTAINERS), "target" and "hw" also
> contribute more than 300 hits each, therefore cc'ing Richard.

We should completely ignoring 'capstone' and any other git submodule
as those are 3rd party codebases we don't maintain ourselves.

With regards,
Daniel

Philippe Mathieu-Daudé Oct. 31, 2022, 3:40 p.m. UTC | #5

On 31/10/22 08:43, Stefan Weil via wrote:
> `make check-spelling` can now be used to get a list of spelling errors.
> It uses the latest version of codespell, a spell checker implemented in Python.
> 
> Signed-off-by: Stefan Weil <sw@weilnetz.de>
> ---
> 
> This RFC can already be used for manual tests, but still reports false
> positives, mostly because some variable names are interpreted as words.
> These words can either be ignored in the check, or in some cases the code
> might be changed to use different variable names.
> 
> The check currently only skips a few directories and files, so for example
> checked out submodules are also checked.
> 
> The rule can be extended to allow user provided ignore and skip lists,
> for example by introducing Makefile variables CODESPELL_SKIP=userfile
> or CODESPELL_IGNORE=userfile. A limited check could be implemented by
> providing a base directory CODESPELL_START=basedirectory, for example
> CODESPELL_START=docs.
> 
> Regards,
> Stefan
> 
>   tests/Makefile.include       | 10 ++++++++++
>   tests/codespell/README.rst   | 18 ++++++++++++++++++
>   tests/codespell/exclude-file |  3 +++
>   tests/codespell/ignore-words | 19 +++++++++++++++++++
>   tests/requirements.txt       |  1 +
>   5 files changed, 51 insertions(+)
>   create mode 100644 tests/codespell/README.rst
>   create mode 100644 tests/codespell/exclude-file
>   create mode 100644 tests/codespell/ignore-words

Just wondering about this list...

> +++ b/tests/codespell/ignore-words
> @@ -0,0 +1,19 @@
> +buid

What is 'buid'? PPC-specific apparently.

> +busses
> +dout
> +falt
> +fpr
> +hace
> +hax
> +hda
> +nd

Apparently 'NIC info'...

> +ot

Is 'ot' MemOp?

> +pard
> +parm
> +ptd
> +ser
> +som
> +synopsys
> +te

Is that 'target endianness'?

> +toke

Where is 'toke'?

> +ue
Where is 'ue'?

Stefan Weil Oct. 31, 2022, 4:45 p.m. UTC | #6

Below I added some examples for the words which are currently ignored by 
codespell.

Am 31.10.22 um 16:40 schrieb Philippe Mathieu-Daudé:

> On 31/10/22 08:43, Stefan Weil via wrote:
>> `make check-spelling` can now be used to get a list of spelling errors.
>> It uses the latest version of codespell, a spell checker implemented 
>> in Python.
>>
>> Signed-off-by: Stefan Weil <sw@weilnetz.de>
>> ---
>>
>> This RFC can already be used for manual tests, but still reports false
>> positives, mostly because some variable names are interpreted as words.
>> These words can either be ignored in the check, or in some cases the 
>> code
>> might be changed to use different variable names.
>>
>> The check currently only skips a few directories and files, so for 
>> example
>> checked out submodules are also checked.
>>
>> The rule can be extended to allow user provided ignore and skip lists,
>> for example by introducing Makefile variables CODESPELL_SKIP=userfile
>> or CODESPELL_IGNORE=userfile. A limited check could be implemented by
>> providing a base directory CODESPELL_START=basedirectory, for example
>> CODESPELL_START=docs.
>>
>> Regards,
>> Stefan
>>
>>   tests/Makefile.include       | 10 ++++++++++
>>   tests/codespell/README.rst   | 18 ++++++++++++++++++
>>   tests/codespell/exclude-file |  3 +++
>>   tests/codespell/ignore-words | 19 +++++++++++++++++++
>>   tests/requirements.txt       |  1 +
>>   5 files changed, 51 insertions(+)
>>   create mode 100644 tests/codespell/README.rst
>>   create mode 100644 tests/codespell/exclude-file
>>   create mode 100644 tests/codespell/ignore-words
>
> Just wondering about this list...
>
>> +++ b/tests/codespell/ignore-words
>> @@ -0,0 +1,19 @@
>> +buid
>
> What is 'buid'? PPC-specific apparently.

hw/ppc/spapr_pci.c:SpaprPhbState *spapr_pci_find_phb(SpaprMachineState 
*spapr, uint64_t buid)
include/hw/ppc/xics.h: * We currently only support one BUID which is our 
interrupt base
[...]


>> +busses
>> +dout
>> +falt
>> +fpr
>> +hace
>> +hax
>> +hda
>> +nd
>
> Apparently 'NIC info'...
hw/arm/aspeed.c:    NICInfo *nd = &nd_table[0];
hw/display/macfb.c:    NubusDevice *nd = NUBUS_DEVICE(s);
[...]


>> +ot
>
> Is 'ot' MemOp?

target/i386/tcg/decode-new.c.inc:static bool decode_op_size(DisasContext 
*s, X86OpEntry *e, X86OpSize size, MemOp *ot)
[...]


>> +pard
>> +parm
>> +ptd
>> +ser
>> +som
>> +synopsys
>> +te
>
> Is that 'target endianness'?

accel/tcg/cputlb.c: * @te: pointer to CPUTLBEntry
hw/audio/cs4231a.c:#define TE (1 << 6)
[...]


>> +toke
>
> Where is 'toke'?

This one is no longer needed. It was used in the old capstone code which 
I still had in my local sources.


>> +ue
> Where is 'ue'?
tests/tcg/i386/test-i386-fp-exceptions.c:#define UE (1 << 4)
tests/unit/test-keyval.c:    qdict = keyval_parse("val,,ue", "implied", 
NULL, &err);
[...]

I simply had added some examples of "words" which occurred often and 
which were reported by codespell as typos. These "typos" occur at least 
10 times (list produced with `grep "^[a-z]" codespell.log | sort -n +1`):

statics      10
regiser      11
usig         11
inh          12
tne          12
overriden    13
inactivate   15
upto         15
hsa          16
useable      17
daa          18
crate        21
endianess    22
olt          22
sring        23
vill         25
keypairs     35
gir          46
sav          47
asign       120
inflight    191

Some of them are real typos, others like aSign or statics are variable 
names and should be ignored, too.

Stefan

[RFC] Add new build target 'check-spelling'

Commit Message

Comments

Patch