diff mbox series

[RFC] kselftest: devices: Add test to detect missing devices

Message ID 20240724-kselftest-dev-exist-v1-1-9bc21aa761b5@collabora.com (mailing list archive)
State New
Headers show
Series [RFC] kselftest: devices: Add test to detect missing devices | expand

Commit Message

Nícolas F. R. A. Prado July 24, 2024, 9:40 p.m. UTC
Introduce a new test to identify regressions causing devices to go
missing on the system.

For each bus and class on the system the test checks the number of
devices present against a reference file, which needs to have been
generated by the program at a previous point on a known-good kernel, and
if there are missing devices they are reported.

Signed-off-by: Nícolas F. R. A. Prado <nfraprado@collabora.com>
---
Hi,

Key points about this test:
* Goal: Identify regressions causing devices to go missing on the system
* Focus:
  * Ease of maintenance: the reference file is generated programatically
  * Minimum of false-positives: the script makes as few assumptions as possible
    about the stability of device identifiers to ensure renames/refactors don't
    trigger false-positives
* How it works: For each bus and class on the system the test checks the number
  of devices present against a reference file, which needs to have been
  generated by the program at a previous point on a known-good kernel, and if
  there are missing devices they are reported.
* Comparison to other tests: It might be possible(*) to replace the discoverable
  devices test [1] with this. The benefits of this test is that it's easier
  to setup and maintain and has wider coverage of devices.

Additional detail:
* Having more devices on the running system than the reference does not cause a
  failure, but a warning is printed in that case to suggest that the reference
  be updated.
* Missing devices are detected per bus/class based on the number of devices.
  When the test fails, the known metadata for each of the expected and detected
  devices is printed and some simple similitarity comparison is done to suggest
  the devices that are the most likely to be missing.
* The proposed place to store the generated reference files is the
  'platform-test-parameters' repository in KernelCI [2].

Example output: This is an example of a failing test case when one of the two
devices in the nvmem bus went missing:

  # Missing devices for subsystem 'nvmem': 1 (Expected 2, found 1)
  # =================
  # Devices expected:
  #
  #   uevent:
  #     OF_NAME=efuse
  #     OF_FULLNAME=/soc/efuse@11c10000
  #     OF_COMPATIBLE_0=mediatek,mt8195-efuse
  #     OF_COMPATIBLE_1=mediatek,efuse
  #     OF_COMPATIBLE_N=2
  #
  #   uevent:
  #     OF_NAME=flash
  #     OF_FULLNAME=/soc/spi@1132c000/flash@0
  #     OF_COMPATIBLE_0=jedec,spi-nor
  #     OF_COMPATIBLE_N=1
  #
  # -----------------
  # Devices found:
  #
  #   uevent:
  #     OF_NAME=efuse
  #     OF_FULLNAME=/soc/efuse@11c10000
  #     OF_COMPATIBLE_0=mediatek,mt8195-efuse
  #     OF_COMPATIBLE_1=mediatek,efuse
  #     OF_COMPATIBLE_N=2
  #
  # -----------------
  # Devices missing (best guess):
  #
  #   uevent:
  #     OF_NAME=flash
  #     OF_FULLNAME=/soc/spi@1132c000/flash@0
  #     OF_COMPATIBLE_0=jedec,spi-nor
  #     OF_COMPATIBLE_N=1
  #
  # =================
  not ok 19 bus.nvmem

Example of how the data for these devices is encoded in the reference file:

  bus:
  ...
    nvmem:
      count: 2
      devices:
      - info:
          uevent: 'OF_NAME=efuse
  
            OF_FULLNAME=/soc/efuse@11c10000
  
            OF_COMPATIBLE_0=mediatek,mt8195-efuse
  
            OF_COMPATIBLE_1=mediatek,efuse
  
            OF_COMPATIBLE_N=2
  
            '
      - info:
          uevent: 'OF_NAME=flash
  
            OF_FULLNAME=/soc/spi@1132c000/flash@0
  
            OF_COMPATIBLE_0=jedec,spi-nor
  
            OF_COMPATIBLE_N=1
  
            '

(Full reference file: http://0x0.st/Xp60.yaml;)

Caveat: Relying only on the count of devices in a subsystem makes the test
susceptible to false-negatives eg. if a device goes missing and another in the
same subsystem is added the count will be the same so this regression won't be
reported. In order to avoid this we may include properties that must match
individual devices, but we must be very careful (and it's why I haven't done it)
since matching against properties that aren't guaranteed to be stable will
introduce false-positives (ie. detecting false regressions) due to eventual
renames.

Some things to improve in the near future / gather feedback on:
* (*): Currently this test only checks for the existence of devices. We could
  extend it to also encode into the reference which devices are bound to drivers
  to be able to completely replace the discoverable devices probe kselftest [1].
* Expanding identifying properties: Currently the properties that are stored
  (when present) in the reference for each device to be used for identification
  in the result output are uevent, device/uevent, firmware_node/uevent and name.
  Suggestions of others properties to add are welcome.
* Adding more filtering to reduce noise:
  * Ignoring buses/classes: Currently the devlink class is ignored by the test
    since it seems like a kernel internal detail that userspace doesn't actually
    care about. We should add others that are similar.
  * Ignoring non-devices: There can be entries in /sys/class/ that aren't
    devices. For now we're filtering down to only symlinks, but there might be a
    better way.
* As mentioned in the caveat section above we may want to add actual matching
  of devices based on properties to avoid false-negatives if we identify
  suitable properties.
* It would be nice to have an option in the program to compare a newer reference
  to an older one to make it easier for the user to see the differences and
  decide if the new reference is ok.
* Since the reference file is not supposed to be manually edited, JSON might be
  a better choice than YAML since it is included in the python standard library.

Let me know your thoughts.

Thanks,
Nícolas

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/devices/probe/test_discoverable_devices.py?id=0debb20c5c812f8750c20c3406bc94a1e8ea4742
[2] https://github.com/kernelci/platform-test-parameters
---
 tools/testing/selftests/Makefile               |   1 +
 tools/testing/selftests/devices/exist/Makefile |   3 +
 tools/testing/selftests/devices/exist/exist.py | 268 +++++++++++++++++++++++++
 3 files changed, 272 insertions(+)


---
base-commit: 73399b58e5e5a1b28a04baf42e321cfcfc663c2f
change-id: 20240724-kselftest-dev-exist-bb1bcf884654

Best regards,

Comments

Shuah Khan July 31, 2024, 11:19 p.m. UTC | #1
On 7/24/24 15:40, Nícolas F. R. A. Prado wrote:
> Introduce a new test to identify regressions causing devices to go
> missing on the system.
> 
> For each bus and class on the system the test checks the number of
> devices present against a reference file, which needs to have been
> generated by the program at a previous point on a known-good kernel, and
> if there are missing devices they are reported.

Can you elaborate on how to generate reference file? It isn't clear.

> 
> Signed-off-by: Nícolas F. R. A. Prado <nfraprado@collabora.com>
> ---
> Hi,
> 
> Key points about this test:
> * Goal: Identify regressions causing devices to go missing on the system
> * Focus:
>    * Ease of maintenance: the reference file is generated programatically
>    * Minimum of false-positives: the script makes as few assumptions as possible
>      about the stability of device identifiers to ensure renames/refactors don't
>      trigger false-positives
> * How it works: For each bus and class on the system the test checks the number
>    of devices present against a reference file, which needs to have been
>    generated by the program at a previous point on a known-good kernel, and if
>    there are missing devices they are reported.
> * Comparison to other tests: It might be possible(*) to replace the discoverable
>    devices test [1] with this. The benefits of this test is that it's easier
>    to setup and maintain and has wider coverage of devices.
> 
> Additional detail:
> * Having more devices on the running system than the reference does not cause a
>    failure, but a warning is printed in that case to suggest that the reference
>    be updated.
> * Missing devices are detected per bus/class based on the number of devices.
>    When the test fails, the known metadata for each of the expected and detected
>    devices is printed and some simple similitarity comparison is done to suggest
>    the devices that are the most likely to be missing.
> * The proposed place to store the generated reference files is the
>    'platform-test-parameters' repository in KernelCI [2].

How would a user run this on their systems - do they need to access
this repository in KernelCI?

This is what I see when I run the test on my system:

make -C tools/testing/selftests/devices/exist/ run_tests
make: Entering directory '/linux/linux_6.11/tools/testing/selftests/devices/exist'
TAP version 13
1..1
# timeout set to 45
# selftests: devices/exist: exist.py
# TAP version 13
# # No matching reference file found (tried './LENOVO,20XH005JUS.yaml')
# # Totals: pass:0 fail:0 xfail:0 xpass:0 skip:0 error:0
not ok 1 selftests: devices/exist: exist.py # exit=1
make: Leaving directory '/linux/linux_6.11/tools/testing/selftests/devices/exist'

thanks,
-- Shuah
Nícolas F. R. A. Prado Aug. 1, 2024, 7:15 p.m. UTC | #2
On Wed, Jul 31, 2024 at 05:19:45PM -0600, Shuah Khan wrote:
> On 7/24/24 15:40, Nícolas F. R. A. Prado wrote:
> > Introduce a new test to identify regressions causing devices to go
> > missing on the system.
> > 
> > For each bus and class on the system the test checks the number of
> > devices present against a reference file, which needs to have been
> > generated by the program at a previous point on a known-good kernel, and
> > if there are missing devices they are reported.
> 
> Can you elaborate on how to generate reference file? It isn't clear.

Indeed, I'll make that information clearer in future versions.

The reference file is generated by passing the --generate-reference flag to the
test:

./exist.py --generate-reference

It will be printed as standard output.

> 
> > 
> > Signed-off-by: Nícolas F. R. A. Prado <nfraprado@collabora.com>
> > ---
> > Hi,
> > 
> > Key points about this test:
> > * Goal: Identify regressions causing devices to go missing on the system
> > * Focus:
> >    * Ease of maintenance: the reference file is generated programatically
> >    * Minimum of false-positives: the script makes as few assumptions as possible
> >      about the stability of device identifiers to ensure renames/refactors don't
> >      trigger false-positives
> > * How it works: For each bus and class on the system the test checks the number
> >    of devices present against a reference file, which needs to have been
> >    generated by the program at a previous point on a known-good kernel, and if
> >    there are missing devices they are reported.
> > * Comparison to other tests: It might be possible(*) to replace the discoverable
> >    devices test [1] with this. The benefits of this test is that it's easier
> >    to setup and maintain and has wider coverage of devices.
> > 
> > Additional detail:
> > * Having more devices on the running system than the reference does not cause a
> >    failure, but a warning is printed in that case to suggest that the reference
> >    be updated.
> > * Missing devices are detected per bus/class based on the number of devices.
> >    When the test fails, the known metadata for each of the expected and detected
> >    devices is printed and some simple similitarity comparison is done to suggest
> >    the devices that are the most likely to be missing.
> > * The proposed place to store the generated reference files is the
> >    'platform-test-parameters' repository in KernelCI [2].
> 
> How would a user run this on their systems - do they need to access
> this repository in KernelCI?

No, that repository would just be a place where people could find pre-generated
reference files (which we'll be using when running this test in KernelCI), but
anyone can always generate their own reference files and store them wherever
they want.

> 
> This is what I see when I run the test on my system:
> 
> make -C tools/testing/selftests/devices/exist/ run_tests
> make: Entering directory '/linux/linux_6.11/tools/testing/selftests/devices/exist'
> TAP version 13
> 1..1
> # timeout set to 45
> # selftests: devices/exist: exist.py
> # TAP version 13
> # # No matching reference file found (tried './LENOVO,20XH005JUS.yaml')

First generate the reference file for your system like so:

tools/testing/selftests/devices/exist/exist.py --generate-reference > tools/testing/selftests/devices/exist/LENOVO,20XH005JUS.yaml

Then you can run the test and it should work.

Thanks,
Nícolas
Shuah Khan Aug. 1, 2024, 8:13 p.m. UTC | #3
On 8/1/24 13:15, Nícolas F. R. A. Prado wrote:
> On Wed, Jul 31, 2024 at 05:19:45PM -0600, Shuah Khan wrote:
>> On 7/24/24 15:40, Nícolas F. R. A. Prado wrote:
>>> Introduce a new test to identify regressions causing devices to go
>>> missing on the system.
>>>
>>> For each bus and class on the system the test checks the number of
>>> devices present against a reference file, which needs to have been
>>> generated by the program at a previous point on a known-good kernel, and
>>> if there are missing devices they are reported.
>>
>> Can you elaborate on how to generate reference file? It isn't clear.
> 
> Indeed, I'll make that information clearer in future versions.
> 
> The reference file is generated by passing the --generate-reference flag to the
> test:
> 
> ./exist.py --generate-reference
> 
> It will be printed as standard output.

How about adding an option to generate file taking filename?
Makes it easier to use.

> 
>>
>>>
>>> Signed-off-by: Nícolas F. R. A. Prado <nfraprado@collabora.com>
>>> ---
>>> Hi,
>>>
>>> Key points about this test:
>>> * Goal: Identify regressions causing devices to go missing on the system
>>> * Focus:
>>>     * Ease of maintenance: the reference file is generated programatically
>>>     * Minimum of false-positives: the script makes as few assumptions as possible
>>>       about the stability of device identifiers to ensure renames/refactors don't
>>>       trigger false-positives
>>> * How it works: For each bus and class on the system the test checks the number
>>>     of devices present against a reference file, which needs to have been
>>>     generated by the program at a previous point on a known-good kernel, and if
>>>     there are missing devices they are reported.
>>> * Comparison to other tests: It might be possible(*) to replace the discoverable
>>>     devices test [1] with this. The benefits of this test is that it's easier
>>>     to setup and maintain and has wider coverage of devices.
>>>
>>> Additional detail:
>>> * Having more devices on the running system than the reference does not cause a
>>>     failure, but a warning is printed in that case to suggest that the reference
>>>     be updated.
>>> * Missing devices are detected per bus/class based on the number of devices.
>>>     When the test fails, the known metadata for each of the expected and detected
>>>     devices is printed and some simple similitarity comparison is done to suggest
>>>     the devices that are the most likely to be missing.
>>> * The proposed place to store the generated reference files is the
>>>     'platform-test-parameters' repository in KernelCI [2].
>>
>> How would a user run this on their systems - do they need to access
>> this repository in KernelCI?
> 
> No, that repository would just be a place where people could find pre-generated
> reference files (which we'll be using when running this test in KernelCI), but
> anyone can always generate their own reference files and store them wherever
> they want.
> 

Thanks for the clarification. This might be good addition to the document.
I think this test could benefit from a README or howto

>>
>> This is what I see when I run the test on my system:
>>
>> make -C tools/testing/selftests/devices/exist/ run_tests
>> make: Entering directory '/linux/linux_6.11/tools/testing/selftests/devices/exist'
>> TAP version 13
>> 1..1
>> # timeout set to 45
>> # selftests: devices/exist: exist.py
>> # TAP version 13
>> # # No matching reference file found (tried './LENOVO,20XH005JUS.yaml')
> 
> First generate the reference file for your system like so:
> 
> tools/testing/selftests/devices/exist/exist.py --generate-reference > tools/testing/selftests/devices/exist/LENOVO,20XH005JUS.yaml
> 

Worked - I see

TAP version 13
# Using reference file: ./LENOVO,20XH005JUS.yaml
1..76

---
# Totals: pass:76 fail:0 xfail:0 xpass:0 skip:0 error:0


Things to improve:

- Have the script take a file instead of assuming that the reference file
   is in the current directory.
   e.g: exist.py -f reference_file

thanks,
-- Shuah
Nícolas F. R. A. Prado Aug. 1, 2024, 9:03 p.m. UTC | #4
On Thu, Aug 01, 2024 at 02:13:05PM -0600, Shuah Khan wrote:
> On 8/1/24 13:15, Nícolas F. R. A. Prado wrote:
> > On Wed, Jul 31, 2024 at 05:19:45PM -0600, Shuah Khan wrote:
> > > On 7/24/24 15:40, Nícolas F. R. A. Prado wrote:
> > > > Introduce a new test to identify regressions causing devices to go
> > > > missing on the system.
> > > > 
> > > > For each bus and class on the system the test checks the number of
> > > > devices present against a reference file, which needs to have been
> > > > generated by the program at a previous point on a known-good kernel, and
> > > > if there are missing devices they are reported.
> > > 
> > > Can you elaborate on how to generate reference file? It isn't clear.
> > 
> > Indeed, I'll make that information clearer in future versions.
> > 
> > The reference file is generated by passing the --generate-reference flag to the
> > test:
> > 
> > ./exist.py --generate-reference
> > 
> > It will be printed as standard output.
> 
> How about adding an option to generate file taking filename?
> Makes it easier to use.

Sure, we can do that. Another option would be to write it to the filename that
would be looked for by default. So for your machine just calling

  ./exist.py --generate-reference

could write the reference to ./LENOVO,20XH005JUS.yaml.

> 
> > 
> > > 
> > > > 
> > > > Signed-off-by: Nícolas F. R. A. Prado <nfraprado@collabora.com>
> > > > ---
> > > > Hi,
> > > > 
> > > > Key points about this test:
> > > > * Goal: Identify regressions causing devices to go missing on the system
> > > > * Focus:
> > > >     * Ease of maintenance: the reference file is generated programatically
> > > >     * Minimum of false-positives: the script makes as few assumptions as possible
> > > >       about the stability of device identifiers to ensure renames/refactors don't
> > > >       trigger false-positives
> > > > * How it works: For each bus and class on the system the test checks the number
> > > >     of devices present against a reference file, which needs to have been
> > > >     generated by the program at a previous point on a known-good kernel, and if
> > > >     there are missing devices they are reported.
> > > > * Comparison to other tests: It might be possible(*) to replace the discoverable
> > > >     devices test [1] with this. The benefits of this test is that it's easier
> > > >     to setup and maintain and has wider coverage of devices.
> > > > 
> > > > Additional detail:
> > > > * Having more devices on the running system than the reference does not cause a
> > > >     failure, but a warning is printed in that case to suggest that the reference
> > > >     be updated.
> > > > * Missing devices are detected per bus/class based on the number of devices.
> > > >     When the test fails, the known metadata for each of the expected and detected
> > > >     devices is printed and some simple similitarity comparison is done to suggest
> > > >     the devices that are the most likely to be missing.
> > > > * The proposed place to store the generated reference files is the
> > > >     'platform-test-parameters' repository in KernelCI [2].
> > > 
> > > How would a user run this on their systems - do they need to access
> > > this repository in KernelCI?
> > 
> > No, that repository would just be a place where people could find pre-generated
> > reference files (which we'll be using when running this test in KernelCI), but
> > anyone can always generate their own reference files and store them wherever
> > they want.
> > 
> 
> Thanks for the clarification. This might be good addition to the document.
> I think this test could benefit from a README or howto

Sure, I can add a README in the next revision.

> 
> > > 
> > > This is what I see when I run the test on my system:
> > > 
> > > make -C tools/testing/selftests/devices/exist/ run_tests
> > > make: Entering directory '/linux/linux_6.11/tools/testing/selftests/devices/exist'
> > > TAP version 13
> > > 1..1
> > > # timeout set to 45
> > > # selftests: devices/exist: exist.py
> > > # TAP version 13
> > > # # No matching reference file found (tried './LENOVO,20XH005JUS.yaml')
> > 
> > First generate the reference file for your system like so:
> > 
> > tools/testing/selftests/devices/exist/exist.py --generate-reference > tools/testing/selftests/devices/exist/LENOVO,20XH005JUS.yaml
> > 
> 
> Worked - I see
> 
> TAP version 13
> # Using reference file: ./LENOVO,20XH005JUS.yaml
> 1..76
> 
> ---
> # Totals: pass:76 fail:0 xfail:0 xpass:0 skip:0 error:0
> 
> 
> Things to improve:
> 
> - Have the script take a file instead of assuming that the reference file
>   is in the current directory.
>   e.g: exist.py -f reference_file

The script also has another parameter to specify a different directory to look
for the reference file: --reference-dir

But the file name is currently fixed and determined from the system's ID (DMI or
Devicetree compatible).

We can definitely have another flag to force a different file name if that's
useful. In theory it shouldn't be needed given the machine name is used as
filename, but might come in handy if there are machine name clashes or if you
want to have references for different kernel stable versions for the same
machine in the same directory.

Thanks,
Nícolas
Shuah Khan Aug. 1, 2024, 9:58 p.m. UTC | #5
On 8/1/24 15:03, Nícolas F. R. A. Prado wrote:
> On Thu, Aug 01, 2024 at 02:13:05PM -0600, Shuah Khan wrote:
>> On 8/1/24 13:15, Nícolas F. R. A. Prado wrote:
>>> On Wed, Jul 31, 2024 at 05:19:45PM -0600, Shuah Khan wrote:
>>>> On 7/24/24 15:40, Nícolas F. R. A. Prado wrote:
>>>>> Introduce a new test to identify regressions causing devices to go
>>>>> missing on the system.
>>>>>
>>>>> For each bus and class on the system the test checks the number of
>>>>> devices present against a reference file, which needs to have been
>>>>> generated by the program at a previous point on a known-good kernel, and
>>>>> if there are missing devices they are reported.
>>>>
>>>> Can you elaborate on how to generate reference file? It isn't clear.
>>>
>>> Indeed, I'll make that information clearer in future versions.
>>>
>>> The reference file is generated by passing the --generate-reference flag to the
>>> test:
>>>
>>> ./exist.py --generate-reference
>>>
>>> It will be printed as standard output.
>>
>> How about adding an option to generate file taking filename?
>> Makes it easier to use.
> 
> Sure, we can do that. Another option would be to write it to the filename that
> would be looked for by default. So for your machine just calling
> 
>    ./exist.py --generate-reference
> 
> could write the reference to ./LENOVO,20XH005JUS.yaml.

You could. Do mention this as the default option and to the
help message.

>
>>>
>>> No, that repository would just be a place where people could find pre-generated
>>> reference files (which we'll be using when running this test in KernelCI), but
>>> anyone can always generate their own reference files and store them wherever
>>> they want.
>>>
>>
>> Thanks for the clarification. This might be good addition to the document.
>> I think this test could benefit from a README or howto
> 
> Sure, I can add a README in the next revision.
> 
>>
>>>>
>>>> This is what I see when I run the test on my system:
>>>>
>>>> make -C tools/testing/selftests/devices/exist/ run_tests
>>>> make: Entering directory '/linux/linux_6.11/tools/testing/selftests/devices/exist'
>>>> TAP version 13
>>>> 1..1
>>>> # timeout set to 45
>>>> # selftests: devices/exist: exist.py
>>>> # TAP version 13
>>>> # # No matching reference file found (tried './LENOVO,20XH005JUS.yaml')
>>>
>>> First generate the reference file for your system like so:
>>>
>>> tools/testing/selftests/devices/exist/exist.py --generate-reference > tools/testing/selftests/devices/exist/LENOVO,20XH005JUS.yaml
>>>
>>
>> Worked - I see
>>
>> TAP version 13
>> # Using reference file: ./LENOVO,20XH005JUS.yaml
>> 1..76
>>
>> ---
>> # Totals: pass:76 fail:0 xfail:0 xpass:0 skip:0 error:0
>>
>>
>> Things to improve:
>>
>> - Have the script take a file instead of assuming that the reference file
>>    is in the current directory.
>>    e.g: exist.py -f reference_file
> 
> The script also has another parameter to specify a different directory to look
> for the reference file: --reference-dir
> 
> But the file name is currently fixed and determined from the system's ID (DMI or
> Devicetree compatible).
> 
> We can definitely have another flag to force a different file name if that's
> useful. In theory it shouldn't be needed given the machine name is used as
> filename, but might come in handy if there are machine name clashes or if you
> want to have references for different kernel stable versions for the same
> machine in the same directory.

Providing an option to force is good.

thanks,
-- Shuah
Bird, Tim Aug. 1, 2024, 10:04 p.m. UTC | #6
> -----Original Message-----
> From: Nícolas F. R. A. Prado <nfraprado@collabora.com>
> On Thu, Aug 01, 2024 at 02:13:05PM -0600, Shuah Khan wrote:
> > On 8/1/24 13:15, Nícolas F. R. A. Prado wrote:
> > > On Wed, Jul 31, 2024 at 05:19:45PM -0600, Shuah Khan wrote:
> > > > On 7/24/24 15:40, Nícolas F. R. A. Prado wrote:
> > > > > Introduce a new test to identify regressions causing devices to go
> > > > > missing on the system.
> > > > >
> > > > > For each bus and class on the system the test checks the number of
> > > > > devices present against a reference file, which needs to have been
> > > > > generated by the program at a previous point on a known-good kernel, and
> > > > > if there are missing devices they are reported.
> > > >
> > > > Can you elaborate on how to generate reference file? It isn't clear.
> > >
> > > Indeed, I'll make that information clearer in future versions.
> > >
> > > The reference file is generated by passing the --generate-reference flag to the
> > > test:
> > >
> > > ./exist.py --generate-reference
> > >
> > > It will be printed as standard output.
> >
> > How about adding an option to generate file taking filename?
> > Makes it easier to use.
> 
> Sure, we can do that. Another option would be to write it to the filename that
> would be looked for by default. So for your machine just calling
> 
>   ./exist.py --generate-reference
> 
> could write the reference to ./LENOVO,20XH005JUS.yaml.
> 
> >
> > >
> > > >
> > > > >
> > > > > Signed-off-by: Nícolas F. R. A. Prado <nfraprado@collabora.com>
> > > > > ---
> > > > > Hi,
> > > > >
> > > > > Key points about this test:
> > > > > * Goal: Identify regressions causing devices to go missing on the system
> > > > > * Focus:
> > > > >     * Ease of maintenance: the reference file is generated programatically
> > > > >     * Minimum of false-positives: the script makes as few assumptions as possible
> > > > >       about the stability of device identifiers to ensure renames/refactors don't
> > > > >       trigger false-positives
> > > > > * How it works: For each bus and class on the system the test checks the number
> > > > >     of devices present against a reference file, which needs to have been
> > > > >     generated by the program at a previous point on a known-good kernel, and if
> > > > >     there are missing devices they are reported.
> > > > > * Comparison to other tests: It might be possible(*) to replace the discoverable
> > > > >     devices test [1] with this. The benefits of this test is that it's easier
> > > > >     to setup and maintain and has wider coverage of devices.
> > > > >
> > > > > Additional detail:
> > > > > * Having more devices on the running system than the reference does not cause a
> > > > >     failure, but a warning is printed in that case to suggest that the reference
> > > > >     be updated.
> > > > > * Missing devices are detected per bus/class based on the number of devices.
> > > > >     When the test fails, the known metadata for each of the expected and detected
> > > > >     devices is printed and some simple similitarity comparison is done to suggest
> > > > >     the devices that are the most likely to be missing.
> > > > > * The proposed place to store the generated reference files is the
> > > > >     'platform-test-parameters' repository in KernelCI [2].
> > > >
> > > > How would a user run this on their systems - do they need to access
> > > > this repository in KernelCI?
> > >
> > > No, that repository would just be a place where people could find pre-generated
> > > reference files (which we'll be using when running this test in KernelCI), but
> > > anyone can always generate their own reference files and store them wherever
> > > they want.
> > >
> >
> > Thanks for the clarification. This might be good addition to the document.
> > I think this test could benefit from a README or howto
> 
> Sure, I can add a README in the next revision.
> 
> >
> > > >
> > > > This is what I see when I run the test on my system:
> > > >
> > > > make -C tools/testing/selftests/devices/exist/ run_tests
> > > > make: Entering directory '/linux/linux_6.11/tools/testing/selftests/devices/exist'
> > > > TAP version 13
> > > > 1..1
> > > > # timeout set to 45
> > > > # selftests: devices/exist: exist.py
> > > > # TAP version 13
> > > > # # No matching reference file found (tried './LENOVO,20XH005JUS.yaml')
> > >
> > > First generate the reference file for your system like so:
> > >
> > > tools/testing/selftests/devices/exist/exist.py --generate-reference > tools/testing/selftests/devices/exist/LENOVO,20XH005JUS.yaml
> > >
> >
> > Worked - I see
> >
> > TAP version 13
> > # Using reference file: ./LENOVO,20XH005JUS.yaml
> > 1..76
> >
> > ---
> > # Totals: pass:76 fail:0 xfail:0 xpass:0 skip:0 error:0
> >
> >
> > Things to improve:
> >
> > - Have the script take a file instead of assuming that the reference file
> >   is in the current directory.
> >   e.g: exist.py -f reference_file
> 
> The script also has another parameter to specify a different directory to look
> for the reference file: --reference-dir
> 
> But the file name is currently fixed and determined from the system's ID (DMI or
> Devicetree compatible).
> 
> We can definitely have another flag to force a different file name if that's
> useful. In theory it shouldn't be needed given the machine name is used as
> filename, but might come in handy if there are machine name clashes or if you
> want to have references for different kernel stable versions for the same
> machine in the same directory.

I would recommend having a flag that allows specifying the filename (which overrides
the automatically determined filename).  I am watching this thread with great interest,
as I am planning on proposing, at Plumbers, another system which will utilize reference
values (the 'adding benchmark results support to KTAP/kselftest' topic that I plan to
give at the testing MC.)

Since this test uses reference values, it has all the same issues as my proposal:
 - naming of the file where the reference values live
 - some method to automatically select an appropriate reference values file
This patch uses the  system's ID, but other things (like kernel config) will likely
affect the set of devices on the system.  So a more complex selection mechanism
may eventually be needed.

 - generating or updating the reference value file
 - where to store the reference file
    - e.g. when would it be appropriate to store it upstream and when elsewhere?

I don't want to gum up the acceptance of this patch by raising all these issues
now.  I think it's acceptable to have a tester generate their own reference file
prior to first use of the test, as the simplest way to use this test.  But I think
that having pre-generated reference files that testers can use, with
known-good values, will be valuable in the long term.

These issues can be discussed at Plumbers, IMHO.

Regards,
 -- Tim
Nícolas F. R. A. Prado Aug. 2, 2024, 3:48 p.m. UTC | #7
On Thu, Aug 01, 2024 at 10:04:36PM +0000, Bird, Tim wrote:
> 
> 
> > -----Original Message-----
> > From: Nícolas F. R. A. Prado <nfraprado@collabora.com>
> > On Thu, Aug 01, 2024 at 02:13:05PM -0600, Shuah Khan wrote:
> > > On 8/1/24 13:15, Nícolas F. R. A. Prado wrote:
> > > > On Wed, Jul 31, 2024 at 05:19:45PM -0600, Shuah Khan wrote:
> > > > > On 7/24/24 15:40, Nícolas F. R. A. Prado wrote:
> > > > > > Introduce a new test to identify regressions causing devices to go
> > > > > > missing on the system.
> > > > > >
> > > > > > For each bus and class on the system the test checks the number of
> > > > > > devices present against a reference file, which needs to have been
> > > > > > generated by the program at a previous point on a known-good kernel, and
> > > > > > if there are missing devices they are reported.
> > > > >
> > > > > Can you elaborate on how to generate reference file? It isn't clear.
> > > >
> > > > Indeed, I'll make that information clearer in future versions.
> > > >
> > > > The reference file is generated by passing the --generate-reference flag to the
> > > > test:
> > > >
> > > > ./exist.py --generate-reference
> > > >
> > > > It will be printed as standard output.
> > >
> > > How about adding an option to generate file taking filename?
> > > Makes it easier to use.
> > 
> > Sure, we can do that. Another option would be to write it to the filename that
> > would be looked for by default. So for your machine just calling
> > 
> >   ./exist.py --generate-reference
> > 
> > could write the reference to ./LENOVO,20XH005JUS.yaml.
> > 
> > >
> > > >
> > > > >
> > > > > >
> > > > > > Signed-off-by: Nícolas F. R. A. Prado <nfraprado@collabora.com>
> > > > > > ---
> > > > > > Hi,
> > > > > >
> > > > > > Key points about this test:
> > > > > > * Goal: Identify regressions causing devices to go missing on the system
> > > > > > * Focus:
> > > > > >     * Ease of maintenance: the reference file is generated programatically
> > > > > >     * Minimum of false-positives: the script makes as few assumptions as possible
> > > > > >       about the stability of device identifiers to ensure renames/refactors don't
> > > > > >       trigger false-positives
> > > > > > * How it works: For each bus and class on the system the test checks the number
> > > > > >     of devices present against a reference file, which needs to have been
> > > > > >     generated by the program at a previous point on a known-good kernel, and if
> > > > > >     there are missing devices they are reported.
> > > > > > * Comparison to other tests: It might be possible(*) to replace the discoverable
> > > > > >     devices test [1] with this. The benefits of this test is that it's easier
> > > > > >     to setup and maintain and has wider coverage of devices.
> > > > > >
> > > > > > Additional detail:
> > > > > > * Having more devices on the running system than the reference does not cause a
> > > > > >     failure, but a warning is printed in that case to suggest that the reference
> > > > > >     be updated.
> > > > > > * Missing devices are detected per bus/class based on the number of devices.
> > > > > >     When the test fails, the known metadata for each of the expected and detected
> > > > > >     devices is printed and some simple similitarity comparison is done to suggest
> > > > > >     the devices that are the most likely to be missing.
> > > > > > * The proposed place to store the generated reference files is the
> > > > > >     'platform-test-parameters' repository in KernelCI [2].
> > > > >
> > > > > How would a user run this on their systems - do they need to access
> > > > > this repository in KernelCI?
> > > >
> > > > No, that repository would just be a place where people could find pre-generated
> > > > reference files (which we'll be using when running this test in KernelCI), but
> > > > anyone can always generate their own reference files and store them wherever
> > > > they want.
> > > >
> > >
> > > Thanks for the clarification. This might be good addition to the document.
> > > I think this test could benefit from a README or howto
> > 
> > Sure, I can add a README in the next revision.
> > 
> > >
> > > > >
> > > > > This is what I see when I run the test on my system:
> > > > >
> > > > > make -C tools/testing/selftests/devices/exist/ run_tests
> > > > > make: Entering directory '/linux/linux_6.11/tools/testing/selftests/devices/exist'
> > > > > TAP version 13
> > > > > 1..1
> > > > > # timeout set to 45
> > > > > # selftests: devices/exist: exist.py
> > > > > # TAP version 13
> > > > > # # No matching reference file found (tried './LENOVO,20XH005JUS.yaml')
> > > >
> > > > First generate the reference file for your system like so:
> > > >
> > > > tools/testing/selftests/devices/exist/exist.py --generate-reference > tools/testing/selftests/devices/exist/LENOVO,20XH005JUS.yaml
> > > >
> > >
> > > Worked - I see
> > >
> > > TAP version 13
> > > # Using reference file: ./LENOVO,20XH005JUS.yaml
> > > 1..76
> > >
> > > ---
> > > # Totals: pass:76 fail:0 xfail:0 xpass:0 skip:0 error:0
> > >
> > >
> > > Things to improve:
> > >
> > > - Have the script take a file instead of assuming that the reference file
> > >   is in the current directory.
> > >   e.g: exist.py -f reference_file
> > 
> > The script also has another parameter to specify a different directory to look
> > for the reference file: --reference-dir
> > 
> > But the file name is currently fixed and determined from the system's ID (DMI or
> > Devicetree compatible).
> > 
> > We can definitely have another flag to force a different file name if that's
> > useful. In theory it shouldn't be needed given the machine name is used as
> > filename, but might come in handy if there are machine name clashes or if you
> > want to have references for different kernel stable versions for the same
> > machine in the same directory.
> 
> I would recommend having a flag that allows specifying the filename (which overrides
> the automatically determined filename).

Alright, will do in v2.

> I am watching this thread with great interest,
> as I am planning on proposing, at Plumbers, another system which will utilize reference
> values (the 'adding benchmark results support to KTAP/kselftest' topic that I plan to
> give at the testing MC.)
> 
> Since this test uses reference values, it has all the same issues as my proposal:
>  - naming of the file where the reference values live
>  - some method to automatically select an appropriate reference values file
> This patch uses the  system's ID, but other things (like kernel config) will likely
> affect the set of devices on the system.  So a more complex selection mechanism
> may eventually be needed.

Right... So far my thinking has been to name the reference files with just the
system ID, and let the user/CI specify a different directory if there are other
parameters. So you might have a different directory for every stable kernel you
want to test, or for each specific kernel configuration.

But I'm definitely open to exploring methods to choose the appropriate file
considering multiple parameters, and if that can be shared across different
kselftests, all the better.

It would be simple to incorporate the kernel version into such a method: look
for the reference with closest version that is lesser or equal. But for the
kernel configuration I'm not so sure (maybe taking a similarity hash of the
.config...?).

> 
>  - generating or updating the reference value file

One thing that I do in this test is to detect whether, despite all expected
devices being present, there are additional devices, busses or classes in the
system. In this case, even though test passes, a warning is printed:

  Warning: The current system contains more devices and/or subsystems than the reference. Updating the reference is recommended.

Ultimately the responsbility of updating the test is still left to the users. In
this kind of test that relies on reference values, the test is only as good as
the reference, so while fully automating the generation and update of the
reference file would remove a burden, I think it's important to have a human
validate it.

>  - where to store the reference file
>     - e.g. when would it be appropriate to store it upstream and when elsewhere?
> 
> I don't want to gum up the acceptance of this patch by raising all these issues
> now.  I think it's acceptable to have a tester generate their own reference file
> prior to first use of the test, as the simplest way to use this test.  But I think
> that having pre-generated reference files that testers can use, with
> known-good values, will be valuable in the long term.
> 
> These issues can be discussed at Plumbers, IMHO.

Yes, I'll be there at your session so I can join in on the discussion. I also
have a session at the testing MC where we can discuss this test and other topics
relating to generic device testing:

https://lpc.events/event/18/contributions/1794/

Thanks,
Nícolas
diff mbox series

Patch

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index bc8fe9e8f7f2..9c49b5ec5bef 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -14,6 +14,7 @@  TARGETS += cpufreq
 TARGETS += cpu-hotplug
 TARGETS += damon
 TARGETS += devices/error_logs
+TARGETS += devices/exist
 TARGETS += devices/probe
 TARGETS += dmabuf-heaps
 TARGETS += drivers/dma-buf
diff --git a/tools/testing/selftests/devices/exist/Makefile b/tools/testing/selftests/devices/exist/Makefile
new file mode 100644
index 000000000000..3075cac32092
--- /dev/null
+++ b/tools/testing/selftests/devices/exist/Makefile
@@ -0,0 +1,3 @@ 
+TEST_PROGS := exist.py
+
+include ../../lib.mk
diff --git a/tools/testing/selftests/devices/exist/exist.py b/tools/testing/selftests/devices/exist/exist.py
new file mode 100755
index 000000000000..8241b2fabc8e
--- /dev/null
+++ b/tools/testing/selftests/devices/exist/exist.py
@@ -0,0 +1,268 @@ 
+#!/usr/bin/env python3
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2024 Collabora Ltd
+
+# * Goal: Identify regressions causing devices to go missing on the system
+# * Focus:
+#   * Ease of maintenance: the reference file is generated programatically
+#   * Minimum of false-positives: the script makes as few assumptions as
+#     possible about the stability of device identifiers to ensure
+#     renames/refactors don't trigger false-positives
+# * How it works: For each bus and class on the system the test checks the
+#   number of devices present against a reference file, which needs to have been
+#   generated by the program at a previous point on a known-good kernel, and if
+#   there are missing devices they are reported.
+
+import os
+import sys
+import argparse
+
+import yaml
+
+# Allow ksft module to be imported from different directory
+this_dir = os.path.dirname(os.path.realpath(__file__))
+sys.path.append(os.path.join(this_dir, "../../kselftest/"))
+
+import ksft
+
+
+def generate_devs_obj():
+    obj = {}
+
+    device_sources = [
+        {
+            "base_dir": "/sys/class",
+            "add_path": "",
+            "key_name": "class",
+            "ignored": ["devlink"],
+        },
+        {
+            "base_dir": "/sys/bus",
+            "add_path": "devices",
+            "key_name": "bus",
+            "ignored": [],
+        },
+    ]
+
+    properties = sorted(["uevent", "device/uevent", "firmware_node/uevent", "name"])
+
+    for source in device_sources:
+        source_subsystems = {}
+        for subsystem in sorted(os.listdir(source["base_dir"])):
+            if subsystem in source["ignored"]:
+                continue
+
+            devs_path = os.path.join(source["base_dir"], subsystem, source["add_path"])
+            dev_dirs = [dev for dev in os.scandir(devs_path) if dev.is_symlink()]
+            devs_data = []
+            for dev_dir in dev_dirs:
+                dev_path = os.path.join(devs_path, dev_dir)
+                dev_data = {"info": {}}
+                for prop in properties:
+                    if os.path.isfile(os.path.join(dev_path, prop)):
+                        with open(os.path.join(dev_path, prop)) as f:
+                            dev_data["info"][prop] = f.read()
+                devs_data.append(dev_data)
+            if len(dev_dirs):
+                source_subsystems[subsystem] = {
+                    "count": len(dev_dirs),
+                    "devices": devs_data,
+                }
+        obj[source["key_name"]] = source_subsystems
+
+    return obj
+
+
+def commented(s):
+    return s.replace("\n", "\n# ")
+
+
+def indented(s, n):
+    return " " * n + s.replace("\n", "\n" + " " * n)
+
+
+def stripped(s):
+    return s.strip("\n")
+
+
+def devices_difference(dev1, dev2):
+    difference = 0
+
+    for prop in dev1["info"].keys():
+        for l1, l2 in zip(
+            dev1["info"].get(prop, "").split("\n"),
+            dev2["info"].get(prop, "").split("\n"),
+        ):
+            if l1 != l2:
+                difference += 1
+    return difference
+
+
+def guess_missing_devices(cur_devs_subsystem, ref_devs_subsystem):
+    # Detect what devices on the current system are the most similar to devices
+    # on the reference one by one until the leftovers are the most dissimilar
+    # devices and therefore most likely the missing ones.
+    found_count = cur_devs_subsystem["count"]
+    expected_count = ref_devs_subsystem["count"]
+    missing_count = found_count - expected_count
+
+    diffs = []
+    for cur_d in cur_devs_subsystem["devices"]:
+        for ref_d in ref_devs_subsystem["devices"]:
+            diffs.append((devices_difference(cur_d, ref_d), cur_d, ref_d))
+
+    diffs.sort(key=lambda x: x[0])
+
+    assigned_ref_devs = []
+    assigned_cur_devs = []
+    for diff in diffs:
+        if len(assigned_ref_devs) >= expected_count - missing_count:
+            break
+        if diff[1] in assigned_cur_devs or diff[2] in assigned_ref_devs:
+            continue
+        assigned_cur_devs.append(diff[1])
+        assigned_ref_devs.append(diff[2])
+
+    missing_devices = []
+    for d in ref_devs_subsystem["devices"]:
+        if d not in assigned_ref_devs:
+            missing_devices.append(d)
+
+    return missing_devices
+
+
+def dump_devices_info(cur_devs_subsystem, ref_devs_subsystem):
+    def dump_device_info(dev):
+        for name, val in dev["info"].items():
+            ksft.print_msg(indented(name + ":", 2))
+            val = stripped(val)
+            if val:
+                ksft.print_msg(commented(indented(val, 4)))
+        ksft.print_msg("")
+
+    ksft.print_msg("=================")
+    ksft.print_msg("Devices expected:")
+    ksft.print_msg("")
+    for d in ref_devs_subsystem["devices"]:
+        dump_device_info(d)
+    ksft.print_msg("-----------------")
+    ksft.print_msg("Devices found:")
+    ksft.print_msg("")
+    for d in cur_devs_subsystem["devices"]:
+        dump_device_info(d)
+    ksft.print_msg("-----------------")
+    ksft.print_msg("Devices missing (best guess):")
+    ksft.print_msg("")
+    missing_devices = guess_missing_devices(cur_devs_subsystem, ref_devs_subsystem)
+    for d in missing_devices:
+        dump_device_info(d)
+    ksft.print_msg("=================")
+
+
+def run_test(ref_filename):
+    ksft.print_msg(f"Using reference file: {ref_filename}")
+
+    with open(ref_filename) as f:
+        ref_devs_obj = yaml.safe_load(f)
+
+    num_tests = 0
+    for dev_source in ref_devs_obj.values():
+        num_tests += len(dev_source)
+    ksft.set_plan(num_tests)
+
+    cur_devs_obj = generate_devs_obj()
+
+    reference_outdated = False
+
+    for source, ref_devs_source_obj in ref_devs_obj.items():
+        for subsystem, ref_devs_subsystem_obj in ref_devs_source_obj.items():
+            test_name = f"{source}.{subsystem}"
+            if not (
+                cur_devs_obj.get(source) and cur_devs_obj.get(source).get(subsystem)
+            ):
+                ksft.print_msg(f"Device subsystem '{subsystem}' missing")
+                ksft.test_result_fail(test_name)
+                continue
+            cur_devs_subsystem_obj = cur_devs_obj[source][subsystem]
+
+            found_count = cur_devs_subsystem_obj["count"]
+            expected_count = ref_devs_subsystem_obj["count"]
+            if found_count < expected_count:
+                ksft.print_msg(
+                    f"Missing devices for subsystem '{subsystem}': {expected_count - found_count} (Expected {expected_count}, found {found_count})"
+                )
+                dump_devices_info(cur_devs_subsystem_obj, ref_devs_subsystem_obj)
+                ksft.test_result_fail(test_name)
+            else:
+                ksft.test_result_pass(test_name)
+                if found_count > expected_count:
+                    reference_outdated = True
+
+        if len(cur_devs_obj[source]) > len(ref_devs_source_obj):
+            reference_outdated = True
+
+    if reference_outdated:
+        ksft.print_msg(
+            "Warning: The current system contains more devices and/or subsystems than the reference. Updating the reference is recommended."
+        )
+
+
+def get_possible_ref_filenames():
+    filenames = []
+
+    dt_board_compatible_file = "/proc/device-tree/compatible"
+    if os.path.exists(dt_board_compatible_file):
+        with open(dt_board_compatible_file) as f:
+            for line in f:
+                compatibles = [compat for compat in line.split("\0") if compat]
+                filenames.extend(compatibles)
+    else:
+        dmi_id_dir = "/sys/devices/virtual/dmi/id"
+        vendor_dmi_file = os.path.join(dmi_id_dir, "sys_vendor")
+        product_dmi_file = os.path.join(dmi_id_dir, "product_name")
+
+        with open(vendor_dmi_file) as f:
+            vendor = f.read().replace("\n", "")
+        with open(product_dmi_file) as f:
+            product = f.read().replace("\n", "")
+
+        filenames = [vendor + "," + product]
+
+    return filenames
+
+
+def get_ref_filename(ref_dir):
+    chosen_ref_filename = ""
+    full_ref_paths = [os.path.join(ref_dir, f + ".yaml") for f in get_possible_ref_filenames()]
+    for path in full_ref_paths:
+        if os.path.exists(path):
+            chosen_ref_filename = path
+            break
+
+    if not chosen_ref_filename:
+        tried_paths = ",".join(["'" + p + "'" for p in full_ref_paths])
+        ksft.print_msg(f"No matching reference file found (tried {tried_paths})")
+        ksft.exit_fail()
+
+    return chosen_ref_filename
+
+
+parser = argparse.ArgumentParser()
+parser.add_argument(
+    "--reference-dir", default=".", help="Directory containing the reference files"
+)
+parser.add_argument("--generate-reference", action="store_true", help="Generate a reference file with the devices on the running system")
+args = parser.parse_args()
+
+if args.generate_reference:
+    print(f"# Kernel version: {os.uname().release}")
+    print(yaml.dump(generate_devs_obj()))
+    sys.exit(0)
+
+ksft.print_header()
+
+ref_filename = get_ref_filename(args.reference_dir)
+
+run_test(ref_filename)
+
+ksft.finished()