mbox series

[0/4] testing/next (aarch64 virt gpu tests)

Message ID 20250219150009.1662688-1-alex.bennee@linaro.org (mailing list archive)
Headers show
Series testing/next (aarch64 virt gpu tests) | expand

Message

Alex Bennée Feb. 19, 2025, 3 p.m. UTC
Hi,

As I was looking at the native context patches I realised our existing
GPU testing is a little sparse. I took the opportunity to split the
test from the main virt test and then extend it to exercise the 3
current display modes (virgl, virgl+blobs, vulkan).

I've added some additional validation to ensure we have the devices we
expect before we start. It doesn't currently address the reported
clang issues but hopefully it will help narrow down what fails and
what works.

Once I've built some new buildroot images I'll re-spin with a while
bunch of additional test binaries available.

Alex.

Alex Bennée (4):
  tests/functional: move aarch64 GPU test into own file
  tests/functional: factor out common code in gpu test
  tests/functional: ensure we have a GPU device for tests
  tests/functional: expand tests to cover virgl

 tests/functional/meson.build              |   2 +
 tests/functional/test_aarch64_virt.py     |  71 -------------
 tests/functional/test_aarch64_virt_gpu.py | 123 ++++++++++++++++++++++
 3 files changed, 125 insertions(+), 71 deletions(-)
 create mode 100755 tests/functional/test_aarch64_virt_gpu.py

Comments

Peter Maydell Feb. 20, 2025, 11:29 a.m. UTC | #1
On Wed, 19 Feb 2025 at 15:00, Alex Bennée <alex.bennee@linaro.org> wrote:
>
> Hi,
>
> As I was looking at the native context patches I realised our existing
> GPU testing is a little sparse. I took the opportunity to split the
> test from the main virt test and then extend it to exercise the 3
> current display modes (virgl, virgl+blobs, vulkan).
>
> I've added some additional validation to ensure we have the devices we
> expect before we start. It doesn't currently address the reported
> clang issues but hopefully it will help narrow down what fails and
> what works.

Running on my setup with a clang sanitizer build I get

ok 1 test_aarch64_virt_gpu.Aarch64VirtGPUMachine.test_aarch64_virt_with_virgl_blobs_gpu
ok 2 test_aarch64_virt_gpu.Aarch64VirtGPUMachine.test_aarch64_virt_with_virgl_gpu

and then the third test timed out.

For the timing out case, the console prints


2025-02-20 11:12:55,208: # weston -B headless --renderer gl --shell
kiosk -- vkmark -b:duration=1.0
2025-02-20 11:12:55,288: Date: 2025-02-20 UTC
2025-02-20 11:12:55,288: [11:12:54.841] weston 14.0.0
2025-02-20 11:12:55,289: https://wayland.freedesktop.org
2025-02-20 11:12:55,289: Bug reports to:
https://gitlab.freedesktop.org/wayland/weston/issues/
2025-02-20 11:12:55,289: Build: 14.0.0
2025-02-20 11:12:55,291: [11:12:54.847] Command line: weston -B
headless --renderer gl --shell kiosk -- vkmark -b:duration=1.0
2025-02-20 11:12:55,298: [11:12:54.850] OS: Linux, 6.11.10, #2 SMP Thu
Dec  5 16:27:12 GMT 2024, aarch64
2025-02-20 11:12:55,299: [11:12:54.855] Flight recorder: enabled
2025-02-20 11:12:55,300: [11:12:54.857] warning: XDG_RUNTIME_DIR
"/tmp" is not configured
2025-02-20 11:12:55,301: correctly.  Unix access mode must be 0700
(current mode is 0777),
2025-02-20 11:12:55,301: and must be owned by the user UID 0 (current
owner is UID 0).
2025-02-20 11:12:55,302: Refer to your distribution on how to get it, or
2025-02-20 11:12:55,302:
http://www.freedesktop.org/wiki/Specifications/basedir-spec
2025-02-20 11:12:55,302: on how to implement it.
2025-02-20 11:12:55,308: [11:12:54.865] Starting with no config file.
2025-02-20 11:12:55,322: [11:12:54.879] Output repaint window is 7 ms maximum.
2025-02-20 11:12:55,333: [11:12:54.890] Loading module
'/usr/lib/libweston-14/headless-backend.so'
2025-02-20 11:12:55,407: [11:12:54.963] Loading module
'/usr/lib/libweston-14/gl-renderer.so'
2025-02-20 11:13:06,936: [11:13:06.491] Using rendering device:
/dev/dri/renderD128
2025-02-20 11:13:07,083: [11:13:06.640] EGL version: 1.5
2025-02-20 11:13:07,084: [11:13:06.641] EGL vendor: Mesa Project
2025-02-20 11:13:07,085: [11:13:06.641] EGL client APIs: OpenGL OpenGL_ES
2025-02-20 11:13:07,088: [11:13:06.645] EGL features:
2025-02-20 11:13:07,089: EGL Wayland extension: yes
2025-02-20 11:13:07,089: context priority: no
2025-02-20 11:13:07,089: buffer age: no
2025-02-20 11:13:07,089: partial update: no
2025-02-20 11:13:07,090: swap buffers with damage: no
2025-02-20 11:13:07,090: configless context: yes
2025-02-20 11:13:07,090: surfaceless context: yes
2025-02-20 11:13:07,090: dmabuf support: modifiers
2025-02-20 11:13:07,206: [11:13:06.763] GL version: OpenGL ES 3.2 Mesa 24.3.0
2025-02-20 11:13:07,207: [11:13:06.764] GLSL version: OpenGL ES GLSL ES 3.20
2025-02-20 11:13:07,207: [11:13:06.764] GL vendor: Mesa
2025-02-20 11:13:07,208: [11:13:06.764] GL renderer: virgl (Quadro
P400/PCIe/SSE2)
2025-02-20 11:13:08,201: [11:13:07.757] GL ES 3.2 - renderer features:
2025-02-20 11:13:08,202: read-back format: ARGB8888
2025-02-20 11:13:08,203: glReadPixels supports y-flip: yes
2025-02-20 11:13:08,203: glReadPixels supports PBO: yes
2025-02-20 11:13:08,204: wl_shm 10 bpc formats: yes
2025-02-20 11:13:08,204: wl_shm 16 bpc formats: yes
2025-02-20 11:13:08,205: wl_shm half-float formats: yes
2025-02-20 11:13:08,206: internal R and RG formats: yes
2025-02-20 11:13:08,209: OES_EGL_image_external: yes
2025-02-20 11:13:08,210: [11:13:07.767] Using GL renderer
2025-02-20 11:13:08,211: [11:13:07.768] Registered plugin API
'weston_windowed_output_api_headless_v2' of size 16
2025-02-20 11:13:08,215: [11:13:07.772] Color manager: no-op
2025-02-20 11:13:08,216: protocol support: no
2025-02-20 11:13:08,226: [11:13:07.782] Output 'headless' attempts
EOTF mode SDR and colorimetry mode default.
2025-02-20 11:13:08,227: [11:13:07.784] Output 'headless' using color
profile: stock sRGB color profile

and that's the last thing it outputs.

The sanitizer reports that when the framework sends the SIGTERM
because of the timeout we get a write to a NULL pointer (but
interesting not this time in an atexit callback):

UndefinedBehaviorSanitizer:DEADLYSIGNAL
==471863==ERROR: UndefinedBehaviorSanitizer: SEGV on unknown address
0x000000000000 (pc 0x7a18ceaafe80 bp 0x000000000000 sp 0x7ffe8e3ff6d0
T471863)
==471863==The signal is caused by a WRITE memory access.
==471863==Hint: address points to the zero page.
    #0 0x7a18ceaafe80
(/lib/x86_64-linux-gnu/libnvidia-eglcore.so.535.183.01+0x16afe80)
(BuildId: 24b0d0b90369112e3de888a93eb8d7e00304a6db)
    #1 0x7a18ce9e72c0
(/lib/x86_64-linux-gnu/libnvidia-eglcore.so.535.183.01+0x15e72c0)
(BuildId: 24b0d0b90369112e3de888a93eb8d7e00304a6db)
    #2 0x7a18ce9f11bb
(/lib/x86_64-linux-gnu/libnvidia-eglcore.so.535.183.01+0x15f11bb)
(BuildId: 24b0d0b90369112e3de888a93eb8d7e00304a6db)
    #3 0x7a18ce6dc9d1
(/lib/x86_64-linux-gnu/libnvidia-eglcore.so.535.183.01+0x12dc9d1)
(BuildId: 24b0d0b90369112e3de888a93eb8d7e00304a6db)
    #4 0x7a18e7d15326 in vrend_renderer_create_fence
/usr/src/virglrenderer-1.0.0-1ubuntu2/obj-x86_64-linux-gnu/../src/vrend_renderer.c:10883:26
    #5 0x55bfb6621871 in virtio_gpu_virgl_process_cmd
/mnt/nvmedisk/linaro/qemu-from-laptop/qemu/build/arm-clang/../../hw/display/virtio-gpu-virgl.c:973:5
    #6 0x55bfb66086ba in virtio_gpu_process_cmdq
/mnt/nvmedisk/linaro/qemu-from-laptop/qemu/build/arm-clang/../../hw/display/virtio-gpu.c:1048:9
    #7 0x55bfb661b69b in virtio_gpu_gl_handle_ctrl
/mnt/nvmedisk/linaro/qemu-from-laptop/qemu/build/arm-clang/../../hw/display/virtio-gpu-gl.c:100:5
    #8 0x55bfb74a7782 in aio_bh_call
/mnt/nvmedisk/linaro/qemu-from-laptop/qemu/build/arm-clang/../../util/async.c:172:5
    #9 0x55bfb74a7b58 in aio_bh_poll
/mnt/nvmedisk/linaro/qemu-from-laptop/qemu/build/arm-clang/../../util/async.c:219:13
    #10 0x55bfb74625ea in aio_dispatch
/mnt/nvmedisk/linaro/qemu-from-laptop/qemu/build/arm-clang/../../util/aio-posix.c:424:5
    #11 0x55bfb74aaaea in aio_ctx_dispatch
/mnt/nvmedisk/linaro/qemu-from-laptop/qemu/build/arm-clang/../../util/async.c:361:5
    #12 0x7a18e8dc15b4 in g_main_dispatch
/usr/src/glib2.0-2.80.0-6ubuntu3.2/debian/build/deb/../../../glib/gmain.c:3344:28
    #13 0x7a18e8dc16ff in g_main_context_dispatch_unlocked
/usr/src/glib2.0-2.80.0-6ubuntu3.2/debian/build/deb/../../../glib/gmain.c:4152:7
    #14 0x7a18e8dc16ff in g_main_context_dispatch
/usr/src/glib2.0-2.80.0-6ubuntu3.2/debian/build/deb/../../../glib/gmain.c:4140:3
    #15 0x55bfb74ab96b in glib_pollfds_poll
/mnt/nvmedisk/linaro/qemu-from-laptop/qemu/build/arm-clang/../../util/main-loop.c:287:9
    #16 0x55bfb74ab96b in os_host_main_loop_wait
/mnt/nvmedisk/linaro/qemu-from-laptop/qemu/build/arm-clang/../../util/main-loop.c:310:5
    #17 0x55bfb74ab96b in main_loop_wait
/mnt/nvmedisk/linaro/qemu-from-laptop/qemu/build/arm-clang/../../util/main-loop.c:589:11
    #18 0x55bfb64799e6 in qemu_main_loop
/mnt/nvmedisk/linaro/qemu-from-laptop/qemu/build/arm-clang/../../system/runstate.c:835:9
    #19 0x55bfb7340356 in qemu_default_main
/mnt/nvmedisk/linaro/qemu-from-laptop/qemu/build/arm-clang/../../system/main.c:48:14
    #20 0x55bfb734032e in main
/mnt/nvmedisk/linaro/qemu-from-laptop/qemu/build/arm-clang/../../system/main.c:76:9
    #21 0x7a18e6a2a1c9 in __libc_start_call_main
csu/../sysdeps/nptl/libc_start_call_main.h:58:16
    #22 0x7a18e6a2a28a in __libc_start_main csu/../csu/libc-start.c:360:3
    #23 0x55bfb59b6554 in _start
(/mnt/nvmedisk/linaro/qemu-from-laptop/qemu/build/arm-clang/qemu-system-aarch64+0x15dd554)
(BuildId: df0d680785eeda685de951dbbbbd220f5c9e773d)

UndefinedBehaviorSanitizer can not provide additional info.
SUMMARY: UndefinedBehaviorSanitizer: SEGV
(/lib/x86_64-linux-gnu/libnvidia-eglcore.so.535.183.01+0x16afe80)
(BuildId: 24b0d0b90369112e3de888a93eb8d7e00304a6db)
==471863==ABORTING



-- PMM
Peter Maydell Feb. 20, 2025, 1:37 p.m. UTC | #2
On Wed, 19 Feb 2025 at 15:00, Alex Bennée <alex.bennee@linaro.org> wrote:
>
> Hi,
>
> As I was looking at the native context patches I realised our existing
> GPU testing is a little sparse. I took the opportunity to split the
> test from the main virt test and then extend it to exercise the 3
> current display modes (virgl, virgl+blobs, vulkan).
>
> I've added some additional validation to ensure we have the devices we
> expect before we start. It doesn't currently address the reported
> clang issues but hopefully it will help narrow down what fails and
> what works.
>
> Once I've built some new buildroot images I'll re-spin with a while
> bunch of additional test binaries available.

Running on a non-sanitizer debug build, I found that
aarch64_virt_with_virgl_gpu hit the timeout. Looking at the
output the last thing printed is
2025-02-20 11:46:36,864: [shadow] <default>: FPS: 45 FrameTime: 22.585 ms
That timestamp is 4 minutes into the test run, and the same
[shadow] test takes over 2 minutes on the with_virgil_blobs_gpu
test, so it looks like it just hit the 360s timeout and might
well have finished OK if it had been allowed to keep running.

Actually I'm surprised the other one didn't hit a timeout,
because its log timestamps show it running from 11:35:03,896
to 11:42:26,468 which is definitely more than 360s.

Is there a less time-intensive test of the virgl code
we can use? check-functional already has way too many
tests that take minutes to run...

-- PMM
Peter Maydell Feb. 20, 2025, 1:47 p.m. UTC | #3
On Thu, 20 Feb 2025 at 11:29, Peter Maydell <peter.maydell@linaro.org> wrote:
>
> On Wed, 19 Feb 2025 at 15:00, Alex Bennée <alex.bennee@linaro.org> wrote:
> >
> > Hi,
> >
> > As I was looking at the native context patches I realised our existing
> > GPU testing is a little sparse. I took the opportunity to split the
> > test from the main virt test and then extend it to exercise the 3
> > current display modes (virgl, virgl+blobs, vulkan).
> >
> > I've added some additional validation to ensure we have the devices we
> > expect before we start. It doesn't currently address the reported
> > clang issues but hopefully it will help narrow down what fails and
> > what works.
>
> Running on my setup with a clang sanitizer build I get
>
> ok 1 test_aarch64_virt_gpu.Aarch64VirtGPUMachine.test_aarch64_virt_with_virgl_blobs_gpu
> ok 2 test_aarch64_virt_gpu.Aarch64VirtGPUMachine.test_aarch64_virt_with_virgl_gpu
>
> and then the third test timed out.

vulkaninfo --summary as requested on irc:


==========
VULKANINFO
==========

Vulkan Instance Version: 1.3.275


Instance Extensions: count = 24
-------------------------------
VK_EXT_acquire_drm_display             : extension revision 1
VK_EXT_acquire_xlib_display            : extension revision 1
VK_EXT_debug_report                    : extension revision 10
VK_EXT_debug_utils                     : extension revision 2
VK_EXT_direct_mode_display             : extension revision 1
VK_EXT_display_surface_counter         : extension revision 1
VK_EXT_headless_surface                : extension revision 1
VK_EXT_surface_maintenance1            : extension revision 1
VK_EXT_swapchain_colorspace            : extension revision 4
VK_KHR_device_group_creation           : extension revision 1
VK_KHR_display                         : extension revision 23
VK_KHR_external_fence_capabilities     : extension revision 1
VK_KHR_external_memory_capabilities    : extension revision 1
VK_KHR_external_semaphore_capabilities : extension revision 1
VK_KHR_get_display_properties2         : extension revision 1
VK_KHR_get_physical_device_properties2 : extension revision 2
VK_KHR_get_surface_capabilities2       : extension revision 1
VK_KHR_portability_enumeration         : extension revision 1
VK_KHR_surface                         : extension revision 25
VK_KHR_surface_protected_capabilities  : extension revision 1
VK_KHR_wayland_surface                 : extension revision 6
VK_KHR_xcb_surface                     : extension revision 6
VK_KHR_xlib_surface                    : extension revision 6
VK_LUNARG_direct_driver_loading        : extension revision 1

Instance Layers: count = 4
--------------------------
VK_LAYER_INTEL_nullhw       INTEL NULL HW                1.1.73   version 1
VK_LAYER_MESA_device_select Linux device selection layer 1.3.211  version 1
VK_LAYER_MESA_overlay       Mesa Overlay layer           1.3.211  version 1
VK_LAYER_NV_optimus         NVIDIA Optimus layer         1.3.242  version 1

Devices:
========
GPU0:
        apiVersion         = 1.3.242
        driverVersion      = 535.183.1.0
        vendorID           = 0x10de
        deviceID           = 0x1cb3
        deviceType         = PHYSICAL_DEVICE_TYPE_DISCRETE_GPU
        deviceName         = Quadro P400
        driverID           = DRIVER_ID_NVIDIA_PROPRIETARY
        driverName         = NVIDIA
        driverInfo         = 535.183.01
        conformanceVersion = 1.3.5.0
        deviceUUID         = 0a44d8af-913b-892f-1603-e76ce29ac9b5
        driverUUID         = 526ab2c8-1f4a-5dd0-9559-81dab18f1e08
GPU1:
        apiVersion         = 1.3.289
        driverVersion      = 0.0.1
        vendorID           = 0x10005
        deviceID           = 0x0000
        deviceType         = PHYSICAL_DEVICE_TYPE_CPU
        deviceName         = llvmpipe (LLVM 19.1.1, 256 bits)
        driverID           = DRIVER_ID_MESA_LLVMPIPE
        driverName         = llvmpipe
        driverInfo         = Mesa 24.2.8-1ubuntu1~24.04.1 (LLVM 19.1.1)
        conformanceVersion = 1.3.1.1
        deviceUUID         = 6d657361-3234-2e32-2e38-2d3175627500
        driverUUID         = 6c6c766d-7069-7065-5555-494400000000

-- PMM
Alex Bennée Feb. 20, 2025, 3:47 p.m. UTC | #4
Peter Maydell <peter.maydell@linaro.org> writes:

> On Wed, 19 Feb 2025 at 15:00, Alex Bennée <alex.bennee@linaro.org> wrote:
>>
>> Hi,
>>
>> As I was looking at the native context patches I realised our existing
>> GPU testing is a little sparse. I took the opportunity to split the
>> test from the main virt test and then extend it to exercise the 3
>> current display modes (virgl, virgl+blobs, vulkan).
>>
>> I've added some additional validation to ensure we have the devices we
>> expect before we start. It doesn't currently address the reported
>> clang issues but hopefully it will help narrow down what fails and
>> what works.
>>
>> Once I've built some new buildroot images I'll re-spin with a while
>> bunch of additional test binaries available.
>
> Running on a non-sanitizer debug build, I found that
> aarch64_virt_with_virgl_gpu hit the timeout. Looking at the
> output the last thing printed is
> 2025-02-20 11:46:36,864: [shadow] <default>: FPS: 45 FrameTime: 22.585 ms
> That timestamp is 4 minutes into the test run, and the same
> [shadow] test takes over 2 minutes on the with_virgil_blobs_gpu
> test, so it looks like it just hit the 360s timeout and might
> well have finished OK if it had been allowed to keep running.

On my system it takes ~43s to run the plain virgl_gpu test. About 2.5s
to boot the kernel and setup the env and approx 40s to run through each
test. The -b:duration=1.0 limits each of the 33 scenes to 1s of runtime.

I'm guessing something in your setup is stalling the scene and instead
of reaching its time limit it stalls and takes more than 1s to recover.

> Actually I'm surprised the other one didn't hit a timeout,
> because its log timestamps show it running from 11:35:03,896
> to 11:42:26,468 which is definitely more than 360s.
>
> Is there a less time-intensive test of the virgl code
> we can use? check-functional already has way too many
> tests that take minutes to run...

I am building a newer rootfs with more testing tools on it so we could
preface with simpler tests and bail early if say the drm device node
can't be seen.

That said I worked quite hard on keeping the runtime bellow 60s and the
benefit of the glmark/vkmark tests is they run through a number of
different scenarios so hopefully exercise a range of the API. It also
has the benefit easily detecting the end from stdout whereas the simpler
tests tend to draw a triangle and then loop forever until you hit
Ctrl-C.


>
> -- PMM