Message ID | 20221028145319.1.I87b119c576d486ad139faf1b7278d97e158aabe4@changeid (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | drm/i915: Set PROBE_PREFER_ASYNCHRONOUS | expand |
On Fri, Oct 28, 2022 at 5:24 PM Patchwork <patchwork@emeril.freedesktop.org> wrote: > > Patch Details > Series:drm/i915: Set PROBE_PREFER_ASYNCHRONOUS > URL:https://patchwork.freedesktop.org/series/110277/ > State:failure > Details:https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_110277v1/index.html > > CI Bug Log - changes from CI_DRM_12317 -> Patchwork_110277v1 > > Summary > > FAILURE > > Serious unknown changes coming with Patchwork_110277v1 absolutely need to be > verified manually. > > If you think the reported changes have nothing to do with the changes > introduced in Patchwork_110277v1, please notify your bug team to allow them > to document this new failure mode, which will reduce false positives in CI. > > External URL: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_110277v1/index.html For the record, I have almost zero idea what to do with this. From what I can tell, most (all?) of these failures are flaky(?) already and are probably not related to my change. But if someone believes to actually be a blocking issue with my patch, let me know, and I can see if I can figure out a better answer than that. Thanks, Brian
On Tue, 1 Nov 2022 at 21:58, Brian Norris <briannorris@chromium.org> wrote: > > On Fri, Oct 28, 2022 at 5:24 PM Patchwork > <patchwork@emeril.freedesktop.org> wrote: > > > > Patch Details > > Series:drm/i915: Set PROBE_PREFER_ASYNCHRONOUS > > URL:https://patchwork.freedesktop.org/series/110277/ > > State:failure > > Details:https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_110277v1/index.html > > > > CI Bug Log - changes from CI_DRM_12317 -> Patchwork_110277v1 > > > > Summary > > > > FAILURE > > > > Serious unknown changes coming with Patchwork_110277v1 absolutely need to be > > verified manually. > > > > If you think the reported changes have nothing to do with the changes > > introduced in Patchwork_110277v1, please notify your bug team to allow them > > to document this new failure mode, which will reduce false positives in CI. > > > > External URL: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_110277v1/index.html > > For the record, I have almost zero idea what to do with this. From > what I can tell, most (all?) of these failures are flaky(?) already > and are probably not related to my change. https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_110277v1/index.html According to that link, this change appears to break every platform when running the live selftests (looking at the purple squares). Running the selftests normally involves loading and unloading the module. Looking at the logs there is scary stuff like: <3> [371.938060] INFO: task kworker/u12:1:46 blocked for more than 61 seconds. <3> [371.944992] Tainted: G U W 6.1.0-rc2-Patchwork_110277v1-g0f9cbc285f7d+ #1 <3> [371.953546] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. <6> [371.961428] task:kworker/u12:1 state:D stack:11512 pid:46 ppid:2 flags:0x00004000 <6> [371.961433] Workqueue: events_unbound async_run_entry_fn <6> [371.961452] Call Trace: <6> [371.961454] <TASK> <6> [371.961457] __schedule+0x30e/0xa70 <6> [371.961481] ? usleep_range_state+0xb0/0xb0 <6> [371.961484] ? __wait_for_common+0xd3/0x1d0 <6> [371.961487] schedule+0x4e/0xd0 <6> [371.961489] schedule_timeout+0x237/0x2e0 <6> [371.961493] ? usleep_range_state+0xb0/0xb0 <6> [371.961495] ? __wait_for_common+0xd3/0x1d0 <6> [371.961497] ? mark_held_locks+0x48/0x80 <6> [371.961500] ? _raw_spin_unlock_irq+0x1f/0x50 <6> [371.961504] ? usleep_range_state+0xb0/0xb0 <6> [371.961506] __wait_for_common+0xf5/0x1d0 <6> [371.961526] __flush_work+0x2c4/0x4e0 <6> [371.961530] ? flush_workqueue_prep_pwqs+0x120/0x120 <6> [371.961553] ? get_work_pool+0x90/0x90 <6> [371.961556] __cancel_work_timer+0x14e/0x1f0 <6> [371.961559] ? _raw_spin_unlock_irqrestore+0x54/0x70 <6> [371.961562] ? lockdep_hardirqs_on+0xbf/0x140 <6> [371.961567] intel_display_driver_unregister+0x27/0x40 [i915] <6> [371.961722] i915_driver_remove+0x50/0x100 [i915] <6> [371.961862] i915_pci_probe+0x123/0x240 [i915] <6> [371.961996] ? _raw_spin_unlock_irqrestore+0x3d/0x70 <6> [371.962001] pci_device_probe+0x95/0x110 <6> [371.962005] really_probe+0xd6/0x350 <6> [371.962008] ? pm_runtime_barrier+0x4b/0xa0 <6> [371.962013] __driver_probe_device+0x73/0x170 <6> [371.962016] driver_probe_device+0x1a/0x90 <6> [371.962019] __driver_attach_async_helper+0x31/0x80 <6> [371.962023] async_run_entry_fn+0x28/0x130 <6> [371.962026] process_one_work+0x272/0x5b0 <6> [371.962032] worker_thread+0x37/0x370 <6> [371.962036] ? process_one_work+0x5b0/0x5b0 <6> [371.962038] kthread+0xed/0x120 <6> [371.962040] ? kthread_complete_and_exit+0x20/0x20 <6> [371.962044] ret_from_fork+0x1f/0x30 <6> [371.962053] </TASK> <3> [371.962055] INFO: task kworker/3:1:57 blocked for more than 61 seconds. <3> [371.968696] Tainted: G U W 6.1.0-rc2-Patchwork_110277v1-g0f9cbc285f7d+ #1 <3> [371.977219] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. <6> [371.985049] task:kworker/3:1 state:D stack:12488 pid:57 ppid:2 flags:0x00004000 <6> [371.985053] Workqueue: events output_poll_execute [drm_kms_helper] <6> [371.985080] Call Trace: <6> [371.985081] <TASK> <6> [371.985102] __schedule+0x30e/0xa70 <6> [371.985108] schedule+0x4e/0xd0 <6> [371.985110] async_synchronize_cookie_domain+0xe7/0x120 <6> [371.985113] ? finish_swait+0x90/0x90 <6> [371.985117] intel_fbdev_output_poll_changed+0x82/0x90 [i915] <6> [371.985281] drm_kms_helper_hotplug_event+0x1e/0x30 [drm_kms_helper] <6> [371.985291] output_poll_execute+0x1f5/0x200 [drm_kms_helper] <6> [371.985303] process_one_work+0x272/0x5b0 <6> [371.985310] worker_thread+0x37/0x370 <6> [371.985313] ? process_one_work+0x5b0/0x5b0 <6> [371.985316] kthread+0xed/0x120 <6> [371.985317] ? kthread_complete_and_exit+0x20/0x20 <6> [371.985321] ret_from_fork+0x1f/0x30 <6> [371.985329] </TASK> <3> [371.985348] INFO: task i915_selftest:6017 blocked for more than 61 seconds. > > But if someone believes to actually be a blocking issue with my patch, > let me know, and I can see if I can figure out a better answer than > that. > > Thanks, > Brian
On Wed, Nov 02, 2022 at 12:18:37PM +0000, Matthew Auld wrote: > On Tue, 1 Nov 2022 at 21:58, Brian Norris <briannorris@chromium.org> wrote: > > > > On Fri, Oct 28, 2022 at 5:24 PM Patchwork > > <patchwork@emeril.freedesktop.org> wrote: > > > > > > Patch Details > > > Series:drm/i915: Set PROBE_PREFER_ASYNCHRONOUS > > > URL:https://patchwork.freedesktop.org/series/110277/ > > > State:failure > > > Details:https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_110277v1/index.html > > > > > > CI Bug Log - changes from CI_DRM_12317 -> Patchwork_110277v1 > > > > > > Summary > > > > > > FAILURE > > > > > > Serious unknown changes coming with Patchwork_110277v1 absolutely need to be > > > verified manually. > > > > > > If you think the reported changes have nothing to do with the changes > > > introduced in Patchwork_110277v1, please notify your bug team to allow them > > > to document this new failure mode, which will reduce false positives in CI. > > > > > > External URL: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_110277v1/index.html > > > > For the record, I have almost zero idea what to do with this. From > > what I can tell, most (all?) of these failures are flaky(?) already > > and are probably not related to my change. > > https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_110277v1/index.html > > According to that link, this change appears to break every platform > when running the live selftests (looking at the purple squares). > Running the selftests normally involves loading and unloading the > module. Looking at the logs there is scary stuff like: > [...] Ah, thanks. I'm not sure what made me think the tests were failing the same way on drm-tip, but maybe just chalk that up to my unfamiliarity with this particular dashboard... (There are a few isolated failure and/or flakes on drm-tip, but they don't look like this.) Anyway, I think I managed to run some of these tests on my own platforms [1], and I don't reproduce those failures. I do see other failures (crashes) though, like in i915_gem_mman_live_selftests/igt_mmap, where igt_mmap_offset() (selftest-only code) -> vm_mmap() assumes we have a valid |current->mm|. But that's borrowing the modprobe process's memory map, and with async probe, the selftest sequence happens in a kernel worker instead (and current->mm is NULL). So that clearly won't work. I suppose I could disable async probe when built as a module (I believe it doesn't really have any value, since the module load task just waits for the async task anyway). I'm not familiar enough with MM to know what the vm_mmap() alternatives are, but this particular bit of code does feel odd. Additionally, I think this implies that live_selftests will break if i915 is built-in (i.e., =y, not =m), as we'll again run in a kernel-thread context at boot time. But I would hope nobody is trying to run them that way? I guess this gets even hairier, because even if the driver is built into the kernel, it's possible to kick them off from a process context by tweaking the module parameters later, and then re-binding the device... So all in all, this bug leaves an ugly situation, with or without my patch. I'm still curious about the reported failures, but maybe they require some particular sequence of tests? I also don't have the full igt-gpu-tools set running, so maybe they do something a little differently than my steps in [1]? Brian [1] I have a GLk system, if it matters. I figured I can run some of these with any one of the following: modprobe i915 live_selftests=1 modprobe i915 live_selftests=1 igt__20__live_workarounds=Y modprobe i915 live_selftests=1 igt__19__live_uncore=Y modprobe i915 live_selftests=1 igt__18__live_sanitycheck=Y ...
On Thu, 3 Nov 2022 at 00:14, Brian Norris <briannorris@chromium.org> wrote: > > On Wed, Nov 02, 2022 at 12:18:37PM +0000, Matthew Auld wrote: > > On Tue, 1 Nov 2022 at 21:58, Brian Norris <briannorris@chromium.org> wrote: > > > > > > On Fri, Oct 28, 2022 at 5:24 PM Patchwork > > > <patchwork@emeril.freedesktop.org> wrote: > > > > > > > > Patch Details > > > > Series:drm/i915: Set PROBE_PREFER_ASYNCHRONOUS > > > > URL:https://patchwork.freedesktop.org/series/110277/ > > > > State:failure > > > > Details:https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_110277v1/index.html > > > > > > > > CI Bug Log - changes from CI_DRM_12317 -> Patchwork_110277v1 > > > > > > > > Summary > > > > > > > > FAILURE > > > > > > > > Serious unknown changes coming with Patchwork_110277v1 absolutely need to be > > > > verified manually. > > > > > > > > If you think the reported changes have nothing to do with the changes > > > > introduced in Patchwork_110277v1, please notify your bug team to allow them > > > > to document this new failure mode, which will reduce false positives in CI. > > > > > > > > External URL: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_110277v1/index.html > > > > > > For the record, I have almost zero idea what to do with this. From > > > what I can tell, most (all?) of these failures are flaky(?) already > > > and are probably not related to my change. > > > > https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_110277v1/index.html > > > > According to that link, this change appears to break every platform > > when running the live selftests (looking at the purple squares). > > Running the selftests normally involves loading and unloading the > > module. Looking at the logs there is scary stuff like: > > > [...] > > Ah, thanks. I'm not sure what made me think the tests were failing the > same way on drm-tip, but maybe just chalk that up to my unfamiliarity > with this particular dashboard... (There are a few isolated failure > and/or flakes on drm-tip, but they don't look like this.) > > Anyway, I think I managed to run some of these tests on my own platforms > [1], and I don't reproduce those failures. I do see other failures > (crashes) though, like in i915_gem_mman_live_selftests/igt_mmap, where > igt_mmap_offset() (selftest-only code) -> vm_mmap() assumes we have a > valid |current->mm|. But that's borrowing the modprobe process's memory > map, and with async probe, the selftest sequence happens in a kernel > worker instead (and current->mm is NULL). So that clearly won't work. > > I suppose I could disable async probe when built as a module (I believe > it doesn't really have any value, since the module load task just waits > for the async task anyway). I'm not familiar enough with MM to know what > the vm_mmap() alternatives are, but this particular bit of code does > feel odd. > > Additionally, I think this implies that live_selftests will break if > i915 is built-in (i.e., =y, not =m), as we'll again run in a > kernel-thread context at boot time. But I would hope nobody is trying to > run them that way? I guess this gets even hairier, because even if the > driver is built into the kernel, it's possible to kick them off from a > process context by tweaking the module parameters later, and then > re-binding the device... So all in all, this bug leaves an ugly > situation, with or without my patch. > > I'm still curious about the reported failures, but maybe they require > some particular sequence of tests? I also don't have the full > igt-gpu-tools set running, so maybe they do something a little > differently than my steps in [1]? > > Brian > > [1] I have a GLk system, if it matters. I figured I can run some of > these with any one of the following: > > modprobe i915 live_selftests=1 > modprobe i915 live_selftests=1 igt__20__live_workarounds=Y > modprobe i915 live_selftests=1 igt__19__live_uncore=Y > modprobe i915 live_selftests=1 igt__18__live_sanitycheck=Y > ... CI should be using the IGT wrapper to run them, AFAIK. So something like: ./build/tests/i915_selftest Or to just run the live, mock or perf: ./build/tests/i915_selftest --run-subtest live ./build/tests/i915_selftest --run-subtest mock ./build/tests/i915_selftest --run-subtest perf Or if you want to run some particular selftest, like live mman tests: ./build/tests/i915_selftest --run-subtest live --dyn mman
On Thu, 3 Nov 2022 at 00:14, Brian Norris <briannorris@chromium.org> wrote: > > On Wed, Nov 02, 2022 at 12:18:37PM +0000, Matthew Auld wrote: > > On Tue, 1 Nov 2022 at 21:58, Brian Norris <briannorris@chromium.org> wrote: > > > > > > On Fri, Oct 28, 2022 at 5:24 PM Patchwork > > > <patchwork@emeril.freedesktop.org> wrote: > > > > > > > > Patch Details > > > > Series:drm/i915: Set PROBE_PREFER_ASYNCHRONOUS > > > > URL:https://patchwork.freedesktop.org/series/110277/ > > > > State:failure > > > > Details:https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_110277v1/index.html > > > > > > > > CI Bug Log - changes from CI_DRM_12317 -> Patchwork_110277v1 > > > > > > > > Summary > > > > > > > > FAILURE > > > > > > > > Serious unknown changes coming with Patchwork_110277v1 absolutely need to be > > > > verified manually. > > > > > > > > If you think the reported changes have nothing to do with the changes > > > > introduced in Patchwork_110277v1, please notify your bug team to allow them > > > > to document this new failure mode, which will reduce false positives in CI. > > > > > > > > External URL: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_110277v1/index.html > > > > > > For the record, I have almost zero idea what to do with this. From > > > what I can tell, most (all?) of these failures are flaky(?) already > > > and are probably not related to my change. > > > > https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_110277v1/index.html > > > > According to that link, this change appears to break every platform > > when running the live selftests (looking at the purple squares). > > Running the selftests normally involves loading and unloading the > > module. Looking at the logs there is scary stuff like: > > > [...] > > Ah, thanks. I'm not sure what made me think the tests were failing the > same way on drm-tip, but maybe just chalk that up to my unfamiliarity > with this particular dashboard... (There are a few isolated failure > and/or flakes on drm-tip, but they don't look like this.) > > Anyway, I think I managed to run some of these tests on my own platforms > [1], and I don't reproduce those failures. I do see other failures > (crashes) though, like in i915_gem_mman_live_selftests/igt_mmap, where > igt_mmap_offset() (selftest-only code) -> vm_mmap() assumes we have a > valid |current->mm|. But that's borrowing the modprobe process's memory > map, and with async probe, the selftest sequence happens in a kernel > worker instead (and current->mm is NULL). So that clearly won't work. Semi related: https://lore.kernel.org/intel-gfx/20221104134703.3770b371@maurocar-mobl2/T/#m888972bb1ffb0a913e3db8b4099dffdc2ec7a0dc Sounds like a similar issue when trying to convert the live selftests over to kunit. > > I suppose I could disable async probe when built as a module (I believe > it doesn't really have any value, since the module load task just waits > for the async task anyway). I'm not familiar enough with MM to know what > the vm_mmap() alternatives are, but this particular bit of code does > feel odd. > > Additionally, I think this implies that live_selftests will break if > i915 is built-in (i.e., =y, not =m), as we'll again run in a > kernel-thread context at boot time. But I would hope nobody is trying to > run them that way? I guess this gets even hairier, because even if the > driver is built into the kernel, it's possible to kick them off from a > process context by tweaking the module parameters later, and then > re-binding the device... So all in all, this bug leaves an ugly > situation, with or without my patch. > > I'm still curious about the reported failures, but maybe they require > some particular sequence of tests? I also don't have the full > igt-gpu-tools set running, so maybe they do something a little > differently than my steps in [1]? > > Brian > > [1] I have a GLk system, if it matters. I figured I can run some of > these with any one of the following: > > modprobe i915 live_selftests=1 > modprobe i915 live_selftests=1 igt__20__live_workarounds=Y > modprobe i915 live_selftests=1 igt__19__live_uncore=Y > modprobe i915 live_selftests=1 igt__18__live_sanitycheck=Y > ...
On Fri, Nov 04, 2022 at 02:38:03PM +0000, Matthew Auld wrote: > On Thu, 3 Nov 2022 at 00:14, Brian Norris <briannorris@chromium.org> wrote: > > I'm still curious about the reported failures, but maybe they require > > some particular sequence of tests? I also don't have the full > > igt-gpu-tools set running, so maybe they do something a little > > differently than my steps in [1]? > > > > Brian > > > > [1] I have a GLk system, if it matters. I figured I can run some of > > these with any one of the following: > > > > modprobe i915 live_selftests=1 > > modprobe i915 live_selftests=1 igt__20__live_workarounds=Y > > modprobe i915 live_selftests=1 igt__19__live_uncore=Y > > modprobe i915 live_selftests=1 igt__18__live_sanitycheck=Y > > ... > > CI should be using the IGT wrapper to run them, AFAIK. So something like: > > ./build/tests/i915_selftest > > Or to just run the live, mock or perf: > > ./build/tests/i915_selftest --run-subtest live > ./build/tests/i915_selftest --run-subtest mock > ./build/tests/i915_selftest --run-subtest perf > > Or if you want to run some particular selftest, like live mman tests: > > ./build/tests/i915_selftest --run-subtest live --dyn mman Thanks. I'm running through those now, and it seems like I'm doing closer to what the CI logs show [1], but I'm still not reproducing on my GLK. (I've now managed to run it with drm-tip; still no luck.) So far, now I've managed to just reproduced *different* known problems: https://lore.kernel.org/all/Y2WfpLbX1SeDtk+7@google.com/ But after working around those, I run without any similar lockup failures. I might poke around some more next week, but I've probably spent more time than reasonable on this already. Anyway, thanks for the help! Regards, Brian [1] For one, I've run through a test list, in order, based on this: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_110277v1/fi-glk-j4005/testlist0.txt
diff --git a/drivers/gpu/drm/i915/i915_pci.c b/drivers/gpu/drm/i915/i915_pci.c index 38460a0bd7cb..1cb1f87aea86 100644 --- a/drivers/gpu/drm/i915/i915_pci.c +++ b/drivers/gpu/drm/i915/i915_pci.c @@ -1371,7 +1371,10 @@ static struct pci_driver i915_pci_driver = { .probe = i915_pci_probe, .remove = i915_pci_remove, .shutdown = i915_pci_shutdown, - .driver.pm = &i915_pm_ops, + .driver = { + .pm = &i915_pm_ops, + .probe_type = PROBE_PREFER_ASYNCHRONOUS, + }, }; int i915_pci_register_driver(void)
This driver often takes a good 100ms to start, but in some particularly bad cases takes more than 1 second. In surveying risk for this driver, I poked around for cross-device shared state, which can be a source of race conditions. GVT support (intel_gvt_devices) seems potentially suspect, but it has an appropriate mutex, and doesn't seem to care about ordering -- if devices are present when the kvmgt module loads, they'll get picked up; and if they probe later than kvmgt, they'll attach then. Additionally, I see past discussions about this patch [1], which concluded that there were problems at the time due to the way hdac_i915.c called request_module("i915") and expected it to complete probing [2]. Work has since been merged [3] to fix that problem. This driver was pinpointed as part of a survey of drivers that take more than 100ms in their initcall (i.e., are built in, and probing synchronously) on a lab of ChromeOS systems. [1] [RFC] i915: make the probe asynchronous https://lore.kernel.org/all/20180604053219.2040-1-feng.tang@intel.com/ [2] https://lore.kernel.org/all/s5hin4d1e3f.wl-tiwai@suse.de/ [3] Commit f9b54e1961c7 ("ALSA: hda/i915: Allow delayed i915 audio component binding") Signed-off-by: Brian Norris <briannorris@chromium.org> --- drivers/gpu/drm/i915/i915_pci.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-)