diff mbox series

drm/i915: Set PROBE_PREFER_ASYNCHRONOUS

Message ID 20221028145319.1.I87b119c576d486ad139faf1b7278d97e158aabe4@changeid (mailing list archive)
State New, archived
Headers show
Series drm/i915: Set PROBE_PREFER_ASYNCHRONOUS | expand

Commit Message

Brian Norris Oct. 28, 2022, 9:53 p.m. UTC
This driver often takes a good 100ms to start, but in some particularly
bad cases takes more than 1 second.

In surveying risk for this driver, I poked around for cross-device
shared state, which can be a source of race conditions. GVT support
(intel_gvt_devices) seems potentially suspect, but it has an appropriate
mutex, and doesn't seem to care about ordering -- if devices are present
when the kvmgt module loads, they'll get picked up; and if they probe
later than kvmgt, they'll attach then.

Additionally, I see past discussions about this patch [1], which
concluded that there were problems at the time due to the way
hdac_i915.c called request_module("i915") and expected it to complete
probing [2]. Work has since been merged [3] to fix that problem.

This driver was pinpointed as part of a survey of drivers that take more
than 100ms in their initcall (i.e., are built in, and probing
synchronously) on a lab of ChromeOS systems.

[1] [RFC] i915: make the probe asynchronous
    https://lore.kernel.org/all/20180604053219.2040-1-feng.tang@intel.com/

[2] https://lore.kernel.org/all/s5hin4d1e3f.wl-tiwai@suse.de/

[3] Commit f9b54e1961c7 ("ALSA: hda/i915: Allow delayed i915 audio
    component binding")

Signed-off-by: Brian Norris <briannorris@chromium.org>
---

 drivers/gpu/drm/i915/i915_pci.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

Comments

Brian Norris Nov. 1, 2022, 9:58 p.m. UTC | #1
On Fri, Oct 28, 2022 at 5:24 PM Patchwork
<patchwork@emeril.freedesktop.org> wrote:
>
> Patch Details
> Series:drm/i915: Set PROBE_PREFER_ASYNCHRONOUS
> URL:https://patchwork.freedesktop.org/series/110277/
> State:failure
> Details:https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_110277v1/index.html
>
> CI Bug Log - changes from CI_DRM_12317 -> Patchwork_110277v1
>
> Summary
>
> FAILURE
>
> Serious unknown changes coming with Patchwork_110277v1 absolutely need to be
> verified manually.
>
> If you think the reported changes have nothing to do with the changes
> introduced in Patchwork_110277v1, please notify your bug team to allow them
> to document this new failure mode, which will reduce false positives in CI.
>
> External URL: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_110277v1/index.html

For the record, I have almost zero idea what to do with this. From
what I can tell, most (all?) of these failures are flaky(?) already
and are probably not related to my change.

But if someone believes to actually be a blocking issue with my patch,
let me know, and I can see if I can figure out a better answer than
that.

Thanks,
Brian
Matthew Auld Nov. 2, 2022, 12:18 p.m. UTC | #2
On Tue, 1 Nov 2022 at 21:58, Brian Norris <briannorris@chromium.org> wrote:
>
> On Fri, Oct 28, 2022 at 5:24 PM Patchwork
> <patchwork@emeril.freedesktop.org> wrote:
> >
> > Patch Details
> > Series:drm/i915: Set PROBE_PREFER_ASYNCHRONOUS
> > URL:https://patchwork.freedesktop.org/series/110277/
> > State:failure
> > Details:https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_110277v1/index.html
> >
> > CI Bug Log - changes from CI_DRM_12317 -> Patchwork_110277v1
> >
> > Summary
> >
> > FAILURE
> >
> > Serious unknown changes coming with Patchwork_110277v1 absolutely need to be
> > verified manually.
> >
> > If you think the reported changes have nothing to do with the changes
> > introduced in Patchwork_110277v1, please notify your bug team to allow them
> > to document this new failure mode, which will reduce false positives in CI.
> >
> > External URL: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_110277v1/index.html
>
> For the record, I have almost zero idea what to do with this. From
> what I can tell, most (all?) of these failures are flaky(?) already
> and are probably not related to my change.

https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_110277v1/index.html

According to that link, this change appears to break every platform
when running the live selftests (looking at the purple squares).
Running the selftests normally involves loading and unloading the
module. Looking at the logs there is scary stuff like:

<3> [371.938060] INFO: task kworker/u12:1:46 blocked for more than 61 seconds.
<3> [371.944992] Tainted: G U W 6.1.0-rc2-Patchwork_110277v1-g0f9cbc285f7d+ #1
<3> [371.953546] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
<6> [371.961428] task:kworker/u12:1 state:D stack:11512 pid:46 ppid:2
flags:0x00004000
<6> [371.961433] Workqueue: events_unbound async_run_entry_fn
<6> [371.961452] Call Trace:
<6> [371.961454] <TASK>
<6> [371.961457] __schedule+0x30e/0xa70
<6> [371.961481] ? usleep_range_state+0xb0/0xb0
<6> [371.961484] ? __wait_for_common+0xd3/0x1d0
<6> [371.961487] schedule+0x4e/0xd0
<6> [371.961489] schedule_timeout+0x237/0x2e0
<6> [371.961493] ? usleep_range_state+0xb0/0xb0
<6> [371.961495] ? __wait_for_common+0xd3/0x1d0
<6> [371.961497] ? mark_held_locks+0x48/0x80
<6> [371.961500] ? _raw_spin_unlock_irq+0x1f/0x50
<6> [371.961504] ? usleep_range_state+0xb0/0xb0
<6> [371.961506] __wait_for_common+0xf5/0x1d0
<6> [371.961526] __flush_work+0x2c4/0x4e0
<6> [371.961530] ? flush_workqueue_prep_pwqs+0x120/0x120
<6> [371.961553] ? get_work_pool+0x90/0x90
<6> [371.961556] __cancel_work_timer+0x14e/0x1f0
<6> [371.961559] ? _raw_spin_unlock_irqrestore+0x54/0x70
<6> [371.961562] ? lockdep_hardirqs_on+0xbf/0x140
<6> [371.961567] intel_display_driver_unregister+0x27/0x40 [i915]
<6> [371.961722] i915_driver_remove+0x50/0x100 [i915]
<6> [371.961862] i915_pci_probe+0x123/0x240 [i915]
<6> [371.961996] ? _raw_spin_unlock_irqrestore+0x3d/0x70
<6> [371.962001] pci_device_probe+0x95/0x110
<6> [371.962005] really_probe+0xd6/0x350
<6> [371.962008] ? pm_runtime_barrier+0x4b/0xa0
<6> [371.962013] __driver_probe_device+0x73/0x170
<6> [371.962016] driver_probe_device+0x1a/0x90
<6> [371.962019] __driver_attach_async_helper+0x31/0x80
<6> [371.962023] async_run_entry_fn+0x28/0x130
<6> [371.962026] process_one_work+0x272/0x5b0
<6> [371.962032] worker_thread+0x37/0x370
<6> [371.962036] ? process_one_work+0x5b0/0x5b0
<6> [371.962038] kthread+0xed/0x120
<6> [371.962040] ? kthread_complete_and_exit+0x20/0x20
<6> [371.962044] ret_from_fork+0x1f/0x30
<6> [371.962053] </TASK>
<3> [371.962055] INFO: task kworker/3:1:57 blocked for more than 61 seconds.
<3> [371.968696] Tainted: G U W 6.1.0-rc2-Patchwork_110277v1-g0f9cbc285f7d+ #1
<3> [371.977219] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
<6> [371.985049] task:kworker/3:1 state:D stack:12488 pid:57 ppid:2
flags:0x00004000
<6> [371.985053] Workqueue: events output_poll_execute [drm_kms_helper]
<6> [371.985080] Call Trace:
<6> [371.985081] <TASK>
<6> [371.985102] __schedule+0x30e/0xa70
<6> [371.985108] schedule+0x4e/0xd0
<6> [371.985110] async_synchronize_cookie_domain+0xe7/0x120
<6> [371.985113] ? finish_swait+0x90/0x90
<6> [371.985117] intel_fbdev_output_poll_changed+0x82/0x90 [i915]
<6> [371.985281] drm_kms_helper_hotplug_event+0x1e/0x30 [drm_kms_helper]
<6> [371.985291] output_poll_execute+0x1f5/0x200 [drm_kms_helper]
<6> [371.985303] process_one_work+0x272/0x5b0
<6> [371.985310] worker_thread+0x37/0x370
<6> [371.985313] ? process_one_work+0x5b0/0x5b0
<6> [371.985316] kthread+0xed/0x120
<6> [371.985317] ? kthread_complete_and_exit+0x20/0x20
<6> [371.985321] ret_from_fork+0x1f/0x30
<6> [371.985329] </TASK>
<3> [371.985348] INFO: task i915_selftest:6017 blocked for more than 61 seconds.


>
> But if someone believes to actually be a blocking issue with my patch,
> let me know, and I can see if I can figure out a better answer than
> that.
>
> Thanks,
> Brian
Brian Norris Nov. 3, 2022, 12:14 a.m. UTC | #3
On Wed, Nov 02, 2022 at 12:18:37PM +0000, Matthew Auld wrote:
> On Tue, 1 Nov 2022 at 21:58, Brian Norris <briannorris@chromium.org> wrote:
> >
> > On Fri, Oct 28, 2022 at 5:24 PM Patchwork
> > <patchwork@emeril.freedesktop.org> wrote:
> > >
> > > Patch Details
> > > Series:drm/i915: Set PROBE_PREFER_ASYNCHRONOUS
> > > URL:https://patchwork.freedesktop.org/series/110277/
> > > State:failure
> > > Details:https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_110277v1/index.html
> > >
> > > CI Bug Log - changes from CI_DRM_12317 -> Patchwork_110277v1
> > >
> > > Summary
> > >
> > > FAILURE
> > >
> > > Serious unknown changes coming with Patchwork_110277v1 absolutely need to be
> > > verified manually.
> > >
> > > If you think the reported changes have nothing to do with the changes
> > > introduced in Patchwork_110277v1, please notify your bug team to allow them
> > > to document this new failure mode, which will reduce false positives in CI.
> > >
> > > External URL: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_110277v1/index.html
> >
> > For the record, I have almost zero idea what to do with this. From
> > what I can tell, most (all?) of these failures are flaky(?) already
> > and are probably not related to my change.
> 
> https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_110277v1/index.html
> 
> According to that link, this change appears to break every platform
> when running the live selftests (looking at the purple squares).
> Running the selftests normally involves loading and unloading the
> module. Looking at the logs there is scary stuff like:
> 
[...]

Ah, thanks. I'm not sure what made me think the tests were failing the
same way on drm-tip, but maybe just chalk that up to my unfamiliarity
with this particular dashboard... (There are a few isolated failure
and/or flakes on drm-tip, but they don't look like this.)

Anyway, I think I managed to run some of these tests on my own platforms
[1], and I don't reproduce those failures. I do see other failures
(crashes) though, like in i915_gem_mman_live_selftests/igt_mmap, where
igt_mmap_offset() (selftest-only code) -> vm_mmap() assumes we have a
valid |current->mm|. But that's borrowing the modprobe process's memory
map, and with async probe, the selftest sequence happens in a kernel
worker instead (and current->mm is NULL). So that clearly won't work.

I suppose I could disable async probe when built as a module (I believe
it doesn't really have any value, since the module load task just waits
for the async task anyway). I'm not familiar enough with MM to know what
the vm_mmap() alternatives are, but this particular bit of code does
feel odd.

Additionally, I think this implies that live_selftests will break if
i915 is built-in (i.e., =y, not =m), as we'll again run in a
kernel-thread context at boot time. But I would hope nobody is trying to
run them that way? I guess this gets even hairier, because even if the
driver is built into the kernel, it's possible to kick them off from a
process context by tweaking the module parameters later, and then
re-binding the device... So all in all, this bug leaves an ugly
situation, with or without my patch.

I'm still curious about the reported failures, but maybe they require
some particular sequence of tests? I also don't have the full
igt-gpu-tools set running, so maybe they do something a little
differently than my steps in [1]?

Brian

[1] I have a GLk system, if it matters. I figured I can run some of
these with any one of the following:

  modprobe i915 live_selftests=1
  modprobe i915 live_selftests=1 igt__20__live_workarounds=Y
  modprobe i915 live_selftests=1 igt__19__live_uncore=Y
  modprobe i915 live_selftests=1 igt__18__live_sanitycheck=Y
  ...
Matthew Auld Nov. 4, 2022, 2:38 p.m. UTC | #4
On Thu, 3 Nov 2022 at 00:14, Brian Norris <briannorris@chromium.org> wrote:
>
> On Wed, Nov 02, 2022 at 12:18:37PM +0000, Matthew Auld wrote:
> > On Tue, 1 Nov 2022 at 21:58, Brian Norris <briannorris@chromium.org> wrote:
> > >
> > > On Fri, Oct 28, 2022 at 5:24 PM Patchwork
> > > <patchwork@emeril.freedesktop.org> wrote:
> > > >
> > > > Patch Details
> > > > Series:drm/i915: Set PROBE_PREFER_ASYNCHRONOUS
> > > > URL:https://patchwork.freedesktop.org/series/110277/
> > > > State:failure
> > > > Details:https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_110277v1/index.html
> > > >
> > > > CI Bug Log - changes from CI_DRM_12317 -> Patchwork_110277v1
> > > >
> > > > Summary
> > > >
> > > > FAILURE
> > > >
> > > > Serious unknown changes coming with Patchwork_110277v1 absolutely need to be
> > > > verified manually.
> > > >
> > > > If you think the reported changes have nothing to do with the changes
> > > > introduced in Patchwork_110277v1, please notify your bug team to allow them
> > > > to document this new failure mode, which will reduce false positives in CI.
> > > >
> > > > External URL: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_110277v1/index.html
> > >
> > > For the record, I have almost zero idea what to do with this. From
> > > what I can tell, most (all?) of these failures are flaky(?) already
> > > and are probably not related to my change.
> >
> > https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_110277v1/index.html
> >
> > According to that link, this change appears to break every platform
> > when running the live selftests (looking at the purple squares).
> > Running the selftests normally involves loading and unloading the
> > module. Looking at the logs there is scary stuff like:
> >
> [...]
>
> Ah, thanks. I'm not sure what made me think the tests were failing the
> same way on drm-tip, but maybe just chalk that up to my unfamiliarity
> with this particular dashboard... (There are a few isolated failure
> and/or flakes on drm-tip, but they don't look like this.)
>
> Anyway, I think I managed to run some of these tests on my own platforms
> [1], and I don't reproduce those failures. I do see other failures
> (crashes) though, like in i915_gem_mman_live_selftests/igt_mmap, where
> igt_mmap_offset() (selftest-only code) -> vm_mmap() assumes we have a
> valid |current->mm|. But that's borrowing the modprobe process's memory
> map, and with async probe, the selftest sequence happens in a kernel
> worker instead (and current->mm is NULL). So that clearly won't work.
>
> I suppose I could disable async probe when built as a module (I believe
> it doesn't really have any value, since the module load task just waits
> for the async task anyway). I'm not familiar enough with MM to know what
> the vm_mmap() alternatives are, but this particular bit of code does
> feel odd.
>
> Additionally, I think this implies that live_selftests will break if
> i915 is built-in (i.e., =y, not =m), as we'll again run in a
> kernel-thread context at boot time. But I would hope nobody is trying to
> run them that way? I guess this gets even hairier, because even if the
> driver is built into the kernel, it's possible to kick them off from a
> process context by tweaking the module parameters later, and then
> re-binding the device... So all in all, this bug leaves an ugly
> situation, with or without my patch.
>
> I'm still curious about the reported failures, but maybe they require
> some particular sequence of tests? I also don't have the full
> igt-gpu-tools set running, so maybe they do something a little
> differently than my steps in [1]?
>
> Brian
>
> [1] I have a GLk system, if it matters. I figured I can run some of
> these with any one of the following:
>
>   modprobe i915 live_selftests=1
>   modprobe i915 live_selftests=1 igt__20__live_workarounds=Y
>   modprobe i915 live_selftests=1 igt__19__live_uncore=Y
>   modprobe i915 live_selftests=1 igt__18__live_sanitycheck=Y
>   ...

CI should be using the IGT wrapper to run them, AFAIK. So something like:

./build/tests/i915_selftest

Or to just run the live, mock or perf:

./build/tests/i915_selftest --run-subtest live
./build/tests/i915_selftest --run-subtest mock
./build/tests/i915_selftest --run-subtest perf

Or if you want to run some particular selftest, like live mman tests:

./build/tests/i915_selftest --run-subtest live --dyn mman
Matthew Auld Nov. 4, 2022, 3:20 p.m. UTC | #5
On Thu, 3 Nov 2022 at 00:14, Brian Norris <briannorris@chromium.org> wrote:
>
> On Wed, Nov 02, 2022 at 12:18:37PM +0000, Matthew Auld wrote:
> > On Tue, 1 Nov 2022 at 21:58, Brian Norris <briannorris@chromium.org> wrote:
> > >
> > > On Fri, Oct 28, 2022 at 5:24 PM Patchwork
> > > <patchwork@emeril.freedesktop.org> wrote:
> > > >
> > > > Patch Details
> > > > Series:drm/i915: Set PROBE_PREFER_ASYNCHRONOUS
> > > > URL:https://patchwork.freedesktop.org/series/110277/
> > > > State:failure
> > > > Details:https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_110277v1/index.html
> > > >
> > > > CI Bug Log - changes from CI_DRM_12317 -> Patchwork_110277v1
> > > >
> > > > Summary
> > > >
> > > > FAILURE
> > > >
> > > > Serious unknown changes coming with Patchwork_110277v1 absolutely need to be
> > > > verified manually.
> > > >
> > > > If you think the reported changes have nothing to do with the changes
> > > > introduced in Patchwork_110277v1, please notify your bug team to allow them
> > > > to document this new failure mode, which will reduce false positives in CI.
> > > >
> > > > External URL: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_110277v1/index.html
> > >
> > > For the record, I have almost zero idea what to do with this. From
> > > what I can tell, most (all?) of these failures are flaky(?) already
> > > and are probably not related to my change.
> >
> > https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_110277v1/index.html
> >
> > According to that link, this change appears to break every platform
> > when running the live selftests (looking at the purple squares).
> > Running the selftests normally involves loading and unloading the
> > module. Looking at the logs there is scary stuff like:
> >
> [...]
>
> Ah, thanks. I'm not sure what made me think the tests were failing the
> same way on drm-tip, but maybe just chalk that up to my unfamiliarity
> with this particular dashboard... (There are a few isolated failure
> and/or flakes on drm-tip, but they don't look like this.)
>
> Anyway, I think I managed to run some of these tests on my own platforms
> [1], and I don't reproduce those failures. I do see other failures
> (crashes) though, like in i915_gem_mman_live_selftests/igt_mmap, where
> igt_mmap_offset() (selftest-only code) -> vm_mmap() assumes we have a
> valid |current->mm|. But that's borrowing the modprobe process's memory
> map, and with async probe, the selftest sequence happens in a kernel
> worker instead (and current->mm is NULL). So that clearly won't work.

Semi related:
https://lore.kernel.org/intel-gfx/20221104134703.3770b371@maurocar-mobl2/T/#m888972bb1ffb0a913e3db8b4099dffdc2ec7a0dc

Sounds like a similar issue when trying to convert the live selftests
over to kunit.

>
> I suppose I could disable async probe when built as a module (I believe
> it doesn't really have any value, since the module load task just waits
> for the async task anyway). I'm not familiar enough with MM to know what
> the vm_mmap() alternatives are, but this particular bit of code does
> feel odd.
>
> Additionally, I think this implies that live_selftests will break if
> i915 is built-in (i.e., =y, not =m), as we'll again run in a
> kernel-thread context at boot time. But I would hope nobody is trying to
> run them that way? I guess this gets even hairier, because even if the
> driver is built into the kernel, it's possible to kick them off from a
> process context by tweaking the module parameters later, and then
> re-binding the device... So all in all, this bug leaves an ugly
> situation, with or without my patch.
>
> I'm still curious about the reported failures, but maybe they require
> some particular sequence of tests? I also don't have the full
> igt-gpu-tools set running, so maybe they do something a little
> differently than my steps in [1]?
>
> Brian
>
> [1] I have a GLk system, if it matters. I figured I can run some of
> these with any one of the following:
>
>   modprobe i915 live_selftests=1
>   modprobe i915 live_selftests=1 igt__20__live_workarounds=Y
>   modprobe i915 live_selftests=1 igt__19__live_uncore=Y
>   modprobe i915 live_selftests=1 igt__18__live_sanitycheck=Y
>   ...
Brian Norris Nov. 5, 2022, 1:29 a.m. UTC | #6
On Fri, Nov 04, 2022 at 02:38:03PM +0000, Matthew Auld wrote:
> On Thu, 3 Nov 2022 at 00:14, Brian Norris <briannorris@chromium.org> wrote:
> > I'm still curious about the reported failures, but maybe they require
> > some particular sequence of tests? I also don't have the full
> > igt-gpu-tools set running, so maybe they do something a little
> > differently than my steps in [1]?
> >
> > Brian
> >
> > [1] I have a GLk system, if it matters. I figured I can run some of
> > these with any one of the following:
> >
> >   modprobe i915 live_selftests=1
> >   modprobe i915 live_selftests=1 igt__20__live_workarounds=Y
> >   modprobe i915 live_selftests=1 igt__19__live_uncore=Y
> >   modprobe i915 live_selftests=1 igt__18__live_sanitycheck=Y
> >   ...
> 
> CI should be using the IGT wrapper to run them, AFAIK. So something like:
> 
> ./build/tests/i915_selftest
> 
> Or to just run the live, mock or perf:
> 
> ./build/tests/i915_selftest --run-subtest live
> ./build/tests/i915_selftest --run-subtest mock
> ./build/tests/i915_selftest --run-subtest perf
> 
> Or if you want to run some particular selftest, like live mman tests:
> 
> ./build/tests/i915_selftest --run-subtest live --dyn mman

Thanks. I'm running through those now, and it seems like I'm doing
closer to what the CI logs show [1], but I'm still not reproducing on my
GLK. (I've now managed to run it with drm-tip; still no luck.)

So far, now I've managed to just reproduced *different* known problems:

https://lore.kernel.org/all/Y2WfpLbX1SeDtk+7@google.com/

But after working around those, I run without any similar lockup
failures.

I might poke around some more next week, but I've probably spent more
time than reasonable on this already.

Anyway, thanks for the help!

Regards,
Brian

[1] For one, I've run through a test list, in order, based on this:
https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_110277v1/fi-glk-j4005/testlist0.txt
diff mbox series

Patch

diff --git a/drivers/gpu/drm/i915/i915_pci.c b/drivers/gpu/drm/i915/i915_pci.c
index 38460a0bd7cb..1cb1f87aea86 100644
--- a/drivers/gpu/drm/i915/i915_pci.c
+++ b/drivers/gpu/drm/i915/i915_pci.c
@@ -1371,7 +1371,10 @@  static struct pci_driver i915_pci_driver = {
 	.probe = i915_pci_probe,
 	.remove = i915_pci_remove,
 	.shutdown = i915_pci_shutdown,
-	.driver.pm = &i915_pm_ops,
+	.driver = {
+		.pm = &i915_pm_ops,
+		.probe_type = PROBE_PREFER_ASYNCHRONOUS,
+	},
 };
 
 int i915_pci_register_driver(void)