Message ID | cover.1657639152.git.mchehab@kernel.org (mailing list archive) |
---|---|
Headers | show |
Series | Fix TLB invalidate issues with Broadwell | expand |
[[PATCH v5 0/2] Fix TLB invalidate issues with Broadwell] On 12/07/2022 (Tue 16:21) Mauro Carvalho Chehab wrote: > i915 selftest hangcheck is causing the i915 driver timeouts, as reported > by Intel CI bot: > > http://gfx-ci.fi.intel.com/cibuglog-ng/issuefilterassoc/24297?query_key=42a999f48fa6ecce068bc8126c069be7c31153b4 [...] > After that, the machine just silently hangs. > > Bisecting the issue, the patch that introduced the regression is: > > 7938d61591d3 ("drm/i915: Flush TLBs before releasing backing store") > > Reverting it fix the issues, but introduce other problems, as TLB > won't be invalidated anymore. So, instead, let's fix the root cause. > > It turns that the TLB flush logic ends conflicting with i915 reset, > which is called during selftest hangcheck. So, the TLB cache should > be serialized together with i915 reset. > > Tested on an Intel NUC5i7RYB with an i7-5557U Broadwell CPU. It turns out that this breaks PM-suspend operations on preempt-rt, on multiple versions, due to all the linux-stable backports. This happens because the uncore->lock is now used in atomic contexts. As the uncore->lock is widely used, conversion to a raw lock seems inappropriate at 1st glance, and hence some alternate solution will likely be required. Below is an example of the regression on v5.15-rt, with backport: commit 0ee5874dad61d2b154a9e3db196fc33e8208ce1b Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Tue Jul 12 16:21:32 2022 +0100 drm/i915/gt: Serialize GRDOM access between multiple engine resets [ Upstream commit b24dcf1dc507f69ed3b5c66c2b6a0209ae80d4d4 ] Reverting the engine reset serialization change avoids the PM-suspend regression and is a temporary workaround for -rt users, but of course leaves this original TLB issue exposed. BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:46 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 45092, name: kworker/u8:4 preempt_count: 1, expected: 0 RCU nest depth: 0, expected: 0 INFO: lockdep is turned off. Preemption disabled at: [<ffffffffc0636522>] __intel_gt_reset+0x92/0x100 [i915] CPU: 3 PID: 45092 Comm: kworker/u8:4 Tainted: G W O 5.15.59-rt48-preempt-rt #1 Hardware name: Intel(R) Client Systems NUC7i5DNKE/NUC7i5DNB, BIOS DNKBLi5v.86A.0064.2019.0523.1933 05/23/2019 Workqueue: events_unbound async_run_entry_fn Call Trace: <TASK> show_stack+0x52/0x5c dump_stack_lvl+0x5b/0x86 dump_stack+0x10/0x16 __might_resched.cold+0xf7/0x12f ? __gen6_reset_engines.constprop.0+0x80/0x80 [i915] rt_spin_lock+0x4e/0xf0 ? gen8_reset_engines+0x2e/0x1e0 [i915] gen8_reset_engines+0x2e/0x1e0 [i915] ? __gen6_reset_engines.constprop.0+0x80/0x80 [i915] __intel_gt_reset+0x9d/0x100 [i915] gt_sanitize+0x16c/0x190 [i915] intel_gt_suspend_late+0x3d/0xc0 [i915] i915_gem_suspend_late+0x57/0x130 [i915] i915_drm_suspend_late+0x38/0x110 [i915] i915_pm_suspend_late+0x1d/0x30 [i915] pm_generic_suspend_late+0x28/0x40 pci_pm_suspend_late+0x37/0x50 ? pci_pm_poweroff_late+0x50/0x50 dpm_run_callback.cold+0x3c/0xa8 __device_suspend_late+0xa4/0x1e0 async_suspend_late+0x20/0xa0 async_run_entry_fn+0x28/0xc0 process_one_work+0x239/0x6c0 worker_thread+0x58/0x3e0 kthread+0x1a9/0x1d0 ? process_one_work+0x6c0/0x6c0 ? set_kthread_struct+0x50/0x50 ret_from_fork+0x1f/0x30 </TASK> PM: late suspend of devices complete after 26.497 msecs Paul. -- > > v5: > - Added a missing SoB on patch 2. > - No other changes. > > v4: > - No functional changes. All changes are at the patch descriptions: > - collected acked-by/reviewed-by; > - use the same e-mail on Author and SoB on patch 1. > > v3: > - Removed the logic that would check if the engine is awake before doing > TLB flush invalidation as backporting PM logic up to Kernel 4.x could be > too painful. After getting this one merged, I'll submit a separate patch > with the PM awake logic. > > v2: > > - Reduced to bare minimum fixes, as this shoud be backported deeply > into stable. > > Chris Wilson (2): > drm/i915/gt: Serialize GRDOM access between multiple engine resets > drm/i915/gt: Serialize TLB invalidates with GT resets > > drivers/gpu/drm/i915/gt/intel_gt.c | 15 ++++++++++- > drivers/gpu/drm/i915/gt/intel_reset.c | 37 ++++++++++++++++++++------- > 2 files changed, 42 insertions(+), 10 deletions(-) > > -- > 2.36.1 > >