diff mbox

[v4,6/6] drm/i915: fix context/engine cleanup order

Message ID 1454095171-22475-7-git-send-email-david.s.gordon@intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Dave Gordon Jan. 29, 2016, 7:19 p.m. UTC
From: Nick Hoath <nicholas.hoath@intel.com>

Swap the order of context & engine cleanup, so that contexts are cleaned
up first, and *then* engines. This is a more sensible order anyway, but
in particular has become necessary since the 'intel_ring_initialized()
must be simple and inline' patch, which now uses ring->dev as an
'initialised' flag, so it can now be NULL after engine teardown. This in
turn can cause a problem in the context code, which (used to) check the
ring->dev->struct_mutex -- causing a fault if ring->dev was NULL.

Also rename the cleanup function to reflect what it actually does
(cleanup engines, not a ringbuffer), and fix an annoying whitespace
issue.

v2: Also make the fix in i915_load_modeset_init, not just in
    i915_driver_unload (Chris Wilson)
v3: Had extra stuff in it.
v4: Reverted extra stuff (so we're back to v2).
    Rebased and updated commentary above (Dave Gordon).

Signed-off-by: Nick Hoath <nicholas.hoath@intel.com>
Signed-off-by: David Gordon <david.s.gordon@intel.com>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>

Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Dave Gordon <david.s.gordon@intel.com>
---
 drivers/gpu/drm/i915/i915_dma.c |  4 ++--
 drivers/gpu/drm/i915/i915_drv.h |  2 +-
 drivers/gpu/drm/i915/i915_gem.c | 23 ++++++++++++-----------
 3 files changed, 15 insertions(+), 14 deletions(-)

Comments

Chris Wilson Jan. 30, 2016, 11:17 a.m. UTC | #1
On Fri, Jan 29, 2016 at 07:19:31PM +0000, Dave Gordon wrote:
> From: Nick Hoath <nicholas.hoath@intel.com>
> 
> Swap the order of context & engine cleanup, so that contexts are cleaned
> up first, and *then* engines. This is a more sensible order anyway, but
> in particular has become necessary since the 'intel_ring_initialized()
> must be simple and inline' patch, which now uses ring->dev as an
> 'initialised' flag, so it can now be NULL after engine teardown. This in
> turn can cause a problem in the context code, which (used to) check the
> ring->dev->struct_mutex -- causing a fault if ring->dev was NULL.
> 
> Also rename the cleanup function to reflect what it actually does
> (cleanup engines, not a ringbuffer), and fix an annoying whitespace
> issue.
> 
> v2: Also make the fix in i915_load_modeset_init, not just in
>     i915_driver_unload (Chris Wilson)
> v3: Had extra stuff in it.
> v4: Reverted extra stuff (so we're back to v2).
>     Rebased and updated commentary above (Dave Gordon).
> 
> Signed-off-by: Nick Hoath <nicholas.hoath@intel.com>
> Signed-off-by: David Gordon <david.s.gordon@intel.com>
> Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>

Have to drop that, with the recent context changes.

You have to move the gpu-reset now for execlists.

Basically pull it out into:

static void i915_unload_gem(struct drm_device *dev)
{
       /*
        * Neither the BIOS, ourselves or any other kernel
        * expects the system to be in execlists mode on startup,
        * so we need to reset the GPU back to legacy mode. And the only
        * known way to disable logical contexts is through a GPU reset.
        *
        * So in order to leave the system in a known default configration,
        * always reset the GPU upon unload. This also cleans up the GEM
        * state tracking, flushing off the requests and leaving the system
        * idle.
        *
        * Note that is of the upmost importance that the GPU is idle and
        * all stray writes are flushed *before* we dismantle the backing
        * storage for the pinned objects.
        */
       i915_reset(dev);

       mutex_lock(&dev->struct_mutex);
       i915_gem_context_fini(dev);
       i915_gem_cleanup_ringbuffer(dev);
       mutex_unlock(&dev->struct_mutex);
}

and then kill the intel_gpu_reset along both the cleanup pathsh
-Chris
Chris Wilson Feb. 11, 2016, 1:36 p.m. UTC | #2
On Sat, Jan 30, 2016 at 11:17:07AM +0000, Chris Wilson wrote:
> On Fri, Jan 29, 2016 at 07:19:31PM +0000, Dave Gordon wrote:
> > From: Nick Hoath <nicholas.hoath@intel.com>
> > 
> > Swap the order of context & engine cleanup, so that contexts are cleaned
> > up first, and *then* engines. This is a more sensible order anyway, but
> > in particular has become necessary since the 'intel_ring_initialized()
> > must be simple and inline' patch, which now uses ring->dev as an
> > 'initialised' flag, so it can now be NULL after engine teardown. This in
> > turn can cause a problem in the context code, which (used to) check the
> > ring->dev->struct_mutex -- causing a fault if ring->dev was NULL.
> > 
> > Also rename the cleanup function to reflect what it actually does
> > (cleanup engines, not a ringbuffer), and fix an annoying whitespace
> > issue.
> > 
> > v2: Also make the fix in i915_load_modeset_init, not just in
> >     i915_driver_unload (Chris Wilson)
> > v3: Had extra stuff in it.
> > v4: Reverted extra stuff (so we're back to v2).
> >     Rebased and updated commentary above (Dave Gordon).
> > 
> > Signed-off-by: Nick Hoath <nicholas.hoath@intel.com>
> > Signed-off-by: David Gordon <david.s.gordon@intel.com>
> > Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
> 
> Have to drop that, with the recent context changes.
> 
> You have to move the gpu-reset now for execlists.
> 
> Basically pull it out into:
> 
> static void i915_unload_gem(struct drm_device *dev)
> {
>        /*
>         * Neither the BIOS, ourselves or any other kernel
>         * expects the system to be in execlists mode on startup,
>         * so we need to reset the GPU back to legacy mode. And the only
>         * known way to disable logical contexts is through a GPU reset.
>         *
>         * So in order to leave the system in a known default configration,
>         * always reset the GPU upon unload. This also cleans up the GEM
>         * state tracking, flushing off the requests and leaving the system
>         * idle.
>         *
>         * Note that is of the upmost importance that the GPU is idle and
>         * all stray writes are flushed *before* we dismantle the backing
>         * storage for the pinned objects.
>         */
>        i915_reset(dev);
> 
>        mutex_lock(&dev->struct_mutex);
>        i915_gem_context_fini(dev);
>        i915_gem_cleanup_ringbuffer(dev);
>        mutex_unlock(&dev->struct_mutex);
> }
> 
> and then kill the intel_gpu_reset along both the cleanup pathsh

It appears this patch was applied without dropping my r-b for the issue
I pointed out above.
-Chris
Mika Kuoppala Feb. 11, 2016, 3:02 p.m. UTC | #3
Chris Wilson <chris@chris-wilson.co.uk> writes:

> On Sat, Jan 30, 2016 at 11:17:07AM +0000, Chris Wilson wrote:
>> On Fri, Jan 29, 2016 at 07:19:31PM +0000, Dave Gordon wrote:
>> > From: Nick Hoath <nicholas.hoath@intel.com>
>> > 
>> > Swap the order of context & engine cleanup, so that contexts are cleaned
>> > up first, and *then* engines. This is a more sensible order anyway, but
>> > in particular has become necessary since the 'intel_ring_initialized()
>> > must be simple and inline' patch, which now uses ring->dev as an
>> > 'initialised' flag, so it can now be NULL after engine teardown. This in
>> > turn can cause a problem in the context code, which (used to) check the
>> > ring->dev->struct_mutex -- causing a fault if ring->dev was NULL.
>> > 
>> > Also rename the cleanup function to reflect what it actually does
>> > (cleanup engines, not a ringbuffer), and fix an annoying whitespace
>> > issue.
>> > 
>> > v2: Also make the fix in i915_load_modeset_init, not just in
>> >     i915_driver_unload (Chris Wilson)
>> > v3: Had extra stuff in it.
>> > v4: Reverted extra stuff (so we're back to v2).
>> >     Rebased and updated commentary above (Dave Gordon).
>> > 
>> > Signed-off-by: Nick Hoath <nicholas.hoath@intel.com>
>> > Signed-off-by: David Gordon <david.s.gordon@intel.com>
>> > Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
>> 
>> Have to drop that, with the recent context changes.
>> 
>> You have to move the gpu-reset now for execlists.
>> 
>> Basically pull it out into:
>> 
>> static void i915_unload_gem(struct drm_device *dev)
>> {
>>        /*
>>         * Neither the BIOS, ourselves or any other kernel
>>         * expects the system to be in execlists mode on startup,
>>         * so we need to reset the GPU back to legacy mode. And the only
>>         * known way to disable logical contexts is through a GPU reset.
>>         *
>>         * So in order to leave the system in a known default configration,
>>         * always reset the GPU upon unload. This also cleans up the GEM
>>         * state tracking, flushing off the requests and leaving the system
>>         * idle.
>>         *
>>         * Note that is of the upmost importance that the GPU is idle and
>>         * all stray writes are flushed *before* we dismantle the backing
>>         * storage for the pinned objects.
>>         */
>>        i915_reset(dev);
>> 
>>        mutex_lock(&dev->struct_mutex);
>>        i915_gem_context_fini(dev);
>>        i915_gem_cleanup_ringbuffer(dev);
>>        mutex_unlock(&dev->struct_mutex);
>> }
>> 
>> and then kill the intel_gpu_reset along both the cleanup pathsh
>
> It appears this patch was applied without dropping my r-b for the issue
> I pointed out above.

Now causes a splat in intel_logical_ring_cleanup when unloading module.

Best to revert and rework on top of Dave's cleanup set?
-Mika



[   58.170369] BUG: unable to handle kernel NULL pointer dereference at
(null)
[   58.170374] IP: [<ffffffffa00e04d3>]
intel_logical_ring_cleanup+0x83/0x100 [i915]
[   58.170389] PGD 0 
[   58.170391] Oops: 0000 [#1] PREEMPT SMP 
[   58.170394] Modules linked in: x86_pkg_temp_thermal intel_powerclamp
coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel i915(-)
mei_me mei i2c_hid e1000e ptp pps_core
[   58.170404] CPU: 0 PID: 5766 Comm: rmmod Not tainted 4.5.0-rc3+ #43
[   58.170407] Hardware name: System manufacturer System Product
Name/Z170M-PLUS, BIOS 0505 11/16/2015
[   58.170410] task: ffff880036aab380 ti: ffff8802374c0000 task.ti:
ffff8802374c0000
[   58.170412] RIP: 0010:[<ffffffffa00e04d3>]  [<ffffffffa00e04d3>]
intel_logical_ring_cleanup+0x83/0x100 [i915]
[   58.170424] RSP: 0018:ffff8802374c3d30  EFLAGS: 00010282
[   58.170425] RAX: 0000000000000000 RBX: ffff880237203788 RCX:
0000000000000001
[   58.170428] RDX: 0000000087654321 RSI: 000000000000000d RDI:
ffff8802372037c0
[   58.170430] RBP: ffff8802374c3d40 R08: 0000000000000000 R09:
0000000ad856c000
[   58.170432] R10: 0000000000000000 R11: 0000000000000001 R12:
ffff880237200000
[   58.170434] R13: ffff880237209638 R14: ffff880237323000 R15:
0000558a770bd090
[   58.170438] FS:  00007f8b439ff700(0000) GS:ffff880239800000(0000)
knlGS:0000000000000000
[   58.170442] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   58.170444] CR2: 0000000000000000 CR3: 000000023532d000 CR4:
00000000003406f0
[   58.170447] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[   58.170450] DR3: 0000000000000000 DR6: 00000000ffff4ff0 DR7:
0000000000000400
[   58.170453] Stack:
[   58.170455]  ffff880237203788 ffff880237200000 ffff8802374c3d70
ffffffffa00d0ed4
[   58.170460]  ffff880237200000 ffff880237323000 ffff880237323060
ffffffffa0194360
[   58.170465]  ffff8802374c3d98 ffffffffa0154520 ffff880237323000
ffff880237323000
[   58.170469] Call Trace:
[   58.170479]  [<ffffffffa00d0ed4>] i915_gem_cleanup_engines+0x34/0x60
[i915]
[   58.170493]  [<ffffffffa0154520>] i915_driver_unload+0x140/0x220
[i915]
[   58.170497]  [<ffffffff8154a4f4>] drm_dev_unregister+0x24/0xa0
[   58.170501]  [<ffffffff8154aace>] drm_put_dev+0x1e/0x60
[   58.170506]  [<ffffffffa00912a0>] i915_pci_remove+0x10/0x20 [i915]
[   58.170510]  [<ffffffff814766e4>] pci_device_remove+0x34/0xb0
[   58.170514]  [<ffffffff8156e7d5>] __device_release_driver+0x95/0x140
[   58.170518]  [<ffffffff8156e97c>] driver_detach+0xbc/0xc0
[   58.170521]  [<ffffffff8156d883>] bus_remove_driver+0x53/0xd0
[   58.170525]  [<ffffffff8156f3a7>] driver_unregister+0x27/0x50
[   58.170528]  [<ffffffff81475725>] pci_unregister_driver+0x25/0x70
[   58.170531]  [<ffffffff8154c274>] drm_pci_exit+0x74/0x90
[   58.170543]  [<ffffffffa0154cb0>] i915_exit+0x20/0x1aa [i915]
[   58.170548]  [<ffffffff8111846f>] SyS_delete_module+0x18f/0x1f0


> -Chris
>
> -- 
> Chris Wilson, Intel Open Source Technology Centre
Dave Gordon Feb. 15, 2016, noon UTC | #4
On 11/02/16 15:02, Mika Kuoppala wrote:
> Chris Wilson <chris@chris-wilson.co.uk> writes:
>
>> On Sat, Jan 30, 2016 at 11:17:07AM +0000, Chris Wilson wrote:
>>> On Fri, Jan 29, 2016 at 07:19:31PM +0000, Dave Gordon wrote:
>>>> From: Nick Hoath <nicholas.hoath@intel.com>
>>>>
>>>> Swap the order of context & engine cleanup, so that contexts are cleaned
>>>> up first, and *then* engines. This is a more sensible order anyway, but
>>>> in particular has become necessary since the 'intel_ring_initialized()
>>>> must be simple and inline' patch, which now uses ring->dev as an
>>>> 'initialised' flag, so it can now be NULL after engine teardown. This in
>>>> turn can cause a problem in the context code, which (used to) check the
>>>> ring->dev->struct_mutex -- causing a fault if ring->dev was NULL.
>>>>
>>>> Also rename the cleanup function to reflect what it actually does
>>>> (cleanup engines, not a ringbuffer), and fix an annoying whitespace
>>>> issue.
>>>>
>>>> v2: Also make the fix in i915_load_modeset_init, not just in
>>>>      i915_driver_unload (Chris Wilson)
>>>> v3: Had extra stuff in it.
>>>> v4: Reverted extra stuff (so we're back to v2).
>>>>      Rebased and updated commentary above (Dave Gordon).
>>>>
>>>> Signed-off-by: Nick Hoath <nicholas.hoath@intel.com>
>>>> Signed-off-by: David Gordon <david.s.gordon@intel.com>
>>>> Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
>>>
>>> Have to drop that, with the recent context changes.
>>>
>>> You have to move the gpu-reset now for execlists.
>>>
>>> Basically pull it out into:
>>>
>>> static void i915_unload_gem(struct drm_device *dev)
>>> {
>>>         /*
>>>          * Neither the BIOS, ourselves or any other kernel
>>>          * expects the system to be in execlists mode on startup,
>>>          * so we need to reset the GPU back to legacy mode. And the only
>>>          * known way to disable logical contexts is through a GPU reset.
>>>          *
>>>          * So in order to leave the system in a known default configration,
>>>          * always reset the GPU upon unload. This also cleans up the GEM
>>>          * state tracking, flushing off the requests and leaving the system
>>>          * idle.
>>>          *
>>>          * Note that is of the upmost importance that the GPU is idle and
>>>          * all stray writes are flushed *before* we dismantle the backing
>>>          * storage for the pinned objects.
>>>          */
>>>         i915_reset(dev);
>>>
>>>         mutex_lock(&dev->struct_mutex);
>>>         i915_gem_context_fini(dev);
>>>         i915_gem_cleanup_ringbuffer(dev);
>>>         mutex_unlock(&dev->struct_mutex);
>>> }
>>>
>>> and then kill the intel_gpu_reset along both the cleanup pathsh
>>
>> It appears this patch was applied without dropping my r-b for the issue
>> I pointed out above.
>
> Now causes a splat in intel_logical_ring_cleanup when unloading module.
>
> Best to revert and rework on top of Dave's cleanup set?
> -Mika

This whole patchset was already superseded by a newer version before 
Daniel merged it. The newer version doesn't have Chris's R-B on this 
(context/engine) patch, only on two others that are still as they were
when he reviewed them.

Please go and look instead at [v5, 11 patches] posted 2016-02-05.

.Dave.
Dave Gordon Feb. 15, 2016, 1:43 p.m. UTC | #5
On 11/02/16 13:36, Chris Wilson wrote:
> On Sat, Jan 30, 2016 at 11:17:07AM +0000, Chris Wilson wrote:
>> On Fri, Jan 29, 2016 at 07:19:31PM +0000, Dave Gordon wrote:
>>> From: Nick Hoath <nicholas.hoath@intel.com>
>>>
>>> Swap the order of context & engine cleanup, so that contexts are cleaned
>>> up first, and *then* engines. This is a more sensible order anyway, but
>>> in particular has become necessary since the 'intel_ring_initialized()
>>> must be simple and inline' patch, which now uses ring->dev as an
>>> 'initialised' flag, so it can now be NULL after engine teardown. This in
>>> turn can cause a problem in the context code, which (used to) check the
>>> ring->dev->struct_mutex -- causing a fault if ring->dev was NULL.

[snip]

> It appears this patch was applied without dropping my r-b for the issue
> I pointed out above.
> -Chris

Not only that, but it was plucked from the end of the set of 6 patches 
without the earlier ones in the sequence, despite the cover letter [0/6] 
saying this:

     Patches reordered again; the bug fixes are now in patches 3 and 5.
     The former could stand alone; the latter depends on patch 4 and is
     a prerequisite for Nick's patch [6/6] to function.

So there's no chance that this one alone could be expected to work :(

.Dave.
diff mbox

Patch

diff --git a/drivers/gpu/drm/i915/i915_dma.c b/drivers/gpu/drm/i915/i915_dma.c
index db9b0c6..3eb6844 100644
--- a/drivers/gpu/drm/i915/i915_dma.c
+++ b/drivers/gpu/drm/i915/i915_dma.c
@@ -444,8 +444,8 @@  static int i915_load_modeset_init(struct drm_device *dev)
 
 cleanup_gem:
 	mutex_lock(&dev->struct_mutex);
-	i915_gem_cleanup_ringbuffer(dev);
 	i915_gem_context_fini(dev);
+	i915_gem_cleanup_engines(dev);
 	mutex_unlock(&dev->struct_mutex);
 cleanup_irq:
 	intel_guc_ucode_fini(dev);
@@ -1220,8 +1220,8 @@  int i915_driver_unload(struct drm_device *dev)
 
 	intel_guc_ucode_fini(dev);
 	mutex_lock(&dev->struct_mutex);
-	i915_gem_cleanup_ringbuffer(dev);
 	i915_gem_context_fini(dev);
+	i915_gem_cleanup_engines(dev);
 	mutex_unlock(&dev->struct_mutex);
 	intel_fbc_cleanup_cfb(dev_priv);
 
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 905e90f..8baaca7 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -3020,7 +3020,7 @@  int i915_gem_init_rings(struct drm_device *dev);
 int __must_check i915_gem_init_hw(struct drm_device *dev);
 int i915_gem_l3_remap(struct drm_i915_gem_request *req, int slice);
 void i915_gem_init_swizzling(struct drm_device *dev);
-void i915_gem_cleanup_ringbuffer(struct drm_device *dev);
+void i915_gem_cleanup_engines(struct drm_device *dev);
 int __must_check i915_gpu_idle(struct drm_device *dev);
 int __must_check i915_gem_suspend(struct drm_device *dev);
 void __i915_add_request(struct drm_i915_gem_request *req,
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 5a4d468..4ee6df9 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -4912,7 +4912,7 @@  int i915_gem_init_rings(struct drm_device *dev)
 		req = i915_gem_request_alloc(ring, NULL);
 		if (IS_ERR(req)) {
 			ret = PTR_ERR(req);
-			i915_gem_cleanup_ringbuffer(dev);
+			i915_gem_cleanup_engines(dev);
 			goto out;
 		}
 
@@ -4925,7 +4925,7 @@  int i915_gem_init_rings(struct drm_device *dev)
 		if (ret && ret != -EIO) {
 			DRM_ERROR("PPGTT enable ring #%d failed %d\n", i, ret);
 			i915_gem_request_cancel(req);
-			i915_gem_cleanup_ringbuffer(dev);
+			i915_gem_cleanup_engines(dev);
 			goto out;
 		}
 
@@ -4933,7 +4933,7 @@  int i915_gem_init_rings(struct drm_device *dev)
 		if (ret && ret != -EIO) {
 			DRM_ERROR("Context enable ring #%d failed %d\n", i, ret);
 			i915_gem_request_cancel(req);
-			i915_gem_cleanup_ringbuffer(dev);
+			i915_gem_cleanup_engines(dev);
 			goto out;
 		}
 
@@ -5011,7 +5011,7 @@  int i915_gem_init(struct drm_device *dev)
 }
 
 void
-i915_gem_cleanup_ringbuffer(struct drm_device *dev)
+i915_gem_cleanup_engines(struct drm_device *dev)
 {
 	struct drm_i915_private *dev_priv = dev->dev_private;
 	struct intel_engine_cs *ring;
@@ -5020,13 +5020,14 @@  int i915_gem_init(struct drm_device *dev)
 	for_each_ring(ring, dev_priv, i)
 		dev_priv->gt.cleanup_ring(ring);
 
-    if (i915.enable_execlists)
-            /*
-             * Neither the BIOS, ourselves or any other kernel
-             * expects the system to be in execlists mode on startup,
-             * so we need to reset the GPU back to legacy mode.
-             */
-            intel_gpu_reset(dev);
+	if (i915.enable_execlists) {
+		/*
+		 * Neither the BIOS, ourselves or any other kernel
+		 * expects the system to be in execlists mode on startup,
+		 * so we need to reset the GPU back to legacy mode.
+		 */
+		intel_gpu_reset(dev);
+	}
 }
 
 static void