diff mbox

[08/10] drm/i915/guc: Fix GuC HW/SW state cleanup in unload path

Message ID 1505650654-11091-8-git-send-email-sagar.a.kamble@intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

sagar.a.kamble@intel.com Sept. 17, 2017, 12:17 p.m. UTC
Teardown of GuC HW/SW state was not properly done in unload path.
guc_submission_disable was called as part of intel_uc_fini_hw which
happens post gem_unload in the i915_driver_unload path.
s/i915_gem_fini/i915_gem_cleanup as it looks more suitable as that
function does cleanup.
To differentiate the tasks during suspend and unload w.r.t GuC this
patch introduces new function i915_gem_fini which in addition to
disabling GuC interfaces also disables GuC submission during which
communication with GuC is needed for destroying doorbell.
i915_gem_fini is copy of i915_gem_suspend with difference w.r.t
GuC operations. To achieve this, new helpers i915_gem_context_suspend
and i915_gem_suspend_complete are prepared.

This patch updates the functions responsibilities as follows:
1. intel_uc_fini_hw: Disable all things that involve communication with
   GuC and internal operation of GuC firmware, currently submission.
   Post this disable all other state like guc_ggtt_invalidate,
   guc_communication and guc_interrupts.
2. intel_uc_cleanup: Free up all UC related memory and other data.

v2: Prepared i915_gem_unload. (Michal)

v3: Moved guc_free_load_err_log past i915.enable_guc_loading check in
intel_uc_cleanup_hw. Prepared i915_gem_contexts_suspend and
i915_gem_suspend_complete that are used in the two paths
i915_gem_suspend and i915_gem_unload. Unload specific actions like
disabling GuC submission and GuC communication are being done in
i915_gem_unload. Commit message update. (Michał Winiarski)

v4: prepared i915_gem_cleanup and intel_uc_cleanup.

Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Michał Winiarski <michal.winiarski@intel.com>
Signed-off-by: Sagar Arun Kamble <sagar.a.kamble@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.c  | 12 +++++-----
 drivers/gpu/drm/i915/i915_drv.h  |  1 +
 drivers/gpu/drm/i915/i915_gem.c  | 52 ++++++++++++++++++++++++++++++++++++----
 drivers/gpu/drm/i915/intel_guc.c | 41 +++++++++++++++++++++++++++++++
 drivers/gpu/drm/i915/intel_guc.h |  3 +++
 drivers/gpu/drm/i915/intel_uc.c  | 39 +++++-------------------------
 drivers/gpu/drm/i915/intel_uc.h  |  1 +
 7 files changed, 105 insertions(+), 44 deletions(-)

Comments

oscar.mateo@intel.com Sept. 21, 2017, 6:33 p.m. UTC | #1
<SNIP>


On 09/17/2017 05:17 AM, Sagar Arun Kamble wrote:
> Teardown of GuC HW/SW state was not properly done in unload path.
> guc_submission_disable was called as part of intel_uc_fini_hw which
> happens post gem_unload in the i915_driver_unload path.
> s/i915_gem_fini/i915_gem_cleanup as it looks more suitable as that
> function does cleanup.
> To differentiate the tasks during suspend and unload w.r.t GuC this
> patch introduces new function i915_gem_fini which in addition to
> disabling GuC interfaces also disables GuC submission during which
> communication with GuC is needed for destroying doorbell.
> i915_gem_fini is copy of i915_gem_suspend with difference w.r.t
> GuC operations. To achieve this, new helpers i915_gem_context_suspend
> and i915_gem_suspend_complete are prepared.
>

Sagar, I just realized this patch would make this comment in 
guc_client_free superfluous, right?:

     /* FIXME: in many cases, by the time we get here the GuC has been
      * reset, so we cannot destroy the doorbell properly. Ignore the
      * error message for now */

Can you make sure you remove it in this patch?

Thanks!
sagar.a.kamble@intel.com Sept. 21, 2017, 7:09 p.m. UTC | #2
On 9/22/2017 12:03 AM, Oscar Mateo wrote:
> <SNIP>
>
>
> On 09/17/2017 05:17 AM, Sagar Arun Kamble wrote:
>> Teardown of GuC HW/SW state was not properly done in unload path.
>> guc_submission_disable was called as part of intel_uc_fini_hw which
>> happens post gem_unload in the i915_driver_unload path.
>> s/i915_gem_fini/i915_gem_cleanup as it looks more suitable as that
>> function does cleanup.
>> To differentiate the tasks during suspend and unload w.r.t GuC this
>> patch introduces new function i915_gem_fini which in addition to
>> disabling GuC interfaces also disables GuC submission during which
>> communication with GuC is needed for destroying doorbell.
>> i915_gem_fini is copy of i915_gem_suspend with difference w.r.t
>> GuC operations. To achieve this, new helpers i915_gem_context_suspend
>> and i915_gem_suspend_complete are prepared.
>>
>
> Sagar, I just realized this patch would make this comment in 
> guc_client_free superfluous, right?:
>
>     /* FIXME: in many cases, by the time we get here the GuC has been
>      * reset, so we cannot destroy the doorbell properly. Ignore the
>      * error message for now */
>
> Can you make sure you remove it in this patch?
>
> Thanks!
Yes. Will need to remove this comment :)
Will make this change as part of 
https://patchwork.freedesktop.org/patch/178189/
How about another comment before it? i915_gem_suspend is ensuring no 
outstanding submissions before coming to
uc_suspend. So shall we remove that as well?

         /*
          * XXX: wait for any outstanding submissions before freeing memory.
          * Be sure to drop any locks
          */
oscar.mateo@intel.com Sept. 21, 2017, 7:49 p.m. UTC | #3
On 09/21/2017 12:09 PM, Sagar Arun Kamble wrote:
>
>
> On 9/22/2017 12:03 AM, Oscar Mateo wrote:
>> <SNIP>
>>
>>
>> On 09/17/2017 05:17 AM, Sagar Arun Kamble wrote:
>>> Teardown of GuC HW/SW state was not properly done in unload path.
>>> guc_submission_disable was called as part of intel_uc_fini_hw which
>>> happens post gem_unload in the i915_driver_unload path.
>>> s/i915_gem_fini/i915_gem_cleanup as it looks more suitable as that
>>> function does cleanup.
>>> To differentiate the tasks during suspend and unload w.r.t GuC this
>>> patch introduces new function i915_gem_fini which in addition to
>>> disabling GuC interfaces also disables GuC submission during which
>>> communication with GuC is needed for destroying doorbell.
>>> i915_gem_fini is copy of i915_gem_suspend with difference w.r.t
>>> GuC operations. To achieve this, new helpers i915_gem_context_suspend
>>> and i915_gem_suspend_complete are prepared.
>>>
>>
>> Sagar, I just realized this patch would make this comment in 
>> guc_client_free superfluous, right?:
>>
>>     /* FIXME: in many cases, by the time we get here the GuC has been
>>      * reset, so we cannot destroy the doorbell properly. Ignore the
>>      * error message for now */
>>
>> Can you make sure you remove it in this patch?
>>
>> Thanks!
> Yes. Will need to remove this comment :)
> Will make this change as part of 
> https://patchwork.freedesktop.org/patch/178189/
> How about another comment before it? i915_gem_suspend is ensuring no 
> outstanding submissions before coming to
> uc_suspend. So shall we remove that as well?
>
>         /*
>          * XXX: wait for any outstanding submissions before freeing 
> memory.
>          * Be sure to drop any locks
>          */

I have the impression that this comment was here as a future design 
guide for Direct Submission. It seems to be the case for almost all the 
"XXX" comments in that file, like:

     /* XXX: wait for any interrupts */
     /* XXX: wait for workqueue to drain */

Since Direct Submission is not being considered at the moment, maybe we 
can just drop of all them (or transform them into something more 
innocuous than a "XXX").
What do people think?
sagar.a.kamble@intel.com Sept. 22, 2017, 4:13 a.m. UTC | #4
On 9/22/2017 1:19 AM, Oscar Mateo wrote:
>
>
> On 09/21/2017 12:09 PM, Sagar Arun Kamble wrote:
>>
>>
>> On 9/22/2017 12:03 AM, Oscar Mateo wrote:
>>> <SNIP>
>>>
>>>
>>> On 09/17/2017 05:17 AM, Sagar Arun Kamble wrote:
>>>> Teardown of GuC HW/SW state was not properly done in unload path.
>>>> guc_submission_disable was called as part of intel_uc_fini_hw which
>>>> happens post gem_unload in the i915_driver_unload path.
>>>> s/i915_gem_fini/i915_gem_cleanup as it looks more suitable as that
>>>> function does cleanup.
>>>> To differentiate the tasks during suspend and unload w.r.t GuC this
>>>> patch introduces new function i915_gem_fini which in addition to
>>>> disabling GuC interfaces also disables GuC submission during which
>>>> communication with GuC is needed for destroying doorbell.
>>>> i915_gem_fini is copy of i915_gem_suspend with difference w.r.t
>>>> GuC operations. To achieve this, new helpers i915_gem_context_suspend
>>>> and i915_gem_suspend_complete are prepared.
>>>>
>>>
>>> Sagar, I just realized this patch would make this comment in 
>>> guc_client_free superfluous, right?:
>>>
>>>     /* FIXME: in many cases, by the time we get here the GuC has been
>>>      * reset, so we cannot destroy the doorbell properly. Ignore the
>>>      * error message for now */
>>>
>>> Can you make sure you remove it in this patch?
>>>
>>> Thanks!
>> Yes. Will need to remove this comment :)
>> Will make this change as part of 
>> https://patchwork.freedesktop.org/patch/178189/
>> How about another comment before it? i915_gem_suspend is ensuring no 
>> outstanding submissions before coming to
>> uc_suspend. So shall we remove that as well?
>>
>>         /*
>>          * XXX: wait for any outstanding submissions before freeing 
>> memory.
>>          * Be sure to drop any locks
>>          */
>
> I have the impression that this comment was here as a future design 
> guide for Direct Submission. It seems to be the case for almost all 
> the "XXX" comments in that file, like:
>
>     /* XXX: wait for any interrupts */
>     /* XXX: wait for workqueue to drain */
>
> Since Direct Submission is not being considered at the moment, maybe 
> we can just drop of all them (or transform them into something more 
> innocuous than a "XXX").
> What do people think?
>
I feel it makes sense to keep for future changes then. If needed, let us 
update these in separate patch.
Thanks for the review Oscar.
sagar.a.kamble@intel.com Sept. 22, 2017, 4:42 a.m. UTC | #5
This was reply to older series hence I replied patch there. Will update 
w.r.t new series.


On 9/22/2017 12:03 AM, Oscar Mateo wrote:
> <SNIP>
>
>
> On 09/17/2017 05:17 AM, Sagar Arun Kamble wrote:
>> Teardown of GuC HW/SW state was not properly done in unload path.
>> guc_submission_disable was called as part of intel_uc_fini_hw which
>> happens post gem_unload in the i915_driver_unload path.
>> s/i915_gem_fini/i915_gem_cleanup as it looks more suitable as that
>> function does cleanup.
>> To differentiate the tasks during suspend and unload w.r.t GuC this
>> patch introduces new function i915_gem_fini which in addition to
>> disabling GuC interfaces also disables GuC submission during which
>> communication with GuC is needed for destroying doorbell.
>> i915_gem_fini is copy of i915_gem_suspend with difference w.r.t
>> GuC operations. To achieve this, new helpers i915_gem_context_suspend
>> and i915_gem_suspend_complete are prepared.
>>
>
> Sagar, I just realized this patch would make this comment in 
> guc_client_free superfluous, right?:
>
>     /* FIXME: in many cases, by the time we get here the GuC has been
>      * reset, so we cannot destroy the doorbell properly. Ignore the
>      * error message for now */
>
> Can you make sure you remove it in this patch?
>
> Thanks!
diff mbox

Patch

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index 4751679..f9622fc 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -596,13 +596,13 @@  static bool i915_switcheroo_can_switch(struct pci_dev *pdev)
 	.can_switch = i915_switcheroo_can_switch,
 };
 
-static void i915_gem_fini(struct drm_i915_private *dev_priv)
+static void i915_gem_cleanup(struct drm_i915_private *dev_priv)
 {
 	/* Flush any outstanding unpin_work. */
 	i915_gem_drain_workqueue(dev_priv);
 
 	mutex_lock(&dev_priv->drm.struct_mutex);
-	intel_uc_fini_hw(dev_priv);
+	intel_uc_cleanup(dev_priv);
 	i915_gem_cleanup_engines(dev_priv);
 	i915_gem_contexts_fini(dev_priv);
 	i915_gem_cleanup_userptr(dev_priv);
@@ -683,9 +683,9 @@  static int i915_load_modeset_init(struct drm_device *dev)
 	return 0;
 
 cleanup_gem:
-	if (i915_gem_suspend(dev_priv))
+	if (i915_gem_fini(dev_priv))
 		DRM_ERROR("failed to idle hardware; continuing to unload!\n");
-	i915_gem_fini(dev_priv);
+	i915_gem_cleanup(dev_priv);
 cleanup_uc:
 	intel_uc_fini_fw(dev_priv);
 cleanup_irq:
@@ -1376,7 +1376,7 @@  void i915_driver_unload(struct drm_device *dev)
 
 	i915_driver_unregister(dev_priv);
 
-	if (i915_gem_suspend(dev_priv))
+	if (i915_gem_fini(dev_priv))
 		DRM_ERROR("failed to idle hardware; continuing to unload!\n");
 
 	intel_display_power_get(dev_priv, POWER_DOMAIN_INIT);
@@ -1410,7 +1410,7 @@  void i915_driver_unload(struct drm_device *dev)
 	cancel_delayed_work_sync(&dev_priv->gpu_error.hangcheck_work);
 	i915_reset_error_state(dev_priv);
 
-	i915_gem_fini(dev_priv);
+	i915_gem_cleanup(dev_priv);
 	intel_uc_fini_fw(dev_priv);
 	intel_fbc_cleanup_cfb(dev_priv);
 
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index a3b7435..341f2ba 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -3671,6 +3671,7 @@  void i915_gem_reset_engine(struct intel_engine_cs *engine,
 int i915_gem_wait_for_idle(struct drm_i915_private *dev_priv,
 			   unsigned int flags);
 int __must_check i915_gem_suspend(struct drm_i915_private *dev_priv);
+int __must_check i915_gem_fini(struct drm_i915_private *dev_priv);
 void i915_gem_resume(struct drm_i915_private *dev_priv);
 int i915_gem_fault(struct vm_fault *vmf);
 int i915_gem_object_wait(struct drm_i915_gem_object *obj,
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 7354307..99f55bd 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -4545,12 +4545,11 @@  void i915_gem_sanitize(struct drm_i915_private *i915)
 	}
 }
 
-int i915_gem_suspend(struct drm_i915_private *dev_priv)
+static int i915_gem_contexts_suspend(struct drm_i915_private *dev_priv)
 {
 	struct drm_device *dev = &dev_priv->drm;
 	int ret;
 
-	intel_runtime_pm_get(dev_priv);
 	intel_suspend_gt_powersave(dev_priv);
 
 	mutex_lock(&dev->struct_mutex);
@@ -4577,8 +4576,15 @@  int i915_gem_suspend(struct drm_i915_private *dev_priv)
 	i915_gem_contexts_lost(dev_priv);
 	mutex_unlock(&dev->struct_mutex);
 
-	intel_uc_system_suspend(dev_priv);
+	return 0;
+
+err_unlock:
+	mutex_unlock(&dev->struct_mutex);
+	return ret;
+}
 
+static void i915_gem_suspend_complete(struct drm_i915_private *dev_priv)
+{
 	cancel_delayed_work_sync(&dev_priv->gpu_error.hangcheck_work);
 	cancel_delayed_work_sync(&dev_priv->gt.retire_work);
 
@@ -4615,12 +4621,48 @@  int i915_gem_suspend(struct drm_i915_private *dev_priv)
 	 * machine in an unusable condition.
 	 */
 	i915_gem_sanitize(dev_priv);
+}
+
+int i915_gem_suspend(struct drm_i915_private *dev_priv)
+{
+	int ret;
+
+	intel_runtime_pm_get(dev_priv);
+
+	ret = i915_gem_contexts_suspend(dev_priv);
+	if (ret && ret != -EIO)
+		goto err_suspend;
+
+	intel_uc_system_suspend(dev_priv);
+
+	i915_gem_suspend_complete(dev_priv);
 
 	intel_runtime_pm_put(dev_priv);
 	return 0;
 
-err_unlock:
-	mutex_unlock(&dev->struct_mutex);
+err_suspend:
+	intel_runtime_pm_put(dev_priv);
+	return ret;
+}
+
+int i915_gem_fini(struct drm_i915_private *dev_priv)
+{
+	int ret;
+
+	intel_runtime_pm_get(dev_priv);
+
+	ret = i915_gem_contexts_suspend(dev_priv);
+	if (ret && ret != -EIO)
+		goto err_suspend;
+
+	intel_uc_fini_hw(dev_priv);
+
+	i915_gem_suspend_complete(dev_priv);
+
+	intel_runtime_pm_put(dev_priv);
+	return 0;
+
+err_suspend:
 	intel_runtime_pm_put(dev_priv);
 	return ret;
 }
diff --git a/drivers/gpu/drm/i915/intel_guc.c b/drivers/gpu/drm/i915/intel_guc.c
index 904cc58..a288c70 100644
--- a/drivers/gpu/drm/i915/intel_guc.c
+++ b/drivers/gpu/drm/i915/intel_guc.c
@@ -292,3 +292,44 @@  int intel_guc_system_resume(struct intel_guc *guc)
 	 */
 	return ret;
 }
+
+void intel_guc_fini_hw(struct intel_guc *guc)
+{
+	struct drm_i915_private *dev_priv = guc_to_i915(guc);
+	struct drm_device *dev = &dev_priv->drm;
+
+	if (i915.enable_guc_submission) {
+		mutex_lock(&dev->struct_mutex);
+		i915_guc_submission_disable(dev_priv);
+		mutex_unlock(&dev->struct_mutex);
+		intel_guc_reset_prepare(guc);
+	}
+}
+
+void intel_guc_capture_load_err_log(struct intel_guc *guc)
+{
+	if (!guc->log.vma || i915.guc_log_level < 0)
+		return;
+
+	if (!guc->load_err_log)
+		guc->load_err_log = i915_gem_object_get(guc->log.vma->obj);
+}
+
+static void guc_free_load_err_log(struct intel_guc *guc)
+{
+	if (guc->load_err_log)
+		i915_gem_object_put(guc->load_err_log);
+}
+
+void intel_guc_cleanup(struct intel_guc *guc)
+{
+	struct drm_i915_private *dev_priv = guc_to_i915(guc);
+
+	if (!i915.enable_guc_loading)
+		return;
+
+	guc_free_load_err_log(guc);
+
+	if (i915.enable_guc_submission)
+		i915_guc_submission_fini(dev_priv);
+}
diff --git a/drivers/gpu/drm/i915/intel_guc.h b/drivers/gpu/drm/i915/intel_guc.h
index e912667..4d7561c 100644
--- a/drivers/gpu/drm/i915/intel_guc.h
+++ b/drivers/gpu/drm/i915/intel_guc.h
@@ -153,6 +153,9 @@  static inline u32 guc_ggtt_offset(struct i915_vma *vma)
 int intel_guc_runtime_resume(struct intel_guc *guc);
 int intel_guc_system_suspend(struct intel_guc *guc);
 int intel_guc_system_resume(struct intel_guc *guc);
+void intel_guc_fini_hw(struct intel_guc *guc);
+void intel_guc_capture_load_err_log(struct intel_guc *guc);
+void intel_guc_cleanup(struct intel_guc *guc);
 
 /* intel_guc_loader.c */
 int intel_guc_select_fw(struct intel_guc *guc);
diff --git a/drivers/gpu/drm/i915/intel_uc.c b/drivers/gpu/drm/i915/intel_uc.c
index 759d8f4e..52e55e1 100644
--- a/drivers/gpu/drm/i915/intel_uc.c
+++ b/drivers/gpu/drm/i915/intel_uc.c
@@ -250,23 +250,6 @@  void intel_uc_fini_fw(struct drm_i915_private *dev_priv)
 	__intel_uc_fw_fini(&dev_priv->huc.fw);
 }
 
-static void guc_capture_load_err_log(struct intel_guc *guc)
-{
-	if (!guc->log.vma || i915.guc_log_level < 0)
-		return;
-
-	if (!guc->load_err_log)
-		guc->load_err_log = i915_gem_object_get(guc->log.vma->obj);
-
-	return;
-}
-
-static void guc_free_load_err_log(struct intel_guc *guc)
-{
-	if (guc->load_err_log)
-		i915_gem_object_put(guc->load_err_log);
-}
-
 static int guc_enable_communication(struct intel_guc *guc)
 {
 	struct drm_i915_private *dev_priv = guc_to_i915(guc);
@@ -377,7 +360,7 @@  int intel_uc_init_hw(struct drm_i915_private *dev_priv)
 	guc_disable_communication(guc);
 	gen9_disable_guc_interrupts(dev_priv);
 err_log_capture:
-	guc_capture_load_err_log(guc);
+	intel_guc_capture_load_err_log(guc);
 err_submission:
 	if (i915.enable_guc_submission)
 		i915_guc_submission_fini(dev_priv);
@@ -403,22 +386,12 @@  int intel_uc_init_hw(struct drm_i915_private *dev_priv)
 
 void intel_uc_fini_hw(struct drm_i915_private *dev_priv)
 {
-	guc_free_load_err_log(&dev_priv->guc);
-
-	if (!i915.enable_guc_loading)
-		return;
-
-	if (i915.enable_guc_submission)
-		i915_guc_submission_disable(dev_priv);
-
-	guc_disable_communication(&dev_priv->guc);
-
-	if (i915.enable_guc_submission) {
-		gen9_disable_guc_interrupts(dev_priv);
-		i915_guc_submission_fini(dev_priv);
-	}
+	intel_guc_fini_hw(&dev_priv->guc);
+}
 
-	i915_ggtt_disable_guc(dev_priv);
+void intel_uc_cleanup(struct drm_i915_private *dev_priv)
+{
+	intel_guc_cleanup(&dev_priv->guc);
 }
 
 void intel_uc_reset_prepare(struct drm_i915_private *dev_priv)
diff --git a/drivers/gpu/drm/i915/intel_uc.h b/drivers/gpu/drm/i915/intel_uc.h
index 4d49a19..9fd8626 100644
--- a/drivers/gpu/drm/i915/intel_uc.h
+++ b/drivers/gpu/drm/i915/intel_uc.h
@@ -105,6 +105,7 @@  struct intel_uc_fw {
 void intel_uc_fini_fw(struct drm_i915_private *dev_priv);
 int intel_uc_init_hw(struct drm_i915_private *dev_priv);
 void intel_uc_fini_hw(struct drm_i915_private *dev_priv);
+void intel_uc_cleanup(struct drm_i915_private *dev_priv);
 void intel_uc_reset_prepare(struct drm_i915_private *dev_priv);
 void intel_uc_runtime_suspend(struct drm_i915_private *dev_priv);
 int intel_uc_runtime_resume(struct drm_i915_private *dev_priv);