diff mbox series

[1/2] drm/i915/guc: fix GuC suspend/resume

Message ID 20181015221009.3710-1-daniele.ceraolospurio@intel.com (mailing list archive)
State New, archived
Headers show
Series [1/2] drm/i915/guc: fix GuC suspend/resume | expand

Commit Message

Daniele Ceraolo Spurio Oct. 15, 2018, 10:10 p.m. UTC
The ENTER/EXIT_S_STATE actions queue the save/restore operation in GuC
FW and then return, so waiting on the H2H is not enough to guarantee
GuC is done.
When all the processing is done, GuC writes 0 to scratch register 14,
so we can poll on that. Note that GuC does not ensure that the value
in the register is different from 0 while the action is in progress
so we need to take care of that ourselves as well.

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
---
 drivers/gpu/drm/i915/intel_guc.c      | 28 +++++++++++++++++++++++++--
 drivers/gpu/drm/i915/intel_guc_fwif.h |  6 ++++++
 2 files changed, 32 insertions(+), 2 deletions(-)

Comments

Daniele Ceraolo Spurio Oct. 15, 2018, 11:43 p.m. UTC | #1
On 15/10/18 15:47, Patchwork wrote:
> == Series Details ==
> 
> Series: series starting with [1/2] drm/i915/guc: fix GuC suspend/resume
> URL   : https://patchwork.freedesktop.org/series/51033/
> State : failure
> 
> == Summary ==
> 
> = CI Bug Log - changes from CI_DRM_4984 -> Patchwork_10464 =
> 
> == Summary - FAILURE ==
> 
>    Serious unknown changes coming with Patchwork_10464 absolutely need to be
>    verified manually.
>    
>    If you think the reported changes have nothing to do with the changes
>    introduced in Patchwork_10464, please notify your bug team to allow them
>    to document this new failure mode, which will reduce false positives in CI.
> 
>    External URL: https://patchwork.freedesktop.org/api/1.0/series/51033/revisions/1/mbox/
> 
> == Possible new issues ==
> 
>    Here are the unknown changes that may have been introduced in Patchwork_10464:
> 
>    === IGT changes ===
> 
>      ==== Possible regressions ====
> 
>      igt@drv_selftest@live_execlists:
>        fi-skl-6700hq:      PASS -> INCOMPLETE
> 

Log seem to be cut for this one. Since it is stopping inside 
live_preempt_smoke it is probably a known issue that Chris mentioned.
Can't reproduce on my skylake even with the test in a loop.

>      igt@drv_selftest@live_guc:
>        fi-kbl-7567u:       PASS -> DMESG-WARN
>        fi-skl-6600u:       PASS -> DMESG-WARN
>        fi-skl-gvtdvm:      PASS -> DMESG-WARN
>        fi-skl-iommu:       PASS -> DMESG-WARN
>        fi-skl-6260u:       PASS -> DMESG-WARN
>        fi-bxt-dsi:         PASS -> DMESG-WARN
>        fi-skl-6700k2:      PASS -> DMESG-WARN
>        fi-whl-u:           PASS -> DMESG-WARN
>        fi-skl-6770hq:      PASS -> DMESG-WARN
>        fi-kbl-7560u:       PASS -> DMESG-WARN
>        fi-kbl-8809g:       PASS -> DMESG-WARN
>        fi-kbl-r:           PASS -> DMESG-WARN
>        fi-kbl-x1275:       PASS -> DMESG-WARN
>        fi-bxt-j4205:       PASS -> DMESG-WARN
>        fi-cfl-s3:          PASS -> DMESG-WARN
>        fi-cfl-8109u:       PASS -> DMESG-WARN
>        fi-kbl-7500u:       PASS -> DMESG-WARN
>        fi-cfl-8700k:       PASS -> DMESG-WARN

These are all:

[drm:intel_guc_send_mmio [i915]] *ERROR* MMIO: GuC action 0x10 failed 
with error -5 0xf000f000

Which is not a real failure since the test is triggering it on purpose

> 
>      igt@drv_selftest@live_hangcheck:
>        fi-skl-gvtdvm:      PASS -> DMESG-FAIL
> 

<7> [464.966238] [drm:guc_fw_xfer [i915]] GuC status 0x20
<3> [464.966361] [drm:guc_fw_xfer [i915]] *ERROR* GuC firmware xfer 
error -110

This looks like GuC is stuck very early in the boot flow (even before 
the RSA check). On SKL there are known issues that could cause this and 
we should reset GuC and retry, but we aren't. Looks like we indirectly 
stopped applying  WaEnableuKernelHeaderValidFix and 
WaEnableGuCBootHashCheckNotSet by not returning -EAGAIN from 
intel_guc_fw_upload in any case. Michal?

Thanks,
Daniele

>      
> == Known issues ==
> 
>    Here are the changes found in Patchwork_10464 that come from known issues:
> 
>    === IGT changes ===
> 
>      ==== Issues hit ====
> 
>      igt@drv_selftest@live_guc:
>        {fi-apl-guc}:       NOTRUN -> DMESG-WARN (fdo#107258)
> 
>      igt@gem_exec_suspend@basic-s4-devices:
>        fi-blb-e6850:       PASS -> INCOMPLETE (fdo#107718)
> 
>      igt@kms_pipe_crc_basic@suspend-read-crc-pipe-b:
>        fi-snb-2520m:       PASS -> DMESG-FAIL (fdo#103713)
> 
>      igt@kms_setmode@basic-clone-single-crtc:
>        fi-snb-2520m:       PASS -> DMESG-WARN (fdo#103713)
> 
>      igt@pm_backlight@basic-brightness:
>        fi-snb-2520m:       PASS -> INCOMPLETE (fdo#103713)
> 
>      
>      ==== Possible fixes ====
> 
>      igt@drv_selftest@live_gem:
>        {fi-apl-guc}:       INCOMPLETE (fdo#106693) -> PASS
> 
>      igt@kms_frontbuffer_tracking@basic:
>        fi-byt-clapper:     FAIL (fdo#103167) -> PASS
> 
>      
>    {name}: This element is suppressed. This means it is ignored when computing
>            the status of the difference (SUCCESS, WARNING, or FAILURE).
> 
>    fdo#103167 https://bugs.freedesktop.org/show_bug.cgi?id=103167
>    fdo#103713 https://bugs.freedesktop.org/show_bug.cgi?id=103713
>    fdo#106693 https://bugs.freedesktop.org/show_bug.cgi?id=106693
>    fdo#107258 https://bugs.freedesktop.org/show_bug.cgi?id=107258
>    fdo#107718 https://bugs.freedesktop.org/show_bug.cgi?id=107718
> 
> 
> == Participating hosts (53 -> 47) ==
> 
>    Missing    (6): fi-kbl-soraka fi-ilk-m540 fi-hsw-4200u fi-byt-squawks fi-bsw-cyan fi-ctg-p8600
> 
> 
> == Build changes ==
> 
>      * Linux: CI_DRM_4984 -> Patchwork_10464
> 
>    CI_DRM_4984: 90b59df999a13a6405f8d7ece08a69120a9b361a @ git://anongit.freedesktop.org/gfx-ci/linux
>    IGT_4678: 9310a1265ceabeec736bdf0a76e1e0357c76c0b1 @ git://anongit.freedesktop.org/xorg/app/intel-gpu-tools
>    Patchwork_10464: c88fb110ee8261c636d63f4f6d9fa9440891b3a6 @ git://anongit.freedesktop.org/gfx-ci/linux
> 
> 
> == Linux commits ==
> 
> c88fb110ee82 HAX enable GuC for CI
> 4454d4d05ce3 drm/i915/guc: fix GuC suspend/resume
> 
> == Logs ==
> 
> For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_10464/issues.html
>
Chris Wilson Oct. 16, 2018, 8:49 a.m. UTC | #2
Quoting Daniele Ceraolo Spurio (2018-10-16 00:43:59)
> 
> 
> On 15/10/18 15:47, Patchwork wrote:
> > == Series Details ==
> > 
> > Series: series starting with [1/2] drm/i915/guc: fix GuC suspend/resume
> > URL   : https://patchwork.freedesktop.org/series/51033/
> > State : failure
> > 
> > == Summary ==
> > 
> > = CI Bug Log - changes from CI_DRM_4984 -> Patchwork_10464 =
> > 
> > == Summary - FAILURE ==
> > 
> >    Serious unknown changes coming with Patchwork_10464 absolutely need to be
> >    verified manually.
> >    
> >    If you think the reported changes have nothing to do with the changes
> >    introduced in Patchwork_10464, please notify your bug team to allow them
> >    to document this new failure mode, which will reduce false positives in CI.
> > 
> >    External URL: https://patchwork.freedesktop.org/api/1.0/series/51033/revisions/1/mbox/
> > 
> > == Possible new issues ==
> > 
> >    Here are the unknown changes that may have been introduced in Patchwork_10464:
> > 
> >    === IGT changes ===
> > 
> >      ==== Possible regressions ====
> > 
> >      igt@drv_selftest@live_execlists:
> >        fi-skl-6700hq:      PASS -> INCOMPLETE
> > 
> 
> Log seem to be cut for this one. Since it is stopping inside 
> live_preempt_smoke it is probably a known issue that Chris mentioned.
> Can't reproduce on my skylake even with the test in a loop.

Yeah, if we have the oops report for that one it might have been
clearer.

> > 
> >      igt@drv_selftest@live_hangcheck:
> >        fi-skl-gvtdvm:      PASS -> DMESG-FAIL
> > 
> 
> <7> [464.966238] [drm:guc_fw_xfer [i915]] GuC status 0x20
> <3> [464.966361] [drm:guc_fw_xfer [i915]] *ERROR* GuC firmware xfer 
> error -110
> 
> This looks like GuC is stuck very early in the boot flow (even before 
> the RSA check). On SKL there are known issues that could cause this and 
> we should reset GuC and retry, but we aren't. Looks like we indirectly 
> stopped applying  WaEnableuKernelHeaderValidFix and 
> WaEnableGuCBootHashCheckNotSet by not returning -EAGAIN from 
> intel_guc_fw_upload in any case. Michal?

That would be useful to fix, as it does fail regularly enough to be a
nuisance.

More importantly, no live_gem failures across the board so the patch does
what it says on the tin. Nice work.
-Chris
Chris Wilson Oct. 16, 2018, 8:53 a.m. UTC | #3
Quoting Daniele Ceraolo Spurio (2018-10-15 23:10:08)
> The ENTER/EXIT_S_STATE actions queue the save/restore operation in GuC
> FW and then return, so waiting on the H2H is not enough to guarantee
> GuC is done.
> When all the processing is done, GuC writes 0 to scratch register 14,
> so we can poll on that. Note that GuC does not ensure that the value
> in the register is different from 0 while the action is in progress
> so we need to take care of that ourselves as well.
> 
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
> Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>

Explains itself very well,
Acked-by: Chris Wilson <chris@chris-wilson.co.uk>
-Chris
Michal Wajdeczko Oct. 16, 2018, 9:12 a.m. UTC | #4
On Tue, 16 Oct 2018 10:49:54 +0200, Chris Wilson  
<chris@chris-wilson.co.uk> wrote:

> Quoting Daniele Ceraolo Spurio (2018-10-16 00:43:59)
...
>> >
>> >      igt@drv_selftest@live_hangcheck:
>> >        fi-skl-gvtdvm:      PASS -> DMESG-FAIL
>> >
>>
>> <7> [464.966238] [drm:guc_fw_xfer [i915]] GuC status 0x20
>> <3> [464.966361] [drm:guc_fw_xfer [i915]] *ERROR* GuC firmware xfer
>> error -110
>>
>> This looks like GuC is stuck very early in the boot flow (even before
>> the RSA check). On SKL there are known issues that could cause this and
>> we should reset GuC and retry, but we aren't. Looks like we indirectly
>> stopped applying  WaEnableuKernelHeaderValidFix and
>> WaEnableGuCBootHashCheckNotSet by not returning -EAGAIN from
>> intel_guc_fw_upload in any case. Michal?
>
> That would be useful to fix, as it does fail regularly enough to be a
> nuisance.

Proposed fix is here [1]. Note that earlier patch was trying to avoid the
case when we were repeating loading of the GuC firmware that simply didn't
pass signature verification (not fixable that way ;)

Michal

[1] https://patchwork.freedesktop.org/patch/256913/
Daniel Vetter Oct. 16, 2018, 9:21 a.m. UTC | #5
On Tue, Oct 16, 2018 at 1:44 AM Daniele Ceraolo Spurio
<daniele.ceraolospurio@intel.com> wrote:
>
>
>
> On 15/10/18 15:47, Patchwork wrote:
> > == Series Details ==
> >
> > Series: series starting with [1/2] drm/i915/guc: fix GuC suspend/resume
> > URL   : https://patchwork.freedesktop.org/series/51033/
> > State : failure
> >
> > == Summary ==
> >
> > = CI Bug Log - changes from CI_DRM_4984 -> Patchwork_10464 =
> >
> > == Summary - FAILURE ==
> >
> >    Serious unknown changes coming with Patchwork_10464 absolutely need to be
> >    verified manually.
> >
> >    If you think the reported changes have nothing to do with the changes
> >    introduced in Patchwork_10464, please notify your bug team to allow them
> >    to document this new failure mode, which will reduce false positives in CI.
> >
> >    External URL: https://patchwork.freedesktop.org/api/1.0/series/51033/revisions/1/mbox/
> >
> > == Possible new issues ==
> >
> >    Here are the unknown changes that may have been introduced in Patchwork_10464:
> >
> >    === IGT changes ===
> >
> >      ==== Possible regressions ====
> >
> >      igt@drv_selftest@live_execlists:
> >        fi-skl-6700hq:      PASS -> INCOMPLETE
> >
>
> Log seem to be cut for this one. Since it is stopping inside
> live_preempt_smoke it is probably a known issue that Chris mentioned.
> Can't reproduce on my skylake even with the test in a loop.
>
> >      igt@drv_selftest@live_guc:
> >        fi-kbl-7567u:       PASS -> DMESG-WARN
> >        fi-skl-6600u:       PASS -> DMESG-WARN
> >        fi-skl-gvtdvm:      PASS -> DMESG-WARN
> >        fi-skl-iommu:       PASS -> DMESG-WARN
> >        fi-skl-6260u:       PASS -> DMESG-WARN
> >        fi-bxt-dsi:         PASS -> DMESG-WARN
> >        fi-skl-6700k2:      PASS -> DMESG-WARN
> >        fi-whl-u:           PASS -> DMESG-WARN
> >        fi-skl-6770hq:      PASS -> DMESG-WARN
> >        fi-kbl-7560u:       PASS -> DMESG-WARN
> >        fi-kbl-8809g:       PASS -> DMESG-WARN
> >        fi-kbl-r:           PASS -> DMESG-WARN
> >        fi-kbl-x1275:       PASS -> DMESG-WARN
> >        fi-bxt-j4205:       PASS -> DMESG-WARN
> >        fi-cfl-s3:          PASS -> DMESG-WARN
> >        fi-cfl-8109u:       PASS -> DMESG-WARN
> >        fi-kbl-7500u:       PASS -> DMESG-WARN
> >        fi-cfl-8700k:       PASS -> DMESG-WARN
>
> These are all:
>
> [drm:intel_guc_send_mmio [i915]] *ERROR* MMIO: GuC action 0x10 failed
> with error -5 0xf000f000
>
> Which is not a real failure since the test is triggering it on purpose

You still need to make them shut up. dmesg errors should only be used
for stuff we really don't expect. E.g. gpu hangs provoked by igt also
don't result in dmesg errors/warnings and failed tests.
-Daniel

> >
> >      igt@drv_selftest@live_hangcheck:
> >        fi-skl-gvtdvm:      PASS -> DMESG-FAIL
> >
>
> <7> [464.966238] [drm:guc_fw_xfer [i915]] GuC status 0x20
> <3> [464.966361] [drm:guc_fw_xfer [i915]] *ERROR* GuC firmware xfer
> error -110
>
> This looks like GuC is stuck very early in the boot flow (even before
> the RSA check). On SKL there are known issues that could cause this and
> we should reset GuC and retry, but we aren't. Looks like we indirectly
> stopped applying  WaEnableuKernelHeaderValidFix and
> WaEnableGuCBootHashCheckNotSet by not returning -EAGAIN from
> intel_guc_fw_upload in any case. Michal?
>
> Thanks,
> Daniele
>
> >
> > == Known issues ==
> >
> >    Here are the changes found in Patchwork_10464 that come from known issues:
> >
> >    === IGT changes ===
> >
> >      ==== Issues hit ====
> >
> >      igt@drv_selftest@live_guc:
> >        {fi-apl-guc}:       NOTRUN -> DMESG-WARN (fdo#107258)
> >
> >      igt@gem_exec_suspend@basic-s4-devices:
> >        fi-blb-e6850:       PASS -> INCOMPLETE (fdo#107718)
> >
> >      igt@kms_pipe_crc_basic@suspend-read-crc-pipe-b:
> >        fi-snb-2520m:       PASS -> DMESG-FAIL (fdo#103713)
> >
> >      igt@kms_setmode@basic-clone-single-crtc:
> >        fi-snb-2520m:       PASS -> DMESG-WARN (fdo#103713)
> >
> >      igt@pm_backlight@basic-brightness:
> >        fi-snb-2520m:       PASS -> INCOMPLETE (fdo#103713)
> >
> >
> >      ==== Possible fixes ====
> >
> >      igt@drv_selftest@live_gem:
> >        {fi-apl-guc}:       INCOMPLETE (fdo#106693) -> PASS
> >
> >      igt@kms_frontbuffer_tracking@basic:
> >        fi-byt-clapper:     FAIL (fdo#103167) -> PASS
> >
> >
> >    {name}: This element is suppressed. This means it is ignored when computing
> >            the status of the difference (SUCCESS, WARNING, or FAILURE).
> >
> >    fdo#103167 https://bugs.freedesktop.org/show_bug.cgi?id=103167
> >    fdo#103713 https://bugs.freedesktop.org/show_bug.cgi?id=103713
> >    fdo#106693 https://bugs.freedesktop.org/show_bug.cgi?id=106693
> >    fdo#107258 https://bugs.freedesktop.org/show_bug.cgi?id=107258
> >    fdo#107718 https://bugs.freedesktop.org/show_bug.cgi?id=107718
> >
> >
> > == Participating hosts (53 -> 47) ==
> >
> >    Missing    (6): fi-kbl-soraka fi-ilk-m540 fi-hsw-4200u fi-byt-squawks fi-bsw-cyan fi-ctg-p8600
> >
> >
> > == Build changes ==
> >
> >      * Linux: CI_DRM_4984 -> Patchwork_10464
> >
> >    CI_DRM_4984: 90b59df999a13a6405f8d7ece08a69120a9b361a @ git://anongit.freedesktop.org/gfx-ci/linux
> >    IGT_4678: 9310a1265ceabeec736bdf0a76e1e0357c76c0b1 @ git://anongit.freedesktop.org/xorg/app/intel-gpu-tools
> >    Patchwork_10464: c88fb110ee8261c636d63f4f6d9fa9440891b3a6 @ git://anongit.freedesktop.org/gfx-ci/linux
> >
> >
> > == Linux commits ==
> >
> > c88fb110ee82 HAX enable GuC for CI
> > 4454d4d05ce3 drm/i915/guc: fix GuC suspend/resume
> >
> > == Logs ==
> >
> > For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_10464/issues.html
> >
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
Michal Wajdeczko Oct. 16, 2018, 10:26 a.m. UTC | #6
On Tue, 16 Oct 2018 00:10:08 +0200, Daniele Ceraolo Spurio  
<daniele.ceraolospurio@intel.com> wrote:

> The ENTER/EXIT_S_STATE actions queue the save/restore operation in GuC
> FW and then return, so waiting on the H2H is not enough to guarantee
> GuC is done.
> When all the processing is done, GuC writes 0 to scratch register 14,
> so we can poll on that. Note that GuC does not ensure that the value
> in the register is different from 0 while the action is in progress
> so we need to take care of that ourselves as well.
>
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
> Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
> ---
>  drivers/gpu/drm/i915/intel_guc.c      | 28 +++++++++++++++++++++++++--
>  drivers/gpu/drm/i915/intel_guc_fwif.h |  6 ++++++
>  2 files changed, 32 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/intel_guc.c  
> b/drivers/gpu/drm/i915/intel_guc.c
> index 230aea69385d..f238cd7a9dcf 100644
> --- a/drivers/gpu/drm/i915/intel_guc.c
> +++ b/drivers/gpu/drm/i915/intel_guc.c
> @@ -521,6 +521,30 @@ int intel_guc_auth_huc(struct intel_guc *guc, u32  
> rsa_offset)
>  	return intel_guc_send(guc, action, ARRAY_SIZE(action));
>  }
> +/*
> + * The ENTER/EXIT_S_STATE actions queue the save/restore operation in  
> GuC FW and
> + * then return, so waiting on the H2H is not enough to guarantee GuC is  
> done.
> + * When all the processing is done, GuC writes 0 to scratch register  
> 14, so we

s/writes 0/writes INTEL_GUC_SLEEP_STATE_SUCCESS ?

> + * can poll on that. Note that GuC does not ensure that the value in the
> + * register is different from 0 while the action is in progress so we  
> need to
> + * take care of that ourselves as well.
> + */
> +static int guc_sleep_state_action(struct intel_guc *guc,
> +				  const u32 *action, u32 len)
> +{
> +	struct drm_i915_private *dev_priv = guc_to_i915(guc);
> +	int ret;
> +
> +	I915_WRITE(SOFT_SCRATCH(14), ~0x0);

Do we want to add dedicated name for that scratch register?

Also, as GuC is using scratch[14] then we need to remove it from our send
register pool - patch is here [1]

> +
> +	ret = intel_guc_send(guc, action, len);
> +	if (ret)
> +		return ret;
> +
> +	return intel_wait_for_register(dev_priv, SOFT_SCRATCH(14), ~0x0,
> +				       INTEL_GUC_SLEEP_STATE_SUCCESS, 10);

Maybe it is worth to use __intel_wait_for_register() and print sleep
state error code in case of failure ? And do we really need to wait
full 10ms for SUCCESS if GuC already has reported PREEMPT_TO_IDLE_FAILED
or ENGINE_RESET_FAILED ?

	u32 state;

	ret = __intel_wait_for_register(dev_priv, SOFT_SCRATCH(14), 0x80000000,
				        0, 10, &state);
	if (ret)
		return ret;
	if (status != INTEL_GUC_SLEEP_STATE_SUCCESS) {
		DRM_ERROR("... %u\n", state);
		return -EIO;
	}
	return 0;

> +}
> +
>  /**
>   * intel_guc_suspend() - notify GuC entering suspend state
>   * @guc:	the guc
> @@ -533,7 +557,7 @@ int intel_guc_suspend(struct intel_guc *guc)
>  		intel_guc_ggtt_offset(guc, guc->shared_data)
>  	};
> -	return intel_guc_send(guc, data, ARRAY_SIZE(data));
> +	return guc_sleep_state_action(guc, data, ARRAY_SIZE(data));
>  }
> /**
> @@ -571,7 +595,7 @@ int intel_guc_resume(struct intel_guc *guc)
>  		intel_guc_ggtt_offset(guc, guc->shared_data)
>  	};
> -	return intel_guc_send(guc, data, ARRAY_SIZE(data));
> +	return guc_sleep_state_action(guc, data, ARRAY_SIZE(data));
>  }
> /**
> diff --git a/drivers/gpu/drm/i915/intel_guc_fwif.h  
> b/drivers/gpu/drm/i915/intel_guc_fwif.h
> index 8382d591c784..b0eb5aabe0a7 100644
> --- a/drivers/gpu/drm/i915/intel_guc_fwif.h
> +++ b/drivers/gpu/drm/i915/intel_guc_fwif.h
> @@ -687,6 +687,12 @@ enum intel_guc_report_status {
>  	INTEL_GUC_REPORT_STATUS_COMPLETE = 0x4,
>  };
> +enum intel_guc_sleep_state_status {
> +	INTEL_GUC_SLEEP_STATE_SUCCESS = 0x0,
> +	INTEL_GUC_SLEEP_STATE_PREEMPT_TO_IDLE_FAILED = 0x1,
> +	INTEL_GUC_SLEEP_STATE_ENGINE_RESET_FAILED = 0x2

as we waiting for state change, maybe we should explicitly define
	INTEL_GUC_SLEEP_STATE_INVALID_MASK = 0x80000000,
and use it in __intel_wait_for_register ?

> +};
> +
>  #define GUC_LOG_CONTROL_LOGGING_ENABLED	(1 << 0)
>  #define GUC_LOG_CONTROL_VERBOSITY_SHIFT	4
>  #define GUC_LOG_CONTROL_VERBOSITY_MASK	(0xF <<  
> GUC_LOG_CONTROL_VERBOSITY_SHIFT)

Thanks,
Michal

[1] https://patchwork.freedesktop.org/patch/256921/
Daniele Ceraolo Spurio Oct. 16, 2018, 4:35 p.m. UTC | #7
On 10/16/2018 3:26 AM, Michal Wajdeczko wrote:
> On Tue, 16 Oct 2018 00:10:08 +0200, Daniele Ceraolo Spurio 
> <daniele.ceraolospurio@intel.com> wrote:
>
>> The ENTER/EXIT_S_STATE actions queue the save/restore operation in GuC
>> FW and then return, so waiting on the H2H is not enough to guarantee
>> GuC is done.
>> When all the processing is done, GuC writes 0 to scratch register 14,
>> so we can poll on that. Note that GuC does not ensure that the value
>> in the register is different from 0 while the action is in progress
>> so we need to take care of that ourselves as well.
>>
>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>> Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
>> Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
>> ---
>>  drivers/gpu/drm/i915/intel_guc.c      | 28 +++++++++++++++++++++++++--
>>  drivers/gpu/drm/i915/intel_guc_fwif.h |  6 ++++++
>>  2 files changed, 32 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/intel_guc.c 
>> b/drivers/gpu/drm/i915/intel_guc.c
>> index 230aea69385d..f238cd7a9dcf 100644
>> --- a/drivers/gpu/drm/i915/intel_guc.c
>> +++ b/drivers/gpu/drm/i915/intel_guc.c
>> @@ -521,6 +521,30 @@ int intel_guc_auth_huc(struct intel_guc *guc, 
>> u32 rsa_offset)
>>      return intel_guc_send(guc, action, ARRAY_SIZE(action));
>>  }
>> +/*
>> + * The ENTER/EXIT_S_STATE actions queue the save/restore operation 
>> in GuC FW and
>> + * then return, so waiting on the H2H is not enough to guarantee GuC 
>> is done.
>> + * When all the processing is done, GuC writes 0 to scratch register 
>> 14, so we
>
> s/writes 0/writes INTEL_GUC_SLEEP_STATE_SUCCESS ?
>
>> + * can poll on that. Note that GuC does not ensure that the value in 
>> the
>> + * register is different from 0 while the action is in progress so 
>> we need to
>> + * take care of that ourselves as well.
>> + */
>> +static int guc_sleep_state_action(struct intel_guc *guc,
>> +                  const u32 *action, u32 len)
>> +{
>> +    struct drm_i915_private *dev_priv = guc_to_i915(guc);
>> +    int ret;
>> +
>> +    I915_WRITE(SOFT_SCRATCH(14), ~0x0);
>
> Do we want to add dedicated name for that scratch register?

As I replied on your patch, the register is used for other purposes in 
other actions so I don't think it is a good idea to rename it.

>
> Also, as GuC is using scratch[14] then we need to remove it from our send
> register pool - patch is here [1]
>
>> +
>> +    ret = intel_guc_send(guc, action, len);
>> +    if (ret)
>> +        return ret;
>> +
>> +    return intel_wait_for_register(dev_priv, SOFT_SCRATCH(14), ~0x0,
>> +                       INTEL_GUC_SLEEP_STATE_SUCCESS, 10);
>
> Maybe it is worth to use __intel_wait_for_register() and print sleep
> state error code in case of failure ? And do we really need to wait
> full 10ms for SUCCESS if GuC already has reported PREEMPT_TO_IDLE_FAILED
> or ENGINE_RESET_FAILED ?
>
>     u32 state;
>
>     ret = __intel_wait_for_register(dev_priv, SOFT_SCRATCH(14), 
> 0x80000000,
>                         0, 10, &state);
>     if (ret)
>         return ret;
>     if (status != INTEL_GUC_SLEEP_STATE_SUCCESS) {
>         DRM_ERROR("... %u\n", state);
>         return -EIO;
>     }
>     return 0;
>

It should be impossible for us to get PREEMPT_TO_IDLE_FAILED or 
ENGINE_RESET_FAILED because those only come in play if guc_suspend() is 
called while there are outstanding request inside GuC. However, no real 
downsides in going with your solution either so I'll do the changes ;)

>> +}
>> +
>>  /**
>>   * intel_guc_suspend() - notify GuC entering suspend state
>>   * @guc:    the guc
>> @@ -533,7 +557,7 @@ int intel_guc_suspend(struct intel_guc *guc)
>>          intel_guc_ggtt_offset(guc, guc->shared_data)
>>      };
>> -    return intel_guc_send(guc, data, ARRAY_SIZE(data));
>> +    return guc_sleep_state_action(guc, data, ARRAY_SIZE(data));
>>  }
>> /**
>> @@ -571,7 +595,7 @@ int intel_guc_resume(struct intel_guc *guc)
>>          intel_guc_ggtt_offset(guc, guc->shared_data)
>>      };
>> -    return intel_guc_send(guc, data, ARRAY_SIZE(data));
>> +    return guc_sleep_state_action(guc, data, ARRAY_SIZE(data));
>>  }
>> /**
>> diff --git a/drivers/gpu/drm/i915/intel_guc_fwif.h 
>> b/drivers/gpu/drm/i915/intel_guc_fwif.h
>> index 8382d591c784..b0eb5aabe0a7 100644
>> --- a/drivers/gpu/drm/i915/intel_guc_fwif.h
>> +++ b/drivers/gpu/drm/i915/intel_guc_fwif.h
>> @@ -687,6 +687,12 @@ enum intel_guc_report_status {
>>      INTEL_GUC_REPORT_STATUS_COMPLETE = 0x4,
>>  };
>> +enum intel_guc_sleep_state_status {
>> +    INTEL_GUC_SLEEP_STATE_SUCCESS = 0x0,
>> +    INTEL_GUC_SLEEP_STATE_PREEMPT_TO_IDLE_FAILED = 0x1,
>> +    INTEL_GUC_SLEEP_STATE_ENGINE_RESET_FAILED = 0x2
>
> as we waiting for state change, maybe we should explicitly define
>     INTEL_GUC_SLEEP_STATE_INVALID_MASK = 0x80000000,
> and use it in __intel_wait_for_register ?

ack

>
>> +};
>> +
>>  #define GUC_LOG_CONTROL_LOGGING_ENABLED    (1 << 0)
>>  #define GUC_LOG_CONTROL_VERBOSITY_SHIFT    4
>>  #define GUC_LOG_CONTROL_VERBOSITY_MASK    (0xF << 
>> GUC_LOG_CONTROL_VERBOSITY_SHIFT)
>

Note for you while I'm it: I've been told that the gen11 GuC FW has a 
simplified and improved flow for suspend/resume, so some changes will be 
required in your series. Not sure about the details.

Thanks,
Daniele

> Thanks,
> Michal
>
> [1] https://patchwork.freedesktop.org/patch/256921/
Daniele Ceraolo Spurio Oct. 16, 2018, 5:15 p.m. UTC | #8
On 10/16/2018 2:21 AM, Daniel Vetter wrote:
> On Tue, Oct 16, 2018 at 1:44 AM Daniele Ceraolo Spurio
> <daniele.ceraolospurio@intel.com> wrote:
>>
>>
>> On 15/10/18 15:47, Patchwork wrote:
>>> == Series Details ==
>>>
>>> Series: series starting with [1/2] drm/i915/guc: fix GuC suspend/resume
>>> URL   : https://patchwork.freedesktop.org/series/51033/
>>> State : failure
>>>
>>> == Summary ==
>>>
>>> = CI Bug Log - changes from CI_DRM_4984 -> Patchwork_10464 =
>>>
>>> == Summary - FAILURE ==
>>>
>>>     Serious unknown changes coming with Patchwork_10464 absolutely need to be
>>>     verified manually.
>>>
>>>     If you think the reported changes have nothing to do with the changes
>>>     introduced in Patchwork_10464, please notify your bug team to allow them
>>>     to document this new failure mode, which will reduce false positives in CI.
>>>
>>>     External URL: https://patchwork.freedesktop.org/api/1.0/series/51033/revisions/1/mbox/
>>>
>>> == Possible new issues ==
>>>
>>>     Here are the unknown changes that may have been introduced in Patchwork_10464:
>>>
>>>     === IGT changes ===
>>>
>>>       ==== Possible regressions ====
>>>
>>>       igt@drv_selftest@live_execlists:
>>>         fi-skl-6700hq:      PASS -> INCOMPLETE
>>>
>> Log seem to be cut for this one. Since it is stopping inside
>> live_preempt_smoke it is probably a known issue that Chris mentioned.
>> Can't reproduce on my skylake even with the test in a loop.
>>
>>>       igt@drv_selftest@live_guc:
>>>         fi-kbl-7567u:       PASS -> DMESG-WARN
>>>         fi-skl-6600u:       PASS -> DMESG-WARN
>>>         fi-skl-gvtdvm:      PASS -> DMESG-WARN
>>>         fi-skl-iommu:       PASS -> DMESG-WARN
>>>         fi-skl-6260u:       PASS -> DMESG-WARN
>>>         fi-bxt-dsi:         PASS -> DMESG-WARN
>>>         fi-skl-6700k2:      PASS -> DMESG-WARN
>>>         fi-whl-u:           PASS -> DMESG-WARN
>>>         fi-skl-6770hq:      PASS -> DMESG-WARN
>>>         fi-kbl-7560u:       PASS -> DMESG-WARN
>>>         fi-kbl-8809g:       PASS -> DMESG-WARN
>>>         fi-kbl-r:           PASS -> DMESG-WARN
>>>         fi-kbl-x1275:       PASS -> DMESG-WARN
>>>         fi-bxt-j4205:       PASS -> DMESG-WARN
>>>         fi-cfl-s3:          PASS -> DMESG-WARN
>>>         fi-cfl-8109u:       PASS -> DMESG-WARN
>>>         fi-kbl-7500u:       PASS -> DMESG-WARN
>>>         fi-cfl-8700k:       PASS -> DMESG-WARN
>> These are all:
>>
>> [drm:intel_guc_send_mmio [i915]] *ERROR* MMIO: GuC action 0x10 failed
>> with error -5 0xf000f000
>>
>> Which is not a real failure since the test is triggering it on purpose
> You still need to make them shut up. dmesg errors should only be used
> for stuff we really don't expect. E.g. gpu hangs provoked by igt also
> don't result in dmesg errors/warnings and failed tests.
> -Daniel

I wasn't trying to imply that we don't care that we have a failure or 
that we shouldn't make it shut up, just that it is not a regression 
introduced by this patch, because it doesn't even get near that code. I 
recall that there was a small discussion in the past about how to 
silence this, I'll try to dig it up and see if there was an agreed solution.

Daniele

>>>       igt@drv_selftest@live_hangcheck:
>>>         fi-skl-gvtdvm:      PASS -> DMESG-FAIL
>>>
>> <7> [464.966238] [drm:guc_fw_xfer [i915]] GuC status 0x20
>> <3> [464.966361] [drm:guc_fw_xfer [i915]] *ERROR* GuC firmware xfer
>> error -110
>>
>> This looks like GuC is stuck very early in the boot flow (even before
>> the RSA check). On SKL there are known issues that could cause this and
>> we should reset GuC and retry, but we aren't. Looks like we indirectly
>> stopped applying  WaEnableuKernelHeaderValidFix and
>> WaEnableGuCBootHashCheckNotSet by not returning -EAGAIN from
>> intel_guc_fw_upload in any case. Michal?
>>
>> Thanks,
>> Daniele
>>
>>> == Known issues ==
>>>
>>>     Here are the changes found in Patchwork_10464 that come from known issues:
>>>
>>>     === IGT changes ===
>>>
>>>       ==== Issues hit ====
>>>
>>>       igt@drv_selftest@live_guc:
>>>         {fi-apl-guc}:       NOTRUN -> DMESG-WARN (fdo#107258)
>>>
>>>       igt@gem_exec_suspend@basic-s4-devices:
>>>         fi-blb-e6850:       PASS -> INCOMPLETE (fdo#107718)
>>>
>>>       igt@kms_pipe_crc_basic@suspend-read-crc-pipe-b:
>>>         fi-snb-2520m:       PASS -> DMESG-FAIL (fdo#103713)
>>>
>>>       igt@kms_setmode@basic-clone-single-crtc:
>>>         fi-snb-2520m:       PASS -> DMESG-WARN (fdo#103713)
>>>
>>>       igt@pm_backlight@basic-brightness:
>>>         fi-snb-2520m:       PASS -> INCOMPLETE (fdo#103713)
>>>
>>>
>>>       ==== Possible fixes ====
>>>
>>>       igt@drv_selftest@live_gem:
>>>         {fi-apl-guc}:       INCOMPLETE (fdo#106693) -> PASS
>>>
>>>       igt@kms_frontbuffer_tracking@basic:
>>>         fi-byt-clapper:     FAIL (fdo#103167) -> PASS
>>>
>>>
>>>     {name}: This element is suppressed. This means it is ignored when computing
>>>             the status of the difference (SUCCESS, WARNING, or FAILURE).
>>>
>>>     fdo#103167 https://bugs.freedesktop.org/show_bug.cgi?id=103167
>>>     fdo#103713 https://bugs.freedesktop.org/show_bug.cgi?id=103713
>>>     fdo#106693 https://bugs.freedesktop.org/show_bug.cgi?id=106693
>>>     fdo#107258 https://bugs.freedesktop.org/show_bug.cgi?id=107258
>>>     fdo#107718 https://bugs.freedesktop.org/show_bug.cgi?id=107718
>>>
>>>
>>> == Participating hosts (53 -> 47) ==
>>>
>>>     Missing    (6): fi-kbl-soraka fi-ilk-m540 fi-hsw-4200u fi-byt-squawks fi-bsw-cyan fi-ctg-p8600
>>>
>>>
>>> == Build changes ==
>>>
>>>       * Linux: CI_DRM_4984 -> Patchwork_10464
>>>
>>>     CI_DRM_4984: 90b59df999a13a6405f8d7ece08a69120a9b361a @ git://anongit.freedesktop.org/gfx-ci/linux
>>>     IGT_4678: 9310a1265ceabeec736bdf0a76e1e0357c76c0b1 @ git://anongit.freedesktop.org/xorg/app/intel-gpu-tools
>>>     Patchwork_10464: c88fb110ee8261c636d63f4f6d9fa9440891b3a6 @ git://anongit.freedesktop.org/gfx-ci/linux
>>>
>>>
>>> == Linux commits ==
>>>
>>> c88fb110ee82 HAX enable GuC for CI
>>> 4454d4d05ce3 drm/i915/guc: fix GuC suspend/resume
>>>
>>> == Logs ==
>>>
>>> For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_10464/issues.html
>>>
>> _______________________________________________
>> Intel-gfx mailing list
>> Intel-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
>
>
Michal Wajdeczko Oct. 16, 2018, 6:58 p.m. UTC | #9
On Tue, 16 Oct 2018 19:15:19 +0200, Daniele Ceraolo Spurio  
<daniele.ceraolospurio@intel.com> wrote:

>
>
> On 10/16/2018 2:21 AM, Daniel Vetter wrote:
>> On Tue, Oct 16, 2018 at 1:44 AM Daniele Ceraolo Spurio
>> <daniele.ceraolospurio@intel.com> wrote:
>>>
>>>
>>> On 15/10/18 15:47, Patchwork wrote:
>>>> == Series Details ==
>>>>
>>>> Series: series starting with [1/2] drm/i915/guc: fix GuC  
>>>> suspend/resume
>>>> URL   : https://patchwork.freedesktop.org/series/51033/
>>>> State : failure
>>>>
>>>> == Summary ==
>>>>
>>>> = CI Bug Log - changes from CI_DRM_4984 -> Patchwork_10464 =
>>>>
>>>> == Summary - FAILURE ==
>>>>
>>>>     Serious unknown changes coming with Patchwork_10464 absolutely  
>>>> need to be
>>>>     verified manually.
>>>>
>>>>     If you think the reported changes have nothing to do with the  
>>>> changes
>>>>     introduced in Patchwork_10464, please notify your bug team to  
>>>> allow them
>>>>     to document this new failure mode, which will reduce false  
>>>> positives in CI.
>>>>
>>>>     External URL:  
>>>> https://patchwork.freedesktop.org/api/1.0/series/51033/revisions/1/mbox/
>>>>
>>>> == Possible new issues ==
>>>>
>>>>     Here are the unknown changes that may have been introduced in  
>>>> Patchwork_10464:
>>>>
>>>>     === IGT changes ===
>>>>
>>>>       ==== Possible regressions ====
>>>>
>>>>       igt@drv_selftest@live_execlists:
>>>>         fi-skl-6700hq:      PASS -> INCOMPLETE
>>>>
>>> Log seem to be cut for this one. Since it is stopping inside
>>> live_preempt_smoke it is probably a known issue that Chris mentioned.
>>> Can't reproduce on my skylake even with the test in a loop.
>>>
>>>>       igt@drv_selftest@live_guc:
>>>>         fi-kbl-7567u:       PASS -> DMESG-WARN
>>>>         fi-skl-6600u:       PASS -> DMESG-WARN
>>>>         fi-skl-gvtdvm:      PASS -> DMESG-WARN
>>>>         fi-skl-iommu:       PASS -> DMESG-WARN
>>>>         fi-skl-6260u:       PASS -> DMESG-WARN
>>>>         fi-bxt-dsi:         PASS -> DMESG-WARN
>>>>         fi-skl-6700k2:      PASS -> DMESG-WARN
>>>>         fi-whl-u:           PASS -> DMESG-WARN
>>>>         fi-skl-6770hq:      PASS -> DMESG-WARN
>>>>         fi-kbl-7560u:       PASS -> DMESG-WARN
>>>>         fi-kbl-8809g:       PASS -> DMESG-WARN
>>>>         fi-kbl-r:           PASS -> DMESG-WARN
>>>>         fi-kbl-x1275:       PASS -> DMESG-WARN
>>>>         fi-bxt-j4205:       PASS -> DMESG-WARN
>>>>         fi-cfl-s3:          PASS -> DMESG-WARN
>>>>         fi-cfl-8109u:       PASS -> DMESG-WARN
>>>>         fi-kbl-7500u:       PASS -> DMESG-WARN
>>>>         fi-cfl-8700k:       PASS -> DMESG-WARN
>>> These are all:
>>>
>>> [drm:intel_guc_send_mmio [i915]] *ERROR* MMIO: GuC action 0x10 failed
>>> with error -5 0xf000f000
>>>
>>> Which is not a real failure since the test is triggering it on purpose
>> You still need to make them shut up. dmesg errors should only be used
>> for stuff we really don't expect. E.g. gpu hangs provoked by igt also
>> don't result in dmesg errors/warnings and failed tests.
>> -Daniel
>
> I wasn't trying to imply that we don't care that we have a failure or  
> that we shouldn't make it shut up, just that it is not a regression  
> introduced by this patch, because it doesn't even get near that code. I  
> recall that there was a small discussion in the past about how to  
> silence this, I'll try to dig it up and see if there was an agreed  
> solution.

Preferred solution was to remove negative GuC tests from i915 selftests.

Note that this "low level" error message is our guard that we are always
correctly communicating with the GuC, no silent drop of unexpected GuC  
errors.

GuC negative testing shall be done by the fw team.

Michal

>
> Daniele
>
>>>>       igt@drv_selftest@live_hangcheck:
>>>>         fi-skl-gvtdvm:      PASS -> DMESG-FAIL
>>>>
>>> <7> [464.966238] [drm:guc_fw_xfer [i915]] GuC status 0x20
>>> <3> [464.966361] [drm:guc_fw_xfer [i915]] *ERROR* GuC firmware xfer
>>> error -110
>>>
>>> This looks like GuC is stuck very early in the boot flow (even before
>>> the RSA check). On SKL there are known issues that could cause this and
>>> we should reset GuC and retry, but we aren't. Looks like we indirectly
>>> stopped applying  WaEnableuKernelHeaderValidFix and
>>> WaEnableGuCBootHashCheckNotSet by not returning -EAGAIN from
>>> intel_guc_fw_upload in any case. Michal?
>>>
>>> Thanks,
>>> Daniele
>>>
>>>> == Known issues ==
>>>>
>>>>     Here are the changes found in Patchwork_10464 that come from  
>>>> known issues:
>>>>
>>>>     === IGT changes ===
>>>>
>>>>       ==== Issues hit ====
>>>>
>>>>       igt@drv_selftest@live_guc:
>>>>         {fi-apl-guc}:       NOTRUN -> DMESG-WARN (fdo#107258)
>>>>
>>>>       igt@gem_exec_suspend@basic-s4-devices:
>>>>         fi-blb-e6850:       PASS -> INCOMPLETE (fdo#107718)
>>>>
>>>>       igt@kms_pipe_crc_basic@suspend-read-crc-pipe-b:
>>>>         fi-snb-2520m:       PASS -> DMESG-FAIL (fdo#103713)
>>>>
>>>>       igt@kms_setmode@basic-clone-single-crtc:
>>>>         fi-snb-2520m:       PASS -> DMESG-WARN (fdo#103713)
>>>>
>>>>       igt@pm_backlight@basic-brightness:
>>>>         fi-snb-2520m:       PASS -> INCOMPLETE (fdo#103713)
>>>>
>>>>
>>>>       ==== Possible fixes ====
>>>>
>>>>       igt@drv_selftest@live_gem:
>>>>         {fi-apl-guc}:       INCOMPLETE (fdo#106693) -> PASS
>>>>
>>>>       igt@kms_frontbuffer_tracking@basic:
>>>>         fi-byt-clapper:     FAIL (fdo#103167) -> PASS
>>>>
>>>>
>>>>     {name}: This element is suppressed. This means it is ignored when  
>>>> computing
>>>>             the status of the difference (SUCCESS, WARNING, or  
>>>> FAILURE).
>>>>
>>>>     fdo#103167 https://bugs.freedesktop.org/show_bug.cgi?id=103167
>>>>     fdo#103713 https://bugs.freedesktop.org/show_bug.cgi?id=103713
>>>>     fdo#106693 https://bugs.freedesktop.org/show_bug.cgi?id=106693
>>>>     fdo#107258 https://bugs.freedesktop.org/show_bug.cgi?id=107258
>>>>     fdo#107718 https://bugs.freedesktop.org/show_bug.cgi?id=107718
>>>>
>>>>
>>>> == Participating hosts (53 -> 47) ==
>>>>
>>>>     Missing    (6): fi-kbl-soraka fi-ilk-m540 fi-hsw-4200u  
>>>> fi-byt-squawks fi-bsw-cyan fi-ctg-p8600
>>>>
>>>>
>>>> == Build changes ==
>>>>
>>>>       * Linux: CI_DRM_4984 -> Patchwork_10464
>>>>
>>>>     CI_DRM_4984: 90b59df999a13a6405f8d7ece08a69120a9b361a @  
>>>> git://anongit.freedesktop.org/gfx-ci/linux
>>>>     IGT_4678: 9310a1265ceabeec736bdf0a76e1e0357c76c0b1 @  
>>>> git://anongit.freedesktop.org/xorg/app/intel-gpu-tools
>>>>     Patchwork_10464: c88fb110ee8261c636d63f4f6d9fa9440891b3a6 @  
>>>> git://anongit.freedesktop.org/gfx-ci/linux
>>>>
>>>>
>>>> == Linux commits ==
>>>>
>>>> c88fb110ee82 HAX enable GuC for CI
>>>> 4454d4d05ce3 drm/i915/guc: fix GuC suspend/resume
>>>>
>>>> == Logs ==
>>>>
>>>> For more details see:  
>>>> https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_10464/issues.html
>>>>
>>> _______________________________________________
>>> Intel-gfx mailing list
>>> Intel-gfx@lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
>>
>>
diff mbox series

Patch

diff --git a/drivers/gpu/drm/i915/intel_guc.c b/drivers/gpu/drm/i915/intel_guc.c
index 230aea69385d..f238cd7a9dcf 100644
--- a/drivers/gpu/drm/i915/intel_guc.c
+++ b/drivers/gpu/drm/i915/intel_guc.c
@@ -521,6 +521,30 @@  int intel_guc_auth_huc(struct intel_guc *guc, u32 rsa_offset)
 	return intel_guc_send(guc, action, ARRAY_SIZE(action));
 }
 
+/*
+ * The ENTER/EXIT_S_STATE actions queue the save/restore operation in GuC FW and
+ * then return, so waiting on the H2H is not enough to guarantee GuC is done.
+ * When all the processing is done, GuC writes 0 to scratch register 14, so we
+ * can poll on that. Note that GuC does not ensure that the value in the
+ * register is different from 0 while the action is in progress so we need to
+ * take care of that ourselves as well.
+ */
+static int guc_sleep_state_action(struct intel_guc *guc,
+				  const u32 *action, u32 len)
+{
+	struct drm_i915_private *dev_priv = guc_to_i915(guc);
+	int ret;
+
+	I915_WRITE(SOFT_SCRATCH(14), ~0x0);
+
+	ret = intel_guc_send(guc, action, len);
+	if (ret)
+		return ret;
+
+	return intel_wait_for_register(dev_priv, SOFT_SCRATCH(14), ~0x0,
+				       INTEL_GUC_SLEEP_STATE_SUCCESS, 10);
+}
+
 /**
  * intel_guc_suspend() - notify GuC entering suspend state
  * @guc:	the guc
@@ -533,7 +557,7 @@  int intel_guc_suspend(struct intel_guc *guc)
 		intel_guc_ggtt_offset(guc, guc->shared_data)
 	};
 
-	return intel_guc_send(guc, data, ARRAY_SIZE(data));
+	return guc_sleep_state_action(guc, data, ARRAY_SIZE(data));
 }
 
 /**
@@ -571,7 +595,7 @@  int intel_guc_resume(struct intel_guc *guc)
 		intel_guc_ggtt_offset(guc, guc->shared_data)
 	};
 
-	return intel_guc_send(guc, data, ARRAY_SIZE(data));
+	return guc_sleep_state_action(guc, data, ARRAY_SIZE(data));
 }
 
 /**
diff --git a/drivers/gpu/drm/i915/intel_guc_fwif.h b/drivers/gpu/drm/i915/intel_guc_fwif.h
index 8382d591c784..b0eb5aabe0a7 100644
--- a/drivers/gpu/drm/i915/intel_guc_fwif.h
+++ b/drivers/gpu/drm/i915/intel_guc_fwif.h
@@ -687,6 +687,12 @@  enum intel_guc_report_status {
 	INTEL_GUC_REPORT_STATUS_COMPLETE = 0x4,
 };
 
+enum intel_guc_sleep_state_status {
+	INTEL_GUC_SLEEP_STATE_SUCCESS = 0x0,
+	INTEL_GUC_SLEEP_STATE_PREEMPT_TO_IDLE_FAILED = 0x1,
+	INTEL_GUC_SLEEP_STATE_ENGINE_RESET_FAILED = 0x2
+};
+
 #define GUC_LOG_CONTROL_LOGGING_ENABLED	(1 << 0)
 #define GUC_LOG_CONTROL_VERBOSITY_SHIFT	4
 #define GUC_LOG_CONTROL_VERBOSITY_MASK	(0xF << GUC_LOG_CONTROL_VERBOSITY_SHIFT)