diff mbox

[DMC_BUGFIX_SKL_V2,1/5] drm/i915/skl: Added a check for the hardware status of csr fw before loading.

Message ID 1440533169-32265-2-git-send-email-animesh.manna@intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Manna, Animesh Aug. 25, 2015, 8:06 p.m. UTC
Dmc will restore the csr program except DC9, cold boot,
warm reset, PCI function level reset, and hibernate/suspend.

intel_csr_load_program() function is used to load the firmware
data from kernel memory to csr address space.

All values of csr address space will be zero if it got reset and
the first byte of csr program is always a non-zero if firmware
is loaded successfuly. Based on hardware status will load the
firmware.

Without this condition check if we overwrite the firmware data the
counters exposed for dc5/dc6 (help for debugging) will be nullified.

v1: Initial version.

v2: Based on review comments from Daniel,
- Added a check to know hardware status and load the firmware if not loaded.

Cc: Daniel Vetter <daniel.vetter@intel.com>
Cc: Damien Lespiau <damien.lespiau@intel.com>
Cc: Imre Deak <imre.deak@intel.com>
Cc: Sunil Kamath <sunil.kamath@intel.com>
Signed-off-by: Animesh Manna <animesh.manna@intel.com>
Signed-off-by: Vathsala Nagaraju <vathsala.nagaraju@intel.com>
---
 drivers/gpu/drm/i915/intel_csr.c | 9 +++++++++
 1 file changed, 9 insertions(+)

Comments

Daniel Vetter Aug. 26, 2015, 1:10 p.m. UTC | #1
On Wed, Aug 26, 2015 at 01:36:05AM +0530, Animesh Manna wrote:
> Dmc will restore the csr program except DC9, cold boot,
> warm reset, PCI function level reset, and hibernate/suspend.
> 
> intel_csr_load_program() function is used to load the firmware
> data from kernel memory to csr address space.
> 
> All values of csr address space will be zero if it got reset and
> the first byte of csr program is always a non-zero if firmware
> is loaded successfuly. Based on hardware status will load the
> firmware.
> 
> Without this condition check if we overwrite the firmware data the
> counters exposed for dc5/dc6 (help for debugging) will be nullified.
> 
> v1: Initial version.
> 
> v2: Based on review comments from Daniel,
> - Added a check to know hardware status and load the firmware if not loaded.
> 
> Cc: Daniel Vetter <daniel.vetter@intel.com>
> Cc: Damien Lespiau <damien.lespiau@intel.com>
> Cc: Imre Deak <imre.deak@intel.com>
> Cc: Sunil Kamath <sunil.kamath@intel.com>
> Signed-off-by: Animesh Manna <animesh.manna@intel.com>
> Signed-off-by: Vathsala Nagaraju <vathsala.nagaraju@intel.com>
> ---
>  drivers/gpu/drm/i915/intel_csr.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/intel_csr.c b/drivers/gpu/drm/i915/intel_csr.c
> index ba1ae03..682cc26 100644
> --- a/drivers/gpu/drm/i915/intel_csr.c
> +++ b/drivers/gpu/drm/i915/intel_csr.c
> @@ -252,6 +252,15 @@ void intel_csr_load_program(struct drm_device *dev)
>  		return;
>  	}
>  
> +	/*
> +	 * Dmc will restore the csr the program except DC9, cold boot,
> +	 * warm reset, PCI function level reset, and hibernate/suspend.
> +	 * This condition will help to check if csr address space is reset/
> +	 * not loaded.
> +	 */

Atm we call this from driver load and resume, which doesn seem to cover
all the cases you mention in the comment. Should this be a WARN_ON
instead? Or do we have troubles in our init sequence where we load too
many times?

Either way I can't reconcile your commit message with the comment here.
-Daniel

> +	if (I915_READ(CSR_PROGRAM_BASE))
> +		return;
> +
>  	mutex_lock(&dev_priv->csr_lock);
>  	fw_size = dev_priv->csr.dmc_fw_size;
>  	for (i = 0; i < fw_size; i++)
> -- 
> 2.0.2
> 
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/intel-gfx
Manna, Animesh Aug. 26, 2015, 2:10 p.m. UTC | #2
On 8/26/2015 6:40 PM, Daniel Vetter wrote:
> On Wed, Aug 26, 2015 at 01:36:05AM +0530, Animesh Manna wrote:
>> Dmc will restore the csr program except DC9, cold boot,
>> warm reset, PCI function level reset, and hibernate/suspend.
>>
>> intel_csr_load_program() function is used to load the firmware
>> data from kernel memory to csr address space.
>>
>> All values of csr address space will be zero if it got reset and
>> the first byte of csr program is always a non-zero if firmware
>> is loaded successfuly. Based on hardware status will load the
>> firmware.
>>
>> Without this condition check if we overwrite the firmware data the
>> counters exposed for dc5/dc6 (help for debugging) will be nullified.

Bacause of the above reason mentioned just above we need to block firmware loading again.
So only WARN_ON will not help.


>>
>> v1: Initial version.
>>
>> v2: Based on review comments from Daniel,
>> - Added a check to know hardware status and load the firmware if not loaded.
>>
>> Cc: Daniel Vetter <daniel.vetter@intel.com>
>> Cc: Damien Lespiau <damien.lespiau@intel.com>
>> Cc: Imre Deak <imre.deak@intel.com>
>> Cc: Sunil Kamath <sunil.kamath@intel.com>
>> Signed-off-by: Animesh Manna <animesh.manna@intel.com>
>> Signed-off-by: Vathsala Nagaraju <vathsala.nagaraju@intel.com>
>> ---
>>   drivers/gpu/drm/i915/intel_csr.c | 9 +++++++++
>>   1 file changed, 9 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/i915/intel_csr.c b/drivers/gpu/drm/i915/intel_csr.c
>> index ba1ae03..682cc26 100644
>> --- a/drivers/gpu/drm/i915/intel_csr.c
>> +++ b/drivers/gpu/drm/i915/intel_csr.c
>> @@ -252,6 +252,15 @@ void intel_csr_load_program(struct drm_device *dev)
>>   		return;
>>   	}
>>   
>> +	/*
>> +	 * Dmc will restore the csr the program except DC9, cold boot,
>> +	 * warm reset, PCI function level reset, and hibernate/suspend.
>> +	 * This condition will help to check if csr address space is reset/
>> +	 * not loaded.
>> +	 */
> Atm we call this from driver load and resume, which doesn seem to cover
> all the cases you mention in the comment. Should this be a WARN_ON
> instead? Or do we have troubles in our init sequence where we load too
> many times?

Yes, the above statement taken from bspec to describe about the special cases dmc will not restore the firmware.
Agree, In our cases cold boot and hibernate/suspend mainly we need to load the firmware again, so in my
second sentence I wanted to comment mainly regarding this condition check added for suspend-hibernate(reset)
and cold boot(not loaded).

Anyways the same api later can be used to load the firmware from anywhere, so my intention to check firmware loaded or not.
If already loaded then not to overwrite the csr address space to maintain the dc5/dc6 counter value.

Can the below comment more clear to you.

	/*
	 * Dmc will restore the csr the program except DC9, cold boot,
	 * warm reset, PCI function level reset, and hibernate/suspend.
	 * If firmware is restored by dmc then no need to load again which
	 * will keep the dc5/dc6 counter exposed by firmware.
	 */

No issue in init sequence.

-Animesh


>
> Either way I can't reconcile your commit message with the comment here.
> -Daniel
>
>> +	if (I915_READ(CSR_PROGRAM_BASE))
>> +		return;
>> +
>>   	mutex_lock(&dev_priv->csr_lock);
>>   	fw_size = dev_priv->csr.dmc_fw_size;
>>   	for (i = 0; i < fw_size; i++)
>> -- 
>> 2.0.2
>>
>> _______________________________________________
>> Intel-gfx mailing list
>> Intel-gfx@lists.freedesktop.org
>> http://lists.freedesktop.org/mailman/listinfo/intel-gfx
Daniel Vetter Sept. 2, 2015, 8:54 a.m. UTC | #3
On Wed, Aug 26, 2015 at 07:40:54PM +0530, Animesh Manna wrote:
> 
> 
> On 8/26/2015 6:40 PM, Daniel Vetter wrote:
> >On Wed, Aug 26, 2015 at 01:36:05AM +0530, Animesh Manna wrote:
> >>Dmc will restore the csr program except DC9, cold boot,
> >>warm reset, PCI function level reset, and hibernate/suspend.
> >>
> >>intel_csr_load_program() function is used to load the firmware
> >>data from kernel memory to csr address space.
> >>
> >>All values of csr address space will be zero if it got reset and
> >>the first byte of csr program is always a non-zero if firmware
> >>is loaded successfuly. Based on hardware status will load the
> >>firmware.
> >>
> >>Without this condition check if we overwrite the firmware data the
> >>counters exposed for dc5/dc6 (help for debugging) will be nullified.
> 
> Bacause of the above reason mentioned just above we need to block firmware loading again.
> So only WARN_ON will not help.
> 
> 
> >>
> >>v1: Initial version.
> >>
> >>v2: Based on review comments from Daniel,
> >>- Added a check to know hardware status and load the firmware if not loaded.
> >>
> >>Cc: Daniel Vetter <daniel.vetter@intel.com>
> >>Cc: Damien Lespiau <damien.lespiau@intel.com>
> >>Cc: Imre Deak <imre.deak@intel.com>
> >>Cc: Sunil Kamath <sunil.kamath@intel.com>
> >>Signed-off-by: Animesh Manna <animesh.manna@intel.com>
> >>Signed-off-by: Vathsala Nagaraju <vathsala.nagaraju@intel.com>
> >>---
> >>  drivers/gpu/drm/i915/intel_csr.c | 9 +++++++++
> >>  1 file changed, 9 insertions(+)
> >>
> >>diff --git a/drivers/gpu/drm/i915/intel_csr.c b/drivers/gpu/drm/i915/intel_csr.c
> >>index ba1ae03..682cc26 100644
> >>--- a/drivers/gpu/drm/i915/intel_csr.c
> >>+++ b/drivers/gpu/drm/i915/intel_csr.c
> >>@@ -252,6 +252,15 @@ void intel_csr_load_program(struct drm_device *dev)
> >>  		return;
> >>  	}
> >>+	/*
> >>+	 * Dmc will restore the csr the program except DC9, cold boot,
> >>+	 * warm reset, PCI function level reset, and hibernate/suspend.
> >>+	 * This condition will help to check if csr address space is reset/
> >>+	 * not loaded.
> >>+	 */
> >Atm we call this from driver load and resume, which doesn seem to cover
> >all the cases you mention in the comment. Should this be a WARN_ON
> >instead? Or do we have troubles in our init sequence where we load too
> >many times?
> 
> Yes, the above statement taken from bspec to describe about the special cases dmc will not restore the firmware.
> Agree, In our cases cold boot and hibernate/suspend mainly we need to load the firmware again, so in my
> second sentence I wanted to comment mainly regarding this condition check added for suspend-hibernate(reset)
> and cold boot(not loaded).
> 
> Anyways the same api later can be used to load the firmware from anywhere, so my intention to check firmware loaded or not.
> If already loaded then not to overwrite the csr address space to maintain the dc5/dc6 counter value.
> 
> Can the below comment more clear to you.
> 
> 	/*
> 	 * Dmc will restore the csr the program except DC9, cold boot,
> 	 * warm reset, PCI function level reset, and hibernate/suspend.
> 	 * If firmware is restored by dmc then no need to load again which
> 	 * will keep the dc5/dc6 counter exposed by firmware.
> 	 */
> 
> No issue in init sequence.

That seems to still cover all the callers of the function afaics - we do
pci resets over suspend resume unconditionally. So I still don't
understand where exactly we try to load the dmc firmware in i915.ko when
it's already loaded.
-Daniel
Kamath, Sunil Sept. 7, 2015, 11:04 a.m. UTC | #4
On Wednesday 26 August 2015 01:36 AM, Animesh Manna wrote:
> Dmc will restore the csr program except DC9, cold boot,
> warm reset, PCI function level reset, and hibernate/suspend.
>
> intel_csr_load_program() function is used to load the firmware
> data from kernel memory to csr address space.
>
> All values of csr address space will be zero if it got reset and
> the first byte of csr program is always a non-zero if firmware
> is loaded successfuly. Based on hardware status will load the
> firmware.
>
> Without this condition check if we overwrite the firmware data the
> counters exposed for dc5/dc6 (help for debugging) will be nullified.
>
> v1: Initial version.
>
> v2: Based on review comments from Daniel,
> - Added a check to know hardware status and load the firmware if not loaded.
>
> Cc: Daniel Vetter <daniel.vetter@intel.com>
> Cc: Damien Lespiau <damien.lespiau@intel.com>
> Cc: Imre Deak <imre.deak@intel.com>
> Cc: Sunil Kamath <sunil.kamath@intel.com>
> Signed-off-by: Animesh Manna <animesh.manna@intel.com>
> Signed-off-by: Vathsala Nagaraju <vathsala.nagaraju@intel.com>
> ---
>   drivers/gpu/drm/i915/intel_csr.c | 9 +++++++++
>   1 file changed, 9 insertions(+)
>
> diff --git a/drivers/gpu/drm/i915/intel_csr.c b/drivers/gpu/drm/i915/intel_csr.c
> index ba1ae03..682cc26 100644
> --- a/drivers/gpu/drm/i915/intel_csr.c
> +++ b/drivers/gpu/drm/i915/intel_csr.c
> @@ -252,6 +252,15 @@ void intel_csr_load_program(struct drm_device *dev)
>   		return;
>   	}
>   
> +	/*
> +	 * Dmc will restore the csr the program except DC9, cold boot,
> +	 * warm reset, PCI function level reset, and hibernate/suspend.
> +	 * This condition will help to check if csr address space is reset/
> +	 * not loaded.
> +	 */
> +	if (I915_READ(CSR_PROGRAM_BASE))
> +		return;
> +
>   	mutex_lock(&dev_priv->csr_lock);
>   	fw_size = dev_priv->csr.dmc_fw_size;
>   	for (i = 0; i < fw_size; i++)

Valid fix and patch is ready for merge now.

Reviewed-by: A.Sunil Kamath <sunil.kamath@intel.com> 
<mailto:sunil.kamath@intel.com>
Daniel Vetter Sept. 7, 2015, 4:22 p.m. UTC | #5
On Mon, Sep 07, 2015 at 04:34:30PM +0530, Sunil Kamath wrote:
> On Wednesday 26 August 2015 01:36 AM, Animesh Manna wrote:
> >Dmc will restore the csr program except DC9, cold boot,
> >warm reset, PCI function level reset, and hibernate/suspend.
> >
> >intel_csr_load_program() function is used to load the firmware
> >data from kernel memory to csr address space.
> >
> >All values of csr address space will be zero if it got reset and
> >the first byte of csr program is always a non-zero if firmware
> >is loaded successfuly. Based on hardware status will load the
> >firmware.
> >
> >Without this condition check if we overwrite the firmware data the
> >counters exposed for dc5/dc6 (help for debugging) will be nullified.
> >
> >v1: Initial version.
> >
> >v2: Based on review comments from Daniel,
> >- Added a check to know hardware status and load the firmware if not loaded.
> >
> >Cc: Daniel Vetter <daniel.vetter@intel.com>
> >Cc: Damien Lespiau <damien.lespiau@intel.com>
> >Cc: Imre Deak <imre.deak@intel.com>
> >Cc: Sunil Kamath <sunil.kamath@intel.com>
> >Signed-off-by: Animesh Manna <animesh.manna@intel.com>
> >Signed-off-by: Vathsala Nagaraju <vathsala.nagaraju@intel.com>
> >---
> >  drivers/gpu/drm/i915/intel_csr.c | 9 +++++++++
> >  1 file changed, 9 insertions(+)
> >
> >diff --git a/drivers/gpu/drm/i915/intel_csr.c b/drivers/gpu/drm/i915/intel_csr.c
> >index ba1ae03..682cc26 100644
> >--- a/drivers/gpu/drm/i915/intel_csr.c
> >+++ b/drivers/gpu/drm/i915/intel_csr.c
> >@@ -252,6 +252,15 @@ void intel_csr_load_program(struct drm_device *dev)
> >  		return;
> >  	}
> >+	/*
> >+	 * Dmc will restore the csr the program except DC9, cold boot,
> >+	 * warm reset, PCI function level reset, and hibernate/suspend.
> >+	 * This condition will help to check if csr address space is reset/
> >+	 * not loaded.
> >+	 */
> >+	if (I915_READ(CSR_PROGRAM_BASE))
> >+		return;
> >+
> >  	mutex_lock(&dev_priv->csr_lock);
> >  	fw_size = dev_priv->csr.dmc_fw_size;
> >  	for (i = 0; i < fw_size; i++)
> 
> Valid fix and patch is ready for merge now.

I still have a question open so will wait until that's anserwered ...

Thanks, Daniel
Manna, Animesh Sept. 9, 2015, 8:28 p.m. UTC | #6
On 9/2/2015 2:24 PM, Daniel Vetter wrote:
> On Wed, Aug 26, 2015 at 07:40:54PM +0530, Animesh Manna wrote:
>>
>> On 8/26/2015 6:40 PM, Daniel Vetter wrote:
>>> On Wed, Aug 26, 2015 at 01:36:05AM +0530, Animesh Manna wrote:
>>>> Dmc will restore the csr program except DC9, cold boot,
>>>> warm reset, PCI function level reset, and hibernate/suspend.
>>>>
>>>> intel_csr_load_program() function is used to load the firmware
>>>> data from kernel memory to csr address space.
>>>>
>>>> All values of csr address space will be zero if it got reset and
>>>> the first byte of csr program is always a non-zero if firmware
>>>> is loaded successfuly. Based on hardware status will load the
>>>> firmware.
>>>>
>>>> Without this condition check if we overwrite the firmware data the
>>>> counters exposed for dc5/dc6 (help for debugging) will be nullified.
>> Bacause of the above reason mentioned just above we need to block firmware loading again.
>> So only WARN_ON will not help.
>>
>>
>>>> v1: Initial version.
>>>>
>>>> v2: Based on review comments from Daniel,
>>>> - Added a check to know hardware status and load the firmware if not loaded.
>>>>
>>>> Cc: Daniel Vetter <daniel.vetter@intel.com>
>>>> Cc: Damien Lespiau <damien.lespiau@intel.com>
>>>> Cc: Imre Deak <imre.deak@intel.com>
>>>> Cc: Sunil Kamath <sunil.kamath@intel.com>
>>>> Signed-off-by: Animesh Manna <animesh.manna@intel.com>
>>>> Signed-off-by: Vathsala Nagaraju <vathsala.nagaraju@intel.com>
>>>> ---
>>>>   drivers/gpu/drm/i915/intel_csr.c | 9 +++++++++
>>>>   1 file changed, 9 insertions(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/i915/intel_csr.c b/drivers/gpu/drm/i915/intel_csr.c
>>>> index ba1ae03..682cc26 100644
>>>> --- a/drivers/gpu/drm/i915/intel_csr.c
>>>> +++ b/drivers/gpu/drm/i915/intel_csr.c
>>>> @@ -252,6 +252,15 @@ void intel_csr_load_program(struct drm_device *dev)
>>>>   		return;
>>>>   	}
>>>> +	/*
>>>> +	 * Dmc will restore the csr the program except DC9, cold boot,
>>>> +	 * warm reset, PCI function level reset, and hibernate/suspend.
>>>> +	 * This condition will help to check if csr address space is reset/
>>>> +	 * not loaded.
>>>> +	 */
>>> Atm we call this from driver load and resume, which doesn seem to cover
>>> all the cases you mention in the comment. Should this be a WARN_ON
>>> instead? Or do we have troubles in our init sequence where we load too
>>> many times?
>> Yes, the above statement taken from bspec to describe about the special cases dmc will not restore the firmware.
>> Agree, In our cases cold boot and hibernate/suspend mainly we need to load the firmware again, so in my
>> second sentence I wanted to comment mainly regarding this condition check added for suspend-hibernate(reset)
>> and cold boot(not loaded).
>>
>> Anyways the same api later can be used to load the firmware from anywhere, so my intention to check firmware loaded or not.
>> If already loaded then not to overwrite the csr address space to maintain the dc5/dc6 counter value.
>>
>> Can the below comment more clear to you.
>>
>> 	/*
>> 	 * Dmc will restore the csr the program except DC9, cold boot,
>> 	 * warm reset, PCI function level reset, and hibernate/suspend.
>> 	 * If firmware is restored by dmc then no need to load again which
>> 	 * will keep the dc5/dc6 counter exposed by firmware.
>> 	 */
>>
>> No issue in init sequence.
> That seems to still cover all the callers of the function afaics - we do
> pci resets over suspend resume unconditionally. So I still don't
> understand where exactly we try to load the dmc firmware in i915.ko when
> it's already loaded.

During resume intel_csr_load_program() will be called from
intel_runtime_resume().

intel_runtime_resume()-> skl_resume_prepare()-> intel_csr_load_program()

During Pc10 entry testing I can see dmc is restoring back the firmware always,
but as you mentioned pci-reset can happen unconditionally, but still then
also during resume intel_runtime_resume() will be called and based on
register read of csr-base-address firmware loading will happen.

-Animesh

> -Daniel
Manna, Animesh Sept. 9, 2015, 8:33 p.m. UTC | #7
On 9/7/2015 9:52 PM, Daniel Vetter wrote:
> On Mon, Sep 07, 2015 at 04:34:30PM +0530, Sunil Kamath wrote:
>> On Wednesday 26 August 2015 01:36 AM, Animesh Manna wrote:
>>> Dmc will restore the csr program except DC9, cold boot,
>>> warm reset, PCI function level reset, and hibernate/suspend.
>>>
>>> intel_csr_load_program() function is used to load the firmware
>>> data from kernel memory to csr address space.
>>>
>>> All values of csr address space will be zero if it got reset and
>>> the first byte of csr program is always a non-zero if firmware
>>> is loaded successfuly. Based on hardware status will load the
>>> firmware.
>>>
>>> Without this condition check if we overwrite the firmware data the
>>> counters exposed for dc5/dc6 (help for debugging) will be nullified.
>>>
>>> v1: Initial version.
>>>
>>> v2: Based on review comments from Daniel,
>>> - Added a check to know hardware status and load the firmware if not loaded.
>>>
>>> Cc: Daniel Vetter <daniel.vetter@intel.com>
>>> Cc: Damien Lespiau <damien.lespiau@intel.com>
>>> Cc: Imre Deak <imre.deak@intel.com>
>>> Cc: Sunil Kamath <sunil.kamath@intel.com>
>>> Signed-off-by: Animesh Manna <animesh.manna@intel.com>
>>> Signed-off-by: Vathsala Nagaraju <vathsala.nagaraju@intel.com>
>>> ---
>>>   drivers/gpu/drm/i915/intel_csr.c | 9 +++++++++
>>>   1 file changed, 9 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/i915/intel_csr.c b/drivers/gpu/drm/i915/intel_csr.c
>>> index ba1ae03..682cc26 100644
>>> --- a/drivers/gpu/drm/i915/intel_csr.c
>>> +++ b/drivers/gpu/drm/i915/intel_csr.c
>>> @@ -252,6 +252,15 @@ void intel_csr_load_program(struct drm_device *dev)
>>>   		return;
>>>   	}
>>> +	/*
>>> +	 * Dmc will restore the csr the program except DC9, cold boot,
>>> +	 * warm reset, PCI function level reset, and hibernate/suspend.
>>> +	 * This condition will help to check if csr address space is reset/
>>> +	 * not loaded.
>>> +	 */
>>> +	if (I915_READ(CSR_PROGRAM_BASE))
>>> +		return;
>>> +
>>>   	mutex_lock(&dev_priv->csr_lock);
>>>   	fw_size = dev_priv->csr.dmc_fw_size;
>>>   	for (i = 0; i < fw_size; i++)
>> Valid fix and patch is ready for merge now.
> I still have a question open so will wait until that's anserwered ...
>
> Thanks, Daniel

Shared my view on other thread.

-Animesh
Daniel Vetter Sept. 10, 2015, 2:45 p.m. UTC | #8
On Thu, Sep 10, 2015 at 01:58:54AM +0530, Animesh Manna wrote:
> 
> 
> On 9/2/2015 2:24 PM, Daniel Vetter wrote:
> >On Wed, Aug 26, 2015 at 07:40:54PM +0530, Animesh Manna wrote:
> >>
> >>On 8/26/2015 6:40 PM, Daniel Vetter wrote:
> >>>On Wed, Aug 26, 2015 at 01:36:05AM +0530, Animesh Manna wrote:
> >>>>Dmc will restore the csr program except DC9, cold boot,
> >>>>warm reset, PCI function level reset, and hibernate/suspend.
> >>>>
> >>>>intel_csr_load_program() function is used to load the firmware
> >>>>data from kernel memory to csr address space.
> >>>>
> >>>>All values of csr address space will be zero if it got reset and
> >>>>the first byte of csr program is always a non-zero if firmware
> >>>>is loaded successfuly. Based on hardware status will load the
> >>>>firmware.
> >>>>
> >>>>Without this condition check if we overwrite the firmware data the
> >>>>counters exposed for dc5/dc6 (help for debugging) will be nullified.
> >>Bacause of the above reason mentioned just above we need to block firmware loading again.
> >>So only WARN_ON will not help.
> >>
> >>
> >>>>v1: Initial version.
> >>>>
> >>>>v2: Based on review comments from Daniel,
> >>>>- Added a check to know hardware status and load the firmware if not loaded.
> >>>>
> >>>>Cc: Daniel Vetter <daniel.vetter@intel.com>
> >>>>Cc: Damien Lespiau <damien.lespiau@intel.com>
> >>>>Cc: Imre Deak <imre.deak@intel.com>
> >>>>Cc: Sunil Kamath <sunil.kamath@intel.com>
> >>>>Signed-off-by: Animesh Manna <animesh.manna@intel.com>
> >>>>Signed-off-by: Vathsala Nagaraju <vathsala.nagaraju@intel.com>
> >>>>---
> >>>>  drivers/gpu/drm/i915/intel_csr.c | 9 +++++++++
> >>>>  1 file changed, 9 insertions(+)
> >>>>
> >>>>diff --git a/drivers/gpu/drm/i915/intel_csr.c b/drivers/gpu/drm/i915/intel_csr.c
> >>>>index ba1ae03..682cc26 100644
> >>>>--- a/drivers/gpu/drm/i915/intel_csr.c
> >>>>+++ b/drivers/gpu/drm/i915/intel_csr.c
> >>>>@@ -252,6 +252,15 @@ void intel_csr_load_program(struct drm_device *dev)
> >>>>  		return;
> >>>>  	}
> >>>>+	/*
> >>>>+	 * Dmc will restore the csr the program except DC9, cold boot,
> >>>>+	 * warm reset, PCI function level reset, and hibernate/suspend.
> >>>>+	 * This condition will help to check if csr address space is reset/
> >>>>+	 * not loaded.
> >>>>+	 */
> >>>Atm we call this from driver load and resume, which doesn seem to cover
> >>>all the cases you mention in the comment. Should this be a WARN_ON
> >>>instead? Or do we have troubles in our init sequence where we load too
> >>>many times?
> >>Yes, the above statement taken from bspec to describe about the special cases dmc will not restore the firmware.
> >>Agree, In our cases cold boot and hibernate/suspend mainly we need to load the firmware again, so in my
> >>second sentence I wanted to comment mainly regarding this condition check added for suspend-hibernate(reset)
> >>and cold boot(not loaded).
> >>
> >>Anyways the same api later can be used to load the firmware from anywhere, so my intention to check firmware loaded or not.
> >>If already loaded then not to overwrite the csr address space to maintain the dc5/dc6 counter value.
> >>
> >>Can the below comment more clear to you.
> >>
> >>	/*
> >>	 * Dmc will restore the csr the program except DC9, cold boot,
> >>	 * warm reset, PCI function level reset, and hibernate/suspend.
> >>	 * If firmware is restored by dmc then no need to load again which
> >>	 * will keep the dc5/dc6 counter exposed by firmware.
> >>	 */
> >>
> >>No issue in init sequence.
> >That seems to still cover all the callers of the function afaics - we do
> >pci resets over suspend resume unconditionally. So I still don't
> >understand where exactly we try to load the dmc firmware in i915.ko when
> >it's already loaded.
> 
> During resume intel_csr_load_program() will be called from
> intel_runtime_resume().
> 
> intel_runtime_resume()-> skl_resume_prepare()-> intel_csr_load_program()
> 
> During Pc10 entry testing I can see dmc is restoring back the firmware always,
> but as you mentioned pci-reset can happen unconditionally, but still then
> also during resume intel_runtime_resume() will be called and based on
> register read of csr-base-address firmware loading will happen.

But in your comment you're saying it won't get restored in case of dc9 and
suspend. So that seems to mismatch what you're saying here (and what the
commit message says) and what the code does. And this function here is
called for resume after suspend/hibernate only.

So I still see a mismatch between what you're saying here and what the
code does vs. what the comment above says. It would be clear if there's
just no comment. And we don't call intel_csr_load_program from random
places just for luck, if there's unecessary calls of that its kinda a bug
in the code (and we should just remove it).
-Daniel
Manna, Animesh Sept. 10, 2015, 7:05 p.m. UTC | #9
On 9/10/2015 8:15 PM, Daniel Vetter wrote:
> On Thu, Sep 10, 2015 at 01:58:54AM +0530, Animesh Manna wrote:
>>
>> On 9/2/2015 2:24 PM, Daniel Vetter wrote:
>>> On Wed, Aug 26, 2015 at 07:40:54PM +0530, Animesh Manna wrote:
>>>> On 8/26/2015 6:40 PM, Daniel Vetter wrote:
>>>>> On Wed, Aug 26, 2015 at 01:36:05AM +0530, Animesh Manna wrote:
>>>>>> Dmc will restore the csr program except DC9, cold boot,
>>>>>> warm reset, PCI function level reset, and hibernate/suspend.
>>>>>>
>>>>>> intel_csr_load_program() function is used to load the firmware
>>>>>> data from kernel memory to csr address space.
>>>>>>
>>>>>> All values of csr address space will be zero if it got reset and
>>>>>> the first byte of csr program is always a non-zero if firmware
>>>>>> is loaded successfuly. Based on hardware status will load the
>>>>>> firmware.
>>>>>>
>>>>>> Without this condition check if we overwrite the firmware data the
>>>>>> counters exposed for dc5/dc6 (help for debugging) will be nullified.
>>>> Bacause of the above reason mentioned just above we need to block firmware loading again.
>>>> So only WARN_ON will not help.
>>>>
>>>>
>>>>>> v1: Initial version.
>>>>>>
>>>>>> v2: Based on review comments from Daniel,
>>>>>> - Added a check to know hardware status and load the firmware if not loaded.
>>>>>>
>>>>>> Cc: Daniel Vetter <daniel.vetter@intel.com>
>>>>>> Cc: Damien Lespiau <damien.lespiau@intel.com>
>>>>>> Cc: Imre Deak <imre.deak@intel.com>
>>>>>> Cc: Sunil Kamath <sunil.kamath@intel.com>
>>>>>> Signed-off-by: Animesh Manna <animesh.manna@intel.com>
>>>>>> Signed-off-by: Vathsala Nagaraju <vathsala.nagaraju@intel.com>
>>>>>> ---
>>>>>>   drivers/gpu/drm/i915/intel_csr.c | 9 +++++++++
>>>>>>   1 file changed, 9 insertions(+)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/i915/intel_csr.c b/drivers/gpu/drm/i915/intel_csr.c
>>>>>> index ba1ae03..682cc26 100644
>>>>>> --- a/drivers/gpu/drm/i915/intel_csr.c
>>>>>> +++ b/drivers/gpu/drm/i915/intel_csr.c
>>>>>> @@ -252,6 +252,15 @@ void intel_csr_load_program(struct drm_device *dev)
>>>>>>   		return;
>>>>>>   	}
>>>>>> +	/*
>>>>>> +	 * Dmc will restore the csr the program except DC9, cold boot,
>>>>>> +	 * warm reset, PCI function level reset, and hibernate/suspend.
>>>>>> +	 * This condition will help to check if csr address space is reset/
>>>>>> +	 * not loaded.
>>>>>> +	 */
>>>>> Atm we call this from driver load and resume, which doesn seem to cover
>>>>> all the cases you mention in the comment. Should this be a WARN_ON
>>>>> instead? Or do we have troubles in our init sequence where we load too
>>>>> many times?
>>>> Yes, the above statement taken from bspec to describe about the special cases dmc will not restore the firmware.
>>>> Agree, In our cases cold boot and hibernate/suspend mainly we need to load the firmware again, so in my
>>>> second sentence I wanted to comment mainly regarding this condition check added for suspend-hibernate(reset)
>>>> and cold boot(not loaded).
>>>>
>>>> Anyways the same api later can be used to load the firmware from anywhere, so my intention to check firmware loaded or not.
>>>> If already loaded then not to overwrite the csr address space to maintain the dc5/dc6 counter value.
>>>>
>>>> Can the below comment more clear to you.
>>>>
>>>> 	/*
>>>> 	 * Dmc will restore the csr the program except DC9, cold boot,
>>>> 	 * warm reset, PCI function level reset, and hibernate/suspend.
>>>> 	 * If firmware is restored by dmc then no need to load again which
>>>> 	 * will keep the dc5/dc6 counter exposed by firmware.
>>>> 	 */
>>>>
>>>> No issue in init sequence.
>>> That seems to still cover all the callers of the function afaics - we do
>>> pci resets over suspend resume unconditionally. So I still don't
>>> understand where exactly we try to load the dmc firmware in i915.ko when
>>> it's already loaded.
>> During resume intel_csr_load_program() will be called from
>> intel_runtime_resume().
>>
>> intel_runtime_resume()-> skl_resume_prepare()-> intel_csr_load_program()
>>
>> During Pc10 entry testing I can see dmc is restoring back the firmware always,
>> but as you mentioned pci-reset can happen unconditionally, but still then
>> also during resume intel_runtime_resume() will be called and based on
>> register read of csr-base-address firmware loading will happen.
> But in your comment you're saying it won't get restored in case of dc9 and
> suspend. So that seems to mismatch what you're saying here (and what the
> commit message says) and what the code does. And this function here is
> called for resume after suspend/hibernate only.

pc10 entry explanation I told is for skylake. dc9 in skylake is not possible.
I think you are confusing between dc6 and dc9. Pc10 can be achieved by
entering into dc6 (not dc9) for skylake. dc9 is the lowest possible state
for broxton which is not present for skylake.

Here intel_csr_load_program() will be used for both skylake and broxton, and instruction
execution flow will be different in case of suspend/resume which I think is confusing
you.

I am ready explain you in detail. It will be good if we discuss specific use-case scenario
and itz software design for specific platform. Another point - as dmc related code for
broxton is not merged better first we close design for skylake. Now, I have added dc9
description in comment thinking of future. If you want I can remove for now and later
can add in bxt patch series for enabling dmc. Will wait for your reply.

-Animesh

>
> So I still see a mismatch between what you're saying here and what the
> code does vs. what the comment above says. It would be clear if there's
> just no comment. And we don't call intel_csr_load_program from random
> places just for luck, if there's unecessary calls of that its kinda a bug
> in the code (and we should just remove it).
> -Daniel
Manna, Animesh Sept. 10, 2015, 7:06 p.m. UTC | #10
On 9/10/2015 8:15 PM, Daniel Vetter wrote:
> On Thu, Sep 10, 2015 at 01:58:54AM +0530, Animesh Manna wrote:
>>
>> On 9/2/2015 2:24 PM, Daniel Vetter wrote:
>>> On Wed, Aug 26, 2015 at 07:40:54PM +0530, Animesh Manna wrote:
>>>> On 8/26/2015 6:40 PM, Daniel Vetter wrote:
>>>>> On Wed, Aug 26, 2015 at 01:36:05AM +0530, Animesh Manna wrote:
>>>>>> Dmc will restore the csr program except DC9, cold boot,
>>>>>> warm reset, PCI function level reset, and hibernate/suspend.
>>>>>>
>>>>>> intel_csr_load_program() function is used to load the firmware
>>>>>> data from kernel memory to csr address space.
>>>>>>
>>>>>> All values of csr address space will be zero if it got reset and
>>>>>> the first byte of csr program is always a non-zero if firmware
>>>>>> is loaded successfuly. Based on hardware status will load the
>>>>>> firmware.
>>>>>>
>>>>>> Without this condition check if we overwrite the firmware data the
>>>>>> counters exposed for dc5/dc6 (help for debugging) will be nullified.
>>>> Bacause of the above reason mentioned just above we need to block firmware loading again.
>>>> So only WARN_ON will not help.
>>>>
>>>>
>>>>>> v1: Initial version.
>>>>>>
>>>>>> v2: Based on review comments from Daniel,
>>>>>> - Added a check to know hardware status and load the firmware if not loaded.
>>>>>>
>>>>>> Cc: Daniel Vetter <daniel.vetter@intel.com>
>>>>>> Cc: Damien Lespiau <damien.lespiau@intel.com>
>>>>>> Cc: Imre Deak <imre.deak@intel.com>
>>>>>> Cc: Sunil Kamath <sunil.kamath@intel.com>
>>>>>> Signed-off-by: Animesh Manna <animesh.manna@intel.com>
>>>>>> Signed-off-by: Vathsala Nagaraju <vathsala.nagaraju@intel.com>
>>>>>> ---
>>>>>>   drivers/gpu/drm/i915/intel_csr.c | 9 +++++++++
>>>>>>   1 file changed, 9 insertions(+)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/i915/intel_csr.c b/drivers/gpu/drm/i915/intel_csr.c
>>>>>> index ba1ae03..682cc26 100644
>>>>>> --- a/drivers/gpu/drm/i915/intel_csr.c
>>>>>> +++ b/drivers/gpu/drm/i915/intel_csr.c
>>>>>> @@ -252,6 +252,15 @@ void intel_csr_load_program(struct drm_device *dev)
>>>>>>   		return;
>>>>>>   	}
>>>>>> +	/*
>>>>>> +	 * Dmc will restore the csr the program except DC9, cold boot,
>>>>>> +	 * warm reset, PCI function level reset, and hibernate/suspend.
>>>>>> +	 * This condition will help to check if csr address space is reset/
>>>>>> +	 * not loaded.
>>>>>> +	 */
>>>>> Atm we call this from driver load and resume, which doesn seem to cover
>>>>> all the cases you mention in the comment. Should this be a WARN_ON
>>>>> instead? Or do we have troubles in our init sequence where we load too
>>>>> many times?
>>>> Yes, the above statement taken from bspec to describe about the special cases dmc will not restore the firmware.
>>>> Agree, In our cases cold boot and hibernate/suspend mainly we need to load the firmware again, so in my
>>>> second sentence I wanted to comment mainly regarding this condition check added for suspend-hibernate(reset)
>>>> and cold boot(not loaded).
>>>>
>>>> Anyways the same api later can be used to load the firmware from anywhere, so my intention to check firmware loaded or not.
>>>> If already loaded then not to overwrite the csr address space to maintain the dc5/dc6 counter value.
>>>>
>>>> Can the below comment more clear to you.
>>>>
>>>> 	/*
>>>> 	 * Dmc will restore the csr the program except DC9, cold boot,
>>>> 	 * warm reset, PCI function level reset, and hibernate/suspend.
>>>> 	 * If firmware is restored by dmc then no need to load again which
>>>> 	 * will keep the dc5/dc6 counter exposed by firmware.
>>>> 	 */
>>>>
>>>> No issue in init sequence.
>>> That seems to still cover all the callers of the function afaics - we do
>>> pci resets over suspend resume unconditionally. So I still don't
>>> understand where exactly we try to load the dmc firmware in i915.ko when
>>> it's already loaded.
>> During resume intel_csr_load_program() will be called from
>> intel_runtime_resume().
>>
>> intel_runtime_resume()-> skl_resume_prepare()-> intel_csr_load_program()
>>
>> During Pc10 entry testing I can see dmc is restoring back the firmware always,
>> but as you mentioned pci-reset can happen unconditionally, but still then
>> also during resume intel_runtime_resume() will be called and based on
>> register read of csr-base-address firmware loading will happen.
> But in your comment you're saying it won't get restored in case of dc9 and
> suspend. So that seems to mismatch what you're saying here (and what the
> commit message says) and what the code does. And this function here is
> called for resume after suspend/hibernate only.

pc10 entry explanation I told is for skylake. dc9 in skylake is not possible.
I think you are confusing between dc6 and dc9. Pc10 can be achieved by
entering into dc6 (not dc9) for skylake. dc9 is the lowest possible state
for broxton which is not present for skylake.

Here intel_csr_load_program() will be used for both skylake and broxton, and instruction
execution flow will be different in case of suspend/resume which I think is confusing
you.

I am ready explain you in detail. It will be good if we discuss specific use-case scenario
and itz software design for specific platform. Another point - as dmc related code for
broxton is not merged better first we close design for skylake. Now, I have added dc9
description in comment thinking of future. If you want I can remove for now and later
can add in bxt patch series for enabling dmc. Will wait for your reply.

-Animesh

>
> So I still see a mismatch between what you're saying here and what the
> code does vs. what the comment above says. It would be clear if there's
> just no comment. And we don't call intel_csr_load_program from random
> places just for luck, if there's unecessary calls of that its kinda a bug
> in the code (and we should just remove it).
> -Daniel
Daniel Vetter Sept. 14, 2015, 7:46 a.m. UTC | #11
On Fri, Sep 11, 2015 at 12:36:24AM +0530, Animesh Manna wrote:
> 
> 
> On 9/10/2015 8:15 PM, Daniel Vetter wrote:
> >On Thu, Sep 10, 2015 at 01:58:54AM +0530, Animesh Manna wrote:
> >>
> >>On 9/2/2015 2:24 PM, Daniel Vetter wrote:
> >>>On Wed, Aug 26, 2015 at 07:40:54PM +0530, Animesh Manna wrote:
> >>>>On 8/26/2015 6:40 PM, Daniel Vetter wrote:
> >>>>>On Wed, Aug 26, 2015 at 01:36:05AM +0530, Animesh Manna wrote:
> >>>>>>Dmc will restore the csr program except DC9, cold boot,
> >>>>>>warm reset, PCI function level reset, and hibernate/suspend.
> >>>>>>
> >>>>>>intel_csr_load_program() function is used to load the firmware
> >>>>>>data from kernel memory to csr address space.
> >>>>>>
> >>>>>>All values of csr address space will be zero if it got reset and
> >>>>>>the first byte of csr program is always a non-zero if firmware
> >>>>>>is loaded successfuly. Based on hardware status will load the
> >>>>>>firmware.
> >>>>>>
> >>>>>>Without this condition check if we overwrite the firmware data the
> >>>>>>counters exposed for dc5/dc6 (help for debugging) will be nullified.
> >>>>Bacause of the above reason mentioned just above we need to block firmware loading again.
> >>>>So only WARN_ON will not help.
> >>>>
> >>>>
> >>>>>>v1: Initial version.
> >>>>>>
> >>>>>>v2: Based on review comments from Daniel,
> >>>>>>- Added a check to know hardware status and load the firmware if not loaded.
> >>>>>>
> >>>>>>Cc: Daniel Vetter <daniel.vetter@intel.com>
> >>>>>>Cc: Damien Lespiau <damien.lespiau@intel.com>
> >>>>>>Cc: Imre Deak <imre.deak@intel.com>
> >>>>>>Cc: Sunil Kamath <sunil.kamath@intel.com>
> >>>>>>Signed-off-by: Animesh Manna <animesh.manna@intel.com>
> >>>>>>Signed-off-by: Vathsala Nagaraju <vathsala.nagaraju@intel.com>
> >>>>>>---
> >>>>>>  drivers/gpu/drm/i915/intel_csr.c | 9 +++++++++
> >>>>>>  1 file changed, 9 insertions(+)
> >>>>>>
> >>>>>>diff --git a/drivers/gpu/drm/i915/intel_csr.c b/drivers/gpu/drm/i915/intel_csr.c
> >>>>>>index ba1ae03..682cc26 100644
> >>>>>>--- a/drivers/gpu/drm/i915/intel_csr.c
> >>>>>>+++ b/drivers/gpu/drm/i915/intel_csr.c
> >>>>>>@@ -252,6 +252,15 @@ void intel_csr_load_program(struct drm_device *dev)
> >>>>>>  		return;
> >>>>>>  	}
> >>>>>>+	/*
> >>>>>>+	 * Dmc will restore the csr the program except DC9, cold boot,
> >>>>>>+	 * warm reset, PCI function level reset, and hibernate/suspend.
> >>>>>>+	 * This condition will help to check if csr address space is reset/
> >>>>>>+	 * not loaded.
> >>>>>>+	 */
> >>>>>Atm we call this from driver load and resume, which doesn seem to cover
> >>>>>all the cases you mention in the comment. Should this be a WARN_ON
> >>>>>instead? Or do we have troubles in our init sequence where we load too
> >>>>>many times?
> >>>>Yes, the above statement taken from bspec to describe about the special cases dmc will not restore the firmware.
> >>>>Agree, In our cases cold boot and hibernate/suspend mainly we need to load the firmware again, so in my
> >>>>second sentence I wanted to comment mainly regarding this condition check added for suspend-hibernate(reset)
> >>>>and cold boot(not loaded).
> >>>>
> >>>>Anyways the same api later can be used to load the firmware from anywhere, so my intention to check firmware loaded or not.
> >>>>If already loaded then not to overwrite the csr address space to maintain the dc5/dc6 counter value.
> >>>>
> >>>>Can the below comment more clear to you.
> >>>>
> >>>>	/*
> >>>>	 * Dmc will restore the csr the program except DC9, cold boot,
> >>>>	 * warm reset, PCI function level reset, and hibernate/suspend.
> >>>>	 * If firmware is restored by dmc then no need to load again which
> >>>>	 * will keep the dc5/dc6 counter exposed by firmware.
> >>>>	 */
> >>>>
> >>>>No issue in init sequence.
> >>>That seems to still cover all the callers of the function afaics - we do
> >>>pci resets over suspend resume unconditionally. So I still don't
> >>>understand where exactly we try to load the dmc firmware in i915.ko when
> >>>it's already loaded.
> >>During resume intel_csr_load_program() will be called from
> >>intel_runtime_resume().
> >>
> >>intel_runtime_resume()-> skl_resume_prepare()-> intel_csr_load_program()
> >>
> >>During Pc10 entry testing I can see dmc is restoring back the firmware always,
> >>but as you mentioned pci-reset can happen unconditionally, but still then
> >>also during resume intel_runtime_resume() will be called and based on
> >>register read of csr-base-address firmware loading will happen.
> >But in your comment you're saying it won't get restored in case of dc9 and
> >suspend. So that seems to mismatch what you're saying here (and what the
> >commit message says) and what the code does. And this function here is
> >called for resume after suspend/hibernate only.
> 
> pc10 entry explanation I told is for skylake. dc9 in skylake is not possible.
> I think you are confusing between dc6 and dc9. Pc10 can be achieved by
> entering into dc6 (not dc9) for skylake. dc9 is the lowest possible state
> for broxton which is not present for skylake.

I have no idea at all about different pc levels on skl. What I'm talking
about is system suspend/resume and driver load, which are the places this
function gets called. At least afaics.

> Here intel_csr_load_program() will be used for both skylake and broxton, and instruction
> execution flow will be different in case of suspend/resume which I think is confusing
> you.

That seems like really important information. What's different on bxt?
These are the kind of details you should explain in the commit message ...

> I am ready explain you in detail. It will be good if we discuss specific use-case scenario
> and itz software design for specific platform. Another point - as dmc related code for
> broxton is not merged better first we close design for skylake. Now, I have added dc9
> description in comment thinking of future. If you want I can remove for now and later
> can add in bxt patch series for enabling dmc. Will wait for your reply.

This question here isn't about the overall design and how to handle power
wells in skl/bxt. That's a separate discussion and tracked somewhere else.
I'm really just confused about when exactly we need to reload to firmware,
and why we need a runtime check for that. Normally we should know when to
reload the firmware and just either reload or not, without checking hw
state. And I don't like checking for hw state since at least in the past
that kind of code ended up being fragile - it's an illusion that it does
the right thing no matter what, since often there's other tricky ordering
constraints. And if you have automatic duct-tape like then no one will
ever spot those other, harder to spot issues, until an expensive customer
escalation happens.

So what I want to know here is:
- When exactly do we need to reload dmc firmware.
- What exactly is the reason why we can't make that decision statically in
  the code (by calling csr_load at the right spots).

Thanks, Daniel
Manna, Animesh Sept. 16, 2015, 7:23 p.m. UTC | #12
On 9/14/2015 1:16 PM, Daniel Vetter wrote:
> On Fri, Sep 11, 2015 at 12:36:24AM +0530, Animesh Manna wrote:
>>
>> On 9/10/2015 8:15 PM, Daniel Vetter wrote:
>>> On Thu, Sep 10, 2015 at 01:58:54AM +0530, Animesh Manna wrote:
>>>> On 9/2/2015 2:24 PM, Daniel Vetter wrote:
>>>>> On Wed, Aug 26, 2015 at 07:40:54PM +0530, Animesh Manna wrote:
>>>>>> On 8/26/2015 6:40 PM, Daniel Vetter wrote:
>>>>>>> On Wed, Aug 26, 2015 at 01:36:05AM +0530, Animesh Manna wrote:
>>>>>>>> Dmc will restore the csr program except DC9, cold boot,
>>>>>>>> warm reset, PCI function level reset, and hibernate/suspend.
>>>>>>>>
>>>>>>>> intel_csr_load_program() function is used to load the firmware
>>>>>>>> data from kernel memory to csr address space.
>>>>>>>>
>>>>>>>> All values of csr address space will be zero if it got reset and
>>>>>>>> the first byte of csr program is always a non-zero if firmware
>>>>>>>> is loaded successfuly. Based on hardware status will load the
>>>>>>>> firmware.
>>>>>>>>
>>>>>>>> Without this condition check if we overwrite the firmware data the
>>>>>>>> counters exposed for dc5/dc6 (help for debugging) will be nullified.
>>>>>> Bacause of the above reason mentioned just above we need to block firmware loading again.
>>>>>> So only WARN_ON will not help.
>>>>>>
>>>>>>
>>>>>>>> v1: Initial version.
>>>>>>>>
>>>>>>>> v2: Based on review comments from Daniel,
>>>>>>>> - Added a check to know hardware status and load the firmware if not loaded.
>>>>>>>>
>>>>>>>> Cc: Daniel Vetter <daniel.vetter@intel.com>
>>>>>>>> Cc: Damien Lespiau <damien.lespiau@intel.com>
>>>>>>>> Cc: Imre Deak <imre.deak@intel.com>
>>>>>>>> Cc: Sunil Kamath <sunil.kamath@intel.com>
>>>>>>>> Signed-off-by: Animesh Manna <animesh.manna@intel.com>
>>>>>>>> Signed-off-by: Vathsala Nagaraju <vathsala.nagaraju@intel.com>
>>>>>>>> ---
>>>>>>>>   drivers/gpu/drm/i915/intel_csr.c | 9 +++++++++
>>>>>>>>   1 file changed, 9 insertions(+)
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/i915/intel_csr.c b/drivers/gpu/drm/i915/intel_csr.c
>>>>>>>> index ba1ae03..682cc26 100644
>>>>>>>> --- a/drivers/gpu/drm/i915/intel_csr.c
>>>>>>>> +++ b/drivers/gpu/drm/i915/intel_csr.c
>>>>>>>> @@ -252,6 +252,15 @@ void intel_csr_load_program(struct drm_device *dev)
>>>>>>>>   		return;
>>>>>>>>   	}
>>>>>>>> +	/*
>>>>>>>> +	 * Dmc will restore the csr the program except DC9, cold boot,
>>>>>>>> +	 * warm reset, PCI function level reset, and hibernate/suspend.
>>>>>>>> +	 * This condition will help to check if csr address space is reset/
>>>>>>>> +	 * not loaded.
>>>>>>>> +	 */
>>>>>>> Atm we call this from driver load and resume, which doesn seem to cover
>>>>>>> all the cases you mention in the comment. Should this be a WARN_ON
>>>>>>> instead? Or do we have troubles in our init sequence where we load too
>>>>>>> many times?
>>>>>> Yes, the above statement taken from bspec to describe about the special cases dmc will not restore the firmware.
>>>>>> Agree, In our cases cold boot and hibernate/suspend mainly we need to load the firmware again, so in my
>>>>>> second sentence I wanted to comment mainly regarding this condition check added for suspend-hibernate(reset)
>>>>>> and cold boot(not loaded).
>>>>>>
>>>>>> Anyways the same api later can be used to load the firmware from anywhere, so my intention to check firmware loaded or not.
>>>>>> If already loaded then not to overwrite the csr address space to maintain the dc5/dc6 counter value.
>>>>>>
>>>>>> Can the below comment more clear to you.
>>>>>>
>>>>>> 	/*
>>>>>> 	 * Dmc will restore the csr the program except DC9, cold boot,
>>>>>> 	 * warm reset, PCI function level reset, and hibernate/suspend.
>>>>>> 	 * If firmware is restored by dmc then no need to load again which
>>>>>> 	 * will keep the dc5/dc6 counter exposed by firmware.
>>>>>> 	 */
>>>>>>
>>>>>> No issue in init sequence.
>>>>> That seems to still cover all the callers of the function afaics - we do
>>>>> pci resets over suspend resume unconditionally. So I still don't
>>>>> understand where exactly we try to load the dmc firmware in i915.ko when
>>>>> it's already loaded.
>>>> During resume intel_csr_load_program() will be called from
>>>> intel_runtime_resume().
>>>>
>>>> intel_runtime_resume()-> skl_resume_prepare()-> intel_csr_load_program()
>>>>
>>>> During Pc10 entry testing I can see dmc is restoring back the firmware always,
>>>> but as you mentioned pci-reset can happen unconditionally, but still then
>>>> also during resume intel_runtime_resume() will be called and based on
>>>> register read of csr-base-address firmware loading will happen.
>>> But in your comment you're saying it won't get restored in case of dc9 and
>>> suspend. So that seems to mismatch what you're saying here (and what the
>>> commit message says) and what the code does. And this function here is
>>> called for resume after suspend/hibernate only.
>> pc10 entry explanation I told is for skylake. dc9 in skylake is not possible.
>> I think you are confusing between dc6 and dc9. Pc10 can be achieved by
>> entering into dc6 (not dc9) for skylake. dc9 is the lowest possible state
>> for broxton which is not present for skylake.
> I have no idea at all about different pc levels on skl. What I'm talking
> about is system suspend/resume and driver load, which are the places this
> function gets called. At least afaics.
>
>> Here intel_csr_load_program() will be used for both skylake and broxton, and instruction
>> execution flow will be different in case of suspend/resume which I think is confusing
>> you.
> That seems like really important information. What's different on bxt?
> These are the kind of details you should explain in the commit message ...
>
>> I am ready explain you in detail. It will be good if we discuss specific use-case scenario
>> and itz software design for specific platform. Another point - as dmc related code for
>> broxton is not merged better first we close design for skylake. Now, I have added dc9
>> description in comment thinking of future. If you want I can remove for now and later
>> can add in bxt patch series for enabling dmc. Will wait for your reply.
> This question here isn't about the overall design and how to handle power
> wells in skl/bxt. That's a separate discussion and tracked somewhere else.
> I'm really just confused about when exactly we need to reload to firmware,
> and why we need a runtime check for that. Normally we should know when to
> reload the firmware and just either reload or not, without checking hw
> state. And I don't like checking for hw state since at least in the past
> that kind of code ended up being fragile - it's an illusion that it does
> the right thing no matter what, since often there's other tricky ordering
> constraints. And if you have automatic duct-tape like then no one will
> ever spot those other, harder to spot issues, until an expensive customer
> escalation happens.
>
> So what I want to know here is:
> - When exactly do we need to reload dmc firmware.
In skl, during driver load first time we load the firmware, during 
normal suspend-resume (dc6 entry/exit)
no need to reload the firmware again as dmc will take care of it. But 
during suspend/hibernation
dmc will not restore the firmware. In that case driver need to reload it 
again. I do not know
how to differentiate pm-suspend and suspend-hibernation and thought both 
the cases
intel_runtime_resume() will be called where we can check the h/w state 
and reload the
firmware if dmc is not restored.

In bxt, during driver load first time we load the firmware, during 
normal suspend-resume
display engine will enter into dc9 and dmc will not restore the 
firmware. So every
suspend-resume we need to reload the firmware.
> - What exactly is the reason why we can't make that decision statically in
>    the code (by calling csr_load at the right spots).
As I mentioned before in case of skylake can we differentiate between
"resume from pm-suspend" with "resume from suspend-hibernation" inside 
driver?

In case of broxton, every time we need to reload, so we can decide 
statically.

-Animesh
>
> Thanks, Daniel
Daniel Vetter Sept. 23, 2015, 7:57 a.m. UTC | #13
On Thu, Sep 17, 2015 at 12:53:21AM +0530, Animesh Manna wrote:
> 
> 
> On 9/14/2015 1:16 PM, Daniel Vetter wrote:
> >On Fri, Sep 11, 2015 at 12:36:24AM +0530, Animesh Manna wrote:
> >>
> >>On 9/10/2015 8:15 PM, Daniel Vetter wrote:
> >>>On Thu, Sep 10, 2015 at 01:58:54AM +0530, Animesh Manna wrote:
> >>>>On 9/2/2015 2:24 PM, Daniel Vetter wrote:
> >>>>>On Wed, Aug 26, 2015 at 07:40:54PM +0530, Animesh Manna wrote:
> >>>>>>On 8/26/2015 6:40 PM, Daniel Vetter wrote:
> >>>>>>>On Wed, Aug 26, 2015 at 01:36:05AM +0530, Animesh Manna wrote:
> >>>>>>>>Dmc will restore the csr program except DC9, cold boot,
> >>>>>>>>warm reset, PCI function level reset, and hibernate/suspend.
> >>>>>>>>
> >>>>>>>>intel_csr_load_program() function is used to load the firmware
> >>>>>>>>data from kernel memory to csr address space.
> >>>>>>>>
> >>>>>>>>All values of csr address space will be zero if it got reset and
> >>>>>>>>the first byte of csr program is always a non-zero if firmware
> >>>>>>>>is loaded successfuly. Based on hardware status will load the
> >>>>>>>>firmware.
> >>>>>>>>
> >>>>>>>>Without this condition check if we overwrite the firmware data the
> >>>>>>>>counters exposed for dc5/dc6 (help for debugging) will be nullified.
> >>>>>>Bacause of the above reason mentioned just above we need to block firmware loading again.
> >>>>>>So only WARN_ON will not help.
> >>>>>>
> >>>>>>
> >>>>>>>>v1: Initial version.
> >>>>>>>>
> >>>>>>>>v2: Based on review comments from Daniel,
> >>>>>>>>- Added a check to know hardware status and load the firmware if not loaded.
> >>>>>>>>
> >>>>>>>>Cc: Daniel Vetter <daniel.vetter@intel.com>
> >>>>>>>>Cc: Damien Lespiau <damien.lespiau@intel.com>
> >>>>>>>>Cc: Imre Deak <imre.deak@intel.com>
> >>>>>>>>Cc: Sunil Kamath <sunil.kamath@intel.com>
> >>>>>>>>Signed-off-by: Animesh Manna <animesh.manna@intel.com>
> >>>>>>>>Signed-off-by: Vathsala Nagaraju <vathsala.nagaraju@intel.com>
> >>>>>>>>---
> >>>>>>>>  drivers/gpu/drm/i915/intel_csr.c | 9 +++++++++
> >>>>>>>>  1 file changed, 9 insertions(+)
> >>>>>>>>
> >>>>>>>>diff --git a/drivers/gpu/drm/i915/intel_csr.c b/drivers/gpu/drm/i915/intel_csr.c
> >>>>>>>>index ba1ae03..682cc26 100644
> >>>>>>>>--- a/drivers/gpu/drm/i915/intel_csr.c
> >>>>>>>>+++ b/drivers/gpu/drm/i915/intel_csr.c
> >>>>>>>>@@ -252,6 +252,15 @@ void intel_csr_load_program(struct drm_device *dev)
> >>>>>>>>  		return;
> >>>>>>>>  	}
> >>>>>>>>+	/*
> >>>>>>>>+	 * Dmc will restore the csr the program except DC9, cold boot,
> >>>>>>>>+	 * warm reset, PCI function level reset, and hibernate/suspend.
> >>>>>>>>+	 * This condition will help to check if csr address space is reset/
> >>>>>>>>+	 * not loaded.
> >>>>>>>>+	 */
> >>>>>>>Atm we call this from driver load and resume, which doesn seem to cover
> >>>>>>>all the cases you mention in the comment. Should this be a WARN_ON
> >>>>>>>instead? Or do we have troubles in our init sequence where we load too
> >>>>>>>many times?
> >>>>>>Yes, the above statement taken from bspec to describe about the special cases dmc will not restore the firmware.
> >>>>>>Agree, In our cases cold boot and hibernate/suspend mainly we need to load the firmware again, so in my
> >>>>>>second sentence I wanted to comment mainly regarding this condition check added for suspend-hibernate(reset)
> >>>>>>and cold boot(not loaded).
> >>>>>>
> >>>>>>Anyways the same api later can be used to load the firmware from anywhere, so my intention to check firmware loaded or not.
> >>>>>>If already loaded then not to overwrite the csr address space to maintain the dc5/dc6 counter value.
> >>>>>>
> >>>>>>Can the below comment more clear to you.
> >>>>>>
> >>>>>>	/*
> >>>>>>	 * Dmc will restore the csr the program except DC9, cold boot,
> >>>>>>	 * warm reset, PCI function level reset, and hibernate/suspend.
> >>>>>>	 * If firmware is restored by dmc then no need to load again which
> >>>>>>	 * will keep the dc5/dc6 counter exposed by firmware.
> >>>>>>	 */
> >>>>>>
> >>>>>>No issue in init sequence.
> >>>>>That seems to still cover all the callers of the function afaics - we do
> >>>>>pci resets over suspend resume unconditionally. So I still don't
> >>>>>understand where exactly we try to load the dmc firmware in i915.ko when
> >>>>>it's already loaded.
> >>>>During resume intel_csr_load_program() will be called from
> >>>>intel_runtime_resume().
> >>>>
> >>>>intel_runtime_resume()-> skl_resume_prepare()-> intel_csr_load_program()
> >>>>
> >>>>During Pc10 entry testing I can see dmc is restoring back the firmware always,
> >>>>but as you mentioned pci-reset can happen unconditionally, but still then
> >>>>also during resume intel_runtime_resume() will be called and based on
> >>>>register read of csr-base-address firmware loading will happen.
> >>>But in your comment you're saying it won't get restored in case of dc9 and
> >>>suspend. So that seems to mismatch what you're saying here (and what the
> >>>commit message says) and what the code does. And this function here is
> >>>called for resume after suspend/hibernate only.
> >>pc10 entry explanation I told is for skylake. dc9 in skylake is not possible.
> >>I think you are confusing between dc6 and dc9. Pc10 can be achieved by
> >>entering into dc6 (not dc9) for skylake. dc9 is the lowest possible state
> >>for broxton which is not present for skylake.
> >I have no idea at all about different pc levels on skl. What I'm talking
> >about is system suspend/resume and driver load, which are the places this
> >function gets called. At least afaics.
> >
> >>Here intel_csr_load_program() will be used for both skylake and broxton, and instruction
> >>execution flow will be different in case of suspend/resume which I think is confusing
> >>you.
> >That seems like really important information. What's different on bxt?
> >These are the kind of details you should explain in the commit message ...
> >
> >>I am ready explain you in detail. It will be good if we discuss specific use-case scenario
> >>and itz software design for specific platform. Another point - as dmc related code for
> >>broxton is not merged better first we close design for skylake. Now, I have added dc9
> >>description in comment thinking of future. If you want I can remove for now and later
> >>can add in bxt patch series for enabling dmc. Will wait for your reply.
> >This question here isn't about the overall design and how to handle power
> >wells in skl/bxt. That's a separate discussion and tracked somewhere else.
> >I'm really just confused about when exactly we need to reload to firmware,
> >and why we need a runtime check for that. Normally we should know when to
> >reload the firmware and just either reload or not, without checking hw
> >state. And I don't like checking for hw state since at least in the past
> >that kind of code ended up being fragile - it's an illusion that it does
> >the right thing no matter what, since often there's other tricky ordering
> >constraints. And if you have automatic duct-tape like then no one will
> >ever spot those other, harder to spot issues, until an expensive customer
> >escalation happens.
> >
> >So what I want to know here is:
> >- When exactly do we need to reload dmc firmware.
> In skl, during driver load first time we load the firmware, during normal
> suspend-resume (dc6 entry/exit)
> no need to reload the firmware again as dmc will take care of it. But during
> suspend/hibernation
> dmc will not restore the firmware. In that case driver need to reload it
> again. I do not know
> how to differentiate pm-suspend and suspend-hibernation and thought both the
> cases
> intel_runtime_resume() will be called where we can check the h/w state and
> reload the
> firmware if dmc is not restored.
> 
> In bxt, during driver load first time we load the firmware, during normal
> suspend-resume
> display engine will enter into dc9 and dmc will not restore the firmware. So
> every
> suspend-resume we need to reload the firmware.
> >- What exactly is the reason why we can't make that decision statically in
> >   the code (by calling csr_load at the right spots).
> As I mentioned before in case of skylake can we differentiate between
> "resume from pm-suspend" with "resume from suspend-hibernation" inside
> driver?
> 
> In case of broxton, every time we need to reload, so we can decide
> statically.

Of course we can differentiate between all the different resume paths, and
we also have a per-platform split to take care of bxt vs. skl. And there
are actually 3 different resume paths:

- runtime PM resume. This calls the runtime_resume hook. It sounds like on
  skl we should _not_ load the csr firmware, but on bxt we should load it.
  This can be fixed by removing the intel_csr_load_program call from
  skl_resume_prepare.
- resume from hibernate-to-disk (i.e. system completely off, state stored
  on the swap partition) is done by calling the thaw callbacks.
- resume from suspend-to-mem (i.e. system in low-power with only memory
  in self-refresh, all state stored in memory) is done by calling the
  resume callbacks.

For i915 we use unified handlers in our dev_pm_ops for both thaw and
resume, but it sounds like that won't be a problem for skl/bxt since we
need to reload the csr firmware in all cases. Although I'm not perfectly
sure since you don't explain what kind of resume you mean exactly (since
you don't use the linux names for them).

Anyway it sounds like we can replace this patch by one where we remove
that errornous csr load call from skl runtime pm resume and that's all.
But I suggest to make sure we get this right we keep the check you're
adding here, but wrap it in a WARN_ON. Then we'll get a backtrace when
this is going wrong again. Like this:

	if (WARN_ON(csr_loaded_already()))
		return;

Also when redoing the commits please explain in detail what exactly are
the requirements like you've done above, but please use the standard linux
names, i.e. "runtime PM" and "hibernate-to-disk" and "suspend-to-mem".

Thanks, Daniel
Daniel Vetter Sept. 23, 2015, 4:27 p.m. UTC | #14
On Wed, Sep 23, 2015 at 9:57 AM, Daniel Vetter <daniel@ffwll.ch> wrote:
> On Thu, Sep 17, 2015 at 12:53:21AM +0530, Animesh Manna wrote:
>>
>>
>> On 9/14/2015 1:16 PM, Daniel Vetter wrote:
>> >On Fri, Sep 11, 2015 at 12:36:24AM +0530, Animesh Manna wrote:
>> >>
>> >>On 9/10/2015 8:15 PM, Daniel Vetter wrote:
>> >>>On Thu, Sep 10, 2015 at 01:58:54AM +0530, Animesh Manna wrote:
>> >>>>On 9/2/2015 2:24 PM, Daniel Vetter wrote:
>> >>>>>On Wed, Aug 26, 2015 at 07:40:54PM +0530, Animesh Manna wrote:
>> >>>>>>On 8/26/2015 6:40 PM, Daniel Vetter wrote:
>> >>>>>>>On Wed, Aug 26, 2015 at 01:36:05AM +0530, Animesh Manna wrote:
>> >>>>>>>>Dmc will restore the csr program except DC9, cold boot,
>> >>>>>>>>warm reset, PCI function level reset, and hibernate/suspend.
>> >>>>>>>>
>> >>>>>>>>intel_csr_load_program() function is used to load the firmware
>> >>>>>>>>data from kernel memory to csr address space.
>> >>>>>>>>
>> >>>>>>>>All values of csr address space will be zero if it got reset and
>> >>>>>>>>the first byte of csr program is always a non-zero if firmware
>> >>>>>>>>is loaded successfuly. Based on hardware status will load the
>> >>>>>>>>firmware.
>> >>>>>>>>
>> >>>>>>>>Without this condition check if we overwrite the firmware data the
>> >>>>>>>>counters exposed for dc5/dc6 (help for debugging) will be nullified.
>> >>>>>>Bacause of the above reason mentioned just above we need to block firmware loading again.
>> >>>>>>So only WARN_ON will not help.
>> >>>>>>
>> >>>>>>
>> >>>>>>>>v1: Initial version.
>> >>>>>>>>
>> >>>>>>>>v2: Based on review comments from Daniel,
>> >>>>>>>>- Added a check to know hardware status and load the firmware if not loaded.
>> >>>>>>>>
>> >>>>>>>>Cc: Daniel Vetter <daniel.vetter@intel.com>
>> >>>>>>>>Cc: Damien Lespiau <damien.lespiau@intel.com>
>> >>>>>>>>Cc: Imre Deak <imre.deak@intel.com>
>> >>>>>>>>Cc: Sunil Kamath <sunil.kamath@intel.com>
>> >>>>>>>>Signed-off-by: Animesh Manna <animesh.manna@intel.com>
>> >>>>>>>>Signed-off-by: Vathsala Nagaraju <vathsala.nagaraju@intel.com>
>> >>>>>>>>---
>> >>>>>>>>  drivers/gpu/drm/i915/intel_csr.c | 9 +++++++++
>> >>>>>>>>  1 file changed, 9 insertions(+)
>> >>>>>>>>
>> >>>>>>>>diff --git a/drivers/gpu/drm/i915/intel_csr.c b/drivers/gpu/drm/i915/intel_csr.c
>> >>>>>>>>index ba1ae03..682cc26 100644
>> >>>>>>>>--- a/drivers/gpu/drm/i915/intel_csr.c
>> >>>>>>>>+++ b/drivers/gpu/drm/i915/intel_csr.c
>> >>>>>>>>@@ -252,6 +252,15 @@ void intel_csr_load_program(struct drm_device *dev)
>> >>>>>>>>              return;
>> >>>>>>>>      }
>> >>>>>>>>+     /*
>> >>>>>>>>+      * Dmc will restore the csr the program except DC9, cold boot,
>> >>>>>>>>+      * warm reset, PCI function level reset, and hibernate/suspend.
>> >>>>>>>>+      * This condition will help to check if csr address space is reset/
>> >>>>>>>>+      * not loaded.
>> >>>>>>>>+      */
>> >>>>>>>Atm we call this from driver load and resume, which doesn seem to cover
>> >>>>>>>all the cases you mention in the comment. Should this be a WARN_ON
>> >>>>>>>instead? Or do we have troubles in our init sequence where we load too
>> >>>>>>>many times?
>> >>>>>>Yes, the above statement taken from bspec to describe about the special cases dmc will not restore the firmware.
>> >>>>>>Agree, In our cases cold boot and hibernate/suspend mainly we need to load the firmware again, so in my
>> >>>>>>second sentence I wanted to comment mainly regarding this condition check added for suspend-hibernate(reset)
>> >>>>>>and cold boot(not loaded).
>> >>>>>>
>> >>>>>>Anyways the same api later can be used to load the firmware from anywhere, so my intention to check firmware loaded or not.
>> >>>>>>If already loaded then not to overwrite the csr address space to maintain the dc5/dc6 counter value.
>> >>>>>>
>> >>>>>>Can the below comment more clear to you.
>> >>>>>>
>> >>>>>>        /*
>> >>>>>>         * Dmc will restore the csr the program except DC9, cold boot,
>> >>>>>>         * warm reset, PCI function level reset, and hibernate/suspend.
>> >>>>>>         * If firmware is restored by dmc then no need to load again which
>> >>>>>>         * will keep the dc5/dc6 counter exposed by firmware.
>> >>>>>>         */
>> >>>>>>
>> >>>>>>No issue in init sequence.
>> >>>>>That seems to still cover all the callers of the function afaics - we do
>> >>>>>pci resets over suspend resume unconditionally. So I still don't
>> >>>>>understand where exactly we try to load the dmc firmware in i915.ko when
>> >>>>>it's already loaded.
>> >>>>During resume intel_csr_load_program() will be called from
>> >>>>intel_runtime_resume().
>> >>>>
>> >>>>intel_runtime_resume()-> skl_resume_prepare()-> intel_csr_load_program()
>> >>>>
>> >>>>During Pc10 entry testing I can see dmc is restoring back the firmware always,
>> >>>>but as you mentioned pci-reset can happen unconditionally, but still then
>> >>>>also during resume intel_runtime_resume() will be called and based on
>> >>>>register read of csr-base-address firmware loading will happen.
>> >>>But in your comment you're saying it won't get restored in case of dc9 and
>> >>>suspend. So that seems to mismatch what you're saying here (and what the
>> >>>commit message says) and what the code does. And this function here is
>> >>>called for resume after suspend/hibernate only.
>> >>pc10 entry explanation I told is for skylake. dc9 in skylake is not possible.
>> >>I think you are confusing between dc6 and dc9. Pc10 can be achieved by
>> >>entering into dc6 (not dc9) for skylake. dc9 is the lowest possible state
>> >>for broxton which is not present for skylake.
>> >I have no idea at all about different pc levels on skl. What I'm talking
>> >about is system suspend/resume and driver load, which are the places this
>> >function gets called. At least afaics.
>> >
>> >>Here intel_csr_load_program() will be used for both skylake and broxton, and instruction
>> >>execution flow will be different in case of suspend/resume which I think is confusing
>> >>you.
>> >That seems like really important information. What's different on bxt?
>> >These are the kind of details you should explain in the commit message ...
>> >
>> >>I am ready explain you in detail. It will be good if we discuss specific use-case scenario
>> >>and itz software design for specific platform. Another point - as dmc related code for
>> >>broxton is not merged better first we close design for skylake. Now, I have added dc9
>> >>description in comment thinking of future. If you want I can remove for now and later
>> >>can add in bxt patch series for enabling dmc. Will wait for your reply.
>> >This question here isn't about the overall design and how to handle power
>> >wells in skl/bxt. That's a separate discussion and tracked somewhere else.
>> >I'm really just confused about when exactly we need to reload to firmware,
>> >and why we need a runtime check for that. Normally we should know when to
>> >reload the firmware and just either reload or not, without checking hw
>> >state. And I don't like checking for hw state since at least in the past
>> >that kind of code ended up being fragile - it's an illusion that it does
>> >the right thing no matter what, since often there's other tricky ordering
>> >constraints. And if you have automatic duct-tape like then no one will
>> >ever spot those other, harder to spot issues, until an expensive customer
>> >escalation happens.
>> >
>> >So what I want to know here is:
>> >- When exactly do we need to reload dmc firmware.
>> In skl, during driver load first time we load the firmware, during normal
>> suspend-resume (dc6 entry/exit)
>> no need to reload the firmware again as dmc will take care of it. But during
>> suspend/hibernation
>> dmc will not restore the firmware. In that case driver need to reload it
>> again. I do not know
>> how to differentiate pm-suspend and suspend-hibernation and thought both the
>> cases
>> intel_runtime_resume() will be called where we can check the h/w state and
>> reload the
>> firmware if dmc is not restored.
>>
>> In bxt, during driver load first time we load the firmware, during normal
>> suspend-resume
>> display engine will enter into dc9 and dmc will not restore the firmware. So
>> every
>> suspend-resume we need to reload the firmware.
>> >- What exactly is the reason why we can't make that decision statically in
>> >   the code (by calling csr_load at the right spots).
>> As I mentioned before in case of skylake can we differentiate between
>> "resume from pm-suspend" with "resume from suspend-hibernation" inside
>> driver?
>>
>> In case of broxton, every time we need to reload, so we can decide
>> statically.
>
> Of course we can differentiate between all the different resume paths, and
> we also have a per-platform split to take care of bxt vs. skl. And there
> are actually 3 different resume paths:
>
> - runtime PM resume. This calls the runtime_resume hook. It sounds like on
>   skl we should _not_ load the csr firmware, but on bxt we should load it.
>   This can be fixed by removing the intel_csr_load_program call from
>   skl_resume_prepare.
> - resume from hibernate-to-disk (i.e. system completely off, state stored
>   on the swap partition) is done by calling the thaw callbacks.
> - resume from suspend-to-mem (i.e. system in low-power with only memory
>   in self-refresh, all state stored in memory) is done by calling the
>   resume callbacks.
>
> For i915 we use unified handlers in our dev_pm_ops for both thaw and
> resume, but it sounds like that won't be a problem for skl/bxt since we
> need to reload the csr firmware in all cases. Although I'm not perfectly
> sure since you don't explain what kind of resume you mean exactly (since
> you don't use the linux names for them).
>
> Anyway it sounds like we can replace this patch by one where we remove
> that errornous csr load call from skl runtime pm resume and that's all.
> But I suggest to make sure we get this right we keep the check you're
> adding here, but wrap it in a WARN_ON. Then we'll get a backtrace when
> this is going wrong again. Like this:
>
>         if (WARN_ON(csr_loaded_already()))
>                 return;
>
> Also when redoing the commits please explain in detail what exactly are
> the requirements like you've done above, but please use the standard linux
> names, i.e. "runtime PM" and "hibernate-to-disk" and "suspend-to-mem".

Ok hooray there's more suspend-to-something things I've totally missed:
- suspend-to-idle (done by cat freeze > /sys/power/state) and
- suspend (done by cat suspend > /sys/power/state)

And apparently there's really no way to drivers to tell them apart.
Rafael, is there really no way for drivers to take different paths for
these 3 suspend cases? I tried grepping for PM_SUSPEND_ON/STANDY/MEM
and didn't spot anything.

Also we're completely missing test coverage for that in igt. That is
something that needs to be fixed asap (yet another case of
combinatorial explosion in igt tests, yay). And at least one of those
suspend-to-idle testcase better be in the BAT.
-Daniel
Daniel Vetter Sept. 23, 2015, 4:28 p.m. UTC | #15
Actually add Rafael this time around ...
-Daniel

On Wed, Sep 23, 2015 at 6:27 PM, Daniel Vetter <daniel@ffwll.ch> wrote:
> On Wed, Sep 23, 2015 at 9:57 AM, Daniel Vetter <daniel@ffwll.ch> wrote:
>> On Thu, Sep 17, 2015 at 12:53:21AM +0530, Animesh Manna wrote:
>>>
>>>
>>> On 9/14/2015 1:16 PM, Daniel Vetter wrote:
>>> >On Fri, Sep 11, 2015 at 12:36:24AM +0530, Animesh Manna wrote:
>>> >>
>>> >>On 9/10/2015 8:15 PM, Daniel Vetter wrote:
>>> >>>On Thu, Sep 10, 2015 at 01:58:54AM +0530, Animesh Manna wrote:
>>> >>>>On 9/2/2015 2:24 PM, Daniel Vetter wrote:
>>> >>>>>On Wed, Aug 26, 2015 at 07:40:54PM +0530, Animesh Manna wrote:
>>> >>>>>>On 8/26/2015 6:40 PM, Daniel Vetter wrote:
>>> >>>>>>>On Wed, Aug 26, 2015 at 01:36:05AM +0530, Animesh Manna wrote:
>>> >>>>>>>>Dmc will restore the csr program except DC9, cold boot,
>>> >>>>>>>>warm reset, PCI function level reset, and hibernate/suspend.
>>> >>>>>>>>
>>> >>>>>>>>intel_csr_load_program() function is used to load the firmware
>>> >>>>>>>>data from kernel memory to csr address space.
>>> >>>>>>>>
>>> >>>>>>>>All values of csr address space will be zero if it got reset and
>>> >>>>>>>>the first byte of csr program is always a non-zero if firmware
>>> >>>>>>>>is loaded successfuly. Based on hardware status will load the
>>> >>>>>>>>firmware.
>>> >>>>>>>>
>>> >>>>>>>>Without this condition check if we overwrite the firmware data the
>>> >>>>>>>>counters exposed for dc5/dc6 (help for debugging) will be nullified.
>>> >>>>>>Bacause of the above reason mentioned just above we need to block firmware loading again.
>>> >>>>>>So only WARN_ON will not help.
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>>>v1: Initial version.
>>> >>>>>>>>
>>> >>>>>>>>v2: Based on review comments from Daniel,
>>> >>>>>>>>- Added a check to know hardware status and load the firmware if not loaded.
>>> >>>>>>>>
>>> >>>>>>>>Cc: Daniel Vetter <daniel.vetter@intel.com>
>>> >>>>>>>>Cc: Damien Lespiau <damien.lespiau@intel.com>
>>> >>>>>>>>Cc: Imre Deak <imre.deak@intel.com>
>>> >>>>>>>>Cc: Sunil Kamath <sunil.kamath@intel.com>
>>> >>>>>>>>Signed-off-by: Animesh Manna <animesh.manna@intel.com>
>>> >>>>>>>>Signed-off-by: Vathsala Nagaraju <vathsala.nagaraju@intel.com>
>>> >>>>>>>>---
>>> >>>>>>>>  drivers/gpu/drm/i915/intel_csr.c | 9 +++++++++
>>> >>>>>>>>  1 file changed, 9 insertions(+)
>>> >>>>>>>>
>>> >>>>>>>>diff --git a/drivers/gpu/drm/i915/intel_csr.c b/drivers/gpu/drm/i915/intel_csr.c
>>> >>>>>>>>index ba1ae03..682cc26 100644
>>> >>>>>>>>--- a/drivers/gpu/drm/i915/intel_csr.c
>>> >>>>>>>>+++ b/drivers/gpu/drm/i915/intel_csr.c
>>> >>>>>>>>@@ -252,6 +252,15 @@ void intel_csr_load_program(struct drm_device *dev)
>>> >>>>>>>>              return;
>>> >>>>>>>>      }
>>> >>>>>>>>+     /*
>>> >>>>>>>>+      * Dmc will restore the csr the program except DC9, cold boot,
>>> >>>>>>>>+      * warm reset, PCI function level reset, and hibernate/suspend.
>>> >>>>>>>>+      * This condition will help to check if csr address space is reset/
>>> >>>>>>>>+      * not loaded.
>>> >>>>>>>>+      */
>>> >>>>>>>Atm we call this from driver load and resume, which doesn seem to cover
>>> >>>>>>>all the cases you mention in the comment. Should this be a WARN_ON
>>> >>>>>>>instead? Or do we have troubles in our init sequence where we load too
>>> >>>>>>>many times?
>>> >>>>>>Yes, the above statement taken from bspec to describe about the special cases dmc will not restore the firmware.
>>> >>>>>>Agree, In our cases cold boot and hibernate/suspend mainly we need to load the firmware again, so in my
>>> >>>>>>second sentence I wanted to comment mainly regarding this condition check added for suspend-hibernate(reset)
>>> >>>>>>and cold boot(not loaded).
>>> >>>>>>
>>> >>>>>>Anyways the same api later can be used to load the firmware from anywhere, so my intention to check firmware loaded or not.
>>> >>>>>>If already loaded then not to overwrite the csr address space to maintain the dc5/dc6 counter value.
>>> >>>>>>
>>> >>>>>>Can the below comment more clear to you.
>>> >>>>>>
>>> >>>>>>        /*
>>> >>>>>>         * Dmc will restore the csr the program except DC9, cold boot,
>>> >>>>>>         * warm reset, PCI function level reset, and hibernate/suspend.
>>> >>>>>>         * If firmware is restored by dmc then no need to load again which
>>> >>>>>>         * will keep the dc5/dc6 counter exposed by firmware.
>>> >>>>>>         */
>>> >>>>>>
>>> >>>>>>No issue in init sequence.
>>> >>>>>That seems to still cover all the callers of the function afaics - we do
>>> >>>>>pci resets over suspend resume unconditionally. So I still don't
>>> >>>>>understand where exactly we try to load the dmc firmware in i915.ko when
>>> >>>>>it's already loaded.
>>> >>>>During resume intel_csr_load_program() will be called from
>>> >>>>intel_runtime_resume().
>>> >>>>
>>> >>>>intel_runtime_resume()-> skl_resume_prepare()-> intel_csr_load_program()
>>> >>>>
>>> >>>>During Pc10 entry testing I can see dmc is restoring back the firmware always,
>>> >>>>but as you mentioned pci-reset can happen unconditionally, but still then
>>> >>>>also during resume intel_runtime_resume() will be called and based on
>>> >>>>register read of csr-base-address firmware loading will happen.
>>> >>>But in your comment you're saying it won't get restored in case of dc9 and
>>> >>>suspend. So that seems to mismatch what you're saying here (and what the
>>> >>>commit message says) and what the code does. And this function here is
>>> >>>called for resume after suspend/hibernate only.
>>> >>pc10 entry explanation I told is for skylake. dc9 in skylake is not possible.
>>> >>I think you are confusing between dc6 and dc9. Pc10 can be achieved by
>>> >>entering into dc6 (not dc9) for skylake. dc9 is the lowest possible state
>>> >>for broxton which is not present for skylake.
>>> >I have no idea at all about different pc levels on skl. What I'm talking
>>> >about is system suspend/resume and driver load, which are the places this
>>> >function gets called. At least afaics.
>>> >
>>> >>Here intel_csr_load_program() will be used for both skylake and broxton, and instruction
>>> >>execution flow will be different in case of suspend/resume which I think is confusing
>>> >>you.
>>> >That seems like really important information. What's different on bxt?
>>> >These are the kind of details you should explain in the commit message ...
>>> >
>>> >>I am ready explain you in detail. It will be good if we discuss specific use-case scenario
>>> >>and itz software design for specific platform. Another point - as dmc related code for
>>> >>broxton is not merged better first we close design for skylake. Now, I have added dc9
>>> >>description in comment thinking of future. If you want I can remove for now and later
>>> >>can add in bxt patch series for enabling dmc. Will wait for your reply.
>>> >This question here isn't about the overall design and how to handle power
>>> >wells in skl/bxt. That's a separate discussion and tracked somewhere else.
>>> >I'm really just confused about when exactly we need to reload to firmware,
>>> >and why we need a runtime check for that. Normally we should know when to
>>> >reload the firmware and just either reload or not, without checking hw
>>> >state. And I don't like checking for hw state since at least in the past
>>> >that kind of code ended up being fragile - it's an illusion that it does
>>> >the right thing no matter what, since often there's other tricky ordering
>>> >constraints. And if you have automatic duct-tape like then no one will
>>> >ever spot those other, harder to spot issues, until an expensive customer
>>> >escalation happens.
>>> >
>>> >So what I want to know here is:
>>> >- When exactly do we need to reload dmc firmware.
>>> In skl, during driver load first time we load the firmware, during normal
>>> suspend-resume (dc6 entry/exit)
>>> no need to reload the firmware again as dmc will take care of it. But during
>>> suspend/hibernation
>>> dmc will not restore the firmware. In that case driver need to reload it
>>> again. I do not know
>>> how to differentiate pm-suspend and suspend-hibernation and thought both the
>>> cases
>>> intel_runtime_resume() will be called where we can check the h/w state and
>>> reload the
>>> firmware if dmc is not restored.
>>>
>>> In bxt, during driver load first time we load the firmware, during normal
>>> suspend-resume
>>> display engine will enter into dc9 and dmc will not restore the firmware. So
>>> every
>>> suspend-resume we need to reload the firmware.
>>> >- What exactly is the reason why we can't make that decision statically in
>>> >   the code (by calling csr_load at the right spots).
>>> As I mentioned before in case of skylake can we differentiate between
>>> "resume from pm-suspend" with "resume from suspend-hibernation" inside
>>> driver?
>>>
>>> In case of broxton, every time we need to reload, so we can decide
>>> statically.
>>
>> Of course we can differentiate between all the different resume paths, and
>> we also have a per-platform split to take care of bxt vs. skl. And there
>> are actually 3 different resume paths:
>>
>> - runtime PM resume. This calls the runtime_resume hook. It sounds like on
>>   skl we should _not_ load the csr firmware, but on bxt we should load it.
>>   This can be fixed by removing the intel_csr_load_program call from
>>   skl_resume_prepare.
>> - resume from hibernate-to-disk (i.e. system completely off, state stored
>>   on the swap partition) is done by calling the thaw callbacks.
>> - resume from suspend-to-mem (i.e. system in low-power with only memory
>>   in self-refresh, all state stored in memory) is done by calling the
>>   resume callbacks.
>>
>> For i915 we use unified handlers in our dev_pm_ops for both thaw and
>> resume, but it sounds like that won't be a problem for skl/bxt since we
>> need to reload the csr firmware in all cases. Although I'm not perfectly
>> sure since you don't explain what kind of resume you mean exactly (since
>> you don't use the linux names for them).
>>
>> Anyway it sounds like we can replace this patch by one where we remove
>> that errornous csr load call from skl runtime pm resume and that's all.
>> But I suggest to make sure we get this right we keep the check you're
>> adding here, but wrap it in a WARN_ON. Then we'll get a backtrace when
>> this is going wrong again. Like this:
>>
>>         if (WARN_ON(csr_loaded_already()))
>>                 return;
>>
>> Also when redoing the commits please explain in detail what exactly are
>> the requirements like you've done above, but please use the standard linux
>> names, i.e. "runtime PM" and "hibernate-to-disk" and "suspend-to-mem".
>
> Ok hooray there's more suspend-to-something things I've totally missed:
> - suspend-to-idle (done by cat freeze > /sys/power/state) and
> - suspend (done by cat suspend > /sys/power/state)
>
> And apparently there's really no way to drivers to tell them apart.
> Rafael, is there really no way for drivers to take different paths for
> these 3 suspend cases? I tried grepping for PM_SUSPEND_ON/STANDY/MEM
> and didn't spot anything.
>
> Also we're completely missing test coverage for that in igt. That is
> something that needs to be fixed asap (yet another case of
> combinatorial explosion in igt tests, yay). And at least one of those
> suspend-to-idle testcase better be in the BAT.
> -Daniel
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> +41 (0) 79 365 57 48 - http://blog.ffwll.ch
Daniel Vetter Sept. 23, 2015, 5:17 p.m. UTC | #16
acpi_target_system_state() seems to be almost the thing we're looking
for, except that it's only valid in the suspend callbacks since it
gets reset to ACPI_STATE_S0 when resuming. So probably we want
something else ...
-Daniel

On Wed, Sep 23, 2015 at 6:28 PM, Daniel Vetter <daniel@ffwll.ch> wrote:
> Actually add Rafael this time around ...
> -Daniel
>
> On Wed, Sep 23, 2015 at 6:27 PM, Daniel Vetter <daniel@ffwll.ch> wrote:
>> On Wed, Sep 23, 2015 at 9:57 AM, Daniel Vetter <daniel@ffwll.ch> wrote:
>>> On Thu, Sep 17, 2015 at 12:53:21AM +0530, Animesh Manna wrote:
>>>>
>>>>
>>>> On 9/14/2015 1:16 PM, Daniel Vetter wrote:
>>>> >On Fri, Sep 11, 2015 at 12:36:24AM +0530, Animesh Manna wrote:
>>>> >>
>>>> >>On 9/10/2015 8:15 PM, Daniel Vetter wrote:
>>>> >>>On Thu, Sep 10, 2015 at 01:58:54AM +0530, Animesh Manna wrote:
>>>> >>>>On 9/2/2015 2:24 PM, Daniel Vetter wrote:
>>>> >>>>>On Wed, Aug 26, 2015 at 07:40:54PM +0530, Animesh Manna wrote:
>>>> >>>>>>On 8/26/2015 6:40 PM, Daniel Vetter wrote:
>>>> >>>>>>>On Wed, Aug 26, 2015 at 01:36:05AM +0530, Animesh Manna wrote:
>>>> >>>>>>>>Dmc will restore the csr program except DC9, cold boot,
>>>> >>>>>>>>warm reset, PCI function level reset, and hibernate/suspend.
>>>> >>>>>>>>
>>>> >>>>>>>>intel_csr_load_program() function is used to load the firmware
>>>> >>>>>>>>data from kernel memory to csr address space.
>>>> >>>>>>>>
>>>> >>>>>>>>All values of csr address space will be zero if it got reset and
>>>> >>>>>>>>the first byte of csr program is always a non-zero if firmware
>>>> >>>>>>>>is loaded successfuly. Based on hardware status will load the
>>>> >>>>>>>>firmware.
>>>> >>>>>>>>
>>>> >>>>>>>>Without this condition check if we overwrite the firmware data the
>>>> >>>>>>>>counters exposed for dc5/dc6 (help for debugging) will be nullified.
>>>> >>>>>>Bacause of the above reason mentioned just above we need to block firmware loading again.
>>>> >>>>>>So only WARN_ON will not help.
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>>>>v1: Initial version.
>>>> >>>>>>>>
>>>> >>>>>>>>v2: Based on review comments from Daniel,
>>>> >>>>>>>>- Added a check to know hardware status and load the firmware if not loaded.
>>>> >>>>>>>>
>>>> >>>>>>>>Cc: Daniel Vetter <daniel.vetter@intel.com>
>>>> >>>>>>>>Cc: Damien Lespiau <damien.lespiau@intel.com>
>>>> >>>>>>>>Cc: Imre Deak <imre.deak@intel.com>
>>>> >>>>>>>>Cc: Sunil Kamath <sunil.kamath@intel.com>
>>>> >>>>>>>>Signed-off-by: Animesh Manna <animesh.manna@intel.com>
>>>> >>>>>>>>Signed-off-by: Vathsala Nagaraju <vathsala.nagaraju@intel.com>
>>>> >>>>>>>>---
>>>> >>>>>>>>  drivers/gpu/drm/i915/intel_csr.c | 9 +++++++++
>>>> >>>>>>>>  1 file changed, 9 insertions(+)
>>>> >>>>>>>>
>>>> >>>>>>>>diff --git a/drivers/gpu/drm/i915/intel_csr.c b/drivers/gpu/drm/i915/intel_csr.c
>>>> >>>>>>>>index ba1ae03..682cc26 100644
>>>> >>>>>>>>--- a/drivers/gpu/drm/i915/intel_csr.c
>>>> >>>>>>>>+++ b/drivers/gpu/drm/i915/intel_csr.c
>>>> >>>>>>>>@@ -252,6 +252,15 @@ void intel_csr_load_program(struct drm_device *dev)
>>>> >>>>>>>>              return;
>>>> >>>>>>>>      }
>>>> >>>>>>>>+     /*
>>>> >>>>>>>>+      * Dmc will restore the csr the program except DC9, cold boot,
>>>> >>>>>>>>+      * warm reset, PCI function level reset, and hibernate/suspend.
>>>> >>>>>>>>+      * This condition will help to check if csr address space is reset/
>>>> >>>>>>>>+      * not loaded.
>>>> >>>>>>>>+      */
>>>> >>>>>>>Atm we call this from driver load and resume, which doesn seem to cover
>>>> >>>>>>>all the cases you mention in the comment. Should this be a WARN_ON
>>>> >>>>>>>instead? Or do we have troubles in our init sequence where we load too
>>>> >>>>>>>many times?
>>>> >>>>>>Yes, the above statement taken from bspec to describe about the special cases dmc will not restore the firmware.
>>>> >>>>>>Agree, In our cases cold boot and hibernate/suspend mainly we need to load the firmware again, so in my
>>>> >>>>>>second sentence I wanted to comment mainly regarding this condition check added for suspend-hibernate(reset)
>>>> >>>>>>and cold boot(not loaded).
>>>> >>>>>>
>>>> >>>>>>Anyways the same api later can be used to load the firmware from anywhere, so my intention to check firmware loaded or not.
>>>> >>>>>>If already loaded then not to overwrite the csr address space to maintain the dc5/dc6 counter value.
>>>> >>>>>>
>>>> >>>>>>Can the below comment more clear to you.
>>>> >>>>>>
>>>> >>>>>>        /*
>>>> >>>>>>         * Dmc will restore the csr the program except DC9, cold boot,
>>>> >>>>>>         * warm reset, PCI function level reset, and hibernate/suspend.
>>>> >>>>>>         * If firmware is restored by dmc then no need to load again which
>>>> >>>>>>         * will keep the dc5/dc6 counter exposed by firmware.
>>>> >>>>>>         */
>>>> >>>>>>
>>>> >>>>>>No issue in init sequence.
>>>> >>>>>That seems to still cover all the callers of the function afaics - we do
>>>> >>>>>pci resets over suspend resume unconditionally. So I still don't
>>>> >>>>>understand where exactly we try to load the dmc firmware in i915.ko when
>>>> >>>>>it's already loaded.
>>>> >>>>During resume intel_csr_load_program() will be called from
>>>> >>>>intel_runtime_resume().
>>>> >>>>
>>>> >>>>intel_runtime_resume()-> skl_resume_prepare()-> intel_csr_load_program()
>>>> >>>>
>>>> >>>>During Pc10 entry testing I can see dmc is restoring back the firmware always,
>>>> >>>>but as you mentioned pci-reset can happen unconditionally, but still then
>>>> >>>>also during resume intel_runtime_resume() will be called and based on
>>>> >>>>register read of csr-base-address firmware loading will happen.
>>>> >>>But in your comment you're saying it won't get restored in case of dc9 and
>>>> >>>suspend. So that seems to mismatch what you're saying here (and what the
>>>> >>>commit message says) and what the code does. And this function here is
>>>> >>>called for resume after suspend/hibernate only.
>>>> >>pc10 entry explanation I told is for skylake. dc9 in skylake is not possible.
>>>> >>I think you are confusing between dc6 and dc9. Pc10 can be achieved by
>>>> >>entering into dc6 (not dc9) for skylake. dc9 is the lowest possible state
>>>> >>for broxton which is not present for skylake.
>>>> >I have no idea at all about different pc levels on skl. What I'm talking
>>>> >about is system suspend/resume and driver load, which are the places this
>>>> >function gets called. At least afaics.
>>>> >
>>>> >>Here intel_csr_load_program() will be used for both skylake and broxton, and instruction
>>>> >>execution flow will be different in case of suspend/resume which I think is confusing
>>>> >>you.
>>>> >That seems like really important information. What's different on bxt?
>>>> >These are the kind of details you should explain in the commit message ...
>>>> >
>>>> >>I am ready explain you in detail. It will be good if we discuss specific use-case scenario
>>>> >>and itz software design for specific platform. Another point - as dmc related code for
>>>> >>broxton is not merged better first we close design for skylake. Now, I have added dc9
>>>> >>description in comment thinking of future. If you want I can remove for now and later
>>>> >>can add in bxt patch series for enabling dmc. Will wait for your reply.
>>>> >This question here isn't about the overall design and how to handle power
>>>> >wells in skl/bxt. That's a separate discussion and tracked somewhere else.
>>>> >I'm really just confused about when exactly we need to reload to firmware,
>>>> >and why we need a runtime check for that. Normally we should know when to
>>>> >reload the firmware and just either reload or not, without checking hw
>>>> >state. And I don't like checking for hw state since at least in the past
>>>> >that kind of code ended up being fragile - it's an illusion that it does
>>>> >the right thing no matter what, since often there's other tricky ordering
>>>> >constraints. And if you have automatic duct-tape like then no one will
>>>> >ever spot those other, harder to spot issues, until an expensive customer
>>>> >escalation happens.
>>>> >
>>>> >So what I want to know here is:
>>>> >- When exactly do we need to reload dmc firmware.
>>>> In skl, during driver load first time we load the firmware, during normal
>>>> suspend-resume (dc6 entry/exit)
>>>> no need to reload the firmware again as dmc will take care of it. But during
>>>> suspend/hibernation
>>>> dmc will not restore the firmware. In that case driver need to reload it
>>>> again. I do not know
>>>> how to differentiate pm-suspend and suspend-hibernation and thought both the
>>>> cases
>>>> intel_runtime_resume() will be called where we can check the h/w state and
>>>> reload the
>>>> firmware if dmc is not restored.
>>>>
>>>> In bxt, during driver load first time we load the firmware, during normal
>>>> suspend-resume
>>>> display engine will enter into dc9 and dmc will not restore the firmware. So
>>>> every
>>>> suspend-resume we need to reload the firmware.
>>>> >- What exactly is the reason why we can't make that decision statically in
>>>> >   the code (by calling csr_load at the right spots).
>>>> As I mentioned before in case of skylake can we differentiate between
>>>> "resume from pm-suspend" with "resume from suspend-hibernation" inside
>>>> driver?
>>>>
>>>> In case of broxton, every time we need to reload, so we can decide
>>>> statically.
>>>
>>> Of course we can differentiate between all the different resume paths, and
>>> we also have a per-platform split to take care of bxt vs. skl. And there
>>> are actually 3 different resume paths:
>>>
>>> - runtime PM resume. This calls the runtime_resume hook. It sounds like on
>>>   skl we should _not_ load the csr firmware, but on bxt we should load it.
>>>   This can be fixed by removing the intel_csr_load_program call from
>>>   skl_resume_prepare.
>>> - resume from hibernate-to-disk (i.e. system completely off, state stored
>>>   on the swap partition) is done by calling the thaw callbacks.
>>> - resume from suspend-to-mem (i.e. system in low-power with only memory
>>>   in self-refresh, all state stored in memory) is done by calling the
>>>   resume callbacks.
>>>
>>> For i915 we use unified handlers in our dev_pm_ops for both thaw and
>>> resume, but it sounds like that won't be a problem for skl/bxt since we
>>> need to reload the csr firmware in all cases. Although I'm not perfectly
>>> sure since you don't explain what kind of resume you mean exactly (since
>>> you don't use the linux names for them).
>>>
>>> Anyway it sounds like we can replace this patch by one where we remove
>>> that errornous csr load call from skl runtime pm resume and that's all.
>>> But I suggest to make sure we get this right we keep the check you're
>>> adding here, but wrap it in a WARN_ON. Then we'll get a backtrace when
>>> this is going wrong again. Like this:
>>>
>>>         if (WARN_ON(csr_loaded_already()))
>>>                 return;
>>>
>>> Also when redoing the commits please explain in detail what exactly are
>>> the requirements like you've done above, but please use the standard linux
>>> names, i.e. "runtime PM" and "hibernate-to-disk" and "suspend-to-mem".
>>
>> Ok hooray there's more suspend-to-something things I've totally missed:
>> - suspend-to-idle (done by cat freeze > /sys/power/state) and
>> - suspend (done by cat suspend > /sys/power/state)
>>
>> And apparently there's really no way to drivers to tell them apart.
>> Rafael, is there really no way for drivers to take different paths for
>> these 3 suspend cases? I tried grepping for PM_SUSPEND_ON/STANDY/MEM
>> and didn't spot anything.
>>
>> Also we're completely missing test coverage for that in igt. That is
>> something that needs to be fixed asap (yet another case of
>> combinatorial explosion in igt tests, yay). And at least one of those
>> suspend-to-idle testcase better be in the BAT.
>> -Daniel
>> --
>> Daniel Vetter
>> Software Engineer, Intel Corporation
>> +41 (0) 79 365 57 48 - http://blog.ffwll.ch
>
>
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> +41 (0) 79 365 57 48 - http://blog.ffwll.ch
Wysocki, Rafael J Sept. 23, 2015, 8:49 p.m. UTC | #17
On 9/23/2015 7:17 PM, Daniel Vetter wrote:
> acpi_target_system_state() seems to be almost the thing we're looking
> for, except that it's only valid in the suspend callbacks since it
> gets reset to ACPI_STATE_S0 when resuming. So probably we want
> something else ...

Right.

The idea is to add a way for drivers to check if
(a) suspend is going to enter the BIOS
(b) resume has been triggered by the BIOS
and that's really what drivers need to know.

For suspend-to-idle those two will return false and for S3 they'll 
return true.

Would that help?

Thanks,
Rafael


> On Wed, Sep 23, 2015 at 6:28 PM, Daniel Vetter <daniel@ffwll.ch> wrote:
>> Actually add Rafael this time around ...
>> -Daniel
>>
>> On Wed, Sep 23, 2015 at 6:27 PM, Daniel Vetter <daniel@ffwll.ch> wrote:
>>> On Wed, Sep 23, 2015 at 9:57 AM, Daniel Vetter <daniel@ffwll.ch> wrote:
>>>> On Thu, Sep 17, 2015 at 12:53:21AM +0530, Animesh Manna wrote:
>>>>>
>>>>> On 9/14/2015 1:16 PM, Daniel Vetter wrote:
>>>>>> On Fri, Sep 11, 2015 at 12:36:24AM +0530, Animesh Manna wrote:
>>>>>>> On 9/10/2015 8:15 PM, Daniel Vetter wrote:
>>>>>>>> On Thu, Sep 10, 2015 at 01:58:54AM +0530, Animesh Manna wrote:
>>>>>>>>> On 9/2/2015 2:24 PM, Daniel Vetter wrote:
>>>>>>>>>> On Wed, Aug 26, 2015 at 07:40:54PM +0530, Animesh Manna wrote:
>>>>>>>>>>> On 8/26/2015 6:40 PM, Daniel Vetter wrote:
>>>>>>>>>>>> On Wed, Aug 26, 2015 at 01:36:05AM +0530, Animesh Manna wrote:
>>>>>>>>>>>>> Dmc will restore the csr program except DC9, cold boot,
>>>>>>>>>>>>> warm reset, PCI function level reset, and hibernate/suspend.
>>>>>>>>>>>>>
>>>>>>>>>>>>> intel_csr_load_program() function is used to load the firmware
>>>>>>>>>>>>> data from kernel memory to csr address space.
>>>>>>>>>>>>>
>>>>>>>>>>>>> All values of csr address space will be zero if it got reset and
>>>>>>>>>>>>> the first byte of csr program is always a non-zero if firmware
>>>>>>>>>>>>> is loaded successfuly. Based on hardware status will load the
>>>>>>>>>>>>> firmware.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Without this condition check if we overwrite the firmware data the
>>>>>>>>>>>>> counters exposed for dc5/dc6 (help for debugging) will be nullified.
>>>>>>>>>>> Bacause of the above reason mentioned just above we need to block firmware loading again.
>>>>>>>>>>> So only WARN_ON will not help.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>> v1: Initial version.
>>>>>>>>>>>>>
>>>>>>>>>>>>> v2: Based on review comments from Daniel,
>>>>>>>>>>>>> - Added a check to know hardware status and load the firmware if not loaded.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Cc: Daniel Vetter <daniel.vetter@intel.com>
>>>>>>>>>>>>> Cc: Damien Lespiau <damien.lespiau@intel.com>
>>>>>>>>>>>>> Cc: Imre Deak <imre.deak@intel.com>
>>>>>>>>>>>>> Cc: Sunil Kamath <sunil.kamath@intel.com>
>>>>>>>>>>>>> Signed-off-by: Animesh Manna <animesh.manna@intel.com>
>>>>>>>>>>>>> Signed-off-by: Vathsala Nagaraju <vathsala.nagaraju@intel.com>
>>>>>>>>>>>>> ---
>>>>>>>>>>>>>   drivers/gpu/drm/i915/intel_csr.c | 9 +++++++++
>>>>>>>>>>>>>   1 file changed, 9 insertions(+)
>>>>>>>>>>>>>
>>>>>>>>>>>>> diff --git a/drivers/gpu/drm/i915/intel_csr.c b/drivers/gpu/drm/i915/intel_csr.c
>>>>>>>>>>>>> index ba1ae03..682cc26 100644
>>>>>>>>>>>>> --- a/drivers/gpu/drm/i915/intel_csr.c
>>>>>>>>>>>>> +++ b/drivers/gpu/drm/i915/intel_csr.c
>>>>>>>>>>>>> @@ -252,6 +252,15 @@ void intel_csr_load_program(struct drm_device *dev)
>>>>>>>>>>>>>               return;
>>>>>>>>>>>>>       }
>>>>>>>>>>>>> +     /*
>>>>>>>>>>>>> +      * Dmc will restore the csr the program except DC9, cold boot,
>>>>>>>>>>>>> +      * warm reset, PCI function level reset, and hibernate/suspend.
>>>>>>>>>>>>> +      * This condition will help to check if csr address space is reset/
>>>>>>>>>>>>> +      * not loaded.
>>>>>>>>>>>>> +      */
>>>>>>>>>>>> Atm we call this from driver load and resume, which doesn seem to cover
>>>>>>>>>>>> all the cases you mention in the comment. Should this be a WARN_ON
>>>>>>>>>>>> instead? Or do we have troubles in our init sequence where we load too
>>>>>>>>>>>> many times?
>>>>>>>>>>> Yes, the above statement taken from bspec to describe about the special cases dmc will not restore the firmware.
>>>>>>>>>>> Agree, In our cases cold boot and hibernate/suspend mainly we need to load the firmware again, so in my
>>>>>>>>>>> second sentence I wanted to comment mainly regarding this condition check added for suspend-hibernate(reset)
>>>>>>>>>>> and cold boot(not loaded).
>>>>>>>>>>>
>>>>>>>>>>> Anyways the same api later can be used to load the firmware from anywhere, so my intention to check firmware loaded or not.
>>>>>>>>>>> If already loaded then not to overwrite the csr address space to maintain the dc5/dc6 counter value.
>>>>>>>>>>>
>>>>>>>>>>> Can the below comment more clear to you.
>>>>>>>>>>>
>>>>>>>>>>>         /*
>>>>>>>>>>>          * Dmc will restore the csr the program except DC9, cold boot,
>>>>>>>>>>>          * warm reset, PCI function level reset, and hibernate/suspend.
>>>>>>>>>>>          * If firmware is restored by dmc then no need to load again which
>>>>>>>>>>>          * will keep the dc5/dc6 counter exposed by firmware.
>>>>>>>>>>>          */
>>>>>>>>>>>
>>>>>>>>>>> No issue in init sequence.
>>>>>>>>>> That seems to still cover all the callers of the function afaics - we do
>>>>>>>>>> pci resets over suspend resume unconditionally. So I still don't
>>>>>>>>>> understand where exactly we try to load the dmc firmware in i915.ko when
>>>>>>>>>> it's already loaded.
>>>>>>>>> During resume intel_csr_load_program() will be called from
>>>>>>>>> intel_runtime_resume().
>>>>>>>>>
>>>>>>>>> intel_runtime_resume()-> skl_resume_prepare()-> intel_csr_load_program()
>>>>>>>>>
>>>>>>>>> During Pc10 entry testing I can see dmc is restoring back the firmware always,
>>>>>>>>> but as you mentioned pci-reset can happen unconditionally, but still then
>>>>>>>>> also during resume intel_runtime_resume() will be called and based on
>>>>>>>>> register read of csr-base-address firmware loading will happen.
>>>>>>>> But in your comment you're saying it won't get restored in case of dc9 and
>>>>>>>> suspend. So that seems to mismatch what you're saying here (and what the
>>>>>>>> commit message says) and what the code does. And this function here is
>>>>>>>> called for resume after suspend/hibernate only.
>>>>>>> pc10 entry explanation I told is for skylake. dc9 in skylake is not possible.
>>>>>>> I think you are confusing between dc6 and dc9. Pc10 can be achieved by
>>>>>>> entering into dc6 (not dc9) for skylake. dc9 is the lowest possible state
>>>>>>> for broxton which is not present for skylake.
>>>>>> I have no idea at all about different pc levels on skl. What I'm talking
>>>>>> about is system suspend/resume and driver load, which are the places this
>>>>>> function gets called. At least afaics.
>>>>>>
>>>>>>> Here intel_csr_load_program() will be used for both skylake and broxton, and instruction
>>>>>>> execution flow will be different in case of suspend/resume which I think is confusing
>>>>>>> you.
>>>>>> That seems like really important information. What's different on bxt?
>>>>>> These are the kind of details you should explain in the commit message ...
>>>>>>
>>>>>>> I am ready explain you in detail. It will be good if we discuss specific use-case scenario
>>>>>>> and itz software design for specific platform. Another point - as dmc related code for
>>>>>>> broxton is not merged better first we close design for skylake. Now, I have added dc9
>>>>>>> description in comment thinking of future. If you want I can remove for now and later
>>>>>>> can add in bxt patch series for enabling dmc. Will wait for your reply.
>>>>>> This question here isn't about the overall design and how to handle power
>>>>>> wells in skl/bxt. That's a separate discussion and tracked somewhere else.
>>>>>> I'm really just confused about when exactly we need to reload to firmware,
>>>>>> and why we need a runtime check for that. Normally we should know when to
>>>>>> reload the firmware and just either reload or not, without checking hw
>>>>>> state. And I don't like checking for hw state since at least in the past
>>>>>> that kind of code ended up being fragile - it's an illusion that it does
>>>>>> the right thing no matter what, since often there's other tricky ordering
>>>>>> constraints. And if you have automatic duct-tape like then no one will
>>>>>> ever spot those other, harder to spot issues, until an expensive customer
>>>>>> escalation happens.
>>>>>>
>>>>>> So what I want to know here is:
>>>>>> - When exactly do we need to reload dmc firmware.
>>>>> In skl, during driver load first time we load the firmware, during normal
>>>>> suspend-resume (dc6 entry/exit)
>>>>> no need to reload the firmware again as dmc will take care of it. But during
>>>>> suspend/hibernation
>>>>> dmc will not restore the firmware. In that case driver need to reload it
>>>>> again. I do not know
>>>>> how to differentiate pm-suspend and suspend-hibernation and thought both the
>>>>> cases
>>>>> intel_runtime_resume() will be called where we can check the h/w state and
>>>>> reload the
>>>>> firmware if dmc is not restored.
>>>>>
>>>>> In bxt, during driver load first time we load the firmware, during normal
>>>>> suspend-resume
>>>>> display engine will enter into dc9 and dmc will not restore the firmware. So
>>>>> every
>>>>> suspend-resume we need to reload the firmware.
>>>>>> - What exactly is the reason why we can't make that decision statically in
>>>>>>    the code (by calling csr_load at the right spots).
>>>>> As I mentioned before in case of skylake can we differentiate between
>>>>> "resume from pm-suspend" with "resume from suspend-hibernation" inside
>>>>> driver?
>>>>>
>>>>> In case of broxton, every time we need to reload, so we can decide
>>>>> statically.
>>>> Of course we can differentiate between all the different resume paths, and
>>>> we also have a per-platform split to take care of bxt vs. skl. And there
>>>> are actually 3 different resume paths:
>>>>
>>>> - runtime PM resume. This calls the runtime_resume hook. It sounds like on
>>>>    skl we should _not_ load the csr firmware, but on bxt we should load it.
>>>>    This can be fixed by removing the intel_csr_load_program call from
>>>>    skl_resume_prepare.
>>>> - resume from hibernate-to-disk (i.e. system completely off, state stored
>>>>    on the swap partition) is done by calling the thaw callbacks.
>>>> - resume from suspend-to-mem (i.e. system in low-power with only memory
>>>>    in self-refresh, all state stored in memory) is done by calling the
>>>>    resume callbacks.
>>>>
>>>> For i915 we use unified handlers in our dev_pm_ops for both thaw and
>>>> resume, but it sounds like that won't be a problem for skl/bxt since we
>>>> need to reload the csr firmware in all cases. Although I'm not perfectly
>>>> sure since you don't explain what kind of resume you mean exactly (since
>>>> you don't use the linux names for them).
>>>>
>>>> Anyway it sounds like we can replace this patch by one where we remove
>>>> that errornous csr load call from skl runtime pm resume and that's all.
>>>> But I suggest to make sure we get this right we keep the check you're
>>>> adding here, but wrap it in a WARN_ON. Then we'll get a backtrace when
>>>> this is going wrong again. Like this:
>>>>
>>>>          if (WARN_ON(csr_loaded_already()))
>>>>                  return;
>>>>
>>>> Also when redoing the commits please explain in detail what exactly are
>>>> the requirements like you've done above, but please use the standard linux
>>>> names, i.e. "runtime PM" and "hibernate-to-disk" and "suspend-to-mem".
>>> Ok hooray there's more suspend-to-something things I've totally missed:
>>> - suspend-to-idle (done by cat freeze > /sys/power/state) and
>>> - suspend (done by cat suspend > /sys/power/state)
>>>
>>> And apparently there's really no way to drivers to tell them apart.
>>> Rafael, is there really no way for drivers to take different paths for
>>> these 3 suspend cases? I tried grepping for PM_SUSPEND_ON/STANDY/MEM
>>> and didn't spot anything.
>>>
>>> Also we're completely missing test coverage for that in igt. That is
>>> something that needs to be fixed asap (yet another case of
>>> combinatorial explosion in igt tests, yay). And at least one of those
>>> suspend-to-idle testcase better be in the BAT.
>>> -Daniel
>>> --
>>> Daniel Vetter
>>> Software Engineer, Intel Corporation
>>> +41 (0) 79 365 57 48 - http://blog.ffwll.ch
>>
>>
>> --
>> Daniel Vetter
>> Software Engineer, Intel Corporation
>> +41 (0) 79 365 57 48 - http://blog.ffwll.ch
>
>
Daniel Vetter Sept. 28, 2015, 6:52 a.m. UTC | #18
On Wed, Sep 23, 2015 at 10:49:36PM +0200, Rafael J. Wysocki wrote:
> On 9/23/2015 7:17 PM, Daniel Vetter wrote:
> >acpi_target_system_state() seems to be almost the thing we're looking
> >for, except that it's only valid in the suspend callbacks since it
> >gets reset to ACPI_STATE_S0 when resuming. So probably we want
> >something else ...
> 
> Right.
> 
> The idea is to add a way for drivers to check if
> (a) suspend is going to enter the BIOS
> (b) resume has been triggered by the BIOS
> and that's really what drivers need to know.
> 
> For suspend-to-idle those two will return false and for S3 they'll return
> true.
> 
> Would that help?

Not sure that matches exaxtly what we'd need here ... Essentially we need
to know whether we've been in S3/S4 (firmware has been eaten) or in one of
the higher suspend-to-idle/standby states (firmware still alive, don't
disturb it). Additional fun that just crossed my mind is that if the
suspend-to-mem is aborted (some other driver failed) then that function
should _not_ indicate that we've been in S3. So maybe something like

acpi_source_system_state() which usually is S0 and only when acpi
successfully went into the suspend state in platform_suspend_ops->enter it
gets set to the value of acpi_target_system_state. And then reset once the
resume has completed. I think that would be what we'd want here.

Anyway I'll pull in Animesh series meanwhile, amended with a FIXME
comment.
-Daniel

> 
> Thanks,
> Rafael
> 
> 
> >On Wed, Sep 23, 2015 at 6:28 PM, Daniel Vetter <daniel@ffwll.ch> wrote:
> >>Actually add Rafael this time around ...
> >>-Daniel
> >>
> >>On Wed, Sep 23, 2015 at 6:27 PM, Daniel Vetter <daniel@ffwll.ch> wrote:
> >>>On Wed, Sep 23, 2015 at 9:57 AM, Daniel Vetter <daniel@ffwll.ch> wrote:
> >>>>On Thu, Sep 17, 2015 at 12:53:21AM +0530, Animesh Manna wrote:
> >>>>>
> >>>>>On 9/14/2015 1:16 PM, Daniel Vetter wrote:
> >>>>>>On Fri, Sep 11, 2015 at 12:36:24AM +0530, Animesh Manna wrote:
> >>>>>>>On 9/10/2015 8:15 PM, Daniel Vetter wrote:
> >>>>>>>>On Thu, Sep 10, 2015 at 01:58:54AM +0530, Animesh Manna wrote:
> >>>>>>>>>On 9/2/2015 2:24 PM, Daniel Vetter wrote:
> >>>>>>>>>>On Wed, Aug 26, 2015 at 07:40:54PM +0530, Animesh Manna wrote:
> >>>>>>>>>>>On 8/26/2015 6:40 PM, Daniel Vetter wrote:
> >>>>>>>>>>>>On Wed, Aug 26, 2015 at 01:36:05AM +0530, Animesh Manna wrote:
> >>>>>>>>>>>>>Dmc will restore the csr program except DC9, cold boot,
> >>>>>>>>>>>>>warm reset, PCI function level reset, and hibernate/suspend.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>intel_csr_load_program() function is used to load the firmware
> >>>>>>>>>>>>>data from kernel memory to csr address space.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>All values of csr address space will be zero if it got reset and
> >>>>>>>>>>>>>the first byte of csr program is always a non-zero if firmware
> >>>>>>>>>>>>>is loaded successfuly. Based on hardware status will load the
> >>>>>>>>>>>>>firmware.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>Without this condition check if we overwrite the firmware data the
> >>>>>>>>>>>>>counters exposed for dc5/dc6 (help for debugging) will be nullified.
> >>>>>>>>>>>Bacause of the above reason mentioned just above we need to block firmware loading again.
> >>>>>>>>>>>So only WARN_ON will not help.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>>>v1: Initial version.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>v2: Based on review comments from Daniel,
> >>>>>>>>>>>>>- Added a check to know hardware status and load the firmware if not loaded.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>Cc: Daniel Vetter <daniel.vetter@intel.com>
> >>>>>>>>>>>>>Cc: Damien Lespiau <damien.lespiau@intel.com>
> >>>>>>>>>>>>>Cc: Imre Deak <imre.deak@intel.com>
> >>>>>>>>>>>>>Cc: Sunil Kamath <sunil.kamath@intel.com>
> >>>>>>>>>>>>>Signed-off-by: Animesh Manna <animesh.manna@intel.com>
> >>>>>>>>>>>>>Signed-off-by: Vathsala Nagaraju <vathsala.nagaraju@intel.com>
> >>>>>>>>>>>>>---
> >>>>>>>>>>>>>  drivers/gpu/drm/i915/intel_csr.c | 9 +++++++++
> >>>>>>>>>>>>>  1 file changed, 9 insertions(+)
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>diff --git a/drivers/gpu/drm/i915/intel_csr.c b/drivers/gpu/drm/i915/intel_csr.c
> >>>>>>>>>>>>>index ba1ae03..682cc26 100644
> >>>>>>>>>>>>>--- a/drivers/gpu/drm/i915/intel_csr.c
> >>>>>>>>>>>>>+++ b/drivers/gpu/drm/i915/intel_csr.c
> >>>>>>>>>>>>>@@ -252,6 +252,15 @@ void intel_csr_load_program(struct drm_device *dev)
> >>>>>>>>>>>>>              return;
> >>>>>>>>>>>>>      }
> >>>>>>>>>>>>>+     /*
> >>>>>>>>>>>>>+      * Dmc will restore the csr the program except DC9, cold boot,
> >>>>>>>>>>>>>+      * warm reset, PCI function level reset, and hibernate/suspend.
> >>>>>>>>>>>>>+      * This condition will help to check if csr address space is reset/
> >>>>>>>>>>>>>+      * not loaded.
> >>>>>>>>>>>>>+      */
> >>>>>>>>>>>>Atm we call this from driver load and resume, which doesn seem to cover
> >>>>>>>>>>>>all the cases you mention in the comment. Should this be a WARN_ON
> >>>>>>>>>>>>instead? Or do we have troubles in our init sequence where we load too
> >>>>>>>>>>>>many times?
> >>>>>>>>>>>Yes, the above statement taken from bspec to describe about the special cases dmc will not restore the firmware.
> >>>>>>>>>>>Agree, In our cases cold boot and hibernate/suspend mainly we need to load the firmware again, so in my
> >>>>>>>>>>>second sentence I wanted to comment mainly regarding this condition check added for suspend-hibernate(reset)
> >>>>>>>>>>>and cold boot(not loaded).
> >>>>>>>>>>>
> >>>>>>>>>>>Anyways the same api later can be used to load the firmware from anywhere, so my intention to check firmware loaded or not.
> >>>>>>>>>>>If already loaded then not to overwrite the csr address space to maintain the dc5/dc6 counter value.
> >>>>>>>>>>>
> >>>>>>>>>>>Can the below comment more clear to you.
> >>>>>>>>>>>
> >>>>>>>>>>>        /*
> >>>>>>>>>>>         * Dmc will restore the csr the program except DC9, cold boot,
> >>>>>>>>>>>         * warm reset, PCI function level reset, and hibernate/suspend.
> >>>>>>>>>>>         * If firmware is restored by dmc then no need to load again which
> >>>>>>>>>>>         * will keep the dc5/dc6 counter exposed by firmware.
> >>>>>>>>>>>         */
> >>>>>>>>>>>
> >>>>>>>>>>>No issue in init sequence.
> >>>>>>>>>>That seems to still cover all the callers of the function afaics - we do
> >>>>>>>>>>pci resets over suspend resume unconditionally. So I still don't
> >>>>>>>>>>understand where exactly we try to load the dmc firmware in i915.ko when
> >>>>>>>>>>it's already loaded.
> >>>>>>>>>During resume intel_csr_load_program() will be called from
> >>>>>>>>>intel_runtime_resume().
> >>>>>>>>>
> >>>>>>>>>intel_runtime_resume()-> skl_resume_prepare()-> intel_csr_load_program()
> >>>>>>>>>
> >>>>>>>>>During Pc10 entry testing I can see dmc is restoring back the firmware always,
> >>>>>>>>>but as you mentioned pci-reset can happen unconditionally, but still then
> >>>>>>>>>also during resume intel_runtime_resume() will be called and based on
> >>>>>>>>>register read of csr-base-address firmware loading will happen.
> >>>>>>>>But in your comment you're saying it won't get restored in case of dc9 and
> >>>>>>>>suspend. So that seems to mismatch what you're saying here (and what the
> >>>>>>>>commit message says) and what the code does. And this function here is
> >>>>>>>>called for resume after suspend/hibernate only.
> >>>>>>>pc10 entry explanation I told is for skylake. dc9 in skylake is not possible.
> >>>>>>>I think you are confusing between dc6 and dc9. Pc10 can be achieved by
> >>>>>>>entering into dc6 (not dc9) for skylake. dc9 is the lowest possible state
> >>>>>>>for broxton which is not present for skylake.
> >>>>>>I have no idea at all about different pc levels on skl. What I'm talking
> >>>>>>about is system suspend/resume and driver load, which are the places this
> >>>>>>function gets called. At least afaics.
> >>>>>>
> >>>>>>>Here intel_csr_load_program() will be used for both skylake and broxton, and instruction
> >>>>>>>execution flow will be different in case of suspend/resume which I think is confusing
> >>>>>>>you.
> >>>>>>That seems like really important information. What's different on bxt?
> >>>>>>These are the kind of details you should explain in the commit message ...
> >>>>>>
> >>>>>>>I am ready explain you in detail. It will be good if we discuss specific use-case scenario
> >>>>>>>and itz software design for specific platform. Another point - as dmc related code for
> >>>>>>>broxton is not merged better first we close design for skylake. Now, I have added dc9
> >>>>>>>description in comment thinking of future. If you want I can remove for now and later
> >>>>>>>can add in bxt patch series for enabling dmc. Will wait for your reply.
> >>>>>>This question here isn't about the overall design and how to handle power
> >>>>>>wells in skl/bxt. That's a separate discussion and tracked somewhere else.
> >>>>>>I'm really just confused about when exactly we need to reload to firmware,
> >>>>>>and why we need a runtime check for that. Normally we should know when to
> >>>>>>reload the firmware and just either reload or not, without checking hw
> >>>>>>state. And I don't like checking for hw state since at least in the past
> >>>>>>that kind of code ended up being fragile - it's an illusion that it does
> >>>>>>the right thing no matter what, since often there's other tricky ordering
> >>>>>>constraints. And if you have automatic duct-tape like then no one will
> >>>>>>ever spot those other, harder to spot issues, until an expensive customer
> >>>>>>escalation happens.
> >>>>>>
> >>>>>>So what I want to know here is:
> >>>>>>- When exactly do we need to reload dmc firmware.
> >>>>>In skl, during driver load first time we load the firmware, during normal
> >>>>>suspend-resume (dc6 entry/exit)
> >>>>>no need to reload the firmware again as dmc will take care of it. But during
> >>>>>suspend/hibernation
> >>>>>dmc will not restore the firmware. In that case driver need to reload it
> >>>>>again. I do not know
> >>>>>how to differentiate pm-suspend and suspend-hibernation and thought both the
> >>>>>cases
> >>>>>intel_runtime_resume() will be called where we can check the h/w state and
> >>>>>reload the
> >>>>>firmware if dmc is not restored.
> >>>>>
> >>>>>In bxt, during driver load first time we load the firmware, during normal
> >>>>>suspend-resume
> >>>>>display engine will enter into dc9 and dmc will not restore the firmware. So
> >>>>>every
> >>>>>suspend-resume we need to reload the firmware.
> >>>>>>- What exactly is the reason why we can't make that decision statically in
> >>>>>>   the code (by calling csr_load at the right spots).
> >>>>>As I mentioned before in case of skylake can we differentiate between
> >>>>>"resume from pm-suspend" with "resume from suspend-hibernation" inside
> >>>>>driver?
> >>>>>
> >>>>>In case of broxton, every time we need to reload, so we can decide
> >>>>>statically.
> >>>>Of course we can differentiate between all the different resume paths, and
> >>>>we also have a per-platform split to take care of bxt vs. skl. And there
> >>>>are actually 3 different resume paths:
> >>>>
> >>>>- runtime PM resume. This calls the runtime_resume hook. It sounds like on
> >>>>   skl we should _not_ load the csr firmware, but on bxt we should load it.
> >>>>   This can be fixed by removing the intel_csr_load_program call from
> >>>>   skl_resume_prepare.
> >>>>- resume from hibernate-to-disk (i.e. system completely off, state stored
> >>>>   on the swap partition) is done by calling the thaw callbacks.
> >>>>- resume from suspend-to-mem (i.e. system in low-power with only memory
> >>>>   in self-refresh, all state stored in memory) is done by calling the
> >>>>   resume callbacks.
> >>>>
> >>>>For i915 we use unified handlers in our dev_pm_ops for both thaw and
> >>>>resume, but it sounds like that won't be a problem for skl/bxt since we
> >>>>need to reload the csr firmware in all cases. Although I'm not perfectly
> >>>>sure since you don't explain what kind of resume you mean exactly (since
> >>>>you don't use the linux names for them).
> >>>>
> >>>>Anyway it sounds like we can replace this patch by one where we remove
> >>>>that errornous csr load call from skl runtime pm resume and that's all.
> >>>>But I suggest to make sure we get this right we keep the check you're
> >>>>adding here, but wrap it in a WARN_ON. Then we'll get a backtrace when
> >>>>this is going wrong again. Like this:
> >>>>
> >>>>         if (WARN_ON(csr_loaded_already()))
> >>>>                 return;
> >>>>
> >>>>Also when redoing the commits please explain in detail what exactly are
> >>>>the requirements like you've done above, but please use the standard linux
> >>>>names, i.e. "runtime PM" and "hibernate-to-disk" and "suspend-to-mem".
> >>>Ok hooray there's more suspend-to-something things I've totally missed:
> >>>- suspend-to-idle (done by cat freeze > /sys/power/state) and
> >>>- suspend (done by cat suspend > /sys/power/state)
> >>>
> >>>And apparently there's really no way to drivers to tell them apart.
> >>>Rafael, is there really no way for drivers to take different paths for
> >>>these 3 suspend cases? I tried grepping for PM_SUSPEND_ON/STANDY/MEM
> >>>and didn't spot anything.
> >>>
> >>>Also we're completely missing test coverage for that in igt. That is
> >>>something that needs to be fixed asap (yet another case of
> >>>combinatorial explosion in igt tests, yay). And at least one of those
> >>>suspend-to-idle testcase better be in the BAT.
> >>>-Daniel
> >>>--
> >>>Daniel Vetter
> >>>Software Engineer, Intel Corporation
> >>>+41 (0) 79 365 57 48 - http://blog.ffwll.ch
> >>
> >>
> >>--
> >>Daniel Vetter
> >>Software Engineer, Intel Corporation
> >>+41 (0) 79 365 57 48 - http://blog.ffwll.ch
> >
> >
>
Daniel Vetter Sept. 28, 2015, 7:03 a.m. UTC | #19
On Mon, Sep 07, 2015 at 04:34:30PM +0530, Sunil Kamath wrote:
> On Wednesday 26 August 2015 01:36 AM, Animesh Manna wrote:
> >Dmc will restore the csr program except DC9, cold boot,
> >warm reset, PCI function level reset, and hibernate/suspend.
> >
> >intel_csr_load_program() function is used to load the firmware
> >data from kernel memory to csr address space.
> >
> >All values of csr address space will be zero if it got reset and
> >the first byte of csr program is always a non-zero if firmware
> >is loaded successfuly. Based on hardware status will load the
> >firmware.
> >
> >Without this condition check if we overwrite the firmware data the
> >counters exposed for dc5/dc6 (help for debugging) will be nullified.
> >
> >v1: Initial version.
> >
> >v2: Based on review comments from Daniel,
> >- Added a check to know hardware status and load the firmware if not loaded.
> >
> >Cc: Daniel Vetter <daniel.vetter@intel.com>
> >Cc: Damien Lespiau <damien.lespiau@intel.com>
> >Cc: Imre Deak <imre.deak@intel.com>
> >Cc: Sunil Kamath <sunil.kamath@intel.com>
> >Signed-off-by: Animesh Manna <animesh.manna@intel.com>
> >Signed-off-by: Vathsala Nagaraju <vathsala.nagaraju@intel.com>
> >---
> >  drivers/gpu/drm/i915/intel_csr.c | 9 +++++++++
> >  1 file changed, 9 insertions(+)
> >
> >diff --git a/drivers/gpu/drm/i915/intel_csr.c b/drivers/gpu/drm/i915/intel_csr.c
> >index ba1ae03..682cc26 100644
> >--- a/drivers/gpu/drm/i915/intel_csr.c
> >+++ b/drivers/gpu/drm/i915/intel_csr.c
> >@@ -252,6 +252,15 @@ void intel_csr_load_program(struct drm_device *dev)
> >  		return;
> >  	}
> >+	/*
> >+	 * Dmc will restore the csr the program except DC9, cold boot,
> >+	 * warm reset, PCI function level reset, and hibernate/suspend.
> >+	 * This condition will help to check if csr address space is reset/
> >+	 * not loaded.
> >+	 */
> >+	if (I915_READ(CSR_PROGRAM_BASE))
> >+		return;
> >+
> >  	mutex_lock(&dev_priv->csr_lock);
> >  	fw_size = dev_priv->csr.dmc_fw_size;
> >  	for (i = 0; i < fw_size; i++)
> 
> Valid fix and patch is ready for merge now.
> 
> Reviewed-by: A.Sunil Kamath <sunil.kamath@intel.com>

Ok I changed the code comment to a FIXME to explain what's going on (DC9
is totally irrelevant if I understand this correctly now) and amended the
commit message with a note about what I think is really going on. Please
double-check.

Also I realized that we have 0 test coverage for suspend-to-idle (and
oversight for merging this feature a few years back). Please add a new igt
testcase (maybe subtest of drv_suspend) to exercise this feature.

Making sure that we have sufficient igt coverage is part of the review and
especially something that _must_ be checked for bugfixes like this.

Thanks, Daniel
Wysocki, Rafael J Sept. 28, 2015, 11:54 p.m. UTC | #20
On 9/28/2015 8:52 AM, Daniel Vetter wrote:
> On Wed, Sep 23, 2015 at 10:49:36PM +0200, Rafael J. Wysocki wrote:
>> On 9/23/2015 7:17 PM, Daniel Vetter wrote:
>>> acpi_target_system_state() seems to be almost the thing we're looking
>>> for, except that it's only valid in the suspend callbacks since it
>>> gets reset to ACPI_STATE_S0 when resuming. So probably we want
>>> something else ...
>> Right.
>>
>> The idea is to add a way for drivers to check if
>> (a) suspend is going to enter the BIOS
>> (b) resume has been triggered by the BIOS
>> and that's really what drivers need to know.
>>
>> For suspend-to-idle those two will return false and for S3 they'll return
>> true.
>>
>> Would that help?
> Not sure that matches exaxtly what we'd need here ... Essentially we need
> to know whether we've been in S3/S4 (firmware has been eaten) or in one of
> the higher suspend-to-idle/standby states (firmware still alive, don't
> disturb it). Additional fun that just crossed my mind is that if the
> suspend-to-mem is aborted (some other driver failed) then that function
> should _not_ indicate that we've been in S3. So maybe something like

So it really looks like the interface I was talking about would be suitable:

pm_suspend_via_firmware() == true -> firmware is going to be eaten (use 
that during suspend if needed)
pm_resume_via_firmware() == true -> firmware was eaten

The latter will only return 'true' if we really have entered the BIOS 
(platform firmware).

> acpi_source_system_state() which usually is S0 and only when acpi
> successfully went into the suspend state in platform_suspend_ops->enter it
> gets set to the value of acpi_target_system_state. And then reset once the
> resume has completed. I think that would be what we'd want here.

We need to new functions like the above, because some things already 
depend on acpi_target_system_state working the way it does currently.

I see no reason to make that ACPI-specific, though, in principle.

> Anyway I'll pull in Animesh series meanwhile, amended with a FIXME comment.

Fine by me.

Thanks,
Rafael


>>> On Wed, Sep 23, 2015 at 6:28 PM, Daniel Vetter <daniel@ffwll.ch> wrote:
>>>> Actually add Rafael this time around ...
>>>> -Daniel
>>>>
>>>> On Wed, Sep 23, 2015 at 6:27 PM, Daniel Vetter <daniel@ffwll.ch> wrote:
>>>>> On Wed, Sep 23, 2015 at 9:57 AM, Daniel Vetter <daniel@ffwll.ch> wrote:
>>>>>> On Thu, Sep 17, 2015 at 12:53:21AM +0530, Animesh Manna wrote:
>>>>>>> On 9/14/2015 1:16 PM, Daniel Vetter wrote:
>>>>>>>> On Fri, Sep 11, 2015 at 12:36:24AM +0530, Animesh Manna wrote:
>>>>>>>>> On 9/10/2015 8:15 PM, Daniel Vetter wrote:
>>>>>>>>>> On Thu, Sep 10, 2015 at 01:58:54AM +0530, Animesh Manna wrote:
>>>>>>>>>>> On 9/2/2015 2:24 PM, Daniel Vetter wrote:
>>>>>>>>>>>> On Wed, Aug 26, 2015 at 07:40:54PM +0530, Animesh Manna wrote:
>>>>>>>>>>>>> On 8/26/2015 6:40 PM, Daniel Vetter wrote:
>>>>>>>>>>>>>> On Wed, Aug 26, 2015 at 01:36:05AM +0530, Animesh Manna wrote:
>>>>>>>>>>>>>>> Dmc will restore the csr program except DC9, cold boot,
>>>>>>>>>>>>>>> warm reset, PCI function level reset, and hibernate/suspend.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> intel_csr_load_program() function is used to load the firmware
>>>>>>>>>>>>>>> data from kernel memory to csr address space.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> All values of csr address space will be zero if it got reset and
>>>>>>>>>>>>>>> the first byte of csr program is always a non-zero if firmware
>>>>>>>>>>>>>>> is loaded successfuly. Based on hardware status will load the
>>>>>>>>>>>>>>> firmware.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Without this condition check if we overwrite the firmware data the
>>>>>>>>>>>>>>> counters exposed for dc5/dc6 (help for debugging) will be nullified.
>>>>>>>>>>>>> Bacause of the above reason mentioned just above we need to block firmware loading again.
>>>>>>>>>>>>> So only WARN_ON will not help.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>> v1: Initial version.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> v2: Based on review comments from Daniel,
>>>>>>>>>>>>>>> - Added a check to know hardware status and load the firmware if not loaded.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Cc: Daniel Vetter <daniel.vetter@intel.com>
>>>>>>>>>>>>>>> Cc: Damien Lespiau <damien.lespiau@intel.com>
>>>>>>>>>>>>>>> Cc: Imre Deak <imre.deak@intel.com>
>>>>>>>>>>>>>>> Cc: Sunil Kamath <sunil.kamath@intel.com>
>>>>>>>>>>>>>>> Signed-off-by: Animesh Manna <animesh.manna@intel.com>
>>>>>>>>>>>>>>> Signed-off-by: Vathsala Nagaraju <vathsala.nagaraju@intel.com>
>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>   drivers/gpu/drm/i915/intel_csr.c | 9 +++++++++
>>>>>>>>>>>>>>>   1 file changed, 9 insertions(+)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> diff --git a/drivers/gpu/drm/i915/intel_csr.c b/drivers/gpu/drm/i915/intel_csr.c
>>>>>>>>>>>>>>> index ba1ae03..682cc26 100644
>>>>>>>>>>>>>>> --- a/drivers/gpu/drm/i915/intel_csr.c
>>>>>>>>>>>>>>> +++ b/drivers/gpu/drm/i915/intel_csr.c
>>>>>>>>>>>>>>> @@ -252,6 +252,15 @@ void intel_csr_load_program(struct drm_device *dev)
>>>>>>>>>>>>>>>               return;
>>>>>>>>>>>>>>>       }
>>>>>>>>>>>>>>> +     /*
>>>>>>>>>>>>>>> +      * Dmc will restore the csr the program except DC9, cold boot,
>>>>>>>>>>>>>>> +      * warm reset, PCI function level reset, and hibernate/suspend.
>>>>>>>>>>>>>>> +      * This condition will help to check if csr address space is reset/
>>>>>>>>>>>>>>> +      * not loaded.
>>>>>>>>>>>>>>> +      */
>>>>>>>>>>>>>> Atm we call this from driver load and resume, which doesn seem to cover
>>>>>>>>>>>>>> all the cases you mention in the comment. Should this be a WARN_ON
>>>>>>>>>>>>>> instead? Or do we have troubles in our init sequence where we load too
>>>>>>>>>>>>>> many times?
>>>>>>>>>>>>> Yes, the above statement taken from bspec to describe about the special cases dmc will not restore the firmware.
>>>>>>>>>>>>> Agree, In our cases cold boot and hibernate/suspend mainly we need to load the firmware again, so in my
>>>>>>>>>>>>> second sentence I wanted to comment mainly regarding this condition check added for suspend-hibernate(reset)
>>>>>>>>>>>>> and cold boot(not loaded).
>>>>>>>>>>>>>
>>>>>>>>>>>>> Anyways the same api later can be used to load the firmware from anywhere, so my intention to check firmware loaded or not.
>>>>>>>>>>>>> If already loaded then not to overwrite the csr address space to maintain the dc5/dc6 counter value.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Can the below comment more clear to you.
>>>>>>>>>>>>>
>>>>>>>>>>>>>         /*
>>>>>>>>>>>>>          * Dmc will restore the csr the program except DC9, cold boot,
>>>>>>>>>>>>>          * warm reset, PCI function level reset, and hibernate/suspend.
>>>>>>>>>>>>>          * If firmware is restored by dmc then no need to load again which
>>>>>>>>>>>>>          * will keep the dc5/dc6 counter exposed by firmware.
>>>>>>>>>>>>>          */
>>>>>>>>>>>>>
>>>>>>>>>>>>> No issue in init sequence.
>>>>>>>>>>>> That seems to still cover all the callers of the function afaics - we do
>>>>>>>>>>>> pci resets over suspend resume unconditionally. So I still don't
>>>>>>>>>>>> understand where exactly we try to load the dmc firmware in i915.ko when
>>>>>>>>>>>> it's already loaded.
>>>>>>>>>>> During resume intel_csr_load_program() will be called from
>>>>>>>>>>> intel_runtime_resume().
>>>>>>>>>>>
>>>>>>>>>>> intel_runtime_resume()-> skl_resume_prepare()-> intel_csr_load_program()
>>>>>>>>>>>
>>>>>>>>>>> During Pc10 entry testing I can see dmc is restoring back the firmware always,
>>>>>>>>>>> but as you mentioned pci-reset can happen unconditionally, but still then
>>>>>>>>>>> also during resume intel_runtime_resume() will be called and based on
>>>>>>>>>>> register read of csr-base-address firmware loading will happen.
>>>>>>>>>> But in your comment you're saying it won't get restored in case of dc9 and
>>>>>>>>>> suspend. So that seems to mismatch what you're saying here (and what the
>>>>>>>>>> commit message says) and what the code does. And this function here is
>>>>>>>>>> called for resume after suspend/hibernate only.
>>>>>>>>> pc10 entry explanation I told is for skylake. dc9 in skylake is not possible.
>>>>>>>>> I think you are confusing between dc6 and dc9. Pc10 can be achieved by
>>>>>>>>> entering into dc6 (not dc9) for skylake. dc9 is the lowest possible state
>>>>>>>>> for broxton which is not present for skylake.
>>>>>>>> I have no idea at all about different pc levels on skl. What I'm talking
>>>>>>>> about is system suspend/resume and driver load, which are the places this
>>>>>>>> function gets called. At least afaics.
>>>>>>>>
>>>>>>>>> Here intel_csr_load_program() will be used for both skylake and broxton, and instruction
>>>>>>>>> execution flow will be different in case of suspend/resume which I think is confusing
>>>>>>>>> you.
>>>>>>>> That seems like really important information. What's different on bxt?
>>>>>>>> These are the kind of details you should explain in the commit message ...
>>>>>>>>
>>>>>>>>> I am ready explain you in detail. It will be good if we discuss specific use-case scenario
>>>>>>>>> and itz software design for specific platform. Another point - as dmc related code for
>>>>>>>>> broxton is not merged better first we close design for skylake. Now, I have added dc9
>>>>>>>>> description in comment thinking of future. If you want I can remove for now and later
>>>>>>>>> can add in bxt patch series for enabling dmc. Will wait for your reply.
>>>>>>>> This question here isn't about the overall design and how to handle power
>>>>>>>> wells in skl/bxt. That's a separate discussion and tracked somewhere else.
>>>>>>>> I'm really just confused about when exactly we need to reload to firmware,
>>>>>>>> and why we need a runtime check for that. Normally we should know when to
>>>>>>>> reload the firmware and just either reload or not, without checking hw
>>>>>>>> state. And I don't like checking for hw state since at least in the past
>>>>>>>> that kind of code ended up being fragile - it's an illusion that it does
>>>>>>>> the right thing no matter what, since often there's other tricky ordering
>>>>>>>> constraints. And if you have automatic duct-tape like then no one will
>>>>>>>> ever spot those other, harder to spot issues, until an expensive customer
>>>>>>>> escalation happens.
>>>>>>>>
>>>>>>>> So what I want to know here is:
>>>>>>>> - When exactly do we need to reload dmc firmware.
>>>>>>> In skl, during driver load first time we load the firmware, during normal
>>>>>>> suspend-resume (dc6 entry/exit)
>>>>>>> no need to reload the firmware again as dmc will take care of it. But during
>>>>>>> suspend/hibernation
>>>>>>> dmc will not restore the firmware. In that case driver need to reload it
>>>>>>> again. I do not know
>>>>>>> how to differentiate pm-suspend and suspend-hibernation and thought both the
>>>>>>> cases
>>>>>>> intel_runtime_resume() will be called where we can check the h/w state and
>>>>>>> reload the
>>>>>>> firmware if dmc is not restored.
>>>>>>>
>>>>>>> In bxt, during driver load first time we load the firmware, during normal
>>>>>>> suspend-resume
>>>>>>> display engine will enter into dc9 and dmc will not restore the firmware. So
>>>>>>> every
>>>>>>> suspend-resume we need to reload the firmware.
>>>>>>>> - What exactly is the reason why we can't make that decision statically in
>>>>>>>>    the code (by calling csr_load at the right spots).
>>>>>>> As I mentioned before in case of skylake can we differentiate between
>>>>>>> "resume from pm-suspend" with "resume from suspend-hibernation" inside
>>>>>>> driver?
>>>>>>>
>>>>>>> In case of broxton, every time we need to reload, so we can decide
>>>>>>> statically.
>>>>>> Of course we can differentiate between all the different resume paths, and
>>>>>> we also have a per-platform split to take care of bxt vs. skl. And there
>>>>>> are actually 3 different resume paths:
>>>>>>
>>>>>> - runtime PM resume. This calls the runtime_resume hook. It sounds like on
>>>>>>    skl we should _not_ load the csr firmware, but on bxt we should load it.
>>>>>>    This can be fixed by removing the intel_csr_load_program call from
>>>>>>    skl_resume_prepare.
>>>>>> - resume from hibernate-to-disk (i.e. system completely off, state stored
>>>>>>    on the swap partition) is done by calling the thaw callbacks.
>>>>>> - resume from suspend-to-mem (i.e. system in low-power with only memory
>>>>>>    in self-refresh, all state stored in memory) is done by calling the
>>>>>>    resume callbacks.
>>>>>>
>>>>>> For i915 we use unified handlers in our dev_pm_ops for both thaw and
>>>>>> resume, but it sounds like that won't be a problem for skl/bxt since we
>>>>>> need to reload the csr firmware in all cases. Although I'm not perfectly
>>>>>> sure since you don't explain what kind of resume you mean exactly (since
>>>>>> you don't use the linux names for them).
>>>>>>
>>>>>> Anyway it sounds like we can replace this patch by one where we remove
>>>>>> that errornous csr load call from skl runtime pm resume and that's all.
>>>>>> But I suggest to make sure we get this right we keep the check you're
>>>>>> adding here, but wrap it in a WARN_ON. Then we'll get a backtrace when
>>>>>> this is going wrong again. Like this:
>>>>>>
>>>>>>          if (WARN_ON(csr_loaded_already()))
>>>>>>                  return;
>>>>>>
>>>>>> Also when redoing the commits please explain in detail what exactly are
>>>>>> the requirements like you've done above, but please use the standard linux
>>>>>> names, i.e. "runtime PM" and "hibernate-to-disk" and "suspend-to-mem".
>>>>> Ok hooray there's more suspend-to-something things I've totally missed:
>>>>> - suspend-to-idle (done by cat freeze > /sys/power/state) and
>>>>> - suspend (done by cat suspend > /sys/power/state)
>>>>>
>>>>> And apparently there's really no way to drivers to tell them apart.
>>>>> Rafael, is there really no way for drivers to take different paths for
>>>>> these 3 suspend cases? I tried grepping for PM_SUSPEND_ON/STANDY/MEM
>>>>> and didn't spot anything.
>>>>>
>>>>> Also we're completely missing test coverage for that in igt. That is
>>>>> something that needs to be fixed asap (yet another case of
>>>>> combinatorial explosion in igt tests, yay). And at least one of those
>>>>> suspend-to-idle testcase better be in the BAT.
>>>>> -Daniel
>>>>> --
>>>>> Daniel Vetter
>>>>> Software Engineer, Intel Corporation
>>>>> +41 (0) 79 365 57 48 - http://blog.ffwll.ch
>>>>
>>>> --
>>>> Daniel Vetter
>>>> Software Engineer, Intel Corporation
>>>> +41 (0) 79 365 57 48 - http://blog.ffwll.ch
>>>
Daniel Vetter Sept. 29, 2015, 8:51 a.m. UTC | #21
On Tue, Sep 29, 2015 at 01:54:53AM +0200, Rafael J. Wysocki wrote:
> On 9/28/2015 8:52 AM, Daniel Vetter wrote:
> >On Wed, Sep 23, 2015 at 10:49:36PM +0200, Rafael J. Wysocki wrote:
> >>On 9/23/2015 7:17 PM, Daniel Vetter wrote:
> >>>acpi_target_system_state() seems to be almost the thing we're looking
> >>>for, except that it's only valid in the suspend callbacks since it
> >>>gets reset to ACPI_STATE_S0 when resuming. So probably we want
> >>>something else ...
> >>Right.
> >>
> >>The idea is to add a way for drivers to check if
> >>(a) suspend is going to enter the BIOS
> >>(b) resume has been triggered by the BIOS
> >>and that's really what drivers need to know.
> >>
> >>For suspend-to-idle those two will return false and for S3 they'll return
> >>true.
> >>
> >>Would that help?
> >Not sure that matches exaxtly what we'd need here ... Essentially we need
> >to know whether we've been in S3/S4 (firmware has been eaten) or in one of
> >the higher suspend-to-idle/standby states (firmware still alive, don't
> >disturb it). Additional fun that just crossed my mind is that if the
> >suspend-to-mem is aborted (some other driver failed) then that function
> >should _not_ indicate that we've been in S3. So maybe something like
> 
> So it really looks like the interface I was talking about would be suitable:
> 
> pm_suspend_via_firmware() == true -> firmware is going to be eaten (use that
> during suspend if needed)
> pm_resume_via_firmware() == true -> firmware was eaten
> 
> The latter will only return 'true' if we really have entered the BIOS
> (platform firmware).

Yeah that seems to fit the bill. We already have a check in our suspend
paths to figure out whether we'll suspend to idle or go into full S3, so
i915 could use them both. And making them generic would make sense I
guess.
-Daniel

> >acpi_source_system_state() which usually is S0 and only when acpi
> >successfully went into the suspend state in platform_suspend_ops->enter it
> >gets set to the value of acpi_target_system_state. And then reset once the
> >resume has completed. I think that would be what we'd want here.
> 
> We need to new functions like the above, because some things already depend
> on acpi_target_system_state working the way it does currently.
> 
> I see no reason to make that ACPI-specific, though, in principle.
> 
> >Anyway I'll pull in Animesh series meanwhile, amended with a FIXME comment.
> 
> Fine by me.
> 
> Thanks,
> Rafael
> 
> 
> >>>On Wed, Sep 23, 2015 at 6:28 PM, Daniel Vetter <daniel@ffwll.ch> wrote:
> >>>>Actually add Rafael this time around ...
> >>>>-Daniel
> >>>>
> >>>>On Wed, Sep 23, 2015 at 6:27 PM, Daniel Vetter <daniel@ffwll.ch> wrote:
> >>>>>On Wed, Sep 23, 2015 at 9:57 AM, Daniel Vetter <daniel@ffwll.ch> wrote:
> >>>>>>On Thu, Sep 17, 2015 at 12:53:21AM +0530, Animesh Manna wrote:
> >>>>>>>On 9/14/2015 1:16 PM, Daniel Vetter wrote:
> >>>>>>>>On Fri, Sep 11, 2015 at 12:36:24AM +0530, Animesh Manna wrote:
> >>>>>>>>>On 9/10/2015 8:15 PM, Daniel Vetter wrote:
> >>>>>>>>>>On Thu, Sep 10, 2015 at 01:58:54AM +0530, Animesh Manna wrote:
> >>>>>>>>>>>On 9/2/2015 2:24 PM, Daniel Vetter wrote:
> >>>>>>>>>>>>On Wed, Aug 26, 2015 at 07:40:54PM +0530, Animesh Manna wrote:
> >>>>>>>>>>>>>On 8/26/2015 6:40 PM, Daniel Vetter wrote:
> >>>>>>>>>>>>>>On Wed, Aug 26, 2015 at 01:36:05AM +0530, Animesh Manna wrote:
> >>>>>>>>>>>>>>>Dmc will restore the csr program except DC9, cold boot,
> >>>>>>>>>>>>>>>warm reset, PCI function level reset, and hibernate/suspend.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>intel_csr_load_program() function is used to load the firmware
> >>>>>>>>>>>>>>>data from kernel memory to csr address space.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>All values of csr address space will be zero if it got reset and
> >>>>>>>>>>>>>>>the first byte of csr program is always a non-zero if firmware
> >>>>>>>>>>>>>>>is loaded successfuly. Based on hardware status will load the
> >>>>>>>>>>>>>>>firmware.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>Without this condition check if we overwrite the firmware data the
> >>>>>>>>>>>>>>>counters exposed for dc5/dc6 (help for debugging) will be nullified.
> >>>>>>>>>>>>>Bacause of the above reason mentioned just above we need to block firmware loading again.
> >>>>>>>>>>>>>So only WARN_ON will not help.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>>v1: Initial version.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>v2: Based on review comments from Daniel,
> >>>>>>>>>>>>>>>- Added a check to know hardware status and load the firmware if not loaded.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>Cc: Daniel Vetter <daniel.vetter@intel.com>
> >>>>>>>>>>>>>>>Cc: Damien Lespiau <damien.lespiau@intel.com>
> >>>>>>>>>>>>>>>Cc: Imre Deak <imre.deak@intel.com>
> >>>>>>>>>>>>>>>Cc: Sunil Kamath <sunil.kamath@intel.com>
> >>>>>>>>>>>>>>>Signed-off-by: Animesh Manna <animesh.manna@intel.com>
> >>>>>>>>>>>>>>>Signed-off-by: Vathsala Nagaraju <vathsala.nagaraju@intel.com>
> >>>>>>>>>>>>>>>---
> >>>>>>>>>>>>>>>  drivers/gpu/drm/i915/intel_csr.c | 9 +++++++++
> >>>>>>>>>>>>>>>  1 file changed, 9 insertions(+)
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>diff --git a/drivers/gpu/drm/i915/intel_csr.c b/drivers/gpu/drm/i915/intel_csr.c
> >>>>>>>>>>>>>>>index ba1ae03..682cc26 100644
> >>>>>>>>>>>>>>>--- a/drivers/gpu/drm/i915/intel_csr.c
> >>>>>>>>>>>>>>>+++ b/drivers/gpu/drm/i915/intel_csr.c
> >>>>>>>>>>>>>>>@@ -252,6 +252,15 @@ void intel_csr_load_program(struct drm_device *dev)
> >>>>>>>>>>>>>>>              return;
> >>>>>>>>>>>>>>>      }
> >>>>>>>>>>>>>>>+     /*
> >>>>>>>>>>>>>>>+      * Dmc will restore the csr the program except DC9, cold boot,
> >>>>>>>>>>>>>>>+      * warm reset, PCI function level reset, and hibernate/suspend.
> >>>>>>>>>>>>>>>+      * This condition will help to check if csr address space is reset/
> >>>>>>>>>>>>>>>+      * not loaded.
> >>>>>>>>>>>>>>>+      */
> >>>>>>>>>>>>>>Atm we call this from driver load and resume, which doesn seem to cover
> >>>>>>>>>>>>>>all the cases you mention in the comment. Should this be a WARN_ON
> >>>>>>>>>>>>>>instead? Or do we have troubles in our init sequence where we load too
> >>>>>>>>>>>>>>many times?
> >>>>>>>>>>>>>Yes, the above statement taken from bspec to describe about the special cases dmc will not restore the firmware.
> >>>>>>>>>>>>>Agree, In our cases cold boot and hibernate/suspend mainly we need to load the firmware again, so in my
> >>>>>>>>>>>>>second sentence I wanted to comment mainly regarding this condition check added for suspend-hibernate(reset)
> >>>>>>>>>>>>>and cold boot(not loaded).
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>Anyways the same api later can be used to load the firmware from anywhere, so my intention to check firmware loaded or not.
> >>>>>>>>>>>>>If already loaded then not to overwrite the csr address space to maintain the dc5/dc6 counter value.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>Can the below comment more clear to you.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>        /*
> >>>>>>>>>>>>>         * Dmc will restore the csr the program except DC9, cold boot,
> >>>>>>>>>>>>>         * warm reset, PCI function level reset, and hibernate/suspend.
> >>>>>>>>>>>>>         * If firmware is restored by dmc then no need to load again which
> >>>>>>>>>>>>>         * will keep the dc5/dc6 counter exposed by firmware.
> >>>>>>>>>>>>>         */
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>No issue in init sequence.
> >>>>>>>>>>>>That seems to still cover all the callers of the function afaics - we do
> >>>>>>>>>>>>pci resets over suspend resume unconditionally. So I still don't
> >>>>>>>>>>>>understand where exactly we try to load the dmc firmware in i915.ko when
> >>>>>>>>>>>>it's already loaded.
> >>>>>>>>>>>During resume intel_csr_load_program() will be called from
> >>>>>>>>>>>intel_runtime_resume().
> >>>>>>>>>>>
> >>>>>>>>>>>intel_runtime_resume()-> skl_resume_prepare()-> intel_csr_load_program()
> >>>>>>>>>>>
> >>>>>>>>>>>During Pc10 entry testing I can see dmc is restoring back the firmware always,
> >>>>>>>>>>>but as you mentioned pci-reset can happen unconditionally, but still then
> >>>>>>>>>>>also during resume intel_runtime_resume() will be called and based on
> >>>>>>>>>>>register read of csr-base-address firmware loading will happen.
> >>>>>>>>>>But in your comment you're saying it won't get restored in case of dc9 and
> >>>>>>>>>>suspend. So that seems to mismatch what you're saying here (and what the
> >>>>>>>>>>commit message says) and what the code does. And this function here is
> >>>>>>>>>>called for resume after suspend/hibernate only.
> >>>>>>>>>pc10 entry explanation I told is for skylake. dc9 in skylake is not possible.
> >>>>>>>>>I think you are confusing between dc6 and dc9. Pc10 can be achieved by
> >>>>>>>>>entering into dc6 (not dc9) for skylake. dc9 is the lowest possible state
> >>>>>>>>>for broxton which is not present for skylake.
> >>>>>>>>I have no idea at all about different pc levels on skl. What I'm talking
> >>>>>>>>about is system suspend/resume and driver load, which are the places this
> >>>>>>>>function gets called. At least afaics.
> >>>>>>>>
> >>>>>>>>>Here intel_csr_load_program() will be used for both skylake and broxton, and instruction
> >>>>>>>>>execution flow will be different in case of suspend/resume which I think is confusing
> >>>>>>>>>you.
> >>>>>>>>That seems like really important information. What's different on bxt?
> >>>>>>>>These are the kind of details you should explain in the commit message ...
> >>>>>>>>
> >>>>>>>>>I am ready explain you in detail. It will be good if we discuss specific use-case scenario
> >>>>>>>>>and itz software design for specific platform. Another point - as dmc related code for
> >>>>>>>>>broxton is not merged better first we close design for skylake. Now, I have added dc9
> >>>>>>>>>description in comment thinking of future. If you want I can remove for now and later
> >>>>>>>>>can add in bxt patch series for enabling dmc. Will wait for your reply.
> >>>>>>>>This question here isn't about the overall design and how to handle power
> >>>>>>>>wells in skl/bxt. That's a separate discussion and tracked somewhere else.
> >>>>>>>>I'm really just confused about when exactly we need to reload to firmware,
> >>>>>>>>and why we need a runtime check for that. Normally we should know when to
> >>>>>>>>reload the firmware and just either reload or not, without checking hw
> >>>>>>>>state. And I don't like checking for hw state since at least in the past
> >>>>>>>>that kind of code ended up being fragile - it's an illusion that it does
> >>>>>>>>the right thing no matter what, since often there's other tricky ordering
> >>>>>>>>constraints. And if you have automatic duct-tape like then no one will
> >>>>>>>>ever spot those other, harder to spot issues, until an expensive customer
> >>>>>>>>escalation happens.
> >>>>>>>>
> >>>>>>>>So what I want to know here is:
> >>>>>>>>- When exactly do we need to reload dmc firmware.
> >>>>>>>In skl, during driver load first time we load the firmware, during normal
> >>>>>>>suspend-resume (dc6 entry/exit)
> >>>>>>>no need to reload the firmware again as dmc will take care of it. But during
> >>>>>>>suspend/hibernation
> >>>>>>>dmc will not restore the firmware. In that case driver need to reload it
> >>>>>>>again. I do not know
> >>>>>>>how to differentiate pm-suspend and suspend-hibernation and thought both the
> >>>>>>>cases
> >>>>>>>intel_runtime_resume() will be called where we can check the h/w state and
> >>>>>>>reload the
> >>>>>>>firmware if dmc is not restored.
> >>>>>>>
> >>>>>>>In bxt, during driver load first time we load the firmware, during normal
> >>>>>>>suspend-resume
> >>>>>>>display engine will enter into dc9 and dmc will not restore the firmware. So
> >>>>>>>every
> >>>>>>>suspend-resume we need to reload the firmware.
> >>>>>>>>- What exactly is the reason why we can't make that decision statically in
> >>>>>>>>   the code (by calling csr_load at the right spots).
> >>>>>>>As I mentioned before in case of skylake can we differentiate between
> >>>>>>>"resume from pm-suspend" with "resume from suspend-hibernation" inside
> >>>>>>>driver?
> >>>>>>>
> >>>>>>>In case of broxton, every time we need to reload, so we can decide
> >>>>>>>statically.
> >>>>>>Of course we can differentiate between all the different resume paths, and
> >>>>>>we also have a per-platform split to take care of bxt vs. skl. And there
> >>>>>>are actually 3 different resume paths:
> >>>>>>
> >>>>>>- runtime PM resume. This calls the runtime_resume hook. It sounds like on
> >>>>>>   skl we should _not_ load the csr firmware, but on bxt we should load it.
> >>>>>>   This can be fixed by removing the intel_csr_load_program call from
> >>>>>>   skl_resume_prepare.
> >>>>>>- resume from hibernate-to-disk (i.e. system completely off, state stored
> >>>>>>   on the swap partition) is done by calling the thaw callbacks.
> >>>>>>- resume from suspend-to-mem (i.e. system in low-power with only memory
> >>>>>>   in self-refresh, all state stored in memory) is done by calling the
> >>>>>>   resume callbacks.
> >>>>>>
> >>>>>>For i915 we use unified handlers in our dev_pm_ops for both thaw and
> >>>>>>resume, but it sounds like that won't be a problem for skl/bxt since we
> >>>>>>need to reload the csr firmware in all cases. Although I'm not perfectly
> >>>>>>sure since you don't explain what kind of resume you mean exactly (since
> >>>>>>you don't use the linux names for them).
> >>>>>>
> >>>>>>Anyway it sounds like we can replace this patch by one where we remove
> >>>>>>that errornous csr load call from skl runtime pm resume and that's all.
> >>>>>>But I suggest to make sure we get this right we keep the check you're
> >>>>>>adding here, but wrap it in a WARN_ON. Then we'll get a backtrace when
> >>>>>>this is going wrong again. Like this:
> >>>>>>
> >>>>>>         if (WARN_ON(csr_loaded_already()))
> >>>>>>                 return;
> >>>>>>
> >>>>>>Also when redoing the commits please explain in detail what exactly are
> >>>>>>the requirements like you've done above, but please use the standard linux
> >>>>>>names, i.e. "runtime PM" and "hibernate-to-disk" and "suspend-to-mem".
> >>>>>Ok hooray there's more suspend-to-something things I've totally missed:
> >>>>>- suspend-to-idle (done by cat freeze > /sys/power/state) and
> >>>>>- suspend (done by cat suspend > /sys/power/state)
> >>>>>
> >>>>>And apparently there's really no way to drivers to tell them apart.
> >>>>>Rafael, is there really no way for drivers to take different paths for
> >>>>>these 3 suspend cases? I tried grepping for PM_SUSPEND_ON/STANDY/MEM
> >>>>>and didn't spot anything.
> >>>>>
> >>>>>Also we're completely missing test coverage for that in igt. That is
> >>>>>something that needs to be fixed asap (yet another case of
> >>>>>combinatorial explosion in igt tests, yay). And at least one of those
> >>>>>suspend-to-idle testcase better be in the BAT.
> >>>>>-Daniel
> >>>>>--
> >>>>>Daniel Vetter
> >>>>>Software Engineer, Intel Corporation
> >>>>>+41 (0) 79 365 57 48 - http://blog.ffwll.ch
> >>>>
> >>>>--
> >>>>Daniel Vetter
> >>>>Software Engineer, Intel Corporation
> >>>>+41 (0) 79 365 57 48 - http://blog.ffwll.ch
> >>>
>
Wysocki, Rafael J Sept. 30, 2015, 12:50 a.m. UTC | #22
On 9/29/2015 10:51 AM, Daniel Vetter wrote:
> On Tue, Sep 29, 2015 at 01:54:53AM +0200, Rafael J. Wysocki wrote:
>> On 9/28/2015 8:52 AM, Daniel Vetter wrote:
>>> On Wed, Sep 23, 2015 at 10:49:36PM +0200, Rafael J. Wysocki wrote:
>>>> On 9/23/2015 7:17 PM, Daniel Vetter wrote:
>>>>> acpi_target_system_state() seems to be almost the thing we're looking
>>>>> for, except that it's only valid in the suspend callbacks since it
>>>>> gets reset to ACPI_STATE_S0 when resuming. So probably we want
>>>>> something else ...
>>>> Right.
>>>>
>>>> The idea is to add a way for drivers to check if
>>>> (a) suspend is going to enter the BIOS
>>>> (b) resume has been triggered by the BIOS
>>>> and that's really what drivers need to know.
>>>>
>>>> For suspend-to-idle those two will return false and for S3 they'll return
>>>> true.
>>>>
>>>> Would that help?
>>> Not sure that matches exaxtly what we'd need here ... Essentially we need
>>> to know whether we've been in S3/S4 (firmware has been eaten) or in one of
>>> the higher suspend-to-idle/standby states (firmware still alive, don't
>>> disturb it). Additional fun that just crossed my mind is that if the
>>> suspend-to-mem is aborted (some other driver failed) then that function
>>> should _not_ indicate that we've been in S3. So maybe something like
>> So it really looks like the interface I was talking about would be suitable:
>>
>> pm_suspend_via_firmware() == true -> firmware is going to be eaten (use that
>> during suspend if needed)
>> pm_resume_via_firmware() == true -> firmware was eaten
>>
>> The latter will only return 'true' if we really have entered the BIOS
>> (platform firmware).
> Yeah that seems to fit the bill. We already have a check in our suspend
> paths to figure out whether we'll suspend to idle or go into full S3, so
> i915 could use them both. And making them generic would make sense I
> guess.

OK, sent patches (CCed you), please have a look.

Thanks,
Rafael


>>> acpi_source_system_state() which usually is S0 and only when acpi
>>> successfully went into the suspend state in platform_suspend_ops->enter it
>>> gets set to the value of acpi_target_system_state. And then reset once the
>>> resume has completed. I think that would be what we'd want here.
>> We need to new functions like the above, because some things already depend
>> on acpi_target_system_state working the way it does currently.
>>
>> I see no reason to make that ACPI-specific, though, in principle.
>>
>>> Anyway I'll pull in Animesh series meanwhile, amended with a FIXME comment.
>> Fine by me.
>>
>> Thanks,
>> Rafael
>>
>>
>>>>> On Wed, Sep 23, 2015 at 6:28 PM, Daniel Vetter <daniel@ffwll.ch> wrote:
>>>>>> Actually add Rafael this time around ...
>>>>>> -Daniel
>>>>>>
>>>>>> On Wed, Sep 23, 2015 at 6:27 PM, Daniel Vetter <daniel@ffwll.ch> wrote:
>>>>>>> On Wed, Sep 23, 2015 at 9:57 AM, Daniel Vetter <daniel@ffwll.ch> wrote:
>>>>>>>> On Thu, Sep 17, 2015 at 12:53:21AM +0530, Animesh Manna wrote:
>>>>>>>>> On 9/14/2015 1:16 PM, Daniel Vetter wrote:
>>>>>>>>>> On Fri, Sep 11, 2015 at 12:36:24AM +0530, Animesh Manna wrote:
>>>>>>>>>>> On 9/10/2015 8:15 PM, Daniel Vetter wrote:
>>>>>>>>>>>> On Thu, Sep 10, 2015 at 01:58:54AM +0530, Animesh Manna wrote:
>>>>>>>>>>>>> On 9/2/2015 2:24 PM, Daniel Vetter wrote:
>>>>>>>>>>>>>> On Wed, Aug 26, 2015 at 07:40:54PM +0530, Animesh Manna wrote:
>>>>>>>>>>>>>>> On 8/26/2015 6:40 PM, Daniel Vetter wrote:
>>>>>>>>>>>>>>>> On Wed, Aug 26, 2015 at 01:36:05AM +0530, Animesh Manna wrote:
>>>>>>>>>>>>>>>>> Dmc will restore the csr program except DC9, cold boot,
>>>>>>>>>>>>>>>>> warm reset, PCI function level reset, and hibernate/suspend.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> intel_csr_load_program() function is used to load the firmware
>>>>>>>>>>>>>>>>> data from kernel memory to csr address space.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> All values of csr address space will be zero if it got reset and
>>>>>>>>>>>>>>>>> the first byte of csr program is always a non-zero if firmware
>>>>>>>>>>>>>>>>> is loaded successfuly. Based on hardware status will load the
>>>>>>>>>>>>>>>>> firmware.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Without this condition check if we overwrite the firmware data the
>>>>>>>>>>>>>>>>> counters exposed for dc5/dc6 (help for debugging) will be nullified.
>>>>>>>>>>>>>>> Bacause of the above reason mentioned just above we need to block firmware loading again.
>>>>>>>>>>>>>>> So only WARN_ON will not help.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> v1: Initial version.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> v2: Based on review comments from Daniel,
>>>>>>>>>>>>>>>>> - Added a check to know hardware status and load the firmware if not loaded.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Cc: Daniel Vetter <daniel.vetter@intel.com>
>>>>>>>>>>>>>>>>> Cc: Damien Lespiau <damien.lespiau@intel.com>
>>>>>>>>>>>>>>>>> Cc: Imre Deak <imre.deak@intel.com>
>>>>>>>>>>>>>>>>> Cc: Sunil Kamath <sunil.kamath@intel.com>
>>>>>>>>>>>>>>>>> Signed-off-by: Animesh Manna <animesh.manna@intel.com>
>>>>>>>>>>>>>>>>> Signed-off-by: Vathsala Nagaraju <vathsala.nagaraju@intel.com>
>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>   drivers/gpu/drm/i915/intel_csr.c | 9 +++++++++
>>>>>>>>>>>>>>>>>   1 file changed, 9 insertions(+)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> diff --git a/drivers/gpu/drm/i915/intel_csr.c b/drivers/gpu/drm/i915/intel_csr.c
>>>>>>>>>>>>>>>>> index ba1ae03..682cc26 100644
>>>>>>>>>>>>>>>>> --- a/drivers/gpu/drm/i915/intel_csr.c
>>>>>>>>>>>>>>>>> +++ b/drivers/gpu/drm/i915/intel_csr.c
>>>>>>>>>>>>>>>>> @@ -252,6 +252,15 @@ void intel_csr_load_program(struct drm_device *dev)
>>>>>>>>>>>>>>>>>               return;
>>>>>>>>>>>>>>>>>       }
>>>>>>>>>>>>>>>>> +     /*
>>>>>>>>>>>>>>>>> +      * Dmc will restore the csr the program except DC9, cold boot,
>>>>>>>>>>>>>>>>> +      * warm reset, PCI function level reset, and hibernate/suspend.
>>>>>>>>>>>>>>>>> +      * This condition will help to check if csr address space is reset/
>>>>>>>>>>>>>>>>> +      * not loaded.
>>>>>>>>>>>>>>>>> +      */
>>>>>>>>>>>>>>>> Atm we call this from driver load and resume, which doesn seem to cover
>>>>>>>>>>>>>>>> all the cases you mention in the comment. Should this be a WARN_ON
>>>>>>>>>>>>>>>> instead? Or do we have troubles in our init sequence where we load too
>>>>>>>>>>>>>>>> many times?
>>>>>>>>>>>>>>> Yes, the above statement taken from bspec to describe about the special cases dmc will not restore the firmware.
>>>>>>>>>>>>>>> Agree, In our cases cold boot and hibernate/suspend mainly we need to load the firmware again, so in my
>>>>>>>>>>>>>>> second sentence I wanted to comment mainly regarding this condition check added for suspend-hibernate(reset)
>>>>>>>>>>>>>>> and cold boot(not loaded).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Anyways the same api later can be used to load the firmware from anywhere, so my intention to check firmware loaded or not.
>>>>>>>>>>>>>>> If already loaded then not to overwrite the csr address space to maintain the dc5/dc6 counter value.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Can the below comment more clear to you.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>         /*
>>>>>>>>>>>>>>>          * Dmc will restore the csr the program except DC9, cold boot,
>>>>>>>>>>>>>>>          * warm reset, PCI function level reset, and hibernate/suspend.
>>>>>>>>>>>>>>>          * If firmware is restored by dmc then no need to load again which
>>>>>>>>>>>>>>>          * will keep the dc5/dc6 counter exposed by firmware.
>>>>>>>>>>>>>>>          */
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> No issue in init sequence.
>>>>>>>>>>>>>> That seems to still cover all the callers of the function afaics - we do
>>>>>>>>>>>>>> pci resets over suspend resume unconditionally. So I still don't
>>>>>>>>>>>>>> understand where exactly we try to load the dmc firmware in i915.ko when
>>>>>>>>>>>>>> it's already loaded.
>>>>>>>>>>>>> During resume intel_csr_load_program() will be called from
>>>>>>>>>>>>> intel_runtime_resume().
>>>>>>>>>>>>>
>>>>>>>>>>>>> intel_runtime_resume()-> skl_resume_prepare()-> intel_csr_load_program()
>>>>>>>>>>>>>
>>>>>>>>>>>>> During Pc10 entry testing I can see dmc is restoring back the firmware always,
>>>>>>>>>>>>> but as you mentioned pci-reset can happen unconditionally, but still then
>>>>>>>>>>>>> also during resume intel_runtime_resume() will be called and based on
>>>>>>>>>>>>> register read of csr-base-address firmware loading will happen.
>>>>>>>>>>>> But in your comment you're saying it won't get restored in case of dc9 and
>>>>>>>>>>>> suspend. So that seems to mismatch what you're saying here (and what the
>>>>>>>>>>>> commit message says) and what the code does. And this function here is
>>>>>>>>>>>> called for resume after suspend/hibernate only.
>>>>>>>>>>> pc10 entry explanation I told is for skylake. dc9 in skylake is not possible.
>>>>>>>>>>> I think you are confusing between dc6 and dc9. Pc10 can be achieved by
>>>>>>>>>>> entering into dc6 (not dc9) for skylake. dc9 is the lowest possible state
>>>>>>>>>>> for broxton which is not present for skylake.
>>>>>>>>>> I have no idea at all about different pc levels on skl. What I'm talking
>>>>>>>>>> about is system suspend/resume and driver load, which are the places this
>>>>>>>>>> function gets called. At least afaics.
>>>>>>>>>>
>>>>>>>>>>> Here intel_csr_load_program() will be used for both skylake and broxton, and instruction
>>>>>>>>>>> execution flow will be different in case of suspend/resume which I think is confusing
>>>>>>>>>>> you.
>>>>>>>>>> That seems like really important information. What's different on bxt?
>>>>>>>>>> These are the kind of details you should explain in the commit message ...
>>>>>>>>>>
>>>>>>>>>>> I am ready explain you in detail. It will be good if we discuss specific use-case scenario
>>>>>>>>>>> and itz software design for specific platform. Another point - as dmc related code for
>>>>>>>>>>> broxton is not merged better first we close design for skylake. Now, I have added dc9
>>>>>>>>>>> description in comment thinking of future. If you want I can remove for now and later
>>>>>>>>>>> can add in bxt patch series for enabling dmc. Will wait for your reply.
>>>>>>>>>> This question here isn't about the overall design and how to handle power
>>>>>>>>>> wells in skl/bxt. That's a separate discussion and tracked somewhere else.
>>>>>>>>>> I'm really just confused about when exactly we need to reload to firmware,
>>>>>>>>>> and why we need a runtime check for that. Normally we should know when to
>>>>>>>>>> reload the firmware and just either reload or not, without checking hw
>>>>>>>>>> state. And I don't like checking for hw state since at least in the past
>>>>>>>>>> that kind of code ended up being fragile - it's an illusion that it does
>>>>>>>>>> the right thing no matter what, since often there's other tricky ordering
>>>>>>>>>> constraints. And if you have automatic duct-tape like then no one will
>>>>>>>>>> ever spot those other, harder to spot issues, until an expensive customer
>>>>>>>>>> escalation happens.
>>>>>>>>>>
>>>>>>>>>> So what I want to know here is:
>>>>>>>>>> - When exactly do we need to reload dmc firmware.
>>>>>>>>> In skl, during driver load first time we load the firmware, during normal
>>>>>>>>> suspend-resume (dc6 entry/exit)
>>>>>>>>> no need to reload the firmware again as dmc will take care of it. But during
>>>>>>>>> suspend/hibernation
>>>>>>>>> dmc will not restore the firmware. In that case driver need to reload it
>>>>>>>>> again. I do not know
>>>>>>>>> how to differentiate pm-suspend and suspend-hibernation and thought both the
>>>>>>>>> cases
>>>>>>>>> intel_runtime_resume() will be called where we can check the h/w state and
>>>>>>>>> reload the
>>>>>>>>> firmware if dmc is not restored.
>>>>>>>>>
>>>>>>>>> In bxt, during driver load first time we load the firmware, during normal
>>>>>>>>> suspend-resume
>>>>>>>>> display engine will enter into dc9 and dmc will not restore the firmware. So
>>>>>>>>> every
>>>>>>>>> suspend-resume we need to reload the firmware.
>>>>>>>>>> - What exactly is the reason why we can't make that decision statically in
>>>>>>>>>>    the code (by calling csr_load at the right spots).
>>>>>>>>> As I mentioned before in case of skylake can we differentiate between
>>>>>>>>> "resume from pm-suspend" with "resume from suspend-hibernation" inside
>>>>>>>>> driver?
>>>>>>>>>
>>>>>>>>> In case of broxton, every time we need to reload, so we can decide
>>>>>>>>> statically.
>>>>>>>> Of course we can differentiate between all the different resume paths, and
>>>>>>>> we also have a per-platform split to take care of bxt vs. skl. And there
>>>>>>>> are actually 3 different resume paths:
>>>>>>>>
>>>>>>>> - runtime PM resume. This calls the runtime_resume hook. It sounds like on
>>>>>>>>    skl we should _not_ load the csr firmware, but on bxt we should load it.
>>>>>>>>    This can be fixed by removing the intel_csr_load_program call from
>>>>>>>>    skl_resume_prepare.
>>>>>>>> - resume from hibernate-to-disk (i.e. system completely off, state stored
>>>>>>>>    on the swap partition) is done by calling the thaw callbacks.
>>>>>>>> - resume from suspend-to-mem (i.e. system in low-power with only memory
>>>>>>>>    in self-refresh, all state stored in memory) is done by calling the
>>>>>>>>    resume callbacks.
>>>>>>>>
>>>>>>>> For i915 we use unified handlers in our dev_pm_ops for both thaw and
>>>>>>>> resume, but it sounds like that won't be a problem for skl/bxt since we
>>>>>>>> need to reload the csr firmware in all cases. Although I'm not perfectly
>>>>>>>> sure since you don't explain what kind of resume you mean exactly (since
>>>>>>>> you don't use the linux names for them).
>>>>>>>>
>>>>>>>> Anyway it sounds like we can replace this patch by one where we remove
>>>>>>>> that errornous csr load call from skl runtime pm resume and that's all.
>>>>>>>> But I suggest to make sure we get this right we keep the check you're
>>>>>>>> adding here, but wrap it in a WARN_ON. Then we'll get a backtrace when
>>>>>>>> this is going wrong again. Like this:
>>>>>>>>
>>>>>>>>          if (WARN_ON(csr_loaded_already()))
>>>>>>>>                  return;
>>>>>>>>
>>>>>>>> Also when redoing the commits please explain in detail what exactly are
>>>>>>>> the requirements like you've done above, but please use the standard linux
>>>>>>>> names, i.e. "runtime PM" and "hibernate-to-disk" and "suspend-to-mem".
>>>>>>> Ok hooray there's more suspend-to-something things I've totally missed:
>>>>>>> - suspend-to-idle (done by cat freeze > /sys/power/state) and
>>>>>>> - suspend (done by cat suspend > /sys/power/state)
>>>>>>>
>>>>>>> And apparently there's really no way to drivers to tell them apart.
>>>>>>> Rafael, is there really no way for drivers to take different paths for
>>>>>>> these 3 suspend cases? I tried grepping for PM_SUSPEND_ON/STANDY/MEM
>>>>>>> and didn't spot anything.
>>>>>>>
>>>>>>> Also we're completely missing test coverage for that in igt. That is
>>>>>>> something that needs to be fixed asap (yet another case of
>>>>>>> combinatorial explosion in igt tests, yay). And at least one of those
>>>>>>> suspend-to-idle testcase better be in the BAT.
>>>>>>> -Daniel
>>>>>>> --
>>>>>>> Daniel Vetter
>>>>>>> Software Engineer, Intel Corporation
>>>>>>> +41 (0) 79 365 57 48 - http://blog.ffwll.ch
>>>>>> --
>>>>>> Daniel Vetter
>>>>>> Software Engineer, Intel Corporation
>>>>>> +41 (0) 79 365 57 48 - http://blog.ffwll.ch
Daniel Vetter Sept. 30, 2015, 12:14 p.m. UTC | #23
On Wed, Sep 30, 2015 at 02:50:40AM +0200, Rafael J. Wysocki wrote:
> On 9/29/2015 10:51 AM, Daniel Vetter wrote:
> >On Tue, Sep 29, 2015 at 01:54:53AM +0200, Rafael J. Wysocki wrote:
> >>On 9/28/2015 8:52 AM, Daniel Vetter wrote:
> >>>On Wed, Sep 23, 2015 at 10:49:36PM +0200, Rafael J. Wysocki wrote:
> >>>>On 9/23/2015 7:17 PM, Daniel Vetter wrote:
> >>>>>acpi_target_system_state() seems to be almost the thing we're looking
> >>>>>for, except that it's only valid in the suspend callbacks since it
> >>>>>gets reset to ACPI_STATE_S0 when resuming. So probably we want
> >>>>>something else ...
> >>>>Right.
> >>>>
> >>>>The idea is to add a way for drivers to check if
> >>>>(a) suspend is going to enter the BIOS
> >>>>(b) resume has been triggered by the BIOS
> >>>>and that's really what drivers need to know.
> >>>>
> >>>>For suspend-to-idle those two will return false and for S3 they'll return
> >>>>true.
> >>>>
> >>>>Would that help?
> >>>Not sure that matches exaxtly what we'd need here ... Essentially we need
> >>>to know whether we've been in S3/S4 (firmware has been eaten) or in one of
> >>>the higher suspend-to-idle/standby states (firmware still alive, don't
> >>>disturb it). Additional fun that just crossed my mind is that if the
> >>>suspend-to-mem is aborted (some other driver failed) then that function
> >>>should _not_ indicate that we've been in S3. So maybe something like
> >>So it really looks like the interface I was talking about would be suitable:
> >>
> >>pm_suspend_via_firmware() == true -> firmware is going to be eaten (use that
> >>during suspend if needed)
> >>pm_resume_via_firmware() == true -> firmware was eaten
> >>
> >>The latter will only return 'true' if we really have entered the BIOS
> >>(platform firmware).
> >Yeah that seems to fit the bill. We already have a check in our suspend
> >paths to figure out whether we'll suspend to idle or go into full S3, so
> >i915 could use them both. And making them generic would make sense I
> >guess.
> 
> OK, sent patches (CCed you), please have a look.

EXPORT_SYMBOL seems missing, but lgtm otherwise. I haven't tried to use
them for the two cases in i915 though.
-Daniel

> 
> Thanks,
> Rafael
> 
> 
> >>>acpi_source_system_state() which usually is S0 and only when acpi
> >>>successfully went into the suspend state in platform_suspend_ops->enter it
> >>>gets set to the value of acpi_target_system_state. And then reset once the
> >>>resume has completed. I think that would be what we'd want here.
> >>We need to new functions like the above, because some things already depend
> >>on acpi_target_system_state working the way it does currently.
> >>
> >>I see no reason to make that ACPI-specific, though, in principle.
> >>
> >>>Anyway I'll pull in Animesh series meanwhile, amended with a FIXME comment.
> >>Fine by me.
> >>
> >>Thanks,
> >>Rafael
> >>
> >>
> >>>>>On Wed, Sep 23, 2015 at 6:28 PM, Daniel Vetter <daniel@ffwll.ch> wrote:
> >>>>>>Actually add Rafael this time around ...
> >>>>>>-Daniel
> >>>>>>
> >>>>>>On Wed, Sep 23, 2015 at 6:27 PM, Daniel Vetter <daniel@ffwll.ch> wrote:
> >>>>>>>On Wed, Sep 23, 2015 at 9:57 AM, Daniel Vetter <daniel@ffwll.ch> wrote:
> >>>>>>>>On Thu, Sep 17, 2015 at 12:53:21AM +0530, Animesh Manna wrote:
> >>>>>>>>>On 9/14/2015 1:16 PM, Daniel Vetter wrote:
> >>>>>>>>>>On Fri, Sep 11, 2015 at 12:36:24AM +0530, Animesh Manna wrote:
> >>>>>>>>>>>On 9/10/2015 8:15 PM, Daniel Vetter wrote:
> >>>>>>>>>>>>On Thu, Sep 10, 2015 at 01:58:54AM +0530, Animesh Manna wrote:
> >>>>>>>>>>>>>On 9/2/2015 2:24 PM, Daniel Vetter wrote:
> >>>>>>>>>>>>>>On Wed, Aug 26, 2015 at 07:40:54PM +0530, Animesh Manna wrote:
> >>>>>>>>>>>>>>>On 8/26/2015 6:40 PM, Daniel Vetter wrote:
> >>>>>>>>>>>>>>>>On Wed, Aug 26, 2015 at 01:36:05AM +0530, Animesh Manna wrote:
> >>>>>>>>>>>>>>>>>Dmc will restore the csr program except DC9, cold boot,
> >>>>>>>>>>>>>>>>>warm reset, PCI function level reset, and hibernate/suspend.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>intel_csr_load_program() function is used to load the firmware
> >>>>>>>>>>>>>>>>>data from kernel memory to csr address space.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>All values of csr address space will be zero if it got reset and
> >>>>>>>>>>>>>>>>>the first byte of csr program is always a non-zero if firmware
> >>>>>>>>>>>>>>>>>is loaded successfuly. Based on hardware status will load the
> >>>>>>>>>>>>>>>>>firmware.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>Without this condition check if we overwrite the firmware data the
> >>>>>>>>>>>>>>>>>counters exposed for dc5/dc6 (help for debugging) will be nullified.
> >>>>>>>>>>>>>>>Bacause of the above reason mentioned just above we need to block firmware loading again.
> >>>>>>>>>>>>>>>So only WARN_ON will not help.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>v1: Initial version.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>v2: Based on review comments from Daniel,
> >>>>>>>>>>>>>>>>>- Added a check to know hardware status and load the firmware if not loaded.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>Cc: Daniel Vetter <daniel.vetter@intel.com>
> >>>>>>>>>>>>>>>>>Cc: Damien Lespiau <damien.lespiau@intel.com>
> >>>>>>>>>>>>>>>>>Cc: Imre Deak <imre.deak@intel.com>
> >>>>>>>>>>>>>>>>>Cc: Sunil Kamath <sunil.kamath@intel.com>
> >>>>>>>>>>>>>>>>>Signed-off-by: Animesh Manna <animesh.manna@intel.com>
> >>>>>>>>>>>>>>>>>Signed-off-by: Vathsala Nagaraju <vathsala.nagaraju@intel.com>
> >>>>>>>>>>>>>>>>>---
> >>>>>>>>>>>>>>>>>  drivers/gpu/drm/i915/intel_csr.c | 9 +++++++++
> >>>>>>>>>>>>>>>>>  1 file changed, 9 insertions(+)
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>diff --git a/drivers/gpu/drm/i915/intel_csr.c b/drivers/gpu/drm/i915/intel_csr.c
> >>>>>>>>>>>>>>>>>index ba1ae03..682cc26 100644
> >>>>>>>>>>>>>>>>>--- a/drivers/gpu/drm/i915/intel_csr.c
> >>>>>>>>>>>>>>>>>+++ b/drivers/gpu/drm/i915/intel_csr.c
> >>>>>>>>>>>>>>>>>@@ -252,6 +252,15 @@ void intel_csr_load_program(struct drm_device *dev)
> >>>>>>>>>>>>>>>>>              return;
> >>>>>>>>>>>>>>>>>      }
> >>>>>>>>>>>>>>>>>+     /*
> >>>>>>>>>>>>>>>>>+      * Dmc will restore the csr the program except DC9, cold boot,
> >>>>>>>>>>>>>>>>>+      * warm reset, PCI function level reset, and hibernate/suspend.
> >>>>>>>>>>>>>>>>>+      * This condition will help to check if csr address space is reset/
> >>>>>>>>>>>>>>>>>+      * not loaded.
> >>>>>>>>>>>>>>>>>+      */
> >>>>>>>>>>>>>>>>Atm we call this from driver load and resume, which doesn seem to cover
> >>>>>>>>>>>>>>>>all the cases you mention in the comment. Should this be a WARN_ON
> >>>>>>>>>>>>>>>>instead? Or do we have troubles in our init sequence where we load too
> >>>>>>>>>>>>>>>>many times?
> >>>>>>>>>>>>>>>Yes, the above statement taken from bspec to describe about the special cases dmc will not restore the firmware.
> >>>>>>>>>>>>>>>Agree, In our cases cold boot and hibernate/suspend mainly we need to load the firmware again, so in my
> >>>>>>>>>>>>>>>second sentence I wanted to comment mainly regarding this condition check added for suspend-hibernate(reset)
> >>>>>>>>>>>>>>>and cold boot(not loaded).
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>Anyways the same api later can be used to load the firmware from anywhere, so my intention to check firmware loaded or not.
> >>>>>>>>>>>>>>>If already loaded then not to overwrite the csr address space to maintain the dc5/dc6 counter value.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>Can the below comment more clear to you.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>        /*
> >>>>>>>>>>>>>>>         * Dmc will restore the csr the program except DC9, cold boot,
> >>>>>>>>>>>>>>>         * warm reset, PCI function level reset, and hibernate/suspend.
> >>>>>>>>>>>>>>>         * If firmware is restored by dmc then no need to load again which
> >>>>>>>>>>>>>>>         * will keep the dc5/dc6 counter exposed by firmware.
> >>>>>>>>>>>>>>>         */
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>No issue in init sequence.
> >>>>>>>>>>>>>>That seems to still cover all the callers of the function afaics - we do
> >>>>>>>>>>>>>>pci resets over suspend resume unconditionally. So I still don't
> >>>>>>>>>>>>>>understand where exactly we try to load the dmc firmware in i915.ko when
> >>>>>>>>>>>>>>it's already loaded.
> >>>>>>>>>>>>>During resume intel_csr_load_program() will be called from
> >>>>>>>>>>>>>intel_runtime_resume().
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>intel_runtime_resume()-> skl_resume_prepare()-> intel_csr_load_program()
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>During Pc10 entry testing I can see dmc is restoring back the firmware always,
> >>>>>>>>>>>>>but as you mentioned pci-reset can happen unconditionally, but still then
> >>>>>>>>>>>>>also during resume intel_runtime_resume() will be called and based on
> >>>>>>>>>>>>>register read of csr-base-address firmware loading will happen.
> >>>>>>>>>>>>But in your comment you're saying it won't get restored in case of dc9 and
> >>>>>>>>>>>>suspend. So that seems to mismatch what you're saying here (and what the
> >>>>>>>>>>>>commit message says) and what the code does. And this function here is
> >>>>>>>>>>>>called for resume after suspend/hibernate only.
> >>>>>>>>>>>pc10 entry explanation I told is for skylake. dc9 in skylake is not possible.
> >>>>>>>>>>>I think you are confusing between dc6 and dc9. Pc10 can be achieved by
> >>>>>>>>>>>entering into dc6 (not dc9) for skylake. dc9 is the lowest possible state
> >>>>>>>>>>>for broxton which is not present for skylake.
> >>>>>>>>>>I have no idea at all about different pc levels on skl. What I'm talking
> >>>>>>>>>>about is system suspend/resume and driver load, which are the places this
> >>>>>>>>>>function gets called. At least afaics.
> >>>>>>>>>>
> >>>>>>>>>>>Here intel_csr_load_program() will be used for both skylake and broxton, and instruction
> >>>>>>>>>>>execution flow will be different in case of suspend/resume which I think is confusing
> >>>>>>>>>>>you.
> >>>>>>>>>>That seems like really important information. What's different on bxt?
> >>>>>>>>>>These are the kind of details you should explain in the commit message ...
> >>>>>>>>>>
> >>>>>>>>>>>I am ready explain you in detail. It will be good if we discuss specific use-case scenario
> >>>>>>>>>>>and itz software design for specific platform. Another point - as dmc related code for
> >>>>>>>>>>>broxton is not merged better first we close design for skylake. Now, I have added dc9
> >>>>>>>>>>>description in comment thinking of future. If you want I can remove for now and later
> >>>>>>>>>>>can add in bxt patch series for enabling dmc. Will wait for your reply.
> >>>>>>>>>>This question here isn't about the overall design and how to handle power
> >>>>>>>>>>wells in skl/bxt. That's a separate discussion and tracked somewhere else.
> >>>>>>>>>>I'm really just confused about when exactly we need to reload to firmware,
> >>>>>>>>>>and why we need a runtime check for that. Normally we should know when to
> >>>>>>>>>>reload the firmware and just either reload or not, without checking hw
> >>>>>>>>>>state. And I don't like checking for hw state since at least in the past
> >>>>>>>>>>that kind of code ended up being fragile - it's an illusion that it does
> >>>>>>>>>>the right thing no matter what, since often there's other tricky ordering
> >>>>>>>>>>constraints. And if you have automatic duct-tape like then no one will
> >>>>>>>>>>ever spot those other, harder to spot issues, until an expensive customer
> >>>>>>>>>>escalation happens.
> >>>>>>>>>>
> >>>>>>>>>>So what I want to know here is:
> >>>>>>>>>>- When exactly do we need to reload dmc firmware.
> >>>>>>>>>In skl, during driver load first time we load the firmware, during normal
> >>>>>>>>>suspend-resume (dc6 entry/exit)
> >>>>>>>>>no need to reload the firmware again as dmc will take care of it. But during
> >>>>>>>>>suspend/hibernation
> >>>>>>>>>dmc will not restore the firmware. In that case driver need to reload it
> >>>>>>>>>again. I do not know
> >>>>>>>>>how to differentiate pm-suspend and suspend-hibernation and thought both the
> >>>>>>>>>cases
> >>>>>>>>>intel_runtime_resume() will be called where we can check the h/w state and
> >>>>>>>>>reload the
> >>>>>>>>>firmware if dmc is not restored.
> >>>>>>>>>
> >>>>>>>>>In bxt, during driver load first time we load the firmware, during normal
> >>>>>>>>>suspend-resume
> >>>>>>>>>display engine will enter into dc9 and dmc will not restore the firmware. So
> >>>>>>>>>every
> >>>>>>>>>suspend-resume we need to reload the firmware.
> >>>>>>>>>>- What exactly is the reason why we can't make that decision statically in
> >>>>>>>>>>   the code (by calling csr_load at the right spots).
> >>>>>>>>>As I mentioned before in case of skylake can we differentiate between
> >>>>>>>>>"resume from pm-suspend" with "resume from suspend-hibernation" inside
> >>>>>>>>>driver?
> >>>>>>>>>
> >>>>>>>>>In case of broxton, every time we need to reload, so we can decide
> >>>>>>>>>statically.
> >>>>>>>>Of course we can differentiate between all the different resume paths, and
> >>>>>>>>we also have a per-platform split to take care of bxt vs. skl. And there
> >>>>>>>>are actually 3 different resume paths:
> >>>>>>>>
> >>>>>>>>- runtime PM resume. This calls the runtime_resume hook. It sounds like on
> >>>>>>>>   skl we should _not_ load the csr firmware, but on bxt we should load it.
> >>>>>>>>   This can be fixed by removing the intel_csr_load_program call from
> >>>>>>>>   skl_resume_prepare.
> >>>>>>>>- resume from hibernate-to-disk (i.e. system completely off, state stored
> >>>>>>>>   on the swap partition) is done by calling the thaw callbacks.
> >>>>>>>>- resume from suspend-to-mem (i.e. system in low-power with only memory
> >>>>>>>>   in self-refresh, all state stored in memory) is done by calling the
> >>>>>>>>   resume callbacks.
> >>>>>>>>
> >>>>>>>>For i915 we use unified handlers in our dev_pm_ops for both thaw and
> >>>>>>>>resume, but it sounds like that won't be a problem for skl/bxt since we
> >>>>>>>>need to reload the csr firmware in all cases. Although I'm not perfectly
> >>>>>>>>sure since you don't explain what kind of resume you mean exactly (since
> >>>>>>>>you don't use the linux names for them).
> >>>>>>>>
> >>>>>>>>Anyway it sounds like we can replace this patch by one where we remove
> >>>>>>>>that errornous csr load call from skl runtime pm resume and that's all.
> >>>>>>>>But I suggest to make sure we get this right we keep the check you're
> >>>>>>>>adding here, but wrap it in a WARN_ON. Then we'll get a backtrace when
> >>>>>>>>this is going wrong again. Like this:
> >>>>>>>>
> >>>>>>>>         if (WARN_ON(csr_loaded_already()))
> >>>>>>>>                 return;
> >>>>>>>>
> >>>>>>>>Also when redoing the commits please explain in detail what exactly are
> >>>>>>>>the requirements like you've done above, but please use the standard linux
> >>>>>>>>names, i.e. "runtime PM" and "hibernate-to-disk" and "suspend-to-mem".
> >>>>>>>Ok hooray there's more suspend-to-something things I've totally missed:
> >>>>>>>- suspend-to-idle (done by cat freeze > /sys/power/state) and
> >>>>>>>- suspend (done by cat suspend > /sys/power/state)
> >>>>>>>
> >>>>>>>And apparently there's really no way to drivers to tell them apart.
> >>>>>>>Rafael, is there really no way for drivers to take different paths for
> >>>>>>>these 3 suspend cases? I tried grepping for PM_SUSPEND_ON/STANDY/MEM
> >>>>>>>and didn't spot anything.
> >>>>>>>
> >>>>>>>Also we're completely missing test coverage for that in igt. That is
> >>>>>>>something that needs to be fixed asap (yet another case of
> >>>>>>>combinatorial explosion in igt tests, yay). And at least one of those
> >>>>>>>suspend-to-idle testcase better be in the BAT.
> >>>>>>>-Daniel
> >>>>>>>--
> >>>>>>>Daniel Vetter
> >>>>>>>Software Engineer, Intel Corporation
> >>>>>>>+41 (0) 79 365 57 48 - http://blog.ffwll.ch
> >>>>>>--
> >>>>>>Daniel Vetter
> >>>>>>Software Engineer, Intel Corporation
> >>>>>>+41 (0) 79 365 57 48 - http://blog.ffwll.ch
>
Wysocki, Rafael J Sept. 30, 2015, 11:34 p.m. UTC | #24
On 9/30/2015 2:14 PM, Daniel Vetter wrote:
> On Wed, Sep 30, 2015 at 02:50:40AM +0200, Rafael J. Wysocki wrote:
>> On 9/29/2015 10:51 AM, Daniel Vetter wrote:
>>> On Tue, Sep 29, 2015 at 01:54:53AM +0200, Rafael J. Wysocki wrote:
>>>> On 9/28/2015 8:52 AM, Daniel Vetter wrote:
>>>>> On Wed, Sep 23, 2015 at 10:49:36PM +0200, Rafael J. Wysocki wrote:
>>>>>> On 9/23/2015 7:17 PM, Daniel Vetter wrote:
>>>>>>> acpi_target_system_state() seems to be almost the thing we're looking
>>>>>>> for, except that it's only valid in the suspend callbacks since it
>>>>>>> gets reset to ACPI_STATE_S0 when resuming. So probably we want
>>>>>>> something else ...
>>>>>> Right.
>>>>>>
>>>>>> The idea is to add a way for drivers to check if
>>>>>> (a) suspend is going to enter the BIOS
>>>>>> (b) resume has been triggered by the BIOS
>>>>>> and that's really what drivers need to know.
>>>>>>
>>>>>> For suspend-to-idle those two will return false and for S3 they'll return
>>>>>> true.
>>>>>>
>>>>>> Would that help?
>>>>> Not sure that matches exaxtly what we'd need here ... Essentially we need
>>>>> to know whether we've been in S3/S4 (firmware has been eaten) or in one of
>>>>> the higher suspend-to-idle/standby states (firmware still alive, don't
>>>>> disturb it). Additional fun that just crossed my mind is that if the
>>>>> suspend-to-mem is aborted (some other driver failed) then that function
>>>>> should _not_ indicate that we've been in S3. So maybe something like
>>>> So it really looks like the interface I was talking about would be suitable:
>>>>
>>>> pm_suspend_via_firmware() == true -> firmware is going to be eaten (use that
>>>> during suspend if needed)
>>>> pm_resume_via_firmware() == true -> firmware was eaten
>>>>
>>>> The latter will only return 'true' if we really have entered the BIOS
>>>> (platform firmware).
>>> Yeah that seems to fit the bill. We already have a check in our suspend
>>> paths to figure out whether we'll suspend to idle or go into full S3, so
>>> i915 could use them both. And making them generic would make sense I
>>> guess.
>> OK, sent patches (CCed you), please have a look.
> EXPORT_SYMBOL seems missing,

Good point, will fix.

>   but lgtm otherwise. I haven't tried to use
> them for the two cases in i915 though.

OK, thanks!

Rafael


>
>>>>> acpi_source_system_state() which usually is S0 and only when acpi
>>>>> successfully went into the suspend state in platform_suspend_ops->enter it
>>>>> gets set to the value of acpi_target_system_state. And then reset once the
>>>>> resume has completed. I think that would be what we'd want here.
>>>> We need to new functions like the above, because some things already depend
>>>> on acpi_target_system_state working the way it does currently.
>>>>
>>>> I see no reason to make that ACPI-specific, though, in principle.
>>>>
>>>>> Anyway I'll pull in Animesh series meanwhile, amended with a FIXME comment.
>>>> Fine by me.
>>>>
>>>> Thanks,
>>>> Rafael
>>>>
>>>>
>>>>>>> On Wed, Sep 23, 2015 at 6:28 PM, Daniel Vetter <daniel@ffwll.ch> wrote:
>>>>>>>> Actually add Rafael this time around ...
>>>>>>>> -Daniel
>>>>>>>>
>>>>>>>> On Wed, Sep 23, 2015 at 6:27 PM, Daniel Vetter <daniel@ffwll.ch> wrote:
>>>>>>>>> On Wed, Sep 23, 2015 at 9:57 AM, Daniel Vetter <daniel@ffwll.ch> wrote:
>>>>>>>>>> On Thu, Sep 17, 2015 at 12:53:21AM +0530, Animesh Manna wrote:
>>>>>>>>>>> On 9/14/2015 1:16 PM, Daniel Vetter wrote:
>>>>>>>>>>>> On Fri, Sep 11, 2015 at 12:36:24AM +0530, Animesh Manna wrote:
>>>>>>>>>>>>> On 9/10/2015 8:15 PM, Daniel Vetter wrote:
>>>>>>>>>>>>>> On Thu, Sep 10, 2015 at 01:58:54AM +0530, Animesh Manna wrote:
>>>>>>>>>>>>>>> On 9/2/2015 2:24 PM, Daniel Vetter wrote:
>>>>>>>>>>>>>>>> On Wed, Aug 26, 2015 at 07:40:54PM +0530, Animesh Manna wrote:
>>>>>>>>>>>>>>>>> On 8/26/2015 6:40 PM, Daniel Vetter wrote:
>>>>>>>>>>>>>>>>>> On Wed, Aug 26, 2015 at 01:36:05AM +0530, Animesh Manna wrote:
>>>>>>>>>>>>>>>>>>> Dmc will restore the csr program except DC9, cold boot,
>>>>>>>>>>>>>>>>>>> warm reset, PCI function level reset, and hibernate/suspend.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> intel_csr_load_program() function is used to load the firmware
>>>>>>>>>>>>>>>>>>> data from kernel memory to csr address space.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> All values of csr address space will be zero if it got reset and
>>>>>>>>>>>>>>>>>>> the first byte of csr program is always a non-zero if firmware
>>>>>>>>>>>>>>>>>>> is loaded successfuly. Based on hardware status will load the
>>>>>>>>>>>>>>>>>>> firmware.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Without this condition check if we overwrite the firmware data the
>>>>>>>>>>>>>>>>>>> counters exposed for dc5/dc6 (help for debugging) will be nullified.
>>>>>>>>>>>>>>>>> Bacause of the above reason mentioned just above we need to block firmware loading again.
>>>>>>>>>>>>>>>>> So only WARN_ON will not help.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> v1: Initial version.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> v2: Based on review comments from Daniel,
>>>>>>>>>>>>>>>>>>> - Added a check to know hardware status and load the firmware if not loaded.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Cc: Daniel Vetter <daniel.vetter@intel.com>
>>>>>>>>>>>>>>>>>>> Cc: Damien Lespiau <damien.lespiau@intel.com>
>>>>>>>>>>>>>>>>>>> Cc: Imre Deak <imre.deak@intel.com>
>>>>>>>>>>>>>>>>>>> Cc: Sunil Kamath <sunil.kamath@intel.com>
>>>>>>>>>>>>>>>>>>> Signed-off-by: Animesh Manna <animesh.manna@intel.com>
>>>>>>>>>>>>>>>>>>> Signed-off-by: Vathsala Nagaraju <vathsala.nagaraju@intel.com>
>>>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>>>   drivers/gpu/drm/i915/intel_csr.c | 9 +++++++++
>>>>>>>>>>>>>>>>>>>   1 file changed, 9 insertions(+)
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> diff --git a/drivers/gpu/drm/i915/intel_csr.c b/drivers/gpu/drm/i915/intel_csr.c
>>>>>>>>>>>>>>>>>>> index ba1ae03..682cc26 100644
>>>>>>>>>>>>>>>>>>> --- a/drivers/gpu/drm/i915/intel_csr.c
>>>>>>>>>>>>>>>>>>> +++ b/drivers/gpu/drm/i915/intel_csr.c
>>>>>>>>>>>>>>>>>>> @@ -252,6 +252,15 @@ void intel_csr_load_program(struct drm_device *dev)
>>>>>>>>>>>>>>>>>>>               return;
>>>>>>>>>>>>>>>>>>>       }
>>>>>>>>>>>>>>>>>>> +     /*
>>>>>>>>>>>>>>>>>>> +      * Dmc will restore the csr the program except DC9, cold boot,
>>>>>>>>>>>>>>>>>>> +      * warm reset, PCI function level reset, and hibernate/suspend.
>>>>>>>>>>>>>>>>>>> +      * This condition will help to check if csr address space is reset/
>>>>>>>>>>>>>>>>>>> +      * not loaded.
>>>>>>>>>>>>>>>>>>> +      */
>>>>>>>>>>>>>>>>>> Atm we call this from driver load and resume, which doesn seem to cover
>>>>>>>>>>>>>>>>>> all the cases you mention in the comment. Should this be a WARN_ON
>>>>>>>>>>>>>>>>>> instead? Or do we have troubles in our init sequence where we load too
>>>>>>>>>>>>>>>>>> many times?
>>>>>>>>>>>>>>>>> Yes, the above statement taken from bspec to describe about the special cases dmc will not restore the firmware.
>>>>>>>>>>>>>>>>> Agree, In our cases cold boot and hibernate/suspend mainly we need to load the firmware again, so in my
>>>>>>>>>>>>>>>>> second sentence I wanted to comment mainly regarding this condition check added for suspend-hibernate(reset)
>>>>>>>>>>>>>>>>> and cold boot(not loaded).
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Anyways the same api later can be used to load the firmware from anywhere, so my intention to check firmware loaded or not.
>>>>>>>>>>>>>>>>> If already loaded then not to overwrite the csr address space to maintain the dc5/dc6 counter value.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Can the below comment more clear to you.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>         /*
>>>>>>>>>>>>>>>>>          * Dmc will restore the csr the program except DC9, cold boot,
>>>>>>>>>>>>>>>>>          * warm reset, PCI function level reset, and hibernate/suspend.
>>>>>>>>>>>>>>>>>          * If firmware is restored by dmc then no need to load again which
>>>>>>>>>>>>>>>>>          * will keep the dc5/dc6 counter exposed by firmware.
>>>>>>>>>>>>>>>>>          */
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> No issue in init sequence.
>>>>>>>>>>>>>>>> That seems to still cover all the callers of the function afaics - we do
>>>>>>>>>>>>>>>> pci resets over suspend resume unconditionally. So I still don't
>>>>>>>>>>>>>>>> understand where exactly we try to load the dmc firmware in i915.ko when
>>>>>>>>>>>>>>>> it's already loaded.
>>>>>>>>>>>>>>> During resume intel_csr_load_program() will be called from
>>>>>>>>>>>>>>> intel_runtime_resume().
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> intel_runtime_resume()-> skl_resume_prepare()-> intel_csr_load_program()
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> During Pc10 entry testing I can see dmc is restoring back the firmware always,
>>>>>>>>>>>>>>> but as you mentioned pci-reset can happen unconditionally, but still then
>>>>>>>>>>>>>>> also during resume intel_runtime_resume() will be called and based on
>>>>>>>>>>>>>>> register read of csr-base-address firmware loading will happen.
>>>>>>>>>>>>>> But in your comment you're saying it won't get restored in case of dc9 and
>>>>>>>>>>>>>> suspend. So that seems to mismatch what you're saying here (and what the
>>>>>>>>>>>>>> commit message says) and what the code does. And this function here is
>>>>>>>>>>>>>> called for resume after suspend/hibernate only.
>>>>>>>>>>>>> pc10 entry explanation I told is for skylake. dc9 in skylake is not possible.
>>>>>>>>>>>>> I think you are confusing between dc6 and dc9. Pc10 can be achieved by
>>>>>>>>>>>>> entering into dc6 (not dc9) for skylake. dc9 is the lowest possible state
>>>>>>>>>>>>> for broxton which is not present for skylake.
>>>>>>>>>>>> I have no idea at all about different pc levels on skl. What I'm talking
>>>>>>>>>>>> about is system suspend/resume and driver load, which are the places this
>>>>>>>>>>>> function gets called. At least afaics.
>>>>>>>>>>>>
>>>>>>>>>>>>> Here intel_csr_load_program() will be used for both skylake and broxton, and instruction
>>>>>>>>>>>>> execution flow will be different in case of suspend/resume which I think is confusing
>>>>>>>>>>>>> you.
>>>>>>>>>>>> That seems like really important information. What's different on bxt?
>>>>>>>>>>>> These are the kind of details you should explain in the commit message ...
>>>>>>>>>>>>
>>>>>>>>>>>>> I am ready explain you in detail. It will be good if we discuss specific use-case scenario
>>>>>>>>>>>>> and itz software design for specific platform. Another point - as dmc related code for
>>>>>>>>>>>>> broxton is not merged better first we close design for skylake. Now, I have added dc9
>>>>>>>>>>>>> description in comment thinking of future. If you want I can remove for now and later
>>>>>>>>>>>>> can add in bxt patch series for enabling dmc. Will wait for your reply.
>>>>>>>>>>>> This question here isn't about the overall design and how to handle power
>>>>>>>>>>>> wells in skl/bxt. That's a separate discussion and tracked somewhere else.
>>>>>>>>>>>> I'm really just confused about when exactly we need to reload to firmware,
>>>>>>>>>>>> and why we need a runtime check for that. Normally we should know when to
>>>>>>>>>>>> reload the firmware and just either reload or not, without checking hw
>>>>>>>>>>>> state. And I don't like checking for hw state since at least in the past
>>>>>>>>>>>> that kind of code ended up being fragile - it's an illusion that it does
>>>>>>>>>>>> the right thing no matter what, since often there's other tricky ordering
>>>>>>>>>>>> constraints. And if you have automatic duct-tape like then no one will
>>>>>>>>>>>> ever spot those other, harder to spot issues, until an expensive customer
>>>>>>>>>>>> escalation happens.
>>>>>>>>>>>>
>>>>>>>>>>>> So what I want to know here is:
>>>>>>>>>>>> - When exactly do we need to reload dmc firmware.
>>>>>>>>>>> In skl, during driver load first time we load the firmware, during normal
>>>>>>>>>>> suspend-resume (dc6 entry/exit)
>>>>>>>>>>> no need to reload the firmware again as dmc will take care of it. But during
>>>>>>>>>>> suspend/hibernation
>>>>>>>>>>> dmc will not restore the firmware. In that case driver need to reload it
>>>>>>>>>>> again. I do not know
>>>>>>>>>>> how to differentiate pm-suspend and suspend-hibernation and thought both the
>>>>>>>>>>> cases
>>>>>>>>>>> intel_runtime_resume() will be called where we can check the h/w state and
>>>>>>>>>>> reload the
>>>>>>>>>>> firmware if dmc is not restored.
>>>>>>>>>>>
>>>>>>>>>>> In bxt, during driver load first time we load the firmware, during normal
>>>>>>>>>>> suspend-resume
>>>>>>>>>>> display engine will enter into dc9 and dmc will not restore the firmware. So
>>>>>>>>>>> every
>>>>>>>>>>> suspend-resume we need to reload the firmware.
>>>>>>>>>>>> - What exactly is the reason why we can't make that decision statically in
>>>>>>>>>>>>    the code (by calling csr_load at the right spots).
>>>>>>>>>>> As I mentioned before in case of skylake can we differentiate between
>>>>>>>>>>> "resume from pm-suspend" with "resume from suspend-hibernation" inside
>>>>>>>>>>> driver?
>>>>>>>>>>>
>>>>>>>>>>> In case of broxton, every time we need to reload, so we can decide
>>>>>>>>>>> statically.
>>>>>>>>>> Of course we can differentiate between all the different resume paths, and
>>>>>>>>>> we also have a per-platform split to take care of bxt vs. skl. And there
>>>>>>>>>> are actually 3 different resume paths:
>>>>>>>>>>
>>>>>>>>>> - runtime PM resume. This calls the runtime_resume hook. It sounds like on
>>>>>>>>>>    skl we should _not_ load the csr firmware, but on bxt we should load it.
>>>>>>>>>>    This can be fixed by removing the intel_csr_load_program call from
>>>>>>>>>>    skl_resume_prepare.
>>>>>>>>>> - resume from hibernate-to-disk (i.e. system completely off, state stored
>>>>>>>>>>    on the swap partition) is done by calling the thaw callbacks.
>>>>>>>>>> - resume from suspend-to-mem (i.e. system in low-power with only memory
>>>>>>>>>>    in self-refresh, all state stored in memory) is done by calling the
>>>>>>>>>>    resume callbacks.
>>>>>>>>>>
>>>>>>>>>> For i915 we use unified handlers in our dev_pm_ops for both thaw and
>>>>>>>>>> resume, but it sounds like that won't be a problem for skl/bxt since we
>>>>>>>>>> need to reload the csr firmware in all cases. Although I'm not perfectly
>>>>>>>>>> sure since you don't explain what kind of resume you mean exactly (since
>>>>>>>>>> you don't use the linux names for them).
>>>>>>>>>>
>>>>>>>>>> Anyway it sounds like we can replace this patch by one where we remove
>>>>>>>>>> that errornous csr load call from skl runtime pm resume and that's all.
>>>>>>>>>> But I suggest to make sure we get this right we keep the check you're
>>>>>>>>>> adding here, but wrap it in a WARN_ON. Then we'll get a backtrace when
>>>>>>>>>> this is going wrong again. Like this:
>>>>>>>>>>
>>>>>>>>>>          if (WARN_ON(csr_loaded_already()))
>>>>>>>>>>                  return;
>>>>>>>>>>
>>>>>>>>>> Also when redoing the commits please explain in detail what exactly are
>>>>>>>>>> the requirements like you've done above, but please use the standard linux
>>>>>>>>>> names, i.e. "runtime PM" and "hibernate-to-disk" and "suspend-to-mem".
>>>>>>>>> Ok hooray there's more suspend-to-something things I've totally missed:
>>>>>>>>> - suspend-to-idle (done by cat freeze > /sys/power/state) and
>>>>>>>>> - suspend (done by cat suspend > /sys/power/state)
>>>>>>>>>
>>>>>>>>> And apparently there's really no way to drivers to tell them apart.
>>>>>>>>> Rafael, is there really no way for drivers to take different paths for
>>>>>>>>> these 3 suspend cases? I tried grepping for PM_SUSPEND_ON/STANDY/MEM
>>>>>>>>> and didn't spot anything.
>>>>>>>>>
>>>>>>>>> Also we're completely missing test coverage for that in igt. That is
>>>>>>>>> something that needs to be fixed asap (yet another case of
>>>>>>>>> combinatorial explosion in igt tests, yay). And at least one of those
>>>>>>>>> suspend-to-idle testcase better be in the BAT.
>>>>>>>>> -Daniel
>>>>>>>>> --
>>>>>>>>> Daniel Vetter
>>>>>>>>> Software Engineer, Intel Corporation
>>>>>>>>> +41 (0) 79 365 57 48 - http://blog.ffwll.ch
>>>>>>>> --
>>>>>>>> Daniel Vetter
>>>>>>>> Software Engineer, Intel Corporation
>>>>>>>> +41 (0) 79 365 57 48 - http://blog.ffwll.ch
diff mbox

Patch

diff --git a/drivers/gpu/drm/i915/intel_csr.c b/drivers/gpu/drm/i915/intel_csr.c
index ba1ae03..682cc26 100644
--- a/drivers/gpu/drm/i915/intel_csr.c
+++ b/drivers/gpu/drm/i915/intel_csr.c
@@ -252,6 +252,15 @@  void intel_csr_load_program(struct drm_device *dev)
 		return;
 	}
 
+	/*
+	 * Dmc will restore the csr the program except DC9, cold boot,
+	 * warm reset, PCI function level reset, and hibernate/suspend.
+	 * This condition will help to check if csr address space is reset/
+	 * not loaded.
+	 */
+	if (I915_READ(CSR_PROGRAM_BASE))
+		return;
+
 	mutex_lock(&dev_priv->csr_lock);
 	fw_size = dev_priv->csr.dmc_fw_size;
 	for (i = 0; i < fw_size; i++)