diff mbox series

drm/i915/huc: bump timeout for delayed load and reduce print verbosity

Message ID 20221006172056.1381588-1-daniele.ceraolospurio@intel.com (mailing list archive)
State New, archived
Headers show
Series drm/i915/huc: bump timeout for delayed load and reduce print verbosity | expand

Commit Message

Daniele Ceraolo Spurio Oct. 6, 2022, 5:20 p.m. UTC
We're observing sporadic HuC delayed load timeouts in CI, due to mei_pxp
binding completing later than we expected. HuC is still still loaded
when the bind occurs, but in the meantime i915 has started allowing
submission to the VCS engines even if HuC is not there.
In most of the cases I've observed, the timeout was due to the
init/resume of another driver between i915 and mei hitting errors and
thus adding an extra delay, but HuC was still loaded before userspace
could submit, because the whole resume process time was increased by the
delays.

Given that there is no upper bound to the delay that can be introduced
by other drivers, I've reached the following compromise with the media
team:

1) i915 is going to bump the timeout to 5s, to reduce the probability
of reaching it. We still expect HuC to be loaded before userspace
starts submitting, so increasing the timeout should have no impact on
normal operations, but in case something weird happens we don't want to
stall video submissions for too long.

2) The media driver will cope with the failing submissions that manage
to go through between i915 init/resume complete and HuC loading, if any
ever happen. This could cause a small corruption of video playback
immediately after a resume (we should be safe on boot because the media
driver polls the HUC_STATUS ioctl before starting submissions).

Since we're accepting the timeout as a valid outcome, I'm also reducing
the print verbosity from error to notice.

References: https://gitlab.freedesktop.org/drm/intel/-/issues/7033
Fixes: 27536e03271d ("drm/i915/huc: track delayed HuC load with a fence")
Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Cc: Tony Ye <tony.ye@intel.com>
---
 drivers/gpu/drm/i915/gt/uc/intel_huc.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

Comments

John Harrison Oct. 6, 2022, 8:09 p.m. UTC | #1
On 10/6/2022 10:20, Daniele Ceraolo Spurio wrote:
> We're observing sporadic HuC delayed load timeouts in CI, due to mei_pxp
> binding completing later than we expected. HuC is still still loaded
still still

> when the bind occurs, but in the meantime i915 has started allowing
> submission to the VCS engines even if HuC is not there.
> In most of the cases I've observed, the timeout was due to the
> init/resume of another driver between i915 and mei hitting errors and
> thus adding an extra delay, but HuC was still loaded before userspace
> could submit, because the whole resume process time was increased by the
> delays.
>
> Given that there is no upper bound to the delay that can be introduced
> by other drivers, I've reached the following compromise with the media
> team:
>
> 1) i915 is going to bump the timeout to 5s, to reduce the probability
> of reaching it. We still expect HuC to be loaded before userspace
> starts submitting, so increasing the timeout should have no impact on
> normal operations, but in case something weird happens we don't want to
> stall video submissions for too long.
>
> 2) The media driver will cope with the failing submissions that manage
> to go through between i915 init/resume complete and HuC loading, if any
> ever happen. This could cause a small corruption of video playback
> immediately after a resume (we should be safe on boot because the media
> driver polls the HUC_STATUS ioctl before starting submissions).
>
> Since we're accepting the timeout as a valid outcome, I'm also reducing
> the print verbosity from error to notice.
>
> References: https://gitlab.freedesktop.org/drm/intel/-/issues/7033
> Fixes: 27536e03271d ("drm/i915/huc: track delayed HuC load with a fence")
> Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
> Cc: Tony Ye <tony.ye@intel.com>
> ---
>   drivers/gpu/drm/i915/gt/uc/intel_huc.c | 10 ++++++----
>   1 file changed, 6 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_huc.c b/drivers/gpu/drm/i915/gt/uc/intel_huc.c
> index 4d1cc383b681..73a6a2fae637 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_huc.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_huc.c
> @@ -52,10 +52,12 @@
>    * guaranteed for this to happen during boot, so the big timeout is a safety net
>    * that we never expect to need.
>    * MEI-PXP + HuC load usually takes ~300ms, but if the GSC needs to be resumed
> - * and/or reset, this can take longer.
> + * and/or reset, this can take longer. Note that the kernel might schedule
> + * other work between the i915 init/resume and the MEI one, which can add to
> + * the delay.
>    */
>   #define GSC_INIT_TIMEOUT_MS 10000
> -#define PXP_INIT_TIMEOUT_MS 2000
> +#define PXP_INIT_TIMEOUT_MS 5000
If we already have the GSC timeout at 10s, why not just use 10s for PXP 
as well?

>   
>   static int sw_fence_dummy_notify(struct i915_sw_fence *sf,
>   				 enum i915_sw_fence_notify state)
> @@ -104,8 +106,8 @@ static enum hrtimer_restart huc_delayed_load_timer_callback(struct hrtimer *hrti
>   	struct intel_huc *huc = container_of(hrtimer, struct intel_huc, delayed_load.timer);
>   
>   	if (!intel_huc_is_authenticated(huc)) {
> -		drm_err(&huc_to_gt(huc)->i915->drm,
> -			"timed out waiting for GSC init to load HuC\n");
> +		drm_notice(&huc_to_gt(huc)->i915->drm,
> +			   "timed out waiting for GSC init to load HuC\n");
If the failure is that the MEI PXP module hasn't loaded yet, why is the 
error message 'waiting for GSC init'? Or can we not distinguish between 
GSC load timeout and PXP load timeout? In which case should the message 
refer to 'GSC/PXP'?

John.

>   
>   		__gsc_init_error(huc);
>   	}
Daniele Ceraolo Spurio Oct. 6, 2022, 8:16 p.m. UTC | #2
On 10/6/2022 1:09 PM, John Harrison wrote:
> On 10/6/2022 10:20, Daniele Ceraolo Spurio wrote:
>> We're observing sporadic HuC delayed load timeouts in CI, due to mei_pxp
>> binding completing later than we expected. HuC is still still loaded
> still still
>
>> when the bind occurs, but in the meantime i915 has started allowing
>> submission to the VCS engines even if HuC is not there.
>> In most of the cases I've observed, the timeout was due to the
>> init/resume of another driver between i915 and mei hitting errors and
>> thus adding an extra delay, but HuC was still loaded before userspace
>> could submit, because the whole resume process time was increased by the
>> delays.
>>
>> Given that there is no upper bound to the delay that can be introduced
>> by other drivers, I've reached the following compromise with the media
>> team:
>>
>> 1) i915 is going to bump the timeout to 5s, to reduce the probability
>> of reaching it. We still expect HuC to be loaded before userspace
>> starts submitting, so increasing the timeout should have no impact on
>> normal operations, but in case something weird happens we don't want to
>> stall video submissions for too long.
>>
>> 2) The media driver will cope with the failing submissions that manage
>> to go through between i915 init/resume complete and HuC loading, if any
>> ever happen. This could cause a small corruption of video playback
>> immediately after a resume (we should be safe on boot because the media
>> driver polls the HUC_STATUS ioctl before starting submissions).
>>
>> Since we're accepting the timeout as a valid outcome, I'm also reducing
>> the print verbosity from error to notice.
>>
>> References: https://gitlab.freedesktop.org/drm/intel/-/issues/7033
>> Fixes: 27536e03271d ("drm/i915/huc: track delayed HuC load with a 
>> fence")
>> Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
>> Cc: Tony Ye <tony.ye@intel.com>
>> ---
>>   drivers/gpu/drm/i915/gt/uc/intel_huc.c | 10 ++++++----
>>   1 file changed, 6 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_huc.c 
>> b/drivers/gpu/drm/i915/gt/uc/intel_huc.c
>> index 4d1cc383b681..73a6a2fae637 100644
>> --- a/drivers/gpu/drm/i915/gt/uc/intel_huc.c
>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_huc.c
>> @@ -52,10 +52,12 @@
>>    * guaranteed for this to happen during boot, so the big timeout is 
>> a safety net
>>    * that we never expect to need.
>>    * MEI-PXP + HuC load usually takes ~300ms, but if the GSC needs to 
>> be resumed
>> - * and/or reset, this can take longer.
>> + * and/or reset, this can take longer. Note that the kernel might 
>> schedule
>> + * other work between the i915 init/resume and the MEI one, which 
>> can add to
>> + * the delay.
>>    */
>>   #define GSC_INIT_TIMEOUT_MS 10000
>> -#define PXP_INIT_TIMEOUT_MS 2000
>> +#define PXP_INIT_TIMEOUT_MS 5000
> If we already have the GSC timeout at 10s, why not just use 10s for 
> PXP as well?

They're different type of operations: mei_gsc is a full on aux driver, 
so it is loaded only once during boot; mei_pxp is a component and it is 
bound on init and then unbound/re-bound on suspend/resume. On resume we 
don't want timeouts excessively big.

>
>>     static int sw_fence_dummy_notify(struct i915_sw_fence *sf,
>>                    enum i915_sw_fence_notify state)
>> @@ -104,8 +106,8 @@ static enum hrtimer_restart 
>> huc_delayed_load_timer_callback(struct hrtimer *hrti
>>       struct intel_huc *huc = container_of(hrtimer, struct intel_huc, 
>> delayed_load.timer);
>>         if (!intel_huc_is_authenticated(huc)) {
>> -        drm_err(&huc_to_gt(huc)->i915->drm,
>> -            "timed out waiting for GSC init to load HuC\n");
>> +        drm_notice(&huc_to_gt(huc)->i915->drm,
>> +               "timed out waiting for GSC init to load HuC\n");
> If the failure is that the MEI PXP module hasn't loaded yet, why is 
> the error message 'waiting for GSC init'? Or can we not distinguish 
> between GSC load timeout and PXP load timeout? In which case should 
> the message refer to 'GSC/PXP'?

I wanted to keep things simple and have a unified message for both 
scenarios as what we care about from an i915 POV is that something went 
wrong on the mei side. I can split it up.

Daniele

>
> John.
>
>>             __gsc_init_error(huc);
>>       }
>
John Harrison Oct. 6, 2022, 8:24 p.m. UTC | #3
On 10/6/2022 13:16, Ceraolo Spurio, Daniele wrote:
> On 10/6/2022 1:09 PM, John Harrison wrote:
>> On 10/6/2022 10:20, Daniele Ceraolo Spurio wrote:
>>> We're observing sporadic HuC delayed load timeouts in CI, due to 
>>> mei_pxp
>>> binding completing later than we expected. HuC is still still loaded
>> still still
>>
>>> when the bind occurs, but in the meantime i915 has started allowing
>>> submission to the VCS engines even if HuC is not there.
>>> In most of the cases I've observed, the timeout was due to the
>>> init/resume of another driver between i915 and mei hitting errors and
>>> thus adding an extra delay, but HuC was still loaded before userspace
>>> could submit, because the whole resume process time was increased by 
>>> the
>>> delays.
>>>
>>> Given that there is no upper bound to the delay that can be introduced
>>> by other drivers, I've reached the following compromise with the media
>>> team:
>>>
>>> 1) i915 is going to bump the timeout to 5s, to reduce the probability
>>> of reaching it. We still expect HuC to be loaded before userspace
>>> starts submitting, so increasing the timeout should have no impact on
>>> normal operations, but in case something weird happens we don't want to
>>> stall video submissions for too long.
>>>
>>> 2) The media driver will cope with the failing submissions that manage
>>> to go through between i915 init/resume complete and HuC loading, if any
>>> ever happen. This could cause a small corruption of video playback
>>> immediately after a resume (we should be safe on boot because the media
>>> driver polls the HUC_STATUS ioctl before starting submissions).
>>>
>>> Since we're accepting the timeout as a valid outcome, I'm also reducing
>>> the print verbosity from error to notice.
>>>
>>> References: https://gitlab.freedesktop.org/drm/intel/-/issues/7033
>>> Fixes: 27536e03271d ("drm/i915/huc: track delayed HuC load with a 
>>> fence")
>>> Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
>>> Cc: Tony Ye <tony.ye@intel.com>
>>> ---
>>>   drivers/gpu/drm/i915/gt/uc/intel_huc.c | 10 ++++++----
>>>   1 file changed, 6 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_huc.c 
>>> b/drivers/gpu/drm/i915/gt/uc/intel_huc.c
>>> index 4d1cc383b681..73a6a2fae637 100644
>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_huc.c
>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_huc.c
>>> @@ -52,10 +52,12 @@
>>>    * guaranteed for this to happen during boot, so the big timeout 
>>> is a safety net
>>>    * that we never expect to need.
>>>    * MEI-PXP + HuC load usually takes ~300ms, but if the GSC needs 
>>> to be resumed
>>> - * and/or reset, this can take longer.
>>> + * and/or reset, this can take longer. Note that the kernel might 
>>> schedule
>>> + * other work between the i915 init/resume and the MEI one, which 
>>> can add to
>>> + * the delay.
>>>    */
>>>   #define GSC_INIT_TIMEOUT_MS 10000
>>> -#define PXP_INIT_TIMEOUT_MS 2000
>>> +#define PXP_INIT_TIMEOUT_MS 5000
>> If we already have the GSC timeout at 10s, why not just use 10s for 
>> PXP as well?
>
> They're different type of operations: mei_gsc is a full on aux driver, 
> so it is loaded only once during boot; mei_pxp is a component and it 
> is bound on init and then unbound/re-bound on suspend/resume. On 
> resume we don't want timeouts excessively big.
Okay.

>
>>
>>>     static int sw_fence_dummy_notify(struct i915_sw_fence *sf,
>>>                    enum i915_sw_fence_notify state)
>>> @@ -104,8 +106,8 @@ static enum hrtimer_restart 
>>> huc_delayed_load_timer_callback(struct hrtimer *hrti
>>>       struct intel_huc *huc = container_of(hrtimer, struct 
>>> intel_huc, delayed_load.timer);
>>>         if (!intel_huc_is_authenticated(huc)) {
>>> -        drm_err(&huc_to_gt(huc)->i915->drm,
>>> -            "timed out waiting for GSC init to load HuC\n");
>>> +        drm_notice(&huc_to_gt(huc)->i915->drm,
>>> +               "timed out waiting for GSC init to load HuC\n");
>> If the failure is that the MEI PXP module hasn't loaded yet, why is 
>> the error message 'waiting for GSC init'? Or can we not distinguish 
>> between GSC load timeout and PXP load timeout? In which case should 
>> the message refer to 'GSC/PXP'?
>
> I wanted to keep things simple and have a unified message for both 
> scenarios as what we care about from an i915 POV is that something 
> went wrong on the mei side. I can split it up.
Or just change it to 'waiting for GSC/PXP init'? If we don't care about 
the differentiation then no need to make the code more complex. But it 
is worth having a message that reflects all the major causes of the issue.

John.

>
> Daniele
>
>>
>> John.
>>
>>>             __gsc_init_error(huc);
>>>       }
>>
>
diff mbox series

Patch

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_huc.c b/drivers/gpu/drm/i915/gt/uc/intel_huc.c
index 4d1cc383b681..73a6a2fae637 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_huc.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_huc.c
@@ -52,10 +52,12 @@ 
  * guaranteed for this to happen during boot, so the big timeout is a safety net
  * that we never expect to need.
  * MEI-PXP + HuC load usually takes ~300ms, but if the GSC needs to be resumed
- * and/or reset, this can take longer.
+ * and/or reset, this can take longer. Note that the kernel might schedule
+ * other work between the i915 init/resume and the MEI one, which can add to
+ * the delay.
  */
 #define GSC_INIT_TIMEOUT_MS 10000
-#define PXP_INIT_TIMEOUT_MS 2000
+#define PXP_INIT_TIMEOUT_MS 5000
 
 static int sw_fence_dummy_notify(struct i915_sw_fence *sf,
 				 enum i915_sw_fence_notify state)
@@ -104,8 +106,8 @@  static enum hrtimer_restart huc_delayed_load_timer_callback(struct hrtimer *hrti
 	struct intel_huc *huc = container_of(hrtimer, struct intel_huc, delayed_load.timer);
 
 	if (!intel_huc_is_authenticated(huc)) {
-		drm_err(&huc_to_gt(huc)->i915->drm,
-			"timed out waiting for GSC init to load HuC\n");
+		drm_notice(&huc_to_gt(huc)->i915->drm,
+			   "timed out waiting for GSC init to load HuC\n");
 
 		__gsc_init_error(huc);
 	}