diff mbox series

drm/i915/selftest: Bump up sample period for busy stats selftest

Message ID 20221103001149.1203285-1-umesh.nerlige.ramappa@intel.com (mailing list archive)
State New, archived
Headers show
Series drm/i915/selftest: Bump up sample period for busy stats selftest | expand

Commit Message

Umesh Nerlige Ramappa Nov. 3, 2022, 12:11 a.m. UTC
Engine busyness samples around a 10ms period is failing with busyness
ranging approx. from 87% to 115%. The expected range is +/- 5% of the
sample period.

When determining busyness of active engine, the GuC based engine
busyness implementation relies on a 64 bit timestamp register read. The
latency incurred by this register read causes the failure.

On DG1, when the test fails, the observed latencies range from 900us -
1.5ms.

One solution tried was to reduce the latency between reg read and
CPU timestamp capture, but such optimization does not add value to user
since the CPU timestamp obtained here is only used for (1) selftest and
(2) i915 rps implementation specific to execlist scheduler. Also, this
solution only reduces the frequency of failure and does not eliminate
it.

In order to make the selftest more robust and account for such
latencies, increase the sample period to 100 ms.

Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
---
 drivers/gpu/drm/i915/gt/selftest_engine_pm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Tvrtko Ursulin Nov. 3, 2022, 12:28 p.m. UTC | #1
On 03/11/2022 00:11, Umesh Nerlige Ramappa wrote:
> Engine busyness samples around a 10ms period is failing with busyness
> ranging approx. from 87% to 115%. The expected range is +/- 5% of the
> sample period.
> 
> When determining busyness of active engine, the GuC based engine
> busyness implementation relies on a 64 bit timestamp register read. The
> latency incurred by this register read causes the failure.
> 
> On DG1, when the test fails, the observed latencies range from 900us -
> 1.5ms.

Do I read this right - that the latency of a 64 bit timestamp register 
read is 0.9 - 1.5ms? That would be the read in guc_update_pm_timestamp?

Regards,

Tvrtko

> One solution tried was to reduce the latency between reg read and
> CPU timestamp capture, but such optimization does not add value to user
> since the CPU timestamp obtained here is only used for (1) selftest and
> (2) i915 rps implementation specific to execlist scheduler. Also, this
> solution only reduces the frequency of failure and does not eliminate
> it.
> 
> In order to make the selftest more robust and account for such
> latencies, increase the sample period to 100 ms.
> 
> Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
> ---
>   drivers/gpu/drm/i915/gt/selftest_engine_pm.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/selftest_engine_pm.c b/drivers/gpu/drm/i915/gt/selftest_engine_pm.c
> index 0dcb3ed44a73..87c94314cf67 100644
> --- a/drivers/gpu/drm/i915/gt/selftest_engine_pm.c
> +++ b/drivers/gpu/drm/i915/gt/selftest_engine_pm.c
> @@ -317,7 +317,7 @@ static int live_engine_busy_stats(void *arg)
>   		ENGINE_TRACE(engine, "measuring busy time\n");
>   		preempt_disable();
>   		de = intel_engine_get_busy_time(engine, &t[0]);
> -		mdelay(10);
> +		mdelay(100);
>   		de = ktime_sub(intel_engine_get_busy_time(engine, &t[1]), de);
>   		preempt_enable();
>   		dt = ktime_sub(t[1], t[0]);
Dixit, Ashutosh Nov. 3, 2022, 5:01 p.m. UTC | #2
On Wed, 02 Nov 2022 17:11:49 -0700, Umesh Nerlige Ramappa wrote:
>
> Engine busyness samples around a 10ms period is failing with busyness
> ranging approx. from 87% to 115%. The expected range is +/- 5% of the
> sample period.
>
> When determining busyness of active engine, the GuC based engine
> busyness implementation relies on a 64 bit timestamp register read. The
> latency incurred by this register read causes the failure.
>
> On DG1, when the test fails, the observed latencies range from 900us -
> 1.5ms.
>
> One solution tried was to reduce the latency between reg read and
> CPU timestamp capture, but such optimization does not add value to user
> since the CPU timestamp obtained here is only used for (1) selftest and
> (2) i915 rps implementation specific to execlist scheduler. Also, this
> solution only reduces the frequency of failure and does not eliminate
> it.
>
> In order to make the selftest more robust and account for such
> latencies, increase the sample period to 100 ms.

Does it make sense, and also by way of documenting, to use 10 ms for
execlists and 100 ms for GuC? Maybe a comment in the code would be nice
too. Thanks.

>
> Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
> ---
>  drivers/gpu/drm/i915/gt/selftest_engine_pm.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/selftest_engine_pm.c b/drivers/gpu/drm/i915/gt/selftest_engine_pm.c
> index 0dcb3ed44a73..87c94314cf67 100644
> --- a/drivers/gpu/drm/i915/gt/selftest_engine_pm.c
> +++ b/drivers/gpu/drm/i915/gt/selftest_engine_pm.c
> @@ -317,7 +317,7 @@ static int live_engine_busy_stats(void *arg)
>		ENGINE_TRACE(engine, "measuring busy time\n");
>		preempt_disable();
>		de = intel_engine_get_busy_time(engine, &t[0]);
> -		mdelay(10);
> +		mdelay(100);
>		de = ktime_sub(intel_engine_get_busy_time(engine, &t[1]), de);
>		preempt_enable();
>		dt = ktime_sub(t[1], t[0]);
> --
> 2.36.1
>
Umesh Nerlige Ramappa Nov. 3, 2022, 6:08 p.m. UTC | #3
On Thu, Nov 03, 2022 at 12:28:46PM +0000, Tvrtko Ursulin wrote:
>
>On 03/11/2022 00:11, Umesh Nerlige Ramappa wrote:
>>Engine busyness samples around a 10ms period is failing with busyness
>>ranging approx. from 87% to 115%. The expected range is +/- 5% of the
>>sample period.
>>
>>When determining busyness of active engine, the GuC based engine
>>busyness implementation relies on a 64 bit timestamp register read. The
>>latency incurred by this register read causes the failure.
>>
>>On DG1, when the test fails, the observed latencies range from 900us -
>>1.5ms.
>
>Do I read this right - that the latency of a 64 bit timestamp register 
>read is 0.9 - 1.5ms? That would be the read in 
>guc_update_pm_timestamp?

Correct. That is total time taken by intel_uncore_read64_2x32() measured 
with local_clock().

One other thing I missed out in the comments is that enable_dc=0 also 
resolves the issue, but display team confirmed there is no relation to 
display in this case other than that it somehow introduces a latency in 
the reg read.

>
>Regards,
>
>Tvrtko
>
>>One solution tried was to reduce the latency between reg read and
>>CPU timestamp capture, but such optimization does not add value to user
>>since the CPU timestamp obtained here is only used for (1) selftest and
>>(2) i915 rps implementation specific to execlist scheduler. Also, this
>>solution only reduces the frequency of failure and does not eliminate
>>it.

Note that this solution is here - 
https://patchwork.freedesktop.org/patch/509991/?series=110497&rev=1

but I am not intending to use it since it just reduces the frequency of 
failues, but the inherent issue still exists.

Regards,
Umesh

>>
>>In order to make the selftest more robust and account for such
>>latencies, increase the sample period to 100 ms.
>>
>>Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
>>---
>>  drivers/gpu/drm/i915/gt/selftest_engine_pm.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>>diff --git a/drivers/gpu/drm/i915/gt/selftest_engine_pm.c b/drivers/gpu/drm/i915/gt/selftest_engine_pm.c
>>index 0dcb3ed44a73..87c94314cf67 100644
>>--- a/drivers/gpu/drm/i915/gt/selftest_engine_pm.c
>>+++ b/drivers/gpu/drm/i915/gt/selftest_engine_pm.c
>>@@ -317,7 +317,7 @@ static int live_engine_busy_stats(void *arg)
>>  		ENGINE_TRACE(engine, "measuring busy time\n");
>>  		preempt_disable();
>>  		de = intel_engine_get_busy_time(engine, &t[0]);
>>-		mdelay(10);
>>+		mdelay(100);
>>  		de = ktime_sub(intel_engine_get_busy_time(engine, &t[1]), de);
>>  		preempt_enable();
>>  		dt = ktime_sub(t[1], t[0]);
Umesh Nerlige Ramappa Nov. 3, 2022, 6:19 p.m. UTC | #4
On Thu, Nov 03, 2022 at 10:01:11AM -0700, Dixit, Ashutosh wrote:
>On Wed, 02 Nov 2022 17:11:49 -0700, Umesh Nerlige Ramappa wrote:
>>
>> Engine busyness samples around a 10ms period is failing with busyness
>> ranging approx. from 87% to 115%. The expected range is +/- 5% of the
>> sample period.
>>
>> When determining busyness of active engine, the GuC based engine
>> busyness implementation relies on a 64 bit timestamp register read. The
>> latency incurred by this register read causes the failure.
>>
>> On DG1, when the test fails, the observed latencies range from 900us -
>> 1.5ms.
>>
>> One solution tried was to reduce the latency between reg read and
>> CPU timestamp capture, but such optimization does not add value to user
>> since the CPU timestamp obtained here is only used for (1) selftest and
>> (2) i915 rps implementation specific to execlist scheduler. Also, this
>> solution only reduces the frequency of failure and does not eliminate
>> it.
>>
>> In order to make the selftest more robust and account for such
>> latencies, increase the sample period to 100 ms.
>
>Does it make sense, and also by way of documenting, to use 10 ms for
>execlists and 100 ms for GuC? Maybe a comment in the code would be nice
>too. Thanks.

I was hoping to keep the same logic for execlist/guc backends. I can add 
it to the comments though.

sadly, this is the 2nd time we are bumping this up. This was originally 
100us for execlists. With the GuC backend, there is a latency by design 
since active busyness is calculated using GT timestamp register. 
Execlists relied solely on ktime_get() to check for active busyness and 
that seemed to have negligible latency. I see no robust option here.

Thanks,
Umesh

>
>>
>> Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
>> ---
>>  drivers/gpu/drm/i915/gt/selftest_engine_pm.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/i915/gt/selftest_engine_pm.c b/drivers/gpu/drm/i915/gt/selftest_engine_pm.c
>> index 0dcb3ed44a73..87c94314cf67 100644
>> --- a/drivers/gpu/drm/i915/gt/selftest_engine_pm.c
>> +++ b/drivers/gpu/drm/i915/gt/selftest_engine_pm.c
>> @@ -317,7 +317,7 @@ static int live_engine_busy_stats(void *arg)
>>		ENGINE_TRACE(engine, "measuring busy time\n");
>>		preempt_disable();
>>		de = intel_engine_get_busy_time(engine, &t[0]);
>> -		mdelay(10);
>> +		mdelay(100);
>>		de = ktime_sub(intel_engine_get_busy_time(engine, &t[1]), de);
>>		preempt_enable();
>>		dt = ktime_sub(t[1], t[0]);
>> --
>> 2.36.1
>>
Tvrtko Ursulin Nov. 4, 2022, 8:29 a.m. UTC | #5
On 03/11/2022 18:08, Umesh Nerlige Ramappa wrote:
> On Thu, Nov 03, 2022 at 12:28:46PM +0000, Tvrtko Ursulin wrote:
>>
>> On 03/11/2022 00:11, Umesh Nerlige Ramappa wrote:
>>> Engine busyness samples around a 10ms period is failing with busyness
>>> ranging approx. from 87% to 115%. The expected range is +/- 5% of the
>>> sample period.
>>>
>>> When determining busyness of active engine, the GuC based engine
>>> busyness implementation relies on a 64 bit timestamp register read. The
>>> latency incurred by this register read causes the failure.
>>>
>>> On DG1, when the test fails, the observed latencies range from 900us -
>>> 1.5ms.
>>
>> Do I read this right - that the latency of a 64 bit timestamp register 
>> read is 0.9 - 1.5ms? That would be the read in guc_update_pm_timestamp?
> 
> Correct. That is total time taken by intel_uncore_read64_2x32() measured 
> with local_clock().
> 
> One other thing I missed out in the comments is that enable_dc=0 also 
> resolves the issue, but display team confirmed there is no relation to 
> display in this case other than that it somehow introduces a latency in 
> the reg read.

Could it be the DMC wreaking havoc something similar to b68763741aa2 
("drm/i915: Restore GT performance in headless mode with DMC loaded")?

>>> One solution tried was to reduce the latency between reg read and
>>> CPU timestamp capture, but such optimization does not add value to user
>>> since the CPU timestamp obtained here is only used for (1) selftest and
>>> (2) i915 rps implementation specific to execlist scheduler. Also, this
>>> solution only reduces the frequency of failure and does not eliminate
>>> it.
> 
> Note that this solution is here - 
> https://patchwork.freedesktop.org/patch/509991/?series=110497&rev=1
> 
> but I am not intending to use it since it just reduces the frequency of 
> failues, but the inherent issue still exists.

Right, I'd just go with that as well if it makes a significant 
improvement. Or even just refactor intel_uncore_read64_2x32 to be under 
one spinlock/fw. I don't see that it can have an excuse to be less 
efficient since there's a loop in there.

Regards,

Tvrtko

> Regards,
> Umesh
> 
>>>
>>> In order to make the selftest more robust and account for such
>>> latencies, increase the sample period to 100 ms.
>>>
>>> Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
>>> ---
>>>  drivers/gpu/drm/i915/gt/selftest_engine_pm.c | 2 +-
>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/gt/selftest_engine_pm.c 
>>> b/drivers/gpu/drm/i915/gt/selftest_engine_pm.c
>>> index 0dcb3ed44a73..87c94314cf67 100644
>>> --- a/drivers/gpu/drm/i915/gt/selftest_engine_pm.c
>>> +++ b/drivers/gpu/drm/i915/gt/selftest_engine_pm.c
>>> @@ -317,7 +317,7 @@ static int live_engine_busy_stats(void *arg)
>>>          ENGINE_TRACE(engine, "measuring busy time\n");
>>>          preempt_disable();
>>>          de = intel_engine_get_busy_time(engine, &t[0]);
>>> -        mdelay(10);
>>> +        mdelay(100);
>>>          de = ktime_sub(intel_engine_get_busy_time(engine, &t[1]), de);
>>>          preempt_enable();
>>>          dt = ktime_sub(t[1], t[0]);
Umesh Nerlige Ramappa Nov. 4, 2022, 2:58 p.m. UTC | #6
On Fri, Nov 04, 2022 at 08:29:38AM +0000, Tvrtko Ursulin wrote:
>
>On 03/11/2022 18:08, Umesh Nerlige Ramappa wrote:
>>On Thu, Nov 03, 2022 at 12:28:46PM +0000, Tvrtko Ursulin wrote:
>>>
>>>On 03/11/2022 00:11, Umesh Nerlige Ramappa wrote:
>>>>Engine busyness samples around a 10ms period is failing with busyness
>>>>ranging approx. from 87% to 115%. The expected range is +/- 5% of the
>>>>sample period.
>>>>
>>>>When determining busyness of active engine, the GuC based engine
>>>>busyness implementation relies on a 64 bit timestamp register read. The
>>>>latency incurred by this register read causes the failure.
>>>>
>>>>On DG1, when the test fails, the observed latencies range from 900us -
>>>>1.5ms.
>>>
>>>Do I read this right - that the latency of a 64 bit timestamp 
>>>register read is 0.9 - 1.5ms? That would be the read in 
>>>guc_update_pm_timestamp?
>>
>>Correct. That is total time taken by intel_uncore_read64_2x32() 
>>measured with local_clock().
>>
>>One other thing I missed out in the comments is that enable_dc=0 
>>also resolves the issue, but display team confirmed there is no 
>>relation to display in this case other than that it somehow 
>>introduces a latency in the reg read.
>
>Could it be the DMC wreaking havoc something similar to b68763741aa2 
>("drm/i915: Restore GT performance in headless mode with DMC loaded")?
>

__gt_unpark is already doing a 

gt->awake = intel_display_power_get(i915, POWER_DOMAIN_GT_IRQ);

I would assume that __gt_unpark was called prior to running the 
selftest, need to confirm that though.

>>>>One solution tried was to reduce the latency between reg read and
>>>>CPU timestamp capture, but such optimization does not add value to user
>>>>since the CPU timestamp obtained here is only used for (1) selftest and
>>>>(2) i915 rps implementation specific to execlist scheduler. Also, this
>>>>solution only reduces the frequency of failure and does not eliminate
>>>>it.
>>
>>Note that this solution is here - 
>>https://patchwork.freedesktop.org/patch/509991/?series=110497&rev=1
>>
>>but I am not intending to use it since it just reduces the frequency 
>>of failues, but the inherent issue still exists.
>
>Right, I'd just go with that as well if it makes a significant 
>improvement. Or even just refactor intel_uncore_read64_2x32 to be 
>under one spinlock/fw. I don't see that it can have an excuse to be 
>less efficient since there's a loop in there.

The patch did reduce the failure to once in 200 runs vs once in 10 runs.  

I will refactor the helper in that case.

Thanks,
Umesh

>
>Regards,
>
>Tvrtko
>
>>Regards,
>>Umesh
>>
>>>>
>>>>In order to make the selftest more robust and account for such
>>>>latencies, increase the sample period to 100 ms.
>>>>
>>>>Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
>>>>---
>>>> drivers/gpu/drm/i915/gt/selftest_engine_pm.c | 2 +-
>>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>>>
>>>>diff --git a/drivers/gpu/drm/i915/gt/selftest_engine_pm.c 
>>>>b/drivers/gpu/drm/i915/gt/selftest_engine_pm.c
>>>>index 0dcb3ed44a73..87c94314cf67 100644
>>>>--- a/drivers/gpu/drm/i915/gt/selftest_engine_pm.c
>>>>+++ b/drivers/gpu/drm/i915/gt/selftest_engine_pm.c
>>>>@@ -317,7 +317,7 @@ static int live_engine_busy_stats(void *arg)
>>>>         ENGINE_TRACE(engine, "measuring busy time\n");
>>>>         preempt_disable();
>>>>         de = intel_engine_get_busy_time(engine, &t[0]);
>>>>-        mdelay(10);
>>>>+        mdelay(100);
>>>>         de = ktime_sub(intel_engine_get_busy_time(engine, &t[1]), de);
>>>>         preempt_enable();
>>>>         dt = ktime_sub(t[1], t[0]);
Tvrtko Ursulin Nov. 4, 2022, 3:45 p.m. UTC | #7
On 04/11/2022 14:58, Umesh Nerlige Ramappa wrote:
> On Fri, Nov 04, 2022 at 08:29:38AM +0000, Tvrtko Ursulin wrote:
>>
>> On 03/11/2022 18:08, Umesh Nerlige Ramappa wrote:
>>> On Thu, Nov 03, 2022 at 12:28:46PM +0000, Tvrtko Ursulin wrote:
>>>>
>>>> On 03/11/2022 00:11, Umesh Nerlige Ramappa wrote:
>>>>> Engine busyness samples around a 10ms period is failing with busyness
>>>>> ranging approx. from 87% to 115%. The expected range is +/- 5% of the
>>>>> sample period.
>>>>>
>>>>> When determining busyness of active engine, the GuC based engine
>>>>> busyness implementation relies on a 64 bit timestamp register read. 
>>>>> The
>>>>> latency incurred by this register read causes the failure.
>>>>>
>>>>> On DG1, when the test fails, the observed latencies range from 900us -
>>>>> 1.5ms.
>>>>
>>>> Do I read this right - that the latency of a 64 bit timestamp 
>>>> register read is 0.9 - 1.5ms? That would be the read in 
>>>> guc_update_pm_timestamp?
>>>
>>> Correct. That is total time taken by intel_uncore_read64_2x32() 
>>> measured with local_clock().
>>>
>>> One other thing I missed out in the comments is that enable_dc=0 also 
>>> resolves the issue, but display team confirmed there is no relation 
>>> to display in this case other than that it somehow introduces a 
>>> latency in the reg read.
>>
>> Could it be the DMC wreaking havoc something similar to b68763741aa2 
>> ("drm/i915: Restore GT performance in headless mode with DMC loaded")?
>>
> 
> __gt_unpark is already doing a
> gt->awake = intel_display_power_get(i915, POWER_DOMAIN_GT_IRQ);
> 
> I would assume that __gt_unpark was called prior to running the 
> selftest, need to confirm that though.

Right, I meant maybe something similar but not necessarily the same. 
Similar in the sense that it may be DMC doing many MMIO invisible to 
i915 and so introducing latency.

>>>>> One solution tried was to reduce the latency between reg read and
>>>>> CPU timestamp capture, but such optimization does not add value to 
>>>>> user
>>>>> since the CPU timestamp obtained here is only used for (1) selftest 
>>>>> and
>>>>> (2) i915 rps implementation specific to execlist scheduler. Also, this
>>>>> solution only reduces the frequency of failure and does not eliminate
>>>>> it.
>>>
>>> Note that this solution is here - 
>>> https://patchwork.freedesktop.org/patch/509991/?series=110497&rev=1
>>>
>>> but I am not intending to use it since it just reduces the frequency 
>>> of failues, but the inherent issue still exists.
>>
>> Right, I'd just go with that as well if it makes a significant 
>> improvement. Or even just refactor intel_uncore_read64_2x32 to be 
>> under one spinlock/fw. I don't see that it can have an excuse to be 
>> less efficient since there's a loop in there.
> 
> The patch did reduce the failure to once in 200 runs vs once in 10 runs.
> I will refactor the helper in that case.

Yeah it makes sense to make it efficient. But feel free to go with the 
msleep increase as well to workaround the issue fully.

Regards,

Tvrtko
diff mbox series

Patch

diff --git a/drivers/gpu/drm/i915/gt/selftest_engine_pm.c b/drivers/gpu/drm/i915/gt/selftest_engine_pm.c
index 0dcb3ed44a73..87c94314cf67 100644
--- a/drivers/gpu/drm/i915/gt/selftest_engine_pm.c
+++ b/drivers/gpu/drm/i915/gt/selftest_engine_pm.c
@@ -317,7 +317,7 @@  static int live_engine_busy_stats(void *arg)
 		ENGINE_TRACE(engine, "measuring busy time\n");
 		preempt_disable();
 		de = intel_engine_get_busy_time(engine, &t[0]);
-		mdelay(10);
+		mdelay(100);
 		de = ktime_sub(intel_engine_get_busy_time(engine, &t[1]), de);
 		preempt_enable();
 		dt = ktime_sub(t[1], t[0]);