[RFC] drm/i915: Temporarily go realtime when polling PCODE

Message ID	20170221170158.6384-1-tvrtko.ursulin@linux.intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <intel-gfx-bounces@lists.freedesktop.org> From: Tvrtko Ursulin <tursulin@ursulin.net> To: Intel-gfx@lists.freedesktop.org Date: Tue, 21 Feb 2017 17:01:58 +0000 Message-Id: <20170221170158.6384-1-tvrtko.ursulin@linux.intel.com> Subject: [Intel-gfx] [RFC] drm/i915: Temporarily go realtime when polling PCODE Precedence: list MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" <intel-gfx-bounces@lists.freedesktop.org>

Tvrtko Ursulin Feb. 21, 2017, 5:01 p.m. UTC

From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Elevate task scheduling policy to realtime when polling on PCODE
to guarantee a good poll rate before falling back to busy wait.

We only do this for tasks with normal policy and priority in
order  to simplify policy restore and also assuming that for
tasks which either made themselves low or high priority it makes
less sense to do so.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Imre Deak <imre.deak@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
---
This was my idea as mentioned in the other thread.

Deadline scheduling policy seems trickier to restore from so
I thought SCHED_FIFO should be good enough.

Briefly tested but couldn't reproduce the timeout condition.
---
 drivers/gpu/drm/i915/intel_drv.h | 33 +++++++++++++++++++++++++++++----
 drivers/gpu/drm/i915/intel_pm.c  |  4 +++-
 2 files changed, 32 insertions(+), 5 deletions(-)

Imre Deak Feb. 21, 2017, 6:48 p.m. UTC | #1

On Tue, Feb 21, 2017 at 05:01:58PM +0000, Tvrtko Ursulin wrote:
> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> 
> Elevate task scheduling policy to realtime when polling on PCODE
> to guarantee a good poll rate before falling back to busy wait.
> 
> We only do this for tasks with normal policy and priority in
> order  to simplify policy restore and also assuming that for
> tasks which either made themselves low or high priority it makes
> less sense to do so.
> 
> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: Imre Deak <imre.deak@intel.com>
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> ---
> This was my idea as mentioned in the other thread.
> 
> Deadline scheduling policy seems trickier to restore from so
> I thought SCHED_FIFO should be good enough.
> 
> Briefly tested but couldn't reproduce the timeout condition.

Hm, I thought you wanted this instead of the preempt-disable poll. The
first preempt-enable poll is what's based on the spec, which only
requires two requests 3ms apart, so no requirement on the number of
requests there. That works most of the time and the preempt-disable part
is needed only rarely. So do we want to increase the priority for the
normal case?

> ---
>  drivers/gpu/drm/i915/intel_drv.h | 33 +++++++++++++++++++++++++++++----
>  drivers/gpu/drm/i915/intel_pm.c  |  4 +++-
>  2 files changed, 32 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/intel_drv.h b/drivers/gpu/drm/i915/intel_drv.h
> index 821c57cab406..70033bdd183e 100644
> --- a/drivers/gpu/drm/i915/intel_drv.h
> +++ b/drivers/gpu/drm/i915/intel_drv.h
> @@ -51,9 +51,24 @@
>   * drm_can_sleep() can be removed and in_atomic()/!in_atomic() asserts
>   * added.
>   */
> -#define _wait_for(COND, US, W) ({ \
> +#define _wait_for(COND, US, W, DEADLINE) ({ \
>  	unsigned long timeout__ = jiffies + usecs_to_jiffies(US) + 1;	\
> -	int ret__;							\
> +	int sched__, ret__;						\
> +									\
> +	if ((DEADLINE) && (W)) {					\
> +		struct task_struct *t = current;			\
> +									\
> +		if (t->policy != 0 || task_nice(t) != 0) {		\
> +			sched__ = -EINVAL;				\
> +		} else {						\
> +			struct sched_param param =			\
> +				{ .sched_priority = MAX_RT_PRIO - 1 };	\
> +			sched__ = sched_setscheduler_nocheck(t,		\
> +							     SCHED_FIFO,\
> +							     &param);	\
> +		}							\
> +	}								\
> +									\
>  	for (;;) {							\
>  		bool expired__ = time_after(jiffies, timeout__);	\
>  		if (COND) {						\
> @@ -70,10 +85,20 @@
>  			cpu_relax();					\
>  		}							\
>  	}								\
> +									\
> +	if ((DEADLINE) && (W) && sched__ == 0) {			\
> +		struct task_struct *t = current;			\
> +		struct sched_param param =				\
> +				{ .sched_priority = 0 };		\
> +									\
> +		WARN_ON(sched_setscheduler_nocheck(t, SCHED_NORMAL,	\
> +						   &param));		\
> +	}								\
> +									\
>  	ret__;								\
>  })
>  
> -#define wait_for(COND, MS)	  	_wait_for((COND), (MS) * 1000, 1000)
> +#define wait_for(COND, MS)	  	_wait_for((COND), (MS) * 1000, 1000, 0)
>  
>  /* If CONFIG_PREEMPT_COUNT is disabled, in_atomic() always reports false. */
>  #if defined(CONFIG_DRM_I915_DEBUG) && defined(CONFIG_PREEMPT_COUNT)
> @@ -123,7 +148,7 @@
>  	int ret__; \
>  	BUILD_BUG_ON(!__builtin_constant_p(US)); \
>  	if ((US) > 10) \
> -		ret__ = _wait_for((COND), (US), 10); \
> +		ret__ = _wait_for((COND), (US), 10, 0); \
>  	else \
>  		ret__ = _wait_for_atomic((COND), (US), 0); \
>  	ret__; \
> diff --git a/drivers/gpu/drm/i915/intel_pm.c b/drivers/gpu/drm/i915/intel_pm.c
> index 169c4908ad5b..215b1a9fd214 100644
> --- a/drivers/gpu/drm/i915/intel_pm.c
> +++ b/drivers/gpu/drm/i915/intel_pm.c
> @@ -7939,7 +7939,7 @@ int skl_pcode_request(struct drm_i915_private *dev_priv, u32 mbox, u32 request,
>  		ret = 0;
>  		goto out;
>  	}
> -	ret = _wait_for(COND, timeout_base_ms * 1000, 10);
> +	ret = _wait_for(COND, timeout_base_ms * 1000, 10, 1);
>  	if (!ret)
>  		goto out;
>  
> @@ -7957,6 +7957,8 @@ int skl_pcode_request(struct drm_i915_private *dev_priv, u32 mbox, u32 request,
>  	preempt_disable();
>  	ret = wait_for_atomic(COND, 10);
>  	preempt_enable();
> +	if (ret == 0)
> +		DRM_DEBUG_KMS("PCODE success after timeout\n");
>  
>  out:
>  	return ret ? ret : status;
> -- 
> 2.9.3
>

Tvrtko Ursulin Feb. 22, 2017, 7:52 a.m. UTC | #2

On 21/02/2017 18:48, Imre Deak wrote:
> On Tue, Feb 21, 2017 at 05:01:58PM +0000, Tvrtko Ursulin wrote:
>> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>
>> Elevate task scheduling policy to realtime when polling on PCODE
>> to guarantee a good poll rate before falling back to busy wait.
>>
>> We only do this for tasks with normal policy and priority in
>> order  to simplify policy restore and also assuming that for
>> tasks which either made themselves low or high priority it makes
>> less sense to do so.
>>
>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>> Cc: Imre Deak <imre.deak@intel.com>
>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>> ---
>> This was my idea as mentioned in the other thread.
>>
>> Deadline scheduling policy seems trickier to restore from so
>> I thought SCHED_FIFO should be good enough.
>>
>> Briefly tested but couldn't reproduce the timeout condition.
>
> Hm, I thought you wanted this instead of the preempt-disable poll. The
> first preempt-enable poll is what's based on the spec, which only
> requires two requests 3ms apart, so no requirement on the number of
> requests there. That works most of the time and the preempt-disable part
> is needed only rarely. So do we want to increase the priority for the
> normal case?

So we end up in the busy loop case less often or never? (By polling 
better in the sleeping loop.) It is possible I got this completely wrong 
mind you. I was just going by what is written in this thread - that the 
problem is the sleeping loop sometimes does not run the COND often 
enough, or enough times.

Regards,

Tvrtko

Imre Deak Feb. 22, 2017, 9:13 a.m. UTC | #3

On Wed, Feb 22, 2017 at 07:52:01AM +0000, Tvrtko Ursulin wrote:
> 
> On 21/02/2017 18:48, Imre Deak wrote:
> >On Tue, Feb 21, 2017 at 05:01:58PM +0000, Tvrtko Ursulin wrote:
> >>From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> >>
> >>Elevate task scheduling policy to realtime when polling on PCODE
> >>to guarantee a good poll rate before falling back to busy wait.
> >>
> >>We only do this for tasks with normal policy and priority in
> >>order  to simplify policy restore and also assuming that for
> >>tasks which either made themselves low or high priority it makes
> >>less sense to do so.
> >>
> >>Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> >>Cc: Imre Deak <imre.deak@intel.com>
> >>Cc: Chris Wilson <chris@chris-wilson.co.uk>
> >>---
> >>This was my idea as mentioned in the other thread.
> >>
> >>Deadline scheduling policy seems trickier to restore from so
> >>I thought SCHED_FIFO should be good enough.
> >>
> >>Briefly tested but couldn't reproduce the timeout condition.
> >
> >Hm, I thought you wanted this instead of the preempt-disable poll. The
> >first preempt-enable poll is what's based on the spec, which only
> >requires two requests 3ms apart, so no requirement on the number of
> >requests there. That works most of the time and the preempt-disable part
> >is needed only rarely. So do we want to increase the priority for the
> >normal case?
> 
> So we end up in the busy loop case less often or never? (By polling better
> in the sleeping loop.) It is possible I got this completely wrong mind you.
> I was just going by what is written in this thread - that the problem is the
> sleeping loop sometimes does not run the COND often enough, or enough times.

Yes, but that means we also raise the priority for the usual case. That
would make the first loop a similar busy loop to what we want to avoid,
running that always. What I hope is that this is a problem in the PCODE
firmware that will get solved eventually, so we don't need the WA; hence
argued about keeping any WA separate.

--Imre

Tvrtko Ursulin Feb. 23, 2017, 9:37 a.m. UTC | #4

On 22/02/2017 09:13, Imre Deak wrote:
> On Wed, Feb 22, 2017 at 07:52:01AM +0000, Tvrtko Ursulin wrote:
>>
>> On 21/02/2017 18:48, Imre Deak wrote:
>>> On Tue, Feb 21, 2017 at 05:01:58PM +0000, Tvrtko Ursulin wrote:
>>>> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>>>
>>>> Elevate task scheduling policy to realtime when polling on PCODE
>>>> to guarantee a good poll rate before falling back to busy wait.
>>>>
>>>> We only do this for tasks with normal policy and priority in
>>>> order  to simplify policy restore and also assuming that for
>>>> tasks which either made themselves low or high priority it makes
>>>> less sense to do so.
>>>>
>>>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>>> Cc: Imre Deak <imre.deak@intel.com>
>>>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>>>> ---
>>>> This was my idea as mentioned in the other thread.
>>>>
>>>> Deadline scheduling policy seems trickier to restore from so
>>>> I thought SCHED_FIFO should be good enough.
>>>>
>>>> Briefly tested but couldn't reproduce the timeout condition.
>>>
>>> Hm, I thought you wanted this instead of the preempt-disable poll. The
>>> first preempt-enable poll is what's based on the spec, which only
>>> requires two requests 3ms apart, so no requirement on the number of
>>> requests there. That works most of the time and the preempt-disable part
>>> is needed only rarely. So do we want to increase the priority for the
>>> normal case?
>>
>> So we end up in the busy loop case less often or never? (By polling better
>> in the sleeping loop.) It is possible I got this completely wrong mind you.
>> I was just going by what is written in this thread - that the problem is the
>> sleeping loop sometimes does not run the COND often enough, or enough times.
>
> Yes, but that means we also raise the priority for the usual case. That
> would make the first loop a similar busy loop to what we want to avoid,
> running that always. What I hope is that this is a problem in the PCODE
> firmware that will get solved eventually, so we don't need the WA; hence
> argued about keeping any WA separate.

Having read the spec I think I see both sides now.

Spec is actually suggesting we should busy-retry the pcode request for 
3ms in this case.

It doesn't say how many retries we are supposed to do and how it 
internally operates, which makes me unsure if our first more relaxed 
polling is perhaps causing or contributing to the issue.

One thing where we don't follow the spec is the timeout for the 
GEN6_PCODE_READY poll which spec says should be 150us and not 500ms. I 
don't know if this timeout was trigger in the bug reports? If not then 
it is not the direct issue. But could be a contributing one, so the 
question is why we decided to do it and shouldn't we change this one to 
the 150us busy wait instead (add wait_for_register_fw_us)?

Another thing is the 10-20us retry for the top level PCODE retry - spec 
does not mention we should wait before retrying so is this our decision 
to be nicer to the system?

In either case, if the poll for GEN6_PCODE_READY is >2us (busy spin 
limit before going to sleeping poll), and the higher level PCODE retry 
ends up much longer than the 10-20us written in the code, first due 
hardware taking longer than 2us to respond, and both due overall CPU 
load and scheduling latencies, we would be drifting away from what is 
prescribed in the spec.

But regardless, the fact that the fallback busy loop needs up to 34ms as 
well makes the last bit from the above a bit uncertain. Only if the 
non-compliant polling we do somehow confuses the hardware and then we 
end up having to busy poll longer than we normally would. Probably unlikely.

Regards,

Tvrtko

Imre Deak Feb. 23, 2017, 12:01 p.m. UTC | #5

On Thu, Feb 23, 2017 at 09:37:29AM +0000, Tvrtko Ursulin wrote:
> [...]
> Having read the spec I think I see both sides now.
> 
> Spec is actually suggesting we should busy-retry the pcode request for 3ms
> in this case.

Well, retry for 3ms without setting any minimum for the number of
requests. That couldn't be guaranteed anyway due to scheduling etc, and
would be a strange ABI. Later Art Runyan clarified this in the way it's
described in the code comment: What is required is two requests at
least 3ms apart. The first request is queued by the firmware and the
second request signals completion.

> 
> It doesn't say how many retries we are supposed to do and how it internally
> operates, which makes me unsure if our first more relaxed polling is perhaps
> causing or contributing to the issue.
> 
> One thing where we don't follow the spec is the timeout for the
> GEN6_PCODE_READY poll which spec says should be 150us and not 500ms. I don't
> know if this timeout was trigger in the bug reports?

No this PCODE_READY poll always succeeds, it's the reply/reply_mask
response which doesn't get set in time.

> If not then it is not
> the direct issue. But could be a contributing one, so the question is why we
> decided to do it and shouldn't we change this one to the 150us busy wait
> instead (add wait_for_register_fw_us)?

Haven't noticed this, but I doubt this is a problem, see below.

> 
> Another thing is the 10-20us retry for the top level PCODE retry - spec does
> not mention we should wait before retrying so is this our decision to be
> nicer to the system?
> 
> In either case, if the poll for GEN6_PCODE_READY is >2us (busy spin limit
> before going to sleeping poll), and the higher level PCODE retry ends up
> much longer than the 10-20us written in the code, first due hardware taking
> longer than 2us to respond, and both due overall CPU load and scheduling
> latencies, we would be drifting away from what is prescribed in the spec.

IIUC all of the above boils down to whether or not the ABI requires us
some exact timing on when we are sending the requests and exactly how
long we retry. I think the kernel can't provide such guarantees (at
least not in an obvious way) and the only - sane - requirement is the
above two requests (queue+completion check) with a _minimum_ amount of
retry time.

So, I suspect the problem is in the firmware where it is either occupied
with something for more than the specified 3ms timeout, or that it goes
idle taking a long time to wake to service our request unless we send a
burst of requests to keep it awake. That's what our WA is trying to do.
There could be some other trick that would prevent this issue, like
disabling CPU C states for the duration of the poll, I haven't tried
this yet.

> But regardless, the fact that the fallback busy loop needs up to 34ms as
> well makes the last bit from the above a bit uncertain. Only if the
> non-compliant polling we do somehow confuses the hardware and then we end up
> having to busy poll longer than we normally would. Probably unlikely.

I'm trying to get more info based on all this (in particular the KBL
problem) from Art. Until that I'd suggest increasing the WA timeout to
50ms, since that solved the problem for the bug reporter. We could fix
things/add more scaffolding if more evidence comes up, or there is a new
bug report.

--Imre

Tvrtko Ursulin Feb. 23, 2017, 1 p.m. UTC | #6

On 23/02/2017 12:01, Imre Deak wrote:
> On Thu, Feb 23, 2017 at 09:37:29AM +0000, Tvrtko Ursulin wrote:
>> [...]
>> Having read the spec I think I see both sides now.
>>
>> Spec is actually suggesting we should busy-retry the pcode request for 3ms
>> in this case.
>
> Well, retry for 3ms without setting any minimum for the number of
> requests. That couldn't be guaranteed anyway due to scheduling etc, and
> would be a strange ABI. Later Art Runyan clarified this in the way it's
> described in the code comment: What is required is two requests at
> least 3ms apart. The first request is queued by the firmware and the
> second request signals completion.

Why is our loop then spamming the hardware every 10us with requests? 
Perhaps it could be counter-productive? A single sleeping loop with a 
long timeout and a 3ms period wouldn't work? Like:

	ret = _wait_for(COND, 50 * 1000, timeout_base_ms * 1000)

?

>>
>> It doesn't say how many retries we are supposed to do and how it internally
>> operates, which makes me unsure if our first more relaxed polling is perhaps
>> causing or contributing to the issue.
>>
>> One thing where we don't follow the spec is the timeout for the
>> GEN6_PCODE_READY poll which spec says should be 150us and not 500ms. I don't
>> know if this timeout was trigger in the bug reports?
>
> No this PCODE_READY poll always succeeds, it's the reply/reply_mask
> response which doesn't get set in time.

Yes I know, I was just thinking if it takes more than 2us it then falls 
back to scheduling & usleep_range. That was at the time I was thinking 
it is really important to poll rapidly. Since you explained above it is 
just the opposite I agree this part is not a problem. It still may make 
sense to wait for that bit for a shorter period as per bspec.

[snip]

>> But regardless, the fact that the fallback busy loop needs up to 34ms as
>> well makes the last bit from the above a bit uncertain. Only if the
>> non-compliant polling we do somehow confuses the hardware and then we end up
>> having to busy poll longer than we normally would. Probably unlikely.
>
> I'm trying to get more info based on all this (in particular the KBL
> problem) from Art. Until that I'd suggest increasing the WA timeout to
> 50ms, since that solved the problem for the bug reporter. We could fix
> things/add more scaffolding if more evidence comes up, or there is a new
> bug report.

Yes sure I think I replied before that it is fine by me to push a 50ms 
fix for stable.

Regards,

Tvrtko

Imre Deak Feb. 23, 2017, 5:02 p.m. UTC | #7

On Thu, Feb 23, 2017 at 01:00:32PM +0000, Tvrtko Ursulin wrote:
> 
> On 23/02/2017 12:01, Imre Deak wrote:
> >On Thu, Feb 23, 2017 at 09:37:29AM +0000, Tvrtko Ursulin wrote:
> >>[...]
> >>Having read the spec I think I see both sides now.
> >>
> >>Spec is actually suggesting we should busy-retry the pcode request for 3ms
> >>in this case.
> >
> >Well, retry for 3ms without setting any minimum for the number of
> >requests. That couldn't be guaranteed anyway due to scheduling etc, and
> >would be a strange ABI. Later Art Runyan clarified this in the way it's
> >described in the code comment: What is required is two requests at
> >least 3ms apart. The first request is queued by the firmware and the
> >second request signals completion.
> 
> Why is our loop then spamming the hardware every 10us with requests? Perhaps
> it could be counter-productive? A single sleeping loop with a long timeout
> and a 3ms period wouldn't work? Like:
> 
> 	ret = _wait_for(COND, 50 * 1000, timeout_base_ms * 1000)
> 
> ?

The tight loop there is since according to the spec the request "typically"
completes < 200us, so we want to benefit from fast completions.

By single loop, do you mean without the initial 'if (COND) ...' or
without the WA loop? If you meant without 'if (COND)', then technically
that would still allow _wait_for() to check COND only once. I know, very
unlikely with your 50ms timeout above.

If you meant without the WA loop then we still need that since the 'two
requests 3ms apart' above is just how the ABI should work based on
feedback from PCODE people so far. There are occasional timeouts due to a
glitch somewhere (I believe in the firmware) which requires that we run
the WA with the burst requests. That part is an empricial solution based
on our own tests. I hope to get a more official explanation for these
and we can get rid of the WA, replacing it with something more robust.

> >>It doesn't say how many retries we are supposed to do and how it internally
> >>operates, which makes me unsure if our first more relaxed polling is perhaps
> >>causing or contributing to the issue.
> >>
> >>One thing where we don't follow the spec is the timeout for the
> >>GEN6_PCODE_READY poll which spec says should be 150us and not 500ms. I don't
> >>know if this timeout was trigger in the bug reports?
> >
> >No this PCODE_READY poll always succeeds, it's the reply/reply_mask
> >response which doesn't get set in time.
> 
> Yes I know, I was just thinking if it takes more than 2us it then falls back
> to scheduling & usleep_range. That was at the time I was thinking it is
> really important to poll rapidly. Since you explained above it is just the
> opposite I agree this part is not a problem. It still may make sense to wait
> for that bit for a shorter period as per bspec.

It goes back to SNB, so it'd need to be checked against platforms since
then and other PCODE requests. We'd bail out from the poll in case of
the first timeout, so in that sense it's not a problem, but I agree it'd
make sense to document it better.

> [snip]
> 
> >>But regardless, the fact that the fallback busy loop needs up to 34ms as
> >>well makes the last bit from the above a bit uncertain. Only if the
> >>non-compliant polling we do somehow confuses the hardware and then we end up
> >>having to busy poll longer than we normally would. Probably unlikely.
> >
> >I'm trying to get more info based on all this (in particular the KBL
> >problem) from Art. Until that I'd suggest increasing the WA timeout to
> >50ms, since that solved the problem for the bug reporter. We could fix
> >things/add more scaffolding if more evidence comes up, or there is a new
> >bug report.
> 
> Yes sure I think I replied before that it is fine by me to push a 50ms fix
> for stable.

Ok, will send it.

--Imre

[RFC] drm/i915: Temporarily go realtime when polling PCODE

Commit Message

Comments

Patch