drm/i915/guc/slpc: Use non-blocking H2G for waitboost

Message ID	20220505054010.21879-1-vinay.belgaumkar@intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <dri-devel-bounces@lists.freedesktop.org> From: Vinay Belgaumkar <vinay.belgaumkar@intel.com> To: intel-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org Subject: [PATCH] drm/i915/guc/slpc: Use non-blocking H2G for waitboost Date: Wed, 4 May 2022 22:40:10 -0700 Message-Id: <20220505054010.21879-1-vinay.belgaumkar@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: list Cc: Vinay Belgaumkar <vinay.belgaumkar@intel.com> Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>
Series	drm/i915/guc/slpc: Use non-blocking H2G for waitboost \| expand drm/i915/guc/slpc: Use non-blocking H2G for waitboost

Vinay Belgaumkar May 5, 2022, 5:40 a.m. UTC

SLPC min/max frequency updates require H2G calls. We are seeing
timeouts when GuC channel is backed up and it is unable to respond
in a timely fashion causing warnings and affecting CI.

This is seen when waitboosting happens during a stress test.
this patch updates the waitboost path to use a non-blocking
H2G call instead, which returns as soon as the message is
successfully transmitted.

Signed-off-by: Vinay Belgaumkar <vinay.belgaumkar@intel.com>
---
 drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.c | 38 ++++++++++++++++-----
 1 file changed, 30 insertions(+), 8 deletions(-)

Tvrtko Ursulin May 5, 2022, 12:13 p.m. UTC | #1

On 05/05/2022 06:40, Vinay Belgaumkar wrote:
> SLPC min/max frequency updates require H2G calls. We are seeing
> timeouts when GuC channel is backed up and it is unable to respond
> in a timely fashion causing warnings and affecting CI.

Is it the "Unable to force min freq" error? Do you have a link to the 
GitLab issue to add to commit message?

> This is seen when waitboosting happens during a stress test.
> this patch updates the waitboost path to use a non-blocking
> H2G call instead, which returns as soon as the message is
> successfully transmitted.

AFAIU with this approach, when CT channel is congested, you instead 
achieve silent dropping of the waitboost request, right?

It sounds like a potentially important feedback from the field to lose 
so easily. How about you added drm_notice to the worker when it fails?

Or simply a "one line patch" to replace i915_probe_error (!?) with 
drm_notice and keep the blocking behavior. (I have no idea what is the 
typical time to drain the CT buffer, and so to decide whether waiting or 
dropping makes more sense for effectiveness of waitboosting.)

Or since the congestion /should not/ happen in production, then the 
argument is why complicate with more code, in which case going with one 
line patch is an easy way forward?

Regards,

Tvrtko

> Signed-off-by: Vinay Belgaumkar <vinay.belgaumkar@intel.com>
> ---
>   drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.c | 38 ++++++++++++++++-----
>   1 file changed, 30 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.c
> index 1db833da42df..c852f73cf521 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.c
> @@ -98,6 +98,30 @@ static u32 slpc_get_state(struct intel_guc_slpc *slpc)
>   	return data->header.global_state;
>   }
>   
> +static int guc_action_slpc_set_param_nb(struct intel_guc *guc, u8 id, u32 value)
> +{
> +	u32 request[] = {
> +		GUC_ACTION_HOST2GUC_PC_SLPC_REQUEST,
> +		SLPC_EVENT(SLPC_EVENT_PARAMETER_SET, 2),
> +		id,
> +		value,
> +	};
> +	int ret;
> +
> +	ret = intel_guc_send_nb(guc, request, ARRAY_SIZE(request), 0);
> +
> +	return ret > 0 ? -EPROTO : ret;
> +}
> +
> +static int slpc_set_param_nb(struct intel_guc_slpc *slpc, u8 id, u32 value)
> +{
> +	struct intel_guc *guc = slpc_to_guc(slpc);
> +
> +	GEM_BUG_ON(id >= SLPC_MAX_PARAM);
> +
> +	return guc_action_slpc_set_param_nb(guc, id, value);
> +}
> +
>   static int guc_action_slpc_set_param(struct intel_guc *guc, u8 id, u32 value)
>   {
>   	u32 request[] = {
> @@ -208,12 +232,10 @@ static int slpc_force_min_freq(struct intel_guc_slpc *slpc, u32 freq)
>   	 */
>   
>   	with_intel_runtime_pm(&i915->runtime_pm, wakeref) {
> -		ret = slpc_set_param(slpc,
> -				     SLPC_PARAM_GLOBAL_MIN_GT_UNSLICE_FREQ_MHZ,
> -				     freq);
> -		if (ret)
> -			i915_probe_error(i915, "Unable to force min freq to %u: %d",
> -					 freq, ret);
> +		/* Non-blocking request will avoid stalls */
> +		ret = slpc_set_param_nb(slpc,
> +					SLPC_PARAM_GLOBAL_MIN_GT_UNSLICE_FREQ_MHZ,
> +					freq);
>   	}
>   
>   	return ret;
> @@ -231,8 +253,8 @@ static void slpc_boost_work(struct work_struct *work)
>   	 */
>   	mutex_lock(&slpc->lock);
>   	if (atomic_read(&slpc->num_waiters)) {
> -		slpc_force_min_freq(slpc, slpc->boost_freq);
> -		slpc->num_boosts++;
> +		if (!slpc_force_min_freq(slpc, slpc->boost_freq))
> +			slpc->num_boosts++;
>   	}
>   	mutex_unlock(&slpc->lock);
>   }

Vinay Belgaumkar May 5, 2022, 5:21 p.m. UTC | #2

On 5/5/2022 5:13 AM, Tvrtko Ursulin wrote:
>
> On 05/05/2022 06:40, Vinay Belgaumkar wrote:
>> SLPC min/max frequency updates require H2G calls. We are seeing
>> timeouts when GuC channel is backed up and it is unable to respond
>> in a timely fashion causing warnings and affecting CI.
>
> Is it the "Unable to force min freq" error? Do you have a link to the 
> GitLab issue to add to commit message?
We don't have a specific error for this one, but have seen similar 
issues with other H2G which are blocking.
>
>> This is seen when waitboosting happens during a stress test.
>> this patch updates the waitboost path to use a non-blocking
>> H2G call instead, which returns as soon as the message is
>> successfully transmitted.
>
> AFAIU with this approach, when CT channel is congested, you instead 
> achieve silent dropping of the waitboost request, right?
We are hoping it makes it, but just not waiting for it to complete.
>
> It sounds like a potentially important feedback from the field to lose 
> so easily. How about you added drm_notice to the worker when it fails?
>
> Or simply a "one line patch" to replace i915_probe_error (!?) with 
> drm_notice and keep the blocking behavior. (I have no idea what is the 
> typical time to drain the CT buffer, and so to decide whether waiting 
> or dropping makes more sense for effectiveness of waitboosting.)
>
> Or since the congestion /should not/ happen in production, then the 
> argument is why complicate with more code, in which case going with 
> one line patch is an easy way forward?

Even if we soften the blow here, the actual timeout error occurs in the 
intel_guc_ct.c code, so we cannot hide that error anyways. Making this 
call non-blocking will achieve both things.

Thanks,

Vinay.

>
> Regards,
>
> Tvrtko
>
>> Signed-off-by: Vinay Belgaumkar <vinay.belgaumkar@intel.com>
>> ---
>>   drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.c | 38 ++++++++++++++++-----
>>   1 file changed, 30 insertions(+), 8 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.c 
>> b/drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.c
>> index 1db833da42df..c852f73cf521 100644
>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.c
>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.c
>> @@ -98,6 +98,30 @@ static u32 slpc_get_state(struct intel_guc_slpc 
>> *slpc)
>>       return data->header.global_state;
>>   }
>>   +static int guc_action_slpc_set_param_nb(struct intel_guc *guc, u8 
>> id, u32 value)
>> +{
>> +    u32 request[] = {
>> +        GUC_ACTION_HOST2GUC_PC_SLPC_REQUEST,
>> +        SLPC_EVENT(SLPC_EVENT_PARAMETER_SET, 2),
>> +        id,
>> +        value,
>> +    };
>> +    int ret;
>> +
>> +    ret = intel_guc_send_nb(guc, request, ARRAY_SIZE(request), 0);
>> +
>> +    return ret > 0 ? -EPROTO : ret;
>> +}
>> +
>> +static int slpc_set_param_nb(struct intel_guc_slpc *slpc, u8 id, u32 
>> value)
>> +{
>> +    struct intel_guc *guc = slpc_to_guc(slpc);
>> +
>> +    GEM_BUG_ON(id >= SLPC_MAX_PARAM);
>> +
>> +    return guc_action_slpc_set_param_nb(guc, id, value);
>> +}
>> +
>>   static int guc_action_slpc_set_param(struct intel_guc *guc, u8 id, 
>> u32 value)
>>   {
>>       u32 request[] = {
>> @@ -208,12 +232,10 @@ static int slpc_force_min_freq(struct 
>> intel_guc_slpc *slpc, u32 freq)
>>        */
>>         with_intel_runtime_pm(&i915->runtime_pm, wakeref) {
>> -        ret = slpc_set_param(slpc,
>> -                     SLPC_PARAM_GLOBAL_MIN_GT_UNSLICE_FREQ_MHZ,
>> -                     freq);
>> -        if (ret)
>> -            i915_probe_error(i915, "Unable to force min freq to %u: 
>> %d",
>> -                     freq, ret);
>> +        /* Non-blocking request will avoid stalls */
>> +        ret = slpc_set_param_nb(slpc,
>> +                    SLPC_PARAM_GLOBAL_MIN_GT_UNSLICE_FREQ_MHZ,
>> +                    freq);
>>       }
>>         return ret;
>> @@ -231,8 +253,8 @@ static void slpc_boost_work(struct work_struct 
>> *work)
>>        */
>>       mutex_lock(&slpc->lock);
>>       if (atomic_read(&slpc->num_waiters)) {
>> -        slpc_force_min_freq(slpc, slpc->boost_freq);
>> -        slpc->num_boosts++;
>> +        if (!slpc_force_min_freq(slpc, slpc->boost_freq))
>> +            slpc->num_boosts++;
>>       }
>>       mutex_unlock(&slpc->lock);
>>   }

John Harrison May 5, 2022, 6:36 p.m. UTC | #3

On 5/5/2022 10:21, Belgaumkar, Vinay wrote:
> On 5/5/2022 5:13 AM, Tvrtko Ursulin wrote:
>> On 05/05/2022 06:40, Vinay Belgaumkar wrote:
>>> SLPC min/max frequency updates require H2G calls. We are seeing
>>> timeouts when GuC channel is backed up and it is unable to respond
>>> in a timely fashion causing warnings and affecting CI.
>>
>> Is it the "Unable to force min freq" error? Do you have a link to the 
>> GitLab issue to add to commit message?
> We don't have a specific error for this one, but have seen similar 
> issues with other H2G which are blocking.
>>
>>> This is seen when waitboosting happens during a stress test.
>>> this patch updates the waitboost path to use a non-blocking
>>> H2G call instead, which returns as soon as the message is
>>> successfully transmitted.
>>
>> AFAIU with this approach, when CT channel is congested, you instead 
>> achieve silent dropping of the waitboost request, right?
> We are hoping it makes it, but just not waiting for it to complete.
We are not 'hoping it makes it'. We know for a fact that it will make 
it. We just don't know when. The issue is not about whether the 
waitboost request itself gets dropped/lost it is about the ack that 
comes back. The GuC will process the message and it will send an ack. 
It's just a question of whether the i915 driver has given up waiting for 
it yet. And if it has, then you get the initial 'timed out waiting for 
ack' followed by a later 'got unexpected ack' message.

Whereas, if we make the call asynchronous, there is no ack. i915 doesn't 
bother waiting and it won't get surprised later.

Also, note that this is only an issue when GuC itself is backed up. 
Normally that requires the creation/destruction of large numbers of 
contexts in rapid succession (context management is about the slowest 
thing we do with GuC). Some of the IGTs and selftests do that with 
thousands of contexts all at once. Those are generally where we see this 
kind of problem. It would be highly unlikely (but not impossible) to hit 
it in real world usage.

The general design philosophy of H2G messages is that asynchronous mode 
should be used for everything if at all possible. It is fire and forget 
and will all get processed in the order sent (same as batch buffer 
execution, really). Synchronous messages should only be used when an 
ack/status is absolutely required. E.g. start of day initialisation or 
things like TLB invalidation where we need to know that a cache has been 
cleared/flushed before updating memory from the CPU.

John.


>>
>> It sounds like a potentially important feedback from the field to 
>> lose so easily. How about you added drm_notice to the worker when it 
>> fails?
>>
>> Or simply a "one line patch" to replace i915_probe_error (!?) with 
>> drm_notice and keep the blocking behavior. (I have no idea what is 
>> the typical time to drain the CT buffer, and so to decide whether 
>> waiting or dropping makes more sense for effectiveness of waitboosting.)
>>
>> Or since the congestion /should not/ happen in production, then the 
>> argument is why complicate with more code, in which case going with 
>> one line patch is an easy way forward?
>
> Even if we soften the blow here, the actual timeout error occurs in 
> the intel_guc_ct.c code, so we cannot hide that error anyways. Making 
> this call non-blocking will achieve both things.
>
> Thanks,
>
> Vinay.
>
>>
>> Regards,
>>
>> Tvrtko
>>
>>> Signed-off-by: Vinay Belgaumkar <vinay.belgaumkar@intel.com>
>>> ---
>>>   drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.c | 38 
>>> ++++++++++++++++-----
>>>   1 file changed, 30 insertions(+), 8 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.c 
>>> b/drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.c
>>> index 1db833da42df..c852f73cf521 100644
>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.c
>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.c
>>> @@ -98,6 +98,30 @@ static u32 slpc_get_state(struct intel_guc_slpc 
>>> *slpc)
>>>       return data->header.global_state;
>>>   }
>>>   +static int guc_action_slpc_set_param_nb(struct intel_guc *guc, u8 
>>> id, u32 value)
>>> +{
>>> +    u32 request[] = {
>>> +        GUC_ACTION_HOST2GUC_PC_SLPC_REQUEST,
>>> +        SLPC_EVENT(SLPC_EVENT_PARAMETER_SET, 2),
>>> +        id,
>>> +        value,
>>> +    };
>>> +    int ret;
>>> +
>>> +    ret = intel_guc_send_nb(guc, request, ARRAY_SIZE(request), 0);
>>> +
>>> +    return ret > 0 ? -EPROTO : ret;
>>> +}
>>> +
>>> +static int slpc_set_param_nb(struct intel_guc_slpc *slpc, u8 id, 
>>> u32 value)
>>> +{
>>> +    struct intel_guc *guc = slpc_to_guc(slpc);
>>> +
>>> +    GEM_BUG_ON(id >= SLPC_MAX_PARAM);
>>> +
>>> +    return guc_action_slpc_set_param_nb(guc, id, value);
>>> +}
>>> +
>>>   static int guc_action_slpc_set_param(struct intel_guc *guc, u8 id, 
>>> u32 value)
>>>   {
>>>       u32 request[] = {
>>> @@ -208,12 +232,10 @@ static int slpc_force_min_freq(struct 
>>> intel_guc_slpc *slpc, u32 freq)
>>>        */
>>>         with_intel_runtime_pm(&i915->runtime_pm, wakeref) {
>>> -        ret = slpc_set_param(slpc,
>>> - SLPC_PARAM_GLOBAL_MIN_GT_UNSLICE_FREQ_MHZ,
>>> -                     freq);
>>> -        if (ret)
>>> -            i915_probe_error(i915, "Unable to force min freq to %u: 
>>> %d",
>>> -                     freq, ret);
>>> +        /* Non-blocking request will avoid stalls */
>>> +        ret = slpc_set_param_nb(slpc,
>>> + SLPC_PARAM_GLOBAL_MIN_GT_UNSLICE_FREQ_MHZ,
>>> +                    freq);
>>>       }
>>>         return ret;
>>> @@ -231,8 +253,8 @@ static void slpc_boost_work(struct work_struct 
>>> *work)
>>>        */
>>>       mutex_lock(&slpc->lock);
>>>       if (atomic_read(&slpc->num_waiters)) {
>>> -        slpc_force_min_freq(slpc, slpc->boost_freq);
>>> -        slpc->num_boosts++;
>>> +        if (!slpc_force_min_freq(slpc, slpc->boost_freq))
>>> +            slpc->num_boosts++;
>>>       }
>>>       mutex_unlock(&slpc->lock);
>>>   }

Tvrtko Ursulin May 6, 2022, 7:18 a.m. UTC | #4

On 05/05/2022 19:36, John Harrison wrote:
> On 5/5/2022 10:21, Belgaumkar, Vinay wrote:
>> On 5/5/2022 5:13 AM, Tvrtko Ursulin wrote:
>>> On 05/05/2022 06:40, Vinay Belgaumkar wrote:
>>>> SLPC min/max frequency updates require H2G calls. We are seeing
>>>> timeouts when GuC channel is backed up and it is unable to respond
>>>> in a timely fashion causing warnings and affecting CI.
>>>
>>> Is it the "Unable to force min freq" error? Do you have a link to the 
>>> GitLab issue to add to commit message?
>> We don't have a specific error for this one, but have seen similar 
>> issues with other H2G which are blocking.
>>>
>>>> This is seen when waitboosting happens during a stress test.
>>>> this patch updates the waitboost path to use a non-blocking
>>>> H2G call instead, which returns as soon as the message is
>>>> successfully transmitted.
>>>
>>> AFAIU with this approach, when CT channel is congested, you instead 
>>> achieve silent dropping of the waitboost request, right?
>> We are hoping it makes it, but just not waiting for it to complete.
> We are not 'hoping it makes it'. We know for a fact that it will make 
> it. We just don't know when. The issue is not about whether the 
> waitboost request itself gets dropped/lost it is about the ack that 
> comes back. The GuC will process the message and it will send an ack. 
> It's just a question of whether the i915 driver has given up waiting for 
> it yet. And if it has, then you get the initial 'timed out waiting for 
> ack' followed by a later 'got unexpected ack' message.
> 
> Whereas, if we make the call asynchronous, there is no ack. i915 doesn't 
> bother waiting and it won't get surprised later.
> 
> Also, note that this is only an issue when GuC itself is backed up. 
> Normally that requires the creation/destruction of large numbers of 
> contexts in rapid succession (context management is about the slowest 
> thing we do with GuC). Some of the IGTs and selftests do that with 
> thousands of contexts all at once. Those are generally where we see this 
> kind of problem. It would be highly unlikely (but not impossible) to hit 
> it in real world usage.

Goto ->

> The general design philosophy of H2G messages is that asynchronous mode 
> should be used for everything if at all possible. It is fire and forget 
> and will all get processed in the order sent (same as batch buffer 
> execution, really). Synchronous messages should only be used when an 
> ack/status is absolutely required. E.g. start of day initialisation or 
> things like TLB invalidation where we need to know that a cache has been 
> cleared/flushed before updating memory from the CPU.
> 
> John.
> 
> 
>>>
>>> It sounds like a potentially important feedback from the field to 
>>> lose so easily. How about you added drm_notice to the worker when it 
>>> fails?
>>>
>>> Or simply a "one line patch" to replace i915_probe_error (!?) with 
>>> drm_notice and keep the blocking behavior. (I have no idea what is 
>>> the typical time to drain the CT buffer, and so to decide whether 
>>> waiting or dropping makes more sense for effectiveness of waitboosting.)
>>>
>>> Or since the congestion /should not/ happen in production, then the 
>>> argument is why complicate with more code, in which case going with 
>>> one line patch is an easy way forward?

Here. Where I did hint I understood the "should not happen in production 
angle".

So statement is GuC is congested in processing requests, but the h2g 
buffer is not congested so no chance intel_guc_send_nb() will fail with 
no space in that buffer? Sounds a bit un-intuitive.

Anyway, it sounds okay to me to use the non-blocking, but I would like 
to see some logging if the unexpected does happen. Hence I was 
suggesting the option of adding drm_notice logging if the send fails 
from the worker. (Because I think other callers would already propagate 
the error, like sysfs.)

   err = slpc_force_min_freq(slpc, slpc->boost_freq);
   if (!err)
        slpc->num_boosts++;
   else
        drm_notice(... "Failed to send waitboost request (%d)", err);

Something like that.

Regards,

Tvrtko


>> Even if we soften the blow here, the actual timeout error occurs in 
>> the intel_guc_ct.c code, so we cannot hide that error anyways. Making 
>> this call non-blocking will achieve both things.
>>
>> Thanks,
>>
>> Vinay.
>>
>>>
>>> Regards,
>>>
>>> Tvrtko
>>>
>>>> Signed-off-by: Vinay Belgaumkar <vinay.belgaumkar@intel.com>
>>>> ---
>>>>   drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.c | 38 
>>>> ++++++++++++++++-----
>>>>   1 file changed, 30 insertions(+), 8 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.c 
>>>> b/drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.c
>>>> index 1db833da42df..c852f73cf521 100644
>>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.c
>>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.c
>>>> @@ -98,6 +98,30 @@ static u32 slpc_get_state(struct intel_guc_slpc 
>>>> *slpc)
>>>>       return data->header.global_state;
>>>>   }
>>>>   +static int guc_action_slpc_set_param_nb(struct intel_guc *guc, u8 
>>>> id, u32 value)
>>>> +{
>>>> +    u32 request[] = {
>>>> +        GUC_ACTION_HOST2GUC_PC_SLPC_REQUEST,
>>>> +        SLPC_EVENT(SLPC_EVENT_PARAMETER_SET, 2),
>>>> +        id,
>>>> +        value,
>>>> +    };
>>>> +    int ret;
>>>> +
>>>> +    ret = intel_guc_send_nb(guc, request, ARRAY_SIZE(request), 0);
>>>> +
>>>> +    return ret > 0 ? -EPROTO : ret;
>>>> +}
>>>> +
>>>> +static int slpc_set_param_nb(struct intel_guc_slpc *slpc, u8 id, 
>>>> u32 value)
>>>> +{
>>>> +    struct intel_guc *guc = slpc_to_guc(slpc);
>>>> +
>>>> +    GEM_BUG_ON(id >= SLPC_MAX_PARAM);
>>>> +
>>>> +    return guc_action_slpc_set_param_nb(guc, id, value);
>>>> +}
>>>> +
>>>>   static int guc_action_slpc_set_param(struct intel_guc *guc, u8 id, 
>>>> u32 value)
>>>>   {
>>>>       u32 request[] = {
>>>> @@ -208,12 +232,10 @@ static int slpc_force_min_freq(struct 
>>>> intel_guc_slpc *slpc, u32 freq)
>>>>        */
>>>>         with_intel_runtime_pm(&i915->runtime_pm, wakeref) {
>>>> -        ret = slpc_set_param(slpc,
>>>> - SLPC_PARAM_GLOBAL_MIN_GT_UNSLICE_FREQ_MHZ,
>>>> -                     freq);
>>>> -        if (ret)
>>>> -            i915_probe_error(i915, "Unable to force min freq to %u: 
>>>> %d",
>>>> -                     freq, ret);
>>>> +        /* Non-blocking request will avoid stalls */
>>>> +        ret = slpc_set_param_nb(slpc,
>>>> + SLPC_PARAM_GLOBAL_MIN_GT_UNSLICE_FREQ_MHZ,
>>>> +                    freq);
>>>>       }
>>>>         return ret;
>>>> @@ -231,8 +253,8 @@ static void slpc_boost_work(struct work_struct 
>>>> *work)
>>>>        */
>>>>       mutex_lock(&slpc->lock);
>>>>       if (atomic_read(&slpc->num_waiters)) {
>>>> -        slpc_force_min_freq(slpc, slpc->boost_freq);
>>>> -        slpc->num_boosts++;
>>>> +        if (!slpc_force_min_freq(slpc, slpc->boost_freq))
>>>> +            slpc->num_boosts++;
>>>>       }
>>>>       mutex_unlock(&slpc->lock);
>>>>   }
>

Vinay Belgaumkar May 6, 2022, 4:21 p.m. UTC | #5

On 5/6/2022 12:18 AM, Tvrtko Ursulin wrote:
>
> On 05/05/2022 19:36, John Harrison wrote:
>> On 5/5/2022 10:21, Belgaumkar, Vinay wrote:
>>> On 5/5/2022 5:13 AM, Tvrtko Ursulin wrote:
>>>> On 05/05/2022 06:40, Vinay Belgaumkar wrote:
>>>>> SLPC min/max frequency updates require H2G calls. We are seeing
>>>>> timeouts when GuC channel is backed up and it is unable to respond
>>>>> in a timely fashion causing warnings and affecting CI.
>>>>
>>>> Is it the "Unable to force min freq" error? Do you have a link to 
>>>> the GitLab issue to add to commit message?
>>> We don't have a specific error for this one, but have seen similar 
>>> issues with other H2G which are blocking.
>>>>
>>>>> This is seen when waitboosting happens during a stress test.
>>>>> this patch updates the waitboost path to use a non-blocking
>>>>> H2G call instead, which returns as soon as the message is
>>>>> successfully transmitted.
>>>>
>>>> AFAIU with this approach, when CT channel is congested, you instead 
>>>> achieve silent dropping of the waitboost request, right?
>>> We are hoping it makes it, but just not waiting for it to complete.
>> We are not 'hoping it makes it'. We know for a fact that it will make 
>> it. We just don't know when. The issue is not about whether the 
>> waitboost request itself gets dropped/lost it is about the ack that 
>> comes back. The GuC will process the message and it will send an ack. 
>> It's just a question of whether the i915 driver has given up waiting 
>> for it yet. And if it has, then you get the initial 'timed out 
>> waiting for ack' followed by a later 'got unexpected ack' message.
>>
>> Whereas, if we make the call asynchronous, there is no ack. i915 
>> doesn't bother waiting and it won't get surprised later.
>>
>> Also, note that this is only an issue when GuC itself is backed up. 
>> Normally that requires the creation/destruction of large numbers of 
>> contexts in rapid succession (context management is about the slowest 
>> thing we do with GuC). Some of the IGTs and selftests do that with 
>> thousands of contexts all at once. Those are generally where we see 
>> this kind of problem. It would be highly unlikely (but not 
>> impossible) to hit it in real world usage.
>
> Goto ->
>
>> The general design philosophy of H2G messages is that asynchronous 
>> mode should be used for everything if at all possible. It is fire and 
>> forget and will all get processed in the order sent (same as batch 
>> buffer execution, really). Synchronous messages should only be used 
>> when an ack/status is absolutely required. E.g. start of day 
>> initialisation or things like TLB invalidation where we need to know 
>> that a cache has been cleared/flushed before updating memory from the 
>> CPU.
>>
>> John.
>>
>>
>>>>
>>>> It sounds like a potentially important feedback from the field to 
>>>> lose so easily. How about you added drm_notice to the worker when 
>>>> it fails?
>>>>
>>>> Or simply a "one line patch" to replace i915_probe_error (!?) with 
>>>> drm_notice and keep the blocking behavior. (I have no idea what is 
>>>> the typical time to drain the CT buffer, and so to decide whether 
>>>> waiting or dropping makes more sense for effectiveness of 
>>>> waitboosting.)
>>>>
>>>> Or since the congestion /should not/ happen in production, then the 
>>>> argument is why complicate with more code, in which case going with 
>>>> one line patch is an easy way forward?
>
> Here. Where I did hint I understood the "should not happen in 
> production angle".
>
> So statement is GuC is congested in processing requests, but the h2g 
> buffer is not congested so no chance intel_guc_send_nb() will fail 
> with no space in that buffer? Sounds a bit un-intuitive.
>
> Anyway, it sounds okay to me to use the non-blocking, but I would like 
> to see some logging if the unexpected does happen. Hence I was 
> suggesting the option of adding drm_notice logging if the send fails 
> from the worker. (Because I think other callers would already 
> propagate the error, like sysfs.)
>
>   err = slpc_force_min_freq(slpc, slpc->boost_freq);
>   if (!err)
>        slpc->num_boosts++;
>   else
>        drm_notice(... "Failed to send waitboost request (%d)", err);

Ok, makes sense. Will send out another rev with this change.

Thanks,

Vinay.


>
> Something like that.
>
> Regards,
>
> Tvrtko
>
>
>>> Even if we soften the blow here, the actual timeout error occurs in 
>>> the intel_guc_ct.c code, so we cannot hide that error anyways. 
>>> Making this call non-blocking will achieve both things.
>>>
>>> Thanks,
>>>
>>> Vinay.
>>>
>>>>
>>>> Regards,
>>>>
>>>> Tvrtko
>>>>
>>>>> Signed-off-by: Vinay Belgaumkar <vinay.belgaumkar@intel.com>
>>>>> ---
>>>>>   drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.c | 38 
>>>>> ++++++++++++++++-----
>>>>>   1 file changed, 30 insertions(+), 8 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.c 
>>>>> b/drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.c
>>>>> index 1db833da42df..c852f73cf521 100644
>>>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.c
>>>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.c
>>>>> @@ -98,6 +98,30 @@ static u32 slpc_get_state(struct intel_guc_slpc 
>>>>> *slpc)
>>>>>       return data->header.global_state;
>>>>>   }
>>>>>   +static int guc_action_slpc_set_param_nb(struct intel_guc *guc, 
>>>>> u8 id, u32 value)
>>>>> +{
>>>>> +    u32 request[] = {
>>>>> +        GUC_ACTION_HOST2GUC_PC_SLPC_REQUEST,
>>>>> +        SLPC_EVENT(SLPC_EVENT_PARAMETER_SET, 2),
>>>>> +        id,
>>>>> +        value,
>>>>> +    };
>>>>> +    int ret;
>>>>> +
>>>>> +    ret = intel_guc_send_nb(guc, request, ARRAY_SIZE(request), 0);
>>>>> +
>>>>> +    return ret > 0 ? -EPROTO : ret;
>>>>> +}
>>>>> +
>>>>> +static int slpc_set_param_nb(struct intel_guc_slpc *slpc, u8 id, 
>>>>> u32 value)
>>>>> +{
>>>>> +    struct intel_guc *guc = slpc_to_guc(slpc);
>>>>> +
>>>>> +    GEM_BUG_ON(id >= SLPC_MAX_PARAM);
>>>>> +
>>>>> +    return guc_action_slpc_set_param_nb(guc, id, value);
>>>>> +}
>>>>> +
>>>>>   static int guc_action_slpc_set_param(struct intel_guc *guc, u8 
>>>>> id, u32 value)
>>>>>   {
>>>>>       u32 request[] = {
>>>>> @@ -208,12 +232,10 @@ static int slpc_force_min_freq(struct 
>>>>> intel_guc_slpc *slpc, u32 freq)
>>>>>        */
>>>>>         with_intel_runtime_pm(&i915->runtime_pm, wakeref) {
>>>>> -        ret = slpc_set_param(slpc,
>>>>> - SLPC_PARAM_GLOBAL_MIN_GT_UNSLICE_FREQ_MHZ,
>>>>> -                     freq);
>>>>> -        if (ret)
>>>>> -            i915_probe_error(i915, "Unable to force min freq to 
>>>>> %u: %d",
>>>>> -                     freq, ret);
>>>>> +        /* Non-blocking request will avoid stalls */
>>>>> +        ret = slpc_set_param_nb(slpc,
>>>>> + SLPC_PARAM_GLOBAL_MIN_GT_UNSLICE_FREQ_MHZ,
>>>>> +                    freq);
>>>>>       }
>>>>>         return ret;
>>>>> @@ -231,8 +253,8 @@ static void slpc_boost_work(struct work_struct 
>>>>> *work)
>>>>>        */
>>>>>       mutex_lock(&slpc->lock);
>>>>>       if (atomic_read(&slpc->num_waiters)) {
>>>>> -        slpc_force_min_freq(slpc, slpc->boost_freq);
>>>>> -        slpc->num_boosts++;
>>>>> +        if (!slpc_force_min_freq(slpc, slpc->boost_freq))
>>>>> +            slpc->num_boosts++;
>>>>>       }
>>>>>       mutex_unlock(&slpc->lock);
>>>>>   }
>>

John Harrison May 6, 2022, 4:43 p.m. UTC | #6

On 5/6/2022 00:18, Tvrtko Ursulin wrote:
> On 05/05/2022 19:36, John Harrison wrote:
>> On 5/5/2022 10:21, Belgaumkar, Vinay wrote:
>>> On 5/5/2022 5:13 AM, Tvrtko Ursulin wrote:
>>>> On 05/05/2022 06:40, Vinay Belgaumkar wrote:
>>>>> SLPC min/max frequency updates require H2G calls. We are seeing
>>>>> timeouts when GuC channel is backed up and it is unable to respond
>>>>> in a timely fashion causing warnings and affecting CI.
>>>>
>>>> Is it the "Unable to force min freq" error? Do you have a link to 
>>>> the GitLab issue to add to commit message?
>>> We don't have a specific error for this one, but have seen similar 
>>> issues with other H2G which are blocking.
>>>>
>>>>> This is seen when waitboosting happens during a stress test.
>>>>> this patch updates the waitboost path to use a non-blocking
>>>>> H2G call instead, which returns as soon as the message is
>>>>> successfully transmitted.
>>>>
>>>> AFAIU with this approach, when CT channel is congested, you instead 
>>>> achieve silent dropping of the waitboost request, right?
>>> We are hoping it makes it, but just not waiting for it to complete.
>> We are not 'hoping it makes it'. We know for a fact that it will make 
>> it. We just don't know when. The issue is not about whether the 
>> waitboost request itself gets dropped/lost it is about the ack that 
>> comes back. The GuC will process the message and it will send an ack. 
>> It's just a question of whether the i915 driver has given up waiting 
>> for it yet. And if it has, then you get the initial 'timed out 
>> waiting for ack' followed by a later 'got unexpected ack' message.
>>
>> Whereas, if we make the call asynchronous, there is no ack. i915 
>> doesn't bother waiting and it won't get surprised later.
>>
>> Also, note that this is only an issue when GuC itself is backed up. 
>> Normally that requires the creation/destruction of large numbers of 
>> contexts in rapid succession (context management is about the slowest 
>> thing we do with GuC). Some of the IGTs and selftests do that with 
>> thousands of contexts all at once. Those are generally where we see 
>> this kind of problem. It would be highly unlikely (but not 
>> impossible) to hit it in real world usage.
>
> Goto ->
>
>> The general design philosophy of H2G messages is that asynchronous 
>> mode should be used for everything if at all possible. It is fire and 
>> forget and will all get processed in the order sent (same as batch 
>> buffer execution, really). Synchronous messages should only be used 
>> when an ack/status is absolutely required. E.g. start of day 
>> initialisation or things like TLB invalidation where we need to know 
>> that a cache has been cleared/flushed before updating memory from the 
>> CPU.
>>
>> John.
>>
>>
>>>>
>>>> It sounds like a potentially important feedback from the field to 
>>>> lose so easily. How about you added drm_notice to the worker when 
>>>> it fails?
>>>>
>>>> Or simply a "one line patch" to replace i915_probe_error (!?) with 
>>>> drm_notice and keep the blocking behavior. (I have no idea what is 
>>>> the typical time to drain the CT buffer, and so to decide whether 
>>>> waiting or dropping makes more sense for effectiveness of 
>>>> waitboosting.)
>>>>
>>>> Or since the congestion /should not/ happen in production, then the 
>>>> argument is why complicate with more code, in which case going with 
>>>> one line patch is an easy way forward?
>
> Here. Where I did hint I understood the "should not happen in 
> production angle".
>
> So statement is GuC is congested in processing requests, but the h2g 
> buffer is not congested so no chance intel_guc_send_nb() will fail 
> with no space in that buffer? Sounds a bit un-intuitive.
That's two different things. The problem of no space in the H2G buffer 
is the same whether the call is sent blocking or non-blocking. The 
wait-for-space version is intel_guc_send_busy_loop() rather than 
intel_guc_send_nb(). NB: _busy_loop is a wrapper around _nb, so the 
wait-for-space version is also non-blocking ;). If a non-looping version 
is used (blocking or otherwise) it will return -EBUSY if there is no 
space. So both the original SLPC call and this non-blocking version will 
still get an immediate EBUSY return code if the H2G channel is backed up 
completely.

Whether the code should be handling EBUSY or not is another matter. 
Vinay, does anything higher up do a loop on EBUSY? If not, maybe it 
should be using the _busy_loop() call instead?

The blocking vs non-blocking is about waiting for a response if the 
command is successfully sent. The blocking case will sit and spin for a 
reply, the non-blocking assumes success and expects an asynchronous 
error report on failure. The assumption being that the call can't fail 
unless something is already broken - i915 sending invalid data to GuC 
for example. And thus any failure is in the BUG_ON category rather than 
the try again with a different approach and/or try again later category.

This is the point of the change. We are currently getting timeout errors 
when the H2G channel has space so the command can be sent, but the 
channel already contains a lot of slow operations. The command has been 
sent and will be processed successfully, it just takes longer than the 
i915 timeout. Given that we don't actually care about the completion 
response for this command, there is no point in either a) sitting in a 
loop waiting for it or b) complaining that it doesn't happen in a timely 
fashion. Hence the plan to make it non-blocking.

>
> Anyway, it sounds okay to me to use the non-blocking, but I would like 
> to see some logging if the unexpected does happen. Hence I was 
> suggesting the option of adding drm_notice logging if the send fails 
> from the worker. (Because I think other callers would already 
> propagate the error, like sysfs.)
>
>   err = slpc_force_min_freq(slpc, slpc->boost_freq);
>   if (!err)
>        slpc->num_boosts++;
>   else
>        drm_notice(... "Failed to send waitboost request (%d)", err);
The only error this should ever report would be EBUSY when the H2G 
channel is full. Anything else (ENODEV, EPIPE, etc.) means the system is 
already toast and bigger errors will likely have already have been reported.

As above, maybe this should be looping on the EBUSY case. Presumably it 
is safe to do so if it was already looping waiting for the response. And 
then printing a notice level warning on more catastrophic errors seems 
reasonable.

John.

>
> Something like that.
>
> Regards,
>
> Tvrtko
>
>
>>> Even if we soften the blow here, the actual timeout error occurs in 
>>> the intel_guc_ct.c code, so we cannot hide that error anyways. 
>>> Making this call non-blocking will achieve both things.
>>>
>>> Thanks,
>>>
>>> Vinay.
>>>
>>>>
>>>> Regards,
>>>>
>>>> Tvrtko
>>>>
>>>>> Signed-off-by: Vinay Belgaumkar <vinay.belgaumkar@intel.com>
>>>>> ---
>>>>>   drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.c | 38 
>>>>> ++++++++++++++++-----
>>>>>   1 file changed, 30 insertions(+), 8 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.c 
>>>>> b/drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.c
>>>>> index 1db833da42df..c852f73cf521 100644
>>>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.c
>>>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.c
>>>>> @@ -98,6 +98,30 @@ static u32 slpc_get_state(struct intel_guc_slpc 
>>>>> *slpc)
>>>>>       return data->header.global_state;
>>>>>   }
>>>>>   +static int guc_action_slpc_set_param_nb(struct intel_guc *guc, 
>>>>> u8 id, u32 value)
>>>>> +{
>>>>> +    u32 request[] = {
>>>>> +        GUC_ACTION_HOST2GUC_PC_SLPC_REQUEST,
>>>>> +        SLPC_EVENT(SLPC_EVENT_PARAMETER_SET, 2),
>>>>> +        id,
>>>>> +        value,
>>>>> +    };
>>>>> +    int ret;
>>>>> +
>>>>> +    ret = intel_guc_send_nb(guc, request, ARRAY_SIZE(request), 0);
>>>>> +
>>>>> +    return ret > 0 ? -EPROTO : ret;
>>>>> +}
>>>>> +
>>>>> +static int slpc_set_param_nb(struct intel_guc_slpc *slpc, u8 id, 
>>>>> u32 value)
>>>>> +{
>>>>> +    struct intel_guc *guc = slpc_to_guc(slpc);
>>>>> +
>>>>> +    GEM_BUG_ON(id >= SLPC_MAX_PARAM);
>>>>> +
>>>>> +    return guc_action_slpc_set_param_nb(guc, id, value);
>>>>> +}
>>>>> +
>>>>>   static int guc_action_slpc_set_param(struct intel_guc *guc, u8 
>>>>> id, u32 value)
>>>>>   {
>>>>>       u32 request[] = {
>>>>> @@ -208,12 +232,10 @@ static int slpc_force_min_freq(struct 
>>>>> intel_guc_slpc *slpc, u32 freq)
>>>>>        */
>>>>>         with_intel_runtime_pm(&i915->runtime_pm, wakeref) {
>>>>> -        ret = slpc_set_param(slpc,
>>>>> - SLPC_PARAM_GLOBAL_MIN_GT_UNSLICE_FREQ_MHZ,
>>>>> -                     freq);
>>>>> -        if (ret)
>>>>> -            i915_probe_error(i915, "Unable to force min freq to 
>>>>> %u: %d",
>>>>> -                     freq, ret);
>>>>> +        /* Non-blocking request will avoid stalls */
>>>>> +        ret = slpc_set_param_nb(slpc,
>>>>> + SLPC_PARAM_GLOBAL_MIN_GT_UNSLICE_FREQ_MHZ,
>>>>> +                    freq);
>>>>>       }
>>>>>         return ret;
>>>>> @@ -231,8 +253,8 @@ static void slpc_boost_work(struct work_struct 
>>>>> *work)
>>>>>        */
>>>>>       mutex_lock(&slpc->lock);
>>>>>       if (atomic_read(&slpc->num_waiters)) {
>>>>> -        slpc_force_min_freq(slpc, slpc->boost_freq);
>>>>> -        slpc->num_boosts++;
>>>>> +        if (!slpc_force_min_freq(slpc, slpc->boost_freq))
>>>>> +            slpc->num_boosts++;
>>>>>       }
>>>>>       mutex_unlock(&slpc->lock);
>>>>>   }
>>

Vinay Belgaumkar May 15, 2022, 5:46 a.m. UTC | #7

On 5/6/2022 9:43 AM, John Harrison wrote:
> On 5/6/2022 00:18, Tvrtko Ursulin wrote:
>> On 05/05/2022 19:36, John Harrison wrote:
>>> On 5/5/2022 10:21, Belgaumkar, Vinay wrote:
>>>> On 5/5/2022 5:13 AM, Tvrtko Ursulin wrote:
>>>>> On 05/05/2022 06:40, Vinay Belgaumkar wrote:
>>>>>> SLPC min/max frequency updates require H2G calls. We are seeing
>>>>>> timeouts when GuC channel is backed up and it is unable to respond
>>>>>> in a timely fashion causing warnings and affecting CI.
>>>>>
>>>>> Is it the "Unable to force min freq" error? Do you have a link to 
>>>>> the GitLab issue to add to commit message?
>>>> We don't have a specific error for this one, but have seen similar 
>>>> issues with other H2G which are blocking.
>>>>>
>>>>>> This is seen when waitboosting happens during a stress test.
>>>>>> this patch updates the waitboost path to use a non-blocking
>>>>>> H2G call instead, which returns as soon as the message is
>>>>>> successfully transmitted.
>>>>>
>>>>> AFAIU with this approach, when CT channel is congested, you 
>>>>> instead achieve silent dropping of the waitboost request, right?
>>>> We are hoping it makes it, but just not waiting for it to complete.
>>> We are not 'hoping it makes it'. We know for a fact that it will 
>>> make it. We just don't know when. The issue is not about whether the 
>>> waitboost request itself gets dropped/lost it is about the ack that 
>>> comes back. The GuC will process the message and it will send an 
>>> ack. It's just a question of whether the i915 driver has given up 
>>> waiting for it yet. And if it has, then you get the initial 'timed 
>>> out waiting for ack' followed by a later 'got unexpected ack' message.
>>>
>>> Whereas, if we make the call asynchronous, there is no ack. i915 
>>> doesn't bother waiting and it won't get surprised later.
>>>
>>> Also, note that this is only an issue when GuC itself is backed up. 
>>> Normally that requires the creation/destruction of large numbers of 
>>> contexts in rapid succession (context management is about the 
>>> slowest thing we do with GuC). Some of the IGTs and selftests do 
>>> that with thousands of contexts all at once. Those are generally 
>>> where we see this kind of problem. It would be highly unlikely (but 
>>> not impossible) to hit it in real world usage.
>>
>> Goto ->
>>
>>> The general design philosophy of H2G messages is that asynchronous 
>>> mode should be used for everything if at all possible. It is fire 
>>> and forget and will all get processed in the order sent (same as 
>>> batch buffer execution, really). Synchronous messages should only be 
>>> used when an ack/status is absolutely required. E.g. start of day 
>>> initialisation or things like TLB invalidation where we need to know 
>>> that a cache has been cleared/flushed before updating memory from 
>>> the CPU.
>>>
>>> John.
>>>
>>>
>>>>>
>>>>> It sounds like a potentially important feedback from the field to 
>>>>> lose so easily. How about you added drm_notice to the worker when 
>>>>> it fails?
>>>>>
>>>>> Or simply a "one line patch" to replace i915_probe_error (!?) with 
>>>>> drm_notice and keep the blocking behavior. (I have no idea what is 
>>>>> the typical time to drain the CT buffer, and so to decide whether 
>>>>> waiting or dropping makes more sense for effectiveness of 
>>>>> waitboosting.)
>>>>>
>>>>> Or since the congestion /should not/ happen in production, then 
>>>>> the argument is why complicate with more code, in which case going 
>>>>> with one line patch is an easy way forward?
>>
>> Here. Where I did hint I understood the "should not happen in 
>> production angle".
>>
>> So statement is GuC is congested in processing requests, but the h2g 
>> buffer is not congested so no chance intel_guc_send_nb() will fail 
>> with no space in that buffer? Sounds a bit un-intuitive.
> That's two different things. The problem of no space in the H2G buffer 
> is the same whether the call is sent blocking or non-blocking. The 
> wait-for-space version is intel_guc_send_busy_loop() rather than 
> intel_guc_send_nb(). NB: _busy_loop is a wrapper around _nb, so the 
> wait-for-space version is also non-blocking ;). If a non-looping 
> version is used (blocking or otherwise) it will return -EBUSY if there 
> is no space. So both the original SLPC call and this non-blocking 
> version will still get an immediate EBUSY return code if the H2G 
> channel is backed up completely.
>
> Whether the code should be handling EBUSY or not is another matter. 
> Vinay, does anything higher up do a loop on EBUSY? If not, maybe it 
> should be using the _busy_loop() call instead?
>
> The blocking vs non-blocking is about waiting for a response if the 
> command is successfully sent. The blocking case will sit and spin for 
> a reply, the non-blocking assumes success and expects an asynchronous 
> error report on failure. The assumption being that the call can't fail 
> unless something is already broken - i915 sending invalid data to GuC 
> for example. And thus any failure is in the BUG_ON category rather 
> than the try again with a different approach and/or try again later 
> category.
>
> This is the point of the change. We are currently getting timeout 
> errors when the H2G channel has space so the command can be sent, but 
> the channel already contains a lot of slow operations. The command has 
> been sent and will be processed successfully, it just takes longer 
> than the i915 timeout. Given that we don't actually care about the 
> completion response for this command, there is no point in either a) 
> sitting in a loop waiting for it or b) complaining that it doesn't 
> happen in a timely fashion. Hence the plan to make it non-blocking.
>
>>
>> Anyway, it sounds okay to me to use the non-blocking, but I would 
>> like to see some logging if the unexpected does happen. Hence I was 
>> suggesting the option of adding drm_notice logging if the send fails 
>> from the worker. (Because I think other callers would already 
>> propagate the error, like sysfs.)
>>
>>   err = slpc_force_min_freq(slpc, slpc->boost_freq);
>>   if (!err)
>>        slpc->num_boosts++;
>>   else
>>        drm_notice(... "Failed to send waitboost request (%d)", err);
> The only error this should ever report would be EBUSY when the H2G 
> channel is full. Anything else (ENODEV, EPIPE, etc.) means the system 
> is already toast and bigger errors will likely have already have been 
> reported.
>
> As above, maybe this should be looping on the EBUSY case. Presumably 
> it is safe to do so if it was already looping waiting for the 
> response. And then printing a notice level warning on more 
> catastrophic errors seems reasonable.

Not sure if we need an extensive busy loop here. All we are trying to do 
(through a tasklet) is bump the min freq. If that fails due to EBUSY, we 
will just end up

retrying next time around.

Thanks,

Vinay.

>
> John.
>
>>
>> Something like that.
>>
>> Regards,
>>
>> Tvrtko
>>
>>
>>>> Even if we soften the blow here, the actual timeout error occurs in 
>>>> the intel_guc_ct.c code, so we cannot hide that error anyways. 
>>>> Making this call non-blocking will achieve both things.
>>>>
>>>> Thanks,
>>>>
>>>> Vinay.
>>>>
>>>>>
>>>>> Regards,
>>>>>
>>>>> Tvrtko
>>>>>
>>>>>> Signed-off-by: Vinay Belgaumkar <vinay.belgaumkar@intel.com>
>>>>>> ---
>>>>>>   drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.c | 38 
>>>>>> ++++++++++++++++-----
>>>>>>   1 file changed, 30 insertions(+), 8 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.c 
>>>>>> b/drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.c
>>>>>> index 1db833da42df..c852f73cf521 100644
>>>>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.c
>>>>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.c
>>>>>> @@ -98,6 +98,30 @@ static u32 slpc_get_state(struct 
>>>>>> intel_guc_slpc *slpc)
>>>>>>       return data->header.global_state;
>>>>>>   }
>>>>>>   +static int guc_action_slpc_set_param_nb(struct intel_guc *guc, 
>>>>>> u8 id, u32 value)
>>>>>> +{
>>>>>> +    u32 request[] = {
>>>>>> +        GUC_ACTION_HOST2GUC_PC_SLPC_REQUEST,
>>>>>> +        SLPC_EVENT(SLPC_EVENT_PARAMETER_SET, 2),
>>>>>> +        id,
>>>>>> +        value,
>>>>>> +    };
>>>>>> +    int ret;
>>>>>> +
>>>>>> +    ret = intel_guc_send_nb(guc, request, ARRAY_SIZE(request), 0);
>>>>>> +
>>>>>> +    return ret > 0 ? -EPROTO : ret;
>>>>>> +}
>>>>>> +
>>>>>> +static int slpc_set_param_nb(struct intel_guc_slpc *slpc, u8 id, 
>>>>>> u32 value)
>>>>>> +{
>>>>>> +    struct intel_guc *guc = slpc_to_guc(slpc);
>>>>>> +
>>>>>> +    GEM_BUG_ON(id >= SLPC_MAX_PARAM);
>>>>>> +
>>>>>> +    return guc_action_slpc_set_param_nb(guc, id, value);
>>>>>> +}
>>>>>> +
>>>>>>   static int guc_action_slpc_set_param(struct intel_guc *guc, u8 
>>>>>> id, u32 value)
>>>>>>   {
>>>>>>       u32 request[] = {
>>>>>> @@ -208,12 +232,10 @@ static int slpc_force_min_freq(struct 
>>>>>> intel_guc_slpc *slpc, u32 freq)
>>>>>>        */
>>>>>>         with_intel_runtime_pm(&i915->runtime_pm, wakeref) {
>>>>>> -        ret = slpc_set_param(slpc,
>>>>>> - SLPC_PARAM_GLOBAL_MIN_GT_UNSLICE_FREQ_MHZ,
>>>>>> -                     freq);
>>>>>> -        if (ret)
>>>>>> -            i915_probe_error(i915, "Unable to force min freq to 
>>>>>> %u: %d",
>>>>>> -                     freq, ret);
>>>>>> +        /* Non-blocking request will avoid stalls */
>>>>>> +        ret = slpc_set_param_nb(slpc,
>>>>>> + SLPC_PARAM_GLOBAL_MIN_GT_UNSLICE_FREQ_MHZ,
>>>>>> +                    freq);
>>>>>>       }
>>>>>>         return ret;
>>>>>> @@ -231,8 +253,8 @@ static void slpc_boost_work(struct 
>>>>>> work_struct *work)
>>>>>>        */
>>>>>>       mutex_lock(&slpc->lock);
>>>>>>       if (atomic_read(&slpc->num_waiters)) {
>>>>>> -        slpc_force_min_freq(slpc, slpc->boost_freq);
>>>>>> -        slpc->num_boosts++;
>>>>>> +        if (!slpc_force_min_freq(slpc, slpc->boost_freq))
>>>>>> +            slpc->num_boosts++;
>>>>>>       }
>>>>>>       mutex_unlock(&slpc->lock);
>>>>>>   }
>>>
>

drm/i915/guc/slpc: Use non-blocking H2G for waitboost

Commit Message

Comments

Patch