diff mbox series

[2/2] drm/i915/guc: Allow for very slow GuC loading

Message ID 20230217234715.3609670-3-John.C.Harrison@Intel.com (mailing list archive)
State New, archived
Headers show
Series Improvements to GuC load failure handling | expand

Commit Message

John Harrison Feb. 17, 2023, 11:47 p.m. UTC
From: John Harrison <John.C.Harrison@Intel.com>

A failure to load the GuC is occasionally observed where the GuC log
actually showed that the GuC had loaded just fine. The implication
being that the load just took ever so slightly longer than the 200ms
timeout. Given that the actual time should be tens of milliseconds at
the slowest, this should never happen. So far the issue has generally
been caused by a bad IFWI resulting in low frequencies during boot
(depsite the KMD requesting max frequency). However, the issue seems
to happen more often than one would like.

So a) increase the timeout so that the user still gets a working
system even in the case of slow load. And b) report the frequency
during the load to see if that is the case of the slow down.

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
---
 drivers/gpu/drm/i915/gt/uc/intel_guc_fw.c | 37 +++++++++++++++++++++--
 1 file changed, 34 insertions(+), 3 deletions(-)

Comments

Daniele Ceraolo Spurio March 3, 2023, 7:20 p.m. UTC | #1
On 2/17/2023 3:47 PM, John.C.Harrison@Intel.com wrote:
> From: John Harrison <John.C.Harrison@Intel.com>
>
> A failure to load the GuC is occasionally observed where the GuC log
> actually showed that the GuC had loaded just fine. The implication
> being that the load just took ever so slightly longer than the 200ms
> timeout. Given that the actual time should be tens of milliseconds at
> the slowest, this should never happen. So far the issue has generally
> been caused by a bad IFWI resulting in low frequencies during boot
> (depsite the KMD requesting max frequency). However, the issue seems
> to happen more often than one would like.
>
> So a) increase the timeout so that the user still gets a working
> system even in the case of slow load. And b) report the frequency
> during the load to see if that is the case of the slow down.

Some refs would be good here. From a quick search, these seems to match:

https://gitlab.freedesktop.org/drm/intel/-/issues/7931
https://gitlab.freedesktop.org/drm/intel/-/issues/8060
https://gitlab.freedesktop.org/drm/intel/-/issues/8083
https://gitlab.freedesktop.org/drm/intel/-/issues/8136
https://gitlab.freedesktop.org/drm/intel/-/issues/8137

>
> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
> ---
>   drivers/gpu/drm/i915/gt/uc/intel_guc_fw.c | 37 +++++++++++++++++++++--
>   1 file changed, 34 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_fw.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_fw.c
> index 2f5942606913d..72e003f50617d 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_fw.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_fw.c
> @@ -12,6 +12,7 @@
>   #include "gt/intel_gt.h"
>   #include "gt/intel_gt_mcr.h"
>   #include "gt/intel_gt_regs.h"
> +#include "gt/intel_rps.h"
>   #include "intel_guc_fw.h"
>   #include "intel_guc_print.h"
>   #include "i915_drv.h"
> @@ -139,9 +140,12 @@ static int guc_wait_ucode(struct intel_guc *guc)
>   {
>   	struct intel_gt *gt = guc_to_gt(guc);
>   	struct intel_uncore *uncore = gt->uncore;
> +	ktime_t before, after, delta;
>   	bool success;
>   	u32 status;
> -	int ret;
> +	int ret, count;
> +	u64 delta_ms;
> +	u32 before_freq;
>   
>   	/*
>   	 * Wait for the GuC to start up.
> @@ -159,13 +163,32 @@ static int guc_wait_ucode(struct intel_guc *guc)
>   	 * issues to be resolved. In the meantime bump the timeout to
>   	 * 200ms. Even at slowest clock, this should be sufficient. And
>   	 * in the working case, a larger timeout makes no difference.
> +	 *
> +	 * IFWI updates have also been seen to cause sporadic failures due to
> +	 * the requested frequency not being granted and thus the firmware
> +	 * load is attempted at minimum frequency. That can lead to load times
> +	 * in the seconds range. However, there is a limit on how long an
> +	 * individual wait_for() can wait. So wrap it in a loop.
>   	 */
> -	ret = wait_for(guc_load_done(uncore, &status, &success), 200);
> +	before_freq = intel_rps_read_actual_frequency(&uncore->gt->rps);
> +	before = ktime_get();
> +	for (count = 0; count < 20; count++) {
> +		ret = wait_for(guc_load_done(uncore, &status, &success), 1000);

Isn't 20 secs a bit too long for an in-place wait? I get that if the GuC 
doesn't load (or fail to) within a few secs the HW is likely toast, but 
still that seems a bit too long to me. What's the worst case load time 
ever observed? I suggest reducing the wait to 3 secs as a compromise, if 
that's bigger than the worst case.

The patch LGTM apart from this point.

Daniele

> +		if (!ret || !success)
> +			break;
> +
> +		guc_dbg(guc, "load still in progress, count = %d, freq = %dMHz\n",
> +			count, intel_rps_read_actual_frequency(&uncore->gt->rps));
> +	}
> +	after = ktime_get();
> +	delta = ktime_sub(after, before);
> +	delta_ms = ktime_to_ms(delta);
>   	if (ret || !success) {
>   		u32 ukernel = REG_FIELD_GET(GS_UKERNEL_MASK, status);
>   		u32 bootrom = REG_FIELD_GET(GS_BOOTROM_MASK, status);
>   
> -		guc_info(guc, "load failed: status = 0x%08X, ret = %d\n", status, ret);
> +		guc_info(guc, "load failed: status = 0x%08X, time = %lldms, freq = %dMHz, ret = %d\n",
> +			 status, delta_ms, intel_rps_read_actual_frequency(&uncore->gt->rps), ret);
>   		guc_info(guc, "load failed: status: Reset = %d, BootROM = 0x%02X, UKernel = 0x%02X, MIA = 0x%02X, Auth = 0x%02X\n",
>   			 REG_FIELD_GET(GS_MIA_IN_RESET, status),
>   			 bootrom, ukernel,
> @@ -206,6 +229,14 @@ static int guc_wait_ucode(struct intel_guc *guc)
>   		/* Uncommon/unexpected error, see earlier status code print for details */
>   		if (ret == 0)
>   			ret = -ENXIO;
> +	} else if (delta_ms > 200) {
> +		guc_warn(guc, "excessive init time: %lldms! [freq = %dMHz, before = %dMHz, status = 0x%08X, count = %d, ret = %d]\n",
> +			 delta_ms, intel_rps_read_actual_frequency(&uncore->gt->rps),
> +			 before_freq, status, count, ret);
> +	} else {
> +		guc_dbg(guc, "init took %lldms, freq = %dMHz, before = %dMHz, status = 0x%08X, count = %d, ret = %d\n",
> +			delta_ms, intel_rps_read_actual_frequency(&uncore->gt->rps),
> +			before_freq, status, count, ret);
>   	}
>   
>   	return ret;
John Harrison March 11, 2023, 1:01 a.m. UTC | #2
On 3/3/2023 11:20, Ceraolo Spurio, Daniele wrote:
> On 2/17/2023 3:47 PM, John.C.Harrison@Intel.com wrote:
>> From: John Harrison <John.C.Harrison@Intel.com>
>>
>> A failure to load the GuC is occasionally observed where the GuC log
>> actually showed that the GuC had loaded just fine. The implication
>> being that the load just took ever so slightly longer than the 200ms
>> timeout. Given that the actual time should be tens of milliseconds at
>> the slowest, this should never happen. So far the issue has generally
>> been caused by a bad IFWI resulting in low frequencies during boot
>> (depsite the KMD requesting max frequency). However, the issue seems
>> to happen more often than one would like.
>>
>> So a) increase the timeout so that the user still gets a working
>> system even in the case of slow load. And b) report the frequency
>> during the load to see if that is the case of the slow down.
>
> Some refs would be good here. From a quick search, these seems to match:
>
> https://gitlab.freedesktop.org/drm/intel/-/issues/7931
> https://gitlab.freedesktop.org/drm/intel/-/issues/8060
> https://gitlab.freedesktop.org/drm/intel/-/issues/8083
> https://gitlab.freedesktop.org/drm/intel/-/issues/8136
> https://gitlab.freedesktop.org/drm/intel/-/issues/8137
Should these have a prefix tag? If so, what? 'closes' is not entirely 
accurate. This patch is just to help with debug of the underlying issue. 
And if the timeout is reduced then it won't necessarily allow a slow 
system to work. See below.


>
>>
>> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
>> ---
>>   drivers/gpu/drm/i915/gt/uc/intel_guc_fw.c | 37 +++++++++++++++++++++--
>>   1 file changed, 34 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_fw.c 
>> b/drivers/gpu/drm/i915/gt/uc/intel_guc_fw.c
>> index 2f5942606913d..72e003f50617d 100644
>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_fw.c
>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_fw.c
>> @@ -12,6 +12,7 @@
>>   #include "gt/intel_gt.h"
>>   #include "gt/intel_gt_mcr.h"
>>   #include "gt/intel_gt_regs.h"
>> +#include "gt/intel_rps.h"
>>   #include "intel_guc_fw.h"
>>   #include "intel_guc_print.h"
>>   #include "i915_drv.h"
>> @@ -139,9 +140,12 @@ static int guc_wait_ucode(struct intel_guc *guc)
>>   {
>>       struct intel_gt *gt = guc_to_gt(guc);
>>       struct intel_uncore *uncore = gt->uncore;
>> +    ktime_t before, after, delta;
>>       bool success;
>>       u32 status;
>> -    int ret;
>> +    int ret, count;
>> +    u64 delta_ms;
>> +    u32 before_freq;
>>         /*
>>        * Wait for the GuC to start up.
>> @@ -159,13 +163,32 @@ static int guc_wait_ucode(struct intel_guc *guc)
>>        * issues to be resolved. In the meantime bump the timeout to
>>        * 200ms. Even at slowest clock, this should be sufficient. And
>>        * in the working case, a larger timeout makes no difference.
>> +     *
>> +     * IFWI updates have also been seen to cause sporadic failures 
>> due to
>> +     * the requested frequency not being granted and thus the firmware
>> +     * load is attempted at minimum frequency. That can lead to load 
>> times
>> +     * in the seconds range. However, there is a limit on how long an
>> +     * individual wait_for() can wait. So wrap it in a loop.
>>        */
>> -    ret = wait_for(guc_load_done(uncore, &status, &success), 200);
>> +    before_freq = intel_rps_read_actual_frequency(&uncore->gt->rps);
>> +    before = ktime_get();
>> +    for (count = 0; count < 20; count++) {
>> +        ret = wait_for(guc_load_done(uncore, &status, &success), 1000);
>
> Isn't 20 secs a bit too long for an in-place wait? I get that if the 
> GuC doesn't load (or fail to) within a few secs the HW is likely 
> toast, but still that seems a bit too long to me. What's the worst 
> case load time ever observed? I suggest reducing the wait to 3 secs as 
> a compromise, if that's bigger than the worst case.
I can drop it to 3 for normal builds and keep 20 for 
CONFIG_DRM_I915_DEBUG_GEM builds. However, that won't actually be long 
enough for all slow situations. We have seen times of at least 11s when 
the GPU is running at minimum frequency. So, for CI runs we definitely 
want to keep the 20s limit. For end users? Is it better to wait for up 
to 20s or to boot in display only fallback mode? And note that this is a 
timeout only. A functional system will still complete in tens of 
milliseconds.

John.

>
> The patch LGTM apart from this point.
>
> Daniele
>
>> +        if (!ret || !success)
>> +            break;
>> +
>> +        guc_dbg(guc, "load still in progress, count = %d, freq = 
>> %dMHz\n",
>> +            count, intel_rps_read_actual_frequency(&uncore->gt->rps));
>> +    }
>> +    after = ktime_get();
>> +    delta = ktime_sub(after, before);
>> +    delta_ms = ktime_to_ms(delta);
>>       if (ret || !success) {
>>           u32 ukernel = REG_FIELD_GET(GS_UKERNEL_MASK, status);
>>           u32 bootrom = REG_FIELD_GET(GS_BOOTROM_MASK, status);
>>   -        guc_info(guc, "load failed: status = 0x%08X, ret = %d\n", 
>> status, ret);
>> +        guc_info(guc, "load failed: status = 0x%08X, time = %lldms, 
>> freq = %dMHz, ret = %d\n",
>> +             status, delta_ms, 
>> intel_rps_read_actual_frequency(&uncore->gt->rps), ret);
>>           guc_info(guc, "load failed: status: Reset = %d, BootROM = 
>> 0x%02X, UKernel = 0x%02X, MIA = 0x%02X, Auth = 0x%02X\n",
>>                REG_FIELD_GET(GS_MIA_IN_RESET, status),
>>                bootrom, ukernel,
>> @@ -206,6 +229,14 @@ static int guc_wait_ucode(struct intel_guc *guc)
>>           /* Uncommon/unexpected error, see earlier status code print 
>> for details */
>>           if (ret == 0)
>>               ret = -ENXIO;
>> +    } else if (delta_ms > 200) {
>> +        guc_warn(guc, "excessive init time: %lldms! [freq = %dMHz, 
>> before = %dMHz, status = 0x%08X, count = %d, ret = %d]\n",
>> +             delta_ms, 
>> intel_rps_read_actual_frequency(&uncore->gt->rps),
>> +             before_freq, status, count, ret);
>> +    } else {
>> +        guc_dbg(guc, "init took %lldms, freq = %dMHz, before = 
>> %dMHz, status = 0x%08X, count = %d, ret = %d\n",
>> +            delta_ms, 
>> intel_rps_read_actual_frequency(&uncore->gt->rps),
>> +            before_freq, status, count, ret);
>>       }
>>         return ret;
>
Dixit, Ashutosh March 13, 2023, 11:55 p.m. UTC | #3
On Fri, 10 Mar 2023 17:01:42 -0800, John Harrison wrote:
>
> >> +	for (count = 0; count < 20; count++) {
> >> +		ret = wait_for(guc_load_done(uncore, &status, &success), 1000);
> >
> > Isn't 20 secs a bit too long for an in-place wait? I get that if the GuC
> > doesn't load (or fail to) within a few secs the HW is likely toast, but
> > still that seems a bit too long to me. What's the worst case load time
> > ever observed? I suggest reducing the wait to 3 secs as a compromise, if
> > that's bigger than the worst case.
>
> I can drop it to 3 for normal builds and keep 20 for
> CONFIG_DRM_I915_DEBUG_GEM builds. However, that won't actually be long
> enough for all slow situations. We have seen times of at least 11s when the
> GPU is running at minimum frequency. So, for CI runs we definitely want to
> keep the 20s limit. For end users? Is it better to wait for up to 20s or to
> boot in display only fallback mode? And note that this is a timeout only. A
> functional system will still complete in tens of milliseconds.

Just FYI, in this related patch:

https://patchwork.freedesktop.org/series/115003/#rev2

I am holding a mutex across GuC FW load, so very unlikely, but worst case a
thread can get blocked for the duration of the GuC reset/FW load.

Ashutosh
Daniele Ceraolo Spurio March 15, 2023, 9:30 p.m. UTC | #4
On 3/10/2023 5:01 PM, John Harrison wrote:
> On 3/3/2023 11:20, Ceraolo Spurio, Daniele wrote:
>> On 2/17/2023 3:47 PM, John.C.Harrison@Intel.com wrote:
>>> From: John Harrison <John.C.Harrison@Intel.com>
>>>
>>> A failure to load the GuC is occasionally observed where the GuC log
>>> actually showed that the GuC had loaded just fine. The implication
>>> being that the load just took ever so slightly longer than the 200ms
>>> timeout. Given that the actual time should be tens of milliseconds at
>>> the slowest, this should never happen. So far the issue has generally
>>> been caused by a bad IFWI resulting in low frequencies during boot
>>> (depsite the KMD requesting max frequency). However, the issue seems
>>> to happen more often than one would like.
>>>
>>> So a) increase the timeout so that the user still gets a working
>>> system even in the case of slow load. And b) report the frequency
>>> during the load to see if that is the case of the slow down.
>>
>> Some refs would be good here. From a quick search, these seems to match:
>>
>> https://gitlab.freedesktop.org/drm/intel/-/issues/7931
>> https://gitlab.freedesktop.org/drm/intel/-/issues/8060
>> https://gitlab.freedesktop.org/drm/intel/-/issues/8083
>> https://gitlab.freedesktop.org/drm/intel/-/issues/8136
>> https://gitlab.freedesktop.org/drm/intel/-/issues/8137
> Should these have a prefix tag? If so, what? 'closes' is not entirely 
> accurate. This patch is just to help with debug of the underlying 
> issue. And if the timeout is reduced then it won't necessarily allow a 
> slow system to work. See below.
>

For bugs we usually use either "References:" or "Closes:". I think the 
former fits more for this case.

>
>>
>>>
>>> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
>>> ---
>>>   drivers/gpu/drm/i915/gt/uc/intel_guc_fw.c | 37 
>>> +++++++++++++++++++++--
>>>   1 file changed, 34 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_fw.c 
>>> b/drivers/gpu/drm/i915/gt/uc/intel_guc_fw.c
>>> index 2f5942606913d..72e003f50617d 100644
>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_fw.c
>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_fw.c
>>> @@ -12,6 +12,7 @@
>>>   #include "gt/intel_gt.h"
>>>   #include "gt/intel_gt_mcr.h"
>>>   #include "gt/intel_gt_regs.h"
>>> +#include "gt/intel_rps.h"
>>>   #include "intel_guc_fw.h"
>>>   #include "intel_guc_print.h"
>>>   #include "i915_drv.h"
>>> @@ -139,9 +140,12 @@ static int guc_wait_ucode(struct intel_guc *guc)
>>>   {
>>>       struct intel_gt *gt = guc_to_gt(guc);
>>>       struct intel_uncore *uncore = gt->uncore;
>>> +    ktime_t before, after, delta;
>>>       bool success;
>>>       u32 status;
>>> -    int ret;
>>> +    int ret, count;
>>> +    u64 delta_ms;
>>> +    u32 before_freq;
>>>         /*
>>>        * Wait for the GuC to start up.
>>> @@ -159,13 +163,32 @@ static int guc_wait_ucode(struct intel_guc *guc)
>>>        * issues to be resolved. In the meantime bump the timeout to
>>>        * 200ms. Even at slowest clock, this should be sufficient. And
>>>        * in the working case, a larger timeout makes no difference.
>>> +     *
>>> +     * IFWI updates have also been seen to cause sporadic failures 
>>> due to
>>> +     * the requested frequency not being granted and thus the firmware
>>> +     * load is attempted at minimum frequency. That can lead to 
>>> load times
>>> +     * in the seconds range. However, there is a limit on how long an
>>> +     * individual wait_for() can wait. So wrap it in a loop.
>>>        */
>>> -    ret = wait_for(guc_load_done(uncore, &status, &success), 200);
>>> +    before_freq = intel_rps_read_actual_frequency(&uncore->gt->rps);
>>> +    before = ktime_get();
>>> +    for (count = 0; count < 20; count++) {
>>> +        ret = wait_for(guc_load_done(uncore, &status, &success), 
>>> 1000);
>>
>> Isn't 20 secs a bit too long for an in-place wait? I get that if the 
>> GuC doesn't load (or fail to) within a few secs the HW is likely 
>> toast, but still that seems a bit too long to me. What's the worst 
>> case load time ever observed? I suggest reducing the wait to 3 secs 
>> as a compromise, if that's bigger than the worst case.
> I can drop it to 3 for normal builds and keep 20 for 
> CONFIG_DRM_I915_DEBUG_GEM builds. However, that won't actually be long 
> enough for all slow situations. We have seen times of at least 11s 
> when the GPU is running at minimum frequency. So, for CI runs we 
> definitely want to keep the 20s limit. For end users? Is it better to 
> wait for up to 20s or to boot in display only fallback mode? And note 
> that this is a timeout only. A functional system will still complete 
> in tens of milliseconds.

I'd argue that the only way for us to go to multiple seconds is if the 
frequency goes below the efficient levels, which normally should only 
happen if the card cooling fails and the temperature goes out of 
control, in which case there are more pressing issues than a GuC load 
failure. I agree that having 3 secs normal and 20 for CI is a good 
compromise and if we ever see a very long timeout in CI we can 
investigate more.

Daniele

>
> John.
>
>>
>> The patch LGTM apart from this point.
>>
>> Daniele
>>
>>> +        if (!ret || !success)
>>> +            break;
>>> +
>>> +        guc_dbg(guc, "load still in progress, count = %d, freq = 
>>> %dMHz\n",
>>> +            count, intel_rps_read_actual_frequency(&uncore->gt->rps));
>>> +    }
>>> +    after = ktime_get();
>>> +    delta = ktime_sub(after, before);
>>> +    delta_ms = ktime_to_ms(delta);
>>>       if (ret || !success) {
>>>           u32 ukernel = REG_FIELD_GET(GS_UKERNEL_MASK, status);
>>>           u32 bootrom = REG_FIELD_GET(GS_BOOTROM_MASK, status);
>>>   -        guc_info(guc, "load failed: status = 0x%08X, ret = %d\n", 
>>> status, ret);
>>> +        guc_info(guc, "load failed: status = 0x%08X, time = %lldms, 
>>> freq = %dMHz, ret = %d\n",
>>> +             status, delta_ms, 
>>> intel_rps_read_actual_frequency(&uncore->gt->rps), ret);
>>>           guc_info(guc, "load failed: status: Reset = %d, BootROM = 
>>> 0x%02X, UKernel = 0x%02X, MIA = 0x%02X, Auth = 0x%02X\n",
>>>                REG_FIELD_GET(GS_MIA_IN_RESET, status),
>>>                bootrom, ukernel,
>>> @@ -206,6 +229,14 @@ static int guc_wait_ucode(struct intel_guc *guc)
>>>           /* Uncommon/unexpected error, see earlier status code 
>>> print for details */
>>>           if (ret == 0)
>>>               ret = -ENXIO;
>>> +    } else if (delta_ms > 200) {
>>> +        guc_warn(guc, "excessive init time: %lldms! [freq = %dMHz, 
>>> before = %dMHz, status = 0x%08X, count = %d, ret = %d]\n",
>>> +             delta_ms, 
>>> intel_rps_read_actual_frequency(&uncore->gt->rps),
>>> +             before_freq, status, count, ret);
>>> +    } else {
>>> +        guc_dbg(guc, "init took %lldms, freq = %dMHz, before = 
>>> %dMHz, status = 0x%08X, count = %d, ret = %d\n",
>>> +            delta_ms, 
>>> intel_rps_read_actual_frequency(&uncore->gt->rps),
>>> +            before_freq, status, count, ret);
>>>       }
>>>         return ret;
>>
>
diff mbox series

Patch

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_fw.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_fw.c
index 2f5942606913d..72e003f50617d 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_fw.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_fw.c
@@ -12,6 +12,7 @@ 
 #include "gt/intel_gt.h"
 #include "gt/intel_gt_mcr.h"
 #include "gt/intel_gt_regs.h"
+#include "gt/intel_rps.h"
 #include "intel_guc_fw.h"
 #include "intel_guc_print.h"
 #include "i915_drv.h"
@@ -139,9 +140,12 @@  static int guc_wait_ucode(struct intel_guc *guc)
 {
 	struct intel_gt *gt = guc_to_gt(guc);
 	struct intel_uncore *uncore = gt->uncore;
+	ktime_t before, after, delta;
 	bool success;
 	u32 status;
-	int ret;
+	int ret, count;
+	u64 delta_ms;
+	u32 before_freq;
 
 	/*
 	 * Wait for the GuC to start up.
@@ -159,13 +163,32 @@  static int guc_wait_ucode(struct intel_guc *guc)
 	 * issues to be resolved. In the meantime bump the timeout to
 	 * 200ms. Even at slowest clock, this should be sufficient. And
 	 * in the working case, a larger timeout makes no difference.
+	 *
+	 * IFWI updates have also been seen to cause sporadic failures due to
+	 * the requested frequency not being granted and thus the firmware
+	 * load is attempted at minimum frequency. That can lead to load times
+	 * in the seconds range. However, there is a limit on how long an
+	 * individual wait_for() can wait. So wrap it in a loop.
 	 */
-	ret = wait_for(guc_load_done(uncore, &status, &success), 200);
+	before_freq = intel_rps_read_actual_frequency(&uncore->gt->rps);
+	before = ktime_get();
+	for (count = 0; count < 20; count++) {
+		ret = wait_for(guc_load_done(uncore, &status, &success), 1000);
+		if (!ret || !success)
+			break;
+
+		guc_dbg(guc, "load still in progress, count = %d, freq = %dMHz\n",
+			count, intel_rps_read_actual_frequency(&uncore->gt->rps));
+	}
+	after = ktime_get();
+	delta = ktime_sub(after, before);
+	delta_ms = ktime_to_ms(delta);
 	if (ret || !success) {
 		u32 ukernel = REG_FIELD_GET(GS_UKERNEL_MASK, status);
 		u32 bootrom = REG_FIELD_GET(GS_BOOTROM_MASK, status);
 
-		guc_info(guc, "load failed: status = 0x%08X, ret = %d\n", status, ret);
+		guc_info(guc, "load failed: status = 0x%08X, time = %lldms, freq = %dMHz, ret = %d\n",
+			 status, delta_ms, intel_rps_read_actual_frequency(&uncore->gt->rps), ret);
 		guc_info(guc, "load failed: status: Reset = %d, BootROM = 0x%02X, UKernel = 0x%02X, MIA = 0x%02X, Auth = 0x%02X\n",
 			 REG_FIELD_GET(GS_MIA_IN_RESET, status),
 			 bootrom, ukernel,
@@ -206,6 +229,14 @@  static int guc_wait_ucode(struct intel_guc *guc)
 		/* Uncommon/unexpected error, see earlier status code print for details */
 		if (ret == 0)
 			ret = -ENXIO;
+	} else if (delta_ms > 200) {
+		guc_warn(guc, "excessive init time: %lldms! [freq = %dMHz, before = %dMHz, status = 0x%08X, count = %d, ret = %d]\n",
+			 delta_ms, intel_rps_read_actual_frequency(&uncore->gt->rps),
+			 before_freq, status, count, ret);
+	} else {
+		guc_dbg(guc, "init took %lldms, freq = %dMHz, before = %dMHz, status = 0x%08X, count = %d, ret = %d\n",
+			delta_ms, intel_rps_read_actual_frequency(&uncore->gt->rps),
+			before_freq, status, count, ret);
 	}
 
 	return ret;