diff mbox

[07/15] drm/i915: Defer default hardware context initialisation until first open

Message ID 1434393394-21002-8-git-send-email-david.s.gordon@intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Dave Gordon June 15, 2015, 6:36 p.m. UTC
In order to fully initialise the default contexts, we have to execute
batchbuffer commands on the GPU engines. But in the case of GuC-based
batch submission, we can't do that until any required firmware has
been loaded, which may not be possible during driver load, because the
filesystem(s) containing the firmware may not be mounted until later.

Therefore, we now allow the first call to the firmware-loading code to
return -EAGAIN to indicate that it's not yet ready, and that it should
be retried when the device is first opened from user code, by which
time we expect that all required filesystems will have been mounted.
The late-retry code will then re-attempt to load the firmware if the
early attempt failed.

If the late retry fails, the current open-in-progress will fail, but
the recovery code will disable GuC submission and reset the GPU and
driver. The next open will therefore be in non-GuC mode, and will be
allowed to complete even if the GuC cannot be loaded or used.

Issue: VIZ-4884
Signed-off-by: Dave Gordon <david.s.gordon@intel.com>
Signed-off-by: Alex Dai <yu.dai@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h         |    2 ++
 drivers/gpu/drm/i915/i915_gem.c         |    9 +++++-
 drivers/gpu/drm/i915/i915_gem_context.c |   52 ++++++++++++++++++++++++++++---
 drivers/gpu/drm/i915/i915_irq.c         |   48 ++++++++++++++++++++++++++++
 4 files changed, 105 insertions(+), 6 deletions(-)

Comments

Chris Wilson June 16, 2015, 9:35 a.m. UTC | #1
On Mon, Jun 15, 2015 at 07:36:25PM +0100, Dave Gordon wrote:
> +static int i915_gem_context_first_open(struct drm_device *dev)
> +{
> +	struct drm_i915_private *dev_priv = dev->dev_private;
> +	int ret;
> +
> +	/*
> +	 * We can't enable contexts until all firmware is loaded. This
> +	 * call shouldn't return -EAGAIN because we pass wait=true, but
> +	 * it can still fail with code -EIO if the GuC doesn't respond,
> +	 * or -ENOEXEC if the GuC firmware image is invalid.
> +	 */
> +	ret = intel_guc_ucode_load(dev, true);
> +	WARN_ON(ret == -EAGAIN);
> +
> +	/*
> +	 * If an error occurred and GuC submission has been requested, we can
> +	 * attempt recovery by disabling GuC submission and reinitialising
> +	 * the GPU and driver. We then fail this open() anyway, but the next
> +	 * attempt will find that GuC submission is already disabled, and so
> +	 * proceed to complete context initialisation in non-GuC mode instead.
> +	 */
> +	if (ret && i915.enable_guc_submission) {
> +		i915_handle_guc_error(dev, ret);
> +		return ret;
> +	}

This is still backwards. What we wanted was for the submission process
to start up normally and then once the GuC loading succeeds, we then
start submitting the backlog to the GuC. If the loading fails, we can
then submit the backlog via execlists. It may be interesting to even
start userspace before GuC finishes loading.

So this makes more sense as to why you have the tight integration with
execlists then. I still don't think that justifies changing gen8 without
reason.
-Chris
Daniel Vetter June 17, 2015, 12:18 p.m. UTC | #2
On Mon, Jun 15, 2015 at 07:36:25PM +0100, Dave Gordon wrote:
> In order to fully initialise the default contexts, we have to execute
> batchbuffer commands on the GPU engines. But in the case of GuC-based
> batch submission, we can't do that until any required firmware has
> been loaded, which may not be possible during driver load, because the
> filesystem(s) containing the firmware may not be mounted until later.
> 
> Therefore, we now allow the first call to the firmware-loading code to
> return -EAGAIN to indicate that it's not yet ready, and that it should
> be retried when the device is first opened from user code, by which
> time we expect that all required filesystems will have been mounted.
> The late-retry code will then re-attempt to load the firmware if the
> early attempt failed.
> 
> If the late retry fails, the current open-in-progress will fail, but
> the recovery code will disable GuC submission and reset the GPU and
> driver. The next open will therefore be in non-GuC mode, and will be
> allowed to complete even if the GuC cannot be loaded or used.
> 
> Issue: VIZ-4884
> Signed-off-by: Dave Gordon <david.s.gordon@intel.com>
> Signed-off-by: Alex Dai <yu.dai@intel.com>

I'm not really sold on this super-flexible fallback scheme implemented
here. Because such fallback schemes means more code to test (which no on
will do likely) or just even bigger fireworks when we actually hit them in
reality when something goes wrong. Imo if anything goes wrong in the setup
we just throw in the towel and fail the driver loading.

There's only one exception: If something fails with GT init we declare the
gpu wedged but proceed with all the modeset setup. This makes sense
because we need all the code to handle a wedge gpu anyway, dead-on-boot
gpus happen occasionally and it's really not nice to greet the user with a
black screen. But more fallbacks are imo just headache.

Hence when the guc fails we imo really shouldn't bother with fallbacks,
but instead just declare the thing wedged and carry on.

That should also allow us to simplify the firmware loading: We can do that
in an async worker and if the blob isn't there in time then we just move
on.
-Daniel

> ---
>  drivers/gpu/drm/i915/i915_drv.h         |    2 ++
>  drivers/gpu/drm/i915/i915_gem.c         |    9 +++++-
>  drivers/gpu/drm/i915/i915_gem_context.c |   52 ++++++++++++++++++++++++++++---
>  drivers/gpu/drm/i915/i915_irq.c         |   48 ++++++++++++++++++++++++++++
>  4 files changed, 105 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index f47cde7..a1fc278 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -1837,6 +1837,7 @@ struct drm_i915_private {
>  	/* hda/i915 audio component */
>  	bool audio_component_registered;
>  
> +	bool contexts_ready;
>  	uint32_t hw_context_size;
>  	struct list_head context_list;
>  
> @@ -2614,6 +2615,7 @@ void i915_queue_hangcheck(struct drm_device *dev);
>  __printf(3, 4)
>  void i915_handle_error(struct drm_device *dev, bool wedged,
>  		       const char *fmt, ...);
> +void i915_handle_guc_error(struct drm_device *dev, int err);
>  
>  extern void intel_irq_init(struct drm_i915_private *dev_priv);
>  extern void intel_hpd_init(struct drm_i915_private *dev_priv);
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index cd4a865..d1a8862 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -5025,8 +5025,15 @@ i915_gem_init_hw(struct drm_device *dev)
>  
>  	/* We can't enable contexts until all firmware is loaded */
>  	ret = intel_guc_ucode_load(dev, false);
> +	if (ret == -EAGAIN) {
> +		ret = 0;
> +		goto out;		/* too early */
> +	}
> +
>  	ret = i915_gem_context_enable(dev_priv);
> -	if (ret && ret != -EIO) {
> +	if (ret == 0) {
> +		dev_priv->contexts_ready = true;
> +	} else if (ret && ret != -EIO) {
>  		DRM_ERROR("Context enable failed %d\n", ret);
>  		i915_gem_cleanup_ringbuffer(dev);
>  
> diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
> index 133afcf..debbfc9 100644
> --- a/drivers/gpu/drm/i915/i915_gem_context.c
> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
> @@ -447,23 +447,65 @@ static int context_idr_cleanup(int id, void *p, void *data)
>  	return 0;
>  }
>  
> +/* Complete any late initialisation here */
> +static int i915_gem_context_first_open(struct drm_device *dev)
> +{
> +	struct drm_i915_private *dev_priv = dev->dev_private;
> +	int ret;
> +
> +	/*
> +	 * We can't enable contexts until all firmware is loaded. This
> +	 * call shouldn't return -EAGAIN because we pass wait=true, but
> +	 * it can still fail with code -EIO if the GuC doesn't respond,
> +	 * or -ENOEXEC if the GuC firmware image is invalid.
> +	 */
> +	ret = intel_guc_ucode_load(dev, true);
> +	WARN_ON(ret == -EAGAIN);
> +
> +	/*
> +	 * If an error occurred and GuC submission has been requested, we can
> +	 * attempt recovery by disabling GuC submission and reinitialising
> +	 * the GPU and driver. We then fail this open() anyway, but the next
> +	 * attempt will find that GuC submission is already disabled, and so
> +	 * proceed to complete context initialisation in non-GuC mode instead.
> +	 */
> +	if (ret && i915.enable_guc_submission) {
> +		i915_handle_guc_error(dev, ret);
> +		return ret;
> +	}
> +
> +	ret = i915_gem_context_enable(dev_priv);
> +	if (ret == 0)
> +		dev_priv->contexts_ready = true;
> +	return ret;
> +}
> +
>  int i915_gem_context_open(struct drm_device *dev, struct drm_file *file)
>  {
> +	struct drm_i915_private *dev_priv = dev->dev_private;
>  	struct drm_i915_file_private *file_priv = file->driver_priv;
>  	struct intel_context *ctx;
> +	int ret = 0;
>  
>  	idr_init(&file_priv->context_idr);
>  
>  	mutex_lock(&dev->struct_mutex);
> -	ctx = i915_gem_create_context(dev, file_priv);
> +
> +	if (!dev_priv->contexts_ready)
> +		ret = i915_gem_context_first_open(dev);
> +
> +	if (ret == 0) {
> +		ctx = i915_gem_create_context(dev, file_priv);
> +		if (IS_ERR(ctx))
> +			ret = PTR_ERR(ctx);
> +	}
> +
>  	mutex_unlock(&dev->struct_mutex);
>  
> -	if (IS_ERR(ctx)) {
> +	if (ret)
>  		idr_destroy(&file_priv->context_idr);
> -		return PTR_ERR(ctx);
> -	}
>  
> -	return 0;
> +	return ret;
>  }
>  
>  void i915_gem_context_close(struct drm_device *dev, struct drm_file *file)
> diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
> index 56db9e74..f7dcf8d 100644
> --- a/drivers/gpu/drm/i915/i915_irq.c
> +++ b/drivers/gpu/drm/i915/i915_irq.c
> @@ -2665,6 +2665,54 @@ void i915_handle_error(struct drm_device *dev, bool wedged,
>  	i915_reset_and_wakeup(dev);
>  }
>  
> +/**
> + * i915_handle_error - handle a GuC error
> + * @dev: drm device
> + *
> + * If the GuC can't be (re-)initialised, disable GuC submission and
> + * then reset and reinitialise the rest of the GPU, so that we can
> + * fall back to operating in ELSP mode. Don't bother capturing error
> + * state, because it probably isn't relevant here.
> + *
> + * Unlike i915_handle_error() above, this is called with the global
> + * struct_mutex held, so we need to release it after setting the
> + * reset-in-progress bit so that other threads can make progress,
> + * and reacquire it after the reset is complete.
> + */
> +void i915_handle_guc_error(struct drm_device *dev, int err)
> +{
> +	struct drm_i915_private *dev_priv = dev->dev_private;
> +
> +	DRM_ERROR("GuC failure %d, disabling GuC submission\n", err);
> +	i915.enable_guc_submission = false;
> +
> +	i915_report_and_clear_eir(dev);	/* unlikely? */
> +
> +	atomic_set_mask(I915_RESET_IN_PROGRESS_FLAG,
> +			&dev_priv->gpu_error.reset_counter);
> +
> +	mutex_unlock(&dev->struct_mutex);
> +
> +	/*
> +	 * Wakeup waiting processes so that the reset function
> +	 * i915_reset_and_wakeup doesn't deadlock trying to grab
> +	 * various locks. By bumping the reset counter first, the woken
> +	 * processes will see a reset in progress and back off,
> +	 * releasing their locks and then wait for the reset completion.
> +	 * We must do this for _all_ gpu waiters that might hold locks
> +	 * that the reset work needs to acquire.
> +	 *
> +	 * Note: The wake_up serves as the required memory barrier to
> +	 * ensure that the waiters see the updated value of the reset
> +	 * counter atomic_t.
> +	 */
> +	i915_error_wake_up(dev_priv, false);
> +
> +	i915_reset_and_wakeup(dev);
> +
> +	mutex_lock(&dev->struct_mutex);
> +}
> +
>  /* Called from drm generic code, passed 'crtc' which
>   * we use as a pipe index
>   */
> -- 
> 1.7.9.5
> 
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/intel-gfx
Dave Gordon June 19, 2015, 9:19 a.m. UTC | #3
On 17/06/15 13:18, Daniel Vetter wrote:
> On Mon, Jun 15, 2015 at 07:36:25PM +0100, Dave Gordon wrote:
>> In order to fully initialise the default contexts, we have to execute
>> batchbuffer commands on the GPU engines. But in the case of GuC-based
>> batch submission, we can't do that until any required firmware has
>> been loaded, which may not be possible during driver load, because the
>> filesystem(s) containing the firmware may not be mounted until later.
>>
>> Therefore, we now allow the first call to the firmware-loading code to
>> return -EAGAIN to indicate that it's not yet ready, and that it should
>> be retried when the device is first opened from user code, by which
>> time we expect that all required filesystems will have been mounted.
>> The late-retry code will then re-attempt to load the firmware if the
>> early attempt failed.
>>
>> If the late retry fails, the current open-in-progress will fail, but
>> the recovery code will disable GuC submission and reset the GPU and
>> driver. The next open will therefore be in non-GuC mode, and will be
>> allowed to complete even if the GuC cannot be loaded or used.
>>
>> Issue: VIZ-4884
>> Signed-off-by: Dave Gordon <david.s.gordon@intel.com>
>> Signed-off-by: Alex Dai <yu.dai@intel.com>
> 
> I'm not really sold on this super-flexible fallback scheme implemented
> here. Because such fallback schemes means more code to test (which no on
> will do likely) or just even bigger fireworks when we actually hit them in
> reality when something goes wrong. Imo if anything goes wrong in the setup
> we just throw in the towel and fail the driver loading.

Firstly, GuC submission is an OPTION. That means we already have code to
work with or without a GuC. The fallback just allows us to keep going
after finding that although GuC submission has been requested, and we do
have a GuC, nonetheless the request cannot be satisfied. That's no
different from automatically disabling PPGTT or execlist mode if they're
requested on platforms where we don't support them.

> There's only one exception: If something fails with GT init we declare the
> gpu wedged but proceed with all the modeset setup. This makes sense
> because we need all the code to handle a wedge gpu anyway, dead-on-boot
> gpus happen occasionally and it's really not nice to greet the user with a
> black screen. But more fallbacks are imo just headache.
> 
> Hence when the guc fails we imo really shouldn't bother with fallbacks,
> but instead just declare the thing wedged and carry on.

So the strategy here is exactly the same as for GT init; declare the GPU
wedged, but after disabling GuC mode. The recovery will then get us into
the same state as if there were no GuC, or GuC mode had not been
selected in the first place. We can't switch between GuC and execlists
arbitrarily; the only switchover is from GuC to non-GuC, and it can only
happen ONCE.

To test this is easy; just rename your firmware blob so the driver can't
find it and reboot. It should automatically run in execlist mode, with a
log message telling you what went wrong (f/w file not found). Much nicer
than your screen staying blank because you upgraded the driver and not
the firmware, or vice versa.

> That should also allow us to simplify the firmware loading: We can do that
> in an async worker and if the blob isn't there in time then we just move
> on.
> -Daniel

Under no circumstances can you ever load the firmware from an async
worker thread, because Bad Things Will Happen if there is hardware
activity already in progress when the GuC f/w starts up.

.Dave.
Dave Gordon June 19, 2015, 9:42 a.m. UTC | #4
On 16/06/15 10:35, Chris Wilson wrote:
> On Mon, Jun 15, 2015 at 07:36:25PM +0100, Dave Gordon wrote:
>> +static int i915_gem_context_first_open(struct drm_device *dev)
>> +{
>> +	struct drm_i915_private *dev_priv = dev->dev_private;
>> +	int ret;
>> +
>> +	/*
>> +	 * We can't enable contexts until all firmware is loaded. This
>> +	 * call shouldn't return -EAGAIN because we pass wait=true, but
>> +	 * it can still fail with code -EIO if the GuC doesn't respond,
>> +	 * or -ENOEXEC if the GuC firmware image is invalid.
>> +	 */
>> +	ret = intel_guc_ucode_load(dev, true);
>> +	WARN_ON(ret == -EAGAIN);
>> +
>> +	/*
>> +	 * If an error occurred and GuC submission has been requested, we can
>> +	 * attempt recovery by disabling GuC submission and reinitialising
>> +	 * the GPU and driver. We then fail this open() anyway, but the next
>> +	 * attempt will find that GuC submission is already disabled, and so
>> +	 * proceed to complete context initialisation in non-GuC mode instead.
>> +	 */
>> +	if (ret && i915.enable_guc_submission) {
>> +		i915_handle_guc_error(dev, ret);
>> +		return ret;
>> +	}
> 
> This is still backwards. What we wanted was for the submission process
> to start up normally and then once the GuC loading succeeds, we then
> start submitting the backlog to the GuC. If the loading fails, we can
> then submit the backlog via execlists. It may be interesting to even
> start userspace before GuC finishes loading.

Absolutely. The latter is what this allows :)

(And its a requirement for Android, as on those platforms the f/w won't
become available until userspace is running).

But we're not going to keep stuff queued up in the rings. It would add
more complexity to manage the backlog and remember that we can accept
calls to add_request but not actually submit them. Also there would be
issue with any code that sent commands to the engines to program
registers via LRIs and then EITHER assumed that they had taken effect OR
waited for completion (because that would block indefinitely).

Instead, we've split hw_init() into early (MMIO) and late (context and
batch) phases, and deferred all of the latter iff we need GuC f/w to be
loaded before batch submission. When the late phase runs, it can submit
batches, and wait for completion if required, without blocking the
entire system as it would if called from driver_load().

It might even make sense as an overall strategy to defer MORE of the
driver initialisation process, so that the critical single-threaded
driver load during system startup does as little as possible.

> So this makes more sense as to why you have the tight integration with
> execlists then. I still don't think that justifies changing gen8 without
> reason.
> -Chris

It's tightly integrated because GuC submission *IS* execlist submission,
only with the GuC doing the actual poking of the ELSP and fielding the
resulting context-switch interrupts.

Deferred initialisation doesn't apply to Gen8, or anything else that
doesn't have and use GuC submission. The delayed-context-initialisation
codepath is only reachable if there is GuC firmware to be loaded.

.Dave.
Daniel Vetter June 24, 2015, 10:15 a.m. UTC | #5
On Fri, Jun 19, 2015 at 10:19:04AM +0100, Dave Gordon wrote:
> On 17/06/15 13:18, Daniel Vetter wrote:
> > On Mon, Jun 15, 2015 at 07:36:25PM +0100, Dave Gordon wrote:
> >> In order to fully initialise the default contexts, we have to execute
> >> batchbuffer commands on the GPU engines. But in the case of GuC-based
> >> batch submission, we can't do that until any required firmware has
> >> been loaded, which may not be possible during driver load, because the
> >> filesystem(s) containing the firmware may not be mounted until later.
> >>
> >> Therefore, we now allow the first call to the firmware-loading code to
> >> return -EAGAIN to indicate that it's not yet ready, and that it should
> >> be retried when the device is first opened from user code, by which
> >> time we expect that all required filesystems will have been mounted.
> >> The late-retry code will then re-attempt to load the firmware if the
> >> early attempt failed.
> >>
> >> If the late retry fails, the current open-in-progress will fail, but
> >> the recovery code will disable GuC submission and reset the GPU and
> >> driver. The next open will therefore be in non-GuC mode, and will be
> >> allowed to complete even if the GuC cannot be loaded or used.
> >>
> >> Issue: VIZ-4884
> >> Signed-off-by: Dave Gordon <david.s.gordon@intel.com>
> >> Signed-off-by: Alex Dai <yu.dai@intel.com>
> > 
> > I'm not really sold on this super-flexible fallback scheme implemented
> > here. Because such fallback schemes means more code to test (which no on
> > will do likely) or just even bigger fireworks when we actually hit them in
> > reality when something goes wrong. Imo if anything goes wrong in the setup
> > we just throw in the towel and fail the driver loading.
> 
> Firstly, GuC submission is an OPTION. That means we already have code to
> work with or without a GuC. The fallback just allows us to keep going
> after finding that although GuC submission has been requested, and we do
> have a GuC, nonetheless the request cannot be satisfied. That's no
> different from automatically disabling PPGTT or execlist mode if they're
> requested on platforms where we don't support them.

It is since we do the automatic ppgtt/execlist/whatever disabling decision
once at driver load and then stick to it. Well you can change it sometimes
at runtime it might work, but it's not something we test or recommend - it
autotaints the kernel even when you just touch these options.

> > There's only one exception: If something fails with GT init we declare the
> > gpu wedged but proceed with all the modeset setup. This makes sense
> > because we need all the code to handle a wedge gpu anyway, dead-on-boot
> > gpus happen occasionally and it's really not nice to greet the user with a
> > black screen. But more fallbacks are imo just headache.
> > 
> > Hence when the guc fails we imo really shouldn't bother with fallbacks,
> > but instead just declare the thing wedged and carry on.
> 
> So the strategy here is exactly the same as for GT init; declare the GPU
> wedged, but after disabling GuC mode. The recovery will then get us into
> the same state as if there were no GuC, or GuC mode had not been
> selected in the first place. We can't switch between GuC and execlists
> arbitrarily; the only switchover is from GuC to non-GuC, and it can only
> happen ONCE.

The existing wedged logic is a terminal state (except when developers
reset it through debugfs). There's no automatic recover/fallback ever if
we can't get the gpu up&running in the mode we want it to run in.

> To test this is easy; just rename your firmware blob so the driver can't
> find it and reboot. It should automatically run in execlist mode, with a
> log message telling you what went wrong (f/w file not found). Much nicer
> than your screen staying blank because you upgraded the driver and not
> the firmware, or vice versa.

The screen will not stay blank since we'll still enable the modeset driver
of i915, and at least basic userspace drivers know how to fall back to sw
rendering. The entire point of declaring the gpu wedged if init fails is
to increase the chances that we can get a bug report.

> > That should also allow us to simplify the firmware loading: We can do that
> > in an async worker and if the blob isn't there in time then we just move
> > on.
> > -Daniel
> 
> Under no circumstances can you ever load the firmware from an async
> worker thread, because Bad Things Will Happen if there is hardware
> activity already in progress when the GuC f/w starts up.

Whether you load the firmware through an async work item in a kernel
thread or from a userspace process (in open) doesn't materially change
things at all - it's concurrent and you need to cope with it. And
dev->struct_mutex is a big lock (way too big and one of the most serious
if not the worst piece of technical debt we carry around), but it does not
protect against concurrent access to the hardware for everything.

The upside of doing the init in an explicit async worker is that it's
explicit, looks scary and you don't have any illusions about it ;-)
-Daniel
diff mbox

Patch

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index f47cde7..a1fc278 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1837,6 +1837,7 @@  struct drm_i915_private {
 	/* hda/i915 audio component */
 	bool audio_component_registered;
 
+	bool contexts_ready;
 	uint32_t hw_context_size;
 	struct list_head context_list;
 
@@ -2614,6 +2615,7 @@  void i915_queue_hangcheck(struct drm_device *dev);
 __printf(3, 4)
 void i915_handle_error(struct drm_device *dev, bool wedged,
 		       const char *fmt, ...);
+void i915_handle_guc_error(struct drm_device *dev, int err);
 
 extern void intel_irq_init(struct drm_i915_private *dev_priv);
 extern void intel_hpd_init(struct drm_i915_private *dev_priv);
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index cd4a865..d1a8862 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -5025,8 +5025,15 @@  i915_gem_init_hw(struct drm_device *dev)
 
 	/* We can't enable contexts until all firmware is loaded */
 	ret = intel_guc_ucode_load(dev, false);
+	if (ret == -EAGAIN) {
+		ret = 0;
+		goto out;		/* too early */
+	}
+
 	ret = i915_gem_context_enable(dev_priv);
-	if (ret && ret != -EIO) {
+	if (ret == 0) {
+		dev_priv->contexts_ready = true;
+	} else if (ret && ret != -EIO) {
 		DRM_ERROR("Context enable failed %d\n", ret);
 		i915_gem_cleanup_ringbuffer(dev);
 
diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index 133afcf..debbfc9 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -447,23 +447,65 @@  static int context_idr_cleanup(int id, void *p, void *data)
 	return 0;
 }
 
+/* Complete any late initialisation here */
+static int i915_gem_context_first_open(struct drm_device *dev)
+{
+	struct drm_i915_private *dev_priv = dev->dev_private;
+	int ret;
+
+	/*
+	 * We can't enable contexts until all firmware is loaded. This
+	 * call shouldn't return -EAGAIN because we pass wait=true, but
+	 * it can still fail with code -EIO if the GuC doesn't respond,
+	 * or -ENOEXEC if the GuC firmware image is invalid.
+	 */
+	ret = intel_guc_ucode_load(dev, true);
+	WARN_ON(ret == -EAGAIN);
+
+	/*
+	 * If an error occurred and GuC submission has been requested, we can
+	 * attempt recovery by disabling GuC submission and reinitialising
+	 * the GPU and driver. We then fail this open() anyway, but the next
+	 * attempt will find that GuC submission is already disabled, and so
+	 * proceed to complete context initialisation in non-GuC mode instead.
+	 */
+	if (ret && i915.enable_guc_submission) {
+		i915_handle_guc_error(dev, ret);
+		return ret;
+	}
+
+	ret = i915_gem_context_enable(dev_priv);
+	if (ret == 0)
+		dev_priv->contexts_ready = true;
+	return ret;
+}
+
 int i915_gem_context_open(struct drm_device *dev, struct drm_file *file)
 {
+	struct drm_i915_private *dev_priv = dev->dev_private;
 	struct drm_i915_file_private *file_priv = file->driver_priv;
 	struct intel_context *ctx;
+	int ret = 0;
 
 	idr_init(&file_priv->context_idr);
 
 	mutex_lock(&dev->struct_mutex);
-	ctx = i915_gem_create_context(dev, file_priv);
+
+	if (!dev_priv->contexts_ready)
+		ret = i915_gem_context_first_open(dev);
+
+	if (ret == 0) {
+		ctx = i915_gem_create_context(dev, file_priv);
+		if (IS_ERR(ctx))
+			ret = PTR_ERR(ctx);
+	}
+
 	mutex_unlock(&dev->struct_mutex);
 
-	if (IS_ERR(ctx)) {
+	if (ret)
 		idr_destroy(&file_priv->context_idr);
-		return PTR_ERR(ctx);
-	}
 
-	return 0;
+	return ret;
 }
 
 void i915_gem_context_close(struct drm_device *dev, struct drm_file *file)
diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index 56db9e74..f7dcf8d 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -2665,6 +2665,54 @@  void i915_handle_error(struct drm_device *dev, bool wedged,
 	i915_reset_and_wakeup(dev);
 }
 
+/**
+ * i915_handle_error - handle a GuC error
+ * @dev: drm device
+ *
+ * If the GuC can't be (re-)initialised, disable GuC submission and
+ * then reset and reinitialise the rest of the GPU, so that we can
+ * fall back to operating in ELSP mode. Don't bother capturing error
+ * state, because it probably isn't relevant here.
+ *
+ * Unlike i915_handle_error() above, this is called with the global
+ * struct_mutex held, so we need to release it after setting the
+ * reset-in-progress bit so that other threads can make progress,
+ * and reacquire it after the reset is complete.
+ */
+void i915_handle_guc_error(struct drm_device *dev, int err)
+{
+	struct drm_i915_private *dev_priv = dev->dev_private;
+
+	DRM_ERROR("GuC failure %d, disabling GuC submission\n", err);
+	i915.enable_guc_submission = false;
+
+	i915_report_and_clear_eir(dev);	/* unlikely? */
+
+	atomic_set_mask(I915_RESET_IN_PROGRESS_FLAG,
+			&dev_priv->gpu_error.reset_counter);
+
+	mutex_unlock(&dev->struct_mutex);
+
+	/*
+	 * Wakeup waiting processes so that the reset function
+	 * i915_reset_and_wakeup doesn't deadlock trying to grab
+	 * various locks. By bumping the reset counter first, the woken
+	 * processes will see a reset in progress and back off,
+	 * releasing their locks and then wait for the reset completion.
+	 * We must do this for _all_ gpu waiters that might hold locks
+	 * that the reset work needs to acquire.
+	 *
+	 * Note: The wake_up serves as the required memory barrier to
+	 * ensure that the waiters see the updated value of the reset
+	 * counter atomic_t.
+	 */
+	i915_error_wake_up(dev_priv, false);
+
+	i915_reset_and_wakeup(dev);
+
+	mutex_lock(&dev->struct_mutex);
+}
+
 /* Called from drm generic code, passed 'crtc' which
  * we use as a pipe index
  */