diff mbox series

[13/20] drm/i915/guc: Relax CTB response timeout

Message ID 20210603051630.2635-14-matthew.brost@intel.com (mailing list archive)
State New, archived
Headers show
Series GuC CTBs changes + a few misc patches | expand

Commit Message

Matthew Brost June 3, 2021, 5:16 a.m. UTC
From: Michal Wajdeczko <michal.wajdeczko@intel.com>

In upcoming patch we will allow more CTB requests to be sent in
parallel to the GuC for processing, so we shouldn't assume any more
that GuC will always reply without 10ms.

Use bigger value from CONFIG_DRM_I915_GUC_CTB_TIMEOUT instead.

v2: Add CONFIG_DRM_I915_GUC_CTB_TIMEOUT config option

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/Kconfig.profile      | 10 ++++++++++
 drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c |  5 ++++-
 2 files changed, 14 insertions(+), 1 deletion(-)

Comments

Daniel Vetter June 4, 2021, 8:33 a.m. UTC | #1
On Wed, Jun 02, 2021 at 10:16:23PM -0700, Matthew Brost wrote:
> From: Michal Wajdeczko <michal.wajdeczko@intel.com>
> 
> In upcoming patch we will allow more CTB requests to be sent in
> parallel to the GuC for processing, so we shouldn't assume any more
> that GuC will always reply without 10ms.
> 
> Use bigger value from CONFIG_DRM_I915_GUC_CTB_TIMEOUT instead.
> 
> v2: Add CONFIG_DRM_I915_GUC_CTB_TIMEOUT config option
> 
> Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> Reviewed-by: Matthew Brost <matthew.brost@intel.com>

So this is a rant, but for upstream we really need to do better than
internal:

- The driver must work by default in the optimal configuration.

- Any config change that we haven't validated _must_ taint the kernel
  (this is especially for module options, but also for config settings)

- Config need a real reason beyond "was useful for bring-up".

Our internal tree is an absolute disaster right now, with multi-line
kernel configs (different on each platform) and bespoke kernel config or
the driver just fails. We're the expert on our own hw, we should know how
it works, not offload that to users essentially asking them "how shitty do
you think Intel hw is in responding timely".

Yes I know there's a lot of these there already, they don't make a lot of
sense either.

Except if there's a real reason for this (aside from us just offloading
testing to our users instead of doing it ourselves properly) I think we
should hardcode this, with a comment explaining why. Maybe with a switch
between the PF/VF case once that's landed.

> ---
>  drivers/gpu/drm/i915/Kconfig.profile      | 10 ++++++++++
>  drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c |  5 ++++-
>  2 files changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/i915/Kconfig.profile b/drivers/gpu/drm/i915/Kconfig.profile
> index 39328567c200..0d5475b5f28a 100644
> --- a/drivers/gpu/drm/i915/Kconfig.profile
> +++ b/drivers/gpu/drm/i915/Kconfig.profile
> @@ -38,6 +38,16 @@ config DRM_I915_USERFAULT_AUTOSUSPEND
>  	  May be 0 to disable the extra delay and solely use the device level
>  	  runtime pm autosuspend delay tunable.
>  
> +config DRM_I915_GUC_CTB_TIMEOUT
> +	int "How long to wait for the GuC to make forward progress on CTBs (ms)"
> +	default 1500 # milliseconds
> +	range 10 60000

Also range is definitely off, drm/scheduler will probably nuke you
beforehand :-)

That's kinda another issue I have with all these kconfig knobs: Maybe we
need a knob for "relax with reset attempts, my workloads overload my gpus
routinely", which then scales _all_ timeouts proportionally. But letting
the user set them all, with silly combiniations like resetting the
workload before heartbeat or stuff like that doesn't make much sense.

Anyway, tiny patch so hopefully I can leave this one out for now until
we've closed this.
-Daniel

> +	help
> +	  Configures the default timeout waiting for GuC the to make forward
> +	  progress on CTBs. e.g. Waiting for a response to a requeset.
> +
> +	  A range of 10 ms to 60000 ms is allowed.
> +
>  config DRM_I915_HEARTBEAT_INTERVAL
>  	int "Interval between heartbeat pulses (ms)"
>  	default 2500 # milliseconds
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
> index 916c2b80c841..cf1fb09ef766 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
> @@ -436,6 +436,7 @@ static int ct_write(struct intel_guc_ct *ct,
>   */
>  static int wait_for_ct_request_update(struct ct_request *req, u32 *status)
>  {
> +	long timeout;
>  	int err;
>  
>  	/*
> @@ -443,10 +444,12 @@ static int wait_for_ct_request_update(struct ct_request *req, u32 *status)
>  	 * up to that length of time, then switch to a slower sleep-wait loop.
>  	 * No GuC command should ever take longer than 10ms.
>  	 */
> +	timeout = CONFIG_DRM_I915_GUC_CTB_TIMEOUT;
> +
>  #define done INTEL_GUC_MSG_IS_RESPONSE(READ_ONCE(req->status))
>  	err = wait_for_us(done, 10);
>  	if (err)
> -		err = wait_for(done, 10);
> +		err = wait_for(done, timeout);
>  #undef done
>  
>  	if (unlikely(err))
> -- 
> 2.28.0
> 
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
Matthew Brost June 4, 2021, 6:35 p.m. UTC | #2
On Fri, Jun 04, 2021 at 10:33:07AM +0200, Daniel Vetter wrote:
> On Wed, Jun 02, 2021 at 10:16:23PM -0700, Matthew Brost wrote:
> > From: Michal Wajdeczko <michal.wajdeczko@intel.com>
> > 
> > In upcoming patch we will allow more CTB requests to be sent in
> > parallel to the GuC for processing, so we shouldn't assume any more
> > that GuC will always reply without 10ms.
> > 
> > Use bigger value from CONFIG_DRM_I915_GUC_CTB_TIMEOUT instead.
> > 
> > v2: Add CONFIG_DRM_I915_GUC_CTB_TIMEOUT config option
> > 
> > Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > Reviewed-by: Matthew Brost <matthew.brost@intel.com>
> 
> So this is a rant, but for upstream we really need to do better than
> internal:
> 
> - The driver must work by default in the optimal configuration.
> 
> - Any config change that we haven't validated _must_ taint the kernel
>   (this is especially for module options, but also for config settings)
> 
> - Config need a real reason beyond "was useful for bring-up".
> 
> Our internal tree is an absolute disaster right now, with multi-line
> kernel configs (different on each platform) and bespoke kernel config or
> the driver just fails. We're the expert on our own hw, we should know how
> it works, not offload that to users essentially asking them "how shitty do
> you think Intel hw is in responding timely".
> 
> Yes I know there's a lot of these there already, they don't make a lot of
> sense either.
> 
> Except if there's a real reason for this (aside from us just offloading
> testing to our users instead of doing it ourselves properly) I think we
> should hardcode this, with a comment explaining why. Maybe with a switch
> between the PF/VF case once that's landed.
> 
> > ---
> >  drivers/gpu/drm/i915/Kconfig.profile      | 10 ++++++++++
> >  drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c |  5 ++++-
> >  2 files changed, 14 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/Kconfig.profile b/drivers/gpu/drm/i915/Kconfig.profile
> > index 39328567c200..0d5475b5f28a 100644
> > --- a/drivers/gpu/drm/i915/Kconfig.profile
> > +++ b/drivers/gpu/drm/i915/Kconfig.profile
> > @@ -38,6 +38,16 @@ config DRM_I915_USERFAULT_AUTOSUSPEND
> >  	  May be 0 to disable the extra delay and solely use the device level
> >  	  runtime pm autosuspend delay tunable.
> >  
> > +config DRM_I915_GUC_CTB_TIMEOUT
> > +	int "How long to wait for the GuC to make forward progress on CTBs (ms)"
> > +	default 1500 # milliseconds
> > +	range 10 60000
> 
> Also range is definitely off, drm/scheduler will probably nuke you
> beforehand :-)
> 
> That's kinda another issue I have with all these kconfig knobs: Maybe we
> need a knob for "relax with reset attempts, my workloads overload my gpus
> routinely", which then scales _all_ timeouts proportionally. But letting
> the user set them all, with silly combiniations like resetting the
> workload before heartbeat or stuff like that doesn't make much sense.
>

Yes, the code as is the user could do some wacky stuff that doesn't make
sense at all.
 
> Anyway, tiny patch so hopefully I can leave this one out for now until
> we've closed this.

No issue leaving this out as blocking CTBs are never really used anyways
until SRIOV aside from setup / debugging. That being said, we might
still want a higher hardcoded value in the meantime, perhaps around a
second. I can follow up on that if needed.

Matt

> -Daniel
> 
> > +	help
> > +	  Configures the default timeout waiting for GuC the to make forward
> > +	  progress on CTBs. e.g. Waiting for a response to a requeset.
> > +
> > +	  A range of 10 ms to 60000 ms is allowed.
> > +
> >  config DRM_I915_HEARTBEAT_INTERVAL
> >  	int "Interval between heartbeat pulses (ms)"
> >  	default 2500 # milliseconds
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
> > index 916c2b80c841..cf1fb09ef766 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
> > @@ -436,6 +436,7 @@ static int ct_write(struct intel_guc_ct *ct,
> >   */
> >  static int wait_for_ct_request_update(struct ct_request *req, u32 *status)
> >  {
> > +	long timeout;
> >  	int err;
> >  
> >  	/*
> > @@ -443,10 +444,12 @@ static int wait_for_ct_request_update(struct ct_request *req, u32 *status)
> >  	 * up to that length of time, then switch to a slower sleep-wait loop.
> >  	 * No GuC command should ever take longer than 10ms.
> >  	 */
> > +	timeout = CONFIG_DRM_I915_GUC_CTB_TIMEOUT;
> > +
> >  #define done INTEL_GUC_MSG_IS_RESPONSE(READ_ONCE(req->status))
> >  	err = wait_for_us(done, 10);
> >  	if (err)
> > -		err = wait_for(done, 10);
> > +		err = wait_for(done, timeout);
> >  #undef done
> >  
> >  	if (unlikely(err))
> > -- 
> > 2.28.0
> > 
> > _______________________________________________
> > Intel-gfx mailing list
> > Intel-gfx@lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/intel-gfx
> 
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch
Daniel Vetter June 9, 2021, 1:24 p.m. UTC | #3
On Fri, Jun 04, 2021 at 11:35:39AM -0700, Matthew Brost wrote:
> On Fri, Jun 04, 2021 at 10:33:07AM +0200, Daniel Vetter wrote:
> > On Wed, Jun 02, 2021 at 10:16:23PM -0700, Matthew Brost wrote:
> > > From: Michal Wajdeczko <michal.wajdeczko@intel.com>
> > > 
> > > In upcoming patch we will allow more CTB requests to be sent in
> > > parallel to the GuC for processing, so we shouldn't assume any more
> > > that GuC will always reply without 10ms.
> > > 
> > > Use bigger value from CONFIG_DRM_I915_GUC_CTB_TIMEOUT instead.
> > > 
> > > v2: Add CONFIG_DRM_I915_GUC_CTB_TIMEOUT config option
> > > 
> > > Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
> > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > Reviewed-by: Matthew Brost <matthew.brost@intel.com>
> > 
> > So this is a rant, but for upstream we really need to do better than
> > internal:
> > 
> > - The driver must work by default in the optimal configuration.
> > 
> > - Any config change that we haven't validated _must_ taint the kernel
> >   (this is especially for module options, but also for config settings)
> > 
> > - Config need a real reason beyond "was useful for bring-up".
> > 
> > Our internal tree is an absolute disaster right now, with multi-line
> > kernel configs (different on each platform) and bespoke kernel config or
> > the driver just fails. We're the expert on our own hw, we should know how
> > it works, not offload that to users essentially asking them "how shitty do
> > you think Intel hw is in responding timely".
> > 
> > Yes I know there's a lot of these there already, they don't make a lot of
> > sense either.
> > 
> > Except if there's a real reason for this (aside from us just offloading
> > testing to our users instead of doing it ourselves properly) I think we
> > should hardcode this, with a comment explaining why. Maybe with a switch
> > between the PF/VF case once that's landed.
> > 
> > > ---
> > >  drivers/gpu/drm/i915/Kconfig.profile      | 10 ++++++++++
> > >  drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c |  5 ++++-
> > >  2 files changed, 14 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/gpu/drm/i915/Kconfig.profile b/drivers/gpu/drm/i915/Kconfig.profile
> > > index 39328567c200..0d5475b5f28a 100644
> > > --- a/drivers/gpu/drm/i915/Kconfig.profile
> > > +++ b/drivers/gpu/drm/i915/Kconfig.profile
> > > @@ -38,6 +38,16 @@ config DRM_I915_USERFAULT_AUTOSUSPEND
> > >  	  May be 0 to disable the extra delay and solely use the device level
> > >  	  runtime pm autosuspend delay tunable.
> > >  
> > > +config DRM_I915_GUC_CTB_TIMEOUT
> > > +	int "How long to wait for the GuC to make forward progress on CTBs (ms)"
> > > +	default 1500 # milliseconds
> > > +	range 10 60000
> > 
> > Also range is definitely off, drm/scheduler will probably nuke you
> > beforehand :-)
> > 
> > That's kinda another issue I have with all these kconfig knobs: Maybe we
> > need a knob for "relax with reset attempts, my workloads overload my gpus
> > routinely", which then scales _all_ timeouts proportionally. But letting
> > the user set them all, with silly combiniations like resetting the
> > workload before heartbeat or stuff like that doesn't make much sense.
> >
> 
> Yes, the code as is the user could do some wacky stuff that doesn't make
> sense at all.
>  
> > Anyway, tiny patch so hopefully I can leave this one out for now until
> > we've closed this.
> 
> No issue leaving this out as blocking CTBs are never really used anyways
> until SRIOV aside from setup / debugging. That being said, we might
> still want a higher hardcoded value in the meantime, perhaps around a
> second. I can follow up on that if needed.

Yeah just patch with updated hardcoded value sounds good to me.
-Daniel

> 
> Matt
> 
> > -Daniel
> > 
> > > +	help
> > > +	  Configures the default timeout waiting for GuC the to make forward
> > > +	  progress on CTBs. e.g. Waiting for a response to a requeset.
> > > +
> > > +	  A range of 10 ms to 60000 ms is allowed.
> > > +
> > >  config DRM_I915_HEARTBEAT_INTERVAL
> > >  	int "Interval between heartbeat pulses (ms)"
> > >  	default 2500 # milliseconds
> > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
> > > index 916c2b80c841..cf1fb09ef766 100644
> > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
> > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
> > > @@ -436,6 +436,7 @@ static int ct_write(struct intel_guc_ct *ct,
> > >   */
> > >  static int wait_for_ct_request_update(struct ct_request *req, u32 *status)
> > >  {
> > > +	long timeout;
> > >  	int err;
> > >  
> > >  	/*
> > > @@ -443,10 +444,12 @@ static int wait_for_ct_request_update(struct ct_request *req, u32 *status)
> > >  	 * up to that length of time, then switch to a slower sleep-wait loop.
> > >  	 * No GuC command should ever take longer than 10ms.
> > >  	 */
> > > +	timeout = CONFIG_DRM_I915_GUC_CTB_TIMEOUT;
> > > +
> > >  #define done INTEL_GUC_MSG_IS_RESPONSE(READ_ONCE(req->status))
> > >  	err = wait_for_us(done, 10);
> > >  	if (err)
> > > -		err = wait_for(done, 10);
> > > +		err = wait_for(done, timeout);
> > >  #undef done
> > >  
> > >  	if (unlikely(err))
> > > -- 
> > > 2.28.0
> > > 
> > > _______________________________________________
> > > Intel-gfx mailing list
> > > Intel-gfx@lists.freedesktop.org
> > > https://lists.freedesktop.org/mailman/listinfo/intel-gfx
> > 
> > -- 
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch
diff mbox series

Patch

diff --git a/drivers/gpu/drm/i915/Kconfig.profile b/drivers/gpu/drm/i915/Kconfig.profile
index 39328567c200..0d5475b5f28a 100644
--- a/drivers/gpu/drm/i915/Kconfig.profile
+++ b/drivers/gpu/drm/i915/Kconfig.profile
@@ -38,6 +38,16 @@  config DRM_I915_USERFAULT_AUTOSUSPEND
 	  May be 0 to disable the extra delay and solely use the device level
 	  runtime pm autosuspend delay tunable.
 
+config DRM_I915_GUC_CTB_TIMEOUT
+	int "How long to wait for the GuC to make forward progress on CTBs (ms)"
+	default 1500 # milliseconds
+	range 10 60000
+	help
+	  Configures the default timeout waiting for GuC the to make forward
+	  progress on CTBs. e.g. Waiting for a response to a requeset.
+
+	  A range of 10 ms to 60000 ms is allowed.
+
 config DRM_I915_HEARTBEAT_INTERVAL
 	int "Interval between heartbeat pulses (ms)"
 	default 2500 # milliseconds
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
index 916c2b80c841..cf1fb09ef766 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
@@ -436,6 +436,7 @@  static int ct_write(struct intel_guc_ct *ct,
  */
 static int wait_for_ct_request_update(struct ct_request *req, u32 *status)
 {
+	long timeout;
 	int err;
 
 	/*
@@ -443,10 +444,12 @@  static int wait_for_ct_request_update(struct ct_request *req, u32 *status)
 	 * up to that length of time, then switch to a slower sleep-wait loop.
 	 * No GuC command should ever take longer than 10ms.
 	 */
+	timeout = CONFIG_DRM_I915_GUC_CTB_TIMEOUT;
+
 #define done INTEL_GUC_MSG_IS_RESPONSE(READ_ONCE(req->status))
 	err = wait_for_us(done, 10);
 	if (err)
-		err = wait_for(done, 10);
+		err = wait_for(done, timeout);
 #undef done
 
 	if (unlikely(err))