diff mbox series

[1/1] drm/i915/guc: Delay disabling guc_id scheduling for better hysteresis

Message ID 20220915021218.1412111-2-alan.previn.teres.alexis@intel.com (mailing list archive)
State New, archived
Headers show
Series Delay disabling GuC scheduling of an idle context | expand

Commit Message

Teres Alexis, Alan Previn Sept. 15, 2022, 2:12 a.m. UTC
From: Matthew Brost <matthew.brost@intel.com>

Add a delay, configurable via debugfs (default 34ms), to disable
scheduling of a context after the pin count goes to zero. Disable
scheduling is a costly operation as it requires synchronizing with
the GuC. So the idea is that a delay allows the user to resubmit
something before doing this operation. This delay is only done if
the context isn't closed and less than a given threshold
(default is 3/4) of the guc_ids are in use.

Alan Previn: Matt Brost first introduced this patch back in Oct 2021.
However no real world workload with measured performance impact was
available to prove the intended results. Today, this series is being
republished in response to a real world workload that benefited greatly
from it along with measured performance improvement.

Workload description: 36 containers were created on a DG2 device where
each container was performing a combination of 720p 3d game rendering
and 30fps video encoding. The workload density was configured in a way
that guaranteed each container to ALWAYS be able to render and
encode no less than 30fps with a predefined maximum render + encode
latency time. That means the totality of all 36 containers and their
workloads were not saturating the engines to their max (in order to
maintain just enough headrooom to meet the min fps and max latencies
of incoming container submissions).

Problem statement: It was observed that the CPU core processing the i915
soft IRQ work was experiencing severe load. Using tracelogs and an
instrumentation patch to count specific i915 IRQ events, it was confirmed
that the majority of the CPU cycles were caused by the
gen11_other_irq_handler() -> guc_irq_handler() code path. The vast
majority of the cycles was determined to be processing a specific G2H
IRQ: i.e. INTEL_GUC_ACTION_SCHED_CONTEXT_MODE_DONE. These IRQs are sent
by GuC in response to i915 KMD sending H2G requests:
INTEL_GUC_ACTION_SCHED_CONTEXT_MODE_SET. Those H2G requests are sent
whenever a context goes idle so that we can unpin the context from GuC.
The high CPU utilization % symptom was limiting density scaling.

Root Cause Analysis: Because the incoming execution buffers were spread
across 36 different containers (each with multiple contexts) but the
system in totality was NOT saturated to the max, it was assumed that each
context was constantly idling between submissions. This was causing
a thrashing of unpinning contexts from GuC at one moment, followed quickly
by repinning them due to incoming workload the very next moment. These
event-pairs were being triggered across multiple contexts per container,
across all containers at the rate of > 30 times per sec per context.

Metrics: When running this workload without this patch, we measured an
average of ~69K INTEL_GUC_ACTION_SCHED_CONTEXT_MODE_DONE events every 10
seconds or ~10 million times over ~25+ mins. With this patch, the count
reduced to ~480 every 10 seconds or about ~28K over ~10 mins. The
improvement observed is ~99% for the average counts per 10 seconds.

Design awareness: Selftest impact.
As temporary WA disable this feature for the selftests. Selftests are
very timing sensitive and any change in timing can cause failure. A
follow up patch will fixup the selftests to understand this delay.

Design awareness: Race between guc_request_alloc and guc_context_close.
If a context close is issued while there is a request submission in
flight and a delayed schedule disable is pending, guc_context_close
and guc_request_alloc will race to cancel the delayed disable.
To close the race, make sure that guc_request_alloc waits for
guc_context_close to finish running before checking any state.

Design awareness: GT Reset event.
If a gt reset is triggered, as preparation steps, add an additional step
to ensure all contexts that have a pending delay-disable-schedule task
be flushed of it. Move them directly into the closed state after cancelling
the worker. This is okay because the existing flow flushes all
yet-to-arrive G2H's dropping them anyway.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Alan Previn <alan.previn.teres.alexis@intel.com>
Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
---
 drivers/gpu/drm/i915/gem/i915_gem_context.c   |   2 +-
 drivers/gpu/drm/i915/gt/intel_context.h       |   8 +
 drivers/gpu/drm/i915/gt/intel_context_types.h |   7 +
 drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  16 ++
 .../gpu/drm/i915/gt/uc/intel_guc_debugfs.c    |  60 +++++
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 220 +++++++++++++++---
 drivers/gpu/drm/i915/i915_selftest.h          |   2 +
 7 files changed, 288 insertions(+), 27 deletions(-)

Comments

Tvrtko Ursulin Sept. 15, 2022, 8:48 a.m. UTC | #1
On 15/09/2022 03:12, Alan Previn wrote:
> From: Matthew Brost <matthew.brost@intel.com>
> 
> Add a delay, configurable via debugfs (default 34ms), to disable
> scheduling of a context after the pin count goes to zero. Disable
> scheduling is a costly operation as it requires synchronizing with
> the GuC. So the idea is that a delay allows the user to resubmit
> something before doing this operation. This delay is only done if
> the context isn't closed and less than a given threshold
> (default is 3/4) of the guc_ids are in use.
> 
> Alan Previn: Matt Brost first introduced this patch back in Oct 2021.
> However no real world workload with measured performance impact was
> available to prove the intended results. Today, this series is being
> republished in response to a real world workload that benefited greatly
> from it along with measured performance improvement.
> 
> Workload description: 36 containers were created on a DG2 device where
> each container was performing a combination of 720p 3d game rendering
> and 30fps video encoding. The workload density was configured in a way
> that guaranteed each container to ALWAYS be able to render and
> encode no less than 30fps with a predefined maximum render + encode
> latency time. That means the totality of all 36 containers and their
> workloads were not saturating the engines to their max (in order to
> maintain just enough headrooom to meet the min fps and max latencies
> of incoming container submissions).
> 
> Problem statement: It was observed that the CPU core processing the i915
> soft IRQ work was experiencing severe load. Using tracelogs and an
> instrumentation patch to count specific i915 IRQ events, it was confirmed
> that the majority of the CPU cycles were caused by the
> gen11_other_irq_handler() -> guc_irq_handler() code path. The vast
> majority of the cycles was determined to be processing a specific G2H
> IRQ: i.e. INTEL_GUC_ACTION_SCHED_CONTEXT_MODE_DONE. These IRQs are sent
> by GuC in response to i915 KMD sending H2G requests:
> INTEL_GUC_ACTION_SCHED_CONTEXT_MODE_SET. Those H2G requests are sent
> whenever a context goes idle so that we can unpin the context from GuC.
> The high CPU utilization % symptom was limiting density scaling.
> 
> Root Cause Analysis: Because the incoming execution buffers were spread
> across 36 different containers (each with multiple contexts) but the
> system in totality was NOT saturated to the max, it was assumed that each
> context was constantly idling between submissions. This was causing
> a thrashing of unpinning contexts from GuC at one moment, followed quickly
> by repinning them due to incoming workload the very next moment. These
> event-pairs were being triggered across multiple contexts per container,
> across all containers at the rate of > 30 times per sec per context.
> 
> Metrics: When running this workload without this patch, we measured an
> average of ~69K INTEL_GUC_ACTION_SCHED_CONTEXT_MODE_DONE events every 10
> seconds or ~10 million times over ~25+ mins. With this patch, the count
> reduced to ~480 every 10 seconds or about ~28K over ~10 mins. The
> improvement observed is ~99% for the average counts per 10 seconds.
> 
> Design awareness: Selftest impact.
> As temporary WA disable this feature for the selftests. Selftests are
> very timing sensitive and any change in timing can cause failure. A
> follow up patch will fixup the selftests to understand this delay.
> 
> Design awareness: Race between guc_request_alloc and guc_context_close.
> If a context close is issued while there is a request submission in
> flight and a delayed schedule disable is pending, guc_context_close
> and guc_request_alloc will race to cancel the delayed disable.
> To close the race, make sure that guc_request_alloc waits for
> guc_context_close to finish running before checking any state.
> 
> Design awareness: GT Reset event.
> If a gt reset is triggered, as preparation steps, add an additional step
> to ensure all contexts that have a pending delay-disable-schedule task
> be flushed of it. Move them directly into the closed state after cancelling
> the worker. This is okay because the existing flow flushes all
> yet-to-arrive G2H's dropping them anyway.
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> Signed-off-by: Alan Previn <alan.previn.teres.alexis@intel.com>
> Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
> ---
>   drivers/gpu/drm/i915/gem/i915_gem_context.c   |   2 +-
>   drivers/gpu/drm/i915/gt/intel_context.h       |   8 +
>   drivers/gpu/drm/i915/gt/intel_context_types.h |   7 +
>   drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  16 ++
>   .../gpu/drm/i915/gt/uc/intel_guc_debugfs.c    |  60 +++++
>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 220 +++++++++++++++---
>   drivers/gpu/drm/i915/i915_selftest.h          |   2 +
>   7 files changed, 288 insertions(+), 27 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c
> index dabdfe09f5e5..df7fd1b019ec 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
> @@ -1454,7 +1454,7 @@ static void engines_idle_release(struct i915_gem_context *ctx,
>   		int err;
>   
>   		/* serialises with execbuf */
> -		set_bit(CONTEXT_CLOSED_BIT, &ce->flags);
> +		intel_context_close(ce);
>   		if (!intel_context_pin_if_active(ce))
>   			continue;
>   
> diff --git a/drivers/gpu/drm/i915/gt/intel_context.h b/drivers/gpu/drm/i915/gt/intel_context.h
> index 8e2d70630c49..f96420f0b5bb 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context.h
> +++ b/drivers/gpu/drm/i915/gt/intel_context.h
> @@ -276,6 +276,14 @@ static inline bool intel_context_is_barrier(const struct intel_context *ce)
>   	return test_bit(CONTEXT_BARRIER_BIT, &ce->flags);
>   }
>   
> +static inline void intel_context_close(struct intel_context *ce)
> +{
> +	set_bit(CONTEXT_CLOSED_BIT, &ce->flags);
> +
> +	if (ce->ops->close)
> +		ce->ops->close(ce);
> +}
> +
>   static inline bool intel_context_is_closed(const struct intel_context *ce)
>   {
>   	return test_bit(CONTEXT_CLOSED_BIT, &ce->flags);
> diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> index 04eacae1aca5..86ac84e2edb9 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> @@ -43,6 +43,8 @@ struct intel_context_ops {
>   	void (*revoke)(struct intel_context *ce, struct i915_request *rq,
>   		       unsigned int preempt_timeout_ms);
>   
> +	void (*close)(struct intel_context *ce);
> +
>   	int (*pre_pin)(struct intel_context *ce, struct i915_gem_ww_ctx *ww, void **vaddr);
>   	int (*pin)(struct intel_context *ce, void *vaddr);
>   	void (*unpin)(struct intel_context *ce);
> @@ -208,6 +210,11 @@ struct intel_context {
>   		 * each priority bucket
>   		 */
>   		u32 prio_count[GUC_CLIENT_PRIORITY_NUM];
> +		/**
> +		 * @sched_disable_delay: worker to disable scheduling on this
> +		 * context
> +		 */
> +		struct delayed_work sched_disable_delay;
>   	} guc_state;
>   
>   	struct {
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> index 804133df1ac9..12be811181b3 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> @@ -112,6 +112,10 @@ struct intel_guc {
>   		 * refs
>   		 */
>   		struct list_head guc_id_list;
> +		/**
> +		 * @guc_ids_in_use: Number single-lrc guc_ids in use
> +		 */
> +		u16 guc_ids_in_use;

Any specific reason to use u16? It can usually just result in larger 
code generated and I don't see any space saving needed or achieved when 
it is sandwiched between two struct list_heads.

>   		/**
>   		 * @destroyed_contexts: list of contexts waiting to be destroyed
>   		 * (deregistered with the GuC)
> @@ -132,6 +136,16 @@ struct intel_guc {
>   		 * @reset_fail_mask: mask of engines that failed to reset
>   		 */
>   		intel_engine_mask_t reset_fail_mask;
> +		/**
> +		 * @sched_disable_delay_ms: schedule disable delay, in ms, for
> +		 * contexts
> +		 */
> +		u64 sched_disable_delay_ms;

64-bits for the delay then sounds like overkill. Both should IMO just be 
unsigned ints.

> +		/**
> +		 * @sched_disable_gucid_threshold: threshold of min remaining available
> +		 * guc_ids before we start bypassing the schedule disable delay
> +		 */
> +		int sched_disable_gucid_threshold;

unsigned int as well, so reader does not have to think about:

  return guc->submission_state.guc_ids_in_use >
	guc->submission_state.sched_disable_gucid_threshold;

further down.

>   	} submission_state;
>   
>   	/**
> @@ -466,4 +480,6 @@ void intel_guc_write_barrier(struct intel_guc *guc);
>   
>   void intel_guc_dump_time_info(struct intel_guc *guc, struct drm_printer *p);
>   
> +int intel_guc_sched_disable_gucid_threshold_max(struct intel_guc *guc);
> +
>   #endif
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_debugfs.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_debugfs.c
> index 25f09a420561..c91b150bb7ac 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_debugfs.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_debugfs.c
> @@ -71,12 +71,72 @@ static bool intel_eval_slpc_support(void *data)
>   	return intel_guc_slpc_is_used(guc);
>   }
>   
> +static int guc_sched_disable_delay_ms_get(void *data, u64 *val)
> +{
> +	struct intel_guc *guc = data;
> +
> +	if (!intel_guc_submission_is_used(guc))
> +		return -ENODEV;
> +
> +	*val = guc->submission_state.sched_disable_delay_ms;
> +
> +	return 0;
> +}
> +
> +static int guc_sched_disable_delay_ms_set(void *data, u64 val)
> +{
> +	struct intel_guc *guc = data;
> +
> +	if (!intel_guc_submission_is_used(guc))
> +		return -ENODEV;
> +
> +	guc->submission_state.sched_disable_delay_ms = val;
> +
> +	return 0;
> +}
> +DEFINE_SIMPLE_ATTRIBUTE(guc_sched_disable_delay_ms_fops,
> +			guc_sched_disable_delay_ms_get,
> +			guc_sched_disable_delay_ms_set, "%lld\n");
> +
> +static int guc_sched_disable_gucid_threshold_get(void *data, u64 *val)
> +{
> +	struct intel_guc *guc = data;
> +
> +	if (!intel_guc_submission_is_used(guc))
> +		return -ENODEV;
> +
> +	*val = guc->submission_state.sched_disable_gucid_threshold;
> +	return 0;
> +}
> +
> +static int guc_sched_disable_gucid_threshold_set(void *data, u64 val)
> +{
> +	struct intel_guc *guc = data;
> +
> +	if (!intel_guc_submission_is_used(guc))
> +		return -ENODEV;
> +
> +	if (val > intel_guc_sched_disable_gucid_threshold_max(guc))
> +		guc->submission_state.sched_disable_gucid_threshold =
> +			intel_guc_sched_disable_gucid_threshold_max(guc);
> +	else
> +		guc->submission_state.sched_disable_gucid_threshold = val;
> +
> +	return 0;
> +}
> +DEFINE_SIMPLE_ATTRIBUTE(guc_sched_disable_gucid_threshold_fops,
> +			guc_sched_disable_gucid_threshold_get,
> +			guc_sched_disable_gucid_threshold_set, "%lld\n");
> +
>   void intel_guc_debugfs_register(struct intel_guc *guc, struct dentry *root)
>   {
>   	static const struct intel_gt_debugfs_file files[] = {
>   		{ "guc_info", &guc_info_fops, NULL },
>   		{ "guc_registered_contexts", &guc_registered_contexts_fops, NULL },
>   		{ "guc_slpc_info", &guc_slpc_info_fops, &intel_eval_slpc_support},
> +		{ "guc_sched_disable_delay_ms", &guc_sched_disable_delay_ms_fops, NULL },
> +		{ "guc_sched_disable_gucid_threshold", &guc_sched_disable_gucid_threshold_fops,
> +		   NULL },
>   	};
>   
>   	if (!intel_guc_is_supported(guc))
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 22ba66e48a9b..29793972c39e 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -65,7 +65,13 @@
>    * corresponding G2H returns indicating the scheduling disable operation has
>    * completed it is safe to unpin the context. While a disable is in flight it
>    * isn't safe to resubmit the context so a fence is used to stall all future
> - * requests of that context until the G2H is returned.
> + * requests of that context until the G2H is returned. Because this interaction
> + * with the GuC takes a non-zero amount of time we delay the disabling of
> + * scheduling after the pin count goes to zero by a configurable period of time
> + * (see SCHED_DISABLE_DELAY_MS). The thought is this gives the user a window of
> + * time to resubmit something on the context before doing this costly operation.
> + * This delay is only done if the context isn't closed and the guc_id usage is
> + * less than a threshold (see NUM_SCHED_DISABLE_GUC_IDS_THRESHOLD).
>    *
>    * Context deregistration:
>    * Before a context can be destroyed or if we steal its guc_id we must
> @@ -163,7 +169,8 @@ guc_create_parallel(struct intel_engine_cs **engines,
>   #define SCHED_STATE_PENDING_ENABLE			BIT(5)
>   #define SCHED_STATE_REGISTERED				BIT(6)
>   #define SCHED_STATE_POLICY_REQUIRED			BIT(7)
> -#define SCHED_STATE_BLOCKED_SHIFT			8
> +#define SCHED_STATE_CLOSED				BIT(8)
> +#define SCHED_STATE_BLOCKED_SHIFT			9
>   #define SCHED_STATE_BLOCKED		BIT(SCHED_STATE_BLOCKED_SHIFT)
>   #define SCHED_STATE_BLOCKED_MASK	(0xfff << SCHED_STATE_BLOCKED_SHIFT)
>   
> @@ -173,12 +180,20 @@ static inline void init_sched_state(struct intel_context *ce)
>   	ce->guc_state.sched_state &= SCHED_STATE_BLOCKED_MASK;
>   }
>   
> +/*
> + * Kernel contexts can have SCHED_STATE_REGISTERED after suspend.
> + * A context close can race with the submission path, so SCHED_STATE_CLOSED
> + * can be set immediately before we try to register.
> + */
> +#define SCHED_STATE_VALID_INIT \
> +	(SCHED_STATE_BLOCKED_MASK | \
> +	 SCHED_STATE_CLOSED | \
> +	 SCHED_STATE_REGISTERED)
> +
>   __maybe_unused
>   static bool sched_state_is_init(struct intel_context *ce)
>   {
> -	/* Kernel contexts can have SCHED_STATE_REGISTERED after suspend. */
> -	return !(ce->guc_state.sched_state &
> -		 ~(SCHED_STATE_BLOCKED_MASK | SCHED_STATE_REGISTERED));
> +	return !(ce->guc_state.sched_state & ~SCHED_STATE_VALID_INIT);
>   }
>   
>   static inline bool
> @@ -319,6 +334,17 @@ static inline void clr_context_policy_required(struct intel_context *ce)
>   	ce->guc_state.sched_state &= ~SCHED_STATE_POLICY_REQUIRED;
>   }
>   
> +static inline bool context_close_done(struct intel_context *ce)
> +{
> +	return ce->guc_state.sched_state & SCHED_STATE_CLOSED;
> +}
> +
> +static inline void set_context_close_done(struct intel_context *ce)
> +{
> +	lockdep_assert_held(&ce->guc_state.lock);
> +	ce->guc_state.sched_state |= SCHED_STATE_CLOSED;
> +}
> +
>   static inline u32 context_blocked(struct intel_context *ce)
>   {
>   	return (ce->guc_state.sched_state & SCHED_STATE_BLOCKED_MASK) >>
> @@ -1523,6 +1549,7 @@ static void guc_flush_submissions(struct intel_guc *guc)
>   }
>   
>   static void guc_flush_destroyed_contexts(struct intel_guc *guc);
> +static void guc_flush_all_delayed_disable_sched_contexts(struct intel_guc *guc);
>   
>   void intel_guc_submission_reset_prepare(struct intel_guc *guc)
>   {
> @@ -1540,6 +1567,7 @@ void intel_guc_submission_reset_prepare(struct intel_guc *guc)
>   	spin_lock_irq(guc_to_gt(guc)->irq_lock);
>   	spin_unlock_irq(guc_to_gt(guc)->irq_lock);
>   
> +	guc_flush_all_delayed_disable_sched_contexts(guc);
>   	guc_flush_submissions(guc);
>   	guc_flush_destroyed_contexts(guc);
>   	flush_work(&guc->ct.requests.worker);
> @@ -1994,6 +2022,9 @@ static int new_guc_id(struct intel_guc *guc, struct intel_context *ce)
>   	if (unlikely(ret < 0))
>   		return ret;
>   
> +	if (!intel_context_is_parent(ce))
> +		++guc->submission_state.guc_ids_in_use;
> +
>   	ce->guc_id.id = ret;
>   	return 0;
>   }
> @@ -2003,14 +2034,16 @@ static void __release_guc_id(struct intel_guc *guc, struct intel_context *ce)
>   	GEM_BUG_ON(intel_context_is_child(ce));
>   
>   	if (!context_guc_id_invalid(ce)) {
> -		if (intel_context_is_parent(ce))
> +		if (intel_context_is_parent(ce)) {
>   			bitmap_release_region(guc->submission_state.guc_ids_bitmap,
>   					      ce->guc_id.id,
>   					      order_base_2(ce->parallel.number_children
>   							   + 1));
> -		else
> +		} else {
> +			--guc->submission_state.guc_ids_in_use;
>   			ida_simple_remove(&guc->submission_state.guc_ids,
>   					  ce->guc_id.id);
> +		}
>   		clr_ctx_id_mapping(guc, ce->guc_id.id);
>   		set_context_guc_id_invalid(ce);
>   	}
> @@ -2998,41 +3031,130 @@ guc_context_revoke(struct intel_context *ce, struct i915_request *rq,
>   	}
>   }
>   
> -static void guc_context_sched_disable(struct intel_context *ce)
> +static void guc_context_sched_disable(struct intel_context *ce);
> +
> +static void do_sched_disable(struct intel_guc *guc, struct intel_context *ce,
> +			     unsigned long flags)
> +	__releases(ce->guc_state.lock)
>   {
> -	struct intel_guc *guc = ce_to_guc(ce);
> -	unsigned long flags;
>   	struct intel_runtime_pm *runtime_pm = &ce->engine->gt->i915->runtime_pm;
>   	intel_wakeref_t wakeref;
>   	u16 guc_id;
>   
> +	lockdep_assert_held(&ce->guc_state.lock);
> +	guc_id = prep_context_pending_disable(ce);
> +
> +	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> +
> +	with_intel_runtime_pm(runtime_pm, wakeref)
> +		__guc_context_sched_disable(guc, ce, guc_id);
> +}
> +
> +static bool bypass_sched_disable(struct intel_guc *guc,
> +				 struct intel_context *ce)
> +{
> +	lockdep_assert_held(&ce->guc_state.lock);
>   	GEM_BUG_ON(intel_context_is_child(ce));
>   
> +	if (submission_disabled(guc) || context_guc_id_invalid(ce) ||
> +	    !ctx_id_mapped(guc, ce->guc_id.id)) {
> +		clr_context_enabled(ce);
> +		return true;
> +	}
> +
> +	return !context_enabled(ce);
> +}
> +
> +static void __delay_sched_disable(struct work_struct *wrk)
> +{
> +	struct intel_context *ce =
> +		container_of(wrk, typeof(*ce), guc_state.sched_disable_delay.work);
> +	struct intel_guc *guc = ce_to_guc(ce);
> +	unsigned long flags;
> +
>   	spin_lock_irqsave(&ce->guc_state.lock, flags);
>   
> +	if (bypass_sched_disable(guc, ce)) {
> +		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> +		intel_context_sched_disable_unpin(ce);
> +	} else {
> +		do_sched_disable(guc, ce, flags);
> +	}


lock
if
   unlock
   do sttuff
else
   do_sched_disable - which unlocks inside

Now move to next block..

> +}
> +
> +static bool guc_id_pressure(struct intel_guc *guc, struct intel_context *ce)
> +{
>   	/*
> -	 * We have to check if the context has been disabled by another thread,
> -	 * check if submssion has been disabled to seal a race with reset and
> -	 * finally check if any more requests have been committed to the
> -	 * context ensursing that a request doesn't slip through the
> -	 * 'context_pending_disable' fence.
> +	 * parent contexts are perma-pinned, if we are unpinning do schedule
> +	 * disable immediately.
>   	 */
> -	if (unlikely(!context_enabled(ce) || submission_disabled(guc) ||
> -		     context_has_committed_requests(ce))) {
> -		clr_context_enabled(ce);
> +	if (intel_context_is_parent(ce))
> +		return true;
> +
> +	/*
> +	 * If we are beyond the threshold for avail guc_ids, do schedule disable immediately.
> +	 */
> +	return guc->submission_state.guc_ids_in_use >
> +		guc->submission_state.sched_disable_gucid_threshold;
> +}
> +
> +static void guc_context_sched_disable(struct intel_context *ce)
> +{
> +	struct intel_guc *guc = ce_to_guc(ce);
> +	u64 delay = guc->submission_state.sched_disable_delay_ms;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&ce->guc_state.lock, flags);
> +
> +	if (bypass_sched_disable(guc, ce)) {
> +		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> +		intel_context_sched_disable_unpin(ce);
> +	} else if (!intel_context_is_closed(ce) && !guc_id_pressure(guc, ce) &&
> +		   delay) {
>   		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> -		goto unpin;
> +		mod_delayed_work(system_unbound_wq,
> +				 &ce->guc_state.sched_disable_delay,
> +				 msecs_to_jiffies(delay));
> +	} else {
> +		do_sched_disable(guc, ce, flags);
>   	}

lock
if
   unlock
   do stuff
else if
   unlock
   do stuff
else
   do_sched_disable - which unlocks inside

IMO it creates less readable code for the benefit of not repeating 
with_intel_runtime_pm -> __guc_context_sched_disable two times. Dunno.. 
it's ugly but I have no suggestions. Hm does it have to send using the 
busy loop? It couldn't just queue the request and then wait for reply if 
disable message was emitted?

> -	guc_id = prep_context_pending_disable(ce);
> +}
>   
> -	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> +static void guc_flush_all_delayed_disable_sched_contexts(struct intel_guc *guc)
> +{
> +	struct intel_context *ce;
> +	unsigned long index;
> +	unsigned long flags;
> +	unsigned long ceflags;
>   
> -	with_intel_runtime_pm(runtime_pm, wakeref)
> -		__guc_context_sched_disable(guc, ce, guc_id);
> +	xa_lock_irqsave(&guc->context_lookup, flags);
> +	xa_for_each(&guc->context_lookup, index, ce) {
> +		if (!kref_get_unless_zero(&ce->ref))
> +			continue;
> +		xa_unlock(&guc->context_lookup);

So this whole loop _needs_ to run with interrupts disabled? Explaining 
why in a comment would be good.

> +		if (test_bit(CONTEXT_GUC_INIT, &ce->flags) &&
> +		    cancel_delayed_work(&ce->guc_state.sched_disable_delay)) {
> +			spin_lock_irqsave(&ce->guc_state.lock, ceflags);
> +			spin_unlock_irqrestore(&ce->guc_state.lock, ceflags);

This deserves a comment about what lock toggling wants to ensure.

Also, if the loops runs with interrupts disabled what is the point of 
irqsave variant in here??

Also2, what is the reason for dropping the lock? intel_context_put?

> +			intel_context_sched_disable_unpin(ce);
> +		}
> +		intel_context_put(ce);
> +		xa_lock(&guc->context_lookup);
> +	}
> +	xa_unlock_irqrestore(&guc->context_lookup, flags);
> +}
> +
> +static void guc_context_close(struct intel_context *ce)
> +{
> +	unsigned long flags;
> +
> +	if (test_bit(CONTEXT_GUC_INIT, &ce->flags) &&
> +	    cancel_delayed_work(&ce->guc_state.sched_disable_delay))
> +		__delay_sched_disable(&ce->guc_state.sched_disable_delay.work);
>   
> -	return;
> -unpin:
> -	intel_context_sched_disable_unpin(ce);
> +	spin_lock_irqsave(&ce->guc_state.lock, flags);
> +	set_context_close_done(ce);
> +	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
>   }
>   
>   static inline void guc_lrc_desc_unpin(struct intel_context *ce)
> @@ -3351,6 +3473,8 @@ static void remove_from_context(struct i915_request *rq)
>   static const struct intel_context_ops guc_context_ops = {
>   	.alloc = guc_context_alloc,
>   
> +	.close = guc_context_close,
> +
>   	.pre_pin = guc_context_pre_pin,
>   	.pin = guc_context_pin,
>   	.unpin = guc_context_unpin,
> @@ -3433,6 +3557,10 @@ static void guc_context_init(struct intel_context *ce)
>   	rcu_read_unlock();
>   
>   	ce->guc_state.prio = map_i915_prio_to_guc_prio(prio);
> +
> +	INIT_DELAYED_WORK(&ce->guc_state.sched_disable_delay,
> +			  __delay_sched_disable);
> +
>   	set_bit(CONTEXT_GUC_INIT, &ce->flags);
>   }
>   
> @@ -3470,6 +3598,19 @@ static int guc_request_alloc(struct i915_request *rq)
>   	if (unlikely(!test_bit(CONTEXT_GUC_INIT, &ce->flags)))
>   		guc_context_init(ce);
>   
> +	/*
> +	 * If the context gets closed while the execbuf is ongoing, the context
> +	 * close code will race with the below code to cancel the delayed work.
> +	 * If the context close wins the race and cancels the work, it will
> +	 * immediately call the sched disable (see guc_context_close), so there
> +	 * is a chance we can get past this check while the sched_disable code
> +	 * is being executed. To make sure that code completes before we check
> +	 * the status further down, we wait for the close process to complete.
> +	 */
> +	if (cancel_delayed_work_sync(&ce->guc_state.sched_disable_delay))
> +		intel_context_sched_disable_unpin(ce);
> +	else if (intel_context_is_closed(ce))
> +		wait_for(context_close_done(ce), 1);

Comment makes it sounds important to handle the race, althought it 
doesn't really explain the consequences. But most importantly, what if 
close doesn't complete in 1ms?

Regards,

Tvrtko

>   	/*
>   	 * Call pin_guc_id here rather than in the pinning step as with
>   	 * dma_resv, contexts can be repeatedly pinned / unpinned trashing the
> @@ -3600,6 +3741,8 @@ static int guc_virtual_context_alloc(struct intel_context *ce)
>   static const struct intel_context_ops virtual_guc_context_ops = {
>   	.alloc = guc_virtual_context_alloc,
>   
> +	.close = guc_context_close,
> +
>   	.pre_pin = guc_virtual_context_pre_pin,
>   	.pin = guc_virtual_context_pin,
>   	.unpin = guc_virtual_context_unpin,
> @@ -3689,6 +3832,8 @@ static void guc_child_context_destroy(struct kref *kref)
>   static const struct intel_context_ops virtual_parent_context_ops = {
>   	.alloc = guc_virtual_context_alloc,
>   
> +	.close = guc_context_close,
> +
>   	.pre_pin = guc_context_pre_pin,
>   	.pin = guc_parent_context_pin,
>   	.unpin = guc_parent_context_unpin,
> @@ -4219,6 +4364,26 @@ static bool __guc_submission_selected(struct intel_guc *guc)
>   	return i915->params.enable_guc & ENABLE_GUC_SUBMISSION;
>   }
>   
> +int intel_guc_sched_disable_gucid_threshold_max(struct intel_guc *guc)
> +{
> +	return guc->submission_state.num_guc_ids - NUMBER_MULTI_LRC_GUC_ID(guc);
> +}
> +
> +/*
> + * This default value of 33 milisecs (+1 milisec round up) ensures 30fps or higher
> + * workloads are able to enjoy the latency reduction when delaying the schedule-disable
> + * operation. This matches the 30fps game-render + encode (real world) workload this
> + * knob was tested against.
> + */
> +#define SCHED_DISABLE_DELAY_MS	34
> +
> +/*
> + * A threshold of 75% is a reasonable starting point considering that real world apps
> + * generally don't get anywhere near this.
> + */
> +#define NUM_SCHED_DISABLE_GUCIDS_DEFAULT_THRESHOLD(__guc) \
> +	(((intel_guc_sched_disable_gucid_threshold_max(guc)) * 3) / 4)
> +
>   void intel_guc_submission_init_early(struct intel_guc *guc)
>   {
>   	xa_init_flags(&guc->context_lookup, XA_FLAGS_LOCK_IRQ);
> @@ -4235,7 +4400,10 @@ void intel_guc_submission_init_early(struct intel_guc *guc)
>   	spin_lock_init(&guc->timestamp.lock);
>   	INIT_DELAYED_WORK(&guc->timestamp.work, guc_timestamp_ping);
>   
> +	guc->submission_state.sched_disable_delay_ms = SCHED_DISABLE_DELAY_MS;
>   	guc->submission_state.num_guc_ids = GUC_MAX_CONTEXT_ID;
> +	guc->submission_state.sched_disable_gucid_threshold =
> +		NUM_SCHED_DISABLE_GUCIDS_DEFAULT_THRESHOLD(guc);
>   	guc->submission_supported = __guc_submission_supported(guc);
>   	guc->submission_selected = __guc_submission_selected(guc);
>   }
> diff --git a/drivers/gpu/drm/i915/i915_selftest.h b/drivers/gpu/drm/i915/i915_selftest.h
> index f54de0499be7..bdf3e22c0a34 100644
> --- a/drivers/gpu/drm/i915/i915_selftest.h
> +++ b/drivers/gpu/drm/i915/i915_selftest.h
> @@ -92,12 +92,14 @@ int __i915_subtests(const char *caller,
>   			T, ARRAY_SIZE(T), data)
>   #define i915_live_subtests(T, data) ({ \
>   	typecheck(struct drm_i915_private *, data); \
> +	(data)->gt[0]->uc.guc.submission_state.sched_disable_delay_ms = 0; \
>   	__i915_subtests(__func__, \
>   			__i915_live_setup, __i915_live_teardown, \
>   			T, ARRAY_SIZE(T), data); \
>   })
>   #define intel_gt_live_subtests(T, data) ({ \
>   	typecheck(struct intel_gt *, data); \
> +	(data)->uc.guc.submission_state.sched_disable_delay_ms = 0; \
>   	__i915_subtests(__func__, \
>   			__intel_gt_live_setup, __intel_gt_live_teardown, \
>   			T, ARRAY_SIZE(T), data); \
Teres Alexis, Alan Previn Sept. 16, 2022, 7:53 a.m. UTC | #2
On Thu, 2022-09-15 at 09:48 +0100, Tvrtko Ursulin wrote:
> On 15/09/2022 03:12, Alan Previn wrote:
> > From: Matthew Brost <matthew.brost@intel.com>
> > 
> > Add a delay, configurable via debugfs (default 34ms), to disable
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > 

> > +		u16 guc_ids_in_use;
> 
> Any specific reason to use u16? It can usually just result in larger 
> code generated and I don't see any space saving needed or achieved when 
> it is sandwiched between two struct list_heads.
> 
no specific reason - will switch to uint32.

> > +		u64 sched_disable_delay_ms;
> 
> 64-bits for the delay then sounds like overkill. Both should IMO just be 
> unsigned ints.
> 
avoiding some typecasting on related functions that reference this
but thats not a good excuse so will fix it.


> > +		int sched_disable_gucid_threshold;
> 
> unsigned int as well, so reader does not have to think about:
>   return guc->submission_state.guc_ids_in_use >
> 	guc->submission_state.sched_disable_gucid_threshold;
> 
> further down.
> 
yes agreed - will fix.


> > +static void __delay_sched_disable(struct work_struct *wrk)
> > +{
> > +	struct intel_context *ce =
> > +		container_of(wrk, typeof(*ce), guc_state.sched_disable_delay.work);
> > +	struct intel_guc *guc = ce_to_guc(ce);
> > +	unsigned long flags;
> > +
> >   	spin_lock_irqsave(&ce->guc_state.lock, flags);
> >   
> > +	if (bypass_sched_disable(guc, ce)) {
> > +		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> > +		intel_context_sched_disable_unpin(ce);
> > +	} else {
> > +		do_sched_disable(guc, ce, flags);
> > +	}
> 
> lock
> if
>    unlock
>    do sttuff
> else
>    do_sched_disable - which unlocks inside
> 
> Now move to next block..
> 
> > +}
> > +
> > +static bool guc_id_pressure(struct intel_guc *guc, struct intel_context *ce)
> > +{
> >   	/*
> > -	 * We have to check if the context has been disabled by another thread,
> > -	 * check if submssion has been disabled to seal a race with reset and
> > -	 * finally check if any more requests have been committed to the
> > -	 * context ensursing that a request doesn't slip through the
> > -	 * 'context_pending_disable' fence.
> > +	 * parent contexts are perma-pinned, if we are unpinning do schedule
> > +	 * disable immediately.
> >   	 */
> > -	if (unlikely(!context_enabled(ce) || submission_disabled(guc) ||
> > -		     context_has_committed_requests(ce))) {
> > -		clr_context_enabled(ce);
> > +	if (intel_context_is_parent(ce))
> > +		return true;
> > +
> > +	/*
> > +	 * If we are beyond the threshold for avail guc_ids, do schedule disable immediately.
> > +	 */
> > +	return guc->submission_state.guc_ids_in_use >
> > +		guc->submission_state.sched_disable_gucid_threshold;
> > +}
> > +
> > +static void guc_context_sched_disable(struct intel_context *ce)
> > +{
> > +	struct intel_guc *guc = ce_to_guc(ce);
> > +	u64 delay = guc->submission_state.sched_disable_delay_ms;
> > +	unsigned long flags;
> > +
> > +	spin_lock_irqsave(&ce->guc_state.lock, flags);
> > +
> > +	if (bypass_sched_disable(guc, ce)) {
> > +		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> > +		intel_context_sched_disable_unpin(ce);
> > +	} else if (!intel_context_is_closed(ce) && !guc_id_pressure(guc, ce) &&
> > +		   delay) {
> >   		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> > -		goto unpin;
> > +		mod_delayed_work(system_unbound_wq,
> > +				 &ce->guc_state.sched_disable_delay,
> > +				 msecs_to_jiffies(delay));
> > +	} else {
> > +		do_sched_disable(guc, ce, flags);
> >   	}
> 
> lock
> if
>    unlock
>    do stuff
> else if
>    unlock
>    do stuff
> else
>    do_sched_disable - which unlocks inside
> 
> IMO it creates less readable code for the benefit of not repeating 
> with_intel_runtime_pm -> __guc_context_sched_disable two times. Dunno.. 
> it's ugly but I have no suggestions. Hm does it have to send using the 
> busy loop? It couldn't just queue the request and then wait for reply if 
> disable message was emitted?
> 
I agree that the above code lacks readability - will see if i can break it down to smaller
functions with cleaner in-function lock/unlock pairs. I agree that a little code duplication
is better than less readable code. It was inherited code i didn't want to modify but I'll
go ahead and refactor this.

On the busy loop - im assuming you are refering to the actual ct sending. I'll consult my
team if i am missing anything more but based on comments, I believe callers must use that
function to guarantee reservation of space in the G2H CTB to always have space to capture
responses for actions that MUST be acknowledged from GuC (acknowledged by either replying
with a success or failure). This is necessary for coherent guc-id state machine (because the
GuC firmware will drop requests for guc-id's that are not registered or not in a
'sched-enabled' state).


> > -	guc_id = prep_context_pending_disable(ce);
> > +}
> >   
> > -	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> > +static void guc_flush_all_delayed_disable_sched_contexts(struct intel_guc *guc)
> > +{
> > +	struct intel_context *ce;
> > +	unsigned long index;
> > +	unsigned long flags;
> > +	unsigned long ceflags;
> >   
> > -	with_intel_runtime_pm(runtime_pm, wakeref)
> > -		__guc_context_sched_disable(guc, ce, guc_id);
> > +	xa_lock_irqsave(&guc->context_lookup, flags);
> > +	xa_for_each(&guc->context_lookup, index, ce) {
> > +		if (!kref_get_unless_zero(&ce->ref))
> > +			continue;
> > +		xa_unlock(&guc->context_lookup);
> 
> So this whole loop _needs_ to run with interrupts disabled? Explaining 
> why in a comment would be good.
> 
Being mid-reset, the locking mode is consistent with other functions also used
as part of the reset preparation that parses and potentially modifies contexts.
I believe the goal is to handle all of this parsing without getting conflicting
latent G2H replies that breaks the preparation flow (that herds active contexts
into a fewer set of states as part of reset) - but i will double check
with my colleagues.

> > +		if (test_bit(CONTEXT_GUC_INIT, &ce->flags) &&
> > +		    cancel_delayed_work(&ce->guc_state.sched_disable_delay)) {
> > +			spin_lock_irqsave(&ce->guc_state.lock, ceflags);
> > +			spin_unlock_irqrestore(&ce->guc_state.lock, ceflags);
> 
> This deserves a comment about what lock toggling wants to ensure.
> 
I realize this might have been my local rebasing mistake, the intention was to
handle cases where sched_disable_delay wasn't pending but potentially still
executing do_sched_disable. I believe I could try cancel_delayed_work_sync (but
not sure if i can call that might-sleep funtion mid reset while not-
interruptible). Else, i would move that lock-unlock to if the
cancel_delayed_work did not return true (as per original intent before my
rebase error).

> Also, if the loops runs with interrupts disabled what is the point of 
> irqsave variant in here??
Yes - its redundant, let me fix that, apologies for that.

> 
> Also2, what is the reason for dropping the lock? intel_context_put?
Being consistent with other reset preparation code that closes contexts,
the lock is dropped before the intel_context_put.
(I hope i am not misunderstanding your question).
> 
> > +	/*
> > +	 * If the context gets closed while the execbuf is ongoing, the context
> > +	 * close code will race with the below code to cancel the delayed work.
> > +	 * If the context close wins the race and cancels the work, it will
> > +	 * immediately call the sched disable (see guc_context_close), so there
> > +	 * is a chance we can get past this check while the sched_disable code
> > +	 * is being executed. To make sure that code completes before we check
> > +	 * the status further down, we wait for the close process to complete.
> > +	 */
> > +	if (cancel_delayed_work_sync(&ce->guc_state.sched_disable_delay))
> > +		intel_context_sched_disable_unpin(ce);
> > +	else if (intel_context_is_closed(ce))
> > +		wait_for(context_close_done(ce), 1);
> 
> Comment makes it sounds important to handle the race, althought it 
> doesn't really explain the consequences. But most importantly, what if 
> close doesn't complete in 1ms?

will add the consequence (i believe the consequence is that we could prematurely
read context flags bit indicating its gucid is still registered and after skipping
re-registration, find that context gets closed and guc-id freed).

Yes the 1 second is arbitrary and unnervingly short. Just spent sometime trying to
figure out portions of the SCHED_foo state machine bits and believe that its possible
for guc_request_alloc to just force context_close to be done from here as it would
force it into a state requiring re-registration and would close that a few lines
below. I will however verify with my team mates as i am new to these SCHED_foo state
machine bits.

> 
> Regards,
> 
> Tvrtko
>
Tvrtko Ursulin Sept. 16, 2022, 8:58 a.m. UTC | #3
On 16/09/2022 08:53, Teres Alexis, Alan Previn wrote:
> 
> On Thu, 2022-09-15 at 09:48 +0100, Tvrtko Ursulin wrote:
>> On 15/09/2022 03:12, Alan Previn wrote:
>>> From: Matthew Brost <matthew.brost@intel.com>
>>>
>>> Add a delay, configurable via debugfs (default 34ms), to disable
>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
>>>
> 
>>> +		u16 guc_ids_in_use;
>>
>> Any specific reason to use u16? It can usually just result in larger
>> code generated and I don't see any space saving needed or achieved when
>> it is sandwiched between two struct list_heads.
>>
> no specific reason - will switch to uint32.

Unsigned int would be best. Every time there is explicit width specified 
it looks like there is special reason for the width - like interacting 
with hardware or firmware protocol. At least I always see it like that.

>>> +		u64 sched_disable_delay_ms;
>>
>> 64-bits for the delay then sounds like overkill. Both should IMO just be
>> unsigned ints.
>>
> avoiding some typecasting on related functions that reference this
> but thats not a good excuse so will fix it.

Right, yes, clamp/cap/validate at debugfs entry points should do it.

>>> +		int sched_disable_gucid_threshold;
>>
>> unsigned int as well, so reader does not have to think about:
>>    return guc->submission_state.guc_ids_in_use >
>> 	guc->submission_state.sched_disable_gucid_threshold;
>>
>> further down.
>>
> yes agreed - will fix.
> 
> 
>>> +static void __delay_sched_disable(struct work_struct *wrk)
>>> +{
>>> +	struct intel_context *ce =
>>> +		container_of(wrk, typeof(*ce), guc_state.sched_disable_delay.work);
>>> +	struct intel_guc *guc = ce_to_guc(ce);
>>> +	unsigned long flags;
>>> +
>>>    	spin_lock_irqsave(&ce->guc_state.lock, flags);
>>>    
>>> +	if (bypass_sched_disable(guc, ce)) {
>>> +		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
>>> +		intel_context_sched_disable_unpin(ce);
>>> +	} else {
>>> +		do_sched_disable(guc, ce, flags);
>>> +	}
>>
>> lock
>> if
>>     unlock
>>     do sttuff
>> else
>>     do_sched_disable - which unlocks inside
>>
>> Now move to next block..
>>
>>> +}
>>> +
>>> +static bool guc_id_pressure(struct intel_guc *guc, struct intel_context *ce)
>>> +{
>>>    	/*
>>> -	 * We have to check if the context has been disabled by another thread,
>>> -	 * check if submssion has been disabled to seal a race with reset and
>>> -	 * finally check if any more requests have been committed to the
>>> -	 * context ensursing that a request doesn't slip through the
>>> -	 * 'context_pending_disable' fence.
>>> +	 * parent contexts are perma-pinned, if we are unpinning do schedule
>>> +	 * disable immediately.
>>>    	 */
>>> -	if (unlikely(!context_enabled(ce) || submission_disabled(guc) ||
>>> -		     context_has_committed_requests(ce))) {
>>> -		clr_context_enabled(ce);
>>> +	if (intel_context_is_parent(ce))
>>> +		return true;
>>> +
>>> +	/*
>>> +	 * If we are beyond the threshold for avail guc_ids, do schedule disable immediately.
>>> +	 */
>>> +	return guc->submission_state.guc_ids_in_use >
>>> +		guc->submission_state.sched_disable_gucid_threshold;
>>> +}
>>> +
>>> +static void guc_context_sched_disable(struct intel_context *ce)
>>> +{
>>> +	struct intel_guc *guc = ce_to_guc(ce);
>>> +	u64 delay = guc->submission_state.sched_disable_delay_ms;
>>> +	unsigned long flags;
>>> +
>>> +	spin_lock_irqsave(&ce->guc_state.lock, flags);
>>> +
>>> +	if (bypass_sched_disable(guc, ce)) {
>>> +		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
>>> +		intel_context_sched_disable_unpin(ce);
>>> +	} else if (!intel_context_is_closed(ce) && !guc_id_pressure(guc, ce) &&
>>> +		   delay) {
>>>    		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
>>> -		goto unpin;
>>> +		mod_delayed_work(system_unbound_wq,
>>> +				 &ce->guc_state.sched_disable_delay,
>>> +				 msecs_to_jiffies(delay));
>>> +	} else {
>>> +		do_sched_disable(guc, ce, flags);
>>>    	}
>>
>> lock
>> if
>>     unlock
>>     do stuff
>> else if
>>     unlock
>>     do stuff
>> else
>>     do_sched_disable - which unlocks inside
>>
>> IMO it creates less readable code for the benefit of not repeating
>> with_intel_runtime_pm -> __guc_context_sched_disable two times. Dunno..
>> it's ugly but I have no suggestions. Hm does it have to send using the
>> busy loop? It couldn't just queue the request and then wait for reply if
>> disable message was emitted?
>>
> I agree that the above code lacks readability - will see if i can break it down to smaller
> functions with cleaner in-function lock/unlock pairs. I agree that a little code duplication
> is better than less readable code. It was inherited code i didn't want to modify but I'll
> go ahead and refactor this.
> 
> On the busy loop - im assuming you are refering to the actual ct sending. I'll consult my
> team if i am missing anything more but based on comments, I believe callers must use that
> function to guarantee reservation of space in the G2H CTB to always have space to capture
> responses for actions that MUST be acknowledged from GuC (acknowledged by either replying
> with a success or failure). This is necessary for coherent guc-id state machine (because the
> GuC firmware will drop requests for guc-id's that are not registered or not in a
> 'sched-enabled' state).

Maybe to explain what I meant a bit better. I thought that the reason 
for strange unlock pattern is the with_runtime_pm which needs to happen 
for the CT send/receive loop. What I meant was would it be possible to 
do it like this:

state lock
...
sent = queue_msg_2_guc (this would be i915 only, no runtime pm needed)
...
state unlock

if (sent)
   with_runtime_pm:
      send/receive queued guc messages (only this would talk to guc)

But I have truly no idea if the CT messaging infrastructure supports 
something like this.

Anyway, see what it is possible and if it is not or too hard for now 
leave it be.

>>> -	guc_id = prep_context_pending_disable(ce);
>>> +}
>>>    
>>> -	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
>>> +static void guc_flush_all_delayed_disable_sched_contexts(struct intel_guc *guc)
>>> +{
>>> +	struct intel_context *ce;
>>> +	unsigned long index;
>>> +	unsigned long flags;
>>> +	unsigned long ceflags;
>>>    
>>> -	with_intel_runtime_pm(runtime_pm, wakeref)
>>> -		__guc_context_sched_disable(guc, ce, guc_id);
>>> +	xa_lock_irqsave(&guc->context_lookup, flags);
>>> +	xa_for_each(&guc->context_lookup, index, ce) {
>>> +		if (!kref_get_unless_zero(&ce->ref))
>>> +			continue;
>>> +		xa_unlock(&guc->context_lookup);
>>
>> So this whole loop _needs_ to run with interrupts disabled? Explaining
>> why in a comment would be good.
>>
> Being mid-reset, the locking mode is consistent with other functions also used
> as part of the reset preparation that parses and potentially modifies contexts.
> I believe the goal is to handle all of this parsing without getting conflicting
> latent G2H replies that breaks the preparation flow (that herds active contexts
> into a fewer set of states as part of reset) - but i will double check
> with my colleagues.
> 
>>> +		if (test_bit(CONTEXT_GUC_INIT, &ce->flags) &&
>>> +		    cancel_delayed_work(&ce->guc_state.sched_disable_delay)) {
>>> +			spin_lock_irqsave(&ce->guc_state.lock, ceflags);
>>> +			spin_unlock_irqrestore(&ce->guc_state.lock, ceflags);
>>
>> This deserves a comment about what lock toggling wants to ensure.
>>
> I realize this might have been my local rebasing mistake, the intention was to
> handle cases where sched_disable_delay wasn't pending but potentially still
> executing do_sched_disable. I believe I could try cancel_delayed_work_sync (but
> not sure if i can call that might-sleep funtion mid reset while not-
> interruptible). Else, i would move that lock-unlock to if the
> cancel_delayed_work did not return true (as per original intent before my
> rebase error).

Okay a comment like "flush any currently executing do_sched_disable" 
above the lock toggling would do then. Even better if you add what 
happens if that flushing isn't done.

> 
>> Also, if the loops runs with interrupts disabled what is the point of
>> irqsave variant in here??
> Yes - its redundant, let me fix that, apologies for that.
> 
>>
>> Also2, what is the reason for dropping the lock? intel_context_put?
> Being consistent with other reset preparation code that closes contexts,
> the lock is dropped before the intel_context_put.
> (I hope i am not misunderstanding your question).

Right, okay.. so for locking inversion issues - intel_context_put must 
not be called with guc_state.lock held, yes?

Not a mandatory request, but if you want you could look into the option 
of avoiding lock dropping and instead doing xa_erase and recording the 
list of contexts for sched disable or put in a local list, and doing 
them all in one block outside the locked/irq disabled section. Although 
tbh I am not sure if that would improve anything. Probably not in this 
case of a reset path, but if there are other places in GuC code which do 
this it may be beneficial for less hammering on the CPU and core 
synchronisation for atomics.

>>
>>> +	/*
>>> +	 * If the context gets closed while the execbuf is ongoing, the context
>>> +	 * close code will race with the below code to cancel the delayed work.
>>> +	 * If the context close wins the race and cancels the work, it will
>>> +	 * immediately call the sched disable (see guc_context_close), so there
>>> +	 * is a chance we can get past this check while the sched_disable code
>>> +	 * is being executed. To make sure that code completes before we check
>>> +	 * the status further down, we wait for the close process to complete.
>>> +	 */
>>> +	if (cancel_delayed_work_sync(&ce->guc_state.sched_disable_delay))
>>> +		intel_context_sched_disable_unpin(ce);
>>> +	else if (intel_context_is_closed(ce))
>>> +		wait_for(context_close_done(ce), 1);
>>
>> Comment makes it sounds important to handle the race, althought it
>> doesn't really explain the consequences. But most importantly, what if
>> close doesn't complete in 1ms?
> 
> will add the consequence (i believe the consequence is that we could prematurely
> read context flags bit indicating its gucid is still registered and after skipping
> re-registration, find that context gets closed and guc-id freed).
> 
> Yes the 1 second is arbitrary and unnervingly short. Just spent sometime trying to

One millisecond even. :)

What would be the consequence of prematurely reading the still 
registered context flag? Execbuf fails? GuC hangs? Kernel crashes? 
Something else?

And why cant' this race with context close happen at any other point 
than this particular spot in guc_request_alloc? Immediately after the 
added checks? After atomic_add_unless?

> figure out portions of the SCHED_foo state machine bits and believe that its possible
> for guc_request_alloc to just force context_close to be done from here as it would
> force it into a state requiring re-registration and would close that a few lines
> below. I will however verify with my team mates as i am new to these SCHED_foo state
> machine bits.

Yes it always looked complicated and perhaps it has even grown more so - 
I'm afraid I cannot be of any help there.

Regards,

Tvrtko
Daniele Ceraolo Spurio Sept. 16, 2022, 3:36 p.m. UTC | #4
On 9/16/2022 1:58 AM, Tvrtko Ursulin wrote:
>
> On 16/09/2022 08:53, Teres Alexis, Alan Previn wrote:
>>
>> On Thu, 2022-09-15 at 09:48 +0100, Tvrtko Ursulin wrote:
>>> On 15/09/2022 03:12, Alan Previn wrote:
>>>> From: Matthew Brost <matthew.brost@intel.com>
>>>>
>>>> Add a delay, configurable via debugfs (default 34ms), to disable
>>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
>>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
>>>>
>>
>>>> +        u16 guc_ids_in_use;
>>>
>>> Any specific reason to use u16? It can usually just result in larger
>>> code generated and I don't see any space saving needed or achieved when
>>> it is sandwiched between two struct list_heads.
>>>
>> no specific reason - will switch to uint32.
>
> Unsigned int would be best. Every time there is explicit width 
> specified it looks like there is special reason for the width - like 
> interacting with hardware or firmware protocol. At least I always see 
> it like that.
>
>>>> +        u64 sched_disable_delay_ms;
>>>
>>> 64-bits for the delay then sounds like overkill. Both should IMO 
>>> just be
>>> unsigned ints.
>>>
>> avoiding some typecasting on related functions that reference this
>> but thats not a good excuse so will fix it.
>
> Right, yes, clamp/cap/validate at debugfs entry points should do it.
>
>>>> +        int sched_disable_gucid_threshold;
>>>
>>> unsigned int as well, so reader does not have to think about:
>>>    return guc->submission_state.guc_ids_in_use >
>>>     guc->submission_state.sched_disable_gucid_threshold;
>>>
>>> further down.
>>>
>> yes agreed - will fix.
>>
>>
>>>> +static void __delay_sched_disable(struct work_struct *wrk)
>>>> +{
>>>> +    struct intel_context *ce =
>>>> +        container_of(wrk, typeof(*ce), 
>>>> guc_state.sched_disable_delay.work);
>>>> +    struct intel_guc *guc = ce_to_guc(ce);
>>>> +    unsigned long flags;
>>>> +
>>>>        spin_lock_irqsave(&ce->guc_state.lock, flags);
>>>>    +    if (bypass_sched_disable(guc, ce)) {
>>>> +        spin_unlock_irqrestore(&ce->guc_state.lock, flags);
>>>> +        intel_context_sched_disable_unpin(ce);
>>>> +    } else {
>>>> +        do_sched_disable(guc, ce, flags);
>>>> +    }
>>>
>>> lock
>>> if
>>>     unlock
>>>     do sttuff
>>> else
>>>     do_sched_disable - which unlocks inside
>>>
>>> Now move to next block..
>>>
>>>> +}
>>>> +
>>>> +static bool guc_id_pressure(struct intel_guc *guc, struct 
>>>> intel_context *ce)
>>>> +{
>>>>        /*
>>>> -     * We have to check if the context has been disabled by 
>>>> another thread,
>>>> -     * check if submssion has been disabled to seal a race with 
>>>> reset and
>>>> -     * finally check if any more requests have been committed to the
>>>> -     * context ensursing that a request doesn't slip through the
>>>> -     * 'context_pending_disable' fence.
>>>> +     * parent contexts are perma-pinned, if we are unpinning do 
>>>> schedule
>>>> +     * disable immediately.
>>>>         */
>>>> -    if (unlikely(!context_enabled(ce) || submission_disabled(guc) ||
>>>> -             context_has_committed_requests(ce))) {
>>>> -        clr_context_enabled(ce);
>>>> +    if (intel_context_is_parent(ce))
>>>> +        return true;
>>>> +
>>>> +    /*
>>>> +     * If we are beyond the threshold for avail guc_ids, do 
>>>> schedule disable immediately.
>>>> +     */
>>>> +    return guc->submission_state.guc_ids_in_use >
>>>> + guc->submission_state.sched_disable_gucid_threshold;
>>>> +}
>>>> +
>>>> +static void guc_context_sched_disable(struct intel_context *ce)
>>>> +{
>>>> +    struct intel_guc *guc = ce_to_guc(ce);
>>>> +    u64 delay = guc->submission_state.sched_disable_delay_ms;
>>>> +    unsigned long flags;
>>>> +
>>>> +    spin_lock_irqsave(&ce->guc_state.lock, flags);
>>>> +
>>>> +    if (bypass_sched_disable(guc, ce)) {
>>>> +        spin_unlock_irqrestore(&ce->guc_state.lock, flags);
>>>> +        intel_context_sched_disable_unpin(ce);
>>>> +    } else if (!intel_context_is_closed(ce) && 
>>>> !guc_id_pressure(guc, ce) &&
>>>> +           delay) {
>>>> spin_unlock_irqrestore(&ce->guc_state.lock, flags);
>>>> -        goto unpin;
>>>> +        mod_delayed_work(system_unbound_wq,
>>>> +                 &ce->guc_state.sched_disable_delay,
>>>> +                 msecs_to_jiffies(delay));
>>>> +    } else {
>>>> +        do_sched_disable(guc, ce, flags);
>>>>        }
>>>
>>> lock
>>> if
>>>     unlock
>>>     do stuff
>>> else if
>>>     unlock
>>>     do stuff
>>> else
>>>     do_sched_disable - which unlocks inside
>>>
>>> IMO it creates less readable code for the benefit of not repeating
>>> with_intel_runtime_pm -> __guc_context_sched_disable two times. Dunno..
>>> it's ugly but I have no suggestions. Hm does it have to send using the
>>> busy loop? It couldn't just queue the request and then wait for 
>>> reply if
>>> disable message was emitted?
>>>
>> I agree that the above code lacks readability - will see if i can 
>> break it down to smaller
>> functions with cleaner in-function lock/unlock pairs. I agree that a 
>> little code duplication
>> is better than less readable code. It was inherited code i didn't 
>> want to modify but I'll
>> go ahead and refactor this.
>>
>> On the busy loop - im assuming you are refering to the actual ct 
>> sending. I'll consult my
>> team if i am missing anything more but based on comments, I believe 
>> callers must use that
>> function to guarantee reservation of space in the G2H CTB to always 
>> have space to capture
>> responses for actions that MUST be acknowledged from GuC 
>> (acknowledged by either replying
>> with a success or failure). This is necessary for coherent guc-id 
>> state machine (because the
>> GuC firmware will drop requests for guc-id's that are not registered 
>> or not in a
>> 'sched-enabled' state).
>
> Maybe to explain what I meant a bit better. I thought that the reason 
> for strange unlock pattern is the with_runtime_pm which needs to 
> happen for the CT send/receive loop. What I meant was would it be 
> possible to do it like this:
>
> state lock
> ...
> sent = queue_msg_2_guc (this would be i915 only, no runtime pm needed)
> ...
> state unlock
>
> if (sent)
>   with_runtime_pm:
>      send/receive queued guc messages (only this would talk to guc)
>
> But I have truly no idea if the CT messaging infrastructure supports 
> something like this.
>
> Anyway, see what it is possible and if it is not or too hard for now 
> leave it be.
>
>>>> -    guc_id = prep_context_pending_disable(ce);
>>>> +}
>>>>    -    spin_unlock_irqrestore(&ce->guc_state.lock, flags);
>>>> +static void guc_flush_all_delayed_disable_sched_contexts(struct 
>>>> intel_guc *guc)
>>>> +{
>>>> +    struct intel_context *ce;
>>>> +    unsigned long index;
>>>> +    unsigned long flags;
>>>> +    unsigned long ceflags;
>>>>    -    with_intel_runtime_pm(runtime_pm, wakeref)
>>>> -        __guc_context_sched_disable(guc, ce, guc_id);
>>>> +    xa_lock_irqsave(&guc->context_lookup, flags);
>>>> +    xa_for_each(&guc->context_lookup, index, ce) {
>>>> +        if (!kref_get_unless_zero(&ce->ref))
>>>> +            continue;
>>>> +        xa_unlock(&guc->context_lookup);
>>>
>>> So this whole loop _needs_ to run with interrupts disabled? Explaining
>>> why in a comment would be good.
>>>
>> Being mid-reset, the locking mode is consistent with other functions 
>> also used
>> as part of the reset preparation that parses and potentially modifies 
>> contexts.
>> I believe the goal is to handle all of this parsing without getting 
>> conflicting
>> latent G2H replies that breaks the preparation flow (that herds 
>> active contexts
>> into a fewer set of states as part of reset) - but i will double check
>> with my colleagues.
>>
>>>> +        if (test_bit(CONTEXT_GUC_INIT, &ce->flags) &&
>>>> + cancel_delayed_work(&ce->guc_state.sched_disable_delay)) {
>>>> +            spin_lock_irqsave(&ce->guc_state.lock, ceflags);
>>>> + spin_unlock_irqrestore(&ce->guc_state.lock, ceflags);
>>>
>>> This deserves a comment about what lock toggling wants to ensure.
>>>
>> I realize this might have been my local rebasing mistake, the 
>> intention was to
>> handle cases where sched_disable_delay wasn't pending but potentially 
>> still
>> executing do_sched_disable. I believe I could try 
>> cancel_delayed_work_sync (but
>> not sure if i can call that might-sleep funtion mid reset while not-
>> interruptible). Else, i would move that lock-unlock to if the
>> cancel_delayed_work did not return true (as per original intent 
>> before my
>> rebase error).
>
> Okay a comment like "flush any currently executing do_sched_disable" 
> above the lock toggling would do then. Even better if you add what 
> happens if that flushing isn't done.
>
>>
>>> Also, if the loops runs with interrupts disabled what is the point of
>>> irqsave variant in here??
>> Yes - its redundant, let me fix that, apologies for that.
>>
>>>
>>> Also2, what is the reason for dropping the lock? intel_context_put?
>> Being consistent with other reset preparation code that closes contexts,
>> the lock is dropped before the intel_context_put.
>> (I hope i am not misunderstanding your question).
>
> Right, okay.. so for locking inversion issues - intel_context_put must 
> not be called with guc_state.lock held, yes?
>
> Not a mandatory request, but if you want you could look into the 
> option of avoiding lock dropping and instead doing xa_erase and 
> recording the list of contexts for sched disable or put in a local 
> list, and doing them all in one block outside the locked/irq disabled 
> section. Although tbh I am not sure if that would improve anything. 
> Probably not in this case of a reset path, but if there are other 
> places in GuC code which do this it may be beneficial for less 
> hammering on the CPU and core synchronisation for atomics.
>
>>>
>>>> +    /*
>>>> +     * If the context gets closed while the execbuf is ongoing, 
>>>> the context
>>>> +     * close code will race with the below code to cancel the 
>>>> delayed work.
>>>> +     * If the context close wins the race and cancels the work, it 
>>>> will
>>>> +     * immediately call the sched disable (see guc_context_close), 
>>>> so there
>>>> +     * is a chance we can get past this check while the 
>>>> sched_disable code
>>>> +     * is being executed. To make sure that code completes before 
>>>> we check
>>>> +     * the status further down, we wait for the close process to 
>>>> complete.
>>>> +     */
>>>> +    if (cancel_delayed_work_sync(&ce->guc_state.sched_disable_delay))
>>>> +        intel_context_sched_disable_unpin(ce);
>>>> +    else if (intel_context_is_closed(ce))
>>>> +        wait_for(context_close_done(ce), 1);
>>>
>>> Comment makes it sounds important to handle the race, althought it
>>> doesn't really explain the consequences. But most importantly, what if
>>> close doesn't complete in 1ms?
>>
>> will add the consequence (i believe the consequence is that we could 
>> prematurely
>> read context flags bit indicating its gucid is still registered and 
>> after skipping
>> re-registration, find that context gets closed and guc-id freed).
>>
>> Yes the 1 second is arbitrary and unnervingly short. Just spent 
>> sometime trying to
>
> One millisecond even. :)

Normally 1ms is not a slow time for this. We can only hit the wait if 
the context_close call has already called the cancel_delayed_work, so 
the only thing left to do in that function is to send the H2G, which is 
relatively fast. However, I've just realized that if the H2G buffer is 
full the H2G will stall, so in that case it can take longer (maximum 
stall time before a hang is declared is 1.5s).

>
> What would be the consequence of prematurely reading the still 
> registered context flag? Execbuf fails? GuC hangs? Kernel crashes? 
> Something else?

i915 thinks that a request pending with the GuC, while GuC thinks we 
disabled it (because the completion of the delayed_disable happens after 
the new request has been submitted). The heartbeat will still go 
through, so no reset will be triggered, the request is just lost. A new 
request submission on the same context should be able to recover it, but 
we didn't test that.


>
> And why cant' this race with context close happen at any other point 
> than this particular spot in guc_request_alloc? Immediately after the 
> added checks? After atomic_add_unless?

The race is tied to the canceling of the delayed work. context_close 
only does something if it cancels the delayed work, so if the 
cancel_delayed_work_sync here does the cancelling then context_close is 
a no-op.

>
>> figure out portions of the SCHED_foo state machine bits and believe 
>> that its possible
>> for guc_request_alloc to just force context_close to be done from 
>> here as it would
>> force it into a state requiring re-registration and would close that 
>> a few lines
>> below. I will however verify with my team mates as i am new to these 
>> SCHED_foo state
>> machine bits.

I'm not sure we want to force context_close unconditionally here, that'd 
be a big overhead. Doing it only if there is a close pending is also 
potentially an issue, the whole point is that the close can race in. The 
close also has to work on its own, because in the normal use-cases we 
don't get a context_close while a request submission is ongoing.
Unless you meant something different and I completely misunderstood.

Daniele

>
> Yes it always looked complicated and perhaps it has even grown more so 
> - I'm afraid I cannot be of any help there.
>
> Regards,
>
> Tvrtko
Teres Alexis, Alan Previn Sept. 20, 2022, 6:22 a.m. UTC | #5
On Fri, 2022-09-16 at 08:36 -0700, Ceraolo Spurio, Daniele wrote:
> 
> On 9/16/2022 1:58 AM, Tvrtko Ursulin wrote:
> > On 16/09/2022 08:53, Teres Alexis, Alan Previn wrote:
> > > On Thu, 2022-09-15 at 09:48 +0100, Tvrtko Ursulin wrote:
> > > > On 15/09/2022 03:12, Alan Previn wrote:
> > > > > From: Matthew Brost <matthew.brost@intel.com>
> > > > > 
> > > > > +static void guc_context_sched_disable(struct intel_context *ce)
> > > > > +{
> > > > > +    struct intel_guc *guc = ce_to_guc(ce);
> > > > > +    u64 delay = guc->submission_state.sched_disable_delay_ms;
> > > > > +    unsigned long flags;
> > > > > +
> > > > > +    spin_lock_irqsave(&ce->guc_state.lock, flags);
> > > > > +
> > > > > +    if (bypass_sched_disable(guc, ce)) {
> > > > > +        spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> > > > > +        intel_context_sched_disable_unpin(ce);
> > > > > +    } else if (!intel_context_is_closed(ce) && 
> > > > > !guc_id_pressure(guc, ce) &&
> > > > > +           delay) {
> > > > > spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> > > > > -        goto unpin;
> > > > > +        mod_delayed_work(system_unbound_wq,
> > > > > +                 &ce->guc_state.sched_disable_delay,
> > > > > +                 msecs_to_jiffies(delay));
> > > > > +    } else {
> > > > > +        do_sched_disable(guc, ce, flags);
> > > > >        }
> > > > 
> > > > lock
> > > > if
> > > >     unlock
> > > >     do stuff
> > > > else if
> > > >     unlock
> > > >     do stuff
> > > > else
> > > >     do_sched_disable - which unlocks inside
> > > > 
> > > > IMO it creates less readable code for the benefit of not repeating
> > > > with_intel_runtime_pm -> __guc_context_sched_disable two times. Dunno..
> > > > it's ugly but I have no suggestions. Hm does it have to send using the
> > > > busy loop? It couldn't just queue the request and then wait for 
> > > > reply if
> > > > disable message was emitted?
> > > > 
> > > I agree that the above code lacks readability - will see if i can 
> > > break it down to smaller
> > > functions with cleaner in-function lock/unlock pairs. I agree that a 
> > > little code duplication
> > > is better than less readable code. It was inherited code i didn't 
> > > want to modify but I'll
> > > go ahead and refactor this.
> > > 
> > > On the busy loop - im assuming you are refering to the actual ct 
> > > sending. I'll consult my
> > > team if i am missing anything more but based on comments, I believe 
> > > callers must use that
> > > function to guarantee reservation of space in the G2H CTB to always 
> > > have space to capture
> > > responses for actions that MUST be acknowledged from GuC 
> > > (acknowledged by either replying
> > > with a success or failure). This is necessary for coherent guc-id 
> > > state machine (because the
> > > GuC firmware will drop requests for guc-id's that are not registered 
> > > or not in a
> > > 'sched-enabled' state).
> > 
> > Maybe to explain what I meant a bit better. I thought that the reason 
> > for strange unlock pattern is the with_runtime_pm which needs to 
> > happen for the CT send/receive loop. What I meant was would it be 
> > possible to do it like this:
> > 
Alan: I assumed the strange lock-unlock was for no other reason than to ensure that
once the delayed task begins executing, we dont let go of the lock until after
we complete the call prep_context_pending_disable (which modifies the state-machine
for the context's guc-id). And as we expect, the runtime_pm is about trying to send
the H2G signal to the GuC which touches register. (i.e. lock was on ce-gucid state
and power was for guc-interaction via ct). If my assumption are correct, then I
can refactor the functions a bit to remove that weird unlock flow while keeping the
lock held from start until prep_context_pending_disable, else i could
still go ahead and do that (with some minimal duplication).


> > state lock
> > ...
> > sent = queue_msg_2_guc (this would be i915 only, no runtime pm needed)
> > ...
> > state unlock
> > 
> > if (sent)
> >   with_runtime_pm:
> >      send/receive queued guc messages (only this would talk to guc)
> > 
> > But I have truly no idea if the CT messaging infrastructure supports 
> > something like this.
> > 
Alan: I'll look around and see if we have something that provides that flow
but i don't think I should pursue such a redesign as part of this series
if none exists.


> > Anyway, see what it is possible and if it is not or too hard for now 
> > leave it be.
> > 
> > > > > -    guc_id = prep_context_pending_disable(ce);
> > > > > +}
> > > > >    -    spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> > > > > +static void guc_flush_all_delayed_disable_sched_contexts(struct 
> > > > > intel_guc *guc)
> > > > > +{
> > > > > +    struct intel_context *ce;
> > > > > +    unsigned long index;
> > > > > +    unsigned long flags;
> > > > > +    unsigned long ceflags;
> > > > >    -    with_intel_runtime_pm(runtime_pm, wakeref)
> > > > > -        __guc_context_sched_disable(guc, ce, guc_id);
> > > > > +    xa_lock_irqsave(&guc->context_lookup, flags);
> > > > > +    xa_for_each(&guc->context_lookup, index, ce) {
> > > > > +        if (!kref_get_unless_zero(&ce->ref))
> > > > > +            continue;
> > > > > +        xa_unlock(&guc->context_lookup);
> > > > 
> > > > So this whole loop _needs_ to run with interrupts disabled? Explaining
> > > > why in a comment would be good.
> > > > 
> > > Being mid-reset, the locking mode is consistent with other functions 
> > > also used
> > > as part of the reset preparation that parses and potentially modifies 
> > > contexts.
> > > I believe the goal is to handle all of this parsing without getting 
> > > conflicting
> > > latent G2H replies that breaks the preparation flow (that herds 
> > > active contexts
> > > into a fewer set of states as part of reset) - but i will double check
> > > with my colleagues.
> > > 
> > > > > +        if (test_bit(CONTEXT_GUC_INIT, &ce->flags) &&
> > > > > + cancel_delayed_work(&ce->guc_state.sched_disable_delay)) {
> > > > > +            spin_lock_irqsave(&ce->guc_state.lock, ceflags);
> > > > > + spin_unlock_irqrestore(&ce->guc_state.lock, ceflags);
> > > > 
> > > > This deserves a comment about what lock toggling wants to ensure.
> > > > 
> > > I realize this might have been my local rebasing mistake, the 
> > > intention was to
> > > handle cases where sched_disable_delay wasn't pending but potentially 
> > > still
> > > executing do_sched_disable. I believe I could try 
> > > cancel_delayed_work_sync (but
> > > not sure if i can call that might-sleep funtion mid reset while not-
> > > interruptible). Else, i would move that lock-unlock to if the
> > > cancel_delayed_work did not return true (as per original intent 
> > > before my
> > > rebase error).
> > 
> > Okay a comment like "flush any currently executing do_sched_disable" 
> > above the lock toggling would do then. Even better if you add what 
> > happens if that flushing isn't done.
> > 
got it. will do.


> > > > Also, if the loops runs with interrupts disabled what is the point of
> > > > irqsave variant in here??
> > > Yes - its redundant, let me fix that, apologies for that.
> > > 
> > > > Also2, what is the reason for dropping the lock? intel_context_put?
> > > Being consistent with other reset preparation code that closes contexts,
> > > the lock is dropped before the intel_context_put.
> > > (I hope i am not misunderstanding your question).
> > 
> > Right, okay.. so for locking inversion issues - intel_context_put must 
> > not be called with guc_state.lock held, yes?
> > 
> > Not a mandatory request, but if you want you could look into the 
> > option of avoiding lock dropping and instead doing xa_erase and 
> > recording the list of contexts for sched disable or put in a local 
> > list, and doing them all in one block outside the locked/irq disabled 
> > section. Although tbh I am not sure if that would improve anything. 
> > Probably not in this case of a reset path, but if there are other 
> > places in GuC code which do this it may be beneficial for less 
> > hammering on the CPU and core synchronisation for atomics.
> > 
> > > > > +    /*
> > > > > +     * If the context gets closed while the execbuf is ongoing, 
> > > > > the context
> > > > > +     * close code will race with the below code to cancel the 
> > > > > delayed work.
> > > > > +     * If the context close wins the race and cancels the work, it 
> > > > > will
> > > > > +     * immediately call the sched disable (see guc_context_close), 
> > > > > so there
> > > > > +     * is a chance we can get past this check while the 
> > > > > sched_disable code
> > > > > +     * is being executed. To make sure that code completes before 
> > > > > we check
> > > > > +     * the status further down, we wait for the close process to 
> > > > > complete.
> > > > > +     */
> > > > > +    if (cancel_delayed_work_sync(&ce->guc_state.sched_disable_delay))
> > > > > +        intel_context_sched_disable_unpin(ce);
> > > > > +    else if (intel_context_is_closed(ce))
> > > > > +        wait_for(context_close_done(ce), 1);
> > > > 
> > > > Comment makes it sounds important to handle the race, althought it
> > > > doesn't really explain the consequences. But most importantly, what if
> > > > close doesn't complete in 1ms?
> > > 
> > > will add the consequence (i believe the consequence is that we could 
> > > prematurely
> > > read context flags bit indicating its gucid is still registered and 
> > > after skipping
> > > re-registration, find that context gets closed and guc-id freed).
> > > 
> > > Yes the 1 second is arbitrary and unnervingly short. Just spent 
> > > sometime trying to
> > 
> > One millisecond even. :)
> 
hehe - typo :)

> Normally 1ms is not a slow time for this. We can only hit the wait if 
> the context_close call has already called the cancel_delayed_work, so 
> the only thing left to do in that function is to send the H2G, which is 
> relatively fast. However, I've just realized that if the H2G buffer is 
> full the H2G will stall, so in that case it can take longer (maximum 
> stall time before a hang is declared is 1.5s).
> 
Daniele, so increase to 1500 milisecs?

> > What would be the consequence of prematurely reading the still 
> > registered context flag? Execbuf fails? GuC hangs? Kernel crashes? 
> > Something else?
> 
> i915 thinks that a request pending with the GuC, while GuC thinks we 
> disabled it (because the completion of the delayed_disable happens after 
> the new request has been submitted). The heartbeat will still go 
> through, so no reset will be triggered, the request is just lost. A new 
> request submission on the same context should be able to recover it, but 
> we didn't test that.
> 
> 
> > And why cant' this race with context close happen at any other point 
> > than this particular spot in guc_request_alloc? Immediately after the 
> > added checks? After atomic_add_unless?
> 
> The race is tied to the canceling of the delayed work. context_close 
> only does something if it cancels the delayed work, so if the 
> cancel_delayed_work_sync here does the cancelling then context_close is 
> a no-op.
> 
> > > figure out portions of the SCHED_foo state machine bits and believe 
> > > that its possible
> > > for guc_request_alloc to just force context_close to be done from 
> > > here as it would
> > > force it into a state requiring re-registration and would close that 
> > > a few lines
> > > below. I will however verify with my team mates as i am new to these 
> > > SCHED_foo state
> > > machine bits.
> 
> I'm not sure we want to force context_close unconditionally here, that'd 
> be a big overhead. Doing it only if there is a close pending is also 
> potentially an issue, the whole point is that the close can race in. The 
> close also has to work on its own, because in the normal use-cases we 
> don't get a context_close while a request submission is ongoing.
> Unless you meant something different and I completely misunderstood.
> 
> Daniele
> 
> > Yes it always looked complicated and perhaps it has even grown more so 
> > - I'm afraid I cannot be of any help there.
> > 
> > Regards,
> > 
> > Tvrtko
Tvrtko Ursulin Sept. 20, 2022, 2:33 p.m. UTC | #6
On 16/09/2022 16:36, Ceraolo Spurio, Daniele wrote:
> 
> 
> On 9/16/2022 1:58 AM, Tvrtko Ursulin wrote:
>>
>> On 16/09/2022 08:53, Teres Alexis, Alan Previn wrote:

[snip]

>>>>
>>>>> +    /*
>>>>> +     * If the context gets closed while the execbuf is ongoing, 
>>>>> the context
>>>>> +     * close code will race with the below code to cancel the 
>>>>> delayed work.
>>>>> +     * If the context close wins the race and cancels the work, it 
>>>>> will
>>>>> +     * immediately call the sched disable (see guc_context_close), 
>>>>> so there
>>>>> +     * is a chance we can get past this check while the 
>>>>> sched_disable code
>>>>> +     * is being executed. To make sure that code completes before 
>>>>> we check
>>>>> +     * the status further down, we wait for the close process to 
>>>>> complete.
>>>>> +     */
>>>>> +    if (cancel_delayed_work_sync(&ce->guc_state.sched_disable_delay))
>>>>> +        intel_context_sched_disable_unpin(ce);
>>>>> +    else if (intel_context_is_closed(ce))
>>>>> +        wait_for(context_close_done(ce), 1);
>>>>
>>>> Comment makes it sounds important to handle the race, althought it
>>>> doesn't really explain the consequences. But most importantly, what if
>>>> close doesn't complete in 1ms?
>>>
>>> will add the consequence (i believe the consequence is that we could 
>>> prematurely
>>> read context flags bit indicating its gucid is still registered and 
>>> after skipping
>>> re-registration, find that context gets closed and guc-id freed).
>>>
>>> Yes the 1 second is arbitrary and unnervingly short. Just spent 
>>> sometime trying to
>>
>> One millisecond even. :)
> 
> Normally 1ms is not a slow time for this. We can only hit the wait if 
> the context_close call has already called the cancel_delayed_work, so 
> the only thing left to do in that function is to send the H2G, which is 
> relatively fast. However, I've just realized that if the H2G buffer is 
> full the H2G will stall, so in that case it can take longer (maximum 
> stall time before a hang is declared is 1.5s).
> 
>>
>> What would be the consequence of prematurely reading the still 
>> registered context flag? Execbuf fails? GuC hangs? Kernel crashes? 
>> Something else?
> 
> i915 thinks that a request pending with the GuC, while GuC thinks we 
> disabled it (because the completion of the delayed_disable happens after 
> the new request has been submitted). The heartbeat will still go 
> through, so no reset will be triggered, the request is just lost. A new 
> request submission on the same context should be able to recover it, but 
> we didn't test that.
> 
> 
>>
>> And why cant' this race with context close happen at any other point 
>> than this particular spot in guc_request_alloc? Immediately after the 
>> added checks? After atomic_add_unless?
> 
> The race is tied to the canceling of the delayed work. context_close 
> only does something if it cancels the delayed work, so if the 
> cancel_delayed_work_sync here does the cancelling then context_close is 
> a no-op.

It's a bit involved* to follow so I'll give up and declare a "whatever" if that's okay with you.

*)
intel_context_close
   set_bit(CONTEXT_CLOSED_BIT, &ce->flags)
   ->guc_context_close
	if (test_bit(CONTEXT_GUC_INIT, &ce->flags) &&
	    cancel_delayed_work(&ce->guc_state.sched_disable_delay))
		__delay_sched_disable(&ce->guc_state.sched_disable_delay.work);
		  ((which is:
			spin_lock_irqsave(&ce->guc_state.lock, flags);
			..stuff..
			unlock
		  ))
  
	spin_lock_irqsave(&ce->guc_state.lock, flags);
	set_context_close_done(ce);
	spin_unlock_irqrestore(&ce->guc_state.lock, flags);

Takes and releases the same lock two times so I have no idea why twice, and less so whether it is safe and race free.

Regards,

Tvrtko
Teres Alexis, Alan Previn Sept. 21, 2022, 7:59 a.m. UTC | #7
On Fri, 2022-09-16 at 08:36 -0700, Ceraolo Spurio, Daniele wrote:
> 
> 
alan: [snip]
> > > > > +    /*
> > > > > +     * If the context gets closed while the execbuf is ongoing, 
> > > > > the context
> > > > > +     * close code will race with the below code to cancel the 
> > > > > delayed work.
> > > > > +     * If the context close wins the race and cancels the work, it 
> > > > > will
> > > > > +     * immediately call the sched disable (see guc_context_close), 
> > > > > so there
> > > > > +     * is a chance we can get past this check while the 
> > > > > sched_disable code
> > > > > +     * is being executed. To make sure that code completes before 
> > > > > we check
> > > > > +     * the status further down, we wait for the close process to 
> > > > > complete.
> > > > > +     */
> > > > > +    if (cancel_delayed_work_sync(&ce->guc_state.sched_disable_delay))
> > > > > +        intel_context_sched_disable_unpin(ce);
> > > > > +    else if (intel_context_is_closed(ce))
> > > > > +        wait_for(context_close_done(ce), 1);
> > > > 
> > > > Comment makes it sounds important to handle the race, althought it
> > > > doesn't really explain the consequences. But most importantly, what if
> > > > close doesn't complete in 1ms?
> > > 
> > > will add the consequence (i believe the consequence is that we could 
> > > prematurely
> > > read context flags bit indicating its gucid is still registered and 
> > > after skipping
> > > re-registration, find that context gets closed and guc-id freed).
> > > 
> > > Yes the 1 second is arbitrary and unnervingly short. Just spent 
> > > sometime trying to
> > 
> > One millisecond even. :)
> 
> Normally 1ms is not a slow time for this. We can only hit the wait if 
> the context_close call has already called the cancel_delayed_work, so 
> the only thing left to do in that function is to send the H2G, which is 
> relatively fast. However, I've just realized that if the H2G buffer is 
> full the H2G will stall, so in that case it can take longer (maximum 
> stall time before a hang is declared is 1.5s).
> 
alan: I'll increase to 1.5 secs

> > What would be the consequence of prematurely reading the still 
> > registered context flag? Execbuf fails? GuC hangs? Kernel crashes? 
> > Something else?
> 
> i915 thinks that a request pending with the GuC, while GuC thinks we 
> disabled it (because the completion of the delayed_disable happens after 
> the new request has been submitted). The heartbeat will still go 
> through, so no reset will be triggered, the request is just lost. A new 
> request submission on the same context should be able to recover it, but 
> we didn't test that.
> 
> 
> > And why cant' this race with context close happen at any other point 
> > than this particular spot in guc_request_alloc? Immediately after the 
> > added checks? After atomic_add_unless?
> 
> The race is tied to the canceling of the delayed work. context_close 
> only does something if it cancels the delayed work, so if the 
> cancel_delayed_work_sync here does the cancelling then context_close is 
> a no-op.
> 
alan: Then i'll add a warn - especially after a 1.5 sec delay and still waiting
for that close.
Teres Alexis, Alan Previn Sept. 21, 2022, 10:30 p.m. UTC | #8
On Fri, 2022-09-16 at 08:36 -0700, Ceraolo Spurio, Daniele wrote:
> 
> On 9/16/2022 1:58 AM, Tvrtko Ursulin wrote:
> > On 16/09/2022 08:53, Teres Alexis, Alan Previn wrote:
> > > On Thu, 2022-09-15 at 09:48 +0100, Tvrtko Ursulin wrote:
> > > > On 15/09/2022 03:12, Alan Previn wrote:
> > > > > 
> > > > 
alan: [snip]

> > > > IMO it creates less readable code for the benefit of not repeating
> > > > with_intel_runtime_pm -> __guc_context_sched_disable two times. Dunno..
> > > > it's ugly but I have no suggestions. Hm does it have to send using the
> > > > busy loop? It couldn't just queue the request and then wait for 
> > > > reply if
> > > > disable message was emitted?
> > > > 
> > > I agree that the above code lacks readability - will see if i can 
> > > break it down to smaller
> > > functions with cleaner in-function lock/unlock pairs. I agree that a 
> > > little code duplication
> > > is better than less readable code. It was inherited code i didn't 
> > > want to modify but I'll
> > > go ahead and refactor this.
> > > 
> > > On the busy loop - im assuming you are refering to the actual ct 
> > > sending. I'll consult my
> > > team if i am missing anything more but based on comments, I believe 
> > > callers must use that
> > > function to guarantee reservation of space in the G2H CTB to always 
> > > have space to capture
> > > responses for actions that MUST be acknowledged from GuC 
> > > (acknowledged by either replying
> > > with a success or failure). This is necessary for coherent guc-id 
> > > state machine (because the
> > > GuC firmware will drop requests for guc-id's that are not registered 
> > > or not in a
> > > 'sched-enabled' state).
> 
> > Maybe to explain what I meant a bit better. I thought that the reason 
> > for strange unlock pattern is the with_runtime_pm which needs to 
> > happen for the CT send/receive loop. What I meant was would it be 
> > possible to do it like this:
> > 
> > state lock
> > ...
> > sent = queue_msg_2_guc (this would be i915 only, no runtime pm needed)
> > ...
> > state unlock
> > 
> > if (sent)
> >   with_runtime_pm:
> >      send/receive queued guc messages (only this would talk to guc)
> > 
> > But I have truly no idea if the CT messaging infrastructure supports 
> > something like this.
> > 
> > Anyway, see what it is possible and if it is not or too hard for now 
> > leave it be.
> > 
alan: I consulted with my team mates on above and they said that discussion has happened
in the past and the CT infrastructure currently does not support that but the benefit
comes into question because such an undertaking would be an expensive redesign that
has wider impact (changes across all callers). I guess for now i will leave above code
as it is as this would be a whole separate feature change on its own.
Teres Alexis, Alan Previn Sept. 21, 2022, 10:30 p.m. UTC | #9
On Fri, 2022-09-16 at 09:58 +0100, Tvrtko Ursulin wrote:
> On 16/09/2022 08:53, Teres Alexis, Alan Previn wrote:
> > On Thu, 2022-09-15 at 09:48 +0100, Tvrtko Ursulin wrote:
> > > On 15/09/2022 03:12, Alan Previn wrote:
> > > > From: Matthew Brost <matthew.brost@intel.com>
> > > > 
> > > > +static void guc_flush_all_delayed_disable_sched_contexts(struct intel_guc *guc)
> > > > +{
> > > > +	struct intel_context *ce;
> > > > +	unsigned long index;
> > > > +	unsigned long flags;
> > > > +	unsigned long ceflags;
> > > >    
> > > > -	with_intel_runtime_pm(runtime_pm, wakeref)
> > > > -		__guc_context_sched_disable(guc, ce, guc_id);
> > > > +	xa_lock_irqsave(&guc->context_lookup, flags);
> > > > +	xa_for_each(&guc->context_lookup, index, ce) {
> > > > +		if (!kref_get_unless_zero(&ce->ref))
> > > > +			continue;
> > > > +		xa_unlock(&guc->context_lookup);
> > > 
> > > So this whole loop _needs_ to run with interrupts disabled? Explaining
> > > why in a comment would be good.
> > > 
> > Being mid-reset, the locking mode is consistent with other functions also used
> > as part of the reset preparation that parses and potentially modifies contexts.
> > I believe the goal is to handle all of this parsing without getting conflicting
> > latent G2H replies that breaks the preparation flow (that herds active contexts
> > into a fewer set of states as part of reset) - but i will double check
> > with my colleagues.
> > 
Alan: Update i realize the guc-reset-preparation code disable irqs for the guc being reset
prior to this function so in fact, we really ought not to see any change to that xa_array
because of a context-id change that is coming via a interrupt that is orthogonal to this
thread. I will change to xa_lock.

> > > > +		if (test_bit(CONTEXT_GUC_INIT, &ce->flags) &&
> > > > +		    cancel_delayed_work(&ce->guc_state.sched_disable_delay)) {
> > > > +			spin_lock_irqsave(&ce->guc_state.lock, ceflags);
> > > > +			spin_unlock_irqrestore(&ce->guc_state.lock, ceflags);
> > > 
> > > This deserves a comment about what lock toggling wants to ensure.
> > > 
> > I realize this might have been my local rebasing mistake, the intention was to
> > handle cases where sched_disable_delay wasn't pending but potentially still
> > executing do_sched_disable. I believe I could try cancel_delayed_work_sync (but
> > not sure if i can call that might-sleep funtion mid reset while not-
> > interruptible). Else, i would move that lock-unlock to if the
> > cancel_delayed_work did not return true (as per original intent before my
> > rebase error).
> 
> Okay a comment like "flush any currently executing do_sched_disable" 
> above the lock toggling would do then. Even better if you add what 
> happens if that flushing isn't done.
> 
> > > Also, if the loops runs with interrupts disabled what is the point of
> > > irqsave variant in here??
> > Yes - its redundant, let me fix that, apologies for that.
> > 
same thing here, a context's guc state should not change in response to an IRQ
from that guc since we disabled it while we are in reset preparatoin
- so actually "not needed" as opposed to "redundant"

> > > Also2, what is the reason for dropping the lock? intel_context_put?
> > Being consistent with other reset preparation code that closes contexts,
> > the lock is dropped before the intel_context_put.
> > (I hope i am not misunderstanding your question).
> 
> Right, okay.. so for locking inversion issues - intel_context_put must 
> not be called with guc_state.lock held, yes?
> 
> Not a mandatory request, but if you want you could look into the option 
> of avoiding lock dropping and instead doing xa_erase and recording the 
> list of contexts for sched disable or put in a local list, and doing 
> them all in one block outside the locked/irq disabled section. Although 
> tbh I am not sure if that would improve anything. Probably not in this 
> case of a reset path, but if there are other places in GuC code which do 
> this it may be beneficial for less hammering on the CPU and core 
> synchronisation for atomics.
> 
apologies my ignorance - perhaps i misunderstood how these functions work but
I assumed that calling kref_get_unless_zero with a non-zero return that lead us
here meant that there is still a ref on the context that didnt come from the
reset path so i am just following the correct rules to release that ref 
and not destroy the contexts yet - but leaving it in the pending-disable
that will be handled in scrub_guc_desc_for_outstanding_g2h that gets called
later in the reset preparation.

Actually i realize that the better option might be to move above code
into the loop already present inside of scrub_guc_desc_for_outstanding_g2h
just prior to its checking of whether a context is pending-disable.
That would ensure we take all these context locks once in the same function
for the herding of all possible states when scrubbing those outstanding g2h.
diff mbox series

Patch

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c
index dabdfe09f5e5..df7fd1b019ec 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
@@ -1454,7 +1454,7 @@  static void engines_idle_release(struct i915_gem_context *ctx,
 		int err;
 
 		/* serialises with execbuf */
-		set_bit(CONTEXT_CLOSED_BIT, &ce->flags);
+		intel_context_close(ce);
 		if (!intel_context_pin_if_active(ce))
 			continue;
 
diff --git a/drivers/gpu/drm/i915/gt/intel_context.h b/drivers/gpu/drm/i915/gt/intel_context.h
index 8e2d70630c49..f96420f0b5bb 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.h
+++ b/drivers/gpu/drm/i915/gt/intel_context.h
@@ -276,6 +276,14 @@  static inline bool intel_context_is_barrier(const struct intel_context *ce)
 	return test_bit(CONTEXT_BARRIER_BIT, &ce->flags);
 }
 
+static inline void intel_context_close(struct intel_context *ce)
+{
+	set_bit(CONTEXT_CLOSED_BIT, &ce->flags);
+
+	if (ce->ops->close)
+		ce->ops->close(ce);
+}
+
 static inline bool intel_context_is_closed(const struct intel_context *ce)
 {
 	return test_bit(CONTEXT_CLOSED_BIT, &ce->flags);
diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
index 04eacae1aca5..86ac84e2edb9 100644
--- a/drivers/gpu/drm/i915/gt/intel_context_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
@@ -43,6 +43,8 @@  struct intel_context_ops {
 	void (*revoke)(struct intel_context *ce, struct i915_request *rq,
 		       unsigned int preempt_timeout_ms);
 
+	void (*close)(struct intel_context *ce);
+
 	int (*pre_pin)(struct intel_context *ce, struct i915_gem_ww_ctx *ww, void **vaddr);
 	int (*pin)(struct intel_context *ce, void *vaddr);
 	void (*unpin)(struct intel_context *ce);
@@ -208,6 +210,11 @@  struct intel_context {
 		 * each priority bucket
 		 */
 		u32 prio_count[GUC_CLIENT_PRIORITY_NUM];
+		/**
+		 * @sched_disable_delay: worker to disable scheduling on this
+		 * context
+		 */
+		struct delayed_work sched_disable_delay;
 	} guc_state;
 
 	struct {
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
index 804133df1ac9..12be811181b3 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
@@ -112,6 +112,10 @@  struct intel_guc {
 		 * refs
 		 */
 		struct list_head guc_id_list;
+		/**
+		 * @guc_ids_in_use: Number single-lrc guc_ids in use
+		 */
+		u16 guc_ids_in_use;
 		/**
 		 * @destroyed_contexts: list of contexts waiting to be destroyed
 		 * (deregistered with the GuC)
@@ -132,6 +136,16 @@  struct intel_guc {
 		 * @reset_fail_mask: mask of engines that failed to reset
 		 */
 		intel_engine_mask_t reset_fail_mask;
+		/**
+		 * @sched_disable_delay_ms: schedule disable delay, in ms, for
+		 * contexts
+		 */
+		u64 sched_disable_delay_ms;
+		/**
+		 * @sched_disable_gucid_threshold: threshold of min remaining available
+		 * guc_ids before we start bypassing the schedule disable delay
+		 */
+		int sched_disable_gucid_threshold;
 	} submission_state;
 
 	/**
@@ -466,4 +480,6 @@  void intel_guc_write_barrier(struct intel_guc *guc);
 
 void intel_guc_dump_time_info(struct intel_guc *guc, struct drm_printer *p);
 
+int intel_guc_sched_disable_gucid_threshold_max(struct intel_guc *guc);
+
 #endif
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_debugfs.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_debugfs.c
index 25f09a420561..c91b150bb7ac 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_debugfs.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_debugfs.c
@@ -71,12 +71,72 @@  static bool intel_eval_slpc_support(void *data)
 	return intel_guc_slpc_is_used(guc);
 }
 
+static int guc_sched_disable_delay_ms_get(void *data, u64 *val)
+{
+	struct intel_guc *guc = data;
+
+	if (!intel_guc_submission_is_used(guc))
+		return -ENODEV;
+
+	*val = guc->submission_state.sched_disable_delay_ms;
+
+	return 0;
+}
+
+static int guc_sched_disable_delay_ms_set(void *data, u64 val)
+{
+	struct intel_guc *guc = data;
+
+	if (!intel_guc_submission_is_used(guc))
+		return -ENODEV;
+
+	guc->submission_state.sched_disable_delay_ms = val;
+
+	return 0;
+}
+DEFINE_SIMPLE_ATTRIBUTE(guc_sched_disable_delay_ms_fops,
+			guc_sched_disable_delay_ms_get,
+			guc_sched_disable_delay_ms_set, "%lld\n");
+
+static int guc_sched_disable_gucid_threshold_get(void *data, u64 *val)
+{
+	struct intel_guc *guc = data;
+
+	if (!intel_guc_submission_is_used(guc))
+		return -ENODEV;
+
+	*val = guc->submission_state.sched_disable_gucid_threshold;
+	return 0;
+}
+
+static int guc_sched_disable_gucid_threshold_set(void *data, u64 val)
+{
+	struct intel_guc *guc = data;
+
+	if (!intel_guc_submission_is_used(guc))
+		return -ENODEV;
+
+	if (val > intel_guc_sched_disable_gucid_threshold_max(guc))
+		guc->submission_state.sched_disable_gucid_threshold =
+			intel_guc_sched_disable_gucid_threshold_max(guc);
+	else
+		guc->submission_state.sched_disable_gucid_threshold = val;
+
+	return 0;
+}
+DEFINE_SIMPLE_ATTRIBUTE(guc_sched_disable_gucid_threshold_fops,
+			guc_sched_disable_gucid_threshold_get,
+			guc_sched_disable_gucid_threshold_set, "%lld\n");
+
 void intel_guc_debugfs_register(struct intel_guc *guc, struct dentry *root)
 {
 	static const struct intel_gt_debugfs_file files[] = {
 		{ "guc_info", &guc_info_fops, NULL },
 		{ "guc_registered_contexts", &guc_registered_contexts_fops, NULL },
 		{ "guc_slpc_info", &guc_slpc_info_fops, &intel_eval_slpc_support},
+		{ "guc_sched_disable_delay_ms", &guc_sched_disable_delay_ms_fops, NULL },
+		{ "guc_sched_disable_gucid_threshold", &guc_sched_disable_gucid_threshold_fops,
+		   NULL },
 	};
 
 	if (!intel_guc_is_supported(guc))
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 22ba66e48a9b..29793972c39e 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -65,7 +65,13 @@ 
  * corresponding G2H returns indicating the scheduling disable operation has
  * completed it is safe to unpin the context. While a disable is in flight it
  * isn't safe to resubmit the context so a fence is used to stall all future
- * requests of that context until the G2H is returned.
+ * requests of that context until the G2H is returned. Because this interaction
+ * with the GuC takes a non-zero amount of time we delay the disabling of
+ * scheduling after the pin count goes to zero by a configurable period of time
+ * (see SCHED_DISABLE_DELAY_MS). The thought is this gives the user a window of
+ * time to resubmit something on the context before doing this costly operation.
+ * This delay is only done if the context isn't closed and the guc_id usage is
+ * less than a threshold (see NUM_SCHED_DISABLE_GUC_IDS_THRESHOLD).
  *
  * Context deregistration:
  * Before a context can be destroyed or if we steal its guc_id we must
@@ -163,7 +169,8 @@  guc_create_parallel(struct intel_engine_cs **engines,
 #define SCHED_STATE_PENDING_ENABLE			BIT(5)
 #define SCHED_STATE_REGISTERED				BIT(6)
 #define SCHED_STATE_POLICY_REQUIRED			BIT(7)
-#define SCHED_STATE_BLOCKED_SHIFT			8
+#define SCHED_STATE_CLOSED				BIT(8)
+#define SCHED_STATE_BLOCKED_SHIFT			9
 #define SCHED_STATE_BLOCKED		BIT(SCHED_STATE_BLOCKED_SHIFT)
 #define SCHED_STATE_BLOCKED_MASK	(0xfff << SCHED_STATE_BLOCKED_SHIFT)
 
@@ -173,12 +180,20 @@  static inline void init_sched_state(struct intel_context *ce)
 	ce->guc_state.sched_state &= SCHED_STATE_BLOCKED_MASK;
 }
 
+/*
+ * Kernel contexts can have SCHED_STATE_REGISTERED after suspend.
+ * A context close can race with the submission path, so SCHED_STATE_CLOSED
+ * can be set immediately before we try to register.
+ */
+#define SCHED_STATE_VALID_INIT \
+	(SCHED_STATE_BLOCKED_MASK | \
+	 SCHED_STATE_CLOSED | \
+	 SCHED_STATE_REGISTERED)
+
 __maybe_unused
 static bool sched_state_is_init(struct intel_context *ce)
 {
-	/* Kernel contexts can have SCHED_STATE_REGISTERED after suspend. */
-	return !(ce->guc_state.sched_state &
-		 ~(SCHED_STATE_BLOCKED_MASK | SCHED_STATE_REGISTERED));
+	return !(ce->guc_state.sched_state & ~SCHED_STATE_VALID_INIT);
 }
 
 static inline bool
@@ -319,6 +334,17 @@  static inline void clr_context_policy_required(struct intel_context *ce)
 	ce->guc_state.sched_state &= ~SCHED_STATE_POLICY_REQUIRED;
 }
 
+static inline bool context_close_done(struct intel_context *ce)
+{
+	return ce->guc_state.sched_state & SCHED_STATE_CLOSED;
+}
+
+static inline void set_context_close_done(struct intel_context *ce)
+{
+	lockdep_assert_held(&ce->guc_state.lock);
+	ce->guc_state.sched_state |= SCHED_STATE_CLOSED;
+}
+
 static inline u32 context_blocked(struct intel_context *ce)
 {
 	return (ce->guc_state.sched_state & SCHED_STATE_BLOCKED_MASK) >>
@@ -1523,6 +1549,7 @@  static void guc_flush_submissions(struct intel_guc *guc)
 }
 
 static void guc_flush_destroyed_contexts(struct intel_guc *guc);
+static void guc_flush_all_delayed_disable_sched_contexts(struct intel_guc *guc);
 
 void intel_guc_submission_reset_prepare(struct intel_guc *guc)
 {
@@ -1540,6 +1567,7 @@  void intel_guc_submission_reset_prepare(struct intel_guc *guc)
 	spin_lock_irq(guc_to_gt(guc)->irq_lock);
 	spin_unlock_irq(guc_to_gt(guc)->irq_lock);
 
+	guc_flush_all_delayed_disable_sched_contexts(guc);
 	guc_flush_submissions(guc);
 	guc_flush_destroyed_contexts(guc);
 	flush_work(&guc->ct.requests.worker);
@@ -1994,6 +2022,9 @@  static int new_guc_id(struct intel_guc *guc, struct intel_context *ce)
 	if (unlikely(ret < 0))
 		return ret;
 
+	if (!intel_context_is_parent(ce))
+		++guc->submission_state.guc_ids_in_use;
+
 	ce->guc_id.id = ret;
 	return 0;
 }
@@ -2003,14 +2034,16 @@  static void __release_guc_id(struct intel_guc *guc, struct intel_context *ce)
 	GEM_BUG_ON(intel_context_is_child(ce));
 
 	if (!context_guc_id_invalid(ce)) {
-		if (intel_context_is_parent(ce))
+		if (intel_context_is_parent(ce)) {
 			bitmap_release_region(guc->submission_state.guc_ids_bitmap,
 					      ce->guc_id.id,
 					      order_base_2(ce->parallel.number_children
 							   + 1));
-		else
+		} else {
+			--guc->submission_state.guc_ids_in_use;
 			ida_simple_remove(&guc->submission_state.guc_ids,
 					  ce->guc_id.id);
+		}
 		clr_ctx_id_mapping(guc, ce->guc_id.id);
 		set_context_guc_id_invalid(ce);
 	}
@@ -2998,41 +3031,130 @@  guc_context_revoke(struct intel_context *ce, struct i915_request *rq,
 	}
 }
 
-static void guc_context_sched_disable(struct intel_context *ce)
+static void guc_context_sched_disable(struct intel_context *ce);
+
+static void do_sched_disable(struct intel_guc *guc, struct intel_context *ce,
+			     unsigned long flags)
+	__releases(ce->guc_state.lock)
 {
-	struct intel_guc *guc = ce_to_guc(ce);
-	unsigned long flags;
 	struct intel_runtime_pm *runtime_pm = &ce->engine->gt->i915->runtime_pm;
 	intel_wakeref_t wakeref;
 	u16 guc_id;
 
+	lockdep_assert_held(&ce->guc_state.lock);
+	guc_id = prep_context_pending_disable(ce);
+
+	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
+
+	with_intel_runtime_pm(runtime_pm, wakeref)
+		__guc_context_sched_disable(guc, ce, guc_id);
+}
+
+static bool bypass_sched_disable(struct intel_guc *guc,
+				 struct intel_context *ce)
+{
+	lockdep_assert_held(&ce->guc_state.lock);
 	GEM_BUG_ON(intel_context_is_child(ce));
 
+	if (submission_disabled(guc) || context_guc_id_invalid(ce) ||
+	    !ctx_id_mapped(guc, ce->guc_id.id)) {
+		clr_context_enabled(ce);
+		return true;
+	}
+
+	return !context_enabled(ce);
+}
+
+static void __delay_sched_disable(struct work_struct *wrk)
+{
+	struct intel_context *ce =
+		container_of(wrk, typeof(*ce), guc_state.sched_disable_delay.work);
+	struct intel_guc *guc = ce_to_guc(ce);
+	unsigned long flags;
+
 	spin_lock_irqsave(&ce->guc_state.lock, flags);
 
+	if (bypass_sched_disable(guc, ce)) {
+		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
+		intel_context_sched_disable_unpin(ce);
+	} else {
+		do_sched_disable(guc, ce, flags);
+	}
+}
+
+static bool guc_id_pressure(struct intel_guc *guc, struct intel_context *ce)
+{
 	/*
-	 * We have to check if the context has been disabled by another thread,
-	 * check if submssion has been disabled to seal a race with reset and
-	 * finally check if any more requests have been committed to the
-	 * context ensursing that a request doesn't slip through the
-	 * 'context_pending_disable' fence.
+	 * parent contexts are perma-pinned, if we are unpinning do schedule
+	 * disable immediately.
 	 */
-	if (unlikely(!context_enabled(ce) || submission_disabled(guc) ||
-		     context_has_committed_requests(ce))) {
-		clr_context_enabled(ce);
+	if (intel_context_is_parent(ce))
+		return true;
+
+	/*
+	 * If we are beyond the threshold for avail guc_ids, do schedule disable immediately.
+	 */
+	return guc->submission_state.guc_ids_in_use >
+		guc->submission_state.sched_disable_gucid_threshold;
+}
+
+static void guc_context_sched_disable(struct intel_context *ce)
+{
+	struct intel_guc *guc = ce_to_guc(ce);
+	u64 delay = guc->submission_state.sched_disable_delay_ms;
+	unsigned long flags;
+
+	spin_lock_irqsave(&ce->guc_state.lock, flags);
+
+	if (bypass_sched_disable(guc, ce)) {
+		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
+		intel_context_sched_disable_unpin(ce);
+	} else if (!intel_context_is_closed(ce) && !guc_id_pressure(guc, ce) &&
+		   delay) {
 		spin_unlock_irqrestore(&ce->guc_state.lock, flags);
-		goto unpin;
+		mod_delayed_work(system_unbound_wq,
+				 &ce->guc_state.sched_disable_delay,
+				 msecs_to_jiffies(delay));
+	} else {
+		do_sched_disable(guc, ce, flags);
 	}
-	guc_id = prep_context_pending_disable(ce);
+}
 
-	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
+static void guc_flush_all_delayed_disable_sched_contexts(struct intel_guc *guc)
+{
+	struct intel_context *ce;
+	unsigned long index;
+	unsigned long flags;
+	unsigned long ceflags;
 
-	with_intel_runtime_pm(runtime_pm, wakeref)
-		__guc_context_sched_disable(guc, ce, guc_id);
+	xa_lock_irqsave(&guc->context_lookup, flags);
+	xa_for_each(&guc->context_lookup, index, ce) {
+		if (!kref_get_unless_zero(&ce->ref))
+			continue;
+		xa_unlock(&guc->context_lookup);
+		if (test_bit(CONTEXT_GUC_INIT, &ce->flags) &&
+		    cancel_delayed_work(&ce->guc_state.sched_disable_delay)) {
+			spin_lock_irqsave(&ce->guc_state.lock, ceflags);
+			spin_unlock_irqrestore(&ce->guc_state.lock, ceflags);
+			intel_context_sched_disable_unpin(ce);
+		}
+		intel_context_put(ce);
+		xa_lock(&guc->context_lookup);
+	}
+	xa_unlock_irqrestore(&guc->context_lookup, flags);
+}
+
+static void guc_context_close(struct intel_context *ce)
+{
+	unsigned long flags;
+
+	if (test_bit(CONTEXT_GUC_INIT, &ce->flags) &&
+	    cancel_delayed_work(&ce->guc_state.sched_disable_delay))
+		__delay_sched_disable(&ce->guc_state.sched_disable_delay.work);
 
-	return;
-unpin:
-	intel_context_sched_disable_unpin(ce);
+	spin_lock_irqsave(&ce->guc_state.lock, flags);
+	set_context_close_done(ce);
+	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
 }
 
 static inline void guc_lrc_desc_unpin(struct intel_context *ce)
@@ -3351,6 +3473,8 @@  static void remove_from_context(struct i915_request *rq)
 static const struct intel_context_ops guc_context_ops = {
 	.alloc = guc_context_alloc,
 
+	.close = guc_context_close,
+
 	.pre_pin = guc_context_pre_pin,
 	.pin = guc_context_pin,
 	.unpin = guc_context_unpin,
@@ -3433,6 +3557,10 @@  static void guc_context_init(struct intel_context *ce)
 	rcu_read_unlock();
 
 	ce->guc_state.prio = map_i915_prio_to_guc_prio(prio);
+
+	INIT_DELAYED_WORK(&ce->guc_state.sched_disable_delay,
+			  __delay_sched_disable);
+
 	set_bit(CONTEXT_GUC_INIT, &ce->flags);
 }
 
@@ -3470,6 +3598,19 @@  static int guc_request_alloc(struct i915_request *rq)
 	if (unlikely(!test_bit(CONTEXT_GUC_INIT, &ce->flags)))
 		guc_context_init(ce);
 
+	/*
+	 * If the context gets closed while the execbuf is ongoing, the context
+	 * close code will race with the below code to cancel the delayed work.
+	 * If the context close wins the race and cancels the work, it will
+	 * immediately call the sched disable (see guc_context_close), so there
+	 * is a chance we can get past this check while the sched_disable code
+	 * is being executed. To make sure that code completes before we check
+	 * the status further down, we wait for the close process to complete.
+	 */
+	if (cancel_delayed_work_sync(&ce->guc_state.sched_disable_delay))
+		intel_context_sched_disable_unpin(ce);
+	else if (intel_context_is_closed(ce))
+		wait_for(context_close_done(ce), 1);
 	/*
 	 * Call pin_guc_id here rather than in the pinning step as with
 	 * dma_resv, contexts can be repeatedly pinned / unpinned trashing the
@@ -3600,6 +3741,8 @@  static int guc_virtual_context_alloc(struct intel_context *ce)
 static const struct intel_context_ops virtual_guc_context_ops = {
 	.alloc = guc_virtual_context_alloc,
 
+	.close = guc_context_close,
+
 	.pre_pin = guc_virtual_context_pre_pin,
 	.pin = guc_virtual_context_pin,
 	.unpin = guc_virtual_context_unpin,
@@ -3689,6 +3832,8 @@  static void guc_child_context_destroy(struct kref *kref)
 static const struct intel_context_ops virtual_parent_context_ops = {
 	.alloc = guc_virtual_context_alloc,
 
+	.close = guc_context_close,
+
 	.pre_pin = guc_context_pre_pin,
 	.pin = guc_parent_context_pin,
 	.unpin = guc_parent_context_unpin,
@@ -4219,6 +4364,26 @@  static bool __guc_submission_selected(struct intel_guc *guc)
 	return i915->params.enable_guc & ENABLE_GUC_SUBMISSION;
 }
 
+int intel_guc_sched_disable_gucid_threshold_max(struct intel_guc *guc)
+{
+	return guc->submission_state.num_guc_ids - NUMBER_MULTI_LRC_GUC_ID(guc);
+}
+
+/*
+ * This default value of 33 milisecs (+1 milisec round up) ensures 30fps or higher
+ * workloads are able to enjoy the latency reduction when delaying the schedule-disable
+ * operation. This matches the 30fps game-render + encode (real world) workload this
+ * knob was tested against.
+ */
+#define SCHED_DISABLE_DELAY_MS	34
+
+/*
+ * A threshold of 75% is a reasonable starting point considering that real world apps
+ * generally don't get anywhere near this.
+ */
+#define NUM_SCHED_DISABLE_GUCIDS_DEFAULT_THRESHOLD(__guc) \
+	(((intel_guc_sched_disable_gucid_threshold_max(guc)) * 3) / 4)
+
 void intel_guc_submission_init_early(struct intel_guc *guc)
 {
 	xa_init_flags(&guc->context_lookup, XA_FLAGS_LOCK_IRQ);
@@ -4235,7 +4400,10 @@  void intel_guc_submission_init_early(struct intel_guc *guc)
 	spin_lock_init(&guc->timestamp.lock);
 	INIT_DELAYED_WORK(&guc->timestamp.work, guc_timestamp_ping);
 
+	guc->submission_state.sched_disable_delay_ms = SCHED_DISABLE_DELAY_MS;
 	guc->submission_state.num_guc_ids = GUC_MAX_CONTEXT_ID;
+	guc->submission_state.sched_disable_gucid_threshold =
+		NUM_SCHED_DISABLE_GUCIDS_DEFAULT_THRESHOLD(guc);
 	guc->submission_supported = __guc_submission_supported(guc);
 	guc->submission_selected = __guc_submission_selected(guc);
 }
diff --git a/drivers/gpu/drm/i915/i915_selftest.h b/drivers/gpu/drm/i915/i915_selftest.h
index f54de0499be7..bdf3e22c0a34 100644
--- a/drivers/gpu/drm/i915/i915_selftest.h
+++ b/drivers/gpu/drm/i915/i915_selftest.h
@@ -92,12 +92,14 @@  int __i915_subtests(const char *caller,
 			T, ARRAY_SIZE(T), data)
 #define i915_live_subtests(T, data) ({ \
 	typecheck(struct drm_i915_private *, data); \
+	(data)->gt[0]->uc.guc.submission_state.sched_disable_delay_ms = 0; \
 	__i915_subtests(__func__, \
 			__i915_live_setup, __i915_live_teardown, \
 			T, ARRAY_SIZE(T), data); \
 })
 #define intel_gt_live_subtests(T, data) ({ \
 	typecheck(struct intel_gt *, data); \
+	(data)->uc.guc.submission_state.sched_disable_delay_ms = 0; \
 	__i915_subtests(__func__, \
 			__intel_gt_live_setup, __intel_gt_live_teardown, \
 			T, ARRAY_SIZE(T), data); \