[RFC,V2] drm/xe/guc: Use exec queue hints for GT frequency

Message ID	20250109120705.3021126-1-tejas.upadhyay@intel.com (mailing list archive)
State	New
Headers	show Return-Path: <dri-devel-bounces@lists.freedesktop.org> From: Tejas Upadhyay <tejas.upadhyay@intel.com> To: intel-xe@lists.freedesktop.org Cc: dri-devel@lists.freedesktop.org, badal.nilawar@intel.com, vinay.belgaumkar@intel.com, michal.mrozek@intel.com, szymon.morek@intel.com, jose.souza@intel.com, lucas.demarchi@intel.com, Tejas Upadhyay <tejas.upadhyay@intel.com> Subject: [RFC PATCH V2] drm/xe/guc: Use exec queue hints for GT frequency Date: Thu, 9 Jan 2025 17:37:05 +0530 Message-Id: <20250109120705.3021126-1-tejas.upadhyay@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Precedence: list Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>
Series	[RFC,V2] drm/xe/guc: Use exec queue hints for GT frequency \| expand [RFC,V2] drm/xe/guc: Use exec queue hints for GT frequency

Upadhyay, Tejas Jan. 9, 2025, 12:07 p.m. UTC

Allow user to provide a low latency hint per exec queue. When set,
KMD sends a hint to GuC which results in special handling for this
exec queue. SLPC will ramp the GT frequency aggressively every time
it switches to this exec queue.

We need to enable the use of SLPC Compute strategy during init, but
it will apply only to exec queues that set this bit during exec queue
creation.

Improvement with this approach as below:

Before,

:~$ NEOReadDebugKeys=1 EnableDirectSubmission=0 clpeak --kernel-latency
Platform: Intel(R) OpenCL Graphics
  Device: Intel(R) Graphics [0xe20b]
    Driver version  : 24.52.0 (Linux x64)
    Compute units   : 160
    Clock frequency : 2850 MHz
    Kernel launch latency : 283.16 us

After,

:~$ NEOReadDebugKeys=1 EnableDirectSubmission=0 clpeak --kernel-latency
Platform: Intel(R) OpenCL Graphics
  Device: Intel(R) Graphics [0xe20b]
    Driver version  : 24.52.0 (Linux x64)
    Compute units   : 160
    Clock frequency : 2850 MHz

    Kernel launch latency : 63.38 us

UMD will indicate low latency hint with flag as mentioned below,

*     struct drm_xe_exec_queue_create exec_queue_create = {
*          .flags = DRM_XE_EXEC_QUEUE_LOW_LATENCY_HINT or 0
*          .extensions = 0,
*          .vm_id = vm,
*          .num_bb_per_exec = 1,
*          .num_eng_per_bb = 1,
*          .instances = to_user_pointer(&instance),
*     };
*     ioctl(fd, DRM_IOCTL_XE_EXEC_QUEUE_CREATE, &exec_queue_create);

Link to UMD PR : https://github.com/intel/compute-runtime/pull/794

Note: There is outstanding issue on guc side to be not able to switch to max
frequency as per strategy indicated by KMD, so for experminet/test result
hardcoding apporch was taken and passed to guc as policy. Effort on debugging
from guc side is going on in parallel.

V2:
  - DRM_XE_EXEC_QUEUE_LOW_LATENCY_HINT 1 is already planned for other hint(Szymon)
  - Add motivation to description (Lucas)

Cc:dri-devel@lists.freedesktop.org
Cc:vinay.belgaumkar@intel.com
Cc:Michal Mrozek <michal.mrozek@intel.com>
Cc:Szymon Morek <szymon.morek@intel.com>
Cc:José Roberto de Souza <jose.souza@intel.com>
Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
---
 drivers/gpu/drm/xe/abi/guc_actions_slpc_abi.h |  3 +++
 drivers/gpu/drm/xe/xe_exec_queue.c            |  7 ++++---
 drivers/gpu/drm/xe/xe_guc_pc.c                | 16 ++++++++++++++++
 drivers/gpu/drm/xe/xe_guc_submit.c            |  7 +++++++
 include/uapi/drm/xe_drm.h                     |  3 ++-
 5 files changed, 32 insertions(+), 4 deletions(-)

Mrozek, Michal Jan. 9, 2025, 12:35 p.m. UTC | #1

>>Allow user to provide a low latency hint per exec queue. When set, KMD sends a hint to GuC which results in special handling for this exec queue. SLPC will ramp the GT frequency aggressively every time it switches to this exec queue.
>>
>>We need to enable the use of SLPC Compute strategy during init, but it will apply only to exec queues that set this bit during exec queue creation.
>>
>>Improvement with this approach as below:
>>
>>Before,
>>
>>:~$ NEOReadDebugKeys=1 EnableDirectSubmission=0 clpeak --kernel-latency
>>Platform: Intel(R) OpenCL Graphics
>>  Device: Intel(R) Graphics [0xe20b]
>>    Driver version  : 24.52.0 (Linux x64)
>>    Compute units   : 160
>>    Clock frequency : 2850 MHz
>>    Kernel launch latency : 283.16 us
>>
>>After,
>>
>>:~$ NEOReadDebugKeys=1 EnableDirectSubmission=0 clpeak --kernel-latency
>>Platform: Intel(R) OpenCL Graphics
>>  Device: Intel(R) Graphics [0xe20b]
>>    Driver version  : 24.52.0 (Linux x64)
>>    Compute units   : 160
>>    Clock frequency : 2850 MHz
>>
>>    Kernel launch latency : 63.38 us
>>
>>UMD will indicate low latency hint with flag as mentioned below,
>>
>>*     struct drm_xe_exec_queue_create exec_queue_create = {
>>*          .flags = DRM_XE_EXEC_QUEUE_LOW_LATENCY_HINT or 0
>>*          .extensions = 0,
>>*          .vm_id = vm,
>>*          .num_bb_per_exec = 1,
>>*          .num_eng_per_bb = 1,
>>*          .instances = to_user_pointer(&instance),
>>*     };
>>*     ioctl(fd, DRM_IOCTL_XE_EXEC_QUEUE_CREATE, &exec_queue_create);
>>
>>Link to UMD PR : https://github.com/intel/compute-runtime/pull/794
>>
>>Note: There is outstanding issue on guc side to be not able to switch to max frequency as per strategy indicated by KMD, so for experminet/test result hardcoding apporch was taken and passed to guc as policy. Effort on debugging from guc side is going on in parallel.
>>
>>V2:
>>  - DRM_XE_EXEC_QUEUE_LOW_LATENCY_HINT 1 is already planned for other hint(Szymon)
>>  - Add motivation to description (Lucas)
>>
>>Cc:dri-devel@lists.freedesktop.org
>>Cc:vinay.belgaumkar@intel.com
>>Cc:Michal Mrozek <michal.mrozek@intel.com> Cc:Szymon Morek <szymon.morek@intel.com> Cc:José Roberto de Souza <jose.souza@intel.com>
>>Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
>>---
>> drivers/gpu/drm/xe/abi/guc_actions_slpc_abi.h |  3 +++
>> drivers/gpu/drm/xe/xe_exec_queue.c            |  7 ++++---
>> drivers/gpu/drm/xe/xe_guc_pc.c                | 16 ++++++++++++++++
>> drivers/gpu/drm/xe/xe_guc_submit.c            |  7 +++++++
>> include/uapi/drm/xe_drm.h                     |  3 ++-
>> 5 files changed, 32 insertions(+), 4 deletions(-)
>>
>>diff --git a/drivers/gpu/drm/xe/abi/guc_actions_slpc_abi.h b/drivers/gpu/drm/xe/abi/guc_actions_slpc_abi.h
>>index 85abe4f09ae2..c50075b8270f 100644
>>--- a/drivers/gpu/drm/xe/abi/guc_actions_slpc_abi.h
>>+++ b/drivers/gpu/drm/xe/abi/guc_actions_slpc_abi.h
>>@@ -174,6 +174,9 @@ struct slpc_task_state_data {
>> 	};
>> } __packed;
>> 
>>+#define SLPC_EXEC_QUEUE_FREQ_REQ_IS_COMPUTE	REG_BIT(28)
>>+#define SLPC_OPTIMIZED_STRATEGY_COMPUTE		REG_BIT(0)
>>+
>> struct slpc_shared_data_header {
>> 	/* Total size in bytes of this shared buffer. */
>> 	u32 size;
>>diff --git a/drivers/gpu/drm/xe/xe_exec_queue.c b/drivers/gpu/drm/xe/xe_exec_queue.c
>>index 8948f50ee58f..7747ba6c4bb8 100644
>>--- a/drivers/gpu/drm/xe/xe_exec_queue.c
>>+++ b/drivers/gpu/drm/xe/xe_exec_queue.c
>>@@ -553,7 +553,8 @@ int xe_exec_queue_create_ioctl(struct drm_device *dev, void *data,
>> 	u32 len;
>> 	int err;
>> 
>>-	if (XE_IOCTL_DBG(xe, args->flags) ||
>>+	if (XE_IOCTL_DBG(xe, args->flags &&
>>+			 !(args->flags & DRM_XE_EXEC_QUEUE_LOW_LATENCY_HINT)) ||
>> 	    XE_IOCTL_DBG(xe, args->reserved[0] || args->reserved[1]))
>> 		return -EINVAL;
>> 
>>@@ -578,7 +579,7 @@ int xe_exec_queue_create_ioctl(struct drm_device *dev, void *data,
>> 
>> 		for_each_tile(tile, xe, id) {
>> 			struct xe_exec_queue *new;
>>-			u32 flags = EXEC_QUEUE_FLAG_VM;
>>+			u32 flags = args->flags | EXEC_QUEUE_FLAG_VM;
>> 
>> 			if (id)
>> 				flags |= EXEC_QUEUE_FLAG_BIND_ENGINE_CHILD;
>>@@ -626,7 +627,7 @@ int xe_exec_queue_create_ioctl(struct drm_device *dev, void *data,
>> 		}
>> 
>> 		q = xe_exec_queue_create(xe, vm, logical_mask,
>>-					 args->width, hwe, 0,
>>+					 args->width, hwe, args->flags,
>> 					 args->extensions);
>> 		up_read(&vm->lock);
>> 		xe_vm_put(vm);
>>diff --git a/drivers/gpu/drm/xe/xe_guc_pc.c b/drivers/gpu/drm/xe/xe_guc_pc.c index df7f130fb663..ff0b98ccf1a7 100644
>>--- a/drivers/gpu/drm/xe/xe_guc_pc.c
>>+++ b/drivers/gpu/drm/xe/xe_guc_pc.c
>>@@ -992,6 +992,19 @@ static int pc_init_freqs(struct xe_guc_pc *pc)
>> 	return ret;
>> }
>> 
>>+static int xe_guc_pc_set_strategy(struct xe_guc_pc *pc, u32 val) {
>>+	int ret = 0;
>>+
>>+	xe_pm_runtime_get(pc_to_xe(pc));
>>+	ret = pc_action_set_param(pc,
>>+				  SLPC_PARAM_STRATEGIES,
>>+				  val);
>>+	xe_pm_runtime_put(pc_to_xe(pc));
>>+
>>+	return ret;
>>+}
>>+
>> /**
>>  * xe_guc_pc_start - Start GuC's Power Conservation component
>>  * @pc: Xe_GuC_PC instance
>>@@ -1052,6 +1065,9 @@ int xe_guc_pc_start(struct xe_guc_pc *pc)
>> 
>> 	ret = pc_action_setup_gucrc(pc, GUCRC_FIRMWARE_CONTROL);
>> 
>>+	/* Enable SLPC Optimized Strategy for compute */
>>+	xe_guc_pc_set_strategy(pc, SLPC_OPTIMIZED_STRATEGY_COMPUTE);
>>+
>> out:
>> 	xe_force_wake_put(gt_to_fw(gt), fw_ref);
>> 	return ret;
>>diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
>>index 9c36329fe857..88a1987ac360 100644
>>--- a/drivers/gpu/drm/xe/xe_guc_submit.c
>>+++ b/drivers/gpu/drm/xe/xe_guc_submit.c
>>@@ -15,6 +15,7 @@
>> #include <drm/drm_managed.h>
>> 
>> #include "abi/guc_actions_abi.h"
>>+#include "abi/guc_actions_slpc_abi.h"
>> #include "abi/guc_klvs_abi.h"
>> #include "regs/xe_lrc_layout.h"
>> #include "xe_assert.h"
>>@@ -400,6 +401,7 @@ static void __guc_exec_queue_policy_add_##func(struct exec_queue_policy *policy,  MAKE_EXEC_QUEUE_POLICY_ADD(execution_quantum, EXECUTION_QUANTUM)  MAKE_EXEC_QUEUE_POLICY_ADD(preemption_timeout, PREEMPTION_TIMEOUT)  MAKE_EXEC_QUEUE_POLICY_ADD(priority, SCHEDULING_PRIORITY)
>>+MAKE_EXEC_QUEUE_POLICY_ADD(slpc_ctx_freq_req, SLPM_GT_FREQUENCY)
>> #undef MAKE_EXEC_QUEUE_POLICY_ADD
>> 
>> static const int xe_exec_queue_prio_to_guc[] = { @@ -414,14 +416,19 @@ static void init_policies(struct xe_guc *guc, struct xe_exec_queue *q)
>> 	struct exec_queue_policy policy;
>> 	enum xe_exec_queue_priority prio = q->sched_props.priority;
>> 	u32 timeslice_us = q->sched_props.timeslice_us;
>>+	u32 slpc_ctx_freq_req = 0;
>> 	u32 preempt_timeout_us = q->sched_props.preempt_timeout_us;
>> 
>> 	xe_gt_assert(guc_to_gt(guc), exec_queue_registered(q));
>> 
>>+	if (q->flags & DRM_XE_EXEC_QUEUE_LOW_LATENCY_HINT)
>>+		slpc_ctx_freq_req |= SLPC_EXEC_QUEUE_FREQ_REQ_IS_COMPUTE;
>>+
>> 	__guc_exec_queue_policy_start_klv(&policy, q->guc->id);
>> 	__guc_exec_queue_policy_add_priority(&policy, xe_exec_queue_prio_to_guc[prio]);
>> 	__guc_exec_queue_policy_add_execution_quantum(&policy, timeslice_us);
>> 	__guc_exec_queue_policy_add_preemption_timeout(&policy, preempt_timeout_us);
>>+	__guc_exec_queue_policy_add_slpc_ctx_freq_req(&policy, 
>>+slpc_ctx_freq_req);
>> 
>> 	xe_guc_ct_send(&guc->ct, (u32 *)&policy.h2g,
>> 		       __guc_exec_queue_policy_action_size(&policy), 0, 0); diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h index f62689ca861a..bd0150d2200c 100644
>>--- a/include/uapi/drm/xe_drm.h
>>+++ b/include/uapi/drm/xe_drm.h
>>@@ -1097,6 +1097,7 @@ struct drm_xe_vm_bind {
>>  *         .engine_class = DRM_XE_ENGINE_CLASS_RENDER,
>>  *     };
>>  *     struct drm_xe_exec_queue_create exec_queue_create = {
>>+ *          .flags = DRM_XE_EXEC_QUEUE_LOW_LATENCY_HINT or 0
>>  *          .extensions = 0,
>>  *          .vm_id = vm,
>>  *          .num_bb_per_exec = 1,
>>@@ -1110,7 +1111,6 @@ struct drm_xe_exec_queue_create {
>> #define DRM_XE_EXEC_QUEUE_EXTENSION_SET_PROPERTY		0
>> #define   DRM_XE_EXEC_QUEUE_SET_PROPERTY_PRIORITY		0
>> #define   DRM_XE_EXEC_QUEUE_SET_PROPERTY_TIMESLICE		1
>>-
>> 	/** @extensions: Pointer to the first extension struct, if any */
>> 	__u64 extensions;
>> 
>>@@ -1123,6 +1123,7 @@ struct drm_xe_exec_queue_create {
>> 	/** @vm_id: VM to use for this exec queue */
>> 	__u32 vm_id;
>> 
>>+#define DRM_XE_EXEC_QUEUE_LOW_LATENCY_HINT	(0x1 << 1)
>> 	/** @flags: MBZ */
>> 	__u32 flags;
>> 
>>--
>>2.34.1
>>"

Acked-by: Michal Mrozek <michal.mrozek@intel.com>

Souza, Jose Jan. 9, 2025, 2:36 p.m. UTC | #2

On Thu, 2025-01-09 at 17:37 +0530, Tejas Upadhyay wrote:
> Allow user to provide a low latency hint per exec queue. When set,
> KMD sends a hint to GuC which results in special handling for this
> exec queue. SLPC will ramp the GT frequency aggressively every time
> it switches to this exec queue.
> 
> We need to enable the use of SLPC Compute strategy during init, but
> it will apply only to exec queues that set this bit during exec queue
> creation.
> 
> Improvement with this approach as below:
> 
> Before,
> 
> :~$ NEOReadDebugKeys=1 EnableDirectSubmission=0 clpeak --kernel-latency
> Platform: Intel(R) OpenCL Graphics
>   Device: Intel(R) Graphics [0xe20b]
>     Driver version  : 24.52.0 (Linux x64)
>     Compute units   : 160
>     Clock frequency : 2850 MHz
>     Kernel launch latency : 283.16 us
> 
> After,
> 
> :~$ NEOReadDebugKeys=1 EnableDirectSubmission=0 clpeak --kernel-latency
> Platform: Intel(R) OpenCL Graphics
>   Device: Intel(R) Graphics [0xe20b]
>     Driver version  : 24.52.0 (Linux x64)
>     Compute units   : 160
>     Clock frequency : 2850 MHz
> 
>     Kernel launch latency : 63.38 us
> 
> UMD will indicate low latency hint with flag as mentioned below,
> 
> *     struct drm_xe_exec_queue_create exec_queue_create = {
> *          .flags = DRM_XE_EXEC_QUEUE_LOW_LATENCY_HINT or 0
> *          .extensions = 0,
> *          .vm_id = vm,
> *          .num_bb_per_exec = 1,
> *          .num_eng_per_bb = 1,
> *          .instances = to_user_pointer(&instance),
> *     };
> *     ioctl(fd, DRM_IOCTL_XE_EXEC_QUEUE_CREATE, &exec_queue_create);
> 
> Link to UMD PR : https://github.com/intel/compute-runtime/pull/794
> 
> Note: There is outstanding issue on guc side to be not able to switch to max
> frequency as per strategy indicated by KMD, so for experminet/test result
> hardcoding apporch was taken and passed to guc as policy. Effort on debugging
> from guc side is going on in parallel.
> 
> V2:
>   - DRM_XE_EXEC_QUEUE_LOW_LATENCY_HINT 1 is already planned for other hint(Szymon)
>   - Add motivation to description (Lucas)
> 
> Cc:dri-devel@lists.freedesktop.org
> Cc:vinay.belgaumkar@intel.com
> Cc:Michal Mrozek <michal.mrozek@intel.com>
> Cc:Szymon Morek <szymon.morek@intel.com>
> Cc:José Roberto de Souza <jose.souza@intel.com>
> Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
> ---
>  drivers/gpu/drm/xe/abi/guc_actions_slpc_abi.h |  3 +++
>  drivers/gpu/drm/xe/xe_exec_queue.c            |  7 ++++---
>  drivers/gpu/drm/xe/xe_guc_pc.c                | 16 ++++++++++++++++
>  drivers/gpu/drm/xe/xe_guc_submit.c            |  7 +++++++
>  include/uapi/drm/xe_drm.h                     |  3 ++-
>  5 files changed, 32 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/abi/guc_actions_slpc_abi.h b/drivers/gpu/drm/xe/abi/guc_actions_slpc_abi.h
> index 85abe4f09ae2..c50075b8270f 100644
> --- a/drivers/gpu/drm/xe/abi/guc_actions_slpc_abi.h
> +++ b/drivers/gpu/drm/xe/abi/guc_actions_slpc_abi.h
> @@ -174,6 +174,9 @@ struct slpc_task_state_data {
>  	};
>  } __packed;
>  
> +#define SLPC_EXEC_QUEUE_FREQ_REQ_IS_COMPUTE	REG_BIT(28)
> +#define SLPC_OPTIMIZED_STRATEGY_COMPUTE		REG_BIT(0)
> +
>  struct slpc_shared_data_header {
>  	/* Total size in bytes of this shared buffer. */
>  	u32 size;
> diff --git a/drivers/gpu/drm/xe/xe_exec_queue.c b/drivers/gpu/drm/xe/xe_exec_queue.c
> index 8948f50ee58f..7747ba6c4bb8 100644
> --- a/drivers/gpu/drm/xe/xe_exec_queue.c
> +++ b/drivers/gpu/drm/xe/xe_exec_queue.c
> @@ -553,7 +553,8 @@ int xe_exec_queue_create_ioctl(struct drm_device *dev, void *data,
>  	u32 len;
>  	int err;
>  
> -	if (XE_IOCTL_DBG(xe, args->flags) ||
> +	if (XE_IOCTL_DBG(xe, args->flags &&
> +			 !(args->flags & DRM_XE_EXEC_QUEUE_LOW_LATENCY_HINT)) ||
>  	    XE_IOCTL_DBG(xe, args->reserved[0] || args->reserved[1]))
>  		return -EINVAL;
>  
> @@ -578,7 +579,7 @@ int xe_exec_queue_create_ioctl(struct drm_device *dev, void *data,
>  
>  		for_each_tile(tile, xe, id) {
>  			struct xe_exec_queue *new;
> -			u32 flags = EXEC_QUEUE_FLAG_VM;
> +			u32 flags = args->flags | EXEC_QUEUE_FLAG_VM;
>  
>  			if (id)
>  				flags |= EXEC_QUEUE_FLAG_BIND_ENGINE_CHILD;
> @@ -626,7 +627,7 @@ int xe_exec_queue_create_ioctl(struct drm_device *dev, void *data,
>  		}
>  
>  		q = xe_exec_queue_create(xe, vm, logical_mask,
> -					 args->width, hwe, 0,
> +					 args->width, hwe, args->flags,
>  					 args->extensions);
>  		up_read(&vm->lock);
>  		xe_vm_put(vm);
> diff --git a/drivers/gpu/drm/xe/xe_guc_pc.c b/drivers/gpu/drm/xe/xe_guc_pc.c
> index df7f130fb663..ff0b98ccf1a7 100644
> --- a/drivers/gpu/drm/xe/xe_guc_pc.c
> +++ b/drivers/gpu/drm/xe/xe_guc_pc.c
> @@ -992,6 +992,19 @@ static int pc_init_freqs(struct xe_guc_pc *pc)
>  	return ret;
>  }
>  
> +static int xe_guc_pc_set_strategy(struct xe_guc_pc *pc, u32 val)
> +{
> +	int ret = 0;
> +
> +	xe_pm_runtime_get(pc_to_xe(pc));
> +	ret = pc_action_set_param(pc,
> +				  SLPC_PARAM_STRATEGIES,
> +				  val);
> +	xe_pm_runtime_put(pc_to_xe(pc));
> +
> +	return ret;
> +}
> +
>  /**
>   * xe_guc_pc_start - Start GuC's Power Conservation component
>   * @pc: Xe_GuC_PC instance
> @@ -1052,6 +1065,9 @@ int xe_guc_pc_start(struct xe_guc_pc *pc)
>  
>  	ret = pc_action_setup_gucrc(pc, GUCRC_FIRMWARE_CONTROL);
>  
> +	/* Enable SLPC Optimized Strategy for compute */
> +	xe_guc_pc_set_strategy(pc, SLPC_OPTIMIZED_STRATEGY_COMPUTE);
> +
>  out:
>  	xe_force_wake_put(gt_to_fw(gt), fw_ref);
>  	return ret;
> diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
> index 9c36329fe857..88a1987ac360 100644
> --- a/drivers/gpu/drm/xe/xe_guc_submit.c
> +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
> @@ -15,6 +15,7 @@
>  #include <drm/drm_managed.h>
>  
>  #include "abi/guc_actions_abi.h"
> +#include "abi/guc_actions_slpc_abi.h"
>  #include "abi/guc_klvs_abi.h"
>  #include "regs/xe_lrc_layout.h"
>  #include "xe_assert.h"
> @@ -400,6 +401,7 @@ static void __guc_exec_queue_policy_add_##func(struct exec_queue_policy *policy,
>  MAKE_EXEC_QUEUE_POLICY_ADD(execution_quantum, EXECUTION_QUANTUM)
>  MAKE_EXEC_QUEUE_POLICY_ADD(preemption_timeout, PREEMPTION_TIMEOUT)
>  MAKE_EXEC_QUEUE_POLICY_ADD(priority, SCHEDULING_PRIORITY)
> +MAKE_EXEC_QUEUE_POLICY_ADD(slpc_ctx_freq_req, SLPM_GT_FREQUENCY)
>  #undef MAKE_EXEC_QUEUE_POLICY_ADD
>  
>  static const int xe_exec_queue_prio_to_guc[] = {
> @@ -414,14 +416,19 @@ static void init_policies(struct xe_guc *guc, struct xe_exec_queue *q)
>  	struct exec_queue_policy policy;
>  	enum xe_exec_queue_priority prio = q->sched_props.priority;
>  	u32 timeslice_us = q->sched_props.timeslice_us;
> +	u32 slpc_ctx_freq_req = 0;
>  	u32 preempt_timeout_us = q->sched_props.preempt_timeout_us;
>  
>  	xe_gt_assert(guc_to_gt(guc), exec_queue_registered(q));
>  
> +	if (q->flags & DRM_XE_EXEC_QUEUE_LOW_LATENCY_HINT)
> +		slpc_ctx_freq_req |= SLPC_EXEC_QUEUE_FREQ_REQ_IS_COMPUTE;
> +
>  	__guc_exec_queue_policy_start_klv(&policy, q->guc->id);
>  	__guc_exec_queue_policy_add_priority(&policy, xe_exec_queue_prio_to_guc[prio]);
>  	__guc_exec_queue_policy_add_execution_quantum(&policy, timeslice_us);
>  	__guc_exec_queue_policy_add_preemption_timeout(&policy, preempt_timeout_us);
> +	__guc_exec_queue_policy_add_slpc_ctx_freq_req(&policy, slpc_ctx_freq_req);
>  
>  	xe_guc_ct_send(&guc->ct, (u32 *)&policy.h2g,
>  		       __guc_exec_queue_policy_action_size(&policy), 0, 0);
> diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h
> index f62689ca861a..bd0150d2200c 100644
> --- a/include/uapi/drm/xe_drm.h
> +++ b/include/uapi/drm/xe_drm.h
> @@ -1097,6 +1097,7 @@ struct drm_xe_vm_bind {
>   *         .engine_class = DRM_XE_ENGINE_CLASS_RENDER,
>   *     };
>   *     struct drm_xe_exec_queue_create exec_queue_create = {
> + *          .flags = DRM_XE_EXEC_QUEUE_LOW_LATENCY_HINT or 0
>   *          .extensions = 0,
>   *          .vm_id = vm,
>   *          .num_bb_per_exec = 1,
> @@ -1110,7 +1111,6 @@ struct drm_xe_exec_queue_create {
>  #define DRM_XE_EXEC_QUEUE_EXTENSION_SET_PROPERTY		0
>  #define   DRM_XE_EXEC_QUEUE_SET_PROPERTY_PRIORITY		0
>  #define   DRM_XE_EXEC_QUEUE_SET_PROPERTY_TIMESLICE		1
> -
>  	/** @extensions: Pointer to the first extension struct, if any */
>  	__u64 extensions;
>  
> @@ -1123,6 +1123,7 @@ struct drm_xe_exec_queue_create {
>  	/** @vm_id: VM to use for this exec queue */
>  	__u32 vm_id;
>  
> +#define DRM_XE_EXEC_QUEUE_LOW_LATENCY_HINT	(0x1 << 1)

we need a way to know if the KMD version running supports it or not.
I think a bit on drm_xe_query_config would do the job.

>  	/** @flags: MBZ */
>  	__u32 flags;
>

Zeng, Oak Jan. 9, 2025, 5:37 p.m. UTC | #3

> -----Original Message-----
> From: dri-devel <dri-devel-bounces@lists.freedesktop.org> On
> Behalf Of Tejas Upadhyay
> Sent: January 9, 2025 7:07 AM
> To: intel-xe@lists.freedesktop.org
> Cc: dri-devel@lists.freedesktop.org; Nilawar, Badal
> <badal.nilawar@intel.com>; Belgaumkar, Vinay
> <vinay.belgaumkar@intel.com>; Mrozek, Michal
> <michal.mrozek@intel.com>; Morek, Szymon
> <szymon.morek@intel.com>; Souza, Jose <jose.souza@intel.com>;
> De Marchi, Lucas <lucas.demarchi@intel.com>; Upadhyay, Tejas
> <tejas.upadhyay@intel.com>
> Subject: [RFC PATCH V2] drm/xe/guc: Use exec queue hints for GT
> frequency
> 
> Allow user to provide a low latency hint per exec queue. When set,
> KMD sends a hint to GuC which results in special handling for this
> exec queue. SLPC will ramp the GT frequency aggressively every time
> it switches to this exec queue.
> 
> We need to enable the use of SLPC Compute strategy during init, but
> it will apply only to exec queues that set this bit during exec queue
> creation.
> 
> Improvement with this approach as below:
> 
> Before,
> 
> :~$ NEOReadDebugKeys=1 EnableDirectSubmission=0 clpeak --
> kernel-latency
> Platform: Intel(R) OpenCL Graphics
>   Device: Intel(R) Graphics [0xe20b]
>     Driver version  : 24.52.0 (Linux x64)
>     Compute units   : 160
>     Clock frequency : 2850 MHz
>     Kernel launch latency : 283.16 us
> 
> After,
> 
> :~$ NEOReadDebugKeys=1 EnableDirectSubmission=0 clpeak --
> kernel-latency
> Platform: Intel(R) OpenCL Graphics
>   Device: Intel(R) Graphics [0xe20b]
>     Driver version  : 24.52.0 (Linux x64)
>     Compute units   : 160
>     Clock frequency : 2850 MHz
> 
>     Kernel launch latency : 63.38 us
> 
> UMD will indicate low latency hint with flag as mentioned below,
> 
> *     struct drm_xe_exec_queue_create exec_queue_create = {
> *          .flags = DRM_XE_EXEC_QUEUE_LOW_LATENCY_HINT or 0
> *          .extensions = 0,
> *          .vm_id = vm,
> *          .num_bb_per_exec = 1,
> *          .num_eng_per_bb = 1,
> *          .instances = to_user_pointer(&instance),
> *     };
> *     ioctl(fd, DRM_IOCTL_XE_EXEC_QUEUE_CREATE,
> &exec_queue_create);
> 
> Link to UMD PR : https://github.com/intel/compute-runtime/pull/794
> 
> Note: There is outstanding issue on guc side to be not able to switch
> to max
> frequency as per strategy indicated by KMD, so for experminet/test
> result
> hardcoding apporch was taken and passed to guc as policy. Effort on
> debugging
> from guc side is going on in parallel.
> 
> V2:
>   - DRM_XE_EXEC_QUEUE_LOW_LATENCY_HINT 1 is already planned
> for other hint(Szymon)
>   - Add motivation to description (Lucas)
> 
> Cc:dri-devel@lists.freedesktop.org
> Cc:vinay.belgaumkar@intel.com
> Cc:Michal Mrozek <michal.mrozek@intel.com>
> Cc:Szymon Morek <szymon.morek@intel.com>
> Cc:José Roberto de Souza <jose.souza@intel.com>
> Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
> ---
>  drivers/gpu/drm/xe/abi/guc_actions_slpc_abi.h |  3 +++
>  drivers/gpu/drm/xe/xe_exec_queue.c            |  7 ++++---
>  drivers/gpu/drm/xe/xe_guc_pc.c                | 16 ++++++++++++++++
>  drivers/gpu/drm/xe/xe_guc_submit.c            |  7 +++++++
>  include/uapi/drm/xe_drm.h                     |  3 ++-
>  5 files changed, 32 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/abi/guc_actions_slpc_abi.h
> b/drivers/gpu/drm/xe/abi/guc_actions_slpc_abi.h
> index 85abe4f09ae2..c50075b8270f 100644
> --- a/drivers/gpu/drm/xe/abi/guc_actions_slpc_abi.h
> +++ b/drivers/gpu/drm/xe/abi/guc_actions_slpc_abi.h
> @@ -174,6 +174,9 @@ struct slpc_task_state_data {
>  	};
>  } __packed;
> 
> +#define SLPC_EXEC_QUEUE_FREQ_REQ_IS_COMPUTE
> 	REG_BIT(28)
> +#define SLPC_OPTIMIZED_STRATEGY_COMPUTE
> 	REG_BIT(0)
> +
>  struct slpc_shared_data_header {
>  	/* Total size in bytes of this shared buffer. */
>  	u32 size;
> diff --git a/drivers/gpu/drm/xe/xe_exec_queue.c
> b/drivers/gpu/drm/xe/xe_exec_queue.c
> index 8948f50ee58f..7747ba6c4bb8 100644
> --- a/drivers/gpu/drm/xe/xe_exec_queue.c
> +++ b/drivers/gpu/drm/xe/xe_exec_queue.c
> @@ -553,7 +553,8 @@ int xe_exec_queue_create_ioctl(struct
> drm_device *dev, void *data,
>  	u32 len;
>  	int err;
> 
> -	if (XE_IOCTL_DBG(xe, args->flags) ||
> +	if (XE_IOCTL_DBG(xe, args->flags &&
> +			 !(args->flags &
> DRM_XE_EXEC_QUEUE_LOW_LATENCY_HINT)) ||
>  	    XE_IOCTL_DBG(xe, args->reserved[0] || args->reserved[1]))
>  		return -EINVAL;
> 
> @@ -578,7 +579,7 @@ int xe_exec_queue_create_ioctl(struct
> drm_device *dev, void *data,
> 
>  		for_each_tile(tile, xe, id) {
>  			struct xe_exec_queue *new;
> -			u32 flags = EXEC_QUEUE_FLAG_VM;
> +			u32 flags = args->flags |
> EXEC_QUEUE_FLAG_VM;


You are mixing internal and external flags here. Args->flags is an external definition. Note the current value of 
DRM_XE_EXEC_QUEUE_LOW_LATENCY_HINT conflict with the internal definition of:

#define EXEC_QUEUE_FLAG_PERMANENT		BIT(1)

I think the better way to do it is, define an internal value for this purpose, such as:

#define EXEC_QUEUE_FLAG_LOW_LATENCY		BIT(5)

Then write: if (args->flags & DRM_XE_EXEC_QUEUE_LOW_LATENCY_HINT)
		flags |= EXEC_QUEUE_FLAG_LOW_LATENCY;


> 
>  			if (id)
>  				flags |=
> EXEC_QUEUE_FLAG_BIND_ENGINE_CHILD;
> @@ -626,7 +627,7 @@ int xe_exec_queue_create_ioctl(struct
> drm_device *dev, void *data,
>  		}
> 
>  		q = xe_exec_queue_create(xe, vm, logical_mask,
> -					 args->width, hwe, 0,
> +					 args->width, hwe, args->flags,

Use internal flag also here if you do what I said above


>  					 args->extensions);
>  		up_read(&vm->lock);
>  		xe_vm_put(vm);
> diff --git a/drivers/gpu/drm/xe/xe_guc_pc.c
> b/drivers/gpu/drm/xe/xe_guc_pc.c
> index df7f130fb663..ff0b98ccf1a7 100644
> --- a/drivers/gpu/drm/xe/xe_guc_pc.c
> +++ b/drivers/gpu/drm/xe/xe_guc_pc.c
> @@ -992,6 +992,19 @@ static int pc_init_freqs(struct xe_guc_pc *pc)
>  	return ret;
>  }
> 
> +static int xe_guc_pc_set_strategy(struct xe_guc_pc *pc, u32 val)
> +{
> +	int ret = 0;
> +
> +	xe_pm_runtime_get(pc_to_xe(pc));
> +	ret = pc_action_set_param(pc,
> +				  SLPC_PARAM_STRATEGIES,
> +				  val);
> +	xe_pm_runtime_put(pc_to_xe(pc));
> +
> +	return ret;
> +}
> +
>  /**
>   * xe_guc_pc_start - Start GuC's Power Conservation component
>   * @pc: Xe_GuC_PC instance
> @@ -1052,6 +1065,9 @@ int xe_guc_pc_start(struct xe_guc_pc *pc)
> 
>  	ret = pc_action_setup_gucrc(pc,
> GUCRC_FIRMWARE_CONTROL);
> 
> +	/* Enable SLPC Optimized Strategy for compute */
> +	xe_guc_pc_set_strategy(pc,
> SLPC_OPTIMIZED_STRATEGY_COMPUTE);
> +
>  out:
>  	xe_force_wake_put(gt_to_fw(gt), fw_ref);
>  	return ret;
> diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c
> b/drivers/gpu/drm/xe/xe_guc_submit.c
> index 9c36329fe857..88a1987ac360 100644
> --- a/drivers/gpu/drm/xe/xe_guc_submit.c
> +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
> @@ -15,6 +15,7 @@
>  #include <drm/drm_managed.h>
> 
>  #include "abi/guc_actions_abi.h"
> +#include "abi/guc_actions_slpc_abi.h"
>  #include "abi/guc_klvs_abi.h"
>  #include "regs/xe_lrc_layout.h"
>  #include "xe_assert.h"
> @@ -400,6 +401,7 @@ static void
> __guc_exec_queue_policy_add_##func(struct exec_queue_policy
> *policy,
>  MAKE_EXEC_QUEUE_POLICY_ADD(execution_quantum,
> EXECUTION_QUANTUM)
>  MAKE_EXEC_QUEUE_POLICY_ADD(preemption_timeout,
> PREEMPTION_TIMEOUT)
>  MAKE_EXEC_QUEUE_POLICY_ADD(priority, SCHEDULING_PRIORITY)
> +MAKE_EXEC_QUEUE_POLICY_ADD(slpc_ctx_freq_req,
> SLPM_GT_FREQUENCY)
>  #undef MAKE_EXEC_QUEUE_POLICY_ADD
> 
>  static const int xe_exec_queue_prio_to_guc[] = {
> @@ -414,14 +416,19 @@ static void init_policies(struct xe_guc *guc,
> struct xe_exec_queue *q)
>  	struct exec_queue_policy policy;
>  	enum xe_exec_queue_priority prio = q-
> >sched_props.priority;
>  	u32 timeslice_us = q->sched_props.timeslice_us;
> +	u32 slpc_ctx_freq_req = 0;
>  	u32 preempt_timeout_us = q-
> >sched_props.preempt_timeout_us;
> 
>  	xe_gt_assert(guc_to_gt(guc), exec_queue_registered(q));
> 
> +	if (q->flags & DRM_XE_EXEC_QUEUE_LOW_LATENCY_HINT)

Use internal definition

> +		slpc_ctx_freq_req |=
> SLPC_EXEC_QUEUE_FREQ_REQ_IS_COMPUTE;

From the codes above, I feel the user hint flag better be named as
Something like DRM_XE_EXEC_QUEUE_COMPUTE_HINT

I feel this is slightly better than LOW_LATENCY as low latency is a result
Of applying some frequency policy designed for compute task.

A related question is, does this new policy affect power consumption? Usually
Higher frequency implies higher power. 

Do we only want to keep such frequency policy during submission time, or 
Same policy is intended for whole execution time?

The answer of above question would lead to some interface design such as
Whether we need to introduce interface to disable the compute frequency policy.


> +
>  	__guc_exec_queue_policy_start_klv(&policy, q->guc->id);
>  	__guc_exec_queue_policy_add_priority(&policy,
> xe_exec_queue_prio_to_guc[prio]);
> 
> 	__guc_exec_queue_policy_add_execution_quantum(&polic
> y, timeslice_us);
> 
> 	__guc_exec_queue_policy_add_preemption_timeout(&polic
> y, preempt_timeout_us);
> +	__guc_exec_queue_policy_add_slpc_ctx_freq_req(&policy,
> slpc_ctx_freq_req);
> 
>  	xe_guc_ct_send(&guc->ct, (u32 *)&policy.h2g,
>  		       __guc_exec_queue_policy_action_size(&policy),
> 0, 0);
> diff --git a/include/uapi/drm/xe_drm.h
> b/include/uapi/drm/xe_drm.h
> index f62689ca861a..bd0150d2200c 100644
> --- a/include/uapi/drm/xe_drm.h
> +++ b/include/uapi/drm/xe_drm.h
> @@ -1097,6 +1097,7 @@ struct drm_xe_vm_bind {
>   *         .engine_class = DRM_XE_ENGINE_CLASS_RENDER,
>   *     };
>   *     struct drm_xe_exec_queue_create exec_queue_create = {
> + *          .flags = DRM_XE_EXEC_QUEUE_LOW_LATENCY_HINT or 0
>   *          .extensions = 0,
>   *          .vm_id = vm,
>   *          .num_bb_per_exec = 1,
> @@ -1110,7 +1111,6 @@ struct drm_xe_exec_queue_create {
>  #define DRM_XE_EXEC_QUEUE_EXTENSION_SET_PROPERTY
> 	0
>  #define   DRM_XE_EXEC_QUEUE_SET_PROPERTY_PRIORITY
> 	0
>  #define   DRM_XE_EXEC_QUEUE_SET_PROPERTY_TIMESLICE
> 	1
> -
>  	/** @extensions: Pointer to the first extension struct, if any
> */
>  	__u64 extensions;
> 
> @@ -1123,6 +1123,7 @@ struct drm_xe_exec_queue_create {
>  	/** @vm_id: VM to use for this exec queue */
>  	__u32 vm_id;
> 
> +#define DRM_XE_EXEC_QUEUE_LOW_LATENCY_HINT	(0x1
> << 1)

I wonder why flag can't start from (0x1 << 0) here.

Yes I did see the v2 comment of below:

 - DRM_XE_EXEC_QUEUE_LOW_LATENCY_HINT 1 is already planned
for other hint(Szymon)

but this is regarding the definition of a flag used in exec_queue_create, and this 
is the first time we use this flag in *this* uAPI. It shouldn't conflict with UMD's usage of hint flags
in *other* uAPI

>  	/** @flags: MBZ */

Remove this MBZ comment

Oak


>  	__u32 flags;
> 
> --
> 2.34.1

Belgaumkar, Vinay Jan. 9, 2025, 6:36 p.m. UTC | #4

On 1/9/2025 6:36 AM, Souza, Jose wrote:
> On Thu, 2025-01-09 at 17:37 +0530, Tejas Upadhyay wrote:
>> Allow user to provide a low latency hint per exec queue. When set,
>> KMD sends a hint to GuC which results in special handling for this
>> exec queue. SLPC will ramp the GT frequency aggressively every time
>> it switches to this exec queue.
Clearer to say context instead of exec queue.
>>
>> We need to enable the use of SLPC Compute strategy during init, but
>> it will apply only to exec queues that set this bit during exec queue
>> creation.
>>
>> Improvement with this approach as below:
>>
>> Before,
>>
>> :~$ NEOReadDebugKeys=1 EnableDirectSubmission=0 clpeak --kernel-latency
>> Platform: Intel(R) OpenCL Graphics
>>    Device: Intel(R) Graphics [0xe20b]
>>      Driver version  : 24.52.0 (Linux x64)
>>      Compute units   : 160
>>      Clock frequency : 2850 MHz
>>      Kernel launch latency : 283.16 us
>>
>> After,
>>
>> :~$ NEOReadDebugKeys=1 EnableDirectSubmission=0 clpeak --kernel-latency
>> Platform: Intel(R) OpenCL Graphics
>>    Device: Intel(R) Graphics [0xe20b]
>>      Driver version  : 24.52.0 (Linux x64)
>>      Compute units   : 160
>>      Clock frequency : 2850 MHz
>>
>>      Kernel launch latency : 63.38 us
>>
>> UMD will indicate low latency hint with flag as mentioned below,
>>
>> *     struct drm_xe_exec_queue_create exec_queue_create = {
>> *          .flags = DRM_XE_EXEC_QUEUE_LOW_LATENCY_HINT or 0
>> *          .extensions = 0,
>> *          .vm_id = vm,
>> *          .num_bb_per_exec = 1,
>> *          .num_eng_per_bb = 1,
>> *          .instances = to_user_pointer(&instance),
>> *     };
>> *     ioctl(fd, DRM_IOCTL_XE_EXEC_QUEUE_CREATE, &exec_queue_create);
>>
>> Link to UMD PR : https://github.com/intel/compute-runtime/pull/794
>>
>> Note: There is outstanding issue on guc side to be not able to switch to max
>> frequency as per strategy indicated by KMD, so for experminet/test result
>> hardcoding apporch was taken and passed to guc as policy. Effort on debugging
>> from guc side is going on in parallel.
I verified this works fine. Seems like there was some issue with an 
older GuC version.
>>
>> V2:
>>    - DRM_XE_EXEC_QUEUE_LOW_LATENCY_HINT 1 is already planned for other hint(Szymon)
>>    - Add motivation to description (Lucas)
>>
>> Cc:dri-devel@lists.freedesktop.org
>> Cc:vinay.belgaumkar@intel.com
>> Cc:Michal Mrozek <michal.mrozek@intel.com>
>> Cc:Szymon Morek <szymon.morek@intel.com>
>> Cc:José Roberto de Souza <jose.souza@intel.com>
>> Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
>> ---
>>   drivers/gpu/drm/xe/abi/guc_actions_slpc_abi.h |  3 +++
>>   drivers/gpu/drm/xe/xe_exec_queue.c            |  7 ++++---
>>   drivers/gpu/drm/xe/xe_guc_pc.c                | 16 ++++++++++++++++
>>   drivers/gpu/drm/xe/xe_guc_submit.c            |  7 +++++++
>>   include/uapi/drm/xe_drm.h                     |  3 ++-
>>   5 files changed, 32 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/xe/abi/guc_actions_slpc_abi.h b/drivers/gpu/drm/xe/abi/guc_actions_slpc_abi.h
>> index 85abe4f09ae2..c50075b8270f 100644
>> --- a/drivers/gpu/drm/xe/abi/guc_actions_slpc_abi.h
>> +++ b/drivers/gpu/drm/xe/abi/guc_actions_slpc_abi.h
>> @@ -174,6 +174,9 @@ struct slpc_task_state_data {
>>   	};
>>   } __packed;
>>   
>> +#define SLPC_EXEC_QUEUE_FREQ_REQ_IS_COMPUTE	REG_BIT(28)
>> +#define SLPC_OPTIMIZED_STRATEGY_COMPUTE		REG_BIT(0)
>> +
>>   struct slpc_shared_data_header {
>>   	/* Total size in bytes of this shared buffer. */
>>   	u32 size;
>> diff --git a/drivers/gpu/drm/xe/xe_exec_queue.c b/drivers/gpu/drm/xe/xe_exec_queue.c
>> index 8948f50ee58f..7747ba6c4bb8 100644
>> --- a/drivers/gpu/drm/xe/xe_exec_queue.c
>> +++ b/drivers/gpu/drm/xe/xe_exec_queue.c
>> @@ -553,7 +553,8 @@ int xe_exec_queue_create_ioctl(struct drm_device *dev, void *data,
>>   	u32 len;
>>   	int err;
>>   
>> -	if (XE_IOCTL_DBG(xe, args->flags) ||
>> +	if (XE_IOCTL_DBG(xe, args->flags &&
>> +			 !(args->flags & DRM_XE_EXEC_QUEUE_LOW_LATENCY_HINT)) ||
As mentioned in the comments, we need flags separate for external API 
and internally. i915 implementation has this - 
https://patchwork.freedesktop.org/series/130698/.
>>   	    XE_IOCTL_DBG(xe, args->reserved[0] || args->reserved[1]))
>>   		return -EINVAL;
>>   
>> @@ -578,7 +579,7 @@ int xe_exec_queue_create_ioctl(struct drm_device *dev, void *data,
>>   
>>   		for_each_tile(tile, xe, id) {
>>   			struct xe_exec_queue *new;
>> -			u32 flags = EXEC_QUEUE_FLAG_VM;
>> +			u32 flags = args->flags | EXEC_QUEUE_FLAG_VM;
>>   
>>   			if (id)
>>   				flags |= EXEC_QUEUE_FLAG_BIND_ENGINE_CHILD;
>> @@ -626,7 +627,7 @@ int xe_exec_queue_create_ioctl(struct drm_device *dev, void *data,
>>   		}
>>   
>>   		q = xe_exec_queue_create(xe, vm, logical_mask,
>> -					 args->width, hwe, 0,
>> +					 args->width, hwe, args->flags,
>>   					 args->extensions);
>>   		up_read(&vm->lock);
>>   		xe_vm_put(vm);
>> diff --git a/drivers/gpu/drm/xe/xe_guc_pc.c b/drivers/gpu/drm/xe/xe_guc_pc.c
>> index df7f130fb663..ff0b98ccf1a7 100644
>> --- a/drivers/gpu/drm/xe/xe_guc_pc.c
>> +++ b/drivers/gpu/drm/xe/xe_guc_pc.c
>> @@ -992,6 +992,19 @@ static int pc_init_freqs(struct xe_guc_pc *pc)
>>   	return ret;
>>   }
>>   
>> +static int xe_guc_pc_set_strategy(struct xe_guc_pc *pc, u32 val)
>> +{
>> +	int ret = 0;
>> +
>> +	xe_pm_runtime_get(pc_to_xe(pc));
We are already holding fwake from caller, no need to get runtime ref. 
Maybe just assert if we don't have it?
>> +	ret = pc_action_set_param(pc,
>> +				  SLPC_PARAM_STRATEGIES,
>> +				  val);
>> +	xe_pm_runtime_put(pc_to_xe(pc));
>> +
>> +	return ret;
>> +}
>> +
>>   /**
>>    * xe_guc_pc_start - Start GuC's Power Conservation component
>>    * @pc: Xe_GuC_PC instance
>> @@ -1052,6 +1065,9 @@ int xe_guc_pc_start(struct xe_guc_pc *pc)
>>   
>>   	ret = pc_action_setup_gucrc(pc, GUCRC_FIRMWARE_CONTROL);
>>   
>> +	/* Enable SLPC Optimized Strategy for compute */
>> +	xe_guc_pc_set_strategy(pc, SLPC_OPTIMIZED_STRATEGY_COMPUTE);
better to check the ret here.
>> +
>>   out:
>>   	xe_force_wake_put(gt_to_fw(gt), fw_ref);
>>   	return ret;
>> diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
>> index 9c36329fe857..88a1987ac360 100644
>> --- a/drivers/gpu/drm/xe/xe_guc_submit.c
>> +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
>> @@ -15,6 +15,7 @@
>>   #include <drm/drm_managed.h>
>>   
>>   #include "abi/guc_actions_abi.h"
>> +#include "abi/guc_actions_slpc_abi.h"
>>   #include "abi/guc_klvs_abi.h"
>>   #include "regs/xe_lrc_layout.h"
>>   #include "xe_assert.h"
>> @@ -400,6 +401,7 @@ static void __guc_exec_queue_policy_add_##func(struct exec_queue_policy *policy,
>>   MAKE_EXEC_QUEUE_POLICY_ADD(execution_quantum, EXECUTION_QUANTUM)
>>   MAKE_EXEC_QUEUE_POLICY_ADD(preemption_timeout, PREEMPTION_TIMEOUT)
>>   MAKE_EXEC_QUEUE_POLICY_ADD(priority, SCHEDULING_PRIORITY)
>> +MAKE_EXEC_QUEUE_POLICY_ADD(slpc_ctx_freq_req, SLPM_GT_FREQUENCY)
>>   #undef MAKE_EXEC_QUEUE_POLICY_ADD
>>   
>>   static const int xe_exec_queue_prio_to_guc[] = {
>> @@ -414,14 +416,19 @@ static void init_policies(struct xe_guc *guc, struct xe_exec_queue *q)
>>   	struct exec_queue_policy policy;
>>   	enum xe_exec_queue_priority prio = q->sched_props.priority;
>>   	u32 timeslice_us = q->sched_props.timeslice_us;
>> +	u32 slpc_ctx_freq_req = 0;
>>   	u32 preempt_timeout_us = q->sched_props.preempt_timeout_us;
>>   
>>   	xe_gt_assert(guc_to_gt(guc), exec_queue_registered(q));
>>   
>> +	if (q->flags & DRM_XE_EXEC_QUEUE_LOW_LATENCY_HINT)
>> +		slpc_ctx_freq_req |= SLPC_EXEC_QUEUE_FREQ_REQ_IS_COMPUTE;
>> +
>>   	__guc_exec_queue_policy_start_klv(&policy, q->guc->id);
>>   	__guc_exec_queue_policy_add_priority(&policy, xe_exec_queue_prio_to_guc[prio]);
>>   	__guc_exec_queue_policy_add_execution_quantum(&policy, timeslice_us);
>>   	__guc_exec_queue_policy_add_preemption_timeout(&policy, preempt_timeout_us);
>> +	__guc_exec_queue_policy_add_slpc_ctx_freq_req(&policy, slpc_ctx_freq_req);
>>   
>>   	xe_guc_ct_send(&guc->ct, (u32 *)&policy.h2g,
>>   		       __guc_exec_queue_policy_action_size(&policy), 0, 0);
>> diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h
>> index f62689ca861a..bd0150d2200c 100644
>> --- a/include/uapi/drm/xe_drm.h
>> +++ b/include/uapi/drm/xe_drm.h
>> @@ -1097,6 +1097,7 @@ struct drm_xe_vm_bind {
>>    *         .engine_class = DRM_XE_ENGINE_CLASS_RENDER,
>>    *     };
>>    *     struct drm_xe_exec_queue_create exec_queue_create = {
>> + *          .flags = DRM_XE_EXEC_QUEUE_LOW_LATENCY_HINT or 0
>>    *          .extensions = 0,
>>    *          .vm_id = vm,
>>    *          .num_bb_per_exec = 1,
>> @@ -1110,7 +1111,6 @@ struct drm_xe_exec_queue_create {
>>   #define DRM_XE_EXEC_QUEUE_EXTENSION_SET_PROPERTY		0
>>   #define   DRM_XE_EXEC_QUEUE_SET_PROPERTY_PRIORITY		0
>>   #define   DRM_XE_EXEC_QUEUE_SET_PROPERTY_TIMESLICE		1
>> -
>>   	/** @extensions: Pointer to the first extension struct, if any */
>>   	__u64 extensions;
>>   
>> @@ -1123,6 +1123,7 @@ struct drm_xe_exec_queue_create {
>>   	/** @vm_id: VM to use for this exec queue */
>>   	__u32 vm_id;
>>   
>> +#define DRM_XE_EXEC_QUEUE_LOW_LATENCY_HINT	(0x1 << 1)
> we need a way to know if the KMD version running supports it or not.
> I think a bit on drm_xe_query_config would do the job.

Yes, the i915 implementation has this - 
https://patchwork.freedesktop.org/series/130698/. We need something 
similar here.

Thanks,

Vinay.

>
>>   	/** @flags: MBZ */
>>   	__u32 flags;
>>

Belgaumkar, Vinay Jan. 9, 2025, 10:03 p.m. UTC | #5

On 1/9/2025 9:37 AM, Zeng, Oak wrote:
>
>> -----Original Message-----
>> From: dri-devel <dri-devel-bounces@lists.freedesktop.org> On
>> Behalf Of Tejas Upadhyay
>> Sent: January 9, 2025 7:07 AM
>> To: intel-xe@lists.freedesktop.org
>> Cc: dri-devel@lists.freedesktop.org; Nilawar, Badal
>> <badal.nilawar@intel.com>; Belgaumkar, Vinay
>> <vinay.belgaumkar@intel.com>; Mrozek, Michal
>> <michal.mrozek@intel.com>; Morek, Szymon
>> <szymon.morek@intel.com>; Souza, Jose <jose.souza@intel.com>;
>> De Marchi, Lucas <lucas.demarchi@intel.com>; Upadhyay, Tejas
>> <tejas.upadhyay@intel.com>
>> Subject: [RFC PATCH V2] drm/xe/guc: Use exec queue hints for GT
>> frequency
>>
>> Allow user to provide a low latency hint per exec queue. When set,
>> KMD sends a hint to GuC which results in special handling for this
>> exec queue. SLPC will ramp the GT frequency aggressively every time
>> it switches to this exec queue.
>>
>> We need to enable the use of SLPC Compute strategy during init, but
>> it will apply only to exec queues that set this bit during exec queue
>> creation.
>>
>> Improvement with this approach as below:
>>
>> Before,
>>
>> :~$ NEOReadDebugKeys=1 EnableDirectSubmission=0 clpeak --
>> kernel-latency
>> Platform: Intel(R) OpenCL Graphics
>>    Device: Intel(R) Graphics [0xe20b]
>>      Driver version  : 24.52.0 (Linux x64)
>>      Compute units   : 160
>>      Clock frequency : 2850 MHz
>>      Kernel launch latency : 283.16 us
>>
>> After,
>>
>> :~$ NEOReadDebugKeys=1 EnableDirectSubmission=0 clpeak --
>> kernel-latency
>> Platform: Intel(R) OpenCL Graphics
>>    Device: Intel(R) Graphics [0xe20b]
>>      Driver version  : 24.52.0 (Linux x64)
>>      Compute units   : 160
>>      Clock frequency : 2850 MHz
>>
>>      Kernel launch latency : 63.38 us
>>
>> UMD will indicate low latency hint with flag as mentioned below,
>>
>> *     struct drm_xe_exec_queue_create exec_queue_create = {
>> *          .flags = DRM_XE_EXEC_QUEUE_LOW_LATENCY_HINT or 0
>> *          .extensions = 0,
>> *          .vm_id = vm,
>> *          .num_bb_per_exec = 1,
>> *          .num_eng_per_bb = 1,
>> *          .instances = to_user_pointer(&instance),
>> *     };
>> *     ioctl(fd, DRM_IOCTL_XE_EXEC_QUEUE_CREATE,
>> &exec_queue_create);
>>
>> Link to UMD PR : https://github.com/intel/compute-runtime/pull/794
>>
>> Note: There is outstanding issue on guc side to be not able to switch
>> to max
>> frequency as per strategy indicated by KMD, so for experminet/test
>> result
>> hardcoding apporch was taken and passed to guc as policy. Effort on
>> debugging
>> from guc side is going on in parallel.
>>
>> V2:
>>    - DRM_XE_EXEC_QUEUE_LOW_LATENCY_HINT 1 is already planned
>> for other hint(Szymon)
>>    - Add motivation to description (Lucas)
>>
>> Cc:dri-devel@lists.freedesktop.org
>> Cc:vinay.belgaumkar@intel.com
>> Cc:Michal Mrozek <michal.mrozek@intel.com>
>> Cc:Szymon Morek <szymon.morek@intel.com>
>> Cc:José Roberto de Souza <jose.souza@intel.com>
>> Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
>> ---
>>   drivers/gpu/drm/xe/abi/guc_actions_slpc_abi.h |  3 +++
>>   drivers/gpu/drm/xe/xe_exec_queue.c            |  7 ++++---
>>   drivers/gpu/drm/xe/xe_guc_pc.c                | 16 ++++++++++++++++
>>   drivers/gpu/drm/xe/xe_guc_submit.c            |  7 +++++++
>>   include/uapi/drm/xe_drm.h                     |  3 ++-
>>   5 files changed, 32 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/xe/abi/guc_actions_slpc_abi.h
>> b/drivers/gpu/drm/xe/abi/guc_actions_slpc_abi.h
>> index 85abe4f09ae2..c50075b8270f 100644
>> --- a/drivers/gpu/drm/xe/abi/guc_actions_slpc_abi.h
>> +++ b/drivers/gpu/drm/xe/abi/guc_actions_slpc_abi.h
>> @@ -174,6 +174,9 @@ struct slpc_task_state_data {
>>   	};
>>   } __packed;
>>
>> +#define SLPC_EXEC_QUEUE_FREQ_REQ_IS_COMPUTE
>> 	REG_BIT(28)
>> +#define SLPC_OPTIMIZED_STRATEGY_COMPUTE
>> 	REG_BIT(0)
>> +
>>   struct slpc_shared_data_header {
>>   	/* Total size in bytes of this shared buffer. */
>>   	u32 size;
>> diff --git a/drivers/gpu/drm/xe/xe_exec_queue.c
>> b/drivers/gpu/drm/xe/xe_exec_queue.c
>> index 8948f50ee58f..7747ba6c4bb8 100644
>> --- a/drivers/gpu/drm/xe/xe_exec_queue.c
>> +++ b/drivers/gpu/drm/xe/xe_exec_queue.c
>> @@ -553,7 +553,8 @@ int xe_exec_queue_create_ioctl(struct
>> drm_device *dev, void *data,
>>   	u32 len;
>>   	int err;
>>
>> -	if (XE_IOCTL_DBG(xe, args->flags) ||
>> +	if (XE_IOCTL_DBG(xe, args->flags &&
>> +			 !(args->flags &
>> DRM_XE_EXEC_QUEUE_LOW_LATENCY_HINT)) ||
>>   	    XE_IOCTL_DBG(xe, args->reserved[0] || args->reserved[1]))
>>   		return -EINVAL;
>>
>> @@ -578,7 +579,7 @@ int xe_exec_queue_create_ioctl(struct
>> drm_device *dev, void *data,
>>
>>   		for_each_tile(tile, xe, id) {
>>   			struct xe_exec_queue *new;
>> -			u32 flags = EXEC_QUEUE_FLAG_VM;
>> +			u32 flags = args->flags |
>> EXEC_QUEUE_FLAG_VM;
>
> You are mixing internal and external flags here. Args->flags is an external definition. Note the current value of
> DRM_XE_EXEC_QUEUE_LOW_LATENCY_HINT conflict with the internal definition of:
>
> #define EXEC_QUEUE_FLAG_PERMANENT		BIT(1)
>
> I think the better way to do it is, define an internal value for this purpose, such as:
>
> #define EXEC_QUEUE_FLAG_LOW_LATENCY		BIT(5)
>
> Then write: if (args->flags & DRM_XE_EXEC_QUEUE_LOW_LATENCY_HINT)
> 		flags |= EXEC_QUEUE_FLAG_LOW_LATENCY;
>
>
>>   			if (id)
>>   				flags |=
>> EXEC_QUEUE_FLAG_BIND_ENGINE_CHILD;
>> @@ -626,7 +627,7 @@ int xe_exec_queue_create_ioctl(struct
>> drm_device *dev, void *data,
>>   		}
>>
>>   		q = xe_exec_queue_create(xe, vm, logical_mask,
>> -					 args->width, hwe, 0,
>> +					 args->width, hwe, args->flags,
> Use internal flag also here if you do what I said above
>
>
>>   					 args->extensions);
>>   		up_read(&vm->lock);
>>   		xe_vm_put(vm);
>> diff --git a/drivers/gpu/drm/xe/xe_guc_pc.c
>> b/drivers/gpu/drm/xe/xe_guc_pc.c
>> index df7f130fb663..ff0b98ccf1a7 100644
>> --- a/drivers/gpu/drm/xe/xe_guc_pc.c
>> +++ b/drivers/gpu/drm/xe/xe_guc_pc.c
>> @@ -992,6 +992,19 @@ static int pc_init_freqs(struct xe_guc_pc *pc)
>>   	return ret;
>>   }
>>
>> +static int xe_guc_pc_set_strategy(struct xe_guc_pc *pc, u32 val)
>> +{
>> +	int ret = 0;
>> +
>> +	xe_pm_runtime_get(pc_to_xe(pc));
>> +	ret = pc_action_set_param(pc,
>> +				  SLPC_PARAM_STRATEGIES,
>> +				  val);
>> +	xe_pm_runtime_put(pc_to_xe(pc));
>> +
>> +	return ret;
>> +}
>> +
>>   /**
>>    * xe_guc_pc_start - Start GuC's Power Conservation component
>>    * @pc: Xe_GuC_PC instance
>> @@ -1052,6 +1065,9 @@ int xe_guc_pc_start(struct xe_guc_pc *pc)
>>
>>   	ret = pc_action_setup_gucrc(pc,
>> GUCRC_FIRMWARE_CONTROL);
>>
>> +	/* Enable SLPC Optimized Strategy for compute */
>> +	xe_guc_pc_set_strategy(pc,
>> SLPC_OPTIMIZED_STRATEGY_COMPUTE);
>> +
>>   out:
>>   	xe_force_wake_put(gt_to_fw(gt), fw_ref);
>>   	return ret;
>> diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c
>> b/drivers/gpu/drm/xe/xe_guc_submit.c
>> index 9c36329fe857..88a1987ac360 100644
>> --- a/drivers/gpu/drm/xe/xe_guc_submit.c
>> +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
>> @@ -15,6 +15,7 @@
>>   #include <drm/drm_managed.h>
>>
>>   #include "abi/guc_actions_abi.h"
>> +#include "abi/guc_actions_slpc_abi.h"
>>   #include "abi/guc_klvs_abi.h"
>>   #include "regs/xe_lrc_layout.h"
>>   #include "xe_assert.h"
>> @@ -400,6 +401,7 @@ static void
>> __guc_exec_queue_policy_add_##func(struct exec_queue_policy
>> *policy,
>>   MAKE_EXEC_QUEUE_POLICY_ADD(execution_quantum,
>> EXECUTION_QUANTUM)
>>   MAKE_EXEC_QUEUE_POLICY_ADD(preemption_timeout,
>> PREEMPTION_TIMEOUT)
>>   MAKE_EXEC_QUEUE_POLICY_ADD(priority, SCHEDULING_PRIORITY)
>> +MAKE_EXEC_QUEUE_POLICY_ADD(slpc_ctx_freq_req,
>> SLPM_GT_FREQUENCY)
>>   #undef MAKE_EXEC_QUEUE_POLICY_ADD
>>
>>   static const int xe_exec_queue_prio_to_guc[] = {
>> @@ -414,14 +416,19 @@ static void init_policies(struct xe_guc *guc,
>> struct xe_exec_queue *q)
>>   	struct exec_queue_policy policy;
>>   	enum xe_exec_queue_priority prio = q-
>>> sched_props.priority;
>>   	u32 timeslice_us = q->sched_props.timeslice_us;
>> +	u32 slpc_ctx_freq_req = 0;
>>   	u32 preempt_timeout_us = q-
>>> sched_props.preempt_timeout_us;
>>   	xe_gt_assert(guc_to_gt(guc), exec_queue_registered(q));
>>
>> +	if (q->flags & DRM_XE_EXEC_QUEUE_LOW_LATENCY_HINT)
> Use internal definition
>
>> +		slpc_ctx_freq_req |=
>> SLPC_EXEC_QUEUE_FREQ_REQ_IS_COMPUTE;
>  From the codes above, I feel the user hint flag better be named as
> Something like DRM_XE_EXEC_QUEUE_COMPUTE_HINT
>
> I feel this is slightly better than LOW_LATENCY as low latency is a result
> Of applying some frequency policy designed for compute task.
We had this discussion while implementing it on the i915 side. It's 
better to keep it engine non-specific, that way non-compute(3D) contexts 
can use it as well.
>
> A related question is, does this new policy affect power consumption? Usually
> Higher frequency implies higher power.
>
> Do we only want to keep such frequency policy during submission time, or
> Same policy is intended for whole execution time?
SLPC will switch to aggressive frequency strategies when that particular 
context is switched in. Other contexts are not affected. There is 
definitely a power impact since we are aggressively ramping the 
frequency for this context.
>
> The answer of above question would lead to some interface design such as
> Whether we need to introduce interface to disable the compute frequency policy.

It can be disabled on the fly for any context, since this is just 
another SLPC param. However, I believe the upstream policy is to not 
mutate the context after it has been created.

Thanks,

Vinay.

>
>
>> +
>>   	__guc_exec_queue_policy_start_klv(&policy, q->guc->id);
>>   	__guc_exec_queue_policy_add_priority(&policy,
>> xe_exec_queue_prio_to_guc[prio]);
>>
>> 	__guc_exec_queue_policy_add_execution_quantum(&polic
>> y, timeslice_us);
>>
>> 	__guc_exec_queue_policy_add_preemption_timeout(&polic
>> y, preempt_timeout_us);
>> +	__guc_exec_queue_policy_add_slpc_ctx_freq_req(&policy,
>> slpc_ctx_freq_req);
>>
>>   	xe_guc_ct_send(&guc->ct, (u32 *)&policy.h2g,
>>   		       __guc_exec_queue_policy_action_size(&policy),
>> 0, 0);
>> diff --git a/include/uapi/drm/xe_drm.h
>> b/include/uapi/drm/xe_drm.h
>> index f62689ca861a..bd0150d2200c 100644
>> --- a/include/uapi/drm/xe_drm.h
>> +++ b/include/uapi/drm/xe_drm.h
>> @@ -1097,6 +1097,7 @@ struct drm_xe_vm_bind {
>>    *         .engine_class = DRM_XE_ENGINE_CLASS_RENDER,
>>    *     };
>>    *     struct drm_xe_exec_queue_create exec_queue_create = {
>> + *          .flags = DRM_XE_EXEC_QUEUE_LOW_LATENCY_HINT or 0
>>    *          .extensions = 0,
>>    *          .vm_id = vm,
>>    *          .num_bb_per_exec = 1,
>> @@ -1110,7 +1111,6 @@ struct drm_xe_exec_queue_create {
>>   #define DRM_XE_EXEC_QUEUE_EXTENSION_SET_PROPERTY
>> 	0
>>   #define   DRM_XE_EXEC_QUEUE_SET_PROPERTY_PRIORITY
>> 	0
>>   #define   DRM_XE_EXEC_QUEUE_SET_PROPERTY_TIMESLICE
>> 	1
>> -
>>   	/** @extensions: Pointer to the first extension struct, if any
>> */
>>   	__u64 extensions;
>>
>> @@ -1123,6 +1123,7 @@ struct drm_xe_exec_queue_create {
>>   	/** @vm_id: VM to use for this exec queue */
>>   	__u32 vm_id;
>>
>> +#define DRM_XE_EXEC_QUEUE_LOW_LATENCY_HINT	(0x1
>> << 1)
> I wonder why flag can't start from (0x1 << 0) here.
>
> Yes I did see the v2 comment of below:
>
>   - DRM_XE_EXEC_QUEUE_LOW_LATENCY_HINT 1 is already planned
> for other hint(Szymon)
>
> but this is regarding the definition of a flag used in exec_queue_create, and this
> is the first time we use this flag in *this* uAPI. It shouldn't conflict with UMD's usage of hint flags
> in *other* uAPI
>
>>   	/** @flags: MBZ */
> Remove this MBZ comment
>
> Oak
>
>
>>   	__u32 flags;
>>
>> --
>> 2.34.1

Zeng, Oak Jan. 9, 2025, 10:59 p.m. UTC | #6

> -----Original Message-----
> From: Belgaumkar, Vinay <vinay.belgaumkar@intel.com>
> Sent: January 9, 2025 5:03 PM
> To: Zeng, Oak <oak.zeng@intel.com>; Upadhyay, Tejas
> <tejas.upadhyay@intel.com>; intel-xe@lists.freedesktop.org
> Cc: dri-devel@lists.freedesktop.org; Nilawar, Badal
> <badal.nilawar@intel.com>; Mrozek, Michal
> <michal.mrozek@intel.com>; Morek, Szymon
> <szymon.morek@intel.com>; Souza, Jose <jose.souza@intel.com>;
> De Marchi, Lucas <lucas.demarchi@intel.com>
> Subject: Re: [RFC PATCH V2] drm/xe/guc: Use exec queue hints for
> GT frequency
> 
> 
> On 1/9/2025 9:37 AM, Zeng, Oak wrote:
> >
> >> -----Original Message-----
> >> From: dri-devel <dri-devel-bounces@lists.freedesktop.org> On
> >> Behalf Of Tejas Upadhyay
> >> Sent: January 9, 2025 7:07 AM
> >> To: intel-xe@lists.freedesktop.org
> >> Cc: dri-devel@lists.freedesktop.org; Nilawar, Badal
> >> <badal.nilawar@intel.com>; Belgaumkar, Vinay
> >> <vinay.belgaumkar@intel.com>; Mrozek, Michal
> >> <michal.mrozek@intel.com>; Morek, Szymon
> >> <szymon.morek@intel.com>; Souza, Jose
> <jose.souza@intel.com>;
> >> De Marchi, Lucas <lucas.demarchi@intel.com>; Upadhyay, Tejas
> >> <tejas.upadhyay@intel.com>
> >> Subject: [RFC PATCH V2] drm/xe/guc: Use exec queue hints for GT
> >> frequency
> >>
> >> Allow user to provide a low latency hint per exec queue. When set,
> >> KMD sends a hint to GuC which results in special handling for this
> >> exec queue. SLPC will ramp the GT frequency aggressively every
> time
> >> it switches to this exec queue.
> >>
> >> We need to enable the use of SLPC Compute strategy during init,
> but
> >> it will apply only to exec queues that set this bit during exec queue
> >> creation.
> >>
> >> Improvement with this approach as below:
> >>
> >> Before,
> >>
> >> :~$ NEOReadDebugKeys=1 EnableDirectSubmission=0 clpeak --
> >> kernel-latency
> >> Platform: Intel(R) OpenCL Graphics
> >>    Device: Intel(R) Graphics [0xe20b]
> >>      Driver version  : 24.52.0 (Linux x64)
> >>      Compute units   : 160
> >>      Clock frequency : 2850 MHz
> >>      Kernel launch latency : 283.16 us
> >>
> >> After,
> >>
> >> :~$ NEOReadDebugKeys=1 EnableDirectSubmission=0 clpeak --
> >> kernel-latency
> >> Platform: Intel(R) OpenCL Graphics
> >>    Device: Intel(R) Graphics [0xe20b]
> >>      Driver version  : 24.52.0 (Linux x64)
> >>      Compute units   : 160
> >>      Clock frequency : 2850 MHz
> >>
> >>      Kernel launch latency : 63.38 us
> >>
> >> UMD will indicate low latency hint with flag as mentioned below,
> >>
> >> *     struct drm_xe_exec_queue_create exec_queue_create = {
> >> *          .flags = DRM_XE_EXEC_QUEUE_LOW_LATENCY_HINT or 0
> >> *          .extensions = 0,
> >> *          .vm_id = vm,
> >> *          .num_bb_per_exec = 1,
> >> *          .num_eng_per_bb = 1,
> >> *          .instances = to_user_pointer(&instance),
> >> *     };
> >> *     ioctl(fd, DRM_IOCTL_XE_EXEC_QUEUE_CREATE,
> >> &exec_queue_create);
> >>
> >> Link to UMD PR : https://github.com/intel/compute-
> runtime/pull/794
> >>
> >> Note: There is outstanding issue on guc side to be not able to
> switch
> >> to max
> >> frequency as per strategy indicated by KMD, so for
> experminet/test
> >> result
> >> hardcoding apporch was taken and passed to guc as policy. Effort
> on
> >> debugging
> >> from guc side is going on in parallel.
> >>
> >> V2:
> >>    - DRM_XE_EXEC_QUEUE_LOW_LATENCY_HINT 1 is already
> planned
> >> for other hint(Szymon)
> >>    - Add motivation to description (Lucas)
> >>
> >> Cc:dri-devel@lists.freedesktop.org
> >> Cc:vinay.belgaumkar@intel.com
> >> Cc:Michal Mrozek <michal.mrozek@intel.com>
> >> Cc:Szymon Morek <szymon.morek@intel.com>
> >> Cc:José Roberto de Souza <jose.souza@intel.com>
> >> Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
> >> ---
> >>   drivers/gpu/drm/xe/abi/guc_actions_slpc_abi.h |  3 +++
> >>   drivers/gpu/drm/xe/xe_exec_queue.c            |  7 ++++---
> >>   drivers/gpu/drm/xe/xe_guc_pc.c                | 16
> ++++++++++++++++
> >>   drivers/gpu/drm/xe/xe_guc_submit.c            |  7 +++++++
> >>   include/uapi/drm/xe_drm.h                     |  3 ++-
> >>   5 files changed, 32 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/drivers/gpu/drm/xe/abi/guc_actions_slpc_abi.h
> >> b/drivers/gpu/drm/xe/abi/guc_actions_slpc_abi.h
> >> index 85abe4f09ae2..c50075b8270f 100644
> >> --- a/drivers/gpu/drm/xe/abi/guc_actions_slpc_abi.h
> >> +++ b/drivers/gpu/drm/xe/abi/guc_actions_slpc_abi.h
> >> @@ -174,6 +174,9 @@ struct slpc_task_state_data {
> >>   	};
> >>   } __packed;
> >>
> >> +#define SLPC_EXEC_QUEUE_FREQ_REQ_IS_COMPUTE
> >> 	REG_BIT(28)
> >> +#define SLPC_OPTIMIZED_STRATEGY_COMPUTE
> >> 	REG_BIT(0)
> >> +
> >>   struct slpc_shared_data_header {
> >>   	/* Total size in bytes of this shared buffer. */
> >>   	u32 size;
> >> diff --git a/drivers/gpu/drm/xe/xe_exec_queue.c
> >> b/drivers/gpu/drm/xe/xe_exec_queue.c
> >> index 8948f50ee58f..7747ba6c4bb8 100644
> >> --- a/drivers/gpu/drm/xe/xe_exec_queue.c
> >> +++ b/drivers/gpu/drm/xe/xe_exec_queue.c
> >> @@ -553,7 +553,8 @@ int xe_exec_queue_create_ioctl(struct
> >> drm_device *dev, void *data,
> >>   	u32 len;
> >>   	int err;
> >>
> >> -	if (XE_IOCTL_DBG(xe, args->flags) ||
> >> +	if (XE_IOCTL_DBG(xe, args->flags &&
> >> +			 !(args->flags &
> >> DRM_XE_EXEC_QUEUE_LOW_LATENCY_HINT)) ||
> >>   	    XE_IOCTL_DBG(xe, args->reserved[0] || args->reserved[1]))
> >>   		return -EINVAL;
> >>
> >> @@ -578,7 +579,7 @@ int xe_exec_queue_create_ioctl(struct
> >> drm_device *dev, void *data,
> >>
> >>   		for_each_tile(tile, xe, id) {
> >>   			struct xe_exec_queue *new;
> >> -			u32 flags = EXEC_QUEUE_FLAG_VM;
> >> +			u32 flags = args->flags |
> >> EXEC_QUEUE_FLAG_VM;
> >
> > You are mixing internal and external flags here. Args->flags is an
> external definition. Note the current value of
> > DRM_XE_EXEC_QUEUE_LOW_LATENCY_HINT conflict with the
> internal definition of:
> >
> > #define EXEC_QUEUE_FLAG_PERMANENT		BIT(1)
> >
> > I think the better way to do it is, define an internal value for this
> purpose, such as:
> >
> > #define EXEC_QUEUE_FLAG_LOW_LATENCY		BIT(5)
> >
> > Then write: if (args->flags &
> DRM_XE_EXEC_QUEUE_LOW_LATENCY_HINT)
> > 		flags |= EXEC_QUEUE_FLAG_LOW_LATENCY;
> >
> >
> >>   			if (id)
> >>   				flags |=
> >> EXEC_QUEUE_FLAG_BIND_ENGINE_CHILD;
> >> @@ -626,7 +627,7 @@ int xe_exec_queue_create_ioctl(struct
> >> drm_device *dev, void *data,
> >>   		}
> >>
> >>   		q = xe_exec_queue_create(xe, vm, logical_mask,
> >> -					 args->width, hwe, 0,
> >> +					 args->width, hwe, args->flags,
> > Use internal flag also here if you do what I said above
> >
> >
> >>   					 args->extensions);
> >>   		up_read(&vm->lock);
> >>   		xe_vm_put(vm);
> >> diff --git a/drivers/gpu/drm/xe/xe_guc_pc.c
> >> b/drivers/gpu/drm/xe/xe_guc_pc.c
> >> index df7f130fb663..ff0b98ccf1a7 100644
> >> --- a/drivers/gpu/drm/xe/xe_guc_pc.c
> >> +++ b/drivers/gpu/drm/xe/xe_guc_pc.c
> >> @@ -992,6 +992,19 @@ static int pc_init_freqs(struct xe_guc_pc
> *pc)
> >>   	return ret;
> >>   }
> >>
> >> +static int xe_guc_pc_set_strategy(struct xe_guc_pc *pc, u32 val)
> >> +{
> >> +	int ret = 0;
> >> +
> >> +	xe_pm_runtime_get(pc_to_xe(pc));
> >> +	ret = pc_action_set_param(pc,
> >> +				  SLPC_PARAM_STRATEGIES,
> >> +				  val);
> >> +	xe_pm_runtime_put(pc_to_xe(pc));
> >> +
> >> +	return ret;
> >> +}
> >> +
> >>   /**
> >>    * xe_guc_pc_start - Start GuC's Power Conservation component
> >>    * @pc: Xe_GuC_PC instance
> >> @@ -1052,6 +1065,9 @@ int xe_guc_pc_start(struct xe_guc_pc
> *pc)
> >>
> >>   	ret = pc_action_setup_gucrc(pc,
> >> GUCRC_FIRMWARE_CONTROL);
> >>
> >> +	/* Enable SLPC Optimized Strategy for compute */
> >> +	xe_guc_pc_set_strategy(pc,
> >> SLPC_OPTIMIZED_STRATEGY_COMPUTE);
> >> +
> >>   out:
> >>   	xe_force_wake_put(gt_to_fw(gt), fw_ref);
> >>   	return ret;
> >> diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c
> >> b/drivers/gpu/drm/xe/xe_guc_submit.c
> >> index 9c36329fe857..88a1987ac360 100644
> >> --- a/drivers/gpu/drm/xe/xe_guc_submit.c
> >> +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
> >> @@ -15,6 +15,7 @@
> >>   #include <drm/drm_managed.h>
> >>
> >>   #include "abi/guc_actions_abi.h"
> >> +#include "abi/guc_actions_slpc_abi.h"
> >>   #include "abi/guc_klvs_abi.h"
> >>   #include "regs/xe_lrc_layout.h"
> >>   #include "xe_assert.h"
> >> @@ -400,6 +401,7 @@ static void
> >> __guc_exec_queue_policy_add_##func(struct
> exec_queue_policy
> >> *policy,
> >>   MAKE_EXEC_QUEUE_POLICY_ADD(execution_quantum,
> >> EXECUTION_QUANTUM)
> >>   MAKE_EXEC_QUEUE_POLICY_ADD(preemption_timeout,
> >> PREEMPTION_TIMEOUT)
> >>   MAKE_EXEC_QUEUE_POLICY_ADD(priority,
> SCHEDULING_PRIORITY)
> >> +MAKE_EXEC_QUEUE_POLICY_ADD(slpc_ctx_freq_req,
> >> SLPM_GT_FREQUENCY)
> >>   #undef MAKE_EXEC_QUEUE_POLICY_ADD
> >>
> >>   static const int xe_exec_queue_prio_to_guc[] = {
> >> @@ -414,14 +416,19 @@ static void init_policies(struct xe_guc
> *guc,
> >> struct xe_exec_queue *q)
> >>   	struct exec_queue_policy policy;
> >>   	enum xe_exec_queue_priority prio = q-
> >>> sched_props.priority;
> >>   	u32 timeslice_us = q->sched_props.timeslice_us;
> >> +	u32 slpc_ctx_freq_req = 0;
> >>   	u32 preempt_timeout_us = q-
> >>> sched_props.preempt_timeout_us;
> >>   	xe_gt_assert(guc_to_gt(guc), exec_queue_registered(q));
> >>
> >> +	if (q->flags & DRM_XE_EXEC_QUEUE_LOW_LATENCY_HINT)
> > Use internal definition
> >
> >> +		slpc_ctx_freq_req |=
> >> SLPC_EXEC_QUEUE_FREQ_REQ_IS_COMPUTE;
> >  From the codes above, I feel the user hint flag better be named as
> > Something like DRM_XE_EXEC_QUEUE_COMPUTE_HINT
> >
> > I feel this is slightly better than LOW_LATENCY as low latency is a
> result
> > Of applying some frequency policy designed for compute task.
> We had this discussion while implementing it on the i915 side. It's
> better to keep it engine non-specific, that way non-compute(3D)
> contexts
> can use it as well.

Ok, then COMPUTE is not the correct word.

I still think LOW_LATENCY is not the right wording here, ie., during submission, the result of
Applying such policy is low latency, but during computing, the result can be faster.

A few other wording come to my mind are: HIGH_FREQUENCY, AGGRESSIVE_FREQUENCY,
INTENSIVE_FREQUENCY or VIGOROUS_FREQUENCY

If LOW_LATENCY was accepted at i915 time, maybe keep it.

> >
> > A related question is, does this new policy affect power
> consumption? Usually
> > Higher frequency implies higher power.
> >
> > Do we only want to keep such frequency policy during submission
> time, or
> > Same policy is intended for whole execution time?
> SLPC will switch to aggressive frequency strategies when that
> particular
> context is switched in. Other contexts are not affected. There is
> definitely a power impact since we are aggressively ramping the
> frequency for this context.

So policy applied to both submission and execution for the context.

> >
> > The answer of above question would lead to some interface design
> such as
> > Whether we need to introduce interface to disable the compute
> frequency policy.
> 
> It can be disabled on the fly for any context, since this is just
> another SLPC param. However, I believe the upstream policy is to not
> mutate the context after it has been created.


Ok, as long as we are aware of, and people not complaining of the power consumption change
After we apply this policy, making is immutable is fine.

Oak

> 
> Thanks,
> 
> Vinay.
> 
> >
> >
> >> +
> >>   	__guc_exec_queue_policy_start_klv(&policy, q->guc->id);
> >>   	__guc_exec_queue_policy_add_priority(&policy,
> >> xe_exec_queue_prio_to_guc[prio]);
> >>
> >> 	__guc_exec_queue_policy_add_execution_quantum(&polic
> >> y, timeslice_us);
> >>
> >> 	__guc_exec_queue_policy_add_preemption_timeout(&polic
> >> y, preempt_timeout_us);
> >> +	__guc_exec_queue_policy_add_slpc_ctx_freq_req(&policy,
> >> slpc_ctx_freq_req);
> >>
> >>   	xe_guc_ct_send(&guc->ct, (u32 *)&policy.h2g,
> >>   		       __guc_exec_queue_policy_action_size(&policy),
> >> 0, 0);
> >> diff --git a/include/uapi/drm/xe_drm.h
> >> b/include/uapi/drm/xe_drm.h
> >> index f62689ca861a..bd0150d2200c 100644
> >> --- a/include/uapi/drm/xe_drm.h
> >> +++ b/include/uapi/drm/xe_drm.h
> >> @@ -1097,6 +1097,7 @@ struct drm_xe_vm_bind {
> >>    *         .engine_class = DRM_XE_ENGINE_CLASS_RENDER,
> >>    *     };
> >>    *     struct drm_xe_exec_queue_create exec_queue_create = {
> >> + *          .flags = DRM_XE_EXEC_QUEUE_LOW_LATENCY_HINT or 0
> >>    *          .extensions = 0,
> >>    *          .vm_id = vm,
> >>    *          .num_bb_per_exec = 1,
> >> @@ -1110,7 +1111,6 @@ struct drm_xe_exec_queue_create {
> >>   #define DRM_XE_EXEC_QUEUE_EXTENSION_SET_PROPERTY
> >> 	0
> >>   #define   DRM_XE_EXEC_QUEUE_SET_PROPERTY_PRIORITY
> >> 	0
> >>   #define   DRM_XE_EXEC_QUEUE_SET_PROPERTY_TIMESLICE
> >> 	1
> >> -
> >>   	/** @extensions: Pointer to the first extension struct, if any
> >> */
> >>   	__u64 extensions;
> >>
> >> @@ -1123,6 +1123,7 @@ struct drm_xe_exec_queue_create {
> >>   	/** @vm_id: VM to use for this exec queue */
> >>   	__u32 vm_id;
> >>
> >> +#define DRM_XE_EXEC_QUEUE_LOW_LATENCY_HINT	(0x1
> >> << 1)
> > I wonder why flag can't start from (0x1 << 0) here.
> >
> > Yes I did see the v2 comment of below:
> >
> >   - DRM_XE_EXEC_QUEUE_LOW_LATENCY_HINT 1 is already
> planned
> > for other hint(Szymon)
> >
> > but this is regarding the definition of a flag used in
> exec_queue_create, and this
> > is the first time we use this flag in *this* uAPI. It shouldn't conflict
> with UMD's usage of hint flags
> > in *other* uAPI
> >
> >>   	/** @flags: MBZ */
> > Remove this MBZ comment
> >
> > Oak
> >
> >
> >>   	__u32 flags;
> >>
> >> --
> >> 2.34.1

[RFC,V2] drm/xe/guc: Use exec queue hints for GT frequency

Commit Message

Comments

Patch