diff mbox

[v6] drm/i915: Add IOCTL Param to control data port coherency.

Message ID 1531746436-9696-1-git-send-email-tomasz.lis@intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Lis, Tomasz July 16, 2018, 1:07 p.m. UTC
The patch adds a parameter to control the data port coherency functionality
on a per-context level. When the IOCTL is called, a command to switch data
port coherency state is added to the ordered list. All prior requests are
executed on old coherency settings, and all exec requests after the IOCTL
will use new settings.

Rationale:

The OpenCL driver develpers requested a functionality to control cache
coherency at data port level. Keeping the coherency at that level is disabled
by default due to its performance costs. OpenCL driver is planning to
enable it for a small subset of submissions, when such functionality is
required. Below are answers to basic question explaining background
of the functionality and reasoning for the proposed implementation:

1. Why do we need a coherency enable/disable switch for memory that is shared
between CPU and GEN (GPU)?

Memory coherency between CPU and GEN, while being a great feature that enables
CL_MEM_SVM_FINE_GRAIN_BUFFER OCL capability on Intel GEN architecture, adds
overhead related to tracking (snooping) memory inside different cache units
(L1$, L2$, L3$, LLC$, etc.). At the same time, minority of modern OCL
applications actually use CL_MEM_SVM_FINE_GRAIN_BUFFER (and hence require
memory coherency between CPU and GPU). The goal of coherency enable/disable
switch is to remove overhead of memory coherency when memory coherency is not
needed.

2. Why do we need a global coherency switch?

In order to support I/O commands from within EUs (Execution Units), Intel GEN
ISA (GEN Instruction Set Assembly) contains dedicated "send" instructions.
These send instructions provide several addressing models. One of these
addressing models (named "stateless") provides most flexible I/O using plain
virtual addresses (as opposed to buffer_handle+offset models). This "stateless"
model is similar to regular memory load/store operations available on typical
CPUs. Since this model provides I/O using arbitrary virtual addresses, it
enables algorithmic designs that are based on pointer-to-pointer (e.g. buffer
of pointers) concepts. For instance, it allows creating tree-like data
structures such as:
                   ________________
                  |      NODE1     |
                  | uint64_t data  |
                  +----------------|
                  | NODE*  |  NODE*|
                  +--------+-------+
                    /              \
   ________________/                \________________
  |      NODE2     |                |      NODE3     |
  | uint64_t data  |                | uint64_t data  |
  +----------------|                +----------------|
  | NODE*  |  NODE*|                | NODE*  |  NODE*|
  +--------+-------+                +--------+-------+

Please note that pointers inside such structures can point to memory locations
in different OCL allocations  - e.g. NODE1 and NODE2 can reside in one OCL
allocation while NODE3 resides in a completely separate OCL allocation.
Additionally, such pointers can be shared with CPU (i.e. using SVM - Shared
Virtual Memory feature). Using pointers from different allocations doesn't
affect the stateless addressing model which even allows scattered reading from
different allocations at the same time (i.e. by utilizing SIMD-nature of send
instructions).

When it comes to coherency programming, send instructions in stateless model
can be encoded (at ISA level) to either use or disable coherency. However, for
generic OCL applications (such as example with tree-like data structure), OCL
compiler is not able to determine origin of memory pointed to by an arbitrary
pointer - i.e. is not able to track given pointer back to a specific
allocation. As such, it's not able to decide whether coherency is needed or not
for specific pointer (or for specific I/O instruction). As a result, compiler
encodes all stateless sends as coherent (doing otherwise would lead to
functional issues resulting from data corruption). Please note that it would be
possible to workaround this (e.g. based on allocations map and pointer bounds
checking prior to each I/O instruction) but the performance cost of such
workaround would be many times greater than the cost of keeping coherency
always enabled. As such, enabling/disabling memory coherency at GEN ISA level
is not feasible and alternative method is needed.

Such alternative solution is to have a global coherency switch that allows
disabling coherency for single (though entire) GPU submission. This is
beneficial because this way we:
* can enable (and pay for) coherency only in submissions that actually need
coherency (submissions that use CL_MEM_SVM_FINE_GRAIN_BUFFER resources)
* don't care about coherency at GEN ISA granularity (no performance impact)

3. Will coherency switch be used frequently?

There are scenarios that will require frequent toggling of the coherency
switch.
E.g. an application has two OCL compute kernels: kern_master and kern_worker.
kern_master uses, concurrently with CPU, some fine grain SVM resources
(CL_MEM_SVM_FINE_GRAIN_BUFFER). These resources contain descriptors of
computational work that needs to be executed. kern_master analyzes incoming
work descriptors and populates a plain OCL buffer (non-fine-grain) with payload
for kern_worker. Once kern_master is done, kern_worker kicks-in and processes
the payload that kern_master produced. These two kernels work in a loop, one
after another. Since only kern_master requires coherency, kern_worker should
not be forced to pay for it. This means that we need to have the ability to
toggle coherency switch on or off per each GPU submission:
(ENABLE COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> (ENABLE
COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> ...

v2: Fixed compilation warning.
v3: Refactored the patch to add IOCTL instead of exec flag.
v4: Renamed and documented the API flag. Used strict values.
    Removed redundant GEM_WARN_ON()s. Improved to coding standard.
    Introduced a macro for checking whether hardware supports the feature.
v5: Renamed some locals. Made the flag write to be lazy.
    Updated comments to remove misconceptions. Added gen11 support.
v6: Moved the flag write to gen8_enit_flush_render(). Renamed some functions.
    Moved all flags checking to one place. Added mutex check.

Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Michal Winiarski <michal.winiarski@intel.com>

Bspec: 11419
Bspec: 19175
Signed-off-by: Tomasz Lis <tomasz.lis@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h         |  1 +
 drivers/gpu/drm/i915/i915_gem_context.c | 29 +++++++++++++--
 drivers/gpu/drm/i915/i915_gem_context.h | 17 +++++++++
 drivers/gpu/drm/i915/intel_lrc.c        | 66 ++++++++++++++++++++++++++++++++-
 include/uapi/drm/i915_drm.h             |  7 ++++
 5 files changed, 115 insertions(+), 5 deletions(-)

Comments

Tvrtko Ursulin July 16, 2018, 1:35 p.m. UTC | #1
On 16/07/2018 14:07, Tomasz Lis wrote:
> The patch adds a parameter to control the data port coherency functionality
> on a per-context level. When the IOCTL is called, a command to switch data
> port coherency state is added to the ordered list. All prior requests are
> executed on old coherency settings, and all exec requests after the IOCTL
> will use new settings.
> 
> Rationale:
> 
> The OpenCL driver develpers requested a functionality to control cache
> coherency at data port level. Keeping the coherency at that level is disabled
> by default due to its performance costs. OpenCL driver is planning to
> enable it for a small subset of submissions, when such functionality is
> required. Below are answers to basic question explaining background
> of the functionality and reasoning for the proposed implementation:
> 
> 1. Why do we need a coherency enable/disable switch for memory that is shared
> between CPU and GEN (GPU)?
> 
> Memory coherency between CPU and GEN, while being a great feature that enables
> CL_MEM_SVM_FINE_GRAIN_BUFFER OCL capability on Intel GEN architecture, adds
> overhead related to tracking (snooping) memory inside different cache units
> (L1$, L2$, L3$, LLC$, etc.). At the same time, minority of modern OCL
> applications actually use CL_MEM_SVM_FINE_GRAIN_BUFFER (and hence require
> memory coherency between CPU and GPU). The goal of coherency enable/disable
> switch is to remove overhead of memory coherency when memory coherency is not
> needed.
> 
> 2. Why do we need a global coherency switch?
> 
> In order to support I/O commands from within EUs (Execution Units), Intel GEN
> ISA (GEN Instruction Set Assembly) contains dedicated "send" instructions.
> These send instructions provide several addressing models. One of these
> addressing models (named "stateless") provides most flexible I/O using plain
> virtual addresses (as opposed to buffer_handle+offset models). This "stateless"
> model is similar to regular memory load/store operations available on typical
> CPUs. Since this model provides I/O using arbitrary virtual addresses, it
> enables algorithmic designs that are based on pointer-to-pointer (e.g. buffer
> of pointers) concepts. For instance, it allows creating tree-like data
> structures such as:
>                     ________________
>                    |      NODE1     |
>                    | uint64_t data  |
>                    +----------------|
>                    | NODE*  |  NODE*|
>                    +--------+-------+
>                      /              \
>     ________________/                \________________
>    |      NODE2     |                |      NODE3     |
>    | uint64_t data  |                | uint64_t data  |
>    +----------------|                +----------------|
>    | NODE*  |  NODE*|                | NODE*  |  NODE*|
>    +--------+-------+                +--------+-------+
> 
> Please note that pointers inside such structures can point to memory locations
> in different OCL allocations  - e.g. NODE1 and NODE2 can reside in one OCL
> allocation while NODE3 resides in a completely separate OCL allocation.
> Additionally, such pointers can be shared with CPU (i.e. using SVM - Shared
> Virtual Memory feature). Using pointers from different allocations doesn't
> affect the stateless addressing model which even allows scattered reading from
> different allocations at the same time (i.e. by utilizing SIMD-nature of send
> instructions).
> 
> When it comes to coherency programming, send instructions in stateless model
> can be encoded (at ISA level) to either use or disable coherency. However, for
> generic OCL applications (such as example with tree-like data structure), OCL
> compiler is not able to determine origin of memory pointed to by an arbitrary
> pointer - i.e. is not able to track given pointer back to a specific
> allocation. As such, it's not able to decide whether coherency is needed or not
> for specific pointer (or for specific I/O instruction). As a result, compiler
> encodes all stateless sends as coherent (doing otherwise would lead to
> functional issues resulting from data corruption). Please note that it would be
> possible to workaround this (e.g. based on allocations map and pointer bounds
> checking prior to each I/O instruction) but the performance cost of such
> workaround would be many times greater than the cost of keeping coherency
> always enabled. As such, enabling/disabling memory coherency at GEN ISA level
> is not feasible and alternative method is needed.
> 
> Such alternative solution is to have a global coherency switch that allows
> disabling coherency for single (though entire) GPU submission. This is
> beneficial because this way we:
> * can enable (and pay for) coherency only in submissions that actually need
> coherency (submissions that use CL_MEM_SVM_FINE_GRAIN_BUFFER resources)
> * don't care about coherency at GEN ISA granularity (no performance impact)
> 
> 3. Will coherency switch be used frequently?
> 
> There are scenarios that will require frequent toggling of the coherency
> switch.
> E.g. an application has two OCL compute kernels: kern_master and kern_worker.
> kern_master uses, concurrently with CPU, some fine grain SVM resources
> (CL_MEM_SVM_FINE_GRAIN_BUFFER). These resources contain descriptors of
> computational work that needs to be executed. kern_master analyzes incoming
> work descriptors and populates a plain OCL buffer (non-fine-grain) with payload
> for kern_worker. Once kern_master is done, kern_worker kicks-in and processes
> the payload that kern_master produced. These two kernels work in a loop, one
> after another. Since only kern_master requires coherency, kern_worker should
> not be forced to pay for it. This means that we need to have the ability to
> toggle coherency switch on or off per each GPU submission:
> (ENABLE COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> (ENABLE
> COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> ...
> 
> v2: Fixed compilation warning.
> v3: Refactored the patch to add IOCTL instead of exec flag.
> v4: Renamed and documented the API flag. Used strict values.
>      Removed redundant GEM_WARN_ON()s. Improved to coding standard.
>      Introduced a macro for checking whether hardware supports the feature.
> v5: Renamed some locals. Made the flag write to be lazy.
>      Updated comments to remove misconceptions. Added gen11 support.
> v6: Moved the flag write to gen8_enit_flush_render(). Renamed some functions.
>      Moved all flags checking to one place. Added mutex check.
> 
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Michal Winiarski <michal.winiarski@intel.com>
> 
> Bspec: 11419
> Bspec: 19175
> Signed-off-by: Tomasz Lis <tomasz.lis@intel.com>
> ---
>   drivers/gpu/drm/i915/i915_drv.h         |  1 +
>   drivers/gpu/drm/i915/i915_gem_context.c | 29 +++++++++++++--
>   drivers/gpu/drm/i915/i915_gem_context.h | 17 +++++++++
>   drivers/gpu/drm/i915/intel_lrc.c        | 66 ++++++++++++++++++++++++++++++++-
>   include/uapi/drm/i915_drm.h             |  7 ++++
>   5 files changed, 115 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index 4fb9373..bae3999 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -2524,6 +2524,7 @@ intel_info(const struct drm_i915_private *dev_priv)
>   #define HAS_EDRAM(dev_priv)	(!!((dev_priv)->edram_cap & EDRAM_ENABLED))
>   #define HAS_WT(dev_priv)	((IS_HASWELL(dev_priv) || \
>   				 IS_BROADWELL(dev_priv)) && HAS_EDRAM(dev_priv))
> +#define HAS_DATA_PORT_COHERENCY(dev_priv)	(INTEL_GEN(dev_priv) >= 9)
>   
>   #define HWS_NEEDS_PHYSICAL(dev_priv)	((dev_priv)->info.hws_needs_physical)
>   
> diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
> index b10770c..44ebc31 100644
> --- a/drivers/gpu/drm/i915/i915_gem_context.c
> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
> @@ -784,6 +784,7 @@ int i915_gem_context_destroy_ioctl(struct drm_device *dev, void *data,
>   int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
>   				    struct drm_file *file)
>   {
> +	struct drm_i915_private *i915 = to_i915(dev);
>   	struct drm_i915_file_private *file_priv = file->driver_priv;
>   	struct drm_i915_gem_context_param *args = data;
>   	struct i915_gem_context *ctx;
> @@ -804,10 +805,10 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
>   	case I915_CONTEXT_PARAM_GTT_SIZE:
>   		if (ctx->ppgtt)
>   			args->value = ctx->ppgtt->vm.total;
> -		else if (to_i915(dev)->mm.aliasing_ppgtt)
> -			args->value = to_i915(dev)->mm.aliasing_ppgtt->vm.total;
> +		else if (i915->mm.aliasing_ppgtt)
> +			args->value = i915->mm.aliasing_ppgtt->vm.total;
>   		else
> -			args->value = to_i915(dev)->ggtt.vm.total;
> +			args->value = i915->ggtt.vm.total;
>   		break;
>   	case I915_CONTEXT_PARAM_NO_ERROR_CAPTURE:
>   		args->value = i915_gem_context_no_error_capture(ctx);
> @@ -818,6 +819,12 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
>   	case I915_CONTEXT_PARAM_PRIORITY:
>   		args->value = ctx->sched.priority;
>   		break;
> +	case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
> +		if (!HAS_DATA_PORT_COHERENCY(i915))
> +			ret = -ENODEV;
> +		else
> +			args->value = i915_gem_context_is_data_port_coherent(ctx);
> +		break;
>   	default:
>   		ret = -EINVAL;
>   		break;
> @@ -830,6 +837,7 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
>   int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
>   				    struct drm_file *file)
>   {
> +	struct drm_i915_private *i915 = to_i915(dev);
>   	struct drm_i915_file_private *file_priv = file->driver_priv;
>   	struct drm_i915_gem_context_param *args = data;
>   	struct i915_gem_context *ctx;
> @@ -880,7 +888,7 @@ int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
>   
>   			if (args->size)
>   				ret = -EINVAL;
> -			else if (!(to_i915(dev)->caps.scheduler & I915_SCHEDULER_CAP_PRIORITY))
> +			else if (!(i915->caps.scheduler & I915_SCHEDULER_CAP_PRIORITY))
>   				ret = -ENODEV;
>   			else if (priority > I915_CONTEXT_MAX_USER_PRIORITY ||
>   				 priority < I915_CONTEXT_MIN_USER_PRIORITY)
> @@ -893,6 +901,19 @@ int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
>   		}
>   		break;
>   
> +	case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
> +		if (args->size)
> +			ret = -EINVAL;
> +		else if (!HAS_DATA_PORT_COHERENCY(i915))
> +			ret = -ENODEV;
> +		else if (args->value == 1)
> +			i915_gem_context_set_data_port_coherent(ctx);
> +		else if (args->value == 0)
> +			i915_gem_context_clear_data_port_coherent(ctx);
> +		else
> +			ret = -EINVAL;
> +		break;
> +
>   	default:
>   		ret = -EINVAL;
>   		break;
> diff --git a/drivers/gpu/drm/i915/i915_gem_context.h b/drivers/gpu/drm/i915/i915_gem_context.h
> index b116e49..9312343 100644
> --- a/drivers/gpu/drm/i915/i915_gem_context.h
> +++ b/drivers/gpu/drm/i915/i915_gem_context.h
> @@ -126,6 +126,8 @@ struct i915_gem_context {
>   #define CONTEXT_BANNABLE		3
>   #define CONTEXT_BANNED			4
>   #define CONTEXT_FORCE_SINGLE_SUBMISSION	5
> +#define CONTEXT_DATA_PORT_COHERENT_REQUESTED	6
> +#define CONTEXT_DATA_PORT_COHERENT_ACTIVE	7
>   
>   	/**
>   	 * @hw_id: - unique identifier for the context
> @@ -257,6 +259,21 @@ static inline void i915_gem_context_set_force_single_submission(struct i915_gem_
>   	__set_bit(CONTEXT_FORCE_SINGLE_SUBMISSION, &ctx->flags);
>   }
>   
> +static inline bool i915_gem_context_is_data_port_coherent(struct i915_gem_context *ctx)
> +{
> +	return test_bit(CONTEXT_DATA_PORT_COHERENT_REQUESTED, &ctx->flags);
> +}
> +
> +static inline void i915_gem_context_set_data_port_coherent(struct i915_gem_context *ctx)
> +{
> +	__set_bit(CONTEXT_DATA_PORT_COHERENT_REQUESTED, &ctx->flags);
> +}
> +
> +static inline void i915_gem_context_clear_data_port_coherent(struct i915_gem_context *ctx)
> +{
> +	__clear_bit(CONTEXT_DATA_PORT_COHERENT_REQUESTED, &ctx->flags);
> +}
> +
>   static inline bool i915_gem_context_is_default(const struct i915_gem_context *c)
>   {
>   	return c->user_handle == DEFAULT_CONTEXT_HANDLE;
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index 6fef9d1..6a08e10 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -259,6 +259,63 @@ intel_lr_context_descriptor_update(struct i915_gem_context *ctx,
>   	ce->lrc_desc = desc;
>   }
>   
> +static int emit_set_data_port_coherency(struct i915_request *rq, bool enable)
> +{
> +	u32 *cs;
> +	i915_reg_t reg;
> +
> +	GEM_BUG_ON(rq->engine->class != RENDER_CLASS);
> +	GEM_BUG_ON(INTEL_GEN(rq->i915) < 9);
> +
> +	cs = intel_ring_begin(rq, 4);
> +	if (IS_ERR(cs))
> +		return PTR_ERR(cs);
> +
> +	if (INTEL_GEN(rq->i915) >= 11)
> +		reg = ICL_HDC_MODE;
> +	else if (INTEL_GEN(rq->i915) >= 10)
> +		reg = CNL_HDC_CHICKEN0;
> +	else
> +		reg = HDC_CHICKEN0;
> +
> +	*cs++ = MI_LOAD_REGISTER_IMM(1);
> +	*cs++ = i915_mmio_reg_offset(reg);
> +	/* Enabling coherency means disabling the bit which forces it off */
> +	if (enable)
> +		*cs++ = _MASKED_BIT_DISABLE(HDC_FORCE_NON_COHERENT);
> +	else
> +		*cs++ = _MASKED_BIT_ENABLE(HDC_FORCE_NON_COHERENT);
> +	*cs++ = MI_NOOP;
> +
> +	intel_ring_advance(rq, cs);
> +
> +	return 0;
> +}
> +
> +static int
> +intel_lr_context_update_data_port_coherency(struct i915_request *rq)
> +{
> +	struct i915_gem_context *ctx = rq->gem_context;
> +	bool enable = test_bit(CONTEXT_DATA_PORT_COHERENT_REQUESTED, &ctx->flags);
> +	int ret;
> +
> +	lockdep_assert_held(&rq->i915->drm.struct_mutex);
> +
> +	if (test_bit(CONTEXT_DATA_PORT_COHERENT_ACTIVE, &ctx->flags) == enable)
> +		return 0;
> +
> +	ret = emit_set_data_port_coherency(rq, enable);
> +
> +	if (!ret) {
> +		if (enable)
> +			__set_bit(CONTEXT_DATA_PORT_COHERENT_ACTIVE, &ctx->flags);
> +		else
> +			__clear_bit(CONTEXT_DATA_PORT_COHERENT_ACTIVE, &ctx->flags);
> +	}
> +
> +	return ret;
> +}
> +
>   static struct i915_priolist *
>   lookup_priolist(struct intel_engine_cs *engine, int prio)
>   {
> @@ -2133,7 +2190,7 @@ static int gen8_emit_flush_render(struct i915_request *request,
>   		i915_ggtt_offset(engine->scratch) + 2 * CACHELINE_BYTES;
>   	bool vf_flush_wa = false, dc_flush_wa = false;
>   	u32 *cs, flags = 0;
> -	int len;
> +	int err, len;
>   
>   	flags |= PIPE_CONTROL_CS_STALL;
>   
> @@ -2164,6 +2221,13 @@ static int gen8_emit_flush_render(struct i915_request *request,
>   		/* WaForGAMHang:kbl */
>   		if (IS_KBL_REVID(request->i915, 0, KBL_REVID_B0))
>   			dc_flush_wa = true;
> +
> +		/* Emit the switch of data port coherency state if needed */
> +		err = intel_lr_context_update_data_port_coherency(request);
> +		if (GEM_WARN_ON(err)) {
> +			DRM_DEBUG("Data Port Coherency toggle failed.\n");
> +			return err;
> +		}
>   	}
>   
>   	len = 6;
> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> index 7f5634c..6ece759 100644
> --- a/include/uapi/drm/i915_drm.h
> +++ b/include/uapi/drm/i915_drm.h
> @@ -1456,6 +1456,13 @@ struct drm_i915_gem_context_param {
>   #define   I915_CONTEXT_MAX_USER_PRIORITY	1023 /* inclusive */
>   #define   I915_CONTEXT_DEFAULT_PRIORITY		0
>   #define   I915_CONTEXT_MIN_USER_PRIORITY	-1023 /* inclusive */
> +/*
> + * When data port level coherency is enabled, the GPU will update memory
> + * buffers shared with CPU, by forcing internal cache units to send memory
> + * writes to higher level caches faster. Enabling data port coherency has
> + * a performance cost.
> + */
> +#define I915_CONTEXT_PARAM_DATA_PORT_COHERENCY	0x7
>   	__u64 value;
>   };
>   
> 

Looks good to me!

Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Regards,

Tvrtko
Joonas Lahtinen July 18, 2018, 1:24 p.m. UTC | #2
Quoting Tomasz Lis (2018-07-16 16:07:16)
> +static int emit_set_data_port_coherency(struct i915_request *rq, bool enable)
> +{
> +       u32 *cs;
> +       i915_reg_t reg;
> +
> +       GEM_BUG_ON(rq->engine->class != RENDER_CLASS);
> +       GEM_BUG_ON(INTEL_GEN(rq->i915) < 9);
> +
> +       cs = intel_ring_begin(rq, 4);
> +       if (IS_ERR(cs))
> +               return PTR_ERR(cs);
> +
> +       if (INTEL_GEN(rq->i915) >= 11)
> +               reg = ICL_HDC_MODE;
> +       else if (INTEL_GEN(rq->i915) >= 10)
> +               reg = CNL_HDC_CHICKEN0;
> +       else
> +               reg = HDC_CHICKEN0;
> +
> +       *cs++ = MI_LOAD_REGISTER_IMM(1);
> +       *cs++ = i915_mmio_reg_offset(reg);
> +       /* Enabling coherency means disabling the bit which forces it off */

This comment is still spurious, please get rid of the habit of writing
comments about "what" the code is doing, useful comments should be
limited to "why", which is quite self explanatory here, that's the way
the register is.

> +static int
> +intel_lr_context_update_data_port_coherency(struct i915_request *rq)
> +{
> +       struct i915_gem_context *ctx = rq->gem_context;
> +       bool enable = test_bit(CONTEXT_DATA_PORT_COHERENT_REQUESTED, &ctx->flags);
> +       int ret;
> +
> +       lockdep_assert_held(&rq->i915->drm.struct_mutex);
> +
> +       if (test_bit(CONTEXT_DATA_PORT_COHERENT_ACTIVE, &ctx->flags) == enable)
> +               return 0;
> +
> +       ret = emit_set_data_port_coherency(rq, enable);
> +
> +       if (!ret) {
> +               if (enable)
> +                       __set_bit(CONTEXT_DATA_PORT_COHERENT_ACTIVE, &ctx->flags);
> +               else
> +                       __clear_bit(CONTEXT_DATA_PORT_COHERENT_ACTIVE, &ctx->flags);
> +       }

Do we have indication that the hardware feature will be unreliable in
responding to the requests? I don't think you need the differentiation
of requested vs. active. If there is an error, we can just report back to
the user as a failed IOCTL. Now it adds unnecessary complication for no benefit.

> @@ -2164,6 +2221,13 @@ static int gen8_emit_flush_render(struct i915_request *request,
>                 /* WaForGAMHang:kbl */
>                 if (IS_KBL_REVID(request->i915, 0, KBL_REVID_B0))
>                         dc_flush_wa = true;
> +
> +               /* Emit the switch of data port coherency state if needed */

Ditto for spurious comment, just about what the code does.

> +++ b/include/uapi/drm/i915_drm.h
> @@ -1456,6 +1456,13 @@ struct drm_i915_gem_context_param {
>  #define   I915_CONTEXT_MAX_USER_PRIORITY       1023 /* inclusive */
>  #define   I915_CONTEXT_DEFAULT_PRIORITY                0
>  #define   I915_CONTEXT_MIN_USER_PRIORITY       -1023 /* inclusive */
> +/*
> + * When data port level coherency is enabled, the GPU will update memory
> + * buffers shared with CPU, by forcing internal cache units to send memory
> + * writes to higher level caches faster. Enabling data port coherency has
> + * a performance cost.
> + */

I was under impression this is enabled by default and it can be disabled
for a performance optimization?

Regards, Joonas
Tvrtko Ursulin July 18, 2018, 2:42 p.m. UTC | #3
On 18/07/2018 14:24, Joonas Lahtinen wrote:
> Quoting Tomasz Lis (2018-07-16 16:07:16)
>> +static int emit_set_data_port_coherency(struct i915_request *rq, bool enable)
>> +{
>> +       u32 *cs;
>> +       i915_reg_t reg;
>> +
>> +       GEM_BUG_ON(rq->engine->class != RENDER_CLASS);
>> +       GEM_BUG_ON(INTEL_GEN(rq->i915) < 9);
>> +
>> +       cs = intel_ring_begin(rq, 4);
>> +       if (IS_ERR(cs))
>> +               return PTR_ERR(cs);
>> +
>> +       if (INTEL_GEN(rq->i915) >= 11)
>> +               reg = ICL_HDC_MODE;
>> +       else if (INTEL_GEN(rq->i915) >= 10)
>> +               reg = CNL_HDC_CHICKEN0;
>> +       else
>> +               reg = HDC_CHICKEN0;
>> +
>> +       *cs++ = MI_LOAD_REGISTER_IMM(1);
>> +       *cs++ = i915_mmio_reg_offset(reg);
>> +       /* Enabling coherency means disabling the bit which forces it off */
> 
> This comment is still spurious, please get rid of the habit of writing
> comments about "what" the code is doing, useful comments should be
> limited to "why", which is quite self explanatory here, that's the way
> the register is.
> 
>> +static int
>> +intel_lr_context_update_data_port_coherency(struct i915_request *rq)
>> +{
>> +       struct i915_gem_context *ctx = rq->gem_context;
>> +       bool enable = test_bit(CONTEXT_DATA_PORT_COHERENT_REQUESTED, &ctx->flags);
>> +       int ret;
>> +
>> +       lockdep_assert_held(&rq->i915->drm.struct_mutex);
>> +
>> +       if (test_bit(CONTEXT_DATA_PORT_COHERENT_ACTIVE, &ctx->flags) == enable)
>> +               return 0;
>> +
>> +       ret = emit_set_data_port_coherency(rq, enable);
>> +
>> +       if (!ret) {
>> +               if (enable)
>> +                       __set_bit(CONTEXT_DATA_PORT_COHERENT_ACTIVE, &ctx->flags);
>> +               else
>> +                       __clear_bit(CONTEXT_DATA_PORT_COHERENT_ACTIVE, &ctx->flags);
>> +       }
> 
> Do we have indication that the hardware feature will be unreliable in
> responding to the requests? I don't think you need the differentiation
> of requested vs. active. If there is an error, we can just report back to
> the user as a failed IOCTL. Now it adds unnecessary complication for no benefit.

Requested vs active is for implementing the lazy emit.

AFAIR it does propagate the error out of execbuf (although we never ever 
expect it to happen), and this is just to keep the internal 
house-keeping in sync.

Regards,

Tvrtko

>> @@ -2164,6 +2221,13 @@ static int gen8_emit_flush_render(struct i915_request *request,
>>                  /* WaForGAMHang:kbl */
>>                  if (IS_KBL_REVID(request->i915, 0, KBL_REVID_B0))
>>                          dc_flush_wa = true;
>> +
>> +               /* Emit the switch of data port coherency state if needed */
> 
> Ditto for spurious comment, just about what the code does.
> 
>> +++ b/include/uapi/drm/i915_drm.h
>> @@ -1456,6 +1456,13 @@ struct drm_i915_gem_context_param {
>>   #define   I915_CONTEXT_MAX_USER_PRIORITY       1023 /* inclusive */
>>   #define   I915_CONTEXT_DEFAULT_PRIORITY                0
>>   #define   I915_CONTEXT_MIN_USER_PRIORITY       -1023 /* inclusive */
>> +/*
>> + * When data port level coherency is enabled, the GPU will update memory
>> + * buffers shared with CPU, by forcing internal cache units to send memory
>> + * writes to higher level caches faster. Enabling data port coherency has
>> + * a performance cost.
>> + */
> 
> I was under impression this is enabled by default and it can be disabled
> for a performance optimization?
> 
> Regards, Joonas
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
>
Lis, Tomasz July 18, 2018, 3:28 p.m. UTC | #4
On 2018-07-18 16:42, Tvrtko Ursulin wrote:
>
> On 18/07/2018 14:24, Joonas Lahtinen wrote:
>> Quoting Tomasz Lis (2018-07-16 16:07:16)
>>> +static int emit_set_data_port_coherency(struct i915_request *rq, 
>>> bool enable)
>>> +{
>>> +       u32 *cs;
>>> +       i915_reg_t reg;
>>> +
>>> +       GEM_BUG_ON(rq->engine->class != RENDER_CLASS);
>>> +       GEM_BUG_ON(INTEL_GEN(rq->i915) < 9);
>>> +
>>> +       cs = intel_ring_begin(rq, 4);
>>> +       if (IS_ERR(cs))
>>> +               return PTR_ERR(cs);
>>> +
>>> +       if (INTEL_GEN(rq->i915) >= 11)
>>> +               reg = ICL_HDC_MODE;
>>> +       else if (INTEL_GEN(rq->i915) >= 10)
>>> +               reg = CNL_HDC_CHICKEN0;
>>> +       else
>>> +               reg = HDC_CHICKEN0;
>>> +
>>> +       *cs++ = MI_LOAD_REGISTER_IMM(1);
>>> +       *cs++ = i915_mmio_reg_offset(reg);
>>> +       /* Enabling coherency means disabling the bit which forces 
>>> it off */
>>
>> This comment is still spurious, please get rid of the habit of writing
>> comments about "what" the code is doing, useful comments should be
>> limited to "why", which is quite self explanatory here, that's the way
>> the register is.
Ok, I will read the related doc:
https://www.kernel.org/doc/html/v4.10/process/coding-style.html#commenting
>>
>>> +static int
>>> +intel_lr_context_update_data_port_coherency(struct i915_request *rq)
>>> +{
>>> +       struct i915_gem_context *ctx = rq->gem_context;
>>> +       bool enable = test_bit(CONTEXT_DATA_PORT_COHERENT_REQUESTED, 
>>> &ctx->flags);
>>> +       int ret;
>>> +
>>> + lockdep_assert_held(&rq->i915->drm.struct_mutex);
>>> +
>>> +       if (test_bit(CONTEXT_DATA_PORT_COHERENT_ACTIVE, &ctx->flags) 
>>> == enable)
>>> +               return 0;
>>> +
>>> +       ret = emit_set_data_port_coherency(rq, enable);
>>> +
>>> +       if (!ret) {
>>> +               if (enable)
>>> + __set_bit(CONTEXT_DATA_PORT_COHERENT_ACTIVE, &ctx->flags);
>>> +               else
>>> + __clear_bit(CONTEXT_DATA_PORT_COHERENT_ACTIVE, &ctx->flags);
>>> +       }
>>
>> Do we have indication that the hardware feature will be unreliable in
>> responding to the requests? I don't think you need the differentiation
>> of requested vs. active. If there is an error, we can just report 
>> back to
>> the user as a failed IOCTL. Now it adds unnecessary complication for 
>> no benefit.
>
> Requested vs active is for implementing the lazy emit.
>
> AFAIR it does propagate the error out of execbuf (although we never 
> ever expect it to happen), and this is just to keep the internal 
> house-keeping in sync.
>
> Regards,
>
> Tvrtko
>
>>> @@ -2164,6 +2221,13 @@ static int gen8_emit_flush_render(struct 
>>> i915_request *request,
>>>                  /* WaForGAMHang:kbl */
>>>                  if (IS_KBL_REVID(request->i915, 0, KBL_REVID_B0))
>>>                          dc_flush_wa = true;
>>> +
>>> +               /* Emit the switch of data port coherency state if 
>>> needed */
>>
>> Ditto for spurious comment, just about what the code does.
>>
>>> +++ b/include/uapi/drm/i915_drm.h
>>> @@ -1456,6 +1456,13 @@ struct drm_i915_gem_context_param {
>>>   #define   I915_CONTEXT_MAX_USER_PRIORITY       1023 /* inclusive */
>>>   #define   I915_CONTEXT_DEFAULT_PRIORITY                0
>>>   #define   I915_CONTEXT_MIN_USER_PRIORITY       -1023 /* inclusive */
>>> +/*
>>> + * When data port level coherency is enabled, the GPU will update 
>>> memory
>>> + * buffers shared with CPU, by forcing internal cache units to send 
>>> memory
>>> + * writes to higher level caches faster. Enabling data port 
>>> coherency has
>>> + * a performance cost.
>>> + */
>>
>> I was under impression this is enabled by default and it can be disabled
>> for a performance optimization?
This is true, coherency is kept by default. We disable it as a 
workaround: performance-related for gen11, and due to minor hardware 
issue on previous platforms. See WaForceEnableNonCoherent.
-Tomasz
>>
>> Regards, Joonas
>> _______________________________________________
>> Intel-gfx mailing list
>> Intel-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
>>
Joonas Lahtinen July 19, 2018, 7:12 a.m. UTC | #5
Quoting Lis, Tomasz (2018-07-18 18:28:32)
> 
> On 2018-07-18 16:42, Tvrtko Ursulin wrote:
> >
> > On 18/07/2018 14:24, Joonas Lahtinen wrote:
> >> Quoting Tomasz Lis (2018-07-16 16:07:16)

<SNIP>

> >>> +++ b/include/uapi/drm/i915_drm.h
> >>> @@ -1456,6 +1456,13 @@ struct drm_i915_gem_context_param {
> >>>   #define   I915_CONTEXT_MAX_USER_PRIORITY       1023 /* inclusive */
> >>>   #define   I915_CONTEXT_DEFAULT_PRIORITY                0
> >>>   #define   I915_CONTEXT_MIN_USER_PRIORITY       -1023 /* inclusive */
> >>> +/*
> >>> + * When data port level coherency is enabled, the GPU will update 
> >>> memory
> >>> + * buffers shared with CPU, by forcing internal cache units to send 
> >>> memory
> >>> + * writes to higher level caches faster. Enabling data port 
> >>> coherency has
> >>> + * a performance cost.
> >>> + */
> >>
> >> I was under impression this is enabled by default and it can be disabled
> >> for a performance optimization?
> This is true, coherency is kept by default. We disable it as a 
> workaround: performance-related for gen11, and due to minor hardware 
> issue on previous platforms. See WaForceEnableNonCoherent.

Ok, then you definitely want to rephrase the comment to bake that
information in it. Now it sounds like it needs to be turned on to have
coherency.

Regards, Joonas
Lis, Tomasz July 19, 2018, 3:10 p.m. UTC | #6
On 2018-07-19 09:12, Joonas Lahtinen wrote:
> Quoting Lis, Tomasz (2018-07-18 18:28:32)
>> On 2018-07-18 16:42, Tvrtko Ursulin wrote:
>>> On 18/07/2018 14:24, Joonas Lahtinen wrote:
>>>> Quoting Tomasz Lis (2018-07-16 16:07:16)
> <SNIP>
>
>>>>> +++ b/include/uapi/drm/i915_drm.h
>>>>> @@ -1456,6 +1456,13 @@ struct drm_i915_gem_context_param {
>>>>>    #define   I915_CONTEXT_MAX_USER_PRIORITY       1023 /* inclusive */
>>>>>    #define   I915_CONTEXT_DEFAULT_PRIORITY                0
>>>>>    #define   I915_CONTEXT_MIN_USER_PRIORITY       -1023 /* inclusive */
>>>>> +/*
>>>>> + * When data port level coherency is enabled, the GPU will update
>>>>> memory
>>>>> + * buffers shared with CPU, by forcing internal cache units to send
>>>>> memory
>>>>> + * writes to higher level caches faster. Enabling data port
>>>>> coherency has
>>>>> + * a performance cost.
>>>>> + */
>>>> I was under impression this is enabled by default and it can be disabled
>>>> for a performance optimization?
>> This is true, coherency is kept by default. We disable it as a
>> workaround: performance-related for gen11, and due to minor hardware
>> issue on previous platforms. See WaForceEnableNonCoherent.
> Ok, then you definitely want to rephrase the comment to bake that
> information in it. Now it sounds like it needs to be turned on to have
> coherency.
I'm not sure if I understand what you're asking for.
Should I emphasize that the feature is disabled unless the flag is set? 
This seem obvious...
Or should I provide the reason why it is disabled on specific platforms? 
This should probably be done within workaround setup, not in user api 
definition. Or maybe it's enough to have it in Bspec? Bspec links are 
provided in the patch.
Or should I just mention the workaround name?
-Tomasz
> Regards, Joonas
diff mbox

Patch

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 4fb9373..bae3999 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -2524,6 +2524,7 @@  intel_info(const struct drm_i915_private *dev_priv)
 #define HAS_EDRAM(dev_priv)	(!!((dev_priv)->edram_cap & EDRAM_ENABLED))
 #define HAS_WT(dev_priv)	((IS_HASWELL(dev_priv) || \
 				 IS_BROADWELL(dev_priv)) && HAS_EDRAM(dev_priv))
+#define HAS_DATA_PORT_COHERENCY(dev_priv)	(INTEL_GEN(dev_priv) >= 9)
 
 #define HWS_NEEDS_PHYSICAL(dev_priv)	((dev_priv)->info.hws_needs_physical)
 
diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index b10770c..44ebc31 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -784,6 +784,7 @@  int i915_gem_context_destroy_ioctl(struct drm_device *dev, void *data,
 int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
 				    struct drm_file *file)
 {
+	struct drm_i915_private *i915 = to_i915(dev);
 	struct drm_i915_file_private *file_priv = file->driver_priv;
 	struct drm_i915_gem_context_param *args = data;
 	struct i915_gem_context *ctx;
@@ -804,10 +805,10 @@  int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
 	case I915_CONTEXT_PARAM_GTT_SIZE:
 		if (ctx->ppgtt)
 			args->value = ctx->ppgtt->vm.total;
-		else if (to_i915(dev)->mm.aliasing_ppgtt)
-			args->value = to_i915(dev)->mm.aliasing_ppgtt->vm.total;
+		else if (i915->mm.aliasing_ppgtt)
+			args->value = i915->mm.aliasing_ppgtt->vm.total;
 		else
-			args->value = to_i915(dev)->ggtt.vm.total;
+			args->value = i915->ggtt.vm.total;
 		break;
 	case I915_CONTEXT_PARAM_NO_ERROR_CAPTURE:
 		args->value = i915_gem_context_no_error_capture(ctx);
@@ -818,6 +819,12 @@  int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
 	case I915_CONTEXT_PARAM_PRIORITY:
 		args->value = ctx->sched.priority;
 		break;
+	case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
+		if (!HAS_DATA_PORT_COHERENCY(i915))
+			ret = -ENODEV;
+		else
+			args->value = i915_gem_context_is_data_port_coherent(ctx);
+		break;
 	default:
 		ret = -EINVAL;
 		break;
@@ -830,6 +837,7 @@  int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
 int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
 				    struct drm_file *file)
 {
+	struct drm_i915_private *i915 = to_i915(dev);
 	struct drm_i915_file_private *file_priv = file->driver_priv;
 	struct drm_i915_gem_context_param *args = data;
 	struct i915_gem_context *ctx;
@@ -880,7 +888,7 @@  int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
 
 			if (args->size)
 				ret = -EINVAL;
-			else if (!(to_i915(dev)->caps.scheduler & I915_SCHEDULER_CAP_PRIORITY))
+			else if (!(i915->caps.scheduler & I915_SCHEDULER_CAP_PRIORITY))
 				ret = -ENODEV;
 			else if (priority > I915_CONTEXT_MAX_USER_PRIORITY ||
 				 priority < I915_CONTEXT_MIN_USER_PRIORITY)
@@ -893,6 +901,19 @@  int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
 		}
 		break;
 
+	case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
+		if (args->size)
+			ret = -EINVAL;
+		else if (!HAS_DATA_PORT_COHERENCY(i915))
+			ret = -ENODEV;
+		else if (args->value == 1)
+			i915_gem_context_set_data_port_coherent(ctx);
+		else if (args->value == 0)
+			i915_gem_context_clear_data_port_coherent(ctx);
+		else
+			ret = -EINVAL;
+		break;
+
 	default:
 		ret = -EINVAL;
 		break;
diff --git a/drivers/gpu/drm/i915/i915_gem_context.h b/drivers/gpu/drm/i915/i915_gem_context.h
index b116e49..9312343 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.h
+++ b/drivers/gpu/drm/i915/i915_gem_context.h
@@ -126,6 +126,8 @@  struct i915_gem_context {
 #define CONTEXT_BANNABLE		3
 #define CONTEXT_BANNED			4
 #define CONTEXT_FORCE_SINGLE_SUBMISSION	5
+#define CONTEXT_DATA_PORT_COHERENT_REQUESTED	6
+#define CONTEXT_DATA_PORT_COHERENT_ACTIVE	7
 
 	/**
 	 * @hw_id: - unique identifier for the context
@@ -257,6 +259,21 @@  static inline void i915_gem_context_set_force_single_submission(struct i915_gem_
 	__set_bit(CONTEXT_FORCE_SINGLE_SUBMISSION, &ctx->flags);
 }
 
+static inline bool i915_gem_context_is_data_port_coherent(struct i915_gem_context *ctx)
+{
+	return test_bit(CONTEXT_DATA_PORT_COHERENT_REQUESTED, &ctx->flags);
+}
+
+static inline void i915_gem_context_set_data_port_coherent(struct i915_gem_context *ctx)
+{
+	__set_bit(CONTEXT_DATA_PORT_COHERENT_REQUESTED, &ctx->flags);
+}
+
+static inline void i915_gem_context_clear_data_port_coherent(struct i915_gem_context *ctx)
+{
+	__clear_bit(CONTEXT_DATA_PORT_COHERENT_REQUESTED, &ctx->flags);
+}
+
 static inline bool i915_gem_context_is_default(const struct i915_gem_context *c)
 {
 	return c->user_handle == DEFAULT_CONTEXT_HANDLE;
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 6fef9d1..6a08e10 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -259,6 +259,63 @@  intel_lr_context_descriptor_update(struct i915_gem_context *ctx,
 	ce->lrc_desc = desc;
 }
 
+static int emit_set_data_port_coherency(struct i915_request *rq, bool enable)
+{
+	u32 *cs;
+	i915_reg_t reg;
+
+	GEM_BUG_ON(rq->engine->class != RENDER_CLASS);
+	GEM_BUG_ON(INTEL_GEN(rq->i915) < 9);
+
+	cs = intel_ring_begin(rq, 4);
+	if (IS_ERR(cs))
+		return PTR_ERR(cs);
+
+	if (INTEL_GEN(rq->i915) >= 11)
+		reg = ICL_HDC_MODE;
+	else if (INTEL_GEN(rq->i915) >= 10)
+		reg = CNL_HDC_CHICKEN0;
+	else
+		reg = HDC_CHICKEN0;
+
+	*cs++ = MI_LOAD_REGISTER_IMM(1);
+	*cs++ = i915_mmio_reg_offset(reg);
+	/* Enabling coherency means disabling the bit which forces it off */
+	if (enable)
+		*cs++ = _MASKED_BIT_DISABLE(HDC_FORCE_NON_COHERENT);
+	else
+		*cs++ = _MASKED_BIT_ENABLE(HDC_FORCE_NON_COHERENT);
+	*cs++ = MI_NOOP;
+
+	intel_ring_advance(rq, cs);
+
+	return 0;
+}
+
+static int
+intel_lr_context_update_data_port_coherency(struct i915_request *rq)
+{
+	struct i915_gem_context *ctx = rq->gem_context;
+	bool enable = test_bit(CONTEXT_DATA_PORT_COHERENT_REQUESTED, &ctx->flags);
+	int ret;
+
+	lockdep_assert_held(&rq->i915->drm.struct_mutex);
+
+	if (test_bit(CONTEXT_DATA_PORT_COHERENT_ACTIVE, &ctx->flags) == enable)
+		return 0;
+
+	ret = emit_set_data_port_coherency(rq, enable);
+
+	if (!ret) {
+		if (enable)
+			__set_bit(CONTEXT_DATA_PORT_COHERENT_ACTIVE, &ctx->flags);
+		else
+			__clear_bit(CONTEXT_DATA_PORT_COHERENT_ACTIVE, &ctx->flags);
+	}
+
+	return ret;
+}
+
 static struct i915_priolist *
 lookup_priolist(struct intel_engine_cs *engine, int prio)
 {
@@ -2133,7 +2190,7 @@  static int gen8_emit_flush_render(struct i915_request *request,
 		i915_ggtt_offset(engine->scratch) + 2 * CACHELINE_BYTES;
 	bool vf_flush_wa = false, dc_flush_wa = false;
 	u32 *cs, flags = 0;
-	int len;
+	int err, len;
 
 	flags |= PIPE_CONTROL_CS_STALL;
 
@@ -2164,6 +2221,13 @@  static int gen8_emit_flush_render(struct i915_request *request,
 		/* WaForGAMHang:kbl */
 		if (IS_KBL_REVID(request->i915, 0, KBL_REVID_B0))
 			dc_flush_wa = true;
+
+		/* Emit the switch of data port coherency state if needed */
+		err = intel_lr_context_update_data_port_coherency(request);
+		if (GEM_WARN_ON(err)) {
+			DRM_DEBUG("Data Port Coherency toggle failed.\n");
+			return err;
+		}
 	}
 
 	len = 6;
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index 7f5634c..6ece759 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -1456,6 +1456,13 @@  struct drm_i915_gem_context_param {
 #define   I915_CONTEXT_MAX_USER_PRIORITY	1023 /* inclusive */
 #define   I915_CONTEXT_DEFAULT_PRIORITY		0
 #define   I915_CONTEXT_MIN_USER_PRIORITY	-1023 /* inclusive */
+/*
+ * When data port level coherency is enabled, the GPU will update memory
+ * buffers shared with CPU, by forcing internal cache units to send memory
+ * writes to higher level caches faster. Enabling data port coherency has
+ * a performance cost.
+ */
+#define I915_CONTEXT_PARAM_DATA_PORT_COHERENCY	0x7
 	__u64 value;
 };