diff mbox

[v1] drm/i915: Add Exec param to control data port coherency.

Message ID 1522430994-15366-1-git-send-email-tomasz.lis@intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Lis, Tomasz March 30, 2018, 5:29 p.m. UTC
The patch adds a parameter to control the data port coherency functionality
on a per-exec call basis. When data port coherency flag value is different
than what it was in previous call for the context, a command to switch data
port coherency state is added before the buffer to be executed.

Rationale:

The OpenCL driver develpers requested a functionality to control cache
coherency at data port level. Keeping the coherency at that level is disabled
by default due to its performance costs. OpenCL driver is planning to
enable it for a small subset of submissions, when such functionality is
required. Below are answers to basic question explaining background
of the functionality and reasoning for the proposed implementation:

1. Why do we need a coherency enable/disable switch for memory that is shared
between CPU and GEN (GPU)?

Memory coherency between CPU and GEN, while being a great feature that enables
CL_MEM_SVM_FINE_GRAIN_BUFFER OCL capability on Intel GEN architecture, adds
overhead related to tracking (snooping) memory inside different cache units
(L1$, L2$, L3$, LLC$, etc.). At the same time, minority of modern OCL
applications actually use CL_MEM_SVM_FINE_GRAIN_BUFFER (and hence require
memory coherency between CPU and GPU). The goal of coherency enable/disable
switch is to remove overhead of memory coherency when memory coherency is not
needed.

2. Why do we need a global coherency switch?

In order to support I/O commands from within EUs (Execution Units), Intel GEN
ISA (GEN Instruction Set Assembly) contains dedicated "send" instructions.
These send instructions provide several addressing models. One of these
addressing models (named "stateless") provides most flexible I/O using plain
virtual addresses (as opposed to buffer_handle+offset models). This "stateless"
model is similar to regular memory load/store operations available on typical
CPUs. Since this model provides I/O using arbitrary virtual addresses, it
enables algorithmic designs that are based on pointer-to-pointer (e.g. buffer
of pointers) concepts. For instance, it allows creating tree-like data
structures such as:
                   ________________
                  |      NODE1     |
                  | uint64_t data  |
                  +----------------|
                  | NODE*  |  NODE*|
                  +--------+-------+
                    /              \
   ________________/                \________________
  |      NODE2     |                |      NODE3     |
  | uint64_t data  |                | uint64_t data  |
  +----------------|                +----------------|
  | NODE*  |  NODE*|                | NODE*  |  NODE*|
  +--------+-------+                +--------+-------+

Please note that pointers inside such structures can point to memory locations
in different OCL allocations  - e.g. NODE1 and NODE2 can reside in one OCL
allocation while NODE3 resides in a completely separate OCL allocation.
Additionally, such pointers can be shared with CPU (i.e. using SVM - Shared
Virtual Memory feature). Using pointers from different allocations doesn't
affect the stateless addressing model which even allows scattered reading from
different allocations at the same time (i.e. by utilizing SIMD-nature of send
instructions).

When it comes to coherency programming, send instructions in stateless model
can be encoded (at ISA level) to either use or disable coherency. However, for
generic OCL applications (such as example with tree-like data structure), OCL
compiler is not able to determine origin of memory pointed to by an arbitrary
pointer - i.e. is not able to track given pointer back to a specific
allocation. As such, it's not able to decide whether coherency is needed or not
for specific pointer (or for specific I/O instruction). As a result, compiler
encodes all stateless sends as coherent (doing otherwise would lead to
functional issues resulting from data corruption). Please note that it would be
possible to workaround this (e.g. based on allocations map and pointer bounds
checking prior to each I/O instruction) but the performance cost of such
workaround would be many times greater than the cost of keeping coherency
always enabled. As such, enabling/disabling memory coherency at GEN ISA level
is not feasible and alternative method is needed.

Such alternative solution is to have a global coherency switch that allows
disabling coherency for single (though entire) GPU submission. This is
beneficial because this way we:
* can enable (and pay for) coherency only in submissions that actually need
coherency (submissions that use CL_MEM_SVM_FINE_GRAIN_BUFFER resources)
* don't care about coherency at GEN ISA granularity (no performance impact)

3. Will coherency switch be used frequently?

There are scenarios that will require frequent toggling of the coherency
switch.
E.g. an application has two OCL compute kernels: kern_master and kern_worker.
kern_master uses, concurrently with CPU, some fine grain SVM resources
(CL_MEM_SVM_FINE_GRAIN_BUFFER). These resources contain descriptors of
computational work that needs to be executed. kern_master analyzes incoming
work descriptors and populates a plain OCL buffer (non-fine-grain) with payload
for kern_worker. Once kern_master is done, kern_worker kicks-in and processes
the payload that kern_master produced. These two kernels work in a loop, one
after another. Since only kern_master requires coherency, kern_worker should
not be forced to pay for it. This means that we need to have the ability to
toggle coherency switch on or off per each GPU submission:
(ENABLE COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> (ENABLE
COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> ...

4. Why the execlist flag approach was chosen?

There are two other ways of providing the functionality to UMDs, besides the
execlist flag:

a) Chicken bit register whitelisting.

This approach would allow adding the functionality without any change to
KMDs interface. Also, it has been determined that whitelisting is safe for
gen10 and gen11. The issue is with gen9, where hardware whitelisting cannot
be used, and OCL driver needs support for it. A workaround there would be to
use command parser, which verifies buffers before execution. But such parsing
comes at a considerable performance cost.

b) Providing the flag as context IOCTL setting.

The data port coherency switch could be implemented as a context parameter,
which would schedule submission of a buffer to switch the coherency flag.
That is an elegant solution with bounds the flag to context, which matches
the hardware placement of the feature. This solution was not accepted
because of OCL driver performance concerns. The OCL driver is constructed
with emphasis on creating small, but very frequent submissions. With such
architecture, adding IOCTL setparam call before submission has considerable
impact on the performance.

Bspec: 11419
Signed-off-by: Tomasz Lis <tomasz.lis@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.c            |  3 ++
 drivers/gpu/drm/i915/i915_gem_context.h    |  1 +
 drivers/gpu/drm/i915/i915_gem_execbuffer.c | 17 ++++++++++
 drivers/gpu/drm/i915/intel_lrc.c           | 53 ++++++++++++++++++++++++++++++
 drivers/gpu/drm/i915/intel_lrc.h           |  3 ++
 include/uapi/drm/i915_drm.h                | 12 ++++++-
 6 files changed, 88 insertions(+), 1 deletion(-)

Comments

kernel test robot March 31, 2018, 7:07 p.m. UTC | #1
Hi Tomasz,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on drm-intel/for-linux-next]
[also build test WARNING on v4.16-rc7 next-20180329]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Tomasz-Lis/drm-i915-Add-Exec-param-to-control-data-port-coherency/20180401-021313
base:   git://anongit.freedesktop.org/drm-intel for-linux-next
config: i386-randconfig-x010-201813 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-1) 7.3.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All warnings (new ones prefixed by >>):

   In file included from drivers/gpu//drm/i915/i915_request.h:30:0,
                    from drivers/gpu//drm/i915/i915_gem_timeline.h:30,
                    from drivers/gpu//drm/i915/intel_ringbuffer.h:8,
                    from drivers/gpu//drm/i915/intel_lrc.h:27,
                    from drivers/gpu//drm/i915/i915_drv.h:63,
                    from drivers/gpu//drm/i915/i915_gem_execbuffer.c:38:
   drivers/gpu//drm/i915/i915_gem_execbuffer.c: In function 'i915_gem_do_execbuffer':
   drivers/gpu//drm/i915/i915_gem.h:47:54: warning: statement with no effect [-Wunused-value]
    #define GEM_WARN_ON(expr) (BUILD_BUG_ON_INVALID(expr), 0)
                              ~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~
>> drivers/gpu//drm/i915/i915_gem_execbuffer.c:2389:2: note: in expansion of macro 'GEM_WARN_ON'
     GEM_WARN_ON(err);
     ^~~~~~~~~~~

vim +/GEM_WARN_ON +2389 drivers/gpu//drm/i915/i915_gem_execbuffer.c

  2182	
  2183	static int
  2184	i915_gem_do_execbuffer(struct drm_device *dev,
  2185			       struct drm_file *file,
  2186			       struct drm_i915_gem_execbuffer2 *args,
  2187			       struct drm_i915_gem_exec_object2 *exec,
  2188			       struct drm_syncobj **fences)
  2189	{
  2190		struct i915_execbuffer eb;
  2191		struct dma_fence *in_fence = NULL;
  2192		struct sync_file *out_fence = NULL;
  2193		int out_fence_fd = -1;
  2194		int err;
  2195	
  2196		BUILD_BUG_ON(__EXEC_INTERNAL_FLAGS & ~__I915_EXEC_ILLEGAL_FLAGS);
  2197		BUILD_BUG_ON(__EXEC_OBJECT_INTERNAL_FLAGS &
  2198			     ~__EXEC_OBJECT_UNKNOWN_FLAGS);
  2199	
  2200		eb.i915 = to_i915(dev);
  2201		eb.file = file;
  2202		eb.args = args;
  2203		if (DBG_FORCE_RELOC || !(args->flags & I915_EXEC_NO_RELOC))
  2204			args->flags |= __EXEC_HAS_RELOC;
  2205	
  2206		eb.exec = exec;
  2207		eb.vma = (struct i915_vma **)(exec + args->buffer_count + 1);
  2208		eb.vma[0] = NULL;
  2209		eb.flags = (unsigned int *)(eb.vma + args->buffer_count + 1);
  2210	
  2211		eb.invalid_flags = __EXEC_OBJECT_UNKNOWN_FLAGS;
  2212		if (USES_FULL_PPGTT(eb.i915))
  2213			eb.invalid_flags |= EXEC_OBJECT_NEEDS_GTT;
  2214		reloc_cache_init(&eb.reloc_cache, eb.i915);
  2215	
  2216		eb.buffer_count = args->buffer_count;
  2217		eb.batch_start_offset = args->batch_start_offset;
  2218		eb.batch_len = args->batch_len;
  2219	
  2220		eb.batch_flags = 0;
  2221		if (args->flags & I915_EXEC_SECURE) {
  2222			if (!drm_is_current_master(file) || !capable(CAP_SYS_ADMIN))
  2223			    return -EPERM;
  2224	
  2225			eb.batch_flags |= I915_DISPATCH_SECURE;
  2226		}
  2227		if (args->flags & I915_EXEC_IS_PINNED)
  2228			eb.batch_flags |= I915_DISPATCH_PINNED;
  2229	
  2230		eb.engine = eb_select_engine(eb.i915, file, args);
  2231		if (!eb.engine)
  2232			return -EINVAL;
  2233	
  2234		if (args->flags & I915_EXEC_RESOURCE_STREAMER) {
  2235			if (!HAS_RESOURCE_STREAMER(eb.i915)) {
  2236				DRM_DEBUG("RS is only allowed for Haswell, Gen8 and above\n");
  2237				return -EINVAL;
  2238			}
  2239			if (eb.engine->id != RCS) {
  2240				DRM_DEBUG("RS is not available on %s\n",
  2241					 eb.engine->name);
  2242				return -EINVAL;
  2243			}
  2244	
  2245			eb.batch_flags |= I915_DISPATCH_RS;
  2246		}
  2247	
  2248		if (args->flags & I915_EXEC_DATA_PORT_COHERENT) {
  2249			if (INTEL_GEN(eb.i915) < 9) {
  2250				DRM_DEBUG("Data Port Coherency is only allowed for Gen9 and above\n");
  2251				return -EINVAL;
  2252			}
  2253			if (eb.engine->class != RENDER_CLASS) {
  2254				DRM_DEBUG("Data Port Coherency is not available on %s\n",
  2255					 eb.engine->name);
  2256				return -EINVAL;
  2257			}
  2258		}
  2259	
  2260		if (args->flags & I915_EXEC_FENCE_IN) {
  2261			in_fence = sync_file_get_fence(lower_32_bits(args->rsvd2));
  2262			if (!in_fence)
  2263				return -EINVAL;
  2264		}
  2265	
  2266		if (args->flags & I915_EXEC_FENCE_OUT) {
  2267			out_fence_fd = get_unused_fd_flags(O_CLOEXEC);
  2268			if (out_fence_fd < 0) {
  2269				err = out_fence_fd;
  2270				goto err_in_fence;
  2271			}
  2272		}
  2273	
  2274		err = eb_create(&eb);
  2275		if (err)
  2276			goto err_out_fence;
  2277	
  2278		GEM_BUG_ON(!eb.lut_size);
  2279	
  2280		err = eb_select_context(&eb);
  2281		if (unlikely(err))
  2282			goto err_destroy;
  2283	
  2284		/*
  2285		 * Take a local wakeref for preparing to dispatch the execbuf as
  2286		 * we expect to access the hardware fairly frequently in the
  2287		 * process. Upon first dispatch, we acquire another prolonged
  2288		 * wakeref that we hold until the GPU has been idle for at least
  2289		 * 100ms.
  2290		 */
  2291		intel_runtime_pm_get(eb.i915);
  2292	
  2293		err = i915_mutex_lock_interruptible(dev);
  2294		if (err)
  2295			goto err_rpm;
  2296	
  2297		err = eb_relocate(&eb);
  2298		if (err) {
  2299			/*
  2300			 * If the user expects the execobject.offset and
  2301			 * reloc.presumed_offset to be an exact match,
  2302			 * as for using NO_RELOC, then we cannot update
  2303			 * the execobject.offset until we have completed
  2304			 * relocation.
  2305			 */
  2306			args->flags &= ~__EXEC_HAS_RELOC;
  2307			goto err_vma;
  2308		}
  2309	
  2310		if (unlikely(*eb.batch->exec_flags & EXEC_OBJECT_WRITE)) {
  2311			DRM_DEBUG("Attempting to use self-modifying batch buffer\n");
  2312			err = -EINVAL;
  2313			goto err_vma;
  2314		}
  2315		if (eb.batch_start_offset > eb.batch->size ||
  2316		    eb.batch_len > eb.batch->size - eb.batch_start_offset) {
  2317			DRM_DEBUG("Attempting to use out-of-bounds batch\n");
  2318			err = -EINVAL;
  2319			goto err_vma;
  2320		}
  2321	
  2322		if (eb_use_cmdparser(&eb)) {
  2323			struct i915_vma *vma;
  2324	
  2325			vma = eb_parse(&eb, drm_is_current_master(file));
  2326			if (IS_ERR(vma)) {
  2327				err = PTR_ERR(vma);
  2328				goto err_vma;
  2329			}
  2330	
  2331			if (vma) {
  2332				/*
  2333				 * Batch parsed and accepted:
  2334				 *
  2335				 * Set the DISPATCH_SECURE bit to remove the NON_SECURE
  2336				 * bit from MI_BATCH_BUFFER_START commands issued in
  2337				 * the dispatch_execbuffer implementations. We
  2338				 * specifically don't want that set on batches the
  2339				 * command parser has accepted.
  2340				 */
  2341				eb.batch_flags |= I915_DISPATCH_SECURE;
  2342				eb.batch_start_offset = 0;
  2343				eb.batch = vma;
  2344			}
  2345		}
  2346	
  2347		if (eb.batch_len == 0)
  2348			eb.batch_len = eb.batch->size - eb.batch_start_offset;
  2349	
  2350		/*
  2351		 * snb/ivb/vlv conflate the "batch in ppgtt" bit with the "non-secure
  2352		 * batch" bit. Hence we need to pin secure batches into the global gtt.
  2353		 * hsw should have this fixed, but bdw mucks it up again. */
  2354		if (eb.batch_flags & I915_DISPATCH_SECURE) {
  2355			struct i915_vma *vma;
  2356	
  2357			/*
  2358			 * So on first glance it looks freaky that we pin the batch here
  2359			 * outside of the reservation loop. But:
  2360			 * - The batch is already pinned into the relevant ppgtt, so we
  2361			 *   already have the backing storage fully allocated.
  2362			 * - No other BO uses the global gtt (well contexts, but meh),
  2363			 *   so we don't really have issues with multiple objects not
  2364			 *   fitting due to fragmentation.
  2365			 * So this is actually safe.
  2366			 */
  2367			vma = i915_gem_object_ggtt_pin(eb.batch->obj, NULL, 0, 0, 0);
  2368			if (IS_ERR(vma)) {
  2369				err = PTR_ERR(vma);
  2370				goto err_vma;
  2371			}
  2372	
  2373			eb.batch = vma;
  2374		}
  2375	
  2376		/* All GPU relocation batches must be submitted prior to the user rq */
  2377		GEM_BUG_ON(eb.reloc_cache.rq);
  2378	
  2379		/* Allocate a request for this batch buffer nice and early. */
  2380		eb.request = i915_request_alloc(eb.engine, eb.ctx);
  2381		if (IS_ERR(eb.request)) {
  2382			err = PTR_ERR(eb.request);
  2383			goto err_batch_unpin;
  2384		}
  2385	
  2386		/* Emit the switch of data port coherency state if needed */
  2387		err = intel_lr_context_modify_data_port_coherency(eb.request,
  2388				(args->flags & I915_EXEC_DATA_PORT_COHERENT) != 0);
> 2389		GEM_WARN_ON(err);
  2390	
  2391		if (in_fence) {
  2392			err = i915_request_await_dma_fence(eb.request, in_fence);
  2393			if (err < 0)
  2394				goto err_request;
  2395		}
  2396	
  2397		if (fences) {
  2398			err = await_fence_array(&eb, fences);
  2399			if (err)
  2400				goto err_request;
  2401		}
  2402	
  2403		if (out_fence_fd != -1) {
  2404			out_fence = sync_file_create(&eb.request->fence);
  2405			if (!out_fence) {
  2406				err = -ENOMEM;
  2407				goto err_request;
  2408			}
  2409		}
  2410	
  2411		/*
  2412		 * Whilst this request exists, batch_obj will be on the
  2413		 * active_list, and so will hold the active reference. Only when this
  2414		 * request is retired will the the batch_obj be moved onto the
  2415		 * inactive_list and lose its active reference. Hence we do not need
  2416		 * to explicitly hold another reference here.
  2417		 */
  2418		eb.request->batch = eb.batch;
  2419	
  2420		trace_i915_request_queue(eb.request, eb.batch_flags);
  2421		err = eb_submit(&eb);
  2422	err_request:
  2423		__i915_request_add(eb.request, err == 0);
  2424		add_to_client(eb.request, file);
  2425	
  2426		if (fences)
  2427			signal_fence_array(&eb, fences);
  2428	
  2429		if (out_fence) {
  2430			if (err == 0) {
  2431				fd_install(out_fence_fd, out_fence->file);
  2432				args->rsvd2 &= GENMASK_ULL(31, 0); /* keep in-fence */
  2433				args->rsvd2 |= (u64)out_fence_fd << 32;
  2434				out_fence_fd = -1;
  2435			} else {
  2436				fput(out_fence->file);
  2437			}
  2438		}
  2439	
  2440	err_batch_unpin:
  2441		if (eb.batch_flags & I915_DISPATCH_SECURE)
  2442			i915_vma_unpin(eb.batch);
  2443	err_vma:
  2444		if (eb.exec)
  2445			eb_release_vmas(&eb);
  2446		mutex_unlock(&dev->struct_mutex);
  2447	err_rpm:
  2448		intel_runtime_pm_put(eb.i915);
  2449		i915_gem_context_put(eb.ctx);
  2450	err_destroy:
  2451		eb_destroy(&eb);
  2452	err_out_fence:
  2453		if (out_fence_fd != -1)
  2454			put_unused_fd(out_fence_fd);
  2455	err_in_fence:
  2456		dma_fence_put(in_fence);
  2457		return err;
  2458	}
  2459	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation
diff mbox

Patch

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index d354627..030854e 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -436,6 +436,9 @@  static int i915_getparam_ioctl(struct drm_device *dev, void *data,
 	case I915_PARAM_CS_TIMESTAMP_FREQUENCY:
 		value = 1000 * INTEL_INFO(dev_priv)->cs_timestamp_frequency_khz;
 		break;
+	case I915_PARAM_HAS_EXEC_DATA_PORT_COHERENCY:
+		value = (INTEL_GEN(dev_priv) >= 9);
+		break;
 	default:
 		DRM_DEBUG("Unknown parameter %d\n", param->param);
 		return -EINVAL;
diff --git a/drivers/gpu/drm/i915/i915_gem_context.h b/drivers/gpu/drm/i915/i915_gem_context.h
index 7854262..00aa309 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.h
+++ b/drivers/gpu/drm/i915/i915_gem_context.h
@@ -118,6 +118,7 @@  struct i915_gem_context {
 #define CONTEXT_BANNABLE		3
 #define CONTEXT_BANNED			4
 #define CONTEXT_FORCE_SINGLE_SUBMISSION	5
+#define CONTEXT_DATA_PORT_COHERENT	6
 
 	/**
 	 * @hw_id: - unique identifier for the context
diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
index 8c170db..e3a2f9e 100644
--- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
@@ -2245,6 +2245,18 @@  i915_gem_do_execbuffer(struct drm_device *dev,
 		eb.batch_flags |= I915_DISPATCH_RS;
 	}
 
+	if (args->flags & I915_EXEC_DATA_PORT_COHERENT) {
+		if (INTEL_GEN(eb.i915) < 9) {
+			DRM_DEBUG("Data Port Coherency is only allowed for Gen9 and above\n");
+			return -EINVAL;
+		}
+		if (eb.engine->class != RENDER_CLASS) {
+			DRM_DEBUG("Data Port Coherency is not available on %s\n",
+				 eb.engine->name);
+			return -EINVAL;
+		}
+	}
+
 	if (args->flags & I915_EXEC_FENCE_IN) {
 		in_fence = sync_file_get_fence(lower_32_bits(args->rsvd2));
 		if (!in_fence)
@@ -2371,6 +2383,11 @@  i915_gem_do_execbuffer(struct drm_device *dev,
 		goto err_batch_unpin;
 	}
 
+	/* Emit the switch of data port coherency state if needed */
+	err = intel_lr_context_modify_data_port_coherency(eb.request,
+			(args->flags & I915_EXEC_DATA_PORT_COHERENT) != 0);
+	GEM_WARN_ON(err);
+
 	if (in_fence) {
 		err = i915_request_await_dma_fence(eb.request, in_fence);
 		if (err < 0)
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index f60b61b..2094494 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -254,6 +254,59 @@  intel_lr_context_descriptor_update(struct i915_gem_context *ctx,
 	ce->lrc_desc = desc;
 }
 
+static int emit_set_data_port_coherency(struct i915_request *req, bool enable)
+{
+	u32 *cs;
+	i915_reg_t reg;
+
+	GEM_BUG_ON(req->engine->class != RENDER_CLASS);
+	GEM_BUG_ON(INTEL_GEN(req->i915) < 9);
+
+	cs = intel_ring_begin(req, 4);
+	if (IS_ERR(cs))
+		return PTR_ERR(cs);
+
+	if (INTEL_GEN(req->i915) >= 10)
+		reg = CNL_HDC_CHICKEN0;
+	else
+		reg = HDC_CHICKEN0;
+
+	*cs++ = MI_LOAD_REGISTER_IMM(1);
+	*cs++ = i915_mmio_reg_offset(reg);
+	/* Enabling coherency means disabling the bit which forces it off */
+	if (enable)
+		*cs++ = _MASKED_BIT_DISABLE(HDC_FORCE_NON_COHERENT);
+	else
+		*cs++ = _MASKED_BIT_ENABLE(HDC_FORCE_NON_COHERENT);
+	*cs++ = MI_NOOP;
+
+	intel_ring_advance(req, cs);
+
+	return 0;
+}
+
+int
+intel_lr_context_modify_data_port_coherency(struct i915_request *req,
+					bool enable)
+{
+	struct i915_gem_context *ctx = req->ctx;
+	int ret;
+
+	if (test_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags) == enable)
+		return 0;
+
+	ret = emit_set_data_port_coherency(req, enable);
+
+	if (!ret) {
+		if (enable)
+			__set_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
+		else
+			__clear_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
+	}
+
+	return ret;
+}
+
 static struct i915_priolist *
 lookup_priolist(struct intel_engine_cs *engine,
 		struct i915_priotree *pt,
diff --git a/drivers/gpu/drm/i915/intel_lrc.h b/drivers/gpu/drm/i915/intel_lrc.h
index 59d7b86..c46b239 100644
--- a/drivers/gpu/drm/i915/intel_lrc.h
+++ b/drivers/gpu/drm/i915/intel_lrc.h
@@ -111,4 +111,7 @@  intel_lr_context_descriptor(struct i915_gem_context *ctx,
 	return ctx->engine[engine->id].lrc_desc;
 }
 
+int intel_lr_context_modify_data_port_coherency(struct i915_request *req,
+						bool enable);
+
 #endif /* _INTEL_LRC_H_ */
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index 7f5634c..0f52793 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -529,6 +529,11 @@  typedef struct drm_i915_irq_wait {
  */
 #define I915_PARAM_CS_TIMESTAMP_FREQUENCY 51
 
+/* Query whether DRM_I915_GEM_EXECBUFFER2 supports the ability to switch
+ * Data Cache access into Data Port Coherency mode.
+ */
+#define I915_PARAM_HAS_EXEC_DATA_PORT_COHERENCY 52
+
 typedef struct drm_i915_getparam {
 	__s32 param;
 	/*
@@ -1048,7 +1053,12 @@  struct drm_i915_gem_execbuffer2 {
  */
 #define I915_EXEC_FENCE_ARRAY   (1<<19)
 
-#define __I915_EXEC_UNKNOWN_FLAGS (-(I915_EXEC_FENCE_ARRAY<<1))
+/* Data Port Coherency capability will be switched before an exec call
+ * which has this flag different than previous call for the context.
+ */
+#define I915_EXEC_DATA_PORT_COHERENT   (1<<20)
+
+#define __I915_EXEC_UNKNOWN_FLAGS (-(I915_EXEC_DATA_PORT_COHERENT<<1))
 
 #define I915_EXEC_CONTEXT_ID_MASK	(0xffffffff)
 #define i915_execbuffer2_set_context_id(eb2, context) \