diff mbox

[v4] drm/i915: Add IOCTL Param to control data port coherency.

Message ID 1531142457-4232-1-git-send-email-tomasz.lis@intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Lis, Tomasz July 9, 2018, 1:20 p.m. UTC
The patch adds a parameter to control the data port coherency functionality
on a per-context level. When the IOCTL is called, a command to switch data
port coherency state is added to the ordered list. All prior requests are
executed on old coherency settings, and all exec requests after the IOCTL
will use new settings.

Rationale:

The OpenCL driver develpers requested a functionality to control cache
coherency at data port level. Keeping the coherency at that level is disabled
by default due to its performance costs. OpenCL driver is planning to
enable it for a small subset of submissions, when such functionality is
required. Below are answers to basic question explaining background
of the functionality and reasoning for the proposed implementation:

1. Why do we need a coherency enable/disable switch for memory that is shared
between CPU and GEN (GPU)?

Memory coherency between CPU and GEN, while being a great feature that enables
CL_MEM_SVM_FINE_GRAIN_BUFFER OCL capability on Intel GEN architecture, adds
overhead related to tracking (snooping) memory inside different cache units
(L1$, L2$, L3$, LLC$, etc.). At the same time, minority of modern OCL
applications actually use CL_MEM_SVM_FINE_GRAIN_BUFFER (and hence require
memory coherency between CPU and GPU). The goal of coherency enable/disable
switch is to remove overhead of memory coherency when memory coherency is not
needed.

2. Why do we need a global coherency switch?

In order to support I/O commands from within EUs (Execution Units), Intel GEN
ISA (GEN Instruction Set Assembly) contains dedicated "send" instructions.
These send instructions provide several addressing models. One of these
addressing models (named "stateless") provides most flexible I/O using plain
virtual addresses (as opposed to buffer_handle+offset models). This "stateless"
model is similar to regular memory load/store operations available on typical
CPUs. Since this model provides I/O using arbitrary virtual addresses, it
enables algorithmic designs that are based on pointer-to-pointer (e.g. buffer
of pointers) concepts. For instance, it allows creating tree-like data
structures such as:
                   ________________
                  |      NODE1     |
                  | uint64_t data  |
                  +----------------|
                  | NODE*  |  NODE*|
                  +--------+-------+
                    /              \
   ________________/                \________________
  |      NODE2     |                |      NODE3     |
  | uint64_t data  |                | uint64_t data  |
  +----------------|                +----------------|
  | NODE*  |  NODE*|                | NODE*  |  NODE*|
  +--------+-------+                +--------+-------+

Please note that pointers inside such structures can point to memory locations
in different OCL allocations  - e.g. NODE1 and NODE2 can reside in one OCL
allocation while NODE3 resides in a completely separate OCL allocation.
Additionally, such pointers can be shared with CPU (i.e. using SVM - Shared
Virtual Memory feature). Using pointers from different allocations doesn't
affect the stateless addressing model which even allows scattered reading from
different allocations at the same time (i.e. by utilizing SIMD-nature of send
instructions).

When it comes to coherency programming, send instructions in stateless model
can be encoded (at ISA level) to either use or disable coherency. However, for
generic OCL applications (such as example with tree-like data structure), OCL
compiler is not able to determine origin of memory pointed to by an arbitrary
pointer - i.e. is not able to track given pointer back to a specific
allocation. As such, it's not able to decide whether coherency is needed or not
for specific pointer (or for specific I/O instruction). As a result, compiler
encodes all stateless sends as coherent (doing otherwise would lead to
functional issues resulting from data corruption). Please note that it would be
possible to workaround this (e.g. based on allocations map and pointer bounds
checking prior to each I/O instruction) but the performance cost of such
workaround would be many times greater than the cost of keeping coherency
always enabled. As such, enabling/disabling memory coherency at GEN ISA level
is not feasible and alternative method is needed.

Such alternative solution is to have a global coherency switch that allows
disabling coherency for single (though entire) GPU submission. This is
beneficial because this way we:
* can enable (and pay for) coherency only in submissions that actually need
coherency (submissions that use CL_MEM_SVM_FINE_GRAIN_BUFFER resources)
* don't care about coherency at GEN ISA granularity (no performance impact)

3. Will coherency switch be used frequently?

There are scenarios that will require frequent toggling of the coherency
switch.
E.g. an application has two OCL compute kernels: kern_master and kern_worker.
kern_master uses, concurrently with CPU, some fine grain SVM resources
(CL_MEM_SVM_FINE_GRAIN_BUFFER). These resources contain descriptors of
computational work that needs to be executed. kern_master analyzes incoming
work descriptors and populates a plain OCL buffer (non-fine-grain) with payload
for kern_worker. Once kern_master is done, kern_worker kicks-in and processes
the payload that kern_master produced. These two kernels work in a loop, one
after another. Since only kern_master requires coherency, kern_worker should
not be forced to pay for it. This means that we need to have the ability to
toggle coherency switch on or off per each GPU submission:
(ENABLE COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> (ENABLE
COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> ...

v2: Fixed compilation warning.
v3: Refactored the patch to add IOCTL instead of exec flag.
v4: Renamed and documented the API flag. Used strict values.
    Removed redundant GEM_WARN_ON()s. Improved to coding standard.
    Introduced a macro for checking whether hardware supports the feature.

Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Michal Winiarski <michal.winiarski@intel.com>

Bspec: 11419
Signed-off-by: Tomasz Lis <tomasz.lis@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h         |  1 +
 drivers/gpu/drm/i915/i915_gem_context.c | 41 +++++++++++++++++++++++++++
 drivers/gpu/drm/i915/i915_gem_context.h |  6 ++++
 drivers/gpu/drm/i915/intel_lrc.c        | 49 +++++++++++++++++++++++++++++++++
 drivers/gpu/drm/i915/intel_lrc.h        |  4 +++
 include/uapi/drm/i915_drm.h             |  6 ++++
 6 files changed, 107 insertions(+)

Comments

Lionel Landwerlin July 9, 2018, 1:48 p.m. UTC | #1
On 09/07/18 14:20, Tomasz Lis wrote:
> The patch adds a parameter to control the data port coherency functionality
> on a per-context level. When the IOCTL is called, a command to switch data
> port coherency state is added to the ordered list. All prior requests are
> executed on old coherency settings, and all exec requests after the IOCTL
> will use new settings.
>
> Rationale:
>
> The OpenCL driver develpers requested a functionality to control cache
> coherency at data port level. Keeping the coherency at that level is disabled
> by default due to its performance costs. OpenCL driver is planning to
> enable it for a small subset of submissions, when such functionality is
> required. Below are answers to basic question explaining background
> of the functionality and reasoning for the proposed implementation:
>
> 1. Why do we need a coherency enable/disable switch for memory that is shared
> between CPU and GEN (GPU)?
>
> Memory coherency between CPU and GEN, while being a great feature that enables
> CL_MEM_SVM_FINE_GRAIN_BUFFER OCL capability on Intel GEN architecture, adds
> overhead related to tracking (snooping) memory inside different cache units
> (L1$, L2$, L3$, LLC$, etc.). At the same time, minority of modern OCL
> applications actually use CL_MEM_SVM_FINE_GRAIN_BUFFER (and hence require
> memory coherency between CPU and GPU). The goal of coherency enable/disable
> switch is to remove overhead of memory coherency when memory coherency is not
> needed.
>
> 2. Why do we need a global coherency switch?
>
> In order to support I/O commands from within EUs (Execution Units), Intel GEN
> ISA (GEN Instruction Set Assembly) contains dedicated "send" instructions.
> These send instructions provide several addressing models. One of these
> addressing models (named "stateless") provides most flexible I/O using plain
> virtual addresses (as opposed to buffer_handle+offset models). This "stateless"
> model is similar to regular memory load/store operations available on typical
> CPUs. Since this model provides I/O using arbitrary virtual addresses, it
> enables algorithmic designs that are based on pointer-to-pointer (e.g. buffer
> of pointers) concepts. For instance, it allows creating tree-like data
> structures such as:
>                     ________________
>                    |      NODE1     |
>                    | uint64_t data  |
>                    +----------------|
>                    | NODE*  |  NODE*|
>                    +--------+-------+
>                      /              \
>     ________________/                \________________
>    |      NODE2     |                |      NODE3     |
>    | uint64_t data  |                | uint64_t data  |
>    +----------------|                +----------------|
>    | NODE*  |  NODE*|                | NODE*  |  NODE*|
>    +--------+-------+                +--------+-------+
>
> Please note that pointers inside such structures can point to memory locations
> in different OCL allocations  - e.g. NODE1 and NODE2 can reside in one OCL
> allocation while NODE3 resides in a completely separate OCL allocation.
> Additionally, such pointers can be shared with CPU (i.e. using SVM - Shared
> Virtual Memory feature). Using pointers from different allocations doesn't
> affect the stateless addressing model which even allows scattered reading from
> different allocations at the same time (i.e. by utilizing SIMD-nature of send
> instructions).
>
> When it comes to coherency programming, send instructions in stateless model
> can be encoded (at ISA level) to either use or disable coherency. However, for
> generic OCL applications (such as example with tree-like data structure), OCL
> compiler is not able to determine origin of memory pointed to by an arbitrary
> pointer - i.e. is not able to track given pointer back to a specific
> allocation. As such, it's not able to decide whether coherency is needed or not
> for specific pointer (or for specific I/O instruction). As a result, compiler
> encodes all stateless sends as coherent (doing otherwise would lead to
> functional issues resulting from data corruption). Please note that it would be
> possible to workaround this (e.g. based on allocations map and pointer bounds
> checking prior to each I/O instruction) but the performance cost of such
> workaround would be many times greater than the cost of keeping coherency
> always enabled. As such, enabling/disabling memory coherency at GEN ISA level
> is not feasible and alternative method is needed.
>
> Such alternative solution is to have a global coherency switch that allows
> disabling coherency for single (though entire) GPU submission. This is
> beneficial because this way we:
> * can enable (and pay for) coherency only in submissions that actually need
> coherency (submissions that use CL_MEM_SVM_FINE_GRAIN_BUFFER resources)
> * don't care about coherency at GEN ISA granularity (no performance impact)
>
> 3. Will coherency switch be used frequently?
>
> There are scenarios that will require frequent toggling of the coherency
> switch.
> E.g. an application has two OCL compute kernels: kern_master and kern_worker.
> kern_master uses, concurrently with CPU, some fine grain SVM resources
> (CL_MEM_SVM_FINE_GRAIN_BUFFER). These resources contain descriptors of
> computational work that needs to be executed. kern_master analyzes incoming
> work descriptors and populates a plain OCL buffer (non-fine-grain) with payload
> for kern_worker. Once kern_master is done, kern_worker kicks-in and processes
> the payload that kern_master produced. These two kernels work in a loop, one
> after another. Since only kern_master requires coherency, kern_worker should
> not be forced to pay for it. This means that we need to have the ability to
> toggle coherency switch on or off per each GPU submission:
> (ENABLE COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> (ENABLE
> COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> ...
>
> v2: Fixed compilation warning.
> v3: Refactored the patch to add IOCTL instead of exec flag.
> v4: Renamed and documented the API flag. Used strict values.
>      Removed redundant GEM_WARN_ON()s. Improved to coding standard.
>      Introduced a macro for checking whether hardware supports the feature.
>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Michal Winiarski <michal.winiarski@intel.com>
>
> Bspec: 11419
> Signed-off-by: Tomasz Lis <tomasz.lis@intel.com>
> ---
>   drivers/gpu/drm/i915/i915_drv.h         |  1 +
>   drivers/gpu/drm/i915/i915_gem_context.c | 41 +++++++++++++++++++++++++++
>   drivers/gpu/drm/i915/i915_gem_context.h |  6 ++++
>   drivers/gpu/drm/i915/intel_lrc.c        | 49 +++++++++++++++++++++++++++++++++
>   drivers/gpu/drm/i915/intel_lrc.h        |  4 +++
>   include/uapi/drm/i915_drm.h             |  6 ++++
>   6 files changed, 107 insertions(+)
>
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index 09ab124..7d4bbd5 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -2524,6 +2524,7 @@ intel_info(const struct drm_i915_private *dev_priv)
>   #define HAS_EDRAM(dev_priv)	(!!((dev_priv)->edram_cap & EDRAM_ENABLED))
>   #define HAS_WT(dev_priv)	((IS_HASWELL(dev_priv) || \
>   				 IS_BROADWELL(dev_priv)) && HAS_EDRAM(dev_priv))
> +#define HAS_DATA_PORT_COHERENCY(dev_priv)	(INTEL_GEN(dev_priv) >= 9)

Reading the documentation it seems that the bit you want to set is gone 
in ICL/Gen11.
Maybe limit this to >= 9 && < 11?

Cheers,

-
Lionel

>   
>   #define HWS_NEEDS_PHYSICAL(dev_priv)	((dev_priv)->info.hws_needs_physical)
>   
> diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
> index b10770c..6db352e 100644
> --- a/drivers/gpu/drm/i915/i915_gem_context.c
> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
> @@ -711,6 +711,26 @@ static bool client_is_banned(struct drm_i915_file_private *file_priv)
>   	return atomic_read(&file_priv->ban_score) >= I915_CLIENT_SCORE_BANNED;
>   }
>   
> +static int i915_gem_context_set_data_port_coherent(struct i915_gem_context *ctx)
> +{
> +	int ret;
> +
> +	ret = intel_lr_context_modify_data_port_coherency(ctx, true);
> +	if (!ret)
> +		__set_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
> +	return ret;
> +}
> +
> +static int i915_gem_context_clear_data_port_coherent(struct i915_gem_context *ctx)
> +{
> +	int ret;
> +
> +	ret = intel_lr_context_modify_data_port_coherency(ctx, false);
> +	if (!ret)
> +		__clear_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
> +	return ret;
> +}
> +
>   int i915_gem_context_create_ioctl(struct drm_device *dev, void *data,
>   				  struct drm_file *file)
>   {
> @@ -784,6 +804,7 @@ int i915_gem_context_destroy_ioctl(struct drm_device *dev, void *data,
>   int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
>   				    struct drm_file *file)
>   {
> +	struct drm_i915_private *dev_priv = to_i915(dev);
>   	struct drm_i915_file_private *file_priv = file->driver_priv;
>   	struct drm_i915_gem_context_param *args = data;
>   	struct i915_gem_context *ctx;
> @@ -818,6 +839,12 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
>   	case I915_CONTEXT_PARAM_PRIORITY:
>   		args->value = ctx->sched.priority;
>   		break;
> +	case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
> +		if (!HAS_DATA_PORT_COHERENCY(dev_priv))
> +			ret = -ENODEV;
> +		else
> +			args->value = i915_gem_context_is_data_port_coherent(ctx);
> +		break;
>   	default:
>   		ret = -EINVAL;
>   		break;
> @@ -830,6 +857,7 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
>   int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
>   				    struct drm_file *file)
>   {
> +	struct drm_i915_private *dev_priv = to_i915(dev);
>   	struct drm_i915_file_private *file_priv = file->driver_priv;
>   	struct drm_i915_gem_context_param *args = data;
>   	struct i915_gem_context *ctx;
> @@ -893,6 +921,19 @@ int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
>   		}
>   		break;
>   
> +	case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
> +		if (args->size)
> +			ret = -EINVAL;
> +		else if (!HAS_DATA_PORT_COHERENCY(dev_priv))
> +			ret = -ENODEV;
> +		else if (args->value == 1)
> +			ret = i915_gem_context_set_data_port_coherent(ctx);
> +		else if (args->value == 0)
> +			ret = i915_gem_context_clear_data_port_coherent(ctx);
> +		else
> +			ret = -EINVAL;
> +		break;
> +
>   	default:
>   		ret = -EINVAL;
>   		break;
> diff --git a/drivers/gpu/drm/i915/i915_gem_context.h b/drivers/gpu/drm/i915/i915_gem_context.h
> index b116e49..e8ccb70 100644
> --- a/drivers/gpu/drm/i915/i915_gem_context.h
> +++ b/drivers/gpu/drm/i915/i915_gem_context.h
> @@ -126,6 +126,7 @@ struct i915_gem_context {
>   #define CONTEXT_BANNABLE		3
>   #define CONTEXT_BANNED			4
>   #define CONTEXT_FORCE_SINGLE_SUBMISSION	5
> +#define CONTEXT_DATA_PORT_COHERENT	6
>   
>   	/**
>   	 * @hw_id: - unique identifier for the context
> @@ -257,6 +258,11 @@ static inline void i915_gem_context_set_force_single_submission(struct i915_gem_
>   	__set_bit(CONTEXT_FORCE_SINGLE_SUBMISSION, &ctx->flags);
>   }
>   
> +static inline bool i915_gem_context_is_data_port_coherent(struct i915_gem_context *ctx)
> +{
> +	return test_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
> +}
> +
>   static inline bool i915_gem_context_is_default(const struct i915_gem_context *c)
>   {
>   	return c->user_handle == DEFAULT_CONTEXT_HANDLE;
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index ab89dab..1f037e3 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -259,6 +259,55 @@ intel_lr_context_descriptor_update(struct i915_gem_context *ctx,
>   	ce->lrc_desc = desc;
>   }
>   
> +static int emit_set_data_port_coherency(struct i915_request *req, bool enable)
> +{
> +	u32 *cs;
> +	i915_reg_t reg;
> +
> +	GEM_BUG_ON(req->engine->class != RENDER_CLASS);
> +	GEM_BUG_ON(INTEL_GEN(req->i915) < 9);
> +
> +	cs = intel_ring_begin(req, 4);
> +	if (IS_ERR(cs))
> +		return PTR_ERR(cs);
> +
> +	if (INTEL_GEN(req->i915) >= 10)
> +		reg = CNL_HDC_CHICKEN0;
> +	else
> +		reg = HDC_CHICKEN0;
> +
> +	*cs++ = MI_LOAD_REGISTER_IMM(1);
> +	*cs++ = i915_mmio_reg_offset(reg);
> +	/* Enabling coherency means disabling the bit which forces it off */
> +	if (enable)
> +		*cs++ = _MASKED_BIT_DISABLE(HDC_FORCE_NON_COHERENT);
> +	else
> +		*cs++ = _MASKED_BIT_ENABLE(HDC_FORCE_NON_COHERENT);
> +	*cs++ = MI_NOOP;
> +
> +	intel_ring_advance(req, cs);
> +
> +	return 0;
> +}
> +
> +int
> +intel_lr_context_modify_data_port_coherency(struct i915_gem_context *ctx,
> +					    bool enable)
> +{
> +	struct i915_request *req;
> +	int ret;
> +
> +	req = i915_request_alloc(ctx->i915->engine[RCS], ctx);
> +	if (IS_ERR(req))
> +		return PTR_ERR(req);
> +
> +	ret = emit_set_data_port_coherency(req, enable);
> +
> +	i915_request_add(req);
> +
> +	return ret;
> +}
> +
>   static struct i915_priolist *
>   lookup_priolist(struct intel_engine_cs *engine, int prio)
>   {
> diff --git a/drivers/gpu/drm/i915/intel_lrc.h b/drivers/gpu/drm/i915/intel_lrc.h
> index 1593194..f6965ae 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.h
> +++ b/drivers/gpu/drm/i915/intel_lrc.h
> @@ -104,4 +104,8 @@ struct i915_gem_context;
>   
>   void intel_lr_context_resume(struct drm_i915_private *dev_priv);
>   
> +int
> +intel_lr_context_modify_data_port_coherency(struct i915_gem_context *ctx,
> +					    bool enable);
> +
>   #endif /* _INTEL_LRC_H_ */
> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> index 7f5634c..e677bea 100644
> --- a/include/uapi/drm/i915_drm.h
> +++ b/include/uapi/drm/i915_drm.h
> @@ -1456,6 +1456,12 @@ struct drm_i915_gem_context_param {
>   #define   I915_CONTEXT_MAX_USER_PRIORITY	1023 /* inclusive */
>   #define   I915_CONTEXT_DEFAULT_PRIORITY		0
>   #define   I915_CONTEXT_MIN_USER_PRIORITY	-1023 /* inclusive */
> +/*
> + * When data port level coherency is enabled, the GPU will update memory
> + * buffers shared with CPU, by forcing internal cache units to send memory
> + * writes to real RAM faster. Keeping such coherency has performance cost.
> + */
> +#define I915_CONTEXT_PARAM_DATA_PORT_COHERENCY	0x7
>   	__u64 value;
>   };
>
Lis, Tomasz July 9, 2018, 2:03 p.m. UTC | #2
On 2018-07-09 15:48, Lionel Landwerlin wrote:
> On 09/07/18 14:20, Tomasz Lis wrote:
>> The patch adds a parameter to control the data port coherency 
>> functionality
>> on a per-context level. When the IOCTL is called, a command to switch 
>> data
>> port coherency state is added to the ordered list. All prior requests 
>> are
>> executed on old coherency settings, and all exec requests after the 
>> IOCTL
>> will use new settings.
>>
>> Rationale:
>>
>> The OpenCL driver develpers requested a functionality to control cache
>> coherency at data port level. Keeping the coherency at that level is 
>> disabled
>> by default due to its performance costs. OpenCL driver is planning to
>> enable it for a small subset of submissions, when such functionality is
>> required. Below are answers to basic question explaining background
>> of the functionality and reasoning for the proposed implementation:
>>
>> 1. Why do we need a coherency enable/disable switch for memory that 
>> is shared
>> between CPU and GEN (GPU)?
>>
>> Memory coherency between CPU and GEN, while being a great feature 
>> that enables
>> CL_MEM_SVM_FINE_GRAIN_BUFFER OCL capability on Intel GEN 
>> architecture, adds
>> overhead related to tracking (snooping) memory inside different cache 
>> units
>> (L1$, L2$, L3$, LLC$, etc.). At the same time, minority of modern OCL
>> applications actually use CL_MEM_SVM_FINE_GRAIN_BUFFER (and hence 
>> require
>> memory coherency between CPU and GPU). The goal of coherency 
>> enable/disable
>> switch is to remove overhead of memory coherency when memory 
>> coherency is not
>> needed.
>>
>> 2. Why do we need a global coherency switch?
>>
>> In order to support I/O commands from within EUs (Execution Units), 
>> Intel GEN
>> ISA (GEN Instruction Set Assembly) contains dedicated "send" 
>> instructions.
>> These send instructions provide several addressing models. One of these
>> addressing models (named "stateless") provides most flexible I/O 
>> using plain
>> virtual addresses (as opposed to buffer_handle+offset models). This 
>> "stateless"
>> model is similar to regular memory load/store operations available on 
>> typical
>> CPUs. Since this model provides I/O using arbitrary virtual 
>> addresses, it
>> enables algorithmic designs that are based on pointer-to-pointer 
>> (e.g. buffer
>> of pointers) concepts. For instance, it allows creating tree-like data
>> structures such as:
>>                     ________________
>>                    |      NODE1     |
>>                    | uint64_t data  |
>>                    +----------------|
>>                    | NODE*  |  NODE*|
>>                    +--------+-------+
>>                      /              \
>>     ________________/                \________________
>>    |      NODE2     |                |      NODE3     |
>>    | uint64_t data  |                | uint64_t data  |
>>    +----------------|                +----------------|
>>    | NODE*  |  NODE*|                | NODE*  |  NODE*|
>>    +--------+-------+                +--------+-------+
>>
>> Please note that pointers inside such structures can point to memory 
>> locations
>> in different OCL allocations  - e.g. NODE1 and NODE2 can reside in 
>> one OCL
>> allocation while NODE3 resides in a completely separate OCL allocation.
>> Additionally, such pointers can be shared with CPU (i.e. using SVM - 
>> Shared
>> Virtual Memory feature). Using pointers from different allocations 
>> doesn't
>> affect the stateless addressing model which even allows scattered 
>> reading from
>> different allocations at the same time (i.e. by utilizing SIMD-nature 
>> of send
>> instructions).
>>
>> When it comes to coherency programming, send instructions in 
>> stateless model
>> can be encoded (at ISA level) to either use or disable coherency. 
>> However, for
>> generic OCL applications (such as example with tree-like data 
>> structure), OCL
>> compiler is not able to determine origin of memory pointed to by an 
>> arbitrary
>> pointer - i.e. is not able to track given pointer back to a specific
>> allocation. As such, it's not able to decide whether coherency is 
>> needed or not
>> for specific pointer (or for specific I/O instruction). As a result, 
>> compiler
>> encodes all stateless sends as coherent (doing otherwise would lead to
>> functional issues resulting from data corruption). Please note that 
>> it would be
>> possible to workaround this (e.g. based on allocations map and 
>> pointer bounds
>> checking prior to each I/O instruction) but the performance cost of such
>> workaround would be many times greater than the cost of keeping 
>> coherency
>> always enabled. As such, enabling/disabling memory coherency at GEN 
>> ISA level
>> is not feasible and alternative method is needed.
>>
>> Such alternative solution is to have a global coherency switch that 
>> allows
>> disabling coherency for single (though entire) GPU submission. This is
>> beneficial because this way we:
>> * can enable (and pay for) coherency only in submissions that 
>> actually need
>> coherency (submissions that use CL_MEM_SVM_FINE_GRAIN_BUFFER resources)
>> * don't care about coherency at GEN ISA granularity (no performance 
>> impact)
>>
>> 3. Will coherency switch be used frequently?
>>
>> There are scenarios that will require frequent toggling of the coherency
>> switch.
>> E.g. an application has two OCL compute kernels: kern_master and 
>> kern_worker.
>> kern_master uses, concurrently with CPU, some fine grain SVM resources
>> (CL_MEM_SVM_FINE_GRAIN_BUFFER). These resources contain descriptors of
>> computational work that needs to be executed. kern_master analyzes 
>> incoming
>> work descriptors and populates a plain OCL buffer (non-fine-grain) 
>> with payload
>> for kern_worker. Once kern_master is done, kern_worker kicks-in and 
>> processes
>> the payload that kern_master produced. These two kernels work in a 
>> loop, one
>> after another. Since only kern_master requires coherency, kern_worker 
>> should
>> not be forced to pay for it. This means that we need to have the 
>> ability to
>> toggle coherency switch on or off per each GPU submission:
>> (ENABLE COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> 
>> (ENABLE
>> COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> ...
>>
>> v2: Fixed compilation warning.
>> v3: Refactored the patch to add IOCTL instead of exec flag.
>> v4: Renamed and documented the API flag. Used strict values.
>>      Removed redundant GEM_WARN_ON()s. Improved to coding standard.
>>      Introduced a macro for checking whether hardware supports the 
>> feature.
>>
>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>> Cc: Michal Winiarski <michal.winiarski@intel.com>
>>
>> Bspec: 11419
>> Signed-off-by: Tomasz Lis <tomasz.lis@intel.com>
>> ---
>>   drivers/gpu/drm/i915/i915_drv.h         |  1 +
>>   drivers/gpu/drm/i915/i915_gem_context.c | 41 
>> +++++++++++++++++++++++++++
>>   drivers/gpu/drm/i915/i915_gem_context.h |  6 ++++
>>   drivers/gpu/drm/i915/intel_lrc.c        | 49 
>> +++++++++++++++++++++++++++++++++
>>   drivers/gpu/drm/i915/intel_lrc.h        |  4 +++
>>   include/uapi/drm/i915_drm.h             |  6 ++++
>>   6 files changed, 107 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/i915/i915_drv.h 
>> b/drivers/gpu/drm/i915/i915_drv.h
>> index 09ab124..7d4bbd5 100644
>> --- a/drivers/gpu/drm/i915/i915_drv.h
>> +++ b/drivers/gpu/drm/i915/i915_drv.h
>> @@ -2524,6 +2524,7 @@ intel_info(const struct drm_i915_private 
>> *dev_priv)
>>   #define HAS_EDRAM(dev_priv)    (!!((dev_priv)->edram_cap & 
>> EDRAM_ENABLED))
>>   #define HAS_WT(dev_priv)    ((IS_HASWELL(dev_priv) || \
>>                    IS_BROADWELL(dev_priv)) && HAS_EDRAM(dev_priv))
>> +#define HAS_DATA_PORT_COHERENCY(dev_priv) (INTEL_GEN(dev_priv) >= 9)
>
> Reading the documentation it seems that the bit you want to set is 
> gone in ICL/Gen11.
> Maybe limit this to >= 9 && < 11?
Icelake actually has the bit as well, just the address is different.
I will add its support as a separate patch as soon as the change which 
defines ICL_HDC_CHICKEN0 is accepted.
But in the current form - you are right, ICL is not supported.
I will update the condition.
-Tomasz
>
> Cheers,
>
> -
> Lionel
>
>>     #define HWS_NEEDS_PHYSICAL(dev_priv) 
>> ((dev_priv)->info.hws_needs_physical)
>>   diff --git a/drivers/gpu/drm/i915/i915_gem_context.c 
>> b/drivers/gpu/drm/i915/i915_gem_context.c
>> index b10770c..6db352e 100644
>> --- a/drivers/gpu/drm/i915/i915_gem_context.c
>> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
>> @@ -711,6 +711,26 @@ static bool client_is_banned(struct 
>> drm_i915_file_private *file_priv)
>>       return atomic_read(&file_priv->ban_score) >= 
>> I915_CLIENT_SCORE_BANNED;
>>   }
>>   +static int i915_gem_context_set_data_port_coherent(struct 
>> i915_gem_context *ctx)
>> +{
>> +    int ret;
>> +
>> +    ret = intel_lr_context_modify_data_port_coherency(ctx, true);
>> +    if (!ret)
>> +        __set_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
>> +    return ret;
>> +}
>> +
>> +static int i915_gem_context_clear_data_port_coherent(struct 
>> i915_gem_context *ctx)
>> +{
>> +    int ret;
>> +
>> +    ret = intel_lr_context_modify_data_port_coherency(ctx, false);
>> +    if (!ret)
>> +        __clear_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
>> +    return ret;
>> +}
>> +
>>   int i915_gem_context_create_ioctl(struct drm_device *dev, void *data,
>>                     struct drm_file *file)
>>   {
>> @@ -784,6 +804,7 @@ int i915_gem_context_destroy_ioctl(struct 
>> drm_device *dev, void *data,
>>   int i915_gem_context_getparam_ioctl(struct drm_device *dev, void 
>> *data,
>>                       struct drm_file *file)
>>   {
>> +    struct drm_i915_private *dev_priv = to_i915(dev);
>>       struct drm_i915_file_private *file_priv = file->driver_priv;
>>       struct drm_i915_gem_context_param *args = data;
>>       struct i915_gem_context *ctx;
>> @@ -818,6 +839,12 @@ int i915_gem_context_getparam_ioctl(struct 
>> drm_device *dev, void *data,
>>       case I915_CONTEXT_PARAM_PRIORITY:
>>           args->value = ctx->sched.priority;
>>           break;
>> +    case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
>> +        if (!HAS_DATA_PORT_COHERENCY(dev_priv))
>> +            ret = -ENODEV;
>> +        else
>> +            args->value = i915_gem_context_is_data_port_coherent(ctx);
>> +        break;
>>       default:
>>           ret = -EINVAL;
>>           break;
>> @@ -830,6 +857,7 @@ int i915_gem_context_getparam_ioctl(struct 
>> drm_device *dev, void *data,
>>   int i915_gem_context_setparam_ioctl(struct drm_device *dev, void 
>> *data,
>>                       struct drm_file *file)
>>   {
>> +    struct drm_i915_private *dev_priv = to_i915(dev);
>>       struct drm_i915_file_private *file_priv = file->driver_priv;
>>       struct drm_i915_gem_context_param *args = data;
>>       struct i915_gem_context *ctx;
>> @@ -893,6 +921,19 @@ int i915_gem_context_setparam_ioctl(struct 
>> drm_device *dev, void *data,
>>           }
>>           break;
>>   +    case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
>> +        if (args->size)
>> +            ret = -EINVAL;
>> +        else if (!HAS_DATA_PORT_COHERENCY(dev_priv))
>> +            ret = -ENODEV;
>> +        else if (args->value == 1)
>> +            ret = i915_gem_context_set_data_port_coherent(ctx);
>> +        else if (args->value == 0)
>> +            ret = i915_gem_context_clear_data_port_coherent(ctx);
>> +        else
>> +            ret = -EINVAL;
>> +        break;
>> +
>>       default:
>>           ret = -EINVAL;
>>           break;
>> diff --git a/drivers/gpu/drm/i915/i915_gem_context.h 
>> b/drivers/gpu/drm/i915/i915_gem_context.h
>> index b116e49..e8ccb70 100644
>> --- a/drivers/gpu/drm/i915/i915_gem_context.h
>> +++ b/drivers/gpu/drm/i915/i915_gem_context.h
>> @@ -126,6 +126,7 @@ struct i915_gem_context {
>>   #define CONTEXT_BANNABLE        3
>>   #define CONTEXT_BANNED            4
>>   #define CONTEXT_FORCE_SINGLE_SUBMISSION    5
>> +#define CONTEXT_DATA_PORT_COHERENT    6
>>         /**
>>        * @hw_id: - unique identifier for the context
>> @@ -257,6 +258,11 @@ static inline void 
>> i915_gem_context_set_force_single_submission(struct i915_gem_
>>       __set_bit(CONTEXT_FORCE_SINGLE_SUBMISSION, &ctx->flags);
>>   }
>>   +static inline bool i915_gem_context_is_data_port_coherent(struct 
>> i915_gem_context *ctx)
>> +{
>> +    return test_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
>> +}
>> +
>>   static inline bool i915_gem_context_is_default(const struct 
>> i915_gem_context *c)
>>   {
>>       return c->user_handle == DEFAULT_CONTEXT_HANDLE;
>> diff --git a/drivers/gpu/drm/i915/intel_lrc.c 
>> b/drivers/gpu/drm/i915/intel_lrc.c
>> index ab89dab..1f037e3 100644
>> --- a/drivers/gpu/drm/i915/intel_lrc.c
>> +++ b/drivers/gpu/drm/i915/intel_lrc.c
>> @@ -259,6 +259,55 @@ intel_lr_context_descriptor_update(struct 
>> i915_gem_context *ctx,
>>       ce->lrc_desc = desc;
>>   }
>>   +static int emit_set_data_port_coherency(struct i915_request *req, 
>> bool enable)
>> +{
>> +    u32 *cs;
>> +    i915_reg_t reg;
>> +
>> +    GEM_BUG_ON(req->engine->class != RENDER_CLASS);
>> +    GEM_BUG_ON(INTEL_GEN(req->i915) < 9);
>> +
>> +    cs = intel_ring_begin(req, 4);
>> +    if (IS_ERR(cs))
>> +        return PTR_ERR(cs);
>> +
>> +    if (INTEL_GEN(req->i915) >= 10)
>> +        reg = CNL_HDC_CHICKEN0;
>> +    else
>> +        reg = HDC_CHICKEN0;
>> +
>> +    *cs++ = MI_LOAD_REGISTER_IMM(1);
>> +    *cs++ = i915_mmio_reg_offset(reg);
>> +    /* Enabling coherency means disabling the bit which forces it 
>> off */
>> +    if (enable)
>> +        *cs++ = _MASKED_BIT_DISABLE(HDC_FORCE_NON_COHERENT);
>> +    else
>> +        *cs++ = _MASKED_BIT_ENABLE(HDC_FORCE_NON_COHERENT);
>> +    *cs++ = MI_NOOP;
>> +
>> +    intel_ring_advance(req, cs);
>> +
>> +    return 0;
>> +}
>> +
>> +int
>> +intel_lr_context_modify_data_port_coherency(struct i915_gem_context 
>> *ctx,
>> +                        bool enable)
>> +{
>> +    struct i915_request *req;
>> +    int ret;
>> +
>> +    req = i915_request_alloc(ctx->i915->engine[RCS], ctx);
>> +    if (IS_ERR(req))
>> +        return PTR_ERR(req);
>> +
>> +    ret = emit_set_data_port_coherency(req, enable);
>> +
>> +    i915_request_add(req);
>> +
>> +    return ret;
>> +}
>> +
>>   static struct i915_priolist *
>>   lookup_priolist(struct intel_engine_cs *engine, int prio)
>>   {
>> diff --git a/drivers/gpu/drm/i915/intel_lrc.h 
>> b/drivers/gpu/drm/i915/intel_lrc.h
>> index 1593194..f6965ae 100644
>> --- a/drivers/gpu/drm/i915/intel_lrc.h
>> +++ b/drivers/gpu/drm/i915/intel_lrc.h
>> @@ -104,4 +104,8 @@ struct i915_gem_context;
>>     void intel_lr_context_resume(struct drm_i915_private *dev_priv);
>>   +int
>> +intel_lr_context_modify_data_port_coherency(struct i915_gem_context 
>> *ctx,
>> +                        bool enable);
>> +
>>   #endif /* _INTEL_LRC_H_ */
>> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
>> index 7f5634c..e677bea 100644
>> --- a/include/uapi/drm/i915_drm.h
>> +++ b/include/uapi/drm/i915_drm.h
>> @@ -1456,6 +1456,12 @@ struct drm_i915_gem_context_param {
>>   #define   I915_CONTEXT_MAX_USER_PRIORITY    1023 /* inclusive */
>>   #define   I915_CONTEXT_DEFAULT_PRIORITY        0
>>   #define   I915_CONTEXT_MIN_USER_PRIORITY    -1023 /* inclusive */
>> +/*
>> + * When data port level coherency is enabled, the GPU will update 
>> memory
>> + * buffers shared with CPU, by forcing internal cache units to send 
>> memory
>> + * writes to real RAM faster. Keeping such coherency has performance 
>> cost.
>> + */
>> +#define I915_CONTEXT_PARAM_DATA_PORT_COHERENCY    0x7
>>       __u64 value;
>>   };
>
>
Lionel Landwerlin July 9, 2018, 2:24 p.m. UTC | #3
On 09/07/18 15:03, Lis, Tomasz wrote:
>
>
> On 2018-07-09 15:48, Lionel Landwerlin wrote:
>> On 09/07/18 14:20, Tomasz Lis wrote:
>>> The patch adds a parameter to control the data port coherency 
>>> functionality
>>> on a per-context level. When the IOCTL is called, a command to 
>>> switch data
>>> port coherency state is added to the ordered list. All prior 
>>> requests are
>>> executed on old coherency settings, and all exec requests after the 
>>> IOCTL
>>> will use new settings.
>>>
>>> Rationale:
>>>
>>> The OpenCL driver develpers requested a functionality to control cache
>>> coherency at data port level. Keeping the coherency at that level is 
>>> disabled
>>> by default due to its performance costs. OpenCL driver is planning to
>>> enable it for a small subset of submissions, when such functionality is
>>> required. Below are answers to basic question explaining background
>>> of the functionality and reasoning for the proposed implementation:
>>>
>>> 1. Why do we need a coherency enable/disable switch for memory that 
>>> is shared
>>> between CPU and GEN (GPU)?
>>>
>>> Memory coherency between CPU and GEN, while being a great feature 
>>> that enables
>>> CL_MEM_SVM_FINE_GRAIN_BUFFER OCL capability on Intel GEN 
>>> architecture, adds
>>> overhead related to tracking (snooping) memory inside different 
>>> cache units
>>> (L1$, L2$, L3$, LLC$, etc.). At the same time, minority of modern OCL
>>> applications actually use CL_MEM_SVM_FINE_GRAIN_BUFFER (and hence 
>>> require
>>> memory coherency between CPU and GPU). The goal of coherency 
>>> enable/disable
>>> switch is to remove overhead of memory coherency when memory 
>>> coherency is not
>>> needed.
>>>
>>> 2. Why do we need a global coherency switch?
>>>
>>> In order to support I/O commands from within EUs (Execution Units), 
>>> Intel GEN
>>> ISA (GEN Instruction Set Assembly) contains dedicated "send" 
>>> instructions.
>>> These send instructions provide several addressing models. One of these
>>> addressing models (named "stateless") provides most flexible I/O 
>>> using plain
>>> virtual addresses (as opposed to buffer_handle+offset models). This 
>>> "stateless"
>>> model is similar to regular memory load/store operations available 
>>> on typical
>>> CPUs. Since this model provides I/O using arbitrary virtual 
>>> addresses, it
>>> enables algorithmic designs that are based on pointer-to-pointer 
>>> (e.g. buffer
>>> of pointers) concepts. For instance, it allows creating tree-like data
>>> structures such as:
>>>                     ________________
>>>                    |      NODE1     |
>>>                    | uint64_t data  |
>>>                    +----------------|
>>>                    | NODE*  |  NODE*|
>>>                    +--------+-------+
>>>                      /              \
>>>     ________________/                \________________
>>>    |      NODE2     |                |      NODE3     |
>>>    | uint64_t data  |                | uint64_t data  |
>>>    +----------------|                +----------------|
>>>    | NODE*  |  NODE*|                | NODE*  |  NODE*|
>>>    +--------+-------+                +--------+-------+
>>>
>>> Please note that pointers inside such structures can point to memory 
>>> locations
>>> in different OCL allocations  - e.g. NODE1 and NODE2 can reside in 
>>> one OCL
>>> allocation while NODE3 resides in a completely separate OCL allocation.
>>> Additionally, such pointers can be shared with CPU (i.e. using SVM - 
>>> Shared
>>> Virtual Memory feature). Using pointers from different allocations 
>>> doesn't
>>> affect the stateless addressing model which even allows scattered 
>>> reading from
>>> different allocations at the same time (i.e. by utilizing 
>>> SIMD-nature of send
>>> instructions).
>>>
>>> When it comes to coherency programming, send instructions in 
>>> stateless model
>>> can be encoded (at ISA level) to either use or disable coherency. 
>>> However, for
>>> generic OCL applications (such as example with tree-like data 
>>> structure), OCL
>>> compiler is not able to determine origin of memory pointed to by an 
>>> arbitrary
>>> pointer - i.e. is not able to track given pointer back to a specific
>>> allocation. As such, it's not able to decide whether coherency is 
>>> needed or not
>>> for specific pointer (or for specific I/O instruction). As a result, 
>>> compiler
>>> encodes all stateless sends as coherent (doing otherwise would lead to
>>> functional issues resulting from data corruption). Please note that 
>>> it would be
>>> possible to workaround this (e.g. based on allocations map and 
>>> pointer bounds
>>> checking prior to each I/O instruction) but the performance cost of 
>>> such
>>> workaround would be many times greater than the cost of keeping 
>>> coherency
>>> always enabled. As such, enabling/disabling memory coherency at GEN 
>>> ISA level
>>> is not feasible and alternative method is needed.
>>>
>>> Such alternative solution is to have a global coherency switch that 
>>> allows
>>> disabling coherency for single (though entire) GPU submission. This is
>>> beneficial because this way we:
>>> * can enable (and pay for) coherency only in submissions that 
>>> actually need
>>> coherency (submissions that use CL_MEM_SVM_FINE_GRAIN_BUFFER resources)
>>> * don't care about coherency at GEN ISA granularity (no performance 
>>> impact)
>>>
>>> 3. Will coherency switch be used frequently?
>>>
>>> There are scenarios that will require frequent toggling of the 
>>> coherency
>>> switch.
>>> E.g. an application has two OCL compute kernels: kern_master and 
>>> kern_worker.
>>> kern_master uses, concurrently with CPU, some fine grain SVM resources
>>> (CL_MEM_SVM_FINE_GRAIN_BUFFER). These resources contain descriptors of
>>> computational work that needs to be executed. kern_master analyzes 
>>> incoming
>>> work descriptors and populates a plain OCL buffer (non-fine-grain) 
>>> with payload
>>> for kern_worker. Once kern_master is done, kern_worker kicks-in and 
>>> processes
>>> the payload that kern_master produced. These two kernels work in a 
>>> loop, one
>>> after another. Since only kern_master requires coherency, 
>>> kern_worker should
>>> not be forced to pay for it. This means that we need to have the 
>>> ability to
>>> toggle coherency switch on or off per each GPU submission:
>>> (ENABLE COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> 
>>> (ENABLE
>>> COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> ...
>>>
>>> v2: Fixed compilation warning.
>>> v3: Refactored the patch to add IOCTL instead of exec flag.
>>> v4: Renamed and documented the API flag. Used strict values.
>>>      Removed redundant GEM_WARN_ON()s. Improved to coding standard.
>>>      Introduced a macro for checking whether hardware supports the 
>>> feature.
>>>
>>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>>> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>>> Cc: Michal Winiarski <michal.winiarski@intel.com>
>>>
>>> Bspec: 11419
>>> Signed-off-by: Tomasz Lis <tomasz.lis@intel.com>
>>> ---
>>>   drivers/gpu/drm/i915/i915_drv.h         |  1 +
>>>   drivers/gpu/drm/i915/i915_gem_context.c | 41 
>>> +++++++++++++++++++++++++++
>>>   drivers/gpu/drm/i915/i915_gem_context.h |  6 ++++
>>>   drivers/gpu/drm/i915/intel_lrc.c        | 49 
>>> +++++++++++++++++++++++++++++++++
>>>   drivers/gpu/drm/i915/intel_lrc.h        |  4 +++
>>>   include/uapi/drm/i915_drm.h             |  6 ++++
>>>   6 files changed, 107 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/i915/i915_drv.h 
>>> b/drivers/gpu/drm/i915/i915_drv.h
>>> index 09ab124..7d4bbd5 100644
>>> --- a/drivers/gpu/drm/i915/i915_drv.h
>>> +++ b/drivers/gpu/drm/i915/i915_drv.h
>>> @@ -2524,6 +2524,7 @@ intel_info(const struct drm_i915_private 
>>> *dev_priv)
>>>   #define HAS_EDRAM(dev_priv)    (!!((dev_priv)->edram_cap & 
>>> EDRAM_ENABLED))
>>>   #define HAS_WT(dev_priv)    ((IS_HASWELL(dev_priv) || \
>>>                    IS_BROADWELL(dev_priv)) && HAS_EDRAM(dev_priv))
>>> +#define HAS_DATA_PORT_COHERENCY(dev_priv) (INTEL_GEN(dev_priv) >= 9)
>>
>> Reading the documentation it seems that the bit you want to set is 
>> gone in ICL/Gen11.
>> Maybe limit this to >= 9 && < 11?
> Icelake actually has the bit as well, just the address is different.
> I will add its support as a separate patch as soon as the change which 
> defines ICL_HDC_CHICKEN0 is accepted.
> But in the current form - you are right, ICL is not supported.
> I will update the condition.
> -Tomasz

Just out of curiosity, what address is ICL_HD_CHICKEN0 at?

Thanks,

-
Lionel

>>
>> Cheers,
>>
>> -
>> Lionel
>>
>>>     #define HWS_NEEDS_PHYSICAL(dev_priv) 
>>> ((dev_priv)->info.hws_needs_physical)
>>>   diff --git a/drivers/gpu/drm/i915/i915_gem_context.c 
>>> b/drivers/gpu/drm/i915/i915_gem_context.c
>>> index b10770c..6db352e 100644
>>> --- a/drivers/gpu/drm/i915/i915_gem_context.c
>>> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
>>> @@ -711,6 +711,26 @@ static bool client_is_banned(struct 
>>> drm_i915_file_private *file_priv)
>>>       return atomic_read(&file_priv->ban_score) >= 
>>> I915_CLIENT_SCORE_BANNED;
>>>   }
>>>   +static int i915_gem_context_set_data_port_coherent(struct 
>>> i915_gem_context *ctx)
>>> +{
>>> +    int ret;
>>> +
>>> +    ret = intel_lr_context_modify_data_port_coherency(ctx, true);
>>> +    if (!ret)
>>> +        __set_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
>>> +    return ret;
>>> +}
>>> +
>>> +static int i915_gem_context_clear_data_port_coherent(struct 
>>> i915_gem_context *ctx)
>>> +{
>>> +    int ret;
>>> +
>>> +    ret = intel_lr_context_modify_data_port_coherency(ctx, false);
>>> +    if (!ret)
>>> +        __clear_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
>>> +    return ret;
>>> +}
>>> +
>>>   int i915_gem_context_create_ioctl(struct drm_device *dev, void *data,
>>>                     struct drm_file *file)
>>>   {
>>> @@ -784,6 +804,7 @@ int i915_gem_context_destroy_ioctl(struct 
>>> drm_device *dev, void *data,
>>>   int i915_gem_context_getparam_ioctl(struct drm_device *dev, void 
>>> *data,
>>>                       struct drm_file *file)
>>>   {
>>> +    struct drm_i915_private *dev_priv = to_i915(dev);
>>>       struct drm_i915_file_private *file_priv = file->driver_priv;
>>>       struct drm_i915_gem_context_param *args = data;
>>>       struct i915_gem_context *ctx;
>>> @@ -818,6 +839,12 @@ int i915_gem_context_getparam_ioctl(struct 
>>> drm_device *dev, void *data,
>>>       case I915_CONTEXT_PARAM_PRIORITY:
>>>           args->value = ctx->sched.priority;
>>>           break;
>>> +    case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
>>> +        if (!HAS_DATA_PORT_COHERENCY(dev_priv))
>>> +            ret = -ENODEV;
>>> +        else
>>> +            args->value = i915_gem_context_is_data_port_coherent(ctx);
>>> +        break;
>>>       default:
>>>           ret = -EINVAL;
>>>           break;
>>> @@ -830,6 +857,7 @@ int i915_gem_context_getparam_ioctl(struct 
>>> drm_device *dev, void *data,
>>>   int i915_gem_context_setparam_ioctl(struct drm_device *dev, void 
>>> *data,
>>>                       struct drm_file *file)
>>>   {
>>> +    struct drm_i915_private *dev_priv = to_i915(dev);
>>>       struct drm_i915_file_private *file_priv = file->driver_priv;
>>>       struct drm_i915_gem_context_param *args = data;
>>>       struct i915_gem_context *ctx;
>>> @@ -893,6 +921,19 @@ int i915_gem_context_setparam_ioctl(struct 
>>> drm_device *dev, void *data,
>>>           }
>>>           break;
>>>   +    case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
>>> +        if (args->size)
>>> +            ret = -EINVAL;
>>> +        else if (!HAS_DATA_PORT_COHERENCY(dev_priv))
>>> +            ret = -ENODEV;
>>> +        else if (args->value == 1)
>>> +            ret = i915_gem_context_set_data_port_coherent(ctx);
>>> +        else if (args->value == 0)
>>> +            ret = i915_gem_context_clear_data_port_coherent(ctx);
>>> +        else
>>> +            ret = -EINVAL;
>>> +        break;
>>> +
>>>       default:
>>>           ret = -EINVAL;
>>>           break;
>>> diff --git a/drivers/gpu/drm/i915/i915_gem_context.h 
>>> b/drivers/gpu/drm/i915/i915_gem_context.h
>>> index b116e49..e8ccb70 100644
>>> --- a/drivers/gpu/drm/i915/i915_gem_context.h
>>> +++ b/drivers/gpu/drm/i915/i915_gem_context.h
>>> @@ -126,6 +126,7 @@ struct i915_gem_context {
>>>   #define CONTEXT_BANNABLE        3
>>>   #define CONTEXT_BANNED            4
>>>   #define CONTEXT_FORCE_SINGLE_SUBMISSION    5
>>> +#define CONTEXT_DATA_PORT_COHERENT    6
>>>         /**
>>>        * @hw_id: - unique identifier for the context
>>> @@ -257,6 +258,11 @@ static inline void 
>>> i915_gem_context_set_force_single_submission(struct i915_gem_
>>>       __set_bit(CONTEXT_FORCE_SINGLE_SUBMISSION, &ctx->flags);
>>>   }
>>>   +static inline bool i915_gem_context_is_data_port_coherent(struct 
>>> i915_gem_context *ctx)
>>> +{
>>> +    return test_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
>>> +}
>>> +
>>>   static inline bool i915_gem_context_is_default(const struct 
>>> i915_gem_context *c)
>>>   {
>>>       return c->user_handle == DEFAULT_CONTEXT_HANDLE;
>>> diff --git a/drivers/gpu/drm/i915/intel_lrc.c 
>>> b/drivers/gpu/drm/i915/intel_lrc.c
>>> index ab89dab..1f037e3 100644
>>> --- a/drivers/gpu/drm/i915/intel_lrc.c
>>> +++ b/drivers/gpu/drm/i915/intel_lrc.c
>>> @@ -259,6 +259,55 @@ intel_lr_context_descriptor_update(struct 
>>> i915_gem_context *ctx,
>>>       ce->lrc_desc = desc;
>>>   }
>>>   +static int emit_set_data_port_coherency(struct i915_request *req, 
>>> bool enable)
>>> +{
>>> +    u32 *cs;
>>> +    i915_reg_t reg;
>>> +
>>> +    GEM_BUG_ON(req->engine->class != RENDER_CLASS);
>>> +    GEM_BUG_ON(INTEL_GEN(req->i915) < 9);
>>> +
>>> +    cs = intel_ring_begin(req, 4);
>>> +    if (IS_ERR(cs))
>>> +        return PTR_ERR(cs);
>>> +
>>> +    if (INTEL_GEN(req->i915) >= 10)
>>> +        reg = CNL_HDC_CHICKEN0;
>>> +    else
>>> +        reg = HDC_CHICKEN0;
>>> +
>>> +    *cs++ = MI_LOAD_REGISTER_IMM(1);
>>> +    *cs++ = i915_mmio_reg_offset(reg);
>>> +    /* Enabling coherency means disabling the bit which forces it 
>>> off */
>>> +    if (enable)
>>> +        *cs++ = _MASKED_BIT_DISABLE(HDC_FORCE_NON_COHERENT);
>>> +    else
>>> +        *cs++ = _MASKED_BIT_ENABLE(HDC_FORCE_NON_COHERENT);
>>> +    *cs++ = MI_NOOP;
>>> +
>>> +    intel_ring_advance(req, cs);
>>> +
>>> +    return 0;
>>> +}
>>> +
>>> +int
>>> +intel_lr_context_modify_data_port_coherency(struct i915_gem_context 
>>> *ctx,
>>> +                        bool enable)
>>> +{
>>> +    struct i915_request *req;
>>> +    int ret;
>>> +
>>> +    req = i915_request_alloc(ctx->i915->engine[RCS], ctx);
>>> +    if (IS_ERR(req))
>>> +        return PTR_ERR(req);
>>> +
>>> +    ret = emit_set_data_port_coherency(req, enable);
>>> +
>>> +    i915_request_add(req);
>>> +
>>> +    return ret;
>>> +}
>>> +
>>>   static struct i915_priolist *
>>>   lookup_priolist(struct intel_engine_cs *engine, int prio)
>>>   {
>>> diff --git a/drivers/gpu/drm/i915/intel_lrc.h 
>>> b/drivers/gpu/drm/i915/intel_lrc.h
>>> index 1593194..f6965ae 100644
>>> --- a/drivers/gpu/drm/i915/intel_lrc.h
>>> +++ b/drivers/gpu/drm/i915/intel_lrc.h
>>> @@ -104,4 +104,8 @@ struct i915_gem_context;
>>>     void intel_lr_context_resume(struct drm_i915_private *dev_priv);
>>>   +int
>>> +intel_lr_context_modify_data_port_coherency(struct i915_gem_context 
>>> *ctx,
>>> +                        bool enable);
>>> +
>>>   #endif /* _INTEL_LRC_H_ */
>>> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
>>> index 7f5634c..e677bea 100644
>>> --- a/include/uapi/drm/i915_drm.h
>>> +++ b/include/uapi/drm/i915_drm.h
>>> @@ -1456,6 +1456,12 @@ struct drm_i915_gem_context_param {
>>>   #define   I915_CONTEXT_MAX_USER_PRIORITY    1023 /* inclusive */
>>>   #define   I915_CONTEXT_DEFAULT_PRIORITY        0
>>>   #define   I915_CONTEXT_MIN_USER_PRIORITY    -1023 /* inclusive */
>>> +/*
>>> + * When data port level coherency is enabled, the GPU will update 
>>> memory
>>> + * buffers shared with CPU, by forcing internal cache units to send 
>>> memory
>>> + * writes to real RAM faster. Keeping such coherency has 
>>> performance cost.
>>> + */
>>> +#define I915_CONTEXT_PARAM_DATA_PORT_COHERENCY    0x7
>>>       __u64 value;
>>>   };
>>
>>
>
>
Lis, Tomasz July 9, 2018, 3:21 p.m. UTC | #4
On 2018-07-09 16:24, Lionel Landwerlin wrote:
> On 09/07/18 15:03, Lis, Tomasz wrote:
>>
>>
>> On 2018-07-09 15:48, Lionel Landwerlin wrote:
>>> On 09/07/18 14:20, Tomasz Lis wrote:
>>>> The patch adds a parameter to control the data port coherency 
>>>> functionality
>>>> on a per-context level. When the IOCTL is called, a command to 
>>>> switch data
>>>> port coherency state is added to the ordered list. All prior 
>>>> requests are
>>>> executed on old coherency settings, and all exec requests after the 
>>>> IOCTL
>>>> will use new settings.
>>>>
>>>> Rationale:
>>>>
>>>> The OpenCL driver develpers requested a functionality to control cache
>>>> coherency at data port level. Keeping the coherency at that level 
>>>> is disabled
>>>> by default due to its performance costs. OpenCL driver is planning to
>>>> enable it for a small subset of submissions, when such 
>>>> functionality is
>>>> required. Below are answers to basic question explaining background
>>>> of the functionality and reasoning for the proposed implementation:
>>>>
>>>> 1. Why do we need a coherency enable/disable switch for memory that 
>>>> is shared
>>>> between CPU and GEN (GPU)?
>>>>
>>>> Memory coherency between CPU and GEN, while being a great feature 
>>>> that enables
>>>> CL_MEM_SVM_FINE_GRAIN_BUFFER OCL capability on Intel GEN 
>>>> architecture, adds
>>>> overhead related to tracking (snooping) memory inside different 
>>>> cache units
>>>> (L1$, L2$, L3$, LLC$, etc.). At the same time, minority of modern OCL
>>>> applications actually use CL_MEM_SVM_FINE_GRAIN_BUFFER (and hence 
>>>> require
>>>> memory coherency between CPU and GPU). The goal of coherency 
>>>> enable/disable
>>>> switch is to remove overhead of memory coherency when memory 
>>>> coherency is not
>>>> needed.
>>>>
>>>> 2. Why do we need a global coherency switch?
>>>>
>>>> In order to support I/O commands from within EUs (Execution Units), 
>>>> Intel GEN
>>>> ISA (GEN Instruction Set Assembly) contains dedicated "send" 
>>>> instructions.
>>>> These send instructions provide several addressing models. One of 
>>>> these
>>>> addressing models (named "stateless") provides most flexible I/O 
>>>> using plain
>>>> virtual addresses (as opposed to buffer_handle+offset models). This 
>>>> "stateless"
>>>> model is similar to regular memory load/store operations available 
>>>> on typical
>>>> CPUs. Since this model provides I/O using arbitrary virtual 
>>>> addresses, it
>>>> enables algorithmic designs that are based on pointer-to-pointer 
>>>> (e.g. buffer
>>>> of pointers) concepts. For instance, it allows creating tree-like data
>>>> structures such as:
>>>>                     ________________
>>>>                    |      NODE1     |
>>>>                    | uint64_t data  |
>>>>                    +----------------|
>>>>                    | NODE*  |  NODE*|
>>>>                    +--------+-------+
>>>>                      /              \
>>>>     ________________/                \________________
>>>>    |      NODE2     |                |      NODE3     |
>>>>    | uint64_t data  |                | uint64_t data  |
>>>>    +----------------|                +----------------|
>>>>    | NODE*  |  NODE*|                | NODE*  |  NODE*|
>>>>    +--------+-------+                +--------+-------+
>>>>
>>>> Please note that pointers inside such structures can point to 
>>>> memory locations
>>>> in different OCL allocations  - e.g. NODE1 and NODE2 can reside in 
>>>> one OCL
>>>> allocation while NODE3 resides in a completely separate OCL 
>>>> allocation.
>>>> Additionally, such pointers can be shared with CPU (i.e. using SVM 
>>>> - Shared
>>>> Virtual Memory feature). Using pointers from different allocations 
>>>> doesn't
>>>> affect the stateless addressing model which even allows scattered 
>>>> reading from
>>>> different allocations at the same time (i.e. by utilizing 
>>>> SIMD-nature of send
>>>> instructions).
>>>>
>>>> When it comes to coherency programming, send instructions in 
>>>> stateless model
>>>> can be encoded (at ISA level) to either use or disable coherency. 
>>>> However, for
>>>> generic OCL applications (such as example with tree-like data 
>>>> structure), OCL
>>>> compiler is not able to determine origin of memory pointed to by an 
>>>> arbitrary
>>>> pointer - i.e. is not able to track given pointer back to a specific
>>>> allocation. As such, it's not able to decide whether coherency is 
>>>> needed or not
>>>> for specific pointer (or for specific I/O instruction). As a 
>>>> result, compiler
>>>> encodes all stateless sends as coherent (doing otherwise would lead to
>>>> functional issues resulting from data corruption). Please note that 
>>>> it would be
>>>> possible to workaround this (e.g. based on allocations map and 
>>>> pointer bounds
>>>> checking prior to each I/O instruction) but the performance cost of 
>>>> such
>>>> workaround would be many times greater than the cost of keeping 
>>>> coherency
>>>> always enabled. As such, enabling/disabling memory coherency at GEN 
>>>> ISA level
>>>> is not feasible and alternative method is needed.
>>>>
>>>> Such alternative solution is to have a global coherency switch that 
>>>> allows
>>>> disabling coherency for single (though entire) GPU submission. This is
>>>> beneficial because this way we:
>>>> * can enable (and pay for) coherency only in submissions that 
>>>> actually need
>>>> coherency (submissions that use CL_MEM_SVM_FINE_GRAIN_BUFFER 
>>>> resources)
>>>> * don't care about coherency at GEN ISA granularity (no performance 
>>>> impact)
>>>>
>>>> 3. Will coherency switch be used frequently?
>>>>
>>>> There are scenarios that will require frequent toggling of the 
>>>> coherency
>>>> switch.
>>>> E.g. an application has two OCL compute kernels: kern_master and 
>>>> kern_worker.
>>>> kern_master uses, concurrently with CPU, some fine grain SVM resources
>>>> (CL_MEM_SVM_FINE_GRAIN_BUFFER). These resources contain descriptors of
>>>> computational work that needs to be executed. kern_master analyzes 
>>>> incoming
>>>> work descriptors and populates a plain OCL buffer (non-fine-grain) 
>>>> with payload
>>>> for kern_worker. Once kern_master is done, kern_worker kicks-in and 
>>>> processes
>>>> the payload that kern_master produced. These two kernels work in a 
>>>> loop, one
>>>> after another. Since only kern_master requires coherency, 
>>>> kern_worker should
>>>> not be forced to pay for it. This means that we need to have the 
>>>> ability to
>>>> toggle coherency switch on or off per each GPU submission:
>>>> (ENABLE COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> 
>>>> (ENABLE
>>>> COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> ...
>>>>
>>>> v2: Fixed compilation warning.
>>>> v3: Refactored the patch to add IOCTL instead of exec flag.
>>>> v4: Renamed and documented the API flag. Used strict values.
>>>>      Removed redundant GEM_WARN_ON()s. Improved to coding standard.
>>>>      Introduced a macro for checking whether hardware supports the 
>>>> feature.
>>>>
>>>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>>>> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>>>> Cc: Michal Winiarski <michal.winiarski@intel.com>
>>>>
>>>> Bspec: 11419
>>>> Signed-off-by: Tomasz Lis <tomasz.lis@intel.com>
>>>> ---
>>>>   drivers/gpu/drm/i915/i915_drv.h         |  1 +
>>>>   drivers/gpu/drm/i915/i915_gem_context.c | 41 
>>>> +++++++++++++++++++++++++++
>>>>   drivers/gpu/drm/i915/i915_gem_context.h |  6 ++++
>>>>   drivers/gpu/drm/i915/intel_lrc.c        | 49 
>>>> +++++++++++++++++++++++++++++++++
>>>>   drivers/gpu/drm/i915/intel_lrc.h        |  4 +++
>>>>   include/uapi/drm/i915_drm.h             |  6 ++++
>>>>   6 files changed, 107 insertions(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/i915/i915_drv.h 
>>>> b/drivers/gpu/drm/i915/i915_drv.h
>>>> index 09ab124..7d4bbd5 100644
>>>> --- a/drivers/gpu/drm/i915/i915_drv.h
>>>> +++ b/drivers/gpu/drm/i915/i915_drv.h
>>>> @@ -2524,6 +2524,7 @@ intel_info(const struct drm_i915_private 
>>>> *dev_priv)
>>>>   #define HAS_EDRAM(dev_priv) (!!((dev_priv)->edram_cap & 
>>>> EDRAM_ENABLED))
>>>>   #define HAS_WT(dev_priv)    ((IS_HASWELL(dev_priv) || \
>>>>                    IS_BROADWELL(dev_priv)) && HAS_EDRAM(dev_priv))
>>>> +#define HAS_DATA_PORT_COHERENCY(dev_priv) (INTEL_GEN(dev_priv) >= 9)
>>>
>>> Reading the documentation it seems that the bit you want to set is 
>>> gone in ICL/Gen11.
>>> Maybe limit this to >= 9 && < 11?
>> Icelake actually has the bit as well, just the address is different.
>> I will add its support as a separate patch as soon as the change 
>> which defines ICL_HDC_CHICKEN0 is accepted.
>> But in the current form - you are right, ICL is not supported.
>> I will update the condition.
>> -Tomasz
>
> Just out of curiosity, what address is ICL_HD_CHICKEN0 at?
It was defined as _MMIO(0xE5F4). But now I see it is renamed to 
ICL_HDC_MODE, and already on the tip.
Bspec: 19175
Wow, looks like I can include the gen11 support already. Will add in 
next version.
Thank you!
>
> Thanks,
>
> -
> Lionel
>
>>>
>>> Cheers,
>>>
>>> -
>>> Lionel
>>>
>>>>     #define HWS_NEEDS_PHYSICAL(dev_priv) 
>>>> ((dev_priv)->info.hws_needs_physical)
>>>>   diff --git a/drivers/gpu/drm/i915/i915_gem_context.c 
>>>> b/drivers/gpu/drm/i915/i915_gem_context.c
>>>> index b10770c..6db352e 100644
>>>> --- a/drivers/gpu/drm/i915/i915_gem_context.c
>>>> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
>>>> @@ -711,6 +711,26 @@ static bool client_is_banned(struct 
>>>> drm_i915_file_private *file_priv)
>>>>       return atomic_read(&file_priv->ban_score) >= 
>>>> I915_CLIENT_SCORE_BANNED;
>>>>   }
>>>>   +static int i915_gem_context_set_data_port_coherent(struct 
>>>> i915_gem_context *ctx)
>>>> +{
>>>> +    int ret;
>>>> +
>>>> +    ret = intel_lr_context_modify_data_port_coherency(ctx, true);
>>>> +    if (!ret)
>>>> +        __set_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
>>>> +    return ret;
>>>> +}
>>>> +
>>>> +static int i915_gem_context_clear_data_port_coherent(struct 
>>>> i915_gem_context *ctx)
>>>> +{
>>>> +    int ret;
>>>> +
>>>> +    ret = intel_lr_context_modify_data_port_coherency(ctx, false);
>>>> +    if (!ret)
>>>> +        __clear_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
>>>> +    return ret;
>>>> +}
>>>> +
>>>>   int i915_gem_context_create_ioctl(struct drm_device *dev, void 
>>>> *data,
>>>>                     struct drm_file *file)
>>>>   {
>>>> @@ -784,6 +804,7 @@ int i915_gem_context_destroy_ioctl(struct 
>>>> drm_device *dev, void *data,
>>>>   int i915_gem_context_getparam_ioctl(struct drm_device *dev, void 
>>>> *data,
>>>>                       struct drm_file *file)
>>>>   {
>>>> +    struct drm_i915_private *dev_priv = to_i915(dev);
>>>>       struct drm_i915_file_private *file_priv = file->driver_priv;
>>>>       struct drm_i915_gem_context_param *args = data;
>>>>       struct i915_gem_context *ctx;
>>>> @@ -818,6 +839,12 @@ int i915_gem_context_getparam_ioctl(struct 
>>>> drm_device *dev, void *data,
>>>>       case I915_CONTEXT_PARAM_PRIORITY:
>>>>           args->value = ctx->sched.priority;
>>>>           break;
>>>> +    case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
>>>> +        if (!HAS_DATA_PORT_COHERENCY(dev_priv))
>>>> +            ret = -ENODEV;
>>>> +        else
>>>> +            args->value = 
>>>> i915_gem_context_is_data_port_coherent(ctx);
>>>> +        break;
>>>>       default:
>>>>           ret = -EINVAL;
>>>>           break;
>>>> @@ -830,6 +857,7 @@ int i915_gem_context_getparam_ioctl(struct 
>>>> drm_device *dev, void *data,
>>>>   int i915_gem_context_setparam_ioctl(struct drm_device *dev, void 
>>>> *data,
>>>>                       struct drm_file *file)
>>>>   {
>>>> +    struct drm_i915_private *dev_priv = to_i915(dev);
>>>>       struct drm_i915_file_private *file_priv = file->driver_priv;
>>>>       struct drm_i915_gem_context_param *args = data;
>>>>       struct i915_gem_context *ctx;
>>>> @@ -893,6 +921,19 @@ int i915_gem_context_setparam_ioctl(struct 
>>>> drm_device *dev, void *data,
>>>>           }
>>>>           break;
>>>>   +    case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
>>>> +        if (args->size)
>>>> +            ret = -EINVAL;
>>>> +        else if (!HAS_DATA_PORT_COHERENCY(dev_priv))
>>>> +            ret = -ENODEV;
>>>> +        else if (args->value == 1)
>>>> +            ret = i915_gem_context_set_data_port_coherent(ctx);
>>>> +        else if (args->value == 0)
>>>> +            ret = i915_gem_context_clear_data_port_coherent(ctx);
>>>> +        else
>>>> +            ret = -EINVAL;
>>>> +        break;
>>>> +
>>>>       default:
>>>>           ret = -EINVAL;
>>>>           break;
>>>> diff --git a/drivers/gpu/drm/i915/i915_gem_context.h 
>>>> b/drivers/gpu/drm/i915/i915_gem_context.h
>>>> index b116e49..e8ccb70 100644
>>>> --- a/drivers/gpu/drm/i915/i915_gem_context.h
>>>> +++ b/drivers/gpu/drm/i915/i915_gem_context.h
>>>> @@ -126,6 +126,7 @@ struct i915_gem_context {
>>>>   #define CONTEXT_BANNABLE        3
>>>>   #define CONTEXT_BANNED            4
>>>>   #define CONTEXT_FORCE_SINGLE_SUBMISSION    5
>>>> +#define CONTEXT_DATA_PORT_COHERENT    6
>>>>         /**
>>>>        * @hw_id: - unique identifier for the context
>>>> @@ -257,6 +258,11 @@ static inline void 
>>>> i915_gem_context_set_force_single_submission(struct i915_gem_
>>>>       __set_bit(CONTEXT_FORCE_SINGLE_SUBMISSION, &ctx->flags);
>>>>   }
>>>>   +static inline bool i915_gem_context_is_data_port_coherent(struct 
>>>> i915_gem_context *ctx)
>>>> +{
>>>> +    return test_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
>>>> +}
>>>> +
>>>>   static inline bool i915_gem_context_is_default(const struct 
>>>> i915_gem_context *c)
>>>>   {
>>>>       return c->user_handle == DEFAULT_CONTEXT_HANDLE;
>>>> diff --git a/drivers/gpu/drm/i915/intel_lrc.c 
>>>> b/drivers/gpu/drm/i915/intel_lrc.c
>>>> index ab89dab..1f037e3 100644
>>>> --- a/drivers/gpu/drm/i915/intel_lrc.c
>>>> +++ b/drivers/gpu/drm/i915/intel_lrc.c
>>>> @@ -259,6 +259,55 @@ intel_lr_context_descriptor_update(struct 
>>>> i915_gem_context *ctx,
>>>>       ce->lrc_desc = desc;
>>>>   }
>>>>   +static int emit_set_data_port_coherency(struct i915_request 
>>>> *req, bool enable)
>>>> +{
>>>> +    u32 *cs;
>>>> +    i915_reg_t reg;
>>>> +
>>>> +    GEM_BUG_ON(req->engine->class != RENDER_CLASS);
>>>> +    GEM_BUG_ON(INTEL_GEN(req->i915) < 9);
>>>> +
>>>> +    cs = intel_ring_begin(req, 4);
>>>> +    if (IS_ERR(cs))
>>>> +        return PTR_ERR(cs);
>>>> +
>>>> +    if (INTEL_GEN(req->i915) >= 10)
>>>> +        reg = CNL_HDC_CHICKEN0;
>>>> +    else
>>>> +        reg = HDC_CHICKEN0;
>>>> +
>>>> +    *cs++ = MI_LOAD_REGISTER_IMM(1);
>>>> +    *cs++ = i915_mmio_reg_offset(reg);
>>>> +    /* Enabling coherency means disabling the bit which forces it 
>>>> off */
>>>> +    if (enable)
>>>> +        *cs++ = _MASKED_BIT_DISABLE(HDC_FORCE_NON_COHERENT);
>>>> +    else
>>>> +        *cs++ = _MASKED_BIT_ENABLE(HDC_FORCE_NON_COHERENT);
>>>> +    *cs++ = MI_NOOP;
>>>> +
>>>> +    intel_ring_advance(req, cs);
>>>> +
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +int
>>>> +intel_lr_context_modify_data_port_coherency(struct 
>>>> i915_gem_context *ctx,
>>>> +                        bool enable)
>>>> +{
>>>> +    struct i915_request *req;
>>>> +    int ret;
>>>> +
>>>> +    req = i915_request_alloc(ctx->i915->engine[RCS], ctx);
>>>> +    if (IS_ERR(req))
>>>> +        return PTR_ERR(req);
>>>> +
>>>> +    ret = emit_set_data_port_coherency(req, enable);
>>>> +
>>>> +    i915_request_add(req);
>>>> +
>>>> +    return ret;
>>>> +}
>>>> +
>>>>   static struct i915_priolist *
>>>>   lookup_priolist(struct intel_engine_cs *engine, int prio)
>>>>   {
>>>> diff --git a/drivers/gpu/drm/i915/intel_lrc.h 
>>>> b/drivers/gpu/drm/i915/intel_lrc.h
>>>> index 1593194..f6965ae 100644
>>>> --- a/drivers/gpu/drm/i915/intel_lrc.h
>>>> +++ b/drivers/gpu/drm/i915/intel_lrc.h
>>>> @@ -104,4 +104,8 @@ struct i915_gem_context;
>>>>     void intel_lr_context_resume(struct drm_i915_private *dev_priv);
>>>>   +int
>>>> +intel_lr_context_modify_data_port_coherency(struct 
>>>> i915_gem_context *ctx,
>>>> +                        bool enable);
>>>> +
>>>>   #endif /* _INTEL_LRC_H_ */
>>>> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
>>>> index 7f5634c..e677bea 100644
>>>> --- a/include/uapi/drm/i915_drm.h
>>>> +++ b/include/uapi/drm/i915_drm.h
>>>> @@ -1456,6 +1456,12 @@ struct drm_i915_gem_context_param {
>>>>   #define   I915_CONTEXT_MAX_USER_PRIORITY    1023 /* inclusive */
>>>>   #define   I915_CONTEXT_DEFAULT_PRIORITY        0
>>>>   #define   I915_CONTEXT_MIN_USER_PRIORITY    -1023 /* inclusive */
>>>> +/*
>>>> + * When data port level coherency is enabled, the GPU will update 
>>>> memory
>>>> + * buffers shared with CPU, by forcing internal cache units to 
>>>> send memory
>>>> + * writes to real RAM faster. Keeping such coherency has 
>>>> performance cost.
>>>> + */
>>>> +#define I915_CONTEXT_PARAM_DATA_PORT_COHERENCY    0x7
>>>>       __u64 value;
>>>>   };
>>>
>>>
>>
>>
>
Tvrtko Ursulin July 9, 2018, 4:28 p.m. UTC | #5
On 09/07/2018 14:20, Tomasz Lis wrote:
> The patch adds a parameter to control the data port coherency functionality
> on a per-context level. When the IOCTL is called, a command to switch data
> port coherency state is added to the ordered list. All prior requests are
> executed on old coherency settings, and all exec requests after the IOCTL
> will use new settings.
> 
> Rationale:
> 
> The OpenCL driver develpers requested a functionality to control cache
> coherency at data port level. Keeping the coherency at that level is disabled
> by default due to its performance costs. OpenCL driver is planning to
> enable it for a small subset of submissions, when such functionality is
> required. Below are answers to basic question explaining background
> of the functionality and reasoning for the proposed implementation:
> 
> 1. Why do we need a coherency enable/disable switch for memory that is shared
> between CPU and GEN (GPU)?
> 
> Memory coherency between CPU and GEN, while being a great feature that enables
> CL_MEM_SVM_FINE_GRAIN_BUFFER OCL capability on Intel GEN architecture, adds
> overhead related to tracking (snooping) memory inside different cache units
> (L1$, L2$, L3$, LLC$, etc.). At the same time, minority of modern OCL
> applications actually use CL_MEM_SVM_FINE_GRAIN_BUFFER (and hence require
> memory coherency between CPU and GPU). The goal of coherency enable/disable
> switch is to remove overhead of memory coherency when memory coherency is not
> needed.
> 
> 2. Why do we need a global coherency switch?
> 
> In order to support I/O commands from within EUs (Execution Units), Intel GEN
> ISA (GEN Instruction Set Assembly) contains dedicated "send" instructions.
> These send instructions provide several addressing models. One of these
> addressing models (named "stateless") provides most flexible I/O using plain
> virtual addresses (as opposed to buffer_handle+offset models). This "stateless"
> model is similar to regular memory load/store operations available on typical
> CPUs. Since this model provides I/O using arbitrary virtual addresses, it
> enables algorithmic designs that are based on pointer-to-pointer (e.g. buffer
> of pointers) concepts. For instance, it allows creating tree-like data
> structures such as:
>                     ________________
>                    |      NODE1     |
>                    | uint64_t data  |
>                    +----------------|
>                    | NODE*  |  NODE*|
>                    +--------+-------+
>                      /              \
>     ________________/                \________________
>    |      NODE2     |                |      NODE3     |
>    | uint64_t data  |                | uint64_t data  |
>    +----------------|                +----------------|
>    | NODE*  |  NODE*|                | NODE*  |  NODE*|
>    +--------+-------+                +--------+-------+
> 
> Please note that pointers inside such structures can point to memory locations
> in different OCL allocations  - e.g. NODE1 and NODE2 can reside in one OCL
> allocation while NODE3 resides in a completely separate OCL allocation.
> Additionally, such pointers can be shared with CPU (i.e. using SVM - Shared
> Virtual Memory feature). Using pointers from different allocations doesn't
> affect the stateless addressing model which even allows scattered reading from
> different allocations at the same time (i.e. by utilizing SIMD-nature of send
> instructions).
> 
> When it comes to coherency programming, send instructions in stateless model
> can be encoded (at ISA level) to either use or disable coherency. However, for
> generic OCL applications (such as example with tree-like data structure), OCL
> compiler is not able to determine origin of memory pointed to by an arbitrary
> pointer - i.e. is not able to track given pointer back to a specific
> allocation. As such, it's not able to decide whether coherency is needed or not
> for specific pointer (or for specific I/O instruction). As a result, compiler
> encodes all stateless sends as coherent (doing otherwise would lead to
> functional issues resulting from data corruption). Please note that it would be
> possible to workaround this (e.g. based on allocations map and pointer bounds
> checking prior to each I/O instruction) but the performance cost of such
> workaround would be many times greater than the cost of keeping coherency
> always enabled. As such, enabling/disabling memory coherency at GEN ISA level
> is not feasible and alternative method is needed.
> 
> Such alternative solution is to have a global coherency switch that allows
> disabling coherency for single (though entire) GPU submission. This is
> beneficial because this way we:
> * can enable (and pay for) coherency only in submissions that actually need
> coherency (submissions that use CL_MEM_SVM_FINE_GRAIN_BUFFER resources)
> * don't care about coherency at GEN ISA granularity (no performance impact)
> 
> 3. Will coherency switch be used frequently?
> 
> There are scenarios that will require frequent toggling of the coherency
> switch.
> E.g. an application has two OCL compute kernels: kern_master and kern_worker.
> kern_master uses, concurrently with CPU, some fine grain SVM resources
> (CL_MEM_SVM_FINE_GRAIN_BUFFER). These resources contain descriptors of
> computational work that needs to be executed. kern_master analyzes incoming
> work descriptors and populates a plain OCL buffer (non-fine-grain) with payload
> for kern_worker. Once kern_master is done, kern_worker kicks-in and processes
> the payload that kern_master produced. These two kernels work in a loop, one
> after another. Since only kern_master requires coherency, kern_worker should
> not be forced to pay for it. This means that we need to have the ability to
> toggle coherency switch on or off per each GPU submission:
> (ENABLE COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> (ENABLE
> COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> ...
> 
> v2: Fixed compilation warning.
> v3: Refactored the patch to add IOCTL instead of exec flag.
> v4: Renamed and documented the API flag. Used strict values.
>      Removed redundant GEM_WARN_ON()s. Improved to coding standard.
>      Introduced a macro for checking whether hardware supports the feature.
> 
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Michal Winiarski <michal.winiarski@intel.com>
> 
> Bspec: 11419
> Signed-off-by: Tomasz Lis <tomasz.lis@intel.com>
> ---
>   drivers/gpu/drm/i915/i915_drv.h         |  1 +
>   drivers/gpu/drm/i915/i915_gem_context.c | 41 +++++++++++++++++++++++++++
>   drivers/gpu/drm/i915/i915_gem_context.h |  6 ++++
>   drivers/gpu/drm/i915/intel_lrc.c        | 49 +++++++++++++++++++++++++++++++++
>   drivers/gpu/drm/i915/intel_lrc.h        |  4 +++
>   include/uapi/drm/i915_drm.h             |  6 ++++
>   6 files changed, 107 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index 09ab124..7d4bbd5 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -2524,6 +2524,7 @@ intel_info(const struct drm_i915_private *dev_priv)
>   #define HAS_EDRAM(dev_priv)	(!!((dev_priv)->edram_cap & EDRAM_ENABLED))
>   #define HAS_WT(dev_priv)	((IS_HASWELL(dev_priv) || \
>   				 IS_BROADWELL(dev_priv)) && HAS_EDRAM(dev_priv))
> +#define HAS_DATA_PORT_COHERENCY(dev_priv)	(INTEL_GEN(dev_priv) >= 9)
>   
>   #define HWS_NEEDS_PHYSICAL(dev_priv)	((dev_priv)->info.hws_needs_physical)
>   
> diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
> index b10770c..6db352e 100644
> --- a/drivers/gpu/drm/i915/i915_gem_context.c
> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
> @@ -711,6 +711,26 @@ static bool client_is_banned(struct drm_i915_file_private *file_priv)
>   	return atomic_read(&file_priv->ban_score) >= I915_CLIENT_SCORE_BANNED;
>   }
>   
> +static int i915_gem_context_set_data_port_coherent(struct i915_gem_context *ctx)
> +{
> +	int ret;
> +
> +	ret = intel_lr_context_modify_data_port_coherency(ctx, true);
> +	if (!ret)
> +		__set_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
> +	return ret;
> +}
> +
> +static int i915_gem_context_clear_data_port_coherent(struct i915_gem_context *ctx)
> +{
> +	int ret;
> +
> +	ret = intel_lr_context_modify_data_port_coherency(ctx, false);
> +	if (!ret)
> +		__clear_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
> +	return ret;

Is there a good reason you allow userspace to keep emitting unlimited 
number of commands which actually do not change the status? If not 
please consider gating the command emission with 
test_and_set_bit/test_and_clear_bit. Hm.. apart even with that they 
could keep toggling ad infinitum with no real work in between. Has it 
been considered to only save the desired state in set param and then 
emit it, if needed, before next execbuf? Minor thing in any case, just 
curious since I wasn't following the threads.

> +}
> +
>   int i915_gem_context_create_ioctl(struct drm_device *dev, void *data,
>   				  struct drm_file *file)
>   {
> @@ -784,6 +804,7 @@ int i915_gem_context_destroy_ioctl(struct drm_device *dev, void *data,
>   int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
>   				    struct drm_file *file)
>   {
> +	struct drm_i915_private *dev_priv = to_i915(dev);

Feel free to use the local for the other existing to_i915(dev) call 
sites in here.

Also use i915 for the local name. Unless I915_READ/WRITE is used i915 is 
preferred nowadays.

>   	struct drm_i915_file_private *file_priv = file->driver_priv;
>   	struct drm_i915_gem_context_param *args = data;
>   	struct i915_gem_context *ctx;
> @@ -818,6 +839,12 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
>   	case I915_CONTEXT_PARAM_PRIORITY:
>   		args->value = ctx->sched.priority;
>   		break;
> +	case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
> +		if (!HAS_DATA_PORT_COHERENCY(dev_priv))
> +			ret = -ENODEV;
> +		else
> +			args->value = i915_gem_context_is_data_port_coherent(ctx);
> +		break;
>   	default:
>   		ret = -EINVAL;
>   		break;
> @@ -830,6 +857,7 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
>   int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
>   				    struct drm_file *file)
>   {
> +	struct drm_i915_private *dev_priv = to_i915(dev);

As with get_param.

>   	struct drm_i915_file_private *file_priv = file->driver_priv;
>   	struct drm_i915_gem_context_param *args = data;
>   	struct i915_gem_context *ctx;
> @@ -893,6 +921,19 @@ int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
>   		}
>   		break;
>   
> +	case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
> +		if (args->size)
> +			ret = -EINVAL;
> +		else if (!HAS_DATA_PORT_COHERENCY(dev_priv))
> +			ret = -ENODEV;
> +		else if (args->value == 1)
> +			ret = i915_gem_context_set_data_port_coherent(ctx);
> +		else if (args->value == 0)
> +			ret = i915_gem_context_clear_data_port_coherent(ctx);
> +		else
> +			ret = -EINVAL;
> +		break;
> +
>   	default:
>   		ret = -EINVAL;
>   		break;
> diff --git a/drivers/gpu/drm/i915/i915_gem_context.h b/drivers/gpu/drm/i915/i915_gem_context.h
> index b116e49..e8ccb70 100644
> --- a/drivers/gpu/drm/i915/i915_gem_context.h
> +++ b/drivers/gpu/drm/i915/i915_gem_context.h
> @@ -126,6 +126,7 @@ struct i915_gem_context {
>   #define CONTEXT_BANNABLE		3
>   #define CONTEXT_BANNED			4
>   #define CONTEXT_FORCE_SINGLE_SUBMISSION	5
> +#define CONTEXT_DATA_PORT_COHERENT	6
>   
>   	/**
>   	 * @hw_id: - unique identifier for the context
> @@ -257,6 +258,11 @@ static inline void i915_gem_context_set_force_single_submission(struct i915_gem_
>   	__set_bit(CONTEXT_FORCE_SINGLE_SUBMISSION, &ctx->flags);
>   }
>   
> +static inline bool i915_gem_context_is_data_port_coherent(struct i915_gem_context *ctx)
> +{
> +	return test_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
> +}
> +
>   static inline bool i915_gem_context_is_default(const struct i915_gem_context *c)
>   {
>   	return c->user_handle == DEFAULT_CONTEXT_HANDLE;
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index ab89dab..1f037e3 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -259,6 +259,55 @@ intel_lr_context_descriptor_update(struct i915_gem_context *ctx,
>   	ce->lrc_desc = desc;
>   }
>   
> +static int emit_set_data_port_coherency(struct i915_request *req, bool enable)

After much disagreement we ended up with rq as the consistent naming for 
requests.

> +{
> +	u32 *cs;
> +	i915_reg_t reg;
> +
> +	GEM_BUG_ON(req->engine->class != RENDER_CLASS);
> +	GEM_BUG_ON(INTEL_GEN(req->i915) < 9);
> +
> +	cs = intel_ring_begin(req, 4);
> +	if (IS_ERR(cs))
> +		return PTR_ERR(cs);
> +
> +	if (INTEL_GEN(req->i915) >= 10)
> +		reg = CNL_HDC_CHICKEN0;
> +	else
> +		reg = HDC_CHICKEN0;
> +
> +	*cs++ = MI_LOAD_REGISTER_IMM(1);
> +	*cs++ = i915_mmio_reg_offset(reg);
> +	/* Enabling coherency means disabling the bit which forces it off */
> +	if (enable)
> +		*cs++ = _MASKED_BIT_DISABLE(HDC_FORCE_NON_COHERENT);
> +	else
> +		*cs++ = _MASKED_BIT_ENABLE(HDC_FORCE_NON_COHERENT);
> +	*cs++ = MI_NOOP;
> +
> +	intel_ring_advance(req, cs);
> +
> +	return 0;
> +}
> +
> +int
> +intel_lr_context_modify_data_port_coherency(struct i915_gem_context *ctx,
> +					    bool enable)
> +{
> +	struct i915_request *req;

rq as above.

> +	int ret;
> +
> +	req = i915_request_alloc(ctx->i915->engine[RCS], ctx);
> +	if (IS_ERR(req))
> +		return PTR_ERR(req);
> +
> +	ret = emit_set_data_port_coherency(req, enable);
> +
> +	i915_request_add(req);
> +
> +	return ret;
> +}
> +
>   static struct i915_priolist *
>   lookup_priolist(struct intel_engine_cs *engine, int prio)
>   {
> diff --git a/drivers/gpu/drm/i915/intel_lrc.h b/drivers/gpu/drm/i915/intel_lrc.h
> index 1593194..f6965ae 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.h
> +++ b/drivers/gpu/drm/i915/intel_lrc.h
> @@ -104,4 +104,8 @@ struct i915_gem_context;
>   
>   void intel_lr_context_resume(struct drm_i915_private *dev_priv);
>   
> +int
> +intel_lr_context_modify_data_port_coherency(struct i915_gem_context *ctx,
> +					    bool enable);
> +
>   #endif /* _INTEL_LRC_H_ */
> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> index 7f5634c..e677bea 100644
> --- a/include/uapi/drm/i915_drm.h
> +++ b/include/uapi/drm/i915_drm.h
> @@ -1456,6 +1456,12 @@ struct drm_i915_gem_context_param {
>   #define   I915_CONTEXT_MAX_USER_PRIORITY	1023 /* inclusive */
>   #define   I915_CONTEXT_DEFAULT_PRIORITY		0
>   #define   I915_CONTEXT_MIN_USER_PRIORITY	-1023 /* inclusive */
> +/*
> + * When data port level coherency is enabled, the GPU will update memory
> + * buffers shared with CPU, by forcing internal cache units to send memory
> + * writes to real RAM faster. Keeping such coherency has performance cost.

Is this comment correct? Is it actually sending memory writes to _RAM_, 
or just the coherency mode enabled, even if only targetting CPU or 
shared cache, which adds a cost?

s/Keeping such coherency has performance cost./Enabling data port 
coherency has a performance cost./ ? Or "can have a performance cost"?

> + */
> +#define I915_CONTEXT_PARAM_DATA_PORT_COHERENCY	0x7
>   	__u64 value;
>   };
>   
> 

Since I understand this design has been approved already on the high 
level, and as you can see I only had some minor comments to add, I can 
say that the patch in principle looks okay to me.

Regards,

Tvrtko
Chris Wilson July 9, 2018, 4:37 p.m. UTC | #6
Quoting Tvrtko Ursulin (2018-07-09 17:28:02)
> 
> On 09/07/2018 14:20, Tomasz Lis wrote:
> > +static int i915_gem_context_clear_data_port_coherent(struct i915_gem_context *ctx)
> > +{
> > +     int ret;
> > +
> > +     ret = intel_lr_context_modify_data_port_coherency(ctx, false);
> > +     if (!ret)
> > +             __clear_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
> > +     return ret;
> 
> Is there a good reason you allow userspace to keep emitting unlimited 
> number of commands which actually do not change the status? If not 
> please consider gating the command emission with 
> test_and_set_bit/test_and_clear_bit. Hm.. apart even with that they 
> could keep toggling ad infinitum with no real work in between. Has it 
> been considered to only save the desired state in set param and then 
> emit it, if needed, before next execbuf? Minor thing in any case, just 
> curious since I wasn't following the threads.

The first patch tried to add a bit to execbuf, and having been
mistakenly down that road before, we asked if there was any alternative.
(Now if you've also been following execbuf3 conversations, having a
packet for privileged LRI is definitely something we want.)

Setting the value in the context register is precisely what we want to
do, and trivially serialised with execbuf since we have to serialise
reservation of ring space, i.e. the normal rules of request generation.
(execbuf is just a client and nothing special). From that point of view,
we only care about frequency, if it is very frequent it should be
controlled by userspace inside the batch (but it can't due to there
being dangerous bits inside the reg aiui). At the other end of the
scale, is context_setparam for set-once. And there should be no
inbetween as that requires costly batch flushes.
-Chris
Lis, Tomasz July 10, 2018, 5:32 p.m. UTC | #7
On 2018-07-09 18:37, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2018-07-09 17:28:02)
>> On 09/07/2018 14:20, Tomasz Lis wrote:
>>> +static int i915_gem_context_clear_data_port_coherent(struct i915_gem_context *ctx)
>>> +{
>>> +     int ret;
>>> +
>>> +     ret = intel_lr_context_modify_data_port_coherency(ctx, false);
>>> +     if (!ret)
>>> +             __clear_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
>>> +     return ret;
>> Is there a good reason you allow userspace to keep emitting unlimited
>> number of commands which actually do not change the status? If not
>> please consider gating the command emission with
>> test_and_set_bit/test_and_clear_bit. Hm.. apart even with that they
>> could keep toggling ad infinitum with no real work in between. Has it
>> been considered to only save the desired state in set param and then
>> emit it, if needed, before next execbuf? Minor thing in any case, just
>> curious since I wasn't following the threads.
> The first patch tried to add a bit to execbuf, and having been
> mistakenly down that road before, we asked if there was any alternative.
> (Now if you've also been following execbuf3 conversations, having a
> packet for privileged LRI is definitely something we want.)
>
> Setting the value in the context register is precisely what we want to
> do, and trivially serialised with execbuf since we have to serialise
> reservation of ring space, i.e. the normal rules of request generation.
> (execbuf is just a client and nothing special). From that point of view,
> we only care about frequency, if it is very frequent it should be
> controlled by userspace inside the batch (but it can't due to there
> being dangerous bits inside the reg aiui). At the other end of the
> scale, is context_setparam for set-once. And there should be no
> inbetween as that requires costly batch flushes.
> -Chris
Joonas did brought that concern in his review; here it is, with my response:

On 2018-06-21 15:47, Lis, Tomasz wrote:
> On 2018-06-21 08:39, Joonas Lahtinen wrote:
>> I'm thinking we should set this value when it has changed, when we 
>> insert the
>> requests into the command stream. So if you change back and forth, while
>> not emitting any requests, nothing really happens. If you change the 
>> value and
>> emit a request, we should emit a LRI before the jump to the commands.
>> Similary if you keep setting the value to the value it already was in,
>> nothing will happen, again.
> When I considered that, my way of reasoning was:
> If we execute the flag changing buffer right away, it may be sent to 
> hardware faster if there is no job in progress.
> If we use the lazy way, and trigger the change just before submission 
> -  there will be additional conditions in submission code, plus the 
> change will be made when there is another job pending (though it's not 
> a considerable payload to just switch a flag).
> If user space switches the flag back and forth without much sense, 
> then there is something wrong with the user space driver, and it 
> shouldn't be up to kernel to fix that.
>
> This is why I chosen the current approach. But I can change it if you 
> wish.

So while I think the current solution is better from performance 
standpoint, but I will change it if you request that.
-Tomasz
Lis, Tomasz July 10, 2018, 6:03 p.m. UTC | #8
On 2018-07-09 18:28, Tvrtko Ursulin wrote:
>
> On 09/07/2018 14:20, Tomasz Lis wrote:
>> The patch adds a parameter to control the data port coherency 
>> functionality
>> on a per-context level. When the IOCTL is called, a command to switch 
>> data
>> port coherency state is added to the ordered list. All prior requests 
>> are
>> executed on old coherency settings, and all exec requests after the 
>> IOCTL
>> will use new settings.
>>
>> Rationale:
>>
>> The OpenCL driver develpers requested a functionality to control cache
>> coherency at data port level. Keeping the coherency at that level is 
>> disabled
>> by default due to its performance costs. OpenCL driver is planning to
>> enable it for a small subset of submissions, when such functionality is
>> required. Below are answers to basic question explaining background
>> of the functionality and reasoning for the proposed implementation:
>>
>> 1. Why do we need a coherency enable/disable switch for memory that 
>> is shared
>> between CPU and GEN (GPU)?
>>
>> Memory coherency between CPU and GEN, while being a great feature 
>> that enables
>> CL_MEM_SVM_FINE_GRAIN_BUFFER OCL capability on Intel GEN 
>> architecture, adds
>> overhead related to tracking (snooping) memory inside different cache 
>> units
>> (L1$, L2$, L3$, LLC$, etc.). At the same time, minority of modern OCL
>> applications actually use CL_MEM_SVM_FINE_GRAIN_BUFFER (and hence 
>> require
>> memory coherency between CPU and GPU). The goal of coherency 
>> enable/disable
>> switch is to remove overhead of memory coherency when memory 
>> coherency is not
>> needed.
>>
>> 2. Why do we need a global coherency switch?
>>
>> In order to support I/O commands from within EUs (Execution Units), 
>> Intel GEN
>> ISA (GEN Instruction Set Assembly) contains dedicated "send" 
>> instructions.
>> These send instructions provide several addressing models. One of these
>> addressing models (named "stateless") provides most flexible I/O 
>> using plain
>> virtual addresses (as opposed to buffer_handle+offset models). This 
>> "stateless"
>> model is similar to regular memory load/store operations available on 
>> typical
>> CPUs. Since this model provides I/O using arbitrary virtual 
>> addresses, it
>> enables algorithmic designs that are based on pointer-to-pointer 
>> (e.g. buffer
>> of pointers) concepts. For instance, it allows creating tree-like data
>> structures such as:
>>                     ________________
>>                    |      NODE1     |
>>                    | uint64_t data  |
>>                    +----------------|
>>                    | NODE*  |  NODE*|
>>                    +--------+-------+
>>                      /              \
>>     ________________/                \________________
>>    |      NODE2     |                |      NODE3     |
>>    | uint64_t data  |                | uint64_t data  |
>>    +----------------|                +----------------|
>>    | NODE*  |  NODE*|                | NODE*  |  NODE*|
>>    +--------+-------+                +--------+-------+
>>
>> Please note that pointers inside such structures can point to memory 
>> locations
>> in different OCL allocations  - e.g. NODE1 and NODE2 can reside in 
>> one OCL
>> allocation while NODE3 resides in a completely separate OCL allocation.
>> Additionally, such pointers can be shared with CPU (i.e. using SVM - 
>> Shared
>> Virtual Memory feature). Using pointers from different allocations 
>> doesn't
>> affect the stateless addressing model which even allows scattered 
>> reading from
>> different allocations at the same time (i.e. by utilizing SIMD-nature 
>> of send
>> instructions).
>>
>> When it comes to coherency programming, send instructions in 
>> stateless model
>> can be encoded (at ISA level) to either use or disable coherency. 
>> However, for
>> generic OCL applications (such as example with tree-like data 
>> structure), OCL
>> compiler is not able to determine origin of memory pointed to by an 
>> arbitrary
>> pointer - i.e. is not able to track given pointer back to a specific
>> allocation. As such, it's not able to decide whether coherency is 
>> needed or not
>> for specific pointer (or for specific I/O instruction). As a result, 
>> compiler
>> encodes all stateless sends as coherent (doing otherwise would lead to
>> functional issues resulting from data corruption). Please note that 
>> it would be
>> possible to workaround this (e.g. based on allocations map and 
>> pointer bounds
>> checking prior to each I/O instruction) but the performance cost of such
>> workaround would be many times greater than the cost of keeping 
>> coherency
>> always enabled. As such, enabling/disabling memory coherency at GEN 
>> ISA level
>> is not feasible and alternative method is needed.
>>
>> Such alternative solution is to have a global coherency switch that 
>> allows
>> disabling coherency for single (though entire) GPU submission. This is
>> beneficial because this way we:
>> * can enable (and pay for) coherency only in submissions that 
>> actually need
>> coherency (submissions that use CL_MEM_SVM_FINE_GRAIN_BUFFER resources)
>> * don't care about coherency at GEN ISA granularity (no performance 
>> impact)
>>
>> 3. Will coherency switch be used frequently?
>>
>> There are scenarios that will require frequent toggling of the coherency
>> switch.
>> E.g. an application has two OCL compute kernels: kern_master and 
>> kern_worker.
>> kern_master uses, concurrently with CPU, some fine grain SVM resources
>> (CL_MEM_SVM_FINE_GRAIN_BUFFER). These resources contain descriptors of
>> computational work that needs to be executed. kern_master analyzes 
>> incoming
>> work descriptors and populates a plain OCL buffer (non-fine-grain) 
>> with payload
>> for kern_worker. Once kern_master is done, kern_worker kicks-in and 
>> processes
>> the payload that kern_master produced. These two kernels work in a 
>> loop, one
>> after another. Since only kern_master requires coherency, kern_worker 
>> should
>> not be forced to pay for it. This means that we need to have the 
>> ability to
>> toggle coherency switch on or off per each GPU submission:
>> (ENABLE COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> 
>> (ENABLE
>> COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> ...
>>
>> v2: Fixed compilation warning.
>> v3: Refactored the patch to add IOCTL instead of exec flag.
>> v4: Renamed and documented the API flag. Used strict values.
>>      Removed redundant GEM_WARN_ON()s. Improved to coding standard.
>>      Introduced a macro for checking whether hardware supports the 
>> feature.
>>
>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>> Cc: Michal Winiarski <michal.winiarski@intel.com>
>>
>> Bspec: 11419
>> Signed-off-by: Tomasz Lis <tomasz.lis@intel.com>
>> ---
>>   drivers/gpu/drm/i915/i915_drv.h         |  1 +
>>   drivers/gpu/drm/i915/i915_gem_context.c | 41 
>> +++++++++++++++++++++++++++
>>   drivers/gpu/drm/i915/i915_gem_context.h |  6 ++++
>>   drivers/gpu/drm/i915/intel_lrc.c        | 49 
>> +++++++++++++++++++++++++++++++++
>>   drivers/gpu/drm/i915/intel_lrc.h        |  4 +++
>>   include/uapi/drm/i915_drm.h             |  6 ++++
>>   6 files changed, 107 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/i915/i915_drv.h 
>> b/drivers/gpu/drm/i915/i915_drv.h
>> index 09ab124..7d4bbd5 100644
>> --- a/drivers/gpu/drm/i915/i915_drv.h
>> +++ b/drivers/gpu/drm/i915/i915_drv.h
>> @@ -2524,6 +2524,7 @@ intel_info(const struct drm_i915_private 
>> *dev_priv)
>>   #define HAS_EDRAM(dev_priv)    (!!((dev_priv)->edram_cap & 
>> EDRAM_ENABLED))
>>   #define HAS_WT(dev_priv)    ((IS_HASWELL(dev_priv) || \
>>                    IS_BROADWELL(dev_priv)) && HAS_EDRAM(dev_priv))
>> +#define HAS_DATA_PORT_COHERENCY(dev_priv) (INTEL_GEN(dev_priv) >= 9)
>>     #define HWS_NEEDS_PHYSICAL(dev_priv) 
>> ((dev_priv)->info.hws_needs_physical)
>>   diff --git a/drivers/gpu/drm/i915/i915_gem_context.c 
>> b/drivers/gpu/drm/i915/i915_gem_context.c
>> index b10770c..6db352e 100644
>> --- a/drivers/gpu/drm/i915/i915_gem_context.c
>> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
>> @@ -711,6 +711,26 @@ static bool client_is_banned(struct 
>> drm_i915_file_private *file_priv)
>>       return atomic_read(&file_priv->ban_score) >= 
>> I915_CLIENT_SCORE_BANNED;
>>   }
>>   +static int i915_gem_context_set_data_port_coherent(struct 
>> i915_gem_context *ctx)
>> +{
>> +    int ret;
>> +
>> +    ret = intel_lr_context_modify_data_port_coherency(ctx, true);
>> +    if (!ret)
>> +        __set_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
>> +    return ret;
>> +}
>> +
>> +static int i915_gem_context_clear_data_port_coherent(struct 
>> i915_gem_context *ctx)
>> +{
>> +    int ret;
>> +
>> +    ret = intel_lr_context_modify_data_port_coherency(ctx, false);
>> +    if (!ret)
>> +        __clear_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
>> +    return ret;
>
> Is there a good reason you allow userspace to keep emitting unlimited 
> number of commands which actually do not change the status? If not 
> please consider gating the command emission with 
> test_and_set_bit/test_and_clear_bit. Hm.. apart even with that they 
> could keep toggling ad infinitum with no real work in between. Has it 
> been considered to only save the desired state in set param and then 
> emit it, if needed, before next execbuf? Minor thing in any case, just 
> curious since I wasn't following the threads.
>
(discussed further in separate thread)
>> +}
>> +
>>   int i915_gem_context_create_ioctl(struct drm_device *dev, void *data,
>>                     struct drm_file *file)
>>   {
>> @@ -784,6 +804,7 @@ int i915_gem_context_destroy_ioctl(struct 
>> drm_device *dev, void *data,
>>   int i915_gem_context_getparam_ioctl(struct drm_device *dev, void 
>> *data,
>>                       struct drm_file *file)
>>   {
>> +    struct drm_i915_private *dev_priv = to_i915(dev);
>
> Feel free to use the local for the other existing to_i915(dev) call 
> sites in here.
>
> Also use i915 for the local name. Unless I915_READ/WRITE is used i915 
> is preferred nowadays.
Will do.
>
>>       struct drm_i915_file_private *file_priv = file->driver_priv;
>>       struct drm_i915_gem_context_param *args = data;
>>       struct i915_gem_context *ctx;
>> @@ -818,6 +839,12 @@ int i915_gem_context_getparam_ioctl(struct 
>> drm_device *dev, void *data,
>>       case I915_CONTEXT_PARAM_PRIORITY:
>>           args->value = ctx->sched.priority;
>>           break;
>> +    case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
>> +        if (!HAS_DATA_PORT_COHERENCY(dev_priv))
>> +            ret = -ENODEV;
>> +        else
>> +            args->value = i915_gem_context_is_data_port_coherent(ctx);
>> +        break;
>>       default:
>>           ret = -EINVAL;
>>           break;
>> @@ -830,6 +857,7 @@ int i915_gem_context_getparam_ioctl(struct 
>> drm_device *dev, void *data,
>>   int i915_gem_context_setparam_ioctl(struct drm_device *dev, void 
>> *data,
>>                       struct drm_file *file)
>>   {
>> +    struct drm_i915_private *dev_priv = to_i915(dev);
>
> As with get_param.
Ack.
>
>>       struct drm_i915_file_private *file_priv = file->driver_priv;
>>       struct drm_i915_gem_context_param *args = data;
>>       struct i915_gem_context *ctx;
>> @@ -893,6 +921,19 @@ int i915_gem_context_setparam_ioctl(struct 
>> drm_device *dev, void *data,
>>           }
>>           break;
>>   +    case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
>> +        if (args->size)
>> +            ret = -EINVAL;
>> +        else if (!HAS_DATA_PORT_COHERENCY(dev_priv))
>> +            ret = -ENODEV;
>> +        else if (args->value == 1)
>> +            ret = i915_gem_context_set_data_port_coherent(ctx);
>> +        else if (args->value == 0)
>> +            ret = i915_gem_context_clear_data_port_coherent(ctx);
>> +        else
>> +            ret = -EINVAL;
>> +        break;
>> +
>>       default:
>>           ret = -EINVAL;
>>           break;
>> diff --git a/drivers/gpu/drm/i915/i915_gem_context.h 
>> b/drivers/gpu/drm/i915/i915_gem_context.h
>> index b116e49..e8ccb70 100644
>> --- a/drivers/gpu/drm/i915/i915_gem_context.h
>> +++ b/drivers/gpu/drm/i915/i915_gem_context.h
>> @@ -126,6 +126,7 @@ struct i915_gem_context {
>>   #define CONTEXT_BANNABLE        3
>>   #define CONTEXT_BANNED            4
>>   #define CONTEXT_FORCE_SINGLE_SUBMISSION    5
>> +#define CONTEXT_DATA_PORT_COHERENT    6
>>         /**
>>        * @hw_id: - unique identifier for the context
>> @@ -257,6 +258,11 @@ static inline void 
>> i915_gem_context_set_force_single_submission(struct i915_gem_
>>       __set_bit(CONTEXT_FORCE_SINGLE_SUBMISSION, &ctx->flags);
>>   }
>>   +static inline bool i915_gem_context_is_data_port_coherent(struct 
>> i915_gem_context *ctx)
>> +{
>> +    return test_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
>> +}
>> +
>>   static inline bool i915_gem_context_is_default(const struct 
>> i915_gem_context *c)
>>   {
>>       return c->user_handle == DEFAULT_CONTEXT_HANDLE;
>> diff --git a/drivers/gpu/drm/i915/intel_lrc.c 
>> b/drivers/gpu/drm/i915/intel_lrc.c
>> index ab89dab..1f037e3 100644
>> --- a/drivers/gpu/drm/i915/intel_lrc.c
>> +++ b/drivers/gpu/drm/i915/intel_lrc.c
>> @@ -259,6 +259,55 @@ intel_lr_context_descriptor_update(struct 
>> i915_gem_context *ctx,
>>       ce->lrc_desc = desc;
>>   }
>>   +static int emit_set_data_port_coherency(struct i915_request *req, 
>> bool enable)
>
> After much disagreement we ended up with rq as the consistent naming 
> for requests.
:)
ok.
>
>> +{
>> +    u32 *cs;
>> +    i915_reg_t reg;
>> +
>> +    GEM_BUG_ON(req->engine->class != RENDER_CLASS);
>> +    GEM_BUG_ON(INTEL_GEN(req->i915) < 9);
>> +
>> +    cs = intel_ring_begin(req, 4);
>> +    if (IS_ERR(cs))
>> +        return PTR_ERR(cs);
>> +
>> +    if (INTEL_GEN(req->i915) >= 10)
>> +        reg = CNL_HDC_CHICKEN0;
>> +    else
>> +        reg = HDC_CHICKEN0;
>> +
>> +    *cs++ = MI_LOAD_REGISTER_IMM(1);
>> +    *cs++ = i915_mmio_reg_offset(reg);
>> +    /* Enabling coherency means disabling the bit which forces it 
>> off */
>> +    if (enable)
>> +        *cs++ = _MASKED_BIT_DISABLE(HDC_FORCE_NON_COHERENT);
>> +    else
>> +        *cs++ = _MASKED_BIT_ENABLE(HDC_FORCE_NON_COHERENT);
>> +    *cs++ = MI_NOOP;
>> +
>> +    intel_ring_advance(req, cs);
>> +
>> +    return 0;
>> +}
>> +
>> +int
>> +intel_lr_context_modify_data_port_coherency(struct i915_gem_context 
>> *ctx,
>> +                        bool enable)
>> +{
>> +    struct i915_request *req;
>
> rq as above.
ack
>
>> +    int ret;
>> +
>> +    req = i915_request_alloc(ctx->i915->engine[RCS], ctx);
>> +    if (IS_ERR(req))
>> +        return PTR_ERR(req);
>> +
>> +    ret = emit_set_data_port_coherency(req, enable);
>> +
>> +    i915_request_add(req);
>> +
>> +    return ret;
>> +}
>> +
>>   static struct i915_priolist *
>>   lookup_priolist(struct intel_engine_cs *engine, int prio)
>>   {
>> diff --git a/drivers/gpu/drm/i915/intel_lrc.h 
>> b/drivers/gpu/drm/i915/intel_lrc.h
>> index 1593194..f6965ae 100644
>> --- a/drivers/gpu/drm/i915/intel_lrc.h
>> +++ b/drivers/gpu/drm/i915/intel_lrc.h
>> @@ -104,4 +104,8 @@ struct i915_gem_context;
>>     void intel_lr_context_resume(struct drm_i915_private *dev_priv);
>>   +int
>> +intel_lr_context_modify_data_port_coherency(struct i915_gem_context 
>> *ctx,
>> +                        bool enable);
>> +
>>   #endif /* _INTEL_LRC_H_ */
>> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
>> index 7f5634c..e677bea 100644
>> --- a/include/uapi/drm/i915_drm.h
>> +++ b/include/uapi/drm/i915_drm.h
>> @@ -1456,6 +1456,12 @@ struct drm_i915_gem_context_param {
>>   #define   I915_CONTEXT_MAX_USER_PRIORITY    1023 /* inclusive */
>>   #define   I915_CONTEXT_DEFAULT_PRIORITY        0
>>   #define   I915_CONTEXT_MIN_USER_PRIORITY    -1023 /* inclusive */
>> +/*
>> + * When data port level coherency is enabled, the GPU will update 
>> memory
>> + * buffers shared with CPU, by forcing internal cache units to send 
>> memory
>> + * writes to real RAM faster. Keeping such coherency has performance 
>> cost.
>
> Is this comment correct? Is it actually sending memory writes to 
> _RAM_, or just the coherency mode enabled, even if only targetting CPU 
> or shared cache, which adds a cost?
I'm not sure whether there are further coherency modes to choose how 
"deep" coherency goes. The use case of OCL Team is to see gradual 
changes in the buffers on CPU side while the execution progresses. Write 
to RAM is needed to achieve that. And that limits performance by using 
RAM bandwidth.
>
> s/Keeping such coherency has performance cost./Enabling data port 
> coherency has a performance cost./ ? Or "can have a performance cost"?
I would prefer "Enabling data port coherency has a performance cost.". 
There likely are workloads with unmeasureable performance impact, but in 
real world using more memory writes will always slow down something.
>
>> + */
>> +#define I915_CONTEXT_PARAM_DATA_PORT_COHERENCY    0x7
>>       __u64 value;
>>   };
>>
>
> Since I understand this design has been approved already on the high 
> level, and as you can see I only had some minor comments to add, I can 
> say that the patch in principle looks okay to me.
Great; will produce a v5 soon.
-Tomasz
>
> Regards,
>
> Tvrtko
Tvrtko Ursulin July 11, 2018, 9:28 a.m. UTC | #9
On 10/07/2018 18:32, Lis, Tomasz wrote:
> On 2018-07-09 18:37, Chris Wilson wrote:
>> Quoting Tvrtko Ursulin (2018-07-09 17:28:02)
>>> On 09/07/2018 14:20, Tomasz Lis wrote:
>>>> +static int i915_gem_context_clear_data_port_coherent(struct i915_gem_context *ctx)
>>>> +{
>>>> +     int ret;
>>>> +
>>>> +     ret = intel_lr_context_modify_data_port_coherency(ctx, false);
>>>> +     if (!ret)
>>>> +             __clear_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
>>>> +     return ret;
>>> Is there a good reason you allow userspace to keep emitting unlimited
>>> number of commands which actually do not change the status? If not
>>> please consider gating the command emission with
>>> test_and_set_bit/test_and_clear_bit. Hm.. apart even with that they
>>> could keep toggling ad infinitum with no real work in between. Has it
>>> been considered to only save the desired state in set param and then
>>> emit it, if needed, before next execbuf? Minor thing in any case, just
>>> curious since I wasn't following the threads.
>> The first patch tried to add a bit to execbuf, and having been
>> mistakenly down that road before, we asked if there was any alternative.
>> (Now if you've also been following execbuf3 conversations, having a
>> packet for privileged LRI is definitely something we want.)
>>
>> Setting the value in the context register is precisely what we want to
>> do, and trivially serialised with execbuf since we have to serialise
>> reservation of ring space, i.e. the normal rules of request generation.
>> (execbuf is just a client and nothing special). From that point of view,
>> we only care about frequency, if it is very frequent it should be
>> controlled by userspace inside the batch (but it can't due to there
>> being dangerous bits inside the reg aiui). At the other end of the
>> scale, is context_setparam for set-once. And there should be no
>> inbetween as that requires costly batch flushes.
>> -Chris
> Joonas did brought that concern in his review; here it is, with my response:
> 
> On 2018-06-21 15:47, Lis, Tomasz wrote:
>> On 2018-06-21 08:39, Joonas Lahtinen wrote:
>>> I'm thinking we should set this value when it has changed, when we 
>>> insert the
>>> requests into the command stream. So if you change back and forth, while
>>> not emitting any requests, nothing really happens. If you change the 
>>> value and
>>> emit a request, we should emit a LRI before the jump to the commands.
>>> Similary if you keep setting the value to the value it already was in,
>>> nothing will happen, again.
>> When I considered that, my way of reasoning was:
>> If we execute the flag changing buffer right away, it may be sent to 
>> hardware faster if there is no job in progress.
>> If we use the lazy way, and trigger the change just before submission 
>> -  there will be additional conditions in submission code, plus the 
>> change will be made when there is another job pending (though it's not 
>> a considerable payload to just switch a flag).
>> If user space switches the flag back and forth without much sense, 
>> then there is something wrong with the user space driver, and it 
>> shouldn't be up to kernel to fix that.
>>
>> This is why I chosen the current approach. But I can change it if you 
>> wish.
> 
> So while I think the current solution is better from performance 
> standpoint, but I will change it if you request that.

Sounds like an interesting dilemma and I can see both arguments.

But for me I still prefer the option where coherency programming is 
emitted lazily on state change only. We do emit a bunch of pipe controls 
to invalidate caches and such as preamble to every request so that fits 
nicely. Advantage I see is that the set param ioctl remains very light 
and doesn't do any command submission, keeping in spirit and expectation 
with all current parameters. It makes the ioctl much quicker and as a 
secondary benefit it protects userspace form their own sillyness.

Regards,

Tvrtko
Lis, Tomasz July 11, 2018, 11:20 a.m. UTC | #10
On 2018-07-10 20:03, Lis, Tomasz wrote:
>
>
> On 2018-07-09 18:28, Tvrtko Ursulin wrote:
>>
>> On 09/07/2018 14:20, Tomasz Lis wrote:
>>>
>>> diff --git a/drivers/gpu/drm/i915/intel_lrc.h 
>>> b/drivers/gpu/drm/i915/intel_lrc.h
>>> index 1593194..f6965ae 100644
>>> --- a/drivers/gpu/drm/i915/intel_lrc.h
>>> +++ b/drivers/gpu/drm/i915/intel_lrc.h
>>> [...]
>>> +/*
>>> + * When data port level coherency is enabled, the GPU will update 
>>> memory
>>> + * buffers shared with CPU, by forcing internal cache units to send 
>>> memory
>>> + * writes to real RAM faster. Keeping such coherency has 
>>> performance cost.
>>
>> Is this comment correct? Is it actually sending memory writes to 
>> _RAM_, or just the coherency mode enabled, even if only targetting 
>> CPU or shared cache, which adds a cost?
> I'm not sure whether there are further coherency modes to choose how 
> "deep" coherency goes. The use case of OCL Team is to see gradual 
> changes in the buffers on CPU side while the execution progresses. 
> Write to RAM is needed to achieve that. And that limits performance by 
> using RAM bandwidth.

It was pointed out to me that last level cache is shared between CPU and 
GPU on non-atoms. Which means my argument was invalid, an most likely 
the coherency option does not enforce RAM write. I will update the comment.
-Tomasz
diff mbox

Patch

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 09ab124..7d4bbd5 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -2524,6 +2524,7 @@  intel_info(const struct drm_i915_private *dev_priv)
 #define HAS_EDRAM(dev_priv)	(!!((dev_priv)->edram_cap & EDRAM_ENABLED))
 #define HAS_WT(dev_priv)	((IS_HASWELL(dev_priv) || \
 				 IS_BROADWELL(dev_priv)) && HAS_EDRAM(dev_priv))
+#define HAS_DATA_PORT_COHERENCY(dev_priv)	(INTEL_GEN(dev_priv) >= 9)
 
 #define HWS_NEEDS_PHYSICAL(dev_priv)	((dev_priv)->info.hws_needs_physical)
 
diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index b10770c..6db352e 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -711,6 +711,26 @@  static bool client_is_banned(struct drm_i915_file_private *file_priv)
 	return atomic_read(&file_priv->ban_score) >= I915_CLIENT_SCORE_BANNED;
 }
 
+static int i915_gem_context_set_data_port_coherent(struct i915_gem_context *ctx)
+{
+	int ret;
+
+	ret = intel_lr_context_modify_data_port_coherency(ctx, true);
+	if (!ret)
+		__set_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
+	return ret;
+}
+
+static int i915_gem_context_clear_data_port_coherent(struct i915_gem_context *ctx)
+{
+	int ret;
+
+	ret = intel_lr_context_modify_data_port_coherency(ctx, false);
+	if (!ret)
+		__clear_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
+	return ret;
+}
+
 int i915_gem_context_create_ioctl(struct drm_device *dev, void *data,
 				  struct drm_file *file)
 {
@@ -784,6 +804,7 @@  int i915_gem_context_destroy_ioctl(struct drm_device *dev, void *data,
 int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
 				    struct drm_file *file)
 {
+	struct drm_i915_private *dev_priv = to_i915(dev);
 	struct drm_i915_file_private *file_priv = file->driver_priv;
 	struct drm_i915_gem_context_param *args = data;
 	struct i915_gem_context *ctx;
@@ -818,6 +839,12 @@  int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
 	case I915_CONTEXT_PARAM_PRIORITY:
 		args->value = ctx->sched.priority;
 		break;
+	case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
+		if (!HAS_DATA_PORT_COHERENCY(dev_priv))
+			ret = -ENODEV;
+		else
+			args->value = i915_gem_context_is_data_port_coherent(ctx);
+		break;
 	default:
 		ret = -EINVAL;
 		break;
@@ -830,6 +857,7 @@  int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
 int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
 				    struct drm_file *file)
 {
+	struct drm_i915_private *dev_priv = to_i915(dev);
 	struct drm_i915_file_private *file_priv = file->driver_priv;
 	struct drm_i915_gem_context_param *args = data;
 	struct i915_gem_context *ctx;
@@ -893,6 +921,19 @@  int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
 		}
 		break;
 
+	case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
+		if (args->size)
+			ret = -EINVAL;
+		else if (!HAS_DATA_PORT_COHERENCY(dev_priv))
+			ret = -ENODEV;
+		else if (args->value == 1)
+			ret = i915_gem_context_set_data_port_coherent(ctx);
+		else if (args->value == 0)
+			ret = i915_gem_context_clear_data_port_coherent(ctx);
+		else
+			ret = -EINVAL;
+		break;
+
 	default:
 		ret = -EINVAL;
 		break;
diff --git a/drivers/gpu/drm/i915/i915_gem_context.h b/drivers/gpu/drm/i915/i915_gem_context.h
index b116e49..e8ccb70 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.h
+++ b/drivers/gpu/drm/i915/i915_gem_context.h
@@ -126,6 +126,7 @@  struct i915_gem_context {
 #define CONTEXT_BANNABLE		3
 #define CONTEXT_BANNED			4
 #define CONTEXT_FORCE_SINGLE_SUBMISSION	5
+#define CONTEXT_DATA_PORT_COHERENT	6
 
 	/**
 	 * @hw_id: - unique identifier for the context
@@ -257,6 +258,11 @@  static inline void i915_gem_context_set_force_single_submission(struct i915_gem_
 	__set_bit(CONTEXT_FORCE_SINGLE_SUBMISSION, &ctx->flags);
 }
 
+static inline bool i915_gem_context_is_data_port_coherent(struct i915_gem_context *ctx)
+{
+	return test_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
+}
+
 static inline bool i915_gem_context_is_default(const struct i915_gem_context *c)
 {
 	return c->user_handle == DEFAULT_CONTEXT_HANDLE;
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index ab89dab..1f037e3 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -259,6 +259,55 @@  intel_lr_context_descriptor_update(struct i915_gem_context *ctx,
 	ce->lrc_desc = desc;
 }
 
+static int emit_set_data_port_coherency(struct i915_request *req, bool enable)
+{
+	u32 *cs;
+	i915_reg_t reg;
+
+	GEM_BUG_ON(req->engine->class != RENDER_CLASS);
+	GEM_BUG_ON(INTEL_GEN(req->i915) < 9);
+
+	cs = intel_ring_begin(req, 4);
+	if (IS_ERR(cs))
+		return PTR_ERR(cs);
+
+	if (INTEL_GEN(req->i915) >= 10)
+		reg = CNL_HDC_CHICKEN0;
+	else
+		reg = HDC_CHICKEN0;
+
+	*cs++ = MI_LOAD_REGISTER_IMM(1);
+	*cs++ = i915_mmio_reg_offset(reg);
+	/* Enabling coherency means disabling the bit which forces it off */
+	if (enable)
+		*cs++ = _MASKED_BIT_DISABLE(HDC_FORCE_NON_COHERENT);
+	else
+		*cs++ = _MASKED_BIT_ENABLE(HDC_FORCE_NON_COHERENT);
+	*cs++ = MI_NOOP;
+
+	intel_ring_advance(req, cs);
+
+	return 0;
+}
+
+int
+intel_lr_context_modify_data_port_coherency(struct i915_gem_context *ctx,
+					    bool enable)
+{
+	struct i915_request *req;
+	int ret;
+
+	req = i915_request_alloc(ctx->i915->engine[RCS], ctx);
+	if (IS_ERR(req))
+		return PTR_ERR(req);
+
+	ret = emit_set_data_port_coherency(req, enable);
+
+	i915_request_add(req);
+
+	return ret;
+}
+
 static struct i915_priolist *
 lookup_priolist(struct intel_engine_cs *engine, int prio)
 {
diff --git a/drivers/gpu/drm/i915/intel_lrc.h b/drivers/gpu/drm/i915/intel_lrc.h
index 1593194..f6965ae 100644
--- a/drivers/gpu/drm/i915/intel_lrc.h
+++ b/drivers/gpu/drm/i915/intel_lrc.h
@@ -104,4 +104,8 @@  struct i915_gem_context;
 
 void intel_lr_context_resume(struct drm_i915_private *dev_priv);
 
+int
+intel_lr_context_modify_data_port_coherency(struct i915_gem_context *ctx,
+					    bool enable);
+
 #endif /* _INTEL_LRC_H_ */
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index 7f5634c..e677bea 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -1456,6 +1456,12 @@  struct drm_i915_gem_context_param {
 #define   I915_CONTEXT_MAX_USER_PRIORITY	1023 /* inclusive */
 #define   I915_CONTEXT_DEFAULT_PRIORITY		0
 #define   I915_CONTEXT_MIN_USER_PRIORITY	-1023 /* inclusive */
+/*
+ * When data port level coherency is enabled, the GPU will update memory
+ * buffers shared with CPU, by forcing internal cache units to send memory
+ * writes to real RAM faster. Keeping such coherency has performance cost.
+ */
+#define I915_CONTEXT_PARAM_DATA_PORT_COHERENCY	0x7
 	__u64 value;
 };