diff mbox

[v1] drm/i915: Add IOCTL Param to control data port coherency.

Message ID 1529506987-16108-2-git-send-email-tomasz.lis@intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Lis, Tomasz June 20, 2018, 3:03 p.m. UTC
The patch adds a parameter to control the data port coherency functionality
on a per-context level. When the IOCTL is called, a command to switch data
port coherency state is added to the ordered list. All prior requests are
executed on old coherency settings, and all exec requests after the IOCTL
will use new settings.

Rationale:

The OpenCL driver develpers requested a functionality to control cache
coherency at data port level. Keeping the coherency at that level is disabled
by default due to its performance costs. OpenCL driver is planning to
enable it for a small subset of submissions, when such functionality is
required. Below are answers to basic question explaining background
of the functionality and reasoning for the proposed implementation:

1. Why do we need a coherency enable/disable switch for memory that is shared
between CPU and GEN (GPU)?

Memory coherency between CPU and GEN, while being a great feature that enables
CL_MEM_SVM_FINE_GRAIN_BUFFER OCL capability on Intel GEN architecture, adds
overhead related to tracking (snooping) memory inside different cache units
(L1$, L2$, L3$, LLC$, etc.). At the same time, minority of modern OCL
applications actually use CL_MEM_SVM_FINE_GRAIN_BUFFER (and hence require
memory coherency between CPU and GPU). The goal of coherency enable/disable
switch is to remove overhead of memory coherency when memory coherency is not
needed.

2. Why do we need a global coherency switch?

In order to support I/O commands from within EUs (Execution Units), Intel GEN
ISA (GEN Instruction Set Assembly) contains dedicated "send" instructions.
These send instructions provide several addressing models. One of these
addressing models (named "stateless") provides most flexible I/O using plain
virtual addresses (as opposed to buffer_handle+offset models). This "stateless"
model is similar to regular memory load/store operations available on typical
CPUs. Since this model provides I/O using arbitrary virtual addresses, it
enables algorithmic designs that are based on pointer-to-pointer (e.g. buffer
of pointers) concepts. For instance, it allows creating tree-like data
structures such as:
                   ________________
                  |      NODE1     |
                  | uint64_t data  |
                  +----------------|
                  | NODE*  |  NODE*|
                  +--------+-------+
                    /              \
   ________________/                \________________
  |      NODE2     |                |      NODE3     |
  | uint64_t data  |                | uint64_t data  |
  +----------------|                +----------------|
  | NODE*  |  NODE*|                | NODE*  |  NODE*|
  +--------+-------+                +--------+-------+

Please note that pointers inside such structures can point to memory locations
in different OCL allocations  - e.g. NODE1 and NODE2 can reside in one OCL
allocation while NODE3 resides in a completely separate OCL allocation.
Additionally, such pointers can be shared with CPU (i.e. using SVM - Shared
Virtual Memory feature). Using pointers from different allocations doesn't
affect the stateless addressing model which even allows scattered reading from
different allocations at the same time (i.e. by utilizing SIMD-nature of send
instructions).

When it comes to coherency programming, send instructions in stateless model
can be encoded (at ISA level) to either use or disable coherency. However, for
generic OCL applications (such as example with tree-like data structure), OCL
compiler is not able to determine origin of memory pointed to by an arbitrary
pointer - i.e. is not able to track given pointer back to a specific
allocation. As such, it's not able to decide whether coherency is needed or not
for specific pointer (or for specific I/O instruction). As a result, compiler
encodes all stateless sends as coherent (doing otherwise would lead to
functional issues resulting from data corruption). Please note that it would be
possible to workaround this (e.g. based on allocations map and pointer bounds
checking prior to each I/O instruction) but the performance cost of such
workaround would be many times greater than the cost of keeping coherency
always enabled. As such, enabling/disabling memory coherency at GEN ISA level
is not feasible and alternative method is needed.

Such alternative solution is to have a global coherency switch that allows
disabling coherency for single (though entire) GPU submission. This is
beneficial because this way we:
* can enable (and pay for) coherency only in submissions that actually need
coherency (submissions that use CL_MEM_SVM_FINE_GRAIN_BUFFER resources)
* don't care about coherency at GEN ISA granularity (no performance impact)

3. Will coherency switch be used frequently?

There are scenarios that will require frequent toggling of the coherency
switch.
E.g. an application has two OCL compute kernels: kern_master and kern_worker.
kern_master uses, concurrently with CPU, some fine grain SVM resources
(CL_MEM_SVM_FINE_GRAIN_BUFFER). These resources contain descriptors of
computational work that needs to be executed. kern_master analyzes incoming
work descriptors and populates a plain OCL buffer (non-fine-grain) with payload
for kern_worker. Once kern_master is done, kern_worker kicks-in and processes
the payload that kern_master produced. These two kernels work in a loop, one
after another. Since only kern_master requires coherency, kern_worker should
not be forced to pay for it. This means that we need to have the ability to
toggle coherency switch on or off per each GPU submission:
(ENABLE COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> (ENABLE
COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> ...

Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Michal Winiarski <michal.winiarski@intel.com>

Bspec: 11419
Signed-off-by: Tomasz Lis <tomasz.lis@intel.com>
---
 drivers/gpu/drm/i915/i915_gem_context.c | 41 ++++++++++++++++++++++++++
 drivers/gpu/drm/i915/i915_gem_context.h |  6 ++++
 drivers/gpu/drm/i915/intel_lrc.c        | 51 +++++++++++++++++++++++++++++++++
 drivers/gpu/drm/i915/intel_lrc.h        |  4 +++
 include/uapi/drm/i915_drm.h             |  1 +
 5 files changed, 103 insertions(+)

Comments

Joonas Lahtinen June 21, 2018, 6:39 a.m. UTC | #1
Changelog would be much appreciated. And this is not the first version
of the series. It helps to remind the reviewer that original
implementation was changed into IOCTl based on feedback. Please see the
git log in i915 for some examples.

Quoting Tomasz Lis (2018-06-20 18:03:07)
> The patch adds a parameter to control the data port coherency functionality
> on a per-context level. When the IOCTL is called, a command to switch data
> port coherency state is added to the ordered list. All prior requests are
> executed on old coherency settings, and all exec requests after the IOCTL
> will use new settings.
> 
> Rationale:
> 
> The OpenCL driver develpers requested a functionality to control cache
> coherency at data port level. Keeping the coherency at that level is disabled
> by default due to its performance costs. OpenCL driver is planning to
> enable it for a small subset of submissions, when such functionality is
> required. Below are answers to basic question explaining background
> of the functionality and reasoning for the proposed implementation:
> 
> 1. Why do we need a coherency enable/disable switch for memory that is shared
> between CPU and GEN (GPU)?
> 
> Memory coherency between CPU and GEN, while being a great feature that enables
> CL_MEM_SVM_FINE_GRAIN_BUFFER OCL capability on Intel GEN architecture, adds
> overhead related to tracking (snooping) memory inside different cache units
> (L1$, L2$, L3$, LLC$, etc.). At the same time, minority of modern OCL
> applications actually use CL_MEM_SVM_FINE_GRAIN_BUFFER (and hence require
> memory coherency between CPU and GPU). The goal of coherency enable/disable
> switch is to remove overhead of memory coherency when memory coherency is not
> needed.
> 
> 2. Why do we need a global coherency switch?
> 
> In order to support I/O commands from within EUs (Execution Units), Intel GEN
> ISA (GEN Instruction Set Assembly) contains dedicated "send" instructions.
> These send instructions provide several addressing models. One of these
> addressing models (named "stateless") provides most flexible I/O using plain
> virtual addresses (as opposed to buffer_handle+offset models). This "stateless"
> model is similar to regular memory load/store operations available on typical
> CPUs. Since this model provides I/O using arbitrary virtual addresses, it
> enables algorithmic designs that are based on pointer-to-pointer (e.g. buffer
> of pointers) concepts. For instance, it allows creating tree-like data
> structures such as:
>                    ________________
>                   |      NODE1     |
>                   | uint64_t data  |
>                   +----------------|
>                   | NODE*  |  NODE*|
>                   +--------+-------+
>                     /              \
>    ________________/                \________________
>   |      NODE2     |                |      NODE3     |
>   | uint64_t data  |                | uint64_t data  |
>   +----------------|                +----------------|
>   | NODE*  |  NODE*|                | NODE*  |  NODE*|
>   +--------+-------+                +--------+-------+
> 
> Please note that pointers inside such structures can point to memory locations
> in different OCL allocations  - e.g. NODE1 and NODE2 can reside in one OCL
> allocation while NODE3 resides in a completely separate OCL allocation.
> Additionally, such pointers can be shared with CPU (i.e. using SVM - Shared
> Virtual Memory feature). Using pointers from different allocations doesn't
> affect the stateless addressing model which even allows scattered reading from
> different allocations at the same time (i.e. by utilizing SIMD-nature of send
> instructions).
> 
> When it comes to coherency programming, send instructions in stateless model
> can be encoded (at ISA level) to either use or disable coherency. However, for
> generic OCL applications (such as example with tree-like data structure), OCL
> compiler is not able to determine origin of memory pointed to by an arbitrary
> pointer - i.e. is not able to track given pointer back to a specific
> allocation. As such, it's not able to decide whether coherency is needed or not
> for specific pointer (or for specific I/O instruction). As a result, compiler
> encodes all stateless sends as coherent (doing otherwise would lead to
> functional issues resulting from data corruption). Please note that it would be
> possible to workaround this (e.g. based on allocations map and pointer bounds
> checking prior to each I/O instruction) but the performance cost of such
> workaround would be many times greater than the cost of keeping coherency
> always enabled. As such, enabling/disabling memory coherency at GEN ISA level
> is not feasible and alternative method is needed.
> 
> Such alternative solution is to have a global coherency switch that allows
> disabling coherency for single (though entire) GPU submission. This is
> beneficial because this way we:
> * can enable (and pay for) coherency only in submissions that actually need
> coherency (submissions that use CL_MEM_SVM_FINE_GRAIN_BUFFER resources)
> * don't care about coherency at GEN ISA granularity (no performance impact)
> 
> 3. Will coherency switch be used frequently?
> 
> There are scenarios that will require frequent toggling of the coherency
> switch.
> E.g. an application has two OCL compute kernels: kern_master and kern_worker.
> kern_master uses, concurrently with CPU, some fine grain SVM resources
> (CL_MEM_SVM_FINE_GRAIN_BUFFER). These resources contain descriptors of
> computational work that needs to be executed. kern_master analyzes incoming
> work descriptors and populates a plain OCL buffer (non-fine-grain) with payload
> for kern_worker. Once kern_master is done, kern_worker kicks-in and processes
> the payload that kern_master produced. These two kernels work in a loop, one
> after another. Since only kern_master requires coherency, kern_worker should
> not be forced to pay for it. This means that we need to have the ability to
> toggle coherency switch on or off per each GPU submission:
> (ENABLE COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> (ENABLE
> COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> ...
> 
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Michal Winiarski <michal.winiarski@intel.com>
> 
> Bspec: 11419
> Signed-off-by: Tomasz Lis <tomasz.lis@intel.com>

<SNIP>

> diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
> index ccf463a..ea65ae6 100644
> --- a/drivers/gpu/drm/i915/i915_gem_context.c
> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
> @@ -711,6 +711,24 @@ static bool client_is_banned(struct drm_i915_file_private *file_priv)
>         return atomic_read(&file_priv->ban_score) >= I915_CLIENT_SCORE_BANNED;
>  }
>  
> +static int i915_gem_context_set_data_port_coherent(struct i915_gem_context *ctx)
> +{
> +       int ret;
> +       ret = intel_lr_context_modify_data_port_coherency(ctx, true);
> +       if (!GEM_WARN_ON(ret))

I don't think there's need for the WARN as the error will be propagated
back to userspace?

> +               __set_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
> +       return ret;
> +}
> +
> +static int i915_gem_context_clear_data_port_coherent(struct i915_gem_context *ctx)
> +{
> +       int ret;
> +       ret = intel_lr_context_modify_data_port_coherency(ctx, false);
> +       if (!GEM_WARN_ON(ret))

Ditto.

> +               __clear_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
> +       return ret;
> +}
> +
>  int i915_gem_context_create_ioctl(struct drm_device *dev, void *data,
>                                   struct drm_file *file)
>  {
> @@ -784,6 +802,7 @@ int i915_gem_context_destroy_ioctl(struct drm_device *dev, void *data,
>  int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
>                                     struct drm_file *file)
>  {
> +       struct drm_i915_private *dev_priv = to_i915(dev);
>         struct drm_i915_file_private *file_priv = file->driver_priv;
>         struct drm_i915_gem_context_param *args = data;
>         struct i915_gem_context *ctx;
> @@ -818,6 +837,16 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
>         case I915_CONTEXT_PARAM_PRIORITY:
>                 args->value = ctx->sched.priority;
>                 break;
> +       case I915_CONTEXT_PARAM_COHERENCY:
> +               /*
> +                * ENODEV if the feature is not supported. This removes the need
> +                * of separate IS_SUPPORTED parameter.
> +                */

Code speaks for itself, the comment is not needed.

> +               if (INTEL_GEN(dev_priv) < 9)
> +                       ret = -ENODEV;
> +               else
> +                       args->value = i915_gem_context_is_data_port_coherent(ctx);
> +               break;
>         default:
>                 ret = -EINVAL;
>                 break;
> @@ -893,6 +923,17 @@ int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
>                 }
>                 break;
>  
> +       case I915_CONTEXT_PARAM_COHERENCY:
> +               if (args->size)
> +                       ret = -EINVAL;
> +               else if (INTEL_GEN(dev_priv) < 9)
> +                       ret = -ENODEV;
> +               else if (args->value)
> +                       ret = i915_gem_context_set_data_port_coherent(ctx);
> +               else
> +                       ret = i915_gem_context_clear_data_port_coherent(ctx);

Be more strict with the uAPI. Only accept values 0 or 1, then you leave
space for extension in the future.

> +               break;
> +
>         default:
>                 ret = -EINVAL;
>                 break;

<SNIP>

> +++ b/drivers/gpu/drm/i915/intel_lrc.c

I'm feeling this is not the right file. The bit is in hardware context,
and doesn't have so much to do with LRC.

> @@ -258,6 +258,57 @@ intel_lr_context_descriptor_update(struct i915_gem_context *ctx,
>         ce->lrc_desc = desc;
>  }
>  
> +static int emit_set_data_port_coherency(struct i915_request *req, bool enable)
> +{
> +       u32 *cs;
> +       i915_reg_t reg;
> +
> +       GEM_BUG_ON(req->engine->class != RENDER_CLASS);
> +       GEM_BUG_ON(INTEL_GEN(req->i915) < 9);
> +
> +       cs = intel_ring_begin(req, 4);
> +       if (IS_ERR(cs))
> +               return PTR_ERR(cs);
> +
> +       if (INTEL_GEN(req->i915) >= 10)
> +               reg = CNL_HDC_CHICKEN0;
> +       else
> +               reg = HDC_CHICKEN0;
> +
> +       /* FIXME: this feature may be unuseable on CNL; If this checks to be
> +        *  true, we should enodev for CNL. */

This is exactly why we want the IGT tests to check for effects, not for
the register. Then we can get an answer by running the tests on all kind
of CNL systems at hand.

> +       *cs++ = MI_LOAD_REGISTER_IMM(1);
> +       *cs++ = i915_mmio_reg_offset(reg);
> +       /* Enabling coherency means disabling the bit which forces it off */

Code is again very self explanatory without the comment.

> +       if (enable)
> +               *cs++ = _MASKED_BIT_DISABLE(HDC_FORCE_NON_COHERENT);
> +       else
> +               *cs++ = _MASKED_BIT_ENABLE(HDC_FORCE_NON_COHERENT);
> +       *cs++ = MI_NOOP;
> +
> +       intel_ring_advance(req, cs);
> +
> +       return 0;
> +}
> +
> +int
> +intel_lr_context_modify_data_port_coherency(struct i915_gem_context *ctx,
> +               bool enable)
> +{
> +       struct i915_request *req;
> +       int ret;
> +
> +       req = i915_request_alloc(ctx->i915->engine[RCS], ctx);
> +       if (IS_ERR(req))
> +               return PTR_ERR(req);
> +
> +       ret = emit_set_data_port_coherency(req, enable);
> +
> +       i915_request_add(req);
> +
> +       return ret;
> +}

I'm thinking we should set this value when it has changed, when we insert the
requests into the command stream. So if you change back and forth, while
not emitting any requests, nothing really happens. If you change the value and
emit a request, we should emit a LRI before the jump to the commands.
Similary if you keep setting the value to the value it already was in,
nothing will happen, again.

> +
>  static struct i915_priolist *
>  lookup_priolist(struct intel_engine_cs *engine, int prio)
>  {
> diff --git a/drivers/gpu/drm/i915/intel_lrc.h b/drivers/gpu/drm/i915/intel_lrc.h
> index 1593194..214e291 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.h
> +++ b/drivers/gpu/drm/i915/intel_lrc.h
> @@ -104,4 +104,8 @@ struct i915_gem_context;
>  
>  void intel_lr_context_resume(struct drm_i915_private *dev_priv);
>  
> +int
> +intel_lr_context_modify_data_port_coherency(struct i915_gem_context *ctx,
> +                                            bool enable);
> +
>  #endif /* _INTEL_LRC_H_ */
> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> index 7f5634c..fab072f 100644
> --- a/include/uapi/drm/i915_drm.h
> +++ b/include/uapi/drm/i915_drm.h
> @@ -1453,6 +1453,7 @@ struct drm_i915_gem_context_param {
>  #define I915_CONTEXT_PARAM_NO_ERROR_CAPTURE    0x4
>  #define I915_CONTEXT_PARAM_BANNABLE    0x5
>  #define I915_CONTEXT_PARAM_PRIORITY    0x6
> +#define I915_CONTEXT_PARAM_COHERENCY   0x7

Please add this line after the indented context priorities.

>  #define   I915_CONTEXT_MAX_USER_PRIORITY       1023 /* inclusive */
>  #define   I915_CONTEXT_DEFAULT_PRIORITY                0
>  #define   I915_CONTEXT_MIN_USER_PRIORITY       -1023 /* inclusive */

Here.

Regards, Joonas
Chris Wilson June 21, 2018, 7:05 a.m. UTC | #2
Quoting Tomasz Lis (2018-06-20 16:03:07)
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index 33bc914..c69dc26 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -258,6 +258,57 @@ intel_lr_context_descriptor_update(struct i915_gem_context *ctx,
>         ce->lrc_desc = desc;
>  }
>  
> +static int emit_set_data_port_coherency(struct i915_request *req, bool enable)
> +{
> +       u32 *cs;
> +       i915_reg_t reg;
> +
> +       GEM_BUG_ON(req->engine->class != RENDER_CLASS);
> +       GEM_BUG_ON(INTEL_GEN(req->i915) < 9);
> +
> +       cs = intel_ring_begin(req, 4);
> +       if (IS_ERR(cs))
> +               return PTR_ERR(cs);
> +
> +       if (INTEL_GEN(req->i915) >= 10)
> +               reg = CNL_HDC_CHICKEN0;
> +       else
> +               reg = HDC_CHICKEN0;
> +
> +       /* FIXME: this feature may be unuseable on CNL; If this checks to be
> +        *  true, we should enodev for CNL. */
> +       *cs++ = MI_LOAD_REGISTER_IMM(1);
> +       *cs++ = i915_mmio_reg_offset(reg);
> +       /* Enabling coherency means disabling the bit which forces it off */
> +       if (enable)
> +               *cs++ = _MASKED_BIT_DISABLE(HDC_FORCE_NON_COHERENT);
> +       else
> +               *cs++ = _MASKED_BIT_ENABLE(HDC_FORCE_NON_COHERENT);
> +       *cs++ = MI_NOOP;
> +
> +       intel_ring_advance(req, cs);
> +
> +       return 0;
> +}

There's nothing specific to the logical ringbuffer context here afaics.
It could have just been done inside the single
i915_gem_context_set_data_port_coherency(). Also makes it clearer that
i915_gem_context_set_data_port_coherency needs struct_mutex.

cmd = HDC_FORCE_NON_COHERENT << 16;
if (!coherent)
	cmd |= HDC_FORCE_NON_COHERENT;
*cs++ = cmd;

Does that read any clearer?

> diff --git a/drivers/gpu/drm/i915/intel_lrc.h b/drivers/gpu/drm/i915/intel_lrc.h
> index 1593194..214e291 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.h
> +++ b/drivers/gpu/drm/i915/intel_lrc.h
> @@ -104,4 +104,8 @@ struct i915_gem_context;
>  
>  void intel_lr_context_resume(struct drm_i915_private *dev_priv);
>  
> +int
> +intel_lr_context_modify_data_port_coherency(struct i915_gem_context *ctx,
> +                                            bool enable);
> +
>  #endif /* _INTEL_LRC_H_ */
> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> index 7f5634c..fab072f 100644
> --- a/include/uapi/drm/i915_drm.h
> +++ b/include/uapi/drm/i915_drm.h
> @@ -1453,6 +1453,7 @@ struct drm_i915_gem_context_param {
>  #define I915_CONTEXT_PARAM_NO_ERROR_CAPTURE    0x4
>  #define I915_CONTEXT_PARAM_BANNABLE    0x5
>  #define I915_CONTEXT_PARAM_PRIORITY    0x6
> +#define I915_CONTEXT_PARAM_COHERENCY   0x7

DATAPORT_COHERENCY
There are many different caches.

There should be some commentary around here telling userspace what the
contract is.

>  #define   I915_CONTEXT_MAX_USER_PRIORITY       1023 /* inclusive */
>  #define   I915_CONTEXT_DEFAULT_PRIORITY                0
>  #define   I915_CONTEXT_MIN_USER_PRIORITY       -1023 /* inclusive */

COHERENCY has MAX/MIN_USER_PRIORITY, interesting. I thought it was just
a boolean.
-Chris
Dunajski, Bartosz June 21, 2018, 7:31 a.m. UTC | #3
I would like to add few things that were mentioned previously.

According to adoption plan.
Our plan is to drop dependency on LLVM 4.0.1 (with custom patches) and instead compile with unpatched (either system or vanilla) LLVM 6.0. Work to transition our compiler stack to LLVM 6 is expected to complete in late Q3. Additionally, we are refactoring our packaging, so instead of a single neo package with multiple components we will have multiple versioned packages with clear dependencies. 

We are coordinating with ClearLinux team to get included once that happens and plan to reach out to other OSVs to do the same.

Is this plan enough to consider NEO an actual opensource client for the coherency control patch ?

PR for patch usage:
https://github.com/intel/compute-runtime/pull/53 

Bartosz
Joonas Lahtinen June 21, 2018, 8:48 a.m. UTC | #4
+ Dave Airlie (The DRM subsystem maintainer) for FYI

Quoting Dunajski, Bartosz (2018-06-21 10:31:57)
> I would like to add few things that were mentioned previously.
> 
> According to adoption plan.
> Our plan is to drop dependency on LLVM 4.0.1 (with custom patches) and instead compile with unpatched (either system or vanilla) LLVM 6.0. Work to transition our compiler stack to LLVM 6 is expected to complete in late Q3. Additionally, we are refactoring our packaging, so instead of a single neo package with multiple components we will have multiple versioned packages with clear dependencies. 
> 
> We are coordinating with ClearLinux team to get included once that happens and plan to reach out to other OSVs to do the same.
> 
> Is this plan enough to consider NEO an actual opensource client for the coherency control patch ?

Yes, once you follow through with the plan, there should be no issues
about merging patches to support the driver.

You may want to squeeze your timeline to be complete before 4.19-rc5,
which is the feature cutoff date for 4.20, but that is rather an
ambitious goal. Your original schedule would land the patches before
4.20-rc5 resulting in inclusion to 4.21.

Regards, Joonas

PS. I'm going on a vacation for a couple of weeks.

> 
> PR for patch usage:
> https://github.com/intel/compute-runtime/pull/53 
> 
> Bartosz
Lis, Tomasz June 21, 2018, 1:47 p.m. UTC | #5
On 2018-06-21 09:05, Chris Wilson wrote:
> Quoting Tomasz Lis (2018-06-20 16:03:07)
>> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
>> index 33bc914..c69dc26 100644
>> --- a/drivers/gpu/drm/i915/intel_lrc.c
>> +++ b/drivers/gpu/drm/i915/intel_lrc.c
>> @@ -258,6 +258,57 @@ intel_lr_context_descriptor_update(struct i915_gem_context *ctx,
>>          ce->lrc_desc = desc;
>>   }
>>   
>> +static int emit_set_data_port_coherency(struct i915_request *req, bool enable)
>> +{
>> +       u32 *cs;
>> +       i915_reg_t reg;
>> +
>> +       GEM_BUG_ON(req->engine->class != RENDER_CLASS);
>> +       GEM_BUG_ON(INTEL_GEN(req->i915) < 9);
>> +
>> +       cs = intel_ring_begin(req, 4);
>> +       if (IS_ERR(cs))
>> +               return PTR_ERR(cs);
>> +
>> +       if (INTEL_GEN(req->i915) >= 10)
>> +               reg = CNL_HDC_CHICKEN0;
>> +       else
>> +               reg = HDC_CHICKEN0;
>> +
>> +       /* FIXME: this feature may be unuseable on CNL; If this checks to be
>> +        *  true, we should enodev for CNL. */
>> +       *cs++ = MI_LOAD_REGISTER_IMM(1);
>> +       *cs++ = i915_mmio_reg_offset(reg);
>> +       /* Enabling coherency means disabling the bit which forces it off */
>> +       if (enable)
>> +               *cs++ = _MASKED_BIT_DISABLE(HDC_FORCE_NON_COHERENT);
>> +       else
>> +               *cs++ = _MASKED_BIT_ENABLE(HDC_FORCE_NON_COHERENT);
>> +       *cs++ = MI_NOOP;
>> +
>> +       intel_ring_advance(req, cs);
>> +
>> +       return 0;
>> +}
> There's nothing specific to the logical ringbuffer context here afaics.
> It could have just been done inside the single
> i915_gem_context_set_data_port_coherency(). Also makes it clearer that
> i915_gem_context_set_data_port_coherency needs struct_mutex.
>
> cmd = HDC_FORCE_NON_COHERENT << 16;
> if (!coherent)
> 	cmd |= HDC_FORCE_NON_COHERENT;
> *cs++ = cmd;
>
> Does that read any clearer?
Sorry, I don't think I follow.
Should I move the code out of logical ringbuffer context (intel_lrc.c)?
Should I merge the emit_set_data_port_coherency() with 
intel_lr_context_modify_data_port_coherency()?
Should I lock a mutex while adding the request?
-Tomasz
>
>> diff --git a/drivers/gpu/drm/i915/intel_lrc.h b/drivers/gpu/drm/i915/intel_lrc.h
>> index 1593194..214e291 100644
>> --- a/drivers/gpu/drm/i915/intel_lrc.h
>> +++ b/drivers/gpu/drm/i915/intel_lrc.h
>> @@ -104,4 +104,8 @@ struct i915_gem_context;
>>   
>>   void intel_lr_context_resume(struct drm_i915_private *dev_priv);
>>   
>> +int
>> +intel_lr_context_modify_data_port_coherency(struct i915_gem_context *ctx,
>> +                                            bool enable);
>> +
>>   #endif /* _INTEL_LRC_H_ */
>> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
>> index 7f5634c..fab072f 100644
>> --- a/include/uapi/drm/i915_drm.h
>> +++ b/include/uapi/drm/i915_drm.h
>> @@ -1453,6 +1453,7 @@ struct drm_i915_gem_context_param {
>>   #define I915_CONTEXT_PARAM_NO_ERROR_CAPTURE    0x4
>>   #define I915_CONTEXT_PARAM_BANNABLE    0x5
>>   #define I915_CONTEXT_PARAM_PRIORITY    0x6
>> +#define I915_CONTEXT_PARAM_COHERENCY   0x7
> DATAPORT_COHERENCY
> There are many different caches.
>
> There should be some commentary around here telling userspace what the
> contract is.
Will do.
>
>>   #define   I915_CONTEXT_MAX_USER_PRIORITY       1023 /* inclusive */
>>   #define   I915_CONTEXT_DEFAULT_PRIORITY                0
>>   #define   I915_CONTEXT_MIN_USER_PRIORITY       -1023 /* inclusive */
> COHERENCY has MAX/MIN_USER_PRIORITY, interesting. I thought it was just
> a boolean.
> -Chris
I did not noticed the structure of defines here; will move the new define.
-Tomasz
Lis, Tomasz June 21, 2018, 1:47 p.m. UTC | #6
On 2018-06-21 08:39, Joonas Lahtinen wrote:
> Changelog would be much appreciated. And this is not the first version
> of the series. It helps to remind the reviewer that original
> implementation was changed into IOCTl based on feedback. Please see the
> git log in i915 for some examples.
Will add. I considered this a separate series, as it is a different 
implementation.
>
> Quoting Tomasz Lis (2018-06-20 18:03:07)
>> The patch adds a parameter to control the data port coherency functionality
>> on a per-context level. When the IOCTL is called, a command to switch data
>> port coherency state is added to the ordered list. All prior requests are
>> executed on old coherency settings, and all exec requests after the IOCTL
>> will use new settings.
>>
>> Rationale:
>>
>> The OpenCL driver develpers requested a functionality to control cache
>> coherency at data port level. Keeping the coherency at that level is disabled
>> by default due to its performance costs. OpenCL driver is planning to
>> enable it for a small subset of submissions, when such functionality is
>> required. Below are answers to basic question explaining background
>> of the functionality and reasoning for the proposed implementation:
>>
>> 1. Why do we need a coherency enable/disable switch for memory that is shared
>> between CPU and GEN (GPU)?
>>
>> Memory coherency between CPU and GEN, while being a great feature that enables
>> CL_MEM_SVM_FINE_GRAIN_BUFFER OCL capability on Intel GEN architecture, adds
>> overhead related to tracking (snooping) memory inside different cache units
>> (L1$, L2$, L3$, LLC$, etc.). At the same time, minority of modern OCL
>> applications actually use CL_MEM_SVM_FINE_GRAIN_BUFFER (and hence require
>> memory coherency between CPU and GPU). The goal of coherency enable/disable
>> switch is to remove overhead of memory coherency when memory coherency is not
>> needed.
>>
>> 2. Why do we need a global coherency switch?
>>
>> In order to support I/O commands from within EUs (Execution Units), Intel GEN
>> ISA (GEN Instruction Set Assembly) contains dedicated "send" instructions.
>> These send instructions provide several addressing models. One of these
>> addressing models (named "stateless") provides most flexible I/O using plain
>> virtual addresses (as opposed to buffer_handle+offset models). This "stateless"
>> model is similar to regular memory load/store operations available on typical
>> CPUs. Since this model provides I/O using arbitrary virtual addresses, it
>> enables algorithmic designs that are based on pointer-to-pointer (e.g. buffer
>> of pointers) concepts. For instance, it allows creating tree-like data
>> structures such as:
>>                     ________________
>>                    |      NODE1     |
>>                    | uint64_t data  |
>>                    +----------------|
>>                    | NODE*  |  NODE*|
>>                    +--------+-------+
>>                      /              \
>>     ________________/                \________________
>>    |      NODE2     |                |      NODE3     |
>>    | uint64_t data  |                | uint64_t data  |
>>    +----------------|                +----------------|
>>    | NODE*  |  NODE*|                | NODE*  |  NODE*|
>>    +--------+-------+                +--------+-------+
>>
>> Please note that pointers inside such structures can point to memory locations
>> in different OCL allocations  - e.g. NODE1 and NODE2 can reside in one OCL
>> allocation while NODE3 resides in a completely separate OCL allocation.
>> Additionally, such pointers can be shared with CPU (i.e. using SVM - Shared
>> Virtual Memory feature). Using pointers from different allocations doesn't
>> affect the stateless addressing model which even allows scattered reading from
>> different allocations at the same time (i.e. by utilizing SIMD-nature of send
>> instructions).
>>
>> When it comes to coherency programming, send instructions in stateless model
>> can be encoded (at ISA level) to either use or disable coherency. However, for
>> generic OCL applications (such as example with tree-like data structure), OCL
>> compiler is not able to determine origin of memory pointed to by an arbitrary
>> pointer - i.e. is not able to track given pointer back to a specific
>> allocation. As such, it's not able to decide whether coherency is needed or not
>> for specific pointer (or for specific I/O instruction). As a result, compiler
>> encodes all stateless sends as coherent (doing otherwise would lead to
>> functional issues resulting from data corruption). Please note that it would be
>> possible to workaround this (e.g. based on allocations map and pointer bounds
>> checking prior to each I/O instruction) but the performance cost of such
>> workaround would be many times greater than the cost of keeping coherency
>> always enabled. As such, enabling/disabling memory coherency at GEN ISA level
>> is not feasible and alternative method is needed.
>>
>> Such alternative solution is to have a global coherency switch that allows
>> disabling coherency for single (though entire) GPU submission. This is
>> beneficial because this way we:
>> * can enable (and pay for) coherency only in submissions that actually need
>> coherency (submissions that use CL_MEM_SVM_FINE_GRAIN_BUFFER resources)
>> * don't care about coherency at GEN ISA granularity (no performance impact)
>>
>> 3. Will coherency switch be used frequently?
>>
>> There are scenarios that will require frequent toggling of the coherency
>> switch.
>> E.g. an application has two OCL compute kernels: kern_master and kern_worker.
>> kern_master uses, concurrently with CPU, some fine grain SVM resources
>> (CL_MEM_SVM_FINE_GRAIN_BUFFER). These resources contain descriptors of
>> computational work that needs to be executed. kern_master analyzes incoming
>> work descriptors and populates a plain OCL buffer (non-fine-grain) with payload
>> for kern_worker. Once kern_master is done, kern_worker kicks-in and processes
>> the payload that kern_master produced. These two kernels work in a loop, one
>> after another. Since only kern_master requires coherency, kern_worker should
>> not be forced to pay for it. This means that we need to have the ability to
>> toggle coherency switch on or off per each GPU submission:
>> (ENABLE COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> (ENABLE
>> COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> ...
>>
>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>> Cc: Michal Winiarski <michal.winiarski@intel.com>
>>
>> Bspec: 11419
>> Signed-off-by: Tomasz Lis <tomasz.lis@intel.com>
> <SNIP>
>
>> diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
>> index ccf463a..ea65ae6 100644
>> --- a/drivers/gpu/drm/i915/i915_gem_context.c
>> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
>> @@ -711,6 +711,24 @@ static bool client_is_banned(struct drm_i915_file_private *file_priv)
>>          return atomic_read(&file_priv->ban_score) >= I915_CLIENT_SCORE_BANNED;
>>   }
>>   
>> +static int i915_gem_context_set_data_port_coherent(struct i915_gem_context *ctx)
>> +{
>> +       int ret;
>> +       ret = intel_lr_context_modify_data_port_coherency(ctx, true);
>> +       if (!GEM_WARN_ON(ret))
> I don't think there's need for the WARN as the error will be propagated
> back to userspace?
You're right.
>
>> +               __set_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
>> +       return ret;
>> +}
>> +
>> +static int i915_gem_context_clear_data_port_coherent(struct i915_gem_context *ctx)
>> +{
>> +       int ret;
>> +       ret = intel_lr_context_modify_data_port_coherency(ctx, false);
>> +       if (!GEM_WARN_ON(ret))
> Ditto.
ack
>
>> +               __clear_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
>> +       return ret;
>> +}
>> +
>>   int i915_gem_context_create_ioctl(struct drm_device *dev, void *data,
>>                                    struct drm_file *file)
>>   {
>> @@ -784,6 +802,7 @@ int i915_gem_context_destroy_ioctl(struct drm_device *dev, void *data,
>>   int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
>>                                      struct drm_file *file)
>>   {
>> +       struct drm_i915_private *dev_priv = to_i915(dev);
>>          struct drm_i915_file_private *file_priv = file->driver_priv;
>>          struct drm_i915_gem_context_param *args = data;
>>          struct i915_gem_context *ctx;
>> @@ -818,6 +837,16 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
>>          case I915_CONTEXT_PARAM_PRIORITY:
>>                  args->value = ctx->sched.priority;
>>                  break;
>> +       case I915_CONTEXT_PARAM_COHERENCY:
>> +               /*
>> +                * ENODEV if the feature is not supported. This removes the need
>> +                * of separate IS_SUPPORTED parameter.
>> +                */
> Code speaks for itself, the comment is not needed.
I don't think it is a good idea to limit comments. The current look of 
the code makes it hard for anyone new to work on it, as the only 
documentation is the history in mailing list.
I don't think it's the correct approach. I believe comments should be 
encouraged.

In this specific case, the code lets you know that ENODEV is returned 
below gen9. But there is no macro IS_DATA_PORT_COHERENCY_SUPPORTED() 
which would clearly indicate the cause of that, so comment is required.

>> +               if (INTEL_GEN(dev_priv) < 9)
>> +                       ret = -ENODEV;
>> +               else
>> +                       args->value = i915_gem_context_is_data_port_coherent(ctx);
>> +               break;
>>          default:
>>                  ret = -EINVAL;
>>                  break;
>> @@ -893,6 +923,17 @@ int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
>>                  }
>>                  break;
>>   
>> +       case I915_CONTEXT_PARAM_COHERENCY:
>> +               if (args->size)
>> +                       ret = -EINVAL;
>> +               else if (INTEL_GEN(dev_priv) < 9)
>> +                       ret = -ENODEV;
>> +               else if (args->value)
>> +                       ret = i915_gem_context_set_data_port_coherent(ctx);
>> +               else
>> +                       ret = i915_gem_context_clear_data_port_coherent(ctx);
> Be more strict with the uAPI. Only accept values 0 or 1, then you leave
> space for extension in the future.
Right. Will do.
>
>> +               break;
>> +
>>          default:
>>                  ret = -EINVAL;
>>                  break;
> <SNIP>
>
>> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> I'm feeling this is not the right file. The bit is in hardware context,
> and doesn't have so much to do with LRC.
Should I move it to i915_gem_context.c?
>
>> @@ -258,6 +258,57 @@ intel_lr_context_descriptor_update(struct i915_gem_context *ctx,
>>          ce->lrc_desc = desc;
>>   }
>>   
>> +static int emit_set_data_port_coherency(struct i915_request *req, bool enable)
>> +{
>> +       u32 *cs;
>> +       i915_reg_t reg;
>> +
>> +       GEM_BUG_ON(req->engine->class != RENDER_CLASS);
>> +       GEM_BUG_ON(INTEL_GEN(req->i915) < 9);
>> +
>> +       cs = intel_ring_begin(req, 4);
>> +       if (IS_ERR(cs))
>> +               return PTR_ERR(cs);
>> +
>> +       if (INTEL_GEN(req->i915) >= 10)
>> +               reg = CNL_HDC_CHICKEN0;
>> +       else
>> +               reg = HDC_CHICKEN0;
>> +
>> +       /* FIXME: this feature may be unuseable on CNL; If this checks to be
>> +        *  true, we should enodev for CNL. */
> This is exactly why we want the IGT tests to check for effects, not for
> the register. Then we can get an answer by running the tests on all kind
> of CNL systems at hand.
This comment is actually outdated, I left it by mistake. Will remove.
>
>> +       *cs++ = MI_LOAD_REGISTER_IMM(1);
>> +       *cs++ = i915_mmio_reg_offset(reg);
>> +       /* Enabling coherency means disabling the bit which forces it off */
> Code is again very self explanatory without the comment.
The logic is reversed, so that "enable" does a "disable". I believe the 
comment does a great job of assuring the reader that this is not just a 
coding mistake.

Do we have any official guidelines for limiting comments?
>
>> +       if (enable)
>> +               *cs++ = _MASKED_BIT_DISABLE(HDC_FORCE_NON_COHERENT);
>> +       else
>> +               *cs++ = _MASKED_BIT_ENABLE(HDC_FORCE_NON_COHERENT);
>> +       *cs++ = MI_NOOP;
>> +
>> +       intel_ring_advance(req, cs);
>> +
>> +       return 0;
>> +}
>> +
>> +int
>> +intel_lr_context_modify_data_port_coherency(struct i915_gem_context *ctx,
>> +               bool enable)
>> +{
>> +       struct i915_request *req;
>> +       int ret;
>> +
>> +       req = i915_request_alloc(ctx->i915->engine[RCS], ctx);
>> +       if (IS_ERR(req))
>> +               return PTR_ERR(req);
>> +
>> +       ret = emit_set_data_port_coherency(req, enable);
>> +
>> +       i915_request_add(req);
>> +
>> +       return ret;
>> +}
> I'm thinking we should set this value when it has changed, when we insert the
> requests into the command stream. So if you change back and forth, while
> not emitting any requests, nothing really happens. If you change the value and
> emit a request, we should emit a LRI before the jump to the commands.
> Similary if you keep setting the value to the value it already was in,
> nothing will happen, again.
When I considered that, my way of reasoning was:
If we execute the flag changing buffer right away, it may be sent to 
hardware faster if there is no job in progress.
If we use the lazy way, and trigger the change just before submission -  
there will be additional conditions in submission code, plus the change 
will be made when there is another job pending (though it's not a 
considerable payload to just switch a flag).
If user space switches the flag back and forth without much sense, then 
there is something wrong with the user space driver, and it shouldn't be 
up to kernel to fix that.

This is why I chosen the current approach. But I can change it if you wish.

>> +
>>   static struct i915_priolist *
>>   lookup_priolist(struct intel_engine_cs *engine, int prio)
>>   {
>> diff --git a/drivers/gpu/drm/i915/intel_lrc.h b/drivers/gpu/drm/i915/intel_lrc.h
>> index 1593194..214e291 100644
>> --- a/drivers/gpu/drm/i915/intel_lrc.h
>> +++ b/drivers/gpu/drm/i915/intel_lrc.h
>> @@ -104,4 +104,8 @@ struct i915_gem_context;
>>   
>>   void intel_lr_context_resume(struct drm_i915_private *dev_priv);
>>   
>> +int
>> +intel_lr_context_modify_data_port_coherency(struct i915_gem_context *ctx,
>> +                                            bool enable);
>> +
>>   #endif /* _INTEL_LRC_H_ */
>> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
>> index 7f5634c..fab072f 100644
>> --- a/include/uapi/drm/i915_drm.h
>> +++ b/include/uapi/drm/i915_drm.h
>> @@ -1453,6 +1453,7 @@ struct drm_i915_gem_context_param {
>>   #define I915_CONTEXT_PARAM_NO_ERROR_CAPTURE    0x4
>>   #define I915_CONTEXT_PARAM_BANNABLE    0x5
>>   #define I915_CONTEXT_PARAM_PRIORITY    0x6
>> +#define I915_CONTEXT_PARAM_COHERENCY   0x7
> Please add this line after the indented context priorities.
ack
>
>>   #define   I915_CONTEXT_MAX_USER_PRIORITY       1023 /* inclusive */
>>   #define   I915_CONTEXT_DEFAULT_PRIORITY                0
>>   #define   I915_CONTEXT_MIN_USER_PRIORITY       -1023 /* inclusive */
> Here.
>
> Regards, Joonas
Dunajski, Bartosz June 22, 2018, 4:40 p.m. UTC | #7
Additionally, we are already on Arch:
https://aur.archlinux.org/packages/compute-runtime 

Can I assume that adoption plan is not a blocker anymore?

Bartosz

> Yes, once you follow through with the plan, there should be no issues about merging patches to support the driver.

>

> You may want to squeeze your timeline to be complete before 4.19-rc5, which is the feature cutoff date for 4.20, but that is rather an ambitious goal. Your original schedule would land the patches before

> 4.20-rc5 resulting in inclusion to 4.21.

>

> Regards, Joonas

>

> PS. I'm going on a vacation for a couple of weeks.
Joonas Lahtinen July 18, 2018, 1:03 p.m. UTC | #8
Quoting Lis, Tomasz (2018-06-21 16:47:45)
> On 2018-06-21 08:39, Joonas Lahtinen wrote:
> > Quoting Tomasz Lis (2018-06-20 18:03:07)
> >>   int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
> >>                                      struct drm_file *file)
> >>   {
> >> +       struct drm_i915_private *dev_priv = to_i915(dev);
> >>          struct drm_i915_file_private *file_priv = file->driver_priv;
> >>          struct drm_i915_gem_context_param *args = data;
> >>          struct i915_gem_context *ctx;
> >> @@ -818,6 +837,16 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
> >>          case I915_CONTEXT_PARAM_PRIORITY:
> >>                  args->value = ctx->sched.priority;
> >>                  break;
> >> +       case I915_CONTEXT_PARAM_COHERENCY:
> >> +               /*
> >> +                * ENODEV if the feature is not supported. This removes the need
> >> +                * of separate IS_SUPPORTED parameter.
> >> +                */
> > Code speaks for itself, the comment is not needed.
> I don't think it is a good idea to limit comments. The current look of 
> the code makes it hard for anyone new to work on it, as the only 
> documentation is the history in mailing list.
> I don't think it's the correct approach. I believe comments should be 
> encouraged.

It's not a matter of opinion. Code should be written clear enough that
comments are not needed.

<SNIP>

> >> +       *cs++ = MI_LOAD_REGISTER_IMM(1);
> >> +       *cs++ = i915_mmio_reg_offset(reg);
> >> +       /* Enabling coherency means disabling the bit which forces it off */
> > Code is again very self explanatory without the comment.
> The logic is reversed, so that "enable" does a "disable". I believe the 
> comment does a great job of assuring the reader that this is not just a 
> coding mistake.
> 
> Do we have any official guidelines for limiting comments?

Yes, avoid where humanly possible. And when you can't avoid, it should
explain "why" not what. I don't see such cases in this patch.

<SNIP>

> > I'm thinking we should set this value when it has changed, when we insert the
> > requests into the command stream. So if you change back and forth, while
> > not emitting any requests, nothing really happens. If you change the value and
> > emit a request, we should emit a LRI before the jump to the commands.
> > Similary if you keep setting the value to the value it already was in,
> > nothing will happen, again.
> When I considered that, my way of reasoning was:
> If we execute the flag changing buffer right away, it may be sent to 
> hardware faster if there is no job in progress.
> If we use the lazy way, and trigger the change just before submission -  
> there will be additional conditions in submission code, plus the change 
> will be made when there is another job pending (though it's not a 
> considerable payload to just switch a flag).
> If user space switches the flag back and forth without much sense, then 
> there is something wrong with the user space driver, and it shouldn't be 
> up to kernel to fix that.

A few register writes appended before jumping to the BB should not be a
performance concern. There will definitely be more overhead in sending a
whole separate request, so I'm not sure I follow whole picture.

So I still think it's right thing to do to only emit the commands as a
prelude when needed.

Regards, Joonas
Joonas Lahtinen July 18, 2018, 1:12 p.m. UTC | #9
Quoting Dunajski, Bartosz (2018-06-22 19:40:58)
> Additionally, we are already on Arch:
> https://aur.archlinux.org/packages/compute-runtime 

I'm not an Arch user myself, but my impression is that AUR [1] is equivalent
of Ubuntu's PPA where anybody can very much upload anything outside of
the support model of the distro.

> Can I assume that adoption plan is not a blocker anymore?

Due to above, I don't think that matter is changed to direction or
another.

Regards, Joonas

[1] https://wiki.archlinux.org/index.php/Arch_User_Repository#What_is_the_AUR.3F
> 
> Bartosz
> 
> > Yes, once you follow through with the plan, there should be no issues about merging patches to support the driver.
> >
> > You may want to squeeze your timeline to be complete before 4.19-rc5, which is the feature cutoff date for 4.20, but that is rather an ambitious goal. Your original schedule would land the patches before
> > 4.20-rc5 resulting in inclusion to 4.21.
> >
> > Regards, Joonas
> >
> > PS. I'm going on a vacation for a couple of weeks.
>
Dunajski, Bartosz July 18, 2018, 1:27 p.m. UTC | #10
You are right about the AUR. This is just a step into opensource community direction.
According to my previous answer about ClearLinux (and others), which is more important here. We are still coordinating this, but I think we are on the right path. And NEO can be considered as opensource client for the coherency patch.


> I'm not an Arch user myself, but my impression is that AUR [1] is equivalent of Ubuntu's PPA where anybody can very much upload anything outside of the support model of the distro.
diff mbox

Patch

diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index ccf463a..ea65ae6 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -711,6 +711,24 @@  static bool client_is_banned(struct drm_i915_file_private *file_priv)
 	return atomic_read(&file_priv->ban_score) >= I915_CLIENT_SCORE_BANNED;
 }
 
+static int i915_gem_context_set_data_port_coherent(struct i915_gem_context *ctx)
+{
+	int ret;
+	ret = intel_lr_context_modify_data_port_coherency(ctx, true);
+	if (!GEM_WARN_ON(ret))
+		__set_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
+	return ret;
+}
+
+static int i915_gem_context_clear_data_port_coherent(struct i915_gem_context *ctx)
+{
+	int ret;
+	ret = intel_lr_context_modify_data_port_coherency(ctx, false);
+	if (!GEM_WARN_ON(ret))
+		__clear_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
+	return ret;
+}
+
 int i915_gem_context_create_ioctl(struct drm_device *dev, void *data,
 				  struct drm_file *file)
 {
@@ -784,6 +802,7 @@  int i915_gem_context_destroy_ioctl(struct drm_device *dev, void *data,
 int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
 				    struct drm_file *file)
 {
+	struct drm_i915_private *dev_priv = to_i915(dev);
 	struct drm_i915_file_private *file_priv = file->driver_priv;
 	struct drm_i915_gem_context_param *args = data;
 	struct i915_gem_context *ctx;
@@ -818,6 +837,16 @@  int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
 	case I915_CONTEXT_PARAM_PRIORITY:
 		args->value = ctx->sched.priority;
 		break;
+	case I915_CONTEXT_PARAM_COHERENCY:
+		/*
+		 * ENODEV if the feature is not supported. This removes the need
+		 * of separate IS_SUPPORTED parameter.
+		 */
+		if (INTEL_GEN(dev_priv) < 9)
+			ret = -ENODEV;
+		else
+			args->value = i915_gem_context_is_data_port_coherent(ctx);
+		break;
 	default:
 		ret = -EINVAL;
 		break;
@@ -830,6 +859,7 @@  int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
 int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
 				    struct drm_file *file)
 {
+	struct drm_i915_private *dev_priv = to_i915(dev);
 	struct drm_i915_file_private *file_priv = file->driver_priv;
 	struct drm_i915_gem_context_param *args = data;
 	struct i915_gem_context *ctx;
@@ -893,6 +923,17 @@  int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
 		}
 		break;
 
+	case I915_CONTEXT_PARAM_COHERENCY:
+		if (args->size)
+			ret = -EINVAL;
+		else if (INTEL_GEN(dev_priv) < 9)
+			ret = -ENODEV;
+		else if (args->value)
+			ret = i915_gem_context_set_data_port_coherent(ctx);
+		else
+			ret = i915_gem_context_clear_data_port_coherent(ctx);
+		break;
+
 	default:
 		ret = -EINVAL;
 		break;
diff --git a/drivers/gpu/drm/i915/i915_gem_context.h b/drivers/gpu/drm/i915/i915_gem_context.h
index b116e49..e8ccb70 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.h
+++ b/drivers/gpu/drm/i915/i915_gem_context.h
@@ -126,6 +126,7 @@  struct i915_gem_context {
 #define CONTEXT_BANNABLE		3
 #define CONTEXT_BANNED			4
 #define CONTEXT_FORCE_SINGLE_SUBMISSION	5
+#define CONTEXT_DATA_PORT_COHERENT	6
 
 	/**
 	 * @hw_id: - unique identifier for the context
@@ -257,6 +258,11 @@  static inline void i915_gem_context_set_force_single_submission(struct i915_gem_
 	__set_bit(CONTEXT_FORCE_SINGLE_SUBMISSION, &ctx->flags);
 }
 
+static inline bool i915_gem_context_is_data_port_coherent(struct i915_gem_context *ctx)
+{
+	return test_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
+}
+
 static inline bool i915_gem_context_is_default(const struct i915_gem_context *c)
 {
 	return c->user_handle == DEFAULT_CONTEXT_HANDLE;
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 33bc914..c69dc26 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -258,6 +258,57 @@  intel_lr_context_descriptor_update(struct i915_gem_context *ctx,
 	ce->lrc_desc = desc;
 }
 
+static int emit_set_data_port_coherency(struct i915_request *req, bool enable)
+{
+	u32 *cs;
+	i915_reg_t reg;
+
+	GEM_BUG_ON(req->engine->class != RENDER_CLASS);
+	GEM_BUG_ON(INTEL_GEN(req->i915) < 9);
+
+	cs = intel_ring_begin(req, 4);
+	if (IS_ERR(cs))
+		return PTR_ERR(cs);
+
+	if (INTEL_GEN(req->i915) >= 10)
+		reg = CNL_HDC_CHICKEN0;
+	else
+		reg = HDC_CHICKEN0;
+
+	/* FIXME: this feature may be unuseable on CNL; If this checks to be
+	 *  true, we should enodev for CNL. */
+	*cs++ = MI_LOAD_REGISTER_IMM(1);
+	*cs++ = i915_mmio_reg_offset(reg);
+	/* Enabling coherency means disabling the bit which forces it off */
+	if (enable)
+		*cs++ = _MASKED_BIT_DISABLE(HDC_FORCE_NON_COHERENT);
+	else
+		*cs++ = _MASKED_BIT_ENABLE(HDC_FORCE_NON_COHERENT);
+	*cs++ = MI_NOOP;
+
+	intel_ring_advance(req, cs);
+
+	return 0;
+}
+
+int
+intel_lr_context_modify_data_port_coherency(struct i915_gem_context *ctx,
+		bool enable)
+{
+	struct i915_request *req;
+	int ret;
+
+	req = i915_request_alloc(ctx->i915->engine[RCS], ctx);
+	if (IS_ERR(req))
+		return PTR_ERR(req);
+
+	ret = emit_set_data_port_coherency(req, enable);
+
+	i915_request_add(req);
+
+	return ret;
+}
+
 static struct i915_priolist *
 lookup_priolist(struct intel_engine_cs *engine, int prio)
 {
diff --git a/drivers/gpu/drm/i915/intel_lrc.h b/drivers/gpu/drm/i915/intel_lrc.h
index 1593194..214e291 100644
--- a/drivers/gpu/drm/i915/intel_lrc.h
+++ b/drivers/gpu/drm/i915/intel_lrc.h
@@ -104,4 +104,8 @@  struct i915_gem_context;
 
 void intel_lr_context_resume(struct drm_i915_private *dev_priv);
 
+int
+intel_lr_context_modify_data_port_coherency(struct i915_gem_context *ctx,
+					     bool enable);
+
 #endif /* _INTEL_LRC_H_ */
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index 7f5634c..fab072f 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -1453,6 +1453,7 @@  struct drm_i915_gem_context_param {
 #define I915_CONTEXT_PARAM_NO_ERROR_CAPTURE	0x4
 #define I915_CONTEXT_PARAM_BANNABLE	0x5
 #define I915_CONTEXT_PARAM_PRIORITY	0x6
+#define I915_CONTEXT_PARAM_COHERENCY	0x7
 #define   I915_CONTEXT_MAX_USER_PRIORITY	1023 /* inclusive */
 #define   I915_CONTEXT_DEFAULT_PRIORITY		0
 #define   I915_CONTEXT_MIN_USER_PRIORITY	-1023 /* inclusive */