From patchwork Mon Jul 9 13:20:57 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Lis, Tomasz" X-Patchwork-Id: 10514595 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id 574056032A for ; Mon, 9 Jul 2018 13:21:04 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 435BD28AB1 for ; Mon, 9 Jul 2018 13:21:04 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 3782B28B01; Mon, 9 Jul 2018 13:21:04 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.2 required=2.0 tests=BAYES_00, MAILING_LIST_MULTI, RCVD_IN_DNSWL_MED autolearn=ham version=3.3.1 Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher DHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.wl.linuxfoundation.org (Postfix) with ESMTPS id 62AA728AB1 for ; Mon, 9 Jul 2018 13:21:03 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id ABA776E174; Mon, 9 Jul 2018 13:21:02 +0000 (UTC) X-Original-To: intel-gfx@lists.freedesktop.org Delivered-To: intel-gfx@lists.freedesktop.org Received: from mga02.intel.com (mga02.intel.com [134.134.136.20]) by gabe.freedesktop.org (Postfix) with ESMTPS id 197556E174 for ; Mon, 9 Jul 2018 13:21:00 +0000 (UTC) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga007.fm.intel.com ([10.253.24.52]) by orsmga101.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 09 Jul 2018 06:21:00 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.51,330,1526367600"; d="scan'208";a="53039596" Received: from szara.igk.intel.com ([172.28.178.192]) by fmsmga007.fm.intel.com with ESMTP; 09 Jul 2018 06:20:58 -0700 From: Tomasz Lis To: intel-gfx@lists.freedesktop.org Date: Mon, 9 Jul 2018 15:20:57 +0200 Message-Id: <1531142457-4232-1-git-send-email-tomasz.lis@intel.com> X-Mailer: git-send-email 2.7.4 In-Reply-To: <1521463055-5325-2-git-send-email-tomasz.lis@intel.com> References: <1521463055-5325-2-git-send-email-tomasz.lis@intel.com> Subject: [Intel-gfx] [PATCH v4] drm/i915: Add IOCTL Param to control data port coherency. X-BeenThere: intel-gfx@lists.freedesktop.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Intel graphics driver community testing & development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: bartosz.dunajski@intel.com MIME-Version: 1.0 Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" X-Virus-Scanned: ClamAV using ClamSMTP The patch adds a parameter to control the data port coherency functionality on a per-context level. When the IOCTL is called, a command to switch data port coherency state is added to the ordered list. All prior requests are executed on old coherency settings, and all exec requests after the IOCTL will use new settings. Rationale: The OpenCL driver develpers requested a functionality to control cache coherency at data port level. Keeping the coherency at that level is disabled by default due to its performance costs. OpenCL driver is planning to enable it for a small subset of submissions, when such functionality is required. Below are answers to basic question explaining background of the functionality and reasoning for the proposed implementation: 1. Why do we need a coherency enable/disable switch for memory that is shared between CPU and GEN (GPU)? Memory coherency between CPU and GEN, while being a great feature that enables CL_MEM_SVM_FINE_GRAIN_BUFFER OCL capability on Intel GEN architecture, adds overhead related to tracking (snooping) memory inside different cache units (L1$, L2$, L3$, LLC$, etc.). At the same time, minority of modern OCL applications actually use CL_MEM_SVM_FINE_GRAIN_BUFFER (and hence require memory coherency between CPU and GPU). The goal of coherency enable/disable switch is to remove overhead of memory coherency when memory coherency is not needed. 2. Why do we need a global coherency switch? In order to support I/O commands from within EUs (Execution Units), Intel GEN ISA (GEN Instruction Set Assembly) contains dedicated "send" instructions. These send instructions provide several addressing models. One of these addressing models (named "stateless") provides most flexible I/O using plain virtual addresses (as opposed to buffer_handle+offset models). This "stateless" model is similar to regular memory load/store operations available on typical CPUs. Since this model provides I/O using arbitrary virtual addresses, it enables algorithmic designs that are based on pointer-to-pointer (e.g. buffer of pointers) concepts. For instance, it allows creating tree-like data structures such as: ________________ | NODE1 | | uint64_t data | +----------------| | NODE* | NODE*| +--------+-------+ / \ ________________/ \________________ | NODE2 | | NODE3 | | uint64_t data | | uint64_t data | +----------------| +----------------| | NODE* | NODE*| | NODE* | NODE*| +--------+-------+ +--------+-------+ Please note that pointers inside such structures can point to memory locations in different OCL allocations - e.g. NODE1 and NODE2 can reside in one OCL allocation while NODE3 resides in a completely separate OCL allocation. Additionally, such pointers can be shared with CPU (i.e. using SVM - Shared Virtual Memory feature). Using pointers from different allocations doesn't affect the stateless addressing model which even allows scattered reading from different allocations at the same time (i.e. by utilizing SIMD-nature of send instructions). When it comes to coherency programming, send instructions in stateless model can be encoded (at ISA level) to either use or disable coherency. However, for generic OCL applications (such as example with tree-like data structure), OCL compiler is not able to determine origin of memory pointed to by an arbitrary pointer - i.e. is not able to track given pointer back to a specific allocation. As such, it's not able to decide whether coherency is needed or not for specific pointer (or for specific I/O instruction). As a result, compiler encodes all stateless sends as coherent (doing otherwise would lead to functional issues resulting from data corruption). Please note that it would be possible to workaround this (e.g. based on allocations map and pointer bounds checking prior to each I/O instruction) but the performance cost of such workaround would be many times greater than the cost of keeping coherency always enabled. As such, enabling/disabling memory coherency at GEN ISA level is not feasible and alternative method is needed. Such alternative solution is to have a global coherency switch that allows disabling coherency for single (though entire) GPU submission. This is beneficial because this way we: * can enable (and pay for) coherency only in submissions that actually need coherency (submissions that use CL_MEM_SVM_FINE_GRAIN_BUFFER resources) * don't care about coherency at GEN ISA granularity (no performance impact) 3. Will coherency switch be used frequently? There are scenarios that will require frequent toggling of the coherency switch. E.g. an application has two OCL compute kernels: kern_master and kern_worker. kern_master uses, concurrently with CPU, some fine grain SVM resources (CL_MEM_SVM_FINE_GRAIN_BUFFER). These resources contain descriptors of computational work that needs to be executed. kern_master analyzes incoming work descriptors and populates a plain OCL buffer (non-fine-grain) with payload for kern_worker. Once kern_master is done, kern_worker kicks-in and processes the payload that kern_master produced. These two kernels work in a loop, one after another. Since only kern_master requires coherency, kern_worker should not be forced to pay for it. This means that we need to have the ability to toggle coherency switch on or off per each GPU submission: (ENABLE COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> (ENABLE COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> ... v2: Fixed compilation warning. v3: Refactored the patch to add IOCTL instead of exec flag. v4: Renamed and documented the API flag. Used strict values. Removed redundant GEM_WARN_ON()s. Improved to coding standard. Introduced a macro for checking whether hardware supports the feature. Cc: Joonas Lahtinen Cc: Tvrtko Ursulin Cc: Chris Wilson Cc: Michal Winiarski Bspec: 11419 Signed-off-by: Tomasz Lis --- drivers/gpu/drm/i915/i915_drv.h | 1 + drivers/gpu/drm/i915/i915_gem_context.c | 41 +++++++++++++++++++++++++++ drivers/gpu/drm/i915/i915_gem_context.h | 6 ++++ drivers/gpu/drm/i915/intel_lrc.c | 49 +++++++++++++++++++++++++++++++++ drivers/gpu/drm/i915/intel_lrc.h | 4 +++ include/uapi/drm/i915_drm.h | 6 ++++ 6 files changed, 107 insertions(+) diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h index 09ab124..7d4bbd5 100644 --- a/drivers/gpu/drm/i915/i915_drv.h +++ b/drivers/gpu/drm/i915/i915_drv.h @@ -2524,6 +2524,7 @@ intel_info(const struct drm_i915_private *dev_priv) #define HAS_EDRAM(dev_priv) (!!((dev_priv)->edram_cap & EDRAM_ENABLED)) #define HAS_WT(dev_priv) ((IS_HASWELL(dev_priv) || \ IS_BROADWELL(dev_priv)) && HAS_EDRAM(dev_priv)) +#define HAS_DATA_PORT_COHERENCY(dev_priv) (INTEL_GEN(dev_priv) >= 9) #define HWS_NEEDS_PHYSICAL(dev_priv) ((dev_priv)->info.hws_needs_physical) diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c index b10770c..6db352e 100644 --- a/drivers/gpu/drm/i915/i915_gem_context.c +++ b/drivers/gpu/drm/i915/i915_gem_context.c @@ -711,6 +711,26 @@ static bool client_is_banned(struct drm_i915_file_private *file_priv) return atomic_read(&file_priv->ban_score) >= I915_CLIENT_SCORE_BANNED; } +static int i915_gem_context_set_data_port_coherent(struct i915_gem_context *ctx) +{ + int ret; + + ret = intel_lr_context_modify_data_port_coherency(ctx, true); + if (!ret) + __set_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags); + return ret; +} + +static int i915_gem_context_clear_data_port_coherent(struct i915_gem_context *ctx) +{ + int ret; + + ret = intel_lr_context_modify_data_port_coherency(ctx, false); + if (!ret) + __clear_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags); + return ret; +} + int i915_gem_context_create_ioctl(struct drm_device *dev, void *data, struct drm_file *file) { @@ -784,6 +804,7 @@ int i915_gem_context_destroy_ioctl(struct drm_device *dev, void *data, int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data, struct drm_file *file) { + struct drm_i915_private *dev_priv = to_i915(dev); struct drm_i915_file_private *file_priv = file->driver_priv; struct drm_i915_gem_context_param *args = data; struct i915_gem_context *ctx; @@ -818,6 +839,12 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data, case I915_CONTEXT_PARAM_PRIORITY: args->value = ctx->sched.priority; break; + case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY: + if (!HAS_DATA_PORT_COHERENCY(dev_priv)) + ret = -ENODEV; + else + args->value = i915_gem_context_is_data_port_coherent(ctx); + break; default: ret = -EINVAL; break; @@ -830,6 +857,7 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data, int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data, struct drm_file *file) { + struct drm_i915_private *dev_priv = to_i915(dev); struct drm_i915_file_private *file_priv = file->driver_priv; struct drm_i915_gem_context_param *args = data; struct i915_gem_context *ctx; @@ -893,6 +921,19 @@ int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data, } break; + case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY: + if (args->size) + ret = -EINVAL; + else if (!HAS_DATA_PORT_COHERENCY(dev_priv)) + ret = -ENODEV; + else if (args->value == 1) + ret = i915_gem_context_set_data_port_coherent(ctx); + else if (args->value == 0) + ret = i915_gem_context_clear_data_port_coherent(ctx); + else + ret = -EINVAL; + break; + default: ret = -EINVAL; break; diff --git a/drivers/gpu/drm/i915/i915_gem_context.h b/drivers/gpu/drm/i915/i915_gem_context.h index b116e49..e8ccb70 100644 --- a/drivers/gpu/drm/i915/i915_gem_context.h +++ b/drivers/gpu/drm/i915/i915_gem_context.h @@ -126,6 +126,7 @@ struct i915_gem_context { #define CONTEXT_BANNABLE 3 #define CONTEXT_BANNED 4 #define CONTEXT_FORCE_SINGLE_SUBMISSION 5 +#define CONTEXT_DATA_PORT_COHERENT 6 /** * @hw_id: - unique identifier for the context @@ -257,6 +258,11 @@ static inline void i915_gem_context_set_force_single_submission(struct i915_gem_ __set_bit(CONTEXT_FORCE_SINGLE_SUBMISSION, &ctx->flags); } +static inline bool i915_gem_context_is_data_port_coherent(struct i915_gem_context *ctx) +{ + return test_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags); +} + static inline bool i915_gem_context_is_default(const struct i915_gem_context *c) { return c->user_handle == DEFAULT_CONTEXT_HANDLE; diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c index ab89dab..1f037e3 100644 --- a/drivers/gpu/drm/i915/intel_lrc.c +++ b/drivers/gpu/drm/i915/intel_lrc.c @@ -259,6 +259,55 @@ intel_lr_context_descriptor_update(struct i915_gem_context *ctx, ce->lrc_desc = desc; } +static int emit_set_data_port_coherency(struct i915_request *req, bool enable) +{ + u32 *cs; + i915_reg_t reg; + + GEM_BUG_ON(req->engine->class != RENDER_CLASS); + GEM_BUG_ON(INTEL_GEN(req->i915) < 9); + + cs = intel_ring_begin(req, 4); + if (IS_ERR(cs)) + return PTR_ERR(cs); + + if (INTEL_GEN(req->i915) >= 10) + reg = CNL_HDC_CHICKEN0; + else + reg = HDC_CHICKEN0; + + *cs++ = MI_LOAD_REGISTER_IMM(1); + *cs++ = i915_mmio_reg_offset(reg); + /* Enabling coherency means disabling the bit which forces it off */ + if (enable) + *cs++ = _MASKED_BIT_DISABLE(HDC_FORCE_NON_COHERENT); + else + *cs++ = _MASKED_BIT_ENABLE(HDC_FORCE_NON_COHERENT); + *cs++ = MI_NOOP; + + intel_ring_advance(req, cs); + + return 0; +} + +int +intel_lr_context_modify_data_port_coherency(struct i915_gem_context *ctx, + bool enable) +{ + struct i915_request *req; + int ret; + + req = i915_request_alloc(ctx->i915->engine[RCS], ctx); + if (IS_ERR(req)) + return PTR_ERR(req); + + ret = emit_set_data_port_coherency(req, enable); + + i915_request_add(req); + + return ret; +} + static struct i915_priolist * lookup_priolist(struct intel_engine_cs *engine, int prio) { diff --git a/drivers/gpu/drm/i915/intel_lrc.h b/drivers/gpu/drm/i915/intel_lrc.h index 1593194..f6965ae 100644 --- a/drivers/gpu/drm/i915/intel_lrc.h +++ b/drivers/gpu/drm/i915/intel_lrc.h @@ -104,4 +104,8 @@ struct i915_gem_context; void intel_lr_context_resume(struct drm_i915_private *dev_priv); +int +intel_lr_context_modify_data_port_coherency(struct i915_gem_context *ctx, + bool enable); + #endif /* _INTEL_LRC_H_ */ diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h index 7f5634c..e677bea 100644 --- a/include/uapi/drm/i915_drm.h +++ b/include/uapi/drm/i915_drm.h @@ -1456,6 +1456,12 @@ struct drm_i915_gem_context_param { #define I915_CONTEXT_MAX_USER_PRIORITY 1023 /* inclusive */ #define I915_CONTEXT_DEFAULT_PRIORITY 0 #define I915_CONTEXT_MIN_USER_PRIORITY -1023 /* inclusive */ +/* + * When data port level coherency is enabled, the GPU will update memory + * buffers shared with CPU, by forcing internal cache units to send memory + * writes to real RAM faster. Keeping such coherency has performance cost. + */ +#define I915_CONTEXT_PARAM_DATA_PORT_COHERENCY 0x7 __u64 value; };