[RFC,007/162] drm/i915: split wa_bb code to its own file

Message ID	20201127120718.454037-8-matthew.auld@intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=wCRU=FB=lists.freedesktop.org=dri-devel-bounces@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 2F95D21D81 IronPort-SDR: h6NMKOny4jmHDX4uS0gVX0kkyR0HtG4Yz/DZbP79IfOR0xxc+b2tiLZAmEB8gWbQRksDLPUoKX BK/RHoAvk1ww== IronPort-SDR: 0jRt7oPLZhnhhWHCGYZMVHctGaOWAZFIug3/prfSeM7gqKAyiMAxXwZ1bgmXAxAcE8PpouF4VA kHtoTZGPcOpg== From: Matthew Auld <matthew.auld@intel.com> To: intel-gfx@lists.freedesktop.org Subject: [RFC PATCH 007/162] drm/i915: split wa_bb code to its own file Date: Fri, 27 Nov 2020 12:04:43 +0000 Message-Id: <20201127120718.454037-8-matthew.auld@intel.com> In-Reply-To: <20201127120718.454037-1-matthew.auld@intel.com> References: <20201127120718.454037-1-matthew.auld@intel.com> MIME-Version: 1.0 Precedence: list Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>, Chris P Wilson <chris.p.wilson@intel.com>, Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>, dri-devel@lists.freedesktop.org, John Harrison <John.C.Harrison@Intel.com> Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>
Series	DG1 + LMEM enabling \| expand [RFC,000/162] DG1 + LMEM enabling [RFC,001/162] drm/i915/selftest: also consider non-contiguous objects [RFC,002/162] drm/i915/selftest: assert we get 2M GTT pages [RFC,003/162] drm/i915/selftest: handle local-memory in perf_memcpy [RFC,004/162] drm/i915/gt: Move move context layout registers and offsets to lrc_reg.h [RFC,005/162] drm/i915/gt: Rename lrc.c to execlists_submission.c [RFC,006/162] drm/i915: split gen8+ flush and bb_start emission functions to their own file [RFC,007/162] drm/i915: split wa_bb code to its own file [RFC,008/162] HAX drm/i915: Work around the selftest timeline lock splat workaround [RFC,009/162] drm/i915: Introduce drm_i915_lock_isolated [RFC,010/162] drm/i915: Lock hwsp objects isolated for pinning at create time [RFC,011/162] drm/i915: Pin timeline map after first timeline pin, v5. [RFC,012/162] drm/i915: Move cmd parser pinning to execbuffer [RFC,013/162] drm/i915: Add missing -EDEADLK handling to execbuf pinning, v2. [RFC,014/162] drm/i915: Ensure we hold the object mutex in pin correctly v2 [RFC,015/162] drm/i915: Add gem object locking to madvise. [RFC,016/162] drm/i915: Move HAS_STRUCT_PAGE to obj->flags [RFC,017/162] drm/i915: Rework struct phys attachment handling [RFC,018/162] drm/i915: Convert i915_gem_object_attach_phys() to ww locking, v2. [RFC,019/162] drm/i915: make lockdep slightly happier about execbuf. [RFC,020/162] drm/i915: Disable userptr pread/pwrite support. [RFC,021/162] drm/i915: No longer allow exporting userptr through dma-buf [RFC,022/162] drm/i915: Reject more ioctls for userptr [RFC,023/162] drm/i915: Reject UNSYNCHRONIZED for userptr, v2. [RFC,024/162] drm/i915: Make compilation of userptr code depend on MMU_NOTIFIER. [RFC,025/162] drm/i915: Fix userptr so we do not have to worry about obj->mm.lock, v5. [RFC,026/162] drm/i915: Flatten obj->mm.lock [RFC,027/162] drm/i915: Populate logical context during first pin. [RFC,028/162] drm/i915: Make ring submission compatible with obj->mm.lock removal, v2. [RFC,029/162] drm/i915: Handle ww locking in init_status_page [RFC,030/162] drm/i915: Rework clflush to work correctly without obj->mm.lock. [RFC,031/162] drm/i915: Pass ww ctx to intel_pin_to_display_plane [RFC,032/162] drm/i915: Add object locking to vm_fault_cpu [RFC,033/162] drm/i915: Move pinning to inside engine_wa_list_verify() [RFC,034/162] drm/i915: Take reservation lock around i915_vma_pin. [RFC,035/162] drm/i915: Make intel_init_workaround_bb more compatible with ww locking. [RFC,036/162] drm/i915: Make __engine_unpark() compatible with ww locking v2 [RFC,037/162] drm/i915: Take obj lock around set_domain ioctl [RFC,038/162] drm/i915: Defer pin calls in buffer pool until first use by caller. [RFC,039/162] drm/i915: Fix pread/pwrite to work with new locking rules. [RFC,040/162] drm/i915: Fix workarounds selftest, part 1 [RFC,041/162] drm/i915: Prepare for obj->mm.lock removal [RFC,042/162] drm/i915: Add igt_spinner_pin() to allow for ww locking around spinner. [RFC,043/162] drm/i915: Add ww locking around vm_access() [RFC,044/162] drm/i915: Increase ww locking for perf. [RFC,045/162] drm/i915: Lock ww in ucode objects correctly [RFC,046/162] drm/i915: Add ww locking to dma-buf ops. [RFC,047/162] drm/i915: Add missing ww lock in intel_dsb_prepare. [RFC,048/162] drm/i915: Fix ww locking in shmem_create_from_object [RFC,049/162] drm/i915: Use a single page table lock for each gtt. [RFC,050/162] drm/i915/selftests: Prepare huge_pages testcases for obj->mm.lock removal. [RFC,051/162] drm/i915/selftests: Prepare client blit for obj->mm.lock removal. [RFC,052/162] drm/i915/selftests: Prepare coherency tests for obj->mm.lock removal. [RFC,053/162] drm/i915/selftests: Prepare context tests for obj->mm.lock removal. [RFC,054/162] drm/i915/selftests: Prepare dma-buf tests for obj->mm.lock removal. [RFC,055/162] drm/i915/selftests: Prepare execbuf tests for obj->mm.lock removal. [RFC,056/162] drm/i915/selftests: Prepare mman testcases for obj->mm.lock removal. [RFC,057/162] drm/i915/selftests: Prepare object tests for obj->mm.lock removal. [RFC,058/162] drm/i915/selftests: Prepare object blit tests for obj->mm.lock removal. [RFC,059/162] drm/i915/selftests: Prepare igt_gem_utils for obj->mm.lock removal [RFC,060/162] drm/i915/selftests: Prepare context selftest for obj->mm.lock removal [RFC,061/162] drm/i915/selftests: Prepare hangcheck for obj->mm.lock removal [RFC,062/162] drm/i915/selftests: Prepare execlists for obj->mm.lock removal [RFC,063/162] drm/i915/selftests: Prepare mocs tests for obj->mm.lock removal [RFC,064/162] drm/i915/selftests: Prepare ring submission for obj->mm.lock removal [RFC,065/162] drm/i915/selftests: Prepare timeline tests for obj->mm.lock removal [RFC,066/162] drm/i915/selftests: Prepare i915_request tests for obj->mm.lock removal [RFC,067/162] drm/i915/selftests: Prepare memory region tests for obj->mm.lock removal [RFC,068/162] drm/i915/selftests: Prepare cs engine tests for obj->mm.lock removal [RFC,069/162] drm/i915/selftests: Prepare gtt tests for obj->mm.lock removal [RFC,070/162] drm/i915: Finally remove obj->mm.lock. [RFC,071/162] drm/i915: Keep userpointer bindings if seqcount is unchanged, v2. [RFC,072/162] drm/i915: Avoid some false positives in assert_object_held() [RFC,073/162] drm/i915: Reference contending lock objects [RFC,074/162] drm/i915: Break out dma_resv ww locking utilities to separate files [RFC,075/162] drm/i915: Introduce a for_i915_gem_ww(){} [RFC,076/162] drm/i915: Untangle the vma pages_mutex [RFC,077/162] drm/i915/fbdev: Use lmem physical addresses for fb_mmap() on discrete [RFC,078/162] drm/i915: Return error value when bo not in LMEM for discrete [RFC,079/162] drm/i915/dmabuf: Disallow LMEM objects from dma-buf [RFC,080/162] drm/i915/lmem: Fail driver init if LMEM training failed [RFC,081/162] HAX drm/i915/lmem: support CPU relocations [RFC,082/162] HAX drm/i915/lmem: support pread and pwrite [RFC,083/162] drm/i915: Update the helper to set correct mapping [RFC,084/162] drm/i915: introduce kernel blitter_context [RFC,085/162] drm/i915/region: support basic eviction [RFC,086/162] drm/i915: Add blit functions that can be called from within a WW transaction [RFC,087/162] drm/i915: Delay publishing objects on the eviction lists [RFC,088/162] drm/i915: support basic object migration [RFC,089/162] drm/i915/dg1: Fix occasional migration error [RFC,090/162] drm/i915/query: Expose memory regions through the query uAPI [RFC,091/162] drm/i915: Store gt in memory region [RFC,092/162] drm/i915/uapi: introduce drm_i915_gem_create_ext [RFC,093/162] drm/i915/lmem: allocate cmd ring in lmem [RFC,094/162] drm/i915/dg1: Do not check r->sgt.pfn for NULL [RFC,095/162] drm/i915/dg1: Introduce dmabuf mmap to LMEM [RFC,096/162] drm/i915: setup the LMEM region [RFC,097/162] drm/i915: Distinction of memory regions [RFC,098/162] drm/i915/gtt: map the PD up front [RFC,099/162] drm/i915/gtt/dgfx: place the PD in LMEM [RFC,100/162] drm/i915/gtt: make flushing conditional [RFC,101/162] drm/i915/gtt/dg1: add PTE_LM plumbing for PPGTT [RFC,102/162] drm/i915/gtt/dg1: add PTE_LM plumbing for GGTT [RFC,103/162] drm/i915: allocate context from LMEM [RFC,104/162] drm/i915: move engine scratch to LMEM [RFC,105/162] drm/i915: Provide a way to disable PCIe relaxed write ordering [RFC,106/162] drm/i915: i915 returns -EBUSY on thread contention [RFC,107/162] drm/i915: setup GPU device lmem region [RFC,108/162] drm/i915: Fix object page offset within a region [RFC,109/162] drm/i915: add i915_gem_object_is_devmem() function [RFC,110/162] drm/i915: finish memory region support for stolen objects. [RFC,111/162] drm/i915/lmem: support optional CPU clearing for special internal use [RFC,112/162] drm/i915/guc: put all guc objects in lmem when available [RFC,113/162] drm/i915: Create stolen memory region from local memory [RFC,114/162] drm/i915/lmem: Bypass aperture when lmem is available [RFC,115/162] drm/i915/lmem: reset the lmem buffer created by fbdev [RFC,116/162] drm/i915/dsb: Enable lmem for dsb [RFC,117/162] drm/i915: Reintroduce mem->reserved [RFC,118/162] drm/i915/dg1: Reserve first 1MB of local memory [RFC,119/162] drm/i915/dg1: Read OPROM via SPI controller [RFC,120/162] drm/i915/oprom: Basic sanitization [RFC,121/162] drm/i915: WA for zero memory channel [RFC,122/162] drm/i915/dg1: Compute MEM Bandwidth using MCHBAR [RFC,123/162] drm/i915/dg1: Double memory bandwidth available [RFC,124/162] drm/i915/lmem: allocate HWSP in lmem [RFC,125/162] drm/i915/lmem: Limit block size to 4G [RFC,126/162] drm/i915/gem: Update shmem available memory [RFC,127/162] drm/i915: Allow non-uniform subslices in gen12+ [RFC,128/162] drm/i915/dg1: intel_memory_region_evict() changes for eviction [RFC,129/162] drm/i915/dg1: i915_gem_object_memcpy(..) infrastructure [RFC,130/162] drm/i915/dg1: Eviction logic [RFC,131/162] drm/i915/dg1: Add enable_eviction modparam [RFC,132/162] drm/i915/dg1: Add lmem_size modparam [RFC,133/162] drm/i915/dg1: Track swap in/out stats via debugfs [RFC,134/162] drm/i915/dg1: Measure swap in/out timing stats [RFC,135/162] drm/i915: define intel_partial_pages_for_sg_table [RFC,136/162] drm/i915: create and destroy dummy vma [RFC,137/162] drm/i915: blt copy between objs using pre-created vma windows [RFC,138/162] drm/i915/dg1: Eliminate eviction mutex [RFC,139/162] drm/i915/dg1: Keep engine awake across whole blit [RFC,140/162] drm/i915: window_blt_copy is used for swapin and swapout [RFC,141/162] drm/i915: Lmem eviction statistics by category [RFC,142/162] drm/i915/gem/selftest: test and measure window based blt cpy [RFC,143/162] drm/i915: suspend/resume eviction [RFC,144/162] drm/i915: Reset blitter context when unpark engine [RFC,145/162] drm/i915/dg1: Add dedicated context for blitter eviction [RFC,146/162] drm/i915/pm: suspend and restore ppgtt mapping [RFC,147/162] drm/i915/gt: Allocate default ctx objects in SMEM [RFC,148/162] drm/i915: suspend/resume enable blitter eviction [RFC,149/162] drm/i915: suspend/resume handling of perma-pinned objects [RFC,150/162] drm/i915: need consider system BO snoop for dgfx [RFC,151/162] drm/i915: move eviction to prepare hook [RFC,152/162] drm/i915: Perform execbuffer object locking as a separate step [RFC,153/162] drm/i915: Implement eviction locking v2 [RFC,154/162] drm/i915: Support ww eviction [RFC,155/162] drm/i915: Use a ww transaction in the fault handler [RFC,156/162] drm/i915: Use a ww transaction in i915_gem_object_pin_map_unlocked() [RFC,157/162] drm/i915: Improve accuracy of eviction stats [RFC,158/162] drm/i915: Support ww locks in suspend/resume [RFC,159/162] drm/i915/dg1: Fix mapping type for default state object [RFC,160/162] drm/i915/dg1: Fix GPU hang due to shmemfs page drop [RFC,161/162] drm/i915/dg1: allow pci to auto probe [RFC,162/162] drm/i915: drop fake lmem

diff --git a/drivers/gpu/drm/i915/Makefile b/drivers/gpu/drm/i915/Makefile index f9ef5199b124..2445cc990e15 100644 --- a/drivers/gpu/drm/i915/Makefile +++ b/drivers/gpu/drm/i915/Makefile @@ -92,6 +92,7 @@ gt-y += \ gt/intel_engine_heartbeat.o \ gt/intel_engine_pm.o \ gt/intel_engine_user.o \ + gt/intel_engine_workaround_bb.o \ gt/intel_execlists_submission.o \ gt/intel_ggtt.o \ gt/intel_ggtt_fencing.o \ diff --git a/drivers/gpu/drm/i915/gt/intel_engine_workaround_bb.c b/drivers/gpu/drm/i915/gt/intel_engine_workaround_bb.c new file mode 100644 index 000000000000..b03bdfc92bb2 --- /dev/null +++ b/drivers/gpu/drm/i915/gt/intel_engine_workaround_bb.c @@ -0,0 +1,335 @@ +// SPDX-License-Identifier: MIT +/* + * Copyright © 2014 Intel Corporation + */ + +#include "i915_drv.h" +#include "intel_engine_types.h" +#include "intel_engine_workaround_bb.h" +#include "intel_execlists_submission.h" /* XXX */ +#include "intel_gpu_commands.h" +#include "intel_gt.h" + +/* + * In this WA we need to set GEN8_L3SQCREG4[21:21] and reset it after + * PIPE_CONTROL instruction. This is required for the flush to happen correctly + * but there is a slight complication as this is applied in WA batch where the + * values are only initialized once so we cannot take register value at the + * beginning and reuse it further; hence we save its value to memory, upload a + * constant value with bit21 set and then we restore it back with the saved value. + * To simplify the WA, a constant value is formed by using the default value + * of this register. This shouldn't be a problem because we are only modifying + * it for a short period and this batch in non-premptible. We can ofcourse + * use additional instructions that read the actual value of the register + * at that time and set our bit of interest but it makes the WA complicated. + * + * This WA is also required for Gen9 so extracting as a function avoids + * code duplication. + */ +static u32 * +gen8_emit_flush_coherentl3_wa(struct intel_engine_cs *engine, u32 *batch) +{ + /* NB no one else is allowed to scribble over scratch + 256! */ + *batch++ = MI_STORE_REGISTER_MEM_GEN8 | MI_SRM_LRM_GLOBAL_GTT; + *batch++ = i915_mmio_reg_offset(GEN8_L3SQCREG4); + *batch++ = intel_gt_scratch_offset(engine->gt, + INTEL_GT_SCRATCH_FIELD_COHERENTL3_WA); + *batch++ = 0; + + *batch++ = MI_LOAD_REGISTER_IMM(1); + *batch++ = i915_mmio_reg_offset(GEN8_L3SQCREG4); + *batch++ = 0x40400000 | GEN8_LQSC_FLUSH_COHERENT_LINES; + + batch = gen8_emit_pipe_control(batch, + PIPE_CONTROL_CS_STALL | + PIPE_CONTROL_DC_FLUSH_ENABLE, + 0); + + *batch++ = MI_LOAD_REGISTER_MEM_GEN8 | MI_SRM_LRM_GLOBAL_GTT; + *batch++ = i915_mmio_reg_offset(GEN8_L3SQCREG4); + *batch++ = intel_gt_scratch_offset(engine->gt, + INTEL_GT_SCRATCH_FIELD_COHERENTL3_WA); + *batch++ = 0; + + return batch; +} + +/* + * Typically we only have one indirect_ctx and per_ctx batch buffer which are + * initialized at the beginning and shared across all contexts but this field + * helps us to have multiple batches at different offsets and select them based + * on a criteria. At the moment this batch always start at the beginning of the page + * and at this point we don't have multiple wa_ctx batch buffers. + * + * The number of WA applied are not known at the beginning; we use this field + * to return the no of DWORDS written. + * + * It is to be noted that this batch does not contain MI_BATCH_BUFFER_END + * so it adds NOOPs as padding to make it cacheline aligned. + * MI_BATCH_BUFFER_END will be added to perctx batch and both of them together + * makes a complete batch buffer. + */ +static u32 *gen8_init_indirectctx_bb(struct intel_engine_cs *engine, u32 *batch) +{ + /* WaDisableCtxRestoreArbitration:bdw,chv */ + *batch++ = MI_ARB_ON_OFF | MI_ARB_DISABLE; + + /* WaFlushCoherentL3CacheLinesAtContextSwitch:bdw */ + if (IS_BROADWELL(engine->i915)) + batch = gen8_emit_flush_coherentl3_wa(engine, batch); + + /* WaClearSlmSpaceAtContextSwitch:bdw,chv */ + /* Actual scratch location is at 128 bytes offset */ + batch = gen8_emit_pipe_control(batch, + PIPE_CONTROL_FLUSH_L3 | + PIPE_CONTROL_STORE_DATA_INDEX | + PIPE_CONTROL_CS_STALL | + PIPE_CONTROL_QW_WRITE, + LRC_PPHWSP_SCRATCH_ADDR); + + *batch++ = MI_ARB_ON_OFF | MI_ARB_ENABLE; + + /* Pad to end of cacheline */ + while ((unsigned long)batch % CACHELINE_BYTES) + *batch++ = MI_NOOP; + + /* + * MI_BATCH_BUFFER_END is not required in Indirect ctx BB because + * execution depends on the length specified in terms of cache lines + * in the register CTX_RCS_INDIRECT_CTX + */ + + return batch; +} + +struct lri { + i915_reg_t reg; + u32 value; +}; + +static u32 *emit_lri(u32 *batch, const struct lri *lri, unsigned int count) +{ + GEM_BUG_ON(!count || count > 63); + + *batch++ = MI_LOAD_REGISTER_IMM(count); + do { + *batch++ = i915_mmio_reg_offset(lri->reg); + *batch++ = lri->value; + } while (lri++, --count); + *batch++ = MI_NOOP; + + return batch; +} + +static u32 *gen9_init_indirectctx_bb(struct intel_engine_cs *engine, u32 *batch) +{ + static const struct lri lri[] = { + /* WaDisableGatherAtSetShaderCommonSlice:skl,bxt,kbl,glk */ + { + COMMON_SLICE_CHICKEN2, + __MASKED_FIELD(GEN9_DISABLE_GATHER_AT_SET_SHADER_COMMON_SLICE, + 0), + }, + + /* BSpec: 11391 */ + { + FF_SLICE_CHICKEN, + __MASKED_FIELD(FF_SLICE_CHICKEN_CL_PROVOKING_VERTEX_FIX, + FF_SLICE_CHICKEN_CL_PROVOKING_VERTEX_FIX), + }, + + /* BSpec: 11299 */ + { + _3D_CHICKEN3, + __MASKED_FIELD(_3D_CHICKEN_SF_PROVOKING_VERTEX_FIX, + _3D_CHICKEN_SF_PROVOKING_VERTEX_FIX), + } + }; + + *batch++ = MI_ARB_ON_OFF | MI_ARB_DISABLE; + + /* WaFlushCoherentL3CacheLinesAtContextSwitch:skl,bxt,glk */ + batch = gen8_emit_flush_coherentl3_wa(engine, batch); + + /* WaClearSlmSpaceAtContextSwitch:skl,bxt,kbl,glk,cfl */ + batch = gen8_emit_pipe_control(batch, + PIPE_CONTROL_FLUSH_L3 | + PIPE_CONTROL_STORE_DATA_INDEX | + PIPE_CONTROL_CS_STALL | + PIPE_CONTROL_QW_WRITE, + LRC_PPHWSP_SCRATCH_ADDR); + + batch = emit_lri(batch, lri, ARRAY_SIZE(lri)); + + /* WaMediaPoolStateCmdInWABB:bxt,glk */ + if (HAS_POOLED_EU(engine->i915)) { + /* + * EU pool configuration is setup along with golden context + * during context initialization. This value depends on + * device type (2x6 or 3x6) and needs to be updated based + * on which subslice is disabled especially for 2x6 + * devices, however it is safe to load default + * configuration of 3x6 device instead of masking off + * corresponding bits because HW ignores bits of a disabled + * subslice and drops down to appropriate config. Please + * see render_state_setup() in i915_gem_render_state.c for + * possible configurations, to avoid duplication they are + * not shown here again. + */ + *batch++ = GEN9_MEDIA_POOL_STATE; + *batch++ = GEN9_MEDIA_POOL_ENABLE; + *batch++ = 0x00777000; + *batch++ = 0; + *batch++ = 0; + *batch++ = 0; + } + + *batch++ = MI_ARB_ON_OFF | MI_ARB_ENABLE; + + /* Pad to end of cacheline */ + while ((unsigned long)batch % CACHELINE_BYTES) + *batch++ = MI_NOOP; + + return batch; +} + +static u32 * +gen10_init_indirectctx_bb(struct intel_engine_cs *engine, u32 *batch) +{ + int i; + + /* + * WaPipeControlBefore3DStateSamplePattern: cnl + * + * Ensure the engine is idle prior to programming a + * 3DSTATE_SAMPLE_PATTERN during a context restore. + */ + batch = gen8_emit_pipe_control(batch, + PIPE_CONTROL_CS_STALL, + 0); + /* + * WaPipeControlBefore3DStateSamplePattern says we need 4 dwords for + * the PIPE_CONTROL followed by 12 dwords of 0x0, so 16 dwords in + * total. However, a PIPE_CONTROL is 6 dwords long, not 4, which is + * confusing. Since gen8_emit_pipe_control() already advances the + * batch by 6 dwords, we advance the other 10 here, completing a + * cacheline. It's not clear if the workaround requires this padding + * before other commands, or if it's just the regular padding we would + * already have for the workaround bb, so leave it here for now. + */ + for (i = 0; i < 10; i++) + *batch++ = MI_NOOP; + + /* Pad to end of cacheline */ + while ((unsigned long)batch % CACHELINE_BYTES) + *batch++ = MI_NOOP; + + return batch; +} + +#define CTX_WA_BB_OBJ_SIZE (PAGE_SIZE) + +static int lrc_setup_wa_ctx(struct intel_engine_cs *engine) +{ + struct drm_i915_gem_object *obj; + struct i915_vma *vma; + int err; + + obj = i915_gem_object_create_shmem(engine->i915, CTX_WA_BB_OBJ_SIZE); + if (IS_ERR(obj)) + return PTR_ERR(obj); + + vma = i915_vma_instance(obj, &engine->gt->ggtt->vm, NULL); + if (IS_ERR(vma)) { + err = PTR_ERR(vma); + goto err; + } + + err = i915_ggtt_pin(vma, NULL, 0, PIN_HIGH); + if (err) + goto err; + + engine->wa_ctx.vma = vma; + return 0; + +err: + i915_gem_object_put(obj); + return err; +} + +typedef u32 *(*wa_bb_func_t)(struct intel_engine_cs *engine, u32 *batch); + +int intel_init_workaround_bb(struct intel_engine_cs *engine) +{ + struct i915_ctx_workarounds *wa_ctx = &engine->wa_ctx; + struct i915_wa_ctx_bb *wa_bb[2] = { &wa_ctx->indirect_ctx, + &wa_ctx->per_ctx }; + wa_bb_func_t wa_bb_fn[2]; + void *batch, *batch_ptr; + unsigned int i; + int ret; + + if (engine->class != RENDER_CLASS) + return 0; + + switch (INTEL_GEN(engine->i915)) { + case 12: + case 11: + return 0; + case 10: + wa_bb_fn[0] = gen10_init_indirectctx_bb; + wa_bb_fn[1] = NULL; + break; + case 9: + wa_bb_fn[0] = gen9_init_indirectctx_bb; + wa_bb_fn[1] = NULL; + break; + case 8: + wa_bb_fn[0] = gen8_init_indirectctx_bb; + wa_bb_fn[1] = NULL; + break; + default: + MISSING_CASE(INTEL_GEN(engine->i915)); + return 0; + } + + ret = lrc_setup_wa_ctx(engine); + if (ret) { + drm_dbg(&engine->i915->drm, + "Failed to setup context WA page: %d\n", ret); + return ret; + } + + batch = i915_gem_object_pin_map(wa_ctx->vma->obj, I915_MAP_WB); + + /* + * Emit the two workaround batch buffers, recording the offset from the + * start of the workaround batch buffer object for each and their + * respective sizes. + */ + batch_ptr = batch; + for (i = 0; i < ARRAY_SIZE(wa_bb_fn); i++) { + wa_bb[i]->offset = batch_ptr - batch; + if (GEM_DEBUG_WARN_ON(!IS_ALIGNED(wa_bb[i]->offset, + CACHELINE_BYTES))) { + ret = -EINVAL; + break; + } + if (wa_bb_fn[i]) + batch_ptr = wa_bb_fn[i](engine, batch_ptr); + wa_bb[i]->size = batch_ptr - (batch + wa_bb[i]->offset); + } + GEM_BUG_ON(batch_ptr - batch > CTX_WA_BB_OBJ_SIZE); + + __i915_gem_object_flush_map(wa_ctx->vma->obj, 0, batch_ptr - batch); + __i915_gem_object_release_map(wa_ctx->vma->obj); + if (ret) + intel_fini_workaround_bb(engine); + + return ret; +} + +void intel_fini_workaround_bb(struct intel_engine_cs *engine) +{ + i915_vma_unpin_and_release(&engine->wa_ctx.vma, 0); +} diff --git a/drivers/gpu/drm/i915/gt/intel_engine_workaround_bb.h b/drivers/gpu/drm/i915/gt/intel_engine_workaround_bb.h new file mode 100644 index 000000000000..88771d77fd42 --- /dev/null +++ b/drivers/gpu/drm/i915/gt/intel_engine_workaround_bb.h @@ -0,0 +1,14 @@ +/* SPDX-License-Identifier: MIT */ +/* + * Copyright © 2014 Intel Corporation + */ + +#ifndef __INTEL_ENGINE_WORKAROUND_BB_H__ +#define __INTEL_ENGINE_WORKAROUND_BB_H__ + +struct intel_engine_cs; + +int intel_init_workaround_bb(struct intel_engine_cs *engine); +void intel_fini_workaround_bb(struct intel_engine_cs *engine); + +#endif /* __INTEL_ENGINE_WORKAROUND_BB_H__ */ diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c index 9069a456d2f7..1cc93ea6b7f0 100644 --- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c +++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c @@ -116,6 +116,7 @@ #include "intel_breadcrumbs.h" #include "intel_context.h" #include "intel_engine_pm.h" +#include "intel_engine_workaround_bb.h" #include "intel_execlists_submission.h" #include "intel_gt.h" #include "intel_gt_pm.h" @@ -3695,330 +3696,6 @@ static int execlists_request_alloc(struct i915_request *request) return 0; } -/* - * In this WA we need to set GEN8_L3SQCREG4[21:21] and reset it after - * PIPE_CONTROL instruction. This is required for the flush to happen correctly - * but there is a slight complication as this is applied in WA batch where the - * values are only initialized once so we cannot take register value at the - * beginning and reuse it further; hence we save its value to memory, upload a - * constant value with bit21 set and then we restore it back with the saved value. - * To simplify the WA, a constant value is formed by using the default value - * of this register. This shouldn't be a problem because we are only modifying - * it for a short period and this batch in non-premptible. We can ofcourse - * use additional instructions that read the actual value of the register - * at that time and set our bit of interest but it makes the WA complicated. - * - * This WA is also required for Gen9 so extracting as a function avoids - * code duplication. - */ -static u32 * -gen8_emit_flush_coherentl3_wa(struct intel_engine_cs *engine, u32 *batch) -{ - /* NB no one else is allowed to scribble over scratch + 256! */ - *batch++ = MI_STORE_REGISTER_MEM_GEN8 | MI_SRM_LRM_GLOBAL_GTT; - *batch++ = i915_mmio_reg_offset(GEN8_L3SQCREG4); - *batch++ = intel_gt_scratch_offset(engine->gt, - INTEL_GT_SCRATCH_FIELD_COHERENTL3_WA); - *batch++ = 0; - - *batch++ = MI_LOAD_REGISTER_IMM(1); - *batch++ = i915_mmio_reg_offset(GEN8_L3SQCREG4); - *batch++ = 0x40400000 | GEN8_LQSC_FLUSH_COHERENT_LINES; - - batch = gen8_emit_pipe_control(batch, - PIPE_CONTROL_CS_STALL | - PIPE_CONTROL_DC_FLUSH_ENABLE, - 0); - - *batch++ = MI_LOAD_REGISTER_MEM_GEN8 | MI_SRM_LRM_GLOBAL_GTT; - *batch++ = i915_mmio_reg_offset(GEN8_L3SQCREG4); - *batch++ = intel_gt_scratch_offset(engine->gt, - INTEL_GT_SCRATCH_FIELD_COHERENTL3_WA); - *batch++ = 0; - - return batch; -} - -/* - * Typically we only have one indirect_ctx and per_ctx batch buffer which are - * initialized at the beginning and shared across all contexts but this field - * helps us to have multiple batches at different offsets and select them based - * on a criteria. At the moment this batch always start at the beginning of the page - * and at this point we don't have multiple wa_ctx batch buffers. - * - * The number of WA applied are not known at the beginning; we use this field - * to return the no of DWORDS written. - * - * It is to be noted that this batch does not contain MI_BATCH_BUFFER_END - * so it adds NOOPs as padding to make it cacheline aligned. - * MI_BATCH_BUFFER_END will be added to perctx batch and both of them together - * makes a complete batch buffer. - */ -static u32 *gen8_init_indirectctx_bb(struct intel_engine_cs *engine, u32 *batch) -{ - /* WaDisableCtxRestoreArbitration:bdw,chv */ - *batch++ = MI_ARB_ON_OFF | MI_ARB_DISABLE; - - /* WaFlushCoherentL3CacheLinesAtContextSwitch:bdw */ - if (IS_BROADWELL(engine->i915)) - batch = gen8_emit_flush_coherentl3_wa(engine, batch); - - /* WaClearSlmSpaceAtContextSwitch:bdw,chv */ - /* Actual scratch location is at 128 bytes offset */ - batch = gen8_emit_pipe_control(batch, - PIPE_CONTROL_FLUSH_L3 | - PIPE_CONTROL_STORE_DATA_INDEX | - PIPE_CONTROL_CS_STALL | - PIPE_CONTROL_QW_WRITE, - LRC_PPHWSP_SCRATCH_ADDR); - - *batch++ = MI_ARB_ON_OFF | MI_ARB_ENABLE; - - /* Pad to end of cacheline */ - while ((unsigned long)batch % CACHELINE_BYTES) - *batch++ = MI_NOOP; - - /* - * MI_BATCH_BUFFER_END is not required in Indirect ctx BB because - * execution depends on the length specified in terms of cache lines - * in the register CTX_RCS_INDIRECT_CTX - */ - - return batch; -} - -struct lri { - i915_reg_t reg; - u32 value; -}; - -static u32 *emit_lri(u32 *batch, const struct lri *lri, unsigned int count) -{ - GEM_BUG_ON(!count || count > 63); - - *batch++ = MI_LOAD_REGISTER_IMM(count); - do { - *batch++ = i915_mmio_reg_offset(lri->reg); - *batch++ = lri->value; - } while (lri++, --count); - *batch++ = MI_NOOP; - - return batch; -} - -static u32 *gen9_init_indirectctx_bb(struct intel_engine_cs *engine, u32 *batch) -{ - static const struct lri lri[] = { - /* WaDisableGatherAtSetShaderCommonSlice:skl,bxt,kbl,glk */ - { - COMMON_SLICE_CHICKEN2, - __MASKED_FIELD(GEN9_DISABLE_GATHER_AT_SET_SHADER_COMMON_SLICE, - 0), - }, - - /* BSpec: 11391 */ - { - FF_SLICE_CHICKEN, - __MASKED_FIELD(FF_SLICE_CHICKEN_CL_PROVOKING_VERTEX_FIX, - FF_SLICE_CHICKEN_CL_PROVOKING_VERTEX_FIX), - }, - - /* BSpec: 11299 */ - { - _3D_CHICKEN3, - __MASKED_FIELD(_3D_CHICKEN_SF_PROVOKING_VERTEX_FIX, - _3D_CHICKEN_SF_PROVOKING_VERTEX_FIX), - } - }; - - *batch++ = MI_ARB_ON_OFF | MI_ARB_DISABLE; - - /* WaFlushCoherentL3CacheLinesAtContextSwitch:skl,bxt,glk */ - batch = gen8_emit_flush_coherentl3_wa(engine, batch); - - /* WaClearSlmSpaceAtContextSwitch:skl,bxt,kbl,glk,cfl */ - batch = gen8_emit_pipe_control(batch, - PIPE_CONTROL_FLUSH_L3 | - PIPE_CONTROL_STORE_DATA_INDEX | - PIPE_CONTROL_CS_STALL | - PIPE_CONTROL_QW_WRITE, - LRC_PPHWSP_SCRATCH_ADDR); - - batch = emit_lri(batch, lri, ARRAY_SIZE(lri)); - - /* WaMediaPoolStateCmdInWABB:bxt,glk */ - if (HAS_POOLED_EU(engine->i915)) { - /* - * EU pool configuration is setup along with golden context - * during context initialization. This value depends on - * device type (2x6 or 3x6) and needs to be updated based - * on which subslice is disabled especially for 2x6 - * devices, however it is safe to load default - * configuration of 3x6 device instead of masking off - * corresponding bits because HW ignores bits of a disabled - * subslice and drops down to appropriate config. Please - * see render_state_setup() in i915_gem_render_state.c for - * possible configurations, to avoid duplication they are - * not shown here again. - */ - *batch++ = GEN9_MEDIA_POOL_STATE; - *batch++ = GEN9_MEDIA_POOL_ENABLE; - *batch++ = 0x00777000; - *batch++ = 0; - *batch++ = 0; - *batch++ = 0; - } - - *batch++ = MI_ARB_ON_OFF | MI_ARB_ENABLE; - - /* Pad to end of cacheline */ - while ((unsigned long)batch % CACHELINE_BYTES) - *batch++ = MI_NOOP; - - return batch; -} - -static u32 * -gen10_init_indirectctx_bb(struct intel_engine_cs *engine, u32 *batch) -{ - int i; - - /* - * WaPipeControlBefore3DStateSamplePattern: cnl - * - * Ensure the engine is idle prior to programming a - * 3DSTATE_SAMPLE_PATTERN during a context restore. - */ - batch = gen8_emit_pipe_control(batch, - PIPE_CONTROL_CS_STALL, - 0); - /* - * WaPipeControlBefore3DStateSamplePattern says we need 4 dwords for - * the PIPE_CONTROL followed by 12 dwords of 0x0, so 16 dwords in - * total. However, a PIPE_CONTROL is 6 dwords long, not 4, which is - * confusing. Since gen8_emit_pipe_control() already advances the - * batch by 6 dwords, we advance the other 10 here, completing a - * cacheline. It's not clear if the workaround requires this padding - * before other commands, or if it's just the regular padding we would - * already have for the workaround bb, so leave it here for now. - */ - for (i = 0; i < 10; i++) - *batch++ = MI_NOOP; - - /* Pad to end of cacheline */ - while ((unsigned long)batch % CACHELINE_BYTES) - *batch++ = MI_NOOP; - - return batch; -} - -#define CTX_WA_BB_OBJ_SIZE (PAGE_SIZE) - -static int lrc_setup_wa_ctx(struct intel_engine_cs *engine) -{ - struct drm_i915_gem_object *obj; - struct i915_vma *vma; - int err; - - obj = i915_gem_object_create_shmem(engine->i915, CTX_WA_BB_OBJ_SIZE); - if (IS_ERR(obj)) - return PTR_ERR(obj); - - vma = i915_vma_instance(obj, &engine->gt->ggtt->vm, NULL); - if (IS_ERR(vma)) { - err = PTR_ERR(vma); - goto err; - } - - err = i915_ggtt_pin(vma, NULL, 0, PIN_HIGH); - if (err) - goto err; - - engine->wa_ctx.vma = vma; - return 0; - -err: - i915_gem_object_put(obj); - return err; -} - -static void lrc_destroy_wa_ctx(struct intel_engine_cs *engine) -{ - i915_vma_unpin_and_release(&engine->wa_ctx.vma, 0); -} - -typedef u32 *(*wa_bb_func_t)(struct intel_engine_cs *engine, u32 *batch); - -static int intel_init_workaround_bb(struct intel_engine_cs *engine) -{ - struct i915_ctx_workarounds *wa_ctx = &engine->wa_ctx; - struct i915_wa_ctx_bb *wa_bb[2] = { &wa_ctx->indirect_ctx, - &wa_ctx->per_ctx }; - wa_bb_func_t wa_bb_fn[2]; - void *batch, *batch_ptr; - unsigned int i; - int ret; - - if (engine->class != RENDER_CLASS) - return 0; - - switch (INTEL_GEN(engine->i915)) { - case 12: - case 11: - return 0; - case 10: - wa_bb_fn[0] = gen10_init_indirectctx_bb; - wa_bb_fn[1] = NULL; - break; - case 9: - wa_bb_fn[0] = gen9_init_indirectctx_bb; - wa_bb_fn[1] = NULL; - break; - case 8: - wa_bb_fn[0] = gen8_init_indirectctx_bb; - wa_bb_fn[1] = NULL; - break; - default: - MISSING_CASE(INTEL_GEN(engine->i915)); - return 0; - } - - ret = lrc_setup_wa_ctx(engine); - if (ret) { - drm_dbg(&engine->i915->drm, - "Failed to setup context WA page: %d\n", ret); - return ret; - } - - batch = i915_gem_object_pin_map(wa_ctx->vma->obj, I915_MAP_WB); - - /* - * Emit the two workaround batch buffers, recording the offset from the - * start of the workaround batch buffer object for each and their - * respective sizes. - */ - batch_ptr = batch; - for (i = 0; i < ARRAY_SIZE(wa_bb_fn); i++) { - wa_bb[i]->offset = batch_ptr - batch; - if (GEM_DEBUG_WARN_ON(!IS_ALIGNED(wa_bb[i]->offset, - CACHELINE_BYTES))) { - ret = -EINVAL; - break; - } - if (wa_bb_fn[i]) - batch_ptr = wa_bb_fn[i](engine, batch_ptr); - wa_bb[i]->size = batch_ptr - (batch + wa_bb[i]->offset); - } - GEM_BUG_ON(batch_ptr - batch > CTX_WA_BB_OBJ_SIZE); - - __i915_gem_object_flush_map(wa_ctx->vma->obj, 0, batch_ptr - batch); - __i915_gem_object_release_map(wa_ctx->vma->obj); - if (ret) - lrc_destroy_wa_ctx(engine); - - return ret; -} - static void reset_csb_pointers(struct intel_engine_cs *engine) { struct intel_engine_execlists * const execlists = &engine->execlists; @@ -4707,7 +4384,7 @@ static void execlists_release(struct intel_engine_cs *engine) execlists_shutdown(engine); intel_engine_cleanup_common(engine); - lrc_destroy_wa_ctx(engine); + intel_fini_workaround_bb(engine); } static void

[RFC,007/162] drm/i915: split wa_bb code to its own file

Commit Message

Patch