Message ID | 20190824002022.27636-1-daniele.ceraolospurio@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | drm/i915: use a separate context for gpu relocs | expand |
Quoting Daniele Ceraolo Spurio (2019-08-24 01:20:22) > The CS pre-parser can pre-fetch commands across memory sync points and > starting from gen12 it is able to pre-fetch across BB_START and BB_END > boundaries as well, so when we emit gpu relocs the pre-parser might > fetch the target location of the reloc before the memory write lands. > > The parser can't pre-fetch across the ctx switch, so we use a separate > context to guarantee that the memory is syncronized before the parser > can get to it. > > Suggested-by: Chris Wilson <chris@chris-wilson.co.uk> > Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> > Cc: Chris Wilson <chris@chris-wilson.co.uk> > --- > .../gpu/drm/i915/gem/i915_gem_execbuffer.c | 27 ++++++++++++++++++- > 1 file changed, 26 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c > index b5f6937369ea..d27201c654e9 100644 > --- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c > +++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c > @@ -252,6 +252,7 @@ struct i915_execbuffer { > bool has_fence : 1; > bool needs_unfenced : 1; > > + struct intel_context *ce; > struct i915_request *rq; > u32 *rq_cmd; > unsigned int rq_size; > @@ -903,6 +904,7 @@ static void reloc_cache_init(struct reloc_cache *cache, > cache->has_fence = cache->gen < 4; > cache->needs_unfenced = INTEL_INFO(i915)->unfenced_needs_alignment; > cache->node.allocated = false; > + cache->ce = NULL; > cache->rq = NULL; > cache->rq_size = 0; > } > @@ -943,6 +945,7 @@ static void reloc_gpu_flush(struct reloc_cache *cache) > static void reloc_cache_reset(struct reloc_cache *cache) > { > void *vaddr; > + struct intel_context *ce; > > if (cache->rq) > reloc_gpu_flush(cache); > @@ -973,6 +976,10 @@ static void reloc_cache_reset(struct reloc_cache *cache) > } > } > > + ce = fetch_and_zero(&cache->ce); > + if (ce) > + intel_context_put(ce); We don't need to create a new context between every buffer, it can be released in eb_destroy. > + > cache->vaddr = 0; > cache->page = -1; > } > @@ -1168,7 +1175,7 @@ static int __reloc_gpu_alloc(struct i915_execbuffer *eb, > if (err) > goto err_unmap; > > - rq = i915_request_create(eb->context); > + rq = intel_context_create_request(cache->ce); > if (IS_ERR(rq)) { > err = PTR_ERR(rq); > goto err_unpin; > @@ -1239,6 +1246,24 @@ static u32 *reloc_gpu(struct i915_execbuffer *eb, > if (!intel_engine_can_store_dword(eb->engine)) > return ERR_PTR(-ENODEV); > > + /* > + * The CS pre-parser can pre-fetch commands across memory sync > + * points and starting gen12 it is able to pre-fetch across > + * BB_START and BB_END boundaries (within the same context). We > + * use a separate context to guarantee that the reloc writes > + * land before the parser gets to the target memory location. > + */ > + if (!cache->ce) { > + struct intel_context *ce; > + I was thinking of if (cache->gen >= 12) > + ce = intel_context_create(eb->context->gem_context, > + eb->engine); else ce = intel_context_get(eb->context); Note that this requires us to fix the tagging so that we don't perform a lite-restore from the reloc instance to the user instance. But yes, that's more or less the same as the sketch I did :) -Chris
Quoting Daniele Ceraolo Spurio (2019-08-24 01:20:22) > The CS pre-parser can pre-fetch commands across memory sync points and > starting from gen12 it is able to pre-fetch across BB_START and BB_END > boundaries as well, so when we emit gpu relocs the pre-parser might > fetch the target location of the reloc before the memory write lands. > > The parser can't pre-fetch across the ctx switch, so we use a separate > context to guarantee that the memory is syncronized before the parser > can get to it. > > Suggested-by: Chris Wilson <chris@chris-wilson.co.uk> > Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> > Cc: Chris Wilson <chris@chris-wilson.co.uk> > --- > .../gpu/drm/i915/gem/i915_gem_execbuffer.c | 27 ++++++++++++++++++- > 1 file changed, 26 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c > index b5f6937369ea..d27201c654e9 100644 > --- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c > +++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c > @@ -252,6 +252,7 @@ struct i915_execbuffer { > bool has_fence : 1; > bool needs_unfenced : 1; > > + struct intel_context *ce; > struct i915_request *rq; > u32 *rq_cmd; > unsigned int rq_size; > @@ -903,6 +904,7 @@ static void reloc_cache_init(struct reloc_cache *cache, > cache->has_fence = cache->gen < 4; > cache->needs_unfenced = INTEL_INFO(i915)->unfenced_needs_alignment; > cache->node.allocated = false; > + cache->ce = NULL; > cache->rq = NULL; > cache->rq_size = 0; > } > @@ -943,6 +945,7 @@ static void reloc_gpu_flush(struct reloc_cache *cache) > static void reloc_cache_reset(struct reloc_cache *cache) > { > void *vaddr; > + struct intel_context *ce; > > if (cache->rq) > reloc_gpu_flush(cache); > @@ -973,6 +976,10 @@ static void reloc_cache_reset(struct reloc_cache *cache) > } > } > > + ce = fetch_and_zero(&cache->ce); > + if (ce) > + intel_context_put(ce); > + > cache->vaddr = 0; > cache->page = -1; > } > @@ -1168,7 +1175,7 @@ static int __reloc_gpu_alloc(struct i915_execbuffer *eb, > if (err) > goto err_unmap; > > - rq = i915_request_create(eb->context); > + rq = intel_context_create_request(cache->ce); > if (IS_ERR(rq)) { > err = PTR_ERR(rq); > goto err_unpin; > @@ -1239,6 +1246,24 @@ static u32 *reloc_gpu(struct i915_execbuffer *eb, > if (!intel_engine_can_store_dword(eb->engine)) > return ERR_PTR(-ENODEV); > > + /* > + * The CS pre-parser can pre-fetch commands across memory sync > + * points and starting gen12 it is able to pre-fetch across > + * BB_START and BB_END boundaries (within the same context). We > + * use a separate context to guarantee that the reloc writes > + * land before the parser gets to the target memory location. > + */ I think the comment about the different semantics for execution flow would also be valuable in emit_fini_breadcrumb -- it's the first place we will look if we ever have any doubt about the barriers around the breadrumbs. -Chris
Quoting Daniele Ceraolo Spurio (2019-08-24 01:20:22) > @@ -943,6 +945,7 @@ static void reloc_gpu_flush(struct reloc_cache *cache) > static void reloc_cache_reset(struct reloc_cache *cache) > { > void *vaddr; > + struct intel_context *ce; > > if (cache->rq) > reloc_gpu_flush(cache); > @@ -973,6 +976,10 @@ static void reloc_cache_reset(struct reloc_cache *cache) > } > } > > + ce = fetch_and_zero(&cache->ce); > + if (ce) > + intel_context_put(ce); For peace of mind, this is too late. For pure gpu relocs, cache->vaddr is 0 and so we took the short-circuit at the beginning of the function. -Chris
On 8/24/2019 1:54 AM, Chris Wilson wrote: > Quoting Daniele Ceraolo Spurio (2019-08-24 01:20:22) >> The CS pre-parser can pre-fetch commands across memory sync points and >> starting from gen12 it is able to pre-fetch across BB_START and BB_END >> boundaries as well, so when we emit gpu relocs the pre-parser might >> fetch the target location of the reloc before the memory write lands. >> >> The parser can't pre-fetch across the ctx switch, so we use a separate >> context to guarantee that the memory is syncronized before the parser >> can get to it. >> >> Suggested-by: Chris Wilson <chris@chris-wilson.co.uk> >> Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> >> Cc: Chris Wilson <chris@chris-wilson.co.uk> >> --- >> .../gpu/drm/i915/gem/i915_gem_execbuffer.c | 27 ++++++++++++++++++- >> 1 file changed, 26 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c >> index b5f6937369ea..d27201c654e9 100644 >> --- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c >> +++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c >> @@ -252,6 +252,7 @@ struct i915_execbuffer { >> bool has_fence : 1; >> bool needs_unfenced : 1; >> >> + struct intel_context *ce; >> struct i915_request *rq; >> u32 *rq_cmd; >> unsigned int rq_size; >> @@ -903,6 +904,7 @@ static void reloc_cache_init(struct reloc_cache *cache, >> cache->has_fence = cache->gen < 4; >> cache->needs_unfenced = INTEL_INFO(i915)->unfenced_needs_alignment; >> cache->node.allocated = false; >> + cache->ce = NULL; >> cache->rq = NULL; >> cache->rq_size = 0; >> } >> @@ -943,6 +945,7 @@ static void reloc_gpu_flush(struct reloc_cache *cache) >> static void reloc_cache_reset(struct reloc_cache *cache) >> { >> void *vaddr; >> + struct intel_context *ce; >> >> if (cache->rq) >> reloc_gpu_flush(cache); >> @@ -973,6 +976,10 @@ static void reloc_cache_reset(struct reloc_cache *cache) >> } >> } >> >> + ce = fetch_and_zero(&cache->ce); >> + if (ce) >> + intel_context_put(ce); > We don't need to create a new context between every buffer, it can be > released in eb_destroy. > >> + >> cache->vaddr = 0; >> cache->page = -1; >> } >> @@ -1168,7 +1175,7 @@ static int __reloc_gpu_alloc(struct i915_execbuffer *eb, >> if (err) >> goto err_unmap; >> >> - rq = i915_request_create(eb->context); >> + rq = intel_context_create_request(cache->ce); >> if (IS_ERR(rq)) { >> err = PTR_ERR(rq); >> goto err_unpin; >> @@ -1239,6 +1246,24 @@ static u32 *reloc_gpu(struct i915_execbuffer *eb, >> if (!intel_engine_can_store_dword(eb->engine)) >> return ERR_PTR(-ENODEV); >> >> + /* >> + * The CS pre-parser can pre-fetch commands across memory sync >> + * points and starting gen12 it is able to pre-fetch across >> + * BB_START and BB_END boundaries (within the same context). We >> + * use a separate context to guarantee that the reloc writes >> + * land before the parser gets to the target memory location. >> + */ >> + if (!cache->ce) { >> + struct intel_context *ce; >> + > I was thinking of > > if (cache->gen >= 12) >> + ce = intel_context_create(eb->context->gem_context, >> + eb->engine); > else > ce = intel_context_get(eb->context); > > Note that this requires us to fix the tagging so that we don't perform a > lite-restore from the reloc instance to the user instance. What's wrong with lite-restoring in this case? It's not something we stop now AFAICS. Daniele > > But yes, that's more or less the same as the sketch I did :) > -Chris
Quoting Daniele Ceraolo Spurio (2019-08-26 18:56:53) > > > On 8/24/2019 1:54 AM, Chris Wilson wrote: > > Note that this requires us to fix the tagging so that we don't perform a > > lite-restore from the reloc instance to the user instance. > > What's wrong with lite-restoring in this case? It's not something we > stop now AFAICS. I have patches to fix the regression. But the question is whether or not the pre-parser is happy to cross the lite-restore boundary. If it is a light lite-restore and there is no cache invalidation... -Chris
On 8/26/19 11:10 AM, Chris Wilson wrote: > Quoting Daniele Ceraolo Spurio (2019-08-26 18:56:53) >> >> >> On 8/24/2019 1:54 AM, Chris Wilson wrote: >>> Note that this requires us to fix the tagging so that we don't perform a >>> lite-restore from the reloc instance to the user instance. >> >> What's wrong with lite-restoring in this case? It's not something we >> stop now AFAICS. > > I have patches to fix the regression. But the question is whether or not > the pre-parser is happy to cross the lite-restore boundary. If it is a > light lite-restore and there is no cache invalidation... > -Chris > On pre-gen12 the parser doesn't cross the BB_START, so we're safe from that POV. Daniele
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c index b5f6937369ea..d27201c654e9 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c +++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c @@ -252,6 +252,7 @@ struct i915_execbuffer { bool has_fence : 1; bool needs_unfenced : 1; + struct intel_context *ce; struct i915_request *rq; u32 *rq_cmd; unsigned int rq_size; @@ -903,6 +904,7 @@ static void reloc_cache_init(struct reloc_cache *cache, cache->has_fence = cache->gen < 4; cache->needs_unfenced = INTEL_INFO(i915)->unfenced_needs_alignment; cache->node.allocated = false; + cache->ce = NULL; cache->rq = NULL; cache->rq_size = 0; } @@ -943,6 +945,7 @@ static void reloc_gpu_flush(struct reloc_cache *cache) static void reloc_cache_reset(struct reloc_cache *cache) { void *vaddr; + struct intel_context *ce; if (cache->rq) reloc_gpu_flush(cache); @@ -973,6 +976,10 @@ static void reloc_cache_reset(struct reloc_cache *cache) } } + ce = fetch_and_zero(&cache->ce); + if (ce) + intel_context_put(ce); + cache->vaddr = 0; cache->page = -1; } @@ -1168,7 +1175,7 @@ static int __reloc_gpu_alloc(struct i915_execbuffer *eb, if (err) goto err_unmap; - rq = i915_request_create(eb->context); + rq = intel_context_create_request(cache->ce); if (IS_ERR(rq)) { err = PTR_ERR(rq); goto err_unpin; @@ -1239,6 +1246,24 @@ static u32 *reloc_gpu(struct i915_execbuffer *eb, if (!intel_engine_can_store_dword(eb->engine)) return ERR_PTR(-ENODEV); + /* + * The CS pre-parser can pre-fetch commands across memory sync + * points and starting gen12 it is able to pre-fetch across + * BB_START and BB_END boundaries (within the same context). We + * use a separate context to guarantee that the reloc writes + * land before the parser gets to the target memory location. + */ + if (!cache->ce) { + struct intel_context *ce; + + ce = intel_context_create(eb->context->gem_context, + eb->engine); + if (IS_ERR(ce)) + return ERR_CAST(ce); + + cache->ce = ce; /* released in reloc_cache_reset */ + } + err = __reloc_gpu_alloc(eb, vma, len); if (unlikely(err)) return ERR_PTR(err);
The CS pre-parser can pre-fetch commands across memory sync points and starting from gen12 it is able to pre-fetch across BB_START and BB_END boundaries as well, so when we emit gpu relocs the pre-parser might fetch the target location of the reloc before the memory write lands. The parser can't pre-fetch across the ctx switch, so we use a separate context to guarantee that the memory is syncronized before the parser can get to it. Suggested-by: Chris Wilson <chris@chris-wilson.co.uk> Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> --- .../gpu/drm/i915/gem/i915_gem_execbuffer.c | 27 ++++++++++++++++++- 1 file changed, 26 insertions(+), 1 deletion(-)