drm/i915: use a separate context for gpu relocs

Message ID	20190824002022.27636-1-daniele.ceraolospurio@intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=paU5=WU=lists.freedesktop.org=intel-gfx-bounces@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org EE45421726 From: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> To: intel-gfx@lists.freedesktop.org Date: Fri, 23 Aug 2019 17:20:22 -0700 Message-Id: <20190824002022.27636-1-daniele.ceraolospurio@intel.com> MIME-Version: 1.0 Subject: [Intel-gfx] [PATCH] drm/i915: use a separate context for gpu relocs Precedence: list Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" <intel-gfx-bounces@lists.freedesktop.org>
Series	drm/i915: use a separate context for gpu relocs \| expand drm/i915: use a separate context for gpu relocs

Daniele Ceraolo Spurio Aug. 24, 2019, 12:20 a.m. UTC

The CS pre-parser can pre-fetch commands across memory sync points and
starting from gen12 it is able to pre-fetch across BB_START and BB_END
boundaries as well, so when we emit gpu relocs the pre-parser might
fetch the target location of the reloc before the memory write lands.

The parser can't pre-fetch across the ctx switch, so we use a separate
context to guarantee that the memory is syncronized before the parser
can get to it.

Suggested-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
---
 .../gpu/drm/i915/gem/i915_gem_execbuffer.c    | 27 ++++++++++++++++++-
 1 file changed, 26 insertions(+), 1 deletion(-)

Chris Wilson Aug. 24, 2019, 8:54 a.m. UTC | #1

Quoting Daniele Ceraolo Spurio (2019-08-24 01:20:22)
> The CS pre-parser can pre-fetch commands across memory sync points and
> starting from gen12 it is able to pre-fetch across BB_START and BB_END
> boundaries as well, so when we emit gpu relocs the pre-parser might
> fetch the target location of the reloc before the memory write lands.
> 
> The parser can't pre-fetch across the ctx switch, so we use a separate
> context to guarantee that the memory is syncronized before the parser
> can get to it.
> 
> Suggested-by: Chris Wilson <chris@chris-wilson.co.uk>
> Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> ---
>  .../gpu/drm/i915/gem/i915_gem_execbuffer.c    | 27 ++++++++++++++++++-
>  1 file changed, 26 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> index b5f6937369ea..d27201c654e9 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> @@ -252,6 +252,7 @@ struct i915_execbuffer {
>                 bool has_fence : 1;
>                 bool needs_unfenced : 1;
>  
> +               struct intel_context *ce;
>                 struct i915_request *rq;
>                 u32 *rq_cmd;
>                 unsigned int rq_size;
> @@ -903,6 +904,7 @@ static void reloc_cache_init(struct reloc_cache *cache,
>         cache->has_fence = cache->gen < 4;
>         cache->needs_unfenced = INTEL_INFO(i915)->unfenced_needs_alignment;
>         cache->node.allocated = false;
> +       cache->ce = NULL;
>         cache->rq = NULL;
>         cache->rq_size = 0;
>  }
> @@ -943,6 +945,7 @@ static void reloc_gpu_flush(struct reloc_cache *cache)
>  static void reloc_cache_reset(struct reloc_cache *cache)
>  {
>         void *vaddr;
> +       struct intel_context *ce;
>  
>         if (cache->rq)
>                 reloc_gpu_flush(cache);
> @@ -973,6 +976,10 @@ static void reloc_cache_reset(struct reloc_cache *cache)
>                 }
>         }
>  
> +       ce = fetch_and_zero(&cache->ce);
> +       if (ce)
> +               intel_context_put(ce);

We don't need to create a new context between every buffer, it can be
released in eb_destroy.

> +
>         cache->vaddr = 0;
>         cache->page = -1;
>  }
> @@ -1168,7 +1175,7 @@ static int __reloc_gpu_alloc(struct i915_execbuffer *eb,
>         if (err)
>                 goto err_unmap;
>  
> -       rq = i915_request_create(eb->context);
> +       rq = intel_context_create_request(cache->ce);
>         if (IS_ERR(rq)) {
>                 err = PTR_ERR(rq);
>                 goto err_unpin;
> @@ -1239,6 +1246,24 @@ static u32 *reloc_gpu(struct i915_execbuffer *eb,
>                 if (!intel_engine_can_store_dword(eb->engine))
>                         return ERR_PTR(-ENODEV);
>  
> +               /*
> +                * The CS pre-parser can pre-fetch commands across memory sync
> +                * points and starting gen12 it is able to pre-fetch across
> +                * BB_START and BB_END boundaries (within the same context). We
> +                * use a separate context to guarantee that the reloc writes
> +                * land before the parser gets to the target memory location.
> +                */
> +               if (!cache->ce) {
> +                       struct intel_context *ce;
> +

I was thinking of

	if (cache->gen >= 12)
> +                       ce = intel_context_create(eb->context->gem_context,
> +                                                 eb->engine);
else
	ce = intel_context_get(eb->context);

Note that this requires us to fix the tagging so that we don't perform a
lite-restore from the reloc instance to the user instance.

But yes, that's more or less the same as the sketch I did :)
-Chris

Chris Wilson Aug. 24, 2019, 9:01 a.m. UTC | #2

Quoting Daniele Ceraolo Spurio (2019-08-24 01:20:22)
> The CS pre-parser can pre-fetch commands across memory sync points and
> starting from gen12 it is able to pre-fetch across BB_START and BB_END
> boundaries as well, so when we emit gpu relocs the pre-parser might
> fetch the target location of the reloc before the memory write lands.
> 
> The parser can't pre-fetch across the ctx switch, so we use a separate
> context to guarantee that the memory is syncronized before the parser
> can get to it.
> 
> Suggested-by: Chris Wilson <chris@chris-wilson.co.uk>
> Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> ---
>  .../gpu/drm/i915/gem/i915_gem_execbuffer.c    | 27 ++++++++++++++++++-
>  1 file changed, 26 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> index b5f6937369ea..d27201c654e9 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> @@ -252,6 +252,7 @@ struct i915_execbuffer {
>                 bool has_fence : 1;
>                 bool needs_unfenced : 1;
>  
> +               struct intel_context *ce;
>                 struct i915_request *rq;
>                 u32 *rq_cmd;
>                 unsigned int rq_size;
> @@ -903,6 +904,7 @@ static void reloc_cache_init(struct reloc_cache *cache,
>         cache->has_fence = cache->gen < 4;
>         cache->needs_unfenced = INTEL_INFO(i915)->unfenced_needs_alignment;
>         cache->node.allocated = false;
> +       cache->ce = NULL;
>         cache->rq = NULL;
>         cache->rq_size = 0;
>  }
> @@ -943,6 +945,7 @@ static void reloc_gpu_flush(struct reloc_cache *cache)
>  static void reloc_cache_reset(struct reloc_cache *cache)
>  {
>         void *vaddr;
> +       struct intel_context *ce;
>  
>         if (cache->rq)
>                 reloc_gpu_flush(cache);
> @@ -973,6 +976,10 @@ static void reloc_cache_reset(struct reloc_cache *cache)
>                 }
>         }
>  
> +       ce = fetch_and_zero(&cache->ce);
> +       if (ce)
> +               intel_context_put(ce);
> +
>         cache->vaddr = 0;
>         cache->page = -1;
>  }
> @@ -1168,7 +1175,7 @@ static int __reloc_gpu_alloc(struct i915_execbuffer *eb,
>         if (err)
>                 goto err_unmap;
>  
> -       rq = i915_request_create(eb->context);
> +       rq = intel_context_create_request(cache->ce);
>         if (IS_ERR(rq)) {
>                 err = PTR_ERR(rq);
>                 goto err_unpin;
> @@ -1239,6 +1246,24 @@ static u32 *reloc_gpu(struct i915_execbuffer *eb,
>                 if (!intel_engine_can_store_dword(eb->engine))
>                         return ERR_PTR(-ENODEV);
>  
> +               /*
> +                * The CS pre-parser can pre-fetch commands across memory sync
> +                * points and starting gen12 it is able to pre-fetch across
> +                * BB_START and BB_END boundaries (within the same context). We
> +                * use a separate context to guarantee that the reloc writes
> +                * land before the parser gets to the target memory location.
> +                */

I think the comment about the different semantics for execution flow
would also be valuable in emit_fini_breadcrumb -- it's the first place
we will look if we ever have any doubt about the barriers around the
breadrumbs.
-Chris

Chris Wilson Aug. 24, 2019, 9:03 a.m. UTC | #3

Quoting Daniele Ceraolo Spurio (2019-08-24 01:20:22)
> @@ -943,6 +945,7 @@ static void reloc_gpu_flush(struct reloc_cache *cache)
>  static void reloc_cache_reset(struct reloc_cache *cache)
>  {
>         void *vaddr;
> +       struct intel_context *ce;
>  
>         if (cache->rq)
>                 reloc_gpu_flush(cache);
> @@ -973,6 +976,10 @@ static void reloc_cache_reset(struct reloc_cache *cache)
>                 }
>         }
>  
> +       ce = fetch_and_zero(&cache->ce);
> +       if (ce)
> +               intel_context_put(ce);

For peace of mind, this is too late. For pure gpu relocs, cache->vaddr is 0
and so we took the short-circuit at the beginning of the function.
-Chris

Daniele Ceraolo Spurio Aug. 26, 2019, 5:56 p.m. UTC | #4

On 8/24/2019 1:54 AM, Chris Wilson wrote:
> Quoting Daniele Ceraolo Spurio (2019-08-24 01:20:22)
>> The CS pre-parser can pre-fetch commands across memory sync points and
>> starting from gen12 it is able to pre-fetch across BB_START and BB_END
>> boundaries as well, so when we emit gpu relocs the pre-parser might
>> fetch the target location of the reloc before the memory write lands.
>>
>> The parser can't pre-fetch across the ctx switch, so we use a separate
>> context to guarantee that the memory is syncronized before the parser
>> can get to it.
>>
>> Suggested-by: Chris Wilson <chris@chris-wilson.co.uk>
>> Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>> ---
>>   .../gpu/drm/i915/gem/i915_gem_execbuffer.c    | 27 ++++++++++++++++++-
>>   1 file changed, 26 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
>> index b5f6937369ea..d27201c654e9 100644
>> --- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
>> @@ -252,6 +252,7 @@ struct i915_execbuffer {
>>                  bool has_fence : 1;
>>                  bool needs_unfenced : 1;
>>   
>> +               struct intel_context *ce;
>>                  struct i915_request *rq;
>>                  u32 *rq_cmd;
>>                  unsigned int rq_size;
>> @@ -903,6 +904,7 @@ static void reloc_cache_init(struct reloc_cache *cache,
>>          cache->has_fence = cache->gen < 4;
>>          cache->needs_unfenced = INTEL_INFO(i915)->unfenced_needs_alignment;
>>          cache->node.allocated = false;
>> +       cache->ce = NULL;
>>          cache->rq = NULL;
>>          cache->rq_size = 0;
>>   }
>> @@ -943,6 +945,7 @@ static void reloc_gpu_flush(struct reloc_cache *cache)
>>   static void reloc_cache_reset(struct reloc_cache *cache)
>>   {
>>          void *vaddr;
>> +       struct intel_context *ce;
>>   
>>          if (cache->rq)
>>                  reloc_gpu_flush(cache);
>> @@ -973,6 +976,10 @@ static void reloc_cache_reset(struct reloc_cache *cache)
>>                  }
>>          }
>>   
>> +       ce = fetch_and_zero(&cache->ce);
>> +       if (ce)
>> +               intel_context_put(ce);
> We don't need to create a new context between every buffer, it can be
> released in eb_destroy.
>
>> +
>>          cache->vaddr = 0;
>>          cache->page = -1;
>>   }
>> @@ -1168,7 +1175,7 @@ static int __reloc_gpu_alloc(struct i915_execbuffer *eb,
>>          if (err)
>>                  goto err_unmap;
>>   
>> -       rq = i915_request_create(eb->context);
>> +       rq = intel_context_create_request(cache->ce);
>>          if (IS_ERR(rq)) {
>>                  err = PTR_ERR(rq);
>>                  goto err_unpin;
>> @@ -1239,6 +1246,24 @@ static u32 *reloc_gpu(struct i915_execbuffer *eb,
>>                  if (!intel_engine_can_store_dword(eb->engine))
>>                          return ERR_PTR(-ENODEV);
>>   
>> +               /*
>> +                * The CS pre-parser can pre-fetch commands across memory sync
>> +                * points and starting gen12 it is able to pre-fetch across
>> +                * BB_START and BB_END boundaries (within the same context). We
>> +                * use a separate context to guarantee that the reloc writes
>> +                * land before the parser gets to the target memory location.
>> +                */
>> +               if (!cache->ce) {
>> +                       struct intel_context *ce;
>> +
> I was thinking of
>
> 	if (cache->gen >= 12)
>> +                       ce = intel_context_create(eb->context->gem_context,
>> +                                                 eb->engine);
> else
> 	ce = intel_context_get(eb->context);
>
> Note that this requires us to fix the tagging so that we don't perform a
> lite-restore from the reloc instance to the user instance.

What's wrong with lite-restoring in this case? It's not something we 
stop now AFAICS.

Daniele

>
> But yes, that's more or less the same as the sketch I did :)
> -Chris

Chris Wilson Aug. 26, 2019, 6:10 p.m. UTC | #5

Quoting Daniele Ceraolo Spurio (2019-08-26 18:56:53)
> 
> 
> On 8/24/2019 1:54 AM, Chris Wilson wrote:
> > Note that this requires us to fix the tagging so that we don't perform a
> > lite-restore from the reloc instance to the user instance.
> 
> What's wrong with lite-restoring in this case? It's not something we 
> stop now AFAICS.

I have patches to fix the regression. But the question is whether or not
the pre-parser is happy to cross the lite-restore boundary. If it is a
light lite-restore and there is no cache invalidation...
-Chris

Daniele Ceraolo Spurio Aug. 26, 2019, 6:12 p.m. UTC | #6

On 8/26/19 11:10 AM, Chris Wilson wrote:
> Quoting Daniele Ceraolo Spurio (2019-08-26 18:56:53)
>>
>>
>> On 8/24/2019 1:54 AM, Chris Wilson wrote:
>>> Note that this requires us to fix the tagging so that we don't perform a
>>> lite-restore from the reloc instance to the user instance.
>>
>> What's wrong with lite-restoring in this case? It's not something we
>> stop now AFAICS.
> 
> I have patches to fix the regression. But the question is whether or not
> the pre-parser is happy to cross the lite-restore boundary. If it is a
> light lite-restore and there is no cache invalidation...
> -Chris
> 

On pre-gen12 the parser doesn't cross the BB_START, so we're safe from 
that POV.

Daniele

drm/i915: use a separate context for gpu relocs

Commit Message

Comments

Patch