diff mbox series

[03/10] drm/syncobj: add new drm_syncobj_add_point interface v2

Message ID 20181207155422.15967-3-david1.zhou@amd.com (mailing list archive)
State New, archived
Headers show
Series [01/10] dma-buf: add new dma_fence_chain container v4 | expand

Commit Message

Chunming Zhou Dec. 7, 2018, 3:54 p.m. UTC
From: Christian König <ckoenig.leichtzumerken@gmail.com>

Use the dma_fence_chain object to create a timeline of fence objects
instead of just replacing the existing fence.

v2: rebase and cleanup

Signed-off-by: Christian König <christian.koenig@amd.com>
---
 drivers/gpu/drm/drm_syncobj.c | 37 +++++++++++++++++++++++++++++++++++
 include/drm/drm_syncobj.h     |  5 +++++
 2 files changed, 42 insertions(+)

Comments

Daniel Vetter Dec. 12, 2018, 10:49 a.m. UTC | #1
On Fri, Dec 07, 2018 at 11:54:15PM +0800, Chunming Zhou wrote:
> From: Christian König <ckoenig.leichtzumerken@gmail.com>
> 
> Use the dma_fence_chain object to create a timeline of fence objects
> instead of just replacing the existing fence.
> 
> v2: rebase and cleanup
> 
> Signed-off-by: Christian König <christian.koenig@amd.com>

Somewhat jumping back into this. Not sure we discussed this already or
not. I'm a bit unclear on why we have to chain the fences in the timeline:

- The timeline stuff is modelled after the WDDM2 monitored fences. Which
  really are just u64 counters in memory somewhere (I think could be
  system ram or vram). Because WDDM2 has the memory management entirely
  separated from rendering synchronization it totally allows userspace to
  create loops and deadlocks and everything else nasty using this - the
  memory manager won't deadlock because these monitored fences never leak
  into the buffer manager. And if CS deadlock, gpu reset takes care of the
  mess.

- This has a few consequences, as in they seem to indeed work like a
  memory location: Userspace incrementing out-of-order (because they run
  batches updating the same fence on different engines) is totally fine,
  as is doing anything else "stupid".

- Now on linux we can't allow anything, because we need to make sure that
  deadlocks don't leak into the memory manager. But as long as we block
  until the underlying dma_fence has materialized, nothing userspace can
  do will lead to such a deadlock. Even if userspace ends up submitting
  jobs without enough built-in synchronization, leading to out-of-order
  signalling of fences on that "timeline". And I don't think that would
  pose a problem for us.

Essentially I think we can look at timeline syncobj as a dma_fence
container indexed through an integer, and there's no need to enforce that
the timline works like a real dma_fence timeline, with all it's
guarantees. It's just a pile of (possibly, if userspace is stupid)
unrelated dma_fences. You could implement the entire thing in userspace
after all, except for the "we want to share these timeline objects between
processes" problem.

tldr; I think we can drop the dma_fence_chain complexity completely. Or at
least I'm not really understanding why it's needed.

Of course that means drivers cannot treat a drm_syncobj timeline as a
dma_fence timeline. But given the future fences stuff and all that, that's
already out of the window anyway.

What am I missing?
-Daniel

> ---
>  drivers/gpu/drm/drm_syncobj.c | 37 +++++++++++++++++++++++++++++++++++
>  include/drm/drm_syncobj.h     |  5 +++++
>  2 files changed, 42 insertions(+)
> 
> diff --git a/drivers/gpu/drm/drm_syncobj.c b/drivers/gpu/drm/drm_syncobj.c
> index e19525af0cce..51f798e2194f 100644
> --- a/drivers/gpu/drm/drm_syncobj.c
> +++ b/drivers/gpu/drm/drm_syncobj.c
> @@ -122,6 +122,43 @@ static void drm_syncobj_remove_wait(struct drm_syncobj *syncobj,
>  	spin_unlock(&syncobj->lock);
>  }
>  
> +/**
> + * drm_syncobj_add_point - add new timeline point to the syncobj
> + * @syncobj: sync object to add timeline point do
> + * @chain: chain node to use to add the point
> + * @fence: fence to encapsulate in the chain node
> + * @point: sequence number to use for the point
> + *
> + * Add the chain node as new timeline point to the syncobj.
> + */
> +void drm_syncobj_add_point(struct drm_syncobj *syncobj,
> +			   struct dma_fence_chain *chain,
> +			   struct dma_fence *fence,
> +			   uint64_t point)
> +{
> +	struct syncobj_wait_entry *cur, *tmp;
> +	struct dma_fence *prev;
> +
> +	dma_fence_get(fence);
> +
> +	spin_lock(&syncobj->lock);
> +
> +	prev = rcu_dereference_protected(syncobj->fence,
> +					 lockdep_is_held(&syncobj->lock));
> +	dma_fence_chain_init(chain, prev, fence, point);
> +	rcu_assign_pointer(syncobj->fence, &chain->base);
> +
> +	list_for_each_entry_safe(cur, tmp, &syncobj->cb_list, node) {
> +		list_del_init(&cur->node);
> +		syncobj_wait_syncobj_func(syncobj, cur);
> +	}
> +	spin_unlock(&syncobj->lock);
> +
> +	/* Walk the chain once to trigger garbage collection */
> +	dma_fence_chain_for_each(prev, fence);
> +}
> +EXPORT_SYMBOL(drm_syncobj_add_point);
> +
>  /**
>   * drm_syncobj_replace_fence - replace fence in a sync object.
>   * @syncobj: Sync object to replace fence in
> diff --git a/include/drm/drm_syncobj.h b/include/drm/drm_syncobj.h
> index 7c6ed845c70d..8acb4ae4f311 100644
> --- a/include/drm/drm_syncobj.h
> +++ b/include/drm/drm_syncobj.h
> @@ -27,6 +27,7 @@
>  #define __DRM_SYNCOBJ_H__
>  
>  #include "linux/dma-fence.h"
> +#include "linux/dma-fence-chain.h"
>  
>  /**
>   * struct drm_syncobj - sync object.
> @@ -110,6 +111,10 @@ drm_syncobj_fence_get(struct drm_syncobj *syncobj)
>  
>  struct drm_syncobj *drm_syncobj_find(struct drm_file *file_private,
>  				     u32 handle);
> +void drm_syncobj_add_point(struct drm_syncobj *syncobj,
> +			   struct dma_fence_chain *chain,
> +			   struct dma_fence *fence,
> +			   uint64_t point);
>  void drm_syncobj_replace_fence(struct drm_syncobj *syncobj,
>  			       struct dma_fence *fence);
>  int drm_syncobj_find_fence(struct drm_file *file_private,
> -- 
> 2.17.1
> 
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
Christian König Dec. 12, 2018, 11:08 a.m. UTC | #2
Am 12.12.18 um 11:49 schrieb Daniel Vetter:
> On Fri, Dec 07, 2018 at 11:54:15PM +0800, Chunming Zhou wrote:
>> From: Christian König <ckoenig.leichtzumerken@gmail.com>
>>
>> Use the dma_fence_chain object to create a timeline of fence objects
>> instead of just replacing the existing fence.
>>
>> v2: rebase and cleanup
>>
>> Signed-off-by: Christian König <christian.koenig@amd.com>
> Somewhat jumping back into this. Not sure we discussed this already or
> not. I'm a bit unclear on why we have to chain the fences in the timeline:
>
> - The timeline stuff is modelled after the WDDM2 monitored fences. Which
>    really are just u64 counters in memory somewhere (I think could be
>    system ram or vram). Because WDDM2 has the memory management entirely
>    separated from rendering synchronization it totally allows userspace to
>    create loops and deadlocks and everything else nasty using this - the
>    memory manager won't deadlock because these monitored fences never leak
>    into the buffer manager. And if CS deadlock, gpu reset takes care of the
>    mess.
>
> - This has a few consequences, as in they seem to indeed work like a
>    memory location: Userspace incrementing out-of-order (because they run
>    batches updating the same fence on different engines) is totally fine,
>    as is doing anything else "stupid".
>
> - Now on linux we can't allow anything, because we need to make sure that
>    deadlocks don't leak into the memory manager. But as long as we block
>    until the underlying dma_fence has materialized, nothing userspace can
>    do will lead to such a deadlock. Even if userspace ends up submitting
>    jobs without enough built-in synchronization, leading to out-of-order
>    signalling of fences on that "timeline". And I don't think that would
>    pose a problem for us.
>
> Essentially I think we can look at timeline syncobj as a dma_fence
> container indexed through an integer, and there's no need to enforce that
> the timline works like a real dma_fence timeline, with all it's
> guarantees. It's just a pile of (possibly, if userspace is stupid)
> unrelated dma_fences. You could implement the entire thing in userspace
> after all, except for the "we want to share these timeline objects between
> processes" problem.
>
> tldr; I think we can drop the dma_fence_chain complexity completely. Or at
> least I'm not really understanding why it's needed.
>
> Of course that means drivers cannot treat a drm_syncobj timeline as a
> dma_fence timeline. But given the future fences stuff and all that, that's
> already out of the window anyway.
>
> What am I missing?

Good question, since that was exactly my initial idea as well.

Key point is that our Vulcan guys came back and said that this wouldn't 
be sufficient, but I honestly don't fully understand why.

Anyway that's why David came up with using the fence array to wait for 
all previously added fences, which I then later on extended into this 
chain container.

I have to admit that it is way more defensive implemented this way. E.g. 
there is much fewer things userspace can do wrong.

The principal idea is that when they mess things up they are always 
going to wait more than necessary, but never less.

Christian.

> -Daniel
>
>> ---
>>   drivers/gpu/drm/drm_syncobj.c | 37 +++++++++++++++++++++++++++++++++++
>>   include/drm/drm_syncobj.h     |  5 +++++
>>   2 files changed, 42 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/drm_syncobj.c b/drivers/gpu/drm/drm_syncobj.c
>> index e19525af0cce..51f798e2194f 100644
>> --- a/drivers/gpu/drm/drm_syncobj.c
>> +++ b/drivers/gpu/drm/drm_syncobj.c
>> @@ -122,6 +122,43 @@ static void drm_syncobj_remove_wait(struct drm_syncobj *syncobj,
>>   	spin_unlock(&syncobj->lock);
>>   }
>>   
>> +/**
>> + * drm_syncobj_add_point - add new timeline point to the syncobj
>> + * @syncobj: sync object to add timeline point do
>> + * @chain: chain node to use to add the point
>> + * @fence: fence to encapsulate in the chain node
>> + * @point: sequence number to use for the point
>> + *
>> + * Add the chain node as new timeline point to the syncobj.
>> + */
>> +void drm_syncobj_add_point(struct drm_syncobj *syncobj,
>> +			   struct dma_fence_chain *chain,
>> +			   struct dma_fence *fence,
>> +			   uint64_t point)
>> +{
>> +	struct syncobj_wait_entry *cur, *tmp;
>> +	struct dma_fence *prev;
>> +
>> +	dma_fence_get(fence);
>> +
>> +	spin_lock(&syncobj->lock);
>> +
>> +	prev = rcu_dereference_protected(syncobj->fence,
>> +					 lockdep_is_held(&syncobj->lock));
>> +	dma_fence_chain_init(chain, prev, fence, point);
>> +	rcu_assign_pointer(syncobj->fence, &chain->base);
>> +
>> +	list_for_each_entry_safe(cur, tmp, &syncobj->cb_list, node) {
>> +		list_del_init(&cur->node);
>> +		syncobj_wait_syncobj_func(syncobj, cur);
>> +	}
>> +	spin_unlock(&syncobj->lock);
>> +
>> +	/* Walk the chain once to trigger garbage collection */
>> +	dma_fence_chain_for_each(prev, fence);
>> +}
>> +EXPORT_SYMBOL(drm_syncobj_add_point);
>> +
>>   /**
>>    * drm_syncobj_replace_fence - replace fence in a sync object.
>>    * @syncobj: Sync object to replace fence in
>> diff --git a/include/drm/drm_syncobj.h b/include/drm/drm_syncobj.h
>> index 7c6ed845c70d..8acb4ae4f311 100644
>> --- a/include/drm/drm_syncobj.h
>> +++ b/include/drm/drm_syncobj.h
>> @@ -27,6 +27,7 @@
>>   #define __DRM_SYNCOBJ_H__
>>   
>>   #include "linux/dma-fence.h"
>> +#include "linux/dma-fence-chain.h"
>>   
>>   /**
>>    * struct drm_syncobj - sync object.
>> @@ -110,6 +111,10 @@ drm_syncobj_fence_get(struct drm_syncobj *syncobj)
>>   
>>   struct drm_syncobj *drm_syncobj_find(struct drm_file *file_private,
>>   				     u32 handle);
>> +void drm_syncobj_add_point(struct drm_syncobj *syncobj,
>> +			   struct dma_fence_chain *chain,
>> +			   struct dma_fence *fence,
>> +			   uint64_t point);
>>   void drm_syncobj_replace_fence(struct drm_syncobj *syncobj,
>>   			       struct dma_fence *fence);
>>   int drm_syncobj_find_fence(struct drm_file *file_private,
>> -- 
>> 2.17.1
>>
>> _______________________________________________
>> Intel-gfx mailing list
>> Intel-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
Daniel Vetter Dec. 12, 2018, 11:15 a.m. UTC | #3
On Wed, Dec 12, 2018 at 12:08 PM Koenig, Christian
<Christian.Koenig@amd.com> wrote:
>
> Am 12.12.18 um 11:49 schrieb Daniel Vetter:
> > On Fri, Dec 07, 2018 at 11:54:15PM +0800, Chunming Zhou wrote:
> >> From: Christian König <ckoenig.leichtzumerken@gmail.com>
> >>
> >> Use the dma_fence_chain object to create a timeline of fence objects
> >> instead of just replacing the existing fence.
> >>
> >> v2: rebase and cleanup
> >>
> >> Signed-off-by: Christian König <christian.koenig@amd.com>
> > Somewhat jumping back into this. Not sure we discussed this already or
> > not. I'm a bit unclear on why we have to chain the fences in the timeline:
> >
> > - The timeline stuff is modelled after the WDDM2 monitored fences. Which
> >    really are just u64 counters in memory somewhere (I think could be
> >    system ram or vram). Because WDDM2 has the memory management entirely
> >    separated from rendering synchronization it totally allows userspace to
> >    create loops and deadlocks and everything else nasty using this - the
> >    memory manager won't deadlock because these monitored fences never leak
> >    into the buffer manager. And if CS deadlock, gpu reset takes care of the
> >    mess.
> >
> > - This has a few consequences, as in they seem to indeed work like a
> >    memory location: Userspace incrementing out-of-order (because they run
> >    batches updating the same fence on different engines) is totally fine,
> >    as is doing anything else "stupid".
> >
> > - Now on linux we can't allow anything, because we need to make sure that
> >    deadlocks don't leak into the memory manager. But as long as we block
> >    until the underlying dma_fence has materialized, nothing userspace can
> >    do will lead to such a deadlock. Even if userspace ends up submitting
> >    jobs without enough built-in synchronization, leading to out-of-order
> >    signalling of fences on that "timeline". And I don't think that would
> >    pose a problem for us.
> >
> > Essentially I think we can look at timeline syncobj as a dma_fence
> > container indexed through an integer, and there's no need to enforce that
> > the timline works like a real dma_fence timeline, with all it's
> > guarantees. It's just a pile of (possibly, if userspace is stupid)
> > unrelated dma_fences. You could implement the entire thing in userspace
> > after all, except for the "we want to share these timeline objects between
> > processes" problem.
> >
> > tldr; I think we can drop the dma_fence_chain complexity completely. Or at
> > least I'm not really understanding why it's needed.
> >
> > Of course that means drivers cannot treat a drm_syncobj timeline as a
> > dma_fence timeline. But given the future fences stuff and all that, that's
> > already out of the window anyway.
> >
> > What am I missing?
>
> Good question, since that was exactly my initial idea as well.
>
> Key point is that our Vulcan guys came back and said that this wouldn't
> be sufficient, but I honestly don't fully understand why.

Hm, sounds like we really need those testscases (vk cts on top of
mesa, igt) so we can talk about the exact corner cases we care about
and why.

I guess one thing that might happen is that userspace leaves out a
number and never sets that fence, relying on the >= semantics of the
monitored fence to unblock that thread. E.g. when skipping a frame in
one of the auxiliary workloads. For that case we'd need to make sure
we don't just wait for the given fence to materialize, but also any
fences later in the timeline.

But we can't decide that without understanding the actual use-case
that needs to be supported at the other end of the stack, and how all
the bits in between should look like.

I guess we're back to "uapi design without userspace doesn't make sense" ...

> Anyway that's why David came up with using the fence array to wait for
> all previously added fences, which I then later on extended into this
> chain container.
>
> I have to admit that it is way more defensive implemented this way. E.g.
> there is much fewer things userspace can do wrong.
>
> The principal idea is that when they mess things up they are always
> going to wait more than necessary, but never less.

That seems against the spirit of vulkan, which is very much about "you
get all the pieces". It also might dig us a hole in the future, if we
ever get around to moving towards a WDDM2 style memory management
model. For future proving I think it would make sense if we implement
the minimal uapi we need for vk timelines, not the strictest
guarantees we can get away with (without performance impact) with
current drivers.
-Daniel


> Christian.
>
> > -Daniel
> >
> >> ---
> >>   drivers/gpu/drm/drm_syncobj.c | 37 +++++++++++++++++++++++++++++++++++
> >>   include/drm/drm_syncobj.h     |  5 +++++
> >>   2 files changed, 42 insertions(+)
> >>
> >> diff --git a/drivers/gpu/drm/drm_syncobj.c b/drivers/gpu/drm/drm_syncobj.c
> >> index e19525af0cce..51f798e2194f 100644
> >> --- a/drivers/gpu/drm/drm_syncobj.c
> >> +++ b/drivers/gpu/drm/drm_syncobj.c
> >> @@ -122,6 +122,43 @@ static void drm_syncobj_remove_wait(struct drm_syncobj *syncobj,
> >>      spin_unlock(&syncobj->lock);
> >>   }
> >>
> >> +/**
> >> + * drm_syncobj_add_point - add new timeline point to the syncobj
> >> + * @syncobj: sync object to add timeline point do
> >> + * @chain: chain node to use to add the point
> >> + * @fence: fence to encapsulate in the chain node
> >> + * @point: sequence number to use for the point
> >> + *
> >> + * Add the chain node as new timeline point to the syncobj.
> >> + */
> >> +void drm_syncobj_add_point(struct drm_syncobj *syncobj,
> >> +                       struct dma_fence_chain *chain,
> >> +                       struct dma_fence *fence,
> >> +                       uint64_t point)
> >> +{
> >> +    struct syncobj_wait_entry *cur, *tmp;
> >> +    struct dma_fence *prev;
> >> +
> >> +    dma_fence_get(fence);
> >> +
> >> +    spin_lock(&syncobj->lock);
> >> +
> >> +    prev = rcu_dereference_protected(syncobj->fence,
> >> +                                     lockdep_is_held(&syncobj->lock));
> >> +    dma_fence_chain_init(chain, prev, fence, point);
> >> +    rcu_assign_pointer(syncobj->fence, &chain->base);
> >> +
> >> +    list_for_each_entry_safe(cur, tmp, &syncobj->cb_list, node) {
> >> +            list_del_init(&cur->node);
> >> +            syncobj_wait_syncobj_func(syncobj, cur);
> >> +    }
> >> +    spin_unlock(&syncobj->lock);
> >> +
> >> +    /* Walk the chain once to trigger garbage collection */
> >> +    dma_fence_chain_for_each(prev, fence);
> >> +}
> >> +EXPORT_SYMBOL(drm_syncobj_add_point);
> >> +
> >>   /**
> >>    * drm_syncobj_replace_fence - replace fence in a sync object.
> >>    * @syncobj: Sync object to replace fence in
> >> diff --git a/include/drm/drm_syncobj.h b/include/drm/drm_syncobj.h
> >> index 7c6ed845c70d..8acb4ae4f311 100644
> >> --- a/include/drm/drm_syncobj.h
> >> +++ b/include/drm/drm_syncobj.h
> >> @@ -27,6 +27,7 @@
> >>   #define __DRM_SYNCOBJ_H__
> >>
> >>   #include "linux/dma-fence.h"
> >> +#include "linux/dma-fence-chain.h"
> >>
> >>   /**
> >>    * struct drm_syncobj - sync object.
> >> @@ -110,6 +111,10 @@ drm_syncobj_fence_get(struct drm_syncobj *syncobj)
> >>
> >>   struct drm_syncobj *drm_syncobj_find(struct drm_file *file_private,
> >>                                   u32 handle);
> >> +void drm_syncobj_add_point(struct drm_syncobj *syncobj,
> >> +                       struct dma_fence_chain *chain,
> >> +                       struct dma_fence *fence,
> >> +                       uint64_t point);
> >>   void drm_syncobj_replace_fence(struct drm_syncobj *syncobj,
> >>                             struct dma_fence *fence);
> >>   int drm_syncobj_find_fence(struct drm_file *file_private,
> >> --
> >> 2.17.1
> >>
> >> _______________________________________________
> >> Intel-gfx mailing list
> >> Intel-gfx@lists.freedesktop.org
> >> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
>
Chunming Zhou Dec. 12, 2018, 11:39 a.m. UTC | #4
+ Daniel Rakos and Jason Ekstrand.

 Below is the background, which is from Daniel R should  be able to explain that's why: 
" ISVs, especially those coming from D3D12, are unsatisfied with the behavior of the Vulkan semaphores as they are unhappy with the fact that for every single dependency they need to use separate semaphores due to their binary nature.
Compared to that a synchronization primitive like D3D12 monitored fences enable one of those to be used to track a sequence of operations by simply associating timeline values to the completion of individual operations. This allows them to track the lifetime and usage of resources and the ordered completion of sequences.
Besides that, they also want to use a single synchronization primitive to be able to handle GPU-to-GPU and GPU-to-CPU dependencies, compared to using semaphores for the former and fences for the latter.
In addition, compared to legacy semaphores, timeline semaphores are proposed to support wait-before-signal, i.e. allow enqueueing a semaphore wait operation with a wait value that is larger than any of the already enqueued signal values. This seems to be a hard requirement for ISVs. Without UMD-side queue batching, and even UMD-side queue batching doesn’t help the situation when such a semaphore is externally shared with another API. Thus in order to properly support wait-before-signal the KMD implementation has to also be able to support such dependencies.
"

Btw, we already add test case to igt, and tested by many existing test, like libdrm unit test, igt related test, vulkan cts, and steam games.

-David
> -----Original Message-----
> From: Daniel Vetter <daniel@ffwll.ch>
> Sent: Wednesday, December 12, 2018 7:15 PM
> To: Koenig, Christian <Christian.Koenig@amd.com>
> Cc: Zhou, David(ChunMing) <David1.Zhou@amd.com>; dri-devel <dri-
> devel@lists.freedesktop.org>; amd-gfx list <amd-gfx@lists.freedesktop.org>;
> intel-gfx <intel-gfx@lists.freedesktop.org>; Christian König
> <ckoenig.leichtzumerken@gmail.com>
> Subject: Re: [Intel-gfx] [PATCH 03/10] drm/syncobj: add new
> drm_syncobj_add_point interface v2
> 
> On Wed, Dec 12, 2018 at 12:08 PM Koenig, Christian
> <Christian.Koenig@amd.com> wrote:
> >
> > Am 12.12.18 um 11:49 schrieb Daniel Vetter:
> > > On Fri, Dec 07, 2018 at 11:54:15PM +0800, Chunming Zhou wrote:
> > >> From: Christian König <ckoenig.leichtzumerken@gmail.com>
> > >>
> > >> Use the dma_fence_chain object to create a timeline of fence
> > >> objects instead of just replacing the existing fence.
> > >>
> > >> v2: rebase and cleanup
> > >>
> > >> Signed-off-by: Christian König <christian.koenig@amd.com>
> > > Somewhat jumping back into this. Not sure we discussed this already
> > > or not. I'm a bit unclear on why we have to chain the fences in the
> timeline:
> > >
> > > - The timeline stuff is modelled after the WDDM2 monitored fences.
> Which
> > >    really are just u64 counters in memory somewhere (I think could be
> > >    system ram or vram). Because WDDM2 has the memory management
> entirely
> > >    separated from rendering synchronization it totally allows userspace to
> > >    create loops and deadlocks and everything else nasty using this - the
> > >    memory manager won't deadlock because these monitored fences
> never leak
> > >    into the buffer manager. And if CS deadlock, gpu reset takes care of the
> > >    mess.
> > >
> > > - This has a few consequences, as in they seem to indeed work like a
> > >    memory location: Userspace incrementing out-of-order (because they
> run
> > >    batches updating the same fence on different engines) is totally fine,
> > >    as is doing anything else "stupid".
> > >
> > > - Now on linux we can't allow anything, because we need to make sure
> that
> > >    deadlocks don't leak into the memory manager. But as long as we block
> > >    until the underlying dma_fence has materialized, nothing userspace can
> > >    do will lead to such a deadlock. Even if userspace ends up submitting
> > >    jobs without enough built-in synchronization, leading to out-of-order
> > >    signalling of fences on that "timeline". And I don't think that would
> > >    pose a problem for us.
> > >
> > > Essentially I think we can look at timeline syncobj as a dma_fence
> > > container indexed through an integer, and there's no need to enforce
> > > that the timline works like a real dma_fence timeline, with all it's
> > > guarantees. It's just a pile of (possibly, if userspace is stupid)
> > > unrelated dma_fences. You could implement the entire thing in
> > > userspace after all, except for the "we want to share these timeline
> > > objects between processes" problem.
> > >
> > > tldr; I think we can drop the dma_fence_chain complexity completely.
> > > Or at least I'm not really understanding why it's needed.
> > >
> > > Of course that means drivers cannot treat a drm_syncobj timeline as
> > > a dma_fence timeline. But given the future fences stuff and all
> > > that, that's already out of the window anyway.
> > >
> > > What am I missing?
> >
> > Good question, since that was exactly my initial idea as well.
> >
> > Key point is that our Vulcan guys came back and said that this
> > wouldn't be sufficient, but I honestly don't fully understand why.
> 
> Hm, sounds like we really need those testscases (vk cts on top of mesa, igt)
> so we can talk about the exact corner cases we care about and why.
> 
> I guess one thing that might happen is that userspace leaves out a number
> and never sets that fence, relying on the >= semantics of the monitored
> fence to unblock that thread. E.g. when skipping a frame in one of the
> auxiliary workloads. For that case we'd need to make sure we don't just wait
> for the given fence to materialize, but also any fences later in the timeline.
> 
> But we can't decide that without understanding the actual use-case that
> needs to be supported at the other end of the stack, and how all the bits in
> between should look like.
> 
> I guess we're back to "uapi design without userspace doesn't make sense" ...
> 
> > Anyway that's why David came up with using the fence array to wait for
> > all previously added fences, which I then later on extended into this
> > chain container.
> >
> > I have to admit that it is way more defensive implemented this way. E.g.
> > there is much fewer things userspace can do wrong.
> >
> > The principal idea is that when they mess things up they are always
> > going to wait more than necessary, but never less.
> 
> That seems against the spirit of vulkan, which is very much about "you get all
> the pieces". It also might dig us a hole in the future, if we ever get around to
> moving towards a WDDM2 style memory management model. For future
> proving I think it would make sense if we implement the minimal uapi we
> need for vk timelines, not the strictest guarantees we can get away with
> (without performance impact) with current drivers.
> -Daniel
> 
> 
> > Christian.
> >
> > > -Daniel
> > >
> > >> ---
> > >>   drivers/gpu/drm/drm_syncobj.c | 37
> +++++++++++++++++++++++++++++++++++
> > >>   include/drm/drm_syncobj.h     |  5 +++++
> > >>   2 files changed, 42 insertions(+)
> > >>
> > >> diff --git a/drivers/gpu/drm/drm_syncobj.c
> > >> b/drivers/gpu/drm/drm_syncobj.c index e19525af0cce..51f798e2194f
> > >> 100644
> > >> --- a/drivers/gpu/drm/drm_syncobj.c
> > >> +++ b/drivers/gpu/drm/drm_syncobj.c
> > >> @@ -122,6 +122,43 @@ static void drm_syncobj_remove_wait(struct
> drm_syncobj *syncobj,
> > >>      spin_unlock(&syncobj->lock);
> > >>   }
> > >>
> > >> +/**
> > >> + * drm_syncobj_add_point - add new timeline point to the syncobj
> > >> + * @syncobj: sync object to add timeline point do
> > >> + * @chain: chain node to use to add the point
> > >> + * @fence: fence to encapsulate in the chain node
> > >> + * @point: sequence number to use for the point
> > >> + *
> > >> + * Add the chain node as new timeline point to the syncobj.
> > >> + */
> > >> +void drm_syncobj_add_point(struct drm_syncobj *syncobj,
> > >> +                       struct dma_fence_chain *chain,
> > >> +                       struct dma_fence *fence,
> > >> +                       uint64_t point) {
> > >> +    struct syncobj_wait_entry *cur, *tmp;
> > >> +    struct dma_fence *prev;
> > >> +
> > >> +    dma_fence_get(fence);
> > >> +
> > >> +    spin_lock(&syncobj->lock);
> > >> +
> > >> +    prev = rcu_dereference_protected(syncobj->fence,
> > >> +                                     lockdep_is_held(&syncobj->lock));
> > >> +    dma_fence_chain_init(chain, prev, fence, point);
> > >> +    rcu_assign_pointer(syncobj->fence, &chain->base);
> > >> +
> > >> +    list_for_each_entry_safe(cur, tmp, &syncobj->cb_list, node) {
> > >> +            list_del_init(&cur->node);
> > >> +            syncobj_wait_syncobj_func(syncobj, cur);
> > >> +    }
> > >> +    spin_unlock(&syncobj->lock);
> > >> +
> > >> +    /* Walk the chain once to trigger garbage collection */
> > >> +    dma_fence_chain_for_each(prev, fence); }
> > >> +EXPORT_SYMBOL(drm_syncobj_add_point);
> > >> +
> > >>   /**
> > >>    * drm_syncobj_replace_fence - replace fence in a sync object.
> > >>    * @syncobj: Sync object to replace fence in diff --git
> > >> a/include/drm/drm_syncobj.h b/include/drm/drm_syncobj.h index
> > >> 7c6ed845c70d..8acb4ae4f311 100644
> > >> --- a/include/drm/drm_syncobj.h
> > >> +++ b/include/drm/drm_syncobj.h
> > >> @@ -27,6 +27,7 @@
> > >>   #define __DRM_SYNCOBJ_H__
> > >>
> > >>   #include "linux/dma-fence.h"
> > >> +#include "linux/dma-fence-chain.h"
> > >>
> > >>   /**
> > >>    * struct drm_syncobj - sync object.
> > >> @@ -110,6 +111,10 @@ drm_syncobj_fence_get(struct drm_syncobj
> > >> *syncobj)
> > >>
> > >>   struct drm_syncobj *drm_syncobj_find(struct drm_file *file_private,
> > >>                                   u32 handle);
> > >> +void drm_syncobj_add_point(struct drm_syncobj *syncobj,
> > >> +                       struct dma_fence_chain *chain,
> > >> +                       struct dma_fence *fence,
> > >> +                       uint64_t point);
> > >>   void drm_syncobj_replace_fence(struct drm_syncobj *syncobj,
> > >>                             struct dma_fence *fence);
> > >>   int drm_syncobj_find_fence(struct drm_file *file_private,
> > >> --
> > >> 2.17.1
> > >>
> > >> _______________________________________________
> > >> Intel-gfx mailing list
> > >> Intel-gfx@lists.freedesktop.org
> > >> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
> >
> 
> 
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> +41 (0) 79 365 57 48 - http://blog.ffwll.ch
Christian König Dec. 12, 2018, noon UTC | #5
> Key point is that our Vulcan guys came back and said that this
> wouldn't be sufficient, but I honestly don't fully understand why.
> Hm, sounds like we really need those testscases (vk cts on top of mesa, igt)
> so we can talk about the exact corner cases we care about and why.
Yes, that's why I made it mandatory that David provides an igt test case 
along the ones in libdrm.

> I guess one thing that might happen is that userspace leaves out a number
> and never sets that fence, relying on the >= semantics of the monitored
> fence to unblock that thread. E.g. when skipping a frame in one of the
> auxiliary workloads. For that case we'd need to make sure we don't just wait
> for the given fence to materialize, but also any fences later in the timeline.
Correct and that's also how we have implemented it.

> But we can't decide that without understanding the actual use-case that
> needs to be supported at the other end of the stack, and how all the bits in
> between should look like.
>
> I guess we're back to "uapi design without userspace doesn't make sense" ...
Yeah, well chicken and egg problem. Amdvlk probably won't make the code 
to support this public until the kernel has accepted it and the kernel 
doesn't accept it until the amdvlk patches are public.

David can you take care of this and release the userspace patches as well?

Additional to that except for a bit polishing the UAPI stayed the same 
from the very beginning while being reviewed multiple times now. So that 
seems to be rather sane.

> That seems against the spirit of vulkan, which is very much about "you get all
> the pieces". It also might dig us a hole in the future, if we ever get around to
> moving towards a WDDM2 style memory management model. For future
> proving I think it would make sense if we implement the minimal uapi we
> need for vk timelines, not the strictest guarantees we can get away with
> (without performance impact) with current drivers.
Well I'm repeating myself, but while this seems to be a good idea for an 
userspace API it is not necessary for a kernel API.

In other words userspace can do all the mess it wants in as long as it 
stays inside the same process, but when it starts to mess with 
inter-process communication (e.g. X or Wayland) the stuff should be 
water prove and not allow for mess to leak between processes.

And what we can always do is to make the restriction more lose, but 
tightening it when userspace already depends on a behavior is not 
possible any more.

Regards,
Christian.

Am 12.12.18 um 12:39 schrieb Zhou, David(ChunMing):
> + Daniel Rakos and Jason Ekstrand.
>
>   Below is the background, which is from Daniel R should  be able to explain that's why:
> " ISVs, especially those coming from D3D12, are unsatisfied with the behavior of the Vulkan semaphores as they are unhappy with the fact that for every single dependency they need to use separate semaphores due to their binary nature.
> Compared to that a synchronization primitive like D3D12 monitored fences enable one of those to be used to track a sequence of operations by simply associating timeline values to the completion of individual operations. This allows them to track the lifetime and usage of resources and the ordered completion of sequences.
> Besides that, they also want to use a single synchronization primitive to be able to handle GPU-to-GPU and GPU-to-CPU dependencies, compared to using semaphores for the former and fences for the latter.
> In addition, compared to legacy semaphores, timeline semaphores are proposed to support wait-before-signal, i.e. allow enqueueing a semaphore wait operation with a wait value that is larger than any of the already enqueued signal values. This seems to be a hard requirement for ISVs. Without UMD-side queue batching, and even UMD-side queue batching doesn’t help the situation when such a semaphore is externally shared with another API. Thus in order to properly support wait-before-signal the KMD implementation has to also be able to support such dependencies.
> "
>
> Btw, we already add test case to igt, and tested by many existing test, like libdrm unit test, igt related test, vulkan cts, and steam games.
>
> -David
>> -----Original Message-----
>> From: Daniel Vetter <daniel@ffwll.ch>
>> Sent: Wednesday, December 12, 2018 7:15 PM
>> To: Koenig, Christian <Christian.Koenig@amd.com>
>> Cc: Zhou, David(ChunMing) <David1.Zhou@amd.com>; dri-devel <dri-
>> devel@lists.freedesktop.org>; amd-gfx list <amd-gfx@lists.freedesktop.org>;
>> intel-gfx <intel-gfx@lists.freedesktop.org>; Christian König
>> <ckoenig.leichtzumerken@gmail.com>
>> Subject: Re: [Intel-gfx] [PATCH 03/10] drm/syncobj: add new
>> drm_syncobj_add_point interface v2
>>
>> On Wed, Dec 12, 2018 at 12:08 PM Koenig, Christian
>> <Christian.Koenig@amd.com> wrote:
>>> Am 12.12.18 um 11:49 schrieb Daniel Vetter:
>>>> On Fri, Dec 07, 2018 at 11:54:15PM +0800, Chunming Zhou wrote:
>>>>> From: Christian König <ckoenig.leichtzumerken@gmail.com>
>>>>>
>>>>> Use the dma_fence_chain object to create a timeline of fence
>>>>> objects instead of just replacing the existing fence.
>>>>>
>>>>> v2: rebase and cleanup
>>>>>
>>>>> Signed-off-by: Christian König <christian.koenig@amd.com>
>>>> Somewhat jumping back into this. Not sure we discussed this already
>>>> or not. I'm a bit unclear on why we have to chain the fences in the
>> timeline:
>>>> - The timeline stuff is modelled after the WDDM2 monitored fences.
>> Which
>>>>     really are just u64 counters in memory somewhere (I think could be
>>>>     system ram or vram). Because WDDM2 has the memory management
>> entirely
>>>>     separated from rendering synchronization it totally allows userspace to
>>>>     create loops and deadlocks and everything else nasty using this - the
>>>>     memory manager won't deadlock because these monitored fences
>> never leak
>>>>     into the buffer manager. And if CS deadlock, gpu reset takes care of the
>>>>     mess.
>>>>
>>>> - This has a few consequences, as in they seem to indeed work like a
>>>>     memory location: Userspace incrementing out-of-order (because they
>> run
>>>>     batches updating the same fence on different engines) is totally fine,
>>>>     as is doing anything else "stupid".
>>>>
>>>> - Now on linux we can't allow anything, because we need to make sure
>> that
>>>>     deadlocks don't leak into the memory manager. But as long as we block
>>>>     until the underlying dma_fence has materialized, nothing userspace can
>>>>     do will lead to such a deadlock. Even if userspace ends up submitting
>>>>     jobs without enough built-in synchronization, leading to out-of-order
>>>>     signalling of fences on that "timeline". And I don't think that would
>>>>     pose a problem for us.
>>>>
>>>> Essentially I think we can look at timeline syncobj as a dma_fence
>>>> container indexed through an integer, and there's no need to enforce
>>>> that the timline works like a real dma_fence timeline, with all it's
>>>> guarantees. It's just a pile of (possibly, if userspace is stupid)
>>>> unrelated dma_fences. You could implement the entire thing in
>>>> userspace after all, except for the "we want to share these timeline
>>>> objects between processes" problem.
>>>>
>>>> tldr; I think we can drop the dma_fence_chain complexity completely.
>>>> Or at least I'm not really understanding why it's needed.
>>>>
>>>> Of course that means drivers cannot treat a drm_syncobj timeline as
>>>> a dma_fence timeline. But given the future fences stuff and all
>>>> that, that's already out of the window anyway.
>>>>
>>>> What am I missing?
>>> Good question, since that was exactly my initial idea as well.
>>>
>>> Key point is that our Vulcan guys came back and said that this
>>> wouldn't be sufficient, but I honestly don't fully understand why.
>> Hm, sounds like we really need those testscases (vk cts on top of mesa, igt)
>> so we can talk about the exact corner cases we care about and why.
>>
>> I guess one thing that might happen is that userspace leaves out a number
>> and never sets that fence, relying on the >= semantics of the monitored
>> fence to unblock that thread. E.g. when skipping a frame in one of the
>> auxiliary workloads. For that case we'd need to make sure we don't just wait
>> for the given fence to materialize, but also any fences later in the timeline.
>>
>> But we can't decide that without understanding the actual use-case that
>> needs to be supported at the other end of the stack, and how all the bits in
>> between should look like.
>>
>> I guess we're back to "uapi design without userspace doesn't make sense" ...
>>
>>> Anyway that's why David came up with using the fence array to wait for
>>> all previously added fences, which I then later on extended into this
>>> chain container.
>>>
>>> I have to admit that it is way more defensive implemented this way. E.g.
>>> there is much fewer things userspace can do wrong.
>>>
>>> The principal idea is that when they mess things up they are always
>>> going to wait more than necessary, but never less.
>> That seems against the spirit of vulkan, which is very much about "you get all
>> the pieces". It also might dig us a hole in the future, if we ever get around to
>> moving towards a WDDM2 style memory management model. For future
>> proving I think it would make sense if we implement the minimal uapi we
>> need for vk timelines, not the strictest guarantees we can get away with
>> (without performance impact) with current drivers.
>> -Daniel
>>
>>
>>> Christian.
>>>
>>>> -Daniel
>>>>
>>>>> ---
>>>>>    drivers/gpu/drm/drm_syncobj.c | 37
>> +++++++++++++++++++++++++++++++++++
>>>>>    include/drm/drm_syncobj.h     |  5 +++++
>>>>>    2 files changed, 42 insertions(+)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/drm_syncobj.c
>>>>> b/drivers/gpu/drm/drm_syncobj.c index e19525af0cce..51f798e2194f
>>>>> 100644
>>>>> --- a/drivers/gpu/drm/drm_syncobj.c
>>>>> +++ b/drivers/gpu/drm/drm_syncobj.c
>>>>> @@ -122,6 +122,43 @@ static void drm_syncobj_remove_wait(struct
>> drm_syncobj *syncobj,
>>>>>       spin_unlock(&syncobj->lock);
>>>>>    }
>>>>>
>>>>> +/**
>>>>> + * drm_syncobj_add_point - add new timeline point to the syncobj
>>>>> + * @syncobj: sync object to add timeline point do
>>>>> + * @chain: chain node to use to add the point
>>>>> + * @fence: fence to encapsulate in the chain node
>>>>> + * @point: sequence number to use for the point
>>>>> + *
>>>>> + * Add the chain node as new timeline point to the syncobj.
>>>>> + */
>>>>> +void drm_syncobj_add_point(struct drm_syncobj *syncobj,
>>>>> +                       struct dma_fence_chain *chain,
>>>>> +                       struct dma_fence *fence,
>>>>> +                       uint64_t point) {
>>>>> +    struct syncobj_wait_entry *cur, *tmp;
>>>>> +    struct dma_fence *prev;
>>>>> +
>>>>> +    dma_fence_get(fence);
>>>>> +
>>>>> +    spin_lock(&syncobj->lock);
>>>>> +
>>>>> +    prev = rcu_dereference_protected(syncobj->fence,
>>>>> +                                     lockdep_is_held(&syncobj->lock));
>>>>> +    dma_fence_chain_init(chain, prev, fence, point);
>>>>> +    rcu_assign_pointer(syncobj->fence, &chain->base);
>>>>> +
>>>>> +    list_for_each_entry_safe(cur, tmp, &syncobj->cb_list, node) {
>>>>> +            list_del_init(&cur->node);
>>>>> +            syncobj_wait_syncobj_func(syncobj, cur);
>>>>> +    }
>>>>> +    spin_unlock(&syncobj->lock);
>>>>> +
>>>>> +    /* Walk the chain once to trigger garbage collection */
>>>>> +    dma_fence_chain_for_each(prev, fence); }
>>>>> +EXPORT_SYMBOL(drm_syncobj_add_point);
>>>>> +
>>>>>    /**
>>>>>     * drm_syncobj_replace_fence - replace fence in a sync object.
>>>>>     * @syncobj: Sync object to replace fence in diff --git
>>>>> a/include/drm/drm_syncobj.h b/include/drm/drm_syncobj.h index
>>>>> 7c6ed845c70d..8acb4ae4f311 100644
>>>>> --- a/include/drm/drm_syncobj.h
>>>>> +++ b/include/drm/drm_syncobj.h
>>>>> @@ -27,6 +27,7 @@
>>>>>    #define __DRM_SYNCOBJ_H__
>>>>>
>>>>>    #include "linux/dma-fence.h"
>>>>> +#include "linux/dma-fence-chain.h"
>>>>>
>>>>>    /**
>>>>>     * struct drm_syncobj - sync object.
>>>>> @@ -110,6 +111,10 @@ drm_syncobj_fence_get(struct drm_syncobj
>>>>> *syncobj)
>>>>>
>>>>>    struct drm_syncobj *drm_syncobj_find(struct drm_file *file_private,
>>>>>                                    u32 handle);
>>>>> +void drm_syncobj_add_point(struct drm_syncobj *syncobj,
>>>>> +                       struct dma_fence_chain *chain,
>>>>> +                       struct dma_fence *fence,
>>>>> +                       uint64_t point);
>>>>>    void drm_syncobj_replace_fence(struct drm_syncobj *syncobj,
>>>>>                              struct dma_fence *fence);
>>>>>    int drm_syncobj_find_fence(struct drm_file *file_private,
>>>>> --
>>>>> 2.17.1
>>>>>
>>>>> _______________________________________________
>>>>> Intel-gfx mailing list
>>>>> Intel-gfx@lists.freedesktop.org
>>>>> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
>>
>> --
>> Daniel Vetter
>> Software Engineer, Intel Corporation
>> +41 (0) 79 365 57 48 - http://blog.ffwll.ch
Daniel Vetter Dec. 12, 2018, 12:20 p.m. UTC | #6
On Wed, Dec 12, 2018 at 1:00 PM Koenig, Christian
<Christian.Koenig@amd.com> wrote:
>
> > Key point is that our Vulcan guys came back and said that this
> > wouldn't be sufficient, but I honestly don't fully understand why.
> > Hm, sounds like we really need those testscases (vk cts on top of mesa, igt)
> > so we can talk about the exact corner cases we care about and why.
> Yes, that's why I made it mandatory that David provides an igt test case
> along the ones in libdrm.
>
> > I guess one thing that might happen is that userspace leaves out a number
> > and never sets that fence, relying on the >= semantics of the monitored
> > fence to unblock that thread. E.g. when skipping a frame in one of the
> > auxiliary workloads. For that case we'd need to make sure we don't just wait
> > for the given fence to materialize, but also any fences later in the timeline.
> Correct and that's also how we have implemented it.
>
> > But we can't decide that without understanding the actual use-case that
> > needs to be supported at the other end of the stack, and how all the bits in
> > between should look like.
> >
> > I guess we're back to "uapi design without userspace doesn't make sense" ...
> Yeah, well chicken and egg problem. Amdvlk probably won't make the code
> to support this public until the kernel has accepted it and the kernel
> doesn't accept it until the amdvlk patches are public.

That's not how we do uapi development.

> David can you take care of this and release the userspace patches as well?

Please also drag the people typing the code on the mailing list, not
just the code. Code alone doesn't make for a useful discussion :-)

Also, someone needs to drag the radv/anv side into this, because I
don't also want to mix up the technical issues here with the entire
"is amdvlk good enough for uapi development" question ... (as is it
seems to fall short on technicalities of simply not doing development
openly, but I hope that's fixable).

> Additional to that except for a bit polishing the UAPI stayed the same
> from the very beginning while being reviewed multiple times now. So that
> seems to be rather sane.
>
> > That seems against the spirit of vulkan, which is very much about "you get all
> > the pieces". It also might dig us a hole in the future, if we ever get around to
> > moving towards a WDDM2 style memory management model. For future
> > proving I think it would make sense if we implement the minimal uapi we
> > need for vk timelines, not the strictest guarantees we can get away with
> > (without performance impact) with current drivers.
> Well I'm repeating myself, but while this seems to be a good idea for an
> userspace API it is not necessary for a kernel API.
>
> In other words userspace can do all the mess it wants in as long as it
> stays inside the same process, but when it starts to mess with
> inter-process communication (e.g. X or Wayland) the stuff should be
> water prove and not allow for mess to leak between processes.
>
> And what we can always do is to make the restriction more lose, but
> tightening it when userspace already depends on a behavior is not
> possible any more.

The point of vk timelines seems to be that userspace wants the mess,
even across processes. E.g. for VR compositors, which run in some
other address space. If you don't want to leak the mess, don't use vk
timelines, use normal fences. Which is what all the X/wayland
protocols seem to be doing. So the mess is strictly opt-in for
userspace, but seems entirely desired.
-Daniel

>
> Regards,
> Christian.
>
> Am 12.12.18 um 12:39 schrieb Zhou, David(ChunMing):
> > + Daniel Rakos and Jason Ekstrand.
> >
> >   Below is the background, which is from Daniel R should  be able to explain that's why:
> > " ISVs, especially those coming from D3D12, are unsatisfied with the behavior of the Vulkan semaphores as they are unhappy with the fact that for every single dependency they need to use separate semaphores due to their binary nature.
> > Compared to that a synchronization primitive like D3D12 monitored fences enable one of those to be used to track a sequence of operations by simply associating timeline values to the completion of individual operations. This allows them to track the lifetime and usage of resources and the ordered completion of sequences.
> > Besides that, they also want to use a single synchronization primitive to be able to handle GPU-to-GPU and GPU-to-CPU dependencies, compared to using semaphores for the former and fences for the latter.
> > In addition, compared to legacy semaphores, timeline semaphores are proposed to support wait-before-signal, i.e. allow enqueueing a semaphore wait operation with a wait value that is larger than any of the already enqueued signal values. This seems to be a hard requirement for ISVs. Without UMD-side queue batching, and even UMD-side queue batching doesn’t help the situation when such a semaphore is externally shared with another API. Thus in order to properly support wait-before-signal the KMD implementation has to also be able to support such dependencies.
> > "
> >
> > Btw, we already add test case to igt, and tested by many existing test, like libdrm unit test, igt related test, vulkan cts, and steam games.
> >
> > -David
> >> -----Original Message-----
> >> From: Daniel Vetter <daniel@ffwll.ch>
> >> Sent: Wednesday, December 12, 2018 7:15 PM
> >> To: Koenig, Christian <Christian.Koenig@amd.com>
> >> Cc: Zhou, David(ChunMing) <David1.Zhou@amd.com>; dri-devel <dri-
> >> devel@lists.freedesktop.org>; amd-gfx list <amd-gfx@lists.freedesktop.org>;
> >> intel-gfx <intel-gfx@lists.freedesktop.org>; Christian König
> >> <ckoenig.leichtzumerken@gmail.com>
> >> Subject: Re: [Intel-gfx] [PATCH 03/10] drm/syncobj: add new
> >> drm_syncobj_add_point interface v2
> >>
> >> On Wed, Dec 12, 2018 at 12:08 PM Koenig, Christian
> >> <Christian.Koenig@amd.com> wrote:
> >>> Am 12.12.18 um 11:49 schrieb Daniel Vetter:
> >>>> On Fri, Dec 07, 2018 at 11:54:15PM +0800, Chunming Zhou wrote:
> >>>>> From: Christian König <ckoenig.leichtzumerken@gmail.com>
> >>>>>
> >>>>> Use the dma_fence_chain object to create a timeline of fence
> >>>>> objects instead of just replacing the existing fence.
> >>>>>
> >>>>> v2: rebase and cleanup
> >>>>>
> >>>>> Signed-off-by: Christian König <christian.koenig@amd.com>
> >>>> Somewhat jumping back into this. Not sure we discussed this already
> >>>> or not. I'm a bit unclear on why we have to chain the fences in the
> >> timeline:
> >>>> - The timeline stuff is modelled after the WDDM2 monitored fences.
> >> Which
> >>>>     really are just u64 counters in memory somewhere (I think could be
> >>>>     system ram or vram). Because WDDM2 has the memory management
> >> entirely
> >>>>     separated from rendering synchronization it totally allows userspace to
> >>>>     create loops and deadlocks and everything else nasty using this - the
> >>>>     memory manager won't deadlock because these monitored fences
> >> never leak
> >>>>     into the buffer manager. And if CS deadlock, gpu reset takes care of the
> >>>>     mess.
> >>>>
> >>>> - This has a few consequences, as in they seem to indeed work like a
> >>>>     memory location: Userspace incrementing out-of-order (because they
> >> run
> >>>>     batches updating the same fence on different engines) is totally fine,
> >>>>     as is doing anything else "stupid".
> >>>>
> >>>> - Now on linux we can't allow anything, because we need to make sure
> >> that
> >>>>     deadlocks don't leak into the memory manager. But as long as we block
> >>>>     until the underlying dma_fence has materialized, nothing userspace can
> >>>>     do will lead to such a deadlock. Even if userspace ends up submitting
> >>>>     jobs without enough built-in synchronization, leading to out-of-order
> >>>>     signalling of fences on that "timeline". And I don't think that would
> >>>>     pose a problem for us.
> >>>>
> >>>> Essentially I think we can look at timeline syncobj as a dma_fence
> >>>> container indexed through an integer, and there's no need to enforce
> >>>> that the timline works like a real dma_fence timeline, with all it's
> >>>> guarantees. It's just a pile of (possibly, if userspace is stupid)
> >>>> unrelated dma_fences. You could implement the entire thing in
> >>>> userspace after all, except for the "we want to share these timeline
> >>>> objects between processes" problem.
> >>>>
> >>>> tldr; I think we can drop the dma_fence_chain complexity completely.
> >>>> Or at least I'm not really understanding why it's needed.
> >>>>
> >>>> Of course that means drivers cannot treat a drm_syncobj timeline as
> >>>> a dma_fence timeline. But given the future fences stuff and all
> >>>> that, that's already out of the window anyway.
> >>>>
> >>>> What am I missing?
> >>> Good question, since that was exactly my initial idea as well.
> >>>
> >>> Key point is that our Vulcan guys came back and said that this
> >>> wouldn't be sufficient, but I honestly don't fully understand why.
> >> Hm, sounds like we really need those testscases (vk cts on top of mesa, igt)
> >> so we can talk about the exact corner cases we care about and why.
> >>
> >> I guess one thing that might happen is that userspace leaves out a number
> >> and never sets that fence, relying on the >= semantics of the monitored
> >> fence to unblock that thread. E.g. when skipping a frame in one of the
> >> auxiliary workloads. For that case we'd need to make sure we don't just wait
> >> for the given fence to materialize, but also any fences later in the timeline.
> >>
> >> But we can't decide that without understanding the actual use-case that
> >> needs to be supported at the other end of the stack, and how all the bits in
> >> between should look like.
> >>
> >> I guess we're back to "uapi design without userspace doesn't make sense" ...
> >>
> >>> Anyway that's why David came up with using the fence array to wait for
> >>> all previously added fences, which I then later on extended into this
> >>> chain container.
> >>>
> >>> I have to admit that it is way more defensive implemented this way. E.g.
> >>> there is much fewer things userspace can do wrong.
> >>>
> >>> The principal idea is that when they mess things up they are always
> >>> going to wait more than necessary, but never less.
> >> That seems against the spirit of vulkan, which is very much about "you get all
> >> the pieces". It also might dig us a hole in the future, if we ever get around to
> >> moving towards a WDDM2 style memory management model. For future
> >> proving I think it would make sense if we implement the minimal uapi we
> >> need for vk timelines, not the strictest guarantees we can get away with
> >> (without performance impact) with current drivers.
> >> -Daniel
> >>
> >>
> >>> Christian.
> >>>
> >>>> -Daniel
> >>>>
> >>>>> ---
> >>>>>    drivers/gpu/drm/drm_syncobj.c | 37
> >> +++++++++++++++++++++++++++++++++++
> >>>>>    include/drm/drm_syncobj.h     |  5 +++++
> >>>>>    2 files changed, 42 insertions(+)
> >>>>>
> >>>>> diff --git a/drivers/gpu/drm/drm_syncobj.c
> >>>>> b/drivers/gpu/drm/drm_syncobj.c index e19525af0cce..51f798e2194f
> >>>>> 100644
> >>>>> --- a/drivers/gpu/drm/drm_syncobj.c
> >>>>> +++ b/drivers/gpu/drm/drm_syncobj.c
> >>>>> @@ -122,6 +122,43 @@ static void drm_syncobj_remove_wait(struct
> >> drm_syncobj *syncobj,
> >>>>>       spin_unlock(&syncobj->lock);
> >>>>>    }
> >>>>>
> >>>>> +/**
> >>>>> + * drm_syncobj_add_point - add new timeline point to the syncobj
> >>>>> + * @syncobj: sync object to add timeline point do
> >>>>> + * @chain: chain node to use to add the point
> >>>>> + * @fence: fence to encapsulate in the chain node
> >>>>> + * @point: sequence number to use for the point
> >>>>> + *
> >>>>> + * Add the chain node as new timeline point to the syncobj.
> >>>>> + */
> >>>>> +void drm_syncobj_add_point(struct drm_syncobj *syncobj,
> >>>>> +                       struct dma_fence_chain *chain,
> >>>>> +                       struct dma_fence *fence,
> >>>>> +                       uint64_t point) {
> >>>>> +    struct syncobj_wait_entry *cur, *tmp;
> >>>>> +    struct dma_fence *prev;
> >>>>> +
> >>>>> +    dma_fence_get(fence);
> >>>>> +
> >>>>> +    spin_lock(&syncobj->lock);
> >>>>> +
> >>>>> +    prev = rcu_dereference_protected(syncobj->fence,
> >>>>> +                                     lockdep_is_held(&syncobj->lock));
> >>>>> +    dma_fence_chain_init(chain, prev, fence, point);
> >>>>> +    rcu_assign_pointer(syncobj->fence, &chain->base);
> >>>>> +
> >>>>> +    list_for_each_entry_safe(cur, tmp, &syncobj->cb_list, node) {
> >>>>> +            list_del_init(&cur->node);
> >>>>> +            syncobj_wait_syncobj_func(syncobj, cur);
> >>>>> +    }
> >>>>> +    spin_unlock(&syncobj->lock);
> >>>>> +
> >>>>> +    /* Walk the chain once to trigger garbage collection */
> >>>>> +    dma_fence_chain_for_each(prev, fence); }
> >>>>> +EXPORT_SYMBOL(drm_syncobj_add_point);
> >>>>> +
> >>>>>    /**
> >>>>>     * drm_syncobj_replace_fence - replace fence in a sync object.
> >>>>>     * @syncobj: Sync object to replace fence in diff --git
> >>>>> a/include/drm/drm_syncobj.h b/include/drm/drm_syncobj.h index
> >>>>> 7c6ed845c70d..8acb4ae4f311 100644
> >>>>> --- a/include/drm/drm_syncobj.h
> >>>>> +++ b/include/drm/drm_syncobj.h
> >>>>> @@ -27,6 +27,7 @@
> >>>>>    #define __DRM_SYNCOBJ_H__
> >>>>>
> >>>>>    #include "linux/dma-fence.h"
> >>>>> +#include "linux/dma-fence-chain.h"
> >>>>>
> >>>>>    /**
> >>>>>     * struct drm_syncobj - sync object.
> >>>>> @@ -110,6 +111,10 @@ drm_syncobj_fence_get(struct drm_syncobj
> >>>>> *syncobj)
> >>>>>
> >>>>>    struct drm_syncobj *drm_syncobj_find(struct drm_file *file_private,
> >>>>>                                    u32 handle);
> >>>>> +void drm_syncobj_add_point(struct drm_syncobj *syncobj,
> >>>>> +                       struct dma_fence_chain *chain,
> >>>>> +                       struct dma_fence *fence,
> >>>>> +                       uint64_t point);
> >>>>>    void drm_syncobj_replace_fence(struct drm_syncobj *syncobj,
> >>>>>                              struct dma_fence *fence);
> >>>>>    int drm_syncobj_find_fence(struct drm_file *file_private,
> >>>>> --
> >>>>> 2.17.1
> >>>>>
> >>>>> _______________________________________________
> >>>>> Intel-gfx mailing list
> >>>>> Intel-gfx@lists.freedesktop.org
> >>>>> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
> >>
> >> --
> >> Daniel Vetter
> >> Software Engineer, Intel Corporation
> >> +41 (0) 79 365 57 48 - http://blog.ffwll.ch
>
Daniel Vetter Dec. 12, 2018, 12:24 p.m. UTC | #7
On Wed, Dec 12, 2018 at 12:40 PM Zhou, David(ChunMing)
<David1.Zhou@amd.com> wrote:
>
> + Daniel Rakos and Jason Ekstrand.
>
>  Below is the background, which is from Daniel R should  be able to explain that's why:
> " ISVs, especially those coming from D3D12, are unsatisfied with the behavior of the Vulkan semaphores as they are unhappy with the fact that for every single dependency they need to use separate semaphores due to their binary nature.
> Compared to that a synchronization primitive like D3D12 monitored fences enable one of those to be used to track a sequence of operations by simply associating timeline values to the completion of individual operations. This allows them to track the lifetime and usage of resources and the ordered completion of sequences.
> Besides that, they also want to use a single synchronization primitive to be able to handle GPU-to-GPU and GPU-to-CPU dependencies, compared to using semaphores for the former and fences for the latter.
> In addition, compared to legacy semaphores, timeline semaphores are proposed to support wait-before-signal, i.e. allow enqueueing a semaphore wait operation with a wait value that is larger than any of the already enqueued signal values. This seems to be a hard requirement for ISVs. Without UMD-side queue batching, and even UMD-side queue batching doesn’t help the situation when such a semaphore is externally shared with another API. Thus in order to properly support wait-before-signal the KMD implementation has to also be able to support such dependencies.
> "

I was tangetially involved in that wg too, I understand the overall
use-case of vk timelines. I don't understand the exact corner case
here, because I wasn't deeply involved in the details.
-Daniel

> Btw, we already add test case to igt, and tested by many existing test, like libdrm unit test, igt related test, vulkan cts, and steam games.
>
> -David
> > -----Original Message-----
> > From: Daniel Vetter <daniel@ffwll.ch>
> > Sent: Wednesday, December 12, 2018 7:15 PM
> > To: Koenig, Christian <Christian.Koenig@amd.com>
> > Cc: Zhou, David(ChunMing) <David1.Zhou@amd.com>; dri-devel <dri-
> > devel@lists.freedesktop.org>; amd-gfx list <amd-gfx@lists.freedesktop.org>;
> > intel-gfx <intel-gfx@lists.freedesktop.org>; Christian König
> > <ckoenig.leichtzumerken@gmail.com>
> > Subject: Re: [Intel-gfx] [PATCH 03/10] drm/syncobj: add new
> > drm_syncobj_add_point interface v2
> >
> > On Wed, Dec 12, 2018 at 12:08 PM Koenig, Christian
> > <Christian.Koenig@amd.com> wrote:
> > >
> > > Am 12.12.18 um 11:49 schrieb Daniel Vetter:
> > > > On Fri, Dec 07, 2018 at 11:54:15PM +0800, Chunming Zhou wrote:
> > > >> From: Christian König <ckoenig.leichtzumerken@gmail.com>
> > > >>
> > > >> Use the dma_fence_chain object to create a timeline of fence
> > > >> objects instead of just replacing the existing fence.
> > > >>
> > > >> v2: rebase and cleanup
> > > >>
> > > >> Signed-off-by: Christian König <christian.koenig@amd.com>
> > > > Somewhat jumping back into this. Not sure we discussed this already
> > > > or not. I'm a bit unclear on why we have to chain the fences in the
> > timeline:
> > > >
> > > > - The timeline stuff is modelled after the WDDM2 monitored fences.
> > Which
> > > >    really are just u64 counters in memory somewhere (I think could be
> > > >    system ram or vram). Because WDDM2 has the memory management
> > entirely
> > > >    separated from rendering synchronization it totally allows userspace to
> > > >    create loops and deadlocks and everything else nasty using this - the
> > > >    memory manager won't deadlock because these monitored fences
> > never leak
> > > >    into the buffer manager. And if CS deadlock, gpu reset takes care of the
> > > >    mess.
> > > >
> > > > - This has a few consequences, as in they seem to indeed work like a
> > > >    memory location: Userspace incrementing out-of-order (because they
> > run
> > > >    batches updating the same fence on different engines) is totally fine,
> > > >    as is doing anything else "stupid".
> > > >
> > > > - Now on linux we can't allow anything, because we need to make sure
> > that
> > > >    deadlocks don't leak into the memory manager. But as long as we block
> > > >    until the underlying dma_fence has materialized, nothing userspace can
> > > >    do will lead to such a deadlock. Even if userspace ends up submitting
> > > >    jobs without enough built-in synchronization, leading to out-of-order
> > > >    signalling of fences on that "timeline". And I don't think that would
> > > >    pose a problem for us.
> > > >
> > > > Essentially I think we can look at timeline syncobj as a dma_fence
> > > > container indexed through an integer, and there's no need to enforce
> > > > that the timline works like a real dma_fence timeline, with all it's
> > > > guarantees. It's just a pile of (possibly, if userspace is stupid)
> > > > unrelated dma_fences. You could implement the entire thing in
> > > > userspace after all, except for the "we want to share these timeline
> > > > objects between processes" problem.
> > > >
> > > > tldr; I think we can drop the dma_fence_chain complexity completely.
> > > > Or at least I'm not really understanding why it's needed.
> > > >
> > > > Of course that means drivers cannot treat a drm_syncobj timeline as
> > > > a dma_fence timeline. But given the future fences stuff and all
> > > > that, that's already out of the window anyway.
> > > >
> > > > What am I missing?
> > >
> > > Good question, since that was exactly my initial idea as well.
> > >
> > > Key point is that our Vulcan guys came back and said that this
> > > wouldn't be sufficient, but I honestly don't fully understand why.
> >
> > Hm, sounds like we really need those testscases (vk cts on top of mesa, igt)
> > so we can talk about the exact corner cases we care about and why.
> >
> > I guess one thing that might happen is that userspace leaves out a number
> > and never sets that fence, relying on the >= semantics of the monitored
> > fence to unblock that thread. E.g. when skipping a frame in one of the
> > auxiliary workloads. For that case we'd need to make sure we don't just wait
> > for the given fence to materialize, but also any fences later in the timeline.
> >
> > But we can't decide that without understanding the actual use-case that
> > needs to be supported at the other end of the stack, and how all the bits in
> > between should look like.
> >
> > I guess we're back to "uapi design without userspace doesn't make sense" ...
> >
> > > Anyway that's why David came up with using the fence array to wait for
> > > all previously added fences, which I then later on extended into this
> > > chain container.
> > >
> > > I have to admit that it is way more defensive implemented this way. E.g.
> > > there is much fewer things userspace can do wrong.
> > >
> > > The principal idea is that when they mess things up they are always
> > > going to wait more than necessary, but never less.
> >
> > That seems against the spirit of vulkan, which is very much about "you get all
> > the pieces". It also might dig us a hole in the future, if we ever get around to
> > moving towards a WDDM2 style memory management model. For future
> > proving I think it would make sense if we implement the minimal uapi we
> > need for vk timelines, not the strictest guarantees we can get away with
> > (without performance impact) with current drivers.
> > -Daniel
> >
> >
> > > Christian.
> > >
> > > > -Daniel
> > > >
> > > >> ---
> > > >>   drivers/gpu/drm/drm_syncobj.c | 37
> > +++++++++++++++++++++++++++++++++++
> > > >>   include/drm/drm_syncobj.h     |  5 +++++
> > > >>   2 files changed, 42 insertions(+)
> > > >>
> > > >> diff --git a/drivers/gpu/drm/drm_syncobj.c
> > > >> b/drivers/gpu/drm/drm_syncobj.c index e19525af0cce..51f798e2194f
> > > >> 100644
> > > >> --- a/drivers/gpu/drm/drm_syncobj.c
> > > >> +++ b/drivers/gpu/drm/drm_syncobj.c
> > > >> @@ -122,6 +122,43 @@ static void drm_syncobj_remove_wait(struct
> > drm_syncobj *syncobj,
> > > >>      spin_unlock(&syncobj->lock);
> > > >>   }
> > > >>
> > > >> +/**
> > > >> + * drm_syncobj_add_point - add new timeline point to the syncobj
> > > >> + * @syncobj: sync object to add timeline point do
> > > >> + * @chain: chain node to use to add the point
> > > >> + * @fence: fence to encapsulate in the chain node
> > > >> + * @point: sequence number to use for the point
> > > >> + *
> > > >> + * Add the chain node as new timeline point to the syncobj.
> > > >> + */
> > > >> +void drm_syncobj_add_point(struct drm_syncobj *syncobj,
> > > >> +                       struct dma_fence_chain *chain,
> > > >> +                       struct dma_fence *fence,
> > > >> +                       uint64_t point) {
> > > >> +    struct syncobj_wait_entry *cur, *tmp;
> > > >> +    struct dma_fence *prev;
> > > >> +
> > > >> +    dma_fence_get(fence);
> > > >> +
> > > >> +    spin_lock(&syncobj->lock);
> > > >> +
> > > >> +    prev = rcu_dereference_protected(syncobj->fence,
> > > >> +                                     lockdep_is_held(&syncobj->lock));
> > > >> +    dma_fence_chain_init(chain, prev, fence, point);
> > > >> +    rcu_assign_pointer(syncobj->fence, &chain->base);
> > > >> +
> > > >> +    list_for_each_entry_safe(cur, tmp, &syncobj->cb_list, node) {
> > > >> +            list_del_init(&cur->node);
> > > >> +            syncobj_wait_syncobj_func(syncobj, cur);
> > > >> +    }
> > > >> +    spin_unlock(&syncobj->lock);
> > > >> +
> > > >> +    /* Walk the chain once to trigger garbage collection */
> > > >> +    dma_fence_chain_for_each(prev, fence); }
> > > >> +EXPORT_SYMBOL(drm_syncobj_add_point);
> > > >> +
> > > >>   /**
> > > >>    * drm_syncobj_replace_fence - replace fence in a sync object.
> > > >>    * @syncobj: Sync object to replace fence in diff --git
> > > >> a/include/drm/drm_syncobj.h b/include/drm/drm_syncobj.h index
> > > >> 7c6ed845c70d..8acb4ae4f311 100644
> > > >> --- a/include/drm/drm_syncobj.h
> > > >> +++ b/include/drm/drm_syncobj.h
> > > >> @@ -27,6 +27,7 @@
> > > >>   #define __DRM_SYNCOBJ_H__
> > > >>
> > > >>   #include "linux/dma-fence.h"
> > > >> +#include "linux/dma-fence-chain.h"
> > > >>
> > > >>   /**
> > > >>    * struct drm_syncobj - sync object.
> > > >> @@ -110,6 +111,10 @@ drm_syncobj_fence_get(struct drm_syncobj
> > > >> *syncobj)
> > > >>
> > > >>   struct drm_syncobj *drm_syncobj_find(struct drm_file *file_private,
> > > >>                                   u32 handle);
> > > >> +void drm_syncobj_add_point(struct drm_syncobj *syncobj,
> > > >> +                       struct dma_fence_chain *chain,
> > > >> +                       struct dma_fence *fence,
> > > >> +                       uint64_t point);
> > > >>   void drm_syncobj_replace_fence(struct drm_syncobj *syncobj,
> > > >>                             struct dma_fence *fence);
> > > >>   int drm_syncobj_find_fence(struct drm_file *file_private,
> > > >> --
> > > >> 2.17.1
> > > >>
> > > >> _______________________________________________
> > > >> Intel-gfx mailing list
> > > >> Intel-gfx@lists.freedesktop.org
> > > >> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
> > >
> >
> >
> > --
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > +41 (0) 79 365 57 48 - http://blog.ffwll.ch
Chunming Zhou Dec. 12, 2018, 1:06 p.m. UTC | #8
在 2018/12/12 20:24, Daniel Vetter 写道:
> On Wed, Dec 12, 2018 at 12:40 PM Zhou, David(ChunMing)
> <David1.Zhou@amd.com> wrote:
>> + Daniel Rakos and Jason Ekstrand.
>>
>>   Below is the background, which is from Daniel R should  be able to explain that's why:
>> " ISVs, especially those coming from D3D12, are unsatisfied with the behavior of the Vulkan semaphores as they are unhappy with the fact that for every single dependency they need to use separate semaphores due to their binary nature.
>> Compared to that a synchronization primitive like D3D12 monitored fences enable one of those to be used to track a sequence of operations by simply associating timeline values to the completion of individual operations. This allows them to track the lifetime and usage of resources and the ordered completion of sequences.
>> Besides that, they also want to use a single synchronization primitive to be able to handle GPU-to-GPU and GPU-to-CPU dependencies, compared to using semaphores for the former and fences for the latter.
>> In addition, compared to legacy semaphores, timeline semaphores are proposed to support wait-before-signal, i.e. allow enqueueing a semaphore wait operation with a wait value that is larger than any of the already enqueued signal values. This seems to be a hard requirement for ISVs. Without UMD-side queue batching, and even UMD-side queue batching doesn’t help the situation when such a semaphore is externally shared with another API. Thus in order to properly support wait-before-signal the KMD implementation has to also be able to support such dependencies.
>> "
> I was tangetially involved in that wg too, I understand the overall
> use-case of vk timelines. I don't understand the exact corner case
> here, because I wasn't deeply involved in the details.


all details are here: 
https://gitlab.khronos.org/vulkan/vulkan/merge_requests/2696

-David

> -Daniel
>
>> Btw, we already add test case to igt, and tested by many existing test, like libdrm unit test, igt related test, vulkan cts, and steam games.
>>
>> -David
>>> -----Original Message-----
>>> From: Daniel Vetter <daniel@ffwll.ch>
>>> Sent: Wednesday, December 12, 2018 7:15 PM
>>> To: Koenig, Christian <Christian.Koenig@amd.com>
>>> Cc: Zhou, David(ChunMing) <David1.Zhou@amd.com>; dri-devel <dri-
>>> devel@lists.freedesktop.org>; amd-gfx list <amd-gfx@lists.freedesktop.org>;
>>> intel-gfx <intel-gfx@lists.freedesktop.org>; Christian König
>>> <ckoenig.leichtzumerken@gmail.com>
>>> Subject: Re: [Intel-gfx] [PATCH 03/10] drm/syncobj: add new
>>> drm_syncobj_add_point interface v2
>>>
>>> On Wed, Dec 12, 2018 at 12:08 PM Koenig, Christian
>>> <Christian.Koenig@amd.com> wrote:
>>>> Am 12.12.18 um 11:49 schrieb Daniel Vetter:
>>>>> On Fri, Dec 07, 2018 at 11:54:15PM +0800, Chunming Zhou wrote:
>>>>>> From: Christian König <ckoenig.leichtzumerken@gmail.com>
>>>>>>
>>>>>> Use the dma_fence_chain object to create a timeline of fence
>>>>>> objects instead of just replacing the existing fence.
>>>>>>
>>>>>> v2: rebase and cleanup
>>>>>>
>>>>>> Signed-off-by: Christian König <christian.koenig@amd.com>
>>>>> Somewhat jumping back into this. Not sure we discussed this already
>>>>> or not. I'm a bit unclear on why we have to chain the fences in the
>>> timeline:
>>>>> - The timeline stuff is modelled after the WDDM2 monitored fences.
>>> Which
>>>>>     really are just u64 counters in memory somewhere (I think could be
>>>>>     system ram or vram). Because WDDM2 has the memory management
>>> entirely
>>>>>     separated from rendering synchronization it totally allows userspace to
>>>>>     create loops and deadlocks and everything else nasty using this - the
>>>>>     memory manager won't deadlock because these monitored fences
>>> never leak
>>>>>     into the buffer manager. And if CS deadlock, gpu reset takes care of the
>>>>>     mess.
>>>>>
>>>>> - This has a few consequences, as in they seem to indeed work like a
>>>>>     memory location: Userspace incrementing out-of-order (because they
>>> run
>>>>>     batches updating the same fence on different engines) is totally fine,
>>>>>     as is doing anything else "stupid".
>>>>>
>>>>> - Now on linux we can't allow anything, because we need to make sure
>>> that
>>>>>     deadlocks don't leak into the memory manager. But as long as we block
>>>>>     until the underlying dma_fence has materialized, nothing userspace can
>>>>>     do will lead to such a deadlock. Even if userspace ends up submitting
>>>>>     jobs without enough built-in synchronization, leading to out-of-order
>>>>>     signalling of fences on that "timeline". And I don't think that would
>>>>>     pose a problem for us.
>>>>>
>>>>> Essentially I think we can look at timeline syncobj as a dma_fence
>>>>> container indexed through an integer, and there's no need to enforce
>>>>> that the timline works like a real dma_fence timeline, with all it's
>>>>> guarantees. It's just a pile of (possibly, if userspace is stupid)
>>>>> unrelated dma_fences. You could implement the entire thing in
>>>>> userspace after all, except for the "we want to share these timeline
>>>>> objects between processes" problem.
>>>>>
>>>>> tldr; I think we can drop the dma_fence_chain complexity completely.
>>>>> Or at least I'm not really understanding why it's needed.
>>>>>
>>>>> Of course that means drivers cannot treat a drm_syncobj timeline as
>>>>> a dma_fence timeline. But given the future fences stuff and all
>>>>> that, that's already out of the window anyway.
>>>>>
>>>>> What am I missing?
>>>> Good question, since that was exactly my initial idea as well.
>>>>
>>>> Key point is that our Vulcan guys came back and said that this
>>>> wouldn't be sufficient, but I honestly don't fully understand why.
>>> Hm, sounds like we really need those testscases (vk cts on top of mesa, igt)
>>> so we can talk about the exact corner cases we care about and why.
>>>
>>> I guess one thing that might happen is that userspace leaves out a number
>>> and never sets that fence, relying on the >= semantics of the monitored
>>> fence to unblock that thread. E.g. when skipping a frame in one of the
>>> auxiliary workloads. For that case we'd need to make sure we don't just wait
>>> for the given fence to materialize, but also any fences later in the timeline.
>>>
>>> But we can't decide that without understanding the actual use-case that
>>> needs to be supported at the other end of the stack, and how all the bits in
>>> between should look like.
>>>
>>> I guess we're back to "uapi design without userspace doesn't make sense" ...
>>>
>>>> Anyway that's why David came up with using the fence array to wait for
>>>> all previously added fences, which I then later on extended into this
>>>> chain container.
>>>>
>>>> I have to admit that it is way more defensive implemented this way. E.g.
>>>> there is much fewer things userspace can do wrong.
>>>>
>>>> The principal idea is that when they mess things up they are always
>>>> going to wait more than necessary, but never less.
>>> That seems against the spirit of vulkan, which is very much about "you get all
>>> the pieces". It also might dig us a hole in the future, if we ever get around to
>>> moving towards a WDDM2 style memory management model. For future
>>> proving I think it would make sense if we implement the minimal uapi we
>>> need for vk timelines, not the strictest guarantees we can get away with
>>> (without performance impact) with current drivers.
>>> -Daniel
>>>
>>>
>>>> Christian.
>>>>
>>>>> -Daniel
>>>>>
>>>>>> ---
>>>>>>    drivers/gpu/drm/drm_syncobj.c | 37
>>> +++++++++++++++++++++++++++++++++++
>>>>>>    include/drm/drm_syncobj.h     |  5 +++++
>>>>>>    2 files changed, 42 insertions(+)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/drm_syncobj.c
>>>>>> b/drivers/gpu/drm/drm_syncobj.c index e19525af0cce..51f798e2194f
>>>>>> 100644
>>>>>> --- a/drivers/gpu/drm/drm_syncobj.c
>>>>>> +++ b/drivers/gpu/drm/drm_syncobj.c
>>>>>> @@ -122,6 +122,43 @@ static void drm_syncobj_remove_wait(struct
>>> drm_syncobj *syncobj,
>>>>>>       spin_unlock(&syncobj->lock);
>>>>>>    }
>>>>>>
>>>>>> +/**
>>>>>> + * drm_syncobj_add_point - add new timeline point to the syncobj
>>>>>> + * @syncobj: sync object to add timeline point do
>>>>>> + * @chain: chain node to use to add the point
>>>>>> + * @fence: fence to encapsulate in the chain node
>>>>>> + * @point: sequence number to use for the point
>>>>>> + *
>>>>>> + * Add the chain node as new timeline point to the syncobj.
>>>>>> + */
>>>>>> +void drm_syncobj_add_point(struct drm_syncobj *syncobj,
>>>>>> +                       struct dma_fence_chain *chain,
>>>>>> +                       struct dma_fence *fence,
>>>>>> +                       uint64_t point) {
>>>>>> +    struct syncobj_wait_entry *cur, *tmp;
>>>>>> +    struct dma_fence *prev;
>>>>>> +
>>>>>> +    dma_fence_get(fence);
>>>>>> +
>>>>>> +    spin_lock(&syncobj->lock);
>>>>>> +
>>>>>> +    prev = rcu_dereference_protected(syncobj->fence,
>>>>>> +                                     lockdep_is_held(&syncobj->lock));
>>>>>> +    dma_fence_chain_init(chain, prev, fence, point);
>>>>>> +    rcu_assign_pointer(syncobj->fence, &chain->base);
>>>>>> +
>>>>>> +    list_for_each_entry_safe(cur, tmp, &syncobj->cb_list, node) {
>>>>>> +            list_del_init(&cur->node);
>>>>>> +            syncobj_wait_syncobj_func(syncobj, cur);
>>>>>> +    }
>>>>>> +    spin_unlock(&syncobj->lock);
>>>>>> +
>>>>>> +    /* Walk the chain once to trigger garbage collection */
>>>>>> +    dma_fence_chain_for_each(prev, fence); }
>>>>>> +EXPORT_SYMBOL(drm_syncobj_add_point);
>>>>>> +
>>>>>>    /**
>>>>>>     * drm_syncobj_replace_fence - replace fence in a sync object.
>>>>>>     * @syncobj: Sync object to replace fence in diff --git
>>>>>> a/include/drm/drm_syncobj.h b/include/drm/drm_syncobj.h index
>>>>>> 7c6ed845c70d..8acb4ae4f311 100644
>>>>>> --- a/include/drm/drm_syncobj.h
>>>>>> +++ b/include/drm/drm_syncobj.h
>>>>>> @@ -27,6 +27,7 @@
>>>>>>    #define __DRM_SYNCOBJ_H__
>>>>>>
>>>>>>    #include "linux/dma-fence.h"
>>>>>> +#include "linux/dma-fence-chain.h"
>>>>>>
>>>>>>    /**
>>>>>>     * struct drm_syncobj - sync object.
>>>>>> @@ -110,6 +111,10 @@ drm_syncobj_fence_get(struct drm_syncobj
>>>>>> *syncobj)
>>>>>>
>>>>>>    struct drm_syncobj *drm_syncobj_find(struct drm_file *file_private,
>>>>>>                                    u32 handle);
>>>>>> +void drm_syncobj_add_point(struct drm_syncobj *syncobj,
>>>>>> +                       struct dma_fence_chain *chain,
>>>>>> +                       struct dma_fence *fence,
>>>>>> +                       uint64_t point);
>>>>>>    void drm_syncobj_replace_fence(struct drm_syncobj *syncobj,
>>>>>>                              struct dma_fence *fence);
>>>>>>    int drm_syncobj_find_fence(struct drm_file *file_private,
>>>>>> --
>>>>>> 2.17.1
>>>>>>
>>>>>> _______________________________________________
>>>>>> Intel-gfx mailing list
>>>>>> Intel-gfx@lists.freedesktop.org
>>>>>> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
>>>
>>> --
>>> Daniel Vetter
>>> Software Engineer, Intel Corporation
>>> +41 (0) 79 365 57 48 - http://blog.ffwll.ch
>
>
diff mbox series

Patch

diff --git a/drivers/gpu/drm/drm_syncobj.c b/drivers/gpu/drm/drm_syncobj.c
index e19525af0cce..51f798e2194f 100644
--- a/drivers/gpu/drm/drm_syncobj.c
+++ b/drivers/gpu/drm/drm_syncobj.c
@@ -122,6 +122,43 @@  static void drm_syncobj_remove_wait(struct drm_syncobj *syncobj,
 	spin_unlock(&syncobj->lock);
 }
 
+/**
+ * drm_syncobj_add_point - add new timeline point to the syncobj
+ * @syncobj: sync object to add timeline point do
+ * @chain: chain node to use to add the point
+ * @fence: fence to encapsulate in the chain node
+ * @point: sequence number to use for the point
+ *
+ * Add the chain node as new timeline point to the syncobj.
+ */
+void drm_syncobj_add_point(struct drm_syncobj *syncobj,
+			   struct dma_fence_chain *chain,
+			   struct dma_fence *fence,
+			   uint64_t point)
+{
+	struct syncobj_wait_entry *cur, *tmp;
+	struct dma_fence *prev;
+
+	dma_fence_get(fence);
+
+	spin_lock(&syncobj->lock);
+
+	prev = rcu_dereference_protected(syncobj->fence,
+					 lockdep_is_held(&syncobj->lock));
+	dma_fence_chain_init(chain, prev, fence, point);
+	rcu_assign_pointer(syncobj->fence, &chain->base);
+
+	list_for_each_entry_safe(cur, tmp, &syncobj->cb_list, node) {
+		list_del_init(&cur->node);
+		syncobj_wait_syncobj_func(syncobj, cur);
+	}
+	spin_unlock(&syncobj->lock);
+
+	/* Walk the chain once to trigger garbage collection */
+	dma_fence_chain_for_each(prev, fence);
+}
+EXPORT_SYMBOL(drm_syncobj_add_point);
+
 /**
  * drm_syncobj_replace_fence - replace fence in a sync object.
  * @syncobj: Sync object to replace fence in
diff --git a/include/drm/drm_syncobj.h b/include/drm/drm_syncobj.h
index 7c6ed845c70d..8acb4ae4f311 100644
--- a/include/drm/drm_syncobj.h
+++ b/include/drm/drm_syncobj.h
@@ -27,6 +27,7 @@ 
 #define __DRM_SYNCOBJ_H__
 
 #include "linux/dma-fence.h"
+#include "linux/dma-fence-chain.h"
 
 /**
  * struct drm_syncobj - sync object.
@@ -110,6 +111,10 @@  drm_syncobj_fence_get(struct drm_syncobj *syncobj)
 
 struct drm_syncobj *drm_syncobj_find(struct drm_file *file_private,
 				     u32 handle);
+void drm_syncobj_add_point(struct drm_syncobj *syncobj,
+			   struct dma_fence_chain *chain,
+			   struct dma_fence *fence,
+			   uint64_t point);
 void drm_syncobj_replace_fence(struct drm_syncobj *syncobj,
 			       struct dma_fence *fence);
 int drm_syncobj_find_fence(struct drm_file *file_private,