[2/2] dma-buf: cleanup shared fence removal

Message ID	20190627101813.61430-2-christian.koenig@amd.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <dri-devel-bounces@lists.freedesktop.org> From: " =?utf-8?q?Christian_K=C3=B6nig?= " <ckoenig.leichtzumerken@gmail.com> To: amd-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org Subject: [PATCH 2/2] dma-buf: cleanup shared fence removal Date: Thu, 27 Jun 2019 12:18:13 +0200 Message-Id: <20190627101813.61430-2-christian.koenig@amd.com> In-Reply-To: <20190627101813.61430-1-christian.koenig@amd.com> References: <20190627101813.61430-1-christian.koenig@amd.com> MIME-Version: 1.0 X-Mailman-Original-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:subject:date:message-id:in-reply-to:references:mime-version :content-transfer-encoding; bh=Vs23hW62H02ENF6HaLT9IScjXnSYrFfzNo7KRS70ddE=; b=h9MWaRVUdp1hisEFFnletblvF//R9qGO1Y0Q+4WxgUtqmjLWSiBemJymhu7FVh3sJi nRXQcHKIoZGIWCSmrPw7TwwRgKGI/uLlu6Qc5PL/3hdUB6qBbF0iHR2elBsqUrFr+6f2 OhkknM2isHgpC5Gji+z63btRT0NMsVU/LKqgMn1X51CuEIVt0Jxa4dL78XvPBhAX00cF oc5xaJdFRlHapOO4/y1tZCZQxwferSCXk3KHfhfAXXQyA/jWZcwc8JUmAgrV8JrNofx7 MetrF/AQzc2SsznnjAfSso/6ozd8z0OXozRB28LybtN9en83dOfigEbckOxRjU34nlZI jmAg== Precedence: list Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>
Series	[1/2] dma-buf: cleanup reservation_object_init/fini \| expand [1/2] dma-buf: cleanup reservation_object_init/fini [2/2] dma-buf: cleanup shared fence removal

Christian König June 27, 2019, 10:18 a.m. UTC

While freeing up memory it is easier to remove a fence from a reservation
object instead of signaling it and installing a new one in all other objects.

Clean this up by adding the removal function to the core reservation object
code instead of messing with the internal in amdgpu.

No functional change.

Signed-off-by: Christian König <christian.koenig@amd.com>
---
 drivers/dma-buf/reservation.c                 | 62 +++++++++++++++++++
 .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  | 45 +-------------
 include/linux/reservation.h                   |  3 +-
 3 files changed, 65 insertions(+), 45 deletions(-)

Daniel Vetter June 27, 2019, 10:43 a.m. UTC | #1

On Thu, Jun 27, 2019 at 12:18 PM Christian König
<ckoenig.leichtzumerken@gmail.com> wrote:
>
> While freeing up memory it is easier to remove a fence from a reservation
> object instead of signaling it and installing a new one in all other objects.
>
> Clean this up by adding the removal function to the core reservation object
> code instead of messing with the internal in amdgpu.
>
> No functional change.
>
> Signed-off-by: Christian König <christian.koenig@amd.com>
> ---
>  drivers/dma-buf/reservation.c                 | 62 +++++++++++++++++++
>  .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  | 45 +-------------
>  include/linux/reservation.h                   |  3 +-
>  3 files changed, 65 insertions(+), 45 deletions(-)
>
> diff --git a/drivers/dma-buf/reservation.c b/drivers/dma-buf/reservation.c
> index ef710effedfa..e43a316a005d 100644
> --- a/drivers/dma-buf/reservation.c
> +++ b/drivers/dma-buf/reservation.c
> @@ -231,6 +231,68 @@ void reservation_object_add_shared_fence(struct reservation_object *obj,
>  }
>  EXPORT_SYMBOL(reservation_object_add_shared_fence);
>
> +/**
> + * reservation_object_remove_shared_fence - remove shared fences
> + * @obj: the reservation object
> + * @context: the context of the fences to remove
> + *
> + * Remove shared fences based on their fence context.
> + */

This needs a serious explanation of "why?". Removing fences without
guaranteeing they're all signaled is a good way to create havoc. Your
commit message has a few words about what you're doing here, but it
still doesn't explain why this is safe and when exactly it should be
used.

Also I'm not sure (depending upon what you use this for) this is
actually correct outside of amdgpu and it's ignorance of the exclusive
fence.
-Daniel

> +int reservation_object_remove_shared_fence(struct reservation_object *obj,
> +                                          uint64_t context)
> +{
> +       struct reservation_object_list *old, *new;
> +       unsigned int i, j, k;
> +
> +       reservation_object_assert_held(obj);
> +
> +       old = reservation_object_get_list(obj);
> +       if (!old)
> +               return 0;
> +
> +       new = kmalloc(offsetof(typeof(*new), shared[old->shared_max]),
> +                     GFP_KERNEL);
> +       if (!new)
> +               return -ENOMEM;
> +
> +       /* Go through all the shared fences in the resevation object and sort
> +        * the interesting ones to the end of the new list.
> +        */
> +       for (i = 0, j = old->shared_count, k = 0; i < old->shared_count; ++i) {
> +               struct dma_fence *f;
> +
> +               f = rcu_dereference_protected(old->shared[i],
> +                                             reservation_object_held(obj));
> +
> +               if (f->context == context)
> +                       RCU_INIT_POINTER(new->shared[--j], f);
> +               else
> +                       RCU_INIT_POINTER(new->shared[k++], f);
> +       }
> +       new->shared_max = old->shared_max;
> +       new->shared_count = k;
> +
> +       /* Install the new fence list, seqcount provides the barriers */
> +       preempt_disable();
> +       write_seqcount_begin(&obj->seq);
> +       RCU_INIT_POINTER(obj->fence, new);
> +       write_seqcount_end(&obj->seq);
> +       preempt_enable();
> +
> +       /* Drop the references to the removed fences */
> +       for (i = j, k = 0; i < old->shared_count; ++i) {
> +               struct dma_fence *f;
> +
> +               f = rcu_dereference_protected(new->shared[i],
> +                                             reservation_object_held(obj));
> +               dma_fence_put(f);
> +       }
> +       kfree_rcu(old, rcu);
> +
> +       return 0;
> +}
> +EXPORT_SYMBOL(reservation_object_remove_shared_fence);
> +
>  /**
>   * reservation_object_add_excl_fence - Add an exclusive fence.
>   * @obj: the reservation object
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> index 10abae398e51..9b25ad3d003e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> @@ -224,50 +224,7 @@ static int amdgpu_amdkfd_remove_eviction_fence(struct amdgpu_bo *bo,
>         if (!ef)
>                 return -EINVAL;
>
> -       old = reservation_object_get_list(resv);
> -       if (!old)
> -               return 0;
> -
> -       new = kmalloc(offsetof(typeof(*new), shared[old->shared_max]),
> -                     GFP_KERNEL);
> -       if (!new)
> -               return -ENOMEM;
> -
> -       /* Go through all the shared fences in the resevation object and sort
> -        * the interesting ones to the end of the list.
> -        */
> -       for (i = 0, j = old->shared_count, k = 0; i < old->shared_count; ++i) {
> -               struct dma_fence *f;
> -
> -               f = rcu_dereference_protected(old->shared[i],
> -                                             reservation_object_held(resv));
> -
> -               if (f->context == ef->base.context)
> -                       RCU_INIT_POINTER(new->shared[--j], f);
> -               else
> -                       RCU_INIT_POINTER(new->shared[k++], f);
> -       }
> -       new->shared_max = old->shared_max;
> -       new->shared_count = k;
> -
> -       /* Install the new fence list, seqcount provides the barriers */
> -       preempt_disable();
> -       write_seqcount_begin(&resv->seq);
> -       RCU_INIT_POINTER(resv->fence, new);
> -       write_seqcount_end(&resv->seq);
> -       preempt_enable();
> -
> -       /* Drop the references to the removed fences or move them to ef_list */
> -       for (i = j, k = 0; i < old->shared_count; ++i) {
> -               struct dma_fence *f;
> -
> -               f = rcu_dereference_protected(new->shared[i],
> -                                             reservation_object_held(resv));
> -               dma_fence_put(f);
> -       }
> -       kfree_rcu(old, rcu);
> -
> -       return 0;
> +       return reservation_object_remove_shared_fence(resv, ef->base.context);
>  }
>
>  static int amdgpu_amdkfd_bo_validate(struct amdgpu_bo *bo, uint32_t domain,
> diff --git a/include/linux/reservation.h b/include/linux/reservation.h
> index f47e8196d039..1c833a56b678 100644
> --- a/include/linux/reservation.h
> +++ b/include/linux/reservation.h
> @@ -229,7 +229,8 @@ int reservation_object_reserve_shared(struct reservation_object *obj,
>                                       unsigned int num_fences);
>  void reservation_object_add_shared_fence(struct reservation_object *obj,
>                                          struct dma_fence *fence);
> -
> +int reservation_object_remove_shared_fence(struct reservation_object *obj,
> +                                          uint64_t context);
>  void reservation_object_add_excl_fence(struct reservation_object *obj,
>                                        struct dma_fence *fence);
>
> --
> 2.17.1
>
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel

Christian König June 27, 2019, 1:19 p.m. UTC | #2

Am 27.06.19 um 12:43 schrieb Daniel Vetter:
> On Thu, Jun 27, 2019 at 12:18 PM Christian König
> <ckoenig.leichtzumerken@gmail.com> wrote:
>> While freeing up memory it is easier to remove a fence from a reservation
>> object instead of signaling it and installing a new one in all other objects.
>>
>> Clean this up by adding the removal function to the core reservation object
>> code instead of messing with the internal in amdgpu.
>>
>> No functional change.
>>
>> Signed-off-by: Christian König <christian.koenig@amd.com>
>> ---
>>   drivers/dma-buf/reservation.c                 | 62 +++++++++++++++++++
>>   .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  | 45 +-------------
>>   include/linux/reservation.h                   |  3 +-
>>   3 files changed, 65 insertions(+), 45 deletions(-)
>>
>> diff --git a/drivers/dma-buf/reservation.c b/drivers/dma-buf/reservation.c
>> index ef710effedfa..e43a316a005d 100644
>> --- a/drivers/dma-buf/reservation.c
>> +++ b/drivers/dma-buf/reservation.c
>> @@ -231,6 +231,68 @@ void reservation_object_add_shared_fence(struct reservation_object *obj,
>>   }
>>   EXPORT_SYMBOL(reservation_object_add_shared_fence);
>>
>> +/**
>> + * reservation_object_remove_shared_fence - remove shared fences
>> + * @obj: the reservation object
>> + * @context: the context of the fences to remove
>> + *
>> + * Remove shared fences based on their fence context.
>> + */
> This needs a serious explanation of "why?". Removing fences without
> guaranteeing they're all signaled is a good way to create havoc. Your
> commit message has a few words about what you're doing here, but it
> still doesn't explain why this is safe and when exactly it should be
> used.

Yeah, I'm not very keen about this either.

The key point is the memory is not accessible by the hardware any more 
because it is freed and removed from the page tables.

So further access is prevented and in this special case it is actually 
valid to do this even if the operation represented by the fence is still 
ongoing.

How should we word this? Something like:

  * Remove shared fences based on their fence context. Removing a fence 
before it is signaled is only valid if hardware access is prevented by 
some other means like IOMMU or similar virtual memory protection.

> Also I'm not sure (depending upon what you use this for) this is
> actually correct outside of amdgpu and it's ignorance of the exclusive
> fence.

No, that is completely unrelated. But I thought that I clean this up 
before I start to tackle the exclusive fence issue.

Christian.

> -Daniel
>
>> +int reservation_object_remove_shared_fence(struct reservation_object *obj,
>> +                                          uint64_t context)
>> +{
>> +       struct reservation_object_list *old, *new;
>> +       unsigned int i, j, k;
>> +
>> +       reservation_object_assert_held(obj);
>> +
>> +       old = reservation_object_get_list(obj);
>> +       if (!old)
>> +               return 0;
>> +
>> +       new = kmalloc(offsetof(typeof(*new), shared[old->shared_max]),
>> +                     GFP_KERNEL);
>> +       if (!new)
>> +               return -ENOMEM;
>> +
>> +       /* Go through all the shared fences in the resevation object and sort
>> +        * the interesting ones to the end of the new list.
>> +        */
>> +       for (i = 0, j = old->shared_count, k = 0; i < old->shared_count; ++i) {
>> +               struct dma_fence *f;
>> +
>> +               f = rcu_dereference_protected(old->shared[i],
>> +                                             reservation_object_held(obj));
>> +
>> +               if (f->context == context)
>> +                       RCU_INIT_POINTER(new->shared[--j], f);
>> +               else
>> +                       RCU_INIT_POINTER(new->shared[k++], f);
>> +       }
>> +       new->shared_max = old->shared_max;
>> +       new->shared_count = k;
>> +
>> +       /* Install the new fence list, seqcount provides the barriers */
>> +       preempt_disable();
>> +       write_seqcount_begin(&obj->seq);
>> +       RCU_INIT_POINTER(obj->fence, new);
>> +       write_seqcount_end(&obj->seq);
>> +       preempt_enable();
>> +
>> +       /* Drop the references to the removed fences */
>> +       for (i = j, k = 0; i < old->shared_count; ++i) {
>> +               struct dma_fence *f;
>> +
>> +               f = rcu_dereference_protected(new->shared[i],
>> +                                             reservation_object_held(obj));
>> +               dma_fence_put(f);
>> +       }
>> +       kfree_rcu(old, rcu);
>> +
>> +       return 0;
>> +}
>> +EXPORT_SYMBOL(reservation_object_remove_shared_fence);
>> +
>>   /**
>>    * reservation_object_add_excl_fence - Add an exclusive fence.
>>    * @obj: the reservation object
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>> index 10abae398e51..9b25ad3d003e 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>> @@ -224,50 +224,7 @@ static int amdgpu_amdkfd_remove_eviction_fence(struct amdgpu_bo *bo,
>>          if (!ef)
>>                  return -EINVAL;
>>
>> -       old = reservation_object_get_list(resv);
>> -       if (!old)
>> -               return 0;
>> -
>> -       new = kmalloc(offsetof(typeof(*new), shared[old->shared_max]),
>> -                     GFP_KERNEL);
>> -       if (!new)
>> -               return -ENOMEM;
>> -
>> -       /* Go through all the shared fences in the resevation object and sort
>> -        * the interesting ones to the end of the list.
>> -        */
>> -       for (i = 0, j = old->shared_count, k = 0; i < old->shared_count; ++i) {
>> -               struct dma_fence *f;
>> -
>> -               f = rcu_dereference_protected(old->shared[i],
>> -                                             reservation_object_held(resv));
>> -
>> -               if (f->context == ef->base.context)
>> -                       RCU_INIT_POINTER(new->shared[--j], f);
>> -               else
>> -                       RCU_INIT_POINTER(new->shared[k++], f);
>> -       }
>> -       new->shared_max = old->shared_max;
>> -       new->shared_count = k;
>> -
>> -       /* Install the new fence list, seqcount provides the barriers */
>> -       preempt_disable();
>> -       write_seqcount_begin(&resv->seq);
>> -       RCU_INIT_POINTER(resv->fence, new);
>> -       write_seqcount_end(&resv->seq);
>> -       preempt_enable();
>> -
>> -       /* Drop the references to the removed fences or move them to ef_list */
>> -       for (i = j, k = 0; i < old->shared_count; ++i) {
>> -               struct dma_fence *f;
>> -
>> -               f = rcu_dereference_protected(new->shared[i],
>> -                                             reservation_object_held(resv));
>> -               dma_fence_put(f);
>> -       }
>> -       kfree_rcu(old, rcu);
>> -
>> -       return 0;
>> +       return reservation_object_remove_shared_fence(resv, ef->base.context);
>>   }
>>
>>   static int amdgpu_amdkfd_bo_validate(struct amdgpu_bo *bo, uint32_t domain,
>> diff --git a/include/linux/reservation.h b/include/linux/reservation.h
>> index f47e8196d039..1c833a56b678 100644
>> --- a/include/linux/reservation.h
>> +++ b/include/linux/reservation.h
>> @@ -229,7 +229,8 @@ int reservation_object_reserve_shared(struct reservation_object *obj,
>>                                        unsigned int num_fences);
>>   void reservation_object_add_shared_fence(struct reservation_object *obj,
>>                                           struct dma_fence *fence);
>> -
>> +int reservation_object_remove_shared_fence(struct reservation_object *obj,
>> +                                          uint64_t context);
>>   void reservation_object_add_excl_fence(struct reservation_object *obj,
>>                                         struct dma_fence *fence);
>>
>> --
>> 2.17.1
>>
>> _______________________________________________
>> dri-devel mailing list
>> dri-devel@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>
>

Daniel Vetter June 27, 2019, 3:34 p.m. UTC | #3

On Thu, Jun 27, 2019 at 3:19 PM Christian König
<ckoenig.leichtzumerken@gmail.com> wrote:
>
> Am 27.06.19 um 12:43 schrieb Daniel Vetter:
> > On Thu, Jun 27, 2019 at 12:18 PM Christian König
> > <ckoenig.leichtzumerken@gmail.com> wrote:
> >> While freeing up memory it is easier to remove a fence from a reservation
> >> object instead of signaling it and installing a new one in all other objects.
> >>
> >> Clean this up by adding the removal function to the core reservation object
> >> code instead of messing with the internal in amdgpu.
> >>
> >> No functional change.
> >>
> >> Signed-off-by: Christian König <christian.koenig@amd.com>
> >> ---
> >>   drivers/dma-buf/reservation.c                 | 62 +++++++++++++++++++
> >>   .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  | 45 +-------------
> >>   include/linux/reservation.h                   |  3 +-
> >>   3 files changed, 65 insertions(+), 45 deletions(-)
> >>
> >> diff --git a/drivers/dma-buf/reservation.c b/drivers/dma-buf/reservation.c
> >> index ef710effedfa..e43a316a005d 100644
> >> --- a/drivers/dma-buf/reservation.c
> >> +++ b/drivers/dma-buf/reservation.c
> >> @@ -231,6 +231,68 @@ void reservation_object_add_shared_fence(struct reservation_object *obj,
> >>   }
> >>   EXPORT_SYMBOL(reservation_object_add_shared_fence);
> >>
> >> +/**
> >> + * reservation_object_remove_shared_fence - remove shared fences
> >> + * @obj: the reservation object
> >> + * @context: the context of the fences to remove
> >> + *
> >> + * Remove shared fences based on their fence context.
> >> + */
> > This needs a serious explanation of "why?". Removing fences without
> > guaranteeing they're all signaled is a good way to create havoc. Your
> > commit message has a few words about what you're doing here, but it
> > still doesn't explain why this is safe and when exactly it should be
> > used.
>
> Yeah, I'm not very keen about this either.
>
> The key point is the memory is not accessible by the hardware any more
> because it is freed and removed from the page tables.
>
> So further access is prevented and in this special case it is actually
> valid to do this even if the operation represented by the fence is still
> ongoing.

Hm, why do you have to remove these fences then? Can't you just let
them signal and get collected as usual? As soon as you share buffers
these fences can get anywhere, so you need to correctly unwind them no
matter what.

You're kinda still just describing what you're doing, not why.

> How should we word this? Something like:
>
>   * Remove shared fences based on their fence context. Removing a fence
> before it is signaled is only valid if hardware access is prevented by
> some other means like IOMMU or similar virtual memory protection.

The part that freaks me out is whether we break the fence chaing
anywhere by removing fences. But I guess if you're only removing the
shared fences that shoudl be fine ...

> > Also I'm not sure (depending upon what you use this for) this is
> > actually correct outside of amdgpu and it's ignorance of the exclusive
> > fence.
>
> No, that is completely unrelated. But I thought that I clean this up
> before I start to tackle the exclusive fence issue.

... except amdgpu has internally a very strange idea of shared fences
with auto-exclusive promotion. And I think removing exclusive fences
will break the fence dependencies of other (earlier) dma drivers by
other operations. Or is that why you filter on the context,
essentially amd's way of saying "remove all shared fences for this
thing, keep only the exclusive ones".

I guess I need to read what this actually does in the code, since I
still have no idea why you need to do this.
-Daniel

> Christian.
>
> > -Daniel
> >
> >> +int reservation_object_remove_shared_fence(struct reservation_object *obj,
> >> +                                          uint64_t context)
> >> +{
> >> +       struct reservation_object_list *old, *new;
> >> +       unsigned int i, j, k;
> >> +
> >> +       reservation_object_assert_held(obj);
> >> +
> >> +       old = reservation_object_get_list(obj);
> >> +       if (!old)
> >> +               return 0;
> >> +
> >> +       new = kmalloc(offsetof(typeof(*new), shared[old->shared_max]),
> >> +                     GFP_KERNEL);
> >> +       if (!new)
> >> +               return -ENOMEM;
> >> +
> >> +       /* Go through all the shared fences in the resevation object and sort
> >> +        * the interesting ones to the end of the new list.
> >> +        */
> >> +       for (i = 0, j = old->shared_count, k = 0; i < old->shared_count; ++i) {
> >> +               struct dma_fence *f;
> >> +
> >> +               f = rcu_dereference_protected(old->shared[i],
> >> +                                             reservation_object_held(obj));
> >> +
> >> +               if (f->context == context)
> >> +                       RCU_INIT_POINTER(new->shared[--j], f);
> >> +               else
> >> +                       RCU_INIT_POINTER(new->shared[k++], f);
> >> +       }
> >> +       new->shared_max = old->shared_max;
> >> +       new->shared_count = k;
> >> +
> >> +       /* Install the new fence list, seqcount provides the barriers */
> >> +       preempt_disable();
> >> +       write_seqcount_begin(&obj->seq);
> >> +       RCU_INIT_POINTER(obj->fence, new);
> >> +       write_seqcount_end(&obj->seq);
> >> +       preempt_enable();
> >> +
> >> +       /* Drop the references to the removed fences */
> >> +       for (i = j, k = 0; i < old->shared_count; ++i) {
> >> +               struct dma_fence *f;
> >> +
> >> +               f = rcu_dereference_protected(new->shared[i],
> >> +                                             reservation_object_held(obj));
> >> +               dma_fence_put(f);
> >> +       }
> >> +       kfree_rcu(old, rcu);
> >> +
> >> +       return 0;
> >> +}
> >> +EXPORT_SYMBOL(reservation_object_remove_shared_fence);
> >> +
> >>   /**
> >>    * reservation_object_add_excl_fence - Add an exclusive fence.
> >>    * @obj: the reservation object
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> >> index 10abae398e51..9b25ad3d003e 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> >> @@ -224,50 +224,7 @@ static int amdgpu_amdkfd_remove_eviction_fence(struct amdgpu_bo *bo,
> >>          if (!ef)
> >>                  return -EINVAL;
> >>
> >> -       old = reservation_object_get_list(resv);
> >> -       if (!old)
> >> -               return 0;
> >> -
> >> -       new = kmalloc(offsetof(typeof(*new), shared[old->shared_max]),
> >> -                     GFP_KERNEL);
> >> -       if (!new)
> >> -               return -ENOMEM;
> >> -
> >> -       /* Go through all the shared fences in the resevation object and sort
> >> -        * the interesting ones to the end of the list.
> >> -        */
> >> -       for (i = 0, j = old->shared_count, k = 0; i < old->shared_count; ++i) {
> >> -               struct dma_fence *f;
> >> -
> >> -               f = rcu_dereference_protected(old->shared[i],
> >> -                                             reservation_object_held(resv));
> >> -
> >> -               if (f->context == ef->base.context)
> >> -                       RCU_INIT_POINTER(new->shared[--j], f);
> >> -               else
> >> -                       RCU_INIT_POINTER(new->shared[k++], f);
> >> -       }
> >> -       new->shared_max = old->shared_max;
> >> -       new->shared_count = k;
> >> -
> >> -       /* Install the new fence list, seqcount provides the barriers */
> >> -       preempt_disable();
> >> -       write_seqcount_begin(&resv->seq);
> >> -       RCU_INIT_POINTER(resv->fence, new);
> >> -       write_seqcount_end(&resv->seq);
> >> -       preempt_enable();
> >> -
> >> -       /* Drop the references to the removed fences or move them to ef_list */
> >> -       for (i = j, k = 0; i < old->shared_count; ++i) {
> >> -               struct dma_fence *f;
> >> -
> >> -               f = rcu_dereference_protected(new->shared[i],
> >> -                                             reservation_object_held(resv));
> >> -               dma_fence_put(f);
> >> -       }
> >> -       kfree_rcu(old, rcu);
> >> -
> >> -       return 0;
> >> +       return reservation_object_remove_shared_fence(resv, ef->base.context);
> >>   }
> >>
> >>   static int amdgpu_amdkfd_bo_validate(struct amdgpu_bo *bo, uint32_t domain,
> >> diff --git a/include/linux/reservation.h b/include/linux/reservation.h
> >> index f47e8196d039..1c833a56b678 100644
> >> --- a/include/linux/reservation.h
> >> +++ b/include/linux/reservation.h
> >> @@ -229,7 +229,8 @@ int reservation_object_reserve_shared(struct reservation_object *obj,
> >>                                        unsigned int num_fences);
> >>   void reservation_object_add_shared_fence(struct reservation_object *obj,
> >>                                           struct dma_fence *fence);
> >> -
> >> +int reservation_object_remove_shared_fence(struct reservation_object *obj,
> >> +                                          uint64_t context);
> >>   void reservation_object_add_excl_fence(struct reservation_object *obj,
> >>                                         struct dma_fence *fence);
> >>
> >> --
> >> 2.17.1
> >>
> >> _______________________________________________
> >> dri-devel mailing list
> >> dri-devel@lists.freedesktop.org
> >> https://lists.freedesktop.org/mailman/listinfo/dri-devel
> >
> >
>

Christian König June 27, 2019, 3:48 p.m. UTC | #4

Am 27.06.19 um 17:34 schrieb Daniel Vetter:
> On Thu, Jun 27, 2019 at 3:19 PM Christian König
> <ckoenig.leichtzumerken@gmail.com> wrote:
>> Am 27.06.19 um 12:43 schrieb Daniel Vetter:
>>> On Thu, Jun 27, 2019 at 12:18 PM Christian König
>>> <ckoenig.leichtzumerken@gmail.com> wrote:
>>>> While freeing up memory it is easier to remove a fence from a reservation
>>>> object instead of signaling it and installing a new one in all other objects.
>>>>
>>>> Clean this up by adding the removal function to the core reservation object
>>>> code instead of messing with the internal in amdgpu.
>>>>
>>>> No functional change.
>>>>
>>>> Signed-off-by: Christian König <christian.koenig@amd.com>
>>>> ---
>>>>    drivers/dma-buf/reservation.c                 | 62 +++++++++++++++++++
>>>>    .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  | 45 +-------------
>>>>    include/linux/reservation.h                   |  3 +-
>>>>    3 files changed, 65 insertions(+), 45 deletions(-)
>>>>
>>>> diff --git a/drivers/dma-buf/reservation.c b/drivers/dma-buf/reservation.c
>>>> index ef710effedfa..e43a316a005d 100644
>>>> --- a/drivers/dma-buf/reservation.c
>>>> +++ b/drivers/dma-buf/reservation.c
>>>> @@ -231,6 +231,68 @@ void reservation_object_add_shared_fence(struct reservation_object *obj,
>>>>    }
>>>>    EXPORT_SYMBOL(reservation_object_add_shared_fence);
>>>>
>>>> +/**
>>>> + * reservation_object_remove_shared_fence - remove shared fences
>>>> + * @obj: the reservation object
>>>> + * @context: the context of the fences to remove
>>>> + *
>>>> + * Remove shared fences based on their fence context.
>>>> + */
>>> This needs a serious explanation of "why?". Removing fences without
>>> guaranteeing they're all signaled is a good way to create havoc. Your
>>> commit message has a few words about what you're doing here, but it
>>> still doesn't explain why this is safe and when exactly it should be
>>> used.
>> Yeah, I'm not very keen about this either.
>>
>> The key point is the memory is not accessible by the hardware any more
>> because it is freed and removed from the page tables.
>>
>> So further access is prevented and in this special case it is actually
>> valid to do this even if the operation represented by the fence is still
>> ongoing.
> Hm, why do you have to remove these fences then? Can't you just let
> them signal and get collected as usual? As soon as you share buffers
> these fences can get anywhere, so you need to correctly unwind them no
> matter what.
>
> You're kinda still just describing what you're doing, not why.

It is simply more efficient to remove the fence from one reservation 
object than to add a new fence to all other reservation objects in the 
same process.

It's just an optimization,
Christian.

>
>> How should we word this? Something like:
>>
>>    * Remove shared fences based on their fence context. Removing a fence
>> before it is signaled is only valid if hardware access is prevented by
>> some other means like IOMMU or similar virtual memory protection.
> The part that freaks me out is whether we break the fence chaing
> anywhere by removing fences. But I guess if you're only removing the
> shared fences that shoudl be fine ...
>
>>> Also I'm not sure (depending upon what you use this for) this is
>>> actually correct outside of amdgpu and it's ignorance of the exclusive
>>> fence.
>> No, that is completely unrelated. But I thought that I clean this up
>> before I start to tackle the exclusive fence issue.
> ... except amdgpu has internally a very strange idea of shared fences
> with auto-exclusive promotion. And I think removing exclusive fences
> will break the fence dependencies of other (earlier) dma drivers by
> other operations. Or is that why you filter on the context,
> essentially amd's way of saying "remove all shared fences for this
> thing, keep only the exclusive ones".
>
> I guess I need to read what this actually does in the code, since I
> still have no idea why you need to do this.
> -Daniel
>
>> Christian.
>>
>>> -Daniel
>>>
>>>> +int reservation_object_remove_shared_fence(struct reservation_object *obj,
>>>> +                                          uint64_t context)
>>>> +{
>>>> +       struct reservation_object_list *old, *new;
>>>> +       unsigned int i, j, k;
>>>> +
>>>> +       reservation_object_assert_held(obj);
>>>> +
>>>> +       old = reservation_object_get_list(obj);
>>>> +       if (!old)
>>>> +               return 0;
>>>> +
>>>> +       new = kmalloc(offsetof(typeof(*new), shared[old->shared_max]),
>>>> +                     GFP_KERNEL);
>>>> +       if (!new)
>>>> +               return -ENOMEM;
>>>> +
>>>> +       /* Go through all the shared fences in the resevation object and sort
>>>> +        * the interesting ones to the end of the new list.
>>>> +        */
>>>> +       for (i = 0, j = old->shared_count, k = 0; i < old->shared_count; ++i) {
>>>> +               struct dma_fence *f;
>>>> +
>>>> +               f = rcu_dereference_protected(old->shared[i],
>>>> +                                             reservation_object_held(obj));
>>>> +
>>>> +               if (f->context == context)
>>>> +                       RCU_INIT_POINTER(new->shared[--j], f);
>>>> +               else
>>>> +                       RCU_INIT_POINTER(new->shared[k++], f);
>>>> +       }
>>>> +       new->shared_max = old->shared_max;
>>>> +       new->shared_count = k;
>>>> +
>>>> +       /* Install the new fence list, seqcount provides the barriers */
>>>> +       preempt_disable();
>>>> +       write_seqcount_begin(&obj->seq);
>>>> +       RCU_INIT_POINTER(obj->fence, new);
>>>> +       write_seqcount_end(&obj->seq);
>>>> +       preempt_enable();
>>>> +
>>>> +       /* Drop the references to the removed fences */
>>>> +       for (i = j, k = 0; i < old->shared_count; ++i) {
>>>> +               struct dma_fence *f;
>>>> +
>>>> +               f = rcu_dereference_protected(new->shared[i],
>>>> +                                             reservation_object_held(obj));
>>>> +               dma_fence_put(f);
>>>> +       }
>>>> +       kfree_rcu(old, rcu);
>>>> +
>>>> +       return 0;
>>>> +}
>>>> +EXPORT_SYMBOL(reservation_object_remove_shared_fence);
>>>> +
>>>>    /**
>>>>     * reservation_object_add_excl_fence - Add an exclusive fence.
>>>>     * @obj: the reservation object
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>>>> index 10abae398e51..9b25ad3d003e 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>>>> @@ -224,50 +224,7 @@ static int amdgpu_amdkfd_remove_eviction_fence(struct amdgpu_bo *bo,
>>>>           if (!ef)
>>>>                   return -EINVAL;
>>>>
>>>> -       old = reservation_object_get_list(resv);
>>>> -       if (!old)
>>>> -               return 0;
>>>> -
>>>> -       new = kmalloc(offsetof(typeof(*new), shared[old->shared_max]),
>>>> -                     GFP_KERNEL);
>>>> -       if (!new)
>>>> -               return -ENOMEM;
>>>> -
>>>> -       /* Go through all the shared fences in the resevation object and sort
>>>> -        * the interesting ones to the end of the list.
>>>> -        */
>>>> -       for (i = 0, j = old->shared_count, k = 0; i < old->shared_count; ++i) {
>>>> -               struct dma_fence *f;
>>>> -
>>>> -               f = rcu_dereference_protected(old->shared[i],
>>>> -                                             reservation_object_held(resv));
>>>> -
>>>> -               if (f->context == ef->base.context)
>>>> -                       RCU_INIT_POINTER(new->shared[--j], f);
>>>> -               else
>>>> -                       RCU_INIT_POINTER(new->shared[k++], f);
>>>> -       }
>>>> -       new->shared_max = old->shared_max;
>>>> -       new->shared_count = k;
>>>> -
>>>> -       /* Install the new fence list, seqcount provides the barriers */
>>>> -       preempt_disable();
>>>> -       write_seqcount_begin(&resv->seq);
>>>> -       RCU_INIT_POINTER(resv->fence, new);
>>>> -       write_seqcount_end(&resv->seq);
>>>> -       preempt_enable();
>>>> -
>>>> -       /* Drop the references to the removed fences or move them to ef_list */
>>>> -       for (i = j, k = 0; i < old->shared_count; ++i) {
>>>> -               struct dma_fence *f;
>>>> -
>>>> -               f = rcu_dereference_protected(new->shared[i],
>>>> -                                             reservation_object_held(resv));
>>>> -               dma_fence_put(f);
>>>> -       }
>>>> -       kfree_rcu(old, rcu);
>>>> -
>>>> -       return 0;
>>>> +       return reservation_object_remove_shared_fence(resv, ef->base.context);
>>>>    }
>>>>
>>>>    static int amdgpu_amdkfd_bo_validate(struct amdgpu_bo *bo, uint32_t domain,
>>>> diff --git a/include/linux/reservation.h b/include/linux/reservation.h
>>>> index f47e8196d039..1c833a56b678 100644
>>>> --- a/include/linux/reservation.h
>>>> +++ b/include/linux/reservation.h
>>>> @@ -229,7 +229,8 @@ int reservation_object_reserve_shared(struct reservation_object *obj,
>>>>                                         unsigned int num_fences);
>>>>    void reservation_object_add_shared_fence(struct reservation_object *obj,
>>>>                                            struct dma_fence *fence);
>>>> -
>>>> +int reservation_object_remove_shared_fence(struct reservation_object *obj,
>>>> +                                          uint64_t context);
>>>>    void reservation_object_add_excl_fence(struct reservation_object *obj,
>>>>                                          struct dma_fence *fence);
>>>>
>>>> --
>>>> 2.17.1
>>>>
>>>> _______________________________________________
>>>> dri-devel mailing list
>>>> dri-devel@lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>>>
>

Daniel Vetter June 27, 2019, 5:10 p.m. UTC | #5

On Thu, Jun 27, 2019 at 03:48:06PM +0000, Koenig, Christian wrote:
> Am 27.06.19 um 17:34 schrieb Daniel Vetter:
> > On Thu, Jun 27, 2019 at 3:19 PM Christian König
> > <ckoenig.leichtzumerken@gmail.com> wrote:
> >> Am 27.06.19 um 12:43 schrieb Daniel Vetter:
> >>> On Thu, Jun 27, 2019 at 12:18 PM Christian König
> >>> <ckoenig.leichtzumerken@gmail.com> wrote:
> >>>> While freeing up memory it is easier to remove a fence from a reservation
> >>>> object instead of signaling it and installing a new one in all other objects.
> >>>>
> >>>> Clean this up by adding the removal function to the core reservation object
> >>>> code instead of messing with the internal in amdgpu.
> >>>>
> >>>> No functional change.
> >>>>
> >>>> Signed-off-by: Christian König <christian.koenig@amd.com>
> >>>> ---
> >>>>    drivers/dma-buf/reservation.c                 | 62 +++++++++++++++++++
> >>>>    .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  | 45 +-------------
> >>>>    include/linux/reservation.h                   |  3 +-
> >>>>    3 files changed, 65 insertions(+), 45 deletions(-)
> >>>>
> >>>> diff --git a/drivers/dma-buf/reservation.c b/drivers/dma-buf/reservation.c
> >>>> index ef710effedfa..e43a316a005d 100644
> >>>> --- a/drivers/dma-buf/reservation.c
> >>>> +++ b/drivers/dma-buf/reservation.c
> >>>> @@ -231,6 +231,68 @@ void reservation_object_add_shared_fence(struct reservation_object *obj,
> >>>>    }
> >>>>    EXPORT_SYMBOL(reservation_object_add_shared_fence);
> >>>>
> >>>> +/**
> >>>> + * reservation_object_remove_shared_fence - remove shared fences
> >>>> + * @obj: the reservation object
> >>>> + * @context: the context of the fences to remove
> >>>> + *
> >>>> + * Remove shared fences based on their fence context.
> >>>> + */
> >>> This needs a serious explanation of "why?". Removing fences without
> >>> guaranteeing they're all signaled is a good way to create havoc. Your
> >>> commit message has a few words about what you're doing here, but it
> >>> still doesn't explain why this is safe and when exactly it should be
> >>> used.
> >> Yeah, I'm not very keen about this either.
> >>
> >> The key point is the memory is not accessible by the hardware any more
> >> because it is freed and removed from the page tables.
> >>
> >> So further access is prevented and in this special case it is actually
> >> valid to do this even if the operation represented by the fence is still
> >> ongoing.
> > Hm, why do you have to remove these fences then? Can't you just let
> > them signal and get collected as usual? As soon as you share buffers
> > these fences can get anywhere, so you need to correctly unwind them no
> > matter what.
> >
> > You're kinda still just describing what you're doing, not why.
> 
> It is simply more efficient to remove the fence from one reservation 
> object than to add a new fence to all other reservation objects in the 
> same process.

Ok, you still talk in riddles and don't explain what the goal of this
entire dance is, so I went and read the code. Assuming I didn't misread
too much:

1. you create a fake fence on a per-process timeline.
2. you attach this liberally to all the bo you're creating on that
process
3. the fence never signals on its own, but it has a very magic
->enable_signaling callback which is the only thing that makes this fence
switch to signalled in a finite time. Before that it's stuck forever. So
quite a bit a schrödinger fence: It's not a real fence (because it fails
to signal in bounded time) except when you start to look at it.
4. Looking at the fence triggers eviction, at that point we replace this
magic eviction fence with the next set, reacquire buffers and then unblock
the kfd process once everything is in shape again.

This is soooooooooooooooooo magic that I really don't think we should
encourage people without clue to maybe use this and totally break all
fences guarantees.

If you do want to make sure an optimized version within
reservation_object.c, then it should be code which replaces fences iff:
- they're the same context
- later in the ordering within that context
- of the same type (i.e. safe vs shared)

That would actually be safe thing to do.

Also, the above is what I expected when asking "why do you need this", not
"we replace fences, its more efficient" I kinda got that from the code :-)

Plus reading this now with (at least the believe of) understanding what
you're doing, replacing the fences and reattaching the next one of these
magic fences doesn't feel like it's actually making anything faster. Just
more obscure ...

Looking at reservation_object_add_shared_fence it seems to dtrt already.
-Daniel

> >> How should we word this? Something like:
> >>
> >>    * Remove shared fences based on their fence context. Removing a fence
> >> before it is signaled is only valid if hardware access is prevented by
> >> some other means like IOMMU or similar virtual memory protection.
> > The part that freaks me out is whether we break the fence chaing
> > anywhere by removing fences. But I guess if you're only removing the
> > shared fences that shoudl be fine ...
> >
> >>> Also I'm not sure (depending upon what you use this for) this is
> >>> actually correct outside of amdgpu and it's ignorance of the exclusive
> >>> fence.
> >> No, that is completely unrelated. But I thought that I clean this up
> >> before I start to tackle the exclusive fence issue.
> > ... except amdgpu has internally a very strange idea of shared fences
> > with auto-exclusive promotion. And I think removing exclusive fences
> > will break the fence dependencies of other (earlier) dma drivers by
> > other operations. Or is that why you filter on the context,
> > essentially amd's way of saying "remove all shared fences for this
> > thing, keep only the exclusive ones".
> >
> > I guess I need to read what this actually does in the code, since I
> > still have no idea why you need to do this.
> > -Daniel
> >
> >> Christian.
> >>
> >>> -Daniel
> >>>
> >>>> +int reservation_object_remove_shared_fence(struct reservation_object *obj,
> >>>> +                                          uint64_t context)
> >>>> +{
> >>>> +       struct reservation_object_list *old, *new;
> >>>> +       unsigned int i, j, k;
> >>>> +
> >>>> +       reservation_object_assert_held(obj);
> >>>> +
> >>>> +       old = reservation_object_get_list(obj);
> >>>> +       if (!old)
> >>>> +               return 0;
> >>>> +
> >>>> +       new = kmalloc(offsetof(typeof(*new), shared[old->shared_max]),
> >>>> +                     GFP_KERNEL);
> >>>> +       if (!new)
> >>>> +               return -ENOMEM;
> >>>> +
> >>>> +       /* Go through all the shared fences in the resevation object and sort
> >>>> +        * the interesting ones to the end of the new list.
> >>>> +        */
> >>>> +       for (i = 0, j = old->shared_count, k = 0; i < old->shared_count; ++i) {
> >>>> +               struct dma_fence *f;
> >>>> +
> >>>> +               f = rcu_dereference_protected(old->shared[i],
> >>>> +                                             reservation_object_held(obj));
> >>>> +
> >>>> +               if (f->context == context)
> >>>> +                       RCU_INIT_POINTER(new->shared[--j], f);
> >>>> +               else
> >>>> +                       RCU_INIT_POINTER(new->shared[k++], f);
> >>>> +       }
> >>>> +       new->shared_max = old->shared_max;
> >>>> +       new->shared_count = k;
> >>>> +
> >>>> +       /* Install the new fence list, seqcount provides the barriers */
> >>>> +       preempt_disable();
> >>>> +       write_seqcount_begin(&obj->seq);
> >>>> +       RCU_INIT_POINTER(obj->fence, new);
> >>>> +       write_seqcount_end(&obj->seq);
> >>>> +       preempt_enable();
> >>>> +
> >>>> +       /* Drop the references to the removed fences */
> >>>> +       for (i = j, k = 0; i < old->shared_count; ++i) {
> >>>> +               struct dma_fence *f;
> >>>> +
> >>>> +               f = rcu_dereference_protected(new->shared[i],
> >>>> +                                             reservation_object_held(obj));
> >>>> +               dma_fence_put(f);
> >>>> +       }
> >>>> +       kfree_rcu(old, rcu);
> >>>> +
> >>>> +       return 0;
> >>>> +}
> >>>> +EXPORT_SYMBOL(reservation_object_remove_shared_fence);
> >>>> +
> >>>>    /**
> >>>>     * reservation_object_add_excl_fence - Add an exclusive fence.
> >>>>     * @obj: the reservation object
> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> >>>> index 10abae398e51..9b25ad3d003e 100644
> >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> >>>> @@ -224,50 +224,7 @@ static int amdgpu_amdkfd_remove_eviction_fence(struct amdgpu_bo *bo,
> >>>>           if (!ef)
> >>>>                   return -EINVAL;
> >>>>
> >>>> -       old = reservation_object_get_list(resv);
> >>>> -       if (!old)
> >>>> -               return 0;
> >>>> -
> >>>> -       new = kmalloc(offsetof(typeof(*new), shared[old->shared_max]),
> >>>> -                     GFP_KERNEL);
> >>>> -       if (!new)
> >>>> -               return -ENOMEM;
> >>>> -
> >>>> -       /* Go through all the shared fences in the resevation object and sort
> >>>> -        * the interesting ones to the end of the list.
> >>>> -        */
> >>>> -       for (i = 0, j = old->shared_count, k = 0; i < old->shared_count; ++i) {
> >>>> -               struct dma_fence *f;
> >>>> -
> >>>> -               f = rcu_dereference_protected(old->shared[i],
> >>>> -                                             reservation_object_held(resv));
> >>>> -
> >>>> -               if (f->context == ef->base.context)
> >>>> -                       RCU_INIT_POINTER(new->shared[--j], f);
> >>>> -               else
> >>>> -                       RCU_INIT_POINTER(new->shared[k++], f);
> >>>> -       }
> >>>> -       new->shared_max = old->shared_max;
> >>>> -       new->shared_count = k;
> >>>> -
> >>>> -       /* Install the new fence list, seqcount provides the barriers */
> >>>> -       preempt_disable();
> >>>> -       write_seqcount_begin(&resv->seq);
> >>>> -       RCU_INIT_POINTER(resv->fence, new);
> >>>> -       write_seqcount_end(&resv->seq);
> >>>> -       preempt_enable();
> >>>> -
> >>>> -       /* Drop the references to the removed fences or move them to ef_list */
> >>>> -       for (i = j, k = 0; i < old->shared_count; ++i) {
> >>>> -               struct dma_fence *f;
> >>>> -
> >>>> -               f = rcu_dereference_protected(new->shared[i],
> >>>> -                                             reservation_object_held(resv));
> >>>> -               dma_fence_put(f);
> >>>> -       }
> >>>> -       kfree_rcu(old, rcu);
> >>>> -
> >>>> -       return 0;
> >>>> +       return reservation_object_remove_shared_fence(resv, ef->base.context);
> >>>>    }
> >>>>
> >>>>    static int amdgpu_amdkfd_bo_validate(struct amdgpu_bo *bo, uint32_t domain,
> >>>> diff --git a/include/linux/reservation.h b/include/linux/reservation.h
> >>>> index f47e8196d039..1c833a56b678 100644
> >>>> --- a/include/linux/reservation.h
> >>>> +++ b/include/linux/reservation.h
> >>>> @@ -229,7 +229,8 @@ int reservation_object_reserve_shared(struct reservation_object *obj,
> >>>>                                         unsigned int num_fences);
> >>>>    void reservation_object_add_shared_fence(struct reservation_object *obj,
> >>>>                                            struct dma_fence *fence);
> >>>> -
> >>>> +int reservation_object_remove_shared_fence(struct reservation_object *obj,
> >>>> +                                          uint64_t context);
> >>>>    void reservation_object_add_excl_fence(struct reservation_object *obj,
> >>>>                                          struct dma_fence *fence);
> >>>>
> >>>> --
> >>>> 2.17.1
> >>>>
> >>>> _______________________________________________
> >>>> dri-devel mailing list
> >>>> dri-devel@lists.freedesktop.org
> >>>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
> >>>
> >
>

Christian König June 27, 2019, 5:20 p.m. UTC | #6

Am 27.06.19 um 19:10 schrieb Daniel Vetter:
> On Thu, Jun 27, 2019 at 03:48:06PM +0000, Koenig, Christian wrote:
>> Am 27.06.19 um 17:34 schrieb Daniel Vetter:
>>> On Thu, Jun 27, 2019 at 3:19 PM Christian König
>>> <ckoenig.leichtzumerken@gmail.com> wrote:
>>>> Am 27.06.19 um 12:43 schrieb Daniel Vetter:
>>>>> On Thu, Jun 27, 2019 at 12:18 PM Christian König
>>>>> <ckoenig.leichtzumerken@gmail.com> wrote:
>>>>>> While freeing up memory it is easier to remove a fence from a reservation
>>>>>> object instead of signaling it and installing a new one in all other objects.
>>>>>>
>>>>>> Clean this up by adding the removal function to the core reservation object
>>>>>> code instead of messing with the internal in amdgpu.
>>>>>>
>>>>>> No functional change.
>>>>>>
>>>>>> Signed-off-by: Christian König <christian.koenig@amd.com>
>>>>>> ---
>>>>>>     drivers/dma-buf/reservation.c                 | 62 +++++++++++++++++++
>>>>>>     .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  | 45 +-------------
>>>>>>     include/linux/reservation.h                   |  3 +-
>>>>>>     3 files changed, 65 insertions(+), 45 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/dma-buf/reservation.c b/drivers/dma-buf/reservation.c
>>>>>> index ef710effedfa..e43a316a005d 100644
>>>>>> --- a/drivers/dma-buf/reservation.c
>>>>>> +++ b/drivers/dma-buf/reservation.c
>>>>>> @@ -231,6 +231,68 @@ void reservation_object_add_shared_fence(struct reservation_object *obj,
>>>>>>     }
>>>>>>     EXPORT_SYMBOL(reservation_object_add_shared_fence);
>>>>>>
>>>>>> +/**
>>>>>> + * reservation_object_remove_shared_fence - remove shared fences
>>>>>> + * @obj: the reservation object
>>>>>> + * @context: the context of the fences to remove
>>>>>> + *
>>>>>> + * Remove shared fences based on their fence context.
>>>>>> + */
>>>>> This needs a serious explanation of "why?". Removing fences without
>>>>> guaranteeing they're all signaled is a good way to create havoc. Your
>>>>> commit message has a few words about what you're doing here, but it
>>>>> still doesn't explain why this is safe and when exactly it should be
>>>>> used.
>>>> Yeah, I'm not very keen about this either.
>>>>
>>>> The key point is the memory is not accessible by the hardware any more
>>>> because it is freed and removed from the page tables.
>>>>
>>>> So further access is prevented and in this special case it is actually
>>>> valid to do this even if the operation represented by the fence is still
>>>> ongoing.
>>> Hm, why do you have to remove these fences then? Can't you just let
>>> them signal and get collected as usual? As soon as you share buffers
>>> these fences can get anywhere, so you need to correctly unwind them no
>>> matter what.
>>>
>>> You're kinda still just describing what you're doing, not why.
>> It is simply more efficient to remove the fence from one reservation
>> object than to add a new fence to all other reservation objects in the
>> same process.
> Ok, you still talk in riddles and don't explain what the goal of this
> entire dance is, so I went and read the code. Assuming I didn't misread
> too much:
>
> 1. you create a fake fence on a per-process timeline.
> 2. you attach this liberally to all the bo you're creating on that
> process
> 3. the fence never signals on its own, but it has a very magic
> ->enable_signaling callback which is the only thing that makes this fence
> switch to signalled in a finite time. Before that it's stuck forever. So
> quite a bit a schrödinger fence: It's not a real fence (because it fails
> to signal in bounded time) except when you start to look at it.
> 4. Looking at the fence triggers eviction, at that point we replace this
> magic eviction fence with the next set, reacquire buffers and then unblock
> the kfd process once everything is in shape again.
>
> This is soooooooooooooooooo magic that I really don't think we should
> encourage people without clue to maybe use this and totally break all
> fences guarantees.

Yeah, that is correct. But this is completely unrelated to why we want 
to remove the fence.

> If you do want to make sure an optimized version within
> reservation_object.c, then it should be code which replaces fences iff:
> - they're the same context
> - later in the ordering within that context
> - of the same type (i.e. safe vs shared)
>
> That would actually be safe thing to do.

No, that won't work because there is no replacement for the fence in 
question.

See we want to remove the fence because the memory is freed up.

> Also, the above is what I expected when asking "why do you need this", not
> "we replace fences, its more efficient" I kinda got that from the code :-)

Well I explained the short version why we do this. What you dug up here 
is correct as well, but completely unrelated to removing the fence.

Again, the reason to remove the fence from one reservation object is 
simply that it is faster to remove it from one object than to attach a 
new fence to all other objects.

It's just an optimization,
Christian.

>
> Plus reading this now with (at least the believe of) understanding what
> you're doing, replacing the fences and reattaching the next one of these
> magic fences doesn't feel like it's actually making anything faster. Just
> more obscure ...
>
> Looking at reservation_object_add_shared_fence it seems to dtrt already.
> -Daniel
>

Daniel Vetter June 27, 2019, 7:57 p.m. UTC | #7

On Thu, Jun 27, 2019 at 7:20 PM Koenig, Christian
<Christian.Koenig@amd.com> wrote:
>
> Am 27.06.19 um 19:10 schrieb Daniel Vetter:
> > On Thu, Jun 27, 2019 at 03:48:06PM +0000, Koenig, Christian wrote:
> >> Am 27.06.19 um 17:34 schrieb Daniel Vetter:
> >>> On Thu, Jun 27, 2019 at 3:19 PM Christian König
> >>> <ckoenig.leichtzumerken@gmail.com> wrote:
> >>>> Am 27.06.19 um 12:43 schrieb Daniel Vetter:
> >>>>> On Thu, Jun 27, 2019 at 12:18 PM Christian König
> >>>>> <ckoenig.leichtzumerken@gmail.com> wrote:
> >>>>>> While freeing up memory it is easier to remove a fence from a reservation
> >>>>>> object instead of signaling it and installing a new one in all other objects.
> >>>>>>
> >>>>>> Clean this up by adding the removal function to the core reservation object
> >>>>>> code instead of messing with the internal in amdgpu.
> >>>>>>
> >>>>>> No functional change.
> >>>>>>
> >>>>>> Signed-off-by: Christian König <christian.koenig@amd.com>
> >>>>>> ---
> >>>>>>     drivers/dma-buf/reservation.c                 | 62 +++++++++++++++++++
> >>>>>>     .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  | 45 +-------------
> >>>>>>     include/linux/reservation.h                   |  3 +-
> >>>>>>     3 files changed, 65 insertions(+), 45 deletions(-)
> >>>>>>
> >>>>>> diff --git a/drivers/dma-buf/reservation.c b/drivers/dma-buf/reservation.c
> >>>>>> index ef710effedfa..e43a316a005d 100644
> >>>>>> --- a/drivers/dma-buf/reservation.c
> >>>>>> +++ b/drivers/dma-buf/reservation.c
> >>>>>> @@ -231,6 +231,68 @@ void reservation_object_add_shared_fence(struct reservation_object *obj,
> >>>>>>     }
> >>>>>>     EXPORT_SYMBOL(reservation_object_add_shared_fence);
> >>>>>>
> >>>>>> +/**
> >>>>>> + * reservation_object_remove_shared_fence - remove shared fences
> >>>>>> + * @obj: the reservation object
> >>>>>> + * @context: the context of the fences to remove
> >>>>>> + *
> >>>>>> + * Remove shared fences based on their fence context.
> >>>>>> + */
> >>>>> This needs a serious explanation of "why?". Removing fences without
> >>>>> guaranteeing they're all signaled is a good way to create havoc. Your
> >>>>> commit message has a few words about what you're doing here, but it
> >>>>> still doesn't explain why this is safe and when exactly it should be
> >>>>> used.
> >>>> Yeah, I'm not very keen about this either.
> >>>>
> >>>> The key point is the memory is not accessible by the hardware any more
> >>>> because it is freed and removed from the page tables.
> >>>>
> >>>> So further access is prevented and in this special case it is actually
> >>>> valid to do this even if the operation represented by the fence is still
> >>>> ongoing.
> >>> Hm, why do you have to remove these fences then? Can't you just let
> >>> them signal and get collected as usual? As soon as you share buffers
> >>> these fences can get anywhere, so you need to correctly unwind them no
> >>> matter what.
> >>>
> >>> You're kinda still just describing what you're doing, not why.
> >> It is simply more efficient to remove the fence from one reservation
> >> object than to add a new fence to all other reservation objects in the
> >> same process.
> > Ok, you still talk in riddles and don't explain what the goal of this
> > entire dance is, so I went and read the code. Assuming I didn't misread
> > too much:
> >
> > 1. you create a fake fence on a per-process timeline.
> > 2. you attach this liberally to all the bo you're creating on that
> > process
> > 3. the fence never signals on its own, but it has a very magic
> > ->enable_signaling callback which is the only thing that makes this fence
> > switch to signalled in a finite time. Before that it's stuck forever. So
> > quite a bit a schrödinger fence: It's not a real fence (because it fails
> > to signal in bounded time) except when you start to look at it.
> > 4. Looking at the fence triggers eviction, at that point we replace this
> > magic eviction fence with the next set, reacquire buffers and then unblock
> > the kfd process once everything is in shape again.
> >
> > This is soooooooooooooooooo magic that I really don't think we should
> > encourage people without clue to maybe use this and totally break all
> > fences guarantees.
>
> Yeah, that is correct. But this is completely unrelated to why we want
> to remove the fence.
>
> > If you do want to make sure an optimized version within
> > reservation_object.c, then it should be code which replaces fences iff:
> > - they're the same context
> > - later in the ordering within that context
> > - of the same type (i.e. safe vs shared)
> >
> > That would actually be safe thing to do.
>
> No, that won't work because there is no replacement for the fence in
> question.
>
> See we want to remove the fence because the memory is freed up.
>
> > Also, the above is what I expected when asking "why do you need this", not
> > "we replace fences, its more efficient" I kinda got that from the code :-)
>
> Well I explained the short version why we do this. What you dug up here
> is correct as well, but completely unrelated to removing the fence.
>
> Again, the reason to remove the fence from one reservation object is
> simply that it is faster to remove it from one object than to attach a
> new fence to all other objects.

Hm I guess I was lead astray by the eviction_fence invalidation thing
in enable_signaling, and a few other places that freed the bo right
afters (like amdgpu_amdkfd_gpuvm_free_memory_of_gpu), where removing
the fences first and then freeing the bo is kinda pointless.

Now with your insistence that I'm getting something wrong I guess the
you're talking about the unbind case, where the bo survives, but it's
mapping disappears, and hence that specific eviction_fence needs to be
removed.

And yeah there I guess just removing the magic eviction fence is
cheaper than replacing all the others.

Now I guess I understand the mechanics of this somewhat, and what
you're doing, and lit ooks even somewhat safe. But I have no idea what
this is supposed to achieve. It feels a bit like ->notify_move, but
implemented in the most horrible way possible. Or maybe something
else.

Really no idea.

And given that we've wasted I few pages full of paragraphs already on
trying to explain what your new little helper is for, when it's safe
to use, when it's maybe not a good idea, and we still haven't even
bottomed out on what this is for ... well I really don't think it's a
good idea to inflict this into core code. Because just blindly
removing normal fences is not safe.

Especially with like half a sentence of kerneldoc that explains
nothing of all this complexity.
-Daniel

Christian König June 28, 2019, 6:32 a.m. UTC | #8

Am 27.06.19 um 21:57 schrieb Daniel Vetter:
> [SNIP]
>> Again, the reason to remove the fence from one reservation object is
>> simply that it is faster to remove it from one object than to attach a
>> new fence to all other objects.
> Hm I guess I was lead astray by the eviction_fence invalidation thing
> in enable_signaling, and a few other places that freed the bo right
> afters (like amdgpu_amdkfd_gpuvm_free_memory_of_gpu), where removing
> the fences first and then freeing the bo is kinda pointless.

AH! Now I know where your missing puzzle piece is.

See when we free a buffer object TTM puts it on the delayed delete list 
to make sure that it gets freed up only after all fences are signaled.

So this is essentially to make sure the BO gets freed up immediately.

> Now with your insistence that I'm getting something wrong I guess the
> you're talking about the unbind case, where the bo survives, but it's
> mapping disappears, and hence that specific eviction_fence needs to be
> removed.
> And yeah there I guess just removing the magic eviction fence is
> cheaper than replacing all the others.

If possible I actually want to apply this to the general case of freeing 
up per process resources.

In other words when we don't track resource usage on a per submission 
basis freeing up resources is costly because we always have to wait for 
the last submission.

But if we can prevent further access to the resource using page tables 
it is perfectly valid to free it as soon as the page tables are up to date.

Regards,
Christian.

>
> Now I guess I understand the mechanics of this somewhat, and what
> you're doing, and lit ooks even somewhat safe. But I have no idea what
> this is supposed to achieve. It feels a bit like ->notify_move, but
> implemented in the most horrible way possible. Or maybe something
> else.
>
> Really no idea.
>
> And given that we've wasted I few pages full of paragraphs already on
> trying to explain what your new little helper is for, when it's safe
> to use, when it's maybe not a good idea, and we still haven't even
> bottomed out on what this is for ... well I really don't think it's a
> good idea to inflict this into core code. Because just blindly
> removing normal fences is not safe.
>
> Especially with like half a sentence of kerneldoc that explains
> nothing of all this complexity.
> -Daniel

Daniel Vetter June 28, 2019, 7:30 a.m. UTC | #9

On Fri, Jun 28, 2019 at 8:32 AM Koenig, Christian
<Christian.Koenig@amd.com> wrote:
> Am 27.06.19 um 21:57 schrieb Daniel Vetter:
> > [SNIP]
> >> Again, the reason to remove the fence from one reservation object is
> >> simply that it is faster to remove it from one object than to attach a
> >> new fence to all other objects.
> > Hm I guess I was lead astray by the eviction_fence invalidation thing
> > in enable_signaling, and a few other places that freed the bo right
> > afters (like amdgpu_amdkfd_gpuvm_free_memory_of_gpu), where removing
> > the fences first and then freeing the bo is kinda pointless.
>
> AH! Now I know where your missing puzzle piece is.
>
> See when we free a buffer object TTM puts it on the delayed delete list
> to make sure that it gets freed up only after all fences are signaled.
>
> So this is essentially to make sure the BO gets freed up immediately.

Well yeah you have to wait for outstanding rendering. Everyone does
that. The problem is that ->eviction_fence does not represent
rendering, it's a very contrived way to implement the ->notify_move
callback.

For real fences you have to wait for rendering to complete, putting
aside corner cases like destroying an entire context with all its
pending rendering. Not really something we should optimize for I
think.

> > Now with your insistence that I'm getting something wrong I guess the
> > you're talking about the unbind case, where the bo survives, but it's
> > mapping disappears, and hence that specific eviction_fence needs to be
> > removed.
> > And yeah there I guess just removing the magic eviction fence is
> > cheaper than replacing all the others.
>
> If possible I actually want to apply this to the general case of freeing
> up per process resources.
>
> In other words when we don't track resource usage on a per submission
> basis freeing up resources is costly because we always have to wait for
> the last submission.
>
> But if we can prevent further access to the resource using page tables
> it is perfectly valid to free it as soon as the page tables are up to date.

Still not seeing how you can use this outside of the magic amdkfd
eviction_fence.

So with the magic amdkfd_eviction fence I agree this makes sense. The
problem I'm having here is that the magic eviction fence itself
doesn't make sense. What I expect will happen (in terms of the new
dynamic dma-buf stuff, I don't have the ttm-ism ready to explain it in
those concepts).

- when you submit command buffers, you _dont_ attach fences to all
involved buffers
- when you get a ->notify_move there's currently 2 options:
1. inefficient way: wait for the latest command buffer to complete, as
a defensive move. To do that you attach the fence from that command
buffer to the obj in your notifiy_move callback (the kerneldoc doesn't
explain this, but I think we really need this).
2. efficient way: You just unmap from pagetables (and eat/handle the
fault if there is any).

No where do you need to remove a fence, because you never attached a
bogus fence.

Now with the magic eviction trick amdkfd uses, you can't do that,
because you need to attach this magic fence to all buffers, all the
time. And since it still needs to work like a fence it's one-shot
only, i.e. instead of a reuseable ->notify_move callback you need to
create a new fence every time ->enable_signalling is called. So in a
way replacing fences is just an artifact of some very, very crazy
calling convention.

If you have a real callback, there's no need for cycling through
fences, and therefore there's also no need to optimize their removal.

Or did you work under the assumption that ->notify_move cannot attach
new fences, and therefore you'd have to roll this magic fence trick to
even more places?

> > Now I guess I understand the mechanics of this somewhat, and what
> > you're doing, and lit ooks even somewhat safe. But I have no idea what
> > this is supposed to achieve. It feels a bit like ->notify_move, but
> > implemented in the most horrible way possible. Or maybe something
> > else.
> >
> > Really no idea.
> >
> > And given that we've wasted I few pages full of paragraphs already on
> > trying to explain what your new little helper is for, when it's safe
> > to use, when it's maybe not a good idea, and we still haven't even
> > bottomed out on what this is for ... well I really don't think it's a
> > good idea to inflict this into core code. Because just blindly
> > removing normal fences is not safe.
> >
> > Especially with like half a sentence of kerneldoc that explains
> > nothing of all this complexity.

Still makes no sense to me to have in core code.
-Daniel

Christian König June 28, 2019, 8:40 a.m. UTC | #10

Am 28.06.19 um 09:30 schrieb Daniel Vetter:
> On Fri, Jun 28, 2019 at 8:32 AM Koenig, Christian
> <Christian.Koenig@amd.com> wrote:
>> Am 27.06.19 um 21:57 schrieb Daniel Vetter:
>>> [SNIP]
>>>> Again, the reason to remove the fence from one reservation object is
>>>> simply that it is faster to remove it from one object than to attach a
>>>> new fence to all other objects.
>>> Hm I guess I was lead astray by the eviction_fence invalidation thing
>>> in enable_signaling, and a few other places that freed the bo right
>>> afters (like amdgpu_amdkfd_gpuvm_free_memory_of_gpu), where removing
>>> the fences first and then freeing the bo is kinda pointless.
>> AH! Now I know where your missing puzzle piece is.
>>
>> See when we free a buffer object TTM puts it on the delayed delete list
>> to make sure that it gets freed up only after all fences are signaled.
>>
>> So this is essentially to make sure the BO gets freed up immediately.
> Well yeah you have to wait for outstanding rendering. Everyone does
> that. The problem is that ->eviction_fence does not represent
> rendering, it's a very contrived way to implement the ->notify_move
> callback.
>
> For real fences you have to wait for rendering to complete, putting
> aside corner cases like destroying an entire context with all its
> pending rendering. Not really something we should optimize for I
> think.

No, real fences I don't need to wait for the rendering to complete either.

If userspace said that this per process resource is dead and can be 
removed, we don't need to wait for it to become idle.

>>> Now with your insistence that I'm getting something wrong I guess the
>>> you're talking about the unbind case, where the bo survives, but it's
>>> mapping disappears, and hence that specific eviction_fence needs to be
>>> removed.
>>> And yeah there I guess just removing the magic eviction fence is
>>> cheaper than replacing all the others.
>> If possible I actually want to apply this to the general case of freeing
>> up per process resources.
>>
>> In other words when we don't track resource usage on a per submission
>> basis freeing up resources is costly because we always have to wait for
>> the last submission.
>>
>> But if we can prevent further access to the resource using page tables
>> it is perfectly valid to free it as soon as the page tables are up to date.
> Still not seeing how you can use this outside of the magic amdkfd
> eviction_fence.

As I explained you have a per process resource and userspace says that 
this one can go away.

As far as I can see it is perfectly valid to remove all fences from this 
process as soon as the page tables are up to date.

> So with the magic amdkfd_eviction fence I agree this makes sense. The
> problem I'm having here is that the magic eviction fence itself
> doesn't make sense. What I expect will happen (in terms of the new
> dynamic dma-buf stuff, I don't have the ttm-ism ready to explain it in
> those concepts).
>
> - when you submit command buffers, you _dont_ attach fences to all
> involved buffers

That's not going to work because then the memory management then thinks 
that the buffer is immediately movable, which it isn't,

> - when you get a ->notify_move there's currently 2 options:
> 1. inefficient way: wait for the latest command buffer to complete, as
> a defensive move. To do that you attach the fence from that command
> buffer to the obj in your notifiy_move callback (the kerneldoc doesn't
> explain this, but I think we really need this).
> 2. efficient way: You just unmap from pagetables (and eat/handle the
> fault if there is any).

Exactly yeah. As far as I can see for freeing things up that is a 
perfectly valid approach as long as you have a VM which prevents 
accessing this memory.

See ttm_bo_individualize_resv() as well, here we do something similar 
for GFX what the KFD does when it releases memory.

E.g. for per process resources we copy over the current fences into an 
individualized reservation object, to make sure that this can be freed 
up at some time in the future.

But I really want to go a step further and say ok, all fences from this 
process can be removed after we updated the page tables.

> No where do you need to remove a fence, because you never attached a
> bogus fence.
>
> Now with the magic eviction trick amdkfd uses, you can't do that,
> because you need to attach this magic fence to all buffers, all the
> time. And since it still needs to work like a fence it's one-shot
> only, i.e. instead of a reuseable ->notify_move callback you need to
> create a new fence every time ->enable_signalling is called. So in a
> way replacing fences is just an artifact of some very, very crazy
> calling convention.
>
> If you have a real callback, there's no need for cycling through
> fences, and therefore there's also no need to optimize their removal.
>
> Or did you work under the assumption that ->notify_move cannot attach
> new fences, and therefore you'd have to roll this magic fence trick to
> even more places?

Well that notify_move approach was what was initially suggested by the 
KFD team as well. The problem is simply that there is no general 
notify_move callback for buffers yet.

Adding that one is exactly what my patches for dynamic DMA-buf are 
currently doing :)

>>> Now I guess I understand the mechanics of this somewhat, and what
>>> you're doing, and lit ooks even somewhat safe. But I have no idea what
>>> this is supposed to achieve. It feels a bit like ->notify_move, but
>>> implemented in the most horrible way possible. Or maybe something
>>> else.
>>>
>>> Really no idea.
>>>
>>> And given that we've wasted I few pages full of paragraphs already on
>>> trying to explain what your new little helper is for, when it's safe
>>> to use, when it's maybe not a good idea, and we still haven't even
>>> bottomed out on what this is for ... well I really don't think it's a
>>> good idea to inflict this into core code. Because just blindly
>>> removing normal fences is not safe.
>>>
>>> Especially with like half a sentence of kerneldoc that explains
>>> nothing of all this complexity.
> Still makes no sense to me to have in core code.

Well it is still better to have this in the core code than messing with 
the reservation object internals in the driver code.

Regards,
Christian.

> -Daniel

Daniel Vetter June 28, 2019, 9:41 a.m. UTC | #11

On Fri, Jun 28, 2019 at 10:40 AM Christian König
<ckoenig.leichtzumerken@gmail.com> wrote:
>
> Am 28.06.19 um 09:30 schrieb Daniel Vetter:
> > On Fri, Jun 28, 2019 at 8:32 AM Koenig, Christian
> > <Christian.Koenig@amd.com> wrote:
> >> Am 27.06.19 um 21:57 schrieb Daniel Vetter:
> >>> [SNIP]
> >>>> Again, the reason to remove the fence from one reservation object is
> >>>> simply that it is faster to remove it from one object than to attach a
> >>>> new fence to all other objects.
> >>> Hm I guess I was lead astray by the eviction_fence invalidation thing
> >>> in enable_signaling, and a few other places that freed the bo right
> >>> afters (like amdgpu_amdkfd_gpuvm_free_memory_of_gpu), where removing
> >>> the fences first and then freeing the bo is kinda pointless.
> >> AH! Now I know where your missing puzzle piece is.
> >>
> >> See when we free a buffer object TTM puts it on the delayed delete list
> >> to make sure that it gets freed up only after all fences are signaled.
> >>
> >> So this is essentially to make sure the BO gets freed up immediately.
> > Well yeah you have to wait for outstanding rendering. Everyone does
> > that. The problem is that ->eviction_fence does not represent
> > rendering, it's a very contrived way to implement the ->notify_move
> > callback.
> >
> > For real fences you have to wait for rendering to complete, putting
> > aside corner cases like destroying an entire context with all its
> > pending rendering. Not really something we should optimize for I
> > think.
>
> No, real fences I don't need to wait for the rendering to complete either.
>
> If userspace said that this per process resource is dead and can be
> removed, we don't need to wait for it to become idle.

But why did you attach a fence in the first place?

> >>> Now with your insistence that I'm getting something wrong I guess the
> >>> you're talking about the unbind case, where the bo survives, but it's
> >>> mapping disappears, and hence that specific eviction_fence needs to be
> >>> removed.
> >>> And yeah there I guess just removing the magic eviction fence is
> >>> cheaper than replacing all the others.
> >> If possible I actually want to apply this to the general case of freeing
> >> up per process resources.
> >>
> >> In other words when we don't track resource usage on a per submission
> >> basis freeing up resources is costly because we always have to wait for
> >> the last submission.
> >>
> >> But if we can prevent further access to the resource using page tables
> >> it is perfectly valid to free it as soon as the page tables are up to date.
> > Still not seeing how you can use this outside of the magic amdkfd
> > eviction_fence.
>
> As I explained you have a per process resource and userspace says that
> this one can go away.
>
> As far as I can see it is perfectly valid to remove all fences from this
> process as soon as the page tables are up to date.

I'm not objecting the design, but just the implementation.

> > So with the magic amdkfd_eviction fence I agree this makes sense. The
> > problem I'm having here is that the magic eviction fence itself
> > doesn't make sense. What I expect will happen (in terms of the new
> > dynamic dma-buf stuff, I don't have the ttm-ism ready to explain it in
> > those concepts).
> >
> > - when you submit command buffers, you _dont_ attach fences to all
> > involved buffers
>
> That's not going to work because then the memory management then thinks
> that the buffer is immediately movable, which it isn't,

I guess we need to fix that then. I pretty much assumed that
->notify_move could add whatever fences you might want to add. Which
would very neatly allow us to solve this problem here, instead of
coming up with fake fences and fun stuff like that.

If ->notify_move can't add fences, then the you have to attach the
right fences to all of the bo, all the time. And a CS model where
userspace just updates the working set and keeps submitting stuff,
while submitting new batches. And the kernel preempts the entire
context if memory needs to be evicted, and re-runs it only once the
working set is available again.

No the eviction_fence is not a good design solution for this, and imo
you should have rejected that. Who cares if the amdkfd people didn't
want to put in the work. But that ship sailed, so lets at least not
spread this more.

> > - when you get a ->notify_move there's currently 2 options:
> > 1. inefficient way: wait for the latest command buffer to complete, as
> > a defensive move. To do that you attach the fence from that command
> > buffer to the obj in your notifiy_move callback (the kerneldoc doesn't
> > explain this, but I think we really need this).
> > 2. efficient way: You just unmap from pagetables (and eat/handle the
> > fault if there is any).
>
> Exactly yeah. As far as I can see for freeing things up that is a
> perfectly valid approach as long as you have a VM which prevents
> accessing this memory.
>
> See ttm_bo_individualize_resv() as well, here we do something similar
> for GFX what the KFD does when it releases memory.

Uh wut. I guess more funky tricks I need to first learn about, but
yeah doesn't make much sense ot me right now.

> E.g. for per process resources we copy over the current fences into an
> individualized reservation object, to make sure that this can be freed
> up at some time in the future.

Why are you deleteing an object where others outside of your driver
can still add fences? Just from this description this doesn't make
sense to me ...

One idea (without reading the code): Does ttm no have a dedicated
active reference for its own mapping, instead fully relying on the
fences in the reservation object? So if you have something imported
you always wait if the other side keeps rendering?

I mean you can "fix" this by replacing the resv object you're using,
but maybe the real deal is a design bug in ttm failing to keep track
of which mappings are still needed. I guess since it only tracks
backing storage and not mappings that's not a surprise. i915 doesn't
need this, and again I don't think that's reasonable design which
should be promoted in core functionality.

Or something else again?

> But I really want to go a step further and say ok, all fences from this
> process can be removed after we updated the page tables.
>
> > No where do you need to remove a fence, because you never attached a
> > bogus fence.
> >
> > Now with the magic eviction trick amdkfd uses, you can't do that,
> > because you need to attach this magic fence to all buffers, all the
> > time. And since it still needs to work like a fence it's one-shot
> > only, i.e. instead of a reuseable ->notify_move callback you need to
> > create a new fence every time ->enable_signalling is called. So in a
> > way replacing fences is just an artifact of some very, very crazy
> > calling convention.
> >
> > If you have a real callback, there's no need for cycling through
> > fences, and therefore there's also no need to optimize their removal.
> >
> > Or did you work under the assumption that ->notify_move cannot attach
> > new fences, and therefore you'd have to roll this magic fence trick to
> > even more places?
>
> Well that notify_move approach was what was initially suggested by the
> KFD team as well. The problem is simply that there is no general
> notify_move callback for buffers yet.
>
> Adding that one is exactly what my patches for dynamic DMA-buf are
> currently doing :)

Ok, so we agree at least for the amdkfd that the ->notify_move is the
right solution here. Imo lets get that done and remove eviction_fence,
instead of promoting that design pattern even more.

> >>> Now I guess I understand the mechanics of this somewhat, and what
> >>> you're doing, and lit ooks even somewhat safe. But I have no idea what
> >>> this is supposed to achieve. It feels a bit like ->notify_move, but
> >>> implemented in the most horrible way possible. Or maybe something
> >>> else.
> >>>
> >>> Really no idea.
> >>>
> >>> And given that we've wasted I few pages full of paragraphs already on
> >>> trying to explain what your new little helper is for, when it's safe
> >>> to use, when it's maybe not a good idea, and we still haven't even
> >>> bottomed out on what this is for ... well I really don't think it's a
> >>> good idea to inflict this into core code. Because just blindly
> >>> removing normal fences is not safe.
> >>>
> >>> Especially with like half a sentence of kerneldoc that explains
> >>> nothing of all this complexity.
> > Still makes no sense to me to have in core code.
>
> Well it is still better to have this in the core code than messing with
> the reservation object internals in the driver code.

Uh no. Imo if you do really funky stuff and abuse fences to get a
one-shot callback, then you get to keep the cost of maintaining that.
You _are_ already fundamentally tearing down the abstraction
dma_fence/resv_obj is supposed to provide. I mean you're essentially
relying on the exporter calling ->enable_signalling (of the magic
fence) at roughly the spot where it would call ->notify_move. All
without that being documented or clarified as how it's supposed to be
done. Accessing a few private fields is the least offense here :-)
-Daniel

Christian König June 28, 2019, 10:24 a.m. UTC | #12

Am 28.06.19 um 11:41 schrieb Daniel Vetter:
> On Fri, Jun 28, 2019 at 10:40 AM Christian König
> <ckoenig.leichtzumerken@gmail.com> wrote:
>> Am 28.06.19 um 09:30 schrieb Daniel Vetter:
>>> On Fri, Jun 28, 2019 at 8:32 AM Koenig, Christian
>>> <Christian.Koenig@amd.com> wrote:
>>>> Am 27.06.19 um 21:57 schrieb Daniel Vetter:
>>>>> [SNIP]
>>> Well yeah you have to wait for outstanding rendering. Everyone does
>>> that. The problem is that ->eviction_fence does not represent
>>> rendering, it's a very contrived way to implement the ->notify_move
>>> callback.
>>>
>>> For real fences you have to wait for rendering to complete, putting
>>> aside corner cases like destroying an entire context with all its
>>> pending rendering. Not really something we should optimize for I
>>> think.
>> No, real fences I don't need to wait for the rendering to complete either.
>>
>> If userspace said that this per process resource is dead and can be
>> removed, we don't need to wait for it to become idle.
> But why did you attach a fence in the first place?

Because so that others can still wait for it.

See it is perfectly valid to export this buffer object to let's say the 
Intel driver and in this case I don't get a move notification.

And I really don't want to add another workaround where I add the fences 
only when the BO is exported or stuff like that.

>> [SNIP]
>> As far as I can see it is perfectly valid to remove all fences from this
>> process as soon as the page tables are up to date.
> I'm not objecting the design, but just the implementation.

Ok, then we can at least agree on that.

>>> So with the magic amdkfd_eviction fence I agree this makes sense. The
>>> problem I'm having here is that the magic eviction fence itself
>>> doesn't make sense. What I expect will happen (in terms of the new
>>> dynamic dma-buf stuff, I don't have the ttm-ism ready to explain it in
>>> those concepts).
>>>
>>> - when you submit command buffers, you _dont_ attach fences to all
>>> involved buffers
>> That's not going to work because then the memory management then thinks
>> that the buffer is immediately movable, which it isn't,
> I guess we need to fix that then. I pretty much assumed that
> ->notify_move could add whatever fences you might want to add. Which
> would very neatly allow us to solve this problem here, instead of
> coming up with fake fences and fun stuff like that.

Adding the fence later on is not a solution because we need something 
which beforehand can check if a buffer is movable or not.

In the case of a move_notify the decision to move it is already done and 
you can't say oh sorry I have to evict my process and reprogram the 
hardware or whatever.

Especially when you do this in an OOM situation.

> If ->notify_move can't add fences, then the you have to attach the
> right fences to all of the bo, all the time. And a CS model where
> userspace just updates the working set and keeps submitting stuff,
> while submitting new batches. And the kernel preempts the entire
> context if memory needs to be evicted, and re-runs it only once the
> working set is available again.
>
> No the eviction_fence is not a good design solution for this, and imo
> you should have rejected that.

Actually I was the one who suggested that because the alternatives 
doesn't sounded like they would work.

[SNIP]
>> See ttm_bo_individualize_resv() as well, here we do something similar
>> for GFX what the KFD does when it releases memory.
> Uh wut. I guess more funky tricks I need to first learn about, but
> yeah doesn't make much sense ot me right now.
>
>> E.g. for per process resources we copy over the current fences into an
>> individualized reservation object, to make sure that this can be freed
>> up at some time in the future.
> Why are you deleteing an object where others outside of your driver
> can still add fences? Just from this description this doesn't make
> sense to me ...

Multiple BOs share a single reservation object. This is used for example 
for page tables.

To map 16GB or memory I can easily have more than 8k BOs for the page 
tables.

So what we did was to use reservation object of the root page table for 
all other BOs of the VM as well.

Otherwise you would need to add a fence to 8k reservation objects and 
that is not really feasible.

That works fine, except for the case when you want to free up a page 
table. In this case we individualize the BO, copy the fences over and 
say ok we free that up when the current set of fences is signaled.

>> But I really want to go a step further and say ok, all fences from this
>> process can be removed after we updated the page tables.
>>
>>> No where do you need to remove a fence, because you never attached a
>>> bogus fence.
>>>
>>> Now with the magic eviction trick amdkfd uses, you can't do that,
>>> because you need to attach this magic fence to all buffers, all the
>>> time. And since it still needs to work like a fence it's one-shot
>>> only, i.e. instead of a reuseable ->notify_move callback you need to
>>> create a new fence every time ->enable_signalling is called. So in a
>>> way replacing fences is just an artifact of some very, very crazy
>>> calling convention.
>>>
>>> If you have a real callback, there's no need for cycling through
>>> fences, and therefore there's also no need to optimize their removal.
>>>
>>> Or did you work under the assumption that ->notify_move cannot attach
>>> new fences, and therefore you'd have to roll this magic fence trick to
>>> even more places?
>> Well that notify_move approach was what was initially suggested by the
>> KFD team as well. The problem is simply that there is no general
>> notify_move callback for buffers yet.
>>
>> Adding that one is exactly what my patches for dynamic DMA-buf are
>> currently doing :)
> Ok, so we agree at least for the amdkfd that the ->notify_move is the
> right solution here.

Actually I don't.

If we don't add some fence to the reservation object the memory 
management has no chance of knowing that this object is used by somebody.

Regards,
Christian.

>   Imo lets get that done and remove eviction_fence,
> instead of promoting that design pattern even more.
>
>>>>> Now I guess I understand the mechanics of this somewhat, and what
>>>>> you're doing, and lit ooks even somewhat safe. But I have no idea what
>>>>> this is supposed to achieve. It feels a bit like ->notify_move, but
>>>>> implemented in the most horrible way possible. Or maybe something
>>>>> else.
>>>>>
>>>>> Really no idea.
>>>>>
>>>>> And given that we've wasted I few pages full of paragraphs already on
>>>>> trying to explain what your new little helper is for, when it's safe
>>>>> to use, when it's maybe not a good idea, and we still haven't even
>>>>> bottomed out on what this is for ... well I really don't think it's a
>>>>> good idea to inflict this into core code. Because just blindly
>>>>> removing normal fences is not safe.
>>>>>
>>>>> Especially with like half a sentence of kerneldoc that explains
>>>>> nothing of all this complexity.
>>> Still makes no sense to me to have in core code.
>> Well it is still better to have this in the core code than messing with
>> the reservation object internals in the driver code.
> Uh no. Imo if you do really funky stuff and abuse fences to get a
> one-shot callback, then you get to keep the cost of maintaining that.
> You _are_ already fundamentally tearing down the abstraction
> dma_fence/resv_obj is supposed to provide. I mean you're essentially
> relying on the exporter calling ->enable_signalling (of the magic
> fence) at roughly the spot where it would call ->notify_move. All
> without that being documented or clarified as how it's supposed to be
> done. Accessing a few private fields is the least offense here :-)
> -Daniel

Daniel Vetter June 28, 2019, 2:38 p.m. UTC | #13

On Fri, Jun 28, 2019 at 12:24 PM Koenig, Christian
<Christian.Koenig@amd.com> wrote:
>
> Am 28.06.19 um 11:41 schrieb Daniel Vetter:
> > On Fri, Jun 28, 2019 at 10:40 AM Christian König
> > <ckoenig.leichtzumerken@gmail.com> wrote:
> >> Am 28.06.19 um 09:30 schrieb Daniel Vetter:
> >>> On Fri, Jun 28, 2019 at 8:32 AM Koenig, Christian
> >>> <Christian.Koenig@amd.com> wrote:
> >>>> Am 27.06.19 um 21:57 schrieb Daniel Vetter:
> >>>>> [SNIP]
> >>> Well yeah you have to wait for outstanding rendering. Everyone does
> >>> that. The problem is that ->eviction_fence does not represent
> >>> rendering, it's a very contrived way to implement the ->notify_move
> >>> callback.
> >>>
> >>> For real fences you have to wait for rendering to complete, putting
> >>> aside corner cases like destroying an entire context with all its
> >>> pending rendering. Not really something we should optimize for I
> >>> think.
> >> No, real fences I don't need to wait for the rendering to complete either.
> >>
> >> If userspace said that this per process resource is dead and can be
> >> removed, we don't need to wait for it to become idle.
> > But why did you attach a fence in the first place?
>
> Because so that others can still wait for it.
>
> See it is perfectly valid to export this buffer object to let's say the
> Intel driver and in this case I don't get a move notification.

If you share with intel then the buffer is pinned. You can't move the
thing anymore.

> And I really don't want to add another workaround where I add the fences
> only when the BO is exported or stuff like that.

You don't need to add a fence for that case because you can't move the
buffer anyway.

> >> [SNIP]
> >> As far as I can see it is perfectly valid to remove all fences from this
> >> process as soon as the page tables are up to date.
> > I'm not objecting the design, but just the implementation.
>
> Ok, then we can at least agree on that.
>
> >>> So with the magic amdkfd_eviction fence I agree this makes sense. The
> >>> problem I'm having here is that the magic eviction fence itself
> >>> doesn't make sense. What I expect will happen (in terms of the new
> >>> dynamic dma-buf stuff, I don't have the ttm-ism ready to explain it in
> >>> those concepts).
> >>>
> >>> - when you submit command buffers, you _dont_ attach fences to all
> >>> involved buffers
> >> That's not going to work because then the memory management then thinks
> >> that the buffer is immediately movable, which it isn't,
> > I guess we need to fix that then. I pretty much assumed that
> > ->notify_move could add whatever fences you might want to add. Which
> > would very neatly allow us to solve this problem here, instead of
> > coming up with fake fences and fun stuff like that.
>
> Adding the fence later on is not a solution because we need something
> which beforehand can check if a buffer is movable or not.
>
> In the case of a move_notify the decision to move it is already done and
> you can't say oh sorry I have to evict my process and reprogram the
> hardware or whatever.
>
> Especially when you do this in an OOM situation.

Why? I mean when the fence for a CS is there already, it might also
still hang out in the scheduler, or be blocked on a fence from another
driver, or anything like that. I don't see a conceptual difference.
Plus with dynamic dma-buf the entire point is that an attached fences
does _not_ mean the buffer is permanently pinned, but can be moved if
you sync correctly. Might need a bit of tuning or a flag to indicate
that some buffers should alwasy considered to be busy, and that you
shouldn't start evicting those. But that's kinda a detail.

From a very high level there's really no difference betwen
->notify_move and the eviction_fence. Both give you a callback when
someone else needs to move the buffer, that's all. The only difference
is that the eviction_fence thing jumbles the callback and the fence
into one, by preattaching a fence just in case. But again from a
conceptual pov it doesn't matter whether the fence is always hanging
around, or whether you just attach it when ->notify_move is called.

> > If ->notify_move can't add fences, then the you have to attach the
> > right fences to all of the bo, all the time. And a CS model where
> > userspace just updates the working set and keeps submitting stuff,
> > while submitting new batches. And the kernel preempts the entire
> > context if memory needs to be evicted, and re-runs it only once the
> > working set is available again.
> >
> > No the eviction_fence is not a good design solution for this, and imo
> > you should have rejected that.
>
> Actually I was the one who suggested that because the alternatives
> doesn't sounded like they would work.
>
> [SNIP]
> >> See ttm_bo_individualize_resv() as well, here we do something similar
> >> for GFX what the KFD does when it releases memory.
> > Uh wut. I guess more funky tricks I need to first learn about, but
> > yeah doesn't make much sense ot me right now.
> >
> >> E.g. for per process resources we copy over the current fences into an
> >> individualized reservation object, to make sure that this can be freed
> >> up at some time in the future.
> > Why are you deleteing an object where others outside of your driver
> > can still add fences? Just from this description this doesn't make
> > sense to me ...
>
> Multiple BOs share a single reservation object. This is used for example
> for page tables.
>
> To map 16GB or memory I can easily have more than 8k BOs for the page
> tables.
>
> So what we did was to use reservation object of the root page table for
> all other BOs of the VM as well.
>
> Otherwise you would need to add a fence to 8k reservation objects and
> that is not really feasible.
>
> That works fine, except for the case when you want to free up a page
> table. In this case we individualize the BO, copy the fences over and
> say ok we free that up when the current set of fences is signaled.

As long as you dont' share these buffers I think it doesn't matter
what you do. Plus there's tons of other options to implement this
instead of copying the resv_obj, since at that point all that resv_obj
serves as is a list of fences. Just creating a list of fences (and
filtering only the ones valid for your gpu, because the others don't
matter) seems much simpler, but probably more work to refactor ttm to
suit.

Making this official otoh is just calling for trouble, anytime else
than when you wait for object destruction replacing the resv_obj is
not a good idea.

Also: If ttm wouldn't be such a midlayer you could just extend the
eviction code to eat pagetables with a bit of dedicated code and
wouldn't need to track every pagetable with a full blown ttm_bo. That
approach frankly sounds terrible to me.

> >> But I really want to go a step further and say ok, all fences from this
> >> process can be removed after we updated the page tables.
> >>
> >>> No where do you need to remove a fence, because you never attached a
> >>> bogus fence.
> >>>
> >>> Now with the magic eviction trick amdkfd uses, you can't do that,
> >>> because you need to attach this magic fence to all buffers, all the
> >>> time. And since it still needs to work like a fence it's one-shot
> >>> only, i.e. instead of a reuseable ->notify_move callback you need to
> >>> create a new fence every time ->enable_signalling is called. So in a
> >>> way replacing fences is just an artifact of some very, very crazy
> >>> calling convention.
> >>>
> >>> If you have a real callback, there's no need for cycling through
> >>> fences, and therefore there's also no need to optimize their removal.
> >>>
> >>> Or did you work under the assumption that ->notify_move cannot attach
> >>> new fences, and therefore you'd have to roll this magic fence trick to
> >>> even more places?
> >> Well that notify_move approach was what was initially suggested by the
> >> KFD team as well. The problem is simply that there is no general
> >> notify_move callback for buffers yet.
> >>
> >> Adding that one is exactly what my patches for dynamic DMA-buf are
> >> currently doing :)
> > Ok, so we agree at least for the amdkfd that the ->notify_move is the
> > right solution here.
>
> Actually I don't.
>
> If we don't add some fence to the reservation object the memory
> management has no chance of knowing that this object is used by somebody.

So what else is that fancy eviction fence used for, aside from
->notify_move and making sure there's a fence already attached,
instead of allowing a fence to be attached as part of the eviction
processe?. Which really shouldn't be a problem, you already _have_ to
be able to cope with that with pipeline bo moves out of vram anyway.
So more reasons to add fences while evicting a bo shouldn't pose a
problem.

At least right now all I'm seeing is ->notify_move implemented in the
most backwards and clever (but not the good kind of clever) way
possible.
-Daniel

>
> Regards,
> Christian.
>
> >   Imo lets get that done and remove eviction_fence,
> > instead of promoting that design pattern even more.
> >
> >>>>> Now I guess I understand the mechanics of this somewhat, and what
> >>>>> you're doing, and lit ooks even somewhat safe. But I have no idea what
> >>>>> this is supposed to achieve. It feels a bit like ->notify_move, but
> >>>>> implemented in the most horrible way possible. Or maybe something
> >>>>> else.
> >>>>>
> >>>>> Really no idea.
> >>>>>
> >>>>> And given that we've wasted I few pages full of paragraphs already on
> >>>>> trying to explain what your new little helper is for, when it's safe
> >>>>> to use, when it's maybe not a good idea, and we still haven't even
> >>>>> bottomed out on what this is for ... well I really don't think it's a
> >>>>> good idea to inflict this into core code. Because just blindly
> >>>>> removing normal fences is not safe.
> >>>>>
> >>>>> Especially with like half a sentence of kerneldoc that explains
> >>>>> nothing of all this complexity.
> >>> Still makes no sense to me to have in core code.
> >> Well it is still better to have this in the core code than messing with
> >> the reservation object internals in the driver code.
> > Uh no. Imo if you do really funky stuff and abuse fences to get a
> > one-shot callback, then you get to keep the cost of maintaining that.
> > You _are_ already fundamentally tearing down the abstraction
> > dma_fence/resv_obj is supposed to provide. I mean you're essentially
> > relying on the exporter calling ->enable_signalling (of the magic
> > fence) at roughly the spot where it would call ->notify_move. All
> > without that being documented or clarified as how it's supposed to be
> > done. Accessing a few private fields is the least offense here :-)
> > -Daniel
>

Christian König June 28, 2019, 3:21 p.m. UTC | #14

Am 28.06.19 um 16:38 schrieb Daniel Vetter:
[SNIP]
>>>>> - when you submit command buffers, you _dont_ attach fences to all
>>>>> involved buffers
>>>> That's not going to work because then the memory management then thinks
>>>> that the buffer is immediately movable, which it isn't,
>>> I guess we need to fix that then. I pretty much assumed that
>>> ->notify_move could add whatever fences you might want to add. Which
>>> would very neatly allow us to solve this problem here, instead of
>>> coming up with fake fences and fun stuff like that.
>> Adding the fence later on is not a solution because we need something
>> which beforehand can check if a buffer is movable or not.
>>
>> In the case of a move_notify the decision to move it is already done and
>> you can't say oh sorry I have to evict my process and reprogram the
>> hardware or whatever.
>>
>> Especially when you do this in an OOM situation.
> Why? I mean when the fence for a CS is there already, it might also
> still hang out in the scheduler, or be blocked on a fence from another
> driver, or anything like that. I don't see a conceptual difference.
> Plus with dynamic dma-buf the entire point is that an attached fences
> does _not_ mean the buffer is permanently pinned, but can be moved if
> you sync correctly. Might need a bit of tuning or a flag to indicate
> that some buffers should alwasy considered to be busy, and that you
> shouldn't start evicting those. But that's kinda a detail.
>
>  From a very high level there's really no difference betwen
> ->notify_move and the eviction_fence. Both give you a callback when
> someone else needs to move the buffer, that's all. The only difference
> is that the eviction_fence thing jumbles the callback and the fence
> into one, by preattaching a fence just in case. But again from a
> conceptual pov it doesn't matter whether the fence is always hanging
> around, or whether you just attach it when ->notify_move is called.

Sure there is a difference. See when you attach the fence beforehand the 
memory management can know that the buffer is busy.

Just imagine the following: We are in an OOM situation and need to swap 
things out to disk!

When the fence is attached beforehand the handling can be as following:
1. MM picks a BO from the LRU and starts to evict it.
2. The eviction fence is enabled and we stop the process using this BO.
3. As soon as the process is stopped the fence is set into the signaled 
state.
4. MM needs to evict more BOs and since the fence for this process is 
now in the signaled state it can intentionally pick the ones up which 
are now idle.

When we attach the fence only on eviction that can't happen and the MM 
would just pick the next random BO and potentially stop another process.

So I think we can summarize that the memory management definitely needs 
to know beforehand how costly it is to evict a BO.

And of course implement this with flags or use counters or whatever, but 
we already have the fence infrastructure and I don't see a reason not to 
use it.

Regards,
Christian.

Daniel Vetter June 28, 2019, 4:40 p.m. UTC | #15

On Fri, Jun 28, 2019 at 5:21 PM Koenig, Christian
<Christian.Koenig@amd.com> wrote:
>
> Am 28.06.19 um 16:38 schrieb Daniel Vetter:
> [SNIP]
> >>>>> - when you submit command buffers, you _dont_ attach fences to all
> >>>>> involved buffers
> >>>> That's not going to work because then the memory management then thinks
> >>>> that the buffer is immediately movable, which it isn't,
> >>> I guess we need to fix that then. I pretty much assumed that
> >>> ->notify_move could add whatever fences you might want to add. Which
> >>> would very neatly allow us to solve this problem here, instead of
> >>> coming up with fake fences and fun stuff like that.
> >> Adding the fence later on is not a solution because we need something
> >> which beforehand can check if a buffer is movable or not.
> >>
> >> In the case of a move_notify the decision to move it is already done and
> >> you can't say oh sorry I have to evict my process and reprogram the
> >> hardware or whatever.
> >>
> >> Especially when you do this in an OOM situation.
> > Why? I mean when the fence for a CS is there already, it might also
> > still hang out in the scheduler, or be blocked on a fence from another
> > driver, or anything like that. I don't see a conceptual difference.
> > Plus with dynamic dma-buf the entire point is that an attached fences
> > does _not_ mean the buffer is permanently pinned, but can be moved if
> > you sync correctly. Might need a bit of tuning or a flag to indicate
> > that some buffers should alwasy considered to be busy, and that you
> > shouldn't start evicting those. But that's kinda a detail.
> >
> >  From a very high level there's really no difference betwen
> > ->notify_move and the eviction_fence. Both give you a callback when
> > someone else needs to move the buffer, that's all. The only difference
> > is that the eviction_fence thing jumbles the callback and the fence
> > into one, by preattaching a fence just in case. But again from a
> > conceptual pov it doesn't matter whether the fence is always hanging
> > around, or whether you just attach it when ->notify_move is called.
>
> Sure there is a difference. See when you attach the fence beforehand the
> memory management can know that the buffer is busy.
>
> Just imagine the following: We are in an OOM situation and need to swap
> things out to disk!
>
> When the fence is attached beforehand the handling can be as following:
> 1. MM picks a BO from the LRU and starts to evict it.
> 2. The eviction fence is enabled and we stop the process using this BO.
> 3. As soon as the process is stopped the fence is set into the signaled
> state.
> 4. MM needs to evict more BOs and since the fence for this process is
> now in the signaled state it can intentionally pick the ones up which
> are now idle.
>
> When we attach the fence only on eviction that can't happen and the MM
> would just pick the next random BO and potentially stop another process.
>
> So I think we can summarize that the memory management definitely needs
> to know beforehand how costly it is to evict a BO.
>
> And of course implement this with flags or use counters or whatever, but
> we already have the fence infrastructure and I don't see a reason not to
> use it.

Ok, for the sake of the argument let's buy this.

Why do we need a ->notify_move callback then? We have it already, with
these special fences.

Other side: If all you want to know is whether you can unmap a buffer
immediately, for some short enough value of immediately (I guess a
bunch of pagetable writes should be ok), then why not add that? The "I
don't want to touch all buffers for every CS, but just have a pinned
working set" command submission model is quite different after all,
having dedicated infrastructure that fits well sounds like a good idea
to me.
-Daniel

Christian König June 28, 2019, 5:32 p.m. UTC | #16

Am 28.06.19 um 18:40 schrieb Daniel Vetter:
> On Fri, Jun 28, 2019 at 5:21 PM Koenig, Christian
> <Christian.Koenig@amd.com> wrote:
>> Am 28.06.19 um 16:38 schrieb Daniel Vetter:
>> [SNIP]
>>>>>>> - when you submit command buffers, you _dont_ attach fences to all
>>>>>>> involved buffers
>>>>>> That's not going to work because then the memory management then thinks
>>>>>> that the buffer is immediately movable, which it isn't,
>>>>> I guess we need to fix that then. I pretty much assumed that
>>>>> ->notify_move could add whatever fences you might want to add. Which
>>>>> would very neatly allow us to solve this problem here, instead of
>>>>> coming up with fake fences and fun stuff like that.
>>>> Adding the fence later on is not a solution because we need something
>>>> which beforehand can check if a buffer is movable or not.
>>>>
>>>> In the case of a move_notify the decision to move it is already done and
>>>> you can't say oh sorry I have to evict my process and reprogram the
>>>> hardware or whatever.
>>>>
>>>> Especially when you do this in an OOM situation.
>>> Why? I mean when the fence for a CS is there already, it might also
>>> still hang out in the scheduler, or be blocked on a fence from another
>>> driver, or anything like that. I don't see a conceptual difference.
>>> Plus with dynamic dma-buf the entire point is that an attached fences
>>> does _not_ mean the buffer is permanently pinned, but can be moved if
>>> you sync correctly. Might need a bit of tuning or a flag to indicate
>>> that some buffers should alwasy considered to be busy, and that you
>>> shouldn't start evicting those. But that's kinda a detail.
>>>
>>>   From a very high level there's really no difference betwen
>>> ->notify_move and the eviction_fence. Both give you a callback when
>>> someone else needs to move the buffer, that's all. The only difference
>>> is that the eviction_fence thing jumbles the callback and the fence
>>> into one, by preattaching a fence just in case. But again from a
>>> conceptual pov it doesn't matter whether the fence is always hanging
>>> around, or whether you just attach it when ->notify_move is called.
>> Sure there is a difference. See when you attach the fence beforehand the
>> memory management can know that the buffer is busy.
>>
>> Just imagine the following: We are in an OOM situation and need to swap
>> things out to disk!
>>
>> When the fence is attached beforehand the handling can be as following:
>> 1. MM picks a BO from the LRU and starts to evict it.
>> 2. The eviction fence is enabled and we stop the process using this BO.
>> 3. As soon as the process is stopped the fence is set into the signaled
>> state.
>> 4. MM needs to evict more BOs and since the fence for this process is
>> now in the signaled state it can intentionally pick the ones up which
>> are now idle.
>>
>> When we attach the fence only on eviction that can't happen and the MM
>> would just pick the next random BO and potentially stop another process.
>>
>> So I think we can summarize that the memory management definitely needs
>> to know beforehand how costly it is to evict a BO.
>>
>> And of course implement this with flags or use counters or whatever, but
>> we already have the fence infrastructure and I don't see a reason not to
>> use it.
> Ok, for the sake of the argument let's buy this.
>
> Why do we need a ->notify_move callback then? We have it already, with
> these special fences.

Yeah, that same thought came to my mind as well.

One difference is that notify_move is only called when the BO is 
actually moved, while the fence can be called for other synchronization 
reasons as well.

> Other side: If all you want to know is whether you can unmap a buffer
> immediately, for some short enough value of immediately (I guess a
> bunch of pagetable writes should be ok), then why not add that? The "I
> don't want to touch all buffers for every CS, but just have a pinned
> working set" command submission model is quite different after all,
> having dedicated infrastructure that fits well sounds like a good idea
> to me.

Ok, well I think I can agree on that.

One of the main purposes using the fence was essentially to avoid 
creating duplicated infrastructure. And I still think that this is a 
good idea. So I think we should essentially aim for something which 
works for both use cases.

Maybe we see this from the wrong side? Fences started of as event system 
to note about completion of operations.

But what we essentially need for memory management is a) know if the BO 
is used, b) some form of event/callback to stop using it c) an 
event/callback back to let the MM know that it is not used any more.

So taking a step back what should the ideal solution look like?

Christian.

> -Daniel

Daniel Vetter July 2, 2019, 1:50 p.m. UTC | #17

On Fri, Jun 28, 2019 at 7:32 PM Koenig, Christian
<Christian.Koenig@amd.com> wrote:
>
> Am 28.06.19 um 18:40 schrieb Daniel Vetter:
> > On Fri, Jun 28, 2019 at 5:21 PM Koenig, Christian
> > <Christian.Koenig@amd.com> wrote:
> >> Am 28.06.19 um 16:38 schrieb Daniel Vetter:
> >> [SNIP]
> >>>>>>> - when you submit command buffers, you _dont_ attach fences to all
> >>>>>>> involved buffers
> >>>>>> That's not going to work because then the memory management then thinks
> >>>>>> that the buffer is immediately movable, which it isn't,
> >>>>> I guess we need to fix that then. I pretty much assumed that
> >>>>> ->notify_move could add whatever fences you might want to add. Which
> >>>>> would very neatly allow us to solve this problem here, instead of
> >>>>> coming up with fake fences and fun stuff like that.
> >>>> Adding the fence later on is not a solution because we need something
> >>>> which beforehand can check if a buffer is movable or not.
> >>>>
> >>>> In the case of a move_notify the decision to move it is already done and
> >>>> you can't say oh sorry I have to evict my process and reprogram the
> >>>> hardware or whatever.
> >>>>
> >>>> Especially when you do this in an OOM situation.
> >>> Why? I mean when the fence for a CS is there already, it might also
> >>> still hang out in the scheduler, or be blocked on a fence from another
> >>> driver, or anything like that. I don't see a conceptual difference.
> >>> Plus with dynamic dma-buf the entire point is that an attached fences
> >>> does _not_ mean the buffer is permanently pinned, but can be moved if
> >>> you sync correctly. Might need a bit of tuning or a flag to indicate
> >>> that some buffers should alwasy considered to be busy, and that you
> >>> shouldn't start evicting those. But that's kinda a detail.
> >>>
> >>>   From a very high level there's really no difference betwen
> >>> ->notify_move and the eviction_fence. Both give you a callback when
> >>> someone else needs to move the buffer, that's all. The only difference
> >>> is that the eviction_fence thing jumbles the callback and the fence
> >>> into one, by preattaching a fence just in case. But again from a
> >>> conceptual pov it doesn't matter whether the fence is always hanging
> >>> around, or whether you just attach it when ->notify_move is called.
> >> Sure there is a difference. See when you attach the fence beforehand the
> >> memory management can know that the buffer is busy.
> >>
> >> Just imagine the following: We are in an OOM situation and need to swap
> >> things out to disk!
> >>
> >> When the fence is attached beforehand the handling can be as following:
> >> 1. MM picks a BO from the LRU and starts to evict it.
> >> 2. The eviction fence is enabled and we stop the process using this BO.
> >> 3. As soon as the process is stopped the fence is set into the signaled
> >> state.
> >> 4. MM needs to evict more BOs and since the fence for this process is
> >> now in the signaled state it can intentionally pick the ones up which
> >> are now idle.
> >>
> >> When we attach the fence only on eviction that can't happen and the MM
> >> would just pick the next random BO and potentially stop another process.
> >>
> >> So I think we can summarize that the memory management definitely needs
> >> to know beforehand how costly it is to evict a BO.
> >>
> >> And of course implement this with flags or use counters or whatever, but
> >> we already have the fence infrastructure and I don't see a reason not to
> >> use it.
> > Ok, for the sake of the argument let's buy this.
> >
> > Why do we need a ->notify_move callback then? We have it already, with
> > these special fences.
>
> Yeah, that same thought came to my mind as well.
>
> One difference is that notify_move is only called when the BO is
> actually moved, while the fence can be called for other synchronization
> reasons as well.

Yeah I think that's the only crux. You can either use the resv_obj for
the magic eviction_fence, or for implicit sync. But if you want to do
both at the same time, it's going to be trouble. Or at least I'm not
seeing how that could work ...

atm not a problem with only amdkfd using that submission model, but I
expect this model of having a semi-static working set + userspace just
submitting batches to spread. And it'll probably spread faster than we
can retire implicit fencing ...

> > Other side: If all you want to know is whether you can unmap a buffer
> > immediately, for some short enough value of immediately (I guess a
> > bunch of pagetable writes should be ok), then why not add that? The "I
> > don't want to touch all buffers for every CS, but just have a pinned
> > working set" command submission model is quite different after all,
> > having dedicated infrastructure that fits well sounds like a good idea
> > to me.
>
> Ok, well I think I can agree on that.
>
> One of the main purposes using the fence was essentially to avoid
> creating duplicated infrastructure. And I still think that this is a
> good idea. So I think we should essentially aim for something which
> works for both use cases.
>
> Maybe we see this from the wrong side? Fences started of as event system
> to note about completion of operations.
>
> But what we essentially need for memory management is a) know if the BO
> is used, b) some form of event/callback to stop using it c) an
> event/callback back to let the MM know that it is not used any more.
>
> So taking a step back what should the ideal solution look like?

Did some pondering over the w/e. But brain was all mushed up and in
non-working condition because everything is too hot here :-/ It's
getting better at least.

Another thing I looked a bit into is the amdgpu userptr stuff, and
wondered whether we could/should make dynamic dma-buf also work for
that case. In a ways userptr is the most over-the-top dynamic kind of
memory ... But I don't think that's possible due to all the locks
involved in hmm_mirror and get_user_pages/hmm_fault.

One thing that's become clear at least is that everyone expects to be
able to wait on fences in their hmm/shrinker/whatever core -mm
callbacks. Which means the comment around the amdgpu userptr mn lock
is actually not paranoid enough:

    /* No memory allocation is allowed while holding the mn lock.
     * p->mn is hold until amdgpu_cs_submit is finished and fence is added
     * to BOs.
     */
    amdgpu_mn_lock(p->mn);

The rule is actually a lot stricter I think: Anytime you publish a
fence you're not allowed to allocate memory until you've signalled
that fence, in _any_ path leading up to that dma_fence_signal.
amdgpu_mn_lock() is just another way to publish fences. Ofc there's
also the locking inversion problem on the mn_lock itself, but the "you
can't allocate any memory, anywhere, in any driver to signal a fence"
is scary. I'll try to bake this into documentation patch somehow.

Aside: The cross-release lockdep stuff would help us validate these
more indirect lock dependencies against the fs_reclaim context. But it
looks like cross-release lockdep is stalled indefinitely :-(

Aside from that sidetrack, no real progress on pondering this topic here.
-Daniel

[2/2] dma-buf: cleanup shared fence removal

Commit Message

Comments

Patch