diff mbox series

[01/11] drm/amdgpu: Comply with implicit fencing rules

Message ID 20210521090959.1663703-1-daniel.vetter@ffwll.ch (mailing list archive)
State New, archived
Headers show
Series [01/11] drm/amdgpu: Comply with implicit fencing rules | expand

Commit Message

Daniel Vetter May 21, 2021, 9:09 a.m. UTC
Docs for struct dma_resv are fairly clear:

"A reservation object can have attached one exclusive fence (normally
associated with write operations) or N shared fences (read
operations)."

https://dri.freedesktop.org/docs/drm/driver-api/dma-buf.html#reservation-objects

Furthermore a review across all of upstream.

First of render drivers and how they set implicit fences:

- nouveau follows this contract, see in validate_fini_no_ticket()

			nouveau_bo_fence(nvbo, fence, !!b->write_domains);

  and that last boolean controls whether the exclusive or shared fence
  slot is used.

- radeon follows this contract by setting

		p->relocs[i].tv.num_shared = !r->write_domain;

  in radeon_cs_parser_relocs(), which ensures that the call to
  ttm_eu_fence_buffer_objects() in radeon_cs_parser_fini() will do the
  right thing.

- vmwgfx seems to follow this contract with the shotgun approach of
  always setting ttm_val_buf->num_shared = 0, which means
  ttm_eu_fence_buffer_objects() will only use the exclusive slot.

- etnaviv follows this contract, as can be trivially seen by looking
  at submit_attach_object_fences()

- i915 is a bit a convoluted maze with multiple paths leading to
  i915_vma_move_to_active(). Which sets the exclusive flag if
  EXEC_OBJECT_WRITE is set. This can either come as a buffer flag for
  softpin mode, or through the write_domain when using relocations. It
  follows this contract.

- lima follows this contract, see lima_gem_submit() which sets the
  exclusive fence when the LIMA_SUBMIT_BO_WRITE flag is set for that
  bo

- msm follows this contract, see msm_gpu_submit() which sets the
  exclusive flag when the MSM_SUBMIT_BO_WRITE is set for that buffer

- panfrost follows this contract with the shotgun approach of just
  always setting the exclusive fence, see
  panfrost_attach_object_fences(). Benefits of a single engine I guess

- v3d follows this contract with the same shotgun approach in
  v3d_attach_fences_and_unlock_reservation(), but it has at least an
  XXX comment that maybe this should be improved

- v4c uses the same shotgun approach of always setting an exclusive
  fence, see vc4_update_bo_seqnos()

- vgem also follows this contract, see vgem_fence_attach_ioctl() and
  the VGEM_FENCE_WRITE. This is used in some igts to validate prime
  sharing with i915.ko without the need of a 2nd gpu

- vritio follows this contract again with the shotgun approach of
  always setting an exclusive fence, see virtio_gpu_array_add_fence()

This covers the setting of the exclusive fences when writing.

Synchronizing against the exclusive fence is a lot more tricky, and I
only spot checked a few:

- i915 does it, with the optional EXEC_OBJECT_ASYNC to skip all
  implicit dependencies (which is used by vulkan)

- etnaviv does this. Implicit dependencies are collected in
  submit_fence_sync(), again with an opt-out flag
  ETNA_SUBMIT_NO_IMPLICIT. These are then picked up in
  etnaviv_sched_dependency which is the
  drm_sched_backend_ops->dependency callback.

- v4c seems to not do much here, maybe gets away with it by not having
  a scheduler and only a single engine. Since all newer broadcom chips than
  the OG vc4 use v3d for rendering, which follows this contract, the
  impact of this issue is fairly small.

- v3d does this using the drm_gem_fence_array_add_implicit() helper,
  which then it's drm_sched_backend_ops->dependency callback
  v3d_job_dependency() picks up.

- panfrost is nice here and tracks the implicit fences in
  panfrost_job->implicit_fences, which again the
  drm_sched_backend_ops->dependency callback panfrost_job_dependency()
  picks up. It is mildly questionable though since it only picks up
  exclusive fences in panfrost_acquire_object_fences(), but not buggy
  in practice because it also always sets the exclusive fence. It
  should pick up both sets of fences, just in case there's ever going
  to be a 2nd gpu in a SoC with a mali gpu. Or maybe a mali SoC with a
  pcie port and a real gpu, which might actually happen eventually. A
  bug, but easy to fix. Should probably use the
  drm_gem_fence_array_add_implicit() helper.

- lima is nice an easy, uses drm_gem_fence_array_add_implicit() and
  the same schema as v3d.

- msm is mildly entertaining. It also supports MSM_SUBMIT_NO_IMPLICIT,
  but because it doesn't use the drm/scheduler it handles fences from
  the wrong context with a synchronous dma_fence_wait. See
  submit_fence_sync() leading to msm_gem_sync_object(). Investing into
  a scheduler might be a good idea.

- all the remaining drivers are ttm based, where I hope they do
  appropriately obey implicit fences already. I didn't do the full
  audit there because a) not follow the contract would confuse ttm
  quite well and b) reading non-standard scheduler and submit code
  which isn't based on drm/scheduler is a pain.

Onwards to the display side.

- Any driver using the drm_gem_plane_helper_prepare_fb() helper will
  correctly. Overwhelmingly most drivers get this right, except a few
  totally dont. I'll follow up with a patch to make this the default
  and avoid a bunch of bugs.

- I didn't audit the ttm drivers, but given that dma_resv started
  there I hope they get this right.

In conclusion this IS the contract, both as documented and
overwhelmingly implemented, specically as implemented by all render
drivers except amdgpu.

Amdgpu tried to fix this already in

commit 049aca4363d8af87cab8d53de5401602db3b9999
Author: Christian König <christian.koenig@amd.com>
Date:   Wed Sep 19 16:54:35 2018 +0200

    drm/amdgpu: fix using shared fence for exported BOs v2

but this fix falls short on a number of areas:

- It's racy, by the time the buffer is shared it might be too late. To
  make sure there's definitely never a problem we need to set the
  fences correctly for any buffer that's potentially exportable.

- It's breaking uapi, dma-buf fds support poll() and differentitiate
  between, which was introduced in

	commit 9b495a5887994a6d74d5c261d012083a92b94738
	Author: Maarten Lankhorst <maarten.lankhorst@canonical.com>
	Date:   Tue Jul 1 12:57:43 2014 +0200

	    dma-buf: add poll support, v3

- Christian König wants to nack new uapi building further on this
  dma_resv contract because it breaks amdgpu, quoting

  "Yeah, and that is exactly the reason why I will NAK this uAPI change.

  "This doesn't works for amdgpu at all for the reasons outlined above."

  https://lore.kernel.org/dri-devel/f2eb6751-2f82-9b23-f57e-548de5b729de@gmail.com/

  Rejecting new development because your own driver is broken and
  violates established cross driver contracts and uapi is really not
  how upstream works.

Now this patch will have a severe performance impact on anything that
runs on multiple engines. So we can't just merge it outright, but need
a bit a plan:

- amdgpu needs a proper uapi for handling implicit fencing. The funny
  thing is that to do it correctly, implicit fencing must be treated
  as a very strange IPC mechanism for transporting fences, where both
  setting the fence and dependency intercepts must be handled
  explicitly. Current best practices is a per-bo flag to indicate
  writes, and a per-bo flag to to skip implicit fencing in the CS
  ioctl as a new chunk.

- Since amdgpu has been shipping with broken behaviour we need an
  opt-out flag from the butchered implicit fencing model to enable the
  proper explicit implicit fencing model.

- for kernel memory fences due to bo moves at least the i915 idea is
  to use ttm_bo->moving. amdgpu probably needs the same.

- since the current p2p dma-buf interface assumes the kernel memory
  fence is in the exclusive dma_resv fence slot we need to add a new
  fence slot for kernel fences, which must never be ignored. Since
  currently only amdgpu supports this there's no real problem here
  yet, until amdgpu gains a NO_IMPLICIT CS flag.

- New userspace needs to ship in enough desktop distros so that users
  wont notice the perf impact. I think we can ignore LTS distros who
  upgrade their kernels but not their mesa3d snapshot.

- Then when this is all in place we can merge this patch here.

What is not a solution to this problem here is trying to make the
dma_resv rules in the kernel more clever. The fundamental issue here
is that the amdgpu CS uapi is the least expressive one across all
drivers (only equalled by panfrost, which has an actual excuse) by not
allowing any userspace control over how implicit sync is conducted.

Until this is fixed it's completely pointless to make the kernel more
clever to improve amdgpu, because all we're doing is papering over
this uapi design issue. amdgpu needs to attain the status quo
established by other drivers first, once that's achieved we can tackle
the remaining issues in a consistent way across drivers.

Cc: mesa-dev@lists.freedesktop.org
Cc: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
Cc: Dave Airlie <airlied@gmail.com>
Cc: Rob Clark <robdclark@chromium.org>
Cc: Kristian H. Kristensen <hoegsberg@google.com>
Cc: Michel Dänzer <michel@daenzer.net>
Cc: Daniel Stone <daniels@collabora.com>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: "Christian König" <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Deepak R Varma <mh12gx2825@gmail.com>
Cc: Chen Li <chenli@uniontech.com>
Cc: Kevin Wang <kevin1.wang@amd.com>
Cc: Dennis Li <Dennis.Li@amd.com>
Cc: Luben Tuikov <luben.tuikov@amd.com>
Cc: linaro-mm-sig@lists.linaro.org
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Comments

Bas Nieuwenhuizen May 21, 2021, 9:46 a.m. UTC | #1
On Fri, May 21, 2021 at 11:10 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
>
> Docs for struct dma_resv are fairly clear:
>
> "A reservation object can have attached one exclusive fence (normally
> associated with write operations) or N shared fences (read
> operations)."
>
> https://dri.freedesktop.org/docs/drm/driver-api/dma-buf.html#reservation-objects
>
> Furthermore a review across all of upstream.
>
> First of render drivers and how they set implicit fences:
>
> - nouveau follows this contract, see in validate_fini_no_ticket()
>
>                         nouveau_bo_fence(nvbo, fence, !!b->write_domains);
>
>   and that last boolean controls whether the exclusive or shared fence
>   slot is used.
>
> - radeon follows this contract by setting
>
>                 p->relocs[i].tv.num_shared = !r->write_domain;
>
>   in radeon_cs_parser_relocs(), which ensures that the call to
>   ttm_eu_fence_buffer_objects() in radeon_cs_parser_fini() will do the
>   right thing.
>
> - vmwgfx seems to follow this contract with the shotgun approach of
>   always setting ttm_val_buf->num_shared = 0, which means
>   ttm_eu_fence_buffer_objects() will only use the exclusive slot.
>
> - etnaviv follows this contract, as can be trivially seen by looking
>   at submit_attach_object_fences()
>
> - i915 is a bit a convoluted maze with multiple paths leading to
>   i915_vma_move_to_active(). Which sets the exclusive flag if
>   EXEC_OBJECT_WRITE is set. This can either come as a buffer flag for
>   softpin mode, or through the write_domain when using relocations. It
>   follows this contract.
>
> - lima follows this contract, see lima_gem_submit() which sets the
>   exclusive fence when the LIMA_SUBMIT_BO_WRITE flag is set for that
>   bo
>
> - msm follows this contract, see msm_gpu_submit() which sets the
>   exclusive flag when the MSM_SUBMIT_BO_WRITE is set for that buffer
>
> - panfrost follows this contract with the shotgun approach of just
>   always setting the exclusive fence, see
>   panfrost_attach_object_fences(). Benefits of a single engine I guess
>
> - v3d follows this contract with the same shotgun approach in
>   v3d_attach_fences_and_unlock_reservation(), but it has at least an
>   XXX comment that maybe this should be improved
>
> - v4c uses the same shotgun approach of always setting an exclusive
>   fence, see vc4_update_bo_seqnos()
>
> - vgem also follows this contract, see vgem_fence_attach_ioctl() and
>   the VGEM_FENCE_WRITE. This is used in some igts to validate prime
>   sharing with i915.ko without the need of a 2nd gpu
>
> - vritio follows this contract again with the shotgun approach of
>   always setting an exclusive fence, see virtio_gpu_array_add_fence()
>
> This covers the setting of the exclusive fences when writing.
>
> Synchronizing against the exclusive fence is a lot more tricky, and I
> only spot checked a few:
>
> - i915 does it, with the optional EXEC_OBJECT_ASYNC to skip all
>   implicit dependencies (which is used by vulkan)
>
> - etnaviv does this. Implicit dependencies are collected in
>   submit_fence_sync(), again with an opt-out flag
>   ETNA_SUBMIT_NO_IMPLICIT. These are then picked up in
>   etnaviv_sched_dependency which is the
>   drm_sched_backend_ops->dependency callback.
>
> - v4c seems to not do much here, maybe gets away with it by not having
>   a scheduler and only a single engine. Since all newer broadcom chips than
>   the OG vc4 use v3d for rendering, which follows this contract, the
>   impact of this issue is fairly small.
>
> - v3d does this using the drm_gem_fence_array_add_implicit() helper,
>   which then it's drm_sched_backend_ops->dependency callback
>   v3d_job_dependency() picks up.
>
> - panfrost is nice here and tracks the implicit fences in
>   panfrost_job->implicit_fences, which again the
>   drm_sched_backend_ops->dependency callback panfrost_job_dependency()
>   picks up. It is mildly questionable though since it only picks up
>   exclusive fences in panfrost_acquire_object_fences(), but not buggy
>   in practice because it also always sets the exclusive fence. It
>   should pick up both sets of fences, just in case there's ever going
>   to be a 2nd gpu in a SoC with a mali gpu. Or maybe a mali SoC with a
>   pcie port and a real gpu, which might actually happen eventually. A
>   bug, but easy to fix. Should probably use the
>   drm_gem_fence_array_add_implicit() helper.
>
> - lima is nice an easy, uses drm_gem_fence_array_add_implicit() and
>   the same schema as v3d.
>
> - msm is mildly entertaining. It also supports MSM_SUBMIT_NO_IMPLICIT,
>   but because it doesn't use the drm/scheduler it handles fences from
>   the wrong context with a synchronous dma_fence_wait. See
>   submit_fence_sync() leading to msm_gem_sync_object(). Investing into
>   a scheduler might be a good idea.
>
> - all the remaining drivers are ttm based, where I hope they do
>   appropriately obey implicit fences already. I didn't do the full
>   audit there because a) not follow the contract would confuse ttm
>   quite well and b) reading non-standard scheduler and submit code
>   which isn't based on drm/scheduler is a pain.
>
> Onwards to the display side.
>
> - Any driver using the drm_gem_plane_helper_prepare_fb() helper will
>   correctly. Overwhelmingly most drivers get this right, except a few
>   totally dont. I'll follow up with a patch to make this the default
>   and avoid a bunch of bugs.
>
> - I didn't audit the ttm drivers, but given that dma_resv started
>   there I hope they get this right.
>
> In conclusion this IS the contract, both as documented and
> overwhelmingly implemented, specically as implemented by all render
> drivers except amdgpu.
>
> Amdgpu tried to fix this already in
>
> commit 049aca4363d8af87cab8d53de5401602db3b9999
> Author: Christian König <christian.koenig@amd.com>
> Date:   Wed Sep 19 16:54:35 2018 +0200
>
>     drm/amdgpu: fix using shared fence for exported BOs v2
>
> but this fix falls short on a number of areas:
>
> - It's racy, by the time the buffer is shared it might be too late. To
>   make sure there's definitely never a problem we need to set the
>   fences correctly for any buffer that's potentially exportable.
>
> - It's breaking uapi, dma-buf fds support poll() and differentitiate
>   between, which was introduced in
>
>         commit 9b495a5887994a6d74d5c261d012083a92b94738
>         Author: Maarten Lankhorst <maarten.lankhorst@canonical.com>
>         Date:   Tue Jul 1 12:57:43 2014 +0200
>
>             dma-buf: add poll support, v3
>
> - Christian König wants to nack new uapi building further on this
>   dma_resv contract because it breaks amdgpu, quoting
>
>   "Yeah, and that is exactly the reason why I will NAK this uAPI change.
>
>   "This doesn't works for amdgpu at all for the reasons outlined above."
>
>   https://lore.kernel.org/dri-devel/f2eb6751-2f82-9b23-f57e-548de5b729de@gmail.com/
>
>   Rejecting new development because your own driver is broken and
>   violates established cross driver contracts and uapi is really not
>   how upstream works.
>
> Now this patch will have a severe performance impact on anything that
> runs on multiple engines. So we can't just merge it outright, but need
> a bit a plan:
>
> - amdgpu needs a proper uapi for handling implicit fencing. The funny
>   thing is that to do it correctly, implicit fencing must be treated
>   as a very strange IPC mechanism for transporting fences, where both
>   setting the fence and dependency intercepts must be handled
>   explicitly. Current best practices is a per-bo flag to indicate
>   writes, and a per-bo flag to to skip implicit fencing in the CS
>   ioctl as a new chunk.
>
> - Since amdgpu has been shipping with broken behaviour we need an
>   opt-out flag from the butchered implicit fencing model to enable the
>   proper explicit implicit fencing model.
>
> - for kernel memory fences due to bo moves at least the i915 idea is
>   to use ttm_bo->moving. amdgpu probably needs the same.
>
> - since the current p2p dma-buf interface assumes the kernel memory
>   fence is in the exclusive dma_resv fence slot we need to add a new
>   fence slot for kernel fences, which must never be ignored. Since
>   currently only amdgpu supports this there's no real problem here
>   yet, until amdgpu gains a NO_IMPLICIT CS flag.
>
> - New userspace needs to ship in enough desktop distros so that users
>   wont notice the perf impact. I think we can ignore LTS distros who
>   upgrade their kernels but not their mesa3d snapshot.
>
> - Then when this is all in place we can merge this patch here.
>
> What is not a solution to this problem here is trying to make the
> dma_resv rules in the kernel more clever. The fundamental issue here
> is that the amdgpu CS uapi is the least expressive one across all
> drivers (only equalled by panfrost, which has an actual excuse) by not
> allowing any userspace control over how implicit sync is conducted.
>
> Until this is fixed it's completely pointless to make the kernel more
> clever to improve amdgpu, because all we're doing is papering over
> this uapi design issue. amdgpu needs to attain the status quo
> established by other drivers first, once that's achieved we can tackle
> the remaining issues in a consistent way across drivers.
>
> Cc: mesa-dev@lists.freedesktop.org
> Cc: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
> Cc: Dave Airlie <airlied@gmail.com>
> Cc: Rob Clark <robdclark@chromium.org>
> Cc: Kristian H. Kristensen <hoegsberg@google.com>
> Cc: Michel Dänzer <michel@daenzer.net>
> Cc: Daniel Stone <daniels@collabora.com>
> Cc: Sumit Semwal <sumit.semwal@linaro.org>
> Cc: "Christian König" <christian.koenig@amd.com>
> Cc: Alex Deucher <alexander.deucher@amd.com>
> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
> Cc: Deepak R Varma <mh12gx2825@gmail.com>
> Cc: Chen Li <chenli@uniontech.com>
> Cc: Kevin Wang <kevin1.wang@amd.com>
> Cc: Dennis Li <Dennis.Li@amd.com>
> Cc: Luben Tuikov <luben.tuikov@amd.com>
> Cc: linaro-mm-sig@lists.linaro.org
> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> index 88a24a0b5691..cc8426e1e8a8 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> @@ -617,8 +617,8 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser *p,
>         amdgpu_bo_list_for_each_entry(e, p->bo_list) {
>                 struct amdgpu_bo *bo = ttm_to_amdgpu_bo(e->tv.bo);
>
> -               /* Make sure we use the exclusive slot for shared BOs */
> -               if (bo->prime_shared_count)
> +               /* Make sure we use the exclusive slot for all potentially shared BOs */
> +               if (!(bo->flags & AMDGPU_GEM_CREATE_VM_ALWAYS_VALID))
>                         e->tv.num_shared = 0;

I think it also makes sense to skip this with
AMDGPU_GEM_CREATE_EXPLICIT_SYNC? It can be shared but I don't think
anyone expects implicit sync to happen with those.
>                 e->bo_va = amdgpu_vm_bo_find(vm, bo);
>         }
> --
> 2.31.0
>
Christian König May 21, 2021, 11:22 a.m. UTC | #2
Am 21.05.21 um 11:09 schrieb Daniel Vetter:
> Docs for struct dma_resv are fairly clear:
>
> "A reservation object can have attached one exclusive fence (normally
> associated with write operations) or N shared fences (read
> operations)."
>
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdri.freedesktop.org%2Fdocs%2Fdrm%2Fdriver-api%2Fdma-buf.html%23reservation-objects&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C2cdb7d8e82de40fd452e08d91c383a13%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637571850083203679%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=Y5zO4aMKMuQwTCKVk6DrjTIbbRBLrcklrZgNCzNGXGs%3D&amp;reserved=0
>
> Furthermore a review across all of upstream.
>
> First of render drivers and how they set implicit fences:
>
> - nouveau follows this contract, see in validate_fini_no_ticket()
>
> 			nouveau_bo_fence(nvbo, fence, !!b->write_domains);
>
>    and that last boolean controls whether the exclusive or shared fence
>    slot is used.
>
> - radeon follows this contract by setting
>
> 		p->relocs[i].tv.num_shared = !r->write_domain;
>
>    in radeon_cs_parser_relocs(), which ensures that the call to
>    ttm_eu_fence_buffer_objects() in radeon_cs_parser_fini() will do the
>    right thing.
>
> - vmwgfx seems to follow this contract with the shotgun approach of
>    always setting ttm_val_buf->num_shared = 0, which means
>    ttm_eu_fence_buffer_objects() will only use the exclusive slot.
>
> - etnaviv follows this contract, as can be trivially seen by looking
>    at submit_attach_object_fences()
>
> - i915 is a bit a convoluted maze with multiple paths leading to
>    i915_vma_move_to_active(). Which sets the exclusive flag if
>    EXEC_OBJECT_WRITE is set. This can either come as a buffer flag for
>    softpin mode, or through the write_domain when using relocations. It
>    follows this contract.
>
> - lima follows this contract, see lima_gem_submit() which sets the
>    exclusive fence when the LIMA_SUBMIT_BO_WRITE flag is set for that
>    bo
>
> - msm follows this contract, see msm_gpu_submit() which sets the
>    exclusive flag when the MSM_SUBMIT_BO_WRITE is set for that buffer
>
> - panfrost follows this contract with the shotgun approach of just
>    always setting the exclusive fence, see
>    panfrost_attach_object_fences(). Benefits of a single engine I guess
>
> - v3d follows this contract with the same shotgun approach in
>    v3d_attach_fences_and_unlock_reservation(), but it has at least an
>    XXX comment that maybe this should be improved
>
> - v4c uses the same shotgun approach of always setting an exclusive
>    fence, see vc4_update_bo_seqnos()
>
> - vgem also follows this contract, see vgem_fence_attach_ioctl() and
>    the VGEM_FENCE_WRITE. This is used in some igts to validate prime
>    sharing with i915.ko without the need of a 2nd gpu
>
> - vritio follows this contract again with the shotgun approach of
>    always setting an exclusive fence, see virtio_gpu_array_add_fence()
>
> This covers the setting of the exclusive fences when writing.
>
> Synchronizing against the exclusive fence is a lot more tricky, and I
> only spot checked a few:
>
> - i915 does it, with the optional EXEC_OBJECT_ASYNC to skip all
>    implicit dependencies (which is used by vulkan)
>
> - etnaviv does this. Implicit dependencies are collected in
>    submit_fence_sync(), again with an opt-out flag
>    ETNA_SUBMIT_NO_IMPLICIT. These are then picked up in
>    etnaviv_sched_dependency which is the
>    drm_sched_backend_ops->dependency callback.
>
> - v4c seems to not do much here, maybe gets away with it by not having
>    a scheduler and only a single engine. Since all newer broadcom chips than
>    the OG vc4 use v3d for rendering, which follows this contract, the
>    impact of this issue is fairly small.
>
> - v3d does this using the drm_gem_fence_array_add_implicit() helper,
>    which then it's drm_sched_backend_ops->dependency callback
>    v3d_job_dependency() picks up.
>
> - panfrost is nice here and tracks the implicit fences in
>    panfrost_job->implicit_fences, which again the
>    drm_sched_backend_ops->dependency callback panfrost_job_dependency()
>    picks up. It is mildly questionable though since it only picks up
>    exclusive fences in panfrost_acquire_object_fences(), but not buggy
>    in practice because it also always sets the exclusive fence. It
>    should pick up both sets of fences, just in case there's ever going
>    to be a 2nd gpu in a SoC with a mali gpu. Or maybe a mali SoC with a
>    pcie port and a real gpu, which might actually happen eventually. A
>    bug, but easy to fix. Should probably use the
>    drm_gem_fence_array_add_implicit() helper.
>
> - lima is nice an easy, uses drm_gem_fence_array_add_implicit() and
>    the same schema as v3d.
>
> - msm is mildly entertaining. It also supports MSM_SUBMIT_NO_IMPLICIT,
>    but because it doesn't use the drm/scheduler it handles fences from
>    the wrong context with a synchronous dma_fence_wait. See
>    submit_fence_sync() leading to msm_gem_sync_object(). Investing into
>    a scheduler might be a good idea.
>
> - all the remaining drivers are ttm based, where I hope they do
>    appropriately obey implicit fences already. I didn't do the full
>    audit there because a) not follow the contract would confuse ttm
>    quite well and b) reading non-standard scheduler and submit code
>    which isn't based on drm/scheduler is a pain.
>
> Onwards to the display side.
>
> - Any driver using the drm_gem_plane_helper_prepare_fb() helper will
>    correctly. Overwhelmingly most drivers get this right, except a few
>    totally dont. I'll follow up with a patch to make this the default
>    and avoid a bunch of bugs.
>
> - I didn't audit the ttm drivers, but given that dma_resv started
>    there I hope they get this right.
>
> In conclusion this IS the contract, both as documented and
> overwhelmingly implemented, specically as implemented by all render
> drivers except amdgpu.
>
> Amdgpu tried to fix this already in
>
> commit 049aca4363d8af87cab8d53de5401602db3b9999
> Author: Christian König <christian.koenig@amd.com>
> Date:   Wed Sep 19 16:54:35 2018 +0200
>
>      drm/amdgpu: fix using shared fence for exported BOs v2
>
> but this fix falls short on a number of areas:
>
> - It's racy, by the time the buffer is shared it might be too late. To
>    make sure there's definitely never a problem we need to set the
>    fences correctly for any buffer that's potentially exportable.
>
> - It's breaking uapi, dma-buf fds support poll() and differentitiate
>    between, which was introduced in
>
> 	commit 9b495a5887994a6d74d5c261d012083a92b94738
> 	Author: Maarten Lankhorst <maarten.lankhorst@canonical.com>
> 	Date:   Tue Jul 1 12:57:43 2014 +0200
>
> 	    dma-buf: add poll support, v3
>
> - Christian König wants to nack new uapi building further on this
>    dma_resv contract because it breaks amdgpu, quoting
>
>    "Yeah, and that is exactly the reason why I will NAK this uAPI change.
>
>    "This doesn't works for amdgpu at all for the reasons outlined above."
>
>    https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2Ff2eb6751-2f82-9b23-f57e-548de5b729de%40gmail.com%2F&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C2cdb7d8e82de40fd452e08d91c383a13%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637571850083203679%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=WkQz%2Bdd61XuEw93JOcKx17SQFpNcyMDvvSBgRA9N0U4%3D&amp;reserved=0
>
>    Rejecting new development because your own driver is broken and
>    violates established cross driver contracts and uapi is really not
>    how upstream works.
>
> Now this patch will have a severe performance impact on anything that
> runs on multiple engines. So we can't just merge it outright, but need
> a bit a plan:
>
> - amdgpu needs a proper uapi for handling implicit fencing. The funny
>    thing is that to do it correctly, implicit fencing must be treated
>    as a very strange IPC mechanism for transporting fences, where both
>    setting the fence and dependency intercepts must be handled
>    explicitly. Current best practices is a per-bo flag to indicate
>    writes, and a per-bo flag to to skip implicit fencing in the CS
>    ioctl as a new chunk.
>
> - Since amdgpu has been shipping with broken behaviour we need an
>    opt-out flag from the butchered implicit fencing model to enable the
>    proper explicit implicit fencing model.
>
> - for kernel memory fences due to bo moves at least the i915 idea is
>    to use ttm_bo->moving. amdgpu probably needs the same.
>
> - since the current p2p dma-buf interface assumes the kernel memory
>    fence is in the exclusive dma_resv fence slot we need to add a new
>    fence slot for kernel fences, which must never be ignored. Since
>    currently only amdgpu supports this there's no real problem here
>    yet, until amdgpu gains a NO_IMPLICIT CS flag.
>
> - New userspace needs to ship in enough desktop distros so that users
>    wont notice the perf impact. I think we can ignore LTS distros who
>    upgrade their kernels but not their mesa3d snapshot.
>
> - Then when this is all in place we can merge this patch here.
>
> What is not a solution to this problem here is trying to make the
> dma_resv rules in the kernel more clever. The fundamental issue here
> is that the amdgpu CS uapi is the least expressive one across all
> drivers (only equalled by panfrost, which has an actual excuse) by not
> allowing any userspace control over how implicit sync is conducted.
>
> Until this is fixed it's completely pointless to make the kernel more
> clever to improve amdgpu, because all we're doing is papering over
> this uapi design issue. amdgpu needs to attain the status quo
> established by other drivers first, once that's achieved we can tackle
> the remaining issues in a consistent way across drivers.
>
> Cc: mesa-dev@lists.freedesktop.org
> Cc: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
> Cc: Dave Airlie <airlied@gmail.com>
> Cc: Rob Clark <robdclark@chromium.org>
> Cc: Kristian H. Kristensen <hoegsberg@google.com>
> Cc: Michel Dänzer <michel@daenzer.net>
> Cc: Daniel Stone <daniels@collabora.com>
> Cc: Sumit Semwal <sumit.semwal@linaro.org>
> Cc: "Christian König" <christian.koenig@amd.com>
> Cc: Alex Deucher <alexander.deucher@amd.com>
> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
> Cc: Deepak R Varma <mh12gx2825@gmail.com>
> Cc: Chen Li <chenli@uniontech.com>
> Cc: Kevin Wang <kevin1.wang@amd.com>
> Cc: Dennis Li <Dennis.Li@amd.com>
> Cc: Luben Tuikov <luben.tuikov@amd.com>
> Cc: linaro-mm-sig@lists.linaro.org
> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>

And as explained before this is a general NAK.

I'm not discussing this further until we have fixed the dma_resv rules 
for implicit synchronization since this will just result in every 
command submission serializing all accesses to BOs which is certainly 
not what we want.

Regards,
Christian.

> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 4 ++--
>   1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> index 88a24a0b5691..cc8426e1e8a8 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> @@ -617,8 +617,8 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser *p,
>   	amdgpu_bo_list_for_each_entry(e, p->bo_list) {
>   		struct amdgpu_bo *bo = ttm_to_amdgpu_bo(e->tv.bo);
>   
> -		/* Make sure we use the exclusive slot for shared BOs */
> -		if (bo->prime_shared_count)
> +		/* Make sure we use the exclusive slot for all potentially shared BOs */
> +		if (!(bo->flags & AMDGPU_GEM_CREATE_VM_ALWAYS_VALID))
>   			e->tv.num_shared = 0;
>   		e->bo_va = amdgpu_vm_bo_find(vm, bo);
>   	}
Daniel Vetter May 21, 2021, 2:37 p.m. UTC | #3
On Fri, May 21, 2021 at 11:46:23AM +0200, Bas Nieuwenhuizen wrote:
> On Fri, May 21, 2021 at 11:10 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> >
> > Docs for struct dma_resv are fairly clear:
> >
> > "A reservation object can have attached one exclusive fence (normally
> > associated with write operations) or N shared fences (read
> > operations)."
> >
> > https://dri.freedesktop.org/docs/drm/driver-api/dma-buf.html#reservation-objects
> >
> > Furthermore a review across all of upstream.
> >
> > First of render drivers and how they set implicit fences:
> >
> > - nouveau follows this contract, see in validate_fini_no_ticket()
> >
> >                         nouveau_bo_fence(nvbo, fence, !!b->write_domains);
> >
> >   and that last boolean controls whether the exclusive or shared fence
> >   slot is used.
> >
> > - radeon follows this contract by setting
> >
> >                 p->relocs[i].tv.num_shared = !r->write_domain;
> >
> >   in radeon_cs_parser_relocs(), which ensures that the call to
> >   ttm_eu_fence_buffer_objects() in radeon_cs_parser_fini() will do the
> >   right thing.
> >
> > - vmwgfx seems to follow this contract with the shotgun approach of
> >   always setting ttm_val_buf->num_shared = 0, which means
> >   ttm_eu_fence_buffer_objects() will only use the exclusive slot.
> >
> > - etnaviv follows this contract, as can be trivially seen by looking
> >   at submit_attach_object_fences()
> >
> > - i915 is a bit a convoluted maze with multiple paths leading to
> >   i915_vma_move_to_active(). Which sets the exclusive flag if
> >   EXEC_OBJECT_WRITE is set. This can either come as a buffer flag for
> >   softpin mode, or through the write_domain when using relocations. It
> >   follows this contract.
> >
> > - lima follows this contract, see lima_gem_submit() which sets the
> >   exclusive fence when the LIMA_SUBMIT_BO_WRITE flag is set for that
> >   bo
> >
> > - msm follows this contract, see msm_gpu_submit() which sets the
> >   exclusive flag when the MSM_SUBMIT_BO_WRITE is set for that buffer
> >
> > - panfrost follows this contract with the shotgun approach of just
> >   always setting the exclusive fence, see
> >   panfrost_attach_object_fences(). Benefits of a single engine I guess
> >
> > - v3d follows this contract with the same shotgun approach in
> >   v3d_attach_fences_and_unlock_reservation(), but it has at least an
> >   XXX comment that maybe this should be improved
> >
> > - v4c uses the same shotgun approach of always setting an exclusive
> >   fence, see vc4_update_bo_seqnos()
> >
> > - vgem also follows this contract, see vgem_fence_attach_ioctl() and
> >   the VGEM_FENCE_WRITE. This is used in some igts to validate prime
> >   sharing with i915.ko without the need of a 2nd gpu
> >
> > - vritio follows this contract again with the shotgun approach of
> >   always setting an exclusive fence, see virtio_gpu_array_add_fence()
> >
> > This covers the setting of the exclusive fences when writing.
> >
> > Synchronizing against the exclusive fence is a lot more tricky, and I
> > only spot checked a few:
> >
> > - i915 does it, with the optional EXEC_OBJECT_ASYNC to skip all
> >   implicit dependencies (which is used by vulkan)
> >
> > - etnaviv does this. Implicit dependencies are collected in
> >   submit_fence_sync(), again with an opt-out flag
> >   ETNA_SUBMIT_NO_IMPLICIT. These are then picked up in
> >   etnaviv_sched_dependency which is the
> >   drm_sched_backend_ops->dependency callback.
> >
> > - v4c seems to not do much here, maybe gets away with it by not having
> >   a scheduler and only a single engine. Since all newer broadcom chips than
> >   the OG vc4 use v3d for rendering, which follows this contract, the
> >   impact of this issue is fairly small.
> >
> > - v3d does this using the drm_gem_fence_array_add_implicit() helper,
> >   which then it's drm_sched_backend_ops->dependency callback
> >   v3d_job_dependency() picks up.
> >
> > - panfrost is nice here and tracks the implicit fences in
> >   panfrost_job->implicit_fences, which again the
> >   drm_sched_backend_ops->dependency callback panfrost_job_dependency()
> >   picks up. It is mildly questionable though since it only picks up
> >   exclusive fences in panfrost_acquire_object_fences(), but not buggy
> >   in practice because it also always sets the exclusive fence. It
> >   should pick up both sets of fences, just in case there's ever going
> >   to be a 2nd gpu in a SoC with a mali gpu. Or maybe a mali SoC with a
> >   pcie port and a real gpu, which might actually happen eventually. A
> >   bug, but easy to fix. Should probably use the
> >   drm_gem_fence_array_add_implicit() helper.
> >
> > - lima is nice an easy, uses drm_gem_fence_array_add_implicit() and
> >   the same schema as v3d.
> >
> > - msm is mildly entertaining. It also supports MSM_SUBMIT_NO_IMPLICIT,
> >   but because it doesn't use the drm/scheduler it handles fences from
> >   the wrong context with a synchronous dma_fence_wait. See
> >   submit_fence_sync() leading to msm_gem_sync_object(). Investing into
> >   a scheduler might be a good idea.
> >
> > - all the remaining drivers are ttm based, where I hope they do
> >   appropriately obey implicit fences already. I didn't do the full
> >   audit there because a) not follow the contract would confuse ttm
> >   quite well and b) reading non-standard scheduler and submit code
> >   which isn't based on drm/scheduler is a pain.
> >
> > Onwards to the display side.
> >
> > - Any driver using the drm_gem_plane_helper_prepare_fb() helper will
> >   correctly. Overwhelmingly most drivers get this right, except a few
> >   totally dont. I'll follow up with a patch to make this the default
> >   and avoid a bunch of bugs.
> >
> > - I didn't audit the ttm drivers, but given that dma_resv started
> >   there I hope they get this right.
> >
> > In conclusion this IS the contract, both as documented and
> > overwhelmingly implemented, specically as implemented by all render
> > drivers except amdgpu.
> >
> > Amdgpu tried to fix this already in
> >
> > commit 049aca4363d8af87cab8d53de5401602db3b9999
> > Author: Christian König <christian.koenig@amd.com>
> > Date:   Wed Sep 19 16:54:35 2018 +0200
> >
> >     drm/amdgpu: fix using shared fence for exported BOs v2
> >
> > but this fix falls short on a number of areas:
> >
> > - It's racy, by the time the buffer is shared it might be too late. To
> >   make sure there's definitely never a problem we need to set the
> >   fences correctly for any buffer that's potentially exportable.
> >
> > - It's breaking uapi, dma-buf fds support poll() and differentitiate
> >   between, which was introduced in
> >
> >         commit 9b495a5887994a6d74d5c261d012083a92b94738
> >         Author: Maarten Lankhorst <maarten.lankhorst@canonical.com>
> >         Date:   Tue Jul 1 12:57:43 2014 +0200
> >
> >             dma-buf: add poll support, v3
> >
> > - Christian König wants to nack new uapi building further on this
> >   dma_resv contract because it breaks amdgpu, quoting
> >
> >   "Yeah, and that is exactly the reason why I will NAK this uAPI change.
> >
> >   "This doesn't works for amdgpu at all for the reasons outlined above."
> >
> >   https://lore.kernel.org/dri-devel/f2eb6751-2f82-9b23-f57e-548de5b729de@gmail.com/
> >
> >   Rejecting new development because your own driver is broken and
> >   violates established cross driver contracts and uapi is really not
> >   how upstream works.
> >
> > Now this patch will have a severe performance impact on anything that
> > runs on multiple engines. So we can't just merge it outright, but need
> > a bit a plan:
> >
> > - amdgpu needs a proper uapi for handling implicit fencing. The funny
> >   thing is that to do it correctly, implicit fencing must be treated
> >   as a very strange IPC mechanism for transporting fences, where both
> >   setting the fence and dependency intercepts must be handled
> >   explicitly. Current best practices is a per-bo flag to indicate
> >   writes, and a per-bo flag to to skip implicit fencing in the CS
> >   ioctl as a new chunk.
> >
> > - Since amdgpu has been shipping with broken behaviour we need an
> >   opt-out flag from the butchered implicit fencing model to enable the
> >   proper explicit implicit fencing model.
> >
> > - for kernel memory fences due to bo moves at least the i915 idea is
> >   to use ttm_bo->moving. amdgpu probably needs the same.
> >
> > - since the current p2p dma-buf interface assumes the kernel memory
> >   fence is in the exclusive dma_resv fence slot we need to add a new
> >   fence slot for kernel fences, which must never be ignored. Since
> >   currently only amdgpu supports this there's no real problem here
> >   yet, until amdgpu gains a NO_IMPLICIT CS flag.
> >
> > - New userspace needs to ship in enough desktop distros so that users
> >   wont notice the perf impact. I think we can ignore LTS distros who
> >   upgrade their kernels but not their mesa3d snapshot.
> >
> > - Then when this is all in place we can merge this patch here.
> >
> > What is not a solution to this problem here is trying to make the
> > dma_resv rules in the kernel more clever. The fundamental issue here
> > is that the amdgpu CS uapi is the least expressive one across all
> > drivers (only equalled by panfrost, which has an actual excuse) by not
> > allowing any userspace control over how implicit sync is conducted.
> >
> > Until this is fixed it's completely pointless to make the kernel more
> > clever to improve amdgpu, because all we're doing is papering over
> > this uapi design issue. amdgpu needs to attain the status quo
> > established by other drivers first, once that's achieved we can tackle
> > the remaining issues in a consistent way across drivers.
> >
> > Cc: mesa-dev@lists.freedesktop.org
> > Cc: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
> > Cc: Dave Airlie <airlied@gmail.com>
> > Cc: Rob Clark <robdclark@chromium.org>
> > Cc: Kristian H. Kristensen <hoegsberg@google.com>
> > Cc: Michel Dänzer <michel@daenzer.net>
> > Cc: Daniel Stone <daniels@collabora.com>
> > Cc: Sumit Semwal <sumit.semwal@linaro.org>
> > Cc: "Christian König" <christian.koenig@amd.com>
> > Cc: Alex Deucher <alexander.deucher@amd.com>
> > Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
> > Cc: Deepak R Varma <mh12gx2825@gmail.com>
> > Cc: Chen Li <chenli@uniontech.com>
> > Cc: Kevin Wang <kevin1.wang@amd.com>
> > Cc: Dennis Li <Dennis.Li@amd.com>
> > Cc: Luben Tuikov <luben.tuikov@amd.com>
> > Cc: linaro-mm-sig@lists.linaro.org
> > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> > index 88a24a0b5691..cc8426e1e8a8 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> > @@ -617,8 +617,8 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser *p,
> >         amdgpu_bo_list_for_each_entry(e, p->bo_list) {
> >                 struct amdgpu_bo *bo = ttm_to_amdgpu_bo(e->tv.bo);
> >
> > -               /* Make sure we use the exclusive slot for shared BOs */
> > -               if (bo->prime_shared_count)
> > +               /* Make sure we use the exclusive slot for all potentially shared BOs */
> > +               if (!(bo->flags & AMDGPU_GEM_CREATE_VM_ALWAYS_VALID))
> >                         e->tv.num_shared = 0;
> 
> I think it also makes sense to skip this with
> AMDGPU_GEM_CREATE_EXPLICIT_SYNC? It can be shared but I don't think
> anyone expects implicit sync to happen with those.

Ah yes, I missed this entirely. So the "no implicit flag" is already
there, and the _only_ thing that's missing really is a way to fish out the
implicit fences, and set them.

https://lore.kernel.org/dri-devel/20210520190007.534046-1-jason@jlekstrand.net/

So I think all that's really needed in radv is not setting
RADEON_FLAG_IMPLICIT_SYNC for winsys buffers when Jason's dma-buf ioctl
are present (means you need to do some import/export and keep the fd
around for winsys buffers, but shouldn't be too bad), and then control the
implicit fences entirely explicitly like vk expects.

Are you bored enough to type this up for radv? I'll give Jason's kernel
stuff another review meanwhile.
-Daniel

> >                 e->bo_va = amdgpu_vm_bo_find(vm, bo);
> >         }
> > --
> > 2.31.0
> >
Daniel Vetter May 21, 2021, 2:58 p.m. UTC | #4
On Fri, May 21, 2021 at 07:58:57AM -0700, Rob Clark wrote:
> On Fri, May 21, 2021 at 2:10 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> >
> > - msm is mildly entertaining. It also supports MSM_SUBMIT_NO_IMPLICIT,
> >   but because it doesn't use the drm/scheduler it handles fences from
> >   the wrong context with a synchronous dma_fence_wait. See
> >   submit_fence_sync() leading to msm_gem_sync_object(). Investing into
> >   a scheduler might be a good idea.
> 
> Yeah, drm/scheduler is (along with a lot of other things) on the TODO
> list.  But this isn't quite as bad as it sounds because userspace uses
> a u_queue thread to call the submit ioctl rather than blocking the
> driver.  (It also offloads some other work from the driver thread,
> like submit merging to reduce # of ioctls.  Coincidentally that
> arrangement was a step towards preparing userspace for some
> hypothetical non-ioctl uapi ;-))

You're also holding a pile of locks, which I expect latest with
multi-engine buffer sharing will be pain. If you push this to the
scheduler then the locks aren't held. Or maybe I've misread the flow, it's
become all a bit a blurr after all these drivers :-)

> OTOH it would be good to move blocking until the system can free
> enough pages to repin bo's out of the ioctl path to better handle some
> memory pressure corner cases without having to be interruptable over a
> lot more of the submit path..  Running chrome+android on devices
> without a lot of memory is fun..

Uh that one needs the userspace thread. Or entirely different semantics of
your ioctl, because you're not allowed to allocate memory once any
dma_fence is visible. So offloading the entire pinning to a submit thread
is no-go.
-Daniel
Rob Clark May 21, 2021, 2:58 p.m. UTC | #5
On Fri, May 21, 2021 at 2:10 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
>
> - msm is mildly entertaining. It also supports MSM_SUBMIT_NO_IMPLICIT,
>   but because it doesn't use the drm/scheduler it handles fences from
>   the wrong context with a synchronous dma_fence_wait. See
>   submit_fence_sync() leading to msm_gem_sync_object(). Investing into
>   a scheduler might be a good idea.

Yeah, drm/scheduler is (along with a lot of other things) on the TODO
list.  But this isn't quite as bad as it sounds because userspace uses
a u_queue thread to call the submit ioctl rather than blocking the
driver.  (It also offloads some other work from the driver thread,
like submit merging to reduce # of ioctls.  Coincidentally that
arrangement was a step towards preparing userspace for some
hypothetical non-ioctl uapi ;-))

OTOH it would be good to move blocking until the system can free
enough pages to repin bo's out of the ioctl path to better handle some
memory pressure corner cases without having to be interruptable over a
lot more of the submit path..  Running chrome+android on devices
without a lot of memory is fun..

BR,
-R
Bas Nieuwenhuizen May 21, 2021, 3 p.m. UTC | #6
On Fri, May 21, 2021 at 4:37 PM Daniel Vetter <daniel@ffwll.ch> wrote:
>
> On Fri, May 21, 2021 at 11:46:23AM +0200, Bas Nieuwenhuizen wrote:
> > On Fri, May 21, 2021 at 11:10 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> > > ---
> > >  drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 4 ++--
> > >  1 file changed, 2 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> > > index 88a24a0b5691..cc8426e1e8a8 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> > > @@ -617,8 +617,8 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser *p,
> > >         amdgpu_bo_list_for_each_entry(e, p->bo_list) {
> > >                 struct amdgpu_bo *bo = ttm_to_amdgpu_bo(e->tv.bo);
> > >
> > > -               /* Make sure we use the exclusive slot for shared BOs */
> > > -               if (bo->prime_shared_count)
> > > +               /* Make sure we use the exclusive slot for all potentially shared BOs */
> > > +               if (!(bo->flags & AMDGPU_GEM_CREATE_VM_ALWAYS_VALID))
> > >                         e->tv.num_shared = 0;
> >
> > I think it also makes sense to skip this with
> > AMDGPU_GEM_CREATE_EXPLICIT_SYNC? It can be shared but I don't think
> > anyone expects implicit sync to happen with those.
>
> Ah yes, I missed this entirely. So the "no implicit flag" is already
> there, and the _only_ thing that's missing really is a way to fish out the
> implicit fences, and set them.
>
> https://lore.kernel.org/dri-devel/20210520190007.534046-1-jason@jlekstrand.net/
>
> So I think all that's really needed in radv is not setting
> RADEON_FLAG_IMPLICIT_SYNC for winsys buffers when Jason's dma-buf ioctl
> are present (means you need to do some import/export and keep the fd
> around for winsys buffers, but shouldn't be too bad), and then control the
> implicit fences entirely explicitly like vk expects.

That is the part I'm less sure about. This is a BO wide flag so we are
also disabling implicit sync in the compositor. If the compositor does
only do read stuff that is ok, as the inserted exclusive fence will
work for that. But as I learned recently the app provided buffer may
end up being written to by the X server which open a whole can of
potential problems if implicit sync gets disabled between Xserver
operations on the app provided buffer. Hence setting that on the WSI
buffer is a whole new can of potential problems and hence I've said a
submission based flag would be preferred.

I can certainly try it out though.

>
> Are you bored enough to type this up for radv? I'll give Jason's kernel
> stuff another review meanwhile.
> -Daniel
>
> > >                 e->bo_va = amdgpu_vm_bo_find(vm, bo);
> > >         }
> > > --
> > > 2.31.0
> > >
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch
Daniel Vetter May 21, 2021, 3:16 p.m. UTC | #7
On Fri, May 21, 2021 at 05:00:46PM +0200, Bas Nieuwenhuizen wrote:
> On Fri, May 21, 2021 at 4:37 PM Daniel Vetter <daniel@ffwll.ch> wrote:
> >
> > On Fri, May 21, 2021 at 11:46:23AM +0200, Bas Nieuwenhuizen wrote:
> > > On Fri, May 21, 2021 at 11:10 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> > > > ---
> > > >  drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 4 ++--
> > > >  1 file changed, 2 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> > > > index 88a24a0b5691..cc8426e1e8a8 100644
> > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> > > > @@ -617,8 +617,8 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser *p,
> > > >         amdgpu_bo_list_for_each_entry(e, p->bo_list) {
> > > >                 struct amdgpu_bo *bo = ttm_to_amdgpu_bo(e->tv.bo);
> > > >
> > > > -               /* Make sure we use the exclusive slot for shared BOs */
> > > > -               if (bo->prime_shared_count)
> > > > +               /* Make sure we use the exclusive slot for all potentially shared BOs */
> > > > +               if (!(bo->flags & AMDGPU_GEM_CREATE_VM_ALWAYS_VALID))
> > > >                         e->tv.num_shared = 0;
> > >
> > > I think it also makes sense to skip this with
> > > AMDGPU_GEM_CREATE_EXPLICIT_SYNC? It can be shared but I don't think
> > > anyone expects implicit sync to happen with those.
> >
> > Ah yes, I missed this entirely. So the "no implicit flag" is already
> > there, and the _only_ thing that's missing really is a way to fish out the
> > implicit fences, and set them.
> >
> > https://lore.kernel.org/dri-devel/20210520190007.534046-1-jason@jlekstrand.net/
> >
> > So I think all that's really needed in radv is not setting
> > RADEON_FLAG_IMPLICIT_SYNC for winsys buffers when Jason's dma-buf ioctl
> > are present (means you need to do some import/export and keep the fd
> > around for winsys buffers, but shouldn't be too bad), and then control the
> > implicit fences entirely explicitly like vk expects.
> 
> That is the part I'm less sure about. This is a BO wide flag so we are
> also disabling implicit sync in the compositor. If the compositor does
> only do read stuff that is ok, as the inserted exclusive fence will
> work for that. But as I learned recently the app provided buffer may
> end up being written to by the X server which open a whole can of
> potential problems if implicit sync gets disabled between Xserver
> operations on the app provided buffer. Hence setting that on the WSI
> buffer is a whole new can of potential problems and hence I've said a
> submission based flag would be preferred.
> 
> I can certainly try it out though.

Hm yeah that's the wrong flag. We need a flag on the drm_file which the
explicit userspace sets. And which is valid only for itself.

There's a nice flags field when creating a ctx, but it's not validated and
there's already a comment that we have to filter out garbage priority, so
that's not use. I'll whip up something entirely untested just as a draft.
-Daniel



> 
> >
> > Are you bored enough to type this up for radv? I'll give Jason's kernel
> > stuff another review meanwhile.
> > -Daniel
> >
> > > >                 e->bo_va = amdgpu_vm_bo_find(vm, bo);
> > > >         }
> > > > --
> > > > 2.31.0
> > > >
> >
> > --
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch
Christian König May 21, 2021, 6:08 p.m. UTC | #8
Am 21.05.21 um 17:16 schrieb Daniel Vetter:
> On Fri, May 21, 2021 at 05:00:46PM +0200, Bas Nieuwenhuizen wrote:
>> On Fri, May 21, 2021 at 4:37 PM Daniel Vetter <daniel@ffwll.ch> wrote:
>>> On Fri, May 21, 2021 at 11:46:23AM +0200, Bas Nieuwenhuizen wrote:
>>>> On Fri, May 21, 2021 at 11:10 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
>>>>> ---
>>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 4 ++--
>>>>>   1 file changed, 2 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>>>> index 88a24a0b5691..cc8426e1e8a8 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>>>> @@ -617,8 +617,8 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser *p,
>>>>>          amdgpu_bo_list_for_each_entry(e, p->bo_list) {
>>>>>                  struct amdgpu_bo *bo = ttm_to_amdgpu_bo(e->tv.bo);
>>>>>
>>>>> -               /* Make sure we use the exclusive slot for shared BOs */
>>>>> -               if (bo->prime_shared_count)
>>>>> +               /* Make sure we use the exclusive slot for all potentially shared BOs */
>>>>> +               if (!(bo->flags & AMDGPU_GEM_CREATE_VM_ALWAYS_VALID))
>>>>>                          e->tv.num_shared = 0;
>>>> I think it also makes sense to skip this with
>>>> AMDGPU_GEM_CREATE_EXPLICIT_SYNC? It can be shared but I don't think
>>>> anyone expects implicit sync to happen with those.
>>> Ah yes, I missed this entirely. So the "no implicit flag" is already
>>> there, and the _only_ thing that's missing really is a way to fish out the
>>> implicit fences, and set them.
>>>
>>> https://lore.kernel.org/dri-devel/20210520190007.534046-1-jason@jlekstrand.net/
>>>
>>> So I think all that's really needed in radv is not setting
>>> RADEON_FLAG_IMPLICIT_SYNC for winsys buffers when Jason's dma-buf ioctl
>>> are present (means you need to do some import/export and keep the fd
>>> around for winsys buffers, but shouldn't be too bad), and then control the
>>> implicit fences entirely explicitly like vk expects.
>> That is the part I'm less sure about. This is a BO wide flag so we are
>> also disabling implicit sync in the compositor. If the compositor does
>> only do read stuff that is ok, as the inserted exclusive fence will
>> work for that. But as I learned recently the app provided buffer may
>> end up being written to by the X server which open a whole can of
>> potential problems if implicit sync gets disabled between Xserver
>> operations on the app provided buffer. Hence setting that on the WSI
>> buffer is a whole new can of potential problems and hence I've said a
>> submission based flag would be preferred.
>>
>> I can certainly try it out though.
> Hm yeah that's the wrong flag. We need a flag on the drm_file which the
> explicit userspace sets. And which is valid only for itself.
>
> There's a nice flags field when creating a ctx, but it's not validated and
> there's already a comment that we have to filter out garbage priority, so
> that's not use. I'll whip up something entirely untested just as a draft.

We could provide an IOCTL for the BO to change the flag.

But could we first figure out the semantics we want to use here?

Cause I'm pretty sure we don't actually need those changes at all and as 
said before I'm certainly NAKing things which break existing use cases.

Regards,
Christian.

> -Daniel
>
>
>
>>> Are you bored enough to type this up for radv? I'll give Jason's kernel
>>> stuff another review meanwhile.
>>> -Daniel
>>>
>>>>>                  e->bo_va = amdgpu_vm_bo_find(vm, bo);
>>>>>          }
>>>>> --
>>>>> 2.31.0
>>>>>
>>> --
>>> Daniel Vetter
>>> Software Engineer, Intel Corporation
>>> http://blog.ffwll.ch
Daniel Vetter May 21, 2021, 6:31 p.m. UTC | #9
On Fri, May 21, 2021 at 8:08 PM Christian König
<ckoenig.leichtzumerken@gmail.com> wrote:
>
> Am 21.05.21 um 17:16 schrieb Daniel Vetter:
> > On Fri, May 21, 2021 at 05:00:46PM +0200, Bas Nieuwenhuizen wrote:
> >> On Fri, May 21, 2021 at 4:37 PM Daniel Vetter <daniel@ffwll.ch> wrote:
> >>> On Fri, May 21, 2021 at 11:46:23AM +0200, Bas Nieuwenhuizen wrote:
> >>>> On Fri, May 21, 2021 at 11:10 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> >>>>> ---
> >>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 4 ++--
> >>>>>   1 file changed, 2 insertions(+), 2 deletions(-)
> >>>>>
> >>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> >>>>> index 88a24a0b5691..cc8426e1e8a8 100644
> >>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> >>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> >>>>> @@ -617,8 +617,8 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser *p,
> >>>>>          amdgpu_bo_list_for_each_entry(e, p->bo_list) {
> >>>>>                  struct amdgpu_bo *bo = ttm_to_amdgpu_bo(e->tv.bo);
> >>>>>
> >>>>> -               /* Make sure we use the exclusive slot for shared BOs */
> >>>>> -               if (bo->prime_shared_count)
> >>>>> +               /* Make sure we use the exclusive slot for all potentially shared BOs */
> >>>>> +               if (!(bo->flags & AMDGPU_GEM_CREATE_VM_ALWAYS_VALID))
> >>>>>                          e->tv.num_shared = 0;
> >>>> I think it also makes sense to skip this with
> >>>> AMDGPU_GEM_CREATE_EXPLICIT_SYNC? It can be shared but I don't think
> >>>> anyone expects implicit sync to happen with those.
> >>> Ah yes, I missed this entirely. So the "no implicit flag" is already
> >>> there, and the _only_ thing that's missing really is a way to fish out the
> >>> implicit fences, and set them.
> >>>
> >>> https://lore.kernel.org/dri-devel/20210520190007.534046-1-jason@jlekstrand.net/
> >>>
> >>> So I think all that's really needed in radv is not setting
> >>> RADEON_FLAG_IMPLICIT_SYNC for winsys buffers when Jason's dma-buf ioctl
> >>> are present (means you need to do some import/export and keep the fd
> >>> around for winsys buffers, but shouldn't be too bad), and then control the
> >>> implicit fences entirely explicitly like vk expects.
> >> That is the part I'm less sure about. This is a BO wide flag so we are
> >> also disabling implicit sync in the compositor. If the compositor does
> >> only do read stuff that is ok, as the inserted exclusive fence will
> >> work for that. But as I learned recently the app provided buffer may
> >> end up being written to by the X server which open a whole can of
> >> potential problems if implicit sync gets disabled between Xserver
> >> operations on the app provided buffer. Hence setting that on the WSI
> >> buffer is a whole new can of potential problems and hence I've said a
> >> submission based flag would be preferred.
> >>
> >> I can certainly try it out though.
> > Hm yeah that's the wrong flag. We need a flag on the drm_file which the
> > explicit userspace sets. And which is valid only for itself.
> >
> > There's a nice flags field when creating a ctx, but it's not validated and
> > there's already a comment that we have to filter out garbage priority, so
> > that's not use. I'll whip up something entirely untested just as a draft.
>
> We could provide an IOCTL for the BO to change the flag.

That's not the semantics we need.

> But could we first figure out the semantics we want to use here?
>
> Cause I'm pretty sure we don't actually need those changes at all and as
> said before I'm certainly NAKing things which break existing use cases.

Please read how other drivers do this and at least _try_ to understand
it. I'm really loosing my patience here with you NAKing patches you're
not even understanding (or did you actually read and fully understand
the entire story I typed up here, and your NAK is on the entire
thing?). There's not much useful conversation to be had with that
approach. And with drivers I mean kernel + userspace here.

That's the other frustration part: You're trying to fix this purely in
the kernel. This is exactly one of these issues why we require open
source userspace, so that we can fix the issues correctly across the
entire stack. And meanwhile you're steadfastily refusing to even look
at that the userspace side of the picture.

Also I thought through your tlb issue, why are you even putting these
tlb flush fences into the shard dma_resv slots? If you store them
somewhere else in the amdgpu private part, the oversync issues goes
away
- in your ttm bo move callback, you can just make your bo copy job
depend on them too (you have to anyway)
- even for p2p there's not an issue here, because you have the
->move_notify callback, and can then lift the tlb flush fences from
your private place to the shared slots so the exporter can see them.

The kernel move fences otoh are a bit more nasty to wring through the
p2p dma-buf interface. That one probably needs something new.
-Daniel

>
> Regards,
> Christian.
>
> > -Daniel
> >
> >
> >
> >>> Are you bored enough to type this up for radv? I'll give Jason's kernel
> >>> stuff another review meanwhile.
> >>> -Daniel
> >>>
> >>>>>                  e->bo_va = amdgpu_vm_bo_find(vm, bo);
> >>>>>          }
> >>>>> --
> >>>>> 2.31.0
> >>>>>
> >>> --
> >>> Daniel Vetter
> >>> Software Engineer, Intel Corporation
> >>> http://blog.ffwll.ch
>
Christian König May 22, 2021, 8:30 a.m. UTC | #10
Am 21.05.21 um 20:31 schrieb Daniel Vetter:
> [SNIP]
>> We could provide an IOCTL for the BO to change the flag.
> That's not the semantics we need.
>
>> But could we first figure out the semantics we want to use here?
>>
>> Cause I'm pretty sure we don't actually need those changes at all and as
>> said before I'm certainly NAKing things which break existing use cases.
> Please read how other drivers do this and at least _try_ to understand
> it. I'm really loosing my patience here with you NAKing patches you're
> not even understanding (or did you actually read and fully understand
> the entire story I typed up here, and your NAK is on the entire
> thing?). There's not much useful conversation to be had with that
> approach. And with drivers I mean kernel + userspace here.

Well to be honest I did fully read that, but I was just to emotionally 
attached to answer more appropriately in that moment.

And I'm sorry that I react emotional on that, but it is really 
frustrating that I'm not able to convince you that we have a major 
problem which affects all drivers and not just amdgpu.

Regarding the reason why I'm NAKing this particular patch, you are 
breaking existing uAPI for RADV with that. And as a maintainer of the 
driver I have simply no other choice than saying halt, stop we can't do 
it like this.

I'm perfectly aware that I've some holes in the understanding of how ANV 
or other Vulkan/OpenGL stacks work. But you should probably also admit 
that you have some holes how amdgpu works or otherwise I can't imagine 
why you suggest a patch which simply breaks RADV.

I mean we are working together for years now and I think you know me 
pretty well, do you really think I scream bloody hell we can't do this 
without a good reason?

So let's stop throwing halve backed solutions at each other and discuss 
what we can do to solve the different problems we are both seeing here.

> That's the other frustration part: You're trying to fix this purely in
> the kernel. This is exactly one of these issues why we require open
> source userspace, so that we can fix the issues correctly across the
> entire stack. And meanwhile you're steadfastily refusing to even look
> at that the userspace side of the picture.

Well I do fully understand the userspace side of the picture for the AMD 
stack. I just don't think we should give userspace that much control 
over the fences in the dma_resv object without untangling them from 
resource management.

And RADV is exercising exclusive sync for amdgpu already. You can do 
submission to both the GFX, Compute and SDMA queues in Vulkan and those 
currently won't over-synchronize.

When you then send a texture generated by multiple engines to the 
Compositor the kernel will correctly inserts waits for all submissions 
of the other process.

So this already works for RADV and completely without the IOCTL Jason 
proposed. IIRC we also have unit tests which exercised that feature for 
the video decoding use case long before RADV even existed.

And yes I have to admit that I haven't thought about interaction with 
other drivers when I came up with this because the rules of that 
interaction wasn't clear to me at that time.

> Also I thought through your tlb issue, why are you even putting these
> tlb flush fences into the shard dma_resv slots? If you store them
> somewhere else in the amdgpu private part, the oversync issues goes
> away
> - in your ttm bo move callback, you can just make your bo copy job
> depend on them too (you have to anyway)
> - even for p2p there's not an issue here, because you have the
> ->move_notify callback, and can then lift the tlb flush fences from
> your private place to the shared slots so the exporter can see them.

Because adding a shared fence requires that this shared fence signals 
after the exclusive fence. And this is a perfect example to explain why 
this is so problematic and also why why we currently stumble over that 
only in amdgpu.

In TTM we have a feature which allows evictions to be pipelined and 
don't wait for the evicting DMA operation. Without that driver will 
stall waiting for their allocations to finish when we need to allocate 
memory.

For certain use cases this gives you a ~20% fps increase under memory 
pressure, so it is a really important feature.

This works by adding the fence of the last eviction DMA operation to BOs 
when their backing store is newly allocated. That's what the 
ttm_bo_add_move_fence() function you stumbled over is good for: 
https://elixir.bootlin.com/linux/v5.13-rc2/source/drivers/gpu/drm/ttm/ttm_bo.c#L692

Now the problem is it is possible that the application is terminated 
before it can complete it's command submission. But since resource 
management only waits for the shared fences when there are some there is 
a chance that we free up memory while it is still in use.

Because of this we have some rather crude workarounds in amdgpu. For 
example IIRC we manual wait for any potential exclusive fence before 
freeing memory.

We could enable this feature for radeon and nouveau as well with an one 
line change. But that would mean we need to maintain the workarounds for 
shortcomings of the dma_resv object design in those drivers as well.

To summarize I think that adding an unbound fence to protect an object 
is a perfectly valid operation for resource management, but this is 
restricted by the needs of implicit sync at the moment.

> The kernel move fences otoh are a bit more nasty to wring through the
> p2p dma-buf interface. That one probably needs something new.

Well the p2p interface are my least concern.

Adding the move fence means that you need to touch every place we do CS 
or page flip since you now have something which is parallel to the 
explicit sync fence.

Otherwise having the move fence separately wouldn't make much sense in 
the first place if we always set it together with the exclusive fence.

Best regards and sorry for getting on your nerves so much,
Christian.

> -Daniel
>
>> Regards,
>> Christian.
>>
>>> -Daniel
>>>
>>>
>>>
>>>>> Are you bored enough to type this up for radv? I'll give Jason's kernel
>>>>> stuff another review meanwhile.
>>>>> -Daniel
>>>>>
>>>>>>>                   e->bo_va = amdgpu_vm_bo_find(vm, bo);
>>>>>>>           }
>>>>>>> --
>>>>>>> 2.31.0
>>>>>>>
>>>>> --
>>>>> Daniel Vetter
>>>>> Software Engineer, Intel Corporation
>>>>> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7Cf0852f38c85046ca877908d91c86a719%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637572186953277692%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=Vgz%2FkXFH4CD6ktZBnxnXFhHTG5tHhN1%2BDyf7pmxak6c%3D&amp;reserved=0
>
Daniel Vetter May 25, 2021, 1:05 p.m. UTC | #11
Hi Christian,

On Sat, May 22, 2021 at 10:30:19AM +0200, Christian König wrote:
> Am 21.05.21 um 20:31 schrieb Daniel Vetter:
> > [SNIP]
> > > We could provide an IOCTL for the BO to change the flag.
> > That's not the semantics we need.
> > 
> > > But could we first figure out the semantics we want to use here?
> > > 
> > > Cause I'm pretty sure we don't actually need those changes at all and as
> > > said before I'm certainly NAKing things which break existing use cases.
> > Please read how other drivers do this and at least _try_ to understand
> > it. I'm really loosing my patience here with you NAKing patches you're
> > not even understanding (or did you actually read and fully understand
> > the entire story I typed up here, and your NAK is on the entire
> > thing?). There's not much useful conversation to be had with that
> > approach. And with drivers I mean kernel + userspace here.
> 
> Well to be honest I did fully read that, but I was just to emotionally
> attached to answer more appropriately in that moment.
> 
> And I'm sorry that I react emotional on that, but it is really frustrating
> that I'm not able to convince you that we have a major problem which affects
> all drivers and not just amdgpu.
> 
> Regarding the reason why I'm NAKing this particular patch, you are breaking
> existing uAPI for RADV with that. And as a maintainer of the driver I have
> simply no other choice than saying halt, stop we can't do it like this.
> 
> I'm perfectly aware that I've some holes in the understanding of how ANV or
> other Vulkan/OpenGL stacks work. But you should probably also admit that you
> have some holes how amdgpu works or otherwise I can't imagine why you
> suggest a patch which simply breaks RADV.
> 
> I mean we are working together for years now and I think you know me pretty
> well, do you really think I scream bloody hell we can't do this without a
> good reason?
> 
> So let's stop throwing halve backed solutions at each other and discuss what
> we can do to solve the different problems we are both seeing here.

Well this was meant to be a goal post/semantics discussion starter. Yes
the patch breaks performance (but not correctness) for amdgpu, but it also
contains my suggestion for how to fix that issue (in text form at least).

Plus a plan how to roll it out so that anyone who cares doesn't hit the
perf issues this patch can cause.

Also the overall series is really meant as a subsystem wide assessment of
the status quo. Aside from a correctness issue Lucas spotted in my panfrost
patches no substantial input from others on this yet unfortunately. I need
to poke more people I think.

Anyway since the plan as a text didn't stick I'm typing up now something
more detailed in form of amdgpu patches. Maybe Bas can do the radv
conversion too.

It won't be complete by far either (I'm not working for amd after all
...), I'll leave out the opengl/media side entirely. But if this works for
radv is should be a useful blueprint for gl/media too (with some changes
in the interfaces, but not really the exposed semantics).

> > That's the other frustration part: You're trying to fix this purely in
> > the kernel. This is exactly one of these issues why we require open
> > source userspace, so that we can fix the issues correctly across the
> > entire stack. And meanwhile you're steadfastily refusing to even look
> > at that the userspace side of the picture.
> 
> Well I do fully understand the userspace side of the picture for the AMD
> stack. I just don't think we should give userspace that much control over
> the fences in the dma_resv object without untangling them from resource
> management.
> 
> And RADV is exercising exclusive sync for amdgpu already. You can do
> submission to both the GFX, Compute and SDMA queues in Vulkan and those
> currently won't over-synchronize.
> 
> When you then send a texture generated by multiple engines to the Compositor
> the kernel will correctly inserts waits for all submissions of the other
> process.
> 
> So this already works for RADV and completely without the IOCTL Jason
> proposed. IIRC we also have unit tests which exercised that feature for the
> video decoding use case long before RADV even existed.

Yeah multiple engines on the producer side works fine with the current
scheme you have (if we ignore for now that the way amdgpu uses dma_resv is
different from other drivers by not setting the exclusive slots for
producers).

Where it breaks down is you have overlapping reads once a frame is
generated, on either side. E.g.

- compositors read the buffer as consumer
- but also producer reads the buffer again (maybe reference frame for
  media, or maybe for some post-processing like motion blurr or whatever).

Then your current scheme really badly oversyncs, while other drivers
don't have this issue. Fixing this for amdgpu will have quite a big impact
on how dma_resv will be used by amdgpu, and that's why I think before
we've looked at this it doesn't make sense to look at changing dma_resv in
big ways.

I do think the AMDGPU_SYNC_NE_OWNER trick is pretty neat, and should help
with some very non-explicit userspace (gl, maybe also media but at least
ours is internally moving to a more explicit model I think) when you split
the workload over multiple engines. For vulkan I think the right model is
AMDGPU_SYNC_EXPLICIT, plus something like what Jason has.

Except there's going to be quite some twist, but I think that's best
explained in patches since text clearly doesn't work well.

> And yes I have to admit that I haven't thought about interaction with other
> drivers when I came up with this because the rules of that interaction
> wasn't clear to me at that time.

Ok I think we're at least distilling the core of our disagreement here,
that's some progress:

- my take: to make implicit sync work well and avoid oversync issues,
  userspace, and currently amgpu lacks these uapis so needs to get those
  added.

- your take: please lets not give so much control to userspace

Tbh I'm not yet fully understanding your concern here, but I do agree that
there's a lot of things that need to be carefully audited here, at least
in amdgpu. From my driver wide audit I do think in general playing clever
tricks is ok, since drivers treat the exclusive fence as just a special
type of shared fence and don't e.g. ignore the shard fences if an
exclusive one is present. For the generic helper version of this see
drm_gem_fence_array_add_implicit(), but afaiui amdgpu works the same (or
at least similar enough), as do other drivers playing funny games.

> > Also I thought through your tlb issue, why are you even putting these
> > tlb flush fences into the shard dma_resv slots? If you store them
> > somewhere else in the amdgpu private part, the oversync issues goes
> > away
> > - in your ttm bo move callback, you can just make your bo copy job
> > depend on them too (you have to anyway)
> > - even for p2p there's not an issue here, because you have the
> > ->move_notify callback, and can then lift the tlb flush fences from
> > your private place to the shared slots so the exporter can see them.
> 
> Because adding a shared fence requires that this shared fence signals after
> the exclusive fence. And this is a perfect example to explain why this is so
> problematic and also why why we currently stumble over that only in amdgpu.

So I also have vague collections that we agreed this is required, but I
don't think it's the case. I'm at least pretty sure we already have
multiple drivers which violate this (non of them using TTM yet).

> In TTM we have a feature which allows evictions to be pipelined and don't
> wait for the evicting DMA operation. Without that driver will stall waiting
> for their allocations to finish when we need to allocate memory.
> 
> For certain use cases this gives you a ~20% fps increase under memory
> pressure, so it is a really important feature.

Yeah that's something I'm banging my head against right now a bit for my
amdgpu demo patch series.

> This works by adding the fence of the last eviction DMA operation to BOs
> when their backing store is newly allocated. That's what the
> ttm_bo_add_move_fence() function you stumbled over is good for: https://elixir.bootlin.com/linux/v5.13-rc2/source/drivers/gpu/drm/ttm/ttm_bo.c#L692
> 
> Now the problem is it is possible that the application is terminated before
> it can complete it's command submission. But since resource management only
> waits for the shared fences when there are some there is a chance that we
> free up memory while it is still in use.

Hm where is this code? Would help in my audit that I wanted to do this
week? If you look at most other places like
drm_gem_fence_array_add_implicit() I mentioned earlier, then we don't
treat the shared fences special and always also include the exclusive one.

> Because of this we have some rather crude workarounds in amdgpu. For example
> IIRC we manual wait for any potential exclusive fence before freeing memory.
> 
> We could enable this feature for radeon and nouveau as well with an one line
> change. But that would mean we need to maintain the workarounds for
> shortcomings of the dma_resv object design in those drivers as well.
> 
> To summarize I think that adding an unbound fence to protect an object is a
> perfectly valid operation for resource management, but this is restricted by
> the needs of implicit sync at the moment.

Hm how are unbound fences (what do you mean with this exactly) come into
play here now? I think you lost me here on the last 1-2 paragraphs, before
that I think I followed.

> > The kernel move fences otoh are a bit more nasty to wring through the
> > p2p dma-buf interface. That one probably needs something new.
> 
> Well the p2p interface are my least concern.
> 
> Adding the move fence means that you need to touch every place we do CS or
> page flip since you now have something which is parallel to the explicit
> sync fence.
> 
> Otherwise having the move fence separately wouldn't make much sense in the
> first place if we always set it together with the exclusive fence.

Yeah it's a bunch of work. But for i915 this is the model we have, so we
have to do it anyway, so I have really good excuse here to do that ttm
audit.

> Best regards and sorry for getting on your nerves so much,

tbh I've also been rather thinly nerved on this. One side was that I spent
the last 1+ years having an eerily similar with i915-gem team about how a
single driver can't just have totally different rules for core stuff like
dma_resv/fence/locks, and unfortunately that entire story went really,
horribly wrong :-/ So I'm very much "oh noes pls not again".

But also the long w/e here helped, yay :-)

I think there's a few questions here that we can ping/pong a bit more, but
I think for the next meaningful round I need to get this draft set of
patches a bit into shape here, and audit more code. I think hopefully
early next week I'll have something which isn't too much full of holes
which should help in moving forward on this discussion.

Cheers, Daniel
Daniel Vetter May 25, 2021, 3:23 p.m. UTC | #12
On Tue, May 25, 2021 at 5:05 PM Christian König
<ckoenig.leichtzumerken@gmail.com> wrote:
>
> Hi Daniel,
>
> Am 25.05.21 um 15:05 schrieb Daniel Vetter:
> > Hi Christian,
> >
> > On Sat, May 22, 2021 at 10:30:19AM +0200, Christian König wrote:
> >> Am 21.05.21 um 20:31 schrieb Daniel Vetter:
> >> This works by adding the fence of the last eviction DMA operation to BOs
> >> when their backing store is newly allocated. That's what the
> >> ttm_bo_add_move_fence() function you stumbled over is good for: https://elixir.bootlin.com/linux/v5.13-rc2/source/drivers/gpu/drm/ttm/ttm_bo.c#L692
> >>
> >> Now the problem is it is possible that the application is terminated before
> >> it can complete it's command submission. But since resource management only
> >> waits for the shared fences when there are some there is a chance that we
> >> free up memory while it is still in use.
> > Hm where is this code? Would help in my audit that I wanted to do this
> > week? If you look at most other places like
> > drm_gem_fence_array_add_implicit() I mentioned earlier, then we don't
> > treat the shared fences special and always also include the exclusive one.
>
> See amdgpu_gem_object_close():
>
> ...
>          fence = dma_resv_get_excl(bo->tbo.base.resv);
>          if (fence) {
>                  amdgpu_bo_fence(bo, fence, true);
>                  fence = NULL;
>          }
> ...
>
> We explicitly added that because resource management of some other
> driver was going totally bananas without that.
>
> But I'm not sure which one that was. Maybe dig a bit in the git and
> mailing history of that.

Hm I looked and it's

commit 82c416b13cb7d22b96ec0888b296a48dff8a09eb
Author: Christian König <christian.koenig@amd.com>
Date:   Thu Mar 12 12:03:34 2020 +0100

   drm/amdgpu: fix and cleanup amdgpu_gem_object_close v4

That sounded more like amdgpu itself needing this, not another driver?

But looking at amdgpu_vm_bo_update_mapping() it seems to pick the
right fencing mode for gpu pte clearing, so I'm really not sure what
the bug was that you worked around here?The implementation boils down
to amdgpu_sync_resv() which syncs for the exclusive fence, always. And
there's nothing else that I could find in public history at least, no
references to bug reports or anything. I think you need to dig
internally, because as-is I'm not seeing the problem here.

Or am I missing something here?
-Daniel
Christian König May 26, 2021, 1:32 p.m. UTC | #13
Am 25.05.21 um 17:23 schrieb Daniel Vetter:
> On Tue, May 25, 2021 at 5:05 PM Christian König
> <ckoenig.leichtzumerken@gmail.com> wrote:
>> Hi Daniel,
>>
>> Am 25.05.21 um 15:05 schrieb Daniel Vetter:
>>> Hi Christian,
>>>
>>> On Sat, May 22, 2021 at 10:30:19AM +0200, Christian König wrote:
>>>> Am 21.05.21 um 20:31 schrieb Daniel Vetter:
>>>> This works by adding the fence of the last eviction DMA operation to BOs
>>>> when their backing store is newly allocated. That's what the
>>>> ttm_bo_add_move_fence() function you stumbled over is good for: https://elixir.bootlin.com/linux/v5.13-rc2/source/drivers/gpu/drm/ttm/ttm_bo.c#L692
>>>>
>>>> Now the problem is it is possible that the application is terminated before
>>>> it can complete it's command submission. But since resource management only
>>>> waits for the shared fences when there are some there is a chance that we
>>>> free up memory while it is still in use.
>>> Hm where is this code? Would help in my audit that I wanted to do this
>>> week? If you look at most other places like
>>> drm_gem_fence_array_add_implicit() I mentioned earlier, then we don't
>>> treat the shared fences special and always also include the exclusive one.
>> See amdgpu_gem_object_close():
>>
>> ...
>>           fence = dma_resv_get_excl(bo->tbo.base.resv);
>>           if (fence) {
>>                   amdgpu_bo_fence(bo, fence, true);
>>                   fence = NULL;
>>           }
>> ...
>>
>> We explicitly added that because resource management of some other
>> driver was going totally bananas without that.
>>
>> But I'm not sure which one that was. Maybe dig a bit in the git and
>> mailing history of that.
> Hm I looked and it's
>
> commit 82c416b13cb7d22b96ec0888b296a48dff8a09eb
> Author: Christian König <christian.koenig@amd.com>
> Date:   Thu Mar 12 12:03:34 2020 +0100
>
>     drm/amdgpu: fix and cleanup amdgpu_gem_object_close v4
>
> That sounded more like amdgpu itself needing this, not another driver?

No, that patch was just a follow up moving the functionality around.

> But looking at amdgpu_vm_bo_update_mapping() it seems to pick the
> right fencing mode for gpu pte clearing, so I'm really not sure what
> the bug was that you worked around here?The implementation boils down
> to amdgpu_sync_resv() which syncs for the exclusive fence, always. And
> there's nothing else that I could find in public history at least, no
> references to bug reports or anything. I think you need to dig
> internally, because as-is I'm not seeing the problem here.
>
> Or am I missing something here?

See the code here for example: 
https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/nouveau/nouveau_fence.c#L361

Nouveau assumes that when a shared fence is present it doesn't need to 
wait for the exclusive one because the shared are always supposed to 
finish after the exclusive one.

But for page table unmap fences that isn't true and we ran into a really 
nasty and hard to reproduce bug because of this.

I think it would be much more defensive if we could say that we always 
wait for the exclusive fence and fix the use case in nouveau and double 
check if somebody else does stuff like that as well.

Christian.

> -Daniel
Daniel Vetter May 26, 2021, 1:51 p.m. UTC | #14
On Wed, May 26, 2021 at 3:32 PM Christian König
<ckoenig.leichtzumerken@gmail.com> wrote:
>
> Am 25.05.21 um 17:23 schrieb Daniel Vetter:
> > On Tue, May 25, 2021 at 5:05 PM Christian König
> > <ckoenig.leichtzumerken@gmail.com> wrote:
> >> Hi Daniel,
> >>
> >> Am 25.05.21 um 15:05 schrieb Daniel Vetter:
> >>> Hi Christian,
> >>>
> >>> On Sat, May 22, 2021 at 10:30:19AM +0200, Christian König wrote:
> >>>> Am 21.05.21 um 20:31 schrieb Daniel Vetter:
> >>>> This works by adding the fence of the last eviction DMA operation to BOs
> >>>> when their backing store is newly allocated. That's what the
> >>>> ttm_bo_add_move_fence() function you stumbled over is good for: https://elixir.bootlin.com/linux/v5.13-rc2/source/drivers/gpu/drm/ttm/ttm_bo.c#L692
> >>>>
> >>>> Now the problem is it is possible that the application is terminated before
> >>>> it can complete it's command submission. But since resource management only
> >>>> waits for the shared fences when there are some there is a chance that we
> >>>> free up memory while it is still in use.
> >>> Hm where is this code? Would help in my audit that I wanted to do this
> >>> week? If you look at most other places like
> >>> drm_gem_fence_array_add_implicit() I mentioned earlier, then we don't
> >>> treat the shared fences special and always also include the exclusive one.
> >> See amdgpu_gem_object_close():
> >>
> >> ...
> >>           fence = dma_resv_get_excl(bo->tbo.base.resv);
> >>           if (fence) {
> >>                   amdgpu_bo_fence(bo, fence, true);
> >>                   fence = NULL;
> >>           }
> >> ...
> >>
> >> We explicitly added that because resource management of some other
> >> driver was going totally bananas without that.
> >>
> >> But I'm not sure which one that was. Maybe dig a bit in the git and
> >> mailing history of that.
> > Hm I looked and it's
> >
> > commit 82c416b13cb7d22b96ec0888b296a48dff8a09eb
> > Author: Christian König <christian.koenig@amd.com>
> > Date:   Thu Mar 12 12:03:34 2020 +0100
> >
> >     drm/amdgpu: fix and cleanup amdgpu_gem_object_close v4
> >
> > That sounded more like amdgpu itself needing this, not another driver?
>
> No, that patch was just a follow up moving the functionality around.

That patch added the "add exclusive fence to shared slots before
amdgpu_vm_clear_freed() call", which I thought was at least part of
your fix.

> > But looking at amdgpu_vm_bo_update_mapping() it seems to pick the
> > right fencing mode for gpu pte clearing, so I'm really not sure what
> > the bug was that you worked around here?The implementation boils down
> > to amdgpu_sync_resv() which syncs for the exclusive fence, always. And
> > there's nothing else that I could find in public history at least, no
> > references to bug reports or anything. I think you need to dig
> > internally, because as-is I'm not seeing the problem here.
> >
> > Or am I missing something here?
>
> See the code here for example:
> https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/nouveau/nouveau_fence.c#L361
>
> Nouveau assumes that when a shared fence is present it doesn't need to
> wait for the exclusive one because the shared are always supposed to
> finish after the exclusive one.
>
> But for page table unmap fences that isn't true and we ran into a really
> nasty and hard to reproduce bug because of this.
>
> I think it would be much more defensive if we could say that we always
> wait for the exclusive fence and fix the use case in nouveau and double
> check if somebody else does stuff like that as well.

Yeah most other drivers do the defensive thing here. I think it would
be good to standardize on that. I'll add that to my list and do more
auditing.
-Daniel
diff mbox series

Patch

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
index 88a24a0b5691..cc8426e1e8a8 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
@@ -617,8 +617,8 @@  static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser *p,
 	amdgpu_bo_list_for_each_entry(e, p->bo_list) {
 		struct amdgpu_bo *bo = ttm_to_amdgpu_bo(e->tv.bo);
 
-		/* Make sure we use the exclusive slot for shared BOs */
-		if (bo->prime_shared_count)
+		/* Make sure we use the exclusive slot for all potentially shared BOs */
+		if (!(bo->flags & AMDGPU_GEM_CREATE_VM_ALWAYS_VALID))
 			e->tv.num_shared = 0;
 		e->bo_va = amdgpu_vm_bo_find(vm, bo);
 	}