diff mbox series

[RFC] amdgpu: Add a context flag to disable implicit sync

Message ID 20240807153941.3668940-1-faith.ekstrand@collabora.com (mailing list archive)
State New, archived
Headers show
Series [RFC] amdgpu: Add a context flag to disable implicit sync | expand

Commit Message

Faith Ekstrand Aug. 7, 2024, 3:39 p.m. UTC
Previously, AMDGPU_GEM_CREATE_EXPLICIT_SYNC was used to disable implicit
synchronization on BOs when explicit synchronization can be used.  The
problem is that this flag is per-BO and affects all amdgpu users in the
system, not just the usermode drver which sets it.  This can lead to
some unintended consequences for userspace if not used carefully.

Since the introduction of DMA_BUF_IOCTL_EXPORT_SYNC_FILE and
DMA_BUF_IOCTL_IMPORT_SYNC_FILE, many userspace window system components
have grown the ability to convert between the Vulkan explicit sync model
and the legacy implicit sync model used by X11 and Wayland in the past.
This allows both old and new components to exist simultaneously and talk
to each other.  In particular, XWayland is able to convert between the
two to let Vulkan apps work seamlessly with older X11 compositors that
aren't aware of explicit synchronizaiton.  This is rapidly becoming the
backbone of synchronization in the Linux window system space.

Unfortunately, AMDGPU_GEM_CREATE_EXPLICIT_SYNC breaks this because it
disables all implicit synchronization on the given BO, regardless of
which userspace driver is rendering with it and regardless of the
assumptions made by the client application.  In particular, this is
causing issues with RADV and radeonsi.  RADV sets the flag to disable
implicit sync on window system buffers so that it can safely have them
resident at all times without causing internal over-synchronization.
The BO is then handed off to a compositor which composites using
radeonsi.  If the compositor uses the EGL_ANDROID_native_fence_sync
extension to pass explicit sync files through to radeonsi, then
everything is fine.  However, if that buffer ever gets handed to an
instance of radeonsi with any assumption of implicit synchronization,
radeonsi won't be able sync on the BO because RADV disabled implict sync
on that BO system-wide.  It doesn't matter whether some window system
component used DMA_BUF_IOCTL_IMPORT_SYNC_FILE to set the appropriate
fence on the BO, amdgpu will ignore it.

This new flag disables implicit sync on the context rather than the BO.
This way RADV can disable implicit sync (which is what RADV has always
wanted) without affecting other components in the system.  If RADV (or
some other driver) wants implicit sync on some BO, it can use
DMA_BUF_IOCTL_EXPORT_SYNC_FILE and DMA_BUF_IOCTL_IMPORT_SYNC_FILE to
manually synchronize with other implicit-sync components.  This is the
approach we've taken with NVK/nouveau, ANV/xe, and similar to the
approach taken by ANV/i915 and it works well for those drivers.

Ideally, I would like to see something like this back-ported to at least
the kernel where DMA_BUF_IOCTL_IMPORT/EXPORT_SYNC_FILE were introduced
so that we don't have to wait another year for the fix to reach users.
However, I understand that back-porting UAPI is problematic and I'll
leave that decision up to the amdgpu maintainers.  Michel suggested that
a new CTX_OP would make more sense if we want to back-port it but the
create flag made more sense to me from an API design PoV.

Signed-off-by: Faith Ekstrand <faith.ekstrand@collabora.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: David Airlie <airlied@gmail.com>
Cc: Michel Dänzer <mdaenzer@redhat.com>
Cc: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c  |  3 ++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 12 ++++++++----
 drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h |  7 +++++++
 include/uapi/drm/amdgpu_drm.h           | 12 +++++++++++-
 4 files changed, 28 insertions(+), 6 deletions(-)

Comments

Joshua Ashton Aug. 7, 2024, 7:23 p.m. UTC | #1
I was thinking about this more recently. I was initially considering "maybe this should be a per-BO import," but I couldn't think of anything in the GL model that would actually benefit given its not "true" bindless and there's no update-after-bind there.

Worth others more familiar with GL asking that question to themselves also. I am definitely not totally up on what's possible there.

Overall, I think I am OK with this approach, even though I think mixing implicit and explicit sync is gross, and I want the pain that is implicit sync to just go away forever. :-)

- Joshie 
Faith Ekstrand Aug. 7, 2024, 8:25 p.m. UTC | #2
On Wed, Aug 7, 2024 at 2:23 PM Joshua Ashton <joshua@froggi.es> wrote:

> I was thinking about this more recently. I was initially considering
> "maybe this should be a per-BO import," but I couldn't think of anything in
> the GL model that would actually benefit given its not "true" bindless and
> there's no update-after-bind there.
>

That's also an option and it's the way it works on i915. However, then you
have to pass lists of things to the kernel and that's kinda gross. If we
need it, we can do that. Otherwise, I think a more global thing is fine.  I
think Bas is currently investigating a per-submit approach which is a tad
different from either but should also work okay.


> Worth others more familiar with GL asking that question to themselves
> also. I am definitely not totally up on what's possible there.
>
> Overall, I think I am OK with this approach, even though I think mixing
> implicit and explicit sync is gross, and I want the pain that is implicit
> sync to just go away forever. :-)
>

So say we all...

~Faith



> - Joshie 
Bas Nieuwenhuizen Aug. 7, 2024, 8:33 p.m. UTC | #3
On Wed, Aug 7, 2024 at 10:25 PM Faith Ekstrand <faith@gfxstrand.net> wrote:

> On Wed, Aug 7, 2024 at 2:23 PM Joshua Ashton <joshua@froggi.es> wrote:
>
>> I was thinking about this more recently. I was initially considering
>> "maybe this should be a per-BO import," but I couldn't think of anything in
>> the GL model that would actually benefit given its not "true" bindless and
>> there's no update-after-bind there.
>>
>
> That's also an option and it's the way it works on i915. However, then you
> have to pass lists of things to the kernel and that's kinda gross. If we
> need it, we can do that. Otherwise, I think a more global thing is fine.  I
> think Bas is currently investigating a per-submit approach which is a tad
> different from either but should also work okay.
>
>

Yeah, I'm working on a per-submit thing (also using BOOKKEEP fences instead
of using the EXPLICIT wait mode to ensure we also don't add implicit
fences).

We do have a per-BO list on submission already, so we could add things
there, it is just very annoying to implement as currently at the point we
do fence wait/signal we lost the association with the BO list. Given that
I  don't see an use case anytime soon (there are some theoreticals like
radeonsi might start doing READ usage instead of RW usage with extra
tracking) I feel it isn't worth that added implementation complexity


Worth others more familiar with GL asking that question to themselves also.
>> I am definitely not totally up on what's possible there.
>>
>> Overall, I think I am OK with this approach, even though I think mixing
>> implicit and explicit sync is gross, and I want the pain that is implicit
>> sync to just go away forever. :-)
>>
>
> So say we all...
>
> ~Faith
>
>
>
>> - Joshie 
Michel Dänzer Aug. 8, 2024, 7:57 a.m. UTC | #4
On 2024-08-07 21:23, Joshua Ashton wrote:
> I was thinking about this more recently. I was initially considering "maybe this should be a per-BO import," but I couldn't think of anything in the GL model that would actually benefit given its not "true" bindless and there's no update-after-bind there.
> 
> Worth others more familiar with GL asking that question to themselves also. I am definitely not totally up on what's possible there.
> 
> Overall, I think I am OK with this approach, even though I think mixing implicit and explicit sync is gross, and I want the pain that is implicit sync to just go away forever. :-)

It can never go away, at least not in the drivers which have ever supported it. We can never break compatibility.
Christian König Aug. 19, 2024, 2:51 p.m. UTC | #5
Am 07.08.24 um 22:33 schrieb Bas Nieuwenhuizen:
> On Wed, Aug 7, 2024 at 10:25 PM Faith Ekstrand <faith@gfxstrand.net> 
> wrote:
>
>     On Wed, Aug 7, 2024 at 2:23 PM Joshua Ashton <joshua@froggi.es> wrote:
>
>         I was thinking about this more recently. I was initially
>         considering "maybe this should be a per-BO import," but I
>         couldn't think of anything in the GL model that would actually
>         benefit given its not "true" bindless and there's no
>         update-after-bind there.
>
>
>     That's also an option and it's the way it works on i915. However,
>     then you have to pass lists of things to the kernel and that's
>     kinda gross. If we need it, we can do that. Otherwise, I think a
>     more global thing is fine.  I think Bas is currently investigating
>     a per-submit approach which is a tad different from either but
>     should also work okay.
>
>
> Yeah, I'm working on a per-submit thing (also using BOOKKEEP fences 
> instead of using the EXPLICIT wait mode to ensure we also don't add 
> implicit fences).

Yeah agree. Your implementation with the per submission flag and using 
BOOKKEEP actually sounds like the right thing to do to me as well.

We need to keep in mind that synchronization goes in both ways, e.g. 
explicit->implicit as well as implicit->explicit.

I would rather like to keep the implicit->explicit handling (which this 
patch here completely disables) and only allow the explicit->implicit 
handling (which by using BOOKKEEP fences).

This way it is possible that we still over synchronize for example for a 
double buffered BO is re-used by an explicit client and implicit display 
server, but that's probably not something we should optimize in the 
first place.

Regards,
Christian.

>
> We do have a per-BO list on submission already, so we could add things 
> there, it is just very annoying to implement as currently at the point 
> we do fence wait/signal we lost the association with the BO list. 
> Given that I don't see an use case anytime soon (there are some 
> theoreticals like radeonsi might start doing READ usage instead of RW 
> usage with extra tracking) I feel it isn't worth that added 
> implementation complexity
>
>
>         Worth others more familiar with GL asking that question to
>         themselves also. I am definitely not totally up on what's
>         possible there.
>
>         Overall, I think I am OK with this approach, even though I
>         think mixing implicit and explicit sync is gross, and I want
>         the pain that is implicit sync to just go away forever. :-)
>
>
>     So say we all...
>
>     ~Faith
>
>         - Joshie 
Bas Nieuwenhuizen Aug. 19, 2024, 2:59 p.m. UTC | #6
On Mon, Aug 19, 2024 at 4:51 PM Christian König <
ckoenig.leichtzumerken@gmail.com> wrote:

> Am 07.08.24 um 22:33 schrieb Bas Nieuwenhuizen:
>
> On Wed, Aug 7, 2024 at 10:25 PM Faith Ekstrand <faith@gfxstrand.net>
> wrote:
>
>> On Wed, Aug 7, 2024 at 2:23 PM Joshua Ashton <joshua@froggi.es> wrote:
>>
>>> I was thinking about this more recently. I was initially considering
>>> "maybe this should be a per-BO import," but I couldn't think of anything in
>>> the GL model that would actually benefit given its not "true" bindless and
>>> there's no update-after-bind there.
>>>
>>
>> That's also an option and it's the way it works on i915. However, then
>> you have to pass lists of things to the kernel and that's kinda gross. If
>> we need it, we can do that. Otherwise, I think a more global thing is
>> fine.  I think Bas is currently investigating a per-submit approach which
>> is a tad different from either but should also work okay.
>>
>>
>
> Yeah, I'm working on a per-submit thing (also using BOOKKEEP fences
> instead of using the EXPLICIT wait mode to ensure we also don't add
> implicit fences).
>
>
> Yeah agree. Your implementation with the per submission flag and using
> BOOKKEEP actually sounds like the right thing to do to me as well.
>
> We need to keep in mind that synchronization goes in both ways, e.g.
> explicit->implicit as well as implicit->explicit.
>
> I would rather like to keep the implicit->explicit handling (which this
> patch here completely disables) and only allow the explicit->implicit
> handling (which by using BOOKKEEP fences).
>
> This way it is possible that we still over synchronize for example for a
> double buffered BO is re-used by an explicit client and implicit display
> server, but that's probably not something we should optimize in the first
> place.
>

This oversynchronization actually happens easily as in bindless Vulkan we
have to mark all buffers as "used". We have some hacks to avoid the worst
of it but it can be pretty meh.

In my series on the ML[1] I think I actually got both sides by waiting on
KERNEL fences only and setting BOOKKEEP fences. (Yeah it actually ends up
kinda orthogonal on the sync mode but it is what it is ...).

- Bas

[1]https://patchwork.freedesktop.org/series/137014/


> Regards,
> Christian.
>
>
> We do have a per-BO list on submission already, so we could add things
> there, it is just very annoying to implement as currently at the point we
> do fence wait/signal we lost the association with the BO list. Given that
> I  don't see an use case anytime soon (there are some theoreticals like
> radeonsi might start doing READ usage instead of RW usage with extra
> tracking) I feel it isn't worth that added implementation complexity
>
>
> Worth others more familiar with GL asking that question to themselves
>>> also. I am definitely not totally up on what's possible there.
>>>
>>> Overall, I think I am OK with this approach, even though I think mixing
>>> implicit and explicit sync is gross, and I want the pain that is implicit
>>> sync to just go away forever. :-)
>>>
>>
>> So say we all...
>>
>> ~Faith
>>
>>
>>
>>> - Joshie 
Faith Ekstrand Aug. 19, 2024, 4 p.m. UTC | #7
On Mon, Aug 19, 2024 at 10:00 AM Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
wrote:

>
>
> On Mon, Aug 19, 2024 at 4:51 PM Christian König <
> ckoenig.leichtzumerken@gmail.com> wrote:
>
>> Am 07.08.24 um 22:33 schrieb Bas Nieuwenhuizen:
>>
>> On Wed, Aug 7, 2024 at 10:25 PM Faith Ekstrand <faith@gfxstrand.net>
>> wrote:
>>
>>> On Wed, Aug 7, 2024 at 2:23 PM Joshua Ashton <joshua@froggi.es> wrote:
>>>
>>>> I was thinking about this more recently. I was initially considering
>>>> "maybe this should be a per-BO import," but I couldn't think of anything in
>>>> the GL model that would actually benefit given its not "true" bindless and
>>>> there's no update-after-bind there.
>>>>
>>>
>>> That's also an option and it's the way it works on i915. However, then
>>> you have to pass lists of things to the kernel and that's kinda gross. If
>>> we need it, we can do that. Otherwise, I think a more global thing is
>>> fine.  I think Bas is currently investigating a per-submit approach which
>>> is a tad different from either but should also work okay.
>>>
>>>
>>
>> Yeah, I'm working on a per-submit thing (also using BOOKKEEP fences
>> instead of using the EXPLICIT wait mode to ensure we also don't add
>> implicit fences).
>>
>>
>> Yeah agree. Your implementation with the per submission flag and using
>> BOOKKEEP actually sounds like the right thing to do to me as well.
>>
>> We need to keep in mind that synchronization goes in both ways, e.g.
>> explicit->implicit as well as implicit->explicit.
>>
>> I would rather like to keep the implicit->explicit handling (which this
>> patch here completely disables) and only allow the explicit->implicit
>> handling (which by using BOOKKEEP fences).
>>
>> This way it is possible that we still over synchronize for example for a
>> double buffered BO is re-used by an explicit client and implicit display
>> server, but that's probably not something we should optimize in the first
>> place.
>>
>
> This oversynchronization actually happens easily as in bindless Vulkan we
> have to mark all buffers as "used". We have some hacks to avoid the worst
> of it but it can be pretty meh.
>

Yeah, this case is actually really important. When I initially did the
dma-buf fence import/export work on Intel, it was a massive speed-up in
DOOM 2016, precisely from removing this bit of over-sync.

~Faith



> In my series on the ML[1] I think I actually got both sides by waiting on
> KERNEL fences only and setting BOOKKEEP fences. (Yeah it actually ends up
> kinda orthogonal on the sync mode but it is what it is ...).
>
> - Bas
>
> [1]https://patchwork.freedesktop.org/series/137014/
>
>
>> Regards,
>> Christian.
>>
>>
>> We do have a per-BO list on submission already, so we could add things
>> there, it is just very annoying to implement as currently at the point we
>> do fence wait/signal we lost the association with the BO list. Given that
>> I  don't see an use case anytime soon (there are some theoreticals like
>> radeonsi might start doing READ usage instead of RW usage with extra
>> tracking) I feel it isn't worth that added implementation complexity
>>
>>
>> Worth others more familiar with GL asking that question to themselves
>>>> also. I am definitely not totally up on what's possible there.
>>>>
>>>> Overall, I think I am OK with this approach, even though I think mixing
>>>> implicit and explicit sync is gross, and I want the pain that is implicit
>>>> sync to just go away forever. :-)
>>>>
>>>
>>> So say we all...
>>>
>>> ~Faith
>>>
>>>
>>>
>>>> - Joshie 
diff mbox series

Patch

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
index ec888fc6ead8..8410b4426541 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
@@ -1196,7 +1196,8 @@  static int amdgpu_cs_sync_rings(struct amdgpu_cs_parser *p)
 		struct dma_resv *resv = bo->tbo.base.resv;
 		enum amdgpu_sync_mode sync_mode;
 
-		sync_mode = amdgpu_bo_explicit_sync(bo) ?
+		sync_mode = (amdgpu_ctx_explicit_sync(p->ctx) ||
+			     amdgpu_bo_explicit_sync(bo)) ?
 			AMDGPU_SYNC_EXPLICIT : AMDGPU_SYNC_NE_OWNER;
 		r = amdgpu_sync_resv(p->adev, &p->sync, resv, sync_mode,
 				     &fpriv->vm);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
index 5cb33ac99f70..a304740ccedf 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
@@ -318,7 +318,8 @@  static int amdgpu_ctx_get_stable_pstate(struct amdgpu_ctx *ctx,
 }
 
 static int amdgpu_ctx_init(struct amdgpu_ctx_mgr *mgr, int32_t priority,
-			   struct drm_file *filp, struct amdgpu_ctx *ctx)
+			   uint32_t flags, struct drm_file *filp,
+			   struct amdgpu_ctx *ctx)
 {
 	struct amdgpu_fpriv *fpriv = filp->driver_priv;
 	u32 current_stable_pstate;
@@ -334,6 +335,7 @@  static int amdgpu_ctx_init(struct amdgpu_ctx_mgr *mgr, int32_t priority,
 	ctx->mgr = mgr;
 	spin_lock_init(&ctx->ring_lock);
 
+	ctx->flags = flags;
 	ctx->reset_counter = atomic_read(&mgr->adev->gpu_reset_counter);
 	ctx->reset_counter_query = ctx->reset_counter;
 	ctx->generation = amdgpu_vm_generation(mgr->adev, &fpriv->vm);
@@ -474,6 +476,7 @@  static int amdgpu_ctx_alloc(struct amdgpu_device *adev,
 			    struct amdgpu_fpriv *fpriv,
 			    struct drm_file *filp,
 			    int32_t priority,
+			    uint32_t flags,
 			    uint32_t *id)
 {
 	struct amdgpu_ctx_mgr *mgr = &fpriv->ctx_mgr;
@@ -493,7 +496,7 @@  static int amdgpu_ctx_alloc(struct amdgpu_device *adev,
 	}
 
 	*id = (uint32_t)r;
-	r = amdgpu_ctx_init(mgr, priority, filp, ctx);
+	r = amdgpu_ctx_init(mgr, priority, flags, filp, ctx);
 	if (r) {
 		idr_remove(&mgr->ctx_handles, *id);
 		*id = 0;
@@ -666,7 +669,7 @@  int amdgpu_ctx_ioctl(struct drm_device *dev, void *data,
 		     struct drm_file *filp)
 {
 	int r;
-	uint32_t id, stable_pstate;
+	uint32_t id, stable_pstate, flags;
 	int32_t priority;
 
 	union drm_amdgpu_ctx *args = data;
@@ -675,6 +678,7 @@  int amdgpu_ctx_ioctl(struct drm_device *dev, void *data,
 
 	id = args->in.ctx_id;
 	priority = args->in.priority;
+	flags = args->in.flags;
 
 	/* For backwards compatibility, we need to accept ioctls with garbage
 	 * in the priority field. Garbage values in the priority field, result
@@ -685,7 +689,7 @@  int amdgpu_ctx_ioctl(struct drm_device *dev, void *data,
 
 	switch (args->in.op) {
 	case AMDGPU_CTX_OP_ALLOC_CTX:
-		r = amdgpu_ctx_alloc(adev, fpriv, filp, priority, &id);
+		r = amdgpu_ctx_alloc(adev, fpriv, filp, priority, flags, &id);
 		args->out.alloc.ctx_id = id;
 		break;
 	case AMDGPU_CTX_OP_FREE_CTX:
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h
index 85376baaa92f..9431c8d2ea59 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h
@@ -45,6 +45,7 @@  struct amdgpu_ctx_entity {
 struct amdgpu_ctx {
 	struct kref			refcount;
 	struct amdgpu_ctx_mgr		*mgr;
+	uint32_t			flags;
 	unsigned			reset_counter;
 	unsigned			reset_counter_query;
 	uint64_t			generation;
@@ -84,6 +85,12 @@  struct dma_fence *amdgpu_ctx_get_fence(struct amdgpu_ctx *ctx,
 bool amdgpu_ctx_priority_is_valid(int32_t ctx_prio);
 void amdgpu_ctx_priority_override(struct amdgpu_ctx *ctx, int32_t ctx_prio);
 
+static inline bool
+amdgpu_ctx_explicit_sync(struct amdgpu_ctx *ctx)
+{
+	return ctx->flags & AMDGPU_CTX_ALLOC_FLAGS_EXPLICIT_SYNC;
+}
+
 int amdgpu_ctx_ioctl(struct drm_device *dev, void *data,
 		     struct drm_file *filp);
 
diff --git a/include/uapi/drm/amdgpu_drm.h b/include/uapi/drm/amdgpu_drm.h
index 96e32dafd4f0..e9d87a6e3d86 100644
--- a/include/uapi/drm/amdgpu_drm.h
+++ b/include/uapi/drm/amdgpu_drm.h
@@ -125,7 +125,14 @@  extern "C" {
 #define AMDGPU_GEM_CREATE_VRAM_CONTIGUOUS	(1 << 5)
 /* Flag that BO is always valid in this VM */
 #define AMDGPU_GEM_CREATE_VM_ALWAYS_VALID	(1 << 6)
-/* Flag that BO sharing will be explicitly synchronized */
+/* Flag that BO sharing will be explicitly synchronized
+ *
+ * This flag should not be used unless the client can guarantee that no
+ * other driver which ever touches this BO will ever want to use implicit
+ * synchronization as it disables implicit sync on this BO system-wide.
+ * Instead, drivers which use an explicit synchronization model should
+ * prefer AMDGPU_CTX_ALLOC_FLAGS_EXPLICIT_SYNC.
+ */
 #define AMDGPU_GEM_CREATE_EXPLICIT_SYNC		(1 << 7)
 /* Flag that indicates allocating MQD gart on GFX9, where the mtype
  * for the second page onward should be set to NC. It should never
@@ -240,6 +247,9 @@  union drm_amdgpu_bo_list {
 #define AMDGPU_CTX_OP_GET_STABLE_PSTATE	5
 #define AMDGPU_CTX_OP_SET_STABLE_PSTATE	6
 
+/* indicate that all synchronization will be explicit */
+#define AMDGPU_CTX_ALLOC_FLAGS_EXPLICIT_SYNC (1<<0)
+
 /* GPU reset status */
 #define AMDGPU_CTX_NO_RESET		0
 /* this the context caused it */