mbox series

[RFC,0/7] drm/panfrost: Add a new submit ioctl

Message ID 20210311092539.2405596-1-boris.brezillon@collabora.com (mailing list archive)
Headers show
Series drm/panfrost: Add a new submit ioctl | expand

Message

Boris Brezillon March 11, 2021, 9:25 a.m. UTC
Hello,

I've been playing with Vulkan lately and struggled quite a bit to
implement VkQueueSubmit with the submit ioctl we have. There are
several limiting factors that can be worked around if we really have to,
but I think it'd be much easier and future-proof if we introduce a new
ioctl that addresses the current limitations:

1/ There can only be one out_sync, but Vulkan might ask us to signal
   several VkSemaphores and possibly one VkFence too, both of those
   being based on sync objects in my PoC. Making out_sync an array of
   syncobjs to attach the render_done fence to would make that possible.
   The other option would be to collect syncobj updates in userspace
   in a separate thread and propagate those updates to all
   semaphores+fences waiting on those events (I think the v3dv driver
   does something like that, but I didn't spend enough time studying
   the code to be sure, so I might be wrong).

2/ Queued jobs might be executed out-of-order (unless they have
   explicit/implicit deps between them), and Vulkan asks that the out
   fence be signaled when all jobs are done. Timeline syncobjs are a
   good match for that use case. All we need to do is pass the same
   fence syncobj to all jobs being attached to a single QueueSubmit
   request, but a different point on the timeline. The syncobj
   timeline wait does the rest and guarantees that we've reached a
   given timeline point (IOW, all jobs before that point are done)
   before declaring the fence as signaled.
   One alternative would be to have dummy 'synchronization' jobs that
   don't actually execute anything on the GPU but declare a dependency
   on all other jobs that are part of the QueueSubmit request, and
   signal the out fence (the scheduler would do most of the work for
   us, all we have to do is support NULL job heads and signal the
   fence directly when that happens instead of queueing the job).

3/ The current implementation lacks information about BO access,
   so we serialize all jobs accessing the same set of BOs, even
   if those jobs might just be reading from them (which can
   happen concurrently). Other drivers pass an access type to the
   list of referenced BOs to address that. Another option would be
   to disable implicit deps (deps based on BOs) and force the driver
   to pass all deps explicitly (interestingly, some drivers have
   both the no-implicit-dep and r/w flags, probably to support
   sub-resource access, so we might want to add that one too).
   I don't see any userspace workaround to that problem, so that one
   alone would justify extending the existing ioctl or adding a new
   one.

4/ There's also the fact that submitting one job at a time adds an
   overhead when QueueSubmit is being passed more than one
   CommandBuffer. That one is less problematic, but if we're adding
   a new ioctl we'd better design it to limit the userspace -> kernel
   transition overhead.

Right now I'm just trying to collect feedback. I don't intend to get
those patches merged until we have a userspace user, but I thought
starting the discussion early would be a good thing.

Feel free to suggest other approaches.

Regards,

Boris

Boris Brezillon (7):
  drm/panfrost: Pass a job to panfrost_{acquire,attach_object_fences}()
  drm/panfrost: Collect implicit and explicit deps in an XArray
  drm/panfrost: Move the mappings collection out of
    panfrost_lookup_bos()
  drm/panfrost: Add BO access flags to relax dependencies between jobs
  drm/panfrost: Add a new ioctl to submit batches
  drm/panfrost: Advertise the SYNCOBJ_TIMELINE feature
  drm/panfrost: Bump minor version to reflect the feature additions

 drivers/gpu/drm/panfrost/panfrost_drv.c | 408 +++++++++++++++++++++---
 drivers/gpu/drm/panfrost/panfrost_job.c |  80 +++--
 drivers/gpu/drm/panfrost/panfrost_job.h |   8 +-
 include/uapi/drm/panfrost_drm.h         |  83 +++++
 4 files changed, 483 insertions(+), 96 deletions(-)

Comments

Steven Price March 11, 2021, 12:16 p.m. UTC | #1
On 11/03/2021 09:25, Boris Brezillon wrote:
> Hello,
> 
> I've been playing with Vulkan lately and struggled quite a bit to
> implement VkQueueSubmit with the submit ioctl we have. There are
> several limiting factors that can be worked around if we really have to,
> but I think it'd be much easier and future-proof if we introduce a new
> ioctl that addresses the current limitations:

Hi Boris,

I think what you've proposed is quite reasonable, some detailed comments 
to your points below.

> 
> 1/ There can only be one out_sync, but Vulkan might ask us to signal
>     several VkSemaphores and possibly one VkFence too, both of those
>     being based on sync objects in my PoC. Making out_sync an array of
>     syncobjs to attach the render_done fence to would make that possible.
>     The other option would be to collect syncobj updates in userspace
>     in a separate thread and propagate those updates to all
>     semaphores+fences waiting on those events (I think the v3dv driver
>     does something like that, but I didn't spend enough time studying
>     the code to be sure, so I might be wrong).

You should be able to avoid the separate thread to propagate by having a 
proxy object in user space that maps between the one outsync of the job 
and the possibly many Vulkan objects. But I've had this argument before 
with the DDK... and the upshot of it was that he Vulkan API is 
unnecessarily complex here and makes this really hard to do in practise. 
So I agree adding this capability to the kernel is likely the best approach.

> 2/ Queued jobs might be executed out-of-order (unless they have
>     explicit/implicit deps between them), and Vulkan asks that the out
>     fence be signaled when all jobs are done. Timeline syncobjs are a
>     good match for that use case. All we need to do is pass the same
>     fence syncobj to all jobs being attached to a single QueueSubmit
>     request, but a different point on the timeline. The syncobj
>     timeline wait does the rest and guarantees that we've reached a
>     given timeline point (IOW, all jobs before that point are done)
>     before declaring the fence as signaled.
>     One alternative would be to have dummy 'synchronization' jobs that
>     don't actually execute anything on the GPU but declare a dependency
>     on all other jobs that are part of the QueueSubmit request, and
>     signal the out fence (the scheduler would do most of the work for
>     us, all we have to do is support NULL job heads and signal the
>     fence directly when that happens instead of queueing the job).

I have to admit to being rather hazy on the details of timeline 
syncobjs, but I thought there was a requirement that the timeline moves 
monotonically. I.e. if you have multiple jobs signalling the same 
syncobj just with different points, then AFAIU the API requires that the 
points are triggered in order.

So I'm not sure that you've actually fixed this point - you either need 
to force an order (in which case the last job can signal the Vulkan 
fence) or you still need a dummy job to do the many-to-one dependency.

Or I may have completely misunderstood timeline syncobjs - definitely a 
possibility :)

> 3/ The current implementation lacks information about BO access,
>     so we serialize all jobs accessing the same set of BOs, even
>     if those jobs might just be reading from them (which can
>     happen concurrently). Other drivers pass an access type to the
>     list of referenced BOs to address that. Another option would be
>     to disable implicit deps (deps based on BOs) and force the driver
>     to pass all deps explicitly (interestingly, some drivers have
>     both the no-implicit-dep and r/w flags, probably to support
>     sub-resource access, so we might want to add that one too).
>     I don't see any userspace workaround to that problem, so that one
>     alone would justify extending the existing ioctl or adding a new
>     one.

Yeah - I think we need this. My only comment is that I think the 
read/write terminology may come back to bite. Better to use 'shared' and 
'exclusive' - which better matches the dma_resv_xxx APIs anyway.

Also the current code completely ignores PANFROST_BO_REF_READ. So either 
that should be defined as 0, or even better we support 3 modes:

  * Exclusive ('write' access)
  * Shared ('read' access)
  * No fence - ensures the BO is mapped, but doesn't add any implicit 
fences.

The last may make sense when doing explicit fences and e.g. doing 
front-buffer rendering with a display driver which does implicit fencing.

> 4/ There's also the fact that submitting one job at a time adds an
>     overhead when QueueSubmit is being passed more than one
>     CommandBuffer. That one is less problematic, but if we're adding
>     a new ioctl we'd better design it to limit the userspace -> kernel
>     transition overhead.

I've no objection - but I doubt the performance effect is significant. I 
was pleased to see the handling of stride which makes the interface 
extendable. In particular I suspect at some point we're going to want a 
priority field in some form.

> Right now I'm just trying to collect feedback. I don't intend to get
> those patches merged until we have a userspace user, but I thought
> starting the discussion early would be a good thing.
> 
> Feel free to suggest other approaches.

Other than the above I didn't see any obvious issues, and I know the 
Vulkan API is problematic in terms of synchronisation primitives - so if 
this makes it easier to implement then it seems like a good idea to me.

Thanks,

Steve
Boris Brezillon March 11, 2021, 1 p.m. UTC | #2
Hi Steven,

On Thu, 11 Mar 2021 12:16:33 +0000
Steven Price <steven.price@arm.com> wrote:

> On 11/03/2021 09:25, Boris Brezillon wrote:
> > Hello,
> > 
> > I've been playing with Vulkan lately and struggled quite a bit to
> > implement VkQueueSubmit with the submit ioctl we have. There are
> > several limiting factors that can be worked around if we really have to,
> > but I think it'd be much easier and future-proof if we introduce a new
> > ioctl that addresses the current limitations:  
> 
> Hi Boris,
> 
> I think what you've proposed is quite reasonable, some detailed comments 
> to your points below.
> 
> > 
> > 1/ There can only be one out_sync, but Vulkan might ask us to signal
> >     several VkSemaphores and possibly one VkFence too, both of those
> >     being based on sync objects in my PoC. Making out_sync an array of
> >     syncobjs to attach the render_done fence to would make that possible.
> >     The other option would be to collect syncobj updates in userspace
> >     in a separate thread and propagate those updates to all
> >     semaphores+fences waiting on those events (I think the v3dv driver
> >     does something like that, but I didn't spend enough time studying
> >     the code to be sure, so I might be wrong).  
> 
> You should be able to avoid the separate thread to propagate by having a 
> proxy object in user space that maps between the one outsync of the job 
> and the possibly many Vulkan objects. But I've had this argument before 
> with the DDK... and the upshot of it was that he Vulkan API is 
> unnecessarily complex here and makes this really hard to do in practise. 
> So I agree adding this capability to the kernel is likely the best approach.
> 
> > 2/ Queued jobs might be executed out-of-order (unless they have
> >     explicit/implicit deps between them), and Vulkan asks that the out
> >     fence be signaled when all jobs are done. Timeline syncobjs are a
> >     good match for that use case. All we need to do is pass the same
> >     fence syncobj to all jobs being attached to a single QueueSubmit
> >     request, but a different point on the timeline. The syncobj
> >     timeline wait does the rest and guarantees that we've reached a
> >     given timeline point (IOW, all jobs before that point are done)
> >     before declaring the fence as signaled.
> >     One alternative would be to have dummy 'synchronization' jobs that
> >     don't actually execute anything on the GPU but declare a dependency
> >     on all other jobs that are part of the QueueSubmit request, and
> >     signal the out fence (the scheduler would do most of the work for
> >     us, all we have to do is support NULL job heads and signal the
> >     fence directly when that happens instead of queueing the job).  
> 
> I have to admit to being rather hazy on the details of timeline 
> syncobjs, but I thought there was a requirement that the timeline moves 
> monotonically. I.e. if you have multiple jobs signalling the same 
> syncobj just with different points, then AFAIU the API requires that the 
> points are triggered in order.

I only started looking at the SYNCOBJ_TIMELINE API a few days ago, so I
might be wrong, but my understanding is that queuing fences (addition
of new points in the timeline) should happen in order, but signaling
can happen in any order. When checking for a signaled fence the
fence-chain logic starts from the last point (or from an explicit point
if you use the timeline wait flavor) and goes backward, stopping at the
first un-signaled node. If I'm correct, that means that fences that
are part of a chain can be signaled in any order.

Note that I also considered using a sync file, which has the ability to
merge fences, but that required 3 extra ioctls for each syncobj to merge
(for the export/merge/import round trip), and AFAICT, fences stay around
until the sync file is destroyed, which forces some garbage collection
if we want to keep the number of active fences low. One nice thing
about the fence-chain/syncobj-timeline logic is that signaled fences
are collected automatically when walking the chain.

> 
> So I'm not sure that you've actually fixed this point - you either need 
> to force an order (in which case the last job can signal the Vulkan 
> fence)

That options requires adding deps that do not really exist on the last
jobs, so I'm not sure I like it.

> or you still need a dummy job to do the many-to-one dependency.

Yeah, that's what I've considered doing before realizing timelined
syncojbs could solve this problem (assuming I got it right :-/).

> 
> Or I may have completely misunderstood timeline syncobjs - definitely a 
> possibility :)

I wouldn't pretend I got it right either :-).

> 
> > 3/ The current implementation lacks information about BO access,
> >     so we serialize all jobs accessing the same set of BOs, even
> >     if those jobs might just be reading from them (which can
> >     happen concurrently). Other drivers pass an access type to the
> >     list of referenced BOs to address that. Another option would be
> >     to disable implicit deps (deps based on BOs) and force the driver
> >     to pass all deps explicitly (interestingly, some drivers have
> >     both the no-implicit-dep and r/w flags, probably to support
> >     sub-resource access, so we might want to add that one too).
> >     I don't see any userspace workaround to that problem, so that one
> >     alone would justify extending the existing ioctl or adding a new
> >     one.  
> 
> Yeah - I think we need this. My only comment is that I think the 
> read/write terminology may come back to bite. Better to use 'shared' and 
> 'exclusive' - which better matches the dma_resv_xxx APIs anyway.
> 
> Also the current code completely ignores PANFROST_BO_REF_READ. So either 
> that should be defined as 0, or even better we support 3 modes:
> 
>   * Exclusive ('write' access)
>   * Shared ('read' access)
>   * No fence - ensures the BO is mapped, but doesn't add any implicit 
> fences.
> 
> The last may make sense when doing explicit fences and e.g. doing 
> front-buffer rendering with a display driver which does implicit fencing.

Sounds good to me.

> 
> > 4/ There's also the fact that submitting one job at a time adds an
> >     overhead when QueueSubmit is being passed more than one
> >     CommandBuffer. That one is less problematic, but if we're adding
> >     a new ioctl we'd better design it to limit the userspace -> kernel
> >     transition overhead.  
> 
> I've no objection - but I doubt the performance effect is significant. I 
> was pleased to see the handling of stride which makes the interface 
> extendable. In particular I suspect at some point we're going to want a 
> priority field in some form.
> 
> > Right now I'm just trying to collect feedback. I don't intend to get
> > those patches merged until we have a userspace user, but I thought
> > starting the discussion early would be a good thing.
> > 
> > Feel free to suggest other approaches.  
> 
> Other than the above I didn't see any obvious issues, and I know the 
> Vulkan API is problematic in terms of synchronisation primitives - so if 
> this makes it easier to implement then it seems like a good idea to me.

Thanks a lot for you review.

Regards,

Boris
Jason Ekstrand March 11, 2021, 4:58 p.m. UTC | #3
Hi all,

Dropping in where I may or may not be wanted to feel free to ignore. : -)

On Thu, Mar 11, 2021 at 7:00 AM Boris Brezillon
<boris.brezillon@collabora.com> wrote:
>
> Hi Steven,
>
> On Thu, 11 Mar 2021 12:16:33 +0000
> Steven Price <steven.price@arm.com> wrote:
>
> > On 11/03/2021 09:25, Boris Brezillon wrote:
> > > Hello,
> > >
> > > I've been playing with Vulkan lately and struggled quite a bit to
> > > implement VkQueueSubmit with the submit ioctl we have. There are
> > > several limiting factors that can be worked around if we really have to,
> > > but I think it'd be much easier and future-proof if we introduce a new
> > > ioctl that addresses the current limitations:
> >
> > Hi Boris,
> >
> > I think what you've proposed is quite reasonable, some detailed comments
> > to your points below.
> >
> > >
> > > 1/ There can only be one out_sync, but Vulkan might ask us to signal
> > >     several VkSemaphores and possibly one VkFence too, both of those
> > >     being based on sync objects in my PoC. Making out_sync an array of
> > >     syncobjs to attach the render_done fence to would make that possible.
> > >     The other option would be to collect syncobj updates in userspace
> > >     in a separate thread and propagate those updates to all
> > >     semaphores+fences waiting on those events (I think the v3dv driver
> > >     does something like that, but I didn't spend enough time studying
> > >     the code to be sure, so I might be wrong).
> >
> > You should be able to avoid the separate thread to propagate by having a
> > proxy object in user space that maps between the one outsync of the job
> > and the possibly many Vulkan objects. But I've had this argument before
> > with the DDK... and the upshot of it was that he Vulkan API is
> > unnecessarily complex here and makes this really hard to do in practise.
> > So I agree adding this capability to the kernel is likely the best approach.

This is pretty easy to do from userspace.  Just take the out sync_file
and stuff it into each of the syncobjs using
DRM_IOCTL_SYNCOBJ_FD_TO_HANDLE.  That's all the kernel would be doing
under the hood anyway.  Having it built into your submit ioctl does
have the advantage of fewer round-trips to kernel space, though.

> > > 2/ Queued jobs might be executed out-of-order (unless they have
> > >     explicit/implicit deps between them), and Vulkan asks that the out
> > >     fence be signaled when all jobs are done. Timeline syncobjs are a
> > >     good match for that use case. All we need to do is pass the same
> > >     fence syncobj to all jobs being attached to a single QueueSubmit
> > >     request, but a different point on the timeline. The syncobj
> > >     timeline wait does the rest and guarantees that we've reached a
> > >     given timeline point (IOW, all jobs before that point are done)
> > >     before declaring the fence as signaled.
> > >     One alternative would be to have dummy 'synchronization' jobs that
> > >     don't actually execute anything on the GPU but declare a dependency
> > >     on all other jobs that are part of the QueueSubmit request, and
> > >     signal the out fence (the scheduler would do most of the work for
> > >     us, all we have to do is support NULL job heads and signal the
> > >     fence directly when that happens instead of queueing the job).
> >
> > I have to admit to being rather hazy on the details of timeline
> > syncobjs, but I thought there was a requirement that the timeline moves
> > monotonically. I.e. if you have multiple jobs signalling the same
> > syncobj just with different points, then AFAIU the API requires that the
> > points are triggered in order.
>
> I only started looking at the SYNCOBJ_TIMELINE API a few days ago, so I
> might be wrong, but my understanding is that queuing fences (addition
> of new points in the timeline) should happen in order, but signaling
> can happen in any order. When checking for a signaled fence the
> fence-chain logic starts from the last point (or from an explicit point
> if you use the timeline wait flavor) and goes backward, stopping at the
> first un-signaled node. If I'm correct, that means that fences that
> are part of a chain can be signaled in any order.

You don't even need a timeline for this.  Just have a single syncobj
per-queue and make each submit wait on it and then signal it.
Alternatively, you can just always hang on to the out-fence from the
previous submit and make the next one wait on that.  Timelines are
overkill here, IMO.

> Note that I also considered using a sync file, which has the ability to
> merge fences, but that required 3 extra ioctls for each syncobj to merge
> (for the export/merge/import round trip), and AFAICT, fences stay around
> until the sync file is destroyed, which forces some garbage collection
> if we want to keep the number of active fences low. One nice thing
> about the fence-chain/syncobj-timeline logic is that signaled fences
> are collected automatically when walking the chain.

Yeah, that's the pain when working with sync files.  This is one of
the reasons why our driver takes an arbitrary number of in/out
syncobjs.

> > So I'm not sure that you've actually fixed this point - you either need
> > to force an order (in which case the last job can signal the Vulkan
> > fence)
>
> That options requires adding deps that do not really exist on the last
> jobs, so I'm not sure I like it.
>
> > or you still need a dummy job to do the many-to-one dependency.
>
> Yeah, that's what I've considered doing before realizing timelined
> syncojbs could solve this problem (assuming I got it right :-/).
>
> >
> > Or I may have completely misunderstood timeline syncobjs - definitely a
> > possibility :)
>
> I wouldn't pretend I got it right either :-).
>
> >
> > > 3/ The current implementation lacks information about BO access,
> > >     so we serialize all jobs accessing the same set of BOs, even
> > >     if those jobs might just be reading from them (which can
> > >     happen concurrently). Other drivers pass an access type to the
> > >     list of referenced BOs to address that. Another option would be
> > >     to disable implicit deps (deps based on BOs) and force the driver
> > >     to pass all deps explicitly (interestingly, some drivers have
> > >     both the no-implicit-dep and r/w flags, probably to support
> > >     sub-resource access, so we might want to add that one too).
> > >     I don't see any userspace workaround to that problem, so that one
> > >     alone would justify extending the existing ioctl or adding a new
> > >     one.
> >
> > Yeah - I think we need this. My only comment is that I think the
> > read/write terminology may come back to bite. Better to use 'shared' and
> > 'exclusive' - which better matches the dma_resv_xxx APIs anyway.
> >
> > Also the current code completely ignores PANFROST_BO_REF_READ. So either
> > that should be defined as 0, or even better we support 3 modes:
> >
> >   * Exclusive ('write' access)
> >   * Shared ('read' access)
> >   * No fence - ensures the BO is mapped, but doesn't add any implicit
> > fences.
> >
> > The last may make sense when doing explicit fences and e.g. doing
> > front-buffer rendering with a display driver which does implicit fencing.

This one's really annoying.  TBH, we've still not gotten it right on
Intel, AFAICT.  That is roughly the set of modes you need but you'll
have to watch out for window-system buffers.  RADV and ANV take
slightly different approaches here and they each have their own
problems.  On the balance, I'd look at what RADV is doing rather than
ANV because ANV's results in some over-synchronization every time you
vkWaitForFences on the WSI fence.  I've got a patch floating round
somewhere that adds some new kernel API to make that case a bit better
but it's a snarly mess.  Sorry for being cryptic but it's a 5-page
e-mail if I type out all the annoying details. (-:

--Jason

> Sounds good to me.
>
> >
> > > 4/ There's also the fact that submitting one job at a time adds an
> > >     overhead when QueueSubmit is being passed more than one
> > >     CommandBuffer. That one is less problematic, but if we're adding
> > >     a new ioctl we'd better design it to limit the userspace -> kernel
> > >     transition overhead.
> >
> > I've no objection - but I doubt the performance effect is significant. I
> > was pleased to see the handling of stride which makes the interface
> > extendable. In particular I suspect at some point we're going to want a
> > priority field in some form.
> >
> > > Right now I'm just trying to collect feedback. I don't intend to get
> > > those patches merged until we have a userspace user, but I thought
> > > starting the discussion early would be a good thing.
> > >
> > > Feel free to suggest other approaches.
> >
> > Other than the above I didn't see any obvious issues, and I know the
> > Vulkan API is problematic in terms of synchronisation primitives - so if
> > this makes it easier to implement then it seems like a good idea to me.
>
> Thanks a lot for you review.
>
> Regards,
>
> Boris
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
Boris Brezillon March 11, 2021, 5:24 p.m. UTC | #4
Hi Jason,

On Thu, 11 Mar 2021 10:58:46 -0600
Jason Ekstrand <jason@jlekstrand.net> wrote:

> Hi all,
> 
> Dropping in where I may or may not be wanted to feel free to ignore. : -)

I'm glad you decided to chime in. :-)

 
> > > > 2/ Queued jobs might be executed out-of-order (unless they have
> > > >     explicit/implicit deps between them), and Vulkan asks that the out
> > > >     fence be signaled when all jobs are done. Timeline syncobjs are a
> > > >     good match for that use case. All we need to do is pass the same
> > > >     fence syncobj to all jobs being attached to a single QueueSubmit
> > > >     request, but a different point on the timeline. The syncobj
> > > >     timeline wait does the rest and guarantees that we've reached a
> > > >     given timeline point (IOW, all jobs before that point are done)
> > > >     before declaring the fence as signaled.
> > > >     One alternative would be to have dummy 'synchronization' jobs that
> > > >     don't actually execute anything on the GPU but declare a dependency
> > > >     on all other jobs that are part of the QueueSubmit request, and
> > > >     signal the out fence (the scheduler would do most of the work for
> > > >     us, all we have to do is support NULL job heads and signal the
> > > >     fence directly when that happens instead of queueing the job).  
> > >
> > > I have to admit to being rather hazy on the details of timeline
> > > syncobjs, but I thought there was a requirement that the timeline moves
> > > monotonically. I.e. if you have multiple jobs signalling the same
> > > syncobj just with different points, then AFAIU the API requires that the
> > > points are triggered in order.  
> >
> > I only started looking at the SYNCOBJ_TIMELINE API a few days ago, so I
> > might be wrong, but my understanding is that queuing fences (addition
> > of new points in the timeline) should happen in order, but signaling
> > can happen in any order. When checking for a signaled fence the
> > fence-chain logic starts from the last point (or from an explicit point
> > if you use the timeline wait flavor) and goes backward, stopping at the
> > first un-signaled node. If I'm correct, that means that fences that
> > are part of a chain can be signaled in any order.  
> 
> You don't even need a timeline for this.  Just have a single syncobj
> per-queue and make each submit wait on it and then signal it.
> Alternatively, you can just always hang on to the out-fence from the
> previous submit and make the next one wait on that.

That's what I have right now, but it forces the serialization of all
jobs that are pushed during a submit (and there can be more than one
per command buffer on panfrost :-/). Maybe I'm wrong, but I thought it'd
be better to not force this serialization if we can avoid it.

> Timelines are overkill here, IMO.

Mind developing why you think this is overkill? After looking at the
kernel implementation I thought using timeline syncobjs would be
pretty cheap compared to the other options I considered.

> 
> > Note that I also considered using a sync file, which has the ability to
> > merge fences, but that required 3 extra ioctls for each syncobj to merge
> > (for the export/merge/import round trip), and AFAICT, fences stay around
> > until the sync file is destroyed, which forces some garbage collection
> > if we want to keep the number of active fences low. One nice thing
> > about the fence-chain/syncobj-timeline logic is that signaled fences
> > are collected automatically when walking the chain.  
> 
> Yeah, that's the pain when working with sync files.  This is one of
> the reasons why our driver takes an arbitrary number of in/out
> syncobjs.
> 
> > > So I'm not sure that you've actually fixed this point - you either need
> > > to force an order (in which case the last job can signal the Vulkan
> > > fence)  
> >
> > That options requires adding deps that do not really exist on the last
> > jobs, so I'm not sure I like it.
> >  
> > > or you still need a dummy job to do the many-to-one dependency.  
> >
> > Yeah, that's what I've considered doing before realizing timelined
> > syncojbs could solve this problem (assuming I got it right :-/).
> >  
> > >
> > > Or I may have completely misunderstood timeline syncobjs - definitely a
> > > possibility :)  
> >
> > I wouldn't pretend I got it right either :-).
> >  
> > >  
> > > > 3/ The current implementation lacks information about BO access,
> > > >     so we serialize all jobs accessing the same set of BOs, even
> > > >     if those jobs might just be reading from them (which can
> > > >     happen concurrently). Other drivers pass an access type to the
> > > >     list of referenced BOs to address that. Another option would be
> > > >     to disable implicit deps (deps based on BOs) and force the driver
> > > >     to pass all deps explicitly (interestingly, some drivers have
> > > >     both the no-implicit-dep and r/w flags, probably to support
> > > >     sub-resource access, so we might want to add that one too).
> > > >     I don't see any userspace workaround to that problem, so that one
> > > >     alone would justify extending the existing ioctl or adding a new
> > > >     one.  
> > >
> > > Yeah - I think we need this. My only comment is that I think the
> > > read/write terminology may come back to bite. Better to use 'shared' and
> > > 'exclusive' - which better matches the dma_resv_xxx APIs anyway.
> > >
> > > Also the current code completely ignores PANFROST_BO_REF_READ. So either
> > > that should be defined as 0, or even better we support 3 modes:
> > >
> > >   * Exclusive ('write' access)
> > >   * Shared ('read' access)
> > >   * No fence - ensures the BO is mapped, but doesn't add any implicit
> > > fences.
> > >
> > > The last may make sense when doing explicit fences and e.g. doing
> > > front-buffer rendering with a display driver which does implicit fencing.  
> 
> This one's really annoying.  TBH, we've still not gotten it right on
> Intel, AFAICT.  That is roughly the set of modes you need but you'll
> have to watch out for window-system buffers.  RADV and ANV take
> slightly different approaches here and they each have their own
> problems.  On the balance, I'd look at what RADV is doing rather than
> ANV because ANV's results in some over-synchronization every time you
> vkWaitForFences on the WSI fence.  I've got a patch floating round
> somewhere that adds some new kernel API to make that case a bit better
> but it's a snarly mess.  Sorry for being cryptic but it's a 5-page
> e-mail if I type out all the annoying details. (-:

Ok, I'll have a look at the RADV driver.

Thanks for your feedback.

Boris
Jason Ekstrand March 11, 2021, 6:11 p.m. UTC | #5
On Thu, Mar 11, 2021 at 11:25 AM Boris Brezillon
<boris.brezillon@collabora.com> wrote:
>
> Hi Jason,
>
> On Thu, 11 Mar 2021 10:58:46 -0600
> Jason Ekstrand <jason@jlekstrand.net> wrote:
>
> > Hi all,
> >
> > Dropping in where I may or may not be wanted to feel free to ignore. : -)
>
> I'm glad you decided to chime in. :-)
>
>
> > > > > 2/ Queued jobs might be executed out-of-order (unless they have
> > > > >     explicit/implicit deps between them), and Vulkan asks that the out
> > > > >     fence be signaled when all jobs are done. Timeline syncobjs are a
> > > > >     good match for that use case. All we need to do is pass the same
> > > > >     fence syncobj to all jobs being attached to a single QueueSubmit
> > > > >     request, but a different point on the timeline. The syncobj
> > > > >     timeline wait does the rest and guarantees that we've reached a
> > > > >     given timeline point (IOW, all jobs before that point are done)
> > > > >     before declaring the fence as signaled.
> > > > >     One alternative would be to have dummy 'synchronization' jobs that
> > > > >     don't actually execute anything on the GPU but declare a dependency
> > > > >     on all other jobs that are part of the QueueSubmit request, and
> > > > >     signal the out fence (the scheduler would do most of the work for
> > > > >     us, all we have to do is support NULL job heads and signal the
> > > > >     fence directly when that happens instead of queueing the job).
> > > >
> > > > I have to admit to being rather hazy on the details of timeline
> > > > syncobjs, but I thought there was a requirement that the timeline moves
> > > > monotonically. I.e. if you have multiple jobs signalling the same
> > > > syncobj just with different points, then AFAIU the API requires that the
> > > > points are triggered in order.
> > >
> > > I only started looking at the SYNCOBJ_TIMELINE API a few days ago, so I
> > > might be wrong, but my understanding is that queuing fences (addition
> > > of new points in the timeline) should happen in order, but signaling
> > > can happen in any order. When checking for a signaled fence the
> > > fence-chain logic starts from the last point (or from an explicit point
> > > if you use the timeline wait flavor) and goes backward, stopping at the
> > > first un-signaled node. If I'm correct, that means that fences that
> > > are part of a chain can be signaled in any order.
> >
> > You don't even need a timeline for this.  Just have a single syncobj
> > per-queue and make each submit wait on it and then signal it.
> > Alternatively, you can just always hang on to the out-fence from the
> > previous submit and make the next one wait on that.
>
> That's what I have right now, but it forces the serialization of all
> jobs that are pushed during a submit (and there can be more than one
> per command buffer on panfrost :-/). Maybe I'm wrong, but I thought it'd
> be better to not force this serialization if we can avoid it.

I'm not familiar with panfrost's needs and I don't work on a tiler and
I know there are different issues there.  But...

The Vulkan spec requires that everything that all the submits that
happen on a given vkQueue happen in-order.  Search the spec for
"Submission order" for more details.

So, generally speaking, there are some in-order requirements there.
Again, not having a lot of tiler experience, I'm not the one to weigh
in.

> > Timelines are overkill here, IMO.
>
> Mind developing why you think this is overkill? After looking at the
> kernel implementation I thought using timeline syncobjs would be
> pretty cheap compared to the other options I considered.

If you use a regular syncobj, every time you wait on it it inserts a
dependency between the current submit and the last thing to signal it
on the CPU timeline.  The internal dma_fences will hang around
as-needed to ensure those dependencies.  If you use a timeline, you
have to also track a uint64_t to reference the current time point.
This may work if you need to sync a bunch of in-flight stuff at one
go, that may work but if you're trying to serialize, it's just extra
tracking for no point.  Again, maybe there's something I'm missing and
you don't actually want to serialize.

--Jason


> >
> > > Note that I also considered using a sync file, which has the ability to
> > > merge fences, but that required 3 extra ioctls for each syncobj to merge
> > > (for the export/merge/import round trip), and AFAICT, fences stay around
> > > until the sync file is destroyed, which forces some garbage collection
> > > if we want to keep the number of active fences low. One nice thing
> > > about the fence-chain/syncobj-timeline logic is that signaled fences
> > > are collected automatically when walking the chain.
> >
> > Yeah, that's the pain when working with sync files.  This is one of
> > the reasons why our driver takes an arbitrary number of in/out
> > syncobjs.
> >
> > > > So I'm not sure that you've actually fixed this point - you either need
> > > > to force an order (in which case the last job can signal the Vulkan
> > > > fence)
> > >
> > > That options requires adding deps that do not really exist on the last
> > > jobs, so I'm not sure I like it.
> > >
> > > > or you still need a dummy job to do the many-to-one dependency.
> > >
> > > Yeah, that's what I've considered doing before realizing timelined
> > > syncojbs could solve this problem (assuming I got it right :-/).
> > >
> > > >
> > > > Or I may have completely misunderstood timeline syncobjs - definitely a
> > > > possibility :)
> > >
> > > I wouldn't pretend I got it right either :-).
> > >
> > > >
> > > > > 3/ The current implementation lacks information about BO access,
> > > > >     so we serialize all jobs accessing the same set of BOs, even
> > > > >     if those jobs might just be reading from them (which can
> > > > >     happen concurrently). Other drivers pass an access type to the
> > > > >     list of referenced BOs to address that. Another option would be
> > > > >     to disable implicit deps (deps based on BOs) and force the driver
> > > > >     to pass all deps explicitly (interestingly, some drivers have
> > > > >     both the no-implicit-dep and r/w flags, probably to support
> > > > >     sub-resource access, so we might want to add that one too).
> > > > >     I don't see any userspace workaround to that problem, so that one
> > > > >     alone would justify extending the existing ioctl or adding a new
> > > > >     one.
> > > >
> > > > Yeah - I think we need this. My only comment is that I think the
> > > > read/write terminology may come back to bite. Better to use 'shared' and
> > > > 'exclusive' - which better matches the dma_resv_xxx APIs anyway.
> > > >
> > > > Also the current code completely ignores PANFROST_BO_REF_READ. So either
> > > > that should be defined as 0, or even better we support 3 modes:
> > > >
> > > >   * Exclusive ('write' access)
> > > >   * Shared ('read' access)
> > > >   * No fence - ensures the BO is mapped, but doesn't add any implicit
> > > > fences.
> > > >
> > > > The last may make sense when doing explicit fences and e.g. doing
> > > > front-buffer rendering with a display driver which does implicit fencing.
> >
> > This one's really annoying.  TBH, we've still not gotten it right on
> > Intel, AFAICT.  That is roughly the set of modes you need but you'll
> > have to watch out for window-system buffers.  RADV and ANV take
> > slightly different approaches here and they each have their own
> > problems.  On the balance, I'd look at what RADV is doing rather than
> > ANV because ANV's results in some over-synchronization every time you
> > vkWaitForFences on the WSI fence.  I've got a patch floating round
> > somewhere that adds some new kernel API to make that case a bit better
> > but it's a snarly mess.  Sorry for being cryptic but it's a 5-page
> > e-mail if I type out all the annoying details. (-:
>
> Ok, I'll have a look at the RADV driver.
>
> Thanks for your feedback.
>
> Boris
Alyssa Rosenzweig March 11, 2021, 10:38 p.m. UTC | #6
> I'm not familiar with panfrost's needs and I don't work on a tiler and
> I know there are different issues there.  But...

The primary issue is we submit vertex+compute and fragment for each
batch as two disjoint jobs (with a dependency of course), reflecting the
internal hardware structure as parallel job slots. That we actually
require two ioctls() and a roundtrip for this is a design wart inherited
from early days of the kernel driver. The downstream Mali driver handles
this by allowing multiple jobs to be submitted with a single ioctl, as
Boris's patch enables. In every other respect I believe our needs are
similar to other renderonly drivers. (What does turnip do?)
Boris Brezillon March 12, 2021, 7:31 a.m. UTC | #7
On Thu, 11 Mar 2021 12:11:48 -0600
Jason Ekstrand <jason@jlekstrand.net> wrote:

> > > > > > 2/ Queued jobs might be executed out-of-order (unless they have
> > > > > >     explicit/implicit deps between them), and Vulkan asks that the out
> > > > > >     fence be signaled when all jobs are done. Timeline syncobjs are a
> > > > > >     good match for that use case. All we need to do is pass the same
> > > > > >     fence syncobj to all jobs being attached to a single QueueSubmit
> > > > > >     request, but a different point on the timeline. The syncobj
> > > > > >     timeline wait does the rest and guarantees that we've reached a
> > > > > >     given timeline point (IOW, all jobs before that point are done)
> > > > > >     before declaring the fence as signaled.
> > > > > >     One alternative would be to have dummy 'synchronization' jobs that
> > > > > >     don't actually execute anything on the GPU but declare a dependency
> > > > > >     on all other jobs that are part of the QueueSubmit request, and
> > > > > >     signal the out fence (the scheduler would do most of the work for
> > > > > >     us, all we have to do is support NULL job heads and signal the
> > > > > >     fence directly when that happens instead of queueing the job).  
> > > > >
> > > > > I have to admit to being rather hazy on the details of timeline
> > > > > syncobjs, but I thought there was a requirement that the timeline moves
> > > > > monotonically. I.e. if you have multiple jobs signalling the same
> > > > > syncobj just with different points, then AFAIU the API requires that the
> > > > > points are triggered in order.  
> > > >
> > > > I only started looking at the SYNCOBJ_TIMELINE API a few days ago, so I
> > > > might be wrong, but my understanding is that queuing fences (addition
> > > > of new points in the timeline) should happen in order, but signaling
> > > > can happen in any order. When checking for a signaled fence the
> > > > fence-chain logic starts from the last point (or from an explicit point
> > > > if you use the timeline wait flavor) and goes backward, stopping at the
> > > > first un-signaled node. If I'm correct, that means that fences that
> > > > are part of a chain can be signaled in any order.  
> > >
> > > You don't even need a timeline for this.  Just have a single syncobj
> > > per-queue and make each submit wait on it and then signal it.
> > > Alternatively, you can just always hang on to the out-fence from the
> > > previous submit and make the next one wait on that.  
> >
> > That's what I have right now, but it forces the serialization of all
> > jobs that are pushed during a submit (and there can be more than one
> > per command buffer on panfrost :-/). Maybe I'm wrong, but I thought it'd
> > be better to not force this serialization if we can avoid it.  
> 
> I'm not familiar with panfrost's needs and I don't work on a tiler and
> I know there are different issues there.  But...
> 
> The Vulkan spec requires that everything that all the submits that
> happen on a given vkQueue happen in-order.  Search the spec for
> "Submission order" for more details.

Duh, looks like I completely occulted the "Submission order"
guarantees. This being said, even after reading this chapter multiple
times I'm not sure what kind of guarantee this gives us, given the
execution itself can be out-of-order. My understanding is that
submission order matters for implicit deps, say you have 2 distinct
VkSubmitInfo, the first one (in submission order) writing to a buffer
and the second one reading from it, you really want the first one to
be submitted first and the second one to wait on the implicit BO fence
created by the first one. If we were to submit out-of-order, this
guarantee wouldn't be met. OTOH, if we have 2 completely independent
submits, I don't really see what submission order gives us if execution
is out-of-order.

In our case, the kernel driver takes care of the submission
serialization (gathering implicit and explicit deps, queuing the job and
assigning the "done" fence to the output sync objects). Once things
are queued, it's the scheduler (drm_sched) deciding of the execution
order.

> 
> So, generally speaking, there are some in-order requirements there.
> Again, not having a lot of tiler experience, I'm not the one to weigh
> in.
> 
> > > Timelines are overkill here, IMO.  
> >
> > Mind developing why you think this is overkill? After looking at the
> > kernel implementation I thought using timeline syncobjs would be
> > pretty cheap compared to the other options I considered.  
> 
> If you use a regular syncobj, every time you wait on it it inserts a
> dependency between the current submit and the last thing to signal it
> on the CPU timeline.  The internal dma_fences will hang around
> as-needed to ensure those dependencies.  If you use a timeline, you
> have to also track a uint64_t to reference the current time point.
> This may work if you need to sync a bunch of in-flight stuff at one
> go, that may work but if you're trying to serialize, it's just extra
> tracking for no point.  Again, maybe there's something I'm missing and
> you don't actually want to serialize.

My understanding (and I might very much be wrong here) is that using a
regular syncobj to do this actually enforces not only the submission
order but also the execution order (each job waiting on the previous
one to complete before being scheduled). The idea of the timeline
syncobj approach is that jobs that have no inter dependencies can be
started in any order, the scheduler picking the first whose deps are
ready (which might not necessarily match submission order). The
timeline syncobj allows us to have one point per kernel-submission and
eventually wait on the last point for the fence passed to
vkSubmitQueue(), and some specific point on the timeline for
pSignalSemaphores entries.

What's more challenging is signal operation ordering:

"
Signal operation order is a fundamental ordering in Vulkan, giving
meaning to the order in which semaphore and fence signal operations
occur when submitted to a single queue.



1.  The initial order is determined by the order in which
    vkQueueSubmit commands are executed on the host, for a single
    queue, from first to last.

2.  The order in which VkSubmitInfo structures are specified in the
    pSubmits parameter of vkQueueSubmit, from lowest index to highest.

3.  The fence signal operation defined by the fence parameter of a
    vkQueueSubmit or vkQueueBindSparse command is ordered after all
    semaphore signal operations defined by that command.
"

This means we have to add implicit dependencies on the signaling
itself. We have two options to guarantee that:

1/ Transfer one of the queue syncobj timeline point to each semaphore
   and fence after job submission (the point itself being dependent
   on the position of the submit entry in the array for semaphores, and
   the last point for the fence). Problem with this approach is that we
   now have an extra TRANSFER_SYNCOBJ call per semaphore/fence

2/ Add SYNC jobs (jobs that do not actually execute on the GPU, but
   serve as a synchronization point) whose responsibility would be to
   do this muxing/transfer as part of the batch submission process.
Boris Brezillon March 12, 2021, 2:15 p.m. UTC | #8
On Thu, 11 Mar 2021 12:16:33 +0000
Steven Price <steven.price@arm.com> wrote:

> Also the current code completely ignores PANFROST_BO_REF_READ. So either 
> that should be defined as 0, or even better we support 3 modes:
> 
>   * Exclusive ('write' access)
>   * Shared ('read' access)
>   * No fence - ensures the BO is mapped, but doesn't add any implicit 
> fences.

I looked at other drivers and they seem to have this
NO_FENCE/NO_IMPLICIT flag at the job/submit level and conditioned on
the presence of a in_sync_fd file. I can see per-BO NO_FENCE being
useful for sub-buffer accesses, assuming we don't want to deal with
other deps explicitly, but we could also force the NO_IMPLICIT behavior
for the whole job and let the user deal with all deps explicitly. TBH,
it's a bit unclear to me how that feature would be used by the vulkan
driver, so I'd rather leave that NO_FENCE use case for later.
Jason Ekstrand March 12, 2021, 3:37 p.m. UTC | #9
On Fri, Mar 12, 2021 at 1:31 AM Boris Brezillon
<boris.brezillon@collabora.com> wrote:
>
> On Thu, 11 Mar 2021 12:11:48 -0600
> Jason Ekstrand <jason@jlekstrand.net> wrote:
>
> > > > > > > 2/ Queued jobs might be executed out-of-order (unless they have
> > > > > > >     explicit/implicit deps between them), and Vulkan asks that the out
> > > > > > >     fence be signaled when all jobs are done. Timeline syncobjs are a
> > > > > > >     good match for that use case. All we need to do is pass the same
> > > > > > >     fence syncobj to all jobs being attached to a single QueueSubmit
> > > > > > >     request, but a different point on the timeline. The syncobj
> > > > > > >     timeline wait does the rest and guarantees that we've reached a
> > > > > > >     given timeline point (IOW, all jobs before that point are done)
> > > > > > >     before declaring the fence as signaled.
> > > > > > >     One alternative would be to have dummy 'synchronization' jobs that
> > > > > > >     don't actually execute anything on the GPU but declare a dependency
> > > > > > >     on all other jobs that are part of the QueueSubmit request, and
> > > > > > >     signal the out fence (the scheduler would do most of the work for
> > > > > > >     us, all we have to do is support NULL job heads and signal the
> > > > > > >     fence directly when that happens instead of queueing the job).
> > > > > >
> > > > > > I have to admit to being rather hazy on the details of timeline
> > > > > > syncobjs, but I thought there was a requirement that the timeline moves
> > > > > > monotonically. I.e. if you have multiple jobs signalling the same
> > > > > > syncobj just with different points, then AFAIU the API requires that the
> > > > > > points are triggered in order.
> > > > >
> > > > > I only started looking at the SYNCOBJ_TIMELINE API a few days ago, so I
> > > > > might be wrong, but my understanding is that queuing fences (addition
> > > > > of new points in the timeline) should happen in order, but signaling
> > > > > can happen in any order. When checking for a signaled fence the
> > > > > fence-chain logic starts from the last point (or from an explicit point
> > > > > if you use the timeline wait flavor) and goes backward, stopping at the
> > > > > first un-signaled node. If I'm correct, that means that fences that
> > > > > are part of a chain can be signaled in any order.
> > > >
> > > > You don't even need a timeline for this.  Just have a single syncobj
> > > > per-queue and make each submit wait on it and then signal it.
> > > > Alternatively, you can just always hang on to the out-fence from the
> > > > previous submit and make the next one wait on that.
> > >
> > > That's what I have right now, but it forces the serialization of all
> > > jobs that are pushed during a submit (and there can be more than one
> > > per command buffer on panfrost :-/). Maybe I'm wrong, but I thought it'd
> > > be better to not force this serialization if we can avoid it.
> >
> > I'm not familiar with panfrost's needs and I don't work on a tiler and
> > I know there are different issues there.  But...
> >
> > The Vulkan spec requires that everything that all the submits that
> > happen on a given vkQueue happen in-order.  Search the spec for
> > "Submission order" for more details.
>
> Duh, looks like I completely occulted the "Submission order"
> guarantees. This being said, even after reading this chapter multiple
> times I'm not sure what kind of guarantee this gives us, given the
> execution itself can be out-of-order. My understanding is that
> submission order matters for implicit deps, say you have 2 distinct
> VkSubmitInfo, the first one (in submission order) writing to a buffer
> and the second one reading from it, you really want the first one to
> be submitted first and the second one to wait on the implicit BO fence
> created by the first one. If we were to submit out-of-order, this
> guarantee wouldn't be met. OTOH, if we have 2 completely independent
> submits, I don't really see what submission order gives us if execution
> is out-of-order.

Right, this is where things get sticky.  What's guaranteed there is
submission order not execution order.  But, sadly, there's more hidden
in there than you might think.  Before we can go there, though, we
need to establish a few details of Mali hardware works.

My understanding (feel free to correct me if I'm wrong) is that,
roughly, Mali has two engines which have to work together to render:
One engine for compute/vertex/geometry and one for binning and
fragment.  When you go to submit a draw, you fire off any geometry
work on the compute engine and the fragment work on the binning
engine.  At a tile flush boundary (or renderpass in Vulkan), you
insert a dependency between the geometry work you put on the compute
engine and the binning/fragment work.  In a GL driver, you'd probably
kick it off to the kernel at this point.

In Vulkan, things are going to get more tricky.  You can't just kick
off to the kernel at tile flush boundaries.  You also can't assume
that all vertex work within a given command buffer is independent of
all the fragment work.  Let's say you have a sequence like this:

vkBeginRenderPass() /* Writes to ImageA */
vkCmdDraw()
vkCmdDraw()
...
vkEndRenderPass()
vkPipelineBarrier(imageA /* fragment -> compute */)
vkCmdDispatch() /* reads imageA, writes BufferB */
vkBeginRenderPass() /* Writes to ImageC */
vkCmdBindVertexBuffers(bufferB)
vkCmdDraw();
...
vkEndRenderPass()

Now you have, in a single command buffer, a draw which writes to an
image which is read by a compute shader which writes to a buffer which
is read by vertex fetch for another draw.  The only way to implement
this on a Mali like device is to ping-pong back and forth between the
compute and binning engines.  It'd look something like this (assuming
my mental model):

A: Vertex for the first draw on the compute engine
B: Vertex for the first draw on the compute engine
C: Fragment for the first draw on the binning engine; depends on A
D: Fragment for the second draw on the binning engine; depends on B
E: Compute on the compute engine; depends on C and D
F: Vertex for the third draw on the compute engine; depends on E
G: Fragment for the third draw on the binning engine; depends on F

There are a couple of options for how to do this:

 1. Each VkCommandBuffer is actually a bunch of command buffers for
each engine with dependencies.  vkQueueSubmit() calls the kernel
submit ioctl a brezillion times to submit them all.

 2. Same as 1, only the submit ioctl takes an array of things with
dependencies so that you don't have to do as many round-trips to
kernels space.

 3. Each VkCommandBuffer is two command buffers: one for compute and
one for binning, and you use some sort of HW synchronization mechanism
to handle the dependencies as you ping-pong between them.

Option 1 is the easiest to get going with if you're working on a
kernel driver that was designed for OpenGL but it has the obvious
drawback of lots of smaller command buffers and lots of submits.  It's
not going to scale well.  Option 3 is nice if the hardware can do it
(I have no idea if Mali can).

Sorry if that's all a re-hash of stuff you already know.  I'm just
trying to establish a baseline here so we're all talking about the
same things.

Ok, so what does this have to do with submission order in the spec?
The mental model of the Vulkan spec is that you have queues and you
submit commands to those queues.  Command buffers are just big
sequences of commands.  Those commands kick off work.  The kick-off
happens in-order but completion is out-of-order.  There are then
pipeline barriers which allow you to specify dependencies between
those bits of work such as "future compute work depends on previous
fragment work".  Importantly, those dependencies can even apply across
command buffers.

So where does this leave us?  Well, it depends on your submit model
and exactly how you handle pipeline barriers that sync between
engines.  If you're taking option 3 above and doing two command
buffers for each VkCommandBuffer, then you probably want two
serialized timelines, one for each engine, and some mechanism to tell
the kernel driver "these two command buffers have to run in parallel"
so that your ping-pong works.  If you're doing 1 or 2 above, I think
you probably still want two simple syncobjs, one for each engine.  You
don't really have any need to go all that far back in history.  All
you really need to describe is "command buffer X depends on previous
compute work" or "command buffer X depends on previous binning work".
As long as your multi-submit ioctl processes the command buffers
in-order, you can do it all with two syncobjs.

Sorry for the tome.  I hope it wasn't too wrong and was at least a
little helpful.

--Jason

> In our case, the kernel driver takes care of the submission
> serialization (gathering implicit and explicit deps, queuing the job and
> assigning the "done" fence to the output sync objects). Once things
> are queued, it's the scheduler (drm_sched) deciding of the execution
> order.
>
> >
> > So, generally speaking, there are some in-order requirements there.
> > Again, not having a lot of tiler experience, I'm not the one to weigh
> > in.
> >
> > > > Timelines are overkill here, IMO.
> > >
> > > Mind developing why you think this is overkill? After looking at the
> > > kernel implementation I thought using timeline syncobjs would be
> > > pretty cheap compared to the other options I considered.
> >
> > If you use a regular syncobj, every time you wait on it it inserts a
> > dependency between the current submit and the last thing to signal it
> > on the CPU timeline.  The internal dma_fences will hang around
> > as-needed to ensure those dependencies.  If you use a timeline, you
> > have to also track a uint64_t to reference the current time point.
> > This may work if you need to sync a bunch of in-flight stuff at one
> > go, that may work but if you're trying to serialize, it's just extra
> > tracking for no point.  Again, maybe there's something I'm missing and
> > you don't actually want to serialize.
>
> My understanding (and I might very much be wrong here) is that using a
> regular syncobj to do this actually enforces not only the submission
> order but also the execution order (each job waiting on the previous
> one to complete before being scheduled). The idea of the timeline
> syncobj approach is that jobs that have no inter dependencies can be
> started in any order, the scheduler picking the first whose deps are
> ready (which might not necessarily match submission order). The
> timeline syncobj allows us to have one point per kernel-submission and
> eventually wait on the last point for the fence passed to
> vkSubmitQueue(), and some specific point on the timeline for
> pSignalSemaphores entries.
>
> What's more challenging is signal operation ordering:
>
> "
> Signal operation order is a fundamental ordering in Vulkan, giving
> meaning to the order in which semaphore and fence signal operations
> occur when submitted to a single queue.
>
>
>
> 1.  The initial order is determined by the order in which
>     vkQueueSubmit commands are executed on the host, for a single
>     queue, from first to last.
>
> 2.  The order in which VkSubmitInfo structures are specified in the
>     pSubmits parameter of vkQueueSubmit, from lowest index to highest.
>
> 3.  The fence signal operation defined by the fence parameter of a
>     vkQueueSubmit or vkQueueBindSparse command is ordered after all
>     semaphore signal operations defined by that command.
> "
>
> This means we have to add implicit dependencies on the signaling
> itself. We have two options to guarantee that:
>
> 1/ Transfer one of the queue syncobj timeline point to each semaphore
>    and fence after job submission (the point itself being dependent
>    on the position of the submit entry in the array for semaphores, and
>    the last point for the fence). Problem with this approach is that we
>    now have an extra TRANSFER_SYNCOBJ call per semaphore/fence
>
> 2/ Add SYNC jobs (jobs that do not actually execute on the GPU, but
>    serve as a synchronization point) whose responsibility would be to
>    do this muxing/transfer as part of the batch submission process.
>
Boris Brezillon March 12, 2021, 6:25 p.m. UTC | #10
On Fri, 12 Mar 2021 09:37:49 -0600
Jason Ekstrand <jason@jlekstrand.net> wrote:

> On Fri, Mar 12, 2021 at 1:31 AM Boris Brezillon
> <boris.brezillon@collabora.com> wrote:
> >
> > On Thu, 11 Mar 2021 12:11:48 -0600
> > Jason Ekstrand <jason@jlekstrand.net> wrote:
> >  
> > > > > > > > 2/ Queued jobs might be executed out-of-order (unless they have
> > > > > > > >     explicit/implicit deps between them), and Vulkan asks that the out
> > > > > > > >     fence be signaled when all jobs are done. Timeline syncobjs are a
> > > > > > > >     good match for that use case. All we need to do is pass the same
> > > > > > > >     fence syncobj to all jobs being attached to a single QueueSubmit
> > > > > > > >     request, but a different point on the timeline. The syncobj
> > > > > > > >     timeline wait does the rest and guarantees that we've reached a
> > > > > > > >     given timeline point (IOW, all jobs before that point are done)
> > > > > > > >     before declaring the fence as signaled.
> > > > > > > >     One alternative would be to have dummy 'synchronization' jobs that
> > > > > > > >     don't actually execute anything on the GPU but declare a dependency
> > > > > > > >     on all other jobs that are part of the QueueSubmit request, and
> > > > > > > >     signal the out fence (the scheduler would do most of the work for
> > > > > > > >     us, all we have to do is support NULL job heads and signal the
> > > > > > > >     fence directly when that happens instead of queueing the job).  
> > > > > > >
> > > > > > > I have to admit to being rather hazy on the details of timeline
> > > > > > > syncobjs, but I thought there was a requirement that the timeline moves
> > > > > > > monotonically. I.e. if you have multiple jobs signalling the same
> > > > > > > syncobj just with different points, then AFAIU the API requires that the
> > > > > > > points are triggered in order.  
> > > > > >
> > > > > > I only started looking at the SYNCOBJ_TIMELINE API a few days ago, so I
> > > > > > might be wrong, but my understanding is that queuing fences (addition
> > > > > > of new points in the timeline) should happen in order, but signaling
> > > > > > can happen in any order. When checking for a signaled fence the
> > > > > > fence-chain logic starts from the last point (or from an explicit point
> > > > > > if you use the timeline wait flavor) and goes backward, stopping at the
> > > > > > first un-signaled node. If I'm correct, that means that fences that
> > > > > > are part of a chain can be signaled in any order.  
> > > > >
> > > > > You don't even need a timeline for this.  Just have a single syncobj
> > > > > per-queue and make each submit wait on it and then signal it.
> > > > > Alternatively, you can just always hang on to the out-fence from the
> > > > > previous submit and make the next one wait on that.  
> > > >
> > > > That's what I have right now, but it forces the serialization of all
> > > > jobs that are pushed during a submit (and there can be more than one
> > > > per command buffer on panfrost :-/). Maybe I'm wrong, but I thought it'd
> > > > be better to not force this serialization if we can avoid it.  
> > >
> > > I'm not familiar with panfrost's needs and I don't work on a tiler and
> > > I know there are different issues there.  But...
> > >
> > > The Vulkan spec requires that everything that all the submits that
> > > happen on a given vkQueue happen in-order.  Search the spec for
> > > "Submission order" for more details.  
> >
> > Duh, looks like I completely occulted the "Submission order"
> > guarantees. This being said, even after reading this chapter multiple
> > times I'm not sure what kind of guarantee this gives us, given the
> > execution itself can be out-of-order. My understanding is that
> > submission order matters for implicit deps, say you have 2 distinct
> > VkSubmitInfo, the first one (in submission order) writing to a buffer
> > and the second one reading from it, you really want the first one to
> > be submitted first and the second one to wait on the implicit BO fence
> > created by the first one. If we were to submit out-of-order, this
> > guarantee wouldn't be met. OTOH, if we have 2 completely independent
> > submits, I don't really see what submission order gives us if execution
> > is out-of-order.  
> 
> Right, this is where things get sticky.  What's guaranteed there is
> submission order not execution order.  But, sadly, there's more hidden
> in there than you might think.  Before we can go there, though, we
> need to establish a few details of Mali hardware works.
> 
> My understanding (feel free to correct me if I'm wrong) is that,
> roughly, Mali has two engines which have to work together to render:
> One engine for compute/vertex/geometry and one for binning and
> fragment.  When you go to submit a draw, you fire off any geometry
> work on the compute engine and the fragment work on the binning
> engine.  At a tile flush boundary (or renderpass in Vulkan), you
> insert a dependency between the geometry work you put on the compute
> engine and the binning/fragment work.  In a GL driver, you'd probably
> kick it off to the kernel at this point.
> 
> In Vulkan, things are going to get more tricky.  You can't just kick
> off to the kernel at tile flush boundaries.  You also can't assume
> that all vertex work within a given command buffer is independent of
> all the fragment work.  Let's say you have a sequence like this:
> 
> vkBeginRenderPass() /* Writes to ImageA */
> vkCmdDraw()
> vkCmdDraw()
> ...
> vkEndRenderPass()
> vkPipelineBarrier(imageA /* fragment -> compute */)
> vkCmdDispatch() /* reads imageA, writes BufferB */
> vkBeginRenderPass() /* Writes to ImageC */
> vkCmdBindVertexBuffers(bufferB)
> vkCmdDraw();
> ...
> vkEndRenderPass()
> 
> Now you have, in a single command buffer, a draw which writes to an
> image which is read by a compute shader which writes to a buffer which
> is read by vertex fetch for another draw.  The only way to implement
> this on a Mali like device is to ping-pong back and forth between the
> compute and binning engines.  It'd look something like this (assuming
> my mental model):
> 
> A: Vertex for the first draw on the compute engine
> B: Vertex for the first draw on the compute engine
> C: Fragment for the first draw on the binning engine; depends on A
> D: Fragment for the second draw on the binning engine; depends on B
> E: Compute on the compute engine; depends on C and D
> F: Vertex for the third draw on the compute engine; depends on E
> G: Fragment for the third draw on the binning engine; depends on F
> 
> There are a couple of options for how to do this:
> 
>  1. Each VkCommandBuffer is actually a bunch of command buffers for
> each engine with dependencies.  vkQueueSubmit() calls the kernel
> submit ioctl a brezillion times to submit them all.

This is what we currently have.

> 
>  2. Same as 1, only the submit ioctl takes an array of things with
> dependencies so that you don't have to do as many round-trips to
> kernels space.


This is the PoC I worked on.

> 
>  3. Each VkCommandBuffer is two command buffers: one for compute and
> one for binning, and you use some sort of HW synchronization mechanism
> to handle the dependencies as you ping-pong between them.

I didn't consider that option. We have a DOORBELL instruction on Bifrost
to wake up the CPU when the GPU wants to report something (not even
sure we have something equivalent on Midgard), but there's no
inter-queue sync mechanism AFAICT.

> 
> Option 1 is the easiest to get going with if you're working on a
> kernel driver that was designed for OpenGL but it has the obvious
> drawback of lots of smaller command buffers and lots of submits.  It's
> not going to scale well.  Option 3 is nice if the hardware can do it
> (I have no idea if Mali can).
> 
> Sorry if that's all a re-hash of stuff you already know.  I'm just
> trying to establish a baseline here so we're all talking about the
> same things.
> 
> Ok, so what does this have to do with submission order in the spec?
> The mental model of the Vulkan spec is that you have queues and you
> submit commands to those queues.  Command buffers are just big
> sequences of commands.  Those commands kick off work.  The kick-off
> happens in-order but completion is out-of-order.  There are then
> pipeline barriers which allow you to specify dependencies between
> those bits of work such as "future compute work depends on previous
> fragment work".  Importantly, those dependencies can even apply across
> command buffers.
> 
> So where does this leave us?  Well, it depends on your submit model
> and exactly how you handle pipeline barriers that sync between
> engines.  If you're taking option 3 above and doing two command
> buffers for each VkCommandBuffer, then you probably want two
> serialized timelines, one for each engine, and some mechanism to tell
> the kernel driver "these two command buffers have to run in parallel"
> so that your ping-pong works.  If you're doing 1 or 2 above, I think
> you probably still want two simple syncobjs, one for each engine.  You
> don't really have any need to go all that far back in history.  All
> you really need to describe is "command buffer X depends on previous
> compute work" or "command buffer X depends on previous binning work".

Okay, so this will effectively force in-order execution. Let's take your
previous example and add 2 more jobs at the end that have no deps on
previous commands:

vkBeginRenderPass() /* Writes to ImageA */
vkCmdDraw()
vkCmdDraw()
...
vkEndRenderPass()
vkPipelineBarrier(imageA /* fragment -> compute */)
vkCmdDispatch() /* reads imageA, writes BufferB */
vkBeginRenderPass() /* Writes to ImageC */
vkCmdBindVertexBuffers(bufferB)
vkCmdDraw();
...
vkEndRenderPass()
vkBeginRenderPass() /* Writes to ImageD */
vkCmdDraw()
...
vkEndRenderPass()

A: Vertex for the first draw on the compute engine
B: Vertex for the first draw on the compute engine
C: Fragment for the first draw on the binning engine; depends on A
D: Fragment for the second draw on the binning engine; depends on B
E: Compute on the compute engine; depends on C and D
F: Vertex for the third draw on the compute engine; depends on E
G: Fragment for the third draw on the binning engine; depends on F
H: Vertex for the fourth draw on the compute engine
I: Fragment for the fourth draw on the binning engine

When we reach E, we might be waiting for D to finish before scheduling
the job, and because of the implicit serialization we have on the
compute queue (F implicitly depends on E, and H on F) we can't schedule
H either, which could, in theory be started. I guess that's where the
term submission order is a bit unclear to me. The action of starting a
job sounds like execution order to me (the order you starts jobs
determines the execution order since we only have one HW queue per job
type). All implicit deps have been calculated when we queued the job to
the SW queue, and I thought that would be enough to meet the submission
order requirements, but I might be wrong.

The PoC I have was trying to get rid of this explicit serialization on
the compute and fragment queues by having one syncobj timeline
(queue(<syncpoint>)) and synchronization points (Sx).

S0: in-fences=<waitSemaphores[]>, out-fences=<explicit_deps> #waitSemaphore sync point
A: in-fences=<explicit_deps>, out-fences=<queue(1)>
B: in-fences=<explicit_deps>, out-fences=<queue(2)>
C: in-fences=<explicit_deps>, out-fence=<queue(3)> #implicit dep on A through the tiler context
D: in-fences=<explicit_deps>, out-fence=<queue(4)> #implicit dep on B through the tiler context
E: in-fences=<explicit_deps>, out-fence=<queue(5)> #implicit dep on D through imageA
F: in-fences=<explicit_deps>, out-fence=<queue(6)> #implicit dep on E through buffer B
G: in-fences=<explicit_deps>, out-fence=<queue(7)> #implicit dep on F through the tiler context
H: in-fences=<explicit_deps>, out-fence=<queue(8)>
I: in-fences=<explicit_deps>, out-fence=<queue(9)> #implicit dep on H through the tiler buffer
S1: in-fences=<queue(9)>, out-fences=<signalSemaphores[],fence> #signalSemaphore,fence sync point
# QueueWaitIdle is implemented with a wait(queue(0)), AKA wait on the last point

With this solution H can be started before E if the compute slot
is empty and E's implicit deps are not done. It's probably overkill,
but I thought maximizing GPU utilization was important.

> As long as your multi-submit ioctl processes the command buffers
> in-order, you can do it all with two syncobjs.
> 
> Sorry for the tome.  I hope it wasn't too wrong and was at least a
> little helpful.

It definitely helps, even if the concept of submission order is still
unclear to me, especially when applied to GPUs that have a single HW
queue, and where submission and execution order could be considered the
same thing.
Boris Brezillon March 12, 2021, 8:06 p.m. UTC | #11
On Fri, 12 Mar 2021 19:25:13 +0100
Boris Brezillon <boris.brezillon@collabora.com> wrote:

> > So where does this leave us?  Well, it depends on your submit model
> > and exactly how you handle pipeline barriers that sync between
> > engines.  If you're taking option 3 above and doing two command
> > buffers for each VkCommandBuffer, then you probably want two
> > serialized timelines, one for each engine, and some mechanism to tell
> > the kernel driver "these two command buffers have to run in parallel"
> > so that your ping-pong works.  If you're doing 1 or 2 above, I think
> > you probably still want two simple syncobjs, one for each engine.  You
> > don't really have any need to go all that far back in history.  All
> > you really need to describe is "command buffer X depends on previous
> > compute work" or "command buffer X depends on previous binning work".  
> 
> Okay, so this will effectively force in-order execution. Let's take your
> previous example and add 2 more jobs at the end that have no deps on
> previous commands:
> 
> vkBeginRenderPass() /* Writes to ImageA */
> vkCmdDraw()
> vkCmdDraw()
> ...
> vkEndRenderPass()
> vkPipelineBarrier(imageA /* fragment -> compute */)
> vkCmdDispatch() /* reads imageA, writes BufferB */
> vkBeginRenderPass() /* Writes to ImageC */
> vkCmdBindVertexBuffers(bufferB)
> vkCmdDraw();
> ...
> vkEndRenderPass()
> vkBeginRenderPass() /* Writes to ImageD */
> vkCmdDraw()
> ...
> vkEndRenderPass()
> 
> A: Vertex for the first draw on the compute engine
> B: Vertex for the first draw on the compute engine
> C: Fragment for the first draw on the binning engine; depends on A
> D: Fragment for the second draw on the binning engine; depends on B
> E: Compute on the compute engine; depends on C and D
> F: Vertex for the third draw on the compute engine; depends on E
> G: Fragment for the third draw on the binning engine; depends on F
> H: Vertex for the fourth draw on the compute engine
> I: Fragment for the fourth draw on the binning engine
> 
> When we reach E, we might be waiting for D to finish before scheduling
> the job, and because of the implicit serialization we have on the
> compute queue (F implicitly depends on E, and H on F) we can't schedule
> H either, which could, in theory be started. I guess that's where the
> term submission order is a bit unclear to me. The action of starting a
> job sounds like execution order to me (the order you starts jobs
> determines the execution order since we only have one HW queue per job
> type). All implicit deps have been calculated when we queued the job to
> the SW queue, and I thought that would be enough to meet the submission
> order requirements, but I might be wrong.
> 
> The PoC I have was trying to get rid of this explicit serialization on
> the compute and fragment queues by having one syncobj timeline
> (queue(<syncpoint>)) and synchronization points (Sx).
> 
> S0: in-fences=<waitSemaphores[]>, out-fences=<explicit_deps> #waitSemaphore sync point
> A: in-fences=<explicit_deps>, out-fences=<queue(1)>
> B: in-fences=<explicit_deps>, out-fences=<queue(2)>
> C: in-fences=<explicit_deps>, out-fence=<queue(3)> #implicit dep on A through the tiler context
> D: in-fences=<explicit_deps>, out-fence=<queue(4)> #implicit dep on B through the tiler context
> E: in-fences=<explicit_deps>, out-fence=<queue(5)> #implicit dep on D through imageA
> F: in-fences=<explicit_deps>, out-fence=<queue(6)> #implicit dep on E through buffer B
> G: in-fences=<explicit_deps>, out-fence=<queue(7)> #implicit dep on F through the tiler context
> H: in-fences=<explicit_deps>, out-fence=<queue(8)>
> I: in-fences=<explicit_deps>, out-fence=<queue(9)> #implicit dep on H through the tiler buffer
> S1: in-fences=<queue(9)>, out-fences=<signalSemaphores[],fence> #signalSemaphore,fence sync point
> # QueueWaitIdle is implemented with a wait(queue(0)), AKA wait on the last point
> 
> With this solution H can be started before E if the compute slot
> is empty and E's implicit deps are not done. It's probably overkill,
> but I thought maximizing GPU utilization was important.

Nevermind, I forgot the drm scheduler was dequeuing jobs in order, so 2
syncobjs (one per queue type) is indeed the right approach.
Alyssa Rosenzweig March 12, 2021, 9:48 p.m. UTC | #12
> >  3. Each VkCommandBuffer is two command buffers: one for compute and
> > one for binning, and you use some sort of HW synchronization mechanism
> > to handle the dependencies as you ping-pong between them.
> 
> I didn't consider that option. We have a DOORBELL instruction on Bifrost
> to wake up the CPU when the GPU wants to report something (not even
> sure we have something equivalent on Midgard), but there's no
> inter-queue sync mechanism AFAICT.

I was about to say that we don't have hardware support. Even considering
DOORBELL (which AFAIK isn't piped through to anything on kbase at
least), I'm inclined to say we don't have hardware support. Option 2
(what kbase does? dunno how DDK vulkan works though) is probably our
best bet, tbh.