Message ID | 20210311092539.2405596-1-boris.brezillon@collabora.com (mailing list archive) |
---|---|
Headers | show |
Series | drm/panfrost: Add a new submit ioctl | expand |
On 11/03/2021 09:25, Boris Brezillon wrote: > Hello, > > I've been playing with Vulkan lately and struggled quite a bit to > implement VkQueueSubmit with the submit ioctl we have. There are > several limiting factors that can be worked around if we really have to, > but I think it'd be much easier and future-proof if we introduce a new > ioctl that addresses the current limitations: Hi Boris, I think what you've proposed is quite reasonable, some detailed comments to your points below. > > 1/ There can only be one out_sync, but Vulkan might ask us to signal > several VkSemaphores and possibly one VkFence too, both of those > being based on sync objects in my PoC. Making out_sync an array of > syncobjs to attach the render_done fence to would make that possible. > The other option would be to collect syncobj updates in userspace > in a separate thread and propagate those updates to all > semaphores+fences waiting on those events (I think the v3dv driver > does something like that, but I didn't spend enough time studying > the code to be sure, so I might be wrong). You should be able to avoid the separate thread to propagate by having a proxy object in user space that maps between the one outsync of the job and the possibly many Vulkan objects. But I've had this argument before with the DDK... and the upshot of it was that he Vulkan API is unnecessarily complex here and makes this really hard to do in practise. So I agree adding this capability to the kernel is likely the best approach. > 2/ Queued jobs might be executed out-of-order (unless they have > explicit/implicit deps between them), and Vulkan asks that the out > fence be signaled when all jobs are done. Timeline syncobjs are a > good match for that use case. All we need to do is pass the same > fence syncobj to all jobs being attached to a single QueueSubmit > request, but a different point on the timeline. The syncobj > timeline wait does the rest and guarantees that we've reached a > given timeline point (IOW, all jobs before that point are done) > before declaring the fence as signaled. > One alternative would be to have dummy 'synchronization' jobs that > don't actually execute anything on the GPU but declare a dependency > on all other jobs that are part of the QueueSubmit request, and > signal the out fence (the scheduler would do most of the work for > us, all we have to do is support NULL job heads and signal the > fence directly when that happens instead of queueing the job). I have to admit to being rather hazy on the details of timeline syncobjs, but I thought there was a requirement that the timeline moves monotonically. I.e. if you have multiple jobs signalling the same syncobj just with different points, then AFAIU the API requires that the points are triggered in order. So I'm not sure that you've actually fixed this point - you either need to force an order (in which case the last job can signal the Vulkan fence) or you still need a dummy job to do the many-to-one dependency. Or I may have completely misunderstood timeline syncobjs - definitely a possibility :) > 3/ The current implementation lacks information about BO access, > so we serialize all jobs accessing the same set of BOs, even > if those jobs might just be reading from them (which can > happen concurrently). Other drivers pass an access type to the > list of referenced BOs to address that. Another option would be > to disable implicit deps (deps based on BOs) and force the driver > to pass all deps explicitly (interestingly, some drivers have > both the no-implicit-dep and r/w flags, probably to support > sub-resource access, so we might want to add that one too). > I don't see any userspace workaround to that problem, so that one > alone would justify extending the existing ioctl or adding a new > one. Yeah - I think we need this. My only comment is that I think the read/write terminology may come back to bite. Better to use 'shared' and 'exclusive' - which better matches the dma_resv_xxx APIs anyway. Also the current code completely ignores PANFROST_BO_REF_READ. So either that should be defined as 0, or even better we support 3 modes: * Exclusive ('write' access) * Shared ('read' access) * No fence - ensures the BO is mapped, but doesn't add any implicit fences. The last may make sense when doing explicit fences and e.g. doing front-buffer rendering with a display driver which does implicit fencing. > 4/ There's also the fact that submitting one job at a time adds an > overhead when QueueSubmit is being passed more than one > CommandBuffer. That one is less problematic, but if we're adding > a new ioctl we'd better design it to limit the userspace -> kernel > transition overhead. I've no objection - but I doubt the performance effect is significant. I was pleased to see the handling of stride which makes the interface extendable. In particular I suspect at some point we're going to want a priority field in some form. > Right now I'm just trying to collect feedback. I don't intend to get > those patches merged until we have a userspace user, but I thought > starting the discussion early would be a good thing. > > Feel free to suggest other approaches. Other than the above I didn't see any obvious issues, and I know the Vulkan API is problematic in terms of synchronisation primitives - so if this makes it easier to implement then it seems like a good idea to me. Thanks, Steve
Hi Steven, On Thu, 11 Mar 2021 12:16:33 +0000 Steven Price <steven.price@arm.com> wrote: > On 11/03/2021 09:25, Boris Brezillon wrote: > > Hello, > > > > I've been playing with Vulkan lately and struggled quite a bit to > > implement VkQueueSubmit with the submit ioctl we have. There are > > several limiting factors that can be worked around if we really have to, > > but I think it'd be much easier and future-proof if we introduce a new > > ioctl that addresses the current limitations: > > Hi Boris, > > I think what you've proposed is quite reasonable, some detailed comments > to your points below. > > > > > 1/ There can only be one out_sync, but Vulkan might ask us to signal > > several VkSemaphores and possibly one VkFence too, both of those > > being based on sync objects in my PoC. Making out_sync an array of > > syncobjs to attach the render_done fence to would make that possible. > > The other option would be to collect syncobj updates in userspace > > in a separate thread and propagate those updates to all > > semaphores+fences waiting on those events (I think the v3dv driver > > does something like that, but I didn't spend enough time studying > > the code to be sure, so I might be wrong). > > You should be able to avoid the separate thread to propagate by having a > proxy object in user space that maps between the one outsync of the job > and the possibly many Vulkan objects. But I've had this argument before > with the DDK... and the upshot of it was that he Vulkan API is > unnecessarily complex here and makes this really hard to do in practise. > So I agree adding this capability to the kernel is likely the best approach. > > > 2/ Queued jobs might be executed out-of-order (unless they have > > explicit/implicit deps between them), and Vulkan asks that the out > > fence be signaled when all jobs are done. Timeline syncobjs are a > > good match for that use case. All we need to do is pass the same > > fence syncobj to all jobs being attached to a single QueueSubmit > > request, but a different point on the timeline. The syncobj > > timeline wait does the rest and guarantees that we've reached a > > given timeline point (IOW, all jobs before that point are done) > > before declaring the fence as signaled. > > One alternative would be to have dummy 'synchronization' jobs that > > don't actually execute anything on the GPU but declare a dependency > > on all other jobs that are part of the QueueSubmit request, and > > signal the out fence (the scheduler would do most of the work for > > us, all we have to do is support NULL job heads and signal the > > fence directly when that happens instead of queueing the job). > > I have to admit to being rather hazy on the details of timeline > syncobjs, but I thought there was a requirement that the timeline moves > monotonically. I.e. if you have multiple jobs signalling the same > syncobj just with different points, then AFAIU the API requires that the > points are triggered in order. I only started looking at the SYNCOBJ_TIMELINE API a few days ago, so I might be wrong, but my understanding is that queuing fences (addition of new points in the timeline) should happen in order, but signaling can happen in any order. When checking for a signaled fence the fence-chain logic starts from the last point (or from an explicit point if you use the timeline wait flavor) and goes backward, stopping at the first un-signaled node. If I'm correct, that means that fences that are part of a chain can be signaled in any order. Note that I also considered using a sync file, which has the ability to merge fences, but that required 3 extra ioctls for each syncobj to merge (for the export/merge/import round trip), and AFAICT, fences stay around until the sync file is destroyed, which forces some garbage collection if we want to keep the number of active fences low. One nice thing about the fence-chain/syncobj-timeline logic is that signaled fences are collected automatically when walking the chain. > > So I'm not sure that you've actually fixed this point - you either need > to force an order (in which case the last job can signal the Vulkan > fence) That options requires adding deps that do not really exist on the last jobs, so I'm not sure I like it. > or you still need a dummy job to do the many-to-one dependency. Yeah, that's what I've considered doing before realizing timelined syncojbs could solve this problem (assuming I got it right :-/). > > Or I may have completely misunderstood timeline syncobjs - definitely a > possibility :) I wouldn't pretend I got it right either :-). > > > 3/ The current implementation lacks information about BO access, > > so we serialize all jobs accessing the same set of BOs, even > > if those jobs might just be reading from them (which can > > happen concurrently). Other drivers pass an access type to the > > list of referenced BOs to address that. Another option would be > > to disable implicit deps (deps based on BOs) and force the driver > > to pass all deps explicitly (interestingly, some drivers have > > both the no-implicit-dep and r/w flags, probably to support > > sub-resource access, so we might want to add that one too). > > I don't see any userspace workaround to that problem, so that one > > alone would justify extending the existing ioctl or adding a new > > one. > > Yeah - I think we need this. My only comment is that I think the > read/write terminology may come back to bite. Better to use 'shared' and > 'exclusive' - which better matches the dma_resv_xxx APIs anyway. > > Also the current code completely ignores PANFROST_BO_REF_READ. So either > that should be defined as 0, or even better we support 3 modes: > > * Exclusive ('write' access) > * Shared ('read' access) > * No fence - ensures the BO is mapped, but doesn't add any implicit > fences. > > The last may make sense when doing explicit fences and e.g. doing > front-buffer rendering with a display driver which does implicit fencing. Sounds good to me. > > > 4/ There's also the fact that submitting one job at a time adds an > > overhead when QueueSubmit is being passed more than one > > CommandBuffer. That one is less problematic, but if we're adding > > a new ioctl we'd better design it to limit the userspace -> kernel > > transition overhead. > > I've no objection - but I doubt the performance effect is significant. I > was pleased to see the handling of stride which makes the interface > extendable. In particular I suspect at some point we're going to want a > priority field in some form. > > > Right now I'm just trying to collect feedback. I don't intend to get > > those patches merged until we have a userspace user, but I thought > > starting the discussion early would be a good thing. > > > > Feel free to suggest other approaches. > > Other than the above I didn't see any obvious issues, and I know the > Vulkan API is problematic in terms of synchronisation primitives - so if > this makes it easier to implement then it seems like a good idea to me. Thanks a lot for you review. Regards, Boris
Hi all, Dropping in where I may or may not be wanted to feel free to ignore. : -) On Thu, Mar 11, 2021 at 7:00 AM Boris Brezillon <boris.brezillon@collabora.com> wrote: > > Hi Steven, > > On Thu, 11 Mar 2021 12:16:33 +0000 > Steven Price <steven.price@arm.com> wrote: > > > On 11/03/2021 09:25, Boris Brezillon wrote: > > > Hello, > > > > > > I've been playing with Vulkan lately and struggled quite a bit to > > > implement VkQueueSubmit with the submit ioctl we have. There are > > > several limiting factors that can be worked around if we really have to, > > > but I think it'd be much easier and future-proof if we introduce a new > > > ioctl that addresses the current limitations: > > > > Hi Boris, > > > > I think what you've proposed is quite reasonable, some detailed comments > > to your points below. > > > > > > > > 1/ There can only be one out_sync, but Vulkan might ask us to signal > > > several VkSemaphores and possibly one VkFence too, both of those > > > being based on sync objects in my PoC. Making out_sync an array of > > > syncobjs to attach the render_done fence to would make that possible. > > > The other option would be to collect syncobj updates in userspace > > > in a separate thread and propagate those updates to all > > > semaphores+fences waiting on those events (I think the v3dv driver > > > does something like that, but I didn't spend enough time studying > > > the code to be sure, so I might be wrong). > > > > You should be able to avoid the separate thread to propagate by having a > > proxy object in user space that maps between the one outsync of the job > > and the possibly many Vulkan objects. But I've had this argument before > > with the DDK... and the upshot of it was that he Vulkan API is > > unnecessarily complex here and makes this really hard to do in practise. > > So I agree adding this capability to the kernel is likely the best approach. This is pretty easy to do from userspace. Just take the out sync_file and stuff it into each of the syncobjs using DRM_IOCTL_SYNCOBJ_FD_TO_HANDLE. That's all the kernel would be doing under the hood anyway. Having it built into your submit ioctl does have the advantage of fewer round-trips to kernel space, though. > > > 2/ Queued jobs might be executed out-of-order (unless they have > > > explicit/implicit deps between them), and Vulkan asks that the out > > > fence be signaled when all jobs are done. Timeline syncobjs are a > > > good match for that use case. All we need to do is pass the same > > > fence syncobj to all jobs being attached to a single QueueSubmit > > > request, but a different point on the timeline. The syncobj > > > timeline wait does the rest and guarantees that we've reached a > > > given timeline point (IOW, all jobs before that point are done) > > > before declaring the fence as signaled. > > > One alternative would be to have dummy 'synchronization' jobs that > > > don't actually execute anything on the GPU but declare a dependency > > > on all other jobs that are part of the QueueSubmit request, and > > > signal the out fence (the scheduler would do most of the work for > > > us, all we have to do is support NULL job heads and signal the > > > fence directly when that happens instead of queueing the job). > > > > I have to admit to being rather hazy on the details of timeline > > syncobjs, but I thought there was a requirement that the timeline moves > > monotonically. I.e. if you have multiple jobs signalling the same > > syncobj just with different points, then AFAIU the API requires that the > > points are triggered in order. > > I only started looking at the SYNCOBJ_TIMELINE API a few days ago, so I > might be wrong, but my understanding is that queuing fences (addition > of new points in the timeline) should happen in order, but signaling > can happen in any order. When checking for a signaled fence the > fence-chain logic starts from the last point (or from an explicit point > if you use the timeline wait flavor) and goes backward, stopping at the > first un-signaled node. If I'm correct, that means that fences that > are part of a chain can be signaled in any order. You don't even need a timeline for this. Just have a single syncobj per-queue and make each submit wait on it and then signal it. Alternatively, you can just always hang on to the out-fence from the previous submit and make the next one wait on that. Timelines are overkill here, IMO. > Note that I also considered using a sync file, which has the ability to > merge fences, but that required 3 extra ioctls for each syncobj to merge > (for the export/merge/import round trip), and AFAICT, fences stay around > until the sync file is destroyed, which forces some garbage collection > if we want to keep the number of active fences low. One nice thing > about the fence-chain/syncobj-timeline logic is that signaled fences > are collected automatically when walking the chain. Yeah, that's the pain when working with sync files. This is one of the reasons why our driver takes an arbitrary number of in/out syncobjs. > > So I'm not sure that you've actually fixed this point - you either need > > to force an order (in which case the last job can signal the Vulkan > > fence) > > That options requires adding deps that do not really exist on the last > jobs, so I'm not sure I like it. > > > or you still need a dummy job to do the many-to-one dependency. > > Yeah, that's what I've considered doing before realizing timelined > syncojbs could solve this problem (assuming I got it right :-/). > > > > > Or I may have completely misunderstood timeline syncobjs - definitely a > > possibility :) > > I wouldn't pretend I got it right either :-). > > > > > > 3/ The current implementation lacks information about BO access, > > > so we serialize all jobs accessing the same set of BOs, even > > > if those jobs might just be reading from them (which can > > > happen concurrently). Other drivers pass an access type to the > > > list of referenced BOs to address that. Another option would be > > > to disable implicit deps (deps based on BOs) and force the driver > > > to pass all deps explicitly (interestingly, some drivers have > > > both the no-implicit-dep and r/w flags, probably to support > > > sub-resource access, so we might want to add that one too). > > > I don't see any userspace workaround to that problem, so that one > > > alone would justify extending the existing ioctl or adding a new > > > one. > > > > Yeah - I think we need this. My only comment is that I think the > > read/write terminology may come back to bite. Better to use 'shared' and > > 'exclusive' - which better matches the dma_resv_xxx APIs anyway. > > > > Also the current code completely ignores PANFROST_BO_REF_READ. So either > > that should be defined as 0, or even better we support 3 modes: > > > > * Exclusive ('write' access) > > * Shared ('read' access) > > * No fence - ensures the BO is mapped, but doesn't add any implicit > > fences. > > > > The last may make sense when doing explicit fences and e.g. doing > > front-buffer rendering with a display driver which does implicit fencing. This one's really annoying. TBH, we've still not gotten it right on Intel, AFAICT. That is roughly the set of modes you need but you'll have to watch out for window-system buffers. RADV and ANV take slightly different approaches here and they each have their own problems. On the balance, I'd look at what RADV is doing rather than ANV because ANV's results in some over-synchronization every time you vkWaitForFences on the WSI fence. I've got a patch floating round somewhere that adds some new kernel API to make that case a bit better but it's a snarly mess. Sorry for being cryptic but it's a 5-page e-mail if I type out all the annoying details. (-: --Jason > Sounds good to me. > > > > > > 4/ There's also the fact that submitting one job at a time adds an > > > overhead when QueueSubmit is being passed more than one > > > CommandBuffer. That one is less problematic, but if we're adding > > > a new ioctl we'd better design it to limit the userspace -> kernel > > > transition overhead. > > > > I've no objection - but I doubt the performance effect is significant. I > > was pleased to see the handling of stride which makes the interface > > extendable. In particular I suspect at some point we're going to want a > > priority field in some form. > > > > > Right now I'm just trying to collect feedback. I don't intend to get > > > those patches merged until we have a userspace user, but I thought > > > starting the discussion early would be a good thing. > > > > > > Feel free to suggest other approaches. > > > > Other than the above I didn't see any obvious issues, and I know the > > Vulkan API is problematic in terms of synchronisation primitives - so if > > this makes it easier to implement then it seems like a good idea to me. > > Thanks a lot for you review. > > Regards, > > Boris > _______________________________________________ > dri-devel mailing list > dri-devel@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/dri-devel
Hi Jason, On Thu, 11 Mar 2021 10:58:46 -0600 Jason Ekstrand <jason@jlekstrand.net> wrote: > Hi all, > > Dropping in where I may or may not be wanted to feel free to ignore. : -) I'm glad you decided to chime in. :-) > > > > 2/ Queued jobs might be executed out-of-order (unless they have > > > > explicit/implicit deps between them), and Vulkan asks that the out > > > > fence be signaled when all jobs are done. Timeline syncobjs are a > > > > good match for that use case. All we need to do is pass the same > > > > fence syncobj to all jobs being attached to a single QueueSubmit > > > > request, but a different point on the timeline. The syncobj > > > > timeline wait does the rest and guarantees that we've reached a > > > > given timeline point (IOW, all jobs before that point are done) > > > > before declaring the fence as signaled. > > > > One alternative would be to have dummy 'synchronization' jobs that > > > > don't actually execute anything on the GPU but declare a dependency > > > > on all other jobs that are part of the QueueSubmit request, and > > > > signal the out fence (the scheduler would do most of the work for > > > > us, all we have to do is support NULL job heads and signal the > > > > fence directly when that happens instead of queueing the job). > > > > > > I have to admit to being rather hazy on the details of timeline > > > syncobjs, but I thought there was a requirement that the timeline moves > > > monotonically. I.e. if you have multiple jobs signalling the same > > > syncobj just with different points, then AFAIU the API requires that the > > > points are triggered in order. > > > > I only started looking at the SYNCOBJ_TIMELINE API a few days ago, so I > > might be wrong, but my understanding is that queuing fences (addition > > of new points in the timeline) should happen in order, but signaling > > can happen in any order. When checking for a signaled fence the > > fence-chain logic starts from the last point (or from an explicit point > > if you use the timeline wait flavor) and goes backward, stopping at the > > first un-signaled node. If I'm correct, that means that fences that > > are part of a chain can be signaled in any order. > > You don't even need a timeline for this. Just have a single syncobj > per-queue and make each submit wait on it and then signal it. > Alternatively, you can just always hang on to the out-fence from the > previous submit and make the next one wait on that. That's what I have right now, but it forces the serialization of all jobs that are pushed during a submit (and there can be more than one per command buffer on panfrost :-/). Maybe I'm wrong, but I thought it'd be better to not force this serialization if we can avoid it. > Timelines are overkill here, IMO. Mind developing why you think this is overkill? After looking at the kernel implementation I thought using timeline syncobjs would be pretty cheap compared to the other options I considered. > > > Note that I also considered using a sync file, which has the ability to > > merge fences, but that required 3 extra ioctls for each syncobj to merge > > (for the export/merge/import round trip), and AFAICT, fences stay around > > until the sync file is destroyed, which forces some garbage collection > > if we want to keep the number of active fences low. One nice thing > > about the fence-chain/syncobj-timeline logic is that signaled fences > > are collected automatically when walking the chain. > > Yeah, that's the pain when working with sync files. This is one of > the reasons why our driver takes an arbitrary number of in/out > syncobjs. > > > > So I'm not sure that you've actually fixed this point - you either need > > > to force an order (in which case the last job can signal the Vulkan > > > fence) > > > > That options requires adding deps that do not really exist on the last > > jobs, so I'm not sure I like it. > > > > > or you still need a dummy job to do the many-to-one dependency. > > > > Yeah, that's what I've considered doing before realizing timelined > > syncojbs could solve this problem (assuming I got it right :-/). > > > > > > > > Or I may have completely misunderstood timeline syncobjs - definitely a > > > possibility :) > > > > I wouldn't pretend I got it right either :-). > > > > > > > > > 3/ The current implementation lacks information about BO access, > > > > so we serialize all jobs accessing the same set of BOs, even > > > > if those jobs might just be reading from them (which can > > > > happen concurrently). Other drivers pass an access type to the > > > > list of referenced BOs to address that. Another option would be > > > > to disable implicit deps (deps based on BOs) and force the driver > > > > to pass all deps explicitly (interestingly, some drivers have > > > > both the no-implicit-dep and r/w flags, probably to support > > > > sub-resource access, so we might want to add that one too). > > > > I don't see any userspace workaround to that problem, so that one > > > > alone would justify extending the existing ioctl or adding a new > > > > one. > > > > > > Yeah - I think we need this. My only comment is that I think the > > > read/write terminology may come back to bite. Better to use 'shared' and > > > 'exclusive' - which better matches the dma_resv_xxx APIs anyway. > > > > > > Also the current code completely ignores PANFROST_BO_REF_READ. So either > > > that should be defined as 0, or even better we support 3 modes: > > > > > > * Exclusive ('write' access) > > > * Shared ('read' access) > > > * No fence - ensures the BO is mapped, but doesn't add any implicit > > > fences. > > > > > > The last may make sense when doing explicit fences and e.g. doing > > > front-buffer rendering with a display driver which does implicit fencing. > > This one's really annoying. TBH, we've still not gotten it right on > Intel, AFAICT. That is roughly the set of modes you need but you'll > have to watch out for window-system buffers. RADV and ANV take > slightly different approaches here and they each have their own > problems. On the balance, I'd look at what RADV is doing rather than > ANV because ANV's results in some over-synchronization every time you > vkWaitForFences on the WSI fence. I've got a patch floating round > somewhere that adds some new kernel API to make that case a bit better > but it's a snarly mess. Sorry for being cryptic but it's a 5-page > e-mail if I type out all the annoying details. (-: Ok, I'll have a look at the RADV driver. Thanks for your feedback. Boris
On Thu, Mar 11, 2021 at 11:25 AM Boris Brezillon <boris.brezillon@collabora.com> wrote: > > Hi Jason, > > On Thu, 11 Mar 2021 10:58:46 -0600 > Jason Ekstrand <jason@jlekstrand.net> wrote: > > > Hi all, > > > > Dropping in where I may or may not be wanted to feel free to ignore. : -) > > I'm glad you decided to chime in. :-) > > > > > > > 2/ Queued jobs might be executed out-of-order (unless they have > > > > > explicit/implicit deps between them), and Vulkan asks that the out > > > > > fence be signaled when all jobs are done. Timeline syncobjs are a > > > > > good match for that use case. All we need to do is pass the same > > > > > fence syncobj to all jobs being attached to a single QueueSubmit > > > > > request, but a different point on the timeline. The syncobj > > > > > timeline wait does the rest and guarantees that we've reached a > > > > > given timeline point (IOW, all jobs before that point are done) > > > > > before declaring the fence as signaled. > > > > > One alternative would be to have dummy 'synchronization' jobs that > > > > > don't actually execute anything on the GPU but declare a dependency > > > > > on all other jobs that are part of the QueueSubmit request, and > > > > > signal the out fence (the scheduler would do most of the work for > > > > > us, all we have to do is support NULL job heads and signal the > > > > > fence directly when that happens instead of queueing the job). > > > > > > > > I have to admit to being rather hazy on the details of timeline > > > > syncobjs, but I thought there was a requirement that the timeline moves > > > > monotonically. I.e. if you have multiple jobs signalling the same > > > > syncobj just with different points, then AFAIU the API requires that the > > > > points are triggered in order. > > > > > > I only started looking at the SYNCOBJ_TIMELINE API a few days ago, so I > > > might be wrong, but my understanding is that queuing fences (addition > > > of new points in the timeline) should happen in order, but signaling > > > can happen in any order. When checking for a signaled fence the > > > fence-chain logic starts from the last point (or from an explicit point > > > if you use the timeline wait flavor) and goes backward, stopping at the > > > first un-signaled node. If I'm correct, that means that fences that > > > are part of a chain can be signaled in any order. > > > > You don't even need a timeline for this. Just have a single syncobj > > per-queue and make each submit wait on it and then signal it. > > Alternatively, you can just always hang on to the out-fence from the > > previous submit and make the next one wait on that. > > That's what I have right now, but it forces the serialization of all > jobs that are pushed during a submit (and there can be more than one > per command buffer on panfrost :-/). Maybe I'm wrong, but I thought it'd > be better to not force this serialization if we can avoid it. I'm not familiar with panfrost's needs and I don't work on a tiler and I know there are different issues there. But... The Vulkan spec requires that everything that all the submits that happen on a given vkQueue happen in-order. Search the spec for "Submission order" for more details. So, generally speaking, there are some in-order requirements there. Again, not having a lot of tiler experience, I'm not the one to weigh in. > > Timelines are overkill here, IMO. > > Mind developing why you think this is overkill? After looking at the > kernel implementation I thought using timeline syncobjs would be > pretty cheap compared to the other options I considered. If you use a regular syncobj, every time you wait on it it inserts a dependency between the current submit and the last thing to signal it on the CPU timeline. The internal dma_fences will hang around as-needed to ensure those dependencies. If you use a timeline, you have to also track a uint64_t to reference the current time point. This may work if you need to sync a bunch of in-flight stuff at one go, that may work but if you're trying to serialize, it's just extra tracking for no point. Again, maybe there's something I'm missing and you don't actually want to serialize. --Jason > > > > > Note that I also considered using a sync file, which has the ability to > > > merge fences, but that required 3 extra ioctls for each syncobj to merge > > > (for the export/merge/import round trip), and AFAICT, fences stay around > > > until the sync file is destroyed, which forces some garbage collection > > > if we want to keep the number of active fences low. One nice thing > > > about the fence-chain/syncobj-timeline logic is that signaled fences > > > are collected automatically when walking the chain. > > > > Yeah, that's the pain when working with sync files. This is one of > > the reasons why our driver takes an arbitrary number of in/out > > syncobjs. > > > > > > So I'm not sure that you've actually fixed this point - you either need > > > > to force an order (in which case the last job can signal the Vulkan > > > > fence) > > > > > > That options requires adding deps that do not really exist on the last > > > jobs, so I'm not sure I like it. > > > > > > > or you still need a dummy job to do the many-to-one dependency. > > > > > > Yeah, that's what I've considered doing before realizing timelined > > > syncojbs could solve this problem (assuming I got it right :-/). > > > > > > > > > > > Or I may have completely misunderstood timeline syncobjs - definitely a > > > > possibility :) > > > > > > I wouldn't pretend I got it right either :-). > > > > > > > > > > > > 3/ The current implementation lacks information about BO access, > > > > > so we serialize all jobs accessing the same set of BOs, even > > > > > if those jobs might just be reading from them (which can > > > > > happen concurrently). Other drivers pass an access type to the > > > > > list of referenced BOs to address that. Another option would be > > > > > to disable implicit deps (deps based on BOs) and force the driver > > > > > to pass all deps explicitly (interestingly, some drivers have > > > > > both the no-implicit-dep and r/w flags, probably to support > > > > > sub-resource access, so we might want to add that one too). > > > > > I don't see any userspace workaround to that problem, so that one > > > > > alone would justify extending the existing ioctl or adding a new > > > > > one. > > > > > > > > Yeah - I think we need this. My only comment is that I think the > > > > read/write terminology may come back to bite. Better to use 'shared' and > > > > 'exclusive' - which better matches the dma_resv_xxx APIs anyway. > > > > > > > > Also the current code completely ignores PANFROST_BO_REF_READ. So either > > > > that should be defined as 0, or even better we support 3 modes: > > > > > > > > * Exclusive ('write' access) > > > > * Shared ('read' access) > > > > * No fence - ensures the BO is mapped, but doesn't add any implicit > > > > fences. > > > > > > > > The last may make sense when doing explicit fences and e.g. doing > > > > front-buffer rendering with a display driver which does implicit fencing. > > > > This one's really annoying. TBH, we've still not gotten it right on > > Intel, AFAICT. That is roughly the set of modes you need but you'll > > have to watch out for window-system buffers. RADV and ANV take > > slightly different approaches here and they each have their own > > problems. On the balance, I'd look at what RADV is doing rather than > > ANV because ANV's results in some over-synchronization every time you > > vkWaitForFences on the WSI fence. I've got a patch floating round > > somewhere that adds some new kernel API to make that case a bit better > > but it's a snarly mess. Sorry for being cryptic but it's a 5-page > > e-mail if I type out all the annoying details. (-: > > Ok, I'll have a look at the RADV driver. > > Thanks for your feedback. > > Boris
> I'm not familiar with panfrost's needs and I don't work on a tiler and > I know there are different issues there. But... The primary issue is we submit vertex+compute and fragment for each batch as two disjoint jobs (with a dependency of course), reflecting the internal hardware structure as parallel job slots. That we actually require two ioctls() and a roundtrip for this is a design wart inherited from early days of the kernel driver. The downstream Mali driver handles this by allowing multiple jobs to be submitted with a single ioctl, as Boris's patch enables. In every other respect I believe our needs are similar to other renderonly drivers. (What does turnip do?)
On Thu, 11 Mar 2021 12:11:48 -0600 Jason Ekstrand <jason@jlekstrand.net> wrote: > > > > > > 2/ Queued jobs might be executed out-of-order (unless they have > > > > > > explicit/implicit deps between them), and Vulkan asks that the out > > > > > > fence be signaled when all jobs are done. Timeline syncobjs are a > > > > > > good match for that use case. All we need to do is pass the same > > > > > > fence syncobj to all jobs being attached to a single QueueSubmit > > > > > > request, but a different point on the timeline. The syncobj > > > > > > timeline wait does the rest and guarantees that we've reached a > > > > > > given timeline point (IOW, all jobs before that point are done) > > > > > > before declaring the fence as signaled. > > > > > > One alternative would be to have dummy 'synchronization' jobs that > > > > > > don't actually execute anything on the GPU but declare a dependency > > > > > > on all other jobs that are part of the QueueSubmit request, and > > > > > > signal the out fence (the scheduler would do most of the work for > > > > > > us, all we have to do is support NULL job heads and signal the > > > > > > fence directly when that happens instead of queueing the job). > > > > > > > > > > I have to admit to being rather hazy on the details of timeline > > > > > syncobjs, but I thought there was a requirement that the timeline moves > > > > > monotonically. I.e. if you have multiple jobs signalling the same > > > > > syncobj just with different points, then AFAIU the API requires that the > > > > > points are triggered in order. > > > > > > > > I only started looking at the SYNCOBJ_TIMELINE API a few days ago, so I > > > > might be wrong, but my understanding is that queuing fences (addition > > > > of new points in the timeline) should happen in order, but signaling > > > > can happen in any order. When checking for a signaled fence the > > > > fence-chain logic starts from the last point (or from an explicit point > > > > if you use the timeline wait flavor) and goes backward, stopping at the > > > > first un-signaled node. If I'm correct, that means that fences that > > > > are part of a chain can be signaled in any order. > > > > > > You don't even need a timeline for this. Just have a single syncobj > > > per-queue and make each submit wait on it and then signal it. > > > Alternatively, you can just always hang on to the out-fence from the > > > previous submit and make the next one wait on that. > > > > That's what I have right now, but it forces the serialization of all > > jobs that are pushed during a submit (and there can be more than one > > per command buffer on panfrost :-/). Maybe I'm wrong, but I thought it'd > > be better to not force this serialization if we can avoid it. > > I'm not familiar with panfrost's needs and I don't work on a tiler and > I know there are different issues there. But... > > The Vulkan spec requires that everything that all the submits that > happen on a given vkQueue happen in-order. Search the spec for > "Submission order" for more details. Duh, looks like I completely occulted the "Submission order" guarantees. This being said, even after reading this chapter multiple times I'm not sure what kind of guarantee this gives us, given the execution itself can be out-of-order. My understanding is that submission order matters for implicit deps, say you have 2 distinct VkSubmitInfo, the first one (in submission order) writing to a buffer and the second one reading from it, you really want the first one to be submitted first and the second one to wait on the implicit BO fence created by the first one. If we were to submit out-of-order, this guarantee wouldn't be met. OTOH, if we have 2 completely independent submits, I don't really see what submission order gives us if execution is out-of-order. In our case, the kernel driver takes care of the submission serialization (gathering implicit and explicit deps, queuing the job and assigning the "done" fence to the output sync objects). Once things are queued, it's the scheduler (drm_sched) deciding of the execution order. > > So, generally speaking, there are some in-order requirements there. > Again, not having a lot of tiler experience, I'm not the one to weigh > in. > > > > Timelines are overkill here, IMO. > > > > Mind developing why you think this is overkill? After looking at the > > kernel implementation I thought using timeline syncobjs would be > > pretty cheap compared to the other options I considered. > > If you use a regular syncobj, every time you wait on it it inserts a > dependency between the current submit and the last thing to signal it > on the CPU timeline. The internal dma_fences will hang around > as-needed to ensure those dependencies. If you use a timeline, you > have to also track a uint64_t to reference the current time point. > This may work if you need to sync a bunch of in-flight stuff at one > go, that may work but if you're trying to serialize, it's just extra > tracking for no point. Again, maybe there's something I'm missing and > you don't actually want to serialize. My understanding (and I might very much be wrong here) is that using a regular syncobj to do this actually enforces not only the submission order but also the execution order (each job waiting on the previous one to complete before being scheduled). The idea of the timeline syncobj approach is that jobs that have no inter dependencies can be started in any order, the scheduler picking the first whose deps are ready (which might not necessarily match submission order). The timeline syncobj allows us to have one point per kernel-submission and eventually wait on the last point for the fence passed to vkSubmitQueue(), and some specific point on the timeline for pSignalSemaphores entries. What's more challenging is signal operation ordering: " Signal operation order is a fundamental ordering in Vulkan, giving meaning to the order in which semaphore and fence signal operations occur when submitted to a single queue. 1. The initial order is determined by the order in which vkQueueSubmit commands are executed on the host, for a single queue, from first to last. 2. The order in which VkSubmitInfo structures are specified in the pSubmits parameter of vkQueueSubmit, from lowest index to highest. 3. The fence signal operation defined by the fence parameter of a vkQueueSubmit or vkQueueBindSparse command is ordered after all semaphore signal operations defined by that command. " This means we have to add implicit dependencies on the signaling itself. We have two options to guarantee that: 1/ Transfer one of the queue syncobj timeline point to each semaphore and fence after job submission (the point itself being dependent on the position of the submit entry in the array for semaphores, and the last point for the fence). Problem with this approach is that we now have an extra TRANSFER_SYNCOBJ call per semaphore/fence 2/ Add SYNC jobs (jobs that do not actually execute on the GPU, but serve as a synchronization point) whose responsibility would be to do this muxing/transfer as part of the batch submission process.
On Thu, 11 Mar 2021 12:16:33 +0000 Steven Price <steven.price@arm.com> wrote: > Also the current code completely ignores PANFROST_BO_REF_READ. So either > that should be defined as 0, or even better we support 3 modes: > > * Exclusive ('write' access) > * Shared ('read' access) > * No fence - ensures the BO is mapped, but doesn't add any implicit > fences. I looked at other drivers and they seem to have this NO_FENCE/NO_IMPLICIT flag at the job/submit level and conditioned on the presence of a in_sync_fd file. I can see per-BO NO_FENCE being useful for sub-buffer accesses, assuming we don't want to deal with other deps explicitly, but we could also force the NO_IMPLICIT behavior for the whole job and let the user deal with all deps explicitly. TBH, it's a bit unclear to me how that feature would be used by the vulkan driver, so I'd rather leave that NO_FENCE use case for later.
On Fri, Mar 12, 2021 at 1:31 AM Boris Brezillon <boris.brezillon@collabora.com> wrote: > > On Thu, 11 Mar 2021 12:11:48 -0600 > Jason Ekstrand <jason@jlekstrand.net> wrote: > > > > > > > > 2/ Queued jobs might be executed out-of-order (unless they have > > > > > > > explicit/implicit deps between them), and Vulkan asks that the out > > > > > > > fence be signaled when all jobs are done. Timeline syncobjs are a > > > > > > > good match for that use case. All we need to do is pass the same > > > > > > > fence syncobj to all jobs being attached to a single QueueSubmit > > > > > > > request, but a different point on the timeline. The syncobj > > > > > > > timeline wait does the rest and guarantees that we've reached a > > > > > > > given timeline point (IOW, all jobs before that point are done) > > > > > > > before declaring the fence as signaled. > > > > > > > One alternative would be to have dummy 'synchronization' jobs that > > > > > > > don't actually execute anything on the GPU but declare a dependency > > > > > > > on all other jobs that are part of the QueueSubmit request, and > > > > > > > signal the out fence (the scheduler would do most of the work for > > > > > > > us, all we have to do is support NULL job heads and signal the > > > > > > > fence directly when that happens instead of queueing the job). > > > > > > > > > > > > I have to admit to being rather hazy on the details of timeline > > > > > > syncobjs, but I thought there was a requirement that the timeline moves > > > > > > monotonically. I.e. if you have multiple jobs signalling the same > > > > > > syncobj just with different points, then AFAIU the API requires that the > > > > > > points are triggered in order. > > > > > > > > > > I only started looking at the SYNCOBJ_TIMELINE API a few days ago, so I > > > > > might be wrong, but my understanding is that queuing fences (addition > > > > > of new points in the timeline) should happen in order, but signaling > > > > > can happen in any order. When checking for a signaled fence the > > > > > fence-chain logic starts from the last point (or from an explicit point > > > > > if you use the timeline wait flavor) and goes backward, stopping at the > > > > > first un-signaled node. If I'm correct, that means that fences that > > > > > are part of a chain can be signaled in any order. > > > > > > > > You don't even need a timeline for this. Just have a single syncobj > > > > per-queue and make each submit wait on it and then signal it. > > > > Alternatively, you can just always hang on to the out-fence from the > > > > previous submit and make the next one wait on that. > > > > > > That's what I have right now, but it forces the serialization of all > > > jobs that are pushed during a submit (and there can be more than one > > > per command buffer on panfrost :-/). Maybe I'm wrong, but I thought it'd > > > be better to not force this serialization if we can avoid it. > > > > I'm not familiar with panfrost's needs and I don't work on a tiler and > > I know there are different issues there. But... > > > > The Vulkan spec requires that everything that all the submits that > > happen on a given vkQueue happen in-order. Search the spec for > > "Submission order" for more details. > > Duh, looks like I completely occulted the "Submission order" > guarantees. This being said, even after reading this chapter multiple > times I'm not sure what kind of guarantee this gives us, given the > execution itself can be out-of-order. My understanding is that > submission order matters for implicit deps, say you have 2 distinct > VkSubmitInfo, the first one (in submission order) writing to a buffer > and the second one reading from it, you really want the first one to > be submitted first and the second one to wait on the implicit BO fence > created by the first one. If we were to submit out-of-order, this > guarantee wouldn't be met. OTOH, if we have 2 completely independent > submits, I don't really see what submission order gives us if execution > is out-of-order. Right, this is where things get sticky. What's guaranteed there is submission order not execution order. But, sadly, there's more hidden in there than you might think. Before we can go there, though, we need to establish a few details of Mali hardware works. My understanding (feel free to correct me if I'm wrong) is that, roughly, Mali has two engines which have to work together to render: One engine for compute/vertex/geometry and one for binning and fragment. When you go to submit a draw, you fire off any geometry work on the compute engine and the fragment work on the binning engine. At a tile flush boundary (or renderpass in Vulkan), you insert a dependency between the geometry work you put on the compute engine and the binning/fragment work. In a GL driver, you'd probably kick it off to the kernel at this point. In Vulkan, things are going to get more tricky. You can't just kick off to the kernel at tile flush boundaries. You also can't assume that all vertex work within a given command buffer is independent of all the fragment work. Let's say you have a sequence like this: vkBeginRenderPass() /* Writes to ImageA */ vkCmdDraw() vkCmdDraw() ... vkEndRenderPass() vkPipelineBarrier(imageA /* fragment -> compute */) vkCmdDispatch() /* reads imageA, writes BufferB */ vkBeginRenderPass() /* Writes to ImageC */ vkCmdBindVertexBuffers(bufferB) vkCmdDraw(); ... vkEndRenderPass() Now you have, in a single command buffer, a draw which writes to an image which is read by a compute shader which writes to a buffer which is read by vertex fetch for another draw. The only way to implement this on a Mali like device is to ping-pong back and forth between the compute and binning engines. It'd look something like this (assuming my mental model): A: Vertex for the first draw on the compute engine B: Vertex for the first draw on the compute engine C: Fragment for the first draw on the binning engine; depends on A D: Fragment for the second draw on the binning engine; depends on B E: Compute on the compute engine; depends on C and D F: Vertex for the third draw on the compute engine; depends on E G: Fragment for the third draw on the binning engine; depends on F There are a couple of options for how to do this: 1. Each VkCommandBuffer is actually a bunch of command buffers for each engine with dependencies. vkQueueSubmit() calls the kernel submit ioctl a brezillion times to submit them all. 2. Same as 1, only the submit ioctl takes an array of things with dependencies so that you don't have to do as many round-trips to kernels space. 3. Each VkCommandBuffer is two command buffers: one for compute and one for binning, and you use some sort of HW synchronization mechanism to handle the dependencies as you ping-pong between them. Option 1 is the easiest to get going with if you're working on a kernel driver that was designed for OpenGL but it has the obvious drawback of lots of smaller command buffers and lots of submits. It's not going to scale well. Option 3 is nice if the hardware can do it (I have no idea if Mali can). Sorry if that's all a re-hash of stuff you already know. I'm just trying to establish a baseline here so we're all talking about the same things. Ok, so what does this have to do with submission order in the spec? The mental model of the Vulkan spec is that you have queues and you submit commands to those queues. Command buffers are just big sequences of commands. Those commands kick off work. The kick-off happens in-order but completion is out-of-order. There are then pipeline barriers which allow you to specify dependencies between those bits of work such as "future compute work depends on previous fragment work". Importantly, those dependencies can even apply across command buffers. So where does this leave us? Well, it depends on your submit model and exactly how you handle pipeline barriers that sync between engines. If you're taking option 3 above and doing two command buffers for each VkCommandBuffer, then you probably want two serialized timelines, one for each engine, and some mechanism to tell the kernel driver "these two command buffers have to run in parallel" so that your ping-pong works. If you're doing 1 or 2 above, I think you probably still want two simple syncobjs, one for each engine. You don't really have any need to go all that far back in history. All you really need to describe is "command buffer X depends on previous compute work" or "command buffer X depends on previous binning work". As long as your multi-submit ioctl processes the command buffers in-order, you can do it all with two syncobjs. Sorry for the tome. I hope it wasn't too wrong and was at least a little helpful. --Jason > In our case, the kernel driver takes care of the submission > serialization (gathering implicit and explicit deps, queuing the job and > assigning the "done" fence to the output sync objects). Once things > are queued, it's the scheduler (drm_sched) deciding of the execution > order. > > > > > So, generally speaking, there are some in-order requirements there. > > Again, not having a lot of tiler experience, I'm not the one to weigh > > in. > > > > > > Timelines are overkill here, IMO. > > > > > > Mind developing why you think this is overkill? After looking at the > > > kernel implementation I thought using timeline syncobjs would be > > > pretty cheap compared to the other options I considered. > > > > If you use a regular syncobj, every time you wait on it it inserts a > > dependency between the current submit and the last thing to signal it > > on the CPU timeline. The internal dma_fences will hang around > > as-needed to ensure those dependencies. If you use a timeline, you > > have to also track a uint64_t to reference the current time point. > > This may work if you need to sync a bunch of in-flight stuff at one > > go, that may work but if you're trying to serialize, it's just extra > > tracking for no point. Again, maybe there's something I'm missing and > > you don't actually want to serialize. > > My understanding (and I might very much be wrong here) is that using a > regular syncobj to do this actually enforces not only the submission > order but also the execution order (each job waiting on the previous > one to complete before being scheduled). The idea of the timeline > syncobj approach is that jobs that have no inter dependencies can be > started in any order, the scheduler picking the first whose deps are > ready (which might not necessarily match submission order). The > timeline syncobj allows us to have one point per kernel-submission and > eventually wait on the last point for the fence passed to > vkSubmitQueue(), and some specific point on the timeline for > pSignalSemaphores entries. > > What's more challenging is signal operation ordering: > > " > Signal operation order is a fundamental ordering in Vulkan, giving > meaning to the order in which semaphore and fence signal operations > occur when submitted to a single queue. > > > > 1. The initial order is determined by the order in which > vkQueueSubmit commands are executed on the host, for a single > queue, from first to last. > > 2. The order in which VkSubmitInfo structures are specified in the > pSubmits parameter of vkQueueSubmit, from lowest index to highest. > > 3. The fence signal operation defined by the fence parameter of a > vkQueueSubmit or vkQueueBindSparse command is ordered after all > semaphore signal operations defined by that command. > " > > This means we have to add implicit dependencies on the signaling > itself. We have two options to guarantee that: > > 1/ Transfer one of the queue syncobj timeline point to each semaphore > and fence after job submission (the point itself being dependent > on the position of the submit entry in the array for semaphores, and > the last point for the fence). Problem with this approach is that we > now have an extra TRANSFER_SYNCOBJ call per semaphore/fence > > 2/ Add SYNC jobs (jobs that do not actually execute on the GPU, but > serve as a synchronization point) whose responsibility would be to > do this muxing/transfer as part of the batch submission process. >
On Fri, 12 Mar 2021 09:37:49 -0600 Jason Ekstrand <jason@jlekstrand.net> wrote: > On Fri, Mar 12, 2021 at 1:31 AM Boris Brezillon > <boris.brezillon@collabora.com> wrote: > > > > On Thu, 11 Mar 2021 12:11:48 -0600 > > Jason Ekstrand <jason@jlekstrand.net> wrote: > > > > > > > > > > 2/ Queued jobs might be executed out-of-order (unless they have > > > > > > > > explicit/implicit deps between them), and Vulkan asks that the out > > > > > > > > fence be signaled when all jobs are done. Timeline syncobjs are a > > > > > > > > good match for that use case. All we need to do is pass the same > > > > > > > > fence syncobj to all jobs being attached to a single QueueSubmit > > > > > > > > request, but a different point on the timeline. The syncobj > > > > > > > > timeline wait does the rest and guarantees that we've reached a > > > > > > > > given timeline point (IOW, all jobs before that point are done) > > > > > > > > before declaring the fence as signaled. > > > > > > > > One alternative would be to have dummy 'synchronization' jobs that > > > > > > > > don't actually execute anything on the GPU but declare a dependency > > > > > > > > on all other jobs that are part of the QueueSubmit request, and > > > > > > > > signal the out fence (the scheduler would do most of the work for > > > > > > > > us, all we have to do is support NULL job heads and signal the > > > > > > > > fence directly when that happens instead of queueing the job). > > > > > > > > > > > > > > I have to admit to being rather hazy on the details of timeline > > > > > > > syncobjs, but I thought there was a requirement that the timeline moves > > > > > > > monotonically. I.e. if you have multiple jobs signalling the same > > > > > > > syncobj just with different points, then AFAIU the API requires that the > > > > > > > points are triggered in order. > > > > > > > > > > > > I only started looking at the SYNCOBJ_TIMELINE API a few days ago, so I > > > > > > might be wrong, but my understanding is that queuing fences (addition > > > > > > of new points in the timeline) should happen in order, but signaling > > > > > > can happen in any order. When checking for a signaled fence the > > > > > > fence-chain logic starts from the last point (or from an explicit point > > > > > > if you use the timeline wait flavor) and goes backward, stopping at the > > > > > > first un-signaled node. If I'm correct, that means that fences that > > > > > > are part of a chain can be signaled in any order. > > > > > > > > > > You don't even need a timeline for this. Just have a single syncobj > > > > > per-queue and make each submit wait on it and then signal it. > > > > > Alternatively, you can just always hang on to the out-fence from the > > > > > previous submit and make the next one wait on that. > > > > > > > > That's what I have right now, but it forces the serialization of all > > > > jobs that are pushed during a submit (and there can be more than one > > > > per command buffer on panfrost :-/). Maybe I'm wrong, but I thought it'd > > > > be better to not force this serialization if we can avoid it. > > > > > > I'm not familiar with panfrost's needs and I don't work on a tiler and > > > I know there are different issues there. But... > > > > > > The Vulkan spec requires that everything that all the submits that > > > happen on a given vkQueue happen in-order. Search the spec for > > > "Submission order" for more details. > > > > Duh, looks like I completely occulted the "Submission order" > > guarantees. This being said, even after reading this chapter multiple > > times I'm not sure what kind of guarantee this gives us, given the > > execution itself can be out-of-order. My understanding is that > > submission order matters for implicit deps, say you have 2 distinct > > VkSubmitInfo, the first one (in submission order) writing to a buffer > > and the second one reading from it, you really want the first one to > > be submitted first and the second one to wait on the implicit BO fence > > created by the first one. If we were to submit out-of-order, this > > guarantee wouldn't be met. OTOH, if we have 2 completely independent > > submits, I don't really see what submission order gives us if execution > > is out-of-order. > > Right, this is where things get sticky. What's guaranteed there is > submission order not execution order. But, sadly, there's more hidden > in there than you might think. Before we can go there, though, we > need to establish a few details of Mali hardware works. > > My understanding (feel free to correct me if I'm wrong) is that, > roughly, Mali has two engines which have to work together to render: > One engine for compute/vertex/geometry and one for binning and > fragment. When you go to submit a draw, you fire off any geometry > work on the compute engine and the fragment work on the binning > engine. At a tile flush boundary (or renderpass in Vulkan), you > insert a dependency between the geometry work you put on the compute > engine and the binning/fragment work. In a GL driver, you'd probably > kick it off to the kernel at this point. > > In Vulkan, things are going to get more tricky. You can't just kick > off to the kernel at tile flush boundaries. You also can't assume > that all vertex work within a given command buffer is independent of > all the fragment work. Let's say you have a sequence like this: > > vkBeginRenderPass() /* Writes to ImageA */ > vkCmdDraw() > vkCmdDraw() > ... > vkEndRenderPass() > vkPipelineBarrier(imageA /* fragment -> compute */) > vkCmdDispatch() /* reads imageA, writes BufferB */ > vkBeginRenderPass() /* Writes to ImageC */ > vkCmdBindVertexBuffers(bufferB) > vkCmdDraw(); > ... > vkEndRenderPass() > > Now you have, in a single command buffer, a draw which writes to an > image which is read by a compute shader which writes to a buffer which > is read by vertex fetch for another draw. The only way to implement > this on a Mali like device is to ping-pong back and forth between the > compute and binning engines. It'd look something like this (assuming > my mental model): > > A: Vertex for the first draw on the compute engine > B: Vertex for the first draw on the compute engine > C: Fragment for the first draw on the binning engine; depends on A > D: Fragment for the second draw on the binning engine; depends on B > E: Compute on the compute engine; depends on C and D > F: Vertex for the third draw on the compute engine; depends on E > G: Fragment for the third draw on the binning engine; depends on F > > There are a couple of options for how to do this: > > 1. Each VkCommandBuffer is actually a bunch of command buffers for > each engine with dependencies. vkQueueSubmit() calls the kernel > submit ioctl a brezillion times to submit them all. This is what we currently have. > > 2. Same as 1, only the submit ioctl takes an array of things with > dependencies so that you don't have to do as many round-trips to > kernels space. This is the PoC I worked on. > > 3. Each VkCommandBuffer is two command buffers: one for compute and > one for binning, and you use some sort of HW synchronization mechanism > to handle the dependencies as you ping-pong between them. I didn't consider that option. We have a DOORBELL instruction on Bifrost to wake up the CPU when the GPU wants to report something (not even sure we have something equivalent on Midgard), but there's no inter-queue sync mechanism AFAICT. > > Option 1 is the easiest to get going with if you're working on a > kernel driver that was designed for OpenGL but it has the obvious > drawback of lots of smaller command buffers and lots of submits. It's > not going to scale well. Option 3 is nice if the hardware can do it > (I have no idea if Mali can). > > Sorry if that's all a re-hash of stuff you already know. I'm just > trying to establish a baseline here so we're all talking about the > same things. > > Ok, so what does this have to do with submission order in the spec? > The mental model of the Vulkan spec is that you have queues and you > submit commands to those queues. Command buffers are just big > sequences of commands. Those commands kick off work. The kick-off > happens in-order but completion is out-of-order. There are then > pipeline barriers which allow you to specify dependencies between > those bits of work such as "future compute work depends on previous > fragment work". Importantly, those dependencies can even apply across > command buffers. > > So where does this leave us? Well, it depends on your submit model > and exactly how you handle pipeline barriers that sync between > engines. If you're taking option 3 above and doing two command > buffers for each VkCommandBuffer, then you probably want two > serialized timelines, one for each engine, and some mechanism to tell > the kernel driver "these two command buffers have to run in parallel" > so that your ping-pong works. If you're doing 1 or 2 above, I think > you probably still want two simple syncobjs, one for each engine. You > don't really have any need to go all that far back in history. All > you really need to describe is "command buffer X depends on previous > compute work" or "command buffer X depends on previous binning work". Okay, so this will effectively force in-order execution. Let's take your previous example and add 2 more jobs at the end that have no deps on previous commands: vkBeginRenderPass() /* Writes to ImageA */ vkCmdDraw() vkCmdDraw() ... vkEndRenderPass() vkPipelineBarrier(imageA /* fragment -> compute */) vkCmdDispatch() /* reads imageA, writes BufferB */ vkBeginRenderPass() /* Writes to ImageC */ vkCmdBindVertexBuffers(bufferB) vkCmdDraw(); ... vkEndRenderPass() vkBeginRenderPass() /* Writes to ImageD */ vkCmdDraw() ... vkEndRenderPass() A: Vertex for the first draw on the compute engine B: Vertex for the first draw on the compute engine C: Fragment for the first draw on the binning engine; depends on A D: Fragment for the second draw on the binning engine; depends on B E: Compute on the compute engine; depends on C and D F: Vertex for the third draw on the compute engine; depends on E G: Fragment for the third draw on the binning engine; depends on F H: Vertex for the fourth draw on the compute engine I: Fragment for the fourth draw on the binning engine When we reach E, we might be waiting for D to finish before scheduling the job, and because of the implicit serialization we have on the compute queue (F implicitly depends on E, and H on F) we can't schedule H either, which could, in theory be started. I guess that's where the term submission order is a bit unclear to me. The action of starting a job sounds like execution order to me (the order you starts jobs determines the execution order since we only have one HW queue per job type). All implicit deps have been calculated when we queued the job to the SW queue, and I thought that would be enough to meet the submission order requirements, but I might be wrong. The PoC I have was trying to get rid of this explicit serialization on the compute and fragment queues by having one syncobj timeline (queue(<syncpoint>)) and synchronization points (Sx). S0: in-fences=<waitSemaphores[]>, out-fences=<explicit_deps> #waitSemaphore sync point A: in-fences=<explicit_deps>, out-fences=<queue(1)> B: in-fences=<explicit_deps>, out-fences=<queue(2)> C: in-fences=<explicit_deps>, out-fence=<queue(3)> #implicit dep on A through the tiler context D: in-fences=<explicit_deps>, out-fence=<queue(4)> #implicit dep on B through the tiler context E: in-fences=<explicit_deps>, out-fence=<queue(5)> #implicit dep on D through imageA F: in-fences=<explicit_deps>, out-fence=<queue(6)> #implicit dep on E through buffer B G: in-fences=<explicit_deps>, out-fence=<queue(7)> #implicit dep on F through the tiler context H: in-fences=<explicit_deps>, out-fence=<queue(8)> I: in-fences=<explicit_deps>, out-fence=<queue(9)> #implicit dep on H through the tiler buffer S1: in-fences=<queue(9)>, out-fences=<signalSemaphores[],fence> #signalSemaphore,fence sync point # QueueWaitIdle is implemented with a wait(queue(0)), AKA wait on the last point With this solution H can be started before E if the compute slot is empty and E's implicit deps are not done. It's probably overkill, but I thought maximizing GPU utilization was important. > As long as your multi-submit ioctl processes the command buffers > in-order, you can do it all with two syncobjs. > > Sorry for the tome. I hope it wasn't too wrong and was at least a > little helpful. It definitely helps, even if the concept of submission order is still unclear to me, especially when applied to GPUs that have a single HW queue, and where submission and execution order could be considered the same thing.
On Fri, 12 Mar 2021 19:25:13 +0100 Boris Brezillon <boris.brezillon@collabora.com> wrote: > > So where does this leave us? Well, it depends on your submit model > > and exactly how you handle pipeline barriers that sync between > > engines. If you're taking option 3 above and doing two command > > buffers for each VkCommandBuffer, then you probably want two > > serialized timelines, one for each engine, and some mechanism to tell > > the kernel driver "these two command buffers have to run in parallel" > > so that your ping-pong works. If you're doing 1 or 2 above, I think > > you probably still want two simple syncobjs, one for each engine. You > > don't really have any need to go all that far back in history. All > > you really need to describe is "command buffer X depends on previous > > compute work" or "command buffer X depends on previous binning work". > > Okay, so this will effectively force in-order execution. Let's take your > previous example and add 2 more jobs at the end that have no deps on > previous commands: > > vkBeginRenderPass() /* Writes to ImageA */ > vkCmdDraw() > vkCmdDraw() > ... > vkEndRenderPass() > vkPipelineBarrier(imageA /* fragment -> compute */) > vkCmdDispatch() /* reads imageA, writes BufferB */ > vkBeginRenderPass() /* Writes to ImageC */ > vkCmdBindVertexBuffers(bufferB) > vkCmdDraw(); > ... > vkEndRenderPass() > vkBeginRenderPass() /* Writes to ImageD */ > vkCmdDraw() > ... > vkEndRenderPass() > > A: Vertex for the first draw on the compute engine > B: Vertex for the first draw on the compute engine > C: Fragment for the first draw on the binning engine; depends on A > D: Fragment for the second draw on the binning engine; depends on B > E: Compute on the compute engine; depends on C and D > F: Vertex for the third draw on the compute engine; depends on E > G: Fragment for the third draw on the binning engine; depends on F > H: Vertex for the fourth draw on the compute engine > I: Fragment for the fourth draw on the binning engine > > When we reach E, we might be waiting for D to finish before scheduling > the job, and because of the implicit serialization we have on the > compute queue (F implicitly depends on E, and H on F) we can't schedule > H either, which could, in theory be started. I guess that's where the > term submission order is a bit unclear to me. The action of starting a > job sounds like execution order to me (the order you starts jobs > determines the execution order since we only have one HW queue per job > type). All implicit deps have been calculated when we queued the job to > the SW queue, and I thought that would be enough to meet the submission > order requirements, but I might be wrong. > > The PoC I have was trying to get rid of this explicit serialization on > the compute and fragment queues by having one syncobj timeline > (queue(<syncpoint>)) and synchronization points (Sx). > > S0: in-fences=<waitSemaphores[]>, out-fences=<explicit_deps> #waitSemaphore sync point > A: in-fences=<explicit_deps>, out-fences=<queue(1)> > B: in-fences=<explicit_deps>, out-fences=<queue(2)> > C: in-fences=<explicit_deps>, out-fence=<queue(3)> #implicit dep on A through the tiler context > D: in-fences=<explicit_deps>, out-fence=<queue(4)> #implicit dep on B through the tiler context > E: in-fences=<explicit_deps>, out-fence=<queue(5)> #implicit dep on D through imageA > F: in-fences=<explicit_deps>, out-fence=<queue(6)> #implicit dep on E through buffer B > G: in-fences=<explicit_deps>, out-fence=<queue(7)> #implicit dep on F through the tiler context > H: in-fences=<explicit_deps>, out-fence=<queue(8)> > I: in-fences=<explicit_deps>, out-fence=<queue(9)> #implicit dep on H through the tiler buffer > S1: in-fences=<queue(9)>, out-fences=<signalSemaphores[],fence> #signalSemaphore,fence sync point > # QueueWaitIdle is implemented with a wait(queue(0)), AKA wait on the last point > > With this solution H can be started before E if the compute slot > is empty and E's implicit deps are not done. It's probably overkill, > but I thought maximizing GPU utilization was important. Nevermind, I forgot the drm scheduler was dequeuing jobs in order, so 2 syncobjs (one per queue type) is indeed the right approach.
> > 3. Each VkCommandBuffer is two command buffers: one for compute and > > one for binning, and you use some sort of HW synchronization mechanism > > to handle the dependencies as you ping-pong between them. > > I didn't consider that option. We have a DOORBELL instruction on Bifrost > to wake up the CPU when the GPU wants to report something (not even > sure we have something equivalent on Midgard), but there's no > inter-queue sync mechanism AFAICT. I was about to say that we don't have hardware support. Even considering DOORBELL (which AFAIK isn't piped through to anything on kbase at least), I'm inclined to say we don't have hardware support. Option 2 (what kbase does? dunno how DDK vulkan works though) is probably our best bet, tbh.