Message ID | 20210520190007.534046-5-jason@jlekstrand.net (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | dma-buf: Add an API for exporting sync files (v8) | expand |
Hi, On Thu, 20 May 2021 at 20:00, Jason Ekstrand <jason@jlekstrand.net> wrote: > In the Vulkan world, this is useful for dealing with the out-fence from > vkQueuePresent. Current Linux window-systems (X11, Wayland, etc.) all > rely on dma-buf implicit sync. Since Vulkan is an explicit sync API, we > get a set of fences (VkSemaphores) in vkQueuePresent and have to stash > those as an exclusive (write) fence on the dma-buf. We handle it in > Mesa today with the above mentioned dummy submit trick. This ioctl > would allow us to set it directly without the dummy submit. > > This may also open up possibilities for GPU drivers to move away from > implicit sync for their kernel driver uAPI and instead provide sync > files and rely on dma-buf import/export for communicating with other > implicit sync clients. > > We make the explicit choice here to only allow setting RW fences which > translates to an exclusive fence on the dma_resv. There's no use for > read-only fences for communicating with other implicit sync userspace > and any such attempts are likely to be racy at best. I think I almost follow, but I'm not quite seeing the race you allude to. Let me talk through my understanding so it's hopefully more clear for others as a summary. Please correct me on anything I've misunderstood or just missed completely. (I thought when I wrote this intro that the email might be relatively snappy, but it's really long and takes in a lot of breadth. Sorry.) So as far as I'm reading this patchset and the Mesa MR, this _only_ concerns the out-fence (i.e. compositor -> client AcquireNextImage semaphore/fence) so far. The in-fence (client->compositor QueuePresent wait-semaphores/fences) is handled by the driver ensuring that an exclusive resv is placed on the union of the signal semaphores passed to QueuePresent, either through flags on its CS ioctl, or amdgpu's BO flags, or ... either way, no problem as it should always be exclusive, and it seems pretty uncontroversial that we should pull the wait semaphore into an exclusive fence so that no downstream consumer will begin using it until the client ops have fully retired. For AcquireNextImage, your patchset exports all the fences (shared and exclusive both) on the dma_resv out into the semaphore/fence such that the client can't progress (CPU-side for VkFence, GPU-side for VkSemaphore) until all currently queued operations have completely retired. So, as long as the server ensures that all its kernel-side work is flushed before its IPC to unblock ANI (wl_buffer.release or DRI3 PresentIdle, both indicating that the client is free to reuse the buffer, subject to synchronising against implicit fences on the resv), all good: we snapshot the current resv state, wrap the relevant driver-side Vulkan primitive around it, and go back to explicit synchronisation. The client can't race generating new work against this, because it can't queue any work until ANI has returned and queued a signaling op on the semaphore/fence. So far, so good. I really like both your series up until this narrative point; as I was saying in the userspace-fence thread, the WSI<->client thread is certainly pulling a very big lever with a heavyweight transition between the two different worlds, and I like that the new export ioctl lets us be very clear about what exactly is happening under the hood. Good stuff. So, what gives with the import ioctl? Taking a guess at where you're going, the import ioctl is going to be used in QueuePresent just as the export ioctl is in ANI: instead of having CS flags to write into the resv exclusive slot or per-BO flags to always dump in exclusive, there's a forthcoming patch somewhere which lets drivers skip this and instead have common QueuePresent code dump the wait semaphore into the resv, so servers on the other side of an implicit-only protocol will synchronise their access against the fence imported as part of client-side QueuePresent? That makes sense to me and is nicely symmetrical, plus it gets GPU drivers further out of the business of doing magic winsys/display synchronisation, which is good. But why enforce that imported fences have to go into the exclusive slot? It makes sense from the point of view of clients doing QueueSubmit, but I think it might preclude other uses for no particularly good reason. Certainly on X11 there's no real use for the shared slot - you get into difficulties with mixed client/server rendering - but at least Wayland and GBM are always non-destructive, so conceptually have a good use for being able to import fences into the shared resv. This goes for media and KMS access as well, if you think beyond the usecase of an explicit Vulkan client sending through to a dumb implicit server which just wants to texture. Media clients are definitely a relevant usecase, both directly with V4L2/VA-API, neither of which have support for explicit synchronisation yet and (at least for V4L2) are unlikely to do so in the near future, but even more with pipeline frameworks like GStreamer and PipeWire, which don't have fundamental awareness (they're prepared to deal with deep pipelines which return at arbitrary times, but once anything is returned, it is available for immediate use). Maybe open-coding CPU-side waits in these users is the best way longer term, but for easier interop if nothing else, being able to pull shared fences in seems like it could be a win ('the compositor is still texturing from this for now, so feel free to read back from ref frames, but don't scribble over it until it's finished'). I'm slightly bemused that there's so much energy spent on designing around winsys being dumb and implicit. We did take a long time to roll out sync_file support, but that was only because we didn't quite understand all the nuances of why implicit sync is so difficult for GPU drivers on modern architectures and how it was actually a win compared to doing nothing; maybe we should have some kind of conference where we all get together and talk to each other ... ? Anyway, by the time we got to the cusp of rolling out bi-directional sync_file support, suddenly we had drm_syncobj. By the time that had semi-settled down and started to be rolled out, then we suddenly had userspace fences on the horizon. And what do we do with those? We've - certainly Weston, and I'm pretty confident I speak for Simon on the wlroots side and Jonas for Mutter - landed on accepting that we're going to have to deal with timeline semaphores and wait-before-signal; we have a credible plan (we think) for protocol to deal with at least syncobj, which should conceptually extend to userspace fences. The biggest blocker there is the syncobj uAPI. Compositors aren't Vulkan clients - we don't have one thread per group of work, because the inter-client synchronisation becomes nightmarish overhead for little gain. I don't think this will ever change, because the balance of work is totally different between client and compositor. Anyway, the problem with syncobj is that the ioctl to wait for a sync_file to materialise for a given timeline point only allows us to block with a timeout; this is a non-starter, because we need something which fits into epoll. The most optimal case is returning a new eventfd or similar which signals when a given timeline point becomes available or signaled, but in extremis a syncobj FD becoming readable when any activity which would change the result of any zero-timeout wait on that syncobj is more or less workable. What we do want to do though, regardless of the primitive or its semantics, is to only have to go through this once, not rework it all in six months, and have to support a bunch of diverging codepaths. So what is the optimal synchronisation primitive we should be aiming for on the winsys side? Is sync_file a good enough lowest common denominator, or should we do timeline-syncobj for fancy clients, in tandem with sync_file bolted on the side for media and ancient GPUs? Or are userspace fences going to give us some new primitive? And if so, can we please talk about those semantics now, so we don't end up with three synchronisation mechanisms which are all sort of but not really alike? As far as I can tell, the three relevant vendors with (more or less) upstream drivers here are AMD, Arm, and Intel. Arm is obviously a bit more up in the air as the hardware and specs aren't currently available to the upstream development teams, but hopefully we can bring them into this conversation. I think it's looking like we're optimising for all of the following, without leaving anyone out in the cold: 1. Modern userspace-fencing hardware running on a userspace-fencing-aware winsys, i.e. new AMD/Arm/Intel on one of the big three Wayland compositors which handle enough synchronisation logic internally that the choice of underlying composition/presentation API is immaterial (for which any new API is important) 2. Modern userspace-fencing hardware running on Xwayland, for which we'll inevitably have to type out new DRI3 synchronsiation, but derived purely from the Wayland protocols so it can be passed through quite directly, and with any mixed X11<->user buffer client interaction (e.g. subwindows) being penalised by a long blocking wait in the X server 3. sync_file-oriented hardware, for which we need to do ~nothing 4. Modern userspace-fencing hardware and APIs which need to interop with media units, for all four combinations of client/server source/sink; for some APIs like Vulkan Video synchronisation is not a problem, for others like VA-API/V4L2/GStreamer we're probably need going to live with the implicit semantics for the foreseeable future, which means we should do the right thing for them up front, because why would you even be playing games if you're not streaming them to Twitch at the same time? (Personally I'm much happier that no-one other than my friends who already know I'm terrible can see me playing CoD, but y'know, kids these days ...) The fifth case I've left out is clients who want to smash Vulkan content directly into the display. For games and Kodi-like UI I'm just going to ignore this, because (maybe I'm biased, but) using KMS directly is difficult enough that you shouldn't do it and just use a winsys because we've spent a very long time dealing with these problems for you. The remaining usecase there is XR, but their uses are very different, and OpenXR already deals with a _lot_ of this for you, with the runtimes having sufficiently sophisticated synchronisation internally that whatever we come up with won't be onerous for them to implement. Either way, integration with KMS/GBM has the same problem as Wayland, in that unless you fully encapsulate it in something like VK_KHR_display, you don't get to throw a thread under the bus to do delayed submit, because you need to actually return something to the caller. Ultimately, I think I'm leaning towards agreeing with Christian that I would like to see a good holistic model for how this works in a variety of usecases before we progress new uAPI, but also to agreeing with you and Dan in that how amdgpu currently implements things is currently objectively very wrong (despite the motivations for having implemented it that way). If there are any usecases I've missed then I'm all ears, but else I think we should take this forward by working with Vulkan/EGL/Wayland/X11/media/KMS and coming up with realistic skeletons for end-to-end synchronisation through those usecases. It's clear that this never happened for syncobj, because it's just not usable as a primitive in any compositor which exists today: let's make sure we don't get into the same trap again. If we can do this over email/GitLab then that's great, but if not, maybe we need to do a kind of mini-XDC with some kind of virtual whiteboard we can all scribble over. (As a coda to this, I'm pretty sure I understand the subtleties of the memory fences for relocation/shootdown, but it would be _really_ good to have a coherent description everyone agrees on written down somewhere, so people can understand the issues without having to read 250 emails with people at loggerheads.) Cheers, Daniel > When we got to > insert the RW fence, the actual fence we set as the new exclusive fence > is a combination of the sync_file provided by the user and all the other > fences on the dma_resv. This ensures that the newly added exclusive > fence will never signal before the old one would have and ensures that > we don't break any dma_resv contracts. We require userspace to specify > RW in the flags for symmetry with the export ioctl and in case we ever > want to support read fences in the future. > > There is one downside here that's worth documenting: If two clients > writing to the same dma-buf using this API race with each other, their > actions on the dma-buf may happen in parallel or in an undefined order. > Both with and without this API, the pattern is the same: Collect all > the fences on dma-buf, submit work which depends on said fences, and > then set a new exclusive (write) fence on the dma-buf which depends on > said work. The difference is that, when it's all handled by the GPU > driver's submit ioctl, the three operations happen atomically under the > dma_resv lock. If two userspace submits race, one will happen before > the other. You aren't guaranteed which but you are guaranteed that > they're strictly ordered. If userspace manages the fences itself, then > these three operations happen separately and the two render operations > may happen genuinely in parallel or get interleaved. However, this is a > case of userspace racing with itself. As long as we ensure userspace > can't back the kernel into a corner, it should be fine. > > v2 (Jason Ekstrand): > - Use a wrapper dma_fence_array of all fences including the new one > when importing an exclusive fence. > > v3 (Jason Ekstrand): > - Lock around setting shared fences as well as exclusive > - Mark SIGNAL_SYNC_FILE as a read-write ioctl. > - Initialize ret to 0 in dma_buf_wait_sync_file > > v4 (Jason Ekstrand): > - Use the new dma_resv_get_singleton helper > > v5 (Jason Ekstrand): > - Rename the IOCTLs to import/export rather than wait/signal > - Drop the WRITE flag and always get/set the exclusive fence > > v5 (Jason Ekstrand): > - Split import and export into separate patches > - New commit message > > Signed-off-by: Jason Ekstrand <jason@jlekstrand.net> > --- > drivers/dma-buf/dma-buf.c | 32 ++++++++++++++++++++++++++++++++ > include/uapi/linux/dma-buf.h | 1 + > 2 files changed, 33 insertions(+) > > diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c > index 7679562b57bfc..c9d6b9195c00c 100644 > --- a/drivers/dma-buf/dma-buf.c > +++ b/drivers/dma-buf/dma-buf.c > @@ -417,6 +417,36 @@ static long dma_buf_export_sync_file(struct dma_buf *dmabuf, > put_unused_fd(fd); > return ret; > } > + > +static long dma_buf_import_sync_file(struct dma_buf *dmabuf, > + const void __user *user_data) > +{ > + struct dma_buf_sync_file arg; > + struct dma_fence *fence, *singleton = NULL; > + int ret = 0; > + > + if (copy_from_user(&arg, user_data, sizeof(arg))) > + return -EFAULT; > + > + if (arg.flags != DMA_BUF_SYNC_RW) > + return -EINVAL; > + > + fence = sync_file_get_fence(arg.fd); > + if (!fence) > + return -EINVAL; > + > + dma_resv_lock(dmabuf->resv, NULL); > + > + ret = dma_resv_get_singleton_rcu(dmabuf->resv, fence, &singleton); > + if (!ret && singleton) > + dma_resv_add_excl_fence(dmabuf->resv, singleton); > + > + dma_resv_unlock(dmabuf->resv); > + > + dma_fence_put(fence); > + > + return ret; > +} > #endif > > static long dma_buf_ioctl(struct file *file, > @@ -465,6 +495,8 @@ static long dma_buf_ioctl(struct file *file, > #if IS_ENABLED(CONFIG_SYNC_FILE) > case DMA_BUF_IOCTL_EXPORT_SYNC_FILE: > return dma_buf_export_sync_file(dmabuf, (void __user *)arg); > + case DMA_BUF_IOCTL_IMPORT_SYNC_FILE: > + return dma_buf_import_sync_file(dmabuf, (const void __user *)arg); > #endif > > default: > diff --git a/include/uapi/linux/dma-buf.h b/include/uapi/linux/dma-buf.h > index f902cadcbdb56..75fdde4800267 100644 > --- a/include/uapi/linux/dma-buf.h > +++ b/include/uapi/linux/dma-buf.h > @@ -70,5 +70,6 @@ struct dma_buf_sync_file { > #define DMA_BUF_SET_NAME_A _IOW(DMA_BUF_BASE, 1, u32) > #define DMA_BUF_SET_NAME_B _IOW(DMA_BUF_BASE, 1, u64) > #define DMA_BUF_IOCTL_EXPORT_SYNC_FILE _IOWR(DMA_BUF_BASE, 2, struct dma_buf_sync_file) > +#define DMA_BUF_IOCTL_IMPORT_SYNC_FILE _IOW(DMA_BUF_BASE, 3, struct dma_buf_sync) > > #endif > -- > 2.31.1 >
Hi guys, separating that discussion out since Daniel had a rather interesting idea here. Am 22.05.21 um 22:05 schrieb Daniel Stone: > [SNIP] > Anyway, the problem with syncobj is that the ioctl to wait for a > sync_file to materialise for a given timeline point only allows us to > block with a timeout; this is a non-starter, because we need something > which fits into epoll. The most optimal case is returning a new > eventfd or similar which signals when a given timeline point becomes > available or signaled, but in extremis a syncobj FD becoming readable > when any activity which would change the result of any zero-timeout > wait on that syncobj is more or less workable. > I think the tricky part is to epoll for a certain value. Not sure how eventfd is supposed to work, but IIRC we don't have the functionality to poll for a certain value/offset etc to become available. We could of course create a separate fd for each requested value to poll for thought, but that sounds like a bit much overhead to me. Apart from that this is a really interesting idea. Regards, Christian.
Hi Christian, On Sun, 23 May 2021 at 18:16, Christian König <christian.koenig@amd.com> wrote: > Am 22.05.21 um 22:05 schrieb Daniel Stone: > > Anyway, the problem with syncobj is that the ioctl to wait for a > > sync_file to materialise for a given timeline point only allows us to > > block with a timeout; this is a non-starter, because we need something > > which fits into epoll. The most optimal case is returning a new > > eventfd or similar which signals when a given timeline point becomes > > available or signaled, but in extremis a syncobj FD becoming readable > > when any activity which would change the result of any zero-timeout > > wait on that syncobj is more or less workable. > > I think the tricky part is to epoll for a certain value. > > Not sure how eventfd is supposed to work, but IIRC we don't have the > functionality to poll for a certain value/offset etc to become available. > > We could of course create a separate fd for each requested value to poll > for thought, but that sounds like a bit much overhead to me. Yeah, I understand the point; something like an eventfd is exactly what you think, that we would have to materialise a new FD for it. On the other hand, as soon as the sync point becomes available, the winsys will want to immediately extract a sync_file from it, so I don't think FD amplification is a big issue. If it looks like being a problem in practice, we could look at providing a FD which starts off inert, and only later becomes a sync_file after it polls readable, but that sounds like something we really want to avoid if we can. Cheers, Daniel
On Sat, May 22, 2021 at 3:05 PM Daniel Stone <daniel@fooishbar.org> wrote: > > Hi, > > On Thu, 20 May 2021 at 20:00, Jason Ekstrand <jason@jlekstrand.net> wrote: > > In the Vulkan world, this is useful for dealing with the out-fence from > > vkQueuePresent. Current Linux window-systems (X11, Wayland, etc.) all > > rely on dma-buf implicit sync. Since Vulkan is an explicit sync API, we > > get a set of fences (VkSemaphores) in vkQueuePresent and have to stash > > those as an exclusive (write) fence on the dma-buf. We handle it in > > Mesa today with the above mentioned dummy submit trick. This ioctl > > would allow us to set it directly without the dummy submit. > > > > This may also open up possibilities for GPU drivers to move away from > > implicit sync for their kernel driver uAPI and instead provide sync > > files and rely on dma-buf import/export for communicating with other > > implicit sync clients. > > > > We make the explicit choice here to only allow setting RW fences which > > translates to an exclusive fence on the dma_resv. There's no use for > > read-only fences for communicating with other implicit sync userspace > > and any such attempts are likely to be racy at best. > > I think I almost follow, but I'm not quite seeing the race you allude > to. Let me talk through my understanding so it's hopefully more clear > for others as a summary. Please correct me on anything I've > misunderstood or just missed completely. (I thought when I wrote this > intro that the email might be relatively snappy, but it's really long > and takes in a lot of breadth. Sorry.) > > So as far as I'm reading this patchset and the Mesa MR, this _only_ > concerns the out-fence (i.e. compositor -> client AcquireNextImage > semaphore/fence) so far. The in-fence (client->compositor QueuePresent > wait-semaphores/fences) is handled by the driver ensuring that an > exclusive resv is placed on the union of the signal semaphores passed > to QueuePresent, either through flags on its CS ioctl, or amdgpu's BO > flags, or ... either way, no problem as it should always be exclusive, > and it seems pretty uncontroversial that we should pull the wait > semaphore into an exclusive fence so that no downstream consumer will > begin using it until the client ops have fully retired. > > For AcquireNextImage, your patchset exports all the fences (shared and > exclusive both) on the dma_resv out into the semaphore/fence such that > the client can't progress (CPU-side for VkFence, GPU-side for > VkSemaphore) until all currently queued operations have completely > retired. So, as long as the server ensures that all its kernel-side > work is flushed before its IPC to unblock ANI (wl_buffer.release or > DRI3 PresentIdle, both indicating that the client is free to reuse the > buffer, subject to synchronising against implicit fences on the resv), > all good: we snapshot the current resv state, wrap the relevant > driver-side Vulkan primitive around it, and go back to explicit > synchronisation. The client can't race generating new work against > this, because it can't queue any work until ANI has returned and > queued a signaling op on the semaphore/fence. > > So far, so good. I really like both your series up until this > narrative point; as I was saying in the userspace-fence thread, the > WSI<->client thread is certainly pulling a very big lever with a > heavyweight transition between the two different worlds, and I like > that the new export ioctl lets us be very clear about what exactly is > happening under the hood. Good stuff. Glad to hear. If you turned that into RBs on the first three patches in this series and all but the last patch on the Mesa MR, it'd make me even happier. :-D At this point, I think everyone is pretty happy with the first three patches and the export ioctl. In the Vulkan WSI code, it solves a significant over-synchronization issue for ANV. Also, the uAPI shouldn't be controversial at all because it's identical to poll() except that it gives you a FD you can poll on later to get the result of the poll as it would be now. I think if we get some Mesa reviews, we should be able to land those. It's import that's trickier. > So, what gives with the import ioctl? Taking a guess at where you're > going, the import ioctl is going to be used in QueuePresent just as > the export ioctl is in ANI: instead of having CS flags to write into > the resv exclusive slot or per-BO flags to always dump in exclusive, > there's a forthcoming patch somewhere which lets drivers skip this and > instead have common QueuePresent code dump the wait semaphore into the > resv, so servers on the other side of an implicit-only protocol will > synchronise their access against the fence imported as part of > client-side QueuePresent? Correct. And the patch isn't forthcoming. It already exists as the top patch in the Mesa MR (!4037). > That makes sense to me and is nicely symmetrical, plus it gets GPU > drivers further out of the business of doing magic winsys/display > synchronisation, which is good. But why enforce that imported fences > have to go into the exclusive slot? It makes sense from the point of > view of clients doing QueueSubmit, but I think it might preclude other > uses for no particularly good reason. Mostly, I was trying to limit the scope a bit. Import is tricky and, to be honest, I'm still not 100% sure that it's safe. I probably should have left RFC on this patch. As long as we keep everything inside the kernel, there's a requirement that any work which triggers any fence on a dma_resv waits on the exclusive fence, if any. Work which sets the exclusive fence has to wait on all fences. With the import ioctl, we can guarantee that the fences fire in the right order by making an array fence containing the new fence and all other fences and using that as the exclusive fence. What we can't do, however, is ensure that the work which triggers the fence actually blocks on ANY of the fences on the dma_resv. Hrm... I think I've now convinced myself that importing shared fences is no more dangerous than importing an exclusive fence because they share the same set of problems. Unfortunately, I've unconvinced myself of the safety of either. I've got to think some more.... The most convincing argument I can make to myself is that this ioctl is like having a one-shot little queue that contains tiny little work jobs which wait on whatever sync_file you provide, do nothing, and then signal. That should be safe, right? > Certainly on X11 there's no real use for the shared slot - you get > into difficulties with mixed client/server rendering - but at least > Wayland and GBM are always non-destructive, so conceptually have a > good use for being able to import fences into the shared resv. This > goes for media and KMS access as well, if you think beyond the usecase > of an explicit Vulkan client sending through to a dumb implicit server > which just wants to texture. > > Media clients are definitely a relevant usecase, both directly with > V4L2/VA-API, neither of which have support for explicit > synchronisation yet and (at least for V4L2) are unlikely to do so in > the near future, but even more with pipeline frameworks like GStreamer > and PipeWire, which don't have fundamental awareness (they're prepared > to deal with deep pipelines which return at arbitrary times, but once > anything is returned, it is available for immediate use). Maybe > open-coding CPU-side waits in these users is the best way longer term, > but for easier interop if nothing else, being able to pull shared > fences in seems like it could be a win ('the compositor is still > texturing from this for now, so feel free to read back from ref > frames, but don't scribble over it until it's finished'). > > I'm slightly bemused that there's so much energy spent on designing > around winsys being dumb and implicit. I'd like to address this one as it's a comment you've made several times. Once you've fixed raw X11 (not just XWayland) and a new release has been made (hah!) and is shipping in distros with said support, then we can talk. Sorry if that comes off as overly snarky but that's reality that we (driver devs) are living with. To make things even worse, when we're in Vulkan land (as opposed to GL), we can't tell up-front whether or not our window-system supports foo fences and adjust accordingly. We have to start up, begin rendering, and only later figure out "oops, this one goes to X11". We really can't say things like "when running on modern Wayland, do things the new way" because Vulkan doesn't have a concept of "running on" a window system. FWIW, I do have a Mesa MR somewhere which adds explicit sync for Vulkan+Wayland when the wayland protocols are there. I don't remember why it didn't land. Maybe lack of acquire fence support? I think the protocol issues have been fixed, so we should up-rev the requirements as needed and land it. > We did take a long time to roll > out sync_file support, but that was only because we didn't quite > understand all the nuances of why implicit sync is so difficult for > GPU drivers on modern architectures and how it was actually a win > compared to doing nothing; maybe we should have some kind of > conference where we all get together and talk to each other ... ? > Anyway, by the time we got to the cusp of rolling out bi-directional > sync_file support, suddenly we had drm_syncobj. By the time that had > semi-settled down and started to be rolled out, then we suddenly had > userspace fences on the horizon. And what do we do with those? You're not wrong.... And that's the second reason for the gymnastics above. Ever since Vulkan launched, we've been fumbling around trying to figure out how to best do synchronization. 'm reasonably certain that userspace memory fences are the future but I'm much less certain about the path to get there. It's been a process of patient experimentation so far and I think we're getting closer. Syncobj, timeline syncobj, etc. have all been steps along that path. I've been hesitant to ask the window-system people to draft piles of extensions until we're sure we've found "the one". It's bad enough iterating in kernel-space and userspace without bringing Wayland protocol into each iteration step. For that reason, one of my goals along this process has been to try and decouple things as much as we can. So, in summary, none of this is because I think that window systems are dumb and implicit or for any lack of respect for the people that work on them. I assume that X11 will always be dumb and implicit. (I'd love to be proven wrong!) For Wayland (and XWayland, probably), I assume the people are very smart and active and will implement whatever APIs we (the driver people) need as long as they're reasonable. I just don't know what to ask for yet. > We've - certainly Weston, and I'm pretty confident I speak for Simon > on the wlroots side and Jonas for Mutter - landed on accepting that > we're going to have to deal with timeline semaphores and > wait-before-signal; we have a credible plan (we think) for protocol to > deal with at least syncobj, which should conceptually extend to > userspace fences. The biggest blocker there is the syncobj uAPI. > Compositors aren't Vulkan clients - we don't have one thread per group > of work, because the inter-client synchronisation becomes nightmarish > overhead for little gain. I don't think this will ever change, because > the balance of work is totally different between client and > compositor. > > Anyway, the problem with syncobj is that the ioctl to wait for a > sync_file to materialise for a given timeline point only allows us to > block with a timeout; this is a non-starter, because we need something > which fits into epoll. The most optimal case is returning a new > eventfd or similar which signals when a given timeline point becomes > available or signaled, but in extremis a syncobj FD becoming readable > when any activity which would change the result of any zero-timeout > wait on that syncobj is more or less workable. Right. You could call this an oversight. Really, though, it was because the use-cases we were interested in were ones where a wait with a timeout was perfectly acceptable and where the extra overhead of setting an epoll with ioctls wasn't. It shouldn't be too hard to wire up if that helps (cross your fingers). But.... If we go the direction of userspace memory fences (which I think is likely), there is no such thing as "wait for the fence to materialize". The work is either done or it isn't. There is no enforceable concept of "about to be done". The word "enforceable" is important there. We could add such a concept but it'd be dependent on the userspace client (not driver) to let us know when it's just about ready and then we'd have to VK_ERROR_DEVICE_LOST on them or similar if they break the contract. Maybe possible but there's some design work required there. The other option is to make the compositors handle this new way of thinking more thoroughly and eat the latency hit. > What we do want to do though, regardless of the primitive or its > semantics, is to only have to go through this once, not rework it all > in six months, and have to support a bunch of diverging codepaths. So > what is the optimal synchronisation primitive we should be aiming for > on the winsys side? Is sync_file a good enough lowest common > denominator, or should we do timeline-syncobj for fancy clients, in > tandem with sync_file bolted on the side for media and ancient GPUs? > Or are userspace fences going to give us some new primitive? And if > so, can we please talk about those semantics now, so we don't end up > with three synchronisation mechanisms which are all sort of but not > really alike? As I said above, I think we're getting close but I don't think we're there yet. > As far as I can tell, the three relevant vendors with (more or less) > upstream drivers here are AMD, Arm, and Intel. Arm is obviously a bit > more up in the air as the hardware and specs aren't currently > available to the upstream development teams, but hopefully we can > bring them into this conversation. I think it's looking like we're > optimising for all of the following, without leaving anyone out in the > cold: > > 1. Modern userspace-fencing hardware running on a > userspace-fencing-aware winsys, i.e. new AMD/Arm/Intel on one of the > big three Wayland compositors which handle enough synchronisation > logic internally that the choice of underlying > composition/presentation API is immaterial (for which any new API is > important) > 2. Modern userspace-fencing hardware running on Xwayland, for which > we'll inevitably have to type out new DRI3 synchronsiation, but > derived purely from the Wayland protocols so it can be passed through > quite directly, and with any mixed X11<->user buffer client > interaction (e.g. subwindows) being penalised by a long blocking wait > in the X server > 3. sync_file-oriented hardware, for which we need to do ~nothing > 4. Modern userspace-fencing hardware and APIs which need to interop > with media units, for all four combinations of client/server > source/sink; for some APIs like Vulkan Video synchronisation is not a > problem, for others like VA-API/V4L2/GStreamer we're probably need > going to live with the implicit semantics for the foreseeable future, > which means we should do the right thing for them up front, because > why would you even be playing games if you're not streaming them to > Twitch at the same time? (Personally I'm much happier that no-one > other than my friends who already know I'm terrible can see me playing > CoD, but y'know, kids these days ...) > > The fifth case I've left out is clients who want to smash Vulkan > content directly into the display. For games and Kodi-like UI I'm just > going to ignore this, because (maybe I'm biased, but) using KMS > directly is difficult enough that you shouldn't do it and just use a > winsys because we've spent a very long time dealing with these > problems for you. The remaining usecase there is XR, but their uses > are very different, and OpenXR already deals with a _lot_ of this for > you, with the runtimes having sufficiently sophisticated > synchronisation internally that whatever we come up with won't be > onerous for them to implement. Either way, integration with KMS/GBM > has the same problem as Wayland, in that unless you fully encapsulate > it in something like VK_KHR_display, you don't get to throw a thread > under the bus to do delayed submit, because you need to actually > return something to the caller. You're missing a use-case: Modern userspace-fencing hardware running on bare X11. I don't know that we should optimize for this case, so to speak, but it has to work non-terribly. Also, as I alluded to above, I really don't want to be maintaining two drivers forever: One for X11 and ancient Wayland and one for modern Wayland. We need to be able to modernize the driver and still support the old window-systems. Unless you can promise me that X11 and KDE/Wayland will either go away or else get modernized, I need a fall-back plan. And even if you make me said promise, I'm not going to believe you. :-P 8-9 years ago, I was one of the people making those promises. Now I'm writing X11 winsys code for drivers. And... now I'm re-thinking some of my life choices.... > Ultimately, I think I'm leaning towards agreeing with Christian that I > would like to see a good holistic model for how this works in a > variety of usecases before we progress new uAPI, but also to agreeing > with you and Dan in that how amdgpu currently implements things is > currently objectively very wrong (despite the motivations for having > implemented it that way). > > If there are any usecases I've missed then I'm all ears, but else I > think we should take this forward by working with > Vulkan/EGL/Wayland/X11/media/KMS and coming up with realistic > skeletons for end-to-end synchronisation through those usecases. It's > clear that this never happened for syncobj, because it's just not > usable as a primitive in any compositor which exists today: let's make > sure we don't get into the same trap again. If we can do this over > email/GitLab then that's great, but if not, maybe we need to do a kind > of mini-XDC with some kind of virtual whiteboard we can all scribble > over. I think what we're looking at is roughly three steps, the first two of which are mostly there on the Wayland side: 1. Implicit sync. We know what this one is. glFlush/vkQueueSubmit on the one side, start texturing on the other, and it all works. 2. Immutable SW fences, i.e. sync_file. This is where we have a fence object that gets returned from the flush/submit and can be passed into the texture op to provide a dependency. Syncobj may be useful here but only as a container. If a sync_file is a dma_fence*, a syncobj should be thought of as a dma_fence**. This may be useful if we want to retrofit sync_file into X11 where the current DRI3 explicit sync stuff is based on creating an object and then re-using it for every present. 3. Userspace memory fences. Note that timeline syncobj is NOT in that list. IMO, all the "wait for submit" stuff is an implementation detail we needed in order to get the timeline semantics on top of immutable SW fences. Under the hood it's all dma_fence; this just gives us a shareable container so we can implement VK_KHR_timeline_semaphore with sharing. I really don't want to make Wayland protocol around it if memory fences are the final solution. > (As a coda to this, I'm pretty sure I understand the subtleties of the > memory fences for relocation/shootdown, but it would be _really_ good > to have a coherent description everyone agrees on written down > somewhere, so people can understand the issues without having to read > 250 emails with people at loggerheads.) Let me give it a try. I define a userspace memory fence as follows: - The fence object is fundamentally a bit of mappable gpu-accessible memory which stores a 64-bit counter value which is used as a timeline. (On Windows, they require it to live in system memory.) - For sharable memory fences, each one probably has to go in its own page. :-( - The value is initialized to 0. - To signal the fence, someone (GPU or CPU) writes a new 64-bit value. - Waits on the fence are waits for it to be >= some value. Depending on circumstances, the wait may be a 32-bit comparison and may be >= with wrap-around. For the purposes of window-system protocol, I think we can assume 64-bit >= with no wrap-around. There are a few very important points worth noting about them, however: - No one except the userspace client (not driver!) has any idea when/if a value will be signaled - The value may never be signaled - Someone may submit a wait before someone submits a signal; we have to deal with that - There is no concept of "about to be signaled" - For the kernel/firmware GPU scheduler handling waits, this means it just keeps trying to run work and hopes the wait eventually unblocks. It also means we need to totally decouple kernel synchronization for memory management purposes from synchronization for work scheduling. - For a compositor, I'm still not 100% sure what this means. TBD, I think. There are some ways to work around some of these issues. Windows has a few tricks which we might be able to steal if we want. Why would anyone want such a horrid thing? Application developers absolutely love them. They can write massively multi-threaded apps with piles of work queues that require very little cross-thread synchronization and the GPU scheduler sorts it all out for them in the end. If you're a 3D game engine developer, this timeline model is very powerful. If you're a driver or window-system developer, you really have to just embrace the lack of knowledge and trust apps. Oof... That got long. I hope it was informative. --Jason > Cheers, > Daniel > > > > > > > > > > When we got to > > insert the RW fence, the actual fence we set as the new exclusive fence > > is a combination of the sync_file provided by the user and all the other > > fences on the dma_resv. This ensures that the newly added exclusive > > fence will never signal before the old one would have and ensures that > > we don't break any dma_resv contracts. We require userspace to specify > > RW in the flags for symmetry with the export ioctl and in case we ever > > want to support read fences in the future. > > > > There is one downside here that's worth documenting: If two clients > > writing to the same dma-buf using this API race with each other, their > > actions on the dma-buf may happen in parallel or in an undefined order. > > Both with and without this API, the pattern is the same: Collect all > > the fences on dma-buf, submit work which depends on said fences, and > > then set a new exclusive (write) fence on the dma-buf which depends on > > said work. The difference is that, when it's all handled by the GPU > > driver's submit ioctl, the three operations happen atomically under the > > dma_resv lock. If two userspace submits race, one will happen before > > the other. You aren't guaranteed which but you are guaranteed that > > they're strictly ordered. If userspace manages the fences itself, then > > these three operations happen separately and the two render operations > > may happen genuinely in parallel or get interleaved. However, this is a > > case of userspace racing with itself. As long as we ensure userspace > > can't back the kernel into a corner, it should be fine. > > > > v2 (Jason Ekstrand): > > - Use a wrapper dma_fence_array of all fences including the new one > > when importing an exclusive fence. > > > > v3 (Jason Ekstrand): > > - Lock around setting shared fences as well as exclusive > > - Mark SIGNAL_SYNC_FILE as a read-write ioctl. > > - Initialize ret to 0 in dma_buf_wait_sync_file > > > > v4 (Jason Ekstrand): > > - Use the new dma_resv_get_singleton helper > > > > v5 (Jason Ekstrand): > > - Rename the IOCTLs to import/export rather than wait/signal > > - Drop the WRITE flag and always get/set the exclusive fence > > > > v5 (Jason Ekstrand): > > - Split import and export into separate patches > > - New commit message > > > > Signed-off-by: Jason Ekstrand <jason@jlekstrand.net> > > --- > > drivers/dma-buf/dma-buf.c | 32 ++++++++++++++++++++++++++++++++ > > include/uapi/linux/dma-buf.h | 1 + > > 2 files changed, 33 insertions(+) > > > > diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c > > index 7679562b57bfc..c9d6b9195c00c 100644 > > --- a/drivers/dma-buf/dma-buf.c > > +++ b/drivers/dma-buf/dma-buf.c > > @@ -417,6 +417,36 @@ static long dma_buf_export_sync_file(struct dma_buf *dmabuf, > > put_unused_fd(fd); > > return ret; > > } > > + > > +static long dma_buf_import_sync_file(struct dma_buf *dmabuf, > > + const void __user *user_data) > > +{ > > + struct dma_buf_sync_file arg; > > + struct dma_fence *fence, *singleton = NULL; > > + int ret = 0; > > + > > + if (copy_from_user(&arg, user_data, sizeof(arg))) > > + return -EFAULT; > > + > > + if (arg.flags != DMA_BUF_SYNC_RW) > > + return -EINVAL; > > + > > + fence = sync_file_get_fence(arg.fd); > > + if (!fence) > > + return -EINVAL; > > + > > + dma_resv_lock(dmabuf->resv, NULL); > > + > > + ret = dma_resv_get_singleton_rcu(dmabuf->resv, fence, &singleton); > > + if (!ret && singleton) > > + dma_resv_add_excl_fence(dmabuf->resv, singleton); > > + > > + dma_resv_unlock(dmabuf->resv); > > + > > + dma_fence_put(fence); > > + > > + return ret; > > +} > > #endif > > > > static long dma_buf_ioctl(struct file *file, > > @@ -465,6 +495,8 @@ static long dma_buf_ioctl(struct file *file, > > #if IS_ENABLED(CONFIG_SYNC_FILE) > > case DMA_BUF_IOCTL_EXPORT_SYNC_FILE: > > return dma_buf_export_sync_file(dmabuf, (void __user *)arg); > > + case DMA_BUF_IOCTL_IMPORT_SYNC_FILE: > > + return dma_buf_import_sync_file(dmabuf, (const void __user *)arg); > > #endif > > > > default: > > diff --git a/include/uapi/linux/dma-buf.h b/include/uapi/linux/dma-buf.h > > index f902cadcbdb56..75fdde4800267 100644 > > --- a/include/uapi/linux/dma-buf.h > > +++ b/include/uapi/linux/dma-buf.h > > @@ -70,5 +70,6 @@ struct dma_buf_sync_file { > > #define DMA_BUF_SET_NAME_A _IOW(DMA_BUF_BASE, 1, u32) > > #define DMA_BUF_SET_NAME_B _IOW(DMA_BUF_BASE, 1, u64) > > #define DMA_BUF_IOCTL_EXPORT_SYNC_FILE _IOWR(DMA_BUF_BASE, 2, struct dma_buf_sync_file) > > +#define DMA_BUF_IOCTL_IMPORT_SYNC_FILE _IOW(DMA_BUF_BASE, 3, struct dma_buf_sync) > > > > #endif > > -- > > 2.31.1 > >
On Sun, May 23, 2021 at 11:34 PM Daniel Stone <daniel@fooishbar.org> wrote: > > Hi Christian, > > On Sun, 23 May 2021 at 18:16, Christian König <christian.koenig@amd.com> wrote: > > Am 22.05.21 um 22:05 schrieb Daniel Stone: > > > Anyway, the problem with syncobj is that the ioctl to wait for a > > > sync_file to materialise for a given timeline point only allows us to > > > block with a timeout; this is a non-starter, because we need something > > > which fits into epoll. The most optimal case is returning a new > > > eventfd or similar which signals when a given timeline point becomes > > > available or signaled, but in extremis a syncobj FD becoming readable > > > when any activity which would change the result of any zero-timeout > > > wait on that syncobj is more or less workable. > > > > I think the tricky part is to epoll for a certain value. > > > > Not sure how eventfd is supposed to work, but IIRC we don't have the > > functionality to poll for a certain value/offset etc to become available. > > > > We could of course create a separate fd for each requested value to poll > > for thought, but that sounds like a bit much overhead to me. > > Yeah, I understand the point; something like an eventfd is exactly > what you think, that we would have to materialise a new FD for it. On > the other hand, as soon as the sync point becomes available, the > winsys will want to immediately extract a sync_file from it, so I > don't think FD amplification is a big issue. If it looks like being a > problem in practice, we could look at providing a FD which starts off > inert, and only later becomes a sync_file after it polls readable, but > that sounds like something we really want to avoid if we can. We can't change personalities of filp in the kernel because there's no locking for that. But we can add a future fence state to sync_file for this, and then make sure sync_file_get_fence() fails on them as long as the fence hasn't materialized. That's required because current uapi is that sync_file is never a future fence. Also opens the door for that, which might be useful for all our userspace fences ideas. Then we could support poll on that sync_file to wait for the fence to materialize using POLLPRI. man poll suggests this isn't the dumbest thing to go with, and grep in the kernel shows plenty of users (it's EPOLLPRI internally). With this we'd just need a new flag for allowing this mode in sync_handle_to_fd_ioctl, and a fairly small amount of typing to glue it together. -Daniel
On Mon, May 24, 2021 at 7:11 PM Jason Ekstrand <jason@jlekstrand.net> wrote: > > On Sat, May 22, 2021 at 3:05 PM Daniel Stone <daniel@fooishbar.org> wrote: > > > > Hi, > > > > On Thu, 20 May 2021 at 20:00, Jason Ekstrand <jason@jlekstrand.net> wrote: > > > In the Vulkan world, this is useful for dealing with the out-fence from > > > vkQueuePresent. Current Linux window-systems (X11, Wayland, etc.) all > > > rely on dma-buf implicit sync. Since Vulkan is an explicit sync API, we > > > get a set of fences (VkSemaphores) in vkQueuePresent and have to stash > > > those as an exclusive (write) fence on the dma-buf. We handle it in > > > Mesa today with the above mentioned dummy submit trick. This ioctl > > > would allow us to set it directly without the dummy submit. > > > > > > This may also open up possibilities for GPU drivers to move away from > > > implicit sync for their kernel driver uAPI and instead provide sync > > > files and rely on dma-buf import/export for communicating with other > > > implicit sync clients. > > > > > > We make the explicit choice here to only allow setting RW fences which > > > translates to an exclusive fence on the dma_resv. There's no use for > > > read-only fences for communicating with other implicit sync userspace > > > and any such attempts are likely to be racy at best. > > > > I think I almost follow, but I'm not quite seeing the race you allude > > to. Let me talk through my understanding so it's hopefully more clear > > for others as a summary. Please correct me on anything I've > > misunderstood or just missed completely. (I thought when I wrote this > > intro that the email might be relatively snappy, but it's really long > > and takes in a lot of breadth. Sorry.) > > > > So as far as I'm reading this patchset and the Mesa MR, this _only_ > > concerns the out-fence (i.e. compositor -> client AcquireNextImage > > semaphore/fence) so far. The in-fence (client->compositor QueuePresent > > wait-semaphores/fences) is handled by the driver ensuring that an > > exclusive resv is placed on the union of the signal semaphores passed > > to QueuePresent, either through flags on its CS ioctl, or amdgpu's BO > > flags, or ... either way, no problem as it should always be exclusive, > > and it seems pretty uncontroversial that we should pull the wait > > semaphore into an exclusive fence so that no downstream consumer will > > begin using it until the client ops have fully retired. > > > > For AcquireNextImage, your patchset exports all the fences (shared and > > exclusive both) on the dma_resv out into the semaphore/fence such that > > the client can't progress (CPU-side for VkFence, GPU-side for > > VkSemaphore) until all currently queued operations have completely > > retired. So, as long as the server ensures that all its kernel-side > > work is flushed before its IPC to unblock ANI (wl_buffer.release or > > DRI3 PresentIdle, both indicating that the client is free to reuse the > > buffer, subject to synchronising against implicit fences on the resv), > > all good: we snapshot the current resv state, wrap the relevant > > driver-side Vulkan primitive around it, and go back to explicit > > synchronisation. The client can't race generating new work against > > this, because it can't queue any work until ANI has returned and > > queued a signaling op on the semaphore/fence. > > > > So far, so good. I really like both your series up until this > > narrative point; as I was saying in the userspace-fence thread, the > > WSI<->client thread is certainly pulling a very big lever with a > > heavyweight transition between the two different worlds, and I like > > that the new export ioctl lets us be very clear about what exactly is > > happening under the hood. Good stuff. > > Glad to hear. If you turned that into RBs on the first three patches > in this series and all but the last patch on the Mesa MR, it'd make me > even happier. :-D > > At this point, I think everyone is pretty happy with the first three > patches and the export ioctl. In the Vulkan WSI code, it solves a > significant over-synchronization issue for ANV. Also, the uAPI > shouldn't be controversial at all because it's identical to poll() > except that it gives you a FD you can poll on later to get the result > of the poll as it would be now. I think if we get some Mesa reviews, > we should be able to land those. It's import that's trickier. > > > So, what gives with the import ioctl? Taking a guess at where you're > > going, the import ioctl is going to be used in QueuePresent just as > > the export ioctl is in ANI: instead of having CS flags to write into > > the resv exclusive slot or per-BO flags to always dump in exclusive, > > there's a forthcoming patch somewhere which lets drivers skip this and > > instead have common QueuePresent code dump the wait semaphore into the > > resv, so servers on the other side of an implicit-only protocol will > > synchronise their access against the fence imported as part of > > client-side QueuePresent? > > Correct. And the patch isn't forthcoming. It already exists as the > top patch in the Mesa MR (!4037). > > > That makes sense to me and is nicely symmetrical, plus it gets GPU > > drivers further out of the business of doing magic winsys/display > > synchronisation, which is good. But why enforce that imported fences > > have to go into the exclusive slot? It makes sense from the point of > > view of clients doing QueueSubmit, but I think it might preclude other > > uses for no particularly good reason. > > Mostly, I was trying to limit the scope a bit. Import is tricky and, > to be honest, I'm still not 100% sure that it's safe. I probably > should have left RFC on this patch. > > As long as we keep everything inside the kernel, there's a requirement > that any work which triggers any fence on a dma_resv waits on the > exclusive fence, if any. Work which sets the exclusive fence has to > wait on all fences. With the import ioctl, we can guarantee that the > fences fire in the right order by making an array fence containing the > new fence and all other fences and using that as the exclusive fence. > What we can't do, however, is ensure that the work which triggers the > fence actually blocks on ANY of the fences on the dma_resv. > > Hrm... I think I've now convinced myself that importing shared fences > is no more dangerous than importing an exclusive fence because they > share the same set of problems. Unfortunately, I've unconvinced > myself of the safety of either. I've got to think some more.... > > The most convincing argument I can make to myself is that this ioctl > is like having a one-shot little queue that contains tiny little work > jobs which wait on whatever sync_file you provide, do nothing, and > then signal. That should be safe, right? Yup. Same for shared. If that breaks a driver, then things are pretty badly broken already. What I'm not so sure about is completely unsynced behaviour between exclusive and shared slot. Internally I think drivers just wait for both (but I haven't done a full audit), see drm_gem_fence_array_add_implicit(). And the only thing that's fully requiring that you never ignore the exclusive fence is dynamice dma-buf sharing (atm only amdgpu supports that), at least until we've split the exclusive slot in dma_resv into the userspace implicit sync and kernel-internal memory management part. So within a driver, as long as the buffer isn't shared as dynamic dma-buf (non-dynamic is fine, those are pinned, so never any concern abou the buffer moving around), then I think drivers can attach fences however they deem like, and the strict ordering just oversynchronizes. But for something cross driver like this it makes probably sense to (for now) uphold the rule that if there's both shared and exclusive slots present, then the shared fences are each guaranteed to get signalled after the exlusive one. This is also overkill I think, and something we might want to lift later on - I'm mildy afraid it will cause pointless oversync since we end up mixing exclusive/shared slots. Especially in cases where both sides of a pipeline do a lot of readbacks or rendered buffers (e.g. codec reference frames on one side and some post-processing on the other). Anyway I think it's fixable later on. > > Certainly on X11 there's no real use for the shared slot - you get > > into difficulties with mixed client/server rendering - but at least > > Wayland and GBM are always non-destructive, so conceptually have a > > good use for being able to import fences into the shared resv. This > > goes for media and KMS access as well, if you think beyond the usecase > > of an explicit Vulkan client sending through to a dumb implicit server > > which just wants to texture. > > > > Media clients are definitely a relevant usecase, both directly with > > V4L2/VA-API, neither of which have support for explicit > > synchronisation yet and (at least for V4L2) are unlikely to do so in > > the near future, but even more with pipeline frameworks like GStreamer > > and PipeWire, which don't have fundamental awareness (they're prepared > > to deal with deep pipelines which return at arbitrary times, but once > > anything is returned, it is available for immediate use). Maybe > > open-coding CPU-side waits in these users is the best way longer term, > > but for easier interop if nothing else, being able to pull shared > > fences in seems like it could be a win ('the compositor is still > > texturing from this for now, so feel free to read back from ref > > frames, but don't scribble over it until it's finished'). > > > > I'm slightly bemused that there's so much energy spent on designing > > around winsys being dumb and implicit. > > I'd like to address this one as it's a comment you've made several > times. Once you've fixed raw X11 (not just XWayland) and a new > release has been made (hah!) and is shipping in distros with said > support, then we can talk. Sorry if that comes off as overly snarky > but that's reality that we (driver devs) are living with. To make > things even worse, when we're in Vulkan land (as opposed to GL), we > can't tell up-front whether or not our window-system supports foo > fences and adjust accordingly. We have to start up, begin rendering, > and only later figure out "oops, this one goes to X11". We really > can't say things like "when running on modern Wayland, do things the > new way" because Vulkan doesn't have a concept of "running on" a > window system. > > FWIW, I do have a Mesa MR somewhere which adds explicit sync for > Vulkan+Wayland when the wayland protocols are there. I don't remember > why it didn't land. Maybe lack of acquire fence support? I think the > protocol issues have been fixed, so we should up-rev the requirements > as needed and land it. > > > We did take a long time to roll > > out sync_file support, but that was only because we didn't quite > > understand all the nuances of why implicit sync is so difficult for > > GPU drivers on modern architectures and how it was actually a win > > compared to doing nothing; maybe we should have some kind of > > conference where we all get together and talk to each other ... ? > > Anyway, by the time we got to the cusp of rolling out bi-directional > > sync_file support, suddenly we had drm_syncobj. By the time that had > > semi-settled down and started to be rolled out, then we suddenly had > > userspace fences on the horizon. And what do we do with those? > > You're not wrong.... And that's the second reason for the gymnastics > above. Ever since Vulkan launched, we've been fumbling around trying > to figure out how to best do synchronization. 'm reasonably certain > that userspace memory fences are the future but I'm much less certain > about the path to get there. It's been a process of patient > experimentation so far and I think we're getting closer. Syncobj, > timeline syncobj, etc. have all been steps along that path. I've been > hesitant to ask the window-system people to draft piles of extensions > until we're sure we've found "the one". It's bad enough iterating in > kernel-space and userspace without bringing Wayland protocol into each > iteration step. For that reason, one of my goals along this process > has been to try and decouple things as much as we can. > > So, in summary, none of this is because I think that window systems > are dumb and implicit or for any lack of respect for the people that > work on them. I assume that X11 will always be dumb and implicit. > (I'd love to be proven wrong!) For Wayland (and XWayland, probably), > I assume the people are very smart and active and will implement > whatever APIs we (the driver people) need as long as they're > reasonable. I just don't know what to ask for yet. > > > We've - certainly Weston, and I'm pretty confident I speak for Simon > > on the wlroots side and Jonas for Mutter - landed on accepting that > > we're going to have to deal with timeline semaphores and > > wait-before-signal; we have a credible plan (we think) for protocol to > > deal with at least syncobj, which should conceptually extend to > > userspace fences. The biggest blocker there is the syncobj uAPI. > > Compositors aren't Vulkan clients - we don't have one thread per group > > of work, because the inter-client synchronisation becomes nightmarish > > overhead for little gain. I don't think this will ever change, because > > the balance of work is totally different between client and > > compositor. > > > > Anyway, the problem with syncobj is that the ioctl to wait for a > > sync_file to materialise for a given timeline point only allows us to > > block with a timeout; this is a non-starter, because we need something > > which fits into epoll. The most optimal case is returning a new > > eventfd or similar which signals when a given timeline point becomes > > available or signaled, but in extremis a syncobj FD becoming readable > > when any activity which would change the result of any zero-timeout > > wait on that syncobj is more or less workable. > > Right. You could call this an oversight. Really, though, it was > because the use-cases we were interested in were ones where a wait > with a timeout was perfectly acceptable and where the extra overhead > of setting an epoll with ioctls wasn't. It shouldn't be too hard to > wire up if that helps (cross your fingers). But.... > > If we go the direction of userspace memory fences (which I think is > likely), there is no such thing as "wait for the fence to > materialize". The work is either done or it isn't. There is no > enforceable concept of "about to be done". The word "enforceable" is > important there. We could add such a concept but it'd be dependent on > the userspace client (not driver) to let us know when it's just about > ready and then we'd have to VK_ERROR_DEVICE_LOST on them or similar if > they break the contract. Maybe possible but there's some design work > required there. The other option is to make the compositors handle > this new way of thinking more thoroughly and eat the latency hit. > > > What we do want to do though, regardless of the primitive or its > > semantics, is to only have to go through this once, not rework it all > > in six months, and have to support a bunch of diverging codepaths. So > > what is the optimal synchronisation primitive we should be aiming for > > on the winsys side? Is sync_file a good enough lowest common > > denominator, or should we do timeline-syncobj for fancy clients, in > > tandem with sync_file bolted on the side for media and ancient GPUs? > > Or are userspace fences going to give us some new primitive? And if > > so, can we please talk about those semantics now, so we don't end up > > with three synchronisation mechanisms which are all sort of but not > > really alike? > > As I said above, I think we're getting close but I don't think we're there yet. > > > As far as I can tell, the three relevant vendors with (more or less) > > upstream drivers here are AMD, Arm, and Intel. Arm is obviously a bit > > more up in the air as the hardware and specs aren't currently > > available to the upstream development teams, but hopefully we can > > bring them into this conversation. I think it's looking like we're > > optimising for all of the following, without leaving anyone out in the > > cold: > > > > 1. Modern userspace-fencing hardware running on a > > userspace-fencing-aware winsys, i.e. new AMD/Arm/Intel on one of the > > big three Wayland compositors which handle enough synchronisation > > logic internally that the choice of underlying > > composition/presentation API is immaterial (for which any new API is > > important) > > 2. Modern userspace-fencing hardware running on Xwayland, for which > > we'll inevitably have to type out new DRI3 synchronsiation, but > > derived purely from the Wayland protocols so it can be passed through > > quite directly, and with any mixed X11<->user buffer client > > interaction (e.g. subwindows) being penalised by a long blocking wait > > in the X server > > 3. sync_file-oriented hardware, for which we need to do ~nothing > > 4. Modern userspace-fencing hardware and APIs which need to interop > > with media units, for all four combinations of client/server > > source/sink; for some APIs like Vulkan Video synchronisation is not a > > problem, for others like VA-API/V4L2/GStreamer we're probably need > > going to live with the implicit semantics for the foreseeable future, > > which means we should do the right thing for them up front, because > > why would you even be playing games if you're not streaming them to > > Twitch at the same time? (Personally I'm much happier that no-one > > other than my friends who already know I'm terrible can see me playing > > CoD, but y'know, kids these days ...) > > > > The fifth case I've left out is clients who want to smash Vulkan > > content directly into the display. For games and Kodi-like UI I'm just > > going to ignore this, because (maybe I'm biased, but) using KMS > > directly is difficult enough that you shouldn't do it and just use a > > winsys because we've spent a very long time dealing with these > > problems for you. The remaining usecase there is XR, but their uses > > are very different, and OpenXR already deals with a _lot_ of this for > > you, with the runtimes having sufficiently sophisticated > > synchronisation internally that whatever we come up with won't be > > onerous for them to implement. Either way, integration with KMS/GBM > > has the same problem as Wayland, in that unless you fully encapsulate > > it in something like VK_KHR_display, you don't get to throw a thread > > under the bus to do delayed submit, because you need to actually > > return something to the caller. > > You're missing a use-case: Modern userspace-fencing hardware running > on bare X11. I don't know that we should optimize for this case, so > to speak, but it has to work non-terribly. Also, as I alluded to > above, I really don't want to be maintaining two drivers forever: One > for X11 and ancient Wayland and one for modern Wayland. We need to be > able to modernize the driver and still support the old window-systems. > Unless you can promise me that X11 and KDE/Wayland will either go away > or else get modernized, I need a fall-back plan. And even if you make > me said promise, I'm not going to believe you. :-P 8-9 years ago, I > was one of the people making those promises. Now I'm writing X11 > winsys code for drivers. And... now I'm re-thinking some of my life > choices.... > > > Ultimately, I think I'm leaning towards agreeing with Christian that I > > would like to see a good holistic model for how this works in a > > variety of usecases before we progress new uAPI, but also to agreeing > > with you and Dan in that how amdgpu currently implements things is > > currently objectively very wrong (despite the motivations for having > > implemented it that way). > > > > If there are any usecases I've missed then I'm all ears, but else I > > think we should take this forward by working with > > Vulkan/EGL/Wayland/X11/media/KMS and coming up with realistic > > skeletons for end-to-end synchronisation through those usecases. It's > > clear that this never happened for syncobj, because it's just not > > usable as a primitive in any compositor which exists today: let's make > > sure we don't get into the same trap again. If we can do this over > > email/GitLab then that's great, but if not, maybe we need to do a kind > > of mini-XDC with some kind of virtual whiteboard we can all scribble > > over. > > I think what we're looking at is roughly three steps, the first two of > which are mostly there on the Wayland side: > > 1. Implicit sync. We know what this one is. glFlush/vkQueueSubmit > on the one side, start texturing on the other, and it all works. > > 2. Immutable SW fences, i.e. sync_file. This is where we have a > fence object that gets returned from the flush/submit and can be > passed into the texture op to provide a dependency. Syncobj may be > useful here but only as a container. If a sync_file is a dma_fence*, > a syncobj should be thought of as a dma_fence**. This may be useful > if we want to retrofit sync_file into X11 where the current DRI3 > explicit sync stuff is based on creating an object and then re-using > it for every present. > > 3. Userspace memory fences. > > Note that timeline syncobj is NOT in that list. IMO, all the "wait > for submit" stuff is an implementation detail we needed in order to > get the timeline semantics on top of immutable SW fences. Under the > hood it's all dma_fence; this just gives us a shareable container so > we can implement VK_KHR_timeline_semaphore with sharing. I really > don't want to make Wayland protocol around it if memory fences are the > final solution. Note that I expect there will be gpus for a very long time which wont do userspace fences. They might do memory fences, but not the full userspace submit story. Establishing the container object aka drm_syncobj for that case I think still makes sense, since it's something they can fully support. Also I do think even full userspace submit hw can fully support dma_fence model (at least I haven't seen hw yet where that shouldn't work), you "just" lose stuff like hw page fault&repair on the gpu. > > (As a coda to this, I'm pretty sure I understand the subtleties of the > > memory fences for relocation/shootdown, but it would be _really_ good > > to have a coherent description everyone agrees on written down > > somewhere, so people can understand the issues without having to read > > 250 emails with people at loggerheads.) > > Let me give it a try. I define a userspace memory fence as follows: > - The fence object is fundamentally a bit of mappable gpu-accessible > memory which stores a 64-bit counter value which is used as a > timeline. (On Windows, they require it to live in system memory.) > - For sharable memory fences, each one probably has to go in its own page. :-( > - The value is initialized to 0. > - To signal the fence, someone (GPU or CPU) writes a new 64-bit value. > - Waits on the fence are waits for it to be >= some value. > > Depending on circumstances, the wait may be a 32-bit comparison and > may be >= with wrap-around. For the purposes of window-system > protocol, I think we can assume 64-bit >= with no wrap-around. > > There are a few very important points worth noting about them, however: > - No one except the userspace client (not driver!) has any idea > when/if a value will be signaled > - The value may never be signaled > - Someone may submit a wait before someone submits a signal; we have > to deal with that > - There is no concept of "about to be signaled" > - For the kernel/firmware GPU scheduler handling waits, this means it > just keeps trying to run work and hopes the wait eventually unblocks. > It also means we need to totally decouple kernel synchronization for > memory management purposes from synchronization for work scheduling. > - For a compositor, I'm still not 100% sure what this means. TBD, I think. > > There are some ways to work around some of these issues. Windows has > a few tricks which we might be able to steal if we want. > > Why would anyone want such a horrid thing? Application developers > absolutely love them. They can write massively multi-threaded apps > with piles of work queues that require very little cross-thread > synchronization and the GPU scheduler sorts it all out for them in the > end. If you're a 3D game engine developer, this timeline model is > very powerful. If you're a driver or window-system developer, you > really have to just embrace the lack of knowledge and trust apps. > > Oof... That got long. I hope it was informative. Yeah maybe smashing dma_fence based syncobj into the same interface as userspace memory fences isn't the best idea ... -Daniel > > --Jason > > > > Cheers, > > Daniel > > > > > > > > > > > > > > > > > > > When we got to > > > insert the RW fence, the actual fence we set as the new exclusive fence > > > is a combination of the sync_file provided by the user and all the other > > > fences on the dma_resv. This ensures that the newly added exclusive > > > fence will never signal before the old one would have and ensures that > > > we don't break any dma_resv contracts. We require userspace to specify > > > RW in the flags for symmetry with the export ioctl and in case we ever > > > want to support read fences in the future. > > > > > > There is one downside here that's worth documenting: If two clients > > > writing to the same dma-buf using this API race with each other, their > > > actions on the dma-buf may happen in parallel or in an undefined order. > > > Both with and without this API, the pattern is the same: Collect all > > > the fences on dma-buf, submit work which depends on said fences, and > > > then set a new exclusive (write) fence on the dma-buf which depends on > > > said work. The difference is that, when it's all handled by the GPU > > > driver's submit ioctl, the three operations happen atomically under the > > > dma_resv lock. If two userspace submits race, one will happen before > > > the other. You aren't guaranteed which but you are guaranteed that > > > they're strictly ordered. If userspace manages the fences itself, then > > > these three operations happen separately and the two render operations > > > may happen genuinely in parallel or get interleaved. However, this is a > > > case of userspace racing with itself. As long as we ensure userspace > > > can't back the kernel into a corner, it should be fine. > > > > > > v2 (Jason Ekstrand): > > > - Use a wrapper dma_fence_array of all fences including the new one > > > when importing an exclusive fence. > > > > > > v3 (Jason Ekstrand): > > > - Lock around setting shared fences as well as exclusive > > > - Mark SIGNAL_SYNC_FILE as a read-write ioctl. > > > - Initialize ret to 0 in dma_buf_wait_sync_file > > > > > > v4 (Jason Ekstrand): > > > - Use the new dma_resv_get_singleton helper > > > > > > v5 (Jason Ekstrand): > > > - Rename the IOCTLs to import/export rather than wait/signal > > > - Drop the WRITE flag and always get/set the exclusive fence > > > > > > v5 (Jason Ekstrand): > > > - Split import and export into separate patches > > > - New commit message > > > > > > Signed-off-by: Jason Ekstrand <jason@jlekstrand.net> > > > --- > > > drivers/dma-buf/dma-buf.c | 32 ++++++++++++++++++++++++++++++++ > > > include/uapi/linux/dma-buf.h | 1 + > > > 2 files changed, 33 insertions(+) > > > > > > diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c > > > index 7679562b57bfc..c9d6b9195c00c 100644 > > > --- a/drivers/dma-buf/dma-buf.c > > > +++ b/drivers/dma-buf/dma-buf.c > > > @@ -417,6 +417,36 @@ static long dma_buf_export_sync_file(struct dma_buf *dmabuf, > > > put_unused_fd(fd); > > > return ret; > > > } > > > + > > > +static long dma_buf_import_sync_file(struct dma_buf *dmabuf, > > > + const void __user *user_data) > > > +{ > > > + struct dma_buf_sync_file arg; > > > + struct dma_fence *fence, *singleton = NULL; > > > + int ret = 0; > > > + > > > + if (copy_from_user(&arg, user_data, sizeof(arg))) > > > + return -EFAULT; > > > + > > > + if (arg.flags != DMA_BUF_SYNC_RW) > > > + return -EINVAL; > > > + > > > + fence = sync_file_get_fence(arg.fd); > > > + if (!fence) > > > + return -EINVAL; > > > + > > > + dma_resv_lock(dmabuf->resv, NULL); > > > + > > > + ret = dma_resv_get_singleton_rcu(dmabuf->resv, fence, &singleton); > > > + if (!ret && singleton) > > > + dma_resv_add_excl_fence(dmabuf->resv, singleton); > > > + > > > + dma_resv_unlock(dmabuf->resv); > > > + > > > + dma_fence_put(fence); > > > + > > > + return ret; > > > +} > > > #endif > > > > > > static long dma_buf_ioctl(struct file *file, > > > @@ -465,6 +495,8 @@ static long dma_buf_ioctl(struct file *file, > > > #if IS_ENABLED(CONFIG_SYNC_FILE) > > > case DMA_BUF_IOCTL_EXPORT_SYNC_FILE: > > > return dma_buf_export_sync_file(dmabuf, (void __user *)arg); > > > + case DMA_BUF_IOCTL_IMPORT_SYNC_FILE: > > > + return dma_buf_import_sync_file(dmabuf, (const void __user *)arg); > > > #endif > > > > > > default: > > > diff --git a/include/uapi/linux/dma-buf.h b/include/uapi/linux/dma-buf.h > > > index f902cadcbdb56..75fdde4800267 100644 > > > --- a/include/uapi/linux/dma-buf.h > > > +++ b/include/uapi/linux/dma-buf.h > > > @@ -70,5 +70,6 @@ struct dma_buf_sync_file { > > > #define DMA_BUF_SET_NAME_A _IOW(DMA_BUF_BASE, 1, u32) > > > #define DMA_BUF_SET_NAME_B _IOW(DMA_BUF_BASE, 1, u64) > > > #define DMA_BUF_IOCTL_EXPORT_SYNC_FILE _IOWR(DMA_BUF_BASE, 2, struct dma_buf_sync_file) > > > +#define DMA_BUF_IOCTL_IMPORT_SYNC_FILE _IOW(DMA_BUF_BASE, 3, struct dma_buf_sync) > > > > > > #endif > > > -- > > > 2.31.1 > > >
On Sunday, May 23rd, 2021 at 11:34 PM, Daniel Stone <daniel@fooishbar.org> wrote: > On the other hand, as soon as the sync point becomes available, the > winsys will want to immediately extract a sync_file from it With today's APIs, yes, but if we ever get drm_syncobj timeline support in KMS/EGL(/Vulkan?) maybe this won't be necessary at some point.
Hey, On Mon, 24 May 2021 at 18:11, Jason Ekstrand <jason@jlekstrand.net> wrote: > On Sat, May 22, 2021 at 3:05 PM Daniel Stone <daniel@fooishbar.org> wrote: > > So far, so good. I really like both your series up until this > > narrative point; as I was saying in the userspace-fence thread, the > > WSI<->client thread is certainly pulling a very big lever with a > > heavyweight transition between the two different worlds, and I like > > that the new export ioctl lets us be very clear about what exactly is > > happening under the hood. Good stuff. > > Glad to hear. If you turned that into RBs on the first three patches > in this series and all but the last patch on the Mesa MR, it'd make me > even happier. :-D > > At this point, I think everyone is pretty happy with the first three > patches and the export ioctl. In the Vulkan WSI code, it solves a > significant over-synchronization issue for ANV. Also, the uAPI > shouldn't be controversial at all because it's identical to poll() > except that it gives you a FD you can poll on later to get the result > of the poll as it would be now. I think if we get some Mesa reviews, > we should be able to land those. It's import that's trickier. Agree. Have a few more burning things on my list (not least fd.o infra this week) but will get to it so we can land. (Clarified on IRC that my description above was accurate to the best of our shared understanding, so we're all on the same page here.) > > That makes sense to me and is nicely symmetrical, plus it gets GPU > > drivers further out of the business of doing magic winsys/display > > synchronisation, which is good. But why enforce that imported fences > > have to go into the exclusive slot? It makes sense from the point of > > view of clients doing QueueSubmit, but I think it might preclude other > > uses for no particularly good reason. > > Mostly, I was trying to limit the scope a bit. Import is tricky and, > to be honest, I'm still not 100% sure that it's safe. I probably > should have left RFC on this patch. > > As long as we keep everything inside the kernel, there's a requirement > that any work which triggers any fence on a dma_resv waits on the > exclusive fence, if any. Work which sets the exclusive fence has to > wait on all fences. With the import ioctl, we can guarantee that the > fences fire in the right order by making an array fence containing the > new fence and all other fences and using that as the exclusive fence. > What we can't do, however, is ensure that the work which triggers the > fence actually blocks on ANY of the fences on the dma_resv. > > Hrm... I think I've now convinced myself that importing shared fences > is no more dangerous than importing an exclusive fence because they > share the same set of problems. Unfortunately, I've unconvinced > myself of the safety of either. I've got to think some more.... > > The most convincing argument I can make to myself is that this ioctl > is like having a one-shot little queue that contains tiny little work > jobs which wait on whatever sync_file you provide, do nothing, and > then signal. That should be safe, right? Yeah, I don't think there's any difference between shared and exclusive wrt safety. The difference lies in, well, exclusive putting a hard serialisation barrier between everything which comes before and everything that comes after, and shared being more relaxed to allow for reads to retire in parallel. As said below, I think there's a good argument for the latter once you get out of the very straightforward uses. One of the arguments for these ioctls is to eliminate oversync, but then the import ioctl mandates oversync in the case where the consumer only does non-destructive reads - which is the case for the vast majority of users! > > Certainly on X11 there's no real use for the shared slot - you get > > into difficulties with mixed client/server rendering - but at least > > Wayland and GBM are always non-destructive, so conceptually have a > > good use for being able to import fences into the shared resv. This > > goes for media and KMS access as well, if you think beyond the usecase > > of an explicit Vulkan client sending through to a dumb implicit server > > which just wants to texture. > > > > Media clients are definitely a relevant usecase, both directly with > > V4L2/VA-API, neither of which have support for explicit > > synchronisation yet and (at least for V4L2) are unlikely to do so in > > the near future, but even more with pipeline frameworks like GStreamer > > and PipeWire, which don't have fundamental awareness (they're prepared > > to deal with deep pipelines which return at arbitrary times, but once > > anything is returned, it is available for immediate use). Maybe > > open-coding CPU-side waits in these users is the best way longer term, > > but for easier interop if nothing else, being able to pull shared > > fences in seems like it could be a win ('the compositor is still > > texturing from this for now, so feel free to read back from ref > > frames, but don't scribble over it until it's finished'). > > > > I'm slightly bemused that there's so much energy spent on designing > > around winsys being dumb and implicit. > > I'd like to address this one as it's a comment you've made several > times. Once you've fixed raw X11 (not just XWayland) and a new > release has been made (hah!) and is shipping in distros with said > support, then we can talk. Sorry if that comes off as overly snarky > but that's reality that we (driver devs) are living with. To make > things even worse, when we're in Vulkan land (as opposed to GL), we > can't tell up-front whether or not our window-system supports foo > fences and adjust accordingly. We have to start up, begin rendering, > and only later figure out "oops, this one goes to X11". We really > can't say things like "when running on modern Wayland, do things the > new way" because Vulkan doesn't have a concept of "running on" a > window system. Hey, no offence taken, and even if there was, there's no point denying the reality that we don't still have Wayland absolutely everywhere, and even when we do, there's still a Streams-sized elephant in the room that no-one wants to talk about. Ultimately though, there will always be two codepaths. Xorg (as opposed to Xwayland) is currently unmaintained, and I expect it to remain so forever. The last times we tried to plumb synchronisation through to native-Xorg, it collapsed in conflicting requests for rewriting the presentation mechanism (which very few people understand and even fewer people can make work reliably), and I don't see that changing any time soon. Plus the single biggest remaining use for native-Xorg is that it implements tearflips by blitting to a single internal frontbuffer which is currently being scanned out, which is going to be ... difficult ... to square with the goal of actually synchronising things. What I mean though is that I don't think it makes sense to have three design points. Previously, the only effort put into explicit synchronisation on the winsys side was using it was a means to surface the timings for both tracing and being able to calibrate compositor repaints. Given that we need to support interop with the implicit world forever, and given that there was no explicit interaction with the implicit resv slots (solved by this new uAPI), and given that the sync_file vs. drm_syncobj vs. ?? userspace fence story wasn't clear, it didn't seem like a practical benefit to sink time into supporting it, just that it ticked a 'yep we're doing things explicitly' box, which was not practically useful since EGL doesn't let you opt out of implicit semantics anyway. Now with the details having come out of AMD/Arm/Intel about future gens, and the driver-side pain being properly understood, as well as the new uAPI actually giving us clear benefit, there's a good reason to work on it. So we'll do that on the winsys side, and support the new explicitly-managed-implicit-slot world, and everyone's lives will be better. Which gives us two design points: unaware implicit-only users (Xorg, old Wayland, old GPU, media), and fully-aware extremely-explicit users (Mutter, Weston, wlroots, new GPU). Having the new uAPI makes the transitions clear and explicit, as opposed to the previous halfway house where we'd dutifully import and export sync_files, but the kernel would just do the exact same resv bookkeeping anyway as if we were implicit. > FWIW, I do have a Mesa MR somewhere which adds explicit sync for > Vulkan+Wayland when the wayland protocols are there. I don't remember > why it didn't land. Maybe lack of acquire fence support? I think the > protocol issues have been fixed, so we should up-rev the requirements > as needed and land it. Depends what you mean by acquire-fence? Assuming you mean ANI (compositor release -> client reuse), we never hooked that up in Weston because we were waiting for things to settle down with e.g. syncobj, but I guess we've just written ourselves into a catch-22 there. Something to fix. In the meantime we do have the timeline-syncobj protocol typed out, but it's not useful to us until it works with epoll. > > We did take a long time to roll > > out sync_file support, but that was only because we didn't quite > > understand all the nuances of why implicit sync is so difficult for > > GPU drivers on modern architectures and how it was actually a win > > compared to doing nothing; maybe we should have some kind of > > conference where we all get together and talk to each other ... ? > > Anyway, by the time we got to the cusp of rolling out bi-directional > > sync_file support, suddenly we had drm_syncobj. By the time that had > > semi-settled down and started to be rolled out, then we suddenly had > > userspace fences on the horizon. And what do we do with those? > > You're not wrong.... And that's the second reason for the gymnastics > above. Ever since Vulkan launched, we've been fumbling around trying > to figure out how to best do synchronization. 'm reasonably certain > that userspace memory fences are the future but I'm much less certain > about the path to get there. It's been a process of patient > experimentation so far and I think we're getting closer. Syncobj, > timeline syncobj, etc. have all been steps along that path. I've been > hesitant to ask the window-system people to draft piles of extensions > until we're sure we've found "the one". It's bad enough iterating in > kernel-space and userspace without bringing Wayland protocol into each > iteration step. For that reason, one of my goals along this process > has been to try and decouple things as much as we can. > > So, in summary, none of this is because I think that window systems > are dumb and implicit or for any lack of respect for the people that > work on them. I assume that X11 will always be dumb and implicit. > (I'd love to be proven wrong!) For Wayland (and XWayland, probably), > I assume the people are very smart and active and will implement > whatever APIs we (the driver people) need as long as they're > reasonable. I just don't know what to ask for yet. Typing out timeline-syncobj support gets us like 90% of the way to userspace fences. At the moment, the protocol only has the concept of fully-materialised fences which will definitely retire at some point; typing out timeline-syncobj gives us the two concepts of syncobj/val and not-yet-materialised fences (i.e. wait-before-signal). The former is just typing, whereas the latter is a fairly violent compositor-internal change to decouple the scene graph across contexts - stalling your desktop for a second because a fence takes too long to retire is not nice, but stalling your desktop infinitely because the client gave you an infinite WBS is less acceptable. Once we've done that, userspace fence support is slightly annoying to type out (new polling mechanism, compositor balancing burning CPU polling excessively vs. getting your VRR frames in as early as possible), but the protocol sed job and polling switch-up is nothing compared to being WBS-aware. So I think it does make perfect sense for us to get typing on timeline-syncobj, and then all our compositors are 95% of the way to being ready for userspace fences. > > We've - certainly Weston, and I'm pretty confident I speak for Simon > > on the wlroots side and Jonas for Mutter - landed on accepting that > > we're going to have to deal with timeline semaphores and > > wait-before-signal; we have a credible plan (we think) for protocol to > > deal with at least syncobj, which should conceptually extend to > > userspace fences. The biggest blocker there is the syncobj uAPI. > > Compositors aren't Vulkan clients - we don't have one thread per group > > of work, because the inter-client synchronisation becomes nightmarish > > overhead for little gain. I don't think this will ever change, because > > the balance of work is totally different between client and > > compositor. > > > > Anyway, the problem with syncobj is that the ioctl to wait for a > > sync_file to materialise for a given timeline point only allows us to > > block with a timeout; this is a non-starter, because we need something > > which fits into epoll. The most optimal case is returning a new > > eventfd or similar which signals when a given timeline point becomes > > available or signaled, but in extremis a syncobj FD becoming readable > > when any activity which would change the result of any zero-timeout > > wait on that syncobj is more or less workable. > > Right. You could call this an oversight. Really, though, it was > because the use-cases we were interested in were ones where a wait > with a timeout was perfectly acceptable and where the extra overhead > of setting an epoll with ioctls wasn't. It shouldn't be too hard to > wire up if that helps (cross your fingers). But.... > > If we go the direction of userspace memory fences (which I think is > likely), there is no such thing as "wait for the fence to > materialize". The work is either done or it isn't. There is no > enforceable concept of "about to be done". The word "enforceable" is > important there. We could add such a concept but it'd be dependent on > the userspace client (not driver) to let us know when it's just about > ready and then we'd have to VK_ERROR_DEVICE_LOST on them or similar if > they break the contract. Maybe possible but there's some design work > required there. The other option is to make the compositors handle > this new way of thinking more thoroughly and eat the latency hit. There's no good answer whatsoever to combining pure userspace fences and VRR, because either you burn your CPU polling at 200Hz, or you're late getting frames to screen. But again, once we're timeline-syncobj-aware, we're pretty much already userspace-fence aware. Compositors already have very separate timed points for client buffer submit (arbitrary time), to scheduling output repaint (as needed), to launching output repaint (negative delta from deadline), to frame deadline (fixed by hardware, VRR notwithstanding). Timeline syncobjs hurt because we have to introduce an additional stage where we check if a client's fence has materialised; the surface tree has to live in limbo until they have, with the scene-graph update happening when they have. Userspace fencing just switches the trigger from 'epoll tells us the fence is materialised and/or signalled' to 'epoll tells us it's time to wake up and check the fence'. Again, just tedious typing, not actually conceptually difficult. > > As far as I can tell, the three relevant vendors with (more or less) > > upstream drivers here are AMD, Arm, and Intel. Arm is obviously a bit > > more up in the air as the hardware and specs aren't currently > > available to the upstream development teams, but hopefully we can > > bring them into this conversation. I think it's looking like we're > > optimising for all of the following, without leaving anyone out in the > > cold: > > > > [...] > > You're missing a use-case: Modern userspace-fencing hardware running > on bare X11. I don't know that we should optimize for this case, so > to speak, but it has to work non-terribly. Also, as I alluded to > above, I really don't want to be maintaining two drivers forever: One > for X11 and ancient Wayland and one for modern Wayland. We need to be > able to modernize the driver and still support the old window-systems. > Unless you can promise me that X11 and KDE/Wayland will either go away > or else get modernized, I need a fall-back plan. And even if you make > me said promise, I'm not going to believe you. :-P 8-9 years ago, I > was one of the people making those promises. Now I'm writing X11 > winsys code for drivers. And... now I'm re-thinking some of my life > choices.... KWinFT is based on wlroots, so it gets everything for free. You're right about X11, but weirdly this is the one case that's really easy. X11 WSI already throws a client submit thread under the bus for asynchronous presentation. This invalidates every possible guarantee: you can't do vkQueuePresentKHR() followed by native X11 rendering, because the queue may not have flushed out to the server. So just do the same thing: use the present thread that already exists, but make it spin on completion instead, and then the server doesn't need to do any synchronisation on QueuePresent, just on ANI. And you don't get eaten by Xorg grues. > > Ultimately, I think I'm leaning towards agreeing with Christian that I > > would like to see a good holistic model for how this works in a > > variety of usecases before we progress new uAPI, but also to agreeing > > with you and Dan in that how amdgpu currently implements things is > > currently objectively very wrong (despite the motivations for having > > implemented it that way). > > > > If there are any usecases I've missed then I'm all ears, but else I > > think we should take this forward by working with > > Vulkan/EGL/Wayland/X11/media/KMS and coming up with realistic > > skeletons for end-to-end synchronisation through those usecases. It's > > clear that this never happened for syncobj, because it's just not > > usable as a primitive in any compositor which exists today: let's make > > sure we don't get into the same trap again. If we can do this over > > email/GitLab then that's great, but if not, maybe we need to do a kind > > of mini-XDC with some kind of virtual whiteboard we can all scribble > > over. > > I think what we're looking at is roughly three steps, the first two of > which are mostly there on the Wayland side: > > 1. Implicit sync. We know what this one is. glFlush/vkQueueSubmit > on the one side, start texturing on the other, and it all works. Done. > 2. Immutable SW fences, i.e. sync_file. This is where we have a > fence object that gets returned from the flush/submit and can be > passed into the texture op to provide a dependency. Syncobj may be > useful here but only as a container. If a sync_file is a dma_fence*, > a syncobj should be thought of as a dma_fence**. This may be useful > if we want to retrofit sync_file into X11 where the current DRI3 > explicit sync stuff is based on creating an object and then re-using > it for every present. Mostly done for Wayland, but as above we never bothered typing out the sync_file return path in any implementation. > 3. Userspace memory fences. > > Note that timeline syncobj is NOT in that list. IMO, all the "wait > for submit" stuff is an implementation detail we needed in order to > get the timeline semantics on top of immutable SW fences. Under the > hood it's all dma_fence; this just gives us a shareable container so > we can implement VK_KHR_timeline_semaphore with sharing. I really > don't want to make Wayland protocol around it if memory fences are the > final solution. Typing out the Wayland protocol isn't the hard bit. If we just need to copy and sed syncobj to weirdsyncobj, no problem really, and it gives us a six-month head start on painful compositor-internal surgery whilst we work on common infrastructure to ship userspace fences around (mappable dmabuf with the sync bracketing? FD where every read() gives you the current value? memfd? other?). > > (As a coda to this, I'm pretty sure I understand the subtleties of the > > memory fences for relocation/shootdown, but it would be _really_ good > > to have a coherent description everyone agrees on written down > > somewhere, so people can understand the issues without having to read > > 250 emails with people at loggerheads.) > > Let me give it a try. I define a userspace memory fence as follows: > [...] Thanks, and I completely agree with your description. What I meant though was the 'relocation fences' Christian has for amdgpu, and Intel semi-has with ttm_bo->moving. This is semi-orthogonal, but understanding how it's supposed to interact is pretty critical I think. > There are some ways to work around some of these issues. Windows has > a few tricks which we might be able to steal if we want. > > Why would anyone want such a horrid thing? Application developers > absolutely love them. They can write massively multi-threaded apps > with piles of work queues that require very little cross-thread > synchronization and the GPU scheduler sorts it all out for them in the > end. If you're a 3D game engine developer, this timeline model is > very powerful. If you're a driver or window-system developer, you > really have to just embrace the lack of knowledge and trust apps. > > Oof... That got long. I hope it was informative. It was! Thanks for the write-up. Cheers, Daniel
On 5/24/21 8:11 PM, Jason Ekstrand wrote: > On Sat, May 22, 2021 at 3:05 PM Daniel Stone <daniel@fooishbar.org> wrote: >> >> Hi, >> >> On Thu, 20 May 2021 at 20:00, Jason Ekstrand <jason@jlekstrand.net> wrote: >>> In the Vulkan world, this is useful for dealing with the out-fence from >>> vkQueuePresent. Current Linux window-systems (X11, Wayland, etc.) all >>> rely on dma-buf implicit sync. Since Vulkan is an explicit sync API, we >>> get a set of fences (VkSemaphores) in vkQueuePresent and have to stash >>> those as an exclusive (write) fence on the dma-buf. We handle it in >>> Mesa today with the above mentioned dummy submit trick. This ioctl >>> would allow us to set it directly without the dummy submit. >>> >>> This may also open up possibilities for GPU drivers to move away from >>> implicit sync for their kernel driver uAPI and instead provide sync >>> files and rely on dma-buf import/export for communicating with other >>> implicit sync clients. >>> >>> We make the explicit choice here to only allow setting RW fences which >>> translates to an exclusive fence on the dma_resv. There's no use for >>> read-only fences for communicating with other implicit sync userspace >>> and any such attempts are likely to be racy at best. >> >> I think I almost follow, but I'm not quite seeing the race you allude >> to. Let me talk through my understanding so it's hopefully more clear >> for others as a summary. Please correct me on anything I've >> misunderstood or just missed completely. (I thought when I wrote this >> intro that the email might be relatively snappy, but it's really long >> and takes in a lot of breadth. Sorry.) >> >> So as far as I'm reading this patchset and the Mesa MR, this _only_ >> concerns the out-fence (i.e. compositor -> client AcquireNextImage >> semaphore/fence) so far. The in-fence (client->compositor QueuePresent >> wait-semaphores/fences) is handled by the driver ensuring that an >> exclusive resv is placed on the union of the signal semaphores passed >> to QueuePresent, either through flags on its CS ioctl, or amdgpu's BO >> flags, or ... either way, no problem as it should always be exclusive, >> and it seems pretty uncontroversial that we should pull the wait >> semaphore into an exclusive fence so that no downstream consumer will >> begin using it until the client ops have fully retired. >> >> For AcquireNextImage, your patchset exports all the fences (shared and >> exclusive both) on the dma_resv out into the semaphore/fence such that >> the client can't progress (CPU-side for VkFence, GPU-side for >> VkSemaphore) until all currently queued operations have completely >> retired. So, as long as the server ensures that all its kernel-side >> work is flushed before its IPC to unblock ANI (wl_buffer.release or >> DRI3 PresentIdle, both indicating that the client is free to reuse the >> buffer, subject to synchronising against implicit fences on the resv), >> all good: we snapshot the current resv state, wrap the relevant >> driver-side Vulkan primitive around it, and go back to explicit >> synchronisation. The client can't race generating new work against >> this, because it can't queue any work until ANI has returned and >> queued a signaling op on the semaphore/fence. >> >> So far, so good. I really like both your series up until this >> narrative point; as I was saying in the userspace-fence thread, the >> WSI<->client thread is certainly pulling a very big lever with a >> heavyweight transition between the two different worlds, and I like >> that the new export ioctl lets us be very clear about what exactly is >> happening under the hood. Good stuff. > > Glad to hear. If you turned that into RBs on the first three patches > in this series and all but the last patch on the Mesa MR, it'd make me > even happier. :-D > > At this point, I think everyone is pretty happy with the first three > patches and the export ioctl. In the Vulkan WSI code, it solves a > significant over-synchronization issue for ANV. Also, the uAPI > shouldn't be controversial at all because it's identical to poll() > except that it gives you a FD you can poll on later to get the result > of the poll as it would be now. I think if we get some Mesa reviews, > we should be able to land those. It's import that's trickier. > >> So, what gives with the import ioctl? Taking a guess at where you're >> going, the import ioctl is going to be used in QueuePresent just as >> the export ioctl is in ANI: instead of having CS flags to write into >> the resv exclusive slot or per-BO flags to always dump in exclusive, >> there's a forthcoming patch somewhere which lets drivers skip this and >> instead have common QueuePresent code dump the wait semaphore into the >> resv, so servers on the other side of an implicit-only protocol will >> synchronise their access against the fence imported as part of >> client-side QueuePresent? > > Correct. And the patch isn't forthcoming. It already exists as the > top patch in the Mesa MR (!4037). > >> That makes sense to me and is nicely symmetrical, plus it gets GPU >> drivers further out of the business of doing magic winsys/display >> synchronisation, which is good. But why enforce that imported fences >> have to go into the exclusive slot? It makes sense from the point of >> view of clients doing QueueSubmit, but I think it might preclude other >> uses for no particularly good reason. > > Mostly, I was trying to limit the scope a bit. Import is tricky and, > to be honest, I'm still not 100% sure that it's safe. I probably > should have left RFC on this patch. > > As long as we keep everything inside the kernel, there's a requirement > that any work which triggers any fence on a dma_resv waits on the > exclusive fence, if any. Work which sets the exclusive fence has to > wait on all fences. With the import ioctl, we can guarantee that the > fences fire in the right order by making an array fence containing the > new fence and all other fences and using that as the exclusive fence. > What we can't do, however, is ensure that the work which triggers the > fence actually blocks on ANY of the fences on the dma_resv. > > Hrm... I think I've now convinced myself that importing shared fences > is no more dangerous than importing an exclusive fence because they > share the same set of problems. Unfortunately, I've unconvinced > myself of the safety of either. I've got to think some more.... > > The most convincing argument I can make to myself is that this ioctl > is like having a one-shot little queue that contains tiny little work > jobs which wait on whatever sync_file you provide, do nothing, and > then signal. That should be safe, right? > >> Certainly on X11 there's no real use for the shared slot - you get >> into difficulties with mixed client/server rendering - but at least >> Wayland and GBM are always non-destructive, so conceptually have a >> good use for being able to import fences into the shared resv. This >> goes for media and KMS access as well, if you think beyond the usecase >> of an explicit Vulkan client sending through to a dumb implicit server >> which just wants to texture. >> >> Media clients are definitely a relevant usecase, both directly with >> V4L2/VA-API, neither of which have support for explicit >> synchronisation yet and (at least for V4L2) are unlikely to do so in >> the near future, but even more with pipeline frameworks like GStreamer >> and PipeWire, which don't have fundamental awareness (they're prepared >> to deal with deep pipelines which return at arbitrary times, but once >> anything is returned, it is available for immediate use). Maybe >> open-coding CPU-side waits in these users is the best way longer term, >> but for easier interop if nothing else, being able to pull shared >> fences in seems like it could be a win ('the compositor is still >> texturing from this for now, so feel free to read back from ref >> frames, but don't scribble over it until it's finished'). >> >> I'm slightly bemused that there's so much energy spent on designing >> around winsys being dumb and implicit. > > I'd like to address this one as it's a comment you've made several > times. Once you've fixed raw X11 (not just XWayland) and a new > release has been made (hah!) and is shipping in distros with said > support, then we can talk. Sorry if that comes off as overly snarky > but that's reality that we (driver devs) are living with. To make > things even worse, when we're in Vulkan land (as opposed to GL), we > can't tell up-front whether or not our window-system supports foo > fences and adjust accordingly. We have to start up, begin rendering, > and only later figure out "oops, this one goes to X11". We really > can't say things like "when running on modern Wayland, do things the > new way" because Vulkan doesn't have a concept of "running on" a > window system. > > FWIW, I do have a Mesa MR somewhere which adds explicit sync for > Vulkan+Wayland when the wayland protocols are there. I don't remember > why it didn't land. Maybe lack of acquire fence support? I think the > protocol issues have been fixed, so we should up-rev the requirements > as needed and land it. > >> We did take a long time to roll >> out sync_file support, but that was only because we didn't quite >> understand all the nuances of why implicit sync is so difficult for >> GPU drivers on modern architectures and how it was actually a win >> compared to doing nothing; maybe we should have some kind of >> conference where we all get together and talk to each other ... ? >> Anyway, by the time we got to the cusp of rolling out bi-directional >> sync_file support, suddenly we had drm_syncobj. By the time that had >> semi-settled down and started to be rolled out, then we suddenly had >> userspace fences on the horizon. And what do we do with those? > > You're not wrong.... And that's the second reason for the gymnastics > above. Ever since Vulkan launched, we've been fumbling around trying > to figure out how to best do synchronization. 'm reasonably certain > that userspace memory fences are the future but I'm much less certain > about the path to get there. It's been a process of patient > experimentation so far and I think we're getting closer. Syncobj, > timeline syncobj, etc. have all been steps along that path. I've been > hesitant to ask the window-system people to draft piles of extensions > until we're sure we've found "the one". It's bad enough iterating in > kernel-space and userspace without bringing Wayland protocol into each > iteration step. For that reason, one of my goals along this process > has been to try and decouple things as much as we can. > > So, in summary, none of this is because I think that window systems > are dumb and implicit or for any lack of respect for the people that > work on them. I assume that X11 will always be dumb and implicit. > (I'd love to be proven wrong!) For Wayland (and XWayland, probably), > I assume the people are very smart and active and will implement > whatever APIs we (the driver people) need as long as they're > reasonable. I just don't know what to ask for yet. > >> We've - certainly Weston, and I'm pretty confident I speak for Simon >> on the wlroots side and Jonas for Mutter - landed on accepting that >> we're going to have to deal with timeline semaphores and >> wait-before-signal; we have a credible plan (we think) for protocol to >> deal with at least syncobj, which should conceptually extend to >> userspace fences. The biggest blocker there is the syncobj uAPI. >> Compositors aren't Vulkan clients - we don't have one thread per group >> of work, because the inter-client synchronisation becomes nightmarish >> overhead for little gain. I don't think this will ever change, because >> the balance of work is totally different between client and >> compositor. >> >> Anyway, the problem with syncobj is that the ioctl to wait for a >> sync_file to materialise for a given timeline point only allows us to >> block with a timeout; this is a non-starter, because we need something >> which fits into epoll. The most optimal case is returning a new >> eventfd or similar which signals when a given timeline point becomes >> available or signaled, but in extremis a syncobj FD becoming readable >> when any activity which would change the result of any zero-timeout >> wait on that syncobj is more or less workable. > > Right. You could call this an oversight. Really, though, it was > because the use-cases we were interested in were ones where a wait > with a timeout was perfectly acceptable and where the extra overhead > of setting an epoll with ioctls wasn't. It shouldn't be too hard to > wire up if that helps (cross your fingers). But.... > > If we go the direction of userspace memory fences (which I think is > likely), there is no such thing as "wait for the fence to > materialize". The work is either done or it isn't. There is no > enforceable concept of "about to be done". The word "enforceable" is > important there. We could add such a concept but it'd be dependent on > the userspace client (not driver) to let us know when it's just about > ready and then we'd have to VK_ERROR_DEVICE_LOST on them or similar if > they break the contract. Maybe possible but there's some design work > required there. The other option is to make the compositors handle > this new way of thinking more thoroughly and eat the latency hit. > >> What we do want to do though, regardless of the primitive or its >> semantics, is to only have to go through this once, not rework it all >> in six months, and have to support a bunch of diverging codepaths. So >> what is the optimal synchronisation primitive we should be aiming for >> on the winsys side? Is sync_file a good enough lowest common >> denominator, or should we do timeline-syncobj for fancy clients, in >> tandem with sync_file bolted on the side for media and ancient GPUs? >> Or are userspace fences going to give us some new primitive? And if >> so, can we please talk about those semantics now, so we don't end up >> with three synchronisation mechanisms which are all sort of but not >> really alike? > > As I said above, I think we're getting close but I don't think we're there yet. > >> As far as I can tell, the three relevant vendors with (more or less) >> upstream drivers here are AMD, Arm, and Intel. Arm is obviously a bit >> more up in the air as the hardware and specs aren't currently >> available to the upstream development teams, but hopefully we can >> bring them into this conversation. I think it's looking like we're >> optimising for all of the following, without leaving anyone out in the >> cold: >> >> 1. Modern userspace-fencing hardware running on a >> userspace-fencing-aware winsys, i.e. new AMD/Arm/Intel on one of the >> big three Wayland compositors which handle enough synchronisation >> logic internally that the choice of underlying >> composition/presentation API is immaterial (for which any new API is >> important) >> 2. Modern userspace-fencing hardware running on Xwayland, for which >> we'll inevitably have to type out new DRI3 synchronsiation, but >> derived purely from the Wayland protocols so it can be passed through >> quite directly, and with any mixed X11<->user buffer client >> interaction (e.g. subwindows) being penalised by a long blocking wait >> in the X server >> 3. sync_file-oriented hardware, for which we need to do ~nothing >> 4. Modern userspace-fencing hardware and APIs which need to interop >> with media units, for all four combinations of client/server >> source/sink; for some APIs like Vulkan Video synchronisation is not a >> problem, for others like VA-API/V4L2/GStreamer we're probably need >> going to live with the implicit semantics for the foreseeable future, >> which means we should do the right thing for them up front, because >> why would you even be playing games if you're not streaming them to >> Twitch at the same time? (Personally I'm much happier that no-one >> other than my friends who already know I'm terrible can see me playing >> CoD, but y'know, kids these days ...) >> >> The fifth case I've left out is clients who want to smash Vulkan >> content directly into the display. For games and Kodi-like UI I'm just >> going to ignore this, because (maybe I'm biased, but) using KMS >> directly is difficult enough that you shouldn't do it and just use a >> winsys because we've spent a very long time dealing with these >> problems for you. The remaining usecase there is XR, but their uses >> are very different, and OpenXR already deals with a _lot_ of this for >> you, with the runtimes having sufficiently sophisticated >> synchronisation internally that whatever we come up with won't be >> onerous for them to implement. Either way, integration with KMS/GBM >> has the same problem as Wayland, in that unless you fully encapsulate >> it in something like VK_KHR_display, you don't get to throw a thread >> under the bus to do delayed submit, because you need to actually >> return something to the caller. > > You're missing a use-case: Modern userspace-fencing hardware running > on bare X11. I don't know that we should optimize for this case, so > to speak, but it has to work non-terribly. Also, as I alluded to > above, I really don't want to be maintaining two drivers forever: One > for X11 and ancient Wayland and one for modern Wayland. We need to be > able to modernize the driver and still support the old window-systems. > Unless you can promise me that X11 and KDE/Wayland will either go away > or else get modernized, I need a fall-back plan. And even if you make Hi, I don't know why you specifically mentioned KDE/Wayland, but we are on board with explicit sync. :) Cheers, Vlad > me said promise, I'm not going to believe you. :-P 8-9 years ago, I > was one of the people making those promises. Now I'm writing X11 > winsys code for drivers. And... now I'm re-thinking some of my life > choices.... > >> Ultimately, I think I'm leaning towards agreeing with Christian that I >> would like to see a good holistic model for how this works in a >> variety of usecases before we progress new uAPI, but also to agreeing >> with you and Dan in that how amdgpu currently implements things is >> currently objectively very wrong (despite the motivations for having >> implemented it that way). >> >> If there are any usecases I've missed then I'm all ears, but else I >> think we should take this forward by working with >> Vulkan/EGL/Wayland/X11/media/KMS and coming up with realistic >> skeletons for end-to-end synchronisation through those usecases. It's >> clear that this never happened for syncobj, because it's just not >> usable as a primitive in any compositor which exists today: let's make >> sure we don't get into the same trap again. If we can do this over >> email/GitLab then that's great, but if not, maybe we need to do a kind >> of mini-XDC with some kind of virtual whiteboard we can all scribble >> over. > > I think what we're looking at is roughly three steps, the first two of > which are mostly there on the Wayland side: > > 1. Implicit sync. We know what this one is. glFlush/vkQueueSubmit > on the one side, start texturing on the other, and it all works. > > 2. Immutable SW fences, i.e. sync_file. This is where we have a > fence object that gets returned from the flush/submit and can be > passed into the texture op to provide a dependency. Syncobj may be > useful here but only as a container. If a sync_file is a dma_fence*, > a syncobj should be thought of as a dma_fence**. This may be useful > if we want to retrofit sync_file into X11 where the current DRI3 > explicit sync stuff is based on creating an object and then re-using > it for every present. > > 3. Userspace memory fences. > > Note that timeline syncobj is NOT in that list. IMO, all the "wait > for submit" stuff is an implementation detail we needed in order to > get the timeline semantics on top of immutable SW fences. Under the > hood it's all dma_fence; this just gives us a shareable container so > we can implement VK_KHR_timeline_semaphore with sharing. I really > don't want to make Wayland protocol around it if memory fences are the > final solution. > > >> (As a coda to this, I'm pretty sure I understand the subtleties of the >> memory fences for relocation/shootdown, but it would be _really_ good >> to have a coherent description everyone agrees on written down >> somewhere, so people can understand the issues without having to read >> 250 emails with people at loggerheads.) > > Let me give it a try. I define a userspace memory fence as follows: > - The fence object is fundamentally a bit of mappable gpu-accessible > memory which stores a 64-bit counter value which is used as a > timeline. (On Windows, they require it to live in system memory.) > - For sharable memory fences, each one probably has to go in its own page. :-( > - The value is initialized to 0. > - To signal the fence, someone (GPU or CPU) writes a new 64-bit value. > - Waits on the fence are waits for it to be >= some value. > > Depending on circumstances, the wait may be a 32-bit comparison and > may be >= with wrap-around. For the purposes of window-system > protocol, I think we can assume 64-bit >= with no wrap-around. > > There are a few very important points worth noting about them, however: > - No one except the userspace client (not driver!) has any idea > when/if a value will be signaled > - The value may never be signaled > - Someone may submit a wait before someone submits a signal; we have > to deal with that > - There is no concept of "about to be signaled" > - For the kernel/firmware GPU scheduler handling waits, this means it > just keeps trying to run work and hopes the wait eventually unblocks. > It also means we need to totally decouple kernel synchronization for > memory management purposes from synchronization for work scheduling. > - For a compositor, I'm still not 100% sure what this means. TBD, I think. > > There are some ways to work around some of these issues. Windows has > a few tricks which we might be able to steal if we want. > > Why would anyone want such a horrid thing? Application developers > absolutely love them. They can write massively multi-threaded apps > with piles of work queues that require very little cross-thread > synchronization and the GPU scheduler sorts it all out for them in the > end. If you're a 3D game engine developer, this timeline model is > very powerful. If you're a driver or window-system developer, you > really have to just embrace the lack of knowledge and trust apps. > > Oof... That got long. I hope it was informative. > > --Jason > > >> Cheers, >> Daniel >> >> >> >> >> >> >> >> >>> When we got to >>> insert the RW fence, the actual fence we set as the new exclusive fence >>> is a combination of the sync_file provided by the user and all the other >>> fences on the dma_resv. This ensures that the newly added exclusive >>> fence will never signal before the old one would have and ensures that >>> we don't break any dma_resv contracts. We require userspace to specify >>> RW in the flags for symmetry with the export ioctl and in case we ever >>> want to support read fences in the future. >>> >>> There is one downside here that's worth documenting: If two clients >>> writing to the same dma-buf using this API race with each other, their >>> actions on the dma-buf may happen in parallel or in an undefined order. >>> Both with and without this API, the pattern is the same: Collect all >>> the fences on dma-buf, submit work which depends on said fences, and >>> then set a new exclusive (write) fence on the dma-buf which depends on >>> said work. The difference is that, when it's all handled by the GPU >>> driver's submit ioctl, the three operations happen atomically under the >>> dma_resv lock. If two userspace submits race, one will happen before >>> the other. You aren't guaranteed which but you are guaranteed that >>> they're strictly ordered. If userspace manages the fences itself, then >>> these three operations happen separately and the two render operations >>> may happen genuinely in parallel or get interleaved. However, this is a >>> case of userspace racing with itself. As long as we ensure userspace >>> can't back the kernel into a corner, it should be fine. >>> >>> v2 (Jason Ekstrand): >>> - Use a wrapper dma_fence_array of all fences including the new one >>> when importing an exclusive fence. >>> >>> v3 (Jason Ekstrand): >>> - Lock around setting shared fences as well as exclusive >>> - Mark SIGNAL_SYNC_FILE as a read-write ioctl. >>> - Initialize ret to 0 in dma_buf_wait_sync_file >>> >>> v4 (Jason Ekstrand): >>> - Use the new dma_resv_get_singleton helper >>> >>> v5 (Jason Ekstrand): >>> - Rename the IOCTLs to import/export rather than wait/signal >>> - Drop the WRITE flag and always get/set the exclusive fence >>> >>> v5 (Jason Ekstrand): >>> - Split import and export into separate patches >>> - New commit message >>> >>> Signed-off-by: Jason Ekstrand <jason@jlekstrand.net> >>> --- >>> drivers/dma-buf/dma-buf.c | 32 ++++++++++++++++++++++++++++++++ >>> include/uapi/linux/dma-buf.h | 1 + >>> 2 files changed, 33 insertions(+) >>> >>> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c >>> index 7679562b57bfc..c9d6b9195c00c 100644 >>> --- a/drivers/dma-buf/dma-buf.c >>> +++ b/drivers/dma-buf/dma-buf.c >>> @@ -417,6 +417,36 @@ static long dma_buf_export_sync_file(struct dma_buf *dmabuf, >>> put_unused_fd(fd); >>> return ret; >>> } >>> + >>> +static long dma_buf_import_sync_file(struct dma_buf *dmabuf, >>> + const void __user *user_data) >>> +{ >>> + struct dma_buf_sync_file arg; >>> + struct dma_fence *fence, *singleton = NULL; >>> + int ret = 0; >>> + >>> + if (copy_from_user(&arg, user_data, sizeof(arg))) >>> + return -EFAULT; >>> + >>> + if (arg.flags != DMA_BUF_SYNC_RW) >>> + return -EINVAL; >>> + >>> + fence = sync_file_get_fence(arg.fd); >>> + if (!fence) >>> + return -EINVAL; >>> + >>> + dma_resv_lock(dmabuf->resv, NULL); >>> + >>> + ret = dma_resv_get_singleton_rcu(dmabuf->resv, fence, &singleton); >>> + if (!ret && singleton) >>> + dma_resv_add_excl_fence(dmabuf->resv, singleton); >>> + >>> + dma_resv_unlock(dmabuf->resv); >>> + >>> + dma_fence_put(fence); >>> + >>> + return ret; >>> +} >>> #endif >>> >>> static long dma_buf_ioctl(struct file *file, >>> @@ -465,6 +495,8 @@ static long dma_buf_ioctl(struct file *file, >>> #if IS_ENABLED(CONFIG_SYNC_FILE) >>> case DMA_BUF_IOCTL_EXPORT_SYNC_FILE: >>> return dma_buf_export_sync_file(dmabuf, (void __user *)arg); >>> + case DMA_BUF_IOCTL_IMPORT_SYNC_FILE: >>> + return dma_buf_import_sync_file(dmabuf, (const void __user *)arg); >>> #endif >>> >>> default: >>> diff --git a/include/uapi/linux/dma-buf.h b/include/uapi/linux/dma-buf.h >>> index f902cadcbdb56..75fdde4800267 100644 >>> --- a/include/uapi/linux/dma-buf.h >>> +++ b/include/uapi/linux/dma-buf.h >>> @@ -70,5 +70,6 @@ struct dma_buf_sync_file { >>> #define DMA_BUF_SET_NAME_A _IOW(DMA_BUF_BASE, 1, u32) >>> #define DMA_BUF_SET_NAME_B _IOW(DMA_BUF_BASE, 1, u64) >>> #define DMA_BUF_IOCTL_EXPORT_SYNC_FILE _IOWR(DMA_BUF_BASE, 2, struct dma_buf_sync_file) >>> +#define DMA_BUF_IOCTL_IMPORT_SYNC_FILE _IOW(DMA_BUF_BASE, 3, struct dma_buf_sync) >>> >>> #endif >>> -- >>> 2.31.1 >>>
On Wed, May 26, 2021 at 1:09 PM Daniel Stone <daniel@fooishbar.org> wrote: > > Hey, > > On Mon, 24 May 2021 at 18:11, Jason Ekstrand <jason@jlekstrand.net> wrote: > > On Sat, May 22, 2021 at 3:05 PM Daniel Stone <daniel@fooishbar.org> wrote: > > > So far, so good. I really like both your series up until this > > > narrative point; as I was saying in the userspace-fence thread, the > > > WSI<->client thread is certainly pulling a very big lever with a > > > heavyweight transition between the two different worlds, and I like > > > that the new export ioctl lets us be very clear about what exactly is > > > happening under the hood. Good stuff. > > > > Glad to hear. If you turned that into RBs on the first three patches > > in this series and all but the last patch on the Mesa MR, it'd make me > > even happier. :-D > > > > At this point, I think everyone is pretty happy with the first three > > patches and the export ioctl. In the Vulkan WSI code, it solves a > > significant over-synchronization issue for ANV. Also, the uAPI > > shouldn't be controversial at all because it's identical to poll() > > except that it gives you a FD you can poll on later to get the result > > of the poll as it would be now. I think if we get some Mesa reviews, > > we should be able to land those. It's import that's trickier. > > Agree. Have a few more burning things on my list (not least fd.o infra > this week) but will get to it so we can land. > > (Clarified on IRC that my description above was accurate to the best > of our shared understanding, so we're all on the same page here.) fd.o work is much appreciated! > > > That makes sense to me and is nicely symmetrical, plus it gets GPU > > > drivers further out of the business of doing magic winsys/display > > > synchronisation, which is good. But why enforce that imported fences > > > have to go into the exclusive slot? It makes sense from the point of > > > view of clients doing QueueSubmit, but I think it might preclude other > > > uses for no particularly good reason. > > > > Mostly, I was trying to limit the scope a bit. Import is tricky and, > > to be honest, I'm still not 100% sure that it's safe. I probably > > should have left RFC on this patch. > > > > As long as we keep everything inside the kernel, there's a requirement > > that any work which triggers any fence on a dma_resv waits on the > > exclusive fence, if any. Work which sets the exclusive fence has to > > wait on all fences. With the import ioctl, we can guarantee that the > > fences fire in the right order by making an array fence containing the > > new fence and all other fences and using that as the exclusive fence. > > What we can't do, however, is ensure that the work which triggers the > > fence actually blocks on ANY of the fences on the dma_resv. > > > > Hrm... I think I've now convinced myself that importing shared fences > > is no more dangerous than importing an exclusive fence because they > > share the same set of problems. Unfortunately, I've unconvinced > > myself of the safety of either. I've got to think some more.... > > > > The most convincing argument I can make to myself is that this ioctl > > is like having a one-shot little queue that contains tiny little work > > jobs which wait on whatever sync_file you provide, do nothing, and > > then signal. That should be safe, right? > > Yeah, I don't think there's any difference between shared and > exclusive wrt safety. The difference lies in, well, exclusive putting > a hard serialisation barrier between everything which comes before and > everything that comes after, and shared being more relaxed to allow > for reads to retire in parallel. > > As said below, I think there's a good argument for the latter once you > get out of the very straightforward uses. One of the arguments for > these ioctls is to eliminate oversync, but then the import ioctl > mandates oversync in the case where the consumer only does > non-destructive reads - which is the case for the vast majority of > users! Just wanted to comment on this: Right now we attach always attach a shared end-of-batch fence to every dma_resv. So reads are automatically and always synced. So in that sense having an explicit ioct to set the read fence is not really useful, since at most you just make everything worse. Until we have userspace memory this is also pretty much guaranteed to stay like that, except if we maybe split up the shared fences into "relevant for implicit sync" and "not so relevant for implicit sync". amdgpu has an interesting model there with only engaging in implicit sync if it's a fence from a different drm_file (but they lack the distinction between read/write and oversync due to that in other cases, oops). I haven't wrapped my head around how to glue those two approaches together. So anyway right now all we really have on the table for explicit control of the implicit fences is really just: - should I stall for the exclusive fence/all fences/not at all - should I set the exclusive fence And from my additional auditing this week we already have plenty of small bugs in this area :-( So lifting restrictions more on what you can do with handling implicit fences more explicitly is going to be major surgery ... Implicit fencing is bound to stay a very funny IPC for fences for quite some time I think. -Daniel
Hey, On Wed, 26 May 2021 at 13:35, Daniel Vetter <daniel@ffwll.ch> wrote: > On Wed, May 26, 2021 at 1:09 PM Daniel Stone <daniel@fooishbar.org> wrote: > > Yeah, I don't think there's any difference between shared and > > exclusive wrt safety. The difference lies in, well, exclusive putting > > a hard serialisation barrier between everything which comes before and > > everything that comes after, and shared being more relaxed to allow > > for reads to retire in parallel. > > > > As said below, I think there's a good argument for the latter once you > > get out of the very straightforward uses. One of the arguments for > > these ioctls is to eliminate oversync, but then the import ioctl > > mandates oversync in the case where the consumer only does > > non-destructive reads - which is the case for the vast majority of > > users! > > Just wanted to comment on this: Right now we attach always attach a > shared end-of-batch fence to every dma_resv. So reads are > automatically and always synced. So in that sense having an explicit > ioct to set the read fence is not really useful, since at most you > just make everything worse. Are you saying that if a compositor imports a client-provided dmabuf as an EGLImage to use as a source texture for its rendering, and then provides it to VA-API or V4L2 to use as a media encode source (both purely read-only ops), that these will both serialise against each other? Like, my media decode job won't begin execution until the composition read has fully retired? If so, a) good lord that hurts, and b) what are shared fences actually ... for? Cheers, Daniel
On Wed, May 26, 2021 at 02:08:19PM +0100, Daniel Stone wrote: > Hey, > > On Wed, 26 May 2021 at 13:35, Daniel Vetter <daniel@ffwll.ch> wrote: > > On Wed, May 26, 2021 at 1:09 PM Daniel Stone <daniel@fooishbar.org> wrote: > > > Yeah, I don't think there's any difference between shared and > > > exclusive wrt safety. The difference lies in, well, exclusive putting > > > a hard serialisation barrier between everything which comes before and > > > everything that comes after, and shared being more relaxed to allow > > > for reads to retire in parallel. > > > > > > As said below, I think there's a good argument for the latter once you > > > get out of the very straightforward uses. One of the arguments for > > > these ioctls is to eliminate oversync, but then the import ioctl > > > mandates oversync in the case where the consumer only does > > > non-destructive reads - which is the case for the vast majority of > > > users! > > > > Just wanted to comment on this: Right now we attach always attach a > > shared end-of-batch fence to every dma_resv. So reads are > > automatically and always synced. So in that sense having an explicit > > ioct to set the read fence is not really useful, since at most you > > just make everything worse. > > Are you saying that if a compositor imports a client-provided dmabuf > as an EGLImage to use as a source texture for its rendering, and then > provides it to VA-API or V4L2 to use as a media encode source (both > purely read-only ops), that these will both serialise against each > other? Like, my media decode job won't begin execution until the > composition read has fully retired? > > If so, a) good lord that hurts, and b) what are shared fences actually ... for? Shared is shared, I just meant to say that we always add the shared fence. So an explicit ioctl to add more shared fences is kinda pointless. So yeah on a good driver this will run in parallel. On a not-so-good driver (which currently includes amdgpu and panfrost) this will serialize, because those drivers don't have the concept of a non-exclusive fence for such shared buffers (amdgpu does not sync internally, but will sync as soon as it's cross-drm_file). -Daniel
Hi, Just making sure this thread doesn't get too short ... On Wed, 26 May 2021 at 12:08, Daniel Stone <daniel@fooishbar.org> wrote: > On Mon, 24 May 2021 at 18:11, Jason Ekstrand <jason@jlekstrand.net> wrote: > > I'd like to address this one as it's a comment you've made several > > times. Once you've fixed raw X11 (not just XWayland) and a new > > release has been made (hah!) and is shipping in distros with said > > support, then we can talk. Sorry if that comes off as overly snarky > > but that's reality that we (driver devs) are living with. To make > > things even worse, when we're in Vulkan land (as opposed to GL), we > > can't tell up-front whether or not our window-system supports foo > > fences and adjust accordingly. We have to start up, begin rendering, > > and only later figure out "oops, this one goes to X11". We really > > can't say things like "when running on modern Wayland, do things the > > new way" because Vulkan doesn't have a concept of "running on" a > > window system. > > Hey, no offence taken, and even if there was, there's no point denying > the reality that we don't still have Wayland absolutely everywhere, > and even when we do, there's still a Streams-sized elephant in the > room that no-one wants to talk about. > > Ultimately though, there will always be two codepaths. Xorg (as > opposed to Xwayland) is currently unmaintained, and I expect it to > remain so forever. The last times we tried to plumb synchronisation > through to native-Xorg, it collapsed in conflicting requests for > rewriting the presentation mechanism (which very few people understand > and even fewer people can make work reliably), and I don't see that > changing any time soon. Plus the single biggest remaining use for > native-Xorg is that it implements tearflips by blitting to a single > internal frontbuffer which is currently being scanned out, which is > going to be ... difficult ... to square with the goal of actually > synchronising things. > > What I mean though is that I don't think it makes sense to have three > design points. > > Previously, the only effort put into explicit synchronisation on the > winsys side was using it was a means to surface the timings for both > tracing and being able to calibrate compositor repaints. Given that we > need to support interop with the implicit world forever, and given > that there was no explicit interaction with the implicit resv slots > (solved by this new uAPI), and given that the sync_file vs. > drm_syncobj vs. ?? userspace fence story wasn't clear, it didn't seem > like a practical benefit to sink time into supporting it, just that it > ticked a 'yep we're doing things explicitly' box, which was not > practically useful since EGL doesn't let you opt out of implicit > semantics anyway. > > Now with the details having come out of AMD/Arm/Intel about future > gens, and the driver-side pain being properly understood, as well as > the new uAPI actually giving us clear benefit, there's a good reason > to work on it. So we'll do that on the winsys side, and support the > new explicitly-managed-implicit-slot world, and everyone's lives will > be better. Which gives us two design points: unaware implicit-only > users (Xorg, old Wayland, old GPU, media), and fully-aware > extremely-explicit users (Mutter, Weston, wlroots, new GPU). So I said why I think X11 is totally fine (sync on client, let the server stay dumb and unsynchronised) for the new world, but not why I'm so confident that it's totally fine for Wayland and we can do it quickly, despite previously banging on about how much work it was and why it was so difficult. You already know this Jason, but for the rest of the class ... All Wayland compositors I've ever seen have a strong separation between the protocol-visible objects, and per-output repaint loops, not least because this is strongly encoded into the protocol. Creating a surface, attaching a buffer to it, and committing the surface state, will only update a compositor list of the current state of the protocol-visible objects. At 'a good time' for each output (often next vblank minus a few milliseconds), the compositor will effectively snapshot that state, generate a scene graph from it, and go through repaint for that output (GPU composition and/or KMS planes and/or media encode and/or RDP etc) to turn that to light. This is unlike X11 where you can post rendering commands and then call XGetImage to pull the pixel result back. We also explicitly chose not to follow the X11 window of a global co-ordinate space with windows explicitly treed down from a shared root window: inter-surface operations are very much disconnected from each other with no ordering guarantee, and since clients are totally isolated from each other, there are no inter-client operations. So that already gives us a clean break between clients posting state and visible effect, with a lot of latitude on timing: you post some updates, and a compositor which users actually want will make those updates real at some point, but nothing is guaranteed. There is one cut-out from inter-surface operations, which is our subsurface protocol. An example of this is having your browser UI as the primary window, which embeds a separate rendered window for each tab, which further embeds a window with your pop-over ad^W^Wmedia content. To make resizing and scrolling work without being all Flash-on-Navigator jank, subsurfaces have a 'synchronised' mode, where clients can make protocol-visible updates to their surfaces, but the updates will not be recorded into the visible scene graph until explicitly released by the topmost parent - think of it like a display controller's 'go bit' or updating the tail pointer on a ring buffer, where we record updates but they're stashed away until everything has become coherent. Every useful compositor already implements this subsurface protocol as well, which gives us a _further_ gap between protocol-visible client-recorded state and the scene graph which is used for painting outputs. Handling wait-before-sync is still a lot of typing, and difficult for corner cases, and will need experimentation in the real world before we can all land on useful common semantics, but we've already done the most conceptually difficult work, which was to separate client/protocol-visible object state from the scene graph that gets generated/snapshotted on every output update. So that's why I'm bullish on how quickly we can move on the Wayland side. It's a lot of typing, but we already have two clean breaks from recorded client surface state -> scene graph -> pixels lit up, with each break already handling fragmentation between different surface groups rather than being a global freeze/thaw. (Having typed all of that, I'm reminded that we never finished up the EGL extension to opt out of implicit sync on a per-EGLImage basis, which was mostly blocked on explicit sync being pretty immature in drivers and none of the uAPI fitting together e.g. per-CS vs. per-BO ... now seems like a pretty apt time to revive that too.) Cheers, Daniel
On Wed, 26 May 2021 at 14:44, Daniel Vetter <daniel@ffwll.ch> wrote: > On Wed, May 26, 2021 at 02:08:19PM +0100, Daniel Stone wrote: > > Are you saying that if a compositor imports a client-provided dmabuf > > as an EGLImage to use as a source texture for its rendering, and then > > provides it to VA-API or V4L2 to use as a media encode source (both > > purely read-only ops), that these will both serialise against each > > other? Like, my media decode job won't begin execution until the > > composition read has fully retired? > > > > If so, a) good lord that hurts, and b) what are shared fences actually ... for? > > Shared is shared, I just meant to say that we always add the shared fence. > So an explicit ioctl to add more shared fences is kinda pointless. > > So yeah on a good driver this will run in parallel. On a not-so-good > driver (which currently includes amdgpu and panfrost) this will serialize, > because those drivers don't have the concept of a non-exclusive fence for > such shared buffers (amdgpu does not sync internally, but will sync as > soon as it's cross-drm_file). When you say 'we always add the shared fence', add it to ... where? And which shared fence? (I'm going to use 'fence' below to refer to anything from literal sync_file to timeline-syncobj to userspace fence.) I'll admit that I've typed out an argument twice for always export from excl+shared, and always import to excl, results in oversync. And I keep tying myself in knots trying to do it. It's arguably slightly contrived, but here's my third attempt ... Vulkan Wayland client, full-flying-car-sync Wayland protocol, Vulkan-based compositor. Part of the contract when the server exposes that protocol is that it guarantees to do explicit sync in both directions, so the client provides a fence at QueueSubmit time and the server provides one back when releasing the image for return to ANI. Neither side ever record fences into the dma_resv because they've opted out by being fully explicit-aware. Now add media encode out on the side because you're streaming. The compositor knows this is a transition between explicit and implicit worlds, so it imports the client's fence into the exclusive dma_resv slot, which makes sense: the media encode has to sync against the client work, but is indifferent to the parallel compositor work. The shared fence is exported back out so the compositor can union the encode-finished fence with its composition-finished fence to send back to the client with release/ANI. Now add a second media encode because you want a higher-quality local capture to upload to YouTube later on. The compositor can do the exact same import/export dance, and the two encodes can safely run in parallel. Which is good. Where it starts to become complex is: what if your compositor is fully explicit-aware but your clients aren't, so your compositor has more import/export points to record into the resv? What if you aren't actually a compositor but a full-blown media pipeline, where you have a bunch of threads all launching reads in parallel, to the extent where it's not practical to manage implicit/explicit transitions globally, but each thread has to more pessimistically import and export around each access? I can make the relatively simple usecases work, but it really feels like in practice we'll end up with massive oversync in some fairly complex usecases, and we'll regret not having had it from the start, plus people will just rely on implicit sync for longer because it has better (more parallel) semantics in some usecases. Cheers, Daniel
On Wed, May 26, 2021 at 6:09 AM Daniel Stone <daniel@fooishbar.org> wrote: > On Mon, 24 May 2021 at 18:11, Jason Ekstrand <jason@jlekstrand.net> wrote: > > 3. Userspace memory fences. > > > > Note that timeline syncobj is NOT in that list. IMO, all the "wait > > for submit" stuff is an implementation detail we needed in order to > > get the timeline semantics on top of immutable SW fences. Under the > > hood it's all dma_fence; this just gives us a shareable container so > > we can implement VK_KHR_timeline_semaphore with sharing. I really > > don't want to make Wayland protocol around it if memory fences are the > > final solution. > > Typing out the Wayland protocol isn't the hard bit. If we just need to > copy and sed syncobj to weirdsyncobj, no problem really, and it gives > us a six-month head start on painful compositor-internal surgery > whilst we work on common infrastructure to ship userspace fences > around (mappable dmabuf with the sync bracketing? FD where every > read() gives you the current value? memfd? other?). I feel like I should elaborate more about timelines. In my earlier reply, my commentary about timeline syncobj was mostly focused around helping people avoid typing. That's not really the full story, though, and I hope more context will help. First, let me say that timeline syncobj was designed as a mechanism to implement VK_KHR_timeline_semaphore without inserting future fences into the kernel. It's entirely designed around the needs of Vulkan drivers, not really as a window-system primitive. The semantics are designed around one driver communicating to another that new fences have been added and it's safe to kick off more rendering. I'm not convinced that it's the right object for window-systems and I'm also not convinced that it's a good idea to try and make a version of it that's a wrapper around a userspace memory fence. (I'm going to start typing UMF for userspace memory fence because it's long to type out.) Why? Well, the fundamental problem with timelines in general is trying to figure out when it's about to be done. But timeline syncobj solves this for us! It gives us this fancy super-useful ioctl! Right? Uh.... not as well as I'd like. Let's say we make a timeline syncobj that's a wrapper around a userspace memory fence. What do we do with that ioctl? As I mentioned above, the kernel doesn't have any clue when it will be triggered so that ioctl turns into an actual wait. That's no good because it creates unnecessary stalls. There's another potential solution here: Have each UMF be two timelines: submitted and completed. At the start of every batch that's supposed to trigger a UMF, we set the "submitted" side and then, when it completes, we set the "completed" side. Ok, great, now we can get at the "about to be done" with the submitted side, implement the ioctl, and we're all good, right? Sadly, no. There's no guarantee about how long a "batch" takes. So there's no universal timeout the kernel can apply. Also, if it does time out, the kernel doesn't know who to blame for the timeout and how to prevent itself from getting in trouble again. The compositor does so, in theory, given the right ioctls, it could detect the -ETIME and kill that client. Not a great solution. The best option I've been able to come up with for this is some sort of client-provided signal. Something where it says, as part of submit or somewhere else, "I promise I'll be done soon" where that promise comes with dire consequences if it's not. At that point, we can turn the UMF and a particular wait value into a one-shot fence like a dma_fence or sync_file, or signal a syncobj on it. If it ever times out, we kick their context. In Vulkan terminology, they get VK_ERROR_DEVICE_LOST. There are two important bits here: First, is that it's based on a client-provided thing. With a fully timeline model and wait-before-signal, we can't infer when something is about to be done. Only the client knows when it submitted its last node in the dependency graph and the whole mess is unblocked. Second, is that the dma_fence is created within the client's driver context. If it's created compositor-side, the kernel doesn't know who to blame if things go badly. If we create it in the client, it's pretty easy to make context death on -ETIME part of the contract. (Before danvet jumps in here and rants about how UMF -> dma_fence isn't possible, I haven't forgotten. I'm pretending, for now, that we've solved some of those problems.) Another option is to just stall on the UMF until it's done. Yeah, kind-of terrible and high-latency, but it always works and doesn't involve any complex logic to kill clients. If a client never gets around to signaling a fence, it just never repaints. The compositor keeps going like nothing's wrong. Maybe, if the client submits lots of frames without ever triggering, it'll hit some max queue depth somewhere and kill it but that's it. More likely, the client's vkAcquireNextImage will start timing out and it'll crash. I suspect where we might actually land is some combination of the two depending on client choice. If the client wants to be dumb, it gets the high-latency always-works path. If the client really wants lowest-latency VRR, it has to take the smarter path and risk VK_ERROR_DEVICE_LOST if it misses too far. But the point of all of this is, neither of the above two paths have anything to do with the compositor calling a "wait for submit" ioctl. Building a design around that and baking it into protocol is, IMO, a mistake. I don't see any valid way to handle this mess without "wait for sumbit" either not existing or existing only client-side for the purposes of WSI. --Jason
On Wed, May 26, 2021 at 5:13 PM Daniel Stone <daniel@fooishbar.org> wrote: > On Wed, 26 May 2021 at 14:44, Daniel Vetter <daniel@ffwll.ch> wrote: > > On Wed, May 26, 2021 at 02:08:19PM +0100, Daniel Stone wrote: > > > Are you saying that if a compositor imports a client-provided dmabuf > > > as an EGLImage to use as a source texture for its rendering, and then > > > provides it to VA-API or V4L2 to use as a media encode source (both > > > purely read-only ops), that these will both serialise against each > > > other? Like, my media decode job won't begin execution until the > > > composition read has fully retired? > > > > > > If so, a) good lord that hurts, and b) what are shared fences actually ... for? > > > > Shared is shared, I just meant to say that we always add the shared fence. > > So an explicit ioctl to add more shared fences is kinda pointless. > > > > So yeah on a good driver this will run in parallel. On a not-so-good > > driver (which currently includes amdgpu and panfrost) this will serialize, > > because those drivers don't have the concept of a non-exclusive fence for > > such shared buffers (amdgpu does not sync internally, but will sync as > > soon as it's cross-drm_file). > > When you say 'we always add the shared fence', add it to ... where? > And which shared fence? (I'm going to use 'fence' below to refer to > anything from literal sync_file to timeline-syncobj to userspace > fence.) In the current model, every time you submit anything to the gpu, we create a dma_fence to track this work. This dma_fence is attached as a shared fence to the dma_resv obj of every object in your working set. Clarifications you = both userspace or kernel, anything really, including fun stuff like writing PTEs, or clearing PTEs and then flushing TLBs working set = depends, but can be anything from "really just the buffers the current gpu submission uses" to "everything bound into a given gpu VM" This is the fence I'm talking about here. Since you can't escape this (not unless we do direct userspace submit with userspace memory fences) and since there's no distinction of the shared fences into "relevant for implicit sync" and "not relevant for implicit sync" there's really not much point in adding implicit read fences. For now at least, we might want to change this eventually. > I'll admit that I've typed out an argument twice for always export > from excl+shared, and always import to excl, results in oversync. And > I keep tying myself in knots trying to do it. It's arguably slightly > contrived, but here's my third attempt ... > > Vulkan Wayland client, full-flying-car-sync Wayland protocol, > Vulkan-based compositor. Part of the contract when the server exposes > that protocol is that it guarantees to do explicit sync in both > directions, so the client provides a fence at QueueSubmit time and the > server provides one back when releasing the image for return to ANI. > Neither side ever record fences into the dma_resv because they've > opted out by being fully explicit-aware. > > Now add media encode out on the side because you're streaming. The > compositor knows this is a transition between explicit and implicit > worlds, so it imports the client's fence into the exclusive dma_resv > slot, which makes sense: the media encode has to sync against the > client work, but is indifferent to the parallel compositor work. The > shared fence is exported back out so the compositor can union the > encode-finished fence with its composition-finished fence to send back > to the client with release/ANI. > > Now add a second media encode because you want a higher-quality local > capture to upload to YouTube later on. The compositor can do the exact > same import/export dance, and the two encodes can safely run in > parallel. Which is good. So the example which works is really clear ... > Where it starts to become complex is: what if your compositor is fully > explicit-aware but your clients aren't, so your compositor has more > import/export points to record into the resv? What if you aren't > actually a compositor but a full-blown media pipeline, where you have > a bunch of threads all launching reads in parallel, to the extent > where it's not practical to manage implicit/explicit transitions > globally, but each thread has to more pessimistically import and > export around each access? ... but the example where we oversync is hand-waving? :-P > I can make the relatively simple usecases work, but it really feels > like in practice we'll end up with massive oversync in some fairly > complex usecases, and we'll regret not having had it from the start, > plus people will just rely on implicit sync for longer because it has > better (more parallel) semantics in some usecases. Things fall apart in implicit sync if you have more than one logical writer into the same buffer. Trivial example is two images in one buffer, but you could also do funky stuff like interleaved/tiled rendering with _indepedent_ consumers. If the consumers are not independent, then you can again just stuff the two writer fences into the exclusive slot with the new ioctl (they'll get merged without additional overhead into one fence array fence). And the fundamental thing is: This is just not possible with implicit sync. There's only one fence slot (even if that resolves to an array of fences for all the producers), so anytime you do multiple independent things in the same buffer you either: - must split the buffers so there's again a clear&unique handoff at each stage of the pipeline - or use explicit sync So in your example, options are - per-client buffers, which you then blend into a composite buffer to handle the N implicit fences from N buffers into a single implicit fence for libva conversion. This single buffer then also allows you to again fan out to M libva encoders, or whatever it is that you fancy - explicit fencing and clients render into a single buffer with no copying, and libva encodes from that single buffer (but again needs explicit fences or it all comes crashing down) There's really no option C where you somehow do multiple implicitly fenced things into a single buffer and expect it to work out in parallel. -Daniel
Hey, On Wed, 26 May 2021 at 17:53, Daniel Vetter <daniel@ffwll.ch> wrote: > On Wed, May 26, 2021 at 5:13 PM Daniel Stone <daniel@fooishbar.org> wrote: > > > Shared is shared, I just meant to say that we always add the shared fence. > > > So an explicit ioctl to add more shared fences is kinda pointless. > > > > > > So yeah on a good driver this will run in parallel. On a not-so-good > > > driver (which currently includes amdgpu and panfrost) this will serialize, > > > because those drivers don't have the concept of a non-exclusive fence for > > > such shared buffers (amdgpu does not sync internally, but will sync as > > > soon as it's cross-drm_file). > > > > When you say 'we always add the shared fence', add it to ... where? > > And which shared fence? (I'm going to use 'fence' below to refer to > > anything from literal sync_file to timeline-syncobj to userspace > > fence.) > > In the current model, every time you submit anything to the gpu, we > create a dma_fence to track this work. This dma_fence is attached as a > shared fence to the dma_resv obj of every object in your working set. > Clarifications > you = both userspace or kernel, anything really, including fun stuff > like writing PTEs, or clearing PTEs and then flushing TLBs > working set = depends, but can be anything from "really just the > buffers the current gpu submission uses" to "everything bound into a > given gpu VM" > > This is the fence I'm talking about here. > > Since you can't escape this (not unless we do direct userspace submit > with userspace memory fences) and since there's no distinction of the > shared fences into "relevant for implicit sync" and "not relevant for > implicit sync" there's really not much point in adding implicit read > fences. For now at least, we might want to change this eventually. Yeah, I agree. My own clarification is that I'm talking about an explicit-first world, where synchronisation is done primarily through unknowable UMF, and falling back to implicit sync is a painful and expensive operation that we only do when we need to. So, definitely not on every CS (command submission aka execbuf aka vkQueueSubmit aka glFlush). > > I'll admit that I've typed out an argument twice for always export > > from excl+shared, and always import to excl, results in oversync. And > > I keep tying myself in knots trying to do it. It's arguably slightly > > contrived, but here's my third attempt ... > > > > Vulkan Wayland client, full-flying-car-sync Wayland protocol, > > Vulkan-based compositor. Part of the contract when the server exposes > > that protocol is that it guarantees to do explicit sync in both > > directions, so the client provides a fence at QueueSubmit time and the > > server provides one back when releasing the image for return to ANI. > > Neither side ever record fences into the dma_resv because they've > > opted out by being fully explicit-aware. > > > > Now add media encode out on the side because you're streaming. The > > compositor knows this is a transition between explicit and implicit > > worlds, so it imports the client's fence into the exclusive dma_resv > > slot, which makes sense: the media encode has to sync against the > > client work, but is indifferent to the parallel compositor work. The > > shared fence is exported back out so the compositor can union the > > encode-finished fence with its composition-finished fence to send back > > to the client with release/ANI. > > > > Now add a second media encode because you want a higher-quality local > > capture to upload to YouTube later on. The compositor can do the exact > > same import/export dance, and the two encodes can safely run in > > parallel. Which is good. > > So the example which works is really clear ... > > > Where it starts to become complex is: what if your compositor is fully > > explicit-aware but your clients aren't, so your compositor has more > > import/export points to record into the resv? What if you aren't > > actually a compositor but a full-blown media pipeline, where you have > > a bunch of threads all launching reads in parallel, to the extent > > where it's not practical to manage implicit/explicit transitions > > globally, but each thread has to more pessimistically import and > > export around each access? > > ... but the example where we oversync is hand-waving? > > :-P Hey, I said I tied myself into knots! Maybe it's because my brain is too deeply baked into implicit sync, maybe it's because the problem cases aren't actually problems. Who knows. I think what it comes down to is that we make it workable for (at least current-generation, before someone bakes it into Unity) Wayland compositors to work well with these modal switches, but really difficult for more complex and variable pipeline frameworks like GStreamer or PipeWire to work with it. > > I can make the relatively simple usecases work, but it really feels > > like in practice we'll end up with massive oversync in some fairly > > complex usecases, and we'll regret not having had it from the start, > > plus people will just rely on implicit sync for longer because it has > > better (more parallel) semantics in some usecases. > > Things fall apart in implicit sync if you have more than one logical > writer into the same buffer. Trivial example is two images in one > buffer, but you could also do funky stuff like interleaved/tiled > rendering with _indepedent_ consumers. If the consumers are not > independent, then you can again just stuff the two writer fences into > the exclusive slot with the new ioctl (they'll get merged without > additional overhead into one fence array fence). > > And the fundamental thing is: This is just not possible with implicit > sync. There's only one fence slot (even if that resolves to an array > of fences for all the producers), so anytime you do multiple > independent things in the same buffer you either: > - must split the buffers so there's again a clear&unique handoff at > each stage of the pipeline > - or use explicit sync Yeah no argument, this doesn't work & can't work in implicit sync. But what I'm talking about is having a single writer (serialised) and multiple readers (in parallel). Readers add to the shared slot, serialising their access against all earlier exclusive fences, and writers add to the exclusive slot, serialising their access against all earlier fences (both exclusive and shared). So if import can only add to the exclusive slot, then we can end up potentially serialising readers against each other. We want readers to land in the shared slot to be able to parallelise against each other and let writers serialise after them, no? > So in your example, options are > - per-client buffers, which you then blend into a composite buffer to > handle the N implicit fences from N buffers into a single implicit > fence for libva conversion. This single buffer then also allows you to > again fan out to M libva encoders, or whatever it is that you fancy > - explicit fencing and clients render into a single buffer with no > copying, and libva encodes from that single buffer (but again needs > explicit fences or it all comes crashing down) > > There's really no option C where you somehow do multiple implicitly > fenced things into a single buffer and expect it to work out in > parallel. All of my examples above are a single client buffer (GPU source which places a fence into the exclusive slot for when the colour buffer contents are fully realised), just working its way through multiple stages and APIs. Like, your single Dota2 window ends up in a Vulkan-based Wayland compositor, a pure VA-API encode stream to write high-quality AV1 to disk, and also an EGL pipeline which overlays your awesome logo and webcam stream before VA-API encoding to a lower-quality H.264 stream for Twitch. This isn't a convoluted example, it's literally what the non-geriatric millennials do all day. It's a lot of potential boundaries between implicit & explicit world, and if we've learned one thing from modifiers it's that we probably shouldn't underthink the boundaries. So: 1. Does every CS generate the appropriate resv entries (exclusive for write, shared for read) for every access to every buffer? I think the answer has to be no, because it's not necessarily viable in future. 2. If every CS doesn't generate the appropriate resv entries, do we go for the middle ground where we keep interactions with implicit sync implicit (e.g. every client API accessing any externally-visible BO populates the appropriate resv slot, but internal-only buffers get to skip it), or do we surface them and make it explicit (e.g. the Wayland explicit-sync protocol is a contract between client/compositor that the client doesn't have to populate the resv slots, because the compositor will ensure every access it makes is appropriate synchronised)? I think the latter, because the halfway house sounds really painful for questionable if any benefit, and makes it maybe impossible for us to one day deprecate implicit. 3. If we do surface everything and make userspace handle the implicit/explicit boundaries, do we make every explicit -> implicit boundary (via the import ioctl) populate the exclusive slot or allow it to choose? I think allow it to choose, because I don't understand what the restriction buys us. 4. Can the combination of dynamic modifier negotiation and explicit synchronisation let us deliver the EGLStreams promise before EGLStreams can? :) Cheers, Daniel
Hey, On Wed, 26 May 2021 at 16:24, Jason Ekstrand <jason@jlekstrand.net> wrote: > On Wed, May 26, 2021 at 6:09 AM Daniel Stone <daniel@fooishbar.org> wrote: > > Typing out the Wayland protocol isn't the hard bit. If we just need to > > copy and sed syncobj to weirdsyncobj, no problem really, and it gives > > us a six-month head start on painful compositor-internal surgery > > whilst we work on common infrastructure to ship userspace fences > > around (mappable dmabuf with the sync bracketing? FD where every > > read() gives you the current value? memfd? other?). > > I feel like I should elaborate more about timelines. In my earlier > reply, my commentary about timeline syncobj was mostly focused around > helping people avoid typing. That's not really the full story, > though, and I hope more context will help. > > First, let me say that timeline syncobj was designed as a mechanism to > implement VK_KHR_timeline_semaphore without inserting future fences > into the kernel. It's entirely designed around the needs of Vulkan > drivers, not really as a window-system primitive. The semantics are > designed around one driver communicating to another that new fences > have been added and it's safe to kick off more rendering. I'm not > convinced that it's the right object for window-systems and I'm also > not convinced that it's a good idea to try and make a version of it > that's a wrapper around a userspace memory fence. (I'm going to start > typing UMF for userspace memory fence because it's long to type out.) > > Why? Well, the fundamental problem with timelines in general is > trying to figure out when it's about to be done. But timeline syncobj > solves this for us! It gives us this fancy super-useful ioctl! > Right? Uh.... not as well as I'd like. Let's say we make a timeline > syncobj that's a wrapper around a userspace memory fence. What do we > do with that ioctl? As I mentioned above, the kernel doesn't have any > clue when it will be triggered so that ioctl turns into an actual > wait. That's no good because it creates unnecessary stalls. Yeah, I'm assuming that UMF will be a separate primitive. No problem. I also think that your submitted/completed thing is a non-problem: at this stage we're just throwing up our hands and admitting that we're letting userspace tie itself in knots, and giving it the tools to tie a sufficiently un-streetwise compositor in knots too. We're already crossing that Rubicon, so let's just embrace it and not try to design it out. Us compositors can handle the scheduling, really. > There's another potential solution here: Have each UMF be two > timelines: submitted and completed. At the start of every batch > that's supposed to trigger a UMF, we set the "submitted" side and > then, when it completes, we set the "completed" side. Ok, great, now > we can get at the "about to be done" with the submitted side, > implement the ioctl, and we're all good, right? Sadly, no. There's > no guarantee about how long a "batch" takes. So there's no universal > timeout the kernel can apply. Also, if it does time out, the kernel > doesn't know who to blame for the timeout and how to prevent itself > from getting in trouble again. The compositor does so, in theory, > given the right ioctls, it could detect the -ETIME and kill that > client. Not a great solution. > > The best option I've been able to come up with for this is some sort > of client-provided signal. Something where it says, as part of submit > or somewhere else, "I promise I'll be done soon" where that promise > comes with dire consequences if it's not. At that point, we can turn > the UMF and a particular wait value into a one-shot fence like a > dma_fence or sync_file, or signal a syncobj on it. If it ever times > out, we kick their context. In Vulkan terminology, they get > VK_ERROR_DEVICE_LOST. There are two important bits here: First, is > that it's based on a client-provided thing. With a fully timeline > model and wait-before-signal, we can't infer when something is about > to be done. Only the client knows when it submitted its last node in > the dependency graph and the whole mess is unblocked. Second, is that > the dma_fence is created within the client's driver context. If it's > created compositor-side, the kernel doesn't know who to blame if > things go badly. If we create it in the client, it's pretty easy to > make context death on -ETIME part of the contract. > > (Before danvet jumps in here and rants about how UMF -> dma_fence > isn't possible, I haven't forgotten. I'm pretending, for now, that > we've solved some of those problems.) Funny how we've come full circle to the original proposal here ... If we really want a kernel primitive for this - and I think it's a good idea, since can help surface 'badness' in a way which is observable by e.g. session managers in a way analogous to cgroup stats and controls - how about this for a counter-proposal? Client exports a FD for its context/queue and sends it to winsys as part of setup, compositor can ioctl() on that to kill it, which lands in the same zap/zap/zap/zap/ban codepath as GPU hangs do today. It's a bigger hammer than per-sync-point primitives, but you as a client have to accept the social contract that if you want to participate in a session, your context has to be making forward progress and you aren't writing cheques the compositor can't cash. I'm also going to pre-emptively agree with other-Dan; I'm extremely wary of anything which tries to make UMF look even a little bit like sync_file. The requirements to support them are so wildly different that I'd almost rather a totally orthogonal interface so that there's no danger of confusing the two. Us sophisticates on this thread can eat the mild annoyance of typing out separate codepaths, but it's much worse for anyone else who may look at a UMF wolf in dma_fence sheep's clothing then only later be substantially more annoyed when they realise that it's not anything like they thought it was. So let's keep sync_file for what it is, and for UMF since the usage is so radically different, build out whatever we do around making the uAPI as useful as possible for what we want to do with it. The real complexity in handling the difference between UMF and 'real' fences is in how they behave, not in how they look. > Another option is to just stall on the UMF until it's done. Yeah, > kind-of terrible and high-latency, but it always works and doesn't > involve any complex logic to kill clients. If a client never gets > around to signaling a fence, it just never repaints. The compositor > keeps going like nothing's wrong. Maybe, if the client submits lots > of frames without ever triggering, it'll hit some max queue depth > somewhere and kill it but that's it. More likely, the client's > vkAcquireNextImage will start timing out and it'll crash. > > I suspect where we might actually land is some combination of the two > depending on client choice. If the client wants to be dumb, it gets > the high-latency always-works path. If the client really wants > lowest-latency VRR, it has to take the smarter path and risk > VK_ERROR_DEVICE_LOST if it misses too far. We already have to handle unresponsive clients. If your browser livelocks today (say because it's Chrome and you hotunplug your monitor at the wrong time with active media playback in an inactive tab in an inactive window ... hypothetically), your Wayland server notices that it isn't responding to pings, throws up the 'do you want to force-quit?' dialog and kills the client; it's actually really simple logic. So we just hook unsignaled fences up to the same. (And, if we have the context-kill primitive, trigger that on our way out.) So yeah, we already have all the complexity points to put particular surface trees in limbo (thanks to subsurface sync mode), we already have all the complexity points to separate realised surface trees from pixels on screen, and we already have the complexity points for different parts of the surface trees being rendered at different times. Checking on fence progression is just a little more typing around those interface points which already exist, and zapping clients is utterly trivial. > But the point of all of this is, neither of the above two paths have > anything to do with the compositor calling a "wait for submit" ioctl. > Building a design around that and baking it into protocol is, IMO, a > mistake. I don't see any valid way to handle this mess without "wait > for sumbit" either not existing or existing only client-side for the > purposes of WSI. I'm still on the fence (sorry) about a wait-before-submit ioctl. For the sync_file-based timeline syncobjs that we have today, yes it is helpful, and we do already have it, it's just the wrong shape in being sleep rather than epoll. For UMF, taking it as a given that the kernel really has no visibility at all into syncpoint progress, then the kernel is conceptually a worse place to spin-sleep than userspace is, because why account the CPU burn to a kthread rather than a real PID, and lose latency/efficiency on context switches when you do wake up? But also, the kernel is conceptually the best place to spin-sleep, because it can fuse waits and do better wakeup quantisation than userspace can. And I'm still hopeful that the IHVs and Windows can both step away from the postmodern 'synchronisation doesn't mean anything anymore, just poll in a lap-burning loop' approach that we've been presented (sorry) with, where we at least get doorbells which allow the kernel to do polling much smarter than quantising timers ('this fence might not have triggered yet, but _something_ happened which might have triggered it so why not check?'). On the other other hand, the only winsys case for burning poll in a tight loop is flipping as quickly as possible straight to a VRR display. In that case, you're definitely running on mains power so you're just melting the polar ice caps rather than your lap, and you've got everything fully lit up anyway so the power cost of polling is immaterial. For FRR, the compositor already has a fixed deadline at which it will wake up and make a hard binding decision about which image to present - this includes XR as well. So we don't have to worry about optimising a polling loop, because there isn't one: we wake up once, we check once, and if the client's missed then too bad, try again next frame. As you can see, much like userspace memory fences, my position on which way to go here is not knowable upfront, and depends on when exactly you observe it. Hopefully someone can come back with an argument compelling enough either way that I have something better to do than to try to pun my way out of having more hands than Ganesh. I don't think it's material to the design or implementation of winsys support though. Cheers, Daniel
On May 26, 2021 13:15:08 Daniel Stone <daniel@fooishbar.org> wrote: > Hey, > > On Wed, 26 May 2021 at 16:24, Jason Ekstrand <jason@jlekstrand.net> wrote: >> On Wed, May 26, 2021 at 6:09 AM Daniel Stone <daniel@fooishbar.org> wrote: >>> Typing out the Wayland protocol isn't the hard bit. If we just need to >>> copy and sed syncobj to weirdsyncobj, no problem really, and it gives >>> us a six-month head start on painful compositor-internal surgery >>> whilst we work on common infrastructure to ship userspace fences >>> around (mappable dmabuf with the sync bracketing? FD where every >>> read() gives you the current value? memfd? other?). >> >> I feel like I should elaborate more about timelines. In my earlier >> reply, my commentary about timeline syncobj was mostly focused around >> helping people avoid typing. That's not really the full story, >> though, and I hope more context will help. >> >> First, let me say that timeline syncobj was designed as a mechanism to >> implement VK_KHR_timeline_semaphore without inserting future fences >> into the kernel. It's entirely designed around the needs of Vulkan >> drivers, not really as a window-system primitive. The semantics are >> designed around one driver communicating to another that new fences >> have been added and it's safe to kick off more rendering. I'm not >> convinced that it's the right object for window-systems and I'm also >> not convinced that it's a good idea to try and make a version of it >> that's a wrapper around a userspace memory fence. (I'm going to start >> typing UMF for userspace memory fence because it's long to type out.) >> >> Why? Well, the fundamental problem with timelines in general is >> trying to figure out when it's about to be done. But timeline syncobj >> solves this for us! It gives us this fancy super-useful ioctl! >> Right? Uh.... not as well as I'd like. Let's say we make a timeline >> syncobj that's a wrapper around a userspace memory fence. What do we >> do with that ioctl? As I mentioned above, the kernel doesn't have any >> clue when it will be triggered so that ioctl turns into an actual >> wait. That's no good because it creates unnecessary stalls. > > Yeah, I'm assuming that UMF will be a separate primitive. No problem. > I also think that your submitted/completed thing is a non-problem: at > this stage we're just throwing up our hands and admitting that we're > letting userspace tie itself in knots, and giving it the tools to tie > a sufficiently un-streetwise compositor in knots too. We're already > crossing that Rubicon, so let's just embrace it and not try to design > it out. Us compositors can handle the scheduling, really. Ok, good. I think we're on the same page. > >> There's another potential solution here: Have each UMF be two >> timelines: submitted and completed. At the start of every batch >> that's supposed to trigger a UMF, we set the "submitted" side and >> then, when it completes, we set the "completed" side. Ok, great, now >> we can get at the "about to be done" with the submitted side, >> implement the ioctl, and we're all good, right? Sadly, no. There's >> no guarantee about how long a "batch" takes. So there's no universal >> timeout the kernel can apply. Also, if it does time out, the kernel >> doesn't know who to blame for the timeout and how to prevent itself >> from getting in trouble again. The compositor does so, in theory, >> given the right ioctls, it could detect the -ETIME and kill that >> client. Not a great solution. >> >> The best option I've been able to come up with for this is some sort >> of client-provided signal. Something where it says, as part of submit >> or somewhere else, "I promise I'll be done soon" where that promise >> comes with dire consequences if it's not. At that point, we can turn >> the UMF and a particular wait value into a one-shot fence like a >> dma_fence or sync_file, or signal a syncobj on it. If it ever times >> out, we kick their context. In Vulkan terminology, they get >> VK_ERROR_DEVICE_LOST. There are two important bits here: First, is >> that it's based on a client-provided thing. With a fully timeline >> model and wait-before-signal, we can't infer when something is about >> to be done. Only the client knows when it submitted its last node in >> the dependency graph and the whole mess is unblocked. Second, is that >> the dma_fence is created within the client's driver context. If it's >> created compositor-side, the kernel doesn't know who to blame if >> things go badly. If we create it in the client, it's pretty easy to >> make context death on -ETIME part of the contract. >> >> (Before danvet jumps in here and rants about how UMF -> dma_fence >> isn't possible, I haven't forgotten. I'm pretending, for now, that >> we've solved some of those problems.) > > Funny how we've come full circle to the original proposal here ... > > If we really want a kernel primitive for this - and I think it's a > good idea, since can help surface 'badness' in a way which is > observable by e.g. session managers in a way analogous to cgroup stats > and controls - how about this for a counter-proposal? Client exports a > FD for its context/queue and sends it to winsys as part of setup, > compositor can ioctl() on that to kill it, which lands in the same > zap/zap/zap/zap/ban codepath as GPU hangs do today. It's a bigger > hammer than per-sync-point primitives, but you as a client have to > accept the social contract that if you want to participate in a > session, your context has to be making forward progress and you aren't > writing cheques the compositor can't cash. The compositor already has that. It can kick the client's Wayland protocol connection. Banning the context from the kernel might be nice too but kicking it is probably sufficient. Side-note to danvet: Do we need a plan for UMF with persistent contexts? My gut says that's a very bad idea but this made me think I should say least pose the question. > I'm also going to pre-emptively agree with other-Dan; I'm extremely > wary of anything which tries to make UMF look even a little bit like > sync_file. The requirements to support them are so wildly different > that I'd almost rather a totally orthogonal interface so that there's > no danger of confusing the two. Us sophisticates on this thread can > eat the mild annoyance of typing out separate codepaths, but it's much > worse for anyone else who may look at a UMF wolf in dma_fence sheep's > clothing then only later be substantially more annoyed when they > realise that it's not anything like they thought it was. > > So let's keep sync_file for what it is, and for UMF since the usage is > so radically different, build out whatever we do around making the > uAPI as useful as possible for what we want to do with it. The real > complexity in handling the difference between UMF and 'real' fences is > in how they behave, not in how they look. Sounds good. > >> Another option is to just stall on the UMF until it's done. Yeah, >> kind-of terrible and high-latency, but it always works and doesn't >> involve any complex logic to kill clients. If a client never gets >> around to signaling a fence, it just never repaints. The compositor >> keeps going like nothing's wrong. Maybe, if the client submits lots >> of frames without ever triggering, it'll hit some max queue depth >> somewhere and kill it but that's it. More likely, the client's >> vkAcquireNextImage will start timing out and it'll crash. >> >> I suspect where we might actually land is some combination of the two >> depending on client choice. If the client wants to be dumb, it gets >> the high-latency always-works path. If the client really wants >> lowest-latency VRR, it has to take the smarter path and risk >> VK_ERROR_DEVICE_LOST if it misses too far. > > We already have to handle unresponsive clients. If your browser > livelocks today (say because it's Chrome and you hotunplug your > monitor at the wrong time with active media playback in an inactive > tab in an inactive window ... hypothetically), That's an oddly specific hypothetical... > yourr Wayland server > notices that it isn't responding to pings, throws up the 'do you want > to force-quit?' dialog and kills the client; it's actually really > simple logic. So we just hook unsignaled fences up to the same. (And, > if we have the context-kill primitive, trigger that on our way out.) > > So yeah, we already have all the complexity points to put particular > surface trees in limbo (thanks to subsurface sync mode), we already > have all the complexity points to separate realised surface trees from > pixels on screen, and we already have the complexity points for > different parts of the surface trees being rendered at different > times. Checking on fence progression is just a little more typing > around those interface points which already exist, and zapping clients > is utterly trivial.
Am 26.05.21 um 18:52 schrieb Daniel Vetter: > [SNIP] >> I can make the relatively simple usecases work, but it really feels >> like in practice we'll end up with massive oversync in some fairly >> complex usecases, and we'll regret not having had it from the start, >> plus people will just rely on implicit sync for longer because it has >> better (more parallel) semantics in some usecases. > Things fall apart in implicit sync if you have more than one logical > writer into the same buffer. Trivial example is two images in one > buffer, but you could also do funky stuff like interleaved/tiled > rendering with _indepedent_ consumers. If the consumers are not > independent, then you can again just stuff the two writer fences into > the exclusive slot with the new ioctl (they'll get merged without > additional overhead into one fence array fence). > > And the fundamental thing is: This is just not possible with implicit > sync. There's only one fence slot (even if that resolves to an array > of fences for all the producers), so anytime you do multiple > independent things in the same buffer you either: > - must split the buffers so there's again a clear&unique handoff at > each stage of the pipeline > - or use explicit sync Well exactly that is the problem we had with amdgpu and why we came up with the special handling there. And you don't even need two images in one buffer, just special hardware which handles multiple writers gracefully is sufficient. The simplest example is a depth buffer, but we also have things like ordered append for ring buffers. > So in your example, options are > - per-client buffers, which you then blend into a composite buffer to > handle the N implicit fences from N buffers into a single implicit > fence for libva conversion. This single buffer then also allows you to > again fan out to M libva encoders, or whatever it is that you fancy > - explicit fencing and clients render into a single buffer with no > copying, and libva encodes from that single buffer (but again needs > explicit fences or it all comes crashing down) > > There's really no option C where you somehow do multiple implicitly > fenced things into a single buffer and expect it to work out in > parallel. You could also fallback to a dummy submission, e.g. compose the image with multiple engines in parallel and then make a single dummy submission to collect all the shared fences into the single exclusive fence. But this needs an extra IOCTL and unfortunately the stack above also needs to know when to make that dummy submission. Christian. > -Daniel
On Wed, May 26, 2021 at 11:32:49PM -0500, Jason Ekstrand wrote: > On May 26, 2021 13:15:08 Daniel Stone <daniel@fooishbar.org> wrote: > > > Hey, > > > > On Wed, 26 May 2021 at 16:24, Jason Ekstrand <jason@jlekstrand.net> wrote: > > > On Wed, May 26, 2021 at 6:09 AM Daniel Stone <daniel@fooishbar.org> wrote: > > > > Typing out the Wayland protocol isn't the hard bit. If we just need to > > > > copy and sed syncobj to weirdsyncobj, no problem really, and it gives > > > > us a six-month head start on painful compositor-internal surgery > > > > whilst we work on common infrastructure to ship userspace fences > > > > around (mappable dmabuf with the sync bracketing? FD where every > > > > read() gives you the current value? memfd? other?). > > > > > > I feel like I should elaborate more about timelines. In my earlier > > > reply, my commentary about timeline syncobj was mostly focused around > > > helping people avoid typing. That's not really the full story, > > > though, and I hope more context will help. > > > > > > First, let me say that timeline syncobj was designed as a mechanism to > > > implement VK_KHR_timeline_semaphore without inserting future fences > > > into the kernel. It's entirely designed around the needs of Vulkan > > > drivers, not really as a window-system primitive. The semantics are > > > designed around one driver communicating to another that new fences > > > have been added and it's safe to kick off more rendering. I'm not > > > convinced that it's the right object for window-systems and I'm also > > > not convinced that it's a good idea to try and make a version of it > > > that's a wrapper around a userspace memory fence. (I'm going to start > > > typing UMF for userspace memory fence because it's long to type out.) > > > > > > Why? Well, the fundamental problem with timelines in general is > > > trying to figure out when it's about to be done. But timeline syncobj > > > solves this for us! It gives us this fancy super-useful ioctl! > > > Right? Uh.... not as well as I'd like. Let's say we make a timeline > > > syncobj that's a wrapper around a userspace memory fence. What do we > > > do with that ioctl? As I mentioned above, the kernel doesn't have any > > > clue when it will be triggered so that ioctl turns into an actual > > > wait. That's no good because it creates unnecessary stalls. > > > > Yeah, I'm assuming that UMF will be a separate primitive. No problem. > > I also think that your submitted/completed thing is a non-problem: at > > this stage we're just throwing up our hands and admitting that we're > > letting userspace tie itself in knots, and giving it the tools to tie > > a sufficiently un-streetwise compositor in knots too. We're already > > crossing that Rubicon, so let's just embrace it and not try to design > > it out. Us compositors can handle the scheduling, really. > > Ok, good. I think we're on the same page. > > > > > > There's another potential solution here: Have each UMF be two > > > timelines: submitted and completed. At the start of every batch > > > that's supposed to trigger a UMF, we set the "submitted" side and > > > then, when it completes, we set the "completed" side. Ok, great, now > > > we can get at the "about to be done" with the submitted side, > > > implement the ioctl, and we're all good, right? Sadly, no. There's > > > no guarantee about how long a "batch" takes. So there's no universal > > > timeout the kernel can apply. Also, if it does time out, the kernel > > > doesn't know who to blame for the timeout and how to prevent itself > > > from getting in trouble again. The compositor does so, in theory, > > > given the right ioctls, it could detect the -ETIME and kill that > > > client. Not a great solution. > > > > > > The best option I've been able to come up with for this is some sort > > > of client-provided signal. Something where it says, as part of submit > > > or somewhere else, "I promise I'll be done soon" where that promise > > > comes with dire consequences if it's not. At that point, we can turn > > > the UMF and a particular wait value into a one-shot fence like a > > > dma_fence or sync_file, or signal a syncobj on it. If it ever times > > > out, we kick their context. In Vulkan terminology, they get > > > VK_ERROR_DEVICE_LOST. There are two important bits here: First, is > > > that it's based on a client-provided thing. With a fully timeline > > > model and wait-before-signal, we can't infer when something is about > > > to be done. Only the client knows when it submitted its last node in > > > the dependency graph and the whole mess is unblocked. Second, is that > > > the dma_fence is created within the client's driver context. If it's > > > created compositor-side, the kernel doesn't know who to blame if > > > things go badly. If we create it in the client, it's pretty easy to > > > make context death on -ETIME part of the contract. > > > > > > (Before danvet jumps in here and rants about how UMF -> dma_fence > > > isn't possible, I haven't forgotten. I'm pretending, for now, that > > > we've solved some of those problems.) > > > > Funny how we've come full circle to the original proposal here ... > > > > If we really want a kernel primitive for this - and I think it's a > > good idea, since can help surface 'badness' in a way which is > > observable by e.g. session managers in a way analogous to cgroup stats > > and controls - how about this for a counter-proposal? Client exports a > > FD for its context/queue and sends it to winsys as part of setup, > > compositor can ioctl() on that to kill it, which lands in the same > > zap/zap/zap/zap/ban codepath as GPU hangs do today. It's a bigger > > hammer than per-sync-point primitives, but you as a client have to > > accept the social contract that if you want to participate in a > > session, your context has to be making forward progress and you aren't > > writing cheques the compositor can't cash. > > The compositor already has that. It can kick the client's Wayland protocol > connection. Banning the context from the kernel might be nice too but > kicking it is probably sufficient. > > Side-note to danvet: Do we need a plan for UMF with persistent contexts? My > gut says that's a very bad idea but this made me think I should say least > pose the question. Yeah UMF ctx mode needs to require non-persisten ctx mode too or it just goes horribly wrong everywhere. > > I'm also going to pre-emptively agree with other-Dan; I'm extremely > > wary of anything which tries to make UMF look even a little bit like > > sync_file. The requirements to support them are so wildly different > > that I'd almost rather a totally orthogonal interface so that there's > > no danger of confusing the two. Us sophisticates on this thread can > > eat the mild annoyance of typing out separate codepaths, but it's much > > worse for anyone else who may look at a UMF wolf in dma_fence sheep's > > clothing then only later be substantially more annoyed when they > > realise that it's not anything like they thought it was. > > > > So let's keep sync_file for what it is, and for UMF since the usage is > > so radically different, build out whatever we do around making the > > uAPI as useful as possible for what we want to do with it. The real > > complexity in handling the difference between UMF and 'real' fences is > > in how they behave, not in how they look. > > Sounds good. > > > > > > Another option is to just stall on the UMF until it's done. Yeah, > > > kind-of terrible and high-latency, but it always works and doesn't > > > involve any complex logic to kill clients. If a client never gets > > > around to signaling a fence, it just never repaints. The compositor > > > keeps going like nothing's wrong. Maybe, if the client submits lots > > > of frames without ever triggering, it'll hit some max queue depth > > > somewhere and kill it but that's it. More likely, the client's > > > vkAcquireNextImage will start timing out and it'll crash. > > > > > > I suspect where we might actually land is some combination of the two > > > depending on client choice. If the client wants to be dumb, it gets > > > the high-latency always-works path. If the client really wants > > > lowest-latency VRR, it has to take the smarter path and risk > > > VK_ERROR_DEVICE_LOST if it misses too far. > > > > We already have to handle unresponsive clients. If your browser > > livelocks today (say because it's Chrome and you hotunplug your > > monitor at the wrong time with active media playback in an inactive > > tab in an inactive window ... hypothetically), > > That's an oddly specific hypothetical... > > > yourr Wayland server > > notices that it isn't responding to pings, throws up the 'do you want > > to force-quit?' dialog and kills the client; it's actually really > > simple logic. So we just hook unsignaled fences up to the same. (And, > > if we have the context-kill primitive, trigger that on our way out.) > > > > So yeah, we already have all the complexity points to put particular > > surface trees in limbo (thanks to subsurface sync mode), we already > > have all the complexity points to separate realised surface trees from > > pixels on screen, and we already have the complexity points for > > different parts of the surface trees being rendered at different > > times. Checking on fence progression is just a little more typing > > around those interface points which already exist, and zapping clients > > is utterly trivial. > >
Am 27.05.21 um 12:19 schrieb Daniel Vetter: > On Wed, May 26, 2021 at 11:32:49PM -0500, Jason Ekstrand wrote: >> On May 26, 2021 13:15:08 Daniel Stone <daniel@fooishbar.org> wrote: >> >>> Hey, >>> >>> On Wed, 26 May 2021 at 16:24, Jason Ekstrand <jason@jlekstrand.net> wrote: >>>> On Wed, May 26, 2021 at 6:09 AM Daniel Stone <daniel@fooishbar.org> wrote: >>>>> Typing out the Wayland protocol isn't the hard bit. If we just need to >>>>> copy and sed syncobj to weirdsyncobj, no problem really, and it gives >>>>> us a six-month head start on painful compositor-internal surgery >>>>> whilst we work on common infrastructure to ship userspace fences >>>>> around (mappable dmabuf with the sync bracketing? FD where every >>>>> read() gives you the current value? memfd? other?). >>>> I feel like I should elaborate more about timelines. In my earlier >>>> reply, my commentary about timeline syncobj was mostly focused around >>>> helping people avoid typing. That's not really the full story, >>>> though, and I hope more context will help. >>>> >>>> First, let me say that timeline syncobj was designed as a mechanism to >>>> implement VK_KHR_timeline_semaphore without inserting future fences >>>> into the kernel. It's entirely designed around the needs of Vulkan >>>> drivers, not really as a window-system primitive. The semantics are >>>> designed around one driver communicating to another that new fences >>>> have been added and it's safe to kick off more rendering. I'm not >>>> convinced that it's the right object for window-systems and I'm also >>>> not convinced that it's a good idea to try and make a version of it >>>> that's a wrapper around a userspace memory fence. (I'm going to start >>>> typing UMF for userspace memory fence because it's long to type out.) >>>> >>>> Why? Well, the fundamental problem with timelines in general is >>>> trying to figure out when it's about to be done. But timeline syncobj >>>> solves this for us! It gives us this fancy super-useful ioctl! >>>> Right? Uh.... not as well as I'd like. Let's say we make a timeline >>>> syncobj that's a wrapper around a userspace memory fence. What do we >>>> do with that ioctl? As I mentioned above, the kernel doesn't have any >>>> clue when it will be triggered so that ioctl turns into an actual >>>> wait. That's no good because it creates unnecessary stalls. >>> Yeah, I'm assuming that UMF will be a separate primitive. No problem. >>> I also think that your submitted/completed thing is a non-problem: at >>> this stage we're just throwing up our hands and admitting that we're >>> letting userspace tie itself in knots, and giving it the tools to tie >>> a sufficiently un-streetwise compositor in knots too. We're already >>> crossing that Rubicon, so let's just embrace it and not try to design >>> it out. Us compositors can handle the scheduling, really. >> Ok, good. I think we're on the same page. >> >>>> There's another potential solution here: Have each UMF be two >>>> timelines: submitted and completed. At the start of every batch >>>> that's supposed to trigger a UMF, we set the "submitted" side and >>>> then, when it completes, we set the "completed" side. Ok, great, now >>>> we can get at the "about to be done" with the submitted side, >>>> implement the ioctl, and we're all good, right? Sadly, no. There's >>>> no guarantee about how long a "batch" takes. So there's no universal >>>> timeout the kernel can apply. Also, if it does time out, the kernel >>>> doesn't know who to blame for the timeout and how to prevent itself >>>> from getting in trouble again. The compositor does so, in theory, >>>> given the right ioctls, it could detect the -ETIME and kill that >>>> client. Not a great solution. >>>> >>>> The best option I've been able to come up with for this is some sort >>>> of client-provided signal. Something where it says, as part of submit >>>> or somewhere else, "I promise I'll be done soon" where that promise >>>> comes with dire consequences if it's not. At that point, we can turn >>>> the UMF and a particular wait value into a one-shot fence like a >>>> dma_fence or sync_file, or signal a syncobj on it. If it ever times >>>> out, we kick their context. In Vulkan terminology, they get >>>> VK_ERROR_DEVICE_LOST. There are two important bits here: First, is >>>> that it's based on a client-provided thing. With a fully timeline >>>> model and wait-before-signal, we can't infer when something is about >>>> to be done. Only the client knows when it submitted its last node in >>>> the dependency graph and the whole mess is unblocked. Second, is that >>>> the dma_fence is created within the client's driver context. If it's >>>> created compositor-side, the kernel doesn't know who to blame if >>>> things go badly. If we create it in the client, it's pretty easy to >>>> make context death on -ETIME part of the contract. >>>> >>>> (Before danvet jumps in here and rants about how UMF -> dma_fence >>>> isn't possible, I haven't forgotten. I'm pretending, for now, that >>>> we've solved some of those problems.) >>> Funny how we've come full circle to the original proposal here ... >>> >>> If we really want a kernel primitive for this - and I think it's a >>> good idea, since can help surface 'badness' in a way which is >>> observable by e.g. session managers in a way analogous to cgroup stats >>> and controls - how about this for a counter-proposal? Client exports a >>> FD for its context/queue and sends it to winsys as part of setup, >>> compositor can ioctl() on that to kill it, which lands in the same >>> zap/zap/zap/zap/ban codepath as GPU hangs do today. It's a bigger >>> hammer than per-sync-point primitives, but you as a client have to >>> accept the social contract that if you want to participate in a >>> session, your context has to be making forward progress and you aren't >>> writing cheques the compositor can't cash. >> The compositor already has that. It can kick the client's Wayland protocol >> connection. Banning the context from the kernel might be nice too but >> kicking it is probably sufficient. >> >> Side-note to danvet: Do we need a plan for UMF with persistent contexts? My >> gut says that's a very bad idea but this made me think I should say least >> pose the question. > Yeah UMF ctx mode needs to require non-persisten ctx mode too or it just > goes horribly wrong everywhere. > >>> I'm also going to pre-emptively agree with other-Dan; I'm extremely >>> wary of anything which tries to make UMF look even a little bit like >>> sync_file. The requirements to support them are so wildly different >>> that I'd almost rather a totally orthogonal interface so that there's >>> no danger of confusing the two. Us sophisticates on this thread can >>> eat the mild annoyance of typing out separate codepaths, but it's much >>> worse for anyone else who may look at a UMF wolf in dma_fence sheep's >>> clothing then only later be substantially more annoyed when they >>> realise that it's not anything like they thought it was. >>> >>> So let's keep sync_file for what it is, and for UMF since the usage is >>> so radically different, build out whatever we do around making the >>> uAPI as useful as possible for what we want to do with it. The real >>> complexity in handling the difference between UMF and 'real' fences is >>> in how they behave, not in how they look. >> Sounds good. >> >>>> Another option is to just stall on the UMF until it's done. Yeah, >>>> kind-of terrible and high-latency, but it always works and doesn't >>>> involve any complex logic to kill clients. If a client never gets >>>> around to signaling a fence, it just never repaints. The compositor >>>> keeps going like nothing's wrong. Maybe, if the client submits lots >>>> of frames without ever triggering, it'll hit some max queue depth >>>> somewhere and kill it but that's it. More likely, the client's >>>> vkAcquireNextImage will start timing out and it'll crash. >>>> >>>> I suspect where we might actually land is some combination of the two >>>> depending on client choice. If the client wants to be dumb, it gets >>>> the high-latency always-works path. If the client really wants >>>> lowest-latency VRR, it has to take the smarter path and risk >>>> VK_ERROR_DEVICE_LOST if it misses too far. >>> We already have to handle unresponsive clients. If your browser >>> livelocks today (say because it's Chrome and you hotunplug your >>> monitor at the wrong time with active media playback in an inactive >>> tab in an inactive window ... hypothetically), >> That's an oddly specific hypothetical... >> >>> yourr Wayland server >>> notices that it isn't responding to pings, throws up the 'do you want >>> to force-quit?' dialog and kills the client; it's actually really >>> simple logic. So we just hook unsignaled fences up to the same. (And, >>> if we have the context-kill primitive, trigger that on our way out.) >>> >>> So yeah, we already have all the complexity points to put particular >>> surface trees in limbo (thanks to subsurface sync mode), we already >>> have all the complexity points to separate realised surface trees from >>> pixels on screen, and we already have the complexity points for >>> different parts of the surface trees being rendered at different >>> times. Checking on fence progression is just a little more typing >>> around those interface points which already exist, and zapping clients >>> is utterly trivial. >>
diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index 7679562b57bfc..c9d6b9195c00c 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -417,6 +417,36 @@ static long dma_buf_export_sync_file(struct dma_buf *dmabuf, put_unused_fd(fd); return ret; } + +static long dma_buf_import_sync_file(struct dma_buf *dmabuf, + const void __user *user_data) +{ + struct dma_buf_sync_file arg; + struct dma_fence *fence, *singleton = NULL; + int ret = 0; + + if (copy_from_user(&arg, user_data, sizeof(arg))) + return -EFAULT; + + if (arg.flags != DMA_BUF_SYNC_RW) + return -EINVAL; + + fence = sync_file_get_fence(arg.fd); + if (!fence) + return -EINVAL; + + dma_resv_lock(dmabuf->resv, NULL); + + ret = dma_resv_get_singleton_rcu(dmabuf->resv, fence, &singleton); + if (!ret && singleton) + dma_resv_add_excl_fence(dmabuf->resv, singleton); + + dma_resv_unlock(dmabuf->resv); + + dma_fence_put(fence); + + return ret; +} #endif static long dma_buf_ioctl(struct file *file, @@ -465,6 +495,8 @@ static long dma_buf_ioctl(struct file *file, #if IS_ENABLED(CONFIG_SYNC_FILE) case DMA_BUF_IOCTL_EXPORT_SYNC_FILE: return dma_buf_export_sync_file(dmabuf, (void __user *)arg); + case DMA_BUF_IOCTL_IMPORT_SYNC_FILE: + return dma_buf_import_sync_file(dmabuf, (const void __user *)arg); #endif default: diff --git a/include/uapi/linux/dma-buf.h b/include/uapi/linux/dma-buf.h index f902cadcbdb56..75fdde4800267 100644 --- a/include/uapi/linux/dma-buf.h +++ b/include/uapi/linux/dma-buf.h @@ -70,5 +70,6 @@ struct dma_buf_sync_file { #define DMA_BUF_SET_NAME_A _IOW(DMA_BUF_BASE, 1, u32) #define DMA_BUF_SET_NAME_B _IOW(DMA_BUF_BASE, 1, u64) #define DMA_BUF_IOCTL_EXPORT_SYNC_FILE _IOWR(DMA_BUF_BASE, 2, struct dma_buf_sync_file) +#define DMA_BUF_IOCTL_IMPORT_SYNC_FILE _IOW(DMA_BUF_BASE, 3, struct dma_buf_sync) #endif
This patch is analogous to the previous sync file export patch in that it allows you to import a sync_file into a dma-buf. Unlike the previous patch, however, this does add genuinely new functionality to dma-buf. Without this, the only way to attach a sync_file to a dma-buf is to submit a batch to your driver of choice which waits on the sync_file and claims to write to the dma-buf. Even if said batch is a no-op, a submit is typically way more overhead than just attaching a fence. A submit may also imply extra synchronization with other work because it happens on a hardware queue. In the Vulkan world, this is useful for dealing with the out-fence from vkQueuePresent. Current Linux window-systems (X11, Wayland, etc.) all rely on dma-buf implicit sync. Since Vulkan is an explicit sync API, we get a set of fences (VkSemaphores) in vkQueuePresent and have to stash those as an exclusive (write) fence on the dma-buf. We handle it in Mesa today with the above mentioned dummy submit trick. This ioctl would allow us to set it directly without the dummy submit. This may also open up possibilities for GPU drivers to move away from implicit sync for their kernel driver uAPI and instead provide sync files and rely on dma-buf import/export for communicating with other implicit sync clients. We make the explicit choice here to only allow setting RW fences which translates to an exclusive fence on the dma_resv. There's no use for read-only fences for communicating with other implicit sync userspace and any such attempts are likely to be racy at best. When we got to insert the RW fence, the actual fence we set as the new exclusive fence is a combination of the sync_file provided by the user and all the other fences on the dma_resv. This ensures that the newly added exclusive fence will never signal before the old one would have and ensures that we don't break any dma_resv contracts. We require userspace to specify RW in the flags for symmetry with the export ioctl and in case we ever want to support read fences in the future. There is one downside here that's worth documenting: If two clients writing to the same dma-buf using this API race with each other, their actions on the dma-buf may happen in parallel or in an undefined order. Both with and without this API, the pattern is the same: Collect all the fences on dma-buf, submit work which depends on said fences, and then set a new exclusive (write) fence on the dma-buf which depends on said work. The difference is that, when it's all handled by the GPU driver's submit ioctl, the three operations happen atomically under the dma_resv lock. If two userspace submits race, one will happen before the other. You aren't guaranteed which but you are guaranteed that they're strictly ordered. If userspace manages the fences itself, then these three operations happen separately and the two render operations may happen genuinely in parallel or get interleaved. However, this is a case of userspace racing with itself. As long as we ensure userspace can't back the kernel into a corner, it should be fine. v2 (Jason Ekstrand): - Use a wrapper dma_fence_array of all fences including the new one when importing an exclusive fence. v3 (Jason Ekstrand): - Lock around setting shared fences as well as exclusive - Mark SIGNAL_SYNC_FILE as a read-write ioctl. - Initialize ret to 0 in dma_buf_wait_sync_file v4 (Jason Ekstrand): - Use the new dma_resv_get_singleton helper v5 (Jason Ekstrand): - Rename the IOCTLs to import/export rather than wait/signal - Drop the WRITE flag and always get/set the exclusive fence v5 (Jason Ekstrand): - Split import and export into separate patches - New commit message Signed-off-by: Jason Ekstrand <jason@jlekstrand.net> --- drivers/dma-buf/dma-buf.c | 32 ++++++++++++++++++++++++++++++++ include/uapi/linux/dma-buf.h | 1 + 2 files changed, 33 insertions(+)