diff mbox series

[RFC,v3,1/3] drm/doc/rfc: VM_BIND feature design document

Message ID 20220517183212.20274-2-niranjana.vishwanathapura@intel.com (mailing list archive)
State New, archived
Headers show
Series drm/doc/rfc: i915 VM_BIND feature design + uapi | expand

Commit Message

Niranjana Vishwanathapura May 17, 2022, 6:32 p.m. UTC
VM_BIND design document with description of intended use cases.

v2: Add more documentation and format as per review comments
    from Daniel.

Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
---
 Documentation/driver-api/dma-buf.rst   |   2 +
 Documentation/gpu/rfc/i915_vm_bind.rst | 304 +++++++++++++++++++++++++
 Documentation/gpu/rfc/index.rst        |   4 +
 3 files changed, 310 insertions(+)
 create mode 100644 Documentation/gpu/rfc/i915_vm_bind.rst

Comments

Zanoni, Paulo R May 19, 2022, 10:52 p.m. UTC | #1
On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote:
> VM_BIND design document with description of intended use cases.
> 
> v2: Add more documentation and format as per review comments
>     from Daniel.
> 
> Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
> ---
> 
> diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst b/Documentation/gpu/rfc/i915_vm_bind.rst
> new file mode 100644
> index 000000000000..f1be560d313c
> --- /dev/null
> +++ b/Documentation/gpu/rfc/i915_vm_bind.rst
> @@ -0,0 +1,304 @@
> +==========================================
> +I915 VM_BIND feature design and use cases
> +==========================================
> +
> +VM_BIND feature
> +================
> +DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM buffer
> +objects (BOs) or sections of a BOs at specified GPU virtual addresses on a
> +specified address space (VM). These mappings (also referred to as persistent
> +mappings) will be persistent across multiple GPU submissions (execbuff calls)
> +issued by the UMD, without user having to provide a list of all required
> +mappings during each submission (as required by older execbuff mode).
> +
> +VM_BIND/UNBIND ioctls will support 'in' and 'out' fences to allow userpace
> +to specify how the binding/unbinding should sync with other operations
> +like the GPU job submission. These fences will be timeline 'drm_syncobj's
> +for non-Compute contexts (See struct drm_i915_vm_bind_ext_timeline_fences).
> +For Compute contexts, they will be user/memory fences (See struct
> +drm_i915_vm_bind_ext_user_fence).
> +
> +VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND.
> +User has to opt-in for VM_BIND mode of binding for an address space (VM)
> +during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND extension.
> +
> +VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an
> +async worker. The binding and unbinding will work like a special GPU engine.
> +The binding and unbinding operations are serialized and will wait on specified
> +input fences before the operation and will signal the output fences upon the
> +completion of the operation. Due to serialization, completion of an operation
> +will also indicate that all previous operations are also complete.
> +
> +VM_BIND features include:
> +
> +* Multiple Virtual Address (VA) mappings can map to the same physical pages
> +  of an object (aliasing).
> +* VA mapping can map to a partial section of the BO (partial binding).
> +* Support capture of persistent mappings in the dump upon GPU error.
> +* TLB is flushed upon unbind completion. Batching of TLB flushes in some
> +  use cases will be helpful.
> +* Asynchronous vm_bind and vm_unbind support with 'in' and 'out' fences.
> +* Support for userptr gem objects (no special uapi is required for this).
> +
> +Execbuff ioctl in VM_BIND mode
> +-------------------------------
> +The execbuff ioctl handling in VM_BIND mode differs significantly from the
> +older method. A VM in VM_BIND mode will not support older execbuff mode of
> +binding. In VM_BIND mode, execbuff ioctl will not accept any execlist. Hence,
> +no support for implicit sync. It is expected that the below work will be able
> +to support requirements of object dependency setting in all use cases:
> +
> +"dma-buf: Add an API for exporting sync files"
> +(https://lwn.net/Articles/859290/)

I would really like to have more details here. The link provided points
to new ioctls and we're not very familiar with those yet, so I think
you should really clarify the interaction between the new additions
here. Having some sample code would be really nice too.

For Mesa at least (and I believe for the other drivers too) we always
have a few exported buffers in every execbuf call, and we rely on the
implicit synchronization provided by execbuf to make sure everything
works. The execbuf ioctl also has some code to flush caches during
implicit synchronization AFAIR, so I would guess we rely on it too and
whatever else the Kernel does. Is that covered by the new ioctls?

In addition, as far as I remember, one of the big improvements of
vm_bind was that it would help reduce ioctl latency and cpu overhead.
But if making execbuf faster comes at the cost of requiring additional
ioctls calls for implicit synchronization, which is required on ever
execbuf call, then I wonder if we'll even get any faster at all.
Comparing old execbuf vs plain new execbuf without the new required
ioctls won't make sense.

But maybe I'm wrong and we won't need to call these new ioctls around
every single execbuf ioctl we submit? Again, more clarification and
some code examples here would be really nice. This is a big change on
an important part of the API, we should clarify the new expected usage.

> +
> +This also means, we need an execbuff extension to pass in the batch
> +buffer addresses (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
> +
> +If at all execlist support in execbuff ioctl is deemed necessary for
> +implicit sync in certain use cases, then support can be added later.

IMHO we really need to sort this and check all the assumptions before
we commit to any interface. Again, implicit synchronization is
something we rely on during *every* execbuf ioctl for most workloads.


> +In VM_BIND mode, VA allocation is completely managed by the user instead of
> +the i915 driver. Hence all VA assignment, eviction are not applicable in
> +VM_BIND mode. Also, for determining object activeness, VM_BIND mode will not
> +be using the i915_vma active reference tracking. It will instead use dma-resv
> +object for that (See `VM_BIND dma_resv usage`_).
> +
> +So, a lot of existing code in the execbuff path like relocations, VA evictions,
> +vma lookup table, implicit sync, vma active reference tracking etc., are not
> +applicable in VM_BIND mode. Hence, the execbuff path needs to be cleaned up
> +by clearly separating out the functionalities where the VM_BIND mode differs
> +from older method and they should be moved to separate files.

I seem to recall some conversations where we were told a bunch of
ioctls would stop working or make no sense to call when using vm_bind.
Can we please get a complete list of those? Bonus points if the Kernel
starts telling us we just called something that makes no sense.

> +
> +VM_PRIVATE objects
> +-------------------
> +By default, BOs can be mapped on multiple VMs and can also be dma-buf
> +exported. Hence these BOs are referred to as Shared BOs.
> +During each execbuff submission, the request fence must be added to the
> +dma-resv fence list of all shared BOs mapped on the VM.
> +
> +VM_BIND feature introduces an optimization where user can create BO which
> +is private to a specified VM via I915_GEM_CREATE_EXT_VM_PRIVATE flag during
> +BO creation. Unlike Shared BOs, these VM private BOs can only be mapped on
> +the VM they are private to and can't be dma-buf exported.
> +All private BOs of a VM share the dma-resv object. Hence during each execbuff
> +submission, they need only one dma-resv fence list updated. Thus, the fast
> +path (where required mappings are already bound) submission latency is O(1)
> +w.r.t the number of VM private BOs.

I know we already discussed this, but just to document it publicly: the
ideal case for user space would be that every BO is created as private
but then we'd have an ioctl to convert it to non-private (without the
need to have a non-private->private interface).

An explanation on why we can't have an ioctl to mark as exported a
buffer that was previously vm_private would be really appreciated.

Thanks,
Paulo


> +
> +VM_BIND locking hirarchy
> +-------------------------
> +The locking design here supports the older (execlist based) execbuff mode, the
> +newer VM_BIND mode, the VM_BIND mode with GPU page faults and possible future
> +system allocator support (See `Shared Virtual Memory (SVM) support`_).
> +The older execbuff mode and the newer VM_BIND mode without page faults manages
> +residency of backing storage using dma_fence. The VM_BIND mode with page faults
> +and the system allocator support do not use any dma_fence at all.
> +
> +VM_BIND locking order is as below.
> +
> +1) Lock-A: A vm_bind mutex will protect vm_bind lists. This lock is taken in
> +   vm_bind/vm_unbind ioctl calls, in the execbuff path and while releasing the
> +   mapping.
> +
> +   In future, when GPU page faults are supported, we can potentially use a
> +   rwsem instead, so that multiple page fault handlers can take the read side
> +   lock to lookup the mapping and hence can run in parallel.
> +   The older execbuff mode of binding do not need this lock.
> +
> +2) Lock-B: The object's dma-resv lock will protect i915_vma state and needs to
> +   be held while binding/unbinding a vma in the async worker and while updating
> +   dma-resv fence list of an object. Note that private BOs of a VM will all
> +   share a dma-resv object.
> +
> +   The future system allocator support will use the HMM prescribed locking
> +   instead.
> +
> +3) Lock-C: Spinlock/s to protect some of the VM's lists like the list of
> +   invalidated vmas (due to eviction and userptr invalidation) etc.
> +
> +When GPU page faults are supported, the execbuff path do not take any of these
> +locks. There we will simply smash the new batch buffer address into the ring and
> +then tell the scheduler run that. The lock taking only happens from the page
> +fault handler, where we take lock-A in read mode, whichever lock-B we need to
> +find the backing storage (dma_resv lock for gem objects, and hmm/core mm for
> +system allocator) and some additional locks (lock-D) for taking care of page
> +table races. Page fault mode should not need to ever manipulate the vm lists,
> +so won't ever need lock-C.
> +
> +VM_BIND LRU handling
> +---------------------
> +We need to ensure VM_BIND mapped objects are properly LRU tagged to avoid
> +performance degradation. We will also need support for bulk LRU movement of
> +VM_BIND objects to avoid additional latencies in execbuff path.
> +
> +The page table pages are similar to VM_BIND mapped objects (See
> +`Evictable page table allocations`_) and are maintained per VM and needs to
> +be pinned in memory when VM is made active (ie., upon an execbuff call with
> +that VM). So, bulk LRU movement of page table pages is also needed.
> +
> +The i915 shrinker LRU has stopped being an LRU. So, it should also be moved
> +over to the ttm LRU in some fashion to make sure we once again have a reasonable
> +and consistent memory aging and reclaim architecture.
> +
> +VM_BIND dma_resv usage
> +-----------------------
> +Fences needs to be added to all VM_BIND mapped objects. During each execbuff
> +submission, they are added with DMA_RESV_USAGE_BOOKKEEP usage to prevent
> +over sync (See enum dma_resv_usage). One can override it with either
> +DMA_RESV_USAGE_READ or DMA_RESV_USAGE_WRITE usage during object dependency
> +setting (either through explicit or implicit mechanism).
> +
> +When vm_bind is called for a non-private object while the VM is already
> +active, the fences need to be copied from VM's shared dma-resv object
> +(common to all private objects of the VM) to this non-private object.
> +If this results in performance degradation, then some optimization will
> +be needed here. This is not a problem for VM's private objects as they use
> +shared dma-resv object which is always updated on each execbuff submission.
> +
> +Also, in VM_BIND mode, use dma-resv apis for determining object activeness
> +(See dma_resv_test_signaled() and dma_resv_wait_timeout()) and do not use the
> +older i915_vma active reference tracking which is deprecated. This should be
> +easier to get it working with the current TTM backend. We can remove the
> +i915_vma active reference tracking fully while supporting TTM backend for igfx.
> +
> +Evictable page table allocations
> +---------------------------------
> +Make pagetable allocations evictable and manage them similar to VM_BIND
> +mapped objects. Page table pages are similar to persistent mappings of a
> +VM (difference here are that the page table pages will not have an i915_vma
> +structure and after swapping pages back in, parent page link needs to be
> +updated).
> +
> +Mesa use case
> +--------------
> +VM_BIND can potentially reduce the CPU overhead in Mesa (both Vulkan and Iris),
> +hence improving performance of CPU-bound applications. It also allows us to
> +implement Vulkan's Sparse Resources. With increasing GPU hardware performance,
> +reducing CPU overhead becomes more impactful.
> +
> +
> +VM_BIND Compute support
> +========================
> +
> +User/Memory Fence
> +------------------
> +The idea is to take a user specified virtual address and install an interrupt
> +handler to wake up the current task when the memory location passes the user
> +supplied filter. User/Memory fence is a <address, value> pair. To signal the
> +user fence, specified value will be written at the specified virtual address
> +and wakeup the waiting process. User can wait on a user fence with the
> +gem_wait_user_fence ioctl.
> +
> +It also allows the user to emit their own MI_FLUSH/PIPE_CONTROL notify
> +interrupt within their batches after updating the value to have sub-batch
> +precision on the wakeup. Each batch can signal a user fence to indicate
> +the completion of next level batch. The completion of very first level batch
> +needs to be signaled by the command streamer. The user must provide the
> +user/memory fence for this via the DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE
> +extension of execbuff ioctl, so that KMD can setup the command streamer to
> +signal it.
> +
> +User/Memory fence can also be supplied to the kernel driver to signal/wake up
> +the user process after completion of an asynchronous operation.
> +
> +When VM_BIND ioctl was provided with a user/memory fence via the
> +I915_VM_BIND_EXT_USER_FENCE extension, it will be signaled upon the completion
> +of binding of that mapping. All async binds/unbinds are serialized, hence
> +signaling of user/memory fence also indicate the completion of all previous
> +binds/unbinds.
> +
> +This feature will be derived from the below original work:
> +https://patchwork.freedesktop.org/patch/349417/
> +
> +Long running Compute contexts
> +------------------------------
> +Usage of dma-fence expects that they complete in reasonable amount of time.
> +Compute on the other hand can be long running. Hence it is appropriate for
> +compute to use user/memory fence and dma-fence usage will be limited to
> +in-kernel consumption only. This requires an execbuff uapi extension to pass
> +in user fence (See struct drm_i915_vm_bind_ext_user_fence). Compute must opt-in
> +for this mechanism with I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING flag during
> +context creation. The dma-fence based user interfaces like gem_wait ioctl and
> +execbuff out fence are not allowed on long running contexts. Implicit sync is
> +not valid as well and is anyway not supported in VM_BIND mode.
> +
> +Where GPU page faults are not available, kernel driver upon buffer invalidation
> +will initiate a suspend (preemption) of long running context with a dma-fence
> +attached to it. And upon completion of that suspend fence, finish the
> +invalidation, revalidate the BO and then resume the compute context. This is
> +done by having a per-context preempt fence (also called suspend fence) proxying
> +as i915_request fence. This suspend fence is enabled when someone tries to wait
> +on it, which then triggers the context preemption.
> +
> +As this support for context suspension using a preempt fence and the resume work
> +for the compute mode contexts can get tricky to get it right, it is better to
> +add this support in drm scheduler so that multiple drivers can make use of it.
> +That means, it will have a dependency on i915 drm scheduler conversion with GuC
> +scheduler backend. This should be fine, as the plan is to support compute mode
> +contexts only with GuC scheduler backend (at least initially). This is much
> +easier to support with VM_BIND mode compared to the current heavier execbuff
> +path resource attachment.
> +
> +Low Latency Submission
> +-----------------------
> +Allows compute UMD to directly submit GPU jobs instead of through execbuff
> +ioctl. This is made possible by VM_BIND is not being synchronized against
> +execbuff. VM_BIND allows bind/unbind of mappings required for the directly
> +submitted jobs.
> +
> +Other VM_BIND use cases
> +========================
> +
> +Debugger
> +---------
> +With debug event interface user space process (debugger) is able to keep track
> +of and act upon resources created by another process (debugged) and attached
> +to GPU via vm_bind interface.
> +
> +GPU page faults
> +----------------
> +GPU page faults when supported (in future), will only be supported in the
> +VM_BIND mode. While both the older execbuff mode and the newer VM_BIND mode of
> +binding will require using dma-fence to ensure residency, the GPU page faults
> +mode when supported, will not use any dma-fence as residency is purely managed
> +by installing and removing/invalidating page table entries.
> +
> +Page level hints settings
> +--------------------------
> +VM_BIND allows any hints setting per mapping instead of per BO.
> +Possible hints include read-only mapping, placement and atomicity.
> +Sub-BO level placement hint will be even more relevant with
> +upcoming GPU on-demand page fault support.
> +
> +Page level Cache/CLOS settings
> +-------------------------------
> +VM_BIND allows cache/CLOS settings per mapping instead of per BO.
> +
> +Shared Virtual Memory (SVM) support
> +------------------------------------
> +VM_BIND interface can be used to map system memory directly (without gem BO
> +abstraction) using the HMM interface. SVM is only supported with GPU page
> +faults enabled.
> +
> +
> +Broder i915 cleanups
> +=====================
> +Supporting this whole new vm_bind mode of binding which comes with its own
> +use cases to support and the locking requirements requires proper integration
> +with the existing i915 driver. This calls for some broader i915 driver
> +cleanups/simplifications for maintainability of the driver going forward.
> +Here are few things identified and are being looked into.
> +
> +- Remove vma lookup cache (eb->gem_context->handles_vma). VM_BIND feature
> +  do not use it and complexity it brings in is probably more than the
> +  performance advantage we get in legacy execbuff case.
> +- Remove vma->open_count counting
> +- Remove i915_vma active reference tracking. VM_BIND feature will not be using
> +  it. Instead use underlying BO's dma-resv fence list to determine if a i915_vma
> +  is active or not.
> +
> +
> +VM_BIND UAPI
> +=============
> +
> +.. kernel-doc:: Documentation/gpu/rfc/i915_vm_bind.h
> diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst
> index 91e93a705230..7d10c36b268d 100644
> --- a/Documentation/gpu/rfc/index.rst
> +++ b/Documentation/gpu/rfc/index.rst
> @@ -23,3 +23,7 @@ host such documentation:
>  .. toctree::
>  
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>      i915_scheduler.rst
> +
> +.. toctree::
> +
> +    i915_vm_bind.rst
Niranjana Vishwanathapura May 23, 2022, 7:05 p.m. UTC | #2
On Thu, May 19, 2022 at 03:52:01PM -0700, Zanoni, Paulo R wrote:
>On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote:
>> VM_BIND design document with description of intended use cases.
>>
>> v2: Add more documentation and format as per review comments
>>     from Daniel.
>>
>> Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
>> ---
>>
>> diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst b/Documentation/gpu/rfc/i915_vm_bind.rst
>> new file mode 100644
>> index 000000000000..f1be560d313c
>> --- /dev/null
>> +++ b/Documentation/gpu/rfc/i915_vm_bind.rst
>> @@ -0,0 +1,304 @@
>> +==========================================
>> +I915 VM_BIND feature design and use cases
>> +==========================================
>> +
>> +VM_BIND feature
>> +================
>> +DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM buffer
>> +objects (BOs) or sections of a BOs at specified GPU virtual addresses on a
>> +specified address space (VM). These mappings (also referred to as persistent
>> +mappings) will be persistent across multiple GPU submissions (execbuff calls)
>> +issued by the UMD, without user having to provide a list of all required
>> +mappings during each submission (as required by older execbuff mode).
>> +
>> +VM_BIND/UNBIND ioctls will support 'in' and 'out' fences to allow userpace
>> +to specify how the binding/unbinding should sync with other operations
>> +like the GPU job submission. These fences will be timeline 'drm_syncobj's
>> +for non-Compute contexts (See struct drm_i915_vm_bind_ext_timeline_fences).
>> +For Compute contexts, they will be user/memory fences (See struct
>> +drm_i915_vm_bind_ext_user_fence).
>> +
>> +VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND.
>> +User has to opt-in for VM_BIND mode of binding for an address space (VM)
>> +during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND extension.
>> +
>> +VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an
>> +async worker. The binding and unbinding will work like a special GPU engine.
>> +The binding and unbinding operations are serialized and will wait on specified
>> +input fences before the operation and will signal the output fences upon the
>> +completion of the operation. Due to serialization, completion of an operation
>> +will also indicate that all previous operations are also complete.
>> +
>> +VM_BIND features include:
>> +
>> +* Multiple Virtual Address (VA) mappings can map to the same physical pages
>> +  of an object (aliasing).
>> +* VA mapping can map to a partial section of the BO (partial binding).
>> +* Support capture of persistent mappings in the dump upon GPU error.
>> +* TLB is flushed upon unbind completion. Batching of TLB flushes in some
>> +  use cases will be helpful.
>> +* Asynchronous vm_bind and vm_unbind support with 'in' and 'out' fences.
>> +* Support for userptr gem objects (no special uapi is required for this).
>> +
>> +Execbuff ioctl in VM_BIND mode
>> +-------------------------------
>> +The execbuff ioctl handling in VM_BIND mode differs significantly from the
>> +older method. A VM in VM_BIND mode will not support older execbuff mode of
>> +binding. In VM_BIND mode, execbuff ioctl will not accept any execlist. Hence,
>> +no support for implicit sync. It is expected that the below work will be able
>> +to support requirements of object dependency setting in all use cases:
>> +
>> +"dma-buf: Add an API for exporting sync files"
>> +(https://lwn.net/Articles/859290/)
>
>I would really like to have more details here. The link provided points
>to new ioctls and we're not very familiar with those yet, so I think
>you should really clarify the interaction between the new additions
>here. Having some sample code would be really nice too.
>
>For Mesa at least (and I believe for the other drivers too) we always
>have a few exported buffers in every execbuf call, and we rely on the
>implicit synchronization provided by execbuf to make sure everything
>works. The execbuf ioctl also has some code to flush caches during
>implicit synchronization AFAIR, so I would guess we rely on it too and
>whatever else the Kernel does. Is that covered by the new ioctls?
>
>In addition, as far as I remember, one of the big improvements of
>vm_bind was that it would help reduce ioctl latency and cpu overhead.
>But if making execbuf faster comes at the cost of requiring additional
>ioctls calls for implicit synchronization, which is required on ever
>execbuf call, then I wonder if we'll even get any faster at all.
>Comparing old execbuf vs plain new execbuf without the new required
>ioctls won't make sense.
>
>But maybe I'm wrong and we won't need to call these new ioctls around
>every single execbuf ioctl we submit? Again, more clarification and
>some code examples here would be really nice. This is a big change on
>an important part of the API, we should clarify the new expected usage.
>

Thanks Paulo for the comments.

In VM_BIND mode, the only reason we would need execlist support in
execbuff path is for implicit synchronization. And AFAIK, this work
from Jason is expected replace implict synchronization with new ioctls.
Hence, VM_BIND mode will not be needing execlist support at all.

Based on comments from Daniel and my offline sync with Jason, this
new mechanism from Jason is expected work for vl. For gl, there is a
question of whether it will be performant or not. But it is worth trying
that first. If it is not performant for gl, then only we can consider
adding implicit sync support back for VM_BIND mode.

Daniel, Jason, Ken, any thoughts you can add here?

>> +
>> +This also means, we need an execbuff extension to pass in the batch
>> +buffer addresses (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
>> +
>> +If at all execlist support in execbuff ioctl is deemed necessary for
>> +implicit sync in certain use cases, then support can be added later.
>
>IMHO we really need to sort this and check all the assumptions before
>we commit to any interface. Again, implicit synchronization is
>something we rely on during *every* execbuf ioctl for most workloads.
>

Daniel's earlier feedback was that it is worth Mesa trying this new
mechanism for gl and see it that works. We want to avoid supporting
execlist support for implicit sync in vm_bind mode from the beginning
if it is going to be deemed not necessary.

>
>> +In VM_BIND mode, VA allocation is completely managed by the user instead of
>> +the i915 driver. Hence all VA assignment, eviction are not applicable in
>> +VM_BIND mode. Also, for determining object activeness, VM_BIND mode will not
>> +be using the i915_vma active reference tracking. It will instead use dma-resv
>> +object for that (See `VM_BIND dma_resv usage`_).
>> +
>> +So, a lot of existing code in the execbuff path like relocations, VA evictions,
>> +vma lookup table, implicit sync, vma active reference tracking etc., are not
>> +applicable in VM_BIND mode. Hence, the execbuff path needs to be cleaned up
>> +by clearly separating out the functionalities where the VM_BIND mode differs
>> +from older method and they should be moved to separate files.
>
>I seem to recall some conversations where we were told a bunch of
>ioctls would stop working or make no sense to call when using vm_bind.
>Can we please get a complete list of those? Bonus points if the Kernel
>starts telling us we just called something that makes no sense.
>

Which ioctls you are talking about here?
We do not support GEM_WAIT ioctls, but that is only for compute mode (which is
already documented in this patch).

>> +
>> +VM_PRIVATE objects
>> +-------------------
>> +By default, BOs can be mapped on multiple VMs and can also be dma-buf
>> +exported. Hence these BOs are referred to as Shared BOs.
>> +During each execbuff submission, the request fence must be added to the
>> +dma-resv fence list of all shared BOs mapped on the VM.
>> +
>> +VM_BIND feature introduces an optimization where user can create BO which
>> +is private to a specified VM via I915_GEM_CREATE_EXT_VM_PRIVATE flag during
>> +BO creation. Unlike Shared BOs, these VM private BOs can only be mapped on
>> +the VM they are private to and can't be dma-buf exported.
>> +All private BOs of a VM share the dma-resv object. Hence during each execbuff
>> +submission, they need only one dma-resv fence list updated. Thus, the fast
>> +path (where required mappings are already bound) submission latency is O(1)
>> +w.r.t the number of VM private BOs.
>
>I know we already discussed this, but just to document it publicly: the
>ideal case for user space would be that every BO is created as private
>but then we'd have an ioctl to convert it to non-private (without the
>need to have a non-private->private interface).
>
>An explanation on why we can't have an ioctl to mark as exported a
>buffer that was previously vm_private would be really appreciated.
>

Ok, I can some notes on that.
The reason being the fact that this require changing the dma-resv object
for gem object, hence the object locking also. This will add complications
as we have to sync with any pending operations. It might be easier for
UMDs to do it themselves by copying the object contexts to a new object.

Niranjana

>Thanks,
>Paulo
>
>
>> +
>> +VM_BIND locking hirarchy
>> +-------------------------
>> +The locking design here supports the older (execlist based) execbuff mode, the
>> +newer VM_BIND mode, the VM_BIND mode with GPU page faults and possible future
>> +system allocator support (See `Shared Virtual Memory (SVM) support`_).
>> +The older execbuff mode and the newer VM_BIND mode without page faults manages
>> +residency of backing storage using dma_fence. The VM_BIND mode with page faults
>> +and the system allocator support do not use any dma_fence at all.
>> +
>> +VM_BIND locking order is as below.
>> +
>> +1) Lock-A: A vm_bind mutex will protect vm_bind lists. This lock is taken in
>> +   vm_bind/vm_unbind ioctl calls, in the execbuff path and while releasing the
>> +   mapping.
>> +
>> +   In future, when GPU page faults are supported, we can potentially use a
>> +   rwsem instead, so that multiple page fault handlers can take the read side
>> +   lock to lookup the mapping and hence can run in parallel.
>> +   The older execbuff mode of binding do not need this lock.
>> +
>> +2) Lock-B: The object's dma-resv lock will protect i915_vma state and needs to
>> +   be held while binding/unbinding a vma in the async worker and while updating
>> +   dma-resv fence list of an object. Note that private BOs of a VM will all
>> +   share a dma-resv object.
>> +
>> +   The future system allocator support will use the HMM prescribed locking
>> +   instead.
>> +
>> +3) Lock-C: Spinlock/s to protect some of the VM's lists like the list of
>> +   invalidated vmas (due to eviction and userptr invalidation) etc.
>> +
>> +When GPU page faults are supported, the execbuff path do not take any of these
>> +locks. There we will simply smash the new batch buffer address into the ring and
>> +then tell the scheduler run that. The lock taking only happens from the page
>> +fault handler, where we take lock-A in read mode, whichever lock-B we need to
>> +find the backing storage (dma_resv lock for gem objects, and hmm/core mm for
>> +system allocator) and some additional locks (lock-D) for taking care of page
>> +table races. Page fault mode should not need to ever manipulate the vm lists,
>> +so won't ever need lock-C.
>> +
>> +VM_BIND LRU handling
>> +---------------------
>> +We need to ensure VM_BIND mapped objects are properly LRU tagged to avoid
>> +performance degradation. We will also need support for bulk LRU movement of
>> +VM_BIND objects to avoid additional latencies in execbuff path.
>> +
>> +The page table pages are similar to VM_BIND mapped objects (See
>> +`Evictable page table allocations`_) and are maintained per VM and needs to
>> +be pinned in memory when VM is made active (ie., upon an execbuff call with
>> +that VM). So, bulk LRU movement of page table pages is also needed.
>> +
>> +The i915 shrinker LRU has stopped being an LRU. So, it should also be moved
>> +over to the ttm LRU in some fashion to make sure we once again have a reasonable
>> +and consistent memory aging and reclaim architecture.
>> +
>> +VM_BIND dma_resv usage
>> +-----------------------
>> +Fences needs to be added to all VM_BIND mapped objects. During each execbuff
>> +submission, they are added with DMA_RESV_USAGE_BOOKKEEP usage to prevent
>> +over sync (See enum dma_resv_usage). One can override it with either
>> +DMA_RESV_USAGE_READ or DMA_RESV_USAGE_WRITE usage during object dependency
>> +setting (either through explicit or implicit mechanism).
>> +
>> +When vm_bind is called for a non-private object while the VM is already
>> +active, the fences need to be copied from VM's shared dma-resv object
>> +(common to all private objects of the VM) to this non-private object.
>> +If this results in performance degradation, then some optimization will
>> +be needed here. This is not a problem for VM's private objects as they use
>> +shared dma-resv object which is always updated on each execbuff submission.
>> +
>> +Also, in VM_BIND mode, use dma-resv apis for determining object activeness
>> +(See dma_resv_test_signaled() and dma_resv_wait_timeout()) and do not use the
>> +older i915_vma active reference tracking which is deprecated. This should be
>> +easier to get it working with the current TTM backend. We can remove the
>> +i915_vma active reference tracking fully while supporting TTM backend for igfx.
>> +
>> +Evictable page table allocations
>> +---------------------------------
>> +Make pagetable allocations evictable and manage them similar to VM_BIND
>> +mapped objects. Page table pages are similar to persistent mappings of a
>> +VM (difference here are that the page table pages will not have an i915_vma
>> +structure and after swapping pages back in, parent page link needs to be
>> +updated).
>> +
>> +Mesa use case
>> +--------------
>> +VM_BIND can potentially reduce the CPU overhead in Mesa (both Vulkan and Iris),
>> +hence improving performance of CPU-bound applications. It also allows us to
>> +implement Vulkan's Sparse Resources. With increasing GPU hardware performance,
>> +reducing CPU overhead becomes more impactful.
>> +
>> +
>> +VM_BIND Compute support
>> +========================
>> +
>> +User/Memory Fence
>> +------------------
>> +The idea is to take a user specified virtual address and install an interrupt
>> +handler to wake up the current task when the memory location passes the user
>> +supplied filter. User/Memory fence is a <address, value> pair. To signal the
>> +user fence, specified value will be written at the specified virtual address
>> +and wakeup the waiting process. User can wait on a user fence with the
>> +gem_wait_user_fence ioctl.
>> +
>> +It also allows the user to emit their own MI_FLUSH/PIPE_CONTROL notify
>> +interrupt within their batches after updating the value to have sub-batch
>> +precision on the wakeup. Each batch can signal a user fence to indicate
>> +the completion of next level batch. The completion of very first level batch
>> +needs to be signaled by the command streamer. The user must provide the
>> +user/memory fence for this via the DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE
>> +extension of execbuff ioctl, so that KMD can setup the command streamer to
>> +signal it.
>> +
>> +User/Memory fence can also be supplied to the kernel driver to signal/wake up
>> +the user process after completion of an asynchronous operation.
>> +
>> +When VM_BIND ioctl was provided with a user/memory fence via the
>> +I915_VM_BIND_EXT_USER_FENCE extension, it will be signaled upon the completion
>> +of binding of that mapping. All async binds/unbinds are serialized, hence
>> +signaling of user/memory fence also indicate the completion of all previous
>> +binds/unbinds.
>> +
>> +This feature will be derived from the below original work:
>> +https://patchwork.freedesktop.org/patch/349417/
>> +
>> +Long running Compute contexts
>> +------------------------------
>> +Usage of dma-fence expects that they complete in reasonable amount of time.
>> +Compute on the other hand can be long running. Hence it is appropriate for
>> +compute to use user/memory fence and dma-fence usage will be limited to
>> +in-kernel consumption only. This requires an execbuff uapi extension to pass
>> +in user fence (See struct drm_i915_vm_bind_ext_user_fence). Compute must opt-in
>> +for this mechanism with I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING flag during
>> +context creation. The dma-fence based user interfaces like gem_wait ioctl and
>> +execbuff out fence are not allowed on long running contexts. Implicit sync is
>> +not valid as well and is anyway not supported in VM_BIND mode.
>> +
>> +Where GPU page faults are not available, kernel driver upon buffer invalidation
>> +will initiate a suspend (preemption) of long running context with a dma-fence
>> +attached to it. And upon completion of that suspend fence, finish the
>> +invalidation, revalidate the BO and then resume the compute context. This is
>> +done by having a per-context preempt fence (also called suspend fence) proxying
>> +as i915_request fence. This suspend fence is enabled when someone tries to wait
>> +on it, which then triggers the context preemption.
>> +
>> +As this support for context suspension using a preempt fence and the resume work
>> +for the compute mode contexts can get tricky to get it right, it is better to
>> +add this support in drm scheduler so that multiple drivers can make use of it.
>> +That means, it will have a dependency on i915 drm scheduler conversion with GuC
>> +scheduler backend. This should be fine, as the plan is to support compute mode
>> +contexts only with GuC scheduler backend (at least initially). This is much
>> +easier to support with VM_BIND mode compared to the current heavier execbuff
>> +path resource attachment.
>> +
>> +Low Latency Submission
>> +-----------------------
>> +Allows compute UMD to directly submit GPU jobs instead of through execbuff
>> +ioctl. This is made possible by VM_BIND is not being synchronized against
>> +execbuff. VM_BIND allows bind/unbind of mappings required for the directly
>> +submitted jobs.
>> +
>> +Other VM_BIND use cases
>> +========================
>> +
>> +Debugger
>> +---------
>> +With debug event interface user space process (debugger) is able to keep track
>> +of and act upon resources created by another process (debugged) and attached
>> +to GPU via vm_bind interface.
>> +
>> +GPU page faults
>> +----------------
>> +GPU page faults when supported (in future), will only be supported in the
>> +VM_BIND mode. While both the older execbuff mode and the newer VM_BIND mode of
>> +binding will require using dma-fence to ensure residency, the GPU page faults
>> +mode when supported, will not use any dma-fence as residency is purely managed
>> +by installing and removing/invalidating page table entries.
>> +
>> +Page level hints settings
>> +--------------------------
>> +VM_BIND allows any hints setting per mapping instead of per BO.
>> +Possible hints include read-only mapping, placement and atomicity.
>> +Sub-BO level placement hint will be even more relevant with
>> +upcoming GPU on-demand page fault support.
>> +
>> +Page level Cache/CLOS settings
>> +-------------------------------
>> +VM_BIND allows cache/CLOS settings per mapping instead of per BO.
>> +
>> +Shared Virtual Memory (SVM) support
>> +------------------------------------
>> +VM_BIND interface can be used to map system memory directly (without gem BO
>> +abstraction) using the HMM interface. SVM is only supported with GPU page
>> +faults enabled.
>> +
>> +
>> +Broder i915 cleanups
>> +=====================
>> +Supporting this whole new vm_bind mode of binding which comes with its own
>> +use cases to support and the locking requirements requires proper integration
>> +with the existing i915 driver. This calls for some broader i915 driver
>> +cleanups/simplifications for maintainability of the driver going forward.
>> +Here are few things identified and are being looked into.
>> +
>> +- Remove vma lookup cache (eb->gem_context->handles_vma). VM_BIND feature
>> +  do not use it and complexity it brings in is probably more than the
>> +  performance advantage we get in legacy execbuff case.
>> +- Remove vma->open_count counting
>> +- Remove i915_vma active reference tracking. VM_BIND feature will not be using
>> +  it. Instead use underlying BO's dma-resv fence list to determine if a i915_vma
>> +  is active or not.
>> +
>> +
>> +VM_BIND UAPI
>> +=============
>> +
>> +.. kernel-doc:: Documentation/gpu/rfc/i915_vm_bind.h
>> diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst
>> index 91e93a705230..7d10c36b268d 100644
>> --- a/Documentation/gpu/rfc/index.rst
>> +++ b/Documentation/gpu/rfc/index.rst
>> @@ -23,3 +23,7 @@ host such documentation:
>>  .. toctree::
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>      i915_scheduler.rst
>> +
>> +.. toctree::
>> +
>> +    i915_vm_bind.rst
>
Niranjana Vishwanathapura May 23, 2022, 7:08 p.m. UTC | #3
On Mon, May 23, 2022 at 12:05:05PM -0700, Niranjana Vishwanathapura wrote:
>On Thu, May 19, 2022 at 03:52:01PM -0700, Zanoni, Paulo R wrote:
>>On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote:
>>>VM_BIND design document with description of intended use cases.
>>>
>>>v2: Add more documentation and format as per review comments
>>>    from Daniel.
>>>
>>>Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
>>>---
>>>
>>>diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst b/Documentation/gpu/rfc/i915_vm_bind.rst
>>>new file mode 100644
>>>index 000000000000..f1be560d313c
>>>--- /dev/null
>>>+++ b/Documentation/gpu/rfc/i915_vm_bind.rst
>>>@@ -0,0 +1,304 @@
>>>+==========================================
>>>+I915 VM_BIND feature design and use cases
>>>+==========================================
>>>+
>>>+VM_BIND feature
>>>+================
>>>+DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM buffer
>>>+objects (BOs) or sections of a BOs at specified GPU virtual addresses on a
>>>+specified address space (VM). These mappings (also referred to as persistent
>>>+mappings) will be persistent across multiple GPU submissions (execbuff calls)
>>>+issued by the UMD, without user having to provide a list of all required
>>>+mappings during each submission (as required by older execbuff mode).
>>>+
>>>+VM_BIND/UNBIND ioctls will support 'in' and 'out' fences to allow userpace
>>>+to specify how the binding/unbinding should sync with other operations
>>>+like the GPU job submission. These fences will be timeline 'drm_syncobj's
>>>+for non-Compute contexts (See struct drm_i915_vm_bind_ext_timeline_fences).
>>>+For Compute contexts, they will be user/memory fences (See struct
>>>+drm_i915_vm_bind_ext_user_fence).
>>>+
>>>+VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND.
>>>+User has to opt-in for VM_BIND mode of binding for an address space (VM)
>>>+during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND extension.
>>>+
>>>+VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an
>>>+async worker. The binding and unbinding will work like a special GPU engine.
>>>+The binding and unbinding operations are serialized and will wait on specified
>>>+input fences before the operation and will signal the output fences upon the
>>>+completion of the operation. Due to serialization, completion of an operation
>>>+will also indicate that all previous operations are also complete.
>>>+
>>>+VM_BIND features include:
>>>+
>>>+* Multiple Virtual Address (VA) mappings can map to the same physical pages
>>>+  of an object (aliasing).
>>>+* VA mapping can map to a partial section of the BO (partial binding).
>>>+* Support capture of persistent mappings in the dump upon GPU error.
>>>+* TLB is flushed upon unbind completion. Batching of TLB flushes in some
>>>+  use cases will be helpful.
>>>+* Asynchronous vm_bind and vm_unbind support with 'in' and 'out' fences.
>>>+* Support for userptr gem objects (no special uapi is required for this).
>>>+
>>>+Execbuff ioctl in VM_BIND mode
>>>+-------------------------------
>>>+The execbuff ioctl handling in VM_BIND mode differs significantly from the
>>>+older method. A VM in VM_BIND mode will not support older execbuff mode of
>>>+binding. In VM_BIND mode, execbuff ioctl will not accept any execlist. Hence,
>>>+no support for implicit sync. It is expected that the below work will be able
>>>+to support requirements of object dependency setting in all use cases:
>>>+
>>>+"dma-buf: Add an API for exporting sync files"
>>>+(https://lwn.net/Articles/859290/)
>>
>>I would really like to have more details here. The link provided points
>>to new ioctls and we're not very familiar with those yet, so I think
>>you should really clarify the interaction between the new additions
>>here. Having some sample code would be really nice too.
>>
>>For Mesa at least (and I believe for the other drivers too) we always
>>have a few exported buffers in every execbuf call, and we rely on the
>>implicit synchronization provided by execbuf to make sure everything
>>works. The execbuf ioctl also has some code to flush caches during
>>implicit synchronization AFAIR, so I would guess we rely on it too and
>>whatever else the Kernel does. Is that covered by the new ioctls?
>>
>>In addition, as far as I remember, one of the big improvements of
>>vm_bind was that it would help reduce ioctl latency and cpu overhead.
>>But if making execbuf faster comes at the cost of requiring additional
>>ioctls calls for implicit synchronization, which is required on ever
>>execbuf call, then I wonder if we'll even get any faster at all.
>>Comparing old execbuf vs plain new execbuf without the new required
>>ioctls won't make sense.
>>
>>But maybe I'm wrong and we won't need to call these new ioctls around
>>every single execbuf ioctl we submit? Again, more clarification and
>>some code examples here would be really nice. This is a big change on
>>an important part of the API, we should clarify the new expected usage.
>>
>
>Thanks Paulo for the comments.
>
>In VM_BIND mode, the only reason we would need execlist support in
>execbuff path is for implicit synchronization. And AFAIK, this work
>from Jason is expected replace implict synchronization with new ioctls.
>Hence, VM_BIND mode will not be needing execlist support at all.
>
>Based on comments from Daniel and my offline sync with Jason, this
>new mechanism from Jason is expected work for vl. For gl, there is a
>question of whether it will be performant or not. But it is worth trying
>that first. If it is not performant for gl, then only we can consider
>adding implicit sync support back for VM_BIND mode.
>
>Daniel, Jason, Ken, any thoughts you can add here?

CC'ing Ken.

>
>>>+
>>>+This also means, we need an execbuff extension to pass in the batch
>>>+buffer addresses (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
>>>+
>>>+If at all execlist support in execbuff ioctl is deemed necessary for
>>>+implicit sync in certain use cases, then support can be added later.
>>
>>IMHO we really need to sort this and check all the assumptions before
>>we commit to any interface. Again, implicit synchronization is
>>something we rely on during *every* execbuf ioctl for most workloads.
>>
>
>Daniel's earlier feedback was that it is worth Mesa trying this new
>mechanism for gl and see it that works. We want to avoid supporting
>execlist support for implicit sync in vm_bind mode from the beginning
>if it is going to be deemed not necessary.
>
>>
>>>+In VM_BIND mode, VA allocation is completely managed by the user instead of
>>>+the i915 driver. Hence all VA assignment, eviction are not applicable in
>>>+VM_BIND mode. Also, for determining object activeness, VM_BIND mode will not
>>>+be using the i915_vma active reference tracking. It will instead use dma-resv
>>>+object for that (See `VM_BIND dma_resv usage`_).
>>>+
>>>+So, a lot of existing code in the execbuff path like relocations, VA evictions,
>>>+vma lookup table, implicit sync, vma active reference tracking etc., are not
>>>+applicable in VM_BIND mode. Hence, the execbuff path needs to be cleaned up
>>>+by clearly separating out the functionalities where the VM_BIND mode differs
>>>+from older method and they should be moved to separate files.
>>
>>I seem to recall some conversations where we were told a bunch of
>>ioctls would stop working or make no sense to call when using vm_bind.
>>Can we please get a complete list of those? Bonus points if the Kernel
>>starts telling us we just called something that makes no sense.
>>
>
>Which ioctls you are talking about here?
>We do not support GEM_WAIT ioctls, but that is only for compute mode (which is
>already documented in this patch).
>
>>>+
>>>+VM_PRIVATE objects
>>>+-------------------
>>>+By default, BOs can be mapped on multiple VMs and can also be dma-buf
>>>+exported. Hence these BOs are referred to as Shared BOs.
>>>+During each execbuff submission, the request fence must be added to the
>>>+dma-resv fence list of all shared BOs mapped on the VM.
>>>+
>>>+VM_BIND feature introduces an optimization where user can create BO which
>>>+is private to a specified VM via I915_GEM_CREATE_EXT_VM_PRIVATE flag during
>>>+BO creation. Unlike Shared BOs, these VM private BOs can only be mapped on
>>>+the VM they are private to and can't be dma-buf exported.
>>>+All private BOs of a VM share the dma-resv object. Hence during each execbuff
>>>+submission, they need only one dma-resv fence list updated. Thus, the fast
>>>+path (where required mappings are already bound) submission latency is O(1)
>>>+w.r.t the number of VM private BOs.
>>
>>I know we already discussed this, but just to document it publicly: the
>>ideal case for user space would be that every BO is created as private
>>but then we'd have an ioctl to convert it to non-private (without the
>>need to have a non-private->private interface).
>>
>>An explanation on why we can't have an ioctl to mark as exported a
>>buffer that was previously vm_private would be really appreciated.
>>
>
>Ok, I can some notes on that.
>The reason being the fact that this require changing the dma-resv object
>for gem object, hence the object locking also. This will add complications
>as we have to sync with any pending operations. It might be easier for
>UMDs to do it themselves by copying the object contexts to a new object.
>
>Niranjana
>
>>Thanks,
>>Paulo
>>
>>
>>>+
>>>+VM_BIND locking hirarchy
>>>+-------------------------
>>>+The locking design here supports the older (execlist based) execbuff mode, the
>>>+newer VM_BIND mode, the VM_BIND mode with GPU page faults and possible future
>>>+system allocator support (See `Shared Virtual Memory (SVM) support`_).
>>>+The older execbuff mode and the newer VM_BIND mode without page faults manages
>>>+residency of backing storage using dma_fence. The VM_BIND mode with page faults
>>>+and the system allocator support do not use any dma_fence at all.
>>>+
>>>+VM_BIND locking order is as below.
>>>+
>>>+1) Lock-A: A vm_bind mutex will protect vm_bind lists. This lock is taken in
>>>+   vm_bind/vm_unbind ioctl calls, in the execbuff path and while releasing the
>>>+   mapping.
>>>+
>>>+   In future, when GPU page faults are supported, we can potentially use a
>>>+   rwsem instead, so that multiple page fault handlers can take the read side
>>>+   lock to lookup the mapping and hence can run in parallel.
>>>+   The older execbuff mode of binding do not need this lock.
>>>+
>>>+2) Lock-B: The object's dma-resv lock will protect i915_vma state and needs to
>>>+   be held while binding/unbinding a vma in the async worker and while updating
>>>+   dma-resv fence list of an object. Note that private BOs of a VM will all
>>>+   share a dma-resv object.
>>>+
>>>+   The future system allocator support will use the HMM prescribed locking
>>>+   instead.
>>>+
>>>+3) Lock-C: Spinlock/s to protect some of the VM's lists like the list of
>>>+   invalidated vmas (due to eviction and userptr invalidation) etc.
>>>+
>>>+When GPU page faults are supported, the execbuff path do not take any of these
>>>+locks. There we will simply smash the new batch buffer address into the ring and
>>>+then tell the scheduler run that. The lock taking only happens from the page
>>>+fault handler, where we take lock-A in read mode, whichever lock-B we need to
>>>+find the backing storage (dma_resv lock for gem objects, and hmm/core mm for
>>>+system allocator) and some additional locks (lock-D) for taking care of page
>>>+table races. Page fault mode should not need to ever manipulate the vm lists,
>>>+so won't ever need lock-C.
>>>+
>>>+VM_BIND LRU handling
>>>+---------------------
>>>+We need to ensure VM_BIND mapped objects are properly LRU tagged to avoid
>>>+performance degradation. We will also need support for bulk LRU movement of
>>>+VM_BIND objects to avoid additional latencies in execbuff path.
>>>+
>>>+The page table pages are similar to VM_BIND mapped objects (See
>>>+`Evictable page table allocations`_) and are maintained per VM and needs to
>>>+be pinned in memory when VM is made active (ie., upon an execbuff call with
>>>+that VM). So, bulk LRU movement of page table pages is also needed.
>>>+
>>>+The i915 shrinker LRU has stopped being an LRU. So, it should also be moved
>>>+over to the ttm LRU in some fashion to make sure we once again have a reasonable
>>>+and consistent memory aging and reclaim architecture.
>>>+
>>>+VM_BIND dma_resv usage
>>>+-----------------------
>>>+Fences needs to be added to all VM_BIND mapped objects. During each execbuff
>>>+submission, they are added with DMA_RESV_USAGE_BOOKKEEP usage to prevent
>>>+over sync (See enum dma_resv_usage). One can override it with either
>>>+DMA_RESV_USAGE_READ or DMA_RESV_USAGE_WRITE usage during object dependency
>>>+setting (either through explicit or implicit mechanism).
>>>+
>>>+When vm_bind is called for a non-private object while the VM is already
>>>+active, the fences need to be copied from VM's shared dma-resv object
>>>+(common to all private objects of the VM) to this non-private object.
>>>+If this results in performance degradation, then some optimization will
>>>+be needed here. This is not a problem for VM's private objects as they use
>>>+shared dma-resv object which is always updated on each execbuff submission.
>>>+
>>>+Also, in VM_BIND mode, use dma-resv apis for determining object activeness
>>>+(See dma_resv_test_signaled() and dma_resv_wait_timeout()) and do not use the
>>>+older i915_vma active reference tracking which is deprecated. This should be
>>>+easier to get it working with the current TTM backend. We can remove the
>>>+i915_vma active reference tracking fully while supporting TTM backend for igfx.
>>>+
>>>+Evictable page table allocations
>>>+---------------------------------
>>>+Make pagetable allocations evictable and manage them similar to VM_BIND
>>>+mapped objects. Page table pages are similar to persistent mappings of a
>>>+VM (difference here are that the page table pages will not have an i915_vma
>>>+structure and after swapping pages back in, parent page link needs to be
>>>+updated).
>>>+
>>>+Mesa use case
>>>+--------------
>>>+VM_BIND can potentially reduce the CPU overhead in Mesa (both Vulkan and Iris),
>>>+hence improving performance of CPU-bound applications. It also allows us to
>>>+implement Vulkan's Sparse Resources. With increasing GPU hardware performance,
>>>+reducing CPU overhead becomes more impactful.
>>>+
>>>+
>>>+VM_BIND Compute support
>>>+========================
>>>+
>>>+User/Memory Fence
>>>+------------------
>>>+The idea is to take a user specified virtual address and install an interrupt
>>>+handler to wake up the current task when the memory location passes the user
>>>+supplied filter. User/Memory fence is a <address, value> pair. To signal the
>>>+user fence, specified value will be written at the specified virtual address
>>>+and wakeup the waiting process. User can wait on a user fence with the
>>>+gem_wait_user_fence ioctl.
>>>+
>>>+It also allows the user to emit their own MI_FLUSH/PIPE_CONTROL notify
>>>+interrupt within their batches after updating the value to have sub-batch
>>>+precision on the wakeup. Each batch can signal a user fence to indicate
>>>+the completion of next level batch. The completion of very first level batch
>>>+needs to be signaled by the command streamer. The user must provide the
>>>+user/memory fence for this via the DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE
>>>+extension of execbuff ioctl, so that KMD can setup the command streamer to
>>>+signal it.
>>>+
>>>+User/Memory fence can also be supplied to the kernel driver to signal/wake up
>>>+the user process after completion of an asynchronous operation.
>>>+
>>>+When VM_BIND ioctl was provided with a user/memory fence via the
>>>+I915_VM_BIND_EXT_USER_FENCE extension, it will be signaled upon the completion
>>>+of binding of that mapping. All async binds/unbinds are serialized, hence
>>>+signaling of user/memory fence also indicate the completion of all previous
>>>+binds/unbinds.
>>>+
>>>+This feature will be derived from the below original work:
>>>+https://patchwork.freedesktop.org/patch/349417/
>>>+
>>>+Long running Compute contexts
>>>+------------------------------
>>>+Usage of dma-fence expects that they complete in reasonable amount of time.
>>>+Compute on the other hand can be long running. Hence it is appropriate for
>>>+compute to use user/memory fence and dma-fence usage will be limited to
>>>+in-kernel consumption only. This requires an execbuff uapi extension to pass
>>>+in user fence (See struct drm_i915_vm_bind_ext_user_fence). Compute must opt-in
>>>+for this mechanism with I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING flag during
>>>+context creation. The dma-fence based user interfaces like gem_wait ioctl and
>>>+execbuff out fence are not allowed on long running contexts. Implicit sync is
>>>+not valid as well and is anyway not supported in VM_BIND mode.
>>>+
>>>+Where GPU page faults are not available, kernel driver upon buffer invalidation
>>>+will initiate a suspend (preemption) of long running context with a dma-fence
>>>+attached to it. And upon completion of that suspend fence, finish the
>>>+invalidation, revalidate the BO and then resume the compute context. This is
>>>+done by having a per-context preempt fence (also called suspend fence) proxying
>>>+as i915_request fence. This suspend fence is enabled when someone tries to wait
>>>+on it, which then triggers the context preemption.
>>>+
>>>+As this support for context suspension using a preempt fence and the resume work
>>>+for the compute mode contexts can get tricky to get it right, it is better to
>>>+add this support in drm scheduler so that multiple drivers can make use of it.
>>>+That means, it will have a dependency on i915 drm scheduler conversion with GuC
>>>+scheduler backend. This should be fine, as the plan is to support compute mode
>>>+contexts only with GuC scheduler backend (at least initially). This is much
>>>+easier to support with VM_BIND mode compared to the current heavier execbuff
>>>+path resource attachment.
>>>+
>>>+Low Latency Submission
>>>+-----------------------
>>>+Allows compute UMD to directly submit GPU jobs instead of through execbuff
>>>+ioctl. This is made possible by VM_BIND is not being synchronized against
>>>+execbuff. VM_BIND allows bind/unbind of mappings required for the directly
>>>+submitted jobs.
>>>+
>>>+Other VM_BIND use cases
>>>+========================
>>>+
>>>+Debugger
>>>+---------
>>>+With debug event interface user space process (debugger) is able to keep track
>>>+of and act upon resources created by another process (debugged) and attached
>>>+to GPU via vm_bind interface.
>>>+
>>>+GPU page faults
>>>+----------------
>>>+GPU page faults when supported (in future), will only be supported in the
>>>+VM_BIND mode. While both the older execbuff mode and the newer VM_BIND mode of
>>>+binding will require using dma-fence to ensure residency, the GPU page faults
>>>+mode when supported, will not use any dma-fence as residency is purely managed
>>>+by installing and removing/invalidating page table entries.
>>>+
>>>+Page level hints settings
>>>+--------------------------
>>>+VM_BIND allows any hints setting per mapping instead of per BO.
>>>+Possible hints include read-only mapping, placement and atomicity.
>>>+Sub-BO level placement hint will be even more relevant with
>>>+upcoming GPU on-demand page fault support.
>>>+
>>>+Page level Cache/CLOS settings
>>>+-------------------------------
>>>+VM_BIND allows cache/CLOS settings per mapping instead of per BO.
>>>+
>>>+Shared Virtual Memory (SVM) support
>>>+------------------------------------
>>>+VM_BIND interface can be used to map system memory directly (without gem BO
>>>+abstraction) using the HMM interface. SVM is only supported with GPU page
>>>+faults enabled.
>>>+
>>>+
>>>+Broder i915 cleanups
>>>+=====================
>>>+Supporting this whole new vm_bind mode of binding which comes with its own
>>>+use cases to support and the locking requirements requires proper integration
>>>+with the existing i915 driver. This calls for some broader i915 driver
>>>+cleanups/simplifications for maintainability of the driver going forward.
>>>+Here are few things identified and are being looked into.
>>>+
>>>+- Remove vma lookup cache (eb->gem_context->handles_vma). VM_BIND feature
>>>+  do not use it and complexity it brings in is probably more than the
>>>+  performance advantage we get in legacy execbuff case.
>>>+- Remove vma->open_count counting
>>>+- Remove i915_vma active reference tracking. VM_BIND feature will not be using
>>>+  it. Instead use underlying BO's dma-resv fence list to determine if a i915_vma
>>>+  is active or not.
>>>+
>>>+
>>>+VM_BIND UAPI
>>>+=============
>>>+
>>>+.. kernel-doc:: Documentation/gpu/rfc/i915_vm_bind.h
>>>diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst
>>>index 91e93a705230..7d10c36b268d 100644
>>>--- a/Documentation/gpu/rfc/index.rst
>>>+++ b/Documentation/gpu/rfc/index.rst
>>>@@ -23,3 +23,7 @@ host such documentation:
>>> .. toctree::
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>     i915_scheduler.rst
>>>+
>>>+.. toctree::
>>>+
>>>+    i915_vm_bind.rst
>>
Lionel Landwerlin May 24, 2022, 10:08 a.m. UTC | #4
On 20/05/2022 01:52, Zanoni, Paulo R wrote:
> On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote:
>> VM_BIND design document with description of intended use cases.
>>
>> v2: Add more documentation and format as per review comments
>>      from Daniel.
>>
>> Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
>> ---
>>
>> diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst b/Documentation/gpu/rfc/i915_vm_bind.rst
>> new file mode 100644
>> index 000000000000..f1be560d313c
>> --- /dev/null
>> +++ b/Documentation/gpu/rfc/i915_vm_bind.rst
>> @@ -0,0 +1,304 @@
>> +==========================================
>> +I915 VM_BIND feature design and use cases
>> +==========================================
>> +
>> +VM_BIND feature
>> +================
>> +DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM buffer
>> +objects (BOs) or sections of a BOs at specified GPU virtual addresses on a
>> +specified address space (VM). These mappings (also referred to as persistent
>> +mappings) will be persistent across multiple GPU submissions (execbuff calls)
>> +issued by the UMD, without user having to provide a list of all required
>> +mappings during each submission (as required by older execbuff mode).
>> +
>> +VM_BIND/UNBIND ioctls will support 'in' and 'out' fences to allow userpace
>> +to specify how the binding/unbinding should sync with other operations
>> +like the GPU job submission. These fences will be timeline 'drm_syncobj's
>> +for non-Compute contexts (See struct drm_i915_vm_bind_ext_timeline_fences).
>> +For Compute contexts, they will be user/memory fences (See struct
>> +drm_i915_vm_bind_ext_user_fence).
>> +
>> +VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND.
>> +User has to opt-in for VM_BIND mode of binding for an address space (VM)
>> +during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND extension.
>> +
>> +VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an
>> +async worker. The binding and unbinding will work like a special GPU engine.
>> +The binding and unbinding operations are serialized and will wait on specified
>> +input fences before the operation and will signal the output fences upon the
>> +completion of the operation. Due to serialization, completion of an operation
>> +will also indicate that all previous operations are also complete.
>> +
>> +VM_BIND features include:
>> +
>> +* Multiple Virtual Address (VA) mappings can map to the same physical pages
>> +  of an object (aliasing).
>> +* VA mapping can map to a partial section of the BO (partial binding).
>> +* Support capture of persistent mappings in the dump upon GPU error.
>> +* TLB is flushed upon unbind completion. Batching of TLB flushes in some
>> +  use cases will be helpful.
>> +* Asynchronous vm_bind and vm_unbind support with 'in' and 'out' fences.
>> +* Support for userptr gem objects (no special uapi is required for this).
>> +
>> +Execbuff ioctl in VM_BIND mode
>> +-------------------------------
>> +The execbuff ioctl handling in VM_BIND mode differs significantly from the
>> +older method. A VM in VM_BIND mode will not support older execbuff mode of
>> +binding. In VM_BIND mode, execbuff ioctl will not accept any execlist. Hence,
>> +no support for implicit sync. It is expected that the below work will be able
>> +to support requirements of object dependency setting in all use cases:
>> +
>> +"dma-buf: Add an API for exporting sync files"
>> +(https://lwn.net/Articles/859290/)
> I would really like to have more details here. The link provided points
> to new ioctls and we're not very familiar with those yet, so I think
> you should really clarify the interaction between the new additions
> here. Having some sample code would be really nice too.
>
> For Mesa at least (and I believe for the other drivers too) we always
> have a few exported buffers in every execbuf call, and we rely on the
> implicit synchronization provided by execbuf to make sure everything
> works. The execbuf ioctl also has some code to flush caches during
> implicit synchronization AFAIR, so I would guess we rely on it too and
> whatever else the Kernel does. Is that covered by the new ioctls?
>
> In addition, as far as I remember, one of the big improvements of
> vm_bind was that it would help reduce ioctl latency and cpu overhead.
> But if making execbuf faster comes at the cost of requiring additional
> ioctls calls for implicit synchronization, which is required on ever
> execbuf call, then I wonder if we'll even get any faster at all.
> Comparing old execbuf vs plain new execbuf without the new required
> ioctls won't make sense.
> But maybe I'm wrong and we won't need to call these new ioctls around
> every single execbuf ioctl we submit? Again, more clarification and
> some code examples here would be really nice. This is a big change on
> an important part of the API, we should clarify the new expected usage.


Hey Paulo,


I think in the case of X11/Wayland, we'll be doing 1 or 2 extra ioctls 
per frame which seems pretty reasonable.

Essentially we need to set the dependencies on the buffer we´re going to 
tell the display engine (gnome-shell/kde/bare-display-hw) to use.


In the Vulkan case, we're trading building execbuffer lists of 
potentially thousands of buffers for every single submission versus 1 or 
2 ioctls for a single item when doing vkQueuePresent() (which happens 
less often than we do execbuffer ioctls).

That seems like a good trade off and doesn't look like a lot more work 
than explicit fencing where we would have to send associated fences.


Here is the Mesa MR associated with this : 
https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4037


-Lionel


>
>> +
>> +This also means, we need an execbuff extension to pass in the batch
>> +buffer addresses (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
>> +
>> +If at all execlist support in execbuff ioctl is deemed necessary for
>> +implicit sync in certain use cases, then support can be added later.
> IMHO we really need to sort this and check all the assumptions before
> we commit to any interface. Again, implicit synchronization is
> something we rely on during *every* execbuf ioctl for most workloads.
>
>
>> +In VM_BIND mode, VA allocation is completely managed by the user instead of
>> +the i915 driver. Hence all VA assignment, eviction are not applicable in
>> +VM_BIND mode. Also, for determining object activeness, VM_BIND mode will not
>> +be using the i915_vma active reference tracking. It will instead use dma-resv
>> +object for that (See `VM_BIND dma_resv usage`_).
>> +
>> +So, a lot of existing code in the execbuff path like relocations, VA evictions,
>> +vma lookup table, implicit sync, vma active reference tracking etc., are not
>> +applicable in VM_BIND mode. Hence, the execbuff path needs to be cleaned up
>> +by clearly separating out the functionalities where the VM_BIND mode differs
>> +from older method and they should be moved to separate files.
> I seem to recall some conversations where we were told a bunch of
> ioctls would stop working or make no sense to call when using vm_bind.
> Can we please get a complete list of those? Bonus points if the Kernel
> starts telling us we just called something that makes no sense.
>
>> +
>> +VM_PRIVATE objects
>> +-------------------
>> +By default, BOs can be mapped on multiple VMs and can also be dma-buf
>> +exported. Hence these BOs are referred to as Shared BOs.
>> +During each execbuff submission, the request fence must be added to the
>> +dma-resv fence list of all shared BOs mapped on the VM.
>> +
>> +VM_BIND feature introduces an optimization where user can create BO which
>> +is private to a specified VM via I915_GEM_CREATE_EXT_VM_PRIVATE flag during
>> +BO creation. Unlike Shared BOs, these VM private BOs can only be mapped on
>> +the VM they are private to and can't be dma-buf exported.
>> +All private BOs of a VM share the dma-resv object. Hence during each execbuff
>> +submission, they need only one dma-resv fence list updated. Thus, the fast
>> +path (where required mappings are already bound) submission latency is O(1)
>> +w.r.t the number of VM private BOs.
> I know we already discussed this, but just to document it publicly: the
> ideal case for user space would be that every BO is created as private
> but then we'd have an ioctl to convert it to non-private (without the
> need to have a non-private->private interface).
>
> An explanation on why we can't have an ioctl to mark as exported a
> buffer that was previously vm_private would be really appreciated.
>
> Thanks,
> Paulo
>
>
>> +
>> +VM_BIND locking hirarchy
>> +-------------------------
>> +The locking design here supports the older (execlist based) execbuff mode, the
>> +newer VM_BIND mode, the VM_BIND mode with GPU page faults and possible future
>> +system allocator support (See `Shared Virtual Memory (SVM) support`_).
>> +The older execbuff mode and the newer VM_BIND mode without page faults manages
>> +residency of backing storage using dma_fence. The VM_BIND mode with page faults
>> +and the system allocator support do not use any dma_fence at all.
>> +
>> +VM_BIND locking order is as below.
>> +
>> +1) Lock-A: A vm_bind mutex will protect vm_bind lists. This lock is taken in
>> +   vm_bind/vm_unbind ioctl calls, in the execbuff path and while releasing the
>> +   mapping.
>> +
>> +   In future, when GPU page faults are supported, we can potentially use a
>> +   rwsem instead, so that multiple page fault handlers can take the read side
>> +   lock to lookup the mapping and hence can run in parallel.
>> +   The older execbuff mode of binding do not need this lock.
>> +
>> +2) Lock-B: The object's dma-resv lock will protect i915_vma state and needs to
>> +   be held while binding/unbinding a vma in the async worker and while updating
>> +   dma-resv fence list of an object. Note that private BOs of a VM will all
>> +   share a dma-resv object.
>> +
>> +   The future system allocator support will use the HMM prescribed locking
>> +   instead.
>> +
>> +3) Lock-C: Spinlock/s to protect some of the VM's lists like the list of
>> +   invalidated vmas (due to eviction and userptr invalidation) etc.
>> +
>> +When GPU page faults are supported, the execbuff path do not take any of these
>> +locks. There we will simply smash the new batch buffer address into the ring and
>> +then tell the scheduler run that. The lock taking only happens from the page
>> +fault handler, where we take lock-A in read mode, whichever lock-B we need to
>> +find the backing storage (dma_resv lock for gem objects, and hmm/core mm for
>> +system allocator) and some additional locks (lock-D) for taking care of page
>> +table races. Page fault mode should not need to ever manipulate the vm lists,
>> +so won't ever need lock-C.
>> +
>> +VM_BIND LRU handling
>> +---------------------
>> +We need to ensure VM_BIND mapped objects are properly LRU tagged to avoid
>> +performance degradation. We will also need support for bulk LRU movement of
>> +VM_BIND objects to avoid additional latencies in execbuff path.
>> +
>> +The page table pages are similar to VM_BIND mapped objects (See
>> +`Evictable page table allocations`_) and are maintained per VM and needs to
>> +be pinned in memory when VM is made active (ie., upon an execbuff call with
>> +that VM). So, bulk LRU movement of page table pages is also needed.
>> +
>> +The i915 shrinker LRU has stopped being an LRU. So, it should also be moved
>> +over to the ttm LRU in some fashion to make sure we once again have a reasonable
>> +and consistent memory aging and reclaim architecture.
>> +
>> +VM_BIND dma_resv usage
>> +-----------------------
>> +Fences needs to be added to all VM_BIND mapped objects. During each execbuff
>> +submission, they are added with DMA_RESV_USAGE_BOOKKEEP usage to prevent
>> +over sync (See enum dma_resv_usage). One can override it with either
>> +DMA_RESV_USAGE_READ or DMA_RESV_USAGE_WRITE usage during object dependency
>> +setting (either through explicit or implicit mechanism).
>> +
>> +When vm_bind is called for a non-private object while the VM is already
>> +active, the fences need to be copied from VM's shared dma-resv object
>> +(common to all private objects of the VM) to this non-private object.
>> +If this results in performance degradation, then some optimization will
>> +be needed here. This is not a problem for VM's private objects as they use
>> +shared dma-resv object which is always updated on each execbuff submission.
>> +
>> +Also, in VM_BIND mode, use dma-resv apis for determining object activeness
>> +(See dma_resv_test_signaled() and dma_resv_wait_timeout()) and do not use the
>> +older i915_vma active reference tracking which is deprecated. This should be
>> +easier to get it working with the current TTM backend. We can remove the
>> +i915_vma active reference tracking fully while supporting TTM backend for igfx.
>> +
>> +Evictable page table allocations
>> +---------------------------------
>> +Make pagetable allocations evictable and manage them similar to VM_BIND
>> +mapped objects. Page table pages are similar to persistent mappings of a
>> +VM (difference here are that the page table pages will not have an i915_vma
>> +structure and after swapping pages back in, parent page link needs to be
>> +updated).
>> +
>> +Mesa use case
>> +--------------
>> +VM_BIND can potentially reduce the CPU overhead in Mesa (both Vulkan and Iris),
>> +hence improving performance of CPU-bound applications. It also allows us to
>> +implement Vulkan's Sparse Resources. With increasing GPU hardware performance,
>> +reducing CPU overhead becomes more impactful.
>> +
>> +
>> +VM_BIND Compute support
>> +========================
>> +
>> +User/Memory Fence
>> +------------------
>> +The idea is to take a user specified virtual address and install an interrupt
>> +handler to wake up the current task when the memory location passes the user
>> +supplied filter. User/Memory fence is a <address, value> pair. To signal the
>> +user fence, specified value will be written at the specified virtual address
>> +and wakeup the waiting process. User can wait on a user fence with the
>> +gem_wait_user_fence ioctl.
>> +
>> +It also allows the user to emit their own MI_FLUSH/PIPE_CONTROL notify
>> +interrupt within their batches after updating the value to have sub-batch
>> +precision on the wakeup. Each batch can signal a user fence to indicate
>> +the completion of next level batch. The completion of very first level batch
>> +needs to be signaled by the command streamer. The user must provide the
>> +user/memory fence for this via the DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE
>> +extension of execbuff ioctl, so that KMD can setup the command streamer to
>> +signal it.
>> +
>> +User/Memory fence can also be supplied to the kernel driver to signal/wake up
>> +the user process after completion of an asynchronous operation.
>> +
>> +When VM_BIND ioctl was provided with a user/memory fence via the
>> +I915_VM_BIND_EXT_USER_FENCE extension, it will be signaled upon the completion
>> +of binding of that mapping. All async binds/unbinds are serialized, hence
>> +signaling of user/memory fence also indicate the completion of all previous
>> +binds/unbinds.
>> +
>> +This feature will be derived from the below original work:
>> +https://patchwork.freedesktop.org/patch/349417/
>> +
>> +Long running Compute contexts
>> +------------------------------
>> +Usage of dma-fence expects that they complete in reasonable amount of time.
>> +Compute on the other hand can be long running. Hence it is appropriate for
>> +compute to use user/memory fence and dma-fence usage will be limited to
>> +in-kernel consumption only. This requires an execbuff uapi extension to pass
>> +in user fence (See struct drm_i915_vm_bind_ext_user_fence). Compute must opt-in
>> +for this mechanism with I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING flag during
>> +context creation. The dma-fence based user interfaces like gem_wait ioctl and
>> +execbuff out fence are not allowed on long running contexts. Implicit sync is
>> +not valid as well and is anyway not supported in VM_BIND mode.
>> +
>> +Where GPU page faults are not available, kernel driver upon buffer invalidation
>> +will initiate a suspend (preemption) of long running context with a dma-fence
>> +attached to it. And upon completion of that suspend fence, finish the
>> +invalidation, revalidate the BO and then resume the compute context. This is
>> +done by having a per-context preempt fence (also called suspend fence) proxying
>> +as i915_request fence. This suspend fence is enabled when someone tries to wait
>> +on it, which then triggers the context preemption.
>> +
>> +As this support for context suspension using a preempt fence and the resume work
>> +for the compute mode contexts can get tricky to get it right, it is better to
>> +add this support in drm scheduler so that multiple drivers can make use of it.
>> +That means, it will have a dependency on i915 drm scheduler conversion with GuC
>> +scheduler backend. This should be fine, as the plan is to support compute mode
>> +contexts only with GuC scheduler backend (at least initially). This is much
>> +easier to support with VM_BIND mode compared to the current heavier execbuff
>> +path resource attachment.
>> +
>> +Low Latency Submission
>> +-----------------------
>> +Allows compute UMD to directly submit GPU jobs instead of through execbuff
>> +ioctl. This is made possible by VM_BIND is not being synchronized against
>> +execbuff. VM_BIND allows bind/unbind of mappings required for the directly
>> +submitted jobs.
>> +
>> +Other VM_BIND use cases
>> +========================
>> +
>> +Debugger
>> +---------
>> +With debug event interface user space process (debugger) is able to keep track
>> +of and act upon resources created by another process (debugged) and attached
>> +to GPU via vm_bind interface.
>> +
>> +GPU page faults
>> +----------------
>> +GPU page faults when supported (in future), will only be supported in the
>> +VM_BIND mode. While both the older execbuff mode and the newer VM_BIND mode of
>> +binding will require using dma-fence to ensure residency, the GPU page faults
>> +mode when supported, will not use any dma-fence as residency is purely managed
>> +by installing and removing/invalidating page table entries.
>> +
>> +Page level hints settings
>> +--------------------------
>> +VM_BIND allows any hints setting per mapping instead of per BO.
>> +Possible hints include read-only mapping, placement and atomicity.
>> +Sub-BO level placement hint will be even more relevant with
>> +upcoming GPU on-demand page fault support.
>> +
>> +Page level Cache/CLOS settings
>> +-------------------------------
>> +VM_BIND allows cache/CLOS settings per mapping instead of per BO.
>> +
>> +Shared Virtual Memory (SVM) support
>> +------------------------------------
>> +VM_BIND interface can be used to map system memory directly (without gem BO
>> +abstraction) using the HMM interface. SVM is only supported with GPU page
>> +faults enabled.
>> +
>> +
>> +Broder i915 cleanups
>> +=====================
>> +Supporting this whole new vm_bind mode of binding which comes with its own
>> +use cases to support and the locking requirements requires proper integration
>> +with the existing i915 driver. This calls for some broader i915 driver
>> +cleanups/simplifications for maintainability of the driver going forward.
>> +Here are few things identified and are being looked into.
>> +
>> +- Remove vma lookup cache (eb->gem_context->handles_vma). VM_BIND feature
>> +  do not use it and complexity it brings in is probably more than the
>> +  performance advantage we get in legacy execbuff case.
>> +- Remove vma->open_count counting
>> +- Remove i915_vma active reference tracking. VM_BIND feature will not be using
>> +  it. Instead use underlying BO's dma-resv fence list to determine if a i915_vma
>> +  is active or not.
>> +
>> +
>> +VM_BIND UAPI
>> +=============
>> +
>> +.. kernel-doc:: Documentation/gpu/rfc/i915_vm_bind.h
>> diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst
>> index 91e93a705230..7d10c36b268d 100644
>> --- a/Documentation/gpu/rfc/index.rst
>> +++ b/Documentation/gpu/rfc/index.rst
>> @@ -23,3 +23,7 @@ host such documentation:
>>   .. toctree::
>>   
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>       i915_scheduler.rst
>> +
>> +.. toctree::
>> +
>> +    i915_vm_bind.rst
Lionel Landwerlin June 1, 2022, 2:25 p.m. UTC | #5
On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
> +VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an
> +async worker. The binding and unbinding will work like a special GPU engine.
> +The binding and unbinding operations are serialized and will wait on specified
> +input fences before the operation and will signal the output fences upon the
> +completion of the operation. Due to serialization, completion of an operation
> +will also indicate that all previous operations are also complete.

I guess we should avoid saying "will immediately start 
binding/unbinding" if there are fences involved.

And the fact that it's happening in an async worker seem to imply it's 
not immediate.


I have a question on the behavior of the bind operation when no input 
fence is provided. Let say I do :

VM_BIND (out_fence=fence1)

VM_BIND (out_fence=fence2)

VM_BIND (out_fence=fence3)


In what order are the fences going to be signaled?

In the order of VM_BIND ioctls? Or out of order?

Because you wrote "serialized I assume it's : in order


One thing I didn't realize is that because we only get one "VM_BIND" 
engine, there is a disconnect from the Vulkan specification.

In Vulkan VM_BIND operations are serialized but per engine.

So you could have something like this :

VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)

VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)


fence1 is not signaled

fence3 is signaled

So the second VM_BIND will proceed before the first VM_BIND.


I guess we can deal with that scenario in userspace by doing the wait 
ourselves in one thread per engines.

But then it makes the VM_BIND input fences useless.


Daniel : what do you think? Should be rework this or just deal with wait 
fences in userspace?


Sorry I noticed this late.


-Lionel
Matthew Brost June 1, 2022, 8:28 p.m. UTC | #6
On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin wrote:
> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
> > +VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an
> > +async worker. The binding and unbinding will work like a special GPU engine.
> > +The binding and unbinding operations are serialized and will wait on specified
> > +input fences before the operation and will signal the output fences upon the
> > +completion of the operation. Due to serialization, completion of an operation
> > +will also indicate that all previous operations are also complete.
> 
> I guess we should avoid saying "will immediately start binding/unbinding" if
> there are fences involved.
> 
> And the fact that it's happening in an async worker seem to imply it's not
> immediate.
> 
> 
> I have a question on the behavior of the bind operation when no input fence
> is provided. Let say I do :
> 
> VM_BIND (out_fence=fence1)
> 
> VM_BIND (out_fence=fence2)
> 
> VM_BIND (out_fence=fence3)
> 
> 
> In what order are the fences going to be signaled?
> 
> In the order of VM_BIND ioctls? Or out of order?
> 
> Because you wrote "serialized I assume it's : in order
> 
> 
> One thing I didn't realize is that because we only get one "VM_BIND" engine,
> there is a disconnect from the Vulkan specification.
> 
> In Vulkan VM_BIND operations are serialized but per engine.
> 
> So you could have something like this :
> 
> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
> 
> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
> 
> 
> fence1 is not signaled
> 
> fence3 is signaled
> 
> So the second VM_BIND will proceed before the first VM_BIND.
> 
> 
> I guess we can deal with that scenario in userspace by doing the wait
> ourselves in one thread per engines.
> 
> But then it makes the VM_BIND input fences useless.
> 
> 
> Daniel : what do you think? Should be rework this or just deal with wait
> fences in userspace?
> 

My opinion is rework this but make the ordering via an engine param optional.

e.g. A VM can be configured so all binds are ordered within the VM

e.g. A VM can be configured so all binds accept an engine argument (in
the case of the i915 likely this is a gem context handle) and binds
ordered with respect to that engine.

This gives UMDs options as the later likely consumes more KMD resources
so if a different UMD can live with binds being ordered within the VM
they can use a mode consuming less resources.

Matt

> 
> Sorry I noticed this late.
> 
> 
> -Lionel
> 
>
Matthew Brost June 1, 2022, 9:18 p.m. UTC | #7
On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin wrote:
> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
> > +VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an
> > +async worker. The binding and unbinding will work like a special GPU engine.
> > +The binding and unbinding operations are serialized and will wait on specified
> > +input fences before the operation and will signal the output fences upon the
> > +completion of the operation. Due to serialization, completion of an operation
> > +will also indicate that all previous operations are also complete.
> 
> I guess we should avoid saying "will immediately start binding/unbinding" if
> there are fences involved.
> 
> And the fact that it's happening in an async worker seem to imply it's not
> immediate.
> 
> 
> I have a question on the behavior of the bind operation when no input fence
> is provided. Let say I do :
> 
> VM_BIND (out_fence=fence1)
> 
> VM_BIND (out_fence=fence2)
> 
> VM_BIND (out_fence=fence3)
> 
> 
> In what order are the fences going to be signaled?
> 
> In the order of VM_BIND ioctls? Or out of order?
> 
> Because you wrote "serialized I assume it's : in order
> 
> 
> One thing I didn't realize is that because we only get one "VM_BIND" engine,
> there is a disconnect from the Vulkan specification.
> 
> In Vulkan VM_BIND operations are serialized but per engine.
> 
> So you could have something like this :
> 
> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
> 
> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
> 

Question - let's say this done after the above operations:

EXEC (engine=ccs0, in_fence=NULL, out_fence=NULL)

Is the exec ordered with respected to bind (i.e. would fence3 & 4 be
signaled before the exec starts)?

Matt

> 
> fence1 is not signaled
> 
> fence3 is signaled
> 
> So the second VM_BIND will proceed before the first VM_BIND.
> 
> 
> I guess we can deal with that scenario in userspace by doing the wait
> ourselves in one thread per engines.
> 
> But then it makes the VM_BIND input fences useless.
> 
> 
> Daniel : what do you think? Should be rework this or just deal with wait
> fences in userspace?
> 
> 
> Sorry I noticed this late.
> 
> 
> -Lionel
> 
>
Zeng, Oak June 2, 2022, 2:13 a.m. UTC | #8
Regards,
Oak

> -----Original Message-----
> From: dri-devel <dri-devel-bounces@lists.freedesktop.org> On Behalf Of
> Niranjana Vishwanathapura
> Sent: May 17, 2022 2:32 PM
> To: intel-gfx@lists.freedesktop.org; dri-devel@lists.freedesktop.org; Vetter,
> Daniel <daniel.vetter@intel.com>
> Cc: Brost, Matthew <matthew.brost@intel.com>; Hellstrom, Thomas
> <thomas.hellstrom@intel.com>; jason@jlekstrand.net; Wilson, Chris P
> <chris.p.wilson@intel.com>; christian.koenig@amd.com
> Subject: [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
> 
> VM_BIND design document with description of intended use cases.
> 
> v2: Add more documentation and format as per review comments
>     from Daniel.
> 
> Signed-off-by: Niranjana Vishwanathapura
> <niranjana.vishwanathapura@intel.com>
> ---
>  Documentation/driver-api/dma-buf.rst   |   2 +
>  Documentation/gpu/rfc/i915_vm_bind.rst | 304
> +++++++++++++++++++++++++
>  Documentation/gpu/rfc/index.rst        |   4 +
>  3 files changed, 310 insertions(+)
>  create mode 100644 Documentation/gpu/rfc/i915_vm_bind.rst
> 
> diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-
> api/dma-buf.rst
> index 36a76cbe9095..64cb924ec5bb 100644
> --- a/Documentation/driver-api/dma-buf.rst
> +++ b/Documentation/driver-api/dma-buf.rst
> @@ -200,6 +200,8 @@ DMA Fence uABI/Sync File
>  .. kernel-doc:: include/linux/sync_file.h
>     :internal:
> 
> +.. _indefinite_dma_fences:
> +
>  Indefinite DMA Fences
>  ~~~~~~~~~~~~~~~~~~~~~
> 
> diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst
> b/Documentation/gpu/rfc/i915_vm_bind.rst
> new file mode 100644
> index 000000000000..f1be560d313c
> --- /dev/null
> +++ b/Documentation/gpu/rfc/i915_vm_bind.rst
> @@ -0,0 +1,304 @@
> +==========================================
> +I915 VM_BIND feature design and use cases
> +==========================================
> +
> +VM_BIND feature
> +================
> +DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM
> buffer
> +objects (BOs) or sections of a BOs at specified GPU virtual addresses on a
> +specified address space (VM). These mappings (also referred to as persistent
> +mappings) will be persistent across multiple GPU submissions (execbuff calls)
> +issued by the UMD, without user having to provide a list of all required
> +mappings during each submission (as required by older execbuff mode).
> +
> +VM_BIND/UNBIND ioctls will support 'in' and 'out' fences to allow userpace
> +to specify how the binding/unbinding should sync with other operations
> +like the GPU job submission. These fences will be timeline 'drm_syncobj's
> +for non-Compute contexts (See struct
> drm_i915_vm_bind_ext_timeline_fences).
> +For Compute contexts, they will be user/memory fences (See struct
> +drm_i915_vm_bind_ext_user_fence).
> +
> +VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND.
> +User has to opt-in for VM_BIND mode of binding for an address space (VM)
> +during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND
> extension.
> +
> +VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in
> an
> +async worker. The binding and unbinding will work like a special GPU engine.
> +The binding and unbinding operations are serialized and will wait on specified
> +input fences before the operation and will signal the output fences upon the
> +completion of the operation. Due to serialization, completion of an operation
> +will also indicate that all previous operations are also complete.

Hi,

Is user required to wait for the out fence be signaled before submit a gpu job using the vm_bind address?
Or is user required to order the gpu job to make gpu job run after vm_bind out fence signaled?

I think there could be different behavior on a non-faultable platform and a faultable platform, such as on a non-faultable
Platform, gpu job is required to be order after vm_bind out fence signaling; and on a faultable platform, there is no such
Restriction since vm bind can be finished in the fault handler?

Should we document such thing?

Regards,
Oak 


> +
> +VM_BIND features include:
> +
> +* Multiple Virtual Address (VA) mappings can map to the same physical pages
> +  of an object (aliasing).
> +* VA mapping can map to a partial section of the BO (partial binding).
> +* Support capture of persistent mappings in the dump upon GPU error.
> +* TLB is flushed upon unbind completion. Batching of TLB flushes in some
> +  use cases will be helpful.
> +* Asynchronous vm_bind and vm_unbind support with 'in' and 'out' fences.
> +* Support for userptr gem objects (no special uapi is required for this).
> +
> +Execbuff ioctl in VM_BIND mode
> +-------------------------------
> +The execbuff ioctl handling in VM_BIND mode differs significantly from the
> +older method. A VM in VM_BIND mode will not support older execbuff mode of
> +binding. In VM_BIND mode, execbuff ioctl will not accept any execlist. Hence,
> +no support for implicit sync. It is expected that the below work will be able
> +to support requirements of object dependency setting in all use cases:
> +
> +"dma-buf: Add an API for exporting sync files"
> +(https://lwn.net/Articles/859290/)
> +
> +This also means, we need an execbuff extension to pass in the batch
> +buffer addresses (See struct
> drm_i915_gem_execbuffer_ext_batch_addresses).
> +
> +If at all execlist support in execbuff ioctl is deemed necessary for
> +implicit sync in certain use cases, then support can be added later.
> +
> +In VM_BIND mode, VA allocation is completely managed by the user instead of
> +the i915 driver. Hence all VA assignment, eviction are not applicable in
> +VM_BIND mode. Also, for determining object activeness, VM_BIND mode will
> not
> +be using the i915_vma active reference tracking. It will instead use dma-resv
> +object for that (See `VM_BIND dma_resv usage`_).
> +
> +So, a lot of existing code in the execbuff path like relocations, VA evictions,
> +vma lookup table, implicit sync, vma active reference tracking etc., are not
> +applicable in VM_BIND mode. Hence, the execbuff path needs to be cleaned up
> +by clearly separating out the functionalities where the VM_BIND mode differs
> +from older method and they should be moved to separate files.
> +
> +VM_PRIVATE objects
> +-------------------
> +By default, BOs can be mapped on multiple VMs and can also be dma-buf
> +exported. Hence these BOs are referred to as Shared BOs.
> +During each execbuff submission, the request fence must be added to the
> +dma-resv fence list of all shared BOs mapped on the VM.
> +
> +VM_BIND feature introduces an optimization where user can create BO which
> +is private to a specified VM via I915_GEM_CREATE_EXT_VM_PRIVATE flag
> during
> +BO creation. Unlike Shared BOs, these VM private BOs can only be mapped on
> +the VM they are private to and can't be dma-buf exported.
> +All private BOs of a VM share the dma-resv object. Hence during each execbuff
> +submission, they need only one dma-resv fence list updated. Thus, the fast
> +path (where required mappings are already bound) submission latency is O(1)
> +w.r.t the number of VM private BOs.
> +
> +VM_BIND locking hirarchy
> +-------------------------
> +The locking design here supports the older (execlist based) execbuff mode, the
> +newer VM_BIND mode, the VM_BIND mode with GPU page faults and possible
> future
> +system allocator support (See `Shared Virtual Memory (SVM) support`_).
> +The older execbuff mode and the newer VM_BIND mode without page faults
> manages
> +residency of backing storage using dma_fence. The VM_BIND mode with page
> faults
> +and the system allocator support do not use any dma_fence at all.
> +
> +VM_BIND locking order is as below.
> +
> +1) Lock-A: A vm_bind mutex will protect vm_bind lists. This lock is taken in
> +   vm_bind/vm_unbind ioctl calls, in the execbuff path and while releasing the
> +   mapping.
> +
> +   In future, when GPU page faults are supported, we can potentially use a
> +   rwsem instead, so that multiple page fault handlers can take the read side
> +   lock to lookup the mapping and hence can run in parallel.
> +   The older execbuff mode of binding do not need this lock.
> +
> +2) Lock-B: The object's dma-resv lock will protect i915_vma state and needs to
> +   be held while binding/unbinding a vma in the async worker and while updating
> +   dma-resv fence list of an object. Note that private BOs of a VM will all
> +   share a dma-resv object.
> +
> +   The future system allocator support will use the HMM prescribed locking
> +   instead.
> +
> +3) Lock-C: Spinlock/s to protect some of the VM's lists like the list of
> +   invalidated vmas (due to eviction and userptr invalidation) etc.
> +
> +When GPU page faults are supported, the execbuff path do not take any of
> these
> +locks. There we will simply smash the new batch buffer address into the ring
> and
> +then tell the scheduler run that. The lock taking only happens from the page
> +fault handler, where we take lock-A in read mode, whichever lock-B we need to
> +find the backing storage (dma_resv lock for gem objects, and hmm/core mm for
> +system allocator) and some additional locks (lock-D) for taking care of page
> +table races. Page fault mode should not need to ever manipulate the vm lists,
> +so won't ever need lock-C.
> +
> +VM_BIND LRU handling
> +---------------------
> +We need to ensure VM_BIND mapped objects are properly LRU tagged to avoid
> +performance degradation. We will also need support for bulk LRU movement of
> +VM_BIND objects to avoid additional latencies in execbuff path.
> +
> +The page table pages are similar to VM_BIND mapped objects (See
> +`Evictable page table allocations`_) and are maintained per VM and needs to
> +be pinned in memory when VM is made active (ie., upon an execbuff call with
> +that VM). So, bulk LRU movement of page table pages is also needed.
> +
> +The i915 shrinker LRU has stopped being an LRU. So, it should also be moved
> +over to the ttm LRU in some fashion to make sure we once again have a
> reasonable
> +and consistent memory aging and reclaim architecture.
> +
> +VM_BIND dma_resv usage
> +-----------------------
> +Fences needs to be added to all VM_BIND mapped objects. During each
> execbuff
> +submission, they are added with DMA_RESV_USAGE_BOOKKEEP usage to
> prevent
> +over sync (See enum dma_resv_usage). One can override it with either
> +DMA_RESV_USAGE_READ or DMA_RESV_USAGE_WRITE usage during object
> dependency
> +setting (either through explicit or implicit mechanism).
> +
> +When vm_bind is called for a non-private object while the VM is already
> +active, the fences need to be copied from VM's shared dma-resv object
> +(common to all private objects of the VM) to this non-private object.
> +If this results in performance degradation, then some optimization will
> +be needed here. This is not a problem for VM's private objects as they use
> +shared dma-resv object which is always updated on each execbuff submission.
> +
> +Also, in VM_BIND mode, use dma-resv apis for determining object activeness
> +(See dma_resv_test_signaled() and dma_resv_wait_timeout()) and do not use
> the
> +older i915_vma active reference tracking which is deprecated. This should be
> +easier to get it working with the current TTM backend. We can remove the
> +i915_vma active reference tracking fully while supporting TTM backend for igfx.
> +
> +Evictable page table allocations
> +---------------------------------
> +Make pagetable allocations evictable and manage them similar to VM_BIND
> +mapped objects. Page table pages are similar to persistent mappings of a
> +VM (difference here are that the page table pages will not have an i915_vma
> +structure and after swapping pages back in, parent page link needs to be
> +updated).
> +
> +Mesa use case
> +--------------
> +VM_BIND can potentially reduce the CPU overhead in Mesa (both Vulkan and
> Iris),
> +hence improving performance of CPU-bound applications. It also allows us to
> +implement Vulkan's Sparse Resources. With increasing GPU hardware
> performance,
> +reducing CPU overhead becomes more impactful.
> +
> +
> +VM_BIND Compute support
> +========================
> +
> +User/Memory Fence
> +------------------
> +The idea is to take a user specified virtual address and install an interrupt
> +handler to wake up the current task when the memory location passes the user
> +supplied filter. User/Memory fence is a <address, value> pair. To signal the
> +user fence, specified value will be written at the specified virtual address
> +and wakeup the waiting process. User can wait on a user fence with the
> +gem_wait_user_fence ioctl.
> +
> +It also allows the user to emit their own MI_FLUSH/PIPE_CONTROL notify
> +interrupt within their batches after updating the value to have sub-batch
> +precision on the wakeup. Each batch can signal a user fence to indicate
> +the completion of next level batch. The completion of very first level batch
> +needs to be signaled by the command streamer. The user must provide the
> +user/memory fence for this via the
> DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE
> +extension of execbuff ioctl, so that KMD can setup the command streamer to
> +signal it.
> +
> +User/Memory fence can also be supplied to the kernel driver to signal/wake up
> +the user process after completion of an asynchronous operation.
> +
> +When VM_BIND ioctl was provided with a user/memory fence via the
> +I915_VM_BIND_EXT_USER_FENCE extension, it will be signaled upon the
> completion
> +of binding of that mapping. All async binds/unbinds are serialized, hence
> +signaling of user/memory fence also indicate the completion of all previous
> +binds/unbinds.
> +
> +This feature will be derived from the below original work:
> +https://patchwork.freedesktop.org/patch/349417/
> +
> +Long running Compute contexts
> +------------------------------
> +Usage of dma-fence expects that they complete in reasonable amount of time.
> +Compute on the other hand can be long running. Hence it is appropriate for
> +compute to use user/memory fence and dma-fence usage will be limited to
> +in-kernel consumption only. This requires an execbuff uapi extension to pass
> +in user fence (See struct drm_i915_vm_bind_ext_user_fence). Compute must
> opt-in
> +for this mechanism with I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING flag
> during
> +context creation. The dma-fence based user interfaces like gem_wait ioctl and
> +execbuff out fence are not allowed on long running contexts. Implicit sync is
> +not valid as well and is anyway not supported in VM_BIND mode.
> +
> +Where GPU page faults are not available, kernel driver upon buffer invalidation
> +will initiate a suspend (preemption) of long running context with a dma-fence
> +attached to it. And upon completion of that suspend fence, finish the
> +invalidation, revalidate the BO and then resume the compute context. This is
> +done by having a per-context preempt fence (also called suspend fence)
> proxying
> +as i915_request fence. This suspend fence is enabled when someone tries to
> wait
> +on it, which then triggers the context preemption.
> +
> +As this support for context suspension using a preempt fence and the resume
> work
> +for the compute mode contexts can get tricky to get it right, it is better to
> +add this support in drm scheduler so that multiple drivers can make use of it.
> +That means, it will have a dependency on i915 drm scheduler conversion with
> GuC
> +scheduler backend. This should be fine, as the plan is to support compute mode
> +contexts only with GuC scheduler backend (at least initially). This is much
> +easier to support with VM_BIND mode compared to the current heavier
> execbuff
> +path resource attachment.
> +
> +Low Latency Submission
> +-----------------------
> +Allows compute UMD to directly submit GPU jobs instead of through execbuff
> +ioctl. This is made possible by VM_BIND is not being synchronized against
> +execbuff. VM_BIND allows bind/unbind of mappings required for the directly
> +submitted jobs.
> +
> +Other VM_BIND use cases
> +========================
> +
> +Debugger
> +---------
> +With debug event interface user space process (debugger) is able to keep track
> +of and act upon resources created by another process (debugged) and attached
> +to GPU via vm_bind interface.
> +
> +GPU page faults
> +----------------
> +GPU page faults when supported (in future), will only be supported in the
> +VM_BIND mode. While both the older execbuff mode and the newer VM_BIND
> mode of
> +binding will require using dma-fence to ensure residency, the GPU page faults
> +mode when supported, will not use any dma-fence as residency is purely
> managed
> +by installing and removing/invalidating page table entries.
> +
> +Page level hints settings
> +--------------------------
> +VM_BIND allows any hints setting per mapping instead of per BO.
> +Possible hints include read-only mapping, placement and atomicity.
> +Sub-BO level placement hint will be even more relevant with
> +upcoming GPU on-demand page fault support.
> +
> +Page level Cache/CLOS settings
> +-------------------------------
> +VM_BIND allows cache/CLOS settings per mapping instead of per BO.
> +
> +Shared Virtual Memory (SVM) support
> +------------------------------------
> +VM_BIND interface can be used to map system memory directly (without gem
> BO
> +abstraction) using the HMM interface. SVM is only supported with GPU page
> +faults enabled.
> +
> +
> +Broder i915 cleanups
> +=====================
> +Supporting this whole new vm_bind mode of binding which comes with its own
> +use cases to support and the locking requirements requires proper integration
> +with the existing i915 driver. This calls for some broader i915 driver
> +cleanups/simplifications for maintainability of the driver going forward.
> +Here are few things identified and are being looked into.
> +
> +- Remove vma lookup cache (eb->gem_context->handles_vma). VM_BIND
> feature
> +  do not use it and complexity it brings in is probably more than the
> +  performance advantage we get in legacy execbuff case.
> +- Remove vma->open_count counting
> +- Remove i915_vma active reference tracking. VM_BIND feature will not be
> using
> +  it. Instead use underlying BO's dma-resv fence list to determine if a i915_vma
> +  is active or not.
> +
> +
> +VM_BIND UAPI
> +=============
> +
> +.. kernel-doc:: Documentation/gpu/rfc/i915_vm_bind.h
> diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst
> index 91e93a705230..7d10c36b268d 100644
> --- a/Documentation/gpu/rfc/index.rst
> +++ b/Documentation/gpu/rfc/index.rst
> @@ -23,3 +23,7 @@ host such documentation:
>  .. toctree::
> 
>      i915_scheduler.rst
> +
> +.. toctree::
> +
> +    i915_vm_bind.rst
> --
> 2.21.0.rc0.32.g243a4c7e27
Lionel Landwerlin June 2, 2022, 5:42 a.m. UTC | #9
On 02/06/2022 00:18, Matthew Brost wrote:
> On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin wrote:
>> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
>>> +VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an
>>> +async worker. The binding and unbinding will work like a special GPU engine.
>>> +The binding and unbinding operations are serialized and will wait on specified
>>> +input fences before the operation and will signal the output fences upon the
>>> +completion of the operation. Due to serialization, completion of an operation
>>> +will also indicate that all previous operations are also complete.
>> I guess we should avoid saying "will immediately start binding/unbinding" if
>> there are fences involved.
>>
>> And the fact that it's happening in an async worker seem to imply it's not
>> immediate.
>>
>>
>> I have a question on the behavior of the bind operation when no input fence
>> is provided. Let say I do :
>>
>> VM_BIND (out_fence=fence1)
>>
>> VM_BIND (out_fence=fence2)
>>
>> VM_BIND (out_fence=fence3)
>>
>>
>> In what order are the fences going to be signaled?
>>
>> In the order of VM_BIND ioctls? Or out of order?
>>
>> Because you wrote "serialized I assume it's : in order
>>
>>
>> One thing I didn't realize is that because we only get one "VM_BIND" engine,
>> there is a disconnect from the Vulkan specification.
>>
>> In Vulkan VM_BIND operations are serialized but per engine.
>>
>> So you could have something like this :
>>
>> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
>>
>> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
>>
> Question - let's say this done after the above operations:
>
> EXEC (engine=ccs0, in_fence=NULL, out_fence=NULL)
>
> Is the exec ordered with respected to bind (i.e. would fence3 & 4 be
> signaled before the exec starts)?
>
> Matt


Hi Matt,

 From the vulkan point of view, everything is serialized within an 
engine (we map that to a VkQueue).

So with :

EXEC (engine=ccs0, in_fence=NULL, out_fence=NULL)
VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)

EXEC completes first then VM_BIND executes.


To be even clearer :

EXEC (engine=ccs0, in_fence=fence2, out_fence=NULL)
VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)


EXEC will wait until fence2 is signaled.
Once fence2 is signaled, EXEC proceeds, finishes and only after it is done, VM_BIND executes.

It would kind of like having the VM_BIND operation be another batch executed from the ringbuffer buffer.

-Lionel


>
>> fence1 is not signaled
>>
>> fence3 is signaled
>>
>> So the second VM_BIND will proceed before the first VM_BIND.
>>
>>
>> I guess we can deal with that scenario in userspace by doing the wait
>> ourselves in one thread per engines.
>>
>> But then it makes the VM_BIND input fences useless.
>>
>>
>> Daniel : what do you think? Should be rework this or just deal with wait
>> fences in userspace?
>>
>>
>> Sorry I noticed this late.
>>
>>
>> -Lionel
>>
>>
Matthew Brost June 2, 2022, 4:22 p.m. UTC | #10
On Thu, Jun 02, 2022 at 08:42:13AM +0300, Lionel Landwerlin wrote:
> On 02/06/2022 00:18, Matthew Brost wrote:
> > On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin wrote:
> > > On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
> > > > +VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an
> > > > +async worker. The binding and unbinding will work like a special GPU engine.
> > > > +The binding and unbinding operations are serialized and will wait on specified
> > > > +input fences before the operation and will signal the output fences upon the
> > > > +completion of the operation. Due to serialization, completion of an operation
> > > > +will also indicate that all previous operations are also complete.
> > > I guess we should avoid saying "will immediately start binding/unbinding" if
> > > there are fences involved.
> > > 
> > > And the fact that it's happening in an async worker seem to imply it's not
> > > immediate.
> > > 
> > > 
> > > I have a question on the behavior of the bind operation when no input fence
> > > is provided. Let say I do :
> > > 
> > > VM_BIND (out_fence=fence1)
> > > 
> > > VM_BIND (out_fence=fence2)
> > > 
> > > VM_BIND (out_fence=fence3)
> > > 
> > > 
> > > In what order are the fences going to be signaled?
> > > 
> > > In the order of VM_BIND ioctls? Or out of order?
> > > 
> > > Because you wrote "serialized I assume it's : in order
> > > 
> > > 
> > > One thing I didn't realize is that because we only get one "VM_BIND" engine,
> > > there is a disconnect from the Vulkan specification.
> > > 
> > > In Vulkan VM_BIND operations are serialized but per engine.
> > > 
> > > So you could have something like this :
> > > 
> > > VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
> > > 
> > > VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
> > > 
> > Question - let's say this done after the above operations:
> > 
> > EXEC (engine=ccs0, in_fence=NULL, out_fence=NULL)
> > 
> > Is the exec ordered with respected to bind (i.e. would fence3 & 4 be
> > signaled before the exec starts)?
> > 
> > Matt
> 
> 
> Hi Matt,
> 
> From the vulkan point of view, everything is serialized within an engine (we
> map that to a VkQueue).
> 
> So with :
> 
> EXEC (engine=ccs0, in_fence=NULL, out_fence=NULL)
> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
> 
> EXEC completes first then VM_BIND executes.
> 
> 
> To be even clearer :
> 
> EXEC (engine=ccs0, in_fence=fence2, out_fence=NULL)
> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
> 
> 
> EXEC will wait until fence2 is signaled.
> Once fence2 is signaled, EXEC proceeds, finishes and only after it is done, VM_BIND executes.
> 
> It would kind of like having the VM_BIND operation be another batch executed from the ringbuffer buffer.
> 

Yea this makes sense. I think of VM_BINDs as more or less just another
version of an EXEC and this fits with that.

In practice I don't think we can share a ring but we should be able to
present an engine (again likely a gem context in i915) to the user that
orders VM_BINDs / EXECs if that is what Vulkan expects, at least I think.

Hopefully Niranjana + Daniel agree.

Matt

> -Lionel
> 
> 
> > 
> > > fence1 is not signaled
> > > 
> > > fence3 is signaled
> > > 
> > > So the second VM_BIND will proceed before the first VM_BIND.
> > > 
> > > 
> > > I guess we can deal with that scenario in userspace by doing the wait
> > > ourselves in one thread per engines.
> > > 
> > > But then it makes the VM_BIND input fences useless.
> > > 
> > > 
> > > Daniel : what do you think? Should be rework this or just deal with wait
> > > fences in userspace?
> > > 
> > > 
> > > Sorry I noticed this late.
> > > 
> > > 
> > > -Lionel
> > > 
> > > 
>
Niranjana Vishwanathapura June 2, 2022, 8:11 p.m. UTC | #11
On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew Brost wrote:
>On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin wrote:
>> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
>> > +VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an
>> > +async worker. The binding and unbinding will work like a special GPU engine.
>> > +The binding and unbinding operations are serialized and will wait on specified
>> > +input fences before the operation and will signal the output fences upon the
>> > +completion of the operation. Due to serialization, completion of an operation
>> > +will also indicate that all previous operations are also complete.
>>
>> I guess we should avoid saying "will immediately start binding/unbinding" if
>> there are fences involved.
>>
>> And the fact that it's happening in an async worker seem to imply it's not
>> immediate.
>>

Ok, will fix.
This was added because in earlier design binding was deferred until next execbuff.
But now it is non-deferred (immediate in that sense). But yah, this is confusing
and will fix it.

>>
>> I have a question on the behavior of the bind operation when no input fence
>> is provided. Let say I do :
>>
>> VM_BIND (out_fence=fence1)
>>
>> VM_BIND (out_fence=fence2)
>>
>> VM_BIND (out_fence=fence3)
>>
>>
>> In what order are the fences going to be signaled?
>>
>> In the order of VM_BIND ioctls? Or out of order?
>>
>> Because you wrote "serialized I assume it's : in order
>>

Yes, in the order of VM_BIND/UNBIND ioctls. Note that bind and unbind will use
the same queue and hence are ordered.

>>
>> One thing I didn't realize is that because we only get one "VM_BIND" engine,
>> there is a disconnect from the Vulkan specification.
>>
>> In Vulkan VM_BIND operations are serialized but per engine.
>>
>> So you could have something like this :
>>
>> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
>>
>> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
>>
>>
>> fence1 is not signaled
>>
>> fence3 is signaled
>>
>> So the second VM_BIND will proceed before the first VM_BIND.
>>
>>
>> I guess we can deal with that scenario in userspace by doing the wait
>> ourselves in one thread per engines.
>>
>> But then it makes the VM_BIND input fences useless.
>>
>>
>> Daniel : what do you think? Should be rework this or just deal with wait
>> fences in userspace?
>>
>
>My opinion is rework this but make the ordering via an engine param optional.
>
>e.g. A VM can be configured so all binds are ordered within the VM
>
>e.g. A VM can be configured so all binds accept an engine argument (in
>the case of the i915 likely this is a gem context handle) and binds
>ordered with respect to that engine.
>
>This gives UMDs options as the later likely consumes more KMD resources
>so if a different UMD can live with binds being ordered within the VM
>they can use a mode consuming less resources.
>

I think we need to be careful here if we are looking for some out of
(submission) order completion of vm_bind/unbind.
In-order completion means, in a batch of binds and unbinds to be
completed in-order, user only needs to specify in-fence for the
first bind/unbind call and the our-fence for the last bind/unbind
call. Also, the VA released by an unbind call can be re-used by
any subsequent bind call in that in-order batch.

These things will break if binding/unbinding were to be allowed to
go out of order (of submission) and user need to be extra careful
not to run into pre-mature triggereing of out-fence and bind failing
as VA is still in use etc.

Also, VM_BIND binds the provided mapping on the specified address space
(VM). So, the uapi is not engine/context specific.

We can however add a 'queue' to the uapi which can be one from the
pre-defined queues,
I915_VM_BIND_QUEUE_0
I915_VM_BIND_QUEUE_1
...
I915_VM_BIND_QUEUE_(N-1)

KMD will spawn an async work queue for each queue which will only
bind the mappings on that queue in the order of submission.
User can assign the queue to per engine or anything like that.

But again here, user need to be careful and not deadlock these
queues with circular dependency of fences.

I prefer adding this later an as extension based on whether it
is really helping with the implementation.

Daniel, any thoughts?

Niranjana

>Matt
>
>>
>> Sorry I noticed this late.
>>
>>
>> -Lionel
>>
>>
Bas Nieuwenhuizen June 2, 2022, 8:16 p.m. UTC | #12
On Thu, Jun 2, 2022 at 7:42 AM Lionel Landwerlin
<lionel.g.landwerlin@intel.com> wrote:
>
> On 02/06/2022 00:18, Matthew Brost wrote:
> > On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin wrote:
> >> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
> >>> +VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an
> >>> +async worker. The binding and unbinding will work like a special GPU engine.
> >>> +The binding and unbinding operations are serialized and will wait on specified
> >>> +input fences before the operation and will signal the output fences upon the
> >>> +completion of the operation. Due to serialization, completion of an operation
> >>> +will also indicate that all previous operations are also complete.
> >> I guess we should avoid saying "will immediately start binding/unbinding" if
> >> there are fences involved.
> >>
> >> And the fact that it's happening in an async worker seem to imply it's not
> >> immediate.
> >>
> >>
> >> I have a question on the behavior of the bind operation when no input fence
> >> is provided. Let say I do :
> >>
> >> VM_BIND (out_fence=fence1)
> >>
> >> VM_BIND (out_fence=fence2)
> >>
> >> VM_BIND (out_fence=fence3)
> >>
> >>
> >> In what order are the fences going to be signaled?
> >>
> >> In the order of VM_BIND ioctls? Or out of order?
> >>
> >> Because you wrote "serialized I assume it's : in order
> >>
> >>
> >> One thing I didn't realize is that because we only get one "VM_BIND" engine,
> >> there is a disconnect from the Vulkan specification.

Note that in Vulkan not every queue has to support sparse binding, so
one could consider a dedicated sparse binding only queue family.

> >>
> >> In Vulkan VM_BIND operations are serialized but per engine.
> >>
> >> So you could have something like this :
> >>
> >> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
> >>
> >> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
> >>
> > Question - let's say this done after the above operations:
> >
> > EXEC (engine=ccs0, in_fence=NULL, out_fence=NULL)
> >
> > Is the exec ordered with respected to bind (i.e. would fence3 & 4 be
> > signaled before the exec starts)?
> >
> > Matt
>
>
> Hi Matt,
>
>  From the vulkan point of view, everything is serialized within an
> engine (we map that to a VkQueue).
>
> So with :
>
> EXEC (engine=ccs0, in_fence=NULL, out_fence=NULL)
> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
>
> EXEC completes first then VM_BIND executes.
>
>
> To be even clearer :
>
> EXEC (engine=ccs0, in_fence=fence2, out_fence=NULL)
> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
>
>
> EXEC will wait until fence2 is signaled.
> Once fence2 is signaled, EXEC proceeds, finishes and only after it is done, VM_BIND executes.
>
> It would kind of like having the VM_BIND operation be another batch executed from the ringbuffer buffer.
>
> -Lionel
>
>
> >
> >> fence1 is not signaled
> >>
> >> fence3 is signaled
> >>
> >> So the second VM_BIND will proceed before the first VM_BIND.
> >>
> >>
> >> I guess we can deal with that scenario in userspace by doing the wait
> >> ourselves in one thread per engines.
> >>
> >> But then it makes the VM_BIND input fences useless.

I posed the same question on my series for AMD
(https://patchwork.freedesktop.org/series/104578/), albeit for
slightly different reasons.: if one creates a new VkMemory object, you
generally want that mapped ASAP, as you can't track (in a
VK_KHR_descriptor_indexing world) whether the next submit is going to
use this VkMemory object and hence have to assume the worst (i.e. wait
till the map/bind is complete before executing the next submission).
If all binds/unbinds (or maps/unmaps) happen in-order that means an
operation with input fences could delay stuff we want ASAP.

Of course waiting in userspace does have disadvantages:

1) more overhead between fence signalling and the operation,
potentially causing slightly bigger GPU bubbles.
2) You can't get an out fence early. Within the driver we can mostly
work around this but sync_fd exports, WSI and such will be messy.
3) moving the queue to a thread might make things slightly less ideal
due to scheduling delays.

Removing the in-order working in the kernel generally seems like
madness to me as it is very hard to keep track of the state of the
virtual address space (to e.g. track umapping stuff before freeing
memory or moving memory around)

the one game I tried (FH5 over vkd3d-proton) does sparse mapping as follows:

separate queue:
1) 0 cmdbuffer submit with 0 input semaphores and 1 output semaphore
2) sparse bind with input semaphore from 1 and 1 output semaphore
3) 0 cmdbuffer submit with input semaphore from 2 and 1 output fence
4) wait on that fence on the CPU

which works very well if we just wait for the sparse bind input
semaphore in userspace, but I'm still working on seeing if this is the
common usecase or an outlier.



> >>
> >>
> >> Daniel : what do you think? Should be rework this or just deal with wait
> >> fences in userspace?
> >>
> >>
> >> Sorry I noticed this late.
> >>
> >>
> >> -Lionel
> >>
> >>
>
Niranjana Vishwanathapura June 2, 2022, 8:24 p.m. UTC | #13
On Thu, Jun 02, 2022 at 09:22:46AM -0700, Matthew Brost wrote:
>On Thu, Jun 02, 2022 at 08:42:13AM +0300, Lionel Landwerlin wrote:
>> On 02/06/2022 00:18, Matthew Brost wrote:
>> > On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin wrote:
>> > > On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
>> > > > +VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an
>> > > > +async worker. The binding and unbinding will work like a special GPU engine.
>> > > > +The binding and unbinding operations are serialized and will wait on specified
>> > > > +input fences before the operation and will signal the output fences upon the
>> > > > +completion of the operation. Due to serialization, completion of an operation
>> > > > +will also indicate that all previous operations are also complete.
>> > > I guess we should avoid saying "will immediately start binding/unbinding" if
>> > > there are fences involved.
>> > >
>> > > And the fact that it's happening in an async worker seem to imply it's not
>> > > immediate.
>> > >
>> > >
>> > > I have a question on the behavior of the bind operation when no input fence
>> > > is provided. Let say I do :
>> > >
>> > > VM_BIND (out_fence=fence1)
>> > >
>> > > VM_BIND (out_fence=fence2)
>> > >
>> > > VM_BIND (out_fence=fence3)
>> > >
>> > >
>> > > In what order are the fences going to be signaled?
>> > >
>> > > In the order of VM_BIND ioctls? Or out of order?
>> > >
>> > > Because you wrote "serialized I assume it's : in order
>> > >
>> > >
>> > > One thing I didn't realize is that because we only get one "VM_BIND" engine,
>> > > there is a disconnect from the Vulkan specification.
>> > >
>> > > In Vulkan VM_BIND operations are serialized but per engine.
>> > >
>> > > So you could have something like this :
>> > >
>> > > VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
>> > >
>> > > VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
>> > >
>> > Question - let's say this done after the above operations:
>> >
>> > EXEC (engine=ccs0, in_fence=NULL, out_fence=NULL)
>> >
>> > Is the exec ordered with respected to bind (i.e. would fence3 & 4 be
>> > signaled before the exec starts)?
>> >
>> > Matt
>>
>>
>> Hi Matt,
>>
>> From the vulkan point of view, everything is serialized within an engine (we
>> map that to a VkQueue).
>>
>> So with :
>>
>> EXEC (engine=ccs0, in_fence=NULL, out_fence=NULL)
>> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
>>
>> EXEC completes first then VM_BIND executes.
>>
>>
>> To be even clearer :
>>
>> EXEC (engine=ccs0, in_fence=fence2, out_fence=NULL)
>> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
>>
>>
>> EXEC will wait until fence2 is signaled.
>> Once fence2 is signaled, EXEC proceeds, finishes and only after it is done, VM_BIND executes.
>>
>> It would kind of like having the VM_BIND operation be another batch executed from the ringbuffer buffer.
>>
>
>Yea this makes sense. I think of VM_BINDs as more or less just another
>version of an EXEC and this fits with that.
>

Note that VM_BIND itself can bind while and EXEC (GPU job) is running.
(Say, getting binds ready for next submission). It is up to user though,
how to use it.

>In practice I don't think we can share a ring but we should be able to
>present an engine (again likely a gem context in i915) to the user that
>orders VM_BINDs / EXECs if that is what Vulkan expects, at least I think.
>

I have responded in the other thread on this.

Niranjana

>Hopefully Niranjana + Daniel agree.
>
>Matt
>
>> -Lionel
>>
>>
>> >
>> > > fence1 is not signaled
>> > >
>> > > fence3 is signaled
>> > >
>> > > So the second VM_BIND will proceed before the first VM_BIND.
>> > >
>> > >
>> > > I guess we can deal with that scenario in userspace by doing the wait
>> > > ourselves in one thread per engines.
>> > >
>> > > But then it makes the VM_BIND input fences useless.
>> > >
>> > >
>> > > Daniel : what do you think? Should be rework this or just deal with wait
>> > > fences in userspace?
>> > >
>> > >
>> > > Sorry I noticed this late.
>> > >
>> > >
>> > > -Lionel
>> > >
>> > >
>>
Jason Ekstrand June 2, 2022, 8:35 p.m. UTC | #14
On Thu, Jun 2, 2022 at 3:11 PM Niranjana Vishwanathapura <
niranjana.vishwanathapura@intel.com> wrote:

> On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew Brost wrote:
> >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin wrote:
> >> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
> >> > +VM_BIND/UNBIND ioctl will immediately start binding/unbinding the
> mapping in an
> >> > +async worker. The binding and unbinding will work like a special GPU
> engine.
> >> > +The binding and unbinding operations are serialized and will wait on
> specified
> >> > +input fences before the operation and will signal the output fences
> upon the
> >> > +completion of the operation. Due to serialization, completion of an
> operation
> >> > +will also indicate that all previous operations are also complete.
> >>
> >> I guess we should avoid saying "will immediately start
> binding/unbinding" if
> >> there are fences involved.
> >>
> >> And the fact that it's happening in an async worker seem to imply it's
> not
> >> immediate.
> >>
>
> Ok, will fix.
> This was added because in earlier design binding was deferred until next
> execbuff.
> But now it is non-deferred (immediate in that sense). But yah, this is
> confusing
> and will fix it.
>
> >>
> >> I have a question on the behavior of the bind operation when no input
> fence
> >> is provided. Let say I do :
> >>
> >> VM_BIND (out_fence=fence1)
> >>
> >> VM_BIND (out_fence=fence2)
> >>
> >> VM_BIND (out_fence=fence3)
> >>
> >>
> >> In what order are the fences going to be signaled?
> >>
> >> In the order of VM_BIND ioctls? Or out of order?
> >>
> >> Because you wrote "serialized I assume it's : in order
> >>
>
> Yes, in the order of VM_BIND/UNBIND ioctls. Note that bind and unbind will
> use
> the same queue and hence are ordered.
>
> >>
> >> One thing I didn't realize is that because we only get one "VM_BIND"
> engine,
> >> there is a disconnect from the Vulkan specification.
> >>
> >> In Vulkan VM_BIND operations are serialized but per engine.
> >>
> >> So you could have something like this :
> >>
> >> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
> >>
> >> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
> >>
> >>
> >> fence1 is not signaled
> >>
> >> fence3 is signaled
> >>
> >> So the second VM_BIND will proceed before the first VM_BIND.
> >>
> >>
> >> I guess we can deal with that scenario in userspace by doing the wait
> >> ourselves in one thread per engines.
> >>
> >> But then it makes the VM_BIND input fences useless.
> >>
> >>
> >> Daniel : what do you think? Should be rework this or just deal with wait
> >> fences in userspace?
> >>
> >
> >My opinion is rework this but make the ordering via an engine param
> optional.
> >
> >e.g. A VM can be configured so all binds are ordered within the VM
> >
> >e.g. A VM can be configured so all binds accept an engine argument (in
> >the case of the i915 likely this is a gem context handle) and binds
> >ordered with respect to that engine.
> >
> >This gives UMDs options as the later likely consumes more KMD resources
> >so if a different UMD can live with binds being ordered within the VM
> >they can use a mode consuming less resources.
> >
>
> I think we need to be careful here if we are looking for some out of
> (submission) order completion of vm_bind/unbind.
> In-order completion means, in a batch of binds and unbinds to be
> completed in-order, user only needs to specify in-fence for the
> first bind/unbind call and the our-fence for the last bind/unbind
> call. Also, the VA released by an unbind call can be re-used by
> any subsequent bind call in that in-order batch.
>
> These things will break if binding/unbinding were to be allowed to
> go out of order (of submission) and user need to be extra careful
> not to run into pre-mature triggereing of out-fence and bind failing
> as VA is still in use etc.
>
> Also, VM_BIND binds the provided mapping on the specified address space
> (VM). So, the uapi is not engine/context specific.
>
> We can however add a 'queue' to the uapi which can be one from the
> pre-defined queues,
> I915_VM_BIND_QUEUE_0
> I915_VM_BIND_QUEUE_1
> ...
> I915_VM_BIND_QUEUE_(N-1)
>
> KMD will spawn an async work queue for each queue which will only
> bind the mappings on that queue in the order of submission.
> User can assign the queue to per engine or anything like that.
>
> But again here, user need to be careful and not deadlock these
> queues with circular dependency of fences.
>
> I prefer adding this later an as extension based on whether it
> is really helping with the implementation.
>

I can tell you right now that having everything on a single in-order queue
will not get us the perf we want.  What vulkan really wants is one of two
things:

 1. No implicit ordering of VM_BIND ops.  They just happen in whatever
their dependencies are resolved and we ensure ordering ourselves by having
a syncobj in the VkQueue.

 2. The ability to create multiple VM_BIND queues.  We need at least 2 but
I don't see why there needs to be a limit besides the limits the i915 API
already has on the number of engines.  Vulkan could expose multiple sparse
binding queues to the client if it's not arbitrarily limited.

Why?  Because Vulkan has two basic kind of bind operations and we don't
want any dependencies between them:

 1. Immediate.  These happen right after BO creation or maybe as part of
vkBindImageMemory() or VkBindBufferMemory().  These don't happen on a queue
and we don't want them serialized with anything.  To synchronize with
submit, we'll have a syncobj in the VkDevice which is signaled by all
immediate bind operations and make submits wait on it.

 2. Queued (sparse): These happen on a VkQueue which may be the same as a
render/compute queue or may be its own queue.  It's up to us what we want
to advertise.  From the Vulkan API PoV, this is like any other queue.
Operations on it wait on and signal semaphores.  If we have a VM_BIND
engine, we'd provide syncobjs to wait and signal just like we do in
execbuf().

The important thing is that we don't want one type of operation to block on
the other.  If immediate binds are blocking on sparse binds, it's going to
cause over-synchronization issues.

In terms of the internal implementation, I know that there's going to be a
lock on the VM and that we can't actually do these things in parallel.
That's fine.  Once the dma_fences have signaled and we're unblocked to do
the bind operation, I don't care if there's a bit of synchronization due to
locking.  That's expected.  What we can't afford to have is an immediate
bind operation suddenly blocking on a sparse operation which is blocked on
a compute job that's going to run for another 5ms.

For reference, Windows solves this by allowing arbitrarily many paging
queues (what they call a VM_BIND engine/queue).  That design works pretty
well and solves the problems in question.  Again, we could just make
everything out-of-order and require using syncobjs to order things as
userspace wants. That'd be fine too.

One more note while I'm here: danvet said something on IRC about VM_BIND
queues waiting for syncobjs to materialize.  We don't really want/need
this.  We already have all the machinery in userspace to handle
wait-before-signal and waiting for syncobj fences to materialize and that
machinery is on by default.  It would actually take MORE work in Mesa to
turn it off and take advantage of the kernel being able to wait for
syncobjs to materialize.  Also, getting that right is ridiculously hard and
I really don't want to get it wrong in kernel space.  When we do memory
fences, wait-before-signal will be a thing.  We don't need to try and make
it a thing for syncobj.

--Jason


> Daniel, any thoughts?
>
> Niranjana
>
> >Matt
> >
> >>
> >> Sorry I noticed this late.
> >>
> >>
> >> -Lionel
> >>
> >>
>
Niranjana Vishwanathapura June 2, 2022, 8:48 p.m. UTC | #15
On Wed, Jun 01, 2022 at 07:13:16PM -0700, Zeng, Oak wrote:
>
>
>Regards,
>Oak
>
>> -----Original Message-----
>> From: dri-devel <dri-devel-bounces@lists.freedesktop.org> On Behalf Of
>> Niranjana Vishwanathapura
>> Sent: May 17, 2022 2:32 PM
>> To: intel-gfx@lists.freedesktop.org; dri-devel@lists.freedesktop.org; Vetter,
>> Daniel <daniel.vetter@intel.com>
>> Cc: Brost, Matthew <matthew.brost@intel.com>; Hellstrom, Thomas
>> <thomas.hellstrom@intel.com>; jason@jlekstrand.net; Wilson, Chris P
>> <chris.p.wilson@intel.com>; christian.koenig@amd.com
>> Subject: [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
>>
>> VM_BIND design document with description of intended use cases.
>>
>> v2: Add more documentation and format as per review comments
>>     from Daniel.
>>
>> Signed-off-by: Niranjana Vishwanathapura
>> <niranjana.vishwanathapura@intel.com>
>> ---
>>  Documentation/driver-api/dma-buf.rst   |   2 +
>>  Documentation/gpu/rfc/i915_vm_bind.rst | 304
>> +++++++++++++++++++++++++
>>  Documentation/gpu/rfc/index.rst        |   4 +
>>  3 files changed, 310 insertions(+)
>>  create mode 100644 Documentation/gpu/rfc/i915_vm_bind.rst
>>
>> diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-
>> api/dma-buf.rst
>> index 36a76cbe9095..64cb924ec5bb 100644
>> --- a/Documentation/driver-api/dma-buf.rst
>> +++ b/Documentation/driver-api/dma-buf.rst
>> @@ -200,6 +200,8 @@ DMA Fence uABI/Sync File
>>  .. kernel-doc:: include/linux/sync_file.h
>>     :internal:
>>
>> +.. _indefinite_dma_fences:
>> +
>>  Indefinite DMA Fences
>>  ~~~~~~~~~~~~~~~~~~~~~
>>
>> diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst
>> b/Documentation/gpu/rfc/i915_vm_bind.rst
>> new file mode 100644
>> index 000000000000..f1be560d313c
>> --- /dev/null
>> +++ b/Documentation/gpu/rfc/i915_vm_bind.rst
>> @@ -0,0 +1,304 @@
>> +==========================================
>> +I915 VM_BIND feature design and use cases
>> +==========================================
>> +
>> +VM_BIND feature
>> +================
>> +DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM
>> buffer
>> +objects (BOs) or sections of a BOs at specified GPU virtual addresses on a
>> +specified address space (VM). These mappings (also referred to as persistent
>> +mappings) will be persistent across multiple GPU submissions (execbuff calls)
>> +issued by the UMD, without user having to provide a list of all required
>> +mappings during each submission (as required by older execbuff mode).
>> +
>> +VM_BIND/UNBIND ioctls will support 'in' and 'out' fences to allow userpace
>> +to specify how the binding/unbinding should sync with other operations
>> +like the GPU job submission. These fences will be timeline 'drm_syncobj's
>> +for non-Compute contexts (See struct
>> drm_i915_vm_bind_ext_timeline_fences).
>> +For Compute contexts, they will be user/memory fences (See struct
>> +drm_i915_vm_bind_ext_user_fence).
>> +
>> +VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND.
>> +User has to opt-in for VM_BIND mode of binding for an address space (VM)
>> +during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND
>> extension.
>> +
>> +VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in
>> an
>> +async worker. The binding and unbinding will work like a special GPU engine.
>> +The binding and unbinding operations are serialized and will wait on specified
>> +input fences before the operation and will signal the output fences upon the
>> +completion of the operation. Due to serialization, completion of an operation
>> +will also indicate that all previous operations are also complete.
>
>Hi,
>
>Is user required to wait for the out fence be signaled before submit a gpu job using the vm_bind address?
>Or is user required to order the gpu job to make gpu job run after vm_bind out fence signaled?
>

Thanks Oak,
Either should be fine and up to user how to use vm_bind/unbind out-fence.

>I think there could be different behavior on a non-faultable platform and a faultable platform, such as on a non-faultable
>Platform, gpu job is required to be order after vm_bind out fence signaling; and on a faultable platform, there is no such
>Restriction since vm bind can be finished in the fault handler?
>

With GPU page faults handler, out fence won't be needed as residency is
purely managed by page fault handler populating page tables (there is a
mention of it in GPU Page Faults section below).

>Should we document such thing?
>

We don't talk much about GPU page faults case in this document as that may
warrent a separate rfc when we add page faults support. We did mention it
in couple places to ensure our locking design here is extensible to gpu
page faults case.

Niranjana

>Regards,
>Oak
>
>
>> +
>> +VM_BIND features include:
>> +
>> +* Multiple Virtual Address (VA) mappings can map to the same physical pages
>> +  of an object (aliasing).
>> +* VA mapping can map to a partial section of the BO (partial binding).
>> +* Support capture of persistent mappings in the dump upon GPU error.
>> +* TLB is flushed upon unbind completion. Batching of TLB flushes in some
>> +  use cases will be helpful.
>> +* Asynchronous vm_bind and vm_unbind support with 'in' and 'out' fences.
>> +* Support for userptr gem objects (no special uapi is required for this).
>> +
>> +Execbuff ioctl in VM_BIND mode
>> +-------------------------------
>> +The execbuff ioctl handling in VM_BIND mode differs significantly from the
>> +older method. A VM in VM_BIND mode will not support older execbuff mode of
>> +binding. In VM_BIND mode, execbuff ioctl will not accept any execlist. Hence,
>> +no support for implicit sync. It is expected that the below work will be able
>> +to support requirements of object dependency setting in all use cases:
>> +
>> +"dma-buf: Add an API for exporting sync files"
>> +(https://lwn.net/Articles/859290/)
>> +
>> +This also means, we need an execbuff extension to pass in the batch
>> +buffer addresses (See struct
>> drm_i915_gem_execbuffer_ext_batch_addresses).
>> +
>> +If at all execlist support in execbuff ioctl is deemed necessary for
>> +implicit sync in certain use cases, then support can be added later.
>> +
>> +In VM_BIND mode, VA allocation is completely managed by the user instead of
>> +the i915 driver. Hence all VA assignment, eviction are not applicable in
>> +VM_BIND mode. Also, for determining object activeness, VM_BIND mode will
>> not
>> +be using the i915_vma active reference tracking. It will instead use dma-resv
>> +object for that (See `VM_BIND dma_resv usage`_).
>> +
>> +So, a lot of existing code in the execbuff path like relocations, VA evictions,
>> +vma lookup table, implicit sync, vma active reference tracking etc., are not
>> +applicable in VM_BIND mode. Hence, the execbuff path needs to be cleaned up
>> +by clearly separating out the functionalities where the VM_BIND mode differs
>> +from older method and they should be moved to separate files.
>> +
>> +VM_PRIVATE objects
>> +-------------------
>> +By default, BOs can be mapped on multiple VMs and can also be dma-buf
>> +exported. Hence these BOs are referred to as Shared BOs.
>> +During each execbuff submission, the request fence must be added to the
>> +dma-resv fence list of all shared BOs mapped on the VM.
>> +
>> +VM_BIND feature introduces an optimization where user can create BO which
>> +is private to a specified VM via I915_GEM_CREATE_EXT_VM_PRIVATE flag
>> during
>> +BO creation. Unlike Shared BOs, these VM private BOs can only be mapped on
>> +the VM they are private to and can't be dma-buf exported.
>> +All private BOs of a VM share the dma-resv object. Hence during each execbuff
>> +submission, they need only one dma-resv fence list updated. Thus, the fast
>> +path (where required mappings are already bound) submission latency is O(1)
>> +w.r.t the number of VM private BOs.
>> +
>> +VM_BIND locking hirarchy
>> +-------------------------
>> +The locking design here supports the older (execlist based) execbuff mode, the
>> +newer VM_BIND mode, the VM_BIND mode with GPU page faults and possible
>> future
>> +system allocator support (See `Shared Virtual Memory (SVM) support`_).
>> +The older execbuff mode and the newer VM_BIND mode without page faults
>> manages
>> +residency of backing storage using dma_fence. The VM_BIND mode with page
>> faults
>> +and the system allocator support do not use any dma_fence at all.
>> +
>> +VM_BIND locking order is as below.
>> +
>> +1) Lock-A: A vm_bind mutex will protect vm_bind lists. This lock is taken in
>> +   vm_bind/vm_unbind ioctl calls, in the execbuff path and while releasing the
>> +   mapping.
>> +
>> +   In future, when GPU page faults are supported, we can potentially use a
>> +   rwsem instead, so that multiple page fault handlers can take the read side
>> +   lock to lookup the mapping and hence can run in parallel.
>> +   The older execbuff mode of binding do not need this lock.
>> +
>> +2) Lock-B: The object's dma-resv lock will protect i915_vma state and needs to
>> +   be held while binding/unbinding a vma in the async worker and while updating
>> +   dma-resv fence list of an object. Note that private BOs of a VM will all
>> +   share a dma-resv object.
>> +
>> +   The future system allocator support will use the HMM prescribed locking
>> +   instead.
>> +
>> +3) Lock-C: Spinlock/s to protect some of the VM's lists like the list of
>> +   invalidated vmas (due to eviction and userptr invalidation) etc.
>> +
>> +When GPU page faults are supported, the execbuff path do not take any of
>> these
>> +locks. There we will simply smash the new batch buffer address into the ring
>> and
>> +then tell the scheduler run that. The lock taking only happens from the page
>> +fault handler, where we take lock-A in read mode, whichever lock-B we need to
>> +find the backing storage (dma_resv lock for gem objects, and hmm/core mm for
>> +system allocator) and some additional locks (lock-D) for taking care of page
>> +table races. Page fault mode should not need to ever manipulate the vm lists,
>> +so won't ever need lock-C.
>> +
>> +VM_BIND LRU handling
>> +---------------------
>> +We need to ensure VM_BIND mapped objects are properly LRU tagged to avoid
>> +performance degradation. We will also need support for bulk LRU movement of
>> +VM_BIND objects to avoid additional latencies in execbuff path.
>> +
>> +The page table pages are similar to VM_BIND mapped objects (See
>> +`Evictable page table allocations`_) and are maintained per VM and needs to
>> +be pinned in memory when VM is made active (ie., upon an execbuff call with
>> +that VM). So, bulk LRU movement of page table pages is also needed.
>> +
>> +The i915 shrinker LRU has stopped being an LRU. So, it should also be moved
>> +over to the ttm LRU in some fashion to make sure we once again have a
>> reasonable
>> +and consistent memory aging and reclaim architecture.
>> +
>> +VM_BIND dma_resv usage
>> +-----------------------
>> +Fences needs to be added to all VM_BIND mapped objects. During each
>> execbuff
>> +submission, they are added with DMA_RESV_USAGE_BOOKKEEP usage to
>> prevent
>> +over sync (See enum dma_resv_usage). One can override it with either
>> +DMA_RESV_USAGE_READ or DMA_RESV_USAGE_WRITE usage during object
>> dependency
>> +setting (either through explicit or implicit mechanism).
>> +
>> +When vm_bind is called for a non-private object while the VM is already
>> +active, the fences need to be copied from VM's shared dma-resv object
>> +(common to all private objects of the VM) to this non-private object.
>> +If this results in performance degradation, then some optimization will
>> +be needed here. This is not a problem for VM's private objects as they use
>> +shared dma-resv object which is always updated on each execbuff submission.
>> +
>> +Also, in VM_BIND mode, use dma-resv apis for determining object activeness
>> +(See dma_resv_test_signaled() and dma_resv_wait_timeout()) and do not use
>> the
>> +older i915_vma active reference tracking which is deprecated. This should be
>> +easier to get it working with the current TTM backend. We can remove the
>> +i915_vma active reference tracking fully while supporting TTM backend for igfx.
>> +
>> +Evictable page table allocations
>> +---------------------------------
>> +Make pagetable allocations evictable and manage them similar to VM_BIND
>> +mapped objects. Page table pages are similar to persistent mappings of a
>> +VM (difference here are that the page table pages will not have an i915_vma
>> +structure and after swapping pages back in, parent page link needs to be
>> +updated).
>> +
>> +Mesa use case
>> +--------------
>> +VM_BIND can potentially reduce the CPU overhead in Mesa (both Vulkan and
>> Iris),
>> +hence improving performance of CPU-bound applications. It also allows us to
>> +implement Vulkan's Sparse Resources. With increasing GPU hardware
>> performance,
>> +reducing CPU overhead becomes more impactful.
>> +
>> +
>> +VM_BIND Compute support
>> +========================
>> +
>> +User/Memory Fence
>> +------------------
>> +The idea is to take a user specified virtual address and install an interrupt
>> +handler to wake up the current task when the memory location passes the user
>> +supplied filter. User/Memory fence is a <address, value> pair. To signal the
>> +user fence, specified value will be written at the specified virtual address
>> +and wakeup the waiting process. User can wait on a user fence with the
>> +gem_wait_user_fence ioctl.
>> +
>> +It also allows the user to emit their own MI_FLUSH/PIPE_CONTROL notify
>> +interrupt within their batches after updating the value to have sub-batch
>> +precision on the wakeup. Each batch can signal a user fence to indicate
>> +the completion of next level batch. The completion of very first level batch
>> +needs to be signaled by the command streamer. The user must provide the
>> +user/memory fence for this via the
>> DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE
>> +extension of execbuff ioctl, so that KMD can setup the command streamer to
>> +signal it.
>> +
>> +User/Memory fence can also be supplied to the kernel driver to signal/wake up
>> +the user process after completion of an asynchronous operation.
>> +
>> +When VM_BIND ioctl was provided with a user/memory fence via the
>> +I915_VM_BIND_EXT_USER_FENCE extension, it will be signaled upon the
>> completion
>> +of binding of that mapping. All async binds/unbinds are serialized, hence
>> +signaling of user/memory fence also indicate the completion of all previous
>> +binds/unbinds.
>> +
>> +This feature will be derived from the below original work:
>> +https://patchwork.freedesktop.org/patch/349417/
>> +
>> +Long running Compute contexts
>> +------------------------------
>> +Usage of dma-fence expects that they complete in reasonable amount of time.
>> +Compute on the other hand can be long running. Hence it is appropriate for
>> +compute to use user/memory fence and dma-fence usage will be limited to
>> +in-kernel consumption only. This requires an execbuff uapi extension to pass
>> +in user fence (See struct drm_i915_vm_bind_ext_user_fence). Compute must
>> opt-in
>> +for this mechanism with I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING flag
>> during
>> +context creation. The dma-fence based user interfaces like gem_wait ioctl and
>> +execbuff out fence are not allowed on long running contexts. Implicit sync is
>> +not valid as well and is anyway not supported in VM_BIND mode.
>> +
>> +Where GPU page faults are not available, kernel driver upon buffer invalidation
>> +will initiate a suspend (preemption) of long running context with a dma-fence
>> +attached to it. And upon completion of that suspend fence, finish the
>> +invalidation, revalidate the BO and then resume the compute context. This is
>> +done by having a per-context preempt fence (also called suspend fence)
>> proxying
>> +as i915_request fence. This suspend fence is enabled when someone tries to
>> wait
>> +on it, which then triggers the context preemption.
>> +
>> +As this support for context suspension using a preempt fence and the resume
>> work
>> +for the compute mode contexts can get tricky to get it right, it is better to
>> +add this support in drm scheduler so that multiple drivers can make use of it.
>> +That means, it will have a dependency on i915 drm scheduler conversion with
>> GuC
>> +scheduler backend. This should be fine, as the plan is to support compute mode
>> +contexts only with GuC scheduler backend (at least initially). This is much
>> +easier to support with VM_BIND mode compared to the current heavier
>> execbuff
>> +path resource attachment.
>> +
>> +Low Latency Submission
>> +-----------------------
>> +Allows compute UMD to directly submit GPU jobs instead of through execbuff
>> +ioctl. This is made possible by VM_BIND is not being synchronized against
>> +execbuff. VM_BIND allows bind/unbind of mappings required for the directly
>> +submitted jobs.
>> +
>> +Other VM_BIND use cases
>> +========================
>> +
>> +Debugger
>> +---------
>> +With debug event interface user space process (debugger) is able to keep track
>> +of and act upon resources created by another process (debugged) and attached
>> +to GPU via vm_bind interface.
>> +
>> +GPU page faults
>> +----------------
>> +GPU page faults when supported (in future), will only be supported in the
>> +VM_BIND mode. While both the older execbuff mode and the newer VM_BIND
>> mode of
>> +binding will require using dma-fence to ensure residency, the GPU page faults
>> +mode when supported, will not use any dma-fence as residency is purely
>> managed
>> +by installing and removing/invalidating page table entries.
>> +
>> +Page level hints settings
>> +--------------------------
>> +VM_BIND allows any hints setting per mapping instead of per BO.
>> +Possible hints include read-only mapping, placement and atomicity.
>> +Sub-BO level placement hint will be even more relevant with
>> +upcoming GPU on-demand page fault support.
>> +
>> +Page level Cache/CLOS settings
>> +-------------------------------
>> +VM_BIND allows cache/CLOS settings per mapping instead of per BO.
>> +
>> +Shared Virtual Memory (SVM) support
>> +------------------------------------
>> +VM_BIND interface can be used to map system memory directly (without gem
>> BO
>> +abstraction) using the HMM interface. SVM is only supported with GPU page
>> +faults enabled.
>> +
>> +
>> +Broder i915 cleanups
>> +=====================
>> +Supporting this whole new vm_bind mode of binding which comes with its own
>> +use cases to support and the locking requirements requires proper integration
>> +with the existing i915 driver. This calls for some broader i915 driver
>> +cleanups/simplifications for maintainability of the driver going forward.
>> +Here are few things identified and are being looked into.
>> +
>> +- Remove vma lookup cache (eb->gem_context->handles_vma). VM_BIND
>> feature
>> +  do not use it and complexity it brings in is probably more than the
>> +  performance advantage we get in legacy execbuff case.
>> +- Remove vma->open_count counting
>> +- Remove i915_vma active reference tracking. VM_BIND feature will not be
>> using
>> +  it. Instead use underlying BO's dma-resv fence list to determine if a i915_vma
>> +  is active or not.
>> +
>> +
>> +VM_BIND UAPI
>> +=============
>> +
>> +.. kernel-doc:: Documentation/gpu/rfc/i915_vm_bind.h
>> diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst
>> index 91e93a705230..7d10c36b268d 100644
>> --- a/Documentation/gpu/rfc/index.rst
>> +++ b/Documentation/gpu/rfc/index.rst
>> @@ -23,3 +23,7 @@ host such documentation:
>>  .. toctree::
>>
>>      i915_scheduler.rst
>> +
>> +.. toctree::
>> +
>> +    i915_vm_bind.rst
>> --
>> 2.21.0.rc0.32.g243a4c7e27
>
Lionel Landwerlin June 3, 2022, 7:20 a.m. UTC | #16
On 02/06/2022 23:35, Jason Ekstrand wrote:
> On Thu, Jun 2, 2022 at 3:11 PM Niranjana Vishwanathapura 
> <niranjana.vishwanathapura@intel.com> wrote:
>
>     On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew Brost wrote:
>     >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin wrote:
>     >> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
>     >> > +VM_BIND/UNBIND ioctl will immediately start
>     binding/unbinding the mapping in an
>     >> > +async worker. The binding and unbinding will work like a
>     special GPU engine.
>     >> > +The binding and unbinding operations are serialized and will
>     wait on specified
>     >> > +input fences before the operation and will signal the output
>     fences upon the
>     >> > +completion of the operation. Due to serialization,
>     completion of an operation
>     >> > +will also indicate that all previous operations are also
>     complete.
>     >>
>     >> I guess we should avoid saying "will immediately start
>     binding/unbinding" if
>     >> there are fences involved.
>     >>
>     >> And the fact that it's happening in an async worker seem to
>     imply it's not
>     >> immediate.
>     >>
>
>     Ok, will fix.
>     This was added because in earlier design binding was deferred
>     until next execbuff.
>     But now it is non-deferred (immediate in that sense). But yah,
>     this is confusing
>     and will fix it.
>
>     >>
>     >> I have a question on the behavior of the bind operation when no
>     input fence
>     >> is provided. Let say I do :
>     >>
>     >> VM_BIND (out_fence=fence1)
>     >>
>     >> VM_BIND (out_fence=fence2)
>     >>
>     >> VM_BIND (out_fence=fence3)
>     >>
>     >>
>     >> In what order are the fences going to be signaled?
>     >>
>     >> In the order of VM_BIND ioctls? Or out of order?
>     >>
>     >> Because you wrote "serialized I assume it's : in order
>     >>
>
>     Yes, in the order of VM_BIND/UNBIND ioctls. Note that bind and
>     unbind will use
>     the same queue and hence are ordered.
>
>     >>
>     >> One thing I didn't realize is that because we only get one
>     "VM_BIND" engine,
>     >> there is a disconnect from the Vulkan specification.
>     >>
>     >> In Vulkan VM_BIND operations are serialized but per engine.
>     >>
>     >> So you could have something like this :
>     >>
>     >> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
>     >>
>     >> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
>     >>
>     >>
>     >> fence1 is not signaled
>     >>
>     >> fence3 is signaled
>     >>
>     >> So the second VM_BIND will proceed before the first VM_BIND.
>     >>
>     >>
>     >> I guess we can deal with that scenario in userspace by doing
>     the wait
>     >> ourselves in one thread per engines.
>     >>
>     >> But then it makes the VM_BIND input fences useless.
>     >>
>     >>
>     >> Daniel : what do you think? Should be rework this or just deal
>     with wait
>     >> fences in userspace?
>     >>
>     >
>     >My opinion is rework this but make the ordering via an engine
>     param optional.
>     >
>     >e.g. A VM can be configured so all binds are ordered within the VM
>     >
>     >e.g. A VM can be configured so all binds accept an engine
>     argument (in
>     >the case of the i915 likely this is a gem context handle) and binds
>     >ordered with respect to that engine.
>     >
>     >This gives UMDs options as the later likely consumes more KMD
>     resources
>     >so if a different UMD can live with binds being ordered within the VM
>     >they can use a mode consuming less resources.
>     >
>
>     I think we need to be careful here if we are looking for some out of
>     (submission) order completion of vm_bind/unbind.
>     In-order completion means, in a batch of binds and unbinds to be
>     completed in-order, user only needs to specify in-fence for the
>     first bind/unbind call and the our-fence for the last bind/unbind
>     call. Also, the VA released by an unbind call can be re-used by
>     any subsequent bind call in that in-order batch.
>
>     These things will break if binding/unbinding were to be allowed to
>     go out of order (of submission) and user need to be extra careful
>     not to run into pre-mature triggereing of out-fence and bind failing
>     as VA is still in use etc.
>
>     Also, VM_BIND binds the provided mapping on the specified address
>     space
>     (VM). So, the uapi is not engine/context specific.
>
>     We can however add a 'queue' to the uapi which can be one from the
>     pre-defined queues,
>     I915_VM_BIND_QUEUE_0
>     I915_VM_BIND_QUEUE_1
>     ...
>     I915_VM_BIND_QUEUE_(N-1)
>
>     KMD will spawn an async work queue for each queue which will only
>     bind the mappings on that queue in the order of submission.
>     User can assign the queue to per engine or anything like that.
>
>     But again here, user need to be careful and not deadlock these
>     queues with circular dependency of fences.
>
>     I prefer adding this later an as extension based on whether it
>     is really helping with the implementation.
>
>
> I can tell you right now that having everything on a single in-order 
> queue will not get us the perf we want. What vulkan really wants is 
> one of two things:
>
>  1. No implicit ordering of VM_BIND ops.  They just happen in whatever 
> their dependencies are resolved and we ensure ordering ourselves by 
> having a syncobj in the VkQueue.
>
>  2. The ability to create multiple VM_BIND queues.  We need at least 2 
> but I don't see why there needs to be a limit besides the limits the 
> i915 API already has on the number of engines.  Vulkan could expose 
> multiple sparse binding queues to the client if it's not arbitrarily 
> limited.
>
> Why?  Because Vulkan has two basic kind of bind operations and we 
> don't want any dependencies between them:
>
>  1. Immediate.  These happen right after BO creation or maybe as part 
> of vkBindImageMemory() or VkBindBufferMemory().  These don't happen on 
> a queue and we don't want them serialized with anything.  To 
> synchronize with submit, we'll have a syncobj in the VkDevice which is 
> signaled by all immediate bind operations and make submits wait on it.
>
>  2. Queued (sparse): These happen on a VkQueue which may be the same 
> as a render/compute queue or may be its own queue.  It's up to us what 
> we want to advertise.  From the Vulkan API PoV, this is like any other 
> queue.  Operations on it wait on and signal semaphores.  If we have a 
> VM_BIND engine, we'd provide syncobjs to wait and signal just like we 
> do in execbuf().
>
> The important thing is that we don't want one type of operation to 
> block on the other.  If immediate binds are blocking on sparse binds, 
> it's going to cause over-synchronization issues.
>
> In terms of the internal implementation, I know that there's going to 
> be a lock on the VM and that we can't actually do these things in 
> parallel.  That's fine.  Once the dma_fences have signaled and we're 
> unblocked to do the bind operation, I don't care if there's a bit of 
> synchronization due to locking.  That's expected.  What we can't 
> afford to have is an immediate bind operation suddenly blocking on a 
> sparse operation which is blocked on a compute job that's going to run 
> for another 5ms.
>
> For reference, Windows solves this by allowing arbitrarily many paging 
> queues (what they call a VM_BIND engine/queue).  That design works 
> pretty well and solves the problems in question.  Again, we could just 
> make everything out-of-order and require using syncobjs to order 
> things as userspace wants. That'd be fine too.
>
> One more note while I'm here: danvet said something on IRC about 
> VM_BIND queues waiting for syncobjs to materialize.  We don't really 
> want/need this.  We already have all the machinery in userspace to 
> handle wait-before-signal and waiting for syncobj fences to 
> materialize and that machinery is on by default.  It would actually 
> take MORE work in Mesa to turn it off and take advantage of the kernel 
> being able to wait for syncobjs to materialize.  Also, getting that 
> right is ridiculously hard and I really don't want to get it wrong in 
> kernel space. When we do memory fences, wait-before-signal will be a 
> thing.  We don't need to try and make it a thing for syncobj.
>
> --Jason


Thanks Jason,


I missed the bit in the Vulkan spec that we're allowed to have a sparse 
queue that does not implement either graphics or compute operations :

    "While some implementations may include VK_QUEUE_SPARSE_BINDING_BIT
    support in queue families that also include

      graphics and compute support, other implementations may only
    expose a VK_QUEUE_SPARSE_BINDING_BIT-only queue

      family."


So it can all be all a vm_bind engine that just does bind/unbind 
operations.


But yes we need another engine for the immediate/non-sparse operations.


-Lionel


>     Daniel, any thoughts?
>
>     Niranjana
>
>     >Matt
>     >
>     >>
>     >> Sorry I noticed this late.
>     >>
>     >>
>     >> -Lionel
>     >>
>     >>
>
Niranjana Vishwanathapura June 3, 2022, 11:51 p.m. UTC | #17
On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel Landwerlin wrote:
>   On 02/06/2022 23:35, Jason Ekstrand wrote:
>
>     On Thu, Jun 2, 2022 at 3:11 PM Niranjana Vishwanathapura
>     <niranjana.vishwanathapura@intel.com> wrote:
>
>       On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew Brost wrote:
>       >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin wrote:
>       >> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
>       >> > +VM_BIND/UNBIND ioctl will immediately start binding/unbinding
>       the mapping in an
>       >> > +async worker. The binding and unbinding will work like a special
>       GPU engine.
>       >> > +The binding and unbinding operations are serialized and will
>       wait on specified
>       >> > +input fences before the operation and will signal the output
>       fences upon the
>       >> > +completion of the operation. Due to serialization, completion of
>       an operation
>       >> > +will also indicate that all previous operations are also
>       complete.
>       >>
>       >> I guess we should avoid saying "will immediately start
>       binding/unbinding" if
>       >> there are fences involved.
>       >>
>       >> And the fact that it's happening in an async worker seem to imply
>       it's not
>       >> immediate.
>       >>
>
>       Ok, will fix.
>       This was added because in earlier design binding was deferred until
>       next execbuff.
>       But now it is non-deferred (immediate in that sense). But yah, this is
>       confusing
>       and will fix it.
>
>       >>
>       >> I have a question on the behavior of the bind operation when no
>       input fence
>       >> is provided. Let say I do :
>       >>
>       >> VM_BIND (out_fence=fence1)
>       >>
>       >> VM_BIND (out_fence=fence2)
>       >>
>       >> VM_BIND (out_fence=fence3)
>       >>
>       >>
>       >> In what order are the fences going to be signaled?
>       >>
>       >> In the order of VM_BIND ioctls? Or out of order?
>       >>
>       >> Because you wrote "serialized I assume it's : in order
>       >>
>
>       Yes, in the order of VM_BIND/UNBIND ioctls. Note that bind and unbind
>       will use
>       the same queue and hence are ordered.
>
>       >>
>       >> One thing I didn't realize is that because we only get one
>       "VM_BIND" engine,
>       >> there is a disconnect from the Vulkan specification.
>       >>
>       >> In Vulkan VM_BIND operations are serialized but per engine.
>       >>
>       >> So you could have something like this :
>       >>
>       >> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
>       >>
>       >> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
>       >>
>       >>
>       >> fence1 is not signaled
>       >>
>       >> fence3 is signaled
>       >>
>       >> So the second VM_BIND will proceed before the first VM_BIND.
>       >>
>       >>
>       >> I guess we can deal with that scenario in userspace by doing the
>       wait
>       >> ourselves in one thread per engines.
>       >>
>       >> But then it makes the VM_BIND input fences useless.
>       >>
>       >>
>       >> Daniel : what do you think? Should be rework this or just deal with
>       wait
>       >> fences in userspace?
>       >>
>       >
>       >My opinion is rework this but make the ordering via an engine param
>       optional.
>       >
>       >e.g. A VM can be configured so all binds are ordered within the VM
>       >
>       >e.g. A VM can be configured so all binds accept an engine argument
>       (in
>       >the case of the i915 likely this is a gem context handle) and binds
>       >ordered with respect to that engine.
>       >
>       >This gives UMDs options as the later likely consumes more KMD
>       resources
>       >so if a different UMD can live with binds being ordered within the VM
>       >they can use a mode consuming less resources.
>       >
>
>       I think we need to be careful here if we are looking for some out of
>       (submission) order completion of vm_bind/unbind.
>       In-order completion means, in a batch of binds and unbinds to be
>       completed in-order, user only needs to specify in-fence for the
>       first bind/unbind call and the our-fence for the last bind/unbind
>       call. Also, the VA released by an unbind call can be re-used by
>       any subsequent bind call in that in-order batch.
>
>       These things will break if binding/unbinding were to be allowed to
>       go out of order (of submission) and user need to be extra careful
>       not to run into pre-mature triggereing of out-fence and bind failing
>       as VA is still in use etc.
>
>       Also, VM_BIND binds the provided mapping on the specified address
>       space
>       (VM). So, the uapi is not engine/context specific.
>
>       We can however add a 'queue' to the uapi which can be one from the
>       pre-defined queues,
>       I915_VM_BIND_QUEUE_0
>       I915_VM_BIND_QUEUE_1
>       ...
>       I915_VM_BIND_QUEUE_(N-1)
>
>       KMD will spawn an async work queue for each queue which will only
>       bind the mappings on that queue in the order of submission.
>       User can assign the queue to per engine or anything like that.
>
>       But again here, user need to be careful and not deadlock these
>       queues with circular dependency of fences.
>
>       I prefer adding this later an as extension based on whether it
>       is really helping with the implementation.
>
>     I can tell you right now that having everything on a single in-order
>     queue will not get us the perf we want.  What vulkan really wants is one
>     of two things:
>      1. No implicit ordering of VM_BIND ops.  They just happen in whatever
>     their dependencies are resolved and we ensure ordering ourselves by
>     having a syncobj in the VkQueue.
>      2. The ability to create multiple VM_BIND queues.  We need at least 2
>     but I don't see why there needs to be a limit besides the limits the
>     i915 API already has on the number of engines.  Vulkan could expose
>     multiple sparse binding queues to the client if it's not arbitrarily
>     limited.

Thanks Jason, Lionel.

Jason, what are you referring to when you say "limits the i915 API already
has on the number of engines"? I am not sure if there is such an uapi today.

I am trying to see how many queues we need and don't want it to be arbitrarily
large and unduely blow up memory usage and complexity in i915 driver.

>     Why?  Because Vulkan has two basic kind of bind operations and we don't
>     want any dependencies between them:
>      1. Immediate.  These happen right after BO creation or maybe as part of
>     vkBindImageMemory() or VkBindBufferMemory().  These don't happen on a
>     queue and we don't want them serialized with anything.  To synchronize
>     with submit, we'll have a syncobj in the VkDevice which is signaled by
>     all immediate bind operations and make submits wait on it.
>      2. Queued (sparse): These happen on a VkQueue which may be the same as
>     a render/compute queue or may be its own queue.  It's up to us what we
>     want to advertise.  From the Vulkan API PoV, this is like any other
>     queue.  Operations on it wait on and signal semaphores.  If we have a
>     VM_BIND engine, we'd provide syncobjs to wait and signal just like we do
>     in execbuf().
>     The important thing is that we don't want one type of operation to block
>     on the other.  If immediate binds are blocking on sparse binds, it's
>     going to cause over-synchronization issues.
>     In terms of the internal implementation, I know that there's going to be
>     a lock on the VM and that we can't actually do these things in
>     parallel.  That's fine.  Once the dma_fences have signaled and we're

Thats correct. It is like a single VM_BIND engine with multiple queues
feeding to it.

>     unblocked to do the bind operation, I don't care if there's a bit of
>     synchronization due to locking.  That's expected.  What we can't afford
>     to have is an immediate bind operation suddenly blocking on a sparse
>     operation which is blocked on a compute job that's going to run for
>     another 5ms.

As the VM_BIND queue is per VM, VM_BIND on one VM doesn't block the VM_BIND
on other VMs. I am not sure about usecases here, but just wanted to clarify.

Niranjana

>     For reference, Windows solves this by allowing arbitrarily many paging
>     queues (what they call a VM_BIND engine/queue).  That design works
>     pretty well and solves the problems in question.  Again, we could just
>     make everything out-of-order and require using syncobjs to order things
>     as userspace wants. That'd be fine too.
>     One more note while I'm here: danvet said something on IRC about VM_BIND
>     queues waiting for syncobjs to materialize.  We don't really want/need
>     this.  We already have all the machinery in userspace to handle
>     wait-before-signal and waiting for syncobj fences to materialize and
>     that machinery is on by default.  It would actually take MORE work in
>     Mesa to turn it off and take advantage of the kernel being able to wait
>     for syncobjs to materialize.  Also, getting that right is ridiculously
>     hard and I really don't want to get it wrong in kernel space.  When we
>     do memory fences, wait-before-signal will be a thing.  We don't need to
>     try and make it a thing for syncobj.
>     --Jason
>
>   Thanks Jason,
>
>   I missed the bit in the Vulkan spec that we're allowed to have a sparse
>   queue that does not implement either graphics or compute operations :
>
>     "While some implementations may include VK_QUEUE_SPARSE_BINDING_BIT
>     support in queue families that also include
>
>      graphics and compute support, other implementations may only expose a
>     VK_QUEUE_SPARSE_BINDING_BIT-only queue
>
>      family."
>
>   So it can all be all a vm_bind engine that just does bind/unbind
>   operations.
>
>   But yes we need another engine for the immediate/non-sparse operations.
>
>   -Lionel
>
>      
>
>       Daniel, any thoughts?
>
>       Niranjana
>
>       >Matt
>       >
>       >>
>       >> Sorry I noticed this late.
>       >>
>       >>
>       >> -Lionel
>       >>
>       >>
Zeng, Oak June 6, 2022, 8:45 p.m. UTC | #18
Regards,
Oak

> -----Original Message-----
> From: Vishwanathapura, Niranjana <niranjana.vishwanathapura@intel.com>
> Sent: June 2, 2022 4:49 PM
> To: Zeng, Oak <oak.zeng@intel.com>
> Cc: intel-gfx@lists.freedesktop.org; dri-devel@lists.freedesktop.org; Vetter,
> Daniel <daniel.vetter@intel.com>; Brost, Matthew <matthew.brost@intel.com>;
> Hellstrom, Thomas <thomas.hellstrom@intel.com>; jason@jlekstrand.net;
> Wilson, Chris P <chris.p.wilson@intel.com>; christian.koenig@amd.com
> Subject: Re: [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
> 
> On Wed, Jun 01, 2022 at 07:13:16PM -0700, Zeng, Oak wrote:
> >
> >
> >Regards,
> >Oak
> >
> >> -----Original Message-----
> >> From: dri-devel <dri-devel-bounces@lists.freedesktop.org> On Behalf Of
> >> Niranjana Vishwanathapura
> >> Sent: May 17, 2022 2:32 PM
> >> To: intel-gfx@lists.freedesktop.org; dri-devel@lists.freedesktop.org; Vetter,
> >> Daniel <daniel.vetter@intel.com>
> >> Cc: Brost, Matthew <matthew.brost@intel.com>; Hellstrom, Thomas
> >> <thomas.hellstrom@intel.com>; jason@jlekstrand.net; Wilson, Chris P
> >> <chris.p.wilson@intel.com>; christian.koenig@amd.com
> >> Subject: [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
> >>
> >> VM_BIND design document with description of intended use cases.
> >>
> >> v2: Add more documentation and format as per review comments
> >>     from Daniel.
> >>
> >> Signed-off-by: Niranjana Vishwanathapura
> >> <niranjana.vishwanathapura@intel.com>
> >> ---
> >>  Documentation/driver-api/dma-buf.rst   |   2 +
> >>  Documentation/gpu/rfc/i915_vm_bind.rst | 304
> >> +++++++++++++++++++++++++
> >>  Documentation/gpu/rfc/index.rst        |   4 +
> >>  3 files changed, 310 insertions(+)
> >>  create mode 100644 Documentation/gpu/rfc/i915_vm_bind.rst
> >>
> >> diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-
> >> api/dma-buf.rst
> >> index 36a76cbe9095..64cb924ec5bb 100644
> >> --- a/Documentation/driver-api/dma-buf.rst
> >> +++ b/Documentation/driver-api/dma-buf.rst
> >> @@ -200,6 +200,8 @@ DMA Fence uABI/Sync File
> >>  .. kernel-doc:: include/linux/sync_file.h
> >>     :internal:
> >>
> >> +.. _indefinite_dma_fences:
> >> +
> >>  Indefinite DMA Fences
> >>  ~~~~~~~~~~~~~~~~~~~~~
> >>
> >> diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst
> >> b/Documentation/gpu/rfc/i915_vm_bind.rst
> >> new file mode 100644
> >> index 000000000000..f1be560d313c
> >> --- /dev/null
> >> +++ b/Documentation/gpu/rfc/i915_vm_bind.rst
> >> @@ -0,0 +1,304 @@
> >> +==========================================
> >> +I915 VM_BIND feature design and use cases
> >> +==========================================
> >> +
> >> +VM_BIND feature
> >> +================
> >> +DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM
> >> buffer
> >> +objects (BOs) or sections of a BOs at specified GPU virtual addresses on a
> >> +specified address space (VM). These mappings (also referred to as persistent
> >> +mappings) will be persistent across multiple GPU submissions (execbuff calls)
> >> +issued by the UMD, without user having to provide a list of all required
> >> +mappings during each submission (as required by older execbuff mode).
> >> +
> >> +VM_BIND/UNBIND ioctls will support 'in' and 'out' fences to allow userpace
> >> +to specify how the binding/unbinding should sync with other operations
> >> +like the GPU job submission. These fences will be timeline 'drm_syncobj's
> >> +for non-Compute contexts (See struct
> >> drm_i915_vm_bind_ext_timeline_fences).
> >> +For Compute contexts, they will be user/memory fences (See struct
> >> +drm_i915_vm_bind_ext_user_fence).
> >> +
> >> +VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND.
> >> +User has to opt-in for VM_BIND mode of binding for an address space (VM)
> >> +during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND
> >> extension.
> >> +
> >> +VM_BIND/UNBIND ioctl will immediately start binding/unbinding the
> mapping in
> >> an
> >> +async worker. The binding and unbinding will work like a special GPU engine.
> >> +The binding and unbinding operations are serialized and will wait on specified
> >> +input fences before the operation and will signal the output fences upon the
> >> +completion of the operation. Due to serialization, completion of an operation
> >> +will also indicate that all previous operations are also complete.
> >
> >Hi,
> >
> >Is user required to wait for the out fence be signaled before submit a gpu job
> using the vm_bind address?
> >Or is user required to order the gpu job to make gpu job run after vm_bind out
> fence signaled?
> >
> 
> Thanks Oak,
> Either should be fine and up to user how to use vm_bind/unbind out-fence.
> 
> >I think there could be different behavior on a non-faultable platform and a
> faultable platform, such as on a non-faultable
> >Platform, gpu job is required to be order after vm_bind out fence signaling; and
> on a faultable platform, there is no such
> >Restriction since vm bind can be finished in the fault handler?
> >
> 
> With GPU page faults handler, out fence won't be needed as residency is
> purely managed by page fault handler populating page tables (there is a
> mention of it in GPU Page Faults section below).
> 
> >Should we document such thing?
> >
> 
> We don't talk much about GPU page faults case in this document as that may
> warrent a separate rfc when we add page faults support. We did mention it
> in couple places to ensure our locking design here is extensible to gpu
> page faults case.

Ok, that makes sense to me. Thanks for explaining.

Regards,
Oak

> 
> Niranjana
> 
> >Regards,
> >Oak
> >
> >
> >> +
> >> +VM_BIND features include:
> >> +
> >> +* Multiple Virtual Address (VA) mappings can map to the same physical
> pages
> >> +  of an object (aliasing).
> >> +* VA mapping can map to a partial section of the BO (partial binding).
> >> +* Support capture of persistent mappings in the dump upon GPU error.
> >> +* TLB is flushed upon unbind completion. Batching of TLB flushes in some
> >> +  use cases will be helpful.
> >> +* Asynchronous vm_bind and vm_unbind support with 'in' and 'out' fences.
> >> +* Support for userptr gem objects (no special uapi is required for this).
> >> +
> >> +Execbuff ioctl in VM_BIND mode
> >> +-------------------------------
> >> +The execbuff ioctl handling in VM_BIND mode differs significantly from the
> >> +older method. A VM in VM_BIND mode will not support older execbuff
> mode of
> >> +binding. In VM_BIND mode, execbuff ioctl will not accept any execlist.
> Hence,
> >> +no support for implicit sync. It is expected that the below work will be able
> >> +to support requirements of object dependency setting in all use cases:
> >> +
> >> +"dma-buf: Add an API for exporting sync files"
> >> +(https://lwn.net/Articles/859290/)
> >> +
> >> +This also means, we need an execbuff extension to pass in the batch
> >> +buffer addresses (See struct
> >> drm_i915_gem_execbuffer_ext_batch_addresses).
> >> +
> >> +If at all execlist support in execbuff ioctl is deemed necessary for
> >> +implicit sync in certain use cases, then support can be added later.
> >> +
> >> +In VM_BIND mode, VA allocation is completely managed by the user instead
> of
> >> +the i915 driver. Hence all VA assignment, eviction are not applicable in
> >> +VM_BIND mode. Also, for determining object activeness, VM_BIND mode
> will
> >> not
> >> +be using the i915_vma active reference tracking. It will instead use dma-resv
> >> +object for that (See `VM_BIND dma_resv usage`_).
> >> +
> >> +So, a lot of existing code in the execbuff path like relocations, VA evictions,
> >> +vma lookup table, implicit sync, vma active reference tracking etc., are not
> >> +applicable in VM_BIND mode. Hence, the execbuff path needs to be cleaned
> up
> >> +by clearly separating out the functionalities where the VM_BIND mode
> differs
> >> +from older method and they should be moved to separate files.
> >> +
> >> +VM_PRIVATE objects
> >> +-------------------
> >> +By default, BOs can be mapped on multiple VMs and can also be dma-buf
> >> +exported. Hence these BOs are referred to as Shared BOs.
> >> +During each execbuff submission, the request fence must be added to the
> >> +dma-resv fence list of all shared BOs mapped on the VM.
> >> +
> >> +VM_BIND feature introduces an optimization where user can create BO
> which
> >> +is private to a specified VM via I915_GEM_CREATE_EXT_VM_PRIVATE flag
> >> during
> >> +BO creation. Unlike Shared BOs, these VM private BOs can only be mapped
> on
> >> +the VM they are private to and can't be dma-buf exported.
> >> +All private BOs of a VM share the dma-resv object. Hence during each
> execbuff
> >> +submission, they need only one dma-resv fence list updated. Thus, the fast
> >> +path (where required mappings are already bound) submission latency is
> O(1)
> >> +w.r.t the number of VM private BOs.
> >> +
> >> +VM_BIND locking hirarchy
> >> +-------------------------
> >> +The locking design here supports the older (execlist based) execbuff mode,
> the
> >> +newer VM_BIND mode, the VM_BIND mode with GPU page faults and
> possible
> >> future
> >> +system allocator support (See `Shared Virtual Memory (SVM) support`_).
> >> +The older execbuff mode and the newer VM_BIND mode without page
> faults
> >> manages
> >> +residency of backing storage using dma_fence. The VM_BIND mode with
> page
> >> faults
> >> +and the system allocator support do not use any dma_fence at all.
> >> +
> >> +VM_BIND locking order is as below.
> >> +
> >> +1) Lock-A: A vm_bind mutex will protect vm_bind lists. This lock is taken in
> >> +   vm_bind/vm_unbind ioctl calls, in the execbuff path and while releasing
> the
> >> +   mapping.
> >> +
> >> +   In future, when GPU page faults are supported, we can potentially use a
> >> +   rwsem instead, so that multiple page fault handlers can take the read side
> >> +   lock to lookup the mapping and hence can run in parallel.
> >> +   The older execbuff mode of binding do not need this lock.
> >> +
> >> +2) Lock-B: The object's dma-resv lock will protect i915_vma state and needs
> to
> >> +   be held while binding/unbinding a vma in the async worker and while
> updating
> >> +   dma-resv fence list of an object. Note that private BOs of a VM will all
> >> +   share a dma-resv object.
> >> +
> >> +   The future system allocator support will use the HMM prescribed locking
> >> +   instead.
> >> +
> >> +3) Lock-C: Spinlock/s to protect some of the VM's lists like the list of
> >> +   invalidated vmas (due to eviction and userptr invalidation) etc.
> >> +
> >> +When GPU page faults are supported, the execbuff path do not take any of
> >> these
> >> +locks. There we will simply smash the new batch buffer address into the ring
> >> and
> >> +then tell the scheduler run that. The lock taking only happens from the page
> >> +fault handler, where we take lock-A in read mode, whichever lock-B we
> need to
> >> +find the backing storage (dma_resv lock for gem objects, and hmm/core mm
> for
> >> +system allocator) and some additional locks (lock-D) for taking care of page
> >> +table races. Page fault mode should not need to ever manipulate the vm
> lists,
> >> +so won't ever need lock-C.
> >> +
> >> +VM_BIND LRU handling
> >> +---------------------
> >> +We need to ensure VM_BIND mapped objects are properly LRU tagged to
> avoid
> >> +performance degradation. We will also need support for bulk LRU movement
> of
> >> +VM_BIND objects to avoid additional latencies in execbuff path.
> >> +
> >> +The page table pages are similar to VM_BIND mapped objects (See
> >> +`Evictable page table allocations`_) and are maintained per VM and needs to
> >> +be pinned in memory when VM is made active (ie., upon an execbuff call
> with
> >> +that VM). So, bulk LRU movement of page table pages is also needed.
> >> +
> >> +The i915 shrinker LRU has stopped being an LRU. So, it should also be moved
> >> +over to the ttm LRU in some fashion to make sure we once again have a
> >> reasonable
> >> +and consistent memory aging and reclaim architecture.
> >> +
> >> +VM_BIND dma_resv usage
> >> +-----------------------
> >> +Fences needs to be added to all VM_BIND mapped objects. During each
> >> execbuff
> >> +submission, they are added with DMA_RESV_USAGE_BOOKKEEP usage to
> >> prevent
> >> +over sync (See enum dma_resv_usage). One can override it with either
> >> +DMA_RESV_USAGE_READ or DMA_RESV_USAGE_WRITE usage during
> object
> >> dependency
> >> +setting (either through explicit or implicit mechanism).
> >> +
> >> +When vm_bind is called for a non-private object while the VM is already
> >> +active, the fences need to be copied from VM's shared dma-resv object
> >> +(common to all private objects of the VM) to this non-private object.
> >> +If this results in performance degradation, then some optimization will
> >> +be needed here. This is not a problem for VM's private objects as they use
> >> +shared dma-resv object which is always updated on each execbuff
> submission.
> >> +
> >> +Also, in VM_BIND mode, use dma-resv apis for determining object
> activeness
> >> +(See dma_resv_test_signaled() and dma_resv_wait_timeout()) and do not
> use
> >> the
> >> +older i915_vma active reference tracking which is deprecated. This should be
> >> +easier to get it working with the current TTM backend. We can remove the
> >> +i915_vma active reference tracking fully while supporting TTM backend for
> igfx.
> >> +
> >> +Evictable page table allocations
> >> +---------------------------------
> >> +Make pagetable allocations evictable and manage them similar to VM_BIND
> >> +mapped objects. Page table pages are similar to persistent mappings of a
> >> +VM (difference here are that the page table pages will not have an i915_vma
> >> +structure and after swapping pages back in, parent page link needs to be
> >> +updated).
> >> +
> >> +Mesa use case
> >> +--------------
> >> +VM_BIND can potentially reduce the CPU overhead in Mesa (both Vulkan
> and
> >> Iris),
> >> +hence improving performance of CPU-bound applications. It also allows us to
> >> +implement Vulkan's Sparse Resources. With increasing GPU hardware
> >> performance,
> >> +reducing CPU overhead becomes more impactful.
> >> +
> >> +
> >> +VM_BIND Compute support
> >> +========================
> >> +
> >> +User/Memory Fence
> >> +------------------
> >> +The idea is to take a user specified virtual address and install an interrupt
> >> +handler to wake up the current task when the memory location passes the
> user
> >> +supplied filter. User/Memory fence is a <address, value> pair. To signal the
> >> +user fence, specified value will be written at the specified virtual address
> >> +and wakeup the waiting process. User can wait on a user fence with the
> >> +gem_wait_user_fence ioctl.
> >> +
> >> +It also allows the user to emit their own MI_FLUSH/PIPE_CONTROL notify
> >> +interrupt within their batches after updating the value to have sub-batch
> >> +precision on the wakeup. Each batch can signal a user fence to indicate
> >> +the completion of next level batch. The completion of very first level batch
> >> +needs to be signaled by the command streamer. The user must provide the
> >> +user/memory fence for this via the
> >> DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE
> >> +extension of execbuff ioctl, so that KMD can setup the command streamer
> to
> >> +signal it.
> >> +
> >> +User/Memory fence can also be supplied to the kernel driver to signal/wake
> up
> >> +the user process after completion of an asynchronous operation.
> >> +
> >> +When VM_BIND ioctl was provided with a user/memory fence via the
> >> +I915_VM_BIND_EXT_USER_FENCE extension, it will be signaled upon the
> >> completion
> >> +of binding of that mapping. All async binds/unbinds are serialized, hence
> >> +signaling of user/memory fence also indicate the completion of all previous
> >> +binds/unbinds.
> >> +
> >> +This feature will be derived from the below original work:
> >> +https://patchwork.freedesktop.org/patch/349417/
> >> +
> >> +Long running Compute contexts
> >> +------------------------------
> >> +Usage of dma-fence expects that they complete in reasonable amount of
> time.
> >> +Compute on the other hand can be long running. Hence it is appropriate for
> >> +compute to use user/memory fence and dma-fence usage will be limited to
> >> +in-kernel consumption only. This requires an execbuff uapi extension to pass
> >> +in user fence (See struct drm_i915_vm_bind_ext_user_fence). Compute
> must
> >> opt-in
> >> +for this mechanism with I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING
> flag
> >> during
> >> +context creation. The dma-fence based user interfaces like gem_wait ioctl
> and
> >> +execbuff out fence are not allowed on long running contexts. Implicit sync is
> >> +not valid as well and is anyway not supported in VM_BIND mode.
> >> +
> >> +Where GPU page faults are not available, kernel driver upon buffer
> invalidation
> >> +will initiate a suspend (preemption) of long running context with a dma-
> fence
> >> +attached to it. And upon completion of that suspend fence, finish the
> >> +invalidation, revalidate the BO and then resume the compute context. This is
> >> +done by having a per-context preempt fence (also called suspend fence)
> >> proxying
> >> +as i915_request fence. This suspend fence is enabled when someone tries to
> >> wait
> >> +on it, which then triggers the context preemption.
> >> +
> >> +As this support for context suspension using a preempt fence and the
> resume
> >> work
> >> +for the compute mode contexts can get tricky to get it right, it is better to
> >> +add this support in drm scheduler so that multiple drivers can make use of it.
> >> +That means, it will have a dependency on i915 drm scheduler conversion with
> >> GuC
> >> +scheduler backend. This should be fine, as the plan is to support compute
> mode
> >> +contexts only with GuC scheduler backend (at least initially). This is much
> >> +easier to support with VM_BIND mode compared to the current heavier
> >> execbuff
> >> +path resource attachment.
> >> +
> >> +Low Latency Submission
> >> +-----------------------
> >> +Allows compute UMD to directly submit GPU jobs instead of through
> execbuff
> >> +ioctl. This is made possible by VM_BIND is not being synchronized against
> >> +execbuff. VM_BIND allows bind/unbind of mappings required for the
> directly
> >> +submitted jobs.
> >> +
> >> +Other VM_BIND use cases
> >> +========================
> >> +
> >> +Debugger
> >> +---------
> >> +With debug event interface user space process (debugger) is able to keep
> track
> >> +of and act upon resources created by another process (debugged) and
> attached
> >> +to GPU via vm_bind interface.
> >> +
> >> +GPU page faults
> >> +----------------
> >> +GPU page faults when supported (in future), will only be supported in the
> >> +VM_BIND mode. While both the older execbuff mode and the newer
> VM_BIND
> >> mode of
> >> +binding will require using dma-fence to ensure residency, the GPU page
> faults
> >> +mode when supported, will not use any dma-fence as residency is purely
> >> managed
> >> +by installing and removing/invalidating page table entries.
> >> +
> >> +Page level hints settings
> >> +--------------------------
> >> +VM_BIND allows any hints setting per mapping instead of per BO.
> >> +Possible hints include read-only mapping, placement and atomicity.
> >> +Sub-BO level placement hint will be even more relevant with
> >> +upcoming GPU on-demand page fault support.
> >> +
> >> +Page level Cache/CLOS settings
> >> +-------------------------------
> >> +VM_BIND allows cache/CLOS settings per mapping instead of per BO.
> >> +
> >> +Shared Virtual Memory (SVM) support
> >> +------------------------------------
> >> +VM_BIND interface can be used to map system memory directly (without
> gem
> >> BO
> >> +abstraction) using the HMM interface. SVM is only supported with GPU page
> >> +faults enabled.
> >> +
> >> +
> >> +Broder i915 cleanups
> >> +=====================
> >> +Supporting this whole new vm_bind mode of binding which comes with its
> own
> >> +use cases to support and the locking requirements requires proper
> integration
> >> +with the existing i915 driver. This calls for some broader i915 driver
> >> +cleanups/simplifications for maintainability of the driver going forward.
> >> +Here are few things identified and are being looked into.
> >> +
> >> +- Remove vma lookup cache (eb->gem_context->handles_vma). VM_BIND
> >> feature
> >> +  do not use it and complexity it brings in is probably more than the
> >> +  performance advantage we get in legacy execbuff case.
> >> +- Remove vma->open_count counting
> >> +- Remove i915_vma active reference tracking. VM_BIND feature will not be
> >> using
> >> +  it. Instead use underlying BO's dma-resv fence list to determine if a
> i915_vma
> >> +  is active or not.
> >> +
> >> +
> >> +VM_BIND UAPI
> >> +=============
> >> +
> >> +.. kernel-doc:: Documentation/gpu/rfc/i915_vm_bind.h
> >> diff --git a/Documentation/gpu/rfc/index.rst
> b/Documentation/gpu/rfc/index.rst
> >> index 91e93a705230..7d10c36b268d 100644
> >> --- a/Documentation/gpu/rfc/index.rst
> >> +++ b/Documentation/gpu/rfc/index.rst
> >> @@ -23,3 +23,7 @@ host such documentation:
> >>  .. toctree::
> >>
> >>      i915_scheduler.rst
> >> +
> >> +.. toctree::
> >> +
> >> +    i915_vm_bind.rst
> >> --
> >> 2.21.0.rc0.32.g243a4c7e27
> >
Jason Ekstrand June 7, 2022, 5:12 p.m. UTC | #19
On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura <
niranjana.vishwanathapura@intel.com> wrote:

> On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel Landwerlin wrote:
> >   On 02/06/2022 23:35, Jason Ekstrand wrote:
> >
> >     On Thu, Jun 2, 2022 at 3:11 PM Niranjana Vishwanathapura
> >     <niranjana.vishwanathapura@intel.com> wrote:
> >
> >       On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew Brost wrote:
> >       >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin wrote:
> >       >> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
> >       >> > +VM_BIND/UNBIND ioctl will immediately start binding/unbinding
> >       the mapping in an
> >       >> > +async worker. The binding and unbinding will work like a
> special
> >       GPU engine.
> >       >> > +The binding and unbinding operations are serialized and will
> >       wait on specified
> >       >> > +input fences before the operation and will signal the output
> >       fences upon the
> >       >> > +completion of the operation. Due to serialization,
> completion of
> >       an operation
> >       >> > +will also indicate that all previous operations are also
> >       complete.
> >       >>
> >       >> I guess we should avoid saying "will immediately start
> >       binding/unbinding" if
> >       >> there are fences involved.
> >       >>
> >       >> And the fact that it's happening in an async worker seem to
> imply
> >       it's not
> >       >> immediate.
> >       >>
> >
> >       Ok, will fix.
> >       This was added because in earlier design binding was deferred until
> >       next execbuff.
> >       But now it is non-deferred (immediate in that sense). But yah,
> this is
> >       confusing
> >       and will fix it.
> >
> >       >>
> >       >> I have a question on the behavior of the bind operation when no
> >       input fence
> >       >> is provided. Let say I do :
> >       >>
> >       >> VM_BIND (out_fence=fence1)
> >       >>
> >       >> VM_BIND (out_fence=fence2)
> >       >>
> >       >> VM_BIND (out_fence=fence3)
> >       >>
> >       >>
> >       >> In what order are the fences going to be signaled?
> >       >>
> >       >> In the order of VM_BIND ioctls? Or out of order?
> >       >>
> >       >> Because you wrote "serialized I assume it's : in order
> >       >>
> >
> >       Yes, in the order of VM_BIND/UNBIND ioctls. Note that bind and
> unbind
> >       will use
> >       the same queue and hence are ordered.
> >
> >       >>
> >       >> One thing I didn't realize is that because we only get one
> >       "VM_BIND" engine,
> >       >> there is a disconnect from the Vulkan specification.
> >       >>
> >       >> In Vulkan VM_BIND operations are serialized but per engine.
> >       >>
> >       >> So you could have something like this :
> >       >>
> >       >> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
> >       >>
> >       >> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
> >       >>
> >       >>
> >       >> fence1 is not signaled
> >       >>
> >       >> fence3 is signaled
> >       >>
> >       >> So the second VM_BIND will proceed before the first VM_BIND.
> >       >>
> >       >>
> >       >> I guess we can deal with that scenario in userspace by doing the
> >       wait
> >       >> ourselves in one thread per engines.
> >       >>
> >       >> But then it makes the VM_BIND input fences useless.
> >       >>
> >       >>
> >       >> Daniel : what do you think? Should be rework this or just deal
> with
> >       wait
> >       >> fences in userspace?
> >       >>
> >       >
> >       >My opinion is rework this but make the ordering via an engine
> param
> >       optional.
> >       >
> >       >e.g. A VM can be configured so all binds are ordered within the VM
> >       >
> >       >e.g. A VM can be configured so all binds accept an engine argument
> >       (in
> >       >the case of the i915 likely this is a gem context handle) and
> binds
> >       >ordered with respect to that engine.
> >       >
> >       >This gives UMDs options as the later likely consumes more KMD
> >       resources
> >       >so if a different UMD can live with binds being ordered within
> the VM
> >       >they can use a mode consuming less resources.
> >       >
> >
> >       I think we need to be careful here if we are looking for some out
> of
> >       (submission) order completion of vm_bind/unbind.
> >       In-order completion means, in a batch of binds and unbinds to be
> >       completed in-order, user only needs to specify in-fence for the
> >       first bind/unbind call and the our-fence for the last bind/unbind
> >       call. Also, the VA released by an unbind call can be re-used by
> >       any subsequent bind call in that in-order batch.
> >
> >       These things will break if binding/unbinding were to be allowed to
> >       go out of order (of submission) and user need to be extra careful
> >       not to run into pre-mature triggereing of out-fence and bind
> failing
> >       as VA is still in use etc.
> >
> >       Also, VM_BIND binds the provided mapping on the specified address
> >       space
> >       (VM). So, the uapi is not engine/context specific.
> >
> >       We can however add a 'queue' to the uapi which can be one from the
> >       pre-defined queues,
> >       I915_VM_BIND_QUEUE_0
> >       I915_VM_BIND_QUEUE_1
> >       ...
> >       I915_VM_BIND_QUEUE_(N-1)
> >
> >       KMD will spawn an async work queue for each queue which will only
> >       bind the mappings on that queue in the order of submission.
> >       User can assign the queue to per engine or anything like that.
> >
> >       But again here, user need to be careful and not deadlock these
> >       queues with circular dependency of fences.
> >
> >       I prefer adding this later an as extension based on whether it
> >       is really helping with the implementation.
> >
> >     I can tell you right now that having everything on a single in-order
> >     queue will not get us the perf we want.  What vulkan really wants is
> one
> >     of two things:
> >      1. No implicit ordering of VM_BIND ops.  They just happen in
> whatever
> >     their dependencies are resolved and we ensure ordering ourselves by
> >     having a syncobj in the VkQueue.
> >      2. The ability to create multiple VM_BIND queues.  We need at least
> 2
> >     but I don't see why there needs to be a limit besides the limits the
> >     i915 API already has on the number of engines.  Vulkan could expose
> >     multiple sparse binding queues to the client if it's not arbitrarily
> >     limited.
>
> Thanks Jason, Lionel.
>
> Jason, what are you referring to when you say "limits the i915 API already
> has on the number of engines"? I am not sure if there is such an uapi
> today.
>

There's a limit of something like 64 total engines today based on the
number of bits we can cram into the exec flags in execbuffer2.  I think
someone had an extended version that allowed more but I ripped it out
because no one was using it.  Of course, execbuffer3 might not have that
problem at all.

I am trying to see how many queues we need and don't want it to be
> arbitrarily
> large and unduely blow up memory usage and complexity in i915 driver.
>

I expect a Vulkan driver to use at most 2 in the vast majority of cases. I
could imagine a client wanting to create more than 1 sparse queue in which
case, it'll be N+1 but that's unlikely.  As far as complexity goes, once
you allow two, I don't think the complexity is going up by allowing N.  As
for memory usage, creating more queues means more memory.  That's a
trade-off that userspace can make.  Again, the expected number here is 1 or
2 in the vast majority of cases so I don't think you need to worry.


> >     Why?  Because Vulkan has two basic kind of bind operations and we
> don't
> >     want any dependencies between them:
> >      1. Immediate.  These happen right after BO creation or maybe as
> part of
> >     vkBindImageMemory() or VkBindBufferMemory().  These don't happen on a
> >     queue and we don't want them serialized with anything.  To
> synchronize
> >     with submit, we'll have a syncobj in the VkDevice which is signaled
> by
> >     all immediate bind operations and make submits wait on it.
> >      2. Queued (sparse): These happen on a VkQueue which may be the same
> as
> >     a render/compute queue or may be its own queue.  It's up to us what
> we
> >     want to advertise.  From the Vulkan API PoV, this is like any other
> >     queue.  Operations on it wait on and signal semaphores.  If we have a
> >     VM_BIND engine, we'd provide syncobjs to wait and signal just like
> we do
> >     in execbuf().
> >     The important thing is that we don't want one type of operation to
> block
> >     on the other.  If immediate binds are blocking on sparse binds, it's
> >     going to cause over-synchronization issues.
> >     In terms of the internal implementation, I know that there's going
> to be
> >     a lock on the VM and that we can't actually do these things in
> >     parallel.  That's fine.  Once the dma_fences have signaled and we're
>
> Thats correct. It is like a single VM_BIND engine with multiple queues
> feeding to it.
>

Right.  As long as the queues themselves are independent and can block on
dma_fences without holding up other queues, I think we're fine.


> >     unblocked to do the bind operation, I don't care if there's a bit of
> >     synchronization due to locking.  That's expected.  What we can't
> afford
> >     to have is an immediate bind operation suddenly blocking on a sparse
> >     operation which is blocked on a compute job that's going to run for
> >     another 5ms.
>
> As the VM_BIND queue is per VM, VM_BIND on one VM doesn't block the VM_BIND
> on other VMs. I am not sure about usecases here, but just wanted to
> clarify.
>

Yes, that's what I would expect.

--Jason



> Niranjana
>
> >     For reference, Windows solves this by allowing arbitrarily many
> paging
> >     queues (what they call a VM_BIND engine/queue).  That design works
> >     pretty well and solves the problems in question.  Again, we could
> just
> >     make everything out-of-order and require using syncobjs to order
> things
> >     as userspace wants. That'd be fine too.
> >     One more note while I'm here: danvet said something on IRC about
> VM_BIND
> >     queues waiting for syncobjs to materialize.  We don't really
> want/need
> >     this.  We already have all the machinery in userspace to handle
> >     wait-before-signal and waiting for syncobj fences to materialize and
> >     that machinery is on by default.  It would actually take MORE work in
> >     Mesa to turn it off and take advantage of the kernel being able to
> wait
> >     for syncobjs to materialize.  Also, getting that right is
> ridiculously
> >     hard and I really don't want to get it wrong in kernel space.  When
> we
> >     do memory fences, wait-before-signal will be a thing.  We don't need
> to
> >     try and make it a thing for syncobj.
> >     --Jason
> >
> >   Thanks Jason,
> >
> >   I missed the bit in the Vulkan spec that we're allowed to have a sparse
> >   queue that does not implement either graphics or compute operations :
> >
> >     "While some implementations may include VK_QUEUE_SPARSE_BINDING_BIT
> >     support in queue families that also include
> >
> >      graphics and compute support, other implementations may only expose
> a
> >     VK_QUEUE_SPARSE_BINDING_BIT-only queue
> >
> >      family."
> >
> >   So it can all be all a vm_bind engine that just does bind/unbind
> >   operations.
> >
> >   But yes we need another engine for the immediate/non-sparse operations.
> >
> >   -Lionel
> >
> >
> >
> >       Daniel, any thoughts?
> >
> >       Niranjana
> >
> >       >Matt
> >       >
> >       >>
> >       >> Sorry I noticed this late.
> >       >>
> >       >>
> >       >> -Lionel
> >       >>
> >       >>
>
Niranjana Vishwanathapura June 7, 2022, 6:18 p.m. UTC | #20
On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason Ekstrand wrote:
>   On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura
>   <niranjana.vishwanathapura@intel.com> wrote:
>
>     On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel Landwerlin wrote:
>     >   On 02/06/2022 23:35, Jason Ekstrand wrote:
>     >
>     >     On Thu, Jun 2, 2022 at 3:11 PM Niranjana Vishwanathapura
>     >     <niranjana.vishwanathapura@intel.com> wrote:
>     >
>     >       On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew Brost wrote:
>     >       >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin
>     wrote:
>     >       >> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
>     >       >> > +VM_BIND/UNBIND ioctl will immediately start
>     binding/unbinding
>     >       the mapping in an
>     >       >> > +async worker. The binding and unbinding will work like a
>     special
>     >       GPU engine.
>     >       >> > +The binding and unbinding operations are serialized and
>     will
>     >       wait on specified
>     >       >> > +input fences before the operation and will signal the
>     output
>     >       fences upon the
>     >       >> > +completion of the operation. Due to serialization,
>     completion of
>     >       an operation
>     >       >> > +will also indicate that all previous operations are also
>     >       complete.
>     >       >>
>     >       >> I guess we should avoid saying "will immediately start
>     >       binding/unbinding" if
>     >       >> there are fences involved.
>     >       >>
>     >       >> And the fact that it's happening in an async worker seem to
>     imply
>     >       it's not
>     >       >> immediate.
>     >       >>
>     >
>     >       Ok, will fix.
>     >       This was added because in earlier design binding was deferred
>     until
>     >       next execbuff.
>     >       But now it is non-deferred (immediate in that sense). But yah,
>     this is
>     >       confusing
>     >       and will fix it.
>     >
>     >       >>
>     >       >> I have a question on the behavior of the bind operation when
>     no
>     >       input fence
>     >       >> is provided. Let say I do :
>     >       >>
>     >       >> VM_BIND (out_fence=fence1)
>     >       >>
>     >       >> VM_BIND (out_fence=fence2)
>     >       >>
>     >       >> VM_BIND (out_fence=fence3)
>     >       >>
>     >       >>
>     >       >> In what order are the fences going to be signaled?
>     >       >>
>     >       >> In the order of VM_BIND ioctls? Or out of order?
>     >       >>
>     >       >> Because you wrote "serialized I assume it's : in order
>     >       >>
>     >
>     >       Yes, in the order of VM_BIND/UNBIND ioctls. Note that bind and
>     unbind
>     >       will use
>     >       the same queue and hence are ordered.
>     >
>     >       >>
>     >       >> One thing I didn't realize is that because we only get one
>     >       "VM_BIND" engine,
>     >       >> there is a disconnect from the Vulkan specification.
>     >       >>
>     >       >> In Vulkan VM_BIND operations are serialized but per engine.
>     >       >>
>     >       >> So you could have something like this :
>     >       >>
>     >       >> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
>     >       >>
>     >       >> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
>     >       >>
>     >       >>
>     >       >> fence1 is not signaled
>     >       >>
>     >       >> fence3 is signaled
>     >       >>
>     >       >> So the second VM_BIND will proceed before the first VM_BIND.
>     >       >>
>     >       >>
>     >       >> I guess we can deal with that scenario in userspace by doing
>     the
>     >       wait
>     >       >> ourselves in one thread per engines.
>     >       >>
>     >       >> But then it makes the VM_BIND input fences useless.
>     >       >>
>     >       >>
>     >       >> Daniel : what do you think? Should be rework this or just
>     deal with
>     >       wait
>     >       >> fences in userspace?
>     >       >>
>     >       >
>     >       >My opinion is rework this but make the ordering via an engine
>     param
>     >       optional.
>     >       >
>     >       >e.g. A VM can be configured so all binds are ordered within the
>     VM
>     >       >
>     >       >e.g. A VM can be configured so all binds accept an engine
>     argument
>     >       (in
>     >       >the case of the i915 likely this is a gem context handle) and
>     binds
>     >       >ordered with respect to that engine.
>     >       >
>     >       >This gives UMDs options as the later likely consumes more KMD
>     >       resources
>     >       >so if a different UMD can live with binds being ordered within
>     the VM
>     >       >they can use a mode consuming less resources.
>     >       >
>     >
>     >       I think we need to be careful here if we are looking for some
>     out of
>     >       (submission) order completion of vm_bind/unbind.
>     >       In-order completion means, in a batch of binds and unbinds to be
>     >       completed in-order, user only needs to specify in-fence for the
>     >       first bind/unbind call and the our-fence for the last
>     bind/unbind
>     >       call. Also, the VA released by an unbind call can be re-used by
>     >       any subsequent bind call in that in-order batch.
>     >
>     >       These things will break if binding/unbinding were to be allowed
>     to
>     >       go out of order (of submission) and user need to be extra
>     careful
>     >       not to run into pre-mature triggereing of out-fence and bind
>     failing
>     >       as VA is still in use etc.
>     >
>     >       Also, VM_BIND binds the provided mapping on the specified
>     address
>     >       space
>     >       (VM). So, the uapi is not engine/context specific.
>     >
>     >       We can however add a 'queue' to the uapi which can be one from
>     the
>     >       pre-defined queues,
>     >       I915_VM_BIND_QUEUE_0
>     >       I915_VM_BIND_QUEUE_1
>     >       ...
>     >       I915_VM_BIND_QUEUE_(N-1)
>     >
>     >       KMD will spawn an async work queue for each queue which will
>     only
>     >       bind the mappings on that queue in the order of submission.
>     >       User can assign the queue to per engine or anything like that.
>     >
>     >       But again here, user need to be careful and not deadlock these
>     >       queues with circular dependency of fences.
>     >
>     >       I prefer adding this later an as extension based on whether it
>     >       is really helping with the implementation.
>     >
>     >     I can tell you right now that having everything on a single
>     in-order
>     >     queue will not get us the perf we want.  What vulkan really wants
>     is one
>     >     of two things:
>     >      1. No implicit ordering of VM_BIND ops.  They just happen in
>     whatever
>     >     their dependencies are resolved and we ensure ordering ourselves
>     by
>     >     having a syncobj in the VkQueue.
>     >      2. The ability to create multiple VM_BIND queues.  We need at
>     least 2
>     >     but I don't see why there needs to be a limit besides the limits
>     the
>     >     i915 API already has on the number of engines.  Vulkan could
>     expose
>     >     multiple sparse binding queues to the client if it's not
>     arbitrarily
>     >     limited.
>
>     Thanks Jason, Lionel.
>
>     Jason, what are you referring to when you say "limits the i915 API
>     already
>     has on the number of engines"? I am not sure if there is such an uapi
>     today.
>
>   There's a limit of something like 64 total engines today based on the
>   number of bits we can cram into the exec flags in execbuffer2.  I think
>   someone had an extended version that allowed more but I ripped it out
>   because no one was using it.  Of course, execbuffer3 might not have that
>   problem at all.
>

Thanks Jason.
Ok, I am not sure which exec flag is that, but yah, execbuffer3 probably
will not have this limiation. So, we need to define a VM_BIND_MAX_QUEUE
and somehow export it to user (I am thinking of embedding it in
I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND, bits[1-3]->'n' meaning 2^n
queues.

>     I am trying to see how many queues we need and don't want it to be
>     arbitrarily
>     large and unduely blow up memory usage and complexity in i915 driver.
>
>   I expect a Vulkan driver to use at most 2 in the vast majority of cases. I
>   could imagine a client wanting to create more than 1 sparse queue in which
>   case, it'll be N+1 but that's unlikely.  As far as complexity goes, once
>   you allow two, I don't think the complexity is going up by allowing N.  As
>   for memory usage, creating more queues means more memory.  That's a
>   trade-off that userspace can make.  Again, the expected number here is 1
>   or 2 in the vast majority of cases so I don't think you need to worry.
>    

Ok, will start with n=3 meaning 8 queues.
That would require us create 8 workqueues.
We can change 'n' later if required.

Niranjana

>
>     >     Why?  Because Vulkan has two basic kind of bind operations and we
>     don't
>     >     want any dependencies between them:
>     >      1. Immediate.  These happen right after BO creation or maybe as
>     part of
>     >     vkBindImageMemory() or VkBindBufferMemory().  These don't happen
>     on a
>     >     queue and we don't want them serialized with anything.  To
>     synchronize
>     >     with submit, we'll have a syncobj in the VkDevice which is
>     signaled by
>     >     all immediate bind operations and make submits wait on it.
>     >      2. Queued (sparse): These happen on a VkQueue which may be the
>     same as
>     >     a render/compute queue or may be its own queue.  It's up to us
>     what we
>     >     want to advertise.  From the Vulkan API PoV, this is like any
>     other
>     >     queue.  Operations on it wait on and signal semaphores.  If we
>     have a
>     >     VM_BIND engine, we'd provide syncobjs to wait and signal just like
>     we do
>     >     in execbuf().
>     >     The important thing is that we don't want one type of operation to
>     block
>     >     on the other.  If immediate binds are blocking on sparse binds,
>     it's
>     >     going to cause over-synchronization issues.
>     >     In terms of the internal implementation, I know that there's going
>     to be
>     >     a lock on the VM and that we can't actually do these things in
>     >     parallel.  That's fine.  Once the dma_fences have signaled and
>     we're
>
>     Thats correct. It is like a single VM_BIND engine with multiple queues
>     feeding to it.
>
>   Right.  As long as the queues themselves are independent and can block on
>   dma_fences without holding up other queues, I think we're fine.
>    
>
>     >     unblocked to do the bind operation, I don't care if there's a bit
>     of
>     >     synchronization due to locking.  That's expected.  What we can't
>     afford
>     >     to have is an immediate bind operation suddenly blocking on a
>     sparse
>     >     operation which is blocked on a compute job that's going to run
>     for
>     >     another 5ms.
>
>     As the VM_BIND queue is per VM, VM_BIND on one VM doesn't block the
>     VM_BIND
>     on other VMs. I am not sure about usecases here, but just wanted to
>     clarify.
>
>   Yes, that's what I would expect.
>   --Jason
>    
>
>     Niranjana
>
>     >     For reference, Windows solves this by allowing arbitrarily many
>     paging
>     >     queues (what they call a VM_BIND engine/queue).  That design works
>     >     pretty well and solves the problems in question.  Again, we could
>     just
>     >     make everything out-of-order and require using syncobjs to order
>     things
>     >     as userspace wants. That'd be fine too.
>     >     One more note while I'm here: danvet said something on IRC about
>     VM_BIND
>     >     queues waiting for syncobjs to materialize.  We don't really
>     want/need
>     >     this.  We already have all the machinery in userspace to handle
>     >     wait-before-signal and waiting for syncobj fences to materialize
>     and
>     >     that machinery is on by default.  It would actually take MORE work
>     in
>     >     Mesa to turn it off and take advantage of the kernel being able to
>     wait
>     >     for syncobjs to materialize.  Also, getting that right is
>     ridiculously
>     >     hard and I really don't want to get it wrong in kernel space. 
>     When we
>     >     do memory fences, wait-before-signal will be a thing.  We don't
>     need to
>     >     try and make it a thing for syncobj.
>     >     --Jason
>     >
>     >   Thanks Jason,
>     >
>     >   I missed the bit in the Vulkan spec that we're allowed to have a
>     sparse
>     >   queue that does not implement either graphics or compute operations
>     :
>     >
>     >     "While some implementations may include
>     VK_QUEUE_SPARSE_BINDING_BIT
>     >     support in queue families that also include
>     >
>     >      graphics and compute support, other implementations may only
>     expose a
>     >     VK_QUEUE_SPARSE_BINDING_BIT-only queue
>     >
>     >      family."
>     >
>     >   So it can all be all a vm_bind engine that just does bind/unbind
>     >   operations.
>     >
>     >   But yes we need another engine for the immediate/non-sparse
>     operations.
>     >
>     >   -Lionel
>     >
>     >     
>     >
>     >       Daniel, any thoughts?
>     >
>     >       Niranjana
>     >
>     >       >Matt
>     >       >
>     >       >>
>     >       >> Sorry I noticed this late.
>     >       >>
>     >       >>
>     >       >> -Lionel
>     >       >>
>     >       >>
Niranjana Vishwanathapura June 7, 2022, 9:32 p.m. UTC | #21
On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana Vishwanathapura wrote:
>On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason Ekstrand wrote:
>>  On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura
>>  <niranjana.vishwanathapura@intel.com> wrote:
>>
>>    On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel Landwerlin wrote:
>>    >   On 02/06/2022 23:35, Jason Ekstrand wrote:
>>    >
>>    >     On Thu, Jun 2, 2022 at 3:11 PM Niranjana Vishwanathapura
>>    >     <niranjana.vishwanathapura@intel.com> wrote:
>>    >
>>    >       On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew Brost wrote:
>>    >       >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin
>>    wrote:
>>    >       >> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
>>    >       >> > +VM_BIND/UNBIND ioctl will immediately start
>>    binding/unbinding
>>    >       the mapping in an
>>    >       >> > +async worker. The binding and unbinding will work like a
>>    special
>>    >       GPU engine.
>>    >       >> > +The binding and unbinding operations are serialized and
>>    will
>>    >       wait on specified
>>    >       >> > +input fences before the operation and will signal the
>>    output
>>    >       fences upon the
>>    >       >> > +completion of the operation. Due to serialization,
>>    completion of
>>    >       an operation
>>    >       >> > +will also indicate that all previous operations are also
>>    >       complete.
>>    >       >>
>>    >       >> I guess we should avoid saying "will immediately start
>>    >       binding/unbinding" if
>>    >       >> there are fences involved.
>>    >       >>
>>    >       >> And the fact that it's happening in an async worker seem to
>>    imply
>>    >       it's not
>>    >       >> immediate.
>>    >       >>
>>    >
>>    >       Ok, will fix.
>>    >       This was added because in earlier design binding was deferred
>>    until
>>    >       next execbuff.
>>    >       But now it is non-deferred (immediate in that sense). But yah,
>>    this is
>>    >       confusing
>>    >       and will fix it.
>>    >
>>    >       >>
>>    >       >> I have a question on the behavior of the bind operation when
>>    no
>>    >       input fence
>>    >       >> is provided. Let say I do :
>>    >       >>
>>    >       >> VM_BIND (out_fence=fence1)
>>    >       >>
>>    >       >> VM_BIND (out_fence=fence2)
>>    >       >>
>>    >       >> VM_BIND (out_fence=fence3)
>>    >       >>
>>    >       >>
>>    >       >> In what order are the fences going to be signaled?
>>    >       >>
>>    >       >> In the order of VM_BIND ioctls? Or out of order?
>>    >       >>
>>    >       >> Because you wrote "serialized I assume it's : in order
>>    >       >>
>>    >
>>    >       Yes, in the order of VM_BIND/UNBIND ioctls. Note that bind and
>>    unbind
>>    >       will use
>>    >       the same queue and hence are ordered.
>>    >
>>    >       >>
>>    >       >> One thing I didn't realize is that because we only get one
>>    >       "VM_BIND" engine,
>>    >       >> there is a disconnect from the Vulkan specification.
>>    >       >>
>>    >       >> In Vulkan VM_BIND operations are serialized but per engine.
>>    >       >>
>>    >       >> So you could have something like this :
>>    >       >>
>>    >       >> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
>>    >       >>
>>    >       >> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
>>    >       >>
>>    >       >>
>>    >       >> fence1 is not signaled
>>    >       >>
>>    >       >> fence3 is signaled
>>    >       >>
>>    >       >> So the second VM_BIND will proceed before the first VM_BIND.
>>    >       >>
>>    >       >>
>>    >       >> I guess we can deal with that scenario in userspace by doing
>>    the
>>    >       wait
>>    >       >> ourselves in one thread per engines.
>>    >       >>
>>    >       >> But then it makes the VM_BIND input fences useless.
>>    >       >>
>>    >       >>
>>    >       >> Daniel : what do you think? Should be rework this or just
>>    deal with
>>    >       wait
>>    >       >> fences in userspace?
>>    >       >>
>>    >       >
>>    >       >My opinion is rework this but make the ordering via an engine
>>    param
>>    >       optional.
>>    >       >
>>    >       >e.g. A VM can be configured so all binds are ordered within the
>>    VM
>>    >       >
>>    >       >e.g. A VM can be configured so all binds accept an engine
>>    argument
>>    >       (in
>>    >       >the case of the i915 likely this is a gem context handle) and
>>    binds
>>    >       >ordered with respect to that engine.
>>    >       >
>>    >       >This gives UMDs options as the later likely consumes more KMD
>>    >       resources
>>    >       >so if a different UMD can live with binds being ordered within
>>    the VM
>>    >       >they can use a mode consuming less resources.
>>    >       >
>>    >
>>    >       I think we need to be careful here if we are looking for some
>>    out of
>>    >       (submission) order completion of vm_bind/unbind.
>>    >       In-order completion means, in a batch of binds and unbinds to be
>>    >       completed in-order, user only needs to specify in-fence for the
>>    >       first bind/unbind call and the our-fence for the last
>>    bind/unbind
>>    >       call. Also, the VA released by an unbind call can be re-used by
>>    >       any subsequent bind call in that in-order batch.
>>    >
>>    >       These things will break if binding/unbinding were to be allowed
>>    to
>>    >       go out of order (of submission) and user need to be extra
>>    careful
>>    >       not to run into pre-mature triggereing of out-fence and bind
>>    failing
>>    >       as VA is still in use etc.
>>    >
>>    >       Also, VM_BIND binds the provided mapping on the specified
>>    address
>>    >       space
>>    >       (VM). So, the uapi is not engine/context specific.
>>    >
>>    >       We can however add a 'queue' to the uapi which can be one from
>>    the
>>    >       pre-defined queues,
>>    >       I915_VM_BIND_QUEUE_0
>>    >       I915_VM_BIND_QUEUE_1
>>    >       ...
>>    >       I915_VM_BIND_QUEUE_(N-1)
>>    >
>>    >       KMD will spawn an async work queue for each queue which will
>>    only
>>    >       bind the mappings on that queue in the order of submission.
>>    >       User can assign the queue to per engine or anything like that.
>>    >
>>    >       But again here, user need to be careful and not deadlock these
>>    >       queues with circular dependency of fences.
>>    >
>>    >       I prefer adding this later an as extension based on whether it
>>    >       is really helping with the implementation.
>>    >
>>    >     I can tell you right now that having everything on a single
>>    in-order
>>    >     queue will not get us the perf we want.  What vulkan really wants
>>    is one
>>    >     of two things:
>>    >      1. No implicit ordering of VM_BIND ops.  They just happen in
>>    whatever
>>    >     their dependencies are resolved and we ensure ordering ourselves
>>    by
>>    >     having a syncobj in the VkQueue.
>>    >      2. The ability to create multiple VM_BIND queues.  We need at
>>    least 2
>>    >     but I don't see why there needs to be a limit besides the limits
>>    the
>>    >     i915 API already has on the number of engines.  Vulkan could
>>    expose
>>    >     multiple sparse binding queues to the client if it's not
>>    arbitrarily
>>    >     limited.
>>
>>    Thanks Jason, Lionel.
>>
>>    Jason, what are you referring to when you say "limits the i915 API
>>    already
>>    has on the number of engines"? I am not sure if there is such an uapi
>>    today.
>>
>>  There's a limit of something like 64 total engines today based on the
>>  number of bits we can cram into the exec flags in execbuffer2.  I think
>>  someone had an extended version that allowed more but I ripped it out
>>  because no one was using it.  Of course, execbuffer3 might not have that
>>  problem at all.
>>
>
>Thanks Jason.
>Ok, I am not sure which exec flag is that, but yah, execbuffer3 probably
>will not have this limiation. So, we need to define a VM_BIND_MAX_QUEUE
>and somehow export it to user (I am thinking of embedding it in
>I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND, bits[1-3]->'n' meaning 2^n
>queues.

Ah, I think you are waking about I915_EXEC_RING_MASK (0x3f) which execbuf3
will also have. So, we can simply define in vm_bind/unbind structures,

#define I915_VM_BIND_MAX_QUEUE   64
        __u32 queue;

I think that will keep things simple.

Niranjana

>
>>    I am trying to see how many queues we need and don't want it to be
>>    arbitrarily
>>    large and unduely blow up memory usage and complexity in i915 driver.
>>
>>  I expect a Vulkan driver to use at most 2 in the vast majority of cases. I
>>  could imagine a client wanting to create more than 1 sparse queue in which
>>  case, it'll be N+1 but that's unlikely.  As far as complexity goes, once
>>  you allow two, I don't think the complexity is going up by allowing N.  As
>>  for memory usage, creating more queues means more memory.  That's a
>>  trade-off that userspace can make.  Again, the expected number here is 1
>>  or 2 in the vast majority of cases so I don't think you need to worry.
>
>Ok, will start with n=3 meaning 8 queues.
>That would require us create 8 workqueues.
>We can change 'n' later if required.
>
>Niranjana
>
>>
>>    >     Why?  Because Vulkan has two basic kind of bind operations and we
>>    don't
>>    >     want any dependencies between them:
>>    >      1. Immediate.  These happen right after BO creation or maybe as
>>    part of
>>    >     vkBindImageMemory() or VkBindBufferMemory().  These don't happen
>>    on a
>>    >     queue and we don't want them serialized with anything.  To
>>    synchronize
>>    >     with submit, we'll have a syncobj in the VkDevice which is
>>    signaled by
>>    >     all immediate bind operations and make submits wait on it.
>>    >      2. Queued (sparse): These happen on a VkQueue which may be the
>>    same as
>>    >     a render/compute queue or may be its own queue.  It's up to us
>>    what we
>>    >     want to advertise.  From the Vulkan API PoV, this is like any
>>    other
>>    >     queue.  Operations on it wait on and signal semaphores.  If we
>>    have a
>>    >     VM_BIND engine, we'd provide syncobjs to wait and signal just like
>>    we do
>>    >     in execbuf().
>>    >     The important thing is that we don't want one type of operation to
>>    block
>>    >     on the other.  If immediate binds are blocking on sparse binds,
>>    it's
>>    >     going to cause over-synchronization issues.
>>    >     In terms of the internal implementation, I know that there's going
>>    to be
>>    >     a lock on the VM and that we can't actually do these things in
>>    >     parallel.  That's fine.  Once the dma_fences have signaled and
>>    we're
>>
>>    Thats correct. It is like a single VM_BIND engine with multiple queues
>>    feeding to it.
>>
>>  Right.  As long as the queues themselves are independent and can block on
>>  dma_fences without holding up other queues, I think we're fine.
>>
>>    >     unblocked to do the bind operation, I don't care if there's a bit
>>    of
>>    >     synchronization due to locking.  That's expected.  What we can't
>>    afford
>>    >     to have is an immediate bind operation suddenly blocking on a
>>    sparse
>>    >     operation which is blocked on a compute job that's going to run
>>    for
>>    >     another 5ms.
>>
>>    As the VM_BIND queue is per VM, VM_BIND on one VM doesn't block the
>>    VM_BIND
>>    on other VMs. I am not sure about usecases here, but just wanted to
>>    clarify.
>>
>>  Yes, that's what I would expect.
>>  --Jason
>>
>>    Niranjana
>>
>>    >     For reference, Windows solves this by allowing arbitrarily many
>>    paging
>>    >     queues (what they call a VM_BIND engine/queue).  That design works
>>    >     pretty well and solves the problems in question.  Again, we could
>>    just
>>    >     make everything out-of-order and require using syncobjs to order
>>    things
>>    >     as userspace wants. That'd be fine too.
>>    >     One more note while I'm here: danvet said something on IRC about
>>    VM_BIND
>>    >     queues waiting for syncobjs to materialize.  We don't really
>>    want/need
>>    >     this.  We already have all the machinery in userspace to handle
>>    >     wait-before-signal and waiting for syncobj fences to materialize
>>    and
>>    >     that machinery is on by default.  It would actually take MORE work
>>    in
>>    >     Mesa to turn it off and take advantage of the kernel being able to
>>    wait
>>    >     for syncobjs to materialize.  Also, getting that right is
>>    ridiculously
>>    >     hard and I really don't want to get it wrong in kernel 
>>space.     When we
>>    >     do memory fences, wait-before-signal will be a thing.  We don't
>>    need to
>>    >     try and make it a thing for syncobj.
>>    >     --Jason
>>    >
>>    >   Thanks Jason,
>>    >
>>    >   I missed the bit in the Vulkan spec that we're allowed to have a
>>    sparse
>>    >   queue that does not implement either graphics or compute operations
>>    :
>>    >
>>    >     "While some implementations may include
>>    VK_QUEUE_SPARSE_BINDING_BIT
>>    >     support in queue families that also include
>>    >
>>    >      graphics and compute support, other implementations may only
>>    expose a
>>    >     VK_QUEUE_SPARSE_BINDING_BIT-only queue
>>    >
>>    >      family."
>>    >
>>    >   So it can all be all a vm_bind engine that just does bind/unbind
>>    >   operations.
>>    >
>>    >   But yes we need another engine for the immediate/non-sparse
>>    operations.
>>    >
>>    >   -Lionel
>>    >
>>    >         >
>>    >       Daniel, any thoughts?
>>    >
>>    >       Niranjana
>>    >
>>    >       >Matt
>>    >       >
>>    >       >>
>>    >       >> Sorry I noticed this late.
>>    >       >>
>>    >       >>
>>    >       >> -Lionel
>>    >       >>
>>    >       >>
Tvrtko Ursulin June 8, 2022, 7:33 a.m. UTC | #22
On 07/06/2022 22:32, Niranjana Vishwanathapura wrote:
> On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana Vishwanathapura wrote:
>> On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason Ekstrand wrote:
>>>  On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura
>>>  <niranjana.vishwanathapura@intel.com> wrote:
>>>
>>>    On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel Landwerlin wrote:
>>>    >   On 02/06/2022 23:35, Jason Ekstrand wrote:
>>>    >
>>>    >     On Thu, Jun 2, 2022 at 3:11 PM Niranjana Vishwanathapura
>>>    >     <niranjana.vishwanathapura@intel.com> wrote:
>>>    >
>>>    >       On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew Brost 
>>> wrote:
>>>    >       >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin
>>>    wrote:
>>>    >       >> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
>>>    >       >> > +VM_BIND/UNBIND ioctl will immediately start
>>>    binding/unbinding
>>>    >       the mapping in an
>>>    >       >> > +async worker. The binding and unbinding will work 
>>> like a
>>>    special
>>>    >       GPU engine.
>>>    >       >> > +The binding and unbinding operations are serialized and
>>>    will
>>>    >       wait on specified
>>>    >       >> > +input fences before the operation and will signal the
>>>    output
>>>    >       fences upon the
>>>    >       >> > +completion of the operation. Due to serialization,
>>>    completion of
>>>    >       an operation
>>>    >       >> > +will also indicate that all previous operations are 
>>> also
>>>    >       complete.
>>>    >       >>
>>>    >       >> I guess we should avoid saying "will immediately start
>>>    >       binding/unbinding" if
>>>    >       >> there are fences involved.
>>>    >       >>
>>>    >       >> And the fact that it's happening in an async worker 
>>> seem to
>>>    imply
>>>    >       it's not
>>>    >       >> immediate.
>>>    >       >>
>>>    >
>>>    >       Ok, will fix.
>>>    >       This was added because in earlier design binding was deferred
>>>    until
>>>    >       next execbuff.
>>>    >       But now it is non-deferred (immediate in that sense). But 
>>> yah,
>>>    this is
>>>    >       confusing
>>>    >       and will fix it.
>>>    >
>>>    >       >>
>>>    >       >> I have a question on the behavior of the bind operation 
>>> when
>>>    no
>>>    >       input fence
>>>    >       >> is provided. Let say I do :
>>>    >       >>
>>>    >       >> VM_BIND (out_fence=fence1)
>>>    >       >>
>>>    >       >> VM_BIND (out_fence=fence2)
>>>    >       >>
>>>    >       >> VM_BIND (out_fence=fence3)
>>>    >       >>
>>>    >       >>
>>>    >       >> In what order are the fences going to be signaled?
>>>    >       >>
>>>    >       >> In the order of VM_BIND ioctls? Or out of order?
>>>    >       >>
>>>    >       >> Because you wrote "serialized I assume it's : in order
>>>    >       >>
>>>    >
>>>    >       Yes, in the order of VM_BIND/UNBIND ioctls. Note that bind 
>>> and
>>>    unbind
>>>    >       will use
>>>    >       the same queue and hence are ordered.
>>>    >
>>>    >       >>
>>>    >       >> One thing I didn't realize is that because we only get one
>>>    >       "VM_BIND" engine,
>>>    >       >> there is a disconnect from the Vulkan specification.
>>>    >       >>
>>>    >       >> In Vulkan VM_BIND operations are serialized but per 
>>> engine.
>>>    >       >>
>>>    >       >> So you could have something like this :
>>>    >       >>
>>>    >       >> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
>>>    >       >>
>>>    >       >> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
>>>    >       >>
>>>    >       >>
>>>    >       >> fence1 is not signaled
>>>    >       >>
>>>    >       >> fence3 is signaled
>>>    >       >>
>>>    >       >> So the second VM_BIND will proceed before the first 
>>> VM_BIND.
>>>    >       >>
>>>    >       >>
>>>    >       >> I guess we can deal with that scenario in userspace by 
>>> doing
>>>    the
>>>    >       wait
>>>    >       >> ourselves in one thread per engines.
>>>    >       >>
>>>    >       >> But then it makes the VM_BIND input fences useless.
>>>    >       >>
>>>    >       >>
>>>    >       >> Daniel : what do you think? Should be rework this or just
>>>    deal with
>>>    >       wait
>>>    >       >> fences in userspace?
>>>    >       >>
>>>    >       >
>>>    >       >My opinion is rework this but make the ordering via an 
>>> engine
>>>    param
>>>    >       optional.
>>>    >       >
>>>    >       >e.g. A VM can be configured so all binds are ordered 
>>> within the
>>>    VM
>>>    >       >
>>>    >       >e.g. A VM can be configured so all binds accept an engine
>>>    argument
>>>    >       (in
>>>    >       >the case of the i915 likely this is a gem context handle) 
>>> and
>>>    binds
>>>    >       >ordered with respect to that engine.
>>>    >       >
>>>    >       >This gives UMDs options as the later likely consumes more 
>>> KMD
>>>    >       resources
>>>    >       >so if a different UMD can live with binds being ordered 
>>> within
>>>    the VM
>>>    >       >they can use a mode consuming less resources.
>>>    >       >
>>>    >
>>>    >       I think we need to be careful here if we are looking for some
>>>    out of
>>>    >       (submission) order completion of vm_bind/unbind.
>>>    >       In-order completion means, in a batch of binds and unbinds 
>>> to be
>>>    >       completed in-order, user only needs to specify in-fence 
>>> for the
>>>    >       first bind/unbind call and the our-fence for the last
>>>    bind/unbind
>>>    >       call. Also, the VA released by an unbind call can be 
>>> re-used by
>>>    >       any subsequent bind call in that in-order batch.
>>>    >
>>>    >       These things will break if binding/unbinding were to be 
>>> allowed
>>>    to
>>>    >       go out of order (of submission) and user need to be extra
>>>    careful
>>>    >       not to run into pre-mature triggereing of out-fence and bind
>>>    failing
>>>    >       as VA is still in use etc.
>>>    >
>>>    >       Also, VM_BIND binds the provided mapping on the specified
>>>    address
>>>    >       space
>>>    >       (VM). So, the uapi is not engine/context specific.
>>>    >
>>>    >       We can however add a 'queue' to the uapi which can be one 
>>> from
>>>    the
>>>    >       pre-defined queues,
>>>    >       I915_VM_BIND_QUEUE_0
>>>    >       I915_VM_BIND_QUEUE_1
>>>    >       ...
>>>    >       I915_VM_BIND_QUEUE_(N-1)
>>>    >
>>>    >       KMD will spawn an async work queue for each queue which will
>>>    only
>>>    >       bind the mappings on that queue in the order of submission.
>>>    >       User can assign the queue to per engine or anything like 
>>> that.
>>>    >
>>>    >       But again here, user need to be careful and not deadlock 
>>> these
>>>    >       queues with circular dependency of fences.
>>>    >
>>>    >       I prefer adding this later an as extension based on 
>>> whether it
>>>    >       is really helping with the implementation.
>>>    >
>>>    >     I can tell you right now that having everything on a single
>>>    in-order
>>>    >     queue will not get us the perf we want.  What vulkan really 
>>> wants
>>>    is one
>>>    >     of two things:
>>>    >      1. No implicit ordering of VM_BIND ops.  They just happen in
>>>    whatever
>>>    >     their dependencies are resolved and we ensure ordering 
>>> ourselves
>>>    by
>>>    >     having a syncobj in the VkQueue.
>>>    >      2. The ability to create multiple VM_BIND queues.  We need at
>>>    least 2
>>>    >     but I don't see why there needs to be a limit besides the 
>>> limits
>>>    the
>>>    >     i915 API already has on the number of engines.  Vulkan could
>>>    expose
>>>    >     multiple sparse binding queues to the client if it's not
>>>    arbitrarily
>>>    >     limited.
>>>
>>>    Thanks Jason, Lionel.
>>>
>>>    Jason, what are you referring to when you say "limits the i915 API
>>>    already
>>>    has on the number of engines"? I am not sure if there is such an uapi
>>>    today.
>>>
>>>  There's a limit of something like 64 total engines today based on the
>>>  number of bits we can cram into the exec flags in execbuffer2.  I think
>>>  someone had an extended version that allowed more but I ripped it out
>>>  because no one was using it.  Of course, execbuffer3 might not have 
>>> that
>>>  problem at all.
>>>
>>
>> Thanks Jason.
>> Ok, I am not sure which exec flag is that, but yah, execbuffer3 probably
>> will not have this limiation. So, we need to define a VM_BIND_MAX_QUEUE
>> and somehow export it to user (I am thinking of embedding it in
>> I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND, bits[1-3]->'n' meaning 2^n
>> queues.
> 
> Ah, I think you are waking about I915_EXEC_RING_MASK (0x3f) which execbuf3
> will also have. So, we can simply define in vm_bind/unbind structures,
> 
> #define I915_VM_BIND_MAX_QUEUE   64
>         __u32 queue;
> 
> I think that will keep things simple.

Hmmm? What does execbuf2 limit has to do with how many engines hardware 
can have? I suggest not to do that.

Change with added this:

	if (set.num_engines > I915_EXEC_RING_MASK + 1)
		return -EINVAL;

To context creation needs to be undone and so let users create engine 
maps with all hardware engines, and let execbuf3 access them all.

Regards,

Tvrtko

> 
> Niranjana
> 
>>
>>>    I am trying to see how many queues we need and don't want it to be
>>>    arbitrarily
>>>    large and unduely blow up memory usage and complexity in i915 driver.
>>>
>>>  I expect a Vulkan driver to use at most 2 in the vast majority of 
>>> cases. I
>>>  could imagine a client wanting to create more than 1 sparse queue in 
>>> which
>>>  case, it'll be N+1 but that's unlikely.  As far as complexity goes, 
>>> once
>>>  you allow two, I don't think the complexity is going up by allowing 
>>> N.  As
>>>  for memory usage, creating more queues means more memory.  That's a
>>>  trade-off that userspace can make.  Again, the expected number here 
>>> is 1
>>>  or 2 in the vast majority of cases so I don't think you need to worry.
>>
>> Ok, will start with n=3 meaning 8 queues.
>> That would require us create 8 workqueues.
>> We can change 'n' later if required.
>>
>> Niranjana
>>
>>>
>>>    >     Why?  Because Vulkan has two basic kind of bind operations 
>>> and we
>>>    don't
>>>    >     want any dependencies between them:
>>>    >      1. Immediate.  These happen right after BO creation or 
>>> maybe as
>>>    part of
>>>    >     vkBindImageMemory() or VkBindBufferMemory().  These don't 
>>> happen
>>>    on a
>>>    >     queue and we don't want them serialized with anything.  To
>>>    synchronize
>>>    >     with submit, we'll have a syncobj in the VkDevice which is
>>>    signaled by
>>>    >     all immediate bind operations and make submits wait on it.
>>>    >      2. Queued (sparse): These happen on a VkQueue which may be the
>>>    same as
>>>    >     a render/compute queue or may be its own queue.  It's up to us
>>>    what we
>>>    >     want to advertise.  From the Vulkan API PoV, this is like any
>>>    other
>>>    >     queue.  Operations on it wait on and signal semaphores.  If we
>>>    have a
>>>    >     VM_BIND engine, we'd provide syncobjs to wait and signal 
>>> just like
>>>    we do
>>>    >     in execbuf().
>>>    >     The important thing is that we don't want one type of 
>>> operation to
>>>    block
>>>    >     on the other.  If immediate binds are blocking on sparse binds,
>>>    it's
>>>    >     going to cause over-synchronization issues.
>>>    >     In terms of the internal implementation, I know that there's 
>>> going
>>>    to be
>>>    >     a lock on the VM and that we can't actually do these things in
>>>    >     parallel.  That's fine.  Once the dma_fences have signaled and
>>>    we're
>>>
>>>    Thats correct. It is like a single VM_BIND engine with multiple 
>>> queues
>>>    feeding to it.
>>>
>>>  Right.  As long as the queues themselves are independent and can 
>>> block on
>>>  dma_fences without holding up other queues, I think we're fine.
>>>
>>>    >     unblocked to do the bind operation, I don't care if there's 
>>> a bit
>>>    of
>>>    >     synchronization due to locking.  That's expected.  What we 
>>> can't
>>>    afford
>>>    >     to have is an immediate bind operation suddenly blocking on a
>>>    sparse
>>>    >     operation which is blocked on a compute job that's going to run
>>>    for
>>>    >     another 5ms.
>>>
>>>    As the VM_BIND queue is per VM, VM_BIND on one VM doesn't block the
>>>    VM_BIND
>>>    on other VMs. I am not sure about usecases here, but just wanted to
>>>    clarify.
>>>
>>>  Yes, that's what I would expect.
>>>  --Jason
>>>
>>>    Niranjana
>>>
>>>    >     For reference, Windows solves this by allowing arbitrarily many
>>>    paging
>>>    >     queues (what they call a VM_BIND engine/queue).  That design 
>>> works
>>>    >     pretty well and solves the problems in question.  Again, we 
>>> could
>>>    just
>>>    >     make everything out-of-order and require using syncobjs to 
>>> order
>>>    things
>>>    >     as userspace wants. That'd be fine too.
>>>    >     One more note while I'm here: danvet said something on IRC 
>>> about
>>>    VM_BIND
>>>    >     queues waiting for syncobjs to materialize.  We don't really
>>>    want/need
>>>    >     this.  We already have all the machinery in userspace to handle
>>>    >     wait-before-signal and waiting for syncobj fences to 
>>> materialize
>>>    and
>>>    >     that machinery is on by default.  It would actually take 
>>> MORE work
>>>    in
>>>    >     Mesa to turn it off and take advantage of the kernel being 
>>> able to
>>>    wait
>>>    >     for syncobjs to materialize.  Also, getting that right is
>>>    ridiculously
>>>    >     hard and I really don't want to get it wrong in kernel 
>>> space.     When we
>>>    >     do memory fences, wait-before-signal will be a thing.  We don't
>>>    need to
>>>    >     try and make it a thing for syncobj.
>>>    >     --Jason
>>>    >
>>>    >   Thanks Jason,
>>>    >
>>>    >   I missed the bit in the Vulkan spec that we're allowed to have a
>>>    sparse
>>>    >   queue that does not implement either graphics or compute 
>>> operations
>>>    :
>>>    >
>>>    >     "While some implementations may include
>>>    VK_QUEUE_SPARSE_BINDING_BIT
>>>    >     support in queue families that also include
>>>    >
>>>    >      graphics and compute support, other implementations may only
>>>    expose a
>>>    >     VK_QUEUE_SPARSE_BINDING_BIT-only queue
>>>    >
>>>    >      family."
>>>    >
>>>    >   So it can all be all a vm_bind engine that just does bind/unbind
>>>    >   operations.
>>>    >
>>>    >   But yes we need another engine for the immediate/non-sparse
>>>    operations.
>>>    >
>>>    >   -Lionel
>>>    >
>>>    >         >
>>>    >       Daniel, any thoughts?
>>>    >
>>>    >       Niranjana
>>>    >
>>>    >       >Matt
>>>    >       >
>>>    >       >>
>>>    >       >> Sorry I noticed this late.
>>>    >       >>
>>>    >       >>
>>>    >       >> -Lionel
>>>    >       >>
>>>    >       >>
Niranjana Vishwanathapura June 8, 2022, 9:44 p.m. UTC | #23
On Wed, Jun 08, 2022 at 08:33:25AM +0100, Tvrtko Ursulin wrote:
>
>
>On 07/06/2022 22:32, Niranjana Vishwanathapura wrote:
>>On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana Vishwanathapura wrote:
>>>On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason Ekstrand wrote:
>>>> On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura
>>>> <niranjana.vishwanathapura@intel.com> wrote:
>>>>
>>>>   On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel Landwerlin wrote:
>>>>   >   On 02/06/2022 23:35, Jason Ekstrand wrote:
>>>>   >
>>>>   >     On Thu, Jun 2, 2022 at 3:11 PM Niranjana Vishwanathapura
>>>>   >     <niranjana.vishwanathapura@intel.com> wrote:
>>>>   >
>>>>   >       On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew 
>>>>Brost wrote:
>>>>   >       >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin
>>>>   wrote:
>>>>   >       >> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
>>>>   >       >> > +VM_BIND/UNBIND ioctl will immediately start
>>>>   binding/unbinding
>>>>   >       the mapping in an
>>>>   >       >> > +async worker. The binding and unbinding will 
>>>>work like a
>>>>   special
>>>>   >       GPU engine.
>>>>   >       >> > +The binding and unbinding operations are serialized and
>>>>   will
>>>>   >       wait on specified
>>>>   >       >> > +input fences before the operation and will signal the
>>>>   output
>>>>   >       fences upon the
>>>>   >       >> > +completion of the operation. Due to serialization,
>>>>   completion of
>>>>   >       an operation
>>>>   >       >> > +will also indicate that all previous operations 
>>>>are also
>>>>   >       complete.
>>>>   >       >>
>>>>   >       >> I guess we should avoid saying "will immediately start
>>>>   >       binding/unbinding" if
>>>>   >       >> there are fences involved.
>>>>   >       >>
>>>>   >       >> And the fact that it's happening in an async 
>>>>worker seem to
>>>>   imply
>>>>   >       it's not
>>>>   >       >> immediate.
>>>>   >       >>
>>>>   >
>>>>   >       Ok, will fix.
>>>>   >       This was added because in earlier design binding was deferred
>>>>   until
>>>>   >       next execbuff.
>>>>   >       But now it is non-deferred (immediate in that sense). 
>>>>But yah,
>>>>   this is
>>>>   >       confusing
>>>>   >       and will fix it.
>>>>   >
>>>>   >       >>
>>>>   >       >> I have a question on the behavior of the bind 
>>>>operation when
>>>>   no
>>>>   >       input fence
>>>>   >       >> is provided. Let say I do :
>>>>   >       >>
>>>>   >       >> VM_BIND (out_fence=fence1)
>>>>   >       >>
>>>>   >       >> VM_BIND (out_fence=fence2)
>>>>   >       >>
>>>>   >       >> VM_BIND (out_fence=fence3)
>>>>   >       >>
>>>>   >       >>
>>>>   >       >> In what order are the fences going to be signaled?
>>>>   >       >>
>>>>   >       >> In the order of VM_BIND ioctls? Or out of order?
>>>>   >       >>
>>>>   >       >> Because you wrote "serialized I assume it's : in order
>>>>   >       >>
>>>>   >
>>>>   >       Yes, in the order of VM_BIND/UNBIND ioctls. Note that 
>>>>bind and
>>>>   unbind
>>>>   >       will use
>>>>   >       the same queue and hence are ordered.
>>>>   >
>>>>   >       >>
>>>>   >       >> One thing I didn't realize is that because we only get one
>>>>   >       "VM_BIND" engine,
>>>>   >       >> there is a disconnect from the Vulkan specification.
>>>>   >       >>
>>>>   >       >> In Vulkan VM_BIND operations are serialized but 
>>>>per engine.
>>>>   >       >>
>>>>   >       >> So you could have something like this :
>>>>   >       >>
>>>>   >       >> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
>>>>   >       >>
>>>>   >       >> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
>>>>   >       >>
>>>>   >       >>
>>>>   >       >> fence1 is not signaled
>>>>   >       >>
>>>>   >       >> fence3 is signaled
>>>>   >       >>
>>>>   >       >> So the second VM_BIND will proceed before the 
>>>>first VM_BIND.
>>>>   >       >>
>>>>   >       >>
>>>>   >       >> I guess we can deal with that scenario in 
>>>>userspace by doing
>>>>   the
>>>>   >       wait
>>>>   >       >> ourselves in one thread per engines.
>>>>   >       >>
>>>>   >       >> But then it makes the VM_BIND input fences useless.
>>>>   >       >>
>>>>   >       >>
>>>>   >       >> Daniel : what do you think? Should be rework this or just
>>>>   deal with
>>>>   >       wait
>>>>   >       >> fences in userspace?
>>>>   >       >>
>>>>   >       >
>>>>   >       >My opinion is rework this but make the ordering via 
>>>>an engine
>>>>   param
>>>>   >       optional.
>>>>   >       >
>>>>   >       >e.g. A VM can be configured so all binds are ordered 
>>>>within the
>>>>   VM
>>>>   >       >
>>>>   >       >e.g. A VM can be configured so all binds accept an engine
>>>>   argument
>>>>   >       (in
>>>>   >       >the case of the i915 likely this is a gem context 
>>>>handle) and
>>>>   binds
>>>>   >       >ordered with respect to that engine.
>>>>   >       >
>>>>   >       >This gives UMDs options as the later likely consumes 
>>>>more KMD
>>>>   >       resources
>>>>   >       >so if a different UMD can live with binds being 
>>>>ordered within
>>>>   the VM
>>>>   >       >they can use a mode consuming less resources.
>>>>   >       >
>>>>   >
>>>>   >       I think we need to be careful here if we are looking for some
>>>>   out of
>>>>   >       (submission) order completion of vm_bind/unbind.
>>>>   >       In-order completion means, in a batch of binds and 
>>>>unbinds to be
>>>>   >       completed in-order, user only needs to specify 
>>>>in-fence for the
>>>>   >       first bind/unbind call and the our-fence for the last
>>>>   bind/unbind
>>>>   >       call. Also, the VA released by an unbind call can be 
>>>>re-used by
>>>>   >       any subsequent bind call in that in-order batch.
>>>>   >
>>>>   >       These things will break if binding/unbinding were to 
>>>>be allowed
>>>>   to
>>>>   >       go out of order (of submission) and user need to be extra
>>>>   careful
>>>>   >       not to run into pre-mature triggereing of out-fence and bind
>>>>   failing
>>>>   >       as VA is still in use etc.
>>>>   >
>>>>   >       Also, VM_BIND binds the provided mapping on the specified
>>>>   address
>>>>   >       space
>>>>   >       (VM). So, the uapi is not engine/context specific.
>>>>   >
>>>>   >       We can however add a 'queue' to the uapi which can be 
>>>>one from
>>>>   the
>>>>   >       pre-defined queues,
>>>>   >       I915_VM_BIND_QUEUE_0
>>>>   >       I915_VM_BIND_QUEUE_1
>>>>   >       ...
>>>>   >       I915_VM_BIND_QUEUE_(N-1)
>>>>   >
>>>>   >       KMD will spawn an async work queue for each queue which will
>>>>   only
>>>>   >       bind the mappings on that queue in the order of submission.
>>>>   >       User can assign the queue to per engine or anything 
>>>>like that.
>>>>   >
>>>>   >       But again here, user need to be careful and not 
>>>>deadlock these
>>>>   >       queues with circular dependency of fences.
>>>>   >
>>>>   >       I prefer adding this later an as extension based on 
>>>>whether it
>>>>   >       is really helping with the implementation.
>>>>   >
>>>>   >     I can tell you right now that having everything on a single
>>>>   in-order
>>>>   >     queue will not get us the perf we want.  What vulkan 
>>>>really wants
>>>>   is one
>>>>   >     of two things:
>>>>   >      1. No implicit ordering of VM_BIND ops.  They just happen in
>>>>   whatever
>>>>   >     their dependencies are resolved and we ensure ordering 
>>>>ourselves
>>>>   by
>>>>   >     having a syncobj in the VkQueue.
>>>>   >      2. The ability to create multiple VM_BIND queues.  We need at
>>>>   least 2
>>>>   >     but I don't see why there needs to be a limit besides 
>>>>the limits
>>>>   the
>>>>   >     i915 API already has on the number of engines.  Vulkan could
>>>>   expose
>>>>   >     multiple sparse binding queues to the client if it's not
>>>>   arbitrarily
>>>>   >     limited.
>>>>
>>>>   Thanks Jason, Lionel.
>>>>
>>>>   Jason, what are you referring to when you say "limits the i915 API
>>>>   already
>>>>   has on the number of engines"? I am not sure if there is such an uapi
>>>>   today.
>>>>
>>>> There's a limit of something like 64 total engines today based on the
>>>> number of bits we can cram into the exec flags in execbuffer2.  I think
>>>> someone had an extended version that allowed more but I ripped it out
>>>> because no one was using it.  Of course, execbuffer3 might not 
>>>>have that
>>>> problem at all.
>>>>
>>>
>>>Thanks Jason.
>>>Ok, I am not sure which exec flag is that, but yah, execbuffer3 probably
>>>will not have this limiation. So, we need to define a VM_BIND_MAX_QUEUE
>>>and somehow export it to user (I am thinking of embedding it in
>>>I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND, bits[1-3]->'n' meaning 2^n
>>>queues.
>>
>>Ah, I think you are waking about I915_EXEC_RING_MASK (0x3f) which execbuf3
>>will also have. So, we can simply define in vm_bind/unbind structures,
>>
>>#define I915_VM_BIND_MAX_QUEUE   64
>>        __u32 queue;
>>
>>I think that will keep things simple.
>
>Hmmm? What does execbuf2 limit has to do with how many engines 
>hardware can have? I suggest not to do that.
>
>Change with added this:
>
>	if (set.num_engines > I915_EXEC_RING_MASK + 1)
>		return -EINVAL;
>
>To context creation needs to be undone and so let users create engine 
>maps with all hardware engines, and let execbuf3 access them all.
>

Earlier plan was to carry I915_EXEC_RING_MAP (0x3f) to execbuff3 also.
Hence, I was using the same limit for VM_BIND queues (64, or 65 if we
make it N+1).
But, as discussed in other thread of this RFC series, we are planning
to drop this I915_EXEC_RING_MAP in execbuff3. So, there won't be
any uapi that limits the number of engines (and hence the vm_bind queues
need to be supported).

If we leave the number of vm_bind queues to be arbitrarily large
(__u32 queue_idx) then, we need to have a hashmap for queue (a wq,
work_item and a linked list) lookup from the user specified queue index.
Other option is to just put some hard limit (say 64 or 65) and use
an array of queues in VM (each created upon first use). I prefer this.

Niranjana

>Regards,
>
>Tvrtko
>
>>
>>Niranjana
>>
>>>
>>>>   I am trying to see how many queues we need and don't want it to be
>>>>   arbitrarily
>>>>   large and unduely blow up memory usage and complexity in i915 driver.
>>>>
>>>> I expect a Vulkan driver to use at most 2 in the vast majority 
>>>>of cases. I
>>>> could imagine a client wanting to create more than 1 sparse 
>>>>queue in which
>>>> case, it'll be N+1 but that's unlikely.  As far as complexity 
>>>>goes, once
>>>> you allow two, I don't think the complexity is going up by 
>>>>allowing N.  As
>>>> for memory usage, creating more queues means more memory.  That's a
>>>> trade-off that userspace can make.  Again, the expected number 
>>>>here is 1
>>>> or 2 in the vast majority of cases so I don't think you need to worry.
>>>
>>>Ok, will start with n=3 meaning 8 queues.
>>>That would require us create 8 workqueues.
>>>We can change 'n' later if required.
>>>
>>>Niranjana
>>>
>>>>
>>>>   >     Why?  Because Vulkan has two basic kind of bind 
>>>>operations and we
>>>>   don't
>>>>   >     want any dependencies between them:
>>>>   >      1. Immediate.  These happen right after BO creation or 
>>>>maybe as
>>>>   part of
>>>>   >     vkBindImageMemory() or VkBindBufferMemory().  These 
>>>>don't happen
>>>>   on a
>>>>   >     queue and we don't want them serialized with anything.  To
>>>>   synchronize
>>>>   >     with submit, we'll have a syncobj in the VkDevice which is
>>>>   signaled by
>>>>   >     all immediate bind operations and make submits wait on it.
>>>>   >      2. Queued (sparse): These happen on a VkQueue which may be the
>>>>   same as
>>>>   >     a render/compute queue or may be its own queue.  It's up to us
>>>>   what we
>>>>   >     want to advertise.  From the Vulkan API PoV, this is like any
>>>>   other
>>>>   >     queue.  Operations on it wait on and signal semaphores.  If we
>>>>   have a
>>>>   >     VM_BIND engine, we'd provide syncobjs to wait and 
>>>>signal just like
>>>>   we do
>>>>   >     in execbuf().
>>>>   >     The important thing is that we don't want one type of 
>>>>operation to
>>>>   block
>>>>   >     on the other.  If immediate binds are blocking on sparse binds,
>>>>   it's
>>>>   >     going to cause over-synchronization issues.
>>>>   >     In terms of the internal implementation, I know that 
>>>>there's going
>>>>   to be
>>>>   >     a lock on the VM and that we can't actually do these things in
>>>>   >     parallel.  That's fine.  Once the dma_fences have signaled and
>>>>   we're
>>>>
>>>>   Thats correct. It is like a single VM_BIND engine with 
>>>>multiple queues
>>>>   feeding to it.
>>>>
>>>> Right.  As long as the queues themselves are independent and 
>>>>can block on
>>>> dma_fences without holding up other queues, I think we're fine.
>>>>
>>>>   >     unblocked to do the bind operation, I don't care if 
>>>>there's a bit
>>>>   of
>>>>   >     synchronization due to locking.  That's expected.  What 
>>>>we can't
>>>>   afford
>>>>   >     to have is an immediate bind operation suddenly blocking on a
>>>>   sparse
>>>>   >     operation which is blocked on a compute job that's going to run
>>>>   for
>>>>   >     another 5ms.
>>>>
>>>>   As the VM_BIND queue is per VM, VM_BIND on one VM doesn't block the
>>>>   VM_BIND
>>>>   on other VMs. I am not sure about usecases here, but just wanted to
>>>>   clarify.
>>>>
>>>> Yes, that's what I would expect.
>>>> --Jason
>>>>
>>>>   Niranjana
>>>>
>>>>   >     For reference, Windows solves this by allowing arbitrarily many
>>>>   paging
>>>>   >     queues (what they call a VM_BIND engine/queue).  That 
>>>>design works
>>>>   >     pretty well and solves the problems in question.  
>>>>Again, we could
>>>>   just
>>>>   >     make everything out-of-order and require using syncobjs 
>>>>to order
>>>>   things
>>>>   >     as userspace wants. That'd be fine too.
>>>>   >     One more note while I'm here: danvet said something on 
>>>>IRC about
>>>>   VM_BIND
>>>>   >     queues waiting for syncobjs to materialize.  We don't really
>>>>   want/need
>>>>   >     this.  We already have all the machinery in userspace to handle
>>>>   >     wait-before-signal and waiting for syncobj fences to 
>>>>materialize
>>>>   and
>>>>   >     that machinery is on by default.  It would actually 
>>>>take MORE work
>>>>   in
>>>>   >     Mesa to turn it off and take advantage of the kernel 
>>>>being able to
>>>>   wait
>>>>   >     for syncobjs to materialize.  Also, getting that right is
>>>>   ridiculously
>>>>   >     hard and I really don't want to get it wrong in kernel 
>>>>space.     When we
>>>>   >     do memory fences, wait-before-signal will be a thing.  We don't
>>>>   need to
>>>>   >     try and make it a thing for syncobj.
>>>>   >     --Jason
>>>>   >
>>>>   >   Thanks Jason,
>>>>   >
>>>>   >   I missed the bit in the Vulkan spec that we're allowed to have a
>>>>   sparse
>>>>   >   queue that does not implement either graphics or compute 
>>>>operations
>>>>   :
>>>>   >
>>>>   >     "While some implementations may include
>>>>   VK_QUEUE_SPARSE_BINDING_BIT
>>>>   >     support in queue families that also include
>>>>   >
>>>>   >      graphics and compute support, other implementations may only
>>>>   expose a
>>>>   >     VK_QUEUE_SPARSE_BINDING_BIT-only queue
>>>>   >
>>>>   >      family."
>>>>   >
>>>>   >   So it can all be all a vm_bind engine that just does bind/unbind
>>>>   >   operations.
>>>>   >
>>>>   >   But yes we need another engine for the immediate/non-sparse
>>>>   operations.
>>>>   >
>>>>   >   -Lionel
>>>>   >
>>>>   >         >
>>>>   >       Daniel, any thoughts?
>>>>   >
>>>>   >       Niranjana
>>>>   >
>>>>   >       >Matt
>>>>   >       >
>>>>   >       >>
>>>>   >       >> Sorry I noticed this late.
>>>>   >       >>
>>>>   >       >>
>>>>   >       >> -Lionel
>>>>   >       >>
>>>>   >       >>
Jason Ekstrand June 8, 2022, 9:55 p.m. UTC | #24
On Wed, Jun 8, 2022 at 4:44 PM Niranjana Vishwanathapura <
niranjana.vishwanathapura@intel.com> wrote:

> On Wed, Jun 08, 2022 at 08:33:25AM +0100, Tvrtko Ursulin wrote:
> >
> >
> >On 07/06/2022 22:32, Niranjana Vishwanathapura wrote:
> >>On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana Vishwanathapura
> wrote:
> >>>On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason Ekstrand wrote:
> >>>> On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura
> >>>> <niranjana.vishwanathapura@intel.com> wrote:
> >>>>
> >>>>   On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel Landwerlin wrote:
> >>>>   >   On 02/06/2022 23:35, Jason Ekstrand wrote:
> >>>>   >
> >>>>   >     On Thu, Jun 2, 2022 at 3:11 PM Niranjana Vishwanathapura
> >>>>   >     <niranjana.vishwanathapura@intel.com> wrote:
> >>>>   >
> >>>>   >       On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew
> >>>>Brost wrote:
> >>>>   >       >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin
> >>>>   wrote:
> >>>>   >       >> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
> >>>>   >       >> > +VM_BIND/UNBIND ioctl will immediately start
> >>>>   binding/unbinding
> >>>>   >       the mapping in an
> >>>>   >       >> > +async worker. The binding and unbinding will
> >>>>work like a
> >>>>   special
> >>>>   >       GPU engine.
> >>>>   >       >> > +The binding and unbinding operations are serialized
> and
> >>>>   will
> >>>>   >       wait on specified
> >>>>   >       >> > +input fences before the operation and will signal the
> >>>>   output
> >>>>   >       fences upon the
> >>>>   >       >> > +completion of the operation. Due to serialization,
> >>>>   completion of
> >>>>   >       an operation
> >>>>   >       >> > +will also indicate that all previous operations
> >>>>are also
> >>>>   >       complete.
> >>>>   >       >>
> >>>>   >       >> I guess we should avoid saying "will immediately start
> >>>>   >       binding/unbinding" if
> >>>>   >       >> there are fences involved.
> >>>>   >       >>
> >>>>   >       >> And the fact that it's happening in an async
> >>>>worker seem to
> >>>>   imply
> >>>>   >       it's not
> >>>>   >       >> immediate.
> >>>>   >       >>
> >>>>   >
> >>>>   >       Ok, will fix.
> >>>>   >       This was added because in earlier design binding was
> deferred
> >>>>   until
> >>>>   >       next execbuff.
> >>>>   >       But now it is non-deferred (immediate in that sense).
> >>>>But yah,
> >>>>   this is
> >>>>   >       confusing
> >>>>   >       and will fix it.
> >>>>   >
> >>>>   >       >>
> >>>>   >       >> I have a question on the behavior of the bind
> >>>>operation when
> >>>>   no
> >>>>   >       input fence
> >>>>   >       >> is provided. Let say I do :
> >>>>   >       >>
> >>>>   >       >> VM_BIND (out_fence=fence1)
> >>>>   >       >>
> >>>>   >       >> VM_BIND (out_fence=fence2)
> >>>>   >       >>
> >>>>   >       >> VM_BIND (out_fence=fence3)
> >>>>   >       >>
> >>>>   >       >>
> >>>>   >       >> In what order are the fences going to be signaled?
> >>>>   >       >>
> >>>>   >       >> In the order of VM_BIND ioctls? Or out of order?
> >>>>   >       >>
> >>>>   >       >> Because you wrote "serialized I assume it's : in order
> >>>>   >       >>
> >>>>   >
> >>>>   >       Yes, in the order of VM_BIND/UNBIND ioctls. Note that
> >>>>bind and
> >>>>   unbind
> >>>>   >       will use
> >>>>   >       the same queue and hence are ordered.
> >>>>   >
> >>>>   >       >>
> >>>>   >       >> One thing I didn't realize is that because we only get
> one
> >>>>   >       "VM_BIND" engine,
> >>>>   >       >> there is a disconnect from the Vulkan specification.
> >>>>   >       >>
> >>>>   >       >> In Vulkan VM_BIND operations are serialized but
> >>>>per engine.
> >>>>   >       >>
> >>>>   >       >> So you could have something like this :
> >>>>   >       >>
> >>>>   >       >> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
> >>>>   >       >>
> >>>>   >       >> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
> >>>>   >       >>
> >>>>   >       >>
> >>>>   >       >> fence1 is not signaled
> >>>>   >       >>
> >>>>   >       >> fence3 is signaled
> >>>>   >       >>
> >>>>   >       >> So the second VM_BIND will proceed before the
> >>>>first VM_BIND.
> >>>>   >       >>
> >>>>   >       >>
> >>>>   >       >> I guess we can deal with that scenario in
> >>>>userspace by doing
> >>>>   the
> >>>>   >       wait
> >>>>   >       >> ourselves in one thread per engines.
> >>>>   >       >>
> >>>>   >       >> But then it makes the VM_BIND input fences useless.
> >>>>   >       >>
> >>>>   >       >>
> >>>>   >       >> Daniel : what do you think? Should be rework this or just
> >>>>   deal with
> >>>>   >       wait
> >>>>   >       >> fences in userspace?
> >>>>   >       >>
> >>>>   >       >
> >>>>   >       >My opinion is rework this but make the ordering via
> >>>>an engine
> >>>>   param
> >>>>   >       optional.
> >>>>   >       >
> >>>>   >       >e.g. A VM can be configured so all binds are ordered
> >>>>within the
> >>>>   VM
> >>>>   >       >
> >>>>   >       >e.g. A VM can be configured so all binds accept an engine
> >>>>   argument
> >>>>   >       (in
> >>>>   >       >the case of the i915 likely this is a gem context
> >>>>handle) and
> >>>>   binds
> >>>>   >       >ordered with respect to that engine.
> >>>>   >       >
> >>>>   >       >This gives UMDs options as the later likely consumes
> >>>>more KMD
> >>>>   >       resources
> >>>>   >       >so if a different UMD can live with binds being
> >>>>ordered within
> >>>>   the VM
> >>>>   >       >they can use a mode consuming less resources.
> >>>>   >       >
> >>>>   >
> >>>>   >       I think we need to be careful here if we are looking for
> some
> >>>>   out of
> >>>>   >       (submission) order completion of vm_bind/unbind.
> >>>>   >       In-order completion means, in a batch of binds and
> >>>>unbinds to be
> >>>>   >       completed in-order, user only needs to specify
> >>>>in-fence for the
> >>>>   >       first bind/unbind call and the our-fence for the last
> >>>>   bind/unbind
> >>>>   >       call. Also, the VA released by an unbind call can be
> >>>>re-used by
> >>>>   >       any subsequent bind call in that in-order batch.
> >>>>   >
> >>>>   >       These things will break if binding/unbinding were to
> >>>>be allowed
> >>>>   to
> >>>>   >       go out of order (of submission) and user need to be extra
> >>>>   careful
> >>>>   >       not to run into pre-mature triggereing of out-fence and bind
> >>>>   failing
> >>>>   >       as VA is still in use etc.
> >>>>   >
> >>>>   >       Also, VM_BIND binds the provided mapping on the specified
> >>>>   address
> >>>>   >       space
> >>>>   >       (VM). So, the uapi is not engine/context specific.
> >>>>   >
> >>>>   >       We can however add a 'queue' to the uapi which can be
> >>>>one from
> >>>>   the
> >>>>   >       pre-defined queues,
> >>>>   >       I915_VM_BIND_QUEUE_0
> >>>>   >       I915_VM_BIND_QUEUE_1
> >>>>   >       ...
> >>>>   >       I915_VM_BIND_QUEUE_(N-1)
> >>>>   >
> >>>>   >       KMD will spawn an async work queue for each queue which will
> >>>>   only
> >>>>   >       bind the mappings on that queue in the order of submission.
> >>>>   >       User can assign the queue to per engine or anything
> >>>>like that.
> >>>>   >
> >>>>   >       But again here, user need to be careful and not
> >>>>deadlock these
> >>>>   >       queues with circular dependency of fences.
> >>>>   >
> >>>>   >       I prefer adding this later an as extension based on
> >>>>whether it
> >>>>   >       is really helping with the implementation.
> >>>>   >
> >>>>   >     I can tell you right now that having everything on a single
> >>>>   in-order
> >>>>   >     queue will not get us the perf we want.  What vulkan
> >>>>really wants
> >>>>   is one
> >>>>   >     of two things:
> >>>>   >      1. No implicit ordering of VM_BIND ops.  They just happen in
> >>>>   whatever
> >>>>   >     their dependencies are resolved and we ensure ordering
> >>>>ourselves
> >>>>   by
> >>>>   >     having a syncobj in the VkQueue.
> >>>>   >      2. The ability to create multiple VM_BIND queues.  We need at
> >>>>   least 2
> >>>>   >     but I don't see why there needs to be a limit besides
> >>>>the limits
> >>>>   the
> >>>>   >     i915 API already has on the number of engines.  Vulkan could
> >>>>   expose
> >>>>   >     multiple sparse binding queues to the client if it's not
> >>>>   arbitrarily
> >>>>   >     limited.
> >>>>
> >>>>   Thanks Jason, Lionel.
> >>>>
> >>>>   Jason, what are you referring to when you say "limits the i915 API
> >>>>   already
> >>>>   has on the number of engines"? I am not sure if there is such an
> uapi
> >>>>   today.
> >>>>
> >>>> There's a limit of something like 64 total engines today based on the
> >>>> number of bits we can cram into the exec flags in execbuffer2.  I
> think
> >>>> someone had an extended version that allowed more but I ripped it out
> >>>> because no one was using it.  Of course, execbuffer3 might not
> >>>>have that
> >>>> problem at all.
> >>>>
> >>>
> >>>Thanks Jason.
> >>>Ok, I am not sure which exec flag is that, but yah, execbuffer3 probably
> >>>will not have this limiation. So, we need to define a VM_BIND_MAX_QUEUE
> >>>and somehow export it to user (I am thinking of embedding it in
> >>>I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND, bits[1-3]->'n' meaning 2^n
> >>>queues.
> >>
> >>Ah, I think you are waking about I915_EXEC_RING_MASK (0x3f) which
> execbuf3
>

Yup!  That's exactly the limit I was talking about.


> >>will also have. So, we can simply define in vm_bind/unbind structures,
> >>
> >>#define I915_VM_BIND_MAX_QUEUE   64
> >>        __u32 queue;
> >>
> >>I think that will keep things simple.
> >
> >Hmmm? What does execbuf2 limit has to do with how many engines
> >hardware can have? I suggest not to do that.
> >
> >Change with added this:
> >
> >       if (set.num_engines > I915_EXEC_RING_MASK + 1)
> >               return -EINVAL;
> >
> >To context creation needs to be undone and so let users create engine
> >maps with all hardware engines, and let execbuf3 access them all.
> >
>
> Earlier plan was to carry I915_EXEC_RING_MAP (0x3f) to execbuff3 also.
> Hence, I was using the same limit for VM_BIND queues (64, or 65 if we
> make it N+1).
> But, as discussed in other thread of this RFC series, we are planning
> to drop this I915_EXEC_RING_MAP in execbuff3. So, there won't be
> any uapi that limits the number of engines (and hence the vm_bind queues
> need to be supported).
>
> If we leave the number of vm_bind queues to be arbitrarily large
> (__u32 queue_idx) then, we need to have a hashmap for queue (a wq,
> work_item and a linked list) lookup from the user specified queue index.
> Other option is to just put some hard limit (say 64 or 65) and use
> an array of queues in VM (each created upon first use). I prefer this.
>

I don't get why a VM_BIND queue is any different from any other queue or
userspace-visible kernel object.  But I'll leave those details up to danvet
or whoever else might be reviewing the implementation.

--Jason



>
> Niranjana
>
> >Regards,
> >
> >Tvrtko
> >
> >>
> >>Niranjana
> >>
> >>>
> >>>>   I am trying to see how many queues we need and don't want it to be
> >>>>   arbitrarily
> >>>>   large and unduely blow up memory usage and complexity in i915
> driver.
> >>>>
> >>>> I expect a Vulkan driver to use at most 2 in the vast majority
> >>>>of cases. I
> >>>> could imagine a client wanting to create more than 1 sparse
> >>>>queue in which
> >>>> case, it'll be N+1 but that's unlikely.  As far as complexity
> >>>>goes, once
> >>>> you allow two, I don't think the complexity is going up by
> >>>>allowing N.  As
> >>>> for memory usage, creating more queues means more memory.  That's a
> >>>> trade-off that userspace can make.  Again, the expected number
> >>>>here is 1
> >>>> or 2 in the vast majority of cases so I don't think you need to worry.
> >>>
> >>>Ok, will start with n=3 meaning 8 queues.
> >>>That would require us create 8 workqueues.
> >>>We can change 'n' later if required.
> >>>
> >>>Niranjana
> >>>
> >>>>
> >>>>   >     Why?  Because Vulkan has two basic kind of bind
> >>>>operations and we
> >>>>   don't
> >>>>   >     want any dependencies between them:
> >>>>   >      1. Immediate.  These happen right after BO creation or
> >>>>maybe as
> >>>>   part of
> >>>>   >     vkBindImageMemory() or VkBindBufferMemory().  These
> >>>>don't happen
> >>>>   on a
> >>>>   >     queue and we don't want them serialized with anything.  To
> >>>>   synchronize
> >>>>   >     with submit, we'll have a syncobj in the VkDevice which is
> >>>>   signaled by
> >>>>   >     all immediate bind operations and make submits wait on it.
> >>>>   >      2. Queued (sparse): These happen on a VkQueue which may be
> the
> >>>>   same as
> >>>>   >     a render/compute queue or may be its own queue.  It's up to us
> >>>>   what we
> >>>>   >     want to advertise.  From the Vulkan API PoV, this is like any
> >>>>   other
> >>>>   >     queue.  Operations on it wait on and signal semaphores.  If we
> >>>>   have a
> >>>>   >     VM_BIND engine, we'd provide syncobjs to wait and
> >>>>signal just like
> >>>>   we do
> >>>>   >     in execbuf().
> >>>>   >     The important thing is that we don't want one type of
> >>>>operation to
> >>>>   block
> >>>>   >     on the other.  If immediate binds are blocking on sparse
> binds,
> >>>>   it's
> >>>>   >     going to cause over-synchronization issues.
> >>>>   >     In terms of the internal implementation, I know that
> >>>>there's going
> >>>>   to be
> >>>>   >     a lock on the VM and that we can't actually do these things in
> >>>>   >     parallel.  That's fine.  Once the dma_fences have signaled and
> >>>>   we're
> >>>>
> >>>>   Thats correct. It is like a single VM_BIND engine with
> >>>>multiple queues
> >>>>   feeding to it.
> >>>>
> >>>> Right.  As long as the queues themselves are independent and
> >>>>can block on
> >>>> dma_fences without holding up other queues, I think we're fine.
> >>>>
> >>>>   >     unblocked to do the bind operation, I don't care if
> >>>>there's a bit
> >>>>   of
> >>>>   >     synchronization due to locking.  That's expected.  What
> >>>>we can't
> >>>>   afford
> >>>>   >     to have is an immediate bind operation suddenly blocking on a
> >>>>   sparse
> >>>>   >     operation which is blocked on a compute job that's going to
> run
> >>>>   for
> >>>>   >     another 5ms.
> >>>>
> >>>>   As the VM_BIND queue is per VM, VM_BIND on one VM doesn't block the
> >>>>   VM_BIND
> >>>>   on other VMs. I am not sure about usecases here, but just wanted to
> >>>>   clarify.
> >>>>
> >>>> Yes, that's what I would expect.
> >>>> --Jason
> >>>>
> >>>>   Niranjana
> >>>>
> >>>>   >     For reference, Windows solves this by allowing arbitrarily
> many
> >>>>   paging
> >>>>   >     queues (what they call a VM_BIND engine/queue).  That
> >>>>design works
> >>>>   >     pretty well and solves the problems in question.
> >>>>Again, we could
> >>>>   just
> >>>>   >     make everything out-of-order and require using syncobjs
> >>>>to order
> >>>>   things
> >>>>   >     as userspace wants. That'd be fine too.
> >>>>   >     One more note while I'm here: danvet said something on
> >>>>IRC about
> >>>>   VM_BIND
> >>>>   >     queues waiting for syncobjs to materialize.  We don't really
> >>>>   want/need
> >>>>   >     this.  We already have all the machinery in userspace to
> handle
> >>>>   >     wait-before-signal and waiting for syncobj fences to
> >>>>materialize
> >>>>   and
> >>>>   >     that machinery is on by default.  It would actually
> >>>>take MORE work
> >>>>   in
> >>>>   >     Mesa to turn it off and take advantage of the kernel
> >>>>being able to
> >>>>   wait
> >>>>   >     for syncobjs to materialize.  Also, getting that right is
> >>>>   ridiculously
> >>>>   >     hard and I really don't want to get it wrong in kernel
> >>>>space.     When we
> >>>>   >     do memory fences, wait-before-signal will be a thing.  We
> don't
> >>>>   need to
> >>>>   >     try and make it a thing for syncobj.
> >>>>   >     --Jason
> >>>>   >
> >>>>   >   Thanks Jason,
> >>>>   >
> >>>>   >   I missed the bit in the Vulkan spec that we're allowed to have a
> >>>>   sparse
> >>>>   >   queue that does not implement either graphics or compute
> >>>>operations
> >>>>   :
> >>>>   >
> >>>>   >     "While some implementations may include
> >>>>   VK_QUEUE_SPARSE_BINDING_BIT
> >>>>   >     support in queue families that also include
> >>>>   >
> >>>>   >      graphics and compute support, other implementations may only
> >>>>   expose a
> >>>>   >     VK_QUEUE_SPARSE_BINDING_BIT-only queue
> >>>>   >
> >>>>   >      family."
> >>>>   >
> >>>>   >   So it can all be all a vm_bind engine that just does bind/unbind
> >>>>   >   operations.
> >>>>   >
> >>>>   >   But yes we need another engine for the immediate/non-sparse
> >>>>   operations.
> >>>>   >
> >>>>   >   -Lionel
> >>>>   >
> >>>>   >         >
> >>>>   >       Daniel, any thoughts?
> >>>>   >
> >>>>   >       Niranjana
> >>>>   >
> >>>>   >       >Matt
> >>>>   >       >
> >>>>   >       >>
> >>>>   >       >> Sorry I noticed this late.
> >>>>   >       >>
> >>>>   >       >>
> >>>>   >       >> -Lionel
> >>>>   >       >>
> >>>>   >       >>
>
Niranjana Vishwanathapura June 8, 2022, 10:48 p.m. UTC | #25
On Wed, Jun 08, 2022 at 04:55:38PM -0500, Jason Ekstrand wrote:
>   On Wed, Jun 8, 2022 at 4:44 PM Niranjana Vishwanathapura
>   <niranjana.vishwanathapura@intel.com> wrote:
>
>     On Wed, Jun 08, 2022 at 08:33:25AM +0100, Tvrtko Ursulin wrote:
>     >
>     >
>     >On 07/06/2022 22:32, Niranjana Vishwanathapura wrote:
>     >>On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana Vishwanathapura
>     wrote:
>     >>>On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason Ekstrand wrote:
>     >>>> On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura
>     >>>> <niranjana.vishwanathapura@intel.com> wrote:
>     >>>>
>     >>>>   On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel Landwerlin
>     wrote:
>     >>>>   >   On 02/06/2022 23:35, Jason Ekstrand wrote:
>     >>>>   >
>     >>>>   >     On Thu, Jun 2, 2022 at 3:11 PM Niranjana Vishwanathapura
>     >>>>   >     <niranjana.vishwanathapura@intel.com> wrote:
>     >>>>   >
>     >>>>   >       On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew
>     >>>>Brost wrote:
>     >>>>   >       >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel
>     Landwerlin
>     >>>>   wrote:
>     >>>>   >       >> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
>     >>>>   >       >> > +VM_BIND/UNBIND ioctl will immediately start
>     >>>>   binding/unbinding
>     >>>>   >       the mapping in an
>     >>>>   >       >> > +async worker. The binding and unbinding will
>     >>>>work like a
>     >>>>   special
>     >>>>   >       GPU engine.
>     >>>>   >       >> > +The binding and unbinding operations are serialized
>     and
>     >>>>   will
>     >>>>   >       wait on specified
>     >>>>   >       >> > +input fences before the operation and will signal
>     the
>     >>>>   output
>     >>>>   >       fences upon the
>     >>>>   >       >> > +completion of the operation. Due to serialization,
>     >>>>   completion of
>     >>>>   >       an operation
>     >>>>   >       >> > +will also indicate that all previous operations
>     >>>>are also
>     >>>>   >       complete.
>     >>>>   >       >>
>     >>>>   >       >> I guess we should avoid saying "will immediately start
>     >>>>   >       binding/unbinding" if
>     >>>>   >       >> there are fences involved.
>     >>>>   >       >>
>     >>>>   >       >> And the fact that it's happening in an async
>     >>>>worker seem to
>     >>>>   imply
>     >>>>   >       it's not
>     >>>>   >       >> immediate.
>     >>>>   >       >>
>     >>>>   >
>     >>>>   >       Ok, will fix.
>     >>>>   >       This was added because in earlier design binding was
>     deferred
>     >>>>   until
>     >>>>   >       next execbuff.
>     >>>>   >       But now it is non-deferred (immediate in that sense).
>     >>>>But yah,
>     >>>>   this is
>     >>>>   >       confusing
>     >>>>   >       and will fix it.
>     >>>>   >
>     >>>>   >       >>
>     >>>>   >       >> I have a question on the behavior of the bind
>     >>>>operation when
>     >>>>   no
>     >>>>   >       input fence
>     >>>>   >       >> is provided. Let say I do :
>     >>>>   >       >>
>     >>>>   >       >> VM_BIND (out_fence=fence1)
>     >>>>   >       >>
>     >>>>   >       >> VM_BIND (out_fence=fence2)
>     >>>>   >       >>
>     >>>>   >       >> VM_BIND (out_fence=fence3)
>     >>>>   >       >>
>     >>>>   >       >>
>     >>>>   >       >> In what order are the fences going to be signaled?
>     >>>>   >       >>
>     >>>>   >       >> In the order of VM_BIND ioctls? Or out of order?
>     >>>>   >       >>
>     >>>>   >       >> Because you wrote "serialized I assume it's : in order
>     >>>>   >       >>
>     >>>>   >
>     >>>>   >       Yes, in the order of VM_BIND/UNBIND ioctls. Note that
>     >>>>bind and
>     >>>>   unbind
>     >>>>   >       will use
>     >>>>   >       the same queue and hence are ordered.
>     >>>>   >
>     >>>>   >       >>
>     >>>>   >       >> One thing I didn't realize is that because we only get
>     one
>     >>>>   >       "VM_BIND" engine,
>     >>>>   >       >> there is a disconnect from the Vulkan specification.
>     >>>>   >       >>
>     >>>>   >       >> In Vulkan VM_BIND operations are serialized but
>     >>>>per engine.
>     >>>>   >       >>
>     >>>>   >       >> So you could have something like this :
>     >>>>   >       >>
>     >>>>   >       >> VM_BIND (engine=rcs0, in_fence=fence1,
>     out_fence=fence2)
>     >>>>   >       >>
>     >>>>   >       >> VM_BIND (engine=ccs0, in_fence=fence3,
>     out_fence=fence4)
>     >>>>   >       >>
>     >>>>   >       >>
>     >>>>   >       >> fence1 is not signaled
>     >>>>   >       >>
>     >>>>   >       >> fence3 is signaled
>     >>>>   >       >>
>     >>>>   >       >> So the second VM_BIND will proceed before the
>     >>>>first VM_BIND.
>     >>>>   >       >>
>     >>>>   >       >>
>     >>>>   >       >> I guess we can deal with that scenario in
>     >>>>userspace by doing
>     >>>>   the
>     >>>>   >       wait
>     >>>>   >       >> ourselves in one thread per engines.
>     >>>>   >       >>
>     >>>>   >       >> But then it makes the VM_BIND input fences useless.
>     >>>>   >       >>
>     >>>>   >       >>
>     >>>>   >       >> Daniel : what do you think? Should be rework this or
>     just
>     >>>>   deal with
>     >>>>   >       wait
>     >>>>   >       >> fences in userspace?
>     >>>>   >       >>
>     >>>>   >       >
>     >>>>   >       >My opinion is rework this but make the ordering via
>     >>>>an engine
>     >>>>   param
>     >>>>   >       optional.
>     >>>>   >       >
>     >>>>   >       >e.g. A VM can be configured so all binds are ordered
>     >>>>within the
>     >>>>   VM
>     >>>>   >       >
>     >>>>   >       >e.g. A VM can be configured so all binds accept an
>     engine
>     >>>>   argument
>     >>>>   >       (in
>     >>>>   >       >the case of the i915 likely this is a gem context
>     >>>>handle) and
>     >>>>   binds
>     >>>>   >       >ordered with respect to that engine.
>     >>>>   >       >
>     >>>>   >       >This gives UMDs options as the later likely consumes
>     >>>>more KMD
>     >>>>   >       resources
>     >>>>   >       >so if a different UMD can live with binds being
>     >>>>ordered within
>     >>>>   the VM
>     >>>>   >       >they can use a mode consuming less resources.
>     >>>>   >       >
>     >>>>   >
>     >>>>   >       I think we need to be careful here if we are looking for
>     some
>     >>>>   out of
>     >>>>   >       (submission) order completion of vm_bind/unbind.
>     >>>>   >       In-order completion means, in a batch of binds and
>     >>>>unbinds to be
>     >>>>   >       completed in-order, user only needs to specify
>     >>>>in-fence for the
>     >>>>   >       first bind/unbind call and the our-fence for the last
>     >>>>   bind/unbind
>     >>>>   >       call. Also, the VA released by an unbind call can be
>     >>>>re-used by
>     >>>>   >       any subsequent bind call in that in-order batch.
>     >>>>   >
>     >>>>   >       These things will break if binding/unbinding were to
>     >>>>be allowed
>     >>>>   to
>     >>>>   >       go out of order (of submission) and user need to be extra
>     >>>>   careful
>     >>>>   >       not to run into pre-mature triggereing of out-fence and
>     bind
>     >>>>   failing
>     >>>>   >       as VA is still in use etc.
>     >>>>   >
>     >>>>   >       Also, VM_BIND binds the provided mapping on the specified
>     >>>>   address
>     >>>>   >       space
>     >>>>   >       (VM). So, the uapi is not engine/context specific.
>     >>>>   >
>     >>>>   >       We can however add a 'queue' to the uapi which can be
>     >>>>one from
>     >>>>   the
>     >>>>   >       pre-defined queues,
>     >>>>   >       I915_VM_BIND_QUEUE_0
>     >>>>   >       I915_VM_BIND_QUEUE_1
>     >>>>   >       ...
>     >>>>   >       I915_VM_BIND_QUEUE_(N-1)
>     >>>>   >
>     >>>>   >       KMD will spawn an async work queue for each queue which
>     will
>     >>>>   only
>     >>>>   >       bind the mappings on that queue in the order of
>     submission.
>     >>>>   >       User can assign the queue to per engine or anything
>     >>>>like that.
>     >>>>   >
>     >>>>   >       But again here, user need to be careful and not
>     >>>>deadlock these
>     >>>>   >       queues with circular dependency of fences.
>     >>>>   >
>     >>>>   >       I prefer adding this later an as extension based on
>     >>>>whether it
>     >>>>   >       is really helping with the implementation.
>     >>>>   >
>     >>>>   >     I can tell you right now that having everything on a single
>     >>>>   in-order
>     >>>>   >     queue will not get us the perf we want.  What vulkan
>     >>>>really wants
>     >>>>   is one
>     >>>>   >     of two things:
>     >>>>   >      1. No implicit ordering of VM_BIND ops.  They just happen
>     in
>     >>>>   whatever
>     >>>>   >     their dependencies are resolved and we ensure ordering
>     >>>>ourselves
>     >>>>   by
>     >>>>   >     having a syncobj in the VkQueue.
>     >>>>   >      2. The ability to create multiple VM_BIND queues.  We need
>     at
>     >>>>   least 2
>     >>>>   >     but I don't see why there needs to be a limit besides
>     >>>>the limits
>     >>>>   the
>     >>>>   >     i915 API already has on the number of engines.  Vulkan
>     could
>     >>>>   expose
>     >>>>   >     multiple sparse binding queues to the client if it's not
>     >>>>   arbitrarily
>     >>>>   >     limited.
>     >>>>
>     >>>>   Thanks Jason, Lionel.
>     >>>>
>     >>>>   Jason, what are you referring to when you say "limits the i915
>     API
>     >>>>   already
>     >>>>   has on the number of engines"? I am not sure if there is such an
>     uapi
>     >>>>   today.
>     >>>>
>     >>>> There's a limit of something like 64 total engines today based on
>     the
>     >>>> number of bits we can cram into the exec flags in execbuffer2.  I
>     think
>     >>>> someone had an extended version that allowed more but I ripped it
>     out
>     >>>> because no one was using it.  Of course, execbuffer3 might not
>     >>>>have that
>     >>>> problem at all.
>     >>>>
>     >>>
>     >>>Thanks Jason.
>     >>>Ok, I am not sure which exec flag is that, but yah, execbuffer3
>     probably
>     >>>will not have this limiation. So, we need to define a
>     VM_BIND_MAX_QUEUE
>     >>>and somehow export it to user (I am thinking of embedding it in
>     >>>I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND, bits[1-3]->'n' meaning
>     2^n
>     >>>queues.
>     >>
>     >>Ah, I think you are waking about I915_EXEC_RING_MASK (0x3f) which
>     execbuf3
>
>   Yup!  That's exactly the limit I was talking about.
>    
>
>     >>will also have. So, we can simply define in vm_bind/unbind structures,
>     >>
>     >>#define I915_VM_BIND_MAX_QUEUE   64
>     >>        __u32 queue;
>     >>
>     >>I think that will keep things simple.
>     >
>     >Hmmm? What does execbuf2 limit has to do with how many engines
>     >hardware can have? I suggest not to do that.
>     >
>     >Change with added this:
>     >
>     >       if (set.num_engines > I915_EXEC_RING_MASK + 1)
>     >               return -EINVAL;
>     >
>     >To context creation needs to be undone and so let users create engine
>     >maps with all hardware engines, and let execbuf3 access them all.
>     >
>
>     Earlier plan was to carry I915_EXEC_RING_MAP (0x3f) to execbuff3 also.
>     Hence, I was using the same limit for VM_BIND queues (64, or 65 if we
>     make it N+1).
>     But, as discussed in other thread of this RFC series, we are planning
>     to drop this I915_EXEC_RING_MAP in execbuff3. So, there won't be
>     any uapi that limits the number of engines (and hence the vm_bind queues
>     need to be supported).
>
>     If we leave the number of vm_bind queues to be arbitrarily large
>     (__u32 queue_idx) then, we need to have a hashmap for queue (a wq,
>     work_item and a linked list) lookup from the user specified queue index.
>     Other option is to just put some hard limit (say 64 or 65) and use
>     an array of queues in VM (each created upon first use). I prefer this.
>
>   I don't get why a VM_BIND queue is any different from any other queue or
>   userspace-visible kernel object.  But I'll leave those details up to
>   danvet or whoever else might be reviewing the implementation.

In execbuff3, if the user specified execbuf3.engine_id is beyond the number of
available engines on the gem context, an error is returned to the user.
In VM_BIND case, not sure how to do that bound check on user specified queue_idx.

In any case, it is an implementation detail and we can use a hashmap for
the VM_BIND queues here (there might be a slight ioctl latency added due to
hash lookup, but in normal case, should be insignificant), which should be Ok.

Niranjana

>   --Jason
>    
>
>     Niranjana
>
>     >Regards,
>     >
>     >Tvrtko
>     >
>     >>
>     >>Niranjana
>     >>
>     >>>
>     >>>>   I am trying to see how many queues we need and don't want it to
>     be
>     >>>>   arbitrarily
>     >>>>   large and unduely blow up memory usage and complexity in i915
>     driver.
>     >>>>
>     >>>> I expect a Vulkan driver to use at most 2 in the vast majority
>     >>>>of cases. I
>     >>>> could imagine a client wanting to create more than 1 sparse
>     >>>>queue in which
>     >>>> case, it'll be N+1 but that's unlikely.  As far as complexity
>     >>>>goes, once
>     >>>> you allow two, I don't think the complexity is going up by
>     >>>>allowing N.  As
>     >>>> for memory usage, creating more queues means more memory.  That's a
>     >>>> trade-off that userspace can make.  Again, the expected number
>     >>>>here is 1
>     >>>> or 2 in the vast majority of cases so I don't think you need to
>     worry.
>     >>>
>     >>>Ok, will start with n=3 meaning 8 queues.
>     >>>That would require us create 8 workqueues.
>     >>>We can change 'n' later if required.
>     >>>
>     >>>Niranjana
>     >>>
>     >>>>
>     >>>>   >     Why?  Because Vulkan has two basic kind of bind
>     >>>>operations and we
>     >>>>   don't
>     >>>>   >     want any dependencies between them:
>     >>>>   >      1. Immediate.  These happen right after BO creation or
>     >>>>maybe as
>     >>>>   part of
>     >>>>   >     vkBindImageMemory() or VkBindBufferMemory().  These
>     >>>>don't happen
>     >>>>   on a
>     >>>>   >     queue and we don't want them serialized with anything.  To
>     >>>>   synchronize
>     >>>>   >     with submit, we'll have a syncobj in the VkDevice which is
>     >>>>   signaled by
>     >>>>   >     all immediate bind operations and make submits wait on it.
>     >>>>   >      2. Queued (sparse): These happen on a VkQueue which may be
>     the
>     >>>>   same as
>     >>>>   >     a render/compute queue or may be its own queue.  It's up to
>     us
>     >>>>   what we
>     >>>>   >     want to advertise.  From the Vulkan API PoV, this is like
>     any
>     >>>>   other
>     >>>>   >     queue.  Operations on it wait on and signal semaphores.  If
>     we
>     >>>>   have a
>     >>>>   >     VM_BIND engine, we'd provide syncobjs to wait and
>     >>>>signal just like
>     >>>>   we do
>     >>>>   >     in execbuf().
>     >>>>   >     The important thing is that we don't want one type of
>     >>>>operation to
>     >>>>   block
>     >>>>   >     on the other.  If immediate binds are blocking on sparse
>     binds,
>     >>>>   it's
>     >>>>   >     going to cause over-synchronization issues.
>     >>>>   >     In terms of the internal implementation, I know that
>     >>>>there's going
>     >>>>   to be
>     >>>>   >     a lock on the VM and that we can't actually do these things
>     in
>     >>>>   >     parallel.  That's fine.  Once the dma_fences have signaled
>     and
>     >>>>   we're
>     >>>>
>     >>>>   Thats correct. It is like a single VM_BIND engine with
>     >>>>multiple queues
>     >>>>   feeding to it.
>     >>>>
>     >>>> Right.  As long as the queues themselves are independent and
>     >>>>can block on
>     >>>> dma_fences without holding up other queues, I think we're fine.
>     >>>>
>     >>>>   >     unblocked to do the bind operation, I don't care if
>     >>>>there's a bit
>     >>>>   of
>     >>>>   >     synchronization due to locking.  That's expected.  What
>     >>>>we can't
>     >>>>   afford
>     >>>>   >     to have is an immediate bind operation suddenly blocking on
>     a
>     >>>>   sparse
>     >>>>   >     operation which is blocked on a compute job that's going to
>     run
>     >>>>   for
>     >>>>   >     another 5ms.
>     >>>>
>     >>>>   As the VM_BIND queue is per VM, VM_BIND on one VM doesn't block
>     the
>     >>>>   VM_BIND
>     >>>>   on other VMs. I am not sure about usecases here, but just wanted
>     to
>     >>>>   clarify.
>     >>>>
>     >>>> Yes, that's what I would expect.
>     >>>> --Jason
>     >>>>
>     >>>>   Niranjana
>     >>>>
>     >>>>   >     For reference, Windows solves this by allowing arbitrarily
>     many
>     >>>>   paging
>     >>>>   >     queues (what they call a VM_BIND engine/queue).  That
>     >>>>design works
>     >>>>   >     pretty well and solves the problems in question. 
>     >>>>Again, we could
>     >>>>   just
>     >>>>   >     make everything out-of-order and require using syncobjs
>     >>>>to order
>     >>>>   things
>     >>>>   >     as userspace wants. That'd be fine too.
>     >>>>   >     One more note while I'm here: danvet said something on
>     >>>>IRC about
>     >>>>   VM_BIND
>     >>>>   >     queues waiting for syncobjs to materialize.  We don't
>     really
>     >>>>   want/need
>     >>>>   >     this.  We already have all the machinery in userspace to
>     handle
>     >>>>   >     wait-before-signal and waiting for syncobj fences to
>     >>>>materialize
>     >>>>   and
>     >>>>   >     that machinery is on by default.  It would actually
>     >>>>take MORE work
>     >>>>   in
>     >>>>   >     Mesa to turn it off and take advantage of the kernel
>     >>>>being able to
>     >>>>   wait
>     >>>>   >     for syncobjs to materialize.  Also, getting that right is
>     >>>>   ridiculously
>     >>>>   >     hard and I really don't want to get it wrong in kernel
>     >>>>space.     When we
>     >>>>   >     do memory fences, wait-before-signal will be a thing.  We
>     don't
>     >>>>   need to
>     >>>>   >     try and make it a thing for syncobj.
>     >>>>   >     --Jason
>     >>>>   >
>     >>>>   >   Thanks Jason,
>     >>>>   >
>     >>>>   >   I missed the bit in the Vulkan spec that we're allowed to
>     have a
>     >>>>   sparse
>     >>>>   >   queue that does not implement either graphics or compute
>     >>>>operations
>     >>>>   :
>     >>>>   >
>     >>>>   >     "While some implementations may include
>     >>>>   VK_QUEUE_SPARSE_BINDING_BIT
>     >>>>   >     support in queue families that also include
>     >>>>   >
>     >>>>   >      graphics and compute support, other implementations may
>     only
>     >>>>   expose a
>     >>>>   >     VK_QUEUE_SPARSE_BINDING_BIT-only queue
>     >>>>   >
>     >>>>   >      family."
>     >>>>   >
>     >>>>   >   So it can all be all a vm_bind engine that just does
>     bind/unbind
>     >>>>   >   operations.
>     >>>>   >
>     >>>>   >   But yes we need another engine for the immediate/non-sparse
>     >>>>   operations.
>     >>>>   >
>     >>>>   >   -Lionel
>     >>>>   >
>     >>>>   >         >
>     >>>>   >       Daniel, any thoughts?
>     >>>>   >
>     >>>>   >       Niranjana
>     >>>>   >
>     >>>>   >       >Matt
>     >>>>   >       >
>     >>>>   >       >>
>     >>>>   >       >> Sorry I noticed this late.
>     >>>>   >       >>
>     >>>>   >       >>
>     >>>>   >       >> -Lionel
>     >>>>   >       >>
>     >>>>   >       >>
Lionel Landwerlin June 9, 2022, 2:49 p.m. UTC | #26
On 09/06/2022 00:55, Jason Ekstrand wrote:
> On Wed, Jun 8, 2022 at 4:44 PM Niranjana Vishwanathapura 
> <niranjana.vishwanathapura@intel.com> wrote:
>
>     On Wed, Jun 08, 2022 at 08:33:25AM +0100, Tvrtko Ursulin wrote:
>     >
>     >
>     >On 07/06/2022 22:32, Niranjana Vishwanathapura wrote:
>     >>On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana
>     Vishwanathapura wrote:
>     >>>On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason Ekstrand wrote:
>     >>>> On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura
>     >>>> <niranjana.vishwanathapura@intel.com> wrote:
>     >>>>
>     >>>>   On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel Landwerlin
>     wrote:
>     >>>>   >   On 02/06/2022 23:35, Jason Ekstrand wrote:
>     >>>>   >
>     >>>>   >     On Thu, Jun 2, 2022 at 3:11 PM Niranjana Vishwanathapura
>     >>>>   >     <niranjana.vishwanathapura@intel.com> wrote:
>     >>>>   >
>     >>>>   >       On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew
>     >>>>Brost wrote:
>     >>>>   >       >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel
>     Landwerlin
>     >>>>   wrote:
>     >>>>   >       >> On 17/05/2022 21:32, Niranjana Vishwanathapura
>     wrote:
>     >>>>   >       >> > +VM_BIND/UNBIND ioctl will immediately start
>     >>>>   binding/unbinding
>     >>>>   >       the mapping in an
>     >>>>   >       >> > +async worker. The binding and unbinding will
>     >>>>work like a
>     >>>>   special
>     >>>>   >       GPU engine.
>     >>>>   >       >> > +The binding and unbinding operations are
>     serialized and
>     >>>>   will
>     >>>>   >       wait on specified
>     >>>>   >       >> > +input fences before the operation and will
>     signal the
>     >>>>   output
>     >>>>   >       fences upon the
>     >>>>   >       >> > +completion of the operation. Due to
>     serialization,
>     >>>>   completion of
>     >>>>   >       an operation
>     >>>>   >       >> > +will also indicate that all previous operations
>     >>>>are also
>     >>>>   >       complete.
>     >>>>   >       >>
>     >>>>   >       >> I guess we should avoid saying "will immediately
>     start
>     >>>>   >       binding/unbinding" if
>     >>>>   >       >> there are fences involved.
>     >>>>   >       >>
>     >>>>   >       >> And the fact that it's happening in an async
>     >>>>worker seem to
>     >>>>   imply
>     >>>>   >       it's not
>     >>>>   >       >> immediate.
>     >>>>   >       >>
>     >>>>   >
>     >>>>   >       Ok, will fix.
>     >>>>   >       This was added because in earlier design binding
>     was deferred
>     >>>>   until
>     >>>>   >       next execbuff.
>     >>>>   >       But now it is non-deferred (immediate in that sense).
>     >>>>But yah,
>     >>>>   this is
>     >>>>   >       confusing
>     >>>>   >       and will fix it.
>     >>>>   >
>     >>>>   >       >>
>     >>>>   >       >> I have a question on the behavior of the bind
>     >>>>operation when
>     >>>>   no
>     >>>>   >       input fence
>     >>>>   >       >> is provided. Let say I do :
>     >>>>   >       >>
>     >>>>   >       >> VM_BIND (out_fence=fence1)
>     >>>>   >       >>
>     >>>>   >       >> VM_BIND (out_fence=fence2)
>     >>>>   >       >>
>     >>>>   >       >> VM_BIND (out_fence=fence3)
>     >>>>   >       >>
>     >>>>   >       >>
>     >>>>   >       >> In what order are the fences going to be signaled?
>     >>>>   >       >>
>     >>>>   >       >> In the order of VM_BIND ioctls? Or out of order?
>     >>>>   >       >>
>     >>>>   >       >> Because you wrote "serialized I assume it's : in
>     order
>     >>>>   >       >>
>     >>>>   >
>     >>>>   >       Yes, in the order of VM_BIND/UNBIND ioctls. Note that
>     >>>>bind and
>     >>>>   unbind
>     >>>>   >       will use
>     >>>>   >       the same queue and hence are ordered.
>     >>>>   >
>     >>>>   >       >>
>     >>>>   >       >> One thing I didn't realize is that because we
>     only get one
>     >>>>   >       "VM_BIND" engine,
>     >>>>   >       >> there is a disconnect from the Vulkan specification.
>     >>>>   >       >>
>     >>>>   >       >> In Vulkan VM_BIND operations are serialized but
>     >>>>per engine.
>     >>>>   >       >>
>     >>>>   >       >> So you could have something like this :
>     >>>>   >       >>
>     >>>>   >       >> VM_BIND (engine=rcs0, in_fence=fence1,
>     out_fence=fence2)
>     >>>>   >       >>
>     >>>>   >       >> VM_BIND (engine=ccs0, in_fence=fence3,
>     out_fence=fence4)
>     >>>>   >       >>
>     >>>>   >       >>
>     >>>>   >       >> fence1 is not signaled
>     >>>>   >       >>
>     >>>>   >       >> fence3 is signaled
>     >>>>   >       >>
>     >>>>   >       >> So the second VM_BIND will proceed before the
>     >>>>first VM_BIND.
>     >>>>   >       >>
>     >>>>   >       >>
>     >>>>   >       >> I guess we can deal with that scenario in
>     >>>>userspace by doing
>     >>>>   the
>     >>>>   >       wait
>     >>>>   >       >> ourselves in one thread per engines.
>     >>>>   >       >>
>     >>>>   >       >> But then it makes the VM_BIND input fences useless.
>     >>>>   >       >>
>     >>>>   >       >>
>     >>>>   >       >> Daniel : what do you think? Should be rework
>     this or just
>     >>>>   deal with
>     >>>>   >       wait
>     >>>>   >       >> fences in userspace?
>     >>>>   >       >>
>     >>>>   >       >
>     >>>>   >       >My opinion is rework this but make the ordering via
>     >>>>an engine
>     >>>>   param
>     >>>>   >       optional.
>     >>>>   >       >
>     >>>>   >       >e.g. A VM can be configured so all binds are ordered
>     >>>>within the
>     >>>>   VM
>     >>>>   >       >
>     >>>>   >       >e.g. A VM can be configured so all binds accept an
>     engine
>     >>>>   argument
>     >>>>   >       (in
>     >>>>   >       >the case of the i915 likely this is a gem context
>     >>>>handle) and
>     >>>>   binds
>     >>>>   >       >ordered with respect to that engine.
>     >>>>   >       >
>     >>>>   >       >This gives UMDs options as the later likely consumes
>     >>>>more KMD
>     >>>>   >       resources
>     >>>>   >       >so if a different UMD can live with binds being
>     >>>>ordered within
>     >>>>   the VM
>     >>>>   >       >they can use a mode consuming less resources.
>     >>>>   >       >
>     >>>>   >
>     >>>>   >       I think we need to be careful here if we are
>     looking for some
>     >>>>   out of
>     >>>>   >       (submission) order completion of vm_bind/unbind.
>     >>>>   >       In-order completion means, in a batch of binds and
>     >>>>unbinds to be
>     >>>>   >       completed in-order, user only needs to specify
>     >>>>in-fence for the
>     >>>>   >       first bind/unbind call and the our-fence for the last
>     >>>>   bind/unbind
>     >>>>   >       call. Also, the VA released by an unbind call can be
>     >>>>re-used by
>     >>>>   >       any subsequent bind call in that in-order batch.
>     >>>>   >
>     >>>>   >       These things will break if binding/unbinding were to
>     >>>>be allowed
>     >>>>   to
>     >>>>   >       go out of order (of submission) and user need to be
>     extra
>     >>>>   careful
>     >>>>   >       not to run into pre-mature triggereing of out-fence
>     and bind
>     >>>>   failing
>     >>>>   >       as VA is still in use etc.
>     >>>>   >
>     >>>>   >       Also, VM_BIND binds the provided mapping on the
>     specified
>     >>>>   address
>     >>>>   >       space
>     >>>>   >       (VM). So, the uapi is not engine/context specific.
>     >>>>   >
>     >>>>   >       We can however add a 'queue' to the uapi which can be
>     >>>>one from
>     >>>>   the
>     >>>>   >       pre-defined queues,
>     >>>>   >       I915_VM_BIND_QUEUE_0
>     >>>>   >       I915_VM_BIND_QUEUE_1
>     >>>>   >       ...
>     >>>>   >       I915_VM_BIND_QUEUE_(N-1)
>     >>>>   >
>     >>>>   >       KMD will spawn an async work queue for each queue
>     which will
>     >>>>   only
>     >>>>   >       bind the mappings on that queue in the order of
>     submission.
>     >>>>   >       User can assign the queue to per engine or anything
>     >>>>like that.
>     >>>>   >
>     >>>>   >       But again here, user need to be careful and not
>     >>>>deadlock these
>     >>>>   >       queues with circular dependency of fences.
>     >>>>   >
>     >>>>   >       I prefer adding this later an as extension based on
>     >>>>whether it
>     >>>>   >       is really helping with the implementation.
>     >>>>   >
>     >>>>   >     I can tell you right now that having everything on a
>     single
>     >>>>   in-order
>     >>>>   >     queue will not get us the perf we want.  What vulkan
>     >>>>really wants
>     >>>>   is one
>     >>>>   >     of two things:
>     >>>>   >      1. No implicit ordering of VM_BIND ops.  They just
>     happen in
>     >>>>   whatever
>     >>>>   >     their dependencies are resolved and we ensure ordering
>     >>>>ourselves
>     >>>>   by
>     >>>>   >     having a syncobj in the VkQueue.
>     >>>>   >      2. The ability to create multiple VM_BIND queues. 
>     We need at
>     >>>>   least 2
>     >>>>   >     but I don't see why there needs to be a limit besides
>     >>>>the limits
>     >>>>   the
>     >>>>   >     i915 API already has on the number of engines. 
>     Vulkan could
>     >>>>   expose
>     >>>>   >     multiple sparse binding queues to the client if it's not
>     >>>>   arbitrarily
>     >>>>   >     limited.
>     >>>>
>     >>>>   Thanks Jason, Lionel.
>     >>>>
>     >>>>   Jason, what are you referring to when you say "limits the
>     i915 API
>     >>>>   already
>     >>>>   has on the number of engines"? I am not sure if there is
>     such an uapi
>     >>>>   today.
>     >>>>
>     >>>> There's a limit of something like 64 total engines today
>     based on the
>     >>>> number of bits we can cram into the exec flags in
>     execbuffer2.  I think
>     >>>> someone had an extended version that allowed more but I
>     ripped it out
>     >>>> because no one was using it.  Of course, execbuffer3 might not
>     >>>>have that
>     >>>> problem at all.
>     >>>>
>     >>>
>     >>>Thanks Jason.
>     >>>Ok, I am not sure which exec flag is that, but yah, execbuffer3
>     probably
>     >>>will not have this limiation. So, we need to define a
>     VM_BIND_MAX_QUEUE
>     >>>and somehow export it to user (I am thinking of embedding it in
>     >>>I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND, bits[1-3]->'n'
>     meaning 2^n
>     >>>queues.
>     >>
>     >>Ah, I think you are waking about I915_EXEC_RING_MASK (0x3f)
>     which execbuf3
>
>
> Yup!  That's exactly the limit I was talking about.
>
>     >>will also have. So, we can simply define in vm_bind/unbind
>     structures,
>     >>
>     >>#define I915_VM_BIND_MAX_QUEUE   64
>     >>        __u32 queue;
>     >>
>     >>I think that will keep things simple.
>     >
>     >Hmmm? What does execbuf2 limit has to do with how many engines
>     >hardware can have? I suggest not to do that.
>     >
>     >Change with added this:
>     >
>     >       if (set.num_engines > I915_EXEC_RING_MASK + 1)
>     >               return -EINVAL;
>     >
>     >To context creation needs to be undone and so let users create
>     engine
>     >maps with all hardware engines, and let execbuf3 access them all.
>     >
>
>     Earlier plan was to carry I915_EXEC_RING_MAP (0x3f) to execbuff3 also.
>     Hence, I was using the same limit for VM_BIND queues (64, or 65 if we
>     make it N+1).
>     But, as discussed in other thread of this RFC series, we are planning
>     to drop this I915_EXEC_RING_MAP in execbuff3. So, there won't be
>     any uapi that limits the number of engines (and hence the vm_bind
>     queues
>     need to be supported).
>
>     If we leave the number of vm_bind queues to be arbitrarily large
>     (__u32 queue_idx) then, we need to have a hashmap for queue (a wq,
>     work_item and a linked list) lookup from the user specified queue
>     index.
>     Other option is to just put some hard limit (say 64 or 65) and use
>     an array of queues in VM (each created upon first use). I prefer this.
>
>
> I don't get why a VM_BIND queue is any different from any other queue 
> or userspace-visible kernel object.  But I'll leave those details up 
> to danvet or whoever else might be reviewing the implementation.
>
> --Jason


I kind of agree here. Wouldn't be simpler to have the bind queue created 
like the others when we build the engine map?

For userspace it's then just matter of selecting the right queue ID when 
submitting.

If there is ever a possibility to have this work on the GPU, it would be 
all ready.


Thanks,


-Lionel


>
>
>     Niranjana
>
>     >Regards,
>     >
>     >Tvrtko
>     >
>     >>
>     >>Niranjana
>     >>
>     >>>
>     >>>>   I am trying to see how many queues we need and don't want
>     it to be
>     >>>>   arbitrarily
>     >>>>   large and unduely blow up memory usage and complexity in
>     i915 driver.
>     >>>>
>     >>>> I expect a Vulkan driver to use at most 2 in the vast majority
>     >>>>of cases. I
>     >>>> could imagine a client wanting to create more than 1 sparse
>     >>>>queue in which
>     >>>> case, it'll be N+1 but that's unlikely.  As far as complexity
>     >>>>goes, once
>     >>>> you allow two, I don't think the complexity is going up by
>     >>>>allowing N.  As
>     >>>> for memory usage, creating more queues means more memory. 
>     That's a
>     >>>> trade-off that userspace can make.  Again, the expected number
>     >>>>here is 1
>     >>>> or 2 in the vast majority of cases so I don't think you need
>     to worry.
>     >>>
>     >>>Ok, will start with n=3 meaning 8 queues.
>     >>>That would require us create 8 workqueues.
>     >>>We can change 'n' later if required.
>     >>>
>     >>>Niranjana
>     >>>
>     >>>>
>     >>>>   >     Why?  Because Vulkan has two basic kind of bind
>     >>>>operations and we
>     >>>>   don't
>     >>>>   >     want any dependencies between them:
>     >>>>   >      1. Immediate.  These happen right after BO creation or
>     >>>>maybe as
>     >>>>   part of
>     >>>>   >     vkBindImageMemory() or VkBindBufferMemory().  These
>     >>>>don't happen
>     >>>>   on a
>     >>>>   >     queue and we don't want them serialized with
>     anything.  To
>     >>>>   synchronize
>     >>>>   >     with submit, we'll have a syncobj in the VkDevice
>     which is
>     >>>>   signaled by
>     >>>>   >     all immediate bind operations and make submits wait
>     on it.
>     >>>>   >      2. Queued (sparse): These happen on a VkQueue which
>     may be the
>     >>>>   same as
>     >>>>   >     a render/compute queue or may be its own queue.  It's
>     up to us
>     >>>>   what we
>     >>>>   >     want to advertise.  From the Vulkan API PoV, this is
>     like any
>     >>>>   other
>     >>>>   >     queue.  Operations on it wait on and signal
>     semaphores.  If we
>     >>>>   have a
>     >>>>   >     VM_BIND engine, we'd provide syncobjs to wait and
>     >>>>signal just like
>     >>>>   we do
>     >>>>   >     in execbuf().
>     >>>>   >     The important thing is that we don't want one type of
>     >>>>operation to
>     >>>>   block
>     >>>>   >     on the other.  If immediate binds are blocking on
>     sparse binds,
>     >>>>   it's
>     >>>>   >     going to cause over-synchronization issues.
>     >>>>   >     In terms of the internal implementation, I know that
>     >>>>there's going
>     >>>>   to be
>     >>>>   >     a lock on the VM and that we can't actually do these
>     things in
>     >>>>   >     parallel.  That's fine.  Once the dma_fences have
>     signaled and
>     >>>>   we're
>     >>>>
>     >>>>   Thats correct. It is like a single VM_BIND engine with
>     >>>>multiple queues
>     >>>>   feeding to it.
>     >>>>
>     >>>> Right.  As long as the queues themselves are independent and
>     >>>>can block on
>     >>>> dma_fences without holding up other queues, I think we're fine.
>     >>>>
>     >>>>   >     unblocked to do the bind operation, I don't care if
>     >>>>there's a bit
>     >>>>   of
>     >>>>   >     synchronization due to locking. That's expected.  What
>     >>>>we can't
>     >>>>   afford
>     >>>>   >     to have is an immediate bind operation suddenly
>     blocking on a
>     >>>>   sparse
>     >>>>   >     operation which is blocked on a compute job that's
>     going to run
>     >>>>   for
>     >>>>   >     another 5ms.
>     >>>>
>     >>>>   As the VM_BIND queue is per VM, VM_BIND on one VM doesn't
>     block the
>     >>>>   VM_BIND
>     >>>>   on other VMs. I am not sure about usecases here, but just
>     wanted to
>     >>>>   clarify.
>     >>>>
>     >>>> Yes, that's what I would expect.
>     >>>> --Jason
>     >>>>
>     >>>>   Niranjana
>     >>>>
>     >>>>   >     For reference, Windows solves this by allowing
>     arbitrarily many
>     >>>>   paging
>     >>>>   >     queues (what they call a VM_BIND engine/queue).  That
>     >>>>design works
>     >>>>   >     pretty well and solves the problems in question.
>     >>>>Again, we could
>     >>>>   just
>     >>>>   >     make everything out-of-order and require using syncobjs
>     >>>>to order
>     >>>>   things
>     >>>>   >     as userspace wants. That'd be fine too.
>     >>>>   >     One more note while I'm here: danvet said something on
>     >>>>IRC about
>     >>>>   VM_BIND
>     >>>>   >     queues waiting for syncobjs to materialize.  We don't
>     really
>     >>>>   want/need
>     >>>>   >     this.  We already have all the machinery in userspace
>     to handle
>     >>>>   >     wait-before-signal and waiting for syncobj fences to
>     >>>>materialize
>     >>>>   and
>     >>>>   >     that machinery is on by default.  It would actually
>     >>>>take MORE work
>     >>>>   in
>     >>>>   >     Mesa to turn it off and take advantage of the kernel
>     >>>>being able to
>     >>>>   wait
>     >>>>   >     for syncobjs to materialize. Also, getting that right is
>     >>>>   ridiculously
>     >>>>   >     hard and I really don't want to get it wrong in kernel
>     >>>>space.     When we
>     >>>>   >     do memory fences, wait-before-signal will be a
>     thing.  We don't
>     >>>>   need to
>     >>>>   >     try and make it a thing for syncobj.
>     >>>>   >     --Jason
>     >>>>   >
>     >>>>   >   Thanks Jason,
>     >>>>   >
>     >>>>   >   I missed the bit in the Vulkan spec that we're allowed
>     to have a
>     >>>>   sparse
>     >>>>   >   queue that does not implement either graphics or compute
>     >>>>operations
>     >>>>   :
>     >>>>   >
>     >>>>   >     "While some implementations may include
>     >>>>   VK_QUEUE_SPARSE_BINDING_BIT
>     >>>>   >     support in queue families that also include
>     >>>>   >
>     >>>>   >      graphics and compute support, other implementations
>     may only
>     >>>>   expose a
>     >>>>   >     VK_QUEUE_SPARSE_BINDING_BIT-only queue
>     >>>>   >
>     >>>>   >      family."
>     >>>>   >
>     >>>>   >   So it can all be all a vm_bind engine that just does
>     bind/unbind
>     >>>>   >   operations.
>     >>>>   >
>     >>>>   >   But yes we need another engine for the immediate/non-sparse
>     >>>>   operations.
>     >>>>   >
>     >>>>   >   -Lionel
>     >>>>   >
>     >>>>   >         >
>     >>>>   >       Daniel, any thoughts?
>     >>>>   >
>     >>>>   >       Niranjana
>     >>>>   >
>     >>>>   >       >Matt
>     >>>>   >       >
>     >>>>   >       >>
>     >>>>   >       >> Sorry I noticed this late.
>     >>>>   >       >>
>     >>>>   >       >>
>     >>>>   >       >> -Lionel
>     >>>>   >       >>
>     >>>>   >       >>
>
Niranjana Vishwanathapura June 9, 2022, 7:31 p.m. UTC | #27
On Thu, Jun 09, 2022 at 05:49:09PM +0300, Lionel Landwerlin wrote:
>   On 09/06/2022 00:55, Jason Ekstrand wrote:
>
>     On Wed, Jun 8, 2022 at 4:44 PM Niranjana Vishwanathapura
>     <niranjana.vishwanathapura@intel.com> wrote:
>
>       On Wed, Jun 08, 2022 at 08:33:25AM +0100, Tvrtko Ursulin wrote:
>       >
>       >
>       >On 07/06/2022 22:32, Niranjana Vishwanathapura wrote:
>       >>On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana Vishwanathapura
>       wrote:
>       >>>On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason Ekstrand wrote:
>       >>>> On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura
>       >>>> <niranjana.vishwanathapura@intel.com> wrote:
>       >>>>
>       >>>>   On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel Landwerlin
>       wrote:
>       >>>>   >   On 02/06/2022 23:35, Jason Ekstrand wrote:
>       >>>>   >
>       >>>>   >     On Thu, Jun 2, 2022 at 3:11 PM Niranjana Vishwanathapura
>       >>>>   >     <niranjana.vishwanathapura@intel.com> wrote:
>       >>>>   >
>       >>>>   >       On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew
>       >>>>Brost wrote:
>       >>>>   >       >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel
>       Landwerlin
>       >>>>   wrote:
>       >>>>   >       >> On 17/05/2022 21:32, Niranjana Vishwanathapura
>       wrote:
>       >>>>   >       >> > +VM_BIND/UNBIND ioctl will immediately start
>       >>>>   binding/unbinding
>       >>>>   >       the mapping in an
>       >>>>   >       >> > +async worker. The binding and unbinding will
>       >>>>work like a
>       >>>>   special
>       >>>>   >       GPU engine.
>       >>>>   >       >> > +The binding and unbinding operations are
>       serialized and
>       >>>>   will
>       >>>>   >       wait on specified
>       >>>>   >       >> > +input fences before the operation and will signal
>       the
>       >>>>   output
>       >>>>   >       fences upon the
>       >>>>   >       >> > +completion of the operation. Due to
>       serialization,
>       >>>>   completion of
>       >>>>   >       an operation
>       >>>>   >       >> > +will also indicate that all previous operations
>       >>>>are also
>       >>>>   >       complete.
>       >>>>   >       >>
>       >>>>   >       >> I guess we should avoid saying "will immediately
>       start
>       >>>>   >       binding/unbinding" if
>       >>>>   >       >> there are fences involved.
>       >>>>   >       >>
>       >>>>   >       >> And the fact that it's happening in an async
>       >>>>worker seem to
>       >>>>   imply
>       >>>>   >       it's not
>       >>>>   >       >> immediate.
>       >>>>   >       >>
>       >>>>   >
>       >>>>   >       Ok, will fix.
>       >>>>   >       This was added because in earlier design binding was
>       deferred
>       >>>>   until
>       >>>>   >       next execbuff.
>       >>>>   >       But now it is non-deferred (immediate in that sense).
>       >>>>But yah,
>       >>>>   this is
>       >>>>   >       confusing
>       >>>>   >       and will fix it.
>       >>>>   >
>       >>>>   >       >>
>       >>>>   >       >> I have a question on the behavior of the bind
>       >>>>operation when
>       >>>>   no
>       >>>>   >       input fence
>       >>>>   >       >> is provided. Let say I do :
>       >>>>   >       >>
>       >>>>   >       >> VM_BIND (out_fence=fence1)
>       >>>>   >       >>
>       >>>>   >       >> VM_BIND (out_fence=fence2)
>       >>>>   >       >>
>       >>>>   >       >> VM_BIND (out_fence=fence3)
>       >>>>   >       >>
>       >>>>   >       >>
>       >>>>   >       >> In what order are the fences going to be signaled?
>       >>>>   >       >>
>       >>>>   >       >> In the order of VM_BIND ioctls? Or out of order?
>       >>>>   >       >>
>       >>>>   >       >> Because you wrote "serialized I assume it's : in
>       order
>       >>>>   >       >>
>       >>>>   >
>       >>>>   >       Yes, in the order of VM_BIND/UNBIND ioctls. Note that
>       >>>>bind and
>       >>>>   unbind
>       >>>>   >       will use
>       >>>>   >       the same queue and hence are ordered.
>       >>>>   >
>       >>>>   >       >>
>       >>>>   >       >> One thing I didn't realize is that because we only
>       get one
>       >>>>   >       "VM_BIND" engine,
>       >>>>   >       >> there is a disconnect from the Vulkan specification.
>       >>>>   >       >>
>       >>>>   >       >> In Vulkan VM_BIND operations are serialized but
>       >>>>per engine.
>       >>>>   >       >>
>       >>>>   >       >> So you could have something like this :
>       >>>>   >       >>
>       >>>>   >       >> VM_BIND (engine=rcs0, in_fence=fence1,
>       out_fence=fence2)
>       >>>>   >       >>
>       >>>>   >       >> VM_BIND (engine=ccs0, in_fence=fence3,
>       out_fence=fence4)
>       >>>>   >       >>
>       >>>>   >       >>
>       >>>>   >       >> fence1 is not signaled
>       >>>>   >       >>
>       >>>>   >       >> fence3 is signaled
>       >>>>   >       >>
>       >>>>   >       >> So the second VM_BIND will proceed before the
>       >>>>first VM_BIND.
>       >>>>   >       >>
>       >>>>   >       >>
>       >>>>   >       >> I guess we can deal with that scenario in
>       >>>>userspace by doing
>       >>>>   the
>       >>>>   >       wait
>       >>>>   >       >> ourselves in one thread per engines.
>       >>>>   >       >>
>       >>>>   >       >> But then it makes the VM_BIND input fences useless.
>       >>>>   >       >>
>       >>>>   >       >>
>       >>>>   >       >> Daniel : what do you think? Should be rework this or
>       just
>       >>>>   deal with
>       >>>>   >       wait
>       >>>>   >       >> fences in userspace?
>       >>>>   >       >>
>       >>>>   >       >
>       >>>>   >       >My opinion is rework this but make the ordering via
>       >>>>an engine
>       >>>>   param
>       >>>>   >       optional.
>       >>>>   >       >
>       >>>>   >       >e.g. A VM can be configured so all binds are ordered
>       >>>>within the
>       >>>>   VM
>       >>>>   >       >
>       >>>>   >       >e.g. A VM can be configured so all binds accept an
>       engine
>       >>>>   argument
>       >>>>   >       (in
>       >>>>   >       >the case of the i915 likely this is a gem context
>       >>>>handle) and
>       >>>>   binds
>       >>>>   >       >ordered with respect to that engine.
>       >>>>   >       >
>       >>>>   >       >This gives UMDs options as the later likely consumes
>       >>>>more KMD
>       >>>>   >       resources
>       >>>>   >       >so if a different UMD can live with binds being
>       >>>>ordered within
>       >>>>   the VM
>       >>>>   >       >they can use a mode consuming less resources.
>       >>>>   >       >
>       >>>>   >
>       >>>>   >       I think we need to be careful here if we are looking
>       for some
>       >>>>   out of
>       >>>>   >       (submission) order completion of vm_bind/unbind.
>       >>>>   >       In-order completion means, in a batch of binds and
>       >>>>unbinds to be
>       >>>>   >       completed in-order, user only needs to specify
>       >>>>in-fence for the
>       >>>>   >       first bind/unbind call and the our-fence for the last
>       >>>>   bind/unbind
>       >>>>   >       call. Also, the VA released by an unbind call can be
>       >>>>re-used by
>       >>>>   >       any subsequent bind call in that in-order batch.
>       >>>>   >
>       >>>>   >       These things will break if binding/unbinding were to
>       >>>>be allowed
>       >>>>   to
>       >>>>   >       go out of order (of submission) and user need to be
>       extra
>       >>>>   careful
>       >>>>   >       not to run into pre-mature triggereing of out-fence and
>       bind
>       >>>>   failing
>       >>>>   >       as VA is still in use etc.
>       >>>>   >
>       >>>>   >       Also, VM_BIND binds the provided mapping on the
>       specified
>       >>>>   address
>       >>>>   >       space
>       >>>>   >       (VM). So, the uapi is not engine/context specific.
>       >>>>   >
>       >>>>   >       We can however add a 'queue' to the uapi which can be
>       >>>>one from
>       >>>>   the
>       >>>>   >       pre-defined queues,
>       >>>>   >       I915_VM_BIND_QUEUE_0
>       >>>>   >       I915_VM_BIND_QUEUE_1
>       >>>>   >       ...
>       >>>>   >       I915_VM_BIND_QUEUE_(N-1)
>       >>>>   >
>       >>>>   >       KMD will spawn an async work queue for each queue which
>       will
>       >>>>   only
>       >>>>   >       bind the mappings on that queue in the order of
>       submission.
>       >>>>   >       User can assign the queue to per engine or anything
>       >>>>like that.
>       >>>>   >
>       >>>>   >       But again here, user need to be careful and not
>       >>>>deadlock these
>       >>>>   >       queues with circular dependency of fences.
>       >>>>   >
>       >>>>   >       I prefer adding this later an as extension based on
>       >>>>whether it
>       >>>>   >       is really helping with the implementation.
>       >>>>   >
>       >>>>   >     I can tell you right now that having everything on a
>       single
>       >>>>   in-order
>       >>>>   >     queue will not get us the perf we want.  What vulkan
>       >>>>really wants
>       >>>>   is one
>       >>>>   >     of two things:
>       >>>>   >      1. No implicit ordering of VM_BIND ops.  They just
>       happen in
>       >>>>   whatever
>       >>>>   >     their dependencies are resolved and we ensure ordering
>       >>>>ourselves
>       >>>>   by
>       >>>>   >     having a syncobj in the VkQueue.
>       >>>>   >      2. The ability to create multiple VM_BIND queues.  We
>       need at
>       >>>>   least 2
>       >>>>   >     but I don't see why there needs to be a limit besides
>       >>>>the limits
>       >>>>   the
>       >>>>   >     i915 API already has on the number of engines.  Vulkan
>       could
>       >>>>   expose
>       >>>>   >     multiple sparse binding queues to the client if it's not
>       >>>>   arbitrarily
>       >>>>   >     limited.
>       >>>>
>       >>>>   Thanks Jason, Lionel.
>       >>>>
>       >>>>   Jason, what are you referring to when you say "limits the i915
>       API
>       >>>>   already
>       >>>>   has on the number of engines"? I am not sure if there is such
>       an uapi
>       >>>>   today.
>       >>>>
>       >>>> There's a limit of something like 64 total engines today based on
>       the
>       >>>> number of bits we can cram into the exec flags in execbuffer2.  I
>       think
>       >>>> someone had an extended version that allowed more but I ripped it
>       out
>       >>>> because no one was using it.  Of course, execbuffer3 might not
>       >>>>have that
>       >>>> problem at all.
>       >>>>
>       >>>
>       >>>Thanks Jason.
>       >>>Ok, I am not sure which exec flag is that, but yah, execbuffer3
>       probably
>       >>>will not have this limiation. So, we need to define a
>       VM_BIND_MAX_QUEUE
>       >>>and somehow export it to user (I am thinking of embedding it in
>       >>>I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND, bits[1-3]->'n'
>       meaning 2^n
>       >>>queues.
>       >>
>       >>Ah, I think you are waking about I915_EXEC_RING_MASK (0x3f) which
>       execbuf3
>
>     Yup!  That's exactly the limit I was talking about.
>      
>
>       >>will also have. So, we can simply define in vm_bind/unbind
>       structures,
>       >>
>       >>#define I915_VM_BIND_MAX_QUEUE   64
>       >>        __u32 queue;
>       >>
>       >>I think that will keep things simple.
>       >
>       >Hmmm? What does execbuf2 limit has to do with how many engines
>       >hardware can have? I suggest not to do that.
>       >
>       >Change with added this:
>       >
>       >       if (set.num_engines > I915_EXEC_RING_MASK + 1)
>       >               return -EINVAL;
>       >
>       >To context creation needs to be undone and so let users create engine
>       >maps with all hardware engines, and let execbuf3 access them all.
>       >
>
>       Earlier plan was to carry I915_EXEC_RING_MAP (0x3f) to execbuff3 also.
>       Hence, I was using the same limit for VM_BIND queues (64, or 65 if we
>       make it N+1).
>       But, as discussed in other thread of this RFC series, we are planning
>       to drop this I915_EXEC_RING_MAP in execbuff3. So, there won't be
>       any uapi that limits the number of engines (and hence the vm_bind
>       queues
>       need to be supported).
>
>       If we leave the number of vm_bind queues to be arbitrarily large
>       (__u32 queue_idx) then, we need to have a hashmap for queue (a wq,
>       work_item and a linked list) lookup from the user specified queue
>       index.
>       Other option is to just put some hard limit (say 64 or 65) and use
>       an array of queues in VM (each created upon first use). I prefer this.
>
>     I don't get why a VM_BIND queue is any different from any other queue or
>     userspace-visible kernel object.  But I'll leave those details up to
>     danvet or whoever else might be reviewing the implementation.
>     --Jason
>
>   I kind of agree here. Wouldn't be simpler to have the bind queue created
>   like the others when we build the engine map?
>
>   For userspace it's then just matter of selecting the right queue ID when
>   submitting.
>
>   If there is ever a possibility to have this work on the GPU, it would be
>   all ready.
>

I did sync offline with Matt Brost on this.
We can add a VM_BIND engine class and let user create VM_BIND engines (queues).
The problem is, in i915 engine creating interface is bound to gem_context.
So, in vm_bind ioctl, we would need both context_id and queue_idx for proper
lookup of the user created engine. This is bit ackward as vm_bind is an
interface to VM (address space) and has nothing to do with gem_context.
Another problem is, if two VMs are binding with the same defined engine,
binding on VM1 can get unnecessary blocked by binding on VM2 (which may be
waiting on its in_fence).

So, my preference here is to just add a 'u32 queue' index in vm_bind/unbind
ioctl, and the queues are per VM.

Niranjana

>   Thanks,
>
>   -Lionel
>
>      
>
>       Niranjana
>
>       >Regards,
>       >
>       >Tvrtko
>       >
>       >>
>       >>Niranjana
>       >>
>       >>>
>       >>>>   I am trying to see how many queues we need and don't want it to
>       be
>       >>>>   arbitrarily
>       >>>>   large and unduely blow up memory usage and complexity in i915
>       driver.
>       >>>>
>       >>>> I expect a Vulkan driver to use at most 2 in the vast majority
>       >>>>of cases. I
>       >>>> could imagine a client wanting to create more than 1 sparse
>       >>>>queue in which
>       >>>> case, it'll be N+1 but that's unlikely.  As far as complexity
>       >>>>goes, once
>       >>>> you allow two, I don't think the complexity is going up by
>       >>>>allowing N.  As
>       >>>> for memory usage, creating more queues means more memory.  That's
>       a
>       >>>> trade-off that userspace can make.  Again, the expected number
>       >>>>here is 1
>       >>>> or 2 in the vast majority of cases so I don't think you need to
>       worry.
>       >>>
>       >>>Ok, will start with n=3 meaning 8 queues.
>       >>>That would require us create 8 workqueues.
>       >>>We can change 'n' later if required.
>       >>>
>       >>>Niranjana
>       >>>
>       >>>>
>       >>>>   >     Why?  Because Vulkan has two basic kind of bind
>       >>>>operations and we
>       >>>>   don't
>       >>>>   >     want any dependencies between them:
>       >>>>   >      1. Immediate.  These happen right after BO creation or
>       >>>>maybe as
>       >>>>   part of
>       >>>>   >     vkBindImageMemory() or VkBindBufferMemory().  These
>       >>>>don't happen
>       >>>>   on a
>       >>>>   >     queue and we don't want them serialized with anything. 
>       To
>       >>>>   synchronize
>       >>>>   >     with submit, we'll have a syncobj in the VkDevice which
>       is
>       >>>>   signaled by
>       >>>>   >     all immediate bind operations and make submits wait on
>       it.
>       >>>>   >      2. Queued (sparse): These happen on a VkQueue which may
>       be the
>       >>>>   same as
>       >>>>   >     a render/compute queue or may be its own queue.  It's up
>       to us
>       >>>>   what we
>       >>>>   >     want to advertise.  From the Vulkan API PoV, this is like
>       any
>       >>>>   other
>       >>>>   >     queue.  Operations on it wait on and signal semaphores. 
>       If we
>       >>>>   have a
>       >>>>   >     VM_BIND engine, we'd provide syncobjs to wait and
>       >>>>signal just like
>       >>>>   we do
>       >>>>   >     in execbuf().
>       >>>>   >     The important thing is that we don't want one type of
>       >>>>operation to
>       >>>>   block
>       >>>>   >     on the other.  If immediate binds are blocking on sparse
>       binds,
>       >>>>   it's
>       >>>>   >     going to cause over-synchronization issues.
>       >>>>   >     In terms of the internal implementation, I know that
>       >>>>there's going
>       >>>>   to be
>       >>>>   >     a lock on the VM and that we can't actually do these
>       things in
>       >>>>   >     parallel.  That's fine.  Once the dma_fences have
>       signaled and
>       >>>>   we're
>       >>>>
>       >>>>   Thats correct. It is like a single VM_BIND engine with
>       >>>>multiple queues
>       >>>>   feeding to it.
>       >>>>
>       >>>> Right.  As long as the queues themselves are independent and
>       >>>>can block on
>       >>>> dma_fences without holding up other queues, I think we're fine.
>       >>>>
>       >>>>   >     unblocked to do the bind operation, I don't care if
>       >>>>there's a bit
>       >>>>   of
>       >>>>   >     synchronization due to locking.  That's expected.  What
>       >>>>we can't
>       >>>>   afford
>       >>>>   >     to have is an immediate bind operation suddenly blocking
>       on a
>       >>>>   sparse
>       >>>>   >     operation which is blocked on a compute job that's going
>       to run
>       >>>>   for
>       >>>>   >     another 5ms.
>       >>>>
>       >>>>   As the VM_BIND queue is per VM, VM_BIND on one VM doesn't block
>       the
>       >>>>   VM_BIND
>       >>>>   on other VMs. I am not sure about usecases here, but just
>       wanted to
>       >>>>   clarify.
>       >>>>
>       >>>> Yes, that's what I would expect.
>       >>>> --Jason
>       >>>>
>       >>>>   Niranjana
>       >>>>
>       >>>>   >     For reference, Windows solves this by allowing
>       arbitrarily many
>       >>>>   paging
>       >>>>   >     queues (what they call a VM_BIND engine/queue).  That
>       >>>>design works
>       >>>>   >     pretty well and solves the problems in question. 
>       >>>>Again, we could
>       >>>>   just
>       >>>>   >     make everything out-of-order and require using syncobjs
>       >>>>to order
>       >>>>   things
>       >>>>   >     as userspace wants. That'd be fine too.
>       >>>>   >     One more note while I'm here: danvet said something on
>       >>>>IRC about
>       >>>>   VM_BIND
>       >>>>   >     queues waiting for syncobjs to materialize.  We don't
>       really
>       >>>>   want/need
>       >>>>   >     this.  We already have all the machinery in userspace to
>       handle
>       >>>>   >     wait-before-signal and waiting for syncobj fences to
>       >>>>materialize
>       >>>>   and
>       >>>>   >     that machinery is on by default.  It would actually
>       >>>>take MORE work
>       >>>>   in
>       >>>>   >     Mesa to turn it off and take advantage of the kernel
>       >>>>being able to
>       >>>>   wait
>       >>>>   >     for syncobjs to materialize.  Also, getting that right is
>       >>>>   ridiculously
>       >>>>   >     hard and I really don't want to get it wrong in kernel
>       >>>>space.     When we
>       >>>>   >     do memory fences, wait-before-signal will be a thing.  We
>       don't
>       >>>>   need to
>       >>>>   >     try and make it a thing for syncobj.
>       >>>>   >     --Jason
>       >>>>   >
>       >>>>   >   Thanks Jason,
>       >>>>   >
>       >>>>   >   I missed the bit in the Vulkan spec that we're allowed to
>       have a
>       >>>>   sparse
>       >>>>   >   queue that does not implement either graphics or compute
>       >>>>operations
>       >>>>   :
>       >>>>   >
>       >>>>   >     "While some implementations may include
>       >>>>   VK_QUEUE_SPARSE_BINDING_BIT
>       >>>>   >     support in queue families that also include
>       >>>>   >
>       >>>>   >      graphics and compute support, other implementations may
>       only
>       >>>>   expose a
>       >>>>   >     VK_QUEUE_SPARSE_BINDING_BIT-only queue
>       >>>>   >
>       >>>>   >      family."
>       >>>>   >
>       >>>>   >   So it can all be all a vm_bind engine that just does
>       bind/unbind
>       >>>>   >   operations.
>       >>>>   >
>       >>>>   >   But yes we need another engine for the immediate/non-sparse
>       >>>>   operations.
>       >>>>   >
>       >>>>   >   -Lionel
>       >>>>   >
>       >>>>   >         >
>       >>>>   >       Daniel, any thoughts?
>       >>>>   >
>       >>>>   >       Niranjana
>       >>>>   >
>       >>>>   >       >Matt
>       >>>>   >       >
>       >>>>   >       >>
>       >>>>   >       >> Sorry I noticed this late.
>       >>>>   >       >>
>       >>>>   >       >>
>       >>>>   >       >> -Lionel
>       >>>>   >       >>
>       >>>>   >       >>
Lionel Landwerlin June 10, 2022, 6:53 a.m. UTC | #28
On 09/06/2022 22:31, Niranjana Vishwanathapura wrote:
> On Thu, Jun 09, 2022 at 05:49:09PM +0300, Lionel Landwerlin wrote:
>>   On 09/06/2022 00:55, Jason Ekstrand wrote:
>>
>>     On Wed, Jun 8, 2022 at 4:44 PM Niranjana Vishwanathapura
>>     <niranjana.vishwanathapura@intel.com> wrote:
>>
>>       On Wed, Jun 08, 2022 at 08:33:25AM +0100, Tvrtko Ursulin wrote:
>>       >
>>       >
>>       >On 07/06/2022 22:32, Niranjana Vishwanathapura wrote:
>>       >>On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana 
>> Vishwanathapura
>>       wrote:
>>       >>>On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason Ekstrand wrote:
>>       >>>> On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura
>>       >>>> <niranjana.vishwanathapura@intel.com> wrote:
>>       >>>>
>>       >>>>   On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel Landwerlin
>>       wrote:
>>       >>>>   >   On 02/06/2022 23:35, Jason Ekstrand wrote:
>>       >>>>   >
>>       >>>>   >     On Thu, Jun 2, 2022 at 3:11 PM Niranjana 
>> Vishwanathapura
>>       >>>>   > <niranjana.vishwanathapura@intel.com> wrote:
>>       >>>>   >
>>       >>>>   >       On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew
>>       >>>>Brost wrote:
>>       >>>>   >       >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel
>>       Landwerlin
>>       >>>>   wrote:
>>       >>>>   >       >> On 17/05/2022 21:32, Niranjana Vishwanathapura
>>       wrote:
>>       >>>>   >       >> > +VM_BIND/UNBIND ioctl will immediately start
>>       >>>>   binding/unbinding
>>       >>>>   >       the mapping in an
>>       >>>>   >       >> > +async worker. The binding and unbinding will
>>       >>>>work like a
>>       >>>>   special
>>       >>>>   >       GPU engine.
>>       >>>>   >       >> > +The binding and unbinding operations are
>>       serialized and
>>       >>>>   will
>>       >>>>   >       wait on specified
>>       >>>>   >       >> > +input fences before the operation and will 
>> signal
>>       the
>>       >>>>   output
>>       >>>>   >       fences upon the
>>       >>>>   >       >> > +completion of the operation. Due to
>>       serialization,
>>       >>>>   completion of
>>       >>>>   >       an operation
>>       >>>>   >       >> > +will also indicate that all previous 
>> operations
>>       >>>>are also
>>       >>>>   >       complete.
>>       >>>>   >       >>
>>       >>>>   >       >> I guess we should avoid saying "will immediately
>>       start
>>       >>>>   >       binding/unbinding" if
>>       >>>>   >       >> there are fences involved.
>>       >>>>   >       >>
>>       >>>>   >       >> And the fact that it's happening in an async
>>       >>>>worker seem to
>>       >>>>   imply
>>       >>>>   >       it's not
>>       >>>>   >       >> immediate.
>>       >>>>   >       >>
>>       >>>>   >
>>       >>>>   >       Ok, will fix.
>>       >>>>   >       This was added because in earlier design binding 
>> was
>>       deferred
>>       >>>>   until
>>       >>>>   >       next execbuff.
>>       >>>>   >       But now it is non-deferred (immediate in that 
>> sense).
>>       >>>>But yah,
>>       >>>>   this is
>>       >>>>   >       confusing
>>       >>>>   >       and will fix it.
>>       >>>>   >
>>       >>>>   >       >>
>>       >>>>   >       >> I have a question on the behavior of the bind
>>       >>>>operation when
>>       >>>>   no
>>       >>>>   >       input fence
>>       >>>>   >       >> is provided. Let say I do :
>>       >>>>   >       >>
>>       >>>>   >       >> VM_BIND (out_fence=fence1)
>>       >>>>   >       >>
>>       >>>>   >       >> VM_BIND (out_fence=fence2)
>>       >>>>   >       >>
>>       >>>>   >       >> VM_BIND (out_fence=fence3)
>>       >>>>   >       >>
>>       >>>>   >       >>
>>       >>>>   >       >> In what order are the fences going to be 
>> signaled?
>>       >>>>   >       >>
>>       >>>>   >       >> In the order of VM_BIND ioctls? Or out of order?
>>       >>>>   >       >>
>>       >>>>   >       >> Because you wrote "serialized I assume it's : in
>>       order
>>       >>>>   >       >>
>>       >>>>   >
>>       >>>>   >       Yes, in the order of VM_BIND/UNBIND ioctls. Note 
>> that
>>       >>>>bind and
>>       >>>>   unbind
>>       >>>>   >       will use
>>       >>>>   >       the same queue and hence are ordered.
>>       >>>>   >
>>       >>>>   >       >>
>>       >>>>   >       >> One thing I didn't realize is that because we 
>> only
>>       get one
>>       >>>>   >       "VM_BIND" engine,
>>       >>>>   >       >> there is a disconnect from the Vulkan 
>> specification.
>>       >>>>   >       >>
>>       >>>>   >       >> In Vulkan VM_BIND operations are serialized but
>>       >>>>per engine.
>>       >>>>   >       >>
>>       >>>>   >       >> So you could have something like this :
>>       >>>>   >       >>
>>       >>>>   >       >> VM_BIND (engine=rcs0, in_fence=fence1,
>>       out_fence=fence2)
>>       >>>>   >       >>
>>       >>>>   >       >> VM_BIND (engine=ccs0, in_fence=fence3,
>>       out_fence=fence4)
>>       >>>>   >       >>
>>       >>>>   >       >>
>>       >>>>   >       >> fence1 is not signaled
>>       >>>>   >       >>
>>       >>>>   >       >> fence3 is signaled
>>       >>>>   >       >>
>>       >>>>   >       >> So the second VM_BIND will proceed before the
>>       >>>>first VM_BIND.
>>       >>>>   >       >>
>>       >>>>   >       >>
>>       >>>>   >       >> I guess we can deal with that scenario in
>>       >>>>userspace by doing
>>       >>>>   the
>>       >>>>   >       wait
>>       >>>>   >       >> ourselves in one thread per engines.
>>       >>>>   >       >>
>>       >>>>   >       >> But then it makes the VM_BIND input fences 
>> useless.
>>       >>>>   >       >>
>>       >>>>   >       >>
>>       >>>>   >       >> Daniel : what do you think? Should be rework 
>> this or
>>       just
>>       >>>>   deal with
>>       >>>>   >       wait
>>       >>>>   >       >> fences in userspace?
>>       >>>>   >       >>
>>       >>>>   >       >
>>       >>>>   >       >My opinion is rework this but make the ordering 
>> via
>>       >>>>an engine
>>       >>>>   param
>>       >>>>   >       optional.
>>       >>>>   >       >
>>       >>>>   >       >e.g. A VM can be configured so all binds are 
>> ordered
>>       >>>>within the
>>       >>>>   VM
>>       >>>>   >       >
>>       >>>>   >       >e.g. A VM can be configured so all binds accept an
>>       engine
>>       >>>>   argument
>>       >>>>   >       (in
>>       >>>>   >       >the case of the i915 likely this is a gem context
>>       >>>>handle) and
>>       >>>>   binds
>>       >>>>   >       >ordered with respect to that engine.
>>       >>>>   >       >
>>       >>>>   >       >This gives UMDs options as the later likely 
>> consumes
>>       >>>>more KMD
>>       >>>>   >       resources
>>       >>>>   >       >so if a different UMD can live with binds being
>>       >>>>ordered within
>>       >>>>   the VM
>>       >>>>   >       >they can use a mode consuming less resources.
>>       >>>>   >       >
>>       >>>>   >
>>       >>>>   >       I think we need to be careful here if we are 
>> looking
>>       for some
>>       >>>>   out of
>>       >>>>   >       (submission) order completion of vm_bind/unbind.
>>       >>>>   >       In-order completion means, in a batch of binds and
>>       >>>>unbinds to be
>>       >>>>   >       completed in-order, user only needs to specify
>>       >>>>in-fence for the
>>       >>>>   >       first bind/unbind call and the our-fence for the 
>> last
>>       >>>>   bind/unbind
>>       >>>>   >       call. Also, the VA released by an unbind call 
>> can be
>>       >>>>re-used by
>>       >>>>   >       any subsequent bind call in that in-order batch.
>>       >>>>   >
>>       >>>>   >       These things will break if binding/unbinding 
>> were to
>>       >>>>be allowed
>>       >>>>   to
>>       >>>>   >       go out of order (of submission) and user need to be
>>       extra
>>       >>>>   careful
>>       >>>>   >       not to run into pre-mature triggereing of 
>> out-fence and
>>       bind
>>       >>>>   failing
>>       >>>>   >       as VA is still in use etc.
>>       >>>>   >
>>       >>>>   >       Also, VM_BIND binds the provided mapping on the
>>       specified
>>       >>>>   address
>>       >>>>   >       space
>>       >>>>   >       (VM). So, the uapi is not engine/context specific.
>>       >>>>   >
>>       >>>>   >       We can however add a 'queue' to the uapi which 
>> can be
>>       >>>>one from
>>       >>>>   the
>>       >>>>   >       pre-defined queues,
>>       >>>>   >       I915_VM_BIND_QUEUE_0
>>       >>>>   >       I915_VM_BIND_QUEUE_1
>>       >>>>   >       ...
>>       >>>>   >       I915_VM_BIND_QUEUE_(N-1)
>>       >>>>   >
>>       >>>>   >       KMD will spawn an async work queue for each 
>> queue which
>>       will
>>       >>>>   only
>>       >>>>   >       bind the mappings on that queue in the order of
>>       submission.
>>       >>>>   >       User can assign the queue to per engine or anything
>>       >>>>like that.
>>       >>>>   >
>>       >>>>   >       But again here, user need to be careful and not
>>       >>>>deadlock these
>>       >>>>   >       queues with circular dependency of fences.
>>       >>>>   >
>>       >>>>   >       I prefer adding this later an as extension based on
>>       >>>>whether it
>>       >>>>   >       is really helping with the implementation.
>>       >>>>   >
>>       >>>>   >     I can tell you right now that having everything on a
>>       single
>>       >>>>   in-order
>>       >>>>   >     queue will not get us the perf we want.  What vulkan
>>       >>>>really wants
>>       >>>>   is one
>>       >>>>   >     of two things:
>>       >>>>   >      1. No implicit ordering of VM_BIND ops.  They just
>>       happen in
>>       >>>>   whatever
>>       >>>>   >     their dependencies are resolved and we ensure 
>> ordering
>>       >>>>ourselves
>>       >>>>   by
>>       >>>>   >     having a syncobj in the VkQueue.
>>       >>>>   >      2. The ability to create multiple VM_BIND 
>> queues.  We
>>       need at
>>       >>>>   least 2
>>       >>>>   >     but I don't see why there needs to be a limit besides
>>       >>>>the limits
>>       >>>>   the
>>       >>>>   >     i915 API already has on the number of engines.  
>> Vulkan
>>       could
>>       >>>>   expose
>>       >>>>   >     multiple sparse binding queues to the client if 
>> it's not
>>       >>>>   arbitrarily
>>       >>>>   >     limited.
>>       >>>>
>>       >>>>   Thanks Jason, Lionel.
>>       >>>>
>>       >>>>   Jason, what are you referring to when you say "limits 
>> the i915
>>       API
>>       >>>>   already
>>       >>>>   has on the number of engines"? I am not sure if there is 
>> such
>>       an uapi
>>       >>>>   today.
>>       >>>>
>>       >>>> There's a limit of something like 64 total engines today 
>> based on
>>       the
>>       >>>> number of bits we can cram into the exec flags in 
>> execbuffer2.  I
>>       think
>>       >>>> someone had an extended version that allowed more but I 
>> ripped it
>>       out
>>       >>>> because no one was using it.  Of course, execbuffer3 might 
>> not
>>       >>>>have that
>>       >>>> problem at all.
>>       >>>>
>>       >>>
>>       >>>Thanks Jason.
>>       >>>Ok, I am not sure which exec flag is that, but yah, execbuffer3
>>       probably
>>       >>>will not have this limiation. So, we need to define a
>>       VM_BIND_MAX_QUEUE
>>       >>>and somehow export it to user (I am thinking of embedding it in
>>       >>>I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND, bits[1-3]->'n'
>>       meaning 2^n
>>       >>>queues.
>>       >>
>>       >>Ah, I think you are waking about I915_EXEC_RING_MASK (0x3f) 
>> which
>>       execbuf3
>>
>>     Yup!  That's exactly the limit I was talking about.
>>
>>       >>will also have. So, we can simply define in vm_bind/unbind
>>       structures,
>>       >>
>>       >>#define I915_VM_BIND_MAX_QUEUE   64
>>       >>        __u32 queue;
>>       >>
>>       >>I think that will keep things simple.
>>       >
>>       >Hmmm? What does execbuf2 limit has to do with how many engines
>>       >hardware can have? I suggest not to do that.
>>       >
>>       >Change with added this:
>>       >
>>       >       if (set.num_engines > I915_EXEC_RING_MASK + 1)
>>       >               return -EINVAL;
>>       >
>>       >To context creation needs to be undone and so let users create 
>> engine
>>       >maps with all hardware engines, and let execbuf3 access them all.
>>       >
>>
>>       Earlier plan was to carry I915_EXEC_RING_MAP (0x3f) to 
>> execbuff3 also.
>>       Hence, I was using the same limit for VM_BIND queues (64, or 65 
>> if we
>>       make it N+1).
>>       But, as discussed in other thread of this RFC series, we are 
>> planning
>>       to drop this I915_EXEC_RING_MAP in execbuff3. So, there won't be
>>       any uapi that limits the number of engines (and hence the vm_bind
>>       queues
>>       need to be supported).
>>
>>       If we leave the number of vm_bind queues to be arbitrarily large
>>       (__u32 queue_idx) then, we need to have a hashmap for queue (a wq,
>>       work_item and a linked list) lookup from the user specified queue
>>       index.
>>       Other option is to just put some hard limit (say 64 or 65) and use
>>       an array of queues in VM (each created upon first use). I 
>> prefer this.
>>
>>     I don't get why a VM_BIND queue is any different from any other 
>> queue or
>>     userspace-visible kernel object.  But I'll leave those details up to
>>     danvet or whoever else might be reviewing the implementation.
>>     --Jason
>>
>>   I kind of agree here. Wouldn't be simpler to have the bind queue 
>> created
>>   like the others when we build the engine map?
>>
>>   For userspace it's then just matter of selecting the right queue ID 
>> when
>>   submitting.
>>
>>   If there is ever a possibility to have this work on the GPU, it 
>> would be
>>   all ready.
>>
>
> I did sync offline with Matt Brost on this.
> We can add a VM_BIND engine class and let user create VM_BIND engines 
> (queues).
> The problem is, in i915 engine creating interface is bound to 
> gem_context.
> So, in vm_bind ioctl, we would need both context_id and queue_idx for 
> proper
> lookup of the user created engine. This is bit ackward as vm_bind is an
> interface to VM (address space) and has nothing to do with gem_context.


A gem_context has a single vm object right?

Set through I915_CONTEXT_PARAM_VM at creation or given a default one if not.

So it's just like picking up the vm like it's done at execbuffer time 
right now : eb->context->vm


> Another problem is, if two VMs are binding with the same defined engine,
> binding on VM1 can get unnecessary blocked by binding on VM2 (which 
> may be
> waiting on its in_fence).


Maybe I'm missing something, but how can you have 2 vm objects with a 
single gem_context right now?


>
> So, my preference here is to just add a 'u32 queue' index in 
> vm_bind/unbind
> ioctl, and the queues are per VM.
>
> Niranjana
>
>>   Thanks,
>>
>>   -Lionel
>>
>>
>>       Niranjana
>>
>>       >Regards,
>>       >
>>       >Tvrtko
>>       >
>>       >>
>>       >>Niranjana
>>       >>
>>       >>>
>>       >>>>   I am trying to see how many queues we need and don't 
>> want it to
>>       be
>>       >>>>   arbitrarily
>>       >>>>   large and unduely blow up memory usage and complexity in 
>> i915
>>       driver.
>>       >>>>
>>       >>>> I expect a Vulkan driver to use at most 2 in the vast 
>> majority
>>       >>>>of cases. I
>>       >>>> could imagine a client wanting to create more than 1 sparse
>>       >>>>queue in which
>>       >>>> case, it'll be N+1 but that's unlikely. As far as complexity
>>       >>>>goes, once
>>       >>>> you allow two, I don't think the complexity is going up by
>>       >>>>allowing N.  As
>>       >>>> for memory usage, creating more queues means more memory.  
>> That's
>>       a
>>       >>>> trade-off that userspace can make. Again, the expected number
>>       >>>>here is 1
>>       >>>> or 2 in the vast majority of cases so I don't think you 
>> need to
>>       worry.
>>       >>>
>>       >>>Ok, will start with n=3 meaning 8 queues.
>>       >>>That would require us create 8 workqueues.
>>       >>>We can change 'n' later if required.
>>       >>>
>>       >>>Niranjana
>>       >>>
>>       >>>>
>>       >>>>   >     Why?  Because Vulkan has two basic kind of bind
>>       >>>>operations and we
>>       >>>>   don't
>>       >>>>   >     want any dependencies between them:
>>       >>>>   >      1. Immediate.  These happen right after BO 
>> creation or
>>       >>>>maybe as
>>       >>>>   part of
>>       >>>>   >     vkBindImageMemory() or VkBindBufferMemory().  These
>>       >>>>don't happen
>>       >>>>   on a
>>       >>>>   >     queue and we don't want them serialized with 
>> anything.       To
>>       >>>>   synchronize
>>       >>>>   >     with submit, we'll have a syncobj in the VkDevice 
>> which
>>       is
>>       >>>>   signaled by
>>       >>>>   >     all immediate bind operations and make submits 
>> wait on
>>       it.
>>       >>>>   >      2. Queued (sparse): These happen on a VkQueue 
>> which may
>>       be the
>>       >>>>   same as
>>       >>>>   >     a render/compute queue or may be its own queue.  
>> It's up
>>       to us
>>       >>>>   what we
>>       >>>>   >     want to advertise.  From the Vulkan API PoV, this 
>> is like
>>       any
>>       >>>>   other
>>       >>>>   >     queue.  Operations on it wait on and signal 
>> semaphores.       If we
>>       >>>>   have a
>>       >>>>   >     VM_BIND engine, we'd provide syncobjs to wait and
>>       >>>>signal just like
>>       >>>>   we do
>>       >>>>   >     in execbuf().
>>       >>>>   >     The important thing is that we don't want one type of
>>       >>>>operation to
>>       >>>>   block
>>       >>>>   >     on the other.  If immediate binds are blocking on 
>> sparse
>>       binds,
>>       >>>>   it's
>>       >>>>   >     going to cause over-synchronization issues.
>>       >>>>   >     In terms of the internal implementation, I know that
>>       >>>>there's going
>>       >>>>   to be
>>       >>>>   >     a lock on the VM and that we can't actually do these
>>       things in
>>       >>>>   >     parallel.  That's fine.  Once the dma_fences have
>>       signaled and
>>       >>>>   we're
>>       >>>>
>>       >>>>   Thats correct. It is like a single VM_BIND engine with
>>       >>>>multiple queues
>>       >>>>   feeding to it.
>>       >>>>
>>       >>>> Right.  As long as the queues themselves are independent and
>>       >>>>can block on
>>       >>>> dma_fences without holding up other queues, I think we're 
>> fine.
>>       >>>>
>>       >>>>   >     unblocked to do the bind operation, I don't care if
>>       >>>>there's a bit
>>       >>>>   of
>>       >>>>   >     synchronization due to locking.  That's expected.  
>> What
>>       >>>>we can't
>>       >>>>   afford
>>       >>>>   >     to have is an immediate bind operation suddenly 
>> blocking
>>       on a
>>       >>>>   sparse
>>       >>>>   >     operation which is blocked on a compute job that's 
>> going
>>       to run
>>       >>>>   for
>>       >>>>   >     another 5ms.
>>       >>>>
>>       >>>>   As the VM_BIND queue is per VM, VM_BIND on one VM 
>> doesn't block
>>       the
>>       >>>>   VM_BIND
>>       >>>>   on other VMs. I am not sure about usecases here, but just
>>       wanted to
>>       >>>>   clarify.
>>       >>>>
>>       >>>> Yes, that's what I would expect.
>>       >>>> --Jason
>>       >>>>
>>       >>>>   Niranjana
>>       >>>>
>>       >>>>   >     For reference, Windows solves this by allowing
>>       arbitrarily many
>>       >>>>   paging
>>       >>>>   >     queues (what they call a VM_BIND engine/queue).  That
>>       >>>>design works
>>       >>>>   >     pretty well and solves the problems in question. 
>>       >>>>Again, we could
>>       >>>>   just
>>       >>>>   >     make everything out-of-order and require using 
>> syncobjs
>>       >>>>to order
>>       >>>>   things
>>       >>>>   >     as userspace wants. That'd be fine too.
>>       >>>>   >     One more note while I'm here: danvet said 
>> something on
>>       >>>>IRC about
>>       >>>>   VM_BIND
>>       >>>>   >     queues waiting for syncobjs to materialize.  We don't
>>       really
>>       >>>>   want/need
>>       >>>>   >     this.  We already have all the machinery in 
>> userspace to
>>       handle
>>       >>>>   >     wait-before-signal and waiting for syncobj fences to
>>       >>>>materialize
>>       >>>>   and
>>       >>>>   >     that machinery is on by default.  It would actually
>>       >>>>take MORE work
>>       >>>>   in
>>       >>>>   >     Mesa to turn it off and take advantage of the kernel
>>       >>>>being able to
>>       >>>>   wait
>>       >>>>   >     for syncobjs to materialize. Also, getting that 
>> right is
>>       >>>>   ridiculously
>>       >>>>   >     hard and I really don't want to get it wrong in 
>> kernel
>>       >>>>space.     When we
>>       >>>>   >     do memory fences, wait-before-signal will be a 
>> thing.  We
>>       don't
>>       >>>>   need to
>>       >>>>   >     try and make it a thing for syncobj.
>>       >>>>   >     --Jason
>>       >>>>   >
>>       >>>>   >   Thanks Jason,
>>       >>>>   >
>>       >>>>   >   I missed the bit in the Vulkan spec that we're 
>> allowed to
>>       have a
>>       >>>>   sparse
>>       >>>>   >   queue that does not implement either graphics or 
>> compute
>>       >>>>operations
>>       >>>>   :
>>       >>>>   >
>>       >>>>   >     "While some implementations may include
>>       >>>>   VK_QUEUE_SPARSE_BINDING_BIT
>>       >>>>   >     support in queue families that also include
>>       >>>>   >
>>       >>>>   >      graphics and compute support, other 
>> implementations may
>>       only
>>       >>>>   expose a
>>       >>>>   > VK_QUEUE_SPARSE_BINDING_BIT-only queue
>>       >>>>   >
>>       >>>>   >      family."
>>       >>>>   >
>>       >>>>   >   So it can all be all a vm_bind engine that just does
>>       bind/unbind
>>       >>>>   >   operations.
>>       >>>>   >
>>       >>>>   >   But yes we need another engine for the 
>> immediate/non-sparse
>>       >>>>   operations.
>>       >>>>   >
>>       >>>>   >   -Lionel
>>       >>>>   >
>>       >>>>   >         >
>>       >>>>   >       Daniel, any thoughts?
>>       >>>>   >
>>       >>>>   >       Niranjana
>>       >>>>   >
>>       >>>>   >       >Matt
>>       >>>>   >       >
>>       >>>>   >       >>
>>       >>>>   >       >> Sorry I noticed this late.
>>       >>>>   >       >>
>>       >>>>   >       >>
>>       >>>>   >       >> -Lionel
>>       >>>>   >       >>
>>       >>>>   >       >>
Niranjana Vishwanathapura June 10, 2022, 7:54 a.m. UTC | #29
On Fri, Jun 10, 2022 at 09:53:24AM +0300, Lionel Landwerlin wrote:
>On 09/06/2022 22:31, Niranjana Vishwanathapura wrote:
>>On Thu, Jun 09, 2022 at 05:49:09PM +0300, Lionel Landwerlin wrote:
>>>  On 09/06/2022 00:55, Jason Ekstrand wrote:
>>>
>>>    On Wed, Jun 8, 2022 at 4:44 PM Niranjana Vishwanathapura
>>>    <niranjana.vishwanathapura@intel.com> wrote:
>>>
>>>      On Wed, Jun 08, 2022 at 08:33:25AM +0100, Tvrtko Ursulin wrote:
>>>      >
>>>      >
>>>      >On 07/06/2022 22:32, Niranjana Vishwanathapura wrote:
>>>      >>On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana 
>>>Vishwanathapura
>>>      wrote:
>>>      >>>On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason Ekstrand wrote:
>>>      >>>> On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura
>>>      >>>> <niranjana.vishwanathapura@intel.com> wrote:
>>>      >>>>
>>>      >>>>   On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel Landwerlin
>>>      wrote:
>>>      >>>>   >   On 02/06/2022 23:35, Jason Ekstrand wrote:
>>>      >>>>   >
>>>      >>>>   >     On Thu, Jun 2, 2022 at 3:11 PM Niranjana 
>>>Vishwanathapura
>>>      >>>>   > <niranjana.vishwanathapura@intel.com> wrote:
>>>      >>>>   >
>>>      >>>>   >       On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew
>>>      >>>>Brost wrote:
>>>      >>>>   >       >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel
>>>      Landwerlin
>>>      >>>>   wrote:
>>>      >>>>   >       >> On 17/05/2022 21:32, Niranjana Vishwanathapura
>>>      wrote:
>>>      >>>>   >       >> > +VM_BIND/UNBIND ioctl will immediately start
>>>      >>>>   binding/unbinding
>>>      >>>>   >       the mapping in an
>>>      >>>>   >       >> > +async worker. The binding and unbinding will
>>>      >>>>work like a
>>>      >>>>   special
>>>      >>>>   >       GPU engine.
>>>      >>>>   >       >> > +The binding and unbinding operations are
>>>      serialized and
>>>      >>>>   will
>>>      >>>>   >       wait on specified
>>>      >>>>   >       >> > +input fences before the operation and 
>>>will signal
>>>      the
>>>      >>>>   output
>>>      >>>>   >       fences upon the
>>>      >>>>   >       >> > +completion of the operation. Due to
>>>      serialization,
>>>      >>>>   completion of
>>>      >>>>   >       an operation
>>>      >>>>   >       >> > +will also indicate that all previous 
>>>operations
>>>      >>>>are also
>>>      >>>>   >       complete.
>>>      >>>>   >       >>
>>>      >>>>   >       >> I guess we should avoid saying "will immediately
>>>      start
>>>      >>>>   >       binding/unbinding" if
>>>      >>>>   >       >> there are fences involved.
>>>      >>>>   >       >>
>>>      >>>>   >       >> And the fact that it's happening in an async
>>>      >>>>worker seem to
>>>      >>>>   imply
>>>      >>>>   >       it's not
>>>      >>>>   >       >> immediate.
>>>      >>>>   >       >>
>>>      >>>>   >
>>>      >>>>   >       Ok, will fix.
>>>      >>>>   >       This was added because in earlier design 
>>>binding was
>>>      deferred
>>>      >>>>   until
>>>      >>>>   >       next execbuff.
>>>      >>>>   >       But now it is non-deferred (immediate in that 
>>>sense).
>>>      >>>>But yah,
>>>      >>>>   this is
>>>      >>>>   >       confusing
>>>      >>>>   >       and will fix it.
>>>      >>>>   >
>>>      >>>>   >       >>
>>>      >>>>   >       >> I have a question on the behavior of the bind
>>>      >>>>operation when
>>>      >>>>   no
>>>      >>>>   >       input fence
>>>      >>>>   >       >> is provided. Let say I do :
>>>      >>>>   >       >>
>>>      >>>>   >       >> VM_BIND (out_fence=fence1)
>>>      >>>>   >       >>
>>>      >>>>   >       >> VM_BIND (out_fence=fence2)
>>>      >>>>   >       >>
>>>      >>>>   >       >> VM_BIND (out_fence=fence3)
>>>      >>>>   >       >>
>>>      >>>>   >       >>
>>>      >>>>   >       >> In what order are the fences going to be 
>>>signaled?
>>>      >>>>   >       >>
>>>      >>>>   >       >> In the order of VM_BIND ioctls? Or out of order?
>>>      >>>>   >       >>
>>>      >>>>   >       >> Because you wrote "serialized I assume it's : in
>>>      order
>>>      >>>>   >       >>
>>>      >>>>   >
>>>      >>>>   >       Yes, in the order of VM_BIND/UNBIND ioctls. 
>>>Note that
>>>      >>>>bind and
>>>      >>>>   unbind
>>>      >>>>   >       will use
>>>      >>>>   >       the same queue and hence are ordered.
>>>      >>>>   >
>>>      >>>>   >       >>
>>>      >>>>   >       >> One thing I didn't realize is that because 
>>>we only
>>>      get one
>>>      >>>>   >       "VM_BIND" engine,
>>>      >>>>   >       >> there is a disconnect from the Vulkan 
>>>specification.
>>>      >>>>   >       >>
>>>      >>>>   >       >> In Vulkan VM_BIND operations are serialized but
>>>      >>>>per engine.
>>>      >>>>   >       >>
>>>      >>>>   >       >> So you could have something like this :
>>>      >>>>   >       >>
>>>      >>>>   >       >> VM_BIND (engine=rcs0, in_fence=fence1,
>>>      out_fence=fence2)
>>>      >>>>   >       >>
>>>      >>>>   >       >> VM_BIND (engine=ccs0, in_fence=fence3,
>>>      out_fence=fence4)
>>>      >>>>   >       >>
>>>      >>>>   >       >>
>>>      >>>>   >       >> fence1 is not signaled
>>>      >>>>   >       >>
>>>      >>>>   >       >> fence3 is signaled
>>>      >>>>   >       >>
>>>      >>>>   >       >> So the second VM_BIND will proceed before the
>>>      >>>>first VM_BIND.
>>>      >>>>   >       >>
>>>      >>>>   >       >>
>>>      >>>>   >       >> I guess we can deal with that scenario in
>>>      >>>>userspace by doing
>>>      >>>>   the
>>>      >>>>   >       wait
>>>      >>>>   >       >> ourselves in one thread per engines.
>>>      >>>>   >       >>
>>>      >>>>   >       >> But then it makes the VM_BIND input fences 
>>>useless.
>>>      >>>>   >       >>
>>>      >>>>   >       >>
>>>      >>>>   >       >> Daniel : what do you think? Should be 
>>>rework this or
>>>      just
>>>      >>>>   deal with
>>>      >>>>   >       wait
>>>      >>>>   >       >> fences in userspace?
>>>      >>>>   >       >>
>>>      >>>>   >       >
>>>      >>>>   >       >My opinion is rework this but make the 
>>>ordering via
>>>      >>>>an engine
>>>      >>>>   param
>>>      >>>>   >       optional.
>>>      >>>>   >       >
>>>      >>>>   >       >e.g. A VM can be configured so all binds are 
>>>ordered
>>>      >>>>within the
>>>      >>>>   VM
>>>      >>>>   >       >
>>>      >>>>   >       >e.g. A VM can be configured so all binds accept an
>>>      engine
>>>      >>>>   argument
>>>      >>>>   >       (in
>>>      >>>>   >       >the case of the i915 likely this is a gem context
>>>      >>>>handle) and
>>>      >>>>   binds
>>>      >>>>   >       >ordered with respect to that engine.
>>>      >>>>   >       >
>>>      >>>>   >       >This gives UMDs options as the later likely 
>>>consumes
>>>      >>>>more KMD
>>>      >>>>   >       resources
>>>      >>>>   >       >so if a different UMD can live with binds being
>>>      >>>>ordered within
>>>      >>>>   the VM
>>>      >>>>   >       >they can use a mode consuming less resources.
>>>      >>>>   >       >
>>>      >>>>   >
>>>      >>>>   >       I think we need to be careful here if we are 
>>>looking
>>>      for some
>>>      >>>>   out of
>>>      >>>>   >       (submission) order completion of vm_bind/unbind.
>>>      >>>>   >       In-order completion means, in a batch of binds and
>>>      >>>>unbinds to be
>>>      >>>>   >       completed in-order, user only needs to specify
>>>      >>>>in-fence for the
>>>      >>>>   >       first bind/unbind call and the our-fence for 
>>>the last
>>>      >>>>   bind/unbind
>>>      >>>>   >       call. Also, the VA released by an unbind call 
>>>can be
>>>      >>>>re-used by
>>>      >>>>   >       any subsequent bind call in that in-order batch.
>>>      >>>>   >
>>>      >>>>   >       These things will break if binding/unbinding 
>>>were to
>>>      >>>>be allowed
>>>      >>>>   to
>>>      >>>>   >       go out of order (of submission) and user need to be
>>>      extra
>>>      >>>>   careful
>>>      >>>>   >       not to run into pre-mature triggereing of 
>>>out-fence and
>>>      bind
>>>      >>>>   failing
>>>      >>>>   >       as VA is still in use etc.
>>>      >>>>   >
>>>      >>>>   >       Also, VM_BIND binds the provided mapping on the
>>>      specified
>>>      >>>>   address
>>>      >>>>   >       space
>>>      >>>>   >       (VM). So, the uapi is not engine/context specific.
>>>      >>>>   >
>>>      >>>>   >       We can however add a 'queue' to the uapi 
>>>which can be
>>>      >>>>one from
>>>      >>>>   the
>>>      >>>>   >       pre-defined queues,
>>>      >>>>   >       I915_VM_BIND_QUEUE_0
>>>      >>>>   >       I915_VM_BIND_QUEUE_1
>>>      >>>>   >       ...
>>>      >>>>   >       I915_VM_BIND_QUEUE_(N-1)
>>>      >>>>   >
>>>      >>>>   >       KMD will spawn an async work queue for each 
>>>queue which
>>>      will
>>>      >>>>   only
>>>      >>>>   >       bind the mappings on that queue in the order of
>>>      submission.
>>>      >>>>   >       User can assign the queue to per engine or anything
>>>      >>>>like that.
>>>      >>>>   >
>>>      >>>>   >       But again here, user need to be careful and not
>>>      >>>>deadlock these
>>>      >>>>   >       queues with circular dependency of fences.
>>>      >>>>   >
>>>      >>>>   >       I prefer adding this later an as extension based on
>>>      >>>>whether it
>>>      >>>>   >       is really helping with the implementation.
>>>      >>>>   >
>>>      >>>>   >     I can tell you right now that having everything on a
>>>      single
>>>      >>>>   in-order
>>>      >>>>   >     queue will not get us the perf we want.  What vulkan
>>>      >>>>really wants
>>>      >>>>   is one
>>>      >>>>   >     of two things:
>>>      >>>>   >      1. No implicit ordering of VM_BIND ops.  They just
>>>      happen in
>>>      >>>>   whatever
>>>      >>>>   >     their dependencies are resolved and we ensure 
>>>ordering
>>>      >>>>ourselves
>>>      >>>>   by
>>>      >>>>   >     having a syncobj in the VkQueue.
>>>      >>>>   >      2. The ability to create multiple VM_BIND 
>>>queues.  We
>>>      need at
>>>      >>>>   least 2
>>>      >>>>   >     but I don't see why there needs to be a limit besides
>>>      >>>>the limits
>>>      >>>>   the
>>>      >>>>   >     i915 API already has on the number of engines.  
>>>Vulkan
>>>      could
>>>      >>>>   expose
>>>      >>>>   >     multiple sparse binding queues to the client if 
>>>it's not
>>>      >>>>   arbitrarily
>>>      >>>>   >     limited.
>>>      >>>>
>>>      >>>>   Thanks Jason, Lionel.
>>>      >>>>
>>>      >>>>   Jason, what are you referring to when you say "limits 
>>>the i915
>>>      API
>>>      >>>>   already
>>>      >>>>   has on the number of engines"? I am not sure if there 
>>>is such
>>>      an uapi
>>>      >>>>   today.
>>>      >>>>
>>>      >>>> There's a limit of something like 64 total engines 
>>>today based on
>>>      the
>>>      >>>> number of bits we can cram into the exec flags in 
>>>execbuffer2.  I
>>>      think
>>>      >>>> someone had an extended version that allowed more but I 
>>>ripped it
>>>      out
>>>      >>>> because no one was using it.  Of course, execbuffer3 
>>>might not
>>>      >>>>have that
>>>      >>>> problem at all.
>>>      >>>>
>>>      >>>
>>>      >>>Thanks Jason.
>>>      >>>Ok, I am not sure which exec flag is that, but yah, execbuffer3
>>>      probably
>>>      >>>will not have this limiation. So, we need to define a
>>>      VM_BIND_MAX_QUEUE
>>>      >>>and somehow export it to user (I am thinking of embedding it in
>>>      >>>I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND, bits[1-3]->'n'
>>>      meaning 2^n
>>>      >>>queues.
>>>      >>
>>>      >>Ah, I think you are waking about I915_EXEC_RING_MASK 
>>>(0x3f) which
>>>      execbuf3
>>>
>>>    Yup!  That's exactly the limit I was talking about.
>>>
>>>      >>will also have. So, we can simply define in vm_bind/unbind
>>>      structures,
>>>      >>
>>>      >>#define I915_VM_BIND_MAX_QUEUE   64
>>>      >>        __u32 queue;
>>>      >>
>>>      >>I think that will keep things simple.
>>>      >
>>>      >Hmmm? What does execbuf2 limit has to do with how many engines
>>>      >hardware can have? I suggest not to do that.
>>>      >
>>>      >Change with added this:
>>>      >
>>>      >       if (set.num_engines > I915_EXEC_RING_MASK + 1)
>>>      >               return -EINVAL;
>>>      >
>>>      >To context creation needs to be undone and so let users 
>>>create engine
>>>      >maps with all hardware engines, and let execbuf3 access them all.
>>>      >
>>>
>>>      Earlier plan was to carry I915_EXEC_RING_MAP (0x3f) to 
>>>execbuff3 also.
>>>      Hence, I was using the same limit for VM_BIND queues (64, or 
>>>65 if we
>>>      make it N+1).
>>>      But, as discussed in other thread of this RFC series, we are 
>>>planning
>>>      to drop this I915_EXEC_RING_MAP in execbuff3. So, there won't be
>>>      any uapi that limits the number of engines (and hence the vm_bind
>>>      queues
>>>      need to be supported).
>>>
>>>      If we leave the number of vm_bind queues to be arbitrarily large
>>>      (__u32 queue_idx) then, we need to have a hashmap for queue (a wq,
>>>      work_item and a linked list) lookup from the user specified queue
>>>      index.
>>>      Other option is to just put some hard limit (say 64 or 65) and use
>>>      an array of queues in VM (each created upon first use). I 
>>>prefer this.
>>>
>>>    I don't get why a VM_BIND queue is any different from any 
>>>other queue or
>>>    userspace-visible kernel object.  But I'll leave those details up to
>>>    danvet or whoever else might be reviewing the implementation.
>>>    --Jason
>>>
>>>  I kind of agree here. Wouldn't be simpler to have the bind queue 
>>>created
>>>  like the others when we build the engine map?
>>>
>>>  For userspace it's then just matter of selecting the right queue 
>>>ID when
>>>  submitting.
>>>
>>>  If there is ever a possibility to have this work on the GPU, it 
>>>would be
>>>  all ready.
>>>
>>
>>I did sync offline with Matt Brost on this.
>>We can add a VM_BIND engine class and let user create VM_BIND 
>>engines (queues).
>>The problem is, in i915 engine creating interface is bound to 
>>gem_context.
>>So, in vm_bind ioctl, we would need both context_id and queue_idx 
>>for proper
>>lookup of the user created engine. This is bit ackward as vm_bind is an
>>interface to VM (address space) and has nothing to do with gem_context.
>
>
>A gem_context has a single vm object right?
>
>Set through I915_CONTEXT_PARAM_VM at creation or given a default one if not.
>
>So it's just like picking up the vm like it's done at execbuffer time 
>right now : eb->context->vm
>

Are you suggesting replacing 'vm_id' with 'context_id' in the VM_BIND/UNBIND
ioctl and probably call it CONTEXT_BIND/UNBIND, because VM can be obtained
from the context?
I think the interface is clean as a interface to VM. It is only that we
don't have a clean way to create a raw VM_BIND engine (not associated with
any context) with i915 uapi.
May be we can add such an interface, but I don't think that is worth it
(we might as well just use a queue_idx in VM_BIND/UNBIND ioctl as I mentioned
above).
Anyone has any thoughts?

>
>>Another problem is, if two VMs are binding with the same defined engine,
>>binding on VM1 can get unnecessary blocked by binding on VM2 (which 
>>may be
>>waiting on its in_fence).
>
>
>Maybe I'm missing something, but how can you have 2 vm objects with a 
>single gem_context right now?
>

No, we don't have 2 VMs for a gem_context.
Say if ctx1 with vm1 and ctx2 with vm2.
First vm_bind call was for vm1 with q_idx 1 in ctx1 engine map.
Second vm_bind call was for vm2 with q_idx 2 in ctx2 engine map. 
If those two queue indicies points to same underlying vm_bind engine,
then the second vm_bind call gets blocked until the first vm_bind call's
'in' fence is triggered and bind completes.

With per VM queues, this is not a problem as two VMs will not endup
sharing same queue.

BTW, I just posted a updated PATCH series.
https://www.spinics.net/lists/dri-devel/msg350483.html

Niranjana

>
>>
>>So, my preference here is to just add a 'u32 queue' index in 
>>vm_bind/unbind
>>ioctl, and the queues are per VM.
>>
>>Niranjana
>>
>>>  Thanks,
>>>
>>>  -Lionel
>>>
>>>
>>>      Niranjana
>>>
>>>      >Regards,
>>>      >
>>>      >Tvrtko
>>>      >
>>>      >>
>>>      >>Niranjana
>>>      >>
>>>      >>>
>>>      >>>>   I am trying to see how many queues we need and don't 
>>>want it to
>>>      be
>>>      >>>>   arbitrarily
>>>      >>>>   large and unduely blow up memory usage and complexity 
>>>in i915
>>>      driver.
>>>      >>>>
>>>      >>>> I expect a Vulkan driver to use at most 2 in the vast 
>>>majority
>>>      >>>>of cases. I
>>>      >>>> could imagine a client wanting to create more than 1 sparse
>>>      >>>>queue in which
>>>      >>>> case, it'll be N+1 but that's unlikely. As far as complexity
>>>      >>>>goes, once
>>>      >>>> you allow two, I don't think the complexity is going up by
>>>      >>>>allowing N.  As
>>>      >>>> for memory usage, creating more queues means more 
>>>memory.  That's
>>>      a
>>>      >>>> trade-off that userspace can make. Again, the expected number
>>>      >>>>here is 1
>>>      >>>> or 2 in the vast majority of cases so I don't think you 
>>>need to
>>>      worry.
>>>      >>>
>>>      >>>Ok, will start with n=3 meaning 8 queues.
>>>      >>>That would require us create 8 workqueues.
>>>      >>>We can change 'n' later if required.
>>>      >>>
>>>      >>>Niranjana
>>>      >>>
>>>      >>>>
>>>      >>>>   >     Why?  Because Vulkan has two basic kind of bind
>>>      >>>>operations and we
>>>      >>>>   don't
>>>      >>>>   >     want any dependencies between them:
>>>      >>>>   >      1. Immediate.  These happen right after BO 
>>>creation or
>>>      >>>>maybe as
>>>      >>>>   part of
>>>      >>>>   >     vkBindImageMemory() or VkBindBufferMemory().  These
>>>      >>>>don't happen
>>>      >>>>   on a
>>>      >>>>   >     queue and we don't want them serialized with 
>>>anything.       To
>>>      >>>>   synchronize
>>>      >>>>   >     with submit, we'll have a syncobj in the 
>>>VkDevice which
>>>      is
>>>      >>>>   signaled by
>>>      >>>>   >     all immediate bind operations and make submits 
>>>wait on
>>>      it.
>>>      >>>>   >      2. Queued (sparse): These happen on a VkQueue 
>>>which may
>>>      be the
>>>      >>>>   same as
>>>      >>>>   >     a render/compute queue or may be its own 
>>>queue.  It's up
>>>      to us
>>>      >>>>   what we
>>>      >>>>   >     want to advertise.  From the Vulkan API PoV, 
>>>this is like
>>>      any
>>>      >>>>   other
>>>      >>>>   >     queue.  Operations on it wait on and signal 
>>>semaphores.       If we
>>>      >>>>   have a
>>>      >>>>   >     VM_BIND engine, we'd provide syncobjs to wait and
>>>      >>>>signal just like
>>>      >>>>   we do
>>>      >>>>   >     in execbuf().
>>>      >>>>   >     The important thing is that we don't want one type of
>>>      >>>>operation to
>>>      >>>>   block
>>>      >>>>   >     on the other.  If immediate binds are blocking 
>>>on sparse
>>>      binds,
>>>      >>>>   it's
>>>      >>>>   >     going to cause over-synchronization issues.
>>>      >>>>   >     In terms of the internal implementation, I know that
>>>      >>>>there's going
>>>      >>>>   to be
>>>      >>>>   >     a lock on the VM and that we can't actually do these
>>>      things in
>>>      >>>>   >     parallel.  That's fine.  Once the dma_fences have
>>>      signaled and
>>>      >>>>   we're
>>>      >>>>
>>>      >>>>   Thats correct. It is like a single VM_BIND engine with
>>>      >>>>multiple queues
>>>      >>>>   feeding to it.
>>>      >>>>
>>>      >>>> Right.  As long as the queues themselves are independent and
>>>      >>>>can block on
>>>      >>>> dma_fences without holding up other queues, I think 
>>>we're fine.
>>>      >>>>
>>>      >>>>   >     unblocked to do the bind operation, I don't care if
>>>      >>>>there's a bit
>>>      >>>>   of
>>>      >>>>   >     synchronization due to locking.  That's 
>>>expected.  What
>>>      >>>>we can't
>>>      >>>>   afford
>>>      >>>>   >     to have is an immediate bind operation suddenly 
>>>blocking
>>>      on a
>>>      >>>>   sparse
>>>      >>>>   >     operation which is blocked on a compute job 
>>>that's going
>>>      to run
>>>      >>>>   for
>>>      >>>>   >     another 5ms.
>>>      >>>>
>>>      >>>>   As the VM_BIND queue is per VM, VM_BIND on one VM 
>>>doesn't block
>>>      the
>>>      >>>>   VM_BIND
>>>      >>>>   on other VMs. I am not sure about usecases here, but just
>>>      wanted to
>>>      >>>>   clarify.
>>>      >>>>
>>>      >>>> Yes, that's what I would expect.
>>>      >>>> --Jason
>>>      >>>>
>>>      >>>>   Niranjana
>>>      >>>>
>>>      >>>>   >     For reference, Windows solves this by allowing
>>>      arbitrarily many
>>>      >>>>   paging
>>>      >>>>   >     queues (what they call a VM_BIND engine/queue).  That
>>>      >>>>design works
>>>      >>>>   >     pretty well and solves the problems in 
>>>question.       >>>>Again, we could
>>>      >>>>   just
>>>      >>>>   >     make everything out-of-order and require using 
>>>syncobjs
>>>      >>>>to order
>>>      >>>>   things
>>>      >>>>   >     as userspace wants. That'd be fine too.
>>>      >>>>   >     One more note while I'm here: danvet said 
>>>something on
>>>      >>>>IRC about
>>>      >>>>   VM_BIND
>>>      >>>>   >     queues waiting for syncobjs to materialize.  We don't
>>>      really
>>>      >>>>   want/need
>>>      >>>>   >     this.  We already have all the machinery in 
>>>userspace to
>>>      handle
>>>      >>>>   >     wait-before-signal and waiting for syncobj fences to
>>>      >>>>materialize
>>>      >>>>   and
>>>      >>>>   >     that machinery is on by default.  It would actually
>>>      >>>>take MORE work
>>>      >>>>   in
>>>      >>>>   >     Mesa to turn it off and take advantage of the kernel
>>>      >>>>being able to
>>>      >>>>   wait
>>>      >>>>   >     for syncobjs to materialize. Also, getting that 
>>>right is
>>>      >>>>   ridiculously
>>>      >>>>   >     hard and I really don't want to get it wrong in 
>>>kernel
>>>      >>>>space.     When we
>>>      >>>>   >     do memory fences, wait-before-signal will be a 
>>>thing.  We
>>>      don't
>>>      >>>>   need to
>>>      >>>>   >     try and make it a thing for syncobj.
>>>      >>>>   >     --Jason
>>>      >>>>   >
>>>      >>>>   >   Thanks Jason,
>>>      >>>>   >
>>>      >>>>   >   I missed the bit in the Vulkan spec that we're 
>>>allowed to
>>>      have a
>>>      >>>>   sparse
>>>      >>>>   >   queue that does not implement either graphics or 
>>>compute
>>>      >>>>operations
>>>      >>>>   :
>>>      >>>>   >
>>>      >>>>   >     "While some implementations may include
>>>      >>>>   VK_QUEUE_SPARSE_BINDING_BIT
>>>      >>>>   >     support in queue families that also include
>>>      >>>>   >
>>>      >>>>   >      graphics and compute support, other 
>>>implementations may
>>>      only
>>>      >>>>   expose a
>>>      >>>>   > VK_QUEUE_SPARSE_BINDING_BIT-only queue
>>>      >>>>   >
>>>      >>>>   >      family."
>>>      >>>>   >
>>>      >>>>   >   So it can all be all a vm_bind engine that just does
>>>      bind/unbind
>>>      >>>>   >   operations.
>>>      >>>>   >
>>>      >>>>   >   But yes we need another engine for the 
>>>immediate/non-sparse
>>>      >>>>   operations.
>>>      >>>>   >
>>>      >>>>   >   -Lionel
>>>      >>>>   >
>>>      >>>>   >         >
>>>      >>>>   >       Daniel, any thoughts?
>>>      >>>>   >
>>>      >>>>   >       Niranjana
>>>      >>>>   >
>>>      >>>>   >       >Matt
>>>      >>>>   >       >
>>>      >>>>   >       >>
>>>      >>>>   >       >> Sorry I noticed this late.
>>>      >>>>   >       >>
>>>      >>>>   >       >>
>>>      >>>>   >       >> -Lionel
>>>      >>>>   >       >>
>>>      >>>>   >       >>
>
>
Lionel Landwerlin June 10, 2022, 8:18 a.m. UTC | #30
On 10/06/2022 10:54, Niranjana Vishwanathapura wrote:
> On Fri, Jun 10, 2022 at 09:53:24AM +0300, Lionel Landwerlin wrote:
>> On 09/06/2022 22:31, Niranjana Vishwanathapura wrote:
>>> On Thu, Jun 09, 2022 at 05:49:09PM +0300, Lionel Landwerlin wrote:
>>>>   On 09/06/2022 00:55, Jason Ekstrand wrote:
>>>>
>>>>     On Wed, Jun 8, 2022 at 4:44 PM Niranjana Vishwanathapura
>>>>     <niranjana.vishwanathapura@intel.com> wrote:
>>>>
>>>>       On Wed, Jun 08, 2022 at 08:33:25AM +0100, Tvrtko Ursulin wrote:
>>>>       >
>>>>       >
>>>>       >On 07/06/2022 22:32, Niranjana Vishwanathapura wrote:
>>>>       >>On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana 
>>>> Vishwanathapura
>>>>       wrote:
>>>>       >>>On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason Ekstrand 
>>>> wrote:
>>>>       >>>> On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura
>>>>       >>>> <niranjana.vishwanathapura@intel.com> wrote:
>>>>       >>>>
>>>>       >>>>   On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel 
>>>> Landwerlin
>>>>       wrote:
>>>>       >>>>   >   On 02/06/2022 23:35, Jason Ekstrand wrote:
>>>>       >>>>   >
>>>>       >>>>   >     On Thu, Jun 2, 2022 at 3:11 PM Niranjana 
>>>> Vishwanathapura
>>>>       >>>>   > <niranjana.vishwanathapura@intel.com> wrote:
>>>>       >>>>   >
>>>>       >>>>   >       On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew
>>>>       >>>>Brost wrote:
>>>>       >>>>   >       >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel
>>>>       Landwerlin
>>>>       >>>>   wrote:
>>>>       >>>>   >       >> On 17/05/2022 21:32, Niranjana Vishwanathapura
>>>>       wrote:
>>>>       >>>>   >       >> > +VM_BIND/UNBIND ioctl will immediately start
>>>>       >>>>   binding/unbinding
>>>>       >>>>   >       the mapping in an
>>>>       >>>>   >       >> > +async worker. The binding and unbinding 
>>>> will
>>>>       >>>>work like a
>>>>       >>>>   special
>>>>       >>>>   >       GPU engine.
>>>>       >>>>   >       >> > +The binding and unbinding operations are
>>>>       serialized and
>>>>       >>>>   will
>>>>       >>>>   >       wait on specified
>>>>       >>>>   >       >> > +input fences before the operation and 
>>>> will signal
>>>>       the
>>>>       >>>>   output
>>>>       >>>>   >       fences upon the
>>>>       >>>>   >       >> > +completion of the operation. Due to
>>>>       serialization,
>>>>       >>>>   completion of
>>>>       >>>>   >       an operation
>>>>       >>>>   >       >> > +will also indicate that all previous 
>>>> operations
>>>>       >>>>are also
>>>>       >>>>   >       complete.
>>>>       >>>>   >       >>
>>>>       >>>>   >       >> I guess we should avoid saying "will 
>>>> immediately
>>>>       start
>>>>       >>>>   >       binding/unbinding" if
>>>>       >>>>   >       >> there are fences involved.
>>>>       >>>>   >       >>
>>>>       >>>>   >       >> And the fact that it's happening in an async
>>>>       >>>>worker seem to
>>>>       >>>>   imply
>>>>       >>>>   >       it's not
>>>>       >>>>   >       >> immediate.
>>>>       >>>>   >       >>
>>>>       >>>>   >
>>>>       >>>>   >       Ok, will fix.
>>>>       >>>>   >       This was added because in earlier design 
>>>> binding was
>>>>       deferred
>>>>       >>>>   until
>>>>       >>>>   >       next execbuff.
>>>>       >>>>   >       But now it is non-deferred (immediate in that 
>>>> sense).
>>>>       >>>>But yah,
>>>>       >>>>   this is
>>>>       >>>>   >       confusing
>>>>       >>>>   >       and will fix it.
>>>>       >>>>   >
>>>>       >>>>   >       >>
>>>>       >>>>   >       >> I have a question on the behavior of the bind
>>>>       >>>>operation when
>>>>       >>>>   no
>>>>       >>>>   >       input fence
>>>>       >>>>   >       >> is provided. Let say I do :
>>>>       >>>>   >       >>
>>>>       >>>>   >       >> VM_BIND (out_fence=fence1)
>>>>       >>>>   >       >>
>>>>       >>>>   >       >> VM_BIND (out_fence=fence2)
>>>>       >>>>   >       >>
>>>>       >>>>   >       >> VM_BIND (out_fence=fence3)
>>>>       >>>>   >       >>
>>>>       >>>>   >       >>
>>>>       >>>>   >       >> In what order are the fences going to be 
>>>> signaled?
>>>>       >>>>   >       >>
>>>>       >>>>   >       >> In the order of VM_BIND ioctls? Or out of 
>>>> order?
>>>>       >>>>   >       >>
>>>>       >>>>   >       >> Because you wrote "serialized I assume it's 
>>>> : in
>>>>       order
>>>>       >>>>   >       >>
>>>>       >>>>   >
>>>>       >>>>   >       Yes, in the order of VM_BIND/UNBIND ioctls. 
>>>> Note that
>>>>       >>>>bind and
>>>>       >>>>   unbind
>>>>       >>>>   >       will use
>>>>       >>>>   >       the same queue and hence are ordered.
>>>>       >>>>   >
>>>>       >>>>   >       >>
>>>>       >>>>   >       >> One thing I didn't realize is that because 
>>>> we only
>>>>       get one
>>>>       >>>>   >       "VM_BIND" engine,
>>>>       >>>>   >       >> there is a disconnect from the Vulkan 
>>>> specification.
>>>>       >>>>   >       >>
>>>>       >>>>   >       >> In Vulkan VM_BIND operations are serialized 
>>>> but
>>>>       >>>>per engine.
>>>>       >>>>   >       >>
>>>>       >>>>   >       >> So you could have something like this :
>>>>       >>>>   >       >>
>>>>       >>>>   >       >> VM_BIND (engine=rcs0, in_fence=fence1,
>>>>       out_fence=fence2)
>>>>       >>>>   >       >>
>>>>       >>>>   >       >> VM_BIND (engine=ccs0, in_fence=fence3,
>>>>       out_fence=fence4)
>>>>       >>>>   >       >>
>>>>       >>>>   >       >>
>>>>       >>>>   >       >> fence1 is not signaled
>>>>       >>>>   >       >>
>>>>       >>>>   >       >> fence3 is signaled
>>>>       >>>>   >       >>
>>>>       >>>>   >       >> So the second VM_BIND will proceed before the
>>>>       >>>>first VM_BIND.
>>>>       >>>>   >       >>
>>>>       >>>>   >       >>
>>>>       >>>>   >       >> I guess we can deal with that scenario in
>>>>       >>>>userspace by doing
>>>>       >>>>   the
>>>>       >>>>   >       wait
>>>>       >>>>   >       >> ourselves in one thread per engines.
>>>>       >>>>   >       >>
>>>>       >>>>   >       >> But then it makes the VM_BIND input fences 
>>>> useless.
>>>>       >>>>   >       >>
>>>>       >>>>   >       >>
>>>>       >>>>   >       >> Daniel : what do you think? Should be 
>>>> rework this or
>>>>       just
>>>>       >>>>   deal with
>>>>       >>>>   >       wait
>>>>       >>>>   >       >> fences in userspace?
>>>>       >>>>   >       >>
>>>>       >>>>   >       >
>>>>       >>>>   >       >My opinion is rework this but make the 
>>>> ordering via
>>>>       >>>>an engine
>>>>       >>>>   param
>>>>       >>>>   >       optional.
>>>>       >>>>   >       >
>>>>       >>>>   >       >e.g. A VM can be configured so all binds are 
>>>> ordered
>>>>       >>>>within the
>>>>       >>>>   VM
>>>>       >>>>   >       >
>>>>       >>>>   >       >e.g. A VM can be configured so all binds 
>>>> accept an
>>>>       engine
>>>>       >>>>   argument
>>>>       >>>>   >       (in
>>>>       >>>>   >       >the case of the i915 likely this is a gem 
>>>> context
>>>>       >>>>handle) and
>>>>       >>>>   binds
>>>>       >>>>   >       >ordered with respect to that engine.
>>>>       >>>>   >       >
>>>>       >>>>   >       >This gives UMDs options as the later likely 
>>>> consumes
>>>>       >>>>more KMD
>>>>       >>>>   >       resources
>>>>       >>>>   >       >so if a different UMD can live with binds being
>>>>       >>>>ordered within
>>>>       >>>>   the VM
>>>>       >>>>   >       >they can use a mode consuming less resources.
>>>>       >>>>   >       >
>>>>       >>>>   >
>>>>       >>>>   >       I think we need to be careful here if we are 
>>>> looking
>>>>       for some
>>>>       >>>>   out of
>>>>       >>>>   >       (submission) order completion of vm_bind/unbind.
>>>>       >>>>   >       In-order completion means, in a batch of binds 
>>>> and
>>>>       >>>>unbinds to be
>>>>       >>>>   >       completed in-order, user only needs to specify
>>>>       >>>>in-fence for the
>>>>       >>>>   >       first bind/unbind call and the our-fence for 
>>>> the last
>>>>       >>>>   bind/unbind
>>>>       >>>>   >       call. Also, the VA released by an unbind call 
>>>> can be
>>>>       >>>>re-used by
>>>>       >>>>   >       any subsequent bind call in that in-order batch.
>>>>       >>>>   >
>>>>       >>>>   >       These things will break if binding/unbinding 
>>>> were to
>>>>       >>>>be allowed
>>>>       >>>>   to
>>>>       >>>>   >       go out of order (of submission) and user need 
>>>> to be
>>>>       extra
>>>>       >>>>   careful
>>>>       >>>>   >       not to run into pre-mature triggereing of 
>>>> out-fence and
>>>>       bind
>>>>       >>>>   failing
>>>>       >>>>   >       as VA is still in use etc.
>>>>       >>>>   >
>>>>       >>>>   >       Also, VM_BIND binds the provided mapping on the
>>>>       specified
>>>>       >>>>   address
>>>>       >>>>   >       space
>>>>       >>>>   >       (VM). So, the uapi is not engine/context 
>>>> specific.
>>>>       >>>>   >
>>>>       >>>>   >       We can however add a 'queue' to the uapi which 
>>>> can be
>>>>       >>>>one from
>>>>       >>>>   the
>>>>       >>>>   >       pre-defined queues,
>>>>       >>>>   >       I915_VM_BIND_QUEUE_0
>>>>       >>>>   >       I915_VM_BIND_QUEUE_1
>>>>       >>>>   >       ...
>>>>       >>>>   >       I915_VM_BIND_QUEUE_(N-1)
>>>>       >>>>   >
>>>>       >>>>   >       KMD will spawn an async work queue for each 
>>>> queue which
>>>>       will
>>>>       >>>>   only
>>>>       >>>>   >       bind the mappings on that queue in the order of
>>>>       submission.
>>>>       >>>>   >       User can assign the queue to per engine or 
>>>> anything
>>>>       >>>>like that.
>>>>       >>>>   >
>>>>       >>>>   >       But again here, user need to be careful and not
>>>>       >>>>deadlock these
>>>>       >>>>   >       queues with circular dependency of fences.
>>>>       >>>>   >
>>>>       >>>>   >       I prefer adding this later an as extension 
>>>> based on
>>>>       >>>>whether it
>>>>       >>>>   >       is really helping with the implementation.
>>>>       >>>>   >
>>>>       >>>>   >     I can tell you right now that having everything 
>>>> on a
>>>>       single
>>>>       >>>>   in-order
>>>>       >>>>   >     queue will not get us the perf we want.  What 
>>>> vulkan
>>>>       >>>>really wants
>>>>       >>>>   is one
>>>>       >>>>   >     of two things:
>>>>       >>>>   >      1. No implicit ordering of VM_BIND ops.  They just
>>>>       happen in
>>>>       >>>>   whatever
>>>>       >>>>   >     their dependencies are resolved and we ensure 
>>>> ordering
>>>>       >>>>ourselves
>>>>       >>>>   by
>>>>       >>>>   >     having a syncobj in the VkQueue.
>>>>       >>>>   >      2. The ability to create multiple VM_BIND 
>>>> queues.  We
>>>>       need at
>>>>       >>>>   least 2
>>>>       >>>>   >     but I don't see why there needs to be a limit 
>>>> besides
>>>>       >>>>the limits
>>>>       >>>>   the
>>>>       >>>>   >     i915 API already has on the number of engines.  
>>>> Vulkan
>>>>       could
>>>>       >>>>   expose
>>>>       >>>>   >     multiple sparse binding queues to the client if 
>>>> it's not
>>>>       >>>>   arbitrarily
>>>>       >>>>   >     limited.
>>>>       >>>>
>>>>       >>>>   Thanks Jason, Lionel.
>>>>       >>>>
>>>>       >>>>   Jason, what are you referring to when you say "limits 
>>>> the i915
>>>>       API
>>>>       >>>>   already
>>>>       >>>>   has on the number of engines"? I am not sure if there 
>>>> is such
>>>>       an uapi
>>>>       >>>>   today.
>>>>       >>>>
>>>>       >>>> There's a limit of something like 64 total engines today 
>>>> based on
>>>>       the
>>>>       >>>> number of bits we can cram into the exec flags in 
>>>> execbuffer2.  I
>>>>       think
>>>>       >>>> someone had an extended version that allowed more but I 
>>>> ripped it
>>>>       out
>>>>       >>>> because no one was using it.  Of course, execbuffer3 
>>>> might not
>>>>       >>>>have that
>>>>       >>>> problem at all.
>>>>       >>>>
>>>>       >>>
>>>>       >>>Thanks Jason.
>>>>       >>>Ok, I am not sure which exec flag is that, but yah, 
>>>> execbuffer3
>>>>       probably
>>>>       >>>will not have this limiation. So, we need to define a
>>>>       VM_BIND_MAX_QUEUE
>>>>       >>>and somehow export it to user (I am thinking of embedding 
>>>> it in
>>>>       >>>I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND, bits[1-3]->'n'
>>>>       meaning 2^n
>>>>       >>>queues.
>>>>       >>
>>>>       >>Ah, I think you are waking about I915_EXEC_RING_MASK (0x3f) 
>>>> which
>>>>       execbuf3
>>>>
>>>>     Yup!  That's exactly the limit I was talking about.
>>>>
>>>>       >>will also have. So, we can simply define in vm_bind/unbind
>>>>       structures,
>>>>       >>
>>>>       >>#define I915_VM_BIND_MAX_QUEUE   64
>>>>       >>        __u32 queue;
>>>>       >>
>>>>       >>I think that will keep things simple.
>>>>       >
>>>>       >Hmmm? What does execbuf2 limit has to do with how many engines
>>>>       >hardware can have? I suggest not to do that.
>>>>       >
>>>>       >Change with added this:
>>>>       >
>>>>       >       if (set.num_engines > I915_EXEC_RING_MASK + 1)
>>>>       >               return -EINVAL;
>>>>       >
>>>>       >To context creation needs to be undone and so let users 
>>>> create engine
>>>>       >maps with all hardware engines, and let execbuf3 access them 
>>>> all.
>>>>       >
>>>>
>>>>       Earlier plan was to carry I915_EXEC_RING_MAP (0x3f) to 
>>>> execbuff3 also.
>>>>       Hence, I was using the same limit for VM_BIND queues (64, or 
>>>> 65 if we
>>>>       make it N+1).
>>>>       But, as discussed in other thread of this RFC series, we are 
>>>> planning
>>>>       to drop this I915_EXEC_RING_MAP in execbuff3. So, there won't be
>>>>       any uapi that limits the number of engines (and hence the 
>>>> vm_bind
>>>>       queues
>>>>       need to be supported).
>>>>
>>>>       If we leave the number of vm_bind queues to be arbitrarily large
>>>>       (__u32 queue_idx) then, we need to have a hashmap for queue 
>>>> (a wq,
>>>>       work_item and a linked list) lookup from the user specified 
>>>> queue
>>>>       index.
>>>>       Other option is to just put some hard limit (say 64 or 65) 
>>>> and use
>>>>       an array of queues in VM (each created upon first use). I 
>>>> prefer this.
>>>>
>>>>     I don't get why a VM_BIND queue is any different from any other 
>>>> queue or
>>>>     userspace-visible kernel object.  But I'll leave those details 
>>>> up to
>>>>     danvet or whoever else might be reviewing the implementation.
>>>>     --Jason
>>>>
>>>>   I kind of agree here. Wouldn't be simpler to have the bind queue 
>>>> created
>>>>   like the others when we build the engine map?
>>>>
>>>>   For userspace it's then just matter of selecting the right queue 
>>>> ID when
>>>>   submitting.
>>>>
>>>>   If there is ever a possibility to have this work on the GPU, it 
>>>> would be
>>>>   all ready.
>>>>
>>>
>>> I did sync offline with Matt Brost on this.
>>> We can add a VM_BIND engine class and let user create VM_BIND 
>>> engines (queues).
>>> The problem is, in i915 engine creating interface is bound to 
>>> gem_context.
>>> So, in vm_bind ioctl, we would need both context_id and queue_idx 
>>> for proper
>>> lookup of the user created engine. This is bit ackward as vm_bind is an
>>> interface to VM (address space) and has nothing to do with gem_context.
>>
>>
>> A gem_context has a single vm object right?
>>
>> Set through I915_CONTEXT_PARAM_VM at creation or given a default one 
>> if not.
>>
>> So it's just like picking up the vm like it's done at execbuffer time 
>> right now : eb->context->vm
>>
>
> Are you suggesting replacing 'vm_id' with 'context_id' in the 
> VM_BIND/UNBIND
> ioctl and probably call it CONTEXT_BIND/UNBIND, because VM can be 
> obtained
> from the context?


Yes, because if we go for engines, they're associated with a context and 
so also associated with the VM bound to the context.


> I think the interface is clean as a interface to VM. It is only that we
> don't have a clean way to create a raw VM_BIND engine (not associated 
> with
> any context) with i915 uapi.
> May be we can add such an interface, but I don't think that is worth it
> (we might as well just use a queue_idx in VM_BIND/UNBIND ioctl as I 
> mentioned
> above).
> Anyone has any thoughts?
>
>>
>>> Another problem is, if two VMs are binding with the same defined 
>>> engine,
>>> binding on VM1 can get unnecessary blocked by binding on VM2 (which 
>>> may be
>>> waiting on its in_fence).
>>
>>
>> Maybe I'm missing something, but how can you have 2 vm objects with a 
>> single gem_context right now?
>>
>
> No, we don't have 2 VMs for a gem_context.
> Say if ctx1 with vm1 and ctx2 with vm2.
> First vm_bind call was for vm1 with q_idx 1 in ctx1 engine map.
> Second vm_bind call was for vm2 with q_idx 2 in ctx2 engine map. If 
> those two queue indicies points to same underlying vm_bind engine,
> then the second vm_bind call gets blocked until the first vm_bind call's
> 'in' fence is triggered and bind completes.
>
> With per VM queues, this is not a problem as two VMs will not endup
> sharing same queue.
>
> BTW, I just posted a updated PATCH series.
> https://www.spinics.net/lists/dri-devel/msg350483.html
>
> Niranjana
>
>>
>>>
>>> So, my preference here is to just add a 'u32 queue' index in 
>>> vm_bind/unbind
>>> ioctl, and the queues are per VM.
>>>
>>> Niranjana
>>>
>>>>   Thanks,
>>>>
>>>>   -Lionel
>>>>
>>>>
>>>>       Niranjana
>>>>
>>>>       >Regards,
>>>>       >
>>>>       >Tvrtko
>>>>       >
>>>>       >>
>>>>       >>Niranjana
>>>>       >>
>>>>       >>>
>>>>       >>>>   I am trying to see how many queues we need and don't 
>>>> want it to
>>>>       be
>>>>       >>>>   arbitrarily
>>>>       >>>>   large and unduely blow up memory usage and complexity 
>>>> in i915
>>>>       driver.
>>>>       >>>>
>>>>       >>>> I expect a Vulkan driver to use at most 2 in the vast 
>>>> majority
>>>>       >>>>of cases. I
>>>>       >>>> could imagine a client wanting to create more than 1 sparse
>>>>       >>>>queue in which
>>>>       >>>> case, it'll be N+1 but that's unlikely. As far as 
>>>> complexity
>>>>       >>>>goes, once
>>>>       >>>> you allow two, I don't think the complexity is going up by
>>>>       >>>>allowing N.  As
>>>>       >>>> for memory usage, creating more queues means more 
>>>> memory.  That's
>>>>       a
>>>>       >>>> trade-off that userspace can make. Again, the expected 
>>>> number
>>>>       >>>>here is 1
>>>>       >>>> or 2 in the vast majority of cases so I don't think you 
>>>> need to
>>>>       worry.
>>>>       >>>
>>>>       >>>Ok, will start with n=3 meaning 8 queues.
>>>>       >>>That would require us create 8 workqueues.
>>>>       >>>We can change 'n' later if required.
>>>>       >>>
>>>>       >>>Niranjana
>>>>       >>>
>>>>       >>>>
>>>>       >>>>   >     Why?  Because Vulkan has two basic kind of bind
>>>>       >>>>operations and we
>>>>       >>>>   don't
>>>>       >>>>   >     want any dependencies between them:
>>>>       >>>>   >      1. Immediate.  These happen right after BO 
>>>> creation or
>>>>       >>>>maybe as
>>>>       >>>>   part of
>>>>       >>>>   >     vkBindImageMemory() or VkBindBufferMemory().  These
>>>>       >>>>don't happen
>>>>       >>>>   on a
>>>>       >>>>   >     queue and we don't want them serialized with 
>>>> anything.       To
>>>>       >>>>   synchronize
>>>>       >>>>   >     with submit, we'll have a syncobj in the 
>>>> VkDevice which
>>>>       is
>>>>       >>>>   signaled by
>>>>       >>>>   >     all immediate bind operations and make submits 
>>>> wait on
>>>>       it.
>>>>       >>>>   >      2. Queued (sparse): These happen on a VkQueue 
>>>> which may
>>>>       be the
>>>>       >>>>   same as
>>>>       >>>>   >     a render/compute queue or may be its own queue.  
>>>> It's up
>>>>       to us
>>>>       >>>>   what we
>>>>       >>>>   >     want to advertise.  From the Vulkan API PoV, 
>>>> this is like
>>>>       any
>>>>       >>>>   other
>>>>       >>>>   >     queue.  Operations on it wait on and signal 
>>>> semaphores.       If we
>>>>       >>>>   have a
>>>>       >>>>   >     VM_BIND engine, we'd provide syncobjs to wait and
>>>>       >>>>signal just like
>>>>       >>>>   we do
>>>>       >>>>   >     in execbuf().
>>>>       >>>>   >     The important thing is that we don't want one 
>>>> type of
>>>>       >>>>operation to
>>>>       >>>>   block
>>>>       >>>>   >     on the other.  If immediate binds are blocking 
>>>> on sparse
>>>>       binds,
>>>>       >>>>   it's
>>>>       >>>>   >     going to cause over-synchronization issues.
>>>>       >>>>   >     In terms of the internal implementation, I know 
>>>> that
>>>>       >>>>there's going
>>>>       >>>>   to be
>>>>       >>>>   >     a lock on the VM and that we can't actually do 
>>>> these
>>>>       things in
>>>>       >>>>   >     parallel.  That's fine. Once the dma_fences have
>>>>       signaled and
>>>>       >>>>   we're
>>>>       >>>>
>>>>       >>>>   Thats correct. It is like a single VM_BIND engine with
>>>>       >>>>multiple queues
>>>>       >>>>   feeding to it.
>>>>       >>>>
>>>>       >>>> Right.  As long as the queues themselves are independent 
>>>> and
>>>>       >>>>can block on
>>>>       >>>> dma_fences without holding up other queues, I think 
>>>> we're fine.
>>>>       >>>>
>>>>       >>>>   >     unblocked to do the bind operation, I don't care if
>>>>       >>>>there's a bit
>>>>       >>>>   of
>>>>       >>>>   >     synchronization due to locking.  That's 
>>>> expected.  What
>>>>       >>>>we can't
>>>>       >>>>   afford
>>>>       >>>>   >     to have is an immediate bind operation suddenly 
>>>> blocking
>>>>       on a
>>>>       >>>>   sparse
>>>>       >>>>   >     operation which is blocked on a compute job 
>>>> that's going
>>>>       to run
>>>>       >>>>   for
>>>>       >>>>   >     another 5ms.
>>>>       >>>>
>>>>       >>>>   As the VM_BIND queue is per VM, VM_BIND on one VM 
>>>> doesn't block
>>>>       the
>>>>       >>>>   VM_BIND
>>>>       >>>>   on other VMs. I am not sure about usecases here, but just
>>>>       wanted to
>>>>       >>>>   clarify.
>>>>       >>>>
>>>>       >>>> Yes, that's what I would expect.
>>>>       >>>> --Jason
>>>>       >>>>
>>>>       >>>>   Niranjana
>>>>       >>>>
>>>>       >>>>   >     For reference, Windows solves this by allowing
>>>>       arbitrarily many
>>>>       >>>>   paging
>>>>       >>>>   >     queues (what they call a VM_BIND engine/queue).  
>>>> That
>>>>       >>>>design works
>>>>       >>>>   >     pretty well and solves the problems in question. 
>>>>       >>>>Again, we could
>>>>       >>>>   just
>>>>       >>>>   >     make everything out-of-order and require using 
>>>> syncobjs
>>>>       >>>>to order
>>>>       >>>>   things
>>>>       >>>>   >     as userspace wants. That'd be fine too.
>>>>       >>>>   >     One more note while I'm here: danvet said 
>>>> something on
>>>>       >>>>IRC about
>>>>       >>>>   VM_BIND
>>>>       >>>>   >     queues waiting for syncobjs to materialize.  We 
>>>> don't
>>>>       really
>>>>       >>>>   want/need
>>>>       >>>>   >     this.  We already have all the machinery in 
>>>> userspace to
>>>>       handle
>>>>       >>>>   >     wait-before-signal and waiting for syncobj 
>>>> fences to
>>>>       >>>>materialize
>>>>       >>>>   and
>>>>       >>>>   >     that machinery is on by default.  It would actually
>>>>       >>>>take MORE work
>>>>       >>>>   in
>>>>       >>>>   >     Mesa to turn it off and take advantage of the 
>>>> kernel
>>>>       >>>>being able to
>>>>       >>>>   wait
>>>>       >>>>   >     for syncobjs to materialize. Also, getting that 
>>>> right is
>>>>       >>>>   ridiculously
>>>>       >>>>   >     hard and I really don't want to get it wrong in 
>>>> kernel
>>>>       >>>>space.   �� When we
>>>>       >>>>   >     do memory fences, wait-before-signal will be a 
>>>> thing.  We
>>>>       don't
>>>>       >>>>   need to
>>>>       >>>>   >     try and make it a thing for syncobj.
>>>>       >>>>   >     --Jason
>>>>       >>>>   >
>>>>       >>>>   >   Thanks Jason,
>>>>       >>>>   >
>>>>       >>>>   >   I missed the bit in the Vulkan spec that we're 
>>>> allowed to
>>>>       have a
>>>>       >>>>   sparse
>>>>       >>>>   >   queue that does not implement either graphics or 
>>>> compute
>>>>       >>>>operations
>>>>       >>>>   :
>>>>       >>>>   >
>>>>       >>>>   >     "While some implementations may include
>>>>       >>>>   VK_QUEUE_SPARSE_BINDING_BIT
>>>>       >>>>   >     support in queue families that also include
>>>>       >>>>   >
>>>>       >>>>   >      graphics and compute support, other 
>>>> implementations may
>>>>       only
>>>>       >>>>   expose a
>>>>       >>>>   > VK_QUEUE_SPARSE_BINDING_BIT-only queue
>>>>       >>>>   >
>>>>       >>>>   >      family."
>>>>       >>>>   >
>>>>       >>>>   >   So it can all be all a vm_bind engine that just does
>>>>       bind/unbind
>>>>       >>>>   >   operations.
>>>>       >>>>   >
>>>>       >>>>   >   But yes we need another engine for the 
>>>> immediate/non-sparse
>>>>       >>>>   operations.
>>>>       >>>>   >
>>>>       >>>>   >   -Lionel
>>>>       >>>>   >
>>>>       >>>>   >         >
>>>>       >>>>   >       Daniel, any thoughts?
>>>>       >>>>   >
>>>>       >>>>   >       Niranjana
>>>>       >>>>   >
>>>>       >>>>   >       >Matt
>>>>       >>>>   >       >
>>>>       >>>>   >       >>
>>>>       >>>>   >       >> Sorry I noticed this late.
>>>>       >>>>   >       >>
>>>>       >>>>   >       >>
>>>>       >>>>   >       >> -Lionel
>>>>       >>>>   >       >>
>>>>       >>>>   >       >>
>>
>>
Niranjana Vishwanathapura June 10, 2022, 5:42 p.m. UTC | #31
On Fri, Jun 10, 2022 at 11:18:14AM +0300, Lionel Landwerlin wrote:
>On 10/06/2022 10:54, Niranjana Vishwanathapura wrote:
>>On Fri, Jun 10, 2022 at 09:53:24AM +0300, Lionel Landwerlin wrote:
>>>On 09/06/2022 22:31, Niranjana Vishwanathapura wrote:
>>>>On Thu, Jun 09, 2022 at 05:49:09PM +0300, Lionel Landwerlin wrote:
>>>>>  On 09/06/2022 00:55, Jason Ekstrand wrote:
>>>>>
>>>>>    On Wed, Jun 8, 2022 at 4:44 PM Niranjana Vishwanathapura
>>>>>    <niranjana.vishwanathapura@intel.com> wrote:
>>>>>
>>>>>      On Wed, Jun 08, 2022 at 08:33:25AM +0100, Tvrtko Ursulin wrote:
>>>>>      >
>>>>>      >
>>>>>      >On 07/06/2022 22:32, Niranjana Vishwanathapura wrote:
>>>>>      >>On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana 
>>>>>Vishwanathapura
>>>>>      wrote:
>>>>>      >>>On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason 
>>>>>Ekstrand wrote:
>>>>>      >>>> On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura
>>>>>      >>>> <niranjana.vishwanathapura@intel.com> wrote:
>>>>>      >>>>
>>>>>      >>>>   On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel 
>>>>>Landwerlin
>>>>>      wrote:
>>>>>      >>>>   >   On 02/06/2022 23:35, Jason Ekstrand wrote:
>>>>>      >>>>   >
>>>>>      >>>>   >     On Thu, Jun 2, 2022 at 3:11 PM Niranjana 
>>>>>Vishwanathapura
>>>>>      >>>>   > <niranjana.vishwanathapura@intel.com> wrote:
>>>>>      >>>>   >
>>>>>      >>>>   >       On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew
>>>>>      >>>>Brost wrote:
>>>>>      >>>>   >       >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel
>>>>>      Landwerlin
>>>>>      >>>>   wrote:
>>>>>      >>>>   >       >> On 17/05/2022 21:32, Niranjana Vishwanathapura
>>>>>      wrote:
>>>>>      >>>>   >       >> > +VM_BIND/UNBIND ioctl will immediately start
>>>>>      >>>>   binding/unbinding
>>>>>      >>>>   >       the mapping in an
>>>>>      >>>>   >       >> > +async worker. The binding and 
>>>>>unbinding will
>>>>>      >>>>work like a
>>>>>      >>>>   special
>>>>>      >>>>   >       GPU engine.
>>>>>      >>>>   >       >> > +The binding and unbinding operations are
>>>>>      serialized and
>>>>>      >>>>   will
>>>>>      >>>>   >       wait on specified
>>>>>      >>>>   >       >> > +input fences before the operation 
>>>>>and will signal
>>>>>      the
>>>>>      >>>>   output
>>>>>      >>>>   >       fences upon the
>>>>>      >>>>   >       >> > +completion of the operation. Due to
>>>>>      serialization,
>>>>>      >>>>   completion of
>>>>>      >>>>   >       an operation
>>>>>      >>>>   >       >> > +will also indicate that all 
>>>>>previous operations
>>>>>      >>>>are also
>>>>>      >>>>   >       complete.
>>>>>      >>>>   >       >>
>>>>>      >>>>   >       >> I guess we should avoid saying "will 
>>>>>immediately
>>>>>      start
>>>>>      >>>>   >       binding/unbinding" if
>>>>>      >>>>   >       >> there are fences involved.
>>>>>      >>>>   >       >>
>>>>>      >>>>   >       >> And the fact that it's happening in an async
>>>>>      >>>>worker seem to
>>>>>      >>>>   imply
>>>>>      >>>>   >       it's not
>>>>>      >>>>   >       >> immediate.
>>>>>      >>>>   >       >>
>>>>>      >>>>   >
>>>>>      >>>>   >       Ok, will fix.
>>>>>      >>>>   >       This was added because in earlier design 
>>>>>binding was
>>>>>      deferred
>>>>>      >>>>   until
>>>>>      >>>>   >       next execbuff.
>>>>>      >>>>   >       But now it is non-deferred (immediate in 
>>>>>that sense).
>>>>>      >>>>But yah,
>>>>>      >>>>   this is
>>>>>      >>>>   >       confusing
>>>>>      >>>>   >       and will fix it.
>>>>>      >>>>   >
>>>>>      >>>>   >       >>
>>>>>      >>>>   >       >> I have a question on the behavior of the bind
>>>>>      >>>>operation when
>>>>>      >>>>   no
>>>>>      >>>>   >       input fence
>>>>>      >>>>   >       >> is provided. Let say I do :
>>>>>      >>>>   >       >>
>>>>>      >>>>   >       >> VM_BIND (out_fence=fence1)
>>>>>      >>>>   >       >>
>>>>>      >>>>   >       >> VM_BIND (out_fence=fence2)
>>>>>      >>>>   >       >>
>>>>>      >>>>   >       >> VM_BIND (out_fence=fence3)
>>>>>      >>>>   >       >>
>>>>>      >>>>   >       >>
>>>>>      >>>>   >       >> In what order are the fences going to 
>>>>>be signaled?
>>>>>      >>>>   >       >>
>>>>>      >>>>   >       >> In the order of VM_BIND ioctls? Or out 
>>>>>of order?
>>>>>      >>>>   >       >>
>>>>>      >>>>   >       >> Because you wrote "serialized I assume 
>>>>>it's : in
>>>>>      order
>>>>>      >>>>   >       >>
>>>>>      >>>>   >
>>>>>      >>>>   >       Yes, in the order of VM_BIND/UNBIND 
>>>>>ioctls. Note that
>>>>>      >>>>bind and
>>>>>      >>>>   unbind
>>>>>      >>>>   >       will use
>>>>>      >>>>   >       the same queue and hence are ordered.
>>>>>      >>>>   >
>>>>>      >>>>   >       >>
>>>>>      >>>>   >       >> One thing I didn't realize is that 
>>>>>because we only
>>>>>      get one
>>>>>      >>>>   >       "VM_BIND" engine,
>>>>>      >>>>   >       >> there is a disconnect from the Vulkan 
>>>>>specification.
>>>>>      >>>>   >       >>
>>>>>      >>>>   >       >> In Vulkan VM_BIND operations are 
>>>>>serialized but
>>>>>      >>>>per engine.
>>>>>      >>>>   >       >>
>>>>>      >>>>   >       >> So you could have something like this :
>>>>>      >>>>   >       >>
>>>>>      >>>>   >       >> VM_BIND (engine=rcs0, in_fence=fence1,
>>>>>      out_fence=fence2)
>>>>>      >>>>   >       >>
>>>>>      >>>>   >       >> VM_BIND (engine=ccs0, in_fence=fence3,
>>>>>      out_fence=fence4)
>>>>>      >>>>   >       >>
>>>>>      >>>>   >       >>
>>>>>      >>>>   >       >> fence1 is not signaled
>>>>>      >>>>   >       >>
>>>>>      >>>>   >       >> fence3 is signaled
>>>>>      >>>>   >       >>
>>>>>      >>>>   >       >> So the second VM_BIND will proceed before the
>>>>>      >>>>first VM_BIND.
>>>>>      >>>>   >       >>
>>>>>      >>>>   >       >>
>>>>>      >>>>   >       >> I guess we can deal with that scenario in
>>>>>      >>>>userspace by doing
>>>>>      >>>>   the
>>>>>      >>>>   >       wait
>>>>>      >>>>   >       >> ourselves in one thread per engines.
>>>>>      >>>>   >       >>
>>>>>      >>>>   >       >> But then it makes the VM_BIND input 
>>>>>fences useless.
>>>>>      >>>>   >       >>
>>>>>      >>>>   >       >>
>>>>>      >>>>   >       >> Daniel : what do you think? Should be 
>>>>>rework this or
>>>>>      just
>>>>>      >>>>   deal with
>>>>>      >>>>   >       wait
>>>>>      >>>>   >       >> fences in userspace?
>>>>>      >>>>   >       >>
>>>>>      >>>>   >       >
>>>>>      >>>>   >       >My opinion is rework this but make the 
>>>>>ordering via
>>>>>      >>>>an engine
>>>>>      >>>>   param
>>>>>      >>>>   >       optional.
>>>>>      >>>>   >       >
>>>>>      >>>>   >       >e.g. A VM can be configured so all binds 
>>>>>are ordered
>>>>>      >>>>within the
>>>>>      >>>>   VM
>>>>>      >>>>   >       >
>>>>>      >>>>   >       >e.g. A VM can be configured so all binds 
>>>>>accept an
>>>>>      engine
>>>>>      >>>>   argument
>>>>>      >>>>   >       (in
>>>>>      >>>>   >       >the case of the i915 likely this is a 
>>>>>gem context
>>>>>      >>>>handle) and
>>>>>      >>>>   binds
>>>>>      >>>>   >       >ordered with respect to that engine.
>>>>>      >>>>   >       >
>>>>>      >>>>   >       >This gives UMDs options as the later 
>>>>>likely consumes
>>>>>      >>>>more KMD
>>>>>      >>>>   >       resources
>>>>>      >>>>   >       >so if a different UMD can live with binds being
>>>>>      >>>>ordered within
>>>>>      >>>>   the VM
>>>>>      >>>>   >       >they can use a mode consuming less resources.
>>>>>      >>>>   >       >
>>>>>      >>>>   >
>>>>>      >>>>   >       I think we need to be careful here if we 
>>>>>are looking
>>>>>      for some
>>>>>      >>>>   out of
>>>>>      >>>>   >       (submission) order completion of vm_bind/unbind.
>>>>>      >>>>   >       In-order completion means, in a batch of 
>>>>>binds and
>>>>>      >>>>unbinds to be
>>>>>      >>>>   >       completed in-order, user only needs to specify
>>>>>      >>>>in-fence for the
>>>>>      >>>>   >       first bind/unbind call and the our-fence 
>>>>>for the last
>>>>>      >>>>   bind/unbind
>>>>>      >>>>   >       call. Also, the VA released by an unbind 
>>>>>call can be
>>>>>      >>>>re-used by
>>>>>      >>>>   >       any subsequent bind call in that in-order batch.
>>>>>      >>>>   >
>>>>>      >>>>   >       These things will break if 
>>>>>binding/unbinding were to
>>>>>      >>>>be allowed
>>>>>      >>>>   to
>>>>>      >>>>   >       go out of order (of submission) and user 
>>>>>need to be
>>>>>      extra
>>>>>      >>>>   careful
>>>>>      >>>>   >       not to run into pre-mature triggereing of 
>>>>>out-fence and
>>>>>      bind
>>>>>      >>>>   failing
>>>>>      >>>>   >       as VA is still in use etc.
>>>>>      >>>>   >
>>>>>      >>>>   >       Also, VM_BIND binds the provided mapping on the
>>>>>      specified
>>>>>      >>>>   address
>>>>>      >>>>   >       space
>>>>>      >>>>   >       (VM). So, the uapi is not engine/context 
>>>>>specific.
>>>>>      >>>>   >
>>>>>      >>>>   >       We can however add a 'queue' to the uapi 
>>>>>which can be
>>>>>      >>>>one from
>>>>>      >>>>   the
>>>>>      >>>>   >       pre-defined queues,
>>>>>      >>>>   >       I915_VM_BIND_QUEUE_0
>>>>>      >>>>   >       I915_VM_BIND_QUEUE_1
>>>>>      >>>>   >       ...
>>>>>      >>>>   >       I915_VM_BIND_QUEUE_(N-1)
>>>>>      >>>>   >
>>>>>      >>>>   >       KMD will spawn an async work queue for 
>>>>>each queue which
>>>>>      will
>>>>>      >>>>   only
>>>>>      >>>>   >       bind the mappings on that queue in the order of
>>>>>      submission.
>>>>>      >>>>   >       User can assign the queue to per engine 
>>>>>or anything
>>>>>      >>>>like that.
>>>>>      >>>>   >
>>>>>      >>>>   >       But again here, user need to be careful and not
>>>>>      >>>>deadlock these
>>>>>      >>>>   >       queues with circular dependency of fences.
>>>>>      >>>>   >
>>>>>      >>>>   >       I prefer adding this later an as 
>>>>>extension based on
>>>>>      >>>>whether it
>>>>>      >>>>   >       is really helping with the implementation.
>>>>>      >>>>   >
>>>>>      >>>>   >     I can tell you right now that having 
>>>>>everything on a
>>>>>      single
>>>>>      >>>>   in-order
>>>>>      >>>>   >     queue will not get us the perf we want.  
>>>>>What vulkan
>>>>>      >>>>really wants
>>>>>      >>>>   is one
>>>>>      >>>>   >     of two things:
>>>>>      >>>>   >      1. No implicit ordering of VM_BIND ops.  They just
>>>>>      happen in
>>>>>      >>>>   whatever
>>>>>      >>>>   >     their dependencies are resolved and we 
>>>>>ensure ordering
>>>>>      >>>>ourselves
>>>>>      >>>>   by
>>>>>      >>>>   >     having a syncobj in the VkQueue.
>>>>>      >>>>   >      2. The ability to create multiple VM_BIND 
>>>>>queues.  We
>>>>>      need at
>>>>>      >>>>   least 2
>>>>>      >>>>   >     but I don't see why there needs to be a 
>>>>>limit besides
>>>>>      >>>>the limits
>>>>>      >>>>   the
>>>>>      >>>>   >     i915 API already has on the number of 
>>>>>engines.  Vulkan
>>>>>      could
>>>>>      >>>>   expose
>>>>>      >>>>   >     multiple sparse binding queues to the 
>>>>>client if it's not
>>>>>      >>>>   arbitrarily
>>>>>      >>>>   >     limited.
>>>>>      >>>>
>>>>>      >>>>   Thanks Jason, Lionel.
>>>>>      >>>>
>>>>>      >>>>   Jason, what are you referring to when you say 
>>>>>"limits the i915
>>>>>      API
>>>>>      >>>>   already
>>>>>      >>>>   has on the number of engines"? I am not sure if 
>>>>>there is such
>>>>>      an uapi
>>>>>      >>>>   today.
>>>>>      >>>>
>>>>>      >>>> There's a limit of something like 64 total engines 
>>>>>today based on
>>>>>      the
>>>>>      >>>> number of bits we can cram into the exec flags in 
>>>>>execbuffer2.  I
>>>>>      think
>>>>>      >>>> someone had an extended version that allowed more 
>>>>>but I ripped it
>>>>>      out
>>>>>      >>>> because no one was using it.  Of course, 
>>>>>execbuffer3 might not
>>>>>      >>>>have that
>>>>>      >>>> problem at all.
>>>>>      >>>>
>>>>>      >>>
>>>>>      >>>Thanks Jason.
>>>>>      >>>Ok, I am not sure which exec flag is that, but yah, 
>>>>>execbuffer3
>>>>>      probably
>>>>>      >>>will not have this limiation. So, we need to define a
>>>>>      VM_BIND_MAX_QUEUE
>>>>>      >>>and somehow export it to user (I am thinking of 
>>>>>embedding it in
>>>>>      >>>I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND, bits[1-3]->'n'
>>>>>      meaning 2^n
>>>>>      >>>queues.
>>>>>      >>
>>>>>      >>Ah, I think you are waking about I915_EXEC_RING_MASK 
>>>>>(0x3f) which
>>>>>      execbuf3
>>>>>
>>>>>    Yup!  That's exactly the limit I was talking about.
>>>>>
>>>>>      >>will also have. So, we can simply define in vm_bind/unbind
>>>>>      structures,
>>>>>      >>
>>>>>      >>#define I915_VM_BIND_MAX_QUEUE   64
>>>>>      >>        __u32 queue;
>>>>>      >>
>>>>>      >>I think that will keep things simple.
>>>>>      >
>>>>>      >Hmmm? What does execbuf2 limit has to do with how many engines
>>>>>      >hardware can have? I suggest not to do that.
>>>>>      >
>>>>>      >Change with added this:
>>>>>      >
>>>>>      >       if (set.num_engines > I915_EXEC_RING_MASK + 1)
>>>>>      >               return -EINVAL;
>>>>>      >
>>>>>      >To context creation needs to be undone and so let users 
>>>>>create engine
>>>>>      >maps with all hardware engines, and let execbuf3 access 
>>>>>them all.
>>>>>      >
>>>>>
>>>>>      Earlier plan was to carry I915_EXEC_RING_MAP (0x3f) to 
>>>>>execbuff3 also.
>>>>>      Hence, I was using the same limit for VM_BIND queues 
>>>>>(64, or 65 if we
>>>>>      make it N+1).
>>>>>      But, as discussed in other thread of this RFC series, we 
>>>>>are planning
>>>>>      to drop this I915_EXEC_RING_MAP in execbuff3. So, there won't be
>>>>>      any uapi that limits the number of engines (and hence 
>>>>>the vm_bind
>>>>>      queues
>>>>>      need to be supported).
>>>>>
>>>>>      If we leave the number of vm_bind queues to be arbitrarily large
>>>>>      (__u32 queue_idx) then, we need to have a hashmap for 
>>>>>queue (a wq,
>>>>>      work_item and a linked list) lookup from the user 
>>>>>specified queue
>>>>>      index.
>>>>>      Other option is to just put some hard limit (say 64 or 
>>>>>65) and use
>>>>>      an array of queues in VM (each created upon first use). 
>>>>>I prefer this.
>>>>>
>>>>>    I don't get why a VM_BIND queue is any different from any 
>>>>>other queue or
>>>>>    userspace-visible kernel object.  But I'll leave those 
>>>>>details up to
>>>>>    danvet or whoever else might be reviewing the implementation.
>>>>>    --Jason
>>>>>
>>>>>  I kind of agree here. Wouldn't be simpler to have the bind 
>>>>>queue created
>>>>>  like the others when we build the engine map?
>>>>>
>>>>>  For userspace it's then just matter of selecting the right 
>>>>>queue ID when
>>>>>  submitting.
>>>>>
>>>>>  If there is ever a possibility to have this work on the GPU, 
>>>>>it would be
>>>>>  all ready.
>>>>>
>>>>
>>>>I did sync offline with Matt Brost on this.
>>>>We can add a VM_BIND engine class and let user create VM_BIND 
>>>>engines (queues).
>>>>The problem is, in i915 engine creating interface is bound to 
>>>>gem_context.
>>>>So, in vm_bind ioctl, we would need both context_id and 
>>>>queue_idx for proper
>>>>lookup of the user created engine. This is bit ackward as vm_bind is an
>>>>interface to VM (address space) and has nothing to do with gem_context.
>>>
>>>
>>>A gem_context has a single vm object right?
>>>
>>>Set through I915_CONTEXT_PARAM_VM at creation or given a default 
>>>one if not.
>>>
>>>So it's just like picking up the vm like it's done at execbuffer 
>>>time right now : eb->context->vm
>>>
>>
>>Are you suggesting replacing 'vm_id' with 'context_id' in the 
>>VM_BIND/UNBIND
>>ioctl and probably call it CONTEXT_BIND/UNBIND, because VM can be 
>>obtained
>>from the context?
>
>
>Yes, because if we go for engines, they're associated with a context 
>and so also associated with the VM bound to the context.
>

Hmm...context doesn't sould like the right interface. It should be
VM and engine (independent of context). Engine can be virtual or soft
engine (kernel thread), each with its own queue. We can add an interface
to create such engines (independent of context). But we are anway
implicitly creating it when user uses a new queue_idx. If in future
we have hardware engines for VM_BIND operation, we can have that
explicit inteface to create engine instances and the queue_index
in vm_bind/unbind will point to those engines.
Anyone has any thoughts? Daniel?

Niranjana

>
>>I think the interface is clean as a interface to VM. It is only that we
>>don't have a clean way to create a raw VM_BIND engine (not 
>>associated with
>>any context) with i915 uapi.
>>May be we can add such an interface, but I don't think that is worth it
>>(we might as well just use a queue_idx in VM_BIND/UNBIND ioctl as I 
>>mentioned
>>above).
>>Anyone has any thoughts?
>>
>>>
>>>>Another problem is, if two VMs are binding with the same defined 
>>>>engine,
>>>>binding on VM1 can get unnecessary blocked by binding on VM2 
>>>>(which may be
>>>>waiting on its in_fence).
>>>
>>>
>>>Maybe I'm missing something, but how can you have 2 vm objects 
>>>with a single gem_context right now?
>>>
>>
>>No, we don't have 2 VMs for a gem_context.
>>Say if ctx1 with vm1 and ctx2 with vm2.
>>First vm_bind call was for vm1 with q_idx 1 in ctx1 engine map.
>>Second vm_bind call was for vm2 with q_idx 2 in ctx2 engine map. If 
>>those two queue indicies points to same underlying vm_bind engine,
>>then the second vm_bind call gets blocked until the first vm_bind call's
>>'in' fence is triggered and bind completes.
>>
>>With per VM queues, this is not a problem as two VMs will not endup
>>sharing same queue.
>>
>>BTW, I just posted a updated PATCH series.
>>https://www.spinics.net/lists/dri-devel/msg350483.html
>>
>>Niranjana
>>
>>>
>>>>
>>>>So, my preference here is to just add a 'u32 queue' index in 
>>>>vm_bind/unbind
>>>>ioctl, and the queues are per VM.
>>>>
>>>>Niranjana
>>>>
>>>>>  Thanks,
>>>>>
>>>>>  -Lionel
>>>>>
>>>>>
>>>>>      Niranjana
>>>>>
>>>>>      >Regards,
>>>>>      >
>>>>>      >Tvrtko
>>>>>      >
>>>>>      >>
>>>>>      >>Niranjana
>>>>>      >>
>>>>>      >>>
>>>>>      >>>>   I am trying to see how many queues we need and 
>>>>>don't want it to
>>>>>      be
>>>>>      >>>>   arbitrarily
>>>>>      >>>>   large and unduely blow up memory usage and 
>>>>>complexity in i915
>>>>>      driver.
>>>>>      >>>>
>>>>>      >>>> I expect a Vulkan driver to use at most 2 in the 
>>>>>vast majority
>>>>>      >>>>of cases. I
>>>>>      >>>> could imagine a client wanting to create more than 1 sparse
>>>>>      >>>>queue in which
>>>>>      >>>> case, it'll be N+1 but that's unlikely. As far as 
>>>>>complexity
>>>>>      >>>>goes, once
>>>>>      >>>> you allow two, I don't think the complexity is going up by
>>>>>      >>>>allowing N.  As
>>>>>      >>>> for memory usage, creating more queues means more 
>>>>>memory.  That's
>>>>>      a
>>>>>      >>>> trade-off that userspace can make. Again, the 
>>>>>expected number
>>>>>      >>>>here is 1
>>>>>      >>>> or 2 in the vast majority of cases so I don't think 
>>>>>you need to
>>>>>      worry.
>>>>>      >>>
>>>>>      >>>Ok, will start with n=3 meaning 8 queues.
>>>>>      >>>That would require us create 8 workqueues.
>>>>>      >>>We can change 'n' later if required.
>>>>>      >>>
>>>>>      >>>Niranjana
>>>>>      >>>
>>>>>      >>>>
>>>>>      >>>>   >     Why?  Because Vulkan has two basic kind of bind
>>>>>      >>>>operations and we
>>>>>      >>>>   don't
>>>>>      >>>>   >     want any dependencies between them:
>>>>>      >>>>   >      1. Immediate.  These happen right after BO 
>>>>>creation or
>>>>>      >>>>maybe as
>>>>>      >>>>   part of
>>>>>      >>>>   >     vkBindImageMemory() or VkBindBufferMemory().  These
>>>>>      >>>>don't happen
>>>>>      >>>>   on a
>>>>>      >>>>   >     queue and we don't want them serialized 
>>>>>with anything.       To
>>>>>      >>>>   synchronize
>>>>>      >>>>   >     with submit, we'll have a syncobj in the 
>>>>>VkDevice which
>>>>>      is
>>>>>      >>>>   signaled by
>>>>>      >>>>   >     all immediate bind operations and make 
>>>>>submits wait on
>>>>>      it.
>>>>>      >>>>   >      2. Queued (sparse): These happen on a 
>>>>>VkQueue which may
>>>>>      be the
>>>>>      >>>>   same as
>>>>>      >>>>   >     a render/compute queue or may be its own 
>>>>>queue.  It's up
>>>>>      to us
>>>>>      >>>>   what we
>>>>>      >>>>   >     want to advertise.  From the Vulkan API 
>>>>>PoV, this is like
>>>>>      any
>>>>>      >>>>   other
>>>>>      >>>>   >     queue.  Operations on it wait on and signal 
>>>>>semaphores.       If we
>>>>>      >>>>   have a
>>>>>      >>>>   >     VM_BIND engine, we'd provide syncobjs to wait and
>>>>>      >>>>signal just like
>>>>>      >>>>   we do
>>>>>      >>>>   >     in execbuf().
>>>>>      >>>>   >     The important thing is that we don't want 
>>>>>one type of
>>>>>      >>>>operation to
>>>>>      >>>>   block
>>>>>      >>>>   >     on the other.  If immediate binds are 
>>>>>blocking on sparse
>>>>>      binds,
>>>>>      >>>>   it's
>>>>>      >>>>   >     going to cause over-synchronization issues.
>>>>>      >>>>   >     In terms of the internal implementation, I 
>>>>>know that
>>>>>      >>>>there's going
>>>>>      >>>>   to be
>>>>>      >>>>   >     a lock on the VM and that we can't actually 
>>>>>do these
>>>>>      things in
>>>>>      >>>>   >     parallel.  That's fine. Once the dma_fences have
>>>>>      signaled and
>>>>>      >>>>   we're
>>>>>      >>>>
>>>>>      >>>>   Thats correct. It is like a single VM_BIND engine with
>>>>>      >>>>multiple queues
>>>>>      >>>>   feeding to it.
>>>>>      >>>>
>>>>>      >>>> Right.  As long as the queues themselves are 
>>>>>independent and
>>>>>      >>>>can block on
>>>>>      >>>> dma_fences without holding up other queues, I think 
>>>>>we're fine.
>>>>>      >>>>
>>>>>      >>>>   >     unblocked to do the bind operation, I don't care if
>>>>>      >>>>there's a bit
>>>>>      >>>>   of
>>>>>      >>>>   >     synchronization due to locking.  That's 
>>>>>expected.  What
>>>>>      >>>>we can't
>>>>>      >>>>   afford
>>>>>      >>>>   >     to have is an immediate bind operation 
>>>>>suddenly blocking
>>>>>      on a
>>>>>      >>>>   sparse
>>>>>      >>>>   >     operation which is blocked on a compute job 
>>>>>that's going
>>>>>      to run
>>>>>      >>>>   for
>>>>>      >>>>   >     another 5ms.
>>>>>      >>>>
>>>>>      >>>>   As the VM_BIND queue is per VM, VM_BIND on one VM 
>>>>>doesn't block
>>>>>      the
>>>>>      >>>>   VM_BIND
>>>>>      >>>>   on other VMs. I am not sure about usecases here, but just
>>>>>      wanted to
>>>>>      >>>>   clarify.
>>>>>      >>>>
>>>>>      >>>> Yes, that's what I would expect.
>>>>>      >>>> --Jason
>>>>>      >>>>
>>>>>      >>>>   Niranjana
>>>>>      >>>>
>>>>>      >>>>   >     For reference, Windows solves this by allowing
>>>>>      arbitrarily many
>>>>>      >>>>   paging
>>>>>      >>>>   >     queues (what they call a VM_BIND 
>>>>>engine/queue).  That
>>>>>      >>>>design works
>>>>>      >>>>   >     pretty well and solves the problems in 
>>>>>question.       >>>>Again, we could
>>>>>      >>>>   just
>>>>>      >>>>   >     make everything out-of-order and require 
>>>>>using syncobjs
>>>>>      >>>>to order
>>>>>      >>>>   things
>>>>>      >>>>   >     as userspace wants. That'd be fine too.
>>>>>      >>>>   >     One more note while I'm here: danvet said 
>>>>>something on
>>>>>      >>>>IRC about
>>>>>      >>>>   VM_BIND
>>>>>      >>>>   >     queues waiting for syncobjs to 
>>>>>materialize.  We don't
>>>>>      really
>>>>>      >>>>   want/need
>>>>>      >>>>   >     this.  We already have all the machinery in 
>>>>>userspace to
>>>>>      handle
>>>>>      >>>>   >     wait-before-signal and waiting for syncobj 
>>>>>fences to
>>>>>      >>>>materialize
>>>>>      >>>>   and
>>>>>      >>>>   >     that machinery is on by default.  It would actually
>>>>>      >>>>take MORE work
>>>>>      >>>>   in
>>>>>      >>>>   >     Mesa to turn it off and take advantage of 
>>>>>the kernel
>>>>>      >>>>being able to
>>>>>      >>>>   wait
>>>>>      >>>>   >     for syncobjs to materialize. Also, getting 
>>>>>that right is
>>>>>      >>>>   ridiculously
>>>>>      >>>>   >     hard and I really don't want to get it 
>>>>>wrong in kernel
>>>>>      >>>>space.   �� When we
>>>>>      >>>>   >     do memory fences, wait-before-signal will 
>>>>>be a thing.  We
>>>>>      don't
>>>>>      >>>>   need to
>>>>>      >>>>   >     try and make it a thing for syncobj.
>>>>>      >>>>   >     --Jason
>>>>>      >>>>   >
>>>>>      >>>>   >   Thanks Jason,
>>>>>      >>>>   >
>>>>>      >>>>   >   I missed the bit in the Vulkan spec that 
>>>>>we're allowed to
>>>>>      have a
>>>>>      >>>>   sparse
>>>>>      >>>>   >   queue that does not implement either graphics 
>>>>>or compute
>>>>>      >>>>operations
>>>>>      >>>>   :
>>>>>      >>>>   >
>>>>>      >>>>   >     "While some implementations may include
>>>>>      >>>>   VK_QUEUE_SPARSE_BINDING_BIT
>>>>>      >>>>   >     support in queue families that also include
>>>>>      >>>>   >
>>>>>      >>>>   >      graphics and compute support, other 
>>>>>implementations may
>>>>>      only
>>>>>      >>>>   expose a
>>>>>      >>>>   > VK_QUEUE_SPARSE_BINDING_BIT-only queue
>>>>>      >>>>   >
>>>>>      >>>>   >      family."
>>>>>      >>>>   >
>>>>>      >>>>   >   So it can all be all a vm_bind engine that just does
>>>>>      bind/unbind
>>>>>      >>>>   >   operations.
>>>>>      >>>>   >
>>>>>      >>>>   >   But yes we need another engine for the 
>>>>>immediate/non-sparse
>>>>>      >>>>   operations.
>>>>>      >>>>   >
>>>>>      >>>>   >   -Lionel
>>>>>      >>>>   >
>>>>>      >>>>   >         >
>>>>>      >>>>   >       Daniel, any thoughts?
>>>>>      >>>>   >
>>>>>      >>>>   >       Niranjana
>>>>>      >>>>   >
>>>>>      >>>>   >       >Matt
>>>>>      >>>>   >       >
>>>>>      >>>>   >       >>
>>>>>      >>>>   >       >> Sorry I noticed this late.
>>>>>      >>>>   >       >>
>>>>>      >>>>   >       >>
>>>>>      >>>>   >       >> -Lionel
>>>>>      >>>>   >       >>
>>>>>      >>>>   >       >>
>>>
>>>
>
Zeng, Oak June 13, 2022, 1:33 p.m. UTC | #32
Regards,
Oak

> -----Original Message-----
> From: Intel-gfx <intel-gfx-bounces@lists.freedesktop.org> On Behalf Of Niranjana
> Vishwanathapura
> Sent: June 10, 2022 1:43 PM
> To: Landwerlin, Lionel G <lionel.g.landwerlin@intel.com>
> Cc: Intel GFX <intel-gfx@lists.freedesktop.org>; Maling list - DRI developers <dri-
> devel@lists.freedesktop.org>; Hellstrom, Thomas <thomas.hellstrom@intel.com>;
> Wilson, Chris P <chris.p.wilson@intel.com>; Vetter, Daniel
> <daniel.vetter@intel.com>; Christian König <christian.koenig@amd.com>
> Subject: Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design
> document
> 
> On Fri, Jun 10, 2022 at 11:18:14AM +0300, Lionel Landwerlin wrote:
> >On 10/06/2022 10:54, Niranjana Vishwanathapura wrote:
> >>On Fri, Jun 10, 2022 at 09:53:24AM +0300, Lionel Landwerlin wrote:
> >>>On 09/06/2022 22:31, Niranjana Vishwanathapura wrote:
> >>>>On Thu, Jun 09, 2022 at 05:49:09PM +0300, Lionel Landwerlin wrote:
> >>>>>  On 09/06/2022 00:55, Jason Ekstrand wrote:
> >>>>>
> >>>>>    On Wed, Jun 8, 2022 at 4:44 PM Niranjana Vishwanathapura
> >>>>>    <niranjana.vishwanathapura@intel.com> wrote:
> >>>>>
> >>>>>      On Wed, Jun 08, 2022 at 08:33:25AM +0100, Tvrtko Ursulin wrote:
> >>>>>      >
> >>>>>      >
> >>>>>      >On 07/06/2022 22:32, Niranjana Vishwanathapura wrote:
> >>>>>      >>On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana
> >>>>>Vishwanathapura
> >>>>>      wrote:
> >>>>>      >>>On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason
> >>>>>Ekstrand wrote:
> >>>>>      >>>> On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura
> >>>>>      >>>> <niranjana.vishwanathapura@intel.com> wrote:
> >>>>>      >>>>
> >>>>>      >>>>   On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel
> >>>>>Landwerlin
> >>>>>      wrote:
> >>>>>      >>>>   >   On 02/06/2022 23:35, Jason Ekstrand wrote:
> >>>>>      >>>>   >
> >>>>>      >>>>   >     On Thu, Jun 2, 2022 at 3:11 PM Niranjana
> >>>>>Vishwanathapura
> >>>>>      >>>>   > <niranjana.vishwanathapura@intel.com> wrote:
> >>>>>      >>>>   >
> >>>>>      >>>>   >       On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew
> >>>>>      >>>>Brost wrote:
> >>>>>      >>>>   >       >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel
> >>>>>      Landwerlin
> >>>>>      >>>>   wrote:
> >>>>>      >>>>   >       >> On 17/05/2022 21:32, Niranjana Vishwanathapura
> >>>>>      wrote:
> >>>>>      >>>>   >       >> > +VM_BIND/UNBIND ioctl will immediately start
> >>>>>      >>>>   binding/unbinding
> >>>>>      >>>>   >       the mapping in an
> >>>>>      >>>>   >       >> > +async worker. The binding and
> >>>>>unbinding will
> >>>>>      >>>>work like a
> >>>>>      >>>>   special
> >>>>>      >>>>   >       GPU engine.
> >>>>>      >>>>   >       >> > +The binding and unbinding operations are
> >>>>>      serialized and
> >>>>>      >>>>   will
> >>>>>      >>>>   >       wait on specified
> >>>>>      >>>>   >       >> > +input fences before the operation
> >>>>>and will signal
> >>>>>      the
> >>>>>      >>>>   output
> >>>>>      >>>>   >       fences upon the
> >>>>>      >>>>   >       >> > +completion of the operation. Due to
> >>>>>      serialization,
> >>>>>      >>>>   completion of
> >>>>>      >>>>   >       an operation
> >>>>>      >>>>   >       >> > +will also indicate that all
> >>>>>previous operations
> >>>>>      >>>>are also
> >>>>>      >>>>   >       complete.
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >       >> I guess we should avoid saying "will
> >>>>>immediately
> >>>>>      start
> >>>>>      >>>>   >       binding/unbinding" if
> >>>>>      >>>>   >       >> there are fences involved.
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >       >> And the fact that it's happening in an async
> >>>>>      >>>>worker seem to
> >>>>>      >>>>   imply
> >>>>>      >>>>   >       it's not
> >>>>>      >>>>   >       >> immediate.
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >
> >>>>>      >>>>   >       Ok, will fix.
> >>>>>      >>>>   >       This was added because in earlier design
> >>>>>binding was
> >>>>>      deferred
> >>>>>      >>>>   until
> >>>>>      >>>>   >       next execbuff.
> >>>>>      >>>>   >       But now it is non-deferred (immediate in
> >>>>>that sense).
> >>>>>      >>>>But yah,
> >>>>>      >>>>   this is
> >>>>>      >>>>   >       confusing
> >>>>>      >>>>   >       and will fix it.
> >>>>>      >>>>   >
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >       >> I have a question on the behavior of the bind
> >>>>>      >>>>operation when
> >>>>>      >>>>   no
> >>>>>      >>>>   >       input fence
> >>>>>      >>>>   >       >> is provided. Let say I do :
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >       >> VM_BIND (out_fence=fence1)
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >       >> VM_BIND (out_fence=fence2)
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >       >> VM_BIND (out_fence=fence3)
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >       >> In what order are the fences going to
> >>>>>be signaled?
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >       >> In the order of VM_BIND ioctls? Or out
> >>>>>of order?
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >       >> Because you wrote "serialized I assume
> >>>>>it's : in
> >>>>>      order
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >
> >>>>>      >>>>   >       Yes, in the order of VM_BIND/UNBIND
> >>>>>ioctls. Note that
> >>>>>      >>>>bind and
> >>>>>      >>>>   unbind
> >>>>>      >>>>   >       will use
> >>>>>      >>>>   >       the same queue and hence are ordered.
> >>>>>      >>>>   >
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >       >> One thing I didn't realize is that
> >>>>>because we only
> >>>>>      get one
> >>>>>      >>>>   >       "VM_BIND" engine,
> >>>>>      >>>>   >       >> there is a disconnect from the Vulkan
> >>>>>specification.
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >       >> In Vulkan VM_BIND operations are
> >>>>>serialized but
> >>>>>      >>>>per engine.
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >       >> So you could have something like this :
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >       >> VM_BIND (engine=rcs0, in_fence=fence1,
> >>>>>      out_fence=fence2)
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >       >> VM_BIND (engine=ccs0, in_fence=fence3,
> >>>>>      out_fence=fence4)
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >       >> fence1 is not signaled
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >       >> fence3 is signaled
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >       >> So the second VM_BIND will proceed before the
> >>>>>      >>>>first VM_BIND.
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >       >> I guess we can deal with that scenario in
> >>>>>      >>>>userspace by doing
> >>>>>      >>>>   the
> >>>>>      >>>>   >       wait
> >>>>>      >>>>   >       >> ourselves in one thread per engines.
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >       >> But then it makes the VM_BIND input
> >>>>>fences useless.
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >       >> Daniel : what do you think? Should be
> >>>>>rework this or
> >>>>>      just
> >>>>>      >>>>   deal with
> >>>>>      >>>>   >       wait
> >>>>>      >>>>   >       >> fences in userspace?
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >       >
> >>>>>      >>>>   >       >My opinion is rework this but make the
> >>>>>ordering via
> >>>>>      >>>>an engine
> >>>>>      >>>>   param
> >>>>>      >>>>   >       optional.
> >>>>>      >>>>   >       >
> >>>>>      >>>>   >       >e.g. A VM can be configured so all binds
> >>>>>are ordered
> >>>>>      >>>>within the
> >>>>>      >>>>   VM
> >>>>>      >>>>   >       >
> >>>>>      >>>>   >       >e.g. A VM can be configured so all binds
> >>>>>accept an
> >>>>>      engine
> >>>>>      >>>>   argument
> >>>>>      >>>>   >       (in
> >>>>>      >>>>   >       >the case of the i915 likely this is a
> >>>>>gem context
> >>>>>      >>>>handle) and
> >>>>>      >>>>   binds
> >>>>>      >>>>   >       >ordered with respect to that engine.
> >>>>>      >>>>   >       >
> >>>>>      >>>>   >       >This gives UMDs options as the later
> >>>>>likely consumes
> >>>>>      >>>>more KMD
> >>>>>      >>>>   >       resources
> >>>>>      >>>>   >       >so if a different UMD can live with binds being
> >>>>>      >>>>ordered within
> >>>>>      >>>>   the VM
> >>>>>      >>>>   >       >they can use a mode consuming less resources.
> >>>>>      >>>>   >       >
> >>>>>      >>>>   >
> >>>>>      >>>>   >       I think we need to be careful here if we
> >>>>>are looking
> >>>>>      for some
> >>>>>      >>>>   out of
> >>>>>      >>>>   >       (submission) order completion of vm_bind/unbind.
> >>>>>      >>>>   >       In-order completion means, in a batch of
> >>>>>binds and
> >>>>>      >>>>unbinds to be
> >>>>>      >>>>   >       completed in-order, user only needs to specify
> >>>>>      >>>>in-fence for the
> >>>>>      >>>>   >       first bind/unbind call and the our-fence
> >>>>>for the last
> >>>>>      >>>>   bind/unbind
> >>>>>      >>>>   >       call. Also, the VA released by an unbind
> >>>>>call can be
> >>>>>      >>>>re-used by
> >>>>>      >>>>   >       any subsequent bind call in that in-order batch.
> >>>>>      >>>>   >
> >>>>>      >>>>   >       These things will break if
> >>>>>binding/unbinding were to
> >>>>>      >>>>be allowed
> >>>>>      >>>>   to
> >>>>>      >>>>   >       go out of order (of submission) and user
> >>>>>need to be
> >>>>>      extra
> >>>>>      >>>>   careful
> >>>>>      >>>>   >       not to run into pre-mature triggereing of
> >>>>>out-fence and
> >>>>>      bind
> >>>>>      >>>>   failing
> >>>>>      >>>>   >       as VA is still in use etc.
> >>>>>      >>>>   >
> >>>>>      >>>>   >       Also, VM_BIND binds the provided mapping on the
> >>>>>      specified
> >>>>>      >>>>   address
> >>>>>      >>>>   >       space
> >>>>>      >>>>   >       (VM). So, the uapi is not engine/context
> >>>>>specific.
> >>>>>      >>>>   >
> >>>>>      >>>>   >       We can however add a 'queue' to the uapi
> >>>>>which can be
> >>>>>      >>>>one from
> >>>>>      >>>>   the
> >>>>>      >>>>   >       pre-defined queues,
> >>>>>      >>>>   >       I915_VM_BIND_QUEUE_0
> >>>>>      >>>>   >       I915_VM_BIND_QUEUE_1
> >>>>>      >>>>   >       ...
> >>>>>      >>>>   >       I915_VM_BIND_QUEUE_(N-1)
> >>>>>      >>>>   >
> >>>>>      >>>>   >       KMD will spawn an async work queue for
> >>>>>each queue which
> >>>>>      will
> >>>>>      >>>>   only
> >>>>>      >>>>   >       bind the mappings on that queue in the order of
> >>>>>      submission.
> >>>>>      >>>>   >       User can assign the queue to per engine
> >>>>>or anything
> >>>>>      >>>>like that.
> >>>>>      >>>>   >
> >>>>>      >>>>   >       But again here, user need to be careful and not
> >>>>>      >>>>deadlock these
> >>>>>      >>>>   >       queues with circular dependency of fences.
> >>>>>      >>>>   >
> >>>>>      >>>>   >       I prefer adding this later an as
> >>>>>extension based on
> >>>>>      >>>>whether it
> >>>>>      >>>>   >       is really helping with the implementation.
> >>>>>      >>>>   >
> >>>>>      >>>>   >     I can tell you right now that having
> >>>>>everything on a
> >>>>>      single
> >>>>>      >>>>   in-order
> >>>>>      >>>>   >     queue will not get us the perf we want.
> >>>>>What vulkan
> >>>>>      >>>>really wants
> >>>>>      >>>>   is one
> >>>>>      >>>>   >     of two things:
> >>>>>      >>>>   >      1. No implicit ordering of VM_BIND ops.  They just
> >>>>>      happen in
> >>>>>      >>>>   whatever
> >>>>>      >>>>   >     their dependencies are resolved and we
> >>>>>ensure ordering
> >>>>>      >>>>ourselves
> >>>>>      >>>>   by
> >>>>>      >>>>   >     having a syncobj in the VkQueue.
> >>>>>      >>>>   >      2. The ability to create multiple VM_BIND
> >>>>>queues.  We
> >>>>>      need at
> >>>>>      >>>>   least 2
> >>>>>      >>>>   >     but I don't see why there needs to be a
> >>>>>limit besides
> >>>>>      >>>>the limits
> >>>>>      >>>>   the
> >>>>>      >>>>   >     i915 API already has on the number of
> >>>>>engines.  Vulkan
> >>>>>      could
> >>>>>      >>>>   expose
> >>>>>      >>>>   >     multiple sparse binding queues to the
> >>>>>client if it's not
> >>>>>      >>>>   arbitrarily
> >>>>>      >>>>   >     limited.
> >>>>>      >>>>
> >>>>>      >>>>   Thanks Jason, Lionel.
> >>>>>      >>>>
> >>>>>      >>>>   Jason, what are you referring to when you say
> >>>>>"limits the i915
> >>>>>      API
> >>>>>      >>>>   already
> >>>>>      >>>>   has on the number of engines"? I am not sure if
> >>>>>there is such
> >>>>>      an uapi
> >>>>>      >>>>   today.
> >>>>>      >>>>
> >>>>>      >>>> There's a limit of something like 64 total engines
> >>>>>today based on
> >>>>>      the
> >>>>>      >>>> number of bits we can cram into the exec flags in
> >>>>>execbuffer2.  I
> >>>>>      think
> >>>>>      >>>> someone had an extended version that allowed more
> >>>>>but I ripped it
> >>>>>      out
> >>>>>      >>>> because no one was using it.  Of course,
> >>>>>execbuffer3 might not
> >>>>>      >>>>have that
> >>>>>      >>>> problem at all.
> >>>>>      >>>>
> >>>>>      >>>
> >>>>>      >>>Thanks Jason.
> >>>>>      >>>Ok, I am not sure which exec flag is that, but yah,
> >>>>>execbuffer3
> >>>>>      probably
> >>>>>      >>>will not have this limiation. So, we need to define a
> >>>>>      VM_BIND_MAX_QUEUE
> >>>>>      >>>and somehow export it to user (I am thinking of
> >>>>>embedding it in
> >>>>>      >>>I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND, bits[1-3]->'n'
> >>>>>      meaning 2^n
> >>>>>      >>>queues.
> >>>>>      >>
> >>>>>      >>Ah, I think you are waking about I915_EXEC_RING_MASK
> >>>>>(0x3f) which
> >>>>>      execbuf3
> >>>>>
> >>>>>    Yup!  That's exactly the limit I was talking about.
> >>>>>
> >>>>>      >>will also have. So, we can simply define in vm_bind/unbind
> >>>>>      structures,
> >>>>>      >>
> >>>>>      >>#define I915_VM_BIND_MAX_QUEUE   64
> >>>>>      >>        __u32 queue;
> >>>>>      >>
> >>>>>      >>I think that will keep things simple.
> >>>>>      >
> >>>>>      >Hmmm? What does execbuf2 limit has to do with how many engines
> >>>>>      >hardware can have? I suggest not to do that.
> >>>>>      >
> >>>>>      >Change with added this:
> >>>>>      >
> >>>>>      >       if (set.num_engines > I915_EXEC_RING_MASK + 1)
> >>>>>      >               return -EINVAL;
> >>>>>      >
> >>>>>      >To context creation needs to be undone and so let users
> >>>>>create engine
> >>>>>      >maps with all hardware engines, and let execbuf3 access
> >>>>>them all.
> >>>>>      >
> >>>>>
> >>>>>      Earlier plan was to carry I915_EXEC_RING_MAP (0x3f) to
> >>>>>execbuff3 also.
> >>>>>      Hence, I was using the same limit for VM_BIND queues
> >>>>>(64, or 65 if we
> >>>>>      make it N+1).
> >>>>>      But, as discussed in other thread of this RFC series, we
> >>>>>are planning
> >>>>>      to drop this I915_EXEC_RING_MAP in execbuff3. So, there won't be
> >>>>>      any uapi that limits the number of engines (and hence
> >>>>>the vm_bind
> >>>>>      queues
> >>>>>      need to be supported).
> >>>>>
> >>>>>      If we leave the number of vm_bind queues to be arbitrarily large
> >>>>>      (__u32 queue_idx) then, we need to have a hashmap for
> >>>>>queue (a wq,
> >>>>>      work_item and a linked list) lookup from the user
> >>>>>specified queue
> >>>>>      index.
> >>>>>      Other option is to just put some hard limit (say 64 or
> >>>>>65) and use
> >>>>>      an array of queues in VM (each created upon first use).
> >>>>>I prefer this.
> >>>>>
> >>>>>    I don't get why a VM_BIND queue is any different from any
> >>>>>other queue or
> >>>>>    userspace-visible kernel object.  But I'll leave those
> >>>>>details up to
> >>>>>    danvet or whoever else might be reviewing the implementation.
> >>>>>    --Jason
> >>>>>
> >>>>>  I kind of agree here. Wouldn't be simpler to have the bind
> >>>>>queue created
> >>>>>  like the others when we build the engine map?
> >>>>>
> >>>>>  For userspace it's then just matter of selecting the right
> >>>>>queue ID when
> >>>>>  submitting.
> >>>>>
> >>>>>  If there is ever a possibility to have this work on the GPU,
> >>>>>it would be
> >>>>>  all ready.
> >>>>>
> >>>>
> >>>>I did sync offline with Matt Brost on this.
> >>>>We can add a VM_BIND engine class and let user create VM_BIND
> >>>>engines (queues).
> >>>>The problem is, in i915 engine creating interface is bound to
> >>>>gem_context.
> >>>>So, in vm_bind ioctl, we would need both context_id and
> >>>>queue_idx for proper
> >>>>lookup of the user created engine. This is bit ackward as vm_bind is an
> >>>>interface to VM (address space) and has nothing to do with gem_context.
> >>>
> >>>
> >>>A gem_context has a single vm object right?
> >>>
> >>>Set through I915_CONTEXT_PARAM_VM at creation or given a default
> >>>one if not.
> >>>
> >>>So it's just like picking up the vm like it's done at execbuffer
> >>>time right now : eb->context->vm
> >>>
> >>
> >>Are you suggesting replacing 'vm_id' with 'context_id' in the
> >>VM_BIND/UNBIND
> >>ioctl and probably call it CONTEXT_BIND/UNBIND, because VM can be
> >>obtained
> >>from the context?
> >
> >
> >Yes, because if we go for engines, they're associated with a context
> >and so also associated with the VM bound to the context.
> >
> 
> Hmm...context doesn't sould like the right interface. It should be
> VM and engine (independent of context). Engine can be virtual or soft
> engine (kernel thread), each with its own queue. We can add an interface
> to create such engines (independent of context). But we are anway
> implicitly creating it when user uses a new queue_idx. If in future
> we have hardware engines for VM_BIND operation, we can have that
> explicit inteface to create engine instances and the queue_index
> in vm_bind/unbind will point to those engines.
> Anyone has any thoughts? Daniel?

Exposing gem_context or intel_context to user space is a strange concept to me. A context represent some hw resources that is used to complete certain task. User space should care allocate some resources (memory, queues) and submit tasks to queues. But user space doesn't care how certain task is mapped to a HW context - driver/guc should take care of this.

So a cleaner interface to me is: user space create a vm,  create gem object, vm_bind it to a vm; allocate queues (internally represent compute or blitter HW. Queue can be virtual to user) for this vm; submit tasks to queues. User can create multiple queues under one vm. One queue is only for one vm.

I915 driver/guc manage the hw compute or blitter resources which is transparent to user space. When i915 or guc decide to schedule a queue (run tasks on that queue), a HW engine will be pick up and set up properly for the vm of that queue (ie., switch to page tables of that vm) - this is a context switch.

From vm_bind perspective, it simply bind a gem_object to a vm. Engine/queue is not a parameter to vm_bind, as any engine can be pick up by i915/guc to execute a task using the vm bound va.

I didn't completely follow the discussion here. Just share some thoughts.

Regards,
Oak

> 
> Niranjana
> 
> >
> >>I think the interface is clean as a interface to VM. It is only that we
> >>don't have a clean way to create a raw VM_BIND engine (not
> >>associated with
> >>any context) with i915 uapi.
> >>May be we can add such an interface, but I don't think that is worth it
> >>(we might as well just use a queue_idx in VM_BIND/UNBIND ioctl as I
> >>mentioned
> >>above).
> >>Anyone has any thoughts?
> >>
> >>>
> >>>>Another problem is, if two VMs are binding with the same defined
> >>>>engine,
> >>>>binding on VM1 can get unnecessary blocked by binding on VM2
> >>>>(which may be
> >>>>waiting on its in_fence).
> >>>
> >>>
> >>>Maybe I'm missing something, but how can you have 2 vm objects
> >>>with a single gem_context right now?
> >>>
> >>
> >>No, we don't have 2 VMs for a gem_context.
> >>Say if ctx1 with vm1 and ctx2 with vm2.
> >>First vm_bind call was for vm1 with q_idx 1 in ctx1 engine map.
> >>Second vm_bind call was for vm2 with q_idx 2 in ctx2 engine map. If
> >>those two queue indicies points to same underlying vm_bind engine,
> >>then the second vm_bind call gets blocked until the first vm_bind call's
> >>'in' fence is triggered and bind completes.
> >>
> >>With per VM queues, this is not a problem as two VMs will not endup
> >>sharing same queue.
> >>
> >>BTW, I just posted a updated PATCH series.
> >>https://www.spinics.net/lists/dri-devel/msg350483.html
> >>
> >>Niranjana
> >>
> >>>
> >>>>
> >>>>So, my preference here is to just add a 'u32 queue' index in
> >>>>vm_bind/unbind
> >>>>ioctl, and the queues are per VM.
> >>>>
> >>>>Niranjana
> >>>>
> >>>>>  Thanks,
> >>>>>
> >>>>>  -Lionel
> >>>>>
> >>>>>
> >>>>>      Niranjana
> >>>>>
> >>>>>      >Regards,
> >>>>>      >
> >>>>>      >Tvrtko
> >>>>>      >
> >>>>>      >>
> >>>>>      >>Niranjana
> >>>>>      >>
> >>>>>      >>>
> >>>>>      >>>>   I am trying to see how many queues we need and
> >>>>>don't want it to
> >>>>>      be
> >>>>>      >>>>   arbitrarily
> >>>>>      >>>>   large and unduely blow up memory usage and
> >>>>>complexity in i915
> >>>>>      driver.
> >>>>>      >>>>
> >>>>>      >>>> I expect a Vulkan driver to use at most 2 in the
> >>>>>vast majority
> >>>>>      >>>>of cases. I
> >>>>>      >>>> could imagine a client wanting to create more than 1 sparse
> >>>>>      >>>>queue in which
> >>>>>      >>>> case, it'll be N+1 but that's unlikely. As far as
> >>>>>complexity
> >>>>>      >>>>goes, once
> >>>>>      >>>> you allow two, I don't think the complexity is going up by
> >>>>>      >>>>allowing N.  As
> >>>>>      >>>> for memory usage, creating more queues means more
> >>>>>memory.  That's
> >>>>>      a
> >>>>>      >>>> trade-off that userspace can make. Again, the
> >>>>>expected number
> >>>>>      >>>>here is 1
> >>>>>      >>>> or 2 in the vast majority of cases so I don't think
> >>>>>you need to
> >>>>>      worry.
> >>>>>      >>>
> >>>>>      >>>Ok, will start with n=3 meaning 8 queues.
> >>>>>      >>>That would require us create 8 workqueues.
> >>>>>      >>>We can change 'n' later if required.
> >>>>>      >>>
> >>>>>      >>>Niranjana
> >>>>>      >>>
> >>>>>      >>>>
> >>>>>      >>>>   >     Why?  Because Vulkan has two basic kind of bind
> >>>>>      >>>>operations and we
> >>>>>      >>>>   don't
> >>>>>      >>>>   >     want any dependencies between them:
> >>>>>      >>>>   >      1. Immediate.  These happen right after BO
> >>>>>creation or
> >>>>>      >>>>maybe as
> >>>>>      >>>>   part of
> >>>>>      >>>>   >     vkBindImageMemory() or VkBindBufferMemory().  These
> >>>>>      >>>>don't happen
> >>>>>      >>>>   on a
> >>>>>      >>>>   >     queue and we don't want them serialized
> >>>>>with anything.       To
> >>>>>      >>>>   synchronize
> >>>>>      >>>>   >     with submit, we'll have a syncobj in the
> >>>>>VkDevice which
> >>>>>      is
> >>>>>      >>>>   signaled by
> >>>>>      >>>>   >     all immediate bind operations and make
> >>>>>submits wait on
> >>>>>      it.
> >>>>>      >>>>   >      2. Queued (sparse): These happen on a
> >>>>>VkQueue which may
> >>>>>      be the
> >>>>>      >>>>   same as
> >>>>>      >>>>   >     a render/compute queue or may be its own
> >>>>>queue.  It's up
> >>>>>      to us
> >>>>>      >>>>   what we
> >>>>>      >>>>   >     want to advertise.  From the Vulkan API
> >>>>>PoV, this is like
> >>>>>      any
> >>>>>      >>>>   other
> >>>>>      >>>>   >     queue.  Operations on it wait on and signal
> >>>>>semaphores.       If we
> >>>>>      >>>>   have a
> >>>>>      >>>>   >     VM_BIND engine, we'd provide syncobjs to wait and
> >>>>>      >>>>signal just like
> >>>>>      >>>>   we do
> >>>>>      >>>>   >     in execbuf().
> >>>>>      >>>>   >     The important thing is that we don't want
> >>>>>one type of
> >>>>>      >>>>operation to
> >>>>>      >>>>   block
> >>>>>      >>>>   >     on the other.  If immediate binds are
> >>>>>blocking on sparse
> >>>>>      binds,
> >>>>>      >>>>   it's
> >>>>>      >>>>   >     going to cause over-synchronization issues.
> >>>>>      >>>>   >     In terms of the internal implementation, I
> >>>>>know that
> >>>>>      >>>>there's going
> >>>>>      >>>>   to be
> >>>>>      >>>>   >     a lock on the VM and that we can't actually
> >>>>>do these
> >>>>>      things in
> >>>>>      >>>>   >     parallel.  That's fine. Once the dma_fences have
> >>>>>      signaled and
> >>>>>      >>>>   we're
> >>>>>      >>>>
> >>>>>      >>>>   Thats correct. It is like a single VM_BIND engine with
> >>>>>      >>>>multiple queues
> >>>>>      >>>>   feeding to it.
> >>>>>      >>>>
> >>>>>      >>>> Right.  As long as the queues themselves are
> >>>>>independent and
> >>>>>      >>>>can block on
> >>>>>      >>>> dma_fences without holding up other queues, I think
> >>>>>we're fine.
> >>>>>      >>>>
> >>>>>      >>>>   >     unblocked to do the bind operation, I don't care if
> >>>>>      >>>>there's a bit
> >>>>>      >>>>   of
> >>>>>      >>>>   >     synchronization due to locking.  That's
> >>>>>expected.  What
> >>>>>      >>>>we can't
> >>>>>      >>>>   afford
> >>>>>      >>>>   >     to have is an immediate bind operation
> >>>>>suddenly blocking
> >>>>>      on a
> >>>>>      >>>>   sparse
> >>>>>      >>>>   >     operation which is blocked on a compute job
> >>>>>that's going
> >>>>>      to run
> >>>>>      >>>>   for
> >>>>>      >>>>   >     another 5ms.
> >>>>>      >>>>
> >>>>>      >>>>   As the VM_BIND queue is per VM, VM_BIND on one VM
> >>>>>doesn't block
> >>>>>      the
> >>>>>      >>>>   VM_BIND
> >>>>>      >>>>   on other VMs. I am not sure about usecases here, but just
> >>>>>      wanted to
> >>>>>      >>>>   clarify.
> >>>>>      >>>>
> >>>>>      >>>> Yes, that's what I would expect.
> >>>>>      >>>> --Jason
> >>>>>      >>>>
> >>>>>      >>>>   Niranjana
> >>>>>      >>>>
> >>>>>      >>>>   >     For reference, Windows solves this by allowing
> >>>>>      arbitrarily many
> >>>>>      >>>>   paging
> >>>>>      >>>>   >     queues (what they call a VM_BIND
> >>>>>engine/queue).  That
> >>>>>      >>>>design works
> >>>>>      >>>>   >     pretty well and solves the problems in
> >>>>>question.       >>>>Again, we could
> >>>>>      >>>>   just
> >>>>>      >>>>   >     make everything out-of-order and require
> >>>>>using syncobjs
> >>>>>      >>>>to order
> >>>>>      >>>>   things
> >>>>>      >>>>   >     as userspace wants. That'd be fine too.
> >>>>>      >>>>   >     One more note while I'm here: danvet said
> >>>>>something on
> >>>>>      >>>>IRC about
> >>>>>      >>>>   VM_BIND
> >>>>>      >>>>   >     queues waiting for syncobjs to
> >>>>>materialize.  We don't
> >>>>>      really
> >>>>>      >>>>   want/need
> >>>>>      >>>>   >     this.  We already have all the machinery in
> >>>>>userspace to
> >>>>>      handle
> >>>>>      >>>>   >     wait-before-signal and waiting for syncobj
> >>>>>fences to
> >>>>>      >>>>materialize
> >>>>>      >>>>   and
> >>>>>      >>>>   >     that machinery is on by default.  It would actually
> >>>>>      >>>>take MORE work
> >>>>>      >>>>   in
> >>>>>      >>>>   >     Mesa to turn it off and take advantage of
> >>>>>the kernel
> >>>>>      >>>>being able to
> >>>>>      >>>>   wait
> >>>>>      >>>>   >     for syncobjs to materialize. Also, getting
> >>>>>that right is
> >>>>>      >>>>   ridiculously
> >>>>>      >>>>   >     hard and I really don't want to get it
> >>>>>wrong in kernel
> >>>>>      >>>>space.   �� When we
> >>>>>      >>>>   >     do memory fences, wait-before-signal will
> >>>>>be a thing.  We
> >>>>>      don't
> >>>>>      >>>>   need to
> >>>>>      >>>>   >     try and make it a thing for syncobj.
> >>>>>      >>>>   >     --Jason
> >>>>>      >>>>   >
> >>>>>      >>>>   >   Thanks Jason,
> >>>>>      >>>>   >
> >>>>>      >>>>   >   I missed the bit in the Vulkan spec that
> >>>>>we're allowed to
> >>>>>      have a
> >>>>>      >>>>   sparse
> >>>>>      >>>>   >   queue that does not implement either graphics
> >>>>>or compute
> >>>>>      >>>>operations
> >>>>>      >>>>   :
> >>>>>      >>>>   >
> >>>>>      >>>>   >     "While some implementations may include
> >>>>>      >>>>   VK_QUEUE_SPARSE_BINDING_BIT
> >>>>>      >>>>   >     support in queue families that also include
> >>>>>      >>>>   >
> >>>>>      >>>>   >      graphics and compute support, other
> >>>>>implementations may
> >>>>>      only
> >>>>>      >>>>   expose a
> >>>>>      >>>>   > VK_QUEUE_SPARSE_BINDING_BIT-only queue
> >>>>>      >>>>   >
> >>>>>      >>>>   >      family."
> >>>>>      >>>>   >
> >>>>>      >>>>   >   So it can all be all a vm_bind engine that just does
> >>>>>      bind/unbind
> >>>>>      >>>>   >   operations.
> >>>>>      >>>>   >
> >>>>>      >>>>   >   But yes we need another engine for the
> >>>>>immediate/non-sparse
> >>>>>      >>>>   operations.
> >>>>>      >>>>   >
> >>>>>      >>>>   >   -Lionel
> >>>>>      >>>>   >
> >>>>>      >>>>   >         >
> >>>>>      >>>>   >       Daniel, any thoughts?
> >>>>>      >>>>   >
> >>>>>      >>>>   >       Niranjana
> >>>>>      >>>>   >
> >>>>>      >>>>   >       >Matt
> >>>>>      >>>>   >       >
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >       >> Sorry I noticed this late.
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >       >> -Lionel
> >>>>>      >>>>   >       >>
> >>>>>      >>>>   >       >>
> >>>
> >>>
> >
Niranjana Vishwanathapura June 13, 2022, 6:02 p.m. UTC | #33
On Mon, Jun 13, 2022 at 06:33:07AM -0700, Zeng, Oak wrote:
>
>
>Regards,
>Oak
>
>> -----Original Message-----
>> From: Intel-gfx <intel-gfx-bounces@lists.freedesktop.org> On Behalf Of Niranjana
>> Vishwanathapura
>> Sent: June 10, 2022 1:43 PM
>> To: Landwerlin, Lionel G <lionel.g.landwerlin@intel.com>
>> Cc: Intel GFX <intel-gfx@lists.freedesktop.org>; Maling list - DRI developers <dri-
>> devel@lists.freedesktop.org>; Hellstrom, Thomas <thomas.hellstrom@intel.com>;
>> Wilson, Chris P <chris.p.wilson@intel.com>; Vetter, Daniel
>> <daniel.vetter@intel.com>; Christian König <christian.koenig@amd.com>
>> Subject: Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design
>> document
>>
>> On Fri, Jun 10, 2022 at 11:18:14AM +0300, Lionel Landwerlin wrote:
>> >On 10/06/2022 10:54, Niranjana Vishwanathapura wrote:
>> >>On Fri, Jun 10, 2022 at 09:53:24AM +0300, Lionel Landwerlin wrote:
>> >>>On 09/06/2022 22:31, Niranjana Vishwanathapura wrote:
>> >>>>On Thu, Jun 09, 2022 at 05:49:09PM +0300, Lionel Landwerlin wrote:
>> >>>>>  On 09/06/2022 00:55, Jason Ekstrand wrote:
>> >>>>>
>> >>>>>    On Wed, Jun 8, 2022 at 4:44 PM Niranjana Vishwanathapura
>> >>>>>    <niranjana.vishwanathapura@intel.com> wrote:
>> >>>>>
>> >>>>>      On Wed, Jun 08, 2022 at 08:33:25AM +0100, Tvrtko Ursulin wrote:
>> >>>>>      >
>> >>>>>      >
>> >>>>>      >On 07/06/2022 22:32, Niranjana Vishwanathapura wrote:
>> >>>>>      >>On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana
>> >>>>>Vishwanathapura
>> >>>>>      wrote:
>> >>>>>      >>>On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason
>> >>>>>Ekstrand wrote:
>> >>>>>      >>>> On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura
>> >>>>>      >>>> <niranjana.vishwanathapura@intel.com> wrote:
>> >>>>>      >>>>
>> >>>>>      >>>>   On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel
>> >>>>>Landwerlin
>> >>>>>      wrote:
>> >>>>>      >>>>   >   On 02/06/2022 23:35, Jason Ekstrand wrote:
>> >>>>>      >>>>   >
>> >>>>>      >>>>   >     On Thu, Jun 2, 2022 at 3:11 PM Niranjana
>> >>>>>Vishwanathapura
>> >>>>>      >>>>   > <niranjana.vishwanathapura@intel.com> wrote:
>> >>>>>      >>>>   >
>> >>>>>      >>>>   >       On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew
>> >>>>>      >>>>Brost wrote:
>> >>>>>      >>>>   >       >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel
>> >>>>>      Landwerlin
>> >>>>>      >>>>   wrote:
>> >>>>>      >>>>   >       >> On 17/05/2022 21:32, Niranjana Vishwanathapura
>> >>>>>      wrote:
>> >>>>>      >>>>   >       >> > +VM_BIND/UNBIND ioctl will immediately start
>> >>>>>      >>>>   binding/unbinding
>> >>>>>      >>>>   >       the mapping in an
>> >>>>>      >>>>   >       >> > +async worker. The binding and
>> >>>>>unbinding will
>> >>>>>      >>>>work like a
>> >>>>>      >>>>   special
>> >>>>>      >>>>   >       GPU engine.
>> >>>>>      >>>>   >       >> > +The binding and unbinding operations are
>> >>>>>      serialized and
>> >>>>>      >>>>   will
>> >>>>>      >>>>   >       wait on specified
>> >>>>>      >>>>   >       >> > +input fences before the operation
>> >>>>>and will signal
>> >>>>>      the
>> >>>>>      >>>>   output
>> >>>>>      >>>>   >       fences upon the
>> >>>>>      >>>>   >       >> > +completion of the operation. Due to
>> >>>>>      serialization,
>> >>>>>      >>>>   completion of
>> >>>>>      >>>>   >       an operation
>> >>>>>      >>>>   >       >> > +will also indicate that all
>> >>>>>previous operations
>> >>>>>      >>>>are also
>> >>>>>      >>>>   >       complete.
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >       >> I guess we should avoid saying "will
>> >>>>>immediately
>> >>>>>      start
>> >>>>>      >>>>   >       binding/unbinding" if
>> >>>>>      >>>>   >       >> there are fences involved.
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >       >> And the fact that it's happening in an async
>> >>>>>      >>>>worker seem to
>> >>>>>      >>>>   imply
>> >>>>>      >>>>   >       it's not
>> >>>>>      >>>>   >       >> immediate.
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >
>> >>>>>      >>>>   >       Ok, will fix.
>> >>>>>      >>>>   >       This was added because in earlier design
>> >>>>>binding was
>> >>>>>      deferred
>> >>>>>      >>>>   until
>> >>>>>      >>>>   >       next execbuff.
>> >>>>>      >>>>   >       But now it is non-deferred (immediate in
>> >>>>>that sense).
>> >>>>>      >>>>But yah,
>> >>>>>      >>>>   this is
>> >>>>>      >>>>   >       confusing
>> >>>>>      >>>>   >       and will fix it.
>> >>>>>      >>>>   >
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >       >> I have a question on the behavior of the bind
>> >>>>>      >>>>operation when
>> >>>>>      >>>>   no
>> >>>>>      >>>>   >       input fence
>> >>>>>      >>>>   >       >> is provided. Let say I do :
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >       >> VM_BIND (out_fence=fence1)
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >       >> VM_BIND (out_fence=fence2)
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >       >> VM_BIND (out_fence=fence3)
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >       >> In what order are the fences going to
>> >>>>>be signaled?
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >       >> In the order of VM_BIND ioctls? Or out
>> >>>>>of order?
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >       >> Because you wrote "serialized I assume
>> >>>>>it's : in
>> >>>>>      order
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >
>> >>>>>      >>>>   >       Yes, in the order of VM_BIND/UNBIND
>> >>>>>ioctls. Note that
>> >>>>>      >>>>bind and
>> >>>>>      >>>>   unbind
>> >>>>>      >>>>   >       will use
>> >>>>>      >>>>   >       the same queue and hence are ordered.
>> >>>>>      >>>>   >
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >       >> One thing I didn't realize is that
>> >>>>>because we only
>> >>>>>      get one
>> >>>>>      >>>>   >       "VM_BIND" engine,
>> >>>>>      >>>>   >       >> there is a disconnect from the Vulkan
>> >>>>>specification.
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >       >> In Vulkan VM_BIND operations are
>> >>>>>serialized but
>> >>>>>      >>>>per engine.
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >       >> So you could have something like this :
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >       >> VM_BIND (engine=rcs0, in_fence=fence1,
>> >>>>>      out_fence=fence2)
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >       >> VM_BIND (engine=ccs0, in_fence=fence3,
>> >>>>>      out_fence=fence4)
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >       >> fence1 is not signaled
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >       >> fence3 is signaled
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >       >> So the second VM_BIND will proceed before the
>> >>>>>      >>>>first VM_BIND.
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >       >> I guess we can deal with that scenario in
>> >>>>>      >>>>userspace by doing
>> >>>>>      >>>>   the
>> >>>>>      >>>>   >       wait
>> >>>>>      >>>>   >       >> ourselves in one thread per engines.
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >       >> But then it makes the VM_BIND input
>> >>>>>fences useless.
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >       >> Daniel : what do you think? Should be
>> >>>>>rework this or
>> >>>>>      just
>> >>>>>      >>>>   deal with
>> >>>>>      >>>>   >       wait
>> >>>>>      >>>>   >       >> fences in userspace?
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >       >
>> >>>>>      >>>>   >       >My opinion is rework this but make the
>> >>>>>ordering via
>> >>>>>      >>>>an engine
>> >>>>>      >>>>   param
>> >>>>>      >>>>   >       optional.
>> >>>>>      >>>>   >       >
>> >>>>>      >>>>   >       >e.g. A VM can be configured so all binds
>> >>>>>are ordered
>> >>>>>      >>>>within the
>> >>>>>      >>>>   VM
>> >>>>>      >>>>   >       >
>> >>>>>      >>>>   >       >e.g. A VM can be configured so all binds
>> >>>>>accept an
>> >>>>>      engine
>> >>>>>      >>>>   argument
>> >>>>>      >>>>   >       (in
>> >>>>>      >>>>   >       >the case of the i915 likely this is a
>> >>>>>gem context
>> >>>>>      >>>>handle) and
>> >>>>>      >>>>   binds
>> >>>>>      >>>>   >       >ordered with respect to that engine.
>> >>>>>      >>>>   >       >
>> >>>>>      >>>>   >       >This gives UMDs options as the later
>> >>>>>likely consumes
>> >>>>>      >>>>more KMD
>> >>>>>      >>>>   >       resources
>> >>>>>      >>>>   >       >so if a different UMD can live with binds being
>> >>>>>      >>>>ordered within
>> >>>>>      >>>>   the VM
>> >>>>>      >>>>   >       >they can use a mode consuming less resources.
>> >>>>>      >>>>   >       >
>> >>>>>      >>>>   >
>> >>>>>      >>>>   >       I think we need to be careful here if we
>> >>>>>are looking
>> >>>>>      for some
>> >>>>>      >>>>   out of
>> >>>>>      >>>>   >       (submission) order completion of vm_bind/unbind.
>> >>>>>      >>>>   >       In-order completion means, in a batch of
>> >>>>>binds and
>> >>>>>      >>>>unbinds to be
>> >>>>>      >>>>   >       completed in-order, user only needs to specify
>> >>>>>      >>>>in-fence for the
>> >>>>>      >>>>   >       first bind/unbind call and the our-fence
>> >>>>>for the last
>> >>>>>      >>>>   bind/unbind
>> >>>>>      >>>>   >       call. Also, the VA released by an unbind
>> >>>>>call can be
>> >>>>>      >>>>re-used by
>> >>>>>      >>>>   >       any subsequent bind call in that in-order batch.
>> >>>>>      >>>>   >
>> >>>>>      >>>>   >       These things will break if
>> >>>>>binding/unbinding were to
>> >>>>>      >>>>be allowed
>> >>>>>      >>>>   to
>> >>>>>      >>>>   >       go out of order (of submission) and user
>> >>>>>need to be
>> >>>>>      extra
>> >>>>>      >>>>   careful
>> >>>>>      >>>>   >       not to run into pre-mature triggereing of
>> >>>>>out-fence and
>> >>>>>      bind
>> >>>>>      >>>>   failing
>> >>>>>      >>>>   >       as VA is still in use etc.
>> >>>>>      >>>>   >
>> >>>>>      >>>>   >       Also, VM_BIND binds the provided mapping on the
>> >>>>>      specified
>> >>>>>      >>>>   address
>> >>>>>      >>>>   >       space
>> >>>>>      >>>>   >       (VM). So, the uapi is not engine/context
>> >>>>>specific.
>> >>>>>      >>>>   >
>> >>>>>      >>>>   >       We can however add a 'queue' to the uapi
>> >>>>>which can be
>> >>>>>      >>>>one from
>> >>>>>      >>>>   the
>> >>>>>      >>>>   >       pre-defined queues,
>> >>>>>      >>>>   >       I915_VM_BIND_QUEUE_0
>> >>>>>      >>>>   >       I915_VM_BIND_QUEUE_1
>> >>>>>      >>>>   >       ...
>> >>>>>      >>>>   >       I915_VM_BIND_QUEUE_(N-1)
>> >>>>>      >>>>   >
>> >>>>>      >>>>   >       KMD will spawn an async work queue for
>> >>>>>each queue which
>> >>>>>      will
>> >>>>>      >>>>   only
>> >>>>>      >>>>   >       bind the mappings on that queue in the order of
>> >>>>>      submission.
>> >>>>>      >>>>   >       User can assign the queue to per engine
>> >>>>>or anything
>> >>>>>      >>>>like that.
>> >>>>>      >>>>   >
>> >>>>>      >>>>   >       But again here, user need to be careful and not
>> >>>>>      >>>>deadlock these
>> >>>>>      >>>>   >       queues with circular dependency of fences.
>> >>>>>      >>>>   >
>> >>>>>      >>>>   >       I prefer adding this later an as
>> >>>>>extension based on
>> >>>>>      >>>>whether it
>> >>>>>      >>>>   >       is really helping with the implementation.
>> >>>>>      >>>>   >
>> >>>>>      >>>>   >     I can tell you right now that having
>> >>>>>everything on a
>> >>>>>      single
>> >>>>>      >>>>   in-order
>> >>>>>      >>>>   >     queue will not get us the perf we want.
>> >>>>>What vulkan
>> >>>>>      >>>>really wants
>> >>>>>      >>>>   is one
>> >>>>>      >>>>   >     of two things:
>> >>>>>      >>>>   >      1. No implicit ordering of VM_BIND ops.  They just
>> >>>>>      happen in
>> >>>>>      >>>>   whatever
>> >>>>>      >>>>   >     their dependencies are resolved and we
>> >>>>>ensure ordering
>> >>>>>      >>>>ourselves
>> >>>>>      >>>>   by
>> >>>>>      >>>>   >     having a syncobj in the VkQueue.
>> >>>>>      >>>>   >      2. The ability to create multiple VM_BIND
>> >>>>>queues.  We
>> >>>>>      need at
>> >>>>>      >>>>   least 2
>> >>>>>      >>>>   >     but I don't see why there needs to be a
>> >>>>>limit besides
>> >>>>>      >>>>the limits
>> >>>>>      >>>>   the
>> >>>>>      >>>>   >     i915 API already has on the number of
>> >>>>>engines.  Vulkan
>> >>>>>      could
>> >>>>>      >>>>   expose
>> >>>>>      >>>>   >     multiple sparse binding queues to the
>> >>>>>client if it's not
>> >>>>>      >>>>   arbitrarily
>> >>>>>      >>>>   >     limited.
>> >>>>>      >>>>
>> >>>>>      >>>>   Thanks Jason, Lionel.
>> >>>>>      >>>>
>> >>>>>      >>>>   Jason, what are you referring to when you say
>> >>>>>"limits the i915
>> >>>>>      API
>> >>>>>      >>>>   already
>> >>>>>      >>>>   has on the number of engines"? I am not sure if
>> >>>>>there is such
>> >>>>>      an uapi
>> >>>>>      >>>>   today.
>> >>>>>      >>>>
>> >>>>>      >>>> There's a limit of something like 64 total engines
>> >>>>>today based on
>> >>>>>      the
>> >>>>>      >>>> number of bits we can cram into the exec flags in
>> >>>>>execbuffer2.  I
>> >>>>>      think
>> >>>>>      >>>> someone had an extended version that allowed more
>> >>>>>but I ripped it
>> >>>>>      out
>> >>>>>      >>>> because no one was using it.  Of course,
>> >>>>>execbuffer3 might not
>> >>>>>      >>>>have that
>> >>>>>      >>>> problem at all.
>> >>>>>      >>>>
>> >>>>>      >>>
>> >>>>>      >>>Thanks Jason.
>> >>>>>      >>>Ok, I am not sure which exec flag is that, but yah,
>> >>>>>execbuffer3
>> >>>>>      probably
>> >>>>>      >>>will not have this limiation. So, we need to define a
>> >>>>>      VM_BIND_MAX_QUEUE
>> >>>>>      >>>and somehow export it to user (I am thinking of
>> >>>>>embedding it in
>> >>>>>      >>>I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND, bits[1-3]->'n'
>> >>>>>      meaning 2^n
>> >>>>>      >>>queues.
>> >>>>>      >>
>> >>>>>      >>Ah, I think you are waking about I915_EXEC_RING_MASK
>> >>>>>(0x3f) which
>> >>>>>      execbuf3
>> >>>>>
>> >>>>>    Yup!  That's exactly the limit I was talking about.
>> >>>>>
>> >>>>>      >>will also have. So, we can simply define in vm_bind/unbind
>> >>>>>      structures,
>> >>>>>      >>
>> >>>>>      >>#define I915_VM_BIND_MAX_QUEUE   64
>> >>>>>      >>        __u32 queue;
>> >>>>>      >>
>> >>>>>      >>I think that will keep things simple.
>> >>>>>      >
>> >>>>>      >Hmmm? What does execbuf2 limit has to do with how many engines
>> >>>>>      >hardware can have? I suggest not to do that.
>> >>>>>      >
>> >>>>>      >Change with added this:
>> >>>>>      >
>> >>>>>      >       if (set.num_engines > I915_EXEC_RING_MASK + 1)
>> >>>>>      >               return -EINVAL;
>> >>>>>      >
>> >>>>>      >To context creation needs to be undone and so let users
>> >>>>>create engine
>> >>>>>      >maps with all hardware engines, and let execbuf3 access
>> >>>>>them all.
>> >>>>>      >
>> >>>>>
>> >>>>>      Earlier plan was to carry I915_EXEC_RING_MAP (0x3f) to
>> >>>>>execbuff3 also.
>> >>>>>      Hence, I was using the same limit for VM_BIND queues
>> >>>>>(64, or 65 if we
>> >>>>>      make it N+1).
>> >>>>>      But, as discussed in other thread of this RFC series, we
>> >>>>>are planning
>> >>>>>      to drop this I915_EXEC_RING_MAP in execbuff3. So, there won't be
>> >>>>>      any uapi that limits the number of engines (and hence
>> >>>>>the vm_bind
>> >>>>>      queues
>> >>>>>      need to be supported).
>> >>>>>
>> >>>>>      If we leave the number of vm_bind queues to be arbitrarily large
>> >>>>>      (__u32 queue_idx) then, we need to have a hashmap for
>> >>>>>queue (a wq,
>> >>>>>      work_item and a linked list) lookup from the user
>> >>>>>specified queue
>> >>>>>      index.
>> >>>>>      Other option is to just put some hard limit (say 64 or
>> >>>>>65) and use
>> >>>>>      an array of queues in VM (each created upon first use).
>> >>>>>I prefer this.
>> >>>>>
>> >>>>>    I don't get why a VM_BIND queue is any different from any
>> >>>>>other queue or
>> >>>>>    userspace-visible kernel object.  But I'll leave those
>> >>>>>details up to
>> >>>>>    danvet or whoever else might be reviewing the implementation.
>> >>>>>    --Jason
>> >>>>>
>> >>>>>  I kind of agree here. Wouldn't be simpler to have the bind
>> >>>>>queue created
>> >>>>>  like the others when we build the engine map?
>> >>>>>
>> >>>>>  For userspace it's then just matter of selecting the right
>> >>>>>queue ID when
>> >>>>>  submitting.
>> >>>>>
>> >>>>>  If there is ever a possibility to have this work on the GPU,
>> >>>>>it would be
>> >>>>>  all ready.
>> >>>>>
>> >>>>
>> >>>>I did sync offline with Matt Brost on this.
>> >>>>We can add a VM_BIND engine class and let user create VM_BIND
>> >>>>engines (queues).
>> >>>>The problem is, in i915 engine creating interface is bound to
>> >>>>gem_context.
>> >>>>So, in vm_bind ioctl, we would need both context_id and
>> >>>>queue_idx for proper
>> >>>>lookup of the user created engine. This is bit ackward as vm_bind is an
>> >>>>interface to VM (address space) and has nothing to do with gem_context.
>> >>>
>> >>>
>> >>>A gem_context has a single vm object right?
>> >>>
>> >>>Set through I915_CONTEXT_PARAM_VM at creation or given a default
>> >>>one if not.
>> >>>
>> >>>So it's just like picking up the vm like it's done at execbuffer
>> >>>time right now : eb->context->vm
>> >>>
>> >>
>> >>Are you suggesting replacing 'vm_id' with 'context_id' in the
>> >>VM_BIND/UNBIND
>> >>ioctl and probably call it CONTEXT_BIND/UNBIND, because VM can be
>> >>obtained
>> >>from the context?
>> >
>> >
>> >Yes, because if we go for engines, they're associated with a context
>> >and so also associated with the VM bound to the context.
>> >
>>
>> Hmm...context doesn't sould like the right interface. It should be
>> VM and engine (independent of context). Engine can be virtual or soft
>> engine (kernel thread), each with its own queue. We can add an interface
>> to create such engines (independent of context). But we are anway
>> implicitly creating it when user uses a new queue_idx. If in future
>> we have hardware engines for VM_BIND operation, we can have that
>> explicit inteface to create engine instances and the queue_index
>> in vm_bind/unbind will point to those engines.
>> Anyone has any thoughts? Daniel?
>
>Exposing gem_context or intel_context to user space is a strange concept to me. A context represent some hw resources that is used to complete certain task. User space should care allocate some resources (memory, queues) and submit tasks to queues. But user space doesn't care how certain task is mapped to a HW context - driver/guc should take care of this.
>
>So a cleaner interface to me is: user space create a vm,  create gem object, vm_bind it to a vm; allocate queues (internally represent compute or blitter HW. Queue can be virtual to user) for this vm; submit tasks to queues. User can create multiple queues under one vm. One queue is only for one vm.
>
>I915 driver/guc manage the hw compute or blitter resources which is transparent to user space. When i915 or guc decide to schedule a queue (run tasks on that queue), a HW engine will be pick up and set up properly for the vm of that queue (ie., switch to page tables of that vm) - this is a context switch.
>
>From vm_bind perspective, it simply bind a gem_object to a vm. Engine/queue is not a parameter to vm_bind, as any engine can be pick up by i915/guc to execute a task using the vm bound va.
>
>I didn't completely follow the discussion here. Just share some thoughts.
>

Yah, I agree.

Lionel,
How about we define the queue as
union {
        __u32 queue_idx;
        __u64 rsvd;
}

If required, we can extend by expanding the 'rsvd' field to <ctx_id, queue_idx> later
with a flag.

Niranjana

>Regards,
>Oak
>
>>
>> Niranjana
>>
>> >
>> >>I think the interface is clean as a interface to VM. It is only that we
>> >>don't have a clean way to create a raw VM_BIND engine (not
>> >>associated with
>> >>any context) with i915 uapi.
>> >>May be we can add such an interface, but I don't think that is worth it
>> >>(we might as well just use a queue_idx in VM_BIND/UNBIND ioctl as I
>> >>mentioned
>> >>above).
>> >>Anyone has any thoughts?
>> >>
>> >>>
>> >>>>Another problem is, if two VMs are binding with the same defined
>> >>>>engine,
>> >>>>binding on VM1 can get unnecessary blocked by binding on VM2
>> >>>>(which may be
>> >>>>waiting on its in_fence).
>> >>>
>> >>>
>> >>>Maybe I'm missing something, but how can you have 2 vm objects
>> >>>with a single gem_context right now?
>> >>>
>> >>
>> >>No, we don't have 2 VMs for a gem_context.
>> >>Say if ctx1 with vm1 and ctx2 with vm2.
>> >>First vm_bind call was for vm1 with q_idx 1 in ctx1 engine map.
>> >>Second vm_bind call was for vm2 with q_idx 2 in ctx2 engine map. If
>> >>those two queue indicies points to same underlying vm_bind engine,
>> >>then the second vm_bind call gets blocked until the first vm_bind call's
>> >>'in' fence is triggered and bind completes.
>> >>
>> >>With per VM queues, this is not a problem as two VMs will not endup
>> >>sharing same queue.
>> >>
>> >>BTW, I just posted a updated PATCH series.
>> >>https://www.spinics.net/lists/dri-devel/msg350483.html
>> >>
>> >>Niranjana
>> >>
>> >>>
>> >>>>
>> >>>>So, my preference here is to just add a 'u32 queue' index in
>> >>>>vm_bind/unbind
>> >>>>ioctl, and the queues are per VM.
>> >>>>
>> >>>>Niranjana
>> >>>>
>> >>>>>  Thanks,
>> >>>>>
>> >>>>>  -Lionel
>> >>>>>
>> >>>>>
>> >>>>>      Niranjana
>> >>>>>
>> >>>>>      >Regards,
>> >>>>>      >
>> >>>>>      >Tvrtko
>> >>>>>      >
>> >>>>>      >>
>> >>>>>      >>Niranjana
>> >>>>>      >>
>> >>>>>      >>>
>> >>>>>      >>>>   I am trying to see how many queues we need and
>> >>>>>don't want it to
>> >>>>>      be
>> >>>>>      >>>>   arbitrarily
>> >>>>>      >>>>   large and unduely blow up memory usage and
>> >>>>>complexity in i915
>> >>>>>      driver.
>> >>>>>      >>>>
>> >>>>>      >>>> I expect a Vulkan driver to use at most 2 in the
>> >>>>>vast majority
>> >>>>>      >>>>of cases. I
>> >>>>>      >>>> could imagine a client wanting to create more than 1 sparse
>> >>>>>      >>>>queue in which
>> >>>>>      >>>> case, it'll be N+1 but that's unlikely. As far as
>> >>>>>complexity
>> >>>>>      >>>>goes, once
>> >>>>>      >>>> you allow two, I don't think the complexity is going up by
>> >>>>>      >>>>allowing N.  As
>> >>>>>      >>>> for memory usage, creating more queues means more
>> >>>>>memory.  That's
>> >>>>>      a
>> >>>>>      >>>> trade-off that userspace can make. Again, the
>> >>>>>expected number
>> >>>>>      >>>>here is 1
>> >>>>>      >>>> or 2 in the vast majority of cases so I don't think
>> >>>>>you need to
>> >>>>>      worry.
>> >>>>>      >>>
>> >>>>>      >>>Ok, will start with n=3 meaning 8 queues.
>> >>>>>      >>>That would require us create 8 workqueues.
>> >>>>>      >>>We can change 'n' later if required.
>> >>>>>      >>>
>> >>>>>      >>>Niranjana
>> >>>>>      >>>
>> >>>>>      >>>>
>> >>>>>      >>>>   >     Why?  Because Vulkan has two basic kind of bind
>> >>>>>      >>>>operations and we
>> >>>>>      >>>>   don't
>> >>>>>      >>>>   >     want any dependencies between them:
>> >>>>>      >>>>   >      1. Immediate.  These happen right after BO
>> >>>>>creation or
>> >>>>>      >>>>maybe as
>> >>>>>      >>>>   part of
>> >>>>>      >>>>   >     vkBindImageMemory() or VkBindBufferMemory().  These
>> >>>>>      >>>>don't happen
>> >>>>>      >>>>   on a
>> >>>>>      >>>>   >     queue and we don't want them serialized
>> >>>>>with anything.       To
>> >>>>>      >>>>   synchronize
>> >>>>>      >>>>   >     with submit, we'll have a syncobj in the
>> >>>>>VkDevice which
>> >>>>>      is
>> >>>>>      >>>>   signaled by
>> >>>>>      >>>>   >     all immediate bind operations and make
>> >>>>>submits wait on
>> >>>>>      it.
>> >>>>>      >>>>   >      2. Queued (sparse): These happen on a
>> >>>>>VkQueue which may
>> >>>>>      be the
>> >>>>>      >>>>   same as
>> >>>>>      >>>>   >     a render/compute queue or may be its own
>> >>>>>queue.  It's up
>> >>>>>      to us
>> >>>>>      >>>>   what we
>> >>>>>      >>>>   >     want to advertise.  From the Vulkan API
>> >>>>>PoV, this is like
>> >>>>>      any
>> >>>>>      >>>>   other
>> >>>>>      >>>>   >     queue.  Operations on it wait on and signal
>> >>>>>semaphores.       If we
>> >>>>>      >>>>   have a
>> >>>>>      >>>>   >     VM_BIND engine, we'd provide syncobjs to wait and
>> >>>>>      >>>>signal just like
>> >>>>>      >>>>   we do
>> >>>>>      >>>>   >     in execbuf().
>> >>>>>      >>>>   >     The important thing is that we don't want
>> >>>>>one type of
>> >>>>>      >>>>operation to
>> >>>>>      >>>>   block
>> >>>>>      >>>>   >     on the other.  If immediate binds are
>> >>>>>blocking on sparse
>> >>>>>      binds,
>> >>>>>      >>>>   it's
>> >>>>>      >>>>   >     going to cause over-synchronization issues.
>> >>>>>      >>>>   >     In terms of the internal implementation, I
>> >>>>>know that
>> >>>>>      >>>>there's going
>> >>>>>      >>>>   to be
>> >>>>>      >>>>   >     a lock on the VM and that we can't actually
>> >>>>>do these
>> >>>>>      things in
>> >>>>>      >>>>   >     parallel.  That's fine. Once the dma_fences have
>> >>>>>      signaled and
>> >>>>>      >>>>   we're
>> >>>>>      >>>>
>> >>>>>      >>>>   Thats correct. It is like a single VM_BIND engine with
>> >>>>>      >>>>multiple queues
>> >>>>>      >>>>   feeding to it.
>> >>>>>      >>>>
>> >>>>>      >>>> Right.  As long as the queues themselves are
>> >>>>>independent and
>> >>>>>      >>>>can block on
>> >>>>>      >>>> dma_fences without holding up other queues, I think
>> >>>>>we're fine.
>> >>>>>      >>>>
>> >>>>>      >>>>   >     unblocked to do the bind operation, I don't care if
>> >>>>>      >>>>there's a bit
>> >>>>>      >>>>   of
>> >>>>>      >>>>   >     synchronization due to locking.  That's
>> >>>>>expected.  What
>> >>>>>      >>>>we can't
>> >>>>>      >>>>   afford
>> >>>>>      >>>>   >     to have is an immediate bind operation
>> >>>>>suddenly blocking
>> >>>>>      on a
>> >>>>>      >>>>   sparse
>> >>>>>      >>>>   >     operation which is blocked on a compute job
>> >>>>>that's going
>> >>>>>      to run
>> >>>>>      >>>>   for
>> >>>>>      >>>>   >     another 5ms.
>> >>>>>      >>>>
>> >>>>>      >>>>   As the VM_BIND queue is per VM, VM_BIND on one VM
>> >>>>>doesn't block
>> >>>>>      the
>> >>>>>      >>>>   VM_BIND
>> >>>>>      >>>>   on other VMs. I am not sure about usecases here, but just
>> >>>>>      wanted to
>> >>>>>      >>>>   clarify.
>> >>>>>      >>>>
>> >>>>>      >>>> Yes, that's what I would expect.
>> >>>>>      >>>> --Jason
>> >>>>>      >>>>
>> >>>>>      >>>>   Niranjana
>> >>>>>      >>>>
>> >>>>>      >>>>   >     For reference, Windows solves this by allowing
>> >>>>>      arbitrarily many
>> >>>>>      >>>>   paging
>> >>>>>      >>>>   >     queues (what they call a VM_BIND
>> >>>>>engine/queue).  That
>> >>>>>      >>>>design works
>> >>>>>      >>>>   >     pretty well and solves the problems in
>> >>>>>question.       >>>>Again, we could
>> >>>>>      >>>>   just
>> >>>>>      >>>>   >     make everything out-of-order and require
>> >>>>>using syncobjs
>> >>>>>      >>>>to order
>> >>>>>      >>>>   things
>> >>>>>      >>>>   >     as userspace wants. That'd be fine too.
>> >>>>>      >>>>   >     One more note while I'm here: danvet said
>> >>>>>something on
>> >>>>>      >>>>IRC about
>> >>>>>      >>>>   VM_BIND
>> >>>>>      >>>>   >     queues waiting for syncobjs to
>> >>>>>materialize.  We don't
>> >>>>>      really
>> >>>>>      >>>>   want/need
>> >>>>>      >>>>   >     this.  We already have all the machinery in
>> >>>>>userspace to
>> >>>>>      handle
>> >>>>>      >>>>   >     wait-before-signal and waiting for syncobj
>> >>>>>fences to
>> >>>>>      >>>>materialize
>> >>>>>      >>>>   and
>> >>>>>      >>>>   >     that machinery is on by default.  It would actually
>> >>>>>      >>>>take MORE work
>> >>>>>      >>>>   in
>> >>>>>      >>>>   >     Mesa to turn it off and take advantage of
>> >>>>>the kernel
>> >>>>>      >>>>being able to
>> >>>>>      >>>>   wait
>> >>>>>      >>>>   >     for syncobjs to materialize. Also, getting
>> >>>>>that right is
>> >>>>>      >>>>   ridiculously
>> >>>>>      >>>>   >     hard and I really don't want to get it
>> >>>>>wrong in kernel
>> >>>>>      >>>>space.   �� When we
>> >>>>>      >>>>   >     do memory fences, wait-before-signal will
>> >>>>>be a thing.  We
>> >>>>>      don't
>> >>>>>      >>>>   need to
>> >>>>>      >>>>   >     try and make it a thing for syncobj.
>> >>>>>      >>>>   >     --Jason
>> >>>>>      >>>>   >
>> >>>>>      >>>>   >   Thanks Jason,
>> >>>>>      >>>>   >
>> >>>>>      >>>>   >   I missed the bit in the Vulkan spec that
>> >>>>>we're allowed to
>> >>>>>      have a
>> >>>>>      >>>>   sparse
>> >>>>>      >>>>   >   queue that does not implement either graphics
>> >>>>>or compute
>> >>>>>      >>>>operations
>> >>>>>      >>>>   :
>> >>>>>      >>>>   >
>> >>>>>      >>>>   >     "While some implementations may include
>> >>>>>      >>>>   VK_QUEUE_SPARSE_BINDING_BIT
>> >>>>>      >>>>   >     support in queue families that also include
>> >>>>>      >>>>   >
>> >>>>>      >>>>   >      graphics and compute support, other
>> >>>>>implementations may
>> >>>>>      only
>> >>>>>      >>>>   expose a
>> >>>>>      >>>>   > VK_QUEUE_SPARSE_BINDING_BIT-only queue
>> >>>>>      >>>>   >
>> >>>>>      >>>>   >      family."
>> >>>>>      >>>>   >
>> >>>>>      >>>>   >   So it can all be all a vm_bind engine that just does
>> >>>>>      bind/unbind
>> >>>>>      >>>>   >   operations.
>> >>>>>      >>>>   >
>> >>>>>      >>>>   >   But yes we need another engine for the
>> >>>>>immediate/non-sparse
>> >>>>>      >>>>   operations.
>> >>>>>      >>>>   >
>> >>>>>      >>>>   >   -Lionel
>> >>>>>      >>>>   >
>> >>>>>      >>>>   >         >
>> >>>>>      >>>>   >       Daniel, any thoughts?
>> >>>>>      >>>>   >
>> >>>>>      >>>>   >       Niranjana
>> >>>>>      >>>>   >
>> >>>>>      >>>>   >       >Matt
>> >>>>>      >>>>   >       >
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >       >> Sorry I noticed this late.
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >       >> -Lionel
>> >>>>>      >>>>   >       >>
>> >>>>>      >>>>   >       >>
>> >>>
>> >>>
>> >
Lionel Landwerlin June 14, 2022, 7:04 a.m. UTC | #34
On 13/06/2022 21:02, Niranjana Vishwanathapura wrote:
> On Mon, Jun 13, 2022 at 06:33:07AM -0700, Zeng, Oak wrote:
>>
>>
>> Regards,
>> Oak
>>
>>> -----Original Message-----
>>> From: Intel-gfx <intel-gfx-bounces@lists.freedesktop.org> On Behalf 
>>> Of Niranjana
>>> Vishwanathapura
>>> Sent: June 10, 2022 1:43 PM
>>> To: Landwerlin, Lionel G <lionel.g.landwerlin@intel.com>
>>> Cc: Intel GFX <intel-gfx@lists.freedesktop.org>; Maling list - DRI 
>>> developers <dri-
>>> devel@lists.freedesktop.org>; Hellstrom, Thomas 
>>> <thomas.hellstrom@intel.com>;
>>> Wilson, Chris P <chris.p.wilson@intel.com>; Vetter, Daniel
>>> <daniel.vetter@intel.com>; Christian König <christian.koenig@amd.com>
>>> Subject: Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature 
>>> design
>>> document
>>>
>>> On Fri, Jun 10, 2022 at 11:18:14AM +0300, Lionel Landwerlin wrote:
>>> >On 10/06/2022 10:54, Niranjana Vishwanathapura wrote:
>>> >>On Fri, Jun 10, 2022 at 09:53:24AM +0300, Lionel Landwerlin wrote:
>>> >>>On 09/06/2022 22:31, Niranjana Vishwanathapura wrote:
>>> >>>>On Thu, Jun 09, 2022 at 05:49:09PM +0300, Lionel Landwerlin wrote:
>>> >>>>>  On 09/06/2022 00:55, Jason Ekstrand wrote:
>>> >>>>>
>>> >>>>>    On Wed, Jun 8, 2022 at 4:44 PM Niranjana Vishwanathapura
>>> >>>>> <niranjana.vishwanathapura@intel.com> wrote:
>>> >>>>>
>>> >>>>>      On Wed, Jun 08, 2022 at 08:33:25AM +0100, Tvrtko Ursulin 
>>> wrote:
>>> >>>>>      >
>>> >>>>>      >
>>> >>>>>      >On 07/06/2022 22:32, Niranjana Vishwanathapura wrote:
>>> >>>>>      >>On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana
>>> >>>>>Vishwanathapura
>>> >>>>>      wrote:
>>> >>>>>      >>>On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason
>>> >>>>>Ekstrand wrote:
>>> >>>>>      >>>> On Fri, Jun 3, 2022 at 6:52 PM Niranjana 
>>> Vishwanathapura
>>> >>>>>      >>>> <niranjana.vishwanathapura@intel.com> wrote:
>>> >>>>>      >>>>
>>> >>>>>      >>>>   On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel
>>> >>>>>Landwerlin
>>> >>>>>      wrote:
>>> >>>>>      >>>>   >   On 02/06/2022 23:35, Jason Ekstrand wrote:
>>> >>>>>      >>>>   >
>>> >>>>>      >>>>   >     On Thu, Jun 2, 2022 at 3:11 PM Niranjana
>>> >>>>>Vishwanathapura
>>> >>>>>      >>>>   > <niranjana.vishwanathapura@intel.com> wrote:
>>> >>>>>      >>>>   >
>>> >>>>>      >>>>   >       On Wed, Jun 01, 2022 at 01:28:36PM -0700, 
>>> Matthew
>>> >>>>>      >>>>Brost wrote:
>>> >>>>>      >>>>   >       >On Wed, Jun 01, 2022 at 05:25:49PM +0300, 
>>> Lionel
>>> >>>>>      Landwerlin
>>> >>>>>      >>>>   wrote:
>>> >>>>>      >>>>   > >> On 17/05/2022 21:32, Niranjana Vishwanathapura
>>> >>>>>      wrote:
>>> >>>>>      >>>>   > >> > +VM_BIND/UNBIND ioctl will immediately start
>>> >>>>>      >>>>   binding/unbinding
>>> >>>>>      >>>>   >       the mapping in an
>>> >>>>>      >>>>   > >> > +async worker. The binding and
>>> >>>>>unbinding will
>>> >>>>>      >>>>work like a
>>> >>>>>      >>>>   special
>>> >>>>>      >>>>   >       GPU engine.
>>> >>>>>      >>>>   > >> > +The binding and unbinding operations are
>>> >>>>>      serialized and
>>> >>>>>      >>>>   will
>>> >>>>>      >>>>   >       wait on specified
>>> >>>>>      >>>>   > >> > +input fences before the operation
>>> >>>>>and will signal
>>> >>>>>      the
>>> >>>>>      >>>>   output
>>> >>>>>      >>>>   >       fences upon the
>>> >>>>>      >>>>   > >> > +completion of the operation. Due to
>>> >>>>>      serialization,
>>> >>>>>      >>>>   completion of
>>> >>>>>      >>>>   >       an operation
>>> >>>>>      >>>>   > >> > +will also indicate that all
>>> >>>>>previous operations
>>> >>>>>      >>>>are also
>>> >>>>>      >>>>   > complete.
>>> >>>>>      >>>>   > >>
>>> >>>>>      >>>>   > >> I guess we should avoid saying "will
>>> >>>>>immediately
>>> >>>>>      start
>>> >>>>>      >>>>   > binding/unbinding" if
>>> >>>>>      >>>>   > >> there are fences involved.
>>> >>>>>      >>>>   > >>
>>> >>>>>      >>>>   > >> And the fact that it's happening in an async
>>> >>>>>      >>>>worker seem to
>>> >>>>>      >>>>   imply
>>> >>>>>      >>>>   >       it's not
>>> >>>>>      >>>>   > >> immediate.
>>> >>>>>      >>>>   > >>
>>> >>>>>      >>>>   >
>>> >>>>>      >>>>   >       Ok, will fix.
>>> >>>>>      >>>>   >       This was added because in earlier design
>>> >>>>>binding was
>>> >>>>>      deferred
>>> >>>>>      >>>>   until
>>> >>>>>      >>>>   >       next execbuff.
>>> >>>>>      >>>>   >       But now it is non-deferred (immediate in
>>> >>>>>that sense).
>>> >>>>>      >>>>But yah,
>>> >>>>>      >>>>   this is
>>> >>>>>      >>>>   > confusing
>>> >>>>>      >>>>   >       and will fix it.
>>> >>>>>      >>>>   >
>>> >>>>>      >>>>   > >>
>>> >>>>>      >>>>   > >> I have a question on the behavior of the bind
>>> >>>>>      >>>>operation when
>>> >>>>>      >>>>   no
>>> >>>>>      >>>>   >       input fence
>>> >>>>>      >>>>   > >> is provided. Let say I do :
>>> >>>>>      >>>>   > >>
>>> >>>>>      >>>>   > >> VM_BIND (out_fence=fence1)
>>> >>>>>      >>>>   > >>
>>> >>>>>      >>>>   > >> VM_BIND (out_fence=fence2)
>>> >>>>>      >>>>   > >>
>>> >>>>>      >>>>   > >> VM_BIND (out_fence=fence3)
>>> >>>>>      >>>>   > >>
>>> >>>>>      >>>>   > >>
>>> >>>>>      >>>>   > >> In what order are the fences going to
>>> >>>>>be signaled?
>>> >>>>>      >>>>   > >>
>>> >>>>>      >>>>   > >> In the order of VM_BIND ioctls? Or out
>>> >>>>>of order?
>>> >>>>>      >>>>   > >>
>>> >>>>>      >>>>   > >> Because you wrote "serialized I assume
>>> >>>>>it's : in
>>> >>>>>      order
>>> >>>>>      >>>>   > >>
>>> >>>>>      >>>>   >
>>> >>>>>      >>>>   >       Yes, in the order of VM_BIND/UNBIND
>>> >>>>>ioctls. Note that
>>> >>>>>      >>>>bind and
>>> >>>>>      >>>>   unbind
>>> >>>>>      >>>>   >       will use
>>> >>>>>      >>>>   >       the same queue and hence are ordered.
>>> >>>>>      >>>>   >
>>> >>>>>      >>>>   > >>
>>> >>>>>      >>>>   > >> One thing I didn't realize is that
>>> >>>>>because we only
>>> >>>>>      get one
>>> >>>>>      >>>>   > "VM_BIND" engine,
>>> >>>>>      >>>>   > >> there is a disconnect from the Vulkan
>>> >>>>>specification.
>>> >>>>>      >>>>   > >>
>>> >>>>>      >>>>   > >> In Vulkan VM_BIND operations are
>>> >>>>>serialized but
>>> >>>>>      >>>>per engine.
>>> >>>>>      >>>>   > >>
>>> >>>>>      >>>>   > >> So you could have something like this :
>>> >>>>>      >>>>   > >>
>>> >>>>>      >>>>   > >> VM_BIND (engine=rcs0, in_fence=fence1,
>>> >>>>>      out_fence=fence2)
>>> >>>>>      >>>>   > >>
>>> >>>>>      >>>>   > >> VM_BIND (engine=ccs0, in_fence=fence3,
>>> >>>>>      out_fence=fence4)
>>> >>>>>      >>>>   > >>
>>> >>>>>      >>>>   > >>
>>> >>>>>      >>>>   > >> fence1 is not signaled
>>> >>>>>      >>>>   > >>
>>> >>>>>      >>>>   > >> fence3 is signaled
>>> >>>>>      >>>>   > >>
>>> >>>>>      >>>>   > >> So the second VM_BIND will proceed before the
>>> >>>>>      >>>>first VM_BIND.
>>> >>>>>      >>>>   > >>
>>> >>>>>      >>>>   > >>
>>> >>>>>      >>>>   > >> I guess we can deal with that scenario in
>>> >>>>>      >>>>userspace by doing
>>> >>>>>      >>>>   the
>>> >>>>>      >>>>   >       wait
>>> >>>>>      >>>>   > >> ourselves in one thread per engines.
>>> >>>>>      >>>>   > >>
>>> >>>>>      >>>>   > >> But then it makes the VM_BIND input
>>> >>>>>fences useless.
>>> >>>>>      >>>>   > >>
>>> >>>>>      >>>>   > >>
>>> >>>>>      >>>>   > >> Daniel : what do you think? Should be
>>> >>>>>rework this or
>>> >>>>>      just
>>> >>>>>      >>>>   deal with
>>> >>>>>      >>>>   >       wait
>>> >>>>>      >>>>   > >> fences in userspace?
>>> >>>>>      >>>>   > >>
>>> >>>>>      >>>>   >       >
>>> >>>>>      >>>>   >       >My opinion is rework this but make the
>>> >>>>>ordering via
>>> >>>>>      >>>>an engine
>>> >>>>>      >>>>   param
>>> >>>>>      >>>>   > optional.
>>> >>>>>      >>>>   >       >
>>> >>>>>      >>>>   > >e.g. A VM can be configured so all binds
>>> >>>>>are ordered
>>> >>>>>      >>>>within the
>>> >>>>>      >>>>   VM
>>> >>>>>      >>>>   >       >
>>> >>>>>      >>>>   > >e.g. A VM can be configured so all binds
>>> >>>>>accept an
>>> >>>>>      engine
>>> >>>>>      >>>>   argument
>>> >>>>>      >>>>   >       (in
>>> >>>>>      >>>>   > >the case of the i915 likely this is a
>>> >>>>>gem context
>>> >>>>>      >>>>handle) and
>>> >>>>>      >>>>   binds
>>> >>>>>      >>>>   > >ordered with respect to that engine.
>>> >>>>>      >>>>   >       >
>>> >>>>>      >>>>   > >This gives UMDs options as the later
>>> >>>>>likely consumes
>>> >>>>>      >>>>more KMD
>>> >>>>>      >>>>   > resources
>>> >>>>>      >>>>   >       >so if a different UMD can live with binds 
>>> being
>>> >>>>>      >>>>ordered within
>>> >>>>>      >>>>   the VM
>>> >>>>>      >>>>   > >they can use a mode consuming less resources.
>>> >>>>>      >>>>   >       >
>>> >>>>>      >>>>   >
>>> >>>>>      >>>>   >       I think we need to be careful here if we
>>> >>>>>are looking
>>> >>>>>      for some
>>> >>>>>      >>>>   out of
>>> >>>>>      >>>>   > (submission) order completion of vm_bind/unbind.
>>> >>>>>      >>>>   > In-order completion means, in a batch of
>>> >>>>>binds and
>>> >>>>>      >>>>unbinds to be
>>> >>>>>      >>>>   > completed in-order, user only needs to specify
>>> >>>>>      >>>>in-fence for the
>>> >>>>>      >>>>   >       first bind/unbind call and the our-fence
>>> >>>>>for the last
>>> >>>>>      >>>>   bind/unbind
>>> >>>>>      >>>>   >       call. Also, the VA released by an unbind
>>> >>>>>call can be
>>> >>>>>      >>>>re-used by
>>> >>>>>      >>>>   >       any subsequent bind call in that in-order 
>>> batch.
>>> >>>>>      >>>>   >
>>> >>>>>      >>>>   >       These things will break if
>>> >>>>>binding/unbinding were to
>>> >>>>>      >>>>be allowed
>>> >>>>>      >>>>   to
>>> >>>>>      >>>>   >       go out of order (of submission) and user
>>> >>>>>need to be
>>> >>>>>      extra
>>> >>>>>      >>>>   careful
>>> >>>>>      >>>>   >       not to run into pre-mature triggereing of
>>> >>>>>out-fence and
>>> >>>>>      bind
>>> >>>>>      >>>>   failing
>>> >>>>>      >>>>   >       as VA is still in use etc.
>>> >>>>>      >>>>   >
>>> >>>>>      >>>>   >       Also, VM_BIND binds the provided mapping 
>>> on the
>>> >>>>>      specified
>>> >>>>>      >>>>   address
>>> >>>>>      >>>>   >       space
>>> >>>>>      >>>>   >       (VM). So, the uapi is not engine/context
>>> >>>>>specific.
>>> >>>>>      >>>>   >
>>> >>>>>      >>>>   >       We can however add a 'queue' to the uapi
>>> >>>>>which can be
>>> >>>>>      >>>>one from
>>> >>>>>      >>>>   the
>>> >>>>>      >>>>   > pre-defined queues,
>>> >>>>>      >>>>   > I915_VM_BIND_QUEUE_0
>>> >>>>>      >>>>   > I915_VM_BIND_QUEUE_1
>>> >>>>>      >>>>   >       ...
>>> >>>>>      >>>>   > I915_VM_BIND_QUEUE_(N-1)
>>> >>>>>      >>>>   >
>>> >>>>>      >>>>   >       KMD will spawn an async work queue for
>>> >>>>>each queue which
>>> >>>>>      will
>>> >>>>>      >>>>   only
>>> >>>>>      >>>>   >       bind the mappings on that queue in the 
>>> order of
>>> >>>>>      submission.
>>> >>>>>      >>>>   >       User can assign the queue to per engine
>>> >>>>>or anything
>>> >>>>>      >>>>like that.
>>> >>>>>      >>>>   >
>>> >>>>>      >>>>   >       But again here, user need to be careful 
>>> and not
>>> >>>>>      >>>>deadlock these
>>> >>>>>      >>>>   >       queues with circular dependency of fences.
>>> >>>>>      >>>>   >
>>> >>>>>      >>>>   >       I prefer adding this later an as
>>> >>>>>extension based on
>>> >>>>>      >>>>whether it
>>> >>>>>      >>>>   >       is really helping with the implementation.
>>> >>>>>      >>>>   >
>>> >>>>>      >>>>   >     I can tell you right now that having
>>> >>>>>everything on a
>>> >>>>>      single
>>> >>>>>      >>>>   in-order
>>> >>>>>      >>>>   >     queue will not get us the perf we want.
>>> >>>>>What vulkan
>>> >>>>>      >>>>really wants
>>> >>>>>      >>>>   is one
>>> >>>>>      >>>>   >     of two things:
>>> >>>>>      >>>>   >      1. No implicit ordering of VM_BIND ops.  
>>> They just
>>> >>>>>      happen in
>>> >>>>>      >>>>   whatever
>>> >>>>>      >>>>   >     their dependencies are resolved and we
>>> >>>>>ensure ordering
>>> >>>>>      >>>>ourselves
>>> >>>>>      >>>>   by
>>> >>>>>      >>>>   >     having a syncobj in the VkQueue.
>>> >>>>>      >>>>   >      2. The ability to create multiple VM_BIND
>>> >>>>>queues.  We
>>> >>>>>      need at
>>> >>>>>      >>>>   least 2
>>> >>>>>      >>>>   >     but I don't see why there needs to be a
>>> >>>>>limit besides
>>> >>>>>      >>>>the limits
>>> >>>>>      >>>>   the
>>> >>>>>      >>>>   >     i915 API already has on the number of
>>> >>>>>engines.  Vulkan
>>> >>>>>      could
>>> >>>>>      >>>>   expose
>>> >>>>>      >>>>   >     multiple sparse binding queues to the
>>> >>>>>client if it's not
>>> >>>>>      >>>>   arbitrarily
>>> >>>>>      >>>>   >     limited.
>>> >>>>>      >>>>
>>> >>>>>      >>>>   Thanks Jason, Lionel.
>>> >>>>>      >>>>
>>> >>>>>      >>>>   Jason, what are you referring to when you say
>>> >>>>>"limits the i915
>>> >>>>>      API
>>> >>>>>      >>>>   already
>>> >>>>>      >>>>   has on the number of engines"? I am not sure if
>>> >>>>>there is such
>>> >>>>>      an uapi
>>> >>>>>      >>>>   today.
>>> >>>>>      >>>>
>>> >>>>>      >>>> There's a limit of something like 64 total engines
>>> >>>>>today based on
>>> >>>>>      the
>>> >>>>>      >>>> number of bits we can cram into the exec flags in
>>> >>>>>execbuffer2.  I
>>> >>>>>      think
>>> >>>>>      >>>> someone had an extended version that allowed more
>>> >>>>>but I ripped it
>>> >>>>>      out
>>> >>>>>      >>>> because no one was using it.  Of course,
>>> >>>>>execbuffer3 might not
>>> >>>>>      >>>>have that
>>> >>>>>      >>>> problem at all.
>>> >>>>>      >>>>
>>> >>>>>      >>>
>>> >>>>>      >>>Thanks Jason.
>>> >>>>>      >>>Ok, I am not sure which exec flag is that, but yah,
>>> >>>>>execbuffer3
>>> >>>>>      probably
>>> >>>>>      >>>will not have this limiation. So, we need to define a
>>> >>>>>      VM_BIND_MAX_QUEUE
>>> >>>>>      >>>and somehow export it to user (I am thinking of
>>> >>>>>embedding it in
>>> >>>>>      >>>I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND, 
>>> bits[1-3]->'n'
>>> >>>>>      meaning 2^n
>>> >>>>>      >>>queues.
>>> >>>>>      >>
>>> >>>>>      >>Ah, I think you are waking about I915_EXEC_RING_MASK
>>> >>>>>(0x3f) which
>>> >>>>>      execbuf3
>>> >>>>>
>>> >>>>>    Yup!  That's exactly the limit I was talking about.
>>> >>>>>
>>> >>>>>      >>will also have. So, we can simply define in vm_bind/unbind
>>> >>>>>      structures,
>>> >>>>>      >>
>>> >>>>>      >>#define I915_VM_BIND_MAX_QUEUE   64
>>> >>>>>      >>        __u32 queue;
>>> >>>>>      >>
>>> >>>>>      >>I think that will keep things simple.
>>> >>>>>      >
>>> >>>>>      >Hmmm? What does execbuf2 limit has to do with how many 
>>> engines
>>> >>>>>      >hardware can have? I suggest not to do that.
>>> >>>>>      >
>>> >>>>>      >Change with added this:
>>> >>>>>      >
>>> >>>>>      >       if (set.num_engines > I915_EXEC_RING_MASK + 1)
>>> >>>>>      >               return -EINVAL;
>>> >>>>>      >
>>> >>>>>      >To context creation needs to be undone and so let users
>>> >>>>>create engine
>>> >>>>>      >maps with all hardware engines, and let execbuf3 access
>>> >>>>>them all.
>>> >>>>>      >
>>> >>>>>
>>> >>>>>      Earlier plan was to carry I915_EXEC_RING_MAP (0x3f) to
>>> >>>>>execbuff3 also.
>>> >>>>>      Hence, I was using the same limit for VM_BIND queues
>>> >>>>>(64, or 65 if we
>>> >>>>>      make it N+1).
>>> >>>>>      But, as discussed in other thread of this RFC series, we
>>> >>>>>are planning
>>> >>>>>      to drop this I915_EXEC_RING_MAP in execbuff3. So, there 
>>> won't be
>>> >>>>>      any uapi that limits the number of engines (and hence
>>> >>>>>the vm_bind
>>> >>>>>      queues
>>> >>>>>      need to be supported).
>>> >>>>>
>>> >>>>>      If we leave the number of vm_bind queues to be 
>>> arbitrarily large
>>> >>>>>      (__u32 queue_idx) then, we need to have a hashmap for
>>> >>>>>queue (a wq,
>>> >>>>>      work_item and a linked list) lookup from the user
>>> >>>>>specified queue
>>> >>>>>      index.
>>> >>>>>      Other option is to just put some hard limit (say 64 or
>>> >>>>>65) and use
>>> >>>>>      an array of queues in VM (each created upon first use).
>>> >>>>>I prefer this.
>>> >>>>>
>>> >>>>>    I don't get why a VM_BIND queue is any different from any
>>> >>>>>other queue or
>>> >>>>>    userspace-visible kernel object.  But I'll leave those
>>> >>>>>details up to
>>> >>>>>    danvet or whoever else might be reviewing the implementation.
>>> >>>>>    --Jason
>>> >>>>>
>>> >>>>>  I kind of agree here. Wouldn't be simpler to have the bind
>>> >>>>>queue created
>>> >>>>>  like the others when we build the engine map?
>>> >>>>>
>>> >>>>>  For userspace it's then just matter of selecting the right
>>> >>>>>queue ID when
>>> >>>>>  submitting.
>>> >>>>>
>>> >>>>>  If there is ever a possibility to have this work on the GPU,
>>> >>>>>it would be
>>> >>>>>  all ready.
>>> >>>>>
>>> >>>>
>>> >>>>I did sync offline with Matt Brost on this.
>>> >>>>We can add a VM_BIND engine class and let user create VM_BIND
>>> >>>>engines (queues).
>>> >>>>The problem is, in i915 engine creating interface is bound to
>>> >>>>gem_context.
>>> >>>>So, in vm_bind ioctl, we would need both context_id and
>>> >>>>queue_idx for proper
>>> >>>>lookup of the user created engine. This is bit ackward as 
>>> vm_bind is an
>>> >>>>interface to VM (address space) and has nothing to do with 
>>> gem_context.
>>> >>>
>>> >>>
>>> >>>A gem_context has a single vm object right?
>>> >>>
>>> >>>Set through I915_CONTEXT_PARAM_VM at creation or given a default
>>> >>>one if not.
>>> >>>
>>> >>>So it's just like picking up the vm like it's done at execbuffer
>>> >>>time right now : eb->context->vm
>>> >>>
>>> >>
>>> >>Are you suggesting replacing 'vm_id' with 'context_id' in the
>>> >>VM_BIND/UNBIND
>>> >>ioctl and probably call it CONTEXT_BIND/UNBIND, because VM can be
>>> >>obtained
>>> >>from the context?
>>> >
>>> >
>>> >Yes, because if we go for engines, they're associated with a context
>>> >and so also associated with the VM bound to the context.
>>> >
>>>
>>> Hmm...context doesn't sould like the right interface. It should be
>>> VM and engine (independent of context). Engine can be virtual or soft
>>> engine (kernel thread), each with its own queue. We can add an 
>>> interface
>>> to create such engines (independent of context). But we are anway
>>> implicitly creating it when user uses a new queue_idx. If in future
>>> we have hardware engines for VM_BIND operation, we can have that
>>> explicit inteface to create engine instances and the queue_index
>>> in vm_bind/unbind will point to those engines.
>>> Anyone has any thoughts? Daniel?
>>
>> Exposing gem_context or intel_context to user space is a strange 
>> concept to me. A context represent some hw resources that is used to 
>> complete certain task. User space should care allocate some resources 
>> (memory, queues) and submit tasks to queues. But user space doesn't 
>> care how certain task is mapped to a HW context - driver/guc should 
>> take care of this.
>>
>> So a cleaner interface to me is: user space create a vm,  create gem 
>> object, vm_bind it to a vm; allocate queues (internally represent 
>> compute or blitter HW. Queue can be virtual to user) for this vm; 
>> submit tasks to queues. User can create multiple queues under one vm. 
>> One queue is only for one vm.
>>
>> I915 driver/guc manage the hw compute or blitter resources which is 
>> transparent to user space. When i915 or guc decide to schedule a 
>> queue (run tasks on that queue), a HW engine will be pick up and set 
>> up properly for the vm of that queue (ie., switch to page tables of 
>> that vm) - this is a context switch.
>>
>> From vm_bind perspective, it simply bind a gem_object to a vm. 
>> Engine/queue is not a parameter to vm_bind, as any engine can be pick 
>> up by i915/guc to execute a task using the vm bound va.
>>
>> I didn't completely follow the discussion here. Just share some 
>> thoughts.
>>
>
> Yah, I agree.
>
> Lionel,
> How about we define the queue as
> union {
>        __u32 queue_idx;
>        __u64 rsvd;
> }
>
> If required, we can extend by expanding the 'rsvd' field to <ctx_id, 
> queue_idx> later
> with a flag.
>
> Niranjana


I did not really understand Oak's comment nor what you're suggesting 
here to be honest.


First the GEM context is already exposed to userspace. It's explicitly 
created by userpace with DRM_IOCTL_I915_GEM_CONTEXT_CREATE.

We give the GEM context id in every execbuffer we do with 
drm_i915_gem_execbuffer2::rsvd1.

It's still in the new execbuffer3 proposal being discussed.


Second, the GEM context is also where we set the VM with 
I915_CONTEXT_PARAM_VM.


Third, the GEM context also has the list of engines with 
I915_CONTEXT_PARAM_ENGINES.


So it makes sense to me to dispatch the vm_bind operation to a GEM 
context, to a given vm_bind queue, because it's got all the information 
required :

     - the list of new vm_bind queues

     - the vm that is going to be modified


Otherwise where do the vm_bind queues live?

In the i915/drm fd object?

That would mean that all the GEM contexts are sharing the same vm_bind 
queues.


intel_context or GuC are internal details we're not concerned about.

I don't really see the connection with the GEM context.


Maybe Oak has a different use case than Vulkan.


-Lionel


>
>> Regards,
>> Oak
>>
>>>
>>> Niranjana
>>>
>>> >
>>> >>I think the interface is clean as a interface to VM. It is only 
>>> that we
>>> >>don't have a clean way to create a raw VM_BIND engine (not
>>> >>associated with
>>> >>any context) with i915 uapi.
>>> >>May be we can add such an interface, but I don't think that is 
>>> worth it
>>> >>(we might as well just use a queue_idx in VM_BIND/UNBIND ioctl as I
>>> >>mentioned
>>> >>above).
>>> >>Anyone has any thoughts?
>>> >>
>>> >>>
>>> >>>>Another problem is, if two VMs are binding with the same defined
>>> >>>>engine,
>>> >>>>binding on VM1 can get unnecessary blocked by binding on VM2
>>> >>>>(which may be
>>> >>>>waiting on its in_fence).
>>> >>>
>>> >>>
>>> >>>Maybe I'm missing something, but how can you have 2 vm objects
>>> >>>with a single gem_context right now?
>>> >>>
>>> >>
>>> >>No, we don't have 2 VMs for a gem_context.
>>> >>Say if ctx1 with vm1 and ctx2 with vm2.
>>> >>First vm_bind call was for vm1 with q_idx 1 in ctx1 engine map.
>>> >>Second vm_bind call was for vm2 with q_idx 2 in ctx2 engine map. If
>>> >>those two queue indicies points to same underlying vm_bind engine,
>>> >>then the second vm_bind call gets blocked until the first vm_bind 
>>> call's
>>> >>'in' fence is triggered and bind completes.
>>> >>
>>> >>With per VM queues, this is not a problem as two VMs will not endup
>>> >>sharing same queue.
>>> >>
>>> >>BTW, I just posted a updated PATCH series.
>>> >>https://www.spinics.net/lists/dri-devel/msg350483.html
>>> >>
>>> >>Niranjana
>>> >>
>>> >>>
>>> >>>>
>>> >>>>So, my preference here is to just add a 'u32 queue' index in
>>> >>>>vm_bind/unbind
>>> >>>>ioctl, and the queues are per VM.
>>> >>>>
>>> >>>>Niranjana
>>> >>>>
>>> >>>>>  Thanks,
>>> >>>>>
>>> >>>>>  -Lionel
>>> >>>>>
>>> >>>>>
>>> >>>>>      Niranjana
>>> >>>>>
>>> >>>>>      >Regards,
>>> >>>>>      >
>>> >>>>>      >Tvrtko
>>> >>>>>      >
>>> >>>>>      >>
>>> >>>>>      >>Niranjana
>>> >>>>>      >>
>>> >>>>>      >>>
>>> >>>>>      >>>>   I am trying to see how many queues we need and
>>> >>>>>don't want it to
>>> >>>>>      be
>>> >>>>>      >>>>   arbitrarily
>>> >>>>>      >>>>   large and unduely blow up memory usage and
>>> >>>>>complexity in i915
>>> >>>>>      driver.
>>> >>>>>      >>>>
>>> >>>>>      >>>> I expect a Vulkan driver to use at most 2 in the
>>> >>>>>vast majority
>>> >>>>>      >>>>of cases. I
>>> >>>>>      >>>> could imagine a client wanting to create more than 1 
>>> sparse
>>> >>>>>      >>>>queue in which
>>> >>>>>      >>>> case, it'll be N+1 but that's unlikely. As far as
>>> >>>>>complexity
>>> >>>>>      >>>>goes, once
>>> >>>>>      >>>> you allow two, I don't think the complexity is going 
>>> up by
>>> >>>>>      >>>>allowing N.  As
>>> >>>>>      >>>> for memory usage, creating more queues means more
>>> >>>>>memory.  That's
>>> >>>>>      a
>>> >>>>>      >>>> trade-off that userspace can make. Again, the
>>> >>>>>expected number
>>> >>>>>      >>>>here is 1
>>> >>>>>      >>>> or 2 in the vast majority of cases so I don't think
>>> >>>>>you need to
>>> >>>>>      worry.
>>> >>>>>      >>>
>>> >>>>>      >>>Ok, will start with n=3 meaning 8 queues.
>>> >>>>>      >>>That would require us create 8 workqueues.
>>> >>>>>      >>>We can change 'n' later if required.
>>> >>>>>      >>>
>>> >>>>>      >>>Niranjana
>>> >>>>>      >>>
>>> >>>>>      >>>>
>>> >>>>>      >>>>   >     Why? Because Vulkan has two basic kind of bind
>>> >>>>>      >>>>operations and we
>>> >>>>>      >>>>   don't
>>> >>>>>      >>>>   >     want any dependencies between them:
>>> >>>>>      >>>>   >      1. Immediate.  These happen right after BO
>>> >>>>>creation or
>>> >>>>>      >>>>maybe as
>>> >>>>>      >>>>   part of
>>> >>>>>      >>>>   > vkBindImageMemory() or VkBindBufferMemory().  These
>>> >>>>>      >>>>don't happen
>>> >>>>>      >>>>   on a
>>> >>>>>      >>>>   >     queue and we don't want them serialized
>>> >>>>>with anything.       To
>>> >>>>>      >>>>   synchronize
>>> >>>>>      >>>>   >     with submit, we'll have a syncobj in the
>>> >>>>>VkDevice which
>>> >>>>>      is
>>> >>>>>      >>>>   signaled by
>>> >>>>>      >>>>   >     all immediate bind operations and make
>>> >>>>>submits wait on
>>> >>>>>      it.
>>> >>>>>      >>>>   >      2. Queued (sparse): These happen on a
>>> >>>>>VkQueue which may
>>> >>>>>      be the
>>> >>>>>      >>>>   same as
>>> >>>>>      >>>>   >     a render/compute queue or may be its own
>>> >>>>>queue.  It's up
>>> >>>>>      to us
>>> >>>>>      >>>>   what we
>>> >>>>>      >>>>   >     want to advertise.  From the Vulkan API
>>> >>>>>PoV, this is like
>>> >>>>>      any
>>> >>>>>      >>>>   other
>>> >>>>>      >>>>   >     queue. Operations on it wait on and signal
>>> >>>>>semaphores.       If we
>>> >>>>>      >>>>   have a
>>> >>>>>      >>>>   >     VM_BIND engine, we'd provide syncobjs to 
>>> wait and
>>> >>>>>      >>>>signal just like
>>> >>>>>      >>>>   we do
>>> >>>>>      >>>>   >     in execbuf().
>>> >>>>>      >>>>   >     The important thing is that we don't want
>>> >>>>>one type of
>>> >>>>>      >>>>operation to
>>> >>>>>      >>>>   block
>>> >>>>>      >>>>   >     on the other.  If immediate binds are
>>> >>>>>blocking on sparse
>>> >>>>>      binds,
>>> >>>>>      >>>>   it's
>>> >>>>>      >>>>   >     going to cause over-synchronization issues.
>>> >>>>>      >>>>   >     In terms of the internal implementation, I
>>> >>>>>know that
>>> >>>>>      >>>>there's going
>>> >>>>>      >>>>   to be
>>> >>>>>      >>>>   >     a lock on the VM and that we can't actually
>>> >>>>>do these
>>> >>>>>      things in
>>> >>>>>      >>>>   > parallel.  That's fine. Once the dma_fences have
>>> >>>>>      signaled and
>>> >>>>>      >>>>   we're
>>> >>>>>      >>>>
>>> >>>>>      >>>>   Thats correct. It is like a single VM_BIND engine 
>>> with
>>> >>>>>      >>>>multiple queues
>>> >>>>>      >>>>   feeding to it.
>>> >>>>>      >>>>
>>> >>>>>      >>>> Right.  As long as the queues themselves are
>>> >>>>>independent and
>>> >>>>>      >>>>can block on
>>> >>>>>      >>>> dma_fences without holding up other queues, I think
>>> >>>>>we're fine.
>>> >>>>>      >>>>
>>> >>>>>      >>>>   > unblocked to do the bind operation, I don't care if
>>> >>>>>      >>>>there's a bit
>>> >>>>>      >>>>   of
>>> >>>>>      >>>>   > synchronization due to locking.  That's
>>> >>>>>expected.  What
>>> >>>>>      >>>>we can't
>>> >>>>>      >>>>   afford
>>> >>>>>      >>>>   >     to have is an immediate bind operation
>>> >>>>>suddenly blocking
>>> >>>>>      on a
>>> >>>>>      >>>>   sparse
>>> >>>>>      >>>>   > operation which is blocked on a compute job
>>> >>>>>that's going
>>> >>>>>      to run
>>> >>>>>      >>>>   for
>>> >>>>>      >>>>   >     another 5ms.
>>> >>>>>      >>>>
>>> >>>>>      >>>>   As the VM_BIND queue is per VM, VM_BIND on one VM
>>> >>>>>doesn't block
>>> >>>>>      the
>>> >>>>>      >>>>   VM_BIND
>>> >>>>>      >>>>   on other VMs. I am not sure about usecases here, 
>>> but just
>>> >>>>>      wanted to
>>> >>>>>      >>>>   clarify.
>>> >>>>>      >>>>
>>> >>>>>      >>>> Yes, that's what I would expect.
>>> >>>>>      >>>> --Jason
>>> >>>>>      >>>>
>>> >>>>>      >>>>   Niranjana
>>> >>>>>      >>>>
>>> >>>>>      >>>>   >     For reference, Windows solves this by allowing
>>> >>>>>      arbitrarily many
>>> >>>>>      >>>>   paging
>>> >>>>>      >>>>   >     queues (what they call a VM_BIND
>>> >>>>>engine/queue).  That
>>> >>>>>      >>>>design works
>>> >>>>>      >>>>   >     pretty well and solves the problems in
>>> >>>>>question.       >>>>Again, we could
>>> >>>>>      >>>>   just
>>> >>>>>      >>>>   >     make everything out-of-order and require
>>> >>>>>using syncobjs
>>> >>>>>      >>>>to order
>>> >>>>>      >>>>   things
>>> >>>>>      >>>>   >     as userspace wants. That'd be fine too.
>>> >>>>>      >>>>   >     One more note while I'm here: danvet said
>>> >>>>>something on
>>> >>>>>      >>>>IRC about
>>> >>>>>      >>>>   VM_BIND
>>> >>>>>      >>>>   >     queues waiting for syncobjs to
>>> >>>>>materialize.  We don't
>>> >>>>>      really
>>> >>>>>      >>>>   want/need
>>> >>>>>      >>>>   >     this. We already have all the machinery in
>>> >>>>>userspace to
>>> >>>>>      handle
>>> >>>>>      >>>>   > wait-before-signal and waiting for syncobj
>>> >>>>>fences to
>>> >>>>>      >>>>materialize
>>> >>>>>      >>>>   and
>>> >>>>>      >>>>   >     that machinery is on by default.  It would 
>>> actually
>>> >>>>>      >>>>take MORE work
>>> >>>>>      >>>>   in
>>> >>>>>      >>>>   >     Mesa to turn it off and take advantage of
>>> >>>>>the kernel
>>> >>>>>      >>>>being able to
>>> >>>>>      >>>>   wait
>>> >>>>>      >>>>   >     for syncobjs to materialize. Also, getting
>>> >>>>>that right is
>>> >>>>>      >>>>   ridiculously
>>> >>>>>      >>>>   >     hard and I really don't want to get it
>>> >>>>>wrong in kernel
>>> >>>>>      >>>>space.   �� When we
>>> >>>>>      >>>>   >     do memory fences, wait-before-signal will
>>> >>>>>be a thing.  We
>>> >>>>>      don't
>>> >>>>>      >>>>   need to
>>> >>>>>      >>>>   >     try and make it a thing for syncobj.
>>> >>>>>      >>>>   >     --Jason
>>> >>>>>      >>>>   >
>>> >>>>>      >>>>   >   Thanks Jason,
>>> >>>>>      >>>>   >
>>> >>>>>      >>>>   >   I missed the bit in the Vulkan spec that
>>> >>>>>we're allowed to
>>> >>>>>      have a
>>> >>>>>      >>>>   sparse
>>> >>>>>      >>>>   >   queue that does not implement either graphics
>>> >>>>>or compute
>>> >>>>>      >>>>operations
>>> >>>>>      >>>>   :
>>> >>>>>      >>>>   >
>>> >>>>>      >>>>   >     "While some implementations may include
>>> >>>>>      >>>> VK_QUEUE_SPARSE_BINDING_BIT
>>> >>>>>      >>>>   >     support in queue families that also include
>>> >>>>>      >>>>   >
>>> >>>>>      >>>>   > graphics and compute support, other
>>> >>>>>implementations may
>>> >>>>>      only
>>> >>>>>      >>>>   expose a
>>> >>>>>      >>>>   > VK_QUEUE_SPARSE_BINDING_BIT-only queue
>>> >>>>>      >>>>   >
>>> >>>>>      >>>>   > family."
>>> >>>>>      >>>>   >
>>> >>>>>      >>>>   >   So it can all be all a vm_bind engine that 
>>> just does
>>> >>>>>      bind/unbind
>>> >>>>>      >>>>   > operations.
>>> >>>>>      >>>>   >
>>> >>>>>      >>>>   >   But yes we need another engine for the
>>> >>>>>immediate/non-sparse
>>> >>>>>      >>>>   operations.
>>> >>>>>      >>>>   >
>>> >>>>>      >>>>   >   -Lionel
>>> >>>>>      >>>>   >
>>> >>>>>      >>>>   >         >
>>> >>>>>      >>>>   > Daniel, any thoughts?
>>> >>>>>      >>>>   >
>>> >>>>>      >>>>   > Niranjana
>>> >>>>>      >>>>   >
>>> >>>>>      >>>>   > >Matt
>>> >>>>>      >>>>   >       >
>>> >>>>>      >>>>   > >>
>>> >>>>>      >>>>   > >> Sorry I noticed this late.
>>> >>>>>      >>>>   > >>
>>> >>>>>      >>>>   > >>
>>> >>>>>      >>>>   > >> -Lionel
>>> >>>>>      >>>>   > >>
>>> >>>>>      >>>>   > >>
>>> >>>
>>> >>>
>>> >
Niranjana Vishwanathapura June 14, 2022, 5:01 p.m. UTC | #35
On Tue, Jun 14, 2022 at 10:04:00AM +0300, Lionel Landwerlin wrote:
>On 13/06/2022 21:02, Niranjana Vishwanathapura wrote:
>>On Mon, Jun 13, 2022 at 06:33:07AM -0700, Zeng, Oak wrote:
>>>
>>>
>>>Regards,
>>>Oak
>>>
>>>>-----Original Message-----
>>>>From: Intel-gfx <intel-gfx-bounces@lists.freedesktop.org> On 
>>>>Behalf Of Niranjana
>>>>Vishwanathapura
>>>>Sent: June 10, 2022 1:43 PM
>>>>To: Landwerlin, Lionel G <lionel.g.landwerlin@intel.com>
>>>>Cc: Intel GFX <intel-gfx@lists.freedesktop.org>; Maling list - 
>>>>DRI developers <dri-
>>>>devel@lists.freedesktop.org>; Hellstrom, Thomas 
>>>><thomas.hellstrom@intel.com>;
>>>>Wilson, Chris P <chris.p.wilson@intel.com>; Vetter, Daniel
>>>><daniel.vetter@intel.com>; Christian König <christian.koenig@amd.com>
>>>>Subject: Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND 
>>>>feature design
>>>>document
>>>>
>>>>On Fri, Jun 10, 2022 at 11:18:14AM +0300, Lionel Landwerlin wrote:
>>>>>On 10/06/2022 10:54, Niranjana Vishwanathapura wrote:
>>>>>>On Fri, Jun 10, 2022 at 09:53:24AM +0300, Lionel Landwerlin wrote:
>>>>>>>On 09/06/2022 22:31, Niranjana Vishwanathapura wrote:
>>>>>>>>On Thu, Jun 09, 2022 at 05:49:09PM +0300, Lionel Landwerlin wrote:
>>>>>>>>>  On 09/06/2022 00:55, Jason Ekstrand wrote:
>>>>>>>>>
>>>>>>>>>    On Wed, Jun 8, 2022 at 4:44 PM Niranjana Vishwanathapura
>>>>>>>>> <niranjana.vishwanathapura@intel.com> wrote:
>>>>>>>>>
>>>>>>>>>      On Wed, Jun 08, 2022 at 08:33:25AM +0100, Tvrtko 
>>>>Ursulin wrote:
>>>>>>>>>      >
>>>>>>>>>      >
>>>>>>>>>      >On 07/06/2022 22:32, Niranjana Vishwanathapura wrote:
>>>>>>>>>      >>On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana
>>>>>>>>>Vishwanathapura
>>>>>>>>>      wrote:
>>>>>>>>>      >>>On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason
>>>>>>>>>Ekstrand wrote:
>>>>>>>>>      >>>> On Fri, Jun 3, 2022 at 6:52 PM Niranjana 
>>>>Vishwanathapura
>>>>>>>>>      >>>> <niranjana.vishwanathapura@intel.com> wrote:
>>>>>>>>>      >>>>
>>>>>>>>>      >>>>   On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel
>>>>>>>>>Landwerlin
>>>>>>>>>      wrote:
>>>>>>>>>      >>>>   >   On 02/06/2022 23:35, Jason Ekstrand wrote:
>>>>>>>>>      >>>>   >
>>>>>>>>>      >>>>   >     On Thu, Jun 2, 2022 at 3:11 PM Niranjana
>>>>>>>>>Vishwanathapura
>>>>>>>>>      >>>>   > <niranjana.vishwanathapura@intel.com> wrote:
>>>>>>>>>      >>>>   >
>>>>>>>>>      >>>>   >       On Wed, Jun 01, 2022 at 01:28:36PM 
>>>>-0700, Matthew
>>>>>>>>>      >>>>Brost wrote:
>>>>>>>>>      >>>>   >       >On Wed, Jun 01, 2022 at 05:25:49PM 
>>>>+0300, Lionel
>>>>>>>>>      Landwerlin
>>>>>>>>>      >>>>   wrote:
>>>>>>>>>      >>>>   > >> On 17/05/2022 21:32, Niranjana Vishwanathapura
>>>>>>>>>      wrote:
>>>>>>>>>      >>>>   > >> > +VM_BIND/UNBIND ioctl will immediately start
>>>>>>>>>      >>>>   binding/unbinding
>>>>>>>>>      >>>>   >       the mapping in an
>>>>>>>>>      >>>>   > >> > +async worker. The binding and
>>>>>>>>>unbinding will
>>>>>>>>>      >>>>work like a
>>>>>>>>>      >>>>   special
>>>>>>>>>      >>>>   >       GPU engine.
>>>>>>>>>      >>>>   > >> > +The binding and unbinding operations are
>>>>>>>>>      serialized and
>>>>>>>>>      >>>>   will
>>>>>>>>>      >>>>   >       wait on specified
>>>>>>>>>      >>>>   > >> > +input fences before the operation
>>>>>>>>>and will signal
>>>>>>>>>      the
>>>>>>>>>      >>>>   output
>>>>>>>>>      >>>>   >       fences upon the
>>>>>>>>>      >>>>   > >> > +completion of the operation. Due to
>>>>>>>>>      serialization,
>>>>>>>>>      >>>>   completion of
>>>>>>>>>      >>>>   >       an operation
>>>>>>>>>      >>>>   > >> > +will also indicate that all
>>>>>>>>>previous operations
>>>>>>>>>      >>>>are also
>>>>>>>>>      >>>>   > complete.
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   > >> I guess we should avoid saying "will
>>>>>>>>>immediately
>>>>>>>>>      start
>>>>>>>>>      >>>>   > binding/unbinding" if
>>>>>>>>>      >>>>   > >> there are fences involved.
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   > >> And the fact that it's happening in an async
>>>>>>>>>      >>>>worker seem to
>>>>>>>>>      >>>>   imply
>>>>>>>>>      >>>>   >       it's not
>>>>>>>>>      >>>>   > >> immediate.
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   >
>>>>>>>>>      >>>>   >       Ok, will fix.
>>>>>>>>>      >>>>   >       This was added because in earlier design
>>>>>>>>>binding was
>>>>>>>>>      deferred
>>>>>>>>>      >>>>   until
>>>>>>>>>      >>>>   >       next execbuff.
>>>>>>>>>      >>>>   >       But now it is non-deferred (immediate in
>>>>>>>>>that sense).
>>>>>>>>>      >>>>But yah,
>>>>>>>>>      >>>>   this is
>>>>>>>>>      >>>>   > confusing
>>>>>>>>>      >>>>   >       and will fix it.
>>>>>>>>>      >>>>   >
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   > >> I have a question on the behavior of the bind
>>>>>>>>>      >>>>operation when
>>>>>>>>>      >>>>   no
>>>>>>>>>      >>>>   >       input fence
>>>>>>>>>      >>>>   > >> is provided. Let say I do :
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   > >> VM_BIND (out_fence=fence1)
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   > >> VM_BIND (out_fence=fence2)
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   > >> VM_BIND (out_fence=fence3)
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   > >> In what order are the fences going to
>>>>>>>>>be signaled?
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   > >> In the order of VM_BIND ioctls? Or out
>>>>>>>>>of order?
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   > >> Because you wrote "serialized I assume
>>>>>>>>>it's : in
>>>>>>>>>      order
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   >
>>>>>>>>>      >>>>   >       Yes, in the order of VM_BIND/UNBIND
>>>>>>>>>ioctls. Note that
>>>>>>>>>      >>>>bind and
>>>>>>>>>      >>>>   unbind
>>>>>>>>>      >>>>   >       will use
>>>>>>>>>      >>>>   >       the same queue and hence are ordered.
>>>>>>>>>      >>>>   >
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   > >> One thing I didn't realize is that
>>>>>>>>>because we only
>>>>>>>>>      get one
>>>>>>>>>      >>>>   > "VM_BIND" engine,
>>>>>>>>>      >>>>   > >> there is a disconnect from the Vulkan
>>>>>>>>>specification.
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   > >> In Vulkan VM_BIND operations are
>>>>>>>>>serialized but
>>>>>>>>>      >>>>per engine.
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   > >> So you could have something like this :
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   > >> VM_BIND (engine=rcs0, in_fence=fence1,
>>>>>>>>>      out_fence=fence2)
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   > >> VM_BIND (engine=ccs0, in_fence=fence3,
>>>>>>>>>      out_fence=fence4)
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   > >> fence1 is not signaled
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   > >> fence3 is signaled
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   > >> So the second VM_BIND will proceed before the
>>>>>>>>>      >>>>first VM_BIND.
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   > >> I guess we can deal with that scenario in
>>>>>>>>>      >>>>userspace by doing
>>>>>>>>>      >>>>   the
>>>>>>>>>      >>>>   >       wait
>>>>>>>>>      >>>>   > >> ourselves in one thread per engines.
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   > >> But then it makes the VM_BIND input
>>>>>>>>>fences useless.
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   > >> Daniel : what do you think? Should be
>>>>>>>>>rework this or
>>>>>>>>>      just
>>>>>>>>>      >>>>   deal with
>>>>>>>>>      >>>>   >       wait
>>>>>>>>>      >>>>   > >> fences in userspace?
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   >       >
>>>>>>>>>      >>>>   >       >My opinion is rework this but make the
>>>>>>>>>ordering via
>>>>>>>>>      >>>>an engine
>>>>>>>>>      >>>>   param
>>>>>>>>>      >>>>   > optional.
>>>>>>>>>      >>>>   >       >
>>>>>>>>>      >>>>   > >e.g. A VM can be configured so all binds
>>>>>>>>>are ordered
>>>>>>>>>      >>>>within the
>>>>>>>>>      >>>>   VM
>>>>>>>>>      >>>>   >       >
>>>>>>>>>      >>>>   > >e.g. A VM can be configured so all binds
>>>>>>>>>accept an
>>>>>>>>>      engine
>>>>>>>>>      >>>>   argument
>>>>>>>>>      >>>>   >       (in
>>>>>>>>>      >>>>   > >the case of the i915 likely this is a
>>>>>>>>>gem context
>>>>>>>>>      >>>>handle) and
>>>>>>>>>      >>>>   binds
>>>>>>>>>      >>>>   > >ordered with respect to that engine.
>>>>>>>>>      >>>>   >       >
>>>>>>>>>      >>>>   > >This gives UMDs options as the later
>>>>>>>>>likely consumes
>>>>>>>>>      >>>>more KMD
>>>>>>>>>      >>>>   > resources
>>>>>>>>>      >>>>   >       >so if a different UMD can live with 
>>>>binds being
>>>>>>>>>      >>>>ordered within
>>>>>>>>>      >>>>   the VM
>>>>>>>>>      >>>>   > >they can use a mode consuming less resources.
>>>>>>>>>      >>>>   >       >
>>>>>>>>>      >>>>   >
>>>>>>>>>      >>>>   >       I think we need to be careful here if we
>>>>>>>>>are looking
>>>>>>>>>      for some
>>>>>>>>>      >>>>   out of
>>>>>>>>>      >>>>   > (submission) order completion of vm_bind/unbind.
>>>>>>>>>      >>>>   > In-order completion means, in a batch of
>>>>>>>>>binds and
>>>>>>>>>      >>>>unbinds to be
>>>>>>>>>      >>>>   > completed in-order, user only needs to specify
>>>>>>>>>      >>>>in-fence for the
>>>>>>>>>      >>>>   >       first bind/unbind call and the our-fence
>>>>>>>>>for the last
>>>>>>>>>      >>>>   bind/unbind
>>>>>>>>>      >>>>   >       call. Also, the VA released by an unbind
>>>>>>>>>call can be
>>>>>>>>>      >>>>re-used by
>>>>>>>>>      >>>>   >       any subsequent bind call in that 
>>>>in-order batch.
>>>>>>>>>      >>>>   >
>>>>>>>>>      >>>>   >       These things will break if
>>>>>>>>>binding/unbinding were to
>>>>>>>>>      >>>>be allowed
>>>>>>>>>      >>>>   to
>>>>>>>>>      >>>>   >       go out of order (of submission) and user
>>>>>>>>>need to be
>>>>>>>>>      extra
>>>>>>>>>      >>>>   careful
>>>>>>>>>      >>>>   >       not to run into pre-mature triggereing of
>>>>>>>>>out-fence and
>>>>>>>>>      bind
>>>>>>>>>      >>>>   failing
>>>>>>>>>      >>>>   >       as VA is still in use etc.
>>>>>>>>>      >>>>   >
>>>>>>>>>      >>>>   >       Also, VM_BIND binds the provided 
>>>>mapping on the
>>>>>>>>>      specified
>>>>>>>>>      >>>>   address
>>>>>>>>>      >>>>   >       space
>>>>>>>>>      >>>>   >       (VM). So, the uapi is not engine/context
>>>>>>>>>specific.
>>>>>>>>>      >>>>   >
>>>>>>>>>      >>>>   >       We can however add a 'queue' to the uapi
>>>>>>>>>which can be
>>>>>>>>>      >>>>one from
>>>>>>>>>      >>>>   the
>>>>>>>>>      >>>>   > pre-defined queues,
>>>>>>>>>      >>>>   > I915_VM_BIND_QUEUE_0
>>>>>>>>>      >>>>   > I915_VM_BIND_QUEUE_1
>>>>>>>>>      >>>>   >       ...
>>>>>>>>>      >>>>   > I915_VM_BIND_QUEUE_(N-1)
>>>>>>>>>      >>>>   >
>>>>>>>>>      >>>>   >       KMD will spawn an async work queue for
>>>>>>>>>each queue which
>>>>>>>>>      will
>>>>>>>>>      >>>>   only
>>>>>>>>>      >>>>   >       bind the mappings on that queue in the 
>>>>order of
>>>>>>>>>      submission.
>>>>>>>>>      >>>>   >       User can assign the queue to per engine
>>>>>>>>>or anything
>>>>>>>>>      >>>>like that.
>>>>>>>>>      >>>>   >
>>>>>>>>>      >>>>   >       But again here, user need to be 
>>>>careful and not
>>>>>>>>>      >>>>deadlock these
>>>>>>>>>      >>>>   >       queues with circular dependency of fences.
>>>>>>>>>      >>>>   >
>>>>>>>>>      >>>>   >       I prefer adding this later an as
>>>>>>>>>extension based on
>>>>>>>>>      >>>>whether it
>>>>>>>>>      >>>>   >       is really helping with the implementation.
>>>>>>>>>      >>>>   >
>>>>>>>>>      >>>>   >     I can tell you right now that having
>>>>>>>>>everything on a
>>>>>>>>>      single
>>>>>>>>>      >>>>   in-order
>>>>>>>>>      >>>>   >     queue will not get us the perf we want.
>>>>>>>>>What vulkan
>>>>>>>>>      >>>>really wants
>>>>>>>>>      >>>>   is one
>>>>>>>>>      >>>>   >     of two things:
>>>>>>>>>      >>>>   >      1. No implicit ordering of VM_BIND 
>>>>ops.  They just
>>>>>>>>>      happen in
>>>>>>>>>      >>>>   whatever
>>>>>>>>>      >>>>   >     their dependencies are resolved and we
>>>>>>>>>ensure ordering
>>>>>>>>>      >>>>ourselves
>>>>>>>>>      >>>>   by
>>>>>>>>>      >>>>   >     having a syncobj in the VkQueue.
>>>>>>>>>      >>>>   >      2. The ability to create multiple VM_BIND
>>>>>>>>>queues.  We
>>>>>>>>>      need at
>>>>>>>>>      >>>>   least 2
>>>>>>>>>      >>>>   >     but I don't see why there needs to be a
>>>>>>>>>limit besides
>>>>>>>>>      >>>>the limits
>>>>>>>>>      >>>>   the
>>>>>>>>>      >>>>   >     i915 API already has on the number of
>>>>>>>>>engines.  Vulkan
>>>>>>>>>      could
>>>>>>>>>      >>>>   expose
>>>>>>>>>      >>>>   >     multiple sparse binding queues to the
>>>>>>>>>client if it's not
>>>>>>>>>      >>>>   arbitrarily
>>>>>>>>>      >>>>   >     limited.
>>>>>>>>>      >>>>
>>>>>>>>>      >>>>   Thanks Jason, Lionel.
>>>>>>>>>      >>>>
>>>>>>>>>      >>>>   Jason, what are you referring to when you say
>>>>>>>>>"limits the i915
>>>>>>>>>      API
>>>>>>>>>      >>>>   already
>>>>>>>>>      >>>>   has on the number of engines"? I am not sure if
>>>>>>>>>there is such
>>>>>>>>>      an uapi
>>>>>>>>>      >>>>   today.
>>>>>>>>>      >>>>
>>>>>>>>>      >>>> There's a limit of something like 64 total engines
>>>>>>>>>today based on
>>>>>>>>>      the
>>>>>>>>>      >>>> number of bits we can cram into the exec flags in
>>>>>>>>>execbuffer2.  I
>>>>>>>>>      think
>>>>>>>>>      >>>> someone had an extended version that allowed more
>>>>>>>>>but I ripped it
>>>>>>>>>      out
>>>>>>>>>      >>>> because no one was using it.  Of course,
>>>>>>>>>execbuffer3 might not
>>>>>>>>>      >>>>have that
>>>>>>>>>      >>>> problem at all.
>>>>>>>>>      >>>>
>>>>>>>>>      >>>
>>>>>>>>>      >>>Thanks Jason.
>>>>>>>>>      >>>Ok, I am not sure which exec flag is that, but yah,
>>>>>>>>>execbuffer3
>>>>>>>>>      probably
>>>>>>>>>      >>>will not have this limiation. So, we need to define a
>>>>>>>>>      VM_BIND_MAX_QUEUE
>>>>>>>>>      >>>and somehow export it to user (I am thinking of
>>>>>>>>>embedding it in
>>>>>>>>>      >>>I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND, 
>>>>bits[1-3]->'n'
>>>>>>>>>      meaning 2^n
>>>>>>>>>      >>>queues.
>>>>>>>>>      >>
>>>>>>>>>      >>Ah, I think you are waking about I915_EXEC_RING_MASK
>>>>>>>>>(0x3f) which
>>>>>>>>>      execbuf3
>>>>>>>>>
>>>>>>>>>    Yup!  That's exactly the limit I was talking about.
>>>>>>>>>
>>>>>>>>>      >>will also have. So, we can simply define in vm_bind/unbind
>>>>>>>>>      structures,
>>>>>>>>>      >>
>>>>>>>>>      >>#define I915_VM_BIND_MAX_QUEUE   64
>>>>>>>>>      >>        __u32 queue;
>>>>>>>>>      >>
>>>>>>>>>      >>I think that will keep things simple.
>>>>>>>>>      >
>>>>>>>>>      >Hmmm? What does execbuf2 limit has to do with how 
>>>>many engines
>>>>>>>>>      >hardware can have? I suggest not to do that.
>>>>>>>>>      >
>>>>>>>>>      >Change with added this:
>>>>>>>>>      >
>>>>>>>>>      >       if (set.num_engines > I915_EXEC_RING_MASK + 1)
>>>>>>>>>      >               return -EINVAL;
>>>>>>>>>      >
>>>>>>>>>      >To context creation needs to be undone and so let users
>>>>>>>>>create engine
>>>>>>>>>      >maps with all hardware engines, and let execbuf3 access
>>>>>>>>>them all.
>>>>>>>>>      >
>>>>>>>>>
>>>>>>>>>      Earlier plan was to carry I915_EXEC_RING_MAP (0x3f) to
>>>>>>>>>execbuff3 also.
>>>>>>>>>      Hence, I was using the same limit for VM_BIND queues
>>>>>>>>>(64, or 65 if we
>>>>>>>>>      make it N+1).
>>>>>>>>>      But, as discussed in other thread of this RFC series, we
>>>>>>>>>are planning
>>>>>>>>>      to drop this I915_EXEC_RING_MAP in execbuff3. So, 
>>>>there won't be
>>>>>>>>>      any uapi that limits the number of engines (and hence
>>>>>>>>>the vm_bind
>>>>>>>>>      queues
>>>>>>>>>      need to be supported).
>>>>>>>>>
>>>>>>>>>      If we leave the number of vm_bind queues to be 
>>>>arbitrarily large
>>>>>>>>>      (__u32 queue_idx) then, we need to have a hashmap for
>>>>>>>>>queue (a wq,
>>>>>>>>>      work_item and a linked list) lookup from the user
>>>>>>>>>specified queue
>>>>>>>>>      index.
>>>>>>>>>      Other option is to just put some hard limit (say 64 or
>>>>>>>>>65) and use
>>>>>>>>>      an array of queues in VM (each created upon first use).
>>>>>>>>>I prefer this.
>>>>>>>>>
>>>>>>>>>    I don't get why a VM_BIND queue is any different from any
>>>>>>>>>other queue or
>>>>>>>>>    userspace-visible kernel object.  But I'll leave those
>>>>>>>>>details up to
>>>>>>>>>    danvet or whoever else might be reviewing the implementation.
>>>>>>>>>    --Jason
>>>>>>>>>
>>>>>>>>>  I kind of agree here. Wouldn't be simpler to have the bind
>>>>>>>>>queue created
>>>>>>>>>  like the others when we build the engine map?
>>>>>>>>>
>>>>>>>>>  For userspace it's then just matter of selecting the right
>>>>>>>>>queue ID when
>>>>>>>>>  submitting.
>>>>>>>>>
>>>>>>>>>  If there is ever a possibility to have this work on the GPU,
>>>>>>>>>it would be
>>>>>>>>>  all ready.
>>>>>>>>>
>>>>>>>>
>>>>>>>>I did sync offline with Matt Brost on this.
>>>>>>>>We can add a VM_BIND engine class and let user create VM_BIND
>>>>>>>>engines (queues).
>>>>>>>>The problem is, in i915 engine creating interface is bound to
>>>>>>>>gem_context.
>>>>>>>>So, in vm_bind ioctl, we would need both context_id and
>>>>>>>>queue_idx for proper
>>>>>>>>lookup of the user created engine. This is bit ackward as 
>>>>vm_bind is an
>>>>>>>>interface to VM (address space) and has nothing to do with 
>>>>gem_context.
>>>>>>>
>>>>>>>
>>>>>>>A gem_context has a single vm object right?
>>>>>>>
>>>>>>>Set through I915_CONTEXT_PARAM_VM at creation or given a default
>>>>>>>one if not.
>>>>>>>
>>>>>>>So it's just like picking up the vm like it's done at execbuffer
>>>>>>>time right now : eb->context->vm
>>>>>>>
>>>>>>
>>>>>>Are you suggesting replacing 'vm_id' with 'context_id' in the
>>>>>>VM_BIND/UNBIND
>>>>>>ioctl and probably call it CONTEXT_BIND/UNBIND, because VM can be
>>>>>>obtained
>>>>>>from the context?
>>>>>
>>>>>
>>>>>Yes, because if we go for engines, they're associated with a context
>>>>>and so also associated with the VM bound to the context.
>>>>>
>>>>
>>>>Hmm...context doesn't sould like the right interface. It should be
>>>>VM and engine (independent of context). Engine can be virtual or soft
>>>>engine (kernel thread), each with its own queue. We can add an 
>>>>interface
>>>>to create such engines (independent of context). But we are anway
>>>>implicitly creating it when user uses a new queue_idx. If in future
>>>>we have hardware engines for VM_BIND operation, we can have that
>>>>explicit inteface to create engine instances and the queue_index
>>>>in vm_bind/unbind will point to those engines.
>>>>Anyone has any thoughts? Daniel?
>>>
>>>Exposing gem_context or intel_context to user space is a strange 
>>>concept to me. A context represent some hw resources that is used 
>>>to complete certain task. User space should care allocate some 
>>>resources (memory, queues) and submit tasks to queues. But user 
>>>space doesn't care how certain task is mapped to a HW context - 
>>>driver/guc should take care of this.
>>>
>>>So a cleaner interface to me is: user space create a vm,  create 
>>>gem object, vm_bind it to a vm; allocate queues (internally 
>>>represent compute or blitter HW. Queue can be virtual to user) for 
>>>this vm; submit tasks to queues. User can create multiple queues 
>>>under one vm. One queue is only for one vm.
>>>
>>>I915 driver/guc manage the hw compute or blitter resources which 
>>>is transparent to user space. When i915 or guc decide to schedule 
>>>a queue (run tasks on that queue), a HW engine will be pick up and 
>>>set up properly for the vm of that queue (ie., switch to page 
>>>tables of that vm) - this is a context switch.
>>>
>>>From vm_bind perspective, it simply bind a gem_object to a vm. 
>>>Engine/queue is not a parameter to vm_bind, as any engine can be 
>>>pick up by i915/guc to execute a task using the vm bound va.
>>>
>>>I didn't completely follow the discussion here. Just share some 
>>>thoughts.
>>>
>>
>>Yah, I agree.
>>
>>Lionel,
>>How about we define the queue as
>>union {
>>       __u32 queue_idx;
>>       __u64 rsvd;
>>}
>>
>>If required, we can extend by expanding the 'rsvd' field to <ctx_id, 
>>queue_idx> later
>>with a flag.
>>
>>Niranjana
>
>
>I did not really understand Oak's comment nor what you're suggesting 
>here to be honest.
>
>
>First the GEM context is already exposed to userspace. It's explicitly 
>created by userpace with DRM_IOCTL_I915_GEM_CONTEXT_CREATE.
>
>We give the GEM context id in every execbuffer we do with 
>drm_i915_gem_execbuffer2::rsvd1.
>
>It's still in the new execbuffer3 proposal being discussed.
>
>
>Second, the GEM context is also where we set the VM with 
>I915_CONTEXT_PARAM_VM.
>
>
>Third, the GEM context also has the list of engines with 
>I915_CONTEXT_PARAM_ENGINES.
>

Yes, the execbuf and engine map creation are tied to gem_context.
(which probably is not the best interface.)

>
>So it makes sense to me to dispatch the vm_bind operation to a GEM 
>context, to a given vm_bind queue, because it's got all the 
>information required :
>
>    - the list of new vm_bind queues
>
>    - the vm that is going to be modified
>

But the operation is performed here on the address space (VM) which
can have multiple gem_contexts referring to it. So, VM is the right
interface here. We need not 'gem_context'ify it.

All we need is multiple queue support for the address space (VM).
Going to gem_context for that just because we have engine creation
support there seems unnecessay and not correct to me.

>
>Otherwise where do the vm_bind queues live?
>
>In the i915/drm fd object?
>
>That would mean that all the GEM contexts are sharing the same vm_bind 
>queues.
>

Not all, only the gem contexts that are using the same address space (VM).
But to me the right way to describe would be that "VM will be using those
queues".

Niranjana

>
>intel_context or GuC are internal details we're not concerned about.
>
>I don't really see the connection with the GEM context.
>
>
>Maybe Oak has a different use case than Vulkan.
>
>
>-Lionel
>
>
>>
>>>Regards,
>>>Oak
>>>
>>>>
>>>>Niranjana
>>>>
>>>>>
>>>>>>I think the interface is clean as a interface to VM. It is 
>>>>only that we
>>>>>>don't have a clean way to create a raw VM_BIND engine (not
>>>>>>associated with
>>>>>>any context) with i915 uapi.
>>>>>>May be we can add such an interface, but I don't think that is 
>>>>worth it
>>>>>>(we might as well just use a queue_idx in VM_BIND/UNBIND ioctl as I
>>>>>>mentioned
>>>>>>above).
>>>>>>Anyone has any thoughts?
>>>>>>
>>>>>>>
>>>>>>>>Another problem is, if two VMs are binding with the same defined
>>>>>>>>engine,
>>>>>>>>binding on VM1 can get unnecessary blocked by binding on VM2
>>>>>>>>(which may be
>>>>>>>>waiting on its in_fence).
>>>>>>>
>>>>>>>
>>>>>>>Maybe I'm missing something, but how can you have 2 vm objects
>>>>>>>with a single gem_context right now?
>>>>>>>
>>>>>>
>>>>>>No, we don't have 2 VMs for a gem_context.
>>>>>>Say if ctx1 with vm1 and ctx2 with vm2.
>>>>>>First vm_bind call was for vm1 with q_idx 1 in ctx1 engine map.
>>>>>>Second vm_bind call was for vm2 with q_idx 2 in ctx2 engine map. If
>>>>>>those two queue indicies points to same underlying vm_bind engine,
>>>>>>then the second vm_bind call gets blocked until the first 
>>>>vm_bind call's
>>>>>>'in' fence is triggered and bind completes.
>>>>>>
>>>>>>With per VM queues, this is not a problem as two VMs will not endup
>>>>>>sharing same queue.
>>>>>>
>>>>>>BTW, I just posted a updated PATCH series.
>>>>>>https://www.spinics.net/lists/dri-devel/msg350483.html
>>>>>>
>>>>>>Niranjana
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>So, my preference here is to just add a 'u32 queue' index in
>>>>>>>>vm_bind/unbind
>>>>>>>>ioctl, and the queues are per VM.
>>>>>>>>
>>>>>>>>Niranjana
>>>>>>>>
>>>>>>>>>  Thanks,
>>>>>>>>>
>>>>>>>>>  -Lionel
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>      Niranjana
>>>>>>>>>
>>>>>>>>>      >Regards,
>>>>>>>>>      >
>>>>>>>>>      >Tvrtko
>>>>>>>>>      >
>>>>>>>>>      >>
>>>>>>>>>      >>Niranjana
>>>>>>>>>      >>
>>>>>>>>>      >>>
>>>>>>>>>      >>>>   I am trying to see how many queues we need and
>>>>>>>>>don't want it to
>>>>>>>>>      be
>>>>>>>>>      >>>>   arbitrarily
>>>>>>>>>      >>>>   large and unduely blow up memory usage and
>>>>>>>>>complexity in i915
>>>>>>>>>      driver.
>>>>>>>>>      >>>>
>>>>>>>>>      >>>> I expect a Vulkan driver to use at most 2 in the
>>>>>>>>>vast majority
>>>>>>>>>      >>>>of cases. I
>>>>>>>>>      >>>> could imagine a client wanting to create more 
>>>>than 1 sparse
>>>>>>>>>      >>>>queue in which
>>>>>>>>>      >>>> case, it'll be N+1 but that's unlikely. As far as
>>>>>>>>>complexity
>>>>>>>>>      >>>>goes, once
>>>>>>>>>      >>>> you allow two, I don't think the complexity is 
>>>>going up by
>>>>>>>>>      >>>>allowing N.  As
>>>>>>>>>      >>>> for memory usage, creating more queues means more
>>>>>>>>>memory.  That's
>>>>>>>>>      a
>>>>>>>>>      >>>> trade-off that userspace can make. Again, the
>>>>>>>>>expected number
>>>>>>>>>      >>>>here is 1
>>>>>>>>>      >>>> or 2 in the vast majority of cases so I don't think
>>>>>>>>>you need to
>>>>>>>>>      worry.
>>>>>>>>>      >>>
>>>>>>>>>      >>>Ok, will start with n=3 meaning 8 queues.
>>>>>>>>>      >>>That would require us create 8 workqueues.
>>>>>>>>>      >>>We can change 'n' later if required.
>>>>>>>>>      >>>
>>>>>>>>>      >>>Niranjana
>>>>>>>>>      >>>
>>>>>>>>>      >>>>
>>>>>>>>>      >>>>   >     Why? Because Vulkan has two basic kind of bind
>>>>>>>>>      >>>>operations and we
>>>>>>>>>      >>>>   don't
>>>>>>>>>      >>>>   >     want any dependencies between them:
>>>>>>>>>      >>>>   >      1. Immediate.  These happen right after BO
>>>>>>>>>creation or
>>>>>>>>>      >>>>maybe as
>>>>>>>>>      >>>>   part of
>>>>>>>>>      >>>>   > vkBindImageMemory() or VkBindBufferMemory().  These
>>>>>>>>>      >>>>don't happen
>>>>>>>>>      >>>>   on a
>>>>>>>>>      >>>>   >     queue and we don't want them serialized
>>>>>>>>>with anything.       To
>>>>>>>>>      >>>>   synchronize
>>>>>>>>>      >>>>   >     with submit, we'll have a syncobj in the
>>>>>>>>>VkDevice which
>>>>>>>>>      is
>>>>>>>>>      >>>>   signaled by
>>>>>>>>>      >>>>   >     all immediate bind operations and make
>>>>>>>>>submits wait on
>>>>>>>>>      it.
>>>>>>>>>      >>>>   >      2. Queued (sparse): These happen on a
>>>>>>>>>VkQueue which may
>>>>>>>>>      be the
>>>>>>>>>      >>>>   same as
>>>>>>>>>      >>>>   >     a render/compute queue or may be its own
>>>>>>>>>queue.  It's up
>>>>>>>>>      to us
>>>>>>>>>      >>>>   what we
>>>>>>>>>      >>>>   >     want to advertise.  From the Vulkan API
>>>>>>>>>PoV, this is like
>>>>>>>>>      any
>>>>>>>>>      >>>>   other
>>>>>>>>>      >>>>   >     queue. Operations on it wait on and signal
>>>>>>>>>semaphores.       If we
>>>>>>>>>      >>>>   have a
>>>>>>>>>      >>>>   >     VM_BIND engine, we'd provide syncobjs to 
>>>>wait and
>>>>>>>>>      >>>>signal just like
>>>>>>>>>      >>>>   we do
>>>>>>>>>      >>>>   >     in execbuf().
>>>>>>>>>      >>>>   >     The important thing is that we don't want
>>>>>>>>>one type of
>>>>>>>>>      >>>>operation to
>>>>>>>>>      >>>>   block
>>>>>>>>>      >>>>   >     on the other.  If immediate binds are
>>>>>>>>>blocking on sparse
>>>>>>>>>      binds,
>>>>>>>>>      >>>>   it's
>>>>>>>>>      >>>>   >     going to cause over-synchronization issues.
>>>>>>>>>      >>>>   >     In terms of the internal implementation, I
>>>>>>>>>know that
>>>>>>>>>      >>>>there's going
>>>>>>>>>      >>>>   to be
>>>>>>>>>      >>>>   >     a lock on the VM and that we can't actually
>>>>>>>>>do these
>>>>>>>>>      things in
>>>>>>>>>      >>>>   > parallel.  That's fine. Once the dma_fences have
>>>>>>>>>      signaled and
>>>>>>>>>      >>>>   we're
>>>>>>>>>      >>>>
>>>>>>>>>      >>>>   Thats correct. It is like a single VM_BIND 
>>>>engine with
>>>>>>>>>      >>>>multiple queues
>>>>>>>>>      >>>>   feeding to it.
>>>>>>>>>      >>>>
>>>>>>>>>      >>>> Right.  As long as the queues themselves are
>>>>>>>>>independent and
>>>>>>>>>      >>>>can block on
>>>>>>>>>      >>>> dma_fences without holding up other queues, I think
>>>>>>>>>we're fine.
>>>>>>>>>      >>>>
>>>>>>>>>      >>>>   > unblocked to do the bind operation, I don't care if
>>>>>>>>>      >>>>there's a bit
>>>>>>>>>      >>>>   of
>>>>>>>>>      >>>>   > synchronization due to locking.  That's
>>>>>>>>>expected.  What
>>>>>>>>>      >>>>we can't
>>>>>>>>>      >>>>   afford
>>>>>>>>>      >>>>   >     to have is an immediate bind operation
>>>>>>>>>suddenly blocking
>>>>>>>>>      on a
>>>>>>>>>      >>>>   sparse
>>>>>>>>>      >>>>   > operation which is blocked on a compute job
>>>>>>>>>that's going
>>>>>>>>>      to run
>>>>>>>>>      >>>>   for
>>>>>>>>>      >>>>   >     another 5ms.
>>>>>>>>>      >>>>
>>>>>>>>>      >>>>   As the VM_BIND queue is per VM, VM_BIND on one VM
>>>>>>>>>doesn't block
>>>>>>>>>      the
>>>>>>>>>      >>>>   VM_BIND
>>>>>>>>>      >>>>   on other VMs. I am not sure about usecases 
>>>>here, but just
>>>>>>>>>      wanted to
>>>>>>>>>      >>>>   clarify.
>>>>>>>>>      >>>>
>>>>>>>>>      >>>> Yes, that's what I would expect.
>>>>>>>>>      >>>> --Jason
>>>>>>>>>      >>>>
>>>>>>>>>      >>>>   Niranjana
>>>>>>>>>      >>>>
>>>>>>>>>      >>>>   >     For reference, Windows solves this by allowing
>>>>>>>>>      arbitrarily many
>>>>>>>>>      >>>>   paging
>>>>>>>>>      >>>>   >     queues (what they call a VM_BIND
>>>>>>>>>engine/queue).  That
>>>>>>>>>      >>>>design works
>>>>>>>>>      >>>>   >     pretty well and solves the problems in
>>>>>>>>>question.       >>>>Again, we could
>>>>>>>>>      >>>>   just
>>>>>>>>>      >>>>   >     make everything out-of-order and require
>>>>>>>>>using syncobjs
>>>>>>>>>      >>>>to order
>>>>>>>>>      >>>>   things
>>>>>>>>>      >>>>   >     as userspace wants. That'd be fine too.
>>>>>>>>>      >>>>   >     One more note while I'm here: danvet said
>>>>>>>>>something on
>>>>>>>>>      >>>>IRC about
>>>>>>>>>      >>>>   VM_BIND
>>>>>>>>>      >>>>   >     queues waiting for syncobjs to
>>>>>>>>>materialize.  We don't
>>>>>>>>>      really
>>>>>>>>>      >>>>   want/need
>>>>>>>>>      >>>>   >     this. We already have all the machinery in
>>>>>>>>>userspace to
>>>>>>>>>      handle
>>>>>>>>>      >>>>   > wait-before-signal and waiting for syncobj
>>>>>>>>>fences to
>>>>>>>>>      >>>>materialize
>>>>>>>>>      >>>>   and
>>>>>>>>>      >>>>   >     that machinery is on by default.  It 
>>>>would actually
>>>>>>>>>      >>>>take MORE work
>>>>>>>>>      >>>>   in
>>>>>>>>>      >>>>   >     Mesa to turn it off and take advantage of
>>>>>>>>>the kernel
>>>>>>>>>      >>>>being able to
>>>>>>>>>      >>>>   wait
>>>>>>>>>      >>>>   >     for syncobjs to materialize. Also, getting
>>>>>>>>>that right is
>>>>>>>>>      >>>>   ridiculously
>>>>>>>>>      >>>>   >     hard and I really don't want to get it
>>>>>>>>>wrong in kernel
>>>>>>>>>      >>>>space.   �� When we
>>>>>>>>>      >>>>   >     do memory fences, wait-before-signal will
>>>>>>>>>be a thing.  We
>>>>>>>>>      don't
>>>>>>>>>      >>>>   need to
>>>>>>>>>      >>>>   >     try and make it a thing for syncobj.
>>>>>>>>>      >>>>   >     --Jason
>>>>>>>>>      >>>>   >
>>>>>>>>>      >>>>   >   Thanks Jason,
>>>>>>>>>      >>>>   >
>>>>>>>>>      >>>>   >   I missed the bit in the Vulkan spec that
>>>>>>>>>we're allowed to
>>>>>>>>>      have a
>>>>>>>>>      >>>>   sparse
>>>>>>>>>      >>>>   >   queue that does not implement either graphics
>>>>>>>>>or compute
>>>>>>>>>      >>>>operations
>>>>>>>>>      >>>>   :
>>>>>>>>>      >>>>   >
>>>>>>>>>      >>>>   >     "While some implementations may include
>>>>>>>>>      >>>> VK_QUEUE_SPARSE_BINDING_BIT
>>>>>>>>>      >>>>   >     support in queue families that also include
>>>>>>>>>      >>>>   >
>>>>>>>>>      >>>>   > graphics and compute support, other
>>>>>>>>>implementations may
>>>>>>>>>      only
>>>>>>>>>      >>>>   expose a
>>>>>>>>>      >>>>   > VK_QUEUE_SPARSE_BINDING_BIT-only queue
>>>>>>>>>      >>>>   >
>>>>>>>>>      >>>>   > family."
>>>>>>>>>      >>>>   >
>>>>>>>>>      >>>>   >   So it can all be all a vm_bind engine that 
>>>>just does
>>>>>>>>>      bind/unbind
>>>>>>>>>      >>>>   > operations.
>>>>>>>>>      >>>>   >
>>>>>>>>>      >>>>   >   But yes we need another engine for the
>>>>>>>>>immediate/non-sparse
>>>>>>>>>      >>>>   operations.
>>>>>>>>>      >>>>   >
>>>>>>>>>      >>>>   >   -Lionel
>>>>>>>>>      >>>>   >
>>>>>>>>>      >>>>   >         >
>>>>>>>>>      >>>>   > Daniel, any thoughts?
>>>>>>>>>      >>>>   >
>>>>>>>>>      >>>>   > Niranjana
>>>>>>>>>      >>>>   >
>>>>>>>>>      >>>>   > >Matt
>>>>>>>>>      >>>>   >       >
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   > >> Sorry I noticed this late.
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   > >> -Lionel
>>>>>>>>>      >>>>   > >>
>>>>>>>>>      >>>>   > >>
>>>>>>>
>>>>>>>
>>>>>
>
>
Zeng, Oak June 14, 2022, 9:12 p.m. UTC | #36
Thanks,
Oak

> -----Original Message-----
> From: Vishwanathapura, Niranjana <niranjana.vishwanathapura@intel.com>
> Sent: June 14, 2022 1:02 PM
> To: Landwerlin, Lionel G <lionel.g.landwerlin@intel.com>
> Cc: Zeng, Oak <oak.zeng@intel.com>; Intel GFX <intel-
> gfx@lists.freedesktop.org>; Maling list - DRI developers <dri-
> devel@lists.freedesktop.org>; Hellstrom, Thomas
> <thomas.hellstrom@intel.com>; Wilson, Chris P <chris.p.wilson@intel.com>;
> Vetter, Daniel <daniel.vetter@intel.com>; Christian König
> <christian.koenig@amd.com>
> Subject: Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design
> document
> 
> On Tue, Jun 14, 2022 at 10:04:00AM +0300, Lionel Landwerlin wrote:
> >On 13/06/2022 21:02, Niranjana Vishwanathapura wrote:
> >>On Mon, Jun 13, 2022 at 06:33:07AM -0700, Zeng, Oak wrote:
> >>>
> >>>
> >>>Regards,
> >>>Oak
> >>>
> >>>>-----Original Message-----
> >>>>From: Intel-gfx <intel-gfx-bounces@lists.freedesktop.org> On
> >>>>Behalf Of Niranjana
> >>>>Vishwanathapura
> >>>>Sent: June 10, 2022 1:43 PM
> >>>>To: Landwerlin, Lionel G <lionel.g.landwerlin@intel.com>
> >>>>Cc: Intel GFX <intel-gfx@lists.freedesktop.org>; Maling list -
> >>>>DRI developers <dri-
> >>>>devel@lists.freedesktop.org>; Hellstrom, Thomas
> >>>><thomas.hellstrom@intel.com>;
> >>>>Wilson, Chris P <chris.p.wilson@intel.com>; Vetter, Daniel
> >>>><daniel.vetter@intel.com>; Christian König
> <christian.koenig@amd.com>
> >>>>Subject: Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND
> >>>>feature design
> >>>>document
> >>>>
> >>>>On Fri, Jun 10, 2022 at 11:18:14AM +0300, Lionel Landwerlin wrote:
> >>>>>On 10/06/2022 10:54, Niranjana Vishwanathapura wrote:
> >>>>>>On Fri, Jun 10, 2022 at 09:53:24AM +0300, Lionel Landwerlin wrote:
> >>>>>>>On 09/06/2022 22:31, Niranjana Vishwanathapura wrote:
> >>>>>>>>On Thu, Jun 09, 2022 at 05:49:09PM +0300, Lionel Landwerlin wrote:
> >>>>>>>>>  On 09/06/2022 00:55, Jason Ekstrand wrote:
> >>>>>>>>>
> >>>>>>>>>    On Wed, Jun 8, 2022 at 4:44 PM Niranjana Vishwanathapura
> >>>>>>>>> <niranjana.vishwanathapura@intel.com> wrote:
> >>>>>>>>>
> >>>>>>>>>      On Wed, Jun 08, 2022 at 08:33:25AM +0100, Tvrtko
> >>>>Ursulin wrote:
> >>>>>>>>>      >
> >>>>>>>>>      >
> >>>>>>>>>      >On 07/06/2022 22:32, Niranjana Vishwanathapura wrote:
> >>>>>>>>>      >>On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana
> >>>>>>>>>Vishwanathapura
> >>>>>>>>>      wrote:
> >>>>>>>>>      >>>On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason
> >>>>>>>>>Ekstrand wrote:
> >>>>>>>>>      >>>> On Fri, Jun 3, 2022 at 6:52 PM Niranjana
> >>>>Vishwanathapura
> >>>>>>>>>      >>>> <niranjana.vishwanathapura@intel.com> wrote:
> >>>>>>>>>      >>>>
> >>>>>>>>>      >>>>   On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel
> >>>>>>>>>Landwerlin
> >>>>>>>>>      wrote:
> >>>>>>>>>      >>>>   >   On 02/06/2022 23:35, Jason Ekstrand wrote:
> >>>>>>>>>      >>>>   >
> >>>>>>>>>      >>>>   >     On Thu, Jun 2, 2022 at 3:11 PM Niranjana
> >>>>>>>>>Vishwanathapura
> >>>>>>>>>      >>>>   > <niranjana.vishwanathapura@intel.com> wrote:
> >>>>>>>>>      >>>>   >
> >>>>>>>>>      >>>>   >       On Wed, Jun 01, 2022 at 01:28:36PM
> >>>>-0700, Matthew
> >>>>>>>>>      >>>>Brost wrote:
> >>>>>>>>>      >>>>   >       >On Wed, Jun 01, 2022 at 05:25:49PM
> >>>>+0300, Lionel
> >>>>>>>>>      Landwerlin
> >>>>>>>>>      >>>>   wrote:
> >>>>>>>>>      >>>>   > >> On 17/05/2022 21:32, Niranjana Vishwanathapura
> >>>>>>>>>      wrote:
> >>>>>>>>>      >>>>   > >> > +VM_BIND/UNBIND ioctl will immediately start
> >>>>>>>>>      >>>>   binding/unbinding
> >>>>>>>>>      >>>>   >       the mapping in an
> >>>>>>>>>      >>>>   > >> > +async worker. The binding and
> >>>>>>>>>unbinding will
> >>>>>>>>>      >>>>work like a
> >>>>>>>>>      >>>>   special
> >>>>>>>>>      >>>>   >       GPU engine.
> >>>>>>>>>      >>>>   > >> > +The binding and unbinding operations are
> >>>>>>>>>      serialized and
> >>>>>>>>>      >>>>   will
> >>>>>>>>>      >>>>   >       wait on specified
> >>>>>>>>>      >>>>   > >> > +input fences before the operation
> >>>>>>>>>and will signal
> >>>>>>>>>      the
> >>>>>>>>>      >>>>   output
> >>>>>>>>>      >>>>   >       fences upon the
> >>>>>>>>>      >>>>   > >> > +completion of the operation. Due to
> >>>>>>>>>      serialization,
> >>>>>>>>>      >>>>   completion of
> >>>>>>>>>      >>>>   >       an operation
> >>>>>>>>>      >>>>   > >> > +will also indicate that all
> >>>>>>>>>previous operations
> >>>>>>>>>      >>>>are also
> >>>>>>>>>      >>>>   > complete.
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   > >> I guess we should avoid saying "will
> >>>>>>>>>immediately
> >>>>>>>>>      start
> >>>>>>>>>      >>>>   > binding/unbinding" if
> >>>>>>>>>      >>>>   > >> there are fences involved.
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   > >> And the fact that it's happening in an async
> >>>>>>>>>      >>>>worker seem to
> >>>>>>>>>      >>>>   imply
> >>>>>>>>>      >>>>   >       it's not
> >>>>>>>>>      >>>>   > >> immediate.
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   >
> >>>>>>>>>      >>>>   >       Ok, will fix.
> >>>>>>>>>      >>>>   >       This was added because in earlier design
> >>>>>>>>>binding was
> >>>>>>>>>      deferred
> >>>>>>>>>      >>>>   until
> >>>>>>>>>      >>>>   >       next execbuff.
> >>>>>>>>>      >>>>   >       But now it is non-deferred (immediate in
> >>>>>>>>>that sense).
> >>>>>>>>>      >>>>But yah,
> >>>>>>>>>      >>>>   this is
> >>>>>>>>>      >>>>   > confusing
> >>>>>>>>>      >>>>   >       and will fix it.
> >>>>>>>>>      >>>>   >
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   > >> I have a question on the behavior of the bind
> >>>>>>>>>      >>>>operation when
> >>>>>>>>>      >>>>   no
> >>>>>>>>>      >>>>   >       input fence
> >>>>>>>>>      >>>>   > >> is provided. Let say I do :
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   > >> VM_BIND (out_fence=fence1)
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   > >> VM_BIND (out_fence=fence2)
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   > >> VM_BIND (out_fence=fence3)
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   > >> In what order are the fences going to
> >>>>>>>>>be signaled?
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   > >> In the order of VM_BIND ioctls? Or out
> >>>>>>>>>of order?
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   > >> Because you wrote "serialized I assume
> >>>>>>>>>it's : in
> >>>>>>>>>      order
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   >
> >>>>>>>>>      >>>>   >       Yes, in the order of VM_BIND/UNBIND
> >>>>>>>>>ioctls. Note that
> >>>>>>>>>      >>>>bind and
> >>>>>>>>>      >>>>   unbind
> >>>>>>>>>      >>>>   >       will use
> >>>>>>>>>      >>>>   >       the same queue and hence are ordered.
> >>>>>>>>>      >>>>   >
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   > >> One thing I didn't realize is that
> >>>>>>>>>because we only
> >>>>>>>>>      get one
> >>>>>>>>>      >>>>   > "VM_BIND" engine,
> >>>>>>>>>      >>>>   > >> there is a disconnect from the Vulkan
> >>>>>>>>>specification.
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   > >> In Vulkan VM_BIND operations are
> >>>>>>>>>serialized but
> >>>>>>>>>      >>>>per engine.
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   > >> So you could have something like this :
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   > >> VM_BIND (engine=rcs0, in_fence=fence1,
> >>>>>>>>>      out_fence=fence2)
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   > >> VM_BIND (engine=ccs0, in_fence=fence3,
> >>>>>>>>>      out_fence=fence4)
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   > >> fence1 is not signaled
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   > >> fence3 is signaled
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   > >> So the second VM_BIND will proceed before the
> >>>>>>>>>      >>>>first VM_BIND.
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   > >> I guess we can deal with that scenario in
> >>>>>>>>>      >>>>userspace by doing
> >>>>>>>>>      >>>>   the
> >>>>>>>>>      >>>>   >       wait
> >>>>>>>>>      >>>>   > >> ourselves in one thread per engines.
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   > >> But then it makes the VM_BIND input
> >>>>>>>>>fences useless.
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   > >> Daniel : what do you think? Should be
> >>>>>>>>>rework this or
> >>>>>>>>>      just
> >>>>>>>>>      >>>>   deal with
> >>>>>>>>>      >>>>   >       wait
> >>>>>>>>>      >>>>   > >> fences in userspace?
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   >       >
> >>>>>>>>>      >>>>   >       >My opinion is rework this but make the
> >>>>>>>>>ordering via
> >>>>>>>>>      >>>>an engine
> >>>>>>>>>      >>>>   param
> >>>>>>>>>      >>>>   > optional.
> >>>>>>>>>      >>>>   >       >
> >>>>>>>>>      >>>>   > >e.g. A VM can be configured so all binds
> >>>>>>>>>are ordered
> >>>>>>>>>      >>>>within the
> >>>>>>>>>      >>>>   VM
> >>>>>>>>>      >>>>   >       >
> >>>>>>>>>      >>>>   > >e.g. A VM can be configured so all binds
> >>>>>>>>>accept an
> >>>>>>>>>      engine
> >>>>>>>>>      >>>>   argument
> >>>>>>>>>      >>>>   >       (in
> >>>>>>>>>      >>>>   > >the case of the i915 likely this is a
> >>>>>>>>>gem context
> >>>>>>>>>      >>>>handle) and
> >>>>>>>>>      >>>>   binds
> >>>>>>>>>      >>>>   > >ordered with respect to that engine.
> >>>>>>>>>      >>>>   >       >
> >>>>>>>>>      >>>>   > >This gives UMDs options as the later
> >>>>>>>>>likely consumes
> >>>>>>>>>      >>>>more KMD
> >>>>>>>>>      >>>>   > resources
> >>>>>>>>>      >>>>   >       >so if a different UMD can live with
> >>>>binds being
> >>>>>>>>>      >>>>ordered within
> >>>>>>>>>      >>>>   the VM
> >>>>>>>>>      >>>>   > >they can use a mode consuming less resources.
> >>>>>>>>>      >>>>   >       >
> >>>>>>>>>      >>>>   >
> >>>>>>>>>      >>>>   >       I think we need to be careful here if we
> >>>>>>>>>are looking
> >>>>>>>>>      for some
> >>>>>>>>>      >>>>   out of
> >>>>>>>>>      >>>>   > (submission) order completion of vm_bind/unbind.
> >>>>>>>>>      >>>>   > In-order completion means, in a batch of
> >>>>>>>>>binds and
> >>>>>>>>>      >>>>unbinds to be
> >>>>>>>>>      >>>>   > completed in-order, user only needs to specify
> >>>>>>>>>      >>>>in-fence for the
> >>>>>>>>>      >>>>   >       first bind/unbind call and the our-fence
> >>>>>>>>>for the last
> >>>>>>>>>      >>>>   bind/unbind
> >>>>>>>>>      >>>>   >       call. Also, the VA released by an unbind
> >>>>>>>>>call can be
> >>>>>>>>>      >>>>re-used by
> >>>>>>>>>      >>>>   >       any subsequent bind call in that
> >>>>in-order batch.
> >>>>>>>>>      >>>>   >
> >>>>>>>>>      >>>>   >       These things will break if
> >>>>>>>>>binding/unbinding were to
> >>>>>>>>>      >>>>be allowed
> >>>>>>>>>      >>>>   to
> >>>>>>>>>      >>>>   >       go out of order (of submission) and user
> >>>>>>>>>need to be
> >>>>>>>>>      extra
> >>>>>>>>>      >>>>   careful
> >>>>>>>>>      >>>>   >       not to run into pre-mature triggereing of
> >>>>>>>>>out-fence and
> >>>>>>>>>      bind
> >>>>>>>>>      >>>>   failing
> >>>>>>>>>      >>>>   >       as VA is still in use etc.
> >>>>>>>>>      >>>>   >
> >>>>>>>>>      >>>>   >       Also, VM_BIND binds the provided
> >>>>mapping on the
> >>>>>>>>>      specified
> >>>>>>>>>      >>>>   address
> >>>>>>>>>      >>>>   >       space
> >>>>>>>>>      >>>>   >       (VM). So, the uapi is not engine/context
> >>>>>>>>>specific.
> >>>>>>>>>      >>>>   >
> >>>>>>>>>      >>>>   >       We can however add a 'queue' to the uapi
> >>>>>>>>>which can be
> >>>>>>>>>      >>>>one from
> >>>>>>>>>      >>>>   the
> >>>>>>>>>      >>>>   > pre-defined queues,
> >>>>>>>>>      >>>>   > I915_VM_BIND_QUEUE_0
> >>>>>>>>>      >>>>   > I915_VM_BIND_QUEUE_1
> >>>>>>>>>      >>>>   >       ...
> >>>>>>>>>      >>>>   > I915_VM_BIND_QUEUE_(N-1)
> >>>>>>>>>      >>>>   >
> >>>>>>>>>      >>>>   >       KMD will spawn an async work queue for
> >>>>>>>>>each queue which
> >>>>>>>>>      will
> >>>>>>>>>      >>>>   only
> >>>>>>>>>      >>>>   >       bind the mappings on that queue in the
> >>>>order of
> >>>>>>>>>      submission.
> >>>>>>>>>      >>>>   >       User can assign the queue to per engine
> >>>>>>>>>or anything
> >>>>>>>>>      >>>>like that.
> >>>>>>>>>      >>>>   >
> >>>>>>>>>      >>>>   >       But again here, user need to be
> >>>>careful and not
> >>>>>>>>>      >>>>deadlock these
> >>>>>>>>>      >>>>   >       queues with circular dependency of fences.
> >>>>>>>>>      >>>>   >
> >>>>>>>>>      >>>>   >       I prefer adding this later an as
> >>>>>>>>>extension based on
> >>>>>>>>>      >>>>whether it
> >>>>>>>>>      >>>>   >       is really helping with the implementation.
> >>>>>>>>>      >>>>   >
> >>>>>>>>>      >>>>   >     I can tell you right now that having
> >>>>>>>>>everything on a
> >>>>>>>>>      single
> >>>>>>>>>      >>>>   in-order
> >>>>>>>>>      >>>>   >     queue will not get us the perf we want.
> >>>>>>>>>What vulkan
> >>>>>>>>>      >>>>really wants
> >>>>>>>>>      >>>>   is one
> >>>>>>>>>      >>>>   >     of two things:
> >>>>>>>>>      >>>>   >      1. No implicit ordering of VM_BIND
> >>>>ops.  They just
> >>>>>>>>>      happen in
> >>>>>>>>>      >>>>   whatever
> >>>>>>>>>      >>>>   >     their dependencies are resolved and we
> >>>>>>>>>ensure ordering
> >>>>>>>>>      >>>>ourselves
> >>>>>>>>>      >>>>   by
> >>>>>>>>>      >>>>   >     having a syncobj in the VkQueue.
> >>>>>>>>>      >>>>   >      2. The ability to create multiple VM_BIND
> >>>>>>>>>queues.  We
> >>>>>>>>>      need at
> >>>>>>>>>      >>>>   least 2
> >>>>>>>>>      >>>>   >     but I don't see why there needs to be a
> >>>>>>>>>limit besides
> >>>>>>>>>      >>>>the limits
> >>>>>>>>>      >>>>   the
> >>>>>>>>>      >>>>   >     i915 API already has on the number of
> >>>>>>>>>engines.  Vulkan
> >>>>>>>>>      could
> >>>>>>>>>      >>>>   expose
> >>>>>>>>>      >>>>   >     multiple sparse binding queues to the
> >>>>>>>>>client if it's not
> >>>>>>>>>      >>>>   arbitrarily
> >>>>>>>>>      >>>>   >     limited.
> >>>>>>>>>      >>>>
> >>>>>>>>>      >>>>   Thanks Jason, Lionel.
> >>>>>>>>>      >>>>
> >>>>>>>>>      >>>>   Jason, what are you referring to when you say
> >>>>>>>>>"limits the i915
> >>>>>>>>>      API
> >>>>>>>>>      >>>>   already
> >>>>>>>>>      >>>>   has on the number of engines"? I am not sure if
> >>>>>>>>>there is such
> >>>>>>>>>      an uapi
> >>>>>>>>>      >>>>   today.
> >>>>>>>>>      >>>>
> >>>>>>>>>      >>>> There's a limit of something like 64 total engines
> >>>>>>>>>today based on
> >>>>>>>>>      the
> >>>>>>>>>      >>>> number of bits we can cram into the exec flags in
> >>>>>>>>>execbuffer2.  I
> >>>>>>>>>      think
> >>>>>>>>>      >>>> someone had an extended version that allowed more
> >>>>>>>>>but I ripped it
> >>>>>>>>>      out
> >>>>>>>>>      >>>> because no one was using it.  Of course,
> >>>>>>>>>execbuffer3 might not
> >>>>>>>>>      >>>>have that
> >>>>>>>>>      >>>> problem at all.
> >>>>>>>>>      >>>>
> >>>>>>>>>      >>>
> >>>>>>>>>      >>>Thanks Jason.
> >>>>>>>>>      >>>Ok, I am not sure which exec flag is that, but yah,
> >>>>>>>>>execbuffer3
> >>>>>>>>>      probably
> >>>>>>>>>      >>>will not have this limiation. So, we need to define a
> >>>>>>>>>      VM_BIND_MAX_QUEUE
> >>>>>>>>>      >>>and somehow export it to user (I am thinking of
> >>>>>>>>>embedding it in
> >>>>>>>>>      >>>I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND,
> >>>>bits[1-3]->'n'
> >>>>>>>>>      meaning 2^n
> >>>>>>>>>      >>>queues.
> >>>>>>>>>      >>
> >>>>>>>>>      >>Ah, I think you are waking about I915_EXEC_RING_MASK
> >>>>>>>>>(0x3f) which
> >>>>>>>>>      execbuf3
> >>>>>>>>>
> >>>>>>>>>    Yup!  That's exactly the limit I was talking about.
> >>>>>>>>>
> >>>>>>>>>      >>will also have. So, we can simply define in vm_bind/unbind
> >>>>>>>>>      structures,
> >>>>>>>>>      >>
> >>>>>>>>>      >>#define I915_VM_BIND_MAX_QUEUE   64
> >>>>>>>>>      >>        __u32 queue;
> >>>>>>>>>      >>
> >>>>>>>>>      >>I think that will keep things simple.
> >>>>>>>>>      >
> >>>>>>>>>      >Hmmm? What does execbuf2 limit has to do with how
> >>>>many engines
> >>>>>>>>>      >hardware can have? I suggest not to do that.
> >>>>>>>>>      >
> >>>>>>>>>      >Change with added this:
> >>>>>>>>>      >
> >>>>>>>>>      >       if (set.num_engines > I915_EXEC_RING_MASK + 1)
> >>>>>>>>>      >               return -EINVAL;
> >>>>>>>>>      >
> >>>>>>>>>      >To context creation needs to be undone and so let users
> >>>>>>>>>create engine
> >>>>>>>>>      >maps with all hardware engines, and let execbuf3 access
> >>>>>>>>>them all.
> >>>>>>>>>      >
> >>>>>>>>>
> >>>>>>>>>      Earlier plan was to carry I915_EXEC_RING_MAP (0x3f) to
> >>>>>>>>>execbuff3 also.
> >>>>>>>>>      Hence, I was using the same limit for VM_BIND queues
> >>>>>>>>>(64, or 65 if we
> >>>>>>>>>      make it N+1).
> >>>>>>>>>      But, as discussed in other thread of this RFC series, we
> >>>>>>>>>are planning
> >>>>>>>>>      to drop this I915_EXEC_RING_MAP in execbuff3. So,
> >>>>there won't be
> >>>>>>>>>      any uapi that limits the number of engines (and hence
> >>>>>>>>>the vm_bind
> >>>>>>>>>      queues
> >>>>>>>>>      need to be supported).
> >>>>>>>>>
> >>>>>>>>>      If we leave the number of vm_bind queues to be
> >>>>arbitrarily large
> >>>>>>>>>      (__u32 queue_idx) then, we need to have a hashmap for
> >>>>>>>>>queue (a wq,
> >>>>>>>>>      work_item and a linked list) lookup from the user
> >>>>>>>>>specified queue
> >>>>>>>>>      index.
> >>>>>>>>>      Other option is to just put some hard limit (say 64 or
> >>>>>>>>>65) and use
> >>>>>>>>>      an array of queues in VM (each created upon first use).
> >>>>>>>>>I prefer this.
> >>>>>>>>>
> >>>>>>>>>    I don't get why a VM_BIND queue is any different from any
> >>>>>>>>>other queue or
> >>>>>>>>>    userspace-visible kernel object.  But I'll leave those
> >>>>>>>>>details up to
> >>>>>>>>>    danvet or whoever else might be reviewing the
> implementation.
> >>>>>>>>>    --Jason
> >>>>>>>>>
> >>>>>>>>>  I kind of agree here. Wouldn't be simpler to have the bind
> >>>>>>>>>queue created
> >>>>>>>>>  like the others when we build the engine map?
> >>>>>>>>>
> >>>>>>>>>  For userspace it's then just matter of selecting the right
> >>>>>>>>>queue ID when
> >>>>>>>>>  submitting.
> >>>>>>>>>
> >>>>>>>>>  If there is ever a possibility to have this work on the GPU,
> >>>>>>>>>it would be
> >>>>>>>>>  all ready.
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>>I did sync offline with Matt Brost on this.
> >>>>>>>>We can add a VM_BIND engine class and let user create VM_BIND
> >>>>>>>>engines (queues).
> >>>>>>>>The problem is, in i915 engine creating interface is bound to
> >>>>>>>>gem_context.
> >>>>>>>>So, in vm_bind ioctl, we would need both context_id and
> >>>>>>>>queue_idx for proper
> >>>>>>>>lookup of the user created engine. This is bit ackward as
> >>>>vm_bind is an
> >>>>>>>>interface to VM (address space) and has nothing to do with
> >>>>gem_context.
> >>>>>>>
> >>>>>>>
> >>>>>>>A gem_context has a single vm object right?
> >>>>>>>
> >>>>>>>Set through I915_CONTEXT_PARAM_VM at creation or given a
> default
> >>>>>>>one if not.
> >>>>>>>
> >>>>>>>So it's just like picking up the vm like it's done at execbuffer
> >>>>>>>time right now : eb->context->vm
> >>>>>>>
> >>>>>>
> >>>>>>Are you suggesting replacing 'vm_id' with 'context_id' in the
> >>>>>>VM_BIND/UNBIND
> >>>>>>ioctl and probably call it CONTEXT_BIND/UNBIND, because VM can
> be
> >>>>>>obtained
> >>>>>>from the context?
> >>>>>
> >>>>>
> >>>>>Yes, because if we go for engines, they're associated with a context
> >>>>>and so also associated with the VM bound to the context.
> >>>>>
> >>>>
> >>>>Hmm...context doesn't sould like the right interface. It should be
> >>>>VM and engine (independent of context). Engine can be virtual or soft
> >>>>engine (kernel thread), each with its own queue. We can add an
> >>>>interface
> >>>>to create such engines (independent of context). But we are anway
> >>>>implicitly creating it when user uses a new queue_idx. If in future
> >>>>we have hardware engines for VM_BIND operation, we can have that
> >>>>explicit inteface to create engine instances and the queue_index
> >>>>in vm_bind/unbind will point to those engines.
> >>>>Anyone has any thoughts? Daniel?
> >>>
> >>>Exposing gem_context or intel_context to user space is a strange
> >>>concept to me. A context represent some hw resources that is used
> >>>to complete certain task. User space should care allocate some
> >>>resources (memory, queues) and submit tasks to queues. But user
> >>>space doesn't care how certain task is mapped to a HW context -
> >>>driver/guc should take care of this.
> >>>
> >>>So a cleaner interface to me is: user space create a vm,  create
> >>>gem object, vm_bind it to a vm; allocate queues (internally
> >>>represent compute or blitter HW. Queue can be virtual to user) for
> >>>this vm; submit tasks to queues. User can create multiple queues
> >>>under one vm. One queue is only for one vm.
> >>>
> >>>I915 driver/guc manage the hw compute or blitter resources which
> >>>is transparent to user space. When i915 or guc decide to schedule
> >>>a queue (run tasks on that queue), a HW engine will be pick up and
> >>>set up properly for the vm of that queue (ie., switch to page
> >>>tables of that vm) - this is a context switch.
> >>>
> >>>From vm_bind perspective, it simply bind a gem_object to a vm.
> >>>Engine/queue is not a parameter to vm_bind, as any engine can be
> >>>pick up by i915/guc to execute a task using the vm bound va.
> >>>
> >>>I didn't completely follow the discussion here. Just share some
> >>>thoughts.
> >>>
> >>
> >>Yah, I agree.
> >>
> >>Lionel,
> >>How about we define the queue as
> >>union {
> >>       __u32 queue_idx;
> >>       __u64 rsvd;
> >>}
> >>
> >>If required, we can extend by expanding the 'rsvd' field to <ctx_id,
> >>queue_idx> later
> >>with a flag.
> >>
> >>Niranjana
> >
> >
> >I did not really understand Oak's comment nor what you're suggesting
> >here to be honest.
> >
> >
> >First the GEM context is already exposed to userspace. It's explicitly
> >created by userpace with DRM_IOCTL_I915_GEM_CONTEXT_CREATE.
> >
> >We give the GEM context id in every execbuffer we do with
> >drm_i915_gem_execbuffer2::rsvd1.
> >
> >It's still in the new execbuffer3 proposal being discussed.
> >
> >
> >Second, the GEM context is also where we set the VM with
> >I915_CONTEXT_PARAM_VM.
> >
> >
> >Third, the GEM context also has the list of engines with
> >I915_CONTEXT_PARAM_ENGINES.
> >
> 
> Yes, the execbuf and engine map creation are tied to gem_context.
> (which probably is not the best interface.)
> 
> >
> >So it makes sense to me to dispatch the vm_bind operation to a GEM
> >context, to a given vm_bind queue, because it's got all the
> >information required :
> >
> >    - the list of new vm_bind queues
> >
> >    - the vm that is going to be modified
> >
> 
> But the operation is performed here on the address space (VM) which
> can have multiple gem_contexts referring to it. So, VM is the right
> interface here. We need not 'gem_context'ify it.
> 
> All we need is multiple queue support for the address space (VM).
> Going to gem_context for that just because we have engine creation
> support there seems unnecessay and not correct to me.
> 
> >
> >Otherwise where do the vm_bind queues live?
> >
> >In the i915/drm fd object?
> >
> >That would mean that all the GEM contexts are sharing the same vm_bind
> >queues.
> >
> 
> Not all, only the gem contexts that are using the same address space (VM).
> But to me the right way to describe would be that "VM will be using those
> queues".


I hope by "queue" here you mean a HW resource  that will be later used to execute the job, for example a ccs compute engine. Of course queue can be virtual so user can create more queues than what hw physically has. 

To express the concept of "VM will be using those queues", I think it make sense to have create_queue(vm) function taking a vm parameter. This means this queue is created for the purpose of submit job under this VM. Later on, we can submit job (referring to objects vm_bound to the same vm) to the queue. The vm_bind ioctl doesn’t need to have queue parameter, just vm_bind (object, va, vm).

I hope the "queue" here is not the engine used to perform the vm_bind operation itself. But if you meant a queue/engine to perform vm_bind itself (vs a queue/engine for later job submission), then we can discuss more. I know xe driver have similar concept and I think align the design early can benefit the migration to xe driver.

Regards,
Oak

> 
> Niranjana
> 
> >
> >intel_context or GuC are internal details we're not concerned about.
> >
> >I don't really see the connection with the GEM context.
> >
> >
> >Maybe Oak has a different use case than Vulkan.
> >
> >
> >-Lionel
> >
> >
> >>
> >>>Regards,
> >>>Oak
> >>>
> >>>>
> >>>>Niranjana
> >>>>
> >>>>>
> >>>>>>I think the interface is clean as a interface to VM. It is
> >>>>only that we
> >>>>>>don't have a clean way to create a raw VM_BIND engine (not
> >>>>>>associated with
> >>>>>>any context) with i915 uapi.
> >>>>>>May be we can add such an interface, but I don't think that is
> >>>>worth it
> >>>>>>(we might as well just use a queue_idx in VM_BIND/UNBIND ioctl as I
> >>>>>>mentioned
> >>>>>>above).
> >>>>>>Anyone has any thoughts?
> >>>>>>
> >>>>>>>
> >>>>>>>>Another problem is, if two VMs are binding with the same defined
> >>>>>>>>engine,
> >>>>>>>>binding on VM1 can get unnecessary blocked by binding on VM2
> >>>>>>>>(which may be
> >>>>>>>>waiting on its in_fence).
> >>>>>>>
> >>>>>>>
> >>>>>>>Maybe I'm missing something, but how can you have 2 vm objects
> >>>>>>>with a single gem_context right now?
> >>>>>>>
> >>>>>>
> >>>>>>No, we don't have 2 VMs for a gem_context.
> >>>>>>Say if ctx1 with vm1 and ctx2 with vm2.
> >>>>>>First vm_bind call was for vm1 with q_idx 1 in ctx1 engine map.
> >>>>>>Second vm_bind call was for vm2 with q_idx 2 in ctx2 engine map. If
> >>>>>>those two queue indicies points to same underlying vm_bind engine,
> >>>>>>then the second vm_bind call gets blocked until the first
> >>>>vm_bind call's
> >>>>>>'in' fence is triggered and bind completes.
> >>>>>>
> >>>>>>With per VM queues, this is not a problem as two VMs will not endup
> >>>>>>sharing same queue.
> >>>>>>
> >>>>>>BTW, I just posted a updated PATCH series.
> >>>>>>https://www.spinics.net/lists/dri-devel/msg350483.html
> >>>>>>
> >>>>>>Niranjana
> >>>>>>
> >>>>>>>
> >>>>>>>>
> >>>>>>>>So, my preference here is to just add a 'u32 queue' index in
> >>>>>>>>vm_bind/unbind
> >>>>>>>>ioctl, and the queues are per VM.
> >>>>>>>>
> >>>>>>>>Niranjana
> >>>>>>>>
> >>>>>>>>>  Thanks,
> >>>>>>>>>
> >>>>>>>>>  -Lionel
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>      Niranjana
> >>>>>>>>>
> >>>>>>>>>      >Regards,
> >>>>>>>>>      >
> >>>>>>>>>      >Tvrtko
> >>>>>>>>>      >
> >>>>>>>>>      >>
> >>>>>>>>>      >>Niranjana
> >>>>>>>>>      >>
> >>>>>>>>>      >>>
> >>>>>>>>>      >>>>   I am trying to see how many queues we need and
> >>>>>>>>>don't want it to
> >>>>>>>>>      be
> >>>>>>>>>      >>>>   arbitrarily
> >>>>>>>>>      >>>>   large and unduely blow up memory usage and
> >>>>>>>>>complexity in i915
> >>>>>>>>>      driver.
> >>>>>>>>>      >>>>
> >>>>>>>>>      >>>> I expect a Vulkan driver to use at most 2 in the
> >>>>>>>>>vast majority
> >>>>>>>>>      >>>>of cases. I
> >>>>>>>>>      >>>> could imagine a client wanting to create more
> >>>>than 1 sparse
> >>>>>>>>>      >>>>queue in which
> >>>>>>>>>      >>>> case, it'll be N+1 but that's unlikely. As far as
> >>>>>>>>>complexity
> >>>>>>>>>      >>>>goes, once
> >>>>>>>>>      >>>> you allow two, I don't think the complexity is
> >>>>going up by
> >>>>>>>>>      >>>>allowing N.  As
> >>>>>>>>>      >>>> for memory usage, creating more queues means more
> >>>>>>>>>memory.  That's
> >>>>>>>>>      a
> >>>>>>>>>      >>>> trade-off that userspace can make. Again, the
> >>>>>>>>>expected number
> >>>>>>>>>      >>>>here is 1
> >>>>>>>>>      >>>> or 2 in the vast majority of cases so I don't think
> >>>>>>>>>you need to
> >>>>>>>>>      worry.
> >>>>>>>>>      >>>
> >>>>>>>>>      >>>Ok, will start with n=3 meaning 8 queues.
> >>>>>>>>>      >>>That would require us create 8 workqueues.
> >>>>>>>>>      >>>We can change 'n' later if required.
> >>>>>>>>>      >>>
> >>>>>>>>>      >>>Niranjana
> >>>>>>>>>      >>>
> >>>>>>>>>      >>>>
> >>>>>>>>>      >>>>   >     Why? Because Vulkan has two basic kind of bind
> >>>>>>>>>      >>>>operations and we
> >>>>>>>>>      >>>>   don't
> >>>>>>>>>      >>>>   >     want any dependencies between them:
> >>>>>>>>>      >>>>   >      1. Immediate.  These happen right after BO
> >>>>>>>>>creation or
> >>>>>>>>>      >>>>maybe as
> >>>>>>>>>      >>>>   part of
> >>>>>>>>>      >>>>   > vkBindImageMemory() or
> VkBindBufferMemory().  These
> >>>>>>>>>      >>>>don't happen
> >>>>>>>>>      >>>>   on a
> >>>>>>>>>      >>>>   >     queue and we don't want them serialized
> >>>>>>>>>with anything.       To
> >>>>>>>>>      >>>>   synchronize
> >>>>>>>>>      >>>>   >     with submit, we'll have a syncobj in the
> >>>>>>>>>VkDevice which
> >>>>>>>>>      is
> >>>>>>>>>      >>>>   signaled by
> >>>>>>>>>      >>>>   >     all immediate bind operations and make
> >>>>>>>>>submits wait on
> >>>>>>>>>      it.
> >>>>>>>>>      >>>>   >      2. Queued (sparse): These happen on a
> >>>>>>>>>VkQueue which may
> >>>>>>>>>      be the
> >>>>>>>>>      >>>>   same as
> >>>>>>>>>      >>>>   >     a render/compute queue or may be its own
> >>>>>>>>>queue.  It's up
> >>>>>>>>>      to us
> >>>>>>>>>      >>>>   what we
> >>>>>>>>>      >>>>   >     want to advertise.  From the Vulkan API
> >>>>>>>>>PoV, this is like
> >>>>>>>>>      any
> >>>>>>>>>      >>>>   other
> >>>>>>>>>      >>>>   >     queue. Operations on it wait on and signal
> >>>>>>>>>semaphores.       If we
> >>>>>>>>>      >>>>   have a
> >>>>>>>>>      >>>>   >     VM_BIND engine, we'd provide syncobjs to
> >>>>wait and
> >>>>>>>>>      >>>>signal just like
> >>>>>>>>>      >>>>   we do
> >>>>>>>>>      >>>>   >     in execbuf().
> >>>>>>>>>      >>>>   >     The important thing is that we don't want
> >>>>>>>>>one type of
> >>>>>>>>>      >>>>operation to
> >>>>>>>>>      >>>>   block
> >>>>>>>>>      >>>>   >     on the other.  If immediate binds are
> >>>>>>>>>blocking on sparse
> >>>>>>>>>      binds,
> >>>>>>>>>      >>>>   it's
> >>>>>>>>>      >>>>   >     going to cause over-synchronization issues.
> >>>>>>>>>      >>>>   >     In terms of the internal implementation, I
> >>>>>>>>>know that
> >>>>>>>>>      >>>>there's going
> >>>>>>>>>      >>>>   to be
> >>>>>>>>>      >>>>   >     a lock on the VM and that we can't actually
> >>>>>>>>>do these
> >>>>>>>>>      things in
> >>>>>>>>>      >>>>   > parallel.  That's fine. Once the dma_fences have
> >>>>>>>>>      signaled and
> >>>>>>>>>      >>>>   we're
> >>>>>>>>>      >>>>
> >>>>>>>>>      >>>>   Thats correct. It is like a single VM_BIND
> >>>>engine with
> >>>>>>>>>      >>>>multiple queues
> >>>>>>>>>      >>>>   feeding to it.
> >>>>>>>>>      >>>>
> >>>>>>>>>      >>>> Right.  As long as the queues themselves are
> >>>>>>>>>independent and
> >>>>>>>>>      >>>>can block on
> >>>>>>>>>      >>>> dma_fences without holding up other queues, I think
> >>>>>>>>>we're fine.
> >>>>>>>>>      >>>>
> >>>>>>>>>      >>>>   > unblocked to do the bind operation, I don't care if
> >>>>>>>>>      >>>>there's a bit
> >>>>>>>>>      >>>>   of
> >>>>>>>>>      >>>>   > synchronization due to locking.  That's
> >>>>>>>>>expected.  What
> >>>>>>>>>      >>>>we can't
> >>>>>>>>>      >>>>   afford
> >>>>>>>>>      >>>>   >     to have is an immediate bind operation
> >>>>>>>>>suddenly blocking
> >>>>>>>>>      on a
> >>>>>>>>>      >>>>   sparse
> >>>>>>>>>      >>>>   > operation which is blocked on a compute job
> >>>>>>>>>that's going
> >>>>>>>>>      to run
> >>>>>>>>>      >>>>   for
> >>>>>>>>>      >>>>   >     another 5ms.
> >>>>>>>>>      >>>>
> >>>>>>>>>      >>>>   As the VM_BIND queue is per VM, VM_BIND on one
> VM
> >>>>>>>>>doesn't block
> >>>>>>>>>      the
> >>>>>>>>>      >>>>   VM_BIND
> >>>>>>>>>      >>>>   on other VMs. I am not sure about usecases
> >>>>here, but just
> >>>>>>>>>      wanted to
> >>>>>>>>>      >>>>   clarify.
> >>>>>>>>>      >>>>
> >>>>>>>>>      >>>> Yes, that's what I would expect.
> >>>>>>>>>      >>>> --Jason
> >>>>>>>>>      >>>>
> >>>>>>>>>      >>>>   Niranjana
> >>>>>>>>>      >>>>
> >>>>>>>>>      >>>>   >     For reference, Windows solves this by allowing
> >>>>>>>>>      arbitrarily many
> >>>>>>>>>      >>>>   paging
> >>>>>>>>>      >>>>   >     queues (what they call a VM_BIND
> >>>>>>>>>engine/queue).  That
> >>>>>>>>>      >>>>design works
> >>>>>>>>>      >>>>   >     pretty well and solves the problems in
> >>>>>>>>>question.       >>>>Again, we could
> >>>>>>>>>      >>>>   just
> >>>>>>>>>      >>>>   >     make everything out-of-order and require
> >>>>>>>>>using syncobjs
> >>>>>>>>>      >>>>to order
> >>>>>>>>>      >>>>   things
> >>>>>>>>>      >>>>   >     as userspace wants. That'd be fine too.
> >>>>>>>>>      >>>>   >     One more note while I'm here: danvet said
> >>>>>>>>>something on
> >>>>>>>>>      >>>>IRC about
> >>>>>>>>>      >>>>   VM_BIND
> >>>>>>>>>      >>>>   >     queues waiting for syncobjs to
> >>>>>>>>>materialize.  We don't
> >>>>>>>>>      really
> >>>>>>>>>      >>>>   want/need
> >>>>>>>>>      >>>>   >     this. We already have all the machinery in
> >>>>>>>>>userspace to
> >>>>>>>>>      handle
> >>>>>>>>>      >>>>   > wait-before-signal and waiting for syncobj
> >>>>>>>>>fences to
> >>>>>>>>>      >>>>materialize
> >>>>>>>>>      >>>>   and
> >>>>>>>>>      >>>>   >     that machinery is on by default.  It
> >>>>would actually
> >>>>>>>>>      >>>>take MORE work
> >>>>>>>>>      >>>>   in
> >>>>>>>>>      >>>>   >     Mesa to turn it off and take advantage of
> >>>>>>>>>the kernel
> >>>>>>>>>      >>>>being able to
> >>>>>>>>>      >>>>   wait
> >>>>>>>>>      >>>>   >     for syncobjs to materialize. Also, getting
> >>>>>>>>>that right is
> >>>>>>>>>      >>>>   ridiculously
> >>>>>>>>>      >>>>   >     hard and I really don't want to get it
> >>>>>>>>>wrong in kernel
> >>>>>>>>>      >>>>space.   �� When we
> >>>>>>>>>      >>>>   >     do memory fences, wait-before-signal will
> >>>>>>>>>be a thing.  We
> >>>>>>>>>      don't
> >>>>>>>>>      >>>>   need to
> >>>>>>>>>      >>>>   >     try and make it a thing for syncobj.
> >>>>>>>>>      >>>>   >     --Jason
> >>>>>>>>>      >>>>   >
> >>>>>>>>>      >>>>   >   Thanks Jason,
> >>>>>>>>>      >>>>   >
> >>>>>>>>>      >>>>   >   I missed the bit in the Vulkan spec that
> >>>>>>>>>we're allowed to
> >>>>>>>>>      have a
> >>>>>>>>>      >>>>   sparse
> >>>>>>>>>      >>>>   >   queue that does not implement either graphics
> >>>>>>>>>or compute
> >>>>>>>>>      >>>>operations
> >>>>>>>>>      >>>>   :
> >>>>>>>>>      >>>>   >
> >>>>>>>>>      >>>>   >     "While some implementations may include
> >>>>>>>>>      >>>> VK_QUEUE_SPARSE_BINDING_BIT
> >>>>>>>>>      >>>>   >     support in queue families that also include
> >>>>>>>>>      >>>>   >
> >>>>>>>>>      >>>>   > graphics and compute support, other
> >>>>>>>>>implementations may
> >>>>>>>>>      only
> >>>>>>>>>      >>>>   expose a
> >>>>>>>>>      >>>>   > VK_QUEUE_SPARSE_BINDING_BIT-only queue
> >>>>>>>>>      >>>>   >
> >>>>>>>>>      >>>>   > family."
> >>>>>>>>>      >>>>   >
> >>>>>>>>>      >>>>   >   So it can all be all a vm_bind engine that
> >>>>just does
> >>>>>>>>>      bind/unbind
> >>>>>>>>>      >>>>   > operations.
> >>>>>>>>>      >>>>   >
> >>>>>>>>>      >>>>   >   But yes we need another engine for the
> >>>>>>>>>immediate/non-sparse
> >>>>>>>>>      >>>>   operations.
> >>>>>>>>>      >>>>   >
> >>>>>>>>>      >>>>   >   -Lionel
> >>>>>>>>>      >>>>   >
> >>>>>>>>>      >>>>   >         >
> >>>>>>>>>      >>>>   > Daniel, any thoughts?
> >>>>>>>>>      >>>>   >
> >>>>>>>>>      >>>>   > Niranjana
> >>>>>>>>>      >>>>   >
> >>>>>>>>>      >>>>   > >Matt
> >>>>>>>>>      >>>>   >       >
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   > >> Sorry I noticed this late.
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   > >> -Lionel
> >>>>>>>>>      >>>>   > >>
> >>>>>>>>>      >>>>   > >>
> >>>>>>>
> >>>>>>>
> >>>>>
> >
> >
Zeng, Oak June 14, 2022, 9:47 p.m. UTC | #37
Thanks,
Oak

> -----Original Message-----
> From: dri-devel <dri-devel-bounces@lists.freedesktop.org> On Behalf Of
> Zeng, Oak
> Sent: June 14, 2022 5:13 PM
> To: Vishwanathapura, Niranjana <niranjana.vishwanathapura@intel.com>;
> Landwerlin, Lionel G <lionel.g.landwerlin@intel.com>
> Cc: Intel GFX <intel-gfx@lists.freedesktop.org>; Wilson, Chris P
> <chris.p.wilson@intel.com>; Hellstrom, Thomas
> <thomas.hellstrom@intel.com>; Maling list - DRI developers <dri-
> devel@lists.freedesktop.org>; Vetter, Daniel <daniel.vetter@intel.com>;
> Christian König <christian.koenig@amd.com>
> Subject: RE: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design
> document
> 
> 
> 
> Thanks,
> Oak
> 
> > -----Original Message-----
> > From: Vishwanathapura, Niranjana <niranjana.vishwanathapura@intel.com>
> > Sent: June 14, 2022 1:02 PM
> > To: Landwerlin, Lionel G <lionel.g.landwerlin@intel.com>
> > Cc: Zeng, Oak <oak.zeng@intel.com>; Intel GFX <intel-
> > gfx@lists.freedesktop.org>; Maling list - DRI developers <dri-
> > devel@lists.freedesktop.org>; Hellstrom, Thomas
> > <thomas.hellstrom@intel.com>; Wilson, Chris P
> <chris.p.wilson@intel.com>;
> > Vetter, Daniel <daniel.vetter@intel.com>; Christian König
> > <christian.koenig@amd.com>
> > Subject: Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design
> > document
> >
> > On Tue, Jun 14, 2022 at 10:04:00AM +0300, Lionel Landwerlin wrote:
> > >On 13/06/2022 21:02, Niranjana Vishwanathapura wrote:
> > >>On Mon, Jun 13, 2022 at 06:33:07AM -0700, Zeng, Oak wrote:
> > >>>
> > >>>
> > >>>Regards,
> > >>>Oak
> > >>>
> > >>>>-----Original Message-----
> > >>>>From: Intel-gfx <intel-gfx-bounces@lists.freedesktop.org> On
> > >>>>Behalf Of Niranjana
> > >>>>Vishwanathapura
> > >>>>Sent: June 10, 2022 1:43 PM
> > >>>>To: Landwerlin, Lionel G <lionel.g.landwerlin@intel.com>
> > >>>>Cc: Intel GFX <intel-gfx@lists.freedesktop.org>; Maling list -
> > >>>>DRI developers <dri-
> > >>>>devel@lists.freedesktop.org>; Hellstrom, Thomas
> > >>>><thomas.hellstrom@intel.com>;
> > >>>>Wilson, Chris P <chris.p.wilson@intel.com>; Vetter, Daniel
> > >>>><daniel.vetter@intel.com>; Christian König
> > <christian.koenig@amd.com>
> > >>>>Subject: Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND
> > >>>>feature design
> > >>>>document
> > >>>>
> > >>>>On Fri, Jun 10, 2022 at 11:18:14AM +0300, Lionel Landwerlin wrote:
> > >>>>>On 10/06/2022 10:54, Niranjana Vishwanathapura wrote:
> > >>>>>>On Fri, Jun 10, 2022 at 09:53:24AM +0300, Lionel Landwerlin wrote:
> > >>>>>>>On 09/06/2022 22:31, Niranjana Vishwanathapura wrote:
> > >>>>>>>>On Thu, Jun 09, 2022 at 05:49:09PM +0300, Lionel Landwerlin
> wrote:
> > >>>>>>>>>  On 09/06/2022 00:55, Jason Ekstrand wrote:
> > >>>>>>>>>
> > >>>>>>>>>    On Wed, Jun 8, 2022 at 4:44 PM Niranjana Vishwanathapura
> > >>>>>>>>> <niranjana.vishwanathapura@intel.com> wrote:
> > >>>>>>>>>
> > >>>>>>>>>      On Wed, Jun 08, 2022 at 08:33:25AM +0100, Tvrtko
> > >>>>Ursulin wrote:
> > >>>>>>>>>      >
> > >>>>>>>>>      >
> > >>>>>>>>>      >On 07/06/2022 22:32, Niranjana Vishwanathapura wrote:
> > >>>>>>>>>      >>On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana
> > >>>>>>>>>Vishwanathapura
> > >>>>>>>>>      wrote:
> > >>>>>>>>>      >>>On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason
> > >>>>>>>>>Ekstrand wrote:
> > >>>>>>>>>      >>>> On Fri, Jun 3, 2022 at 6:52 PM Niranjana
> > >>>>Vishwanathapura
> > >>>>>>>>>      >>>> <niranjana.vishwanathapura@intel.com> wrote:
> > >>>>>>>>>      >>>>
> > >>>>>>>>>      >>>>   On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel
> > >>>>>>>>>Landwerlin
> > >>>>>>>>>      wrote:
> > >>>>>>>>>      >>>>   >   On 02/06/2022 23:35, Jason Ekstrand wrote:
> > >>>>>>>>>      >>>>   >
> > >>>>>>>>>      >>>>   >     On Thu, Jun 2, 2022 at 3:11 PM Niranjana
> > >>>>>>>>>Vishwanathapura
> > >>>>>>>>>      >>>>   > <niranjana.vishwanathapura@intel.com> wrote:
> > >>>>>>>>>      >>>>   >
> > >>>>>>>>>      >>>>   >       On Wed, Jun 01, 2022 at 01:28:36PM
> > >>>>-0700, Matthew
> > >>>>>>>>>      >>>>Brost wrote:
> > >>>>>>>>>      >>>>   >       >On Wed, Jun 01, 2022 at 05:25:49PM
> > >>>>+0300, Lionel
> > >>>>>>>>>      Landwerlin
> > >>>>>>>>>      >>>>   wrote:
> > >>>>>>>>>      >>>>   > >> On 17/05/2022 21:32, Niranjana
> Vishwanathapura
> > >>>>>>>>>      wrote:
> > >>>>>>>>>      >>>>   > >> > +VM_BIND/UNBIND ioctl will immediately start
> > >>>>>>>>>      >>>>   binding/unbinding
> > >>>>>>>>>      >>>>   >       the mapping in an
> > >>>>>>>>>      >>>>   > >> > +async worker. The binding and
> > >>>>>>>>>unbinding will
> > >>>>>>>>>      >>>>work like a
> > >>>>>>>>>      >>>>   special
> > >>>>>>>>>      >>>>   >       GPU engine.
> > >>>>>>>>>      >>>>   > >> > +The binding and unbinding operations are
> > >>>>>>>>>      serialized and
> > >>>>>>>>>      >>>>   will
> > >>>>>>>>>      >>>>   >       wait on specified
> > >>>>>>>>>      >>>>   > >> > +input fences before the operation
> > >>>>>>>>>and will signal
> > >>>>>>>>>      the
> > >>>>>>>>>      >>>>   output
> > >>>>>>>>>      >>>>   >       fences upon the
> > >>>>>>>>>      >>>>   > >> > +completion of the operation. Due to
> > >>>>>>>>>      serialization,
> > >>>>>>>>>      >>>>   completion of
> > >>>>>>>>>      >>>>   >       an operation
> > >>>>>>>>>      >>>>   > >> > +will also indicate that all
> > >>>>>>>>>previous operations
> > >>>>>>>>>      >>>>are also
> > >>>>>>>>>      >>>>   > complete.
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   > >> I guess we should avoid saying "will
> > >>>>>>>>>immediately
> > >>>>>>>>>      start
> > >>>>>>>>>      >>>>   > binding/unbinding" if
> > >>>>>>>>>      >>>>   > >> there are fences involved.
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   > >> And the fact that it's happening in an async
> > >>>>>>>>>      >>>>worker seem to
> > >>>>>>>>>      >>>>   imply
> > >>>>>>>>>      >>>>   >       it's not
> > >>>>>>>>>      >>>>   > >> immediate.
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   >
> > >>>>>>>>>      >>>>   >       Ok, will fix.
> > >>>>>>>>>      >>>>   >       This was added because in earlier design
> > >>>>>>>>>binding was
> > >>>>>>>>>      deferred
> > >>>>>>>>>      >>>>   until
> > >>>>>>>>>      >>>>   >       next execbuff.
> > >>>>>>>>>      >>>>   >       But now it is non-deferred (immediate in
> > >>>>>>>>>that sense).
> > >>>>>>>>>      >>>>But yah,
> > >>>>>>>>>      >>>>   this is
> > >>>>>>>>>      >>>>   > confusing
> > >>>>>>>>>      >>>>   >       and will fix it.
> > >>>>>>>>>      >>>>   >
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   > >> I have a question on the behavior of the bind
> > >>>>>>>>>      >>>>operation when
> > >>>>>>>>>      >>>>   no
> > >>>>>>>>>      >>>>   >       input fence
> > >>>>>>>>>      >>>>   > >> is provided. Let say I do :
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   > >> VM_BIND (out_fence=fence1)
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   > >> VM_BIND (out_fence=fence2)
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   > >> VM_BIND (out_fence=fence3)
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   > >> In what order are the fences going to
> > >>>>>>>>>be signaled?
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   > >> In the order of VM_BIND ioctls? Or out
> > >>>>>>>>>of order?
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   > >> Because you wrote "serialized I assume
> > >>>>>>>>>it's : in
> > >>>>>>>>>      order
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   >
> > >>>>>>>>>      >>>>   >       Yes, in the order of VM_BIND/UNBIND
> > >>>>>>>>>ioctls. Note that
> > >>>>>>>>>      >>>>bind and
> > >>>>>>>>>      >>>>   unbind
> > >>>>>>>>>      >>>>   >       will use
> > >>>>>>>>>      >>>>   >       the same queue and hence are ordered.
> > >>>>>>>>>      >>>>   >
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   > >> One thing I didn't realize is that
> > >>>>>>>>>because we only
> > >>>>>>>>>      get one
> > >>>>>>>>>      >>>>   > "VM_BIND" engine,
> > >>>>>>>>>      >>>>   > >> there is a disconnect from the Vulkan
> > >>>>>>>>>specification.
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   > >> In Vulkan VM_BIND operations are
> > >>>>>>>>>serialized but
> > >>>>>>>>>      >>>>per engine.
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   > >> So you could have something like this :
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   > >> VM_BIND (engine=rcs0, in_fence=fence1,
> > >>>>>>>>>      out_fence=fence2)
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   > >> VM_BIND (engine=ccs0, in_fence=fence3,
> > >>>>>>>>>      out_fence=fence4)
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   > >> fence1 is not signaled
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   > >> fence3 is signaled
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   > >> So the second VM_BIND will proceed before the
> > >>>>>>>>>      >>>>first VM_BIND.
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   > >> I guess we can deal with that scenario in
> > >>>>>>>>>      >>>>userspace by doing
> > >>>>>>>>>      >>>>   the
> > >>>>>>>>>      >>>>   >       wait
> > >>>>>>>>>      >>>>   > >> ourselves in one thread per engines.
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   > >> But then it makes the VM_BIND input
> > >>>>>>>>>fences useless.
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   > >> Daniel : what do you think? Should be
> > >>>>>>>>>rework this or
> > >>>>>>>>>      just
> > >>>>>>>>>      >>>>   deal with
> > >>>>>>>>>      >>>>   >       wait
> > >>>>>>>>>      >>>>   > >> fences in userspace?
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   >       >
> > >>>>>>>>>      >>>>   >       >My opinion is rework this but make the
> > >>>>>>>>>ordering via
> > >>>>>>>>>      >>>>an engine
> > >>>>>>>>>      >>>>   param
> > >>>>>>>>>      >>>>   > optional.
> > >>>>>>>>>      >>>>   >       >
> > >>>>>>>>>      >>>>   > >e.g. A VM can be configured so all binds
> > >>>>>>>>>are ordered
> > >>>>>>>>>      >>>>within the
> > >>>>>>>>>      >>>>   VM
> > >>>>>>>>>      >>>>   >       >
> > >>>>>>>>>      >>>>   > >e.g. A VM can be configured so all binds
> > >>>>>>>>>accept an
> > >>>>>>>>>      engine
> > >>>>>>>>>      >>>>   argument
> > >>>>>>>>>      >>>>   >       (in
> > >>>>>>>>>      >>>>   > >the case of the i915 likely this is a
> > >>>>>>>>>gem context
> > >>>>>>>>>      >>>>handle) and
> > >>>>>>>>>      >>>>   binds
> > >>>>>>>>>      >>>>   > >ordered with respect to that engine.
> > >>>>>>>>>      >>>>   >       >
> > >>>>>>>>>      >>>>   > >This gives UMDs options as the later
> > >>>>>>>>>likely consumes
> > >>>>>>>>>      >>>>more KMD
> > >>>>>>>>>      >>>>   > resources
> > >>>>>>>>>      >>>>   >       >so if a different UMD can live with
> > >>>>binds being
> > >>>>>>>>>      >>>>ordered within
> > >>>>>>>>>      >>>>   the VM
> > >>>>>>>>>      >>>>   > >they can use a mode consuming less resources.
> > >>>>>>>>>      >>>>   >       >
> > >>>>>>>>>      >>>>   >
> > >>>>>>>>>      >>>>   >       I think we need to be careful here if we
> > >>>>>>>>>are looking
> > >>>>>>>>>      for some
> > >>>>>>>>>      >>>>   out of
> > >>>>>>>>>      >>>>   > (submission) order completion of vm_bind/unbind.
> > >>>>>>>>>      >>>>   > In-order completion means, in a batch of
> > >>>>>>>>>binds and
> > >>>>>>>>>      >>>>unbinds to be
> > >>>>>>>>>      >>>>   > completed in-order, user only needs to specify
> > >>>>>>>>>      >>>>in-fence for the
> > >>>>>>>>>      >>>>   >       first bind/unbind call and the our-fence
> > >>>>>>>>>for the last
> > >>>>>>>>>      >>>>   bind/unbind
> > >>>>>>>>>      >>>>   >       call. Also, the VA released by an unbind
> > >>>>>>>>>call can be
> > >>>>>>>>>      >>>>re-used by
> > >>>>>>>>>      >>>>   >       any subsequent bind call in that
> > >>>>in-order batch.
> > >>>>>>>>>      >>>>   >
> > >>>>>>>>>      >>>>   >       These things will break if
> > >>>>>>>>>binding/unbinding were to
> > >>>>>>>>>      >>>>be allowed
> > >>>>>>>>>      >>>>   to
> > >>>>>>>>>      >>>>   >       go out of order (of submission) and user
> > >>>>>>>>>need to be
> > >>>>>>>>>      extra
> > >>>>>>>>>      >>>>   careful
> > >>>>>>>>>      >>>>   >       not to run into pre-mature triggereing of
> > >>>>>>>>>out-fence and
> > >>>>>>>>>      bind
> > >>>>>>>>>      >>>>   failing
> > >>>>>>>>>      >>>>   >       as VA is still in use etc.
> > >>>>>>>>>      >>>>   >
> > >>>>>>>>>      >>>>   >       Also, VM_BIND binds the provided
> > >>>>mapping on the
> > >>>>>>>>>      specified
> > >>>>>>>>>      >>>>   address
> > >>>>>>>>>      >>>>   >       space
> > >>>>>>>>>      >>>>   >       (VM). So, the uapi is not engine/context
> > >>>>>>>>>specific.
> > >>>>>>>>>      >>>>   >
> > >>>>>>>>>      >>>>   >       We can however add a 'queue' to the uapi
> > >>>>>>>>>which can be
> > >>>>>>>>>      >>>>one from
> > >>>>>>>>>      >>>>   the
> > >>>>>>>>>      >>>>   > pre-defined queues,
> > >>>>>>>>>      >>>>   > I915_VM_BIND_QUEUE_0
> > >>>>>>>>>      >>>>   > I915_VM_BIND_QUEUE_1
> > >>>>>>>>>      >>>>   >       ...
> > >>>>>>>>>      >>>>   > I915_VM_BIND_QUEUE_(N-1)
> > >>>>>>>>>      >>>>   >
> > >>>>>>>>>      >>>>   >       KMD will spawn an async work queue for
> > >>>>>>>>>each queue which
> > >>>>>>>>>      will
> > >>>>>>>>>      >>>>   only
> > >>>>>>>>>      >>>>   >       bind the mappings on that queue in the
> > >>>>order of
> > >>>>>>>>>      submission.
> > >>>>>>>>>      >>>>   >       User can assign the queue to per engine
> > >>>>>>>>>or anything
> > >>>>>>>>>      >>>>like that.
> > >>>>>>>>>      >>>>   >
> > >>>>>>>>>      >>>>   >       But again here, user need to be
> > >>>>careful and not
> > >>>>>>>>>      >>>>deadlock these
> > >>>>>>>>>      >>>>   >       queues with circular dependency of fences.
> > >>>>>>>>>      >>>>   >
> > >>>>>>>>>      >>>>   >       I prefer adding this later an as
> > >>>>>>>>>extension based on
> > >>>>>>>>>      >>>>whether it
> > >>>>>>>>>      >>>>   >       is really helping with the implementation.
> > >>>>>>>>>      >>>>   >
> > >>>>>>>>>      >>>>   >     I can tell you right now that having
> > >>>>>>>>>everything on a
> > >>>>>>>>>      single
> > >>>>>>>>>      >>>>   in-order
> > >>>>>>>>>      >>>>   >     queue will not get us the perf we want.
> > >>>>>>>>>What vulkan
> > >>>>>>>>>      >>>>really wants
> > >>>>>>>>>      >>>>   is one
> > >>>>>>>>>      >>>>   >     of two things:
> > >>>>>>>>>      >>>>   >      1. No implicit ordering of VM_BIND
> > >>>>ops.  They just
> > >>>>>>>>>      happen in
> > >>>>>>>>>      >>>>   whatever
> > >>>>>>>>>      >>>>   >     their dependencies are resolved and we
> > >>>>>>>>>ensure ordering
> > >>>>>>>>>      >>>>ourselves
> > >>>>>>>>>      >>>>   by
> > >>>>>>>>>      >>>>   >     having a syncobj in the VkQueue.
> > >>>>>>>>>      >>>>   >      2. The ability to create multiple VM_BIND
> > >>>>>>>>>queues.  We
> > >>>>>>>>>      need at
> > >>>>>>>>>      >>>>   least 2
> > >>>>>>>>>      >>>>   >     but I don't see why there needs to be a
> > >>>>>>>>>limit besides
> > >>>>>>>>>      >>>>the limits
> > >>>>>>>>>      >>>>   the
> > >>>>>>>>>      >>>>   >     i915 API already has on the number of
> > >>>>>>>>>engines.  Vulkan
> > >>>>>>>>>      could
> > >>>>>>>>>      >>>>   expose
> > >>>>>>>>>      >>>>   >     multiple sparse binding queues to the
> > >>>>>>>>>client if it's not
> > >>>>>>>>>      >>>>   arbitrarily
> > >>>>>>>>>      >>>>   >     limited.
> > >>>>>>>>>      >>>>
> > >>>>>>>>>      >>>>   Thanks Jason, Lionel.
> > >>>>>>>>>      >>>>
> > >>>>>>>>>      >>>>   Jason, what are you referring to when you say
> > >>>>>>>>>"limits the i915
> > >>>>>>>>>      API
> > >>>>>>>>>      >>>>   already
> > >>>>>>>>>      >>>>   has on the number of engines"? I am not sure if
> > >>>>>>>>>there is such
> > >>>>>>>>>      an uapi
> > >>>>>>>>>      >>>>   today.
> > >>>>>>>>>      >>>>
> > >>>>>>>>>      >>>> There's a limit of something like 64 total engines
> > >>>>>>>>>today based on
> > >>>>>>>>>      the
> > >>>>>>>>>      >>>> number of bits we can cram into the exec flags in
> > >>>>>>>>>execbuffer2.  I
> > >>>>>>>>>      think
> > >>>>>>>>>      >>>> someone had an extended version that allowed more
> > >>>>>>>>>but I ripped it
> > >>>>>>>>>      out
> > >>>>>>>>>      >>>> because no one was using it.  Of course,
> > >>>>>>>>>execbuffer3 might not
> > >>>>>>>>>      >>>>have that
> > >>>>>>>>>      >>>> problem at all.
> > >>>>>>>>>      >>>>
> > >>>>>>>>>      >>>
> > >>>>>>>>>      >>>Thanks Jason.
> > >>>>>>>>>      >>>Ok, I am not sure which exec flag is that, but yah,
> > >>>>>>>>>execbuffer3
> > >>>>>>>>>      probably
> > >>>>>>>>>      >>>will not have this limiation. So, we need to define a
> > >>>>>>>>>      VM_BIND_MAX_QUEUE
> > >>>>>>>>>      >>>and somehow export it to user (I am thinking of
> > >>>>>>>>>embedding it in
> > >>>>>>>>>      >>>I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND,
> > >>>>bits[1-3]->'n'
> > >>>>>>>>>      meaning 2^n
> > >>>>>>>>>      >>>queues.
> > >>>>>>>>>      >>
> > >>>>>>>>>      >>Ah, I think you are waking about I915_EXEC_RING_MASK
> > >>>>>>>>>(0x3f) which
> > >>>>>>>>>      execbuf3
> > >>>>>>>>>
> > >>>>>>>>>    Yup!  That's exactly the limit I was talking about.
> > >>>>>>>>>
> > >>>>>>>>>      >>will also have. So, we can simply define in
> vm_bind/unbind
> > >>>>>>>>>      structures,
> > >>>>>>>>>      >>
> > >>>>>>>>>      >>#define I915_VM_BIND_MAX_QUEUE   64
> > >>>>>>>>>      >>        __u32 queue;
> > >>>>>>>>>      >>
> > >>>>>>>>>      >>I think that will keep things simple.
> > >>>>>>>>>      >
> > >>>>>>>>>      >Hmmm? What does execbuf2 limit has to do with how
> > >>>>many engines
> > >>>>>>>>>      >hardware can have? I suggest not to do that.
> > >>>>>>>>>      >
> > >>>>>>>>>      >Change with added this:
> > >>>>>>>>>      >
> > >>>>>>>>>      >       if (set.num_engines > I915_EXEC_RING_MASK + 1)
> > >>>>>>>>>      >               return -EINVAL;
> > >>>>>>>>>      >
> > >>>>>>>>>      >To context creation needs to be undone and so let users
> > >>>>>>>>>create engine
> > >>>>>>>>>      >maps with all hardware engines, and let execbuf3 access
> > >>>>>>>>>them all.
> > >>>>>>>>>      >
> > >>>>>>>>>
> > >>>>>>>>>      Earlier plan was to carry I915_EXEC_RING_MAP (0x3f) to
> > >>>>>>>>>execbuff3 also.
> > >>>>>>>>>      Hence, I was using the same limit for VM_BIND queues
> > >>>>>>>>>(64, or 65 if we
> > >>>>>>>>>      make it N+1).
> > >>>>>>>>>      But, as discussed in other thread of this RFC series, we
> > >>>>>>>>>are planning
> > >>>>>>>>>      to drop this I915_EXEC_RING_MAP in execbuff3. So,
> > >>>>there won't be
> > >>>>>>>>>      any uapi that limits the number of engines (and hence
> > >>>>>>>>>the vm_bind
> > >>>>>>>>>      queues
> > >>>>>>>>>      need to be supported).
> > >>>>>>>>>
> > >>>>>>>>>      If we leave the number of vm_bind queues to be
> > >>>>arbitrarily large
> > >>>>>>>>>      (__u32 queue_idx) then, we need to have a hashmap for
> > >>>>>>>>>queue (a wq,
> > >>>>>>>>>      work_item and a linked list) lookup from the user
> > >>>>>>>>>specified queue
> > >>>>>>>>>      index.
> > >>>>>>>>>      Other option is to just put some hard limit (say 64 or
> > >>>>>>>>>65) and use
> > >>>>>>>>>      an array of queues in VM (each created upon first use).
> > >>>>>>>>>I prefer this.
> > >>>>>>>>>
> > >>>>>>>>>    I don't get why a VM_BIND queue is any different from any
> > >>>>>>>>>other queue or
> > >>>>>>>>>    userspace-visible kernel object.  But I'll leave those
> > >>>>>>>>>details up to
> > >>>>>>>>>    danvet or whoever else might be reviewing the
> > implementation.
> > >>>>>>>>>    --Jason
> > >>>>>>>>>
> > >>>>>>>>>  I kind of agree here. Wouldn't be simpler to have the bind
> > >>>>>>>>>queue created
> > >>>>>>>>>  like the others when we build the engine map?
> > >>>>>>>>>
> > >>>>>>>>>  For userspace it's then just matter of selecting the right
> > >>>>>>>>>queue ID when
> > >>>>>>>>>  submitting.
> > >>>>>>>>>
> > >>>>>>>>>  If there is ever a possibility to have this work on the GPU,
> > >>>>>>>>>it would be
> > >>>>>>>>>  all ready.
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>>I did sync offline with Matt Brost on this.
> > >>>>>>>>We can add a VM_BIND engine class and let user create
> VM_BIND
> > >>>>>>>>engines (queues).
> > >>>>>>>>The problem is, in i915 engine creating interface is bound to
> > >>>>>>>>gem_context.
> > >>>>>>>>So, in vm_bind ioctl, we would need both context_id and
> > >>>>>>>>queue_idx for proper
> > >>>>>>>>lookup of the user created engine. This is bit ackward as
> > >>>>vm_bind is an
> > >>>>>>>>interface to VM (address space) and has nothing to do with
> > >>>>gem_context.
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>A gem_context has a single vm object right?
> > >>>>>>>
> > >>>>>>>Set through I915_CONTEXT_PARAM_VM at creation or given a
> > default
> > >>>>>>>one if not.
> > >>>>>>>
> > >>>>>>>So it's just like picking up the vm like it's done at execbuffer
> > >>>>>>>time right now : eb->context->vm
> > >>>>>>>
> > >>>>>>
> > >>>>>>Are you suggesting replacing 'vm_id' with 'context_id' in the
> > >>>>>>VM_BIND/UNBIND
> > >>>>>>ioctl and probably call it CONTEXT_BIND/UNBIND, because VM can
> > be
> > >>>>>>obtained
> > >>>>>>from the context?
> > >>>>>
> > >>>>>
> > >>>>>Yes, because if we go for engines, they're associated with a context
> > >>>>>and so also associated with the VM bound to the context.
> > >>>>>
> > >>>>
> > >>>>Hmm...context doesn't sould like the right interface. It should be
> > >>>>VM and engine (independent of context). Engine can be virtual or soft
> > >>>>engine (kernel thread), each with its own queue. We can add an
> > >>>>interface
> > >>>>to create such engines (independent of context). But we are anway
> > >>>>implicitly creating it when user uses a new queue_idx. If in future
> > >>>>we have hardware engines for VM_BIND operation, we can have that
> > >>>>explicit inteface to create engine instances and the queue_index
> > >>>>in vm_bind/unbind will point to those engines.
> > >>>>Anyone has any thoughts? Daniel?
> > >>>
> > >>>Exposing gem_context or intel_context to user space is a strange
> > >>>concept to me. A context represent some hw resources that is used
> > >>>to complete certain task. User space should care allocate some
> > >>>resources (memory, queues) and submit tasks to queues. But user
> > >>>space doesn't care how certain task is mapped to a HW context -
> > >>>driver/guc should take care of this.
> > >>>
> > >>>So a cleaner interface to me is: user space create a vm,  create
> > >>>gem object, vm_bind it to a vm; allocate queues (internally
> > >>>represent compute or blitter HW. Queue can be virtual to user) for
> > >>>this vm; submit tasks to queues. User can create multiple queues
> > >>>under one vm. One queue is only for one vm.
> > >>>
> > >>>I915 driver/guc manage the hw compute or blitter resources which
> > >>>is transparent to user space. When i915 or guc decide to schedule
> > >>>a queue (run tasks on that queue), a HW engine will be pick up and
> > >>>set up properly for the vm of that queue (ie., switch to page
> > >>>tables of that vm) - this is a context switch.
> > >>>
> > >>>From vm_bind perspective, it simply bind a gem_object to a vm.
> > >>>Engine/queue is not a parameter to vm_bind, as any engine can be
> > >>>pick up by i915/guc to execute a task using the vm bound va.
> > >>>
> > >>>I didn't completely follow the discussion here. Just share some
> > >>>thoughts.
> > >>>
> > >>
> > >>Yah, I agree.
> > >>
> > >>Lionel,
> > >>How about we define the queue as
> > >>union {
> > >>       __u32 queue_idx;
> > >>       __u64 rsvd;
> > >>}
> > >>
> > >>If required, we can extend by expanding the 'rsvd' field to <ctx_id,
> > >>queue_idx> later
> > >>with a flag.
> > >>
> > >>Niranjana
> > >
> > >
> > >I did not really understand Oak's comment nor what you're suggesting
> > >here to be honest.
> > >
> > >
> > >First the GEM context is already exposed to userspace. It's explicitly
> > >created by userpace with DRM_IOCTL_I915_GEM_CONTEXT_CREATE.
> > >
> > >We give the GEM context id in every execbuffer we do with
> > >drm_i915_gem_execbuffer2::rsvd1.
> > >
> > >It's still in the new execbuffer3 proposal being discussed.
> > >
> > >
> > >Second, the GEM context is also where we set the VM with
> > >I915_CONTEXT_PARAM_VM.
> > >
> > >
> > >Third, the GEM context also has the list of engines with
> > >I915_CONTEXT_PARAM_ENGINES.
> > >
> >
> > Yes, the execbuf and engine map creation are tied to gem_context.
> > (which probably is not the best interface.)
> >
> > >
> > >So it makes sense to me to dispatch the vm_bind operation to a GEM
> > >context, to a given vm_bind queue, because it's got all the
> > >information required :
> > >
> > >    - the list of new vm_bind queues
> > >
> > >    - the vm that is going to be modified
> > >
> >
> > But the operation is performed here on the address space (VM) which
> > can have multiple gem_contexts referring to it. So, VM is the right
> > interface here. We need not 'gem_context'ify it.
> >
> > All we need is multiple queue support for the address space (VM).
> > Going to gem_context for that just because we have engine creation
> > support there seems unnecessay and not correct to me.
> >
> > >
> > >Otherwise where do the vm_bind queues live?
> > >
> > >In the i915/drm fd object?
> > >
> > >That would mean that all the GEM contexts are sharing the same vm_bind
> > >queues.
> > >
> >
> > Not all, only the gem contexts that are using the same address space (VM).
> > But to me the right way to describe would be that "VM will be using those
> > queues".
> 
> 
> I hope by "queue" here you mean a HW resource  that will be later used to
> execute the job, for example a ccs compute engine. Of course queue can be
> virtual so user can create more queues than what hw physically has.
> 
> To express the concept of "VM will be using those queues", I think it make
> sense to have create_queue(vm) function taking a vm parameter. This
> means this queue is created for the purpose of submit job under this VM.
> Later on, we can submit job (referring to objects vm_bound to the same vm)
> to the queue. The vm_bind ioctl doesn’t need to have queue parameter, just
> vm_bind (object, va, vm).
> 
> I hope the "queue" here is not the engine used to perform the vm_bind
> operation itself. But if you meant a queue/engine to perform vm_bind itself
> (vs a queue/engine for later job submission), then we can discuss more. I
> know xe driver have similar concept and I think align the design early can
> benefit the migration to xe driver.

Oops, I read more on this thread and it turned out the vm_bind queue here is actually used to perform vm bind/unbind operations. XE driver has the similar concept (except it is called engine_id there). So having a queue_idx parameter is closer to xe design.

That said, I still feel having a queue_idx parameter to vm_bind is a bit awkward. Vm_bind can be performed without any GPU engines, ie,. CPU itself can complete a vm bind as long as CPU have access to gpu's local memory. So the queue here have to be a virtual concept - it doesn't have a hard map to GPU blitter engine.

Can someone summarize what is the benefit of the queue-idx parameter? For the purpose of ordering vm_bind and later gpu jobs?  

> 
> Regards,
> Oak
> 
> >
> > Niranjana
> >
> > >
> > >intel_context or GuC are internal details we're not concerned about.
> > >
> > >I don't really see the connection with the GEM context.
> > >
> > >
> > >Maybe Oak has a different use case than Vulkan.
> > >
> > >
> > >-Lionel
> > >
> > >
> > >>
> > >>>Regards,
> > >>>Oak
> > >>>
> > >>>>
> > >>>>Niranjana
> > >>>>
> > >>>>>
> > >>>>>>I think the interface is clean as a interface to VM. It is
> > >>>>only that we
> > >>>>>>don't have a clean way to create a raw VM_BIND engine (not
> > >>>>>>associated with
> > >>>>>>any context) with i915 uapi.
> > >>>>>>May be we can add such an interface, but I don't think that is
> > >>>>worth it
> > >>>>>>(we might as well just use a queue_idx in VM_BIND/UNBIND ioctl
> as I
> > >>>>>>mentioned
> > >>>>>>above).
> > >>>>>>Anyone has any thoughts?
> > >>>>>>
> > >>>>>>>
> > >>>>>>>>Another problem is, if two VMs are binding with the same
> defined
> > >>>>>>>>engine,
> > >>>>>>>>binding on VM1 can get unnecessary blocked by binding on VM2
> > >>>>>>>>(which may be
> > >>>>>>>>waiting on its in_fence).
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>Maybe I'm missing something, but how can you have 2 vm objects
> > >>>>>>>with a single gem_context right now?
> > >>>>>>>
> > >>>>>>
> > >>>>>>No, we don't have 2 VMs for a gem_context.
> > >>>>>>Say if ctx1 with vm1 and ctx2 with vm2.
> > >>>>>>First vm_bind call was for vm1 with q_idx 1 in ctx1 engine map.
> > >>>>>>Second vm_bind call was for vm2 with q_idx 2 in ctx2 engine map. If
> > >>>>>>those two queue indicies points to same underlying vm_bind
> engine,
> > >>>>>>then the second vm_bind call gets blocked until the first
> > >>>>vm_bind call's
> > >>>>>>'in' fence is triggered and bind completes.
> > >>>>>>
> > >>>>>>With per VM queues, this is not a problem as two VMs will not
> endup
> > >>>>>>sharing same queue.
> > >>>>>>
> > >>>>>>BTW, I just posted a updated PATCH series.
> > >>>>>>https://www.spinics.net/lists/dri-devel/msg350483.html
> > >>>>>>
> > >>>>>>Niranjana
> > >>>>>>
> > >>>>>>>
> > >>>>>>>>
> > >>>>>>>>So, my preference here is to just add a 'u32 queue' index in
> > >>>>>>>>vm_bind/unbind
> > >>>>>>>>ioctl, and the queues are per VM.
> > >>>>>>>>
> > >>>>>>>>Niranjana
> > >>>>>>>>
> > >>>>>>>>>  Thanks,
> > >>>>>>>>>
> > >>>>>>>>>  -Lionel
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>      Niranjana
> > >>>>>>>>>
> > >>>>>>>>>      >Regards,
> > >>>>>>>>>      >
> > >>>>>>>>>      >Tvrtko
> > >>>>>>>>>      >
> > >>>>>>>>>      >>
> > >>>>>>>>>      >>Niranjana
> > >>>>>>>>>      >>
> > >>>>>>>>>      >>>
> > >>>>>>>>>      >>>>   I am trying to see how many queues we need and
> > >>>>>>>>>don't want it to
> > >>>>>>>>>      be
> > >>>>>>>>>      >>>>   arbitrarily
> > >>>>>>>>>      >>>>   large and unduely blow up memory usage and
> > >>>>>>>>>complexity in i915
> > >>>>>>>>>      driver.
> > >>>>>>>>>      >>>>
> > >>>>>>>>>      >>>> I expect a Vulkan driver to use at most 2 in the
> > >>>>>>>>>vast majority
> > >>>>>>>>>      >>>>of cases. I
> > >>>>>>>>>      >>>> could imagine a client wanting to create more
> > >>>>than 1 sparse
> > >>>>>>>>>      >>>>queue in which
> > >>>>>>>>>      >>>> case, it'll be N+1 but that's unlikely. As far as
> > >>>>>>>>>complexity
> > >>>>>>>>>      >>>>goes, once
> > >>>>>>>>>      >>>> you allow two, I don't think the complexity is
> > >>>>going up by
> > >>>>>>>>>      >>>>allowing N.  As
> > >>>>>>>>>      >>>> for memory usage, creating more queues means more
> > >>>>>>>>>memory.  That's
> > >>>>>>>>>      a
> > >>>>>>>>>      >>>> trade-off that userspace can make. Again, the
> > >>>>>>>>>expected number
> > >>>>>>>>>      >>>>here is 1
> > >>>>>>>>>      >>>> or 2 in the vast majority of cases so I don't think
> > >>>>>>>>>you need to
> > >>>>>>>>>      worry.
> > >>>>>>>>>      >>>
> > >>>>>>>>>      >>>Ok, will start with n=3 meaning 8 queues.
> > >>>>>>>>>      >>>That would require us create 8 workqueues.
> > >>>>>>>>>      >>>We can change 'n' later if required.
> > >>>>>>>>>      >>>
> > >>>>>>>>>      >>>Niranjana
> > >>>>>>>>>      >>>
> > >>>>>>>>>      >>>>
> > >>>>>>>>>      >>>>   >     Why? Because Vulkan has two basic kind of bind
> > >>>>>>>>>      >>>>operations and we
> > >>>>>>>>>      >>>>   don't
> > >>>>>>>>>      >>>>   >     want any dependencies between them:
> > >>>>>>>>>      >>>>   >      1. Immediate.  These happen right after BO
> > >>>>>>>>>creation or
> > >>>>>>>>>      >>>>maybe as
> > >>>>>>>>>      >>>>   part of
> > >>>>>>>>>      >>>>   > vkBindImageMemory() or
> > VkBindBufferMemory().  These
> > >>>>>>>>>      >>>>don't happen
> > >>>>>>>>>      >>>>   on a
> > >>>>>>>>>      >>>>   >     queue and we don't want them serialized
> > >>>>>>>>>with anything.       To
> > >>>>>>>>>      >>>>   synchronize
> > >>>>>>>>>      >>>>   >     with submit, we'll have a syncobj in the
> > >>>>>>>>>VkDevice which
> > >>>>>>>>>      is
> > >>>>>>>>>      >>>>   signaled by
> > >>>>>>>>>      >>>>   >     all immediate bind operations and make
> > >>>>>>>>>submits wait on
> > >>>>>>>>>      it.
> > >>>>>>>>>      >>>>   >      2. Queued (sparse): These happen on a
> > >>>>>>>>>VkQueue which may
> > >>>>>>>>>      be the
> > >>>>>>>>>      >>>>   same as
> > >>>>>>>>>      >>>>   >     a render/compute queue or may be its own
> > >>>>>>>>>queue.  It's up
> > >>>>>>>>>      to us
> > >>>>>>>>>      >>>>   what we
> > >>>>>>>>>      >>>>   >     want to advertise.  From the Vulkan API
> > >>>>>>>>>PoV, this is like
> > >>>>>>>>>      any
> > >>>>>>>>>      >>>>   other
> > >>>>>>>>>      >>>>   >     queue. Operations on it wait on and signal
> > >>>>>>>>>semaphores.       If we
> > >>>>>>>>>      >>>>   have a
> > >>>>>>>>>      >>>>   >     VM_BIND engine, we'd provide syncobjs to
> > >>>>wait and
> > >>>>>>>>>      >>>>signal just like
> > >>>>>>>>>      >>>>   we do
> > >>>>>>>>>      >>>>   >     in execbuf().
> > >>>>>>>>>      >>>>   >     The important thing is that we don't want
> > >>>>>>>>>one type of
> > >>>>>>>>>      >>>>operation to
> > >>>>>>>>>      >>>>   block
> > >>>>>>>>>      >>>>   >     on the other.  If immediate binds are
> > >>>>>>>>>blocking on sparse
> > >>>>>>>>>      binds,
> > >>>>>>>>>      >>>>   it's
> > >>>>>>>>>      >>>>   >     going to cause over-synchronization issues.
> > >>>>>>>>>      >>>>   >     In terms of the internal implementation, I
> > >>>>>>>>>know that
> > >>>>>>>>>      >>>>there's going
> > >>>>>>>>>      >>>>   to be
> > >>>>>>>>>      >>>>   >     a lock on the VM and that we can't actually
> > >>>>>>>>>do these
> > >>>>>>>>>      things in
> > >>>>>>>>>      >>>>   > parallel.  That's fine. Once the dma_fences have
> > >>>>>>>>>      signaled and
> > >>>>>>>>>      >>>>   we're
> > >>>>>>>>>      >>>>
> > >>>>>>>>>      >>>>   Thats correct. It is like a single VM_BIND
> > >>>>engine with
> > >>>>>>>>>      >>>>multiple queues
> > >>>>>>>>>      >>>>   feeding to it.
> > >>>>>>>>>      >>>>
> > >>>>>>>>>      >>>> Right.  As long as the queues themselves are
> > >>>>>>>>>independent and
> > >>>>>>>>>      >>>>can block on
> > >>>>>>>>>      >>>> dma_fences without holding up other queues, I think
> > >>>>>>>>>we're fine.
> > >>>>>>>>>      >>>>
> > >>>>>>>>>      >>>>   > unblocked to do the bind operation, I don't care if
> > >>>>>>>>>      >>>>there's a bit
> > >>>>>>>>>      >>>>   of
> > >>>>>>>>>      >>>>   > synchronization due to locking.  That's
> > >>>>>>>>>expected.  What
> > >>>>>>>>>      >>>>we can't
> > >>>>>>>>>      >>>>   afford
> > >>>>>>>>>      >>>>   >     to have is an immediate bind operation
> > >>>>>>>>>suddenly blocking
> > >>>>>>>>>      on a
> > >>>>>>>>>      >>>>   sparse
> > >>>>>>>>>      >>>>   > operation which is blocked on a compute job
> > >>>>>>>>>that's going
> > >>>>>>>>>      to run
> > >>>>>>>>>      >>>>   for
> > >>>>>>>>>      >>>>   >     another 5ms.
> > >>>>>>>>>      >>>>
> > >>>>>>>>>      >>>>   As the VM_BIND queue is per VM, VM_BIND on one
> > VM
> > >>>>>>>>>doesn't block
> > >>>>>>>>>      the
> > >>>>>>>>>      >>>>   VM_BIND
> > >>>>>>>>>      >>>>   on other VMs. I am not sure about usecases
> > >>>>here, but just
> > >>>>>>>>>      wanted to
> > >>>>>>>>>      >>>>   clarify.
> > >>>>>>>>>      >>>>
> > >>>>>>>>>      >>>> Yes, that's what I would expect.
> > >>>>>>>>>      >>>> --Jason
> > >>>>>>>>>      >>>>
> > >>>>>>>>>      >>>>   Niranjana
> > >>>>>>>>>      >>>>
> > >>>>>>>>>      >>>>   >     For reference, Windows solves this by allowing
> > >>>>>>>>>      arbitrarily many
> > >>>>>>>>>      >>>>   paging
> > >>>>>>>>>      >>>>   >     queues (what they call a VM_BIND
> > >>>>>>>>>engine/queue).  That
> > >>>>>>>>>      >>>>design works
> > >>>>>>>>>      >>>>   >     pretty well and solves the problems in
> > >>>>>>>>>question.       >>>>Again, we could
> > >>>>>>>>>      >>>>   just
> > >>>>>>>>>      >>>>   >     make everything out-of-order and require
> > >>>>>>>>>using syncobjs
> > >>>>>>>>>      >>>>to order
> > >>>>>>>>>      >>>>   things
> > >>>>>>>>>      >>>>   >     as userspace wants. That'd be fine too.
> > >>>>>>>>>      >>>>   >     One more note while I'm here: danvet said
> > >>>>>>>>>something on
> > >>>>>>>>>      >>>>IRC about
> > >>>>>>>>>      >>>>   VM_BIND
> > >>>>>>>>>      >>>>   >     queues waiting for syncobjs to
> > >>>>>>>>>materialize.  We don't
> > >>>>>>>>>      really
> > >>>>>>>>>      >>>>   want/need
> > >>>>>>>>>      >>>>   >     this. We already have all the machinery in
> > >>>>>>>>>userspace to
> > >>>>>>>>>      handle
> > >>>>>>>>>      >>>>   > wait-before-signal and waiting for syncobj
> > >>>>>>>>>fences to
> > >>>>>>>>>      >>>>materialize
> > >>>>>>>>>      >>>>   and
> > >>>>>>>>>      >>>>   >     that machinery is on by default.  It
> > >>>>would actually
> > >>>>>>>>>      >>>>take MORE work
> > >>>>>>>>>      >>>>   in
> > >>>>>>>>>      >>>>   >     Mesa to turn it off and take advantage of
> > >>>>>>>>>the kernel
> > >>>>>>>>>      >>>>being able to
> > >>>>>>>>>      >>>>   wait
> > >>>>>>>>>      >>>>   >     for syncobjs to materialize. Also, getting
> > >>>>>>>>>that right is
> > >>>>>>>>>      >>>>   ridiculously
> > >>>>>>>>>      >>>>   >     hard and I really don't want to get it
> > >>>>>>>>>wrong in kernel
> > >>>>>>>>>      >>>>space.   �� When we
> > >>>>>>>>>      >>>>   >     do memory fences, wait-before-signal will
> > >>>>>>>>>be a thing.  We
> > >>>>>>>>>      don't
> > >>>>>>>>>      >>>>   need to
> > >>>>>>>>>      >>>>   >     try and make it a thing for syncobj.
> > >>>>>>>>>      >>>>   >     --Jason
> > >>>>>>>>>      >>>>   >
> > >>>>>>>>>      >>>>   >   Thanks Jason,
> > >>>>>>>>>      >>>>   >
> > >>>>>>>>>      >>>>   >   I missed the bit in the Vulkan spec that
> > >>>>>>>>>we're allowed to
> > >>>>>>>>>      have a
> > >>>>>>>>>      >>>>   sparse
> > >>>>>>>>>      >>>>   >   queue that does not implement either graphics
> > >>>>>>>>>or compute
> > >>>>>>>>>      >>>>operations
> > >>>>>>>>>      >>>>   :
> > >>>>>>>>>      >>>>   >
> > >>>>>>>>>      >>>>   >     "While some implementations may include
> > >>>>>>>>>      >>>> VK_QUEUE_SPARSE_BINDING_BIT
> > >>>>>>>>>      >>>>   >     support in queue families that also include
> > >>>>>>>>>      >>>>   >
> > >>>>>>>>>      >>>>   > graphics and compute support, other
> > >>>>>>>>>implementations may
> > >>>>>>>>>      only
> > >>>>>>>>>      >>>>   expose a
> > >>>>>>>>>      >>>>   > VK_QUEUE_SPARSE_BINDING_BIT-only queue
> > >>>>>>>>>      >>>>   >
> > >>>>>>>>>      >>>>   > family."
> > >>>>>>>>>      >>>>   >
> > >>>>>>>>>      >>>>   >   So it can all be all a vm_bind engine that
> > >>>>just does
> > >>>>>>>>>      bind/unbind
> > >>>>>>>>>      >>>>   > operations.
> > >>>>>>>>>      >>>>   >
> > >>>>>>>>>      >>>>   >   But yes we need another engine for the
> > >>>>>>>>>immediate/non-sparse
> > >>>>>>>>>      >>>>   operations.
> > >>>>>>>>>      >>>>   >
> > >>>>>>>>>      >>>>   >   -Lionel
> > >>>>>>>>>      >>>>   >
> > >>>>>>>>>      >>>>   >         >
> > >>>>>>>>>      >>>>   > Daniel, any thoughts?
> > >>>>>>>>>      >>>>   >
> > >>>>>>>>>      >>>>   > Niranjana
> > >>>>>>>>>      >>>>   >
> > >>>>>>>>>      >>>>   > >Matt
> > >>>>>>>>>      >>>>   >       >
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   > >> Sorry I noticed this late.
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   > >> -Lionel
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>>>      >>>>   > >>
> > >>>>>>>
> > >>>>>>>
> > >>>>>
> > >
> > >
diff mbox series

Patch

diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-api/dma-buf.rst
index 36a76cbe9095..64cb924ec5bb 100644
--- a/Documentation/driver-api/dma-buf.rst
+++ b/Documentation/driver-api/dma-buf.rst
@@ -200,6 +200,8 @@  DMA Fence uABI/Sync File
 .. kernel-doc:: include/linux/sync_file.h
    :internal:
 
+.. _indefinite_dma_fences:
+
 Indefinite DMA Fences
 ~~~~~~~~~~~~~~~~~~~~~
 
diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst b/Documentation/gpu/rfc/i915_vm_bind.rst
new file mode 100644
index 000000000000..f1be560d313c
--- /dev/null
+++ b/Documentation/gpu/rfc/i915_vm_bind.rst
@@ -0,0 +1,304 @@ 
+==========================================
+I915 VM_BIND feature design and use cases
+==========================================
+
+VM_BIND feature
+================
+DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM buffer
+objects (BOs) or sections of a BOs at specified GPU virtual addresses on a
+specified address space (VM). These mappings (also referred to as persistent
+mappings) will be persistent across multiple GPU submissions (execbuff calls)
+issued by the UMD, without user having to provide a list of all required
+mappings during each submission (as required by older execbuff mode).
+
+VM_BIND/UNBIND ioctls will support 'in' and 'out' fences to allow userpace
+to specify how the binding/unbinding should sync with other operations
+like the GPU job submission. These fences will be timeline 'drm_syncobj's
+for non-Compute contexts (See struct drm_i915_vm_bind_ext_timeline_fences).
+For Compute contexts, they will be user/memory fences (See struct
+drm_i915_vm_bind_ext_user_fence).
+
+VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND.
+User has to opt-in for VM_BIND mode of binding for an address space (VM)
+during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND extension.
+
+VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an
+async worker. The binding and unbinding will work like a special GPU engine.
+The binding and unbinding operations are serialized and will wait on specified
+input fences before the operation and will signal the output fences upon the
+completion of the operation. Due to serialization, completion of an operation
+will also indicate that all previous operations are also complete.
+
+VM_BIND features include:
+
+* Multiple Virtual Address (VA) mappings can map to the same physical pages
+  of an object (aliasing).
+* VA mapping can map to a partial section of the BO (partial binding).
+* Support capture of persistent mappings in the dump upon GPU error.
+* TLB is flushed upon unbind completion. Batching of TLB flushes in some
+  use cases will be helpful.
+* Asynchronous vm_bind and vm_unbind support with 'in' and 'out' fences.
+* Support for userptr gem objects (no special uapi is required for this).
+
+Execbuff ioctl in VM_BIND mode
+-------------------------------
+The execbuff ioctl handling in VM_BIND mode differs significantly from the
+older method. A VM in VM_BIND mode will not support older execbuff mode of
+binding. In VM_BIND mode, execbuff ioctl will not accept any execlist. Hence,
+no support for implicit sync. It is expected that the below work will be able
+to support requirements of object dependency setting in all use cases:
+
+"dma-buf: Add an API for exporting sync files"
+(https://lwn.net/Articles/859290/)
+
+This also means, we need an execbuff extension to pass in the batch
+buffer addresses (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
+
+If at all execlist support in execbuff ioctl is deemed necessary for
+implicit sync in certain use cases, then support can be added later.
+
+In VM_BIND mode, VA allocation is completely managed by the user instead of
+the i915 driver. Hence all VA assignment, eviction are not applicable in
+VM_BIND mode. Also, for determining object activeness, VM_BIND mode will not
+be using the i915_vma active reference tracking. It will instead use dma-resv
+object for that (See `VM_BIND dma_resv usage`_).
+
+So, a lot of existing code in the execbuff path like relocations, VA evictions,
+vma lookup table, implicit sync, vma active reference tracking etc., are not
+applicable in VM_BIND mode. Hence, the execbuff path needs to be cleaned up
+by clearly separating out the functionalities where the VM_BIND mode differs
+from older method and they should be moved to separate files.
+
+VM_PRIVATE objects
+-------------------
+By default, BOs can be mapped on multiple VMs and can also be dma-buf
+exported. Hence these BOs are referred to as Shared BOs.
+During each execbuff submission, the request fence must be added to the
+dma-resv fence list of all shared BOs mapped on the VM.
+
+VM_BIND feature introduces an optimization where user can create BO which
+is private to a specified VM via I915_GEM_CREATE_EXT_VM_PRIVATE flag during
+BO creation. Unlike Shared BOs, these VM private BOs can only be mapped on
+the VM they are private to and can't be dma-buf exported.
+All private BOs of a VM share the dma-resv object. Hence during each execbuff
+submission, they need only one dma-resv fence list updated. Thus, the fast
+path (where required mappings are already bound) submission latency is O(1)
+w.r.t the number of VM private BOs.
+
+VM_BIND locking hirarchy
+-------------------------
+The locking design here supports the older (execlist based) execbuff mode, the
+newer VM_BIND mode, the VM_BIND mode with GPU page faults and possible future
+system allocator support (See `Shared Virtual Memory (SVM) support`_).
+The older execbuff mode and the newer VM_BIND mode without page faults manages
+residency of backing storage using dma_fence. The VM_BIND mode with page faults
+and the system allocator support do not use any dma_fence at all.
+
+VM_BIND locking order is as below.
+
+1) Lock-A: A vm_bind mutex will protect vm_bind lists. This lock is taken in
+   vm_bind/vm_unbind ioctl calls, in the execbuff path and while releasing the
+   mapping.
+
+   In future, when GPU page faults are supported, we can potentially use a
+   rwsem instead, so that multiple page fault handlers can take the read side
+   lock to lookup the mapping and hence can run in parallel.
+   The older execbuff mode of binding do not need this lock.
+
+2) Lock-B: The object's dma-resv lock will protect i915_vma state and needs to
+   be held while binding/unbinding a vma in the async worker and while updating
+   dma-resv fence list of an object. Note that private BOs of a VM will all
+   share a dma-resv object.
+
+   The future system allocator support will use the HMM prescribed locking
+   instead.
+
+3) Lock-C: Spinlock/s to protect some of the VM's lists like the list of
+   invalidated vmas (due to eviction and userptr invalidation) etc.
+
+When GPU page faults are supported, the execbuff path do not take any of these
+locks. There we will simply smash the new batch buffer address into the ring and
+then tell the scheduler run that. The lock taking only happens from the page
+fault handler, where we take lock-A in read mode, whichever lock-B we need to
+find the backing storage (dma_resv lock for gem objects, and hmm/core mm for
+system allocator) and some additional locks (lock-D) for taking care of page
+table races. Page fault mode should not need to ever manipulate the vm lists,
+so won't ever need lock-C.
+
+VM_BIND LRU handling
+---------------------
+We need to ensure VM_BIND mapped objects are properly LRU tagged to avoid
+performance degradation. We will also need support for bulk LRU movement of
+VM_BIND objects to avoid additional latencies in execbuff path.
+
+The page table pages are similar to VM_BIND mapped objects (See
+`Evictable page table allocations`_) and are maintained per VM and needs to
+be pinned in memory when VM is made active (ie., upon an execbuff call with
+that VM). So, bulk LRU movement of page table pages is also needed.
+
+The i915 shrinker LRU has stopped being an LRU. So, it should also be moved
+over to the ttm LRU in some fashion to make sure we once again have a reasonable
+and consistent memory aging and reclaim architecture.
+
+VM_BIND dma_resv usage
+-----------------------
+Fences needs to be added to all VM_BIND mapped objects. During each execbuff
+submission, they are added with DMA_RESV_USAGE_BOOKKEEP usage to prevent
+over sync (See enum dma_resv_usage). One can override it with either
+DMA_RESV_USAGE_READ or DMA_RESV_USAGE_WRITE usage during object dependency
+setting (either through explicit or implicit mechanism).
+
+When vm_bind is called for a non-private object while the VM is already
+active, the fences need to be copied from VM's shared dma-resv object
+(common to all private objects of the VM) to this non-private object.
+If this results in performance degradation, then some optimization will
+be needed here. This is not a problem for VM's private objects as they use
+shared dma-resv object which is always updated on each execbuff submission.
+
+Also, in VM_BIND mode, use dma-resv apis for determining object activeness
+(See dma_resv_test_signaled() and dma_resv_wait_timeout()) and do not use the
+older i915_vma active reference tracking which is deprecated. This should be
+easier to get it working with the current TTM backend. We can remove the
+i915_vma active reference tracking fully while supporting TTM backend for igfx.
+
+Evictable page table allocations
+---------------------------------
+Make pagetable allocations evictable and manage them similar to VM_BIND
+mapped objects. Page table pages are similar to persistent mappings of a
+VM (difference here are that the page table pages will not have an i915_vma
+structure and after swapping pages back in, parent page link needs to be
+updated).
+
+Mesa use case
+--------------
+VM_BIND can potentially reduce the CPU overhead in Mesa (both Vulkan and Iris),
+hence improving performance of CPU-bound applications. It also allows us to
+implement Vulkan's Sparse Resources. With increasing GPU hardware performance,
+reducing CPU overhead becomes more impactful.
+
+
+VM_BIND Compute support
+========================
+
+User/Memory Fence
+------------------
+The idea is to take a user specified virtual address and install an interrupt
+handler to wake up the current task when the memory location passes the user
+supplied filter. User/Memory fence is a <address, value> pair. To signal the
+user fence, specified value will be written at the specified virtual address
+and wakeup the waiting process. User can wait on a user fence with the
+gem_wait_user_fence ioctl.
+
+It also allows the user to emit their own MI_FLUSH/PIPE_CONTROL notify
+interrupt within their batches after updating the value to have sub-batch
+precision on the wakeup. Each batch can signal a user fence to indicate
+the completion of next level batch. The completion of very first level batch
+needs to be signaled by the command streamer. The user must provide the
+user/memory fence for this via the DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE
+extension of execbuff ioctl, so that KMD can setup the command streamer to
+signal it.
+
+User/Memory fence can also be supplied to the kernel driver to signal/wake up
+the user process after completion of an asynchronous operation.
+
+When VM_BIND ioctl was provided with a user/memory fence via the
+I915_VM_BIND_EXT_USER_FENCE extension, it will be signaled upon the completion
+of binding of that mapping. All async binds/unbinds are serialized, hence
+signaling of user/memory fence also indicate the completion of all previous
+binds/unbinds.
+
+This feature will be derived from the below original work:
+https://patchwork.freedesktop.org/patch/349417/
+
+Long running Compute contexts
+------------------------------
+Usage of dma-fence expects that they complete in reasonable amount of time.
+Compute on the other hand can be long running. Hence it is appropriate for
+compute to use user/memory fence and dma-fence usage will be limited to
+in-kernel consumption only. This requires an execbuff uapi extension to pass
+in user fence (See struct drm_i915_vm_bind_ext_user_fence). Compute must opt-in
+for this mechanism with I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING flag during
+context creation. The dma-fence based user interfaces like gem_wait ioctl and
+execbuff out fence are not allowed on long running contexts. Implicit sync is
+not valid as well and is anyway not supported in VM_BIND mode.
+
+Where GPU page faults are not available, kernel driver upon buffer invalidation
+will initiate a suspend (preemption) of long running context with a dma-fence
+attached to it. And upon completion of that suspend fence, finish the
+invalidation, revalidate the BO and then resume the compute context. This is
+done by having a per-context preempt fence (also called suspend fence) proxying
+as i915_request fence. This suspend fence is enabled when someone tries to wait
+on it, which then triggers the context preemption.
+
+As this support for context suspension using a preempt fence and the resume work
+for the compute mode contexts can get tricky to get it right, it is better to
+add this support in drm scheduler so that multiple drivers can make use of it.
+That means, it will have a dependency on i915 drm scheduler conversion with GuC
+scheduler backend. This should be fine, as the plan is to support compute mode
+contexts only with GuC scheduler backend (at least initially). This is much
+easier to support with VM_BIND mode compared to the current heavier execbuff
+path resource attachment.
+
+Low Latency Submission
+-----------------------
+Allows compute UMD to directly submit GPU jobs instead of through execbuff
+ioctl. This is made possible by VM_BIND is not being synchronized against
+execbuff. VM_BIND allows bind/unbind of mappings required for the directly
+submitted jobs.
+
+Other VM_BIND use cases
+========================
+
+Debugger
+---------
+With debug event interface user space process (debugger) is able to keep track
+of and act upon resources created by another process (debugged) and attached
+to GPU via vm_bind interface.
+
+GPU page faults
+----------------
+GPU page faults when supported (in future), will only be supported in the
+VM_BIND mode. While both the older execbuff mode and the newer VM_BIND mode of
+binding will require using dma-fence to ensure residency, the GPU page faults
+mode when supported, will not use any dma-fence as residency is purely managed
+by installing and removing/invalidating page table entries.
+
+Page level hints settings
+--------------------------
+VM_BIND allows any hints setting per mapping instead of per BO.
+Possible hints include read-only mapping, placement and atomicity.
+Sub-BO level placement hint will be even more relevant with
+upcoming GPU on-demand page fault support.
+
+Page level Cache/CLOS settings
+-------------------------------
+VM_BIND allows cache/CLOS settings per mapping instead of per BO.
+
+Shared Virtual Memory (SVM) support
+------------------------------------
+VM_BIND interface can be used to map system memory directly (without gem BO
+abstraction) using the HMM interface. SVM is only supported with GPU page
+faults enabled.
+
+
+Broder i915 cleanups
+=====================
+Supporting this whole new vm_bind mode of binding which comes with its own
+use cases to support and the locking requirements requires proper integration
+with the existing i915 driver. This calls for some broader i915 driver
+cleanups/simplifications for maintainability of the driver going forward.
+Here are few things identified and are being looked into.
+
+- Remove vma lookup cache (eb->gem_context->handles_vma). VM_BIND feature
+  do not use it and complexity it brings in is probably more than the
+  performance advantage we get in legacy execbuff case.
+- Remove vma->open_count counting
+- Remove i915_vma active reference tracking. VM_BIND feature will not be using
+  it. Instead use underlying BO's dma-resv fence list to determine if a i915_vma
+  is active or not.
+
+
+VM_BIND UAPI
+=============
+
+.. kernel-doc:: Documentation/gpu/rfc/i915_vm_bind.h
diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst
index 91e93a705230..7d10c36b268d 100644
--- a/Documentation/gpu/rfc/index.rst
+++ b/Documentation/gpu/rfc/index.rst
@@ -23,3 +23,7 @@  host such documentation:
 .. toctree::
 
     i915_scheduler.rst
+
+.. toctree::
+
+    i915_vm_bind.rst