diff mbox series

[v4] Documentation/gpu: VM_BIND locking document

Message ID 20231115124937.6740-1-thomas.hellstrom@linux.intel.com (mailing list archive)
State New, archived
Headers show
Series [v4] Documentation/gpu: VM_BIND locking document | expand

Commit Message

Thomas Hellström Nov. 15, 2023, 12:49 p.m. UTC
Add the first version of the VM_BIND locking document which is
intended to be part of the xe driver upstreaming agreement.

The document describes and discuss the locking used during exec-
functions, evicton and for userptr gpu-vmas. Intention is to be using the
same nomenclature as the drm-vm-bind-async.rst.

v2:
- s/gvm/gpu_vm/g (Rodrigo Vivi)
- Clarify the userptr seqlock with a pointer to mm/mmu_notifier.c
  (Rodrigo Vivi)
- Adjust commit message accordingly.
- Add SPDX license header.

v3:
- Large update to align with the drm_gpuvm manager locking
- Add "Efficient userptr gpu_vma exec function iteration" section
- Add "Locking at bind- and unbind time" section.

v4:
- Fix tabs vs space errors by untabifying (Rodrigo Vivi)
- Minor style fixes and typos (Rodrigo Vivi)
- Clarify situations where stale GPU mappings are occurring and how
  access through these mappings are blocked. (Rodrigo Vivi)
- Insert into the toctree in implementation_guidelines.rst

Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
---
 Documentation/gpu/drm-vm-bind-locking.rst     | 503 ++++++++++++++++++
 .../gpu/implementation_guidelines.rst         |   1 +
 2 files changed, 504 insertions(+)
 create mode 100644 Documentation/gpu/drm-vm-bind-locking.rst

Comments

Danilo Krummrich Nov. 15, 2023, 3:11 p.m. UTC | #1
On Wed, Nov 15, 2023 at 01:49:37PM +0100, Thomas Hellström wrote:
> Add the first version of the VM_BIND locking document which is
> intended to be part of the xe driver upstreaming agreement.
> 
> The document describes and discuss the locking used during exec-
> functions, evicton and for userptr gpu-vmas. Intention is to be using the
> same nomenclature as the drm-vm-bind-async.rst.
> 
> v2:
> - s/gvm/gpu_vm/g (Rodrigo Vivi)
> - Clarify the userptr seqlock with a pointer to mm/mmu_notifier.c
>   (Rodrigo Vivi)
> - Adjust commit message accordingly.
> - Add SPDX license header.
> 
> v3:
> - Large update to align with the drm_gpuvm manager locking
> - Add "Efficient userptr gpu_vma exec function iteration" section
> - Add "Locking at bind- and unbind time" section.
> 
> v4:
> - Fix tabs vs space errors by untabifying (Rodrigo Vivi)
> - Minor style fixes and typos (Rodrigo Vivi)
> - Clarify situations where stale GPU mappings are occurring and how
>   access through these mappings are blocked. (Rodrigo Vivi)
> - Insert into the toctree in implementation_guidelines.rst
> 
> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
> Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> ---
>  Documentation/gpu/drm-vm-bind-locking.rst     | 503 ++++++++++++++++++
>  .../gpu/implementation_guidelines.rst         |   1 +
>  2 files changed, 504 insertions(+)
>  create mode 100644 Documentation/gpu/drm-vm-bind-locking.rst
> 
> diff --git a/Documentation/gpu/drm-vm-bind-locking.rst b/Documentation/gpu/drm-vm-bind-locking.rst
> new file mode 100644
> index 000000000000..bc701157cb34
> --- /dev/null
> +++ b/Documentation/gpu/drm-vm-bind-locking.rst
> @@ -0,0 +1,503 @@
> +.. SPDX-License-Identifier: (GPL-2.0+ OR MIT)
> +
> +===============
> +VM_BIND locking
> +===============
> +
> +This document attempts to describe what's needed to get VM_BIND locking right,
> +including the userptr mmu_notifier locking and it will also discuss some
> +optimizations to get rid of the looping through of all userptr mappings and
> +external / shared object mappings that is needed in the simplest
> +implementation. It will also discuss some implications for faulting gpu_vms.

I'd try to avoid switching to future tense. There are a few more places
throughout the document.

> +
> +Nomenclature
> +============
> +
> +* ``Context``: GPU execution context.
> +* ``gpu_vm``: Abstraction of a virtual GPU address space with
> +  meta-data. Typically one per client (DRM file-private), or one per
> +  context.
> +* ``gpu_vma``: Abstraction of a GPU address range within a gpu_vm with
> +  associated meta-data. The backing storage of a gpu_vma can either be
> +  a GEM object or anonymous pages mapped also into the CPU
> +  address space for the process.
> +* gpu_vm_bo: Abstracts the association of a GEM object and
> +  a VM. Note that if only one gpu_vma per vm and buffer object were
> +  allowed, the state stored with a gpu_vm_bo could just as well have
> +  been stored with the gpu_vma. For the purpose of this document, each
> +  GEM object maintains a list of gpu_vm_bos, and each gpu_vm_bo
> +  maintains a list of gpu_vmas.
> +* ``userptr gpu_vma or just userptr``: A gpu_vma, whose backing store
> +  is anonymous pages as described above.
> +* ``revalidating``: Revalidating a gpu_vma means making the latest version
> +  of the backing store resident and making sure the gpu_vma's
> +  page-table entries point to that backing store.
> +* ``dma_fence``: A struct dma_fence that is similar to a struct completion
> +  and which tracks GPU activity. When the GPU activity is finished,
> +  the dma_fence signals.
> +* ``dma_resv``: A struct dma_resv (a.k.a reservation object) that is used
> +  to track GPU activity in the form of multiple dma_fences on a
> +  gpu_vm or a GEM object. The dma_resv contains an array / list
> +  of dma_fences and a lock that needs to be held when adding
> +  additional dma_fences to the dma_resv. The lock is of a type that
> +  allows deadlock-safe locking of multiple dma_resvs in arbitrary order.
> +* ``exec function``: An exec function is a function that revalidates all
> +  affected gpu_vmas, submits a GPU command batch and registers the
> +  dma_fence representing the GPU command's activity with all affected
> +  dma_resvs. For completeness, although not covered by this document,
> +  it's worth mentioning that an exec function may also be the
> +  revalidation worker that is used by some drivers in compute /
> +  long-running mode.
> +* ``local object``: A GEM object which is local to a gpu_vm. Shared gem
> +  objects also share the gpu_vm's dma_resv.

What about "a GEM object which is not shared across multiple VMs" or " a GEM
object which is only mapped within a single VM"?

Also That last sentence seems to belong to 'shared object' below.

> +* ``shared object``: a.k.a external object: A GEM object which may be shared
> +  by multiple gpu_vms and whose backing storage may be shared with
> +  other drivers.
> +
> +
> +Locks used and locking orders
> +=============================

Maybe just "Locks and locking order"?

> +
> +One of the benefits of VM_BIND is that local GEM objects share the gpu_vm's
> +dma_resv object and hence the dma_resv lock. So even with a huge
> +number of local GEM objects, only one lock is needed to make the exec
> +sequence atomic.
> +
> +The following locks and locking orders are used:
> +
> +* The ``gpu_vm->lock`` (optionally an rwsem). Protects how the gpu_vm is

"a rwsem"

> +  partitioned into gpu_vmas. It can also protect the gpu_vm's list of

"Protects how the gpu_vm is partitioned into gpu_vmas." might be a bit abstract.
What about "[...] protects the gpu_vm's data structure keeping track of
gpu_vmas." instead?

> +  userptr gpu_vmas. With a CPU mm analogy this would correspond to the
> +  mmap_lock.
> +* The ``userptr_seqlock``. This lock is taken in read mode for each
> +  userptr gpu_vma on the gpu_vm's userptr list, and in write mode during mmu
> +  notifier invalidation. This is not a real seqlock but described in
> +  ``mm/mmu_notifier.c`` as a "Collision-retry read-side/write-side
> +  'lock' a lot like a seqcount, however this allows multiple
> +  write-sides to hold it at once...". The read side critical section
> +  is enclosed by ``mmu_interval_read_begin() /
> +  mmu_interval_read_retry()`` with ``mmu_interval_read_begin()``
> +  sleeping if the write side is held.
> +  The write side is held by the core mm while calling mmu interval
> +  invalidation notifiers.
> +* The ``gpu_vm->resv`` lock. Protects the gpu_vm's list of gpu_vmas needing
> +  rebinding, and also the residency of all the gpu_vm's local GEM object.

"as well as the residency state"?

> +  Furthermore it typically protects the gpu_vm's list of evicted GEM
> +  objects and external objects.

"evicted and external GEM objects"

> +* The ``gpu_vm->userptr_notifier_lock``. This is an rwsem that is

"a rwsem"

> +  taken in read mode during exec and write mode during a mmu notifier
> +  invalidation. The userptr notifier lock is per gpu_vm.
> +* The gpu_vm list spinlocks. With some implementations they are needed
> +  to be able to update the gpu_vm evicted- and external object
> +  list. For those implementations, the spinlocks are grabbed when the
> +  lists are manipulated. However to avoid locking order violations
> +  with the dma_resv locks, a special scheme is needed when iterating
> +  over the lists.
> +
> +.. _gpu_vma lifetime:
> +
> +Protection and lifetime of gpu_vm_bos and gpu_vmas
> +==================================================
> +
> +The GEM object's list of gpu_vm_bos is typically protected by the
> +GEM object's dma_resv. Each gpu_vm_bo holds a reference counted pointer

You already say "typically", maybe we should add what the alternative is.

"Drivers might use a driver specific lock in order to be able to make use of
this lock within the dma-fence signalling critical section." We could also link
to the corresponding function to set a driver specific lock.

> +to the underlying GEM object, and each gpu_vma holds a reference counted
> +pointer to the gpu_vm_bo. When iterating over the GEM object's
> +list of gpu_vm_bos the gem object's dma_resv must thus be held,
> +but if it needs to be dropped during the iteration, care needs to be
> +taken so that any gpu_vm_bo, and the gpu_vm, if dereferenced
> +while the lock is dropped, do not disappear. The easiest way to avoid
> +this is to take a reference on affected objects while the dma_resv is
> +still held. If iterating over the gpu_vm_bo's gpu_vmas, even
> +greater care needs to be taken since the gpu_vmas are not
> +reference counted. If a driver accesses a gpu_vma obtained from
> +the gpu_vm_bo's list of gpu_vmas, and the GEM object's
> +dma_resv is dropped, at the very least, it should be thoroughly
> +documented how the gpu_vma is kept alive. Otherwise holding the
> +GEM object's dma_resv lock also around unlinking a gpu_vma from a
> +gpu_vm_bo will ensure that doesn't happen.
> +
> +
> +Revalidation and eviction of local objects
> +==========================================
> +
> +Revalidation
> +____________
> +With VM_BIND, all local objects need to be resident when the gpu is
> +executing using the gpu_vm, and the objects need to have valid
> +gpu_vmas set up pointing to them. Typically each gpu command buffer
> +submission is therefore preceded with a re-validation section:
> +
> +.. code-block:: C
> +
> +   dma_resv_lock(gpu_vm->resv);
> +
> +   // Validation section starts here.
> +   for_each_gpu_vm_bo_on_evict_list(&gpu_vm->evict_list, &gpu_vm_bo) {
> +           validate_gem_bo(&gpu_vm_bo->gem_bo);
> +
> +           // The following list iteration needs the Gem object's
> +           // dma_resv to be held (it protects the gpu_vm_bo's list of
> +           // gpu_vmas, but since local gem objects share the gpu_vm's
> +           // dma_resv, it is already held at this point.
> +           for_each_gpu_vma_of_gpu_vm_bo(&gpu_vm_bo, &gpu_vma)
> +                  move_gpu_vma_to_rebind_list(&gpu_vma, &gpu_vm->rebind_list);
> +   }
> +
> +   for_each_gpu_vma_on_rebind_list(&gpu vm->rebind_list, &gpu_vma) {
> +           rebind_gpu_vma(&gpu_vma);
> +           remove_gpu_vma_from_rebind_list(&gpu_vma);
> +   }
> +   // Validation section ends here, and job submission starts.
> +
> +   add_dependencies(&gpu_job, &gpu_vm->resv);
> +   job_dma_fence = gpu_submit(&gpu_job));
> +
> +   add_dma_fence(job_dma_fence, &gpu_vm->resv);
> +   dma_resv_unlock(gpu_vm->resv);
> +
> +The reason for having a separate gpu_vm rebind list is that there
> +might be userptr gpu_vmas that are not mapping a buffer object that
> +also need rebinding.
> +
> +Eviction
> +________
> +
> +Eviction of one of these local objects will then look similar to the
> +following:
> +
> +.. code-block:: C
> +
> +   obj = get_object_from_lru();
> +
> +   dma_resv_lock(obj->resv);
> +   for_each_gpu_vm_bo_of_obj(obj, &gpu_vm_bo);
> +           add_gpu_vm_bo_to_evict_list(&gpu_vm_bo, &gpu_vm->evict_list);
> +
> +   add_dependencies(&eviction_job, &obj->resv);
> +   job_dma_fence = gpu_submit(&eviction_job);
> +   add_dma_fence(&obj->resv, job_dma_fence);
> +
> +   dma_resv_unlock(&obj->resv);
> +   put_object(obj);
> +
> +Note that since the object is local to the gpu_vm, it will share the gpu_vm's
> +dma_resv lock so that ``obj->resv == gpu_vm->resv``.
> +The gpu_vm_bos marked for eviction are put on the gpu_vm's evict list,
> +which is protected by ``gpu_vm->resv``, that is always locked while
> +evicting, due to the above equality.
> +
> +For VM_BIND gpu_vms, gpu_vmas don't need to be unbound before eviction,
> +Since the driver must ensure that the eviction blit or copy will wait
> +for GPU idle or depend on all previous GPU activity. Furthermore, any
> +subsequent attempt by the GPU to access freed memory through the
> +gpu_vma will be preceded by a new exec function, with a revalidation
> +section which will make sure all gpu_vmas are rebound. The eviction
> +code holding the object's dma_resv while revalidating will ensure a
> +new exec function may not race with the eviction. Note that this will
> +not hold true, however, if only a subsets of vmas are, due to the
> +driver implementation, selected for rebinding the next exec
> +function. Then all vmas *not* selected for rebinding needs to be
> +properly unbound before re-enabling GPU access to the VM.
> +
> +
> +Locking with external (or shared) buffer objects
> +================================================
> +
> +Since shared buffer objects may be shared by multiple gpu_vm's they
> +can't share their reservation object with a single gpu_vm, but will rather
> +have a reservation object of their own. The shared objects bound to a
> +gpu_vm using one or many gpu_vmas are therefore typically put on a
> +per-gpu_vm list which is protected by the gpu_vm's dma_resv lock. Once

Or a separate spinlock as mentioned in "Locks used and locking orders".

> +the gpu_vm's reservation object  is locked, it is safe to traverse the
 
double space

> +external object list and lock the dma_resvs of all external objects.
> +
> +At eviction time we now need to put the gpu_vm_bos of *all* gpu_vms a
> +shared object is bound to on the gpu_vm's evict list, but we can no longer
> +be certain that we hold the gpu_vm's dma_resv of all the gpu_vms the
> +object is bound to, since at eviction time we only hold the object's
> +private dma_resv.

Something seems to be weird with this sentence.

> If we have a ww_acquire context at hand at eviction
> +time we could grab the those dma_resvs but that could cause
> +expensive ww_mutex rollbacks. A simple option is to just mark the
> +gpu_vm_bos of the evicted gem object with an ``evicted`` bool that
> +is inspected the next time the corresponding gpu_vm evicted list needs
> +to be traversed. At that time the gpu_vm's dma_resv and the object's
> +dma_resv is held, and the gpu_vm_bo marked evicted, can then be added
> +to the gpu_vm's list of evicted gpu_vm_bos. The ``evicted`` bool would
> +then be protected by the object's dma_resv.

Also worth mentioning that this extra semantics doesn't apply to the spinlock
locking scheme.

> +
> +The exec function would then become
> +
> +.. code-block:: C
> +
> +   dma_resv_lock(gpu_vm->resv);
> +
> +   // External object list is protected by the gpu_vm->resv lock.
> +   for_each_gpu_vm_bo_on_extobj_list(gpu_vm, &gpu_vm_bo) {
> +           dma_resv_lock(gpu_vm_bo.gem_obj->resv);
> +           if (gpu_vm_bo_marked_evicted(&gpu_vm_bo))
> +                   add_gpu_vm_bo_to_evict_list(&gpu_vm_bo, &gpu_vm->evict_list);
> +   }
> +
> +   for_each_gpu_vm_bo_on_evict_list(&gpu_vm->evict_list, &gpu_vm_bo) {
> +           validate_gem_bo(&gpu_vm_bo->gem_bo);
> +
> +           for_each_gpu_vma_of_gpu_vm_bo(&gpu_vm_bo, &gpu_vma)
> +                  move_gpu_vma_to_rebind_list(&gpu_vma, &gpu_vm->rebind_list);
> +   }
> +
> +   for_each_gpu_vma_on_rebind_list(&gpu vm->rebind_list, &gpu_vma) {
> +           rebind_gpu_vma(&gpu_vma);
> +           remove_gpu_vma_from_rebind_list(&gpu_vma);
> +   }
> +
> +   add_dependencies(&gpu_job, &gpu_vm->resv);
> +   job_dma_fence = gpu_submit(&gpu_job));
> +
> +   add_dma_fence(job_dma_fence, &gpu_vm->resv);
> +   for_each_shared_obj(gpu_vm, &obj)
> +          add_dma_fence(job_dma_fence, &obj->resv);
> +   dma_resv_unlock_all_resv_locks();
> +
> +And the corresponding shared-object aware eviction would look like:
> +
> +.. code-block:: C
> +
> +   obj = get_object_from_lru();
> +
> +   dma_resv_lock(obj->resv);
> +   for_each_gpu_vm_bo_of_obj(obj, &gpu_vm_bo)
> +           if (object_is_vm_local(obj))
> +                add_gpu_vm_bo_to_evict_list(&gpu_vm_bo, &gpu_vm->evict_list);
> +           else
> +                mark_gpu_vm_bo_evicted(&gpu_vm_bo);
> +
> +   add_dependencies(&eviction_job, &obj->resv);
> +   job_dma_fence = gpu_submit(&eviction_job);
> +   add_dma_fence(&obj->resv, job_dma_fence);
> +
> +   dma_resv_unlock(&obj->resv);
> +   put_object(obj);
> +
> +.. _Spinlock iteration:
> +
> +Accessing the gpu_vm's lists without the dma_resv lock held
> +===========================================================

Ok, you have a separate paragraph for this. Maybe just mention the existance of
the alternative locking scheme above and refer to this paragraph.

> +
> +Although some drivers will hold the gpu_vm's dma_resv lock when
> +accessing the gpu_vm's evict list and external objects lists, there
> +are drivers that need to access these lists without the dma_resv lock held,
> +for example due to asynchronous state updates from within the
> +dma_fence signalling critical path. In such case a spinlock can be
> +used to protect manipulation of the lists. However, since higher level
> +sleeping locks needs to be taken for each list item while iterating
> +over the lists, the items already iterated over needs to be
> +temporarily moved to a private list and the spinlock released
> +while processing each item:
> +
> +.. code block:: C
> +
> +    struct list_head still_in_list;
> +
> +    INIT_LIST_HEAD(&still_in_list);
> +
> +    spin_lock(&gpu_vm->list_lock);
> +    do {
> +            struct list_head *entry = list_first_entry_or_null(&gpu_vm->list, head);
> +
> +            if (!entry)
> +                    break;
> +
> +            list_move_tail(&entry->head, &still_in_list);
> +            list_entry_get_unless_zero(entry);
> +            spin_unlock(&gpu_vm->list_lock);
> +
> +            process(entry);
> +
> +            spin_lock(&gpu_vm->list_lock);
> +            list_entry_put(entry);
> +    } while (true);
> +
> +    list_splice_tail(&still_in_list, &gpu_vm->list);
> +    spin_unlock(&gpu_vm->list_lock);
> +
> +However, due to the additional locking and atomic operations, drivers that *can*
> +avoid accessing the gpu_vm's list outside of the dma_resv lock
> +might want to avoid this iteration scheme, if the driver anticipates a
> +large number of list items. For lists where the anticipated number of
> +list items is small, list iteration doesn't happen very often, or

Why or?

> +there is a significant additional cost associated with each iteration,
> +the atomic operation overhead associated with this type of iteration
> +is, however, probably negligible. Note that if this scheme is
> +used, it is necessary to make sure this list iteration is protected by
> +an outer level lock or semaphore, since list items are temporarily
> +pulled off the list while iterating.
> +
> +TODO: Pointer to the gpuvm code implementation if this iteration and
> +how to choose either iteration scheme.

I landed the GPUVM series on Monday. Hence, we should be able to add the
corresponding links now. :-)

> +
> +userptr gpu_vmas
> +================
> +
> +A userptr gpu_vma is a gpu_vma that, instead of mapping a buffer object to a
> +GPU virtual address range, directly maps a CPU mm range of anonymous-
> +or file page-cache pages.
> +A very simple approach would be to just pin the pages using
> +pin_user_pages() at bind time and unpin them at unbind time, but this
> +creates a Denial-Of-Service vector since a single user-space process
> +would be able to pin down all of system memory, which is not
> +desirable. (For special use-cases and with proper accounting pinning might
> +still be a desirable feature, though). What we need to do in the
> +general case is to obtain a reference to the desired pages, make sure
> +we are notified using a MMU notifier just before the CPU mm unmaps the
> +pages, dirty them if they are not mapped read-only to the GPU, and
> +then drop the reference.
> +When we are notified by the MMU notifier that CPU mm is about to drop the
> +pages, we need to stop GPU access to the pages by waiting for VM idle
> +in the MMU notifier and make sure that before the next time the GPU
> +tries to access whatever is now present in the CPU mm range, we unmap
> +the old pages from the GPU page tables and repeat the process of
> +obtaining new page references. (See the :ref:`notifier example
> +<Invalidation example>` below). Note that when the core mm decides to
> +laundry pages, we get such an unmap MMU notification and can mark the
> +pages dirty again before the next GPU access. We also get similar MMU
> +notifications for NUMA accounting which the GPU driver doesn't really
> +need to care about, but so far it has proven difficult to exclude
> +certain notifications.
> +
> +Using a MMU notifier for device DMA (and other methods) is described in
> +`this document
> +<https://docs.kernel.org/core-api/pin_user_pages.html#case-3-mmu-notifier-registration-with-or-without-page-faulting-hardware>`_.
> +
> +Now the method of obtaining struct page references using
> +get_user_pages() unfortunately can't be used under a dma_resv lock
> +since that would violate the locking order of the dma_resv lock vs the
> +mmap_lock that is grabbed when resolving a CPU pagefault. This means
> +the gpu_vm's list of userptr gpu_vmas needs to be protected by an
> +outer lock.
> +
> +The MMU interval seqlock for a userptr gpu_vma is used in the following
> +way:
> +
> +.. code-block:: C
> +
> +   // Exclusive locking mode here is strictly needed only if there are
> +   // invalidated userptr vmas present, to avoid multiple userptr
> +   // revalidations.
> +   down_write(&gpu_vm->lock);
> +   retry:
> +
> +   // Note: mmu_interval_read_begin() blocks until there is no
> +   // invalidation notifier running anymore.
> +   seq = mmu_interval_read_begin(&gpu_vma->userptr_interval);
> +   if (seq != gpu_vma->saved_seq) {
> +           obtain_new_page_pointers(&gpu_vma);
> +           dma_resv_lock(&gpu_vm->resv);
> +           add_gpu_vma_top_revalidate_list(&gpu_vma, &gpu_vm);
> +           dma_resv_unlock(&gpu_vm->resv);
> +           gpu_vma->saved_seq = seq;
> +   }
> +
> +   // The usual revalidation goes here.
> +
> +   // Final userptr sequence validation may not happen before the
> +   // submission dma_fence is added to the gpu_vm's resv, from the POW
> +   // of the MMU invalidation notifier. Hence the
> +   // userptr_notifier_lock that will make them appear atomic.
> +
> +   add_dependencies(&gpu_job, &gpu_vm->resv);
> +   down_read(&gpu_vm->userptr_notifier_lock);
> +   if (mmu_interval_read_retry(&gpu_vma->userptr_interval, gpu_vma->saved_seq)) {
> +          up_read(&gpu_vm->userptr_notifier_lock);
> +          goto retry;
> +   }
> +
> +   job_dma_fence = gpu_submit(&gpu_job));
> +
> +   add_dma_fence(job_dma_fence, &gpu_vm->resv);
> +
> +   for_each_shared_obj(gpu_vm, &obj)
> +          add_dma_fence(job_dma_fence, &obj->resv);
> +
> +   dma_resv_unlock_all_resv_locks();
> +   up_read(&gpu_vm->userptr_notifier_lock);
> +   up_write(&gpu_vm->lock);
> +
> +The code between ``mmu_interval_read_begin()`` and the
> +``mmu_interval_read_retry()`` marks the read side critical section of
> +what we call the ``userptr_seqlock``. In reality the gpu_vm's userptr
> +gpu_vma list is looped through, and the check is done for *all* of its
> +userptr gpu_vmas, although we only show a single one here.
> +
> +The userptr gpu_vma MMU invalidation notifier might be called from
> +reclaim context and, again to avoid locking order violations, we can't
> +take any dma_resv lock nor the gpu_vm->lock from within it.
> +
> +.. _Invalidation example:
> +.. code-block:: C
> +
> +  bool gpu_vma_userptr_invalidate(userptr_interval, cur_seq)
> +  {
> +          // Make sure the exec function either sees the new sequence
> +          // and backs off or we wait for the dma-fence:
> +
> +          down_write(&gpu_vm->userptr_notifier_lock);
> +          mmu_interval_set_seq(userptr_interval, cur_seq);
> +          up_write(&gpu_vm->userptr_notifier_lock);
> +
> +          // At this point, the exec function can't succeed in
> +          // submitting a new job, because cur_seq is an invalid
> +          // sequence number and will always cause a retry. When all
> +          // invalidation callbacks, the mmu notifier core will flip
> +          // the sequence number to a valid one. However we need to
> +          // stop gpu access to the old pages here.
> +
> +          dma_resv_wait_timeout(&gpu_vm->resv, DMA_RESV_USAGE_BOOKKEEP,
> +                                false, MAX_SCHEDULE_TIMEOUT);
> +          return true;
> +  }
> +
> +When this invalidation notifier returns, the GPU can no longer be
> +accessing the old pages of the userptr gpu_vma and needs to redo the
> +page-binding before a new GPU submission can succeed.
> +
> +Efficient userptr gpu_vma exec_function iteration
> +_________________________________________________
> +
> +If the gpu_vm's list of userptr gpu_vmas becomes large, it's
> +inefficient to iterate through the complete lists of userptrs on each
> +exec function to check whether each userptr gpu_vma's saved
> +sequence number is invalid or stale. A solution to this is to put all
> +*invalidated* userptr gpu_vmas on a separate gpu_vm list and
> +only those gpu_vmas on the list are actually checked on each exec
> +function. This list will then lend itself very-well to the spinlock
> +locking scheme that is
> +:ref:`described in the spinlock iteration section <Spinlock iteration>`, since
> +in the mmu notifier, where we add the invalidated gpu_vmas to the
> +list, it's not possible to take any outer locks like the
> +``gpu_vm->lock`` or the ``gpu_vm->resv`` lock. Note that the
> +``gpu_vm->lock`` still needs to be taken while iterating to ensure the list is
> +complete, as also mentioned in that section.
> +
> +If using an invalidated userptr list like this, the retry check in the
> +exec function trivially becomes a check for invalidated list empty.
> +
> +Locking at bind and unbind time
> +================================
> +
> +At bind time, assuming a GEM object backed gpu_vma, each
> +gpu_vma needs to be associated with a gpu_vm_bo and that
> +gpu_vm_bo in turn needs to be added to the GEM object's
> +gpu_vm_bo list, and possibly to the gpu_vm's external object
> +list. This is referred to as *linking* the gpu_vma, and typically
> +requires that the ``gpu_vm->resv`` and the GEM object's dma_resv are
> +held. When unlinking a gpu_vma the same locks are typically held,
> +and that ensures, as briefly discussed
> +:ref:`previously <gpu_vma lifetime>`, that when iterating over
> +``gpu_vmas`, either under the ``gpu_vm->resv`` or the GEM
> +object's dma_resv, that the gpu_vmas stay alive as long
> +as the lock under which we iterate are not is not released. For
> +userptr gpu_vmas it's similarly required that during unlink, the
> +outer ``gpu_vm->lock`` is held, since otherwise when iterating over
> +the invalidated userptr list as described in the previous section,
> +there is nothing keeping those userptr gpu_vmas alive.
> diff --git a/Documentation/gpu/implementation_guidelines.rst b/Documentation/gpu/implementation_guidelines.rst
> index 138e637dcc6b..dbccfa72f1c9 100644
> --- a/Documentation/gpu/implementation_guidelines.rst
> +++ b/Documentation/gpu/implementation_guidelines.rst
> @@ -7,3 +7,4 @@ Misc DRM driver uAPI- and feature implementation guidelines
>  .. toctree::
>  
>     drm-vm-bind-async
> +   drm-vm-bind-locking
> -- 
> 2.41.0
>
Thomas Hellström Nov. 15, 2023, 4:04 p.m. UTC | #2
Hi, Danilo,

On Wed, 2023-11-15 at 16:11 +0100, Danilo Krummrich wrote:
> On Wed, Nov 15, 2023 at 01:49:37PM +0100, Thomas Hellström wrote:
> > Add the first version of the VM_BIND locking document which is
> > intended to be part of the xe driver upstreaming agreement.
> > 
> > The document describes and discuss the locking used during exec-
> > functions, evicton and for userptr gpu-vmas. Intention is to be
> > using the
> > same nomenclature as the drm-vm-bind-async.rst.
> > 

Thanks for reviewing. I'll update the document accordingly except for
the s/an rwsem/a rwsem/g, I think it's "an rwsem" similarly to "an r".

/Thomas
Danilo Krummrich Nov. 15, 2023, 5:02 p.m. UTC | #3
On 11/15/23 17:04, Thomas Hellström wrote:
> Hi, Danilo,
> 
> On Wed, 2023-11-15 at 16:11 +0100, Danilo Krummrich wrote:
>> On Wed, Nov 15, 2023 at 01:49:37PM +0100, Thomas Hellström wrote:
>>> Add the first version of the VM_BIND locking document which is
>>> intended to be part of the xe driver upstreaming agreement.
>>>
>>> The document describes and discuss the locking used during exec-
>>> functions, evicton and for userptr gpu-vmas. Intention is to be
>>> using the
>>> same nomenclature as the drm-vm-bind-async.rst.
>>>
> 
> Thanks for reviewing. I'll update the document accordingly except for
> the s/an rwsem/a rwsem/g, I think it's "an rwsem" similarly to "an r".

I read it as "read-write-sem". Would you read it as "ar-double-u-sem"
then I guess?

> 
> /Thomas
>
Thomas Hellström Nov. 15, 2023, 5:18 p.m. UTC | #4
On 11/15/23 18:02, Danilo Krummrich wrote:
> On 11/15/23 17:04, Thomas Hellström wrote:
>> Hi, Danilo,
>>
>> On Wed, 2023-11-15 at 16:11 +0100, Danilo Krummrich wrote:
>>> On Wed, Nov 15, 2023 at 01:49:37PM +0100, Thomas Hellström wrote:
>>>> Add the first version of the VM_BIND locking document which is
>>>> intended to be part of the xe driver upstreaming agreement.
>>>>
>>>> The document describes and discuss the locking used during exec-
>>>> functions, evicton and for userptr gpu-vmas. Intention is to be
>>>> using the
>>>> same nomenclature as the drm-vm-bind-async.rst.
>>>>
>>
>> Thanks for reviewing. I'll update the document accordingly except for
>> the s/an rwsem/a rwsem/g, I think it's "an rwsem" similarly to "an r".
>
> I read it as "read-write-sem". Would you read it as "ar-double-u-sem"
> then I guess?
>
Yes. :)

/Thomas


>>
>> /Thomas
>>
>
Boris Brezillon Nov. 16, 2023, 9:48 a.m. UTC | #5
Hi Thomas,

On Wed, 15 Nov 2023 13:49:37 +0100
Thomas Hellström <thomas.hellstrom@linux.intel.com> wrote:

> Add the first version of the VM_BIND locking document which is
> intended to be part of the xe driver upstreaming agreement.
> 
> The document describes and discuss the locking used during exec-
> functions, evicton and for userptr gpu-vmas. Intention is to be using the
> same nomenclature as the drm-vm-bind-async.rst.
> 
> v2:
> - s/gvm/gpu_vm/g (Rodrigo Vivi)
> - Clarify the userptr seqlock with a pointer to mm/mmu_notifier.c
>   (Rodrigo Vivi)
> - Adjust commit message accordingly.
> - Add SPDX license header.
> 
> v3:
> - Large update to align with the drm_gpuvm manager locking
> - Add "Efficient userptr gpu_vma exec function iteration" section
> - Add "Locking at bind- and unbind time" section.
> 
> v4:
> - Fix tabs vs space errors by untabifying (Rodrigo Vivi)
> - Minor style fixes and typos (Rodrigo Vivi)
> - Clarify situations where stale GPU mappings are occurring and how
>   access through these mappings are blocked. (Rodrigo Vivi)
> - Insert into the toctree in implementation_guidelines.rst
> 
> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
> Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> ---
>  Documentation/gpu/drm-vm-bind-locking.rst     | 503 ++++++++++++++++++
>  .../gpu/implementation_guidelines.rst         |   1 +
>  2 files changed, 504 insertions(+)
>  create mode 100644 Documentation/gpu/drm-vm-bind-locking.rst
> 
> diff --git a/Documentation/gpu/drm-vm-bind-locking.rst b/Documentation/gpu/drm-vm-bind-locking.rst
> new file mode 100644
> index 000000000000..bc701157cb34
> --- /dev/null
> +++ b/Documentation/gpu/drm-vm-bind-locking.rst
> @@ -0,0 +1,503 @@
> +.. SPDX-License-Identifier: (GPL-2.0+ OR MIT)
> +
> +===============
> +VM_BIND locking
> +===============
> +
> +This document attempts to describe what's needed to get VM_BIND locking right,
> +including the userptr mmu_notifier locking and it will also discuss some
> +optimizations to get rid of the looping through of all userptr mappings and
> +external / shared object mappings that is needed in the simplest
> +implementation. It will also discuss some implications for faulting gpu_vms.
> +
> +Nomenclature
> +============
> +
> +* ``Context``: GPU execution context.
> +* ``gpu_vm``: Abstraction of a virtual GPU address space with
> +  meta-data. Typically one per client (DRM file-private), or one per
> +  context.

Should we mention that it's a driver object, likely inheriting from
drm_gpuvm?

> +* ``gpu_vma``: Abstraction of a GPU address range within a gpu_vm with
> +  associated meta-data. The backing storage of a gpu_vma can either be
> +  a GEM object or anonymous pages mapped also into the CPU
> +  address space for the process.
> +* gpu_vm_bo: Abstracts the association of a GEM object and
> +  a VM. Note that if only one gpu_vma per vm and buffer object were
> +  allowed, the state stored with a gpu_vm_bo could just as well have
> +  been stored with the gpu_vma.

I find this note confusing, and I don't think it brings any value to
the rest of the explanation. Sure, there's a way we could have
optimized things for 1:1 VM:BO mappings, but do we care about this very
restrictive use case in this context.

> For the purpose of this document, each
> +  GEM object maintains a list of gpu_vm_bos, and each gpu_vm_bo
> +  maintains a list of gpu_vmas.
> +* ``userptr gpu_vma or just userptr``: A gpu_vma, whose backing store
> +  is anonymous pages as described above.
> +* ``revalidating``: Revalidating a gpu_vma means making the latest version
> +  of the backing store resident and making sure the gpu_vma's
> +  page-table entries point to that backing store.
> +* ``dma_fence``: A struct dma_fence that is similar to a struct completion
> +  and which tracks GPU activity. When the GPU activity is finished,
> +  the dma_fence signals.

Maybe we could point to the dma_fence doc here [1]

> +* ``dma_resv``: A struct dma_resv (a.k.a reservation object) that is used
> +  to track GPU activity in the form of multiple dma_fences on a
> +  gpu_vm or a GEM object. The dma_resv contains an array / list
> +  of dma_fences and a lock that needs to be held when adding
> +  additional dma_fences to the dma_resv. The lock is of a type that
> +  allows deadlock-safe locking of multiple dma_resvs in arbitrary order.

Same for dma_resv [2]

> +* ``exec function``: An exec function is a function that revalidates all
> +  affected gpu_vmas, submits a GPU command batch and registers the
> +  dma_fence representing the GPU command's activity with all affected
> +  dma_resvs. For completeness, although not covered by this document,
> +  it's worth mentioning that an exec function may also be the
> +  revalidation worker that is used by some drivers in compute /
> +  long-running mode.
> +* ``local object``: A GEM object which is local to a gpu_vm. Shared gem
> +  objects also share the gpu_vm's dma_resv.
> +* ``shared object``: a.k.a external object: A GEM object which may be shared
> +  by multiple gpu_vms and whose backing storage may be shared with
> +  other drivers.

Should we name that one external object and mentions it's sometimes
also called external object. This way it matches the name used in gpuvm
implementation (extobj).

> +
> +
> +Locks used and locking orders
> +=============================
> +
> +One of the benefits of VM_BIND is that local GEM objects share the gpu_vm's
> +dma_resv object and hence the dma_resv lock. So even with a huge
> +number of local GEM objects, only one lock is needed to make the exec
> +sequence atomic.
> +
> +The following locks and locking orders are used:
> +
> +* The ``gpu_vm->lock`` (optionally an rwsem). Protects how the gpu_vm is
> +  partitioned into gpu_vmas. It can also protect the gpu_vm's list of
> +  userptr gpu_vmas. With a CPU mm analogy this would correspond to the
> +  mmap_lock.

I don't see any drm_gpuvm::lock field in Danilo's latest patchset, so,
unless I missed one version, and this lock is actually provided by
drm_gpuvm, I would mention this is a driver-specific lock. This comment
applies to all the locks you describe here actually (mention which ones
are provided by drm_gpuvm, and which ones are driver-specific).

> +* The ``userptr_seqlock``. This lock is taken in read mode for each
> +  userptr gpu_vma on the gpu_vm's userptr list, and in write mode during mmu
> +  notifier invalidation. This is not a real seqlock but described in
> +  ``mm/mmu_notifier.c`` as a "Collision-retry read-side/write-side
> +  'lock' a lot like a seqcount, however this allows multiple
> +  write-sides to hold it at once...". The read side critical section
> +  is enclosed by ``mmu_interval_read_begin() /
> +  mmu_interval_read_retry()`` with ``mmu_interval_read_begin()``
> +  sleeping if the write side is held.
> +  The write side is held by the core mm while calling mmu interval
> +  invalidation notifiers.
> +* The ``gpu_vm->resv`` lock. Protects the gpu_vm's list of gpu_vmas needing
> +  rebinding, and also the residency of all the gpu_vm's local GEM object.
> +  Furthermore it typically protects the gpu_vm's list of evicted GEM
> +  objects and external objects.
> +* The ``gpu_vm->userptr_notifier_lock``. This is an rwsem that is
> +  taken in read mode during exec and write mode during a mmu notifier
> +  invalidation. The userptr notifier lock is per gpu_vm.
> +* The gpu_vm list spinlocks. With some implementations they are needed
> +  to be able to update the gpu_vm evicted- and external object
> +  list. For those implementations, the spinlocks are grabbed when the
> +  lists are manipulated. However to avoid locking order violations
> +  with the dma_resv locks, a special scheme is needed when iterating
> +  over the lists.
> +
> +.. _gpu_vma lifetime:
> +
> +Protection and lifetime of gpu_vm_bos and gpu_vmas
> +==================================================
> +
> +The GEM object's list of gpu_vm_bos is typically protected by the
> +GEM object's dma_resv.

Should we mention the driver-custom lock here, for cases where the
driver needs to link a gpu_vm_bo to the GEM list in a dma-signalling
path?

> Each gpu_vm_bo holds a reference counted pointer
> +to the underlying GEM object, and each gpu_vma holds a reference counted
> +pointer to the gpu_vm_bo. When iterating over the GEM object's
> +list of gpu_vm_bos the gem object's dma_resv must thus be held,
> +but if it needs to be dropped during the iteration, care needs to be
> +taken so that any gpu_vm_bo, and the gpu_vm, if dereferenced
> +while the lock is dropped, do not disappear. The easiest way to avoid
> +this is to take a reference on affected objects while the dma_resv is
> +still held. If iterating over the gpu_vm_bo's gpu_vmas, even
> +greater care needs to be taken since the gpu_vmas are not
> +reference counted. If a driver accesses a gpu_vma obtained from
> +the gpu_vm_bo's list of gpu_vmas, and the GEM object's
> +dma_resv is dropped, at the very least, it should be thoroughly
> +documented how the gpu_vma is kept alive. Otherwise holding the
> +GEM object's dma_resv lock also around unlinking a gpu_vma from a
> +gpu_vm_bo will ensure that doesn't happen.

If we have use cases where accessing VMAs without the GEM object
gpuva_list lock held is useful, it would probably be worth mentioning
them. If not, I'm not sure we should list it as a possibility, given
how hard it'd be to guarantee that the VMA is still valid when being
accessed in that situation (I mean, that could be done with some
driver-specific refcounting at the VMA object level, but is it really
something we want to encourage drivers to do?).

> +
> +
> +Revalidation and eviction of local objects
> +==========================================
> +
> +Revalidation
> +____________
> +With VM_BIND, all local objects need to be resident when the gpu is
> +executing using the gpu_vm, and the objects need to have valid
> +gpu_vmas set up pointing to them. Typically each gpu command buffer
> +submission is therefore preceded with a re-validation section:
> +
> +.. code-block:: C
> +
> +   dma_resv_lock(gpu_vm->resv);
> +
> +   // Validation section starts here.
> +   for_each_gpu_vm_bo_on_evict_list(&gpu_vm->evict_list, &gpu_vm_bo) {
> +           validate_gem_bo(&gpu_vm_bo->gem_bo);
> +
> +           // The following list iteration needs the Gem object's
> +           // dma_resv to be held (it protects the gpu_vm_bo's list of
> +           // gpu_vmas, but since local gem objects share the gpu_vm's
> +           // dma_resv, it is already held at this point.
> +           for_each_gpu_vma_of_gpu_vm_bo(&gpu_vm_bo, &gpu_vma)
> +                  move_gpu_vma_to_rebind_list(&gpu_vma, &gpu_vm->rebind_list);
> +   }
> +
> +   for_each_gpu_vma_on_rebind_list(&gpu vm->rebind_list, &gpu_vma) {
> +           rebind_gpu_vma(&gpu_vma);
> +           remove_gpu_vma_from_rebind_list(&gpu_vma);
> +   }
> +   // Validation section ends here, and job submission starts.
> +
> +   add_dependencies(&gpu_job, &gpu_vm->resv);
> +   job_dma_fence = gpu_submit(&gpu_job));
> +
> +   add_dma_fence(job_dma_fence, &gpu_vm->resv);
> +   dma_resv_unlock(gpu_vm->resv);

This code excerpt is lacking all the retryable resv-locking + slot
reservation steps. I mean, we'd normally use a drm_exec (or
drm_gpuvm_exec) context to acquire the VM resvs, all extobjs resvs +
all other driver internal resvs that are not attached to any of the VM
BOs. I wonder if this simplification is not going to cause trouble in
the long run, but at the very least, it should be mentioned that this
is a simplified version.

> +
> +The reason for having a separate gpu_vm rebind list is that there
> +might be userptr gpu_vmas that are not mapping a buffer object that
> +also need rebinding.
> +
> +Eviction
> +________
> +
> +Eviction of one of these local objects will then look similar to the
> +following:
> +
> +.. code-block:: C
> +
> +   obj = get_object_from_lru();
> +
> +   dma_resv_lock(obj->resv);
> +   for_each_gpu_vm_bo_of_obj(obj, &gpu_vm_bo);
> +           add_gpu_vm_bo_to_evict_list(&gpu_vm_bo, &gpu_vm->evict_list);
> +
> +   add_dependencies(&eviction_job, &obj->resv);
> +   job_dma_fence = gpu_submit(&eviction_job);

Could you expend a bit on what the eviction job does? On embedded GPUs,
where there's no VRAM, we typically don't have a GPU job to teardown
GPU mappings. We do have an asynchronous VM_BIND queue though, so I
suspect the job here is an async VM_BIND job.

Not asking you to add that to the doc, I'm just trying to figure out how
this would map to the mem-reclaim logic in panthor...

> +   add_dma_fence(&obj->resv, job_dma_fence);
> +
> +   dma_resv_unlock(&obj->resv);
> +   put_object(obj);
> +
> +Note that since the object is local to the gpu_vm, it will share the gpu_vm's
> +dma_resv lock so that ``obj->resv == gpu_vm->resv``.
> +The gpu_vm_bos marked for eviction are put on the gpu_vm's evict list,
> +which is protected by ``gpu_vm->resv``, that is always locked while
> +evicting, due to the above equality.
> +
> +For VM_BIND gpu_vms, gpu_vmas don't need to be unbound before eviction,
> +Since the driver must ensure that the eviction blit or copy will wait
> +for GPU idle or depend on all previous GPU activity. Furthermore, any
> +subsequent attempt by the GPU to access freed memory through the
> +gpu_vma will be preceded by a new exec function, with a revalidation
> +section which will make sure all gpu_vmas are rebound. The eviction
> +code holding the object's dma_resv while revalidating will ensure a
> +new exec function may not race with the eviction. Note that this will
> +not hold true, however, if only a subsets of vmas are, due to the
> +driver implementation, selected for rebinding the next exec
> +function.

This last sentence is hard to follow.

"
Note that this will not hold true if only a subset of vmas
are selected for rebinding during the next exec call (for instance, due
to some driver decision to only partially restore VMAs).
"

> Then all vmas *not* selected for rebinding needs to be
> +properly unbound before re-enabling GPU access to the VM.

I think I get the problem, but can we have a use case where partial
VMA restoration is useful? I mean, if some VMAs are not needed, we
might be able to save MMU page table allocation/setup-time, but given
the mess it then is to track those non-live VMAs, I'm wondering if it's
leaving the door open for that, unless there's a good reason to do it.

> +
> +
> +Locking with external (or shared) buffer objects
> +================================================
> +
> +Since shared buffer objects may be shared by multiple gpu_vm's they
> +can't share their reservation object with a single gpu_vm, but will rather
> +have a reservation object of their own. The shared objects bound to a
> +gpu_vm using one or many gpu_vmas are therefore typically put on a
> +per-gpu_vm list which is protected by the gpu_vm's dma_resv lock. Once
> +the gpu_vm's reservation object  is locked, it is safe to traverse the

                                  ^ extra space

> +external object list and lock the dma_resvs of all external objects.
> +
> +At eviction time we now need to put the gpu_vm_bos of *all* gpu_vms a
> +shared object is bound to on the gpu_vm's evict list, but we can no longer
> +be certain that we hold the gpu_vm's dma_resv of all the gpu_vms the
> +object is bound to, since at eviction time we only hold the object's
> +private dma_resv. If we have a ww_acquire context at hand at eviction
> +time we could grab the those dma_resvs but that could cause
> +expensive ww_mutex rollbacks. A simple option is to just mark the
> +gpu_vm_bos of the evicted gem object with an ``evicted`` bool that
> +is inspected the next time the corresponding gpu_vm evicted list needs
> +to be traversed.

IIUC, the vm_bo is not added to the evicted list immediately in that
case, so I suspect you meant the next time the corresponding gpu_vm
*extobj* list needs to be traversed.

> At that time the gpu_vm's dma_resv and the object's
> +dma_resv is held, and the gpu_vm_bo marked evicted, can then be added
> +to the gpu_vm's list of evicted gpu_vm_bos. The ``evicted`` bool would
> +then be protected by the object's dma_resv.

I would be more asserted here:

"
Note that accesses to the ``evicted`` bool are protected by the
GEM object's dma_resv.
"


> +
> +The exec function would then become
> +
> +.. code-block:: C
> +
> +   dma_resv_lock(gpu_vm->resv);
> +
> +   // External object list is protected by the gpu_vm->resv lock.
> +   for_each_gpu_vm_bo_on_extobj_list(gpu_vm, &gpu_vm_bo) {
> +           dma_resv_lock(gpu_vm_bo.gem_obj->resv);

Ok, so it's really pseudo-code where you're assuming dma_resv_lock()
calls are actually locks with a ww_acquire context. This really needs
to be clarified to people don't copy/paste this code directly.

> +           if (gpu_vm_bo_marked_evicted(&gpu_vm_bo))
> +                   add_gpu_vm_bo_to_evict_list(&gpu_vm_bo, &gpu_vm->evict_list);
> +   }
> +
> +   for_each_gpu_vm_bo_on_evict_list(&gpu_vm->evict_list, &gpu_vm_bo) {
> +           validate_gem_bo(&gpu_vm_bo->gem_bo);
> +
> +           for_each_gpu_vma_of_gpu_vm_bo(&gpu_vm_bo, &gpu_vma)
> +                  move_gpu_vma_to_rebind_list(&gpu_vma, &gpu_vm->rebind_list);
> +   }
> +
> +   for_each_gpu_vma_on_rebind_list(&gpu vm->rebind_list, &gpu_vma) {
> +           rebind_gpu_vma(&gpu_vma);
> +           remove_gpu_vma_from_rebind_list(&gpu_vma);
> +   }
> +
> +   add_dependencies(&gpu_job, &gpu_vm->resv);
> +   job_dma_fence = gpu_submit(&gpu_job));
> +
> +   add_dma_fence(job_dma_fence, &gpu_vm->resv);
> +   for_each_shared_obj(gpu_vm, &obj)
> +          add_dma_fence(job_dma_fence, &obj->resv);
> +   dma_resv_unlock_all_resv_locks();
> +
> +And the corresponding shared-object aware eviction would look like:
> +
> +.. code-block:: C
> +
> +   obj = get_object_from_lru();
> +
> +   dma_resv_lock(obj->resv);
> +   for_each_gpu_vm_bo_of_obj(obj, &gpu_vm_bo)
> +           if (object_is_vm_local(obj))
> +                add_gpu_vm_bo_to_evict_list(&gpu_vm_bo, &gpu_vm->evict_list);
> +           else
> +                mark_gpu_vm_bo_evicted(&gpu_vm_bo);
> +
> +   add_dependencies(&eviction_job, &obj->resv);
> +   job_dma_fence = gpu_submit(&eviction_job);
> +   add_dma_fence(&obj->resv, job_dma_fence);
> +
> +   dma_resv_unlock(&obj->resv);
> +   put_object(obj);
> +
> +.. _Spinlock iteration:
> +
> +Accessing the gpu_vm's lists without the dma_resv lock held
> +===========================================================
> +
> +Although some drivers will hold the gpu_vm's dma_resv lock when
> +accessing the gpu_vm's evict list and external objects lists, there
> +are drivers that need to access these lists without the dma_resv lock held,
> +for example due to asynchronous state updates from within the
> +dma_fence signalling critical path. In such case a spinlock can be
> +used to protect manipulation of the lists. However, since higher level
> +sleeping locks needs to be taken for each list item while iterating

                  ^ need

> +over the lists, the items already iterated over needs to be

                                                   ^ need

> +temporarily moved to a private list and the spinlock released
> +while processing each item:
> +
> +.. code block:: C
> +
> +    struct list_head still_in_list;
> +
> +    INIT_LIST_HEAD(&still_in_list);
> +
> +    spin_lock(&gpu_vm->list_lock);
> +    do {
> +            struct list_head *entry = list_first_entry_or_null(&gpu_vm->list, head);
> +
> +            if (!entry)
> +                    break;
> +
> +            list_move_tail(&entry->head, &still_in_list);
> +            list_entry_get_unless_zero(entry);
> +            spin_unlock(&gpu_vm->list_lock);
> +
> +            process(entry);
> +
> +            spin_lock(&gpu_vm->list_lock);
> +            list_entry_put(entry);
> +    } while (true);
> +
> +    list_splice_tail(&still_in_list, &gpu_vm->list);
> +    spin_unlock(&gpu_vm->list_lock);
> +
> +However, due to the additional locking and atomic operations, drivers that *can*

I would drop the 'However'.

> +avoid accessing the gpu_vm's list outside of the dma_resv lock
> +might want to avoid this iteration scheme, if the driver anticipates a
> +large number of list items. For lists where the anticipated number of
> +list items is small, list iteration doesn't happen very often, or
> +there is a significant additional cost associated with each iteration,
> +the atomic operation overhead associated with this type of iteration
> +is, however, probably negligible. Note that if this scheme is

'however' can be dropped here as well.

> +used, it is necessary to make sure this list iteration is protected by
> +an outer level lock or semaphore, since list items are temporarily
> +pulled off the list while iterating.
> +
> +TODO: Pointer to the gpuvm code implementation if this iteration and
> +how to choose either iteration scheme.
> +
> +userptr gpu_vmas
> +================
> +
> +A userptr gpu_vma is a gpu_vma that, instead of mapping a buffer object to a
> +GPU virtual address range, directly maps a CPU mm range of anonymous-
> +or file page-cache pages.
> +A very simple approach would be to just pin the pages using
> +pin_user_pages() at bind time and unpin them at unbind time, but this
> +creates a Denial-Of-Service vector since a single user-space process
> +would be able to pin down all of system memory, which is not
> +desirable. (For special use-cases and with proper accounting pinning might
> +still be a desirable feature, though). What we need to do in the
> +general case is to obtain a reference to the desired pages, make sure
> +we are notified using a MMU notifier just before the CPU mm unmaps the
> +pages, dirty them if they are not mapped read-only to the GPU, and
> +then drop the reference.
> +When we are notified by the MMU notifier that CPU mm is about to drop the
> +pages, we need to stop GPU access to the pages by waiting for VM idle
> +in the MMU notifier and make sure that before the next time the GPU
> +tries to access whatever is now present in the CPU mm range, we unmap
> +the old pages from the GPU page tables and repeat the process of
> +obtaining new page references. (See the :ref:`notifier example
> +<Invalidation example>` below). Note that when the core mm decides to
> +laundry pages, we get such an unmap MMU notification and can mark the
> +pages dirty again before the next GPU access. We also get similar MMU
> +notifications for NUMA accounting which the GPU driver doesn't really
> +need to care about, but so far it has proven difficult to exclude
> +certain notifications.
> +
> +Using a MMU notifier for device DMA (and other methods) is described in
> +`this document
> +<https://docs.kernel.org/core-api/pin_user_pages.html#case-3-mmu-notifier-registration-with-or-without-page-faulting-hardware>`_.
> +
> +Now the method of obtaining struct page references using
> +get_user_pages() unfortunately can't be used under a dma_resv lock
> +since that would violate the locking order of the dma_resv lock vs the
> +mmap_lock that is grabbed when resolving a CPU pagefault. This means
> +the gpu_vm's list of userptr gpu_vmas needs to be protected by an
> +outer lock.
> +
> +The MMU interval seqlock for a userptr gpu_vma is used in the following
> +way:
> +
> +.. code-block:: C
> +
> +   // Exclusive locking mode here is strictly needed only if there are
> +   // invalidated userptr vmas present, to avoid multiple userptr
> +   // revalidations.
> +   down_write(&gpu_vm->lock);
> +   retry:
> +
> +   // Note: mmu_interval_read_begin() blocks until there is no
> +   // invalidation notifier running anymore.
> +   seq = mmu_interval_read_begin(&gpu_vma->userptr_interval);
> +   if (seq != gpu_vma->saved_seq) {
> +           obtain_new_page_pointers(&gpu_vma);
> +           dma_resv_lock(&gpu_vm->resv);
> +           add_gpu_vma_top_revalidate_list(&gpu_vma, &gpu_vm);
> +           dma_resv_unlock(&gpu_vm->resv);
> +           gpu_vma->saved_seq = seq;
> +   }
> +
> +   // The usual revalidation goes here.
> +
> +   // Final userptr sequence validation may not happen before the
> +   // submission dma_fence is added to the gpu_vm's resv, from the POW
> +   // of the MMU invalidation notifier. Hence the
> +   // userptr_notifier_lock that will make them appear atomic.
> +
> +   add_dependencies(&gpu_job, &gpu_vm->resv);
> +   down_read(&gpu_vm->userptr_notifier_lock);
> +   if (mmu_interval_read_retry(&gpu_vma->userptr_interval, gpu_vma->saved_seq)) {
> +          up_read(&gpu_vm->userptr_notifier_lock);
> +          goto retry;
> +   }
> +
> +   job_dma_fence = gpu_submit(&gpu_job));
> +
> +   add_dma_fence(job_dma_fence, &gpu_vm->resv);
> +
> +   for_each_shared_obj(gpu_vm, &obj)
> +          add_dma_fence(job_dma_fence, &obj->resv);
> +
> +   dma_resv_unlock_all_resv_locks();
> +   up_read(&gpu_vm->userptr_notifier_lock);
> +   up_write(&gpu_vm->lock);
> +
> +The code between ``mmu_interval_read_begin()`` and the
> +``mmu_interval_read_retry()`` marks the read side critical section of
> +what we call the ``userptr_seqlock``. In reality the gpu_vm's userptr
> +gpu_vma list is looped through, and the check is done for *all* of its
> +userptr gpu_vmas, although we only show a single one here.
> +
> +The userptr gpu_vma MMU invalidation notifier might be called from
> +reclaim context and, again to avoid locking order violations, we can't
> +take any dma_resv lock nor the gpu_vm->lock from within it.
> +
> +.. _Invalidation example:
> +.. code-block:: C
> +
> +  bool gpu_vma_userptr_invalidate(userptr_interval, cur_seq)
> +  {
> +          // Make sure the exec function either sees the new sequence
> +          // and backs off or we wait for the dma-fence:
> +
> +          down_write(&gpu_vm->userptr_notifier_lock);
> +          mmu_interval_set_seq(userptr_interval, cur_seq);
> +          up_write(&gpu_vm->userptr_notifier_lock);
> +
> +          // At this point, the exec function can't succeed in
> +          // submitting a new job, because cur_seq is an invalid
> +          // sequence number and will always cause a retry. When all
> +          // invalidation callbacks, the mmu notifier core will flip
> +          // the sequence number to a valid one. However we need to
> +          // stop gpu access to the old pages here.
> +
> +          dma_resv_wait_timeout(&gpu_vm->resv, DMA_RESV_USAGE_BOOKKEEP,
> +                                false, MAX_SCHEDULE_TIMEOUT);
> +          return true;
> +  }
> +
> +When this invalidation notifier returns, the GPU can no longer be
> +accessing the old pages of the userptr gpu_vma and needs to redo the
> +page-binding before a new GPU submission can succeed.

Don't know enough about userptr to properly review this section, sorry
:-/.

> +
> +Efficient userptr gpu_vma exec_function iteration
> +_________________________________________________
> +
> +If the gpu_vm's list of userptr gpu_vmas becomes large, it's
> +inefficient to iterate through the complete lists of userptrs on each
> +exec function to check whether each userptr gpu_vma's saved
> +sequence number is invalid or stale. A solution to this is to put all
> +*invalidated* userptr gpu_vmas on a separate gpu_vm list and
> +only those gpu_vmas on the list are actually checked on each exec
> +function. This list will then lend itself very-well to the spinlock
> +locking scheme that is
> +:ref:`described in the spinlock iteration section <Spinlock iteration>`, since
> +in the mmu notifier, where we add the invalidated gpu_vmas to the
> +list, it's not possible to take any outer locks like the
> +``gpu_vm->lock`` or the ``gpu_vm->resv`` lock. Note that the
> +``gpu_vm->lock`` still needs to be taken while iterating to ensure the list is
> +complete, as also mentioned in that section.
> +
> +If using an invalidated userptr list like this, the retry check in the
> +exec function trivially becomes a check for invalidated list empty.
> +
> +Locking at bind and unbind time
> +================================
> +
> +At bind time, assuming a GEM object backed gpu_vma, each
> +gpu_vma needs to be associated with a gpu_vm_bo and that
> +gpu_vm_bo in turn needs to be added to the GEM object's
> +gpu_vm_bo list, and possibly to the gpu_vm's external object
> +list. This is referred to as *linking* the gpu_vma, and typically
> +requires that the ``gpu_vm->resv`` and the GEM object's dma_resv are
> +held.

Again, depends on the locking scheme you picked. While I agree that
insertion/removal of extobjs is likely to be done with the VM and GEM
resv held, link/unlink being done when the VMA is
inserted/removed in the tree, it really depends on the VM update model
chosen by the driver (update early/at-VM_BIND-ioctl-time vs
JIT/at-VM_BIND-job-exec-time).

We went over it quite a few times with Danilo and you, and I keep
forgetting why early link/deferred unlink (leaving the VMA tree update
in the dma-signalling path, but the link/unlink being done outside of
this path) wouldn't work. I think it'd be worth documenting this kind of
details, so we don't come to discuss it over and over again, just to
end with the exact same conclusion :-). Maybe we should add a "drm_gpuvm
design decisions" section gathering all those tricky/hard to grasp
details at some point.


> When unlinking a gpu_vma the same locks are typically held,
> +and that ensures, as briefly discussed
> +:ref:`previously <gpu_vma lifetime>`, that when iterating over
> +``gpu_vmas`, either under the ``gpu_vm->resv`` or the GEM
> +object's dma_resv, that the gpu_vmas stay alive as long
> +as the lock under which we iterate are not is not released. For
> +userptr gpu_vmas it's similarly required that during unlink, the
> +outer ``gpu_vm->lock`` is held, since otherwise when iterating over
> +the invalidated userptr list as described in the previous section,
> +there is nothing keeping those userptr gpu_vmas alive.
> diff --git a/Documentation/gpu/implementation_guidelines.rst b/Documentation/gpu/implementation_guidelines.rst
> index 138e637dcc6b..dbccfa72f1c9 100644
> --- a/Documentation/gpu/implementation_guidelines.rst
> +++ b/Documentation/gpu/implementation_guidelines.rst
> @@ -7,3 +7,4 @@ Misc DRM driver uAPI- and feature implementation guidelines
>  .. toctree::
>  
>     drm-vm-bind-async
> +   drm-vm-bind-locking

Looks pretty good overall. My only complaint would be that you don't
refer much to drm_gpuvm, when it's actually suppose to become the
common ground for GPU VM handling in drivers (at least that was my
understanding).

Regards,

Boris

[1]https://docs.kernel.org/driver-api/dma-buf.html#dma-fences
[2]https://docs.kernel.org/driver-api/dma-buf.html#reservation-objects
Thomas Hellström Nov. 16, 2023, 11:48 a.m. UTC | #6
Hi, Boris,

Thanks for reviewing. Some comments below:

On Thu, 2023-11-16 at 10:48 +0100, Boris Brezillon wrote:
> Hi Thomas,
> 
> On Wed, 15 Nov 2023 13:49:37 +0100
> Thomas Hellström <thomas.hellstrom@linux.intel.com> wrote:
> 
> > Add the first version of the VM_BIND locking document which is
> > intended to be part of the xe driver upstreaming agreement.
> > 
> > The document describes and discuss the locking used during exec-
> > functions, evicton and for userptr gpu-vmas. Intention is to be
> > using the
> > same nomenclature as the drm-vm-bind-async.rst.
> > 
> > v2:
> > - s/gvm/gpu_vm/g (Rodrigo Vivi)
> > - Clarify the userptr seqlock with a pointer to mm/mmu_notifier.c
> >   (Rodrigo Vivi)
> > - Adjust commit message accordingly.
> > - Add SPDX license header.
> > 
> > v3:
> > - Large update to align with the drm_gpuvm manager locking
> > - Add "Efficient userptr gpu_vma exec function iteration" section
> > - Add "Locking at bind- and unbind time" section.
> > 
> > v4:
> > - Fix tabs vs space errors by untabifying (Rodrigo Vivi)
> > - Minor style fixes and typos (Rodrigo Vivi)
> > - Clarify situations where stale GPU mappings are occurring and how
> >   access through these mappings are blocked. (Rodrigo Vivi)
> > - Insert into the toctree in implementation_guidelines.rst
> > 
> > Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
> > Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > ---
> >  Documentation/gpu/drm-vm-bind-locking.rst     | 503
> > ++++++++++++++++++
> >  .../gpu/implementation_guidelines.rst         |   1 +
> >  2 files changed, 504 insertions(+)
> >  create mode 100644 Documentation/gpu/drm-vm-bind-locking.rst
> > 
> > diff --git a/Documentation/gpu/drm-vm-bind-locking.rst
> > b/Documentation/gpu/drm-vm-bind-locking.rst
> > new file mode 100644
> > index 000000000000..bc701157cb34
> > --- /dev/null
> > +++ b/Documentation/gpu/drm-vm-bind-locking.rst
> > @@ -0,0 +1,503 @@
> > +.. SPDX-License-Identifier: (GPL-2.0+ OR MIT)
> > +
> > +===============
> > +VM_BIND locking
> > +===============
> > +
> > +This document attempts to describe what's needed to get VM_BIND
> > locking right,
> > +including the userptr mmu_notifier locking and it will also
> > discuss some
> > +optimizations to get rid of the looping through of all userptr
> > mappings and
> > +external / shared object mappings that is needed in the simplest
> > +implementation. It will also discuss some implications for
> > faulting gpu_vms.
> > +
> > +Nomenclature
> > +============
> > +
> > +* ``Context``: GPU execution context.
> > +* ``gpu_vm``: Abstraction of a virtual GPU address space with
> > +  meta-data. Typically one per client (DRM file-private), or one
> > per
> > +  context.
> 
> Should we mention that it's a driver object, likely inheriting from
> drm_gpuvm?

Yes, I can try to be a bit more drm_gpuvm-centric throughout the
document, although I want to avoid being too specific due to the
current rapid drm_gpuvm rate of change, which might also affect this
document. I guess I will have to commit for now to update the document
on each gpuvm series we land...

> 
> > +* ``gpu_vma``: Abstraction of a GPU address range within a gpu_vm
> > with
> > +  associated meta-data. The backing storage of a gpu_vma can
> > either be
> > +  a GEM object or anonymous pages mapped also into the CPU
> > +  address space for the process.
> > +* gpu_vm_bo: Abstracts the association of a GEM object and
> > +  a VM. Note that if only one gpu_vma per vm and buffer object
> > were
> > +  allowed, the state stored with a gpu_vm_bo could just as well
> > have
> > +  been stored with the gpu_vma.
> 
> I find this note confusing, and I don't think it brings any value to
> the rest of the explanation. Sure, there's a way we could have
> optimized things for 1:1 VM:BO mappings, but do we care about this
> very
> restrictive use case in this context.

I'll remove that note.

> 
> > For the purpose of this document, each
> > +  GEM object maintains a list of gpu_vm_bos, and each gpu_vm_bo
> > +  maintains a list of gpu_vmas.
> > +* ``userptr gpu_vma or just userptr``: A gpu_vma, whose backing
> > store
> > +  is anonymous pages as described above.
> > +* ``revalidating``: Revalidating a gpu_vma means making the latest
> > version
> > +  of the backing store resident and making sure the gpu_vma's
> > +  page-table entries point to that backing store.
> > +* ``dma_fence``: A struct dma_fence that is similar to a struct
> > completion
> > +  and which tracks GPU activity. When the GPU activity is
> > finished,
> > +  the dma_fence signals.
> 
> Maybe we could point to the dma_fence doc here [1]
> 
> > +* ``dma_resv``: A struct dma_resv (a.k.a reservation object) that
> > is used
> > +  to track GPU activity in the form of multiple dma_fences on a
> > +  gpu_vm or a GEM object. The dma_resv contains an array / list
> > +  of dma_fences and a lock that needs to be held when adding
> > +  additional dma_fences to the dma_resv. The lock is of a type
> > that
> > +  allows deadlock-safe locking of multiple dma_resvs in arbitrary
> > order.
> 
> Same for dma_resv [2]
> 
> > +* ``exec function``: An exec function is a function that
> > revalidates all
> > +  affected gpu_vmas, submits a GPU command batch and registers the
> > +  dma_fence representing the GPU command's activity with all
> > affected
> > +  dma_resvs. For completeness, although not covered by this
> > document,
> > +  it's worth mentioning that an exec function may also be the
> > +  revalidation worker that is used by some drivers in compute /
> > +  long-running mode.
> > +* ``local object``: A GEM object which is local to a gpu_vm.
> > Shared gem
> > +  objects also share the gpu_vm's dma_resv.
> > +* ``shared object``: a.k.a external object: A GEM object which may
> > be shared
> > +  by multiple gpu_vms and whose backing storage may be shared with
> > +  other drivers.
> 
> Should we name that one external object and mentions it's sometimes
> also called external object. This way it matches the name used in
> gpuvm
> implementation (extobj).

You mean "... also called shared object"? Shure, can do that.
> 
> > +
> > +
> > +Locks used and locking orders
> > +=============================
> > +
> > +One of the benefits of VM_BIND is that local GEM objects share the
> > gpu_vm's
> > +dma_resv object and hence the dma_resv lock. So even with a huge
> > +number of local GEM objects, only one lock is needed to make the
> > exec
> > +sequence atomic.
> > +
> > +The following locks and locking orders are used:
> > +
> > +* The ``gpu_vm->lock`` (optionally an rwsem). Protects how the
> > gpu_vm is
> > +  partitioned into gpu_vmas. It can also protect the gpu_vm's list
> > of
> > +  userptr gpu_vmas. With a CPU mm analogy this would correspond to
> > the
> > +  mmap_lock.
> 
> I don't see any drm_gpuvm::lock field in Danilo's latest patchset,
> so,
> unless I missed one version, and this lock is actually provided by
> drm_gpuvm, I would mention this is a driver-specific lock. This
> comment
> applies to all the locks you describe here actually (mention which
> ones
> are provided by drm_gpuvm, and which ones are driver-specific).

These will be needed also by gpuvm when implementing userptr vmas, so I
can mention that drm_gpuvm is currently lacking a userptr
implementation, so "the locks described below are to be considered
driver-specific for now"

> 
> > +* The ``userptr_seqlock``. This lock is taken in read mode for
> > each
> > +  userptr gpu_vma on the gpu_vm's userptr list, and in write mode
> > during mmu
> > +  notifier invalidation. This is not a real seqlock but described
> > in
> > +  ``mm/mmu_notifier.c`` as a "Collision-retry read-side/write-side
> > +  'lock' a lot like a seqcount, however this allows multiple
> > +  write-sides to hold it at once...". The read side critical
> > section
> > +  is enclosed by ``mmu_interval_read_begin() /
> > +  mmu_interval_read_retry()`` with ``mmu_interval_read_begin()``
> > +  sleeping if the write side is held.
> > +  The write side is held by the core mm while calling mmu interval
> > +  invalidation notifiers.
> > +* The ``gpu_vm->resv`` lock. Protects the gpu_vm's list of
> > gpu_vmas needing
> > +  rebinding, and also the residency of all the gpu_vm's local GEM
> > object.
> > +  Furthermore it typically protects the gpu_vm's list of evicted
> > GEM
> > +  objects and external objects.
> > +* The ``gpu_vm->userptr_notifier_lock``. This is an rwsem that is
> > +  taken in read mode during exec and write mode during a mmu
> > notifier
> > +  invalidation. The userptr notifier lock is per gpu_vm.
> > +* The gpu_vm list spinlocks. With some implementations they are
> > needed
> > +  to be able to update the gpu_vm evicted- and external object
> > +  list. For those implementations, the spinlocks are grabbed when
> > the
> > +  lists are manipulated. However to avoid locking order violations
> > +  with the dma_resv locks, a special scheme is needed when
> > iterating
> > +  over the lists.
> > +
> > +.. _gpu_vma lifetime:
> > +
> > +Protection and lifetime of gpu_vm_bos and gpu_vmas
> > +==================================================
> > +
> > +The GEM object's list of gpu_vm_bos is typically protected by the
> > +GEM object's dma_resv.
> 
> Should we mention the driver-custom lock here, for cases where the
> driver needs to link a gpu_vm_bo to the GEM list in a dma-signalling
> path?

Yes, can do that.

> 
> > Each gpu_vm_bo holds a reference counted pointer
> > +to the underlying GEM object, and each gpu_vma holds a reference
> > counted
> > +pointer to the gpu_vm_bo. When iterating over the GEM object's
> > +list of gpu_vm_bos the gem object's dma_resv must thus be held,
> > +but if it needs to be dropped during the iteration, care needs to
> > be
> > +taken so that any gpu_vm_bo, and the gpu_vm, if dereferenced
> > +while the lock is dropped, do not disappear. The easiest way to
> > avoid
> > +this is to take a reference on affected objects while the dma_resv
> > is
> > +still held. If iterating over the gpu_vm_bo's gpu_vmas, even
> > +greater care needs to be taken since the gpu_vmas are not
> > +reference counted. If a driver accesses a gpu_vma obtained from
> > +the gpu_vm_bo's list of gpu_vmas, and the GEM object's
> > +dma_resv is dropped, at the very least, it should be thoroughly
> > +documented how the gpu_vma is kept alive. Otherwise holding the
> > +GEM object's dma_resv lock also around unlinking a gpu_vma from a
> > +gpu_vm_bo will ensure that doesn't happen.
> 
> If we have use cases where accessing VMAs without the GEM object
> gpuva_list lock held is useful, it would probably be worth mentioning
> them. If not, I'm not sure we should list it as a possibility, given
> how hard it'd be to guarantee that the VMA is still valid when being
> accessed in that situation (I mean, that could be done with some
> driver-specific refcounting at the VMA object level, but is it really
> something we want to encourage drivers to do?).
> 
> > +
> > +
> > +Revalidation and eviction of local objects
> > +==========================================
> > +
> > +Revalidation
> > +____________
> > +With VM_BIND, all local objects need to be resident when the gpu
> > is
> > +executing using the gpu_vm, and the objects need to have valid
> > +gpu_vmas set up pointing to them. Typically each gpu command
> > buffer
> > +submission is therefore preceded with a re-validation section:
> > +
> > +.. code-block:: C
> > +
> > +   dma_resv_lock(gpu_vm->resv);
> > +
> > +   // Validation section starts here.
> > +   for_each_gpu_vm_bo_on_evict_list(&gpu_vm->evict_list,
> > &gpu_vm_bo) {
> > +           validate_gem_bo(&gpu_vm_bo->gem_bo);
> > +
> > +           // The following list iteration needs the Gem object's
> > +           // dma_resv to be held (it protects the gpu_vm_bo's
> > list of
> > +           // gpu_vmas, but since local gem objects share the
> > gpu_vm's
> > +           // dma_resv, it is already held at this point.
> > +           for_each_gpu_vma_of_gpu_vm_bo(&gpu_vm_bo, &gpu_vma)
> > +                  move_gpu_vma_to_rebind_list(&gpu_vma, &gpu_vm-
> > >rebind_list);
> > +   }
> > +
> > +   for_each_gpu_vma_on_rebind_list(&gpu vm->rebind_list, &gpu_vma)
> > {
> > +           rebind_gpu_vma(&gpu_vma);
> > +           remove_gpu_vma_from_rebind_list(&gpu_vma);
> > +   }
> > +   // Validation section ends here, and job submission starts.
> > +
> > +   add_dependencies(&gpu_job, &gpu_vm->resv);
> > +   job_dma_fence = gpu_submit(&gpu_job));
> > +
> > +   add_dma_fence(job_dma_fence, &gpu_vm->resv);
> > +   dma_resv_unlock(gpu_vm->resv);
> 
> This code excerpt is lacking all the retryable resv-locking + slot
> reservation steps. I mean, we'd normally use a drm_exec (or
> drm_gpuvm_exec) context to acquire the VM resvs, all extobjs resvs +
> all other driver internal resvs that are not attached to any of the
> VM
> BOs. I wonder if this simplification is not going to cause trouble in
> the long run, but at the very least, it should be mentioned that this
> is a simplified version.
> 
> > +
> > +The reason for having a separate gpu_vm rebind list is that there
> > +might be userptr gpu_vmas that are not mapping a buffer object
> > that
> > +also need rebinding.
> > +
> > +Eviction
> > +________
> > +
> > +Eviction of one of these local objects will then look similar to
> > the
> > +following:
> > +
> > +.. code-block:: C
> > +
> > +   obj = get_object_from_lru();
> > +
> > +   dma_resv_lock(obj->resv);
> > +   for_each_gpu_vm_bo_of_obj(obj, &gpu_vm_bo);
> > +           add_gpu_vm_bo_to_evict_list(&gpu_vm_bo, &gpu_vm-
> > >evict_list);
> > +
> > +   add_dependencies(&eviction_job, &obj->resv);
> > +   job_dma_fence = gpu_submit(&eviction_job);
> 
> Could you expend a bit on what the eviction job does? On embedded
> GPUs,
> where there's no VRAM, we typically don't have a GPU job to teardown
> GPU mappings. We do have an asynchronous VM_BIND queue though, so I
> suspect the job here is an async VM_BIND job.
> 
> Not asking you to add that to the doc, I'm just trying to figure out
> how
> this would map to the mem-reclaim logic in panthor...

This would typically be the blit of data from VRAM to system, but async
jobs here may also include zapping page-tables and TLB flush in case of
recoverable page-faults.

> 
> > +   add_dma_fence(&obj->resv, job_dma_fence);
> > +
> > +   dma_resv_unlock(&obj->resv);
> > +   put_object(obj);
> > +
> > +Note that since the object is local to the gpu_vm, it will share
> > the gpu_vm's
> > +dma_resv lock so that ``obj->resv == gpu_vm->resv``.
> > +The gpu_vm_bos marked for eviction are put on the gpu_vm's evict
> > list,
> > +which is protected by ``gpu_vm->resv``, that is always locked
> > while
> > +evicting, due to the above equality.
> > +
> > +For VM_BIND gpu_vms, gpu_vmas don't need to be unbound before
> > eviction,
> > +Since the driver must ensure that the eviction blit or copy will
> > wait
> > +for GPU idle or depend on all previous GPU activity. Furthermore,
> > any
> > +subsequent attempt by the GPU to access freed memory through the
> > +gpu_vma will be preceded by a new exec function, with a
> > revalidation
> > +section which will make sure all gpu_vmas are rebound. The
> > eviction
> > +code holding the object's dma_resv while revalidating will ensure
> > a
> > +new exec function may not race with the eviction. Note that this
> > will
> > +not hold true, however, if only a subsets of vmas are, due to the
> > +driver implementation, selected for rebinding the next exec
> > +function.
> 
> This last sentence is hard to follow.
> 
> "
> Note that this will not hold true if only a subset of vmas
> are selected for rebinding during the next exec call (for instance,
> due
> to some driver decision to only partially restore VMAs).
> "
> 
> > Then all vmas *not* selected for rebinding needs to be
> > +properly unbound before re-enabling GPU access to the VM.
> 
> I think I get the problem, but can we have a use case where partial
> VMA restoration is useful? I mean, if some VMAs are not needed, we
> might be able to save MMU page table allocation/setup-time, but given
> the mess it then is to track those non-live VMAs, I'm wondering if
> it's
> leaving the door open for that, unless there's a good reason to do
> it.

This is the use-case Christian has been flagging for for OpenGL and
Media where he warns that the single-vm-memory-overcommit case would
otherwise make the app crawl.

Generalized one might want to think of these as groups of (or perhaps
virtual ranges of) vmas that need to be resident for a single job
submission. Currently we assume the residency-group <-> vm mapping is
1:1, allowing for the unbind-before-eviction to be ignored, but I
figure moving forward and addressing performance problems of real-world
applications that may not always be true.

> 
> > +
> > +
> > +Locking with external (or shared) buffer objects
> > +================================================
> > +
> > +Since shared buffer objects may be shared by multiple gpu_vm's
> > they
> > +can't share their reservation object with a single gpu_vm, but
> > will rather
> > +have a reservation object of their own. The shared objects bound
> > to a
> > +gpu_vm using one or many gpu_vmas are therefore typically put on a
> > +per-gpu_vm list which is protected by the gpu_vm's dma_resv lock.
> > Once
> > +the gpu_vm's reservation object  is locked, it is safe to traverse
> > the
> 
>                                   ^ extra space
> 
> > +external object list and lock the dma_resvs of all external
> > objects.
> > +
> > +At eviction time we now need to put the gpu_vm_bos of *all*
> > gpu_vms a
> > +shared object is bound to on the gpu_vm's evict list, but we can
> > no longer
> > +be certain that we hold the gpu_vm's dma_resv of all the gpu_vms
> > the
> > +object is bound to, since at eviction time we only hold the
> > object's
> > +private dma_resv. If we have a ww_acquire context at hand at
> > eviction
> > +time we could grab the those dma_resvs but that could cause
> > +expensive ww_mutex rollbacks. A simple option is to just mark the
> > +gpu_vm_bos of the evicted gem object with an ``evicted`` bool that
> > +is inspected the next time the corresponding gpu_vm evicted list
> > needs
> > +to be traversed.
> 
> IIUC, the vm_bo is not added to the evicted list immediately in that
> case, so I suspect you meant the next time the corresponding gpu_vm
> *extobj* list needs to be traversed.

I like to think of the concept as "lock and validate the list before
traversing it" and you're right in this case, the validation occurs
slightly before we actually start looking at the evicted list. But I
could perhaps rephrase to "...bool that is inspected *before* the next
time..."

> 
> > At that time the gpu_vm's dma_resv and the object's
> > +dma_resv is held, and the gpu_vm_bo marked evicted, can then be
> > added
> > +to the gpu_vm's list of evicted gpu_vm_bos. The ``evicted`` bool
> > would
> > +then be protected by the object's dma_resv.
> 
> I would be more asserted here:
> 
> "
> Note that accesses to the ``evicted`` bool are protected by the
> GEM object's dma_resv.
> "
> 
> 
> > +
> > +The exec function would then become
> > +
> > +.. code-block:: C
> > +
> > +   dma_resv_lock(gpu_vm->resv);
> > +
> > +   // External object list is protected by the gpu_vm->resv lock.
> > +   for_each_gpu_vm_bo_on_extobj_list(gpu_vm, &gpu_vm_bo) {
> > +           dma_resv_lock(gpu_vm_bo.gem_obj->resv);
> 
> Ok, so it's really pseudo-code where you're assuming dma_resv_lock()
> calls are actually locks with a ww_acquire context. This really needs
> to be clarified to people don't copy/paste this code directly.
> 
> > +           if (gpu_vm_bo_marked_evicted(&gpu_vm_bo))
> > +                   add_gpu_vm_bo_to_evict_list(&gpu_vm_bo,
> > &gpu_vm->evict_list);
> > +   }
> > +
> > +   for_each_gpu_vm_bo_on_evict_list(&gpu_vm->evict_list,
> > &gpu_vm_bo) {
> > +           validate_gem_bo(&gpu_vm_bo->gem_bo);
> > +
> > +           for_each_gpu_vma_of_gpu_vm_bo(&gpu_vm_bo, &gpu_vma)
> > +                  move_gpu_vma_to_rebind_list(&gpu_vma, &gpu_vm-
> > >rebind_list);
> > +   }
> > +
> > +   for_each_gpu_vma_on_rebind_list(&gpu vm->rebind_list, &gpu_vma)
> > {
> > +           rebind_gpu_vma(&gpu_vma);
> > +           remove_gpu_vma_from_rebind_list(&gpu_vma);
> > +   }
> > +
> > +   add_dependencies(&gpu_job, &gpu_vm->resv);
> > +   job_dma_fence = gpu_submit(&gpu_job));
> > +
> > +   add_dma_fence(job_dma_fence, &gpu_vm->resv);
> > +   for_each_shared_obj(gpu_vm, &obj)
> > +          add_dma_fence(job_dma_fence, &obj->resv);
> > +   dma_resv_unlock_all_resv_locks();
> > +
> > +And the corresponding shared-object aware eviction would look
> > like:
> > +
> > +.. code-block:: C
> > +
> > +   obj = get_object_from_lru();
> > +
> > +   dma_resv_lock(obj->resv);
> > +   for_each_gpu_vm_bo_of_obj(obj, &gpu_vm_bo)
> > +           if (object_is_vm_local(obj))
> > +                add_gpu_vm_bo_to_evict_list(&gpu_vm_bo, &gpu_vm-
> > >evict_list);
> > +           else
> > +                mark_gpu_vm_bo_evicted(&gpu_vm_bo);
> > +
> > +   add_dependencies(&eviction_job, &obj->resv);
> > +   job_dma_fence = gpu_submit(&eviction_job);
> > +   add_dma_fence(&obj->resv, job_dma_fence);
> > +
> > +   dma_resv_unlock(&obj->resv);
> > +   put_object(obj);
> > +
> > +.. _Spinlock iteration:
> > +
> > +Accessing the gpu_vm's lists without the dma_resv lock held
> > +===========================================================
> > +
> > +Although some drivers will hold the gpu_vm's dma_resv lock when
> > +accessing the gpu_vm's evict list and external objects lists,
> > there
> > +are drivers that need to access these lists without the dma_resv
> > lock held,
> > +for example due to asynchronous state updates from within the
> > +dma_fence signalling critical path. In such case a spinlock can be
> > +used to protect manipulation of the lists. However, since higher
> > level
> > +sleeping locks needs to be taken for each list item while
> > iterating
> 
>                   ^ need
> 
> > +over the lists, the items already iterated over needs to be
> 
>                                                    ^ need
> 
> > +temporarily moved to a private list and the spinlock released
> > +while processing each item:
> > +
> > +.. code block:: C
> > +
> > +    struct list_head still_in_list;
> > +
> > +    INIT_LIST_HEAD(&still_in_list);
> > +
> > +    spin_lock(&gpu_vm->list_lock);
> > +    do {
> > +            struct list_head *entry =
> > list_first_entry_or_null(&gpu_vm->list, head);
> > +
> > +            if (!entry)
> > +                    break;
> > +
> > +            list_move_tail(&entry->head, &still_in_list);
> > +            list_entry_get_unless_zero(entry);
> > +            spin_unlock(&gpu_vm->list_lock);
> > +
> > +            process(entry);
> > +
> > +            spin_lock(&gpu_vm->list_lock);
> > +            list_entry_put(entry);
> > +    } while (true);
> > +
> > +    list_splice_tail(&still_in_list, &gpu_vm->list);
> > +    spin_unlock(&gpu_vm->list_lock);
> > +
> > +However, due to the additional locking and atomic operations,
> > drivers that *can*
> 
> I would drop the 'However'.
> 
> > +avoid accessing the gpu_vm's list outside of the dma_resv lock
> > +might want to avoid this iteration scheme, if the driver
> > anticipates a
> > +large number of list items. For lists where the anticipated number
> > of
> > +list items is small, list iteration doesn't happen very often, or
> > +there is a significant additional cost associated with each
> > iteration,
> > +the atomic operation overhead associated with this type of
> > iteration
> > +is, however, probably negligible. Note that if this scheme is
> 
> 'however' can be dropped here as well.
> 
> > +used, it is necessary to make sure this list iteration is
> > protected by
> > +an outer level lock or semaphore, since list items are temporarily
> > +pulled off the list while iterating.
> > +
> > +TODO: Pointer to the gpuvm code implementation if this iteration
> > and
> > +how to choose either iteration scheme.
> > +
> > +userptr gpu_vmas
> > +================
> > +
> > +A userptr gpu_vma is a gpu_vma that, instead of mapping a buffer
> > object to a
> > +GPU virtual address range, directly maps a CPU mm range of
> > anonymous-
> > +or file page-cache pages.
> > +A very simple approach would be to just pin the pages using
> > +pin_user_pages() at bind time and unpin them at unbind time, but
> > this
> > +creates a Denial-Of-Service vector since a single user-space
> > process
> > +would be able to pin down all of system memory, which is not
> > +desirable. (For special use-cases and with proper accounting
> > pinning might
> > +still be a desirable feature, though). What we need to do in the
> > +general case is to obtain a reference to the desired pages, make
> > sure
> > +we are notified using a MMU notifier just before the CPU mm unmaps
> > the
> > +pages, dirty them if they are not mapped read-only to the GPU, and
> > +then drop the reference.
> > +When we are notified by the MMU notifier that CPU mm is about to
> > drop the
> > +pages, we need to stop GPU access to the pages by waiting for VM
> > idle
> > +in the MMU notifier and make sure that before the next time the
> > GPU
> > +tries to access whatever is now present in the CPU mm range, we
> > unmap
> > +the old pages from the GPU page tables and repeat the process of
> > +obtaining new page references. (See the :ref:`notifier example
> > +<Invalidation example>` below). Note that when the core mm decides
> > to
> > +laundry pages, we get such an unmap MMU notification and can mark
> > the
> > +pages dirty again before the next GPU access. We also get similar
> > MMU
> > +notifications for NUMA accounting which the GPU driver doesn't
> > really
> > +need to care about, but so far it has proven difficult to exclude
> > +certain notifications.
> > +
> > +Using a MMU notifier for device DMA (and other methods) is
> > described in
> > +`this document
> > +<
> > https://docs.kernel.org/core-api/pin_user_pages.html#case-3-mmu-not
> > ifier-registration-with-or-without-page-faulting-hardware>`_.
> > +
> > +Now the method of obtaining struct page references using
> > +get_user_pages() unfortunately can't be used under a dma_resv lock
> > +since that would violate the locking order of the dma_resv lock vs
> > the
> > +mmap_lock that is grabbed when resolving a CPU pagefault. This
> > means
> > +the gpu_vm's list of userptr gpu_vmas needs to be protected by an
> > +outer lock.
> > +
> > +The MMU interval seqlock for a userptr gpu_vma is used in the
> > following
> > +way:
> > +
> > +.. code-block:: C
> > +
> > +   // Exclusive locking mode here is strictly needed only if there
> > are
> > +   // invalidated userptr vmas present, to avoid multiple userptr
> > +   // revalidations.
> > +   down_write(&gpu_vm->lock);
> > +   retry:
> > +
> > +   // Note: mmu_interval_read_begin() blocks until there is no
> > +   // invalidation notifier running anymore.
> > +   seq = mmu_interval_read_begin(&gpu_vma->userptr_interval);
> > +   if (seq != gpu_vma->saved_seq) {
> > +           obtain_new_page_pointers(&gpu_vma);
> > +           dma_resv_lock(&gpu_vm->resv);
> > +           add_gpu_vma_top_revalidate_list(&gpu_vma, &gpu_vm);
> > +           dma_resv_unlock(&gpu_vm->resv);
> > +           gpu_vma->saved_seq = seq;
> > +   }
> > +
> > +   // The usual revalidation goes here.
> > +
> > +   // Final userptr sequence validation may not happen before the
> > +   // submission dma_fence is added to the gpu_vm's resv, from the
> > POW
> > +   // of the MMU invalidation notifier. Hence the
> > +   // userptr_notifier_lock that will make them appear atomic.
> > +
> > +   add_dependencies(&gpu_job, &gpu_vm->resv);
> > +   down_read(&gpu_vm->userptr_notifier_lock);
> > +   if (mmu_interval_read_retry(&gpu_vma->userptr_interval,
> > gpu_vma->saved_seq)) {
> > +          up_read(&gpu_vm->userptr_notifier_lock);
> > +          goto retry;
> > +   }
> > +
> > +   job_dma_fence = gpu_submit(&gpu_job));
> > +
> > +   add_dma_fence(job_dma_fence, &gpu_vm->resv);
> > +
> > +   for_each_shared_obj(gpu_vm, &obj)
> > +          add_dma_fence(job_dma_fence, &obj->resv);
> > +
> > +   dma_resv_unlock_all_resv_locks();
> > +   up_read(&gpu_vm->userptr_notifier_lock);
> > +   up_write(&gpu_vm->lock);
> > +
> > +The code between ``mmu_interval_read_begin()`` and the
> > +``mmu_interval_read_retry()`` marks the read side critical section
> > of
> > +what we call the ``userptr_seqlock``. In reality the gpu_vm's
> > userptr
> > +gpu_vma list is looped through, and the check is done for *all* of
> > its
> > +userptr gpu_vmas, although we only show a single one here.
> > +
> > +The userptr gpu_vma MMU invalidation notifier might be called from
> > +reclaim context and, again to avoid locking order violations, we
> > can't
> > +take any dma_resv lock nor the gpu_vm->lock from within it.
> > +
> > +.. _Invalidation example:
> > +.. code-block:: C
> > +
> > +  bool gpu_vma_userptr_invalidate(userptr_interval, cur_seq)
> > +  {
> > +          // Make sure the exec function either sees the new
> > sequence
> > +          // and backs off or we wait for the dma-fence:
> > +
> > +          down_write(&gpu_vm->userptr_notifier_lock);
> > +          mmu_interval_set_seq(userptr_interval, cur_seq);
> > +          up_write(&gpu_vm->userptr_notifier_lock);
> > +
> > +          // At this point, the exec function can't succeed in
> > +          // submitting a new job, because cur_seq is an invalid
> > +          // sequence number and will always cause a retry. When
> > all
> > +          // invalidation callbacks, the mmu notifier core will
> > flip
> > +          // the sequence number to a valid one. However we need
> > to
> > +          // stop gpu access to the old pages here.
> > +
> > +          dma_resv_wait_timeout(&gpu_vm->resv,
> > DMA_RESV_USAGE_BOOKKEEP,
> > +                                false, MAX_SCHEDULE_TIMEOUT);
> > +          return true;
> > +  }
> > +
> > +When this invalidation notifier returns, the GPU can no longer be
> > +accessing the old pages of the userptr gpu_vma and needs to redo
> > the
> > +page-binding before a new GPU submission can succeed.
> 
> Don't know enough about userptr to properly review this section,
> sorry
> :-/.
> 

NP. I will address the review as stated above including all review
points where I didn't leave a comment.

Thanks,
Thomas
Bagas Sanjaya Nov. 16, 2023, 1:13 p.m. UTC | #7
On Wed, Nov 15, 2023 at 01:49:37PM +0100, Thomas Hellström wrote:
> +TODO: Pointer to the gpuvm code implementation if this iteration and

"... implementation of this iteration ..."

> +Using a MMU notifier for device DMA (and other methods) is described in
> +`this document
> +<https://docs.kernel.org/core-api/pin_user_pages.html#case-3-mmu-notifier-registration-with-or-without-page-faulting-hardware>`_.

You can use internal linking instead:

---- >8 ----
diff --git a/Documentation/core-api/pin_user_pages.rst b/Documentation/core-api/pin_user_pages.rst
index d3c1f6d8c0e0ec..6b5f7e6e7155fb 100644
--- a/Documentation/core-api/pin_user_pages.rst
+++ b/Documentation/core-api/pin_user_pages.rst
@@ -153,6 +153,8 @@ NOTE: Some pages, such as DAX pages, cannot be pinned with longterm pins. That's
 because DAX pages do not have a separate page cache, and so "pinning" implies
 locking down file system blocks, which is not (yet) supported in that way.
 
+.. _mmu-notifier-registration-case:
+
 CASE 3: MMU notifier registration, with or without page faulting hardware
 -------------------------------------------------------------------------
 Device drivers can pin pages via get_user_pages*(), and register for mmu
diff --git a/Documentation/gpu/drm-vm-bind-locking.rst b/Documentation/gpu/drm-vm-bind-locking.rst
index bc701157cb3414..08b6a47a6e592f 100644
--- a/Documentation/gpu/drm-vm-bind-locking.rst
+++ b/Documentation/gpu/drm-vm-bind-locking.rst
@@ -366,8 +366,7 @@ need to care about, but so far it has proven difficult to exclude
 certain notifications.
 
 Using a MMU notifier for device DMA (and other methods) is described in
-`this document
-<https://docs.kernel.org/core-api/pin_user_pages.html#case-3-mmu-notifier-registration-with-or-without-page-faulting-hardware>`_.
+:ref:`pin_user_pages() documentation <mmu-notifier-registration-case>`.
 
 Now the method of obtaining struct page references using
 get_user_pages() unfortunately can't be used under a dma_resv lock

Thanks.
Boris Brezillon Nov. 16, 2023, 1:27 p.m. UTC | #8
On Thu, 16 Nov 2023 12:48:45 +0100
Thomas Hellström <thomas.hellstrom@linux.intel.com> wrote:

> Hi, Boris,
> 
> Thanks for reviewing. Some comments below:
> 
> On Thu, 2023-11-16 at 10:48 +0100, Boris Brezillon wrote:
> > Hi Thomas,
> > 
> > On Wed, 15 Nov 2023 13:49:37 +0100
> > Thomas Hellström <thomas.hellstrom@linux.intel.com> wrote:
> >   
> > > Add the first version of the VM_BIND locking document which is
> > > intended to be part of the xe driver upstreaming agreement.
> > > 
> > > The document describes and discuss the locking used during exec-
> > > functions, evicton and for userptr gpu-vmas. Intention is to be
> > > using the
> > > same nomenclature as the drm-vm-bind-async.rst.
> > > 
> > > v2:
> > > - s/gvm/gpu_vm/g (Rodrigo Vivi)
> > > - Clarify the userptr seqlock with a pointer to mm/mmu_notifier.c
> > >   (Rodrigo Vivi)
> > > - Adjust commit message accordingly.
> > > - Add SPDX license header.
> > > 
> > > v3:
> > > - Large update to align with the drm_gpuvm manager locking
> > > - Add "Efficient userptr gpu_vma exec function iteration" section
> > > - Add "Locking at bind- and unbind time" section.
> > > 
> > > v4:
> > > - Fix tabs vs space errors by untabifying (Rodrigo Vivi)
> > > - Minor style fixes and typos (Rodrigo Vivi)
> > > - Clarify situations where stale GPU mappings are occurring and how
> > >   access through these mappings are blocked. (Rodrigo Vivi)
> > > - Insert into the toctree in implementation_guidelines.rst
> > > 
> > > Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
> > > Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > > ---
> > >  Documentation/gpu/drm-vm-bind-locking.rst     | 503
> > > ++++++++++++++++++
> > >  .../gpu/implementation_guidelines.rst         |   1 +
> > >  2 files changed, 504 insertions(+)
> > >  create mode 100644 Documentation/gpu/drm-vm-bind-locking.rst
> > > 
> > > diff --git a/Documentation/gpu/drm-vm-bind-locking.rst
> > > b/Documentation/gpu/drm-vm-bind-locking.rst
> > > new file mode 100644
> > > index 000000000000..bc701157cb34
> > > --- /dev/null
> > > +++ b/Documentation/gpu/drm-vm-bind-locking.rst
> > > @@ -0,0 +1,503 @@
> > > +.. SPDX-License-Identifier: (GPL-2.0+ OR MIT)
> > > +
> > > +===============
> > > +VM_BIND locking
> > > +===============
> > > +
> > > +This document attempts to describe what's needed to get VM_BIND
> > > locking right,
> > > +including the userptr mmu_notifier locking and it will also
> > > discuss some
> > > +optimizations to get rid of the looping through of all userptr
> > > mappings and
> > > +external / shared object mappings that is needed in the simplest
> > > +implementation. It will also discuss some implications for
> > > faulting gpu_vms.
> > > +
> > > +Nomenclature
> > > +============
> > > +
> > > +* ``Context``: GPU execution context.
> > > +* ``gpu_vm``: Abstraction of a virtual GPU address space with
> > > +  meta-data. Typically one per client (DRM file-private), or one
> > > per
> > > +  context.  
> > 
> > Should we mention that it's a driver object, likely inheriting from
> > drm_gpuvm?  
> 
> Yes, I can try to be a bit more drm_gpuvm-centric throughout the
> document, although I want to avoid being too specific due to the
> current rapid drm_gpuvm rate of change, which might also affect this
> document. I guess I will have to commit for now to update the document
> on each gpuvm series we land...

Well, I'd suggest that the one doing changes to drm_gpuvm gets to
update this document along the way, or even better, make this
documentation part of the drm_gpuvm doc, so there's no excuse to not
update it when drm_gpuvm is extended.

> >   
> > > +* ``exec function``: An exec function is a function that
> > > revalidates all
> > > +  affected gpu_vmas, submits a GPU command batch and registers the
> > > +  dma_fence representing the GPU command's activity with all
> > > affected
> > > +  dma_resvs. For completeness, although not covered by this
> > > document,
> > > +  it's worth mentioning that an exec function may also be the
> > > +  revalidation worker that is used by some drivers in compute /
> > > +  long-running mode.
> > > +* ``local object``: A GEM object which is local to a gpu_vm.
> > > Shared gem
> > > +  objects also share the gpu_vm's dma_resv.
> > > +* ``shared object``: a.k.a external object: A GEM object which may
> > > be shared
> > > +  by multiple gpu_vms and whose backing storage may be shared with
> > > +  other drivers.  
> > 
> > Should we name that one external object and mentions it's sometimes
> > also called external object. This way it matches the name used in
> > gpuvm
> > implementation (extobj).  
> 
> You mean "... also called shared object"?

Yeah, sorry.

> Shure, can do that.
> >   
> > > +
> > > +
> > > +Locks used and locking orders
> > > +=============================
> > > +
> > > +One of the benefits of VM_BIND is that local GEM objects share the
> > > gpu_vm's
> > > +dma_resv object and hence the dma_resv lock. So even with a huge
> > > +number of local GEM objects, only one lock is needed to make the
> > > exec
> > > +sequence atomic.
> > > +
> > > +The following locks and locking orders are used:
> > > +
> > > +* The ``gpu_vm->lock`` (optionally an rwsem). Protects how the
> > > gpu_vm is
> > > +  partitioned into gpu_vmas. It can also protect the gpu_vm's list
> > > of
> > > +  userptr gpu_vmas. With a CPU mm analogy this would correspond to
> > > the
> > > +  mmap_lock.  
> > 
> > I don't see any drm_gpuvm::lock field in Danilo's latest patchset,
> > so,
> > unless I missed one version, and this lock is actually provided by
> > drm_gpuvm, I would mention this is a driver-specific lock. This
> > comment
> > applies to all the locks you describe here actually (mention which
> > ones
> > are provided by drm_gpuvm, and which ones are driver-specific).  
> 
> These will be needed also by gpuvm when implementing userptr vmas, so I
> can mention that drm_gpuvm is currently lacking a userptr
> implementation, so "the locks described below are to be considered
> driver-specific for now"

Sounds good.

> > > +
> > > +The reason for having a separate gpu_vm rebind list is that there
> > > +might be userptr gpu_vmas that are not mapping a buffer object
> > > that
> > > +also need rebinding.
> > > +
> > > +Eviction
> > > +________
> > > +
> > > +Eviction of one of these local objects will then look similar to
> > > the
> > > +following:
> > > +
> > > +.. code-block:: C
> > > +
> > > +   obj = get_object_from_lru();
> > > +
> > > +   dma_resv_lock(obj->resv);
> > > +   for_each_gpu_vm_bo_of_obj(obj, &gpu_vm_bo);
> > > +           add_gpu_vm_bo_to_evict_list(&gpu_vm_bo, &gpu_vm-  
> > > >evict_list);  
> > > +
> > > +   add_dependencies(&eviction_job, &obj->resv);
> > > +   job_dma_fence = gpu_submit(&eviction_job);  
> > 
> > Could you expend a bit on what the eviction job does? On embedded
> > GPUs,
> > where there's no VRAM, we typically don't have a GPU job to teardown
> > GPU mappings. We do have an asynchronous VM_BIND queue though, so I
> > suspect the job here is an async VM_BIND job.
> > 
> > Not asking you to add that to the doc, I'm just trying to figure out
> > how
> > this would map to the mem-reclaim logic in panthor...  
> 
> This would typically be the blit of data from VRAM to system,

Okay, not needed in my case then.

> but async
> jobs here may also include zapping page-tables and TLB flush in case of
> recoverable page-faults.

And that part is synchronous in panthor. Maybe the locking will force
me to make it an asynchronous VM_BIND job. I guess I'll only know when
I get to hook up the shrinker...

> 
> >   
> > > +   add_dma_fence(&obj->resv, job_dma_fence);
> > > +
> > > +   dma_resv_unlock(&obj->resv);
> > > +   put_object(obj);
> > > +
> > > +Note that since the object is local to the gpu_vm, it will share
> > > the gpu_vm's
> > > +dma_resv lock so that ``obj->resv == gpu_vm->resv``.
> > > +The gpu_vm_bos marked for eviction are put on the gpu_vm's evict
> > > list,
> > > +which is protected by ``gpu_vm->resv``, that is always locked
> > > while
> > > +evicting, due to the above equality.
> > > +
> > > +For VM_BIND gpu_vms, gpu_vmas don't need to be unbound before
> > > eviction,
> > > +Since the driver must ensure that the eviction blit or copy will
> > > wait
> > > +for GPU idle or depend on all previous GPU activity. Furthermore,
> > > any
> > > +subsequent attempt by the GPU to access freed memory through the
> > > +gpu_vma will be preceded by a new exec function, with a
> > > revalidation
> > > +section which will make sure all gpu_vmas are rebound. The
> > > eviction
> > > +code holding the object's dma_resv while revalidating will ensure
> > > a
> > > +new exec function may not race with the eviction. Note that this
> > > will
> > > +not hold true, however, if only a subsets of vmas are, due to the
> > > +driver implementation, selected for rebinding the next exec
> > > +function.  
> > 
> > This last sentence is hard to follow.
> > 
> > "
> > Note that this will not hold true if only a subset of vmas
> > are selected for rebinding during the next exec call (for instance,
> > due
> > to some driver decision to only partially restore VMAs).
> > "
> >   
> > > Then all vmas *not* selected for rebinding needs to be
> > > +properly unbound before re-enabling GPU access to the VM.  
> > 
> > I think I get the problem, but can we have a use case where partial
> > VMA restoration is useful? I mean, if some VMAs are not needed, we
> > might be able to save MMU page table allocation/setup-time, but given
> > the mess it then is to track those non-live VMAs, I'm wondering if
> > it's
> > leaving the door open for that, unless there's a good reason to do
> > it.  
> 
> This is the use-case Christian has been flagging for for OpenGL and
> Media where he warns that the single-vm-memory-overcommit case would
> otherwise make the app crawl.

IIUC, the partial VMA restore is about not restoring all VMAs attached
to a vm_bo, but as soon as you restore one, this makes the BO resident,
so all you're saving here is the extra page table for non-restored VMAs.
I don't think that would significantly help the overcommit use case,
unless you have so many VMAs attached to a single vm_bo that the
amount of extra page tables becomes non-negligible compared to the BO
size itself.

What would really help the overcommit use case is not restoring all
evicted BOs, if some of them are known to be in a range that's not
accessed by a GPU job. In that situation, you can decide to leave
vm_bos in the evicted list if none of their VMAs overlap any of the VM
ranges used by a job.

> 
> Generalized one might want to think of these as groups of (or perhaps
> virtual ranges of) vmas that need to be resident for a single job
> submission. Currently we assume the residency-group <-> vm mapping is
> 1:1, allowing for the unbind-before-eviction to be ignored, but I
> figure moving forward and addressing performance problems of real-world
> applications that may not always be true.

Maybe I don't understand what unbind-before-eviction means. To me it's
the operation of tearing down all VM mappings (unbind) before returning
BO pages to the system (evict). I don't see a situation where the
eviction of a BO doesn't imply killing all VM mapping pointing to this
BO.

> 
> >   
> > > +
> > > +
> > > +Locking with external (or shared) buffer objects
> > > +================================================
> > > +
> > > +Since shared buffer objects may be shared by multiple gpu_vm's
> > > they
> > > +can't share their reservation object with a single gpu_vm, but
> > > will rather
> > > +have a reservation object of their own. The shared objects bound
> > > to a
> > > +gpu_vm using one or many gpu_vmas are therefore typically put on a
> > > +per-gpu_vm list which is protected by the gpu_vm's dma_resv lock.
> > > Once
> > > +the gpu_vm's reservation object  is locked, it is safe to traverse
> > > the  
> > 
> >                                   ^ extra space
> >   
> > > +external object list and lock the dma_resvs of all external
> > > objects.
> > > +
> > > +At eviction time we now need to put the gpu_vm_bos of *all*
> > > gpu_vms a
> > > +shared object is bound to on the gpu_vm's evict list, but we can
> > > no longer
> > > +be certain that we hold the gpu_vm's dma_resv of all the gpu_vms
> > > the
> > > +object is bound to, since at eviction time we only hold the
> > > object's
> > > +private dma_resv. If we have a ww_acquire context at hand at
> > > eviction
> > > +time we could grab the those dma_resvs but that could cause
> > > +expensive ww_mutex rollbacks. A simple option is to just mark the
> > > +gpu_vm_bos of the evicted gem object with an ``evicted`` bool that
> > > +is inspected the next time the corresponding gpu_vm evicted list
> > > needs
> > > +to be traversed.  
> > 
> > IIUC, the vm_bo is not added to the evicted list immediately in that
> > case, so I suspect you meant the next time the corresponding gpu_vm
> > *extobj* list needs to be traversed.  
> 
> I like to think of the concept as "lock and validate the list before
> traversing it" and you're right in this case, the validation occurs
> slightly before we actually start looking at the evicted list. But I
> could perhaps rephrase to "...bool that is inspected *before* the next
> time..."

My point was, you detect such evicted vm_bos when iterating the
extobj list, not the evicted list, because the vm_bo won't be in the
evicted list until you've walked extobj and inserted evicted vm_bos the
the evicted list. Even with your rephrasing, it sounds like those are
detected by iterating the evicted list. How about

"
A simple option is to just mark the gpu_vm_bos of the evicted gem
object with an ``evicted`` bool. Those gpu_vm_bos will be moved to
the evicted list next time the extobj list is inspected for vm_bo
revalidation.
"
Thomas Hellström Nov. 16, 2023, 1:53 p.m. UTC | #9
On Thu, 2023-11-16 at 14:27 +0100, Boris Brezillon wrote:
> On Thu, 16 Nov 2023 12:48:45 +0100
> Thomas Hellström <thomas.hellstrom@linux.intel.com> wrote:
> 
> > Hi, Boris,
> > 
> > Thanks for reviewing. Some comments below:
> > 
> > On Thu, 2023-11-16 at 10:48 +0100, Boris Brezillon wrote:
> > > Hi Thomas,
> > > 
> > > On Wed, 15 Nov 2023 13:49:37 +0100
> > > Thomas Hellström <thomas.hellstrom@linux.intel.com> wrote:
> > >   
> > > > Add the first version of the VM_BIND locking document which is
> > > > intended to be part of the xe driver upstreaming agreement.
> > > > 
> > > > The document describes and discuss the locking used during
> > > > exec-
> > > > functions, evicton and for userptr gpu-vmas. Intention is to be
> > > > using the
> > > > same nomenclature as the drm-vm-bind-async.rst.
> > > > 
> > > > v2:
> > > > - s/gvm/gpu_vm/g (Rodrigo Vivi)
> > > > - Clarify the userptr seqlock with a pointer to
> > > > mm/mmu_notifier.c
> > > >   (Rodrigo Vivi)
> > > > - Adjust commit message accordingly.
> > > > - Add SPDX license header.
> > > > 
> > > > v3:
> > > > - Large update to align with the drm_gpuvm manager locking
> > > > - Add "Efficient userptr gpu_vma exec function iteration"
> > > > section
> > > > - Add "Locking at bind- and unbind time" section.
> > > > 
> > > > v4:
> > > > - Fix tabs vs space errors by untabifying (Rodrigo Vivi)
> > > > - Minor style fixes and typos (Rodrigo Vivi)
> > > > - Clarify situations where stale GPU mappings are occurring and
> > > > how
> > > >   access through these mappings are blocked. (Rodrigo Vivi)
> > > > - Insert into the toctree in implementation_guidelines.rst
> > > > 
> > > > Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
> > > > Signed-off-by: Thomas Hellström
> > > > <thomas.hellstrom@linux.intel.com>
> > > > ---
> > > >  Documentation/gpu/drm-vm-bind-locking.rst     | 503
> > > > ++++++++++++++++++
> > > >  .../gpu/implementation_guidelines.rst         |   1 +
> > > >  2 files changed, 504 insertions(+)
> > > >  create mode 100644 Documentation/gpu/drm-vm-bind-locking.rst
> > > > 
> > > > diff --git a/Documentation/gpu/drm-vm-bind-locking.rst
> > > > b/Documentation/gpu/drm-vm-bind-locking.rst
> > > > new file mode 100644
> > > > index 000000000000..bc701157cb34
> > > > --- /dev/null
> > > > +++ b/Documentation/gpu/drm-vm-bind-locking.rst
> > > > @@ -0,0 +1,503 @@
> > > > +.. SPDX-License-Identifier: (GPL-2.0+ OR MIT)
> > > > +
> > > > +===============
> > > > +VM_BIND locking
> > > > +===============
> > > > +
> > > > +This document attempts to describe what's needed to get
> > > > VM_BIND
> > > > locking right,
> > > > +including the userptr mmu_notifier locking and it will also
> > > > discuss some
> > > > +optimizations to get rid of the looping through of all userptr
> > > > mappings and
> > > > +external / shared object mappings that is needed in the
> > > > simplest
> > > > +implementation. It will also discuss some implications for
> > > > faulting gpu_vms.
> > > > +
> > > > +Nomenclature
> > > > +============
> > > > +
> > > > +* ``Context``: GPU execution context.
> > > > +* ``gpu_vm``: Abstraction of a virtual GPU address space with
> > > > +  meta-data. Typically one per client (DRM file-private), or
> > > > one
> > > > per
> > > > +  context.  
> > > 
> > > Should we mention that it's a driver object, likely inheriting
> > > from
> > > drm_gpuvm?  
> > 
> > Yes, I can try to be a bit more drm_gpuvm-centric throughout the
> > document, although I want to avoid being too specific due to the
> > current rapid drm_gpuvm rate of change, which might also affect
> > this
> > document. I guess I will have to commit for now to update the
> > document
> > on each gpuvm series we land...
> 
> Well, I'd suggest that the one doing changes to drm_gpuvm gets to
> update this document along the way, or even better, make this
> documentation part of the drm_gpuvm doc, so there's no excuse to not
> update it when drm_gpuvm is extended.

Sure, Although I think initial merge will be as is, and then merging
with drm_gpuvm will come at a later point.
> 
> > >   
> > > > +* ``exec function``: An exec function is a function that
> > > > revalidates all
> > > > +  affected gpu_vmas, submits a GPU command batch and registers
> > > > the
> > > > +  dma_fence representing the GPU command's activity with all
> > > > affected
> > > > +  dma_resvs. For completeness, although not covered by this
> > > > document,
> > > > +  it's worth mentioning that an exec function may also be the
> > > > +  revalidation worker that is used by some drivers in compute
> > > > /
> > > > +  long-running mode.
> > > > +* ``local object``: A GEM object which is local to a gpu_vm.
> > > > Shared gem
> > > > +  objects also share the gpu_vm's dma_resv.
> > > > +* ``shared object``: a.k.a external object: A GEM object which
> > > > may
> > > > be shared
> > > > +  by multiple gpu_vms and whose backing storage may be shared
> > > > with
> > > > +  other drivers.  
> > > 
> > > Should we name that one external object and mentions it's
> > > sometimes
> > > also called external object. This way it matches the name used in
> > > gpuvm
> > > implementation (extobj).  
> > 
> > You mean "... also called shared object"?
> 
> Yeah, sorry.
> 
> > Shure, can do that.
> > >   
> > > > +
> > > > +
> > > > +Locks used and locking orders
> > > > +=============================
> > > > +
> > > > +One of the benefits of VM_BIND is that local GEM objects share
> > > > the
> > > > gpu_vm's
> > > > +dma_resv object and hence the dma_resv lock. So even with a
> > > > huge
> > > > +number of local GEM objects, only one lock is needed to make
> > > > the
> > > > exec
> > > > +sequence atomic.
> > > > +
> > > > +The following locks and locking orders are used:
> > > > +
> > > > +* The ``gpu_vm->lock`` (optionally an rwsem). Protects how the
> > > > gpu_vm is
> > > > +  partitioned into gpu_vmas. It can also protect the gpu_vm's
> > > > list
> > > > of
> > > > +  userptr gpu_vmas. With a CPU mm analogy this would
> > > > correspond to
> > > > the
> > > > +  mmap_lock.  
> > > 
> > > I don't see any drm_gpuvm::lock field in Danilo's latest
> > > patchset,
> > > so,
> > > unless I missed one version, and this lock is actually provided
> > > by
> > > drm_gpuvm, I would mention this is a driver-specific lock. This
> > > comment
> > > applies to all the locks you describe here actually (mention
> > > which
> > > ones
> > > are provided by drm_gpuvm, and which ones are driver-specific).  
> > 
> > These will be needed also by gpuvm when implementing userptr vmas,
> > so I
> > can mention that drm_gpuvm is currently lacking a userptr
> > implementation, so "the locks described below are to be considered
> > driver-specific for now"
> 
> Sounds good.
> 
> > > > +
> > > > +The reason for having a separate gpu_vm rebind list is that
> > > > there
> > > > +might be userptr gpu_vmas that are not mapping a buffer object
> > > > that
> > > > +also need rebinding.
> > > > +
> > > > +Eviction
> > > > +________
> > > > +
> > > > +Eviction of one of these local objects will then look similar
> > > > to
> > > > the
> > > > +following:
> > > > +
> > > > +.. code-block:: C
> > > > +
> > > > +   obj = get_object_from_lru();
> > > > +
> > > > +   dma_resv_lock(obj->resv);
> > > > +   for_each_gpu_vm_bo_of_obj(obj, &gpu_vm_bo);
> > > > +           add_gpu_vm_bo_to_evict_list(&gpu_vm_bo, &gpu_vm-  
> > > > > evict_list);  
> > > > +
> > > > +   add_dependencies(&eviction_job, &obj->resv);
> > > > +   job_dma_fence = gpu_submit(&eviction_job);  
> > > 
> > > Could you expend a bit on what the eviction job does? On embedded
> > > GPUs,
> > > where there's no VRAM, we typically don't have a GPU job to
> > > teardown
> > > GPU mappings. We do have an asynchronous VM_BIND queue though, so
> > > I
> > > suspect the job here is an async VM_BIND job.
> > > 
> > > Not asking you to add that to the doc, I'm just trying to figure
> > > out
> > > how
> > > this would map to the mem-reclaim logic in panthor...  
> > 
> > This would typically be the blit of data from VRAM to system,
> 
> Okay, not needed in my case then.
> 
> > but async
> > jobs here may also include zapping page-tables and TLB flush in
> > case of
> > recoverable page-faults.
> 
> And that part is synchronous in panthor. Maybe the locking will force
> me to make it an asynchronous VM_BIND job. I guess I'll only know
> when
> I get to hook up the shrinker...
> 
> > 
> > >   
> > > > +   add_dma_fence(&obj->resv, job_dma_fence);
> > > > +
> > > > +   dma_resv_unlock(&obj->resv);
> > > > +   put_object(obj);
> > > > +
> > > > +Note that since the object is local to the gpu_vm, it will
> > > > share
> > > > the gpu_vm's
> > > > +dma_resv lock so that ``obj->resv == gpu_vm->resv``.
> > > > +The gpu_vm_bos marked for eviction are put on the gpu_vm's
> > > > evict
> > > > list,
> > > > +which is protected by ``gpu_vm->resv``, that is always locked
> > > > while
> > > > +evicting, due to the above equality.
> > > > +
> > > > +For VM_BIND gpu_vms, gpu_vmas don't need to be unbound before
> > > > eviction,
> > > > +Since the driver must ensure that the eviction blit or copy
> > > > will
> > > > wait
> > > > +for GPU idle or depend on all previous GPU activity.
> > > > Furthermore,
> > > > any
> > > > +subsequent attempt by the GPU to access freed memory through
> > > > the
> > > > +gpu_vma will be preceded by a new exec function, with a
> > > > revalidation
> > > > +section which will make sure all gpu_vmas are rebound. The
> > > > eviction
> > > > +code holding the object's dma_resv while revalidating will
> > > > ensure
> > > > a
> > > > +new exec function may not race with the eviction. Note that
> > > > this
> > > > will
> > > > +not hold true, however, if only a subsets of vmas are, due to
> > > > the
> > > > +driver implementation, selected for rebinding the next exec
> > > > +function.  
> > > 
> > > This last sentence is hard to follow.
> > > 
> > > "
> > > Note that this will not hold true if only a subset of vmas
> > > are selected for rebinding during the next exec call (for
> > > instance,
> > > due
> > > to some driver decision to only partially restore VMAs).
> > > "
> > >   
> > > > Then all vmas *not* selected for rebinding needs to be
> > > > +properly unbound before re-enabling GPU access to the VM.  
> > > 
> > > I think I get the problem, but can we have a use case where
> > > partial
> > > VMA restoration is useful? I mean, if some VMAs are not needed,
> > > we
> > > might be able to save MMU page table allocation/setup-time, but
> > > given
> > > the mess it then is to track those non-live VMAs, I'm wondering
> > > if
> > > it's
> > > leaving the door open for that, unless there's a good reason to
> > > do
> > > it.  
> > 
> > This is the use-case Christian has been flagging for for OpenGL and
> > Media where he warns that the single-vm-memory-overcommit case
> > would
> > otherwise make the app crawl.
> 
> IIUC, the partial VMA restore is about not restoring all VMAs
> attached
> to a vm_bo, but as soon as you restore one, this makes the BO
> resident,
> so all you're saving here is the extra page table for non-restored
> VMAs.
> I don't think that would significantly help the overcommit use case,
> unless you have so many VMAs attached to a single vm_bo that the
> amount of extra page tables becomes non-negligible compared to the BO
> size itself.
> 
> What would really help the overcommit use case is not restoring all
> evicted BOs, if some of them are known to be in a range that's not
> accessed by a GPU job. In that situation, you can decide to leave
> vm_bos in the evicted list if none of their VMAs overlap any of the
> VM
> ranges used by a job.

Yes this section here is the key: The vmas belonging to evicted bos not
restored would be the ones not "selected for rebind".

> 
> > 
> > Generalized one might want to think of these as groups of (or
> > perhaps
> > virtual ranges of) vmas that need to be resident for a single job
> > submission. Currently we assume the residency-group <-> vm mapping
> > is
> > 1:1, allowing for the unbind-before-eviction to be ignored, but I
> > figure moving forward and addressing performance problems of real-
> > world
> > applications that may not always be true.
> 
> Maybe I don't understand what unbind-before-eviction means. To me
> it's
> the operation of tearing down all VM mappings (unbind) before
> returning
> BO pages to the system (evict). I don't see a situation where the
> eviction of a BO doesn't imply killing all VM mapping pointing to
> this
> BO.

It's the point of teardown that matters here. You can return the pages
to system without tearing down the GPU mappings, as long as you tear
them down before there are any GPU activity on that vm again. In xe we
tear down the old mappings as part of the rebind process after we've
validated the evicted bo again, (but before submitting the GPU job). 
The point is the stale mappings can be left as long as there is no gpu
job accessing them.

> 
> > 
> > >   
> > > > +
> > > > +
> > > > +Locking with external (or shared) buffer objects
> > > > +================================================
> > > > +
> > > > +Since shared buffer objects may be shared by multiple gpu_vm's
> > > > they
> > > > +can't share their reservation object with a single gpu_vm, but
> > > > will rather
> > > > +have a reservation object of their own. The shared objects
> > > > bound
> > > > to a
> > > > +gpu_vm using one or many gpu_vmas are therefore typically put
> > > > on a
> > > > +per-gpu_vm list which is protected by the gpu_vm's dma_resv
> > > > lock.
> > > > Once
> > > > +the gpu_vm's reservation object  is locked, it is safe to
> > > > traverse
> > > > the  
> > > 
> > >                                   ^ extra space
> > >   
> > > > +external object list and lock the dma_resvs of all external
> > > > objects.
> > > > +
> > > > +At eviction time we now need to put the gpu_vm_bos of *all*
> > > > gpu_vms a
> > > > +shared object is bound to on the gpu_vm's evict list, but we
> > > > can
> > > > no longer
> > > > +be certain that we hold the gpu_vm's dma_resv of all the
> > > > gpu_vms
> > > > the
> > > > +object is bound to, since at eviction time we only hold the
> > > > object's
> > > > +private dma_resv. If we have a ww_acquire context at hand at
> > > > eviction
> > > > +time we could grab the those dma_resvs but that could cause
> > > > +expensive ww_mutex rollbacks. A simple option is to just mark
> > > > the
> > > > +gpu_vm_bos of the evicted gem object with an ``evicted`` bool
> > > > that
> > > > +is inspected the next time the corresponding gpu_vm evicted
> > > > list
> > > > needs
> > > > +to be traversed.  
> > > 
> > > IIUC, the vm_bo is not added to the evicted list immediately in
> > > that
> > > case, so I suspect you meant the next time the corresponding
> > > gpu_vm
> > > *extobj* list needs to be traversed.  
> > 
> > I like to think of the concept as "lock and validate the list
> > before
> > traversing it" and you're right in this case, the validation occurs
> > slightly before we actually start looking at the evicted list. But
> > I
> > could perhaps rephrase to "...bool that is inspected *before* the
> > next
> > time..."
> 
> My point was, you detect such evicted vm_bos when iterating the
> extobj list, not the evicted list, because the vm_bo won't be in the
> evicted list until you've walked extobj and inserted evicted vm_bos
> the
> the evicted list. Even with your rephrasing, it sounds like those are
> detected by iterating the evicted list. How about
> 
> "
> A simple option is to just mark the gpu_vm_bos of the evicted gem
> object with an ``evicted`` bool. Those gpu_vm_bos will be moved to
> the evicted list next time the extobj list is inspected for vm_bo
> revalidation.
> "

Sure. I'll rephrase it like that.
/Thomas
Danilo Krummrich Nov. 16, 2023, 2:02 p.m. UTC | #10
On 11/16/23 12:48, Thomas Hellström wrote:

<snip>

>>> +Locks used and locking orders
>>> +=============================
>>> +
>>> +One of the benefits of VM_BIND is that local GEM objects share the
>>> gpu_vm's
>>> +dma_resv object and hence the dma_resv lock. So even with a huge
>>> +number of local GEM objects, only one lock is needed to make the
>>> exec
>>> +sequence atomic.
>>> +
>>> +The following locks and locking orders are used:
>>> +
>>> +* The ``gpu_vm->lock`` (optionally an rwsem). Protects how the
>>> gpu_vm is
>>> +  partitioned into gpu_vmas. It can also protect the gpu_vm's list
>>> of
>>> +  userptr gpu_vmas. With a CPU mm analogy this would correspond to
>>> the
>>> +  mmap_lock.
>>
>> I don't see any drm_gpuvm::lock field in Danilo's latest patchset,
>> so,
>> unless I missed one version, and this lock is actually provided by
>> drm_gpuvm, I would mention this is a driver-specific lock. This
>> comment
>> applies to all the locks you describe here actually (mention which
>> ones
>> are provided by drm_gpuvm, and which ones are driver-specific).
> 
> These will be needed also by gpuvm when implementing userptr vmas, so I
> can mention that drm_gpuvm is currently lacking a userptr
> implementation, so "the locks described below are to be considered
> driver-specific for now"

Since Xe already implements userptr support, are you guys maybe interested
in extending drm_gpuvm accordingly? :-)
Thomas Hellström Nov. 16, 2023, 2:09 p.m. UTC | #11
On Thu, 2023-11-16 at 15:02 +0100, Danilo Krummrich wrote:
> On 11/16/23 12:48, Thomas Hellström wrote:
> 
> <snip>
> 
> > > > +Locks used and locking orders
> > > > +=============================
> > > > +
> > > > +One of the benefits of VM_BIND is that local GEM objects share
> > > > the
> > > > gpu_vm's
> > > > +dma_resv object and hence the dma_resv lock. So even with a
> > > > huge
> > > > +number of local GEM objects, only one lock is needed to make
> > > > the
> > > > exec
> > > > +sequence atomic.
> > > > +
> > > > +The following locks and locking orders are used:
> > > > +
> > > > +* The ``gpu_vm->lock`` (optionally an rwsem). Protects how the
> > > > gpu_vm is
> > > > +  partitioned into gpu_vmas. It can also protect the gpu_vm's
> > > > list
> > > > of
> > > > +  userptr gpu_vmas. With a CPU mm analogy this would
> > > > correspond to
> > > > the
> > > > +  mmap_lock.
> > > 
> > > I don't see any drm_gpuvm::lock field in Danilo's latest
> > > patchset,
> > > so,
> > > unless I missed one version, and this lock is actually provided
> > > by
> > > drm_gpuvm, I would mention this is a driver-specific lock. This
> > > comment
> > > applies to all the locks you describe here actually (mention
> > > which
> > > ones
> > > are provided by drm_gpuvm, and which ones are driver-specific).
> > 
> > These will be needed also by gpuvm when implementing userptr vmas,
> > so I
> > can mention that drm_gpuvm is currently lacking a userptr
> > implementation, so "the locks described below are to be considered
> > driver-specific for now"
> 
> Since Xe already implements userptr support, are you guys maybe
> interested
> in extending drm_gpuvm accordingly? :-)
> 

I've been thinking of that but in that case that needs to happen after
the xe merge. Also we ofc need to clear it with the people who do
resource allocation on our side :)

Thanks,
Thomas
Boris Brezillon Nov. 16, 2023, 2:47 p.m. UTC | #12
On Thu, 16 Nov 2023 14:53:50 +0100
Thomas Hellström <thomas.hellstrom@linux.intel.com> wrote:

> On Thu, 2023-11-16 at 14:27 +0100, Boris Brezillon wrote:
> > On Thu, 16 Nov 2023 12:48:45 +0100
> > Thomas Hellström <thomas.hellstrom@linux.intel.com> wrote:
> >   
> > > Hi, Boris,
> > > 
> > > Thanks for reviewing. Some comments below:
> > > 
> > > On Thu, 2023-11-16 at 10:48 +0100, Boris Brezillon wrote:  
> > > > Hi Thomas,
> > > > 
> > > > On Wed, 15 Nov 2023 13:49:37 +0100
> > > > Thomas Hellström <thomas.hellstrom@linux.intel.com> wrote:
> > > >     
> > > > > Add the first version of the VM_BIND locking document which is
> > > > > intended to be part of the xe driver upstreaming agreement.
> > > > > 
> > > > > The document describes and discuss the locking used during
> > > > > exec-
> > > > > functions, evicton and for userptr gpu-vmas. Intention is to be
> > > > > using the
> > > > > same nomenclature as the drm-vm-bind-async.rst.
> > > > > 
> > > > > v2:
> > > > > - s/gvm/gpu_vm/g (Rodrigo Vivi)
> > > > > - Clarify the userptr seqlock with a pointer to
> > > > > mm/mmu_notifier.c
> > > > >   (Rodrigo Vivi)
> > > > > - Adjust commit message accordingly.
> > > > > - Add SPDX license header.
> > > > > 
> > > > > v3:
> > > > > - Large update to align with the drm_gpuvm manager locking
> > > > > - Add "Efficient userptr gpu_vma exec function iteration"
> > > > > section
> > > > > - Add "Locking at bind- and unbind time" section.
> > > > > 
> > > > > v4:
> > > > > - Fix tabs vs space errors by untabifying (Rodrigo Vivi)
> > > > > - Minor style fixes and typos (Rodrigo Vivi)
> > > > > - Clarify situations where stale GPU mappings are occurring and
> > > > > how
> > > > >   access through these mappings are blocked. (Rodrigo Vivi)
> > > > > - Insert into the toctree in implementation_guidelines.rst
> > > > > 
> > > > > Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
> > > > > Signed-off-by: Thomas Hellström
> > > > > <thomas.hellstrom@linux.intel.com>
> > > > > ---
> > > > >  Documentation/gpu/drm-vm-bind-locking.rst     | 503
> > > > > ++++++++++++++++++
> > > > >  .../gpu/implementation_guidelines.rst         |   1 +
> > > > >  2 files changed, 504 insertions(+)
> > > > >  create mode 100644 Documentation/gpu/drm-vm-bind-locking.rst
> > > > > 
> > > > > diff --git a/Documentation/gpu/drm-vm-bind-locking.rst
> > > > > b/Documentation/gpu/drm-vm-bind-locking.rst
> > > > > new file mode 100644
> > > > > index 000000000000..bc701157cb34
> > > > > --- /dev/null
> > > > > +++ b/Documentation/gpu/drm-vm-bind-locking.rst
> > > > > @@ -0,0 +1,503 @@
> > > > > +.. SPDX-License-Identifier: (GPL-2.0+ OR MIT)
> > > > > +
> > > > > +===============
> > > > > +VM_BIND locking
> > > > > +===============
> > > > > +
> > > > > +This document attempts to describe what's needed to get
> > > > > VM_BIND
> > > > > locking right,
> > > > > +including the userptr mmu_notifier locking and it will also
> > > > > discuss some
> > > > > +optimizations to get rid of the looping through of all userptr
> > > > > mappings and
> > > > > +external / shared object mappings that is needed in the
> > > > > simplest
> > > > > +implementation. It will also discuss some implications for
> > > > > faulting gpu_vms.
> > > > > +
> > > > > +Nomenclature
> > > > > +============
> > > > > +
> > > > > +* ``Context``: GPU execution context.
> > > > > +* ``gpu_vm``: Abstraction of a virtual GPU address space with
> > > > > +  meta-data. Typically one per client (DRM file-private), or
> > > > > one
> > > > > per
> > > > > +  context.    
> > > > 
> > > > Should we mention that it's a driver object, likely inheriting
> > > > from
> > > > drm_gpuvm?    
> > > 
> > > Yes, I can try to be a bit more drm_gpuvm-centric throughout the
> > > document, although I want to avoid being too specific due to the
> > > current rapid drm_gpuvm rate of change, which might also affect
> > > this
> > > document. I guess I will have to commit for now to update the
> > > document
> > > on each gpuvm series we land...  
> > 
> > Well, I'd suggest that the one doing changes to drm_gpuvm gets to
> > update this document along the way, or even better, make this
> > documentation part of the drm_gpuvm doc, so there's no excuse to not
> > update it when drm_gpuvm is extended.  
> 
> Sure, Although I think initial merge will be as is, and then merging
> with drm_gpuvm will come at a later point.

Sure, I have no problem with that.

> > >   
> > > >     
> > > > > +   add_dma_fence(&obj->resv, job_dma_fence);
> > > > > +
> > > > > +   dma_resv_unlock(&obj->resv);
> > > > > +   put_object(obj);
> > > > > +
> > > > > +Note that since the object is local to the gpu_vm, it will
> > > > > share
> > > > > the gpu_vm's
> > > > > +dma_resv lock so that ``obj->resv == gpu_vm->resv``.
> > > > > +The gpu_vm_bos marked for eviction are put on the gpu_vm's
> > > > > evict
> > > > > list,
> > > > > +which is protected by ``gpu_vm->resv``, that is always locked
> > > > > while
> > > > > +evicting, due to the above equality.
> > > > > +
> > > > > +For VM_BIND gpu_vms, gpu_vmas don't need to be unbound before
> > > > > eviction,
> > > > > +Since the driver must ensure that the eviction blit or copy
> > > > > will
> > > > > wait
> > > > > +for GPU idle or depend on all previous GPU activity.
> > > > > Furthermore,
> > > > > any
> > > > > +subsequent attempt by the GPU to access freed memory through
> > > > > the
> > > > > +gpu_vma will be preceded by a new exec function, with a
> > > > > revalidation
> > > > > +section which will make sure all gpu_vmas are rebound. The
> > > > > eviction
> > > > > +code holding the object's dma_resv while revalidating will
> > > > > ensure
> > > > > a
> > > > > +new exec function may not race with the eviction. Note that
> > > > > this
> > > > > will
> > > > > +not hold true, however, if only a subsets of vmas are, due to
> > > > > the
> > > > > +driver implementation, selected for rebinding the next exec
> > > > > +function.    
> > > > 
> > > > This last sentence is hard to follow.
> > > > 
> > > > "
> > > > Note that this will not hold true if only a subset of vmas
> > > > are selected for rebinding during the next exec call (for
> > > > instance,
> > > > due
> > > > to some driver decision to only partially restore VMAs).
> > > > "
> > > >     
> > > > > Then all vmas *not* selected for rebinding needs to be
> > > > > +properly unbound before re-enabling GPU access to the VM.    
> > > > 
> > > > I think I get the problem, but can we have a use case where
> > > > partial
> > > > VMA restoration is useful? I mean, if some VMAs are not needed,
> > > > we
> > > > might be able to save MMU page table allocation/setup-time, but
> > > > given
> > > > the mess it then is to track those non-live VMAs, I'm wondering
> > > > if
> > > > it's
> > > > leaving the door open for that, unless there's a good reason to
> > > > do
> > > > it.    
> > > 
> > > This is the use-case Christian has been flagging for for OpenGL and
> > > Media where he warns that the single-vm-memory-overcommit case
> > > would
> > > otherwise make the app crawl.  
> > 
> > IIUC, the partial VMA restore is about not restoring all VMAs
> > attached
> > to a vm_bo, but as soon as you restore one, this makes the BO
> > resident,
> > so all you're saving here is the extra page table for non-restored
> > VMAs.
> > I don't think that would significantly help the overcommit use case,
> > unless you have so many VMAs attached to a single vm_bo that the
> > amount of extra page tables becomes non-negligible compared to the BO
> > size itself.
> > 
> > What would really help the overcommit use case is not restoring all
> > evicted BOs, if some of them are known to be in a range that's not
> > accessed by a GPU job. In that situation, you can decide to leave
> > vm_bos in the evicted list if none of their VMAs overlap any of the
> > VM
> > ranges used by a job.  
> 
> Yes this section here is the key: The vmas belonging to evicted bos not
> restored would be the ones not "selected for rebind".

Okay, but then I don't see the problem if you leave such vm_bos in the
evicted list. Those will still be considered for 'rebind' next time the
evicted list is walked (basically on the next exec), right?

> 
> >   
> > > 
> > > Generalized one might want to think of these as groups of (or
> > > perhaps
> > > virtual ranges of) vmas that need to be resident for a single job
> > > submission. Currently we assume the residency-group <-> vm mapping
> > > is
> > > 1:1, allowing for the unbind-before-eviction to be ignored, but I
> > > figure moving forward and addressing performance problems of real-
> > > world
> > > applications that may not always be true.  
> > 
> > Maybe I don't understand what unbind-before-eviction means. To me
> > it's
> > the operation of tearing down all VM mappings (unbind) before
> > returning
> > BO pages to the system (evict). I don't see a situation where the
> > eviction of a BO doesn't imply killing all VM mapping pointing to
> > this
> > BO.  
> 
> It's the point of teardown that matters here. You can return the pages
> to system without tearing down the GPU mappings, as long as you tear
> them down before there are any GPU activity on that vm again. In xe we
> tear down the old mappings as part of the rebind process after we've
> validated the evicted bo again, (but before submitting the GPU job). 
> The point is the stale mappings can be left as long as there is no gpu
> job accessing them.

I see. As long as you're sure the VM is completely inactive, and that
such mappings are destroyed before the VM is used again, that's fine I
guess. Might be worth a detailed explanation about the different
scenarios though.
Thomas Hellström Nov. 16, 2023, 3:01 p.m. UTC | #13
On 11/16/23 15:47, Boris Brezillon wrote:
> On Thu, 16 Nov 2023 14:53:50 +0100
> Thomas Hellström <thomas.hellstrom@linux.intel.com> wrote:
>
>> On Thu, 2023-11-16 at 14:27 +0100, Boris Brezillon wrote:
>>> On Thu, 16 Nov 2023 12:48:45 +0100
>>> Thomas Hellström <thomas.hellstrom@linux.intel.com> wrote:
>>>    
>>>> Hi, Boris,
>>>>
>>>> Thanks for reviewing. Some comments below:
>>>>
>>>> On Thu, 2023-11-16 at 10:48 +0100, Boris Brezillon wrote:
>>>>> Hi Thomas,
>>>>>
>>>>> On Wed, 15 Nov 2023 13:49:37 +0100
>>>>> Thomas Hellström <thomas.hellstrom@linux.intel.com> wrote:
>>>>>      
>>>>>> Add the first version of the VM_BIND locking document which is
>>>>>> intended to be part of the xe driver upstreaming agreement.
>>>>>>
>>>>>> The document describes and discuss the locking used during
>>>>>> exec-
>>>>>> functions, evicton and for userptr gpu-vmas. Intention is to be
>>>>>> using the
>>>>>> same nomenclature as the drm-vm-bind-async.rst.
>>>>>>
>>>>>> v2:
>>>>>> - s/gvm/gpu_vm/g (Rodrigo Vivi)
>>>>>> - Clarify the userptr seqlock with a pointer to
>>>>>> mm/mmu_notifier.c
>>>>>>    (Rodrigo Vivi)
>>>>>> - Adjust commit message accordingly.
>>>>>> - Add SPDX license header.
>>>>>>
>>>>>> v3:
>>>>>> - Large update to align with the drm_gpuvm manager locking
>>>>>> - Add "Efficient userptr gpu_vma exec function iteration"
>>>>>> section
>>>>>> - Add "Locking at bind- and unbind time" section.
>>>>>>
>>>>>> v4:
>>>>>> - Fix tabs vs space errors by untabifying (Rodrigo Vivi)
>>>>>> - Minor style fixes and typos (Rodrigo Vivi)
>>>>>> - Clarify situations where stale GPU mappings are occurring and
>>>>>> how
>>>>>>    access through these mappings are blocked. (Rodrigo Vivi)
>>>>>> - Insert into the toctree in implementation_guidelines.rst
>>>>>>
>>>>>> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
>>>>>> Signed-off-by: Thomas Hellström
>>>>>> <thomas.hellstrom@linux.intel.com>
>>>>>> ---
>>>>>>   Documentation/gpu/drm-vm-bind-locking.rst     | 503
>>>>>> ++++++++++++++++++
>>>>>>   .../gpu/implementation_guidelines.rst         |   1 +
>>>>>>   2 files changed, 504 insertions(+)
>>>>>>   create mode 100644 Documentation/gpu/drm-vm-bind-locking.rst
>>>>>>
>>>>>> diff --git a/Documentation/gpu/drm-vm-bind-locking.rst
>>>>>> b/Documentation/gpu/drm-vm-bind-locking.rst
>>>>>> new file mode 100644
>>>>>> index 000000000000..bc701157cb34
>>>>>> --- /dev/null
>>>>>> +++ b/Documentation/gpu/drm-vm-bind-locking.rst
>>>>>> @@ -0,0 +1,503 @@
>>>>>> +.. SPDX-License-Identifier: (GPL-2.0+ OR MIT)
>>>>>> +
>>>>>> +===============
>>>>>> +VM_BIND locking
>>>>>> +===============
>>>>>> +
>>>>>> +This document attempts to describe what's needed to get
>>>>>> VM_BIND
>>>>>> locking right,
>>>>>> +including the userptr mmu_notifier locking and it will also
>>>>>> discuss some
>>>>>> +optimizations to get rid of the looping through of all userptr
>>>>>> mappings and
>>>>>> +external / shared object mappings that is needed in the
>>>>>> simplest
>>>>>> +implementation. It will also discuss some implications for
>>>>>> faulting gpu_vms.
>>>>>> +
>>>>>> +Nomenclature
>>>>>> +============
>>>>>> +
>>>>>> +* ``Context``: GPU execution context.
>>>>>> +* ``gpu_vm``: Abstraction of a virtual GPU address space with
>>>>>> +  meta-data. Typically one per client (DRM file-private), or
>>>>>> one
>>>>>> per
>>>>>> +  context.
>>>>> Should we mention that it's a driver object, likely inheriting
>>>>> from
>>>>> drm_gpuvm?
>>>> Yes, I can try to be a bit more drm_gpuvm-centric throughout the
>>>> document, although I want to avoid being too specific due to the
>>>> current rapid drm_gpuvm rate of change, which might also affect
>>>> this
>>>> document. I guess I will have to commit for now to update the
>>>> document
>>>> on each gpuvm series we land...
>>> Well, I'd suggest that the one doing changes to drm_gpuvm gets to
>>> update this document along the way, or even better, make this
>>> documentation part of the drm_gpuvm doc, so there's no excuse to not
>>> update it when drm_gpuvm is extended.
>> Sure, Although I think initial merge will be as is, and then merging
>> with drm_gpuvm will come at a later point.
> Sure, I have no problem with that.
>
>>>>    
>>>>>      
>>>>>> +   add_dma_fence(&obj->resv, job_dma_fence);
>>>>>> +
>>>>>> +   dma_resv_unlock(&obj->resv);
>>>>>> +   put_object(obj);
>>>>>> +
>>>>>> +Note that since the object is local to the gpu_vm, it will
>>>>>> share
>>>>>> the gpu_vm's
>>>>>> +dma_resv lock so that ``obj->resv == gpu_vm->resv``.
>>>>>> +The gpu_vm_bos marked for eviction are put on the gpu_vm's
>>>>>> evict
>>>>>> list,
>>>>>> +which is protected by ``gpu_vm->resv``, that is always locked
>>>>>> while
>>>>>> +evicting, due to the above equality.
>>>>>> +
>>>>>> +For VM_BIND gpu_vms, gpu_vmas don't need to be unbound before
>>>>>> eviction,
>>>>>> +Since the driver must ensure that the eviction blit or copy
>>>>>> will
>>>>>> wait
>>>>>> +for GPU idle or depend on all previous GPU activity.
>>>>>> Furthermore,
>>>>>> any
>>>>>> +subsequent attempt by the GPU to access freed memory through
>>>>>> the
>>>>>> +gpu_vma will be preceded by a new exec function, with a
>>>>>> revalidation
>>>>>> +section which will make sure all gpu_vmas are rebound. The
>>>>>> eviction
>>>>>> +code holding the object's dma_resv while revalidating will
>>>>>> ensure
>>>>>> a
>>>>>> +new exec function may not race with the eviction. Note that
>>>>>> this
>>>>>> will
>>>>>> +not hold true, however, if only a subsets of vmas are, due to
>>>>>> the
>>>>>> +driver implementation, selected for rebinding the next exec
>>>>>> +function.
>>>>> This last sentence is hard to follow.
>>>>>
>>>>> "
>>>>> Note that this will not hold true if only a subset of vmas
>>>>> are selected for rebinding during the next exec call (for
>>>>> instance,
>>>>> due
>>>>> to some driver decision to only partially restore VMAs).
>>>>> "
>>>>>      
>>>>>> Then all vmas *not* selected for rebinding needs to be
>>>>>> +properly unbound before re-enabling GPU access to the VM.
>>>>> I think I get the problem, but can we have a use case where
>>>>> partial
>>>>> VMA restoration is useful? I mean, if some VMAs are not needed,
>>>>> we
>>>>> might be able to save MMU page table allocation/setup-time, but
>>>>> given
>>>>> the mess it then is to track those non-live VMAs, I'm wondering
>>>>> if
>>>>> it's
>>>>> leaving the door open for that, unless there's a good reason to
>>>>> do
>>>>> it.
>>>> This is the use-case Christian has been flagging for for OpenGL and
>>>> Media where he warns that the single-vm-memory-overcommit case
>>>> would
>>>> otherwise make the app crawl.
>>> IIUC, the partial VMA restore is about not restoring all VMAs
>>> attached
>>> to a vm_bo, but as soon as you restore one, this makes the BO
>>> resident,
>>> so all you're saving here is the extra page table for non-restored
>>> VMAs.
>>> I don't think that would significantly help the overcommit use case,
>>> unless you have so many VMAs attached to a single vm_bo that the
>>> amount of extra page tables becomes non-negligible compared to the BO
>>> size itself.
>>>
>>> What would really help the overcommit use case is not restoring all
>>> evicted BOs, if some of them are known to be in a range that's not
>>> accessed by a GPU job. In that situation, you can decide to leave
>>> vm_bos in the evicted list if none of their VMAs overlap any of the
>>> VM
>>> ranges used by a job.
>> Yes this section here is the key: The vmas belonging to evicted bos not
>> restored would be the ones not "selected for rebind".
> Okay, but then I don't see the problem if you leave such vm_bos in the
> evicted list. Those will still be considered for 'rebind' next time the
> evicted list is walked (basically on the next exec), right?

Yes, but given what's written below that "rebind next time" also allows 
gpu activation with stale mappings, so at least we need to invalidate 
the stale mappings at this point.

/Thomas


>
>>>    
>>>> Generalized one might want to think of these as groups of (or
>>>> perhaps
>>>> virtual ranges of) vmas that need to be resident for a single job
>>>> submission. Currently we assume the residency-group <-> vm mapping
>>>> is
>>>> 1:1, allowing for the unbind-before-eviction to be ignored, but I
>>>> figure moving forward and addressing performance problems of real-
>>>> world
>>>> applications that may not always be true.
>>> Maybe I don't understand what unbind-before-eviction means. To me
>>> it's
>>> the operation of tearing down all VM mappings (unbind) before
>>> returning
>>> BO pages to the system (evict). I don't see a situation where the
>>> eviction of a BO doesn't imply killing all VM mapping pointing to
>>> this
>>> BO.
>> It's the point of teardown that matters here. You can return the pages
>> to system without tearing down the GPU mappings, as long as you tear
>> them down before there are any GPU activity on that vm again. In xe we
>> tear down the old mappings as part of the rebind process after we've
>> validated the evicted bo again, (but before submitting the GPU job).
>> The point is the stale mappings can be left as long as there is no gpu
>> job accessing them.
> I see. As long as you're sure the VM is completely inactive, and that
> such mappings are destroyed before the VM is used again, that's fine I
> guess. Might be worth a detailed explanation about the different
> scenarios though.
>
Thomas Hellström Nov. 17, 2023, 8:33 a.m. UTC | #14
Hi,


On 11/16/23 14:13, Bagas Sanjaya wrote:
> On Wed, Nov 15, 2023 at 01:49:37PM +0100, Thomas Hellström wrote:
>> +TODO: Pointer to the gpuvm code implementation if this iteration and
> "... implementation of this iteration ..."
>
>> +Using a MMU notifier for device DMA (and other methods) is described in
>> +`this document
>> +<https://docs.kernel.org/core-api/pin_user_pages.html#case-3-mmu-notifier-registration-with-or-without-page-faulting-hardware>`_.
> You can use internal linking instead:
>
> ---- >8 ----
> diff --git a/Documentation/core-api/pin_user_pages.rst b/Documentation/core-api/pin_user_pages.rst
> index d3c1f6d8c0e0ec..6b5f7e6e7155fb 100644
> --- a/Documentation/core-api/pin_user_pages.rst
> +++ b/Documentation/core-api/pin_user_pages.rst
> @@ -153,6 +153,8 @@ NOTE: Some pages, such as DAX pages, cannot be pinned with longterm pins. That's
>   because DAX pages do not have a separate page cache, and so "pinning" implies
>   locking down file system blocks, which is not (yet) supported in that way.
>   
> +.. _mmu-notifier-registration-case:
> +
>   CASE 3: MMU notifier registration, with or without page faulting hardware
>   -------------------------------------------------------------------------
>   Device drivers can pin pages via get_user_pages*(), and register for mmu
> diff --git a/Documentation/gpu/drm-vm-bind-locking.rst b/Documentation/gpu/drm-vm-bind-locking.rst
> index bc701157cb3414..08b6a47a6e592f 100644
> --- a/Documentation/gpu/drm-vm-bind-locking.rst
> +++ b/Documentation/gpu/drm-vm-bind-locking.rst
> @@ -366,8 +366,7 @@ need to care about, but so far it has proven difficult to exclude
>   certain notifications.
>   
>   Using a MMU notifier for device DMA (and other methods) is described in
> -`this document
> -<https://docs.kernel.org/core-api/pin_user_pages.html#case-3-mmu-notifier-registration-with-or-without-page-faulting-hardware>`_.
> +:ref:`pin_user_pages() documentation <mmu-notifier-registration-case>`.
>   
>   Now the method of obtaining struct page references using
>   get_user_pages() unfortunately can't be used under a dma_resv lock
>
> Thanks.
>
Thanks. I'll take a look at doing this as well.

Thomas
Thomas Hellström Nov. 21, 2023, 8:41 a.m. UTC | #15
Hi, Boris

On 11/16/23 15:47, Boris Brezillon wrote:
> On Thu, 16 Nov 2023 14:53:50 +0100
> Thomas Hellström <thomas.hellstrom@linux.intel.com> wrote:
>
>> On Thu, 2023-11-16 at 14:27 +0100, Boris Brezillon wrote:
>>> On Thu, 16 Nov 2023 12:48:45 +0100
>>> Thomas Hellström <thomas.hellstrom@linux.intel.com> wrote:
>>>    
>>>> Hi, Boris,
>>>>
>>>> Thanks for reviewing. Some comments below:
>>>>
I'm going to send out an updated version today with I think all of 
Danilo's comments and must of yours addressed. While I added more 
references to GPUVM, mainly as code examples and explanations, I've 
intentionally left out the "This is a driver lock and this is a gpvum 
lock distinction", The reason is twofold. First I think that when we get 
userptr into gpvum, gpuvm needs to be aware of most if not all locks. 
Second, since this document is an implementation guideline and gpuvm is 
an implementation it makes more sense to me to add pointers from the 
GPUVM documentation to the VM_BIND locking guideline, and that could be 
a task to be looked at after merging this together with implementing the 
userptr stuff. The most important thing at this point is that the 
document doesn't conflict in any way with the gpuvm implementation, and 
I've fixed those parts where I missed the separate lock protecting the 
GEM object's vm_bo list that you pointed out.

I strongly think this is the right way to go but if you disagree to the 
point where you're not willing to provide an ack or R-B, let me know and 
we can look at adding what's missing.

Thanks,

Thomas
diff mbox series

Patch

diff --git a/Documentation/gpu/drm-vm-bind-locking.rst b/Documentation/gpu/drm-vm-bind-locking.rst
new file mode 100644
index 000000000000..bc701157cb34
--- /dev/null
+++ b/Documentation/gpu/drm-vm-bind-locking.rst
@@ -0,0 +1,503 @@ 
+.. SPDX-License-Identifier: (GPL-2.0+ OR MIT)
+
+===============
+VM_BIND locking
+===============
+
+This document attempts to describe what's needed to get VM_BIND locking right,
+including the userptr mmu_notifier locking and it will also discuss some
+optimizations to get rid of the looping through of all userptr mappings and
+external / shared object mappings that is needed in the simplest
+implementation. It will also discuss some implications for faulting gpu_vms.
+
+Nomenclature
+============
+
+* ``Context``: GPU execution context.
+* ``gpu_vm``: Abstraction of a virtual GPU address space with
+  meta-data. Typically one per client (DRM file-private), or one per
+  context.
+* ``gpu_vma``: Abstraction of a GPU address range within a gpu_vm with
+  associated meta-data. The backing storage of a gpu_vma can either be
+  a GEM object or anonymous pages mapped also into the CPU
+  address space for the process.
+* gpu_vm_bo: Abstracts the association of a GEM object and
+  a VM. Note that if only one gpu_vma per vm and buffer object were
+  allowed, the state stored with a gpu_vm_bo could just as well have
+  been stored with the gpu_vma. For the purpose of this document, each
+  GEM object maintains a list of gpu_vm_bos, and each gpu_vm_bo
+  maintains a list of gpu_vmas.
+* ``userptr gpu_vma or just userptr``: A gpu_vma, whose backing store
+  is anonymous pages as described above.
+* ``revalidating``: Revalidating a gpu_vma means making the latest version
+  of the backing store resident and making sure the gpu_vma's
+  page-table entries point to that backing store.
+* ``dma_fence``: A struct dma_fence that is similar to a struct completion
+  and which tracks GPU activity. When the GPU activity is finished,
+  the dma_fence signals.
+* ``dma_resv``: A struct dma_resv (a.k.a reservation object) that is used
+  to track GPU activity in the form of multiple dma_fences on a
+  gpu_vm or a GEM object. The dma_resv contains an array / list
+  of dma_fences and a lock that needs to be held when adding
+  additional dma_fences to the dma_resv. The lock is of a type that
+  allows deadlock-safe locking of multiple dma_resvs in arbitrary order.
+* ``exec function``: An exec function is a function that revalidates all
+  affected gpu_vmas, submits a GPU command batch and registers the
+  dma_fence representing the GPU command's activity with all affected
+  dma_resvs. For completeness, although not covered by this document,
+  it's worth mentioning that an exec function may also be the
+  revalidation worker that is used by some drivers in compute /
+  long-running mode.
+* ``local object``: A GEM object which is local to a gpu_vm. Shared gem
+  objects also share the gpu_vm's dma_resv.
+* ``shared object``: a.k.a external object: A GEM object which may be shared
+  by multiple gpu_vms and whose backing storage may be shared with
+  other drivers.
+
+
+Locks used and locking orders
+=============================
+
+One of the benefits of VM_BIND is that local GEM objects share the gpu_vm's
+dma_resv object and hence the dma_resv lock. So even with a huge
+number of local GEM objects, only one lock is needed to make the exec
+sequence atomic.
+
+The following locks and locking orders are used:
+
+* The ``gpu_vm->lock`` (optionally an rwsem). Protects how the gpu_vm is
+  partitioned into gpu_vmas. It can also protect the gpu_vm's list of
+  userptr gpu_vmas. With a CPU mm analogy this would correspond to the
+  mmap_lock.
+* The ``userptr_seqlock``. This lock is taken in read mode for each
+  userptr gpu_vma on the gpu_vm's userptr list, and in write mode during mmu
+  notifier invalidation. This is not a real seqlock but described in
+  ``mm/mmu_notifier.c`` as a "Collision-retry read-side/write-side
+  'lock' a lot like a seqcount, however this allows multiple
+  write-sides to hold it at once...". The read side critical section
+  is enclosed by ``mmu_interval_read_begin() /
+  mmu_interval_read_retry()`` with ``mmu_interval_read_begin()``
+  sleeping if the write side is held.
+  The write side is held by the core mm while calling mmu interval
+  invalidation notifiers.
+* The ``gpu_vm->resv`` lock. Protects the gpu_vm's list of gpu_vmas needing
+  rebinding, and also the residency of all the gpu_vm's local GEM object.
+  Furthermore it typically protects the gpu_vm's list of evicted GEM
+  objects and external objects.
+* The ``gpu_vm->userptr_notifier_lock``. This is an rwsem that is
+  taken in read mode during exec and write mode during a mmu notifier
+  invalidation. The userptr notifier lock is per gpu_vm.
+* The gpu_vm list spinlocks. With some implementations they are needed
+  to be able to update the gpu_vm evicted- and external object
+  list. For those implementations, the spinlocks are grabbed when the
+  lists are manipulated. However to avoid locking order violations
+  with the dma_resv locks, a special scheme is needed when iterating
+  over the lists.
+
+.. _gpu_vma lifetime:
+
+Protection and lifetime of gpu_vm_bos and gpu_vmas
+==================================================
+
+The GEM object's list of gpu_vm_bos is typically protected by the
+GEM object's dma_resv. Each gpu_vm_bo holds a reference counted pointer
+to the underlying GEM object, and each gpu_vma holds a reference counted
+pointer to the gpu_vm_bo. When iterating over the GEM object's
+list of gpu_vm_bos the gem object's dma_resv must thus be held,
+but if it needs to be dropped during the iteration, care needs to be
+taken so that any gpu_vm_bo, and the gpu_vm, if dereferenced
+while the lock is dropped, do not disappear. The easiest way to avoid
+this is to take a reference on affected objects while the dma_resv is
+still held. If iterating over the gpu_vm_bo's gpu_vmas, even
+greater care needs to be taken since the gpu_vmas are not
+reference counted. If a driver accesses a gpu_vma obtained from
+the gpu_vm_bo's list of gpu_vmas, and the GEM object's
+dma_resv is dropped, at the very least, it should be thoroughly
+documented how the gpu_vma is kept alive. Otherwise holding the
+GEM object's dma_resv lock also around unlinking a gpu_vma from a
+gpu_vm_bo will ensure that doesn't happen.
+
+
+Revalidation and eviction of local objects
+==========================================
+
+Revalidation
+____________
+With VM_BIND, all local objects need to be resident when the gpu is
+executing using the gpu_vm, and the objects need to have valid
+gpu_vmas set up pointing to them. Typically each gpu command buffer
+submission is therefore preceded with a re-validation section:
+
+.. code-block:: C
+
+   dma_resv_lock(gpu_vm->resv);
+
+   // Validation section starts here.
+   for_each_gpu_vm_bo_on_evict_list(&gpu_vm->evict_list, &gpu_vm_bo) {
+           validate_gem_bo(&gpu_vm_bo->gem_bo);
+
+           // The following list iteration needs the Gem object's
+           // dma_resv to be held (it protects the gpu_vm_bo's list of
+           // gpu_vmas, but since local gem objects share the gpu_vm's
+           // dma_resv, it is already held at this point.
+           for_each_gpu_vma_of_gpu_vm_bo(&gpu_vm_bo, &gpu_vma)
+                  move_gpu_vma_to_rebind_list(&gpu_vma, &gpu_vm->rebind_list);
+   }
+
+   for_each_gpu_vma_on_rebind_list(&gpu vm->rebind_list, &gpu_vma) {
+           rebind_gpu_vma(&gpu_vma);
+           remove_gpu_vma_from_rebind_list(&gpu_vma);
+   }
+   // Validation section ends here, and job submission starts.
+
+   add_dependencies(&gpu_job, &gpu_vm->resv);
+   job_dma_fence = gpu_submit(&gpu_job));
+
+   add_dma_fence(job_dma_fence, &gpu_vm->resv);
+   dma_resv_unlock(gpu_vm->resv);
+
+The reason for having a separate gpu_vm rebind list is that there
+might be userptr gpu_vmas that are not mapping a buffer object that
+also need rebinding.
+
+Eviction
+________
+
+Eviction of one of these local objects will then look similar to the
+following:
+
+.. code-block:: C
+
+   obj = get_object_from_lru();
+
+   dma_resv_lock(obj->resv);
+   for_each_gpu_vm_bo_of_obj(obj, &gpu_vm_bo);
+           add_gpu_vm_bo_to_evict_list(&gpu_vm_bo, &gpu_vm->evict_list);
+
+   add_dependencies(&eviction_job, &obj->resv);
+   job_dma_fence = gpu_submit(&eviction_job);
+   add_dma_fence(&obj->resv, job_dma_fence);
+
+   dma_resv_unlock(&obj->resv);
+   put_object(obj);
+
+Note that since the object is local to the gpu_vm, it will share the gpu_vm's
+dma_resv lock so that ``obj->resv == gpu_vm->resv``.
+The gpu_vm_bos marked for eviction are put on the gpu_vm's evict list,
+which is protected by ``gpu_vm->resv``, that is always locked while
+evicting, due to the above equality.
+
+For VM_BIND gpu_vms, gpu_vmas don't need to be unbound before eviction,
+Since the driver must ensure that the eviction blit or copy will wait
+for GPU idle or depend on all previous GPU activity. Furthermore, any
+subsequent attempt by the GPU to access freed memory through the
+gpu_vma will be preceded by a new exec function, with a revalidation
+section which will make sure all gpu_vmas are rebound. The eviction
+code holding the object's dma_resv while revalidating will ensure a
+new exec function may not race with the eviction. Note that this will
+not hold true, however, if only a subsets of vmas are, due to the
+driver implementation, selected for rebinding the next exec
+function. Then all vmas *not* selected for rebinding needs to be
+properly unbound before re-enabling GPU access to the VM.
+
+
+Locking with external (or shared) buffer objects
+================================================
+
+Since shared buffer objects may be shared by multiple gpu_vm's they
+can't share their reservation object with a single gpu_vm, but will rather
+have a reservation object of their own. The shared objects bound to a
+gpu_vm using one or many gpu_vmas are therefore typically put on a
+per-gpu_vm list which is protected by the gpu_vm's dma_resv lock. Once
+the gpu_vm's reservation object  is locked, it is safe to traverse the
+external object list and lock the dma_resvs of all external objects.
+
+At eviction time we now need to put the gpu_vm_bos of *all* gpu_vms a
+shared object is bound to on the gpu_vm's evict list, but we can no longer
+be certain that we hold the gpu_vm's dma_resv of all the gpu_vms the
+object is bound to, since at eviction time we only hold the object's
+private dma_resv. If we have a ww_acquire context at hand at eviction
+time we could grab the those dma_resvs but that could cause
+expensive ww_mutex rollbacks. A simple option is to just mark the
+gpu_vm_bos of the evicted gem object with an ``evicted`` bool that
+is inspected the next time the corresponding gpu_vm evicted list needs
+to be traversed. At that time the gpu_vm's dma_resv and the object's
+dma_resv is held, and the gpu_vm_bo marked evicted, can then be added
+to the gpu_vm's list of evicted gpu_vm_bos. The ``evicted`` bool would
+then be protected by the object's dma_resv.
+
+The exec function would then become
+
+.. code-block:: C
+
+   dma_resv_lock(gpu_vm->resv);
+
+   // External object list is protected by the gpu_vm->resv lock.
+   for_each_gpu_vm_bo_on_extobj_list(gpu_vm, &gpu_vm_bo) {
+           dma_resv_lock(gpu_vm_bo.gem_obj->resv);
+           if (gpu_vm_bo_marked_evicted(&gpu_vm_bo))
+                   add_gpu_vm_bo_to_evict_list(&gpu_vm_bo, &gpu_vm->evict_list);
+   }
+
+   for_each_gpu_vm_bo_on_evict_list(&gpu_vm->evict_list, &gpu_vm_bo) {
+           validate_gem_bo(&gpu_vm_bo->gem_bo);
+
+           for_each_gpu_vma_of_gpu_vm_bo(&gpu_vm_bo, &gpu_vma)
+                  move_gpu_vma_to_rebind_list(&gpu_vma, &gpu_vm->rebind_list);
+   }
+
+   for_each_gpu_vma_on_rebind_list(&gpu vm->rebind_list, &gpu_vma) {
+           rebind_gpu_vma(&gpu_vma);
+           remove_gpu_vma_from_rebind_list(&gpu_vma);
+   }
+
+   add_dependencies(&gpu_job, &gpu_vm->resv);
+   job_dma_fence = gpu_submit(&gpu_job));
+
+   add_dma_fence(job_dma_fence, &gpu_vm->resv);
+   for_each_shared_obj(gpu_vm, &obj)
+          add_dma_fence(job_dma_fence, &obj->resv);
+   dma_resv_unlock_all_resv_locks();
+
+And the corresponding shared-object aware eviction would look like:
+
+.. code-block:: C
+
+   obj = get_object_from_lru();
+
+   dma_resv_lock(obj->resv);
+   for_each_gpu_vm_bo_of_obj(obj, &gpu_vm_bo)
+           if (object_is_vm_local(obj))
+                add_gpu_vm_bo_to_evict_list(&gpu_vm_bo, &gpu_vm->evict_list);
+           else
+                mark_gpu_vm_bo_evicted(&gpu_vm_bo);
+
+   add_dependencies(&eviction_job, &obj->resv);
+   job_dma_fence = gpu_submit(&eviction_job);
+   add_dma_fence(&obj->resv, job_dma_fence);
+
+   dma_resv_unlock(&obj->resv);
+   put_object(obj);
+
+.. _Spinlock iteration:
+
+Accessing the gpu_vm's lists without the dma_resv lock held
+===========================================================
+
+Although some drivers will hold the gpu_vm's dma_resv lock when
+accessing the gpu_vm's evict list and external objects lists, there
+are drivers that need to access these lists without the dma_resv lock held,
+for example due to asynchronous state updates from within the
+dma_fence signalling critical path. In such case a spinlock can be
+used to protect manipulation of the lists. However, since higher level
+sleeping locks needs to be taken for each list item while iterating
+over the lists, the items already iterated over needs to be
+temporarily moved to a private list and the spinlock released
+while processing each item:
+
+.. code block:: C
+
+    struct list_head still_in_list;
+
+    INIT_LIST_HEAD(&still_in_list);
+
+    spin_lock(&gpu_vm->list_lock);
+    do {
+            struct list_head *entry = list_first_entry_or_null(&gpu_vm->list, head);
+
+            if (!entry)
+                    break;
+
+            list_move_tail(&entry->head, &still_in_list);
+            list_entry_get_unless_zero(entry);
+            spin_unlock(&gpu_vm->list_lock);
+
+            process(entry);
+
+            spin_lock(&gpu_vm->list_lock);
+            list_entry_put(entry);
+    } while (true);
+
+    list_splice_tail(&still_in_list, &gpu_vm->list);
+    spin_unlock(&gpu_vm->list_lock);
+
+However, due to the additional locking and atomic operations, drivers that *can*
+avoid accessing the gpu_vm's list outside of the dma_resv lock
+might want to avoid this iteration scheme, if the driver anticipates a
+large number of list items. For lists where the anticipated number of
+list items is small, list iteration doesn't happen very often, or
+there is a significant additional cost associated with each iteration,
+the atomic operation overhead associated with this type of iteration
+is, however, probably negligible. Note that if this scheme is
+used, it is necessary to make sure this list iteration is protected by
+an outer level lock or semaphore, since list items are temporarily
+pulled off the list while iterating.
+
+TODO: Pointer to the gpuvm code implementation if this iteration and
+how to choose either iteration scheme.
+
+userptr gpu_vmas
+================
+
+A userptr gpu_vma is a gpu_vma that, instead of mapping a buffer object to a
+GPU virtual address range, directly maps a CPU mm range of anonymous-
+or file page-cache pages.
+A very simple approach would be to just pin the pages using
+pin_user_pages() at bind time and unpin them at unbind time, but this
+creates a Denial-Of-Service vector since a single user-space process
+would be able to pin down all of system memory, which is not
+desirable. (For special use-cases and with proper accounting pinning might
+still be a desirable feature, though). What we need to do in the
+general case is to obtain a reference to the desired pages, make sure
+we are notified using a MMU notifier just before the CPU mm unmaps the
+pages, dirty them if they are not mapped read-only to the GPU, and
+then drop the reference.
+When we are notified by the MMU notifier that CPU mm is about to drop the
+pages, we need to stop GPU access to the pages by waiting for VM idle
+in the MMU notifier and make sure that before the next time the GPU
+tries to access whatever is now present in the CPU mm range, we unmap
+the old pages from the GPU page tables and repeat the process of
+obtaining new page references. (See the :ref:`notifier example
+<Invalidation example>` below). Note that when the core mm decides to
+laundry pages, we get such an unmap MMU notification and can mark the
+pages dirty again before the next GPU access. We also get similar MMU
+notifications for NUMA accounting which the GPU driver doesn't really
+need to care about, but so far it has proven difficult to exclude
+certain notifications.
+
+Using a MMU notifier for device DMA (and other methods) is described in
+`this document
+<https://docs.kernel.org/core-api/pin_user_pages.html#case-3-mmu-notifier-registration-with-or-without-page-faulting-hardware>`_.
+
+Now the method of obtaining struct page references using
+get_user_pages() unfortunately can't be used under a dma_resv lock
+since that would violate the locking order of the dma_resv lock vs the
+mmap_lock that is grabbed when resolving a CPU pagefault. This means
+the gpu_vm's list of userptr gpu_vmas needs to be protected by an
+outer lock.
+
+The MMU interval seqlock for a userptr gpu_vma is used in the following
+way:
+
+.. code-block:: C
+
+   // Exclusive locking mode here is strictly needed only if there are
+   // invalidated userptr vmas present, to avoid multiple userptr
+   // revalidations.
+   down_write(&gpu_vm->lock);
+   retry:
+
+   // Note: mmu_interval_read_begin() blocks until there is no
+   // invalidation notifier running anymore.
+   seq = mmu_interval_read_begin(&gpu_vma->userptr_interval);
+   if (seq != gpu_vma->saved_seq) {
+           obtain_new_page_pointers(&gpu_vma);
+           dma_resv_lock(&gpu_vm->resv);
+           add_gpu_vma_top_revalidate_list(&gpu_vma, &gpu_vm);
+           dma_resv_unlock(&gpu_vm->resv);
+           gpu_vma->saved_seq = seq;
+   }
+
+   // The usual revalidation goes here.
+
+   // Final userptr sequence validation may not happen before the
+   // submission dma_fence is added to the gpu_vm's resv, from the POW
+   // of the MMU invalidation notifier. Hence the
+   // userptr_notifier_lock that will make them appear atomic.
+
+   add_dependencies(&gpu_job, &gpu_vm->resv);
+   down_read(&gpu_vm->userptr_notifier_lock);
+   if (mmu_interval_read_retry(&gpu_vma->userptr_interval, gpu_vma->saved_seq)) {
+          up_read(&gpu_vm->userptr_notifier_lock);
+          goto retry;
+   }
+
+   job_dma_fence = gpu_submit(&gpu_job));
+
+   add_dma_fence(job_dma_fence, &gpu_vm->resv);
+
+   for_each_shared_obj(gpu_vm, &obj)
+          add_dma_fence(job_dma_fence, &obj->resv);
+
+   dma_resv_unlock_all_resv_locks();
+   up_read(&gpu_vm->userptr_notifier_lock);
+   up_write(&gpu_vm->lock);
+
+The code between ``mmu_interval_read_begin()`` and the
+``mmu_interval_read_retry()`` marks the read side critical section of
+what we call the ``userptr_seqlock``. In reality the gpu_vm's userptr
+gpu_vma list is looped through, and the check is done for *all* of its
+userptr gpu_vmas, although we only show a single one here.
+
+The userptr gpu_vma MMU invalidation notifier might be called from
+reclaim context and, again to avoid locking order violations, we can't
+take any dma_resv lock nor the gpu_vm->lock from within it.
+
+.. _Invalidation example:
+.. code-block:: C
+
+  bool gpu_vma_userptr_invalidate(userptr_interval, cur_seq)
+  {
+          // Make sure the exec function either sees the new sequence
+          // and backs off or we wait for the dma-fence:
+
+          down_write(&gpu_vm->userptr_notifier_lock);
+          mmu_interval_set_seq(userptr_interval, cur_seq);
+          up_write(&gpu_vm->userptr_notifier_lock);
+
+          // At this point, the exec function can't succeed in
+          // submitting a new job, because cur_seq is an invalid
+          // sequence number and will always cause a retry. When all
+          // invalidation callbacks, the mmu notifier core will flip
+          // the sequence number to a valid one. However we need to
+          // stop gpu access to the old pages here.
+
+          dma_resv_wait_timeout(&gpu_vm->resv, DMA_RESV_USAGE_BOOKKEEP,
+                                false, MAX_SCHEDULE_TIMEOUT);
+          return true;
+  }
+
+When this invalidation notifier returns, the GPU can no longer be
+accessing the old pages of the userptr gpu_vma and needs to redo the
+page-binding before a new GPU submission can succeed.
+
+Efficient userptr gpu_vma exec_function iteration
+_________________________________________________
+
+If the gpu_vm's list of userptr gpu_vmas becomes large, it's
+inefficient to iterate through the complete lists of userptrs on each
+exec function to check whether each userptr gpu_vma's saved
+sequence number is invalid or stale. A solution to this is to put all
+*invalidated* userptr gpu_vmas on a separate gpu_vm list and
+only those gpu_vmas on the list are actually checked on each exec
+function. This list will then lend itself very-well to the spinlock
+locking scheme that is
+:ref:`described in the spinlock iteration section <Spinlock iteration>`, since
+in the mmu notifier, where we add the invalidated gpu_vmas to the
+list, it's not possible to take any outer locks like the
+``gpu_vm->lock`` or the ``gpu_vm->resv`` lock. Note that the
+``gpu_vm->lock`` still needs to be taken while iterating to ensure the list is
+complete, as also mentioned in that section.
+
+If using an invalidated userptr list like this, the retry check in the
+exec function trivially becomes a check for invalidated list empty.
+
+Locking at bind and unbind time
+================================
+
+At bind time, assuming a GEM object backed gpu_vma, each
+gpu_vma needs to be associated with a gpu_vm_bo and that
+gpu_vm_bo in turn needs to be added to the GEM object's
+gpu_vm_bo list, and possibly to the gpu_vm's external object
+list. This is referred to as *linking* the gpu_vma, and typically
+requires that the ``gpu_vm->resv`` and the GEM object's dma_resv are
+held. When unlinking a gpu_vma the same locks are typically held,
+and that ensures, as briefly discussed
+:ref:`previously <gpu_vma lifetime>`, that when iterating over
+``gpu_vmas`, either under the ``gpu_vm->resv`` or the GEM
+object's dma_resv, that the gpu_vmas stay alive as long
+as the lock under which we iterate are not is not released. For
+userptr gpu_vmas it's similarly required that during unlink, the
+outer ``gpu_vm->lock`` is held, since otherwise when iterating over
+the invalidated userptr list as described in the previous section,
+there is nothing keeping those userptr gpu_vmas alive.
diff --git a/Documentation/gpu/implementation_guidelines.rst b/Documentation/gpu/implementation_guidelines.rst
index 138e637dcc6b..dbccfa72f1c9 100644
--- a/Documentation/gpu/implementation_guidelines.rst
+++ b/Documentation/gpu/implementation_guidelines.rst
@@ -7,3 +7,4 @@  Misc DRM driver uAPI- and feature implementation guidelines
 .. toctree::
 
    drm-vm-bind-async
+   drm-vm-bind-locking