[v2] Documentation/gpu: VM_BIND locking document

Message ID	20230816091547.2982-1-thomas.hellstrom@linux.intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <dri-devel-bounces@lists.freedesktop.org> From: =?utf-8?q?Thomas_Hellstr=C3=B6m?= <thomas.hellstrom@linux.intel.com> To: intel-xe@lists.freedesktop.org Subject: [PATCH v2] Documentation/gpu: VM_BIND locking document Date: Wed, 16 Aug 2023 11:15:47 +0200 Message-ID: <20230816091547.2982-1-thomas.hellstrom@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Precedence: list Cc: Matthew Brost <matthew.brost@intel.com>, =?utf-8?q?Thomas_Hellstr=C3=B6m?= <thomas.hellstrom@linux.intel.com>, Francois Dugast <francois.dugast@intel.com>, linux-kernel@vger.kernel.org, Oak Zeng <oak.zeng@intel.com>, Danilo Krummrich <dakr@redhat.com>, dri-devel@lists.freedesktop.org, Rodrigo Vivi <rodrigo.vivi@intel.com> Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>
Series	[v2] Documentation/gpu: VM_BIND locking document \| expand [v2] Documentation/gpu: VM_BIND locking document

Thomas Hellström Aug. 16, 2023, 9:15 a.m. UTC

Add the first version of the VM_BIND locking document which is
intended to be part of the xe driver upstreaming agreement.

The document describes and discuss the locking used during exec-
functions, evicton and for userptr gpu-vmas. Intention is to be using the
same nomenclature as the drm-vm-bind-async.rst.

v2:
- s/gvm/gpu_vm/g (Rodrigo Vivi)
- Clarify the userptr seqlock with a pointer to mm/mmu_notifier.c
  (Rodrigo Vivi)
- Adjust commit message accordingly.
- Add SPDX license header.

Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
---
 Documentation/gpu/drm-vm-bind-locking.rst | 351 ++++++++++++++++++++++
 1 file changed, 351 insertions(+)
 create mode 100644 Documentation/gpu/drm-vm-bind-locking.rst

kernel test robot Aug. 17, 2023, 2:05 a.m. UTC | #1

Hi Thomas,

kernel test robot noticed the following build warnings:

[auto build test WARNING on drm-misc/drm-misc-next]
[also build test WARNING on drm/drm-next drm-tip/drm-tip linus/master v6.5-rc6 next-20230816]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Thomas-Hellstr-m/Documentation-gpu-VM_BIND-locking-document/20230816-171911
base:   git://anongit.freedesktop.org/drm/drm-misc drm-misc-next
patch link:    https://lore.kernel.org/r/20230816091547.2982-1-thomas.hellstrom%40linux.intel.com
patch subject: [PATCH v2] Documentation/gpu: VM_BIND locking document
reproduce: (https://download.01.org/0day-ci/archive/20230817/202308170916.TGY7kBpM-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202308170916.TGY7kBpM-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> Documentation/gpu/drm-vm-bind-locking.rst: WARNING: document isn't included in any toctree

Rodrigo Vivi Aug. 31, 2023, 7:30 p.m. UTC | #2

On Wed, Aug 16, 2023 at 11:15:47AM +0200, Thomas Hellström wrote:
> Add the first version of the VM_BIND locking document which is
> intended to be part of the xe driver upstreaming agreement.
> 
> The document describes and discuss the locking used during exec-
> functions, evicton and for userptr gpu-vmas. Intention is to be using the
> same nomenclature as the drm-vm-bind-async.rst.
> 
> v2:
> - s/gvm/gpu_vm/g (Rodrigo Vivi)
> - Clarify the userptr seqlock with a pointer to mm/mmu_notifier.c
>   (Rodrigo Vivi)
> - Adjust commit message accordingly.
> - Add SPDX license header.
> 
> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>

Cc: Danilo Krummrich <dakr@redhat.com>

> Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> ---
>  Documentation/gpu/drm-vm-bind-locking.rst | 351 ++++++++++++++++++++++
>  1 file changed, 351 insertions(+)
>  create mode 100644 Documentation/gpu/drm-vm-bind-locking.rst
> 
> diff --git a/Documentation/gpu/drm-vm-bind-locking.rst b/Documentation/gpu/drm-vm-bind-locking.rst
> new file mode 100644
> index 000000000000..b813961a9ec2
> --- /dev/null
> +++ b/Documentation/gpu/drm-vm-bind-locking.rst
> @@ -0,0 +1,351 @@
> +.. SPDX-License-Identifier: (GPL-2.0+ OR MIT)
> +
> +===============
> +VM_BIND locking
> +===============
> +
> +This document attempts to describe what's needed to get VM_BIND locking right,
> +including the userptr mmu_notifier locking and it will also discuss some
> +optimizations to get rid of the looping through of all userptr mappings and
> +external / shared object mappings that is needed in the simplest
> +implementation. It will also discuss some implications for faulting gpu_vms.
> +
> +Nomenclature
> +============
> +
> +* ``Context``: GPU execution context.
> +* ``gpu_vm``: Abstraction of a virtual GPU address space with
> +  meta-data. Typically one per client (DRM file-private), or one per
> +  context.
> +* ``gpu_vma``: Abstraction of a GPU address range within a gpu_vm with
> +  associated meta-data. The backing storage of a gpu_vma can either be
> +  a gem buffer object or anonymous pages mapped also into the CPU
> +  address space for the process.
> +* ``userptr gpu_vma or just userptr``: A gpu_vma, the backing store of
> +  which is anonymous pages as described above.
> +* ``revalidating``: Revalidating a gpu_vma means making the latest version
> +  of the backing store resident and making sure the gpu_vma's
> +  page-table entries point to that backing store.
> +* ``dma_fence``: A struct dma_fence that is similar to a struct completion
> +  and which tracks GPU activity. When the GPU activity is finished,
> +  the dma_fence signals.
> +* ``dma_resv``: A struct dma_resv (AKA reservation object) that is used
> +  to track GPU activity in the form of multiple dma_fences on a
> +  gpu_vm or a gem buffer object. The dma_resv contains an array / list
> +  of dma_fences and a lock that needs to be held when adding
> +  additional dma_fences to the dma_resv. The lock is of a type that
> +  allows deadlock-safe locking of multiple dma_resvs in arbitrary order.
> +* ``exec function``: An exec function is a function that revalidates all
> +  affected gpu_vmas, submits a GPU command batch and registers the
> +  dma_fence representing the GPU command's activity with all affected
> +  dma_resvs. For completeness, although not covered by this document,
> +  it's worth mentioning that an exec function may also be the
> +  revalidation worker that is used by some drivers in compute /
> +  long-running mode.
> +* ``local object``: A GEM object which is local to a gpu_vm. Shared gem
> +  objects also share the gpu_vm's dma_resv.
> +* ``shared object``: AKA external object: A GEM object which may be shared
> +  by multiple gpu_vms and whose backing storage may be shared with
> +  other drivers.
> +
> +
> +Introducing the locks
> +=====================
> +
> +One of the benefits of VM_BIND is that local GEM objects share the gpu_vm's
> +dma_resv object and hence the dma_resv lock. So even with a huge
> +number of local GEM objects, only one lock is needed to make the exec
> +sequence atomic.
> +
> +The following locks and locking orders are used:
> +
> +* The ``gpu_vm->lock`` (optionally an rwsem). Protects how the gpu_vm is
> +  partitioned into gpu_vmas, protects the gpu_vm's list of external objects,
> +  and can also with some simplification protect the gpu_vm's list of
> +  userptr gpu_vmas. With the CPU mm analogy this would correspond to the
> +  mmap_lock.
> +* The ``userptr_seqlock``. This lock is taken in read mode for each
> +  userptr gpu_vma on the gpu_vm's userptr list, and in write mode during mmu
> +  notifier invalidation. This is not a real seqlock but described in
> +  ``mm/mmu_notifier.c` as a "Collision-retry read-side/write-side
> +  'lock' a lot like a seqcount, however this allows multiple
> +  write-sides to hold it at once...". The read side critical section
> +  is enclosed by ``mmu_interval_read_begin() /
> +  mmu_interval_read_retry()`` with ``mmu_interval_read_begin()``
> +  sleeping uninterruptibly if the write side is held.
> +  The write side is held by the core mm while calling mmu interval
> +  invalidation notifiers.
> +* The ``gpu_vm->resv`` lock. Protects the gpu_vm's list of gpu_vmas needing
> +  rebinding, and also the residency of all the gpu_vm's local GEM object.
> +* The ``gpu_vm->userptr_notifier_lock``. This is an rwsem that is taken in read
> +  mode during exec and write mode during a mmu notifier invalidation. In
> +  the absence of a separate page-table lock, this lock can serve
> +  together with the gpu_vm's dma_resv lock as a page-table lock. More on
> +  this below. The userptr notifier lock is per gpu_vm.
> +* The ``gpu_vm->page_table_lock``. Protects the gpu_vm's page-table updates. For
> +  simplicity the gpu_vm's dma_resv lock can be reused as page-table lock.
> +
> +There are certain optimizations described below that require
> +additional locks. More on that later.
> +
> +.. code-block:: C
> +
> +   dma_resv_lock(&gpu_vm->resv);
> +
> +   for_each_gpu_vma_on_revalidate_list(gpu_vm, &gpu_vma) {
> +		revalidate_gpu_vma(&gpu_vma);
> +		remove_from_revalidate_list(&gpu_vma);
> +   }
> +
> +   add_dependencies(&gpu_job, &gpu_vm->resv);
> +   job_dma_fence = gpu_submit(&gpu_job));
> +
> +   add_dma_fence(job_dma_fence, &gpu_vm->resv);
> +   dma_resv_unlock(&gpu_vm->resv);
> +
> +Eviction of one of these local objects will then be something like the
> +following:
> +
> +.. code-block:: C
> +
> +   obj = get_object_from_lru();
> +
> +   dma_resv_lock(obj->resv);
> +   for_each_gpu_vma_of_obj(obj, &gpu_vma);
> +		put_gpu_vma_on_revalidate_list(&gpu_vma);
> +
> +   add_dependencies(&eviction_job, &obj->resv);
> +   job_dma_fence = gpu_submit(&eviction_job);
> +   add_dma_fence(&obj->resv, job_dma_fence);
> +
> +   dma_resv_unlock(&obj->resv);
> +   put_object(obj);
> +
> +Note that since the object is local to the gpu_vm, it will share the gpu_vm's
> +``dma_resv`` lock so that ``obj->resv == gpu_vm->resv``. Invalidated gpu_vmas are put
> +on the gpu_vm's revalidation list, which is protected by ``gpu_vm->resv``, which
> +is always locked while evicting, due to the above equality.
> +
> +For VM_BIND gpu_vms, gpu_vmas don't need to be unbound before eviction,
> +Since the eviction blit or copy will wait for GPU idle, any attempt by
> +the GPU to access freed memory through the gpu_vma will be preceded by
> +a new exec function, which will make sure the gpu_vma is
> +revalidated. The eviction code holding the object's dma_resv while
> +revalidating will ensure a new exec function may not race with the eviction.
> +
> +Introducing external (or shared) buffer objects
> +===============================================
> +
> +Since shared buffer objects may be shared by multiple gpu_vm's they
> +can't share their reservation object with a single gpu_vm, but will rather
> +have a reservation object of their own. The shared objects bound to a
> +gpu_vm using one or many
> +gpu_vmas are therefore typically put on a per-gpu_vm list which is
> +protected by the gpu_vm lock. One could in theory protect it also with
> +the ``gpu_vm->resv``, but since the list of dma_resvs to take is typically
> +built before the ``gpu_vm->resv`` is locked due to a limitation in
> +the current locking helpers, that is typically not done. Also see
> +below for userptr gpu_vmas.
> +
> +At eviction time we now need to invalidate *all* gpu_vmas of a shared
> +object, but we can no longer be certain that we hold the gpu_vm's
> +dma_resv of all the object's gpu_vmas. We can only be certain that we
> +hold the object's private dma_resv. We can trylock the dma_resvs for
> +the affected gpu_vm's but that might be unnecessarily complex. If we
> +have a ww_acquire context at hand at eviction time we can also perform
> +sleeping locks of those dma_resvs but that could cause expensive
> +rollbacks. One option is to just mark the invalidated gpu_vmas with a bool
> +which is inspected on the next exec function, when the gpu_vm's
> +dma_resv and the object's dma_resv is held, and the invalidated
> +gpu_vmas could then be put on the gpu_vm's list of invalidated
> +gpu_vmas. That bool would then, although being per-gpu_vma formally be
> +protected by the object's dma_resv.
> +
> +The exec function would then look something like the following:
> +
> +.. code-block:: C
> +
> +   read_lock(&gpu_vm->lock);
> +
> +   dma_resv_lock(&gpu_vm->resv);
> +
> +   // Shared object list is protected by the gpu_vm->lock.
> +   for_each_shared_obj(gpu_vm, &obj) {
> +		dma_resv_lock(&obj->resv);
> +		move_marked_gpu_vmas_to_revalidate_gpu_vma_list(obj, &gpu_vm);
> +   }
> +
> +   for_each_gpu_vma_to_revalidate(gpu_vm, &gpu_vma) {
> +		revalidate_gpu_vma(&gpu_vma);
> +		remove_from_revalidate_list(&gpu_vma);
> +   }
> +
> +   add_dependencies(&gpu_job, &gpu_vm->resv);
> +   job_dma_fence = gpu_submit(&gpu_job));
> +
> +   add_dma_fence(job_dma_fence, &gpu_vm->resv);
> +   for_each_shared_obj(gpu_vm, &obj)
> +          add_dma_fence(job_dma_fence, &obj->resv);
> +   dma_resv_unlock_all_resv_locks();
> +
> +   read_unlock(&gpu_vm->lock);
> +
> +And the corresponding shared-object aware eviction would look like:
> +
> +.. code-block:: C
> +
> +   obj = get_object_from_lru();
> +
> +   dma_resv_lock(obj->resv);
> +   for_each_gpu_vma_of_obj(obj, &gpu_vma);
> +		if (object_is_vm_local(obj))
> +		             put_gpu_vma_on_revalidate_list(&gpu_vma, &gpu_vm);
> +		else
> +		             mark_gpu_vma_for_revalidation(&gpu_vma);
> +
> +   add_dependencies(&eviction_job, &obj->resv);
> +   job_dma_fence = gpu_submit(&eviction_job);
> +   add_dma_fence(&obj->resv, job_dma_fence);
> +
> +   dma_resv_unlock(&obj->resv);
> +   put_object(obj);
> +
> +Yet another option is to put the gpu_vmas to be invalidated on a separate
> +gpu_vm list protected by a lower level lock that can be taken both at eviction
> +time and at transfer-to-revalidate list time. The details are not in
> +this document, but this for reference implemented in the Intel xe
> +driver.
> +
> +Introducing userptr gpu_vmas
> +============================
> +
> +A userptr gpu_vma is a gpu_vma that, instead of mapping a buffer object to a
> +GPU virtual address range, directly maps a CPU mm range of anonymous-
> +or file page-cache pages.
> +A very simple approach would be to just pin the pages using
> +pin_user_pages() at bind time and unpin them at unbind time, but this
> +creates a Denial-Of-Service vector since a single user-space process
> +would be able to pin down all of system memory, which is not
> +desirable. (For special use-cases and with proper accounting pinning might
> +still be a desirable feature, though). What we need to do in the general case is
> +to obtain a reference to the desired pages, make sure we are notified
> +using a MMU notifier just before the CPU mm unmaps the pages, dirty
> +them if they are not mapped read-only to the GPU, and then drop the reference.
> +When we are notified by the MMU notifier that CPU mm is about to drop the
> +pages, we need to stop GPU access to the pages,
> +GPU page-table and make sure that before the next time the GPU tries to access
> +whatever is now present in the CPU mm range, we unmap the old pages
> +from the GPU page tables and repeat the process of obtaining new page
> +references. Note that when the core mm decides to laundry pages, we get such
> +an unmap MMU notification and can mark the pages dirty again before the
> +next GPU access. We also get similar MMU notifications for NUMA accounting
> +which the GPU driver doesn't really need to care about, but so far
> +it's proven difficult to exclude certain notifications.
> +
> +Using a MMU notifier for device DMA (and other methods) is described in
> +`this document
> +<https://docs.kernel.org/core-api/pin_user_pages.html#case-3-mmu-notifier-registration-with-or-without-page-faulting-hardware>`_.
> +
> +Now the method of obtaining struct page references using
> +get_user_pages() unfortunately can't be used under a dma_resv lock
> +since that would violate the locking order of the dma_resv lock vs the
> +mmap_lock that is grabbed when resolving a CPU pagefault. This means the gpu_vm's
> +list of userptr gpu_vmas needs to be protected by an outer lock, and this
> +is the first time we strictly need the gpu_vm->lock. While it was
> +previously used also to protect the list of the gpu_vm's shared objects,
> +we could in theory have used the gpu_vm->resv for that.
> +
> +The MMU interval seqlock for a userptr gpu_vma is used in the following
> +way:
> +
> +.. code-block:: C
> +
> +   down_read(&gpu_vm->lock);
> +
> +   retry:
> +
> +   // Note: mmu_interval_read_begin() blocks until there is no
> +   // invalidation notifier running anymore.
> +   seq = mmu_interval_read_begin(&gpu_vma->userptr_interval);
> +   if (seq != gpu_vma->saved_seq) {
> +           obtain_new_page_pointers(&gpu_vma);
> +	   dma_resv_lock(&gpu_vm->resv);
> +	   put_gpu_vma_on_revalidate_list(&gpu_vma, &gpu_vm);
> +	   dma_resv_unlock(&gpu_vm->resv);
> +	   gpu_vma->saved_seq = seq;
> +   }
> +
> +   // The usual revalidation goes here.
> +
> +   // Final userptr sequence validation may not happen before the
> +   // submission dma_fence is added to the gpu_vm's resv, from the POW
> +   // of the MMU invalidation notifier. Hence the
> +   // userptr_notifier_lock that will make them appear atomic.
> +
> +   add_dependencies(&gpu_job, &gpu_vm->resv);
> +   down_read(&gpu_vm->userptr_notifier_lock);
> +   if (mmu_interval_read_retry(&gpu_vma->userptr_interval, gpu_vma->saved_seq)) {
> +          up_read(&gpu_vm->userptr_notifier_lock);
> +	  goto retry;
> +   }
> +
> +   job_dma_fence = gpu_submit(&gpu_job));
> +
> +   add_dma_fence(job_dma_fence, &gpu_vm->resv);
> +
> +   for_each_shared_obj(gpu_vm, &obj)
> +          add_dma_fence(job_dma_fence, &obj->resv);
> +
> +   dma_resv_unlock_all_resv_locks();
> +   up_read(&gpu_vm->userptr_notifier_lock);
> +   up_read(&gpu_vm->lock);
> +
> +The code between ``mmu_interval_read_begin()`` and the
> +``mmu_interval_read_retry()`` marks the read side critical section of
> +what we call the ``userptr_seqlock``. In reality the gpu_vm's userptr
> +gpu_vma list is looped through, and the check is done for *all* of its
> +userptr gpu_vmas, although we only show a single one here.
> +
> +The userptr gpu_vma MMU invalidation notifier might be called from
> +reclaim context and, again to avoid locking order violations, we can't
> +take any dma_resv lock nor the gpu_vm->lock from within it.
> +
> +.. code-block:: C
> +
> +  bool gpu_vma_userptr_invalidate(userptr_interval, cur_seq)
> +  {
> +          // Make sure the exec function either sees the new sequence
> +	  // and backs off or we wait for the dma-fence:
> +
> +          down_write(&gpu_vm->userptr_notifier_lock);
> +	  mmu_interval_set_seq(userptr_interval, cur_seq);
> +	  up_write(&gpu_vm->userptr_notifier_lock);
> +
> +	  dma_resv_wait_timeout(&gpu_vm->resv, DMA_RESV_USAGE_BOOKKEEP,
> +		                false, MAX_SCHEDULE_TIMEOUT);
> +	  return true;
> +  }
> +
> +When this invalidation notifier returns, the GPU can no longer be
> +accessing the old pages of the userptr gpu_vma and needs to redo the page-binding
> +before a new GPU submission can succeed.
> +
> +Optimizing gpu_vma iteration
> +----------------------------
> +
> +Iterating through all of a gpu_vm's userptr gpu_vmas to check the validity
> +on each exec function may be very costly. There is a scheme to avoid
> +this and only iterate through the userptr gpu_vmas that actually saw an
> +invalidation notifier call since the last exec. T

The document so far looks good to me.
I'd like to hear from Danilo if this aligns with nouveau locking
or if he has any further thoughts on this in general.

> +
> +TODO: describe that scheme here. It's implemented in the xe driver.
> +
> +Locking for page-table updates at bind- and unbind time
> +=======================================================
> +
> +TODO.
> +
> +Recoverable page-fault implications
> +===================================
> +
> +TODO.

We should probably add the TODO note somewhere else and keep the doc itself clean?
or the plan is to update before we push this patch?

> -- 
> 2.41.0
>

Danilo Krummrich Sept. 5, 2023, 7:50 p.m. UTC | #3

On Wed, Aug 16, 2023 at 11:15:47AM +0200, Thomas Hellström wrote:
> Add the first version of the VM_BIND locking document which is
> intended to be part of the xe driver upstreaming agreement.
> 
> The document describes and discuss the locking used during exec-
> functions, evicton and for userptr gpu-vmas. Intention is to be using the
> same nomenclature as the drm-vm-bind-async.rst.
> 
> v2:
> - s/gvm/gpu_vm/g (Rodrigo Vivi)
> - Clarify the userptr seqlock with a pointer to mm/mmu_notifier.c
>   (Rodrigo Vivi)
> - Adjust commit message accordingly.
> - Add SPDX license header.
> 
> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
> Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> ---
>  Documentation/gpu/drm-vm-bind-locking.rst | 351 ++++++++++++++++++++++
>  1 file changed, 351 insertions(+)
>  create mode 100644 Documentation/gpu/drm-vm-bind-locking.rst
> 
> diff --git a/Documentation/gpu/drm-vm-bind-locking.rst b/Documentation/gpu/drm-vm-bind-locking.rst
> new file mode 100644
> index 000000000000..b813961a9ec2
> --- /dev/null
> +++ b/Documentation/gpu/drm-vm-bind-locking.rst
> @@ -0,0 +1,351 @@
> +.. SPDX-License-Identifier: (GPL-2.0+ OR MIT)
> +
> +===============
> +VM_BIND locking
> +===============
> +
> +This document attempts to describe what's needed to get VM_BIND locking right,
> +including the userptr mmu_notifier locking and it will also discuss some
> +optimizations to get rid of the looping through of all userptr mappings and
> +external / shared object mappings that is needed in the simplest
> +implementation. It will also discuss some implications for faulting gpu_vms.
> +
> +Nomenclature
> +============
> +
> +* ``Context``: GPU execution context.
> +* ``gpu_vm``: Abstraction of a virtual GPU address space with
> +  meta-data. Typically one per client (DRM file-private), or one per
> +  context.
> +* ``gpu_vma``: Abstraction of a GPU address range within a gpu_vm with

The same nomenclature was used within the VM_BIND async document as well. I
wonder if it would make sense to align the naming with the GPUVA manager, such
that ('drm_gpuva_manager' -> 'drm_gpuvm'). This would also result into better
function names, such as drm_gpuvm_resv_lock() or drm_gpuvm_prepare_objects() and
potentially way better naming for the VM_BO abstraction 'drm_gpuvm_bo'.

However, I'd like to keep 'drm_gpuva' rather than 'drm_gpu_vma', but I think
this is close enough anyway.

> +  associated meta-data. The backing storage of a gpu_vma can either be
> +  a gem buffer object or anonymous pages mapped also into the CPU
> +  address space for the process.
> +* ``userptr gpu_vma or just userptr``: A gpu_vma, the backing store of
> +  which is anonymous pages as described above.
> +* ``revalidating``: Revalidating a gpu_vma means making the latest version
> +  of the backing store resident and making sure the gpu_vma's
> +  page-table entries point to that backing store.
> +* ``dma_fence``: A struct dma_fence that is similar to a struct completion
> +  and which tracks GPU activity. When the GPU activity is finished,
> +  the dma_fence signals.
> +* ``dma_resv``: A struct dma_resv (AKA reservation object) that is used
> +  to track GPU activity in the form of multiple dma_fences on a
> +  gpu_vm or a gem buffer object. The dma_resv contains an array / list
> +  of dma_fences and a lock that needs to be held when adding
> +  additional dma_fences to the dma_resv. The lock is of a type that
> +  allows deadlock-safe locking of multiple dma_resvs in arbitrary order.
> +* ``exec function``: An exec function is a function that revalidates all
> +  affected gpu_vmas, submits a GPU command batch and registers the
> +  dma_fence representing the GPU command's activity with all affected
> +  dma_resvs. For completeness, although not covered by this document,
> +  it's worth mentioning that an exec function may also be the
> +  revalidation worker that is used by some drivers in compute /
> +  long-running mode.
> +* ``local object``: A GEM object which is local to a gpu_vm. Shared gem
> +  objects also share the gpu_vm's dma_resv.
> +* ``shared object``: AKA external object: A GEM object which may be shared
> +  by multiple gpu_vms and whose backing storage may be shared with
> +  other drivers.
> +
> +
> +Introducing the locks
> +=====================
> +
> +One of the benefits of VM_BIND is that local GEM objects share the gpu_vm's
> +dma_resv object and hence the dma_resv lock. So even with a huge
> +number of local GEM objects, only one lock is needed to make the exec
> +sequence atomic.
> +
> +The following locks and locking orders are used:
> +
> +* The ``gpu_vm->lock`` (optionally an rwsem). Protects how the gpu_vm is
> +  partitioned into gpu_vmas, protects the gpu_vm's list of external objects,
> +  and can also with some simplification protect the gpu_vm's list of
> +  userptr gpu_vmas. With the CPU mm analogy this would correspond to the
> +  mmap_lock.
> +* The ``userptr_seqlock``. This lock is taken in read mode for each
> +  userptr gpu_vma on the gpu_vm's userptr list, and in write mode during mmu
> +  notifier invalidation. This is not a real seqlock but described in
> +  ``mm/mmu_notifier.c` as a "Collision-retry read-side/write-side
> +  'lock' a lot like a seqcount, however this allows multiple
> +  write-sides to hold it at once...". The read side critical section
> +  is enclosed by ``mmu_interval_read_begin() /
> +  mmu_interval_read_retry()`` with ``mmu_interval_read_begin()``
> +  sleeping uninterruptibly if the write side is held.
> +  The write side is held by the core mm while calling mmu interval
> +  invalidation notifiers.
> +* The ``gpu_vm->resv`` lock. Protects the gpu_vm's list of gpu_vmas needing
> +  rebinding, and also the residency of all the gpu_vm's local GEM object.
> +* The ``gpu_vm->userptr_notifier_lock``. This is an rwsem that is taken in read
> +  mode during exec and write mode during a mmu notifier invalidation. In
> +  the absence of a separate page-table lock, this lock can serve
> +  together with the gpu_vm's dma_resv lock as a page-table lock. More on
> +  this below. The userptr notifier lock is per gpu_vm.
> +* The ``gpu_vm->page_table_lock``. Protects the gpu_vm's page-table updates. For
> +  simplicity the gpu_vm's dma_resv lock can be reused as page-table lock.
> +
> +There are certain optimizations described below that require
> +additional locks. More on that later.
> +
> +.. code-block:: C
> +
> +   dma_resv_lock(&gpu_vm->resv);
> +
> +   for_each_gpu_vma_on_revalidate_list(gpu_vm, &gpu_vma) {
> +		revalidate_gpu_vma(&gpu_vma);
> +		remove_from_revalidate_list(&gpu_vma);
> +   }
> +
> +   add_dependencies(&gpu_job, &gpu_vm->resv);
> +   job_dma_fence = gpu_submit(&gpu_job));
> +
> +   add_dma_fence(job_dma_fence, &gpu_vm->resv);
> +   dma_resv_unlock(&gpu_vm->resv);
> +
> +Eviction of one of these local objects will then be something like the
> +following:
> +
> +.. code-block:: C
> +
> +   obj = get_object_from_lru();
> +
> +   dma_resv_lock(obj->resv);
> +   for_each_gpu_vma_of_obj(obj, &gpu_vma);
> +		put_gpu_vma_on_revalidate_list(&gpu_vma);
> +
> +   add_dependencies(&eviction_job, &obj->resv);
> +   job_dma_fence = gpu_submit(&eviction_job);
> +   add_dma_fence(&obj->resv, job_dma_fence);
> +
> +   dma_resv_unlock(&obj->resv);
> +   put_object(obj);
> +
> +Note that since the object is local to the gpu_vm, it will share the gpu_vm's
> +``dma_resv`` lock so that ``obj->resv == gpu_vm->resv``. Invalidated gpu_vmas are put
> +on the gpu_vm's revalidation list, which is protected by ``gpu_vm->resv``, which
> +is always locked while evicting, due to the above equality.
> +
> +For VM_BIND gpu_vms, gpu_vmas don't need to be unbound before eviction,
> +Since the eviction blit or copy will wait for GPU idle, any attempt by
> +the GPU to access freed memory through the gpu_vma will be preceded by
> +a new exec function, which will make sure the gpu_vma is
> +revalidated. The eviction code holding the object's dma_resv while
> +revalidating will ensure a new exec function may not race with the eviction.
> +
> +Introducing external (or shared) buffer objects
> +===============================================
> +
> +Since shared buffer objects may be shared by multiple gpu_vm's they
> +can't share their reservation object with a single gpu_vm, but will rather
> +have a reservation object of their own. The shared objects bound to a
> +gpu_vm using one or many
> +gpu_vmas are therefore typically put on a per-gpu_vm list which is
> +protected by the gpu_vm lock. One could in theory protect it also with
> +the ``gpu_vm->resv``, but since the list of dma_resvs to take is typically
> +built before the ``gpu_vm->resv`` is locked due to a limitation in
> +the current locking helpers, that is typically not done. Also see
> +below for userptr gpu_vmas.
> +
> +At eviction time we now need to invalidate *all* gpu_vmas of a shared
> +object, but we can no longer be certain that we hold the gpu_vm's
> +dma_resv of all the object's gpu_vmas. We can only be certain that we

I need to think a bit more about locking of extobj and evicted object tracking
in the case of processing 'drm_gpuva_ops' directly through callbacks within the
fence signalling critical path as mentioend in [1].

In order to support that, we'd need to protect extobjs with a separate lock,
and while iterating extobjs to acquire the dma-resv lock drop the lock within
the loop before we actually acquire the dma-resv lock. Maple tree supports that
already and this can be fully done within the GPUVA manager; no need for the
driver to care about that.

While, as already mentioned, I'd really love to support that, I noticed that we
have a similar issue with tracking evicted objects. There are (similar) ways to
deal with that, however, it drastically increases complexity.

Hence, I'd like to reconsider whether it's worth supporting it in the first
place. Most of the arguments in order to support it are for decreasing
complexity. However, if it increases complexity elsewhere, it's probably not
worth. The only argument left would be for synchronous bind jobs which could
be injected at any point of time without the need to be queued up in the
scheduler to preserve ordering. However, I'm not yet sure how important this
would be. For Xe it doesn't really seem to be a concern I guess?

[1] https://lore.kernel.org/dri-devel/202308221050.kTj8uFMA-lkp@intel.com/T/#m7f3b5a7ff70723332adeea32671578cb95c62f7c

> +hold the object's private dma_resv. We can trylock the dma_resvs for
> +the affected gpu_vm's but that might be unnecessarily complex. If we
> +have a ww_acquire context at hand at eviction time we can also perform
> +sleeping locks of those dma_resvs but that could cause expensive
> +rollbacks. One option is to just mark the invalidated gpu_vmas with a bool
> +which is inspected on the next exec function, when the gpu_vm's
> +dma_resv and the object's dma_resv is held, and the invalidated
> +gpu_vmas could then be put on the gpu_vm's list of invalidated
> +gpu_vmas. That bool would then, although being per-gpu_vma formally be
> +protected by the object's dma_resv.
> +
> +The exec function would then look something like the following:
> +
> +.. code-block:: C
> +
> +   read_lock(&gpu_vm->lock);
> +
> +   dma_resv_lock(&gpu_vm->resv);
> +
> +   // Shared object list is protected by the gpu_vm->lock.
> +   for_each_shared_obj(gpu_vm, &obj) {
> +		dma_resv_lock(&obj->resv);
> +		move_marked_gpu_vmas_to_revalidate_gpu_vma_list(obj, &gpu_vm);
> +   }
> +
> +   for_each_gpu_vma_to_revalidate(gpu_vm, &gpu_vma) {
> +		revalidate_gpu_vma(&gpu_vma);
> +		remove_from_revalidate_list(&gpu_vma);
> +   }
> +
> +   add_dependencies(&gpu_job, &gpu_vm->resv);
> +   job_dma_fence = gpu_submit(&gpu_job));
> +
> +   add_dma_fence(job_dma_fence, &gpu_vm->resv);
> +   for_each_shared_obj(gpu_vm, &obj)
> +          add_dma_fence(job_dma_fence, &obj->resv);
> +   dma_resv_unlock_all_resv_locks();
> +
> +   read_unlock(&gpu_vm->lock);
> +
> +And the corresponding shared-object aware eviction would look like:
> +
> +.. code-block:: C
> +
> +   obj = get_object_from_lru();
> +
> +   dma_resv_lock(obj->resv);
> +   for_each_gpu_vma_of_obj(obj, &gpu_vma);
> +		if (object_is_vm_local(obj))
> +		             put_gpu_vma_on_revalidate_list(&gpu_vma, &gpu_vm);
> +		else
> +		             mark_gpu_vma_for_revalidation(&gpu_vma);
> +
> +   add_dependencies(&eviction_job, &obj->resv);
> +   job_dma_fence = gpu_submit(&eviction_job);
> +   add_dma_fence(&obj->resv, job_dma_fence);
> +
> +   dma_resv_unlock(&obj->resv);
> +   put_object(obj);
> +
> +Yet another option is to put the gpu_vmas to be invalidated on a separate
> +gpu_vm list protected by a lower level lock that can be taken both at eviction
> +time and at transfer-to-revalidate list time. The details are not in
> +this document, but this for reference implemented in the Intel xe
> +driver.
> +
> +Introducing userptr gpu_vmas
> +============================
> +
> +A userptr gpu_vma is a gpu_vma that, instead of mapping a buffer object to a
> +GPU virtual address range, directly maps a CPU mm range of anonymous-
> +or file page-cache pages.
> +A very simple approach would be to just pin the pages using
> +pin_user_pages() at bind time and unpin them at unbind time, but this
> +creates a Denial-Of-Service vector since a single user-space process
> +would be able to pin down all of system memory, which is not
> +desirable. (For special use-cases and with proper accounting pinning might
> +still be a desirable feature, though). What we need to do in the general case is
> +to obtain a reference to the desired pages, make sure we are notified
> +using a MMU notifier just before the CPU mm unmaps the pages, dirty
> +them if they are not mapped read-only to the GPU, and then drop the reference.
> +When we are notified by the MMU notifier that CPU mm is about to drop the
> +pages, we need to stop GPU access to the pages,
> +GPU page-table and make sure that before the next time the GPU tries to access
> +whatever is now present in the CPU mm range, we unmap the old pages
> +from the GPU page tables and repeat the process of obtaining new page
> +references. Note that when the core mm decides to laundry pages, we get such
> +an unmap MMU notification and can mark the pages dirty again before the
> +next GPU access. We also get similar MMU notifications for NUMA accounting
> +which the GPU driver doesn't really need to care about, but so far
> +it's proven difficult to exclude certain notifications.
> +
> +Using a MMU notifier for device DMA (and other methods) is described in
> +`this document
> +<https://docs.kernel.org/core-api/pin_user_pages.html#case-3-mmu-notifier-registration-with-or-without-page-faulting-hardware>`_.
> +
> +Now the method of obtaining struct page references using
> +get_user_pages() unfortunately can't be used under a dma_resv lock
> +since that would violate the locking order of the dma_resv lock vs the
> +mmap_lock that is grabbed when resolving a CPU pagefault. This means the gpu_vm's
> +list of userptr gpu_vmas needs to be protected by an outer lock, and this
> +is the first time we strictly need the gpu_vm->lock. While it was
> +previously used also to protect the list of the gpu_vm's shared objects,
> +we could in theory have used the gpu_vm->resv for that.
> +
> +The MMU interval seqlock for a userptr gpu_vma is used in the following
> +way:
> +
> +.. code-block:: C
> +
> +   down_read(&gpu_vm->lock);
> +
> +   retry:
> +
> +   // Note: mmu_interval_read_begin() blocks until there is no
> +   // invalidation notifier running anymore.
> +   seq = mmu_interval_read_begin(&gpu_vma->userptr_interval);
> +   if (seq != gpu_vma->saved_seq) {
> +           obtain_new_page_pointers(&gpu_vma);
> +	   dma_resv_lock(&gpu_vm->resv);
> +	   put_gpu_vma_on_revalidate_list(&gpu_vma, &gpu_vm);
> +	   dma_resv_unlock(&gpu_vm->resv);
> +	   gpu_vma->saved_seq = seq;
> +   }
> +
> +   // The usual revalidation goes here.
> +
> +   // Final userptr sequence validation may not happen before the
> +   // submission dma_fence is added to the gpu_vm's resv, from the POW
> +   // of the MMU invalidation notifier. Hence the
> +   // userptr_notifier_lock that will make them appear atomic.
> +
> +   add_dependencies(&gpu_job, &gpu_vm->resv);
> +   down_read(&gpu_vm->userptr_notifier_lock);
> +   if (mmu_interval_read_retry(&gpu_vma->userptr_interval, gpu_vma->saved_seq)) {
> +          up_read(&gpu_vm->userptr_notifier_lock);
> +	  goto retry;
> +   }
> +
> +   job_dma_fence = gpu_submit(&gpu_job));
> +
> +   add_dma_fence(job_dma_fence, &gpu_vm->resv);
> +
> +   for_each_shared_obj(gpu_vm, &obj)
> +          add_dma_fence(job_dma_fence, &obj->resv);
> +
> +   dma_resv_unlock_all_resv_locks();
> +   up_read(&gpu_vm->userptr_notifier_lock);
> +   up_read(&gpu_vm->lock);
> +
> +The code between ``mmu_interval_read_begin()`` and the
> +``mmu_interval_read_retry()`` marks the read side critical section of
> +what we call the ``userptr_seqlock``. In reality the gpu_vm's userptr
> +gpu_vma list is looped through, and the check is done for *all* of its
> +userptr gpu_vmas, although we only show a single one here.
> +
> +The userptr gpu_vma MMU invalidation notifier might be called from
> +reclaim context and, again to avoid locking order violations, we can't
> +take any dma_resv lock nor the gpu_vm->lock from within it.
> +
> +.. code-block:: C
> +
> +  bool gpu_vma_userptr_invalidate(userptr_interval, cur_seq)
> +  {
> +          // Make sure the exec function either sees the new sequence
> +	  // and backs off or we wait for the dma-fence:
> +
> +          down_write(&gpu_vm->userptr_notifier_lock);
> +	  mmu_interval_set_seq(userptr_interval, cur_seq);
> +	  up_write(&gpu_vm->userptr_notifier_lock);
> +
> +	  dma_resv_wait_timeout(&gpu_vm->resv, DMA_RESV_USAGE_BOOKKEEP,
> +		                false, MAX_SCHEDULE_TIMEOUT);
> +	  return true;
> +  }
> +
> +When this invalidation notifier returns, the GPU can no longer be
> +accessing the old pages of the userptr gpu_vma and needs to redo the page-binding
> +before a new GPU submission can succeed.
> +
> +Optimizing gpu_vma iteration
> +----------------------------
> +
> +Iterating through all of a gpu_vm's userptr gpu_vmas to check the validity
> +on each exec function may be very costly. There is a scheme to avoid
> +this and only iterate through the userptr gpu_vmas that actually saw an
> +invalidation notifier call since the last exec. T
> +
> +TODO: describe that scheme here. It's implemented in the xe driver.
> +
> +Locking for page-table updates at bind- and unbind time
> +=======================================================
> +
> +TODO.
> +
> +Recoverable page-fault implications
> +===================================
> +
> +TODO.
> -- 
> 2.41.0
>

Thomas Hellström Sept. 6, 2023, 7:06 a.m. UTC | #4

Hi, Danilo,

Thanks for taking a look. Comments inline.

On 9/5/23 21:50, Danilo Krummrich wrote:
> On Wed, Aug 16, 2023 at 11:15:47AM +0200, Thomas Hellström wrote:
>> Add the first version of the VM_BIND locking document which is
>> intended to be part of the xe driver upstreaming agreement.
>>
>> The document describes and discuss the locking used during exec-
>> functions, evicton and for userptr gpu-vmas. Intention is to be using the
>> same nomenclature as the drm-vm-bind-async.rst.
>>
>> v2:
>> - s/gvm/gpu_vm/g (Rodrigo Vivi)
>> - Clarify the userptr seqlock with a pointer to mm/mmu_notifier.c
>>    (Rodrigo Vivi)
>> - Adjust commit message accordingly.
>> - Add SPDX license header.
>>
>> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
>> Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
>> ---
>>   Documentation/gpu/drm-vm-bind-locking.rst | 351 ++++++++++++++++++++++
>>   1 file changed, 351 insertions(+)
>>   create mode 100644 Documentation/gpu/drm-vm-bind-locking.rst
>>
>> diff --git a/Documentation/gpu/drm-vm-bind-locking.rst b/Documentation/gpu/drm-vm-bind-locking.rst
>> new file mode 100644
>> index 000000000000..b813961a9ec2
>> --- /dev/null
>> +++ b/Documentation/gpu/drm-vm-bind-locking.rst
>> @@ -0,0 +1,351 @@
>> +.. SPDX-License-Identifier: (GPL-2.0+ OR MIT)
>> +
>> +===============
>> +VM_BIND locking
>> +===============
>> +
>> +This document attempts to describe what's needed to get VM_BIND locking right,
>> +including the userptr mmu_notifier locking and it will also discuss some
>> +optimizations to get rid of the looping through of all userptr mappings and
>> +external / shared object mappings that is needed in the simplest
>> +implementation. It will also discuss some implications for faulting gpu_vms.
>> +
>> +Nomenclature
>> +============
>> +
>> +* ``Context``: GPU execution context.
>> +* ``gpu_vm``: Abstraction of a virtual GPU address space with
>> +  meta-data. Typically one per client (DRM file-private), or one per
>> +  context.
>> +* ``gpu_vma``: Abstraction of a GPU address range within a gpu_vm with
> The same nomenclature was used within the VM_BIND async document as well. I
> wonder if it would make sense to align the naming with the GPUVA manager, such
> that ('drm_gpuva_manager' -> 'drm_gpuvm'). This would also result into better
> function names, such as drm_gpuvm_resv_lock() or drm_gpuvm_prepare_objects() and
> potentially way better naming for the VM_BO abstraction 'drm_gpuvm_bo'.
>
> However, I'd like to keep 'drm_gpuva' rather than 'drm_gpu_vma', but I think
> this is close enough anyway.

I don't have a strong opinion about the naming here and aligning with 
the GPUVA manager make sense, although perhaps the "drm_" prefix which 
makes sense for the function- and struct names may not make sense in a 
more generic document like this. What about gpuva and gpuvm?


>
>> +  associated meta-data. The backing storage of a gpu_vma can either be
>> +  a gem buffer object or anonymous pages mapped also into the CPU
>> +  address space for the process.
>> +* ``userptr gpu_vma or just userptr``: A gpu_vma, the backing store of
>> +  which is anonymous pages as described above.
>> +* ``revalidating``: Revalidating a gpu_vma means making the latest version
>> +  of the backing store resident and making sure the gpu_vma's
>> +  page-table entries point to that backing store.
>> +* ``dma_fence``: A struct dma_fence that is similar to a struct completion
>> +  and which tracks GPU activity. When the GPU activity is finished,
>> +  the dma_fence signals.
>> +* ``dma_resv``: A struct dma_resv (AKA reservation object) that is used
>> +  to track GPU activity in the form of multiple dma_fences on a
>> +  gpu_vm or a gem buffer object. The dma_resv contains an array / list
>> +  of dma_fences and a lock that needs to be held when adding
>> +  additional dma_fences to the dma_resv. The lock is of a type that
>> +  allows deadlock-safe locking of multiple dma_resvs in arbitrary order.
>> +* ``exec function``: An exec function is a function that revalidates all
>> +  affected gpu_vmas, submits a GPU command batch and registers the
>> +  dma_fence representing the GPU command's activity with all affected
>> +  dma_resvs. For completeness, although not covered by this document,
>> +  it's worth mentioning that an exec function may also be the
>> +  revalidation worker that is used by some drivers in compute /
>> +  long-running mode.
>> +* ``local object``: A GEM object which is local to a gpu_vm. Shared gem
>> +  objects also share the gpu_vm's dma_resv.
>> +* ``shared object``: AKA external object: A GEM object which may be shared
>> +  by multiple gpu_vms and whose backing storage may be shared with
>> +  other drivers.
>> +
>> +
>> +Introducing the locks
>> +=====================
>> +
>> +One of the benefits of VM_BIND is that local GEM objects share the gpu_vm's
>> +dma_resv object and hence the dma_resv lock. So even with a huge
>> +number of local GEM objects, only one lock is needed to make the exec
>> +sequence atomic.
>> +
>> +The following locks and locking orders are used:
>> +
>> +* The ``gpu_vm->lock`` (optionally an rwsem). Protects how the gpu_vm is
>> +  partitioned into gpu_vmas, protects the gpu_vm's list of external objects,
>> +  and can also with some simplification protect the gpu_vm's list of
>> +  userptr gpu_vmas. With the CPU mm analogy this would correspond to the
>> +  mmap_lock.
>> +* The ``userptr_seqlock``. This lock is taken in read mode for each
>> +  userptr gpu_vma on the gpu_vm's userptr list, and in write mode during mmu
>> +  notifier invalidation. This is not a real seqlock but described in
>> +  ``mm/mmu_notifier.c` as a "Collision-retry read-side/write-side
>> +  'lock' a lot like a seqcount, however this allows multiple
>> +  write-sides to hold it at once...". The read side critical section
>> +  is enclosed by ``mmu_interval_read_begin() /
>> +  mmu_interval_read_retry()`` with ``mmu_interval_read_begin()``
>> +  sleeping uninterruptibly if the write side is held.
>> +  The write side is held by the core mm while calling mmu interval
>> +  invalidation notifiers.
>> +* The ``gpu_vm->resv`` lock. Protects the gpu_vm's list of gpu_vmas needing
>> +  rebinding, and also the residency of all the gpu_vm's local GEM object.
>> +* The ``gpu_vm->userptr_notifier_lock``. This is an rwsem that is taken in read
>> +  mode during exec and write mode during a mmu notifier invalidation. In
>> +  the absence of a separate page-table lock, this lock can serve
>> +  together with the gpu_vm's dma_resv lock as a page-table lock. More on
>> +  this below. The userptr notifier lock is per gpu_vm.
>> +* The ``gpu_vm->page_table_lock``. Protects the gpu_vm's page-table updates. For
>> +  simplicity the gpu_vm's dma_resv lock can be reused as page-table lock.
>> +
>> +There are certain optimizations described below that require
>> +additional locks. More on that later.
>> +
>> +.. code-block:: C
>> +
>> +   dma_resv_lock(&gpu_vm->resv);
>> +
>> +   for_each_gpu_vma_on_revalidate_list(gpu_vm, &gpu_vma) {
>> +		revalidate_gpu_vma(&gpu_vma);
>> +		remove_from_revalidate_list(&gpu_vma);
>> +   }
>> +
>> +   add_dependencies(&gpu_job, &gpu_vm->resv);
>> +   job_dma_fence = gpu_submit(&gpu_job));
>> +
>> +   add_dma_fence(job_dma_fence, &gpu_vm->resv);
>> +   dma_resv_unlock(&gpu_vm->resv);
>> +
>> +Eviction of one of these local objects will then be something like the
>> +following:
>> +
>> +.. code-block:: C
>> +
>> +   obj = get_object_from_lru();
>> +
>> +   dma_resv_lock(obj->resv);
>> +   for_each_gpu_vma_of_obj(obj, &gpu_vma);
>> +		put_gpu_vma_on_revalidate_list(&gpu_vma);
>> +
>> +   add_dependencies(&eviction_job, &obj->resv);
>> +   job_dma_fence = gpu_submit(&eviction_job);
>> +   add_dma_fence(&obj->resv, job_dma_fence);
>> +
>> +   dma_resv_unlock(&obj->resv);
>> +   put_object(obj);
>> +
>> +Note that since the object is local to the gpu_vm, it will share the gpu_vm's
>> +``dma_resv`` lock so that ``obj->resv == gpu_vm->resv``. Invalidated gpu_vmas are put
>> +on the gpu_vm's revalidation list, which is protected by ``gpu_vm->resv``, which
>> +is always locked while evicting, due to the above equality.
>> +
>> +For VM_BIND gpu_vms, gpu_vmas don't need to be unbound before eviction,
>> +Since the eviction blit or copy will wait for GPU idle, any attempt by
>> +the GPU to access freed memory through the gpu_vma will be preceded by
>> +a new exec function, which will make sure the gpu_vma is
>> +revalidated. The eviction code holding the object's dma_resv while
>> +revalidating will ensure a new exec function may not race with the eviction.
>> +
>> +Introducing external (or shared) buffer objects
>> +===============================================
>> +
>> +Since shared buffer objects may be shared by multiple gpu_vm's they
>> +can't share their reservation object with a single gpu_vm, but will rather
>> +have a reservation object of their own. The shared objects bound to a
>> +gpu_vm using one or many
>> +gpu_vmas are therefore typically put on a per-gpu_vm list which is
>> +protected by the gpu_vm lock. One could in theory protect it also with
>> +the ``gpu_vm->resv``, but since the list of dma_resvs to take is typically
>> +built before the ``gpu_vm->resv`` is locked due to a limitation in
>> +the current locking helpers, that is typically not done. Also see
>> +below for userptr gpu_vmas.
>> +
>> +At eviction time we now need to invalidate *all* gpu_vmas of a shared
>> +object, but we can no longer be certain that we hold the gpu_vm's
>> +dma_resv of all the object's gpu_vmas. We can only be certain that we
> I need to think a bit more about locking of extobj and evicted object tracking
> in the case of processing 'drm_gpuva_ops' directly through callbacks within the
> fence signalling critical path as mentioend in [1].
>
> In order to support that, we'd need to protect extobjs with a separate lock,
> and while iterating extobjs to acquire the dma-resv lock drop the lock within
> the loop before we actually acquire the dma-resv lock. Maple tree supports that
> already and this can be fully done within the GPUVA manager; no need for the
> driver to care about that.

So do I understand correctly that this because you want to update the 
gpuvm state while operations are progressing asynchronously?

If so, I wonder whether that could really be done? For example to 
allocate enough memory for page-tables etc, you need to know the details 
of the operations at IOCTL execution time, and to know the details you 
need to know the state from the previous operation?

>
> While, as already mentioned, I'd really love to support that, I noticed that we
> have a similar issue with tracking evicted objects. There are (similar) ways to
> deal with that, however, it drastically increases complexity.
>
> Hence, I'd like to reconsider whether it's worth supporting it in the first
> place. Most of the arguments in order to support it are for decreasing
> complexity. However, if it increases complexity elsewhere, it's probably not
> worth. The only argument left would be for synchronous bind jobs which could
> be injected at any point of time without the need to be queued up in the
> scheduler to preserve ordering. However, I'm not yet sure how important this
> would be. For Xe it doesn't really seem to be a concern I guess?
Xe supports that functionality via separate bind queues. If you queue 
most of the operations using one queue, you can inject synchronous bind 
jobs using another. Ideally they execute separately, but they are not 
guaranteed to do that.
>
> [1] https://lore.kernel.org/dri-devel/202308221050.kTj8uFMA-lkp@intel.com/T/#m7f3b5a7ff70723332adeea32671578cb95c62f7c
>
>> +hold the object's private dma_resv. We can trylock the dma_resvs for
>> +the affected gpu_vm's but that might be unnecessarily complex. If we
>> +have a ww_acquire context at hand at eviction time we can also perform
>> +sleeping locks of those dma_resvs but that could cause expensive
>> +rollbacks. One option is to just mark the invalidated gpu_vmas with a bool
>> +which is inspected on the next exec function, when the gpu_vm's
>> +dma_resv and the object's dma_resv is held, and the invalidated
>> +gpu_vmas could then be put on the gpu_vm's list of invalidated
>> +gpu_vmas. That bool would then, although being per-gpu_vma formally be
>> +protected by the object's dma_resv.
>> +
>> +The exec function would then look something like the following:
>> +
>> +.. code-block:: C
>> +
>> +   read_lock(&gpu_vm->lock);
>> +
>> +   dma_resv_lock(&gpu_vm->resv);
>> +
>> +   // Shared object list is protected by the gpu_vm->lock.
>> +   for_each_shared_obj(gpu_vm, &obj) {
>> +		dma_resv_lock(&obj->resv);
>> +		move_marked_gpu_vmas_to_revalidate_gpu_vma_list(obj, &gpu_vm);
>> +   }
>> +
>> +   for_each_gpu_vma_to_revalidate(gpu_vm, &gpu_vma) {
>> +		revalidate_gpu_vma(&gpu_vma);
>> +		remove_from_revalidate_list(&gpu_vma);
>> +   }
>> +
>> +   add_dependencies(&gpu_job, &gpu_vm->resv);
>> +   job_dma_fence = gpu_submit(&gpu_job));
>> +
>> +   add_dma_fence(job_dma_fence, &gpu_vm->resv);
>> +   for_each_shared_obj(gpu_vm, &obj)
>> +          add_dma_fence(job_dma_fence, &obj->resv);
>> +   dma_resv_unlock_all_resv_locks();
>> +
>> +   read_unlock(&gpu_vm->lock);
>> +
>> +And the corresponding shared-object aware eviction would look like:
>> +
>> +.. code-block:: C
>> +
>> +   obj = get_object_from_lru();
>> +
>> +   dma_resv_lock(obj->resv);
>> +   for_each_gpu_vma_of_obj(obj, &gpu_vma);
>> +		if (object_is_vm_local(obj))
>> +		             put_gpu_vma_on_revalidate_list(&gpu_vma, &gpu_vm);
>> +		else
>> +		             mark_gpu_vma_for_revalidation(&gpu_vma);
>> +
>> +   add_dependencies(&eviction_job, &obj->resv);
>> +   job_dma_fence = gpu_submit(&eviction_job);
>> +   add_dma_fence(&obj->resv, job_dma_fence);
>> +
>> +   dma_resv_unlock(&obj->resv);
>> +   put_object(obj);
>> +
>> +Yet another option is to put the gpu_vmas to be invalidated on a separate
>> +gpu_vm list protected by a lower level lock that can be taken both at eviction
>> +time and at transfer-to-revalidate list time. The details are not in
>> +this document, but this for reference implemented in the Intel xe
>> +driver.
>> +
>> +Introducing userptr gpu_vmas
>> +============================
>> +
>> +A userptr gpu_vma is a gpu_vma that, instead of mapping a buffer object to a
>> +GPU virtual address range, directly maps a CPU mm range of anonymous-
>> +or file page-cache pages.
>> +A very simple approach would be to just pin the pages using
>> +pin_user_pages() at bind time and unpin them at unbind time, but this
>> +creates a Denial-Of-Service vector since a single user-space process
>> +would be able to pin down all of system memory, which is not
>> +desirable. (For special use-cases and with proper accounting pinning might
>> +still be a desirable feature, though). What we need to do in the general case is
>> +to obtain a reference to the desired pages, make sure we are notified
>> +using a MMU notifier just before the CPU mm unmaps the pages, dirty
>> +them if they are not mapped read-only to the GPU, and then drop the reference.
>> +When we are notified by the MMU notifier that CPU mm is about to drop the
>> +pages, we need to stop GPU access to the pages,
>> +GPU page-table and make sure that before the next time the GPU tries to access
>> +whatever is now present in the CPU mm range, we unmap the old pages
>> +from the GPU page tables and repeat the process of obtaining new page
>> +references. Note that when the core mm decides to laundry pages, we get such
>> +an unmap MMU notification and can mark the pages dirty again before the
>> +next GPU access. We also get similar MMU notifications for NUMA accounting
>> +which the GPU driver doesn't really need to care about, but so far
>> +it's proven difficult to exclude certain notifications.
>> +
>> +Using a MMU notifier for device DMA (and other methods) is described in
>> +`this document
>> +<https://docs.kernel.org/core-api/pin_user_pages.html#case-3-mmu-notifier-registration-with-or-without-page-faulting-hardware>`_.
>> +
>> +Now the method of obtaining struct page references using
>> +get_user_pages() unfortunately can't be used under a dma_resv lock
>> +since that would violate the locking order of the dma_resv lock vs the
>> +mmap_lock that is grabbed when resolving a CPU pagefault. This means the gpu_vm's
>> +list of userptr gpu_vmas needs to be protected by an outer lock, and this
>> +is the first time we strictly need the gpu_vm->lock. While it was
>> +previously used also to protect the list of the gpu_vm's shared objects,
>> +we could in theory have used the gpu_vm->resv for that.
>> +
>> +The MMU interval seqlock for a userptr gpu_vma is used in the following
>> +way:
>> +
>> +.. code-block:: C
>> +
>> +   down_read(&gpu_vm->lock);
>> +
>> +   retry:
>> +
>> +   // Note: mmu_interval_read_begin() blocks until there is no
>> +   // invalidation notifier running anymore.
>> +   seq = mmu_interval_read_begin(&gpu_vma->userptr_interval);
>> +   if (seq != gpu_vma->saved_seq) {
>> +           obtain_new_page_pointers(&gpu_vma);
>> +	   dma_resv_lock(&gpu_vm->resv);
>> +	   put_gpu_vma_on_revalidate_list(&gpu_vma, &gpu_vm);
>> +	   dma_resv_unlock(&gpu_vm->resv);
>> +	   gpu_vma->saved_seq = seq;
>> +   }
>> +
>> +   // The usual revalidation goes here.
>> +
>> +   // Final userptr sequence validation may not happen before the
>> +   // submission dma_fence is added to the gpu_vm's resv, from the POW
>> +   // of the MMU invalidation notifier. Hence the
>> +   // userptr_notifier_lock that will make them appear atomic.
>> +
>> +   add_dependencies(&gpu_job, &gpu_vm->resv);
>> +   down_read(&gpu_vm->userptr_notifier_lock);
>> +   if (mmu_interval_read_retry(&gpu_vma->userptr_interval, gpu_vma->saved_seq)) {
>> +          up_read(&gpu_vm->userptr_notifier_lock);
>> +	  goto retry;
>> +   }
>> +
>> +   job_dma_fence = gpu_submit(&gpu_job));
>> +
>> +   add_dma_fence(job_dma_fence, &gpu_vm->resv);
>> +
>> +   for_each_shared_obj(gpu_vm, &obj)
>> +          add_dma_fence(job_dma_fence, &obj->resv);
>> +
>> +   dma_resv_unlock_all_resv_locks();
>> +   up_read(&gpu_vm->userptr_notifier_lock);
>> +   up_read(&gpu_vm->lock);
>> +
>> +The code between ``mmu_interval_read_begin()`` and the
>> +``mmu_interval_read_retry()`` marks the read side critical section of
>> +what we call the ``userptr_seqlock``. In reality the gpu_vm's userptr
>> +gpu_vma list is looped through, and the check is done for *all* of its
>> +userptr gpu_vmas, although we only show a single one here.
>> +
>> +The userptr gpu_vma MMU invalidation notifier might be called from
>> +reclaim context and, again to avoid locking order violations, we can't
>> +take any dma_resv lock nor the gpu_vm->lock from within it.
>> +
>> +.. code-block:: C
>> +
>> +  bool gpu_vma_userptr_invalidate(userptr_interval, cur_seq)
>> +  {
>> +          // Make sure the exec function either sees the new sequence
>> +	  // and backs off or we wait for the dma-fence:
>> +
>> +          down_write(&gpu_vm->userptr_notifier_lock);
>> +	  mmu_interval_set_seq(userptr_interval, cur_seq);
>> +	  up_write(&gpu_vm->userptr_notifier_lock);
>> +
>> +	  dma_resv_wait_timeout(&gpu_vm->resv, DMA_RESV_USAGE_BOOKKEEP,
>> +		                false, MAX_SCHEDULE_TIMEOUT);
>> +	  return true;
>> +  }
>> +
>> +When this invalidation notifier returns, the GPU can no longer be
>> +accessing the old pages of the userptr gpu_vma and needs to redo the page-binding
>> +before a new GPU submission can succeed.
>> +
>> +Optimizing gpu_vma iteration
>> +----------------------------
>> +
>> +Iterating through all of a gpu_vm's userptr gpu_vmas to check the validity
>> +on each exec function may be very costly. There is a scheme to avoid
>> +this and only iterate through the userptr gpu_vmas that actually saw an
>> +invalidation notifier call since the last exec. T
>> +
>> +TODO: describe that scheme here. It's implemented in the xe driver.
>> +
>> +Locking for page-table updates at bind- and unbind time
>> +=======================================================
>> +
>> +TODO.
>> +
>> +Recoverable page-fault implications
>> +===================================
>> +
>> +TODO.
>> -- 
>> 2.41.0
>>

Danilo Krummrich Sept. 6, 2023, 8 a.m. UTC | #5

On 9/6/23 09:06, Thomas Hellström wrote:
> Hi, Danilo,
> 
> Thanks for taking a look. Comments inline.
> 
> On 9/5/23 21:50, Danilo Krummrich wrote:
>> On Wed, Aug 16, 2023 at 11:15:47AM +0200, Thomas Hellström wrote:
>>> Add the first version of the VM_BIND locking document which is
>>> intended to be part of the xe driver upstreaming agreement.
>>>
>>> The document describes and discuss the locking used during exec-
>>> functions, evicton and for userptr gpu-vmas. Intention is to be using the
>>> same nomenclature as the drm-vm-bind-async.rst.
>>>
>>> v2:
>>> - s/gvm/gpu_vm/g (Rodrigo Vivi)
>>> - Clarify the userptr seqlock with a pointer to mm/mmu_notifier.c
>>>    (Rodrigo Vivi)
>>> - Adjust commit message accordingly.
>>> - Add SPDX license header.
>>>
>>> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
>>> Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
>>> ---
>>>   Documentation/gpu/drm-vm-bind-locking.rst | 351 ++++++++++++++++++++++
>>>   1 file changed, 351 insertions(+)
>>>   create mode 100644 Documentation/gpu/drm-vm-bind-locking.rst
>>>
>>> diff --git a/Documentation/gpu/drm-vm-bind-locking.rst b/Documentation/gpu/drm-vm-bind-locking.rst
>>> new file mode 100644
>>> index 000000000000..b813961a9ec2
>>> --- /dev/null
>>> +++ b/Documentation/gpu/drm-vm-bind-locking.rst
>>> @@ -0,0 +1,351 @@
>>> +.. SPDX-License-Identifier: (GPL-2.0+ OR MIT)
>>> +
>>> +===============
>>> +VM_BIND locking
>>> +===============
>>> +
>>> +This document attempts to describe what's needed to get VM_BIND locking right,
>>> +including the userptr mmu_notifier locking and it will also discuss some
>>> +optimizations to get rid of the looping through of all userptr mappings and
>>> +external / shared object mappings that is needed in the simplest
>>> +implementation. It will also discuss some implications for faulting gpu_vms.
>>> +
>>> +Nomenclature
>>> +============
>>> +
>>> +* ``Context``: GPU execution context.
>>> +* ``gpu_vm``: Abstraction of a virtual GPU address space with
>>> +  meta-data. Typically one per client (DRM file-private), or one per
>>> +  context.
>>> +* ``gpu_vma``: Abstraction of a GPU address range within a gpu_vm with
>> The same nomenclature was used within the VM_BIND async document as well. I
>> wonder if it would make sense to align the naming with the GPUVA manager, such
>> that ('drm_gpuva_manager' -> 'drm_gpuvm'). This would also result into better
>> function names, such as drm_gpuvm_resv_lock() or drm_gpuvm_prepare_objects() and
>> potentially way better naming for the VM_BO abstraction 'drm_gpuvm_bo'.
>>
>> However, I'd like to keep 'drm_gpuva' rather than 'drm_gpu_vma', but I think
>> this is close enough anyway.
> 
> I don't have a strong opinion about the naming here and aligning with the GPUVA manager make sense, although perhaps the "drm_" prefix which makes sense for the function- and struct names may not make sense in a more generic document like this. What about gpuva and gpuvm?

Oh, I think the document is fine as it is. This was more like me thinking loud
about renaming things in the GPUVA manager accordingly.

> 
> 
>>
>>> +  associated meta-data. The backing storage of a gpu_vma can either be
>>> +  a gem buffer object or anonymous pages mapped also into the CPU
>>> +  address space for the process.
>>> +* ``userptr gpu_vma or just userptr``: A gpu_vma, the backing store of
>>> +  which is anonymous pages as described above.
>>> +* ``revalidating``: Revalidating a gpu_vma means making the latest version
>>> +  of the backing store resident and making sure the gpu_vma's
>>> +  page-table entries point to that backing store.
>>> +* ``dma_fence``: A struct dma_fence that is similar to a struct completion
>>> +  and which tracks GPU activity. When the GPU activity is finished,
>>> +  the dma_fence signals.
>>> +* ``dma_resv``: A struct dma_resv (AKA reservation object) that is used
>>> +  to track GPU activity in the form of multiple dma_fences on a
>>> +  gpu_vm or a gem buffer object. The dma_resv contains an array / list
>>> +  of dma_fences and a lock that needs to be held when adding
>>> +  additional dma_fences to the dma_resv. The lock is of a type that
>>> +  allows deadlock-safe locking of multiple dma_resvs in arbitrary order.
>>> +* ``exec function``: An exec function is a function that revalidates all
>>> +  affected gpu_vmas, submits a GPU command batch and registers the
>>> +  dma_fence representing the GPU command's activity with all affected
>>> +  dma_resvs. For completeness, although not covered by this document,
>>> +  it's worth mentioning that an exec function may also be the
>>> +  revalidation worker that is used by some drivers in compute /
>>> +  long-running mode.
>>> +* ``local object``: A GEM object which is local to a gpu_vm. Shared gem
>>> +  objects also share the gpu_vm's dma_resv.
>>> +* ``shared object``: AKA external object: A GEM object which may be shared
>>> +  by multiple gpu_vms and whose backing storage may be shared with
>>> +  other drivers.
>>> +
>>> +
>>> +Introducing the locks
>>> +=====================
>>> +
>>> +One of the benefits of VM_BIND is that local GEM objects share the gpu_vm's
>>> +dma_resv object and hence the dma_resv lock. So even with a huge
>>> +number of local GEM objects, only one lock is needed to make the exec
>>> +sequence atomic.
>>> +
>>> +The following locks and locking orders are used:
>>> +
>>> +* The ``gpu_vm->lock`` (optionally an rwsem). Protects how the gpu_vm is
>>> +  partitioned into gpu_vmas, protects the gpu_vm's list of external objects,
>>> +  and can also with some simplification protect the gpu_vm's list of
>>> +  userptr gpu_vmas. With the CPU mm analogy this would correspond to the
>>> +  mmap_lock.
>>> +* The ``userptr_seqlock``. This lock is taken in read mode for each
>>> +  userptr gpu_vma on the gpu_vm's userptr list, and in write mode during mmu
>>> +  notifier invalidation. This is not a real seqlock but described in
>>> +  ``mm/mmu_notifier.c` as a "Collision-retry read-side/write-side
>>> +  'lock' a lot like a seqcount, however this allows multiple
>>> +  write-sides to hold it at once...". The read side critical section
>>> +  is enclosed by ``mmu_interval_read_begin() /
>>> +  mmu_interval_read_retry()`` with ``mmu_interval_read_begin()``
>>> +  sleeping uninterruptibly if the write side is held.
>>> +  The write side is held by the core mm while calling mmu interval
>>> +  invalidation notifiers.
>>> +* The ``gpu_vm->resv`` lock. Protects the gpu_vm's list of gpu_vmas needing
>>> +  rebinding, and also the residency of all the gpu_vm's local GEM object.
>>> +* The ``gpu_vm->userptr_notifier_lock``. This is an rwsem that is taken in read
>>> +  mode during exec and write mode during a mmu notifier invalidation. In
>>> +  the absence of a separate page-table lock, this lock can serve
>>> +  together with the gpu_vm's dma_resv lock as a page-table lock. More on
>>> +  this below. The userptr notifier lock is per gpu_vm.
>>> +* The ``gpu_vm->page_table_lock``. Protects the gpu_vm's page-table updates. For
>>> +  simplicity the gpu_vm's dma_resv lock can be reused as page-table lock.
>>> +
>>> +There are certain optimizations described below that require
>>> +additional locks. More on that later.
>>> +
>>> +.. code-block:: C
>>> +
>>> +   dma_resv_lock(&gpu_vm->resv);
>>> +
>>> +   for_each_gpu_vma_on_revalidate_list(gpu_vm, &gpu_vma) {
>>> +        revalidate_gpu_vma(&gpu_vma);
>>> +        remove_from_revalidate_list(&gpu_vma);
>>> +   }
>>> +
>>> +   add_dependencies(&gpu_job, &gpu_vm->resv);
>>> +   job_dma_fence = gpu_submit(&gpu_job));
>>> +
>>> +   add_dma_fence(job_dma_fence, &gpu_vm->resv);
>>> +   dma_resv_unlock(&gpu_vm->resv);
>>> +
>>> +Eviction of one of these local objects will then be something like the
>>> +following:
>>> +
>>> +.. code-block:: C
>>> +
>>> +   obj = get_object_from_lru();
>>> +
>>> +   dma_resv_lock(obj->resv);
>>> +   for_each_gpu_vma_of_obj(obj, &gpu_vma);
>>> +        put_gpu_vma_on_revalidate_list(&gpu_vma);
>>> +
>>> +   add_dependencies(&eviction_job, &obj->resv);
>>> +   job_dma_fence = gpu_submit(&eviction_job);
>>> +   add_dma_fence(&obj->resv, job_dma_fence);
>>> +
>>> +   dma_resv_unlock(&obj->resv);
>>> +   put_object(obj);
>>> +
>>> +Note that since the object is local to the gpu_vm, it will share the gpu_vm's
>>> +``dma_resv`` lock so that ``obj->resv == gpu_vm->resv``. Invalidated gpu_vmas are put
>>> +on the gpu_vm's revalidation list, which is protected by ``gpu_vm->resv``, which
>>> +is always locked while evicting, due to the above equality.
>>> +
>>> +For VM_BIND gpu_vms, gpu_vmas don't need to be unbound before eviction,
>>> +Since the eviction blit or copy will wait for GPU idle, any attempt by
>>> +the GPU to access freed memory through the gpu_vma will be preceded by
>>> +a new exec function, which will make sure the gpu_vma is
>>> +revalidated. The eviction code holding the object's dma_resv while
>>> +revalidating will ensure a new exec function may not race with the eviction.
>>> +
>>> +Introducing external (or shared) buffer objects
>>> +===============================================
>>> +
>>> +Since shared buffer objects may be shared by multiple gpu_vm's they
>>> +can't share their reservation object with a single gpu_vm, but will rather
>>> +have a reservation object of their own. The shared objects bound to a
>>> +gpu_vm using one or many
>>> +gpu_vmas are therefore typically put on a per-gpu_vm list which is
>>> +protected by the gpu_vm lock. One could in theory protect it also with
>>> +the ``gpu_vm->resv``, but since the list of dma_resvs to take is typically
>>> +built before the ``gpu_vm->resv`` is locked due to a limitation in
>>> +the current locking helpers, that is typically not done. Also see
>>> +below for userptr gpu_vmas.
>>> +
>>> +At eviction time we now need to invalidate *all* gpu_vmas of a shared
>>> +object, but we can no longer be certain that we hold the gpu_vm's
>>> +dma_resv of all the object's gpu_vmas. We can only be certain that we
>> I need to think a bit more about locking of extobj and evicted object tracking
>> in the case of processing 'drm_gpuva_ops' directly through callbacks within the
>> fence signalling critical path as mentioend in [1].
>>
>> In order to support that, we'd need to protect extobjs with a separate lock,
>> and while iterating extobjs to acquire the dma-resv lock drop the lock within
>> the loop before we actually acquire the dma-resv lock. Maple tree supports that
>> already and this can be fully done within the GPUVA manager; no need for the
>> driver to care about that.
> 
> So do I understand correctly that this because you want to update the gpuvm state while operations are progressing asynchronously?
> 
> If so, I wonder whether that could really be done? For example to allocate enough memory for page-tables etc, you need to know the details of the operations at IOCTL execution time, and to know the details you need to know the state from the previous operation?

Right, sync and async bind can't run fully concurrently, but you could "inject" a
sync one between two async ones such that the sync ones executed from the IOCTL
directly while async execution is stalled meanwhile. This would be possible because
the actual drm_gpuva_ops would be calculated within the async execution path rather
than in the IOCTL. But yes, page-table management must be desinged to support that.

> 
>>
>> While, as already mentioned, I'd really love to support that, I noticed that we
>> have a similar issue with tracking evicted objects. There are (similar) ways to
>> deal with that, however, it drastically increases complexity.
>>
>> Hence, I'd like to reconsider whether it's worth supporting it in the first
>> place. Most of the arguments in order to support it are for decreasing
>> complexity. However, if it increases complexity elsewhere, it's probably not
>> worth. The only argument left would be for synchronous bind jobs which could
>> be injected at any point of time without the need to be queued up in the
>> scheduler to preserve ordering. However, I'm not yet sure how important this
>> would be. For Xe it doesn't really seem to be a concern I guess?
> Xe supports that functionality via separate bind queues. If you queue most of the operations using one queue, you can inject synchronous bind jobs using another. Ideally they execute separately, but they are not guaranteed to do that.

Ok, but the separate bind queue would still work in the same asynchronous way, as
in the job is submitted to some kind of worker and the IOCTL just blocks until
completion, right?

>>
>> [1] https://lore.kernel.org/dri-devel/202308221050.kTj8uFMA-lkp@intel.com/T/#m7f3b5a7ff70723332adeea32671578cb95c62f7c
>>
>>> +hold the object's private dma_resv. We can trylock the dma_resvs for
>>> +the affected gpu_vm's but that might be unnecessarily complex. If we
>>> +have a ww_acquire context at hand at eviction time we can also perform
>>> +sleeping locks of those dma_resvs but that could cause expensive
>>> +rollbacks. One option is to just mark the invalidated gpu_vmas with a bool
>>> +which is inspected on the next exec function, when the gpu_vm's
>>> +dma_resv and the object's dma_resv is held, and the invalidated
>>> +gpu_vmas could then be put on the gpu_vm's list of invalidated
>>> +gpu_vmas. That bool would then, although being per-gpu_vma formally be
>>> +protected by the object's dma_resv.
>>> +
>>> +The exec function would then look something like the following:
>>> +
>>> +.. code-block:: C
>>> +
>>> +   read_lock(&gpu_vm->lock);
>>> +
>>> +   dma_resv_lock(&gpu_vm->resv);
>>> +
>>> +   // Shared object list is protected by the gpu_vm->lock.
>>> +   for_each_shared_obj(gpu_vm, &obj) {
>>> +        dma_resv_lock(&obj->resv);
>>> +        move_marked_gpu_vmas_to_revalidate_gpu_vma_list(obj, &gpu_vm);
>>> +   }
>>> +
>>> +   for_each_gpu_vma_to_revalidate(gpu_vm, &gpu_vma) {
>>> +        revalidate_gpu_vma(&gpu_vma);
>>> +        remove_from_revalidate_list(&gpu_vma);
>>> +   }
>>> +
>>> +   add_dependencies(&gpu_job, &gpu_vm->resv);
>>> +   job_dma_fence = gpu_submit(&gpu_job));
>>> +
>>> +   add_dma_fence(job_dma_fence, &gpu_vm->resv);
>>> +   for_each_shared_obj(gpu_vm, &obj)
>>> +          add_dma_fence(job_dma_fence, &obj->resv);
>>> +   dma_resv_unlock_all_resv_locks();
>>> +
>>> +   read_unlock(&gpu_vm->lock);
>>> +
>>> +And the corresponding shared-object aware eviction would look like:
>>> +
>>> +.. code-block:: C
>>> +
>>> +   obj = get_object_from_lru();
>>> +
>>> +   dma_resv_lock(obj->resv);
>>> +   for_each_gpu_vma_of_obj(obj, &gpu_vma);
>>> +        if (object_is_vm_local(obj))
>>> +                     put_gpu_vma_on_revalidate_list(&gpu_vma, &gpu_vm);
>>> +        else
>>> +                     mark_gpu_vma_for_revalidation(&gpu_vma);
>>> +
>>> +   add_dependencies(&eviction_job, &obj->resv);
>>> +   job_dma_fence = gpu_submit(&eviction_job);
>>> +   add_dma_fence(&obj->resv, job_dma_fence);
>>> +
>>> +   dma_resv_unlock(&obj->resv);
>>> +   put_object(obj);
>>> +
>>> +Yet another option is to put the gpu_vmas to be invalidated on a separate
>>> +gpu_vm list protected by a lower level lock that can be taken both at eviction
>>> +time and at transfer-to-revalidate list time. The details are not in
>>> +this document, but this for reference implemented in the Intel xe
>>> +driver.
>>> +
>>> +Introducing userptr gpu_vmas
>>> +============================
>>> +
>>> +A userptr gpu_vma is a gpu_vma that, instead of mapping a buffer object to a
>>> +GPU virtual address range, directly maps a CPU mm range of anonymous-
>>> +or file page-cache pages.
>>> +A very simple approach would be to just pin the pages using
>>> +pin_user_pages() at bind time and unpin them at unbind time, but this
>>> +creates a Denial-Of-Service vector since a single user-space process
>>> +would be able to pin down all of system memory, which is not
>>> +desirable. (For special use-cases and with proper accounting pinning might
>>> +still be a desirable feature, though). What we need to do in the general case is
>>> +to obtain a reference to the desired pages, make sure we are notified
>>> +using a MMU notifier just before the CPU mm unmaps the pages, dirty
>>> +them if they are not mapped read-only to the GPU, and then drop the reference.
>>> +When we are notified by the MMU notifier that CPU mm is about to drop the
>>> +pages, we need to stop GPU access to the pages,
>>> +GPU page-table and make sure that before the next time the GPU tries to access
>>> +whatever is now present in the CPU mm range, we unmap the old pages
>>> +from the GPU page tables and repeat the process of obtaining new page
>>> +references. Note that when the core mm decides to laundry pages, we get such
>>> +an unmap MMU notification and can mark the pages dirty again before the
>>> +next GPU access. We also get similar MMU notifications for NUMA accounting
>>> +which the GPU driver doesn't really need to care about, but so far
>>> +it's proven difficult to exclude certain notifications.
>>> +
>>> +Using a MMU notifier for device DMA (and other methods) is described in
>>> +`this document
>>> +<https://docs.kernel.org/core-api/pin_user_pages.html#case-3-mmu-notifier-registration-with-or-without-page-faulting-hardware>`_.
>>> +
>>> +Now the method of obtaining struct page references using
>>> +get_user_pages() unfortunately can't be used under a dma_resv lock
>>> +since that would violate the locking order of the dma_resv lock vs the
>>> +mmap_lock that is grabbed when resolving a CPU pagefault. This means the gpu_vm's
>>> +list of userptr gpu_vmas needs to be protected by an outer lock, and this
>>> +is the first time we strictly need the gpu_vm->lock. While it was
>>> +previously used also to protect the list of the gpu_vm's shared objects,
>>> +we could in theory have used the gpu_vm->resv for that.
>>> +
>>> +The MMU interval seqlock for a userptr gpu_vma is used in the following
>>> +way:
>>> +
>>> +.. code-block:: C
>>> +
>>> +   down_read(&gpu_vm->lock);
>>> +
>>> +   retry:
>>> +
>>> +   // Note: mmu_interval_read_begin() blocks until there is no
>>> +   // invalidation notifier running anymore.
>>> +   seq = mmu_interval_read_begin(&gpu_vma->userptr_interval);
>>> +   if (seq != gpu_vma->saved_seq) {
>>> +           obtain_new_page_pointers(&gpu_vma);
>>> +       dma_resv_lock(&gpu_vm->resv);
>>> +       put_gpu_vma_on_revalidate_list(&gpu_vma, &gpu_vm);
>>> +       dma_resv_unlock(&gpu_vm->resv);
>>> +       gpu_vma->saved_seq = seq;
>>> +   }
>>> +
>>> +   // The usual revalidation goes here.
>>> +
>>> +   // Final userptr sequence validation may not happen before the
>>> +   // submission dma_fence is added to the gpu_vm's resv, from the POW
>>> +   // of the MMU invalidation notifier. Hence the
>>> +   // userptr_notifier_lock that will make them appear atomic.
>>> +
>>> +   add_dependencies(&gpu_job, &gpu_vm->resv);
>>> +   down_read(&gpu_vm->userptr_notifier_lock);
>>> +   if (mmu_interval_read_retry(&gpu_vma->userptr_interval, gpu_vma->saved_seq)) {
>>> +          up_read(&gpu_vm->userptr_notifier_lock);
>>> +      goto retry;
>>> +   }
>>> +
>>> +   job_dma_fence = gpu_submit(&gpu_job));
>>> +
>>> +   add_dma_fence(job_dma_fence, &gpu_vm->resv);
>>> +
>>> +   for_each_shared_obj(gpu_vm, &obj)
>>> +          add_dma_fence(job_dma_fence, &obj->resv);
>>> +
>>> +   dma_resv_unlock_all_resv_locks();
>>> +   up_read(&gpu_vm->userptr_notifier_lock);
>>> +   up_read(&gpu_vm->lock);
>>> +
>>> +The code between ``mmu_interval_read_begin()`` and the
>>> +``mmu_interval_read_retry()`` marks the read side critical section of
>>> +what we call the ``userptr_seqlock``. In reality the gpu_vm's userptr
>>> +gpu_vma list is looped through, and the check is done for *all* of its
>>> +userptr gpu_vmas, although we only show a single one here.
>>> +
>>> +The userptr gpu_vma MMU invalidation notifier might be called from
>>> +reclaim context and, again to avoid locking order violations, we can't
>>> +take any dma_resv lock nor the gpu_vm->lock from within it.
>>> +
>>> +.. code-block:: C
>>> +
>>> +  bool gpu_vma_userptr_invalidate(userptr_interval, cur_seq)
>>> +  {
>>> +          // Make sure the exec function either sees the new sequence
>>> +      // and backs off or we wait for the dma-fence:
>>> +
>>> +          down_write(&gpu_vm->userptr_notifier_lock);
>>> +      mmu_interval_set_seq(userptr_interval, cur_seq);
>>> +      up_write(&gpu_vm->userptr_notifier_lock);
>>> +
>>> +      dma_resv_wait_timeout(&gpu_vm->resv, DMA_RESV_USAGE_BOOKKEEP,
>>> +                        false, MAX_SCHEDULE_TIMEOUT);
>>> +      return true;
>>> +  }
>>> +
>>> +When this invalidation notifier returns, the GPU can no longer be
>>> +accessing the old pages of the userptr gpu_vma and needs to redo the page-binding
>>> +before a new GPU submission can succeed.
>>> +
>>> +Optimizing gpu_vma iteration
>>> +----------------------------
>>> +
>>> +Iterating through all of a gpu_vm's userptr gpu_vmas to check the validity
>>> +on each exec function may be very costly. There is a scheme to avoid
>>> +this and only iterate through the userptr gpu_vmas that actually saw an
>>> +invalidation notifier call since the last exec. T
>>> +
>>> +TODO: describe that scheme here. It's implemented in the xe driver.
>>> +
>>> +Locking for page-table updates at bind- and unbind time
>>> +=======================================================
>>> +
>>> +TODO.
>>> +
>>> +Recoverable page-fault implications
>>> +===================================
>>> +
>>> +TODO.
>>> -- 
>>> 2.41.0
>>>
>

Thomas Hellström Sept. 6, 2023, 8:32 a.m. UTC | #6

On 9/6/23 10:00, Danilo Krummrich wrote:
> On 9/6/23 09:06, Thomas Hellström wrote:
>> Hi, Danilo,
>>
>> Thanks for taking a look. Comments inline.
>>
>> On 9/5/23 21:50, Danilo Krummrich wrote:
>>> On Wed, Aug 16, 2023 at 11:15:47AM +0200, Thomas Hellström wrote:
>>>> Add the first version of the VM_BIND locking document which is
>>>> intended to be part of the xe driver upstreaming agreement.
>>>>
>>>> The document describes and discuss the locking used during exec-
>>>> functions, evicton and for userptr gpu-vmas. Intention is to be 
>>>> using the
>>>> same nomenclature as the drm-vm-bind-async.rst.
>>>>
>>>> v2:
>>>>
>>>>
>>>> - s/gvm/gpu_vm/g (Rodrigo Vivi)
>>>> - Clarify the userptr seqlock with a pointer to mm/mmu_notifier.c
>>>>    (Rodrigo Vivi)
>>>> - Adjust commit message accordingly.
>>>> - Add SPDX license header.
>>>>
>>>> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
>>>> Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
>>>> ---
>>>>   Documentation/gpu/drm-vm-bind-locking.rst | 351 
>>>> ++++++++++++++++++++++
>>>>   1 file changed, 351 insertions(+)
>>>>   create mode 100644 Documentation/gpu/drm-vm-bind-locking.rst
>>>>
>>>> diff --git a/Documentation/gpu/drm-vm-bind-locking.rst 
>>>> b/Documentation/gpu/drm-vm-bind-locking.rst
>>>> new file mode 100644
>>>> index 000000000000..b813961a9ec2
>>>> --- /dev/null
>>>> +++ b/Documentation/gpu/drm-vm-bind-locking.rst
>>>> @@ -0,0 +1,351 @@
>>>> +.. SPDX-License-Identifier: (GPL-2.0+ OR MIT)
>>>> +
>>>> +===============
>>>> +VM_BIND locking
>>>> +===============
>>>> +
>>>> +This document attempts to describe what's needed to get VM_BIND 
>>>> locking right,
>>>> +including the userptr mmu_notifier locking and it will also 
>>>> discuss some
>>>> +optimizations to get rid of the looping through of all userptr 
>>>> mappings and
>>>> +external / shared object mappings that is needed in the simplest
>>>> +implementation. It will also discuss some implications for 
>>>> faulting gpu_vms.
>>>> +
>>>> +Nomenclature
>>>> +============
>>>> +
>>>> +* ``Context``: GPU execution context.
>>>> +* ``gpu_vm``: Abstraction of a virtual GPU address space with
>>>>
>>>>
>>>> +  meta-data. Typically one per client (DRM file-private), or one per
>>>> +  context.
>>>> +* ``gpu_vma``: Abstraction of a GPU address range within a gpu_vm 
>>>> with
>>> The same nomenclature was used within the VM_BIND async document as 
>>> well. I
>>> wonder if it would make sense to align the naming with the GPUVA 
>>> manager, such
>>> that ('drm_gpuva_manager' -> 'drm_gpuvm'). This would also result 
>>> into better
>>> function names, such as drm_gpuvm_resv_lock() or 
>>> drm_gpuvm_prepare_objects() and
>>> potentially way better naming for the VM_BO abstraction 'drm_gpuvm_bo'.
>>>
>>> However, I'd like to keep 'drm_gpuva' rather than 'drm_gpu_vma', but 
>>> I think
>>> this is close enough anyway.
>>
>> I don't have a strong opinion about the naming here and aligning with 
>> the GPUVA manager make sense, although perhaps the "drm_" prefix 
>> which makes sense for the function- and struct names may not make 
>> sense in a more generic document like this. What about gpuva and gpuvm?
>
> Oh, I think the document is fine as it is. This was more like me 
> thinking loud
> about renaming things in the GPUVA manager accordingly.
>
>>
>>
>>>
>>>> +  associated meta-data. The backing storage of a gpu_vma can 
>>>> either be
>>>> +  a gem buffer object or anonymous pages mapped also into the CPU
>>>> +  address space for the process.
>>>> +* ``userptr gpu_vma or just userptr``: A gpu_vma, the backing 
>>>> store of
>>>> +  which is anonymous pages as described above.
>>>> +* ``revalidating``: Revalidating a gpu_vma means making the latest 
>>>> version
>>>> +  of the backing store resident and making sure the gpu_vma's
>>>> +  page-table entries point to that backing store.
>>>> +* ``dma_fence``: A struct dma_fence that is similar to a struct 
>>>> completion
>>>> +  and which tracks GPU activity. When the GPU activity is finished,
>>>> +  the dma_fence signals.
>>>> +* ``dma_resv``: A struct dma_resv (AKA reservation object) that is 
>>>> used
>>>> +  to track GPU activity in the form of multiple dma_fences on a
>>>> +  gpu_vm or a gem buffer object. The dma_resv contains an array / 
>>>> list
>>>> +  of dma_fences and a lock that needs to be held when adding
>>>>
>>>>
>>>> +  additional dma_fences to the dma_resv. The lock is of a type that
>>>> +  allows deadlock-safe locking of multiple dma_resvs in arbitrary 
>>>> order.
>>>> +* ``exec function``: An exec function is a function that 
>>>> revalidates all
>>>> +  affected gpu_vmas, submits a GPU command batch and registers the
>>>> +  dma_fence representing the GPU command's activity with all affected
>>>> +  dma_resvs. For completeness, although not covered by this document,
>>>> +  it's worth mentioning that an exec function may also be the
>>>> +  revalidation worker that is used by some drivers in compute /
>>>> +  long-running mode.
>>>> +* ``local object``: A GEM object which is local to a gpu_vm. 
>>>> Shared gem
>>>> +  objects also share the gpu_vm's dma_resv.
>>>> +* ``shared object``: AKA external object: A GEM object which may 
>>>> be shared
>>>> +  by multiple gpu_vms and whose backing storage may be shared with
>>>> +  other drivers.
>>>> +
>>>> +
>>>> +Introducing the locks
>>>> +=====================
>>>> +
>>>> +One of the benefits of VM_BIND is that local GEM objects share the 
>>>> gpu_vm's
>>>> +dma_resv object and hence the dma_resv lock. So even with a huge
>>>> +number of local GEM objects, only one lock is needed to make the exec
>>>> +sequence atomic.
>>>> +
>>>> +The following locks and locking orders are used:
>>>> +
>>>> +* The ``gpu_vm->lock`` (optionally an rwsem). Protects how the 
>>>> gpu_vm is
>>>> +  partitioned into gpu_vmas, protects the gpu_vm's list of 
>>>> external objects,
>>>> +  and can also with some simplification protect the gpu_vm's list of
>>>> +  userptr gpu_vmas. With the CPU mm analogy this would correspond 
>>>> to the
>>>> +  mmap_lock.
>>>> +* The ``userptr_seqlock``. This lock is taken in read mode for each
>>>> +  userptr gpu_vma on the gpu_vm's userptr list, and in write mode 
>>>> during mmu
>>>> +  notifier invalidation. This is not a real seqlock but described in
>>>> +  ``mm/mmu_notifier.c` as a "Collision-retry read-side/write-side
>>>> +  'lock' a lot like a seqcount, however this allows multiple
>>>> +  write-sides to hold it at once...". The read side critical section
>>>> +  is enclosed by ``mmu_interval_read_begin() /
>>>> +  mmu_interval_read_retry()`` with ``mmu_interval_read_begin()``
>>>> +  sleeping uninterruptibly if the write side is held.
>>>> +  The write side is held by the core mm while calling mmu interval
>>>> +  invalidation notifiers.
>>>> +* The ``gpu_vm->resv`` lock. Protects the gpu_vm's list of 
>>>> gpu_vmas needing
>>>> +  rebinding, and also the residency of all the gpu_vm's local GEM 
>>>> object.
>>>> +* The ``gpu_vm->userptr_notifier_lock``. This is an rwsem that is 
>>>> taken in read
>>>> +  mode during exec and write mode during a mmu notifier 
>>>> invalidation. In
>>>> +  the absence of a separate page-table lock, this lock can serve
>>>> +  together with the gpu_vm's dma_resv lock as a page-table lock. 
>>>> More on
>>>> +  this below. The userptr notifier lock is per gpu_vm.
>>>> +* The ``gpu_vm->page_table_lock``. Protects the gpu_vm's 
>>>> page-table updates. For
>>>> +  simplicity the gpu_vm's dma_resv lock can be reused as 
>>>> page-table lock.
>>>> +
>>>> +There are certain optimizations described below that require
>>>> +additional locks. More on that later.
>>>> +
>>>> +.. code-block:: C
>>>> +
>>>> +   dma_resv_lock(&gpu_vm->resv);
>>>> +
>>>> +   for_each_gpu_vma_on_revalidate_list(gpu_vm, &gpu_vma) {
>>>> +        revalidate_gpu_vma(&gpu_vma);
>>>> +        remove_from_revalidate_list(&gpu_vma);
>>>> +   }
>>>> +
>>>> +   add_dependencies(&gpu_job, &gpu_vm->resv);
>>>> +   job_dma_fence = gpu_submit(&gpu_job));
>>>> +
>>>> +   add_dma_fence(job_dma_fence, &gpu_vm->resv);
>>>> +   dma_resv_unlock(&gpu_vm->resv);
>>>> +
>>>> +Eviction of one of these local objects will then be something like 
>>>> the
>>>> +following:
>>>> +
>>>> +.. code-block:: C
>>>> +
>>>> +   obj = get_object_from_lru();
>>>> +
>>>> +   dma_resv_lock(obj->resv);
>>>> +   for_each_gpu_vma_of_obj(obj, &gpu_vma);
>>>> +        put_gpu_vma_on_revalidate_list(&gpu_vma);
>>>> +
>>>> +   add_dependencies(&eviction_job, &obj->resv);
>>>> +   job_dma_fence = gpu_submit(&eviction_job);
>>>> +   add_dma_fence(&obj->resv, job_dma_fence);
>>>> +
>>>> +   dma_resv_unlock(&obj->resv);
>>>> +   put_object(obj);
>>>> +
>>>> +Note that since the object is local to the gpu_vm, it will share 
>>>> the gpu_vm's
>>>> +``dma_resv`` lock so that ``obj->resv == gpu_vm->resv``. 
>>>> Invalidated gpu_vmas are put
>>>> +on the gpu_vm's revalidation list, which is protected by 
>>>> ``gpu_vm->resv``, which
>>>> +is always locked while evicting, due to the above equality.
>>>> +
>>>> +For VM_BIND gpu_vms, gpu_vmas don't need to be unbound before 
>>>> eviction,
>>>> +Since the eviction blit or copy will wait for GPU idle, any 
>>>> attempt by
>>>> +the GPU to access freed memory through the gpu_vma will be 
>>>> preceded by
>>>> +a new exec function, which will make sure the gpu_vma is
>>>> +revalidated. The eviction code holding the object's dma_resv while
>>>> +revalidating will ensure a new exec function may not race with the 
>>>> eviction.
>>>> +
>>>> +Introducing external (or shared) buffer objects
>>>> +===============================================
>>>> +
>>>> +Since shared buffer objects may be shared by multiple gpu_vm's they
>>>> +can't share their reservation object with a single gpu_vm, but 
>>>> will rather
>>>> +have a reservation object of their own. The shared objects bound to a
>>>> +gpu_vm using one or many
>>>> +gpu_vmas are therefore typically put on a per-gpu_vm list which is
>>>> +protected by the gpu_vm lock. One could in theory protect it also 
>>>> with
>>>> +the ``gpu_vm->resv``, but since the list of dma_resvs to take is 
>>>> typically
>>>> +built before the ``gpu_vm->resv`` is locked due to a limitation in
>>>> +the current locking helpers, that is typically not done. Also see
>>>> +below for userptr gpu_vmas.
>>>> +
>>>> +At eviction time we now need to invalidate *all* gpu_vmas of a shared
>>>> +object, but we can no longer be certain that we hold the gpu_vm's
>>>> +dma_resv of all the object's gpu_vmas. We can only be certain that we
>>> I need to think a bit more about locking of extobj and evicted 
>>> object tracking
>>> in the case of processing 'drm_gpuva_ops' directly through callbacks 
>>> within the
>>> fence signalling critical path as mentioend in [1].
>>>
>>> In order to support that, we'd need to protect extobjs with a 
>>> separate lock,
>>> and while iterating extobjs to acquire the dma-resv lock drop the 
>>> lock within
>>> the loop before we actually acquire the dma-resv lock. Maple tree 
>>> supports that
>>> already and this can be fully done within the GPUVA manager; no need 
>>> for the
>>> driver to care about that.
>>
>> So do I understand correctly that this because you want to update the 
>> gpuvm state while operations are progressing asynchronously?
>>
>> If so, I wonder whether that could really be done? For example to 
>> allocate enough memory for page-tables etc, you need to know the 
>> details of the operations at IOCTL execution time, and to know the 
>> details you need to know the state from the previous operation?
>
>
> Right, sync and async bind can't run fully concurrently, but you could 
> "inject" a
> sync one between two async ones such that the sync ones executed from 
> the IOCTL
> directly while async execution is stalled meanwhile. This would be 
> possible because
> the actual drm_gpuva_ops would be calculated within the async 
> execution path rather
> than in the IOCTL. But yes, page-table management must be desinged to 
> support that.

OK, well one of the main motivations for Xe is to be able to pipeline 
interleaving binds and execs if needed, like so:

- Bind vmas for scene 1.
- Submit scene 1.
- Unbind vmas for scene 1.
- Bind vmas for scene 2.
- Submit scene 2.
- Unbind vmas for scene 2.

And being able to *submit* all of the above while the async binding of 
vmas for scene (step 1) has not yet completed.
I can't really see how this could be done, while obeying dma-fence 
rules, unless state is updated synchronously while submitting?

So unless I'm misunderstanding what you are trying to do, I don't see Xe 
wanting to side-step the current approach, but OTOH protecting part of 
the state with additional locks probably won't be a problem as long as 
that is optional.

>
>>
>>>
>>> While, as already mentioned, I'd really love to support that, I 
>>> noticed that we
>>> have a similar issue with tracking evicted objects. There are 
>>> (similar) ways to
>>> deal with that, however, it drastically increases complexity.
>>>
>>> Hence, I'd like to reconsider whether it's worth supporting it in 
>>> the first
>>> place. Most of the arguments in order to support it are for decreasing
>>> complexity. However, if it increases complexity elsewhere, it's 
>>> probably not
>>> worth. The only argument left would be for synchronous bind jobs 
>>> which could
>>> be injected at any point of time without the need to be queued up in 
>>> the
>>> scheduler to preserve ordering. However, I'm not yet sure how 
>>> important this
>>> would be. For Xe it doesn't really seem to be a concern I guess?
>> Xe supports that functionality via separate bind queues. If you queue 
>> most of the operations using one queue, you can inject synchronous 
>> bind jobs using another. Ideally they execute separately, but they 
>> are not guaranteed to do that.
>
> Ok, but the separate bind queue would still work in the same 
> asynchronous way, as
> in the job is submitted to some kind of worker and the IOCTL just 
> blocks until
> completion, right?

The job is only submitted to a worker if there are unsatisfied 
dependencies, like that bind queue is busy with something else, or a GPU 
job is wiping the BO content for security reasons, or an in-fence, or 
somebody else having queued a job to the same page-table range *). 
Otherwise the page-table is updated immediately using CPU writes.

But yes, the IOCTL blocks until completion if the job is synchronous.

/Thomas


>
>
>
>>>
>>> [1] 
>>> https://lore.kernel.org/dri-devel/202308221050.kTj8uFMA-lkp@intel.com/T/#m7f3b5a7ff70723332adeea32671578cb95c62f7c
>>>
>>>> +hold the object's private dma_resv. We can trylock the dma_resvs for
>>>> +the affected gpu_vm's but that might be unnecessarily complex. If we
>>>> +have a ww_acquire context at hand at eviction time we can also 
>>>> perform
>>>> +sleeping locks of those dma_resvs but that could cause expensive
>>>> +rollbacks. One option is to just mark the invalidated gpu_vmas 
>>>> with a bool
>>>> +which is inspected on the next exec function, when the gpu_vm's
>>>> +dma_resv and the object's dma_resv is held, and the invalidated
>>>> +gpu_vmas could then be put on the gpu_vm's list of invalidated
>>>> +gpu_vmas. That bool would then, although being per-gpu_vma 
>>>> formally be
>>>> +protected by the object's dma_resv.
>>>> +
>>>> +The exec function would then look something like the following:
>>>> +
>>>> +.. code-block:: C
>>>> +
>>>> +   read_lock(&gpu_vm->lock);
>>>> +
>>>> +   dma_resv_lock(&gpu_vm->resv);
>>>> +
>>>> +   // Shared object list is protected by the gpu_vm->lock.
>>>> +   for_each_shared_obj(gpu_vm, &obj) {
>>>> +        dma_resv_lock(&obj->resv);
>>>> + move_marked_gpu_vmas_to_revalidate_gpu_vma_list(obj, &gpu_vm);
>>>> +   }
>>>> +
>>>> +   for_each_gpu_vma_to_revalidate(gpu_vm, &gpu_vma) {
>>>> +        revalidate_gpu_vma(&gpu_vma);
>>>> +        remove_from_revalidate_list(&gpu_vma);
>>>> +   }
>>>> +
>>>> +   add_dependencies(&gpu_job, &gpu_vm->resv);
>>>> +   job_dma_fence = gpu_submit(&gpu_job));
>>>> +
>>>> +   add_dma_fence(job_dma_fence, &gpu_vm->resv);
>>>> +   for_each_shared_obj(gpu_vm, &obj)
>>>> +          add_dma_fence(job_dma_fence, &obj->resv);
>>>> +   dma_resv_unlock_all_resv_locks();
>>>> +
>>>> +   read_unlock(&gpu_vm->lock);
>>>> +
>>>> +And the corresponding shared-object aware eviction would look like:
>>>> +
>>>> +.. code-block:: C
>>>> +
>>>> +   obj = get_object_from_lru();
>>>> +
>>>> +   dma_resv_lock(obj->resv);
>>>> +   for_each_gpu_vma_of_obj(obj, &gpu_vma);
>>>> +        if (object_is_vm_local(obj))
>>>> + put_gpu_vma_on_revalidate_list(&gpu_vma, &gpu_vm);
>>>> +        else
>>>> + mark_gpu_vma_for_revalidation(&gpu_vma);
>>>> +
>>>> +   add_dependencies(&eviction_job, &obj->resv);
>>>> +   job_dma_fence = gpu_submit(&eviction_job);
>>>> +   add_dma_fence(&obj->resv, job_dma_fence);
>>>> +
>>>> +   dma_resv_unlock(&obj->resv);
>>>> +   put_object(obj);
>>>> +
>>>> +Yet another option is to put the gpu_vmas to be invalidated on a 
>>>> separate
>>>> +gpu_vm list protected by a lower level lock that can be taken both 
>>>> at eviction
>>>> +time and at transfer-to-revalidate list time. The details are not in
>>>> +this document, but this for reference implemented in the Intel xe
>>>> +driver.
>>>> +
>>>> +Introducing userptr gpu_vmas
>>>> +============================
>>>> +
>>>> +A userptr gpu_vma is a gpu_vma that, instead of mapping a buffer 
>>>> object to a
>>>> +GPU virtual address range, directly maps a CPU mm range of anonymous-
>>>> +or file page-cache pages.
>>>> +A very simple approach would be to just pin the pages using
>>>> +pin_user_pages() at bind time and unpin them at unbind time, but this
>>>> +creates a Denial-Of-Service vector since a single user-space process
>>>> +would be able to pin down all of system memory, which is not
>>>> +desirable. (For special use-cases and with proper accounting 
>>>> pinning might
>>>> +still be a desirable feature, though). What we need to do in the 
>>>> general case is
>>>> +to obtain a reference to the desired pages, make sure we are notified
>>>> +using a MMU notifier just before the CPU mm unmaps the pages, dirty
>>>> +them if they are not mapped read-only to the GPU, and then drop 
>>>> the reference.
>>>> +When we are notified by the MMU notifier that CPU mm is about to 
>>>> drop the
>>>> +pages, we need to stop GPU access to the pages,
>>>> +GPU page-table and make sure that before the next time the GPU 
>>>> tries to access
>>>> +whatever is now present in the CPU mm range, we unmap the old pages
>>>> +from the GPU page tables and repeat the process of obtaining new page
>>>> +references. Note that when the core mm decides to laundry pages, 
>>>> we get such
>>>> +an unmap MMU notification and can mark the pages dirty again 
>>>> before the
>>>> +next GPU access. We also get similar MMU notifications for NUMA 
>>>> accounting
>>>> +which the GPU driver doesn't really need to care about, but so far
>>>> +it's proven difficult to exclude certain notifications.
>>>> +
>>>> +Using a MMU notifier for device DMA (and other methods) is 
>>>> described in
>>>> +`this document
>>>> +<https://docs.kernel.org/core-api/pin_user_pages.html#case-3-mmu-notifier-registration-with-or-without-page-faulting-hardware>`_. 
>>>>
>>>> +
>>>> +Now the method of obtaining struct page references using
>>>> +get_user_pages() unfortunately can't be used under a dma_resv lock
>>>> +since that would violate the locking order of the dma_resv lock vs 
>>>> the
>>>> +mmap_lock that is grabbed when resolving a CPU pagefault. This 
>>>> means the gpu_vm's
>>>> +list of userptr gpu_vmas needs to be protected by an outer lock, 
>>>> and this
>>>> +is the first time we strictly need the gpu_vm->lock. While it was
>>>> +previously used also to protect the list of the gpu_vm's shared 
>>>> objects,
>>>> +we could in theory have used the gpu_vm->resv for that.
>>>> +
>>>> +The MMU interval seqlock for a userptr gpu_vma is used in the 
>>>> following
>>>> +way:
>>>> +
>>>> +.. code-block:: C
>>>> +
>>>> +   down_read(&gpu_vm->lock);
>>>> +
>>>> +   retry:
>>>> +
>>>> +   // Note: mmu_interval_read_begin() blocks until there is no
>>>> +   // invalidation notifier running anymore.
>>>> +   seq = mmu_interval_read_begin(&gpu_vma->userptr_interval);
>>>> +   if (seq != gpu_vma->saved_seq) {
>>>> +           obtain_new_page_pointers(&gpu_vma);
>>>> +       dma_resv_lock(&gpu_vm->resv);
>>>> +       put_gpu_vma_on_revalidate_list(&gpu_vma, &gpu_vm);
>>>> +       dma_resv_unlock(&gpu_vm->resv);
>>>> +       gpu_vma->saved_seq = seq;
>>>> +   }
>>>> +
>>>> +   // The usual revalidation goes here.
>>>> +
>>>> +   // Final userptr sequence validation may not happen before the
>>>> +   // submission dma_fence is added to the gpu_vm's resv, from the 
>>>> POW
>>>> +   // of the MMU invalidation notifier. Hence the
>>>> +   // userptr_notifier_lock that will make them appear atomic.
>>>> +
>>>> +   add_dependencies(&gpu_job, &gpu_vm->resv);
>>>> +   down_read(&gpu_vm->userptr_notifier_lock);
>>>> +   if (mmu_interval_read_retry(&gpu_vma->userptr_interval, 
>>>> gpu_vma->saved_seq)) {
>>>> +          up_read(&gpu_vm->userptr_notifier_lock);
>>>> +      goto retry;
>>>> +   }
>>>> +
>>>> +   job_dma_fence = gpu_submit(&gpu_job));
>>>> +
>>>> +   add_dma_fence(job_dma_fence, &gpu_vm->resv);
>>>> +
>>>> +   for_each_shared_obj(gpu_vm, &obj)
>>>> +          add_dma_fence(job_dma_fence, &obj->resv);
>>>> +
>>>> +   dma_resv_unlock_all_resv_locks();
>>>> +   up_read(&gpu_vm->userptr_notifier_lock);
>>>> +   up_read(&gpu_vm->lock);
>>>> +
>>>> +The code between ``mmu_interval_read_begin()`` and the
>>>> +``mmu_interval_read_retry()`` marks the read side critical section of
>>>> +what we call the ``userptr_seqlock``. In reality the gpu_vm's userptr
>>>> +gpu_vma list is looped through, and the check is done for *all* of 
>>>> its
>>>> +userptr gpu_vmas, although we only show a single one here.
>>>> +
>>>> +The userptr gpu_vma MMU invalidation notifier might be called fr
>>>>
>>>>
>>>> om
>>>> +reclaim context and, again to avoid locking order violations, we 
>>>> can't
>>>> +take any dma_resv lock nor the gpu_vm->lock from within it.
>>>> +
>>>> +.. code-block:: C
>>>> +
>>>> +  bool gpu_vma_userptr_invalidate(userptr_interval, cur_seq)
>>>> +  {
>>>> +          // Make sure the exec function either sees the new sequence
>>>> +      // and backs off or we wait for the dma-fence:
>>>> +
>>>> + down_write(&gpu_vm->userptr_notifier_lock);
>>>> +      mmu_interval_set_seq(userptr_interval, cur_seq);
>>>> +      up_write(&gpu_vm->userptr_notifier_lock);
>>>> +
>>>> +      dma_resv_wait_timeout(&gpu_vm->resv, DMA_RESV_USAGE_BOOKKEEP,
>>>> +                        false, MAX_SCHEDULE_TIMEOUT);
>>>> +      return true;
>>>> +  }
>>>> +
>>>> +When this invalidation notifier returns, the GPU can no longer be
>>>> +accessing the old pages of the userptr gpu_vma and needs to redo 
>>>> the page-binding
>>>> +before a new GPU submission can succeed.
>>>> +
>>>> +Optimizing gpu_vma iteration
>>>> +----------------------------
>>>> +
>>>> +Iterating through all of a gpu_vm's userptr gpu_vmas to check the 
>>>> validity
>>>> +on each exec function may be very costly. There is a scheme to avoid
>>>> +this and only iterate through the userptr gpu_vmas that actually 
>>>> saw an
>>>> +invalidation notifier call since the last exec. T
>>>> +
>>>> +TODO: describe that scheme here. It's implemented in the xe driver.
>>>> +
>>>> +Locking for page-table updates at bind- and unbind time
>>>> +=======================================================
>>>> +
>>>> +TODO.
>>>> +
>>>> +Recoverable page-fault implications
>>>> +===================================
>>>> +
>>>> +TODO.
>>>> -- 
>>>> 2.41.0
>>>>
>>
>

Boris Brezillon Sept. 6, 2023, 11:09 a.m. UTC | #7

On Wed, 6 Sep 2023 10:32:24 +0200
Thomas Hellström <thomas.hellstrom@linux.intel.com> wrote:


> >>>> +Introducing external (or shared) buffer objects
> >>>> +===============================================
> >>>> +
> >>>> +Since shared buffer objects may be shared by multiple gpu_vm's they
> >>>> +can't share their reservation object with a single gpu_vm, but 
> >>>> will rather
> >>>> +have a reservation object of their own. The shared objects bound to a
> >>>> +gpu_vm using one or many
> >>>> +gpu_vmas are therefore typically put on a per-gpu_vm list which is
> >>>> +protected by the gpu_vm lock. One could in theory protect it also 
> >>>> with
> >>>> +the ``gpu_vm->resv``, but since the list of dma_resvs to take is 
> >>>> typically
> >>>> +built before the ``gpu_vm->resv`` is locked due to a limitation in
> >>>> +the current locking helpers, that is typically not done. Also see
> >>>> +below for userptr gpu_vmas.
> >>>> +
> >>>> +At eviction time we now need to invalidate *all* gpu_vmas of a shared
> >>>> +object, but we can no longer be certain that we hold the gpu_vm's
> >>>> +dma_resv of all the object's gpu_vmas. We can only be certain that we  
> >>> I need to think a bit more about locking of extobj and evicted 
> >>> object tracking
> >>> in the case of processing 'drm_gpuva_ops' directly through callbacks 
> >>> within the
> >>> fence signalling critical path as mentioend in [1].
> >>>
> >>> In order to support that, we'd need to protect extobjs with a 
> >>> separate lock,
> >>> and while iterating extobjs to acquire the dma-resv lock drop the 
> >>> lock within
> >>> the loop before we actually acquire the dma-resv lock. Maple tree 
> >>> supports that
> >>> already and this can be fully done within the GPUVA manager; no need 
> >>> for the
> >>> driver to care about that.  
> >>
> >> So do I understand correctly that this because you want to update the 
> >> gpuvm state while operations are progressing asynchronously?
> >>
> >> If so, I wonder whether that could really be done? For example to 
> >> allocate enough memory for page-tables etc, you need to know the 
> >> details of the operations at IOCTL execution time, and to know the 
> >> details you need to know the state from the previous operation?  
> >
> >
> > Right, sync and async bind can't run fully concurrently, but you could 
> > "inject" a
> > sync one between two async ones such that the sync ones executed from 
> > the IOCTL
> > directly while async execution is stalled meanwhile. This would be 
> > possible because
> > the actual drm_gpuva_ops would be calculated within the async 
> > execution path rather
> > than in the IOCTL. But yes, page-table management must be desinged to 
> > support that.

FWIW, the panthor driver is designed this way (note that I'm not
supporting GEM eviction yet, so there might be subtleties I missed).

> 
> OK, well one of the main motivations for Xe is to be able to pipeline 
> interleaving binds and execs if needed, like so:
> 
> - Bind vmas for scene 1.
> - Submit scene 1.
> - Unbind vmas for scene 1.
> - Bind vmas for scene 2.
> - Submit scene 2.
> - Unbind vmas for scene 2.
> 
> And being able to *submit* all of the above while the async binding of 
> vmas for scene (step 1) has not yet completed.
> I can't really see how this could be done, while obeying dma-fence 
> rules, unless state is updated synchronously while submitting?

The idea in this case is to detect when a GPU job dependency is a
VM_BIND out-fence, turn drm_sched_fence->parent into an
xxx_vm_bind_job_fence object that's holding the GEM that's about to be
mapped (AFAICT, we don't need to do anything for unmap operations), and
then add our GPU job fence to this BO. This should not only guarantee
that the GEMs we depend on are mapped before the GPU job is executed
(the fence wait does that), but also that such yet-to-be-mapped GEMs
won't be evicted just after they've been mapped and before the GPU had
a chance to execute (unless I'm missing something, adding our GPU job
fence to the BO being targeted by a pending VM_BIND(async,map) operation
solves this problem).

Thomas Hellström Sept. 6, 2023, 11:57 a.m. UTC | #8

Hi, Boris

On 9/6/23 13:09, Boris Brezillon wrote:
> On Wed, 6 Sep 2023 10:32:24 +0200
> Thomas Hellström <thomas.hellstrom@linux.intel.com> wrote:
>
>
>>>>>> +Introducing external (or shared) buffer objects
>>>>>> +===============================================
>>>>>> +
>>>>>> +Since shared buffer objects may be shared by multiple gpu_vm's they
>>>>>> +can't share their reservation object with a single gpu_vm, but
>>>>>> will rather
>>>>>> +have a reservation object of their own. The shared objects bound to a
>>>>>> +gpu_vm using one or many
>>>>>> +gpu_vmas are therefore typically put on a per-gpu_vm list which is
>>>>>> +protected by the gpu_vm lock. One could in theory protect it also
>>>>>> with
>>>>>> +the ``gpu_vm->resv``, but since the list of dma_resvs to take is
>>>>>> typically
>>>>>> +built before the ``gpu_vm->resv`` is locked due to a limitation in
>>>>>> +the current locking helpers, that is typically not done. Also see
>>>>>> +below for userptr gpu_vmas.
>>>>>> +
>>>>>> +At eviction time we now need to invalidate *all* gpu_vmas of a shared
>>>>>> +object, but we can no longer be certain that we hold the gpu_vm's
>>>>>> +dma_resv of all the object's gpu_vmas. We can only be certain that we
>>>>> I need to think a bit more about locking of extobj and evicted
>>>>> object tracking
>>>>> in the case of processing 'drm_gpuva_ops' directly through callbacks
>>>>> within the
>>>>> fence signalling critical path as mentioend in [1].
>>>>>
>>>>> In order to support that, we'd need to protect extobjs with a
>>>>> separate lock,
>>>>> and while iterating extobjs to acquire the dma-resv lock drop the
>>>>> lock within
>>>>> the loop before we actually acquire the dma-resv lock. Maple tree
>>>>> supports that
>>>>> already and this can be fully done within the GPUVA manager; no need
>>>>> for the
>>>>> driver to care about that.
>>>> So do I understand correctly that this because you want to update the
>>>> gpuvm state while operations are progressing asynchronously?
>>>>
>>>> If so, I wonder whether that could really be done? For example to
>>>> allocate enough memory for page-tables etc, you need to know the
>>>> details of the operations at IOCTL execution time, and to know the
>>>> details you need to know the state from the previous operation?
>>>
>>> Right, sync and async bind can't run fully concurrently, but you could
>>> "inject" a
>>> sync one between two async ones such that the sync ones executed from
>>> the IOCTL
>>> directly while async execution is stalled meanwhile. This would be
>>> possible because
>>> the actual drm_gpuva_ops would be calculated within the async
>>> execution path rather
>>> than in the IOCTL. But yes, page-table management must be desinged to
>>> support that.
> FWIW, the panthor driver is designed this way (note that I'm not
> supporting GEM eviction yet, so there might be subtleties I missed).

The problem is that once you've published your VM_BIND out-fence, any 
code path required to signal that fence may notallocate memory nor or 
grab any locks that allows allocating memory while held including 
dma_resv locks, and that means all required page-table memory needs to 
be allocated synchronously in the IOCTL, and all evicted bos need to be 
made resident in the IOCTL, and at least in the xe driver the amount of 
memory we need to allocate depends on the vm state, so we can't really 
update the vm state asynchronously either.

But as long as any async binding work required for signalling the 
VM_BIND out-fence is properly annotated with 
dma_fence_begin_signalling() and dma_fence_end_signalling() and there 
aren't any lockdep splats, things should be good. It would trigger on 
both memory allocation and attempts to grab a dma_resv lock.


>
>> OK, well one of the main motivations for Xe is to be able to pipeline
>> interleaving binds and execs if needed, like so:
>>
>> - Bind vmas for scene 1.
>> - Submit scene 1.
>> - Unbind vmas for scene 1.
>> - Bind vmas for scene 2.
>> - Submit scene 2.
>> - Unbind vmas for scene 2.
>>
>> And being able to *submit* all of the above while the async binding of
>> vmas for scene (step 1) has not yet completed.
>> I can't really see how this could be done, while obeying dma-fence
>> rules, unless state is updated synchronously while submitting?
> The idea in this case is to detect when a GPU job dependency is a
> VM_BIND out-fence, turn drm_sched_fence->parent into an
> xxx_vm_bind_job_fence object that's holding the GEM that's about to be
> mapped (AFAICT, we don't need to do anything for unmap operations), and
> then add our GPU job fence to this BO. This should not only guarantee
> that the GEMs we depend on are mapped before the GPU job is executed
> (the fence wait does that), but also that such yet-to-be-mapped GEMs
> won't be evicted just after they've been mapped and before the GPU had
> a chance to execute (unless I'm missing something, adding our GPU job
> fence to the BO being targeted by a pending VM_BIND(async,map) operation
> solves this problem).

Yes, we're essentially doing the same. The issue here is that when we, 
for example *submit* Bind vmas for scene 2,
we need to know how much page-table memory to allocate, and what BOs to 
make resident to be able to publish the out-fence. That means we need to 
know what the VM state would look like at the end of "Unbind vmas for 
scene 1". If the VM state is updated at submission time, that's all ok 
but if it's updated at execution time, we'd have to guess what resources 
to pre-allocate.

/Thomas

Boris Brezillon Sept. 6, 2023, 1 p.m. UTC | #9

On Wed, 6 Sep 2023 13:57:03 +0200
Thomas Hellström <thomas.hellstrom@linux.intel.com> wrote:

> Hi, Boris
> 
> On 9/6/23 13:09, Boris Brezillon wrote:
> > On Wed, 6 Sep 2023 10:32:24 +0200
> > Thomas Hellström <thomas.hellstrom@linux.intel.com> wrote:
> >
> >  
> >>>>>> +Introducing external (or shared) buffer objects
> >>>>>> +===============================================
> >>>>>> +
> >>>>>> +Since shared buffer objects may be shared by multiple gpu_vm's they
> >>>>>> +can't share their reservation object with a single gpu_vm, but
> >>>>>> will rather
> >>>>>> +have a reservation object of their own. The shared objects bound to a
> >>>>>> +gpu_vm using one or many
> >>>>>> +gpu_vmas are therefore typically put on a per-gpu_vm list which is
> >>>>>> +protected by the gpu_vm lock. One could in theory protect it also
> >>>>>> with
> >>>>>> +the ``gpu_vm->resv``, but since the list of dma_resvs to take is
> >>>>>> typically
> >>>>>> +built before the ``gpu_vm->resv`` is locked due to a limitation in
> >>>>>> +the current locking helpers, that is typically not done. Also see
> >>>>>> +below for userptr gpu_vmas.
> >>>>>> +
> >>>>>> +At eviction time we now need to invalidate *all* gpu_vmas of a shared
> >>>>>> +object, but we can no longer be certain that we hold the gpu_vm's
> >>>>>> +dma_resv of all the object's gpu_vmas. We can only be certain that we  
> >>>>> I need to think a bit more about locking of extobj and evicted
> >>>>> object tracking
> >>>>> in the case of processing 'drm_gpuva_ops' directly through callbacks
> >>>>> within the
> >>>>> fence signalling critical path as mentioend in [1].
> >>>>>
> >>>>> In order to support that, we'd need to protect extobjs with a
> >>>>> separate lock,
> >>>>> and while iterating extobjs to acquire the dma-resv lock drop the
> >>>>> lock within
> >>>>> the loop before we actually acquire the dma-resv lock. Maple tree
> >>>>> supports that
> >>>>> already and this can be fully done within the GPUVA manager; no need
> >>>>> for the
> >>>>> driver to care about that.  
> >>>> So do I understand correctly that this because you want to update the
> >>>> gpuvm state while operations are progressing asynchronously?
> >>>>
> >>>> If so, I wonder whether that could really be done? For example to
> >>>> allocate enough memory for page-tables etc, you need to know the
> >>>> details of the operations at IOCTL execution time, and to know the
> >>>> details you need to know the state from the previous operation?  
> >>>
> >>> Right, sync and async bind can't run fully concurrently, but you could
> >>> "inject" a
> >>> sync one between two async ones such that the sync ones executed from
> >>> the IOCTL
> >>> directly while async execution is stalled meanwhile. This would be
> >>> possible because
> >>> the actual drm_gpuva_ops would be calculated within the async
> >>> execution path rather
> >>> than in the IOCTL. But yes, page-table management must be desinged to
> >>> support that.  
> > FWIW, the panthor driver is designed this way (note that I'm not
> > supporting GEM eviction yet, so there might be subtleties I missed).  
> 
> The problem is that once you've published your VM_BIND out-fence, any 
> code path required to signal that fence may notallocate memory nor or 
> grab any locks that allows allocating memory while held including 
> dma_resv locks, and that means all required page-table memory needs to 
> be allocated synchronously in the IOCTL,

Yep, that's already what I do, by over-provisioning for the worst case
scenario (page table tree is empty), and returning unused pages after
the operation is done.

> and all evicted bos need to be 
> made resident in the IOCTL,

Yep, I'm pinning memory to BOs in that path too.

> and at least in the xe driver the amount of 
> memory we need to allocate depends on the vm state, so we can't really 
> update the vm state asynchronously either.

For Mali, we can calculate the maximum amount of pages we'll need for a
MAP operation, by assuming the page table is empty. Then it's just a
matter of returning unused pages to a fast-alloc pool so we can
speed-up further page table allocations (we're using a kmem_cache here,
since the page table update is done by the CPU and memory is shared on
Arm, but there's no reason you can't have your own cache
implementation).

> 
> But as long as any async binding work required for signalling the 
> VM_BIND out-fence is properly annotated with 
> dma_fence_begin_signalling() and dma_fence_end_signalling() and there 
> aren't any lockdep splats, things should be good. It would trigger on 
> both memory allocation and attempts to grab a dma_resv lock.

I have dma_fence_{begin,end}_signalling() annotations in the
::run_job() path, and no lockdep complaint spotted so far.

> 
> 
> >  
> >> OK, well one of the main motivations for Xe is to be able to pipeline
> >> interleaving binds and execs if needed, like so:
> >>
> >> - Bind vmas for scene 1.
> >> - Submit scene 1.
> >> - Unbind vmas for scene 1.
> >> - Bind vmas for scene 2.
> >> - Submit scene 2.
> >> - Unbind vmas for scene 2.
> >>
> >> And being able to *submit* all of the above while the async binding of
> >> vmas for scene (step 1) has not yet completed.
> >> I can't really see how this could be done, while obeying dma-fence
> >> rules, unless state is updated synchronously while submitting?  
> > The idea in this case is to detect when a GPU job dependency is a
> > VM_BIND out-fence, turn drm_sched_fence->parent into an
> > xxx_vm_bind_job_fence object that's holding the GEM that's about to be
> > mapped (AFAICT, we don't need to do anything for unmap operations), and
> > then add our GPU job fence to this BO. This should not only guarantee
> > that the GEMs we depend on are mapped before the GPU job is executed
> > (the fence wait does that), but also that such yet-to-be-mapped GEMs
> > won't be evicted just after they've been mapped and before the GPU had
> > a chance to execute (unless I'm missing something, adding our GPU job
> > fence to the BO being targeted by a pending VM_BIND(async,map) operation
> > solves this problem).

It's not exactly that, because we'd need to add a GEMs of all the
pending VM_BIND(map) jobs that come before the expressed dependency, not
just the one attached to the dependency itself. But after chatting with
Danilo, I realized we might not even need to track the GEMs being
mapped at the fence level if we call drm_gpuva_extobj_insert() in the
ioctl(VM_BIND) path:

- drm_gpuva_extobj_insert() will make sure the GEM is added to
  the ext-object map even before it's actually mapped to the VM (for
  private GEMs, it doesn't matter, because they are using the VM resv,
  so any private GEM mapped will automatically receive the VM resv
  updates).

Now, when a GPU job is queued, we do all the VM GEM preparation, which
includes the following steps:

- drm_gpuva_manager_validate() will make already-bound-but-evicted GEMs
  resident
- Iterate over all ext-objs to add our fence (I'm skipping the slot
  reservation step that's implied). Because drm_gpuva_extobj_insert()
  was called early, we also get all the GEMs that are not yet mapped,
  but are about to be mapped. This means they won't be evicted until
  after our job is done
- add our fence to the VM resv

Unless I'm missing something, this should guarantee that all GEMs are
resident and mapped when the job is executed.

> 
> Yes, we're essentially doing the same. The issue here is that when we, 
> for example *submit* Bind vmas for scene 2,
> we need to know how much page-table memory to allocate,

This is solved with over-provisioning in our case.

> and what BOs to 
> make resident to be able to publish the out-fence.

That's basically what Danilo's latest gpuva_mgr patchset tries to
provide generic helpers for, by exposing functions to iterate over all
evicted GEMs (so we can make them resident) and adding a way to add
fences to all GEMs currently bound to the VM. That leaves external GEMs
that are about to be mapped, which, I think, is addressed by the
solution detailed above.

> That means we need to 
> know what the VM state would look like at the end of "Unbind vmas for 
> scene 1".

Not necessarily, as long as you know all the GEMs that are currently
mapped and those that are about to be mapped. The extobj set provides
exactly that for external GEMs.

> If the VM state is updated at submission time, that's all ok 
> but if it's updated at execution time, we'd have to guess what resources 
> to pre-allocate.

As long as you have enough resources pre-allocated to do the VM update
(not saying this is easy to guess on Intel, but it's doable on Mali,
and the page table caching makes over-provisioning not too bad, as long
as we limit the number of in-flight VM_BIND jobs).

Thomas Hellström Sept. 6, 2023, 2:08 p.m. UTC | #10

Hi, Boris,

On 9/6/23 15:00, Boris Brezillon wrote:
> On Wed, 6 Sep 2023 13:57:03 +0200
> Thomas Hellström <thomas.hellstrom@linux.intel.com> wrote:
>
>> Hi, Boris
>>
>> On 9/6/23 13:09, Boris Brezillon wrote:
>>> On Wed, 6 Sep 2023 10:32:24 +0200
>>> Thomas Hellström <thomas.hellstrom@linux.intel.com> wrote:
>>>
>>>   
>>>>>>>> +Introducing external (or shared) buffer objects
>>>>>>>> +===============================================
>>>>>>>> +
>>>>>>>> +Since shared buffer objects may be shared by multiple gpu_vm's they
>>>>>>>> +can't share their reservation object with a single gpu_vm, but
>>>>>>>> will rather
>>>>>>>> +have a reservation object of their own. The shared objects bound to a
>>>>>>>> +gpu_vm using one or many
>>>>>>>> +gpu_vmas are therefore typically put on a per-gpu_vm list which is
>>>>>>>> +protected by the gpu_vm lock. One could in theory protect it also
>>>>>>>> with
>>>>>>>> +the ``gpu_vm->resv``, but since the list of dma_resvs to take is
>>>>>>>> typically
>>>>>>>> +built before the ``gpu_vm->resv`` is locked due to a limitation in
>>>>>>>> +the current locking helpers, that is typically not done. Also see
>>>>>>>> +below for userptr gpu_vmas.
>>>>>>>> +
>>>>>>>> +At eviction time we now need to invalidate *all* gpu_vmas of a shared
>>>>>>>> +object, but we can no longer be certain that we hold the gpu_vm's
>>>>>>>> +dma_resv of all the object's gpu_vmas. We can only be certain that we
>>>>>>> I need to think a bit more about locking of extobj and evicted
>>>>>>> object tracking
>>>>>>> in the case of processing 'drm_gpuva_ops' directly through callbacks
>>>>>>> within the
>>>>>>> fence signalling critical path as mentioend in [1].
>>>>>>>
>>>>>>> In order to support that, we'd need to protect extobjs with a
>>>>>>> separate lock,
>>>>>>> and while iterating extobjs to acquire the dma-resv lock drop the
>>>>>>> lock within
>>>>>>> the loop before we actually acquire the dma-resv lock. Maple tree
>>>>>>> supports that
>>>>>>> already and this can be fully done within the GPUVA manager; no need
>>>>>>> for the
>>>>>>> driver to care about that.
>>>>>> So do I understand correctly that this because you want to update the
>>>>>> gpuvm state while operations are progressing asynchronously?
>>>>>>
>>>>>> If so, I wonder whether that could really be done? For example to
>>>>>> allocate enough memory for page-tables etc, you need to know the
>>>>>> details of the operations at IOCTL execution time, and to know the
>>>>>> details you need to know the state from the previous operation?
>>>>> Right, sync and async bind can't run fully concurrently, but you could
>>>>> "inject" a
>>>>> sync one between two async ones such that the sync ones executed from
>>>>> the IOCTL
>>>>> directly while async execution is stalled meanwhile. This would be
>>>>> possible because
>>>>> the actual drm_gpuva_ops would be calculated within the async
>>>>> execution path rather
>>>>> than in the IOCTL. But yes, page-table management must be desinged to
>>>>> support that.
>>> FWIW, the panthor driver is designed this way (note that I'm not
>>> supporting GEM eviction yet, so there might be subtleties I missed).
>> The problem is that once you've published your VM_BIND out-fence, any
>> code path required to signal that fence may notallocate memory nor or
>> grab any locks that allows allocating memory while held including
>> dma_resv locks, and that means all required page-table memory needs to
>> be allocated synchronously in the IOCTL,
> Yep, that's already what I do, by over-provisioning for the worst case
> scenario (page table tree is empty), and returning unused pages after
> the operation is done.
>
>> and all evicted bos need to be
>> made resident in the IOCTL,
> Yep, I'm pinning memory to BOs in that path too.
>
>> and at least in the xe driver the amount of
>> memory we need to allocate depends on the vm state, so we can't really
>> update the vm state asynchronously either.
> For Mali, we can calculate the maximum amount of pages we'll need for a
> MAP operation, by assuming the page table is empty. Then it's just a
> matter of returning unused pages to a fast-alloc pool so we can
> speed-up further page table allocations (we're using a kmem_cache here,
> since the page table update is done by the CPU and memory is shared on
> Arm, but there's no reason you can't have your own cache
> implementation).
>
>> But as long as any async binding work required for signalling the
>> VM_BIND out-fence is properly annotated with
>> dma_fence_begin_signalling() and dma_fence_end_signalling() and there
>> aren't any lockdep splats, things should be good. It would trigger on
>> both memory allocation and attempts to grab a dma_resv lock.
> I have dma_fence_{begin,end}_signalling() annotations in the
> ::run_job() path, and no lockdep complaint spotted so far.
>
>>
>>>   
>>>> OK, well one of the main motivations for Xe is to be able to pipeline
>>>> interleaving binds and execs if needed, like so:
>>>>
>>>> - Bind vmas for scene 1.
>>>> - Submit scene 1.
>>>> - Unbind vmas for scene 1.
>>>> - Bind vmas for scene 2.
>>>> - Submit scene 2.
>>>> - Unbind vmas for scene 2.
>>>>
>>>> And being able to *submit* all of the above while the async binding of
>>>> vmas for scene (step 1) has not yet completed.
>>>> I can't really see how this could be done, while obeying dma-fence
>>>> rules, unless state is updated synchronously while submitting?
>>> The idea in this case is to detect when a GPU job dependency is a
>>> VM_BIND out-fence, turn drm_sched_fence->parent into an
>>> xxx_vm_bind_job_fence object that's holding the GEM that's about to be
>>> mapped (AFAICT, we don't need to do anything for unmap operations), and
>>> then add our GPU job fence to this BO. This should not only guarantee
>>> that the GEMs we depend on are mapped before the GPU job is executed
>>> (the fence wait does that), but also that such yet-to-be-mapped GEMs
>>> won't be evicted just after they've been mapped and before the GPU had
>>> a chance to execute (unless I'm missing something, adding our GPU job
>>> fence to the BO being targeted by a pending VM_BIND(async,map) operation
>>> solves this problem).
> It's not exactly that, because we'd need to add a GEMs of all the
> pending VM_BIND(map) jobs that come before the expressed dependency, not
> just the one attached to the dependency itself. But after chatting with
> Danilo, I realized we might not even need to track the GEMs being
> mapped at the fence level if we call drm_gpuva_extobj_insert() in the
> ioctl(VM_BIND) path:
>
> - drm_gpuva_extobj_insert() will make sure the GEM is added to
>    the ext-object map even before it's actually mapped to the VM (for
>    private GEMs, it doesn't matter, because they are using the VM resv,
>    so any private GEM mapped will automatically receive the VM resv
>    updates).
>
> Now, when a GPU job is queued, we do all the VM GEM preparation, which
> includes the following steps:
>
> - drm_gpuva_manager_validate() will make already-bound-but-evicted GEMs
>    resident
> - Iterate over all ext-objs to add our fence (I'm skipping the slot
>    reservation step that's implied). Because drm_gpuva_extobj_insert()
>    was called early, we also get all the GEMs that are not yet mapped,
>    but are about to be mapped. This means they won't be evicted until
>    after our job is done
> - add our fence to the VM resv
>
> Unless I'm missing something, this should guarantee that all GEMs are
> resident and mapped when the job is executed.
>
>> Yes, we're essentially doing the same. The issue here is that when we,
>> for example *submit* Bind vmas for scene 2,
>> we need to know how much page-table memory to allocate,
> This is solved with over-provisioning in our case.
>
>> and what BOs to
>> make resident to be able to publish the out-fence.
> That's basically what Danilo's latest gpuva_mgr patchset tries to
> provide generic helpers for, by exposing functions to iterate over all
> evicted GEMs (so we can make them resident) and adding a way to add
> fences to all GEMs currently bound to the VM. That leaves external GEMs
> that are about to be mapped, which, I think, is addressed by the
> solution detailed above.
>
>> That means we need to
>> know what the VM state would look like at the end of "Unbind vmas for
>> scene 1".
> Not necessarily, as long as you know all the GEMs that are currently
> mapped and those that are about to be mapped. The extobj set provides
> exactly that for external GEMs.
>
>> If the VM state is updated at submission time, that's all ok
>> but if it's updated at execution time, we'd have to guess what resources
>> to pre-allocate.
> As long as you have enough resources pre-allocated to do the VM update
> (not saying this is easy to guess on Intel, but it's doable on Mali,
> and the page table caching makes over-provisioning not too bad, as long
> as we limit the number of in-flight VM_BIND jobs).

OK, then it sounds we're on the same page. I guess it would i theory be 
possible to pre-allocate all needed resources on xe as well, but if the 
vm state lock is made an inner lock in order for us to be able to grab 
it within the dma-fence critical section, then it comes with a number of 
drawbacks as well:
* Over-allocation of resources.
* Need to spawn a cpu-thread for the async part (currently we utilize 
the GPU for that).
* Probably looking at locking inversions wrt userptr?
* Probably looking at locking inversions wrt recoverable pagefaults?
* Mismatch with the cpu mmap() / munmap() interface where the mmap_sem 
is the outermost lock.

So for us currently it currently looks like the sync state update is the 
preferred one... But OTOH we haven't fully implemented the unwinding yet...

/Thomas

Boris Brezillon Sept. 6, 2023, 2:54 p.m. UTC | #11

Hi Thomas,

On Wed, 6 Sep 2023 16:08:07 +0200
Thomas Hellström <thomas.hellstrom@linux.intel.com> wrote:

> Hi, Boris,
> 
> On 9/6/23 15:00, Boris Brezillon wrote:
> > On Wed, 6 Sep 2023 13:57:03 +0200
> > Thomas Hellström <thomas.hellstrom@linux.intel.com> wrote:
> >  
> >> Hi, Boris
> >>
> >> On 9/6/23 13:09, Boris Brezillon wrote:  
> >>> On Wed, 6 Sep 2023 10:32:24 +0200
> >>> Thomas Hellström <thomas.hellstrom@linux.intel.com> wrote:
> >>>
> >>>     
> >>>>>>>> +Introducing external (or shared) buffer objects
> >>>>>>>> +===============================================
> >>>>>>>> +
> >>>>>>>> +Since shared buffer objects may be shared by multiple gpu_vm's they
> >>>>>>>> +can't share their reservation object with a single gpu_vm, but
> >>>>>>>> will rather
> >>>>>>>> +have a reservation object of their own. The shared objects bound to a
> >>>>>>>> +gpu_vm using one or many
> >>>>>>>> +gpu_vmas are therefore typically put on a per-gpu_vm list which is
> >>>>>>>> +protected by the gpu_vm lock. One could in theory protect it also
> >>>>>>>> with
> >>>>>>>> +the ``gpu_vm->resv``, but since the list of dma_resvs to take is
> >>>>>>>> typically
> >>>>>>>> +built before the ``gpu_vm->resv`` is locked due to a limitation in
> >>>>>>>> +the current locking helpers, that is typically not done. Also see
> >>>>>>>> +below for userptr gpu_vmas.
> >>>>>>>> +
> >>>>>>>> +At eviction time we now need to invalidate *all* gpu_vmas of a shared
> >>>>>>>> +object, but we can no longer be certain that we hold the gpu_vm's
> >>>>>>>> +dma_resv of all the object's gpu_vmas. We can only be certain that we  
> >>>>>>> I need to think a bit more about locking of extobj and evicted
> >>>>>>> object tracking
> >>>>>>> in the case of processing 'drm_gpuva_ops' directly through callbacks
> >>>>>>> within the
> >>>>>>> fence signalling critical path as mentioend in [1].
> >>>>>>>
> >>>>>>> In order to support that, we'd need to protect extobjs with a
> >>>>>>> separate lock,
> >>>>>>> and while iterating extobjs to acquire the dma-resv lock drop the
> >>>>>>> lock within
> >>>>>>> the loop before we actually acquire the dma-resv lock. Maple tree
> >>>>>>> supports that
> >>>>>>> already and this can be fully done within the GPUVA manager; no need
> >>>>>>> for the
> >>>>>>> driver to care about that.  
> >>>>>> So do I understand correctly that this because you want to update the
> >>>>>> gpuvm state while operations are progressing asynchronously?
> >>>>>>
> >>>>>> If so, I wonder whether that could really be done? For example to
> >>>>>> allocate enough memory for page-tables etc, you need to know the
> >>>>>> details of the operations at IOCTL execution time, and to know the
> >>>>>> details you need to know the state from the previous operation?  
> >>>>> Right, sync and async bind can't run fully concurrently, but you could
> >>>>> "inject" a
> >>>>> sync one between two async ones such that the sync ones executed from
> >>>>> the IOCTL
> >>>>> directly while async execution is stalled meanwhile. This would be
> >>>>> possible because
> >>>>> the actual drm_gpuva_ops would be calculated within the async
> >>>>> execution path rather
> >>>>> than in the IOCTL. But yes, page-table management must be desinged to
> >>>>> support that.  
> >>> FWIW, the panthor driver is designed this way (note that I'm not
> >>> supporting GEM eviction yet, so there might be subtleties I missed).  
> >> The problem is that once you've published your VM_BIND out-fence, any
> >> code path required to signal that fence may notallocate memory nor or
> >> grab any locks that allows allocating memory while held including
> >> dma_resv locks, and that means all required page-table memory needs to
> >> be allocated synchronously in the IOCTL,  
> > Yep, that's already what I do, by over-provisioning for the worst case
> > scenario (page table tree is empty), and returning unused pages after
> > the operation is done.
> >  
> >> and all evicted bos need to be
> >> made resident in the IOCTL,  
> > Yep, I'm pinning memory to BOs in that path too.
> >  
> >> and at least in the xe driver the amount of
> >> memory we need to allocate depends on the vm state, so we can't really
> >> update the vm state asynchronously either.  
> > For Mali, we can calculate the maximum amount of pages we'll need for a
> > MAP operation, by assuming the page table is empty. Then it's just a
> > matter of returning unused pages to a fast-alloc pool so we can
> > speed-up further page table allocations (we're using a kmem_cache here,
> > since the page table update is done by the CPU and memory is shared on
> > Arm, but there's no reason you can't have your own cache
> > implementation).
> >  
> >> But as long as any async binding work required for signalling the
> >> VM_BIND out-fence is properly annotated with
> >> dma_fence_begin_signalling() and dma_fence_end_signalling() and there
> >> aren't any lockdep splats, things should be good. It would trigger on
> >> both memory allocation and attempts to grab a dma_resv lock.  
> > I have dma_fence_{begin,end}_signalling() annotations in the
> > ::run_job() path, and no lockdep complaint spotted so far.
> >  
> >>  
> >>>     
> >>>> OK, well one of the main motivations for Xe is to be able to pipeline
> >>>> interleaving binds and execs if needed, like so:
> >>>>
> >>>> - Bind vmas for scene 1.
> >>>> - Submit scene 1.
> >>>> - Unbind vmas for scene 1.
> >>>> - Bind vmas for scene 2.
> >>>> - Submit scene 2.
> >>>> - Unbind vmas for scene 2.
> >>>>
> >>>> And being able to *submit* all of the above while the async binding of
> >>>> vmas for scene (step 1) has not yet completed.
> >>>> I can't really see how this could be done, while obeying dma-fence
> >>>> rules, unless state is updated synchronously while submitting?  
> >>> The idea in this case is to detect when a GPU job dependency is a
> >>> VM_BIND out-fence, turn drm_sched_fence->parent into an
> >>> xxx_vm_bind_job_fence object that's holding the GEM that's about to be
> >>> mapped (AFAICT, we don't need to do anything for unmap operations), and
> >>> then add our GPU job fence to this BO. This should not only guarantee
> >>> that the GEMs we depend on are mapped before the GPU job is executed
> >>> (the fence wait does that), but also that such yet-to-be-mapped GEMs
> >>> won't be evicted just after they've been mapped and before the GPU had
> >>> a chance to execute (unless I'm missing something, adding our GPU job
> >>> fence to the BO being targeted by a pending VM_BIND(async,map) operation
> >>> solves this problem).  
> > It's not exactly that, because we'd need to add a GEMs of all the
> > pending VM_BIND(map) jobs that come before the expressed dependency, not
> > just the one attached to the dependency itself. But after chatting with
> > Danilo, I realized we might not even need to track the GEMs being
> > mapped at the fence level if we call drm_gpuva_extobj_insert() in the
> > ioctl(VM_BIND) path:
> >
> > - drm_gpuva_extobj_insert() will make sure the GEM is added to
> >    the ext-object map even before it's actually mapped to the VM (for
> >    private GEMs, it doesn't matter, because they are using the VM resv,
> >    so any private GEM mapped will automatically receive the VM resv
> >    updates).
> >
> > Now, when a GPU job is queued, we do all the VM GEM preparation, which
> > includes the following steps:
> >
> > - drm_gpuva_manager_validate() will make already-bound-but-evicted GEMs
> >    resident
> > - Iterate over all ext-objs to add our fence (I'm skipping the slot
> >    reservation step that's implied). Because drm_gpuva_extobj_insert()
> >    was called early, we also get all the GEMs that are not yet mapped,
> >    but are about to be mapped. This means they won't be evicted until
> >    after our job is done
> > - add our fence to the VM resv
> >
> > Unless I'm missing something, this should guarantee that all GEMs are
> > resident and mapped when the job is executed.
> >  
> >> Yes, we're essentially doing the same. The issue here is that when we,
> >> for example *submit* Bind vmas for scene 2,
> >> we need to know how much page-table memory to allocate,  
> > This is solved with over-provisioning in our case.
> >  
> >> and what BOs to
> >> make resident to be able to publish the out-fence.  
> > That's basically what Danilo's latest gpuva_mgr patchset tries to
> > provide generic helpers for, by exposing functions to iterate over all
> > evicted GEMs (so we can make them resident) and adding a way to add
> > fences to all GEMs currently bound to the VM. That leaves external GEMs
> > that are about to be mapped, which, I think, is addressed by the
> > solution detailed above.
> >  
> >> That means we need to
> >> know what the VM state would look like at the end of "Unbind vmas for
> >> scene 1".  
> > Not necessarily, as long as you know all the GEMs that are currently
> > mapped and those that are about to be mapped. The extobj set provides
> > exactly that for external GEMs.
> >  
> >> If the VM state is updated at submission time, that's all ok
> >> but if it's updated at execution time, we'd have to guess what resources
> >> to pre-allocate.  
> > As long as you have enough resources pre-allocated to do the VM update
> > (not saying this is easy to guess on Intel, but it's doable on Mali,
> > and the page table caching makes over-provisioning not too bad, as long
> > as we limit the number of in-flight VM_BIND jobs).  
> 
> OK, then it sounds we're on the same page. I guess it would i theory be 
> possible to pre-allocate all needed resources on xe as well, but if the 
> vm state lock is made an inner lock in order for us to be able to grab 
> it within the dma-fence critical section, then it comes with a number of 
> drawbacks as well:
> * Over-allocation of resources.
> * Need to spawn a cpu-thread for the async part (currently we utilize 
> the GPU for that).

I guess the async CPU part is the logic returning unused resources to
the cache. You can use a work item/wq for that instead of a thread, but
yes, there's some work to be done on the CPU, indeed.

> * Probably looking at locking inversions wrt userptr?
> * Probably looking at locking inversions wrt recoverable pagefaults?

Okay, I clearly didn't look at userptr, and I briefly looked at
alloc-on-fault but didn't finish/test my implementation since I didn't
have a use case for it.

For the use cases we have, we only need to take the VM lock (the lock
protecting the VM state) when executing a VM operation (map/unmap), and
that's in the dma-signalling path where we do no allocation (thanks
to the pre-allocation logic) and no attempt to acquire a resv lock.

Tbh, I'm not even sure we'd need a lock if that wasn't for the debugfs
gpuva dumper, because drm_sched makes it so VM operations are
serialized (VM ops happen on the CPU, and there's one thread dequeuing
drm_sched_jobs).

The extobj set is protected using another lock in Danilo's
implementation (and my open-coded implementation did something similar,
though slightly broken apparently), so maybe that's the one you're
worried about.

> * Mismatch with the cpu mmap() / munmap() interface where the mmap_sem 
> is the outermost lock.
> 
> So for us currently it currently looks like the sync state update is the 
> preferred one...

Just to clarify things, I'm not trying to convince you to use the async
model, just saying that's what we went for, and, at first glance, it
didn't seem completely insane to me. But if there's something
fundamentally broken in this approach, I think I'd like to figure it
out early, so thanks for all your inputs :-).

Regards,

Boris

Thomas Hellström Sept. 6, 2023, 3:07 p.m. UTC | #12

Hi, Boris

On 9/6/23 16:54, Boris Brezillon wrote:
> Hi Thomas,
>
> On Wed, 6 Sep 2023 16:08:07 +0200
> Thomas Hellström <thomas.hellstrom@linux.intel.com> wrote:
>
>> Hi, Boris,
>>
>> On 9/6/23 15:00, Boris Brezillon wrote:
>>> On Wed, 6 Sep 2023 13:57:03 +0200
>>> Thomas Hellström <thomas.hellstrom@linux.intel.com> wrote:
>>>   
>>>> Hi, Boris
>>>>
>>>> On 9/6/23 13:09, Boris Brezillon wrote:
>>>>> On Wed, 6 Sep 2023 10:32:24 +0200
>>>>> Thomas Hellström <thomas.hellstrom@linux.intel.com> wrote:
>>>>>
>>>>>      
>>>>>>>>>> +Introducing external (or shared) buffer objects
>>>>>>>>>> +===============================================
>>>>>>>>>> +
>>>>>>>>>> +Since shared buffer objects may be shared by multiple gpu_vm's they
>>>>>>>>>> +can't share their reservation object with a single gpu_vm, but
>>>>>>>>>> will rather
>>>>>>>>>> +have a reservation object of their own. The shared objects bound to a
>>>>>>>>>> +gpu_vm using one or many
>>>>>>>>>> +gpu_vmas are therefore typically put on a per-gpu_vm list which is
>>>>>>>>>> +protected by the gpu_vm lock. One could in theory protect it also
>>>>>>>>>> with
>>>>>>>>>> +the ``gpu_vm->resv``, but since the list of dma_resvs to take is
>>>>>>>>>> typically
>>>>>>>>>> +built before the ``gpu_vm->resv`` is locked due to a limitation in
>>>>>>>>>> +the current locking helpers, that is typically not done. Also see
>>>>>>>>>> +below for userptr gpu_vmas.
>>>>>>>>>> +
>>>>>>>>>> +At eviction time we now need to invalidate *all* gpu_vmas of a shared
>>>>>>>>>> +object, but we can no longer be certain that we hold the gpu_vm's
>>>>>>>>>> +dma_resv of all the object's gpu_vmas. We can only be certain that we
>>>>>>>>> I need to think a bit more about locking of extobj and evicted
>>>>>>>>> object tracking
>>>>>>>>> in the case of processing 'drm_gpuva_ops' directly through callbacks
>>>>>>>>> within the
>>>>>>>>> fence signalling critical path as mentioend in [1].
>>>>>>>>>
>>>>>>>>> In order to support that, we'd need to protect extobjs with a
>>>>>>>>> separate lock,
>>>>>>>>> and while iterating extobjs to acquire the dma-resv lock drop the
>>>>>>>>> lock within
>>>>>>>>> the loop before we actually acquire the dma-resv lock. Maple tree
>>>>>>>>> supports that
>>>>>>>>> already and this can be fully done within the GPUVA manager; no need
>>>>>>>>> for the
>>>>>>>>> driver to care about that.
>>>>>>>> So do I understand correctly that this because you want to update the
>>>>>>>> gpuvm state while operations are progressing asynchronously?
>>>>>>>>
>>>>>>>> If so, I wonder whether that could really be done? For example to
>>>>>>>> allocate enough memory for page-tables etc, you need to know the
>>>>>>>> details of the operations at IOCTL execution time, and to know the
>>>>>>>> details you need to know the state from the previous operation?
>>>>>>> Right, sync and async bind can't run fully concurrently, but you could
>>>>>>> "inject" a
>>>>>>> sync one between two async ones such that the sync ones executed from
>>>>>>> the IOCTL
>>>>>>> directly while async execution is stalled meanwhile. This would be
>>>>>>> possible because
>>>>>>> the actual drm_gpuva_ops would be calculated within the async
>>>>>>> execution path rather
>>>>>>> than in the IOCTL. But yes, page-table management must be desinged to
>>>>>>> support that.
>>>>> FWIW, the panthor driver is designed this way (note that I'm not
>>>>> supporting GEM eviction yet, so there might be subtleties I missed).
>>>> The problem is that once you've published your VM_BIND out-fence, any
>>>> code path required to signal that fence may notallocate memory nor or
>>>> grab any locks that allows allocating memory while held including
>>>> dma_resv locks, and that means all required page-table memory needs to
>>>> be allocated synchronously in the IOCTL,
>>> Yep, that's already what I do, by over-provisioning for the worst case
>>> scenario (page table tree is empty), and returning unused pages after
>>> the operation is done.
>>>   
>>>> and all evicted bos need to be
>>>> made resident in the IOCTL,
>>> Yep, I'm pinning memory to BOs in that path too.
>>>   
>>>> and at least in the xe driver the amount of
>>>> memory we need to allocate depends on the vm state, so we can't really
>>>> update the vm state asynchronously either.
>>> For Mali, we can calculate the maximum amount of pages we'll need for a
>>> MAP operation, by assuming the page table is empty. Then it's just a
>>> matter of returning unused pages to a fast-alloc pool so we can
>>> speed-up further page table allocations (we're using a kmem_cache here,
>>> since the page table update is done by the CPU and memory is shared on
>>> Arm, but there's no reason you can't have your own cache
>>> implementation).
>>>   
>>>> But as long as any async binding work required for signalling the
>>>> VM_BIND out-fence is properly annotated with
>>>> dma_fence_begin_signalling() and dma_fence_end_signalling() and there
>>>> aren't any lockdep splats, things should be good. It would trigger on
>>>> both memory allocation and attempts to grab a dma_resv lock.
>>> I have dma_fence_{begin,end}_signalling() annotations in the
>>> ::run_job() path, and no lockdep complaint spotted so far.
>>>   
>>>>   
>>>>>      
>>>>>> OK, well one of the main motivations for Xe is to be able to pipeline
>>>>>> interleaving binds and execs if needed, like so:
>>>>>>
>>>>>> - Bind vmas for scene 1.
>>>>>> - Submit scene 1.
>>>>>> - Unbind vmas for scene 1.
>>>>>> - Bind vmas for scene 2.
>>>>>> - Submit scene 2.
>>>>>> - Unbind vmas for scene 2.
>>>>>>
>>>>>> And being able to *submit* all of the above while the async binding of
>>>>>> vmas for scene (step 1) has not yet completed.
>>>>>> I can't really see how this could be done, while obeying dma-fence
>>>>>> rules, unless state is updated synchronously while submitting?
>>>>> The idea in this case is to detect when a GPU job dependency is a
>>>>> VM_BIND out-fence, turn drm_sched_fence->parent into an
>>>>> xxx_vm_bind_job_fence object that's holding the GEM that's about to be
>>>>> mapped (AFAICT, we don't need to do anything for unmap operations), and
>>>>> then add our GPU job fence to this BO. This should not only guarantee
>>>>> that the GEMs we depend on are mapped before the GPU job is executed
>>>>> (the fence wait does that), but also that such yet-to-be-mapped GEMs
>>>>> won't be evicted just after they've been mapped and before the GPU had
>>>>> a chance to execute (unless I'm missing something, adding our GPU job
>>>>> fence to the BO being targeted by a pending VM_BIND(async,map) operation
>>>>> solves this problem).
>>> It's not exactly that, because we'd need to add a GEMs of all the
>>> pending VM_BIND(map) jobs that come before the expressed dependency, not
>>> just the one attached to the dependency itself. But after chatting with
>>> Danilo, I realized we might not even need to track the GEMs being
>>> mapped at the fence level if we call drm_gpuva_extobj_insert() in the
>>> ioctl(VM_BIND) path:
>>>
>>> - drm_gpuva_extobj_insert() will make sure the GEM is added to
>>>     the ext-object map even before it's actually mapped to the VM (for
>>>     private GEMs, it doesn't matter, because they are using the VM resv,
>>>     so any private GEM mapped will automatically receive the VM resv
>>>     updates).
>>>
>>> Now, when a GPU job is queued, we do all the VM GEM preparation, which
>>> includes the following steps:
>>>
>>> - drm_gpuva_manager_validate() will make already-bound-but-evicted GEMs
>>>     resident
>>> - Iterate over all ext-objs to add our fence (I'm skipping the slot
>>>     reservation step that's implied). Because drm_gpuva_extobj_insert()
>>>     was called early, we also get all the GEMs that are not yet mapped,
>>>     but are about to be mapped. This means they won't be evicted until
>>>     after our job is done
>>> - add our fence to the VM resv
>>>
>>> Unless I'm missing something, this should guarantee that all GEMs are
>>> resident and mapped when the job is executed.
>>>   
>>>> Yes, we're essentially doing the same. The issue here is that when we,
>>>> for example *submit* Bind vmas for scene 2,
>>>> we need to know how much page-table memory to allocate,
>>> This is solved with over-provisioning in our case.
>>>   
>>>> and what BOs to
>>>> make resident to be able to publish the out-fence.
>>> That's basically what Danilo's latest gpuva_mgr patchset tries to
>>> provide generic helpers for, by exposing functions to iterate over all
>>> evicted GEMs (so we can make them resident) and adding a way to add
>>> fences to all GEMs currently bound to the VM. That leaves external GEMs
>>> that are about to be mapped, which, I think, is addressed by the
>>> solution detailed above.
>>>   
>>>> That means we need to
>>>> know what the VM state would look like at the end of "Unbind vmas for
>>>> scene 1".
>>> Not necessarily, as long as you know all the GEMs that are currently
>>> mapped and those that are about to be mapped. The extobj set provides
>>> exactly that for external GEMs.
>>>   
>>>> If the VM state is updated at submission time, that's all ok
>>>> but if it's updated at execution time, we'd have to guess what resources
>>>> to pre-allocate.
>>> As long as you have enough resources pre-allocated to do the VM update
>>> (not saying this is easy to guess on Intel, but it's doable on Mali,
>>> and the page table caching makes over-provisioning not too bad, as long
>>> as we limit the number of in-flight VM_BIND jobs).
>> OK, then it sounds we're on the same page. I guess it would i theory be
>> possible to pre-allocate all needed resources on xe as well, but if the
>> vm state lock is made an inner lock in order for us to be able to grab
>> it within the dma-fence critical section, then it comes with a number of
>> drawbacks as well:
>> * Over-allocation of resources.
>> * Need to spawn a cpu-thread for the async part (currently we utilize
>> the GPU for that).
> I guess the async CPU part is the logic returning unused resources to
> the cache. You can use a work item/wq for that instead of a thread, but
> yes, there's some work to be done on the CPU, indeed.
>
>> * Probably looking at locking inversions wrt userptr?
>> * Probably looking at locking inversions wrt recoverable pagefaults?
> Okay, I clearly didn't look at userptr, and I briefly looked at
> alloc-on-fault but didn't finish/test my implementation since I didn't
> have a use case for it.
>
> For the use cases we have, we only need to take the VM lock (the lock
> protecting the VM state) when executing a VM operation (map/unmap), and
> that's in the dma-signalling path where we do no allocation (thanks
> to the pre-allocation logic) and no attempt to acquire a resv lock.
>
> Tbh, I'm not even sure we'd need a lock if that wasn't for the debugfs
> gpuva dumper, because drm_sched makes it so VM operations are
> serialized (VM ops happen on the CPU, and there's one thread dequeuing
> drm_sched_jobs).
>
> The extobj set is protected using another lock in Danilo's
> implementation (and my open-coded implementation did something similar,
> though slightly broken apparently), so maybe that's the one you're
> worried about.
>
>> * Mismatch with the cpu mmap() / munmap() interface where the mmap_sem
>> is the outermost lock.
>>
>> So for us currently it currently looks like the sync state update is the
>> preferred one...
> Just to clarify things, I'm not trying to convince you to use the async
> model, just saying that's what we went for, and, at first glance, it
> didn't seem completely insane to me. But if there's something
> fundamentally broken in this approach, I think I'd like to figure it
> out early, so thanks for all your inputs :-).

Yeah I think these discussions are really beneficial and helps at least 
me understand other choices as well, and if it can help providing good 
input to the design of the GPUVA manager, thats  a great benefit as well.

In the end, we have a task to document the VM_BIND locking as part of 
merging Xe, and if there are different flavours, I'll look at 
documenting these to, and in what manner they differ.

Thanks,

Thomas


>
> Regards,
>
> Boris

[v2] Documentation/gpu: VM_BIND locking document

Commit Message

Comments

Patch