[v5,27/32] drm/xe: Add SVM VRAM migration

Message ID	20250213021112.1228481-28-matthew.brost@intel.com (mailing list archive)
State	New
Headers	show Return-Path: <dri-devel-bounces@lists.freedesktop.org> From: Matthew Brost <matthew.brost@intel.com> To: intel-xe@lists.freedesktop.org, dri-devel@lists.freedesktop.org Cc: himal.prasad.ghimiray@intel.com, apopple@nvidia.com, airlied@gmail.com, thomas.hellstrom@linux.intel.com, simona.vetter@ffwll.ch, felix.kuehling@amd.com, dakr@kernel.org Subject: [PATCH v5 27/32] drm/xe: Add SVM VRAM migration Date: Wed, 12 Feb 2025 18:11:07 -0800 Message-Id: <20250213021112.1228481-28-matthew.brost@intel.com> In-Reply-To: <20250213021112.1228481-1-matthew.brost@intel.com> References: <20250213021112.1228481-1-matthew.brost@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: list Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>
Series	Introduce GPU SVM and Xe SVM implementation \| expand [v5,00/32] Introduce GPU SVM and Xe SVM implementation [v5,01/32] drm/xe: Retry BO allocation [v5,02/32] mm/migrate: Add migrate_device_pfns [v5,03/32] mm/migrate: Trylock device page in do_swap_page [v5,04/32] drm/pagemap: Add DRM pagemap [v5,05/32] drm/xe/bo: Introduce xe_bo_put_async [v5,06/32] drm/gpusvm: Add support for GPU Shared Virtual Memory [v5,07/32] drm/xe: Select DRM_GPUSVM Kconfig [v5,08/32] drm/xe/uapi: Add DRM_XE_VM_BIND_FLAG_CPU_ADDR_MIRROR [v5,09/32] drm/xe: Add SVM init / close / fini to faulting VMs [v5,10/32] drm/xe: Add dma_addr res cursor [v5,11/32] drm/xe: Nuke VM's mapping upon close [v5,12/32] drm/xe: Add SVM range invalidation and page fault [v5,13/32] drm/gpuvm: Add DRM_GPUVA_OP_DRIVER [v5,14/32] drm/xe: Add (re)bind to SVM page fault handler [v5,15/32] drm/xe: Add SVM garbage collector [v5,16/32] drm/xe: Add unbind to SVM garbage collector [v5,17/32] drm/xe: Do not allow CPU address mirror VMA unbind if the GPU has bindings [v5,18/32] drm/xe: Enable CPU address mirror uAPI [v5,19/32] drm/xe/uapi: Add DRM_XE_QUERY_CONFIG_FLAG_HAS_CPU_ADDR_MIRROR [v5,20/32] drm/xe: Add migrate layer functions for SVM support [v5,21/32] drm/xe: Add SVM device memory mirroring [v5,22/32] drm/xe: Add drm_gpusvm_devmem to xe_bo [v5,23/32] drm/xe: Add drm_pagemap ops to SVM [v5,24/32] drm/xe: Add GPUSVM device memory copy vfunc functions [v5,25/32] drm/xe: Add Xe SVM populate_devmem_pfn GPU SVM vfunc [v5,26/32] drm/xe: Add Xe SVM devmem_release GPU SVM vfunc [v5,27/32] drm/xe: Add SVM VRAM migration [v5,28/32] drm/xe: Basic SVM BO eviction [v5,29/32] drm/xe: Add SVM debug [v5,30/32] drm/xe: Add modparam for SVM notifier size [v5,31/32] drm/xe: Add always_migrate_to_vram modparam [v5,32/32] drm/doc: gpusvm: Add GPU SVM documentation

Matthew Brost Feb. 13, 2025, 2:11 a.m. UTC

Migration is implemented with range granularity, with VRAM backing being
a VM private TTM BO (i.e., shares dma-resv with VM). The lifetime of the
TTM BO is limited to when the SVM range is in VRAM (i.e., when a VRAM
SVM range is migrated to SRAM, the TTM BO is destroyed).

The design choice for using TTM BO for VRAM backing store, as opposed to
direct buddy allocation, is as follows:

- DRM buddy allocations are not at page granularity, offering no
  advantage over a BO.
- Unified eviction is required (SVM VRAM and TTM BOs need to be able to
  evict each other).
- For exhaustive eviction [1], SVM VRAM allocations will almost certainly
  require a dma-resv.
- Likely allocation size is 2M which makes of size of BO (872)
  acceptable per allocation (872 / 2M == .0004158).

With this, using TTM BO for VRAM backing store seems to be an obvious
choice as it allows leveraging of the TTM eviction code.

Current migration policy is migrate any SVM range greater than or equal
to 64k once.

[1] https://patchwork.freedesktop.org/series/133643/

v2:
 - Rebase on latest GPU SVM
 - Retry page fault on get pages returning mixed allocation
 - Use drm_gpusvm_devmem
v3:
 - Use new BO flags
 - New range structure (Thomas)
 - Hide migration behind Kconfig
 - Kernel doc (Thomas)
 - Use check_pages_threshold
v4:
 - Don't evict partial unmaps in garbage collector (Thomas)
 - Use %pe to print errors (Thomas)
 - Use %p to print pointers (Thomas)
v5:
 - Use range size helper (Thomas)
 - Make BO external (Thomas)
 - Set tile to NULL for BO creation (Thomas)
 - Drop BO mirror flag (Thomas)
 - Hold BO dma-resv lock across migration (Auld, Thomas)

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_svm.c | 111 ++++++++++++++++++++++++++++++++++--
 drivers/gpu/drm/xe/xe_svm.h |   5 ++
 2 files changed, 112 insertions(+), 4 deletions(-)

Ghimiray, Himal Prasad Feb. 13, 2025, 6:28 p.m. UTC | #1

On 13-02-2025 07:41, Matthew Brost wrote:
> Migration is implemented with range granularity, with VRAM backing being
> a VM private TTM BO (i.e., shares dma-resv with VM). The lifetime of the
> TTM BO is limited to when the SVM range is in VRAM (i.e., when a VRAM
> SVM range is migrated to SRAM, the TTM BO is destroyed).
> 
> The design choice for using TTM BO for VRAM backing store, as opposed to
> direct buddy allocation, is as follows:
> 
> - DRM buddy allocations are not at page granularity, offering no
>    advantage over a BO.
> - Unified eviction is required (SVM VRAM and TTM BOs need to be able to
>    evict each other).
> - For exhaustive eviction [1], SVM VRAM allocations will almost certainly
>    require a dma-resv.
> - Likely allocation size is 2M which makes of size of BO (872)
>    acceptable per allocation (872 / 2M == .0004158).
> 
> With this, using TTM BO for VRAM backing store seems to be an obvious
> choice as it allows leveraging of the TTM eviction code.
> 
> Current migration policy is migrate any SVM range greater than or equal
> to 64k once.
> 
> [1] https://patchwork.freedesktop.org/series/133643/
> 
> v2:
>   - Rebase on latest GPU SVM
>   - Retry page fault on get pages returning mixed allocation
>   - Use drm_gpusvm_devmem
> v3:
>   - Use new BO flags
>   - New range structure (Thomas)
>   - Hide migration behind Kconfig
>   - Kernel doc (Thomas)
>   - Use check_pages_threshold
> v4:
>   - Don't evict partial unmaps in garbage collector (Thomas)
>   - Use %pe to print errors (Thomas)
>   - Use %p to print pointers (Thomas)
> v5:
>   - Use range size helper (Thomas)
>   - Make BO external (Thomas)
>   - Set tile to NULL for BO creation (Thomas)
>   - Drop BO mirror flag (Thomas)
>   - Hold BO dma-resv lock across migration (Auld, Thomas)
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   drivers/gpu/drm/xe/xe_svm.c | 111 ++++++++++++++++++++++++++++++++++--
>   drivers/gpu/drm/xe/xe_svm.h |   5 ++
>   2 files changed, 112 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c
> index 0a78a838508c..2e1e0f31c1a8 100644
> --- a/drivers/gpu/drm/xe/xe_svm.c
> +++ b/drivers/gpu/drm/xe/xe_svm.c
> @@ -32,6 +32,11 @@ static unsigned long xe_svm_range_end(struct xe_svm_range *range)
>   	return drm_gpusvm_range_end(&range->base);
>   }
>   
> +static unsigned long xe_svm_range_size(struct xe_svm_range *range)
> +{
> +	return drm_gpusvm_range_size(&range->base);
> +}
> +
>   static void *xe_svm_devm_owner(struct xe_device *xe)
>   {
>   	return xe;
> @@ -512,7 +517,6 @@ static int xe_svm_populate_devmem_pfn(struct drm_gpusvm_devmem *devmem_allocatio
>   	return 0;
>   }
>   
> -__maybe_unused
>   static const struct drm_gpusvm_devmem_ops gpusvm_devmem_ops = {
>   	.devmem_release = xe_svm_devmem_release,
>   	.populate_devmem_pfn = xe_svm_populate_devmem_pfn,
> @@ -592,6 +596,71 @@ static bool xe_svm_range_is_valid(struct xe_svm_range *range,
>   	return (range->tile_present & ~range->tile_invalidated) & BIT(tile->id);
>   }
>   
> +static struct xe_vram_region *tile_to_vr(struct xe_tile *tile)
> +{
> +	return &tile->mem.vram;
> +}
> +
> +static struct xe_bo *xe_svm_alloc_vram(struct xe_vm *vm, struct xe_tile *tile,
> +				       struct xe_svm_range *range,
> +				       const struct drm_gpusvm_ctx *ctx)
> +{
> +	struct mm_struct *mm = vm->svm.gpusvm.mm;
> +	struct xe_vram_region *vr = tile_to_vr(tile);
> +	struct drm_buddy_block *block;
> +	struct list_head *blocks;
> +	struct xe_bo *bo;
> +	ktime_t end = 0;
> +	int err;
> +
> +	if (!mmget_not_zero(mm))
> +		return ERR_PTR(-EFAULT);
> +	mmap_read_lock(mm);
> +
> +retry:
> +	bo = xe_bo_create_locked(tile_to_xe(tile), NULL, NULL,
> +				 xe_svm_range_size(range),
> +				 ttm_bo_type_device,
> +				 XE_BO_FLAG_VRAM_IF_DGFX(tile));
> +	if (IS_ERR(bo)) {
> +		err = PTR_ERR(bo);
> +		if (xe_vm_validate_should_retry(NULL, err, &end))
> +			goto retry;
> +		goto unlock;
> +	}
> +
> +	drm_gpusvm_devmem_init(&bo->devmem_allocation,
> +			       vm->xe->drm.dev, mm,
> +			       &gpusvm_devmem_ops,
> +			       &tile->mem.vram.dpagemap,
> +			       xe_svm_range_size(range));
> +
> +	blocks = &to_xe_ttm_vram_mgr_resource(bo->ttm.resource)->blocks;
> +	list_for_each_entry(block, blocks, link)
> +		block->private = vr;
> +
> +	/*
> +	 * Take ref because as soon as drm_gpusvm_migrate_to_devmem succeeds the
> +	 * creation ref can be dropped upon CPU fault or unmap.
> +	 */
> +	xe_bo_get(bo);
> +
> +	err = drm_gpusvm_migrate_to_devmem(&vm->svm.gpusvm, &range->base,
> +					   &bo->devmem_allocation, ctx);
> +	xe_bo_unlock(bo);
> +	if (err) {
> +		xe_bo_put(bo);	/* Local ref */
> +		xe_bo_put(bo);	/* Creation ref */
> +		bo = ERR_PTR(err);
> +	}
> +
> +unlock:
> +	mmap_read_unlock(mm);
> +	mmput(mm);
> +
> +	return bo;
> +}
> +
>   /**
>    * xe_svm_handle_pagefault() - SVM handle page fault
>    * @vm: The VM.
> @@ -600,7 +669,8 @@ static bool xe_svm_range_is_valid(struct xe_svm_range *range,
>    * @fault_addr: The GPU fault address.
>    * @atomic: The fault atomic access bit.
>    *
> - * Create GPU bindings for a SVM page fault.
> + * Create GPU bindings for a SVM page fault. Optionally migrate to device
> + * memory.
>    *
>    * Return: 0 on success, negative error code on error.
>    */
> @@ -608,11 +678,18 @@ int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
>   			    struct xe_tile *tile, u64 fault_addr,
>   			    bool atomic)
>   {
> -	struct drm_gpusvm_ctx ctx = { .read_only = xe_vma_read_only(vma), };
> +	struct drm_gpusvm_ctx ctx = {
> +		.read_only = xe_vma_read_only(vma),
> +		.devmem_possible = IS_DGFX(vm->xe) &&
> +			IS_ENABLED(CONFIG_DRM_XE_DEVMEM_MIRROR),
> +		.check_pages_threshold = IS_DGFX(vm->xe) &&
> +			IS_ENABLED(CONFIG_DRM_XE_DEVMEM_MIRROR) ? SZ_64K : 0,
> +	};
>   	struct xe_svm_range *range;
>   	struct drm_gpusvm_range *r;
>   	struct drm_exec exec;
>   	struct dma_fence *fence;
> +	struct xe_bo *bo = NULL;
>   	ktime_t end = 0;
>   	int err;
>   
> @@ -620,6 +697,9 @@ int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
>   	xe_assert(vm->xe, xe_vma_is_cpu_addr_mirror(vma));
>   
>   retry:
> +	xe_bo_put(bo);
> +	bo = NULL;
> +
>   	/* Always process UNMAPs first so view SVM ranges is current */
>   	err = xe_svm_garbage_collector(vm);
>   	if (err)
> @@ -635,9 +715,31 @@ int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
>   	if (xe_svm_range_is_valid(range, tile))
>   		return 0;
>   
> +	/* XXX: Add migration policy, for now migrate range once */
> +	if (!range->migrated && range->base.flags.migrate_devmem &&
> +	    xe_svm_range_size(range) >= SZ_64K) {
> +		range->migrated = true;


shouldn't this be set to true, only once xe_svm_alloc_vram is successfull ?
In case of bo alloc failure retry wont enter here.

> +
> +		bo = xe_svm_alloc_vram(vm, tile, range, &ctx);
> +		if (IS_ERR(bo)) {
> +			drm_info(&vm->xe->drm,
> +				 "VRAM allocation failed, falling back to retrying, asid=%u, errno %pe\n",
> +				 vm->usm.asid, bo);
> +			bo = NULL;
> +			goto retry;
> +		}
> +	}
> +
>   	err = drm_gpusvm_range_get_pages(&vm->svm.gpusvm, r, &ctx);
> -	if (err == -EFAULT || err == -EPERM)	/* Corner where CPU mappings have changed */
> +	/* Corner where CPU mappings have changed */
> +	if (err == -EOPNOTSUPP || err == -EFAULT || err == -EPERM) {
> +		if (err == -EOPNOTSUPP)
> +			drm_gpusvm_range_evict(&vm->svm.gpusvm, &range->base);
> +		drm_info(&vm->xe->drm,
> +			 "Get pages failed, falling back to retrying, asid=%u, gpusvm=%p, errno %pe\n",
> +			 vm->usm.asid, &vm->svm.gpusvm, ERR_PTR(err));
>   		goto retry;
> +	}
>   	if (err)
>   		goto err_out;
>   
> @@ -668,6 +770,7 @@ int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
>   	dma_fence_put(fence);
>   
>   err_out:
> +	xe_bo_put(bo);
>   
>   	return err;
>   }
> diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
> index 0fa525d34987..984a61651d9e 100644
> --- a/drivers/gpu/drm/xe/xe_svm.h
> +++ b/drivers/gpu/drm/xe/xe_svm.h
> @@ -35,6 +35,11 @@ struct xe_svm_range {
>   	 * range. Protected by GPU SVM notifier lock.
>   	 */
>   	u8 tile_invalidated;
> +	/**
> +	 * @migrated: Range has been migrated to device memory, protected by
> +	 * GPU fault handler locking.
> +	 */
> +	u8 migrated	:1;
>   };
>   
>   int xe_devm_add(struct xe_tile *tile, struct xe_vram_region *vr);

Matthew Brost Feb. 18, 2025, 9:54 p.m. UTC | #2

On Thu, Feb 13, 2025 at 11:58:27PM +0530, Ghimiray, Himal Prasad wrote:
> 
> 
> On 13-02-2025 07:41, Matthew Brost wrote:
> > Migration is implemented with range granularity, with VRAM backing being
> > a VM private TTM BO (i.e., shares dma-resv with VM). The lifetime of the
> > TTM BO is limited to when the SVM range is in VRAM (i.e., when a VRAM
> > SVM range is migrated to SRAM, the TTM BO is destroyed).
> > 
> > The design choice for using TTM BO for VRAM backing store, as opposed to
> > direct buddy allocation, is as follows:
> > 
> > - DRM buddy allocations are not at page granularity, offering no
> >    advantage over a BO.
> > - Unified eviction is required (SVM VRAM and TTM BOs need to be able to
> >    evict each other).
> > - For exhaustive eviction [1], SVM VRAM allocations will almost certainly
> >    require a dma-resv.
> > - Likely allocation size is 2M which makes of size of BO (872)
> >    acceptable per allocation (872 / 2M == .0004158).
> > 
> > With this, using TTM BO for VRAM backing store seems to be an obvious
> > choice as it allows leveraging of the TTM eviction code.
> > 
> > Current migration policy is migrate any SVM range greater than or equal
> > to 64k once.
> > 
> > [1] https://patchwork.freedesktop.org/series/133643/
> > 
> > v2:
> >   - Rebase on latest GPU SVM
> >   - Retry page fault on get pages returning mixed allocation
> >   - Use drm_gpusvm_devmem
> > v3:
> >   - Use new BO flags
> >   - New range structure (Thomas)
> >   - Hide migration behind Kconfig
> >   - Kernel doc (Thomas)
> >   - Use check_pages_threshold
> > v4:
> >   - Don't evict partial unmaps in garbage collector (Thomas)
> >   - Use %pe to print errors (Thomas)
> >   - Use %p to print pointers (Thomas)
> > v5:
> >   - Use range size helper (Thomas)
> >   - Make BO external (Thomas)
> >   - Set tile to NULL for BO creation (Thomas)
> >   - Drop BO mirror flag (Thomas)
> >   - Hold BO dma-resv lock across migration (Auld, Thomas)
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >   drivers/gpu/drm/xe/xe_svm.c | 111 ++++++++++++++++++++++++++++++++++--
> >   drivers/gpu/drm/xe/xe_svm.h |   5 ++
> >   2 files changed, 112 insertions(+), 4 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c
> > index 0a78a838508c..2e1e0f31c1a8 100644
> > --- a/drivers/gpu/drm/xe/xe_svm.c
> > +++ b/drivers/gpu/drm/xe/xe_svm.c
> > @@ -32,6 +32,11 @@ static unsigned long xe_svm_range_end(struct xe_svm_range *range)
> >   	return drm_gpusvm_range_end(&range->base);
> >   }
> > +static unsigned long xe_svm_range_size(struct xe_svm_range *range)
> > +{
> > +	return drm_gpusvm_range_size(&range->base);
> > +}
> > +
> >   static void *xe_svm_devm_owner(struct xe_device *xe)
> >   {
> >   	return xe;
> > @@ -512,7 +517,6 @@ static int xe_svm_populate_devmem_pfn(struct drm_gpusvm_devmem *devmem_allocatio
> >   	return 0;
> >   }
> > -__maybe_unused
> >   static const struct drm_gpusvm_devmem_ops gpusvm_devmem_ops = {
> >   	.devmem_release = xe_svm_devmem_release,
> >   	.populate_devmem_pfn = xe_svm_populate_devmem_pfn,
> > @@ -592,6 +596,71 @@ static bool xe_svm_range_is_valid(struct xe_svm_range *range,
> >   	return (range->tile_present & ~range->tile_invalidated) & BIT(tile->id);
> >   }
> > +static struct xe_vram_region *tile_to_vr(struct xe_tile *tile)
> > +{
> > +	return &tile->mem.vram;
> > +}
> > +
> > +static struct xe_bo *xe_svm_alloc_vram(struct xe_vm *vm, struct xe_tile *tile,
> > +				       struct xe_svm_range *range,
> > +				       const struct drm_gpusvm_ctx *ctx)
> > +{
> > +	struct mm_struct *mm = vm->svm.gpusvm.mm;
> > +	struct xe_vram_region *vr = tile_to_vr(tile);
> > +	struct drm_buddy_block *block;
> > +	struct list_head *blocks;
> > +	struct xe_bo *bo;
> > +	ktime_t end = 0;
> > +	int err;
> > +
> > +	if (!mmget_not_zero(mm))
> > +		return ERR_PTR(-EFAULT);
> > +	mmap_read_lock(mm);
> > +
> > +retry:
> > +	bo = xe_bo_create_locked(tile_to_xe(tile), NULL, NULL,
> > +				 xe_svm_range_size(range),
> > +				 ttm_bo_type_device,
> > +				 XE_BO_FLAG_VRAM_IF_DGFX(tile));
> > +	if (IS_ERR(bo)) {
> > +		err = PTR_ERR(bo);
> > +		if (xe_vm_validate_should_retry(NULL, err, &end))
> > +			goto retry;
> > +		goto unlock;
> > +	}
> > +
> > +	drm_gpusvm_devmem_init(&bo->devmem_allocation,
> > +			       vm->xe->drm.dev, mm,
> > +			       &gpusvm_devmem_ops,
> > +			       &tile->mem.vram.dpagemap,
> > +			       xe_svm_range_size(range));
> > +
> > +	blocks = &to_xe_ttm_vram_mgr_resource(bo->ttm.resource)->blocks;
> > +	list_for_each_entry(block, blocks, link)
> > +		block->private = vr;
> > +
> > +	/*
> > +	 * Take ref because as soon as drm_gpusvm_migrate_to_devmem succeeds the
> > +	 * creation ref can be dropped upon CPU fault or unmap.
> > +	 */
> > +	xe_bo_get(bo);
> > +
> > +	err = drm_gpusvm_migrate_to_devmem(&vm->svm.gpusvm, &range->base,
> > +					   &bo->devmem_allocation, ctx);
> > +	xe_bo_unlock(bo);
> > +	if (err) {
> > +		xe_bo_put(bo);	/* Local ref */
> > +		xe_bo_put(bo);	/* Creation ref */
> > +		bo = ERR_PTR(err);
> > +	}
> > +
> > +unlock:
> > +	mmap_read_unlock(mm);
> > +	mmput(mm);
> > +
> > +	return bo;
> > +}
> > +
> >   /**
> >    * xe_svm_handle_pagefault() - SVM handle page fault
> >    * @vm: The VM.
> > @@ -600,7 +669,8 @@ static bool xe_svm_range_is_valid(struct xe_svm_range *range,
> >    * @fault_addr: The GPU fault address.
> >    * @atomic: The fault atomic access bit.
> >    *
> > - * Create GPU bindings for a SVM page fault.
> > + * Create GPU bindings for a SVM page fault. Optionally migrate to device
> > + * memory.
> >    *
> >    * Return: 0 on success, negative error code on error.
> >    */
> > @@ -608,11 +678,18 @@ int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
> >   			    struct xe_tile *tile, u64 fault_addr,
> >   			    bool atomic)
> >   {
> > -	struct drm_gpusvm_ctx ctx = { .read_only = xe_vma_read_only(vma), };
> > +	struct drm_gpusvm_ctx ctx = {
> > +		.read_only = xe_vma_read_only(vma),
> > +		.devmem_possible = IS_DGFX(vm->xe) &&
> > +			IS_ENABLED(CONFIG_DRM_XE_DEVMEM_MIRROR),
> > +		.check_pages_threshold = IS_DGFX(vm->xe) &&
> > +			IS_ENABLED(CONFIG_DRM_XE_DEVMEM_MIRROR) ? SZ_64K : 0,
> > +	};
> >   	struct xe_svm_range *range;
> >   	struct drm_gpusvm_range *r;
> >   	struct drm_exec exec;
> >   	struct dma_fence *fence;
> > +	struct xe_bo *bo = NULL;
> >   	ktime_t end = 0;
> >   	int err;
> > @@ -620,6 +697,9 @@ int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
> >   	xe_assert(vm->xe, xe_vma_is_cpu_addr_mirror(vma));
> >   retry:
> > +	xe_bo_put(bo);
> > +	bo = NULL;
> > +
> >   	/* Always process UNMAPs first so view SVM ranges is current */
> >   	err = xe_svm_garbage_collector(vm);
> >   	if (err)
> > @@ -635,9 +715,31 @@ int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
> >   	if (xe_svm_range_is_valid(range, tile))
> >   		return 0;
> > +	/* XXX: Add migration policy, for now migrate range once */
> > +	if (!range->migrated && range->base.flags.migrate_devmem &&
> > +	    xe_svm_range_size(range) >= SZ_64K) {
> > +		range->migrated = true;
> 
> 
> shouldn't this be set to true, only once xe_svm_alloc_vram is successfull ?
> In case of bo alloc failure retry wont enter here.
> 

No. The point of this is try to migrate once to avoid a live lock case
of concurrent CPU and GPU access. Once we have migration policy, I'd
suspect memory marked as prefered VRAM only try once too. If memory is
marked as VRAM only we'd retry to a limit and if reached kill the app.
We can work of those details when that code lands. Until then, trying
once seems reasonable.

Matt

> > +
> > +		bo = xe_svm_alloc_vram(vm, tile, range, &ctx);
> > +		if (IS_ERR(bo)) {
> > +			drm_info(&vm->xe->drm,
> > +				 "VRAM allocation failed, falling back to retrying, asid=%u, errno %pe\n",
> > +				 vm->usm.asid, bo);
> > +			bo = NULL;
> > +			goto retry;
> > +		}
> > +	}
> > +
> >   	err = drm_gpusvm_range_get_pages(&vm->svm.gpusvm, r, &ctx);
> > -	if (err == -EFAULT || err == -EPERM)	/* Corner where CPU mappings have changed */
> > +	/* Corner where CPU mappings have changed */
> > +	if (err == -EOPNOTSUPP || err == -EFAULT || err == -EPERM) {
> > +		if (err == -EOPNOTSUPP)
> > +			drm_gpusvm_range_evict(&vm->svm.gpusvm, &range->base);
> > +		drm_info(&vm->xe->drm,
> > +			 "Get pages failed, falling back to retrying, asid=%u, gpusvm=%p, errno %pe\n",
> > +			 vm->usm.asid, &vm->svm.gpusvm, ERR_PTR(err));
> >   		goto retry;
> > +	}
> >   	if (err)
> >   		goto err_out;
> > @@ -668,6 +770,7 @@ int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
> >   	dma_fence_put(fence);
> >   err_out:
> > +	xe_bo_put(bo);
> >   	return err;
> >   }
> > diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
> > index 0fa525d34987..984a61651d9e 100644
> > --- a/drivers/gpu/drm/xe/xe_svm.h
> > +++ b/drivers/gpu/drm/xe/xe_svm.h
> > @@ -35,6 +35,11 @@ struct xe_svm_range {
> >   	 * range. Protected by GPU SVM notifier lock.
> >   	 */
> >   	u8 tile_invalidated;
> > +	/**
> > +	 * @migrated: Range has been migrated to device memory, protected by
> > +	 * GPU fault handler locking.
> > +	 */
> > +	u8 migrated	:1;
> >   };
> >   int xe_devm_add(struct xe_tile *tile, struct xe_vram_region *vr);
>

Ghimiray, Himal Prasad Feb. 19, 2025, 2:59 a.m. UTC | #3

On 19-02-2025 03:24, Matthew Brost wrote:
> On Thu, Feb 13, 2025 at 11:58:27PM +0530, Ghimiray, Himal Prasad wrote:
>>
>>
>> On 13-02-2025 07:41, Matthew Brost wrote:
>>> Migration is implemented with range granularity, with VRAM backing being
>>> a VM private TTM BO (i.e., shares dma-resv with VM). The lifetime of the
>>> TTM BO is limited to when the SVM range is in VRAM (i.e., when a VRAM
>>> SVM range is migrated to SRAM, the TTM BO is destroyed).
>>>
>>> The design choice for using TTM BO for VRAM backing store, as opposed to
>>> direct buddy allocation, is as follows:
>>>
>>> - DRM buddy allocations are not at page granularity, offering no
>>>     advantage over a BO.
>>> - Unified eviction is required (SVM VRAM and TTM BOs need to be able to
>>>     evict each other).
>>> - For exhaustive eviction [1], SVM VRAM allocations will almost certainly
>>>     require a dma-resv.
>>> - Likely allocation size is 2M which makes of size of BO (872)
>>>     acceptable per allocation (872 / 2M == .0004158).
>>>
>>> With this, using TTM BO for VRAM backing store seems to be an obvious
>>> choice as it allows leveraging of the TTM eviction code.
>>>
>>> Current migration policy is migrate any SVM range greater than or equal
>>> to 64k once.
>>>
>>> [1] https://patchwork.freedesktop.org/series/133643/
>>>
>>> v2:
>>>    - Rebase on latest GPU SVM
>>>    - Retry page fault on get pages returning mixed allocation
>>>    - Use drm_gpusvm_devmem
>>> v3:
>>>    - Use new BO flags
>>>    - New range structure (Thomas)
>>>    - Hide migration behind Kconfig
>>>    - Kernel doc (Thomas)
>>>    - Use check_pages_threshold
>>> v4:
>>>    - Don't evict partial unmaps in garbage collector (Thomas)
>>>    - Use %pe to print errors (Thomas)
>>>    - Use %p to print pointers (Thomas)
>>> v5:
>>>    - Use range size helper (Thomas)
>>>    - Make BO external (Thomas)
>>>    - Set tile to NULL for BO creation (Thomas)
>>>    - Drop BO mirror flag (Thomas)
>>>    - Hold BO dma-resv lock across migration (Auld, Thomas)
>>>
>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>> ---
>>>    drivers/gpu/drm/xe/xe_svm.c | 111 ++++++++++++++++++++++++++++++++++--
>>>    drivers/gpu/drm/xe/xe_svm.h |   5 ++
>>>    2 files changed, 112 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c
>>> index 0a78a838508c..2e1e0f31c1a8 100644
>>> --- a/drivers/gpu/drm/xe/xe_svm.c
>>> +++ b/drivers/gpu/drm/xe/xe_svm.c
>>> @@ -32,6 +32,11 @@ static unsigned long xe_svm_range_end(struct xe_svm_range *range)
>>>    	return drm_gpusvm_range_end(&range->base);
>>>    }
>>> +static unsigned long xe_svm_range_size(struct xe_svm_range *range)
>>> +{
>>> +	return drm_gpusvm_range_size(&range->base);
>>> +}
>>> +
>>>    static void *xe_svm_devm_owner(struct xe_device *xe)
>>>    {
>>>    	return xe;
>>> @@ -512,7 +517,6 @@ static int xe_svm_populate_devmem_pfn(struct drm_gpusvm_devmem *devmem_allocatio
>>>    	return 0;
>>>    }
>>> -__maybe_unused
>>>    static const struct drm_gpusvm_devmem_ops gpusvm_devmem_ops = {
>>>    	.devmem_release = xe_svm_devmem_release,
>>>    	.populate_devmem_pfn = xe_svm_populate_devmem_pfn,
>>> @@ -592,6 +596,71 @@ static bool xe_svm_range_is_valid(struct xe_svm_range *range,
>>>    	return (range->tile_present & ~range->tile_invalidated) & BIT(tile->id);
>>>    }
>>> +static struct xe_vram_region *tile_to_vr(struct xe_tile *tile)
>>> +{
>>> +	return &tile->mem.vram;
>>> +}
>>> +
>>> +static struct xe_bo *xe_svm_alloc_vram(struct xe_vm *vm, struct xe_tile *tile,
>>> +				       struct xe_svm_range *range,
>>> +				       const struct drm_gpusvm_ctx *ctx)
>>> +{
>>> +	struct mm_struct *mm = vm->svm.gpusvm.mm;
>>> +	struct xe_vram_region *vr = tile_to_vr(tile);
>>> +	struct drm_buddy_block *block;
>>> +	struct list_head *blocks;
>>> +	struct xe_bo *bo;
>>> +	ktime_t end = 0;
>>> +	int err;
>>> +
>>> +	if (!mmget_not_zero(mm))
>>> +		return ERR_PTR(-EFAULT);
>>> +	mmap_read_lock(mm);
>>> +
>>> +retry:
>>> +	bo = xe_bo_create_locked(tile_to_xe(tile), NULL, NULL,
>>> +				 xe_svm_range_size(range),
>>> +				 ttm_bo_type_device,
>>> +				 XE_BO_FLAG_VRAM_IF_DGFX(tile));
>>> +	if (IS_ERR(bo)) {
>>> +		err = PTR_ERR(bo);
>>> +		if (xe_vm_validate_should_retry(NULL, err, &end))
>>> +			goto retry;
>>> +		goto unlock;
>>> +	}
>>> +
>>> +	drm_gpusvm_devmem_init(&bo->devmem_allocation,
>>> +			       vm->xe->drm.dev, mm,
>>> +			       &gpusvm_devmem_ops,
>>> +			       &tile->mem.vram.dpagemap,
>>> +			       xe_svm_range_size(range));
>>> +
>>> +	blocks = &to_xe_ttm_vram_mgr_resource(bo->ttm.resource)->blocks;
>>> +	list_for_each_entry(block, blocks, link)
>>> +		block->private = vr;
>>> +
>>> +	/*
>>> +	 * Take ref because as soon as drm_gpusvm_migrate_to_devmem succeeds the
>>> +	 * creation ref can be dropped upon CPU fault or unmap.
>>> +	 */
>>> +	xe_bo_get(bo);
>>> +
>>> +	err = drm_gpusvm_migrate_to_devmem(&vm->svm.gpusvm, &range->base,
>>> +					   &bo->devmem_allocation, ctx);
>>> +	xe_bo_unlock(bo);
>>> +	if (err) {
>>> +		xe_bo_put(bo);	/* Local ref */
>>> +		xe_bo_put(bo);	/* Creation ref */
>>> +		bo = ERR_PTR(err);
>>> +	}
>>> +
>>> +unlock:
>>> +	mmap_read_unlock(mm);
>>> +	mmput(mm);
>>> +
>>> +	return bo;
>>> +}
>>> +
>>>    /**
>>>     * xe_svm_handle_pagefault() - SVM handle page fault
>>>     * @vm: The VM.
>>> @@ -600,7 +669,8 @@ static bool xe_svm_range_is_valid(struct xe_svm_range *range,
>>>     * @fault_addr: The GPU fault address.
>>>     * @atomic: The fault atomic access bit.
>>>     *
>>> - * Create GPU bindings for a SVM page fault.
>>> + * Create GPU bindings for a SVM page fault. Optionally migrate to device
>>> + * memory.
>>>     *
>>>     * Return: 0 on success, negative error code on error.
>>>     */
>>> @@ -608,11 +678,18 @@ int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
>>>    			    struct xe_tile *tile, u64 fault_addr,
>>>    			    bool atomic)
>>>    {
>>> -	struct drm_gpusvm_ctx ctx = { .read_only = xe_vma_read_only(vma), };
>>> +	struct drm_gpusvm_ctx ctx = {
>>> +		.read_only = xe_vma_read_only(vma),
>>> +		.devmem_possible = IS_DGFX(vm->xe) &&
>>> +			IS_ENABLED(CONFIG_DRM_XE_DEVMEM_MIRROR),
>>> +		.check_pages_threshold = IS_DGFX(vm->xe) &&
>>> +			IS_ENABLED(CONFIG_DRM_XE_DEVMEM_MIRROR) ? SZ_64K : 0,
>>> +	};
>>>    	struct xe_svm_range *range;
>>>    	struct drm_gpusvm_range *r;
>>>    	struct drm_exec exec;
>>>    	struct dma_fence *fence;
>>> +	struct xe_bo *bo = NULL;
>>>    	ktime_t end = 0;
>>>    	int err;
>>> @@ -620,6 +697,9 @@ int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
>>>    	xe_assert(vm->xe, xe_vma_is_cpu_addr_mirror(vma));
>>>    retry:
>>> +	xe_bo_put(bo);
>>> +	bo = NULL;
>>> +
>>>    	/* Always process UNMAPs first so view SVM ranges is current */
>>>    	err = xe_svm_garbage_collector(vm);
>>>    	if (err)
>>> @@ -635,9 +715,31 @@ int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
>>>    	if (xe_svm_range_is_valid(range, tile))
>>>    		return 0;
>>> +	/* XXX: Add migration policy, for now migrate range once */
>>> +	if (!range->migrated && range->base.flags.migrate_devmem &&
>>> +	    xe_svm_range_size(range) >= SZ_64K) {
>>> +		range->migrated = true;
>>
>>
>> shouldn't this be set to true, only once xe_svm_alloc_vram is successfull ?
>> In case of bo alloc failure retry wont enter here.
>>
> 
> No. The point of this is try to migrate once to avoid a live lock case
> of concurrent CPU and GPU access. Once we have migration policy, I'd
> suspect memory marked as prefered VRAM only try once too. If memory is
> marked as VRAM only we'd retry to a limit and if reached kill the app.
> We can work of those details when that code lands. Until then, trying
> once seems reasonable.

I understand what we are trying to achieve here, and it functions well. 
However, my only concern is member name "migrated" gives the impression 
that it will be set to true only if the range has been successfully 
migrated to VRAM.

> 
> Matt
> 
>>> +
>>> +		bo = xe_svm_alloc_vram(vm, tile, range, &ctx);
>>> +		if (IS_ERR(bo)) {
>>> +			drm_info(&vm->xe->drm,
>>> +				 "VRAM allocation failed, falling back to retrying, asid=%u, errno %pe\n",
>>> +				 vm->usm.asid, bo);

This log is also misleading, we dont retry vram allocation in case of 
first failure.

>>> +			bo = NULL;
>>> +			goto retry;
>>> +		}
>>> +	}
>>> +
>>>    	err = drm_gpusvm_range_get_pages(&vm->svm.gpusvm, r, &ctx);
>>> -	if (err == -EFAULT || err == -EPERM)	/* Corner where CPU mappings have changed */
>>> +	/* Corner where CPU mappings have changed */
>>> +	if (err == -EOPNOTSUPP || err == -EFAULT || err == -EPERM) {
>>> +		if (err == -EOPNOTSUPP)
>>> +			drm_gpusvm_range_evict(&vm->svm.gpusvm, &range->base);
>>> +		drm_info(&vm->xe->drm,
>>> +			 "Get pages failed, falling back to retrying, asid=%u, gpusvm=%p, errno %pe\n",
>>> +			 vm->usm.asid, &vm->svm.gpusvm, ERR_PTR(err));
>>>    		goto retry;
>>> +	}
>>>    	if (err)
>>>    		goto err_out;
>>> @@ -668,6 +770,7 @@ int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
>>>    	dma_fence_put(fence);
>>>    err_out:
>>> +	xe_bo_put(bo);
>>>    	return err;
>>>    }
>>> diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
>>> index 0fa525d34987..984a61651d9e 100644
>>> --- a/drivers/gpu/drm/xe/xe_svm.h
>>> +++ b/drivers/gpu/drm/xe/xe_svm.h
>>> @@ -35,6 +35,11 @@ struct xe_svm_range {
>>>    	 * range. Protected by GPU SVM notifier lock.
>>>    	 */
>>>    	u8 tile_invalidated;
>>> +	/**
>>> +	 * @migrated: Range has been migrated to device memory, protected by
>>> +	 * GPU fault handler locking.

The range is not necessarily migrated to device memory if migrated is true.

>>> +	 */
>>> +	u8 migrated	:1;
>>>    };
>>>    int xe_devm_add(struct xe_tile *tile, struct xe_vram_region *vr);
>>

Matthew Brost Feb. 19, 2025, 3:05 a.m. UTC | #4

On Wed, Feb 19, 2025 at 08:29:53AM +0530, Ghimiray, Himal Prasad wrote:
> 
> 
> On 19-02-2025 03:24, Matthew Brost wrote:
> > On Thu, Feb 13, 2025 at 11:58:27PM +0530, Ghimiray, Himal Prasad wrote:
> > > 
> > > 
> > > On 13-02-2025 07:41, Matthew Brost wrote:
> > > > Migration is implemented with range granularity, with VRAM backing being
> > > > a VM private TTM BO (i.e., shares dma-resv with VM). The lifetime of the
> > > > TTM BO is limited to when the SVM range is in VRAM (i.e., when a VRAM
> > > > SVM range is migrated to SRAM, the TTM BO is destroyed).
> > > > 
> > > > The design choice for using TTM BO for VRAM backing store, as opposed to
> > > > direct buddy allocation, is as follows:
> > > > 
> > > > - DRM buddy allocations are not at page granularity, offering no
> > > >     advantage over a BO.
> > > > - Unified eviction is required (SVM VRAM and TTM BOs need to be able to
> > > >     evict each other).
> > > > - For exhaustive eviction [1], SVM VRAM allocations will almost certainly
> > > >     require a dma-resv.
> > > > - Likely allocation size is 2M which makes of size of BO (872)
> > > >     acceptable per allocation (872 / 2M == .0004158).
> > > > 
> > > > With this, using TTM BO for VRAM backing store seems to be an obvious
> > > > choice as it allows leveraging of the TTM eviction code.
> > > > 
> > > > Current migration policy is migrate any SVM range greater than or equal
> > > > to 64k once.
> > > > 
> > > > [1] https://patchwork.freedesktop.org/series/133643/
> > > > 
> > > > v2:
> > > >    - Rebase on latest GPU SVM
> > > >    - Retry page fault on get pages returning mixed allocation
> > > >    - Use drm_gpusvm_devmem
> > > > v3:
> > > >    - Use new BO flags
> > > >    - New range structure (Thomas)
> > > >    - Hide migration behind Kconfig
> > > >    - Kernel doc (Thomas)
> > > >    - Use check_pages_threshold
> > > > v4:
> > > >    - Don't evict partial unmaps in garbage collector (Thomas)
> > > >    - Use %pe to print errors (Thomas)
> > > >    - Use %p to print pointers (Thomas)
> > > > v5:
> > > >    - Use range size helper (Thomas)
> > > >    - Make BO external (Thomas)
> > > >    - Set tile to NULL for BO creation (Thomas)
> > > >    - Drop BO mirror flag (Thomas)
> > > >    - Hold BO dma-resv lock across migration (Auld, Thomas)
> > > > 
> > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > ---
> > > >    drivers/gpu/drm/xe/xe_svm.c | 111 ++++++++++++++++++++++++++++++++++--
> > > >    drivers/gpu/drm/xe/xe_svm.h |   5 ++
> > > >    2 files changed, 112 insertions(+), 4 deletions(-)
> > > > 
> > > > diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c
> > > > index 0a78a838508c..2e1e0f31c1a8 100644
> > > > --- a/drivers/gpu/drm/xe/xe_svm.c
> > > > +++ b/drivers/gpu/drm/xe/xe_svm.c
> > > > @@ -32,6 +32,11 @@ static unsigned long xe_svm_range_end(struct xe_svm_range *range)
> > > >    	return drm_gpusvm_range_end(&range->base);
> > > >    }
> > > > +static unsigned long xe_svm_range_size(struct xe_svm_range *range)
> > > > +{
> > > > +	return drm_gpusvm_range_size(&range->base);
> > > > +}
> > > > +
> > > >    static void *xe_svm_devm_owner(struct xe_device *xe)
> > > >    {
> > > >    	return xe;
> > > > @@ -512,7 +517,6 @@ static int xe_svm_populate_devmem_pfn(struct drm_gpusvm_devmem *devmem_allocatio
> > > >    	return 0;
> > > >    }
> > > > -__maybe_unused
> > > >    static const struct drm_gpusvm_devmem_ops gpusvm_devmem_ops = {
> > > >    	.devmem_release = xe_svm_devmem_release,
> > > >    	.populate_devmem_pfn = xe_svm_populate_devmem_pfn,
> > > > @@ -592,6 +596,71 @@ static bool xe_svm_range_is_valid(struct xe_svm_range *range,
> > > >    	return (range->tile_present & ~range->tile_invalidated) & BIT(tile->id);
> > > >    }
> > > > +static struct xe_vram_region *tile_to_vr(struct xe_tile *tile)
> > > > +{
> > > > +	return &tile->mem.vram;
> > > > +}
> > > > +
> > > > +static struct xe_bo *xe_svm_alloc_vram(struct xe_vm *vm, struct xe_tile *tile,
> > > > +				       struct xe_svm_range *range,
> > > > +				       const struct drm_gpusvm_ctx *ctx)
> > > > +{
> > > > +	struct mm_struct *mm = vm->svm.gpusvm.mm;
> > > > +	struct xe_vram_region *vr = tile_to_vr(tile);
> > > > +	struct drm_buddy_block *block;
> > > > +	struct list_head *blocks;
> > > > +	struct xe_bo *bo;
> > > > +	ktime_t end = 0;
> > > > +	int err;
> > > > +
> > > > +	if (!mmget_not_zero(mm))
> > > > +		return ERR_PTR(-EFAULT);
> > > > +	mmap_read_lock(mm);
> > > > +
> > > > +retry:
> > > > +	bo = xe_bo_create_locked(tile_to_xe(tile), NULL, NULL,
> > > > +				 xe_svm_range_size(range),
> > > > +				 ttm_bo_type_device,
> > > > +				 XE_BO_FLAG_VRAM_IF_DGFX(tile));
> > > > +	if (IS_ERR(bo)) {
> > > > +		err = PTR_ERR(bo);
> > > > +		if (xe_vm_validate_should_retry(NULL, err, &end))
> > > > +			goto retry;
> > > > +		goto unlock;
> > > > +	}
> > > > +
> > > > +	drm_gpusvm_devmem_init(&bo->devmem_allocation,
> > > > +			       vm->xe->drm.dev, mm,
> > > > +			       &gpusvm_devmem_ops,
> > > > +			       &tile->mem.vram.dpagemap,
> > > > +			       xe_svm_range_size(range));
> > > > +
> > > > +	blocks = &to_xe_ttm_vram_mgr_resource(bo->ttm.resource)->blocks;
> > > > +	list_for_each_entry(block, blocks, link)
> > > > +		block->private = vr;
> > > > +
> > > > +	/*
> > > > +	 * Take ref because as soon as drm_gpusvm_migrate_to_devmem succeeds the
> > > > +	 * creation ref can be dropped upon CPU fault or unmap.
> > > > +	 */
> > > > +	xe_bo_get(bo);
> > > > +
> > > > +	err = drm_gpusvm_migrate_to_devmem(&vm->svm.gpusvm, &range->base,
> > > > +					   &bo->devmem_allocation, ctx);
> > > > +	xe_bo_unlock(bo);
> > > > +	if (err) {
> > > > +		xe_bo_put(bo);	/* Local ref */
> > > > +		xe_bo_put(bo);	/* Creation ref */
> > > > +		bo = ERR_PTR(err);
> > > > +	}
> > > > +
> > > > +unlock:
> > > > +	mmap_read_unlock(mm);
> > > > +	mmput(mm);
> > > > +
> > > > +	return bo;
> > > > +}
> > > > +
> > > >    /**
> > > >     * xe_svm_handle_pagefault() - SVM handle page fault
> > > >     * @vm: The VM.
> > > > @@ -600,7 +669,8 @@ static bool xe_svm_range_is_valid(struct xe_svm_range *range,
> > > >     * @fault_addr: The GPU fault address.
> > > >     * @atomic: The fault atomic access bit.
> > > >     *
> > > > - * Create GPU bindings for a SVM page fault.
> > > > + * Create GPU bindings for a SVM page fault. Optionally migrate to device
> > > > + * memory.
> > > >     *
> > > >     * Return: 0 on success, negative error code on error.
> > > >     */
> > > > @@ -608,11 +678,18 @@ int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
> > > >    			    struct xe_tile *tile, u64 fault_addr,
> > > >    			    bool atomic)
> > > >    {
> > > > -	struct drm_gpusvm_ctx ctx = { .read_only = xe_vma_read_only(vma), };
> > > > +	struct drm_gpusvm_ctx ctx = {
> > > > +		.read_only = xe_vma_read_only(vma),
> > > > +		.devmem_possible = IS_DGFX(vm->xe) &&
> > > > +			IS_ENABLED(CONFIG_DRM_XE_DEVMEM_MIRROR),
> > > > +		.check_pages_threshold = IS_DGFX(vm->xe) &&
> > > > +			IS_ENABLED(CONFIG_DRM_XE_DEVMEM_MIRROR) ? SZ_64K : 0,
> > > > +	};
> > > >    	struct xe_svm_range *range;
> > > >    	struct drm_gpusvm_range *r;
> > > >    	struct drm_exec exec;
> > > >    	struct dma_fence *fence;
> > > > +	struct xe_bo *bo = NULL;
> > > >    	ktime_t end = 0;
> > > >    	int err;
> > > > @@ -620,6 +697,9 @@ int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
> > > >    	xe_assert(vm->xe, xe_vma_is_cpu_addr_mirror(vma));
> > > >    retry:
> > > > +	xe_bo_put(bo);
> > > > +	bo = NULL;
> > > > +
> > > >    	/* Always process UNMAPs first so view SVM ranges is current */
> > > >    	err = xe_svm_garbage_collector(vm);
> > > >    	if (err)
> > > > @@ -635,9 +715,31 @@ int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
> > > >    	if (xe_svm_range_is_valid(range, tile))
> > > >    		return 0;
> > > > +	/* XXX: Add migration policy, for now migrate range once */
> > > > +	if (!range->migrated && range->base.flags.migrate_devmem &&
> > > > +	    xe_svm_range_size(range) >= SZ_64K) {
> > > > +		range->migrated = true;
> > > 
> > > 
> > > shouldn't this be set to true, only once xe_svm_alloc_vram is successfull ?
> > > In case of bo alloc failure retry wont enter here.
> > > 
> > 
> > No. The point of this is try to migrate once to avoid a live lock case
> > of concurrent CPU and GPU access. Once we have migration policy, I'd
> > suspect memory marked as prefered VRAM only try once too. If memory is
> > marked as VRAM only we'd retry to a limit and if reached kill the app.
> > We can work of those details when that code lands. Until then, trying
> > once seems reasonable.
> 
> I understand what we are trying to achieve here, and it functions well.
> However, my only concern is member name "migrated" gives the impression that
> it will be set to true only if the range has been successfully migrated to
> VRAM.
> 

s/migrated/skip_migrate ?

> > 
> > Matt
> > 
> > > > +
> > > > +		bo = xe_svm_alloc_vram(vm, tile, range, &ctx);
> > > > +		if (IS_ERR(bo)) {
> > > > +			drm_info(&vm->xe->drm,
> > > > +				 "VRAM allocation failed, falling back to retrying, asid=%u, errno %pe\n",
> > > > +				 vm->usm.asid, bo);
> 
> This log is also misleading, we dont retry vram allocation in case of first
> failure.
> 

How about...

"VRAM allocation failed, falling back to retrying fault,"

Matt

> > > > +			bo = NULL;
> > > > +			goto retry;
> > > > +		}
> > > > +	}
> > > > +
> > > >    	err = drm_gpusvm_range_get_pages(&vm->svm.gpusvm, r, &ctx);
> > > > -	if (err == -EFAULT || err == -EPERM)	/* Corner where CPU mappings have changed */
> > > > +	/* Corner where CPU mappings have changed */
> > > > +	if (err == -EOPNOTSUPP || err == -EFAULT || err == -EPERM) {
> > > > +		if (err == -EOPNOTSUPP)
> > > > +			drm_gpusvm_range_evict(&vm->svm.gpusvm, &range->base);
> > > > +		drm_info(&vm->xe->drm,
> > > > +			 "Get pages failed, falling back to retrying, asid=%u, gpusvm=%p, errno %pe\n",
> > > > +			 vm->usm.asid, &vm->svm.gpusvm, ERR_PTR(err));
> > > >    		goto retry;
> > > > +	}
> > > >    	if (err)
> > > >    		goto err_out;
> > > > @@ -668,6 +770,7 @@ int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
> > > >    	dma_fence_put(fence);
> > > >    err_out:
> > > > +	xe_bo_put(bo);
> > > >    	return err;
> > > >    }
> > > > diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
> > > > index 0fa525d34987..984a61651d9e 100644
> > > > --- a/drivers/gpu/drm/xe/xe_svm.h
> > > > +++ b/drivers/gpu/drm/xe/xe_svm.h
> > > > @@ -35,6 +35,11 @@ struct xe_svm_range {
> > > >    	 * range. Protected by GPU SVM notifier lock.
> > > >    	 */
> > > >    	u8 tile_invalidated;
> > > > +	/**
> > > > +	 * @migrated: Range has been migrated to device memory, protected by
> > > > +	 * GPU fault handler locking.
> 
> The range is not necessarily migrated to device memory if migrated is true.
> 
> > > > +	 */
> > > > +	u8 migrated	:1;
> > > >    };
> > > >    int xe_devm_add(struct xe_tile *tile, struct xe_vram_region *vr);
> > > 
>

Ghimiray, Himal Prasad Feb. 19, 2025, 3:40 a.m. UTC | #5

On 19-02-2025 08:35, Matthew Brost wrote:
> On Wed, Feb 19, 2025 at 08:29:53AM +0530, Ghimiray, Himal Prasad wrote:
>>
>>
>> On 19-02-2025 03:24, Matthew Brost wrote:
>>> On Thu, Feb 13, 2025 at 11:58:27PM +0530, Ghimiray, Himal Prasad wrote:
>>>>
>>>>
>>>> On 13-02-2025 07:41, Matthew Brost wrote:
>>>>> Migration is implemented with range granularity, with VRAM backing being
>>>>> a VM private TTM BO (i.e., shares dma-resv with VM). The lifetime of the
>>>>> TTM BO is limited to when the SVM range is in VRAM (i.e., when a VRAM
>>>>> SVM range is migrated to SRAM, the TTM BO is destroyed).
>>>>>
>>>>> The design choice for using TTM BO for VRAM backing store, as opposed to
>>>>> direct buddy allocation, is as follows:
>>>>>
>>>>> - DRM buddy allocations are not at page granularity, offering no
>>>>>      advantage over a BO.
>>>>> - Unified eviction is required (SVM VRAM and TTM BOs need to be able to
>>>>>      evict each other).
>>>>> - For exhaustive eviction [1], SVM VRAM allocations will almost certainly
>>>>>      require a dma-resv.
>>>>> - Likely allocation size is 2M which makes of size of BO (872)
>>>>>      acceptable per allocation (872 / 2M == .0004158).
>>>>>
>>>>> With this, using TTM BO for VRAM backing store seems to be an obvious
>>>>> choice as it allows leveraging of the TTM eviction code.
>>>>>
>>>>> Current migration policy is migrate any SVM range greater than or equal
>>>>> to 64k once.
>>>>>
>>>>> [1] https://patchwork.freedesktop.org/series/133643/
>>>>>
>>>>> v2:
>>>>>     - Rebase on latest GPU SVM
>>>>>     - Retry page fault on get pages returning mixed allocation
>>>>>     - Use drm_gpusvm_devmem
>>>>> v3:
>>>>>     - Use new BO flags
>>>>>     - New range structure (Thomas)
>>>>>     - Hide migration behind Kconfig
>>>>>     - Kernel doc (Thomas)
>>>>>     - Use check_pages_threshold
>>>>> v4:
>>>>>     - Don't evict partial unmaps in garbage collector (Thomas)
>>>>>     - Use %pe to print errors (Thomas)
>>>>>     - Use %p to print pointers (Thomas)
>>>>> v5:
>>>>>     - Use range size helper (Thomas)
>>>>>     - Make BO external (Thomas)
>>>>>     - Set tile to NULL for BO creation (Thomas)
>>>>>     - Drop BO mirror flag (Thomas)
>>>>>     - Hold BO dma-resv lock across migration (Auld, Thomas)
>>>>>
>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>> ---
>>>>>     drivers/gpu/drm/xe/xe_svm.c | 111 ++++++++++++++++++++++++++++++++++--
>>>>>     drivers/gpu/drm/xe/xe_svm.h |   5 ++
>>>>>     2 files changed, 112 insertions(+), 4 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c
>>>>> index 0a78a838508c..2e1e0f31c1a8 100644
>>>>> --- a/drivers/gpu/drm/xe/xe_svm.c
>>>>> +++ b/drivers/gpu/drm/xe/xe_svm.c
>>>>> @@ -32,6 +32,11 @@ static unsigned long xe_svm_range_end(struct xe_svm_range *range)
>>>>>     	return drm_gpusvm_range_end(&range->base);
>>>>>     }
>>>>> +static unsigned long xe_svm_range_size(struct xe_svm_range *range)
>>>>> +{
>>>>> +	return drm_gpusvm_range_size(&range->base);
>>>>> +}
>>>>> +
>>>>>     static void *xe_svm_devm_owner(struct xe_device *xe)
>>>>>     {
>>>>>     	return xe;
>>>>> @@ -512,7 +517,6 @@ static int xe_svm_populate_devmem_pfn(struct drm_gpusvm_devmem *devmem_allocatio
>>>>>     	return 0;
>>>>>     }
>>>>> -__maybe_unused
>>>>>     static const struct drm_gpusvm_devmem_ops gpusvm_devmem_ops = {
>>>>>     	.devmem_release = xe_svm_devmem_release,
>>>>>     	.populate_devmem_pfn = xe_svm_populate_devmem_pfn,
>>>>> @@ -592,6 +596,71 @@ static bool xe_svm_range_is_valid(struct xe_svm_range *range,
>>>>>     	return (range->tile_present & ~range->tile_invalidated) & BIT(tile->id);
>>>>>     }
>>>>> +static struct xe_vram_region *tile_to_vr(struct xe_tile *tile)
>>>>> +{
>>>>> +	return &tile->mem.vram;
>>>>> +}
>>>>> +
>>>>> +static struct xe_bo *xe_svm_alloc_vram(struct xe_vm *vm, struct xe_tile *tile,
>>>>> +				       struct xe_svm_range *range,
>>>>> +				       const struct drm_gpusvm_ctx *ctx)
>>>>> +{
>>>>> +	struct mm_struct *mm = vm->svm.gpusvm.mm;
>>>>> +	struct xe_vram_region *vr = tile_to_vr(tile);
>>>>> +	struct drm_buddy_block *block;
>>>>> +	struct list_head *blocks;
>>>>> +	struct xe_bo *bo;
>>>>> +	ktime_t end = 0;
>>>>> +	int err;
>>>>> +
>>>>> +	if (!mmget_not_zero(mm))
>>>>> +		return ERR_PTR(-EFAULT);
>>>>> +	mmap_read_lock(mm);
>>>>> +
>>>>> +retry:
>>>>> +	bo = xe_bo_create_locked(tile_to_xe(tile), NULL, NULL,
>>>>> +				 xe_svm_range_size(range),
>>>>> +				 ttm_bo_type_device,
>>>>> +				 XE_BO_FLAG_VRAM_IF_DGFX(tile));
>>>>> +	if (IS_ERR(bo)) {
>>>>> +		err = PTR_ERR(bo);
>>>>> +		if (xe_vm_validate_should_retry(NULL, err, &end))
>>>>> +			goto retry;
>>>>> +		goto unlock;
>>>>> +	}
>>>>> +
>>>>> +	drm_gpusvm_devmem_init(&bo->devmem_allocation,
>>>>> +			       vm->xe->drm.dev, mm,
>>>>> +			       &gpusvm_devmem_ops,
>>>>> +			       &tile->mem.vram.dpagemap,
>>>>> +			       xe_svm_range_size(range));
>>>>> +
>>>>> +	blocks = &to_xe_ttm_vram_mgr_resource(bo->ttm.resource)->blocks;
>>>>> +	list_for_each_entry(block, blocks, link)
>>>>> +		block->private = vr;
>>>>> +
>>>>> +	/*
>>>>> +	 * Take ref because as soon as drm_gpusvm_migrate_to_devmem succeeds the
>>>>> +	 * creation ref can be dropped upon CPU fault or unmap.
>>>>> +	 */
>>>>> +	xe_bo_get(bo);
>>>>> +
>>>>> +	err = drm_gpusvm_migrate_to_devmem(&vm->svm.gpusvm, &range->base,
>>>>> +					   &bo->devmem_allocation, ctx);
>>>>> +	xe_bo_unlock(bo);
>>>>> +	if (err) {
>>>>> +		xe_bo_put(bo);	/* Local ref */
>>>>> +		xe_bo_put(bo);	/* Creation ref */
>>>>> +		bo = ERR_PTR(err);
>>>>> +	}
>>>>> +
>>>>> +unlock:
>>>>> +	mmap_read_unlock(mm);
>>>>> +	mmput(mm);
>>>>> +
>>>>> +	return bo;
>>>>> +}
>>>>> +
>>>>>     /**
>>>>>      * xe_svm_handle_pagefault() - SVM handle page fault
>>>>>      * @vm: The VM.
>>>>> @@ -600,7 +669,8 @@ static bool xe_svm_range_is_valid(struct xe_svm_range *range,
>>>>>      * @fault_addr: The GPU fault address.
>>>>>      * @atomic: The fault atomic access bit.
>>>>>      *
>>>>> - * Create GPU bindings for a SVM page fault.
>>>>> + * Create GPU bindings for a SVM page fault. Optionally migrate to device
>>>>> + * memory.
>>>>>      *
>>>>>      * Return: 0 on success, negative error code on error.
>>>>>      */
>>>>> @@ -608,11 +678,18 @@ int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
>>>>>     			    struct xe_tile *tile, u64 fault_addr,
>>>>>     			    bool atomic)
>>>>>     {
>>>>> -	struct drm_gpusvm_ctx ctx = { .read_only = xe_vma_read_only(vma), };
>>>>> +	struct drm_gpusvm_ctx ctx = {
>>>>> +		.read_only = xe_vma_read_only(vma),
>>>>> +		.devmem_possible = IS_DGFX(vm->xe) &&
>>>>> +			IS_ENABLED(CONFIG_DRM_XE_DEVMEM_MIRROR),
>>>>> +		.check_pages_threshold = IS_DGFX(vm->xe) &&
>>>>> +			IS_ENABLED(CONFIG_DRM_XE_DEVMEM_MIRROR) ? SZ_64K : 0,
>>>>> +	};
>>>>>     	struct xe_svm_range *range;
>>>>>     	struct drm_gpusvm_range *r;
>>>>>     	struct drm_exec exec;
>>>>>     	struct dma_fence *fence;
>>>>> +	struct xe_bo *bo = NULL;
>>>>>     	ktime_t end = 0;
>>>>>     	int err;
>>>>> @@ -620,6 +697,9 @@ int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
>>>>>     	xe_assert(vm->xe, xe_vma_is_cpu_addr_mirror(vma));
>>>>>     retry:
>>>>> +	xe_bo_put(bo);
>>>>> +	bo = NULL;
>>>>> +
>>>>>     	/* Always process UNMAPs first so view SVM ranges is current */
>>>>>     	err = xe_svm_garbage_collector(vm);
>>>>>     	if (err)
>>>>> @@ -635,9 +715,31 @@ int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
>>>>>     	if (xe_svm_range_is_valid(range, tile))
>>>>>     		return 0;
>>>>> +	/* XXX: Add migration policy, for now migrate range once */
>>>>> +	if (!range->migrated && range->base.flags.migrate_devmem &&
>>>>> +	    xe_svm_range_size(range) >= SZ_64K) {
>>>>> +		range->migrated = true;
>>>>
>>>>
>>>> shouldn't this be set to true, only once xe_svm_alloc_vram is successfull ?
>>>> In case of bo alloc failure retry wont enter here.
>>>>
>>>
>>> No. The point of this is try to migrate once to avoid a live lock case
>>> of concurrent CPU and GPU access. Once we have migration policy, I'd
>>> suspect memory marked as prefered VRAM only try once too. If memory is
>>> marked as VRAM only we'd retry to a limit and if reached kill the app.
>>> We can work of those details when that code lands. Until then, trying
>>> once seems reasonable.
>>
>> I understand what we are trying to achieve here, and it functions well.
>> However, my only concern is member name "migrated" gives the impression that
>> it will be set to true only if the range has been successfully migrated to
>> VRAM.
>>
> 
> s/migrated/skip_migrate ?

sounds good to me.

> 
>>>
>>> Matt
>>>
>>>>> +
>>>>> +		bo = xe_svm_alloc_vram(vm, tile, range, &ctx);
>>>>> +		if (IS_ERR(bo)) {
>>>>> +			drm_info(&vm->xe->drm,
>>>>> +				 "VRAM allocation failed, falling back to retrying, asid=%u, errno %pe\n",
>>>>> +				 vm->usm.asid, bo);
>>
>> This log is also misleading, we dont retry vram allocation in case of first
>> failure.
>>
> 
> How about...
> 
> "VRAM allocation failed, falling back to retrying fault,"

Yup.

> 
> Matt
> 
>>>>> +			bo = NULL;
>>>>> +			goto retry;
>>>>> +		}
>>>>> +	}
>>>>> +
>>>>>     	err = drm_gpusvm_range_get_pages(&vm->svm.gpusvm, r, &ctx);
>>>>> -	if (err == -EFAULT || err == -EPERM)	/* Corner where CPU mappings have changed */
>>>>> +	/* Corner where CPU mappings have changed */
>>>>> +	if (err == -EOPNOTSUPP || err == -EFAULT || err == -EPERM) {
>>>>> +		if (err == -EOPNOTSUPP)
>>>>> +			drm_gpusvm_range_evict(&vm->svm.gpusvm, &range->base);
>>>>> +		drm_info(&vm->xe->drm,
>>>>> +			 "Get pages failed, falling back to retrying, asid=%u, gpusvm=%p, errno %pe\n",
>>>>> +			 vm->usm.asid, &vm->svm.gpusvm, ERR_PTR(err));
>>>>>     		goto retry;
>>>>> +	}
>>>>>     	if (err)
>>>>>     		goto err_out;
>>>>> @@ -668,6 +770,7 @@ int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
>>>>>     	dma_fence_put(fence);
>>>>>     err_out:
>>>>> +	xe_bo_put(bo);
>>>>>     	return err;
>>>>>     }
>>>>> diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
>>>>> index 0fa525d34987..984a61651d9e 100644
>>>>> --- a/drivers/gpu/drm/xe/xe_svm.h
>>>>> +++ b/drivers/gpu/drm/xe/xe_svm.h
>>>>> @@ -35,6 +35,11 @@ struct xe_svm_range {
>>>>>     	 * range. Protected by GPU SVM notifier lock.
>>>>>     	 */
>>>>>     	u8 tile_invalidated;
>>>>> +	/**
>>>>> +	 * @migrated: Range has been migrated to device memory, protected by
>>>>> +	 * GPU fault handler locking.
>>
>> The range is not necessarily migrated to device memory if migrated is true.
>>
>>>>> +	 */
>>>>> +	u8 migrated	:1;
>>>>>     };
>>>>>     int xe_devm_add(struct xe_tile *tile, struct xe_vram_region *vr);
>>>>
>>

Thomas Hellström Feb. 19, 2025, 10:30 a.m. UTC | #6

On Wed, 2025-02-12 at 18:11 -0800, Matthew Brost wrote:
> Migration is implemented with range granularity, with VRAM backing
> being
> a VM private TTM BO (i.e., shares dma-resv with VM). The lifetime of
> the
> TTM BO is limited to when the SVM range is in VRAM (i.e., when a VRAM
> SVM range is migrated to SRAM, the TTM BO is destroyed).
> 
> The design choice for using TTM BO for VRAM backing store, as opposed
> to
> direct buddy allocation, is as follows:
> 
> - DRM buddy allocations are not at page granularity, offering no
>   advantage over a BO.
> - Unified eviction is required (SVM VRAM and TTM BOs need to be able
> to
>   evict each other).
> - For exhaustive eviction [1], SVM VRAM allocations will almost
> certainly
>   require a dma-resv.
> - Likely allocation size is 2M which makes of size of BO (872)
>   acceptable per allocation (872 / 2M == .0004158).
> 
> With this, using TTM BO for VRAM backing store seems to be an obvious
> choice as it allows leveraging of the TTM eviction code.
> 
> Current migration policy is migrate any SVM range greater than or
> equal
> to 64k once.
> 
> [1] https://patchwork.freedesktop.org/series/133643/
> 
> v2:
>  - Rebase on latest GPU SVM
>  - Retry page fault on get pages returning mixed allocation
>  - Use drm_gpusvm_devmem
> v3:
>  - Use new BO flags
>  - New range structure (Thomas)
>  - Hide migration behind Kconfig
>  - Kernel doc (Thomas)
>  - Use check_pages_threshold
> v4:
>  - Don't evict partial unmaps in garbage collector (Thomas)
>  - Use %pe to print errors (Thomas)
>  - Use %p to print pointers (Thomas)
> v5:
>  - Use range size helper (Thomas)
>  - Make BO external (Thomas)
>  - Set tile to NULL for BO creation (Thomas)
>  - Drop BO mirror flag (Thomas)
>  - Hold BO dma-resv lock across migration (Auld, Thomas)
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_svm.c | 111
> ++++++++++++++++++++++++++++++++++--
>  drivers/gpu/drm/xe/xe_svm.h |   5 ++
>  2 files changed, 112 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_svm.c
> b/drivers/gpu/drm/xe/xe_svm.c
> index 0a78a838508c..2e1e0f31c1a8 100644
> --- a/drivers/gpu/drm/xe/xe_svm.c
> +++ b/drivers/gpu/drm/xe/xe_svm.c
> @@ -32,6 +32,11 @@ static unsigned long xe_svm_range_end(struct
> xe_svm_range *range)
>  	return drm_gpusvm_range_end(&range->base);
>  }
>  
> +static unsigned long xe_svm_range_size(struct xe_svm_range *range)
> +{
> +	return drm_gpusvm_range_size(&range->base);
> +}
> +
>  static void *xe_svm_devm_owner(struct xe_device *xe)
>  {
>  	return xe;
> @@ -512,7 +517,6 @@ static int xe_svm_populate_devmem_pfn(struct
> drm_gpusvm_devmem *devmem_allocatio
>  	return 0;
>  }
>  
> -__maybe_unused
>  static const struct drm_gpusvm_devmem_ops gpusvm_devmem_ops = {
>  	.devmem_release = xe_svm_devmem_release,
>  	.populate_devmem_pfn = xe_svm_populate_devmem_pfn,
> @@ -592,6 +596,71 @@ static bool xe_svm_range_is_valid(struct
> xe_svm_range *range,
>  	return (range->tile_present & ~range->tile_invalidated) &
> BIT(tile->id);
>  }
>  
> +static struct xe_vram_region *tile_to_vr(struct xe_tile *tile)
> +{
> +	return &tile->mem.vram;
> +}
> +
> +static struct xe_bo *xe_svm_alloc_vram(struct xe_vm *vm, struct
> xe_tile *tile,
> +				       struct xe_svm_range *range,
> +				       const struct drm_gpusvm_ctx
> *ctx)
> +{
> +	struct mm_struct *mm = vm->svm.gpusvm.mm;
> +	struct xe_vram_region *vr = tile_to_vr(tile);
> +	struct drm_buddy_block *block;
> +	struct list_head *blocks;
> +	struct xe_bo *bo;
> +	ktime_t end = 0;
> +	int err;
> +
> +	if (!mmget_not_zero(mm))
> +		return ERR_PTR(-EFAULT);
> +	mmap_read_lock(mm);
> +
> +retry:
> +	bo = xe_bo_create_locked(tile_to_xe(tile), NULL, NULL,
> +				 xe_svm_range_size(range),
> +				 ttm_bo_type_device,
> +				 XE_BO_FLAG_VRAM_IF_DGFX(tile));
> +	if (IS_ERR(bo)) {
> +		err = PTR_ERR(bo);
> +		if (xe_vm_validate_should_retry(NULL, err, &end))
> +			goto retry;
> +		goto unlock;
> +	}
> +
> +	drm_gpusvm_devmem_init(&bo->devmem_allocation,
> +			       vm->xe->drm.dev, mm,
> +			       &gpusvm_devmem_ops,
> +			       &tile->mem.vram.dpagemap,
> +			       xe_svm_range_size(range));

The way this is done now looks good, although the corresponding multi-
device commit

https://gitlab.freedesktop.org/thomash/kernel/-/commit/1f204b5d05896e5cf63ab54a679ee343e76600ff

Goes one step further and removes the bo return of this function to
favor an int error code. The caller doesn't really use the bo, right?

> +
> +	blocks = &to_xe_ttm_vram_mgr_resource(bo->ttm.resource)-
> >blocks;
> +	list_for_each_entry(block, blocks, link)
> +		block->private = vr;
> +
> +	/*
> +	 * Take ref because as soon as drm_gpusvm_migrate_to_devmem
> succeeds the
> +	 * creation ref can be dropped upon CPU fault or unmap.
> +	 */
> +	xe_bo_get(bo);
> +
> +	err = drm_gpusvm_migrate_to_devmem(&vm->svm.gpusvm, &range-
> >base,
> +					   &bo->devmem_allocation,
> ctx);
> +	xe_bo_unlock(bo);
> +	if (err) {
> +		xe_bo_put(bo);	/* Local ref */
> +		xe_bo_put(bo);	/* Creation ref */

Strictly, with the drm_gpusvm_devmem_init() succeeding, this bo put
should be an xe_svm_devmem_release(), right? While equivalent here, if
devmem_init moving forward sets up other resources that are expected to
be released at devmem_release time, they'd be leaked.

> +		bo = ERR_PTR(err);
> +	}
> +
> +unlock:
> +	mmap_read_unlock(mm);
> +	mmput(mm);
> +
> +	return bo;
> +}
> +
>  /**
>   * xe_svm_handle_pagefault() - SVM handle page fault
>   * @vm: The VM.
> @@ -600,7 +669,8 @@ static bool xe_svm_range_is_valid(struct
> xe_svm_range *range,
>   * @fault_addr: The GPU fault address.
>   * @atomic: The fault atomic access bit.
>   *
> - * Create GPU bindings for a SVM page fault.
> + * Create GPU bindings for a SVM page fault. Optionally migrate to
> device
> + * memory.
>   *
>   * Return: 0 on success, negative error code on error.
>   */
> @@ -608,11 +678,18 @@ int xe_svm_handle_pagefault(struct xe_vm *vm,
> struct xe_vma *vma,
>  			    struct xe_tile *tile, u64 fault_addr,
>  			    bool atomic)
>  {
> -	struct drm_gpusvm_ctx ctx = { .read_only =
> xe_vma_read_only(vma), };
> +	struct drm_gpusvm_ctx ctx = {
> +		.read_only = xe_vma_read_only(vma),
> +		.devmem_possible = IS_DGFX(vm->xe) &&
> +			IS_ENABLED(CONFIG_DRM_XE_DEVMEM_MIRROR),
> +		.check_pages_threshold = IS_DGFX(vm->xe) &&
> +			IS_ENABLED(CONFIG_DRM_XE_DEVMEM_MIRROR) ?
> SZ_64K : 0,
> +	};
>  	struct xe_svm_range *range;
>  	struct drm_gpusvm_range *r;
>  	struct drm_exec exec;
>  	struct dma_fence *fence;
> +	struct xe_bo *bo = NULL;
>  	ktime_t end = 0;
>  	int err;
>  
> @@ -620,6 +697,9 @@ int xe_svm_handle_pagefault(struct xe_vm *vm,
> struct xe_vma *vma,
>  	xe_assert(vm->xe, xe_vma_is_cpu_addr_mirror(vma));
>  
>  retry:
> +	xe_bo_put(bo);
> +	bo = NULL;
> +
>  	/* Always process UNMAPs first so view SVM ranges is current
> */
>  	err = xe_svm_garbage_collector(vm);
>  	if (err)
> @@ -635,9 +715,31 @@ int xe_svm_handle_pagefault(struct xe_vm *vm,
> struct xe_vma *vma,
>  	if (xe_svm_range_is_valid(range, tile))
>  		return 0;
>  
> +	/* XXX: Add migration policy, for now migrate range once */
> +	if (!range->migrated && range->base.flags.migrate_devmem &&
> +	    xe_svm_range_size(range) >= SZ_64K) {
> +		range->migrated = true;
> +
> +		bo = xe_svm_alloc_vram(vm, tile, range, &ctx);
> +		if (IS_ERR(bo)) {
> +			drm_info(&vm->xe->drm,
> +				 "VRAM allocation failed, falling
> back to retrying, asid=%u, errno %pe\n",
> +				 vm->usm.asid, bo);

Typically we use drm_dbg() to print error messages to avoid spamming
the logs and facilitate dynamic debug. Should we change this.

/Thomas



> +			bo = NULL;
> +			goto retry;
> +		}
> +	}
> +
>  	err = drm_gpusvm_range_get_pages(&vm->svm.gpusvm, r, &ctx);
> -	if (err == -EFAULT || err == -EPERM)	/* Corner where CPU
> mappings have changed */
> +	/* Corner where CPU mappings have changed */
> +	if (err == -EOPNOTSUPP || err == -EFAULT || err == -EPERM) {
> +		if (err == -EOPNOTSUPP)
> +			drm_gpusvm_range_evict(&vm->svm.gpusvm,
> &range->base);
> +		drm_info(&vm->xe->drm,
> +			 "Get pages failed, falling back to
> retrying, asid=%u, gpusvm=%p, errno %pe\n",
> +			 vm->usm.asid, &vm->svm.gpusvm,
> ERR_PTR(err));
>  		goto retry;
> +	}
>  	if (err)
>  		goto err_out;
>  
> @@ -668,6 +770,7 @@ int xe_svm_handle_pagefault(struct xe_vm *vm,
> struct xe_vma *vma,
>  	dma_fence_put(fence);
>  
>  err_out:
> +	xe_bo_put(bo);
>  
>  	return err;
>  }
> diff --git a/drivers/gpu/drm/xe/xe_svm.h
> b/drivers/gpu/drm/xe/xe_svm.h
> index 0fa525d34987..984a61651d9e 100644
> --- a/drivers/gpu/drm/xe/xe_svm.h
> +++ b/drivers/gpu/drm/xe/xe_svm.h
> @@ -35,6 +35,11 @@ struct xe_svm_range {
>  	 * range. Protected by GPU SVM notifier lock.
>  	 */
>  	u8 tile_invalidated;
> +	/**
> +	 * @migrated: Range has been migrated to device memory,
> protected by
> +	 * GPU fault handler locking.
> +	 */
> +	u8 migrated	:1;
>  };
>  
>  int xe_devm_add(struct xe_tile *tile, struct xe_vram_region *vr);

Matthew Brost Feb. 19, 2025, 5:38 p.m. UTC | #7

On Wed, Feb 19, 2025 at 11:30:31AM +0100, Thomas Hellström wrote:
> On Wed, 2025-02-12 at 18:11 -0800, Matthew Brost wrote:
> > Migration is implemented with range granularity, with VRAM backing
> > being
> > a VM private TTM BO (i.e., shares dma-resv with VM). The lifetime of
> > the
> > TTM BO is limited to when the SVM range is in VRAM (i.e., when a VRAM
> > SVM range is migrated to SRAM, the TTM BO is destroyed).
> > 
> > The design choice for using TTM BO for VRAM backing store, as opposed
> > to
> > direct buddy allocation, is as follows:
> > 
> > - DRM buddy allocations are not at page granularity, offering no
> >   advantage over a BO.
> > - Unified eviction is required (SVM VRAM and TTM BOs need to be able
> > to
> >   evict each other).
> > - For exhaustive eviction [1], SVM VRAM allocations will almost
> > certainly
> >   require a dma-resv.
> > - Likely allocation size is 2M which makes of size of BO (872)
> >   acceptable per allocation (872 / 2M == .0004158).
> > 
> > With this, using TTM BO for VRAM backing store seems to be an obvious
> > choice as it allows leveraging of the TTM eviction code.
> > 
> > Current migration policy is migrate any SVM range greater than or
> > equal
> > to 64k once.
> > 
> > [1] https://patchwork.freedesktop.org/series/133643/
> > 
> > v2:
> >  - Rebase on latest GPU SVM
> >  - Retry page fault on get pages returning mixed allocation
> >  - Use drm_gpusvm_devmem
> > v3:
> >  - Use new BO flags
> >  - New range structure (Thomas)
> >  - Hide migration behind Kconfig
> >  - Kernel doc (Thomas)
> >  - Use check_pages_threshold
> > v4:
> >  - Don't evict partial unmaps in garbage collector (Thomas)
> >  - Use %pe to print errors (Thomas)
> >  - Use %p to print pointers (Thomas)
> > v5:
> >  - Use range size helper (Thomas)
> >  - Make BO external (Thomas)
> >  - Set tile to NULL for BO creation (Thomas)
> >  - Drop BO mirror flag (Thomas)
> >  - Hold BO dma-resv lock across migration (Auld, Thomas)
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >  drivers/gpu/drm/xe/xe_svm.c | 111
> > ++++++++++++++++++++++++++++++++++--
> >  drivers/gpu/drm/xe/xe_svm.h |   5 ++
> >  2 files changed, 112 insertions(+), 4 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_svm.c
> > b/drivers/gpu/drm/xe/xe_svm.c
> > index 0a78a838508c..2e1e0f31c1a8 100644
> > --- a/drivers/gpu/drm/xe/xe_svm.c
> > +++ b/drivers/gpu/drm/xe/xe_svm.c
> > @@ -32,6 +32,11 @@ static unsigned long xe_svm_range_end(struct
> > xe_svm_range *range)
> >  	return drm_gpusvm_range_end(&range->base);
> >  }
> >  
> > +static unsigned long xe_svm_range_size(struct xe_svm_range *range)
> > +{
> > +	return drm_gpusvm_range_size(&range->base);
> > +}
> > +
> >  static void *xe_svm_devm_owner(struct xe_device *xe)
> >  {
> >  	return xe;
> > @@ -512,7 +517,6 @@ static int xe_svm_populate_devmem_pfn(struct
> > drm_gpusvm_devmem *devmem_allocatio
> >  	return 0;
> >  }
> >  
> > -__maybe_unused
> >  static const struct drm_gpusvm_devmem_ops gpusvm_devmem_ops = {
> >  	.devmem_release = xe_svm_devmem_release,
> >  	.populate_devmem_pfn = xe_svm_populate_devmem_pfn,
> > @@ -592,6 +596,71 @@ static bool xe_svm_range_is_valid(struct
> > xe_svm_range *range,
> >  	return (range->tile_present & ~range->tile_invalidated) &
> > BIT(tile->id);
> >  }
> >  
> > +static struct xe_vram_region *tile_to_vr(struct xe_tile *tile)
> > +{
> > +	return &tile->mem.vram;
> > +}
> > +
> > +static struct xe_bo *xe_svm_alloc_vram(struct xe_vm *vm, struct
> > xe_tile *tile,
> > +				       struct xe_svm_range *range,
> > +				       const struct drm_gpusvm_ctx
> > *ctx)
> > +{
> > +	struct mm_struct *mm = vm->svm.gpusvm.mm;
> > +	struct xe_vram_region *vr = tile_to_vr(tile);
> > +	struct drm_buddy_block *block;
> > +	struct list_head *blocks;
> > +	struct xe_bo *bo;
> > +	ktime_t end = 0;
> > +	int err;
> > +
> > +	if (!mmget_not_zero(mm))
> > +		return ERR_PTR(-EFAULT);
> > +	mmap_read_lock(mm);
> > +
> > +retry:
> > +	bo = xe_bo_create_locked(tile_to_xe(tile), NULL, NULL,
> > +				 xe_svm_range_size(range),
> > +				 ttm_bo_type_device,
> > +				 XE_BO_FLAG_VRAM_IF_DGFX(tile));
> > +	if (IS_ERR(bo)) {
> > +		err = PTR_ERR(bo);
> > +		if (xe_vm_validate_should_retry(NULL, err, &end))
> > +			goto retry;
> > +		goto unlock;
> > +	}
> > +
> > +	drm_gpusvm_devmem_init(&bo->devmem_allocation,
> > +			       vm->xe->drm.dev, mm,
> > +			       &gpusvm_devmem_ops,
> > +			       &tile->mem.vram.dpagemap,
> > +			       xe_svm_range_size(range));
> 
> The way this is done now looks good, although the corresponding multi-
> device commit
> 
> https://gitlab.freedesktop.org/thomash/kernel/-/commit/1f204b5d05896e5cf63ab54a679ee343e76600ff
> 
> Goes one step further and removes the bo return of this function to
> favor an int error code. The caller doesn't really use the bo, right?
> 

No it does not. I was ref counting the BO in prior revs where the bind
code using a hack to look at the BO to get physical VRAM address. This
is gone so dropping the ref counting and returning an error should work.

> > +
> > +	blocks = &to_xe_ttm_vram_mgr_resource(bo->ttm.resource)-
> > >blocks;
> > +	list_for_each_entry(block, blocks, link)
> > +		block->private = vr;
> > +
> > +	/*
> > +	 * Take ref because as soon as drm_gpusvm_migrate_to_devmem
> > succeeds the
> > +	 * creation ref can be dropped upon CPU fault or unmap.
> > +	 */
> > +	xe_bo_get(bo);
> > +
> > +	err = drm_gpusvm_migrate_to_devmem(&vm->svm.gpusvm, &range-
> > >base,
> > +					   &bo->devmem_allocation,
> > ctx);
> > +	xe_bo_unlock(bo);
> > +	if (err) {
> > +		xe_bo_put(bo);	/* Local ref */
> > +		xe_bo_put(bo);	/* Creation ref */
> 
> Strictly, with the drm_gpusvm_devmem_init() succeeding, this bo put
> should be an xe_svm_devmem_release(), right? While equivalent here, if
> devmem_init moving forward sets up other resources that are expected to
> be released at devmem_release time, they'd be leaked.
> 

We only call 'zdd->devmem_allocation = devmem_allocation' upon success
of drm_gpusvm_migrate_to_devmem. So if drm_gpusvm_migrate_to_devmem
fails we need to drop the creation ref. See above, the local ref can be
droppped when converting this function to return an error code.

> > +		bo = ERR_PTR(err);
> > +	}
> > +
> > +unlock:
> > +	mmap_read_unlock(mm);
> > +	mmput(mm);
> > +
> > +	return bo;
> > +}
> > +
> >  /**
> >   * xe_svm_handle_pagefault() - SVM handle page fault
> >   * @vm: The VM.
> > @@ -600,7 +669,8 @@ static bool xe_svm_range_is_valid(struct
> > xe_svm_range *range,
> >   * @fault_addr: The GPU fault address.
> >   * @atomic: The fault atomic access bit.
> >   *
> > - * Create GPU bindings for a SVM page fault.
> > + * Create GPU bindings for a SVM page fault. Optionally migrate to
> > device
> > + * memory.
> >   *
> >   * Return: 0 on success, negative error code on error.
> >   */
> > @@ -608,11 +678,18 @@ int xe_svm_handle_pagefault(struct xe_vm *vm,
> > struct xe_vma *vma,
> >  			    struct xe_tile *tile, u64 fault_addr,
> >  			    bool atomic)
> >  {
> > -	struct drm_gpusvm_ctx ctx = { .read_only =
> > xe_vma_read_only(vma), };
> > +	struct drm_gpusvm_ctx ctx = {
> > +		.read_only = xe_vma_read_only(vma),
> > +		.devmem_possible = IS_DGFX(vm->xe) &&
> > +			IS_ENABLED(CONFIG_DRM_XE_DEVMEM_MIRROR),
> > +		.check_pages_threshold = IS_DGFX(vm->xe) &&
> > +			IS_ENABLED(CONFIG_DRM_XE_DEVMEM_MIRROR) ?
> > SZ_64K : 0,
> > +	};
> >  	struct xe_svm_range *range;
> >  	struct drm_gpusvm_range *r;
> >  	struct drm_exec exec;
> >  	struct dma_fence *fence;
> > +	struct xe_bo *bo = NULL;
> >  	ktime_t end = 0;
> >  	int err;
> >  
> > @@ -620,6 +697,9 @@ int xe_svm_handle_pagefault(struct xe_vm *vm,
> > struct xe_vma *vma,
> >  	xe_assert(vm->xe, xe_vma_is_cpu_addr_mirror(vma));
> >  
> >  retry:
> > +	xe_bo_put(bo);
> > +	bo = NULL;
> > +
> >  	/* Always process UNMAPs first so view SVM ranges is current
> > */
> >  	err = xe_svm_garbage_collector(vm);
> >  	if (err)
> > @@ -635,9 +715,31 @@ int xe_svm_handle_pagefault(struct xe_vm *vm,
> > struct xe_vma *vma,
> >  	if (xe_svm_range_is_valid(range, tile))
> >  		return 0;
> >  
> > +	/* XXX: Add migration policy, for now migrate range once */
> > +	if (!range->migrated && range->base.flags.migrate_devmem &&
> > +	    xe_svm_range_size(range) >= SZ_64K) {
> > +		range->migrated = true;
> > +
> > +		bo = xe_svm_alloc_vram(vm, tile, range, &ctx);
> > +		if (IS_ERR(bo)) {
> > +			drm_info(&vm->xe->drm,
> > +				 "VRAM allocation failed, falling
> > back to retrying, asid=%u, errno %pe\n",
> > +				 vm->usm.asid, bo);
> 
> Typically we use drm_dbg() to print error messages to avoid spamming
> the logs and facilitate dynamic debug. Should we change this.
> 

Noticed this too. Will change debug statements in the series to drm_dbg.

Matt

> /Thomas
> 
> 
> 
> > +			bo = NULL;
> > +			goto retry;
> > +		}
> > +	}
> > +
> >  	err = drm_gpusvm_range_get_pages(&vm->svm.gpusvm, r, &ctx);
> > -	if (err == -EFAULT || err == -EPERM)	/* Corner where CPU
> > mappings have changed */
> > +	/* Corner where CPU mappings have changed */
> > +	if (err == -EOPNOTSUPP || err == -EFAULT || err == -EPERM) {
> > +		if (err == -EOPNOTSUPP)
> > +			drm_gpusvm_range_evict(&vm->svm.gpusvm,
> > &range->base);
> > +		drm_info(&vm->xe->drm,
> > +			 "Get pages failed, falling back to
> > retrying, asid=%u, gpusvm=%p, errno %pe\n",
> > +			 vm->usm.asid, &vm->svm.gpusvm,
> > ERR_PTR(err));
> >  		goto retry;
> > +	}
> >  	if (err)
> >  		goto err_out;
> >  
> > @@ -668,6 +770,7 @@ int xe_svm_handle_pagefault(struct xe_vm *vm,
> > struct xe_vma *vma,
> >  	dma_fence_put(fence);
> >  
> >  err_out:
> > +	xe_bo_put(bo);
> >  
> >  	return err;
> >  }
> > diff --git a/drivers/gpu/drm/xe/xe_svm.h
> > b/drivers/gpu/drm/xe/xe_svm.h
> > index 0fa525d34987..984a61651d9e 100644
> > --- a/drivers/gpu/drm/xe/xe_svm.h
> > +++ b/drivers/gpu/drm/xe/xe_svm.h
> > @@ -35,6 +35,11 @@ struct xe_svm_range {
> >  	 * range. Protected by GPU SVM notifier lock.
> >  	 */
> >  	u8 tile_invalidated;
> > +	/**
> > +	 * @migrated: Range has been migrated to device memory,
> > protected by
> > +	 * GPU fault handler locking.
> > +	 */
> > +	u8 migrated	:1;
> >  };
> >  
> >  int xe_devm_add(struct xe_tile *tile, struct xe_vram_region *vr);
>

[v5,27/32] drm/xe: Add SVM VRAM migration

Commit Message

Comments

Patch