[RFC,23/28] drm/xe: Add SVM VRAM migration

Message ID	20240828024901.2582335-24-matthew.brost@intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <dri-devel-bounces@lists.freedesktop.org> From: Matthew Brost <matthew.brost@intel.com> To: intel-xe@lists.freedesktop.org, dri-devel@lists.freedesktop.org Cc: airlied@gmail.com, christian.koenig@amd.com, thomas.hellstrom@linux.intel.com, matthew.auld@intel.com, daniel@ffwll.ch Subject: [RFC PATCH 23/28] drm/xe: Add SVM VRAM migration Date: Tue, 27 Aug 2024 19:48:56 -0700 Message-Id: <20240828024901.2582335-24-matthew.brost@intel.com> In-Reply-To: <20240828024901.2582335-1-matthew.brost@intel.com> References: <20240828024901.2582335-1-matthew.brost@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: list Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>
Series	Introduce GPU SVM and Xe SVM implementation \| expand [RFC,00/28] Introduce GPU SVM and Xe SVM implementation [RFC,01/28] dma-buf: Split out dma fence array create into alloc and arm functions [RFC,02/28] drm/xe: Invalidate media_gt TLBs in PT code [RFC,03/28] drm/xe: Retry BO allocation [RFC,04/28] mm/migrate: Add migrate_device_vma_range [RFC,05/28] drm/gpusvm: Add support for GPU Shared Virtual Memory [RFC,06/28] drm/xe/uapi: Add DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATON flag [RFC,07/28] drm/xe: Add SVM init / fini to faulting VMs [RFC,08/28] drm/xe: Add dma_addr res cursor [RFC,09/28] drm/xe: Add SVM range invalidation [RFC,10/28] drm/gpuvm: Add DRM_GPUVA_OP_USER [RFC,11/28] drm/xe: Add (re)bind to SVM page fault handler [RFC,12/28] drm/xe: Add SVM garbage collector [RFC,13/28] drm/xe: Add unbind to SVM garbage collector [RFC,14/28] drm/xe: Do not allow system allocator VMA unbind if the GPU has bindings [RFC,15/28] drm/xe: Enable system allocator uAPI [RFC,16/28] drm/xe: Add migrate layer functions for SVM support [RFC,17/28] drm/xe: Add SVM device memory mirroring [RFC,18/28] drm/xe: Add GPUSVM copy SRAM / VRAM vfunc functions [RFC,19/28] drm/xe: Update PT layer to understand ranges in VRAM [RFC,20/28] drm/xe: Add Xe SVM populate_vram_pfn vfunc [RFC,21/28] drm/xe: Add Xe SVM vram_release vfunc [RFC,22/28] drm/xe: Add BO flags required for SVM [RFC,23/28] drm/xe: Add SVM VRAM migration [RFC,24/28] drm/xe: Basic SVM BO eviction [RFC,25/28] drm/xe: Add SVM debug [RFC,26/28] drm/xe: Add modparam for SVM notifier size [RFC,27/28] drm/xe: Add modparam for SVM prefault [RFC,28/28] drm/gpusvm: Ensure all pages migrated upon eviction

Matthew Brost Aug. 28, 2024, 2:48 a.m. UTC

Migration is implemented with range granularity, with VRAM backing being
a VM private TTM BO (i.e., shares dma-resv with VM). The lifetime of the
TTM BO is limited to when the SVM range is in VRAM (i.e., when a VRAM
SVM range is migrated to SRAM, the TTM BO is destroyed).

The design choice for using TTM BO for VRAM backing store, as opposed to
direct buddy allocation, is as follows:

- DRM buddy allocations are not at page granularity, offering no
  advantage over a BO.
- DRM buddy allocations do not solve locking inversion problems between
  mmap lock and dma-resv locks.
- Unified eviction is required (SVM VRAM and TTM BOs need to be able to
  evict each other).
- For exhaustive eviction [1], SVM VRAM allocations will almost certainly
  require a dma-resv.
- Likely allocation size is 2M which makes of size of BO (872)
  acceptable per allocation (872 / 2M == .0004158).

With this, using TTM BO for VRAM backing store seems to be an obvious
choice as it allows leveraging of the TTM eviction code.

Current migration policy is migrate any SVM range greater than or equal
to 64k once.

[1] https://patchwork.freedesktop.org/series/133643/

Signed-off-by: Matthew Brost matthew.brost@intel.com
---
 drivers/gpu/drm/xe/xe_svm.c | 81 ++++++++++++++++++++++++++++++++++++-
 drivers/gpu/drm/xe/xe_svm.h |  1 +
 2 files changed, 81 insertions(+), 1 deletion(-)

Daniel Vetter Aug. 28, 2024, 4:06 p.m. UTC | #1

On Tue, Aug 27, 2024 at 07:48:56PM -0700, Matthew Brost wrote:
> Migration is implemented with range granularity, with VRAM backing being
> a VM private TTM BO (i.e., shares dma-resv with VM). The lifetime of the
> TTM BO is limited to when the SVM range is in VRAM (i.e., when a VRAM
> SVM range is migrated to SRAM, the TTM BO is destroyed).
> 
> The design choice for using TTM BO for VRAM backing store, as opposed to
> direct buddy allocation, is as follows:
> 
> - DRM buddy allocations are not at page granularity, offering no
>   advantage over a BO.

This one I'm not understanding.

> - DRM buddy allocations do not solve locking inversion problems between
>   mmap lock and dma-resv locks.

Which mmap -> dma_resv inversion? I've seen a lot ... I guess it also
matters hugely which migration path we're in, i.e. opportunistic
migration, cpu fault where we have to migrate or die, or when we run out
of vram and need to evict stuff to make space.

> - Unified eviction is required (SVM VRAM and TTM BOs need to be able to
>   evict each other).

So core mm handles this by just roughly equally shrinking everything.
Seems to work, and it has a pile of object shrinkers, and the page lru is
also split into page cache and anon memory.

I think you need to put in more justification that unified eviction is
required than just stating it, because a look at mm/ gives a very well
established counterexample.

> - For exhaustive eviction [1], SVM VRAM allocations will almost certainly
>   require a dma-resv.

So from the TTM side we need exhaustive eviction, or at least something a
bit more exhaustive than what ttm currently has. Note that i915-gem also
never really got to perfect exhaustive eviction, it's just a pile better
than ttm right now.

Now if there's also SVM VRAM managed on a page lru, TTM exhaustive
eviction is going to win because the shrinkers can only trylock dma_resv.
So this part works. It actually works so well on the system memory side
that if we're not careful we can trigger oom, because we're too good at
getting at all the memory.

SVM VRAM allocations otoh do not need exhaustive evictions. Or at least I
don't see why, because the idea is that thanks to gpu and cpu page faults,
you can always get out of a pinch by just trashing everything for a while
and migrating the handfull of available pages a lot.

> - Likely allocation size is 2M which makes of size of BO (872)
>   acceptable per allocation (872 / 2M == .0004158).
> 
> With this, using TTM BO for VRAM backing store seems to be an obvious
> choice as it allows leveraging of the TTM eviction code.

Except it requires that you hold dma_resv, which brings in all kinds of
pain. And for eviction we really don't need a lot of synchronization, so a
lot of that locking is not needed, unlike the case where we have a cpu
fault, where we absolutely need mmap_lock and all that to make sure we
fault in the right page.

But for eviction we only need to throw out some pages, if we're not
entirely precise with picking the right ones (or have no idea into which
vma they're all currently mapped into) it doesn't matter. That's why
migrate_device_pages doesn't care about any of that at all, it doesn't
need to by design. But by bo backing memory you drag in all that stuff
that's causing headacheds for eviction.

The only thing migration tries to do is remove all pte, and if that
succeeds, move the page. Specialized for the gpusvm case, looking at mm/
code as cheat sheet, we need roughly:

- reverse mapping structure like anon_vma. Except gpusvm can assume that
  there's currently only one gpu side mapping, so we can just stuff the
  gpusvm an va_address into the page, and protect it with the page lock.

- we need pagetable locks, so that we can manipulate pagetables (well
  specifically make ptes invalid) without taking any other locks.

- everyone else inserting or removing ptes for svm mappings also needs to
  lock the page, or we have races. This might be the hmm_range_fault races
  you're seeing when allowing vram pages, since I don't think there's
  anything else stopping the page lookup otherwise from succeeding.

- we might also need to stuff migrate ptes into the gpu side, like the cpu
  does, to hold up refaults before the migration has finished. But I think
  those are only needed for anon memory in sram because there's no other
  way to find the right page than swap pte entries, of which migration
  entries are a special case.

- core code also expects us to handle the page refcount correctly for svm
  device memory, so we can't free the pages like normal bo pages either
  directly to drm_buddy.

Now typing this all up will look an awful lot like what you have, with the
dma_resv lock serving as the page lock and the pagetable lock. The only
reason is that these locks are much smaller and nest within all the other
stuff going on and so avoid the inversion issues.

So one annoying part is that this is a lot of pointlessly looking typing.
The other is that it's full of races, because core mm really is yolo all
the way down. So lots of ways you lock the wrong page and fun stuff like
that, but the few cases that matter work:

- svm fault handling with hmm_range fault retries with mmu notifiers. Note
  that we need to have vram pages locked and the notifier retrie needs to
  be under the pagetable lock, or there's room to escape. At least that's
  what I came up with last time I thought it all through.

- migrate_to_ram: it will hold a page reference which we know was the
  valid vram page when the cpu pte was locked, but it might not be it
  anymore. So we have to lock the page and check whether it's still gpu
  mapped, and if not retry the entire fault since most likey another
  migrate_to_ram has succeed meanwhile in parallel.

- for eviction we don't care, we might actually be migrating a page no one
  even wants anymore.

Now I think you can get all this done with the dma_resv lock and maybe the
bo refcount. But it does involve a tremendous amount of headaches and
impendence mismatch, because that's not how page faults and migrations
work in core mm.

Cheers, Sima

> Current migration policy is migrate any SVM range greater than or equal
> to 64k once.
> 
> [1] https://patchwork.freedesktop.org/series/133643/
> 
> Signed-off-by: Matthew Brost matthew.brost@intel.com
> ---
>  drivers/gpu/drm/xe/xe_svm.c | 81 ++++++++++++++++++++++++++++++++++++-
>  drivers/gpu/drm/xe/xe_svm.h |  1 +
>  2 files changed, 81 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c
> index 4372c02a341f..fd8987e0a506 100644
> --- a/drivers/gpu/drm/xe/xe_svm.c
> +++ b/drivers/gpu/drm/xe/xe_svm.c
> @@ -217,8 +217,13 @@ static void xe_svm_invalidate(struct drm_gpusvm *gpusvm,
>  static int __xe_svm_garbage_collector(struct xe_vm *vm,
>  				      struct xe_svm_range *range)
>  {
> +	struct drm_gpusvm_ctx ctx = {};
>  	struct dma_fence *fence;
>  
> +	/* Evict any pages holding references to vram allocation */
> +	if (range->base.flags.partial_unmap && IS_DGFX(vm->xe))
> +		drm_gpusvm_migrate_to_sram(&vm->svm.gpusvm, &range->base, &ctx);
> +
>  	xe_vm_lock(vm, false);
>  	fence = xe_vm_range_unbind(vm, range);
>  	xe_vm_unlock(vm);
> @@ -504,21 +509,77 @@ static bool xe_svm_range_is_valid(struct xe_svm_range *range,
>  	return (range->tile_present & ~range->tile_invalidated) & BIT(tile->id);
>  }
>  
> +static struct xe_mem_region *tile_to_mr(struct xe_tile *tile)
> +{
> +	return &tile->mem.vram;
> +}
> +
> +static struct xe_bo *xe_svm_alloc_vram(struct xe_vm *vm, struct xe_tile *tile,
> +				       struct xe_svm_range *range,
> +				       const struct drm_gpusvm_ctx *ctx)
> +{
> +	struct xe_mem_region *mr = tile_to_mr(tile);
> +	struct drm_buddy_block *block;
> +	struct list_head *blocks;
> +	struct xe_bo *bo;
> +	ktime_t end = 0;
> +	int err;
> +
> +retry:
> +	xe_vm_lock(vm, false);
> +	bo = xe_bo_create(tile_to_xe(tile), tile, vm, range->base.va.end -
> +			  range->base.va.start, ttm_bo_type_device,
> +			  XE_BO_FLAG_VRAM_IF_DGFX(tile) |
> +			  XE_BO_FLAG_SYSTEM_ALLOC | XE_BO_FLAG_SKIP_CLEAR);
> +	xe_vm_unlock(vm);
> +	if (IS_ERR(bo)) {
> +		err = PTR_ERR(bo);
> +		if (xe_vm_validate_should_retry(NULL, err, &end))
> +			goto retry;
> +		return bo;
> +	}
> +
> +	blocks = &to_xe_ttm_vram_mgr_resource(bo->ttm.resource)->blocks;
> +	list_for_each_entry(block, blocks, link)
> +		block->private = mr;
> +
> +	/*
> +	 * Take ref because as soon as drm_gpusvm_migrate_to_vram succeeds the
> +	 * creation ref can be dropped upon CPU fault or unmap.
> +	 */
> +	xe_bo_get(bo);
> +
> +	err = drm_gpusvm_migrate_to_vram(&vm->svm.gpusvm, &range->base,
> +					 bo, ctx);
> +	if (err) {
> +		xe_bo_put(bo);	/* Local ref */
> +		xe_bo_put(bo);	/* Creation ref */
> +		return ERR_PTR(err);
> +	}
> +
> +	return bo;
> +}
> +
>  int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
>  			    struct xe_tile *tile, u64 fault_addr,
>  			    bool atomic)
>  {
> -	struct drm_gpusvm_ctx ctx = { .read_only = xe_vma_read_only(vma), };
> +	struct drm_gpusvm_ctx ctx = { .read_only = xe_vma_read_only(vma),
> +		.vram_possible = IS_DGFX(vm->xe), };
>  	struct xe_svm_range *range;
>  	struct drm_gpusvm_range *r;
>  	struct drm_exec exec;
>  	struct dma_fence *fence;
> +	struct xe_bo *bo = NULL;
>  	ktime_t end = 0;
>  	int err;
>  
>  	lockdep_assert_held_write(&vm->lock);
>  
>  retry:
> +	xe_bo_put(bo);
> +	bo = NULL;
> +
>  	/* Always process UNMAPs first so view SVM ranges is current */
>  	err = xe_svm_garbage_collector(vm);
>  	if (err)
> @@ -534,6 +595,22 @@ int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
>  	if (xe_svm_range_is_valid(range, tile))
>  		return 0;
>  
> +	/* XXX: Add migration policy, for now migrate range once */
> +	if (IS_DGFX(vm->xe) && !range->migrated &&
> +	    range->base.flags.migrate_vram &&
> +	    (range->base.va.end - range->base.va.start) >= SZ_64K) {
> +		range->migrated = true;
> +
> +		bo = xe_svm_alloc_vram(vm, tile, range, &ctx);
> +		if (IS_ERR(bo)) {
> +			drm_info(&vm->xe->drm,
> +				 "VRAM allocation failed, falling back to retrying, asid=%u, errno %ld\n",
> +				 vm->usm.asid, PTR_ERR(bo));
> +			bo = NULL;
> +			goto retry;
> +		}
> +	}
> +
>  	err = drm_gpusvm_range_get_pages(&vm->svm.gpusvm, r, &ctx);
>  	if (err == -EFAULT || err == -EPERM)	/* Corner where CPU mappings have change */
>  	       goto retry;
> @@ -567,6 +644,8 @@ int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
>  	dma_fence_put(fence);
>  
>  err_out:
> +	xe_bo_put(bo);
> +
>  	return err;
>  }
>  
> diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
> index 8b72e91cc37d..3f432483a230 100644
> --- a/drivers/gpu/drm/xe/xe_svm.h
> +++ b/drivers/gpu/drm/xe/xe_svm.h
> @@ -18,6 +18,7 @@ struct xe_svm_range {
>  	struct list_head garbage_collector_link;
>  	u8 tile_present;
>  	u8 tile_invalidated;
> +	u8 migrated	:1;
>  };
>  
>  int xe_devm_add(struct xe_tile *tile, struct xe_mem_region *mr);
> -- 
> 2.34.1
>

Daniel Vetter Aug. 28, 2024, 6:22 p.m. UTC | #2

On Wed, Aug 28, 2024 at 06:06:47PM +0200, Daniel Vetter wrote:
> On Tue, Aug 27, 2024 at 07:48:56PM -0700, Matthew Brost wrote:
> > Migration is implemented with range granularity, with VRAM backing being
> > a VM private TTM BO (i.e., shares dma-resv with VM). The lifetime of the
> > TTM BO is limited to when the SVM range is in VRAM (i.e., when a VRAM
> > SVM range is migrated to SRAM, the TTM BO is destroyed).
> > 
> > The design choice for using TTM BO for VRAM backing store, as opposed to
> > direct buddy allocation, is as follows:
> > 
> > - DRM buddy allocations are not at page granularity, offering no
> >   advantage over a BO.
> 
> This one I'm not understanding.
> 
> > - DRM buddy allocations do not solve locking inversion problems between
> >   mmap lock and dma-resv locks.
> 
> Which mmap -> dma_resv inversion? I've seen a lot ... I guess it also
> matters hugely which migration path we're in, i.e. opportunistic
> migration, cpu fault where we have to migrate or die, or when we run out
> of vram and need to evict stuff to make space.
> 
> > - Unified eviction is required (SVM VRAM and TTM BOs need to be able to
> >   evict each other).
> 
> So core mm handles this by just roughly equally shrinking everything.
> Seems to work, and it has a pile of object shrinkers, and the page lru is
> also split into page cache and anon memory.
> 
> I think you need to put in more justification that unified eviction is
> required than just stating it, because a look at mm/ gives a very well
> established counterexample.
> 
> > - For exhaustive eviction [1], SVM VRAM allocations will almost certainly
> >   require a dma-resv.
> 
> So from the TTM side we need exhaustive eviction, or at least something a
> bit more exhaustive than what ttm currently has. Note that i915-gem also
> never really got to perfect exhaustive eviction, it's just a pile better
> than ttm right now.
> 
> Now if there's also SVM VRAM managed on a page lru, TTM exhaustive
> eviction is going to win because the shrinkers can only trylock dma_resv.
> So this part works. It actually works so well on the system memory side
> that if we're not careful we can trigger oom, because we're too good at
> getting at all the memory.
> 
> SVM VRAM allocations otoh do not need exhaustive evictions. Or at least I
> don't see why, because the idea is that thanks to gpu and cpu page faults,
> you can always get out of a pinch by just trashing everything for a while
> and migrating the handfull of available pages a lot.
> 
> > - Likely allocation size is 2M which makes of size of BO (872)
> >   acceptable per allocation (872 / 2M == .0004158).
> > 
> > With this, using TTM BO for VRAM backing store seems to be an obvious
> > choice as it allows leveraging of the TTM eviction code.
> 
> Except it requires that you hold dma_resv, which brings in all kinds of
> pain. And for eviction we really don't need a lot of synchronization, so a
> lot of that locking is not needed, unlike the case where we have a cpu
> fault, where we absolutely need mmap_lock and all that to make sure we
> fault in the right page.
> 
> But for eviction we only need to throw out some pages, if we're not
> entirely precise with picking the right ones (or have no idea into which
> vma they're all currently mapped into) it doesn't matter. That's why
> migrate_device_pages doesn't care about any of that at all, it doesn't
> need to by design. But by bo backing memory you drag in all that stuff
> that's causing headacheds for eviction.
> 
> The only thing migration tries to do is remove all pte, and if that
> succeeds, move the page. Specialized for the gpusvm case, looking at mm/
> code as cheat sheet, we need roughly:
> 
> - reverse mapping structure like anon_vma. Except gpusvm can assume that
>   there's currently only one gpu side mapping, so we can just stuff the
>   gpusvm an va_address into the page, and protect it with the page lock.
> 
> - we need pagetable locks, so that we can manipulate pagetables (well
>   specifically make ptes invalid) without taking any other locks.
> 
> - everyone else inserting or removing ptes for svm mappings also needs to
>   lock the page, or we have races. This might be the hmm_range_fault races
>   you're seeing when allowing vram pages, since I don't think there's
>   anything else stopping the page lookup otherwise from succeeding.
> 
> - we might also need to stuff migrate ptes into the gpu side, like the cpu
>   does, to hold up refaults before the migration has finished. But I think
>   those are only needed for anon memory in sram because there's no other
>   way to find the right page than swap pte entries, of which migration
>   entries are a special case.
> 
> - core code also expects us to handle the page refcount correctly for svm
>   device memory, so we can't free the pages like normal bo pages either
>   directly to drm_buddy.
> 
> Now typing this all up will look an awful lot like what you have, with the
> dma_resv lock serving as the page lock and the pagetable lock. The only
> reason is that these locks are much smaller and nest within all the other
> stuff going on and so avoid the inversion issues.
> 
> So one annoying part is that this is a lot of pointlessly looking typing.
> The other is that it's full of races, because core mm really is yolo all
> the way down. So lots of ways you lock the wrong page and fun stuff like
> that, but the few cases that matter work:
> 
> - svm fault handling with hmm_range fault retries with mmu notifiers. Note
>   that we need to have vram pages locked and the notifier retrie needs to
>   be under the pagetable lock, or there's room to escape. At least that's
>   what I came up with last time I thought it all through.
> 
> - migrate_to_ram: it will hold a page reference which we know was the
>   valid vram page when the cpu pte was locked, but it might not be it
>   anymore. So we have to lock the page and check whether it's still gpu
>   mapped, and if not retry the entire fault since most likey another
>   migrate_to_ram has succeed meanwhile in parallel.
> 
> - for eviction we don't care, we might actually be migrating a page no one
>   even wants anymore.
> 
> Now I think you can get all this done with the dma_resv lock and maybe the
> bo refcount. But it does involve a tremendous amount of headaches and
> impendence mismatch, because that's not how page faults and migrations
> work in core mm.

Bit a different take, and this might be completely wrong because I've
misread your gpusvm code or the amdkfd code. But I figured it might be
good for understanding when I explain, where the current page lock and
pgatable locks are.

For the pagetable lock:

In amdkfd that's svm_range_lock/unlock. This is also held while
re-checking mmu_notifiers. In gpusvm the equivalent is the
gpusvm->notifier_lock, which is global instead of per notifier range, but
the same idea.

For the page lock:

In amdkfd this is essentiall svm_range->migrate_mutex, it's the thing that
ensures we're consistent with any concurrent other migrations in the same
range. Note that this protects a virtual address range, but because that
has a 1:1 mapping to bot the cpu mm va ranges and to any vram allocations
it defacto serves as the page lock since there can never be more than one
svm_range for a svm vram allocation. That avoids the revalidation dance
core mm/ needs to do once it locks a page, since the va->page mappings
might have changed meanwhile. That's why it's pretty much everywhere just
trylocks while holding the pgtable locks and bailing out if that fails.

In your gpusvm design I guess it should be the bo's dma_resv lock that
backs a gpusvm_range. But it's not consistently enough used, or not with
big enough locking scope to protect against concurrent migration races.
The trouble with using the dma_resv lock like this is that it's not very
compatible with classic dma_resv usage (at least I think, might be wrong).

For core kernel:

Pagetable lock is in each pagetable at each level, so parallelizes
ridiculously. We could adopt that scheme by also storing the mmu notifier
seqno into each pgtable page. That would give us a mmu notifier ranges
that pretty much perfectly scale as we add/remove pagetables and map/unmap
stuff, instead of having to maintain a separate rbtree of ad-hoc notifiers
like gpuvsm and amdkfd do. And because it's hierarchical and we could
store the mmu notifier seqno at each level you can even optimize for both
small and huge ranges and it all checks out, since if the entire huge
range hasn't been invalidated, we can skip checking the ranges below.

page lock is the per-page (or well nowadays often the per-folio) lock. It
makes sure no one can rip the actual data stored in your page away from
underneath you (with migration or something else funny), since the
reference count only makes sure the allocation stays. The refcount itself
does not guarantee that the page you've looked up is actually still the
one that's mapped, it's like the bo refcount in that regard in gpusvm and
amdkfd for svm allocations in vram. Also with folios the locks are now per
chunk size, like gpusvm does.

Cheers, Sima

Christian König Aug. 29, 2024, 9:24 a.m. UTC | #3

Am 28.08.24 um 18:06 schrieb Daniel Vetter:
> On Tue, Aug 27, 2024 at 07:48:56PM -0700, Matthew Brost wrote:
>> Migration is implemented with range granularity, with VRAM backing being
>> a VM private TTM BO (i.e., shares dma-resv with VM). The lifetime of the
>> TTM BO is limited to when the SVM range is in VRAM (i.e., when a VRAM
>> SVM range is migrated to SRAM, the TTM BO is destroyed).
>>
>> The design choice for using TTM BO for VRAM backing store, as opposed to
>> direct buddy allocation, is as follows:
>>
>> - DRM buddy allocations are not at page granularity, offering no
>>    advantage over a BO.
> This one I'm not understanding.

Adding Arun as well. I couldn't understand it fully either, but maybe 
it's because the buddy allocator is more optimized for higher orders of 
allocations?

>
>> - DRM buddy allocations do not solve locking inversion problems between
>>    mmap lock and dma-resv locks.
> Which mmap -> dma_resv inversion? I've seen a lot ... I guess it also
> matters hugely which migration path we're in, i.e. opportunistic
> migration, cpu fault where we have to migrate or die, or when we run out
> of vram and need to evict stuff to make space.

Mhm I think the locking order between mmap lock and dma-resv lock is 
well defined since dma_resv_lockdep() was added.

>> - Unified eviction is required (SVM VRAM and TTM BOs need to be able to
>>    evict each other).
> So core mm handles this by just roughly equally shrinking everything.
> Seems to work, and it has a pile of object shrinkers, and the page lru is
> also split into page cache and anon memory.
>
> I think you need to put in more justification that unified eviction is
> required than just stating it, because a look at mm/ gives a very well
> established counterexample.
>
>> - For exhaustive eviction [1], SVM VRAM allocations will almost certainly
>>    require a dma-resv.
> So from the TTM side we need exhaustive eviction, or at least something a
> bit more exhaustive than what ttm currently has. Note that i915-gem also
> never really got to perfect exhaustive eviction, it's just a pile better
> than ttm right now.

Please define what exhaustive eviction should mean? I think I know what 
it is and I have been pushing TTM into the direction of solving this for 
years.

The last missing puzzle piece is to use drm_exec for TTM evictions, but 
apart from that everything should work now.

Regards,
Christian.

> Now if there's also SVM VRAM managed on a page lru, TTM exhaustive
> eviction is going to win because the shrinkers can only trylock dma_resv.
> So this part works. It actually works so well on the system memory side
> that if we're not careful we can trigger oom, because we're too good at
> getting at all the memory.
>
> SVM VRAM allocations otoh do not need exhaustive evictions. Or at least I
> don't see why, because the idea is that thanks to gpu and cpu page faults,
> you can always get out of a pinch by just trashing everything for a while
> and migrating the handfull of available pages a lot.
>
>> - Likely allocation size is 2M which makes of size of BO (872)
>>    acceptable per allocation (872 / 2M == .0004158).
>>
>> With this, using TTM BO for VRAM backing store seems to be an obvious
>> choice as it allows leveraging of the TTM eviction code.
> Except it requires that you hold dma_resv, which brings in all kinds of
> pain. And for eviction we really don't need a lot of synchronization, so a
> lot of that locking is not needed, unlike the case where we have a cpu
> fault, where we absolutely need mmap_lock and all that to make sure we
> fault in the right page.
>
> But for eviction we only need to throw out some pages, if we're not
> entirely precise with picking the right ones (or have no idea into which
> vma they're all currently mapped into) it doesn't matter. That's why
> migrate_device_pages doesn't care about any of that at all, it doesn't
> need to by design. But by bo backing memory you drag in all that stuff
> that's causing headacheds for eviction.
>
> The only thing migration tries to do is remove all pte, and if that
> succeeds, move the page. Specialized for the gpusvm case, looking at mm/
> code as cheat sheet, we need roughly:
>
> - reverse mapping structure like anon_vma. Except gpusvm can assume that
>    there's currently only one gpu side mapping, so we can just stuff the
>    gpusvm an va_address into the page, and protect it with the page lock.
>
> - we need pagetable locks, so that we can manipulate pagetables (well
>    specifically make ptes invalid) without taking any other locks.
>
> - everyone else inserting or removing ptes for svm mappings also needs to
>    lock the page, or we have races. This might be the hmm_range_fault races
>    you're seeing when allowing vram pages, since I don't think there's
>    anything else stopping the page lookup otherwise from succeeding.
>
> - we might also need to stuff migrate ptes into the gpu side, like the cpu
>    does, to hold up refaults before the migration has finished. But I think
>    those are only needed for anon memory in sram because there's no other
>    way to find the right page than swap pte entries, of which migration
>    entries are a special case.
>
> - core code also expects us to handle the page refcount correctly for svm
>    device memory, so we can't free the pages like normal bo pages either
>    directly to drm_buddy.
>
> Now typing this all up will look an awful lot like what you have, with the
> dma_resv lock serving as the page lock and the pagetable lock. The only
> reason is that these locks are much smaller and nest within all the other
> stuff going on and so avoid the inversion issues.
>
> So one annoying part is that this is a lot of pointlessly looking typing.
> The other is that it's full of races, because core mm really is yolo all
> the way down. So lots of ways you lock the wrong page and fun stuff like
> that, but the few cases that matter work:
>
> - svm fault handling with hmm_range fault retries with mmu notifiers. Note
>    that we need to have vram pages locked and the notifier retrie needs to
>    be under the pagetable lock, or there's room to escape. At least that's
>    what I came up with last time I thought it all through.
>
> - migrate_to_ram: it will hold a page reference which we know was the
>    valid vram page when the cpu pte was locked, but it might not be it
>    anymore. So we have to lock the page and check whether it's still gpu
>    mapped, and if not retry the entire fault since most likey another
>    migrate_to_ram has succeed meanwhile in parallel.
>
> - for eviction we don't care, we might actually be migrating a page no one
>    even wants anymore.
>
> Now I think you can get all this done with the dma_resv lock and maybe the
> bo refcount. But it does involve a tremendous amount of headaches and
> impendence mismatch, because that's not how page faults and migrations
> work in core mm.
>
> Cheers, Sima
>
>> Current migration policy is migrate any SVM range greater than or equal
>> to 64k once.
>>
>> [1]https://patchwork.freedesktop.org/series/133643/
>>
>> Signed-off-by: Matthew Brostmatthew.brost@intel.com
>> ---
>>   drivers/gpu/drm/xe/xe_svm.c | 81 ++++++++++++++++++++++++++++++++++++-
>>   drivers/gpu/drm/xe/xe_svm.h |  1 +
>>   2 files changed, 81 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c
>> index 4372c02a341f..fd8987e0a506 100644
>> --- a/drivers/gpu/drm/xe/xe_svm.c
>> +++ b/drivers/gpu/drm/xe/xe_svm.c
>> @@ -217,8 +217,13 @@ static void xe_svm_invalidate(struct drm_gpusvm *gpusvm,
>>   static int __xe_svm_garbage_collector(struct xe_vm *vm,
>>   				      struct xe_svm_range *range)
>>   {
>> +	struct drm_gpusvm_ctx ctx = {};
>>   	struct dma_fence *fence;
>>   
>> +	/* Evict any pages holding references to vram allocation */
>> +	if (range->base.flags.partial_unmap && IS_DGFX(vm->xe))
>> +		drm_gpusvm_migrate_to_sram(&vm->svm.gpusvm, &range->base, &ctx);
>> +
>>   	xe_vm_lock(vm, false);
>>   	fence = xe_vm_range_unbind(vm, range);
>>   	xe_vm_unlock(vm);
>> @@ -504,21 +509,77 @@ static bool xe_svm_range_is_valid(struct xe_svm_range *range,
>>   	return (range->tile_present & ~range->tile_invalidated) & BIT(tile->id);
>>   }
>>   
>> +static struct xe_mem_region *tile_to_mr(struct xe_tile *tile)
>> +{
>> +	return &tile->mem.vram;
>> +}
>> +
>> +static struct xe_bo *xe_svm_alloc_vram(struct xe_vm *vm, struct xe_tile *tile,
>> +				       struct xe_svm_range *range,
>> +				       const struct drm_gpusvm_ctx *ctx)
>> +{
>> +	struct xe_mem_region *mr = tile_to_mr(tile);
>> +	struct drm_buddy_block *block;
>> +	struct list_head *blocks;
>> +	struct xe_bo *bo;
>> +	ktime_t end = 0;
>> +	int err;
>> +
>> +retry:
>> +	xe_vm_lock(vm, false);
>> +	bo = xe_bo_create(tile_to_xe(tile), tile, vm, range->base.va.end -
>> +			  range->base.va.start, ttm_bo_type_device,
>> +			  XE_BO_FLAG_VRAM_IF_DGFX(tile) |
>> +			  XE_BO_FLAG_SYSTEM_ALLOC | XE_BO_FLAG_SKIP_CLEAR);
>> +	xe_vm_unlock(vm);
>> +	if (IS_ERR(bo)) {
>> +		err = PTR_ERR(bo);
>> +		if (xe_vm_validate_should_retry(NULL, err, &end))
>> +			goto retry;
>> +		return bo;
>> +	}
>> +
>> +	blocks = &to_xe_ttm_vram_mgr_resource(bo->ttm.resource)->blocks;
>> +	list_for_each_entry(block, blocks, link)
>> +		block->private = mr;
>> +
>> +	/*
>> +	 * Take ref because as soon as drm_gpusvm_migrate_to_vram succeeds the
>> +	 * creation ref can be dropped upon CPU fault or unmap.
>> +	 */
>> +	xe_bo_get(bo);
>> +
>> +	err = drm_gpusvm_migrate_to_vram(&vm->svm.gpusvm, &range->base,
>> +					 bo, ctx);
>> +	if (err) {
>> +		xe_bo_put(bo);	/* Local ref */
>> +		xe_bo_put(bo);	/* Creation ref */
>> +		return ERR_PTR(err);
>> +	}
>> +
>> +	return bo;
>> +}
>> +
>>   int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
>>   			    struct xe_tile *tile, u64 fault_addr,
>>   			    bool atomic)
>>   {
>> -	struct drm_gpusvm_ctx ctx = { .read_only = xe_vma_read_only(vma), };
>> +	struct drm_gpusvm_ctx ctx = { .read_only = xe_vma_read_only(vma),
>> +		.vram_possible = IS_DGFX(vm->xe), };
>>   	struct xe_svm_range *range;
>>   	struct drm_gpusvm_range *r;
>>   	struct drm_exec exec;
>>   	struct dma_fence *fence;
>> +	struct xe_bo *bo = NULL;
>>   	ktime_t end = 0;
>>   	int err;
>>   
>>   	lockdep_assert_held_write(&vm->lock);
>>   
>>   retry:
>> +	xe_bo_put(bo);
>> +	bo = NULL;
>> +
>>   	/* Always process UNMAPs first so view SVM ranges is current */
>>   	err = xe_svm_garbage_collector(vm);
>>   	if (err)
>> @@ -534,6 +595,22 @@ int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
>>   	if (xe_svm_range_is_valid(range, tile))
>>   		return 0;
>>   
>> +	/* XXX: Add migration policy, for now migrate range once */
>> +	if (IS_DGFX(vm->xe) && !range->migrated &&
>> +	    range->base.flags.migrate_vram &&
>> +	    (range->base.va.end - range->base.va.start) >= SZ_64K) {
>> +		range->migrated = true;
>> +
>> +		bo = xe_svm_alloc_vram(vm, tile, range, &ctx);
>> +		if (IS_ERR(bo)) {
>> +			drm_info(&vm->xe->drm,
>> +				 "VRAM allocation failed, falling back to retrying, asid=%u, errno %ld\n",
>> +				 vm->usm.asid, PTR_ERR(bo));
>> +			bo = NULL;
>> +			goto retry;
>> +		}
>> +	}
>> +
>>   	err = drm_gpusvm_range_get_pages(&vm->svm.gpusvm, r, &ctx);
>>   	if (err == -EFAULT || err == -EPERM)	/* Corner where CPU mappings have change */
>>   	       goto retry;
>> @@ -567,6 +644,8 @@ int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
>>   	dma_fence_put(fence);
>>   
>>   err_out:
>> +	xe_bo_put(bo);
>> +
>>   	return err;
>>   }
>>   
>> diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
>> index 8b72e91cc37d..3f432483a230 100644
>> --- a/drivers/gpu/drm/xe/xe_svm.h
>> +++ b/drivers/gpu/drm/xe/xe_svm.h
>> @@ -18,6 +18,7 @@ struct xe_svm_range {
>>   	struct list_head garbage_collector_link;
>>   	u8 tile_present;
>>   	u8 tile_invalidated;
>> +	u8 migrated	:1;
>>   };
>>   
>>   int xe_devm_add(struct xe_tile *tile, struct xe_mem_region *mr);
>> -- 
>> 2.34.1
>>

Thomas Hellström Aug. 29, 2024, 9:53 a.m. UTC | #4

Hi, Christian,

On Thu, 2024-08-29 at 11:24 +0200, Christian König wrote:
> 

...

> > > - Unified eviction is required (SVM VRAM and TTM BOs need to be
> > > able to
> > >    evict each other).
> > So core mm handles this by just roughly equally shrinking
> > everything.
> > Seems to work, and it has a pile of object shrinkers, and the page
> > lru is
> > also split into page cache and anon memory.
> > 
> > I think you need to put in more justification that unified eviction
> > is
> > required than just stating it, because a look at mm/ gives a very
> > well
> > established counterexample.
> > 
> > > - For exhaustive eviction [1], SVM VRAM allocations will almost
> > > certainly
> > >    require a dma-resv.
> > So from the TTM side we need exhaustive eviction, or at least
> > something a
> > bit more exhaustive than what ttm currently has. Note that i915-gem
> > also
> > never really got to perfect exhaustive eviction, it's just a pile
> > better
> > than ttm right now.
> 
> Please define what exhaustive eviction should mean? I think I know
> what 
> it is and I have been pushing TTM into the direction of solving this
> for 
> years.

We internally refer to exhaustive eviction being a client is always
guaranteed to eventually make progress in obtaining non-pinned vram,
typically by incrementally locking and keeping dma-resvs across a
single validation including validations during buffer object
allocations.

> 
> The last missing puzzle piece is to use drm_exec for TTM evictions,

and IMO keeping the dma-resv locks grabbed during eviction until at
least one unit of progress (one validation) has succeeded.

> but 
> apart from that everything should work now.
> 
> 
> Regards,
> Christian.

But as Sima pointed out in private communication, exhaustive eviction
is not really needed for faulting to make (crawling) progress.
Watermarks and VRAM trylock shrinking should suffice, since we're
strictly only required to service a single gpu page granule at a time.

However, ordinary bo-based jobs would still like to be able to
completely evict SVM vram. Whether that is important enough to strive
for is ofc up for discussion.

/Thomas



> 
> > Now if there's also SVM VRAM managed on a page lru, TTM exhaustive
> > eviction is going to win because the shrinkers can only trylock
> > dma_resv.
> > So this part works. It actually works so well on the system memory
> > side
> > that if we're not careful we can trigger oom, because we're too
> > good at
> > getting at all the memory.
> > 
> > SVM VRAM allocations otoh do not need exhaustive evictions. Or at
> > least I
> > don't see why, because the idea is that thanks to gpu and cpu page
> > faults,
> > you can always get out of a pinch by just trashing everything for a
> > while
> > and migrating the handfull of available pages a lot.
> > 
> > > - Likely allocation size is 2M which makes of size of BO (872)
> > >    acceptable per allocation (872 / 2M == .0004158).
> > > 
> > > With this, using TTM BO for VRAM backing store seems to be an
> > > obvious
> > > choice as it allows leveraging of the TTM eviction code.
> > Except it requires that you hold dma_resv, which brings in all
> > kinds of
> > pain. And for eviction we really don't need a lot of
> > synchronization, so a
> > lot of that locking is not needed, unlike the case where we have a
> > cpu
> > fault, where we absolutely need mmap_lock and all that to make sure
> > we
> > fault in the right page.
> > 
> > But for eviction we only need to throw out some pages, if we're not
> > entirely precise with picking the right ones (or have no idea into
> > which
> > vma they're all currently mapped into) it doesn't matter. That's
> > why
> > migrate_device_pages doesn't care about any of that at all, it
> > doesn't
> > need to by design. But by bo backing memory you drag in all that
> > stuff
> > that's causing headacheds for eviction.
> > 
> > The only thing migration tries to do is remove all pte, and if that
> > succeeds, move the page. Specialized for the gpusvm case, looking
> > at mm/
> > code as cheat sheet, we need roughly:
> > 
> > - reverse mapping structure like anon_vma. Except gpusvm can assume
> > that
> >    there's currently only one gpu side mapping, so we can just
> > stuff the
> >    gpusvm an va_address into the page, and protect it with the page
> > lock.
> > 
> > - we need pagetable locks, so that we can manipulate pagetables
> > (well
> >    specifically make ptes invalid) without taking any other locks.
> > 
> > - everyone else inserting or removing ptes for svm mappings also
> > needs to
> >    lock the page, or we have races. This might be the
> > hmm_range_fault races
> >    you're seeing when allowing vram pages, since I don't think
> > there's
> >    anything else stopping the page lookup otherwise from
> > succeeding.
> > 
> > - we might also need to stuff migrate ptes into the gpu side, like
> > the cpu
> >    does, to hold up refaults before the migration has finished. But
> > I think
> >    those are only needed for anon memory in sram because there's no
> > other
> >    way to find the right page than swap pte entries, of which
> > migration
> >    entries are a special case.
> > 
> > - core code also expects us to handle the page refcount correctly
> > for svm
> >    device memory, so we can't free the pages like normal bo pages
> > either
> >    directly to drm_buddy.
> > 
> > Now typing this all up will look an awful lot like what you have,
> > with the
> > dma_resv lock serving as the page lock and the pagetable lock. The
> > only
> > reason is that these locks are much smaller and nest within all the
> > other
> > stuff going on and so avoid the inversion issues.
> > 
> > So one annoying part is that this is a lot of pointlessly looking
> > typing.
> > The other is that it's full of races, because core mm really is
> > yolo all
> > the way down. So lots of ways you lock the wrong page and fun stuff
> > like
> > that, but the few cases that matter work:
> > 
> > - svm fault handling with hmm_range fault retries with mmu
> > notifiers. Note
> >    that we need to have vram pages locked and the notifier retrie
> > needs to
> >    be under the pagetable lock, or there's room to escape. At least
> > that's
> >    what I came up with last time I thought it all through.
> > 
> > - migrate_to_ram: it will hold a page reference which we know was
> > the
> >    valid vram page when the cpu pte was locked, but it might not be
> > it
> >    anymore. So we have to lock the page and check whether it's
> > still gpu
> >    mapped, and if not retry the entire fault since most likey
> > another
> >    migrate_to_ram has succeed meanwhile in parallel.
> > 
> > - for eviction we don't care, we might actually be migrating a page
> > no one
> >    even wants anymore.
> > 
> > Now I think you can get all this done with the dma_resv lock and
> > maybe the
> > bo refcount. But it does involve a tremendous amount of headaches
> > and
> > impendence mismatch, because that's not how page faults and
> > migrations
> > work in core mm.
> > 
> > Cheers, Sima
> > 
> > > Current migration policy is migrate any SVM range greater than or
> > > equal
> > > to 64k once.
> > > 
> > > [1]https://patchwork.freedesktop.org/series/133643/
> > > 
> > > Signed-off-by: Matthew Brostmatthew.brost@intel.com
> > > ---
> > >   drivers/gpu/drm/xe/xe_svm.c | 81
> > > ++++++++++++++++++++++++++++++++++++-
> > >   drivers/gpu/drm/xe/xe_svm.h |  1 +
> > >   2 files changed, 81 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/gpu/drm/xe/xe_svm.c
> > > b/drivers/gpu/drm/xe/xe_svm.c
> > > index 4372c02a341f..fd8987e0a506 100644
> > > --- a/drivers/gpu/drm/xe/xe_svm.c
> > > +++ b/drivers/gpu/drm/xe/xe_svm.c
> > > @@ -217,8 +217,13 @@ static void xe_svm_invalidate(struct
> > > drm_gpusvm *gpusvm,
> > >   static int __xe_svm_garbage_collector(struct xe_vm *vm,
> > >   				      struct xe_svm_range
> > > *range)
> > >   {
> > > +	struct drm_gpusvm_ctx ctx = {};
> > >   	struct dma_fence *fence;
> > >   
> > > +	/* Evict any pages holding references to vram allocation
> > > */
> > > +	if (range->base.flags.partial_unmap && IS_DGFX(vm->xe))
> > > +		drm_gpusvm_migrate_to_sram(&vm->svm.gpusvm,
> > > &range->base, &ctx);
> > > +
> > >   	xe_vm_lock(vm, false);
> > >   	fence = xe_vm_range_unbind(vm, range);
> > >   	xe_vm_unlock(vm);
> > > @@ -504,21 +509,77 @@ static bool xe_svm_range_is_valid(struct
> > > xe_svm_range *range,
> > >   	return (range->tile_present & ~range->tile_invalidated)
> > > & BIT(tile->id);
> > >   }
> > >   
> > > +static struct xe_mem_region *tile_to_mr(struct xe_tile *tile)
> > > +{
> > > +	return &tile->mem.vram;
> > > +}
> > > +
> > > +static struct xe_bo *xe_svm_alloc_vram(struct xe_vm *vm, struct
> > > xe_tile *tile,
> > > +				       struct xe_svm_range
> > > *range,
> > > +				       const struct
> > > drm_gpusvm_ctx *ctx)
> > > +{
> > > +	struct xe_mem_region *mr = tile_to_mr(tile);
> > > +	struct drm_buddy_block *block;
> > > +	struct list_head *blocks;
> > > +	struct xe_bo *bo;
> > > +	ktime_t end = 0;
> > > +	int err;
> > > +
> > > +retry:
> > > +	xe_vm_lock(vm, false);
> > > +	bo = xe_bo_create(tile_to_xe(tile), tile, vm, range-
> > > >base.va.end -
> > > +			  range->base.va.start,
> > > ttm_bo_type_device,
> > > +			  XE_BO_FLAG_VRAM_IF_DGFX(tile) |
> > > +			  XE_BO_FLAG_SYSTEM_ALLOC |
> > > XE_BO_FLAG_SKIP_CLEAR);
> > > +	xe_vm_unlock(vm);
> > > +	if (IS_ERR(bo)) {
> > > +		err = PTR_ERR(bo);
> > > +		if (xe_vm_validate_should_retry(NULL, err,
> > > &end))
> > > +			goto retry;
> > > +		return bo;
> > > +	}
> > > +
> > > +	blocks = &to_xe_ttm_vram_mgr_resource(bo->ttm.resource)-
> > > >blocks;
> > > +	list_for_each_entry(block, blocks, link)
> > > +		block->private = mr;
> > > +
> > > +	/*
> > > +	 * Take ref because as soon as
> > > drm_gpusvm_migrate_to_vram succeeds the
> > > +	 * creation ref can be dropped upon CPU fault or unmap.
> > > +	 */
> > > +	xe_bo_get(bo);
> > > +
> > > +	err = drm_gpusvm_migrate_to_vram(&vm->svm.gpusvm,
> > > &range->base,
> > > +					 bo, ctx);
> > > +	if (err) {
> > > +		xe_bo_put(bo);	/* Local ref */
> > > +		xe_bo_put(bo);	/* Creation ref */
> > > +		return ERR_PTR(err);
> > > +	}
> > > +
> > > +	return bo;
> > > +}
> > > +
> > >   int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma
> > > *vma,
> > >   			    struct xe_tile *tile, u64
> > > fault_addr,
> > >   			    bool atomic)
> > >   {
> > > -	struct drm_gpusvm_ctx ctx = { .read_only =
> > > xe_vma_read_only(vma), };
> > > +	struct drm_gpusvm_ctx ctx = { .read_only =
> > > xe_vma_read_only(vma),
> > > +		.vram_possible = IS_DGFX(vm->xe), };
> > >   	struct xe_svm_range *range;
> > >   	struct drm_gpusvm_range *r;
> > >   	struct drm_exec exec;
> > >   	struct dma_fence *fence;
> > > +	struct xe_bo *bo = NULL;
> > >   	ktime_t end = 0;
> > >   	int err;
> > >   
> > >   	lockdep_assert_held_write(&vm->lock);
> > >   
> > >   retry:
> > > +	xe_bo_put(bo);
> > > +	bo = NULL;
> > > +
> > >   	/* Always process UNMAPs first so view SVM ranges is
> > > current */
> > >   	err = xe_svm_garbage_collector(vm);
> > >   	if (err)
> > > @@ -534,6 +595,22 @@ int xe_svm_handle_pagefault(struct xe_vm
> > > *vm, struct xe_vma *vma,
> > >   	if (xe_svm_range_is_valid(range, tile))
> > >   		return 0;
> > >   
> > > +	/* XXX: Add migration policy, for now migrate range once
> > > */
> > > +	if (IS_DGFX(vm->xe) && !range->migrated &&
> > > +	    range->base.flags.migrate_vram &&
> > > +	    (range->base.va.end - range->base.va.start) >=
> > > SZ_64K) {
> > > +		range->migrated = true;
> > > +
> > > +		bo = xe_svm_alloc_vram(vm, tile, range, &ctx);
> > > +		if (IS_ERR(bo)) {
> > > +			drm_info(&vm->xe->drm,
> > > +				 "VRAM allocation failed,
> > > falling back to retrying, asid=%u, errno %ld\n",
> > > +				 vm->usm.asid, PTR_ERR(bo));
> > > +			bo = NULL;
> > > +			goto retry;
> > > +		}
> > > +	}
> > > +
> > >   	err = drm_gpusvm_range_get_pages(&vm->svm.gpusvm, r,
> > > &ctx);
> > >   	if (err == -EFAULT || err == -EPERM)	/* Corner where
> > > CPU mappings have change */
> > >   	       goto retry;
> > > @@ -567,6 +644,8 @@ int xe_svm_handle_pagefault(struct xe_vm *vm,
> > > struct xe_vma *vma,
> > >   	dma_fence_put(fence);
> > >   
> > >   err_out:
> > > +	xe_bo_put(bo);
> > > +
> > >   	return err;
> > >   }
> > >   
> > > diff --git a/drivers/gpu/drm/xe/xe_svm.h
> > > b/drivers/gpu/drm/xe/xe_svm.h
> > > index 8b72e91cc37d..3f432483a230 100644
> > > --- a/drivers/gpu/drm/xe/xe_svm.h
> > > +++ b/drivers/gpu/drm/xe/xe_svm.h
> > > @@ -18,6 +18,7 @@ struct xe_svm_range {
> > >   	struct list_head garbage_collector_link;
> > >   	u8 tile_present;
> > >   	u8 tile_invalidated;
> > > +	u8 migrated	:1;
> > >   };
> > >   
> > >   int xe_devm_add(struct xe_tile *tile, struct xe_mem_region
> > > *mr);
> > > -- 
> > > 2.34.1
> > >

Daniel Vetter Aug. 29, 2024, 11:02 a.m. UTC | #5

On Thu, Aug 29, 2024 at 11:53:58AM +0200, Thomas Hellström wrote:
> But as Sima pointed out in private communication, exhaustive eviction
> is not really needed for faulting to make (crawling) progress.
> Watermarks and VRAM trylock shrinking should suffice, since we're
> strictly only required to service a single gpu page granule at a time.
> 
> However, ordinary bo-based jobs would still like to be able to
> completely evict SVM vram. Whether that is important enough to strive
> for is ofc up for discussion.

My take is that you don't win anything for exhaustive eviction by having
the dma_resv somewhere in there for svm allocations. Roughly for split lru
world, where svm ignores bo/dma_resv:

When evicting vram from the ttm side we'll fairly switch between selecting
bo and throwing out svm pages. With drm_exec/ww_acquire_ctx selecting bo
will eventually succeed in vacuuming up everything (with a few retries
perhaps, if we're not yet at the head of the ww ticket queue).

svm pages we need to try to evict anyway - there's no guarantee, becaue
the core mm might be holding temporary page references (which block
migration) or have the page locked (which also block the migration). But
as long as those two steps succeed, we'll win and get the pages. There
might be some thrashing against concurrent svm faults stealing them again,
but they have a disadvantage since they can't steal dma_resv_locked bo.
And if it's still too much we can stall them in the page allocator.

So it's not entirely reliable, but should be close enough.

Now for bo based svm the picture isn't any different, because holding
dma_resv is not actually enough to migrate svm mappings. We still need to
hope there's no temporary page references around, and we still need to
succeed at locking the page. And the migration code only does trylocks,
because that's it's deadlock prevent algorithm if different migrations
needing the same set of pages, but acquiring them in a different order. So
we win nothing.

Worse, if dma_resv does actually hold up svm migration and reclaim, then
we potentially deadlock because that lock is for a bigger range than
individual pages (or folios). And the core mm assumes that it can get out
of a deadlock bind by (at least stochastically) eventually succeeding in
acquiring/locking down a single page.

This means we cannot use dma_resv tricks to give the ttm world an
advantage in exhaustive eviction against concurrent svm faults. Or at
least not more than we can do without by just stalling svm faults that
need to allocate gpu memory (but that must happen without holding locks or
we're busted).

So the only benefit I'm seeing is the unified lru, which I'm not sure is
worth it. There's also a bit a lru design tension here, because for the bo
world we want objects that are locked to stay on the lru, so that the
competing processes can figure out who has the winning ww ticket. The core
mm design otoh does isolate pages and remove them from the lru when
they're acquired, so that they don't gunk up other processes from trying
to make forward progress and are better hidden. Which reduces temporary
page references (from lru walk) preventing migration and stuff like that.

Cheers, Sima

Christian König Aug. 29, 2024, 2:30 p.m. UTC | #6

Am 29.08.24 um 11:53 schrieb Thomas Hellström:
> Hi, Christian,
>
> On Thu, 2024-08-29 at 11:24 +0200, Christian König wrote:
> ...
>
>>>> - Unified eviction is required (SVM VRAM and TTM BOs need to be
>>>> able to
>>>>     evict each other).
>>> So core mm handles this by just roughly equally shrinking
>>> everything.
>>> Seems to work, and it has a pile of object shrinkers, and the page
>>> lru is
>>> also split into page cache and anon memory.
>>>
>>> I think you need to put in more justification that unified eviction
>>> is
>>> required than just stating it, because a look at mm/ gives a very
>>> well
>>> established counterexample.
>>>
>>>> - For exhaustive eviction [1], SVM VRAM allocations will almost
>>>> certainly
>>>>     require a dma-resv.
>>> So from the TTM side we need exhaustive eviction, or at least
>>> something a
>>> bit more exhaustive than what ttm currently has. Note that i915-gem
>>> also
>>> never really got to perfect exhaustive eviction, it's just a pile
>>> better
>>> than ttm right now.
>> Please define what exhaustive eviction should mean? I think I know
>> what
>> it is and I have been pushing TTM into the direction of solving this
>> for
>> years.
> We internally refer to exhaustive eviction being a client is always
> guaranteed to eventually make progress in obtaining non-pinned vram,
> typically by incrementally locking and keeping dma-resvs across a
> single validation including validations during buffer object
> allocations.
>
>> The last missing puzzle piece is to use drm_exec for TTM evictions,
> and IMO keeping the dma-resv locks grabbed during eviction until at
> least one unit of progress (one validation) has succeeded.

Yes, exactly that. My guessed understanding was actually correct.

>
>> but
>> apart from that everything should work now.
>>
>>
>> Regards,
>> Christian.
> But as Sima pointed out in private communication, exhaustive eviction
> is not really needed for faulting to make (crawling) progress.
> Watermarks and VRAM trylock shrinking should suffice, since we're
> strictly only required to service a single gpu page granule at a time.

Yeah fault based memory management should be able to keep working as 
long as the page isn't re-migrated before you make any progress.

Since the number of VRAM or system memory pages is very high that should 
basically never happen.

> However, ordinary bo-based jobs would still like to be able to
> completely evict SVM vram. Whether that is important enough to strive
> for is ofc up for discussion.

Yes, exactly that. Felix, Alex, a bunch of other AMD folks and I came up 
with the same conclusion at AMD internally as well.

Regards,
Christian.

>
> /Thomas
>
>
>
>>> Now if there's also SVM VRAM managed on a page lru, TTM exhaustive
>>> eviction is going to win because the shrinkers can only trylock
>>> dma_resv.
>>> So this part works. It actually works so well on the system memory
>>> side
>>> that if we're not careful we can trigger oom, because we're too
>>> good at
>>> getting at all the memory.
>>>
>>> SVM VRAM allocations otoh do not need exhaustive evictions. Or at
>>> least I
>>> don't see why, because the idea is that thanks to gpu and cpu page
>>> faults,
>>> you can always get out of a pinch by just trashing everything for a
>>> while
>>> and migrating the handfull of available pages a lot.
>>>
>>>> - Likely allocation size is 2M which makes of size of BO (872)
>>>>     acceptable per allocation (872 / 2M == .0004158).
>>>>
>>>> With this, using TTM BO for VRAM backing store seems to be an
>>>> obvious
>>>> choice as it allows leveraging of the TTM eviction code.
>>> Except it requires that you hold dma_resv, which brings in all
>>> kinds of
>>> pain. And for eviction we really don't need a lot of
>>> synchronization, so a
>>> lot of that locking is not needed, unlike the case where we have a
>>> cpu
>>> fault, where we absolutely need mmap_lock and all that to make sure
>>> we
>>> fault in the right page.
>>>
>>> But for eviction we only need to throw out some pages, if we're not
>>> entirely precise with picking the right ones (or have no idea into
>>> which
>>> vma they're all currently mapped into) it doesn't matter. That's
>>> why
>>> migrate_device_pages doesn't care about any of that at all, it
>>> doesn't
>>> need to by design. But by bo backing memory you drag in all that
>>> stuff
>>> that's causing headacheds for eviction.
>>>
>>> The only thing migration tries to do is remove all pte, and if that
>>> succeeds, move the page. Specialized for the gpusvm case, looking
>>> at mm/
>>> code as cheat sheet, we need roughly:
>>>
>>> - reverse mapping structure like anon_vma. Except gpusvm can assume
>>> that
>>>     there's currently only one gpu side mapping, so we can just
>>> stuff the
>>>     gpusvm an va_address into the page, and protect it with the page
>>> lock.
>>>
>>> - we need pagetable locks, so that we can manipulate pagetables
>>> (well
>>>     specifically make ptes invalid) without taking any other locks.
>>>
>>> - everyone else inserting or removing ptes for svm mappings also
>>> needs to
>>>     lock the page, or we have races. This might be the
>>> hmm_range_fault races
>>>     you're seeing when allowing vram pages, since I don't think
>>> there's
>>>     anything else stopping the page lookup otherwise from
>>> succeeding.
>>>
>>> - we might also need to stuff migrate ptes into the gpu side, like
>>> the cpu
>>>     does, to hold up refaults before the migration has finished. But
>>> I think
>>>     those are only needed for anon memory in sram because there's no
>>> other
>>>     way to find the right page than swap pte entries, of which
>>> migration
>>>     entries are a special case.
>>>
>>> - core code also expects us to handle the page refcount correctly
>>> for svm
>>>     device memory, so we can't free the pages like normal bo pages
>>> either
>>>     directly to drm_buddy.
>>>
>>> Now typing this all up will look an awful lot like what you have,
>>> with the
>>> dma_resv lock serving as the page lock and the pagetable lock. The
>>> only
>>> reason is that these locks are much smaller and nest within all the
>>> other
>>> stuff going on and so avoid the inversion issues.
>>>
>>> So one annoying part is that this is a lot of pointlessly looking
>>> typing.
>>> The other is that it's full of races, because core mm really is
>>> yolo all
>>> the way down. So lots of ways you lock the wrong page and fun stuff
>>> like
>>> that, but the few cases that matter work:
>>>
>>> - svm fault handling with hmm_range fault retries with mmu
>>> notifiers. Note
>>>     that we need to have vram pages locked and the notifier retrie
>>> needs to
>>>     be under the pagetable lock, or there's room to escape. At least
>>> that's
>>>     what I came up with last time I thought it all through.
>>>
>>> - migrate_to_ram: it will hold a page reference which we know was
>>> the
>>>     valid vram page when the cpu pte was locked, but it might not be
>>> it
>>>     anymore. So we have to lock the page and check whether it's
>>> still gpu
>>>     mapped, and if not retry the entire fault since most likey
>>> another
>>>     migrate_to_ram has succeed meanwhile in parallel.
>>>
>>> - for eviction we don't care, we might actually be migrating a page
>>> no one
>>>     even wants anymore.
>>>
>>> Now I think you can get all this done with the dma_resv lock and
>>> maybe the
>>> bo refcount. But it does involve a tremendous amount of headaches
>>> and
>>> impendence mismatch, because that's not how page faults and
>>> migrations
>>> work in core mm.
>>>
>>> Cheers, Sima
>>>
>>>> Current migration policy is migrate any SVM range greater than or
>>>> equal
>>>> to 64k once.
>>>>
>>>> [1]https://patchwork.freedesktop.org/series/133643/
>>>>
>>>> Signed-off-by: Matthew Brostmatthew.brost@intel.com
>>>> ---
>>>>    drivers/gpu/drm/xe/xe_svm.c | 81
>>>> ++++++++++++++++++++++++++++++++++++-
>>>>    drivers/gpu/drm/xe/xe_svm.h |  1 +
>>>>    2 files changed, 81 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/xe/xe_svm.c
>>>> b/drivers/gpu/drm/xe/xe_svm.c
>>>> index 4372c02a341f..fd8987e0a506 100644
>>>> --- a/drivers/gpu/drm/xe/xe_svm.c
>>>> +++ b/drivers/gpu/drm/xe/xe_svm.c
>>>> @@ -217,8 +217,13 @@ static void xe_svm_invalidate(struct
>>>> drm_gpusvm *gpusvm,
>>>>    static int __xe_svm_garbage_collector(struct xe_vm *vm,
>>>>    				      struct xe_svm_range
>>>> *range)
>>>>    {
>>>> +	struct drm_gpusvm_ctx ctx = {};
>>>>    	struct dma_fence *fence;
>>>>    
>>>> +	/* Evict any pages holding references to vram allocation
>>>> */
>>>> +	if (range->base.flags.partial_unmap && IS_DGFX(vm->xe))
>>>> +		drm_gpusvm_migrate_to_sram(&vm->svm.gpusvm,
>>>> &range->base, &ctx);
>>>> +
>>>>    	xe_vm_lock(vm, false);
>>>>    	fence = xe_vm_range_unbind(vm, range);
>>>>    	xe_vm_unlock(vm);
>>>> @@ -504,21 +509,77 @@ static bool xe_svm_range_is_valid(struct
>>>> xe_svm_range *range,
>>>>    	return (range->tile_present & ~range->tile_invalidated)
>>>> & BIT(tile->id);
>>>>    }
>>>>    
>>>> +static struct xe_mem_region *tile_to_mr(struct xe_tile *tile)
>>>> +{
>>>> +	return &tile->mem.vram;
>>>> +}
>>>> +
>>>> +static struct xe_bo *xe_svm_alloc_vram(struct xe_vm *vm, struct
>>>> xe_tile *tile,
>>>> +				       struct xe_svm_range
>>>> *range,
>>>> +				       const struct
>>>> drm_gpusvm_ctx *ctx)
>>>> +{
>>>> +	struct xe_mem_region *mr = tile_to_mr(tile);
>>>> +	struct drm_buddy_block *block;
>>>> +	struct list_head *blocks;
>>>> +	struct xe_bo *bo;
>>>> +	ktime_t end = 0;
>>>> +	int err;
>>>> +
>>>> +retry:
>>>> +	xe_vm_lock(vm, false);
>>>> +	bo = xe_bo_create(tile_to_xe(tile), tile, vm, range-
>>>>> base.va.end -
>>>> +			  range->base.va.start,
>>>> ttm_bo_type_device,
>>>> +			  XE_BO_FLAG_VRAM_IF_DGFX(tile) |
>>>> +			  XE_BO_FLAG_SYSTEM_ALLOC |
>>>> XE_BO_FLAG_SKIP_CLEAR);
>>>> +	xe_vm_unlock(vm);
>>>> +	if (IS_ERR(bo)) {
>>>> +		err = PTR_ERR(bo);
>>>> +		if (xe_vm_validate_should_retry(NULL, err,
>>>> &end))
>>>> +			goto retry;
>>>> +		return bo;
>>>> +	}
>>>> +
>>>> +	blocks = &to_xe_ttm_vram_mgr_resource(bo->ttm.resource)-
>>>>> blocks;
>>>> +	list_for_each_entry(block, blocks, link)
>>>> +		block->private = mr;
>>>> +
>>>> +	/*
>>>> +	 * Take ref because as soon as
>>>> drm_gpusvm_migrate_to_vram succeeds the
>>>> +	 * creation ref can be dropped upon CPU fault or unmap.
>>>> +	 */
>>>> +	xe_bo_get(bo);
>>>> +
>>>> +	err = drm_gpusvm_migrate_to_vram(&vm->svm.gpusvm,
>>>> &range->base,
>>>> +					 bo, ctx);
>>>> +	if (err) {
>>>> +		xe_bo_put(bo);	/* Local ref */
>>>> +		xe_bo_put(bo);	/* Creation ref */
>>>> +		return ERR_PTR(err);
>>>> +	}
>>>> +
>>>> +	return bo;
>>>> +}
>>>> +
>>>>    int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma
>>>> *vma,
>>>>    			    struct xe_tile *tile, u64
>>>> fault_addr,
>>>>    			    bool atomic)
>>>>    {
>>>> -	struct drm_gpusvm_ctx ctx = { .read_only =
>>>> xe_vma_read_only(vma), };
>>>> +	struct drm_gpusvm_ctx ctx = { .read_only =
>>>> xe_vma_read_only(vma),
>>>> +		.vram_possible = IS_DGFX(vm->xe), };
>>>>    	struct xe_svm_range *range;
>>>>    	struct drm_gpusvm_range *r;
>>>>    	struct drm_exec exec;
>>>>    	struct dma_fence *fence;
>>>> +	struct xe_bo *bo = NULL;
>>>>    	ktime_t end = 0;
>>>>    	int err;
>>>>    
>>>>    	lockdep_assert_held_write(&vm->lock);
>>>>    
>>>>    retry:
>>>> +	xe_bo_put(bo);
>>>> +	bo = NULL;
>>>> +
>>>>    	/* Always process UNMAPs first so view SVM ranges is
>>>> current */
>>>>    	err = xe_svm_garbage_collector(vm);
>>>>    	if (err)
>>>> @@ -534,6 +595,22 @@ int xe_svm_handle_pagefault(struct xe_vm
>>>> *vm, struct xe_vma *vma,
>>>>    	if (xe_svm_range_is_valid(range, tile))
>>>>    		return 0;
>>>>    
>>>> +	/* XXX: Add migration policy, for now migrate range once
>>>> */
>>>> +	if (IS_DGFX(vm->xe) && !range->migrated &&
>>>> +	    range->base.flags.migrate_vram &&
>>>> +	    (range->base.va.end - range->base.va.start) >=
>>>> SZ_64K) {
>>>> +		range->migrated = true;
>>>> +
>>>> +		bo = xe_svm_alloc_vram(vm, tile, range, &ctx);
>>>> +		if (IS_ERR(bo)) {
>>>> +			drm_info(&vm->xe->drm,
>>>> +				 "VRAM allocation failed,
>>>> falling back to retrying, asid=%u, errno %ld\n",
>>>> +				 vm->usm.asid, PTR_ERR(bo));
>>>> +			bo = NULL;
>>>> +			goto retry;
>>>> +		}
>>>> +	}
>>>> +
>>>>    	err = drm_gpusvm_range_get_pages(&vm->svm.gpusvm, r,
>>>> &ctx);
>>>>    	if (err == -EFAULT || err == -EPERM)	/* Corner where
>>>> CPU mappings have change */
>>>>    	       goto retry;
>>>> @@ -567,6 +644,8 @@ int xe_svm_handle_pagefault(struct xe_vm *vm,
>>>> struct xe_vma *vma,
>>>>    	dma_fence_put(fence);
>>>>    
>>>>    err_out:
>>>> +	xe_bo_put(bo);
>>>> +
>>>>    	return err;
>>>>    }
>>>>    
>>>> diff --git a/drivers/gpu/drm/xe/xe_svm.h
>>>> b/drivers/gpu/drm/xe/xe_svm.h
>>>> index 8b72e91cc37d..3f432483a230 100644
>>>> --- a/drivers/gpu/drm/xe/xe_svm.h
>>>> +++ b/drivers/gpu/drm/xe/xe_svm.h
>>>> @@ -18,6 +18,7 @@ struct xe_svm_range {
>>>>    	struct list_head garbage_collector_link;
>>>>    	u8 tile_present;
>>>>    	u8 tile_invalidated;
>>>> +	u8 migrated	:1;
>>>>    };
>>>>    
>>>>    int xe_devm_add(struct xe_tile *tile, struct xe_mem_region
>>>> *mr);
>>>> -- 
>>>> 2.34.1
>>>>

Matthew Brost Aug. 29, 2024, 9:48 p.m. UTC | #7

On Thu, Aug 29, 2024 at 11:24:26AM +0200, Christian König wrote:
>    Am 28.08.24 um 18:06 schrieb Daniel Vetter:
> 

A lot to unpack here. Will try to address as much as I can in this
single reply to both of you (Daniel, Christian).

> On Tue, Aug 27, 2024 at 07:48:56PM -0700, Matthew Brost wrote:
> 
> Migration is implemented with range granularity, with VRAM backing being
> a VM private TTM BO (i.e., shares dma-resv with VM). The lifetime of the
> TTM BO is limited to when the SVM range is in VRAM (i.e., when a VRAM
> SVM range is migrated to SRAM, the TTM BO is destroyed).
> 
> The design choice for using TTM BO for VRAM backing store, as opposed to
> direct buddy allocation, is as follows:
> 
> - DRM buddy allocations are not at page granularity, offering no
>   advantage over a BO.
> 
> This one I'm not understanding.
>
>    Adding Arun as well. I couldn't understand it fully either, but maybe
>    it's because the buddy allocator is more optimized for higher orders of
>    allocations?
> 

As currently written BO VRAM allocation resolves to a DRM buddy
allocation. Unless there is memory pressure this is likely 1 buddy block
if the allocation is aligned (SVM should basically also be doing aligned
allocations which the common case of 2M at a time).

It was suggested in earlier revs by a colleague of mine allocating
directly from DRM buddy pool provided a benefit wrt to freeing a page at
a time. It doesn't given even if you bypass a BO most likely you are
going to get 1 buddy block which is larger than a page. In either case
you need to ref count the allocation or do some wild spliting algorithm
(I don't want to that). Or alternatively write a new buddy allocator
which easily code with freeing a page at a time (I don't want to that).

Lastly, the common case for getting dev_pagemap_ops.page_free is going
to be consecutive calls spanning the entire allocation (e.g. eviction or
CPU fault which triggers migration).

> 
> 
> - DRM buddy allocations do not solve locking inversion problems between
>   mmap lock and dma-resv locks.
> 
> Which mmap -> dma_resv inversion? I've seen a lot ... I guess it also
> matters hugely which migration path we're in, i.e. opportunistic
> migration, cpu fault where we have to migrate or die, or when we run out
> of vram and need to evict stuff to make space.
> 
>    Mhm I think the locking order between mmap lock and dma-resv lock is
>    well defined since dma_resv_lockdep() was added.
>

Yes. Also solved the inversion issue by using migrate_device_*. At one
point had trylocking of mmap lock (still kinda there) but have agreed
based on Daniel's feedback to rip the out.
 
> - Unified eviction is required (SVM VRAM and TTM BOs need to be able to
>   evict each other).
> 
> So core mm handles this by just roughly equally shrinking everything.
> Seems to work, and it has a pile of object shrinkers, and the page lru is
> also split into page cache and anon memory.
> 
> I think you need to put in more justification that unified eviction is
> required than just stating it, because a look at mm/ gives a very well
> established counterexample.
> 
> 
> - For exhaustive eviction [1], SVM VRAM allocations will almost certainly
>   require a dma-resv.
> 
> So from the TTM side we need exhaustive eviction, or at least something a
> bit more exhaustive than what ttm currently has. Note that i915-gem also
> never really got to perfect exhaustive eviction, it's just a pile better
> than ttm right now.
> 
>    Please define what exhaustive eviction should mean? I think I know what
>    it is and I have been pushing TTM into the direction of solving this
>    for years.
>    The last missing puzzle piece is to use drm_exec for TTM evictions, but
>    apart from that everything should work now.
>    Regards,
>    Christian.

I think Thomas has defined this in his replies. He also touches on our
SVM design allows mixing user BO mappings and SVM mappings within the
same VM. These need to be able to fairly evict each other. A dma-resv
lock provides a level of fairness and ensure forward progress once a
flavor of his series lands.

Also worth nothing in addition to user BOs, we have kernel BOs (for page
table, user exec queues, etc...) in Xe which absolutely need to be able
to evict something or the application dies.

> 
> Now if there's also SVM VRAM managed on a page lru, TTM exhaustive

Page LRU isn't used for device pages from my understanding.

> eviction is going to win because the shrinkers can only trylock dma_resv.
> So this part works. It actually works so well on the system memory side
> that if we're not careful we can trigger oom, because we're too good at
> getting at all the memory.
> 
> SVM VRAM allocations otoh do not need exhaustive evictions. Or at least I
> don't see why, because the idea is that thanks to gpu and cpu page faults,
> you can always get out of a pinch by just trashing everything for a while
> and migrating the handfull of available pages a lot.
> 
> 
> - Likely allocation size is 2M which makes of size of BO (872)
>   acceptable per allocation (872 / 2M == .0004158).
> 
> With this, using TTM BO for VRAM backing store seems to be an obvious
> choice as it allows leveraging of the TTM eviction code.
> 
> Except it requires that you hold dma_resv, which brings in all kinds of
> pain. And for eviction we really don't need a lot of synchronization, so a

Yes, but I think I have solved all those issues wrt to dma-resv.

What is really the alternative here? Teaching TTM to evict non-BO SVM
allocations? Writing an SVM VRAM allocator which ends up looking also
exactly like TTM and teaching it to evict TTM BOs? The later case we'd
still need to grab dma-lock...

Do we write a new page based buddy allocator and wire that to TTM if SVM
could possibly be used?

This would be tons of code and not really sure what ROI is here.

> lot of that locking is not needed, unlike the case where we have a cpu
> fault, where we absolutely need mmap_lock and all that to make sure we
> fault in the right page.
> 
> But for eviction we only need to throw out some pages, if we're not
> entirely precise with picking the right ones (or have no idea into which
> vma they're all currently mapped into) it doesn't matter. That's why
> migrate_device_pages doesn't care about any of that at all, it doesn't
> need to by design. But by bo backing memory you drag in all that stuff
> that's causing headacheds for eviction.
> 
> The only thing migration tries to do is remove all pte, and if that
> succeeds, move the page. Specialized for the gpusvm case, looking at mm/
> code as cheat sheet, we need roughly:
> 
> - reverse mapping structure like anon_vma. Except gpusvm can assume that
>   there's currently only one gpu side mapping, so we can just stuff the
>   gpusvm an va_address into the page, and protect it with the page lock.
> 
> - we need pagetable locks, so that we can manipulate pagetables (well
>   specifically make ptes invalid) without taking any other locks.
> 
> - everyone else inserting or removing ptes for svm mappings also needs to
>   lock the page, or we have races. This might be the hmm_range_fault races
>   you're seeing when allowing vram pages, since I don't think there's
>   anything else stopping the page lookup otherwise from succeeding.

AMD looks take the range->migration_mutex to prevent races.

> 
> - we might also need to stuff migrate ptes into the gpu side, like the cpu
>   does, to hold up refaults before the migration has finished. But I think
>   those are only needed for anon memory in sram because there's no other
>   way to find the right page than swap pte entries, of which migration
>   entries are a special case.
> 
> - core code also expects us to handle the page refcount correctly for svm
>   device memory, so we can't free the pages like normal bo pages either
>   directly to drm_buddy.
> 
> Now typing this all up will look an awful lot like what you have, with the
> dma_resv lock serving as the page lock and the pagetable lock. The only

dma_resv is indeed one of locks we need for page table updates (binds)
as we allocate TTM BOs for page tables and we install fences for binds
in dma-resv slots (certainly for non-SVM, might be able to drop that for
SVM).

> reason is that these locks are much smaller and nest within all the other
> stuff going on and so avoid the inversion issues.
> 
> So one annoying part is that this is a lot of pointlessly looking typing.
> The other is that it's full of races, because core mm really is yolo all
> the way down. So lots of ways you lock the wrong page and fun stuff like
> that, but the few cases that matter work:
> 
> - svm fault handling with hmm_range fault retries with mmu notifiers. Note
>   that we need to have vram pages locked and the notifier retrie needs to
>   be under the pagetable lock, or there's room to escape. At least that's
>   what I came up with last time I thought it all through.
>

We grab the gpusvm->notifier lock just be committing the bind and check
for retry. If we need to retry we completely unwind all locks and
restart the GPU fault.
 
> - migrate_to_ram: it will hold a page reference which we know was the
>   valid vram page when the cpu pte was locked, but it might not be it
>   anymore. So we have to lock the page and check whether it's still gpu
>   mapped, and if not retry the entire fault since most likey another
>   migrate_to_ram has succeed meanwhile in parallel.
> 
> - for eviction we don't care, we might actually be migrating a page no one
>   even wants anymore.
> 
> Now I think you can get all this done with the dma_resv lock and maybe the
> bo refcount. But it does involve a tremendous amount of headaches and

I don't the headaches are too bad...

> impendence mismatch, because that's not how page faults and migrations
> work in core mm.

Agree there is a bit of impendence mismatch but see above I can't really
think of a better solution without thousands of lines of new code and
invavise changes across the subsystem.

What I have in place appears to work with very little code changes to Xe
and none to TTM. AMD also landed on a BO likely for similar reasons I
have laid out.

Matt

> 
> Cheers, Sima
> 
> 
> Current migration policy is migrate any SVM range greater than or equal
> to 64k once.
> 
> [1] [1]https://patchwork.freedesktop.org/series/133643/
> 
> Signed-off-by: Matthew Brost [2]matthew.brost@intel.com
> ---
>  drivers/gpu/drm/xe/xe_svm.c | 81 ++++++++++++++++++++++++++++++++++++-
>  drivers/gpu/drm/xe/xe_svm.h |  1 +
>  2 files changed, 81 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c
> index 4372c02a341f..fd8987e0a506 100644
> --- a/drivers/gpu/drm/xe/xe_svm.c
> +++ b/drivers/gpu/drm/xe/xe_svm.c
> @@ -217,8 +217,13 @@ static void xe_svm_invalidate(struct drm_gpusvm *gpusvm,
>  static int __xe_svm_garbage_collector(struct xe_vm *vm,
>                                       struct xe_svm_range *range)
>  {
> +       struct drm_gpusvm_ctx ctx = {};
>         struct dma_fence *fence;
> 
> +       /* Evict any pages holding references to vram allocation */
> +       if (range->base.flags.partial_unmap && IS_DGFX(vm->xe))
> +               drm_gpusvm_migrate_to_sram(&vm->svm.gpusvm, &range->base, &ctx);
> +
>         xe_vm_lock(vm, false);
>         fence = xe_vm_range_unbind(vm, range);
>         xe_vm_unlock(vm);
> @@ -504,21 +509,77 @@ static bool xe_svm_range_is_valid(struct xe_svm_range *ran
> ge,
>         return (range->tile_present & ~range->tile_invalidated) & BIT(tile->id);
>  }
> 
> +static struct xe_mem_region *tile_to_mr(struct xe_tile *tile)
> +{
> +       return &tile->mem.vram;
> +}
> +
> +static struct xe_bo *xe_svm_alloc_vram(struct xe_vm *vm, struct xe_tile *tile,
> +                                      struct xe_svm_range *range,
> +                                      const struct drm_gpusvm_ctx *ctx)
> +{
> +       struct xe_mem_region *mr = tile_to_mr(tile);
> +       struct drm_buddy_block *block;
> +       struct list_head *blocks;
> +       struct xe_bo *bo;
> +       ktime_t end = 0;
> +       int err;
> +
> +retry:
> +       xe_vm_lock(vm, false);
> +       bo = xe_bo_create(tile_to_xe(tile), tile, vm, range->base.va.end -
> +                         range->base.va.start, ttm_bo_type_device,
> +                         XE_BO_FLAG_VRAM_IF_DGFX(tile) |
> +                         XE_BO_FLAG_SYSTEM_ALLOC | XE_BO_FLAG_SKIP_CLEAR);
> +       xe_vm_unlock(vm);
> +       if (IS_ERR(bo)) {
> +               err = PTR_ERR(bo);
> +               if (xe_vm_validate_should_retry(NULL, err, &end))
> +                       goto retry;
> +               return bo;
> +       }
> +
> +       blocks = &to_xe_ttm_vram_mgr_resource(bo->ttm.resource)->blocks;
> +       list_for_each_entry(block, blocks, link)
> +               block->private = mr;
> +
> +       /*
> +        * Take ref because as soon as drm_gpusvm_migrate_to_vram succeeds the
> +        * creation ref can be dropped upon CPU fault or unmap.
> +        */
> +       xe_bo_get(bo);
> +
> +       err = drm_gpusvm_migrate_to_vram(&vm->svm.gpusvm, &range->base,
> +                                        bo, ctx);
> +       if (err) {
> +               xe_bo_put(bo);  /* Local ref */
> +               xe_bo_put(bo);  /* Creation ref */
> +               return ERR_PTR(err);
> +       }
> +
> +       return bo;
> +}
> +
>  int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
>                             struct xe_tile *tile, u64 fault_addr,
>                             bool atomic)
>  {
> -       struct drm_gpusvm_ctx ctx = { .read_only = xe_vma_read_only(vma), };
> +       struct drm_gpusvm_ctx ctx = { .read_only = xe_vma_read_only(vma),
> +               .vram_possible = IS_DGFX(vm->xe), };
>         struct xe_svm_range *range;
>         struct drm_gpusvm_range *r;
>         struct drm_exec exec;
>         struct dma_fence *fence;
> +       struct xe_bo *bo = NULL;
>         ktime_t end = 0;
>         int err;
> 
>         lockdep_assert_held_write(&vm->lock);
> 
>  retry:
> +       xe_bo_put(bo);
> +       bo = NULL;
> +
>         /* Always process UNMAPs first so view SVM ranges is current */
>         err = xe_svm_garbage_collector(vm);
>         if (err)
> @@ -534,6 +595,22 @@ int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma
>  *vma,
>         if (xe_svm_range_is_valid(range, tile))
>                 return 0;
> 
> +       /* XXX: Add migration policy, for now migrate range once */
> +       if (IS_DGFX(vm->xe) && !range->migrated &&
> +           range->base.flags.migrate_vram &&
> +           (range->base.va.end - range->base.va.start) >= SZ_64K) {
> +               range->migrated = true;
> +
> +               bo = xe_svm_alloc_vram(vm, tile, range, &ctx);
> +               if (IS_ERR(bo)) {
> +                       drm_info(&vm->xe->drm,
> +                                "VRAM allocation failed, falling back to retryi
> ng, asid=%u, errno %ld\n",
> +                                vm->usm.asid, PTR_ERR(bo));
> +                       bo = NULL;
> +                       goto retry;
> +               }
> +       }
> +
>         err = drm_gpusvm_range_get_pages(&vm->svm.gpusvm, r, &ctx);
>         if (err == -EFAULT || err == -EPERM)    /* Corner where CPU mappings hav
> e change */
>                goto retry;
> @@ -567,6 +644,8 @@ int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma
> *vma,
>         dma_fence_put(fence);
> 
>  err_out:
> +       xe_bo_put(bo);
> +
>         return err;
>  }
> 
> diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
> index 8b72e91cc37d..3f432483a230 100644
> --- a/drivers/gpu/drm/xe/xe_svm.h
> +++ b/drivers/gpu/drm/xe/xe_svm.h
> @@ -18,6 +18,7 @@ struct xe_svm_range {
>         struct list_head garbage_collector_link;
>         u8 tile_present;
>         u8 tile_invalidated;
> +       u8 migrated     :1;
>  };
> 
>  int xe_devm_add(struct xe_tile *tile, struct xe_mem_region *mr);
> --
> 2.34.1
> 
> References
> 
>    1. https://patchwork.freedesktop.org/series/133643/
>    2. mailto:matthew.brost@intel.com

Matthew Brost Aug. 29, 2024, 9:53 p.m. UTC | #8

On Thu, Aug 29, 2024 at 04:30:11PM +0200, Christian König wrote:
> 
> 
> Am 29.08.24 um 11:53 schrieb Thomas Hellström:
> > Hi, Christian,
> > 
> > On Thu, 2024-08-29 at 11:24 +0200, Christian König wrote:
> > ...
> > 
> > > > > - Unified eviction is required (SVM VRAM and TTM BOs need to be
> > > > > able to
> > > > >     evict each other).
> > > > So core mm handles this by just roughly equally shrinking
> > > > everything.
> > > > Seems to work, and it has a pile of object shrinkers, and the page
> > > > lru is
> > > > also split into page cache and anon memory.
> > > > 
> > > > I think you need to put in more justification that unified eviction
> > > > is
> > > > required than just stating it, because a look at mm/ gives a very
> > > > well
> > > > established counterexample.
> > > > 
> > > > > - For exhaustive eviction [1], SVM VRAM allocations will almost
> > > > > certainly
> > > > >     require a dma-resv.
> > > > So from the TTM side we need exhaustive eviction, or at least
> > > > something a
> > > > bit more exhaustive than what ttm currently has. Note that i915-gem
> > > > also
> > > > never really got to perfect exhaustive eviction, it's just a pile
> > > > better
> > > > than ttm right now.
> > > Please define what exhaustive eviction should mean? I think I know
> > > what
> > > it is and I have been pushing TTM into the direction of solving this
> > > for
> > > years.
> > We internally refer to exhaustive eviction being a client is always
> > guaranteed to eventually make progress in obtaining non-pinned vram,
> > typically by incrementally locking and keeping dma-resvs across a
> > single validation including validations during buffer object
> > allocations.
> > 
> > > The last missing puzzle piece is to use drm_exec for TTM evictions,
> > and IMO keeping the dma-resv locks grabbed during eviction until at
> > least one unit of progress (one validation) has succeeded.
> 
> Yes, exactly that. My guessed understanding was actually correct.
> 
> > 
> > > but
> > > apart from that everything should work now.
> > > 
> > > 
> > > Regards,
> > > Christian.
> > But as Sima pointed out in private communication, exhaustive eviction
> > is not really needed for faulting to make (crawling) progress.
> > Watermarks and VRAM trylock shrinking should suffice, since we're
> > strictly only required to service a single gpu page granule at a time.
> 
> Yeah fault based memory management should be able to keep working as long as
> the page isn't re-migrated before you make any progress.

I prevent against that via eviction_valuable vfunc. See here [1].

[1] https://patchwork.freedesktop.org/patch/610982/?series=137870&rev=1

> 
> Since the number of VRAM or system memory pages is very high that should
> basically never happen.
> 
> > However, ordinary bo-based jobs would still like to be able to
> > completely evict SVM vram. Whether that is important enough to strive
> > for is ofc up for discussion.
> 
> Yes, exactly that. Felix, Alex, a bunch of other AMD folks and I came up
> with the same conclusion at AMD internally as well.
>

Agree with both of you. Landed on BO rather than rewrite the world as
TTM appears to everything need for SVM aside from an impedance mismatch
Daniel has pointed out resolved by refcounting. 

Matt

> Regards,
> Christian.
> 
> > 
> > /Thomas
> > 
> > 
> > 
> > > > Now if there's also SVM VRAM managed on a page lru, TTM exhaustive
> > > > eviction is going to win because the shrinkers can only trylock
> > > > dma_resv.
> > > > So this part works. It actually works so well on the system memory
> > > > side
> > > > that if we're not careful we can trigger oom, because we're too
> > > > good at
> > > > getting at all the memory.
> > > > 
> > > > SVM VRAM allocations otoh do not need exhaustive evictions. Or at
> > > > least I
> > > > don't see why, because the idea is that thanks to gpu and cpu page
> > > > faults,
> > > > you can always get out of a pinch by just trashing everything for a
> > > > while
> > > > and migrating the handfull of available pages a lot.
> > > > 
> > > > > - Likely allocation size is 2M which makes of size of BO (872)
> > > > >     acceptable per allocation (872 / 2M == .0004158).
> > > > > 
> > > > > With this, using TTM BO for VRAM backing store seems to be an
> > > > > obvious
> > > > > choice as it allows leveraging of the TTM eviction code.
> > > > Except it requires that you hold dma_resv, which brings in all
> > > > kinds of
> > > > pain. And for eviction we really don't need a lot of
> > > > synchronization, so a
> > > > lot of that locking is not needed, unlike the case where we have a
> > > > cpu
> > > > fault, where we absolutely need mmap_lock and all that to make sure
> > > > we
> > > > fault in the right page.
> > > > 
> > > > But for eviction we only need to throw out some pages, if we're not
> > > > entirely precise with picking the right ones (or have no idea into
> > > > which
> > > > vma they're all currently mapped into) it doesn't matter. That's
> > > > why
> > > > migrate_device_pages doesn't care about any of that at all, it
> > > > doesn't
> > > > need to by design. But by bo backing memory you drag in all that
> > > > stuff
> > > > that's causing headacheds for eviction.
> > > > 
> > > > The only thing migration tries to do is remove all pte, and if that
> > > > succeeds, move the page. Specialized for the gpusvm case, looking
> > > > at mm/
> > > > code as cheat sheet, we need roughly:
> > > > 
> > > > - reverse mapping structure like anon_vma. Except gpusvm can assume
> > > > that
> > > >     there's currently only one gpu side mapping, so we can just
> > > > stuff the
> > > >     gpusvm an va_address into the page, and protect it with the page
> > > > lock.
> > > > 
> > > > - we need pagetable locks, so that we can manipulate pagetables
> > > > (well
> > > >     specifically make ptes invalid) without taking any other locks.
> > > > 
> > > > - everyone else inserting or removing ptes for svm mappings also
> > > > needs to
> > > >     lock the page, or we have races. This might be the
> > > > hmm_range_fault races
> > > >     you're seeing when allowing vram pages, since I don't think
> > > > there's
> > > >     anything else stopping the page lookup otherwise from
> > > > succeeding.
> > > > 
> > > > - we might also need to stuff migrate ptes into the gpu side, like
> > > > the cpu
> > > >     does, to hold up refaults before the migration has finished. But
> > > > I think
> > > >     those are only needed for anon memory in sram because there's no
> > > > other
> > > >     way to find the right page than swap pte entries, of which
> > > > migration
> > > >     entries are a special case.
> > > > 
> > > > - core code also expects us to handle the page refcount correctly
> > > > for svm
> > > >     device memory, so we can't free the pages like normal bo pages
> > > > either
> > > >     directly to drm_buddy.
> > > > 
> > > > Now typing this all up will look an awful lot like what you have,
> > > > with the
> > > > dma_resv lock serving as the page lock and the pagetable lock. The
> > > > only
> > > > reason is that these locks are much smaller and nest within all the
> > > > other
> > > > stuff going on and so avoid the inversion issues.
> > > > 
> > > > So one annoying part is that this is a lot of pointlessly looking
> > > > typing.
> > > > The other is that it's full of races, because core mm really is
> > > > yolo all
> > > > the way down. So lots of ways you lock the wrong page and fun stuff
> > > > like
> > > > that, but the few cases that matter work:
> > > > 
> > > > - svm fault handling with hmm_range fault retries with mmu
> > > > notifiers. Note
> > > >     that we need to have vram pages locked and the notifier retrie
> > > > needs to
> > > >     be under the pagetable lock, or there's room to escape. At least
> > > > that's
> > > >     what I came up with last time I thought it all through.
> > > > 
> > > > - migrate_to_ram: it will hold a page reference which we know was
> > > > the
> > > >     valid vram page when the cpu pte was locked, but it might not be
> > > > it
> > > >     anymore. So we have to lock the page and check whether it's
> > > > still gpu
> > > >     mapped, and if not retry the entire fault since most likey
> > > > another
> > > >     migrate_to_ram has succeed meanwhile in parallel.
> > > > 
> > > > - for eviction we don't care, we might actually be migrating a page
> > > > no one
> > > >     even wants anymore.
> > > > 
> > > > Now I think you can get all this done with the dma_resv lock and
> > > > maybe the
> > > > bo refcount. But it does involve a tremendous amount of headaches
> > > > and
> > > > impendence mismatch, because that's not how page faults and
> > > > migrations
> > > > work in core mm.
> > > > 
> > > > Cheers, Sima
> > > > 
> > > > > Current migration policy is migrate any SVM range greater than or
> > > > > equal
> > > > > to 64k once.
> > > > > 
> > > > > [1]https://patchwork.freedesktop.org/series/133643/
> > > > > 
> > > > > Signed-off-by: Matthew Brostmatthew.brost@intel.com
> > > > > ---
> > > > >    drivers/gpu/drm/xe/xe_svm.c | 81
> > > > > ++++++++++++++++++++++++++++++++++++-
> > > > >    drivers/gpu/drm/xe/xe_svm.h |  1 +
> > > > >    2 files changed, 81 insertions(+), 1 deletion(-)
> > > > > 
> > > > > diff --git a/drivers/gpu/drm/xe/xe_svm.c
> > > > > b/drivers/gpu/drm/xe/xe_svm.c
> > > > > index 4372c02a341f..fd8987e0a506 100644
> > > > > --- a/drivers/gpu/drm/xe/xe_svm.c
> > > > > +++ b/drivers/gpu/drm/xe/xe_svm.c
> > > > > @@ -217,8 +217,13 @@ static void xe_svm_invalidate(struct
> > > > > drm_gpusvm *gpusvm,
> > > > >    static int __xe_svm_garbage_collector(struct xe_vm *vm,
> > > > >    				      struct xe_svm_range
> > > > > *range)
> > > > >    {
> > > > > +	struct drm_gpusvm_ctx ctx = {};
> > > > >    	struct dma_fence *fence;
> > > > > +	/* Evict any pages holding references to vram allocation
> > > > > */
> > > > > +	if (range->base.flags.partial_unmap && IS_DGFX(vm->xe))
> > > > > +		drm_gpusvm_migrate_to_sram(&vm->svm.gpusvm,
> > > > > &range->base, &ctx);
> > > > > +
> > > > >    	xe_vm_lock(vm, false);
> > > > >    	fence = xe_vm_range_unbind(vm, range);
> > > > >    	xe_vm_unlock(vm);
> > > > > @@ -504,21 +509,77 @@ static bool xe_svm_range_is_valid(struct
> > > > > xe_svm_range *range,
> > > > >    	return (range->tile_present & ~range->tile_invalidated)
> > > > > & BIT(tile->id);
> > > > >    }
> > > > > +static struct xe_mem_region *tile_to_mr(struct xe_tile *tile)
> > > > > +{
> > > > > +	return &tile->mem.vram;
> > > > > +}
> > > > > +
> > > > > +static struct xe_bo *xe_svm_alloc_vram(struct xe_vm *vm, struct
> > > > > xe_tile *tile,
> > > > > +				       struct xe_svm_range
> > > > > *range,
> > > > > +				       const struct
> > > > > drm_gpusvm_ctx *ctx)
> > > > > +{
> > > > > +	struct xe_mem_region *mr = tile_to_mr(tile);
> > > > > +	struct drm_buddy_block *block;
> > > > > +	struct list_head *blocks;
> > > > > +	struct xe_bo *bo;
> > > > > +	ktime_t end = 0;
> > > > > +	int err;
> > > > > +
> > > > > +retry:
> > > > > +	xe_vm_lock(vm, false);
> > > > > +	bo = xe_bo_create(tile_to_xe(tile), tile, vm, range-
> > > > > > base.va.end -
> > > > > +			  range->base.va.start,
> > > > > ttm_bo_type_device,
> > > > > +			  XE_BO_FLAG_VRAM_IF_DGFX(tile) |
> > > > > +			  XE_BO_FLAG_SYSTEM_ALLOC |
> > > > > XE_BO_FLAG_SKIP_CLEAR);
> > > > > +	xe_vm_unlock(vm);
> > > > > +	if (IS_ERR(bo)) {
> > > > > +		err = PTR_ERR(bo);
> > > > > +		if (xe_vm_validate_should_retry(NULL, err,
> > > > > &end))
> > > > > +			goto retry;
> > > > > +		return bo;
> > > > > +	}
> > > > > +
> > > > > +	blocks = &to_xe_ttm_vram_mgr_resource(bo->ttm.resource)-
> > > > > > blocks;
> > > > > +	list_for_each_entry(block, blocks, link)
> > > > > +		block->private = mr;
> > > > > +
> > > > > +	/*
> > > > > +	 * Take ref because as soon as
> > > > > drm_gpusvm_migrate_to_vram succeeds the
> > > > > +	 * creation ref can be dropped upon CPU fault or unmap.
> > > > > +	 */
> > > > > +	xe_bo_get(bo);
> > > > > +
> > > > > +	err = drm_gpusvm_migrate_to_vram(&vm->svm.gpusvm,
> > > > > &range->base,
> > > > > +					 bo, ctx);
> > > > > +	if (err) {
> > > > > +		xe_bo_put(bo);	/* Local ref */
> > > > > +		xe_bo_put(bo);	/* Creation ref */
> > > > > +		return ERR_PTR(err);
> > > > > +	}
> > > > > +
> > > > > +	return bo;
> > > > > +}
> > > > > +
> > > > >    int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma
> > > > > *vma,
> > > > >    			    struct xe_tile *tile, u64
> > > > > fault_addr,
> > > > >    			    bool atomic)
> > > > >    {
> > > > > -	struct drm_gpusvm_ctx ctx = { .read_only =
> > > > > xe_vma_read_only(vma), };
> > > > > +	struct drm_gpusvm_ctx ctx = { .read_only =
> > > > > xe_vma_read_only(vma),
> > > > > +		.vram_possible = IS_DGFX(vm->xe), };
> > > > >    	struct xe_svm_range *range;
> > > > >    	struct drm_gpusvm_range *r;
> > > > >    	struct drm_exec exec;
> > > > >    	struct dma_fence *fence;
> > > > > +	struct xe_bo *bo = NULL;
> > > > >    	ktime_t end = 0;
> > > > >    	int err;
> > > > >    	lockdep_assert_held_write(&vm->lock);
> > > > >    retry:
> > > > > +	xe_bo_put(bo);
> > > > > +	bo = NULL;
> > > > > +
> > > > >    	/* Always process UNMAPs first so view SVM ranges is
> > > > > current */
> > > > >    	err = xe_svm_garbage_collector(vm);
> > > > >    	if (err)
> > > > > @@ -534,6 +595,22 @@ int xe_svm_handle_pagefault(struct xe_vm
> > > > > *vm, struct xe_vma *vma,
> > > > >    	if (xe_svm_range_is_valid(range, tile))
> > > > >    		return 0;
> > > > > +	/* XXX: Add migration policy, for now migrate range once
> > > > > */
> > > > > +	if (IS_DGFX(vm->xe) && !range->migrated &&
> > > > > +	    range->base.flags.migrate_vram &&
> > > > > +	    (range->base.va.end - range->base.va.start) >=
> > > > > SZ_64K) {
> > > > > +		range->migrated = true;
> > > > > +
> > > > > +		bo = xe_svm_alloc_vram(vm, tile, range, &ctx);
> > > > > +		if (IS_ERR(bo)) {
> > > > > +			drm_info(&vm->xe->drm,
> > > > > +				 "VRAM allocation failed,
> > > > > falling back to retrying, asid=%u, errno %ld\n",
> > > > > +				 vm->usm.asid, PTR_ERR(bo));
> > > > > +			bo = NULL;
> > > > > +			goto retry;
> > > > > +		}
> > > > > +	}
> > > > > +
> > > > >    	err = drm_gpusvm_range_get_pages(&vm->svm.gpusvm, r,
> > > > > &ctx);
> > > > >    	if (err == -EFAULT || err == -EPERM)	/* Corner where
> > > > > CPU mappings have change */
> > > > >    	       goto retry;
> > > > > @@ -567,6 +644,8 @@ int xe_svm_handle_pagefault(struct xe_vm *vm,
> > > > > struct xe_vma *vma,
> > > > >    	dma_fence_put(fence);
> > > > >    err_out:
> > > > > +	xe_bo_put(bo);
> > > > > +
> > > > >    	return err;
> > > > >    }
> > > > > diff --git a/drivers/gpu/drm/xe/xe_svm.h
> > > > > b/drivers/gpu/drm/xe/xe_svm.h
> > > > > index 8b72e91cc37d..3f432483a230 100644
> > > > > --- a/drivers/gpu/drm/xe/xe_svm.h
> > > > > +++ b/drivers/gpu/drm/xe/xe_svm.h
> > > > > @@ -18,6 +18,7 @@ struct xe_svm_range {
> > > > >    	struct list_head garbage_collector_link;
> > > > >    	u8 tile_present;
> > > > >    	u8 tile_invalidated;
> > > > > +	u8 migrated	:1;
> > > > >    };
> > > > >    int xe_devm_add(struct xe_tile *tile, struct xe_mem_region
> > > > > *mr);
> > > > > -- 
> > > > > 2.34.1
> > > > > 
>

Matthew Brost Aug. 29, 2024, 10:12 p.m. UTC | #9

On Thu, Aug 29, 2024 at 01:02:54PM +0200, Daniel Vetter wrote:
> On Thu, Aug 29, 2024 at 11:53:58AM +0200, Thomas Hellström wrote:
> > But as Sima pointed out in private communication, exhaustive eviction
> > is not really needed for faulting to make (crawling) progress.
> > Watermarks and VRAM trylock shrinking should suffice, since we're
> > strictly only required to service a single gpu page granule at a time.
> > 
> > However, ordinary bo-based jobs would still like to be able to
> > completely evict SVM vram. Whether that is important enough to strive
> > for is ofc up for discussion.
> 
> My take is that you don't win anything for exhaustive eviction by having
> the dma_resv somewhere in there for svm allocations. Roughly for split lru
> world, where svm ignores bo/dma_resv:
> 
> When evicting vram from the ttm side we'll fairly switch between selecting
> bo and throwing out svm pages. With drm_exec/ww_acquire_ctx selecting bo
> will eventually succeed in vacuuming up everything (with a few retries
> perhaps, if we're not yet at the head of the ww ticket queue).
> 
> svm pages we need to try to evict anyway - there's no guarantee, becaue
> the core mm might be holding temporary page references (which block

Yea, but think you can could kill the app then - not suggesting we
should but could. To me this is akin to a CPU fault and not being able
to migrate the device pages - the migration layer doc says when this
happens kick this to user space and segfault the app.

My last patch in the series adds some asserts to see if this ever
happens, thus far never. If it occurs we could gracefully handle it by
aborting the migration I guess... I think the user really needs to
something a bit crazy to trigger this condition - I don't think the core
randomly grabs refs to device pages but could be wrong.

> migration) or have the page locked (which also block the migration). But
> as long as those two steps succeed, we'll win and get the pages. There
> might be some thrashing against concurrent svm faults stealing them again,
> but they have a disadvantage since they can't steal dma_resv_locked bo.
> And if it's still too much we can stall them in the page allocator.
> 
> So it's not entirely reliable, but should be close enough.
> 
> Now for bo based svm the picture isn't any different, because holding
> dma_resv is not actually enough to migrate svm mappings. We still need to
> hope there's no temporary page references around, and we still need to
> succeed at locking the page. And the migration code only does trylocks,
> because that's it's deadlock prevent algorithm if different migrations
> needing the same set of pages, but acquiring them in a different order. So
> we win nothing.

Ok, maybe my statement above is false...

Wouldn't be the only time this falls is if another migration is in
flight (e.g. CPU fault) and they race? Then the eviction will naturally
happen via refcount being dropped from the other migration. I guess I
likely need to update my eviction code to not free the TTM resource if
all pages are not migrated.

> 
> Worse, if dma_resv does actually hold up svm migration and reclaim, then
> we potentially deadlock because that lock is for a bigger range than
> individual pages (or folios). And the core mm assumes that it can get out
> of a deadlock bind by (at least stochastically) eventually succeeding in
> acquiring/locking down a single page.
> 
> This means we cannot use dma_resv tricks to give the ttm world an
> advantage in exhaustive eviction against concurrent svm faults. Or at
> least not more than we can do without by just stalling svm faults that
> need to allocate gpu memory (but that must happen without holding locks or
> we're busted).
> 

I'm a little lost here on the deadlock case. Do you mean when we try to
evict SVM BO we trigger reclaim by allocating system pages and can
deadlock? Doesn't TTM already have this dependency when evicting non-SVM
BOs?

> So the only benefit I'm seeing is the unified lru, which I'm not sure is
> worth it. There's also a bit a lru design tension here, because for the bo

Well also not rewriting the world...

Matt

> world we want objects that are locked to stay on the lru, so that the
> competing processes can figure out who has the winning ww ticket. The core
> mm design otoh does isolate pages and remove them from the lru when
> they're acquired, so that they don't gunk up other processes from trying
> to make forward progress and are better hidden. Which reduces temporary
> page references (from lru walk) preventing migration and stuff like that.
> 
> Cheers, Sima
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

Matthew Brost Aug. 29, 2024, 10:23 p.m. UTC | #10

On Thu, Aug 29, 2024 at 10:12:53PM +0000, Matthew Brost wrote:
> On Thu, Aug 29, 2024 at 01:02:54PM +0200, Daniel Vetter wrote:
> > On Thu, Aug 29, 2024 at 11:53:58AM +0200, Thomas Hellström wrote:
> > > But as Sima pointed out in private communication, exhaustive eviction
> > > is not really needed for faulting to make (crawling) progress.
> > > Watermarks and VRAM trylock shrinking should suffice, since we're
> > > strictly only required to service a single gpu page granule at a time.
> > > 
> > > However, ordinary bo-based jobs would still like to be able to
> > > completely evict SVM vram. Whether that is important enough to strive
> > > for is ofc up for discussion.
> > 
> > My take is that you don't win anything for exhaustive eviction by having
> > the dma_resv somewhere in there for svm allocations. Roughly for split lru
> > world, where svm ignores bo/dma_resv:
> > 
> > When evicting vram from the ttm side we'll fairly switch between selecting
> > bo and throwing out svm pages. With drm_exec/ww_acquire_ctx selecting bo
> > will eventually succeed in vacuuming up everything (with a few retries
> > perhaps, if we're not yet at the head of the ww ticket queue).
> > 
> > svm pages we need to try to evict anyway - there's no guarantee, becaue
> > the core mm might be holding temporary page references (which block
> 
> Yea, but think you can could kill the app then - not suggesting we
> should but could. To me this is akin to a CPU fault and not being able
> to migrate the device pages - the migration layer doc says when this
> happens kick this to user space and segfault the app.
> 
> My last patch in the series adds some asserts to see if this ever
> happens, thus far never. If it occurs we could gracefully handle it by
> aborting the migration I guess... I think the user really needs to
> something a bit crazy to trigger this condition - I don't think the core
> randomly grabs refs to device pages but could be wrong.
> 
> > migration) or have the page locked (which also block the migration). But
> > as long as those two steps succeed, we'll win and get the pages. There
> > might be some thrashing against concurrent svm faults stealing them again,
> > but they have a disadvantage since they can't steal dma_resv_locked bo.
> > And if it's still too much we can stall them in the page allocator.
> > 
> > So it's not entirely reliable, but should be close enough.
> > 
> > Now for bo based svm the picture isn't any different, because holding
> > dma_resv is not actually enough to migrate svm mappings. We still need to
> > hope there's no temporary page references around, and we still need to
> > succeed at locking the page. And the migration code only does trylocks,
> > because that's it's deadlock prevent algorithm if different migrations
> > needing the same set of pages, but acquiring them in a different order. So
> > we win nothing.
> 
> Ok, maybe my statement above is false...
> 
> Wouldn't be the only time this falls is if another migration is in
> flight (e.g. CPU fault) and they race? Then the eviction will naturally
> happen via refcount being dropped from the other migration. I guess I
> likely need to update my eviction code to not free the TTM resource if
> all pages are not migrated.
> 

Also if we add something like AMD's range->migration_mutex pretty sure
this race goes away and we left with 'the user has done something not smart,
and could convinciblely kill the app'.

Matt

> > 
> > Worse, if dma_resv does actually hold up svm migration and reclaim, then
> > we potentially deadlock because that lock is for a bigger range than
> > individual pages (or folios). And the core mm assumes that it can get out
> > of a deadlock bind by (at least stochastically) eventually succeeding in
> > acquiring/locking down a single page.
> > 
> > This means we cannot use dma_resv tricks to give the ttm world an
> > advantage in exhaustive eviction against concurrent svm faults. Or at
> > least not more than we can do without by just stalling svm faults that
> > need to allocate gpu memory (but that must happen without holding locks or
> > we're busted).
> > 
> 
> I'm a little lost here on the deadlock case. Do you mean when we try to
> evict SVM BO we trigger reclaim by allocating system pages and can
> deadlock? Doesn't TTM already have this dependency when evicting non-SVM
> BOs?
> 
> > So the only benefit I'm seeing is the unified lru, which I'm not sure is
> > worth it. There's also a bit a lru design tension here, because for the bo
> 
> Well also not rewriting the world...
> 
> Matt
> 
> > world we want objects that are locked to stay on the lru, so that the
> > competing processes can figure out who has the winning ww ticket. The core
> > mm design otoh does isolate pages and remove them from the lru when
> > they're acquired, so that they don't gunk up other processes from trying
> > to make forward progress and are better hidden. Which reduces temporary
> > page references (from lru walk) preventing migration and stuff like that.
> > 
> > Cheers, Sima
> > -- 
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch

Christian König Sept. 2, 2024, 11:01 a.m. UTC | #11

Am 30.08.24 um 00:12 schrieb Matthew Brost:
> On Thu, Aug 29, 2024 at 01:02:54PM +0200, Daniel Vetter wrote:
>> On Thu, Aug 29, 2024 at 11:53:58AM +0200, Thomas Hellström wrote:
>>> But as Sima pointed out in private communication, exhaustive eviction
>>> is not really needed for faulting to make (crawling) progress.
>>> Watermarks and VRAM trylock shrinking should suffice, since we're
>>> strictly only required to service a single gpu page granule at a time.
>>>
>>> However, ordinary bo-based jobs would still like to be able to
>>> completely evict SVM vram. Whether that is important enough to strive
>>> for is ofc up for discussion.
>> My take is that you don't win anything for exhaustive eviction by having
>> the dma_resv somewhere in there for svm allocations. Roughly for split lru
>> world, where svm ignores bo/dma_resv:
>>
>> When evicting vram from the ttm side we'll fairly switch between selecting
>> bo and throwing out svm pages. With drm_exec/ww_acquire_ctx selecting bo
>> will eventually succeed in vacuuming up everything (with a few retries
>> perhaps, if we're not yet at the head of the ww ticket queue).
>>
>> svm pages we need to try to evict anyway - there's no guarantee, becaue
>> the core mm might be holding temporary page references (which block
> Yea, but think you can could kill the app then - not suggesting we
> should but could. To me this is akin to a CPU fault and not being able
> to migrate the device pages - the migration layer doc says when this
> happens kick this to user space and segfault the app.

That's most likely a bad idea. That the core holds a temporary page 
reference can happen any time without any bad doing from the 
application. E.g. for direct I/O, swapping etc...

So you can't punish the application with a segfault if you happen to not 
be able to migrate a page because it has a reference.

Regards,
Christian.

>
> My last patch in the series adds some asserts to see if this ever
> happens, thus far never. If it occurs we could gracefully handle it by
> aborting the migration I guess... I think the user really needs to
> something a bit crazy to trigger this condition - I don't think the core
> randomly grabs refs to device pages but could be wrong.
>
>> migration) or have the page locked (which also block the migration). But
>> as long as those two steps succeed, we'll win and get the pages. There
>> might be some thrashing against concurrent svm faults stealing them again,
>> but they have a disadvantage since they can't steal dma_resv_locked bo.
>> And if it's still too much we can stall them in the page allocator.
>>
>> So it's not entirely reliable, but should be close enough.
>>
>> Now for bo based svm the picture isn't any different, because holding
>> dma_resv is not actually enough to migrate svm mappings. We still need to
>> hope there's no temporary page references around, and we still need to
>> succeed at locking the page. And the migration code only does trylocks,
>> because that's it's deadlock prevent algorithm if different migrations
>> needing the same set of pages, but acquiring them in a different order. So
>> we win nothing.
> Ok, maybe my statement above is false...
>
> Wouldn't be the only time this falls is if another migration is in
> flight (e.g. CPU fault) and they race? Then the eviction will naturally
> happen via refcount being dropped from the other migration. I guess I
> likely need to update my eviction code to not free the TTM resource if
> all pages are not migrated.
>
>> Worse, if dma_resv does actually hold up svm migration and reclaim, then
>> we potentially deadlock because that lock is for a bigger range than
>> individual pages (or folios). And the core mm assumes that it can get out
>> of a deadlock bind by (at least stochastically) eventually succeeding in
>> acquiring/locking down a single page.
>>
>> This means we cannot use dma_resv tricks to give the ttm world an
>> advantage in exhaustive eviction against concurrent svm faults. Or at
>> least not more than we can do without by just stalling svm faults that
>> need to allocate gpu memory (but that must happen without holding locks or
>> we're busted).
>>
> I'm a little lost here on the deadlock case. Do you mean when we try to
> evict SVM BO we trigger reclaim by allocating system pages and can
> deadlock? Doesn't TTM already have this dependency when evicting non-SVM
> BOs?
>
>> So the only benefit I'm seeing is the unified lru, which I'm not sure is
>> worth it. There's also a bit a lru design tension here, because for the bo
> Well also not rewriting the world...
>
> Matt
>
>> world we want objects that are locked to stay on the lru, so that the
>> competing processes can figure out who has the winning ww ticket. The core
>> mm design otoh does isolate pages and remove them from the lru when
>> they're acquired, so that they don't gunk up other processes from trying
>> to make forward progress and are better hidden. Which reduces temporary
>> page references (from lru walk) preventing migration and stuff like that.
>>
>> Cheers, Sima
>> -- 
>> Daniel Vetter
>> Software Engineer, Intel Corporation
>> http://blog.ffwll.ch

Daniel Vetter Sept. 2, 2024, 12:48 p.m. UTC | #12

On Thu, Aug 29, 2024 at 10:12:53PM +0000, Matthew Brost wrote:
> On Thu, Aug 29, 2024 at 01:02:54PM +0200, Daniel Vetter wrote:
> > On Thu, Aug 29, 2024 at 11:53:58AM +0200, Thomas Hellström wrote:
> > > But as Sima pointed out in private communication, exhaustive eviction
> > > is not really needed for faulting to make (crawling) progress.
> > > Watermarks and VRAM trylock shrinking should suffice, since we're
> > > strictly only required to service a single gpu page granule at a time.
> > > 
> > > However, ordinary bo-based jobs would still like to be able to
> > > completely evict SVM vram. Whether that is important enough to strive
> > > for is ofc up for discussion.
> > 
> > My take is that you don't win anything for exhaustive eviction by having
> > the dma_resv somewhere in there for svm allocations. Roughly for split lru
> > world, where svm ignores bo/dma_resv:
> > 
> > When evicting vram from the ttm side we'll fairly switch between selecting
> > bo and throwing out svm pages. With drm_exec/ww_acquire_ctx selecting bo
> > will eventually succeed in vacuuming up everything (with a few retries
> > perhaps, if we're not yet at the head of the ww ticket queue).
> > 
> > svm pages we need to try to evict anyway - there's no guarantee, becaue
> > the core mm might be holding temporary page references (which block
> 
> Yea, but think you can could kill the app then - not suggesting we
> should but could. To me this is akin to a CPU fault and not being able
> to migrate the device pages - the migration layer doc says when this
> happens kick this to user space and segfault the app.
> 
> My last patch in the series adds some asserts to see if this ever
> happens, thus far never. If it occurs we could gracefully handle it by
> aborting the migration I guess... I think the user really needs to
> something a bit crazy to trigger this condition - I don't think the core
> randomly grabs refs to device pages but could be wrong.

I think it does :-/

If you read do_swap_page around ->migrate_to_ram:


	get_page(vmf->page);
	pte_unmap_unlock(vmf->pte, vmf->ptl);
	ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
	put_page(vmf->page);

Also the migrate code itself does lock pages. So unless we toss in
additional locking on top of what core mm does (which I think should be
enough to cover migration), migrations will temporarily fail. And this is
just for multiple threads trying to get the same page back to sram, which
I think is a case we should support because the application did nothing
wrong.

> > migration) or have the page locked (which also block the migration). But
> > as long as those two steps succeed, we'll win and get the pages. There
> > might be some thrashing against concurrent svm faults stealing them again,
> > but they have a disadvantage since they can't steal dma_resv_locked bo.
> > And if it's still too much we can stall them in the page allocator.
> > 
> > So it's not entirely reliable, but should be close enough.
> > 
> > Now for bo based svm the picture isn't any different, because holding
> > dma_resv is not actually enough to migrate svm mappings. We still need to
> > hope there's no temporary page references around, and we still need to
> > succeed at locking the page. And the migration code only does trylocks,
> > because that's it's deadlock prevent algorithm if different migrations
> > needing the same set of pages, but acquiring them in a different order. So
> > we win nothing.
> 
> Ok, maybe my statement above is false...
> 
> Wouldn't be the only time this falls is if another migration is in
> flight (e.g. CPU fault) and they race? Then the eviction will naturally
> happen via refcount being dropped from the other migration. I guess I
> likely need to update my eviction code to not free the TTM resource if
> all pages are not migrated.

Yeah. And additionally core mm relies on some amount of Good Luck here,
plus the assumption that at least falling back to a single page/folio will
work out. At least eventually ...

The trouble is if your design assumes you can migrate an entire block,
because then if threads hammer that range in different orders you'll never
make forward progress. Because the core mm code doesn't have a fancy ww
locking scheme to get out of this, but only uses trylock, plus the
assumption that falling back to a single page will work out eventually.

Wrt TTM resource refcounting, I think that all looks ok. But maybe I
checked the wrong things.

> > Worse, if dma_resv does actually hold up svm migration and reclaim, then
> > we potentially deadlock because that lock is for a bigger range than
> > individual pages (or folios). And the core mm assumes that it can get out
> > of a deadlock bind by (at least stochastically) eventually succeeding in
> > acquiring/locking down a single page.
> > 
> > This means we cannot use dma_resv tricks to give the ttm world an
> > advantage in exhaustive eviction against concurrent svm faults. Or at
> > least not more than we can do without by just stalling svm faults that
> > need to allocate gpu memory (but that must happen without holding locks or
> > we're busted).
> > 
> 
> I'm a little lost here on the deadlock case. Do you mean when we try to
> evict SVM BO we trigger reclaim by allocating system pages and can
> deadlock? Doesn't TTM already have this dependency when evicting non-SVM
> BOs?

So you can have multiple cpu threads hammering a given svm range. And
thanks to the lols of mremap and fork each of them can have a different
view of that range (they are all obviously different processes from the
one that has created the gpusvm binding). And if you try to migrate, they
might all grab the pages in different orders, which can deadlock.

That's why there's so much retrying and also why core mm only does trylock
on pages if it grabs an entire pile.

Now if you have a lock that nests within the page lock you need to trylock
it, or it deadlocks. Which kinda defeats the point of having a bigger lock
and moving the entire bo as a unit.

But if that is outside of the page lock (like amdgpu), you still have the
issue of the elevated page reference from do_swap_page. Which also blocks
migration.

Note that neither is a hard deadlock, as in lockdep complaints, because
they're all retrying anyway. They're more like lifelocks, and the bigger
your pile of pages the more likely that you'll always have a failed page
and need to abort and retry. Which results in threads spinning forever.

> > So the only benefit I'm seeing is the unified lru, which I'm not sure is
> > worth it. There's also a bit a lru design tension here, because for the bo
> 
> Well also not rewriting the world...

Yeah it's tough. I'm still at the "understanding all the tradeoffs" stage,
just to make that clear.
-Sima

> Matt
> 
> > world we want objects that are locked to stay on the lru, so that the
> > competing processes can figure out who has the winning ww ticket. The core
> > mm design otoh does isolate pages and remove them from the lru when
> > they're acquired, so that they don't gunk up other processes from trying
> > to make forward progress and are better hidden. Which reduces temporary
> > page references (from lru walk) preventing migration and stuff like that.
> > 
> > Cheers, Sima
> > -- 
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch

Daniel Vetter Sept. 2, 2024, 12:50 p.m. UTC | #13

On Mon, Sep 02, 2024 at 01:01:45PM +0200, Christian König wrote:
> Am 30.08.24 um 00:12 schrieb Matthew Brost:
> > On Thu, Aug 29, 2024 at 01:02:54PM +0200, Daniel Vetter wrote:
> > > On Thu, Aug 29, 2024 at 11:53:58AM +0200, Thomas Hellström wrote:
> > > > But as Sima pointed out in private communication, exhaustive eviction
> > > > is not really needed for faulting to make (crawling) progress.
> > > > Watermarks and VRAM trylock shrinking should suffice, since we're
> > > > strictly only required to service a single gpu page granule at a time.
> > > > 
> > > > However, ordinary bo-based jobs would still like to be able to
> > > > completely evict SVM vram. Whether that is important enough to strive
> > > > for is ofc up for discussion.
> > > My take is that you don't win anything for exhaustive eviction by having
> > > the dma_resv somewhere in there for svm allocations. Roughly for split lru
> > > world, where svm ignores bo/dma_resv:
> > > 
> > > When evicting vram from the ttm side we'll fairly switch between selecting
> > > bo and throwing out svm pages. With drm_exec/ww_acquire_ctx selecting bo
> > > will eventually succeed in vacuuming up everything (with a few retries
> > > perhaps, if we're not yet at the head of the ww ticket queue).
> > > 
> > > svm pages we need to try to evict anyway - there's no guarantee, becaue
> > > the core mm might be holding temporary page references (which block
> > Yea, but think you can could kill the app then - not suggesting we
> > should but could. To me this is akin to a CPU fault and not being able
> > to migrate the device pages - the migration layer doc says when this
> > happens kick this to user space and segfault the app.
> 
> That's most likely a bad idea. That the core holds a temporary page
> reference can happen any time without any bad doing from the application.
> E.g. for direct I/O, swapping etc...
> 
> So you can't punish the application with a segfault if you happen to not be
> able to migrate a page because it has a reference.

See my other reply, it even happens as a direct consequence of a 2nd
thread trying to migrate the exact same page from vram to sram. And that
really is a core use case.

RESo yeah, we really can't SIGBUS on this case.
-Sima

Daniel Vetter Sept. 2, 2024, 1:02 p.m. UTC | #14

On Thu, Aug 29, 2024 at 09:48:11PM +0000, Matthew Brost wrote:
> On Thu, Aug 29, 2024 at 11:24:26AM +0200, Christian König wrote:
> >    Am 28.08.24 um 18:06 schrieb Daniel Vetter:
> > 
> 
> A lot to unpack here. Will try to address as much as I can in this
> single reply to both of you (Daniel, Christian).
> 
> > On Tue, Aug 27, 2024 at 07:48:56PM -0700, Matthew Brost wrote:
> > 
> > Migration is implemented with range granularity, with VRAM backing being
> > a VM private TTM BO (i.e., shares dma-resv with VM). The lifetime of the
> > TTM BO is limited to when the SVM range is in VRAM (i.e., when a VRAM
> > SVM range is migrated to SRAM, the TTM BO is destroyed).
> > 
> > The design choice for using TTM BO for VRAM backing store, as opposed to
> > direct buddy allocation, is as follows:
> > 
> > - DRM buddy allocations are not at page granularity, offering no
> >   advantage over a BO.
> > 
> > This one I'm not understanding.
> >
> >    Adding Arun as well. I couldn't understand it fully either, but maybe
> >    it's because the buddy allocator is more optimized for higher orders of
> >    allocations?
> > 
> 
> As currently written BO VRAM allocation resolves to a DRM buddy
> allocation. Unless there is memory pressure this is likely 1 buddy block
> if the allocation is aligned (SVM should basically also be doing aligned
> allocations which the common case of 2M at a time).
> 
> It was suggested in earlier revs by a colleague of mine allocating
> directly from DRM buddy pool provided a benefit wrt to freeing a page at
> a time. It doesn't given even if you bypass a BO most likely you are
> going to get 1 buddy block which is larger than a page. In either case
> you need to ref count the allocation or do some wild spliting algorithm
> (I don't want to that). Or alternatively write a new buddy allocator
> which easily code with freeing a page at a time (I don't want to that).
> 
> Lastly, the common case for getting dev_pagemap_ops.page_free is going
> to be consecutive calls spanning the entire allocation (e.g. eviction or
> CPU fault which triggers migration).
> 
> > 
> > 
> > - DRM buddy allocations do not solve locking inversion problems between
> >   mmap lock and dma-resv locks.
> > 
> > Which mmap -> dma_resv inversion? I've seen a lot ... I guess it also
> > matters hugely which migration path we're in, i.e. opportunistic
> > migration, cpu fault where we have to migrate or die, or when we run out
> > of vram and need to evict stuff to make space.
> > 
> >    Mhm I think the locking order between mmap lock and dma-resv lock is
> >    well defined since dma_resv_lockdep() was added.
> >
> 
> Yes. Also solved the inversion issue by using migrate_device_*. At one
> point had trylocking of mmap lock (still kinda there) but have agreed
> based on Daniel's feedback to rip the out.
>  
> > - Unified eviction is required (SVM VRAM and TTM BOs need to be able to
> >   evict each other).
> > 
> > So core mm handles this by just roughly equally shrinking everything.
> > Seems to work, and it has a pile of object shrinkers, and the page lru is
> > also split into page cache and anon memory.
> > 
> > I think you need to put in more justification that unified eviction is
> > required than just stating it, because a look at mm/ gives a very well
> > established counterexample.
> > 
> > 
> > - For exhaustive eviction [1], SVM VRAM allocations will almost certainly
> >   require a dma-resv.
> > 
> > So from the TTM side we need exhaustive eviction, or at least something a
> > bit more exhaustive than what ttm currently has. Note that i915-gem also
> > never really got to perfect exhaustive eviction, it's just a pile better
> > than ttm right now.
> > 
> >    Please define what exhaustive eviction should mean? I think I know what
> >    it is and I have been pushing TTM into the direction of solving this
> >    for years.
> >    The last missing puzzle piece is to use drm_exec for TTM evictions, but
> >    apart from that everything should work now.
> >    Regards,
> >    Christian.
> 
> I think Thomas has defined this in his replies. He also touches on our
> SVM design allows mixing user BO mappings and SVM mappings within the
> same VM. These need to be able to fairly evict each other. A dma-resv
> lock provides a level of fairness and ensure forward progress once a
> flavor of his series lands.
> 
> Also worth nothing in addition to user BOs, we have kernel BOs (for page
> table, user exec queues, etc...) in Xe which absolutely need to be able
> to evict something or the application dies.
> 
> > 
> > Now if there's also SVM VRAM managed on a page lru, TTM exhaustive
> 
> Page LRU isn't used for device pages from my understanding.

Yeah, we'd need to manage that ourselves. We could use exactly what the
core mm is doing, I haven't found anything that prohibits that. I think
core mm simply doesn't maintain zone device lru because it's not involved
in any device side access.

> > eviction is going to win because the shrinkers can only trylock dma_resv.
> > So this part works. It actually works so well on the system memory side
> > that if we're not careful we can trigger oom, because we're too good at
> > getting at all the memory.
> > 
> > SVM VRAM allocations otoh do not need exhaustive evictions. Or at least I
> > don't see why, because the idea is that thanks to gpu and cpu page faults,
> > you can always get out of a pinch by just trashing everything for a while
> > and migrating the handfull of available pages a lot.
> > 
> > 
> > - Likely allocation size is 2M which makes of size of BO (872)
> >   acceptable per allocation (872 / 2M == .0004158).
> > 
> > With this, using TTM BO for VRAM backing store seems to be an obvious
> > choice as it allows leveraging of the TTM eviction code.
> > 
> > Except it requires that you hold dma_resv, which brings in all kinds of
> > pain. And for eviction we really don't need a lot of synchronization, so a
> 
> Yes, but I think I have solved all those issues wrt to dma-resv.
> 
> What is really the alternative here? Teaching TTM to evict non-BO SVM
> allocations? Writing an SVM VRAM allocator which ends up looking also
> exactly like TTM and teaching it to evict TTM BOs? The later case we'd
> still need to grab dma-lock...

Yup.

> Do we write a new page based buddy allocator and wire that to TTM if SVM
> could possibly be used?

Well rebase it on top of a page array, like the buddy allocator in core mm
would be my first idea.

> This would be tons of code and not really sure what ROI is here.
> 
> > lot of that locking is not needed, unlike the case where we have a cpu
> > fault, where we absolutely need mmap_lock and all that to make sure we
> > fault in the right page.
> > 
> > But for eviction we only need to throw out some pages, if we're not
> > entirely precise with picking the right ones (or have no idea into which
> > vma they're all currently mapped into) it doesn't matter. That's why
> > migrate_device_pages doesn't care about any of that at all, it doesn't
> > need to by design. But by bo backing memory you drag in all that stuff
> > that's causing headacheds for eviction.
> > 
> > The only thing migration tries to do is remove all pte, and if that
> > succeeds, move the page. Specialized for the gpusvm case, looking at mm/
> > code as cheat sheet, we need roughly:
> > 
> > - reverse mapping structure like anon_vma. Except gpusvm can assume that
> >   there's currently only one gpu side mapping, so we can just stuff the
> >   gpusvm an va_address into the page, and protect it with the page lock.
> > 
> > - we need pagetable locks, so that we can manipulate pagetables (well
> >   specifically make ptes invalid) without taking any other locks.
> > 
> > - everyone else inserting or removing ptes for svm mappings also needs to
> >   lock the page, or we have races. This might be the hmm_range_fault races
> >   you're seeing when allowing vram pages, since I don't think there's
> >   anything else stopping the page lookup otherwise from succeeding.
> 
> AMD looks take the range->migration_mutex to prevent races.
> 
> > 
> > - we might also need to stuff migrate ptes into the gpu side, like the cpu
> >   does, to hold up refaults before the migration has finished. But I think
> >   those are only needed for anon memory in sram because there's no other
> >   way to find the right page than swap pte entries, of which migration
> >   entries are a special case.
> > 
> > - core code also expects us to handle the page refcount correctly for svm
> >   device memory, so we can't free the pages like normal bo pages either
> >   directly to drm_buddy.
> > 
> > Now typing this all up will look an awful lot like what you have, with the
> > dma_resv lock serving as the page lock and the pagetable lock. The only
> 
> dma_resv is indeed one of locks we need for page table updates (binds)
> as we allocate TTM BOs for page tables and we install fences for binds
> in dma-resv slots (certainly for non-SVM, might be able to drop that for
> SVM).

So the way this is solved on the core mm side is with two tricks:

- Page faults race entirely, and races are only resolved at pte insertion
  time when you acquire the pagetable lock. If there's anything else than
  a page-not-present pte, you've raced and bail out.

- Pagetables are allocated upfront, with the same trick: If someone else
  was faster, you bail out. Pagetables are never reclaimed for core mm
  code, so that avoids someone else nuking it meanwhile. At least while
  the vma mapping stays valid.

  I'm not sure we can entirely emulate that design with gpusvm.

And yeah there would be a substantial difference in code between the bo
and the svm world.

> > reason is that these locks are much smaller and nest within all the other
> > stuff going on and so avoid the inversion issues.
> > 
> > So one annoying part is that this is a lot of pointlessly looking typing.
> > The other is that it's full of races, because core mm really is yolo all
> > the way down. So lots of ways you lock the wrong page and fun stuff like
> > that, but the few cases that matter work:
> > 
> > - svm fault handling with hmm_range fault retries with mmu notifiers. Note
> >   that we need to have vram pages locked and the notifier retrie needs to
> >   be under the pagetable lock, or there's room to escape. At least that's
> >   what I came up with last time I thought it all through.

Correction: at fault time core mm does not lock pages. Just elevated
refcount is enough.

> We grab the gpusvm->notifier lock just be committing the bind and check
> for retry. If we need to retry we completely unwind all locks and
> restart the GPU fault.

Yeah core mm does the same.
  
> > - migrate_to_ram: it will hold a page reference which we know was the
> >   valid vram page when the cpu pte was locked, but it might not be it
> >   anymore. So we have to lock the page and check whether it's still gpu
> >   mapped, and if not retry the entire fault since most likey another
> >   migrate_to_ram has succeed meanwhile in parallel.
> > 
> > - for eviction we don't care, we might actually be migrating a page no one
> >   even wants anymore.
> > 
> > Now I think you can get all this done with the dma_resv lock and maybe the
> > bo refcount. But it does involve a tremendous amount of headaches and
> 
> I don't the headaches are too bad...
> 
> > impendence mismatch, because that's not how page faults and migrations
> > work in core mm.
> 
> Agree there is a bit of impendence mismatch but see above I can't really
> think of a better solution without thousands of lines of new code and
> invavise changes across the subsystem.
> 
> What I have in place appears to work with very little code changes to Xe
> and none to TTM. AMD also landed on a BO likely for similar reasons I
> have laid out.

My understanding of the history is that large chunks of the gpusvm
features where retrofitted, without updating the design. So I'm not
putting that much weight on amdkfd as a good solution, it's just the
obvious incremental solution.

Cheers, Sima

Matthew Brost Sept. 2, 2024, 10:20 p.m. UTC | #15

On Mon, Sep 02, 2024 at 02:48:55PM +0200, Daniel Vetter wrote:
> On Thu, Aug 29, 2024 at 10:12:53PM +0000, Matthew Brost wrote:
> > On Thu, Aug 29, 2024 at 01:02:54PM +0200, Daniel Vetter wrote:
> > > On Thu, Aug 29, 2024 at 11:53:58AM +0200, Thomas Hellström wrote:
> > > > But as Sima pointed out in private communication, exhaustive eviction
> > > > is not really needed for faulting to make (crawling) progress.
> > > > Watermarks and VRAM trylock shrinking should suffice, since we're
> > > > strictly only required to service a single gpu page granule at a time.
> > > > 
> > > > However, ordinary bo-based jobs would still like to be able to
> > > > completely evict SVM vram. Whether that is important enough to strive
> > > > for is ofc up for discussion.
> > > 
> > > My take is that you don't win anything for exhaustive eviction by having
> > > the dma_resv somewhere in there for svm allocations. Roughly for split lru
> > > world, where svm ignores bo/dma_resv:
> > > 
> > > When evicting vram from the ttm side we'll fairly switch between selecting
> > > bo and throwing out svm pages. With drm_exec/ww_acquire_ctx selecting bo
> > > will eventually succeed in vacuuming up everything (with a few retries
> > > perhaps, if we're not yet at the head of the ww ticket queue).
> > > 
> > > svm pages we need to try to evict anyway - there's no guarantee, becaue
> > > the core mm might be holding temporary page references (which block
> > 
> > Yea, but think you can could kill the app then - not suggesting we
> > should but could. To me this is akin to a CPU fault and not being able
> > to migrate the device pages - the migration layer doc says when this
> > happens kick this to user space and segfault the app.
> > 
> > My last patch in the series adds some asserts to see if this ever
> > happens, thus far never. If it occurs we could gracefully handle it by
> > aborting the migration I guess... I think the user really needs to
> > something a bit crazy to trigger this condition - I don't think the core
> > randomly grabs refs to device pages but could be wrong.
> 
> I think it does :-/
> 
> If you read do_swap_page around ->migrate_to_ram:
> 
> 
> 	get_page(vmf->page);
> 	pte_unmap_unlock(vmf->pte, vmf->ptl);
> 	ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
> 	put_page(vmf->page);
> 

Yep, I've seen this show in some of local rework getting MREMAP working.
The common case is I have test with 2M mapping which I call MREMAP on
and then immediately read from the new mapping (the one from MREMAP).
Since a MREMAP results in a UNMAP notifier event one of the possible
solutions I have just evict the VRAM via drm_gpusvm_evict_to_sram upon
the UNMAP event. This case race with fault from the user space so the
evict only moves 511 of the pages while the CPU fault moves 1 page. This
actually seems to be fine though as the entire VRAM allocation is
migrated.

> Also the migrate code itself does lock pages. So unless we toss in
> additional locking on top of what core mm does (which I think should be
> enough to cover migration), migrations will temporarily fail. And this is
> just for multiple threads trying to get the same page back to sram, which
> I think is a case we should support because the application did nothing
> wrong.

Yes, I think I've mentioned this already. Multiple CPU faults from
different threads targeting the same range / allocation can race but
again this actually seems fine too. Each thread gets a semi-random set
of VRAM pages which it migrates but again the end result the entire
VRAM allocation is migrated after all racing faults are serviced. I
think the only guarantee when CPU faults race is the faulting page per
each thread is migrated in that thread.

I have threaded test which hammers reads on a single 2M migration which
checks every 4k page's data integrity that passes reliably. Working on
updated this to fork version now too.

> 
> > > migration) or have the page locked (which also block the migration). But
> > > as long as those two steps succeed, we'll win and get the pages. There
> > > might be some thrashing against concurrent svm faults stealing them again,
> > > but they have a disadvantage since they can't steal dma_resv_locked bo.
> > > And if it's still too much we can stall them in the page allocator.
> > > 
> > > So it's not entirely reliable, but should be close enough.
> > > 
> > > Now for bo based svm the picture isn't any different, because holding
> > > dma_resv is not actually enough to migrate svm mappings. We still need to
> > > hope there's no temporary page references around, and we still need to
> > > succeed at locking the page. And the migration code only does trylocks,
> > > because that's it's deadlock prevent algorithm if different migrations
> > > needing the same set of pages, but acquiring them in a different order. So
> > > we win nothing.
> > 
> > Ok, maybe my statement above is false...
> > 
> > Wouldn't be the only time this falls is if another migration is in
> > flight (e.g. CPU fault) and they race? Then the eviction will naturally
> > happen via refcount being dropped from the other migration. I guess I
> > likely need to update my eviction code to not free the TTM resource if
> > all pages are not migrated.
> 
> Yeah. And additionally core mm relies on some amount of Good Luck here,
> plus the assumption that at least falling back to a single page/folio will
> work out. At least eventually ...
> 
> The trouble is if your design assumes you can migrate an entire block,
> because then if threads hammer that range in different orders you'll never
> make forward progress. Because the core mm code doesn't have a fancy ww
> locking scheme to get out of this, but only uses trylock, plus the
> assumption that falling back to a single page will work out eventually.
> 

Hmm, see above. I think forward progess is made unless I'm completely
missing something. 

> Wrt TTM resource refcounting, I think that all looks ok. But maybe I
> checked the wrong things.
> 
> > > Worse, if dma_resv does actually hold up svm migration and reclaim, then
> > > we potentially deadlock because that lock is for a bigger range than
> > > individual pages (or folios). And the core mm assumes that it can get out
> > > of a deadlock bind by (at least stochastically) eventually succeeding in
> > > acquiring/locking down a single page.
> > > 
> > > This means we cannot use dma_resv tricks to give the ttm world an
> > > advantage in exhaustive eviction against concurrent svm faults. Or at
> > > least not more than we can do without by just stalling svm faults that
> > > need to allocate gpu memory (but that must happen without holding locks or
> > > we're busted).
> > > 
> > 
> > I'm a little lost here on the deadlock case. Do you mean when we try to
> > evict SVM BO we trigger reclaim by allocating system pages and can
> > deadlock? Doesn't TTM already have this dependency when evicting non-SVM
> > BOs?
> 
> So you can have multiple cpu threads hammering a given svm range. And
> thanks to the lols of mremap and fork each of them can have a different
> view of that range (they are all obviously different processes from the
> one that has created the gpusvm binding). And if you try to migrate, they
> might all grab the pages in different orders, which can deadlock.
> 

Yes, grabbing locks in different orders would be bad and that could
deadlock. But I don't that that will happen, even with a range lock I
believe I have this working as the range in zdd is pointing the
originally allocated range. The MM and start / end can be wrong (with
fork / mremap) so that has to worked around which isn't ideal. If zdd or
vram allocation has the lock and we remove the range from migration view
this conceptually makes more sense which kinda where I'm trending if we
agree to roughly keep what I have in place at least initially.

Touch on this here too [1].

[1] https://patchwork.freedesktop.org/patch/610957/?series=137870&rev=1#comment_1112527 

> That's why there's so much retrying and also why core mm only does trylock
> on pages if it grabs an entire pile.
> 
> Now if you have a lock that nests within the page lock you need to trylock
> it, or it deadlocks. Which kinda defeats the point of having a bigger lock
> and moving the entire bo as a unit.
> 
> But if that is outside of the page lock (like amdgpu), you still have the
> issue of the elevated page reference from do_swap_page. Which also blocks
> migration.
> 

See above, it doesn't actually block migration as each thread still make
forward progress and collectively all complete the migration, at least
that is what I'm observing.

> Note that neither is a hard deadlock, as in lockdep complaints, because
> they're all retrying anyway. They're more like lifelocks, and the bigger
> your pile of pages the more likely that you'll always have a failed page
> and need to abort and retry. Which results in threads spinning forever.
> 
> > > So the only benefit I'm seeing is the unified lru, which I'm not sure is
> > > worth it. There's also a bit a lru design tension here, because for the bo
> > 
> > Well also not rewriting the world...
> 
> Yeah it's tough. I'm still at the "understanding all the tradeoffs" stage,
> just to make that clear.

That's basically where I'm at too. Trying balance between simple as
possible vs. dream solution. Wrote this series fairly quickly to what I
could get working and help me understand how all of this works. 

I've also said this a few time throughout my replies, also really want
UMD / application data to help understand how SVM will be used too. Feel
like that information will also help determine some design choices (e.g.
what to optimize for).

Matt

> -Sima
> 
> > Matt
> > 
> > > world we want objects that are locked to stay on the lru, so that the
> > > competing processes can figure out who has the winning ww ticket. The core
> > > mm design otoh does isolate pages and remove them from the lru when
> > > they're acquired, so that they don't gunk up other processes from trying
> > > to make forward progress and are better hidden. Which reduces temporary
> > > page references (from lru walk) preventing migration and stuff like that.
> > > 
> > > Cheers, Sima
> > > -- 
> > > Daniel Vetter
> > > Software Engineer, Intel Corporation
> > > http://blog.ffwll.ch
> 
> -- 
> Simona Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

Simona Vetter Sept. 3, 2024, 8:07 a.m. UTC | #16

On Mon, Sep 02, 2024 at 10:20:07PM +0000, Matthew Brost wrote:
> On Mon, Sep 02, 2024 at 02:48:55PM +0200, Daniel Vetter wrote:
> > On Thu, Aug 29, 2024 at 10:12:53PM +0000, Matthew Brost wrote:
> > > On Thu, Aug 29, 2024 at 01:02:54PM +0200, Daniel Vetter wrote:
> > > > On Thu, Aug 29, 2024 at 11:53:58AM +0200, Thomas Hellström wrote:
> > > > > But as Sima pointed out in private communication, exhaustive eviction
> > > > > is not really needed for faulting to make (crawling) progress.
> > > > > Watermarks and VRAM trylock shrinking should suffice, since we're
> > > > > strictly only required to service a single gpu page granule at a time.
> > > > > 
> > > > > However, ordinary bo-based jobs would still like to be able to
> > > > > completely evict SVM vram. Whether that is important enough to strive
> > > > > for is ofc up for discussion.
> > > > 
> > > > My take is that you don't win anything for exhaustive eviction by having
> > > > the dma_resv somewhere in there for svm allocations. Roughly for split lru
> > > > world, where svm ignores bo/dma_resv:
> > > > 
> > > > When evicting vram from the ttm side we'll fairly switch between selecting
> > > > bo and throwing out svm pages. With drm_exec/ww_acquire_ctx selecting bo
> > > > will eventually succeed in vacuuming up everything (with a few retries
> > > > perhaps, if we're not yet at the head of the ww ticket queue).
> > > > 
> > > > svm pages we need to try to evict anyway - there's no guarantee, becaue
> > > > the core mm might be holding temporary page references (which block
> > > 
> > > Yea, but think you can could kill the app then - not suggesting we
> > > should but could. To me this is akin to a CPU fault and not being able
> > > to migrate the device pages - the migration layer doc says when this
> > > happens kick this to user space and segfault the app.
> > > 
> > > My last patch in the series adds some asserts to see if this ever
> > > happens, thus far never. If it occurs we could gracefully handle it by
> > > aborting the migration I guess... I think the user really needs to
> > > something a bit crazy to trigger this condition - I don't think the core
> > > randomly grabs refs to device pages but could be wrong.
> > 
> > I think it does :-/
> > 
> > If you read do_swap_page around ->migrate_to_ram:
> > 
> > 
> > 	get_page(vmf->page);
> > 	pte_unmap_unlock(vmf->pte, vmf->ptl);
> > 	ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
> > 	put_page(vmf->page);
> > 
> 
> Yep, I've seen this show in some of local rework getting MREMAP working.
> The common case is I have test with 2M mapping which I call MREMAP on
> and then immediately read from the new mapping (the one from MREMAP).
> Since a MREMAP results in a UNMAP notifier event one of the possible
> solutions I have just evict the VRAM via drm_gpusvm_evict_to_sram upon
> the UNMAP event. This case race with fault from the user space so the
> evict only moves 511 of the pages while the CPU fault moves 1 page. This
> actually seems to be fine though as the entire VRAM allocation is
> migrated.

There's also MREMAP_DONTUNMAP for added fun. So you cannot rely on the
unmap happening I think.

> > Also the migrate code itself does lock pages. So unless we toss in
> > additional locking on top of what core mm does (which I think should be
> > enough to cover migration), migrations will temporarily fail. And this is
> > just for multiple threads trying to get the same page back to sram, which
> > I think is a case we should support because the application did nothing
> > wrong.
> 
> Yes, I think I've mentioned this already. Multiple CPU faults from
> different threads targeting the same range / allocation can race but
> again this actually seems fine too. Each thread gets a semi-random set
> of VRAM pages which it migrates but again the end result the entire
> VRAM allocation is migrated after all racing faults are serviced. I
> think the only guarantee when CPU faults race is the faulting page per
> each thread is migrated in that thread.
> 
> I have threaded test which hammers reads on a single 2M migration which
> checks every 4k page's data integrity that passes reliably. Working on
> updated this to fork version now too.
> 
> > 
> > > > migration) or have the page locked (which also block the migration). But
> > > > as long as those two steps succeed, we'll win and get the pages. There
> > > > might be some thrashing against concurrent svm faults stealing them again,
> > > > but they have a disadvantage since they can't steal dma_resv_locked bo.
> > > > And if it's still too much we can stall them in the page allocator.
> > > > 
> > > > So it's not entirely reliable, but should be close enough.
> > > > 
> > > > Now for bo based svm the picture isn't any different, because holding
> > > > dma_resv is not actually enough to migrate svm mappings. We still need to
> > > > hope there's no temporary page references around, and we still need to
> > > > succeed at locking the page. And the migration code only does trylocks,
> > > > because that's it's deadlock prevent algorithm if different migrations
> > > > needing the same set of pages, but acquiring them in a different order. So
> > > > we win nothing.
> > > 
> > > Ok, maybe my statement above is false...
> > > 
> > > Wouldn't be the only time this falls is if another migration is in
> > > flight (e.g. CPU fault) and they race? Then the eviction will naturally
> > > happen via refcount being dropped from the other migration. I guess I
> > > likely need to update my eviction code to not free the TTM resource if
> > > all pages are not migrated.
> > 
> > Yeah. And additionally core mm relies on some amount of Good Luck here,
> > plus the assumption that at least falling back to a single page/folio will
> > work out. At least eventually ...
> > 
> > The trouble is if your design assumes you can migrate an entire block,
> > because then if threads hammer that range in different orders you'll never
> > make forward progress. Because the core mm code doesn't have a fancy ww
> > locking scheme to get out of this, but only uses trylock, plus the
> > assumption that falling back to a single page will work out eventually.
> > 
> 
> Hmm, see above. I think forward progess is made unless I'm completely
> missing something. 
> 
> > Wrt TTM resource refcounting, I think that all looks ok. But maybe I
> > checked the wrong things.
> > 
> > > > Worse, if dma_resv does actually hold up svm migration and reclaim, then
> > > > we potentially deadlock because that lock is for a bigger range than
> > > > individual pages (or folios). And the core mm assumes that it can get out
> > > > of a deadlock bind by (at least stochastically) eventually succeeding in
> > > > acquiring/locking down a single page.
> > > > 
> > > > This means we cannot use dma_resv tricks to give the ttm world an
> > > > advantage in exhaustive eviction against concurrent svm faults. Or at
> > > > least not more than we can do without by just stalling svm faults that
> > > > need to allocate gpu memory (but that must happen without holding locks or
> > > > we're busted).
> > > > 
> > > 
> > > I'm a little lost here on the deadlock case. Do you mean when we try to
> > > evict SVM BO we trigger reclaim by allocating system pages and can
> > > deadlock? Doesn't TTM already have this dependency when evicting non-SVM
> > > BOs?
> > 
> > So you can have multiple cpu threads hammering a given svm range. And
> > thanks to the lols of mremap and fork each of them can have a different
> > view of that range (they are all obviously different processes from the
> > one that has created the gpusvm binding). And if you try to migrate, they
> > might all grab the pages in different orders, which can deadlock.
> > 
> 
> Yes, grabbing locks in different orders would be bad and that could
> deadlock. But I don't that that will happen, even with a range lock I
> believe I have this working as the range in zdd is pointing the
> originally allocated range. The MM and start / end can be wrong (with
> fork / mremap) so that has to worked around which isn't ideal. If zdd or
> vram allocation has the lock and we remove the range from migration view
> this conceptually makes more sense which kinda where I'm trending if we
> agree to roughly keep what I have in place at least initially.
> 
> Touch on this here too [1].
> 
> [1] https://patchwork.freedesktop.org/patch/610957/?series=137870&rev=1#comment_1112527 

So I think there's multiple scenarios:

- You use your mmap_write hack, and don't have forks. There's only one
  migration ever, the write mmap lock will block any concurrent faults. No
  troubles.

- With the migration_mutex, at least how I understand the amdgpu design I
  think there can be trouble, if you throw enough threads at the issue.

  1. Lots of threads fault on different pages in the same range. They all
  grap a page reference from do_swap_page around calling ->migrate_to_ram.

  2. First thing they do is mutex_lock(range->migration_mutex). That
  serializes everything. It also means that the other threads will not
  wait on the migration pte, because that's not visible outside of holding
  the migration_mutex, instead they pile up against that lock. So you have
  pages in that range you cannot migrate, and you need partial migration
  support or you're stuck.

  Unless you force a full eviction for anything partially migrated before
  you try to handle gpu faults, you also need to handle partial migrations
  on the fault side or things won't make forward progress. As soon as you
  allow faults to fully race with migrate_to_ram (with or without
  migration_mutex) you need to support partial migration state because the
  gpu fault side is not in any way better at winning races than the cpu
  fault migration.

  Or you fall back to sram, but for "must be in vram" memory that's not
  going to work too well.

- Next up is migration_mutex with partial migrations, but instead of
  threads hammering different pages in parallel, they hammer the same page
  in parallel. They will all pile up in the mutex_lock(migration_mutex),
  because the migration pte is never visible outside of that lock. And the
  migration won't ever work, because there's other threads that have an
  elevated page reference.

  You can fix that by making the migration_mutex a trylock. At that point
  you've pretty much exactly reinvented the page lock semantics, except
  the lock is for a pile of pages instead of each individually.

  If you're not yet seeing this I think there's not yet ennough concurrent
  faulting in your tests going on. And this is a legit use case:
  1. allocate range, move it to gpu
  2. do stuff on gpu
  3. when gpu finishes a thundering herd of cpu threads want to look at
  the result

  Note that even the trylock isn't good enough, because there's a window
  between when do_swap_page elevates the page refcount and when we
  trylock. But if we _only_ use the page lock we could move that trylock
  into do_swap_page, while we hold the cpu pagetable lock, which would
  close the race. Cpu threads that lost the race will simply busy-loop for
  a bit until the migration pte is installed, at which point they'll block
  and wait for the migration to finish.

So I think even if we limit us to legit use-cases the migration_mutex (or
any other physical storage lock that's not the page lock) will just get in
the way eventually. And the core mm page lock really should be enough to
make sure migration doesn't race.

> > That's why there's so much retrying and also why core mm only does trylock
> > on pages if it grabs an entire pile.
> > 
> > Now if you have a lock that nests within the page lock you need to trylock
> > it, or it deadlocks. Which kinda defeats the point of having a bigger lock
> > and moving the entire bo as a unit.
> > 
> > But if that is outside of the page lock (like amdgpu), you still have the
> > issue of the elevated page reference from do_swap_page. Which also blocks
> > migration.
> > 
> 
> See above, it doesn't actually block migration as each thread still make
> forward progress and collectively all complete the migration, at least
> that is what I'm observing.
> 
> > Note that neither is a hard deadlock, as in lockdep complaints, because
> > they're all retrying anyway. They're more like lifelocks, and the bigger
> > your pile of pages the more likely that you'll always have a failed page
> > and need to abort and retry. Which results in threads spinning forever.
> > 
> > > > So the only benefit I'm seeing is the unified lru, which I'm not sure is
> > > > worth it. There's also a bit a lru design tension here, because for the bo
> > > 
> > > Well also not rewriting the world...
> > 
> > Yeah it's tough. I'm still at the "understanding all the tradeoffs" stage,
> > just to make that clear.
> 
> That's basically where I'm at too. Trying balance between simple as
> possible vs. dream solution. Wrote this series fairly quickly to what I
> could get working and help me understand how all of this works. 
> 
> I've also said this a few time throughout my replies, also really want
> UMD / application data to help understand how SVM will be used too. Feel
> like that information will also help determine some design choices (e.g.
> what to optimize for).

My thinking is a bit different: ttm bo world is from a lot of semantics
fundamentally at odds with core mm. And I didn't realize this fully until
this recent thread, but migrate_device.c and hmm.c really inflict most of
the bonkers "we love all the races" mm semantics onto drivers. Which means
we have a huge semantic design conflict, and the question is where should
we solve it. There's a range, but the two extremes are roughly:

- Try to make svm look as much as possible as the bo+userptr world. This
  means lots of tensions within our code, and the risk that we design
  ourselves into a corner we cannot fix. Like we cannot trylock the
  range->migration_mutex while holding cpu pagetables in do_swap_page, so
  we'd need a different fix for that, and I haven't come up with one yet.

- Just adopt core mm design, and end up with a bunch of duplicated code.
  This means tension between these two worlds, but there's a clear design
  to address that from core mm (shrinkers for ttm bo, page reclaim for
  svm, running in a loop applying equal eviction pressure to both). So
  much cleaner separation, and really well structured interaction between
  the two worlds like we have on the igpu side already for managing sram.
  But it comes at the cost of more code.

Cheers, Sima

[RFC,23/28] drm/xe: Add SVM VRAM migration

Commit Message

Comments

Patch