[1/1] mm/migrate: Trylock device page in do_swap_page

Message ID	20240911030337.870160-2-matthew.brost@intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <dri-devel-bounces@lists.freedesktop.org> From: Matthew Brost <matthew.brost@intel.com> To: intel-xe@lists.freedesktop.org, dri-devel@lists.freedesktop.org, linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, simona.vetter@ffwll.ch, Philip.Yang@amd.com, akpm@linux-foundation.org, felix.kuehling@amd.com, christian.koenig@amd.com Subject: [PATCH 1/1] mm/migrate: Trylock device page in do_swap_page Date: Tue, 10 Sep 2024 20:03:37 -0700 Message-Id: <20240911030337.870160-2-matthew.brost@intel.com> In-Reply-To: <20240911030337.870160-1-matthew.brost@intel.com> References: <20240911030337.870160-1-matthew.brost@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Precedence: list Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>
Series	Fix device private page livelock on CPU fault \| expand [0/1] Fix device private page livelock on CPU fault [1/1] mm/migrate: Trylock device page in do_swap_page

Matthew Brost Sept. 11, 2024, 3:03 a.m. UTC

Avoid multiple CPU page faults to the same device page racing by locking
the page in do_swap_page before taking an additional reference to the
page. This prevents scenarios where multiple CPU page faults each take
an extra reference to a device page, which could abort migration in
folio_migrate_mapping. With the device page locked in do_swap_page, the
migrate_vma_* functions need to be updated to avoid locking the
fault_page argument.

Prior to this change, a livelock scenario could occur in Xe's (Intel GPU
DRM driver) SVM implementation if enough threads faulted the same device
page.

Cc: Philip Yang <Philip.Yang@amd.com>
Cc: Felix Kuehling <felix.kuehling@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Suggessted-by: Simona Vetter <simona.vetter@ffwll.ch>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 mm/memory.c         | 13 +++++++---
 mm/migrate_device.c | 60 +++++++++++++++++++++++++++++++--------------
 2 files changed, 50 insertions(+), 23 deletions(-)

Alistair Popple Sept. 11, 2024, 4:53 a.m. UTC | #1

Matthew Brost <matthew.brost@intel.com> writes:

> Avoid multiple CPU page faults to the same device page racing by locking
> the page in do_swap_page before taking an additional reference to the
> page. This prevents scenarios where multiple CPU page faults each take
> an extra reference to a device page, which could abort migration in
> folio_migrate_mapping. With the device page locked in do_swap_page, the
> migrate_vma_* functions need to be updated to avoid locking the
> fault_page argument.

I added the get_page() and therefore the fault_page argument to deal
with another lifetime issue. Neither Felix nor I particularly liked the
solution though (see
https://lore.kernel.org/all/cover.60659b549d8509ddecafad4f498ee7f03bb23c69.1664366292.git-series.apopple@nvidia.com/T/#m715589d57716448386ff9af41da27a952d94615a)
and this seems to make it even uglier, so I have suggested an
alternative solution below.

> Prior to this change, a livelock scenario could occur in Xe's (Intel GPU
> DRM driver) SVM implementation if enough threads faulted the same device
> page.

I haven't seen the same in the NVIDIA UVM driver (out-of-tree, I know)
but theoretically it seems like it should be possible. However we
serialize migrations of the same virtual address range to avoid these
kind of issues as they can happen the other way too (ie. multiple
threads trying to migrate to GPU).

So I suspect what happens in UVM is that one thread wins and installs
the migration entry while the others fail to get the driver migration
lock and bail out sufficiently early in the fault path to avoid the
live-lock.

> Cc: Philip Yang <Philip.Yang@amd.com>
> Cc: Felix Kuehling <felix.kuehling@amd.com>
> Cc: Christian König <christian.koenig@amd.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Suggessted-by: Simona Vetter <simona.vetter@ffwll.ch>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  mm/memory.c         | 13 +++++++---
>  mm/migrate_device.c | 60 +++++++++++++++++++++++++++++++--------------
>  2 files changed, 50 insertions(+), 23 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 3c01d68065be..bbd97d16a96a 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4046,10 +4046,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  			 * Get a page reference while we know the page can't be
>  			 * freed.
>  			 */
> -			get_page(vmf->page);
> -			pte_unmap_unlock(vmf->pte, vmf->ptl);
> -			ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
> -			put_page(vmf->page);
> +			if (trylock_page(vmf->page)) {
> +				get_page(vmf->page);
> +				pte_unmap_unlock(vmf->pte, vmf->ptl);

This is all beginning to look a lot like migrate_vma_collect_pmd(). So
rather than do this and then have to pass all this context
(ie. fault_page) down to the migrate_vma_* functions could we instead
just do what migrate_vma_collect_pmd() does here? Ie. we already have
the PTL and the page lock so there's no reason we couldn't just setup
the migration entry prior to calling migrate_to_ram().

Obviously calling migrate_vma_setup() would show the page as not
migrating, but drivers could easily just fill in the src_pfn info after
calling migrate_vma_setup().

This would eliminate the whole fault_page ugliness.

> +				ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
> +				put_page(vmf->page);
> +				unlock_page(vmf->page);
> +			} else {
> +				pte_unmap_unlock(vmf->pte, vmf->ptl);
> +			}
>  		} else if (is_hwpoison_entry(entry)) {
>  			ret = VM_FAULT_HWPOISON;
>  		} else if (is_pte_marker_entry(entry)) {
> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> index 6d66dc1c6ffa..049893a5a179 100644
> --- a/mm/migrate_device.c
> +++ b/mm/migrate_device.c
> @@ -60,6 +60,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>  				   struct mm_walk *walk)
>  {
>  	struct migrate_vma *migrate = walk->private;
> +	struct folio *fault_folio = migrate->fault_page ?
> +		page_folio(migrate->fault_page) : NULL;
>  	struct vm_area_struct *vma = walk->vma;
>  	struct mm_struct *mm = vma->vm_mm;
>  	unsigned long addr = start, unmapped = 0;
> @@ -88,11 +90,13 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>  
>  			folio_get(folio);
>  			spin_unlock(ptl);
> -			if (unlikely(!folio_trylock(folio)))
> +			if (unlikely(fault_folio != folio &&
> +				     !folio_trylock(folio)))
>  				return migrate_vma_collect_skip(start, end,
>  								walk);
>  			ret = split_folio(folio);
> -			folio_unlock(folio);
> +			if (fault_folio != folio)
> +				folio_unlock(folio);
>  			folio_put(folio);
>  			if (ret)
>  				return migrate_vma_collect_skip(start, end,
> @@ -192,7 +196,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>  		 * optimisation to avoid walking the rmap later with
>  		 * try_to_migrate().
>  		 */
> -		if (folio_trylock(folio)) {
> +		if (fault_folio == folio || folio_trylock(folio)) {
>  			bool anon_exclusive;
>  			pte_t swp_pte;
>  
> @@ -204,7 +208,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>  
>  				if (folio_try_share_anon_rmap_pte(folio, page)) {
>  					set_pte_at(mm, addr, ptep, pte);
> -					folio_unlock(folio);
> +					if (fault_folio != folio)
> +						folio_unlock(folio);
>  					folio_put(folio);
>  					mpfn = 0;
>  					goto next;
> @@ -363,6 +368,8 @@ static unsigned long migrate_device_unmap(unsigned long *src_pfns,
>  					  unsigned long npages,
>  					  struct page *fault_page)
>  {
> +	struct folio *fault_folio = fault_page ?
> +		page_folio(fault_page) : NULL;
>  	unsigned long i, restore = 0;
>  	bool allow_drain = true;
>  	unsigned long unmapped = 0;
> @@ -427,7 +434,8 @@ static unsigned long migrate_device_unmap(unsigned long *src_pfns,
>  		remove_migration_ptes(folio, folio, false);
>  
>  		src_pfns[i] = 0;
> -		folio_unlock(folio);
> +		if (fault_folio != folio)
> +			folio_unlock(folio);
>  		folio_put(folio);
>  		restore--;
>  	}
> @@ -536,6 +544,8 @@ int migrate_vma_setup(struct migrate_vma *args)
>  		return -EINVAL;
>  	if (args->fault_page && !is_device_private_page(args->fault_page))
>  		return -EINVAL;
> +	if (args->fault_page && !PageLocked(args->fault_page))
> +		return -EINVAL;
>  
>  	memset(args->src, 0, sizeof(*args->src) * nr_pages);
>  	args->cpages = 0;
> @@ -799,19 +809,13 @@ void migrate_vma_pages(struct migrate_vma *migrate)
>  }
>  EXPORT_SYMBOL(migrate_vma_pages);
>  
> -/*
> - * migrate_device_finalize() - complete page migration
> - * @src_pfns: src_pfns returned from migrate_device_range()
> - * @dst_pfns: array of pfns allocated by the driver to migrate memory to
> - * @npages: number of pages in the range
> - *
> - * Completes migration of the page by removing special migration entries.
> - * Drivers must ensure copying of page data is complete and visible to the CPU
> - * before calling this.
> - */
> -void migrate_device_finalize(unsigned long *src_pfns,
> -			unsigned long *dst_pfns, unsigned long npages)
> +static void __migrate_device_finalize(unsigned long *src_pfns,
> +				      unsigned long *dst_pfns,
> +				      unsigned long npages,
> +				      struct page *fault_page)
>  {
> +	struct folio *fault_folio = fault_page ?
> +		page_folio(fault_page) : NULL;
>  	unsigned long i;
>  
>  	for (i = 0; i < npages; i++) {
> @@ -838,7 +842,8 @@ void migrate_device_finalize(unsigned long *src_pfns,
>  		src = page_folio(page);
>  		dst = page_folio(newpage);
>  		remove_migration_ptes(src, dst, false);
> -		folio_unlock(src);
> +		if (fault_folio != src)
> +			folio_unlock(src);
>  
>  		if (is_zone_device_page(page))
>  			put_page(page);
> @@ -854,6 +859,22 @@ void migrate_device_finalize(unsigned long *src_pfns,
>  		}
>  	}
>  }
> +
> +/*
> + * migrate_device_finalize() - complete page migration
> + * @src_pfns: src_pfns returned from migrate_device_range()
> + * @dst_pfns: array of pfns allocated by the driver to migrate memory to
> + * @npages: number of pages in the range
> + *
> + * Completes migration of the page by removing special migration entries.
> + * Drivers must ensure copying of page data is complete and visible to the CPU
> + * before calling this.
> + */
> +void migrate_device_finalize(unsigned long *src_pfns,
> +			unsigned long *dst_pfns, unsigned long npages)
> +{
> +	return __migrate_device_finalize(src_pfns, dst_pfns, npages, NULL);
> +}
>  EXPORT_SYMBOL(migrate_device_finalize);
>  
>  /**
> @@ -869,7 +890,8 @@ EXPORT_SYMBOL(migrate_device_finalize);
>   */
>  void migrate_vma_finalize(struct migrate_vma *migrate)
>  {
> -	migrate_device_finalize(migrate->src, migrate->dst, migrate->npages);
> +	__migrate_device_finalize(migrate->src, migrate->dst, migrate->npages,
> +				  migrate->fault_page);
>  }
>  EXPORT_SYMBOL(migrate_vma_finalize);

Matthew Brost Sept. 13, 2024, 10:39 p.m. UTC | #2

On Wed, Sep 11, 2024 at 02:53:31PM +1000, Alistair Popple wrote:
> 
> Matthew Brost <matthew.brost@intel.com> writes:
> 
> > Avoid multiple CPU page faults to the same device page racing by locking
> > the page in do_swap_page before taking an additional reference to the
> > page. This prevents scenarios where multiple CPU page faults each take
> > an extra reference to a device page, which could abort migration in
> > folio_migrate_mapping. With the device page locked in do_swap_page, the
> > migrate_vma_* functions need to be updated to avoid locking the
> > fault_page argument.
> 
> I added the get_page() and therefore the fault_page argument to deal
> with another lifetime issue. Neither Felix nor I particularly liked the
> solution though (see
> https://lore.kernel.org/all/cover.60659b549d8509ddecafad4f498ee7f03bb23c69.1664366292.git-series.apopple@nvidia.com/T/#m715589d57716448386ff9af41da27a952d94615a)
> and this seems to make it even uglier, so I have suggested an
> alternative solution below.
> 

Thanks for the ref, will read this through.

> > Prior to this change, a livelock scenario could occur in Xe's (Intel GPU
> > DRM driver) SVM implementation if enough threads faulted the same device
> > page.
> 
> I haven't seen the same in the NVIDIA UVM driver (out-of-tree, I know)

Still a driver.

> but theoretically it seems like it should be possible. However we
> serialize migrations of the same virtual address range to avoid these
> kind of issues as they can happen the other way too (ie. multiple
> threads trying to migrate to GPU).
> 
> So I suspect what happens in UVM is that one thread wins and installs
> the migration entry while the others fail to get the driver migration
> lock and bail out sufficiently early in the fault path to avoid the
> live-lock.
> 

I had to try hard to show this, doubt an actual user could trigger this.

I wrote a test which kicked 8 threads, each thread did a pthread join,
and then tried to read the same page. This repeats in loop for like 512
pages or something. I needed an exclusive lock in migrate_to_ram vfunc
for it to livelock. Without an exclusive lock I think on average I saw
about 32k retries (i.e. migrate_to_ram calls on the same page) before a
thread won this race.

From reading UVM, pretty sure if you tried hard enough you could trigger
a livelock given it appears you take excluvise locks in migrate_to_ram.

> > Cc: Philip Yang <Philip.Yang@amd.com>
> > Cc: Felix Kuehling <felix.kuehling@amd.com>
> > Cc: Christian König <christian.koenig@amd.com>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Suggessted-by: Simona Vetter <simona.vetter@ffwll.ch>
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >  mm/memory.c         | 13 +++++++---
> >  mm/migrate_device.c | 60 +++++++++++++++++++++++++++++++--------------
> >  2 files changed, 50 insertions(+), 23 deletions(-)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 3c01d68065be..bbd97d16a96a 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -4046,10 +4046,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >  			 * Get a page reference while we know the page can't be
> >  			 * freed.
> >  			 */
> > -			get_page(vmf->page);
> > -			pte_unmap_unlock(vmf->pte, vmf->ptl);
> > -			ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
> > -			put_page(vmf->page);
> > +			if (trylock_page(vmf->page)) {
> > +				get_page(vmf->page);
> > +				pte_unmap_unlock(vmf->pte, vmf->ptl);
> 
> This is all beginning to look a lot like migrate_vma_collect_pmd(). So
> rather than do this and then have to pass all this context
> (ie. fault_page) down to the migrate_vma_* functions could we instead
> just do what migrate_vma_collect_pmd() does here? Ie. we already have
> the PTL and the page lock so there's no reason we couldn't just setup
> the migration entry prior to calling migrate_to_ram().
> 
> Obviously calling migrate_vma_setup() would show the page as not
> migrating, but drivers could easily just fill in the src_pfn info after
> calling migrate_vma_setup().
> 
> This would eliminate the whole fault_page ugliness.
>

This seems like it would work and agree it likely be cleaner. Let me
play around with this and see what I come up with. Multi-tasking a bit
so expect a bit of delay here.

Thanks for the input,
Matt

> > +				ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
> > +				put_page(vmf->page);
> > +				unlock_page(vmf->page);
> > +			} else {
> > +				pte_unmap_unlock(vmf->pte, vmf->ptl);
> > +			}
> >  		} else if (is_hwpoison_entry(entry)) {
> >  			ret = VM_FAULT_HWPOISON;
> >  		} else if (is_pte_marker_entry(entry)) {
> > diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> > index 6d66dc1c6ffa..049893a5a179 100644
> > --- a/mm/migrate_device.c
> > +++ b/mm/migrate_device.c
> > @@ -60,6 +60,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> >  				   struct mm_walk *walk)
> >  {
> >  	struct migrate_vma *migrate = walk->private;
> > +	struct folio *fault_folio = migrate->fault_page ?
> > +		page_folio(migrate->fault_page) : NULL;
> >  	struct vm_area_struct *vma = walk->vma;
> >  	struct mm_struct *mm = vma->vm_mm;
> >  	unsigned long addr = start, unmapped = 0;
> > @@ -88,11 +90,13 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> >  
> >  			folio_get(folio);
> >  			spin_unlock(ptl);
> > -			if (unlikely(!folio_trylock(folio)))
> > +			if (unlikely(fault_folio != folio &&
> > +				     !folio_trylock(folio)))
> >  				return migrate_vma_collect_skip(start, end,
> >  								walk);
> >  			ret = split_folio(folio);
> > -			folio_unlock(folio);
> > +			if (fault_folio != folio)
> > +				folio_unlock(folio);
> >  			folio_put(folio);
> >  			if (ret)
> >  				return migrate_vma_collect_skip(start, end,
> > @@ -192,7 +196,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> >  		 * optimisation to avoid walking the rmap later with
> >  		 * try_to_migrate().
> >  		 */
> > -		if (folio_trylock(folio)) {
> > +		if (fault_folio == folio || folio_trylock(folio)) {
> >  			bool anon_exclusive;
> >  			pte_t swp_pte;
> >  
> > @@ -204,7 +208,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> >  
> >  				if (folio_try_share_anon_rmap_pte(folio, page)) {
> >  					set_pte_at(mm, addr, ptep, pte);
> > -					folio_unlock(folio);
> > +					if (fault_folio != folio)
> > +						folio_unlock(folio);
> >  					folio_put(folio);
> >  					mpfn = 0;
> >  					goto next;
> > @@ -363,6 +368,8 @@ static unsigned long migrate_device_unmap(unsigned long *src_pfns,
> >  					  unsigned long npages,
> >  					  struct page *fault_page)
> >  {
> > +	struct folio *fault_folio = fault_page ?
> > +		page_folio(fault_page) : NULL;
> >  	unsigned long i, restore = 0;
> >  	bool allow_drain = true;
> >  	unsigned long unmapped = 0;
> > @@ -427,7 +434,8 @@ static unsigned long migrate_device_unmap(unsigned long *src_pfns,
> >  		remove_migration_ptes(folio, folio, false);
> >  
> >  		src_pfns[i] = 0;
> > -		folio_unlock(folio);
> > +		if (fault_folio != folio)
> > +			folio_unlock(folio);
> >  		folio_put(folio);
> >  		restore--;
> >  	}
> > @@ -536,6 +544,8 @@ int migrate_vma_setup(struct migrate_vma *args)
> >  		return -EINVAL;
> >  	if (args->fault_page && !is_device_private_page(args->fault_page))
> >  		return -EINVAL;
> > +	if (args->fault_page && !PageLocked(args->fault_page))
> > +		return -EINVAL;
> >  
> >  	memset(args->src, 0, sizeof(*args->src) * nr_pages);
> >  	args->cpages = 0;
> > @@ -799,19 +809,13 @@ void migrate_vma_pages(struct migrate_vma *migrate)
> >  }
> >  EXPORT_SYMBOL(migrate_vma_pages);
> >  
> > -/*
> > - * migrate_device_finalize() - complete page migration
> > - * @src_pfns: src_pfns returned from migrate_device_range()
> > - * @dst_pfns: array of pfns allocated by the driver to migrate memory to
> > - * @npages: number of pages in the range
> > - *
> > - * Completes migration of the page by removing special migration entries.
> > - * Drivers must ensure copying of page data is complete and visible to the CPU
> > - * before calling this.
> > - */
> > -void migrate_device_finalize(unsigned long *src_pfns,
> > -			unsigned long *dst_pfns, unsigned long npages)
> > +static void __migrate_device_finalize(unsigned long *src_pfns,
> > +				      unsigned long *dst_pfns,
> > +				      unsigned long npages,
> > +				      struct page *fault_page)
> >  {
> > +	struct folio *fault_folio = fault_page ?
> > +		page_folio(fault_page) : NULL;
> >  	unsigned long i;
> >  
> >  	for (i = 0; i < npages; i++) {
> > @@ -838,7 +842,8 @@ void migrate_device_finalize(unsigned long *src_pfns,
> >  		src = page_folio(page);
> >  		dst = page_folio(newpage);
> >  		remove_migration_ptes(src, dst, false);
> > -		folio_unlock(src);
> > +		if (fault_folio != src)
> > +			folio_unlock(src);
> >  
> >  		if (is_zone_device_page(page))
> >  			put_page(page);
> > @@ -854,6 +859,22 @@ void migrate_device_finalize(unsigned long *src_pfns,
> >  		}
> >  	}
> >  }
> > +
> > +/*
> > + * migrate_device_finalize() - complete page migration
> > + * @src_pfns: src_pfns returned from migrate_device_range()
> > + * @dst_pfns: array of pfns allocated by the driver to migrate memory to
> > + * @npages: number of pages in the range
> > + *
> > + * Completes migration of the page by removing special migration entries.
> > + * Drivers must ensure copying of page data is complete and visible to the CPU
> > + * before calling this.
> > + */
> > +void migrate_device_finalize(unsigned long *src_pfns,
> > +			unsigned long *dst_pfns, unsigned long npages)
> > +{
> > +	return __migrate_device_finalize(src_pfns, dst_pfns, npages, NULL);
> > +}
> >  EXPORT_SYMBOL(migrate_device_finalize);
> >  
> >  /**
> > @@ -869,7 +890,8 @@ EXPORT_SYMBOL(migrate_device_finalize);
> >   */
> >  void migrate_vma_finalize(struct migrate_vma *migrate)
> >  {
> > -	migrate_device_finalize(migrate->src, migrate->dst, migrate->npages);
> > +	__migrate_device_finalize(migrate->src, migrate->dst, migrate->npages,
> > +				  migrate->fault_page);
> >  }
> >  EXPORT_SYMBOL(migrate_vma_finalize);
>

Alistair Popple Sept. 18, 2024, 3:10 p.m. UTC | #3

Matthew Brost <matthew.brost@intel.com> writes:

> On Wed, Sep 11, 2024 at 02:53:31PM +1000, Alistair Popple wrote:
>> 
>> Matthew Brost <matthew.brost@intel.com> writes:
>> 
>> I haven't seen the same in the NVIDIA UVM driver (out-of-tree, I know)
>
> Still a driver.

Indeed, and I'm happy to answer any questions about our implementation.

>> but theoretically it seems like it should be possible. However we
>> serialize migrations of the same virtual address range to avoid these
>> kind of issues as they can happen the other way too (ie. multiple
>> threads trying to migrate to GPU).
>> 
>> So I suspect what happens in UVM is that one thread wins and installs
>> the migration entry while the others fail to get the driver migration
>> lock and bail out sufficiently early in the fault path to avoid the
>> live-lock.
>> 
>
> I had to try hard to show this, doubt an actual user could trigger this.
>
> I wrote a test which kicked 8 threads, each thread did a pthread join,
> and then tried to read the same page. This repeats in loop for like 512
> pages or something. I needed an exclusive lock in migrate_to_ram vfunc
> for it to livelock. Without an exclusive lock I think on average I saw
> about 32k retries (i.e. migrate_to_ram calls on the same page) before a
> thread won this race.
>
> From reading UVM, pretty sure if you tried hard enough you could trigger
> a livelock given it appears you take excluvise locks in migrate_to_ram.

Yes, I suspect you're correct here and that we just haven't tried hard
enough to trigger it.

>> > Cc: Philip Yang <Philip.Yang@amd.com>
>> > Cc: Felix Kuehling <felix.kuehling@amd.com>
>> > Cc: Christian König <christian.koenig@amd.com>
>> > Cc: Andrew Morton <akpm@linux-foundation.org>
>> > Suggessted-by: Simona Vetter <simona.vetter@ffwll.ch>
>> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>> > ---
>> >  mm/memory.c         | 13 +++++++---
>> >  mm/migrate_device.c | 60 +++++++++++++++++++++++++++++++--------------
>> >  2 files changed, 50 insertions(+), 23 deletions(-)
>> >
>> > diff --git a/mm/memory.c b/mm/memory.c
>> > index 3c01d68065be..bbd97d16a96a 100644
>> > --- a/mm/memory.c
>> > +++ b/mm/memory.c
>> > @@ -4046,10 +4046,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>> >  			 * Get a page reference while we know the page can't be
>> >  			 * freed.
>> >  			 */
>> > -			get_page(vmf->page);
>> > -			pte_unmap_unlock(vmf->pte, vmf->ptl);
>> > -			ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
>> > -			put_page(vmf->page);
>> > +			if (trylock_page(vmf->page)) {
>> > +				get_page(vmf->page);
>> > +				pte_unmap_unlock(vmf->pte, vmf->ptl);
>> 
>> This is all beginning to look a lot like migrate_vma_collect_pmd(). So
>> rather than do this and then have to pass all this context
>> (ie. fault_page) down to the migrate_vma_* functions could we instead
>> just do what migrate_vma_collect_pmd() does here? Ie. we already have
>> the PTL and the page lock so there's no reason we couldn't just setup
>> the migration entry prior to calling migrate_to_ram().
>> 
>> Obviously calling migrate_vma_setup() would show the page as not
>> migrating, but drivers could easily just fill in the src_pfn info after
>> calling migrate_vma_setup().
>> 
>> This would eliminate the whole fault_page ugliness.
>>
>
> This seems like it would work and agree it likely be cleaner. Let me
> play around with this and see what I come up with. Multi-tasking a bit
> so expect a bit of delay here.
>
> Thanks for the input,
> Matt
>
>> > +				ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
>> > +				put_page(vmf->page);
>> > +				unlock_page(vmf->page);
>> > +			} else {
>> > +				pte_unmap_unlock(vmf->pte, vmf->ptl);
>> > +			}
>> >  		} else if (is_hwpoison_entry(entry)) {
>> >  			ret = VM_FAULT_HWPOISON;
>> >  		} else if (is_pte_marker_entry(entry)) {
>> > diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>> > index 6d66dc1c6ffa..049893a5a179 100644
>> > --- a/mm/migrate_device.c
>> > +++ b/mm/migrate_device.c
>> > @@ -60,6 +60,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>> >  				   struct mm_walk *walk)
>> >  {
>> >  	struct migrate_vma *migrate = walk->private;
>> > +	struct folio *fault_folio = migrate->fault_page ?
>> > +		page_folio(migrate->fault_page) : NULL;
>> >  	struct vm_area_struct *vma = walk->vma;
>> >  	struct mm_struct *mm = vma->vm_mm;
>> >  	unsigned long addr = start, unmapped = 0;
>> > @@ -88,11 +90,13 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>> >  
>> >  			folio_get(folio);
>> >  			spin_unlock(ptl);
>> > -			if (unlikely(!folio_trylock(folio)))
>> > +			if (unlikely(fault_folio != folio &&
>> > +				     !folio_trylock(folio)))
>> >  				return migrate_vma_collect_skip(start, end,
>> >  								walk);
>> >  			ret = split_folio(folio);
>> > -			folio_unlock(folio);
>> > +			if (fault_folio != folio)
>> > +				folio_unlock(folio);
>> >  			folio_put(folio);
>> >  			if (ret)
>> >  				return migrate_vma_collect_skip(start, end,
>> > @@ -192,7 +196,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>> >  		 * optimisation to avoid walking the rmap later with
>> >  		 * try_to_migrate().
>> >  		 */
>> > -		if (folio_trylock(folio)) {
>> > +		if (fault_folio == folio || folio_trylock(folio)) {
>> >  			bool anon_exclusive;
>> >  			pte_t swp_pte;
>> >  
>> > @@ -204,7 +208,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>> >  
>> >  				if (folio_try_share_anon_rmap_pte(folio, page)) {
>> >  					set_pte_at(mm, addr, ptep, pte);
>> > -					folio_unlock(folio);
>> > +					if (fault_folio != folio)
>> > +						folio_unlock(folio);
>> >  					folio_put(folio);
>> >  					mpfn = 0;
>> >  					goto next;
>> > @@ -363,6 +368,8 @@ static unsigned long migrate_device_unmap(unsigned long *src_pfns,
>> >  					  unsigned long npages,
>> >  					  struct page *fault_page)
>> >  {
>> > +	struct folio *fault_folio = fault_page ?
>> > +		page_folio(fault_page) : NULL;
>> >  	unsigned long i, restore = 0;
>> >  	bool allow_drain = true;
>> >  	unsigned long unmapped = 0;
>> > @@ -427,7 +434,8 @@ static unsigned long migrate_device_unmap(unsigned long *src_pfns,
>> >  		remove_migration_ptes(folio, folio, false);
>> >  
>> >  		src_pfns[i] = 0;
>> > -		folio_unlock(folio);
>> > +		if (fault_folio != folio)
>> > +			folio_unlock(folio);
>> >  		folio_put(folio);
>> >  		restore--;
>> >  	}
>> > @@ -536,6 +544,8 @@ int migrate_vma_setup(struct migrate_vma *args)
>> >  		return -EINVAL;
>> >  	if (args->fault_page && !is_device_private_page(args->fault_page))
>> >  		return -EINVAL;
>> > +	if (args->fault_page && !PageLocked(args->fault_page))
>> > +		return -EINVAL;
>> >  
>> >  	memset(args->src, 0, sizeof(*args->src) * nr_pages);
>> >  	args->cpages = 0;
>> > @@ -799,19 +809,13 @@ void migrate_vma_pages(struct migrate_vma *migrate)
>> >  }
>> >  EXPORT_SYMBOL(migrate_vma_pages);
>> >  
>> > -/*
>> > - * migrate_device_finalize() - complete page migration
>> > - * @src_pfns: src_pfns returned from migrate_device_range()
>> > - * @dst_pfns: array of pfns allocated by the driver to migrate memory to
>> > - * @npages: number of pages in the range
>> > - *
>> > - * Completes migration of the page by removing special migration entries.
>> > - * Drivers must ensure copying of page data is complete and visible to the CPU
>> > - * before calling this.
>> > - */
>> > -void migrate_device_finalize(unsigned long *src_pfns,
>> > -			unsigned long *dst_pfns, unsigned long npages)
>> > +static void __migrate_device_finalize(unsigned long *src_pfns,
>> > +				      unsigned long *dst_pfns,
>> > +				      unsigned long npages,
>> > +				      struct page *fault_page)
>> >  {
>> > +	struct folio *fault_folio = fault_page ?
>> > +		page_folio(fault_page) : NULL;
>> >  	unsigned long i;
>> >  
>> >  	for (i = 0; i < npages; i++) {
>> > @@ -838,7 +842,8 @@ void migrate_device_finalize(unsigned long *src_pfns,
>> >  		src = page_folio(page);
>> >  		dst = page_folio(newpage);
>> >  		remove_migration_ptes(src, dst, false);
>> > -		folio_unlock(src);
>> > +		if (fault_folio != src)
>> > +			folio_unlock(src);
>> >  
>> >  		if (is_zone_device_page(page))
>> >  			put_page(page);
>> > @@ -854,6 +859,22 @@ void migrate_device_finalize(unsigned long *src_pfns,
>> >  		}
>> >  	}
>> >  }
>> > +
>> > +/*
>> > + * migrate_device_finalize() - complete page migration
>> > + * @src_pfns: src_pfns returned from migrate_device_range()
>> > + * @dst_pfns: array of pfns allocated by the driver to migrate memory to
>> > + * @npages: number of pages in the range
>> > + *
>> > + * Completes migration of the page by removing special migration entries.
>> > + * Drivers must ensure copying of page data is complete and visible to the CPU
>> > + * before calling this.
>> > + */
>> > +void migrate_device_finalize(unsigned long *src_pfns,
>> > +			unsigned long *dst_pfns, unsigned long npages)
>> > +{
>> > +	return __migrate_device_finalize(src_pfns, dst_pfns, npages, NULL);
>> > +}
>> >  EXPORT_SYMBOL(migrate_device_finalize);
>> >  
>> >  /**
>> > @@ -869,7 +890,8 @@ EXPORT_SYMBOL(migrate_device_finalize);
>> >   */
>> >  void migrate_vma_finalize(struct migrate_vma *migrate)
>> >  {
>> > -	migrate_device_finalize(migrate->src, migrate->dst, migrate->npages);
>> > +	__migrate_device_finalize(migrate->src, migrate->dst, migrate->npages,
>> > +				  migrate->fault_page);
>> >  }
>> >  EXPORT_SYMBOL(migrate_vma_finalize);
>>

Felix Kuehling Sept. 20, 2024, 8:26 p.m. UTC | #4

On 2024-09-18 11:10, Alistair Popple wrote:
> Matthew Brost <matthew.brost@intel.com> writes:
>
>> On Wed, Sep 11, 2024 at 02:53:31PM +1000, Alistair Popple wrote:
>>> Matthew Brost <matthew.brost@intel.com> writes:
>>>
>>> I haven't seen the same in the NVIDIA UVM driver (out-of-tree, I know)
>> Still a driver.
> Indeed, and I'm happy to answer any questions about our implementation.
>
>>> but theoretically it seems like it should be possible. However we
>>> serialize migrations of the same virtual address range to avoid these
>>> kind of issues as they can happen the other way too (ie. multiple
>>> threads trying to migrate to GPU).
>>>
>>> So I suspect what happens in UVM is that one thread wins and installs
>>> the migration entry while the others fail to get the driver migration
>>> lock and bail out sufficiently early in the fault path to avoid the
>>> live-lock.
>>>
>> I had to try hard to show this, doubt an actual user could trigger this.
>>
>> I wrote a test which kicked 8 threads, each thread did a pthread join,
>> and then tried to read the same page. This repeats in loop for like 512
>> pages or something. I needed an exclusive lock in migrate_to_ram vfunc
>> for it to livelock. Without an exclusive lock I think on average I saw
>> about 32k retries (i.e. migrate_to_ram calls on the same page) before a
>> thread won this race.
>>
>>  From reading UVM, pretty sure if you tried hard enough you could trigger
>> a livelock given it appears you take excluvise locks in migrate_to_ram.
> Yes, I suspect you're correct here and that we just haven't tried hard
> enough to trigger it.
>
>>>> Cc: Philip Yang <Philip.Yang@amd.com>
>>>> Cc: Felix Kuehling <felix.kuehling@amd.com>
>>>> Cc: Christian König <christian.koenig@amd.com>
>>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>>> Suggessted-by: Simona Vetter <simona.vetter@ffwll.ch>
>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>> ---
>>>>   mm/memory.c         | 13 +++++++---
>>>>   mm/migrate_device.c | 60 +++++++++++++++++++++++++++++++--------------
>>>>   2 files changed, 50 insertions(+), 23 deletions(-)
>>>>
>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>> index 3c01d68065be..bbd97d16a96a 100644
>>>> --- a/mm/memory.c
>>>> +++ b/mm/memory.c
>>>> @@ -4046,10 +4046,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>>   			 * Get a page reference while we know the page can't be
>>>>   			 * freed.
>>>>   			 */
>>>> -			get_page(vmf->page);
>>>> -			pte_unmap_unlock(vmf->pte, vmf->ptl);
>>>> -			ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
>>>> -			put_page(vmf->page);
>>>> +			if (trylock_page(vmf->page)) {
>>>> +				get_page(vmf->page);
>>>> +				pte_unmap_unlock(vmf->pte, vmf->ptl);
>>> This is all beginning to look a lot like migrate_vma_collect_pmd(). So
>>> rather than do this and then have to pass all this context
>>> (ie. fault_page) down to the migrate_vma_* functions could we instead
>>> just do what migrate_vma_collect_pmd() does here? Ie. we already have
>>> the PTL and the page lock so there's no reason we couldn't just setup
>>> the migration entry prior to calling migrate_to_ram().
>>>
>>> Obviously calling migrate_vma_setup() would show the page as not
>>> migrating, but drivers could easily just fill in the src_pfn info after
>>> calling migrate_vma_setup().
>>>
>>> This would eliminate the whole fault_page ugliness.
>>>
>> This seems like it would work and agree it likely be cleaner. Let me
>> play around with this and see what I come up with. Multi-tasking a bit
>> so expect a bit of delay here.
>>
>> Thanks for the input,
>> Matt

Thanks! Sorry, I'm late catching up after a vacation. Please keep 
Philip, Christian and myself in the loop with future patches in this area.

Regards,
   Felix


>>
>>>> +				ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
>>>> +				put_page(vmf->page);
>>>> +				unlock_page(vmf->page);
>>>> +			} else {
>>>> +				pte_unmap_unlock(vmf->pte, vmf->ptl);
>>>> +			}
>>>>   		} else if (is_hwpoison_entry(entry)) {
>>>>   			ret = VM_FAULT_HWPOISON;
>>>>   		} else if (is_pte_marker_entry(entry)) {
>>>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>>>> index 6d66dc1c6ffa..049893a5a179 100644
>>>> --- a/mm/migrate_device.c
>>>> +++ b/mm/migrate_device.c
>>>> @@ -60,6 +60,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>>   				   struct mm_walk *walk)
>>>>   {
>>>>   	struct migrate_vma *migrate = walk->private;
>>>> +	struct folio *fault_folio = migrate->fault_page ?
>>>> +		page_folio(migrate->fault_page) : NULL;
>>>>   	struct vm_area_struct *vma = walk->vma;
>>>>   	struct mm_struct *mm = vma->vm_mm;
>>>>   	unsigned long addr = start, unmapped = 0;
>>>> @@ -88,11 +90,13 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>>   
>>>>   			folio_get(folio);
>>>>   			spin_unlock(ptl);
>>>> -			if (unlikely(!folio_trylock(folio)))
>>>> +			if (unlikely(fault_folio != folio &&
>>>> +				     !folio_trylock(folio)))
>>>>   				return migrate_vma_collect_skip(start, end,
>>>>   								walk);
>>>>   			ret = split_folio(folio);
>>>> -			folio_unlock(folio);
>>>> +			if (fault_folio != folio)
>>>> +				folio_unlock(folio);
>>>>   			folio_put(folio);
>>>>   			if (ret)
>>>>   				return migrate_vma_collect_skip(start, end,
>>>> @@ -192,7 +196,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>>   		 * optimisation to avoid walking the rmap later with
>>>>   		 * try_to_migrate().
>>>>   		 */
>>>> -		if (folio_trylock(folio)) {
>>>> +		if (fault_folio == folio || folio_trylock(folio)) {
>>>>   			bool anon_exclusive;
>>>>   			pte_t swp_pte;
>>>>   
>>>> @@ -204,7 +208,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>>   
>>>>   				if (folio_try_share_anon_rmap_pte(folio, page)) {
>>>>   					set_pte_at(mm, addr, ptep, pte);
>>>> -					folio_unlock(folio);
>>>> +					if (fault_folio != folio)
>>>> +						folio_unlock(folio);
>>>>   					folio_put(folio);
>>>>   					mpfn = 0;
>>>>   					goto next;
>>>> @@ -363,6 +368,8 @@ static unsigned long migrate_device_unmap(unsigned long *src_pfns,
>>>>   					  unsigned long npages,
>>>>   					  struct page *fault_page)
>>>>   {
>>>> +	struct folio *fault_folio = fault_page ?
>>>> +		page_folio(fault_page) : NULL;
>>>>   	unsigned long i, restore = 0;
>>>>   	bool allow_drain = true;
>>>>   	unsigned long unmapped = 0;
>>>> @@ -427,7 +434,8 @@ static unsigned long migrate_device_unmap(unsigned long *src_pfns,
>>>>   		remove_migration_ptes(folio, folio, false);
>>>>   
>>>>   		src_pfns[i] = 0;
>>>> -		folio_unlock(folio);
>>>> +		if (fault_folio != folio)
>>>> +			folio_unlock(folio);
>>>>   		folio_put(folio);
>>>>   		restore--;
>>>>   	}
>>>> @@ -536,6 +544,8 @@ int migrate_vma_setup(struct migrate_vma *args)
>>>>   		return -EINVAL;
>>>>   	if (args->fault_page && !is_device_private_page(args->fault_page))
>>>>   		return -EINVAL;
>>>> +	if (args->fault_page && !PageLocked(args->fault_page))
>>>> +		return -EINVAL;
>>>>   
>>>>   	memset(args->src, 0, sizeof(*args->src) * nr_pages);
>>>>   	args->cpages = 0;
>>>> @@ -799,19 +809,13 @@ void migrate_vma_pages(struct migrate_vma *migrate)
>>>>   }
>>>>   EXPORT_SYMBOL(migrate_vma_pages);
>>>>   
>>>> -/*
>>>> - * migrate_device_finalize() - complete page migration
>>>> - * @src_pfns: src_pfns returned from migrate_device_range()
>>>> - * @dst_pfns: array of pfns allocated by the driver to migrate memory to
>>>> - * @npages: number of pages in the range
>>>> - *
>>>> - * Completes migration of the page by removing special migration entries.
>>>> - * Drivers must ensure copying of page data is complete and visible to the CPU
>>>> - * before calling this.
>>>> - */
>>>> -void migrate_device_finalize(unsigned long *src_pfns,
>>>> -			unsigned long *dst_pfns, unsigned long npages)
>>>> +static void __migrate_device_finalize(unsigned long *src_pfns,
>>>> +				      unsigned long *dst_pfns,
>>>> +				      unsigned long npages,
>>>> +				      struct page *fault_page)
>>>>   {
>>>> +	struct folio *fault_folio = fault_page ?
>>>> +		page_folio(fault_page) : NULL;
>>>>   	unsigned long i;
>>>>   
>>>>   	for (i = 0; i < npages; i++) {
>>>> @@ -838,7 +842,8 @@ void migrate_device_finalize(unsigned long *src_pfns,
>>>>   		src = page_folio(page);
>>>>   		dst = page_folio(newpage);
>>>>   		remove_migration_ptes(src, dst, false);
>>>> -		folio_unlock(src);
>>>> +		if (fault_folio != src)
>>>> +			folio_unlock(src);
>>>>   
>>>>   		if (is_zone_device_page(page))
>>>>   			put_page(page);
>>>> @@ -854,6 +859,22 @@ void migrate_device_finalize(unsigned long *src_pfns,
>>>>   		}
>>>>   	}
>>>>   }
>>>> +
>>>> +/*
>>>> + * migrate_device_finalize() - complete page migration
>>>> + * @src_pfns: src_pfns returned from migrate_device_range()
>>>> + * @dst_pfns: array of pfns allocated by the driver to migrate memory to
>>>> + * @npages: number of pages in the range
>>>> + *
>>>> + * Completes migration of the page by removing special migration entries.
>>>> + * Drivers must ensure copying of page data is complete and visible to the CPU
>>>> + * before calling this.
>>>> + */
>>>> +void migrate_device_finalize(unsigned long *src_pfns,
>>>> +			unsigned long *dst_pfns, unsigned long npages)
>>>> +{
>>>> +	return __migrate_device_finalize(src_pfns, dst_pfns, npages, NULL);
>>>> +}
>>>>   EXPORT_SYMBOL(migrate_device_finalize);
>>>>   
>>>>   /**
>>>> @@ -869,7 +890,8 @@ EXPORT_SYMBOL(migrate_device_finalize);
>>>>    */
>>>>   void migrate_vma_finalize(struct migrate_vma *migrate)
>>>>   {
>>>> -	migrate_device_finalize(migrate->src, migrate->dst, migrate->npages);
>>>> +	__migrate_device_finalize(migrate->src, migrate->dst, migrate->npages,
>>>> +				  migrate->fault_page);
>>>>   }
>>>>   EXPORT_SYMBOL(migrate_vma_finalize);

Matthew Brost Sept. 20, 2024, 9:23 p.m. UTC | #5

On Fri, Sep 20, 2024 at 04:26:50PM -0400, Felix Kuehling wrote:
> On 2024-09-18 11:10, Alistair Popple wrote:
> > Matthew Brost <matthew.brost@intel.com> writes:
> > 
> > > On Wed, Sep 11, 2024 at 02:53:31PM +1000, Alistair Popple wrote:
> > > > Matthew Brost <matthew.brost@intel.com> writes:
> > > > 
> > > > I haven't seen the same in the NVIDIA UVM driver (out-of-tree, I know)
> > > Still a driver.
> > Indeed, and I'm happy to answer any questions about our implementation.
> > 
> > > > but theoretically it seems like it should be possible. However we
> > > > serialize migrations of the same virtual address range to avoid these
> > > > kind of issues as they can happen the other way too (ie. multiple
> > > > threads trying to migrate to GPU).
> > > > 
> > > > So I suspect what happens in UVM is that one thread wins and installs
> > > > the migration entry while the others fail to get the driver migration
> > > > lock and bail out sufficiently early in the fault path to avoid the
> > > > live-lock.
> > > > 
> > > I had to try hard to show this, doubt an actual user could trigger this.
> > > 
> > > I wrote a test which kicked 8 threads, each thread did a pthread join,
> > > and then tried to read the same page. This repeats in loop for like 512
> > > pages or something. I needed an exclusive lock in migrate_to_ram vfunc
> > > for it to livelock. Without an exclusive lock I think on average I saw
> > > about 32k retries (i.e. migrate_to_ram calls on the same page) before a
> > > thread won this race.
> > > 
> > >  From reading UVM, pretty sure if you tried hard enough you could trigger
> > > a livelock given it appears you take excluvise locks in migrate_to_ram.
> > Yes, I suspect you're correct here and that we just haven't tried hard
> > enough to trigger it.
> > 
> > > > > Cc: Philip Yang <Philip.Yang@amd.com>
> > > > > Cc: Felix Kuehling <felix.kuehling@amd.com>
> > > > > Cc: Christian König <christian.koenig@amd.com>
> > > > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > > > Suggessted-by: Simona Vetter <simona.vetter@ffwll.ch>
> > > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > > ---
> > > > >   mm/memory.c         | 13 +++++++---
> > > > >   mm/migrate_device.c | 60 +++++++++++++++++++++++++++++++--------------
> > > > >   2 files changed, 50 insertions(+), 23 deletions(-)
> > > > > 
> > > > > diff --git a/mm/memory.c b/mm/memory.c
> > > > > index 3c01d68065be..bbd97d16a96a 100644
> > > > > --- a/mm/memory.c
> > > > > +++ b/mm/memory.c
> > > > > @@ -4046,10 +4046,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > > > >   			 * Get a page reference while we know the page can't be
> > > > >   			 * freed.
> > > > >   			 */
> > > > > -			get_page(vmf->page);
> > > > > -			pte_unmap_unlock(vmf->pte, vmf->ptl);
> > > > > -			ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
> > > > > -			put_page(vmf->page);
> > > > > +			if (trylock_page(vmf->page)) {
> > > > > +				get_page(vmf->page);
> > > > > +				pte_unmap_unlock(vmf->pte, vmf->ptl);
> > > > This is all beginning to look a lot like migrate_vma_collect_pmd(). So
> > > > rather than do this and then have to pass all this context
> > > > (ie. fault_page) down to the migrate_vma_* functions could we instead
> > > > just do what migrate_vma_collect_pmd() does here? Ie. we already have
> > > > the PTL and the page lock so there's no reason we couldn't just setup
> > > > the migration entry prior to calling migrate_to_ram().
> > > > 
> > > > Obviously calling migrate_vma_setup() would show the page as not
> > > > migrating, but drivers could easily just fill in the src_pfn info after
> > > > calling migrate_vma_setup().
> > > > 
> > > > This would eliminate the whole fault_page ugliness.
> > > > 
> > > This seems like it would work and agree it likely be cleaner. Let me
> > > play around with this and see what I come up with. Multi-tasking a bit
> > > so expect a bit of delay here.
> > > 
> > > Thanks for the input,
> > > Matt
> 
> Thanks! Sorry, I'm late catching up after a vacation. Please keep Philip,
> Christian and myself in the loop with future patches in this area.
> 

Will do. Already have another local patch set which helps drivers dma
map 2M pages for migrations if SRAM is physically contiguous. Seems
helpful for performance on Intel hardware. Probably post that soon for
early feedack.

Longer term I thinking 2M migration entries, 2M device pages, and being
able to install 2M THP on VRAM -> SRAM could be really helpful. I'm
finding migrate_vma_* functions take up like 80-90% of the time in the
CPU / GPU fault handlers on a fault (or prefetch) which doesn't seem
ideal. Seems like 2M entries for everything would really help here. No
idea how feasible this is as the core MM stuff gets confusing fast. Any
input on this idea?

Matt

> Regards,
>   Felix
> 
> 
> > > 
> > > > > +				ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
> > > > > +				put_page(vmf->page);
> > > > > +				unlock_page(vmf->page);
> > > > > +			} else {
> > > > > +				pte_unmap_unlock(vmf->pte, vmf->ptl);
> > > > > +			}
> > > > >   		} else if (is_hwpoison_entry(entry)) {
> > > > >   			ret = VM_FAULT_HWPOISON;
> > > > >   		} else if (is_pte_marker_entry(entry)) {
> > > > > diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> > > > > index 6d66dc1c6ffa..049893a5a179 100644
> > > > > --- a/mm/migrate_device.c
> > > > > +++ b/mm/migrate_device.c
> > > > > @@ -60,6 +60,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> > > > >   				   struct mm_walk *walk)
> > > > >   {
> > > > >   	struct migrate_vma *migrate = walk->private;
> > > > > +	struct folio *fault_folio = migrate->fault_page ?
> > > > > +		page_folio(migrate->fault_page) : NULL;
> > > > >   	struct vm_area_struct *vma = walk->vma;
> > > > >   	struct mm_struct *mm = vma->vm_mm;
> > > > >   	unsigned long addr = start, unmapped = 0;
> > > > > @@ -88,11 +90,13 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> > > > >   			folio_get(folio);
> > > > >   			spin_unlock(ptl);
> > > > > -			if (unlikely(!folio_trylock(folio)))
> > > > > +			if (unlikely(fault_folio != folio &&
> > > > > +				     !folio_trylock(folio)))
> > > > >   				return migrate_vma_collect_skip(start, end,
> > > > >   								walk);
> > > > >   			ret = split_folio(folio);
> > > > > -			folio_unlock(folio);
> > > > > +			if (fault_folio != folio)
> > > > > +				folio_unlock(folio);
> > > > >   			folio_put(folio);
> > > > >   			if (ret)
> > > > >   				return migrate_vma_collect_skip(start, end,
> > > > > @@ -192,7 +196,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> > > > >   		 * optimisation to avoid walking the rmap later with
> > > > >   		 * try_to_migrate().
> > > > >   		 */
> > > > > -		if (folio_trylock(folio)) {
> > > > > +		if (fault_folio == folio || folio_trylock(folio)) {
> > > > >   			bool anon_exclusive;
> > > > >   			pte_t swp_pte;
> > > > > @@ -204,7 +208,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> > > > >   				if (folio_try_share_anon_rmap_pte(folio, page)) {
> > > > >   					set_pte_at(mm, addr, ptep, pte);
> > > > > -					folio_unlock(folio);
> > > > > +					if (fault_folio != folio)
> > > > > +						folio_unlock(folio);
> > > > >   					folio_put(folio);
> > > > >   					mpfn = 0;
> > > > >   					goto next;
> > > > > @@ -363,6 +368,8 @@ static unsigned long migrate_device_unmap(unsigned long *src_pfns,
> > > > >   					  unsigned long npages,
> > > > >   					  struct page *fault_page)
> > > > >   {
> > > > > +	struct folio *fault_folio = fault_page ?
> > > > > +		page_folio(fault_page) : NULL;
> > > > >   	unsigned long i, restore = 0;
> > > > >   	bool allow_drain = true;
> > > > >   	unsigned long unmapped = 0;
> > > > > @@ -427,7 +434,8 @@ static unsigned long migrate_device_unmap(unsigned long *src_pfns,
> > > > >   		remove_migration_ptes(folio, folio, false);
> > > > >   		src_pfns[i] = 0;
> > > > > -		folio_unlock(folio);
> > > > > +		if (fault_folio != folio)
> > > > > +			folio_unlock(folio);
> > > > >   		folio_put(folio);
> > > > >   		restore--;
> > > > >   	}
> > > > > @@ -536,6 +544,8 @@ int migrate_vma_setup(struct migrate_vma *args)
> > > > >   		return -EINVAL;
> > > > >   	if (args->fault_page && !is_device_private_page(args->fault_page))
> > > > >   		return -EINVAL;
> > > > > +	if (args->fault_page && !PageLocked(args->fault_page))
> > > > > +		return -EINVAL;
> > > > >   	memset(args->src, 0, sizeof(*args->src) * nr_pages);
> > > > >   	args->cpages = 0;
> > > > > @@ -799,19 +809,13 @@ void migrate_vma_pages(struct migrate_vma *migrate)
> > > > >   }
> > > > >   EXPORT_SYMBOL(migrate_vma_pages);
> > > > > -/*
> > > > > - * migrate_device_finalize() - complete page migration
> > > > > - * @src_pfns: src_pfns returned from migrate_device_range()
> > > > > - * @dst_pfns: array of pfns allocated by the driver to migrate memory to
> > > > > - * @npages: number of pages in the range
> > > > > - *
> > > > > - * Completes migration of the page by removing special migration entries.
> > > > > - * Drivers must ensure copying of page data is complete and visible to the CPU
> > > > > - * before calling this.
> > > > > - */
> > > > > -void migrate_device_finalize(unsigned long *src_pfns,
> > > > > -			unsigned long *dst_pfns, unsigned long npages)
> > > > > +static void __migrate_device_finalize(unsigned long *src_pfns,
> > > > > +				      unsigned long *dst_pfns,
> > > > > +				      unsigned long npages,
> > > > > +				      struct page *fault_page)
> > > > >   {
> > > > > +	struct folio *fault_folio = fault_page ?
> > > > > +		page_folio(fault_page) : NULL;
> > > > >   	unsigned long i;
> > > > >   	for (i = 0; i < npages; i++) {
> > > > > @@ -838,7 +842,8 @@ void migrate_device_finalize(unsigned long *src_pfns,
> > > > >   		src = page_folio(page);
> > > > >   		dst = page_folio(newpage);
> > > > >   		remove_migration_ptes(src, dst, false);
> > > > > -		folio_unlock(src);
> > > > > +		if (fault_folio != src)
> > > > > +			folio_unlock(src);
> > > > >   		if (is_zone_device_page(page))
> > > > >   			put_page(page);
> > > > > @@ -854,6 +859,22 @@ void migrate_device_finalize(unsigned long *src_pfns,
> > > > >   		}
> > > > >   	}
> > > > >   }
> > > > > +
> > > > > +/*
> > > > > + * migrate_device_finalize() - complete page migration
> > > > > + * @src_pfns: src_pfns returned from migrate_device_range()
> > > > > + * @dst_pfns: array of pfns allocated by the driver to migrate memory to
> > > > > + * @npages: number of pages in the range
> > > > > + *
> > > > > + * Completes migration of the page by removing special migration entries.
> > > > > + * Drivers must ensure copying of page data is complete and visible to the CPU
> > > > > + * before calling this.
> > > > > + */
> > > > > +void migrate_device_finalize(unsigned long *src_pfns,
> > > > > +			unsigned long *dst_pfns, unsigned long npages)
> > > > > +{
> > > > > +	return __migrate_device_finalize(src_pfns, dst_pfns, npages, NULL);
> > > > > +}
> > > > >   EXPORT_SYMBOL(migrate_device_finalize);
> > > > >   /**
> > > > > @@ -869,7 +890,8 @@ EXPORT_SYMBOL(migrate_device_finalize);
> > > > >    */
> > > > >   void migrate_vma_finalize(struct migrate_vma *migrate)
> > > > >   {
> > > > > -	migrate_device_finalize(migrate->src, migrate->dst, migrate->npages);
> > > > > +	__migrate_device_finalize(migrate->src, migrate->dst, migrate->npages,
> > > > > +				  migrate->fault_page);
> > > > >   }
> > > > >   EXPORT_SYMBOL(migrate_vma_finalize);

Felix Kuehling Sept. 20, 2024, 9:50 p.m. UTC | #6

On 2024-09-20 17:23, Matthew Brost wrote:
> On Fri, Sep 20, 2024 at 04:26:50PM -0400, Felix Kuehling wrote:
>> On 2024-09-18 11:10, Alistair Popple wrote:
>>> Matthew Brost <matthew.brost@intel.com> writes:
>>>
>>>> On Wed, Sep 11, 2024 at 02:53:31PM +1000, Alistair Popple wrote:
>>>>> Matthew Brost <matthew.brost@intel.com> writes:
>>>>>
>>>>> I haven't seen the same in the NVIDIA UVM driver (out-of-tree, I know)
>>>> Still a driver.
>>> Indeed, and I'm happy to answer any questions about our implementation.
>>>
>>>>> but theoretically it seems like it should be possible. However we
>>>>> serialize migrations of the same virtual address range to avoid these
>>>>> kind of issues as they can happen the other way too (ie. multiple
>>>>> threads trying to migrate to GPU).
>>>>>
>>>>> So I suspect what happens in UVM is that one thread wins and installs
>>>>> the migration entry while the others fail to get the driver migration
>>>>> lock and bail out sufficiently early in the fault path to avoid the
>>>>> live-lock.
>>>>>
>>>> I had to try hard to show this, doubt an actual user could trigger this.
>>>>
>>>> I wrote a test which kicked 8 threads, each thread did a pthread join,
>>>> and then tried to read the same page. This repeats in loop for like 512
>>>> pages or something. I needed an exclusive lock in migrate_to_ram vfunc
>>>> for it to livelock. Without an exclusive lock I think on average I saw
>>>> about 32k retries (i.e. migrate_to_ram calls on the same page) before a
>>>> thread won this race.
>>>>
>>>>   From reading UVM, pretty sure if you tried hard enough you could trigger
>>>> a livelock given it appears you take excluvise locks in migrate_to_ram.
>>> Yes, I suspect you're correct here and that we just haven't tried hard
>>> enough to trigger it.
>>>
>>>>>> Cc: Philip Yang <Philip.Yang@amd.com>
>>>>>> Cc: Felix Kuehling <felix.kuehling@amd.com>
>>>>>> Cc: Christian König <christian.koenig@amd.com>
>>>>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>>>>> Suggessted-by: Simona Vetter <simona.vetter@ffwll.ch>
>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>> ---
>>>>>>    mm/memory.c         | 13 +++++++---
>>>>>>    mm/migrate_device.c | 60 +++++++++++++++++++++++++++++++--------------
>>>>>>    2 files changed, 50 insertions(+), 23 deletions(-)
>>>>>>
>>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>>> index 3c01d68065be..bbd97d16a96a 100644
>>>>>> --- a/mm/memory.c
>>>>>> +++ b/mm/memory.c
>>>>>> @@ -4046,10 +4046,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>>>>    			 * Get a page reference while we know the page can't be
>>>>>>    			 * freed.
>>>>>>    			 */
>>>>>> -			get_page(vmf->page);
>>>>>> -			pte_unmap_unlock(vmf->pte, vmf->ptl);
>>>>>> -			ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
>>>>>> -			put_page(vmf->page);
>>>>>> +			if (trylock_page(vmf->page)) {
>>>>>> +				get_page(vmf->page);
>>>>>> +				pte_unmap_unlock(vmf->pte, vmf->ptl);
>>>>> This is all beginning to look a lot like migrate_vma_collect_pmd(). So
>>>>> rather than do this and then have to pass all this context
>>>>> (ie. fault_page) down to the migrate_vma_* functions could we instead
>>>>> just do what migrate_vma_collect_pmd() does here? Ie. we already have
>>>>> the PTL and the page lock so there's no reason we couldn't just setup
>>>>> the migration entry prior to calling migrate_to_ram().
>>>>>
>>>>> Obviously calling migrate_vma_setup() would show the page as not
>>>>> migrating, but drivers could easily just fill in the src_pfn info after
>>>>> calling migrate_vma_setup().
>>>>>
>>>>> This would eliminate the whole fault_page ugliness.
>>>>>
>>>> This seems like it would work and agree it likely be cleaner. Let me
>>>> play around with this and see what I come up with. Multi-tasking a bit
>>>> so expect a bit of delay here.
>>>>
>>>> Thanks for the input,
>>>> Matt
>> Thanks! Sorry, I'm late catching up after a vacation. Please keep Philip,
>> Christian and myself in the loop with future patches in this area.
>>
> Will do. Already have another local patch set which helps drivers dma
> map 2M pages for migrations if SRAM is physically contiguous. Seems
> helpful for performance on Intel hardware. Probably post that soon for
> early feedack.

OK.


>
> Longer term I thinking 2M migration entries, 2M device pages, and being
> able to install 2M THP on VRAM -> SRAM could be really helpful. I'm
> finding migrate_vma_* functions take up like 80-90% of the time in the
> CPU / GPU fault handlers on a fault (or prefetch) which doesn't seem
> ideal. Seems like 2M entries for everything would really help here. No
> idea how feasible this is as the core MM stuff gets confusing fast. Any
> input on this idea?

I agree with your observations. We found that the migrate_vma_* code was 
the bottle neck for migration performance as well, and not breaking 2M 
pages could reduce that overhead a lot. I don't have any specific ideas. 
I'm not familiar with the details of that code myself. Philip has looked 
at this (and some old NVidia patches from a few years ago) in the past 
but never had enough uninterrupted time to make it past prototyping.

Regards,
   Felix


>
> Matt
>
>> Regards,
>>    Felix
>>
>>
>>>>>> +				ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
>>>>>> +				put_page(vmf->page);
>>>>>> +				unlock_page(vmf->page);
>>>>>> +			} else {
>>>>>> +				pte_unmap_unlock(vmf->pte, vmf->ptl);
>>>>>> +			}
>>>>>>    		} else if (is_hwpoison_entry(entry)) {
>>>>>>    			ret = VM_FAULT_HWPOISON;
>>>>>>    		} else if (is_pte_marker_entry(entry)) {
>>>>>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>>>>>> index 6d66dc1c6ffa..049893a5a179 100644
>>>>>> --- a/mm/migrate_device.c
>>>>>> +++ b/mm/migrate_device.c
>>>>>> @@ -60,6 +60,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>>>>    				   struct mm_walk *walk)
>>>>>>    {
>>>>>>    	struct migrate_vma *migrate = walk->private;
>>>>>> +	struct folio *fault_folio = migrate->fault_page ?
>>>>>> +		page_folio(migrate->fault_page) : NULL;
>>>>>>    	struct vm_area_struct *vma = walk->vma;
>>>>>>    	struct mm_struct *mm = vma->vm_mm;
>>>>>>    	unsigned long addr = start, unmapped = 0;
>>>>>> @@ -88,11 +90,13 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>>>>    			folio_get(folio);
>>>>>>    			spin_unlock(ptl);
>>>>>> -			if (unlikely(!folio_trylock(folio)))
>>>>>> +			if (unlikely(fault_folio != folio &&
>>>>>> +				     !folio_trylock(folio)))
>>>>>>    				return migrate_vma_collect_skip(start, end,
>>>>>>    								walk);
>>>>>>    			ret = split_folio(folio);
>>>>>> -			folio_unlock(folio);
>>>>>> +			if (fault_folio != folio)
>>>>>> +				folio_unlock(folio);
>>>>>>    			folio_put(folio);
>>>>>>    			if (ret)
>>>>>>    				return migrate_vma_collect_skip(start, end,
>>>>>> @@ -192,7 +196,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>>>>    		 * optimisation to avoid walking the rmap later with
>>>>>>    		 * try_to_migrate().
>>>>>>    		 */
>>>>>> -		if (folio_trylock(folio)) {
>>>>>> +		if (fault_folio == folio || folio_trylock(folio)) {
>>>>>>    			bool anon_exclusive;
>>>>>>    			pte_t swp_pte;
>>>>>> @@ -204,7 +208,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>>>>    				if (folio_try_share_anon_rmap_pte(folio, page)) {
>>>>>>    					set_pte_at(mm, addr, ptep, pte);
>>>>>> -					folio_unlock(folio);
>>>>>> +					if (fault_folio != folio)
>>>>>> +						folio_unlock(folio);
>>>>>>    					folio_put(folio);
>>>>>>    					mpfn = 0;
>>>>>>    					goto next;
>>>>>> @@ -363,6 +368,8 @@ static unsigned long migrate_device_unmap(unsigned long *src_pfns,
>>>>>>    					  unsigned long npages,
>>>>>>    					  struct page *fault_page)
>>>>>>    {
>>>>>> +	struct folio *fault_folio = fault_page ?
>>>>>> +		page_folio(fault_page) : NULL;
>>>>>>    	unsigned long i, restore = 0;
>>>>>>    	bool allow_drain = true;
>>>>>>    	unsigned long unmapped = 0;
>>>>>> @@ -427,7 +434,8 @@ static unsigned long migrate_device_unmap(unsigned long *src_pfns,
>>>>>>    		remove_migration_ptes(folio, folio, false);
>>>>>>    		src_pfns[i] = 0;
>>>>>> -		folio_unlock(folio);
>>>>>> +		if (fault_folio != folio)
>>>>>> +			folio_unlock(folio);
>>>>>>    		folio_put(folio);
>>>>>>    		restore--;
>>>>>>    	}
>>>>>> @@ -536,6 +544,8 @@ int migrate_vma_setup(struct migrate_vma *args)
>>>>>>    		return -EINVAL;
>>>>>>    	if (args->fault_page && !is_device_private_page(args->fault_page))
>>>>>>    		return -EINVAL;
>>>>>> +	if (args->fault_page && !PageLocked(args->fault_page))
>>>>>> +		return -EINVAL;
>>>>>>    	memset(args->src, 0, sizeof(*args->src) * nr_pages);
>>>>>>    	args->cpages = 0;
>>>>>> @@ -799,19 +809,13 @@ void migrate_vma_pages(struct migrate_vma *migrate)
>>>>>>    }
>>>>>>    EXPORT_SYMBOL(migrate_vma_pages);
>>>>>> -/*
>>>>>> - * migrate_device_finalize() - complete page migration
>>>>>> - * @src_pfns: src_pfns returned from migrate_device_range()
>>>>>> - * @dst_pfns: array of pfns allocated by the driver to migrate memory to
>>>>>> - * @npages: number of pages in the range
>>>>>> - *
>>>>>> - * Completes migration of the page by removing special migration entries.
>>>>>> - * Drivers must ensure copying of page data is complete and visible to the CPU
>>>>>> - * before calling this.
>>>>>> - */
>>>>>> -void migrate_device_finalize(unsigned long *src_pfns,
>>>>>> -			unsigned long *dst_pfns, unsigned long npages)
>>>>>> +static void __migrate_device_finalize(unsigned long *src_pfns,
>>>>>> +				      unsigned long *dst_pfns,
>>>>>> +				      unsigned long npages,
>>>>>> +				      struct page *fault_page)
>>>>>>    {
>>>>>> +	struct folio *fault_folio = fault_page ?
>>>>>> +		page_folio(fault_page) : NULL;
>>>>>>    	unsigned long i;
>>>>>>    	for (i = 0; i < npages; i++) {
>>>>>> @@ -838,7 +842,8 @@ void migrate_device_finalize(unsigned long *src_pfns,
>>>>>>    		src = page_folio(page);
>>>>>>    		dst = page_folio(newpage);
>>>>>>    		remove_migration_ptes(src, dst, false);
>>>>>> -		folio_unlock(src);
>>>>>> +		if (fault_folio != src)
>>>>>> +			folio_unlock(src);
>>>>>>    		if (is_zone_device_page(page))
>>>>>>    			put_page(page);
>>>>>> @@ -854,6 +859,22 @@ void migrate_device_finalize(unsigned long *src_pfns,
>>>>>>    		}
>>>>>>    	}
>>>>>>    }
>>>>>> +
>>>>>> +/*
>>>>>> + * migrate_device_finalize() - complete page migration
>>>>>> + * @src_pfns: src_pfns returned from migrate_device_range()
>>>>>> + * @dst_pfns: array of pfns allocated by the driver to migrate memory to
>>>>>> + * @npages: number of pages in the range
>>>>>> + *
>>>>>> + * Completes migration of the page by removing special migration entries.
>>>>>> + * Drivers must ensure copying of page data is complete and visible to the CPU
>>>>>> + * before calling this.
>>>>>> + */
>>>>>> +void migrate_device_finalize(unsigned long *src_pfns,
>>>>>> +			unsigned long *dst_pfns, unsigned long npages)
>>>>>> +{
>>>>>> +	return __migrate_device_finalize(src_pfns, dst_pfns, npages, NULL);
>>>>>> +}
>>>>>>    EXPORT_SYMBOL(migrate_device_finalize);
>>>>>>    /**
>>>>>> @@ -869,7 +890,8 @@ EXPORT_SYMBOL(migrate_device_finalize);
>>>>>>     */
>>>>>>    void migrate_vma_finalize(struct migrate_vma *migrate)
>>>>>>    {
>>>>>> -	migrate_device_finalize(migrate->src, migrate->dst, migrate->npages);
>>>>>> +	__migrate_device_finalize(migrate->src, migrate->dst, migrate->npages,
>>>>>> +				  migrate->fault_page);
>>>>>>    }
>>>>>>    EXPORT_SYMBOL(migrate_vma_finalize);

Matthew Brost Sept. 20, 2024, 9:59 p.m. UTC | #7

On Fri, Sep 20, 2024 at 05:50:10PM -0400, Felix Kuehling wrote:
> 
> On 2024-09-20 17:23, Matthew Brost wrote:
> > On Fri, Sep 20, 2024 at 04:26:50PM -0400, Felix Kuehling wrote:
> > > On 2024-09-18 11:10, Alistair Popple wrote:
> > > > Matthew Brost <matthew.brost@intel.com> writes:
> > > > 
> > > > > On Wed, Sep 11, 2024 at 02:53:31PM +1000, Alistair Popple wrote:
> > > > > > Matthew Brost <matthew.brost@intel.com> writes:
> > > > > > 
> > > > > > I haven't seen the same in the NVIDIA UVM driver (out-of-tree, I know)
> > > > > Still a driver.
> > > > Indeed, and I'm happy to answer any questions about our implementation.
> > > > 
> > > > > > but theoretically it seems like it should be possible. However we
> > > > > > serialize migrations of the same virtual address range to avoid these
> > > > > > kind of issues as they can happen the other way too (ie. multiple
> > > > > > threads trying to migrate to GPU).
> > > > > > 
> > > > > > So I suspect what happens in UVM is that one thread wins and installs
> > > > > > the migration entry while the others fail to get the driver migration
> > > > > > lock and bail out sufficiently early in the fault path to avoid the
> > > > > > live-lock.
> > > > > > 
> > > > > I had to try hard to show this, doubt an actual user could trigger this.
> > > > > 
> > > > > I wrote a test which kicked 8 threads, each thread did a pthread join,
> > > > > and then tried to read the same page. This repeats in loop for like 512
> > > > > pages or something. I needed an exclusive lock in migrate_to_ram vfunc
> > > > > for it to livelock. Without an exclusive lock I think on average I saw
> > > > > about 32k retries (i.e. migrate_to_ram calls on the same page) before a
> > > > > thread won this race.
> > > > > 
> > > > >   From reading UVM, pretty sure if you tried hard enough you could trigger
> > > > > a livelock given it appears you take excluvise locks in migrate_to_ram.
> > > > Yes, I suspect you're correct here and that we just haven't tried hard
> > > > enough to trigger it.
> > > > 
> > > > > > > Cc: Philip Yang <Philip.Yang@amd.com>
> > > > > > > Cc: Felix Kuehling <felix.kuehling@amd.com>
> > > > > > > Cc: Christian König <christian.koenig@amd.com>
> > > > > > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > > > > > Suggessted-by: Simona Vetter <simona.vetter@ffwll.ch>
> > > > > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > > > > ---
> > > > > > >    mm/memory.c         | 13 +++++++---
> > > > > > >    mm/migrate_device.c | 60 +++++++++++++++++++++++++++++++--------------
> > > > > > >    2 files changed, 50 insertions(+), 23 deletions(-)
> > > > > > > 
> > > > > > > diff --git a/mm/memory.c b/mm/memory.c
> > > > > > > index 3c01d68065be..bbd97d16a96a 100644
> > > > > > > --- a/mm/memory.c
> > > > > > > +++ b/mm/memory.c
> > > > > > > @@ -4046,10 +4046,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > > > > > >    			 * Get a page reference while we know the page can't be
> > > > > > >    			 * freed.
> > > > > > >    			 */
> > > > > > > -			get_page(vmf->page);
> > > > > > > -			pte_unmap_unlock(vmf->pte, vmf->ptl);
> > > > > > > -			ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
> > > > > > > -			put_page(vmf->page);
> > > > > > > +			if (trylock_page(vmf->page)) {
> > > > > > > +				get_page(vmf->page);
> > > > > > > +				pte_unmap_unlock(vmf->pte, vmf->ptl);
> > > > > > This is all beginning to look a lot like migrate_vma_collect_pmd(). So
> > > > > > rather than do this and then have to pass all this context
> > > > > > (ie. fault_page) down to the migrate_vma_* functions could we instead
> > > > > > just do what migrate_vma_collect_pmd() does here? Ie. we already have
> > > > > > the PTL and the page lock so there's no reason we couldn't just setup
> > > > > > the migration entry prior to calling migrate_to_ram().
> > > > > > 
> > > > > > Obviously calling migrate_vma_setup() would show the page as not
> > > > > > migrating, but drivers could easily just fill in the src_pfn info after
> > > > > > calling migrate_vma_setup().
> > > > > > 
> > > > > > This would eliminate the whole fault_page ugliness.
> > > > > > 
> > > > > This seems like it would work and agree it likely be cleaner. Let me
> > > > > play around with this and see what I come up with. Multi-tasking a bit
> > > > > so expect a bit of delay here.
> > > > > 
> > > > > Thanks for the input,
> > > > > Matt
> > > Thanks! Sorry, I'm late catching up after a vacation. Please keep Philip,
> > > Christian and myself in the loop with future patches in this area.
> > > 
> > Will do. Already have another local patch set which helps drivers dma
> > map 2M pages for migrations if SRAM is physically contiguous. Seems
> > helpful for performance on Intel hardware. Probably post that soon for
> > early feedack.
> 
> OK.
> 
> 
> > 
> > Longer term I thinking 2M migration entries, 2M device pages, and being
> > able to install 2M THP on VRAM -> SRAM could be really helpful. I'm
> > finding migrate_vma_* functions take up like 80-90% of the time in the
> > CPU / GPU fault handlers on a fault (or prefetch) which doesn't seem
> > ideal. Seems like 2M entries for everything would really help here. No
> > idea how feasible this is as the core MM stuff gets confusing fast. Any
> > input on this idea?
> 
> I agree with your observations. We found that the migrate_vma_* code was the
> bottle neck for migration performance as well, and not breaking 2M pages
> could reduce that overhead a lot. I don't have any specific ideas. I'm not
> familiar with the details of that code myself. Philip has looked at this
> (and some old NVidia patches from a few years ago) in the past but never had
> enough uninterrupted time to make it past prototyping.
> 

Cool good to know this isn't some crazy idea. Doubt it happen anytime
soon as I need to get a working baseline in before anything then start
applying optimizations and help in get other features to get the design
complete. But eventually will probably try to look at this. May ping
Philip and Nvidia when I have time to dig in here.

Matt

> Regards,
>   Felix
> 
> 
> > 
> > Matt
> > 
> > > Regards,
> > >    Felix
> > > 
> > > 
> > > > > > > +				ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
> > > > > > > +				put_page(vmf->page);
> > > > > > > +				unlock_page(vmf->page);
> > > > > > > +			} else {
> > > > > > > +				pte_unmap_unlock(vmf->pte, vmf->ptl);
> > > > > > > +			}
> > > > > > >    		} else if (is_hwpoison_entry(entry)) {
> > > > > > >    			ret = VM_FAULT_HWPOISON;
> > > > > > >    		} else if (is_pte_marker_entry(entry)) {
> > > > > > > diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> > > > > > > index 6d66dc1c6ffa..049893a5a179 100644
> > > > > > > --- a/mm/migrate_device.c
> > > > > > > +++ b/mm/migrate_device.c
> > > > > > > @@ -60,6 +60,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> > > > > > >    				   struct mm_walk *walk)
> > > > > > >    {
> > > > > > >    	struct migrate_vma *migrate = walk->private;
> > > > > > > +	struct folio *fault_folio = migrate->fault_page ?
> > > > > > > +		page_folio(migrate->fault_page) : NULL;
> > > > > > >    	struct vm_area_struct *vma = walk->vma;
> > > > > > >    	struct mm_struct *mm = vma->vm_mm;
> > > > > > >    	unsigned long addr = start, unmapped = 0;
> > > > > > > @@ -88,11 +90,13 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> > > > > > >    			folio_get(folio);
> > > > > > >    			spin_unlock(ptl);
> > > > > > > -			if (unlikely(!folio_trylock(folio)))
> > > > > > > +			if (unlikely(fault_folio != folio &&
> > > > > > > +				     !folio_trylock(folio)))
> > > > > > >    				return migrate_vma_collect_skip(start, end,
> > > > > > >    								walk);
> > > > > > >    			ret = split_folio(folio);
> > > > > > > -			folio_unlock(folio);
> > > > > > > +			if (fault_folio != folio)
> > > > > > > +				folio_unlock(folio);
> > > > > > >    			folio_put(folio);
> > > > > > >    			if (ret)
> > > > > > >    				return migrate_vma_collect_skip(start, end,
> > > > > > > @@ -192,7 +196,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> > > > > > >    		 * optimisation to avoid walking the rmap later with
> > > > > > >    		 * try_to_migrate().
> > > > > > >    		 */
> > > > > > > -		if (folio_trylock(folio)) {
> > > > > > > +		if (fault_folio == folio || folio_trylock(folio)) {
> > > > > > >    			bool anon_exclusive;
> > > > > > >    			pte_t swp_pte;
> > > > > > > @@ -204,7 +208,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> > > > > > >    				if (folio_try_share_anon_rmap_pte(folio, page)) {
> > > > > > >    					set_pte_at(mm, addr, ptep, pte);
> > > > > > > -					folio_unlock(folio);
> > > > > > > +					if (fault_folio != folio)
> > > > > > > +						folio_unlock(folio);
> > > > > > >    					folio_put(folio);
> > > > > > >    					mpfn = 0;
> > > > > > >    					goto next;
> > > > > > > @@ -363,6 +368,8 @@ static unsigned long migrate_device_unmap(unsigned long *src_pfns,
> > > > > > >    					  unsigned long npages,
> > > > > > >    					  struct page *fault_page)
> > > > > > >    {
> > > > > > > +	struct folio *fault_folio = fault_page ?
> > > > > > > +		page_folio(fault_page) : NULL;
> > > > > > >    	unsigned long i, restore = 0;
> > > > > > >    	bool allow_drain = true;
> > > > > > >    	unsigned long unmapped = 0;
> > > > > > > @@ -427,7 +434,8 @@ static unsigned long migrate_device_unmap(unsigned long *src_pfns,
> > > > > > >    		remove_migration_ptes(folio, folio, false);
> > > > > > >    		src_pfns[i] = 0;
> > > > > > > -		folio_unlock(folio);
> > > > > > > +		if (fault_folio != folio)
> > > > > > > +			folio_unlock(folio);
> > > > > > >    		folio_put(folio);
> > > > > > >    		restore--;
> > > > > > >    	}
> > > > > > > @@ -536,6 +544,8 @@ int migrate_vma_setup(struct migrate_vma *args)
> > > > > > >    		return -EINVAL;
> > > > > > >    	if (args->fault_page && !is_device_private_page(args->fault_page))
> > > > > > >    		return -EINVAL;
> > > > > > > +	if (args->fault_page && !PageLocked(args->fault_page))
> > > > > > > +		return -EINVAL;
> > > > > > >    	memset(args->src, 0, sizeof(*args->src) * nr_pages);
> > > > > > >    	args->cpages = 0;
> > > > > > > @@ -799,19 +809,13 @@ void migrate_vma_pages(struct migrate_vma *migrate)
> > > > > > >    }
> > > > > > >    EXPORT_SYMBOL(migrate_vma_pages);
> > > > > > > -/*
> > > > > > > - * migrate_device_finalize() - complete page migration
> > > > > > > - * @src_pfns: src_pfns returned from migrate_device_range()
> > > > > > > - * @dst_pfns: array of pfns allocated by the driver to migrate memory to
> > > > > > > - * @npages: number of pages in the range
> > > > > > > - *
> > > > > > > - * Completes migration of the page by removing special migration entries.
> > > > > > > - * Drivers must ensure copying of page data is complete and visible to the CPU
> > > > > > > - * before calling this.
> > > > > > > - */
> > > > > > > -void migrate_device_finalize(unsigned long *src_pfns,
> > > > > > > -			unsigned long *dst_pfns, unsigned long npages)
> > > > > > > +static void __migrate_device_finalize(unsigned long *src_pfns,
> > > > > > > +				      unsigned long *dst_pfns,
> > > > > > > +				      unsigned long npages,
> > > > > > > +				      struct page *fault_page)
> > > > > > >    {
> > > > > > > +	struct folio *fault_folio = fault_page ?
> > > > > > > +		page_folio(fault_page) : NULL;
> > > > > > >    	unsigned long i;
> > > > > > >    	for (i = 0; i < npages; i++) {
> > > > > > > @@ -838,7 +842,8 @@ void migrate_device_finalize(unsigned long *src_pfns,
> > > > > > >    		src = page_folio(page);
> > > > > > >    		dst = page_folio(newpage);
> > > > > > >    		remove_migration_ptes(src, dst, false);
> > > > > > > -		folio_unlock(src);
> > > > > > > +		if (fault_folio != src)
> > > > > > > +			folio_unlock(src);
> > > > > > >    		if (is_zone_device_page(page))
> > > > > > >    			put_page(page);
> > > > > > > @@ -854,6 +859,22 @@ void migrate_device_finalize(unsigned long *src_pfns,
> > > > > > >    		}
> > > > > > >    	}
> > > > > > >    }
> > > > > > > +
> > > > > > > +/*
> > > > > > > + * migrate_device_finalize() - complete page migration
> > > > > > > + * @src_pfns: src_pfns returned from migrate_device_range()
> > > > > > > + * @dst_pfns: array of pfns allocated by the driver to migrate memory to
> > > > > > > + * @npages: number of pages in the range
> > > > > > > + *
> > > > > > > + * Completes migration of the page by removing special migration entries.
> > > > > > > + * Drivers must ensure copying of page data is complete and visible to the CPU
> > > > > > > + * before calling this.
> > > > > > > + */
> > > > > > > +void migrate_device_finalize(unsigned long *src_pfns,
> > > > > > > +			unsigned long *dst_pfns, unsigned long npages)
> > > > > > > +{
> > > > > > > +	return __migrate_device_finalize(src_pfns, dst_pfns, npages, NULL);
> > > > > > > +}
> > > > > > >    EXPORT_SYMBOL(migrate_device_finalize);
> > > > > > >    /**
> > > > > > > @@ -869,7 +890,8 @@ EXPORT_SYMBOL(migrate_device_finalize);
> > > > > > >     */
> > > > > > >    void migrate_vma_finalize(struct migrate_vma *migrate)
> > > > > > >    {
> > > > > > > -	migrate_device_finalize(migrate->src, migrate->dst, migrate->npages);
> > > > > > > +	__migrate_device_finalize(migrate->src, migrate->dst, migrate->npages,
> > > > > > > +				  migrate->fault_page);
> > > > > > >    }
> > > > > > >    EXPORT_SYMBOL(migrate_vma_finalize);

Simona Vetter Sept. 24, 2024, 11:48 a.m. UTC | #8

On Fri, Sep 20, 2024 at 09:59:51PM +0000, Matthew Brost wrote:
> On Fri, Sep 20, 2024 at 05:50:10PM -0400, Felix Kuehling wrote:
> > 
> > On 2024-09-20 17:23, Matthew Brost wrote:
> > > On Fri, Sep 20, 2024 at 04:26:50PM -0400, Felix Kuehling wrote:
> > > > On 2024-09-18 11:10, Alistair Popple wrote:
> > > > > Matthew Brost <matthew.brost@intel.com> writes:
> > > > > 
> > > > > > On Wed, Sep 11, 2024 at 02:53:31PM +1000, Alistair Popple wrote:
> > > > > > > Matthew Brost <matthew.brost@intel.com> writes:
> > > > > > > 
> > > > > > > I haven't seen the same in the NVIDIA UVM driver (out-of-tree, I know)
> > > > > > Still a driver.
> > > > > Indeed, and I'm happy to answer any questions about our implementation.
> > > > > 
> > > > > > > but theoretically it seems like it should be possible. However we
> > > > > > > serialize migrations of the same virtual address range to avoid these
> > > > > > > kind of issues as they can happen the other way too (ie. multiple
> > > > > > > threads trying to migrate to GPU).
> > > > > > > 
> > > > > > > So I suspect what happens in UVM is that one thread wins and installs
> > > > > > > the migration entry while the others fail to get the driver migration
> > > > > > > lock and bail out sufficiently early in the fault path to avoid the
> > > > > > > live-lock.
> > > > > > > 
> > > > > > I had to try hard to show this, doubt an actual user could trigger this.
> > > > > > 
> > > > > > I wrote a test which kicked 8 threads, each thread did a pthread join,
> > > > > > and then tried to read the same page. This repeats in loop for like 512
> > > > > > pages or something. I needed an exclusive lock in migrate_to_ram vfunc
> > > > > > for it to livelock. Without an exclusive lock I think on average I saw
> > > > > > about 32k retries (i.e. migrate_to_ram calls on the same page) before a
> > > > > > thread won this race.
> > > > > > 
> > > > > >   From reading UVM, pretty sure if you tried hard enough you could trigger
> > > > > > a livelock given it appears you take excluvise locks in migrate_to_ram.
> > > > > Yes, I suspect you're correct here and that we just haven't tried hard
> > > > > enough to trigger it.
> > > > > 
> > > > > > > > Cc: Philip Yang <Philip.Yang@amd.com>
> > > > > > > > Cc: Felix Kuehling <felix.kuehling@amd.com>
> > > > > > > > Cc: Christian König <christian.koenig@amd.com>
> > > > > > > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > > > > > > Suggessted-by: Simona Vetter <simona.vetter@ffwll.ch>
> > > > > > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > > > > > ---
> > > > > > > >    mm/memory.c         | 13 +++++++---
> > > > > > > >    mm/migrate_device.c | 60 +++++++++++++++++++++++++++++++--------------
> > > > > > > >    2 files changed, 50 insertions(+), 23 deletions(-)
> > > > > > > > 
> > > > > > > > diff --git a/mm/memory.c b/mm/memory.c
> > > > > > > > index 3c01d68065be..bbd97d16a96a 100644
> > > > > > > > --- a/mm/memory.c
> > > > > > > > +++ b/mm/memory.c
> > > > > > > > @@ -4046,10 +4046,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > > > > > > >    			 * Get a page reference while we know the page can't be
> > > > > > > >    			 * freed.
> > > > > > > >    			 */
> > > > > > > > -			get_page(vmf->page);
> > > > > > > > -			pte_unmap_unlock(vmf->pte, vmf->ptl);
> > > > > > > > -			ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
> > > > > > > > -			put_page(vmf->page);
> > > > > > > > +			if (trylock_page(vmf->page)) {
> > > > > > > > +				get_page(vmf->page);
> > > > > > > > +				pte_unmap_unlock(vmf->pte, vmf->ptl);
> > > > > > > This is all beginning to look a lot like migrate_vma_collect_pmd(). So
> > > > > > > rather than do this and then have to pass all this context
> > > > > > > (ie. fault_page) down to the migrate_vma_* functions could we instead
> > > > > > > just do what migrate_vma_collect_pmd() does here? Ie. we already have
> > > > > > > the PTL and the page lock so there's no reason we couldn't just setup
> > > > > > > the migration entry prior to calling migrate_to_ram().
> > > > > > > 
> > > > > > > Obviously calling migrate_vma_setup() would show the page as not
> > > > > > > migrating, but drivers could easily just fill in the src_pfn info after
> > > > > > > calling migrate_vma_setup().
> > > > > > > 
> > > > > > > This would eliminate the whole fault_page ugliness.
> > > > > > > 
> > > > > > This seems like it would work and agree it likely be cleaner. Let me
> > > > > > play around with this and see what I come up with. Multi-tasking a bit
> > > > > > so expect a bit of delay here.
> > > > > > 
> > > > > > Thanks for the input,
> > > > > > Matt
> > > > Thanks! Sorry, I'm late catching up after a vacation. Please keep Philip,
> > > > Christian and myself in the loop with future patches in this area.
> > > > 
> > > Will do. Already have another local patch set which helps drivers dma
> > > map 2M pages for migrations if SRAM is physically contiguous. Seems
> > > helpful for performance on Intel hardware. Probably post that soon for
> > > early feedack.
> > 
> > OK.
> > 
> > 
> > > 
> > > Longer term I thinking 2M migration entries, 2M device pages, and being
> > > able to install 2M THP on VRAM -> SRAM could be really helpful. I'm
> > > finding migrate_vma_* functions take up like 80-90% of the time in the
> > > CPU / GPU fault handlers on a fault (or prefetch) which doesn't seem
> > > ideal. Seems like 2M entries for everything would really help here. No
> > > idea how feasible this is as the core MM stuff gets confusing fast. Any
> > > input on this idea?
> > 
> > I agree with your observations. We found that the migrate_vma_* code was the
> > bottle neck for migration performance as well, and not breaking 2M pages
> > could reduce that overhead a lot. I don't have any specific ideas. I'm not
> > familiar with the details of that code myself. Philip has looked at this
> > (and some old NVidia patches from a few years ago) in the past but never had
> > enough uninterrupted time to make it past prototyping.
> > 
> 
> Cool good to know this isn't some crazy idea. Doubt it happen anytime
> soon as I need to get a working baseline in before anything then start
> applying optimizations and help in get other features to get the design
> complete. But eventually will probably try to look at this. May ping
> Philip and Nvidia when I have time to dig in here.

I think the big step will be moving hmm.c and migrate.c apis over from
struct page to folios. That should also give us some nice benefits on the
gpu side, since instead of 4k pages to track we could allocate 2m gpu
pages.

Once we have folios at the driver/core mm api level doing all the fancy
thp stuff should be at least a well-contained problem. But I might be
dellusionally optimistic here :-)
-Sima

> 
> Matt
> 
> > Regards,
> >   Felix
> > 
> > 
> > > 
> > > Matt
> > > 
> > > > Regards,
> > > >    Felix
> > > > 
> > > > 
> > > > > > > > +				ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
> > > > > > > > +				put_page(vmf->page);
> > > > > > > > +				unlock_page(vmf->page);
> > > > > > > > +			} else {
> > > > > > > > +				pte_unmap_unlock(vmf->pte, vmf->ptl);
> > > > > > > > +			}
> > > > > > > >    		} else if (is_hwpoison_entry(entry)) {
> > > > > > > >    			ret = VM_FAULT_HWPOISON;
> > > > > > > >    		} else if (is_pte_marker_entry(entry)) {
> > > > > > > > diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> > > > > > > > index 6d66dc1c6ffa..049893a5a179 100644
> > > > > > > > --- a/mm/migrate_device.c
> > > > > > > > +++ b/mm/migrate_device.c
> > > > > > > > @@ -60,6 +60,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> > > > > > > >    				   struct mm_walk *walk)
> > > > > > > >    {
> > > > > > > >    	struct migrate_vma *migrate = walk->private;
> > > > > > > > +	struct folio *fault_folio = migrate->fault_page ?
> > > > > > > > +		page_folio(migrate->fault_page) : NULL;
> > > > > > > >    	struct vm_area_struct *vma = walk->vma;
> > > > > > > >    	struct mm_struct *mm = vma->vm_mm;
> > > > > > > >    	unsigned long addr = start, unmapped = 0;
> > > > > > > > @@ -88,11 +90,13 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> > > > > > > >    			folio_get(folio);
> > > > > > > >    			spin_unlock(ptl);
> > > > > > > > -			if (unlikely(!folio_trylock(folio)))
> > > > > > > > +			if (unlikely(fault_folio != folio &&
> > > > > > > > +				     !folio_trylock(folio)))
> > > > > > > >    				return migrate_vma_collect_skip(start, end,
> > > > > > > >    								walk);
> > > > > > > >    			ret = split_folio(folio);
> > > > > > > > -			folio_unlock(folio);
> > > > > > > > +			if (fault_folio != folio)
> > > > > > > > +				folio_unlock(folio);
> > > > > > > >    			folio_put(folio);
> > > > > > > >    			if (ret)
> > > > > > > >    				return migrate_vma_collect_skip(start, end,
> > > > > > > > @@ -192,7 +196,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> > > > > > > >    		 * optimisation to avoid walking the rmap later with
> > > > > > > >    		 * try_to_migrate().
> > > > > > > >    		 */
> > > > > > > > -		if (folio_trylock(folio)) {
> > > > > > > > +		if (fault_folio == folio || folio_trylock(folio)) {
> > > > > > > >    			bool anon_exclusive;
> > > > > > > >    			pte_t swp_pte;
> > > > > > > > @@ -204,7 +208,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> > > > > > > >    				if (folio_try_share_anon_rmap_pte(folio, page)) {
> > > > > > > >    					set_pte_at(mm, addr, ptep, pte);
> > > > > > > > -					folio_unlock(folio);
> > > > > > > > +					if (fault_folio != folio)
> > > > > > > > +						folio_unlock(folio);
> > > > > > > >    					folio_put(folio);
> > > > > > > >    					mpfn = 0;
> > > > > > > >    					goto next;
> > > > > > > > @@ -363,6 +368,8 @@ static unsigned long migrate_device_unmap(unsigned long *src_pfns,
> > > > > > > >    					  unsigned long npages,
> > > > > > > >    					  struct page *fault_page)
> > > > > > > >    {
> > > > > > > > +	struct folio *fault_folio = fault_page ?
> > > > > > > > +		page_folio(fault_page) : NULL;
> > > > > > > >    	unsigned long i, restore = 0;
> > > > > > > >    	bool allow_drain = true;
> > > > > > > >    	unsigned long unmapped = 0;
> > > > > > > > @@ -427,7 +434,8 @@ static unsigned long migrate_device_unmap(unsigned long *src_pfns,
> > > > > > > >    		remove_migration_ptes(folio, folio, false);
> > > > > > > >    		src_pfns[i] = 0;
> > > > > > > > -		folio_unlock(folio);
> > > > > > > > +		if (fault_folio != folio)
> > > > > > > > +			folio_unlock(folio);
> > > > > > > >    		folio_put(folio);
> > > > > > > >    		restore--;
> > > > > > > >    	}
> > > > > > > > @@ -536,6 +544,8 @@ int migrate_vma_setup(struct migrate_vma *args)
> > > > > > > >    		return -EINVAL;
> > > > > > > >    	if (args->fault_page && !is_device_private_page(args->fault_page))
> > > > > > > >    		return -EINVAL;
> > > > > > > > +	if (args->fault_page && !PageLocked(args->fault_page))
> > > > > > > > +		return -EINVAL;
> > > > > > > >    	memset(args->src, 0, sizeof(*args->src) * nr_pages);
> > > > > > > >    	args->cpages = 0;
> > > > > > > > @@ -799,19 +809,13 @@ void migrate_vma_pages(struct migrate_vma *migrate)
> > > > > > > >    }
> > > > > > > >    EXPORT_SYMBOL(migrate_vma_pages);
> > > > > > > > -/*
> > > > > > > > - * migrate_device_finalize() - complete page migration
> > > > > > > > - * @src_pfns: src_pfns returned from migrate_device_range()
> > > > > > > > - * @dst_pfns: array of pfns allocated by the driver to migrate memory to
> > > > > > > > - * @npages: number of pages in the range
> > > > > > > > - *
> > > > > > > > - * Completes migration of the page by removing special migration entries.
> > > > > > > > - * Drivers must ensure copying of page data is complete and visible to the CPU
> > > > > > > > - * before calling this.
> > > > > > > > - */
> > > > > > > > -void migrate_device_finalize(unsigned long *src_pfns,
> > > > > > > > -			unsigned long *dst_pfns, unsigned long npages)
> > > > > > > > +static void __migrate_device_finalize(unsigned long *src_pfns,
> > > > > > > > +				      unsigned long *dst_pfns,
> > > > > > > > +				      unsigned long npages,
> > > > > > > > +				      struct page *fault_page)
> > > > > > > >    {
> > > > > > > > +	struct folio *fault_folio = fault_page ?
> > > > > > > > +		page_folio(fault_page) : NULL;
> > > > > > > >    	unsigned long i;
> > > > > > > >    	for (i = 0; i < npages; i++) {
> > > > > > > > @@ -838,7 +842,8 @@ void migrate_device_finalize(unsigned long *src_pfns,
> > > > > > > >    		src = page_folio(page);
> > > > > > > >    		dst = page_folio(newpage);
> > > > > > > >    		remove_migration_ptes(src, dst, false);
> > > > > > > > -		folio_unlock(src);
> > > > > > > > +		if (fault_folio != src)
> > > > > > > > +			folio_unlock(src);
> > > > > > > >    		if (is_zone_device_page(page))
> > > > > > > >    			put_page(page);
> > > > > > > > @@ -854,6 +859,22 @@ void migrate_device_finalize(unsigned long *src_pfns,
> > > > > > > >    		}
> > > > > > > >    	}
> > > > > > > >    }
> > > > > > > > +
> > > > > > > > +/*
> > > > > > > > + * migrate_device_finalize() - complete page migration
> > > > > > > > + * @src_pfns: src_pfns returned from migrate_device_range()
> > > > > > > > + * @dst_pfns: array of pfns allocated by the driver to migrate memory to
> > > > > > > > + * @npages: number of pages in the range
> > > > > > > > + *
> > > > > > > > + * Completes migration of the page by removing special migration entries.
> > > > > > > > + * Drivers must ensure copying of page data is complete and visible to the CPU
> > > > > > > > + * before calling this.
> > > > > > > > + */
> > > > > > > > +void migrate_device_finalize(unsigned long *src_pfns,
> > > > > > > > +			unsigned long *dst_pfns, unsigned long npages)
> > > > > > > > +{
> > > > > > > > +	return __migrate_device_finalize(src_pfns, dst_pfns, npages, NULL);
> > > > > > > > +}
> > > > > > > >    EXPORT_SYMBOL(migrate_device_finalize);
> > > > > > > >    /**
> > > > > > > > @@ -869,7 +890,8 @@ EXPORT_SYMBOL(migrate_device_finalize);
> > > > > > > >     */
> > > > > > > >    void migrate_vma_finalize(struct migrate_vma *migrate)
> > > > > > > >    {
> > > > > > > > -	migrate_device_finalize(migrate->src, migrate->dst, migrate->npages);
> > > > > > > > +	__migrate_device_finalize(migrate->src, migrate->dst, migrate->npages,
> > > > > > > > +				  migrate->fault_page);
> > > > > > > >    }
> > > > > > > >    EXPORT_SYMBOL(migrate_vma_finalize);

Matthew Brost Sept. 24, 2024, 4:42 p.m. UTC | #9

On Tue, Sep 24, 2024 at 01:48:29PM +0200, Simona Vetter wrote:
> On Fri, Sep 20, 2024 at 09:59:51PM +0000, Matthew Brost wrote:
> > On Fri, Sep 20, 2024 at 05:50:10PM -0400, Felix Kuehling wrote:
> > > 
> > > On 2024-09-20 17:23, Matthew Brost wrote:
> > > > On Fri, Sep 20, 2024 at 04:26:50PM -0400, Felix Kuehling wrote:
> > > > > On 2024-09-18 11:10, Alistair Popple wrote:
> > > > > > Matthew Brost <matthew.brost@intel.com> writes:
> > > > > > 
> > > > > > > On Wed, Sep 11, 2024 at 02:53:31PM +1000, Alistair Popple wrote:
> > > > > > > > Matthew Brost <matthew.brost@intel.com> writes:
> > > > > > > > 
> > > > > > > > I haven't seen the same in the NVIDIA UVM driver (out-of-tree, I know)
> > > > > > > Still a driver.
> > > > > > Indeed, and I'm happy to answer any questions about our implementation.
> > > > > > 
> > > > > > > > but theoretically it seems like it should be possible. However we
> > > > > > > > serialize migrations of the same virtual address range to avoid these
> > > > > > > > kind of issues as they can happen the other way too (ie. multiple
> > > > > > > > threads trying to migrate to GPU).
> > > > > > > > 
> > > > > > > > So I suspect what happens in UVM is that one thread wins and installs
> > > > > > > > the migration entry while the others fail to get the driver migration
> > > > > > > > lock and bail out sufficiently early in the fault path to avoid the
> > > > > > > > live-lock.
> > > > > > > > 
> > > > > > > I had to try hard to show this, doubt an actual user could trigger this.
> > > > > > > 
> > > > > > > I wrote a test which kicked 8 threads, each thread did a pthread join,
> > > > > > > and then tried to read the same page. This repeats in loop for like 512
> > > > > > > pages or something. I needed an exclusive lock in migrate_to_ram vfunc
> > > > > > > for it to livelock. Without an exclusive lock I think on average I saw
> > > > > > > about 32k retries (i.e. migrate_to_ram calls on the same page) before a
> > > > > > > thread won this race.
> > > > > > > 
> > > > > > >   From reading UVM, pretty sure if you tried hard enough you could trigger
> > > > > > > a livelock given it appears you take excluvise locks in migrate_to_ram.
> > > > > > Yes, I suspect you're correct here and that we just haven't tried hard
> > > > > > enough to trigger it.
> > > > > > 
> > > > > > > > > Cc: Philip Yang <Philip.Yang@amd.com>
> > > > > > > > > Cc: Felix Kuehling <felix.kuehling@amd.com>
> > > > > > > > > Cc: Christian König <christian.koenig@amd.com>
> > > > > > > > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > > > > > > > Suggessted-by: Simona Vetter <simona.vetter@ffwll.ch>
> > > > > > > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > > > > > > ---
> > > > > > > > >    mm/memory.c         | 13 +++++++---
> > > > > > > > >    mm/migrate_device.c | 60 +++++++++++++++++++++++++++++++--------------
> > > > > > > > >    2 files changed, 50 insertions(+), 23 deletions(-)
> > > > > > > > > 
> > > > > > > > > diff --git a/mm/memory.c b/mm/memory.c
> > > > > > > > > index 3c01d68065be..bbd97d16a96a 100644
> > > > > > > > > --- a/mm/memory.c
> > > > > > > > > +++ b/mm/memory.c
> > > > > > > > > @@ -4046,10 +4046,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > > > > > > > >    			 * Get a page reference while we know the page can't be
> > > > > > > > >    			 * freed.
> > > > > > > > >    			 */
> > > > > > > > > -			get_page(vmf->page);
> > > > > > > > > -			pte_unmap_unlock(vmf->pte, vmf->ptl);
> > > > > > > > > -			ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
> > > > > > > > > -			put_page(vmf->page);
> > > > > > > > > +			if (trylock_page(vmf->page)) {
> > > > > > > > > +				get_page(vmf->page);
> > > > > > > > > +				pte_unmap_unlock(vmf->pte, vmf->ptl);
> > > > > > > > This is all beginning to look a lot like migrate_vma_collect_pmd(). So
> > > > > > > > rather than do this and then have to pass all this context
> > > > > > > > (ie. fault_page) down to the migrate_vma_* functions could we instead
> > > > > > > > just do what migrate_vma_collect_pmd() does here? Ie. we already have
> > > > > > > > the PTL and the page lock so there's no reason we couldn't just setup
> > > > > > > > the migration entry prior to calling migrate_to_ram().
> > > > > > > > 
> > > > > > > > Obviously calling migrate_vma_setup() would show the page as not
> > > > > > > > migrating, but drivers could easily just fill in the src_pfn info after
> > > > > > > > calling migrate_vma_setup().
> > > > > > > > 
> > > > > > > > This would eliminate the whole fault_page ugliness.
> > > > > > > > 
> > > > > > > This seems like it would work and agree it likely be cleaner. Let me
> > > > > > > play around with this and see what I come up with. Multi-tasking a bit
> > > > > > > so expect a bit of delay here.
> > > > > > > 
> > > > > > > Thanks for the input,
> > > > > > > Matt
> > > > > Thanks! Sorry, I'm late catching up after a vacation. Please keep Philip,
> > > > > Christian and myself in the loop with future patches in this area.
> > > > > 
> > > > Will do. Already have another local patch set which helps drivers dma
> > > > map 2M pages for migrations if SRAM is physically contiguous. Seems
> > > > helpful for performance on Intel hardware. Probably post that soon for
> > > > early feedack.
> > > 
> > > OK.
> > > 
> > > 
> > > > 
> > > > Longer term I thinking 2M migration entries, 2M device pages, and being
> > > > able to install 2M THP on VRAM -> SRAM could be really helpful. I'm
> > > > finding migrate_vma_* functions take up like 80-90% of the time in the
> > > > CPU / GPU fault handlers on a fault (or prefetch) which doesn't seem
> > > > ideal. Seems like 2M entries for everything would really help here. No
> > > > idea how feasible this is as the core MM stuff gets confusing fast. Any
> > > > input on this idea?
> > > 
> > > I agree with your observations. We found that the migrate_vma_* code was the
> > > bottle neck for migration performance as well, and not breaking 2M pages
> > > could reduce that overhead a lot. I don't have any specific ideas. I'm not
> > > familiar with the details of that code myself. Philip has looked at this
> > > (and some old NVidia patches from a few years ago) in the past but never had
> > > enough uninterrupted time to make it past prototyping.
> > > 
> > 
> > Cool good to know this isn't some crazy idea. Doubt it happen anytime
> > soon as I need to get a working baseline in before anything then start
> > applying optimizations and help in get other features to get the design
> > complete. But eventually will probably try to look at this. May ping
> > Philip and Nvidia when I have time to dig in here.
> 
> I think the big step will be moving hmm.c and migrate.c apis over from
> struct page to folios. That should also give us some nice benefits on the
> gpu side, since instead of 4k pages to track we could allocate 2m gpu
> pages.
> 

I think was thinking just encode the order in the migration PFN like HMM
does. Really only Nth order entry in the page array needs to be
populated then - HMM populates every entry though which doesn't seem
like that is required. Maybe having a folio API makes more sense?

> Once we have folios at the driver/core mm api level doing all the fancy
> thp stuff should be at least a well-contained problem. But I might be
> dellusionally optimistic here :-)

I think it contained in the sense is the DRM SVM layer just allocates a
THP or large continous device memory and hands it off to migrate layer
and that layer does the right thing. The 'right thing' here I believe is
a decent amount of core MM work though.

Matt

> -Sima
> 
> > 
> > Matt
> > 
> > > Regards,
> > >   Felix
> > > 
> > > 
> > > > 
> > > > Matt
> > > > 
> > > > > Regards,
> > > > >    Felix
> > > > > 
> > > > > 
> > > > > > > > > +				ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
> > > > > > > > > +				put_page(vmf->page);
> > > > > > > > > +				unlock_page(vmf->page);
> > > > > > > > > +			} else {
> > > > > > > > > +				pte_unmap_unlock(vmf->pte, vmf->ptl);
> > > > > > > > > +			}
> > > > > > > > >    		} else if (is_hwpoison_entry(entry)) {
> > > > > > > > >    			ret = VM_FAULT_HWPOISON;
> > > > > > > > >    		} else if (is_pte_marker_entry(entry)) {
> > > > > > > > > diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> > > > > > > > > index 6d66dc1c6ffa..049893a5a179 100644
> > > > > > > > > --- a/mm/migrate_device.c
> > > > > > > > > +++ b/mm/migrate_device.c
> > > > > > > > > @@ -60,6 +60,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> > > > > > > > >    				   struct mm_walk *walk)
> > > > > > > > >    {
> > > > > > > > >    	struct migrate_vma *migrate = walk->private;
> > > > > > > > > +	struct folio *fault_folio = migrate->fault_page ?
> > > > > > > > > +		page_folio(migrate->fault_page) : NULL;
> > > > > > > > >    	struct vm_area_struct *vma = walk->vma;
> > > > > > > > >    	struct mm_struct *mm = vma->vm_mm;
> > > > > > > > >    	unsigned long addr = start, unmapped = 0;
> > > > > > > > > @@ -88,11 +90,13 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> > > > > > > > >    			folio_get(folio);
> > > > > > > > >    			spin_unlock(ptl);
> > > > > > > > > -			if (unlikely(!folio_trylock(folio)))
> > > > > > > > > +			if (unlikely(fault_folio != folio &&
> > > > > > > > > +				     !folio_trylock(folio)))
> > > > > > > > >    				return migrate_vma_collect_skip(start, end,
> > > > > > > > >    								walk);
> > > > > > > > >    			ret = split_folio(folio);
> > > > > > > > > -			folio_unlock(folio);
> > > > > > > > > +			if (fault_folio != folio)
> > > > > > > > > +				folio_unlock(folio);
> > > > > > > > >    			folio_put(folio);
> > > > > > > > >    			if (ret)
> > > > > > > > >    				return migrate_vma_collect_skip(start, end,
> > > > > > > > > @@ -192,7 +196,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> > > > > > > > >    		 * optimisation to avoid walking the rmap later with
> > > > > > > > >    		 * try_to_migrate().
> > > > > > > > >    		 */
> > > > > > > > > -		if (folio_trylock(folio)) {
> > > > > > > > > +		if (fault_folio == folio || folio_trylock(folio)) {
> > > > > > > > >    			bool anon_exclusive;
> > > > > > > > >    			pte_t swp_pte;
> > > > > > > > > @@ -204,7 +208,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> > > > > > > > >    				if (folio_try_share_anon_rmap_pte(folio, page)) {
> > > > > > > > >    					set_pte_at(mm, addr, ptep, pte);
> > > > > > > > > -					folio_unlock(folio);
> > > > > > > > > +					if (fault_folio != folio)
> > > > > > > > > +						folio_unlock(folio);
> > > > > > > > >    					folio_put(folio);
> > > > > > > > >    					mpfn = 0;
> > > > > > > > >    					goto next;
> > > > > > > > > @@ -363,6 +368,8 @@ static unsigned long migrate_device_unmap(unsigned long *src_pfns,
> > > > > > > > >    					  unsigned long npages,
> > > > > > > > >    					  struct page *fault_page)
> > > > > > > > >    {
> > > > > > > > > +	struct folio *fault_folio = fault_page ?
> > > > > > > > > +		page_folio(fault_page) : NULL;
> > > > > > > > >    	unsigned long i, restore = 0;
> > > > > > > > >    	bool allow_drain = true;
> > > > > > > > >    	unsigned long unmapped = 0;
> > > > > > > > > @@ -427,7 +434,8 @@ static unsigned long migrate_device_unmap(unsigned long *src_pfns,
> > > > > > > > >    		remove_migration_ptes(folio, folio, false);
> > > > > > > > >    		src_pfns[i] = 0;
> > > > > > > > > -		folio_unlock(folio);
> > > > > > > > > +		if (fault_folio != folio)
> > > > > > > > > +			folio_unlock(folio);
> > > > > > > > >    		folio_put(folio);
> > > > > > > > >    		restore--;
> > > > > > > > >    	}
> > > > > > > > > @@ -536,6 +544,8 @@ int migrate_vma_setup(struct migrate_vma *args)
> > > > > > > > >    		return -EINVAL;
> > > > > > > > >    	if (args->fault_page && !is_device_private_page(args->fault_page))
> > > > > > > > >    		return -EINVAL;
> > > > > > > > > +	if (args->fault_page && !PageLocked(args->fault_page))
> > > > > > > > > +		return -EINVAL;
> > > > > > > > >    	memset(args->src, 0, sizeof(*args->src) * nr_pages);
> > > > > > > > >    	args->cpages = 0;
> > > > > > > > > @@ -799,19 +809,13 @@ void migrate_vma_pages(struct migrate_vma *migrate)
> > > > > > > > >    }
> > > > > > > > >    EXPORT_SYMBOL(migrate_vma_pages);
> > > > > > > > > -/*
> > > > > > > > > - * migrate_device_finalize() - complete page migration
> > > > > > > > > - * @src_pfns: src_pfns returned from migrate_device_range()
> > > > > > > > > - * @dst_pfns: array of pfns allocated by the driver to migrate memory to
> > > > > > > > > - * @npages: number of pages in the range
> > > > > > > > > - *
> > > > > > > > > - * Completes migration of the page by removing special migration entries.
> > > > > > > > > - * Drivers must ensure copying of page data is complete and visible to the CPU
> > > > > > > > > - * before calling this.
> > > > > > > > > - */
> > > > > > > > > -void migrate_device_finalize(unsigned long *src_pfns,
> > > > > > > > > -			unsigned long *dst_pfns, unsigned long npages)
> > > > > > > > > +static void __migrate_device_finalize(unsigned long *src_pfns,
> > > > > > > > > +				      unsigned long *dst_pfns,
> > > > > > > > > +				      unsigned long npages,
> > > > > > > > > +				      struct page *fault_page)
> > > > > > > > >    {
> > > > > > > > > +	struct folio *fault_folio = fault_page ?
> > > > > > > > > +		page_folio(fault_page) : NULL;
> > > > > > > > >    	unsigned long i;
> > > > > > > > >    	for (i = 0; i < npages; i++) {
> > > > > > > > > @@ -838,7 +842,8 @@ void migrate_device_finalize(unsigned long *src_pfns,
> > > > > > > > >    		src = page_folio(page);
> > > > > > > > >    		dst = page_folio(newpage);
> > > > > > > > >    		remove_migration_ptes(src, dst, false);
> > > > > > > > > -		folio_unlock(src);
> > > > > > > > > +		if (fault_folio != src)
> > > > > > > > > +			folio_unlock(src);
> > > > > > > > >    		if (is_zone_device_page(page))
> > > > > > > > >    			put_page(page);
> > > > > > > > > @@ -854,6 +859,22 @@ void migrate_device_finalize(unsigned long *src_pfns,
> > > > > > > > >    		}
> > > > > > > > >    	}
> > > > > > > > >    }
> > > > > > > > > +
> > > > > > > > > +/*
> > > > > > > > > + * migrate_device_finalize() - complete page migration
> > > > > > > > > + * @src_pfns: src_pfns returned from migrate_device_range()
> > > > > > > > > + * @dst_pfns: array of pfns allocated by the driver to migrate memory to
> > > > > > > > > + * @npages: number of pages in the range
> > > > > > > > > + *
> > > > > > > > > + * Completes migration of the page by removing special migration entries.
> > > > > > > > > + * Drivers must ensure copying of page data is complete and visible to the CPU
> > > > > > > > > + * before calling this.
> > > > > > > > > + */
> > > > > > > > > +void migrate_device_finalize(unsigned long *src_pfns,
> > > > > > > > > +			unsigned long *dst_pfns, unsigned long npages)
> > > > > > > > > +{
> > > > > > > > > +	return __migrate_device_finalize(src_pfns, dst_pfns, npages, NULL);
> > > > > > > > > +}
> > > > > > > > >    EXPORT_SYMBOL(migrate_device_finalize);
> > > > > > > > >    /**
> > > > > > > > > @@ -869,7 +890,8 @@ EXPORT_SYMBOL(migrate_device_finalize);
> > > > > > > > >     */
> > > > > > > > >    void migrate_vma_finalize(struct migrate_vma *migrate)
> > > > > > > > >    {
> > > > > > > > > -	migrate_device_finalize(migrate->src, migrate->dst, migrate->npages);
> > > > > > > > > +	__migrate_device_finalize(migrate->src, migrate->dst, migrate->npages,
> > > > > > > > > +				  migrate->fault_page);
> > > > > > > > >    }
> > > > > > > > >    EXPORT_SYMBOL(migrate_vma_finalize);
> 
> -- 
> Simona Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

Felix Kuehling Sept. 24, 2024, 7:28 p.m. UTC | #10

On 2024-09-24 12:42, Matthew Brost wrote:
> On Tue, Sep 24, 2024 at 01:48:29PM +0200, Simona Vetter wrote:
>> On Fri, Sep 20, 2024 at 09:59:51PM +0000, Matthew Brost wrote:
>>> On Fri, Sep 20, 2024 at 05:50:10PM -0400, Felix Kuehling wrote:
>>>>
>>>> On 2024-09-20 17:23, Matthew Brost wrote:
>>>>> On Fri, Sep 20, 2024 at 04:26:50PM -0400, Felix Kuehling wrote:
>>>>>> On 2024-09-18 11:10, Alistair Popple wrote:
>>>>>>> Matthew Brost <matthew.brost@intel.com> writes:
>>>>>>>
>>>>>>>> On Wed, Sep 11, 2024 at 02:53:31PM +1000, Alistair Popple wrote:
>>>>>>>>> Matthew Brost <matthew.brost@intel.com> writes:
>>>>>>>>>
>>>>>>>>> I haven't seen the same in the NVIDIA UVM driver (out-of-tree, I know)
>>>>>>>> Still a driver.
>>>>>>> Indeed, and I'm happy to answer any questions about our implementation.
>>>>>>>
>>>>>>>>> but theoretically it seems like it should be possible. However we
>>>>>>>>> serialize migrations of the same virtual address range to avoid these
>>>>>>>>> kind of issues as they can happen the other way too (ie. multiple
>>>>>>>>> threads trying to migrate to GPU).
>>>>>>>>>
>>>>>>>>> So I suspect what happens in UVM is that one thread wins and installs
>>>>>>>>> the migration entry while the others fail to get the driver migration
>>>>>>>>> lock and bail out sufficiently early in the fault path to avoid the
>>>>>>>>> live-lock.
>>>>>>>>>
>>>>>>>> I had to try hard to show this, doubt an actual user could trigger this.
>>>>>>>>
>>>>>>>> I wrote a test which kicked 8 threads, each thread did a pthread join,
>>>>>>>> and then tried to read the same page. This repeats in loop for like 512
>>>>>>>> pages or something. I needed an exclusive lock in migrate_to_ram vfunc
>>>>>>>> for it to livelock. Without an exclusive lock I think on average I saw
>>>>>>>> about 32k retries (i.e. migrate_to_ram calls on the same page) before a
>>>>>>>> thread won this race.
>>>>>>>>
>>>>>>>>   From reading UVM, pretty sure if you tried hard enough you could trigger
>>>>>>>> a livelock given it appears you take excluvise locks in migrate_to_ram.
>>>>>>> Yes, I suspect you're correct here and that we just haven't tried hard
>>>>>>> enough to trigger it.
>>>>>>>
>>>>>>>>>> Cc: Philip Yang <Philip.Yang@amd.com>
>>>>>>>>>> Cc: Felix Kuehling <felix.kuehling@amd.com>
>>>>>>>>>> Cc: Christian König <christian.koenig@amd.com>
>>>>>>>>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>>>>>>>>> Suggessted-by: Simona Vetter <simona.vetter@ffwll.ch>
>>>>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>> ---
>>>>>>>>>>    mm/memory.c         | 13 +++++++---
>>>>>>>>>>    mm/migrate_device.c | 60 +++++++++++++++++++++++++++++++--------------
>>>>>>>>>>    2 files changed, 50 insertions(+), 23 deletions(-)
>>>>>>>>>>
>>>>>>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>>>>>>> index 3c01d68065be..bbd97d16a96a 100644
>>>>>>>>>> --- a/mm/memory.c
>>>>>>>>>> +++ b/mm/memory.c
>>>>>>>>>> @@ -4046,10 +4046,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>>>>>>>>    			 * Get a page reference while we know the page can't be
>>>>>>>>>>    			 * freed.
>>>>>>>>>>    			 */
>>>>>>>>>> -			get_page(vmf->page);
>>>>>>>>>> -			pte_unmap_unlock(vmf->pte, vmf->ptl);
>>>>>>>>>> -			ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
>>>>>>>>>> -			put_page(vmf->page);
>>>>>>>>>> +			if (trylock_page(vmf->page)) {
>>>>>>>>>> +				get_page(vmf->page);
>>>>>>>>>> +				pte_unmap_unlock(vmf->pte, vmf->ptl);
>>>>>>>>> This is all beginning to look a lot like migrate_vma_collect_pmd(). So
>>>>>>>>> rather than do this and then have to pass all this context
>>>>>>>>> (ie. fault_page) down to the migrate_vma_* functions could we instead
>>>>>>>>> just do what migrate_vma_collect_pmd() does here? Ie. we already have
>>>>>>>>> the PTL and the page lock so there's no reason we couldn't just setup
>>>>>>>>> the migration entry prior to calling migrate_to_ram().
>>>>>>>>>
>>>>>>>>> Obviously calling migrate_vma_setup() would show the page as not
>>>>>>>>> migrating, but drivers could easily just fill in the src_pfn info after
>>>>>>>>> calling migrate_vma_setup().
>>>>>>>>>
>>>>>>>>> This would eliminate the whole fault_page ugliness.
>>>>>>>>>
>>>>>>>> This seems like it would work and agree it likely be cleaner. Let me
>>>>>>>> play around with this and see what I come up with. Multi-tasking a bit
>>>>>>>> so expect a bit of delay here.
>>>>>>>>
>>>>>>>> Thanks for the input,
>>>>>>>> Matt
>>>>>> Thanks! Sorry, I'm late catching up after a vacation. Please keep Philip,
>>>>>> Christian and myself in the loop with future patches in this area.
>>>>>>
>>>>> Will do. Already have another local patch set which helps drivers dma
>>>>> map 2M pages for migrations if SRAM is physically contiguous. Seems
>>>>> helpful for performance on Intel hardware. Probably post that soon for
>>>>> early feedack.
>>>>
>>>> OK.
>>>>
>>>>
>>>>>
>>>>> Longer term I thinking 2M migration entries, 2M device pages, and being
>>>>> able to install 2M THP on VRAM -> SRAM could be really helpful. I'm
>>>>> finding migrate_vma_* functions take up like 80-90% of the time in the
>>>>> CPU / GPU fault handlers on a fault (or prefetch) which doesn't seem
>>>>> ideal. Seems like 2M entries for everything would really help here. No
>>>>> idea how feasible this is as the core MM stuff gets confusing fast. Any
>>>>> input on this idea?
>>>>
>>>> I agree with your observations. We found that the migrate_vma_* code was the
>>>> bottle neck for migration performance as well, and not breaking 2M pages
>>>> could reduce that overhead a lot. I don't have any specific ideas. I'm not
>>>> familiar with the details of that code myself. Philip has looked at this
>>>> (and some old NVidia patches from a few years ago) in the past but never had
>>>> enough uninterrupted time to make it past prototyping.
>>>>
>>>
>>> Cool good to know this isn't some crazy idea. Doubt it happen anytime
>>> soon as I need to get a working baseline in before anything then start
>>> applying optimizations and help in get other features to get the design
>>> complete. But eventually will probably try to look at this. May ping
>>> Philip and Nvidia when I have time to dig in here.
>>
>> I think the big step will be moving hmm.c and migrate.c apis over from
>> struct page to folios. That should also give us some nice benefits on the
>> gpu side, since instead of 4k pages to track we could allocate 2m gpu
>> pages.
>>
> 
> I think was thinking just encode the order in the migration PFN like HMM
> does. Really only Nth order entry in the page array needs to be
> populated then - HMM populates every entry though which doesn't seem
> like that is required. Maybe having a folio API makes more sense?
> 
>> Once we have folios at the driver/core mm api level doing all the fancy
>> thp stuff should be at least a well-contained problem. But I might be
>> dellusionally optimistic here :-)
> 
> I think it contained in the sense is the DRM SVM layer just allocates a
> THP or large continous device memory and hands it off to migrate layer
> and that layer does the right thing. The 'right thing' here I believe is
> a decent amount of core MM work though.

Whether or not you allocate contiguous device memory doesn't need to depend on the page size in system memory. You can migrate 2MB of scattered system memory into a contiguous 2MB block of VRAM. The AMD driver does this already. The problem (as I see it) is all the extra book-keeping with 4KB granularity in the migrate_vma_* helpers when the source data is actually contiguous.

Regards,
  Felix

> 
> Matt
> 
>> -Sima
>>
>>>
>>> Matt
>>>
>>>> Regards,
>>>>   Felix
>>>>
>>>>
>>>>>
>>>>> Matt
>>>>>
>>>>>> Regards,
>>>>>>    Felix
>>>>>>
>>>>>>
>>>>>>>>>> +				ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
>>>>>>>>>> +				put_page(vmf->page);
>>>>>>>>>> +				unlock_page(vmf->page);
>>>>>>>>>> +			} else {
>>>>>>>>>> +				pte_unmap_unlock(vmf->pte, vmf->ptl);
>>>>>>>>>> +			}
>>>>>>>>>>    		} else if (is_hwpoison_entry(entry)) {
>>>>>>>>>>    			ret = VM_FAULT_HWPOISON;
>>>>>>>>>>    		} else if (is_pte_marker_entry(entry)) {
>>>>>>>>>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>>>>>>>>>> index 6d66dc1c6ffa..049893a5a179 100644
>>>>>>>>>> --- a/mm/migrate_device.c
>>>>>>>>>> +++ b/mm/migrate_device.c
>>>>>>>>>> @@ -60,6 +60,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>>>>>>>>    				   struct mm_walk *walk)
>>>>>>>>>>    {
>>>>>>>>>>    	struct migrate_vma *migrate = walk->private;
>>>>>>>>>> +	struct folio *fault_folio = migrate->fault_page ?
>>>>>>>>>> +		page_folio(migrate->fault_page) : NULL;
>>>>>>>>>>    	struct vm_area_struct *vma = walk->vma;
>>>>>>>>>>    	struct mm_struct *mm = vma->vm_mm;
>>>>>>>>>>    	unsigned long addr = start, unmapped = 0;
>>>>>>>>>> @@ -88,11 +90,13 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>>>>>>>>    			folio_get(folio);
>>>>>>>>>>    			spin_unlock(ptl);
>>>>>>>>>> -			if (unlikely(!folio_trylock(folio)))
>>>>>>>>>> +			if (unlikely(fault_folio != folio &&
>>>>>>>>>> +				     !folio_trylock(folio)))
>>>>>>>>>>    				return migrate_vma_collect_skip(start, end,
>>>>>>>>>>    								walk);
>>>>>>>>>>    			ret = split_folio(folio);
>>>>>>>>>> -			folio_unlock(folio);
>>>>>>>>>> +			if (fault_folio != folio)
>>>>>>>>>> +				folio_unlock(folio);
>>>>>>>>>>    			folio_put(folio);
>>>>>>>>>>    			if (ret)
>>>>>>>>>>    				return migrate_vma_collect_skip(start, end,
>>>>>>>>>> @@ -192,7 +196,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>>>>>>>>    		 * optimisation to avoid walking the rmap later with
>>>>>>>>>>    		 * try_to_migrate().
>>>>>>>>>>    		 */
>>>>>>>>>> -		if (folio_trylock(folio)) {
>>>>>>>>>> +		if (fault_folio == folio || folio_trylock(folio)) {
>>>>>>>>>>    			bool anon_exclusive;
>>>>>>>>>>    			pte_t swp_pte;
>>>>>>>>>> @@ -204,7 +208,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>>>>>>>>    				if (folio_try_share_anon_rmap_pte(folio, page)) {
>>>>>>>>>>    					set_pte_at(mm, addr, ptep, pte);
>>>>>>>>>> -					folio_unlock(folio);
>>>>>>>>>> +					if (fault_folio != folio)
>>>>>>>>>> +						folio_unlock(folio);
>>>>>>>>>>    					folio_put(folio);
>>>>>>>>>>    					mpfn = 0;
>>>>>>>>>>    					goto next;
>>>>>>>>>> @@ -363,6 +368,8 @@ static unsigned long migrate_device_unmap(unsigned long *src_pfns,
>>>>>>>>>>    					  unsigned long npages,
>>>>>>>>>>    					  struct page *fault_page)
>>>>>>>>>>    {
>>>>>>>>>> +	struct folio *fault_folio = fault_page ?
>>>>>>>>>> +		page_folio(fault_page) : NULL;
>>>>>>>>>>    	unsigned long i, restore = 0;
>>>>>>>>>>    	bool allow_drain = true;
>>>>>>>>>>    	unsigned long unmapped = 0;
>>>>>>>>>> @@ -427,7 +434,8 @@ static unsigned long migrate_device_unmap(unsigned long *src_pfns,
>>>>>>>>>>    		remove_migration_ptes(folio, folio, false);
>>>>>>>>>>    		src_pfns[i] = 0;
>>>>>>>>>> -		folio_unlock(folio);
>>>>>>>>>> +		if (fault_folio != folio)
>>>>>>>>>> +			folio_unlock(folio);
>>>>>>>>>>    		folio_put(folio);
>>>>>>>>>>    		restore--;
>>>>>>>>>>    	}
>>>>>>>>>> @@ -536,6 +544,8 @@ int migrate_vma_setup(struct migrate_vma *args)
>>>>>>>>>>    		return -EINVAL;
>>>>>>>>>>    	if (args->fault_page && !is_device_private_page(args->fault_page))
>>>>>>>>>>    		return -EINVAL;
>>>>>>>>>> +	if (args->fault_page && !PageLocked(args->fault_page))
>>>>>>>>>> +		return -EINVAL;
>>>>>>>>>>    	memset(args->src, 0, sizeof(*args->src) * nr_pages);
>>>>>>>>>>    	args->cpages = 0;
>>>>>>>>>> @@ -799,19 +809,13 @@ void migrate_vma_pages(struct migrate_vma *migrate)
>>>>>>>>>>    }
>>>>>>>>>>    EXPORT_SYMBOL(migrate_vma_pages);
>>>>>>>>>> -/*
>>>>>>>>>> - * migrate_device_finalize() - complete page migration
>>>>>>>>>> - * @src_pfns: src_pfns returned from migrate_device_range()
>>>>>>>>>> - * @dst_pfns: array of pfns allocated by the driver to migrate memory to
>>>>>>>>>> - * @npages: number of pages in the range
>>>>>>>>>> - *
>>>>>>>>>> - * Completes migration of the page by removing special migration entries.
>>>>>>>>>> - * Drivers must ensure copying of page data is complete and visible to the CPU
>>>>>>>>>> - * before calling this.
>>>>>>>>>> - */
>>>>>>>>>> -void migrate_device_finalize(unsigned long *src_pfns,
>>>>>>>>>> -			unsigned long *dst_pfns, unsigned long npages)
>>>>>>>>>> +static void __migrate_device_finalize(unsigned long *src_pfns,
>>>>>>>>>> +				      unsigned long *dst_pfns,
>>>>>>>>>> +				      unsigned long npages,
>>>>>>>>>> +				      struct page *fault_page)
>>>>>>>>>>    {
>>>>>>>>>> +	struct folio *fault_folio = fault_page ?
>>>>>>>>>> +		page_folio(fault_page) : NULL;
>>>>>>>>>>    	unsigned long i;
>>>>>>>>>>    	for (i = 0; i < npages; i++) {
>>>>>>>>>> @@ -838,7 +842,8 @@ void migrate_device_finalize(unsigned long *src_pfns,
>>>>>>>>>>    		src = page_folio(page);
>>>>>>>>>>    		dst = page_folio(newpage);
>>>>>>>>>>    		remove_migration_ptes(src, dst, false);
>>>>>>>>>> -		folio_unlock(src);
>>>>>>>>>> +		if (fault_folio != src)
>>>>>>>>>> +			folio_unlock(src);
>>>>>>>>>>    		if (is_zone_device_page(page))
>>>>>>>>>>    			put_page(page);
>>>>>>>>>> @@ -854,6 +859,22 @@ void migrate_device_finalize(unsigned long *src_pfns,
>>>>>>>>>>    		}
>>>>>>>>>>    	}
>>>>>>>>>>    }
>>>>>>>>>> +
>>>>>>>>>> +/*
>>>>>>>>>> + * migrate_device_finalize() - complete page migration
>>>>>>>>>> + * @src_pfns: src_pfns returned from migrate_device_range()
>>>>>>>>>> + * @dst_pfns: array of pfns allocated by the driver to migrate memory to
>>>>>>>>>> + * @npages: number of pages in the range
>>>>>>>>>> + *
>>>>>>>>>> + * Completes migration of the page by removing special migration entries.
>>>>>>>>>> + * Drivers must ensure copying of page data is complete and visible to the CPU
>>>>>>>>>> + * before calling this.
>>>>>>>>>> + */
>>>>>>>>>> +void migrate_device_finalize(unsigned long *src_pfns,
>>>>>>>>>> +			unsigned long *dst_pfns, unsigned long npages)
>>>>>>>>>> +{
>>>>>>>>>> +	return __migrate_device_finalize(src_pfns, dst_pfns, npages, NULL);
>>>>>>>>>> +}
>>>>>>>>>>    EXPORT_SYMBOL(migrate_device_finalize);
>>>>>>>>>>    /**
>>>>>>>>>> @@ -869,7 +890,8 @@ EXPORT_SYMBOL(migrate_device_finalize);
>>>>>>>>>>     */
>>>>>>>>>>    void migrate_vma_finalize(struct migrate_vma *migrate)
>>>>>>>>>>    {
>>>>>>>>>> -	migrate_device_finalize(migrate->src, migrate->dst, migrate->npages);
>>>>>>>>>> +	__migrate_device_finalize(migrate->src, migrate->dst, migrate->npages,
>>>>>>>>>> +				  migrate->fault_page);
>>>>>>>>>>    }
>>>>>>>>>>    EXPORT_SYMBOL(migrate_vma_finalize);
>>
>> -- 
>> Simona Vetter
>> Software Engineer, Intel Corporation
>> http://blog.ffwll.ch

Simona Vetter Sept. 25, 2024, 11:44 a.m. UTC | #11

On Tue, Sep 24, 2024 at 04:42:19PM +0000, Matthew Brost wrote:
> On Tue, Sep 24, 2024 at 01:48:29PM +0200, Simona Vetter wrote:
> > On Fri, Sep 20, 2024 at 09:59:51PM +0000, Matthew Brost wrote:
> > > On Fri, Sep 20, 2024 at 05:50:10PM -0400, Felix Kuehling wrote:
> > > > 
> > > > On 2024-09-20 17:23, Matthew Brost wrote:
> > > > > On Fri, Sep 20, 2024 at 04:26:50PM -0400, Felix Kuehling wrote:
> > > > > > On 2024-09-18 11:10, Alistair Popple wrote:
> > > > > > > Matthew Brost <matthew.brost@intel.com> writes:
> > > > > > > 
> > > > > > > > On Wed, Sep 11, 2024 at 02:53:31PM +1000, Alistair Popple wrote:
> > > > > > > > > Matthew Brost <matthew.brost@intel.com> writes:
> > > > > > > > > 
> > > > > > > > > I haven't seen the same in the NVIDIA UVM driver (out-of-tree, I know)
> > > > > > > > Still a driver.
> > > > > > > Indeed, and I'm happy to answer any questions about our implementation.
> > > > > > > 
> > > > > > > > > but theoretically it seems like it should be possible. However we
> > > > > > > > > serialize migrations of the same virtual address range to avoid these
> > > > > > > > > kind of issues as they can happen the other way too (ie. multiple
> > > > > > > > > threads trying to migrate to GPU).
> > > > > > > > > 
> > > > > > > > > So I suspect what happens in UVM is that one thread wins and installs
> > > > > > > > > the migration entry while the others fail to get the driver migration
> > > > > > > > > lock and bail out sufficiently early in the fault path to avoid the
> > > > > > > > > live-lock.
> > > > > > > > > 
> > > > > > > > I had to try hard to show this, doubt an actual user could trigger this.
> > > > > > > > 
> > > > > > > > I wrote a test which kicked 8 threads, each thread did a pthread join,
> > > > > > > > and then tried to read the same page. This repeats in loop for like 512
> > > > > > > > pages or something. I needed an exclusive lock in migrate_to_ram vfunc
> > > > > > > > for it to livelock. Without an exclusive lock I think on average I saw
> > > > > > > > about 32k retries (i.e. migrate_to_ram calls on the same page) before a
> > > > > > > > thread won this race.
> > > > > > > > 
> > > > > > > >   From reading UVM, pretty sure if you tried hard enough you could trigger
> > > > > > > > a livelock given it appears you take excluvise locks in migrate_to_ram.
> > > > > > > Yes, I suspect you're correct here and that we just haven't tried hard
> > > > > > > enough to trigger it.
> > > > > > > 
> > > > > > > > > > Cc: Philip Yang <Philip.Yang@amd.com>
> > > > > > > > > > Cc: Felix Kuehling <felix.kuehling@amd.com>
> > > > > > > > > > Cc: Christian König <christian.koenig@amd.com>
> > > > > > > > > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > > > > > > > > Suggessted-by: Simona Vetter <simona.vetter@ffwll.ch>
> > > > > > > > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > > > > > > > ---
> > > > > > > > > >    mm/memory.c         | 13 +++++++---
> > > > > > > > > >    mm/migrate_device.c | 60 +++++++++++++++++++++++++++++++--------------
> > > > > > > > > >    2 files changed, 50 insertions(+), 23 deletions(-)
> > > > > > > > > > 
> > > > > > > > > > diff --git a/mm/memory.c b/mm/memory.c
> > > > > > > > > > index 3c01d68065be..bbd97d16a96a 100644
> > > > > > > > > > --- a/mm/memory.c
> > > > > > > > > > +++ b/mm/memory.c
> > > > > > > > > > @@ -4046,10 +4046,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > > > > > > > > >    			 * Get a page reference while we know the page can't be
> > > > > > > > > >    			 * freed.
> > > > > > > > > >    			 */
> > > > > > > > > > -			get_page(vmf->page);
> > > > > > > > > > -			pte_unmap_unlock(vmf->pte, vmf->ptl);
> > > > > > > > > > -			ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
> > > > > > > > > > -			put_page(vmf->page);
> > > > > > > > > > +			if (trylock_page(vmf->page)) {
> > > > > > > > > > +				get_page(vmf->page);
> > > > > > > > > > +				pte_unmap_unlock(vmf->pte, vmf->ptl);
> > > > > > > > > This is all beginning to look a lot like migrate_vma_collect_pmd(). So
> > > > > > > > > rather than do this and then have to pass all this context
> > > > > > > > > (ie. fault_page) down to the migrate_vma_* functions could we instead
> > > > > > > > > just do what migrate_vma_collect_pmd() does here? Ie. we already have
> > > > > > > > > the PTL and the page lock so there's no reason we couldn't just setup
> > > > > > > > > the migration entry prior to calling migrate_to_ram().
> > > > > > > > > 
> > > > > > > > > Obviously calling migrate_vma_setup() would show the page as not
> > > > > > > > > migrating, but drivers could easily just fill in the src_pfn info after
> > > > > > > > > calling migrate_vma_setup().
> > > > > > > > > 
> > > > > > > > > This would eliminate the whole fault_page ugliness.
> > > > > > > > > 
> > > > > > > > This seems like it would work and agree it likely be cleaner. Let me
> > > > > > > > play around with this and see what I come up with. Multi-tasking a bit
> > > > > > > > so expect a bit of delay here.
> > > > > > > > 
> > > > > > > > Thanks for the input,
> > > > > > > > Matt
> > > > > > Thanks! Sorry, I'm late catching up after a vacation. Please keep Philip,
> > > > > > Christian and myself in the loop with future patches in this area.
> > > > > > 
> > > > > Will do. Already have another local patch set which helps drivers dma
> > > > > map 2M pages for migrations if SRAM is physically contiguous. Seems
> > > > > helpful for performance on Intel hardware. Probably post that soon for
> > > > > early feedack.
> > > > 
> > > > OK.
> > > > 
> > > > 
> > > > > 
> > > > > Longer term I thinking 2M migration entries, 2M device pages, and being
> > > > > able to install 2M THP on VRAM -> SRAM could be really helpful. I'm
> > > > > finding migrate_vma_* functions take up like 80-90% of the time in the
> > > > > CPU / GPU fault handlers on a fault (or prefetch) which doesn't seem
> > > > > ideal. Seems like 2M entries for everything would really help here. No
> > > > > idea how feasible this is as the core MM stuff gets confusing fast. Any
> > > > > input on this idea?
> > > > 
> > > > I agree with your observations. We found that the migrate_vma_* code was the
> > > > bottle neck for migration performance as well, and not breaking 2M pages
> > > > could reduce that overhead a lot. I don't have any specific ideas. I'm not
> > > > familiar with the details of that code myself. Philip has looked at this
> > > > (and some old NVidia patches from a few years ago) in the past but never had
> > > > enough uninterrupted time to make it past prototyping.
> > > > 
> > > 
> > > Cool good to know this isn't some crazy idea. Doubt it happen anytime
> > > soon as I need to get a working baseline in before anything then start
> > > applying optimizations and help in get other features to get the design
> > > complete. But eventually will probably try to look at this. May ping
> > > Philip and Nvidia when I have time to dig in here.
> > 
> > I think the big step will be moving hmm.c and migrate.c apis over from
> > struct page to folios. That should also give us some nice benefits on the
> > gpu side, since instead of 4k pages to track we could allocate 2m gpu
> > pages.
> > 
> 
> I think was thinking just encode the order in the migration PFN like HMM
> does. Really only Nth order entry in the page array needs to be
> populated then - HMM populates every entry though which doesn't seem
> like that is required. Maybe having a folio API makes more sense?

Both I'd say, as a first attempt at least. An array of folios, but only
populate the ones we need and jump over empty entries. A bit wasteful, but
eh it's just allocations.

> > Once we have folios at the driver/core mm api level doing all the fancy
> > thp stuff should be at least a well-contained problem. But I might be
> > dellusionally optimistic here :-)
> 
> I think it contained in the sense is the DRM SVM layer just allocates a
> THP or large continous device memory and hands it off to migrate layer
> and that layer does the right thing. The 'right thing' here I believe is
> a decent amount of core MM work though.

Yeah that's what I meant, once we have switched the interfaces to be
arrays of folios, where for larger folios we leave the entries in between
NULL and have some appropriate iterators, then the driver side is done
mostly.  The core mm side to actually make use of that will be fairly
gnarly though.
-Sima

> 
> Matt
> 
> > -Sima
> > 
> > > 
> > > Matt
> > > 
> > > > Regards,
> > > >   Felix
> > > > 
> > > > 
> > > > > 
> > > > > Matt
> > > > > 
> > > > > > Regards,
> > > > > >    Felix
> > > > > > 
> > > > > > 
> > > > > > > > > > +				ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
> > > > > > > > > > +				put_page(vmf->page);
> > > > > > > > > > +				unlock_page(vmf->page);
> > > > > > > > > > +			} else {
> > > > > > > > > > +				pte_unmap_unlock(vmf->pte, vmf->ptl);
> > > > > > > > > > +			}
> > > > > > > > > >    		} else if (is_hwpoison_entry(entry)) {
> > > > > > > > > >    			ret = VM_FAULT_HWPOISON;
> > > > > > > > > >    		} else if (is_pte_marker_entry(entry)) {
> > > > > > > > > > diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> > > > > > > > > > index 6d66dc1c6ffa..049893a5a179 100644
> > > > > > > > > > --- a/mm/migrate_device.c
> > > > > > > > > > +++ b/mm/migrate_device.c
> > > > > > > > > > @@ -60,6 +60,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> > > > > > > > > >    				   struct mm_walk *walk)
> > > > > > > > > >    {
> > > > > > > > > >    	struct migrate_vma *migrate = walk->private;
> > > > > > > > > > +	struct folio *fault_folio = migrate->fault_page ?
> > > > > > > > > > +		page_folio(migrate->fault_page) : NULL;
> > > > > > > > > >    	struct vm_area_struct *vma = walk->vma;
> > > > > > > > > >    	struct mm_struct *mm = vma->vm_mm;
> > > > > > > > > >    	unsigned long addr = start, unmapped = 0;
> > > > > > > > > > @@ -88,11 +90,13 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> > > > > > > > > >    			folio_get(folio);
> > > > > > > > > >    			spin_unlock(ptl);
> > > > > > > > > > -			if (unlikely(!folio_trylock(folio)))
> > > > > > > > > > +			if (unlikely(fault_folio != folio &&
> > > > > > > > > > +				     !folio_trylock(folio)))
> > > > > > > > > >    				return migrate_vma_collect_skip(start, end,
> > > > > > > > > >    								walk);
> > > > > > > > > >    			ret = split_folio(folio);
> > > > > > > > > > -			folio_unlock(folio);
> > > > > > > > > > +			if (fault_folio != folio)
> > > > > > > > > > +				folio_unlock(folio);
> > > > > > > > > >    			folio_put(folio);
> > > > > > > > > >    			if (ret)
> > > > > > > > > >    				return migrate_vma_collect_skip(start, end,
> > > > > > > > > > @@ -192,7 +196,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> > > > > > > > > >    		 * optimisation to avoid walking the rmap later with
> > > > > > > > > >    		 * try_to_migrate().
> > > > > > > > > >    		 */
> > > > > > > > > > -		if (folio_trylock(folio)) {
> > > > > > > > > > +		if (fault_folio == folio || folio_trylock(folio)) {
> > > > > > > > > >    			bool anon_exclusive;
> > > > > > > > > >    			pte_t swp_pte;
> > > > > > > > > > @@ -204,7 +208,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> > > > > > > > > >    				if (folio_try_share_anon_rmap_pte(folio, page)) {
> > > > > > > > > >    					set_pte_at(mm, addr, ptep, pte);
> > > > > > > > > > -					folio_unlock(folio);
> > > > > > > > > > +					if (fault_folio != folio)
> > > > > > > > > > +						folio_unlock(folio);
> > > > > > > > > >    					folio_put(folio);
> > > > > > > > > >    					mpfn = 0;
> > > > > > > > > >    					goto next;
> > > > > > > > > > @@ -363,6 +368,8 @@ static unsigned long migrate_device_unmap(unsigned long *src_pfns,
> > > > > > > > > >    					  unsigned long npages,
> > > > > > > > > >    					  struct page *fault_page)
> > > > > > > > > >    {
> > > > > > > > > > +	struct folio *fault_folio = fault_page ?
> > > > > > > > > > +		page_folio(fault_page) : NULL;
> > > > > > > > > >    	unsigned long i, restore = 0;
> > > > > > > > > >    	bool allow_drain = true;
> > > > > > > > > >    	unsigned long unmapped = 0;
> > > > > > > > > > @@ -427,7 +434,8 @@ static unsigned long migrate_device_unmap(unsigned long *src_pfns,
> > > > > > > > > >    		remove_migration_ptes(folio, folio, false);
> > > > > > > > > >    		src_pfns[i] = 0;
> > > > > > > > > > -		folio_unlock(folio);
> > > > > > > > > > +		if (fault_folio != folio)
> > > > > > > > > > +			folio_unlock(folio);
> > > > > > > > > >    		folio_put(folio);
> > > > > > > > > >    		restore--;
> > > > > > > > > >    	}
> > > > > > > > > > @@ -536,6 +544,8 @@ int migrate_vma_setup(struct migrate_vma *args)
> > > > > > > > > >    		return -EINVAL;
> > > > > > > > > >    	if (args->fault_page && !is_device_private_page(args->fault_page))
> > > > > > > > > >    		return -EINVAL;
> > > > > > > > > > +	if (args->fault_page && !PageLocked(args->fault_page))
> > > > > > > > > > +		return -EINVAL;
> > > > > > > > > >    	memset(args->src, 0, sizeof(*args->src) * nr_pages);
> > > > > > > > > >    	args->cpages = 0;
> > > > > > > > > > @@ -799,19 +809,13 @@ void migrate_vma_pages(struct migrate_vma *migrate)
> > > > > > > > > >    }
> > > > > > > > > >    EXPORT_SYMBOL(migrate_vma_pages);
> > > > > > > > > > -/*
> > > > > > > > > > - * migrate_device_finalize() - complete page migration
> > > > > > > > > > - * @src_pfns: src_pfns returned from migrate_device_range()
> > > > > > > > > > - * @dst_pfns: array of pfns allocated by the driver to migrate memory to
> > > > > > > > > > - * @npages: number of pages in the range
> > > > > > > > > > - *
> > > > > > > > > > - * Completes migration of the page by removing special migration entries.
> > > > > > > > > > - * Drivers must ensure copying of page data is complete and visible to the CPU
> > > > > > > > > > - * before calling this.
> > > > > > > > > > - */
> > > > > > > > > > -void migrate_device_finalize(unsigned long *src_pfns,
> > > > > > > > > > -			unsigned long *dst_pfns, unsigned long npages)
> > > > > > > > > > +static void __migrate_device_finalize(unsigned long *src_pfns,
> > > > > > > > > > +				      unsigned long *dst_pfns,
> > > > > > > > > > +				      unsigned long npages,
> > > > > > > > > > +				      struct page *fault_page)
> > > > > > > > > >    {
> > > > > > > > > > +	struct folio *fault_folio = fault_page ?
> > > > > > > > > > +		page_folio(fault_page) : NULL;
> > > > > > > > > >    	unsigned long i;
> > > > > > > > > >    	for (i = 0; i < npages; i++) {
> > > > > > > > > > @@ -838,7 +842,8 @@ void migrate_device_finalize(unsigned long *src_pfns,
> > > > > > > > > >    		src = page_folio(page);
> > > > > > > > > >    		dst = page_folio(newpage);
> > > > > > > > > >    		remove_migration_ptes(src, dst, false);
> > > > > > > > > > -		folio_unlock(src);
> > > > > > > > > > +		if (fault_folio != src)
> > > > > > > > > > +			folio_unlock(src);
> > > > > > > > > >    		if (is_zone_device_page(page))
> > > > > > > > > >    			put_page(page);
> > > > > > > > > > @@ -854,6 +859,22 @@ void migrate_device_finalize(unsigned long *src_pfns,
> > > > > > > > > >    		}
> > > > > > > > > >    	}
> > > > > > > > > >    }
> > > > > > > > > > +
> > > > > > > > > > +/*
> > > > > > > > > > + * migrate_device_finalize() - complete page migration
> > > > > > > > > > + * @src_pfns: src_pfns returned from migrate_device_range()
> > > > > > > > > > + * @dst_pfns: array of pfns allocated by the driver to migrate memory to
> > > > > > > > > > + * @npages: number of pages in the range
> > > > > > > > > > + *
> > > > > > > > > > + * Completes migration of the page by removing special migration entries.
> > > > > > > > > > + * Drivers must ensure copying of page data is complete and visible to the CPU
> > > > > > > > > > + * before calling this.
> > > > > > > > > > + */
> > > > > > > > > > +void migrate_device_finalize(unsigned long *src_pfns,
> > > > > > > > > > +			unsigned long *dst_pfns, unsigned long npages)
> > > > > > > > > > +{
> > > > > > > > > > +	return __migrate_device_finalize(src_pfns, dst_pfns, npages, NULL);
> > > > > > > > > > +}
> > > > > > > > > >    EXPORT_SYMBOL(migrate_device_finalize);
> > > > > > > > > >    /**
> > > > > > > > > > @@ -869,7 +890,8 @@ EXPORT_SYMBOL(migrate_device_finalize);
> > > > > > > > > >     */
> > > > > > > > > >    void migrate_vma_finalize(struct migrate_vma *migrate)
> > > > > > > > > >    {
> > > > > > > > > > -	migrate_device_finalize(migrate->src, migrate->dst, migrate->npages);
> > > > > > > > > > +	__migrate_device_finalize(migrate->src, migrate->dst, migrate->npages,
> > > > > > > > > > +				  migrate->fault_page);
> > > > > > > > > >    }
> > > > > > > > > >    EXPORT_SYMBOL(migrate_vma_finalize);
> > 
> > -- 
> > Simona Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch

Simona Vetter Sept. 25, 2024, 11:46 a.m. UTC | #12

On Wed, Sep 11, 2024 at 02:53:31PM +1000, Alistair Popple wrote:
> 
> Matthew Brost <matthew.brost@intel.com> writes:
> 
> > Avoid multiple CPU page faults to the same device page racing by locking
> > the page in do_swap_page before taking an additional reference to the
> > page. This prevents scenarios where multiple CPU page faults each take
> > an extra reference to a device page, which could abort migration in
> > folio_migrate_mapping. With the device page locked in do_swap_page, the
> > migrate_vma_* functions need to be updated to avoid locking the
> > fault_page argument.
> 
> I added the get_page() and therefore the fault_page argument to deal
> with another lifetime issue. Neither Felix nor I particularly liked the
> solution though (see
> https://lore.kernel.org/all/cover.60659b549d8509ddecafad4f498ee7f03bb23c69.1664366292.git-series.apopple@nvidia.com/T/#m715589d57716448386ff9af41da27a952d94615a)
> and this seems to make it even uglier, so I have suggested an
> alternative solution below.

Just an aside, I don't fine the fault_page special handling all that
offensive, because fundamentally fault_page _is_ special: It's the one
page we _really_ have to migrate or userspace blows up because the kernel
sucks.

But maybe we shold stuff it into the vmf struct instead and pass that
around? I think that would be conceptually a notch cleaner.
-Sima

> 
> > Prior to this change, a livelock scenario could occur in Xe's (Intel GPU
> > DRM driver) SVM implementation if enough threads faulted the same device
> > page.
> 
> I haven't seen the same in the NVIDIA UVM driver (out-of-tree, I know)
> but theoretically it seems like it should be possible. However we
> serialize migrations of the same virtual address range to avoid these
> kind of issues as they can happen the other way too (ie. multiple
> threads trying to migrate to GPU).
> 
> So I suspect what happens in UVM is that one thread wins and installs
> the migration entry while the others fail to get the driver migration
> lock and bail out sufficiently early in the fault path to avoid the
> live-lock.
> 
> > Cc: Philip Yang <Philip.Yang@amd.com>
> > Cc: Felix Kuehling <felix.kuehling@amd.com>
> > Cc: Christian König <christian.koenig@amd.com>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Suggessted-by: Simona Vetter <simona.vetter@ffwll.ch>
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >  mm/memory.c         | 13 +++++++---
> >  mm/migrate_device.c | 60 +++++++++++++++++++++++++++++++--------------
> >  2 files changed, 50 insertions(+), 23 deletions(-)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 3c01d68065be..bbd97d16a96a 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -4046,10 +4046,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >  			 * Get a page reference while we know the page can't be
> >  			 * freed.
> >  			 */
> > -			get_page(vmf->page);
> > -			pte_unmap_unlock(vmf->pte, vmf->ptl);
> > -			ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
> > -			put_page(vmf->page);
> > +			if (trylock_page(vmf->page)) {
> > +				get_page(vmf->page);
> > +				pte_unmap_unlock(vmf->pte, vmf->ptl);
> 
> This is all beginning to look a lot like migrate_vma_collect_pmd(). So
> rather than do this and then have to pass all this context
> (ie. fault_page) down to the migrate_vma_* functions could we instead
> just do what migrate_vma_collect_pmd() does here? Ie. we already have
> the PTL and the page lock so there's no reason we couldn't just setup
> the migration entry prior to calling migrate_to_ram().
> 
> Obviously calling migrate_vma_setup() would show the page as not
> migrating, but drivers could easily just fill in the src_pfn info after
> calling migrate_vma_setup().
> 
> This would eliminate the whole fault_page ugliness.
> 
> > +				ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
> > +				put_page(vmf->page);
> > +				unlock_page(vmf->page);
> > +			} else {
> > +				pte_unmap_unlock(vmf->pte, vmf->ptl);
> > +			}
> >  		} else if (is_hwpoison_entry(entry)) {
> >  			ret = VM_FAULT_HWPOISON;
> >  		} else if (is_pte_marker_entry(entry)) {
> > diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> > index 6d66dc1c6ffa..049893a5a179 100644
> > --- a/mm/migrate_device.c
> > +++ b/mm/migrate_device.c
> > @@ -60,6 +60,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> >  				   struct mm_walk *walk)
> >  {
> >  	struct migrate_vma *migrate = walk->private;
> > +	struct folio *fault_folio = migrate->fault_page ?
> > +		page_folio(migrate->fault_page) : NULL;
> >  	struct vm_area_struct *vma = walk->vma;
> >  	struct mm_struct *mm = vma->vm_mm;
> >  	unsigned long addr = start, unmapped = 0;
> > @@ -88,11 +90,13 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> >  
> >  			folio_get(folio);
> >  			spin_unlock(ptl);
> > -			if (unlikely(!folio_trylock(folio)))
> > +			if (unlikely(fault_folio != folio &&
> > +				     !folio_trylock(folio)))
> >  				return migrate_vma_collect_skip(start, end,
> >  								walk);
> >  			ret = split_folio(folio);
> > -			folio_unlock(folio);
> > +			if (fault_folio != folio)
> > +				folio_unlock(folio);
> >  			folio_put(folio);
> >  			if (ret)
> >  				return migrate_vma_collect_skip(start, end,
> > @@ -192,7 +196,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> >  		 * optimisation to avoid walking the rmap later with
> >  		 * try_to_migrate().
> >  		 */
> > -		if (folio_trylock(folio)) {
> > +		if (fault_folio == folio || folio_trylock(folio)) {
> >  			bool anon_exclusive;
> >  			pte_t swp_pte;
> >  
> > @@ -204,7 +208,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> >  
> >  				if (folio_try_share_anon_rmap_pte(folio, page)) {
> >  					set_pte_at(mm, addr, ptep, pte);
> > -					folio_unlock(folio);
> > +					if (fault_folio != folio)
> > +						folio_unlock(folio);
> >  					folio_put(folio);
> >  					mpfn = 0;
> >  					goto next;
> > @@ -363,6 +368,8 @@ static unsigned long migrate_device_unmap(unsigned long *src_pfns,
> >  					  unsigned long npages,
> >  					  struct page *fault_page)
> >  {
> > +	struct folio *fault_folio = fault_page ?
> > +		page_folio(fault_page) : NULL;
> >  	unsigned long i, restore = 0;
> >  	bool allow_drain = true;
> >  	unsigned long unmapped = 0;
> > @@ -427,7 +434,8 @@ static unsigned long migrate_device_unmap(unsigned long *src_pfns,
> >  		remove_migration_ptes(folio, folio, false);
> >  
> >  		src_pfns[i] = 0;
> > -		folio_unlock(folio);
> > +		if (fault_folio != folio)
> > +			folio_unlock(folio);
> >  		folio_put(folio);
> >  		restore--;
> >  	}
> > @@ -536,6 +544,8 @@ int migrate_vma_setup(struct migrate_vma *args)
> >  		return -EINVAL;
> >  	if (args->fault_page && !is_device_private_page(args->fault_page))
> >  		return -EINVAL;
> > +	if (args->fault_page && !PageLocked(args->fault_page))
> > +		return -EINVAL;
> >  
> >  	memset(args->src, 0, sizeof(*args->src) * nr_pages);
> >  	args->cpages = 0;
> > @@ -799,19 +809,13 @@ void migrate_vma_pages(struct migrate_vma *migrate)
> >  }
> >  EXPORT_SYMBOL(migrate_vma_pages);
> >  
> > -/*
> > - * migrate_device_finalize() - complete page migration
> > - * @src_pfns: src_pfns returned from migrate_device_range()
> > - * @dst_pfns: array of pfns allocated by the driver to migrate memory to
> > - * @npages: number of pages in the range
> > - *
> > - * Completes migration of the page by removing special migration entries.
> > - * Drivers must ensure copying of page data is complete and visible to the CPU
> > - * before calling this.
> > - */
> > -void migrate_device_finalize(unsigned long *src_pfns,
> > -			unsigned long *dst_pfns, unsigned long npages)
> > +static void __migrate_device_finalize(unsigned long *src_pfns,
> > +				      unsigned long *dst_pfns,
> > +				      unsigned long npages,
> > +				      struct page *fault_page)
> >  {
> > +	struct folio *fault_folio = fault_page ?
> > +		page_folio(fault_page) : NULL;
> >  	unsigned long i;
> >  
> >  	for (i = 0; i < npages; i++) {
> > @@ -838,7 +842,8 @@ void migrate_device_finalize(unsigned long *src_pfns,
> >  		src = page_folio(page);
> >  		dst = page_folio(newpage);
> >  		remove_migration_ptes(src, dst, false);
> > -		folio_unlock(src);
> > +		if (fault_folio != src)
> > +			folio_unlock(src);
> >  
> >  		if (is_zone_device_page(page))
> >  			put_page(page);
> > @@ -854,6 +859,22 @@ void migrate_device_finalize(unsigned long *src_pfns,
> >  		}
> >  	}
> >  }
> > +
> > +/*
> > + * migrate_device_finalize() - complete page migration
> > + * @src_pfns: src_pfns returned from migrate_device_range()
> > + * @dst_pfns: array of pfns allocated by the driver to migrate memory to
> > + * @npages: number of pages in the range
> > + *
> > + * Completes migration of the page by removing special migration entries.
> > + * Drivers must ensure copying of page data is complete and visible to the CPU
> > + * before calling this.
> > + */
> > +void migrate_device_finalize(unsigned long *src_pfns,
> > +			unsigned long *dst_pfns, unsigned long npages)
> > +{
> > +	return __migrate_device_finalize(src_pfns, dst_pfns, npages, NULL);
> > +}
> >  EXPORT_SYMBOL(migrate_device_finalize);
> >  
> >  /**
> > @@ -869,7 +890,8 @@ EXPORT_SYMBOL(migrate_device_finalize);
> >   */
> >  void migrate_vma_finalize(struct migrate_vma *migrate)
> >  {
> > -	migrate_device_finalize(migrate->src, migrate->dst, migrate->npages);
> > +	__migrate_device_finalize(migrate->src, migrate->dst, migrate->npages,
> > +				  migrate->fault_page);
> >  }
> >  EXPORT_SYMBOL(migrate_vma_finalize);
>

Simona Vetter Sept. 25, 2024, 12:03 p.m. UTC | #13

On Wed, Sep 25, 2024 at 01:44:37PM +0200, Simona Vetter wrote:
> On Tue, Sep 24, 2024 at 04:42:19PM +0000, Matthew Brost wrote:
> > On Tue, Sep 24, 2024 at 01:48:29PM +0200, Simona Vetter wrote:
> > > On Fri, Sep 20, 2024 at 09:59:51PM +0000, Matthew Brost wrote:
> > > > On Fri, Sep 20, 2024 at 05:50:10PM -0400, Felix Kuehling wrote:
> > > > > 
> > > > > On 2024-09-20 17:23, Matthew Brost wrote:
> > > > > > On Fri, Sep 20, 2024 at 04:26:50PM -0400, Felix Kuehling wrote:
> > > > > > > On 2024-09-18 11:10, Alistair Popple wrote:
> > > > > > > > Matthew Brost <matthew.brost@intel.com> writes:
> > > > > > > > 
> > > > > > > > > On Wed, Sep 11, 2024 at 02:53:31PM +1000, Alistair Popple wrote:
> > > > > > > > > > Matthew Brost <matthew.brost@intel.com> writes:
> > > > > > > > > > 
> > > > > > > > > > I haven't seen the same in the NVIDIA UVM driver (out-of-tree, I know)
> > > > > > > > > Still a driver.
> > > > > > > > Indeed, and I'm happy to answer any questions about our implementation.
> > > > > > > > 
> > > > > > > > > > but theoretically it seems like it should be possible. However we
> > > > > > > > > > serialize migrations of the same virtual address range to avoid these
> > > > > > > > > > kind of issues as they can happen the other way too (ie. multiple
> > > > > > > > > > threads trying to migrate to GPU).
> > > > > > > > > > 
> > > > > > > > > > So I suspect what happens in UVM is that one thread wins and installs
> > > > > > > > > > the migration entry while the others fail to get the driver migration
> > > > > > > > > > lock and bail out sufficiently early in the fault path to avoid the
> > > > > > > > > > live-lock.
> > > > > > > > > > 
> > > > > > > > > I had to try hard to show this, doubt an actual user could trigger this.
> > > > > > > > > 
> > > > > > > > > I wrote a test which kicked 8 threads, each thread did a pthread join,
> > > > > > > > > and then tried to read the same page. This repeats in loop for like 512
> > > > > > > > > pages or something. I needed an exclusive lock in migrate_to_ram vfunc
> > > > > > > > > for it to livelock. Without an exclusive lock I think on average I saw
> > > > > > > > > about 32k retries (i.e. migrate_to_ram calls on the same page) before a
> > > > > > > > > thread won this race.
> > > > > > > > > 
> > > > > > > > >   From reading UVM, pretty sure if you tried hard enough you could trigger
> > > > > > > > > a livelock given it appears you take excluvise locks in migrate_to_ram.
> > > > > > > > Yes, I suspect you're correct here and that we just haven't tried hard
> > > > > > > > enough to trigger it.
> > > > > > > > 
> > > > > > > > > > > Cc: Philip Yang <Philip.Yang@amd.com>
> > > > > > > > > > > Cc: Felix Kuehling <felix.kuehling@amd.com>
> > > > > > > > > > > Cc: Christian König <christian.koenig@amd.com>
> > > > > > > > > > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > > > > > > > > > Suggessted-by: Simona Vetter <simona.vetter@ffwll.ch>
> > > > > > > > > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > > > > > > > > ---
> > > > > > > > > > >    mm/memory.c         | 13 +++++++---
> > > > > > > > > > >    mm/migrate_device.c | 60 +++++++++++++++++++++++++++++++--------------
> > > > > > > > > > >    2 files changed, 50 insertions(+), 23 deletions(-)
> > > > > > > > > > > 
> > > > > > > > > > > diff --git a/mm/memory.c b/mm/memory.c
> > > > > > > > > > > index 3c01d68065be..bbd97d16a96a 100644
> > > > > > > > > > > --- a/mm/memory.c
> > > > > > > > > > > +++ b/mm/memory.c
> > > > > > > > > > > @@ -4046,10 +4046,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > > > > > > > > > >    			 * Get a page reference while we know the page can't be
> > > > > > > > > > >    			 * freed.
> > > > > > > > > > >    			 */
> > > > > > > > > > > -			get_page(vmf->page);
> > > > > > > > > > > -			pte_unmap_unlock(vmf->pte, vmf->ptl);
> > > > > > > > > > > -			ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
> > > > > > > > > > > -			put_page(vmf->page);
> > > > > > > > > > > +			if (trylock_page(vmf->page)) {
> > > > > > > > > > > +				get_page(vmf->page);
> > > > > > > > > > > +				pte_unmap_unlock(vmf->pte, vmf->ptl);
> > > > > > > > > > This is all beginning to look a lot like migrate_vma_collect_pmd(). So
> > > > > > > > > > rather than do this and then have to pass all this context
> > > > > > > > > > (ie. fault_page) down to the migrate_vma_* functions could we instead
> > > > > > > > > > just do what migrate_vma_collect_pmd() does here? Ie. we already have
> > > > > > > > > > the PTL and the page lock so there's no reason we couldn't just setup
> > > > > > > > > > the migration entry prior to calling migrate_to_ram().
> > > > > > > > > > 
> > > > > > > > > > Obviously calling migrate_vma_setup() would show the page as not
> > > > > > > > > > migrating, but drivers could easily just fill in the src_pfn info after
> > > > > > > > > > calling migrate_vma_setup().
> > > > > > > > > > 
> > > > > > > > > > This would eliminate the whole fault_page ugliness.
> > > > > > > > > > 
> > > > > > > > > This seems like it would work and agree it likely be cleaner. Let me
> > > > > > > > > play around with this and see what I come up with. Multi-tasking a bit
> > > > > > > > > so expect a bit of delay here.
> > > > > > > > > 
> > > > > > > > > Thanks for the input,
> > > > > > > > > Matt
> > > > > > > Thanks! Sorry, I'm late catching up after a vacation. Please keep Philip,
> > > > > > > Christian and myself in the loop with future patches in this area.
> > > > > > > 
> > > > > > Will do. Already have another local patch set which helps drivers dma
> > > > > > map 2M pages for migrations if SRAM is physically contiguous. Seems
> > > > > > helpful for performance on Intel hardware. Probably post that soon for
> > > > > > early feedack.
> > > > > 
> > > > > OK.
> > > > > 
> > > > > 
> > > > > > 
> > > > > > Longer term I thinking 2M migration entries, 2M device pages, and being
> > > > > > able to install 2M THP on VRAM -> SRAM could be really helpful. I'm
> > > > > > finding migrate_vma_* functions take up like 80-90% of the time in the
> > > > > > CPU / GPU fault handlers on a fault (or prefetch) which doesn't seem
> > > > > > ideal. Seems like 2M entries for everything would really help here. No
> > > > > > idea how feasible this is as the core MM stuff gets confusing fast. Any
> > > > > > input on this idea?
> > > > > 
> > > > > I agree with your observations. We found that the migrate_vma_* code was the
> > > > > bottle neck for migration performance as well, and not breaking 2M pages
> > > > > could reduce that overhead a lot. I don't have any specific ideas. I'm not
> > > > > familiar with the details of that code myself. Philip has looked at this
> > > > > (and some old NVidia patches from a few years ago) in the past but never had
> > > > > enough uninterrupted time to make it past prototyping.
> > > > > 
> > > > 
> > > > Cool good to know this isn't some crazy idea. Doubt it happen anytime
> > > > soon as I need to get a working baseline in before anything then start
> > > > applying optimizations and help in get other features to get the design
> > > > complete. But eventually will probably try to look at this. May ping
> > > > Philip and Nvidia when I have time to dig in here.
> > > 
> > > I think the big step will be moving hmm.c and migrate.c apis over from
> > > struct page to folios. That should also give us some nice benefits on the
> > > gpu side, since instead of 4k pages to track we could allocate 2m gpu
> > > pages.
> > > 
> > 
> > I think was thinking just encode the order in the migration PFN like HMM
> > does. Really only Nth order entry in the page array needs to be
> > populated then - HMM populates every entry though which doesn't seem
> > like that is required. Maybe having a folio API makes more sense?
> 
> Both I'd say, as a first attempt at least. An array of folios, but only
> populate the ones we need and jump over empty entries. A bit wasteful, but
> eh it's just allocations.

Ok thought some more, I think there's two things going on:

- spot contig memory sections so that the gpu side is more efficient and
  can user bigger pagetables and stuff like that. this is what
  hmm_range_fault does.

- optimize the core mm book-keeping by working on folios instead of
  individual pages. hmm_range_fault does not care because it doesn't grab
  references or lock pages or do anything else interesting with them, the
  entire synchronization is provided by mmu notifier retry loops. But the
  migration code does do a lot of expensive stuff, so it would need that.
  For hmm_range_fault it's probably not so important, so maybe we could
  leave the folio conversion of that to later.

I think we need both, meaning:
- switch to folio, leave out entries as NULL for compound folios
- on top of compoung folios still track contig ranges so that the gpu side
  can additionally benefit when e.g. 2M pages are split into smaller
  folios but happen to be contig

Cheers, Sima

> 
> > > Once we have folios at the driver/core mm api level doing all the fancy
> > > thp stuff should be at least a well-contained problem. But I might be
> > > dellusionally optimistic here :-)
> > 
> > I think it contained in the sense is the DRM SVM layer just allocates a
> > THP or large continous device memory and hands it off to migrate layer
> > and that layer does the right thing. The 'right thing' here I believe is
> > a decent amount of core MM work though.
> 
> Yeah that's what I meant, once we have switched the interfaces to be
> arrays of folios, where for larger folios we leave the entries in between
> NULL and have some appropriate iterators, then the driver side is done
> mostly.  The core mm side to actually make use of that will be fairly
> gnarly though.
> -Sima
> 
> > 
> > Matt
> > 
> > > -Sima
> > > 
> > > > 
> > > > Matt
> > > > 
> > > > > Regards,
> > > > >   Felix
> > > > > 
> > > > > 
> > > > > > 
> > > > > > Matt
> > > > > > 
> > > > > > > Regards,
> > > > > > >    Felix
> > > > > > > 
> > > > > > > 
> > > > > > > > > > > +				ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
> > > > > > > > > > > +				put_page(vmf->page);
> > > > > > > > > > > +				unlock_page(vmf->page);
> > > > > > > > > > > +			} else {
> > > > > > > > > > > +				pte_unmap_unlock(vmf->pte, vmf->ptl);
> > > > > > > > > > > +			}
> > > > > > > > > > >    		} else if (is_hwpoison_entry(entry)) {
> > > > > > > > > > >    			ret = VM_FAULT_HWPOISON;
> > > > > > > > > > >    		} else if (is_pte_marker_entry(entry)) {
> > > > > > > > > > > diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> > > > > > > > > > > index 6d66dc1c6ffa..049893a5a179 100644
> > > > > > > > > > > --- a/mm/migrate_device.c
> > > > > > > > > > > +++ b/mm/migrate_device.c
> > > > > > > > > > > @@ -60,6 +60,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> > > > > > > > > > >    				   struct mm_walk *walk)
> > > > > > > > > > >    {
> > > > > > > > > > >    	struct migrate_vma *migrate = walk->private;
> > > > > > > > > > > +	struct folio *fault_folio = migrate->fault_page ?
> > > > > > > > > > > +		page_folio(migrate->fault_page) : NULL;
> > > > > > > > > > >    	struct vm_area_struct *vma = walk->vma;
> > > > > > > > > > >    	struct mm_struct *mm = vma->vm_mm;
> > > > > > > > > > >    	unsigned long addr = start, unmapped = 0;
> > > > > > > > > > > @@ -88,11 +90,13 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> > > > > > > > > > >    			folio_get(folio);
> > > > > > > > > > >    			spin_unlock(ptl);
> > > > > > > > > > > -			if (unlikely(!folio_trylock(folio)))
> > > > > > > > > > > +			if (unlikely(fault_folio != folio &&
> > > > > > > > > > > +				     !folio_trylock(folio)))
> > > > > > > > > > >    				return migrate_vma_collect_skip(start, end,
> > > > > > > > > > >    								walk);
> > > > > > > > > > >    			ret = split_folio(folio);
> > > > > > > > > > > -			folio_unlock(folio);
> > > > > > > > > > > +			if (fault_folio != folio)
> > > > > > > > > > > +				folio_unlock(folio);
> > > > > > > > > > >    			folio_put(folio);
> > > > > > > > > > >    			if (ret)
> > > > > > > > > > >    				return migrate_vma_collect_skip(start, end,
> > > > > > > > > > > @@ -192,7 +196,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> > > > > > > > > > >    		 * optimisation to avoid walking the rmap later with
> > > > > > > > > > >    		 * try_to_migrate().
> > > > > > > > > > >    		 */
> > > > > > > > > > > -		if (folio_trylock(folio)) {
> > > > > > > > > > > +		if (fault_folio == folio || folio_trylock(folio)) {
> > > > > > > > > > >    			bool anon_exclusive;
> > > > > > > > > > >    			pte_t swp_pte;
> > > > > > > > > > > @@ -204,7 +208,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> > > > > > > > > > >    				if (folio_try_share_anon_rmap_pte(folio, page)) {
> > > > > > > > > > >    					set_pte_at(mm, addr, ptep, pte);
> > > > > > > > > > > -					folio_unlock(folio);
> > > > > > > > > > > +					if (fault_folio != folio)
> > > > > > > > > > > +						folio_unlock(folio);
> > > > > > > > > > >    					folio_put(folio);
> > > > > > > > > > >    					mpfn = 0;
> > > > > > > > > > >    					goto next;
> > > > > > > > > > > @@ -363,6 +368,8 @@ static unsigned long migrate_device_unmap(unsigned long *src_pfns,
> > > > > > > > > > >    					  unsigned long npages,
> > > > > > > > > > >    					  struct page *fault_page)
> > > > > > > > > > >    {
> > > > > > > > > > > +	struct folio *fault_folio = fault_page ?
> > > > > > > > > > > +		page_folio(fault_page) : NULL;
> > > > > > > > > > >    	unsigned long i, restore = 0;
> > > > > > > > > > >    	bool allow_drain = true;
> > > > > > > > > > >    	unsigned long unmapped = 0;
> > > > > > > > > > > @@ -427,7 +434,8 @@ static unsigned long migrate_device_unmap(unsigned long *src_pfns,
> > > > > > > > > > >    		remove_migration_ptes(folio, folio, false);
> > > > > > > > > > >    		src_pfns[i] = 0;
> > > > > > > > > > > -		folio_unlock(folio);
> > > > > > > > > > > +		if (fault_folio != folio)
> > > > > > > > > > > +			folio_unlock(folio);
> > > > > > > > > > >    		folio_put(folio);
> > > > > > > > > > >    		restore--;
> > > > > > > > > > >    	}
> > > > > > > > > > > @@ -536,6 +544,8 @@ int migrate_vma_setup(struct migrate_vma *args)
> > > > > > > > > > >    		return -EINVAL;
> > > > > > > > > > >    	if (args->fault_page && !is_device_private_page(args->fault_page))
> > > > > > > > > > >    		return -EINVAL;
> > > > > > > > > > > +	if (args->fault_page && !PageLocked(args->fault_page))
> > > > > > > > > > > +		return -EINVAL;
> > > > > > > > > > >    	memset(args->src, 0, sizeof(*args->src) * nr_pages);
> > > > > > > > > > >    	args->cpages = 0;
> > > > > > > > > > > @@ -799,19 +809,13 @@ void migrate_vma_pages(struct migrate_vma *migrate)
> > > > > > > > > > >    }
> > > > > > > > > > >    EXPORT_SYMBOL(migrate_vma_pages);
> > > > > > > > > > > -/*
> > > > > > > > > > > - * migrate_device_finalize() - complete page migration
> > > > > > > > > > > - * @src_pfns: src_pfns returned from migrate_device_range()
> > > > > > > > > > > - * @dst_pfns: array of pfns allocated by the driver to migrate memory to
> > > > > > > > > > > - * @npages: number of pages in the range
> > > > > > > > > > > - *
> > > > > > > > > > > - * Completes migration of the page by removing special migration entries.
> > > > > > > > > > > - * Drivers must ensure copying of page data is complete and visible to the CPU
> > > > > > > > > > > - * before calling this.
> > > > > > > > > > > - */
> > > > > > > > > > > -void migrate_device_finalize(unsigned long *src_pfns,
> > > > > > > > > > > -			unsigned long *dst_pfns, unsigned long npages)
> > > > > > > > > > > +static void __migrate_device_finalize(unsigned long *src_pfns,
> > > > > > > > > > > +				      unsigned long *dst_pfns,
> > > > > > > > > > > +				      unsigned long npages,
> > > > > > > > > > > +				      struct page *fault_page)
> > > > > > > > > > >    {
> > > > > > > > > > > +	struct folio *fault_folio = fault_page ?
> > > > > > > > > > > +		page_folio(fault_page) : NULL;
> > > > > > > > > > >    	unsigned long i;
> > > > > > > > > > >    	for (i = 0; i < npages; i++) {
> > > > > > > > > > > @@ -838,7 +842,8 @@ void migrate_device_finalize(unsigned long *src_pfns,
> > > > > > > > > > >    		src = page_folio(page);
> > > > > > > > > > >    		dst = page_folio(newpage);
> > > > > > > > > > >    		remove_migration_ptes(src, dst, false);
> > > > > > > > > > > -		folio_unlock(src);
> > > > > > > > > > > +		if (fault_folio != src)
> > > > > > > > > > > +			folio_unlock(src);
> > > > > > > > > > >    		if (is_zone_device_page(page))
> > > > > > > > > > >    			put_page(page);
> > > > > > > > > > > @@ -854,6 +859,22 @@ void migrate_device_finalize(unsigned long *src_pfns,
> > > > > > > > > > >    		}
> > > > > > > > > > >    	}
> > > > > > > > > > >    }
> > > > > > > > > > > +
> > > > > > > > > > > +/*
> > > > > > > > > > > + * migrate_device_finalize() - complete page migration
> > > > > > > > > > > + * @src_pfns: src_pfns returned from migrate_device_range()
> > > > > > > > > > > + * @dst_pfns: array of pfns allocated by the driver to migrate memory to
> > > > > > > > > > > + * @npages: number of pages in the range
> > > > > > > > > > > + *
> > > > > > > > > > > + * Completes migration of the page by removing special migration entries.
> > > > > > > > > > > + * Drivers must ensure copying of page data is complete and visible to the CPU
> > > > > > > > > > > + * before calling this.
> > > > > > > > > > > + */
> > > > > > > > > > > +void migrate_device_finalize(unsigned long *src_pfns,
> > > > > > > > > > > +			unsigned long *dst_pfns, unsigned long npages)
> > > > > > > > > > > +{
> > > > > > > > > > > +	return __migrate_device_finalize(src_pfns, dst_pfns, npages, NULL);
> > > > > > > > > > > +}
> > > > > > > > > > >    EXPORT_SYMBOL(migrate_device_finalize);
> > > > > > > > > > >    /**
> > > > > > > > > > > @@ -869,7 +890,8 @@ EXPORT_SYMBOL(migrate_device_finalize);
> > > > > > > > > > >     */
> > > > > > > > > > >    void migrate_vma_finalize(struct migrate_vma *migrate)
> > > > > > > > > > >    {
> > > > > > > > > > > -	migrate_device_finalize(migrate->src, migrate->dst, migrate->npages);
> > > > > > > > > > > +	__migrate_device_finalize(migrate->src, migrate->dst, migrate->npages,
> > > > > > > > > > > +				  migrate->fault_page);
> > > > > > > > > > >    }
> > > > > > > > > > >    EXPORT_SYMBOL(migrate_vma_finalize);
> > > 
> > > -- 
> > > Simona Vetter
> > > Software Engineer, Intel Corporation
> > > http://blog.ffwll.ch
> 
> -- 
> Simona Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

Alistair Popple Oct. 8, 2024, 1:33 a.m. UTC | #14

Simona Vetter <simona.vetter@ffwll.ch> writes:

> On Wed, Sep 25, 2024 at 01:44:37PM +0200, Simona Vetter wrote:
>> On Tue, Sep 24, 2024 at 04:42:19PM +0000, Matthew Brost wrote:
>> > On Tue, Sep 24, 2024 at 01:48:29PM +0200, Simona Vetter wrote:
>> > > On Fri, Sep 20, 2024 at 09:59:51PM +0000, Matthew Brost wrote:
>> > > > On Fri, Sep 20, 2024 at 05:50:10PM -0400, Felix Kuehling wrote:
>> > > > > 
>> > > > > On 2024-09-20 17:23, Matthew Brost wrote:
>> > > > > > On Fri, Sep 20, 2024 at 04:26:50PM -0400, Felix Kuehling wrote:
>> > > > > > > On 2024-09-18 11:10, Alistair Popple wrote:
>> > > > > > > > Matthew Brost <matthew.brost@intel.com> writes:
>> > > > > > > > 
>> > > > > > > > > On Wed, Sep 11, 2024 at 02:53:31PM +1000, Alistair Popple wrote:
>> > > > > > > > > > Matthew Brost <matthew.brost@intel.com> writes:
>> > > > > > > > > > 
>> > > > > > > > > > I haven't seen the same in the NVIDIA UVM driver (out-of-tree, I know)
>> > > > > > > > > Still a driver.
>> > > > > > > > Indeed, and I'm happy to answer any questions about our implementation.
>> > > > > > > > 
>> > > > > > > > > > but theoretically it seems like it should be possible. However we
>> > > > > > > > > > serialize migrations of the same virtual address range to avoid these
>> > > > > > > > > > kind of issues as they can happen the other way too (ie. multiple
>> > > > > > > > > > threads trying to migrate to GPU).
>> > > > > > > > > > 
>> > > > > > > > > > So I suspect what happens in UVM is that one thread wins and installs
>> > > > > > > > > > the migration entry while the others fail to get the driver migration
>> > > > > > > > > > lock and bail out sufficiently early in the fault path to avoid the
>> > > > > > > > > > live-lock.
>> > > > > > > > > > 
>> > > > > > > > > I had to try hard to show this, doubt an actual user could trigger this.
>> > > > > > > > > 
>> > > > > > > > > I wrote a test which kicked 8 threads, each thread did a pthread join,
>> > > > > > > > > and then tried to read the same page. This repeats in loop for like 512
>> > > > > > > > > pages or something. I needed an exclusive lock in migrate_to_ram vfunc
>> > > > > > > > > for it to livelock. Without an exclusive lock I think on average I saw
>> > > > > > > > > about 32k retries (i.e. migrate_to_ram calls on the same page) before a
>> > > > > > > > > thread won this race.
>> > > > > > > > > 
>> > > > > > > > >   From reading UVM, pretty sure if you tried hard enough you could trigger
>> > > > > > > > > a livelock given it appears you take excluvise locks in migrate_to_ram.
>> > > > > > > > Yes, I suspect you're correct here and that we just haven't tried hard
>> > > > > > > > enough to trigger it.
>> > > > > > > > 
>> > > > > > > > > > > Cc: Philip Yang <Philip.Yang@amd.com>
>> > > > > > > > > > > Cc: Felix Kuehling <felix.kuehling@amd.com>
>> > > > > > > > > > > Cc: Christian König <christian.koenig@amd.com>
>> > > > > > > > > > > Cc: Andrew Morton <akpm@linux-foundation.org>
>> > > > > > > > > > > Suggessted-by: Simona Vetter <simona.vetter@ffwll.ch>
>> > > > > > > > > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>> > > > > > > > > > > ---
>> > > > > > > > > > >    mm/memory.c         | 13 +++++++---
>> > > > > > > > > > >    mm/migrate_device.c | 60 +++++++++++++++++++++++++++++++--------------
>> > > > > > > > > > >    2 files changed, 50 insertions(+), 23 deletions(-)
>> > > > > > > > > > > 
>> > > > > > > > > > > diff --git a/mm/memory.c b/mm/memory.c
>> > > > > > > > > > > index 3c01d68065be..bbd97d16a96a 100644
>> > > > > > > > > > > --- a/mm/memory.c
>> > > > > > > > > > > +++ b/mm/memory.c
>> > > > > > > > > > > @@ -4046,10 +4046,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>> > > > > > > > > > >    			 * Get a page reference while we know the page can't be
>> > > > > > > > > > >    			 * freed.
>> > > > > > > > > > >    			 */
>> > > > > > > > > > > -			get_page(vmf->page);
>> > > > > > > > > > > -			pte_unmap_unlock(vmf->pte, vmf->ptl);
>> > > > > > > > > > > -			ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
>> > > > > > > > > > > -			put_page(vmf->page);
>> > > > > > > > > > > +			if (trylock_page(vmf->page)) {
>> > > > > > > > > > > +				get_page(vmf->page);
>> > > > > > > > > > > +				pte_unmap_unlock(vmf->pte, vmf->ptl);
>> > > > > > > > > > This is all beginning to look a lot like migrate_vma_collect_pmd(). So
>> > > > > > > > > > rather than do this and then have to pass all this context
>> > > > > > > > > > (ie. fault_page) down to the migrate_vma_* functions could we instead
>> > > > > > > > > > just do what migrate_vma_collect_pmd() does here? Ie. we already have
>> > > > > > > > > > the PTL and the page lock so there's no reason we couldn't just setup
>> > > > > > > > > > the migration entry prior to calling migrate_to_ram().
>> > > > > > > > > > 
>> > > > > > > > > > Obviously calling migrate_vma_setup() would show the page as not
>> > > > > > > > > > migrating, but drivers could easily just fill in the src_pfn info after
>> > > > > > > > > > calling migrate_vma_setup().
>> > > > > > > > > > 
>> > > > > > > > > > This would eliminate the whole fault_page ugliness.
>> > > > > > > > > > 
>> > > > > > > > > This seems like it would work and agree it likely be cleaner. Let me
>> > > > > > > > > play around with this and see what I come up with. Multi-tasking a bit
>> > > > > > > > > so expect a bit of delay here.
>> > > > > > > > > 
>> > > > > > > > > Thanks for the input,
>> > > > > > > > > Matt
>> > > > > > > Thanks! Sorry, I'm late catching up after a vacation. Please keep Philip,
>> > > > > > > Christian and myself in the loop with future patches in this area.
>> > > > > > > 
>> > > > > > Will do. Already have another local patch set which helps drivers dma
>> > > > > > map 2M pages for migrations if SRAM is physically contiguous. Seems
>> > > > > > helpful for performance on Intel hardware. Probably post that soon for
>> > > > > > early feedack.
>> > > > > 
>> > > > > OK.
>> > > > > 
>> > > > > 
>> > > > > > 
>> > > > > > Longer term I thinking 2M migration entries, 2M device pages, and being
>> > > > > > able to install 2M THP on VRAM -> SRAM could be really helpful. I'm
>> > > > > > finding migrate_vma_* functions take up like 80-90% of the time in the
>> > > > > > CPU / GPU fault handlers on a fault (or prefetch) which doesn't seem
>> > > > > > ideal. Seems like 2M entries for everything would really help here. No
>> > > > > > idea how feasible this is as the core MM stuff gets confusing fast. Any
>> > > > > > input on this idea?
>> > > > > 
>> > > > > I agree with your observations. We found that the migrate_vma_* code was the
>> > > > > bottle neck for migration performance as well, and not breaking 2M pages
>> > > > > could reduce that overhead a lot. I don't have any specific ideas. I'm not
>> > > > > familiar with the details of that code myself. Philip has looked at this
>> > > > > (and some old NVidia patches from a few years ago) in the past but never had
>> > > > > enough uninterrupted time to make it past prototyping.
>> > > > > 
>> > > > 
>> > > > Cool good to know this isn't some crazy idea. Doubt it happen anytime
>> > > > soon as I need to get a working baseline in before anything then start
>> > > > applying optimizations and help in get other features to get the design
>> > > > complete. But eventually will probably try to look at this. May ping
>> > > > Philip and Nvidia when I have time to dig in here.

Apologies for my late reply here, I have just returned from vacation.

We (Nvidia) are actively looking at this as we have the same bottle
necks. Mostly I've been doing some clean-ups in MM to make compound
ZONE_DEVICE pages possible.

>> > > 
>> > > I think the big step will be moving hmm.c and migrate.c apis over from
>> > > struct page to folios. That should also give us some nice benefits on the
>> > > gpu side, since instead of 4k pages to track we could allocate 2m gpu
>> > > pages.
>> > > 
>> > 
>> > I think was thinking just encode the order in the migration PFN like HMM
>> > does. Really only Nth order entry in the page array needs to be
>> > populated then - HMM populates every entry though which doesn't seem
>> > like that is required. Maybe having a folio API makes more sense?
>> 
>> Both I'd say, as a first attempt at least. An array of folios, but only
>> populate the ones we need and jump over empty entries. A bit wasteful, but
>> eh it's just allocations.
>
> Ok thought some more, I think there's two things going on:
>
> - spot contig memory sections so that the gpu side is more efficient and
>   can user bigger pagetables and stuff like that. this is what
>   hmm_range_fault does.
>
> - optimize the core mm book-keeping by working on folios instead of
>   individual pages. hmm_range_fault does not care because it doesn't grab
>   references or lock pages or do anything else interesting with them, the
>   entire synchronization is provided by mmu notifier retry loops. But the
>   migration code does do a lot of expensive stuff, so it would need that.
>   For hmm_range_fault it's probably not so important, so maybe we could
>   leave the folio conversion of that to later.
>
> I think we need both, meaning:
> - switch to folio, leave out entries as NULL for compound folios
> - on top of compoung folios still track contig ranges so that the gpu side
>   can additionally benefit when e.g. 2M pages are split into smaller
>   folios but happen to be contig

Definitely I think the folio approach makes sense and I was prototpying
a series to allow compound device private pages and to migrate
them. However I got caught up on similar questions around migrate_vma_*
API (specifically around whether to allow splitting/merging) so never
got around to posting them. Probably the simple array based approach
makes sense though.

In any case to get something like that merged I'd been asked if we could
fix the ZONE_DEVICE refcounting mess for DAX (as it was the only one
still relying on the off-by-one refcounts for compound pages). The good
news is that series[1] is getting close to merged, and as a side-effect
it allows for compound ZONE_DEVICE pages so extending it to
DEVICE_PRIVATE pages to allow THP migration shouldn't be too difficult
and was going to be my next step once this was merged. So it's nice to
know other people might care about this too.

[1] - https://lore.kernel.org/linux-mm/cover.9f0e45d52f5cff58807831b6b867084d0b14b61c.1725941415.git-series.apopple@nvidia.com/

> Cheers, Sima
>
>> 
>> > > Once we have folios at the driver/core mm api level doing all the fancy
>> > > thp stuff should be at least a well-contained problem. But I might be
>> > > dellusionally optimistic here :-)
>> > 
>> > I think it contained in the sense is the DRM SVM layer just allocates a
>> > THP or large continous device memory and hands it off to migrate layer
>> > and that layer does the right thing. The 'right thing' here I believe is
>> > a decent amount of core MM work though.
>> 
>> Yeah that's what I meant, once we have switched the interfaces to be
>> arrays of folios, where for larger folios we leave the entries in between
>> NULL and have some appropriate iterators, then the driver side is done
>> mostly.  The core mm side to actually make use of that will be fairly
>> gnarly though.
>> -Sima
>> 
>> > 
>> > Matt
>> > 
>> > > -Sima
>> > > 
>> > > > 
>> > > > Matt
>> > > > 
>> > > > > Regards,
>> > > > >   Felix
>> > > > > 
>> > > > > 
>> > > > > > 
>> > > > > > Matt
>> > > > > > 
>> > > > > > > Regards,
>> > > > > > >    Felix
>> > > > > > > 
>> > > > > > > 
>> > > > > > > > > > > +				ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
>> > > > > > > > > > > +				put_page(vmf->page);
>> > > > > > > > > > > +				unlock_page(vmf->page);
>> > > > > > > > > > > +			} else {
>> > > > > > > > > > > +				pte_unmap_unlock(vmf->pte, vmf->ptl);
>> > > > > > > > > > > +			}
>> > > > > > > > > > >    		} else if (is_hwpoison_entry(entry)) {
>> > > > > > > > > > >    			ret = VM_FAULT_HWPOISON;
>> > > > > > > > > > >    		} else if (is_pte_marker_entry(entry)) {
>> > > > > > > > > > > diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>> > > > > > > > > > > index 6d66dc1c6ffa..049893a5a179 100644
>> > > > > > > > > > > --- a/mm/migrate_device.c
>> > > > > > > > > > > +++ b/mm/migrate_device.c
>> > > > > > > > > > > @@ -60,6 +60,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>> > > > > > > > > > >    				   struct mm_walk *walk)
>> > > > > > > > > > >    {
>> > > > > > > > > > >    	struct migrate_vma *migrate = walk->private;
>> > > > > > > > > > > +	struct folio *fault_folio = migrate->fault_page ?
>> > > > > > > > > > > +		page_folio(migrate->fault_page) : NULL;
>> > > > > > > > > > >    	struct vm_area_struct *vma = walk->vma;
>> > > > > > > > > > >    	struct mm_struct *mm = vma->vm_mm;
>> > > > > > > > > > >    	unsigned long addr = start, unmapped = 0;
>> > > > > > > > > > > @@ -88,11 +90,13 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>> > > > > > > > > > >    			folio_get(folio);
>> > > > > > > > > > >    			spin_unlock(ptl);
>> > > > > > > > > > > -			if (unlikely(!folio_trylock(folio)))
>> > > > > > > > > > > +			if (unlikely(fault_folio != folio &&
>> > > > > > > > > > > +				     !folio_trylock(folio)))
>> > > > > > > > > > >    				return migrate_vma_collect_skip(start, end,
>> > > > > > > > > > >    								walk);
>> > > > > > > > > > >    			ret = split_folio(folio);
>> > > > > > > > > > > -			folio_unlock(folio);
>> > > > > > > > > > > +			if (fault_folio != folio)
>> > > > > > > > > > > +				folio_unlock(folio);
>> > > > > > > > > > >    			folio_put(folio);
>> > > > > > > > > > >    			if (ret)
>> > > > > > > > > > >    				return migrate_vma_collect_skip(start, end,
>> > > > > > > > > > > @@ -192,7 +196,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>> > > > > > > > > > >    		 * optimisation to avoid walking the rmap later with
>> > > > > > > > > > >    		 * try_to_migrate().
>> > > > > > > > > > >    		 */
>> > > > > > > > > > > -		if (folio_trylock(folio)) {
>> > > > > > > > > > > +		if (fault_folio == folio || folio_trylock(folio)) {
>> > > > > > > > > > >    			bool anon_exclusive;
>> > > > > > > > > > >    			pte_t swp_pte;
>> > > > > > > > > > > @@ -204,7 +208,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>> > > > > > > > > > >    				if (folio_try_share_anon_rmap_pte(folio, page)) {
>> > > > > > > > > > >    					set_pte_at(mm, addr, ptep, pte);
>> > > > > > > > > > > -					folio_unlock(folio);
>> > > > > > > > > > > +					if (fault_folio != folio)
>> > > > > > > > > > > +						folio_unlock(folio);
>> > > > > > > > > > >    					folio_put(folio);
>> > > > > > > > > > >    					mpfn = 0;
>> > > > > > > > > > >    					goto next;
>> > > > > > > > > > > @@ -363,6 +368,8 @@ static unsigned long migrate_device_unmap(unsigned long *src_pfns,
>> > > > > > > > > > >    					  unsigned long npages,
>> > > > > > > > > > >    					  struct page *fault_page)
>> > > > > > > > > > >    {
>> > > > > > > > > > > +	struct folio *fault_folio = fault_page ?
>> > > > > > > > > > > +		page_folio(fault_page) : NULL;
>> > > > > > > > > > >    	unsigned long i, restore = 0;
>> > > > > > > > > > >    	bool allow_drain = true;
>> > > > > > > > > > >    	unsigned long unmapped = 0;
>> > > > > > > > > > > @@ -427,7 +434,8 @@ static unsigned long migrate_device_unmap(unsigned long *src_pfns,
>> > > > > > > > > > >    		remove_migration_ptes(folio, folio, false);
>> > > > > > > > > > >    		src_pfns[i] = 0;
>> > > > > > > > > > > -		folio_unlock(folio);
>> > > > > > > > > > > +		if (fault_folio != folio)
>> > > > > > > > > > > +			folio_unlock(folio);
>> > > > > > > > > > >    		folio_put(folio);
>> > > > > > > > > > >    		restore--;
>> > > > > > > > > > >    	}
>> > > > > > > > > > > @@ -536,6 +544,8 @@ int migrate_vma_setup(struct migrate_vma *args)
>> > > > > > > > > > >    		return -EINVAL;
>> > > > > > > > > > >    	if (args->fault_page && !is_device_private_page(args->fault_page))
>> > > > > > > > > > >    		return -EINVAL;
>> > > > > > > > > > > +	if (args->fault_page && !PageLocked(args->fault_page))
>> > > > > > > > > > > +		return -EINVAL;
>> > > > > > > > > > >    	memset(args->src, 0, sizeof(*args->src) * nr_pages);
>> > > > > > > > > > >    	args->cpages = 0;
>> > > > > > > > > > > @@ -799,19 +809,13 @@ void migrate_vma_pages(struct migrate_vma *migrate)
>> > > > > > > > > > >    }
>> > > > > > > > > > >    EXPORT_SYMBOL(migrate_vma_pages);
>> > > > > > > > > > > -/*
>> > > > > > > > > > > - * migrate_device_finalize() - complete page migration
>> > > > > > > > > > > - * @src_pfns: src_pfns returned from migrate_device_range()
>> > > > > > > > > > > - * @dst_pfns: array of pfns allocated by the driver to migrate memory to
>> > > > > > > > > > > - * @npages: number of pages in the range
>> > > > > > > > > > > - *
>> > > > > > > > > > > - * Completes migration of the page by removing special migration entries.
>> > > > > > > > > > > - * Drivers must ensure copying of page data is complete and visible to the CPU
>> > > > > > > > > > > - * before calling this.
>> > > > > > > > > > > - */
>> > > > > > > > > > > -void migrate_device_finalize(unsigned long *src_pfns,
>> > > > > > > > > > > -			unsigned long *dst_pfns, unsigned long npages)
>> > > > > > > > > > > +static void __migrate_device_finalize(unsigned long *src_pfns,
>> > > > > > > > > > > +				      unsigned long *dst_pfns,
>> > > > > > > > > > > +				      unsigned long npages,
>> > > > > > > > > > > +				      struct page *fault_page)
>> > > > > > > > > > >    {
>> > > > > > > > > > > +	struct folio *fault_folio = fault_page ?
>> > > > > > > > > > > +		page_folio(fault_page) : NULL;
>> > > > > > > > > > >    	unsigned long i;
>> > > > > > > > > > >    	for (i = 0; i < npages; i++) {
>> > > > > > > > > > > @@ -838,7 +842,8 @@ void migrate_device_finalize(unsigned long *src_pfns,
>> > > > > > > > > > >    		src = page_folio(page);
>> > > > > > > > > > >    		dst = page_folio(newpage);
>> > > > > > > > > > >    		remove_migration_ptes(src, dst, false);
>> > > > > > > > > > > -		folio_unlock(src);
>> > > > > > > > > > > +		if (fault_folio != src)
>> > > > > > > > > > > +			folio_unlock(src);
>> > > > > > > > > > >    		if (is_zone_device_page(page))
>> > > > > > > > > > >    			put_page(page);
>> > > > > > > > > > > @@ -854,6 +859,22 @@ void migrate_device_finalize(unsigned long *src_pfns,
>> > > > > > > > > > >    		}
>> > > > > > > > > > >    	}
>> > > > > > > > > > >    }
>> > > > > > > > > > > +
>> > > > > > > > > > > +/*
>> > > > > > > > > > > + * migrate_device_finalize() - complete page migration
>> > > > > > > > > > > + * @src_pfns: src_pfns returned from migrate_device_range()
>> > > > > > > > > > > + * @dst_pfns: array of pfns allocated by the driver to migrate memory to
>> > > > > > > > > > > + * @npages: number of pages in the range
>> > > > > > > > > > > + *
>> > > > > > > > > > > + * Completes migration of the page by removing special migration entries.
>> > > > > > > > > > > + * Drivers must ensure copying of page data is complete and visible to the CPU
>> > > > > > > > > > > + * before calling this.
>> > > > > > > > > > > + */
>> > > > > > > > > > > +void migrate_device_finalize(unsigned long *src_pfns,
>> > > > > > > > > > > +			unsigned long *dst_pfns, unsigned long npages)
>> > > > > > > > > > > +{
>> > > > > > > > > > > +	return __migrate_device_finalize(src_pfns, dst_pfns, npages, NULL);
>> > > > > > > > > > > +}
>> > > > > > > > > > >    EXPORT_SYMBOL(migrate_device_finalize);
>> > > > > > > > > > >    /**
>> > > > > > > > > > > @@ -869,7 +890,8 @@ EXPORT_SYMBOL(migrate_device_finalize);
>> > > > > > > > > > >     */
>> > > > > > > > > > >    void migrate_vma_finalize(struct migrate_vma *migrate)
>> > > > > > > > > > >    {
>> > > > > > > > > > > -	migrate_device_finalize(migrate->src, migrate->dst, migrate->npages);
>> > > > > > > > > > > +	__migrate_device_finalize(migrate->src, migrate->dst, migrate->npages,
>> > > > > > > > > > > +				  migrate->fault_page);
>> > > > > > > > > > >    }
>> > > > > > > > > > >    EXPORT_SYMBOL(migrate_vma_finalize);
>> > > 
>> > > -- 
>> > > Simona Vetter
>> > > Software Engineer, Intel Corporation
>> > > http://blog.ffwll.ch
>> 
>> -- 
>> Simona Vetter
>> Software Engineer, Intel Corporation
>> http://blog.ffwll.ch

Matthew Brost Oct. 8, 2024, 5:39 a.m. UTC | #15

On Tue, Oct 08, 2024 at 12:33:51PM +1100, Alistair Popple wrote:
> 
> Simona Vetter <simona.vetter@ffwll.ch> writes:
> 
> > On Wed, Sep 25, 2024 at 01:44:37PM +0200, Simona Vetter wrote:
> >> On Tue, Sep 24, 2024 at 04:42:19PM +0000, Matthew Brost wrote:
> >> > On Tue, Sep 24, 2024 at 01:48:29PM +0200, Simona Vetter wrote:
> >> > > On Fri, Sep 20, 2024 at 09:59:51PM +0000, Matthew Brost wrote:
> >> > > > On Fri, Sep 20, 2024 at 05:50:10PM -0400, Felix Kuehling wrote:
> >> > > > > 
> >> > > > > On 2024-09-20 17:23, Matthew Brost wrote:
> >> > > > > > On Fri, Sep 20, 2024 at 04:26:50PM -0400, Felix Kuehling wrote:
> >> > > > > > > On 2024-09-18 11:10, Alistair Popple wrote:
> >> > > > > > > > Matthew Brost <matthew.brost@intel.com> writes:
> >> > > > > > > > 
> >> > > > > > > > > On Wed, Sep 11, 2024 at 02:53:31PM +1000, Alistair Popple wrote:
> >> > > > > > > > > > Matthew Brost <matthew.brost@intel.com> writes:
> >> > > > > > > > > > 
> >> > > > > > > > > > I haven't seen the same in the NVIDIA UVM driver (out-of-tree, I know)
> >> > > > > > > > > Still a driver.
> >> > > > > > > > Indeed, and I'm happy to answer any questions about our implementation.
> >> > > > > > > > 
> >> > > > > > > > > > but theoretically it seems like it should be possible. However we
> >> > > > > > > > > > serialize migrations of the same virtual address range to avoid these
> >> > > > > > > > > > kind of issues as they can happen the other way too (ie. multiple
> >> > > > > > > > > > threads trying to migrate to GPU).
> >> > > > > > > > > > 
> >> > > > > > > > > > So I suspect what happens in UVM is that one thread wins and installs
> >> > > > > > > > > > the migration entry while the others fail to get the driver migration
> >> > > > > > > > > > lock and bail out sufficiently early in the fault path to avoid the
> >> > > > > > > > > > live-lock.
> >> > > > > > > > > > 
> >> > > > > > > > > I had to try hard to show this, doubt an actual user could trigger this.
> >> > > > > > > > > 
> >> > > > > > > > > I wrote a test which kicked 8 threads, each thread did a pthread join,
> >> > > > > > > > > and then tried to read the same page. This repeats in loop for like 512
> >> > > > > > > > > pages or something. I needed an exclusive lock in migrate_to_ram vfunc
> >> > > > > > > > > for it to livelock. Without an exclusive lock I think on average I saw
> >> > > > > > > > > about 32k retries (i.e. migrate_to_ram calls on the same page) before a
> >> > > > > > > > > thread won this race.
> >> > > > > > > > > 
> >> > > > > > > > >   From reading UVM, pretty sure if you tried hard enough you could trigger
> >> > > > > > > > > a livelock given it appears you take excluvise locks in migrate_to_ram.
> >> > > > > > > > Yes, I suspect you're correct here and that we just haven't tried hard
> >> > > > > > > > enough to trigger it.
> >> > > > > > > > 
> >> > > > > > > > > > > Cc: Philip Yang <Philip.Yang@amd.com>
> >> > > > > > > > > > > Cc: Felix Kuehling <felix.kuehling@amd.com>
> >> > > > > > > > > > > Cc: Christian König <christian.koenig@amd.com>
> >> > > > > > > > > > > Cc: Andrew Morton <akpm@linux-foundation.org>
> >> > > > > > > > > > > Suggessted-by: Simona Vetter <simona.vetter@ffwll.ch>
> >> > > > > > > > > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> >> > > > > > > > > > > ---
> >> > > > > > > > > > >    mm/memory.c         | 13 +++++++---
> >> > > > > > > > > > >    mm/migrate_device.c | 60 +++++++++++++++++++++++++++++++--------------
> >> > > > > > > > > > >    2 files changed, 50 insertions(+), 23 deletions(-)
> >> > > > > > > > > > > 
> >> > > > > > > > > > > diff --git a/mm/memory.c b/mm/memory.c
> >> > > > > > > > > > > index 3c01d68065be..bbd97d16a96a 100644
> >> > > > > > > > > > > --- a/mm/memory.c
> >> > > > > > > > > > > +++ b/mm/memory.c
> >> > > > > > > > > > > @@ -4046,10 +4046,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >> > > > > > > > > > >    			 * Get a page reference while we know the page can't be
> >> > > > > > > > > > >    			 * freed.
> >> > > > > > > > > > >    			 */
> >> > > > > > > > > > > -			get_page(vmf->page);
> >> > > > > > > > > > > -			pte_unmap_unlock(vmf->pte, vmf->ptl);
> >> > > > > > > > > > > -			ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
> >> > > > > > > > > > > -			put_page(vmf->page);
> >> > > > > > > > > > > +			if (trylock_page(vmf->page)) {
> >> > > > > > > > > > > +				get_page(vmf->page);
> >> > > > > > > > > > > +				pte_unmap_unlock(vmf->pte, vmf->ptl);
> >> > > > > > > > > > This is all beginning to look a lot like migrate_vma_collect_pmd(). So
> >> > > > > > > > > > rather than do this and then have to pass all this context
> >> > > > > > > > > > (ie. fault_page) down to the migrate_vma_* functions could we instead
> >> > > > > > > > > > just do what migrate_vma_collect_pmd() does here? Ie. we already have
> >> > > > > > > > > > the PTL and the page lock so there's no reason we couldn't just setup
> >> > > > > > > > > > the migration entry prior to calling migrate_to_ram().
> >> > > > > > > > > > 
> >> > > > > > > > > > Obviously calling migrate_vma_setup() would show the page as not
> >> > > > > > > > > > migrating, but drivers could easily just fill in the src_pfn info after
> >> > > > > > > > > > calling migrate_vma_setup().
> >> > > > > > > > > > 
> >> > > > > > > > > > This would eliminate the whole fault_page ugliness.
> >> > > > > > > > > > 
> >> > > > > > > > > This seems like it would work and agree it likely be cleaner. Let me
> >> > > > > > > > > play around with this and see what I come up with. Multi-tasking a bit
> >> > > > > > > > > so expect a bit of delay here.
> >> > > > > > > > > 
> >> > > > > > > > > Thanks for the input,
> >> > > > > > > > > Matt
> >> > > > > > > Thanks! Sorry, I'm late catching up after a vacation. Please keep Philip,
> >> > > > > > > Christian and myself in the loop with future patches in this area.
> >> > > > > > > 
> >> > > > > > Will do. Already have another local patch set which helps drivers dma
> >> > > > > > map 2M pages for migrations if SRAM is physically contiguous. Seems
> >> > > > > > helpful for performance on Intel hardware. Probably post that soon for
> >> > > > > > early feedack.
> >> > > > > 
> >> > > > > OK.
> >> > > > > 
> >> > > > > 
> >> > > > > > 
> >> > > > > > Longer term I thinking 2M migration entries, 2M device pages, and being
> >> > > > > > able to install 2M THP on VRAM -> SRAM could be really helpful. I'm
> >> > > > > > finding migrate_vma_* functions take up like 80-90% of the time in the
> >> > > > > > CPU / GPU fault handlers on a fault (or prefetch) which doesn't seem
> >> > > > > > ideal. Seems like 2M entries for everything would really help here. No
> >> > > > > > idea how feasible this is as the core MM stuff gets confusing fast. Any
> >> > > > > > input on this idea?
> >> > > > > 
> >> > > > > I agree with your observations. We found that the migrate_vma_* code was the
> >> > > > > bottle neck for migration performance as well, and not breaking 2M pages
> >> > > > > could reduce that overhead a lot. I don't have any specific ideas. I'm not
> >> > > > > familiar with the details of that code myself. Philip has looked at this
> >> > > > > (and some old NVidia patches from a few years ago) in the past but never had
> >> > > > > enough uninterrupted time to make it past prototyping.
> >> > > > > 
> >> > > > 
> >> > > > Cool good to know this isn't some crazy idea. Doubt it happen anytime
> >> > > > soon as I need to get a working baseline in before anything then start
> >> > > > applying optimizations and help in get other features to get the design
> >> > > > complete. But eventually will probably try to look at this. May ping
> >> > > > Philip and Nvidia when I have time to dig in here.
> 
> Apologies for my late reply here, I have just returned from vacation.
> 
> We (Nvidia) are actively looking at this as we have the same bottle
> necks. Mostly I've been doing some clean-ups in MM to make compound
> ZONE_DEVICE pages possible.
> 

That's good to know all of us (Intel, AMD, Nvidia) are in agreement this
is a bottleneck and needs fixing. 

> >> > > 
> >> > > I think the big step will be moving hmm.c and migrate.c apis over from
> >> > > struct page to folios. That should also give us some nice benefits on the
> >> > > gpu side, since instead of 4k pages to track we could allocate 2m gpu
> >> > > pages.
> >> > > 
> >> > 
> >> > I think was thinking just encode the order in the migration PFN like HMM
> >> > does. Really only Nth order entry in the page array needs to be
> >> > populated then - HMM populates every entry though which doesn't seem
> >> > like that is required. Maybe having a folio API makes more sense?
> >> 
> >> Both I'd say, as a first attempt at least. An array of folios, but only
> >> populate the ones we need and jump over empty entries. A bit wasteful, but
> >> eh it's just allocations.
> >
> > Ok thought some more, I think there's two things going on:
> >
> > - spot contig memory sections so that the gpu side is more efficient and
> >   can user bigger pagetables and stuff like that. this is what
> >   hmm_range_fault does.
> >
> > - optimize the core mm book-keeping by working on folios instead of
> >   individual pages. hmm_range_fault does not care because it doesn't grab
> >   references or lock pages or do anything else interesting with them, the
> >   entire synchronization is provided by mmu notifier retry loops. But the
> >   migration code does do a lot of expensive stuff, so it would need that.
> >   For hmm_range_fault it's probably not so important, so maybe we could
> >   leave the folio conversion of that to later.
> >
> > I think we need both, meaning:
> > - switch to folio, leave out entries as NULL for compound folios
> > - on top of compoung folios still track contig ranges so that the gpu side
> >   can additionally benefit when e.g. 2M pages are split into smaller
> >   folios but happen to be contig
> 
> Definitely I think the folio approach makes sense and I was prototpying
> a series to allow compound device private pages and to migrate
> them. However I got caught up on similar questions around migrate_vma_*
> API (specifically around whether to allow splitting/merging) so never
> got around to posting them. Probably the simple array based approach
> makes sense though.
> 

If / when we (Intel) have time to look at this we will reach out but
seems like you may beat us to it. Happy to pull in code to test out and
with reviews.

> In any case to get something like that merged I'd been asked if we could
> fix the ZONE_DEVICE refcounting mess for DAX (as it was the only one
> still relying on the off-by-one refcounts for compound pages). The good
> news is that series[1] is getting close to merged, and as a side-effect
> it allows for compound ZONE_DEVICE pages so extending it to
> DEVICE_PRIVATE pages to allow THP migration shouldn't be too difficult
> and was going to be my next step once this was merged. So it's nice to
> know other people might care about this too.
> 
> [1] - https://lore.kernel.org/linux-mm/cover.9f0e45d52f5cff58807831b6b867084d0b14b61c.1725941415.git-series.apopple@nvidia.com/
>

Let me look here.

Matt

> > Cheers, Sima
> >
> >> 
> >> > > Once we have folios at the driver/core mm api level doing all the fancy
> >> > > thp stuff should be at least a well-contained problem. But I might be
> >> > > dellusionally optimistic here :-)
> >> > 
> >> > I think it contained in the sense is the DRM SVM layer just allocates a
> >> > THP or large continous device memory and hands it off to migrate layer
> >> > and that layer does the right thing. The 'right thing' here I believe is
> >> > a decent amount of core MM work though.
> >> 
> >> Yeah that's what I meant, once we have switched the interfaces to be
> >> arrays of folios, where for larger folios we leave the entries in between
> >> NULL and have some appropriate iterators, then the driver side is done
> >> mostly.  The core mm side to actually make use of that will be fairly
> >> gnarly though.
> >> -Sima
> >> 
> >> > 
> >> > Matt
> >> > 
> >> > > -Sima
> >> > > 
> >> > > > 
> >> > > > Matt
> >> > > > 
> >> > > > > Regards,
> >> > > > >   Felix
> >> > > > > 
> >> > > > > 
> >> > > > > > 
> >> > > > > > Matt
> >> > > > > > 
> >> > > > > > > Regards,
> >> > > > > > >    Felix
> >> > > > > > > 
> >> > > > > > > 
> >> > > > > > > > > > > +				ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
> >> > > > > > > > > > > +				put_page(vmf->page);
> >> > > > > > > > > > > +				unlock_page(vmf->page);
> >> > > > > > > > > > > +			} else {
> >> > > > > > > > > > > +				pte_unmap_unlock(vmf->pte, vmf->ptl);
> >> > > > > > > > > > > +			}
> >> > > > > > > > > > >    		} else if (is_hwpoison_entry(entry)) {
> >> > > > > > > > > > >    			ret = VM_FAULT_HWPOISON;
> >> > > > > > > > > > >    		} else if (is_pte_marker_entry(entry)) {
> >> > > > > > > > > > > diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> >> > > > > > > > > > > index 6d66dc1c6ffa..049893a5a179 100644
> >> > > > > > > > > > > --- a/mm/migrate_device.c
> >> > > > > > > > > > > +++ b/mm/migrate_device.c
> >> > > > > > > > > > > @@ -60,6 +60,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> >> > > > > > > > > > >    				   struct mm_walk *walk)
> >> > > > > > > > > > >    {
> >> > > > > > > > > > >    	struct migrate_vma *migrate = walk->private;
> >> > > > > > > > > > > +	struct folio *fault_folio = migrate->fault_page ?
> >> > > > > > > > > > > +		page_folio(migrate->fault_page) : NULL;
> >> > > > > > > > > > >    	struct vm_area_struct *vma = walk->vma;
> >> > > > > > > > > > >    	struct mm_struct *mm = vma->vm_mm;
> >> > > > > > > > > > >    	unsigned long addr = start, unmapped = 0;
> >> > > > > > > > > > > @@ -88,11 +90,13 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> >> > > > > > > > > > >    			folio_get(folio);
> >> > > > > > > > > > >    			spin_unlock(ptl);
> >> > > > > > > > > > > -			if (unlikely(!folio_trylock(folio)))
> >> > > > > > > > > > > +			if (unlikely(fault_folio != folio &&
> >> > > > > > > > > > > +				     !folio_trylock(folio)))
> >> > > > > > > > > > >    				return migrate_vma_collect_skip(start, end,
> >> > > > > > > > > > >    								walk);
> >> > > > > > > > > > >    			ret = split_folio(folio);
> >> > > > > > > > > > > -			folio_unlock(folio);
> >> > > > > > > > > > > +			if (fault_folio != folio)
> >> > > > > > > > > > > +				folio_unlock(folio);
> >> > > > > > > > > > >    			folio_put(folio);
> >> > > > > > > > > > >    			if (ret)
> >> > > > > > > > > > >    				return migrate_vma_collect_skip(start, end,
> >> > > > > > > > > > > @@ -192,7 +196,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> >> > > > > > > > > > >    		 * optimisation to avoid walking the rmap later with
> >> > > > > > > > > > >    		 * try_to_migrate().
> >> > > > > > > > > > >    		 */
> >> > > > > > > > > > > -		if (folio_trylock(folio)) {
> >> > > > > > > > > > > +		if (fault_folio == folio || folio_trylock(folio)) {
> >> > > > > > > > > > >    			bool anon_exclusive;
> >> > > > > > > > > > >    			pte_t swp_pte;
> >> > > > > > > > > > > @@ -204,7 +208,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> >> > > > > > > > > > >    				if (folio_try_share_anon_rmap_pte(folio, page)) {
> >> > > > > > > > > > >    					set_pte_at(mm, addr, ptep, pte);
> >> > > > > > > > > > > -					folio_unlock(folio);
> >> > > > > > > > > > > +					if (fault_folio != folio)
> >> > > > > > > > > > > +						folio_unlock(folio);
> >> > > > > > > > > > >    					folio_put(folio);
> >> > > > > > > > > > >    					mpfn = 0;
> >> > > > > > > > > > >    					goto next;
> >> > > > > > > > > > > @@ -363,6 +368,8 @@ static unsigned long migrate_device_unmap(unsigned long *src_pfns,
> >> > > > > > > > > > >    					  unsigned long npages,
> >> > > > > > > > > > >    					  struct page *fault_page)
> >> > > > > > > > > > >    {
> >> > > > > > > > > > > +	struct folio *fault_folio = fault_page ?
> >> > > > > > > > > > > +		page_folio(fault_page) : NULL;
> >> > > > > > > > > > >    	unsigned long i, restore = 0;
> >> > > > > > > > > > >    	bool allow_drain = true;
> >> > > > > > > > > > >    	unsigned long unmapped = 0;
> >> > > > > > > > > > > @@ -427,7 +434,8 @@ static unsigned long migrate_device_unmap(unsigned long *src_pfns,
> >> > > > > > > > > > >    		remove_migration_ptes(folio, folio, false);
> >> > > > > > > > > > >    		src_pfns[i] = 0;
> >> > > > > > > > > > > -		folio_unlock(folio);
> >> > > > > > > > > > > +		if (fault_folio != folio)
> >> > > > > > > > > > > +			folio_unlock(folio);
> >> > > > > > > > > > >    		folio_put(folio);
> >> > > > > > > > > > >    		restore--;
> >> > > > > > > > > > >    	}
> >> > > > > > > > > > > @@ -536,6 +544,8 @@ int migrate_vma_setup(struct migrate_vma *args)
> >> > > > > > > > > > >    		return -EINVAL;
> >> > > > > > > > > > >    	if (args->fault_page && !is_device_private_page(args->fault_page))
> >> > > > > > > > > > >    		return -EINVAL;
> >> > > > > > > > > > > +	if (args->fault_page && !PageLocked(args->fault_page))
> >> > > > > > > > > > > +		return -EINVAL;
> >> > > > > > > > > > >    	memset(args->src, 0, sizeof(*args->src) * nr_pages);
> >> > > > > > > > > > >    	args->cpages = 0;
> >> > > > > > > > > > > @@ -799,19 +809,13 @@ void migrate_vma_pages(struct migrate_vma *migrate)
> >> > > > > > > > > > >    }
> >> > > > > > > > > > >    EXPORT_SYMBOL(migrate_vma_pages);
> >> > > > > > > > > > > -/*
> >> > > > > > > > > > > - * migrate_device_finalize() - complete page migration
> >> > > > > > > > > > > - * @src_pfns: src_pfns returned from migrate_device_range()
> >> > > > > > > > > > > - * @dst_pfns: array of pfns allocated by the driver to migrate memory to
> >> > > > > > > > > > > - * @npages: number of pages in the range
> >> > > > > > > > > > > - *
> >> > > > > > > > > > > - * Completes migration of the page by removing special migration entries.
> >> > > > > > > > > > > - * Drivers must ensure copying of page data is complete and visible to the CPU
> >> > > > > > > > > > > - * before calling this.
> >> > > > > > > > > > > - */
> >> > > > > > > > > > > -void migrate_device_finalize(unsigned long *src_pfns,
> >> > > > > > > > > > > -			unsigned long *dst_pfns, unsigned long npages)
> >> > > > > > > > > > > +static void __migrate_device_finalize(unsigned long *src_pfns,
> >> > > > > > > > > > > +				      unsigned long *dst_pfns,
> >> > > > > > > > > > > +				      unsigned long npages,
> >> > > > > > > > > > > +				      struct page *fault_page)
> >> > > > > > > > > > >    {
> >> > > > > > > > > > > +	struct folio *fault_folio = fault_page ?
> >> > > > > > > > > > > +		page_folio(fault_page) : NULL;
> >> > > > > > > > > > >    	unsigned long i;
> >> > > > > > > > > > >    	for (i = 0; i < npages; i++) {
> >> > > > > > > > > > > @@ -838,7 +842,8 @@ void migrate_device_finalize(unsigned long *src_pfns,
> >> > > > > > > > > > >    		src = page_folio(page);
> >> > > > > > > > > > >    		dst = page_folio(newpage);
> >> > > > > > > > > > >    		remove_migration_ptes(src, dst, false);
> >> > > > > > > > > > > -		folio_unlock(src);
> >> > > > > > > > > > > +		if (fault_folio != src)
> >> > > > > > > > > > > +			folio_unlock(src);
> >> > > > > > > > > > >    		if (is_zone_device_page(page))
> >> > > > > > > > > > >    			put_page(page);
> >> > > > > > > > > > > @@ -854,6 +859,22 @@ void migrate_device_finalize(unsigned long *src_pfns,
> >> > > > > > > > > > >    		}
> >> > > > > > > > > > >    	}
> >> > > > > > > > > > >    }
> >> > > > > > > > > > > +
> >> > > > > > > > > > > +/*
> >> > > > > > > > > > > + * migrate_device_finalize() - complete page migration
> >> > > > > > > > > > > + * @src_pfns: src_pfns returned from migrate_device_range()
> >> > > > > > > > > > > + * @dst_pfns: array of pfns allocated by the driver to migrate memory to
> >> > > > > > > > > > > + * @npages: number of pages in the range
> >> > > > > > > > > > > + *
> >> > > > > > > > > > > + * Completes migration of the page by removing special migration entries.
> >> > > > > > > > > > > + * Drivers must ensure copying of page data is complete and visible to the CPU
> >> > > > > > > > > > > + * before calling this.
> >> > > > > > > > > > > + */
> >> > > > > > > > > > > +void migrate_device_finalize(unsigned long *src_pfns,
> >> > > > > > > > > > > +			unsigned long *dst_pfns, unsigned long npages)
> >> > > > > > > > > > > +{
> >> > > > > > > > > > > +	return __migrate_device_finalize(src_pfns, dst_pfns, npages, NULL);
> >> > > > > > > > > > > +}
> >> > > > > > > > > > >    EXPORT_SYMBOL(migrate_device_finalize);
> >> > > > > > > > > > >    /**
> >> > > > > > > > > > > @@ -869,7 +890,8 @@ EXPORT_SYMBOL(migrate_device_finalize);
> >> > > > > > > > > > >     */
> >> > > > > > > > > > >    void migrate_vma_finalize(struct migrate_vma *migrate)
> >> > > > > > > > > > >    {
> >> > > > > > > > > > > -	migrate_device_finalize(migrate->src, migrate->dst, migrate->npages);
> >> > > > > > > > > > > +	__migrate_device_finalize(migrate->src, migrate->dst, migrate->npages,
> >> > > > > > > > > > > +				  migrate->fault_page);
> >> > > > > > > > > > >    }
> >> > > > > > > > > > >    EXPORT_SYMBOL(migrate_vma_finalize);
> >> > > 
> >> > > -- 
> >> > > Simona Vetter
> >> > > Software Engineer, Intel Corporation
> >> > > http://blog.ffwll.ch
> >> 
> >> -- 
> >> Simona Vetter
> >> Software Engineer, Intel Corporation
> >> http://blog.ffwll.ch
>

[1/1] mm/migrate: Trylock device page in do_swap_page

Commit Message

Comments

Patch