[6/6] mm/gup: migrate pinned pages out of movable zone

Message ID	20201202052330.474592-7-pasha.tatashin@soleen.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=g5mM=FG=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org A2B772065C From: Pavel Tatashin <pasha.tatashin@soleen.com> To: pasha.tatashin@soleen.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org, akpm@linux-foundation.org, vbabka@suse.cz, mhocko@suse.com, david@redhat.com, osalvador@suse.de, dan.j.williams@intel.com, sashal@kernel.org, tyhicks@linux.microsoft.com, iamjoonsoo.kim@lge.com, mike.kravetz@oracle.com, rostedt@goodmis.org, mingo@redhat.com, jgg@ziepe.ca, peterz@infradead.org, mgorman@suse.de, willy@infradead.org, rientjes@google.com, jhubbard@nvidia.com Subject: [PATCH 6/6] mm/gup: migrate pinned pages out of movable zone Date: Wed, 2 Dec 2020 00:23:30 -0500 Message-Id: <20201202052330.474592-7-pasha.tatashin@soleen.com> In-Reply-To: <20201202052330.474592-1-pasha.tatashin@soleen.com> References: <20201202052330.474592-1-pasha.tatashin@soleen.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	prohibit pinning pages in ZONE_MOVABLE \| expand [0/6] prohibit pinning pages in ZONE_MOVABLE [1/6] mm/gup: perform check_dax_vmas only when FS_DAX is enabled [2/6] mm/gup: don't pin migrated cma pages in movable zone [3/6] mm/gup: make __gup_longterm_locked common [4/6] mm cma: rename PF_MEMALLOC_NOCMA to PF_MEMALLOC_NOMOVABLE [5/6] mm: honor PF_MEMALLOC_NOMOVABLE for all allocations [6/6] mm/gup: migrate pinned pages out of movable zone

Pasha Tatashin Dec. 2, 2020, 5:23 a.m. UTC

We do not allocate pin pages in ZONE_MOVABLE, but if pages were already
allocated before pinning they need to migrated to a different zone.
Currently, we migrate movable CMA pages only. Generalize the function
that migrates CMA pages to migrate all movable pages.

Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
---
 include/linux/migrate.h        |  1 +
 include/trace/events/migrate.h |  3 +-
 mm/gup.c                       | 56 +++++++++++++---------------------
 3 files changed, 24 insertions(+), 36 deletions(-)

Jason Gunthorpe Dec. 2, 2020, 4:35 p.m. UTC | #1

On Wed, Dec 02, 2020 at 12:23:30AM -0500, Pavel Tatashin wrote:
>  /*
>   * First define the enums in the above macros to be exported to userspace
> diff --git a/mm/gup.c b/mm/gup.c
> index 724d8a65e1df..1d511f65f8a7 100644
> +++ b/mm/gup.c
> @@ -1593,19 +1593,18 @@ static bool check_dax_vmas(struct vm_area_struct **vmas, long nr_pages)
>  }
>  #endif
>  
> -#ifdef CONFIG_CMA
> -static long check_and_migrate_cma_pages(struct mm_struct *mm,
> -					unsigned long start,
> -					unsigned long nr_pages,
> -					struct page **pages,
> -					struct vm_area_struct **vmas,
> -					unsigned int gup_flags)
> +static long check_and_migrate_movable_pages(struct mm_struct *mm,
> +					    unsigned long start,
> +					    unsigned long nr_pages,
> +					    struct page **pages,
> +					    struct vm_area_struct **vmas,
> +					    unsigned int gup_flags)
>  {
>  	unsigned long i;
>  	unsigned long step;
>  	bool drain_allow = true;
>  	bool migrate_allow = true;
> -	LIST_HEAD(cma_page_list);
> +	LIST_HEAD(page_list);
>  	long ret = nr_pages;
>  	struct migration_target_control mtc = {
>  		.nid = NUMA_NO_NODE,
> @@ -1623,13 +1622,12 @@ static long check_and_migrate_cma_pages(struct mm_struct *mm,
>  		 */
>  		step = compound_nr(head) - (pages[i] - head);
>  		/*
> -		 * If we get a page from the CMA zone, since we are going to
> -		 * be pinning these entries, we might as well move them out
> -		 * of the CMA zone if possible.
> +		 * If we get a movable page, since we are going to be pinning
> +		 * these entries, try to move them out if possible.
>  		 */
> -		if (is_migrate_cma_page(head)) {
> +		if (is_migrate_movable(get_pageblock_migratetype(head))) {
>  			if (PageHuge(head))

It is a good moment to say, I really dislike how this was implemented
in the first place.

Scanning the output of gup just to do the is_migrate_movable() test is
kind of nonsense and slow. It would be better/faster to handle this
directly while gup is scanning the page tables and adding pages to the
list.

Now that this becoming more general, can you take a moment to see if a
better implementation could be possible?

Also, something takes care of the gup fast path too?

Jason

Pasha Tatashin Dec. 3, 2020, 12:19 a.m. UTC | #2

> It is a good moment to say, I really dislike how this was implemented
> in the first place.
>
> Scanning the output of gup just to do the is_migrate_movable() test is
> kind of nonsense and slow. It would be better/faster to handle this
> directly while gup is scanning the page tables and adding pages to the
> list.

Hi Jason,

I assume you mean to migrate pages as soon as they are followed and
skip those that are faulted, as we already know that faulted pages are
allocated from nomovable zone.

The place would be:

__get_user_pages()
      while(more pages)
          get_gate_page()
          follow_hugetlb_page()
          follow_page_mask()

          if (!page)
               faultin_page()

          if (page && !faulted && (gup_flags & FOLL_LONGTERM) )
                check_and_migrate this page

I looked at that function, and I do not think the code will be cleaner
there, as that function already has a complicated loop.  The only
drawback with the current approach that I see is that
check_and_migrate_movable_pages() has to check once the faulted pages.
This is while not optimal is not horrible. The FOLL_LONGTERM should
not happen too frequently, so having one extra nr_pages loop should
not hurt the performance. Also, I checked and most of the users of
FOLL_LONGTERM pin only one page at a time. Which means the extra loop
is only to check a single page.

We could discuss improving this code farther. For example, I still
think it would be a good idea to pass an appropriate gfp_mask via
fault_flags from gup_flags instead of using
PF_MEMALLOC_NOMOVABLE (previously PF_MEMALLOC_NOCMA) per context flag.
However, those changes can come after this series. The current series
fixes a bug where hot-remove is not working with making minimal amount
of changes, so it is easy to backport it to stable kernels.

Thank you,
Pasha



>
> Now that this becoming more general, can you take a moment to see if a
> better implementation could be possible?
>
> Also, something takes care of the gup fast path too?
>
> Jason

Jason Gunthorpe Dec. 3, 2020, 1:08 a.m. UTC | #3

On Wed, Dec 02, 2020 at 07:19:45PM -0500, Pavel Tatashin wrote:
> > It is a good moment to say, I really dislike how this was implemented
> > in the first place.
> >
> > Scanning the output of gup just to do the is_migrate_movable() test is
> > kind of nonsense and slow. It would be better/faster to handle this
> > directly while gup is scanning the page tables and adding pages to the
> > list.
> 
> Hi Jason,
> 
> I assume you mean to migrate pages as soon as they are followed and
> skip those that are faulted, as we already know that faulted pages are
> allocated from nomovable zone.
> 
> The place would be:
> 
> __get_user_pages()
>       while(more pages)
>           get_gate_page()
>           follow_hugetlb_page()
>           follow_page_mask()
> 
>           if (!page)
>                faultin_page()
> 
>           if (page && !faulted && (gup_flags & FOLL_LONGTERM) )
>                 check_and_migrate this page

Either here or perhaps even lower down the call chain when the page is
captured, similar to how GUP fast would detect it. (how is that done
anyhow?)

> I looked at that function, and I do not think the code will be cleaner
> there, as that function already has a complicated loop.  

That function is complicated for its own reasons.. But complicated is
not the point here..

> The only drawback with the current approach that I see is that
> check_and_migrate_movable_pages() has to check once the faulted
> pages.

Yes

> This is while not optimal is not horrible. 

It is.

> The FOLL_LONGTERM should not happen too frequently, so having one
> extra nr_pages loop should not hurt the performance. 

FOLL_LONGTERM is typically used with very large regions, for instance
we are benchmarking around the 300G level. It takes 10s of seconds for
get_user_pages to operate. There are many inefficiencies in this
path. This extra work of re-scanning the list is part of the cost.

Further, having these special wrappers just for FOLL_LONGTERM has a
spill over complexity on the entire rest of the callchain up to here,
we now have endless wrappers and varieties of function calls that
generally are happening because the longterm path needs to end up in a
different place than other paths.

IMHO this is due to the lack of integration with the primary loop
above

> Also, I checked and most of the users of FOLL_LONGTERM pin only one
> page at a time. Which means the extra loop is only to check a single
> page.

Er, I don't know what you checked but those are not the cases I
see. Two big users are vfio and rdma. Both are pinning huge ranges of
memory in very typical use cases.

> However, those changes can come after this series. The current series
> fixes a bug where hot-remove is not working with making minimal amount
> of changes, so it is easy to backport it to stable kernels.

This is a good point, good enough that you should probably continue as
is

Jason

Pasha Tatashin Dec. 3, 2020, 1:34 a.m. UTC | #4

On Wed, Dec 2, 2020 at 8:08 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Wed, Dec 02, 2020 at 07:19:45PM -0500, Pavel Tatashin wrote:
> > > It is a good moment to say, I really dislike how this was implemented
> > > in the first place.
> > >
> > > Scanning the output of gup just to do the is_migrate_movable() test is
> > > kind of nonsense and slow. It would be better/faster to handle this
> > > directly while gup is scanning the page tables and adding pages to the
> > > list.
> >
> > Hi Jason,
> >
> > I assume you mean to migrate pages as soon as they are followed and
> > skip those that are faulted, as we already know that faulted pages are
> > allocated from nomovable zone.
> >
> > The place would be:
> >
> > __get_user_pages()
> >       while(more pages)
> >           get_gate_page()
> >           follow_hugetlb_page()
> >           follow_page_mask()
> >
> >           if (!page)
> >                faultin_page()
> >
> >           if (page && !faulted && (gup_flags & FOLL_LONGTERM) )
> >                 check_and_migrate this page
>
> Either here or perhaps even lower down the call chain when the page is
> captured, similar to how GUP fast would detect it. (how is that done
> anyhow?)

Ah, thank you for pointing this out. I think I need to address it here:

https://soleen.com/source/xref/linux/mm/gup.c?r=96e1fac1#94

static __maybe_unused struct page *try_grab_compound_head()
              if (unlikely(flags & FOLL_LONGTERM) &&  is_migrate_cma_page(page))
                   return NULL;

I need to change is_migrate_cma_page() to all migratable pages. Will
study, and send an update with this fix.

>
> > I looked at that function, and I do not think the code will be cleaner
> > there, as that function already has a complicated loop.
>
> That function is complicated for its own reasons.. But complicated is
> not the point here..
>
> > The only drawback with the current approach that I see is that
> > check_and_migrate_movable_pages() has to check once the faulted
> > pages.
>
> Yes
>
> > This is while not optimal is not horrible.
>
> It is.
>
> > The FOLL_LONGTERM should not happen too frequently, so having one
> > extra nr_pages loop should not hurt the performance.
>
> FOLL_LONGTERM is typically used with very large regions, for instance
> we are benchmarking around the 300G level. It takes 10s of seconds for
> get_user_pages to operate. There are many inefficiencies in this
> path. This extra work of re-scanning the list is part of the cost.

OK, I did not realize that pinning was for such large regions, the
path must be optimized.

>
> Further, having these special wrappers just for FOLL_LONGTERM has a
> spill over complexity on the entire rest of the callchain up to here,
> we now have endless wrappers and varieties of function calls that
> generally are happening because the longterm path needs to end up in a
> different place than other paths.
>
> IMHO this is due to the lack of integration with the primary loop
> above
>
> > Also, I checked and most of the users of FOLL_LONGTERM pin only one
> > page at a time. Which means the extra loop is only to check a single
> > page.
>
> Er, I don't know what you checked but those are not the cases I
> see. Two big users are vfio and rdma. Both are pinning huge ranges of
> memory in very typical use cases.

What I meant is the users of the interface do it incrementally not in
large chunks. For example:

vfio_pin_pages_remote
   vaddr_get_pfn
        ret = pin_user_pages_remote(mm, vaddr, 1, flags |
FOLL_LONGTERM, page, NULL, NULL);
1 -> pin only one pages at a time

RDMA indeed can do it in one chunk though. Regardless, the VFIO should
probably be optimized to do it in a larger chunk, and the code path
should be optimized for the reasons you gave above.

>
> > However, those changes can come after this series. The current series
> > fixes a bug where hot-remove is not working with making minimal amount
> > of changes, so it is easy to backport it to stable kernels.
>
> This is a good point, good enough that you should probably continue as
> is

I will continue looking into this code, and see if I can address your
concerns in a follow-up fixes.


Thanks,
Pasha

>
> Jason

John Hubbard Dec. 3, 2020, 8:22 a.m. UTC | #5

On 12/1/20 9:23 PM, Pavel Tatashin wrote:
> We do not allocate pin pages in ZONE_MOVABLE, but if pages were already
> allocated before pinning they need to migrated to a different zone.
> Currently, we migrate movable CMA pages only. Generalize the function
> that migrates CMA pages to migrate all movable pages.
> 
> Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
> ---
>   include/linux/migrate.h        |  1 +
>   include/trace/events/migrate.h |  3 +-
>   mm/gup.c                       | 56 +++++++++++++---------------------
>   3 files changed, 24 insertions(+), 36 deletions(-)
> 

I like the cleanup so far, even at this point it's a relief to finally
see the nested ifdefs get removed.

One naming nit/idea below, but this looks fine as is, so:

Reviewed-by: John Hubbard <jhubbard@nvidia.com>

> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index 0f8d1583fa8e..00bab23d1ee5 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -27,6 +27,7 @@ enum migrate_reason {
>   	MR_MEMPOLICY_MBIND,
>   	MR_NUMA_MISPLACED,
>   	MR_CONTIG_RANGE,
> +	MR_LONGTERM_PIN,
>   	MR_TYPES
>   };
>   
> diff --git a/include/trace/events/migrate.h b/include/trace/events/migrate.h
> index 4d434398d64d..363b54ce104c 100644
> --- a/include/trace/events/migrate.h
> +++ b/include/trace/events/migrate.h
> @@ -20,7 +20,8 @@
>   	EM( MR_SYSCALL,		"syscall_or_cpuset")		\
>   	EM( MR_MEMPOLICY_MBIND,	"mempolicy_mbind")		\
>   	EM( MR_NUMA_MISPLACED,	"numa_misplaced")		\
> -	EMe(MR_CONTIG_RANGE,	"contig_range")
> +	EM( MR_CONTIG_RANGE,	"contig_range")			\
> +	EMe(MR_LONGTERM_PIN,	"longterm_pin")
>   
>   /*
>    * First define the enums in the above macros to be exported to userspace
> diff --git a/mm/gup.c b/mm/gup.c
> index 724d8a65e1df..1d511f65f8a7 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1593,19 +1593,18 @@ static bool check_dax_vmas(struct vm_area_struct **vmas, long nr_pages)
>   }
>   #endif
>   
> -#ifdef CONFIG_CMA
> -static long check_and_migrate_cma_pages(struct mm_struct *mm,
> -					unsigned long start,
> -					unsigned long nr_pages,
> -					struct page **pages,
> -					struct vm_area_struct **vmas,
> -					unsigned int gup_flags)
> +static long check_and_migrate_movable_pages(struct mm_struct *mm,
> +					    unsigned long start,
> +					    unsigned long nr_pages,
> +					    struct page **pages,
> +					    struct vm_area_struct **vmas,
> +					    unsigned int gup_flags)
>   {
>   	unsigned long i;
>   	unsigned long step;
>   	bool drain_allow = true;
>   	bool migrate_allow = true;
> -	LIST_HEAD(cma_page_list);
> +	LIST_HEAD(page_list);


Maybe naming it "movable_page_list", would be a nice touch.


thanks,

Jason Gunthorpe Dec. 3, 2020, 2:17 p.m. UTC | #6

On Wed, Dec 02, 2020 at 08:34:32PM -0500, Pavel Tatashin wrote:

> > Either here or perhaps even lower down the call chain when the page is
> > captured, similar to how GUP fast would detect it. (how is that done
> > anyhow?)
> 
> Ah, thank you for pointing this out. I think I need to address it here:
> 
> https://soleen.com/source/xref/linux/mm/gup.c?r=96e1fac1#94
> 
> static __maybe_unused struct page *try_grab_compound_head()
>               if (unlikely(flags & FOLL_LONGTERM) &&  is_migrate_cma_page(page))
>                    return NULL;
> 
> I need to change is_migrate_cma_page() to all migratable pages. Will
> study, and send an update with this fix.

Yes, missing the two flows is a common error :(

Looking at this code some more.. How is it even correct?

1633  				if (!isolate_lru_page(head)) {
1634  					list_add_tail(&head->lru, &cma_page_list);

Here we are only running under the read side of the mmap sem so multiple
GUPs can be calling that sequence in parallel. I don't see any
obvious exclusion that will prevent corruption of head->lru. The first
GUP thread to do isolate_lru_page() will ClearPageLRU() and the second
GUP thread will be a NOP for isolate_lru_page().

They will both race list_add_tail and other list ops. That is not OK.

> What I meant is the users of the interface do it incrementally not in
> large chunks. For example:
> 
> vfio_pin_pages_remote
>    vaddr_get_pfn
>         ret = pin_user_pages_remote(mm, vaddr, 1, flags |
> FOLL_LONGTERM, page, NULL, NULL);
> 1 -> pin only one pages at a time

I don't know why vfio does this, it is why it so ridiculously slow at
least.

Jason

Pasha Tatashin Dec. 3, 2020, 3:55 p.m. UTC | #7

> I like the cleanup so far, even at this point it's a relief to finally
> see the nested ifdefs get removed.
>
> One naming nit/idea below, but this looks fine as is, so:
>
> Reviewed-by: John Hubbard <jhubbard@nvidia.com>

Thank you for reviewing this series.

> Maybe naming it "movable_page_list", would be a nice touch.

Sure, I will change it.

Thank you,
Pasha

Pasha Tatashin Dec. 3, 2020, 4:40 p.m. UTC | #8

> Looking at this code some more.. How is it even correct?
>
> 1633                            if (!isolate_lru_page(head)) {
> 1634                                    list_add_tail(&head->lru, &cma_page_list);
>
> Here we are only running under the read side of the mmap sem so multiple
> GUPs can be calling that sequence in parallel. I don't see any
> obvious exclusion that will prevent corruption of head->lru. The first
> GUP thread to do isolate_lru_page() will ClearPageLRU() and the second
> GUP thread will be a NOP for isolate_lru_page().
>
> They will both race list_add_tail and other list ops. That is not OK.

Good question. I studied it, and I do not see how this is OK. Worse,
this race is also exposable as a syscall instead of via driver: two
move_pages() run simultaneously. Perhaps in other places?

move_pages()
  kernel_move_pages()
    mmget()
    do_pages_move()
      add_page_for_migratio()
         mmap_read_lock(mm);
         list_add_tail(&head->lru, pagelist); <- Not protected

>
> > What I meant is the users of the interface do it incrementally not in
> > large chunks. For example:
> >
> > vfio_pin_pages_remote
> >    vaddr_get_pfn
> >         ret = pin_user_pages_remote(mm, vaddr, 1, flags |
> > FOLL_LONGTERM, page, NULL, NULL);
> > 1 -> pin only one pages at a time
>
> I don't know why vfio does this, it is why it so ridiculously slow at
> least.

Agreed.

>
> Jason

Jason Gunthorpe Dec. 3, 2020, 4:59 p.m. UTC | #9

On Thu, Dec 03, 2020 at 11:40:15AM -0500, Pavel Tatashin wrote:
> > Looking at this code some more.. How is it even correct?
> >
> > 1633                            if (!isolate_lru_page(head)) {
> > 1634                                    list_add_tail(&head->lru, &cma_page_list);
> >
> > Here we are only running under the read side of the mmap sem so multiple
> > GUPs can be calling that sequence in parallel. I don't see any
> > obvious exclusion that will prevent corruption of head->lru. The first
> > GUP thread to do isolate_lru_page() will ClearPageLRU() and the second
> > GUP thread will be a NOP for isolate_lru_page().
> >
> > They will both race list_add_tail and other list ops. That is not OK.
> 
> Good question. I studied it, and I do not see how this is OK. Worse,
> this race is also exposable as a syscall instead of via driver: two
> move_pages() run simultaneously. Perhaps in other places?
> 
> move_pages()
>   kernel_move_pages()
>     mmget()
>     do_pages_move()
>       add_page_for_migratio()
>          mmap_read_lock(mm);
>          list_add_tail(&head->lru, pagelist); <- Not protected

When this was CMA only it might have been rarer to trigger, but this
move stuff sounds like it makes it much more broadly, eg on typical
servers with RDMA exposed/etc

Seems like it needs fixing as part of this too :\

Page at a time inside the gup loop could address both concerns, unsure
about batching performance here though..

Jason

Pasha Tatashin Dec. 3, 2020, 5:14 p.m. UTC | #10

On Thu, Dec 3, 2020 at 11:59 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Thu, Dec 03, 2020 at 11:40:15AM -0500, Pavel Tatashin wrote:
> > > Looking at this code some more.. How is it even correct?
> > >
> > > 1633                            if (!isolate_lru_page(head)) {
> > > 1634                                    list_add_tail(&head->lru, &cma_page_list);
> > >
> > > Here we are only running under the read side of the mmap sem so multiple
> > > GUPs can be calling that sequence in parallel. I don't see any
> > > obvious exclusion that will prevent corruption of head->lru. The first
> > > GUP thread to do isolate_lru_page() will ClearPageLRU() and the second
> > > GUP thread will be a NOP for isolate_lru_page().
> > >
> > > They will both race list_add_tail and other list ops. That is not OK.
> >
> > Good question. I studied it, and I do not see how this is OK. Worse,
> > this race is also exposable as a syscall instead of via driver: two
> > move_pages() run simultaneously. Perhaps in other places?
> >
> > move_pages()
> >   kernel_move_pages()
> >     mmget()
> >     do_pages_move()
> >       add_page_for_migratio()
> >          mmap_read_lock(mm);
> >          list_add_tail(&head->lru, pagelist); <- Not protected
>
> When this was CMA only it might have been rarer to trigger, but this
> move stuff sounds like it makes it much more broadly, eg on typical
> servers with RDMA exposed/etc
>
> Seems like it needs fixing as part of this too :\

Just to clarify the stack that I showed above is outside of gup, it is
the same issue that you pointed out that happens elsewhere. I suspect
there might be more. All of them should be addressed together.

Pasha

>
> Page at a time inside the gup loop could address both concerns, unsure
> about batching performance here though..
>
> Jason

Pasha Tatashin Dec. 3, 2020, 7:15 p.m. UTC | #11

> > > > Looking at this code some more.. How is it even correct?
> > > >
> > > > 1633                            if (!isolate_lru_page(head)) {
> > > > 1634                                    list_add_tail(&head->lru, &cma_page_list);
> > > >
> > > > Here we are only running under the read side of the mmap sem so multiple
> > > > GUPs can be calling that sequence in parallel. I don't see any
> > > > obvious exclusion that will prevent corruption of head->lru. The first
> > > > GUP thread to do isolate_lru_page() will ClearPageLRU() and the second
> > > > GUP thread will be a NOP for isolate_lru_page().
> > > >
> > > > They will both race list_add_tail and other list ops. That is not OK.
> > >
> > > Good question. I studied it, and I do not see how this is OK. Worse,
> > > this race is also exposable as a syscall instead of via driver: two
> > > move_pages() run simultaneously. Perhaps in other places?
> > >
> > > move_pages()
> > >   kernel_move_pages()
> > >     mmget()
> > >     do_pages_move()
> > >       add_page_for_migratio()
> > >          mmap_read_lock(mm);
> > >          list_add_tail(&head->lru, pagelist); <- Not protected
> >
> > When this was CMA only it might have been rarer to trigger, but this
> > move stuff sounds like it makes it much more broadly, eg on typical
> > servers with RDMA exposed/etc
> >
> > Seems like it needs fixing as part of this too :\
>
> Just to clarify the stack that I showed above is outside of gup, it is
> the same issue that you pointed out that happens elsewhere. I suspect
> there might be more. All of them should be addressed together.

Hi Jason,

I studied some more, and I think this is not a race:
list_add_tail(&head->lru, &cma_page_list) is called only when
isolate_lru_page(head) succeeds.
isolate_lru_page(head) succeeds only when PageLRU(head) is true.
However, in this function we also clear LRU flag before returning
success.
This means, that if we race with another thread, the other thread
won't get to unprotected list_add_tail(&head->lru, &cma_page_list)
until head is is back on LRU list.

Please let me know if I am missing anything.

Thank you,
Pasha

Jason Gunthorpe Dec. 3, 2020, 7:36 p.m. UTC | #12

On Thu, Dec 03, 2020 at 02:15:36PM -0500, Pavel Tatashin wrote:

> I studied some more, and I think this is not a race:
> list_add_tail(&head->lru, &cma_page_list) is called only when
> isolate_lru_page(head) succeeds.
> isolate_lru_page(head) succeeds only when PageLRU(head) is true.
> However, in this function we also clear LRU flag before returning
> success.
> This means, that if we race with another thread, the other thread
> won't get to unprotected list_add_tail(&head->lru, &cma_page_list)
> until head is is back on LRU list.

Oh interesting, I totally didn't see how that LRU stuff is
working. So.. this creates a ridiculously expensive spin lock? Not
broken, but yikes :|

Jason

Joonsoo Kim Dec. 4, 2020, 4:13 a.m. UTC | #13

On Wed, Dec 02, 2020 at 12:23:30AM -0500, Pavel Tatashin wrote:
> We do not allocate pin pages in ZONE_MOVABLE, but if pages were already
> allocated before pinning they need to migrated to a different zone.
> Currently, we migrate movable CMA pages only. Generalize the function
> that migrates CMA pages to migrate all movable pages.
> 
> Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
> ---
>  include/linux/migrate.h        |  1 +
>  include/trace/events/migrate.h |  3 +-
>  mm/gup.c                       | 56 +++++++++++++---------------------
>  3 files changed, 24 insertions(+), 36 deletions(-)
> 
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index 0f8d1583fa8e..00bab23d1ee5 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -27,6 +27,7 @@ enum migrate_reason {
>  	MR_MEMPOLICY_MBIND,
>  	MR_NUMA_MISPLACED,
>  	MR_CONTIG_RANGE,
> +	MR_LONGTERM_PIN,
>  	MR_TYPES
>  };
>  
> diff --git a/include/trace/events/migrate.h b/include/trace/events/migrate.h
> index 4d434398d64d..363b54ce104c 100644
> --- a/include/trace/events/migrate.h
> +++ b/include/trace/events/migrate.h
> @@ -20,7 +20,8 @@
>  	EM( MR_SYSCALL,		"syscall_or_cpuset")		\
>  	EM( MR_MEMPOLICY_MBIND,	"mempolicy_mbind")		\
>  	EM( MR_NUMA_MISPLACED,	"numa_misplaced")		\
> -	EMe(MR_CONTIG_RANGE,	"contig_range")
> +	EM( MR_CONTIG_RANGE,	"contig_range")			\
> +	EMe(MR_LONGTERM_PIN,	"longterm_pin")
>  
>  /*
>   * First define the enums in the above macros to be exported to userspace
> diff --git a/mm/gup.c b/mm/gup.c
> index 724d8a65e1df..1d511f65f8a7 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1593,19 +1593,18 @@ static bool check_dax_vmas(struct vm_area_struct **vmas, long nr_pages)
>  }
>  #endif
>  
> -#ifdef CONFIG_CMA
> -static long check_and_migrate_cma_pages(struct mm_struct *mm,
> -					unsigned long start,
> -					unsigned long nr_pages,
> -					struct page **pages,
> -					struct vm_area_struct **vmas,
> -					unsigned int gup_flags)
> +static long check_and_migrate_movable_pages(struct mm_struct *mm,
> +					    unsigned long start,
> +					    unsigned long nr_pages,
> +					    struct page **pages,
> +					    struct vm_area_struct **vmas,
> +					    unsigned int gup_flags)
>  {
>  	unsigned long i;
>  	unsigned long step;
>  	bool drain_allow = true;
>  	bool migrate_allow = true;
> -	LIST_HEAD(cma_page_list);
> +	LIST_HEAD(page_list);
>  	long ret = nr_pages;
>  	struct migration_target_control mtc = {
>  		.nid = NUMA_NO_NODE,
> @@ -1623,13 +1622,12 @@ static long check_and_migrate_cma_pages(struct mm_struct *mm,
>  		 */
>  		step = compound_nr(head) - (pages[i] - head);
>  		/*
> -		 * If we get a page from the CMA zone, since we are going to
> -		 * be pinning these entries, we might as well move them out
> -		 * of the CMA zone if possible.
> +		 * If we get a movable page, since we are going to be pinning
> +		 * these entries, try to move them out if possible.
>  		 */
> -		if (is_migrate_cma_page(head)) {
> +		if (is_migrate_movable(get_pageblock_migratetype(head))) {

is_migrate_movable() isn't a check for the ZONE. It's a check for the
MIGRATE_TYPE. MIGRATE_TYPE doesn't require hard guarantee for
migration, and, most of memory, including ZONE_NORMAL, is
MIGRATE_MOVABLE. With this code, long term gup would always fails due
to not enough memory. I think that correct change would be
"is_migrate_cma_page(hear) && zone == ZONE_MOVABLE".

Patch #5 also has this problem. Please fix it too.

Thanks.

Pasha Tatashin Dec. 4, 2020, 4:24 p.m. UTC | #14

On Thu, Dec 3, 2020 at 2:36 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Thu, Dec 03, 2020 at 02:15:36PM -0500, Pavel Tatashin wrote:
>
> > I studied some more, and I think this is not a race:
> > list_add_tail(&head->lru, &cma_page_list) is called only when
> > isolate_lru_page(head) succeeds.
> > isolate_lru_page(head) succeeds only when PageLRU(head) is true.
> > However, in this function we also clear LRU flag before returning
> > success.
> > This means, that if we race with another thread, the other thread
> > won't get to unprotected list_add_tail(&head->lru, &cma_page_list)
> > until head is is back on LRU list.
>
> Oh interesting, I totally didn't see how that LRU stuff is
> working. So.. this creates a ridiculously expensive spin lock? Not
> broken, but yikes :|

Not really a spin lock, the second thread won't be able to isolate
this page, and will skip migration of this page.

>
> Jason

Jason Gunthorpe Dec. 4, 2020, 5:06 p.m. UTC | #15

On Fri, Dec 04, 2020 at 11:24:56AM -0500, Pavel Tatashin wrote:
> On Thu, Dec 3, 2020 at 2:36 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >
> > On Thu, Dec 03, 2020 at 02:15:36PM -0500, Pavel Tatashin wrote:
> >
> > > I studied some more, and I think this is not a race:
> > > list_add_tail(&head->lru, &cma_page_list) is called only when
> > > isolate_lru_page(head) succeeds.
> > > isolate_lru_page(head) succeeds only when PageLRU(head) is true.
> > > However, in this function we also clear LRU flag before returning
> > > success.
> > > This means, that if we race with another thread, the other thread
> > > won't get to unprotected list_add_tail(&head->lru, &cma_page_list)
> > > until head is is back on LRU list.
> >
> > Oh interesting, I totally didn't see how that LRU stuff is
> > working. So.. this creates a ridiculously expensive spin lock? Not
> > broken, but yikes :|
> 
> Not really a spin lock, the second thread won't be able to isolate
> this page, and will skip migration of this page.

It looks like the intent is that it will call gup again, then goto
check_again, and once again try to isolate the LRU. ie it loops.

If it gets to a point where all the CMA pages fail to isolate then it
simply exits with success as the cma_page_list will be empty.

Is this a bug? It seems like a bug, the invariant here is to not
return with a CMA page, so why do we have a path that does return with
a CMA page?

Jason

Pasha Tatashin Dec. 4, 2020, 5:43 p.m. UTC | #16

On Thu, Dec 3, 2020 at 11:14 PM Joonsoo Kim <js1304@gmail.com> wrote:
>
> On Wed, Dec 02, 2020 at 12:23:30AM -0500, Pavel Tatashin wrote:
> > We do not allocate pin pages in ZONE_MOVABLE, but if pages were already
> > allocated before pinning they need to migrated to a different zone.
> > Currently, we migrate movable CMA pages only. Generalize the function
> > that migrates CMA pages to migrate all movable pages.
> >
> > Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
> > ---
> >  include/linux/migrate.h        |  1 +
> >  include/trace/events/migrate.h |  3 +-
> >  mm/gup.c                       | 56 +++++++++++++---------------------
> >  3 files changed, 24 insertions(+), 36 deletions(-)
> >
> > diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> > index 0f8d1583fa8e..00bab23d1ee5 100644
> > --- a/include/linux/migrate.h
> > +++ b/include/linux/migrate.h
> > @@ -27,6 +27,7 @@ enum migrate_reason {
> >       MR_MEMPOLICY_MBIND,
> >       MR_NUMA_MISPLACED,
> >       MR_CONTIG_RANGE,
> > +     MR_LONGTERM_PIN,
> >       MR_TYPES
> >  };
> >
> > diff --git a/include/trace/events/migrate.h b/include/trace/events/migrate.h
> > index 4d434398d64d..363b54ce104c 100644
> > --- a/include/trace/events/migrate.h
> > +++ b/include/trace/events/migrate.h
> > @@ -20,7 +20,8 @@
> >       EM( MR_SYSCALL,         "syscall_or_cpuset")            \
> >       EM( MR_MEMPOLICY_MBIND, "mempolicy_mbind")              \
> >       EM( MR_NUMA_MISPLACED,  "numa_misplaced")               \
> > -     EMe(MR_CONTIG_RANGE,    "contig_range")
> > +     EM( MR_CONTIG_RANGE,    "contig_range")                 \
> > +     EMe(MR_LONGTERM_PIN,    "longterm_pin")
> >
> >  /*
> >   * First define the enums in the above macros to be exported to userspace
> > diff --git a/mm/gup.c b/mm/gup.c
> > index 724d8a65e1df..1d511f65f8a7 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -1593,19 +1593,18 @@ static bool check_dax_vmas(struct vm_area_struct **vmas, long nr_pages)
> >  }
> >  #endif
> >
> > -#ifdef CONFIG_CMA
> > -static long check_and_migrate_cma_pages(struct mm_struct *mm,
> > -                                     unsigned long start,
> > -                                     unsigned long nr_pages,
> > -                                     struct page **pages,
> > -                                     struct vm_area_struct **vmas,
> > -                                     unsigned int gup_flags)
> > +static long check_and_migrate_movable_pages(struct mm_struct *mm,
> > +                                         unsigned long start,
> > +                                         unsigned long nr_pages,
> > +                                         struct page **pages,
> > +                                         struct vm_area_struct **vmas,
> > +                                         unsigned int gup_flags)
> >  {
> >       unsigned long i;
> >       unsigned long step;
> >       bool drain_allow = true;
> >       bool migrate_allow = true;
> > -     LIST_HEAD(cma_page_list);
> > +     LIST_HEAD(page_list);
> >       long ret = nr_pages;
> >       struct migration_target_control mtc = {
> >               .nid = NUMA_NO_NODE,
> > @@ -1623,13 +1622,12 @@ static long check_and_migrate_cma_pages(struct mm_struct *mm,
> >                */
> >               step = compound_nr(head) - (pages[i] - head);
> >               /*
> > -              * If we get a page from the CMA zone, since we are going to
> > -              * be pinning these entries, we might as well move them out
> > -              * of the CMA zone if possible.
> > +              * If we get a movable page, since we are going to be pinning
> > +              * these entries, try to move them out if possible.
> >                */
> > -             if (is_migrate_cma_page(head)) {
> > +             if (is_migrate_movable(get_pageblock_migratetype(head))) {
>
> is_migrate_movable() isn't a check for the ZONE. It's a check for the
> MIGRATE_TYPE. MIGRATE_TYPE doesn't require hard guarantee for
> migration, and, most of memory, including ZONE_NORMAL, is
> MIGRATE_MOVABLE. With this code, long term gup would always fails due
> to not enough memory. I think that correct change would be
> "is_migrate_cma_page(hear) && zone == ZONE_MOVABLE".

Good point. The above should be OR not AND.

zone_idx(page_zone(head)) == ZONE_MOVABLE || is_migrate_cma_page(hear)

Pasha

Daniel Jordan Dec. 4, 2020, 8:05 p.m. UTC | #17

Jason Gunthorpe <jgg@ziepe.ca> writes:

> On Wed, Dec 02, 2020 at 08:34:32PM -0500, Pavel Tatashin wrote:
>> What I meant is the users of the interface do it incrementally not in
>> large chunks. For example:
>> 
>> vfio_pin_pages_remote
>>    vaddr_get_pfn
>>         ret = pin_user_pages_remote(mm, vaddr, 1, flags |
>> FOLL_LONGTERM, page, NULL, NULL);
>> 1 -> pin only one pages at a time
>
> I don't know why vfio does this, it is why it so ridiculously slow at
> least.

Well Alex can correct me, but I went digging and a comment from the
first type1 vfio commit says the iommu API didn't promise to unmap
subpages of previous mappings, so doing page at a time gave flexibility
at the cost of inefficiency.

Then 166fd7d94afd allowed the iommu to use larger pages in vfio, but
vfio kept pinning pages at a time.  I couldn't find an explanation for
why that stayed the same.

Yesterday I tried optimizing vfio to skip gup calls for tail pages after
Matthew pointed out this same issue to me by coincidence last week.
Currently debugging, but if there's a fundamental reason this won't work
on the vfio side, it'd be nice to know.

Pasha Tatashin Dec. 4, 2020, 8:16 p.m. UTC | #18

On Fri, Dec 4, 2020 at 3:06 PM Daniel Jordan <daniel.m.jordan@oracle.com> wrote:
>
> Jason Gunthorpe <jgg@ziepe.ca> writes:
>
> > On Wed, Dec 02, 2020 at 08:34:32PM -0500, Pavel Tatashin wrote:
> >> What I meant is the users of the interface do it incrementally not in
> >> large chunks. For example:
> >>
> >> vfio_pin_pages_remote
> >>    vaddr_get_pfn
> >>         ret = pin_user_pages_remote(mm, vaddr, 1, flags |
> >> FOLL_LONGTERM, page, NULL, NULL);
> >> 1 -> pin only one pages at a time
> >
> > I don't know why vfio does this, it is why it so ridiculously slow at
> > least.
>
> Well Alex can correct me, but I went digging and a comment from the
> first type1 vfio commit says the iommu API didn't promise to unmap
> subpages of previous mappings, so doing page at a time gave flexibility
> at the cost of inefficiency.
>
> Then 166fd7d94afd allowed the iommu to use larger pages in vfio, but
> vfio kept pinning pages at a time.  I couldn't find an explanation for
> why that stayed the same.
>
> Yesterday I tried optimizing vfio to skip gup calls for tail pages after
> Matthew pointed out this same issue to me by coincidence last week.
> Currently debugging, but if there's a fundamental reason this won't work
> on the vfio side, it'd be nice to know.

Hi Daniel,

I do not think there are any fundamental reasons why it won't work. I
have also thinking increasing VFIO chunking for a different reason:

If a client touches pages before doing a VFIO DMA map, those pages
might be huge, and pinning a small page at a time and migrating a
small page at a time can break-up the huge pages. So, it is not only
inefficient to pin, but it can also inadvertently slow down the
runtime.

Thank you,
Pasha

Jason Gunthorpe Dec. 4, 2020, 8:52 p.m. UTC | #19

On Fri, Dec 04, 2020 at 03:05:46PM -0500, Daniel Jordan wrote:
> Jason Gunthorpe <jgg@ziepe.ca> writes:
> 
> > On Wed, Dec 02, 2020 at 08:34:32PM -0500, Pavel Tatashin wrote:
> >> What I meant is the users of the interface do it incrementally not in
> >> large chunks. For example:
> >> 
> >> vfio_pin_pages_remote
> >>    vaddr_get_pfn
> >>         ret = pin_user_pages_remote(mm, vaddr, 1, flags |
> >> FOLL_LONGTERM, page, NULL, NULL);
> >> 1 -> pin only one pages at a time
> >
> > I don't know why vfio does this, it is why it so ridiculously slow at
> > least.
> 
> Well Alex can correct me, but I went digging and a comment from the
> first type1 vfio commit says the iommu API didn't promise to unmap
> subpages of previous mappings, so doing page at a time gave flexibility
> at the cost of inefficiency.

iommu restrictions are not related to with gup. vfio needs to get the
page list from the page tables as efficiently as possible, then you
break it up into what you want to feed into the IOMMU how the iommu
wants.

vfio must maintain a page list to call unpin_user_pages() anyhow, so
it makes alot of sense to assemble the page list up front, then do the
iommu, instead of trying to do both things page at a time.

It would be smart to rebuild vfio to use scatter lists to store the
page list and then break the sgl into pages for iommu
configuration. SGLs will consume alot less memory for the usual case
of THPs backing the VFIO registrations.

ib_umem_get() has some example of how to code this, I've been thinking
we could make this some common API, and it could be further optimized.

> Yesterday I tried optimizing vfio to skip gup calls for tail pages after
> Matthew pointed out this same issue to me by coincidence last week.

Please don't just hack up vfio like this. Everyone needs faster gup,
we really need to solve this in the core code. Plus this is tricky,
vfio is already using follow_pfn wrongly, drivers should not be open
coding MM stuff.

> Currently debugging, but if there's a fundamental reason this won't work
> on the vfio side, it'd be nice to know.

AFAIK there is no guarentee that just because you see a compound head
that the remaining pages in the page tables are actually the tail
pages. This is only true sometimes, for instance if an entire huge
page is placed in a page table level.

I belive Ralph pointed to some case where we might break a huge page
from PMD to PTEs then later COW one of the PTEs. In this case the
compound head will be visible but the page map will be non-contiguous
and the page flags on each 4k entry will be different.

Only GUP's page walkers know that the compound page is actually at a
PMD level and can safely apply the 'everything is the same'
optimization.

The solution here is to make core gup faster, espcially for the cases
where it is returning huge pages. We can approach this by:
 - Batching the compound & tail page acquisition for higher page
   levels, eg gup fast does this already, look at record_subpages()
   gup slow needs it too
 - Batching unpin for compound & tail page, the opposite of the 'refs'
   arg for try_grab_compound_head()
 - Devise some API where get_user_pages can directly return
   contiguous groups of pages to avoid memory traffic
 - Reduce the cost of a FOLL_LONGTERM pin eg here is a start:
    https://lore.kernel.org/linux-mm/0-v1-5551df3ed12e+b8-gup_dax_speedup_jgg@nvidia.com
   And CMA should get some similar treatment. Scanning the output page
   list multiple times is slow.

I would like to get to a point where the main GUP walker functions can
output in more formats than just page array. For instance directly
constructing and chaining a biovec or sgl would dramatically improve
perfomance and decrease memory consumption. Being able to write in
hmm_range_fault's pfn&flags output format would delete a whole bunch
of duplicated code.

Jason

Joonsoo Kim Dec. 7, 2020, 7:13 a.m. UTC | #20

On Fri, Dec 04, 2020 at 12:43:29PM -0500, Pavel Tatashin wrote:
> On Thu, Dec 3, 2020 at 11:14 PM Joonsoo Kim <js1304@gmail.com> wrote:
> >
> > On Wed, Dec 02, 2020 at 12:23:30AM -0500, Pavel Tatashin wrote:
> > > We do not allocate pin pages in ZONE_MOVABLE, but if pages were already
> > > allocated before pinning they need to migrated to a different zone.
> > > Currently, we migrate movable CMA pages only. Generalize the function
> > > that migrates CMA pages to migrate all movable pages.
> > >
> > > Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
> > > ---
> > >  include/linux/migrate.h        |  1 +
> > >  include/trace/events/migrate.h |  3 +-
> > >  mm/gup.c                       | 56 +++++++++++++---------------------
> > >  3 files changed, 24 insertions(+), 36 deletions(-)
> > >
> > > diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> > > index 0f8d1583fa8e..00bab23d1ee5 100644
> > > --- a/include/linux/migrate.h
> > > +++ b/include/linux/migrate.h
> > > @@ -27,6 +27,7 @@ enum migrate_reason {
> > >       MR_MEMPOLICY_MBIND,
> > >       MR_NUMA_MISPLACED,
> > >       MR_CONTIG_RANGE,
> > > +     MR_LONGTERM_PIN,
> > >       MR_TYPES
> > >  };
> > >
> > > diff --git a/include/trace/events/migrate.h b/include/trace/events/migrate.h
> > > index 4d434398d64d..363b54ce104c 100644
> > > --- a/include/trace/events/migrate.h
> > > +++ b/include/trace/events/migrate.h
> > > @@ -20,7 +20,8 @@
> > >       EM( MR_SYSCALL,         "syscall_or_cpuset")            \
> > >       EM( MR_MEMPOLICY_MBIND, "mempolicy_mbind")              \
> > >       EM( MR_NUMA_MISPLACED,  "numa_misplaced")               \
> > > -     EMe(MR_CONTIG_RANGE,    "contig_range")
> > > +     EM( MR_CONTIG_RANGE,    "contig_range")                 \
> > > +     EMe(MR_LONGTERM_PIN,    "longterm_pin")
> > >
> > >  /*
> > >   * First define the enums in the above macros to be exported to userspace
> > > diff --git a/mm/gup.c b/mm/gup.c
> > > index 724d8a65e1df..1d511f65f8a7 100644
> > > --- a/mm/gup.c
> > > +++ b/mm/gup.c
> > > @@ -1593,19 +1593,18 @@ static bool check_dax_vmas(struct vm_area_struct **vmas, long nr_pages)
> > >  }
> > >  #endif
> > >
> > > -#ifdef CONFIG_CMA
> > > -static long check_and_migrate_cma_pages(struct mm_struct *mm,
> > > -                                     unsigned long start,
> > > -                                     unsigned long nr_pages,
> > > -                                     struct page **pages,
> > > -                                     struct vm_area_struct **vmas,
> > > -                                     unsigned int gup_flags)
> > > +static long check_and_migrate_movable_pages(struct mm_struct *mm,
> > > +                                         unsigned long start,
> > > +                                         unsigned long nr_pages,
> > > +                                         struct page **pages,
> > > +                                         struct vm_area_struct **vmas,
> > > +                                         unsigned int gup_flags)
> > >  {
> > >       unsigned long i;
> > >       unsigned long step;
> > >       bool drain_allow = true;
> > >       bool migrate_allow = true;
> > > -     LIST_HEAD(cma_page_list);
> > > +     LIST_HEAD(page_list);
> > >       long ret = nr_pages;
> > >       struct migration_target_control mtc = {
> > >               .nid = NUMA_NO_NODE,
> > > @@ -1623,13 +1622,12 @@ static long check_and_migrate_cma_pages(struct mm_struct *mm,
> > >                */
> > >               step = compound_nr(head) - (pages[i] - head);
> > >               /*
> > > -              * If we get a page from the CMA zone, since we are going to
> > > -              * be pinning these entries, we might as well move them out
> > > -              * of the CMA zone if possible.
> > > +              * If we get a movable page, since we are going to be pinning
> > > +              * these entries, try to move them out if possible.
> > >                */
> > > -             if (is_migrate_cma_page(head)) {
> > > +             if (is_migrate_movable(get_pageblock_migratetype(head))) {
> >
> > is_migrate_movable() isn't a check for the ZONE. It's a check for the
> > MIGRATE_TYPE. MIGRATE_TYPE doesn't require hard guarantee for
> > migration, and, most of memory, including ZONE_NORMAL, is
> > MIGRATE_MOVABLE. With this code, long term gup would always fails due
> > to not enough memory. I think that correct change would be
> > "is_migrate_cma_page(hear) && zone == ZONE_MOVABLE".
> 
> Good point. The above should be OR not AND.
> 
> zone_idx(page_zone(head)) == ZONE_MOVABLE || is_migrate_cma_page(hear)

Yep!

Thanks.

Daniel Jordan Dec. 8, 2020, 2:27 a.m. UTC | #21

Pavel Tatashin <pasha.tatashin@soleen.com> writes:

> On Fri, Dec 4, 2020 at 3:06 PM Daniel Jordan <daniel.m.jordan@oracle.com> wrote:
>>
>> Jason Gunthorpe <jgg@ziepe.ca> writes:
>>
>> > On Wed, Dec 02, 2020 at 08:34:32PM -0500, Pavel Tatashin wrote:
>> >> What I meant is the users of the interface do it incrementally not in
>> >> large chunks. For example:
>> >>
>> >> vfio_pin_pages_remote
>> >>    vaddr_get_pfn
>> >>         ret = pin_user_pages_remote(mm, vaddr, 1, flags |
>> >> FOLL_LONGTERM, page, NULL, NULL);
>> >> 1 -> pin only one pages at a time
>> >
>> > I don't know why vfio does this, it is why it so ridiculously slow at
>> > least.
>>
>> Well Alex can correct me, but I went digging and a comment from the
>> first type1 vfio commit says the iommu API didn't promise to unmap
>> subpages of previous mappings, so doing page at a time gave flexibility
>> at the cost of inefficiency.
>>
>> Then 166fd7d94afd allowed the iommu to use larger pages in vfio, but
>> vfio kept pinning pages at a time.  I couldn't find an explanation for
>> why that stayed the same.
>>
>> Yesterday I tried optimizing vfio to skip gup calls for tail pages after
>> Matthew pointed out this same issue to me by coincidence last week.
>> Currently debugging, but if there's a fundamental reason this won't work
>> on the vfio side, it'd be nice to know.
>
> Hi Daniel,
>
> I do not think there are any fundamental reasons why it won't work. I
> have also thinking increasing VFIO chunking for a different reason:
>
> If a client touches pages before doing a VFIO DMA map, those pages
> might be huge, and pinning a small page at a time and migrating a
> small page at a time can break-up the huge pages. So, it is not only
> inefficient to pin, but it can also inadvertently slow down the
> runtime.

Hi Pasha,

I see, and I'm curious, do you experience this case where a user has
touched the pages before doing a VFIO DMA map, and if so where?

The usual situation on my side is that the pages are faulted in during
pinning.

Daniel

Daniel Jordan Dec. 8, 2020, 2:48 a.m. UTC | #22

Jason Gunthorpe <jgg@ziepe.ca> writes:
> On Fri, Dec 04, 2020 at 03:05:46PM -0500, Daniel Jordan wrote:
>> Well Alex can correct me, but I went digging and a comment from the
>> first type1 vfio commit says the iommu API didn't promise to unmap
>> subpages of previous mappings, so doing page at a time gave flexibility
>> at the cost of inefficiency.
>
> iommu restrictions are not related to with gup. vfio needs to get the
> page list from the page tables as efficiently as possible, then you
> break it up into what you want to feed into the IOMMU how the iommu
> wants.
>
> vfio must maintain a page list to call unpin_user_pages() anyhow, so

It does in some cases but not others, namely the expensive
VFIO_IOMMU_MAP_DMA/UNMAP_DMA path where the iommu page tables are used
to find the pfns when unpinning.

I don't see why vfio couldn't do as you say, though, and the worst case
memory overhead of using scatterlist to remember the pfns of a 300g VM
backed by huge but physically discontiguous pages is only a few meg, not
bad at all.

> it makes alot of sense to assemble the page list up front, then do the
> iommu, instead of trying to do both things page at a time.
>
> It would be smart to rebuild vfio to use scatter lists to store the
> page list and then break the sgl into pages for iommu
> configuration. SGLs will consume alot less memory for the usual case
> of THPs backing the VFIO registrations.
>
> ib_umem_get() has some example of how to code this, I've been thinking
> we could make this some common API, and it could be further optimized.

Agreed, great suggestions, both above and in the rest of your response.

>> Yesterday I tried optimizing vfio to skip gup calls for tail pages after
>> Matthew pointed out this same issue to me by coincidence last week.
>
> Please don't just hack up vfio like this.

Yeah, you've cured me of that idea.  I'll see where I get experimenting
with some of this stuff.

Jason Gunthorpe Dec. 8, 2020, 1:24 p.m. UTC | #23

On Mon, Dec 07, 2020 at 09:48:48PM -0500, Daniel Jordan wrote:
> Jason Gunthorpe <jgg@ziepe.ca> writes:
> > On Fri, Dec 04, 2020 at 03:05:46PM -0500, Daniel Jordan wrote:
> >> Well Alex can correct me, but I went digging and a comment from the
> >> first type1 vfio commit says the iommu API didn't promise to unmap
> >> subpages of previous mappings, so doing page at a time gave flexibility
> >> at the cost of inefficiency.
> >
> > iommu restrictions are not related to with gup. vfio needs to get the
> > page list from the page tables as efficiently as possible, then you
> > break it up into what you want to feed into the IOMMU how the iommu
> > wants.
> >
> > vfio must maintain a page list to call unpin_user_pages() anyhow, so
> 
> It does in some cases but not others, namely the expensive
> VFIO_IOMMU_MAP_DMA/UNMAP_DMA path where the iommu page tables are used
> to find the pfns when unpinning.

Oh, I see.. Well, that is still possible, but vfio really needs to
batch operations, eg call pin_user_pages() with some larger buffer and
store those into the iommu and then reverse this to build up
contiguous runs of pages to unpin

> I don't see why vfio couldn't do as you say, though, and the worst case
> memory overhead of using scatterlist to remember the pfns of a 300g VM
> backed by huge but physically discontiguous pages is only a few meg, not
> bad at all.

Yes, but 0 is still better.. I would start by focusing on batching
pin_user_pages.

Jason

[6/6] mm/gup: migrate pinned pages out of movable zone

Commit Message

Comments

Patch