diff mbox series

[v4,04/11] mm/hugetlb: make hugetlb migration callback CMA aware

Message ID 1594107889-32228-5-git-send-email-iamjoonsoo.kim@lge.com (mailing list archive)
State New, archived
Headers show
Series clean-up the migration target allocation functions | expand

Commit Message

Joonsoo Kim July 7, 2020, 7:44 a.m. UTC
From: Joonsoo Kim <iamjoonsoo.kim@lge.com>

new_non_cma_page() in gup.c which try to allocate migration target page
requires to allocate the new page that is not on the CMA area.
new_non_cma_page() implements it by removing __GFP_MOVABLE flag.  This way
works well for THP page or normal page but not for hugetlb page.

hugetlb page allocation process consists of two steps.  First is dequeing
from the pool.  Second is, if there is no available page on the queue,
allocating from the page allocator.

new_non_cma_page() can control allocation from the page allocator by
specifying correct gfp flag.  However, dequeing cannot be controlled until
now, so, new_non_cma_page() skips dequeing completely.  It is a suboptimal
since new_non_cma_page() cannot utilize hugetlb pages on the queue so this
patch tries to fix this situation.

This patch makes the deque function on hugetlb CMA aware and skip CMA
pages if newly added skip_cma argument is passed as true.

Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 include/linux/hugetlb.h |  6 ++----
 mm/gup.c                |  3 ++-
 mm/hugetlb.c            | 46 ++++++++++++++++++++++++++++++----------------
 mm/mempolicy.c          |  2 +-
 mm/migrate.c            |  2 +-
 5 files changed, 36 insertions(+), 23 deletions(-)

Comments

Vlastimil Babka July 7, 2020, 11:22 a.m. UTC | #1
On 7/7/20 9:44 AM, js1304@gmail.com wrote:
> From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> 
> new_non_cma_page() in gup.c which try to allocate migration target page
> requires to allocate the new page that is not on the CMA area.
> new_non_cma_page() implements it by removing __GFP_MOVABLE flag.  This way
> works well for THP page or normal page but not for hugetlb page.
> 
> hugetlb page allocation process consists of two steps.  First is dequeing
> from the pool.  Second is, if there is no available page on the queue,
> allocating from the page allocator.
> 
> new_non_cma_page() can control allocation from the page allocator by
> specifying correct gfp flag.  However, dequeing cannot be controlled until
> now, so, new_non_cma_page() skips dequeing completely.  It is a suboptimal
> since new_non_cma_page() cannot utilize hugetlb pages on the queue so this
> patch tries to fix this situation.
> 
> This patch makes the deque function on hugetlb CMA aware and skip CMA
> pages if newly added skip_cma argument is passed as true.

Hmm, can't you instead change dequeue_huge_page_node_exact() to test the PF_
flag and avoid adding bool skip_cma everywhere?

I think that's what Michal suggested [1] except he said "the code already does
by memalloc_nocma_{save,restore} API". It needs extending a bit though, AFAICS.
__gup_longterm_locked() indeed does the save/restore, but restore comes before
check_and_migrate_cma_pages() and thus new_non_cma_page() is called, so an
adjustment is needed there, but that's all?

Hm the adjustment should be also done because save/restore is done around
__get_user_pages_locked(), but check_and_migrate_cma_pages() also calls
__get_user_pages_locked(), and that call not being between nocma save and
restore is thus also a correctness issue?

[1] https://lore.kernel.org/r/20200629075510.GA32461@dhcp22.suse.cz

> Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> ---
>  include/linux/hugetlb.h |  6 ++----
>  mm/gup.c                |  3 ++-
>  mm/hugetlb.c            | 46 ++++++++++++++++++++++++++++++----------------
>  mm/mempolicy.c          |  2 +-
>  mm/migrate.c            |  2 +-
>  5 files changed, 36 insertions(+), 23 deletions(-)
>
Michal Hocko July 7, 2020, 11:31 a.m. UTC | #2
On Tue 07-07-20 16:44:42, Joonsoo Kim wrote:
> From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> 
> new_non_cma_page() in gup.c which try to allocate migration target page
> requires to allocate the new page that is not on the CMA area.
> new_non_cma_page() implements it by removing __GFP_MOVABLE flag.  This way
> works well for THP page or normal page but not for hugetlb page.
> 
> hugetlb page allocation process consists of two steps.  First is dequeing
> from the pool.  Second is, if there is no available page on the queue,
> allocating from the page allocator.
> 
> new_non_cma_page() can control allocation from the page allocator by
> specifying correct gfp flag.  However, dequeing cannot be controlled until
> now, so, new_non_cma_page() skips dequeing completely.  It is a suboptimal
> since new_non_cma_page() cannot utilize hugetlb pages on the queue so this
> patch tries to fix this situation.
> 
> This patch makes the deque function on hugetlb CMA aware and skip CMA
> pages if newly added skip_cma argument is passed as true.

I really dislike this as already mentioned in the previous version of
the patch. You are making hugetlb and only one part of its allocator a
special snowflake which is almost always a bad idea. Your changelog
lacks any real justification for this inconsistency.

Also by doing so you are keeping an existing bug that the hugetlb
allocator doesn't respect scope gfp flags as I have mentioned when
reviewing the previous version. That bug likely doesn't matter now but
it might in future and as soon as it is fixed all this is just a
pointless exercise.

I do not have energy and time to burn to repeat that argumentation to I
will let Mike to have a final word. Btw. you are keeping his acks even
after considerable changes to patches which I am not really sure he is
ok with.

> Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

To this particular patch.
[...]

> diff --git a/mm/gup.c b/mm/gup.c
> index 5daadae..2c3dab4 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1630,11 +1630,12 @@ static struct page *new_non_cma_page(struct page *page, unsigned long private)
>  #ifdef CONFIG_HUGETLB_PAGE
>  	if (PageHuge(page)) {
>  		struct hstate *h = page_hstate(page);
> +
>  		/*
>  		 * We don't want to dequeue from the pool because pool pages will
>  		 * mostly be from the CMA region.
>  		 */
> -		return alloc_migrate_huge_page(h, gfp_mask, nid, NULL);
> +		return alloc_huge_page_nodemask(h, nid, NULL, gfp_mask, true);

Let me repeat that this whole thing is running under
memalloc_nocma_save. So additional parameter is bogus.
[...]
> -static struct page *dequeue_huge_page_node_exact(struct hstate *h, int nid)
> +static struct page *dequeue_huge_page_node_exact(struct hstate *h, int nid, bool skip_cma)

If you really insist on an additional parameter at this layer than it
should be checking for the PF_MEMALLOC_NOCMA instead.

[...]
> @@ -1971,21 +1977,29 @@ struct page *alloc_buddy_huge_page_with_mpol(struct hstate *h,
>  
>  /* page migration callback function */
>  struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
> -		nodemask_t *nmask, gfp_t gfp_mask)
> +		nodemask_t *nmask, gfp_t gfp_mask, bool skip_cma)
>  {
> +	unsigned int flags = 0;
> +	struct page *page = NULL;
> +
> +	if (skip_cma)
> +		flags = memalloc_nocma_save();

This is pointless for a scope that is already defined up in the call
chain and fundamentally this is breaking the expected use of the scope
API. The primary reason for that API to exist is to define the scope and
have it sticky for _all_ allocation contexts. So if you have to use it
deep in the allocator then you are doing something wrong.
Michal Hocko July 8, 2020, 6:48 a.m. UTC | #3
On Tue 07-07-20 13:31:19, Michal Hocko wrote:
> Btw. you are keeping his acks even
> after considerable changes to patches which I am not really sure he is
> ok with.

I am sorry but I have missed the last email from Mike in v3.
Joonsoo Kim July 8, 2020, 7:12 a.m. UTC | #4
On Tue, Jul 07, 2020 at 01:31:16PM +0200, Michal Hocko wrote:
> On Tue 07-07-20 16:44:42, Joonsoo Kim wrote:
> > From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> > 
> > new_non_cma_page() in gup.c which try to allocate migration target page
> > requires to allocate the new page that is not on the CMA area.
> > new_non_cma_page() implements it by removing __GFP_MOVABLE flag.  This way
> > works well for THP page or normal page but not for hugetlb page.
> > 
> > hugetlb page allocation process consists of two steps.  First is dequeing
> > from the pool.  Second is, if there is no available page on the queue,
> > allocating from the page allocator.
> > 
> > new_non_cma_page() can control allocation from the page allocator by
> > specifying correct gfp flag.  However, dequeing cannot be controlled until
> > now, so, new_non_cma_page() skips dequeing completely.  It is a suboptimal
> > since new_non_cma_page() cannot utilize hugetlb pages on the queue so this
> > patch tries to fix this situation.
> > 
> > This patch makes the deque function on hugetlb CMA aware and skip CMA
> > pages if newly added skip_cma argument is passed as true.
> 
> I really dislike this as already mentioned in the previous version of
> the patch. You are making hugetlb and only one part of its allocator a
> special snowflake which is almost always a bad idea. Your changelog
> lacks any real justification for this inconsistency.
> 
> Also by doing so you are keeping an existing bug that the hugetlb
> allocator doesn't respect scope gfp flags as I have mentioned when
> reviewing the previous version. That bug likely doesn't matter now but
> it might in future and as soon as it is fixed all this is just a
> pointless exercise.
> 
> I do not have energy and time to burn to repeat that argumentation to I
> will let Mike to have a final word. Btw. you are keeping his acks even
> after considerable changes to patches which I am not really sure he is
> ok with.

As you replied a minute ago, Mike acked.

> > Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
> > Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> 
> To this particular patch.
> [...]
> 
> > diff --git a/mm/gup.c b/mm/gup.c
> > index 5daadae..2c3dab4 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -1630,11 +1630,12 @@ static struct page *new_non_cma_page(struct page *page, unsigned long private)
> >  #ifdef CONFIG_HUGETLB_PAGE
> >  	if (PageHuge(page)) {
> >  		struct hstate *h = page_hstate(page);
> > +
> >  		/*
> >  		 * We don't want to dequeue from the pool because pool pages will
> >  		 * mostly be from the CMA region.
> >  		 */
> > -		return alloc_migrate_huge_page(h, gfp_mask, nid, NULL);
> > +		return alloc_huge_page_nodemask(h, nid, NULL, gfp_mask, true);
> 
> Let me repeat that this whole thing is running under
> memalloc_nocma_save. So additional parameter is bogus.

As Vlasimil said in other reply, we are not under
memalloc_nocma_save(). Anyway, now, I also think that additional parameter
isn't need.

> [...]
> > -static struct page *dequeue_huge_page_node_exact(struct hstate *h, int nid)
> > +static struct page *dequeue_huge_page_node_exact(struct hstate *h, int nid, bool skip_cma)
> 
> If you really insist on an additional parameter at this layer than it
> should be checking for the PF_MEMALLOC_NOCMA instead.

I will change the patch to check PF_MEMALLOC_NOCMA instead of
introducing and checking skip_cma.

> [...]
> > @@ -1971,21 +1977,29 @@ struct page *alloc_buddy_huge_page_with_mpol(struct hstate *h,
> >  
> >  /* page migration callback function */
> >  struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
> > -		nodemask_t *nmask, gfp_t gfp_mask)
> > +		nodemask_t *nmask, gfp_t gfp_mask, bool skip_cma)
> >  {
> > +	unsigned int flags = 0;
> > +	struct page *page = NULL;
> > +
> > +	if (skip_cma)
> > +		flags = memalloc_nocma_save();
> 
> This is pointless for a scope that is already defined up in the call
> chain and fundamentally this is breaking the expected use of the scope
> API. The primary reason for that API to exist is to define the scope and
> have it sticky for _all_ allocation contexts. So if you have to use it
> deep in the allocator then you are doing something wrong.

As mentioned above, we are not under memalloc_nocma_save(). Anyway, I
will rework the patch and attach it to Vlasimil's reply. It's appreciate
if you check it again.

Thanks.
Joonsoo Kim July 8, 2020, 7:16 a.m. UTC | #5
On Tue, Jul 07, 2020 at 01:22:31PM +0200, Vlastimil Babka wrote:
> On 7/7/20 9:44 AM, js1304@gmail.com wrote:
> > From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> > 
> > new_non_cma_page() in gup.c which try to allocate migration target page
> > requires to allocate the new page that is not on the CMA area.
> > new_non_cma_page() implements it by removing __GFP_MOVABLE flag.  This way
> > works well for THP page or normal page but not for hugetlb page.
> > 
> > hugetlb page allocation process consists of two steps.  First is dequeing
> > from the pool.  Second is, if there is no available page on the queue,
> > allocating from the page allocator.
> > 
> > new_non_cma_page() can control allocation from the page allocator by
> > specifying correct gfp flag.  However, dequeing cannot be controlled until
> > now, so, new_non_cma_page() skips dequeing completely.  It is a suboptimal
> > since new_non_cma_page() cannot utilize hugetlb pages on the queue so this
> > patch tries to fix this situation.
> > 
> > This patch makes the deque function on hugetlb CMA aware and skip CMA
> > pages if newly added skip_cma argument is passed as true.
> 
> Hmm, can't you instead change dequeue_huge_page_node_exact() to test the PF_
> flag and avoid adding bool skip_cma everywhere?

Okay! Please check following patch.
> 
> I think that's what Michal suggested [1] except he said "the code already does
> by memalloc_nocma_{save,restore} API". It needs extending a bit though, AFAICS.
> __gup_longterm_locked() indeed does the save/restore, but restore comes before
> check_and_migrate_cma_pages() and thus new_non_cma_page() is called, so an
> adjustment is needed there, but that's all?
> 
> Hm the adjustment should be also done because save/restore is done around
> __get_user_pages_locked(), but check_and_migrate_cma_pages() also calls
> __get_user_pages_locked(), and that call not being between nocma save and
> restore is thus also a correctness issue?

Simply, I call memalloc_nocma_{save,restore} in new_non_cma_page(). It
would not cause any problem.

------------------>8-------------------
From bcfc57e3c6f2df1ad2940308b89d740cd3f0fba8 Mon Sep 17 00:00:00 2001
From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Date: Wed, 8 Jul 2020 14:39:26 +0900
Subject: [PATCH] mm/hugetlb: make hugetlb migration callback CMA aware

new_non_cma_page() in gup.c which try to allocate migration target page
requires to allocate the new page that is not on the CMA area.
new_non_cma_page() implements it by removing __GFP_MOVABLE flag.  This way
works well for THP page or normal page but not for hugetlb page.

hugetlb page allocation process consists of two steps.  First is dequeing
from the pool.  Second is, if there is no available page on the queue,
allocating from the page allocator.

new_non_cma_page() can control allocation from the page allocator by
specifying correct gfp flag.  However, dequeing cannot be controlled until
now, so, new_non_cma_page() skips dequeing completely.  It is a suboptimal
since new_non_cma_page() cannot utilize hugetlb pages on the queue so this
patch tries to fix this situation.

This patch makes new_non_cma_page() uses memalloc_nocma_{save,restore}
to exclude CMA memory rather than manually clearing __GFP_MOVABLE. And,
this patch also makes the deque function on hugetlb CMA aware. In the
deque function, CMA memory is skipped if PF_MEMALLOC_NOCMA flag is set
by memalloc_nocma_{save,restore}.

Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 include/linux/hugetlb.h |  2 --
 mm/gup.c                | 32 +++++++++++++++-----------------
 mm/hugetlb.c            | 11 +++++++++--
 3 files changed, 24 insertions(+), 21 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index bb93e95..34a10e5 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -509,8 +509,6 @@ struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
 				nodemask_t *nmask, gfp_t gfp_mask);
 struct page *alloc_huge_page_vma(struct hstate *h, struct vm_area_struct *vma,
 				unsigned long address);
-struct page *alloc_migrate_huge_page(struct hstate *h, gfp_t gfp_mask,
-				     int nid, nodemask_t *nmask);
 int huge_add_to_page_cache(struct page *page, struct address_space *mapping,
 			pgoff_t idx);
 
diff --git a/mm/gup.c b/mm/gup.c
index 5daadae..79142a9 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1623,6 +1623,8 @@ static struct page *new_non_cma_page(struct page *page, unsigned long private)
 	 * allocation memory.
 	 */
 	gfp_t gfp_mask = GFP_USER | __GFP_NOWARN;
+	unsigned int flags = memalloc_nocma_save();
+	struct page *new_page = NULL;
 
 	if (PageHighMem(page))
 		gfp_mask |= __GFP_HIGHMEM;
@@ -1630,33 +1632,29 @@ static struct page *new_non_cma_page(struct page *page, unsigned long private)
 #ifdef CONFIG_HUGETLB_PAGE
 	if (PageHuge(page)) {
 		struct hstate *h = page_hstate(page);
-		/*
-		 * We don't want to dequeue from the pool because pool pages will
-		 * mostly be from the CMA region.
-		 */
-		return alloc_migrate_huge_page(h, gfp_mask, nid, NULL);
+
+		new_page = alloc_huge_page_nodemask(h, nid, NULL, gfp_mask);
+		goto out;
 	}
 #endif
+
 	if (PageTransHuge(page)) {
-		struct page *thp;
 		/*
 		 * ignore allocation failure warnings
 		 */
 		gfp_t thp_gfpmask = GFP_TRANSHUGE | __GFP_NOWARN;
 
-		/*
-		 * Remove the movable mask so that we don't allocate from
-		 * CMA area again.
-		 */
-		thp_gfpmask &= ~__GFP_MOVABLE;
-		thp = __alloc_pages_node(nid, thp_gfpmask, HPAGE_PMD_ORDER);
-		if (!thp)
-			return NULL;
-		prep_transhuge_page(thp);
-		return thp;
+		new_page = __alloc_pages_node(nid, thp_gfpmask, HPAGE_PMD_ORDER);
+		if (new_page)
+			prep_transhuge_page(new_page);
+		goto out;
 	}
 
-	return __alloc_pages_node(nid, gfp_mask, 0);
+	new_page = __alloc_pages_node(nid, gfp_mask, 0);
+
+out:
+	memalloc_nocma_restore(flags);
+	return new_page;
 }
 
 static long check_and_migrate_cma_pages(struct task_struct *tsk,
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 3245aa0..514e29c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -29,6 +29,7 @@
 #include <linux/numa.h>
 #include <linux/llist.h>
 #include <linux/cma.h>
+#include <linux/sched/mm.h>
 
 #include <asm/page.h>
 #include <asm/tlb.h>
@@ -1036,10 +1037,16 @@ static void enqueue_huge_page(struct hstate *h, struct page *page)
 static struct page *dequeue_huge_page_node_exact(struct hstate *h, int nid)
 {
 	struct page *page;
+	bool nocma = !!(READ_ONCE(current->flags) & PF_MEMALLOC_NOCMA);
+
+	list_for_each_entry(page, &h->hugepage_freelists[nid], lru) {
+		if (nocma && is_migrate_cma_page(page))
+			continue;
 
-	list_for_each_entry(page, &h->hugepage_freelists[nid], lru)
 		if (!PageHWPoison(page))
 			break;
+	}
+
 	/*
 	 * if 'non-isolated free hugepage' not found on the list,
 	 * the allocation fails.
@@ -1928,7 +1935,7 @@ static struct page *alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask,
 	return page;
 }
 
-struct page *alloc_migrate_huge_page(struct hstate *h, gfp_t gfp_mask,
+static struct page *alloc_migrate_huge_page(struct hstate *h, gfp_t gfp_mask,
 				     int nid, nodemask_t *nmask)
 {
 	struct page *page;
Michal Hocko July 8, 2020, 7:41 a.m. UTC | #6
On Wed 08-07-20 16:16:02, Joonsoo Kim wrote:
> On Tue, Jul 07, 2020 at 01:22:31PM +0200, Vlastimil Babka wrote:
> > On 7/7/20 9:44 AM, js1304@gmail.com wrote:
> > > From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> > > 
> > > new_non_cma_page() in gup.c which try to allocate migration target page
> > > requires to allocate the new page that is not on the CMA area.
> > > new_non_cma_page() implements it by removing __GFP_MOVABLE flag.  This way
> > > works well for THP page or normal page but not for hugetlb page.
> > > 
> > > hugetlb page allocation process consists of two steps.  First is dequeing
> > > from the pool.  Second is, if there is no available page on the queue,
> > > allocating from the page allocator.
> > > 
> > > new_non_cma_page() can control allocation from the page allocator by
> > > specifying correct gfp flag.  However, dequeing cannot be controlled until
> > > now, so, new_non_cma_page() skips dequeing completely.  It is a suboptimal
> > > since new_non_cma_page() cannot utilize hugetlb pages on the queue so this
> > > patch tries to fix this situation.
> > > 
> > > This patch makes the deque function on hugetlb CMA aware and skip CMA
> > > pages if newly added skip_cma argument is passed as true.
> > 
> > Hmm, can't you instead change dequeue_huge_page_node_exact() to test the PF_
> > flag and avoid adding bool skip_cma everywhere?
> 
> Okay! Please check following patch.
> > 
> > I think that's what Michal suggested [1] except he said "the code already does
> > by memalloc_nocma_{save,restore} API". It needs extending a bit though, AFAICS.
> > __gup_longterm_locked() indeed does the save/restore, but restore comes before
> > check_and_migrate_cma_pages() and thus new_non_cma_page() is called, so an
> > adjustment is needed there, but that's all?
> > 
> > Hm the adjustment should be also done because save/restore is done around
> > __get_user_pages_locked(), but check_and_migrate_cma_pages() also calls
> > __get_user_pages_locked(), and that call not being between nocma save and
> > restore is thus also a correctness issue?
> 
> Simply, I call memalloc_nocma_{save,restore} in new_non_cma_page(). It
> would not cause any problem.

I believe a proper fix is the following. The scope is really defined for
FOLL_LONGTERM pins and pushing it inside check_and_migrate_cma_pages
will solve the problem as well but it imho makes more sense to do it in
the caller the same way we do for any others. 

Fixes: 9a4e9f3b2d73 ("mm: update get_user_pages_longterm to migrate pages allocated from CMA region")

I am not sure this is worth backporting to stable yet.

diff --git a/mm/gup.c b/mm/gup.c
index de9e36262ccb..75980dd5a2fc 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1794,7 +1794,6 @@ static long __gup_longterm_locked(struct task_struct *tsk,
 				     vmas_tmp, NULL, gup_flags);
 
 	if (gup_flags & FOLL_LONGTERM) {
-		memalloc_nocma_restore(flags);
 		if (rc < 0)
 			goto out;
 
@@ -1802,11 +1801,13 @@ static long __gup_longterm_locked(struct task_struct *tsk,
 			for (i = 0; i < rc; i++)
 				put_page(pages[i]);
 			rc = -EOPNOTSUPP;
+			memalloc_nocma_restore(flags);
 			goto out;
 		}
 
 		rc = check_and_migrate_cma_pages(tsk, mm, start, rc, pages,
 						 vmas_tmp, gup_flags);
+		memalloc_nocma_restore(flags);
 	}
 
 out:
Vlastimil Babka July 8, 2020, 9:26 a.m. UTC | #7
On 7/8/20 9:41 AM, Michal Hocko wrote:
> On Wed 08-07-20 16:16:02, Joonsoo Kim wrote:
>> On Tue, Jul 07, 2020 at 01:22:31PM +0200, Vlastimil Babka wrote:
>> 
>> Simply, I call memalloc_nocma_{save,restore} in new_non_cma_page(). It
>> would not cause any problem.
> 
> I believe a proper fix is the following. The scope is really defined for
> FOLL_LONGTERM pins and pushing it inside check_and_migrate_cma_pages
> will solve the problem as well but it imho makes more sense to do it in
> the caller the same way we do for any others. 
> 
> Fixes: 9a4e9f3b2d73 ("mm: update get_user_pages_longterm to migrate pages allocated from CMA region")

Agreed.

> 
> I am not sure this is worth backporting to stable yet.

CC Aneesh.

Context: since check_and_migrate_cma_pages() calls __get_user_pages_locked(), it
should also be called under memalloc_nocma_save().

> diff --git a/mm/gup.c b/mm/gup.c
> index de9e36262ccb..75980dd5a2fc 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1794,7 +1794,6 @@ static long __gup_longterm_locked(struct task_struct *tsk,
>  				     vmas_tmp, NULL, gup_flags);
>  
>  	if (gup_flags & FOLL_LONGTERM) {
> -		memalloc_nocma_restore(flags);
>  		if (rc < 0)
>  			goto out;
>  
> @@ -1802,11 +1801,13 @@ static long __gup_longterm_locked(struct task_struct *tsk,
>  			for (i = 0; i < rc; i++)
>  				put_page(pages[i]);
>  			rc = -EOPNOTSUPP;
> +			memalloc_nocma_restore(flags);
>  			goto out;
>  		}
>  
>  		rc = check_and_migrate_cma_pages(tsk, mm, start, rc, pages,
>  						 vmas_tmp, gup_flags);
> +		memalloc_nocma_restore(flags);
>  	}
>  
>  out:
>
Aneesh Kumar K.V July 8, 2020, 10:57 a.m. UTC | #8
Vlastimil Babka <vbabka@suse.cz> writes:

> On 7/8/20 9:41 AM, Michal Hocko wrote:
>> On Wed 08-07-20 16:16:02, Joonsoo Kim wrote:
>>> On Tue, Jul 07, 2020 at 01:22:31PM +0200, Vlastimil Babka wrote:
>>> 
>>> Simply, I call memalloc_nocma_{save,restore} in new_non_cma_page(). It
>>> would not cause any problem.
>> 
>> I believe a proper fix is the following. The scope is really defined for
>> FOLL_LONGTERM pins and pushing it inside check_and_migrate_cma_pages
>> will solve the problem as well but it imho makes more sense to do it in
>> the caller the same way we do for any others. 
>> 
>> Fixes: 9a4e9f3b2d73 ("mm: update get_user_pages_longterm to migrate pages allocated from CMA region")
>
> Agreed.
>
>> 
>> I am not sure this is worth backporting to stable yet.
>
> CC Aneesh.
>
> Context: since check_and_migrate_cma_pages() calls __get_user_pages_locked(), it
> should also be called under memalloc_nocma_save().

But by then we faulted in all relevant pages and migrated them out of
CMA rea right?


>
>> diff --git a/mm/gup.c b/mm/gup.c
>> index de9e36262ccb..75980dd5a2fc 100644
>> --- a/mm/gup.c
>> +++ b/mm/gup.c
>> @@ -1794,7 +1794,6 @@ static long __gup_longterm_locked(struct task_struct *tsk,
>>  				     vmas_tmp, NULL, gup_flags);
>>  
>>  	if (gup_flags & FOLL_LONGTERM) {
>> -		memalloc_nocma_restore(flags);
>>  		if (rc < 0)
>>  			goto out;
>>  
>> @@ -1802,11 +1801,13 @@ static long __gup_longterm_locked(struct task_struct *tsk,
>>  			for (i = 0; i < rc; i++)
>>  				put_page(pages[i]);
>>  			rc = -EOPNOTSUPP;
>> +			memalloc_nocma_restore(flags);
>>  			goto out;
>>  		}
>>  
>>  		rc = check_and_migrate_cma_pages(tsk, mm, start, rc, pages,
>>  						 vmas_tmp, gup_flags);
>> +		memalloc_nocma_restore(flags);
>>  	}
>>  
>>  out:
>> 

-aneesh
Michal Hocko July 8, 2020, 11:32 a.m. UTC | #9
On Wed 08-07-20 16:27:16, Aneesh Kumar K.V wrote:
> Vlastimil Babka <vbabka@suse.cz> writes:
> 
> > On 7/8/20 9:41 AM, Michal Hocko wrote:
> >> On Wed 08-07-20 16:16:02, Joonsoo Kim wrote:
> >>> On Tue, Jul 07, 2020 at 01:22:31PM +0200, Vlastimil Babka wrote:
> >>> 
> >>> Simply, I call memalloc_nocma_{save,restore} in new_non_cma_page(). It
> >>> would not cause any problem.
> >> 
> >> I believe a proper fix is the following. The scope is really defined for
> >> FOLL_LONGTERM pins and pushing it inside check_and_migrate_cma_pages
> >> will solve the problem as well but it imho makes more sense to do it in
> >> the caller the same way we do for any others. 
> >> 
> >> Fixes: 9a4e9f3b2d73 ("mm: update get_user_pages_longterm to migrate pages allocated from CMA region")
> >
> > Agreed.
> >
> >> 
> >> I am not sure this is worth backporting to stable yet.
> >
> > CC Aneesh.
> >
> > Context: since check_and_migrate_cma_pages() calls __get_user_pages_locked(), it
> > should also be called under memalloc_nocma_save().
> 
> But by then we faulted in all relevant pages and migrated them out of
> CMA rea right?

check_and_migrate_cma_pages will allocate target pages that you want to
migrate off the CMA region unless I am misreading the code. And those
allocation need to be placed outside of the CMA.
Mike Kravetz July 9, 2020, 12:27 a.m. UTC | #10
On 7/8/20 12:16 AM, Joonsoo Kim wrote:
> On Tue, Jul 07, 2020 at 01:22:31PM +0200, Vlastimil Babka wrote:
>> On 7/7/20 9:44 AM, js1304@gmail.com wrote:
>>> From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>>>
<...>
>>> This patch makes the deque function on hugetlb CMA aware and skip CMA
>>> pages if newly added skip_cma argument is passed as true.
>>
>> Hmm, can't you instead change dequeue_huge_page_node_exact() to test the PF_
>> flag and avoid adding bool skip_cma everywhere?
> 
> Okay! Please check following patch.
>>
>> I think that's what Michal suggested [1] except he said "the code already does
>> by memalloc_nocma_{save,restore} API". It needs extending a bit though, AFAICS.
>> __gup_longterm_locked() indeed does the save/restore, but restore comes before
>> check_and_migrate_cma_pages() and thus new_non_cma_page() is called, so an
>> adjustment is needed there, but that's all?
>>
>> Hm the adjustment should be also done because save/restore is done around
>> __get_user_pages_locked(), but check_and_migrate_cma_pages() also calls
>> __get_user_pages_locked(), and that call not being between nocma save and
>> restore is thus also a correctness issue?
> 
> Simply, I call memalloc_nocma_{save,restore} in new_non_cma_page(). It
> would not cause any problem.
> 
> ------------------>8-------------------
> From bcfc57e3c6f2df1ad2940308b89d740cd3f0fba8 Mon Sep 17 00:00:00 2001
> From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Date: Wed, 8 Jul 2020 14:39:26 +0900
> Subject: [PATCH] mm/hugetlb: make hugetlb migration callback CMA aware
> 
<...>
> 
> This patch makes new_non_cma_page() uses memalloc_nocma_{save,restore}
> to exclude CMA memory rather than manually clearing __GFP_MOVABLE. And,
> this patch also makes the deque function on hugetlb CMA aware. In the
> deque function, CMA memory is skipped if PF_MEMALLOC_NOCMA flag is set
> by memalloc_nocma_{save,restore}.
> 
> Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

I did ACK the previous version of the patch, but I like this much better.
I assume there will be a new version built on top of Michal's patch to
change the placement of memalloc_nocma_restore calls in __gup_longterm_locked.
Michal Hocko July 9, 2020, 6:43 a.m. UTC | #11
On Wed 08-07-20 09:41:06, Michal Hocko wrote:
> On Wed 08-07-20 16:16:02, Joonsoo Kim wrote:
> > On Tue, Jul 07, 2020 at 01:22:31PM +0200, Vlastimil Babka wrote:
> > > On 7/7/20 9:44 AM, js1304@gmail.com wrote:
> > > > From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> > > > 
> > > > new_non_cma_page() in gup.c which try to allocate migration target page
> > > > requires to allocate the new page that is not on the CMA area.
> > > > new_non_cma_page() implements it by removing __GFP_MOVABLE flag.  This way
> > > > works well for THP page or normal page but not for hugetlb page.
> > > > 
> > > > hugetlb page allocation process consists of two steps.  First is dequeing
> > > > from the pool.  Second is, if there is no available page on the queue,
> > > > allocating from the page allocator.
> > > > 
> > > > new_non_cma_page() can control allocation from the page allocator by
> > > > specifying correct gfp flag.  However, dequeing cannot be controlled until
> > > > now, so, new_non_cma_page() skips dequeing completely.  It is a suboptimal
> > > > since new_non_cma_page() cannot utilize hugetlb pages on the queue so this
> > > > patch tries to fix this situation.
> > > > 
> > > > This patch makes the deque function on hugetlb CMA aware and skip CMA
> > > > pages if newly added skip_cma argument is passed as true.
> > > 
> > > Hmm, can't you instead change dequeue_huge_page_node_exact() to test the PF_
> > > flag and avoid adding bool skip_cma everywhere?
> > 
> > Okay! Please check following patch.
> > > 
> > > I think that's what Michal suggested [1] except he said "the code already does
> > > by memalloc_nocma_{save,restore} API". It needs extending a bit though, AFAICS.
> > > __gup_longterm_locked() indeed does the save/restore, but restore comes before
> > > check_and_migrate_cma_pages() and thus new_non_cma_page() is called, so an
> > > adjustment is needed there, but that's all?
> > > 
> > > Hm the adjustment should be also done because save/restore is done around
> > > __get_user_pages_locked(), but check_and_migrate_cma_pages() also calls
> > > __get_user_pages_locked(), and that call not being between nocma save and
> > > restore is thus also a correctness issue?
> > 
> > Simply, I call memalloc_nocma_{save,restore} in new_non_cma_page(). It
> > would not cause any problem.
> 
> I believe a proper fix is the following. The scope is really defined for
> FOLL_LONGTERM pins and pushing it inside check_and_migrate_cma_pages
> will solve the problem as well but it imho makes more sense to do it in
> the caller the same way we do for any others. 
> 
> Fixes: 9a4e9f3b2d73 ("mm: update get_user_pages_longterm to migrate pages allocated from CMA region")
> 
> I am not sure this is worth backporting to stable yet.

Should I post it as a separate patch do you plan to include this into your next version?

> 
> diff --git a/mm/gup.c b/mm/gup.c
> index de9e36262ccb..75980dd5a2fc 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1794,7 +1794,6 @@ static long __gup_longterm_locked(struct task_struct *tsk,
>  				     vmas_tmp, NULL, gup_flags);
>  
>  	if (gup_flags & FOLL_LONGTERM) {
> -		memalloc_nocma_restore(flags);
>  		if (rc < 0)
>  			goto out;
>  
> @@ -1802,11 +1801,13 @@ static long __gup_longterm_locked(struct task_struct *tsk,
>  			for (i = 0; i < rc; i++)
>  				put_page(pages[i]);
>  			rc = -EOPNOTSUPP;
> +			memalloc_nocma_restore(flags);
>  			goto out;
>  		}
>  
>  		rc = check_and_migrate_cma_pages(tsk, mm, start, rc, pages,
>  						 vmas_tmp, gup_flags);
> +		memalloc_nocma_restore(flags);
>  	}
>  
>  out:
> -- 
> Michal Hocko
> SUSE Labs
Joonsoo Kim July 9, 2020, 7:03 a.m. UTC | #12
2020년 7월 9일 (목) 오후 3:43, Michal Hocko <mhocko@kernel.org>님이 작성:
>
> On Wed 08-07-20 09:41:06, Michal Hocko wrote:
> > On Wed 08-07-20 16:16:02, Joonsoo Kim wrote:
> > > On Tue, Jul 07, 2020 at 01:22:31PM +0200, Vlastimil Babka wrote:
> > > > On 7/7/20 9:44 AM, js1304@gmail.com wrote:
> > > > > From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> > > > >
> > > > > new_non_cma_page() in gup.c which try to allocate migration target page
> > > > > requires to allocate the new page that is not on the CMA area.
> > > > > new_non_cma_page() implements it by removing __GFP_MOVABLE flag.  This way
> > > > > works well for THP page or normal page but not for hugetlb page.
> > > > >
> > > > > hugetlb page allocation process consists of two steps.  First is dequeing
> > > > > from the pool.  Second is, if there is no available page on the queue,
> > > > > allocating from the page allocator.
> > > > >
> > > > > new_non_cma_page() can control allocation from the page allocator by
> > > > > specifying correct gfp flag.  However, dequeing cannot be controlled until
> > > > > now, so, new_non_cma_page() skips dequeing completely.  It is a suboptimal
> > > > > since new_non_cma_page() cannot utilize hugetlb pages on the queue so this
> > > > > patch tries to fix this situation.
> > > > >
> > > > > This patch makes the deque function on hugetlb CMA aware and skip CMA
> > > > > pages if newly added skip_cma argument is passed as true.
> > > >
> > > > Hmm, can't you instead change dequeue_huge_page_node_exact() to test the PF_
> > > > flag and avoid adding bool skip_cma everywhere?
> > >
> > > Okay! Please check following patch.
> > > >
> > > > I think that's what Michal suggested [1] except he said "the code already does
> > > > by memalloc_nocma_{save,restore} API". It needs extending a bit though, AFAICS.
> > > > __gup_longterm_locked() indeed does the save/restore, but restore comes before
> > > > check_and_migrate_cma_pages() and thus new_non_cma_page() is called, so an
> > > > adjustment is needed there, but that's all?
> > > >
> > > > Hm the adjustment should be also done because save/restore is done around
> > > > __get_user_pages_locked(), but check_and_migrate_cma_pages() also calls
> > > > __get_user_pages_locked(), and that call not being between nocma save and
> > > > restore is thus also a correctness issue?
> > >
> > > Simply, I call memalloc_nocma_{save,restore} in new_non_cma_page(). It
> > > would not cause any problem.
> >
> > I believe a proper fix is the following. The scope is really defined for
> > FOLL_LONGTERM pins and pushing it inside check_and_migrate_cma_pages
> > will solve the problem as well but it imho makes more sense to do it in
> > the caller the same way we do for any others.
> >
> > Fixes: 9a4e9f3b2d73 ("mm: update get_user_pages_longterm to migrate pages allocated from CMA region")
> >
> > I am not sure this is worth backporting to stable yet.
>
> Should I post it as a separate patch do you plan to include this into your next version?

It's better to include it on my next version since this patch would
cause a conflict with
the next tree that includes my v3 of this patchset.

Thanks.
diff mbox series

Patch

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index bb93e95..5a9ddf1 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -506,11 +506,9 @@  struct huge_bootmem_page {
 struct page *alloc_huge_page(struct vm_area_struct *vma,
 				unsigned long addr, int avoid_reserve);
 struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
-				nodemask_t *nmask, gfp_t gfp_mask);
+				nodemask_t *nmask, gfp_t gfp_mask, bool skip_cma);
 struct page *alloc_huge_page_vma(struct hstate *h, struct vm_area_struct *vma,
 				unsigned long address);
-struct page *alloc_migrate_huge_page(struct hstate *h, gfp_t gfp_mask,
-				     int nid, nodemask_t *nmask);
 int huge_add_to_page_cache(struct page *page, struct address_space *mapping,
 			pgoff_t idx);
 
@@ -770,7 +768,7 @@  static inline struct page *alloc_huge_page(struct vm_area_struct *vma,
 
 static inline struct page *
 alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
-			nodemask_t *nmask, gfp_t gfp_mask)
+			nodemask_t *nmask, gfp_t gfp_mask, bool skip_cma)
 {
 	return NULL;
 }
diff --git a/mm/gup.c b/mm/gup.c
index 5daadae..2c3dab4 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1630,11 +1630,12 @@  static struct page *new_non_cma_page(struct page *page, unsigned long private)
 #ifdef CONFIG_HUGETLB_PAGE
 	if (PageHuge(page)) {
 		struct hstate *h = page_hstate(page);
+
 		/*
 		 * We don't want to dequeue from the pool because pool pages will
 		 * mostly be from the CMA region.
 		 */
-		return alloc_migrate_huge_page(h, gfp_mask, nid, NULL);
+		return alloc_huge_page_nodemask(h, nid, NULL, gfp_mask, true);
 	}
 #endif
 	if (PageTransHuge(page)) {
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 3245aa0..bcf4abe 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -29,6 +29,7 @@ 
 #include <linux/numa.h>
 #include <linux/llist.h>
 #include <linux/cma.h>
+#include <linux/sched/mm.h>
 
 #include <asm/page.h>
 #include <asm/tlb.h>
@@ -1033,13 +1034,18 @@  static void enqueue_huge_page(struct hstate *h, struct page *page)
 	h->free_huge_pages_node[nid]++;
 }
 
-static struct page *dequeue_huge_page_node_exact(struct hstate *h, int nid)
+static struct page *dequeue_huge_page_node_exact(struct hstate *h, int nid, bool skip_cma)
 {
 	struct page *page;
 
-	list_for_each_entry(page, &h->hugepage_freelists[nid], lru)
+	list_for_each_entry(page, &h->hugepage_freelists[nid], lru) {
+		if (skip_cma && is_migrate_cma_page(page))
+			continue;
+
 		if (!PageHWPoison(page))
 			break;
+	}
+
 	/*
 	 * if 'non-isolated free hugepage' not found on the list,
 	 * the allocation fails.
@@ -1054,7 +1060,7 @@  static struct page *dequeue_huge_page_node_exact(struct hstate *h, int nid)
 }
 
 static struct page *dequeue_huge_page_nodemask(struct hstate *h, gfp_t gfp_mask, int nid,
-		nodemask_t *nmask)
+		nodemask_t *nmask, bool skip_cma)
 {
 	unsigned int cpuset_mems_cookie;
 	struct zonelist *zonelist;
@@ -1079,7 +1085,7 @@  static struct page *dequeue_huge_page_nodemask(struct hstate *h, gfp_t gfp_mask,
 			continue;
 		node = zone_to_nid(zone);
 
-		page = dequeue_huge_page_node_exact(h, node);
+		page = dequeue_huge_page_node_exact(h, node, skip_cma);
 		if (page)
 			return page;
 	}
@@ -1115,7 +1121,7 @@  static struct page *dequeue_huge_page_vma(struct hstate *h,
 
 	gfp_mask = htlb_alloc_mask(h);
 	nid = huge_node(vma, address, gfp_mask, &mpol, &nodemask);
-	page = dequeue_huge_page_nodemask(h, gfp_mask, nid, nodemask);
+	page = dequeue_huge_page_nodemask(h, gfp_mask, nid, nodemask, false);
 	if (page && !avoid_reserve && vma_has_reserves(vma, chg)) {
 		SetPagePrivate(page);
 		h->resv_huge_pages--;
@@ -1928,7 +1934,7 @@  static struct page *alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask,
 	return page;
 }
 
-struct page *alloc_migrate_huge_page(struct hstate *h, gfp_t gfp_mask,
+static struct page *alloc_migrate_huge_page(struct hstate *h, gfp_t gfp_mask,
 				     int nid, nodemask_t *nmask)
 {
 	struct page *page;
@@ -1971,21 +1977,29 @@  struct page *alloc_buddy_huge_page_with_mpol(struct hstate *h,
 
 /* page migration callback function */
 struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
-		nodemask_t *nmask, gfp_t gfp_mask)
+		nodemask_t *nmask, gfp_t gfp_mask, bool skip_cma)
 {
+	unsigned int flags = 0;
+	struct page *page = NULL;
+
+	if (skip_cma)
+		flags = memalloc_nocma_save();
+
 	spin_lock(&hugetlb_lock);
 	if (h->free_huge_pages - h->resv_huge_pages > 0) {
-		struct page *page;
-
-		page = dequeue_huge_page_nodemask(h, gfp_mask, preferred_nid, nmask);
-		if (page) {
-			spin_unlock(&hugetlb_lock);
-			return page;
-		}
+		page = dequeue_huge_page_nodemask(h, gfp_mask,
+					preferred_nid, nmask, skip_cma);
 	}
 	spin_unlock(&hugetlb_lock);
 
-	return alloc_migrate_huge_page(h, gfp_mask, preferred_nid, nmask);
+	if (!page)
+		page = alloc_migrate_huge_page(h, gfp_mask,
+					preferred_nid, nmask);
+
+	if (skip_cma)
+		memalloc_nocma_restore(flags);
+
+	return page;
 }
 
 /* mempolicy aware migration callback */
@@ -2000,7 +2014,7 @@  struct page *alloc_huge_page_vma(struct hstate *h, struct vm_area_struct *vma,
 
 	gfp_mask = htlb_alloc_mask(h);
 	node = huge_node(vma, address, gfp_mask, &mpol, &nodemask);
-	page = alloc_huge_page_nodemask(h, node, nodemask, gfp_mask);
+	page = alloc_huge_page_nodemask(h, node, nodemask, gfp_mask, false);
 	mpol_cond_put(mpol);
 
 	return page;
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 9034a53..667b453 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1072,7 +1072,7 @@  struct page *alloc_new_node_page(struct page *page, unsigned long node)
 		struct hstate *h = page_hstate(compound_head(page));
 		gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE;
 
-		return alloc_huge_page_nodemask(h, node, NULL, gfp_mask);
+		return alloc_huge_page_nodemask(h, node, NULL, gfp_mask, false);
 	} else if (PageTransHuge(page)) {
 		struct page *thp;
 
diff --git a/mm/migrate.c b/mm/migrate.c
index 3b3d918..02b31fe 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1543,7 +1543,7 @@  struct page *new_page_nodemask(struct page *page,
 
 		gfp_mask = htlb_alloc_mask(h);
 		return alloc_huge_page_nodemask(h, preferred_nid,
-						nodemask, gfp_mask);
+						nodemask, gfp_mask, false);
 	}
 
 	if (PageTransHuge(page)) {