diff mbox series

[v6,1/2] mm: migration: fix the FOLL_GET failure on following huge page

Message ID 20220816022102.582865-2-haiyue.wang@intel.com (mailing list archive)
State Superseded
Headers show
Series fix follow_page related issues | expand

Commit Message

Haiyue Wang Aug. 16, 2022, 2:21 a.m. UTC
Not all huge page APIs support FOLL_GET option, so move_pages() syscall
will fail to get the page node information for some huge pages.

Like x86 on linux 5.19 with 1GB huge page API follow_huge_pud(), it will
return NULL page for FOLL_GET when calling move_pages() syscall with the
NULL 'nodes' parameter, the 'status' parameter has '-2' error in array.

Note: follow_huge_pud() now supports FOLL_GET in linux 6.0.
      Link: https://lore.kernel.org/all/20220714042420.1847125-3-naoya.horiguchi@linux.dev

But these huge page APIs don't support FOLL_GET:
  1. follow_huge_pud() in arch/s390/mm/hugetlbpage.c
  2. follow_huge_addr() in arch/ia64/mm/hugetlbpage.c
     It will cause WARN_ON_ONCE for FOLL_GET.
  3. follow_huge_pgd() in mm/hugetlb.c

This is an temporary solution to mitigate the side effect of the race
condition fix by calling follow_page() with FOLL_GET set for huge pages.

After supporting follow huge page by FOLL_GET is done, this fix can be
reverted safely.

Fixes: 4cd614841c06 ("mm: migration: fix possible do_pages_stat_array racing with memory offline")
Signed-off-by: Haiyue Wang <haiyue.wang@intel.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
---
 mm/migrate.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

Comments

Baolin Wang Aug. 16, 2022, 8:54 a.m. UTC | #1
On 8/16/2022 10:21 AM, Haiyue Wang wrote:
> Not all huge page APIs support FOLL_GET option, so move_pages() syscall
> will fail to get the page node information for some huge pages.
> 
> Like x86 on linux 5.19 with 1GB huge page API follow_huge_pud(), it will
> return NULL page for FOLL_GET when calling move_pages() syscall with the
> NULL 'nodes' parameter, the 'status' parameter has '-2' error in array.
> 
> Note: follow_huge_pud() now supports FOLL_GET in linux 6.0.
>        Link: https://lore.kernel.org/all/20220714042420.1847125-3-naoya.horiguchi@linux.dev
> 
> But these huge page APIs don't support FOLL_GET:
>    1. follow_huge_pud() in arch/s390/mm/hugetlbpage.c
>    2. follow_huge_addr() in arch/ia64/mm/hugetlbpage.c
>       It will cause WARN_ON_ONCE for FOLL_GET.
>    3. follow_huge_pgd() in mm/hugetlb.c
> 
> This is an temporary solution to mitigate the side effect of the race
> condition fix by calling follow_page() with FOLL_GET set for huge pages.
> 
> After supporting follow huge page by FOLL_GET is done, this fix can be
> reverted safely.
> 
> Fixes: 4cd614841c06 ("mm: migration: fix possible do_pages_stat_array racing with memory offline")
> Signed-off-by: Haiyue Wang <haiyue.wang@intel.com>
> Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
> ---

Looks good to me.
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>

>   mm/migrate.c | 10 ++++++++--
>   1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 6a1597c92261..581dfaad9257 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1848,6 +1848,7 @@ static void do_pages_stat_array(struct mm_struct *mm, unsigned long nr_pages,
>   
>   	for (i = 0; i < nr_pages; i++) {
>   		unsigned long addr = (unsigned long)(*pages);
> +		unsigned int foll_flags = FOLL_DUMP;
>   		struct vm_area_struct *vma;
>   		struct page *page;
>   		int err = -EFAULT;
> @@ -1856,8 +1857,12 @@ static void do_pages_stat_array(struct mm_struct *mm, unsigned long nr_pages,
>   		if (!vma)
>   			goto set_status;
>   
> +		/* Not all huge page follow APIs support 'FOLL_GET' */
> +		if (!is_vm_hugetlb_page(vma))
> +			foll_flags |= FOLL_GET;
> +
>   		/* FOLL_DUMP to ignore special (like zero) pages */
> -		page = follow_page(vma, addr, FOLL_GET | FOLL_DUMP);
> +		page = follow_page(vma, addr, foll_flags);
>   
>   		err = PTR_ERR(page);
>   		if (IS_ERR(page))
> @@ -1865,7 +1870,8 @@ static void do_pages_stat_array(struct mm_struct *mm, unsigned long nr_pages,
>   
>   		if (page && !is_zone_device_page(page)) {
>   			err = page_to_nid(page);
> -			put_page(page);
> +			if (foll_flags & FOLL_GET)
> +				put_page(page);
>   		} else {
>   			err = -ENOENT;
>   		}
Andrew Morton Aug. 17, 2022, 12:58 a.m. UTC | #2
On Tue, 16 Aug 2022 10:21:00 +0800 Haiyue Wang <haiyue.wang@intel.com> wrote:

> Not all huge page APIs support FOLL_GET option, so move_pages() syscall
> will fail to get the page node information for some huge pages.
> 
> Like x86 on linux 5.19 with 1GB huge page API follow_huge_pud(), it will
> return NULL page for FOLL_GET when calling move_pages() syscall with the
> NULL 'nodes' parameter, the 'status' parameter has '-2' error in array.
> 
> Note: follow_huge_pud() now supports FOLL_GET in linux 6.0.
>       Link: https://lore.kernel.org/all/20220714042420.1847125-3-naoya.horiguchi@linux.dev
> 
> But these huge page APIs don't support FOLL_GET:
>   1. follow_huge_pud() in arch/s390/mm/hugetlbpage.c

Let's tell the s390 maintainers.

>   2. follow_huge_addr() in arch/ia64/mm/hugetlbpage.c
>      It will cause WARN_ON_ONCE for FOLL_GET.

ia64 doesn't have maintainers :( Can we come up with something local to
arch/ia64 for this?

>   3. follow_huge_pgd() in mm/hugetlb.c

Hi, Mike.

> This is an temporary solution to mitigate the side effect of the race
> condition fix by calling follow_page() with FOLL_GET set for huge pages.
> 
> After supporting follow huge page by FOLL_GET is done, this fix can be
> reverted safely.
> 
> ...
>
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1848,6 +1848,7 @@ static void do_pages_stat_array(struct mm_struct *mm, unsigned long nr_pages,
>  
>  	for (i = 0; i < nr_pages; i++) {
>  		unsigned long addr = (unsigned long)(*pages);
> +		unsigned int foll_flags = FOLL_DUMP;
>  		struct vm_area_struct *vma;
>  		struct page *page;
>  		int err = -EFAULT;
> @@ -1856,8 +1857,12 @@ static void do_pages_stat_array(struct mm_struct *mm, unsigned long nr_pages,
>  		if (!vma)
>  			goto set_status;
>  
> +		/* Not all huge page follow APIs support 'FOLL_GET' */
> +		if (!is_vm_hugetlb_page(vma))
> +			foll_flags |= FOLL_GET;
> +
>  		/* FOLL_DUMP to ignore special (like zero) pages */
> -		page = follow_page(vma, addr, FOLL_GET | FOLL_DUMP);
> +		page = follow_page(vma, addr, foll_flags);
>  
>  		err = PTR_ERR(page);
>  		if (IS_ERR(page))
> @@ -1865,7 +1870,8 @@ static void do_pages_stat_array(struct mm_struct *mm, unsigned long nr_pages,
>  
>  		if (page && !is_zone_device_page(page)) {
>  			err = page_to_nid(page);
> -			put_page(page);
> +			if (foll_flags & FOLL_GET)
> +				put_page(page);
>  		} else {
>  			err = -ENOENT;
>  		}

I would be better to fix this for real at those three client code sites?
Miaohe Lin Aug. 17, 2022, 2:12 a.m. UTC | #3
On 2022/8/16 10:21, Haiyue Wang wrote:
> Not all huge page APIs support FOLL_GET option, so move_pages() syscall
> will fail to get the page node information for some huge pages.
> 
> Like x86 on linux 5.19 with 1GB huge page API follow_huge_pud(), it will
> return NULL page for FOLL_GET when calling move_pages() syscall with the
> NULL 'nodes' parameter, the 'status' parameter has '-2' error in array.
> 
> Note: follow_huge_pud() now supports FOLL_GET in linux 6.0.
>       Link: https://lore.kernel.org/all/20220714042420.1847125-3-naoya.horiguchi@linux.dev
> 
> But these huge page APIs don't support FOLL_GET:
>   1. follow_huge_pud() in arch/s390/mm/hugetlbpage.c
>   2. follow_huge_addr() in arch/ia64/mm/hugetlbpage.c
>      It will cause WARN_ON_ONCE for FOLL_GET.
>   3. follow_huge_pgd() in mm/hugetlb.c
> 
> This is an temporary solution to mitigate the side effect of the race
> condition fix by calling follow_page() with FOLL_GET set for huge pages.
> 
> After supporting follow huge page by FOLL_GET is done, this fix can be
> reverted safely.
> 
> Fixes: 4cd614841c06 ("mm: migration: fix possible do_pages_stat_array racing with memory offline")
> Signed-off-by: Haiyue Wang <haiyue.wang@intel.com>
> Reviewed-by: "Huang, Ying" <ying.huang@intel.com>

LGTM. Many thanks for fixing this.

Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>

Thanks,
Miaohe Lin
Haiyue Wang Aug. 17, 2022, 3:31 a.m. UTC | #4
> -----Original Message-----
> From: Andrew Morton <akpm@linux-foundation.org>
> Sent: Wednesday, August 17, 2022 08:59
> To: Wang, Haiyue <haiyue.wang@intel.com>
> Cc: linux-mm@kvack.org; linux-kernel@vger.kernel.org; david@redhat.com; apopple@nvidia.com;
> linmiaohe@huawei.com; Huang, Ying <ying.huang@intel.com>; songmuchun@bytedance.com;
> naoya.horiguchi@linux.dev; alex.sierra@amd.com; Heiko Carstens <hca@linux.ibm.com>; Vasily Gorbik
> <gor@linux.ibm.com>; Alexander Gordeev <agordeev@linux.ibm.com>; Christian Borntraeger
> <borntraeger@linux.ibm.com>; Sven Schnelle <svens@linux.ibm.com>; Mike Kravetz
> <mike.kravetz@oracle.com>
> Subject: Re: [PATCH v6 1/2] mm: migration: fix the FOLL_GET failure on following huge page
> 
> On Tue, 16 Aug 2022 10:21:00 +0800 Haiyue Wang <haiyue.wang@intel.com> wrote:
> 
> > Not all huge page APIs support FOLL_GET option, so move_pages() syscall
> > will fail to get the page node information for some huge pages.
> >
> > Like x86 on linux 5.19 with 1GB huge page API follow_huge_pud(), it will
> > return NULL page for FOLL_GET when calling move_pages() syscall with the
> > NULL 'nodes' parameter, the 'status' parameter has '-2' error in array.
> >
> > Note: follow_huge_pud() now supports FOLL_GET in linux 6.0.
> >       Link: https://lore.kernel.org/all/20220714042420.1847125-3-naoya.horiguchi@linux.dev
> >
> > But these huge page APIs don't support FOLL_GET:
> >   1. follow_huge_pud() in arch/s390/mm/hugetlbpage.c
> 
> Let's tell the s390 maintainers.
> 
> >   2. follow_huge_addr() in arch/ia64/mm/hugetlbpage.c
> >      It will cause WARN_ON_ONCE for FOLL_GET.
> 
> ia64 doesn't have maintainers :( Can we come up with something local to
> arch/ia64 for this?

The 'follow_huge_addr' itself just has interest on "FOLL_WRITE"
struct page *
follow_huge_addr(struct mm_struct *mm, unsigned long address,
			      int write)

And arch/ia64 defines this function 17 years ago ...

But I found that "WARN_ON_ONCE for FOLL_GET" was introduced on 2005-10-29
by commit:

[PATCH] mm: follow_page with inner ptlock

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=deceb6cd17e6dfafe4c4f81b1b4153bc41b2cb70

-	page = follow_huge_addr(mm, address, write);
-	if (! IS_ERR(page))
-		return page;
+	page = follow_huge_addr(mm, address, flags & FOLL_WRITE);
+	if (!IS_ERR(page)) {
+		BUG_ON(flags & FOLL_GET);
+		goto out;
+	}

> 
> >   3. follow_huge_pgd() in mm/hugetlb.c
> 
> Hi, Mike.
> 


> >  		}
> 
> I would be better to fix this for real at those three client code sites?

Then 5.19 will break for a while to wait for the final BIG patch ?
Andrew Morton Aug. 17, 2022, 5:43 a.m. UTC | #5
On Wed, 17 Aug 2022 03:31:37 +0000 "Wang, Haiyue" <haiyue.wang@intel.com> wrote:

> > >  		}
> > 
> > I would be better to fix this for real at those three client code sites?
> 
> Then 5.19 will break for a while to wait for the final BIG patch ?

If that's the proposal then your [1/2] should have had a cc:stable and
changelog words describing the plan for 6.0.

But before we do that I'd like to see at least a prototype of the final
fixes to s390 and hugetlb, so we can assess those as preferable for
backporting.  I don't think they'll be terribly intrusive or risky?
Haiyue Wang Aug. 17, 2022, 5:47 a.m. UTC | #6
> -----Original Message-----
> From: Andrew Morton <akpm@linux-foundation.org>
> Sent: Wednesday, August 17, 2022 13:43
> To: Wang, Haiyue <haiyue.wang@intel.com>
> Cc: linux-mm@kvack.org; linux-kernel@vger.kernel.org; david@redhat.com; apopple@nvidia.com;
> linmiaohe@huawei.com; Huang, Ying <ying.huang@intel.com>; songmuchun@bytedance.com;
> naoya.horiguchi@linux.dev; alex.sierra@amd.com; Heiko Carstens <hca@linux.ibm.com>; Vasily Gorbik
> <gor@linux.ibm.com>; Alexander Gordeev <agordeev@linux.ibm.com>; Christian Borntraeger
> <borntraeger@linux.ibm.com>; Sven Schnelle <svens@linux.ibm.com>; Mike Kravetz
> <mike.kravetz@oracle.com>
> Subject: Re: [PATCH v6 1/2] mm: migration: fix the FOLL_GET failure on following huge page
> 
> On Wed, 17 Aug 2022 03:31:37 +0000 "Wang, Haiyue" <haiyue.wang@intel.com> wrote:
> 
> > > >  		}
> > >
> > > I would be better to fix this for real at those three client code sites?
> >
> > Then 5.19 will break for a while to wait for the final BIG patch ?
> 
> If that's the proposal then your [1/2] should have had a cc:stable and
> changelog words describing the plan for 6.0.
> 
> But before we do that I'd like to see at least a prototype of the final
> fixes to s390 and hugetlb, so we can assess those as preferable for

Got it, make sense. ;-)

> backporting.  I don't think they'll be terribly intrusive or risky?
Mike Kravetz Aug. 17, 2022, 5:26 p.m. UTC | #7
On 08/16/22 22:43, Andrew Morton wrote:
> On Wed, 17 Aug 2022 03:31:37 +0000 "Wang, Haiyue" <haiyue.wang@intel.com> wrote:
> 
> > > >  		}
> > > 
> > > I would be better to fix this for real at those three client code sites?
> > 
> > Then 5.19 will break for a while to wait for the final BIG patch ?
> 
> If that's the proposal then your [1/2] should have had a cc:stable and
> changelog words describing the plan for 6.0.
> 
> But before we do that I'd like to see at least a prototype of the final
> fixes to s390 and hugetlb, so we can assess those as preferable for
> backporting.  I don't think they'll be terribly intrusive or risky?

I will start on adding follow_huge_pgd() support.  Although, I may need
some help with verification from the powerpc folks, as that is the only
architecture which supports hugetlb pages at that level.

mpe any suggestions?

When Naoya recently modified follow_huge_pud, it ended up looking very
much like follow_huge_pmd (as it should).  I expect follow_huge_pgd
will be similar.  In the end we may want to factor out the common code.
However, for a backport it would be better to just modify follow_huge_pgd
and do the cleanup/condensing common code later.
Mike Kravetz Aug. 17, 2022, 9:58 p.m. UTC | #8
On 08/17/22 10:26, Mike Kravetz wrote:
> On 08/16/22 22:43, Andrew Morton wrote:
> > On Wed, 17 Aug 2022 03:31:37 +0000 "Wang, Haiyue" <haiyue.wang@intel.com> wrote:
> > 
> > > > >  		}
> > > > 
> > > > I would be better to fix this for real at those three client code sites?
> > > 
> > > Then 5.19 will break for a while to wait for the final BIG patch ?
> > 
> > If that's the proposal then your [1/2] should have had a cc:stable and
> > changelog words describing the plan for 6.0.
> > 
> > But before we do that I'd like to see at least a prototype of the final
> > fixes to s390 and hugetlb, so we can assess those as preferable for
> > backporting.  I don't think they'll be terribly intrusive or risky?
> 
> I will start on adding follow_huge_pgd() support.  Although, I may need
> some help with verification from the powerpc folks, as that is the only
> architecture which supports hugetlb pages at that level.
> 
> mpe any suggestions?

From 4925a98a6857dbb5a23bd97063ded2648863e65e Mon Sep 17 00:00:00 2001
From: Mike Kravetz <mike.kravetz@oracle.com>
Date: Wed, 17 Aug 2022 14:32:10 -0700
Subject: [PATCH] hugetlb: make follow_huge_pgd support FOLL_GET

The existing version of follow_huge_pgd was very primitive and only
provided limited functionality.  Specifically, it did not support
FOLL_GET.  Update follow_huge_pgd with modifications similar to those
made for follow_huge_pud in commit 3a194f3f8ad0 ("mm/hugetlb: make
pud_huge() and follow_huge_pud() aware of non-present pud entry").

Note, common code should be factored out of follow_huge_p*d routines.
This will be done in future modifications.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/hugetlb.c | 32 ++++++++++++++++++++++++++++++--
 1 file changed, 30 insertions(+), 2 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ea1c7bfa1cc3..6f32d2bd1ca9 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -7055,10 +7055,38 @@ follow_huge_pud(struct mm_struct *mm, unsigned long address,
 struct page * __weak
 follow_huge_pgd(struct mm_struct *mm, unsigned long address, pgd_t *pgd, int flags)
 {
-	if (flags & (FOLL_GET | FOLL_PIN))
+	struct page *page = NULL;
+	spinlock_t *ptl;
+	pte_t pte;
+
+	if (WARN_ON_ONCE(flags & FOLL_PIN))
 		return NULL;
 
-	return pte_page(*(pte_t *)pgd) + ((address & ~PGDIR_MASK) >> PAGE_SHIFT);
+retry:
+	ptl = huge_pte_lock(hstate_sizelog(PGDIR_SHIFT), mm, (pte_t *)pgd);
+	if (!pgd_huge(*pgd))
+		goto out;
+	pte = huge_ptep_get((pte_t *)pgd);
+	if (pte_present(pte)) {
+		page = pgd_page(*pgd) + ((address & ~PGDIR_MASK) >> PAGE_SHIFT);
+		if (WARN_ON_ONCE(!try_grab_page(page, flags))) {
+			page = NULL;
+			goto out;
+		}
+	} else {
+		if (is_hugetlb_entry_migration(pte)) {
+			spin_unlock(ptl);
+			__migration_entry_wait(mm, (pte_t *)pgd, ptl);
+			goto retry;
+		}
+		/*
+		 * hwpoisoned entry is treated as no_page_table in
+		 * follow_page_mask().
+		 */
+	}
+out:
+	spin_unlock(ptl);
+	return page;
 }
 
 int isolate_hugetlb(struct page *page, struct list_head *list)
Haiyue Wang Aug. 18, 2022, 12:32 a.m. UTC | #9
> -----Original Message-----
> From: Mike Kravetz <mike.kravetz@oracle.com>
> Sent: Thursday, August 18, 2022 05:58
> To: Andrew Morton <akpm@linux-foundation.org>; Michael Ellerman <mpe@ellerman.id.au>
> Cc: Wang, Haiyue <haiyue.wang@intel.com>; linux-mm@kvack.org; linux-kernel@vger.kernel.org;
> david@redhat.com; apopple@nvidia.com; linmiaohe@huawei.com; Huang, Ying <ying.huang@intel.com>;
> songmuchun@bytedance.com; naoya.horiguchi@linux.dev; alex.sierra@amd.com; Heiko Carstens
> <hca@linux.ibm.com>; Vasily Gorbik <gor@linux.ibm.com>; Alexander Gordeev <agordeev@linux.ibm.com>;
> Christian Borntraeger <borntraeger@linux.ibm.com>; Sven Schnelle <svens@linux.ibm.com>
> Subject: Re: [PATCH v6 1/2] mm: migration: fix the FOLL_GET failure on following huge page
> 
> On 08/17/22 10:26, Mike Kravetz wrote:
> > On 08/16/22 22:43, Andrew Morton wrote:
> > > On Wed, 17 Aug 2022 03:31:37 +0000 "Wang, Haiyue" <haiyue.wang@intel.com> wrote:
> > >
> > > > > >  		}
> > > > >
> > > > > I would be better to fix this for real at those three client code sites?
> > > >
> > > > Then 5.19 will break for a while to wait for the final BIG patch ?
> > >
> > > If that's the proposal then your [1/2] should have had a cc:stable and
> > > changelog words describing the plan for 6.0.
> > >
> > > But before we do that I'd like to see at least a prototype of the final
> > > fixes to s390 and hugetlb, so we can assess those as preferable for
> > > backporting.  I don't think they'll be terribly intrusive or risky?
> >
> > I will start on adding follow_huge_pgd() support.  Although, I may need
> > some help with verification from the powerpc folks, as that is the only
> > architecture which supports hugetlb pages at that level.
> >
> > mpe any suggestions?
> 
> From 4925a98a6857dbb5a23bd97063ded2648863e65e Mon Sep 17 00:00:00 2001
> From: Mike Kravetz <mike.kravetz@oracle.com>
> Date: Wed, 17 Aug 2022 14:32:10 -0700
> Subject: [PATCH] hugetlb: make follow_huge_pgd support FOLL_GET
> 
> The existing version of follow_huge_pgd was very primitive and only
> provided limited functionality.  Specifically, it did not support
> FOLL_GET.  Update follow_huge_pgd with modifications similar to those
> made for follow_huge_pud in commit 3a194f3f8ad0 ("mm/hugetlb: make
> pud_huge() and follow_huge_pud() aware of non-present pud entry").
> 
> Note, common code should be factored out of follow_huge_p*d routines.
> This will be done in future modifications.
> 

I found "Anshuman Khandual <khandual@linux.vnet.ibm.com>" submit the similar
patch on "Apr 2016 11:07:37 +0530"

[PATCH 03/10] mm/hugetlb: Protect follow_huge_(pud|pgd) functions from race
https://lore.kernel.org/all/1460007464-26726-4-git-send-email-khandual@linux.vnet.ibm.com/

> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> ---
>  mm/hugetlb.c | 32 ++++++++++++++++++++++++++++++--
>  1 file changed, 30 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index ea1c7bfa1cc3..6f32d2bd1ca9 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -7055,10 +7055,38 @@ follow_huge_pud(struct mm_struct *mm, unsigned long address,
>  struct page * __weak
>  follow_huge_pgd(struct mm_struct *mm, unsigned long address, pgd_t *pgd, int flags)
>  {
> -	if (flags & (FOLL_GET | FOLL_PIN))
> +	struct page *page = NULL;
> +	spinlock_t *ptl;
> +	pte_t pte;
> +
> +	if (WARN_ON_ONCE(flags & FOLL_PIN))
>  		return NULL;
> 
> -	return pte_page(*(pte_t *)pgd) + ((address & ~PGDIR_MASK) >> PAGE_SHIFT);
> +retry:
> +	ptl = huge_pte_lock(hstate_sizelog(PGDIR_SHIFT), mm, (pte_t *)pgd);
> +	if (!pgd_huge(*pgd))
> +		goto out;
> +	pte = huge_ptep_get((pte_t *)pgd);
> +	if (pte_present(pte)) {
> +		page = pgd_page(*pgd) + ((address & ~PGDIR_MASK) >> PAGE_SHIFT);
> +		if (WARN_ON_ONCE(!try_grab_page(page, flags))) {
> +			page = NULL;
> +			goto out;
> +		}
> +	} else {
> +		if (is_hugetlb_entry_migration(pte)) {
> +			spin_unlock(ptl);
> +			__migration_entry_wait(mm, (pte_t *)pgd, ptl);
> +			goto retry;
> +		}
> +		/*
> +		 * hwpoisoned entry is treated as no_page_table in
> +		 * follow_page_mask().
> +		 */
> +	}
> +out:
> +	spin_unlock(ptl);
> +	return page;
>  }
> 
>  int isolate_hugetlb(struct page *page, struct list_head *list)
> --
> 2.37.1
Gerald Schaefer Aug. 18, 2022, 11:51 a.m. UTC | #10
On Tue, 16 Aug 2022 22:43:22 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> On Wed, 17 Aug 2022 03:31:37 +0000 "Wang, Haiyue" <haiyue.wang@intel.com> wrote:
> 
> > > >  		}
> > > 
> > > I would be better to fix this for real at those three client code sites?
> > 
> > Then 5.19 will break for a while to wait for the final BIG patch ?
> 
> If that's the proposal then your [1/2] should have had a cc:stable and
> changelog words describing the plan for 6.0.
> 
> But before we do that I'd like to see at least a prototype of the final
> fixes to s390 and hugetlb, so we can assess those as preferable for
> backporting.  I don't think they'll be terribly intrusive or risky?
> 

The private follow_huge_pud() for s390 is just some leftover, and the
only reason is / was that the generic version was using pte_page()
instead of pud_page(), which would not work for s390. See also commit
97534127012f ("mm/hugetlb: use pmd_page() in follow_huge_pmd()").

Since commit 3a194f3f8ad01 ("mm/hugetlb: make pud_huge() and
follow_huge_pud() aware of non-present pud entry") made
follow_huge_pud() behave similar to follow_huge_pmd(), in particular
also adding pud_page(), we can now switch to the generic version.

Note that we cannot support migration / hwpoison for hugetlb or THP,
because of different layout for PTE and PMD/PUD on s390. The generic
swp_entry functions all require proper PTEs, which wouldn't work on
PMD/PUD entries. In theory, at least for hugetlb, due to the "fake
PTE" conversion logic in huge_ptep_get(), we might be able to also
fake swp_entries, but the other problem is that we do not have enough
free bits in the PMD/PUD, so there probably will never be migration
support for huge pages on s390.

Anyway, that should not matter wrt to switching to the generic
follow_huge_pud(), because is_hugetlb_entry_migration() should always
return false, and no special change to pud_huge() check should be
needed like on x86.
Gerald Schaefer Aug. 18, 2022, 11:57 a.m. UTC | #11
On Thu, 18 Aug 2022 13:51:49 +0200
Gerald Schaefer <gerald.schaefer@linux.ibm.com> wrote:

> On Tue, 16 Aug 2022 22:43:22 -0700
> Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> > On Wed, 17 Aug 2022 03:31:37 +0000 "Wang, Haiyue" <haiyue.wang@intel.com> wrote:
> > 
> > > > >  		}
> > > > 
> > > > I would be better to fix this for real at those three client code sites?
> > > 
> > > Then 5.19 will break for a while to wait for the final BIG patch ?
> > 
> > If that's the proposal then your [1/2] should have had a cc:stable and
> > changelog words describing the plan for 6.0.
> > 
> > But before we do that I'd like to see at least a prototype of the final
> > fixes to s390 and hugetlb, so we can assess those as preferable for
> > backporting.  I don't think they'll be terribly intrusive or risky?
> > 
> 
> The private follow_huge_pud() for s390 is just some leftover, and the
> only reason is / was that the generic version was using pte_page()
> instead of pud_page(), which would not work for s390. See also commit
> 97534127012f ("mm/hugetlb: use pmd_page() in follow_huge_pmd()").
> 
> Since commit 3a194f3f8ad01 ("mm/hugetlb: make pud_huge() and
> follow_huge_pud() aware of non-present pud entry") made
> follow_huge_pud() behave similar to follow_huge_pmd(), in particular
> also adding pud_page(), we can now switch to the generic version.
> 
> Note that we cannot support migration / hwpoison for hugetlb or THP,
> because of different layout for PTE and PMD/PUD on s390. The generic
> swp_entry functions all require proper PTEs, which wouldn't work on
> PMD/PUD entries. In theory, at least for hugetlb, due to the "fake
> PTE" conversion logic in huge_ptep_get(), we might be able to also
> fake swp_entries, but the other problem is that we do not have enough
> free bits in the PMD/PUD, so there probably will never be migration
> support for huge pages on s390.
> 
> Anyway, that should not matter wrt to switching to the generic
> follow_huge_pud(), because is_hugetlb_entry_migration() should always
> return false, and no special change to pud_huge() check should be
> needed like on x86.

From ce0150cd6f80425c702ccdc4cd8a511c47e99b67 Mon Sep 17 00:00:00 2001
From: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Date: Thu, 18 Aug 2022 13:19:23 +0200
Subject: [PATCH] s390/hugetlb: switch to generic version of follow_huge_pud()

When pud-sized hugepages were introduced for s390, the generic version
of follow_huge_pud() was using pte_page() instead of pud_page(). This
would be wrong for s390, see also commit 97534127012f ("mm/hugetlb: use
pmd_page() in follow_huge_pmd()"). Therefore, and probably because not
all archs were supporting pud_page() at that time, a private version of
follow_huge_pud() was added for s390, correctly using pud_page().

Since commit 3a194f3f8ad01 ("mm/hugetlb: make pud_huge() and
follow_huge_pud() aware of non-present pud entry"), the generic version
of follow_huge_pud() is now also using pud_page(), and in general
behaves similar to follow_huge_pmd().

Therefore we can now switch to the generic version and get rid of the
s390-specific follow_huge_pud().

Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
---
 arch/s390/mm/hugetlbpage.c | 10 ----------
 1 file changed, 10 deletions(-)

diff --git a/arch/s390/mm/hugetlbpage.c b/arch/s390/mm/hugetlbpage.c
index 10e51ef9c79a..c299a18273ff 100644
--- a/arch/s390/mm/hugetlbpage.c
+++ b/arch/s390/mm/hugetlbpage.c
@@ -237,16 +237,6 @@ int pud_huge(pud_t pud)
 	return pud_large(pud);
 }
 
-struct page *
-follow_huge_pud(struct mm_struct *mm, unsigned long address,
-		pud_t *pud, int flags)
-{
-	if (flags & FOLL_GET)
-		return NULL;
-
-	return pud_page(*pud) + ((address & ~PUD_MASK) >> PAGE_SHIFT);
-}
-
 bool __init arch_hugetlb_valid_size(unsigned long size)
 {
 	if (MACHINE_HAS_EDAT1 && size == PMD_SIZE)
Michael Ellerman Aug. 19, 2022, 11:22 a.m. UTC | #12
Mike Kravetz <mike.kravetz@oracle.com> writes:
> On 08/16/22 22:43, Andrew Morton wrote:
>> On Wed, 17 Aug 2022 03:31:37 +0000 "Wang, Haiyue" <haiyue.wang@intel.com> wrote:
>>
>> > > >  		}
>> > >
>> > > I would be better to fix this for real at those three client code sites?
>> >
>> > Then 5.19 will break for a while to wait for the final BIG patch ?
>>
>> If that's the proposal then your [1/2] should have had a cc:stable and
>> changelog words describing the plan for 6.0.
>>
>> But before we do that I'd like to see at least a prototype of the final
>> fixes to s390 and hugetlb, so we can assess those as preferable for
>> backporting.  I don't think they'll be terribly intrusive or risky?
>
> I will start on adding follow_huge_pgd() support.  Although, I may need
> some help with verification from the powerpc folks, as that is the only
> architecture which supports hugetlb pages at that level.
>
> mpe any suggestions?

I'm happy to test.

I have a system where I can allocate 1GB huge pages.

I'm not sure how to actually test this path though. I hacked up the
vm/migration.c test to allocate 1GB hugepages, but I can't see it going
through follow_huge_pgd() (using ftrace).

Maybe I hacked it up badly, I'll have a closer look on Monday. But if
you have any tips on how to trigger that path let me know :)

cheers
Mike Kravetz Aug. 19, 2022, 4:55 p.m. UTC | #13
On 08/19/22 21:22, Michael Ellerman wrote:
> Mike Kravetz <mike.kravetz@oracle.com> writes:
> > On 08/16/22 22:43, Andrew Morton wrote:
> >> On Wed, 17 Aug 2022 03:31:37 +0000 "Wang, Haiyue" <haiyue.wang@intel.com> wrote:
> >>
> >> > > >  		}
> >> > >
> >> > > I would be better to fix this for real at those three client code sites?
> >> >
> >> > Then 5.19 will break for a while to wait for the final BIG patch ?
> >>
> >> If that's the proposal then your [1/2] should have had a cc:stable and
> >> changelog words describing the plan for 6.0.
> >>
> >> But before we do that I'd like to see at least a prototype of the final
> >> fixes to s390 and hugetlb, so we can assess those as preferable for
> >> backporting.  I don't think they'll be terribly intrusive or risky?
> >
> > I will start on adding follow_huge_pgd() support.  Although, I may need
> > some help with verification from the powerpc folks, as that is the only
> > architecture which supports hugetlb pages at that level.
> >
> > mpe any suggestions?
> 
> I'm happy to test.
> 
> I have a system where I can allocate 1GB huge pages.
> 
> I'm not sure how to actually test this path though. I hacked up the
> vm/migration.c test to allocate 1GB hugepages, but I can't see it going
> through follow_huge_pgd() (using ftrace).

I thing you needed to use 16GB to trigger this code path.  Anshuman introduced
support for page offline (and migration) at this level in commit 94310cbcaa3c
("mm/madvise: enable (soft|hard) offline of HugeTLB pages at PGD level").
When asked about the use case, he mentioned:

"Yes, its in the context of 16GB pages on POWER8 system where all the
 gigantic pages are pre allocated from the platform and passed on to
 the kernel through the device tree. We dont allocate these gigantic
 pages on runtime."
Michael Ellerman Aug. 26, 2022, 1:07 p.m. UTC | #14
Mike Kravetz <mike.kravetz@oracle.com> writes:
> On 08/19/22 21:22, Michael Ellerman wrote:
>> Mike Kravetz <mike.kravetz@oracle.com> writes:
>> > On 08/16/22 22:43, Andrew Morton wrote:
>> >> On Wed, 17 Aug 2022 03:31:37 +0000 "Wang, Haiyue" <haiyue.wang@intel.com> wrote:
>> >>
>> >> > > >  		}
>> >> > >
>> >> > > I would be better to fix this for real at those three client code sites?
>> >> >
>> >> > Then 5.19 will break for a while to wait for the final BIG patch ?
>> >>
>> >> If that's the proposal then your [1/2] should have had a cc:stable and
>> >> changelog words describing the plan for 6.0.
>> >>
>> >> But before we do that I'd like to see at least a prototype of the final
>> >> fixes to s390 and hugetlb, so we can assess those as preferable for
>> >> backporting.  I don't think they'll be terribly intrusive or risky?
>> >
>> > I will start on adding follow_huge_pgd() support.  Although, I may need
>> > some help with verification from the powerpc folks, as that is the only
>> > architecture which supports hugetlb pages at that level.
>> >
>> > mpe any suggestions?
>>
>> I'm happy to test.
>>
>> I have a system where I can allocate 1GB huge pages.
>>
>> I'm not sure how to actually test this path though. I hacked up the
>> vm/migration.c test to allocate 1GB hugepages, but I can't see it going
>> through follow_huge_pgd() (using ftrace).
>
> I thing you needed to use 16GB to trigger this code path.  Anshuman introduced
> support for page offline (and migration) at this level in commit 94310cbcaa3c
> ("mm/madvise: enable (soft|hard) offline of HugeTLB pages at PGD level").
> When asked about the use case, he mentioned:
>
> "Yes, its in the context of 16GB pages on POWER8 system where all the
>  gigantic pages are pre allocated from the platform and passed on to
>  the kernel through the device tree. We dont allocate these gigantic
>  pages on runtime."

That was true, but isn't anymore.

I must have been insufficently caffeinated the other day. On our newer
machines 1GB is the largest huge page size, but it's obviously way too
small to sit at the PGD level. So that was a waste of my time :)

We used to support 16GB at the PGD level, but we reworked the page table
geometry a few years ago, and now they sit at the PUD level on machines
that support 16GB pages:

  https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ba95b5d0359609b4ec8010f77c40ab3c595a6ac6

Note the author :}

So the good news is we no longer have any configuration where a huge
page entry is expected in the PGD. So we can drop our pgd_huge()
definitions, and ours are the last non-zero definitions, so it can all
go away I think.

I'll send a patch to remove the powerpc pgd_huge() definitions after
I've run it through some tests.

cheers
diff mbox series

Patch

diff --git a/mm/migrate.c b/mm/migrate.c
index 6a1597c92261..581dfaad9257 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1848,6 +1848,7 @@  static void do_pages_stat_array(struct mm_struct *mm, unsigned long nr_pages,
 
 	for (i = 0; i < nr_pages; i++) {
 		unsigned long addr = (unsigned long)(*pages);
+		unsigned int foll_flags = FOLL_DUMP;
 		struct vm_area_struct *vma;
 		struct page *page;
 		int err = -EFAULT;
@@ -1856,8 +1857,12 @@  static void do_pages_stat_array(struct mm_struct *mm, unsigned long nr_pages,
 		if (!vma)
 			goto set_status;
 
+		/* Not all huge page follow APIs support 'FOLL_GET' */
+		if (!is_vm_hugetlb_page(vma))
+			foll_flags |= FOLL_GET;
+
 		/* FOLL_DUMP to ignore special (like zero) pages */
-		page = follow_page(vma, addr, FOLL_GET | FOLL_DUMP);
+		page = follow_page(vma, addr, foll_flags);
 
 		err = PTR_ERR(page);
 		if (IS_ERR(page))
@@ -1865,7 +1870,8 @@  static void do_pages_stat_array(struct mm_struct *mm, unsigned long nr_pages,
 
 		if (page && !is_zone_device_page(page)) {
 			err = page_to_nid(page);
-			put_page(page);
+			if (foll_flags & FOLL_GET)
+				put_page(page);
 		} else {
 			err = -ENOENT;
 		}