diff mbox series

[v2,2/2] mm: hugetlb: soft-offline: dissolve_free_huge_page() return zero on !PageHuge

Message ID 1560154686-18497-3-git-send-email-n-horiguchi@ah.jp.nec.com (mailing list archive)
State New, archived
Headers show
Series fix return value issue of soft offlining hugepages | expand

Commit Message

Naoya Horiguchi June 10, 2019, 8:18 a.m. UTC
madvise(MADV_SOFT_OFFLINE) often returns -EBUSY when calling soft offline
for hugepages with overcommitting enabled. That was caused by the suboptimal
code in current soft-offline code. See the following part:

    ret = migrate_pages(&pagelist, new_page, NULL, MPOL_MF_MOVE_ALL,
                            MIGRATE_SYNC, MR_MEMORY_FAILURE);
    if (ret) {
            ...
    } else {
            /*
             * We set PG_hwpoison only when the migration source hugepage
             * was successfully dissolved, because otherwise hwpoisoned
             * hugepage remains on free hugepage list, then userspace will
             * find it as SIGBUS by allocation failure. That's not expected
             * in soft-offlining.
             */
            ret = dissolve_free_huge_page(page);
            if (!ret) {
                    if (set_hwpoison_free_buddy_page(page))
                            num_poisoned_pages_inc();
            }
    }
    return ret;

Here dissolve_free_huge_page() returns -EBUSY if the migration source page
was freed into buddy in migrate_pages(), but even in that case we actually
has a chance that set_hwpoison_free_buddy_page() succeeds. So that means
current code gives up offlining too early now.

dissolve_free_huge_page() checks that a given hugepage is suitable for
dissolving, where we should return success for !PageHuge() case because
the given hugepage is considered as already dissolved.

This change also affects other callers of dissolve_free_huge_page(),
which are cleaned up together.

Reported-by: Chen, Jerry T <jerry.t.chen@intel.com>
Tested-by: Chen, Jerry T <jerry.t.chen@intel.com>
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Fixes: 6bc9b56433b76 ("mm: fix race on soft-offlining")
Cc: <stable@vger.kernel.org> # v4.19+
---
 mm/hugetlb.c        | 15 +++++++++------
 mm/memory-failure.c |  5 +----
 2 files changed, 10 insertions(+), 10 deletions(-)

Comments

Anshuman Khandual June 11, 2019, 9:50 a.m. UTC | #1
On 06/10/2019 01:48 PM, Naoya Horiguchi wrote:
> madvise(MADV_SOFT_OFFLINE) often returns -EBUSY when calling soft offline
> for hugepages with overcommitting enabled. That was caused by the suboptimal
> code in current soft-offline code. See the following part:
> 
>     ret = migrate_pages(&pagelist, new_page, NULL, MPOL_MF_MOVE_ALL,
>                             MIGRATE_SYNC, MR_MEMORY_FAILURE);
>     if (ret) {
>             ...
>     } else {
>             /*
>              * We set PG_hwpoison only when the migration source hugepage
>              * was successfully dissolved, because otherwise hwpoisoned
>              * hugepage remains on free hugepage list, then userspace will
>              * find it as SIGBUS by allocation failure. That's not expected
>              * in soft-offlining.
>              */
>             ret = dissolve_free_huge_page(page);
>             if (!ret) {
>                     if (set_hwpoison_free_buddy_page(page))
>                             num_poisoned_pages_inc();
>             }
>     }
>     return ret;
> 
> Here dissolve_free_huge_page() returns -EBUSY if the migration source page
> was freed into buddy in migrate_pages(), but even in that case we actually

Over committed source pages will be released into buddy and the normal ones
will not be ? dissolve_free_huge_page() returns -EBUSY because PageHuge()
return negative on already released pages ? How dissolve_free_huge_page()
will behave differently with over committed pages. I might be missing some
recent developments here.

> has a chance that set_hwpoison_free_buddy_page() succeeds. So that means
> current code gives up offlining too early now.

Hmm. It gives up early as the return value from dissolve_free_huge_page(EBUSY)
gets back as the return code for soft_offline_huge_page() without attempting
set_hwpoison_free_buddy_page() which still has a chance to succeed for freed
normal buddy pages.

> 
> dissolve_free_huge_page() checks that a given hugepage is suitable for
> dissolving, where we should return success for !PageHuge() case because
> the given hugepage is considered as already dissolved.

Right. It should return 0 (as a success) for freed normal buddy pages. Should
not it then check explicitly for PageBuddy() as well ?

> 
> This change also affects other callers of dissolve_free_huge_page(),
> which are cleaned up together.
> 
> Reported-by: Chen, Jerry T <jerry.t.chen@intel.com>
> Tested-by: Chen, Jerry T <jerry.t.chen@intel.com>
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Fixes: 6bc9b56433b76 ("mm: fix race on soft-offlining")
> Cc: <stable@vger.kernel.org> # v4.19+
> ---
>  mm/hugetlb.c        | 15 +++++++++------
>  mm/memory-failure.c |  5 +----
>  2 files changed, 10 insertions(+), 10 deletions(-)
> 
> diff --git v5.2-rc3/mm/hugetlb.c v5.2-rc3_patched/mm/hugetlb.c
> index ac843d3..048d071 100644
> --- v5.2-rc3/mm/hugetlb.c
> +++ v5.2-rc3_patched/mm/hugetlb.c
> @@ -1519,7 +1519,12 @@ int dissolve_free_huge_page(struct page *page)
>  	int rc = -EBUSY;
>  
>  	spin_lock(&hugetlb_lock);
> -	if (PageHuge(page) && !page_count(page)) {
> +	if (!PageHuge(page)) {
> +		rc = 0;
> +		goto out;
> +	}

With this early bail out it maintains the functionality when called from
soft_offline_free_page() for normal pages. For huge page, it continues
on the previous path.

> +
> +	if (!page_count(page)) {
>  		struct page *head = compound_head(page);
>  		struct hstate *h = page_hstate(head);
>  		int nid = page_to_nid(head);
> @@ -1564,11 +1569,9 @@ int dissolve_free_huge_pages(unsigned long start_pfn, unsigned long end_pfn)
>  
>  	for (pfn = start_pfn; pfn < end_pfn; pfn += 1 << minimum_order) {
>  		page = pfn_to_page(pfn);
> -		if (PageHuge(page) && !page_count(page)) {

Right. These checks are now redundant.
Mike Kravetz June 11, 2019, 5:16 p.m. UTC | #2
On 6/10/19 1:18 AM, Naoya Horiguchi wrote:
> madvise(MADV_SOFT_OFFLINE) often returns -EBUSY when calling soft offline
> for hugepages with overcommitting enabled. That was caused by the suboptimal
> code in current soft-offline code. See the following part:
> 
>     ret = migrate_pages(&pagelist, new_page, NULL, MPOL_MF_MOVE_ALL,
>                             MIGRATE_SYNC, MR_MEMORY_FAILURE);
>     if (ret) {
>             ...
>     } else {
>             /*
>              * We set PG_hwpoison only when the migration source hugepage
>              * was successfully dissolved, because otherwise hwpoisoned
>              * hugepage remains on free hugepage list, then userspace will
>              * find it as SIGBUS by allocation failure. That's not expected
>              * in soft-offlining.
>              */
>             ret = dissolve_free_huge_page(page);
>             if (!ret) {
>                     if (set_hwpoison_free_buddy_page(page))
>                             num_poisoned_pages_inc();
>             }
>     }
>     return ret;
> 
> Here dissolve_free_huge_page() returns -EBUSY if the migration source page
> was freed into buddy in migrate_pages(), but even in that case we actually
> has a chance that set_hwpoison_free_buddy_page() succeeds. So that means
> current code gives up offlining too early now.
> 
> dissolve_free_huge_page() checks that a given hugepage is suitable for
> dissolving, where we should return success for !PageHuge() case because
> the given hugepage is considered as already dissolved.
> 
> This change also affects other callers of dissolve_free_huge_page(),
> which are cleaned up together.
> 
> Reported-by: Chen, Jerry T <jerry.t.chen@intel.com>
> Tested-by: Chen, Jerry T <jerry.t.chen@intel.com>
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Fixes: 6bc9b56433b76 ("mm: fix race on soft-offlining")
> Cc: <stable@vger.kernel.org> # v4.19+
> ---
>  mm/hugetlb.c        | 15 +++++++++------
>  mm/memory-failure.c |  5 +----
>  2 files changed, 10 insertions(+), 10 deletions(-)
> 
> diff --git v5.2-rc3/mm/hugetlb.c v5.2-rc3_patched/mm/hugetlb.c
> index ac843d3..048d071 100644
> --- v5.2-rc3/mm/hugetlb.c
> +++ v5.2-rc3_patched/mm/hugetlb.c
> @@ -1519,7 +1519,12 @@ int dissolve_free_huge_page(struct page *page)

Please update the function description for dissolve_free_huge_page() as
well.  It currently says, "Returns -EBUSY if the dissolution fails because
a give page is not a free hugepage" which is no longer true as a result of
this change.

>  	int rc = -EBUSY;
>  
>  	spin_lock(&hugetlb_lock);
> -	if (PageHuge(page) && !page_count(page)) {
> +	if (!PageHuge(page)) {
> +		rc = 0;
> +		goto out;
> +	}
> +
> +	if (!page_count(page)) {
>  		struct page *head = compound_head(page);
>  		struct hstate *h = page_hstate(head);
>  		int nid = page_to_nid(head);
> @@ -1564,11 +1569,9 @@ int dissolve_free_huge_pages(unsigned long start_pfn, unsigned long end_pfn)
>  
>  	for (pfn = start_pfn; pfn < end_pfn; pfn += 1 << minimum_order) {
>  		page = pfn_to_page(pfn);
> -		if (PageHuge(page) && !page_count(page)) {
> -			rc = dissolve_free_huge_page(page);
> -			if (rc)
> -				break;
> -		}

We may want to consider keeping at least the PageHuge(page) check before
calling dissolve_free_huge_page().  dissolve_free_huge_pages is called as
part of memory offline processing.  We do not know if the memory to be offlined
contains huge pages or not.  With your changes, we are taking hugetlb_lock
on each call to dissolve_free_huge_page just to discover that the page is
not a huge page.

You 'could' add a PageHuge(page) check to dissolve_free_huge_page before
taking the lock.  However, you would need to check again after taking the
lock.
Naoya Horiguchi June 12, 2019, 7:09 a.m. UTC | #3
On Tue, Jun 11, 2019 at 03:20:26PM +0530, Anshuman Khandual wrote:
> 
> On 06/10/2019 01:48 PM, Naoya Horiguchi wrote:
> > madvise(MADV_SOFT_OFFLINE) often returns -EBUSY when calling soft offline
> > for hugepages with overcommitting enabled. That was caused by the suboptimal
> > code in current soft-offline code. See the following part:
> > 
> >     ret = migrate_pages(&pagelist, new_page, NULL, MPOL_MF_MOVE_ALL,
> >                             MIGRATE_SYNC, MR_MEMORY_FAILURE);
> >     if (ret) {
> >             ...
> >     } else {
> >             /*
> >              * We set PG_hwpoison only when the migration source hugepage
> >              * was successfully dissolved, because otherwise hwpoisoned
> >              * hugepage remains on free hugepage list, then userspace will
> >              * find it as SIGBUS by allocation failure. That's not expected
> >              * in soft-offlining.
> >              */
> >             ret = dissolve_free_huge_page(page);
> >             if (!ret) {
> >                     if (set_hwpoison_free_buddy_page(page))
> >                             num_poisoned_pages_inc();
> >             }
> >     }
> >     return ret;
> > 
> > Here dissolve_free_huge_page() returns -EBUSY if the migration source page
> > was freed into buddy in migrate_pages(), but even in that case we actually
> 
> Over committed source pages will be released into buddy and the normal ones
> will not be ? dissolve_free_huge_page() returns -EBUSY because PageHuge()
> return negative on already released pages ? 

The answers for both questions here are yes.

> How dissolve_free_huge_page()
> will behave differently with over committed pages. I might be missing some
> recent developments here.

This dissolve_free_huge_page() should see a (free or reused) 4kB page when
overcommitting, and should see a (free or reused) huge page for non
overcommitting case.

> 
> > has a chance that set_hwpoison_free_buddy_page() succeeds. So that means
> > current code gives up offlining too early now.
> 
> Hmm. It gives up early as the return value from dissolve_free_huge_page(EBUSY)
> gets back as the return code for soft_offline_huge_page() without attempting
> set_hwpoison_free_buddy_page() which still has a chance to succeed for freed
> normal buddy pages.

Exactly.

> 
> > 
> > dissolve_free_huge_page() checks that a given hugepage is suitable for
> > dissolving, where we should return success for !PageHuge() case because
> > the given hugepage is considered as already dissolved.
> 
> Right. It should return 0 (as a success) for freed normal buddy pages. Should
> not it then check explicitly for PageBuddy() as well ?

in new semantics, dissolve_free_huge_page() returns:

  0: successfully dissolved free hugepages or the page is already dissolved
  EBUSY: failed to dissolved free hugepages or the hugepage is in-use.

so for any types of non hugepages, the return value is 0.

Thanks,
- Naoya 

> > 
> > This change also affects other callers of dissolve_free_huge_page(),
> > which are cleaned up together.
> > 
> > Reported-by: Chen, Jerry T <jerry.t.chen@intel.com>
> > Tested-by: Chen, Jerry T <jerry.t.chen@intel.com>
> > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > Fixes: 6bc9b56433b76 ("mm: fix race on soft-offlining")
> > Cc: <stable@vger.kernel.org> # v4.19+
> > ---
> >  mm/hugetlb.c        | 15 +++++++++------
> >  mm/memory-failure.c |  5 +----
> >  2 files changed, 10 insertions(+), 10 deletions(-)
> > 
> > diff --git v5.2-rc3/mm/hugetlb.c v5.2-rc3_patched/mm/hugetlb.c
> > index ac843d3..048d071 100644
> > --- v5.2-rc3/mm/hugetlb.c
> > +++ v5.2-rc3_patched/mm/hugetlb.c
> > @@ -1519,7 +1519,12 @@ int dissolve_free_huge_page(struct page *page)
> >  	int rc = -EBUSY;
> >  
> >  	spin_lock(&hugetlb_lock);
> > -	if (PageHuge(page) && !page_count(page)) {
> > +	if (!PageHuge(page)) {
> > +		rc = 0;
> > +		goto out;
> > +	}
> 
> With this early bail out it maintains the functionality when called from
> soft_offline_free_page() for normal pages. For huge page, it continues
> on the previous path.
> 
> > +
> > +	if (!page_count(page)) {
> >  		struct page *head = compound_head(page);
> >  		struct hstate *h = page_hstate(head);
> >  		int nid = page_to_nid(head);
> > @@ -1564,11 +1569,9 @@ int dissolve_free_huge_pages(unsigned long start_pfn, unsigned long end_pfn)
> >  
> >  	for (pfn = start_pfn; pfn < end_pfn; pfn += 1 << minimum_order) {
> >  		page = pfn_to_page(pfn);
> > -		if (PageHuge(page) && !page_count(page)) {
> 
> Right. These checks are now redundant.
>
Naoya Horiguchi June 12, 2019, 7:24 a.m. UTC | #4
On Tue, Jun 11, 2019 at 10:16:03AM -0700, Mike Kravetz wrote:
> On 6/10/19 1:18 AM, Naoya Horiguchi wrote:
> > madvise(MADV_SOFT_OFFLINE) often returns -EBUSY when calling soft offline
> > for hugepages with overcommitting enabled. That was caused by the suboptimal
> > code in current soft-offline code. See the following part:
> > 
> >     ret = migrate_pages(&pagelist, new_page, NULL, MPOL_MF_MOVE_ALL,
> >                             MIGRATE_SYNC, MR_MEMORY_FAILURE);
> >     if (ret) {
> >             ...
> >     } else {
> >             /*
> >              * We set PG_hwpoison only when the migration source hugepage
> >              * was successfully dissolved, because otherwise hwpoisoned
> >              * hugepage remains on free hugepage list, then userspace will
> >              * find it as SIGBUS by allocation failure. That's not expected
> >              * in soft-offlining.
> >              */
> >             ret = dissolve_free_huge_page(page);
> >             if (!ret) {
> >                     if (set_hwpoison_free_buddy_page(page))
> >                             num_poisoned_pages_inc();
> >             }
> >     }
> >     return ret;
> > 
> > Here dissolve_free_huge_page() returns -EBUSY if the migration source page
> > was freed into buddy in migrate_pages(), but even in that case we actually
> > has a chance that set_hwpoison_free_buddy_page() succeeds. So that means
> > current code gives up offlining too early now.
> > 
> > dissolve_free_huge_page() checks that a given hugepage is suitable for
> > dissolving, where we should return success for !PageHuge() case because
> > the given hugepage is considered as already dissolved.
> > 
> > This change also affects other callers of dissolve_free_huge_page(),
> > which are cleaned up together.
> > 
> > Reported-by: Chen, Jerry T <jerry.t.chen@intel.com>
> > Tested-by: Chen, Jerry T <jerry.t.chen@intel.com>
> > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > Fixes: 6bc9b56433b76 ("mm: fix race on soft-offlining")
> > Cc: <stable@vger.kernel.org> # v4.19+
> > ---
> >  mm/hugetlb.c        | 15 +++++++++------
> >  mm/memory-failure.c |  5 +----
> >  2 files changed, 10 insertions(+), 10 deletions(-)
> > 
> > diff --git v5.2-rc3/mm/hugetlb.c v5.2-rc3_patched/mm/hugetlb.c
> > index ac843d3..048d071 100644
> > --- v5.2-rc3/mm/hugetlb.c
> > +++ v5.2-rc3_patched/mm/hugetlb.c
> > @@ -1519,7 +1519,12 @@ int dissolve_free_huge_page(struct page *page)
> 
> Please update the function description for dissolve_free_huge_page() as
> well.  It currently says, "Returns -EBUSY if the dissolution fails because
> a give page is not a free hugepage" which is no longer true as a result of
> this change.

Thanks for pointing out, I completely missed that.

> 
> >  	int rc = -EBUSY;
> >  
> >  	spin_lock(&hugetlb_lock);
> > -	if (PageHuge(page) && !page_count(page)) {
> > +	if (!PageHuge(page)) {
> > +		rc = 0;
> > +		goto out;
> > +	}
> > +
> > +	if (!page_count(page)) {
> >  		struct page *head = compound_head(page);
> >  		struct hstate *h = page_hstate(head);
> >  		int nid = page_to_nid(head);
> > @@ -1564,11 +1569,9 @@ int dissolve_free_huge_pages(unsigned long start_pfn, unsigned long end_pfn)
> >  
> >  	for (pfn = start_pfn; pfn < end_pfn; pfn += 1 << minimum_order) {
> >  		page = pfn_to_page(pfn);
> > -		if (PageHuge(page) && !page_count(page)) {
> > -			rc = dissolve_free_huge_page(page);
> > -			if (rc)
> > -				break;
> > -		}
> 
> We may want to consider keeping at least the PageHuge(page) check before
> calling dissolve_free_huge_page().  dissolve_free_huge_pages is called as
> part of memory offline processing.  We do not know if the memory to be offlined
> contains huge pages or not.  With your changes, we are taking hugetlb_lock
> on each call to dissolve_free_huge_page just to discover that the page is
> not a huge page.
> 
> You 'could' add a PageHuge(page) check to dissolve_free_huge_page before
> taking the lock.  However, you would need to check again after taking the
> lock.

Right, I'll do this.

What was in my mind when writing this was that I actually don't like
PageHuge because it's slow (not inlined) and called anywhere in mm code,
so I like to reduce it if possible.
But I now see that dissolve_free_huge_page() are relatively rare event
rather than hugepage allocation/free, so dissolve_free_huge_page should take
burden to precheck PageHuge instead of speculatively taking hugetlb_lock
and disrupting the hot path.

Thanks,
- Naoya
diff mbox series

Patch

diff --git v5.2-rc3/mm/hugetlb.c v5.2-rc3_patched/mm/hugetlb.c
index ac843d3..048d071 100644
--- v5.2-rc3/mm/hugetlb.c
+++ v5.2-rc3_patched/mm/hugetlb.c
@@ -1519,7 +1519,12 @@  int dissolve_free_huge_page(struct page *page)
 	int rc = -EBUSY;
 
 	spin_lock(&hugetlb_lock);
-	if (PageHuge(page) && !page_count(page)) {
+	if (!PageHuge(page)) {
+		rc = 0;
+		goto out;
+	}
+
+	if (!page_count(page)) {
 		struct page *head = compound_head(page);
 		struct hstate *h = page_hstate(head);
 		int nid = page_to_nid(head);
@@ -1564,11 +1569,9 @@  int dissolve_free_huge_pages(unsigned long start_pfn, unsigned long end_pfn)
 
 	for (pfn = start_pfn; pfn < end_pfn; pfn += 1 << minimum_order) {
 		page = pfn_to_page(pfn);
-		if (PageHuge(page) && !page_count(page)) {
-			rc = dissolve_free_huge_page(page);
-			if (rc)
-				break;
-		}
+		rc = dissolve_free_huge_page(page);
+		if (rc)
+			break;
 	}
 
 	return rc;
diff --git v5.2-rc3/mm/memory-failure.c v5.2-rc3_patched/mm/memory-failure.c
index 7ea485e..3a83e27 100644
--- v5.2-rc3/mm/memory-failure.c
+++ v5.2-rc3_patched/mm/memory-failure.c
@@ -1859,11 +1859,8 @@  static int soft_offline_in_use_page(struct page *page, int flags)
 
 static int soft_offline_free_page(struct page *page)
 {
-	int rc = 0;
-	struct page *head = compound_head(page);
+	int rc = dissolve_free_huge_page(page);
 
-	if (PageHuge(head))
-		rc = dissolve_free_huge_page(page);
 	if (!rc) {
 		if (set_hwpoison_free_buddy_page(page))
 			num_poisoned_pages_inc();