diff mbox series

[v6,08/12] mm,hwpoison: Rework soft offline for in-use pages

Message ID 20200806184923.7007-9-nao.horiguchi@gmail.com (mailing list archive)
State New, archived
Headers show
Series HWPOISON: soft offline rework | expand

Commit Message

Naoya Horiguchi Aug. 6, 2020, 6:49 p.m. UTC
From: Oscar Salvador <osalvador@suse.de>

This patch changes the way we set and handle in-use poisoned pages.
Until now, poisoned pages were released to the buddy allocator, trusting
that the checks that take place prior to hand the page would act as a
safe net and would skip that page.

This has proved to be wrong, as we got some pfn walkers out there, like
compaction, that all they care is the page to be PageBuddy and be in a
freelist.
Although this might not be the only user, having poisoned pages
in the buddy allocator seems a bad idea as we should only have
free pages that are ready and meant to be used as such.

Before explaining the taken approach, let us break down the kind
of pages we can soft offline.

- Anonymous THP (after the split, they end up being 4K pages)
- Hugetlb
- Order-0 pages (that can be either migrated or invalited)

* Normal pages (order-0 and anon-THP)

  - If they are clean and unmapped page cache pages, we invalidate
    then by means of invalidate_inode_page().
  - If they are mapped/dirty, we do the isolate-and-migrate dance.

  Either way, do not call put_page directly from those paths.
  Instead, we keep the page and send it to page_set_poison to perform the
  right handling.

  page_set_poison sets the HWPoison flag and does the last put_page.
  This call to put_page is mainly to be able to call __page_cache_release,
  since this function is not exported.

  Down the chain, we placed a check for HWPoison page in
  free_pages_prepare, that just skips any poisoned page, so those pages
  do not end up in any pcplist/freelist.

  After that, we set the refcount on the page to 1 and we increment
  the poisoned pages counter.

  We could do as we do for free pages:
  1) wait until the page hits buddy's freelists
  2) take it off
  3) flag it

  The problem is that we could race with an allocation, so by the time we
  want to take the page off the buddy, the page is already allocated, so we
  cannot soft-offline it.
  This is not fatal of course, but if it is better if we can close the race
  as does not require a lot of code.

* Hugetlb pages

  - We isolate-and-migrate them

  After the migration has been successful, we call dissolve_free_huge_page,
  and we set HWPoison on the page if we succeed.
  Hugetlb has a slightly different handling though.

  While for non-hugetlb pages we cared about closing the race with an
  allocation, doing so for hugetlb pages requires quite some additional
  code (we would need to hook in free_huge_page and some other places).
  So I decided to not make the code overly complicated and just fail
  normally if the page we allocated in the meantime.

Because of the way we handle now in-use pages, we no longer need the
put-as-isolation-migratetype dance, that was guarding for poisoned pages
to end up in pcplists.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
---
 include/linux/page-flags.h |  5 -----
 mm/memory-failure.c        | 45 ++++++++++++++------------------------
 mm/migrate.c               | 11 +++-------
 mm/page_alloc.c            | 28 ------------------------
 4 files changed, 19 insertions(+), 70 deletions(-)

Comments

Oscar Salvador Sept. 18, 2020, 7:58 a.m. UTC | #1
On 2020-08-06 20:49, nao.horiguchi@gmail.com wrote:
> From: Oscar Salvador <osalvador@suse.de>
> 
> This patch changes the way we set and handle in-use poisoned pages.
> Until now, poisoned pages were released to the buddy allocator, 
> trusting
> that the checks that take place prior to hand the page would act as a
> safe net and would skip that page.
> 
> This has proved to be wrong, as we got some pfn walkers out there, like
> compaction, that all they care is the page to be PageBuddy and be in a
> freelist.
> Although this might not be the only user, having poisoned pages
> in the buddy allocator seems a bad idea as we should only have
> free pages that are ready and meant to be used as such.
> 
> Before explaining the taken approach, let us break down the kind
> of pages we can soft offline.
> 
> - Anonymous THP (after the split, they end up being 4K pages)
> - Hugetlb
> - Order-0 pages (that can be either migrated or invalited)
> 
> * Normal pages (order-0 and anon-THP)
> 
>   - If they are clean and unmapped page cache pages, we invalidate
>     then by means of invalidate_inode_page().
>   - If they are mapped/dirty, we do the isolate-and-migrate dance.
> 
>   Either way, do not call put_page directly from those paths.
>   Instead, we keep the page and send it to page_set_poison to perform 
> the
>   right handling.
> 
>   page_set_poison sets the HWPoison flag and does the last put_page.
>   This call to put_page is mainly to be able to call 
> __page_cache_release,
>   since this function is not exported.
> 
>   Down the chain, we placed a check for HWPoison page in
>   free_pages_prepare, that just skips any poisoned page, so those pages
>   do not end up in any pcplist/freelist.
> 
>   After that, we set the refcount on the page to 1 and we increment
>   the poisoned pages counter.
> 
>   We could do as we do for free pages:
>   1) wait until the page hits buddy's freelists
>   2) take it off
>   3) flag it
> 
>   The problem is that we could race with an allocation, so by the time 
> we
>   want to take the page off the buddy, the page is already allocated, 
> so we
>   cannot soft-offline it.
>   This is not fatal of course, but if it is better if we can close the 
> race
>   as does not require a lot of code.
> 
> * Hugetlb pages
> 
>   - We isolate-and-migrate them
> 
>   After the migration has been successful, we call 
> dissolve_free_huge_page,
>   and we set HWPoison on the page if we succeed.
>   Hugetlb has a slightly different handling though.
> 
>   While for non-hugetlb pages we cared about closing the race with an
>   allocation, doing so for hugetlb pages requires quite some additional
>   code (we would need to hook in free_huge_page and some other places).
>   So I decided to not make the code overly complicated and just fail
>   normally if the page we allocated in the meantime.
> 
> Because of the way we handle now in-use pages, we no longer need the
> put-as-isolation-migratetype dance, that was guarding for poisoned 
> pages
> to end up in pcplists.
> 
> Signed-off-by: Oscar Salvador <osalvador@suse.de>
> Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>

Hi Andrew,

I just found out yesterday that the patchset Naoya sent has diverged 
from mine in some aspects that lead to some bugs [1].
This was due to a misunderstanding so no blame here.
So, patch#8 and patch#9 need a little tweak [2].

I was wondering what do you prefer?

1) I paste the chunks that need to be changed in the offending patches 
(this and patch#9)
2) I just send them as standalone patches and you applied them on top

I am asking this because although patches had hit -next, we might still 
have time to change them.
If not, do not worry, I will send them as standalone.

[1] https://patchwork.kernel.org/comment/23622881/
[2] https://patchwork.kernel.org/comment/23622985/

I will go ahead and paste the chunks here, in case you lean towards 
option#1:

Patch#8:

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index f68cb5e3b320..4ffaaa5c2603 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -67,11 +67,6 @@ atomic_long_t num_poisoned_pages __read_mostly = 
ATOMIC_LONG_INIT(0);

  static bool page_handle_poison(struct page *page, bool 
hugepage_or_freepage, bool release)
  {
-	if (release) {
-		put_page(page);
-		drain_all_pages(page_zone(page));
-	}
-
  	if (hugepage_or_freepage) {
  		/*
  		 * Doing this check for free pages is also fine since 
dissolve_free_huge_page
@@ -89,6 +84,12 @@ static bool page_handle_poison(struct page *page, 
bool hugepage_or_freepage, boo
  	}

  	SetPageHWPoison(page);
+
+	if (release) {
+                put_page(page);
+                drain_all_pages(page_zone(page));
+        }
+
  	page_ref_inc(page);
  	num_poisoned_pages_inc();
  	return true;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0d9f9bd0e06c..8a6ab05f074c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1173,6 +1173,16 @@ static __always_inline bool 
free_pages_prepare(struct page *page,

  	trace_mm_page_free(page, order);

+	if (unlikely(PageHWPoison(page)) && !order) {
+		/*
+		 * Untie memcg state and reset page's owner
+		 */
+		if (memcg_kmem_enabled() && PageKmemcg(page))
+			__memcg_kmem_uncharge_page(page, order);
+		reset_page_owner(page, order);
+		return false;
+	}
+
  	/*
  	 * Check tail pages before head page information is cleared to
  	 * avoid checking PageCompound for order-0 pag

Patch#9:

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index c3b96ca5c86d..a1bc573d91d5 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1835,7 +1835,7 @@ static int __soft_offline_page(struct page *page)
  		if (!ret) {
  			bool release = !huge;

-			if (!page_handle_poison(page, true, release))
+			if (!page_handle_poison(page, huge, release))
  				ret = -EBUSY;
  		} else {
  			if (!list_empty(&pagelist)

Thanks ans sorry for the inconvenience.
Andrew Morton Sept. 19, 2020, 12:23 a.m. UTC | #2
On Fri, 18 Sep 2020 09:58:22 +0200 osalvador@suse.de wrote:

> I just found out yesterday that the patchset Naoya sent has diverged 
> from mine in some aspects that lead to some bugs [1].
> This was due to a misunderstanding so no blame here.
> So, patch#8 and patch#9 need a little tweak [2].
> 
> I was wondering what do you prefer?

Well.  I (and I suspect the rest of the world) have lost track of
what's going on here.

So please let's have a full resend of the whole series?
Oscar Salvador Sept. 19, 2020, 8:26 a.m. UTC | #3
On 2020-09-19 02:23, Andrew Morton wrote:
> On Fri, 18 Sep 2020 09:58:22 +0200 osalvador@suse.de wrote:
> 
>> I just found out yesterday that the patchset Naoya sent has diverged
>> from mine in some aspects that lead to some bugs [1].
>> This was due to a misunderstanding so no blame here.
>> So, patch#8 and patch#9 need a little tweak [2].
>> 
>> I was wondering what do you prefer?
> 
> Well.  I (and I suspect the rest of the world) have lost track of
> what's going on here.
> 
> So please let's have a full resend of the whole series?

Sure, I will resend a new version squeezing  all changes into it on 
Monday.
diff mbox series

Patch

diff --git v5.8-rc7-mmotm-2020-07-27-18-18/include/linux/page-flags.h v5.8-rc7-mmotm-2020-07-27-18-18_patched/include/linux/page-flags.h
index 9fa5d4e2d69a..d1df51ed6eeb 100644
--- v5.8-rc7-mmotm-2020-07-27-18-18/include/linux/page-flags.h
+++ v5.8-rc7-mmotm-2020-07-27-18-18_patched/include/linux/page-flags.h
@@ -422,14 +422,9 @@  PAGEFLAG_FALSE(Uncached)
 PAGEFLAG(HWPoison, hwpoison, PF_ANY)
 TESTSCFLAG(HWPoison, hwpoison, PF_ANY)
 #define __PG_HWPOISON (1UL << PG_hwpoison)
-extern bool set_hwpoison_free_buddy_page(struct page *page);
 extern bool take_page_off_buddy(struct page *page);
 #else
 PAGEFLAG_FALSE(HWPoison)
-static inline bool set_hwpoison_free_buddy_page(struct page *page)
-{
-	return 0;
-}
 #define __PG_HWPOISON 0
 #endif
 
diff --git v5.8-rc7-mmotm-2020-07-27-18-18/mm/memory-failure.c v5.8-rc7-mmotm-2020-07-27-18-18_patched/mm/memory-failure.c
index 0e619012e050..95bf8aa44a9a 100644
--- v5.8-rc7-mmotm-2020-07-27-18-18/mm/memory-failure.c
+++ v5.8-rc7-mmotm-2020-07-27-18-18_patched/mm/memory-failure.c
@@ -65,8 +65,12 @@  int sysctl_memory_failure_recovery __read_mostly = 1;
 
 atomic_long_t num_poisoned_pages __read_mostly = ATOMIC_LONG_INIT(0);
 
-static void page_handle_poison(struct page *page)
+static void page_handle_poison(struct page *page, bool release)
 {
+	if (release) {
+		put_page(page);
+		drain_all_pages(page_zone(page));
+	}
 	SetPageHWPoison(page);
 	page_ref_inc(page);
 	num_poisoned_pages_inc();
@@ -1756,19 +1760,13 @@  static int soft_offline_huge_page(struct page *page, int flags)
 			ret = -EIO;
 	} else {
 		/*
-		 * We set PG_hwpoison only when the migration source hugepage
-		 * was successfully dissolved, because otherwise hwpoisoned
-		 * hugepage remains on free hugepage list, then userspace will
-		 * find it as SIGBUS by allocation failure. That's not expected
-		 * in soft-offlining.
+		 * We set PG_hwpoison only when we were able to take the page
+		 * off the buddy.
 		 */
-		ret = dissolve_free_huge_page(page);
-		if (!ret) {
-			if (set_hwpoison_free_buddy_page(page))
-				num_poisoned_pages_inc();
-			else
-				ret = -EBUSY;
-		}
+		if (!dissolve_free_huge_page(page) && take_page_off_buddy(page))
+			page_handle_poison(page, false);
+		else
+			ret = -EBUSY;
 	}
 	return ret;
 }
@@ -1807,10 +1805,8 @@  static int __soft_offline_page(struct page *page, int flags)
 	 * would need to fix isolation locking first.
 	 */
 	if (ret == 1) {
-		put_page(page);
 		pr_info("soft_offline: %#lx: invalidated\n", pfn);
-		SetPageHWPoison(page);
-		num_poisoned_pages_inc();
+		page_handle_poison(page, true);
 		return 0;
 	}
 
@@ -1841,7 +1837,9 @@  static int __soft_offline_page(struct page *page, int flags)
 		list_add(&page->lru, &pagelist);
 		ret = migrate_pages(&pagelist, alloc_migration_target, NULL,
 			(unsigned long)&mtc, MIGRATE_SYNC, MR_MEMORY_FAILURE);
-		if (ret) {
+		if (!ret) {
+			page_handle_poison(page, true);
+		} else {
 			if (!list_empty(&pagelist))
 				putback_movable_pages(&pagelist);
 
@@ -1860,27 +1858,16 @@  static int __soft_offline_page(struct page *page, int flags)
 static int soft_offline_in_use_page(struct page *page, int flags)
 {
 	int ret;
-	int mt;
 	struct page *hpage = compound_head(page);
 
 	if (!PageHuge(page) && PageTransHuge(hpage))
 		if (try_to_split_thp_page(page, "soft offline") < 0)
 			return -EBUSY;
 
-	/*
-	 * Setting MIGRATE_ISOLATE here ensures that the page will be linked
-	 * to free list immediately (not via pcplist) when released after
-	 * successful page migration. Otherwise we can't guarantee that the
-	 * page is really free after put_page() returns, so
-	 * set_hwpoison_free_buddy_page() highly likely fails.
-	 */
-	mt = get_pageblock_migratetype(page);
-	set_pageblock_migratetype(page, MIGRATE_ISOLATE);
 	if (PageHuge(page))
 		ret = soft_offline_huge_page(page, flags);
 	else
 		ret = __soft_offline_page(page, flags);
-	set_pageblock_migratetype(page, mt);
 	return ret;
 }
 
@@ -1889,7 +1876,7 @@  static int soft_offline_free_page(struct page *page)
 	int rc = -EBUSY;
 
 	if (!dissolve_free_huge_page(page) && take_page_off_buddy(page)) {
-		page_handle_poison(page);
+		page_handle_poison(page, false);
 		rc = 0;
 	}
 
diff --git v5.8-rc7-mmotm-2020-07-27-18-18/mm/migrate.c v5.8-rc7-mmotm-2020-07-27-18-18_patched/mm/migrate.c
index 2c809ffcf0e1..d7a9379c343b 100644
--- v5.8-rc7-mmotm-2020-07-27-18-18/mm/migrate.c
+++ v5.8-rc7-mmotm-2020-07-27-18-18_patched/mm/migrate.c
@@ -1222,16 +1222,11 @@  static int unmap_and_move(new_page_t get_new_page,
 	 * we want to retry.
 	 */
 	if (rc == MIGRATEPAGE_SUCCESS) {
-		put_page(page);
-		if (reason == MR_MEMORY_FAILURE) {
+		if (reason != MR_MEMORY_FAILURE)
 			/*
-			 * Set PG_HWPoison on just freed page
-			 * intentionally. Although it's rather weird,
-			 * it's how HWPoison flag works at the moment.
+			 * We release the page in page_handle_poison.
 			 */
-			if (set_hwpoison_free_buddy_page(page))
-				num_poisoned_pages_inc();
-		}
+			put_page(page);
 	} else {
 		if (rc != -EAGAIN) {
 			if (likely(!__PageMovable(page))) {
diff --git v5.8-rc7-mmotm-2020-07-27-18-18/mm/page_alloc.c v5.8-rc7-mmotm-2020-07-27-18-18_patched/mm/page_alloc.c
index aab89f7db4ac..e4896e674594 100644
--- v5.8-rc7-mmotm-2020-07-27-18-18/mm/page_alloc.c
+++ v5.8-rc7-mmotm-2020-07-27-18-18_patched/mm/page_alloc.c
@@ -8843,32 +8843,4 @@  bool take_page_off_buddy(struct page *page)
 	spin_unlock_irqrestore(&zone->lock, flags);
 	return ret;
 }
-
-/*
- * Set PG_hwpoison flag if a given page is confirmed to be a free page.  This
- * test is performed under the zone lock to prevent a race against page
- * allocation.
- */
-bool set_hwpoison_free_buddy_page(struct page *page)
-{
-	struct zone *zone = page_zone(page);
-	unsigned long pfn = page_to_pfn(page);
-	unsigned long flags;
-	unsigned int order;
-	bool hwpoisoned = false;
-
-	spin_lock_irqsave(&zone->lock, flags);
-	for (order = 0; order < MAX_ORDER; order++) {
-		struct page *page_head = page - (pfn & ((1 << order) - 1));
-
-		if (PageBuddy(page_head) && page_order(page_head) >= order) {
-			if (!TestSetPageHWPoison(page))
-				hwpoisoned = true;
-			break;
-		}
-	}
-	spin_unlock_irqrestore(&zone->lock, flags);
-
-	return hwpoisoned;
-}
 #endif