diff mbox series

[v1,6/6] mm/hwpoison: fix unpoison_memory()

Message ID 20210614021212.223326-7-nao.horiguchi@gmail.com (mailing list archive)
State New
Headers show
Series mm/hwpoison: fix unpoison_memory() | expand

Commit Message

Naoya Horiguchi June 14, 2021, 2:12 a.m. UTC
From: Naoya Horiguchi <naoya.horiguchi@nec.com>

After recent soft-offline rework, error pages can be taken off from
buddy allocator, but the existing unpoison_memory() does not properly
undo the operation.  Moreover, due to the recent change on
__get_hwpoison_page(), get_page_unless_zero() is hardly called for
hwpoisoned pages.  So __get_hwpoison_page() mostly returns zero (meaning
to fail to grab page refcount) and unpoison just clears PG_hwpoison
without releasing a refcount.  That does not lead to a critical issue
like kernel panic, but unpoisoned pages never get back to buddy (leaked
permanently), which is not good.

To fix this, we need to identify "taken off" pages from other types of
hwpoisoned pages.  We can't use refcount or page flags for this purpose,
so a pseudo flag is defined by hacking ->private field.

Sometimes hwpoisoned pages can be still in-use, where the refcount should
be more than 1, so we can't unpoison them immediately and need to wait
until the all users release their refcount.

Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
---
 include/linux/mm.h         |  1 +
 include/linux/page-flags.h |  4 ++
 mm/memory-failure.c        | 84 ++++++++++++++++++++++++++++----------
 mm/page_alloc.c            | 19 +++++++++
 4 files changed, 86 insertions(+), 22 deletions(-)

Comments

Ding Hui June 17, 2021, 10 a.m. UTC | #1
On 2021/6/14 10:12, Naoya Horiguchi wrote:
> From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> 
> After recent soft-offline rework, error pages can be taken off from
> buddy allocator, but the existing unpoison_memory() does not properly
> undo the operation.  Moreover, due to the recent change on
> __get_hwpoison_page(), get_page_unless_zero() is hardly called for
> hwpoisoned pages.  So __get_hwpoison_page() mostly returns zero (meaning
> to fail to grab page refcount) and unpoison just clears PG_hwpoison
> without releasing a refcount.  That does not lead to a critical issue
> like kernel panic, but unpoisoned pages never get back to buddy (leaked
> permanently), which is not good.

As I mention in [1], I'm not sure about the exactly meaning of "broken" 
in unpoison_memory().

Maybe the misunderstanding is:

I think __get_hwpoison_page() mostly returns one for hwpoisoned page.
In 06be6ff3d2ec ("mm,hwpoison: rework soft offline for free pages"), 
page_handle_poison() is introduced, it will add refcount for all 
soft-offlineed hwpoison page.
In memory_failure() for hard-offline,page_ref_inc() called on free page 
too, and for used page, we do not call put_page() after 
get_hwpoison_page() != 0.
So all hwpoisoned page refcount must be great than zero when 
unpoison_memory() if regardless of racy.

Recently I tested loop soft-offline random pages and unpoison them for 
days, it works fine to me. (with bac9c6fa1f92 patched)

[1]: 
https://lore.kernel.org/lkml/6af291a0-41fa-8112-5297-6a4cdf2337b6@sangfor.com.cn/

> 
> To fix this, we need to identify "taken off" pages from other types of
> hwpoisoned pages.  We can't use refcount or page flags for this purpose,
> so a pseudo flag is defined by hacking ->private field.
> 
> Sometimes hwpoisoned pages can be still in-use, where the refcount should
> be more than 1, so we can't unpoison them immediately and need to wait
> until the all users release their refcount.
> 
> Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
> ---
HORIGUCHI NAOYA(堀口 直也) June 18, 2021, 8:36 a.m. UTC | #2
On Thu, Jun 17, 2021 at 06:00:21PM +0800, Ding Hui wrote:
> On 2021/6/14 10:12, Naoya Horiguchi wrote:
> > From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> > 
> > After recent soft-offline rework, error pages can be taken off from
> > buddy allocator, but the existing unpoison_memory() does not properly
> > undo the operation.  Moreover, due to the recent change on
> > __get_hwpoison_page(), get_page_unless_zero() is hardly called for
> > hwpoisoned pages.  So __get_hwpoison_page() mostly returns zero (meaning
> > to fail to grab page refcount) and unpoison just clears PG_hwpoison
> > without releasing a refcount.  That does not lead to a critical issue
> > like kernel panic, but unpoisoned pages never get back to buddy (leaked
> > permanently), which is not good.
> 
> As I mention in [1], I'm not sure about the exactly meaning of "broken" in
> unpoison_memory().
> 
> Maybe the misunderstanding is:
> 
> I think __get_hwpoison_page() mostly returns one for hwpoisoned page.
> In 06be6ff3d2ec ("mm,hwpoison: rework soft offline for free pages"),
> page_handle_poison() is introduced, it will add refcount for all
> soft-offlineed hwpoison page.
> In memory_failure() for hard-offline,page_ref_inc() called on free page
> too, and for used page, we do not call put_page() after get_hwpoison_page()
> != 0.
> So all hwpoisoned page refcount must be great than zero when
> unpoison_memory() if regardless of racy.

Hi, Ding,

Thanks for the comment.  I feel that I failed to define the exact issue in
unpoison.  Maybe I saw and misinterpreted some random error as unpoison's
issue during developing other hwpoison patches, so please don't take serious
my previous wrong word "broken", sorry about that.

Anyway I reconsider how to handle this 6/6, maybe it will be a clear
description of the problem, and will be simplified.

> 
> Recently I tested loop soft-offline random pages and unpoison them for days,
> it works fine to me. (with bac9c6fa1f92 patched)

Thank you for testing,

Thanks,
Naoya Horiguchi
Ding Hui June 19, 2021, 12:22 p.m. UTC | #3
On 2021/6/18 4:36 下午, HORIGUCHI NAOYA(堀口 直也) wrote:
> On Thu, Jun 17, 2021 at 06:00:21PM +0800, Ding Hui wrote:
>> On 2021/6/14 10:12, Naoya Horiguchi wrote:
>>> From: Naoya Horiguchi <naoya.horiguchi@nec.com>
>>>
>>> After recent soft-offline rework, error pages can be taken off from
>>> buddy allocator, but the existing unpoison_memory() does not properly
>>> undo the operation.  Moreover, due to the recent change on
>>> __get_hwpoison_page(), get_page_unless_zero() is hardly called for
>>> hwpoisoned pages.  So __get_hwpoison_page() mostly returns zero (meaning
>>> to fail to grab page refcount) and unpoison just clears PG_hwpoison
>>> without releasing a refcount.  That does not lead to a critical issue
>>> like kernel panic, but unpoisoned pages never get back to buddy (leaked
>>> permanently), which is not good.
>>
>> As I mention in [1], I'm not sure about the exactly meaning of "broken" in
>> unpoison_memory().
>>
>> Maybe the misunderstanding is:
>>
>> I think __get_hwpoison_page() mostly returns one for hwpoisoned page.
>> In 06be6ff3d2ec ("mm,hwpoison: rework soft offline for free pages"),
>> page_handle_poison() is introduced, it will add refcount for all
>> soft-offlineed hwpoison page.
>> In memory_failure() for hard-offline,page_ref_inc() called on free page
>> too, and for used page, we do not call put_page() after get_hwpoison_page()
>> != 0.
>> So all hwpoisoned page refcount must be great than zero when
>> unpoison_memory() if regardless of racy.
> 
> Hi, Ding,
> 
> Thanks for the comment.  I feel that I failed to define the exact issue in
> unpoison.  Maybe I saw and misinterpreted some random error as unpoison's
> issue during developing other hwpoison patches, so please don't take serious
> my previous wrong word "broken", sorry about that.
> 
> Anyway I reconsider how to handle this 6/6, maybe it will be a clear
> description of the problem, and will be simplified.
> 
>>
>> Recently I tested loop soft-offline random pages and unpoison them for days,
>> it works fine to me. (with bac9c6fa1f92 patched)
> 
> Thank you for testing,
> 

Hi Naoya,

I'm afraid of my description about testing is ambiguous for others, let 
me clarify that I ran stress soft-offline test case from mce-test 
project (https://git.kernel.org/pub/scm/utils/cpu/mce/mce-test.git) for 
days to verify my modify about NR_FREE_PAGES (bac9c6fa1f92), without 
your current patchset, the case is loop soft-offline random pages and 
unpoison them, and it works basic fine to me.
diff mbox series

Patch

diff --git v5.13-rc5/include/linux/mm.h v5.13-rc5_patched/include/linux/mm.h
index f1e3b82e1a93..0baf3fc97415 100644
--- v5.13-rc5/include/linux/mm.h
+++ v5.13-rc5_patched/include/linux/mm.h
@@ -3077,6 +3077,7 @@  enum mf_flags {
 	MF_ACTION_REQUIRED = 1 << 1,
 	MF_MUST_KILL = 1 << 2,
 	MF_SOFT_OFFLINE = 1 << 3,
+	MF_UNPOISON = 1 << 4,
 };
 extern int memory_failure(unsigned long pfn, int flags);
 extern void memory_failure_queue(unsigned long pfn, int flags);
diff --git v5.13-rc5/include/linux/page-flags.h v5.13-rc5_patched/include/linux/page-flags.h
index 04a34c08e0a6..00bddb3a058d 100644
--- v5.13-rc5/include/linux/page-flags.h
+++ v5.13-rc5_patched/include/linux/page-flags.h
@@ -430,7 +430,11 @@  PAGEFLAG_FALSE(Uncached)
 PAGEFLAG(HWPoison, hwpoison, PF_ANY)
 TESTSCFLAG(HWPoison, hwpoison, PF_ANY)
 #define __PG_HWPOISON (1UL << PG_hwpoison)
+#define MAGIC_HWPOISON	0x4857504f49534f4e
+extern void SetPageHWPoisonTakenOff(struct page *page);
+extern void ClearPageHWPoisonTakenOff(struct page *page);
 extern bool take_page_off_buddy(struct page *page);
+extern void take_page_back_buddy(struct page *page);
 #else
 PAGEFLAG_FALSE(HWPoison)
 #define __PG_HWPOISON 0
diff --git v5.13-rc5/mm/memory-failure.c v5.13-rc5_patched/mm/memory-failure.c
index 0d51067f0129..41b0ef96e2aa 100644
--- v5.13-rc5/mm/memory-failure.c
+++ v5.13-rc5_patched/mm/memory-failure.c
@@ -1105,6 +1105,22 @@  static int page_action(struct page_state *ps, struct page *p,
 	return (result == MF_RECOVERED || result == MF_DELAYED) ? 0 : -EBUSY;
 }
 
+static inline bool PageHWPoisonTakenOff(struct page *page)
+{
+	return PageHWPoison(page) && page_private(page) == MAGIC_HWPOISON;
+}
+
+void SetPageHWPoisonTakenOff(struct page *page)
+{
+	set_page_private(page, MAGIC_HWPOISON);
+}
+
+void ClearPageHWPoisonTakenOff(struct page *page)
+{
+	if (PageHWPoison(page))
+		set_page_private(page, 0);
+}
+
 /*
  * Return true if a page type of a given page is supported by hwpoison
  * mechanism (while handling could fail), otherwise false.  This function
@@ -1117,7 +1133,7 @@  static inline bool HWPoisonHandlable(struct page *page)
 		PageSlab(page) || PageTable(page) || PageReserved(page);
 }
 
-static int __get_hwpoison_page(struct page *page)
+static int __get_hwpoison_page(struct page *page, unsigned long flags)
 {
 	struct page *head = compound_head(page);
 	int ret = 0;
@@ -1127,12 +1143,19 @@  static int __get_hwpoison_page(struct page *page)
 	if (hugetlb)
 		return ret;
 
+	/*
+	 * Finding taken-off pages, so it could be true only when called
+	 * from unpoison_memory().
+	 */
+	if (PageHWPoisonTakenOff(head))
+		return -EHWPOISON;
+
 	/*
 	 * This check prevents from calling get_hwpoison_unless_zero()
 	 * for any unsupported type of page in order to reduce the risk of
 	 * unexpected races caused by taking a page refcount.
 	 */
-	if (!HWPoisonHandlable(head))
+	if (!(flags & MF_UNPOISON) && !HWPoisonHandlable(head))
 		return 0;
 
 	if (PageTransHuge(head)) {
@@ -1171,7 +1194,7 @@  static int get_any_page(struct page *p, unsigned long flags)
 
 try_again:
 	if (!count_increased) {
-		ret = __get_hwpoison_page(p);
+		ret = __get_hwpoison_page(p, flags);
 		if (!ret) {
 			if (page_count(p)) {
 				/* We raced with an allocation, retry. */
@@ -1190,6 +1213,8 @@  static int get_any_page(struct page *p, unsigned long flags)
 			if (pass++ < 3)
 				goto try_again;
 			goto out;
+		} else if (ret == -EHWPOISON) {
+			goto out;
 		}
 	}
 
@@ -1233,14 +1258,18 @@  static int get_any_page(struct page *p, unsigned long flags)
  *         1 on success for in-use pages in a well-defined state,
  *         -EIO for pages on which we can not handle memory errors,
  *         -EBUSY when get_hwpoison_page() has raced with page lifecycle
- *         operations like allocation and free.
+ *         operations like allocation and free,
+ *         -EHWPOISON when the page is hwpoisoned and taken off from buddy.
  */
 static int get_hwpoison_page(struct page *p, unsigned long flags)
 {
 	int ret;
 
 	zone_pcp_disable(page_zone(p));
-	ret = get_any_page(p, flags);
+	if (flags & MF_UNPOISON)
+		ret = __get_hwpoison_page(p, flags);
+	else
+		ret = get_any_page(p, flags);
 	zone_pcp_enable(page_zone(p));
 
 	return ret;
@@ -1895,6 +1924,17 @@  core_initcall(memory_failure_init);
 		pr_info(fmt, pfn);			\
 })
 
+static inline int clear_page_hwpoison(struct ratelimit_state *rs, struct page *p)
+{
+	if (TestClearPageHWPoison(p)) {
+		unpoison_pr_info("Unpoison: Software-unpoisoned page %#lx\n",
+				 page_to_pfn(p), rs);
+		num_poisoned_pages_dec();
+		return 1;
+	}
+	return 0;
+}
+
 /**
  * unpoison_memory - Unpoison a previously poisoned page
  * @pfn: Page number of the to be unpoisoned page
@@ -1911,9 +1951,7 @@  int unpoison_memory(unsigned long pfn)
 {
 	struct page *page;
 	struct page *p;
-	int freeit = 0;
 	int ret = 0;
-	unsigned long flags = 0;
 	static DEFINE_RATELIMIT_STATE(unpoison_rs, DEFAULT_RATELIMIT_INTERVAL,
 					DEFAULT_RATELIMIT_BURST);
 
@@ -1949,24 +1987,26 @@  int unpoison_memory(unsigned long pfn)
 		goto unlock_mutex;
 	}
 
-	if (!get_hwpoison_page(p, flags)) {
-		if (TestClearPageHWPoison(p))
-			num_poisoned_pages_dec();
-		unpoison_pr_info("Unpoison: Software-unpoisoned free page %#lx\n",
-				 pfn, &unpoison_rs);
+	ret = get_hwpoison_page(p, MF_UNPOISON);
+	if (!ret) {
+		clear_page_hwpoison(&unpoison_rs, p);
 		goto unlock_mutex;
-	}
-
-	if (TestClearPageHWPoison(page)) {
-		unpoison_pr_info("Unpoison: Software-unpoisoned page %#lx\n",
-				 pfn, &unpoison_rs);
-		num_poisoned_pages_dec();
-		freeit = 1;
-	}
+	} else if (ret < 0) {
+		if (ret == -EHWPOISON) {
+			clear_page_hwpoison(&unpoison_rs, p);
+			take_page_back_buddy(p);
+			ret = 0;
+		} else
+			unpoison_pr_info("Unpoison: failed to grab page %#lx\n",
+					 pfn, &unpoison_rs);
+		goto unlock_mutex;
+	} else {
+		int freeit = clear_page_hwpoison(&unpoison_rs, p);
 
-	put_page(page);
-	if (freeit && !(pfn == my_zero_pfn(0) && page_count(p) == 1))
 		put_page(page);
+		if (freeit && !(pfn == my_zero_pfn(0) && page_count(p) == 1))
+			put_page(page);
+	}
 
 unlock_mutex:
 	mutex_unlock(&mf_mutex);
diff --git v5.13-rc5/mm/page_alloc.c v5.13-rc5_patched/mm/page_alloc.c
index d1f5de1c1283..325e91d92b7e 100644
--- v5.13-rc5/mm/page_alloc.c
+++ v5.13-rc5_patched/mm/page_alloc.c
@@ -9158,6 +9158,7 @@  bool take_page_off_buddy(struct page *page)
 			del_page_from_free_list(page_head, zone, page_order);
 			break_down_buddy_pages(zone, page_head, page, 0,
 						page_order, migratetype);
+			SetPageHWPoisonTakenOff(page);
 			if (!is_migrate_isolate(migratetype))
 				__mod_zone_freepage_state(zone, -1, migratetype);
 			ret = true;
@@ -9169,4 +9170,22 @@  bool take_page_off_buddy(struct page *page)
 	spin_unlock_irqrestore(&zone->lock, flags);
 	return ret;
 }
+
+/*
+ * Cancel takeoff done by take_page_off_buddy().
+ */
+void take_page_back_buddy(struct page *page)
+{
+	struct zone *zone = page_zone(page);
+	unsigned long pfn = page_to_pfn(page);
+	unsigned long flags;
+	int migratetype = get_pfnblock_migratetype(page, pfn);
+
+	spin_lock_irqsave(&zone->lock, flags);
+	if (put_page_testzero(page)) {
+		ClearPageHWPoisonTakenOff(page);
+		__free_one_page(page, pfn, zone, 0, migratetype, FPI_NONE);
+	}
+	spin_unlock_irqrestore(&zone->lock, flags);
+}
 #endif