mbox series

[v5,00/16] HWPOISON: soft offline rework

Message ID 20200731122112.11263-1-nao.horiguchi@gmail.com (mailing list archive)
Headers show
Series HWPOISON: soft offline rework | expand

Message

Naoya Horiguchi July 31, 2020, 12:20 p.m. UTC
This patchset is the latest version of soft offline rework patchset
targetted for v5.9.

Main focus of this series is to stabilize soft offline.  Historically soft
offlined pages have suffered from racy conditions because PageHWPoison is
used to a little too aggressively, which (directly or indirectly) invades
other mm code which cares little about hwpoison.  This results in unexpected
behavior or kernel panic, which is very far from soft offline's "do not
disturb userspace or other kernel component" policy.

Main point of this change set is to contain target page "via buddy allocator",
where we first free the target page as we do for normal pages, and remove
from buddy only when we confirm that it reaches free list. There is surely
race window of page allocation, but that's fine because someone really want
that page and the page is still working, so soft offline can happily give up.

v4 from Oscar tries to handle the race around reallocation, but that part
seems still work in progress, so I decide to separate it for changes into
v5.9.  Thank you for your contribution, Oscar.

The issue reported by Qian Cai is fixed by patch 16/16.

This patchset is based on v5.8-rc7-mmotm-2020-07-27-18-18, but I applied
this series after reverting previous version.
Maybe https://github.com/Naoya-Horiguchi/linux/commits/soft-offline-rework.v5
shows what I did more precisely.

Any other comment/suggestion/help would be appreciated.

Thanks,
Naoya Horiguchi
---
Previous versions:
  v1: https://lore.kernel.org/linux-mm/1541746035-13408-1-git-send-email-n-horiguchi@ah.jp.nec.com/
  v2: https://lore.kernel.org/linux-mm/20191017142123.24245-1-osalvador@suse.de/
  v3: https://lore.kernel.org/linux-mm/20200624150137.7052-1-nao.horiguchi@gmail.com/
  v4: https://lore.kernel.org/linux-mm/20200716123810.25292-1-osalvador@suse.de/
---
Summary:

Naoya Horiguchi (8):
      mm,hwpoison: cleanup unused PageHuge() check
      mm, hwpoison: remove recalculating hpage
      mm,madvise: call soft_offline_page() without MF_COUNT_INCREASED
      mm,hwpoison-inject: don't pin for hwpoison_filter
      mm,hwpoison: remove MF_COUNT_INCREASED
      mm,hwpoison: remove flag argument from soft offline functions
      mm,hwpoison: introduce MF_MSG_UNSPLIT_THP
      mm,hwpoison: double-check page count in __get_any_page()

Oscar Salvador (8):
      mm,madvise: Refactor madvise_inject_error
      mm,hwpoison: Un-export get_hwpoison_page and make it static
      mm,hwpoison: Kill put_hwpoison_page
      mm,hwpoison: Unify THP handling for hard and soft offline
      mm,hwpoison: Rework soft offline for free pages
      mm,hwpoison: Rework soft offline for in-use pages
      mm,hwpoison: Refactor soft_offline_huge_page and __soft_offline_page
      mm,hwpoison: Return 0 if the page is already poisoned in soft-offline

 drivers/base/memory.c      |   2 +-
 include/linux/mm.h         |  12 +-
 include/linux/page-flags.h |   6 +-
 include/ras/ras_event.h    |   3 +
 mm/hwpoison-inject.c       |  18 +--
 mm/madvise.c               |  39 +++---
 mm/memory-failure.c        | 334 ++++++++++++++++++++-------------------------
 mm/migrate.c               |  11 +-
 mm/page_alloc.c            |  60 ++++++--
 9 files changed, 233 insertions(+), 252 deletions(-)

Comments

HORIGUCHI NAOYA(堀口 直也) Aug. 3, 2020, 1:36 p.m. UTC | #1
Hello,

On Mon, Aug 03, 2020 at 08:39:55AM -0400, Qian Cai wrote:
> On Fri, Jul 31, 2020 at 12:20:56PM +0000, nao.horiguchi@gmail.com wrote:
> > This patchset is the latest version of soft offline rework patchset
> > targetted for v5.9.
> > 
> > Main focus of this series is to stabilize soft offline.  Historically soft
> > offlined pages have suffered from racy conditions because PageHWPoison is
> > used to a little too aggressively, which (directly or indirectly) invades
> > other mm code which cares little about hwpoison.  This results in unexpected
> > behavior or kernel panic, which is very far from soft offline's "do not
> > disturb userspace or other kernel component" policy.
> > 
> > Main point of this change set is to contain target page "via buddy allocator",
> > where we first free the target page as we do for normal pages, and remove
> > from buddy only when we confirm that it reaches free list. There is surely
> > race window of page allocation, but that's fine because someone really want
> > that page and the page is still working, so soft offline can happily give up.
> > 
> > v4 from Oscar tries to handle the race around reallocation, but that part
> > seems still work in progress, so I decide to separate it for changes into
> > v5.9.  Thank you for your contribution, Oscar.
> > 
> > The issue reported by Qian Cai is fixed by patch 16/16.
> 
> I am still getting EIO everywhere on next-20200803 (which includes this v5).
> 
> # ./random 1
> - start: migrate_huge_offline
> - use NUMA nodes 0,8.
> - mmap and free 8388608 bytes hugepages on node 0
> - mmap and free 8388608 bytes hugepages on node 8
> madvise: Input/output error
> 
> From the serial console,
> 
> [  637.164222][ T8357] soft offline: 0x118ee0: hugepage isolation failed: 0, page count 2, type 7fff800001000e (referenced|uptodate|dirty|head)
> [  637.164890][ T8357] Soft offlining pfn 0x20001380 at process virtual address 0x7fff9f000000
> [  637.165422][ T8357] Soft offlining pfn 0x3ba00 at process virtual address 0x7fff9f200000
> [  637.166409][ T8357] Soft offlining pfn 0x201914a0 at process virtual address 0x7fff9f000000
> [  637.166833][ T8357] Soft offlining pfn 0x12b9a0 at process virtual address 0x7fff9f200000
> [  637.168044][ T8357] Soft offlining pfn 0x1abb60 at process virtual address 0x7fff9f000000
> [  637.168557][ T8357] Soft offlining pfn 0x20014820 at process virtual address 0x7fff9f200000
> [  637.169493][ T8357] Soft offlining pfn 0x119720 at process virtual address 0x7fff9f000000
> [  637.169603][ T8357] soft offline: 0x119720: hugepage isolation failed: 0, page count 2, type 7fff800001000e (referenced|uptodate|dirty|head)
> [  637.169756][ T8357] Soft offlining pfn 0x118ee0 at process virtual address 0x7fff9f200000
> [  637.170653][ T8357] Soft offlining pfn 0x200e81e0 at process virtual address 0x7fff9f000000
> [  637.171067][ T8357] Soft offlining pfn 0x201c5f60 at process virtual address 0x7fff9f200000
> [  637.172101][ T8357] Soft offlining pfn 0x201c8f00 at process virtual address 0x7fff9f000000
> [  637.172241][ T8357] __get_any_page: 0x201c8f00: unknown zero refcount page type 87fff8000000000

I might misjudge to skip the following patch, sorry about that.
Could you try with it?

---
From eafe6fde94cd15e67631540f1b2b000b6e33a650 Mon Sep 17 00:00:00 2001
From: Oscar Salvador <osalvador@suse.de>
Date: Mon, 3 Aug 2020 22:25:10 +0900
Subject: [PATCH] mm,hwpoison: Drain pcplists before bailing out for non-buddy
 zero-refcount page

A page with 0-refcount and !PageBuddy could perfectly be a pcppage.
Currently, we bail out with an error if we encounter such a page,
meaning that we do not handle pcppages neither from hard-offline
nor from soft-offline path.

Fix this by draining pcplists whenever we find this kind of page
and retry the check again.
It might be that pcplists have been spilled into the buddy allocator
and so we can handle it.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 mm/memory-failure.c | 30 ++++++++++++++++++++++++------
 1 file changed, 24 insertions(+), 6 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index b2753ce2b85b..02be529445c0 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -949,13 +949,13 @@ static int page_action(struct page_state *ps, struct page *p,
 }
 
 /**
- * get_hwpoison_page() - Get refcount for memory error handling:
+ * __get_hwpoison_page() - Get refcount for memory error handling:
  * @page:	raw error page (hit by memory error)
  *
  * Return: return 0 if failed to grab the refcount, otherwise true (some
  * non-zero value.)
  */
-static int get_hwpoison_page(struct page *page)
+static int __get_hwpoison_page(struct page *page)
 {
 	struct page *head = compound_head(page);
 
@@ -985,6 +985,28 @@ static int get_hwpoison_page(struct page *page)
 	return 0;
 }
 
+static int get_hwpoison_page(struct page *p)
+{
+	int ret;
+	bool drained = false;
+
+retry:
+	ret = __get_hwpoison_page(p);
+	if (!ret) {
+		if (!is_free_buddy_page(p) && !page_count(p) && !drained) {
+			/*
+			 * The page might be in a pcplist, so try to drain
+			 * those and see if we are lucky.
+			 */
+			drain_all_pages(page_zone(p));
+			drained = true;
+			goto retry;
+		}
+	}
+
+	return ret;
+}
+
 /*
  * Do all that is necessary to remove user space mappings. Unmap
  * the pages and send SIGBUS to the processes if the data was dirty.
@@ -1683,10 +1705,6 @@ static int __get_any_page(struct page *p, unsigned long pfn)
 {
 	int ret;
 
-	/*
-	 * When the target page is a free hugepage, just remove it
-	 * from free hugepage list.
-	 */
 	if (!get_hwpoison_page(p)) {
 		if (PageHuge(p)) {
 			pr_info("%s: %#lx free huge page\n", __func__, pfn);
HORIGUCHI NAOYA(堀口 直也) Aug. 4, 2020, 1:16 a.m. UTC | #2
On Mon, Aug 03, 2020 at 03:07:09PM -0400, Qian Cai wrote:
> On Fri, Jul 31, 2020 at 12:20:56PM +0000, nao.horiguchi@gmail.com wrote:
> > This patchset is the latest version of soft offline rework patchset
> > targetted for v5.9.
> > 
> > Main focus of this series is to stabilize soft offline.  Historically soft
> > offlined pages have suffered from racy conditions because PageHWPoison is
> > used to a little too aggressively, which (directly or indirectly) invades
> > other mm code which cares little about hwpoison.  This results in unexpected
> > behavior or kernel panic, which is very far from soft offline's "do not
> > disturb userspace or other kernel component" policy.
> > 
> > Main point of this change set is to contain target page "via buddy allocator",
> > where we first free the target page as we do for normal pages, and remove
> > from buddy only when we confirm that it reaches free list. There is surely
> > race window of page allocation, but that's fine because someone really want
> > that page and the page is still working, so soft offline can happily give up.
> > 
> > v4 from Oscar tries to handle the race around reallocation, but that part
> > seems still work in progress, so I decide to separate it for changes into
> > v5.9.  Thank you for your contribution, Oscar.
> > 
> > The issue reported by Qian Cai is fixed by patch 16/16.
> > 
> > This patchset is based on v5.8-rc7-mmotm-2020-07-27-18-18, but I applied
> > this series after reverting previous version.
> > Maybe https://github.com/Naoya-Horiguchi/linux/commits/soft-offline-rework.v5
> > shows what I did more precisely.
> > 
> > Any other comment/suggestion/help would be appreciated.
> 
> There is another issue with this patchset (with and without the patch [1]).
> 
> [1] https://lore.kernel.org/lkml/20200803133657.GA13307@hori.linux.bs1.fc.nec.co.jp/
> 
> Arm64 using 512M-size hugepages starts to fail allocations prematurely.
> 
> # ./random 1
> - start: migrate_huge_offline
> - use NUMA nodes 0,1.
> - mmap and free 2147483648 bytes hugepages on node 0
> - mmap and free 2147483648 bytes hugepages on node 1
> madvise: Cannot allocate memory
> 
> [  284.388061][ T3706] soft offline: 0x956000: hugepage isolation failed: 0, page count 2, type 17ffff80001000e (referenced|uptodate|dirty|head)
> [  284.400777][ T3706] Soft offlining pfn 0x8e000 at process virtual address 0xffff80000000
> [  284.893412][ T3706] Soft offlining pfn 0x8a000 at process virtual address 0xffff60000000
> [  284.901539][ T3706] soft offline: 0x8a000: hugepage isolation failed: 0, page count 2, type 7ffff80001000e (referenced|uptodate|dirty|head)
> [  284.914129][ T3706] Soft offlining pfn 0x8c000 at process virtual address 0xffff80000000
> [  285.433497][ T3706] Soft offlining pfn 0x88000 at process virtual address 0xffff60000000
> [  285.720377][ T3706] Soft offlining pfn 0x8a000 at process virtual address 0xffff80000000
> [  286.281620][ T3706] Soft offlining pfn 0xa000 at process virtual address 0xffff60000000
> [  286.290065][ T3706] soft offline: 0xa000: hugepage migration failed -12, type 7ffff80001000e (referenced|uptodate|dirty|head)

I think that this is due to the lack of contiguous memory.
This test program iterates soft offlining many times for hugepages,
so finally one page in every 512MB will be removed from buddy, then we
can't allocate hugepage any more even if we have enough free pages.
This is not good for heavy hugepage users, but that should be intended.

It seems that random.c calls madvise(MADV_SOFT_OFFLINE) for 2 hugepages,
and iterates it 1000 (==NR_LOOP) times, so if the system doesn't have
enough memory to cover the range of 2000 hugepages (1000GB in the Arm64
system), this ENOMEM should reproduce as expected.

> 
> Reverting this patchset and its dependency patchset [2] (reverting the
> dependency alone did not help) fixed it,

But it's still not clear to me why this was not visible before this
patchset, so I need more check for it.

Thanks,
Naoya Horiguchi

> 
> # ./random 1
> - start: migrate_huge_offline
> - use NUMA nodes 0,1.
> - mmap and free 2147483648 bytes hugepages on node 0
> - mmap and free 2147483648 bytes hugepages on node 1
> - pass: mmap_offline_node_huge
> 
> [2] https://lore.kernel.org/linux-mm/1594622517-20681-1-git-send-email-iamjoonsoo.kim@lge.com/ 
> 
> > 
> > Thanks,
> > Naoya Horiguchi
> > ---
> > Previous versions:
> >   v1: https://lore.kernel.org/linux-mm/1541746035-13408-1-git-send-email-n-horiguchi@ah.jp.nec.com/
> >   v2: https://lore.kernel.org/linux-mm/20191017142123.24245-1-osalvador@suse.de/
> >   v3: https://lore.kernel.org/linux-mm/20200624150137.7052-1-nao.horiguchi@gmail.com/
> >   v4: https://lore.kernel.org/linux-mm/20200716123810.25292-1-osalvador@suse.de/
> > ---
> > Summary:
> > 
> > Naoya Horiguchi (8):
> >       mm,hwpoison: cleanup unused PageHuge() check
> >       mm, hwpoison: remove recalculating hpage
> >       mm,madvise: call soft_offline_page() without MF_COUNT_INCREASED
> >       mm,hwpoison-inject: don't pin for hwpoison_filter
> >       mm,hwpoison: remove MF_COUNT_INCREASED
> >       mm,hwpoison: remove flag argument from soft offline functions
> >       mm,hwpoison: introduce MF_MSG_UNSPLIT_THP
> >       mm,hwpoison: double-check page count in __get_any_page()
> > 
> > Oscar Salvador (8):
> >       mm,madvise: Refactor madvise_inject_error
> >       mm,hwpoison: Un-export get_hwpoison_page and make it static
> >       mm,hwpoison: Kill put_hwpoison_page
> >       mm,hwpoison: Unify THP handling for hard and soft offline
> >       mm,hwpoison: Rework soft offline for free pages
> >       mm,hwpoison: Rework soft offline for in-use pages
> >       mm,hwpoison: Refactor soft_offline_huge_page and __soft_offline_page
> >       mm,hwpoison: Return 0 if the page is already poisoned in soft-offline
> > 
> >  drivers/base/memory.c      |   2 +-
> >  include/linux/mm.h         |  12 +-
> >  include/linux/page-flags.h |   6 +-
> >  include/ras/ras_event.h    |   3 +
> >  mm/hwpoison-inject.c       |  18 +--
> >  mm/madvise.c               |  39 +++---
> >  mm/memory-failure.c        | 334 ++++++++++++++++++++-------------------------
> >  mm/migrate.c               |  11 +-
> >  mm/page_alloc.c            |  60 ++++++--
> >  9 files changed, 233 insertions(+), 252 deletions(-)
>
Qian Cai Aug. 4, 2020, 1:49 a.m. UTC | #3
On Tue, Aug 04, 2020 at 01:16:45AM +0000, HORIGUCHI NAOYA(堀口 直也) wrote:
> On Mon, Aug 03, 2020 at 03:07:09PM -0400, Qian Cai wrote:
> > On Fri, Jul 31, 2020 at 12:20:56PM +0000, nao.horiguchi@gmail.com wrote:
> > > This patchset is the latest version of soft offline rework patchset
> > > targetted for v5.9.
> > > 
> > > Main focus of this series is to stabilize soft offline.  Historically soft
> > > offlined pages have suffered from racy conditions because PageHWPoison is
> > > used to a little too aggressively, which (directly or indirectly) invades
> > > other mm code which cares little about hwpoison.  This results in unexpected
> > > behavior or kernel panic, which is very far from soft offline's "do not
> > > disturb userspace or other kernel component" policy.
> > > 
> > > Main point of this change set is to contain target page "via buddy allocator",
> > > where we first free the target page as we do for normal pages, and remove
> > > from buddy only when we confirm that it reaches free list. There is surely
> > > race window of page allocation, but that's fine because someone really want
> > > that page and the page is still working, so soft offline can happily give up.
> > > 
> > > v4 from Oscar tries to handle the race around reallocation, but that part
> > > seems still work in progress, so I decide to separate it for changes into
> > > v5.9.  Thank you for your contribution, Oscar.
> > > 
> > > The issue reported by Qian Cai is fixed by patch 16/16.
> > > 
> > > This patchset is based on v5.8-rc7-mmotm-2020-07-27-18-18, but I applied
> > > this series after reverting previous version.
> > > Maybe https://github.com/Naoya-Horiguchi/linux/commits/soft-offline-rework.v5
> > > shows what I did more precisely.
> > > 
> > > Any other comment/suggestion/help would be appreciated.
> > 
> > There is another issue with this patchset (with and without the patch [1]).
> > 
> > [1] https://lore.kernel.org/lkml/20200803133657.GA13307@hori.linux.bs1.fc.nec.co.jp/
> > 
> > Arm64 using 512M-size hugepages starts to fail allocations prematurely.
> > 
> > # ./random 1
> > - start: migrate_huge_offline
> > - use NUMA nodes 0,1.
> > - mmap and free 2147483648 bytes hugepages on node 0
> > - mmap and free 2147483648 bytes hugepages on node 1
> > madvise: Cannot allocate memory
> > 
> > [  284.388061][ T3706] soft offline: 0x956000: hugepage isolation failed: 0, page count 2, type 17ffff80001000e (referenced|uptodate|dirty|head)
> > [  284.400777][ T3706] Soft offlining pfn 0x8e000 at process virtual address 0xffff80000000
> > [  284.893412][ T3706] Soft offlining pfn 0x8a000 at process virtual address 0xffff60000000
> > [  284.901539][ T3706] soft offline: 0x8a000: hugepage isolation failed: 0, page count 2, type 7ffff80001000e (referenced|uptodate|dirty|head)
> > [  284.914129][ T3706] Soft offlining pfn 0x8c000 at process virtual address 0xffff80000000
> > [  285.433497][ T3706] Soft offlining pfn 0x88000 at process virtual address 0xffff60000000
> > [  285.720377][ T3706] Soft offlining pfn 0x8a000 at process virtual address 0xffff80000000
> > [  286.281620][ T3706] Soft offlining pfn 0xa000 at process virtual address 0xffff60000000
> > [  286.290065][ T3706] soft offline: 0xa000: hugepage migration failed -12, type 7ffff80001000e (referenced|uptodate|dirty|head)
> 
> I think that this is due to the lack of contiguous memory.
> This test program iterates soft offlining many times for hugepages,
> so finally one page in every 512MB will be removed from buddy, then we
> can't allocate hugepage any more even if we have enough free pages.
> This is not good for heavy hugepage users, but that should be intended.
> 
> It seems that random.c calls madvise(MADV_SOFT_OFFLINE) for 2 hugepages,
> and iterates it 1000 (==NR_LOOP) times, so if the system doesn't have
> enough memory to cover the range of 2000 hugepages (1000GB in the Arm64
> system), this ENOMEM should reproduce as expected.

Well, each iteration will mmap/munmap, so there should be no leaking. 

https://gitlab.com/cailca/linux-mm/-/blob/master/random.c#L376

It also seem to me madvise(MADV_SOFT_OFFLINE) does start to fragment memory
somehow, because after this "madvise: Cannot allocate memory" happened, I
immediately checked /proc/meminfo and then found no hugepage usage at all.

> 
> > 
> > Reverting this patchset and its dependency patchset [2] (reverting the
> > dependency alone did not help) fixed it,
> 
> But it's still not clear to me why this was not visible before this
> patchset, so I need more check for it.
> 
> Thanks,
> Naoya Horiguchi
> 
> > 
> > # ./random 1
> > - start: migrate_huge_offline
> > - use NUMA nodes 0,1.
> > - mmap and free 2147483648 bytes hugepages on node 0
> > - mmap and free 2147483648 bytes hugepages on node 1
> > - pass: mmap_offline_node_huge
> > 
> > [2] https://lore.kernel.org/linux-mm/1594622517-20681-1-git-send-email-iamjoonsoo.kim@lge.com/ 
> > 
> > > 
> > > Thanks,
> > > Naoya Horiguchi
> > > ---
> > > Previous versions:
> > >   v1: https://lore.kernel.org/linux-mm/1541746035-13408-1-git-send-email-n-horiguchi@ah.jp.nec.com/
> > >   v2: https://lore.kernel.org/linux-mm/20191017142123.24245-1-osalvador@suse.de/
> > >   v3: https://lore.kernel.org/linux-mm/20200624150137.7052-1-nao.horiguchi@gmail.com/
> > >   v4: https://lore.kernel.org/linux-mm/20200716123810.25292-1-osalvador@suse.de/
> > > ---
> > > Summary:
> > > 
> > > Naoya Horiguchi (8):
> > >       mm,hwpoison: cleanup unused PageHuge() check
> > >       mm, hwpoison: remove recalculating hpage
> > >       mm,madvise: call soft_offline_page() without MF_COUNT_INCREASED
> > >       mm,hwpoison-inject: don't pin for hwpoison_filter
> > >       mm,hwpoison: remove MF_COUNT_INCREASED
> > >       mm,hwpoison: remove flag argument from soft offline functions
> > >       mm,hwpoison: introduce MF_MSG_UNSPLIT_THP
> > >       mm,hwpoison: double-check page count in __get_any_page()
> > > 
> > > Oscar Salvador (8):
> > >       mm,madvise: Refactor madvise_inject_error
> > >       mm,hwpoison: Un-export get_hwpoison_page and make it static
> > >       mm,hwpoison: Kill put_hwpoison_page
> > >       mm,hwpoison: Unify THP handling for hard and soft offline
> > >       mm,hwpoison: Rework soft offline for free pages
> > >       mm,hwpoison: Rework soft offline for in-use pages
> > >       mm,hwpoison: Refactor soft_offline_huge_page and __soft_offline_page
> > >       mm,hwpoison: Return 0 if the page is already poisoned in soft-offline
> > > 
> > >  drivers/base/memory.c      |   2 +-
> > >  include/linux/mm.h         |  12 +-
> > >  include/linux/page-flags.h |   6 +-
> > >  include/ras/ras_event.h    |   3 +
> > >  mm/hwpoison-inject.c       |  18 +--
> > >  mm/madvise.c               |  39 +++---
> > >  mm/memory-failure.c        | 334 ++++++++++++++++++++-------------------------
> > >  mm/migrate.c               |  11 +-
> > >  mm/page_alloc.c            |  60 ++++++--
> > >  9 files changed, 233 insertions(+), 252 deletions(-)
> >
Oscar Salvador Aug. 4, 2020, 8:13 a.m. UTC | #4
On 2020-08-04 03:49, Qian Cai wrote:
> 
> Well, each iteration will mmap/munmap, so there should be no leaking.
> 
> https://gitlab.com/cailca/linux-mm/-/blob/master/random.c#L376
> 
> It also seem to me madvise(MADV_SOFT_OFFLINE) does start to fragment 
> memory
> somehow, because after this "madvise: Cannot allocate memory" happened, 
> I
> immediately checked /proc/meminfo and then found no hugepage usage at 
> all.

Unfortunately I will be off for a week, but out of curiosity, could you 
try out with below tree [1] and see if you still see those issues?

Thanks for your time

[1] https://github.com/leberus/linux-mm-1/tree/hwpoison
HORIGUCHI NAOYA(堀口 直也) Aug. 5, 2020, 8:43 p.m. UTC | #5
On Mon, Aug 03, 2020 at 11:19:09AM -0400, Qian Cai wrote:
> On Mon, Aug 03, 2020 at 01:36:58PM +0000, HORIGUCHI NAOYA(堀口 直也) wrote:
> > Hello,
> > 
> > On Mon, Aug 03, 2020 at 08:39:55AM -0400, Qian Cai wrote:
> > > On Fri, Jul 31, 2020 at 12:20:56PM +0000, nao.horiguchi@gmail.com wrote:
> > > > This patchset is the latest version of soft offline rework patchset
> > > > targetted for v5.9.
> > > > 
> > > > Main focus of this series is to stabilize soft offline.  Historically soft
> > > > offlined pages have suffered from racy conditions because PageHWPoison is
> > > > used to a little too aggressively, which (directly or indirectly) invades
> > > > other mm code which cares little about hwpoison.  This results in unexpected
> > > > behavior or kernel panic, which is very far from soft offline's "do not
> > > > disturb userspace or other kernel component" policy.
> > > > 
> > > > Main point of this change set is to contain target page "via buddy allocator",
> > > > where we first free the target page as we do for normal pages, and remove
> > > > from buddy only when we confirm that it reaches free list. There is surely
> > > > race window of page allocation, but that's fine because someone really want
> > > > that page and the page is still working, so soft offline can happily give up.
> > > > 
> > > > v4 from Oscar tries to handle the race around reallocation, but that part
> > > > seems still work in progress, so I decide to separate it for changes into
> > > > v5.9.  Thank you for your contribution, Oscar.
> > > > 
> > > > The issue reported by Qian Cai is fixed by patch 16/16.
> > > 
> > > I am still getting EIO everywhere on next-20200803 (which includes this v5).
> > > 
> > > # ./random 1
> > > - start: migrate_huge_offline
> > > - use NUMA nodes 0,8.
> > > - mmap and free 8388608 bytes hugepages on node 0
> > > - mmap and free 8388608 bytes hugepages on node 8
> > > madvise: Input/output error
> > > 
> > > From the serial console,
> > > 
> > > [  637.164222][ T8357] soft offline: 0x118ee0: hugepage isolation failed: 0, page count 2, type 7fff800001000e (referenced|uptodate|dirty|head)
> > > [  637.164890][ T8357] Soft offlining pfn 0x20001380 at process virtual address 0x7fff9f000000
> > > [  637.165422][ T8357] Soft offlining pfn 0x3ba00 at process virtual address 0x7fff9f200000
> > > [  637.166409][ T8357] Soft offlining pfn 0x201914a0 at process virtual address 0x7fff9f000000
> > > [  637.166833][ T8357] Soft offlining pfn 0x12b9a0 at process virtual address 0x7fff9f200000
> > > [  637.168044][ T8357] Soft offlining pfn 0x1abb60 at process virtual address 0x7fff9f000000
> > > [  637.168557][ T8357] Soft offlining pfn 0x20014820 at process virtual address 0x7fff9f200000
> > > [  637.169493][ T8357] Soft offlining pfn 0x119720 at process virtual address 0x7fff9f000000
> > > [  637.169603][ T8357] soft offline: 0x119720: hugepage isolation failed: 0, page count 2, type 7fff800001000e (referenced|uptodate|dirty|head)
> > > [  637.169756][ T8357] Soft offlining pfn 0x118ee0 at process virtual address 0x7fff9f200000
> > > [  637.170653][ T8357] Soft offlining pfn 0x200e81e0 at process virtual address 0x7fff9f000000
> > > [  637.171067][ T8357] Soft offlining pfn 0x201c5f60 at process virtual address 0x7fff9f200000
> > > [  637.172101][ T8357] Soft offlining pfn 0x201c8f00 at process virtual address 0x7fff9f000000
> > > [  637.172241][ T8357] __get_any_page: 0x201c8f00: unknown zero refcount page type 87fff8000000000
> > 
> > I might misjudge to skip the following patch, sorry about that.
> > Could you try with it?
> 
> Still getting EIO after applied this patch.
> 
> [ 1215.499030][T88982] soft offline: 0x201bdc20: hugepage isolation failed: 0, page count 2, type 87fff800001000e (referenced|uptodate|dirty|head)
> [ 1215.499775][T88982] Soft offlining pfn 0x201bdc20 at process virtual address 0x7fff91a00000
> [ 1215.500189][T88982] Soft offlining pfn 0x201c19c0 at process virtual address 0x7fff91c00000
> [ 1215.500297][T88982] soft offline: 0x201c19c0: hugepage isolation failed: 0, page count 2, type 87fff800001000e (referenced|uptodate|dirty|head)
> [ 1215.500982][T88982] Soft offlining pfn 0x1f1fa0 at process virtual address 0x7fff91a00000
> [ 1215.501086][T88982] soft offline: 0x1f1fa0: hugepage isolation failed: 0, page count 2, type 7fff800001000e (referenced|uptodate|dirty|head)
> [ 1215.501237][T88982] Soft offlining pfn 0x1f4520 at process virtual address 0x7fff91c00000
> [ 1215.501355][T88982] soft offline: 0x1f4520: hugepage isolation failed: 0, page count 2, type 7fff800001000e (referenced|uptodate|dirty|head)
> [ 1215.502196][T88982] Soft offlining pfn 0x1f4520 at process virtual address 0x7fff91a00000
> [ 1215.502573][T88982] Soft offlining pfn 0x1f1fa0 at process virtual address 0x7fff91c00000
> [ 1215.502687][T88982] soft offline: 0x1f1fa0: hugepage isolation failed: 0, page count 2, type 7fff800001000e (referenced|uptodate|dirty|head)
> [ 1215.503245][T88982] Soft offlining pfn 0x201c3cc0 at process virtual address 0x7fff91a00000
> [ 1215.503594][T88982] Soft offlining pfn 0x201c3ce0 at process virtual address 0x7fff91c00000
> [ 1215.503755][T88982] __get_any_page: 0x201c3ce0: unknown zero refcount page type 87fff8000000000

So I lean to cancel patch 3/16 where madvise_inject_error() releases page
refcount before calling soft_offline_page(). That should solve the issue
because refcount of the target page never becomes zero.
Patch 4/16, 8/16 and 9/16 depend on it, so they should be removed too.
I'll resubmit the series later, but I prepare the delta below (based on
next-20200805), so it would be appreciated if you can run random.c with it
to confrim the fix.

Thanks,
Naoya Horiguchi
---
From 533090c0869aeca88d8ff14d27c3cef8fc060ccd Mon Sep 17 00:00:00 2001
From: Naoya Horiguchi <naoya.horiguchi@nec.com>
Date: Thu, 6 Aug 2020 01:29:08 +0900
Subject: [PATCH] (for testing) revert change around removing
 MF_COUNT_INCREASED

revert the following patches
- mm,madvise: call soft_offline_page() without MF_COUNT_INCREASED
- mm,madvise: Refactor madvise_inject_error
- mm,hwpoison: remove MF_COUNT_INCREASED
- mm,hwpoison: remove flag argument from soft offline functions

Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
---
 drivers/base/memory.c |  2 +-
 include/linux/mm.h    |  9 +++++----
 mm/madvise.c          | 36 +++++++++++++++++++-----------------
 mm/memory-failure.c   | 31 ++++++++++++++++++++-----------
 4 files changed, 45 insertions(+), 33 deletions(-)

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 3e6d27c9dff6..4db3c660de83 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -463,7 +463,7 @@ static ssize_t soft_offline_page_store(struct device *dev,
 	if (kstrtoull(buf, 0, &pfn) < 0)
 		return -EINVAL;
 	pfn >>= PAGE_SHIFT;
-	ret = soft_offline_page(pfn);
+	ret = soft_offline_page(pfn, 0);
 	return ret == 0 ? count : ret;
 }
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 4f12b2465e80..442921a004a2 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2976,9 +2976,10 @@ void register_page_bootmem_memmap(unsigned long section_nr, struct page *map,
 				  unsigned long nr_pages);
 
 enum mf_flags {
-	MF_ACTION_REQUIRED = 1 << 0,
-	MF_MUST_KILL = 1 << 1,
-	MF_SOFT_OFFLINE = 1 << 2,
+	MF_COUNT_INCREASED = 1 << 0,
+	MF_ACTION_REQUIRED = 1 << 1,
+	MF_MUST_KILL = 1 << 2,
+	MF_SOFT_OFFLINE = 1 << 3,
 };
 extern int memory_failure(unsigned long pfn, int flags);
 extern void memory_failure_queue(unsigned long pfn, int flags);
@@ -2988,7 +2989,7 @@ extern int sysctl_memory_failure_early_kill;
 extern int sysctl_memory_failure_recovery;
 extern void shake_page(struct page *p, int access);
 extern atomic_long_t num_poisoned_pages __read_mostly;
-extern int soft_offline_page(unsigned long pfn);
+extern int soft_offline_page(unsigned long pfn, int flags);
 
 
 /*
diff --git a/mm/madvise.c b/mm/madvise.c
index 843f6fad3b89..5fa5f66468b3 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -886,15 +886,16 @@ static long madvise_remove(struct vm_area_struct *vma,
 static int madvise_inject_error(int behavior,
 		unsigned long start, unsigned long end)
 {
+	struct page *page;
 	struct zone *zone;
 	unsigned long size;
 
 	if (!capable(CAP_SYS_ADMIN))
 		return -EPERM;
 
+
 	for (; start < end; start += size) {
 		unsigned long pfn;
-		struct page *page;
 		int ret;
 
 		ret = get_user_pages_fast(start, 1, 0, &page);
@@ -909,26 +910,27 @@ static int madvise_inject_error(int behavior,
 		 */
 		size = page_size(compound_head(page));
 
-		/*
-		 * The get_user_pages_fast() is just to get the pfn of the
-		 * given address, and the refcount has nothing to do with
-		 * what we try to test, so it should be released immediately.
-		 * This is racy but it's intended because the real hardware
-		 * errors could happen at any moment and memory error handlers
-		 * must properly handle the race.
-		 */
-		put_page(page);
-
 		if (behavior == MADV_SOFT_OFFLINE) {
 			pr_info("Soft offlining pfn %#lx at process virtual address %#lx\n",
-				pfn, start);
-			ret = soft_offline_page(pfn);
-		} else {
-			pr_info("Injecting memory failure for pfn %#lx at process virtual address %#lx\n",
-				pfn, start);
-			ret = memory_failure(pfn, 0);
+					pfn, start);
+
+			ret = soft_offline_page(pfn, MF_COUNT_INCREASED);
+			if (ret)
+				return ret;
+			continue;
 		}
 
+		pr_info("Injecting memory failure for pfn %#lx at process virtual address %#lx\n",
+				pfn, start);
+
+		/*
+		 * Drop the page reference taken by get_user_pages_fast(). In
+		 * the absence of MF_COUNT_INCREASED the memory_failure()
+		 * routine is responsible for pinning the page to prevent it
+		 * from being released back to the page allocator.
+		 */
+		put_page(page);
+		ret = memory_failure(pfn, 0);
 		if (ret)
 			return ret;
 	}
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index a229d4694954..fd50a6f9a60d 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1167,7 +1167,7 @@ static int memory_failure_hugetlb(unsigned long pfn, int flags)
 
 	num_poisoned_pages_inc();
 
-	if (!get_hwpoison_page(p)) {
+	if (!(flags & MF_COUNT_INCREASED) && !get_hwpoison_page(p)) {
 		/*
 		 * Check "filter hit" and "race with other subpage."
 		 */
@@ -1363,7 +1363,7 @@ int memory_failure(unsigned long pfn, int flags)
 	 * In fact it's dangerous to directly bump up page count from 0,
 	 * that may make page_ref_freeze()/page_ref_unfreeze() mismatch.
 	 */
-	if (!get_hwpoison_page(p)) {
+	if (!(flags & MF_COUNT_INCREASED) && !get_hwpoison_page(p)) {
 		if (is_free_buddy_page(p)) {
 			action_result(pfn, MF_MSG_BUDDY, MF_DELAYED);
 			return 0;
@@ -1392,7 +1392,10 @@ int memory_failure(unsigned long pfn, int flags)
 	shake_page(p, 0);
 	/* shake_page could have turned it free. */
 	if (!PageLRU(p) && is_free_buddy_page(p)) {
-		action_result(pfn, MF_MSG_BUDDY_2ND, MF_DELAYED);
+		if (flags & MF_COUNT_INCREASED)
+			action_result(pfn, MF_MSG_BUDDY, MF_DELAYED);
+		else
+			action_result(pfn, MF_MSG_BUDDY_2ND, MF_DELAYED);
 		return 0;
 	}
 
@@ -1540,7 +1543,7 @@ static void memory_failure_work_func(struct work_struct *work)
 		if (!gotten)
 			break;
 		if (entry.flags & MF_SOFT_OFFLINE)
-			soft_offline_page(entry.pfn);
+			soft_offline_page(entry.pfn, entry.flags);
 		else
 			memory_failure(entry.pfn, entry.flags);
 	}
@@ -1679,10 +1682,13 @@ EXPORT_SYMBOL(unpoison_memory);
  * that is not free, and 1 for any other page type.
  * For 1 the page is returned with increased page count, otherwise not.
  */
-static int __get_any_page(struct page *p, unsigned long pfn)
+static int __get_any_page(struct page *p, unsigned long pfn, int flags)
 {
 	int ret;
 
+	if (flags & MF_COUNT_INCREASED)
+		return 1;
+
 	/*
 	 * When the target page is a free hugepage, just remove it
 	 * from free hugepage list.
@@ -1709,12 +1715,12 @@ static int __get_any_page(struct page *p, unsigned long pfn)
 	return ret;
 }
 
-static int get_any_page(struct page *page, unsigned long pfn)
+static int get_any_page(struct page *page, unsigned long pfn, int flags)
 {
-	int ret = __get_any_page(page, pfn);
+	int ret = __get_any_page(page, pfn, flags);
 
 	if (ret == -EBUSY)
-		ret = __get_any_page(page, pfn);
+		ret = __get_any_page(page, pfn, flags);
 
 	if (ret == 1 && !PageHuge(page) &&
 	    !PageLRU(page) && !__PageMovable(page)) {
@@ -1727,7 +1733,7 @@ static int get_any_page(struct page *page, unsigned long pfn)
 		/*
 		 * Did it turn free?
 		 */
-		ret = __get_any_page(page, pfn);
+		ret = __get_any_page(page, pfn, 0);
 		if (ret == 1 && !PageLRU(page)) {
 			/* Drop page reference which is from __get_any_page() */
 			put_page(page);
@@ -1869,6 +1875,7 @@ static int soft_offline_free_page(struct page *page)
 /**
  * soft_offline_page - Soft offline a page.
  * @pfn: pfn to soft-offline
+ * @flags: flags. Same as memory_failure().
  *
  * Returns 0 on success, otherwise negated errno.
  *
@@ -1887,7 +1894,7 @@ static int soft_offline_free_page(struct page *page)
  * This is not a 100% solution for all memory, but tries to be
  * ``good enough'' for the majority of memory.
  */
-int soft_offline_page(unsigned long pfn)
+int soft_offline_page(unsigned long pfn, int flags)
 {
 	int ret;
 	struct page *page;
@@ -1901,11 +1908,13 @@ int soft_offline_page(unsigned long pfn)
 
 	if (PageHWPoison(page)) {
 		pr_info("soft offline: %#lx page already poisoned\n", pfn);
+		if (flags & MF_COUNT_INCREASED)
+			put_page(page);
 		return 0;
 	}
 
 	get_online_mems();
-	ret = get_any_page(page, pfn);
+	ret = get_any_page(page, pfn, flags);
 	put_online_mems();
 
 	if (ret > 0)
HORIGUCHI NAOYA(堀口 直也) Aug. 5, 2020, 8:44 p.m. UTC | #6
On Mon, Aug 03, 2020 at 09:49:42PM -0400, Qian Cai wrote:
> On Tue, Aug 04, 2020 at 01:16:45AM +0000, HORIGUCHI NAOYA(堀口 直也) wrote:
> > On Mon, Aug 03, 2020 at 03:07:09PM -0400, Qian Cai wrote:
> > > On Fri, Jul 31, 2020 at 12:20:56PM +0000, nao.horiguchi@gmail.com wrote:
> > > > This patchset is the latest version of soft offline rework patchset
> > > > targetted for v5.9.
> > > > 
> > > > Main focus of this series is to stabilize soft offline.  Historically soft
> > > > offlined pages have suffered from racy conditions because PageHWPoison is
> > > > used to a little too aggressively, which (directly or indirectly) invades
> > > > other mm code which cares little about hwpoison.  This results in unexpected
> > > > behavior or kernel panic, which is very far from soft offline's "do not
> > > > disturb userspace or other kernel component" policy.
> > > > 
> > > > Main point of this change set is to contain target page "via buddy allocator",
> > > > where we first free the target page as we do for normal pages, and remove
> > > > from buddy only when we confirm that it reaches free list. There is surely
> > > > race window of page allocation, but that's fine because someone really want
> > > > that page and the page is still working, so soft offline can happily give up.
> > > > 
> > > > v4 from Oscar tries to handle the race around reallocation, but that part
> > > > seems still work in progress, so I decide to separate it for changes into
> > > > v5.9.  Thank you for your contribution, Oscar.
> > > > 
> > > > The issue reported by Qian Cai is fixed by patch 16/16.
> > > > 
> > > > This patchset is based on v5.8-rc7-mmotm-2020-07-27-18-18, but I applied
> > > > this series after reverting previous version.
> > > > Maybe https://github.com/Naoya-Horiguchi/linux/commits/soft-offline-rework.v5
> > > > shows what I did more precisely.
> > > > 
> > > > Any other comment/suggestion/help would be appreciated.
> > > 
> > > There is another issue with this patchset (with and without the patch [1]).
> > > 
> > > [1] https://lore.kernel.org/lkml/20200803133657.GA13307@hori.linux.bs1.fc.nec.co.jp/
> > > 
> > > Arm64 using 512M-size hugepages starts to fail allocations prematurely.
> > > 
> > > # ./random 1
> > > - start: migrate_huge_offline
> > > - use NUMA nodes 0,1.
> > > - mmap and free 2147483648 bytes hugepages on node 0
> > > - mmap and free 2147483648 bytes hugepages on node 1
> > > madvise: Cannot allocate memory
> > > 
> > > [  284.388061][ T3706] soft offline: 0x956000: hugepage isolation failed: 0, page count 2, type 17ffff80001000e (referenced|uptodate|dirty|head)
> > > [  284.400777][ T3706] Soft offlining pfn 0x8e000 at process virtual address 0xffff80000000
> > > [  284.893412][ T3706] Soft offlining pfn 0x8a000 at process virtual address 0xffff60000000
> > > [  284.901539][ T3706] soft offline: 0x8a000: hugepage isolation failed: 0, page count 2, type 7ffff80001000e (referenced|uptodate|dirty|head)
> > > [  284.914129][ T3706] Soft offlining pfn 0x8c000 at process virtual address 0xffff80000000
> > > [  285.433497][ T3706] Soft offlining pfn 0x88000 at process virtual address 0xffff60000000
> > > [  285.720377][ T3706] Soft offlining pfn 0x8a000 at process virtual address 0xffff80000000
> > > [  286.281620][ T3706] Soft offlining pfn 0xa000 at process virtual address 0xffff60000000
> > > [  286.290065][ T3706] soft offline: 0xa000: hugepage migration failed -12, type 7ffff80001000e (referenced|uptodate|dirty|head)
> > 
> > I think that this is due to the lack of contiguous memory.
> > This test program iterates soft offlining many times for hugepages,
> > so finally one page in every 512MB will be removed from buddy, then we
> > can't allocate hugepage any more even if we have enough free pages.
> > This is not good for heavy hugepage users, but that should be intended.
> > 
> > It seems that random.c calls madvise(MADV_SOFT_OFFLINE) for 2 hugepages,
> > and iterates it 1000 (==NR_LOOP) times, so if the system doesn't have
> > enough memory to cover the range of 2000 hugepages (1000GB in the Arm64
> > system), this ENOMEM should reproduce as expected.
> 
> Well, each iteration will mmap/munmap, so there should be no leaking. 
> 
> https://gitlab.com/cailca/linux-mm/-/blob/master/random.c#L376
> 
> It also seem to me madvise(MADV_SOFT_OFFLINE) does start to fragment memory
> somehow, because after this "madvise: Cannot allocate memory" happened, I
> immediately checked /proc/meminfo and then found no hugepage usage at all.
> 
> > 
> > > 
> > > Reverting this patchset and its dependency patchset [2] (reverting the
> > > dependency alone did not help) fixed it,
> > 
> > But it's still not clear to me why this was not visible before this
> > patchset, so I need more check for it.

I've reproduced ENOMEM with v5.8 (without this patchset) simply by using
VM with small memory (4GB). So this specific error seems not to be caused
by this series.

Thanks,
Naoya Horiguchi