diff mbox series

[RFC] hwpoison, memory_hotplug: allow hwpoisoned pages to be offlined

Message ID 20181203100309.14784-1-mhocko@kernel.org (mailing list archive)
State New, archived
Headers show
Series [RFC] hwpoison, memory_hotplug: allow hwpoisoned pages to be offlined | expand

Commit Message

Michal Hocko Dec. 3, 2018, 10:03 a.m. UTC
From: Michal Hocko <mhocko@suse.com>

We have received a bug report that an injected MCE about faulty memory
prevents memory offline to succeed. The underlying reason is that the
HWPoison page has an elevated reference count and the migration keeps
failing. There are two problems with that. First of all it is dubious
to migrate the poisoned page because we know that accessing that memory
is possible to fail. Secondly it doesn't make any sense to migrate a
potentially broken content and preserve the memory corruption over to a
new location.

Oscar has found out that it is the elevated reference count from
memory_failure that is confusing the offlining path. HWPoisoned pages
are isolated from the LRU list but __offline_pages might still try to
migrate them if there is any preceding migrateable pages in the pfn
range. Such a migration would fail due to the reference count but
the migration code would put it back on the LRU list. This is quite
wrong in itself but it would also make scan_movable_pages stumble over
it again without any way out.

This means that the hotremove with hwpoisoned pages has never really
worked (without a luck). HWPoisoning really needs a larger surgery
but an immediate and backportable fix is to skip over these pages during
offlining. Even if they are still mapped for some reason then
try_to_unmap should turn those mappings into hwpoison ptes and cause
SIGBUS on access. Nobody should be really touching the content of the
page so it should be safe to ignore them even when there is a pending
reference count.

Debugged-by: Oscar Salvador <osalvador@suse.com>
Cc: stable
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
Hi,
I am sending this as an RFC now because I am not fully sure I see all
the consequences myself yet. This has passed a testing by Oscar but I
would highly appreciate a review from Naoya about my assumptions about
hwpoisoning. E.g. it is not entirely clear to me whether there is a
potential case where the page might be still mapped. I have put
try_to_unmap just to be sure. It would be really great if I could drop
that part because then it is not really great which of the TTU flags to
use to cover all potential cases.

I have marked the patch for stable but I have no idea how far back it
should go. Probably everything that already has hotremove and hwpoison
code.

Thanks in advance!

 mm/memory_hotplug.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

Comments

Naoya Horiguchi Dec. 4, 2018, 7:21 a.m. UTC | #1
On Mon, Dec 03, 2018 at 11:03:09AM +0100, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> We have received a bug report that an injected MCE about faulty memory
> prevents memory offline to succeed. The underlying reason is that the
> HWPoison page has an elevated reference count and the migration keeps
> failing. There are two problems with that. First of all it is dubious
> to migrate the poisoned page because we know that accessing that memory
> is possible to fail. Secondly it doesn't make any sense to migrate a
> potentially broken content and preserve the memory corruption over to a
> new location.
> 
> Oscar has found out that it is the elevated reference count from
> memory_failure that is confusing the offlining path. HWPoisoned pages
> are isolated from the LRU list but __offline_pages might still try to
> migrate them if there is any preceding migrateable pages in the pfn
> range. Such a migration would fail due to the reference count but
> the migration code would put it back on the LRU list. This is quite
> wrong in itself but it would also make scan_movable_pages stumble over
> it again without any way out.
> 
> This means that the hotremove with hwpoisoned pages has never really
> worked (without a luck). HWPoisoning really needs a larger surgery
> but an immediate and backportable fix is to skip over these pages during
> offlining. Even if they are still mapped for some reason then
> try_to_unmap should turn those mappings into hwpoison ptes and cause
> SIGBUS on access. Nobody should be really touching the content of the
> page so it should be safe to ignore them even when there is a pending
> reference count.
> 
> Debugged-by: Oscar Salvador <osalvador@suse.com>
> Cc: stable
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
> Hi,
> I am sending this as an RFC now because I am not fully sure I see all
> the consequences myself yet. This has passed a testing by Oscar but I
> would highly appreciate a review from Naoya about my assumptions about
> hwpoisoning. E.g. it is not entirely clear to me whether there is a
> potential case where the page might be still mapped.

One potential case is ksm page, for which we give up unmapping and leave
it unmapped. Rather than that I don't have any idea, but any new type of
page would be potentially categorized to this class.

> I have put
> try_to_unmap just to be sure. It would be really great if I could drop
> that part because then it is not really great which of the TTU flags to
> use to cover all potential cases.
> 
> I have marked the patch for stable but I have no idea how far back it
> should go. Probably everything that already has hotremove and hwpoison
> code.

Yes, maybe this could be ported to all active stable trees.

> 
> Thanks in advance!
> 
>  mm/memory_hotplug.c | 12 ++++++++++++
>  1 file changed, 12 insertions(+)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index c6c42a7425e5..08c576d5a633 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -34,6 +34,7 @@
>  #include <linux/hugetlb.h>
>  #include <linux/memblock.h>
>  #include <linux/compaction.h>
> +#include <linux/rmap.h>
>  
>  #include <asm/tlbflush.h>
>  
> @@ -1366,6 +1367,17 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
>  			pfn = page_to_pfn(compound_head(page))
>  				+ hpage_nr_pages(page) - 1;
>  
> +		/*
> +		 * HWPoison pages have elevated reference counts so the migration would
> +		 * fail on them. It also doesn't make any sense to migrate them in the
> +		 * first place. Still try to unmap such a page in case it is still mapped.
> +		 */
> +		if (PageHWPoison(page)) {
> +			if (page_mapped(page))
> +				try_to_unmap(page, TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS);
> +			continue;
> +		}
> +

I think this looks OK (no better idea.)

Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

I wondered why I didn't find this for long, and found that my testing only
covered the case where PageHWPoison is the first page of memory block.
scan_movable_pages() considers PageHWPoison as non-movable, so do_migrate_range()
started with pfn after the PageHWPoison and never tried to migrate it
(so effectively ignored every PageHWPoison as the above code does.)

Thanks,
Naoya Horiguchi

>  		if (!get_page_unless_zero(page))
>  			continue;
>  		/*
> -- 
> 2.19.1
> 
>
Michal Hocko Dec. 4, 2018, 8:48 a.m. UTC | #2
On Tue 04-12-18 07:21:16, Naoya Horiguchi wrote:
> On Mon, Dec 03, 2018 at 11:03:09AM +0100, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > We have received a bug report that an injected MCE about faulty memory
> > prevents memory offline to succeed. The underlying reason is that the
> > HWPoison page has an elevated reference count and the migration keeps
> > failing. There are two problems with that. First of all it is dubious
> > to migrate the poisoned page because we know that accessing that memory
> > is possible to fail. Secondly it doesn't make any sense to migrate a
> > potentially broken content and preserve the memory corruption over to a
> > new location.
> > 
> > Oscar has found out that it is the elevated reference count from
> > memory_failure that is confusing the offlining path. HWPoisoned pages
> > are isolated from the LRU list but __offline_pages might still try to
> > migrate them if there is any preceding migrateable pages in the pfn
> > range. Such a migration would fail due to the reference count but
> > the migration code would put it back on the LRU list. This is quite
> > wrong in itself but it would also make scan_movable_pages stumble over
> > it again without any way out.
> > 
> > This means that the hotremove with hwpoisoned pages has never really
> > worked (without a luck). HWPoisoning really needs a larger surgery
> > but an immediate and backportable fix is to skip over these pages during
> > offlining. Even if they are still mapped for some reason then
> > try_to_unmap should turn those mappings into hwpoison ptes and cause
> > SIGBUS on access. Nobody should be really touching the content of the
> > page so it should be safe to ignore them even when there is a pending
> > reference count.
> > 
> > Debugged-by: Oscar Salvador <osalvador@suse.com>
> > Cc: stable
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > ---
> > Hi,
> > I am sending this as an RFC now because I am not fully sure I see all
> > the consequences myself yet. This has passed a testing by Oscar but I
> > would highly appreciate a review from Naoya about my assumptions about
> > hwpoisoning. E.g. it is not entirely clear to me whether there is a
> > potential case where the page might be still mapped.
> 
> One potential case is ksm page, for which we give up unmapping and leave
> it unmapped. Rather than that I don't have any idea, but any new type of
> page would be potentially categorized to this class.

Could you be more specific why hwpoison code gives up on ksm pages while
we can safely unmap here?

[...]
> 
> I think this looks OK (no better idea.)
> 
> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

Thanks!

> I wondered why I didn't find this for long, and found that my testing only
> covered the case where PageHWPoison is the first page of memory block.
> scan_movable_pages() considers PageHWPoison as non-movable, so do_migrate_range()
> started with pfn after the PageHWPoison and never tried to migrate it
> (so effectively ignored every PageHWPoison as the above code does.)

Yeah, it seems that the hotremove worked only by chance in presence of
hwpoison pages so far. The specific usecase which triggered this patch
is a heavily memory utilized system with in memory database IIRC. So it
is quite likely that hwpoison pages are punched to otherwise used
memory.

Thanks for the review Naoya!
Naoya Horiguchi Dec. 4, 2018, 9:11 a.m. UTC | #3
On Tue, Dec 04, 2018 at 09:48:26AM +0100, Michal Hocko wrote:
> On Tue 04-12-18 07:21:16, Naoya Horiguchi wrote:
> > On Mon, Dec 03, 2018 at 11:03:09AM +0100, Michal Hocko wrote:
> > > From: Michal Hocko <mhocko@suse.com>
> > > 
> > > We have received a bug report that an injected MCE about faulty memory
> > > prevents memory offline to succeed. The underlying reason is that the
> > > HWPoison page has an elevated reference count and the migration keeps
> > > failing. There are two problems with that. First of all it is dubious
> > > to migrate the poisoned page because we know that accessing that memory
> > > is possible to fail. Secondly it doesn't make any sense to migrate a
> > > potentially broken content and preserve the memory corruption over to a
> > > new location.
> > > 
> > > Oscar has found out that it is the elevated reference count from
> > > memory_failure that is confusing the offlining path. HWPoisoned pages
> > > are isolated from the LRU list but __offline_pages might still try to
> > > migrate them if there is any preceding migrateable pages in the pfn
> > > range. Such a migration would fail due to the reference count but
> > > the migration code would put it back on the LRU list. This is quite
> > > wrong in itself but it would also make scan_movable_pages stumble over
> > > it again without any way out.
> > > 
> > > This means that the hotremove with hwpoisoned pages has never really
> > > worked (without a luck). HWPoisoning really needs a larger surgery
> > > but an immediate and backportable fix is to skip over these pages during
> > > offlining. Even if they are still mapped for some reason then
> > > try_to_unmap should turn those mappings into hwpoison ptes and cause
> > > SIGBUS on access. Nobody should be really touching the content of the
> > > page so it should be safe to ignore them even when there is a pending
> > > reference count.
> > > 
> > > Debugged-by: Oscar Salvador <osalvador@suse.com>
> > > Cc: stable
> > > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > > ---
> > > Hi,
> > > I am sending this as an RFC now because I am not fully sure I see all
> > > the consequences myself yet. This has passed a testing by Oscar but I
> > > would highly appreciate a review from Naoya about my assumptions about
> > > hwpoisoning. E.g. it is not entirely clear to me whether there is a
> > > potential case where the page might be still mapped.
> > 
> > One potential case is ksm page, for which we give up unmapping and leave
> > it unmapped. Rather than that I don't have any idea, but any new type of
> > page would be potentially categorized to this class.
> 
> Could you be more specific why hwpoison code gives up on ksm pages while
> we can safely unmap here?

Actually no big reason. Ksm pages never dominate memory, so we simply didn't
have strong motivation to save the pages.

> [...]
> > 
> > I think this looks OK (no better idea.)
> > 
> > Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> 
> Thanks!
> 
> > I wondered why I didn't find this for long, and found that my testing only
> > covered the case where PageHWPoison is the first page of memory block.
> > scan_movable_pages() considers PageHWPoison as non-movable, so do_migrate_range()
> > started with pfn after the PageHWPoison and never tried to migrate it
> > (so effectively ignored every PageHWPoison as the above code does.)
> 
> Yeah, it seems that the hotremove worked only by chance in presence of
> hwpoison pages so far. The specific usecase which triggered this patch
> is a heavily memory utilized system with in memory database IIRC. So it
> is quite likely that hwpoison pages are punched to otherwise used
> memory.
> 
> Thanks for the review Naoya!

Your welcome, and thank you for reporting/fixing the issue.

- Naoya
Michal Hocko Dec. 4, 2018, 9:35 a.m. UTC | #4
On Tue 04-12-18 09:11:05, Naoya Horiguchi wrote:
> On Tue, Dec 04, 2018 at 09:48:26AM +0100, Michal Hocko wrote:
> > On Tue 04-12-18 07:21:16, Naoya Horiguchi wrote:
> > > On Mon, Dec 03, 2018 at 11:03:09AM +0100, Michal Hocko wrote:
> > > > From: Michal Hocko <mhocko@suse.com>
> > > > 
> > > > We have received a bug report that an injected MCE about faulty memory
> > > > prevents memory offline to succeed. The underlying reason is that the
> > > > HWPoison page has an elevated reference count and the migration keeps
> > > > failing. There are two problems with that. First of all it is dubious
> > > > to migrate the poisoned page because we know that accessing that memory
> > > > is possible to fail. Secondly it doesn't make any sense to migrate a
> > > > potentially broken content and preserve the memory corruption over to a
> > > > new location.
> > > > 
> > > > Oscar has found out that it is the elevated reference count from
> > > > memory_failure that is confusing the offlining path. HWPoisoned pages
> > > > are isolated from the LRU list but __offline_pages might still try to
> > > > migrate them if there is any preceding migrateable pages in the pfn
> > > > range. Such a migration would fail due to the reference count but
> > > > the migration code would put it back on the LRU list. This is quite
> > > > wrong in itself but it would also make scan_movable_pages stumble over
> > > > it again without any way out.
> > > > 
> > > > This means that the hotremove with hwpoisoned pages has never really
> > > > worked (without a luck). HWPoisoning really needs a larger surgery
> > > > but an immediate and backportable fix is to skip over these pages during
> > > > offlining. Even if they are still mapped for some reason then
> > > > try_to_unmap should turn those mappings into hwpoison ptes and cause
> > > > SIGBUS on access. Nobody should be really touching the content of the
> > > > page so it should be safe to ignore them even when there is a pending
> > > > reference count.
> > > > 
> > > > Debugged-by: Oscar Salvador <osalvador@suse.com>
> > > > Cc: stable
> > > > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > > > ---
> > > > Hi,
> > > > I am sending this as an RFC now because I am not fully sure I see all
> > > > the consequences myself yet. This has passed a testing by Oscar but I
> > > > would highly appreciate a review from Naoya about my assumptions about
> > > > hwpoisoning. E.g. it is not entirely clear to me whether there is a
> > > > potential case where the page might be still mapped.
> > > 
> > > One potential case is ksm page, for which we give up unmapping and leave
> > > it unmapped. Rather than that I don't have any idea, but any new type of
> > > page would be potentially categorized to this class.
> > 
> > Could you be more specific why hwpoison code gives up on ksm pages while
> > we can safely unmap here?
> 
> Actually no big reason. Ksm pages never dominate memory, so we simply didn't
> have strong motivation to save the pages.

OK, so the unmapping is safe. I will drop a comment. Does this look good
to you?
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 08c576d5a633..ef5d42759aa2 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1370,7 +1370,9 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
 		/*
 		 * HWPoison pages have elevated reference counts so the migration would
 		 * fail on them. It also doesn't make any sense to migrate them in the
-		 * first place. Still try to unmap such a page in case it is still mapped.
+		 * first place. Still try to unmap such a page in case it is still mapped
+		 * (e.g. current hwpoison implementation doesn't unmap KSM pages but keep
+		 * the unmap as the catch all safety net).
 		 */
 		if (PageHWPoison(page)) {
 			if (page_mapped(page))
David Hildenbrand Dec. 4, 2018, 11:22 a.m. UTC | #5
On 03.12.18 11:03, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> We have received a bug report that an injected MCE about faulty memory
> prevents memory offline to succeed. The underlying reason is that the
> HWPoison page has an elevated reference count and the migration keeps
> failing. There are two problems with that. First of all it is dubious
> to migrate the poisoned page because we know that accessing that memory
> is possible to fail. Secondly it doesn't make any sense to migrate a
> potentially broken content and preserve the memory corruption over to a
> new location.
> 
> Oscar has found out that it is the elevated reference count from
> memory_failure that is confusing the offlining path. HWPoisoned pages
> are isolated from the LRU list but __offline_pages might still try to
> migrate them if there is any preceding migrateable pages in the pfn
> range. Such a migration would fail due to the reference count but
> the migration code would put it back on the LRU list. This is quite
> wrong in itself but it would also make scan_movable_pages stumble over
> it again without any way out.
> 
> This means that the hotremove with hwpoisoned pages has never really
> worked (without a luck). HWPoisoning really needs a larger surgery
> but an immediate and backportable fix is to skip over these pages during
> offlining. Even if they are still mapped for some reason then
> try_to_unmap should turn those mappings into hwpoison ptes and cause
> SIGBUS on access. Nobody should be really touching the content of the
> page so it should be safe to ignore them even when there is a pending
> reference count.
> 
> Debugged-by: Oscar Salvador <osalvador@suse.com>
> Cc: stable
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
> Hi,
> I am sending this as an RFC now because I am not fully sure I see all
> the consequences myself yet. This has passed a testing by Oscar but I
> would highly appreciate a review from Naoya about my assumptions about
> hwpoisoning. E.g. it is not entirely clear to me whether there is a
> potential case where the page might be still mapped. I have put
> try_to_unmap just to be sure. It would be really great if I could drop
> that part because then it is not really great which of the TTU flags to
> use to cover all potential cases.
> 
> I have marked the patch for stable but I have no idea how far back it
> should go. Probably everything that already has hotremove and hwpoison
> code.
> 
> Thanks in advance!

This sounds good to me. We treat all HWPoison pages already as movable
in has_unmovable_pages() when isolating pages to migrate pages away (and
as !movable when trying to isolate a contig range for allocation).

If this scenario should not be supported (if HWPoison page that is
mapped cannot be offlined), we would have to bail out on such pages way
earlier (e.g. in has_unmovable_pages()), failing in do_migrate_range()
would be too late.

+1 to "HWPoisoning really needs a larger surgery"

With the comment update

Acked-by: David Hildenbrand <david@redhat.com>


> 
>  mm/memory_hotplug.c | 12 ++++++++++++
>  1 file changed, 12 insertions(+)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index c6c42a7425e5..08c576d5a633 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -34,6 +34,7 @@
>  #include <linux/hugetlb.h>
>  #include <linux/memblock.h>
>  #include <linux/compaction.h>
> +#include <linux/rmap.h>
>  
>  #include <asm/tlbflush.h>
>  
> @@ -1366,6 +1367,17 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
>  			pfn = page_to_pfn(compound_head(page))
>  				+ hpage_nr_pages(page) - 1;
>  
> +		/*
> +		 * HWPoison pages have elevated reference counts so the migration would
> +		 * fail on them. It also doesn't make any sense to migrate them in the
> +		 * first place. Still try to unmap such a page in case it is still mapped.
> +		 */
> +		if (PageHWPoison(page)) {
> +			if (page_mapped(page))
> +				try_to_unmap(page, TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS);
> +			continue;
> +		}
> +
>  		if (!get_page_unless_zero(page))
>  			continue;
>  		/*
>
Oscar Salvador Dec. 4, 2018, 12:30 p.m. UTC | #6
On 2018-12-03 11:03, Michal Hocko wrote:
> Debugged-by: Oscar Salvador <osalvador@suse.com>
> Cc: stable
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Bit by bit memory-hotplug is getting trained :-)

Reviewed-by: Oscar Salvador <osalvador@suse.com>
Naoya Horiguchi Dec. 5, 2018, 1:14 a.m. UTC | #7
On Tue, Dec 04, 2018 at 10:35:49AM +0100, Michal Hocko wrote:
> On Tue 04-12-18 09:11:05, Naoya Horiguchi wrote:
> > On Tue, Dec 04, 2018 at 09:48:26AM +0100, Michal Hocko wrote:
> > > On Tue 04-12-18 07:21:16, Naoya Horiguchi wrote:
> > > > On Mon, Dec 03, 2018 at 11:03:09AM +0100, Michal Hocko wrote:
> > > > > From: Michal Hocko <mhocko@suse.com>
> > > > > 
> > > > > We have received a bug report that an injected MCE about faulty memory
> > > > > prevents memory offline to succeed. The underlying reason is that the
> > > > > HWPoison page has an elevated reference count and the migration keeps
> > > > > failing. There are two problems with that. First of all it is dubious
> > > > > to migrate the poisoned page because we know that accessing that memory
> > > > > is possible to fail. Secondly it doesn't make any sense to migrate a
> > > > > potentially broken content and preserve the memory corruption over to a
> > > > > new location.
> > > > > 
> > > > > Oscar has found out that it is the elevated reference count from
> > > > > memory_failure that is confusing the offlining path. HWPoisoned pages
> > > > > are isolated from the LRU list but __offline_pages might still try to
> > > > > migrate them if there is any preceding migrateable pages in the pfn
> > > > > range. Such a migration would fail due to the reference count but
> > > > > the migration code would put it back on the LRU list. This is quite
> > > > > wrong in itself but it would also make scan_movable_pages stumble over
> > > > > it again without any way out.
> > > > > 
> > > > > This means that the hotremove with hwpoisoned pages has never really
> > > > > worked (without a luck). HWPoisoning really needs a larger surgery
> > > > > but an immediate and backportable fix is to skip over these pages during
> > > > > offlining. Even if they are still mapped for some reason then
> > > > > try_to_unmap should turn those mappings into hwpoison ptes and cause
> > > > > SIGBUS on access. Nobody should be really touching the content of the
> > > > > page so it should be safe to ignore them even when there is a pending
> > > > > reference count.
> > > > > 
> > > > > Debugged-by: Oscar Salvador <osalvador@suse.com>
> > > > > Cc: stable
> > > > > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > > > > ---
> > > > > Hi,
> > > > > I am sending this as an RFC now because I am not fully sure I see all
> > > > > the consequences myself yet. This has passed a testing by Oscar but I
> > > > > would highly appreciate a review from Naoya about my assumptions about
> > > > > hwpoisoning. E.g. it is not entirely clear to me whether there is a
> > > > > potential case where the page might be still mapped.
> > > > 
> > > > One potential case is ksm page, for which we give up unmapping and leave
> > > > it unmapped. Rather than that I don't have any idea, but any new type of
> > > > page would be potentially categorized to this class.
> > > 
> > > Could you be more specific why hwpoison code gives up on ksm pages while
> > > we can safely unmap here?
> > 
> > Actually no big reason. Ksm pages never dominate memory, so we simply didn't
> > have strong motivation to save the pages.
> 
> OK, so the unmapping is safe. I will drop a comment. Does this look good
> to you?
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 08c576d5a633..ef5d42759aa2 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1370,7 +1370,9 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
>  		/*
>  		 * HWPoison pages have elevated reference counts so the migration would
>  		 * fail on them. It also doesn't make any sense to migrate them in the
> -		 * first place. Still try to unmap such a page in case it is still mapped.
> +		 * first place. Still try to unmap such a page in case it is still mapped
> +		 * (e.g. current hwpoison implementation doesn't unmap KSM pages but keep
> +		 * the unmap as the catch all safety net).
>  		 */
>  		if (PageHWPoison(page)) {
>  			if (page_mapped(page))

Thanks, I'm fine to this part which explains why we unmap here.

- Naoya
Michal Hocko Dec. 5, 2018, 12:29 p.m. UTC | #8
On Mon 03-12-18 11:03:09, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> We have received a bug report that an injected MCE about faulty memory
> prevents memory offline to succeed. The underlying reason is that the
> HWPoison page has an elevated reference count and the migration keeps
> failing. There are two problems with that. First of all it is dubious
> to migrate the poisoned page because we know that accessing that memory
> is possible to fail. Secondly it doesn't make any sense to migrate a
> potentially broken content and preserve the memory corruption over to a
> new location.
> 
> Oscar has found out that it is the elevated reference count from
> memory_failure that is confusing the offlining path. HWPoisoned pages
> are isolated from the LRU list but __offline_pages might still try to
> migrate them if there is any preceding migrateable pages in the pfn
> range. Such a migration would fail due to the reference count but
> the migration code would put it back on the LRU list. This is quite
> wrong in itself but it would also make scan_movable_pages stumble over
> it again without any way out.
> 
> This means that the hotremove with hwpoisoned pages has never really
> worked (without a luck). HWPoisoning really needs a larger surgery
> but an immediate and backportable fix is to skip over these pages during
> offlining. Even if they are still mapped for some reason then
> try_to_unmap should turn those mappings into hwpoison ptes and cause
> SIGBUS on access. Nobody should be really touching the content of the
> page so it should be safe to ignore them even when there is a pending
> reference count.

After some more thinking I am not really sure the above reasoning is
still true with the current upstream kernel. Maybe I just managed to
confuse myself so please hold off on this patch for now. Testing by
Oscar has shown this patch is helping but the changelog might need to be
updated.
Michal Hocko Dec. 5, 2018, 4:57 p.m. UTC | #9
On Wed 05-12-18 13:29:18, Michal Hocko wrote:
[...]
> After some more thinking I am not really sure the above reasoning is
> still true with the current upstream kernel. Maybe I just managed to
> confuse myself so please hold off on this patch for now. Testing by
> Oscar has shown this patch is helping but the changelog might need to be
> updated.

OK, so Oscar has nailed it down and it seems that 4.4 kernel we have
been debugging on behaves slightly different. The underlying problem is
the same though. So I have reworded the changelog and added "just in
case" PageLRU handling. Naoya, maybe you have an argument that would
make this void for current upstream kernels.

I have dropped all the reviewed tags as the patch has changed slightly.
Thanks a lot to Oscar for his patience and testing he has devoted to
this issue.

Btw. the way how we drop all the work on the first page that we cannot
isolate is just goofy. Why don't we simply migrate all that we already
have on the list and go on? Something for a followup cleanup though.
---
From 909521051f41ae46a841b481acaf1ed9c695ae7b Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Mon, 3 Dec 2018 10:27:18 +0100
Subject: [PATCH] hwpoison, memory_hotplug: allow hwpoisoned pages to be
 offlined

We have received a bug report that an injected MCE about faulty memory
prevents memory offline to succeed on 4.4 base kernel. The underlying
reason was that the HWPoison page has an elevated reference count and
the migration keeps failing. There are two problems with that. First
of all it is dubious to migrate the poisoned page because we know that
accessing that memory is possible to fail. Secondly it doesn't make any
sense to migrate a potentially broken content and preserve the memory
corruption over to a new location.

Oscar has found out that 4.4 and the current upstream kernels behave
slightly differently with his simply testcase
===

int main(void)
{
        int ret;
        int i;
        int fd;
        char *array = malloc(4096);
        char *array_locked = malloc(4096);

        fd = open("/tmp/data", O_RDONLY);
        read(fd, array, 4095);

        for (i = 0; i < 4096; i++)
                array_locked[i] = 'd';

        ret = mlock((void *)PAGE_ALIGN((unsigned long)array_locked), sizeof(array_locked));
        if (ret)
                perror("mlock");

        sleep (20);

        ret = madvise((void *)PAGE_ALIGN((unsigned long)array_locked), 4096, MADV_HWPOISON);
        if (ret)
                perror("madvise");

        for (i = 0; i < 4096; i++)
                array_locked[i] = 'd';

        return 0;
}
===

+ offline this memory.

In 4.4 kernels he saw the hwpoisoned page to be returned back to the LRU
list
kernel:  [<ffffffff81019ac9>] dump_trace+0x59/0x340
kernel:  [<ffffffff81019e9a>] show_stack_log_lvl+0xea/0x170
kernel:  [<ffffffff8101ac71>] show_stack+0x21/0x40
kernel:  [<ffffffff8132bb90>] dump_stack+0x5c/0x7c
kernel:  [<ffffffff810815a1>] warn_slowpath_common+0x81/0xb0
kernel:  [<ffffffff811a275c>] __pagevec_lru_add_fn+0x14c/0x160
kernel:  [<ffffffff811a2eed>] pagevec_lru_move_fn+0xad/0x100
kernel:  [<ffffffff811a334c>] __lru_cache_add+0x6c/0xb0
kernel:  [<ffffffff81195236>] add_to_page_cache_lru+0x46/0x70
kernel:  [<ffffffffa02b4373>] extent_readpages+0xc3/0x1a0 [btrfs]
kernel:  [<ffffffff811a16d7>] __do_page_cache_readahead+0x177/0x200
kernel:  [<ffffffff811a18c8>] ondemand_readahead+0x168/0x2a0
kernel:  [<ffffffff8119673f>] generic_file_read_iter+0x41f/0x660
kernel:  [<ffffffff8120e50d>] __vfs_read+0xcd/0x140
kernel:  [<ffffffff8120e9ea>] vfs_read+0x7a/0x120
kernel:  [<ffffffff8121404b>] kernel_read+0x3b/0x50
kernel:  [<ffffffff81215c80>] do_execveat_common.isra.29+0x490/0x6f0
kernel:  [<ffffffff81215f08>] do_execve+0x28/0x30
kernel:  [<ffffffff81095ddb>] call_usermodehelper_exec_async+0xfb/0x130
kernel:  [<ffffffff8161c045>] ret_from_fork+0x55/0x80

And that later confuses the hotremove path because an LRU page is
attempted to be migrated and that fails due to an elevated reference
count. It is quite possible that the reuse of the HWPoisoned page is
some kind of fixed race condition but I am not really sure about that.

With the upstream kernel the failure is slightly different. The page
doesn't seem to have LRU bit set but isolate_movable_page simply fails
and do_migrate_range simply puts all the isolated pages back to LRU and
therefore no progress is made and scan_movable_pages finds same set of
pages over and over again.

Fix both cases by explicitly checking HWPoisoned pages before we even
try to get a reference on the page, try to unmap it if it is still
mapped. As explained by Naoya
: Hwpoison code never unmapped those for no big reason because
: Ksm pages never dominate memory, so we simply didn't have strong
: motivation to save the pages.

Also put WARN_ON(PageLRU) in case there is a race and we can hit LRU
HWPoison pages which shouldn't happen but I couldn't convince myself
about that.

Debugged-by: Oscar Salvador <osalvador@suse.com>
Cc: stable
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/memory_hotplug.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index c6c42a7425e5..cfa1a2736876 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -34,6 +34,7 @@
 #include <linux/hugetlb.h>
 #include <linux/memblock.h>
 #include <linux/compaction.h>
+#include <linux/rmap.h>
 
 #include <asm/tlbflush.h>
 
@@ -1366,6 +1367,21 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
 			pfn = page_to_pfn(compound_head(page))
 				+ hpage_nr_pages(page) - 1;
 
+		/*
+		 * HWPoison pages have elevated reference counts so the migration would
+		 * fail on them. It also doesn't make any sense to migrate them in the
+		 * first place. Still try to unmap such a page in case it is still mapped
+		 * (e.g. current hwpoison implementation doesn't unmap KSM pages but keep
+		 * the unmap as the catch all safety net).
+		 */
+		if (PageHWPoison(page)) {
+			if (WARN_ON(PageLRU(page)))
+				isolate_lru_page(page);
+			if (page_mapped(page))
+				try_to_unmap(page, TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS);
+			continue;
+		}
+
 		if (!get_page_unless_zero(page))
 			continue;
 		/*
Naoya Horiguchi Dec. 6, 2018, 5:21 a.m. UTC | #10
On Wed, Dec 05, 2018 at 05:57:16PM +0100, Michal Hocko wrote:
> On Wed 05-12-18 13:29:18, Michal Hocko wrote:
> [...]
> > After some more thinking I am not really sure the above reasoning is
> > still true with the current upstream kernel. Maybe I just managed to
> > confuse myself so please hold off on this patch for now. Testing by
> > Oscar has shown this patch is helping but the changelog might need to be
> > updated.
> 
> OK, so Oscar has nailed it down and it seems that 4.4 kernel we have
> been debugging on behaves slightly different. The underlying problem is
> the same though. So I have reworded the changelog and added "just in
> case" PageLRU handling. Naoya, maybe you have an argument that would
> make this void for current upstream kernels.

The following commit (not in 4.4.x stable tree) might explain the
difference you experienced:

  commit 286c469a988fbaf68e3a97ddf1e6c245c1446968                          
  Author: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>                      
  Date:   Wed May 3 14:56:22 2017 -0700                                    
                                                                           
      mm: hwpoison: call shake_page() after try_to_unmap() for mlocked page

This commit adds shake_page() for mlocked pages to make sure that the target
page is flushed out from LRU cache. Without this shake_page(), subsequent
delete_from_lru_cache() (from me_pagecache_clean()) fails to isolate it and
the page will finally return back to LRU list.  So this scenario leads to
"hwpoisoned by still linked to LRU list" page.

Thanks,
Naoya Horiguchi
Oscar Salvador Dec. 6, 2018, 6:43 a.m. UTC | #11
> Btw. the way how we drop all the work on the first page that we
> cannot
> isolate is just goofy. Why don't we simply migrate all that we
> already
> have on the list and go on? Something for a followup cleanup though.

Indeed, that is just wrong.
I will try to send a followup cleanup to fix that.


> Debugged-by: Oscar Salvador <osalvador@suse.com>
> Cc: stable
> Signed-off-by: Michal Hocko <mhocko@suse.com>

It has been a fun bug to chase down, thanks for the patch ;-)

Reviewed-by: Oscar Salvador <osalvador@suse.com>
Tested-by: Oscar Salvador <osalvador@suse.com>
Michal Hocko Dec. 6, 2018, 8:32 a.m. UTC | #12
On Thu 06-12-18 05:21:38, Naoya Horiguchi wrote:
> On Wed, Dec 05, 2018 at 05:57:16PM +0100, Michal Hocko wrote:
> > On Wed 05-12-18 13:29:18, Michal Hocko wrote:
> > [...]
> > > After some more thinking I am not really sure the above reasoning is
> > > still true with the current upstream kernel. Maybe I just managed to
> > > confuse myself so please hold off on this patch for now. Testing by
> > > Oscar has shown this patch is helping but the changelog might need to be
> > > updated.
> > 
> > OK, so Oscar has nailed it down and it seems that 4.4 kernel we have
> > been debugging on behaves slightly different. The underlying problem is
> > the same though. So I have reworded the changelog and added "just in
> > case" PageLRU handling. Naoya, maybe you have an argument that would
> > make this void for current upstream kernels.
> 
> The following commit (not in 4.4.x stable tree) might explain the
> difference you experienced:
> 
>   commit 286c469a988fbaf68e3a97ddf1e6c245c1446968                          
>   Author: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>                      
>   Date:   Wed May 3 14:56:22 2017 -0700                                    
>                                                                            
>       mm: hwpoison: call shake_page() after try_to_unmap() for mlocked page
> 
> This commit adds shake_page() for mlocked pages to make sure that the target
> page is flushed out from LRU cache. Without this shake_page(), subsequent
> delete_from_lru_cache() (from me_pagecache_clean()) fails to isolate it and
> the page will finally return back to LRU list.  So this scenario leads to
> "hwpoisoned by still linked to LRU list" page.

OK, I see. So does that mean that the LRU handling is no longer needed
and there is a guanratee that all kernels with the above commit cannot
ever get an LRU page?
Oscar Salvador Dec. 6, 2018, 8:40 a.m. UTC | #13
>> This commit adds shake_page() for mlocked pages to make sure that the 
>> target
>> page is flushed out from LRU cache. Without this shake_page(), 
>> subsequent
>> delete_from_lru_cache() (from me_pagecache_clean()) fails to isolate 
>> it and
>> the page will finally return back to LRU list.  So this scenario leads 
>> to
>> "hwpoisoned by still linked to LRU list" page.
> 
> OK, I see. So does that mean that the LRU handling is no longer needed
> and there is a guanratee that all kernels with the above commit cannot
> ever get an LRU page?

For the sake of completeness:

I made a quick test reverting 286c469a988 on upstream kernel.
As expected, the poisoned page is in LRU when it hits do_migrate_range,
and so, the migration path is taken and I see the exact failure I saw 
on. 4.4


Oscar Salvador
---
Suse L3
David Hildenbrand Dec. 6, 2018, 9:02 a.m. UTC | #14
On 05.12.18 17:57, Michal Hocko wrote:
> On Wed 05-12-18 13:29:18, Michal Hocko wrote:
> [...]
>> After some more thinking I am not really sure the above reasoning is
>> still true with the current upstream kernel. Maybe I just managed to
>> confuse myself so please hold off on this patch for now. Testing by
>> Oscar has shown this patch is helping but the changelog might need to be
>> updated.
> 
> OK, so Oscar has nailed it down and it seems that 4.4 kernel we have
> been debugging on behaves slightly different. The underlying problem is
> the same though. So I have reworded the changelog and added "just in
> case" PageLRU handling. Naoya, maybe you have an argument that would
> make this void for current upstream kernels.
> 
> I have dropped all the reviewed tags as the patch has changed slightly.
> Thanks a lot to Oscar for his patience and testing he has devoted to
> this issue.
> 
> Btw. the way how we drop all the work on the first page that we cannot
> isolate is just goofy. Why don't we simply migrate all that we already
> have on the list and go on? Something for a followup cleanup though.
> ---
> From 909521051f41ae46a841b481acaf1ed9c695ae7b Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Mon, 3 Dec 2018 10:27:18 +0100
> Subject: [PATCH] hwpoison, memory_hotplug: allow hwpoisoned pages to be
>  offlined
> 
> We have received a bug report that an injected MCE about faulty memory
> prevents memory offline to succeed on 4.4 base kernel. The underlying
> reason was that the HWPoison page has an elevated reference count and
> the migration keeps failing. There are two problems with that. First
> of all it is dubious to migrate the poisoned page because we know that
> accessing that memory is possible to fail. Secondly it doesn't make any
> sense to migrate a potentially broken content and preserve the memory
> corruption over to a new location.
> 
> Oscar has found out that 4.4 and the current upstream kernels behave
> slightly differently with his simply testcase
> ===
> 
> int main(void)
> {
>         int ret;
>         int i;
>         int fd;
>         char *array = malloc(4096);
>         char *array_locked = malloc(4096);
> 
>         fd = open("/tmp/data", O_RDONLY);
>         read(fd, array, 4095);
> 
>         for (i = 0; i < 4096; i++)
>                 array_locked[i] = 'd';
> 
>         ret = mlock((void *)PAGE_ALIGN((unsigned long)array_locked), sizeof(array_locked));
>         if (ret)
>                 perror("mlock");
> 
>         sleep (20);
> 
>         ret = madvise((void *)PAGE_ALIGN((unsigned long)array_locked), 4096, MADV_HWPOISON);
>         if (ret)
>                 perror("madvise");
> 
>         for (i = 0; i < 4096; i++)
>                 array_locked[i] = 'd';
> 
>         return 0;
> }
> ===
> 
> + offline this memory.
> 
> In 4.4 kernels he saw the hwpoisoned page to be returned back to the LRU
> list
> kernel:  [<ffffffff81019ac9>] dump_trace+0x59/0x340
> kernel:  [<ffffffff81019e9a>] show_stack_log_lvl+0xea/0x170
> kernel:  [<ffffffff8101ac71>] show_stack+0x21/0x40
> kernel:  [<ffffffff8132bb90>] dump_stack+0x5c/0x7c
> kernel:  [<ffffffff810815a1>] warn_slowpath_common+0x81/0xb0
> kernel:  [<ffffffff811a275c>] __pagevec_lru_add_fn+0x14c/0x160
> kernel:  [<ffffffff811a2eed>] pagevec_lru_move_fn+0xad/0x100
> kernel:  [<ffffffff811a334c>] __lru_cache_add+0x6c/0xb0
> kernel:  [<ffffffff81195236>] add_to_page_cache_lru+0x46/0x70
> kernel:  [<ffffffffa02b4373>] extent_readpages+0xc3/0x1a0 [btrfs]
> kernel:  [<ffffffff811a16d7>] __do_page_cache_readahead+0x177/0x200
> kernel:  [<ffffffff811a18c8>] ondemand_readahead+0x168/0x2a0
> kernel:  [<ffffffff8119673f>] generic_file_read_iter+0x41f/0x660
> kernel:  [<ffffffff8120e50d>] __vfs_read+0xcd/0x140
> kernel:  [<ffffffff8120e9ea>] vfs_read+0x7a/0x120
> kernel:  [<ffffffff8121404b>] kernel_read+0x3b/0x50
> kernel:  [<ffffffff81215c80>] do_execveat_common.isra.29+0x490/0x6f0
> kernel:  [<ffffffff81215f08>] do_execve+0x28/0x30
> kernel:  [<ffffffff81095ddb>] call_usermodehelper_exec_async+0xfb/0x130
> kernel:  [<ffffffff8161c045>] ret_from_fork+0x55/0x80
> 
> And that later confuses the hotremove path because an LRU page is
> attempted to be migrated and that fails due to an elevated reference
> count. It is quite possible that the reuse of the HWPoisoned page is
> some kind of fixed race condition but I am not really sure about that.
> 
> With the upstream kernel the failure is slightly different. The page
> doesn't seem to have LRU bit set but isolate_movable_page simply fails
> and do_migrate_range simply puts all the isolated pages back to LRU and
> therefore no progress is made and scan_movable_pages finds same set of
> pages over and over again.
> 
> Fix both cases by explicitly checking HWPoisoned pages before we even
> try to get a reference on the page, try to unmap it if it is still
> mapped. As explained by Naoya
> : Hwpoison code never unmapped those for no big reason because
> : Ksm pages never dominate memory, so we simply didn't have strong
> : motivation to save the pages.
> 
> Also put WARN_ON(PageLRU) in case there is a race and we can hit LRU
> HWPoison pages which shouldn't happen but I couldn't convince myself
> about that.
> 
> Debugged-by: Oscar Salvador <osalvador@suse.com>
> Cc: stable
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  mm/memory_hotplug.c | 16 ++++++++++++++++
>  1 file changed, 16 insertions(+)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index c6c42a7425e5..cfa1a2736876 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -34,6 +34,7 @@
>  #include <linux/hugetlb.h>
>  #include <linux/memblock.h>
>  #include <linux/compaction.h>
> +#include <linux/rmap.h>
>  
>  #include <asm/tlbflush.h>
>  
> @@ -1366,6 +1367,21 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
>  			pfn = page_to_pfn(compound_head(page))
>  				+ hpage_nr_pages(page) - 1;
>  
> +		/*
> +		 * HWPoison pages have elevated reference counts so the migration would
> +		 * fail on them. It also doesn't make any sense to migrate them in the
> +		 * first place. Still try to unmap such a page in case it is still mapped
> +		 * (e.g. current hwpoison implementation doesn't unmap KSM pages but keep
> +		 * the unmap as the catch all safety net).
> +		 */
> +		if (PageHWPoison(page)) {
> +			if (WARN_ON(PageLRU(page)))
> +				isolate_lru_page(page);
> +			if (page_mapped(page))
> +				try_to_unmap(page, TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS);
> +			continue;
> +		}
> +
>  		if (!get_page_unless_zero(page))
>  			continue;
>  		/*
> 

Complicated stuff. With or without the LRU handling

Acked-by: David Hildenbrand <david@redhat.com>
Naoya Horiguchi Dec. 6, 2018, 9:15 a.m. UTC | #15
On Thu, Dec 06, 2018 at 09:32:06AM +0100, Michal Hocko wrote:
> On Thu 06-12-18 05:21:38, Naoya Horiguchi wrote:
> > On Wed, Dec 05, 2018 at 05:57:16PM +0100, Michal Hocko wrote:
> > > On Wed 05-12-18 13:29:18, Michal Hocko wrote:
> > > [...]
> > > > After some more thinking I am not really sure the above reasoning is
> > > > still true with the current upstream kernel. Maybe I just managed to
> > > > confuse myself so please hold off on this patch for now. Testing by
> > > > Oscar has shown this patch is helping but the changelog might need to be
> > > > updated.
> > > 
> > > OK, so Oscar has nailed it down and it seems that 4.4 kernel we have
> > > been debugging on behaves slightly different. The underlying problem is
> > > the same though. So I have reworded the changelog and added "just in
> > > case" PageLRU handling. Naoya, maybe you have an argument that would
> > > make this void for current upstream kernels.
> > 
> > The following commit (not in 4.4.x stable tree) might explain the
> > difference you experienced:
> > 
> >   commit 286c469a988fbaf68e3a97ddf1e6c245c1446968                          
> >   Author: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>                      
> >   Date:   Wed May 3 14:56:22 2017 -0700                                    
> >                                                                            
> >       mm: hwpoison: call shake_page() after try_to_unmap() for mlocked page
> > 
> > This commit adds shake_page() for mlocked pages to make sure that the target
> > page is flushed out from LRU cache. Without this shake_page(), subsequent
> > delete_from_lru_cache() (from me_pagecache_clean()) fails to isolate it and
> > the page will finally return back to LRU list.  So this scenario leads to
> > "hwpoisoned by still linked to LRU list" page.
> 
> OK, I see. So does that mean that the LRU handling is no longer needed
> and there is a guanratee that all kernels with the above commit cannot
> ever get an LRU page?

Theoretically no such gurantee, because try_to_unmap() doesn't have a
guarantee of success and then memory_failure() returns immediately
when hwpoison_user_mappings fails.
Or the following code (comes after hwpoison_user_mappings block) also implies
that the target page can still have PageLRU flag.

        /*
         * Torn down by someone else?
         */
        if (PageLRU(p) && !PageSwapCache(p) && p->mapping == NULL) {
                action_result(pfn, MF_MSG_TRUNCATED_LRU, MF_IGNORED);
                res = -EBUSY;
                goto out;
        }

So I think it's OK to keep "if (WARN_ON(PageLRU(page)))" block in
current version of your patch.

Feel free to add my ack.

Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

Thanks,
Naoya Horiguchi
Michal Hocko Dec. 6, 2018, 12:02 p.m. UTC | #16
On Thu 06-12-18 09:15:53, Naoya Horiguchi wrote:
> On Thu, Dec 06, 2018 at 09:32:06AM +0100, Michal Hocko wrote:
> > On Thu 06-12-18 05:21:38, Naoya Horiguchi wrote:
> > > On Wed, Dec 05, 2018 at 05:57:16PM +0100, Michal Hocko wrote:
> > > > On Wed 05-12-18 13:29:18, Michal Hocko wrote:
> > > > [...]
> > > > > After some more thinking I am not really sure the above reasoning is
> > > > > still true with the current upstream kernel. Maybe I just managed to
> > > > > confuse myself so please hold off on this patch for now. Testing by
> > > > > Oscar has shown this patch is helping but the changelog might need to be
> > > > > updated.
> > > > 
> > > > OK, so Oscar has nailed it down and it seems that 4.4 kernel we have
> > > > been debugging on behaves slightly different. The underlying problem is
> > > > the same though. So I have reworded the changelog and added "just in
> > > > case" PageLRU handling. Naoya, maybe you have an argument that would
> > > > make this void for current upstream kernels.
> > > 
> > > The following commit (not in 4.4.x stable tree) might explain the
> > > difference you experienced:
> > > 
> > >   commit 286c469a988fbaf68e3a97ddf1e6c245c1446968                          
> > >   Author: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>                      
> > >   Date:   Wed May 3 14:56:22 2017 -0700                                    
> > >                                                                            
> > >       mm: hwpoison: call shake_page() after try_to_unmap() for mlocked page
> > > 
> > > This commit adds shake_page() for mlocked pages to make sure that the target
> > > page is flushed out from LRU cache. Without this shake_page(), subsequent
> > > delete_from_lru_cache() (from me_pagecache_clean()) fails to isolate it and
> > > the page will finally return back to LRU list.  So this scenario leads to
> > > "hwpoisoned by still linked to LRU list" page.
> > 
> > OK, I see. So does that mean that the LRU handling is no longer needed
> > and there is a guanratee that all kernels with the above commit cannot
> > ever get an LRU page?
> 
> Theoretically no such gurantee, because try_to_unmap() doesn't have a
> guarantee of success and then memory_failure() returns immediately
> when hwpoison_user_mappings fails.
> Or the following code (comes after hwpoison_user_mappings block) also implies
> that the target page can still have PageLRU flag.
> 
>         /*
>          * Torn down by someone else?
>          */
>         if (PageLRU(p) && !PageSwapCache(p) && p->mapping == NULL) {
>                 action_result(pfn, MF_MSG_TRUNCATED_LRU, MF_IGNORED);
>                 res = -EBUSY;
>                 goto out;
>         }
> 
> So I think it's OK to keep "if (WARN_ON(PageLRU(page)))" block in
> current version of your patch.
> 
> Feel free to add my ack.
> 
> Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

Thanks a lot Naoya! I will extend the changelog with your wording.
diff mbox series

Patch

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index c6c42a7425e5..08c576d5a633 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -34,6 +34,7 @@ 
 #include <linux/hugetlb.h>
 #include <linux/memblock.h>
 #include <linux/compaction.h>
+#include <linux/rmap.h>
 
 #include <asm/tlbflush.h>
 
@@ -1366,6 +1367,17 @@  do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
 			pfn = page_to_pfn(compound_head(page))
 				+ hpage_nr_pages(page) - 1;
 
+		/*
+		 * HWPoison pages have elevated reference counts so the migration would
+		 * fail on them. It also doesn't make any sense to migrate them in the
+		 * first place. Still try to unmap such a page in case it is still mapped.
+		 */
+		if (PageHWPoison(page)) {
+			if (page_mapped(page))
+				try_to_unmap(page, TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS);
+			continue;
+		}
+
 		if (!get_page_unless_zero(page))
 			continue;
 		/*