[RFC,v2,00/16] Hwpoison rework {hard,soft}-offline

Message ID	20191017142123.24245-1-osalvador@suse.de (mailing list archive)
Headers	show Return-Path: <SRS0=7nHD=YK=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 6306721D80 From: Oscar Salvador <osalvador@suse.de> To: n-horiguchi@ah.jp.nec.com Cc: mhocko@kernel.org, mike.kravetz@oracle.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Oscar Salvador <osalvador@suse.de> Subject: [RFC PATCH v2 00/16] Hwpoison rework {hard,soft}-offline Date: Thu, 17 Oct 2019 16:21:07 +0200 Message-Id: <20191017142123.24245-1-osalvador@suse.de> Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	Hwpoison rework {hard,soft}-offline \| expand [RFC,v2,00/16] Hwpoison rework {hard,soft}-offline [RFC,v2,01/16] mm,hwpoison: cleanup unused PageHuge() check [RFC,v2,02/16] mm,madvise: call soft_offline_page() without MF_COUNT_INCREASED [RFC,v2,03/16] mm,madvise: Refactor madvise_inject_error [RFC,v2,04/16] mm,hwpoison-inject: don't pin for hwpoison_filter [RFC,v2,05/16] mm,hwpoison: Un-export get_hwpoison_page and make it static [RFC,v2,06/16] mm,hwpoison: Kill put_hwpoison_page [RFC,v2,07/16] mm,hwpoison: remove MF_COUNT_INCREASED [RFC,v2,08/16] mm,hwpoison: remove flag argument from soft offline functions [RFC,v2,09/16] mm,hwpoison: Unify THP handling for hard and soft offline [RFC,v2,10/16] mm,hwpoison: Rework soft offline for free pages [RFC,v2,11/16] mm,hwpoison: Rework soft offline for in-use pages [RFC,v2,12/16] mm,hwpoison: Refactor soft_offline_huge_page and __soft_offline_page [RFC,v2,13/16] mm,hwpoison: Take pages off the buddy when hard-offlining [RFC,v2,14/16] mm,hwpoison: Return 0 if the page is already poisoned in soft-offline [RFC,v2,15/16] mm/hwpoison-inject: Rip off duplicated checks [RFC,v2,16/16] mm, soft-offline: convert parameter to pfn [17/16] mm,hwpoison: introduce MF_MSG_UNSPLIT_THP

Message ID

20191017142123.24245-1-osalvador@suse.de (mailing list archive)

Headers

DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 6306721D80
From: Oscar Salvador <osalvador@suse.de>
To: n-horiguchi@ah.jp.nec.com
Cc: mhocko@kernel.org,
	mike.kravetz@oracle.com,
	linux-mm@kvack.org,
	linux-kernel@vger.kernel.org,
	Oscar Salvador <osalvador@suse.de>
Subject: [RFC PATCH v2 00/16] Hwpoison rework {hard,soft}-offline
Date: Thu, 17 Oct 2019 16:21:07 +0200
Message-Id: <20191017142123.24245-1-osalvador@suse.de>
Sender: owner-linux-mm@kvack.org
Precedence: bulk

Series

Hwpoison rework {hard,soft}-offline | expand

Message

Oscar Salvador Oct. 17, 2019, 2:21 p.m. UTC

[NOTE]
Although I think the patchset is ready to go since a) it fixes
the original issues and b) survives all my tests, I wanted to
giving it a last RFC spin.
If no further objections are presented, I will drop the RFC.

This patchset was initially based on Naoya's hwpoison rework [1], so
thanks to him for the initial work.
I would also like to think Naoya for testing the patchset off-line,
and report any issues he found, that was quite helpful.

This patchset aims to fix some issues laying in {soft,hard}-offline handling,
but it also takes the chance and takes some further steps to perform
cleanups and some refactoring as well.

While this patchset was initially thought for soft-offlining, I think
that hard-offline part can be further cleanup.
But that would be on top of this work.

- Motivation:

A customer and I were facing an issue were processes were killed
after having soft-offlined some of their pages.
This should not happen when soft-offlining, as it is meant to be non-disruptive.
I was able to reproduce the issue when I stressed the memory +
soft offlining pages in the meantime.

After debugging the issue, I saw that the problem was that pages were returned
back to user-space after having offlined them properly.
So, when those pages were faulted in, the fault handler returned VM_FAULT_POISON
all the way down to the arch handler, and it simply killed the process.

After a further anaylsis, it became clear that the problem was that when
kcompactd kicked in to migrate pages over, compaction_alloc callback
was handing poisoned pages to the migrate routine.

All this could happen because isolate_freepages_block and
fast_isolate_freepages just check for the page to be PageBuddy,
and since 1) poisoned pages can be part of a higher order page
and 2) poisoned pages are also Page Buddy, they can sneak in easily.

I also saw some other problems with sawap pages, but I suspected it
to be the same sort of problem, so I did not follow that trace.

The above refers to soft-offline.
But I also saw problems with hard-offline, specially hugetlb corruption,
and some other weird stuff. (I could paste the logs)

The full explanation refering to the soft-offline case can be found at [2].

- Approach:

The taken approach is to contain those pages and never let them hit
neither pcplists nor buddy freelists.
Only when they are completely out of reach, we flag them as poisoned.

A full explanation of this can be found in patch#10 and patch#11.

- Outcome:

With this patchset, I no longer see the issues with soft-offline and
hard-offline.

[1] https://lore.kernel.org/linux-mm/1541746035-13408-1-git-send-email-n-horiguchi@ah.jp.nec.com/
[2] https://lore.kernel.org/linux-mm/20190826104144.GA7849@linux/T/#u

Naoya Horiguchi (6):
mm,hwpoison: cleanup unused PageHuge() check
mm,madvise: call soft_offline_page() without MF_COUNT_INCREASED
mm,hwpoison-inject: don't pin for hwpoison_filter
mm,hwpoison: remove MF_COUNT_INCREASED
mm,hwpoison: remove flag argument from soft offline functions
mm, soft-offline: convert parameter to pfn

Oscar Salvador (10):
mm,madvise: Refactor madvise_inject_error
mm,hwpoison: Un-export get_hwpoison_page and make it static
mm,hwpoison: Kill put_hwpoison_page
mm,hwpoison: Unify THP handling for hard and soft offline
mm,hwpoison: Rework soft offline for free pages
mm,hwpoison: Rework soft offline for in-use pages
mm,hwpoison: Refactor soft_offline_huge_page and __soft_offline_page
mm,hwpoison: Take pages off the buddy when hard-offlining
mm,hwpoison: Return 0 if the page is already poisoned in soft-offline
mm/hwpoison-inject: Rip off duplicated checks

Comments

Dmitry Yakunin June 11, 2020, 4:43 p.m. UTC | #1

Hello!

We are faced with similar problems with hwpoisoned pages
on one of our production clusters after kernel update to stable 4.19.
Application that does a lot of memory allocations sometimes caught SIGBUS signal
with message in dmesg about hardware memory corruption fault.
In kernel and mce logs we saw messages about soft offlining pages with
correctable errors. Those events always had happened before application
was killed. This is not the behavior we expect. We want our application to
continue working on a smaller set of available pages in the system.

This issue is difficult to reproduce, but we suppose that the reason for such
behavior is that compaction does not check for page poisonness while processing
free pages, so as a result valid userspace data gets migrated to bad pages.
We wrote the simple test:
  - soft offline first 4 pages in every 64 continuous pages in ZONE_NORMAL
    through writing pfn to /sys/devices/system/memory/soft_offline_page
  - force compaction by echo 1 >> /proc/sys/vm/compact_memory
Without this patch series after these steps bash became unusable
and every attempt to run any command leads to SIGBUS with message about
hardware memory corruption fault. And after applying this series to our kernel
tree we cannot reproduce such SIGBUSes by our test. On upstream kernel 5.7
this behavior is still reproducible.

So, we want to know, why this patchset wasn't merged to the upstream?
Is there any problems in such rework for {soft,hard}-offline handling?
BTW, this patchset should be updated with upstream changes in mm.

Thanks for you replies.

--
Dmitry Yakunin

HORIGUCHI NAOYA(堀口直也) June 15, 2020, 6:19 a.m. UTC | #2

Hi Dmitry,

On Thu, Jun 11, 2020 at 07:43:19PM +0300, Dmitry Yakunin wrote:
> Hello!
> 
> We are faced with similar problems with hwpoisoned pages
> on one of our production clusters after kernel update to stable 4.19.
> Application that does a lot of memory allocations sometimes caught SIGBUS signal
> with message in dmesg about hardware memory corruption fault.
> In kernel and mce logs we saw messages about soft offlining pages with
> correctable errors. Those events always had happened before application
> was killed. This is not the behavior we expect. We want our application to
> continue working on a smaller set of available pages in the system.
> 
> This issue is difficult to reproduce, but we suppose that the reason for such
> behavior is that compaction does not check for page poisonness while processing
> free pages, so as a result valid userspace data gets migrated to bad pages.
> We wrote the simple test:
>   - soft offline first 4 pages in every 64 continuous pages in ZONE_NORMAL
>     through writing pfn to /sys/devices/system/memory/soft_offline_page
>   - force compaction by echo 1 >> /proc/sys/vm/compact_memory
> Without this patch series after these steps bash became unusable
> and every attempt to run any command leads to SIGBUS with message about
> hardware memory corruption fault. And after applying this series to our kernel
> tree we cannot reproduce such SIGBUSes by our test. On upstream kernel 5.7
> this behavior is still reproducible.
> 
> So, we want to know, why this patchset wasn't merged to the upstream?
> Is there any problems in such rework for {soft,hard}-offline handling?

No technical reason, it's just because I didn't have enough power to push
this to be merged. Really sorry about that.

> BTW, this patchset should be updated with upstream changes in mm.

I'm working this now and still need more testing to confirm, but I hope
I'll update and post this for 5.9.

Thanks,
Naoya Horiguchi