[v4,0/7] HWpoison: further fixes and cleanups

Message ID	20200917081049.27428-1-osalvador@suse.de (mailing list archive)
Headers	show Return-Path: <SRS0=Mu/N=C2=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 5D58020872 From: Oscar Salvador <osalvador@suse.de> To: akpm@linux-foundation.org Cc: aris@ruivo.org, naoya.horiguchi@nec.com, mhocko@kernel.org, tony.luck@intel.com, cai@lca.pw, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Oscar Salvador <osalvador@suse.de> Subject: [PATCH v4 0/7] HWpoison: further fixes and cleanups Date: Thu, 17 Sep 2020 10:10:42 +0200 Message-Id: <20200917081049.27428-1-osalvador@suse.de> Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	HWpoison: further fixes and cleanups \| expand [v4,0/7] HWpoison: further fixes and cleanups [v4,1/7] mm,hwpoison: take free pages off the buddy freelists [v4,2/7] mm,hwpoison: Do not set hugepage_or_freepage unconditionally [v4,3/7] mm,hwpoison: Try to narrow window race for free pages [v4,4/7] mm,hwpoison: refactor madvise_inject_error [v4,5/7] mm,hwpoison: drain pcplists before bailing out for non-buddy zero-refcount page [v4,6/7] mm,hwpoison: drop unneeded pcplist draining [v4,7/7] mm,hwpoison: remove stale code

Oscar Salvador Sept. 17, 2020, 8:10 a.m. UTC

This patchset includes some fixups (patch#1,patch#2 and patch#3)
and some cleanups (patch#4-7).

Patch#1 is a fix to take off HWPoison pages off a buddy freelist since
it can lead us to having HWPoison pages back in the game without no one
noticing it.
So fix it (we did that already for soft_offline_page [1]).

Patch#2 is fixing a rebasing problem that made the call
to page_handle_poison from _soft_offline_page set the
wrong value for hugepage_or_freepage. [2]

Patch#3 is not really a fixup, but tries to re-handle a page
in case it was allocated under us.

[1] https://patchwork.kernel.org/cover/11704083/
[2] https://patchwork.kernel.org/comment/23619775/

Thanks

Oscar Salvador (7):
  mm,hwpoison: take free pages off the buddy freelists
  mm,hwpoison: Do not set hugepage_or_freepage unconditionally
  mm,hwpoison: Try to narrow window race for free pages
  mm,hwpoison: refactor madvise_inject_error
  mm,hwpoison: drain pcplists before bailing out for non-buddy
    zero-refcount page
  mm,hwpoison: drop unneeded pcplist draining
  mm,hwpoison: remove stale code

 mm/madvise.c        | 36 +++++++++++----------------
 mm/memory-failure.c | 59 +++++++++++++++++++++++++++++++++------------
 2 files changed, 58 insertions(+), 37 deletions(-)

HORIGUCHI NAOYA(堀口直也) Sept. 17, 2020, 11:39 a.m. UTC | #1

On Thu, Sep 17, 2020 at 10:10:42AM +0200, Oscar Salvador wrote:
> This patchset includes some fixups (patch#1,patch#2 and patch#3)
> and some cleanups (patch#4-7).
> 
> Patch#1 is a fix to take off HWPoison pages off a buddy freelist since
> it can lead us to having HWPoison pages back in the game without no one
> noticing it.
> So fix it (we did that already for soft_offline_page [1]).
> 
> Patch#2 is fixing a rebasing problem that made the call
> to page_handle_poison from _soft_offline_page set the
> wrong value for hugepage_or_freepage. [2]
> 
> Patch#3 is not really a fixup, but tries to re-handle a page
> in case it was allocated under us.

Thanks for the update.
This patchset triggers the following BUG_ON() with Aristeu's workload:

    [ 1010.400900] Soft offlining pfn 0xbff8c at process virtual address 0x7fe6c99c8000
    [ 1010.402931] page:00000000f5670686 refcount:1 mapcount:-128 mapping:0000000000000000 index:0x1 pfn:0xbff89
    [ 1010.405604] flags: 0xfffe000800000(hwpoison)
    [ 1010.406755] raw: 000fffe000800000 ffffcddf029ab848 ffffcddf02ff9448 0000000000000000
    [ 1010.408824] raw: 0000000000000001 0000000000000000 00000001ffffff7f 0000000000000000
    [ 1010.410877] page dumped because: VM_BUG_ON_PAGE(page_count(buddy) != 0)
    [ 1010.412673] ------------[ cut here ]------------
    [ 1010.413930] kernel BUG at mm/page_alloc.c:800!
    [ 1010.415143] invalid opcode: 0000 [#1] SMP PTI
    [ 1010.416320] CPU: 3 PID: 1340 Comm: kworker/3:0 Not tainted 5.9.0-rc2-mm1-v5.9-rc2-200917-1952-00212-gf1a0765b04cb+ #33
    [ 1010.419101] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014
    [ 1010.422645] Workqueue: mm_percpu_wq drain_local_pages_wq
    [ 1010.424075] RIP: 0010:__free_one_page+0x552/0x580
    [ 1010.425344] Code: 48 c7 c6 90 6c 0f 84 4c 89 e7 e8 69 7e fd ff 0f 0b 0f 1f 44 00 00 e9 e5 fc ff ff 48 c7 c6 c8 f3 11 84 4c 89 f7 e8 4e 7e fd ff <0f> 0b 83 fb 08 0f 86 cb fc ff ff 48 83 c4 20 5b 5d 41 5c 41 5d 41
    [ 1010.430231] RSP: 0018:ffffaa96c171fda0 EFLAGS: 00010082
    [ 1010.431651] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000027
    [ 1010.433598] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8dc8bbd18d08
    [ 1010.435627] RBP: 00000000000bff88 R08: ffff8dc8bbd18d00 R09: 6573756163656220
    [ 1010.437544] R10: 6163656220646570 R11: 6d75642065676170 R12: ffffcddf02ffe200
    [ 1010.439376] R13: 00000000000bff89 R14: ffffcddf02ffe240 R15: ffff8dc7bffd5680
    [ 1010.441271] FS:  0000000000000000(0000) GS:ffff8dc8bbd00000(0000) knlGS:0000000000000000
    [ 1010.443349] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 1010.444892] CR2: 00007f6b69f92000 CR3: 0000000139c4a000 CR4: 00000000001506e0
    [ 1010.446746] Call Trace:
    [ 1010.447424]  free_pcppages_bulk+0x1d4/0x2c0
    [ 1010.448553]  drain_pages_zone+0x42/0x50
    [ 1010.449585]  drain_local_pages_wq+0xe/0x10
    [ 1010.450702]  process_one_work+0x1b0/0x360
    [ 1010.451769]  worker_thread+0x50/0x3a0
    [ 1010.452940]  ? process_one_work+0x360/0x360
    [ 1010.454072]  kthread+0xfe/0x140
    [ 1010.454989]  ? kthread_park+0x90/0x90
    [ 1010.455970]  ret_from_fork+0x22/0x30

This message seems to show that the pages to be moved to buddy have refcount.
Could you review how changes in v3 -> v4 make it?

Here's my reproducer.

    [build1:~]$ cat test_ksm_madv_soft.c
    #include <stdio.h>
    #include <string.h>
    #include <sys/mman.h>
    #include <unistd.h>
    #include <sys/types.h>
    #include <errno.h>
    #include <stdlib.h>
    
    #define MADV_SOFT_OFFLINE 101
    
    #define err(x) perror(x),exit(EXIT_FAILURE)
    
    int main() {
            int ret;
            int size = 100000*0x1000;
    
            char *p1 = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
            printf("p1 %p\n", p1);
            char *p2 = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
            printf("p2 %p\n", p2);
    
            ret = madvise(p1, size, MADV_MERGEABLE);
            printf("madvise(p1) %d\n", ret);
            ret = madvise(p2, size, MADV_MERGEABLE);
            printf("madvise(p2) %d\n", ret);
    
            printf("writing p1 ... ");
            memset(p1, 'a', size);
            printf("done\n");
            printf("writing p2 ... ");
            memset(p2, 'a', size);
            printf("done\n");
    
            usleep(10000000);
            printf("soft offline\n");
            ret = madvise(p1, size, MADV_SOFT_OFFLINE);
            printf("soft offline returns %d\n", ret);
            if (ret)
                    err("madvise");
    
            madvise(p1, size, MADV_UNMERGEABLE);
            madvise(p2, size, MADV_UNMERGEABLE);
    
            printf("OK\n");
    }
    
    [build1:~/upstream/mm_regression/lib]$ cat tmp_run_ksm_madv.sh
    
    rm test_ksm_madv_soft 2> /dev/null
    gcc -o test_ksm_madv_soft test_ksm_madv_soft.c || exit 1
    
    echo 0 > /sys/kernel/mm/ksm/sleep_millisecs
    echo 100000 > /sys/kernel/mm/ksm/pages_to_scan
    echo 100000 > /sys/kernel/mm/ksm/max_page_sharing
    echo 2 > /sys/kernel/mm/ksm/run
    echo 1 > /sys/kernel/mm/ksm/run
    
    ./test_ksm_madv_soft

Thanks,
Naoya Horiguchi

Oscar Salvador Sept. 17, 2020, 1:09 p.m. UTC | #2

On Thu, Sep 17, 2020 at 11:39:21AM +0000, HORIGUCHI NAOYA(堀口 直也) wrote:
> Thanks for the update.
> This patchset triggers the following BUG_ON() with Aristeu's workload:

I just took a look, but I found more oddities.
The patchset you sent seems to be a bit buggy and it is missing some things
my patchset contains, e.g:

static __always_inline bool free_pages_prepare(struct page *page,
					unsigned int order, bool check_free)
{
	...
	if (unlikely(PageHWPoison(page)) && !order) {
	...
		return false;
}

Moreover, in page_handle_poison, you managed it wrong because:

static bool page_handle_poison(struct page *page, bool hugepage_or_freepage, bool release)
{
        if (release) {
                put_page(page);
                drain_all_pages(page_zone(page));
        }

	...
        SetPageHWPoison(page);
        page_ref_inc(page);

1) You are freeing the page first, which means it goes to buddy
2) Then you set it as poisoned and you update its refcount.

Now we have a page sitting in Buddy with a refcount = 1 and poisoned, and that is quite wrong.

Honestly, I do not know how your patchset diverged so much from mine, but is
not right.

I will go over my patchset and yours and compare/fix things.

Oscar Salvador Sept. 17, 2020, 1:40 p.m. UTC | #3

On Thu, Sep 17, 2020 at 03:09:52PM +0200, Oscar Salvador wrote:
> static bool page_handle_poison(struct page *page, bool hugepage_or_freepage, bool release)
> {
>         if (release) {
>                 put_page(page);
>                 drain_all_pages(page_zone(page));
>         }
> 
> 	...
>         SetPageHWPoison(page);
>         page_ref_inc(page);
> 
> 1) You are freeing the page first, which means it goes to buddy
> 2) Then you set it as poisoned and you update its refcount.
> 
> Now we have a page sitting in Buddy with a refcount = 1 and poisoned, and that is quite wrong.

Hi Naoya,

Ok, I tested it and with the following changes on top I cannot reproduce the issue:

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index f68cb5e3b320..4ffaaa5c2603 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -67,11 +67,6 @@ atomic_long_t num_poisoned_pages __read_mostly = ATOMIC_LONG_INIT(0);
 
 static bool page_handle_poison(struct page *page, bool hugepage_or_freepage, bool release)
 {
-	if (release) {
-		put_page(page);
-		drain_all_pages(page_zone(page));
-	}
-
 	if (hugepage_or_freepage) {
 		/*
 		 * Doing this check for free pages is also fine since dissolve_free_huge_page
@@ -89,6 +84,12 @@ static bool page_handle_poison(struct page *page, bool hugepage_or_freepage, boo
 	}
 
 	SetPageHWPoison(page);
+
+	if (release) {
+                put_page(page);
+                drain_all_pages(page_zone(page));
+        }
+
 	page_ref_inc(page);
 	num_poisoned_pages_inc();
 	return true;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0d9f9bd0e06c..8a6ab05f074c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1173,6 +1173,16 @@ static __always_inline bool free_pages_prepare(struct page *page,
 
 	trace_mm_page_free(page, order);
 
+	if (unlikely(PageHWPoison(page)) && !order) {
+		/*
+		 * Untie memcg state and reset page's owner
+		 */
+		if (memcg_kmem_enabled() && PageKmemcg(page))
+			__memcg_kmem_uncharge_page(page, order);
+		reset_page_owner(page, order);
+		return false;
+	}
+
 	/*
 	 * Check tail pages before head page information is cleared to
 	 * avoid checking PageCompound for order-0 pages.


# sh tmp_run_ksm_madv.sh 
p1 0x7f6b6b08e000
p2 0x7f6b529ee000
madvise(p1) 0
madvise(p2) 0
writing p1 ... done
writing p2 ... done
soft offline
soft offline returns 0
OK


Can you try to re-run your tests and see if they come clean?
If they do, I will try to see if Andrew can squezee above changes into [1],
where they belong to.

Otherwise I will craft a v5 containing all fixups (pretty unfortunate).

[1] https://patchwork.kernel.org/patch/11704099/

HORIGUCHI NAOYA(堀口直也) Sept. 17, 2020, 3:27 p.m. UTC | #4

On Thu, Sep 17, 2020 at 03:40:06PM +0200, Oscar Salvador wrote:
> On Thu, Sep 17, 2020 at 03:09:52PM +0200, Oscar Salvador wrote:
> > static bool page_handle_poison(struct page *page, bool hugepage_or_freepage, bool release)
> > {
> >         if (release) {
> >                 put_page(page);
> >                 drain_all_pages(page_zone(page));
> >         }
> >
> > 	...
> >         SetPageHWPoison(page);
> >         page_ref_inc(page);
> >
> > 1) You are freeing the page first, which means it goes to buddy
> > 2) Then you set it as poisoned and you update its refcount.
> >
> > Now we have a page sitting in Buddy with a refcount = 1 and poisoned, and that is quite wrong.
>
> Hi Naoya,
>
> Ok, I tested it and with the following changes on top I cannot reproduce the issue:
>
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index f68cb5e3b320..4ffaaa5c2603 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -67,11 +67,6 @@ atomic_long_t num_poisoned_pages __read_mostly = ATOMIC_LONG_INIT(0);
>
>  static bool page_handle_poison(struct page *page, bool hugepage_or_freepage, bool release)
>  {
> -	if (release) {
> -		put_page(page);
> -		drain_all_pages(page_zone(page));
> -	}
> -
>  	if (hugepage_or_freepage) {
>  		/*
>  		 * Doing this check for free pages is also fine since dissolve_free_huge_page
> @@ -89,6 +84,12 @@ static bool page_handle_poison(struct page *page, bool hugepage_or_freepage, boo
>  	}
>
>  	SetPageHWPoison(page);
> +
> +	if (release) {
> +                put_page(page);
> +                drain_all_pages(page_zone(page));
> +        }
> +
>  	page_ref_inc(page);
>  	num_poisoned_pages_inc();
>  	return true;
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 0d9f9bd0e06c..8a6ab05f074c 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1173,6 +1173,16 @@ static __always_inline bool free_pages_prepare(struct page *page,
>
>  	trace_mm_page_free(page, order);
>
> +	if (unlikely(PageHWPoison(page)) && !order) {
> +		/*
> +		 * Untie memcg state and reset page's owner
> +		 */
> +		if (memcg_kmem_enabled() && PageKmemcg(page))
> +			__memcg_kmem_uncharge_page(page, order);
> +		reset_page_owner(page, order);
> +		return false;
> +	}
> +

Sorry, I modified the patches based on the different assumption from yours.
I firstly thought of taking page off after confirming the error page
is freed back to buddy. This approach leaves the possibility of reusing
the error page (which is acceptable), but simpler and less invasive one.

Your approach removes the error page from page allocator's control in
freeing time. It has no possibility of reusing the error page but changes
are tightly coupled with page free code.

This is a tradeoff between complexity and completeness of soft offline,
Now I'm not sure I could persist on my own opinion without providing
working code, and it's OK for me to take your one.

>  	/*
>  	 * Check tail pages before head page information is cleared to
>  	 * avoid checking PageCompound for order-0 pages.
>
>

...
(let me reply to older email here...)
> static bool page_handle_poison(struct page *page, bool hugepage_or_freepage, bool release)
> {
>         if (release) {
>                 put_page(page);
>                 drain_all_pages(page_zone(page));
>         }
>
>       ...
>         SetPageHWPoison(page);
>         page_ref_inc(page);
>
> 1) You are freeing the page first, which means it goes to buddy
> 2) Then you set it as poisoned and you update its refcount.
>
> Now we have a page sitting in Buddy with a refcount = 1 and poisoned, and that is quite wrong.

This order was correct in my approach. Isolation operation is done
after confirming it's in the free list. This prevents PageHWPoison pages
from being in pcpclists. But the page_ref_inc() may not be necessary in my approach.


> # sh tmp_run_ksm_madv.sh
> p1 0x7f6b6b08e000
> p2 0x7f6b529ee000
> madvise(p1) 0
> madvise(p2) 0
> writing p1 ... done
> writing p2 ... done
> soft offline
> soft offline returns 0
> OK
>
>
> Can you try to re-run your tests and see if they come clean?

The test passed in my environment, so this is fine.

> If they do, I will try to see if Andrew can squezee above changes into [1],
> where they belong to.

Yes, proposing the fix for mmhwpoison-rework-soft-offline-for-in-use-pages.patch
seems fine to me.

Again, sorry for modifying code without asking.

Naoya Horiguchi

Oscar Salvador Sept. 18, 2020, 5:49 a.m. UTC | #5

On 2020-09-17 17:27, HORIGUCHI NAOYA wrote:
> Sorry, I modified the patches based on the different assumption from 
> yours.
> I firstly thought of taking page off after confirming the error page
> is freed back to buddy. This approach leaves the possibility of reusing
> the error page (which is acceptable), but simpler and less invasive 
> one.
> 
> Your approach removes the error page from page allocator's control in
> freeing time. It has no possibility of reusing the error page but 
> changes
> are tightly coupled with page free code.
> 
> This is a tradeoff between complexity and completeness of soft offline,
> Now I'm not sure I could persist on my own opinion without providing
> working code, and it's OK for me to take your one.

Yeah, you are right it is a trade off.
I would suggest taking this path now, and if it proofs to be problematic 
in some way, we can always
do the:

free_page
  take_it_off_buddy
   OK: mark it as hwpoison and increment refcount
   NOT_OK (raced with allocation): oops, sorry

> The test passed in my environment, so this is fine.

Thanks for trying it out.

> 
>> If they do, I will try to see if Andrew can squezee above changes into 
>> [1],
>> where they belong to.
> 
> Yes, proposing the fix for 
> mmhwpoison-rework-soft-offline-for-in-use-pages.patch
> seems fine to me.
> 
> Again, sorry for modifying code without asking.

No worries, I wil do a couple of tests on my own and then I will talk to 
Andrew to see if we can squeeze the changes in there.

Aristeu Rozanski Sept. 18, 2020, 7:25 p.m. UTC | #6

On Thu, Sep 17, 2020 at 03:27:18PM +0000, HORIGUCHI NAOYA(堀口 直也) wrote:
> The test passed in my environment, so this is fine.

With everything applied, it works for me as well.
(apologies for the delay, the machine I was using stopped working and took a
while to get another one)

[v4,0/7] HWpoison: further fixes and cleanups

Message

Comments