[v6,1/2] mm,hwpoison: fix race with hugetlb page allocation

Message ID	20210603233632.2964832-2-nao.horiguchi@gmail.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=W4zG=K5=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 8DAE460231 From: Naoya Horiguchi <nao.horiguchi@gmail.com> To: Oscar Salvador <osalvador@suse.de>, Muchun Song <songmuchun@bytedance.com>, Mike Kravetz <mike.kravetz@oracle.com>, linux-mm@kvack.org Cc: Andrew Morton <akpm@linux-foundation.org>, Michal Hocko <mhocko@suse.com>, Tony Luck <tony.luck@intel.com>, Naoya Horiguchi <naoya.horiguchi@nec.com>, linux-kernel@vger.kernel.org Subject: [PATCH v6 1/2] mm,hwpoison: fix race with hugetlb page allocation Date: Fri, 4 Jun 2021 08:36:31 +0900 Message-Id: <20210603233632.2964832-2-nao.horiguchi@gmail.com> In-Reply-To: <20210603233632.2964832-1-nao.horiguchi@gmail.com> References: <20210603233632.2964832-1-nao.horiguchi@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	hwpoison: fix race with hugetlb page allocation \| expand [v6,0/2] hwpoison: fix race with hugetlb page allocation [v6,1/2] mm,hwpoison: fix race with hugetlb page allocation [v6,2/2] mm,hwpoison: make get_hwpoison_page() call get_any_page()

Naoya Horiguchi June 3, 2021, 11:36 p.m. UTC

From: Naoya Horiguchi <naoya.horiguchi@nec.com>

When hugetlb page fault (under overcommitting situation) and
memory_failure() race, VM_BUG_ON_PAGE() is triggered by the following race:

    CPU0:                           CPU1:

                                    gather_surplus_pages()
                                      page = alloc_surplus_huge_page()
    memory_failure_hugetlb()
      get_hwpoison_page(page)
        __get_hwpoison_page(page)
          get_page_unless_zero(page)
                                      zero = put_page_testzero(page)
                                      VM_BUG_ON_PAGE(!zero, page)
                                      enqueue_huge_page(h, page)
      put_page(page)

__get_hwpoison_page() only checks the page refcount before taking an
additional one for memory error handling, which is not enough because
there's a time window where compound pages have non-zero refcount during
hugetlb page initialization.

So make __get_hwpoison_page() check page status a bit more for hugetlb
pages with get_hwpoison_huge_page(). Checking hugetlb-specific flags
under hugetlb_lock makes sure that the hugetlb page is not transitive.
It's notable that another new function, HWPoisonHandlable(), is helpful
to prevent a race against other transitive page states (like a generic
compound page just before PageHuge becomes true).

Fixes: ead07f6a867b ("mm/memory-failure: introduce get_hwpoison_page() for consistent refcount handling")
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reported-by: Muchun Song <songmuchun@bytedance.com>
Cc: stable@vger.kernel.org # 5.12+
---
ChangeLog v6:
- defined HWPoisonHandlable(),
- updated comment and patch description.
---
 include/linux/hugetlb.h |  6 ++++++
 mm/hugetlb.c            | 15 +++++++++++++++
 mm/memory-failure.c     | 29 +++++++++++++++++++++++++++--
 3 files changed, 48 insertions(+), 2 deletions(-)

Mike Kravetz June 4, 2021, 11:55 p.m. UTC | #1

On 6/3/21 4:36 PM, Naoya Horiguchi wrote:
> From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> 
> When hugetlb page fault (under overcommitting situation) and
> memory_failure() race, VM_BUG_ON_PAGE() is triggered by the following race:
> 
>     CPU0:                           CPU1:
> 
>                                     gather_surplus_pages()
>                                       page = alloc_surplus_huge_page()
>     memory_failure_hugetlb()
>       get_hwpoison_page(page)
>         __get_hwpoison_page(page)
>           get_page_unless_zero(page)
>                                       zero = put_page_testzero(page)
>                                       VM_BUG_ON_PAGE(!zero, page)
>                                       enqueue_huge_page(h, page)
>       put_page(page)
> 
> __get_hwpoison_page() only checks the page refcount before taking an
> additional one for memory error handling, which is not enough because
> there's a time window where compound pages have non-zero refcount during
> hugetlb page initialization.
> 
> So make __get_hwpoison_page() check page status a bit more for hugetlb
> pages with get_hwpoison_huge_page(). Checking hugetlb-specific flags
> under hugetlb_lock makes sure that the hugetlb page is not transitive.
> It's notable that another new function, HWPoisonHandlable(), is helpful
> to prevent a race against other transitive page states (like a generic
> compound page just before PageHuge becomes true).

Thanks!

I believe this is good enough to prevent the race.  Since access to the
page is not synchronized in any way, it would still be "theoretically
possible" to race.  For example,

    CPU0:                           CPU1:

    HWPoisonHandlable true
    				    page freed to buddy
				    page allocated from buddy
				    page added to hugetlb pool
				    alloc_surplus_huge_page

But, this is very Very VERY unlikely to ever happen.

> 
> Fixes: ead07f6a867b ("mm/memory-failure: introduce get_hwpoison_page() for consistent refcount handling")
> Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
> Reported-by: Muchun Song <songmuchun@bytedance.com>
> Cc: stable@vger.kernel.org # 5.12+
> ---
> ChangeLog v6:
> - defined HWPoisonHandlable(),
> - updated comment and patch description.
> ---
>  include/linux/hugetlb.h |  6 ++++++
>  mm/hugetlb.c            | 15 +++++++++++++++
>  mm/memory-failure.c     | 29 +++++++++++++++++++++++++++--
>  3 files changed, 48 insertions(+), 2 deletions(-)

Acked-by: Mike Kravetz <mike.kravetz@oracle.com>

Luck, Tony Aug. 12, 2021, 4:28 a.m. UTC | #2

On Fri, Jun 04, 2021 at 08:36:31AM +0900, Naoya Horiguchi wrote:
> From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> 
> When hugetlb page fault (under overcommitting situation) and
> memory_failure() race, VM_BUG_ON_PAGE() is triggered by the following race:
> 
>     CPU0:                           CPU1:
> 
>                                     gather_surplus_pages()
>                                       page = alloc_surplus_huge_page()
>     memory_failure_hugetlb()
>       get_hwpoison_page(page)
>         __get_hwpoison_page(page)
>           get_page_unless_zero(page)
>                                       zero = put_page_testzero(page)
>                                       VM_BUG_ON_PAGE(!zero, page)
>                                       enqueue_huge_page(h, page)
>       put_page(page)
> 
> __get_hwpoison_page() only checks the page refcount before taking an
> additional one for memory error handling, which is not enough because
> there's a time window where compound pages have non-zero refcount during
> hugetlb page initialization.
> 
> So make __get_hwpoison_page() check page status a bit more for hugetlb
> pages with get_hwpoison_huge_page(). Checking hugetlb-specific flags
> under hugetlb_lock makes sure that the hugetlb page is not transitive.
> It's notable that another new function, HWPoisonHandlable(), is helpful
> to prevent a race against other transitive page states (like a generic
> compound page just before PageHuge becomes true).

I'm seeing some strange results when doing a simple injection/recovery.

Current upstream often fails to offline the page with messages like:

	"high-order kernel page"
or
	"unknown page"

Things were working in v5.12. Broken in v5.13.

Bisect says that:

25182f05ffed ("mm,hwpoison: fix race with hugetlb page allocation")

is the culprit (though it is possible that there is more than one
issue ... failure symptoms changed a bit during the bisection).

This commit doesn't revert automatically from upstream. But it
does revert from v5.13. Running with this reverted from v5.13
gives kernel that recovers normally[1] from hundreds of consecutive
error injections.

-Tony

[1] Almost normally. My test catches SIGBUS and prints the virtual
address from the siginfo_t structure. Sometimes the address is correct
other times it is NULL.

HORIGUCHI NAOYA(堀口直也) Aug. 12, 2021, 9:03 a.m. UTC | #3

On Wed, Aug 11, 2021 at 09:28:13PM -0700, Luck, Tony wrote:
> On Fri, Jun 04, 2021 at 08:36:31AM +0900, Naoya Horiguchi wrote:
> > From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> > 
> > When hugetlb page fault (under overcommitting situation) and
> > memory_failure() race, VM_BUG_ON_PAGE() is triggered by the following race:
> > 
> >     CPU0:                           CPU1:
> > 
> >                                     gather_surplus_pages()
> >                                       page = alloc_surplus_huge_page()
> >     memory_failure_hugetlb()
> >       get_hwpoison_page(page)
> >         __get_hwpoison_page(page)
> >           get_page_unless_zero(page)
> >                                       zero = put_page_testzero(page)
> >                                       VM_BUG_ON_PAGE(!zero, page)
> >                                       enqueue_huge_page(h, page)
> >       put_page(page)
> > 
> > __get_hwpoison_page() only checks the page refcount before taking an
> > additional one for memory error handling, which is not enough because
> > there's a time window where compound pages have non-zero refcount during
> > hugetlb page initialization.
> > 
> > So make __get_hwpoison_page() check page status a bit more for hugetlb
> > pages with get_hwpoison_huge_page(). Checking hugetlb-specific flags
> > under hugetlb_lock makes sure that the hugetlb page is not transitive.
> > It's notable that another new function, HWPoisonHandlable(), is helpful
> > to prevent a race against other transitive page states (like a generic
> > compound page just before PageHuge becomes true).
> 
> I'm seeing some strange results when doing a simple injection/recovery.
> 
> Current upstream often fails to offline the page with messages like:
> 
> 	"high-order kernel page"
> or
> 	"unknown page"
> 
> Things were working in v5.12. Broken in v5.13.
> 
> Bisect says that:
> 
> 25182f05ffed ("mm,hwpoison: fix race with hugetlb page allocation")
> 
> is the culprit (though it is possible that there is more than one
> issue ... failure symptoms changed a bit during the bisection).

Sorry for the failures. I think that the following patch (and dependencies)
should solve the issue.
https://lore.kernel.org/linux-mm/20210614021212.223326-6-nao.horiguchi@gmail.com/.
I'll submit the update (maybe the patchset will be smaller by feedbacks)
later soon.

Thanks,
Naoya Horiguchi

> 
> This commit doesn't revert automatically from upstream. But it
> does revert from v5.13. Running with this reverted from v5.13
> gives kernel that recovers normally[1] from hundreds of consecutive
> error injections.
> 
> -Tony
> 
> [1] Almost normally. My test catches SIGBUS and prints the virtual
> address from the siginfo_t structure. Sometimes the address is correct
> other times it is NULL.

Luck, Tony Aug. 12, 2021, 3:25 p.m. UTC | #4

On Thu, Aug 12, 2021 at 09:03:04AM +0000, HORIGUCHI NAOYA(堀口 直也) wrote:
> Sorry for the failures. I think that the following patch (and dependencies)
> should solve the issue.
> https://lore.kernel.org/linux-mm/20210614021212.223326-6-nao.horiguchi@gmail.com/.
> I'll submit the update (maybe the patchset will be smaller by feedbacks)
> later soon.

I was uncertain about which dependencies you meant. So I followed
the advice in the cover letter for the patch series containing that
patch and did:

$ git fetch https://github.com/nhoriguchi/linux hwpoison

This kernel still has some odd issues and the poison page
did not get unmapped from my test user application.

See git://git.kernel.org/pub/scm/linux/kernel/git/aegl/ras-tools.git
for my test program. In this case I was just running with default settings
to inject an error into a user data page and then consume it.

Here's the dmesg output. There are multiple calls to memory_failure()
because the poison address is signalled both by the memory controller
(CMCI with UCNA signature) and the DCU (#MC with SRAR signature).

Note that the first message says: "recovery action for unknown page: Ignored"

[   70.331253] EINJ: Error INJection is initialized.
[   76.949490] process '/aegl/ras-tools/einj_mem_uc' started with executable stack
[   77.481846] Disabling lock debugging due to kernel taint
[   77.482004] mce: [Hardware Error]: Machine check events logged
[   77.487176] mce: Uncorrected hardware memory error in user-access at 7e025e400
[   77.493225] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
[   77.508704] {1}[Hardware Error]: event severity: recoverable
[   77.514361] {1}[Hardware Error]:  Error 0, type: recoverable
[   77.520011] {1}[Hardware Error]:  fru_text: Card01, ChnG, DIMM0
[   77.525921] {1}[Hardware Error]:   section_type: memory error
[   77.531659] {1}[Hardware Error]:   error_status: 0x0000000000000400
[   77.537914] {1}[Hardware Error]:   physical_address: 0x00000007e025e400
[   77.544518] {1}[Hardware Error]:   node: 0 card: 6 module: 0 rank: 1 bank: 8 device: 0 row: 15105 column: 896
[   77.554503] {1}[Hardware Error]:   error_type: 4, single-symbol chipkill ECC
[   77.561548] {1}[Hardware Error]:   DIMM location: NODE 3 CPU0_DIMM_G1
[   77.568135] Memory failure: 0x7e025e: recovery action for unknown page: Ignored
[   77.575445] Memory failure: 0x7e025e: already hardware poisoned
[   77.627894] EDAC skx MC3: HANDLING MCE MEMORY ERROR
[   77.633600] EDAC skx MC3: CPU 0: Machine Check Event: 0x0 Bank 25: 0xac00000200a00090
[   77.633601] EDAC skx MC3: TSC 0x18d57ce28f4f25
[   77.633602] EDAC skx MC3: ADDR 0x7e025e400
[   77.633603] EDAC skx MC3: MISC 0x9000201d809c086
[   77.633603] EDAC skx MC3: PROCESSOR 0:0x606a6 TIME 1628780833 SOCKET 0 APIC 0x0
[   77.633608] EDAC MC3: 1 UE memory read error on CPU_SrcID#0_MC#3_Chan#0_DIMM#0 (channel:0 slot:0 page:0x7e025e offset:0x400 grain:32 -  err_code:0x00a0:0x0090  SystemAddress:0x7e025e400 ProcessorSocketId:0x0 MemoryControllerId:0x3 ChannelAddress:0xec04bc00 ChannelId:0x0 RankAddress:0x76025c00 PhysicalRankId:0x1 DimmSlotId:0x0 Row:0x3b01 Column:0x380 Bank:0x0 BankGroup:0x2 ChipSelect:0x1 ChipId:0x0)
[   77.633611] Memory failure: 0x7e025e: Sending SIGBUS to einj_mem_uc:12283 due to hardware memory corruption
[   77.668827] mce: [Hardware Error]: Machine check events logged
[   77.678605] Memory failure: 0x7e025e: already hardware poisoned
[   77.736685] EDAC skx MC3: HANDLING MCE MEMORY ERROR
[   77.742392] EDAC skx MC3: CPU 0: Machine Check Event: 0x0 Bank 255: 0xb40000000000009f
[   77.742394] EDAC skx MC3: TSC 0x0
[   77.742394] EDAC skx MC3: ADDR 0x7e025e400
[   77.742395] EDAC skx MC3: MISC 0x0
[   77.742395] EDAC skx MC3: PROCESSOR 0:0x606a6 TIME 1628780833 SOCKET 0 APIC 0x0
[   77.742397] EDAC MC3: 1 UE memory read error on CPU_SrcID#0_MC#3_Chan#0_DIMM#0 (channel:0 slot:0 page:0x7e025e offset:0x400 grain:32 -  err_code:0x0000:0x009f  SystemAddress:0x7e025e400 ProcessorSocketId:0x0 MemoryControllerId:0x3 ChannelAddress:0xec04bc00 ChannelId:0x0 RankAddress:0x76025c00 PhysicalRankId:0x1 DimmSlotId:0x0 Row:0x3b01 Column:0x380 Bank:0x0 BankGroup:0x2 ChipSelect:0x1 ChipId:0x0)
[   77.777612] Memory failure: 0x7e025e: already hardware poisoned

-Tony

HORIGUCHI NAOYA(堀口直也) Aug. 13, 2021, 6:29 a.m. UTC | #5

On Thu, Aug 12, 2021 at 08:25:48AM -0700, Luck, Tony wrote:
> On Thu, Aug 12, 2021 at 09:03:04AM +0000, HORIGUCHI NAOYA(堀口 直也) wrote:
> > Sorry for the failures. I think that the following patch (and dependencies)
> > should solve the issue.
> > https://lore.kernel.org/linux-mm/20210614021212.223326-6-nao.horiguchi@gmail.com/.
> > I'll submit the update (maybe the patchset will be smaller by feedbacks)
> > later soon.
>
> I was uncertain about which dependencies you meant. So I followed
> the advice in the cover letter for the patch series containing that
> patch and did:
>
> $ git fetch https://github.com/nhoriguchi/linux hwpoison
>
> This kernel still has some odd issues and the poison page
> did not get unmapped from my test user application.
>
> See git://git.kernel.org/pub/scm/linux/kernel/git/aegl/ras-tools.git
> for my test program. In this case I was just running with default settings
> to inject an error into a user data page and then consume it.
>
> Here's the dmesg output. There are multiple calls to memory_failure()
> because the poison address is signalled both by the memory controller
> (CMCI with UCNA signature) and the DCU (#MC with SRAR signature).

Thanks for the details.

The following dmesg implies that the failure happened in einj_mem_uc
which has 13 sub-testcases (specified by "testname"). In which one
the "unknonw page" event was triggered?

>
> Note that the first message says: "recovery action for unknown page: Ignored"
>
> [   70.331253] EINJ: Error INJection is initialized.
> [   76.949490] process '/aegl/ras-tools/einj_mem_uc' started with executable stack
> [   77.481846] Disabling lock debugging due to kernel taint
> [   77.482004] mce: [Hardware Error]: Machine check events logged
> [   77.487176] mce: Uncorrected hardware memory error in user-access at 7e025e400
> [   77.493225] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
> [   77.508704] {1}[Hardware Error]: event severity: recoverable
> [   77.514361] {1}[Hardware Error]:  Error 0, type: recoverable
> [   77.520011] {1}[Hardware Error]:  fru_text: Card01, ChnG, DIMM0
> [   77.525921] {1}[Hardware Error]:   section_type: memory error
> [   77.531659] {1}[Hardware Error]:   error_status: 0x0000000000000400

Accourding to https://www.kernel.org/doc/html/latest/firmware-guide/acpi/apei/einj.html
this error status means "Platform Uncorrectable non-fatal", so memory_failure()
is called with MF_ACTION_REQUIRED set?

> [   77.537914] {1}[Hardware Error]:   physical_address: 0x00000007e025e400
> [   77.544518] {1}[Hardware Error]:   node: 0 card: 6 module: 0 rank: 1 bank: 8 device: 0 row: 15105 column: 896
> [   77.554503] {1}[Hardware Error]:   error_type: 4, single-symbol chipkill ECC
> [   77.561548] {1}[Hardware Error]:   DIMM location: NODE 3 CPU0_DIMM_G1
> [   77.568135] Memory failure: 0x7e025e: recovery action for unknown page: Ignored

This "unknown page" should come from the following line in memory_failure():

   int memory_failure(unsigned long pfn, int flags)
   ...
           if (!(flags & MF_COUNT_INCREASED)) {
                   res = get_hwpoison_page(p, flags);
                   if (!res) {
                           ...  // code for HWPoisonHandlable pages.
                   } else if (res < 0) {
                           action_result(pfn, MF_MSG_UNKNOWN, MF_IGNORED); /// HERE
                           res = -EBUSY;
                           goto unlock_mutex;
                   }
           }

This path is chosen when HWPoisonHandlable() returns false in get_hwpoison_page()
and HWPoisonHandlable() is like this in the current version:

    static inline bool HWPoisonHandlable(struct page *page)
    {
            return PageLRU(page) || __PageMovable(page) ||
                    PageSlab(page) || PageTable(page) || PageReserved(page);
    }

So I wonder why HWPoisonHandlable() returned false in your case ("to inject
an error into a user data page and then consume it" sounds a basic testcase
so it's expected to be detected by PageLRU() or __PageMovable().)

I think that we are going to the direction of stopping taking refcount of
error pages blindly to reduce the risk of critical races.  In order to
improve HWPoisonHandlable() check instead of reverting it, I'd like to
understand error page's state in the failure case.  Could you try testing
again with inserting a dump_page() like below?  (I'll try to reproduce by
myself somehow, but it might be hard because I have no machine with firmware
supporting EINJ.)

    @@ -1703,6 +1703,7 @@ int memory_failure(unsigned long pfn, int flags)
                            goto unlock_mutex;
                    } else if (res < 0) {
                            action_result(pfn, MF_MSG_UNKNOWN, MF_IGNORED);
    +                       dump_page(p, "hwpoison unknown page");
                            res = -EBUSY;
                            goto unlock_mutex;
                    }

Thanks,
Naoya Horiguchi

> [   77.575445] Memory failure: 0x7e025e: already hardware poisoned
> [   77.627894] EDAC skx MC3: HANDLING MCE MEMORY ERROR
> [   77.633600] EDAC skx MC3: CPU 0: Machine Check Event: 0x0 Bank 25: 0xac00000200a00090
> [   77.633601] EDAC skx MC3: TSC 0x18d57ce28f4f25
> [   77.633602] EDAC skx MC3: ADDR 0x7e025e400
> [   77.633603] EDAC skx MC3: MISC 0x9000201d809c086
> [   77.633603] EDAC skx MC3: PROCESSOR 0:0x606a6 TIME 1628780833 SOCKET 0 APIC 0x0
> [   77.633608] EDAC MC3: 1 UE memory read error on CPU_SrcID#0_MC#3_Chan#0_DIMM#0 (channel:0 slot:0 page:0x7e025e offset:0x400 grain:32 -  err_code:0x00a0:0x0090  SystemAddress:0x7e025e400 ProcessorSocketId:0x0 MemoryControllerId:0x3 ChannelAddress:0xec04bc00 ChannelId:0x0 RankAddress:0x76025c00 PhysicalRankId:0x1 DimmSlotId:0x0 Row:0x3b01 Column:0x380 Bank:0x0 BankGroup:0x2 ChipSelect:0x1 ChipId:0x0)
> [   77.633611] Memory failure: 0x7e025e: Sending SIGBUS to einj_mem_uc:12283 due to hardware memory corruption
> [   77.668827] mce: [Hardware Error]: Machine check events logged
> [   77.678605] Memory failure: 0x7e025e: already hardware poisoned
> [   77.736685] EDAC skx MC3: HANDLING MCE MEMORY ERROR
> [   77.742392] EDAC skx MC3: CPU 0: Machine Check Event: 0x0 Bank 255: 0xb40000000000009f
> [   77.742394] EDAC skx MC3: TSC 0x0
> [   77.742394] EDAC skx MC3: ADDR 0x7e025e400
> [   77.742395] EDAC skx MC3: MISC 0x0
> [   77.742395] EDAC skx MC3: PROCESSOR 0:0x606a6 TIME 1628780833 SOCKET 0 APIC 0x0
> [   77.742397] EDAC MC3: 1 UE memory read error on CPU_SrcID#0_MC#3_Chan#0_DIMM#0 (channel:0 slot:0 page:0x7e025e offset:0x400 grain:32 -  err_code:0x0000:0x009f  SystemAddress:0x7e025e400 ProcessorSocketId:0x0 MemoryControllerId:0x3 ChannelAddress:0xec04bc00 ChannelId:0x0 RankAddress:0x76025c00 PhysicalRankId:0x1 DimmSlotId:0x0 Row:0x3b01 Column:0x380 Bank:0x0 BankGroup:0x2 ChipSelect:0x1 ChipId:0x0)
> [   77.777612] Memory failure: 0x7e025e: already hardware poisoned
>
> -Tony
>

Luck, Tony Aug. 13, 2021, 3:07 p.m. UTC | #6

I'm running the default case from my einj_mem_uc test. That just:

1) allocates a page using:

	mmap(NULL, pagesize, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_ANON, -1, 0);

2) fills the page with random data (to make sure it has been allocated, and that the kernel can't
do KSM tricks to share this physical page with some other user).

3) injects the error at a 1KB offset within the page.

4) does a memory read of the poison address.


>                            action_result(pfn, MF_MSG_UNKNOWN, MF_IGNORED);
>   +                       dump_page(p, "hwpoison unknown page");
>                            res = -EBUSY;
>                            goto unlock_mutex;
>                    }

I added that patch against upstream (v5.14-rc5).  Here's the dump. The "pfn" matches the physical address where I injected,
and it has the hwpoison flag bit that was set early in memory_failure() --- so this is the right page.

[   79.368212] Memory failure: 0x623889: recovery action for unknown page: Ignored
[   79.375525] page:0000000065ad9479 refcount:3 mapcount:1 mapping:00000000a4ac843b index:0x0 pfn:0x623889
[   79.384909] memcg:ff40a569f2966000
[   79.388313] aops:shmem_aops ino:4c00 dentry name:"dev/zero"
[   79.393896] flags: 0x17ffffc088000c(uptodate|dirty|swapbacked|hwpoison|node=0|zone=2|lastcpupid=0x1fffff)
[   79.403455] raw: 0017ffffc088000c 0000000000000000 dead000000000122 ff40a569f45a7160
[   79.411191] raw: 0000000000000000 0000000000000000 0000000300000000 ff40a569f2966000
[   79.418931] page dumped because: hwpoison unknown page


-Tony

Naoya Horiguchi Aug. 16, 2021, 5:12 p.m. UTC | #7

On Fri, Aug 13, 2021 at 03:07:20PM +0000, Luck, Tony wrote:
> I'm running the default case from my einj_mem_uc test. That just:
> 
> 1) allocates a page using:
> 
> 	mmap(NULL, pagesize, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_ANON, -1, 0);
> 
> 2) fills the page with random data (to make sure it has been allocated, and that the kernel can't
> do KSM tricks to share this physical page with some other user).
> 
> 3) injects the error at a 1KB offset within the page.
> 
> 4) does a memory read of the poison address.
> 
> 
> >                            action_result(pfn, MF_MSG_UNKNOWN, MF_IGNORED);
> >   +                       dump_page(p, "hwpoison unknown page");
> >                            res = -EBUSY;
> >                            goto unlock_mutex;
> >                    }
> 
> I added that patch against upstream (v5.14-rc5).  Here's the dump. The "pfn" matches the physical address where I injected,
> and it has the hwpoison flag bit that was set early in memory_failure() --- so this is the right page.
> 
> [   79.368212] Memory failure: 0x623889: recovery action for unknown page: Ignored
> [   79.375525] page:0000000065ad9479 refcount:3 mapcount:1 mapping:00000000a4ac843b index:0x0 pfn:0x623889
> [   79.384909] memcg:ff40a569f2966000
> [   79.388313] aops:shmem_aops ino:4c00 dentry name:"dev/zero"
> [   79.393896] flags: 0x17ffffc088000c(uptodate|dirty|swapbacked|hwpoison|node=0|zone=2|lastcpupid=0x1fffff)
> [   79.403455] raw: 0017ffffc088000c 0000000000000000 dead000000000122 ff40a569f45a7160
> [   79.411191] raw: 0000000000000000 0000000000000000 0000000300000000 ff40a569f2966000
> [   79.418931] page dumped because: hwpoison unknown page

Thank you for your help.

This dump indicates that HWPoisonHandlable() returned false due to
the lack of PG_lru flag.  In older code before 5.13, get_any_page() does
retry with shake_page(), but does not since 5.13, which seems to me
the root cause of the issue. So my suggestion is to call shake_page()
when HWPoisonHandlable() is false.

Could you try checking that the following diff fixes the issue?
I could still have better fix (like inserting shake_page() to other
retry paths in get_any_page()), but the below is the minimum one.

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 76cc53b2999a..3e770e4f259e 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1146,7 +1146,7 @@ static int __get_hwpoison_page(struct page *page)
         * unexpected races caused by taking a page refcount.
         */
        if (!HWPoisonHandlable(head))
-               return 0;
+               return -EBUSY;

        if (PageTransHuge(head)) {
                /*
@@ -1199,9 +1199,14 @@ static int get_any_page(struct page *p, unsigned long flags)
                        }
                        goto out;
                } else if (ret == -EBUSY) {
-                       /* We raced with freeing huge page to buddy, retry. */
-                       if (pass++ < 3)
+                       /*
+                        * We raced with (possibly temporary) unhandlable
+                        * page, retry.
+                        */
+                       if (pass++ < 3) {
+                               shake_page(p, 1);
                                goto try_again;
+                       }
                        goto out;
                }
        }


Thanks,
Naoya Horiguchi

Luck, Tony Aug. 16, 2021, 5:56 p.m. UTC | #8

On Tue, Aug 17, 2021 at 02:12:07AM +0900, Naoya Horiguchi wrote:
> This dump indicates that HWPoisonHandlable() returned false due to
> the lack of PG_lru flag.  In older code before 5.13, get_any_page() does
> retry with shake_page(), but does not since 5.13, which seems to me
> the root cause of the issue. So my suggestion is to call shake_page()
> when HWPoisonHandlable() is false.
> 
> Could you try checking that the following diff fixes the issue?
> I could still have better fix (like inserting shake_page() to other
> retry paths in get_any_page()), but the below is the minimum one.

Tried it ... and it works! Injected and recovered from a thousand
errors without seeing any problems.

-Tony

P.S. Somewhere in the mail system your patch arrived with <TAB>s
changed to spaces. Here's what I applied to v5.14-rc6 (hopefully with
TABS preserved) ... just in case anyone else is following along with
this thread and wants to try some tests.

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index eefd823deb67..aa6592540f17 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1146,7 +1146,7 @@ static int __get_hwpoison_page(struct page *page)
 	 * unexpected races caused by taking a page refcount.
 	 */
 	if (!HWPoisonHandlable(head))
-		return 0;
+		return -EBUSY;
 
 	if (PageTransHuge(head)) {
 		/*
@@ -1199,9 +1199,14 @@ static int get_any_page(struct page *p, unsigned long flags)
 			}
 			goto out;
 		} else if (ret == -EBUSY) {
-			/* We raced with freeing huge page to buddy, retry. */
-			if (pass++ < 3)
+			/*
+			 * We raced with (possibly temporary) unhandlable
+			 * page, retry.
+			 */
+			if (pass++ < 3) {
+				shake_page(p, 1);
 				goto try_again;
+			}
 			goto out;
 		}
 	}

HORIGUCHI NAOYA(堀口直也) Aug. 17, 2021, 5:40 a.m. UTC | #9

On Mon, Aug 16, 2021 at 10:56:46AM -0700, Luck, Tony wrote:
> On Tue, Aug 17, 2021 at 02:12:07AM +0900, Naoya Horiguchi wrote:
> > This dump indicates that HWPoisonHandlable() returned false due to
> > the lack of PG_lru flag.  In older code before 5.13, get_any_page() does
> > retry with shake_page(), but does not since 5.13, which seems to me
> > the root cause of the issue. So my suggestion is to call shake_page()
> > when HWPoisonHandlable() is false.
> > 
> > Could you try checking that the following diff fixes the issue?
> > I could still have better fix (like inserting shake_page() to other
> > retry paths in get_any_page()), but the below is the minimum one.
> 
> Tried it ... and it works! Injected and recovered from a thousand
> errors without seeing any problems.

Thank you for testing, I've submitted a patch just now.

- Naoya

[v6,1/2] mm,hwpoison: fix race with hugetlb page allocation

Commit Message

Comments

Patch