diff mbox series

[v4,2/2] mm,hwpoison: send SIGBUS when the page has already been poisoned

Message ID 20210427062953.2080293-3-nao.horiguchi@gmail.com (mailing list archive)
State New, archived
Headers show
Series mm,hwpoison: fix sending SIGBUS for Action Required MCE | expand

Commit Message

Naoya Horiguchi April 27, 2021, 6:29 a.m. UTC
From: Naoya Horiguchi <naoya.horiguchi@nec.com>

When memory_failure() is called with MF_ACTION_REQUIRED on the
page that has already been hwpoisoned, memory_failure() could fail
to send SIGBUS to the affected process, which results in infinite
loop of MCEs.

Currently memory_failure() returns 0 if it's called for already
hwpoisoned page, then the caller, kill_me_maybe(), could return
without sending SIGBUS to current process.  An action required MCE
is raised when the current process accesses to the broken memory,
so no SIGBUS means that the current process continues to run and
access to the error page again soon, so running into MCE loop.

This issue can arise for example in the following scenarios:

  - Two or more threads access to the poisoned page concurrently.
    If local MCE is enabled, MCE handler independently handles the
    MCE events.  So there's a race among MCE events, and the
    second or latter threads fall into the situation in question.

  - If there was a precedent memory error event and memory_failure()
    for the event failed to unmap the error page for some reason,
    the subsequent memory access to the error page triggers the
    MCE loop situation.

To fix the issue, make memory_failure() return some error code when the
error page has already been hwpoisoned.  This allows memory error
handler to control how it sends signals to userspace.  And make sure
that any process touching a hwpoisoned page should get a SIGBUS (if
possible) with the error virtual address, even in "already hwpoisoned"
path of memory_failure() as is done in page fault path.

kill_accessing_process() does pagetable walk to find the error virtual
address.  If multiple virtual addresses are found in the pagetable walk,
no one knows which address is the correct one, so we fall back to sending
SIGBUS in kill_me_maybe() without error address info as we do now.
This corner case is left to be solved in the future.

Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Aili Yao <yaoaili@kingsoft.com>
---
change log v3 -> v4:
- refactored hwpoison_pte_range to save indentation,
- updated patch description

change log v1 -> v2:
- initialize local variables in check_hwpoisoned_entry() and
  hwpoison_pte_range()
- fix and improve logic to calculate error address offset.
---
 arch/x86/kernel/cpu/mce/core.c |  13 ++-
 include/linux/swapops.h        |   5 ++
 mm/memory-failure.c            | 143 ++++++++++++++++++++++++++++++++-
 3 files changed, 158 insertions(+), 3 deletions(-)

Comments

yaoaili [么爱利] May 7, 2021, 9:38 a.m. UTC | #1
On Tue, 27 Apr 2021 15:29:53 +0900
Naoya Horiguchi <nao.horiguchi@gmail.com> wrote:

> From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> 
> When memory_failure() is called with MF_ACTION_REQUIRED on the
> page that has already been hwpoisoned, memory_failure() could fail
> to send SIGBUS to the affected process, which results in infinite
> loop of MCEs.
> 
> Currently memory_failure() returns 0 if it's called for already
> hwpoisoned page, then the caller, kill_me_maybe(), could return
> without sending SIGBUS to current process.  An action required MCE
> is raised when the current process accesses to the broken memory,
> so no SIGBUS means that the current process continues to run and
> access to the error page again soon, so running into MCE loop.
> 
> This issue can arise for example in the following scenarios:
> 
>   - Two or more threads access to the poisoned page concurrently.
>     If local MCE is enabled, MCE handler independently handles the
>     MCE events.  So there's a race among MCE events, and the
>     second or latter threads fall into the situation in question.
> 
>   - If there was a precedent memory error event and memory_failure()
>     for the event failed to unmap the error page for some reason,
>     the subsequent memory access to the error page triggers the
>     MCE loop situation.
> 
> To fix the issue, make memory_failure() return some error code when the
> error page has already been hwpoisoned.  This allows memory error
> handler to control how it sends signals to userspace.  And make sure
> that any process touching a hwpoisoned page should get a SIGBUS (if
> possible) with the error virtual address, even in "already hwpoisoned"
> path of memory_failure() as is done in page fault path.
> 
> kill_accessing_process() does pagetable walk to find the error virtual
> address.  If multiple virtual addresses are found in the pagetable walk,
> no one knows which address is the correct one, so we fall back to sending
> SIGBUS in kill_me_maybe() without error address info as we do now.
> This corner case is left to be solved in the future.
> 
> Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>

Sorry for my late response, I just get time to rethink the pagewalk patch. Please let me share my thoughts, 
If anything wrong, just point out, thanks!

This whole pagewalk patch is meant to fix invalid virtual address along SIGBUS, For this invalid virtual address issue,
It seems this is one existing issue before this race issue is posted. while the issue is not fixed for a long time.

Then I think why this issue is not fixed, maybe just no process will care this virtual address as it will be killed.
Maybe virtual guest will need this address to forward it to vCPU, but untill now the memory recovery function in the VM doesn't
work at all, and without this address, It seems not a big impact though.

Maybe there are some other cases will care the virtual address, if anyone knows, just point out.

But invalid virtual address is still no good.

Before this, I post one RFC patch try to fix this issue with one knowing issue:it failed for mutiple pte entry;
Then this patch is posted trying to address this.

First I read this patch, I think this method is good and right and i test it. But now I think it again, I am wondering even the process
have multi pte entry and wrong virtuall address, but it still pointing to the same page, right?
If the process won't exit and get the wrong virtual address, what wrong action will it do? Do I miss some thing? 
while I can just think the virtual machine example, but the qemu will translate the wrong virtual address to right guest physical address? 
I am not sure VM will have multi pte entry?

And I think the virtual address along SIGBUS is not mean to backtrace the code, it just want to tell where the error memory is, for multi pte
entry, one virtual address for the same physical page is not enough?

Compare this patch with my RFC patch, difference:
1.This patch will just fix the race issue's invalid virtual address. while my RFC patch will cover all the error case for recovery;
2.For multi entry, this patch will do one force_sig with no other infomation, But the RFC patch will take one possible right address, I don't know which one is better.

And if this multi pte entry is one real issue, it seems the normal recovey work will aslo trigger this, would it be better to fix that first?

Thanks!
Aili Yao
HORIGUCHI NAOYA(堀口 直也) May 10, 2021, 7:21 a.m. UTC | #2
On Fri, May 07, 2021 at 05:38:52PM +0800, Aili Yao wrote:
> On Tue, 27 Apr 2021 15:29:53 +0900
> Naoya Horiguchi <nao.horiguchi@gmail.com> wrote:
> 
> > From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> > 
> > When memory_failure() is called with MF_ACTION_REQUIRED on the
> > page that has already been hwpoisoned, memory_failure() could fail
> > to send SIGBUS to the affected process, which results in infinite
> > loop of MCEs.
> > 
> > Currently memory_failure() returns 0 if it's called for already
> > hwpoisoned page, then the caller, kill_me_maybe(), could return
> > without sending SIGBUS to current process.  An action required MCE
> > is raised when the current process accesses to the broken memory,
> > so no SIGBUS means that the current process continues to run and
> > access to the error page again soon, so running into MCE loop.
> > 
> > This issue can arise for example in the following scenarios:
> > 
> >   - Two or more threads access to the poisoned page concurrently.
> >     If local MCE is enabled, MCE handler independently handles the
> >     MCE events.  So there's a race among MCE events, and the
> >     second or latter threads fall into the situation in question.
> > 
> >   - If there was a precedent memory error event and memory_failure()
> >     for the event failed to unmap the error page for some reason,
> >     the subsequent memory access to the error page triggers the
> >     MCE loop situation.
> > 
> > To fix the issue, make memory_failure() return some error code when the
> > error page has already been hwpoisoned.  This allows memory error
> > handler to control how it sends signals to userspace.  And make sure
> > that any process touching a hwpoisoned page should get a SIGBUS (if
> > possible) with the error virtual address, even in "already hwpoisoned"
> > path of memory_failure() as is done in page fault path.
> > 
> > kill_accessing_process() does pagetable walk to find the error virtual
> > address.  If multiple virtual addresses are found in the pagetable walk,
> > no one knows which address is the correct one, so we fall back to sending
> > SIGBUS in kill_me_maybe() without error address info as we do now.
> > This corner case is left to be solved in the future.
> > 
> > Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
> 
> Sorry for my late response, I just get time to rethink the pagewalk patch. Please let me share my thoughts, 
> If anything wrong, just point out, thanks!

Thank you for the feedback.

> 
> This whole pagewalk patch is meant to fix invalid virtual address along SIGBUS, For this invalid virtual address issue,
> It seems this is one existing issue before this race issue is posted. while the issue is not fixed for a long time.
> 
> Then I think why this issue is not fixed, maybe just no process will care this virtual address as it will be killed.
> Maybe virtual guest will need this address to forward it to vCPU, but untill now the memory recovery function in the VM doesn't
> work at all, and without this address, It seems not a big impact though.
> 
> Maybe there are some other cases will care the virtual address, if anyone knows, just point out.
> 
> But invalid virtual address is still no good.
> 
> Before this, I post one RFC patch try to fix this issue with one knowing issue:it failed for mutiple pte entry;
> Then this patch is posted trying to address this.
> 
> First I read this patch, I think this method is good and right and i test it. But now I think it again, I am wondering even the process
> have multi pte entry and wrong virtuall address, but it still pointing to the same page, right?

Yes, it is.

> If the process won't exit and get the wrong virtual address, what wrong action will it do?

I have no clear idea.  Typical action for the SIGBUS is to kill the process with
some logging, so the obviously wrong action like killing wrong process never happens.
A possible wrong result is invalid address in log, which might not be critical.

> while I can just think the virtual machine example, but the qemu will translate the wrong virtual address to right guest physical address? 
> I am not sure VM will have multi pte entry?

As long as I know, qemu maintains one-to-one mapping between host virtual address
and guest physical address, so no multi entry issue should happen around qemu.

> 
> And I think the virtual address along SIGBUS is not mean to backtrace the code, it just want to tell where the error memory is, for multi pte
> entry, one virtual address for the same physical page is not enough?
> 
> Compare this patch with my RFC patch, difference:
> 1.This patch will just fix the race issue's invalid virtual address. while my RFC patch will cover all the error case for recovery;
> 2.For multi entry, this patch will do one force_sig with no other infomation, But the RFC patch will take one possible right address, I don't know which one is better.
> 
> And if this multi pte entry is one real issue, it seems the normal recovey work will aslo trigger this, would it be better to fix that first?

Assuming that your RFC is https://lore.kernel.org/lkml/20210317162304.58ff188c@alex-virtual-machine/,
it simply uses the first-found virtual address.  I start thinking that this
approach could be fine.  And it's easy to change the patch with this approach.
I have no preference, so if you like, I switch to the "first-found" approach.

Thanks,
Naoya Horiguchi
yaoaili [么爱利] May 10, 2021, 8 a.m. UTC | #3
On Mon, 10 May 2021 07:21:28 +0000
HORIGUCHI NAOYA(堀口 直也) <naoya.horiguchi@nec.com> wrote:

> On Fri, May 07, 2021 at 05:38:52PM +0800, Aili Yao wrote:
> > On Tue, 27 Apr 2021 15:29:53 +0900
> > Naoya Horiguchi <nao.horiguchi@gmail.com> wrote:
> >   
> > > From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> > > 
> > > When memory_failure() is called with MF_ACTION_REQUIRED on the
> > > page that has already been hwpoisoned, memory_failure() could fail
> > > to send SIGBUS to the affected process, which results in infinite
> > > loop of MCEs.
> > > 
> > > Currently memory_failure() returns 0 if it's called for already
> > > hwpoisoned page, then the caller, kill_me_maybe(), could return
> > > without sending SIGBUS to current process.  An action required MCE
> > > is raised when the current process accesses to the broken memory,
> > > so no SIGBUS means that the current process continues to run and
> > > access to the error page again soon, so running into MCE loop.
> > > 
> > > This issue can arise for example in the following scenarios:
> > > 
> > >   - Two or more threads access to the poisoned page concurrently.
> > >     If local MCE is enabled, MCE handler independently handles the
> > >     MCE events.  So there's a race among MCE events, and the
> > >     second or latter threads fall into the situation in question.
> > > 
> > >   - If there was a precedent memory error event and memory_failure()
> > >     for the event failed to unmap the error page for some reason,
> > >     the subsequent memory access to the error page triggers the
> > >     MCE loop situation.
> > > 
> > > To fix the issue, make memory_failure() return some error code when the
> > > error page has already been hwpoisoned.  This allows memory error
> > > handler to control how it sends signals to userspace.  And make sure
> > > that any process touching a hwpoisoned page should get a SIGBUS (if
> > > possible) with the error virtual address, even in "already hwpoisoned"
> > > path of memory_failure() as is done in page fault path.
> > > 
> > > kill_accessing_process() does pagetable walk to find the error virtual
> > > address.  If multiple virtual addresses are found in the pagetable walk,
> > > no one knows which address is the correct one, so we fall back to sending
> > > SIGBUS in kill_me_maybe() without error address info as we do now.
> > > This corner case is left to be solved in the future.
> > > 
> > > Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>  
> > 
> > Sorry for my late response, I just get time to rethink the pagewalk patch. Please let me share my thoughts, 
> > If anything wrong, just point out, thanks!  
> 
> Thank you for the feedback.
> 
> > 
> > This whole pagewalk patch is meant to fix invalid virtual address along SIGBUS, For this invalid virtual address issue,
> > It seems this is one existing issue before this race issue is posted. while the issue is not fixed for a long time.
> > 
> > Then I think why this issue is not fixed, maybe just no process will care this virtual address as it will be killed.
> > Maybe virtual guest will need this address to forward it to vCPU, but untill now the memory recovery function in the VM doesn't
> > work at all, and without this address, It seems not a big impact though.
> > 
> > Maybe there are some other cases will care the virtual address, if anyone knows, just point out.
> > 
> > But invalid virtual address is still no good.
> > 
> > Before this, I post one RFC patch try to fix this issue with one knowing issue:it failed for mutiple pte entry;
> > Then this patch is posted trying to address this.
> > 
> > First I read this patch, I think this method is good and right and i test it. But now I think it again, I am wondering even the process
> > have multi pte entry and wrong virtuall address, but it still pointing to the same page, right?  
> 
> Yes, it is.
> 
> > If the process won't exit and get the wrong virtual address, what wrong action will it do?  
> 
> I have no clear idea.  Typical action for the SIGBUS is to kill the process with
> some logging, so the obviously wrong action like killing wrong process never happens.
> A possible wrong result is invalid address in log, which might not be critical.
> 
> > while I can just think the virtual machine example, but the qemu will translate the wrong virtual address to right guest physical address? 
> > I am not sure VM will have multi pte entry?  
> 
> As long as I know, qemu maintains one-to-one mapping between host virtual address
> and guest physical address, so no multi entry issue should happen around qemu.
> 
> > 
> > And I think the virtual address along SIGBUS is not mean to backtrace the code, it just want to tell where the error memory is, for multi pte
> > entry, one virtual address for the same physical page is not enough?
> > 
> > Compare this patch with my RFC patch, difference:
> > 1.This patch will just fix the race issue's invalid virtual address. while my RFC patch will cover all the error case for recovery;
> > 2.For multi entry, this patch will do one force_sig with no other infomation, But the RFC patch will take one possible right address, I don't know which one is better.
> > 
> > And if this multi pte entry is one real issue, it seems the normal recovey work will aslo trigger this, would it be better to fix that first?  
> 
> Assuming that your RFC is https://lore.kernel.org/lkml/20210317162304.58ff188c@alex-virtual-machine/,
> it simply uses the first-found virtual address.  I start thinking that this
> approach could be fine.  And it's easy to change the patch with this approach.
> I have no preference, so if you like, I switch to the "first-found" approach.

Hi Naoya:
  Thanks for your reply!
  Yes, you can change to that RFC approach, but there may be some un-indentified issuees, and need more considerations though.
  And there may be other method to address this, you can also dig into that, get it realized and posted.
  I am OK with any option.
  But for here, From the beginning, I thinks the invalid address issue and race issue are two different issues, may have some
relationship but still two issues in my mind.
  whould you please seperate this series patches into three again?
  Great Thanks!

Aili Yao!
HORIGUCHI NAOYA(堀口 直也) May 10, 2021, 8:31 a.m. UTC | #4
On Mon, May 10, 2021 at 04:00:21PM +0800, Aili Yao wrote:
> On Mon, 10 May 2021 07:21:28 +0000
> HORIGUCHI NAOYA(堀口 直也) <naoya.horiguchi@nec.com> wrote:
> 
> > On Fri, May 07, 2021 at 05:38:52PM +0800, Aili Yao wrote:
> > > On Tue, 27 Apr 2021 15:29:53 +0900
...
> > > And I think the virtual address along SIGBUS is not mean to backtrace the code, it just want to tell where the error memory is, for multi pte
> > > entry, one virtual address for the same physical page is not enough?
> > > 
> > > Compare this patch with my RFC patch, difference:
> > > 1.This patch will just fix the race issue's invalid virtual address. while my RFC patch will cover all the error case for recovery;
> > > 2.For multi entry, this patch will do one force_sig with no other infomation, But the RFC patch will take one possible right address, I don't know which one is better.
> > > 
> > > And if this multi pte entry is one real issue, it seems the normal recovey work will aslo trigger this, would it be better to fix that first?  
> > 
> > Assuming that your RFC is https://lore.kernel.org/lkml/20210317162304.58ff188c@alex-virtual-machine/,
> > it simply uses the first-found virtual address.  I start thinking that this
> > approach could be fine.  And it's easy to change the patch with this approach.
> > I have no preference, so if you like, I switch to the "first-found" approach.
> 
> Hi Naoya:
>   Thanks for your reply!
>   Yes, you can change to that RFC approach, but there may be some un-indentified issuees, and need more considerations though.
>   And there may be other method to address this, you can also dig into that, get it realized and posted.
>   I am OK with any option.
>   But for here, From the beginning, I thinks the invalid address issue and race issue are two different issues, may have some
> relationship but still two issues in my mind.
>   whould you please seperate this series patches into three again?

OK, I'll do it.
Maybe that's helpful if we consider to send some part of the series to stable.

- Naoya
diff mbox series

Patch

diff --git v5.12/arch/x86/kernel/cpu/mce/core.c v5.12_patched/arch/x86/kernel/cpu/mce/core.c
index 7962355436da..3ce23445a48c 100644
--- v5.12/arch/x86/kernel/cpu/mce/core.c
+++ v5.12_patched/arch/x86/kernel/cpu/mce/core.c
@@ -1257,19 +1257,28 @@  static void kill_me_maybe(struct callback_head *cb)
 {
 	struct task_struct *p = container_of(cb, struct task_struct, mce_kill_me);
 	int flags = MF_ACTION_REQUIRED;
+	int ret;
 
 	pr_err("Uncorrected hardware memory error in user-access at %llx", p->mce_addr);
 
 	if (!p->mce_ripv)
 		flags |= MF_MUST_KILL;
 
-	if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags) &&
-	    !(p->mce_kflags & MCE_IN_KERNEL_COPYIN)) {
+	ret = memory_failure(p->mce_addr >> PAGE_SHIFT, flags);
+	if (!ret && !(p->mce_kflags & MCE_IN_KERNEL_COPYIN)) {
 		set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page);
 		sync_core();
 		return;
 	}
 
+	/*
+	 * -EHWPOISON from memory_failure() means that it already sent SIGBUS
+	 * to the current process with the proper error info, so no need to
+	 * send it here again.
+	 */
+	if (ret == -EHWPOISON)
+		return;
+
 	if (p->mce_vaddr != (void __user *)-1l) {
 		force_sig_mceerr(BUS_MCEERR_AR, p->mce_vaddr, PAGE_SHIFT);
 	} else {
diff --git v5.12/include/linux/swapops.h v5.12_patched/include/linux/swapops.h
index d9b7c9132c2f..98ea67fcf360 100644
--- v5.12/include/linux/swapops.h
+++ v5.12_patched/include/linux/swapops.h
@@ -323,6 +323,11 @@  static inline int is_hwpoison_entry(swp_entry_t entry)
 	return swp_type(entry) == SWP_HWPOISON;
 }
 
+static inline unsigned long hwpoison_entry_to_pfn(swp_entry_t entry)
+{
+	return swp_offset(entry);
+}
+
 static inline void num_poisoned_pages_inc(void)
 {
 	atomic_long_inc(&num_poisoned_pages);
diff --git v5.12/mm/memory-failure.c v5.12_patched/mm/memory-failure.c
index 4087308e4b32..a3659619d293 100644
--- v5.12/mm/memory-failure.c
+++ v5.12_patched/mm/memory-failure.c
@@ -56,6 +56,7 @@ 
 #include <linux/kfifo.h>
 #include <linux/ratelimit.h>
 #include <linux/page-isolation.h>
+#include <linux/pagewalk.h>
 #include "internal.h"
 #include "ras/ras_event.h"
 
@@ -554,6 +555,140 @@  static void collect_procs(struct page *page, struct list_head *tokill,
 		collect_procs_file(page, tokill, force_early);
 }
 
+struct hwp_walk {
+	struct to_kill tk;
+	unsigned long pfn;
+	int flags;
+};
+
+static int set_to_kill(struct to_kill *tk, unsigned long addr, short shift)
+{
+	/* Abort pagewalk when finding multiple mappings to the error page. */
+	if (tk->addr)
+		return 1;
+	tk->addr = addr;
+	tk->size_shift = shift;
+	return 0;
+}
+
+static int check_hwpoisoned_entry(pte_t pte, unsigned long addr, short shift,
+				unsigned long poisoned_pfn, struct to_kill *tk)
+{
+	unsigned long pfn = 0;
+
+	if (pte_present(pte)) {
+		pfn = pte_pfn(pte);
+	} else {
+		swp_entry_t swp = pte_to_swp_entry(pte);
+
+		if (is_hwpoison_entry(swp))
+			pfn = hwpoison_entry_to_pfn(swp);
+	}
+
+	if (!pfn || pfn != poisoned_pfn)
+		return 0;
+
+	return set_to_kill(tk, addr, shift);
+}
+
+static int hwpoison_pte_range(pmd_t *pmdp, unsigned long addr,
+			      unsigned long end, struct mm_walk *walk)
+{
+	struct hwp_walk *hwp = (struct hwp_walk *)walk->private;
+	int ret = 0;
+	pte_t *ptep;
+	pmd_t pmd;
+	spinlock_t *ptl;
+	unsigned long pfn;
+	unsigned long hwpoison_vaddr;
+
+	ptl = pmd_trans_huge_lock(pmdp, walk->vma);
+	if (!ptl)
+		goto pte_loop;
+	pmd = *pmdp;
+	if (!pmd_present(pmd))
+		goto unlock;
+	pfn = pmd_pfn(pmd);
+	if (pfn <= hwp->pfn && hwp->pfn < pfn + HPAGE_PMD_NR) {
+		hwpoison_vaddr = addr + ((hwp->pfn - pfn) << PAGE_SHIFT);
+		ret = set_to_kill(&hwp->tk, hwpoison_vaddr, PAGE_SHIFT);
+	}
+unlock:
+	spin_unlock(ptl);
+	goto out;
+pte_loop:
+	if (pmd_trans_unstable(pmdp))
+		goto out;
+
+	ptep = pte_offset_map_lock(walk->vma->vm_mm, pmdp, addr, &ptl);
+	for (; addr != end; ptep++, addr += PAGE_SIZE) {
+		ret = check_hwpoisoned_entry(*ptep, addr, PAGE_SHIFT,
+					     hwp->pfn, &hwp->tk);
+		if (ret == 1)
+			break;
+	}
+	pte_unmap_unlock(ptep - 1, ptl);
+out:
+	cond_resched();
+	return ret;
+}
+
+#ifdef CONFIG_HUGETLB_PAGE
+static int hwpoison_hugetlb_range(pte_t *ptep, unsigned long hmask,
+			    unsigned long addr, unsigned long end,
+			    struct mm_walk *walk)
+{
+	struct hwp_walk *hwp = (struct hwp_walk *)walk->private;
+	pte_t pte = huge_ptep_get(ptep);
+	struct hstate *h = hstate_vma(walk->vma);
+
+	return check_hwpoisoned_entry(pte, addr, huge_page_shift(h),
+				      hwp->pfn, &hwp->tk);
+}
+#else
+#define hwpoison_hugetlb_range	NULL
+#endif
+
+static struct mm_walk_ops hwp_walk_ops = {
+	.pmd_entry = hwpoison_pte_range,
+	.hugetlb_entry = hwpoison_hugetlb_range,
+};
+
+/*
+ * Sends SIGBUS to the current process with the error info.
+ *
+ * This function is intended to handle "Action Required" MCEs on already
+ * hardware poisoned pages. They could happen, for example, when
+ * memory_failure() failed to unmap the error page at the first call, or
+ * when multiple local machine checks happened on different CPUs.
+ *
+ * MCE handler currently has no easy access to the error virtual address,
+ * so this function walks page table to find it. One challenge on this is
+ * to reliably get the proper virual address of the error to report to
+ * applications via SIGBUS. A process could map a page multiple times to
+ * different virtual addresses, then we now have no way to tell which virtual
+ * address was accessed when the Action Required MCE was generated.
+ * So in such a corner case, we now give up and fall back to sending SIGBUS
+ * with no error info.
+ */
+static int kill_accessing_process(struct task_struct *p, unsigned long pfn,
+				  int flags)
+{
+	int ret;
+	struct hwp_walk priv = {
+		.pfn = pfn,
+	};
+	priv.tk.tsk = p;
+
+	mmap_read_lock(p->mm);
+	ret = walk_page_range(p->mm, 0, TASK_SIZE_MAX, &hwp_walk_ops,
+			      (void *)&priv);
+	if (!ret && priv.tk.addr)
+		kill_proc(&priv.tk, pfn, flags);
+	mmap_read_unlock(p->mm);
+	return ret ? -EFAULT : -EHWPOISON;
+}
+
 static const char *action_name[] = {
 	[MF_IGNORED] = "Ignored",
 	[MF_FAILED] = "Failed",
@@ -1228,7 +1363,10 @@  static int memory_failure_hugetlb(unsigned long pfn, int flags)
 	if (TestSetPageHWPoison(head)) {
 		pr_err("Memory failure: %#lx: already hardware poisoned\n",
 		       pfn);
-		return 0;
+		res = -EHWPOISON;
+		if (flags & MF_ACTION_REQUIRED)
+			res = kill_accessing_process(current, page_to_pfn(head), flags);
+		return res;
 	}
 
 	num_poisoned_pages_inc();
@@ -1437,6 +1575,9 @@  int memory_failure(unsigned long pfn, int flags)
 	if (TestSetPageHWPoison(p)) {
 		pr_err("Memory failure: %#lx: already hardware poisoned\n",
 			pfn);
+		res = -EHWPOISON;
+		if (flags & MF_ACTION_REQUIRED)
+			res = kill_accessing_process(current, pfn, flags);
 		goto unlock_mutex;
 	}