diff mbox series

mm,hwpoison: return -EBUSY when page already poisoned

Message ID 20210224151619.67c29731@alex-virtual-machine (mailing list archive)
State New, archived
Headers show
Series mm,hwpoison: return -EBUSY when page already poisoned | expand

Commit Message

yaoaili [么爱利] Feb. 24, 2021, 7:16 a.m. UTC
When the page is already poisoned, another memory_failure() call in the
same page now return 0, meaning OK. For nested memory mce handling, this
behavior may lead real serious problem, Example:

1.When LCME is enabled, and there are two processes A && B running on
different core X && Y separately, which will access one same page, then
the page corrupted when process A access it, a MCE will be rasied to
core X and the error process is just underway.

2.Then B access the page and trigger another MCE to core Y, it will also
do error process, it will see TestSetPageHWPoison be true, and 0 is
returned.

3.The kill_me_maybe will check the return:

1244 static void kill_me_maybe(struct callback_head *cb)
1245 {

1254         if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags) &&
1255             !(p->mce_kflags & MCE_IN_KERNEL_COPYIN)) {
1256                 set_mce_nospec(p->mce_addr >> PAGE_SHIFT,
p->mce_whole_page);
1257                 sync_core();
1258                 return;
1259         }

1267 }

4. The error process for B will end, and may nothing happened if
kill-early is not set, We may let the wrong data go into effect.

For other cases which care the return value of memory_failure() should
check why they want to process a memory error which have already been
processed. This behavior seems reasonable.

In kill_me_maybe, log the fact about the memory may not recovered, and
we will kill the related process.

Signed-off-by: Aili Yao <yaoaili@kingsoft.com>
---
 arch/x86/kernel/cpu/mce/core.c | 2 ++
 mm/memory-failure.c            | 4 ++--
 2 files changed, 4 insertions(+), 2 deletions(-)

Comments

David Hildenbrand Feb. 24, 2021, 10:10 a.m. UTC | #1
On 24.02.21 08:16, Aili Yao wrote:
> When the page is already poisoned, another memory_failure() call in the
> same page now return 0, meaning OK. For nested memory mce handling, this
> behavior may lead real serious problem, Example:
> 
> 1.When LCME is enabled, and there are two processes A && B running on
> different core X && Y separately, which will access one same page, then
> the page corrupted when process A access it, a MCE will be rasied to
> core X and the error process is just underway.
> 
> 2.Then B access the page and trigger another MCE to core Y, it will also
> do error process, it will see TestSetPageHWPoison be true, and 0 is
> returned.
> 
> 3.The kill_me_maybe will check the return:
> 
> 1244 static void kill_me_maybe(struct callback_head *cb)
> 1245 {
> 
> 1254         if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags) &&
> 1255             !(p->mce_kflags & MCE_IN_KERNEL_COPYIN)) {
> 1256                 set_mce_nospec(p->mce_addr >> PAGE_SHIFT,
> p->mce_whole_page);
> 1257                 sync_core();
> 1258                 return;
> 1259         }
> 
> 1267 }
> 
> 4. The error process for B will end, and may nothing happened if
> kill-early is not set, We may let the wrong data go into effect.
> 
> For other cases which care the return value of memory_failure() should
> check why they want to process a memory error which have already been
> processed. This behavior seems reasonable.
> 
> In kill_me_maybe, log the fact about the memory may not recovered, and
> we will kill the related process.
>

Is -EBUSY then the right return value?

I'd expect if it's already poisoned that we would get something like 
EHWPOISON.

Does this affect existing user space interfaces (especially, via madvise?)?
Oscar Salvador Feb. 24, 2021, 10:31 a.m. UTC | #2
On Wed, Feb 24, 2021 at 03:16:19PM +0800, Aili Yao wrote:
> When the page is already poisoned, another memory_failure() call in the
> same page now return 0, meaning OK. For nested memory mce handling, this
> behavior may lead real serious problem, Example:

I have some questions:

> 1.When LCME is enabled, and there are two processes A && B running on
> different core X && Y separately, which will access one same page, then
> the page corrupted when process A access it, a MCE will be rasied to
> core X and the error process is just underway.

When !LMCE, that is not a problem because new MCE needs to wait for the ongoing MCE?

> 2.Then B access the page and trigger another MCE to core Y, it will also
> do error process, it will see TestSetPageHWPoison be true, and 0 is
> returned.

For non-nested calls, that is no problem because the page will be taken out
of business(unmapped from the processes), right? So no more MCE are possible.

> 
> 3.The kill_me_maybe will check the return:
> 
> 1244 static void kill_me_maybe(struct callback_head *cb)
> 1245 {
> 
> 1254         if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags) &&
> 1255             !(p->mce_kflags & MCE_IN_KERNEL_COPYIN)) {
> 1256                 set_mce_nospec(p->mce_addr >> PAGE_SHIFT,

So, IIUC, in case of a LMCE nested call, the second MCE will reach here.
set_mce_nospec() will either mark the underlying page as not mapped/cached.

Should not have memory_failure()->hwpoison_user_mappings() unmapped the page
from both process A and B? Or this is in case the ongoing MCE(process A) has
not still unmapped anything, so process B can still access this page.

So with your change, process B will be sent a SIGBUG, while process A is still
handling the MCE, right?

> p->mce_whole_page);
> 1257                 sync_core();
> 1258                 return;
> 1259         }
> 
> 1267 }
> 
> 4. The error process for B will end, and may nothing happened if
> kill-early is not set, We may let the wrong data go into effect.
> 
> For other cases which care the return value of memory_failure() should
> check why they want to process a memory error which have already been
> processed. This behavior seems reasonable.
> 
> In kill_me_maybe, log the fact about the memory may not recovered, and
> we will kill the related process.
> 
> Signed-off-by: Aili Yao <yaoaili@kingsoft.com>
> ---
>  arch/x86/kernel/cpu/mce/core.c | 2 ++
>  mm/memory-failure.c            | 4 ++--
>  2 files changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
> index e133ce1e562b..db4afc5bf15a 100644
> --- a/arch/x86/kernel/cpu/mce/core.c
> +++ b/arch/x86/kernel/cpu/mce/core.c
> @@ -1259,6 +1259,8 @@ static void kill_me_maybe(struct callback_head *cb)
>  	}
>  
>  	if (p->mce_vaddr != (void __user *)-1l) {
> +		pr_err("Memory error may not recovered: %#lx: Sending SIGBUS to %s:%d due to hardware memory corruption\n",
> +			p->mce_addr >> PAGE_SHIFT, p->comm, p->pid);
>  		force_sig_mceerr(BUS_MCEERR_AR, p->mce_vaddr, PAGE_SHIFT);
>  	} else {
>  		pr_err("Memory error not recovered");
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index e9481632fcd1..06f006174b8c 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1224,7 +1224,7 @@ static int memory_failure_hugetlb(unsigned long pfn, int flags)
>  	if (TestSetPageHWPoison(head)) {
>  		pr_err("Memory failure: %#lx: already hardware poisoned\n",
>  		       pfn);
> -		return 0;
> +		return -EBUSY;

As David said, madvise_inject_error() will start returning -EBUSY now in case
we madvise(MADV_HWPOISON) on an already hwpoisoned page.

AFAICS, memory_failure() can return 0, -Eerrors, and MF_XXX.
Would it make sense to unify that? That way we could declare error codes that
make somse sense (like MF_ALREADY_HWPOISONED).
yaoaili [么爱利] Feb. 25, 2021, 3:43 a.m. UTC | #3
On Wed, 24 Feb 2021 11:31:55 +0100
Oscar Salvador <osalvador@suse.de> wrote:


> I have some questions:
> 
> > 1.When LCME is enabled, and there are two processes A && B running on
> > different core X && Y separately, which will access one same page, then
> > the page corrupted when process A access it, a MCE will be rasied to
> > core X and the error process is just underway.  
> 
> When !LMCE, that is not a problem because new MCE needs to wait for the ongoing MCE?

I am not sure whether this case will happen when !LMCE, when I realized this place may be an issue
I tried to reproduce it and my configuration is LMCE enabled.

> > 2.Then B access the page and trigger another MCE to core Y, it will also
> > do error process, it will see TestSetPageHWPoison be true, and 0 is
> > returned.  
> 
> For non-nested calls, that is no problem because the page will be taken out
> of business(unmapped from the processes), right? So no more MCE are possible.

Yes, I think after the recovery jod is finished, other processes still access the page
will meet a page fault and error will be returned;
 
> > 
> > 3.The kill_me_maybe will check the return:
> > 
> > 1244 static void kill_me_maybe(struct callback_head *cb)
> > 1245 {
> > 
> > 1254         if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags) &&
> > 1255             !(p->mce_kflags & MCE_IN_KERNEL_COPYIN)) {
> > 1256                 set_mce_nospec(p->mce_addr >> PAGE_SHIFT,  
> 
> So, IIUC, in case of a LMCE nested call, the second MCE will reach here.
> set_mce_nospec() will either mark the underlying page as not mapped/cached.
>
This set_mce_nospec() is not proper when the recovery job is on the fly. In my test
this function failed.
 
> Should not have memory_failure()->hwpoison_user_mappings() unmapped the page
> from both process A and B? Or this is in case the ongoing MCE(process A) has
> not still unmapped anything, so process B can still access this page.
> 
What I care is the process B triggered the error again after process A,
I don't know how it return and proceed.

> So with your change, process B will be sent a SIGBUG, while process A is still
> handling the MCE, right?

Right!

> > p->mce_whole_page);
> > 1257                 sync_core();
> > 1258                 return;
> > 1259         }
> > 
> > 1267 }
> > 
> > 4. The error process for B will end, and may nothing happened if
> > kill-early is not set, We may let the wrong data go into effect.
> > 
> > For other cases which care the return value of memory_failure() should
> > check why they want to process a memory error which have already been
> > processed. This behavior seems reasonable.
> > 
> > In kill_me_maybe, log the fact about the memory may not recovered, and
> > we will kill the related process.
> > 
> > Signed-off-by: Aili Yao <yaoaili@kingsoft.com>
> > ---
> >  arch/x86/kernel/cpu/mce/core.c | 2 ++
> >  mm/memory-failure.c            | 4 ++--
> >  2 files changed, 4 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
> > index e133ce1e562b..db4afc5bf15a 100644
> > --- a/arch/x86/kernel/cpu/mce/core.c
> > +++ b/arch/x86/kernel/cpu/mce/core.c
> > @@ -1259,6 +1259,8 @@ static void kill_me_maybe(struct callback_head *cb)
> >  	}
> >  
> >  	if (p->mce_vaddr != (void __user *)-1l) {
> > +		pr_err("Memory error may not recovered: %#lx: Sending SIGBUS to %s:%d due to hardware memory corruption\n",
> > +			p->mce_addr >> PAGE_SHIFT, p->comm, p->pid);
> >  		force_sig_mceerr(BUS_MCEERR_AR, p->mce_vaddr, PAGE_SHIFT);
> >  	} else {
> >  		pr_err("Memory error not recovered");
> > diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> > index e9481632fcd1..06f006174b8c 100644
> > --- a/mm/memory-failure.c
> > +++ b/mm/memory-failure.c
> > @@ -1224,7 +1224,7 @@ static int memory_failure_hugetlb(unsigned long pfn, int flags)
> >  	if (TestSetPageHWPoison(head)) {
> >  		pr_err("Memory failure: %#lx: already hardware poisoned\n",
> >  		       pfn);
> > -		return 0;
> > +		return -EBUSY;  
> 
> As David said, madvise_inject_error() will start returning -EBUSY now in case
> we madvise(MADV_HWPOISON) on an already hwpoisoned page.
> 
> AFAICS, memory_failure() can return 0, -Eerrors, and MF_XXX.
> Would it make sense to unify that? That way we could declare error codes that
> make somse sense (like MF_ALREADY_HWPOISONED).
> 

@David:

I checked the code again, and find a few places will care the exact return value, like:

1: drivers/base/memory.c:483:      ret = memory_failure(pfn, 0);
This is for hard page offline, I see the code in mcelog:
static void offline_action(struct mempage *mp, u64 addr)
{
	if (offline <= OFFLINE_ACCOUNT)
		return;
	Lprintf("Offlining page %llx\n", addr);
	if (memory_offline(addr) < 0) {
		Lprintf("Offlining page %llx failed: %s\n", addr, strerror(errno));
		mp->offlined = PAGE_OFFLINE_FAILED;
	} else
		mp->offlined = PAGE_OFFLINE;
}
I think return an negative value will be more proper? As the related killing function may not be performed, and we can't say
it's a success operation?

2:mm/hwpoison-inject.c:51:        return memory_failure(pfn, 0);
mm/madvise.c:910:               ret = memory_failure(pfn, MF_COUNT_INCREASED);

These two cases are mainly for error injections, I checked the test codes, mostly it only care if the value is 0 or < 0;
I do the related test, normally it work well, but for stress test, sometimes in some case, I do meet some fail cases along with the -EBUSY return.
I will dig more.

Other place will only care if the return value is 0. or just ignore it.

Hi naoya, what's your opnion for this possible issue, I need your inputs!

Thanks
Aili Yao
HORIGUCHI NAOYA(堀口 直也) Feb. 25, 2021, 11:28 a.m. UTC | #4
On Thu, Feb 25, 2021 at 11:43:29AM +0800, Aili Yao wrote:
> On Wed, 24 Feb 2021 11:31:55 +0100 Oscar Salvador <osalvador@suse.de> wrote:
...
>  
> > > 
> > > 3.The kill_me_maybe will check the return:
> > > 
> > > 1244 static void kill_me_maybe(struct callback_head *cb)
> > > 1245 {
> > > 
> > > 1254         if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags) &&
> > > 1255             !(p->mce_kflags & MCE_IN_KERNEL_COPYIN)) {
> > > 1256                 set_mce_nospec(p->mce_addr >> PAGE_SHIFT,  
> > 
> > So, IIUC, in case of a LMCE nested call, the second MCE will reach here.
> > set_mce_nospec() will either mark the underlying page as not mapped/cached.
> >
> This set_mce_nospec() is not proper when the recovery job is on the fly. In my test
> this function failed.

Hi Aili,

I agree that this set_mce_nospec() is not expected to be called for
"already hwpoisoned" page because in the reported case the error
page is already contained and no need to resort changing cache mode.

...

> > > diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> > > index e9481632fcd1..06f006174b8c 100644
> > > --- a/mm/memory-failure.c
> > > +++ b/mm/memory-failure.c
> > > @@ -1224,7 +1224,7 @@ static int memory_failure_hugetlb(unsigned long pfn, int flags)
> > >  	if (TestSetPageHWPoison(head)) {
> > >  		pr_err("Memory failure: %#lx: already hardware poisoned\n",
> > >  		       pfn);
> > > -		return 0;
> > > +		return -EBUSY;  
> > 
> > As David said, madvise_inject_error() will start returning -EBUSY now in case
> > we madvise(MADV_HWPOISON) on an already hwpoisoned page.
> > 
> > AFAICS, memory_failure() can return 0, -Eerrors, and MF_XXX.
> > Would it make sense to unify that? That way we could declare error codes that
> > make somse sense (like MF_ALREADY_HWPOISONED).

It seems to me that memory_failure() does not return MF_XXX.  But yes,
returning some positive value for the reported case could be a solution.

> > 
> 
> @David:
> 
> I checked the code again, and find a few places will care the exact return value, like:
> 
> 1: drivers/base/memory.c:483:      ret = memory_failure(pfn, 0);
> This is for hard page offline, I see the code in mcelog:
> static void offline_action(struct mempage *mp, u64 addr)
> {
> 	if (offline <= OFFLINE_ACCOUNT)
> 		return;
> 	Lprintf("Offlining page %llx\n", addr);
> 	if (memory_offline(addr) < 0) {
> 		Lprintf("Offlining page %llx failed: %s\n", addr, strerror(errno));
> 		mp->offlined = PAGE_OFFLINE_FAILED;
> 	} else
> 		mp->offlined = PAGE_OFFLINE;
> }
> I think return an negative value will be more proper? As the related killing function may not be performed, and we can't say
> it's a success operation?
> 
> 2:mm/hwpoison-inject.c:51:        return memory_failure(pfn, 0);
> mm/madvise.c:910:               ret = memory_failure(pfn, MF_COUNT_INCREASED);
> 
> These two cases are mainly for error injections, I checked the test codes, mostly it only care if the value is 0 or < 0;
> I do the related test, normally it work well, but for stress test, sometimes in some case, I do meet some fail cases along with the -EBUSY return.
> I will dig more.
> 
> Other place will only care if the return value is 0. or just ignore it.
> 
> Hi naoya, what's your opnion for this possible issue, I need your inputs!

We could use some negative value (error code) to report the reported case,
then as you mentioned above, some callers need change to handle the
new case, and the same is true if you use some positive value.
My preference is -EHWPOISON, but other options are fine if justified well.

Thanks,
Naoya Horiguchi
Oscar Salvador Feb. 25, 2021, 11:39 a.m. UTC | #5
On Thu, Feb 25, 2021 at 11:28:18AM +0000, HORIGUCHI NAOYA(堀口 直也) wrote:
> Hi Aili,
> 
> I agree that this set_mce_nospec() is not expected to be called for
> "already hwpoisoned" page because in the reported case the error
> page is already contained and no need to resort changing cache mode.

Out of curiosity, what is the current behavour now?
Say we have an ongoing MCE which has marked the page as HWPoison but
memory_failure did not take any action on the page yet.
And then, we have another MCE, which ends up there.
set_mce_nospec might clear _PAGE_PRESENT bit.

Does that have any impact on the first MCE?

> It seems to me that memory_failure() does not return MF_XXX.  But yes,
> returning some positive value for the reported case could be a solution.

No, you are right. I somehow managed to confuse myself.
I see now that MF_XXX return codes are filtered out in page_action.

> We could use some negative value (error code) to report the reported case,
> then as you mentioned above, some callers need change to handle the
> new case, and the same is true if you use some positive value.
> My preference is -EHWPOISON, but other options are fine if justified well.

-EHWPOISON seems like a good fit.
HORIGUCHI NAOYA(堀口 直也) Feb. 25, 2021, 12:38 p.m. UTC | #6
On Thu, Feb 25, 2021 at 12:39:30PM +0100, Oscar Salvador wrote:
> On Thu, Feb 25, 2021 at 11:28:18AM +0000, HORIGUCHI NAOYA(堀口 直也) wrote:
> > Hi Aili,
> > 
> > I agree that this set_mce_nospec() is not expected to be called for
> > "already hwpoisoned" page because in the reported case the error
> > page is already contained and no need to resort changing cache mode.
> 
> Out of curiosity, what is the current behavour now?
> Say we have an ongoing MCE which has marked the page as HWPoison but
> memory_failure did not take any action on the page yet.
> And then, we have another MCE, which ends up there.
> set_mce_nospec might clear _PAGE_PRESENT bit.
> 
> Does that have any impact on the first MCE?

Hi Oscar,

Thank you for shedding light on this, this race looks worrisome to me.
We call try_to_unmap() inside memory_failure(), where we find affected
ptes by page_vma_mapped_walk() and convert into hwpoison entires in
try_to_unmap_one().  So there seems two racy cases:

  1)
     CPU 0                          CPU 1
     page_vma_mapped_walk
                                    clear _PAGE_PRESENT bit
       // skipped the entry

  2)
     CPU 0                          CPU 1
     page_vma_mapped_walk
       try_to_unmap_one
                                    clear _PAGE_PRESENT bit
         convert the entry
         set_pte_at

In case 1, the affected processes get signals on later access,
so although the info in SIGBUS could be different, that's OK.
And we have no impact in case 2.

Thanks,
Naoya Horiguchi
Luck, Tony Feb. 25, 2021, 6:15 p.m. UTC | #7
On Thu, Feb 25, 2021 at 12:38:06PM +0000, HORIGUCHI NAOYA(堀口 直也) wrote:
> Thank you for shedding light on this, this race looks worrisome to me.
> We call try_to_unmap() inside memory_failure(), where we find affected
> ptes by page_vma_mapped_walk() and convert into hwpoison entires in
> try_to_unmap_one().  So there seems two racy cases:
> 
>   1)
>      CPU 0                          CPU 1
>      page_vma_mapped_walk
>                                     clear _PAGE_PRESENT bit
>        // skipped the entry
> 
>   2)
>      CPU 0                          CPU 1
>      page_vma_mapped_walk
>        try_to_unmap_one
>                                     clear _PAGE_PRESENT bit
>          convert the entry
>          set_pte_at
> 
> In case 1, the affected processes get signals on later access,
> so although the info in SIGBUS could be different, that's OK.
> And we have no impact in case 2.

I've been debugging a similar issue for a few days and finally
got enough traces to partially understand what happened.

The test case is a multi-threaded pointer chasing micro-benchmark
running on all logical CPUs. We then inject poison into the address
space of the process.

All works fine if one thread consumes poison and completes all
Linux machine check processing before any other threads read the
poison. The page is unmapped, a SIGBUS is sent (which kills all
threads).

But in the problem case I see:

CPU1 reads poison, takes a machine check. Gets to the
kill_me_maybe() task work, which calls memory_failure()
this CPU sets the page poison flag, but is still executing the
rest of the flow to hunt down tasks/mappings to invalidate pages
and send SIGBUS if required.

CPU2 reads the poison. When it gets to memory_failure()
there's an early return because the poison flag is already
set. So in current code it returns and takes the machine
check again.

CPU3 reads the poison and starts along same path that CPU2
did.

Meanwhile CPU1 gets far enough along in memory failure and hits
a problem. It prints:

[ 1867.409837] Memory failure: 0x42a9ff6: reserved kernel page still referenced by 1 users
[ 1867.409850] Memory failure: 0x42a9ff6: recovery action for reserved kernel page: Failed

and doesn't complete unmapping the page that CPU2 and CPU3 are touching.

Other CPUs gradually reach the poison and join in the fun of repeatedly
taking machine checks.

I have not yet tracked why this user access is reporting as a "reserved kernel page".
Some traces showed that futex(2) syscall was in use from this benchmark,
so maybe the kernel locked a user page that was a contended futex???


Idea for what we should do next ... Now that x86 is calling memory_failure()
from user context ... maybe parallel calls for the same page should
be blocked until the first caller completes so we can:
a) know that pages are unmapped (if that happens)
b) all get the same success/fail status

-Tony
HORIGUCHI NAOYA(堀口 直也) Feb. 26, 2021, 2:19 a.m. UTC | #8
On Thu, Feb 25, 2021 at 10:15:42AM -0800, Luck, Tony wrote:
> On Thu, Feb 25, 2021 at 12:38:06PM +0000, HORIGUCHI NAOYA(堀口 直也) wrote:
> > Thank you for shedding light on this, this race looks worrisome to me.
> > We call try_to_unmap() inside memory_failure(), where we find affected
> > ptes by page_vma_mapped_walk() and convert into hwpoison entires in
> > try_to_unmap_one().  So there seems two racy cases:
> > 
> >   1)
> >      CPU 0                          CPU 1
> >      page_vma_mapped_walk
> >                                     clear _PAGE_PRESENT bit
> >        // skipped the entry
> > 
> >   2)
> >      CPU 0                          CPU 1
> >      page_vma_mapped_walk
> >        try_to_unmap_one
> >                                     clear _PAGE_PRESENT bit
> >          convert the entry
> >          set_pte_at
> > 
> > In case 1, the affected processes get signals on later access,
> > so although the info in SIGBUS could be different, that's OK.
> > And we have no impact in case 2.
> 
> I've been debugging a similar issue for a few days and finally
> got enough traces to partially understand what happened.
> 
> The test case is a multi-threaded pointer chasing micro-benchmark
> running on all logical CPUs. We then inject poison into the address
> space of the process.
> 
> All works fine if one thread consumes poison and completes all
> Linux machine check processing before any other threads read the
> poison. The page is unmapped, a SIGBUS is sent (which kills all
> threads).
> 
> But in the problem case I see:

Thanks for the description, it's helpful to understand the problem.

> 
> CPU1 reads poison, takes a machine check. Gets to the
> kill_me_maybe() task work, which calls memory_failure()
> this CPU sets the page poison flag, but is still executing the
> rest of the flow to hunt down tasks/mappings to invalidate pages
> and send SIGBUS if required.
> 
> CPU2 reads the poison. When it gets to memory_failure()
> there's an early return because the poison flag is already
> set. So in current code it returns and takes the machine
> check again.
> 
> CPU3 reads the poison and starts along same path that CPU2
> did.

I think that the MCE loop happening on CPU2 and CPU3 is unexpected
and these threads should immediately kill the current process on
each CPU.  force_sig_mceerr() in kill_me_maybe() is supposed to do it,
so Aili's patch would fix this issue too?

> 
> Meanwhile CPU1 gets far enough along in memory failure and hits
> a problem. It prints:
> 
> [ 1867.409837] Memory failure: 0x42a9ff6: reserved kernel page still referenced by 1 users
> [ 1867.409850] Memory failure: 0x42a9ff6: recovery action for reserved kernel page: Failed
> 
> and doesn't complete unmapping the page that CPU2 and CPU3 are touching.
> 
> Other CPUs gradually reach the poison and join in the fun of repeatedly
> taking machine checks.
> 
> I have not yet tracked why this user access is reporting as a "reserved kernel page".
> Some traces showed that futex(2) syscall was in use from this benchmark,
> so maybe the kernel locked a user page that was a contended futex???

This might imply that current logic to identify page state does
not work properly on this exotic type of user page, I'll take a
look on this from futex's viewpoint.

> 
> Idea for what we should do next ... Now that x86 is calling memory_failure()
> from user context ... maybe parallel calls for the same page should
> be blocked until the first caller completes so we can:
> a) know that pages are unmapped (if that happens)
> b) all get the same success/fail status

One memory_failure() call changes the target page's status and
affects all mappings to all affected processes, so I think that
(ideally) we don't have to block other threads (letting them
early return seems fine).  Sometimes memory_failure() fails,
but even in such case, PG_hwpoison is set on the page and other
threads properly get SIGBUSs with this patch, so I think that
we can avoid the worst scenario (like system stall by MCE loop).

Thanks,
Naoya Horiguchi
yaoaili [么爱利] Feb. 26, 2021, 2:52 a.m. UTC | #9
Hi naoya,Oscar,david:
> 
> > We could use some negative value (error code) to report the reported case,
> > then as you mentioned above, some callers need change to handle the
> > new case, and the same is true if you use some positive value.
> > My preference is -EHWPOISON, but other options are fine if justified well.  
> 
> -EHWPOISON seems like a good fit.
> 
I am OK with the -EHWPOISON error code, But I have one doubt here:
When we return this -EHWPOISON error code, Does this means we have to add a new error code
to error-base.h or errno.h? Is this easy realized?

Thanks!
Aili Yao
yaoaili [么爱利] Feb. 26, 2021, 2:59 a.m. UTC | #10
Hi naoya, tony:
> > 
> > Idea for what we should do next ... Now that x86 is calling memory_failure()
> > from user context ... maybe parallel calls for the same page should
> > be blocked until the first caller completes so we can:
> > a) know that pages are unmapped (if that happens)
> > b) all get the same success/fail status  
> 
> One memory_failure() call changes the target page's status and
> affects all mappings to all affected processes, so I think that
> (ideally) we don't have to block other threads (letting them
> early return seems fine).  Sometimes memory_failure() fails,
> but even in such case, PG_hwpoison is set on the page and other
> threads properly get SIGBUSs with this patch, so I think that
> we can avoid the worst scenario (like system stall by MCE loop).
> 
I agree with naoya's point, if we block for this issue, Does this change the result
that the process should be killed? Or is there something other still need to be considered?

Thanks!
Aili Yao
Tony Luck Feb. 26, 2021, 3:26 a.m. UTC | #11
On Thu, Feb 25, 2021 at 6:23 PM HORIGUCHI NAOYA(堀口 直也)
<naoya.horiguchi@nec.com> wrote:
>
> On Thu, Feb 25, 2021 at 10:15:42AM -0800, Luck, Tony wrote:

> > CPU3 reads the poison and starts along same path that CPU2
> > did.
>
> I think that the MCE loop happening on CPU2 and CPU3 is unexpected
> and these threads should immediately kill the current process on
> each CPU.  force_sig_mceerr() in kill_me_maybe() is supposed to do it,
> so Aili's patch would fix this issue too?

It would stop the looping. But for the case where the error came from
user code we don't have
the virtual address that was accessed at this point (normally this
address is found during the
reverse lokup from the physical address inside memory_failure()).

So we can send a generic SIGBUS, but not one with the usual extra
information about the
location of the error.

-Tony
Luck, Tony Feb. 26, 2021, 5:58 p.m. UTC | #12
On Fri, Feb 26, 2021 at 10:52:50AM +0800, Aili Yao wrote:
> Hi naoya,Oscar,david:
> > 
> > > We could use some negative value (error code) to report the reported case,
> > > then as you mentioned above, some callers need change to handle the
> > > new case, and the same is true if you use some positive value.
> > > My preference is -EHWPOISON, but other options are fine if justified well.  
> > 
> > -EHWPOISON seems like a good fit.
> > 
> I am OK with the -EHWPOISON error code, But I have one doubt here:
> When we return this -EHWPOISON error code, Does this means we have to add a new error code
> to error-base.h or errno.h? Is this easy realized?

The page already poisoned isn't really an error though. Just the result
of a race condition.  What if we added an extra argument to memory_failure()
so it can tell the caller that the specific reason for the early successful
return is that the page was already poisoned?

Something like this (untested - patch against v5.11):

-Tony

---

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index e133ce1e562b..0e32c4d879fb 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -637,6 +637,7 @@ static int uc_decode_notifier(struct notifier_block *nb, unsigned long val,
 {
 	struct mce *mce = (struct mce *)data;
 	unsigned long pfn;
+	int already = 0;
 
 	if (!mce || !mce_usable_address(mce))
 		return NOTIFY_DONE;
@@ -646,8 +647,9 @@ static int uc_decode_notifier(struct notifier_block *nb, unsigned long val,
 		return NOTIFY_DONE;
 
 	pfn = mce->addr >> PAGE_SHIFT;
-	if (!memory_failure(pfn, 0)) {
-		set_mce_nospec(pfn, whole_page(mce));
+	if (!memory_failure(pfn, 0, &already)) {
+		if (!already)
+			set_mce_nospec(pfn, whole_page(mce));
 		mce->kflags |= MCE_HANDLED_UC;
 	}
 
@@ -1245,15 +1247,19 @@ static void kill_me_maybe(struct callback_head *cb)
 {
 	struct task_struct *p = container_of(cb, struct task_struct, mce_kill_me);
 	int flags = MF_ACTION_REQUIRED;
+	int already = 0;
 
 	pr_err("Uncorrected hardware memory error in user-access at %llx", p->mce_addr);
 
 	if (!p->mce_ripv)
 		flags |= MF_MUST_KILL;
 
-	if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags) &&
+	if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags, &already) &&
 	    !(p->mce_kflags & MCE_IN_KERNEL_COPYIN)) {
-		set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page);
+		if (already)
+			force_sig(SIGBUS); // BETTER CODE NEEDED HERE!!!
+		else
+			set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page);
 		sync_core();
 		return;
 	}
@@ -1440,7 +1446,7 @@ noinstr void do_machine_check(struct pt_regs *regs)
 EXPORT_SYMBOL_GPL(do_machine_check);
 
 #ifndef CONFIG_MEMORY_FAILURE
-int memory_failure(unsigned long pfn, int flags)
+int memory_failure(unsigned long pfn, int flags, int *already)
 {
 	/* mce_severity() should not hand us an ACTION_REQUIRED error */
 	BUG_ON(flags & MF_ACTION_REQUIRED);
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index eef4ffb6122c..24c36623e492 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -480,7 +480,7 @@ static ssize_t hard_offline_page_store(struct device *dev,
 	if (kstrtoull(buf, 0, &pfn) < 0)
 		return -EINVAL;
 	pfn >>= PAGE_SHIFT;
-	ret = memory_failure(pfn, 0);
+	ret = memory_failure(pfn, 0, NULL);
 	return ret ? ret : count;
 }
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ecdf8a8cd6ae..88b92820465c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3045,7 +3045,7 @@ enum mf_flags {
 	MF_MUST_KILL = 1 << 2,
 	MF_SOFT_OFFLINE = 1 << 3,
 };
-extern int memory_failure(unsigned long pfn, int flags);
+extern int memory_failure(unsigned long pfn, int flags, int *already);
 extern void memory_failure_queue(unsigned long pfn, int flags);
 extern void memory_failure_queue_kick(int cpu);
 extern int unpoison_memory(unsigned long pfn);
diff --git a/mm/hwpoison-inject.c b/mm/hwpoison-inject.c
index 1ae1ebc2b9b1..bfd5151dcd3f 100644
--- a/mm/hwpoison-inject.c
+++ b/mm/hwpoison-inject.c
@@ -48,7 +48,7 @@ static int hwpoison_inject(void *data, u64 val)
 
 inject:
 	pr_info("Injecting memory failure at pfn %#lx\n", pfn);
-	return memory_failure(pfn, 0);
+	return memory_failure(pfn, 0, NULL);
 }
 
 static int hwpoison_unpoison(void *data, u64 val)
diff --git a/mm/madvise.c b/mm/madvise.c
index 6a660858784b..ade1956632aa 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -907,7 +907,7 @@ static int madvise_inject_error(int behavior,
 		} else {
 			pr_info("Injecting memory failure for pfn %#lx at process virtual address %#lx\n",
 				 pfn, start);
-			ret = memory_failure(pfn, MF_COUNT_INCREASED);
+			ret = memory_failure(pfn, MF_COUNT_INCREASED, NULL);
 		}
 
 		if (ret)
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index e9481632fcd1..e8508e4d70e5 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1388,7 +1388,7 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
  * Must run in process context (e.g. a work queue) with interrupts
  * enabled and no spinlocks hold.
  */
-int memory_failure(unsigned long pfn, int flags)
+int memory_failure(unsigned long pfn, int flags, int *already)
 {
 	struct page *p;
 	struct page *hpage;
@@ -1418,6 +1418,8 @@ int memory_failure(unsigned long pfn, int flags)
 	if (PageHuge(p))
 		return memory_failure_hugetlb(pfn, flags);
 	if (TestSetPageHWPoison(p)) {
+		if (already)
+			*already = 1;
 		pr_err("Memory failure: %#lx: already hardware poisoned\n",
 			pfn);
 		return 0;
@@ -1624,7 +1626,7 @@ static void memory_failure_work_func(struct work_struct *work)
 		if (entry.flags & MF_SOFT_OFFLINE)
 			soft_offline_page(entry.pfn, entry.flags);
 		else
-			memory_failure(entry.pfn, entry.flags);
+			memory_failure(entry.pfn, entry.flags, NULL);
 	}
 }
yaoaili [么爱利] March 2, 2021, 4:32 a.m. UTC | #13
On Fri, 26 Feb 2021 09:58:37 -0800
"Luck, Tony" <tony.luck@intel.com> wrote:

> On Fri, Feb 26, 2021 at 10:52:50AM +0800, Aili Yao wrote:
> > Hi naoya,Oscar,david:  
> > >   
> > > > We could use some negative value (error code) to report the reported case,
> > > > then as you mentioned above, some callers need change to handle the
> > > > new case, and the same is true if you use some positive value.
> > > > My preference is -EHWPOISON, but other options are fine if justified well.    
> > > 
> > > -EHWPOISON seems like a good fit.
> > >   
> > I am OK with the -EHWPOISON error code, But I have one doubt here:
> > When we return this -EHWPOISON error code, Does this means we have to add a new error code
> > to error-base.h or errno.h? Is this easy realized?  
> 
> The page already poisoned isn't really an error though. Just the result
> of a race condition.  What if we added an extra argument to memory_failure()
> so it can tell the caller that the specific reason for the early successful
> return is that the page was already poisoned?
> 

It may be not an error, Is it reasonable to return a positive value like MF_HWPOISON, it seems the 0
return code donesn't tell the whole story. 

Your patch seems more safer, But I don't know if it's worth such multi module modifications for this case.
It really should be referenced to other maintainers and reviewers and thet can give more expert suggestions.

Thanks!
Aili Yao
Luck, Tony March 3, 2021, 3:39 a.m. UTC | #14
On Fri, Feb 26, 2021 at 10:59:15AM +0800, Aili Yao wrote:
> Hi naoya, tony:
> > > 
> > > Idea for what we should do next ... Now that x86 is calling memory_failure()
> > > from user context ... maybe parallel calls for the same page should
> > > be blocked until the first caller completes so we can:
> > > a) know that pages are unmapped (if that happens)
> > > b) all get the same success/fail status  
> > 
> > One memory_failure() call changes the target page's status and
> > affects all mappings to all affected processes, so I think that
> > (ideally) we don't have to block other threads (letting them
> > early return seems fine).  Sometimes memory_failure() fails,
> > but even in such case, PG_hwpoison is set on the page and other
> > threads properly get SIGBUSs with this patch, so I think that
> > we can avoid the worst scenario (like system stall by MCE loop).
> > 
> I agree with naoya's point, if we block for this issue, Does this change the result
> that the process should be killed? Or is there something other still need to be considered?

Ok ... no blocking ... I think someone in this thread suggested
scanning the page tables to make sure the poisoned page had been
unmapped.

There's a walk_page_range() function that does all the work for that.
Just need to supply some callback routines that check whether a
mapping includes the bad PFN and clear the PRESENT bit.

RFC patch below against v5.12-rc1

-Tony

From 8de23b7f1be00ad38e129690dfe0b1558fad5ff8 Mon Sep 17 00:00:00 2001
From: Tony Luck <tony.luck@intel.com>
Date: Tue, 2 Mar 2021 15:06:33 -0800
Subject: [PATCH] x86/mce: Handle races between machine checks

When multiple CPUs hit the same poison memory there is a race. The
first CPU into memory_failure() atomically marks the page as poison
and continues processing to hunt down all the tasks that map this page
so that the virtual addresses can be marked not-present and SIGBUS
sent to the task that did the access.

Later CPUs get an early return from memory_failure() and may return
to user mode and access the poison again.

Add a new argument to memory_failure() so that it can indicate when
the race has been lost. Fix kill_me_maybe() to scan page tables in
this case to unmap pages.
---
 arch/x86/kernel/cpu/mce/core.c | 61 +++++++++++++++++++++++++++++++---
 drivers/base/memory.c          |  2 +-
 include/linux/mm.h             |  2 +-
 mm/hwpoison-inject.c           |  2 +-
 mm/madvise.c                   |  2 +-
 mm/memory-failure.c            |  6 ++--
 6 files changed, 64 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 7962355436da..2c6c560f3f92 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -30,6 +30,7 @@
 #include <linux/slab.h>
 #include <linux/init.h>
 #include <linux/kmod.h>
+#include <linux/pagewalk.h>
 #include <linux/poll.h>
 #include <linux/nmi.h>
 #include <linux/cpu.h>
@@ -637,6 +638,7 @@ static int uc_decode_notifier(struct notifier_block *nb, unsigned long val,
 {
 	struct mce *mce = (struct mce *)data;
 	unsigned long pfn;
+	int already = 0;
 
 	if (!mce || !mce_usable_address(mce))
 		return NOTIFY_DONE;
@@ -646,8 +648,9 @@ static int uc_decode_notifier(struct notifier_block *nb, unsigned long val,
 		return NOTIFY_DONE;
 
 	pfn = mce->addr >> PAGE_SHIFT;
-	if (!memory_failure(pfn, 0)) {
-		set_mce_nospec(pfn, whole_page(mce));
+	if (!memory_failure(pfn, 0, &already)) {
+		if (!already)
+			set_mce_nospec(pfn, whole_page(mce));
 		mce->kflags |= MCE_HANDLED_UC;
 	}
 
@@ -1248,6 +1251,50 @@ static void __mc_scan_banks(struct mce *m, struct pt_regs *regs, struct mce *fin
 	*m = *final;
 }
 
+static int pte_entry(pte_t *pte, unsigned long addr, unsigned long next, struct mm_walk *walk)
+{
+	u64 pfn = (u64)walk->private;
+
+	if (pte_pfn(*pte) == pfn)
+		pte->pte = pte->pte & ~_PAGE_PRESENT;
+
+	return 0;
+}
+
+static int pmd_entry(pmd_t *pmd, unsigned long addr, unsigned long next, struct mm_walk *walk)
+{
+	int shift = PMD_SHIFT - PAGE_SHIFT;
+	u64 pfn = (u64)walk->private;
+
+	if (!pmd_large(*pmd))
+		return 0;
+
+	if (pmd_pfn(*pmd) >> shift == pfn >> shift)
+		pmd->pmd = pmd->pmd & ~_PAGE_PRESENT;
+
+	return 0;
+}
+
+static int pud_entry(pud_t *pud, unsigned long addr, unsigned long next, struct mm_walk *walk)
+{
+	int shift = PUD_SHIFT - PAGE_SHIFT;
+	u64 pfn = (u64)walk->private;
+
+	if (!pud_large(*pud))
+		return 0;
+
+	if (pud_pfn(*pud) >> shift == pfn >> shift)
+		pud->pud = pud->pud & ~_PAGE_PRESENT;
+
+	return 0;
+}
+
+static struct mm_walk_ops walk = {
+	.pte_entry = pte_entry,
+	.pmd_entry = pmd_entry,
+	.pud_entry = pud_entry
+};
+
 static void kill_me_now(struct callback_head *ch)
 {
 	force_sig(SIGBUS);
@@ -1257,15 +1304,19 @@ static void kill_me_maybe(struct callback_head *cb)
 {
 	struct task_struct *p = container_of(cb, struct task_struct, mce_kill_me);
 	int flags = MF_ACTION_REQUIRED;
+	int already = 0;
 
 	pr_err("Uncorrected hardware memory error in user-access at %llx", p->mce_addr);
 
 	if (!p->mce_ripv)
 		flags |= MF_MUST_KILL;
 
-	if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags) &&
+	if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags, &already) &&
 	    !(p->mce_kflags & MCE_IN_KERNEL_COPYIN)) {
-		set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page);
+		if (already)
+			walk_page_range(current->mm, 0, TASK_SIZE_MAX, &walk, (void *)(p->mce_addr >> PAGE_SHIFT));
+		else
+			set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page);
 		sync_core();
 		return;
 	}
@@ -1452,7 +1503,7 @@ noinstr void do_machine_check(struct pt_regs *regs)
 EXPORT_SYMBOL_GPL(do_machine_check);
 
 #ifndef CONFIG_MEMORY_FAILURE
-int memory_failure(unsigned long pfn, int flags)
+int memory_failure(unsigned long pfn, int flags, int *already)
 {
 	/* mce_severity() should not hand us an ACTION_REQUIRED error */
 	BUG_ON(flags & MF_ACTION_REQUIRED);
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index f35298425575..144500983656 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -480,7 +480,7 @@ static ssize_t hard_offline_page_store(struct device *dev,
 	if (kstrtoull(buf, 0, &pfn) < 0)
 		return -EINVAL;
 	pfn >>= PAGE_SHIFT;
-	ret = memory_failure(pfn, 0);
+	ret = memory_failure(pfn, 0, NULL);
 	return ret ? ret : count;
 }
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 77e64e3eac80..beaa6e871cbe 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3003,7 +3003,7 @@ enum mf_flags {
 	MF_MUST_KILL = 1 << 2,
 	MF_SOFT_OFFLINE = 1 << 3,
 };
-extern int memory_failure(unsigned long pfn, int flags);
+extern int memory_failure(unsigned long pfn, int flags, int *already);
 extern void memory_failure_queue(unsigned long pfn, int flags);
 extern void memory_failure_queue_kick(int cpu);
 extern int unpoison_memory(unsigned long pfn);
diff --git a/mm/hwpoison-inject.c b/mm/hwpoison-inject.c
index 1ae1ebc2b9b1..bfd5151dcd3f 100644
--- a/mm/hwpoison-inject.c
+++ b/mm/hwpoison-inject.c
@@ -48,7 +48,7 @@ static int hwpoison_inject(void *data, u64 val)
 
 inject:
 	pr_info("Injecting memory failure at pfn %#lx\n", pfn);
-	return memory_failure(pfn, 0);
+	return memory_failure(pfn, 0, NULL);
 }
 
 static int hwpoison_unpoison(void *data, u64 val)
diff --git a/mm/madvise.c b/mm/madvise.c
index df692d2e35d4..09f569fed68d 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -908,7 +908,7 @@ static int madvise_inject_error(int behavior,
 		} else {
 			pr_info("Injecting memory failure for pfn %#lx at process virtual address %#lx\n",
 				 pfn, start);
-			ret = memory_failure(pfn, MF_COUNT_INCREASED);
+			ret = memory_failure(pfn, MF_COUNT_INCREASED, NULL);
 		}
 
 		if (ret)
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 24210c9bd843..9a8911aa5fc9 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1398,7 +1398,7 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
  * Must run in process context (e.g. a work queue) with interrupts
  * enabled and no spinlocks hold.
  */
-int memory_failure(unsigned long pfn, int flags)
+int memory_failure(unsigned long pfn, int flags, int *already)
 {
 	struct page *p;
 	struct page *hpage;
@@ -1428,6 +1428,8 @@ int memory_failure(unsigned long pfn, int flags)
 	if (PageHuge(p))
 		return memory_failure_hugetlb(pfn, flags);
 	if (TestSetPageHWPoison(p)) {
+		if (already)
+			*already = 1;
 		pr_err("Memory failure: %#lx: already hardware poisoned\n",
 			pfn);
 		return 0;
@@ -1634,7 +1636,7 @@ static void memory_failure_work_func(struct work_struct *work)
 		if (entry.flags & MF_SOFT_OFFLINE)
 			soft_offline_page(entry.pfn, entry.flags);
 		else
-			memory_failure(entry.pfn, entry.flags);
+			memory_failure(entry.pfn, entry.flags, NULL);
 	}
 }
yaoaili [么爱利] March 3, 2021, 3:57 a.m. UTC | #15
On Tue, 2 Mar 2021 19:39:53 -0800
"Luck, Tony" <tony.luck@intel.com> wrote:

> On Fri, Feb 26, 2021 at 10:59:15AM +0800, Aili Yao wrote:
> > Hi naoya, tony:  
> > > > 
> > > > Idea for what we should do next ... Now that x86 is calling memory_failure()
> > > > from user context ... maybe parallel calls for the same page should
> > > > be blocked until the first caller completes so we can:
> > > > a) know that pages are unmapped (if that happens)
> > > > b) all get the same success/fail status    
> > > 
> > > One memory_failure() call changes the target page's status and
> > > affects all mappings to all affected processes, so I think that
> > > (ideally) we don't have to block other threads (letting them
> > > early return seems fine).  Sometimes memory_failure() fails,
> > > but even in such case, PG_hwpoison is set on the page and other
> > > threads properly get SIGBUSs with this patch, so I think that
> > > we can avoid the worst scenario (like system stall by MCE loop).
> > >   
> > I agree with naoya's point, if we block for this issue, Does this change the result
> > that the process should be killed? Or is there something other still need to be considered?  
> 
> Ok ... no blocking ... I think someone in this thread suggested
> scanning the page tables to make sure the poisoned page had been
> unmapped.
> 
> There's a walk_page_range() function that does all the work for that.
> Just need to supply some callback routines that check whether a
> mapping includes the bad PFN and clear the PRESENT bit.
> 
> RFC patch below against v5.12-rc1
> 
> -Tony
> 
> From 8de23b7f1be00ad38e129690dfe0b1558fad5ff8 Mon Sep 17 00:00:00 2001
> From: Tony Luck <tony.luck@intel.com>
> Date: Tue, 2 Mar 2021 15:06:33 -0800
> Subject: [PATCH] x86/mce: Handle races between machine checks
> 
> When multiple CPUs hit the same poison memory there is a race. The
> first CPU into memory_failure() atomically marks the page as poison
> and continues processing to hunt down all the tasks that map this page
> so that the virtual addresses can be marked not-present and SIGBUS
> sent to the task that did the access.
> 
> Later CPUs get an early return from memory_failure() and may return
> to user mode and access the poison again.
> 
> Add a new argument to memory_failure() so that it can indicate when
> the race has been lost. Fix kill_me_maybe() to scan page tables in
> this case to unmap pages.

> +
>  static void kill_me_now(struct callback_head *ch)
>  {
>  	force_sig(SIGBUS);
> @@ -1257,15 +1304,19 @@ static void kill_me_maybe(struct callback_head *cb)
>  {
>  	struct task_struct *p = container_of(cb, struct task_struct, mce_kill_me);
>  	int flags = MF_ACTION_REQUIRED;
> +	int already = 0;
>  
>  	pr_err("Uncorrected hardware memory error in user-access at %llx", p->mce_addr);
>  
>  	if (!p->mce_ripv)
>  		flags |= MF_MUST_KILL;
>  
> -	if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags) &&
> +	if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags, &already) &&
>  	    !(p->mce_kflags & MCE_IN_KERNEL_COPYIN)) {
> -		set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page);
> +		if (already)
> +			walk_page_range(current->mm, 0, TASK_SIZE_MAX, &walk, (void *)(p->mce_addr >> PAGE_SHIFT));
> +		else
> +			set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page);
>  		sync_core();
>  		return;

>  	MF_MUST_KILL = 1 << 2,
>  	MF_SOFT_OFFLINE = 1 << 3,
>  };

I have one doubt here, when "walk_page_range(current->mm, 0, TASK_SIZE_MAX, &walk, (void *)(p->mce_addr >> PAGE_SHIFT));" is done,
so how is the process triggering this error returned if it have taken the wrong data?
yaoaili [么爱利] March 3, 2021, 8:39 a.m. UTC | #16
Hi tony:

> On Tue, 2 Mar 2021 19:39:53 -0800
> "Luck, Tony" <tony.luck@intel.com> wrote:
> 
> > On Fri, Feb 26, 2021 at 10:59:15AM +0800, Aili Yao wrote:  
> > > Hi naoya, tony:    
> > > > > 
> > > > > Idea for what we should do next ... Now that x86 is calling memory_failure()
> > > > > from user context ... maybe parallel calls for the same page should
> > > > > be blocked until the first caller completes so we can:
> > > > > a) know that pages are unmapped (if that happens)
> > > > > b) all get the same success/fail status      
> > > > 
> > > > One memory_failure() call changes the target page's status and
> > > > affects all mappings to all affected processes, so I think that
> > > > (ideally) we don't have to block other threads (letting them
> > > > early return seems fine).  Sometimes memory_failure() fails,
> > > > but even in such case, PG_hwpoison is set on the page and other
> > > > threads properly get SIGBUSs with this patch, so I think that
> > > > we can avoid the worst scenario (like system stall by MCE loop).
> > > >     
> > > I agree with naoya's point, if we block for this issue, Does this change the result
> > > that the process should be killed? Or is there something other still need to be considered?    
> > 
> > Ok ... no blocking ... 

I do think about blocking method and the error address issue with sigbus,here is my opinion, maybe helpful:

For blocking, if we block here, there are some undefine work i think should be done.
As we don't know the process B triggering this error again is early-kill or not, so the previous memory_failure() call may 
not add B on kill_list, even if B is on kill_list, the error level for B is not proper set, as B should get an AR SIGBUS;

So we can't just wait, We must have some logic adding the process B to kill list, and as this is an AR error
another change should be done to current code, we need more logic in kill_proc or some other place.

Even if all the work is done right. There is one more serious scenario though, we even don't know the current step the previous memory_failure() is on,
So previous modification may not be usefull at all; When this scenario happens, what we can do?  block or return ?
if finally we return, an error code should be taken back; so we have to go to error process logic and a signal without right address will be sent;

For error address with sigbus, i think this is not an issue resulted by the patch i post, before my patch, the issue is already there.
I don't find a realizable way to get the correct address for same reason --- we don't know whether the page mapping is there or not when
we got to kill_me_maybe(), in some case, we may get it, but there are a lot of parallel issue need to consider, and if failed we have to fallback
to the error brach again, remaining current code may be an easy option;

Any methods or patchs can solve the issue in a better way is OK to me, i want this issue fixed and in more complete way!
Luck, Tony March 3, 2021, 3:41 p.m. UTC | #17
> For error address with sigbus, i think this is not an issue resulted by the patch i post, before my patch, the issue is already there.
> I don't find a realizable way to get the correct address for same reason --- we don't know whether the page mapping is there or not when
> we got to kill_me_maybe(), in some case, we may get it, but there are a lot of parallel issue need to consider, and if failed we have to fallback
> to the error brach again, remaining current code may be an easy option;

My RFC patch from yesterday removes the uncertainty about whether the page is there or not. After it walks the page
tables we know that the poison page isn't mapped (note that patch is RFC for a reason ... I'm 90% sure that it should
do a bit more that just clear the PRESENT bit).

So perhaps memory_failure() has queued a SIGBUS for this task, if so, we take it when we return from kill_me_maybe()

If not, we will return to user mode and re-execute the failing instruction ... but because the page is unmapped we will take a #PF

The x86 page fault handler will see that the page for this physical address is marked HWPOISON, and it will send the SIGBUS
(just like it does if the page had been removed by an earlier UCNA/SRAO error).

At least that's the theory.

-Tony
yaoaili [么爱利] March 4, 2021, 2:16 a.m. UTC | #18
On Wed, 3 Mar 2021 15:41:35 +0000
"Luck, Tony" <tony.luck@intel.com> wrote:

> > For error address with sigbus, i think this is not an issue resulted by the patch i post, before my patch, the issue is already there.
> > I don't find a realizable way to get the correct address for same reason --- we don't know whether the page mapping is there or not when
> > we got to kill_me_maybe(), in some case, we may get it, but there are a lot of parallel issue need to consider, and if failed we have to fallback
> > to the error brach again, remaining current code may be an easy option;  
> 
> My RFC patch from yesterday removes the uncertainty about whether the page is there or not. After it walks the page
> tables we know that the poison page isn't mapped (note that patch is RFC for a reason ... I'm 90% sure that it should
> do a bit more that just clear the PRESENT bit).
> 
> So perhaps memory_failure() has queued a SIGBUS for this task, if so, we take it when we return from kill_me_maybe()
> 
> If not, we will return to user mode and re-execute the failing instruction ... but because the page is unmapped we will take a #PF

Got this, I have some error thoughts here.


> The x86 page fault handler will see that the page for this physical address is marked HWPOISON, and it will send the SIGBUS
> (just like it does if the page had been removed by an earlier UCNA/SRAO error).

if your methods works, should it be like this?

1582                         pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
1583                         if (PageHuge(page)) {
1584                                 hugetlb_count_sub(compound_nr(page), mm);
1585                                 set_huge_swap_pte_at(mm, address,
1586                                                      pvmw.pte, pteval,
1587                                                      vma_mmu_pagesize(vma));
1588                         } else {
1589                                 dec_mm_counter(mm, mm_counter(page));
1590                                 set_pte_at(mm, address, pvmw.pte, pteval);
1591                         }

the page fault check if it's a poison page using is_hwpoison_entry(),
yaoaili [么爱利] March 4, 2021, 4:19 a.m. UTC | #19
On Thu, 4 Mar 2021 10:16:53 +0800
Aili Yao <yaoaili@kingsoft.com> wrote:

> On Wed, 3 Mar 2021 15:41:35 +0000
> "Luck, Tony" <tony.luck@intel.com> wrote:
> 
> > > For error address with sigbus, i think this is not an issue resulted by the patch i post, before my patch, the issue is already there.
> > > I don't find a realizable way to get the correct address for same reason --- we don't know whether the page mapping is there or not when
> > > we got to kill_me_maybe(), in some case, we may get it, but there are a lot of parallel issue need to consider, and if failed we have to fallback
> > > to the error brach again, remaining current code may be an easy option;    
> > 
> > My RFC patch from yesterday removes the uncertainty about whether the page is there or not. After it walks the page
> > tables we know that the poison page isn't mapped (note that patch is RFC for a reason ... I'm 90% sure that it should
> > do a bit more that just clear the PRESENT bit).
> > 
> > So perhaps memory_failure() has queued a SIGBUS for this task, if so, we take it when we return from kill_me_maybe()

And when this happen, the process will receive an SIGBUS with AO level, is it proper as not an AR?

> > If not, we will return to user mode and re-execute the failing instruction ... but because the page is unmapped we will take a #PF  
> 
> Got this, I have some error thoughts here.
> 
> 
> > The x86 page fault handler will see that the page for this physical address is marked HWPOISON, and it will send the SIGBUS
> > (just like it does if the page had been removed by an earlier UCNA/SRAO error).  
> 
> if your methods works, should it be like this?
> 
> 1582                         pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
> 1583                         if (PageHuge(page)) {
> 1584                                 hugetlb_count_sub(compound_nr(page), mm);
> 1585                                 set_huge_swap_pte_at(mm, address,
> 1586                                                      pvmw.pte, pteval,
> 1587                                                      vma_mmu_pagesize(vma));
> 1588                         } else {
> 1589                                 dec_mm_counter(mm, mm_counter(page));
> 1590                                 set_pte_at(mm, address, pvmw.pte, pteval);
> 1591                         }
> 
> the page fault check if it's a poison page using is_hwpoison_entry(),
> 

And if it works, does we need some locking mechanism before we call walk_page_range();
if we lock, does we need to process the blocking interrupted error as other places will do?
yaoaili [么爱利] March 4, 2021, 6:45 a.m. UTC | #20
On Thu, 4 Mar 2021 12:19:41 +0800
Aili Yao <yaoaili@kingsoft.com> wrote:

> On Thu, 4 Mar 2021 10:16:53 +0800
> Aili Yao <yaoaili@kingsoft.com> wrote:
> 
> > On Wed, 3 Mar 2021 15:41:35 +0000
> > "Luck, Tony" <tony.luck@intel.com> wrote:
> >   
> > > > For error address with sigbus, i think this is not an issue resulted by the patch i post, before my patch, the issue is already there.
> > > > I don't find a realizable way to get the correct address for same reason --- we don't know whether the page mapping is there or not when
> > > > we got to kill_me_maybe(), in some case, we may get it, but there are a lot of parallel issue need to consider, and if failed we have to fallback
> > > > to the error brach again, remaining current code may be an easy option;      
> > > 
> > > My RFC patch from yesterday removes the uncertainty about whether the page is there or not. After it walks the page
> > > tables we know that the poison page isn't mapped (note that patch is RFC for a reason ... I'm 90% sure that it should
> > > do a bit more that just clear the PRESENT bit).
> > > 
> > > So perhaps memory_failure() has queued a SIGBUS for this task, if so, we take it when we return from kill_me_maybe()  
> 
> And when this happen, the process will receive an SIGBUS with AO level, is it proper as not an AR?
> 
> > > If not, we will return to user mode and re-execute the failing instruction ... but because the page is unmapped we will take a #PF    
> > 
> > Got this, I have some error thoughts here.
> > 
> >   
> > > The x86 page fault handler will see that the page for this physical address is marked HWPOISON, and it will send the SIGBUS
> > > (just like it does if the page had been removed by an earlier UCNA/SRAO error).    
> > 
> > if your methods works, should it be like this?
> > 
> > 1582                         pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
> > 1583                         if (PageHuge(page)) {
> > 1584                                 hugetlb_count_sub(compound_nr(page), mm);
> > 1585                                 set_huge_swap_pte_at(mm, address,
> > 1586                                                      pvmw.pte, pteval,
> > 1587                                                      vma_mmu_pagesize(vma));
> > 1588                         } else {
> > 1589                                 dec_mm_counter(mm, mm_counter(page));
> > 1590                                 set_pte_at(mm, address, pvmw.pte, pteval);
> > 1591                         }
> > 
> > the page fault check if it's a poison page using is_hwpoison_entry(),
> >   
> 
> And if it works, does we need some locking mechanism before we call walk_page_range();
> if we lock, does we need to process the blocking interrupted error as other places will do?
> 

And another thing:
Do we need a call to flush_tlb_page(vma, address) to make the pte changes into effect?
Luck, Tony March 4, 2021, 11:57 p.m. UTC | #21
On Thu, Mar 04, 2021 at 02:45:24PM +0800, Aili Yao wrote:
> > > if your methods works, should it be like this?
> > > 
> > > 1582                         pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
> > > 1583                         if (PageHuge(page)) {
> > > 1584                                 hugetlb_count_sub(compound_nr(page), mm);
> > > 1585                                 set_huge_swap_pte_at(mm, address,
> > > 1586                                                      pvmw.pte, pteval,
> > > 1587                                                      vma_mmu_pagesize(vma));
> > > 1588                         } else {
> > > 1589                                 dec_mm_counter(mm, mm_counter(page));
> > > 1590                                 set_pte_at(mm, address, pvmw.pte, pteval);
> > > 1591                         }
> > > 
> > > the page fault check if it's a poison page using is_hwpoison_entry(),
> > >   
> > 
> > And if it works, does we need some locking mechanism before we call walk_page_range();
> > if we lock, does we need to process the blocking interrupted error as other places will do?
> > 
> 
> And another thing:
> Do we need a call to flush_tlb_page(vma, address) to make the pte changes into effect?

Thanks for all the pointers.  I added them to the patch (see below).
[The pmd/pud cases may need some tweaking ... but just trying to get
the 4K page case working first]

I tried testing by skipping the call to memory_failure() and just
using this new code to search the page tables for current page and
marking it hwpoison (to simulate the case where 2nd process gets the
early return from memory_failure(). Something is still missing because I get:

[  481.911298] mce: pte_entry: matched pfn - mark poison & zap pte
[  481.917935] MCE: Killing einj_mem_uc:5555 due to hardware memory corruption fault at 7fe64b33b400
[  482.933775] BUG: Bad page cache in process einj_mem_uc  pfn:408b6d6
[  482.940777] page:0000000013ea6e96 refcount:3 mapcount:1 mapping:00000000e3a069d9 index:0x0 pfn:0x408b6d6
[  482.951355] memcg:ffff94a809834000
[  482.955153] aops:shmem_aops ino:3c04
[  482.959142] flags: 0x97ffffc0880015(locked|uptodate|lru|swapbacked|hwpoison)
[  482.967018] raw: 0097ffffc0880015 ffff94c80e93ec00 ffff94c80e93ec00 ffff94c80a9b25a8
[  482.975666] raw: 0000000000000000 0000000000000000 0000000300000000 ffff94a809834000
[  482.984310] page dumped because: still mapped when deleted

-Tony

commit e5de44560b33e2d407704243566253a70f858a59
Author: Tony Luck <tony.luck@intel.com>
Date:   Tue Mar 2 15:06:33 2021 -0800

    x86/mce: Handle races between machine checks
    
    When multiple CPUs hit the same poison memory there is a race. The
    first CPU into memory_failure() atomically marks the page as poison
    and continues processing to hunt down all the tasks that map this page
    so that the virtual addresses can be marked not-present and SIGBUS
    sent to the task that did the access.
    
    Later CPUs get an early return from memory_failure() and may return
    to user mode and access the poison again.
    
    Add a new argument to memory_failure() so that it can indicate when
    the race has been lost. Fix kill_me_maybe() to scan page tables in
    this case to unmap pages.

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 7962355436da..a52c6a772de2 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -28,8 +28,12 @@
 #include <linux/sysfs.h>
 #include <linux/types.h>
 #include <linux/slab.h>
+#include <linux/hugetlb.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>
 #include <linux/init.h>
 #include <linux/kmod.h>
+#include <linux/pagewalk.h>
 #include <linux/poll.h>
 #include <linux/nmi.h>
 #include <linux/cpu.h>
@@ -637,6 +641,7 @@ static int uc_decode_notifier(struct notifier_block *nb, unsigned long val,
 {
 	struct mce *mce = (struct mce *)data;
 	unsigned long pfn;
+	int already = 0;
 
 	if (!mce || !mce_usable_address(mce))
 		return NOTIFY_DONE;
@@ -646,8 +651,9 @@ static int uc_decode_notifier(struct notifier_block *nb, unsigned long val,
 		return NOTIFY_DONE;
 
 	pfn = mce->addr >> PAGE_SHIFT;
-	if (!memory_failure(pfn, 0)) {
-		set_mce_nospec(pfn, whole_page(mce));
+	if (!memory_failure(pfn, 0, &already)) {
+		if (!already)
+			set_mce_nospec(pfn, whole_page(mce));
 		mce->kflags |= MCE_HANDLED_UC;
 	}
 
@@ -1248,6 +1254,79 @@ static void __mc_scan_banks(struct mce *m, struct pt_regs *regs, struct mce *fin
 	*m = *final;
 }
 
+static int pte_entry(pte_t *pte, unsigned long addr, unsigned long next, struct mm_walk *walk)
+{
+	u64 pfn = (u64)walk->private;
+	struct page *page;
+	pte_t pteval;
+
+	if (pte_pfn(*pte) == pfn) {
+pr_info("pte_entry: matched pfn - mark poison & zap pte\n");
+		page = pfn_to_page(pfn);
+		lock_page(page);
+SetPageHWPoison(page);
+		pteval = swp_entry_to_pte(make_hwpoison_entry(page));
+		dec_mm_counter(walk->mm, mm_counter(page));
+		set_pte_at(current->mm, addr, pte, pteval);
+		unlock_page(page);
+		flush_tlb_page(walk->vma, addr);
+	}
+
+	return 0;
+}
+
+static int pmd_entry(pmd_t *pmd, unsigned long addr, unsigned long next, struct mm_walk *walk)
+{
+	int shift = PMD_SHIFT - PAGE_SHIFT;
+	u64 pfn = (u64)walk->private;
+	struct page *page;
+	pte_t pteval;
+
+	if (!pmd_large(*pmd))
+		return 0;
+
+	if (pmd_pfn(*pmd) >> shift == pfn >> shift) {
+		page = pfn_to_page(pfn);
+		lock_page(page);
+		pteval = swp_entry_to_pte(make_hwpoison_entry(page));
+		hugetlb_count_sub(compound_nr(page), walk->mm);
+		set_huge_swap_pte_at(walk->mm, addr, (pte_t *)pmd, pteval, vma_mmu_pagesize(walk->vma));
+		unlock_page(page);
+		flush_tlb_page(walk->vma, addr);
+	}
+
+	return 0;
+}
+
+static int pud_entry(pud_t *pud, unsigned long addr, unsigned long next, struct mm_walk *walk)
+{
+	int shift = PUD_SHIFT - PAGE_SHIFT;
+	u64 pfn = (u64)walk->private;
+	struct page *page;
+	pte_t pteval;
+
+	if (!pud_large(*pud))
+		return 0;
+
+	if (pud_pfn(*pud) >> shift == pfn >> shift) {
+		page = pfn_to_page(pfn);
+		lock_page(page);
+		pteval = swp_entry_to_pte(make_hwpoison_entry(page));
+		hugetlb_count_sub(compound_nr(page), walk->mm);
+		set_huge_swap_pte_at(walk->mm, addr, (pte_t *)pud, pteval, vma_mmu_pagesize(walk->vma));
+		unlock_page(page);
+		flush_tlb_page(walk->vma, addr);
+	}
+
+	return 0;
+}
+
+static struct mm_walk_ops walk = {
+	.pte_entry = pte_entry,
+	.pmd_entry = pmd_entry,
+	.pud_entry = pud_entry
+};
+
 static void kill_me_now(struct callback_head *ch)
 {
 	force_sig(SIGBUS);
@@ -1257,15 +1336,22 @@ static void kill_me_maybe(struct callback_head *cb)
 {
 	struct task_struct *p = container_of(cb, struct task_struct, mce_kill_me);
 	int flags = MF_ACTION_REQUIRED;
+	int already = 0;
 
 	pr_err("Uncorrected hardware memory error in user-access at %llx", p->mce_addr);
 
 	if (!p->mce_ripv)
 		flags |= MF_MUST_KILL;
 
-	if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags) &&
+	if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags, &already) &&
 	    !(p->mce_kflags & MCE_IN_KERNEL_COPYIN)) {
-		set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page);
+		if (already) {
+			mmap_read_lock(current->mm);
+			walk_page_range(current->mm, 0, TASK_SIZE_MAX, &walk, (void *)(p->mce_addr >> PAGE_SHIFT));
+			mmap_read_unlock(current->mm);
+		} else {
+			set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page);
+		}
 		sync_core();
 		return;
 	}
@@ -1452,7 +1538,7 @@ noinstr void do_machine_check(struct pt_regs *regs)
 EXPORT_SYMBOL_GPL(do_machine_check);
 
 #ifndef CONFIG_MEMORY_FAILURE
-int memory_failure(unsigned long pfn, int flags)
+int memory_failure(unsigned long pfn, int flags, int *already)
 {
 	/* mce_severity() should not hand us an ACTION_REQUIRED error */
 	BUG_ON(flags & MF_ACTION_REQUIRED);
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index f35298425575..144500983656 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -480,7 +480,7 @@ static ssize_t hard_offline_page_store(struct device *dev,
 	if (kstrtoull(buf, 0, &pfn) < 0)
 		return -EINVAL;
 	pfn >>= PAGE_SHIFT;
-	ret = memory_failure(pfn, 0);
+	ret = memory_failure(pfn, 0, NULL);
 	return ret ? ret : count;
 }
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 77e64e3eac80..beaa6e871cbe 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3003,7 +3003,7 @@ enum mf_flags {
 	MF_MUST_KILL = 1 << 2,
 	MF_SOFT_OFFLINE = 1 << 3,
 };
-extern int memory_failure(unsigned long pfn, int flags);
+extern int memory_failure(unsigned long pfn, int flags, int *already);
 extern void memory_failure_queue(unsigned long pfn, int flags);
 extern void memory_failure_queue_kick(int cpu);
 extern int unpoison_memory(unsigned long pfn);
diff --git a/mm/hwpoison-inject.c b/mm/hwpoison-inject.c
index 1ae1ebc2b9b1..bfd5151dcd3f 100644
--- a/mm/hwpoison-inject.c
+++ b/mm/hwpoison-inject.c
@@ -48,7 +48,7 @@ static int hwpoison_inject(void *data, u64 val)
 
 inject:
 	pr_info("Injecting memory failure at pfn %#lx\n", pfn);
-	return memory_failure(pfn, 0);
+	return memory_failure(pfn, 0, NULL);
 }
 
 static int hwpoison_unpoison(void *data, u64 val)
diff --git a/mm/madvise.c b/mm/madvise.c
index df692d2e35d4..09f569fed68d 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -908,7 +908,7 @@ static int madvise_inject_error(int behavior,
 		} else {
 			pr_info("Injecting memory failure for pfn %#lx at process virtual address %#lx\n",
 				 pfn, start);
-			ret = memory_failure(pfn, MF_COUNT_INCREASED);
+			ret = memory_failure(pfn, MF_COUNT_INCREASED, NULL);
 		}
 
 		if (ret)
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 24210c9bd843..9a8911aa5fc9 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1398,7 +1398,7 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
  * Must run in process context (e.g. a work queue) with interrupts
  * enabled and no spinlocks hold.
  */
-int memory_failure(unsigned long pfn, int flags)
+int memory_failure(unsigned long pfn, int flags, int *already)
 {
 	struct page *p;
 	struct page *hpage;
@@ -1428,6 +1428,8 @@ int memory_failure(unsigned long pfn, int flags)
 	if (PageHuge(p))
 		return memory_failure_hugetlb(pfn, flags);
 	if (TestSetPageHWPoison(p)) {
+		if (already)
+			*already = 1;
 		pr_err("Memory failure: %#lx: already hardware poisoned\n",
 			pfn);
 		return 0;
@@ -1634,7 +1636,7 @@ static void memory_failure_work_func(struct work_struct *work)
 		if (entry.flags & MF_SOFT_OFFLINE)
 			soft_offline_page(entry.pfn, entry.flags);
 		else
-			memory_failure(entry.pfn, entry.flags);
+			memory_failure(entry.pfn, entry.flags, NULL);
 	}
 }
yaoaili [么爱利] March 5, 2021, 1:30 a.m. UTC | #22
On Thu, 4 Mar 2021 15:57:20 -0800
"Luck, Tony" <tony.luck@intel.com> wrote:

> On Thu, Mar 04, 2021 at 02:45:24PM +0800, Aili Yao wrote:
> > > > if your methods works, should it be like this?
> > > > 
> > > > 1582                         pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
> > > > 1583                         if (PageHuge(page)) {
> > > > 1584                                 hugetlb_count_sub(compound_nr(page), mm);
> > > > 1585                                 set_huge_swap_pte_at(mm, address,
> > > > 1586                                                      pvmw.pte, pteval,
> > > > 1587                                                      vma_mmu_pagesize(vma));
> > > > 1588                         } else {
> > > > 1589                                 dec_mm_counter(mm, mm_counter(page));
> > > > 1590                                 set_pte_at(mm, address, pvmw.pte, pteval);
> > > > 1591                         }
> > > > 
> > > > the page fault check if it's a poison page using is_hwpoison_entry(),
> > > >     
> > > 
> > > And if it works, does we need some locking mechanism before we call walk_page_range();
> > > if we lock, does we need to process the blocking interrupted error as other places will do?
> > >   
> > 
> > And another thing:
> > Do we need a call to flush_tlb_page(vma, address) to make the pte changes into effect?  
> 
> Thanks for all the pointers.  I added them to the patch (see below).
> [The pmd/pud cases may need some tweaking ... but just trying to get
> the 4K page case working first]
> 
> I tried testing by skipping the call to memory_failure() and just
> using this new code to search the page tables for current page and
> marking it hwpoison (to simulate the case where 2nd process gets the
> early return from memory_failure(). Something is still missing because I get:
> 
> [  481.911298] mce: pte_entry: matched pfn - mark poison & zap pte
> [  481.917935] MCE: Killing einj_mem_uc:5555 due to hardware memory corruption fault at 7fe64b33b400
> [  482.933775] BUG: Bad page cache in process einj_mem_uc  pfn:408b6d6
> [  482.940777] page:0000000013ea6e96 refcount:3 mapcount:1 mapping:00000000e3a069d9 index:0x0 pfn:0x408b6d6
> [  482.951355] memcg:ffff94a809834000
> [  482.955153] aops:shmem_aops ino:3c04
> [  482.959142] flags: 0x97ffffc0880015(locked|uptodate|lru|swapbacked|hwpoison)
> [  482.967018] raw: 0097ffffc0880015 ffff94c80e93ec00 ffff94c80e93ec00 ffff94c80a9b25a8
> [  482.975666] raw: 0000000000000000 0000000000000000 0000000300000000 ffff94a809834000
> [  482.984310] page dumped because: still mapped when deleted

From the walk, it seems we have got the virtual address, can we just send a SIGBUS with it?



> commit e5de44560b33e2d407704243566253a70f858a59
> Author: Tony Luck <tony.luck@intel.com>
> Date:   Tue Mar 2 15:06:33 2021 -0800
> 
>     x86/mce: Handle races between machine checks
>     
>     When multiple CPUs hit the same poison memory there is a race. The
>     first CPU into memory_failure() atomically marks the page as poison
>     and continues processing to hunt down all the tasks that map this page
>     so that the virtual addresses can be marked not-present and SIGBUS
>     sent to the task that did the access.
>     
>     Later CPUs get an early return from memory_failure() and may return
>     to user mode and access the poison again.
>     
>     Add a new argument to memory_failure() so that it can indicate when
>     the race has been lost. Fix kill_me_maybe() to scan page tables in
>     this case to unmap pages.
> 
> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
> index 7962355436da..a52c6a772de2 100644
> --- a/arch/x86/kernel/cpu/mce/core.c
> +++ b/arch/x86/kernel/cpu/mce/core.c
> @@ -28,8 +28,12 @@
>  #include <linux/sysfs.h>
>  #include <linux/types.h>
>  #include <linux/slab.h>
> +#include <linux/hugetlb.h>
> +#include <linux/swap.h>
> +#include <linux/swapops.h>
>  #include <linux/init.h>
>  #include <linux/kmod.h>
> +#include <linux/pagewalk.h>
>  #include <linux/poll.h>
>  #include <linux/nmi.h>
>  #include <linux/cpu.h>
> @@ -637,6 +641,7 @@ static int uc_decode_notifier(struct notifier_block *nb, unsigned long val,
>  {
>  	struct mce *mce = (struct mce *)data;
>  	unsigned long pfn;
> +	int already = 0;
>  
>  	if (!mce || !mce_usable_address(mce))
>  		return NOTIFY_DONE;
> @@ -646,8 +651,9 @@ static int uc_decode_notifier(struct notifier_block *nb, unsigned long val,
>  		return NOTIFY_DONE;
>  
>  	pfn = mce->addr >> PAGE_SHIFT;
> -	if (!memory_failure(pfn, 0)) {
> -		set_mce_nospec(pfn, whole_page(mce));
> +	if (!memory_failure(pfn, 0, &already)) {
> +		if (!already)
> +			set_mce_nospec(pfn, whole_page(mce));
>  		mce->kflags |= MCE_HANDLED_UC;
>  	}
>  
> @@ -1248,6 +1254,79 @@ static void __mc_scan_banks(struct mce *m, struct pt_regs *regs, struct mce *fin
>  	*m = *final;
>  }
>  
> +static int pte_entry(pte_t *pte, unsigned long addr, unsigned long next, struct mm_walk *walk)
> +{
> +	u64 pfn = (u64)walk->private;
> +	struct page *page;
> +	pte_t pteval;
> +
> +	if (pte_pfn(*pte) == pfn) {
> +pr_info("pte_entry: matched pfn - mark poison & zap pte\n");
> +		page = pfn_to_page(pfn);
> +		lock_page(page);
> +SetPageHWPoison(page);
> +		pteval = swp_entry_to_pte(make_hwpoison_entry(page));
> +		dec_mm_counter(walk->mm, mm_counter(page));
> +		set_pte_at(current->mm, addr, pte, pteval);
> +		unlock_page(page);
> +		flush_tlb_page(walk->vma, addr);
> +	}
> +
> +	return 0;
> +}
> +
> +static int pmd_entry(pmd_t *pmd, unsigned long addr, unsigned long next, struct mm_walk *walk)
> +{
> +	int shift = PMD_SHIFT - PAGE_SHIFT;
> +	u64 pfn = (u64)walk->private;
> +	struct page *page;
> +	pte_t pteval;
> +
> +	if (!pmd_large(*pmd))
> +		return 0;
> +
> +	if (pmd_pfn(*pmd) >> shift == pfn >> shift) {
> +		page = pfn_to_page(pfn);
> +		lock_page(page);
> +		pteval = swp_entry_to_pte(make_hwpoison_entry(page));
> +		hugetlb_count_sub(compound_nr(page), walk->mm);
> +		set_huge_swap_pte_at(walk->mm, addr, (pte_t *)pmd, pteval, vma_mmu_pagesize(walk->vma));
> +		unlock_page(page);
> +		flush_tlb_page(walk->vma, addr);
> +	}
> +
> +	return 0;
> +}
> +
> +static int pud_entry(pud_t *pud, unsigned long addr, unsigned long next, struct mm_walk *walk)
> +{
> +	int shift = PUD_SHIFT - PAGE_SHIFT;
> +	u64 pfn = (u64)walk->private;
> +	struct page *page;
> +	pte_t pteval;
> +
> +	if (!pud_large(*pud))
> +		return 0;
> +
> +	if (pud_pfn(*pud) >> shift == pfn >> shift) {
> +		page = pfn_to_page(pfn);
> +		lock_page(page);
> +		pteval = swp_entry_to_pte(make_hwpoison_entry(page));
> +		hugetlb_count_sub(compound_nr(page), walk->mm);
> +		set_huge_swap_pte_at(walk->mm, addr, (pte_t *)pud, pteval, vma_mmu_pagesize(walk->vma));
> +		unlock_page(page);
> +		flush_tlb_page(walk->vma, addr);
> +	}
> +
> +	return 0;
> +}
> +
> +static struct mm_walk_ops walk = {
> +	.pte_entry = pte_entry,
> +	.pmd_entry = pmd_entry,
> +	.pud_entry = pud_entry
> +};
> +
>  static void kill_me_now(struct callback_head *ch)
>  {
>  	force_sig(SIGBUS);
> @@ -1257,15 +1336,22 @@ static void kill_me_maybe(struct callback_head *cb)
>  {
>  	struct task_struct *p = container_of(cb, struct task_struct, mce_kill_me);
>  	int flags = MF_ACTION_REQUIRED;
> +	int already = 0;
>  
>  	pr_err("Uncorrected hardware memory error in user-access at %llx", p->mce_addr);
>  
>  	if (!p->mce_ripv)
>  		flags |= MF_MUST_KILL;
>  
> -	if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags) &&
> +	if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags, &already) &&
>  	    !(p->mce_kflags & MCE_IN_KERNEL_COPYIN)) {
> -		set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page);
> +		if (already) {
> +			mmap_read_lock(current->mm);
> +			walk_page_range(current->mm, 0, TASK_SIZE_MAX, &walk, (void *)(p->mce_addr >> PAGE_SHIFT));
> +			mmap_read_unlock(current->mm);
> +		} else {
> +			set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page);
> +		}
>  		sync_core();
>  		return;
>  	}
> @@ -1452,7 +1538,7 @@ noinstr void do_machine_check(struct pt_regs *regs)
>  EXPORT_SYMBOL_GPL(do_machine_check);
>  
>  #ifndef CONFIG_MEMORY_FAILURE
> -int memory_failure(unsigned long pfn, int flags)
> +int memory_failure(unsigned long pfn, int flags, int *already)
>  {
>  	/* mce_severity() should not hand us an ACTION_REQUIRED error */
>  	BUG_ON(flags & MF_ACTION_REQUIRED);
> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> index f35298425575..144500983656 100644
> --- a/drivers/base/memory.c
> +++ b/drivers/base/memory.c
> @@ -480,7 +480,7 @@ static ssize_t hard_offline_page_store(struct device *dev,
>  	if (kstrtoull(buf, 0, &pfn) < 0)
>  		return -EINVAL;
>  	pfn >>= PAGE_SHIFT;
> -	ret = memory_failure(pfn, 0);
> +	ret = memory_failure(pfn, 0, NULL);
>  	return ret ? ret : count;
>  }
>  
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 77e64e3eac80..beaa6e871cbe 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3003,7 +3003,7 @@ enum mf_flags {
>  	MF_MUST_KILL = 1 << 2,
>  	MF_SOFT_OFFLINE = 1 << 3,
>  };
> -extern int memory_failure(unsigned long pfn, int flags);
> +extern int memory_failure(unsigned long pfn, int flags, int *already);
>  extern void memory_failure_queue(unsigned long pfn, int flags);
>  extern void memory_failure_queue_kick(int cpu);
>  extern int unpoison_memory(unsigned long pfn);
> diff --git a/mm/hwpoison-inject.c b/mm/hwpoison-inject.c
> index 1ae1ebc2b9b1..bfd5151dcd3f 100644
> --- a/mm/hwpoison-inject.c
> +++ b/mm/hwpoison-inject.c
> @@ -48,7 +48,7 @@ static int hwpoison_inject(void *data, u64 val)
>  
>  inject:
>  	pr_info("Injecting memory failure at pfn %#lx\n", pfn);
> -	return memory_failure(pfn, 0);
> +	return memory_failure(pfn, 0, NULL);
>  }
>  
>  static int hwpoison_unpoison(void *data, u64 val)
> diff --git a/mm/madvise.c b/mm/madvise.c
> index df692d2e35d4..09f569fed68d 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -908,7 +908,7 @@ static int madvise_inject_error(int behavior,
>  		} else {
>  			pr_info("Injecting memory failure for pfn %#lx at process virtual address %#lx\n",
>  				 pfn, start);
> -			ret = memory_failure(pfn, MF_COUNT_INCREASED);
> +			ret = memory_failure(pfn, MF_COUNT_INCREASED, NULL);
>  		}
>  
>  		if (ret)
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 24210c9bd843..9a8911aa5fc9 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1398,7 +1398,7 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
>   * Must run in process context (e.g. a work queue) with interrupts
>   * enabled and no spinlocks hold.
>   */
> -int memory_failure(unsigned long pfn, int flags)
> +int memory_failure(unsigned long pfn, int flags, int *already)
>  {
>  	struct page *p;
>  	struct page *hpage;
> @@ -1428,6 +1428,8 @@ int memory_failure(unsigned long pfn, int flags)
>  	if (PageHuge(p))
>  		return memory_failure_hugetlb(pfn, flags);
>  	if (TestSetPageHWPoison(p)) {
> +		if (already)
> +			*already = 1;
>  		pr_err("Memory failure: %#lx: already hardware poisoned\n",
>  			pfn);
>  		return 0;
> @@ -1634,7 +1636,7 @@ static void memory_failure_work_func(struct work_struct *work)
>  		if (entry.flags & MF_SOFT_OFFLINE)
>  			soft_offline_page(entry.pfn, entry.flags);
>  		else
> -			memory_failure(entry.pfn, entry.flags);
> +			memory_failure(entry.pfn, entry.flags, NULL);
>  	}
>  }
>
yaoaili [么爱利] March 5, 2021, 1:36 a.m. UTC | #23
On Fri, 5 Mar 2021 09:30:16 +0800
Aili Yao <yaoaili@kingsoft.com> wrote:

> On Thu, 4 Mar 2021 15:57:20 -0800
> "Luck, Tony" <tony.luck@intel.com> wrote:
> 
> > On Thu, Mar 04, 2021 at 02:45:24PM +0800, Aili Yao wrote:
> > > > > if your methods works, should it be like this?
> > > > > 
> > > > > 1582                         pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
> > > > > 1583                         if (PageHuge(page)) {
> > > > > 1584                                 hugetlb_count_sub(compound_nr(page), mm);
> > > > > 1585                                 set_huge_swap_pte_at(mm, address,
> > > > > 1586                                                      pvmw.pte, pteval,
> > > > > 1587                                                      vma_mmu_pagesize(vma));
> > > > > 1588                         } else {
> > > > > 1589                                 dec_mm_counter(mm, mm_counter(page));
> > > > > 1590                                 set_pte_at(mm, address, pvmw.pte, pteval);
> > > > > 1591                         }
> > > > > 
> > > > > the page fault check if it's a poison page using is_hwpoison_entry(),
> > > > >     
> > > > 
> > > > And if it works, does we need some locking mechanism before we call walk_page_range();
> > > > if we lock, does we need to process the blocking interrupted error as other places will do?
> > > >   
> > > 
> > > And another thing:
> > > Do we need a call to flush_tlb_page(vma, address) to make the pte changes into effect?  
> > 
> > Thanks for all the pointers.  I added them to the patch (see below).
> > [The pmd/pud cases may need some tweaking ... but just trying to get
> > the 4K page case working first]
> > 
> > I tried testing by skipping the call to memory_failure() and just
> > using this new code to search the page tables for current page and
> > marking it hwpoison (to simulate the case where 2nd process gets the
> > early return from memory_failure(). Something is still missing because I get:
> > 
> > [  481.911298] mce: pte_entry: matched pfn - mark poison & zap pte
> > [  481.917935] MCE: Killing einj_mem_uc:5555 due to hardware memory corruption fault at 7fe64b33b400
> > [  482.933775] BUG: Bad page cache in process einj_mem_uc  pfn:408b6d6
> > [  482.940777] page:0000000013ea6e96 refcount:3 mapcount:1 mapping:00000000e3a069d9 index:0x0 pfn:0x408b6d6
> > [  482.951355] memcg:ffff94a809834000
> > [  482.955153] aops:shmem_aops ino:3c04
> > [  482.959142] flags: 0x97ffffc0880015(locked|uptodate|lru|swapbacked|hwpoison)
> > [  482.967018] raw: 0097ffffc0880015 ffff94c80e93ec00 ffff94c80e93ec00 ffff94c80a9b25a8
> > [  482.975666] raw: 0000000000000000 0000000000000000 0000000300000000 ffff94a809834000
> > [  482.984310] page dumped because: still mapped when deleted
> 
> From the walk, it seems we have got the virtual address, can we just send a SIGBUS with it?

Does this walk proper for other memory-failure error cases, can it be applied to if (p->mce_vaddr != (void __user *)-1l) branch
in kill_me_maybe()?

> > commit e5de44560b33e2d407704243566253a70f858a59
> > Author: Tony Luck <tony.luck@intel.com>
> > Date:   Tue Mar 2 15:06:33 2021 -0800
> > 
> >     x86/mce: Handle races between machine checks
> >     
> >     When multiple CPUs hit the same poison memory there is a race. The
> >     first CPU into memory_failure() atomically marks the page as poison
> >     and continues processing to hunt down all the tasks that map this page
> >     so that the virtual addresses can be marked not-present and SIGBUS
> >     sent to the task that did the access.
> >     
> >     Later CPUs get an early return from memory_failure() and may return
> >     to user mode and access the poison again.
> >     
> >     Add a new argument to memory_failure() so that it can indicate when
> >     the race has been lost. Fix kill_me_maybe() to scan page tables in
> >     this case to unmap pages.
> > 
> > diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
> > index 7962355436da..a52c6a772de2 100644
> > --- a/arch/x86/kernel/cpu/mce/core.c
> > +++ b/arch/x86/kernel/cpu/mce/core.c
> > @@ -28,8 +28,12 @@
> >  #include <linux/sysfs.h>
> >  #include <linux/types.h>
> >  #include <linux/slab.h>
> > +#include <linux/hugetlb.h>
> > +#include <linux/swap.h>
> > +#include <linux/swapops.h>
> >  #include <linux/init.h>
> >  #include <linux/kmod.h>
> > +#include <linux/pagewalk.h>
> >  #include <linux/poll.h>
> >  #include <linux/nmi.h>
> >  #include <linux/cpu.h>
> > @@ -637,6 +641,7 @@ static int uc_decode_notifier(struct notifier_block *nb, unsigned long val,
> >  {
> >  	struct mce *mce = (struct mce *)data;
> >  	unsigned long pfn;
> > +	int already = 0;
> >  
> >  	if (!mce || !mce_usable_address(mce))
> >  		return NOTIFY_DONE;
> > @@ -646,8 +651,9 @@ static int uc_decode_notifier(struct notifier_block *nb, unsigned long val,
> >  		return NOTIFY_DONE;
> >  
> >  	pfn = mce->addr >> PAGE_SHIFT;
> > -	if (!memory_failure(pfn, 0)) {
> > -		set_mce_nospec(pfn, whole_page(mce));
> > +	if (!memory_failure(pfn, 0, &already)) {
> > +		if (!already)
> > +			set_mce_nospec(pfn, whole_page(mce));
> >  		mce->kflags |= MCE_HANDLED_UC;
> >  	}
> >  
> > @@ -1248,6 +1254,79 @@ static void __mc_scan_banks(struct mce *m, struct pt_regs *regs, struct mce *fin
> >  	*m = *final;
> >  }
> >  
> > +static int pte_entry(pte_t *pte, unsigned long addr, unsigned long next, struct mm_walk *walk)
> > +{
> > +	u64 pfn = (u64)walk->private;
> > +	struct page *page;
> > +	pte_t pteval;
> > +
> > +	if (pte_pfn(*pte) == pfn) {
> > +pr_info("pte_entry: matched pfn - mark poison & zap pte\n");
> > +		page = pfn_to_page(pfn);
> > +		lock_page(page);
> > +SetPageHWPoison(page);
> > +		pteval = swp_entry_to_pte(make_hwpoison_entry(page));
> > +		dec_mm_counter(walk->mm, mm_counter(page));
> > +		set_pte_at(current->mm, addr, pte, pteval);
> > +		unlock_page(page);
> > +		flush_tlb_page(walk->vma, addr);
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +static int pmd_entry(pmd_t *pmd, unsigned long addr, unsigned long next, struct mm_walk *walk)
> > +{
> > +	int shift = PMD_SHIFT - PAGE_SHIFT;
> > +	u64 pfn = (u64)walk->private;
> > +	struct page *page;
> > +	pte_t pteval;
> > +
> > +	if (!pmd_large(*pmd))
> > +		return 0;
> > +
> > +	if (pmd_pfn(*pmd) >> shift == pfn >> shift) {
> > +		page = pfn_to_page(pfn);
> > +		lock_page(page);
> > +		pteval = swp_entry_to_pte(make_hwpoison_entry(page));
> > +		hugetlb_count_sub(compound_nr(page), walk->mm);
> > +		set_huge_swap_pte_at(walk->mm, addr, (pte_t *)pmd, pteval, vma_mmu_pagesize(walk->vma));
> > +		unlock_page(page);
> > +		flush_tlb_page(walk->vma, addr);
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +static int pud_entry(pud_t *pud, unsigned long addr, unsigned long next, struct mm_walk *walk)
> > +{
> > +	int shift = PUD_SHIFT - PAGE_SHIFT;
> > +	u64 pfn = (u64)walk->private;
> > +	struct page *page;
> > +	pte_t pteval;
> > +
> > +	if (!pud_large(*pud))
> > +		return 0;
> > +
> > +	if (pud_pfn(*pud) >> shift == pfn >> shift) {
> > +		page = pfn_to_page(pfn);
> > +		lock_page(page);
> > +		pteval = swp_entry_to_pte(make_hwpoison_entry(page));
> > +		hugetlb_count_sub(compound_nr(page), walk->mm);
> > +		set_huge_swap_pte_at(walk->mm, addr, (pte_t *)pud, pteval, vma_mmu_pagesize(walk->vma));
> > +		unlock_page(page);
> > +		flush_tlb_page(walk->vma, addr);
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +static struct mm_walk_ops walk = {
> > +	.pte_entry = pte_entry,
> > +	.pmd_entry = pmd_entry,
> > +	.pud_entry = pud_entry
> > +};
> > +
> >  static void kill_me_now(struct callback_head *ch)
> >  {
> >  	force_sig(SIGBUS);
> > @@ -1257,15 +1336,22 @@ static void kill_me_maybe(struct callback_head *cb)
> >  {
> >  	struct task_struct *p = container_of(cb, struct task_struct, mce_kill_me);
> >  	int flags = MF_ACTION_REQUIRED;
> > +	int already = 0;
> >  
> >  	pr_err("Uncorrected hardware memory error in user-access at %llx", p->mce_addr);
> >  
> >  	if (!p->mce_ripv)
> >  		flags |= MF_MUST_KILL;
> >  
> > -	if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags) &&
> > +	if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags, &already) &&
> >  	    !(p->mce_kflags & MCE_IN_KERNEL_COPYIN)) {
> > -		set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page);
> > +		if (already) {
> > +			mmap_read_lock(current->mm);
> > +			walk_page_range(current->mm, 0, TASK_SIZE_MAX, &walk, (void *)(p->mce_addr >> PAGE_SHIFT));
> > +			mmap_read_unlock(current->mm);
> > +		} else {
> > +			set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page);
> > +		}
> >  		sync_core();
> >  		return;
> >  	}
> > @@ -1452,7 +1538,7 @@ noinstr void do_machine_check(struct pt_regs *regs)
> >  EXPORT_SYMBOL_GPL(do_machine_check);
> >  
> >  #ifndef CONFIG_MEMORY_FAILURE
> > -int memory_failure(unsigned long pfn, int flags)
> > +int memory_failure(unsigned long pfn, int flags, int *already)
> >  {
> >  	/* mce_severity() should not hand us an ACTION_REQUIRED error */
> >  	BUG_ON(flags & MF_ACTION_REQUIRED);
> > diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> > index f35298425575..144500983656 100644
> > --- a/drivers/base/memory.c
> > +++ b/drivers/base/memory.c
> > @@ -480,7 +480,7 @@ static ssize_t hard_offline_page_store(struct device *dev,
> >  	if (kstrtoull(buf, 0, &pfn) < 0)
> >  		return -EINVAL;
> >  	pfn >>= PAGE_SHIFT;
> > -	ret = memory_failure(pfn, 0);
> > +	ret = memory_failure(pfn, 0, NULL);
> >  	return ret ? ret : count;
> >  }
> >  
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 77e64e3eac80..beaa6e871cbe 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -3003,7 +3003,7 @@ enum mf_flags {
> >  	MF_MUST_KILL = 1 << 2,
> >  	MF_SOFT_OFFLINE = 1 << 3,
> >  };
> > -extern int memory_failure(unsigned long pfn, int flags);
> > +extern int memory_failure(unsigned long pfn, int flags, int *already);
> >  extern void memory_failure_queue(unsigned long pfn, int flags);
> >  extern void memory_failure_queue_kick(int cpu);
> >  extern int unpoison_memory(unsigned long pfn);
> > diff --git a/mm/hwpoison-inject.c b/mm/hwpoison-inject.c
> > index 1ae1ebc2b9b1..bfd5151dcd3f 100644
> > --- a/mm/hwpoison-inject.c
> > +++ b/mm/hwpoison-inject.c
> > @@ -48,7 +48,7 @@ static int hwpoison_inject(void *data, u64 val)
> >  
> >  inject:
> >  	pr_info("Injecting memory failure at pfn %#lx\n", pfn);
> > -	return memory_failure(pfn, 0);
> > +	return memory_failure(pfn, 0, NULL);
> >  }
> >  
> >  static int hwpoison_unpoison(void *data, u64 val)
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index df692d2e35d4..09f569fed68d 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -908,7 +908,7 @@ static int madvise_inject_error(int behavior,
> >  		} else {
> >  			pr_info("Injecting memory failure for pfn %#lx at process virtual address %#lx\n",
> >  				 pfn, start);
> > -			ret = memory_failure(pfn, MF_COUNT_INCREASED);
> > +			ret = memory_failure(pfn, MF_COUNT_INCREASED, NULL);
> >  		}
> >  
> >  		if (ret)
> > diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> > index 24210c9bd843..9a8911aa5fc9 100644
> > --- a/mm/memory-failure.c
> > +++ b/mm/memory-failure.c
> > @@ -1398,7 +1398,7 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
> >   * Must run in process context (e.g. a work queue) with interrupts
> >   * enabled and no spinlocks hold.
> >   */
> > -int memory_failure(unsigned long pfn, int flags)
> > +int memory_failure(unsigned long pfn, int flags, int *already)
> >  {
> >  	struct page *p;
> >  	struct page *hpage;
> > @@ -1428,6 +1428,8 @@ int memory_failure(unsigned long pfn, int flags)
> >  	if (PageHuge(p))
> >  		return memory_failure_hugetlb(pfn, flags);
> >  	if (TestSetPageHWPoison(p)) {
> > +		if (already)
> > +			*already = 1;
> >  		pr_err("Memory failure: %#lx: already hardware poisoned\n",
> >  			pfn);
> >  		return 0;
> > @@ -1634,7 +1636,7 @@ static void memory_failure_work_func(struct work_struct *work)
> >  		if (entry.flags & MF_SOFT_OFFLINE)
> >  			soft_offline_page(entry.pfn, entry.flags);
> >  		else
> > -			memory_failure(entry.pfn, entry.flags);
> > +			memory_failure(entry.pfn, entry.flags, NULL);
> >  	}
> >  }
> >  
>
Luck, Tony March 5, 2021, 3:55 p.m. UTC | #24
> From the walk, it seems we have got the virtual address, can we just send a SIGBUS with it?

If the walk wins the race and the pte for the poisoned page is still valid, then yes.

But we could have:

CPU1                            CPU2
memory_failure sets poison
bit for struct page


rmap finds page in task
on CPU2 and sets PTE
to not-valid-poison

                                memory_failure returns
                                early because struct page
                                already marked as poison

                                walk page tables looking
                                for mapping - don't find it

-Tony
Luck, Tony March 5, 2021, 10:11 p.m. UTC | #25
This whole page table walking patch is trying to work around the
races caused by multiple calls to memory_failure() for the same
page.

Maybe better to just avoid the races.  The comment right above
memory_failure says:

 * Must run in process context (e.g. a work queue) with interrupts
 * enabled and no spinlocks hold.

So it should be safe to grab and hold a mutex.  See patch below.

-Tony

commit 8dd0dbe7d595e02647e9c2c76c03341a9f6bd7b9
Author: Tony Luck <tony.luck@intel.com>
Date:   Fri Mar 5 10:40:48 2021 -0800

    Use a mutex to avoid memory_failure() races

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 24210c9bd843..c1509f4b565e 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1381,6 +1381,8 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
 	return rc;
 }
 
+static DEFINE_MUTEX(mf_mutex);
+
 /**
  * memory_failure - Handle memory failure of a page.
  * @pfn: Page Number of the corrupted page
@@ -1424,12 +1426,18 @@ int memory_failure(unsigned long pfn, int flags)
 		return -ENXIO;
 	}
 
+	mutex_lock(&mf_mutex);
+
 try_again:
-	if (PageHuge(p))
-		return memory_failure_hugetlb(pfn, flags);
+	if (PageHuge(p)) {
+		res = memory_failure_hugetlb(pfn, flags);
+		goto out2;
+	}
+
 	if (TestSetPageHWPoison(p)) {
 		pr_err("Memory failure: %#lx: already hardware poisoned\n",
 			pfn);
+		mutex_unlock(&mf_mutex);
 		return 0;
 	}
 
@@ -1463,9 +1471,11 @@ int memory_failure(unsigned long pfn, int flags)
 				res = MF_FAILED;
 			}
 			action_result(pfn, MF_MSG_BUDDY, res);
+			mutex_unlock(&mf_mutex);
 			return res == MF_RECOVERED ? 0 : -EBUSY;
 		} else {
 			action_result(pfn, MF_MSG_KERNEL_HIGH_ORDER, MF_IGNORED);
+			mutex_unlock(&mf_mutex);
 			return -EBUSY;
 		}
 	}
@@ -1473,6 +1483,7 @@ int memory_failure(unsigned long pfn, int flags)
 	if (PageTransHuge(hpage)) {
 		if (try_to_split_thp_page(p, "Memory Failure") < 0) {
 			action_result(pfn, MF_MSG_UNSPLIT_THP, MF_IGNORED);
+			mutex_unlock(&mf_mutex);
 			return -EBUSY;
 		}
 		VM_BUG_ON_PAGE(!page_count(p), p);
@@ -1517,6 +1528,7 @@ int memory_failure(unsigned long pfn, int flags)
 		num_poisoned_pages_dec();
 		unlock_page(p);
 		put_page(p);
+		mutex_unlock(&mf_mutex);
 		return 0;
 	}
 	if (hwpoison_filter(p)) {
@@ -1524,6 +1536,7 @@ int memory_failure(unsigned long pfn, int flags)
 			num_poisoned_pages_dec();
 		unlock_page(p);
 		put_page(p);
+		mutex_unlock(&mf_mutex);
 		return 0;
 	}
 
@@ -1559,6 +1572,8 @@ int memory_failure(unsigned long pfn, int flags)
 	res = identify_page_state(pfn, p, page_flags);
 out:
 	unlock_page(p);
+out2:
+	mutex_unlock(&mf_mutex);
 	return res;
 }
 EXPORT_SYMBOL_GPL(memory_failure);
HORIGUCHI NAOYA(堀口 直也) March 8, 2021, 6:45 a.m. UTC | #26
On Fri, Mar 05, 2021 at 02:11:43PM -0800, Luck, Tony wrote:
> This whole page table walking patch is trying to work around the
> races caused by multiple calls to memory_failure() for the same
> page.
> 
> Maybe better to just avoid the races.  The comment right above
> memory_failure says:
> 
>  * Must run in process context (e.g. a work queue) with interrupts
>  * enabled and no spinlocks hold.
> 
> So it should be safe to grab and hold a mutex.  See patch below.

The mutex approach looks simpler and safer, so I'm fine with it.

> 
> -Tony
> 
> commit 8dd0dbe7d595e02647e9c2c76c03341a9f6bd7b9
> Author: Tony Luck <tony.luck@intel.com>
> Date:   Fri Mar 5 10:40:48 2021 -0800
> 
>     Use a mutex to avoid memory_failure() races
> 
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 24210c9bd843..c1509f4b565e 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1381,6 +1381,8 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
>  	return rc;
>  }
>  
> +static DEFINE_MUTEX(mf_mutex);
> +
>  /**
>   * memory_failure - Handle memory failure of a page.
>   * @pfn: Page Number of the corrupted page
> @@ -1424,12 +1426,18 @@ int memory_failure(unsigned long pfn, int flags)
>  		return -ENXIO;
>  	}
>  
> +	mutex_lock(&mf_mutex);

Is it better to take mutex before memory_failure_dev_pagemap() block?
Or we don't have to protect against race for device memory?

Thanks,
Naoya Horiguchi
Luck, Tony March 8, 2021, 6:54 p.m. UTC | #27
>> So it should be safe to grab and hold a mutex.  See patch below.
>
> The mutex approach looks simpler and safer, so I'm fine with it.

Thanks.  Is that an "Acked-by:"?

>>  /**
>>   * memory_failure - Handle memory failure of a page.
>>   * @pfn: Page Number of the corrupted page
>> @@ -1424,12 +1426,18 @@ int memory_failure(unsigned long pfn, int flags)
>>  		return -ENXIO;
>>  	}
>>  
>> +	mutex_lock(&mf_mutex);
>
> Is it better to take mutex before memory_failure_dev_pagemap() block?
> Or we don't have to protect against race for device memory?

No races (recovery is only attempted for errors in normal memory).

-Tony
HORIGUCHI NAOYA(堀口 直也) March 8, 2021, 10:38 p.m. UTC | #28
On Mon, Mar 08, 2021 at 06:54:02PM +0000, Luck, Tony wrote:
> >> So it should be safe to grab and hold a mutex.  See patch below.
> >
> > The mutex approach looks simpler and safer, so I'm fine with it.
> 
> Thanks.  Is that an "Acked-by:"?

Not yet, I intended to add it after full patch is submitted
(with your Signed-off-by and commit log).

> 
> >>  /**
> >>   * memory_failure - Handle memory failure of a page.
> >>   * @pfn: Page Number of the corrupted page
> >> @@ -1424,12 +1426,18 @@ int memory_failure(unsigned long pfn, int flags)
> >>  		return -ENXIO;
> >>  	}
> >>  
> >> +	mutex_lock(&mf_mutex);
> >
> > Is it better to take mutex before memory_failure_dev_pagemap() block?
> > Or we don't have to protect against race for device memory?
> 
> No races (recovery is only attempted for errors in normal memory).

OK, thanks.

- Naoya
yaoaili [么爱利] March 10, 2021, 6:10 a.m. UTC | #29
On Fri, 5 Mar 2021 15:55:25 +0000
"Luck, Tony" <tony.luck@intel.com> wrote:

> > From the walk, it seems we have got the virtual address, can we just send a SIGBUS with it?  
> 
> If the walk wins the race and the pte for the poisoned page is still valid, then yes.
> 
> But we could have:
> 
> CPU1                            CPU2
> memory_failure sets poison
> bit for struct page
> 
> 
> rmap finds page in task
> on CPU2 and sets PTE
> to not-valid-poison
> 
>                                 memory_failure returns
>                                 early because struct page
>                                 already marked as poison
> 
>                                 walk page tables looking
>                                 for mapping - don't find it
> 
> -Tony

While I don't think there is a race condition, and if you really think the pfn with SIGBUS is not
proper, I think following patch maybe one way.
I copy your abandon code, and make a little modification, and just now it pass
my simple test.

And also this is a RFC version, only valid if you think the pfn with SIGBUS is not right.

Thanks!

From a522ab8856e3a332a2318d57bb19f3c59594d462 Mon Sep 17 00:00:00 2001
From: Aili Yao <yaoaili@kingsoft.com>
Date: Wed, 10 Mar 2021 13:59:18 +0800
Subject: [PATCH] x86/mce: fix invalid SIGBUS address

walk the current process pte and compare with the pfn;
1. only test for normal page and 2M hugetlb page;
2. 1G hugetlb and transparentHuge is not support currently;
3. May other fails is not recognized, This is a RFC version.

---
 arch/x86/kernel/cpu/mce/core.c | 83 ++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 80 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index db4afc5..65d7ef7 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -28,8 +28,12 @@
 #include <linux/sysfs.h>
 #include <linux/types.h>
 #include <linux/slab.h>
+#include <linux/hugetlb.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>
 #include <linux/init.h>
 #include <linux/kmod.h>
+#include <linux/pagewalk.h>
 #include <linux/poll.h>
 #include <linux/nmi.h>
 #include <linux/cpu.h>
@@ -1235,6 +1239,81 @@ static void __mc_scan_banks(struct mce *m, struct pt_regs *regs, struct mce *fin
 	/* mce_clear_state will clear *final, save locally for use later */
 	*m = *final;
 }
+static int mc_pte_entry(pte_t *pte, unsigned long addr, unsigned long next, struct mm_walk *walk)
+{
+	u64 *buff = (u64 *)walk->private;
+	u64 pfn = buff[0];
+
+	if (!pte_present(*pte) && is_hwpoison_entry(pte_to_swp_entry(*pte)))
+		goto find;
+	else if (pte_pfn(*pte) == pfn)
+		goto find;
+
+	return 0;
+find:
+	buff[0] = addr;
+	buff[1] = PAGE_SHIFT;
+	return true;
+}
+
+extern bool is_hugetlb_entry_hwpoisoned(pte_t pte);
+
+static int mc_hugetlb_range(pte_t *ptep, unsigned long hmask,
+				 unsigned long addr, unsigned long end,
+				 struct mm_walk *walk)
+{
+	u64 *buff = (u64 *)walk->private;
+	u64 pfn = buff[0];
+	int shift = PMD_SHIFT;
+	pte_t pte =  huge_ptep_get(ptep);
+
+	if (unlikely(is_hugetlb_entry_hwpoisoned(pte)))
+		goto find;
+
+	if (pte_pfn(*ptep) == pfn)
+		goto find;
+
+	return 0;
+find:
+	buff[0] = addr;
+	buff[1] = shift;
+	return true;
+}
+
+static struct mm_walk_ops walk = {
+	.pte_entry = mc_pte_entry,
+	.hugetlb_entry	= mc_hugetlb_range
+};
+
+void mc_memory_failure_error(struct task_struct *p, unsigned long pfn)
+{
+	u64 buff[2] = {pfn, 0};
+	struct page *page;
+	int ret = -1;
+
+	page = pfn_to_page(pfn);
+	if (!page)
+		goto force_sigbus;
+
+	if (is_zone_device_page(page))
+		goto force_sigbus;
+
+	mmap_read_lock(p->mm);
+	ret = walk_page_range(p->mm, 0, TASK_SIZE_MAX, &walk, (void *)buff);
+	mmap_read_unlock(p->mm);
+
+	if (ret && buff[0]) {
+		pr_err("Memory error may not recovered: %#llx: Sending SIGBUS to %s:%d due to hardware memory corruption\n",
+		buff[0], p->comm, p->pid);
+		force_sig_mceerr(BUS_MCEERR_AR, (void __user *)buff[0], buff[1]);
+	} else {
+force_sigbus:
+		pr_err("Memory error may not recovered, pfn: %#lx: Sending SIGBUS to %s:%d due to hardware memory corruption\n",
+		pfn, p->comm, p->pid);
+		force_sig_mceerr(BUS_MCEERR_AR, (void __user *)pfn, PAGE_SHIFT);
+	}
+
+}
 
 static void kill_me_now(struct callback_head *ch)
 {
@@ -1259,9 +1338,7 @@ static void kill_me_maybe(struct callback_head *cb)
 	}
 
 	if (p->mce_vaddr != (void __user *)-1l) {
-		pr_err("Memory error may not recovered: %#lx: Sending SIGBUS to %s:%d due to hardware memory corruption\n",
-			p->mce_addr >> PAGE_SHIFT, p->comm, p->pid);
-		force_sig_mceerr(BUS_MCEERR_AR, p->mce_vaddr, PAGE_SHIFT);
+		mc_memory_failure_error(current, p->mce_addr >> PAGE_SHIFT);
 	} else {
 		pr_err("Memory error not recovered");
 		kill_me_now(cb);
HORIGUCHI NAOYA(堀口 直也) March 11, 2021, 8:55 a.m. UTC | #30
On Wed, Mar 10, 2021 at 02:10:42PM +0800, Aili Yao wrote:
> On Fri, 5 Mar 2021 15:55:25 +0000
> "Luck, Tony" <tony.luck@intel.com> wrote:
> 
> > > From the walk, it seems we have got the virtual address, can we just send a SIGBUS with it?  
> > 
> > If the walk wins the race and the pte for the poisoned page is still valid, then yes.
> > 
> > But we could have:
> > 
> > CPU1                            CPU2
> > memory_failure sets poison
> > bit for struct page
> > 
> > 
> > rmap finds page in task
> > on CPU2 and sets PTE
> > to not-valid-poison
> > 
> >                                 memory_failure returns
> >                                 early because struct page
> >                                 already marked as poison
> > 
> >                                 walk page tables looking
> >                                 for mapping - don't find it
> > 
> > -Tony
> 
> While I don't think there is a race condition, and if you really think the pfn with SIGBUS is not
> proper, I think following patch maybe one way.
> I copy your abandon code, and make a little modification, and just now it pass
> my simple test.
> 
> And also this is a RFC version, only valid if you think the pfn with SIGBUS is not right.
> 
> Thanks!
> 
> From a522ab8856e3a332a2318d57bb19f3c59594d462 Mon Sep 17 00:00:00 2001
> From: Aili Yao <yaoaili@kingsoft.com>
> Date: Wed, 10 Mar 2021 13:59:18 +0800
> Subject: [PATCH] x86/mce: fix invalid SIGBUS address
> 
> walk the current process pte and compare with the pfn;
> 1. only test for normal page and 2M hugetlb page;
> 2. 1G hugetlb and transparentHuge is not support currently;
> 3. May other fails is not recognized, This is a RFC version.
> 
> ---
>  arch/x86/kernel/cpu/mce/core.c | 83 ++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 80 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
> index db4afc5..65d7ef7 100644
> --- a/arch/x86/kernel/cpu/mce/core.c
> +++ b/arch/x86/kernel/cpu/mce/core.c
> @@ -28,8 +28,12 @@
>  #include <linux/sysfs.h>
>  #include <linux/types.h>
>  #include <linux/slab.h>
> +#include <linux/hugetlb.h>
> +#include <linux/swap.h>
> +#include <linux/swapops.h>
>  #include <linux/init.h>
>  #include <linux/kmod.h>
> +#include <linux/pagewalk.h>
>  #include <linux/poll.h>
>  #include <linux/nmi.h>
>  #include <linux/cpu.h>

Maybe requiring many dependencies like this implies that you might be better
to do below in mm/memory-failure.c instead of in this file.

> @@ -1235,6 +1239,81 @@ static void __mc_scan_banks(struct mce *m, struct pt_regs *regs, struct mce *fin
>  	/* mce_clear_state will clear *final, save locally for use later */
>  	*m = *final;
>  }
> +static int mc_pte_entry(pte_t *pte, unsigned long addr, unsigned long next, struct mm_walk *walk)
> +{
> +	u64 *buff = (u64 *)walk->private;
> +	u64 pfn = buff[0];
> +
> +	if (!pte_present(*pte) && is_hwpoison_entry(pte_to_swp_entry(*pte)))
> +		goto find;
> +	else if (pte_pfn(*pte) == pfn)
> +		goto find;
> +
> +	return 0;
> +find:
> +	buff[0] = addr;
> +	buff[1] = PAGE_SHIFT;
> +	return true;

Returning true means you stop walking when you find the first entry pointing
to a given pfn. But there could be multiple such entries, so if MCE SRAR is
triggered by memory access to the larger address in hwpoisoned entries, the
returned virtual address might be wrong.

> +}
> +
> +extern bool is_hugetlb_entry_hwpoisoned(pte_t pte);
> +
> +static int mc_hugetlb_range(pte_t *ptep, unsigned long hmask,
> +				 unsigned long addr, unsigned long end,
> +				 struct mm_walk *walk)
> +{
> +	u64 *buff = (u64 *)walk->private;
> +	u64 pfn = buff[0];
> +	int shift = PMD_SHIFT;
> +	pte_t pte =  huge_ptep_get(ptep);
> +
> +	if (unlikely(is_hugetlb_entry_hwpoisoned(pte)))
> +		goto find;
> +
> +	if (pte_pfn(*ptep) == pfn)
> +		goto find;
> +
> +	return 0;
> +find:
> +	buff[0] = addr;
> +	buff[1] = shift;
> +	return true;
> +}
> +
> +static struct mm_walk_ops walk = {
> +	.pte_entry = mc_pte_entry,
> +	.hugetlb_entry	= mc_hugetlb_range
> +};
> +
> +void mc_memory_failure_error(struct task_struct *p, unsigned long pfn)
> +{
> +	u64 buff[2] = {pfn, 0};
> +	struct page *page;
> +	int ret = -1;
> +
> +	page = pfn_to_page(pfn);
> +	if (!page)
> +		goto force_sigbus;
> +
> +	if (is_zone_device_page(page))
> +		goto force_sigbus;
> +
> +	mmap_read_lock(p->mm);
> +	ret = walk_page_range(p->mm, 0, TASK_SIZE_MAX, &walk, (void *)buff);
> +	mmap_read_unlock(p->mm);
> +
> +	if (ret && buff[0]) {
> +		pr_err("Memory error may not recovered: %#llx: Sending SIGBUS to %s:%d due to hardware memory corruption\n",
> +		buff[0], p->comm, p->pid);
> +		force_sig_mceerr(BUS_MCEERR_AR, (void __user *)buff[0], buff[1]);
> +	} else {
> +force_sigbus:
> +		pr_err("Memory error may not recovered, pfn: %#lx: Sending SIGBUS to %s:%d due to hardware memory corruption\n",
> +		pfn, p->comm, p->pid);
> +		force_sig_mceerr(BUS_MCEERR_AR, (void __user *)pfn, PAGE_SHIFT);
> +	}
> +
> +}
>  
>  static void kill_me_now(struct callback_head *ch)
>  {
> @@ -1259,9 +1338,7 @@ static void kill_me_maybe(struct callback_head *cb)
>  	}
>  
>  	if (p->mce_vaddr != (void __user *)-1l) {
> -		pr_err("Memory error may not recovered: %#lx: Sending SIGBUS to %s:%d due to hardware memory corruption\n",
> -			p->mce_addr >> PAGE_SHIFT, p->comm, p->pid);
> -		force_sig_mceerr(BUS_MCEERR_AR, p->mce_vaddr, PAGE_SHIFT);
> +		mc_memory_failure_error(current, p->mce_addr >> PAGE_SHIFT);

I guess that p->mce_vaddr stores the virtual address of the error here.
If so, sending SIGBUS with the address looks enough as we do now, so why
do you walk page table to find the error virtual address?

Thanks,
Naoya Horiguchi
yaoaili [么爱利] March 11, 2021, 11:23 a.m. UTC | #31
On Thu, 11 Mar 2021 08:55:30 +0000
HORIGUCHI NAOYA(堀口 直也) <naoya.horiguchi@nec.com> wrote:

> On Wed, Mar 10, 2021 at 02:10:42PM +0800, Aili Yao wrote:
> > On Fri, 5 Mar 2021 15:55:25 +0000
> > "Luck, Tony" <tony.luck@intel.com> wrote:
> >   
> > > > From the walk, it seems we have got the virtual address, can we just send a SIGBUS with it?    
> > > 
> > > If the walk wins the race and the pte for the poisoned page is still valid, then yes.
> > > 
> > > But we could have:
> > > 
> > > CPU1                            CPU2
> > > memory_failure sets poison
> > > bit for struct page
> > > 
> > > 
> > > rmap finds page in task
> > > on CPU2 and sets PTE
> > > to not-valid-poison
> > > 
> > >                                 memory_failure returns
> > >                                 early because struct page
> > >                                 already marked as poison
> > > 
> > >                                 walk page tables looking
> > >                                 for mapping - don't find it
> > > 
> > > -Tony  
> > 
> > While I don't think there is a race condition, and if you really think the pfn with SIGBUS is not
> > proper, I think following patch maybe one way.
> > I copy your abandon code, and make a little modification, and just now it pass
> > my simple test.
> > 
> > And also this is a RFC version, only valid if you think the pfn with SIGBUS is not right.
> > 
> > Thanks!
> > 
> > From a522ab8856e3a332a2318d57bb19f3c59594d462 Mon Sep 17 00:00:00 2001
> > From: Aili Yao <yaoaili@kingsoft.com>
> > Date: Wed, 10 Mar 2021 13:59:18 +0800
> > Subject: [PATCH] x86/mce: fix invalid SIGBUS address
> > 
> > walk the current process pte and compare with the pfn;
> > 1. only test for normal page and 2M hugetlb page;
> > 2. 1G hugetlb and transparentHuge is not support currently;
> > 3. May other fails is not recognized, This is a RFC version.
> > 
> > ---
> >  arch/x86/kernel/cpu/mce/core.c | 83 ++++++++++++++++++++++++++++++++++++++++--
> >  1 file changed, 80 insertions(+), 3 deletions(-)
> > 
> > diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
> > index db4afc5..65d7ef7 100644
> > --- a/arch/x86/kernel/cpu/mce/core.c
> > +++ b/arch/x86/kernel/cpu/mce/core.c
> > @@ -28,8 +28,12 @@
> >  #include <linux/sysfs.h>
> >  #include <linux/types.h>
> >  #include <linux/slab.h>
> > +#include <linux/hugetlb.h>
> > +#include <linux/swap.h>
> > +#include <linux/swapops.h>
> >  #include <linux/init.h>
> >  #include <linux/kmod.h>
> > +#include <linux/pagewalk.h>
> >  #include <linux/poll.h>
> >  #include <linux/nmi.h>
> >  #include <linux/cpu.h>  
> 
> Maybe requiring many dependencies like this implies that you might be better
> to do below in mm/memory-failure.c instead of in this file.

Yes, agree, I will change this, Thanks!
 
> > @@ -1235,6 +1239,81 @@ static void __mc_scan_banks(struct mce *m, struct pt_regs *regs, struct mce *fin
> >  	/* mce_clear_state will clear *final, save locally for use later */
> >  	*m = *final;
> >  }
> > +static int mc_pte_entry(pte_t *pte, unsigned long addr, unsigned long next, struct mm_walk *walk)
> > +{
> > +	u64 *buff = (u64 *)walk->private;
> > +	u64 pfn = buff[0];
> > +
> > +	if (!pte_present(*pte) && is_hwpoison_entry(pte_to_swp_entry(*pte)))
> > +		goto find;
> > +	else if (pte_pfn(*pte) == pfn)
> > +		goto find;
> > +
> > +	return 0;
> > +find:
> > +	buff[0] = addr;
> > +	buff[1] = PAGE_SHIFT;
> > +	return true;  
> 
> Returning true means you stop walking when you find the first entry pointing
> to a given pfn. But there could be multiple such entries, so if MCE SRAR is
> triggered by memory access to the larger address in hwpoisoned entries, the
> returned virtual address might be wrong.

Yes, We need to consider multiple posion page entries, I will fix this. Thanks for
you sugguestion!


> > +}
> > +
> > +extern bool is_hugetlb_entry_hwpoisoned(pte_t pte);
> > +
> > +static int mc_hugetlb_range(pte_t *ptep, unsigned long hmask,
> > +				 unsigned long addr, unsigned long end,
> > +				 struct mm_walk *walk)
> > +{
> > +	u64 *buff = (u64 *)walk->private;
> > +	u64 pfn = buff[0];
> > +	int shift = PMD_SHIFT;
> > +	pte_t pte =  huge_ptep_get(ptep);
> > +
> > +	if (unlikely(is_hugetlb_entry_hwpoisoned(pte)))
> > +		goto find;
> > +
> > +	if (pte_pfn(*ptep) == pfn)
> > +		goto find;
> > +
> > +	return 0;
> > +find:
> > +	buff[0] = addr;
> > +	buff[1] = shift;
> > +	return true;
> > +}
> > +
> > +static struct mm_walk_ops walk = {
> > +	.pte_entry = mc_pte_entry,
> > +	.hugetlb_entry	= mc_hugetlb_range
> > +};
> > +
> > +void mc_memory_failure_error(struct task_struct *p, unsigned long pfn)
> > +{
> > +	u64 buff[2] = {pfn, 0};
> > +	struct page *page;
> > +	int ret = -1;
> > +
> > +	page = pfn_to_page(pfn);
> > +	if (!page)
> > +		goto force_sigbus;
> > +
> > +	if (is_zone_device_page(page))
> > +		goto force_sigbus;
> > +
> > +	mmap_read_lock(p->mm);
> > +	ret = walk_page_range(p->mm, 0, TASK_SIZE_MAX, &walk, (void *)buff);
> > +	mmap_read_unlock(p->mm);
> > +
> > +	if (ret && buff[0]) {
> > +		pr_err("Memory error may not recovered: %#llx: Sending SIGBUS to %s:%d due to hardware memory corruption\n",
> > +		buff[0], p->comm, p->pid);
> > +		force_sig_mceerr(BUS_MCEERR_AR, (void __user *)buff[0], buff[1]);
> > +	} else {
> > +force_sigbus:
> > +		pr_err("Memory error may not recovered, pfn: %#lx: Sending SIGBUS to %s:%d due to hardware memory corruption\n",
> > +		pfn, p->comm, p->pid);
> > +		force_sig_mceerr(BUS_MCEERR_AR, (void __user *)pfn, PAGE_SHIFT);
> > +	}
> > +
> > +}
> >  
> >  static void kill_me_now(struct callback_head *ch)
> >  {
> > @@ -1259,9 +1338,7 @@ static void kill_me_maybe(struct callback_head *cb)
> >  	}
> >  
> >  	if (p->mce_vaddr != (void __user *)-1l) {
> > -		pr_err("Memory error may not recovered: %#lx: Sending SIGBUS to %s:%d due to hardware memory corruption\n",
> > -			p->mce_addr >> PAGE_SHIFT, p->comm, p->pid);
> > -		force_sig_mceerr(BUS_MCEERR_AR, p->mce_vaddr, PAGE_SHIFT);
> > +		mc_memory_failure_error(current, p->mce_addr >> PAGE_SHIFT);  
> 
> I guess that p->mce_vaddr stores the virtual address of the error here.
> If so, sending SIGBUS with the address looks enough as we do now, so why
> do you walk page table to find the error virtual address?

I check the code again, yes, I should have placed mc_memory_failure_error in else branch, but it seems p->mce_vaddr is not correctly
initialized and for my test, it has a zero value then code goes into if (p->mce_vaddr != (void __user *)-1l) branch; It seems this is
another issue needing to fix;

And from the p->mce_vaddr, possibly there is a better way to get vaddr, i will dig more about this.
Luck, Tony March 11, 2021, 5:05 p.m. UTC | #32
> I guess that p->mce_vaddr stores the virtual address of the error here.
> If so, sending SIGBUS with the address looks enough as we do now, so why
> do you walk page table to find the error virtual address?

p->mce_vaddr only has the virtual address for the COPYIN case. In that code
path we decode the kernel instruction that hit the fault in order to find the virtual
address. That's easy because:

1) The kernel RIP is known to be good (can't page fault etc. on kernel address).
2) There are only a half dozen instructions used by the kernel for get_user() or
     copy_from_user().

When the machine check happens during user execution accessing poison data
we only have the physical address (from MCi_ADDR).

-Tony
yaoaili [么爱利] March 12, 2021, 5:55 a.m. UTC | #33
On Thu, 11 Mar 2021 17:05:53 +0000
"Luck, Tony" <tony.luck@intel.com> wrote:

> > I guess that p->mce_vaddr stores the virtual address of the error here.
> > If so, sending SIGBUS with the address looks enough as we do now, so why
> > do you walk page table to find the error virtual address?  
> 
> p->mce_vaddr only has the virtual address for the COPYIN case. In that code
> path we decode the kernel instruction that hit the fault in order to find the virtual
> address. That's easy because:
> 
> 1) The kernel RIP is known to be good (can't page fault etc. on kernel address).
> 2) There are only a half dozen instructions used by the kernel for get_user() or
>      copy_from_user().
> 
> When the machine check happens during user execution accessing poison data
> we only have the physical address (from MCi_ADDR).
> 
> -Tony

Sorry to interrupt as I am really confused here:
If it's a copyin case, has the page been mapped for the current process?
will memory_failure() find it and unmap it? if succeed, then the current will be
signaled with correct vaddr and shift?

Maybe the mce_vaddr is set correctly, but we may lost the correct page shift?

And for copyin case, we don't need to call set_mce_nospec()?
Luck, Tony March 12, 2021, 4:29 p.m. UTC | #34
> Sorry to interrupt as I am really confused here:
> If it's a copyin case, has the page been mapped for the current process?

Yes. The kernel has the current process mapped to the low address range
and code like get_user(), put_user() "simply" reaches down to those addresses
(maybe not so simply, as the access needs to be within a STAC/CLAC section
of code on modern CPUs, and the access instruction needs an entry in EXTABLE
so that the page fault handler can fix it if there isn't a page mapped at the user
address).

> will memory_failure() find it and unmap it? if succeed, then the current will be
> signaled with correct vaddr and shift?

That's a very good question.  I didn't see a SIGBUS when I first wrote this code,
hence all the p->mce_vaddr.  But now I'm
a) not sure why there wasn't a signal
b) if we are to fix the problems noted by AndyL, need to make sure that there isn't a SIGBUS

> Maybe the mce_vaddr is set correctly, but we may lost the correct page shift?
Yes. There is a hard-coded PAGE_SHIFT for this case - which may not match the actual page size.

> And for copyin case, we don't need to call set_mce_nospec()?
Yes. We should.

Thanks for your good questions.

-Tony
Luck, Tony March 12, 2021, 11:48 p.m. UTC | #35
>> will memory_failure() find it and unmap it? if succeed, then the current will be
>> signaled with correct vaddr and shift?
>
> That's a very good question.  I didn't see a SIGBUS when I first wrote this code,
> hence all the p->mce_vaddr.  But now I'm
> a) not sure why there wasn't a signal
> b) if we are to fix the problems noted by AndyL, need to make sure that there isn't a SIGBUS

Tests on upstream kernel today show that memory_failure() is both unmapping the page
and sending a SIGBUS.


My biggest issue with the KERNEL_COPYIN recovery path is that we don't have code
to mark the page not present while we are still in do_machine_check().  That's resulted
in recovery working for simple cases where there is a single get_user() call followed by
an error return if that failed. But more complex cases require more machine checks and
a touching faith that the kernel will eventually give up trying (spoiler: it sometimes doesn't).

Thanks to the decode of the instruction we do have the virtual address. So we just need
a safe walk of pgd->p4d->pud->pmd->pte (truncated if we hit a huge page) with a write
of a "not-present" value. Maybe a different poison type from the one we get from
memory_failure() so that the #PF code can recognize this as a special case and do any
other work that we avoided because we were in #MC context.

-Tony
HORIGUCHI NAOYA(堀口 直也) March 16, 2021, 6:42 a.m. UTC | #36
On Fri, Mar 12, 2021 at 11:48:31PM +0000, Luck, Tony wrote:
> >> will memory_failure() find it and unmap it? if succeed, then the current will be
> >> signaled with correct vaddr and shift?
> >
> > That's a very good question.  I didn't see a SIGBUS when I first wrote this code,
> > hence all the p->mce_vaddr.  But now I'm
> > a) not sure why there wasn't a signal

We have a recent change around this behavior, which might be an answer for you.
See commit 30c9cf492704 ("mm,hwpoison: send SIGBUS to PF_MCE_EARLY processes on action required events").

> > b) if we are to fix the problems noted by AndyL, need to make sure that there isn't a SIGBUS
> 
> Tests on upstream kernel today show that memory_failure() is both unmapping the page
> and sending a SIGBUS.

Sorry if I misunderstood the exact problem, but if the point is to allow
user processes to find poisoned virtual address without SIGBUS, one possible
way is to expose hwpoison entry via /proc/pid/pagemap (attached a draft patch below).
With this patch, processes easily find hwpoison entries without any new interface.

> 
> My biggest issue with the KERNEL_COPYIN recovery path is that we don't have code
> to mark the page not present while we are still in do_machine_check(). 

It sounds to me that even if we have such code, it just narrows down the race window
between multiple MCEs on different CPUs. Or does it completely prevent the race?
(Another thread could touch the poisoned page just before the first thread marks
the page non-present?)

> That's resulted
> in recovery working for simple cases where there is a single get_user() call followed by
> an error return if that failed. But more complex cases require more machine checks and
> a touching faith that the kernel will eventually give up trying (spoiler: it sometimes doesn't).
> 
> Thanks to the decode of the instruction we do have the virtual address. So we just need
> a safe walk of pgd->p4d->pud->pmd->pte (truncated if we hit a huge page) with a write
> of a "not-present" value. Maybe a different poison type from the one we get from
> memory_failure() so that the #PF code can recognize this as a special case and do any
> other work that we avoided because we were in #MC context.

As you answered in another email, p->mce_vaddr is set only on KERNEL_COPYIN case,
then if p->mce_vaddr is available also for generic SRAR MCE (I'm assuming that we are
talking about issues on race between generic SRAR MCE not only for KERNEL_COPYIN case),
that might be helpful, although it might be hard to implement.
And I'm afraid that walking page table could find the wrong virtual address if a process
have multiple entries to the single page. We could send multiple SIGBUSs for such case,
but maybe that's not an optimal solution.

Thanks,
Naoya Horiguchi

----
From 147449a97e2ea3420ac3523f13523f5d30a13614 Mon Sep 17 00:00:00 2001
From: Naoya Horiguchi <naoya.horiguchi@nec.com>
Date: Tue, 16 Mar 2021 14:22:21 +0900
Subject: [PATCH] pagemap: expose hwpoison entry

not-signed-off-by-yet: Naoya Horiguchi <naoya.horiguchi@nec.com>
---
 fs/proc/task_mmu.c      |  6 ++++++
 include/linux/swapops.h | 12 ++++++++++++
 tools/vm/page-types.c   |  5 ++++-
 3 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 602e3a52884d..08cea209bae7 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1300,6 +1300,7 @@ struct pagemapread {
 #define PM_PFRAME_MASK		GENMASK_ULL(PM_PFRAME_BITS - 1, 0)
 #define PM_SOFT_DIRTY		BIT_ULL(55)
 #define PM_MMAP_EXCLUSIVE	BIT_ULL(56)
+#define PM_HWPOISON	       	BIT_ULL(60)
 #define PM_FILE			BIT_ULL(61)
 #define PM_SWAP			BIT_ULL(62)
 #define PM_PRESENT		BIT_ULL(63)
@@ -1385,6 +1386,11 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
 		if (is_migration_entry(entry))
 			page = migration_entry_to_page(entry);
 
+		if (is_hwpoison_entry(entry)) {
+			page = hwpoison_entry_to_page(entry);
+			flags |= PM_HWPOISON;
+		}
+
 		if (is_device_private_entry(entry))
 			page = device_private_entry_to_page(entry);
 	}
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index d9b7c9132c2f..1b9dedbd06ab 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -323,6 +323,13 @@ static inline int is_hwpoison_entry(swp_entry_t entry)
 	return swp_type(entry) == SWP_HWPOISON;
 }
 
+static inline struct page *hwpoison_entry_to_page(swp_entry_t entry)
+{
+	struct page *p = pfn_to_page(swp_offset(entry));
+	WARN_ON(!PageHWPoison(p));
+	return p;
+}
+
 static inline void num_poisoned_pages_inc(void)
 {
 	atomic_long_inc(&num_poisoned_pages);
@@ -345,6 +352,11 @@ static inline int is_hwpoison_entry(swp_entry_t swp)
 	return 0;
 }
 
+static inline struct page *hwpoison_entry_to_page(swp_entry_t entry)
+{
+	return NULL;
+}
+
 static inline void num_poisoned_pages_inc(void)
 {
 }
diff --git a/tools/vm/page-types.c b/tools/vm/page-types.c
index 0517c744b04e..1160d5a14955 100644
--- a/tools/vm/page-types.c
+++ b/tools/vm/page-types.c
@@ -53,6 +53,7 @@
 #define PM_SWAP_OFFSET(x)	(((x) & PM_PFRAME_MASK) >> MAX_SWAPFILES_SHIFT)
 #define PM_SOFT_DIRTY		(1ULL << 55)
 #define PM_MMAP_EXCLUSIVE	(1ULL << 56)
+#define PM_HWPOISON		(1ULL << 60)
 #define PM_FILE			(1ULL << 61)
 #define PM_SWAP			(1ULL << 62)
 #define PM_PRESENT		(1ULL << 63)
@@ -311,6 +312,8 @@ static unsigned long pagemap_pfn(uint64_t val)
 
 	if (val & PM_PRESENT)
 		pfn = PM_PFRAME(val);
+	else if (val & PM_HWPOISON)
+		pfn = PM_SWAP_OFFSET(val);
 	else
 		pfn = 0;
 
@@ -742,7 +745,7 @@ static void walk_vma(unsigned long index, unsigned long count)
 			pfn = pagemap_pfn(buf[i]);
 			if (pfn)
 				walk_pfn(index + i, pfn, 1, buf[i]);
-			if (buf[i] & PM_SWAP)
+			else if (buf[i] & PM_SWAP)
 				walk_swap(index + i, buf[i]);
 		}
yaoaili [么爱利] March 16, 2021, 7:54 a.m. UTC | #37
> As you answered in another email, p->mce_vaddr is set only on KERNEL_COPYIN case,
> then if p->mce_vaddr is available also for generic SRAR MCE (I'm assuming that we are
> talking about issues on race between generic SRAR MCE not only for KERNEL_COPYIN case),
> that might be helpful, although it might be hard to implement.
> And I'm afraid that walking page table could find the wrong virtual address if a process
> have multiple entries to the single page. We could send multiple SIGBUSs for such case,
> but maybe that's not an optimal solution.

Yes, agree, I can't find one way to address this multiple entries case, to make sure we get
the exact correct virtual address.

But also I will post a v2 RFC patch for this issue for only discussion purpose!

Thanks!

> ----
> From 147449a97e2ea3420ac3523f13523f5d30a13614 Mon Sep 17 00:00:00 2001
> From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> Date: Tue, 16 Mar 2021 14:22:21 +0900
> Subject: [PATCH] pagemap: expose hwpoison entry
> 
> not-signed-off-by-yet: Naoya Horiguchi <naoya.horiguchi@nec.com>
> ---
>  fs/proc/task_mmu.c      |  6 ++++++
>  include/linux/swapops.h | 12 ++++++++++++
>  tools/vm/page-types.c   |  5 ++++-
>  3 files changed, 22 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 602e3a52884d..08cea209bae7 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -1300,6 +1300,7 @@ struct pagemapread {
>  #define PM_PFRAME_MASK		GENMASK_ULL(PM_PFRAME_BITS - 1, 0)
>  #define PM_SOFT_DIRTY		BIT_ULL(55)
>  #define PM_MMAP_EXCLUSIVE	BIT_ULL(56)
> +#define PM_HWPOISON	       	BIT_ULL(60)
>  #define PM_FILE			BIT_ULL(61)
>  #define PM_SWAP			BIT_ULL(62)
>  #define PM_PRESENT		BIT_ULL(63)
> @@ -1385,6 +1386,11 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
>  		if (is_migration_entry(entry))
>  			page = migration_entry_to_page(entry);
>  
> +		if (is_hwpoison_entry(entry)) {
> +			page = hwpoison_entry_to_page(entry);
> +			flags |= PM_HWPOISON;
> +		}
> +
>  		if (is_device_private_entry(entry))
>  			page = device_private_entry_to_page(entry);
>  	}
> diff --git a/include/linux/swapops.h b/include/linux/swapops.h
> index d9b7c9132c2f..1b9dedbd06ab 100644
> --- a/include/linux/swapops.h
> +++ b/include/linux/swapops.h
> @@ -323,6 +323,13 @@ static inline int is_hwpoison_entry(swp_entry_t entry)
>  	return swp_type(entry) == SWP_HWPOISON;
>  }
>  
> +static inline struct page *hwpoison_entry_to_page(swp_entry_t entry)
> +{
> +	struct page *p = pfn_to_page(swp_offset(entry));
> +	WARN_ON(!PageHWPoison(p));
> +	return p;
> +}
> +
>  static inline void num_poisoned_pages_inc(void)
>  {
>  	atomic_long_inc(&num_poisoned_pages);
> @@ -345,6 +352,11 @@ static inline int is_hwpoison_entry(swp_entry_t swp)
>  	return 0;
>  }
>  
> +static inline struct page *hwpoison_entry_to_page(swp_entry_t entry)
> +{
> +	return NULL;
> +}
> +
>  static inline void num_poisoned_pages_inc(void)
>  {
>  }
> diff --git a/tools/vm/page-types.c b/tools/vm/page-types.c
> index 0517c744b04e..1160d5a14955 100644
> --- a/tools/vm/page-types.c
> +++ b/tools/vm/page-types.c
> @@ -53,6 +53,7 @@
>  #define PM_SWAP_OFFSET(x)	(((x) & PM_PFRAME_MASK) >> MAX_SWAPFILES_SHIFT)
>  #define PM_SOFT_DIRTY		(1ULL << 55)
>  #define PM_MMAP_EXCLUSIVE	(1ULL << 56)
> +#define PM_HWPOISON		(1ULL << 60)
>  #define PM_FILE			(1ULL << 61)
>  #define PM_SWAP			(1ULL << 62)
>  #define PM_PRESENT		(1ULL << 63)
> @@ -311,6 +312,8 @@ static unsigned long pagemap_pfn(uint64_t val)
>  
>  	if (val & PM_PRESENT)
>  		pfn = PM_PFRAME(val);
> +	else if (val & PM_HWPOISON)
> +		pfn = PM_SWAP_OFFSET(val);
>  	else
>  		pfn = 0;
>  
> @@ -742,7 +745,7 @@ static void walk_vma(unsigned long index, unsigned long count)
>  			pfn = pagemap_pfn(buf[i]);
>  			if (pfn)
>  				walk_pfn(index + i, pfn, 1, buf[i]);
> -			if (buf[i] & PM_SWAP)
> +			else if (buf[i] & PM_SWAP)
>  				walk_swap(index + i, buf[i]);
>  		}
>
Luck, Tony March 17, 2021, 12:29 a.m. UTC | #38
On Fri, Mar 12, 2021 at 11:48:31PM +0000, Luck, Tony wrote:
> Thanks to the decode of the instruction we do have the virtual address. So we just need
> a safe walk of pgd->p4d->pud->pmd->pte (truncated if we hit a huge page) with a write
> of a "not-present" value. Maybe a different poison type from the one we get from
> memory_failure() so that the #PF code can recognize this as a special case and do any
> other work that we avoided because we were in #MC context.

There is no such thing as the safe walk I describe above. In a multi
threaded application other threads might munmap(2) bits of the address
space which could free some of the p4d->pud->pmd->pte tables.

But the pgd table is guaranteed to exist as long as any of the threads
in the process exist. Which gave rise to a new approach (partial credit
to Dave Hansen who helped me understand why a more extreme patch that
I wanted to use wasn't working ... and he pointed out the pgd persistence).

N.B. The code below DOES NOT WORK. My test application tried to do a write(2)
syscall with some poison in the buffer. Goal is that write should return a
short count (with only the bytes from the buffer leading up to the poison
written to the file). Currently the process gets a suprise SIGSEGV in
some glibc code :-(

The code would also need a bunch more work to handle the fact that
in a multi-threaded application multiple threads might consume the
poison, and some innocent threads not consuming the poison may also end
up drawn into the melee in the page fault handler.


The big picture.
---------------

This approach passes the buck for the recovery from the #MC handler
(which has very limited options to do anything useful) to the #PF
handler (which is a much friendlier environment where locks and mutexes
can be obtained as needed).

The mechanism for this buck passing is for the #MC handler to set a
reserved bit in the PGD entry for the user address where the machine
check happened, flush the TLB for that address, and then return from
the #MC handler to the kernel get_user()/copy_from_user() code which
will re-execute the instruction to access the poison address, but this
time take a #PF because of the reserved bit in the PGD.

There's a couple of changes bundled in here to help my debugging:
1) Turn off UCNA calls to memory_failure() ... just avoids races
   and make tests more determinstic for now
2) Disable "fast string" support ... avoid "REP MOVS" copy routines
   since I'm testing on an old system that treats poison consumed
   by REP MOVS as fatal.

Posted here for general comments on the approach.  On the plus
side it avoids taking multiple machine checks in the kernel when
it is accessing user data. I think it can also meet the goal set
by Andy Lutomirski of avoiding SIGBUS from syscalls that happen
to touch user poison. They should see either short counts for
calls like write(2) that may partially succeed or -EFAULT if
the system call totally failed.

-Tony

---

diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index f24d7ef8fffa..e533eaf20834 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -23,6 +23,7 @@
 #define _PAGE_BIT_SOFTW2	10	/* " */
 #define _PAGE_BIT_SOFTW3	11	/* " */
 #define _PAGE_BIT_PAT_LARGE	12	/* On 2MB or 1GB pages */
+#define _PAGE_BIT_RESERVED1	51	/* Reserved bit */
 #define _PAGE_BIT_SOFTW4	58	/* available for programmer */
 #define _PAGE_BIT_PKEY_BIT0	59	/* Protection Keys, bit 1/4 */
 #define _PAGE_BIT_PKEY_BIT1	60	/* Protection Keys, bit 2/4 */
@@ -56,6 +57,7 @@
 #define _PAGE_PAT_LARGE (_AT(pteval_t, 1) << _PAGE_BIT_PAT_LARGE)
 #define _PAGE_SPECIAL	(_AT(pteval_t, 1) << _PAGE_BIT_SPECIAL)
 #define _PAGE_CPA_TEST	(_AT(pteval_t, 1) << _PAGE_BIT_CPA_TEST)
+#define _PAGE_RESERVED1	(_AT(pteval_t, 1) << _PAGE_BIT_RESERVED1)
 #ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
 #define _PAGE_PKEY_BIT0	(_AT(pteval_t, 1) << _PAGE_BIT_PKEY_BIT0)
 #define _PAGE_PKEY_BIT1	(_AT(pteval_t, 1) << _PAGE_BIT_PKEY_BIT1)
diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index 7f7200021bd1..269e8ee04c11 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -45,4 +45,24 @@ void __noreturn handle_stack_overflow(const char *message,
 				      unsigned long fault_address);
 #endif
 
+static inline void pgd_set_reserved(pgd_t *pgdp, unsigned long addr)
+{
+	pgd_t	pgd;
+
+	pgdp += pgd_index(addr);
+	pgd = *pgdp;
+	pgd.pgd |= _PAGE_RESERVED1;
+	WRITE_ONCE(*pgdp, pgd);
+}
+
+static inline void pgd_clr_reserved(pgd_t *pgdp, unsigned long addr)
+{
+	pgd_t	pgd;
+
+	pgdp += pgd_index(addr);
+	pgd = *pgdp;
+	pgd.pgd &= ~_PAGE_RESERVED1;
+	WRITE_ONCE(*pgdp, pgd);
+}
+
 #endif /* _ASM_X86_TRAPS_H */
diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index 0e422a544835..974e1eb5d1aa 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -287,6 +287,7 @@ static void early_init_intel(struct cpuinfo_x86 *c)
 	 */
 	if (c->x86 > 6 || (c->x86 == 6 && c->x86_model >= 0xd)) {
 		rdmsrl(MSR_IA32_MISC_ENABLE, misc_enable);
+misc_enable = 0;
 		if (!(misc_enable & MSR_IA32_MISC_ENABLE_FAST_STRING)) {
 			pr_info("Disabled fast string operations\n");
 			setup_clear_cpu_cap(X86_FEATURE_REP_GOOD);
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 7962355436da..41bedc961928 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -635,6 +635,7 @@ static struct notifier_block early_nb = {
 static int uc_decode_notifier(struct notifier_block *nb, unsigned long val,
 			      void *data)
 {
+#if 0
 	struct mce *mce = (struct mce *)data;
 	unsigned long pfn;
 
@@ -650,7 +651,7 @@ static int uc_decode_notifier(struct notifier_block *nb, unsigned long val,
 		set_mce_nospec(pfn, whole_page(mce));
 		mce->kflags |= MCE_HANDLED_UC;
 	}
-
+#endif
 	return NOTIFY_OK;
 }
 
@@ -1443,8 +1444,18 @@ noinstr void do_machine_check(struct pt_regs *regs)
 				mce_panic("Failed kernel mode recovery", &m, msg);
 		}
 
-		if (m.kflags & MCE_IN_KERNEL_COPYIN)
-			queue_task_work(&m, kill_current_task);
+		/*
+		 * Machine check while copying from user space. Note that we
+		 * do not call fixup_exception(). Instead we ensure a page fault
+		 * when we return to the faulting instruction.
+		 * Let the page fault handler do the rest of the
+		 * recovery.
+		 */
+		if (m.kflags & MCE_IN_KERNEL_COPYIN) {
+			flush_tlb_one_user((unsigned long)current->mce_vaddr);
+			pgd_set_reserved(current->mm->pgd, (unsigned long)current->mce_vaddr);
+			current->mce_addr = m.addr;
+		}
 	}
 out:
 	mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
diff --git a/arch/x86/kernel/cpu/mce/severity.c b/arch/x86/kernel/cpu/mce/severity.c
index 83df991314c5..da917945150d 100644
--- a/arch/x86/kernel/cpu/mce/severity.c
+++ b/arch/x86/kernel/cpu/mce/severity.c
@@ -282,7 +282,6 @@ static int error_context(struct mce *m, struct pt_regs *regs)
 		return IN_KERNEL_RECOV;
 	}
 	if (t == EX_HANDLER_UACCESS && regs && is_copy_from_user(regs)) {
-		m->kflags |= MCE_IN_KERNEL_RECOV;
 		m->kflags |= MCE_IN_KERNEL_COPYIN;
 		return IN_KERNEL_RECOV;
 	}
diff --git a/arch/x86/lib/copy_user_64.S b/arch/x86/lib/copy_user_64.S
index 77b9b2a3b5c8..e0e71ca023ce 100644
--- a/arch/x86/lib/copy_user_64.S
+++ b/arch/x86/lib/copy_user_64.S
@@ -234,24 +234,11 @@ EXPORT_SYMBOL(copy_user_enhanced_fast_string)
  */
 SYM_CODE_START_LOCAL(.Lcopy_user_handle_tail)
 	movl %edx,%ecx
-	cmp $X86_TRAP_MC,%eax		/* check if X86_TRAP_MC */
-	je 3f
 1:	rep movsb
 2:	mov %ecx,%eax
 	ASM_CLAC
 	ret
 
-	/*
-	 * Return zero to pretend that this copy succeeded. This
-	 * is counter-intuitive, but needed to prevent the code
-	 * in lib/iov_iter.c from retrying and running back into
-	 * the poison cache line again. The machine check handler
-	 * will ensure that a SIGBUS is sent to the task.
-	 */
-3:	xorl %eax,%eax
-	ASM_CLAC
-	ret
-
 	_ASM_EXTABLE_CPY(1b, 2b)
 SYM_CODE_END(.Lcopy_user_handle_tail)
 
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index a73347e2cdfc..49232264e893 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1245,9 +1245,13 @@ void do_user_addr_fault(struct pt_regs *regs,
 	/*
 	 * Reserved bits are never expected to be set on
 	 * entries in the user portion of the page tables.
+	 * Except when machine check handling of a copy from
+	 * user passed the buck to #PF.
 	 */
-	if (unlikely(error_code & X86_PF_RSVD))
-		pgtable_bad(regs, error_code, address);
+	if (unlikely(error_code & X86_PF_RSVD)) {
+		if (!current->mce_vaddr)
+			pgtable_bad(regs, error_code, address);
+	}
 
 	/*
 	 * If SMAP is on, check for invalid kernel (supervisor) access to user
@@ -1291,6 +1295,15 @@ void do_user_addr_fault(struct pt_regs *regs,
 			local_irq_enable();
 	}
 
+	if (current->mce_vaddr) {
+		pgd_clr_reserved(current->mm->pgd,
+				 (unsigned long)current->mce_vaddr);
+		memory_failure(current->mce_addr >> PAGE_SHIFT, 0);
+		current->mce_vaddr = 0;
+		error_code &= ~X86_PF_RSVD;
+		pr_info("#PF: maybe fixed? Try to continue\n");
+	}
+
 	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
 
 	if (error_code & X86_PF_WRITE)
yaoaili [么爱利] March 17, 2021, 7:48 a.m. UTC | #39
> 
> Returning true means you stop walking when you find the first entry pointing
> to a given pfn. But there could be multiple such entries, so if MCE SRAR is
> triggered by memory access to the larger address in hwpoisoned entries, the
> returned virtual address might be wrong.
> 

I can't find the way to fix this, maybe the virtual address is contained in
related register, but this is really beyong my knowledge.

This is a v2 RFC patch, add support for thp and 1G huge page errors.

Thanks
Aili Yao

From 31b685609610b3b06c8fd98d866913dbfeb7e159 Mon Sep 17 00:00:00 2001
From: Aili Yao <yaoaili@kingsoft.com>
Date: Wed, 17 Mar 2021 15:34:15 +0800
Subject: [PATCH] fix invalid SIGBUS address for recovery fail

Walk the current process pages and compare with the pfn, then get the
user address and related page_shift.

For thp pages, we can only split anonoums thp page, so I think there may
be no race condition for walking and searching the thp error page for such
case; For non anonymous thp, the page flag and pte will not be change. so
when code goes into this place, it may be race condition for non-anonoums
thp page or from a recovery fail for anonoums thp, the page status will
not change, i am not so sure about this;

For the case we don't find the related virtual address, Maybe sending one
BUS_MCEERR_AR signal with invalid address NULL is a better option, but i am
not sure.

And this may get the wrong virtual address if process have multiple entry
for a same page, I don't find a way to get it correct.

Maybe other issues is not recognized.
---
 arch/x86/kernel/cpu/mce/core.c |  16 ++---
 include/linux/mm.h             |   1 +
 mm/memory-failure.c            | 145 ++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 152 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index db4afc5..1bf21cc 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -28,8 +28,12 @@
 #include <linux/sysfs.h>
 #include <linux/types.h>
 #include <linux/slab.h>
+#include <linux/hugetlb.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>
 #include <linux/init.h>
 #include <linux/kmod.h>
+#include <linux/pagewalk.h>
 #include <linux/poll.h>
 #include <linux/nmi.h>
 #include <linux/cpu.h>
@@ -1246,7 +1250,7 @@ static void kill_me_maybe(struct callback_head *cb)
 	struct task_struct *p = container_of(cb, struct task_struct, mce_kill_me);
 	int flags = MF_ACTION_REQUIRED;
 
-	pr_err("Uncorrected hardware memory error in user-access at %llx", p->mce_addr);
+	pr_err("Uncorrected hardware memory error in user-access at %llx, %llx", p->mce_addr, p->mce_vaddr);
 
 	if (!p->mce_ripv)
 		flags |= MF_MUST_KILL;
@@ -1258,14 +1262,8 @@ static void kill_me_maybe(struct callback_head *cb)
 		return;
 	}
 
-	if (p->mce_vaddr != (void __user *)-1l) {
-		pr_err("Memory error may not recovered: %#lx: Sending SIGBUS to %s:%d due to hardware memory corruption\n",
-			p->mce_addr >> PAGE_SHIFT, p->comm, p->pid);
-		force_sig_mceerr(BUS_MCEERR_AR, p->mce_vaddr, PAGE_SHIFT);
-	} else {
-		pr_err("Memory error not recovered");
-		kill_me_now(cb);
-	}
+	memory_failure_error(current, p->mce_addr >> PAGE_SHIFT);
+
 }
 
 static void queue_task_work(struct mce *m, int kill_current_task)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ecdf8a8..cff2f02 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3046,6 +3046,7 @@ enum mf_flags {
 	MF_SOFT_OFFLINE = 1 << 3,
 };
 extern int memory_failure(unsigned long pfn, int flags);
+extern void memory_failure_error(struct task_struct *p, unsigned long pfn);
 extern void memory_failure_queue(unsigned long pfn, int flags);
 extern void memory_failure_queue_kick(int cpu);
 extern int unpoison_memory(unsigned long pfn);
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 06f0061..aaf99a7 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -56,8 +56,10 @@
 #include <linux/kfifo.h>
 #include <linux/ratelimit.h>
 #include <linux/page-isolation.h>
+#include <linux/pagewalk.h>
 #include "internal.h"
 #include "ras/ras_event.h"
+#include <linux/delay.h>
 
 int sysctl_memory_failure_early_kill __read_mostly = 0;
 
@@ -240,7 +242,7 @@ static int kill_proc(struct to_kill *tk, unsigned long pfn, int flags)
 	int ret = 0;
 
 	pr_err("Memory failure: %#lx: Sending SIGBUS to %s:%d due to hardware memory corruption\n",
-			pfn, t->comm, t->pid);
+			tk->addr, t->comm, t->pid);
 
 	if (flags & MF_ACTION_REQUIRED) {
 		WARN_ON_ONCE(t != current);
@@ -1417,6 +1419,7 @@ int memory_failure(unsigned long pfn, int flags)
 try_again:
 	if (PageHuge(p))
 		return memory_failure_hugetlb(pfn, flags);
+
 	if (TestSetPageHWPoison(p)) {
 		pr_err("Memory failure: %#lx: already hardware poisoned\n",
 			pfn);
@@ -1553,6 +1556,146 @@ int memory_failure(unsigned long pfn, int flags)
 }
 EXPORT_SYMBOL_GPL(memory_failure);
 
+static int pte_range(pte_t *ptep, unsigned long addr, unsigned long next, struct mm_walk *walk)
+{
+	u64 *buff = (u64 *)walk->private;
+	u64 pfn = buff[0];
+	pte_t pte = *ptep;
+
+	if (!pte_none(pte) && !pte_present(pte)) {
+		swp_entry_t swp_temp = pte_to_swp_entry(pte);
+		
+		printk("%s, %d \n", __FUNCTION__, __LINE__);
+		if (is_hwpoison_entry(swp_temp) && swp_offset(swp_temp) == pfn) {
+			printk("%s, %d \n", __FUNCTION__, __LINE__);
+			goto find;
+		}
+	} else if (pte_pfn(pte) == pfn) {
+		printk("%s, %d \n", __FUNCTION__, __LINE__);
+		goto find;
+	}
+
+	return 0;
+
+find:
+	buff[0] = addr;
+	buff[1] = PAGE_SHIFT;
+	return true;
+}
+
+static int pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
+			     struct mm_walk *walk)
+{
+	u64 *buff = (u64 *)walk->private;
+	struct page *page = (struct page *)buff[0];
+	u64 pfn = buff[1];
+	pmd_t pmd = *pmdp;
+
+	if (likely(!pmd_trans_huge(pmd)))
+		return 0;
+
+	if (pmd_none(pmd) || !pmd_present(pmd))
+		return 0;
+
+	if (pmd_page(pmd) != page)
+		return 0;
+
+	for (; addr != end; page++, addr += PAGE_SIZE) {
+		if (page_to_pfn(page) == pfn) {
+			printk("%s, %d \n", __FUNCTION__, __LINE__);
+			buff[0] = addr;
+			buff[1] = PAGE_SHIFT;
+			return true;
+		}
+	}
+
+	return 0;
+}
+
+static int hugetlb_range(pte_t *ptep, unsigned long hmask,
+				 unsigned long addr, unsigned long end,
+				 struct mm_walk *walk)
+{
+	u64 *buff = (u64 *)walk->private;
+	u64 pfn = buff[0];
+	pte_t pte =  huge_ptep_get(ptep);
+	struct page *page = pfn_to_page(pfn);
+
+	if (!huge_pte_none(pte) && !pte_present(pte)) {
+		swp_entry_t swp_temp = pte_to_swp_entry(pte);
+		printk("%s, %d \n", __FUNCTION__, __LINE__);
+		if (is_hwpoison_entry(swp_temp) && swp_offset(swp_temp) == pfn) {
+			printk("%s, %d \n", __FUNCTION__, __LINE__);
+			goto find;
+		}
+	}
+	if (pte_pfn(pte) == pfn) {
+		printk("%s, %d \n", __FUNCTION__, __LINE__);
+		goto find;
+	}
+	return 0;
+
+find:
+	buff[0] = addr;
+	buff[1] = (huge_page_size(page_hstate(page)) > PMD_SIZE) ? PUD_SHIFT : PMD_SHIFT;
+	return true;
+}
+
+void memory_failure_error(struct task_struct *p, unsigned long pfn)
+{
+	u64 buff[2] = {0};
+	struct page *page;
+	int ret = -1;
+	struct mm_walk_ops walk = {0};
+
+	if (p->mce_vaddr != (void __user *)-1l && p->mce_vaddr != (void __user *)0) {
+		printk("%s, %d \n", __FUNCTION__, __LINE__);
+		force_sig_mceerr(BUS_MCEERR_AR, p->mce_vaddr, PAGE_SHIFT);
+		return;
+	}
+
+	page = pfn_to_page(pfn);
+	if (!page)
+		goto force_sigbus;
+
+	if (is_zone_device_page(page))
+		goto force_sigbus;
+
+	page = compound_head(page);
+
+	if (PageHuge(page)) {
+		printk("%s, %d \n", __FUNCTION__, __LINE__);
+		walk.hugetlb_entry = hugetlb_range;
+		buff[0] = page_to_pfn(page);
+	} else if (PageTransHuge(page)) {
+		printk("%s, %d \n", __FUNCTION__, __LINE__);
+		walk.pmd_entry = pmd_range;
+		buff[0] = (u64)page;
+		buff[1] = pfn;
+	} else {
+		printk("%s, %d \n", __FUNCTION__, __LINE__);
+		walk.pte_entry = pte_range;
+		buff[0] = pfn;
+	}
+	msleep(1000);
+	mmap_read_lock(p->mm);
+	ret = walk_page_range(p->mm, 0, TASK_SIZE_MAX, &walk, (void *)buff);
+	mmap_read_unlock(p->mm);
+
+	pr_err("Memory error may not recovered: %#lx: Sending SIGBUS to %s:%d due to hardware memory corruption\n",
+	pfn, p->comm, p->pid);
+
+	if (ret) {
+		printk("%s, %d \n", __FUNCTION__, __LINE__);
+		force_sig_mceerr(BUS_MCEERR_AR, (void __user *)buff[0], buff[1]);
+	} else {
+force_sigbus:
+		printk("%s, %d \n", __FUNCTION__, __LINE__);
+		force_sig_mceerr(BUS_MCEERR_AR, (void __user *)0, PAGE_SHIFT);
+	}
+}
+EXPORT_SYMBOL_GPL(memory_failure_error);
+
 #define MEMORY_FAILURE_FIFO_ORDER	4
 #define MEMORY_FAILURE_FIFO_SIZE	(1 << MEMORY_FAILURE_FIFO_ORDER)
yaoaili [么爱利] March 17, 2021, 8:23 a.m. UTC | #40
> > 
> > Returning true means you stop walking when you find the first entry pointing
> > to a given pfn. But there could be multiple such entries, so if MCE SRAR is
> > triggered by memory access to the larger address in hwpoisoned entries, the
> > returned virtual address might be wrong.
> >   
> 
> I can't find the way to fix this, maybe the virtual address is contained in
> related register, but this is really beyong my knowledge.
> 
> This is a v2 RFC patch, add support for thp and 1G huge page errors.
> 

Sorry for the debug info and other unclean modifications.

Post a clean one.

Thanks
Aili Yao

From 2289276ba943cdcddbf3b5b2cdbcaff78690e2e8 Mon Sep 17 00:00:00 2001
From: Aili Yao <yaoaili@kingsoft.com>
Date: Wed, 17 Mar 2021 16:12:41 +0800
Subject: [PATCH] fix invalid SIGBUS address for recovery fail

Walk the current process pages and compare with the pfn, then get the
user address and related page_shift.

For thp pages, we can only split anonoums thp page, so I think there may
be no race condition for walking and searching the thp error page for such
case; For non anonymous thp, the page flag and pte will not be change. so
when code goes into this place, it may be race condition for non-anonoums
thp page or from a recovery fail for anonoums thp, the page status will
not change, i am not so sure about this;

For the case we don't find the related virtual address, Maybe sending one
BUS_MCEERR_AR signal with invalid address NULL is a better option, but i am
not sure.

And this may get the wrong virtual address if process have multiple entry
for a same page, I don't find a way to get it correct.

Maybe other issues is not recognized.
---
 arch/x86/kernel/cpu/mce/core.c |  12 +---
 include/linux/mm.h             |   1 +
 mm/memory-failure.c            | 127 +++++++++++++++++++++++++++++++++++++++++
 3 files changed, 131 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index db4afc5..4cb873c 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1246,7 +1246,7 @@ static void kill_me_maybe(struct callback_head *cb)
 	struct task_struct *p = container_of(cb, struct task_struct, mce_kill_me);
 	int flags = MF_ACTION_REQUIRED;
 
-	pr_err("Uncorrected hardware memory error in user-access at %llx", p->mce_addr);
+	pr_err("Uncorrected hardware memory error in user-access at %llx\n", p->mce_addr);
 
 	if (!p->mce_ripv)
 		flags |= MF_MUST_KILL;
@@ -1258,14 +1258,8 @@ static void kill_me_maybe(struct callback_head *cb)
 		return;
 	}
 
-	if (p->mce_vaddr != (void __user *)-1l) {
-		pr_err("Memory error may not recovered: %#lx: Sending SIGBUS to %s:%d due to hardware memory corruption\n",
-			p->mce_addr >> PAGE_SHIFT, p->comm, p->pid);
-		force_sig_mceerr(BUS_MCEERR_AR, p->mce_vaddr, PAGE_SHIFT);
-	} else {
-		pr_err("Memory error not recovered");
-		kill_me_now(cb);
-	}
+	memory_failure_error(current, p->mce_addr >> PAGE_SHIFT);
+
 }
 
 static void queue_task_work(struct mce *m, int kill_current_task)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ecdf8a8..cff2f02 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3046,6 +3046,7 @@ enum mf_flags {
 	MF_SOFT_OFFLINE = 1 << 3,
 };
 extern int memory_failure(unsigned long pfn, int flags);
+extern void memory_failure_error(struct task_struct *p, unsigned long pfn);
 extern void memory_failure_queue(unsigned long pfn, int flags);
 extern void memory_failure_queue_kick(int cpu);
 extern int unpoison_memory(unsigned long pfn);
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 06f0061..359b42f 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -56,6 +56,7 @@
 #include <linux/kfifo.h>
 #include <linux/ratelimit.h>
 #include <linux/page-isolation.h>
+#include <linux/pagewalk.h>
 #include "internal.h"
 #include "ras/ras_event.h"
 
@@ -1553,6 +1554,132 @@ int memory_failure(unsigned long pfn, int flags)
 }
 EXPORT_SYMBOL_GPL(memory_failure);
 
+static int pte_range(pte_t *ptep, unsigned long addr, unsigned long next, struct mm_walk *walk)
+{
+	u64 *buff = (u64 *)walk->private;
+	u64 pfn = buff[0];
+	pte_t pte = *ptep;
+
+	if (!pte_none(pte) && !pte_present(pte)) {
+		swp_entry_t swp_temp = pte_to_swp_entry(pte);
+
+		if (is_hwpoison_entry(swp_temp) && swp_offset(swp_temp) == pfn)
+			goto find;
+	} else if (pte_pfn(pte) == pfn) {
+		goto find;
+	}
+
+	return 0;
+
+find:
+	buff[0] = addr;
+	buff[1] = PAGE_SHIFT;
+	return true;
+}
+
+static int pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
+			     struct mm_walk *walk)
+{
+	u64 *buff = (u64 *)walk->private;
+	struct page *page = (struct page *)buff[0];
+	u64 pfn = buff[1];
+	pmd_t pmd = *pmdp;
+
+	if (likely(!pmd_trans_huge(pmd)))
+		return 0;
+
+	if (pmd_none(pmd) || !pmd_present(pmd))
+		return 0;
+
+	if (pmd_page(pmd) != page)
+		return 0;
+
+	for (; addr != end; page++, addr += PAGE_SIZE) {
+		if (page_to_pfn(page) == pfn) {
+			buff[0] = addr;
+			buff[1] = PAGE_SHIFT;
+			return true;
+		}
+	}
+
+	return 0;
+}
+
+static int hugetlb_range(pte_t *ptep, unsigned long hmask,
+				 unsigned long addr, unsigned long end,
+				 struct mm_walk *walk)
+{
+	u64 *buff = (u64 *)walk->private;
+	u64 pfn = buff[0];
+	pte_t pte =  huge_ptep_get(ptep);
+	struct page *page = pfn_to_page(pfn);
+
+	if (!huge_pte_none(pte) && !pte_present(pte)) {
+		swp_entry_t swp_temp = pte_to_swp_entry(pte);
+
+		if (is_hwpoison_entry(swp_temp) && swp_offset(swp_temp) == pfn)
+			goto find;
+	}
+	if (pte_pfn(pte) == pfn)
+		goto find;
+
+	return 0;
+
+find:
+	buff[0] = addr;
+	buff[1] = (huge_page_size(page_hstate(page)) > PMD_SIZE) ? PUD_SHIFT : PMD_SHIFT;
+	return true;
+}
+
+void memory_failure_error(struct task_struct *p, unsigned long pfn)
+{
+	u64 buff[2] = {0};
+	struct page *page;
+	int ret = -1;
+	struct mm_walk_ops walk = {0};
+
+	if (p->mce_vaddr != (void __user *)-1l && p->mce_vaddr != (void __user *)0) {
+		force_sig_mceerr(BUS_MCEERR_AR, p->mce_vaddr, PAGE_SHIFT);
+		return;
+	}
+
+	page = pfn_to_page(pfn);
+	if (!page)
+		goto force_sigbus;
+
+	if (is_zone_device_page(page))
+		goto force_sigbus;
+
+	page = compound_head(page);
+
+	if (PageHuge(page)) {
+		walk.hugetlb_entry = hugetlb_range;
+		buff[0] = page_to_pfn(page);
+	} else if (PageTransHuge(page)) {
+		walk.pmd_entry = pmd_range;
+		buff[0] = (u64)page;
+		buff[1] = pfn;
+	} else {
+		walk.pte_entry = pte_range;
+		buff[0] = pfn;
+	}
+
+	mmap_read_lock(p->mm);
+	ret = walk_page_range(p->mm, 0, TASK_SIZE_MAX, &walk, (void *)buff);
+	mmap_read_unlock(p->mm);
+
+	pr_err("Memory error may not recovered: %#lx: Sending SIGBUS to %s:%d due to hardware memory corruption\n",
+	pfn, p->comm, p->pid);
+
+	if (ret) {
+		force_sig_mceerr(BUS_MCEERR_AR, (void __user *)buff[0], buff[1]);
+	} else {
+force_sigbus:
+		force_sig_mceerr(BUS_MCEERR_AR, (void __user *)0, PAGE_SHIFT);
+	}
+}
+EXPORT_SYMBOL_GPL(memory_failure_error);
+
 #define MEMORY_FAILURE_FIFO_ORDER	4
 #define MEMORY_FAILURE_FIFO_SIZE	(1 << MEMORY_FAILURE_FIFO_ORDER)
yaoaili [么爱利] March 17, 2021, 9:07 a.m. UTC | #41
On Tue, 16 Mar 2021 17:29:39 -0700
"Luck, Tony" <tony.luck@intel.com> wrote:

> On Fri, Mar 12, 2021 at 11:48:31PM +0000, Luck, Tony wrote:
> > Thanks to the decode of the instruction we do have the virtual address. So we just need
> > a safe walk of pgd->p4d->pud->pmd->pte (truncated if we hit a huge page) with a write
> > of a "not-present" value. Maybe a different poison type from the one we get from
> > memory_failure() so that the #PF code can recognize this as a special case and do any
> > other work that we avoided because we were in #MC context.  
> 
> There is no such thing as the safe walk I describe above. In a multi
> threaded application other threads might munmap(2) bits of the address
> space which could free some of the p4d->pud->pmd->pte tables.
> 
> But the pgd table is guaranteed to exist as long as any of the threads
> in the process exist. Which gave rise to a new approach (partial credit
> to Dave Hansen who helped me understand why a more extreme patch that
> I wanted to use wasn't working ... and he pointed out the pgd persistence).
> 
> N.B. The code below DOES NOT WORK. My test application tried to do a write(2)
> syscall with some poison in the buffer. Goal is that write should return a
> short count (with only the bytes from the buffer leading up to the poison
> written to the file). Currently the process gets a suprise SIGSEGV in
> some glibc code :-(
> 
> The code would also need a bunch more work to handle the fact that
> in a multi-threaded application multiple threads might consume the
> poison, and some innocent threads not consuming the poison may also end
> up drawn into the melee in the page fault handler.
> 
> 
> The big picture.
> ---------------
> 
> This approach passes the buck for the recovery from the #MC handler
> (which has very limited options to do anything useful) to the #PF
> handler (which is a much friendlier environment where locks and mutexes
> can be obtained as needed).
> 
> The mechanism for this buck passing is for the #MC handler to set a
> reserved bit in the PGD entry for the user address where the machine
> check happened, flush the TLB for that address, and then return from
> the #MC handler to the kernel get_user()/copy_from_user() code which
> will re-execute the instruction to access the poison address, but this
> time take a #PF because of the reserved bit in the PGD.
> 
> There's a couple of changes bundled in here to help my debugging:
> 1) Turn off UCNA calls to memory_failure() ... just avoids races
>    and make tests more determinstic for now
> 2) Disable "fast string" support ... avoid "REP MOVS" copy routines
>    since I'm testing on an old system that treats poison consumed
>    by REP MOVS as fatal.
> 
> Posted here for general comments on the approach.  On the plus
> side it avoids taking multiple machine checks in the kernel when
> it is accessing user data. I think it can also meet the goal set
> by Andy Lutomirski of avoiding SIGBUS from syscalls that happen
> to touch user poison. They should see either short counts for
> calls like write(2) that may partially succeed or -EFAULT if
> the system call totally failed.
> 

I have another view here, but maybe wrong, post here for discussion:
When the process is signaled SIGBUS with BUS_MCEERR_AR, i think it should be from
a read, with the data read, the process can proceed, like process the data, or make decision.
When for poison, the data can't be returned, the process don't know what to do, and kill
is the first option.

while for a copyin case, the read is excuted by kernel, and for poison kernel will refuse to
excute current read and further operation. For the process, it seems it have a change to proceed.

if just error code is returned, the process may care or not, it may not correctly process the error.
It seems the worst case here is the process will touch the poison page again, trigger another MCE and
accordingly be killed.

It's not that bad?

Thanks
Aili Yao

> 
> diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
> index f24d7ef8fffa..e533eaf20834 100644
> --- a/arch/x86/include/asm/pgtable_types.h
> +++ b/arch/x86/include/asm/pgtable_types.h
> @@ -23,6 +23,7 @@
>  #define _PAGE_BIT_SOFTW2	10	/* " */
>  #define _PAGE_BIT_SOFTW3	11	/* " */
>  #define _PAGE_BIT_PAT_LARGE	12	/* On 2MB or 1GB pages */
> +#define _PAGE_BIT_RESERVED1	51	/* Reserved bit */
>  #define _PAGE_BIT_SOFTW4	58	/* available for programmer */
>  #define _PAGE_BIT_PKEY_BIT0	59	/* Protection Keys, bit 1/4 */
>  #define _PAGE_BIT_PKEY_BIT1	60	/* Protection Keys, bit 2/4 */
> @@ -56,6 +57,7 @@
>  #define _PAGE_PAT_LARGE (_AT(pteval_t, 1) << _PAGE_BIT_PAT_LARGE)
>  #define _PAGE_SPECIAL	(_AT(pteval_t, 1) << _PAGE_BIT_SPECIAL)
>  #define _PAGE_CPA_TEST	(_AT(pteval_t, 1) << _PAGE_BIT_CPA_TEST)
> +#define _PAGE_RESERVED1	(_AT(pteval_t, 1) << _PAGE_BIT_RESERVED1)
>  #ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
>  #define _PAGE_PKEY_BIT0	(_AT(pteval_t, 1) << _PAGE_BIT_PKEY_BIT0)
>  #define _PAGE_PKEY_BIT1	(_AT(pteval_t, 1) << _PAGE_BIT_PKEY_BIT1)
> diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
> index 7f7200021bd1..269e8ee04c11 100644
> --- a/arch/x86/include/asm/traps.h
> +++ b/arch/x86/include/asm/traps.h
> @@ -45,4 +45,24 @@ void __noreturn handle_stack_overflow(const char *message,
>  				      unsigned long fault_address);
>  #endif
>  
> +static inline void pgd_set_reserved(pgd_t *pgdp, unsigned long addr)
> +{
> +	pgd_t	pgd;
> +
> +	pgdp += pgd_index(addr);
> +	pgd = *pgdp;
> +	pgd.pgd |= _PAGE_RESERVED1;
> +	WRITE_ONCE(*pgdp, pgd);
> +}
> +
> +static inline void pgd_clr_reserved(pgd_t *pgdp, unsigned long addr)
> +{
> +	pgd_t	pgd;
> +
> +	pgdp += pgd_index(addr);
> +	pgd = *pgdp;
> +	pgd.pgd &= ~_PAGE_RESERVED1;
> +	WRITE_ONCE(*pgdp, pgd);
> +}
> +
>  #endif /* _ASM_X86_TRAPS_H */
> diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
> index 0e422a544835..974e1eb5d1aa 100644
> --- a/arch/x86/kernel/cpu/intel.c
> +++ b/arch/x86/kernel/cpu/intel.c
> @@ -287,6 +287,7 @@ static void early_init_intel(struct cpuinfo_x86 *c)
>  	 */
>  	if (c->x86 > 6 || (c->x86 == 6 && c->x86_model >= 0xd)) {
>  		rdmsrl(MSR_IA32_MISC_ENABLE, misc_enable);
> +misc_enable = 0;
>  		if (!(misc_enable & MSR_IA32_MISC_ENABLE_FAST_STRING)) {
>  			pr_info("Disabled fast string operations\n");
>  			setup_clear_cpu_cap(X86_FEATURE_REP_GOOD);
> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
> index 7962355436da..41bedc961928 100644
> --- a/arch/x86/kernel/cpu/mce/core.c
> +++ b/arch/x86/kernel/cpu/mce/core.c
> @@ -635,6 +635,7 @@ static struct notifier_block early_nb = {
>  static int uc_decode_notifier(struct notifier_block *nb, unsigned long val,
>  			      void *data)
>  {
> +#if 0
>  	struct mce *mce = (struct mce *)data;
>  	unsigned long pfn;
>  
> @@ -650,7 +651,7 @@ static int uc_decode_notifier(struct notifier_block *nb, unsigned long val,
>  		set_mce_nospec(pfn, whole_page(mce));
>  		mce->kflags |= MCE_HANDLED_UC;
>  	}
> -
> +#endif
>  	return NOTIFY_OK;
>  }
>  
> @@ -1443,8 +1444,18 @@ noinstr void do_machine_check(struct pt_regs *regs)
>  				mce_panic("Failed kernel mode recovery", &m, msg);
>  		}
>  
> -		if (m.kflags & MCE_IN_KERNEL_COPYIN)
> -			queue_task_work(&m, kill_current_task);
> +		/*
> +		 * Machine check while copying from user space. Note that we
> +		 * do not call fixup_exception(). Instead we ensure a page fault
> +		 * when we return to the faulting instruction.
> +		 * Let the page fault handler do the rest of the
> +		 * recovery.
> +		 */
> +		if (m.kflags & MCE_IN_KERNEL_COPYIN) {
> +			flush_tlb_one_user((unsigned long)current->mce_vaddr);
> +			pgd_set_reserved(current->mm->pgd, (unsigned long)current->mce_vaddr);
> +			current->mce_addr = m.addr;
> +		}
>  	}
>  out:
>  	mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
> diff --git a/arch/x86/kernel/cpu/mce/severity.c b/arch/x86/kernel/cpu/mce/severity.c
> index 83df991314c5..da917945150d 100644
> --- a/arch/x86/kernel/cpu/mce/severity.c
> +++ b/arch/x86/kernel/cpu/mce/severity.c
> @@ -282,7 +282,6 @@ static int error_context(struct mce *m, struct pt_regs *regs)
>  		return IN_KERNEL_RECOV;
>  	}
>  	if (t == EX_HANDLER_UACCESS && regs && is_copy_from_user(regs)) {
> -		m->kflags |= MCE_IN_KERNEL_RECOV;
>  		m->kflags |= MCE_IN_KERNEL_COPYIN;
>  		return IN_KERNEL_RECOV;
>  	}
> diff --git a/arch/x86/lib/copy_user_64.S b/arch/x86/lib/copy_user_64.S
> index 77b9b2a3b5c8..e0e71ca023ce 100644
> --- a/arch/x86/lib/copy_user_64.S
> +++ b/arch/x86/lib/copy_user_64.S
> @@ -234,24 +234,11 @@ EXPORT_SYMBOL(copy_user_enhanced_fast_string)
>   */
>  SYM_CODE_START_LOCAL(.Lcopy_user_handle_tail)
>  	movl %edx,%ecx
> -	cmp $X86_TRAP_MC,%eax		/* check if X86_TRAP_MC */
> -	je 3f
>  1:	rep movsb
>  2:	mov %ecx,%eax
>  	ASM_CLAC
>  	ret
>  
> -	/*
> -	 * Return zero to pretend that this copy succeeded. This
> -	 * is counter-intuitive, but needed to prevent the code
> -	 * in lib/iov_iter.c from retrying and running back into
> -	 * the poison cache line again. The machine check handler
> -	 * will ensure that a SIGBUS is sent to the task.
> -	 */
> -3:	xorl %eax,%eax
> -	ASM_CLAC
> -	ret
> -
>  	_ASM_EXTABLE_CPY(1b, 2b)
>  SYM_CODE_END(.Lcopy_user_handle_tail)
>  
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index a73347e2cdfc..49232264e893 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -1245,9 +1245,13 @@ void do_user_addr_fault(struct pt_regs *regs,
>  	/*
>  	 * Reserved bits are never expected to be set on
>  	 * entries in the user portion of the page tables.
> +	 * Except when machine check handling of a copy from
> +	 * user passed the buck to #PF.
>  	 */
> -	if (unlikely(error_code & X86_PF_RSVD))
> -		pgtable_bad(regs, error_code, address);
> +	if (unlikely(error_code & X86_PF_RSVD)) {
> +		if (!current->mce_vaddr)
> +			pgtable_bad(regs, error_code, address);
> +	}
>  
>  	/*
>  	 * If SMAP is on, check for invalid kernel (supervisor) access to user
> @@ -1291,6 +1295,15 @@ void do_user_addr_fault(struct pt_regs *regs,
>  			local_irq_enable();
>  	}
>  
> +	if (current->mce_vaddr) {
> +		pgd_clr_reserved(current->mm->pgd,
> +				 (unsigned long)current->mce_vaddr);
> +		memory_failure(current->mce_addr >> PAGE_SHIFT, 0);
> +		current->mce_vaddr = 0;
> +		error_code &= ~X86_PF_RSVD;
> +		pr_info("#PF: maybe fixed? Try to continue\n");
> +	}
> +
>  	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
>  
>  	if (error_code & X86_PF_WRITE)
yaoaili [么爱利] March 31, 2021, 10:56 a.m. UTC | #42
On Thu, 25 Feb 2021 12:39:30 +0100
Oscar Salvador <osalvador@suse.de> wrote:

> On Thu, Feb 25, 2021 at 11:28:18AM +0000, HORIGUCHI NAOYA(堀口 直也) wrote:
> > Hi Aili,
> > 
> > I agree that this set_mce_nospec() is not expected to be called for
> > "already hwpoisoned" page because in the reported case the error
> > page is already contained and no need to resort changing cache mode.  
> 
> Out of curiosity, what is the current behavour now?
> Say we have an ongoing MCE which has marked the page as HWPoison but
> memory_failure did not take any action on the page yet.
> And then, we have another MCE, which ends up there.
> set_mce_nospec might clear _PAGE_PRESENT bit.
> 
> Does that have any impact on the first MCE?
> 
> > It seems to me that memory_failure() does not return MF_XXX.  But yes,
> > returning some positive value for the reported case could be a solution.  
> 
> No, you are right. I somehow managed to confuse myself.
> I see now that MF_XXX return codes are filtered out in page_action.
> 
> > We could use some negative value (error code) to report the reported case,
> > then as you mentioned above, some callers need change to handle the
> > new case, and the same is true if you use some positive value.
> > My preference is -EHWPOISON, but other options are fine if justified well.  
> 
> -EHWPOISON seems like a good fit.
> 

Hi Oscar, david:

Long away fron this topic, but i noticed today I made a stupid mistake that EHWPOISON is already
been declared, so we should better return EHWPOISON for this case.

Really sorry for this!

As the patch is still under review, I will post a new version for this, if I change this, may I add
your review tag here please?
David Hildenbrand March 31, 2021, 10:58 a.m. UTC | #43
On 31.03.21 12:56, Aili Yao wrote:
> On Thu, 25 Feb 2021 12:39:30 +0100
> Oscar Salvador <osalvador@suse.de> wrote:
> 
>> On Thu, Feb 25, 2021 at 11:28:18AM +0000, HORIGUCHI NAOYA(堀口 直也) wrote:
>>> Hi Aili,
>>>
>>> I agree that this set_mce_nospec() is not expected to be called for
>>> "already hwpoisoned" page because in the reported case the error
>>> page is already contained and no need to resort changing cache mode.
>>
>> Out of curiosity, what is the current behavour now?
>> Say we have an ongoing MCE which has marked the page as HWPoison but
>> memory_failure did not take any action on the page yet.
>> And then, we have another MCE, which ends up there.
>> set_mce_nospec might clear _PAGE_PRESENT bit.
>>
>> Does that have any impact on the first MCE?
>>
>>> It seems to me that memory_failure() does not return MF_XXX.  But yes,
>>> returning some positive value for the reported case could be a solution.
>>
>> No, you are right. I somehow managed to confuse myself.
>> I see now that MF_XXX return codes are filtered out in page_action.
>>
>>> We could use some negative value (error code) to report the reported case,
>>> then as you mentioned above, some callers need change to handle the
>>> new case, and the same is true if you use some positive value.
>>> My preference is -EHWPOISON, but other options are fine if justified well.
>>
>> -EHWPOISON seems like a good fit.
>>
> 
> Hi Oscar, david:
> 
> Long away fron this topic, but i noticed today I made a stupid mistake that EHWPOISON is already
> been declared, so we should better return EHWPOISON for this case.
> 
> Really sorry for this!
> 
> As the patch is still under review, I will post a new version for this, if I change this, may I add
> your review tag here please?

Just resend as v2. We will review and post our RBs there.
diff mbox series

Patch

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index e133ce1e562b..db4afc5bf15a 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1259,6 +1259,8 @@  static void kill_me_maybe(struct callback_head *cb)
 	}
 
 	if (p->mce_vaddr != (void __user *)-1l) {
+		pr_err("Memory error may not recovered: %#lx: Sending SIGBUS to %s:%d due to hardware memory corruption\n",
+			p->mce_addr >> PAGE_SHIFT, p->comm, p->pid);
 		force_sig_mceerr(BUS_MCEERR_AR, p->mce_vaddr, PAGE_SHIFT);
 	} else {
 		pr_err("Memory error not recovered");
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index e9481632fcd1..06f006174b8c 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1224,7 +1224,7 @@  static int memory_failure_hugetlb(unsigned long pfn, int flags)
 	if (TestSetPageHWPoison(head)) {
 		pr_err("Memory failure: %#lx: already hardware poisoned\n",
 		       pfn);
-		return 0;
+		return -EBUSY;
 	}
 
 	num_poisoned_pages_inc();
@@ -1420,7 +1420,7 @@  int memory_failure(unsigned long pfn, int flags)
 	if (TestSetPageHWPoison(p)) {
 		pr_err("Memory failure: %#lx: already hardware poisoned\n",
 			pfn);
-		return 0;
+		return -EBUSY;
 	}
 
 	orig_head = hpage = compound_head(p);