diff mbox series

[RFC,v1,1/2] mm/memory-failure: introduce global MFR policy

Message ID 20240924043924.3562257-2-jiaqiyan@google.com (mailing list archive)
State New
Headers show
Series Userspace Can Control Memory Failure Recovery | expand

Commit Message

Jiaqi Yan Sept. 24, 2024, 4:39 a.m. UTC
Give userspace the control to enable or disable HARD_OFFLINE error folio
(either a raw page or a hugepage). By default, HARD_OFFLINE is enabled to
be consistent with existing memory_failure behavior.

Userspace should be able to control whether to keep or discard a large chunk
of memory in the event of uncorrectable memory errors. There are two major
use cases in cloud environments.

The 1st case is 1G HugeTLB-backed database workload. Comparing to discarding
the hugepage when only single PFN is impacted by uncorrectable memory error,
if kernel simply leaves the 1G hugepage mapped, access to major of clean PFNs
within the poisoned 1G region still works well for VM and workload.

The 2nd case is MMIO device memory or EGM [1] mapped to userspace via huge
VM_PFNMAP [2]. If kernel does not zap PUD or PMD, there is no need for the
VFIO drivers that manages the memory to intercept page faults for clean PFNs
and to reinstall PTEs.

In addition, in both cases there is no EPT or stage-2 (S2) violation, so no
performance cost for accessing clean guest pages already mapped in EPT or S2.

See cover letter for more details on why userspace need such control, and
implication when userspace chooses to disable HARD_OFFLINE.

If this RFC receives general positive feedbacks, I will add selftest in v2.

[1] https://developer.nvidia.com/blog/nvidia-grace-hopper-superchip-architecture-in-depth/#extended_gpu_memory
[2] https://lore.kernel.org/linux-mm/20240828234958.GE3773488@nvidia.com/T/#m413a61acaf1fc60e65ee7968ab0ae3093f7b1ea3

Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
---
 mm/memory-failure.c | 33 +++++++++++++++++++++++++++++++++
 1 file changed, 33 insertions(+)

Comments

Jane Chu Oct. 2, 2024, 11:50 p.m. UTC | #1
Hi,

On 9/23/2024 9:39 PM, Jiaqi Yan wrote:
>   
> +	/*
> +	 * On ARM64, if APEI failed to claims SEA, (e.g. GHES driver doesn't
> +	 * register to SEA notifications from firmware), memory_failure will
> +	 * never be synchrounous to the error consumption thread. Notifying
> +	 * it via SIGBUS synchrnously has to be done by either core kernel in
> +	 * do_mem_abort, or KVM in kvm_handle_guest_abort.
> +	 */
> +	if (!sysctl_enable_hard_offline) {
> +		pr_info_once("%#lx: disabled by /proc/sys/vm/enable_hard_offline\n", pfn);
> +		kill_procs_now(p, pfn, flags, page_folio(p));
> +		res = -EOPNOTSUPP;
> +		goto unlock_mutex;
> +	}
> +

I am curious why the SIGBUS is sent without setting PG_hwpoison in the 
page.   In 0/2 there seems to be indication about threads coordinate 
with each other such that clean subpages in a poisoned hugetlb page 
continue to be accessible, and at some point, (or perhaps I misread), 
the poisoned page (sub- or huge-) will eventually be isolated, because, 
it's unthinkable to let a poisoned page laying around and kernel treats 
it like a clean page ?  But I'm not sure how do you plan to handle it 
without PG_hwpoison while hard_offline is disabled globally.

Another thing I'm curious at is whether you have tested with real 
hardware UE - the one that triggers MCE.  When a real UE is consumed by 
the training process, the user process must longjmp out in order to 
avoid getting stuck at the same instruction that fetched a UE memory.  
Given a longjmp is needed (unless I am missing something), the training 
process is already in a situation where it has to figure out things like 
rewind, where-to-restart-from, does it even keep states? etc. On the 
whole, whether the burden to ask user application to deal with what's 
lacking in the kernel, namely the lack of splitting up a hugetlb page, 
is worthwhile, is something that need to be weighed over.

Thanks,

-jane
Jiaqi Yan Oct. 3, 2024, 11:51 p.m. UTC | #2
Hi Jane,

On Wed, Oct 2, 2024 at 4:50 PM <jane.chu@oracle.com> wrote:
>
> Hi,
>
> On 9/23/2024 9:39 PM, Jiaqi Yan wrote:
> >
> > +     /*
> > +      * On ARM64, if APEI failed to claims SEA, (e.g. GHES driver doesn't
> > +      * register to SEA notifications from firmware), memory_failure will
> > +      * never be synchrounous to the error consumption thread. Notifying
> > +      * it via SIGBUS synchrnously has to be done by either core kernel in
> > +      * do_mem_abort, or KVM in kvm_handle_guest_abort.
> > +      */
> > +     if (!sysctl_enable_hard_offline) {
> > +             pr_info_once("%#lx: disabled by /proc/sys/vm/enable_hard_offline\n", pfn);
> > +             kill_procs_now(p, pfn, flags, page_folio(p));
> > +             res = -EOPNOTSUPP;
> > +             goto unlock_mutex;
> > +     }
> > +
>
> I am curious why the SIGBUS is sent without setting PG_hwpoison in the
> page.   In 0/2 there seems to be indication about threads coordinate
> with each other such that clean subpages in a poisoned hugetlb page
> continue to be accessible, and at some point, (or perhaps I misread),
> the poisoned page (sub- or huge-) will eventually be isolated, because,

The code here is "global policy". The "per-VMA policy", proposed in
0/2 but code not sent, should be able to support isolation + offline
at some point (all VMAs are gone and page becomes free).

> it's unthinkable to let a poisoned page laying around and kernel treats
> it like a clean page ?  But I'm not sure how do you plan to handle it
> without PG_hwpoison while hard_offline is disabled globally.

It will become the responsibility of a control plan running in
userspace. For example, the control plan immediately prevents starting
of any new workload/VM, but chooses to wait until memory errors exceed
a certain threshold, or hold on to the hosts until all workloads/VMs
are migrated and then repair the machine. Not setting PG_hwpoison is
indeed a big difference and risk, so it needs to be carefully handled
by userspace.

>
> Another thing I'm curious at is whether you have tested with real
> hardware UE - the one that triggers MCE.  When a real UE is consumed by

Yes, with our workload. Can you share more about what is the "training
process"? Is it something to train memory or screen memory errors?

> the training process, the user process must longjmp out in order to
> avoid getting stuck at the same instruction that fetched a UE memory.
> Given a longjmp is needed (unless I am missing something), the training
> process is already in a situation where it has to figure out things like
> rewind, where-to-restart-from, does it even keep states? etc. On the
> whole, whether the burden to ask user application to deal with what's
> lacking in the kernel, namely the lack of splitting up a hugetlb page,
> is worthwhile, is something that need to be weighed over.

For sure, and that's why I put a lot of the word in the cover letter
to talk about 2 use cases where "user application to deal with what's
lacking in the kernel is worthwhile".

>
> Thanks,
>
> -jane
>
>
Jane Chu Oct. 7, 2024, 5:24 p.m. UTC | #3
On 10/3/2024 4:51 PM, Jiaqi Yan wrote:
> soned page (sub- or huge-) will eventually be isolated, because,
> The code here is "global policy". The "per-VMA policy", proposed in
> 0/2 but code not sent, should be able to support isolation + offline
> at some point (all VMAs are gone and page becomes free).
"per-VMA policy" sounds interesting.
>> Another thing I'm curious at is whether you have tested with real
>> hardware UE - the one that triggers MCE.  When a real UE is consumed by
> Yes, with our workload. Can you share more about what is the "training
> process"? Is it something to train memory or screen memory errors?

The cover letter mentioned "Machine Learning (ML) workloads", so I used 
it as an example.

-jane
Jiaqi Yan Oct. 10, 2024, 11:21 p.m. UTC | #4
On Mon, Oct 7, 2024 at 10:24 AM <jane.chu@oracle.com> wrote:
>
> On 10/3/2024 4:51 PM, Jiaqi Yan wrote:
> > soned page (sub- or huge-) will eventually be isolated, because,
> > The code here is "global policy". The "per-VMA policy", proposed in
> > 0/2 but code not sent, should be able to support isolation + offline
> > at some point (all VMAs are gone and page becomes free).
> "per-VMA policy" sounds interesting.
> >> Another thing I'm curious at is whether you have tested with real
> >> hardware UE - the one that triggers MCE.  When a real UE is consumed by
> > Yes, with our workload. Can you share more about what is the "training
> > process"? Is it something to train memory or screen memory errors?
>
> The cover letter mentioned "Machine Learning (ML) workloads", so I used
> it as an example.

Got you. In that case, if the ML workload (running in a VM) wants to
do what you described, wouldn't losing 1G hugetlb page due to kernel
offline make the VM/workload even harder to execute recover logic?

>
> -jane
>
Miaohe Lin Oct. 11, 2024, 7:04 a.m. UTC | #5
On 2024/10/4 7:51, Jiaqi Yan wrote:
> Hi Jane,
> 
> On Wed, Oct 2, 2024 at 4:50 PM <jane.chu@oracle.com> wrote:
>>
>> Hi,
>>
>> On 9/23/2024 9:39 PM, Jiaqi Yan wrote:
>>>
>>> +     /*
>>> +      * On ARM64, if APEI failed to claims SEA, (e.g. GHES driver doesn't
>>> +      * register to SEA notifications from firmware), memory_failure will
>>> +      * never be synchrounous to the error consumption thread. Notifying
>>> +      * it via SIGBUS synchrnously has to be done by either core kernel in
>>> +      * do_mem_abort, or KVM in kvm_handle_guest_abort.
>>> +      */
>>> +     if (!sysctl_enable_hard_offline) {
>>> +             pr_info_once("%#lx: disabled by /proc/sys/vm/enable_hard_offline\n", pfn);
>>> +             kill_procs_now(p, pfn, flags, page_folio(p));
>>> +             res = -EOPNOTSUPP;
>>> +             goto unlock_mutex;
>>> +     }
>>> +
>>
>> I am curious why the SIGBUS is sent without setting PG_hwpoison in the
>> page.   In 0/2 there seems to be indication about threads coordinate
>> with each other such that clean subpages in a poisoned hugetlb page
>> continue to be accessible, and at some point, (or perhaps I misread),
>> the poisoned page (sub- or huge-) will eventually be isolated, because,
> 
> The code here is "global policy". The "per-VMA policy", proposed in
> 0/2 but code not sent, should be able to support isolation + offline
> at some point (all VMAs are gone and page becomes free).
> 
>> it's unthinkable to let a poisoned page laying around and kernel treats
>> it like a clean page ?  But I'm not sure how do you plan to handle it
>> without PG_hwpoison while hard_offline is disabled globally.
> 
> It will become the responsibility of a control plan running in
> userspace. For example, the control plan immediately prevents starting
> of any new workload/VM, but chooses to wait until memory errors exceed
> a certain threshold, or hold on to the hosts until all workloads/VMs
> are migrated and then repair the machine. Not setting PG_hwpoison is
> indeed a big difference and risk, so it needs to be carefully handled
> by userspace.
> 

Could you explain why PG_hwpoison cannot be set in this case? It seems a control plan running in
userspace can work with PG_hwpoison set. PG_hwpoison makes sure hwpoisoned pages won't be re-used
by kernel while the control plan prevent them from re-accessed from userspace. Or am I miss something?

Thanks.
.
Jane Chu Oct. 11, 2024, 6:28 p.m. UTC | #6
On 10/10/2024 4:21 PM, Jiaqi Yan wrote:

> On Mon, Oct 7, 2024 at 10:24 AM <jane.chu@oracle.com> wrote:
>> On 10/3/2024 4:51 PM, Jiaqi Yan wrote:
>>> soned page (sub- or huge-) will eventually be isolated, because,
>>> The code here is "global policy". The "per-VMA policy", proposed in
>>> 0/2 but code not sent, should be able to support isolation + offline
>>> at some point (all VMAs are gone and page becomes free).
>> "per-VMA policy" sounds interesting.
>>>> Another thing I'm curious at is whether you have tested with real
>>>> hardware UE - the one that triggers MCE.  When a real UE is consumed by
>>> Yes, with our workload. Can you share more about what is the "training
>>> process"? Is it something to train memory or screen memory errors?
>> The cover letter mentioned "Machine Learning (ML) workloads", so I used
>> it as an example.
> Got you. In that case, if the ML workload (running in a VM) wants to
> do what you described, wouldn't losing 1G hugetlb page due to kernel
> offline make the VM/workload even harder to execute recover logic?

Indeed.

As the user application got more sophisticated on recovering from 
poison, what about making the kernel to do the heavy lifting?

Something like by way of userfaultfd,  kernel provides a new/clean 
hugetlb page, copied over good data from the clean subpages and then 
present the clean hugetlb page to user process with indication that 
subpage x is a substitute of the poisoned old subpage x, hence its data 
might need a refill?  I am not sure how exactly to pull this through as 
the even is not a page-fault, but just wondering whether something like 
this is possible.

thanks,

-jane

>
>> -jane
>>
Tony Luck Oct. 11, 2024, 7:44 p.m. UTC | #7
> Something like by way of userfaultfd,  kernel provides a new/clean 
> hugetlb page, copied over good data from the clean subpages and then 
> present the clean hugetlb page to user process with indication that 
> subpage x is a substitute of the poisoned old subpage x, hence its data 
> might need a refill?  I am not sure how exactly to pull this through as 
> the even is not a page-fault, but just wondering whether something like 
> this is possible.

This requires serious levels of sophistication from the application.
If some thread still accesses the "lost" data, there's no signal that
anything went wrong. It just reads whatever data the kernel filled the
poisoned area with. For some applications there might be some
data pattern that would help track this down. But no general answer.

On the plus side, the amount of "lost" data need not be a page.
On Intel the poison unit is a cache line (64 bytes). So more of the
original data can potentially be preserved. This might be useful
for applications using regular pages as well as those using huge pages.

When Linux first implemented recovery, we had hopes that applications
like databases would be able to implement their own recovery. Losing
a whole page turned out to be problematic as in some implementations
the metadata for a database entry was stored at the start of the memory
block. So the SIGBUS would provide the virtual address, and it wasn't
of any practical use to determine which data structure(s) were affected
without some massive restructure of the code to separate metadata
from data.

-Tony
Jane Chu Oct. 11, 2024, 8:15 p.m. UTC | #8
On 10/11/2024 12:44 PM, Luck, Tony wrote:

>> Something like by way of userfaultfd,  kernel provides a new/clean
>> hugetlb page, copied over good data from the clean subpages and then
>> present the clean hugetlb page to user process with indication that
>> subpage x is a substitute of the poisoned old subpage x, hence its data
>> might need a refill?  I am not sure how exactly to pull this through as
>> the even is not a page-fault, but just wondering whether something like
>> this is possible.
> This requires serious levels of sophistication from the application.
> If some thread still accesses the "lost" data, there's no signal that
> anything went wrong. It just reads whatever data the kernel filled the
> poisoned area with. For some applications there might be some
> data pattern that would help track this down. But no general answer.
Is it possible to rely on mf_mutex to hold off subsequent threads 
accessing the poisoned spot until the 1st poison event has been handled 
and page replaced by joint effort of the application and kernel?  I mean 
until the poisoned page is removed from the page table, other threads 
accessing it would hit MCE, right?
>
> On the plus side, the amount of "lost" data need not be a page.
> On Intel the poison unit is a cache line (64 bytes). So more of the
> original data can potentially be preserved. This might be useful
> for applications using regular pages as well as those using huge pages.
That requires the kernel to provide finer grained SIGBUS payload such as 
untrimmed vaddr and si_lsb=6.
>
> When Linux first implemented recovery, we had hopes that applications
> like databases would be able to implement their own recovery. Losing
> a whole page turned out to be problematic as in some implementations
> the metadata for a database entry was stored at the start of the memory
> block. So the SIGBUS would provide the virtual address, and it wasn't
> of any practical use to determine which data structure(s) were affected
> without some massive restructure of the code to separate metadata
> from data.
>
> -Tony
-jane
Jiaqi Yan Oct. 15, 2024, 11:45 p.m. UTC | #9
On Fri, Oct 11, 2024 at 11:28 AM <jane.chu@oracle.com> wrote:
>
> On 10/10/2024 4:21 PM, Jiaqi Yan wrote:
>
> > On Mon, Oct 7, 2024 at 10:24 AM <jane.chu@oracle.com> wrote:
> >> On 10/3/2024 4:51 PM, Jiaqi Yan wrote:
> >>> soned page (sub- or huge-) will eventually be isolated, because,
> >>> The code here is "global policy". The "per-VMA policy", proposed in
> >>> 0/2 but code not sent, should be able to support isolation + offline
> >>> at some point (all VMAs are gone and page becomes free).
> >> "per-VMA policy" sounds interesting.
> >>>> Another thing I'm curious at is whether you have tested with real
> >>>> hardware UE - the one that triggers MCE.  When a real UE is consumed by
> >>> Yes, with our workload. Can you share more about what is the "training
> >>> process"? Is it something to train memory or screen memory errors?
> >> The cover letter mentioned "Machine Learning (ML) workloads", so I used
> >> it as an example.
> > Got you. In that case, if the ML workload (running in a VM) wants to
> > do what you described, wouldn't losing 1G hugetlb page due to kernel
> > offline make the VM/workload even harder to execute recover logic?
>
> Indeed.
>
> As the user application got more sophisticated on recovering from
> poison, what about making the kernel to do the heavy lifting?

I think there are two things.

First, if userspace claims it has enough or sophisticated recovery
ability (assume we trust it), can it take full control of what happens
to the hardware poisoned memory page it **owns**?
My answer to this question is yes. The reason is I believe the kernel
has a limited ability to do memory failure recovery (MFR) optimally
for all userspace. Current hard offline support in the kernel has also
made userspace recovery hard, so userspace deserve a position in MFR.

Second, what is the granularity of the control? This patch makes the
control applicable to every process. So what about making it
controllable only by the userspace process that owns the memory page?
Kernel can still do whatever the heavy lifting (hard offline, set
HWPoison) **after** the owning userspace unclaims the control, or
exits.

Another way to "disable hardoffline but still set HWPoison" I can
think of is, make the HWPOISON flag apply at page_size level, instead
of always set at the compound head. At least from hugetlb's
perspective, is it a good idea?

>
> Something like by way of userfaultfd,  kernel provides a new/clean
> hugetlb page, copied over good data from the clean subpages and then
> present the clean hugetlb page to user process with indication that
> subpage x is a substitute of the poisoned old subpage x, hence its data
> might need a refill?  I am not sure how exactly to pull this through as
> the even is not a page-fault, but just wondering whether something like
> this is possible.
>
> thanks,
>
> -jane
>
> >
> >> -jane
> >>
Tony Luck Oct. 15, 2024, 11:56 p.m. UTC | #10
> Another way to "disable hardoffline but still set HWPoison" I can
> think of is, make the HWPOISON flag apply at page_size level, instead
> of always set at the compound head. At least from hugetlb's
> perspective, is it a good idea?

Many years ago someone looked at breaking up hugetlb pages
when a memory error occurred so that just 4K was lost instead
of the entire huge page. At that time the conclusion was that
doing so would require locks to be taken/released around all
hugetlb map/unmap operations. An unacceptable performance
issue for common operations to handle very rare memory error
events.

I don't know if that is still true. There's been a lot of restructure
to memory management code since then.

-Tony
Jiaqi Yan Oct. 15, 2024, 11:58 p.m. UTC | #11
On Fri, Oct 11, 2024 at 12:05 AM Miaohe Lin <linmiaohe@huawei.com> wrote:
>
> On 2024/10/4 7:51, Jiaqi Yan wrote:
> > Hi Jane,
> >
> > On Wed, Oct 2, 2024 at 4:50 PM <jane.chu@oracle.com> wrote:
> >>
> >> Hi,
> >>
> >> On 9/23/2024 9:39 PM, Jiaqi Yan wrote:
> >>>
> >>> +     /*
> >>> +      * On ARM64, if APEI failed to claims SEA, (e.g. GHES driver doesn't
> >>> +      * register to SEA notifications from firmware), memory_failure will
> >>> +      * never be synchrounous to the error consumption thread. Notifying
> >>> +      * it via SIGBUS synchrnously has to be done by either core kernel in
> >>> +      * do_mem_abort, or KVM in kvm_handle_guest_abort.
> >>> +      */
> >>> +     if (!sysctl_enable_hard_offline) {
> >>> +             pr_info_once("%#lx: disabled by /proc/sys/vm/enable_hard_offline\n", pfn);
> >>> +             kill_procs_now(p, pfn, flags, page_folio(p));
> >>> +             res = -EOPNOTSUPP;
> >>> +             goto unlock_mutex;
> >>> +     }
> >>> +
> >>
> >> I am curious why the SIGBUS is sent without setting PG_hwpoison in the
> >> page.   In 0/2 there seems to be indication about threads coordinate
> >> with each other such that clean subpages in a poisoned hugetlb page
> >> continue to be accessible, and at some point, (or perhaps I misread),
> >> the poisoned page (sub- or huge-) will eventually be isolated, because,
> >
> > The code here is "global policy". The "per-VMA policy", proposed in
> > 0/2 but code not sent, should be able to support isolation + offline
> > at some point (all VMAs are gone and page becomes free).
> >
> >> it's unthinkable to let a poisoned page laying around and kernel treats
> >> it like a clean page ?  But I'm not sure how do you plan to handle it
> >> without PG_hwpoison while hard_offline is disabled globally.
> >
> > It will become the responsibility of a control plan running in
> > userspace. For example, the control plan immediately prevents starting
> > of any new workload/VM, but chooses to wait until memory errors exceed
> > a certain threshold, or hold on to the hosts until all workloads/VMs
> > are migrated and then repair the machine. Not setting PG_hwpoison is
> > indeed a big difference and risk, so it needs to be carefully handled
> > by userspace.
> >
>
> Could you explain why PG_hwpoison cannot be set in this case? It seems a control plan running in
> userspace can work with PG_hwpoison set. PG_hwpoison makes sure hwpoisoned pages won't be re-used
> by kernel while the control plan prevent them from re-accessed from userspace. Or am I miss something?
>

[Resend to include more people and linux-mm]

Sorry I almost missed your comment/question.

I think for hugetlb and transparent hugepages, say we keep them mapped
but set HWPoison flag, the flag will be set at compound head and
future userspace page fault on **any** part of the hugepage will
result in SIGBUS, meaning the hugepage is lost to the userspace,
making "keep them mapped" a meaningless action.

> Thanks.
> .
>
Jane Chu Oct. 16, 2024, 12:19 a.m. UTC | #12
On 10/15/2024 4:56 PM, Luck, Tony wrote:

>> Another way to "disable hardoffline but still set HWPoison" I can
>> think of is, make the HWPOISON flag apply at page_size level, instead
>> of always set at the compound head. At least from hugetlb's
>> perspective, is it a good idea?
> Many years ago someone looked at breaking up hugetlb pages
> when a memory error occurred so that just 4K was lost instead
> of the entire huge page. At that time the conclusion was that
> doing so would require locks to be taken/released around all
> hugetlb map/unmap operations. An unacceptable performance
> issue for common operations to handle very rare memory error
> events.
>
> I don't know if that is still true. There's been a lot of restructure
> to memory management code since then.

The HGM for hugetlbfs 
<https://lore.kernel.org/linux-mm/20230306191944.GA15773@monkey/#r> 
project attempted this as well.

  https://lore.kernel.org/linux-mm/20230306191944.GA15773@monkey/

-jan

>
> -Tony
>
>
diff mbox series

Patch

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 7066fc84f351..a7b85b98d61e 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -70,6 +70,8 @@  static int sysctl_memory_failure_recovery __read_mostly = 1;
 
 static int sysctl_enable_soft_offline __read_mostly = 1;
 
+static int sysctl_enable_hard_offline __read_mostly = 1;
+
 atomic_long_t num_poisoned_pages __read_mostly = ATOMIC_LONG_INIT(0);
 
 static bool hw_memory_failure __read_mostly = false;
@@ -151,6 +153,15 @@  static struct ctl_table memory_failure_table[] = {
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= SYSCTL_ZERO,
 		.extra2		= SYSCTL_ONE,
+	},
+	{
+		.procname	= "enable_hard_offline",
+		.data		= &sysctl_enable_hard_offline,
+		.maxlen		= sizeof(sysctl_enable_hard_offline),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= SYSCTL_ONE,
 	}
 };
 
@@ -2223,6 +2234,14 @@  int memory_failure(unsigned long pfn, int flags)
 
 	p = pfn_to_online_page(pfn);
 	if (!p) {
+		/*
+		 * For ZONE_DEVICE memory and memory on special architectures,
+		 * assume they have opt out core kernel's MFR. Since these
+		 * memory can still be mapped to userspace, let userspace
+		 * know MFR doesn't apply.
+		 */
+		pr_info_once("%#lx: can't apply global MFR policy\n", pfn);
+
 		res = arch_memory_failure(pfn, flags);
 		if (res == 0)
 			goto unlock_mutex;
@@ -2241,6 +2260,20 @@  int memory_failure(unsigned long pfn, int flags)
 		goto unlock_mutex;
 	}
 
+	/*
+	 * On ARM64, if APEI failed to claims SEA, (e.g. GHES driver doesn't
+	 * register to SEA notifications from firmware), memory_failure will
+	 * never be synchrounous to the error consumption thread. Notifying
+	 * it via SIGBUS synchrnously has to be done by either core kernel in
+	 * do_mem_abort, or KVM in kvm_handle_guest_abort.
+	 */
+	if (!sysctl_enable_hard_offline) {
+		pr_info_once("%#lx: disabled by /proc/sys/vm/enable_hard_offline\n", pfn);
+		kill_procs_now(p, pfn, flags, page_folio(p));
+		res = -EOPNOTSUPP;
+		goto unlock_mutex;
+	}
+
 try_again:
 	res = try_memory_failure_hugetlb(pfn, flags, &hugetlb);
 	if (hugetlb)