diff mbox series

x86/mce: Set PG_hwpoison page flag to avoid the capture kernel panic

Message ID 20230127015030.30074-1-tony.luck@intel.com (mailing list archive)
State New, archived
Headers show
Series x86/mce: Set PG_hwpoison page flag to avoid the capture kernel panic | expand

Commit Message

Luck, Tony Jan. 27, 2023, 1:50 a.m. UTC
From: Zhiquan Li <zhiquan1.li@intel.com>

Kdump can exclude the HWPosion page to avoid touch the error page
again, the prerequisite is the PG_hwpoison page flag is set.
However, for some MCE fatal error cases, there are no opportunity
to queue a task for calling memory_failure(), as a result,
the capture kernel touches the error page again and panics.

Add function mce_set_page_hwpoison_now() which mark a page as
HWPoison before kernel panic() for MCE error, so that the dump
program can check and skip the error page and prevent the capture
kernel panic.

[Tony: Changed TestSetPageHWPoison() to SetPageHWPoison()]

Co-developedd-by: Youquan Song <youquan.song@intel.com>
Signed-off-by: Youquan Song <youquan.song@intel.com>
Signed-off-by: Zhiquan Li <zhiquan1.li@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/kernel/cpu/mce/core.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

Comments

HORIGUCHI NAOYA(堀口 直也) Jan. 30, 2023, 2:32 a.m. UTC | #1
On Thu, Jan 26, 2023 at 05:50:30PM -0800, Tony Luck wrote:
> From: Zhiquan Li <zhiquan1.li@intel.com>
> 
> Kdump can exclude the HWPosion page to avoid touch the error page
> again, the prerequisite is the PG_hwpoison page flag is set.
> However, for some MCE fatal error cases, there are no opportunity
> to queue a task for calling memory_failure(), as a result,
> the capture kernel touches the error page again and panics.
> 
> Add function mce_set_page_hwpoison_now() which mark a page as
> HWPoison before kernel panic() for MCE error, so that the dump
> program can check and skip the error page and prevent the capture
> kernel panic.
> 
> [Tony: Changed TestSetPageHWPoison() to SetPageHWPoison()]
> 
> Co-developedd-by: Youquan Song <youquan.song@intel.com>
> Signed-off-by: Youquan Song <youquan.song@intel.com>
> Signed-off-by: Zhiquan Li <zhiquan1.li@intel.com>
> Signed-off-by: Tony Luck <tony.luck@intel.com>

Hi,
Thank you for the patch.

> ---
>  arch/x86/kernel/cpu/mce/core.c | 20 ++++++++++++++++++++
>  1 file changed, 20 insertions(+)
> 
> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
> index 2c8ec5c71712..0630999c6311 100644
> --- a/arch/x86/kernel/cpu/mce/core.c
> +++ b/arch/x86/kernel/cpu/mce/core.c
> @@ -162,6 +162,24 @@ void mce_unregister_decode_chain(struct notifier_block *nb)
>  }
>  EXPORT_SYMBOL_GPL(mce_unregister_decode_chain);
>  
> +/*
> + * Kdump can exclude the HWPosion page to avoid touch the error page again,
> + * the prerequisite is the PG_hwpoison page flag is set. However, for some
> + * MCE fatal error cases, there are no opportunity to queue a task
> + * for calling memory_failure(), as a result, the capture kernel panic.

s/panic/panics/ ?

> + * This function mark the page as HWPoison before kernel panic() for MCE.

s/mark/marks/

> + */
> +static void mce_set_page_hwpoison_now(unsigned long pfn)
> +{
> +	struct page *p;
> +
> +	/* TODO: need to handle other sort of page, like SGX, PMEM and
> +	 * HugeTLB pages*/

Although I'm not sure that SGX memory or PMEM pages are expected to be
included in kdump, but simply setting PageHWPoison does not work for them?
(Maybe that depends on how kdump handles these types of memory.)

As for HugeTLB, kdump utility should parse the struct page and be aware of
HugeTLB pages, so maybe setting PageHWPoison on the head page could work.

> +	p = pfn_to_online_page(pfn);
> +	if (p)
> +		SetPageHWPoison(p);
> +}
> +
>  static void __print_mce(struct mce *m)
>  {
>  	pr_emerg(HW_ERR "CPU %d: Machine Check%s: %Lx Bank %d: %016Lx\n",
> @@ -292,6 +310,8 @@ static noinstr void mce_panic(const char *msg, struct mce *final, char *exp)
>  	if (!fake_panic) {
>  		if (panic_timeout == 0)
>  			panic_timeout = mca_cfg.panic_timeout;
> +		if (final && (final->status & MCI_STATUS_ADDRV))
> +			 mce_set_page_hwpoison_now(final->addr >> PAGE_SHIFT);
>  		panic(msg);

I think that setting PageHWPoison outside hwpoison subsystem is OK here,
because this is called just before calling panic() so it's expected to not
conflict with other hwpoison-related code.

Thanks,
Naoya Horiguchi
Luck, Tony Jan. 30, 2023, 7:21 p.m. UTC | #2
> Although I'm not sure that SGX memory or PMEM pages are expected to be
> included in kdump, but simply setting PageHWPoison does not work for them?
> (Maybe that depends on how kdump handles these types of memory.)

SGX/TDX pages can't be dumped. They are encrypted with no way for kdump to
get the key.

PMEM seems pointless (but I don't know what kdump does here).

> As for HugeTLB, kdump utility should parse the struct page and be aware of
> HugeTLB pages, so maybe setting PageHWPoison on the head page could work.

Or maybe kdump can take not of the PageHWPoison flag on the sub-page of the
huge page? It depends on whether there is any benefit to the dump to include the
not-poisoned parts of a huge page.


> I think that setting PageHWPoison outside hwpoison subsystem is OK here,
> because this is called just before calling panic() so it's expected to not
> conflict with other hwpoison-related code.

Thanks for the review.

-Tony
HORIGUCHI NAOYA(堀口 直也) Jan. 31, 2023, 4:50 a.m. UTC | #3
On Mon, Jan 30, 2023 at 07:21:15PM +0000, Luck, Tony wrote:
> > Although I'm not sure that SGX memory or PMEM pages are expected to be
> > included in kdump, but simply setting PageHWPoison does not work for them?
> > (Maybe that depends on how kdump handles these types of memory.)
> 
> SGX/TDX pages can't be dumped. They are encrypted with no way for kdump to
> get the key.
> 
> PMEM seems pointless (but I don't know what kdump does here).
> 
> > As for HugeTLB, kdump utility should parse the struct page and be aware of
> > HugeTLB pages, so maybe setting PageHWPoison on the head page could work.
> 
> Or maybe kdump can take not of the PageHWPoison flag on the sub-page of the
> huge page? It depends on whether there is any benefit to the dump to include the
> not-poisoned parts of a huge page.

I think that many kdump users filter out HugeTLB pages (setting dump_level
to filter "User pages") to reduce the size of kdump.  User pages are not
much helpful to investigate kernel problems, so filtering all sub-pages in
hwpoisoned hugepage seems to me not so harmful.

I don't say that saving healthy subpages has no benefit, but I don't know
much about usecases where user pages in kdump file help.

Thanks,
Naoya Horiguchi
Zhiquan Li Feb. 1, 2023, 2:47 a.m. UTC | #4
On 2023/1/31 12:50, HORIGUCHI NAOYA(堀口 直也) wrote:
> On Mon, Jan 30, 2023 at 07:21:15PM +0000, Luck, Tony wrote:
>>> Although I'm not sure that SGX memory or PMEM pages are expected to be
>>> included in kdump, but simply setting PageHWPoison does not work for them?
>>> (Maybe that depends on how kdump handles these types of memory.)
>>
>> SGX/TDX pages can't be dumped. They are encrypted with no way for kdump to
>> get the key.
>>
>> PMEM seems pointless (but I don't know what kdump does here).
>>
>>> As for HugeTLB, kdump utility should parse the struct page and be aware of
>>> HugeTLB pages, so maybe setting PageHWPoison on the head page could work.
>>
>> Or maybe kdump can take not of the PageHWPoison flag on the sub-page of the
>> huge page? It depends on whether there is any benefit to the dump to include the
>> not-poisoned parts of a huge page.
> 
> I think that many kdump users filter out HugeTLB pages (setting dump_level
> to filter "User pages") to reduce the size of kdump.  User pages are not
> much helpful to investigate kernel problems, so filtering all sub-pages in
> hwpoisoned hugepage seems to me not so harmful.
> 
> I don't say that saving healthy subpages has no benefit, but I don't know
> much about usecases where user pages in kdump file help.
> 

Many thanks, Naoya and Tony.

The "TODO" comment was initially added by me. I referenced
memory_failure() that gets rid of the three cases specifically, so I'd
like the cases are fully discussed here to make sure it is a strong fix.

- SGX, as Tony said those pages can't be dumped.
- PMEM, although the memory region is figured out by kexec, kdump
utility doesn't have corresponding implementation for it.  So we can
ignore it.
- HugeTLB
  As we known there are complicated processes for HugeTLB pages at
memory_failure(). However, it doesn't make sense for a routine going to
call panic().  Kernel marks the page hwpoisoned would be enough.

The information is sufficient for kdump to make decision on how to deal
with a hwpoisoned huge pages, even though setting the PageHWPoison flag
on the sub-page of a huge page.  It's kdump's responsibility for the
not-poisoned parts of a huge page.  Currently the implementation is to
exclude the whole huge page simply:

-
https://github.com/makedumpfile/makedumpfile/commit/e8b4f93b3260defe86f5e13ca7536c07f2e32914
- https://bugzilla.redhat.com/show_bug.cgi?id=1181649

Best Regards,
Zhiquan
diff mbox series

Patch

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 2c8ec5c71712..0630999c6311 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -162,6 +162,24 @@  void mce_unregister_decode_chain(struct notifier_block *nb)
 }
 EXPORT_SYMBOL_GPL(mce_unregister_decode_chain);
 
+/*
+ * Kdump can exclude the HWPosion page to avoid touch the error page again,
+ * the prerequisite is the PG_hwpoison page flag is set. However, for some
+ * MCE fatal error cases, there are no opportunity to queue a task
+ * for calling memory_failure(), as a result, the capture kernel panic.
+ * This function mark the page as HWPoison before kernel panic() for MCE.
+ */
+static void mce_set_page_hwpoison_now(unsigned long pfn)
+{
+	struct page *p;
+
+	/* TODO: need to handle other sort of page, like SGX, PMEM and
+	 * HugeTLB pages*/
+	p = pfn_to_online_page(pfn);
+	if (p)
+		SetPageHWPoison(p);
+}
+
 static void __print_mce(struct mce *m)
 {
 	pr_emerg(HW_ERR "CPU %d: Machine Check%s: %Lx Bank %d: %016Lx\n",
@@ -292,6 +310,8 @@  static noinstr void mce_panic(const char *msg, struct mce *final, char *exp)
 	if (!fake_panic) {
 		if (panic_timeout == 0)
 			panic_timeout = mca_cfg.panic_timeout;
+		if (final && (final->status & MCI_STATUS_ADDRV))
+			 mce_set_page_hwpoison_now(final->addr >> PAGE_SHIFT);
 		panic(msg);
 	} else
 		pr_emerg(HW_ERR "Fake kernel panic: %s\n", msg);