diff mbox series

[v3] x86/mce: Set PG_hwpoison page flag to avoid the capture kernel panic

Message ID 20231014051754.3759099-1-zhiquan1.li@intel.com (mailing list archive)
State New, archived
Headers show
Series [v3] x86/mce: Set PG_hwpoison page flag to avoid the capture kernel panic | expand

Commit Message

Zhiquan Li Oct. 14, 2023, 5:17 a.m. UTC
Memory errors don't happen very often, especially the severity is fatal.
However, in large-scale scenarios, such as data centers, it might still
happen.  For some MCE fatal error cases, the kernel might call
mce_panic() to terminate the production kernel directly, thus there is
no opportunity to queue a task for calling memory_failure() which will
try to make the kernel survive via memory failure handling.

Unfortunately, the capture kernel will panic for the same reason if it
touches the error memory again.  The consequence is that only an
incomplete vmcore is left for sustaining engineers, it's a big headache
for them to make clear what happened in the past.

The main task of kdump kernel is providing a "window" - /proc/vmcore,
for the dump program to access old memory.  A dump program running in
userspace determines the "policy".  Which pages need to be dumped is
determined by the configuration of dump program, it reads out the pages
that the sustaining engineer is interested in and excludes the rest.

Likewise, the dump program can exclude the poisoned page to avoid
touching the error page again, the prerequisite is the PG_hwpoison page
flag is set correctly by kernel.  The de facto dump program
(makedumpfile) already supports this function in a decade ago.  To set
the PG_hwpoison flag in the production kernel just before it panics is
the only missing step to make everything work.

And it would not introduce additional overhead in capture kernel or
conflict with other hwpoision-related code in production kernel.  It
leverages the already existing mechanisms to fix the issue as much as
possible, so the code changes are lightweight.

[ Tony: Changed TestSetPageHWPoison() to SetPageHWPoison() ]
[ mingo: Fixed the comments & changelog ]

Co-developed-by: Youquan Song <youquan.song@intel.com>
Signed-off-by: Youquan Song <youquan.song@intel.com>
Signed-off-by: Zhiquan Li <zhiquan1.li@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Borislav Petkov <bp@alien8.de>
Link: https://lore.kernel.org/all/20230719211625.298785-1-tony.luck@intel.com/#t

---

V2: https://lore.kernel.org/all/20230914030539.1622477-1-zhiquan1.li@intel.com/

Changes since V2:
- Rebased to v6.6-rc5.
- Explained full scenario in commit message per Boris's suggestion.
- Included Ingo's fixes.
  Link: https://lore.kernel.org/all/ZRsUpM%2FXtPAE50Rm@gmail.com/

V1: https://lore.kernel.org/all/20230127015030.30074-1-tony.luck@intel.com/

Changes since V1:
- Revised the commit message as per Naoya's suggestion.
- Replaced "TODO" comment in code with comments based on mailing list
  discussion on the lack of value in covering other page types.
- Added the tag from Naoya.
  Link: https://lore.kernel.org/all/20230327083739.GA956278@hori.linux.bs1.fc.nec.co.jp/
---
 arch/x86/kernel/cpu/mce/core.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

Comments

Luck, Tony Oct. 14, 2023, 5:12 a.m. UTC | #1
> Co-developed-by: Youquan Song <youquan.song@intel.com>
> Signed-off-by: Youquan Song <youquan.song@intel.com>
> Signed-off-by: Zhiquan Li <zhiquan1.li@intel.com>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> Signed-off-by: Ingo Molnar <mingo@kernel.org>
> Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
> Cc: Borislav Petkov <bp@alien8.de>
> Link: https://lore.kernel.org/all/20230719211625.298785-1-tony.luck@intel.com/#t

This still has problems.  You should have removed my Signed-off-by. You should NOT
have added Ingo's Signed-off-by (the only time to add someone else's Signed-off-by
is when paired with a Co-developed-by).

-Tony
Zhiquan Li Oct. 14, 2023, 9:34 a.m. UTC | #2
On 2023/10/14 13:12, Luck, Tony wrote:
>> Co-developed-by: Youquan Song <youquan.song@intel.com>
>> Signed-off-by: Youquan Song <youquan.song@intel.com>
>> Signed-off-by: Zhiquan Li <zhiquan1.li@intel.com>
>> Signed-off-by: Tony Luck <tony.luck@intel.com>
>> Signed-off-by: Ingo Molnar <mingo@kernel.org>
>> Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
>> Cc: Borislav Petkov <bp@alien8.de>
>> Link: https://lore.kernel.org/all/20230719211625.298785-1-tony.luck@intel.com/#t
> 
> This still has problems.  You should have removed my Signed-off-by. You should NOT
> have added Ingo's Signed-off-by (the only time to add someone else's Signed-off-by
> is when paired with a Co-developed-by).
> 

Oh, this is V3, not RESEND, I should reset the SoB chain.
Thanks for your reminder, Tony.

If my understanding is correct, the version below fixes the tag list.

Best Regards,
Zhiquan

========>
From: Zhiquan Li <zhiquan1.li@intel.com>
Date: Sat, 14 Oct 2023 13:03:17 -0600
Subject: [PATCH v3] x86/mce: Set PG_hwpoison page flag to avoid the capture kernel panic

Memory errors don't happen very often, especially the severity is fatal.
However, in large-scale scenarios, such as data centers, it might still
happen.  For some MCE fatal error cases, the kernel might call
mce_panic() to terminate the production kernel directly, thus there is
no opportunity to queue a task for calling memory_failure() which will
try to make the kernel survive via memory failure handling.

Unfortunately, the capture kernel will panic for the same reason if it
touches the error memory again.  The consequence is that only an
incomplete vmcore is left for sustaining engineers, it's a big headache
for them to make clear what happened in the past.

The main task of kdump kernel is providing a "window" - /proc/vmcore,
for the dump program to access old memory.  A dump program running in
userspace determines the "policy".  Which pages need to be dumped is
determined by the configuration of dump program, it reads out the pages
that the sustaining engineer is interested in and excludes the rest.

Likewise, the dump program can exclude the poisoned page to avoid
touching the error page again, the prerequisite is the PG_hwpoison page
flag is set correctly by kernel.  The de facto dump program
(makedumpfile) already supports this function in a decade ago.  To set
the PG_hwpoison flag in the production kernel just before it panics is
the only missing step to make everything work.

And it would not introduce additional overhead in capture kernel or
conflict with other hwpoision-related code in production kernel.  It
leverages the already existing mechanisms to fix the issue as much as
possible, so the code changes are lightweight.

Co-developed-by: Youquan Song <youquan.song@intel.com>
Signed-off-by: Youquan Song <youquan.song@intel.com>
Signed-off-by: Zhiquan Li <zhiquan1.li@intel.com>
Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Borislav Petkov <bp@alien8.de>

---

V2: https://lore.kernel.org/all/20230914030539.1622477-1-zhiquan1.li@intel.com/

Changes since V2:
- Rebased to v6.6-rc5.
- Explained full scenario in commit message per Boris's suggestion.
- Included Ingo's fixes.
  Link: https://lore.kernel.org/all/ZRsUpM%2FXtPAE50Rm@gmail.com/

V1: https://lore.kernel.org/all/20230127015030.30074-1-tony.luck@intel.com/

Changes since V1:
- Revised the commit message as per Naoya's suggestion.
- Replaced "TODO" comment in code with comments based on mailing list
  discussion on the lack of value in covering other page types.
- Added the tag from Naoya.
  Link: https://lore.kernel.org/all/20230327083739.GA956278@hori.linux.bs1.fc.nec.co.jp/
---
 arch/x86/kernel/cpu/mce/core.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 6f35f724cc14..905e80c776b8 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -233,6 +233,7 @@ static noinstr void mce_panic(const char *msg, struct mce *final, char *exp)
 	struct llist_node *pending;
 	struct mce_evt_llist *l;
 	int apei_err = 0;
+	struct page *p;
 
 	/*
 	 * Allow instrumentation around external facilities usage. Not that it
@@ -286,6 +287,17 @@ static noinstr void mce_panic(const char *msg, struct mce *final, char *exp)
 	if (!fake_panic) {
 		if (panic_timeout == 0)
 			panic_timeout = mca_cfg.panic_timeout;
+		/*
+		 * Kdump can exclude the HWPoison page to avoid touching the error
+		 * page again, the prerequisite is that the PG_hwpoison page flag is
+		 * set.  However, for some MCE fatal error cases, there is no
+		 * opportunity to queue a task for calling memory_failure(), and as a
+		 * result, the capture kernel panics.  So mark the page as HWPoison
+		 * before kernel panic() for MCE.
+		 */
+		p = pfn_to_online_page(final->addr >> PAGE_SHIFT);
+		if (final && (final->status & MCI_STATUS_ADDRV) && p)
+			SetPageHWPoison(p);
 		panic(msg);
 	} else
 		pr_emerg(HW_ERR "Fake kernel panic: %s\n", msg);
Borislav Petkov Oct. 14, 2023, 10:18 a.m. UTC | #3
On Sat, Oct 14, 2023 at 05:34:12PM +0800, Zhiquan Li wrote:
> Oh, this is V3, not RESEND, I should reset the SoB chain.

You should slow down and take the time to read the document I pointed
you at.
Borislav Petkov Oct. 16, 2023, 9:11 a.m. UTC | #4
On Sat, Oct 14, 2023 at 05:34:12PM +0800, Zhiquan Li wrote:
> Memory errors don't happen very often, especially the severity is fatal.
> However, in large-scale scenarios, such as data centers, it might still
> happen.  For some MCE fatal error cases, the kernel might call
> mce_panic() to terminate the production kernel directly, thus there is
> no opportunity to queue a task for calling memory_failure() which will
> try to make the kernel survive via memory failure handling.

You can't "make the kernel survive" if the error has been deemed
critical. That's mce_severity()'s job. If it grades the error's severity
wrongly and memory_failure() should run after all, then this is
a different story.

> @@ -286,6 +287,17 @@ static noinstr void mce_panic(const char *msg, struct mce *final, char *exp)
>  	if (!fake_panic) {
>  		if (panic_timeout == 0)
>  			panic_timeout = mca_cfg.panic_timeout;

This whole thing...

> +		/*
> +		 * Kdump can exclude the HWPoison page to avoid touching the error
> +		 * page again, the prerequisite is that the PG_hwpoison page flag is
> +		 * set.  However, for some MCE fatal error cases, there is no
> +		 * opportunity to queue a task for calling memory_failure(), and as a
> +		 * result, the capture kernel panics.  So mark the page as HWPoison
> +		 * before kernel panic() for MCE.
> +		 */
> +		p = pfn_to_online_page(final->addr >> PAGE_SHIFT);
> +		if (final && (final->status & MCI_STATUS_ADDRV) && p)
> +			SetPageHWPoison(p);

... needs to be inside:

	if (kexec_crash_loaded() {
		...
	}

otherwise it'll be useless work on the panic path.

Thx.
Zhiquan Li Oct. 17, 2023, 1:05 a.m. UTC | #5
On 2023/10/16 17:11, Borislav Petkov wrote:
>> For some MCE fatal error cases, the kernel might call
>> mce_panic() to terminate the production kernel directly, thus there is
>> no opportunity to queue a task for calling memory_failure() which will
>> try to make the kernel survive via memory failure handling.
> You can't "make the kernel survive" if the error has been deemed
> critical. That's mce_severity()'s job. If it grades the error's severity
> wrongly and memory_failure() should run after all, then this is
> a different story.
> 

I understand what you mean.  Looks I didn't express myself well on this
point and caused ambiguity.  Maybe removing the attributive clause would
make it brief and clear? Such as,

	For some MCE fatal error cases, the kernel might call
	mce_panic() to terminate the production kernel directly, there
	is no opportunity to queue a task for calling memory_failure().

Or do you have other better suggestions? Thanks.


>> @@ -286,6 +287,17 @@ static noinstr void mce_panic(const char *msg, struct mce *final, char *exp)
>>  	if (!fake_panic) {
>>  		if (panic_timeout == 0)
>>  			panic_timeout = mca_cfg.panic_timeout;
> This whole thing...
> 
>> +		/*
>> +		 * Kdump can exclude the HWPoison page to avoid touching the error
>> +		 * page again, the prerequisite is that the PG_hwpoison page flag is
>> +		 * set.  However, for some MCE fatal error cases, there is no
>> +		 * opportunity to queue a task for calling memory_failure(), and as a
>> +		 * result, the capture kernel panics.  So mark the page as HWPoison
>> +		 * before kernel panic() for MCE.
>> +		 */
>> +		p = pfn_to_online_page(final->addr >> PAGE_SHIFT);
>> +		if (final && (final->status & MCI_STATUS_ADDRV) && p)
>> +			SetPageHWPoison(p);
> ... needs to be inside:
> 
> 	if (kexec_crash_loaded() {
> 		...
> 	}
> 
> otherwise it'll be useless work on the panic path.

Good idea! The condition makes it more robust.  I'll validate it and
send V4.

Best Regards,
Zhiquan
Luck, Tony Oct. 17, 2023, 1:24 a.m. UTC | #6
> I understand what you mean.  Looks I didn't express myself well on this
> point and caused ambiguity.  Maybe removing the attributive clause would
> make it brief and clear? Such as,
>
>	For some MCE fatal error cases, the kernel might call
>	mce_panic() to terminate the production kernel directly, there
>	is no opportunity to queue a task for calling memory_failure().

How about:

When there is a fatal machine check Linux calls mce_panic()
without checking to see if bad data at some memory address
was reported in the machine check banks.

If kexec is enabled, check for memory errors and mark the
page as poisoned so that the kexec'd kernel can avoid accessing
the page.

-Tony
Zhiquan Li Oct. 17, 2023, 1:39 a.m. UTC | #7
On 2023/10/14 18:18, Borislav Petkov wrote:
> You should slow down and take the time to read the document I pointed
> you at.

I'd like to revise the tag list as below for next version, and reference
rules are following each action.  Please correct me if I still
understand the rules in submitting-patches.rst wrongly.

	Co-developed-by: Youquan Song <youquan.song@intel.com>
	Signed-off-by: Youquan Song <youquan.song@intel.com>
	Signed-off-by: Zhiquan Li <zhiquan1.li@intel.com>
	Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
	Cc: Borislav Petkov <bp@alien8.de>


1) As the author of the initial patch and the person who submitting the
patch, I put my SoB following Youquan's tags per below rule:

	Notably, the last Signed-off-by: must always be that of the
	developer submitting the patch.


2) Remove Tony's SoB.  I had confirmed with him, he needn't
Co-developed-by: tag, so the SoB added by himself in V1 and V2 should be
removed.
In fact, I'm not sure how to deal with such SoB for "RESEND" case.
While resent V2 I retained the SoB to reflect the chain.  According to
my understanding, the RESEND process emphasizes "not modifying".


3) Remove Ingo's SoB.  Because a new version means a new review cycle,
the SoB added in V2 should be reset to reflect the *new* real route,
unless Ingo needs a Co-developed-by: tag. Relative rule is following:

	Any further SoBs (Signed-off-by:'s) following the author's SoB
	are from people handling and transporting the patch, but were
	not involved in its development. SoB chains should reflect the
	*real* route a patch took as it was propagated to the
	maintainers and ultimately to Linus, with the first SoB entry
	signalling primary authorship of a single author.

I missed this point while I sent V3 :-(


4) Keep Naoya's Reviewed-by: according to below rule, because there is
no substantial change in V3.

	Both Tested-by and Reviewed-by tags, once received on mailing
	list from tester or reviewer, should be added by author to the
	applicable patches when sending next versions. However if the
	patch has changed substantially in following version, these tags
	might not be applicable anymore and thus should be removed.


5) Add Cc: tag to you per below rule :-)

	This is the only tag which might be added without an explicit
	action by the person it names - but it should indicate that this
	person was copied on the patch. This tag documents that
	potentially interested parties have been included in the
	discussion.

Best Regards,
Zhiquan
Borislav Petkov Oct. 17, 2023, 11:18 a.m. UTC | #8
On Tue, Oct 17, 2023 at 01:24:53AM +0000, Luck, Tony wrote:
> How about:
>
> When there is a fatal machine check Linux calls mce_panic()
> without checking to see if bad data at some memory address
> was reported in the machine check banks.

... for the simple reason that the kernel cannot allow itself to do any
unnecessary work but panic immediately so that it can stop the
propagation of bad data.

Now, it's a whole different story whether that's the right thing to do
and whether the data has already propagated so that the panic is moot.

The whole point I'm trying to make is that the machine panics because
the error severity dictates it to do so. And there's no opportunity to
queue recovery work because it simply cannot in that case. So the commit
message should simply state that we're marking the page as poison for
the kexec'ed kernel's sake and not because of anything else.

> If kexec is enabled, check for memory errors and mark the
> page as poisoned so that the kexec'd kernel can avoid accessing
> the page.

Yap, yours makes sense.

Thx.
Zhiquan Li Oct. 17, 2023, 3 p.m. UTC | #9
On 2023/10/17 19:18, Borislav Petkov wrote:
> ... for the simple reason that the kernel cannot allow itself to do any
> unnecessary work but panic immediately so that it can stop the
> propagation of bad data.
> 
> Now, it's a whole different story whether that's the right thing to do
> and whether the data has already propagated so that the panic is moot.
> 
> The whole point I'm trying to make is that the machine panics because
> the error severity dictates it to do so. And there's no opportunity to
> queue recovery work because it simply cannot in that case. So the commit
> message should simply state that we're marking the page as poison for
> the kexec'ed kernel's sake and not because of anything else.
> 

Wonderful!  Thanks for your detail explanation, Boris!

I think I got the point why you emphasized "can't make the kernel
survive" before.  In such scenario the consideration for recovery
doesn’t make sense at all, even thought there is opportunity it
shouldn’t do that, the only choice is panic ASAP.


>> If kexec is enabled, check for memory errors and mark the
>> page as poisoned so that the kexec'd kernel can avoid accessing
>> the page.
> Yap, yours makes sense.

Tony, your commit message made me realize how verbose my commit message
is. May I simplify the whole commit message as following for next version?

---start---
Memory errors don't happen very often, especially the severity is fatal.
 However, in large-scale scenarios, such as data centers, it might still
happen.  When there is a fatal machine check Linux calls mce_panic()
without checking to see if bad data at some memory address
was reported in the machine check banks.

If kexec is enabled, check for memory errors and mark the page as
poisoned so that the kexec'ed kernel can avoid accessing the page.
---end---

It already covers the scenario, root cause and solution, and focuses on
kernel.  No need to talk something else.

Thanks to both of you for great insights.

Best Regards,
Zhiquan
Luck, Tony Oct. 17, 2023, 5:35 p.m. UTC | #10
> Tony, your commit message made me realize how verbose my commit message
> is. May I simplify the whole commit message as following for next version?
>
> ---start---
> Memory errors don't happen very often, especially the severity is fatal.
>  However, in large-scale scenarios, such as data centers, it might still
> happen.  When there is a fatal machine check Linux calls mce_panic()
> without checking to see if bad data at some memory address
> was reported in the machine check banks.
>
> If kexec is enabled, check for memory errors and mark the page as
> poisoned so that the kexec'ed kernel can avoid accessing the page.
> ---end---
>
> It already covers the scenario, root cause and solution, and focuses on
> kernel.  No need to talk something else.

Yes, you can use that. Being concise (but keeping the important details)
is the art of a good commit message.

-Tony
diff mbox series

Patch

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 6f35f724cc14..905e80c776b8 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -233,6 +233,7 @@  static noinstr void mce_panic(const char *msg, struct mce *final, char *exp)
 	struct llist_node *pending;
 	struct mce_evt_llist *l;
 	int apei_err = 0;
+	struct page *p;
 
 	/*
 	 * Allow instrumentation around external facilities usage. Not that it
@@ -286,6 +287,17 @@  static noinstr void mce_panic(const char *msg, struct mce *final, char *exp)
 	if (!fake_panic) {
 		if (panic_timeout == 0)
 			panic_timeout = mca_cfg.panic_timeout;
+		/*
+		 * Kdump can exclude the HWPoison page to avoid touching the error
+		 * page again, the prerequisite is that the PG_hwpoison page flag is
+		 * set.  However, for some MCE fatal error cases, there is no
+		 * opportunity to queue a task for calling memory_failure(), and as a
+		 * result, the capture kernel panics.  So mark the page as HWPoison
+		 * before kernel panic() for MCE.
+		 */
+		p = pfn_to_online_page(final->addr >> PAGE_SHIFT);
+		if (final && (final->status & MCI_STATUS_ADDRV) && p)
+			SetPageHWPoison(p);
 		panic(msg);
 	} else
 		pr_emerg(HW_ERR "Fake kernel panic: %s\n", msg);