diff mbox series

[v2] x86/mce: Always call memory_failure() when there is a valid address

Message ID 20230418180343.19167-1-tony.luck@intel.com (mailing list archive)
State New, archived
Headers show
Series [v2] x86/mce: Always call memory_failure() when there is a valid address | expand

Commit Message

Tony Luck April 18, 2023, 6:03 p.m. UTC
Linux should always take poisoned pages offline when there is an error
report with a valid physcal address.

Note1: that call_me_maybe() will correctly handle the case currently
covered by the test of "kill_current_task" that is deleted by this
change because it will set the MF_MUST_KILL flag when p->mce_ripv is
not set.

Note2: This also provides defense against the case where the logged
error doesn't provide a physical address.

Suggested-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/kernel/cpu/mce/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Yazen Ghannam April 18, 2023, 6:17 p.m. UTC | #1
On 4/18/23 14:03, Tony Luck wrote:
> Linux should always take poisoned pages offline when there is an error
> report with a valid physcal address.
> 
> Note1: that call_me_maybe() will correctly handle the case currently
> covered by the test of "kill_current_task" that is deleted by this
> change because it will set the MF_MUST_KILL flag when p->mce_ripv is
> not set.
> 
> Note2: This also provides defense against the case where the logged
> error doesn't provide a physical address.
> 
> Suggested-by: Yazen Ghannam <yazen.ghannam@amd.com>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
>  arch/x86/kernel/cpu/mce/core.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
> index 2eec60f50057..f72c97860524 100644
> --- a/arch/x86/kernel/cpu/mce/core.c
> +++ b/arch/x86/kernel/cpu/mce/core.c
> @@ -1533,7 +1533,7 @@ noinstr void do_machine_check(struct pt_regs *regs)
>  		/* If this triggers there is no way to recover. Die hard. */
>  		BUG_ON(!on_thread_stack() || !user_mode(regs));
>  
> -		if (kill_current_task)
> +		if (mce_usable_address(&m))

This should be !mce_usable_address().

>  			queue_task_work(&m, msg, kill_me_now);
>  		else
>  			queue_task_work(&m, msg, kill_me_maybe);

Thanks,
Yazen

P.S. I had the exact change in mind. :)

Copying old patch here. Feel free to reuse any of the commit message if
it helps.


From 1123d883470c49babe7c390c67e604b658acb913 Mon Sep 17 00:00:00 2001
From: Yazen Ghannam <yazen.ghannam@amd.com>
Date: Fri, 8 Jan 2021 04:00:35 +0000
Subject: [PATCH] x86/MCE: Call kill_me_maybe() for errors with usable address

Call kill_me_maybe() for machine check errors with a usable address.
This ensures that any memory associated with the error is properly
marked as poison.

This is needed for errors that occur on memory, but that do not have
MCG_STATUS[RIPV] set. One example is data poison consumption through the
instruction fetch units on AMD Zen-based systems.

The MF_MUST_KILL flag is passed to memory_failure() when
MCG_STATUS[RIPV] is not set. So the associated process will still be
killed.

The scenario described above occurs when hardware can precisely identify
the address of poisoned memory, but execution cannot reliably continue
for the interrupted hardware thread.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
---
 arch/x86/kernel/cpu/mce/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 89c81e9992d4..bc2523384357 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1555,7 +1555,7 @@ noinstr void do_machine_check(struct pt_regs *regs)
 		/* If this triggers there is no way to recover. Die hard. */
 		BUG_ON(!on_thread_stack() || !user_mode(regs));
 
-		if (kill_current_task)
+		if (!mce_usable_address(&m))
 			queue_task_work(&m, msg, kill_me_now);
 		else
 			queue_task_work(&m, msg, kill_me_maybe);
Tony Luck April 18, 2023, 7:34 p.m. UTC | #2
>> +		if (mce_usable_address(&m))
>
> This should be !mce_usable_address().

> Copying old patch here. Feel free to reuse any of the commit message if
> it helps.

Might as well just take your version. The commit message seems fine.

Reviewed-by: Tony Luck <tony.luck@intel.com>


> From: Yazen Ghannam <yazen.ghannam@amd.com>
> Date: Fri, 8 Jan 2021 04:00:35 +0000

2021 - wow!

> Subject: [PATCH] x86/MCE: Call kill_me_maybe() for errors with usable address

-Tony
Yazen Ghannam April 19, 2023, 1:08 p.m. UTC | #3
On 4/18/23 15:34, Luck, Tony wrote:
>>> +		if (mce_usable_address(&m))
>>
>> This should be !mce_usable_address().
> 
>> Copying old patch here. Feel free to reuse any of the commit message if
>> it helps.
> 
> Might as well just take your version. The commit message seems fine.
> 
> Reviewed-by: Tony Luck <tony.luck@intel.com>
> 

Thanks!

> 
>> From: Yazen Ghannam <yazen.ghannam@amd.com>
>> Date: Fri, 8 Jan 2021 04:00:35 +0000
> 
> 2021 - wow!
> 

Yeah, I've been distracted lately with other things. >_>

-Yazen
diff mbox series

Patch

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 2eec60f50057..f72c97860524 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1533,7 +1533,7 @@  noinstr void do_machine_check(struct pt_regs *regs)
 		/* If this triggers there is no way to recover. Die hard. */
 		BUG_ON(!on_thread_stack() || !user_mode(regs));
 
-		if (kill_current_task)
+		if (mce_usable_address(&m))
 			queue_task_work(&m, msg, kill_me_now);
 		else
 			queue_task_work(&m, msg, kill_me_maybe);