diff mbox series

[-next,v4,2/3] x86/mce: rename MCE_IN_KERNEL_COPYIN to MCE_IN_KERNEL_COPY_MC

Message ID 20240111135548.3207437-3-tongtiangen@huawei.com (mailing list archive)
State New
Headers show
Series minor improvements for x86 mce processing | expand

Commit Message

Tong Tiangen Jan. 11, 2024, 1:55 p.m. UTC
In the x86 mce processing, macro MCE_IN_KERNEL_COPYIN is used to identify
copied from user. do_machine_check() uses this flag to isolate posion
page in memory_failure(). there's nothing wrong but we can expand the use
of this macro.

Currently, there are some kernel memory copy scenarios is also mc safe
which use copy_mc_to_kernel() or copy_mc_user_highpage(). In these
scenarios, posion pages need to be isolated too. Therefore, a macro similar
to MCE_IN_KERNEL_COPYIN is required. For this reason, we can rename
MCE_IN_KERNEL_COPYIN to MCE_IN_KERNEL_COPY_MC, the new macro can be applied
to both user-to-kernel mc safe copy and kernel-to-kernel mc safe copy.

Signed-off-by: Tong Tiangen <tongtiangen@huawei.com>
---
 arch/x86/include/asm/mce.h         | 8 ++++----
 arch/x86/kernel/cpu/mce/core.c     | 2 +-
 arch/x86/kernel/cpu/mce/severity.c | 2 +-
 3 files changed, 6 insertions(+), 6 deletions(-)

Comments

Borislav Petkov Jan. 31, 2024, 7:02 a.m. UTC | #1
On Thu, Jan 11, 2024 at 09:55:47PM +0800, Tong Tiangen wrote:
> Currently, there are some kernel memory copy scenarios is also mc safe
> which use copy_mc_to_kernel() or copy_mc_user_highpage().

Both of those end up in copy_mc_enhanced_fast_string() which does
EX_TYPE_DEFAULT_MCE_SAFE.
Tong Tiangen Feb. 1, 2024, 11:37 a.m. UTC | #2
在 2024/1/31 15:02, Borislav Petkov 写道:
> On Thu, Jan 11, 2024 at 09:55:47PM +0800, Tong Tiangen wrote:
>> Currently, there are some kernel memory copy scenarios is also mc safe
>> which use copy_mc_to_kernel() or copy_mc_user_highpage().
> 
> Both of those end up in copy_mc_enhanced_fast_string() which does
> EX_TYPE_DEFAULT_MCE_SAFE.

OK, how about this commit msg change? :)

Currently, there are some kernel memory copy scenarios is also mc safe
which use copy_mc_to_kernel() or copy_mc_user_highpage(), **both of 
those end up in copy_mc_enhanced_fast_string() or copy_mc_fragile() 
which does EX_TYPE_DEFAULT_MCE_SAFE.**

In these scenarios, posion pages need to be isolated too. Therefore, a
macro similar to MCE_IN_KERNEL_COPYIN is required. For this reason, we
can rename MCE_IN_KERNEL_COPYIN to MCE_IN_KERNEL_COPY_MC, the new macro
can be applied to both user-to-kernel mc safe copy and kernel-to-kernel
mc safe copy.

Thanks.
Tong.

>
Borislav Petkov Feb. 1, 2024, 2:20 p.m. UTC | #3
On Thu, Feb 01, 2024 at 07:37:25PM +0800, Tong Tiangen wrote:
> 在 2024/1/31 15:02, Borislav Petkov 写道:
> > On Thu, Jan 11, 2024 at 09:55:47PM +0800, Tong Tiangen wrote:
> > > Currently, there are some kernel memory copy scenarios is also mc safe
> > > which use copy_mc_to_kernel() or copy_mc_user_highpage().
> > 
> > Both of those end up in copy_mc_enhanced_fast_string() which does
> > EX_TYPE_DEFAULT_MCE_SAFE.
> 
> OK, how about this commit msg change? :)
> 
> Currently, there are some kernel memory copy scenarios is also mc safe
> which use copy_mc_to_kernel() or copy_mc_user_highpage(), **both of those
> end up in copy_mc_enhanced_fast_string() or copy_mc_fragile() which does
> EX_TYPE_DEFAULT_MCE_SAFE.**
> 
> In these scenarios, posion pages need to be isolated too. Therefore, a
> macro similar to MCE_IN_KERNEL_COPYIN is required. For this reason, we
> can rename MCE_IN_KERNEL_COPYIN to MCE_IN_KERNEL_COPY_MC, the new macro
> can be applied to both user-to-kernel mc safe copy and kernel-to-kernel
> mc safe copy.

Maybe my question wasn't clear: why is that renaming churn needed at
all? What are you "fixing" here?

What is the problem that you're encountering which needs fixing?

Thx.
Tong Tiangen Feb. 2, 2024, 7:51 a.m. UTC | #4
在 2024/2/1 22:20, Borislav Petkov 写道:
> On Thu, Feb 01, 2024 at 07:37:25PM +0800, Tong Tiangen wrote:
>> 在 2024/1/31 15:02, Borislav Petkov 写道:
>>> On Thu, Jan 11, 2024 at 09:55:47PM +0800, Tong Tiangen wrote:
>>>> Currently, there are some kernel memory copy scenarios is also mc safe
>>>> which use copy_mc_to_kernel() or copy_mc_user_highpage().
>>>
>>> Both of those end up in copy_mc_enhanced_fast_string() which does
>>> EX_TYPE_DEFAULT_MCE_SAFE.
>>
>> OK, how about this commit msg change? :)
>>
>> Currently, there are some kernel memory copy scenarios is also mc safe
>> which use copy_mc_to_kernel() or copy_mc_user_highpage(), **both of those
>> end up in copy_mc_enhanced_fast_string() or copy_mc_fragile() which does
>> EX_TYPE_DEFAULT_MCE_SAFE.**
>>
>> In these scenarios, posion pages need to be isolated too. Therefore, a
>> macro similar to MCE_IN_KERNEL_COPYIN is required. For this reason, we
>> can rename MCE_IN_KERNEL_COPYIN to MCE_IN_KERNEL_COPY_MC, the new macro
>> can be applied to both user-to-kernel mc safe copy and kernel-to-kernel
>> mc safe copy.
> 
> Maybe my question wasn't clear: why is that renaming churn needed at
> all? What are you "fixing" here?
> 
> What is the problem that you're encountering which needs fixing?

This patch is a prepare patch and the next patch is a fix patch, the
complete logic of the two patches is as follows:

The problem i'm encountering:
-------------------------------
In the x86 mce processing, error_context() setting macro
MCE_IN_KERNEL_COPYIN to identify copy from user(user-to-kernel copy) for
fixup_type EX_TYPE_UACCESS.

Then do_machine_check() uses macro MCE_IN_KERNEL_COPYIN to isolate
posion page in memory_failure().

Currently, there are some kernel memory copy scenarios is also mc safe
which use copy_mc_to_kernel() or copy_mc_user_highpage(), these kernel-
to-kernel copy use fixup_type EX_TYPE_DEFAULT_MCE_SAFE. In these
scenarios, posion pages need to be isolated too and the current
implementation is to actively call memory_failure_queue() when the copy
fails.

Calling memory_failure_queue() separately is not a good implementation,
call it uniformly in do_machine_check() is more reasonable.

Solution:
----------
A macro similar to MCE_IN_KERNEL_COPYIN is required, so we can rename
MCE_IN_KERNEL_COPYIN to MCE_IN_KERNEL_COPY_MC, the new macro can be
applied to both user-to-kernel mc safe copy and kernel-to-kernel mc safe
copy, in error_context(),we can set MCE_IN_KERNEL_COPY_MC for both
fixup_type EX_TYPE_UACCESS and EX_TYPE_DEFAULT_MCE_SAFE.



Do you think it's clear to say so and then we can merge the two patches
to make the complete logic clearer in commit msg ?

Many thanks.
Tong.

> 
> Thx.
>
Borislav Petkov Feb. 2, 2024, 1:39 p.m. UTC | #5
On Fri, Feb 02, 2024 at 03:51:12PM +0800, Tong Tiangen wrote:
> Currently, there are some kernel memory copy scenarios is also mc safe
> which use copy_mc_to_kernel() or copy_mc_user_highpage(), these kernel-
> to-kernel copy use fixup_type EX_TYPE_DEFAULT_MCE_SAFE. In these
> scenarios, posion pages need to be isolated too and the current

So you have, for example:

  unsigned long __must_check copy_mc_to_kernel(void *dst, const void *src, unsigned len)

Now imagine you get a MCE for *dst which is some kernel page which
cannot be poisoned: direct map, kernel text, and so on.

Attempting to poison such a page would not work, to put it mildly.

So, again, what *exactly* are you "fixing" here?

When I read "Currently, there are some kernel memory copy scenarios" and
there's nothing more explaining what those scenarios are, I'm tempted to
ignore this completely until you give a detailed and concrete example
what the problem is:

What exactly are you doing, what goes wrong, why does this need to be
fixed and so on...

If there isn't such a real-life use case you're encountering, then this
all is waste of time.

Thx.
Luck, Tony Feb. 2, 2024, 6:44 p.m. UTC | #6
> So you have, for example:
>
>  unsigned long __must_check copy_mc_to_kernel(void *dst, const void *src, unsigned len)
>
> Now imagine you get a MCE for *dst which is some kernel page which
> cannot be poisoned: direct map, kernel text, and so on.

At least on Intel you can only get a machine check for operation on poison data LOAD.
Not for a STORE. I believe that is generally true - other arches to confirm.

So there can't me a machine check on "*dst".

-Tony
Borislav Petkov Feb. 2, 2024, 7:42 p.m. UTC | #7
On Fri, Feb 02, 2024 at 06:44:32PM +0000, Luck, Tony wrote:
> At least on Intel you can only get a machine check for operation on poison data LOAD.
> Not for a STORE. I believe that is generally true - other arches to confirm.

So what happens if you store to a poisoned cacheline on Intel? It'll
raise a poison consumption error when that cacheline is loaded in the
cache? Because you need to load that line into the cache for writing,
I'd presume...

What happens if you have bits flipped in the cacheline you want to write
to?

That's fine because you're overwriting them anyway?

I'd presume ECC check gets performed on cacheline load and then you'll
have to raise an #MC...
Luck, Tony Feb. 2, 2024, 9:36 p.m. UTC | #8
> > At least on Intel you can only get a machine check for operation on poison data LOAD.
> > Not for a STORE. I believe that is generally true - other arches to confirm.
>
> So what happens if you store to a poisoned cacheline on Intel? It'll
> raise a poison consumption error when that cacheline is loaded in the
> cache? Because you need to load that line into the cache for writing,
> I'd presume...

There are two places in the pipeline where poison is significant.

1) When the memory controller gets a request to fetch some data. If the ECC
check on the bits returned from the DIMMs the memory controller will log
a "UCNA" signature error to a machine check bank for the memory channel
where the DIMMs live. If CMCI is enabled for that bank, then a CMCI is
sent to all logical CPUs that are in the scope of that bank (generally a
CPU socket). The data is marked with a POISON signature and passed
to the entity that requested it. Caches support this POISON signature
and preserve it as data is moved between caches, or written back to
memory. This may have been a prefetch or a speculative read. In these
cases there won't be a machine check. Linux uc_decode_notifier() will
try to offline pages when it sees UCNA signatures.

2) When a CPU core tries to retire an instruction that consumes poison
data, or needs to retire a poisoned instruction. These log an SRAR signature
into a core scoped bank (on most Xeons to date bank 0 for poisoned instructions,
bank 1 for poisoned data consumption). Then they signal a machine check.

> What happens if you have bits flipped in the cacheline you want to write
> to?
>
> That's fine because you're overwriting them anyway?
>
> I'd presume ECC check gets performed on cacheline load and then you'll
> have to raise an #MC...

Partial cacheline stores to data marked as POISON in the cache maintain
the poison status. Full cacheline writes (certainly with MOVDIR64B instruction,
possibly with some AVX512 instructions) can clear the POISON status (since
you have all new data). A sequence of partial cache line stores that overwrite
all data in a cache line will NOT clear the POISON status.

Nothing is logged or signaled when updating data in the cache.

-Tony
Borislav Petkov Feb. 2, 2024, 10:22 p.m. UTC | #9
On Fri, Feb 02, 2024 at 09:36:27PM +0000, Luck, Tony wrote:
> There are two places in the pipeline where poison is significant.
> 
> 1) When the memory controller gets a request to fetch some data. If the ECC
> check on the bits returned from the DIMMs the memory controller will log
> a "UCNA" signature error to a machine check bank for the memory channel
> where the DIMMs live. If CMCI is enabled for that bank, then a CMCI is
> sent to all logical CPUs that are in the scope of that bank (generally a
> CPU socket). The data is marked with a POISON signature and passed
> to the entity that requested it. Caches support this POISON signature
> and preserve it as data is moved between caches, or written back to
> memory. This may have been a prefetch or a speculative read. In these
> cases there won't be a machine check. Linux uc_decode_notifier() will
> try to offline pages when it sees UCNA signatures.

Yap, deferred errors.

> 2) When a CPU core tries to retire an instruction that consumes poison
> data, or needs to retire a poisoned instruction. These log an SRAR signature
> into a core scoped bank (on most Xeons to date bank 0 for poisoned instructions,
> bank 1 for poisoned data consumption). Then they signal a machine check.

And that is the #MC on a poison data load thing. FWIW, the other vendor
does it very similarly.

> Partial cacheline stores to data marked as POISON in the cache maintain
> the poison status. Full cacheline writes (certainly with MOVDIR64B instruction,
> possibly with some AVX512 instructions) can clear the POISON status (since
> you have all new data). A sequence of partial cache line stores that overwrite
> all data in a cache line will NOT clear the POISON status.

That's interesting - partial stores don't clear poison data.

> Nothing is logged or signaled when updating data in the cache.

Ok, no #MC on writing to poisoned cachelines.

Ok, so long story short, #MC only on loads. Good.

Now, since you're explaining things today :) pls explain to me what this
patchset is all about? You having reviewed patch 3 and all?

Why is this pattern:

	if (copy_mc_user_highpage(dst, src, addr, vma)) {
		memory_failure_queue(page_to_pfn(src), 0);

not good anymore?

Or is the goal here to poison straight from the #MC handler and not
waste time and potentially get another #MC while memory_failure_queue()
on the source address is done?

Or something completely different?

Thx.
Luck, Tony Feb. 2, 2024, 10:46 p.m. UTC | #10
> Now, since you're explaining things today :) pls explain to me what this
> patchset is all about? You having reviewed patch 3 and all?
>
> Why is this pattern:
>
>       if (copy_mc_user_highpage(dst, src, addr, vma)) {
>               memory_failure_queue(page_to_pfn(src), 0);
>
> not good anymore?
>
> Or is the goal here to poison straight from the #MC handler and not
> waste time and potentially get another #MC while memory_failure_queue()
> on the source address is done?
>
> Or something completely different?

See the comment above memory_failure_queue()

* The function is primarily of use for corruptions that
 * happen outside the current execution context (e.g. when
 * detected by a background scrubber)

In the copy_mc_user_highpage() case the fault happens in
the current execution context. So scheduling someone else
to handle it at some future point is risky. Just deal with it
right away.

-Tony
Tong Tiangen Feb. 3, 2024, 7:56 a.m. UTC | #11
在 2024/2/3 6:46, Luck, Tony 写道:
>> Now, since you're explaining things today :) pls explain to me what this
>> patchset is all about? You having reviewed patch 3 and all?
>>
>> Why is this pattern:
>>
>>        if (copy_mc_user_highpage(dst, src, addr, vma)) {
>>                memory_failure_queue(page_to_pfn(src), 0);
>>
>> not good anymore?
>>
>> Or is the goal here to poison straight from the #MC handler and not
>> waste time and potentially get another #MC while memory_failure_queue()
>> on the source address is done?
>>
>> Or something completely different?
> 
> See the comment above memory_failure_queue()
> 
> * The function is primarily of use for corruptions that
>   * happen outside the current execution context (e.g. when
>   * detected by a background scrubber)
> 
> In the copy_mc_user_highpage() case the fault happens in
> the current execution context. So scheduling someone else
> to handle it at some future point is risky. Just deal with it
> right away.
> 
> -Tony

The goal of this patch:
   When #MC is triggered by copy_mc_user_highpage(), #MC is directly
processed in the synchronously triggered do_machine_check() ->
kill_me_never() -> memory_failure().

And the current handling is to call memory_failure_queue() ->
schedule_work_on() in the execution context, I think that's what
"scheduling someone else to handle it at some future point is risky."

Thanks.
Tong.
Borislav Petkov Feb. 3, 2024, 9:43 a.m. UTC | #12
On Sat, Feb 03, 2024 at 03:56:04PM +0800, Tong Tiangen wrote:
> The goal of this patch:
>   When #MC is triggered by copy_mc_user_highpage(), #MC is directly
> processed in the synchronously triggered do_machine_check() ->
> kill_me_never() -> memory_failure().
> 
> And the current handling is to call memory_failure_queue() ->
> schedule_work_on() in the execution context, I think that's what
> "scheduling someone else to handle it at some future point is risky."

Ok, now take everything that was discussed on the thread and use it to
rewrite all your commit messages to explain *why* you're doing this, not
*what* you're doing - that is visible from the diff.

A possible way to structure them is:

1. Prepare the context for the explanation briefly.

2. Explain the problem at hand.

3. "It happens because of <...>"

4. "Fix it by doing X"

5. "(Potentially do Y)."

And some of those above are optional depending on the issue being
explained.

For more detailed info, see
Documentation/process/submitting-patches.rst,
Section "2) Describe your changes".

Also, to the tone, from Documentation/process/submitting-patches.rst:

 "Describe your changes in imperative mood, e.g. "make xyzzy do frotz"
  instead of "[This patch] makes xyzzy do frotz" or "[I] changed xyzzy
  to do frotz", as if you are giving orders to the codebase to change
  its behaviour."

Also, do not talk about what your patch does - that should (hopefully) be
visible from the diff itself. Rather, talk about *why* you're doing what
you're doing.

Thx.
Tong Tiangen Feb. 4, 2024, 1:52 a.m. UTC | #13
在 2024/2/3 17:43, Borislav Petkov 写道:
> On Sat, Feb 03, 2024 at 03:56:04PM +0800, Tong Tiangen wrote:
>> The goal of this patch:
>>    When #MC is triggered by copy_mc_user_highpage(), #MC is directly
>> processed in the synchronously triggered do_machine_check() ->
>> kill_me_never() -> memory_failure().
>>
>> And the current handling is to call memory_failure_queue() ->
>> schedule_work_on() in the execution context, I think that's what
>> "scheduling someone else to handle it at some future point is risky."
> 
> Ok, now take everything that was discussed on the thread and use it to
> rewrite all your commit messages to explain *why* you're doing this, not
> *what* you're doing - that is visible from the diff.
> 
> A possible way to structure them is:
> 
> 1. Prepare the context for the explanation briefly.
> 
> 2. Explain the problem at hand.
> 
> 3. "It happens because of <...>"
> 
> 4. "Fix it by doing X"
> 
> 5. "(Potentially do Y)."
> 
> And some of those above are optional depending on the issue being
> explained.
> 
> For more detailed info, see
> Documentation/process/submitting-patches.rst,
> Section "2) Describe your changes".
> 
> Also, to the tone, from Documentation/process/submitting-patches.rst:
> 
>   "Describe your changes in imperative mood, e.g. "make xyzzy do frotz"
>    instead of "[This patch] makes xyzzy do frotz" or "[I] changed xyzzy
>    to do frotz", as if you are giving orders to the codebase to change
>    its behaviour."
> 
> Also, do not talk about what your patch does - that should (hopefully) be
> visible from the diff itself. Rather, talk about *why* you're doing what
> you're doing.

OK, will improved next version.

Thank.
Tong.

> 
> Thx.
>
diff mbox series

Patch

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index de3118305838..cb628ab2f32f 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -151,11 +151,11 @@ 
 
 /*
  * Indicates an MCE that happened in kernel space while copying data
- * from user. In this case fixup_exception() gets the kernel to the
- * error exit for the copy function. Machine check handler can then
- * treat it like a fault taken in user mode.
+ * from user or kernel. In this case fixup_exception() gets the kernel
+ * to the error exit for the copy function. Machine check handler can
+ * then treat it like a fault taken in user or kernel mode.
  */
-#define MCE_IN_KERNEL_COPYIN	BIT_ULL(7)
+#define MCE_IN_KERNEL_COPY_MC	BIT_ULL(7)
 
 /*
  * This structure contains all data related to the MCE log.  Also
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 2bfd0e3c62e4..bd1a31ed148b 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1607,7 +1607,7 @@  noinstr void do_machine_check(struct pt_regs *regs)
 				mce_panic("Failed kernel mode recovery", &m, msg);
 		}
 
-		if (m.kflags & MCE_IN_KERNEL_COPYIN)
+		if (m.kflags & MCE_IN_KERNEL_COPY_MC)
 			queue_task_work(&m, msg, kill_me_never);
 	}
 
diff --git a/arch/x86/kernel/cpu/mce/severity.c b/arch/x86/kernel/cpu/mce/severity.c
index bca780fa5e57..df67a7a13034 100644
--- a/arch/x86/kernel/cpu/mce/severity.c
+++ b/arch/x86/kernel/cpu/mce/severity.c
@@ -292,7 +292,7 @@  static noinstr int error_context(struct mce *m, struct pt_regs *regs)
 	case EX_TYPE_UACCESS:
 		if (!copy_user)
 			return IN_KERNEL;
-		m->kflags |= MCE_IN_KERNEL_COPYIN;
+		m->kflags |= MCE_IN_KERNEL_COPY_MC;
 		fallthrough;
 
 	case EX_TYPE_FAULT_MCE_SAFE: