diff mbox series

[v5] x86/mce: Avoid infinite loop for copy from user recovery

Message ID 20210125225509.GA7149@agluck-desk2.amr.corp.intel.com (mailing list archive)
State New
Headers show
Series [v5] x86/mce: Avoid infinite loop for copy from user recovery | expand

Commit Message

Luck, Tony Jan. 25, 2021, 10:55 p.m. UTC
On Thu, Jan 21, 2021 at 01:09:59PM -0800, Luck, Tony wrote:
> On Wed, Jan 20, 2021 at 01:18:12PM +0100, Borislav Petkov wrote:
> But, on a whim, I changed the type of mce_count from "int" to "atomic_t" and
> fixeed up the increment & clear to use atomic_inc_return() and atomic_set().
> See updated patch below.
> 
> I can't see why this should have made any difference. But the "int" version
> crashes in a few dozen machine checks. The "atomic_t" version has run many
> thousands of machine checks (10213 in the current boot according to /proc/interrupts)
> with no issues.

And now I've changed it back to non-atomic (but keeping the
slightly cleaner looking code style that I used for the atomic
version).  This one also works for thousands of injections and
recoveries.  Maybe take it now before it stops working again :-)

-Tony

From: Tony Luck <tony.luck@intel.com>

Recovery action when get_user() triggers a machine check uses the fixup
path to make get_user() return -EFAULT.  Also queue_task_work() sets up
so that kill_me_maybe() will be called on return to user mode to send a
SIGBUS to the current process.

But there are places in the kernel where the code assumes that this
EFAULT return was simply because of a page fault. The code takes some
action to fix that, and then retries the access. This results in a second
machine check.

While processing this second machine check queue_task_work() is called
again. But since this uses the same callback_head structure that
was used in the first call, the net result is an entry on the
current->task_works list that points to itself. When task_work_run()
is called it loops forever in this code:

		do {
			next = work->next;
			work->func(work);
			work = next;
			cond_resched();
		} while (work);

Add a counter (current->mce_count) to keep track of repeated machine checks
before task_work() is called. First machine check saves the address information
and calls task_work_add(). Subsequent machine checks before that task_work
call back is executed check that the address is in the same page as the first
machine check (since the callback will offline exactly one page).

Expected worst case is two machine checks before moving on (e.g. one user
access with page faults disabled, then a repeat to the same addrsss with
page faults enabled). Just in case there is some code that loops forever
enforce a limit of 10.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/kernel/cpu/mce/core.c | 35 +++++++++++++++++++++++++++-------
 include/linux/sched.h          |  1 +
 2 files changed, 29 insertions(+), 7 deletions(-)

Comments

Borislav Petkov Jan. 26, 2021, 11:03 a.m. UTC | #1
On Mon, Jan 25, 2021 at 02:55:09PM -0800, Luck, Tony wrote:
> And now I've changed it back to non-atomic (but keeping the
> slightly cleaner looking code style that I used for the atomic
> version).  This one also works for thousands of injections and
> recoveries.  Maybe take it now before it stops working again :-)

Hmm, so the only differences I see between your v4 and this are:

-@@ -1238,6 +1238,7 @@ static void __mc_scan_banks(struct mce *m, struct pt_regs *regs, struct mce *fin
+@@ -1238,6 +1238,9 @@ static void __mc_scan_banks(struct mce *m, struct pt_regs *regs, struct mce *fin
  
  static void kill_me_now(struct callback_head *ch)
  {
++	struct task_struct *p = container_of(ch, struct task_struct, mce_kill_me);
++
 +	p->mce_count = 0;
  	force_sig(SIGBUS);
  }

Could the container_of() macro have changed something?

Because we don't know yet (right?) why would it fail? Would it read
stale ->mce_count data? If so, then a barrier is missing somewhere.

Or what is the failure exactly?

Because if I take it now without us knowing what the issue is, it will
start failing somewhere - Murphy's our friend - and then we'll have to
deal with breaking people's boxes. Not fun.

The other difference is:

@@ -76,8 +71,10 @@ index 13d3f1cbda17..5460c146edb5 100644
 -	current->mce_kflags = m->kflags;
 -	current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV);
 -	current->mce_whole_page = whole_page(m);
++	int count = ++current->mce_count;
++
 +	/* First call, save all the details */
-+	if (current->mce_count++ == 0) {
++	if (count == 1) {
 +		current->mce_addr = m->addr;
 +		current->mce_kflags = m->kflags;
 +		current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV);

Hmm, a local variable and a pre-increment. Can that have an effect somehow?

> +	/* Ten is likley overkill. Don't expect more than two faults before task_work() */

Typo: likely.
Luck, Tony Jan. 26, 2021, 10:36 p.m. UTC | #2
On Tue, Jan 26, 2021 at 12:03:14PM +0100, Borislav Petkov wrote:
> On Mon, Jan 25, 2021 at 02:55:09PM -0800, Luck, Tony wrote:
> > And now I've changed it back to non-atomic (but keeping the
> > slightly cleaner looking code style that I used for the atomic
> > version).  This one also works for thousands of injections and
> > recoveries.  Maybe take it now before it stops working again :-)
> 
> Hmm, so the only differences I see between your v4 and this are:
> 
> -@@ -1238,6 +1238,7 @@ static void __mc_scan_banks(struct mce *m, struct pt_regs *regs, struct mce *fin
> +@@ -1238,6 +1238,9 @@ static void __mc_scan_banks(struct mce *m, struct pt_regs *regs, struct mce *fin
>   
>   static void kill_me_now(struct callback_head *ch)
>   {
> ++	struct task_struct *p = container_of(ch, struct task_struct, mce_kill_me);
> ++
>  +	p->mce_count = 0;
>   	force_sig(SIGBUS);
>   }
> 
> Could the container_of() macro have changed something?

That change was to fix my brown paper bag moment (does not
compile without a variable named "p" in scope to be used on
next line.)

> Because we don't know yet (right?) why would it fail? Would it read
> stale ->mce_count data? If so, then a barrier is missing somewhere.

I don't see how a barrier would make a differece. In the common case
all this code is executed on the same logical CPU. Return from the
do_machine_check() tries to return to user mode and finds that there
is some "task_work" to execute first.

In some cases Linux might context switch to something else. Perhaps
this task even gets picked up by another CPU to run the task work
queued functions.  But I imagine that the context switch should act
as a barrier ... shouldn't it?

> Or what is the failure exactly?

After a few cycles of the test injection to user mode, I saw an
overflow in the machine check bank. As if it hadn't been cleared
from the previous iteration ... but all the banks are cleared as
soon as we find that the machine check is recoverable. A while before
getting to the code I changed.

When the tests were failing, code was on top of v5.11-rc3. Latest
experiments moved to -rc5.  There's just a tracing fix from
PeterZ between rc3 and rc5 to mce/core.c:

737495361d44 ("x86/mce: Remove explicit/superfluous tracing")

which doesn't appear to be a candidate for the problems I saw.

> Because if I take it now without us knowing what the issue is, it will
> start failing somewhere - Murphy's our friend - and then we'll have to
> deal with breaking people's boxes. Not fun.

Fair point.

> The other difference is:
> 
> @@ -76,8 +71,10 @@ index 13d3f1cbda17..5460c146edb5 100644
>  -	current->mce_kflags = m->kflags;
>  -	current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV);
>  -	current->mce_whole_page = whole_page(m);
> ++	int count = ++current->mce_count;
> ++
>  +	/* First call, save all the details */
> -+	if (current->mce_count++ == 0) {
> ++	if (count == 1) {
>  +		current->mce_addr = m->addr;
>  +		current->mce_kflags = m->kflags;
>  +		current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV);
> 
> Hmm, a local variable and a pre-increment. Can that have an effect somehow?

This is the bit that changed during my detour using atomic_t mce_count.
I added the local variable to capture value from atomic_inc_return(), then
used it later, instead of a bunch of atomic_read() calls.

I kept it this way because "if (count == 1)" is marginally easier to read
than "if (current->mce_count++ == 0)"

> > +	/* Ten is likley overkill. Don't expect more than two faults before task_work() */
> 
> Typo: likely.

Oops. Fixed.

-Tony
Borislav Petkov Jan. 28, 2021, 5:57 p.m. UTC | #3
On Tue, Jan 26, 2021 at 02:36:05PM -0800, Luck, Tony wrote:
> In some cases Linux might context switch to something else. Perhaps
> this task even gets picked up by another CPU to run the task work
> queued functions.  But I imagine that the context switch should act
> as a barrier ... shouldn't it?

I'm given to understand that the #MC from user is likely to schedule and
a context switch has a barrier character.

> After a few cycles of the test injection to user mode, I saw an
> overflow in the machine check bank. As if it hadn't been cleared
> from the previous iteration ...

This sounds weird. As if something else is happening which we haven't
thought of yet...

> When the tests were failing, code was on top of v5.11-rc3. Latest
> experiments moved to -rc5.  There's just a tracing fix from
> PeterZ between rc3 and rc5 to mce/core.c:
> 
> 737495361d44 ("x86/mce: Remove explicit/superfluous tracing")
> 
> which doesn't appear to be a candidate for the problems I saw.

Doesn't look like it.

> This is the bit that changed during my detour using atomic_t mce_count.
> I added the local variable to capture value from atomic_inc_return(), then
> used it later, instead of a bunch of atomic_read() calls.
> 
> I kept it this way because "if (count == 1)" is marginally easier to read
> than "if (current->mce_count++ == 0)"

Right.

So still no explanation why it would fail before. ;-\

Crazy idea: if you still can reproduce on -rc3, you could bisect: i.e.,
if you apply the patch on -rc3 and it explodes and if you apply the same
patch on -rc5 and it works, then that could be a start... Yeah, don't
have a better idea here. :-\
Luck, Tony Feb. 1, 2021, 6:58 p.m. UTC | #4
On Thu, Jan 28, 2021 at 06:57:35PM +0100, Borislav Petkov wrote:
> Crazy idea: if you still can reproduce on -rc3, you could bisect: i.e.,
> if you apply the patch on -rc3 and it explodes and if you apply the same
> patch on -rc5 and it works, then that could be a start... Yeah, don't
> have a better idea here. :-\

I tried reporoducing (applied the original patch I posted back to -rc3) and
the same issue stubbornly refused to show up again.

But I did hit something with the same signature (overflow bit set in
bank 1) while running my futex test (which has two processes mapping
the poison page).  This time I *do* understand what happened.  The test
failed when the two processes were running on the two hyperhtreads of
the same core. Seeing overflow in this case is understandable because
bank 1 MSRs on my test machine are shared between the HT threads. When
I run the test again using taskset(1) to only allowing running on
thread 0 of each core, it keeps going for hunderds of iterations.

I'm not sure I can stitch together how this overflow also happened for
my single process test. Maybe a migration from one HT thread to the
other at an awkward moment?

-Tony
Borislav Petkov Feb. 2, 2021, 11:01 a.m. UTC | #5
On Mon, Feb 01, 2021 at 10:58:12AM -0800, Luck, Tony wrote:
> On Thu, Jan 28, 2021 at 06:57:35PM +0100, Borislav Petkov wrote:
> > Crazy idea: if you still can reproduce on -rc3, you could bisect: i.e.,
> > if you apply the patch on -rc3 and it explodes and if you apply the same
> > patch on -rc5 and it works, then that could be a start... Yeah, don't
> > have a better idea here. :-\
> 
> I tried reporoducing (applied the original patch I posted back to -rc3) and
> the same issue stubbornly refused to show up again.
> 
> But I did hit something with the same signature (overflow bit set in
> bank 1) while running my futex test (which has two processes mapping
> the poison page).  This time I *do* understand what happened.  The test
> failed when the two processes were running on the two hyperhtreads of
> the same core. Seeing overflow in this case is understandable because
> bank 1 MSRs on my test machine are shared between the HT threads. When
> I run the test again using taskset(1) to only allowing running on
> thread 0 of each core, it keeps going for hunderds of iterations.
> 
> I'm not sure I can stitch together how this overflow also happened for
> my single process test. Maybe a migration from one HT thread to the
> other at an awkward moment?

Sounds plausible.

And the much more important question is, what is the code supposed to
do when that overflow *actually* happens in real life? Because IINM,
an overflow condition on the same page would mean killing the task to
contain the error and not killing the machine...
Luck, Tony Feb. 2, 2021, 4:04 p.m. UTC | #6
> And the much more important question is, what is the code supposed to
> do when that overflow *actually* happens in real life? Because IINM,
> an overflow condition on the same page would mean killing the task to
> contain the error and not killing the machine...

Correct. The cases I've actually hit, the second machine check is on the
same address as the first. But from a recovery perspective Linux is going
to take away the whole page anyway ... so not complaining if the second
(or subsequent) access is within the same page makes sense (and that's
what the patch does).

The code can't handle it if a subsequent #MC is to a different page (because
we only have a single spot in the task structure to store the physical page
address).  But that looks adequate. If the code is wildly accessing different
pages *and* getting machine checks from those different pages ... then
something is very seriously wrong with the system.

-Tony
Borislav Petkov Feb. 2, 2021, 9:06 p.m. UTC | #7
On Tue, Feb 02, 2021 at 04:04:17PM +0000, Luck, Tony wrote:
> > And the much more important question is, what is the code supposed to
> > do when that overflow *actually* happens in real life? Because IINM,
> > an overflow condition on the same page would mean killing the task to
> > contain the error and not killing the machine...
> 
> Correct. The cases I've actually hit, the second machine check is on the
> same address as the first. But from a recovery perspective Linux is going
> to take away the whole page anyway ... so not complaining if the second
> (or subsequent) access is within the same page makes sense (and that's
> what the patch does).
> 
> The code can't handle it if a subsequent #MC is to a different page (because
> we only have a single spot in the task structure to store the physical page
> address).  But that looks adequate. If the code is wildly accessing different
> pages *and* getting machine checks from those different pages ... then
> something is very seriously wrong with the system.

Right, that's clear.

But I meant something else: this thread had somewhere upthread:

"But the real validation folks took my patch and found that it has
destabilized cases 1 & 2 (and case 3 also chokes if you repeat a few
more times). System either hangs or panics. Generally before 100
injection/conumption cycles."

To which I asked:

"Or what is the failure exactly?"

and you said:

"After a few cycles of the test injection to user mode, I saw an
overflow in the machine check bank. As if it hadn't been cleared from
the previous iteration ... but all the banks are cleared as soon as we
find that the machine check is recoverable. A while before getting to
the code I changed."

And you also said:

"I tried reporoducing (applied the original patch I posted back to -rc3)
and the same issue stubbornly refused to show up again."

So that "system hang or panic" which the validation folks triggered,
that cannot be reproduced anymore? Did they run the latest version of
the patch?

And the overflow issue is something else which the patch handles fine,
as you say.

So that original mysterious hang is still unsolved.

Or am I missing something and am misrepresenting the situation?

Thx.
Luck, Tony Feb. 2, 2021, 10:12 p.m. UTC | #8
> So that "system hang or panic" which the validation folks triggered,
> that cannot be reproduced anymore? Did they run the latest version of
> the patch?

I will get the validation folks to run the latest version (and play around with
hyperthreading if they see problems).

-Tony
diff mbox series

Patch

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index e133ce1e562b..30dedffd6f88 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1238,6 +1238,9 @@  static void __mc_scan_banks(struct mce *m, struct pt_regs *regs, struct mce *fin
 
 static void kill_me_now(struct callback_head *ch)
 {
+	struct task_struct *p = container_of(ch, struct task_struct, mce_kill_me);
+
+	p->mce_count = 0;
 	force_sig(SIGBUS);
 }
 
@@ -1246,6 +1249,7 @@  static void kill_me_maybe(struct callback_head *cb)
 	struct task_struct *p = container_of(cb, struct task_struct, mce_kill_me);
 	int flags = MF_ACTION_REQUIRED;
 
+	p->mce_count = 0;
 	pr_err("Uncorrected hardware memory error in user-access at %llx", p->mce_addr);
 
 	if (!p->mce_ripv)
@@ -1266,12 +1270,29 @@  static void kill_me_maybe(struct callback_head *cb)
 	}
 }
 
-static void queue_task_work(struct mce *m, int kill_current_task)
+static void queue_task_work(struct mce *m, char *msg, int kill_current_task)
 {
-	current->mce_addr = m->addr;
-	current->mce_kflags = m->kflags;
-	current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV);
-	current->mce_whole_page = whole_page(m);
+	int count = ++current->mce_count;
+
+	/* First call, save all the details */
+	if (count == 1) {
+		current->mce_addr = m->addr;
+		current->mce_kflags = m->kflags;
+		current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV);
+		current->mce_whole_page = whole_page(m);
+	}
+
+	/* Ten is likley overkill. Don't expect more than two faults before task_work() */
+	if (count > 10)
+		mce_panic("Too many machine checks while accessing user data", m, msg);
+
+	/* Second or later call, make sure page address matches the one from first call */
+	if (count > 1 && (current->mce_addr >> PAGE_SHIFT) != (m->addr >> PAGE_SHIFT))
+		mce_panic("Machine checks to different user pages", m, msg);
+
+	/* Do not call task_work_add() more than once */
+	if (count > 1)
+		return;
 
 	if (kill_current_task)
 		current->mce_kill_me.func = kill_me_now;
@@ -1414,7 +1435,7 @@  noinstr void do_machine_check(struct pt_regs *regs)
 		/* If this triggers there is no way to recover. Die hard. */
 		BUG_ON(!on_thread_stack() || !user_mode(regs));
 
-		queue_task_work(&m, kill_current_task);
+		queue_task_work(&m, msg, kill_current_task);
 
 	} else {
 		/*
@@ -1432,7 +1453,7 @@  noinstr void do_machine_check(struct pt_regs *regs)
 		}
 
 		if (m.kflags & MCE_IN_KERNEL_COPYIN)
-			queue_task_work(&m, kill_current_task);
+			queue_task_work(&m, msg, kill_current_task);
 	}
 out:
 	mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6e3a5eeec509..386366c9c757 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1362,6 +1362,7 @@  struct task_struct {
 					mce_whole_page : 1,
 					__mce_reserved : 62;
 	struct callback_head		mce_kill_me;
+	int				mce_count;
 #endif
 
 #ifdef CONFIG_KRETPROBES