diff mbox

[RFC,2/3] x86, MCE: Avoid potential deadlock in MCE context

Message ID 1405478082-30757-3-git-send-email-gong.chen@linux.intel.com (mailing list archive)
State Not Applicable, archived
Headers show

Commit Message

Chen Gong July 16, 2014, 2:34 a.m. UTC
When Uncorrected error happens, an MCE will be raised. Some
notification callbacks will be called in MCE context. If
some notification call printk it will cause potential
deadlock because MCE can preempt normal interrupts like NMI does.

Since printk is not safe in MCE context. So a lock-less memory
allocator (genpool) is used to save information which are
organized via a lock-less list. Print will be delayed into IRQ
context via irq_work. This idea is inspired by APEI/GHES driver.

Reported-by: Xie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: Chen, Gong <gong.chen@linux.intel.com>
---
 arch/x86/Kconfig                          |   1 +
 arch/x86/include/asm/mce.h                |   2 +-
 arch/x86/kernel/cpu/mcheck/Makefile       |   2 +-
 arch/x86/kernel/cpu/mcheck/mce-apei.c     |   2 +-
 arch/x86/kernel/cpu/mcheck/mce-genpool.c  | 109 ++++++++++++++++++++++++++++++
 arch/x86/kernel/cpu/mcheck/mce-internal.h |   9 +++
 arch/x86/kernel/cpu/mcheck/mce.c          |  21 ++++--
 arch/x86/kernel/cpu/mcheck/mce_amd.c      |   2 +-
 8 files changed, 137 insertions(+), 11 deletions(-)
 create mode 100644 arch/x86/kernel/cpu/mcheck/mce-genpool.c

Comments

Borislav Petkov July 21, 2014, 8:47 a.m. UTC | #1
On Tue, Jul 15, 2014 at 10:34:41PM -0400, Chen, Gong wrote:
> When Uncorrected error happens, an MCE will be raised. Some
> notification callbacks will be called in MCE context. If
> some notification call printk it will cause potential
> deadlock because MCE can preempt normal interrupts like NMI does.
> 
> Since printk is not safe in MCE context. So a lock-less memory
> allocator (genpool) is used to save information which are
> organized via a lock-less list. Print will be delayed into IRQ
> context via irq_work. This idea is inspired by APEI/GHES driver.

This patch is overengineered even though we already have both process
context work and irq work facilities in place.

We also already have mce_ring where we add MCE signatures in #MC
context. Well, only for AO errors with usable addresses for now, at
least.

And we empty that ring in mce_process_work().

I think it would be a *lot* simpler if you modify the logic to put all
errors into the ring and remove the call chain call from mce_log(). I'm
looking at mce_report_event() which even does the irq_work stuff if
we need to raise an IPI *during* the MCE so that stuff gets processed
*before* we return to userspace.
Tony Luck July 21, 2014, 5:14 p.m. UTC | #2
> This patch is overengineered even though we already have both process

> context work and irq work facilities in place.

>

> We also already have mce_ring where we add MCE signatures in #MC

> context. Well, only for AO errors with usable addresses for now, at

> least.


We've evolved a bunch of mechanisms:

1) mce_ring: to pass pfn for AO errors from MCE context to a work thread
2) mce_info: to pass pfn for AR errors from MCE context to same process running in process context
3) mce_log: to pass entire "mce" structures from any context (MCE, CMCI, or init-time) to /dev/mcelog

something simpler might be nice - but a generic thing that is overkill for each of the
specialized uses might not necessarily be an improvement.

E.g. #3 above has a fixed capacity (MCE_LOG_LEN) and just drops any extras if it should fill
up (deliberately, because we almost always prefer to see the first bunch of errors rather
than the newest).

> I think it would be a *lot* simpler if you modify the logic to put all

> errors into the ring and remove the call chain call from mce_log().


I was actually wondering about going in the other direction. Make the
/dev/mcelog code register a notifier on x86_mce_decoder_chain (and
perhaps move all the /dev/mcelog functions out of mce.c into an actual
driver file).  Then use Chen Gong's NMI safe code to just unconditionally
make safe copies of anything that gets passed to mce_log() and run all
the notifiers from his do_mce_irqwork().

-Tony
Borislav Petkov July 21, 2014, 9:41 p.m. UTC | #3
On Mon, Jul 21, 2014 at 05:14:06PM +0000, Luck, Tony wrote:
> We've evolved a bunch of mechanisms:
> 
> 1) mce_ring: to pass pfn for AO errors from MCE context to a work thread
> 2) mce_info: to pass pfn for AR errors from MCE context to same process running in process context
> 3) mce_log: to pass entire "mce" structures from any context (MCE, CMCI, or init-time) to /dev/mcelog
> 
> something simpler might be nice - but a generic thing that is overkill for each of the
> specialized uses might not necessarily be an improvement.
> 
> E.g. #3 above has a fixed capacity (MCE_LOG_LEN) and just drops any extras if it should fill

Gong's too. Famous last words:

	/* two pages should be enough */
	pages = 2;

> up (deliberately, because we almost always prefer to see the first bunch of errors rather
> than the newest).
> 
> > I think it would be a *lot* simpler if you modify the logic to put all
> > errors into the ring and remove the call chain call from mce_log().
> 
> I was actually wondering about going in the other direction. Make the
> /dev/mcelog code register a notifier on x86_mce_decoder_chain (and
> perhaps move all the /dev/mcelog functions out of mce.c into an actual
> driver file).

For easier deletion later. :-P

> Then use Chen Gong's NMI safe code to just unconditionally make safe
> copies of anything that gets passed to mce_log() and run all the
> notifiers from his do_mce_irqwork().

And drop all the homegrown other stuff like mce_ring and all? If this
gets designed right and it is well thought out - not hastily coded out -
it will probably be better, yes.
Tony Luck July 21, 2014, 10:03 p.m. UTC | #4
> And drop all the homegrown other stuff like mce_ring and all?


mce_ring should be easy ... the "mce" structure has the address
from which we can easily get the pfn to pass into the action-optional
recovery path.  Only thing missing is a direct indication that this mce
does contain an AO error that needs to be processed. We could
re-invoke mce_severity() to figure it out again - or just add a flag
somewhere.

Not so sure about mce_info. This one passes from the
MCE context to the same task when we catch it in process
context (set_thread_flag(MCE_NOTIFY)).  Back when I was pushing
this code it, I really wanted to just add a field to the thread_info
structure to hold the address ... because this really is some
information that belongs to the thread. But I was unable to
convince people back then.  We must be able to find the
page frame when we arrive in mce_notify_process(). So we
can't stash it in some limited size pool of "mce" structures that
might decide to just drop this one.

-Tony
Borislav Petkov July 22, 2014, 5:26 p.m. UTC | #5
On Mon, Jul 21, 2014 at 10:03:04PM +0000, Luck, Tony wrote:
> > And drop all the homegrown other stuff like mce_ring and all?
> 
> mce_ring should be easy ... the "mce" structure has the address
> from which we can easily get the pfn to pass into the action-optional
> recovery path.  Only thing missing is a direct indication that this mce
> does contain an AO error that needs to be processed. We could
> re-invoke mce_severity() to figure it out again - or just add a flag
> somewhere.

Right, so if we're going to clean up this mess, I think we should strive
for having a single ring buffer which contains all the MCEs and which we
can iterate over at leisure, either in IRQ context if some of them have
been reported through real exceptions or in process context if they're
simple CEs.

Once they've been eaten by something, we simply remove them from that
buffer and that's it. But sure, one lockless buffer which works in all
contexts would be much better than growing stuff here and there with
different semantics and context usage.

Thanks.
Tony Luck July 22, 2014, 9:24 p.m. UTC | #6
On Tue, Jul 22, 2014 at 10:26 AM, Borislav Petkov <bp@alien8.de> wrote:
> Once they've been eaten by something, we simply remove them from that
> buffer and that's it.

swap that for

Once everyone who registered a notifier as had a chance to see each
logged entry, we simply remove ...

I'm not a fan of the current NOTIFY_STOP behavior where one registrant
can say they are so important that nobody else should be allowed to see
what was logged.

-Tony
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chen Gong July 23, 2014, 7:48 a.m. UTC | #7
On Mon, Jul 21, 2014 at 05:14:06PM +0000, Luck, Tony wrote:
> We've evolved a bunch of mechanisms:
> 
> 1) mce_ring: to pass pfn for AO errors from MCE context to a work thread
> 2) mce_info: to pass pfn for AR errors from MCE context to same process running in process context
> 3) mce_log: to pass entire "mce" structures from any context (MCE, CMCI, or init-time) to /dev/mcelog
> 
> I was actually wondering about going in the other direction. Make the
> /dev/mcelog code register a notifier on x86_mce_decoder_chain (and
> perhaps move all the /dev/mcelog functions out of mce.c into an actual
> driver file).  Then use Chen Gong's NMI safe code to just unconditionally
> make safe copies of anything that gets passed to mce_log() and run all
> the notifiers from his do_mce_irqwork().
> 
> -Tony

OK, I can cook some patches based on Tony's suggestion:
patch 1: add a generic lock-less memory pool to save error records
patch 2: remove mce_info (Tony has done a draft)
patch 3: remove mce_ring
patch 4: remove mce log buffer
patch 5: move all mce log related logic into a separate file lke mcelog.c
         under the same directory with mce.c
diff mbox

Patch

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index a8f749e..2632732 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -861,6 +861,7 @@  config X86_REROUTE_FOR_BROKEN_BOOT_IRQS
 
 config X86_MCE
 	bool "Machine Check / overheating reporting"
+	select GENERIC_ALLOCATOR
 	default y
 	---help---
 	  Machine Check support allows the processor to notify the
diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index 958b90f..e90d8ac 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -141,7 +141,7 @@  static inline void enable_p5_mce(void) {}
 #endif
 
 void mce_setup(struct mce *m);
-void mce_log(struct mce *m);
+void mce_log(struct mce *m, int);
 DECLARE_PER_CPU(struct device *, mce_device);
 
 /*
diff --git a/arch/x86/kernel/cpu/mcheck/Makefile b/arch/x86/kernel/cpu/mcheck/Makefile
index bb34b03..a3311c8 100644
--- a/arch/x86/kernel/cpu/mcheck/Makefile
+++ b/arch/x86/kernel/cpu/mcheck/Makefile
@@ -1,4 +1,4 @@ 
-obj-y				=  mce.o mce-severity.o
+obj-y				=  mce.o mce-severity.o mce-genpool.o
 
 obj-$(CONFIG_X86_ANCIENT_MCE)	+= winchip.o p5.o
 obj-$(CONFIG_X86_MCE_INTEL)	+= mce_intel.o
diff --git a/arch/x86/kernel/cpu/mcheck/mce-apei.c b/arch/x86/kernel/cpu/mcheck/mce-apei.c
index a1aef95..7b92089 100644
--- a/arch/x86/kernel/cpu/mcheck/mce-apei.c
+++ b/arch/x86/kernel/cpu/mcheck/mce-apei.c
@@ -56,7 +56,7 @@  void apei_mce_report_mem_error(int severity, struct cper_sec_mem_err *mem_err)
 		m.status |= MCI_STATUS_PCC;
 
 	m.addr = mem_err->physical_addr;
-	mce_log(&m);
+	mce_log(&m, 0);
 	mce_notify_irq();
 }
 EXPORT_SYMBOL_GPL(apei_mce_report_mem_error);
diff --git a/arch/x86/kernel/cpu/mcheck/mce-genpool.c b/arch/x86/kernel/cpu/mcheck/mce-genpool.c
new file mode 100644
index 0000000..1c08c37
--- /dev/null
+++ b/arch/x86/kernel/cpu/mcheck/mce-genpool.c
@@ -0,0 +1,109 @@ 
+/*
+ * Print memory pool management in MCE context
+ *
+ * Copyright (C) 2014 Intel Corp.
+ * Author: Chen, Gong <gong.chen@linux.intel.com>
+ *
+ * This file is licensed under GPLv2.
+ */
+#include <linux/mm.h>
+#include <linux/genalloc.h>
+#include <linux/llist.h>
+#include <linux/irq_work.h>
+#include "mce-internal.h"
+
+/*
+ * printk is not safe in MCE context. So a lock-less memory allocator
+ * (genpool) is used to save information which are organized via a lock-less
+ * list. Print will be delayed into IRQ context via irq_work.
+ */
+static struct gen_pool *mce_evt_pool;
+static struct llist_head mce_event_llist;
+static struct irq_work mce_irqwork;
+static int mce_evt_len;
+static bool pool_inactive;
+
+/*
+ * This memory pool is only to be used to save MCE records in MCE context.
+ * The MCE event is rare so a fixed size memory pool should be enough.
+ */
+static int mce_genpool_create(void)
+{
+	int i, ret, pages;
+	unsigned long addr;
+
+	mce_evt_len = sizeof(struct mce_evt_llist);
+	mce_evt_pool = gen_pool_create(ilog2(mce_evt_len), -1);
+	if (!mce_evt_pool)
+		return -ENOMEM;
+
+	/* two pages should be enough */
+	pages = 2;
+	for (i = 0; i < pages; i++) {
+		addr = __get_free_page(GFP_KERNEL);
+		if (!addr)
+			return -ENOMEM;
+		ret = gen_pool_add(mce_evt_pool, addr, PAGE_SIZE, -1);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
+static void free_chunk_page(struct gen_pool *pool,
+		struct gen_pool_chunk *chunk, void *data)
+{
+	free_page(chunk->start_addr);
+}
+
+static void mce_evt_pool_destroy(void)
+{
+	gen_pool_for_each_chunk(mce_evt_pool, free_chunk_page, NULL);
+	gen_pool_destroy(mce_evt_pool);
+}
+
+static void do_mce_irqwork(struct irq_work *irq_work)
+{
+	struct llist_node *llnode, *next;
+	struct mce_evt_llist *node;
+	struct mce *mce;
+
+	llnode = llist_del_all(&mce_event_llist);
+	/* print MCE event record from the oldest */
+	llnode = llist_reverse_order(llnode);
+	while (llnode) {
+		next = llnode->next;
+		node = llist_entry(llnode, struct mce_evt_llist, llnode);
+		mce = &node->mce;
+		atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, mce);
+		gen_pool_free(mce_evt_pool, (unsigned long)node, mce_evt_len);
+		llnode = next;
+	}
+}
+
+void mce_evt_save_ll(struct mce *mce)
+{
+	struct mce_evt_llist *node;
+
+	if (unlikely(pool_inactive))
+		return;
+
+	node = (void *)gen_pool_alloc(mce_evt_pool, mce_evt_len);
+	if (node) {
+		memcpy(&node->mce, mce, mce_evt_len);
+		llist_add(&node->llnode, &mce_event_llist);
+		irq_work_queue(&mce_irqwork);
+	}
+}
+
+void __init mce_genpool_init(void)
+{
+	if (mce_genpool_create() != 0)
+		pool_inactive = true;
+
+	if (pool_inactive)
+		mce_evt_pool_destroy();
+	else
+		init_irq_work(&mce_irqwork, do_mce_irqwork);
+}
diff --git a/arch/x86/kernel/cpu/mcheck/mce-internal.h b/arch/x86/kernel/cpu/mcheck/mce-internal.h
index 09edd0b..005a688 100644
--- a/arch/x86/kernel/cpu/mcheck/mce-internal.h
+++ b/arch/x86/kernel/cpu/mcheck/mce-internal.h
@@ -11,6 +11,8 @@  enum severity_level {
 	MCE_PANIC_SEVERITY,
 };
 
+extern struct atomic_notifier_head x86_mce_decoder_chain;
+
 #define ATTR_LEN		16
 
 /* One object for each MCE bank, shared by all CPUs */
@@ -21,6 +23,13 @@  struct mce_bank {
 	char			attrname[ATTR_LEN];	/* attribute name */
 };
 
+struct mce_evt_llist {
+	struct llist_node llnode;
+	struct mce mce;
+};
+
+void mce_evt_save_ll(struct mce *mce);
+void __init mce_genpool_init(void);
 int mce_severity(struct mce *a, int tolerant, char **msg);
 struct dentry *mce_get_debugfs_dir(void);
 
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index bb92f38..dcca519 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -147,7 +147,7 @@  static struct mce_log mcelog = {
 	.recordlen	= sizeof(struct mce),
 };
 
-void mce_log(struct mce *mce)
+void mce_log(struct mce *mce, int in_mce)
 {
 	unsigned next, entry;
 	int ret = 0;
@@ -155,9 +155,14 @@  void mce_log(struct mce *mce)
 	/* Emit the trace record: */
 	trace_mce_record(mce);
 
-	ret = atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, mce);
-	if (ret == NOTIFY_STOP)
-		return;
+	if (in_mce)
+		mce_evt_save_ll(mce);
+	else {
+		ret = atomic_notifier_call_chain(&x86_mce_decoder_chain, 0,
+						 mce);
+		if (ret == NOTIFY_STOP)
+			return;
+	}
 
 	mce->finished = 0;
 	wmb();
@@ -635,7 +640,7 @@  void machine_check_poll(enum mcp_flags flags, mce_banks_t *b)
 		 * have anything to do with the actual error location.
 		 */
 		if (!(flags & MCP_DONTLOG) && !mca_cfg.dont_log_ce)
-			mce_log(&m);
+			mce_log(&m, 0);
 
 		/*
 		 * Clear state for this bank.
@@ -1124,7 +1129,7 @@  void do_machine_check(struct pt_regs *regs, long error_code)
 		if (severity == MCE_AO_SEVERITY && mce_usable_address(&m))
 			mce_ring_add(m.addr >> PAGE_SHIFT);
 
-		mce_log(&m);
+		mce_log(&m, 1);
 
 		if (severity > worst) {
 			*final = m;
@@ -1254,7 +1259,7 @@  void mce_log_therm_throt_event(__u64 status)
 	mce_setup(&m);
 	m.bank = MCE_THERMAL_BANK;
 	m.status = status;
-	mce_log(&m);
+	mce_log(&m, 0);
 }
 #endif /* CONFIG_X86_MCE_INTEL */
 
@@ -2466,6 +2471,8 @@  static __init int mcheck_init_device(void)
 	if (err)
 		goto err_register;
 
+	mce_genpool_init();
+
 	return 0;
 
 err_register:
diff --git a/arch/x86/kernel/cpu/mcheck/mce_amd.c b/arch/x86/kernel/cpu/mcheck/mce_amd.c
index 603df4f..ca64641 100644
--- a/arch/x86/kernel/cpu/mcheck/mce_amd.c
+++ b/arch/x86/kernel/cpu/mcheck/mce_amd.c
@@ -319,7 +319,7 @@  static void amd_threshold_interrupt(void)
 				m.bank = K8_MCE_THRESHOLD_BASE
 				       + bank * NR_BLOCKS
 				       + block;
-				mce_log(&m);
+				mce_log(&m, 0);
 				return;
 			}
 		}