diff mbox

[V1,4/6] arm64: exception: handle Synchronous External Abort

Message ID 1454699608-22760-5-git-send-email-tbaicar@codeaurora.org (mailing list archive)
State Not Applicable, archived
Headers show

Commit Message

Tyler Baicar Feb. 5, 2016, 7:13 p.m. UTC
SEA exceptions are often caused by an uncorrected hardware
error and are handled when data abort and instruction abort
exception classes have specific values for their Fault Status
Code.

When SEA occurs, before killing the process, go through
the handlers registered in the notification list.

Update fault_info[] with specific SEA faults so that the
new SEA handler is used.

Signed-off-by: Jonathan (Zhixiong) Zhang <zjzhang@codeaurora.org>
Signed-off-by: Tyler Baicar <tbaicar@codeaurora.org>
Signed-off-by: Naveen Kaje <nkaje@codeaurora.org>
---
 arch/arm64/include/asm/system_misc.h | 13 ++++++++
 arch/arm64/mm/fault.c                | 58 +++++++++++++++++++++++++++++-------
 2 files changed, 61 insertions(+), 10 deletions(-)

Comments

Will Deacon Feb. 10, 2016, 6:03 p.m. UTC | #1
On Fri, Feb 05, 2016 at 12:13:26PM -0700, Tyler Baicar wrote:
> SEA exceptions are often caused by an uncorrected hardware
> error and are handled when data abort and instruction abort
> exception classes have specific values for their Fault Status
> Code.
> 
> When SEA occurs, before killing the process, go through
> the handlers registered in the notification list.
> 
> Update fault_info[] with specific SEA faults so that the
> new SEA handler is used.
> 
> Signed-off-by: Jonathan (Zhixiong) Zhang <zjzhang@codeaurora.org>
> Signed-off-by: Tyler Baicar <tbaicar@codeaurora.org>
> Signed-off-by: Naveen Kaje <nkaje@codeaurora.org>
> ---
>  arch/arm64/include/asm/system_misc.h | 13 ++++++++
>  arch/arm64/mm/fault.c                | 58 +++++++++++++++++++++++++++++-------
>  2 files changed, 61 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/system_misc.h b/arch/arm64/include/asm/system_misc.h
> index 57f110b..90daf4a 100644
> --- a/arch/arm64/include/asm/system_misc.h
> +++ b/arch/arm64/include/asm/system_misc.h
> @@ -64,4 +64,17 @@ extern void (*arm_pm_restart)(enum reboot_mode reboot_mode, const char *cmd);
>  
>  #endif	/* __ASSEMBLY__ */
>  
> +/*
> + * The functions below are used to register and unregister callbacks
> + * that are to be invoked when a Synchronous External Abort (SEA)
> + * occurs. An SEA is raised by certain fault status codes that have
> + * either data or instruction abort as the exception class, and
> + * callbacks may be registered to parse or handle such hardware errors.
> + *
> + * Registered callbacks are run in an interrupt/atomic context. They
> + * are not allowed to block or sleep.
> + */
> +int sea_register_handler_chain(struct notifier_block *nb);
> +void sea_unregister_handler_chain(struct notifier_block *nb);
> +
>  #endif	/* __ASM_SYSTEM_MISC_H */
> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> index 92ddac1..d6fa691 100644
> --- a/arch/arm64/mm/fault.c
> +++ b/arch/arm64/mm/fault.c
> @@ -39,6 +39,22 @@
>  #include <asm/pgtable.h>
>  #include <asm/tlbflush.h>
>  
> +/*
> + * GHES SEA handler code may register a notifier call here to
> + * handle HW error record passed from platform.
> + */
> +static ATOMIC_NOTIFIER_HEAD(sea_handler_chain);
> +
> +int sea_register_handler_chain(struct notifier_block *nb)
> +{
> +	return atomic_notifier_chain_register(&sea_handler_chain, nb);
> +}
> +
> +void sea_unregister_handler_chain(struct notifier_block *nb)
> +{
> +	atomic_notifier_chain_unregister(&sea_handler_chain, nb);
> +}
> +
>  static const char *fault_name(unsigned int esr);
>  
>  /*
> @@ -379,6 +395,28 @@ static int do_bad(unsigned long addr, unsigned int esr, struct pt_regs *regs)
>  	return 1;
>  }
>  
> +/*
> + * This abort handler deals with Synchronous External Abort.
> + * It calls notifiers, and then returns "fault".
> + */
> +static int do_sea(unsigned long addr, unsigned int esr, struct pt_regs *regs)
> +{
> +	struct siginfo info;
> +
> +	atomic_notifier_call_chain(&sea_handler_chain, 0, NULL);
> +
> +	pr_err("Synchronous External Abort: %s (0x%08x) at 0x%016lx\n",
> +		 fault_name(esr), esr, addr);
> +
> +	info.si_signo = SIGBUS;
> +	info.si_errno = 0;
> +	info.si_code  = 0;
> +	info.si_addr  = (void __user *)addr;
> +	arm64_notify_die("", regs, &info, esr);

Surely we don't want to call this if the notifier chain handled the
exception?

Will
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Abdulhamid, Harb Feb. 11, 2016, 2:40 a.m. UTC | #2
On 2/10/2016 1:03 PM, Will Deacon wrote:
> On Fri, Feb 05, 2016 at 12:13:26PM -0700, Tyler Baicar wrote:

<snip>

>> +static int do_sea(unsigned long addr, unsigned int esr, struct pt_regs *regs)
>> +{
>> +	struct siginfo info;
>> +
>> +	atomic_notifier_call_chain(&sea_handler_chain, 0, NULL);
>> +
>> +	pr_err("Synchronous External Abort: %s (0x%08x) at 0x%016lx\n",
>> +		 fault_name(esr), esr, addr);
>> +
>> +	info.si_signo = SIGBUS;
>> +	info.si_errno = 0;
>> +	info.si_code  = 0;
>> +	info.si_addr  = (void __user *)addr;
>> +	arm64_notify_die("", regs, &info, esr);
> 
> Surely we don't want to call this if the notifier chain handled the
> exception?
You are correct, Ideally you should not die if the notifier chain
handled the exception (e.g. via memory fault handling).  However, this
patch was intended as a first step to provide the user with more useful
information about the hardware error (e.g. details of a cache error, bus
error, or memory error that led to the SEA).

The thought was to do what your suggesting as a next step (i.e. adding
actually recovery mechanisms in the SEA handler). However, there are a
couple of questions enumerated below that I think need more discussion.

First, you need a way to get information returned from the notifier
chain to understand whether or not it recovered from the error. (If this
easier than I'm making it out to be, please set me straight here, as it
was not clear to me at first glance on how to do that)

Second, you need a way to kill/abort the thread that encountered this
error, which (I assume) would only be valid/possible thing to do if it
was a user thread that encountered the hardware error.

For example, let's say we encounter an SEA due to a memory error that
was successfully handled by the memory fault handling code (e.g. offline
a page owned by some user application).  Since this is a synchronous
error that may have occurred either on a load, store, or instruction
fetch, the SEA handler must also know to kill the user thread that
encountered that hardware error.  It is not clear to me how we do that
cleanly, and what the repercussions would be. Would it get handled
naturally after the page has become invalid (e.g. it would just result
in a translation fault when attempting to continue the thread, existing
kernel software error handling takes it from there)?

Also, keep in mind that our current assumption is that *all* kernel data
and threads should be considered critical, and any
corruption/termination of kernel data/threads should always be treated
as fatal.  Please let us know if you disagree.

Harb
diff mbox

Patch

diff --git a/arch/arm64/include/asm/system_misc.h b/arch/arm64/include/asm/system_misc.h
index 57f110b..90daf4a 100644
--- a/arch/arm64/include/asm/system_misc.h
+++ b/arch/arm64/include/asm/system_misc.h
@@ -64,4 +64,17 @@  extern void (*arm_pm_restart)(enum reboot_mode reboot_mode, const char *cmd);
 
 #endif	/* __ASSEMBLY__ */
 
+/*
+ * The functions below are used to register and unregister callbacks
+ * that are to be invoked when a Synchronous External Abort (SEA)
+ * occurs. An SEA is raised by certain fault status codes that have
+ * either data or instruction abort as the exception class, and
+ * callbacks may be registered to parse or handle such hardware errors.
+ *
+ * Registered callbacks are run in an interrupt/atomic context. They
+ * are not allowed to block or sleep.
+ */
+int sea_register_handler_chain(struct notifier_block *nb);
+void sea_unregister_handler_chain(struct notifier_block *nb);
+
 #endif	/* __ASM_SYSTEM_MISC_H */
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 92ddac1..d6fa691 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -39,6 +39,22 @@ 
 #include <asm/pgtable.h>
 #include <asm/tlbflush.h>
 
+/*
+ * GHES SEA handler code may register a notifier call here to
+ * handle HW error record passed from platform.
+ */
+static ATOMIC_NOTIFIER_HEAD(sea_handler_chain);
+
+int sea_register_handler_chain(struct notifier_block *nb)
+{
+	return atomic_notifier_chain_register(&sea_handler_chain, nb);
+}
+
+void sea_unregister_handler_chain(struct notifier_block *nb)
+{
+	atomic_notifier_chain_unregister(&sea_handler_chain, nb);
+}
+
 static const char *fault_name(unsigned int esr);
 
 /*
@@ -379,6 +395,28 @@  static int do_bad(unsigned long addr, unsigned int esr, struct pt_regs *regs)
 	return 1;
 }
 
+/*
+ * This abort handler deals with Synchronous External Abort.
+ * It calls notifiers, and then returns "fault".
+ */
+static int do_sea(unsigned long addr, unsigned int esr, struct pt_regs *regs)
+{
+	struct siginfo info;
+
+	atomic_notifier_call_chain(&sea_handler_chain, 0, NULL);
+
+	pr_err("Synchronous External Abort: %s (0x%08x) at 0x%016lx\n",
+		 fault_name(esr), esr, addr);
+
+	info.si_signo = SIGBUS;
+	info.si_errno = 0;
+	info.si_code  = 0;
+	info.si_addr  = (void __user *)addr;
+	arm64_notify_die("", regs, &info, esr);
+
+	return 0;
+}
+
 static struct fault_info {
 	int	(*fn)(unsigned long addr, unsigned int esr, struct pt_regs *regs);
 	int	sig;
@@ -401,22 +439,22 @@  static struct fault_info {
 	{ do_page_fault,	SIGSEGV, SEGV_ACCERR,	"level 1 permission fault"	},
 	{ do_page_fault,	SIGSEGV, SEGV_ACCERR,	"level 2 permission fault"	},
 	{ do_page_fault,	SIGSEGV, SEGV_ACCERR,	"level 3 permission fault"	},
-	{ do_bad,		SIGBUS,  0,		"synchronous external abort"	},
+	{ do_sea,		SIGBUS,  0,		"synchronous external abort"	},
 	{ do_bad,		SIGBUS,  0,		"unknown 17"			},
 	{ do_bad,		SIGBUS,  0,		"unknown 18"			},
 	{ do_bad,		SIGBUS,  0,		"unknown 19"			},
-	{ do_bad,		SIGBUS,  0,		"synchronous abort (translation table walk)" },
-	{ do_bad,		SIGBUS,  0,		"synchronous abort (translation table walk)" },
-	{ do_bad,		SIGBUS,  0,		"synchronous abort (translation table walk)" },
-	{ do_bad,		SIGBUS,  0,		"synchronous abort (translation table walk)" },
-	{ do_bad,		SIGBUS,  0,		"synchronous parity error"	},
+	{ do_sea,		SIGBUS,  0,		"level 0 SEA (trans tbl walk)"	},
+	{ do_sea,		SIGBUS,  0,		"level 1 SEA (trans tbl walk)"	},
+	{ do_sea,		SIGBUS,  0,		"level 2 SEA (trans tbl walk)"	},
+	{ do_sea,		SIGBUS,  0,		"level 3 SEA (trans tbl walk)"	},
+	{ do_sea,		SIGBUS,  0,		"synchronous parity or ECC err" },
 	{ do_bad,		SIGBUS,  0,		"unknown 25"			},
 	{ do_bad,		SIGBUS,  0,		"unknown 26"			},
 	{ do_bad,		SIGBUS,  0,		"unknown 27"			},
-	{ do_bad,		SIGBUS,  0,		"synchronous parity error (translation table walk)" },
-	{ do_bad,		SIGBUS,  0,		"synchronous parity error (translation table walk)" },
-	{ do_bad,		SIGBUS,  0,		"synchronous parity error (translation table walk)" },
-	{ do_bad,		SIGBUS,  0,		"synchronous parity error (translation table walk)" },
+	{ do_sea,		SIGBUS,  0,		"level 0 synch parity error"	},
+	{ do_sea,		SIGBUS,  0,		"level 1 synch parity error"	},
+	{ do_sea,		SIGBUS,  0,		"level 2 synch parity error"	},
+	{ do_sea,		SIGBUS,  0,		"level 3 synch parity error"	},
 	{ do_bad,		SIGBUS,  0,		"unknown 32"			},
 	{ do_bad,		SIGBUS,  BUS_ADRALN,	"alignment fault"		},
 	{ do_bad,		SIGBUS,  0,		"unknown 34"			},