diff mbox series

arm: mm: fault: check ADFSR in case of abort

Message ID HE1PR0702MB37567D530C1C1D45949C164BFAF30@HE1PR0702MB3756.eurprd07.prod.outlook.com (mailing list archive)
State New, archived
Headers show
Series arm: mm: fault: check ADFSR in case of abort | expand

Commit Message

Wiebe, Wladislav (Nokia - DE/Ulm) Oct. 29, 2018, 2:20 p.m. UTC
When running into situations like:
"Unhandled fault: synchronous external abort (0x210) at 0xXXX"
or
"Unhandled prefetch abort: synchronous external abort (0x210) at 0xXXX"
it is useful to know the content of ADFSR (Auxiliary Data Fault Status
Register) to indicate an ECC double-bit error in L1 or L2 cache.

Refer to:
Cortex-A15 Technical Reference Manual, Revision: r2p1
[6.4.8. Error Correction Code]

Signed-off-by: Wladislav Wiebe <wladislav.wiebe@nokia.com>
---
 arch/arm/mm/fault.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

Comments

Robin Murphy Oct. 29, 2018, 2:52 p.m. UTC | #1
On 29/10/2018 14:20, Wiebe, Wladislav (Nokia - DE/Ulm) wrote:
> When running into situations like:
> "Unhandled fault: synchronous external abort (0x210) at 0xXXX"
> or
> "Unhandled prefetch abort: synchronous external abort (0x210) at 0xXXX"
> it is useful to know the content of ADFSR (Auxiliary Data Fault Status
> Register) to indicate an ECC double-bit error in L1 or L2 cache.
> 
> Refer to:
> Cortex-A15 Technical Reference Manual, Revision: r2p1
> [6.4.8. Error Correction Code]

The contents of ADFSR are implementation-defined, though, so this 
interpretation is *only* valid on Cortex-A15. Other processors may use 
those bit positions to report something else, at which point printing a 
message about ECC errors would be totally misleading.

Robin.

> Signed-off-by: Wladislav Wiebe <wladislav.wiebe@nokia.com>
> ---
>   arch/arm/mm/fault.c | 18 ++++++++++++++++++
>   1 file changed, 18 insertions(+)
> 
> diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
> index 3232afb6fdc0..5e240deb6ed6 100644
> --- a/arch/arm/mm/fault.c
> +++ b/arch/arm/mm/fault.c
> @@ -547,6 +547,22 @@ hook_fault_code(int nr, int (*fn)(unsigned long, unsigned int, struct pt_regs *)
>   	fsr_info[nr].name = name;
>   }
>   
> +/*
> + * Check for ECC double-bit errors in Auxiliary Data Fault Status Register
> + */
> +static void check_adfsr_for_ecc(void)
> +{
> +	u32 adfsr = 0;
> +
> +	asm("mrc p15, 0, %0, c5, c1, 0" : "=r" (adfsr));
> +
> +	if (adfsr & (BIT(31) | BIT(23))) {
> +		pr_alert("ADFSR status 0x%x indicates that an L1 or L2 cache\n"
> +			 "ECC double-bit error occurred at some time.\n",
> +			  adfsr);
> +	}
> +}
> +
>   /*
>    * Dispatch a data abort to the relevant handler.
>    */
> @@ -559,6 +575,7 @@ do_DataAbort(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
>   	if (!inf->fn(addr, fsr & ~FSR_LNX_PF, regs))
>   		return;
>   
> +	check_adfsr_for_ecc();
>   	pr_alert("Unhandled fault: %s (0x%03x) at 0x%08lx\n",
>   		inf->name, fsr, addr);
>   	show_pte(current->mm, addr);
> @@ -593,6 +610,7 @@ do_PrefetchAbort(unsigned long addr, unsigned int ifsr, struct pt_regs *regs)
>   	if (!inf->fn(addr, ifsr | FSR_LNX_PF, regs))
>   		return;
>   
> +	check_adfsr_for_ecc();
>   	pr_alert("Unhandled prefetch abort: %s (0x%03x) at 0x%08lx\n",
>   		inf->name, ifsr, addr);
>   
>
Russell King (Oracle) Oct. 29, 2018, 3:12 p.m. UTC | #2
On Mon, Oct 29, 2018 at 02:20:51PM +0000, Wiebe, Wladislav (Nokia - DE/Ulm) wrote:
> When running into situations like:
> "Unhandled fault: synchronous external abort (0x210) at 0xXXX"
> or
> "Unhandled prefetch abort: synchronous external abort (0x210) at 0xXXX"
> it is useful to know the content of ADFSR (Auxiliary Data Fault Status
> Register) to indicate an ECC double-bit error in L1 or L2 cache.
> 
> Refer to:
> Cortex-A15 Technical Reference Manual, Revision: r2p1
> [6.4.8. Error Correction Code]

This is CPU independent code, and so must only access registers that are
present on all CPUs which may run that code.

Here's the extract from the ARM ARM for the ADFSR and AIFSR:

  The position of these registers is architecturally-defined, but the
  content and use of the registers is IMPLEMENTATION DEFINED. An
  implementation can use these registers to return additional fault
  status information. An example use of these registers is to return
  more information for diagnosing parity errors.

So by testing bits in this register, you are making use of
implementation defined values.

It also goes on to say:

  These registers are not implemented in architecture versions before
  ARMv7.

So before ARMv7, we have to take note of the unimplemented CP15 rules:

2. In an allocated CP15 primary register, accesses to all unallocated
   encodings are UNPREDICTABLE for accesses at PL1 or higher.  This
   means that any MCR or MRC access from PL1 or higher with a
   combination of <CRn>, <opc1>, <CRm> and <opc2> values not shown in,
   or referenced from, Full list of VMSA CP15 registers, by coprocessor
   register number on page B3-1481, that would access an allocated
   CP15 primary register, is UNPREDICTABLE. As indicated by rule 1, for
   the ARMv7-Aarchitecture, the allocated CP15 primary registers are:
   • in any VMSA implementation, c0-c3, c5-c11, c13, and c15
   ...

So I'd prefer if we didn't attempt to read this register on CPUs where
this isn't explicitly implemented.
Wiebe, Wladislav (Nokia - DE/Ulm) Oct. 29, 2018, 3:30 p.m. UTC | #3
Hi Robin, Russel,

> -----Original Message-----
> From: Robin Murphy <robin.murphy@arm.com>
> Sent: Monday, October 29, 2018 3:52 PM
[..]
> On 29/10/2018 14:20, Wiebe, Wladislav (Nokia - DE/Ulm) wrote:
> > When running into situations like:
> > "Unhandled fault: synchronous external abort (0x210) at 0xXXX"
> > or
> > "Unhandled prefetch abort: synchronous external abort (0x210) at 0xXXX"
> > it is useful to know the content of ADFSR (Auxiliary Data Fault Status
> > Register) to indicate an ECC double-bit error in L1 or L2 cache.
> >
> > Refer to:
> > Cortex-A15 Technical Reference Manual, Revision: r2p1 [6.4.8. Error
> > Correction Code]
> 
> The contents of ADFSR are implementation-defined, though, so this
> interpretation is *only* valid on Cortex-A15. Other processors may use those
> bit positions to report something else, at which point printing a message
> about ECC errors would be totally misleading.

Good point, I thought initially it is valid for others as well.

Do you think we can go with this approach:
	if (read_cpuid_part() == ARM_CPU_PART_CORTEX_A15) {
		asm("mrc p15, 0, %0, c5, c1, 0" : "=r" (adfsr));
		xxxx
	}

?
Thanks a lot for the fast feedback!

- Wladislav

> 
> Robin.
> 
> > Signed-off-by: Wladislav Wiebe <wladislav.wiebe@nokia.com>
> > ---
> >   arch/arm/mm/fault.c | 18 ++++++++++++++++++
> >   1 file changed, 18 insertions(+)
> >
> > diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c index
> > 3232afb6fdc0..5e240deb6ed6 100644
> > --- a/arch/arm/mm/fault.c
> > +++ b/arch/arm/mm/fault.c
> > @@ -547,6 +547,22 @@ hook_fault_code(int nr, int (*fn)(unsigned long,
> unsigned int, struct pt_regs *)
> >   	fsr_info[nr].name = name;
> >   }
> >
> > +/*
> > + * Check for ECC double-bit errors in Auxiliary Data Fault Status
> > +Register  */ static void check_adfsr_for_ecc(void) {
> > +	u32 adfsr = 0;
> > +
> > +	asm("mrc p15, 0, %0, c5, c1, 0" : "=r" (adfsr));
> > +
> > +	if (adfsr & (BIT(31) | BIT(23))) {
> > +		pr_alert("ADFSR status 0x%x indicates that an L1 or L2
> cache\n"
> > +			 "ECC double-bit error occurred at some time.\n",
> > +			  adfsr);
> > +	}
> > +}
> > +
> >   /*
> >    * Dispatch a data abort to the relevant handler.
> >    */
> > @@ -559,6 +575,7 @@ do_DataAbort(unsigned long addr, unsigned int fsr,
> struct pt_regs *regs)
> >   	if (!inf->fn(addr, fsr & ~FSR_LNX_PF, regs))
> >   		return;
> >
> > +	check_adfsr_for_ecc();
> >   	pr_alert("Unhandled fault: %s (0x%03x) at 0x%08lx\n",
> >   		inf->name, fsr, addr);
> >   	show_pte(current->mm, addr);
> > @@ -593,6 +610,7 @@ do_PrefetchAbort(unsigned long addr, unsigned int
> ifsr, struct pt_regs *regs)
> >   	if (!inf->fn(addr, ifsr | FSR_LNX_PF, regs))
> >   		return;
> >
> > +	check_adfsr_for_ecc();
> >   	pr_alert("Unhandled prefetch abort: %s (0x%03x) at 0x%08lx\n",
> >   		inf->name, ifsr, addr);
> >
> >
Mark Rutland Oct. 29, 2018, 3:54 p.m. UTC | #4
On Mon, Oct 29, 2018 at 02:20:51PM +0000, Wiebe, Wladislav (Nokia - DE/Ulm) wrote:
> When running into situations like:
> "Unhandled fault: synchronous external abort (0x210) at 0xXXX"
> or
> "Unhandled prefetch abort: synchronous external abort (0x210) at 0xXXX"
> it is useful to know the content of ADFSR (Auxiliary Data Fault Status
> Register) to indicate an ECC double-bit error in L1 or L2 cache.
> 
> Refer to:
> Cortex-A15 Technical Reference Manual, Revision: r2p1
> [6.4.8. Error Correction Code]
> 
> Signed-off-by: Wladislav Wiebe <wladislav.wiebe@nokia.com>
> ---
>  arch/arm/mm/fault.c | 18 ++++++++++++++++++
>  1 file changed, 18 insertions(+)
> 
> diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
> index 3232afb6fdc0..5e240deb6ed6 100644
> --- a/arch/arm/mm/fault.c
> +++ b/arch/arm/mm/fault.c
> @@ -547,6 +547,22 @@ hook_fault_code(int nr, int (*fn)(unsigned long, unsigned int, struct pt_regs *)
>  	fsr_info[nr].name = name;
>  }
>  
> +/*
> + * Check for ECC double-bit errors in Auxiliary Data Fault Status Register
> + */
> +static void check_adfsr_for_ecc(void)
> +{
> +	u32 adfsr = 0;
> +
> +	asm("mrc p15, 0, %0, c5, c1, 0" : "=r" (adfsr));
> +
> +	if (adfsr & (BIT(31) | BIT(23))) {
> +		pr_alert("ADFSR status 0x%x indicates that an L1 or L2 cache\n"
> +			 "ECC double-bit error occurred at some time.\n",
> +			  adfsr);
> +	}
> +}
> +
>  /*
>   * Dispatch a data abort to the relevant handler.
>   */
> @@ -559,6 +575,7 @@ do_DataAbort(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
>  	if (!inf->fn(addr, fsr & ~FSR_LNX_PF, regs))
>  		return;
>  
> +	check_adfsr_for_ecc();
>  	pr_alert("Unhandled fault: %s (0x%03x) at 0x%08lx\n",
>  		inf->name, fsr, addr);
>  	show_pte(current->mm, addr);
> @@ -593,6 +610,7 @@ do_PrefetchAbort(unsigned long addr, unsigned int ifsr, struct pt_regs *regs)
>  	if (!inf->fn(addr, ifsr | FSR_LNX_PF, regs))
>  		return;
>  
> +	check_adfsr_for_ecc();
>  	pr_alert("Unhandled prefetch abort: %s (0x%03x) at 0x%08lx\n",
>  		inf->name, ifsr, addr);

IIUC at this point the task is preemptible (and interruptible), so I
believe this is too late to snapshot the ADFSR. The task could have been
migrated to a different core, with an irrelavant ADFSR, or a fault could
have occured within an interrupt handler, etc.

Thanks,
Mark.
Russell King (Oracle) Oct. 29, 2018, 4:43 p.m. UTC | #5
On Mon, Oct 29, 2018 at 03:54:36PM +0000, Mark Rutland wrote:
> On Mon, Oct 29, 2018 at 02:20:51PM +0000, Wiebe, Wladislav (Nokia - DE/Ulm) wrote:
> > When running into situations like:
> > "Unhandled fault: synchronous external abort (0x210) at 0xXXX"
> > or
> > "Unhandled prefetch abort: synchronous external abort (0x210) at 0xXXX"
> > it is useful to know the content of ADFSR (Auxiliary Data Fault Status
> > Register) to indicate an ECC double-bit error in L1 or L2 cache.
> > 
> > Refer to:
> > Cortex-A15 Technical Reference Manual, Revision: r2p1
> > [6.4.8. Error Correction Code]
> > 
> > Signed-off-by: Wladislav Wiebe <wladislav.wiebe@nokia.com>
> > ---
> >  arch/arm/mm/fault.c | 18 ++++++++++++++++++
> >  1 file changed, 18 insertions(+)
> > 
> > diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
> > index 3232afb6fdc0..5e240deb6ed6 100644
> > --- a/arch/arm/mm/fault.c
> > +++ b/arch/arm/mm/fault.c
> > @@ -547,6 +547,22 @@ hook_fault_code(int nr, int (*fn)(unsigned long, unsigned int, struct pt_regs *)
> >  	fsr_info[nr].name = name;
> >  }
> >  
> > +/*
> > + * Check for ECC double-bit errors in Auxiliary Data Fault Status Register
> > + */
> > +static void check_adfsr_for_ecc(void)
> > +{
> > +	u32 adfsr = 0;
> > +
> > +	asm("mrc p15, 0, %0, c5, c1, 0" : "=r" (adfsr));
> > +
> > +	if (adfsr & (BIT(31) | BIT(23))) {
> > +		pr_alert("ADFSR status 0x%x indicates that an L1 or L2 cache\n"
> > +			 "ECC double-bit error occurred at some time.\n",
> > +			  adfsr);
> > +	}
> > +}
> > +
> >  /*
> >   * Dispatch a data abort to the relevant handler.
> >   */
> > @@ -559,6 +575,7 @@ do_DataAbort(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
> >  	if (!inf->fn(addr, fsr & ~FSR_LNX_PF, regs))
> >  		return;
> >  
> > +	check_adfsr_for_ecc();
> >  	pr_alert("Unhandled fault: %s (0x%03x) at 0x%08lx\n",
> >  		inf->name, fsr, addr);
> >  	show_pte(current->mm, addr);
> > @@ -593,6 +610,7 @@ do_PrefetchAbort(unsigned long addr, unsigned int ifsr, struct pt_regs *regs)
> >  	if (!inf->fn(addr, ifsr | FSR_LNX_PF, regs))
> >  		return;
> >  
> > +	check_adfsr_for_ecc();
> >  	pr_alert("Unhandled prefetch abort: %s (0x%03x) at 0x%08lx\n",
> >  		inf->name, ifsr, addr);
> 
> IIUC at this point the task is preemptible (and interruptible),

It may be preemptable, but isn't necessarily so.  It depends whether the
called FSR specific function enabled interrupts or not.

So, it would be better to read the ADFSR before calling the FSR specific
function to guarantee that we read the values that correspond with _this_
fault.
diff mbox series

Patch

diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
index 3232afb6fdc0..5e240deb6ed6 100644
--- a/arch/arm/mm/fault.c
+++ b/arch/arm/mm/fault.c
@@ -547,6 +547,22 @@  hook_fault_code(int nr, int (*fn)(unsigned long, unsigned int, struct pt_regs *)
 	fsr_info[nr].name = name;
 }
 
+/*
+ * Check for ECC double-bit errors in Auxiliary Data Fault Status Register
+ */
+static void check_adfsr_for_ecc(void)
+{
+	u32 adfsr = 0;
+
+	asm("mrc p15, 0, %0, c5, c1, 0" : "=r" (adfsr));
+
+	if (adfsr & (BIT(31) | BIT(23))) {
+		pr_alert("ADFSR status 0x%x indicates that an L1 or L2 cache\n"
+			 "ECC double-bit error occurred at some time.\n",
+			  adfsr);
+	}
+}
+
 /*
  * Dispatch a data abort to the relevant handler.
  */
@@ -559,6 +575,7 @@  do_DataAbort(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
 	if (!inf->fn(addr, fsr & ~FSR_LNX_PF, regs))
 		return;
 
+	check_adfsr_for_ecc();
 	pr_alert("Unhandled fault: %s (0x%03x) at 0x%08lx\n",
 		inf->name, fsr, addr);
 	show_pte(current->mm, addr);
@@ -593,6 +610,7 @@  do_PrefetchAbort(unsigned long addr, unsigned int ifsr, struct pt_regs *regs)
 	if (!inf->fn(addr, ifsr | FSR_LNX_PF, regs))
 		return;
 
+	check_adfsr_for_ecc();
 	pr_alert("Unhandled prefetch abort: %s (0x%03x) at 0x%08lx\n",
 		inf->name, ifsr, addr);