diff mbox series

trace: skip hwasan

Message ID 20190217043434.46233-1-cai@lca.pw (mailing list archive)
State New, archived
Headers show
Series trace: skip hwasan | expand

Commit Message

Qian Cai Feb. 17, 2019, 4:34 a.m. UTC
Enabling function tracer with CONFIG_KASAN_SW_TAGS=y (hwasan) tracer
causes the whole system frozen on ThunderX2 systems with 256 CPUs,
because there is a burst of too much pointer access, and then KASAN will
dereference each byte of the shadow address for the tag checking which
will kill all the CPUs.

Signed-off-by: Qian Cai <cai@lca.pw>
---
 kernel/trace/Makefile | 5 +++++
 1 file changed, 5 insertions(+)

Comments

Dmitry Vyukov Feb. 17, 2019, 7:30 a.m. UTC | #1
On Sun, Feb 17, 2019 at 5:34 AM Qian Cai <cai@lca.pw> wrote:
>
> Enabling function tracer with CONFIG_KASAN_SW_TAGS=y (hwasan) tracer
> causes the whole system frozen on ThunderX2 systems with 256 CPUs,
> because there is a burst of too much pointer access, and then KASAN will
> dereference each byte of the shadow address for the tag checking which
> will kill all the CPUs.

Hi Qian,

Could you please elaborate what exactly happens and who/why kills
CPUs? Number of memory accesses should not make any difference.
With hardware support (MTE) it won't be possible to disable
instrumentation (loads and stores check tags themselves), so it would
be useful to keep track of exact reasons we disable instrumentation to
know how to deal with them with hardware support.
It would be useful to keep this info in the comment in the Makefile.

Thanks

> Signed-off-by: Qian Cai <cai@lca.pw>
> ---
>  kernel/trace/Makefile | 5 +++++
>  1 file changed, 5 insertions(+)
>
> diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
> index c2b2148bb1d2..fdd547a68385 100644
> --- a/kernel/trace/Makefile
> +++ b/kernel/trace/Makefile
> @@ -28,6 +28,11 @@ ifdef CONFIG_GCOV_PROFILE_FTRACE
>  GCOV_PROFILE := y
>  endif
>
> +# Too much pointer access will kill hwasan.
> +ifdef CONFIG_KASAN_SW_TAGS
> +KASAN_SANITIZE := n
> +endif
> +
>  CFLAGS_trace_benchmark.o := -I$(src)
>  CFLAGS_trace_events_filter.o := -I$(src)
>
> --
> 2.17.2 (Apple Git-113)
>
> --
> You received this message because you are subscribed to the Google Groups "kasan-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to kasan-dev+unsubscribe@googlegroups.com.
> To post to this group, send email to kasan-dev@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/kasan-dev/20190217043434.46233-1-cai%40lca.pw.
> For more options, visit https://groups.google.com/d/optout.
Will Deacon Feb. 18, 2019, 10:43 a.m. UTC | #2
On Sat, Feb 16, 2019 at 11:34:34PM -0500, Qian Cai wrote:
> Enabling function tracer with CONFIG_KASAN_SW_TAGS=y (hwasan) tracer
> causes the whole system frozen on ThunderX2 systems with 256 CPUs,
> because there is a burst of too much pointer access, and then KASAN will
> dereference each byte of the shadow address for the tag checking which
> will kill all the CPUs.
> 
> Signed-off-by: Qian Cai <cai@lca.pw>
> ---
>  kernel/trace/Makefile | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
> index c2b2148bb1d2..fdd547a68385 100644
> --- a/kernel/trace/Makefile
> +++ b/kernel/trace/Makefile
> @@ -28,6 +28,11 @@ ifdef CONFIG_GCOV_PROFILE_FTRACE
>  GCOV_PROFILE := y
>  endif
>  
> +# Too much pointer access will kill hwasan.
> +ifdef CONFIG_KASAN_SW_TAGS
> +KASAN_SANITIZE := n
> +endif

I don't maintain this file, but I think that my comments on your related
patch are relevant here as well:

https://lkml.org/lkml/2019/2/18/223

Will
Qian Cai Feb. 18, 2019, 1:27 p.m. UTC | #3
On 2/17/19 2:30 AM, Dmitry Vyukov wrote:
> On Sun, Feb 17, 2019 at 5:34 AM Qian Cai <cai@lca.pw> wrote:
>>
>> Enabling function tracer with CONFIG_KASAN_SW_TAGS=y (hwasan) tracer
>> causes the whole system frozen on ThunderX2 systems with 256 CPUs,
>> because there is a burst of too much pointer access, and then KASAN will
>> dereference each byte of the shadow address for the tag checking which
>> will kill all the CPUs.
> 
> Hi Qian,
> 
> Could you please elaborate what exactly happens and who/why kills
> CPUs? Number of memory accesses should not make any difference.
> With hardware support (MTE) it won't be possible to disable
> instrumentation (loads and stores check tags themselves), so it would
> be useful to keep track of exact reasons we disable instrumentation to
> know how to deal with them with hardware support.
> It would be useful to keep this info in the comment in the Makefile.

It turns out sometimes it will trigger a hardware error.

# echo function > /sys/kernel/debug/tracing/current_trace

RAS CONTROLLER: Fatal unrecoverable error detected

	*** NBU BAR Error ***


  MPIDR= 0x81000000
  CTX_X0= ffff10001032eb9c
  CTX_X1= ffff100010205f08
  CTX_X2= 0
  CTX_X3= ffff100010205efc
  CTX_X4= 8
  CTX_X5= 40
  CTX_X6= 3f
  CTX_X7= 0
  CTX_X8= ff
  CTX_X9= ffff0808ba65ab46
  CTX_X10= ffff0808ba65ab45
  CTX_X11= da
  CTX_X12= 10071651
  CTX_X13= fff60658
  CTX_X14= ffff1000140d5000
  CTX_X15= ffff100013855578
  CTX_X16= 804b004a
  CTX_X17= 1000100
  CTX_X18= 0
  CTX_X19= ffff100010205f08
  CTX_X20= ffff100012531cd0
  CTX_X21= ffff100010205f08
  CTX_X22= ffff10001032eb9c
  CTX_X23= 0
  CTX_X24= ffff100012531cc0
  CTX_X25= 12af
  CTX_X26= fffdba05
  CTX_X27= daff808ba65ab460
  CTX_X28= ffff100012531cc0
  CTX_X29= ffff808a2c617320
  CTX_X30= ffff10001009b5a4
  CTX_X31= ffff100012531cc0
  CTX_SCR_EL3= 735
  CTX_RUNTIME_SP= 6e545c0
  CTX_SPSR_EL3= 604003c9
  CTX_ELR_EL3= ffff100010205ecc
 Node 0 NBU 0 Error report :
 NBU BAR Error
      NBU_REG_BAR_ADDRESS_ERROR_REG0 : 0x0040554c
      NBU_REG_BAR_ADDRESS_ERROR_REG1 : 0x0011ff00
      NBU_REG_BAR_ADDRESS_ERROR_REG2 : 0x00000004
      Physical Address : 0x40011ff00

NBU BAR Error : Decoded info :
        Agent info : CPU
            Core ID : 21
            Thread ID : 1
        Requ: type : 4 : Write Back
 Node 0 NBU 1 Error report :
 NBU BAR Error
      NBU_REG_BAR_ADDRESS_ERROR_REG0 : 0x0040554c
      NBU_REG_BAR_ADDRESS_ERROR_REG1 : 0x0011ff40
      NBU_REG_BAR_ADDRESS_ERROR_REG2 : 0x00000004
      Physical Address : 0x40011ff40

NBU BAR Error : Decoded info :
        Agent info : CPU
            Core ID : 21
            Thread ID : 1
        Requ: type : 4 : Write Back
 Node 0 NBU 2 Error report :
 NBU BAR Error
      NBU_REG_BAR_ADDRESS_ERROR_REG0 : 0x0040554c
      NBU_REG_BAR_ADDRESS_ERROR_REG1 : 0x0011ff80
      NBU_REG_BAR_ADDRESS_ERROR_REG2 : 0x00000004
      Physical Address : 0x40011ff80

NBU BAR Error : Decoded info :
        Agent info : CPU
            Core ID : 21
            Thread ID : 1
        Requ: type : 4 : Write Back
 Node 0 NBU 3 Error report :
 NBU BAR Error
      NBU_REG_BAR_ADDRESS_ERROR_REG0 : 0x0040554c
      NBU_REG_BAR_ADDRESS_ERROR_REG1 : 0x0011ffc0
      NBU_REG_BAR_ADDRESS_ERROR_REG2 : 0x00000004
      Physical Address : 0x40011ffc0

NBU BAR Error : Decoded info :
        Agent info : CPU
            Core ID : 21
            Thread ID : 1
        Requ: type : 4 : Write Back
 Node 0 NBU 4 Error report :
 NBU BAR Error
      NBU_REG_BAR_ADDRESS_ERROR_REG0 : 0x0040554c
      NBU_REG_BAR_ADDRESS_ERROR_REG1 : 0x0011fe00
      NBU_REG_BAR_ADDRESS_ERROR_REG2 : 0x00000004
      Physical Address : 0x40011fe00

NBU BAR Error : Decoded info :
        Agent info : CPU
            Core ID : 21
            Thread ID : 1
        Requ: type : 4 : Write Back
 Node 0 NBU 5 Error report :
 NBU BAR Error
      NBU_REG_BAR_ADDRESS_ERROR_REG0 : 0x0040554c
      NBU_REG_BAR_ADDRESS_ERROR_REG1 : 0x0011fe40
      NBU_REG_BAR_ADDRESS_ERROR_REG2 : 0x00000004
      Physical Address : 0x40011fe40

NBU BAR Error : Decoded info :
        Agent info : CPU
            Core ID : 21
            Thread ID : 1
        Requ: type : 4 : Write Back
 Node 0 NBU 6 Error report :
 NBU BAR Error
      NBU_REG_BAR_ADDRESS_ERROR_REG0 : 0x0040554c
      NBU_REG_BAR_ADDRESS_ERROR_REG1 : 0x0011fe80
      NBU_REG_BAR_ADDRESS_ERROR_REG2 : 0x00000004
      Physical Address : 0x40011fe80

NBU BAR Error : Decoded info :
        Agent info : CPU
            Core ID : 21
            Thread ID : 1
        Requ: type : 4 : Write Back
 Node 0 NBU 7 Error report :
 NBU BAR Error
      NBU_REG_BAR_ADDRESS_ERROR_REG0 : 0x0040554c
      NBU_REG_BAR_ADDRESS_ERROR_REG1 : 0x0011fee0
      NBU_REG_BAR_ADDRESS_ERROR_REG2 : 0x00000004
      Physical Address : 0x40011fee0

NBU BAR Error : Decoded info :
        Agent info : CPU
            Core ID : 21
            Thread ID : 1
        Requ: type : 4 : Write Back
 Node 0 NBU 8 Error report :
 NBU BAR Error
      NBU_REG_BAR_ADDRESS_ERROR_REG0 : 0x0040554c
      NBU_REG_BAR_ADDRESS_ERROR_REG1 : 0x0011fd30
      NBU_REG_BAR_ADDRESS_ERROR_REG2 : 0x00000004
      Physical Address : 0x40011fd30

NBU BAR Error : Decoded info :
        Agent info : CPU
            Core ID : 21
            Thread ID : 1
        Requ: type : 4 : Write Back
 Node 0 NBU 9 Error report :
 NBU BAR Error
      NBU_REG_BAR_ADDRESS_ERROR_REG0 : 0x0040554c
      NBU_REG_BAR_ADDRESS_ERROR_REG1 : 0x0011fd60
      NBU_REG_BAR_ADDRESS_ERROR_REG2 : 0x00000004
      Physical Address : 0x40011fd60

NBU BAR Error : Decoded info :
        Agent info : CPU
            Core ID : 21
            Thread ID : 1
        Requ: type : 4 : Write Back
 Node 0 NBU 10 Error report :
 NBU BAR Error
      NBU_REG_BAR_ADDRESS_ERROR_REG0 : 0x0040554c
      NBU_REG_BAR_ADDRESS_ERROR_REG1 : 0x0011fda0
      NBU_REG_BAR_ADDRESS_ERROR_REG2 : 0x00000004
      Physical Address : 0x40011fda0

NBU BAR Error : Decoded info :
        Agent info : CPU
            Core ID : 21
            Thread ID : 1
        Requ: type : 4 : Write Back
 Node 0 NBU 11 Error report :
 NBU BAR Error
      NBU_REG_BAR_ADDRESS_ERROR_REG0 : 0x0040554c
      NBU_REG_BAR_ADDRESS_ERROR_REG1 : 0x0011fdc0
      NBU_REG_BAR_ADDRESS_ERROR_REG2 : 0x00000004
      Physical Address : 0x40011fdc0

NBU BAR Error : Decoded info :
        Agent info : CPU
            Core ID : 21
            Thread ID : 1
        Requ: type : 4 : Write Back
 Node 0 NBU 12 Error report :
 NBU BAR Error
      NBU_REG_BAR_ADDRESS_ERROR_REG0 : 0x0040554c
      NBU_REG_BAR_ADDRESS_ERROR_REG1 : 0x0011fc00
      NBU_REG_BAR_ADDRESS_ERROR_REG2 : 0x00000004
      Physical Address : 0x40011fc00

NBU BAR Error : Decoded info :
        Agent info : CPU
            Core ID : 21
            Thread ID : 1
        Requ: type : 4 : Write Back
 Node 0 NBU 13 Error report :
 NBU BAR Error
      NBU_REG_BAR_ADDRESS_ERROR_REG0 : 0x0040554c
      NBU_REG_BAR_ADDRESS_ERROR_REG1 : 0x0011fc40
      NBU_REG_BAR_ADDRESS_ERROR_REG2 : 0x00000004
      Physical Address : 0x40011fc40

NBU BAR Error : Decoded info :
        Agent info : CPU
            Core ID : 21
            Thread ID : 1
        Requ: type : 4 : Write Back
 Node 0 NBU 14 Error report :
 NBU BAR Error
      NBU_REG_BAR_ADDRESS_ERROR_REG0 : 0x0040554c
      NBU_REG_BAR_ADDRESS_ERROR_REG1 : 0x0011fc80
      NBU_REG_BAR_ADDRESS_ERROR_REG2 : 0x00000004
      Physical Address : 0x40011fc80

NBU BAR Error : Decoded info :
        Agent info : CPU
            Core ID : 21
            Thread ID : 1
        Requ: type : 4 : Write Back
 Node 0 NBU 15 Error report :
 NBU BAR Error
      NBU_REG_BAR_ADDRESS_ERROR_REG0 : 0x0040554c
      NBU_REG_BAR_ADDRESS_ERROR_REG1 : 0x0011fcc0
      NBU_REG_BAR_ADDRESS_ERROR_REG2 : 0x00000004
      Physical Address : 0x40011fcc0

NBU BAR Error : Decoded info :
        Agent info : CPU
            Core ID : 21
            Thread ID : 1
        Requ: type : 4 : Write Back

Current NBU DRAM BAR setting:
Node0 BAR0 Base 00004000 Limit 00007FFC chan_xlation 00004008 node_xlation 00000000
Node0 BAR1 Base 00080001 Limit 000FEFFC chan_xlation 0007C008 node_xlation 00000000
Node0 BAR2 Base 00880001 Limit 00FFCFFC chan_xlation 007FD008 node_xlation 00000000
Node0 BAR3 Base 00FFD001 Limit 00FFFFDF chan_xlation 00FFD000 node_xlation 00000000
Node0 BAR4 Base 08800001 Limit 08BFCFDF chan_xlation 087FD000 node_xlation 00000000
Node0 BAR5 Base 08BFD001 Limit 093FCFEE chan_xlation 08BFD008 node_xlation 00000002
Node0 BAR6 Base 093FD001 Limit 097FCFDF chan_xlation 093FD000 node_xlation 00000002
Node0 BAR7 Base FFFFF000 Limit 00000000 chan_xlation 00000000 node_xlation 00000000
Node0 BAR8 Base FFFFF000 Limit 00000000 chan_xlation 00000000 node_xlation 00000000
Node0 BAR9 Base FFFFF000 Limit 00000000 chan_xlation 00000000 node_xlation 00000000
Node0 BAR10 Base FFFFF000 Limit 00000000 chan_xlation 00000000 node_xlation 00000000
Node0 BAR11 Base FFFFF000 Limit 00000000 chan_xlation 00000000 node_xlation 00000000
Node0 BAR12 Base FFFFF000 Limit 00000000 chan_xlation 00000000 node_xlation 00000000
Node0 BAR13 Base FFFFF000 Limit 00000000 chan_xlation 00000000 node_xlation 00000000
Node0 BAR14 Base FFFFF000 Limit 00000000 chan_xlation 00000000 node_xlation 00000000
Node0 BAR15 Base FFFFF000 Limit 00000000 chan_xlation 00000000 node_xlation 00000000
Node1 BAR0 Base 00004000 Limit 00007FFC chan_xlation 00004008 node_xlation 00000000
Node1 BAR1 Base 00080001 Limit 000FEFFC chan_xlation 0007C008 node_xlation 00000000
Node1 BAR2 Base 00880001 Limit 00FFCFFC chan_xlation 007FD008 node_xlation 00000000
Node1 BAR3 Base 00FFD001 Limit 00FFFFDF chan_xlation 00FFD000 node_xlation 00000000
Node1 BAR4 Base 08800001 Limit 08BFCFDF chan_xlation 087FD000 node_xlation 00000000
Node1 BAR5 Base 08BFD001 Limit 093FCFEE chan_xlation 08BFD008 node_xlation 00000002
Node1 BAR6 Base 093FD001 Limit 097FCFDF chan_xlation 093FD000 node_xlation 00000002
Node1 BAR7 Base FFFFF000 Limit 00000000 chan_xlation 00000000 node_xlation 00000000
Node1 BAR8 Base FFFFF000 Limit 00000000 chan_xlation 00000000 node_xlation 00000000
Node1 BAR9 Base FFFFF000 Limit 00000000 chan_xlation 00000000 node_xlation 00000000
Node1 BAR10 Base FFFFF000 Limit 00000000 chan_xlation 00000000 node_xlation 00000000
Node1 BAR11 Base FFFFF000 Limit 00000000 chan_xlation 00000000 node_xlation 00000000
Node1 BAR12 Base FFFFF000 Limit 00000000 chan_xlation 00000000 node_xlation 00000000
Node1 BAR13 Base FFFFF000 Limit 00000000 chan_xlation 00000000 node_xlation 00000000
Node1 BAR14 Base FFFFF000 Limit 00000000 chan_xlation 00000000 node_xlation 00000000
Node1 BAR15 Base FFFFF000 Limit 00000000 chan_xlation 00000000 node_xlation 00000000

0.0.0:
  00: AF00177D
  04: 00100006
  08: 06000000
  0C: 00000010
  10: 00000000
  14: 00000000
  18: 00000000
  1C: 00000000
  20: 00000000
  24: 00000000
  28: 00000000
  2C: 0000177D
  30: 00000000
  34: 00000090
  38: 00000000
  3C: 00000000

0.1.0:
  00: AF84177D
  04: 00100000
  08: 06040000
  0C: 00010000
  10: 00000000
  14: 00000000
  18: 00010100
  1C: 00000000
  20: 0000FFF0
  24: 0001FFF1
  28: 00000000
  2C: 00000000
  30: 00000000
  34: 00000048
  38: 00000000
  3C: 00000100

0.2.0:
  00: AF84177D
  04: 00100000
  08: 06040000
  0C: 00010000
  10: 00000000
  14: 00000000
  18: 00020200
  1C: 00000000
  20: 0000FFF0
  24: 0001FFF1
  28: 00000000
  2C: 00000000
  30: 00000000
  34: 00000048
  38: 00000000
  3C: 00000100

0.3.0:
  00: AF84177D
  04: 00100000
  08: 06040000
  0C: 00010000
  10: 00000000
  14: 00000000
  18: 00030300
  1C: 00000000
  20: 0000FFF0
  24: 0001FFF1
  28: 00000000
  2C: 00000000
  30: 00000000
  34: 00000048
  38: 00000000
  3C: 00000100

0.4.0:
  00: AF84177D
  04: 00100000
  08: 06040000
  0C: 00010000
  10: 00000000
  14: 00000000
  18: 00040400
  1C: 00000000
  20: 0000FFF0
  24: 0001FFF1
  28: 00000000
  2C: 00000000
  30: 00000000
  34: 00000048
  38: 00000000
  3C: 00000100

0.5.0:
  00: AF84177D
  04: 00100000
  08: 06040000
  0C: 00010000
  10: 00000000
  14: 00000000
  18: 00050500
  1C: 00000000
  20: 0000FFF0
  24: 0001FFF1
  28: 00000000
  2C: 00000000
  30: 00000000
  34: 00000048
  38: 00000000
  3C: 00000100

0.6.0:
  00: AF84177D
  04: 00100000
  08: 06040000
  0C: 00010000
  10: 00000000
  14: 00000000
  18: 00060600
  1C: 00000000
  20: 0000FFF0
  24: 0001FFF1
  28: 00000000
  2C: 00000000
  30: 00000000
  34: 00000048
  38: 00000000
  3C: 00000100

0.7.0:
  00: AF84177D
  04: 00100000
  08: 06040000
  0C: 00010000
  10: 00000000
  14: 00000000
  18: 00070700
  1C: 00000000
  20: 0000FFF0
  24: 0001FFF1
  28: 00000000
  2C: 00000000
  30: 00000000
  34: 00000048
  38: 00000000
  3C: 00000100

0.8.0:
  00: AF84177D
  04: 00100000
  08: 06040000
  0C: 00010000
  10: 00000000
  14: 00000000
  18: 00080800
  1C: 00000000
  20: 0000FFF0
  24: 0001FFF1
  28: 00000000
  2C: 00000000
  30: 00000000
  34: 00000048
  38: 00000000
  3C: 00000100

0.9.0:
  00: AF84177D
  04: 00100000
  08: 06040000
  0C: 00010000
  10: 00000000
  14: 00000000
  18: 00090900
  1C: 00000000
  20: 0000FFF0
  24: 0001FFF1
  28: 00000000
  2C: 00000000
  30: 00000000
  34: 00000048
  38: 00000000
  3C: 00000100

0.a.0:
  00: AF84177D
  04: 00100000
  08: 06040000
  0C: 00010000
  10: 00000000
  14: 00000000
  18: 000A0A00
  1C: 00000000
  20: 0000FFF0
  24: 0001FFF1
  28: 00000000
  2C: 00000000
  30: 00000000
  34: 00000048
  38: 00000000
  3C: 00000100

0.b.0:
  00: AF84177D
  04: 00100106
  08: 06040000
  0C: 00010000
  10: 00000000
  14: 00000000
  18: 000C0B00
  1C: 20000000
  20: 43104300
  24: 03F10001
  28: 00000100
  2C: 00000100
  30: 00000000
  34: 00000048
  38: 00000000
  3C: 000201FF

0.c.0:
  00: AF84177D
  04: 00100000
  08: 06040000
  0C: 00010000
  10: 00000000
  14: 00000000
  18: 000D0D00
  1C: 00000000
  20: 0000FFF0
  24: 0001FFF1
  28: 00000000
  2C: 00000000
  30: 00000000
  34: 00000048
  38: 00000000
  3C: 00000100

0.d.0:
  00: AF84177D
  04: 00100000
  08: 06040000
  0C: 00010000
  10: 00000000
  14: 00000000
  18: 000E0E00
  1C: 00000000
  20: 0000FFF0
  24: 0001FFF1
  28: 00000000
  2C: 00000000
  30: 00000000
  34: 00000048
  38: 00000000
  3C: 00000100

0.e.0:
  00: AF84177D
  04: 00100106
  08: 06040000
  0C: 00010000
  10: 00000000
  14: 00000000
  18: 00100F00
  1C: 20000000
  20: 42F04000
  24: 0001FFF1
  28: 00000000
  2C: 00000000
  30: 00000000
  34: 00000048
  38: 00000000
  3C: 000201FF

0.f.0:
  00: 902614E4
  04: 00100406
  08: 0C033000
  0C: 00800010
  10: 0400000C
  14: 00000100
  18: 0401000C
  1C: 00000100
  20: 00000000
  24: 00000000
  28: 00000000
  2C: 00000000
  30: 00000000
  34: 00000080
  38: 00000000
  3C: 00000000

0.f.1:
  00: 902614E4
  04: 00100406
  08: 0C033000
  0C: 00800010
  10: 0402000C
  14: 00000100
  18: 0403000C
  1C: 00000100
  20: 00000000
  24: 00000000
  28: 00000000
  2C: 00000000
  30: 00000000
  34: 00000080
  38: 00000000
  3C: 00000000

0.10.0:
  00: 902714E4
  04: 00100406
  08: 01060100
  0C: 00800010
  10: 00000000
  14: 00000000
  18: 0404000C
  1C: 00000100
  20: 00000000
  24: 43200000
  28: 00000000
  2C: 00000000
  30: 00000000
  34: 00000080
  38: 00000000
  3C: 000000FF

0.10.1:
  00: 902714E4
  04: 00100406
  08: 01060100
  0C: 00800010
  10: 00000000
  14: 00000000
  18: 0405000C
  1C: 00000100
  20: 00000000
  24: 43210000
  28: 00000000
  2C: 00000000
  30: 00000000
  34: 00000080
  38: 00000000
  3C: 000000FF

b.0.0:
  00: 101515B3
  04: 00100506
  08: 02000000
  0C: 00800000
  10: 0000000C
  14: 00000100
  18: 00000000
  1C: 00000000
  20: 00000000
  24: 00000000
  28: 00000000
  2C: 028A1590
  30: FFF00000
  34: 00000060
  38: 00000000
  3C: 000001FF

b.0.1:
  00: 101515B3
  04: 00100506
  08: 02000000
  0C: 00800000
  10: 0200000C
  14: 00000100
  18: 00000000
  1C: 00000000
  20: 00000000
  24: 00000000
  28: 00000000
  2C: 028A1590
  30: FFF00000
  34: 00000060
  38: 00000000
  3C: 000002FF

f.0.0:
  00: 11501A03
  04: 00100107
  08: 06040004
  0C: 00010000
  10: 00000000
  14: 00000000
  18: 0010100F
  1C: 022001F1
  20: 42F04000
  24: 0001FFF1
  28: 00000000
  2C: 00000000
  30: 00000000
  34: 00000050
  38: 00000000
  3C: 000201FF

10.0.0:
  00: 20001A03
  04: 02100102
  08: 03000041
  0C: 00000000
  10: 40000000
  14: 42000000
  18: 00000001
  1C: 00000000
  20: 00000000
  24: 00000000
  28: 00000000
  2C: 20001A03
  30: 00000000
  34: 00000040
  38: 00000000
  3C: 000001FF

80.0.0:
  00: AF00177D
  04: 00100002
  08: 06000000
  0C: 00000010
  10: 00000000
  14: 00000000
  18: 00000000
  1C: 00000000
  20: 00000000
  24: 00000000
  28: 00000000
  2C: 0000177D
  30: 00000000
  34: 00000090
  38: 00000000
  3C: 00000000

80.1.0:
  00: AF84177D
  04: 00100000
  08: 06040000
  0C: 00010000
  10: 00000000
  14: 00000000
  18: 00818180
  1C: 00000000
  20: 0000FFF0
  24: 0001FFF1
  28: 00000000
  2C: 00000000
  30: 00000000
  34: 00000048
  38: 00000000
  3C: 00000100

80.9.0:
  00: AF84177D
  04: 00100000
  08: 06040000
  0C: 00010000
  10: 00000000
  14: 00000000
  18: 00828280
  1C: 00000000
  20: 0000FFF0
  24: 0001FFF1
  28: 00000000
  2C: 00000000
  30: 00000000
  34: 00000048
  38: 00000000
  3C: 00000100

80.b.0:
  00: AF84177D
  04: 00100000
  08: 06040000
  0C: 00010000
  10: 00000000
  14: 00000000
  18: 00838380
  1C: 00000000
  20: 0000FFF0
  24: 0001FFF1
  28: 00000000
  2C: 00000000
  30: 00000000
  34: 00000048
  38: 00000000
  3C: 00000100

80.f.0:
  00: 902614E4
  04: 00100406
  08: 0C033000
  0C: 00800010
  10: 0000000C
  14: 00000140
  18: 0001000C
  1C: 00000140
  20: 00000000
  24: 00000000
  28: 00000000
  2C: 00000000
  30: 00000000
  34: 00000080
  38: 00000000
  3C: 00000000

80.f.1:
  00: 902614E4
  04: 00100406
  08: 0C033000
  0C: 00800010
  10: 0002000C
  14: 00000140
  18: 0003000C
  1C: 00000140
  20: 00000000
  24: 00000000
  28: 00000000
  2C: 00000000
  30: 00000000
  34: 00000080
  38: 00000000
  3C: 00000000

80.10.0:
  00: 902714E4
  04: 00100406
  08: 01060100
  0C: 00800010
  10: 00000000
  14: 00000000
  18: 0004000C
  1C: 00000140
  20: 00000000
  24: 60000000
  28: 00000000
  2C: 00000000
  30: 00000000
  34: 00000080
  38: 00000000
  3C: 000000FF

80.10.1:
  00: 902714E4
  04: 00100406
  08: 01060100
  0C: 00800010
  10: 00000000
  14: 00000000
  18: 0005000C
  1C: 00000140
  20: 00000000
  24: 60010000
  28: 00000000
  2C: 00000000
  30: 00000000
  34: 00000080
  38: 00000000
  3C: 000000FF
RAS CONTROLLER: SYSTEM HALTED...
Dmitry Vyukov Feb. 18, 2019, 1:56 p.m. UTC | #4
On Mon, Feb 18, 2019 at 2:27 PM Qian Cai <cai@lca.pw> wrote:
>
>
>
> On 2/17/19 2:30 AM, Dmitry Vyukov wrote:
> > On Sun, Feb 17, 2019 at 5:34 AM Qian Cai <cai@lca.pw> wrote:
> >>
> >> Enabling function tracer with CONFIG_KASAN_SW_TAGS=y (hwasan) tracer
> >> causes the whole system frozen on ThunderX2 systems with 256 CPUs,
> >> because there is a burst of too much pointer access, and then KASAN will
> >> dereference each byte of the shadow address for the tag checking which
> >> will kill all the CPUs.
> >
> > Hi Qian,
> >
> > Could you please elaborate what exactly happens and who/why kills
> > CPUs? Number of memory accesses should not make any difference.
> > With hardware support (MTE) it won't be possible to disable
> > instrumentation (loads and stores check tags themselves), so it would
> > be useful to keep track of exact reasons we disable instrumentation to
> > know how to deal with them with hardware support.
> > It would be useful to keep this info in the comment in the Makefile.
>
> It turns out sometimes it will trigger a hardware error.

Please add this to the comment that there is that error, reason is
unknown, happens from time to time.
"Too much pointer access" is confusing and does not seem to be the
root cause (there are lots of source files that cause lots of pointer
accesses).


> # echo function > /sys/kernel/debug/tracing/current_trace
>
> RAS CONTROLLER: Fatal unrecoverable error detected
>
>         *** NBU BAR Error ***
>
>
>   MPIDR= 0x81000000
>   CTX_X0= ffff10001032eb9c
>   CTX_X1= ffff100010205f08
>   CTX_X2= 0
>   CTX_X3= ffff100010205efc
>   CTX_X4= 8
>   CTX_X5= 40
>   CTX_X6= 3f
>   CTX_X7= 0
>   CTX_X8= ff
>   CTX_X9= ffff0808ba65ab46
>   CTX_X10= ffff0808ba65ab45
>   CTX_X11= da
>   CTX_X12= 10071651
>   CTX_X13= fff60658
>   CTX_X14= ffff1000140d5000
>   CTX_X15= ffff100013855578
>   CTX_X16= 804b004a
>   CTX_X17= 1000100
>   CTX_X18= 0
>   CTX_X19= ffff100010205f08
>   CTX_X20= ffff100012531cd0
>   CTX_X21= ffff100010205f08
>   CTX_X22= ffff10001032eb9c
>   CTX_X23= 0
>   CTX_X24= ffff100012531cc0
>   CTX_X25= 12af
>   CTX_X26= fffdba05
>   CTX_X27= daff808ba65ab460
>   CTX_X28= ffff100012531cc0
>   CTX_X29= ffff808a2c617320
>   CTX_X30= ffff10001009b5a4
>   CTX_X31= ffff100012531cc0
>   CTX_SCR_EL3= 735
>   CTX_RUNTIME_SP= 6e545c0
>   CTX_SPSR_EL3= 604003c9
>   CTX_ELR_EL3= ffff100010205ecc
>  Node 0 NBU 0 Error report :
>  NBU BAR Error
>       NBU_REG_BAR_ADDRESS_ERROR_REG0 : 0x0040554c
>       NBU_REG_BAR_ADDRESS_ERROR_REG1 : 0x0011ff00
>       NBU_REG_BAR_ADDRESS_ERROR_REG2 : 0x00000004
>       Physical Address : 0x40011ff00
>
> NBU BAR Error : Decoded info :
>         Agent info : CPU
>             Core ID : 21
>             Thread ID : 1
>         Requ: type : 4 : Write Back
>  Node 0 NBU 1 Error report :
>  NBU BAR Error
>       NBU_REG_BAR_ADDRESS_ERROR_REG0 : 0x0040554c
>       NBU_REG_BAR_ADDRESS_ERROR_REG1 : 0x0011ff40
>       NBU_REG_BAR_ADDRESS_ERROR_REG2 : 0x00000004
>       Physical Address : 0x40011ff40
>
> NBU BAR Error : Decoded info :
>         Agent info : CPU
>             Core ID : 21
>             Thread ID : 1
>         Requ: type : 4 : Write Back
>  Node 0 NBU 2 Error report :
>  NBU BAR Error
>       NBU_REG_BAR_ADDRESS_ERROR_REG0 : 0x0040554c
>       NBU_REG_BAR_ADDRESS_ERROR_REG1 : 0x0011ff80
>       NBU_REG_BAR_ADDRESS_ERROR_REG2 : 0x00000004
>       Physical Address : 0x40011ff80
>
> NBU BAR Error : Decoded info :
>         Agent info : CPU
>             Core ID : 21
>             Thread ID : 1
>         Requ: type : 4 : Write Back
>  Node 0 NBU 3 Error report :
>  NBU BAR Error
>       NBU_REG_BAR_ADDRESS_ERROR_REG0 : 0x0040554c
>       NBU_REG_BAR_ADDRESS_ERROR_REG1 : 0x0011ffc0
>       NBU_REG_BAR_ADDRESS_ERROR_REG2 : 0x00000004
>       Physical Address : 0x40011ffc0
>
> NBU BAR Error : Decoded info :
>         Agent info : CPU
>             Core ID : 21
>             Thread ID : 1
>         Requ: type : 4 : Write Back
>  Node 0 NBU 4 Error report :
>  NBU BAR Error
>       NBU_REG_BAR_ADDRESS_ERROR_REG0 : 0x0040554c
>       NBU_REG_BAR_ADDRESS_ERROR_REG1 : 0x0011fe00
>       NBU_REG_BAR_ADDRESS_ERROR_REG2 : 0x00000004
>       Physical Address : 0x40011fe00
>
> NBU BAR Error : Decoded info :
>         Agent info : CPU
>             Core ID : 21
>             Thread ID : 1
>         Requ: type : 4 : Write Back
>  Node 0 NBU 5 Error report :
>  NBU BAR Error
>       NBU_REG_BAR_ADDRESS_ERROR_REG0 : 0x0040554c
>       NBU_REG_BAR_ADDRESS_ERROR_REG1 : 0x0011fe40
>       NBU_REG_BAR_ADDRESS_ERROR_REG2 : 0x00000004
>       Physical Address : 0x40011fe40
>
> NBU BAR Error : Decoded info :
>         Agent info : CPU
>             Core ID : 21
>             Thread ID : 1
>         Requ: type : 4 : Write Back
>  Node 0 NBU 6 Error report :
>  NBU BAR Error
>       NBU_REG_BAR_ADDRESS_ERROR_REG0 : 0x0040554c
>       NBU_REG_BAR_ADDRESS_ERROR_REG1 : 0x0011fe80
>       NBU_REG_BAR_ADDRESS_ERROR_REG2 : 0x00000004
>       Physical Address : 0x40011fe80
>
> NBU BAR Error : Decoded info :
>         Agent info : CPU
>             Core ID : 21
>             Thread ID : 1
>         Requ: type : 4 : Write Back
>  Node 0 NBU 7 Error report :
>  NBU BAR Error
>       NBU_REG_BAR_ADDRESS_ERROR_REG0 : 0x0040554c
>       NBU_REG_BAR_ADDRESS_ERROR_REG1 : 0x0011fee0
>       NBU_REG_BAR_ADDRESS_ERROR_REG2 : 0x00000004
>       Physical Address : 0x40011fee0
>
> NBU BAR Error : Decoded info :
>         Agent info : CPU
>             Core ID : 21
>             Thread ID : 1
>         Requ: type : 4 : Write Back
>  Node 0 NBU 8 Error report :
>  NBU BAR Error
>       NBU_REG_BAR_ADDRESS_ERROR_REG0 : 0x0040554c
>       NBU_REG_BAR_ADDRESS_ERROR_REG1 : 0x0011fd30
>       NBU_REG_BAR_ADDRESS_ERROR_REG2 : 0x00000004
>       Physical Address : 0x40011fd30
>
> NBU BAR Error : Decoded info :
>         Agent info : CPU
>             Core ID : 21
>             Thread ID : 1
>         Requ: type : 4 : Write Back
>  Node 0 NBU 9 Error report :
>  NBU BAR Error
>       NBU_REG_BAR_ADDRESS_ERROR_REG0 : 0x0040554c
>       NBU_REG_BAR_ADDRESS_ERROR_REG1 : 0x0011fd60
>       NBU_REG_BAR_ADDRESS_ERROR_REG2 : 0x00000004
>       Physical Address : 0x40011fd60
>
> NBU BAR Error : Decoded info :
>         Agent info : CPU
>             Core ID : 21
>             Thread ID : 1
>         Requ: type : 4 : Write Back
>  Node 0 NBU 10 Error report :
>  NBU BAR Error
>       NBU_REG_BAR_ADDRESS_ERROR_REG0 : 0x0040554c
>       NBU_REG_BAR_ADDRESS_ERROR_REG1 : 0x0011fda0
>       NBU_REG_BAR_ADDRESS_ERROR_REG2 : 0x00000004
>       Physical Address : 0x40011fda0
>
> NBU BAR Error : Decoded info :
>         Agent info : CPU
>             Core ID : 21
>             Thread ID : 1
>         Requ: type : 4 : Write Back
>  Node 0 NBU 11 Error report :
>  NBU BAR Error
>       NBU_REG_BAR_ADDRESS_ERROR_REG0 : 0x0040554c
>       NBU_REG_BAR_ADDRESS_ERROR_REG1 : 0x0011fdc0
>       NBU_REG_BAR_ADDRESS_ERROR_REG2 : 0x00000004
>       Physical Address : 0x40011fdc0
>
> NBU BAR Error : Decoded info :
>         Agent info : CPU
>             Core ID : 21
>             Thread ID : 1
>         Requ: type : 4 : Write Back
>  Node 0 NBU 12 Error report :
>  NBU BAR Error
>       NBU_REG_BAR_ADDRESS_ERROR_REG0 : 0x0040554c
>       NBU_REG_BAR_ADDRESS_ERROR_REG1 : 0x0011fc00
>       NBU_REG_BAR_ADDRESS_ERROR_REG2 : 0x00000004
>       Physical Address : 0x40011fc00
>
> NBU BAR Error : Decoded info :
>         Agent info : CPU
>             Core ID : 21
>             Thread ID : 1
>         Requ: type : 4 : Write Back
>  Node 0 NBU 13 Error report :
>  NBU BAR Error
>       NBU_REG_BAR_ADDRESS_ERROR_REG0 : 0x0040554c
>       NBU_REG_BAR_ADDRESS_ERROR_REG1 : 0x0011fc40
>       NBU_REG_BAR_ADDRESS_ERROR_REG2 : 0x00000004
>       Physical Address : 0x40011fc40
>
> NBU BAR Error : Decoded info :
>         Agent info : CPU
>             Core ID : 21
>             Thread ID : 1
>         Requ: type : 4 : Write Back
>  Node 0 NBU 14 Error report :
>  NBU BAR Error
>       NBU_REG_BAR_ADDRESS_ERROR_REG0 : 0x0040554c
>       NBU_REG_BAR_ADDRESS_ERROR_REG1 : 0x0011fc80
>       NBU_REG_BAR_ADDRESS_ERROR_REG2 : 0x00000004
>       Physical Address : 0x40011fc80
>
> NBU BAR Error : Decoded info :
>         Agent info : CPU
>             Core ID : 21
>             Thread ID : 1
>         Requ: type : 4 : Write Back
>  Node 0 NBU 15 Error report :
>  NBU BAR Error
>       NBU_REG_BAR_ADDRESS_ERROR_REG0 : 0x0040554c
>       NBU_REG_BAR_ADDRESS_ERROR_REG1 : 0x0011fcc0
>       NBU_REG_BAR_ADDRESS_ERROR_REG2 : 0x00000004
>       Physical Address : 0x40011fcc0
>
> NBU BAR Error : Decoded info :
>         Agent info : CPU
>             Core ID : 21
>             Thread ID : 1
>         Requ: type : 4 : Write Back
>
> Current NBU DRAM BAR setting:
> Node0 BAR0 Base 00004000 Limit 00007FFC chan_xlation 00004008 node_xlation 00000000
> Node0 BAR1 Base 00080001 Limit 000FEFFC chan_xlation 0007C008 node_xlation 00000000
> Node0 BAR2 Base 00880001 Limit 00FFCFFC chan_xlation 007FD008 node_xlation 00000000
> Node0 BAR3 Base 00FFD001 Limit 00FFFFDF chan_xlation 00FFD000 node_xlation 00000000
> Node0 BAR4 Base 08800001 Limit 08BFCFDF chan_xlation 087FD000 node_xlation 00000000
> Node0 BAR5 Base 08BFD001 Limit 093FCFEE chan_xlation 08BFD008 node_xlation 00000002
> Node0 BAR6 Base 093FD001 Limit 097FCFDF chan_xlation 093FD000 node_xlation 00000002
> Node0 BAR7 Base FFFFF000 Limit 00000000 chan_xlation 00000000 node_xlation 00000000
> Node0 BAR8 Base FFFFF000 Limit 00000000 chan_xlation 00000000 node_xlation 00000000
> Node0 BAR9 Base FFFFF000 Limit 00000000 chan_xlation 00000000 node_xlation 00000000
> Node0 BAR10 Base FFFFF000 Limit 00000000 chan_xlation 00000000 node_xlation 00000000
> Node0 BAR11 Base FFFFF000 Limit 00000000 chan_xlation 00000000 node_xlation 00000000
> Node0 BAR12 Base FFFFF000 Limit 00000000 chan_xlation 00000000 node_xlation 00000000
> Node0 BAR13 Base FFFFF000 Limit 00000000 chan_xlation 00000000 node_xlation 00000000
> Node0 BAR14 Base FFFFF000 Limit 00000000 chan_xlation 00000000 node_xlation 00000000
> Node0 BAR15 Base FFFFF000 Limit 00000000 chan_xlation 00000000 node_xlation 00000000
> Node1 BAR0 Base 00004000 Limit 00007FFC chan_xlation 00004008 node_xlation 00000000
> Node1 BAR1 Base 00080001 Limit 000FEFFC chan_xlation 0007C008 node_xlation 00000000
> Node1 BAR2 Base 00880001 Limit 00FFCFFC chan_xlation 007FD008 node_xlation 00000000
> Node1 BAR3 Base 00FFD001 Limit 00FFFFDF chan_xlation 00FFD000 node_xlation 00000000
> Node1 BAR4 Base 08800001 Limit 08BFCFDF chan_xlation 087FD000 node_xlation 00000000
> Node1 BAR5 Base 08BFD001 Limit 093FCFEE chan_xlation 08BFD008 node_xlation 00000002
> Node1 BAR6 Base 093FD001 Limit 097FCFDF chan_xlation 093FD000 node_xlation 00000002
> Node1 BAR7 Base FFFFF000 Limit 00000000 chan_xlation 00000000 node_xlation 00000000
> Node1 BAR8 Base FFFFF000 Limit 00000000 chan_xlation 00000000 node_xlation 00000000
> Node1 BAR9 Base FFFFF000 Limit 00000000 chan_xlation 00000000 node_xlation 00000000
> Node1 BAR10 Base FFFFF000 Limit 00000000 chan_xlation 00000000 node_xlation 00000000
> Node1 BAR11 Base FFFFF000 Limit 00000000 chan_xlation 00000000 node_xlation 00000000
> Node1 BAR12 Base FFFFF000 Limit 00000000 chan_xlation 00000000 node_xlation 00000000
> Node1 BAR13 Base FFFFF000 Limit 00000000 chan_xlation 00000000 node_xlation 00000000
> Node1 BAR14 Base FFFFF000 Limit 00000000 chan_xlation 00000000 node_xlation 00000000
> Node1 BAR15 Base FFFFF000 Limit 00000000 chan_xlation 00000000 node_xlation 00000000
>
> 0.0.0:
>   00: AF00177D
>   04: 00100006
>   08: 06000000
>   0C: 00000010
>   10: 00000000
>   14: 00000000
>   18: 00000000
>   1C: 00000000
>   20: 00000000
>   24: 00000000
>   28: 00000000
>   2C: 0000177D
>   30: 00000000
>   34: 00000090
>   38: 00000000
>   3C: 00000000
>
> 0.1.0:
>   00: AF84177D
>   04: 00100000
>   08: 06040000
>   0C: 00010000
>   10: 00000000
>   14: 00000000
>   18: 00010100
>   1C: 00000000
>   20: 0000FFF0
>   24: 0001FFF1
>   28: 00000000
>   2C: 00000000
>   30: 00000000
>   34: 00000048
>   38: 00000000
>   3C: 00000100
>
> 0.2.0:
>   00: AF84177D
>   04: 00100000
>   08: 06040000
>   0C: 00010000
>   10: 00000000
>   14: 00000000
>   18: 00020200
>   1C: 00000000
>   20: 0000FFF0
>   24: 0001FFF1
>   28: 00000000
>   2C: 00000000
>   30: 00000000
>   34: 00000048
>   38: 00000000
>   3C: 00000100
>
> 0.3.0:
>   00: AF84177D
>   04: 00100000
>   08: 06040000
>   0C: 00010000
>   10: 00000000
>   14: 00000000
>   18: 00030300
>   1C: 00000000
>   20: 0000FFF0
>   24: 0001FFF1
>   28: 00000000
>   2C: 00000000
>   30: 00000000
>   34: 00000048
>   38: 00000000
>   3C: 00000100
>
> 0.4.0:
>   00: AF84177D
>   04: 00100000
>   08: 06040000
>   0C: 00010000
>   10: 00000000
>   14: 00000000
>   18: 00040400
>   1C: 00000000
>   20: 0000FFF0
>   24: 0001FFF1
>   28: 00000000
>   2C: 00000000
>   30: 00000000
>   34: 00000048
>   38: 00000000
>   3C: 00000100
>
> 0.5.0:
>   00: AF84177D
>   04: 00100000
>   08: 06040000
>   0C: 00010000
>   10: 00000000
>   14: 00000000
>   18: 00050500
>   1C: 00000000
>   20: 0000FFF0
>   24: 0001FFF1
>   28: 00000000
>   2C: 00000000
>   30: 00000000
>   34: 00000048
>   38: 00000000
>   3C: 00000100
>
> 0.6.0:
>   00: AF84177D
>   04: 00100000
>   08: 06040000
>   0C: 00010000
>   10: 00000000
>   14: 00000000
>   18: 00060600
>   1C: 00000000
>   20: 0000FFF0
>   24: 0001FFF1
>   28: 00000000
>   2C: 00000000
>   30: 00000000
>   34: 00000048
>   38: 00000000
>   3C: 00000100
>
> 0.7.0:
>   00: AF84177D
>   04: 00100000
>   08: 06040000
>   0C: 00010000
>   10: 00000000
>   14: 00000000
>   18: 00070700
>   1C: 00000000
>   20: 0000FFF0
>   24: 0001FFF1
>   28: 00000000
>   2C: 00000000
>   30: 00000000
>   34: 00000048
>   38: 00000000
>   3C: 00000100
>
> 0.8.0:
>   00: AF84177D
>   04: 00100000
>   08: 06040000
>   0C: 00010000
>   10: 00000000
>   14: 00000000
>   18: 00080800
>   1C: 00000000
>   20: 0000FFF0
>   24: 0001FFF1
>   28: 00000000
>   2C: 00000000
>   30: 00000000
>   34: 00000048
>   38: 00000000
>   3C: 00000100
>
> 0.9.0:
>   00: AF84177D
>   04: 00100000
>   08: 06040000
>   0C: 00010000
>   10: 00000000
>   14: 00000000
>   18: 00090900
>   1C: 00000000
>   20: 0000FFF0
>   24: 0001FFF1
>   28: 00000000
>   2C: 00000000
>   30: 00000000
>   34: 00000048
>   38: 00000000
>   3C: 00000100
>
> 0.a.0:
>   00: AF84177D
>   04: 00100000
>   08: 06040000
>   0C: 00010000
>   10: 00000000
>   14: 00000000
>   18: 000A0A00
>   1C: 00000000
>   20: 0000FFF0
>   24: 0001FFF1
>   28: 00000000
>   2C: 00000000
>   30: 00000000
>   34: 00000048
>   38: 00000000
>   3C: 00000100
>
> 0.b.0:
>   00: AF84177D
>   04: 00100106
>   08: 06040000
>   0C: 00010000
>   10: 00000000
>   14: 00000000
>   18: 000C0B00
>   1C: 20000000
>   20: 43104300
>   24: 03F10001
>   28: 00000100
>   2C: 00000100
>   30: 00000000
>   34: 00000048
>   38: 00000000
>   3C: 000201FF
>
> 0.c.0:
>   00: AF84177D
>   04: 00100000
>   08: 06040000
>   0C: 00010000
>   10: 00000000
>   14: 00000000
>   18: 000D0D00
>   1C: 00000000
>   20: 0000FFF0
>   24: 0001FFF1
>   28: 00000000
>   2C: 00000000
>   30: 00000000
>   34: 00000048
>   38: 00000000
>   3C: 00000100
>
> 0.d.0:
>   00: AF84177D
>   04: 00100000
>   08: 06040000
>   0C: 00010000
>   10: 00000000
>   14: 00000000
>   18: 000E0E00
>   1C: 00000000
>   20: 0000FFF0
>   24: 0001FFF1
>   28: 00000000
>   2C: 00000000
>   30: 00000000
>   34: 00000048
>   38: 00000000
>   3C: 00000100
>
> 0.e.0:
>   00: AF84177D
>   04: 00100106
>   08: 06040000
>   0C: 00010000
>   10: 00000000
>   14: 00000000
>   18: 00100F00
>   1C: 20000000
>   20: 42F04000
>   24: 0001FFF1
>   28: 00000000
>   2C: 00000000
>   30: 00000000
>   34: 00000048
>   38: 00000000
>   3C: 000201FF
>
> 0.f.0:
>   00: 902614E4
>   04: 00100406
>   08: 0C033000
>   0C: 00800010
>   10: 0400000C
>   14: 00000100
>   18: 0401000C
>   1C: 00000100
>   20: 00000000
>   24: 00000000
>   28: 00000000
>   2C: 00000000
>   30: 00000000
>   34: 00000080
>   38: 00000000
>   3C: 00000000
>
> 0.f.1:
>   00: 902614E4
>   04: 00100406
>   08: 0C033000
>   0C: 00800010
>   10: 0402000C
>   14: 00000100
>   18: 0403000C
>   1C: 00000100
>   20: 00000000
>   24: 00000000
>   28: 00000000
>   2C: 00000000
>   30: 00000000
>   34: 00000080
>   38: 00000000
>   3C: 00000000
>
> 0.10.0:
>   00: 902714E4
>   04: 00100406
>   08: 01060100
>   0C: 00800010
>   10: 00000000
>   14: 00000000
>   18: 0404000C
>   1C: 00000100
>   20: 00000000
>   24: 43200000
>   28: 00000000
>   2C: 00000000
>   30: 00000000
>   34: 00000080
>   38: 00000000
>   3C: 000000FF
>
> 0.10.1:
>   00: 902714E4
>   04: 00100406
>   08: 01060100
>   0C: 00800010
>   10: 00000000
>   14: 00000000
>   18: 0405000C
>   1C: 00000100
>   20: 00000000
>   24: 43210000
>   28: 00000000
>   2C: 00000000
>   30: 00000000
>   34: 00000080
>   38: 00000000
>   3C: 000000FF
>
> b.0.0:
>   00: 101515B3
>   04: 00100506
>   08: 02000000
>   0C: 00800000
>   10: 0000000C
>   14: 00000100
>   18: 00000000
>   1C: 00000000
>   20: 00000000
>   24: 00000000
>   28: 00000000
>   2C: 028A1590
>   30: FFF00000
>   34: 00000060
>   38: 00000000
>   3C: 000001FF
>
> b.0.1:
>   00: 101515B3
>   04: 00100506
>   08: 02000000
>   0C: 00800000
>   10: 0200000C
>   14: 00000100
>   18: 00000000
>   1C: 00000000
>   20: 00000000
>   24: 00000000
>   28: 00000000
>   2C: 028A1590
>   30: FFF00000
>   34: 00000060
>   38: 00000000
>   3C: 000002FF
>
> f.0.0:
>   00: 11501A03
>   04: 00100107
>   08: 06040004
>   0C: 00010000
>   10: 00000000
>   14: 00000000
>   18: 0010100F
>   1C: 022001F1
>   20: 42F04000
>   24: 0001FFF1
>   28: 00000000
>   2C: 00000000
>   30: 00000000
>   34: 00000050
>   38: 00000000
>   3C: 000201FF
>
> 10.0.0:
>   00: 20001A03
>   04: 02100102
>   08: 03000041
>   0C: 00000000
>   10: 40000000
>   14: 42000000
>   18: 00000001
>   1C: 00000000
>   20: 00000000
>   24: 00000000
>   28: 00000000
>   2C: 20001A03
>   30: 00000000
>   34: 00000040
>   38: 00000000
>   3C: 000001FF
>
> 80.0.0:
>   00: AF00177D
>   04: 00100002
>   08: 06000000
>   0C: 00000010
>   10: 00000000
>   14: 00000000
>   18: 00000000
>   1C: 00000000
>   20: 00000000
>   24: 00000000
>   28: 00000000
>   2C: 0000177D
>   30: 00000000
>   34: 00000090
>   38: 00000000
>   3C: 00000000
>
> 80.1.0:
>   00: AF84177D
>   04: 00100000
>   08: 06040000
>   0C: 00010000
>   10: 00000000
>   14: 00000000
>   18: 00818180
>   1C: 00000000
>   20: 0000FFF0
>   24: 0001FFF1
>   28: 00000000
>   2C: 00000000
>   30: 00000000
>   34: 00000048
>   38: 00000000
>   3C: 00000100
>
> 80.9.0:
>   00: AF84177D
>   04: 00100000
>   08: 06040000
>   0C: 00010000
>   10: 00000000
>   14: 00000000
>   18: 00828280
>   1C: 00000000
>   20: 0000FFF0
>   24: 0001FFF1
>   28: 00000000
>   2C: 00000000
>   30: 00000000
>   34: 00000048
>   38: 00000000
>   3C: 00000100
>
> 80.b.0:
>   00: AF84177D
>   04: 00100000
>   08: 06040000
>   0C: 00010000
>   10: 00000000
>   14: 00000000
>   18: 00838380
>   1C: 00000000
>   20: 0000FFF0
>   24: 0001FFF1
>   28: 00000000
>   2C: 00000000
>   30: 00000000
>   34: 00000048
>   38: 00000000
>   3C: 00000100
>
> 80.f.0:
>   00: 902614E4
>   04: 00100406
>   08: 0C033000
>   0C: 00800010
>   10: 0000000C
>   14: 00000140
>   18: 0001000C
>   1C: 00000140
>   20: 00000000
>   24: 00000000
>   28: 00000000
>   2C: 00000000
>   30: 00000000
>   34: 00000080
>   38: 00000000
>   3C: 00000000
>
> 80.f.1:
>   00: 902614E4
>   04: 00100406
>   08: 0C033000
>   0C: 00800010
>   10: 0002000C
>   14: 00000140
>   18: 0003000C
>   1C: 00000140
>   20: 00000000
>   24: 00000000
>   28: 00000000
>   2C: 00000000
>   30: 00000000
>   34: 00000080
>   38: 00000000
>   3C: 00000000
>
> 80.10.0:
>   00: 902714E4
>   04: 00100406
>   08: 01060100
>   0C: 00800010
>   10: 00000000
>   14: 00000000
>   18: 0004000C
>   1C: 00000140
>   20: 00000000
>   24: 60000000
>   28: 00000000
>   2C: 00000000
>   30: 00000000
>   34: 00000080
>   38: 00000000
>   3C: 000000FF
>
> 80.10.1:
>   00: 902714E4
>   04: 00100406
>   08: 01060100
>   0C: 00800010
>   10: 00000000
>   14: 00000000
>   18: 0005000C
>   1C: 00000140
>   20: 00000000
>   24: 60010000
>   28: 00000000
>   2C: 00000000
>   30: 00000000
>   34: 00000080
>   38: 00000000
>   3C: 000000FF
> RAS CONTROLLER: SYSTEM HALTED...
Will Deacon Feb. 18, 2019, 1:59 p.m. UTC | #5
[+James, who knows how to decode these things]

On Mon, Feb 18, 2019 at 02:56:47PM +0100, Dmitry Vyukov wrote:
> On Mon, Feb 18, 2019 at 2:27 PM Qian Cai <cai@lca.pw> wrote:
> > On 2/17/19 2:30 AM, Dmitry Vyukov wrote:
> > > On Sun, Feb 17, 2019 at 5:34 AM Qian Cai <cai@lca.pw> wrote:
> > >>
> > >> Enabling function tracer with CONFIG_KASAN_SW_TAGS=y (hwasan) tracer
> > >> causes the whole system frozen on ThunderX2 systems with 256 CPUs,
> > >> because there is a burst of too much pointer access, and then KASAN will
> > >> dereference each byte of the shadow address for the tag checking which
> > >> will kill all the CPUs.
> > >
> > > Could you please elaborate what exactly happens and who/why kills
> > > CPUs? Number of memory accesses should not make any difference.
> > > With hardware support (MTE) it won't be possible to disable
> > > instrumentation (loads and stores check tags themselves), so it would
> > > be useful to keep track of exact reasons we disable instrumentation to
> > > know how to deal with them with hardware support.
> > > It would be useful to keep this info in the comment in the Makefile.
> >
> > It turns out sometimes it will trigger a hardware error.
> 
> Please add this to the comment that there is that error, reason is
> unknown, happens from time to time.
> "Too much pointer access" is confusing and does not seem to be the
> root cause (there are lots of source files that cause lots of pointer
> accesses).

I don't think this is directly related to KASAN, as I'm sure we've seen this
RAS error before.

Will

> > # echo function > /sys/kernel/debug/tracing/current_trace
> >
> > RAS CONTROLLER: Fatal unrecoverable error detected
> >
> >         *** NBU BAR Error ***
> >
> >
> >   MPIDR= 0x81000000
> >   CTX_X0= ffff10001032eb9c
> >   CTX_X1= ffff100010205f08
> >   CTX_X2= 0
> >   CTX_X3= ffff100010205efc
> >   CTX_X4= 8
> >   CTX_X5= 40
> >   CTX_X6= 3f
> >   CTX_X7= 0
> >   CTX_X8= ff
> >   CTX_X9= ffff0808ba65ab46
> >   CTX_X10= ffff0808ba65ab45
> >   CTX_X11= da
> >   CTX_X12= 10071651
> >   CTX_X13= fff60658
> >   CTX_X14= ffff1000140d5000
> >   CTX_X15= ffff100013855578
> >   CTX_X16= 804b004a
> >   CTX_X17= 1000100
> >   CTX_X18= 0
> >   CTX_X19= ffff100010205f08
> >   CTX_X20= ffff100012531cd0
> >   CTX_X21= ffff100010205f08
> >   CTX_X22= ffff10001032eb9c
> >   CTX_X23= 0
> >   CTX_X24= ffff100012531cc0
> >   CTX_X25= 12af
> >   CTX_X26= fffdba05
> >   CTX_X27= daff808ba65ab460
> >   CTX_X28= ffff100012531cc0
> >   CTX_X29= ffff808a2c617320
> >   CTX_X30= ffff10001009b5a4
> >   CTX_X31= ffff100012531cc0
> >   CTX_SCR_EL3= 735
> >   CTX_RUNTIME_SP= 6e545c0
> >   CTX_SPSR_EL3= 604003c9
> >   CTX_ELR_EL3= ffff100010205ecc
> >  Node 0 NBU 0 Error report :
> >  NBU BAR Error
> >       NBU_REG_BAR_ADDRESS_ERROR_REG0 : 0x0040554c
> >       NBU_REG_BAR_ADDRESS_ERROR_REG1 : 0x0011ff00
> >       NBU_REG_BAR_ADDRESS_ERROR_REG2 : 0x00000004
> >       Physical Address : 0x40011ff00
> >
> > NBU BAR Error : Decoded info :
> >         Agent info : CPU
> >             Core ID : 21
> >             Thread ID : 1
> >         Requ: type : 4 : Write Back
> >  Node 0 NBU 1 Error report :
> >  NBU BAR Error
> >       NBU_REG_BAR_ADDRESS_ERROR_REG0 : 0x0040554c
> >       NBU_REG_BAR_ADDRESS_ERROR_REG1 : 0x0011ff40
> >       NBU_REG_BAR_ADDRESS_ERROR_REG2 : 0x00000004
> >       Physical Address : 0x40011ff40
> >
> > NBU BAR Error : Decoded info :
> >         Agent info : CPU
> >             Core ID : 21
> >             Thread ID : 1
> >         Requ: type : 4 : Write Back
> >  Node 0 NBU 2 Error report :
> >  NBU BAR Error
> >       NBU_REG_BAR_ADDRESS_ERROR_REG0 : 0x0040554c
> >       NBU_REG_BAR_ADDRESS_ERROR_REG1 : 0x0011ff80
> >       NBU_REG_BAR_ADDRESS_ERROR_REG2 : 0x00000004
> >       Physical Address : 0x40011ff80
> >
> > NBU BAR Error : Decoded info :
> >         Agent info : CPU
> >             Core ID : 21
> >             Thread ID : 1
> >         Requ: type : 4 : Write Back
> >  Node 0 NBU 3 Error report :
> >  NBU BAR Error
> >       NBU_REG_BAR_ADDRESS_ERROR_REG0 : 0x0040554c
> >       NBU_REG_BAR_ADDRESS_ERROR_REG1 : 0x0011ffc0
> >       NBU_REG_BAR_ADDRESS_ERROR_REG2 : 0x00000004
> >       Physical Address : 0x40011ffc0
> >
> > NBU BAR Error : Decoded info :
> >         Agent info : CPU
> >             Core ID : 21
> >             Thread ID : 1
> >         Requ: type : 4 : Write Back
> >  Node 0 NBU 4 Error report :
> >  NBU BAR Error
> >       NBU_REG_BAR_ADDRESS_ERROR_REG0 : 0x0040554c
> >       NBU_REG_BAR_ADDRESS_ERROR_REG1 : 0x0011fe00
> >       NBU_REG_BAR_ADDRESS_ERROR_REG2 : 0x00000004
> >       Physical Address : 0x40011fe00
> >
> > NBU BAR Error : Decoded info :
> >         Agent info : CPU
> >             Core ID : 21
> >             Thread ID : 1
> >         Requ: type : 4 : Write Back
> >  Node 0 NBU 5 Error report :
> >  NBU BAR Error
> >       NBU_REG_BAR_ADDRESS_ERROR_REG0 : 0x0040554c
> >       NBU_REG_BAR_ADDRESS_ERROR_REG1 : 0x0011fe40
> >       NBU_REG_BAR_ADDRESS_ERROR_REG2 : 0x00000004
> >       Physical Address : 0x40011fe40
> >
> > NBU BAR Error : Decoded info :
> >         Agent info : CPU
> >             Core ID : 21
> >             Thread ID : 1
> >         Requ: type : 4 : Write Back
> >  Node 0 NBU 6 Error report :
> >  NBU BAR Error
> >       NBU_REG_BAR_ADDRESS_ERROR_REG0 : 0x0040554c
> >       NBU_REG_BAR_ADDRESS_ERROR_REG1 : 0x0011fe80
> >       NBU_REG_BAR_ADDRESS_ERROR_REG2 : 0x00000004
> >       Physical Address : 0x40011fe80
> >
> > NBU BAR Error : Decoded info :
> >         Agent info : CPU
> >             Core ID : 21
> >             Thread ID : 1
> >         Requ: type : 4 : Write Back
> >  Node 0 NBU 7 Error report :
> >  NBU BAR Error
> >       NBU_REG_BAR_ADDRESS_ERROR_REG0 : 0x0040554c
> >       NBU_REG_BAR_ADDRESS_ERROR_REG1 : 0x0011fee0
> >       NBU_REG_BAR_ADDRESS_ERROR_REG2 : 0x00000004
> >       Physical Address : 0x40011fee0
> >
> > NBU BAR Error : Decoded info :
> >         Agent info : CPU
> >             Core ID : 21
> >             Thread ID : 1
> >         Requ: type : 4 : Write Back
> >  Node 0 NBU 8 Error report :
> >  NBU BAR Error
> >       NBU_REG_BAR_ADDRESS_ERROR_REG0 : 0x0040554c
> >       NBU_REG_BAR_ADDRESS_ERROR_REG1 : 0x0011fd30
> >       NBU_REG_BAR_ADDRESS_ERROR_REG2 : 0x00000004
> >       Physical Address : 0x40011fd30
> >
> > NBU BAR Error : Decoded info :
> >         Agent info : CPU
> >             Core ID : 21
> >             Thread ID : 1
> >         Requ: type : 4 : Write Back
> >  Node 0 NBU 9 Error report :
> >  NBU BAR Error
> >       NBU_REG_BAR_ADDRESS_ERROR_REG0 : 0x0040554c
> >       NBU_REG_BAR_ADDRESS_ERROR_REG1 : 0x0011fd60
> >       NBU_REG_BAR_ADDRESS_ERROR_REG2 : 0x00000004
> >       Physical Address : 0x40011fd60
> >
> > NBU BAR Error : Decoded info :
> >         Agent info : CPU
> >             Core ID : 21
> >             Thread ID : 1
> >         Requ: type : 4 : Write Back
> >  Node 0 NBU 10 Error report :
> >  NBU BAR Error
> >       NBU_REG_BAR_ADDRESS_ERROR_REG0 : 0x0040554c
> >       NBU_REG_BAR_ADDRESS_ERROR_REG1 : 0x0011fda0
> >       NBU_REG_BAR_ADDRESS_ERROR_REG2 : 0x00000004
> >       Physical Address : 0x40011fda0
> >
> > NBU BAR Error : Decoded info :
> >         Agent info : CPU
> >             Core ID : 21
> >             Thread ID : 1
> >         Requ: type : 4 : Write Back
> >  Node 0 NBU 11 Error report :
> >  NBU BAR Error
> >       NBU_REG_BAR_ADDRESS_ERROR_REG0 : 0x0040554c
> >       NBU_REG_BAR_ADDRESS_ERROR_REG1 : 0x0011fdc0
> >       NBU_REG_BAR_ADDRESS_ERROR_REG2 : 0x00000004
> >       Physical Address : 0x40011fdc0
> >
> > NBU BAR Error : Decoded info :
> >         Agent info : CPU
> >             Core ID : 21
> >             Thread ID : 1
> >         Requ: type : 4 : Write Back
> >  Node 0 NBU 12 Error report :
> >  NBU BAR Error
> >       NBU_REG_BAR_ADDRESS_ERROR_REG0 : 0x0040554c
> >       NBU_REG_BAR_ADDRESS_ERROR_REG1 : 0x0011fc00
> >       NBU_REG_BAR_ADDRESS_ERROR_REG2 : 0x00000004
> >       Physical Address : 0x40011fc00
> >
> > NBU BAR Error : Decoded info :
> >         Agent info : CPU
> >             Core ID : 21
> >             Thread ID : 1
> >         Requ: type : 4 : Write Back
> >  Node 0 NBU 13 Error report :
> >  NBU BAR Error
> >       NBU_REG_BAR_ADDRESS_ERROR_REG0 : 0x0040554c
> >       NBU_REG_BAR_ADDRESS_ERROR_REG1 : 0x0011fc40
> >       NBU_REG_BAR_ADDRESS_ERROR_REG2 : 0x00000004
> >       Physical Address : 0x40011fc40
> >
> > NBU BAR Error : Decoded info :
> >         Agent info : CPU
> >             Core ID : 21
> >             Thread ID : 1
> >         Requ: type : 4 : Write Back
> >  Node 0 NBU 14 Error report :
> >  NBU BAR Error
> >       NBU_REG_BAR_ADDRESS_ERROR_REG0 : 0x0040554c
> >       NBU_REG_BAR_ADDRESS_ERROR_REG1 : 0x0011fc80
> >       NBU_REG_BAR_ADDRESS_ERROR_REG2 : 0x00000004
> >       Physical Address : 0x40011fc80
> >
> > NBU BAR Error : Decoded info :
> >         Agent info : CPU
> >             Core ID : 21
> >             Thread ID : 1
> >         Requ: type : 4 : Write Back
> >  Node 0 NBU 15 Error report :
> >  NBU BAR Error
> >       NBU_REG_BAR_ADDRESS_ERROR_REG0 : 0x0040554c
> >       NBU_REG_BAR_ADDRESS_ERROR_REG1 : 0x0011fcc0
> >       NBU_REG_BAR_ADDRESS_ERROR_REG2 : 0x00000004
> >       Physical Address : 0x40011fcc0
> >
> > NBU BAR Error : Decoded info :
> >         Agent info : CPU
> >             Core ID : 21
> >             Thread ID : 1
> >         Requ: type : 4 : Write Back
> >
> > Current NBU DRAM BAR setting:
> > Node0 BAR0 Base 00004000 Limit 00007FFC chan_xlation 00004008 node_xlation 00000000
> > Node0 BAR1 Base 00080001 Limit 000FEFFC chan_xlation 0007C008 node_xlation 00000000
> > Node0 BAR2 Base 00880001 Limit 00FFCFFC chan_xlation 007FD008 node_xlation 00000000
> > Node0 BAR3 Base 00FFD001 Limit 00FFFFDF chan_xlation 00FFD000 node_xlation 00000000
> > Node0 BAR4 Base 08800001 Limit 08BFCFDF chan_xlation 087FD000 node_xlation 00000000
> > Node0 BAR5 Base 08BFD001 Limit 093FCFEE chan_xlation 08BFD008 node_xlation 00000002
> > Node0 BAR6 Base 093FD001 Limit 097FCFDF chan_xlation 093FD000 node_xlation 00000002
> > Node0 BAR7 Base FFFFF000 Limit 00000000 chan_xlation 00000000 node_xlation 00000000
> > Node0 BAR8 Base FFFFF000 Limit 00000000 chan_xlation 00000000 node_xlation 00000000
> > Node0 BAR9 Base FFFFF000 Limit 00000000 chan_xlation 00000000 node_xlation 00000000
> > Node0 BAR10 Base FFFFF000 Limit 00000000 chan_xlation 00000000 node_xlation 00000000
> > Node0 BAR11 Base FFFFF000 Limit 00000000 chan_xlation 00000000 node_xlation 00000000
> > Node0 BAR12 Base FFFFF000 Limit 00000000 chan_xlation 00000000 node_xlation 00000000
> > Node0 BAR13 Base FFFFF000 Limit 00000000 chan_xlation 00000000 node_xlation 00000000
> > Node0 BAR14 Base FFFFF000 Limit 00000000 chan_xlation 00000000 node_xlation 00000000
> > Node0 BAR15 Base FFFFF000 Limit 00000000 chan_xlation 00000000 node_xlation 00000000
> > Node1 BAR0 Base 00004000 Limit 00007FFC chan_xlation 00004008 node_xlation 00000000
> > Node1 BAR1 Base 00080001 Limit 000FEFFC chan_xlation 0007C008 node_xlation 00000000
> > Node1 BAR2 Base 00880001 Limit 00FFCFFC chan_xlation 007FD008 node_xlation 00000000
> > Node1 BAR3 Base 00FFD001 Limit 00FFFFDF chan_xlation 00FFD000 node_xlation 00000000
> > Node1 BAR4 Base 08800001 Limit 08BFCFDF chan_xlation 087FD000 node_xlation 00000000
> > Node1 BAR5 Base 08BFD001 Limit 093FCFEE chan_xlation 08BFD008 node_xlation 00000002
> > Node1 BAR6 Base 093FD001 Limit 097FCFDF chan_xlation 093FD000 node_xlation 00000002
> > Node1 BAR7 Base FFFFF000 Limit 00000000 chan_xlation 00000000 node_xlation 00000000
> > Node1 BAR8 Base FFFFF000 Limit 00000000 chan_xlation 00000000 node_xlation 00000000
> > Node1 BAR9 Base FFFFF000 Limit 00000000 chan_xlation 00000000 node_xlation 00000000
> > Node1 BAR10 Base FFFFF000 Limit 00000000 chan_xlation 00000000 node_xlation 00000000
> > Node1 BAR11 Base FFFFF000 Limit 00000000 chan_xlation 00000000 node_xlation 00000000
> > Node1 BAR12 Base FFFFF000 Limit 00000000 chan_xlation 00000000 node_xlation 00000000
> > Node1 BAR13 Base FFFFF000 Limit 00000000 chan_xlation 00000000 node_xlation 00000000
> > Node1 BAR14 Base FFFFF000 Limit 00000000 chan_xlation 00000000 node_xlation 00000000
> > Node1 BAR15 Base FFFFF000 Limit 00000000 chan_xlation 00000000 node_xlation 00000000
> >
> > 0.0.0:
> >   00: AF00177D
> >   04: 00100006
> >   08: 06000000
> >   0C: 00000010
> >   10: 00000000
> >   14: 00000000
> >   18: 00000000
> >   1C: 00000000
> >   20: 00000000
> >   24: 00000000
> >   28: 00000000
> >   2C: 0000177D
> >   30: 00000000
> >   34: 00000090
> >   38: 00000000
> >   3C: 00000000
> >
> > 0.1.0:
> >   00: AF84177D
> >   04: 00100000
> >   08: 06040000
> >   0C: 00010000
> >   10: 00000000
> >   14: 00000000
> >   18: 00010100
> >   1C: 00000000
> >   20: 0000FFF0
> >   24: 0001FFF1
> >   28: 00000000
> >   2C: 00000000
> >   30: 00000000
> >   34: 00000048
> >   38: 00000000
> >   3C: 00000100
> >
> > 0.2.0:
> >   00: AF84177D
> >   04: 00100000
> >   08: 06040000
> >   0C: 00010000
> >   10: 00000000
> >   14: 00000000
> >   18: 00020200
> >   1C: 00000000
> >   20: 0000FFF0
> >   24: 0001FFF1
> >   28: 00000000
> >   2C: 00000000
> >   30: 00000000
> >   34: 00000048
> >   38: 00000000
> >   3C: 00000100
> >
> > 0.3.0:
> >   00: AF84177D
> >   04: 00100000
> >   08: 06040000
> >   0C: 00010000
> >   10: 00000000
> >   14: 00000000
> >   18: 00030300
> >   1C: 00000000
> >   20: 0000FFF0
> >   24: 0001FFF1
> >   28: 00000000
> >   2C: 00000000
> >   30: 00000000
> >   34: 00000048
> >   38: 00000000
> >   3C: 00000100
> >
> > 0.4.0:
> >   00: AF84177D
> >   04: 00100000
> >   08: 06040000
> >   0C: 00010000
> >   10: 00000000
> >   14: 00000000
> >   18: 00040400
> >   1C: 00000000
> >   20: 0000FFF0
> >   24: 0001FFF1
> >   28: 00000000
> >   2C: 00000000
> >   30: 00000000
> >   34: 00000048
> >   38: 00000000
> >   3C: 00000100
> >
> > 0.5.0:
> >   00: AF84177D
> >   04: 00100000
> >   08: 06040000
> >   0C: 00010000
> >   10: 00000000
> >   14: 00000000
> >   18: 00050500
> >   1C: 00000000
> >   20: 0000FFF0
> >   24: 0001FFF1
> >   28: 00000000
> >   2C: 00000000
> >   30: 00000000
> >   34: 00000048
> >   38: 00000000
> >   3C: 00000100
> >
> > 0.6.0:
> >   00: AF84177D
> >   04: 00100000
> >   08: 06040000
> >   0C: 00010000
> >   10: 00000000
> >   14: 00000000
> >   18: 00060600
> >   1C: 00000000
> >   20: 0000FFF0
> >   24: 0001FFF1
> >   28: 00000000
> >   2C: 00000000
> >   30: 00000000
> >   34: 00000048
> >   38: 00000000
> >   3C: 00000100
> >
> > 0.7.0:
> >   00: AF84177D
> >   04: 00100000
> >   08: 06040000
> >   0C: 00010000
> >   10: 00000000
> >   14: 00000000
> >   18: 00070700
> >   1C: 00000000
> >   20: 0000FFF0
> >   24: 0001FFF1
> >   28: 00000000
> >   2C: 00000000
> >   30: 00000000
> >   34: 00000048
> >   38: 00000000
> >   3C: 00000100
> >
> > 0.8.0:
> >   00: AF84177D
> >   04: 00100000
> >   08: 06040000
> >   0C: 00010000
> >   10: 00000000
> >   14: 00000000
> >   18: 00080800
> >   1C: 00000000
> >   20: 0000FFF0
> >   24: 0001FFF1
> >   28: 00000000
> >   2C: 00000000
> >   30: 00000000
> >   34: 00000048
> >   38: 00000000
> >   3C: 00000100
> >
> > 0.9.0:
> >   00: AF84177D
> >   04: 00100000
> >   08: 06040000
> >   0C: 00010000
> >   10: 00000000
> >   14: 00000000
> >   18: 00090900
> >   1C: 00000000
> >   20: 0000FFF0
> >   24: 0001FFF1
> >   28: 00000000
> >   2C: 00000000
> >   30: 00000000
> >   34: 00000048
> >   38: 00000000
> >   3C: 00000100
> >
> > 0.a.0:
> >   00: AF84177D
> >   04: 00100000
> >   08: 06040000
> >   0C: 00010000
> >   10: 00000000
> >   14: 00000000
> >   18: 000A0A00
> >   1C: 00000000
> >   20: 0000FFF0
> >   24: 0001FFF1
> >   28: 00000000
> >   2C: 00000000
> >   30: 00000000
> >   34: 00000048
> >   38: 00000000
> >   3C: 00000100
> >
> > 0.b.0:
> >   00: AF84177D
> >   04: 00100106
> >   08: 06040000
> >   0C: 00010000
> >   10: 00000000
> >   14: 00000000
> >   18: 000C0B00
> >   1C: 20000000
> >   20: 43104300
> >   24: 03F10001
> >   28: 00000100
> >   2C: 00000100
> >   30: 00000000
> >   34: 00000048
> >   38: 00000000
> >   3C: 000201FF
> >
> > 0.c.0:
> >   00: AF84177D
> >   04: 00100000
> >   08: 06040000
> >   0C: 00010000
> >   10: 00000000
> >   14: 00000000
> >   18: 000D0D00
> >   1C: 00000000
> >   20: 0000FFF0
> >   24: 0001FFF1
> >   28: 00000000
> >   2C: 00000000
> >   30: 00000000
> >   34: 00000048
> >   38: 00000000
> >   3C: 00000100
> >
> > 0.d.0:
> >   00: AF84177D
> >   04: 00100000
> >   08: 06040000
> >   0C: 00010000
> >   10: 00000000
> >   14: 00000000
> >   18: 000E0E00
> >   1C: 00000000
> >   20: 0000FFF0
> >   24: 0001FFF1
> >   28: 00000000
> >   2C: 00000000
> >   30: 00000000
> >   34: 00000048
> >   38: 00000000
> >   3C: 00000100
> >
> > 0.e.0:
> >   00: AF84177D
> >   04: 00100106
> >   08: 06040000
> >   0C: 00010000
> >   10: 00000000
> >   14: 00000000
> >   18: 00100F00
> >   1C: 20000000
> >   20: 42F04000
> >   24: 0001FFF1
> >   28: 00000000
> >   2C: 00000000
> >   30: 00000000
> >   34: 00000048
> >   38: 00000000
> >   3C: 000201FF
> >
> > 0.f.0:
> >   00: 902614E4
> >   04: 00100406
> >   08: 0C033000
> >   0C: 00800010
> >   10: 0400000C
> >   14: 00000100
> >   18: 0401000C
> >   1C: 00000100
> >   20: 00000000
> >   24: 00000000
> >   28: 00000000
> >   2C: 00000000
> >   30: 00000000
> >   34: 00000080
> >   38: 00000000
> >   3C: 00000000
> >
> > 0.f.1:
> >   00: 902614E4
> >   04: 00100406
> >   08: 0C033000
> >   0C: 00800010
> >   10: 0402000C
> >   14: 00000100
> >   18: 0403000C
> >   1C: 00000100
> >   20: 00000000
> >   24: 00000000
> >   28: 00000000
> >   2C: 00000000
> >   30: 00000000
> >   34: 00000080
> >   38: 00000000
> >   3C: 00000000
> >
> > 0.10.0:
> >   00: 902714E4
> >   04: 00100406
> >   08: 01060100
> >   0C: 00800010
> >   10: 00000000
> >   14: 00000000
> >   18: 0404000C
> >   1C: 00000100
> >   20: 00000000
> >   24: 43200000
> >   28: 00000000
> >   2C: 00000000
> >   30: 00000000
> >   34: 00000080
> >   38: 00000000
> >   3C: 000000FF
> >
> > 0.10.1:
> >   00: 902714E4
> >   04: 00100406
> >   08: 01060100
> >   0C: 00800010
> >   10: 00000000
> >   14: 00000000
> >   18: 0405000C
> >   1C: 00000100
> >   20: 00000000
> >   24: 43210000
> >   28: 00000000
> >   2C: 00000000
> >   30: 00000000
> >   34: 00000080
> >   38: 00000000
> >   3C: 000000FF
> >
> > b.0.0:
> >   00: 101515B3
> >   04: 00100506
> >   08: 02000000
> >   0C: 00800000
> >   10: 0000000C
> >   14: 00000100
> >   18: 00000000
> >   1C: 00000000
> >   20: 00000000
> >   24: 00000000
> >   28: 00000000
> >   2C: 028A1590
> >   30: FFF00000
> >   34: 00000060
> >   38: 00000000
> >   3C: 000001FF
> >
> > b.0.1:
> >   00: 101515B3
> >   04: 00100506
> >   08: 02000000
> >   0C: 00800000
> >   10: 0200000C
> >   14: 00000100
> >   18: 00000000
> >   1C: 00000000
> >   20: 00000000
> >   24: 00000000
> >   28: 00000000
> >   2C: 028A1590
> >   30: FFF00000
> >   34: 00000060
> >   38: 00000000
> >   3C: 000002FF
> >
> > f.0.0:
> >   00: 11501A03
> >   04: 00100107
> >   08: 06040004
> >   0C: 00010000
> >   10: 00000000
> >   14: 00000000
> >   18: 0010100F
> >   1C: 022001F1
> >   20: 42F04000
> >   24: 0001FFF1
> >   28: 00000000
> >   2C: 00000000
> >   30: 00000000
> >   34: 00000050
> >   38: 00000000
> >   3C: 000201FF
> >
> > 10.0.0:
> >   00: 20001A03
> >   04: 02100102
> >   08: 03000041
> >   0C: 00000000
> >   10: 40000000
> >   14: 42000000
> >   18: 00000001
> >   1C: 00000000
> >   20: 00000000
> >   24: 00000000
> >   28: 00000000
> >   2C: 20001A03
> >   30: 00000000
> >   34: 00000040
> >   38: 00000000
> >   3C: 000001FF
> >
> > 80.0.0:
> >   00: AF00177D
> >   04: 00100002
> >   08: 06000000
> >   0C: 00000010
> >   10: 00000000
> >   14: 00000000
> >   18: 00000000
> >   1C: 00000000
> >   20: 00000000
> >   24: 00000000
> >   28: 00000000
> >   2C: 0000177D
> >   30: 00000000
> >   34: 00000090
> >   38: 00000000
> >   3C: 00000000
> >
> > 80.1.0:
> >   00: AF84177D
> >   04: 00100000
> >   08: 06040000
> >   0C: 00010000
> >   10: 00000000
> >   14: 00000000
> >   18: 00818180
> >   1C: 00000000
> >   20: 0000FFF0
> >   24: 0001FFF1
> >   28: 00000000
> >   2C: 00000000
> >   30: 00000000
> >   34: 00000048
> >   38: 00000000
> >   3C: 00000100
> >
> > 80.9.0:
> >   00: AF84177D
> >   04: 00100000
> >   08: 06040000
> >   0C: 00010000
> >   10: 00000000
> >   14: 00000000
> >   18: 00828280
> >   1C: 00000000
> >   20: 0000FFF0
> >   24: 0001FFF1
> >   28: 00000000
> >   2C: 00000000
> >   30: 00000000
> >   34: 00000048
> >   38: 00000000
> >   3C: 00000100
> >
> > 80.b.0:
> >   00: AF84177D
> >   04: 00100000
> >   08: 06040000
> >   0C: 00010000
> >   10: 00000000
> >   14: 00000000
> >   18: 00838380
> >   1C: 00000000
> >   20: 0000FFF0
> >   24: 0001FFF1
> >   28: 00000000
> >   2C: 00000000
> >   30: 00000000
> >   34: 00000048
> >   38: 00000000
> >   3C: 00000100
> >
> > 80.f.0:
> >   00: 902614E4
> >   04: 00100406
> >   08: 0C033000
> >   0C: 00800010
> >   10: 0000000C
> >   14: 00000140
> >   18: 0001000C
> >   1C: 00000140
> >   20: 00000000
> >   24: 00000000
> >   28: 00000000
> >   2C: 00000000
> >   30: 00000000
> >   34: 00000080
> >   38: 00000000
> >   3C: 00000000
> >
> > 80.f.1:
> >   00: 902614E4
> >   04: 00100406
> >   08: 0C033000
> >   0C: 00800010
> >   10: 0002000C
> >   14: 00000140
> >   18: 0003000C
> >   1C: 00000140
> >   20: 00000000
> >   24: 00000000
> >   28: 00000000
> >   2C: 00000000
> >   30: 00000000
> >   34: 00000080
> >   38: 00000000
> >   3C: 00000000
> >
> > 80.10.0:
> >   00: 902714E4
> >   04: 00100406
> >   08: 01060100
> >   0C: 00800010
> >   10: 00000000
> >   14: 00000000
> >   18: 0004000C
> >   1C: 00000140
> >   20: 00000000
> >   24: 60000000
> >   28: 00000000
> >   2C: 00000000
> >   30: 00000000
> >   34: 00000080
> >   38: 00000000
> >   3C: 000000FF
> >
> > 80.10.1:
> >   00: 902714E4
> >   04: 00100406
> >   08: 01060100
> >   0C: 00800010
> >   10: 00000000
> >   14: 00000000
> >   18: 0005000C
> >   1C: 00000140
> >   20: 00000000
> >   24: 60010000
> >   28: 00000000
> >   2C: 00000000
> >   30: 00000000
> >   34: 00000080
> >   38: 00000000
> >   3C: 000000FF
> > RAS CONTROLLER: SYSTEM HALTED...
Andrey Konovalov Feb. 18, 2019, 3:25 p.m. UTC | #6
On Sun, Feb 17, 2019 at 5:34 AM Qian Cai <cai@lca.pw> wrote:
>
> Enabling function tracer with CONFIG_KASAN_SW_TAGS=y (hwasan) tracer
> causes the whole system frozen on ThunderX2 systems with 256 CPUs,
> because there is a burst of too much pointer access, and then KASAN will
> dereference each byte of the shadow address for the tag checking which
> will kill all the CPUs.

Hi Qian,

Could you check if adding "CFLAGS_REMOVE_tags.o = -pg" into
mm/kasan/Makefile helps with that?

Thanks!

>
> Signed-off-by: Qian Cai <cai@lca.pw>
> ---
>  kernel/trace/Makefile | 5 +++++
>  1 file changed, 5 insertions(+)
>
> diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
> index c2b2148bb1d2..fdd547a68385 100644
> --- a/kernel/trace/Makefile
> +++ b/kernel/trace/Makefile
> @@ -28,6 +28,11 @@ ifdef CONFIG_GCOV_PROFILE_FTRACE
>  GCOV_PROFILE := y
>  endif
>
> +# Too much pointer access will kill hwasan.
> +ifdef CONFIG_KASAN_SW_TAGS
> +KASAN_SANITIZE := n
> +endif
> +
>  CFLAGS_trace_benchmark.o := -I$(src)
>  CFLAGS_trace_events_filter.o := -I$(src)
>
> --
> 2.17.2 (Apple Git-113)
>
Qian Cai Feb. 18, 2019, 3:53 p.m. UTC | #7
On 2/18/19 10:25 AM, Andrey Konovalov wrote:
> On Sun, Feb 17, 2019 at 5:34 AM Qian Cai <cai@lca.pw> wrote:
>>
>> Enabling function tracer with CONFIG_KASAN_SW_TAGS=y (hwasan) tracer
>> causes the whole system frozen on ThunderX2 systems with 256 CPUs,
>> because there is a burst of too much pointer access, and then KASAN will
>> dereference each byte of the shadow address for the tag checking which
>> will kill all the CPUs.
> 
> Hi Qian,
> 
> Could you check if adding "CFLAGS_REMOVE_tags.o = -pg" into
> mm/kasan/Makefile helps with that?

Yes, you nailed it!
Andrey Konovalov Feb. 18, 2019, 3:56 p.m. UTC | #8
On Mon, Feb 18, 2019 at 4:53 PM Qian Cai <cai@lca.pw> wrote:
>
>
>
> On 2/18/19 10:25 AM, Andrey Konovalov wrote:
> > On Sun, Feb 17, 2019 at 5:34 AM Qian Cai <cai@lca.pw> wrote:
> >>
> >> Enabling function tracer with CONFIG_KASAN_SW_TAGS=y (hwasan) tracer
> >> causes the whole system frozen on ThunderX2 systems with 256 CPUs,
> >> because there is a burst of too much pointer access, and then KASAN will
> >> dereference each byte of the shadow address for the tag checking which
> >> will kill all the CPUs.
> >
> > Hi Qian,
> >
> > Could you check if adding "CFLAGS_REMOVE_tags.o = -pg" into
> > mm/kasan/Makefile helps with that?
>
> Yes, you nailed it!

Great! I'll send the patch.
Steven Rostedt Feb. 18, 2019, 5:23 p.m. UTC | #9
On Mon, 18 Feb 2019 16:56:44 +0100
Andrey Konovalov <andreyknvl@google.com> wrote:

> On Mon, Feb 18, 2019 at 4:53 PM Qian Cai <cai@lca.pw> wrote:
> >
> >
> >
> > On 2/18/19 10:25 AM, Andrey Konovalov wrote:  
> > > On Sun, Feb 17, 2019 at 5:34 AM Qian Cai <cai@lca.pw> wrote:  
> > >>
> > >> Enabling function tracer with CONFIG_KASAN_SW_TAGS=y (hwasan) tracer
> > >> causes the whole system frozen on ThunderX2 systems with 256 CPUs,
> > >> because there is a burst of too much pointer access, and then KASAN will
> > >> dereference each byte of the shadow address for the tag checking which
> > >> will kill all the CPUs.  
> > >
> > > Hi Qian,
> > >
> > > Could you check if adding "CFLAGS_REMOVE_tags.o = -pg" into
> > > mm/kasan/Makefile helps with that?  
> >
> > Yes, you nailed it!  
> 
> Great! I'll send the patch.

OK, then I'll ignore the original patch in this thread.

-- Steve
James Morse Feb. 21, 2019, 2:19 p.m. UTC | #10
Hi!

On 18/02/2019 13:59, Will Deacon wrote:
> [+James, who knows how to decode these things]

Decode is a strong term!

This stuff is printed by Cavium's secure-world software. All I'm doing is spotting the
bits that vary between the out we've seen!


> On Mon, Feb 18, 2019 at 02:56:47PM +0100, Dmitry Vyukov wrote:
>> On Mon, Feb 18, 2019 at 2:27 PM Qian Cai <cai@lca.pw> wrote:
>>> On 2/17/19 2:30 AM, Dmitry Vyukov wrote:
>>>> On Sun, Feb 17, 2019 at 5:34 AM Qian Cai <cai@lca.pw> wrote:
>>>>>
>>>>> Enabling function tracer with CONFIG_KASAN_SW_TAGS=y (hwasan) tracer
>>>>> causes the whole system frozen on ThunderX2 systems with 256 CPUs,
>>>>> because there is a burst of too much pointer access, and then KASAN will
>>>>> dereference each byte of the shadow address for the tag checking which
>>>>> will kill all the CPUs.
>>>>
>>>> Could you please elaborate what exactly happens and who/why kills
>>>> CPUs? Number of memory accesses should not make any difference.
>>>> With hardware support (MTE) it won't be possible to disable
>>>> instrumentation (loads and stores check tags themselves), so it would
>>>> be useful to keep track of exact reasons we disable instrumentation to
>>>> know how to deal with them with hardware support.
>>>> It would be useful to keep this info in the comment in the Makefile.
>>>
>>> It turns out sometimes it will trigger a hardware error.
>>
>> Please add this to the comment that there is that error, reason is
>> unknown, happens from time to time.
>> "Too much pointer access" is confusing and does not seem to be the
>> root cause (there are lots of source files that cause lots of pointer
>> accesses).

> I don't think this is directly related to KASAN, as I'm sure we've seen this
> RAS error before.

Not quite like this. I've had one choke on some PCIe transaction[0].

This looks like corruption detected in a cache associated with a CPU. 'Write back' and
'Physical Address' suggests its the data cache:


>>>  Node 0 NBU 0 Error report :
>>>  NBU BAR Error
[..]
>>>       Physical Address : 0x40011ff00
>>>
>>> NBU BAR Error : Decoded info :
>>>         Agent info : CPU
>>>             Core ID : 21
>>>             Thread ID : 1
>>>         Requ: type : 4 : Write Back
>>>  Node 0 NBU 1 Error report :
>>>  NBU BAR Error
[..]
>>>       Physical Address : 0x40011ff40
>>>
>>> NBU BAR Error : Decoded info :
>>>         Agent info : CPU
>>>             Core ID : 21
>>>             Thread ID : 1
>>>         Requ: type : 4 : Write Back
>>>  Node 0 NBU 2 Error report :
>>>  NBU BAR Error
[..]
>>>       Physical Address : 0x40011ff80

If you can reproduce it, and it always affects Core:21,Thread:1 I'd suggest offline-ing
all the threads/CPUs in that core. It may be one cache is close to some threshold, and you
can offline the core that its part of.


Thanks,

James


[0] For comparison, I've had one of these during kexec:
# NBU BAR Error : Decoded info :
#        Agent info : IO
#                   : PCIE0
#        Requ: type : 2 : Read
diff mbox series

Patch

diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
index c2b2148bb1d2..fdd547a68385 100644
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -28,6 +28,11 @@  ifdef CONFIG_GCOV_PROFILE_FTRACE
 GCOV_PROFILE := y
 endif
 
+# Too much pointer access will kill hwasan.
+ifdef CONFIG_KASAN_SW_TAGS
+KASAN_SANITIZE := n
+endif
+
 CFLAGS_trace_benchmark.o := -I$(src)
 CFLAGS_trace_events_filter.o := -I$(src)