diff mbox series

bcm2711_thermal: Kernel panic - not syncing: Asynchronous SError Interrupt

Message ID 20210210114829.2915de78@gollum (mailing list archive)
State New, archived
Headers show
Series bcm2711_thermal: Kernel panic - not syncing: Asynchronous SError Interrupt | expand

Commit Message

Juerg Haefliger Feb. 10, 2021, 10:48 a.m. UTC
Trying to dump the BCM2711 registers kills the kernel:

# cat /sys/kernel/debug/regmap/dummy-avs-monitor\@fd5d2000/range
0-efc
# cat /sys/kernel/debug/regmap/dummy-avs-monitor\@fd5d2000/registers 

[   62.857661] SError Interrupt on CPU1, code 0xbf000002 -- SError
[   62.857671] CPU: 1 PID: 478 Comm: cat Not tainted 5.11.0-rc7 #4
[   62.857674] Hardware name: Raspberry Pi 4 Model B Rev 1.2 (DT)
[   62.857676] pstate: 20000085 (nzCv daIf -PAN -UAO -TCO BTYPE=--)
[   62.857679] pc : regmap_mmio_read32le+0x1c/0x34
[   62.857681] lr : regmap_mmio_read+0x50/0x80
[   62.857682] sp : ffff8000105c3c00
[   62.857685] x29: ffff8000105c3c00 x28: 0000000000000014 
[   62.857694] x27: 0000000000000014 x26: ffffd2ea1c2060b0 
[   62.857699] x25: ffff4e34408ecc00 x24: 0000000000000efc 
[   62.857704] x23: ffff8000105c3e20 x22: ffff8000105c3d3c 
[   62.857710] x21: ffff8000105c3d3c x20: 0000000000000014 
[   62.857715] x19: ffff4e344037a900 x18: 0000000000000020 
[   62.857720] x17: 0000000000000000 x16: 0000000000000000 
[   62.857725] x15: ffff4e3447ac40f0 x14: 0000000000000003 
[   62.857730] x13: ffff4e34422c0000 x12: ffff4e34422a0046 
[   62.857735] x11: ffffd2ea1c8765e0 x10: 0000000000000000 
[   62.857741] x9 : ffffd2ea1b9495a0 x8 : ffff4e34429ef980 
[   62.857746] x7 : 000000000000000f x6 : ffff4e34422a004b 
[   62.857751] x5 : 00000000fffffff9 x4 : 0000000000000000 
[   62.857757] x3 : ffffd2ea1b949550 x2 : ffffd2ea1b949330 
[   62.857761] x1 : 0000000000000014 x0 : 0000000000000000 
[   62.857767] Kernel panic - not syncing: Asynchronous SError Interrupt
[   62.857770] CPU: 1 PID: 478 Comm: cat Not tainted 5.11.0-rc7 #4
[   62.857773] Hardware name: Raspberry Pi 4 Model B Rev 1.2 (DT)
[   62.857775] Call trace:
[   62.857777]  dump_backtrace+0x0/0x1e0
[   62.857778]  show_stack+0x24/0x70
[   62.857780]  dump_stack+0xd0/0x12c
[   62.857782]  panic+0x168/0x370
[   62.857783]  nmi_panic+0x98/0xa0
[   62.857786]  arm64_serror_panic+0x8c/0x98
[   62.857787]  do_serror+0x3c/0x6c
[   62.857789]  el1_error+0x78/0xf0
[   62.857791]  regmap_mmio_read32le+0x1c/0x34
[   62.857793]  _regmap_bus_reg_read+0x24/0x30
[   62.857795]  _regmap_read+0x6c/0x17c
[   62.857797]  regmap_read+0x58/0x84
[   62.857799]  regmap_read_debugfs+0x138/0x3f4
[   62.857801]  regmap_map_read_file+0x34/0x40
[   62.857803]  full_proxy_read+0x6c/0xc0
[   62.857805]  vfs_read+0xb8/0x1e4
[   62.857807]  ksys_read+0x78/0x10c
[   62.857809]  __arm64_sys_read+0x28/0x34
[   62.857811]  el0_svc_common.constprop.0+0x7c/0x194
[   62.857813]  do_el0_svc+0x30/0x9c
[   62.857814]  el0_svc+0x20/0x30
[   62.857816]  el0_sync_handler+0x1a4/0x1b0
[   62.857818]  el0_sync+0x174/0x180
[   62.857842] SMP: stopping secondary CPUs
[   62.857845] Kernel Offset: 0x52ea0b080000 from 0xffff800010000000
[   62.857847] PHYS_OFFSET: 0xffffb1cc00000000
[   62.857849] CPU features: 0x00240022,61806000
[   62.857851] Memory Limit: none

Sprinkling printks around regmap_read [1] shows that reading from 0x14 (20)
seems to cause the issue:


[   40.456230] map=ffff020a069c9c00, from=0, to=3836, count=131072
[   40.462520] map=ffff020a069c9c00, i=0
[   40.466319] ret=0, val=0
[   40.468922] map=ffff020a069c9c00, i=4
[   40.472684] ret=0, val=0
[   40.475292] map=ffff020a069c9c00, i=8
[   40.479048] ret=0, val=0
[   40.481649] map=ffff020a069c9c00, i=12
[   40.485492] ret=0, val=0
[   40.488080] map=ffff020a069c9c00, i=16
[   40.491922] ret=0, val=0
[   40.494523] map=ffff020a069c9c00, i=20
[   40.498497] SError Interrupt on CPU0, code 0xbf000002 -- SError
[   40.498499] CPU: 0 PID: 486 Comm: cat Not tainted 5.11.0-rc7+ #8
[   40.498501] Hardware name: Raspberry Pi 4 Model B Rev 1.2 (DT)


...Juerg

[1]

Comments

Nicolas Saenz Julienne Feb. 10, 2021, 1:15 p.m. UTC | #1
[ Add Robin, Catalin and Florian in case they want to chime in ]

Hi Juerg, thanks for the report!

On Wed, 2021-02-10 at 11:48 +0100, Juerg Haefliger wrote:
> Trying to dump the BCM2711 registers kills the kernel:
> 
> # cat /sys/kernel/debug/regmap/dummy-avs-monitor\@fd5d2000/range
> 0-efc
> # cat /sys/kernel/debug/regmap/dummy-avs-monitor\@fd5d2000/registers 
> 
> [   62.857661] SError Interrupt on CPU1, code 0xbf000002 -- SError

So ESR's IDS (bit 24) is set, which means it's an 'Implementation Defined
SError,' hence IIUC the rest of the error code is meaningless to anyone outside
of Broadcom/RPi.

The regmap is created through the following syscon device:

	avs_monitor: avs-monitor@7d5d2000 {
		compatible = "brcm,bcm2711-avs-monitor",
			     "syscon", "simple-mfd";
		reg = <0x7d5d2000 0xf00>;

		thermal: thermal {
			compatible = "brcm,bcm2711-thermal";
			#thermal-sensor-cells = <0>;
		};
	};

I've done some tests with devmem, and the whole <0x7d5d2000 0xf00> range is
full of addresses that trigger this same error. Also note that as per Florian's
comments[1]: "AVS_RO_REGISTERS_0: 0x7d5d2200 - 0x7d5d22e3." But from what I can
tell, at least 0x7d5d22b0 seems to be faulty too.

Any ideas/comments? My guess is that those addresses are marked somehow as
secure, and only for VC4 to access (VC4 is RPi4's co-processor). Ultimately,
the solution is to narrow the register range exposed by avs-monitor to whatever
bcm2711-thermal needs (which is ATM a single 32bit register).

Regards,
Nicolas

[1] https://lore.kernel.org/linux-pm/82125042-684a-b4e2-fbaa-45a393b2ce5e@gmx.net/

> [   62.857671] CPU: 1 PID: 478 Comm: cat Not tainted 5.11.0-rc7 #4
> [   62.857674] Hardware name: Raspberry Pi 4 Model B Rev 1.2 (DT)
> [   62.857676] pstate: 20000085 (nzCv daIf -PAN -UAO -TCO BTYPE=--)
> [   62.857679] pc : regmap_mmio_read32le+0x1c/0x34
> [   62.857681] lr : regmap_mmio_read+0x50/0x80
> [   62.857682] sp : ffff8000105c3c00
> [   62.857685] x29: ffff8000105c3c00 x28: 0000000000000014 
> [   62.857694] x27: 0000000000000014 x26: ffffd2ea1c2060b0 
> [   62.857699] x25: ffff4e34408ecc00 x24: 0000000000000efc 
> [   62.857704] x23: ffff8000105c3e20 x22: ffff8000105c3d3c 
> [   62.857710] x21: ffff8000105c3d3c x20: 0000000000000014 
> [   62.857715] x19: ffff4e344037a900 x18: 0000000000000020 
> [   62.857720] x17: 0000000000000000 x16: 0000000000000000 
> [   62.857725] x15: ffff4e3447ac40f0 x14: 0000000000000003 
> [   62.857730] x13: ffff4e34422c0000 x12: ffff4e34422a0046 
> [   62.857735] x11: ffffd2ea1c8765e0 x10: 0000000000000000 
> [   62.857741] x9 : ffffd2ea1b9495a0 x8 : ffff4e34429ef980 
> [   62.857746] x7 : 000000000000000f x6 : ffff4e34422a004b 
> [   62.857751] x5 : 00000000fffffff9 x4 : 0000000000000000 
> [   62.857757] x3 : ffffd2ea1b949550 x2 : ffffd2ea1b949330 
> [   62.857761] x1 : 0000000000000014 x0 : 0000000000000000 
> [   62.857767] Kernel panic - not syncing: Asynchronous SError Interrupt
> [   62.857770] CPU: 1 PID: 478 Comm: cat Not tainted 5.11.0-rc7 #4
> [   62.857773] Hardware name: Raspberry Pi 4 Model B Rev 1.2 (DT)
> [   62.857775] Call trace:
> [   62.857777]  dump_backtrace+0x0/0x1e0
> [   62.857778]  show_stack+0x24/0x70
> [   62.857780]  dump_stack+0xd0/0x12c
> [   62.857782]  panic+0x168/0x370
> [   62.857783]  nmi_panic+0x98/0xa0
> [   62.857786]  arm64_serror_panic+0x8c/0x98
> [   62.857787]  do_serror+0x3c/0x6c
> [   62.857789]  el1_error+0x78/0xf0
> [   62.857791]  regmap_mmio_read32le+0x1c/0x34
> [   62.857793]  _regmap_bus_reg_read+0x24/0x30
> [   62.857795]  _regmap_read+0x6c/0x17c
> [   62.857797]  regmap_read+0x58/0x84
> [   62.857799]  regmap_read_debugfs+0x138/0x3f4
> [   62.857801]  regmap_map_read_file+0x34/0x40
> [   62.857803]  full_proxy_read+0x6c/0xc0
> [   62.857805]  vfs_read+0xb8/0x1e4
> [   62.857807]  ksys_read+0x78/0x10c
> [   62.857809]  __arm64_sys_read+0x28/0x34
> [   62.857811]  el0_svc_common.constprop.0+0x7c/0x194
> [   62.857813]  do_el0_svc+0x30/0x9c
> [   62.857814]  el0_svc+0x20/0x30
> [   62.857816]  el0_sync_handler+0x1a4/0x1b0
> [   62.857818]  el0_sync+0x174/0x180
> [   62.857842] SMP: stopping secondary CPUs
> [   62.857845] Kernel Offset: 0x52ea0b080000 from 0xffff800010000000
> [   62.857847] PHYS_OFFSET: 0xffffb1cc00000000
> [   62.857849] CPU features: 0x00240022,61806000
> [   62.857851] Memory Limit: none
> 
> Sprinkling printks around regmap_read [1] shows that reading from 0x14 (20)
> seems to cause the issue:
> 
> 
> [   40.456230] map=ffff020a069c9c00, from=0, to=3836, count=131072
> [   40.462520] map=ffff020a069c9c00, i=0
> [   40.466319] ret=0, val=0
> [   40.468922] map=ffff020a069c9c00, i=4
> [   40.472684] ret=0, val=0
> [   40.475292] map=ffff020a069c9c00, i=8
> [   40.479048] ret=0, val=0
> [   40.481649] map=ffff020a069c9c00, i=12
> [   40.485492] ret=0, val=0
> [   40.488080] map=ffff020a069c9c00, i=16
> [   40.491922] ret=0, val=0
> [   40.494523] map=ffff020a069c9c00, i=20
> [   40.498497] SError Interrupt on CPU0, code 0xbf000002 -- SError
> [   40.498499] CPU: 0 PID: 486 Comm: cat Not tainted 5.11.0-rc7+ #8
> [   40.498501] Hardware name: Raspberry Pi 4 Model B Rev 1.2 (DT)
> 
> 
> ...Juerg
> 
> [1]
> diff --git a/drivers/base/regmap/regmap-debugfs.c b/drivers/base/regmap/regmap-debugfs.c
> index ff2ee87987c7..9465f5a2f3b8 100644
> --- a/drivers/base/regmap/regmap-debugfs.c
> +++ b/drivers/base/regmap/regmap-debugfs.c
> @@ -229,6 +229,7 @@ static ssize_t regmap_read_debugfs(struct regmap *map, unsigned int from,
>         if (count > (PAGE_SIZE << (MAX_ORDER - 1)))
>                 count = PAGE_SIZE << (MAX_ORDER - 1);
>  
> 
> +       printk("map=%px, from=%d, to=%d, count=%ld\n", map, from, to, count);
>         buf = kmalloc(count, GFP_KERNEL);
>         if (!buf)
>                 return -ENOMEM;
> @@ -253,7 +254,9 @@ static ssize_t regmap_read_debugfs(struct regmap *map, unsigned int from,
>                         buf_pos += map->debugfs_reg_len + 2;
>  
> 
>                         /* Format the value, write all X if we can't read */
> +                       printk("map=%px, i=%d\n", map, i);
>                         ret = regmap_read(map, i, &val);
> +                       printk("ret=%ld, val=%x\n", ret, val);
>                         if (ret == 0)
>                                 snprintf(buf + buf_pos, count - buf_pos,
>                                          "%.*x", map->debugfs_val_len, val);
>
Juerg Haefliger Feb. 10, 2021, 2:54 p.m. UTC | #2
On Wed, 10 Feb 2021 14:15:46 +0100
Nicolas Saenz Julienne <nsaenzjulienne@suse.de> wrote:

> [ Add Robin, Catalin and Florian in case they want to chime in ]
> 
> Hi Juerg, thanks for the report!
> 
> On Wed, 2021-02-10 at 11:48 +0100, Juerg Haefliger wrote:
> > Trying to dump the BCM2711 registers kills the kernel:
> > 
> > # cat /sys/kernel/debug/regmap/dummy-avs-monitor\@fd5d2000/range
> > 0-efc
> > # cat /sys/kernel/debug/regmap/dummy-avs-monitor\@fd5d2000/registers 
> > 
> > [   62.857661] SError Interrupt on CPU1, code 0xbf000002 -- SError  
> 
> So ESR's IDS (bit 24) is set, which means it's an 'Implementation Defined
> SError,' hence IIUC the rest of the error code is meaningless to anyone outside
> of Broadcom/RPi.
> 
> The regmap is created through the following syscon device:
> 
> 	avs_monitor: avs-monitor@7d5d2000 {
> 		compatible = "brcm,bcm2711-avs-monitor",
> 			     "syscon", "simple-mfd";
> 		reg = <0x7d5d2000 0xf00>;
> 
> 		thermal: thermal {
> 			compatible = "brcm,bcm2711-thermal";
> 			#thermal-sensor-cells = <0>;
> 		};
> 	};
> 
> I've done some tests with devmem, and the whole <0x7d5d2000 0xf00> range is
> full of addresses that trigger this same error. Also note that as per Florian's
> comments[1]: "AVS_RO_REGISTERS_0: 0x7d5d2200 - 0x7d5d22e3." But from what I can
> tell, at least 0x7d5d22b0 seems to be faulty too.
> 
> Any ideas/comments? My guess is that those addresses are marked somehow as
> secure, and only for VC4 to access (VC4 is RPi4's co-processor). Ultimately,
> the solution is to narrow the register range exposed by avs-monitor to whatever
> bcm2711-thermal needs (which is ATM a single 32bit register).

Yeah, that's what I tried but wasn't sure if that's the correct approach or
if there was something wrong with the DTB (which I know virtually nothing
about).

With [1] I get seemingly the correct behavior:

# cat /sys/kernel/debug/regmap/dummy-avs-monitor\@fd5d2200/range 
0-0
# cat /sys/kernel/debug/regmap/dummy-avs-monitor\@fd5d2200/registers 
0: 000106fa
# cat /sys/class/thermal/thermal_zone0/temp 
39433

On a different note, how did you come up with that address range in the DTB?
Is that public information?

...Juerg


[1]
diff --git a/arch/arm/boot/dts/bcm2711.dtsi b/arch/arm/boot/dts/bcm2711.dtsi
index 4847dd305317..a7059967aab1 100644
--- a/arch/arm/boot/dts/bcm2711.dtsi
+++ b/arch/arm/boot/dts/bcm2711.dtsi
@@ -65,10 +65,10 @@ gicv2: interrupt-controller@40041000 {
                                                 IRQ_TYPE_LEVEL_HIGH)>;
                };
 
-               avs_monitor: avs-monitor@7d5d2000 {
+               avs_monitor: avs-monitor@7d5d2200 {
                        compatible = "brcm,bcm2711-avs-monitor",
                                     "syscon", "simple-mfd";
-                       reg = <0x7d5d2000 0xf00>;
+                       reg = <0x7d5d2200 0x4>;
 
                        thermal: thermal {
                                compatible = "brcm,bcm2711-thermal";
diff --git a/drivers/thermal/broadcom/bcm2711_thermal.c b/drivers/thermal/broadcom/bcm2711_thermal.c
index 67c2a737bc9d..3b5a84402b89 100644
--- a/drivers/thermal/broadcom/bcm2711_thermal.c
+++ b/drivers/thermal/broadcom/bcm2711_thermal.c
@@ -22,7 +22,7 @@
 
 #include "../thermal_hwmon.h"
 
-#define AVS_RO_TEMP_STATUS             0x200
+#define AVS_RO_TEMP_STATUS             0x0 /* address 0x7d5d2200 */
 #define AVS_RO_TEMP_STATUS_VALID_MSK   (BIT(16) | BIT(10))
 #define AVS_RO_TEMP_STATUS_DATA_MSK    GENMASK(9, 0)




> Regards,
> Nicolas
> 
> [1] https://lore.kernel.org/linux-pm/82125042-684a-b4e2-fbaa-45a393b2ce5e@gmx.net/
> 
> > [   62.857671] CPU: 1 PID: 478 Comm: cat Not tainted 5.11.0-rc7 #4
> > [   62.857674] Hardware name: Raspberry Pi 4 Model B Rev 1.2 (DT)
> > [   62.857676] pstate: 20000085 (nzCv daIf -PAN -UAO -TCO BTYPE=--)
> > [   62.857679] pc : regmap_mmio_read32le+0x1c/0x34
> > [   62.857681] lr : regmap_mmio_read+0x50/0x80
> > [   62.857682] sp : ffff8000105c3c00
> > [   62.857685] x29: ffff8000105c3c00 x28: 0000000000000014 
> > [   62.857694] x27: 0000000000000014 x26: ffffd2ea1c2060b0 
> > [   62.857699] x25: ffff4e34408ecc00 x24: 0000000000000efc 
> > [   62.857704] x23: ffff8000105c3e20 x22: ffff8000105c3d3c 
> > [   62.857710] x21: ffff8000105c3d3c x20: 0000000000000014 
> > [   62.857715] x19: ffff4e344037a900 x18: 0000000000000020 
> > [   62.857720] x17: 0000000000000000 x16: 0000000000000000 
> > [   62.857725] x15: ffff4e3447ac40f0 x14: 0000000000000003 
> > [   62.857730] x13: ffff4e34422c0000 x12: ffff4e34422a0046 
> > [   62.857735] x11: ffffd2ea1c8765e0 x10: 0000000000000000 
> > [   62.857741] x9 : ffffd2ea1b9495a0 x8 : ffff4e34429ef980 
> > [   62.857746] x7 : 000000000000000f x6 : ffff4e34422a004b 
> > [   62.857751] x5 : 00000000fffffff9 x4 : 0000000000000000 
> > [   62.857757] x3 : ffffd2ea1b949550 x2 : ffffd2ea1b949330 
> > [   62.857761] x1 : 0000000000000014 x0 : 0000000000000000 
> > [   62.857767] Kernel panic - not syncing: Asynchronous SError Interrupt
> > [   62.857770] CPU: 1 PID: 478 Comm: cat Not tainted 5.11.0-rc7 #4
> > [   62.857773] Hardware name: Raspberry Pi 4 Model B Rev 1.2 (DT)
> > [   62.857775] Call trace:
> > [   62.857777]  dump_backtrace+0x0/0x1e0
> > [   62.857778]  show_stack+0x24/0x70
> > [   62.857780]  dump_stack+0xd0/0x12c
> > [   62.857782]  panic+0x168/0x370
> > [   62.857783]  nmi_panic+0x98/0xa0
> > [   62.857786]  arm64_serror_panic+0x8c/0x98
> > [   62.857787]  do_serror+0x3c/0x6c
> > [   62.857789]  el1_error+0x78/0xf0
> > [   62.857791]  regmap_mmio_read32le+0x1c/0x34
> > [   62.857793]  _regmap_bus_reg_read+0x24/0x30
> > [   62.857795]  _regmap_read+0x6c/0x17c
> > [   62.857797]  regmap_read+0x58/0x84
> > [   62.857799]  regmap_read_debugfs+0x138/0x3f4
> > [   62.857801]  regmap_map_read_file+0x34/0x40
> > [   62.857803]  full_proxy_read+0x6c/0xc0
> > [   62.857805]  vfs_read+0xb8/0x1e4
> > [   62.857807]  ksys_read+0x78/0x10c
> > [   62.857809]  __arm64_sys_read+0x28/0x34
> > [   62.857811]  el0_svc_common.constprop.0+0x7c/0x194
> > [   62.857813]  do_el0_svc+0x30/0x9c
> > [   62.857814]  el0_svc+0x20/0x30
> > [   62.857816]  el0_sync_handler+0x1a4/0x1b0
> > [   62.857818]  el0_sync+0x174/0x180
> > [   62.857842] SMP: stopping secondary CPUs
> > [   62.857845] Kernel Offset: 0x52ea0b080000 from 0xffff800010000000
> > [   62.857847] PHYS_OFFSET: 0xffffb1cc00000000
> > [   62.857849] CPU features: 0x00240022,61806000
> > [   62.857851] Memory Limit: none
> > 
> > Sprinkling printks around regmap_read [1] shows that reading from 0x14 (20)
> > seems to cause the issue:
> > 
> > 
> > [   40.456230] map=ffff020a069c9c00, from=0, to=3836, count=131072
> > [   40.462520] map=ffff020a069c9c00, i=0
> > [   40.466319] ret=0, val=0
> > [   40.468922] map=ffff020a069c9c00, i=4
> > [   40.472684] ret=0, val=0
> > [   40.475292] map=ffff020a069c9c00, i=8
> > [   40.479048] ret=0, val=0
> > [   40.481649] map=ffff020a069c9c00, i=12
> > [   40.485492] ret=0, val=0
> > [   40.488080] map=ffff020a069c9c00, i=16
> > [   40.491922] ret=0, val=0
> > [   40.494523] map=ffff020a069c9c00, i=20
> > [   40.498497] SError Interrupt on CPU0, code 0xbf000002 -- SError
> > [   40.498499] CPU: 0 PID: 486 Comm: cat Not tainted 5.11.0-rc7+ #8
> > [   40.498501] Hardware name: Raspberry Pi 4 Model B Rev 1.2 (DT)
> > 
> > 
> > ...Juerg
> > 
> > [1]
> > diff --git a/drivers/base/regmap/regmap-debugfs.c b/drivers/base/regmap/regmap-debugfs.c
> > index ff2ee87987c7..9465f5a2f3b8 100644
> > --- a/drivers/base/regmap/regmap-debugfs.c
> > +++ b/drivers/base/regmap/regmap-debugfs.c
> > @@ -229,6 +229,7 @@ static ssize_t regmap_read_debugfs(struct regmap *map, unsigned int from,
> >         if (count > (PAGE_SIZE << (MAX_ORDER - 1)))
> >                 count = PAGE_SIZE << (MAX_ORDER - 1);
> >  
> > 
> > +       printk("map=%px, from=%d, to=%d, count=%ld\n", map, from, to, count);
> >         buf = kmalloc(count, GFP_KERNEL);
> >         if (!buf)
> >                 return -ENOMEM;
> > @@ -253,7 +254,9 @@ static ssize_t regmap_read_debugfs(struct regmap *map, unsigned int from,
> >                         buf_pos += map->debugfs_reg_len + 2;
> >  
> > 
> >                         /* Format the value, write all X if we can't read */
> > +                       printk("map=%px, i=%d\n", map, i);
> >                         ret = regmap_read(map, i, &val);
> > +                       printk("ret=%ld, val=%x\n", ret, val);
> >                         if (ret == 0)
> >                                 snprintf(buf + buf_pos, count - buf_pos,
> >                                          "%.*x", map->debugfs_val_len, val);
> >   
> 
> 
>
Robin Murphy Feb. 10, 2021, 4:25 p.m. UTC | #3
On 2021-02-10 13:15, Nicolas Saenz Julienne wrote:
> [ Add Robin, Catalin and Florian in case they want to chime in ]
> 
> Hi Juerg, thanks for the report!
> 
> On Wed, 2021-02-10 at 11:48 +0100, Juerg Haefliger wrote:
>> Trying to dump the BCM2711 registers kills the kernel:
>>
>> # cat /sys/kernel/debug/regmap/dummy-avs-monitor\@fd5d2000/range
>> 0-efc
>> # cat /sys/kernel/debug/regmap/dummy-avs-monitor\@fd5d2000/registers
>>
>> [   62.857661] SError Interrupt on CPU1, code 0xbf000002 -- SError
> 
> So ESR's IDS (bit 24) is set, which means it's an 'Implementation Defined
> SError,' hence IIUC the rest of the error code is meaningless to anyone outside
> of Broadcom/RPi.

It's imp-def from the architecture's PoV, but the implementation in this 
case is Cortex-A72, where 0x000002 means an attributable, containable 
Slave Error:

https://developer.arm.com/documentation/100095/0003/system-control/aarch64-register-descriptions/exception-syndrome-register--el1-and-el3?lang=en

In other words, the thing at the other end of an interconnect 
transaction said "no" :)

(The fact that Cortex-A72 gets too far ahead of itself to take it as a 
synchronous external abort is a mild annoyance, but hey...)

> The regmap is created through the following syscon device:
> 
> 	avs_monitor: avs-monitor@7d5d2000 {
> 		compatible = "brcm,bcm2711-avs-monitor",
> 			     "syscon", "simple-mfd";
> 		reg = <0x7d5d2000 0xf00>;
> 
> 		thermal: thermal {
> 			compatible = "brcm,bcm2711-thermal";
> 			#thermal-sensor-cells = <0>;
> 		};
> 	};
> 
> I've done some tests with devmem, and the whole <0x7d5d2000 0xf00> range is
> full of addresses that trigger this same error. Also note that as per Florian's
> comments[1]: "AVS_RO_REGISTERS_0: 0x7d5d2200 - 0x7d5d22e3." But from what I can
> tell, at least 0x7d5d22b0 seems to be faulty too.
> 
> Any ideas/comments? My guess is that those addresses are marked somehow as
> secure, and only for VC4 to access (VC4 is RPi4's co-processor). Ultimately,
> the solution is to narrow the register range exposed by avs-monitor to whatever
> bcm2711-thermal needs (which is ATM a single 32bit register).

When a peripheral decodes a region of address space, nobody says it has 
to accept accesses to *every* address in that space; registers may be 
sparsely populated, and although some devices might be "nice" and make 
unused areas behave as RAZ/WI, others may throw slave errors if you poke 
at the wrong places. As you note, in a TrustZone-aware device some 
registers may only exist in one or other of the Secure/Non-Secure 
address spaces.

Even when there is a defined register at a given address, it still 
doesn't necessarily accept all possible types of access; it wouldn't be 
particularly friendly, but a device *could* have, say, some registers 
that support 32-bit accesses and others that only support 16-bit 
accesses, and thus throw slave errors if you do the wrong thing in the 
wrong place.

It really all depends on the device itself.

Robin.

> 
> Regards,
> Nicolas
> 
> [1] https://lore.kernel.org/linux-pm/82125042-684a-b4e2-fbaa-45a393b2ce5e@gmx.net/
> 
>> [   62.857671] CPU: 1 PID: 478 Comm: cat Not tainted 5.11.0-rc7 #4
>> [   62.857674] Hardware name: Raspberry Pi 4 Model B Rev 1.2 (DT)
>> [   62.857676] pstate: 20000085 (nzCv daIf -PAN -UAO -TCO BTYPE=--)
>> [   62.857679] pc : regmap_mmio_read32le+0x1c/0x34
>> [   62.857681] lr : regmap_mmio_read+0x50/0x80
>> [   62.857682] sp : ffff8000105c3c00
>> [   62.857685] x29: ffff8000105c3c00 x28: 0000000000000014
>> [   62.857694] x27: 0000000000000014 x26: ffffd2ea1c2060b0
>> [   62.857699] x25: ffff4e34408ecc00 x24: 0000000000000efc
>> [   62.857704] x23: ffff8000105c3e20 x22: ffff8000105c3d3c
>> [   62.857710] x21: ffff8000105c3d3c x20: 0000000000000014
>> [   62.857715] x19: ffff4e344037a900 x18: 0000000000000020
>> [   62.857720] x17: 0000000000000000 x16: 0000000000000000
>> [   62.857725] x15: ffff4e3447ac40f0 x14: 0000000000000003
>> [   62.857730] x13: ffff4e34422c0000 x12: ffff4e34422a0046
>> [   62.857735] x11: ffffd2ea1c8765e0 x10: 0000000000000000
>> [   62.857741] x9 : ffffd2ea1b9495a0 x8 : ffff4e34429ef980
>> [   62.857746] x7 : 000000000000000f x6 : ffff4e34422a004b
>> [   62.857751] x5 : 00000000fffffff9 x4 : 0000000000000000
>> [   62.857757] x3 : ffffd2ea1b949550 x2 : ffffd2ea1b949330
>> [   62.857761] x1 : 0000000000000014 x0 : 0000000000000000
>> [   62.857767] Kernel panic - not syncing: Asynchronous SError Interrupt
>> [   62.857770] CPU: 1 PID: 478 Comm: cat Not tainted 5.11.0-rc7 #4
>> [   62.857773] Hardware name: Raspberry Pi 4 Model B Rev 1.2 (DT)
>> [   62.857775] Call trace:
>> [   62.857777]  dump_backtrace+0x0/0x1e0
>> [   62.857778]  show_stack+0x24/0x70
>> [   62.857780]  dump_stack+0xd0/0x12c
>> [   62.857782]  panic+0x168/0x370
>> [   62.857783]  nmi_panic+0x98/0xa0
>> [   62.857786]  arm64_serror_panic+0x8c/0x98
>> [   62.857787]  do_serror+0x3c/0x6c
>> [   62.857789]  el1_error+0x78/0xf0
>> [   62.857791]  regmap_mmio_read32le+0x1c/0x34
>> [   62.857793]  _regmap_bus_reg_read+0x24/0x30
>> [   62.857795]  _regmap_read+0x6c/0x17c
>> [   62.857797]  regmap_read+0x58/0x84
>> [   62.857799]  regmap_read_debugfs+0x138/0x3f4
>> [   62.857801]  regmap_map_read_file+0x34/0x40
>> [   62.857803]  full_proxy_read+0x6c/0xc0
>> [   62.857805]  vfs_read+0xb8/0x1e4
>> [   62.857807]  ksys_read+0x78/0x10c
>> [   62.857809]  __arm64_sys_read+0x28/0x34
>> [   62.857811]  el0_svc_common.constprop.0+0x7c/0x194
>> [   62.857813]  do_el0_svc+0x30/0x9c
>> [   62.857814]  el0_svc+0x20/0x30
>> [   62.857816]  el0_sync_handler+0x1a4/0x1b0
>> [   62.857818]  el0_sync+0x174/0x180
>> [   62.857842] SMP: stopping secondary CPUs
>> [   62.857845] Kernel Offset: 0x52ea0b080000 from 0xffff800010000000
>> [   62.857847] PHYS_OFFSET: 0xffffb1cc00000000
>> [   62.857849] CPU features: 0x00240022,61806000
>> [   62.857851] Memory Limit: none
>>
>> Sprinkling printks around regmap_read [1] shows that reading from 0x14 (20)
>> seems to cause the issue:
>>
>>
>> [   40.456230] map=ffff020a069c9c00, from=0, to=3836, count=131072
>> [   40.462520] map=ffff020a069c9c00, i=0
>> [   40.466319] ret=0, val=0
>> [   40.468922] map=ffff020a069c9c00, i=4
>> [   40.472684] ret=0, val=0
>> [   40.475292] map=ffff020a069c9c00, i=8
>> [   40.479048] ret=0, val=0
>> [   40.481649] map=ffff020a069c9c00, i=12
>> [   40.485492] ret=0, val=0
>> [   40.488080] map=ffff020a069c9c00, i=16
>> [   40.491922] ret=0, val=0
>> [   40.494523] map=ffff020a069c9c00, i=20
>> [   40.498497] SError Interrupt on CPU0, code 0xbf000002 -- SError
>> [   40.498499] CPU: 0 PID: 486 Comm: cat Not tainted 5.11.0-rc7+ #8
>> [   40.498501] Hardware name: Raspberry Pi 4 Model B Rev 1.2 (DT)
>>
>>
>> ...Juerg
>>
>> [1]
>> diff --git a/drivers/base/regmap/regmap-debugfs.c b/drivers/base/regmap/regmap-debugfs.c
>> index ff2ee87987c7..9465f5a2f3b8 100644
>> --- a/drivers/base/regmap/regmap-debugfs.c
>> +++ b/drivers/base/regmap/regmap-debugfs.c
>> @@ -229,6 +229,7 @@ static ssize_t regmap_read_debugfs(struct regmap *map, unsigned int from,
>>          if (count > (PAGE_SIZE << (MAX_ORDER - 1)))
>>                  count = PAGE_SIZE << (MAX_ORDER - 1);
>>   
>>
>> +       printk("map=%px, from=%d, to=%d, count=%ld\n", map, from, to, count);
>>          buf = kmalloc(count, GFP_KERNEL);
>>          if (!buf)
>>                  return -ENOMEM;
>> @@ -253,7 +254,9 @@ static ssize_t regmap_read_debugfs(struct regmap *map, unsigned int from,
>>                          buf_pos += map->debugfs_reg_len + 2;
>>   
>>
>>                          /* Format the value, write all X if we can't read */
>> +                       printk("map=%px, i=%d\n", map, i);
>>                          ret = regmap_read(map, i, &val);
>> +                       printk("ret=%ld, val=%x\n", ret, val);
>>                          if (ret == 0)
>>                                  snprintf(buf + buf_pos, count - buf_pos,
>>                                           "%.*x", map->debugfs_val_len, val);
>>
> 
> 
> 
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
>
Nicolas Saenz Julienne Feb. 10, 2021, 4:55 p.m. UTC | #4
Hi Robin,

On Wed, 2021-02-10 at 16:25 +0000, Robin Murphy wrote:
> On 2021-02-10 13:15, Nicolas Saenz Julienne wrote:
> > [ Add Robin, Catalin and Florian in case they want to chime in ]
> > 
> > Hi Juerg, thanks for the report!
> > 
> > On Wed, 2021-02-10 at 11:48 +0100, Juerg Haefliger wrote:
> > > Trying to dump the BCM2711 registers kills the kernel:
> > > 
> > > # cat /sys/kernel/debug/regmap/dummy-avs-monitor\@fd5d2000/range
> > > 0-efc
> > > # cat /sys/kernel/debug/regmap/dummy-avs-monitor\@fd5d2000/registers
> > > 
> > > [   62.857661] SError Interrupt on CPU1, code 0xbf000002 -- SError
> > 
> > So ESR's IDS (bit 24) is set, which means it's an 'Implementation Defined
> > SError,' hence IIUC the rest of the error code is meaningless to anyone outside
> > of Broadcom/RPi.
> 
> It's imp-def from the architecture's PoV, but the implementation in this 
> case is Cortex-A72, where 0x000002 means an attributable, containable 
> Slave Error:
> 
> https://developer.arm.com/documentation/100095/0003/system-control/aarch64-register-descriptions/exception-syndrome-register--el1-and-el3?lang=en
> 
> In other words, the thing at the other end of an interconnect 
> transaction said "no" :)
> 
> (The fact that Cortex-A72 gets too far ahead of itself to take it as a 
> synchronous external abort is a mild annoyance, but hey...)

Thanks for both your clarifications! Reading arm documentation is a skill on
its own.

> > The regmap is created through the following syscon device:
> > 
> > 	avs_monitor: avs-monitor@7d5d2000 {
> > 		compatible = "brcm,bcm2711-avs-monitor",
> > 			     "syscon", "simple-mfd";
> > 		reg = <0x7d5d2000 0xf00>;
> > 
> > 		thermal: thermal {
> > 			compatible = "brcm,bcm2711-thermal";
> > 			#thermal-sensor-cells = <0>;
> > 		};
> > 	};
> > 
> > I've done some tests with devmem, and the whole <0x7d5d2000 0xf00> range is
> > full of addresses that trigger this same error. Also note that as per Florian's
> > comments[1]: "AVS_RO_REGISTERS_0: 0x7d5d2200 - 0x7d5d22e3." But from what I can
> > tell, at least 0x7d5d22b0 seems to be faulty too.
> > 
> > Any ideas/comments? My guess is that those addresses are marked somehow as
> > secure, and only for VC4 to access (VC4 is RPi4's co-processor). Ultimately,
> > the solution is to narrow the register range exposed by avs-monitor to whatever
> > bcm2711-thermal needs (which is ATM a single 32bit register).
> 
> When a peripheral decodes a region of address space, nobody says it has 
> to accept accesses to *every* address in that space; registers may be 
> sparsely populated, and although some devices might be "nice" and make 
> unused areas behave as RAZ/WI, others may throw slave errors if you poke 
> at the wrong places. As you note, in a TrustZone-aware device some 
> registers may only exist in one or other of the Secure/Non-Secure 
> address spaces.
> 
> Even when there is a defined register at a given address, it still 
> doesn't necessarily accept all possible types of access; it wouldn't be 
> particularly friendly, but a device *could* have, say, some registers 
> that support 32-bit accesses and others that only support 16-bit 
> accesses, and thus throw slave errors if you do the wrong thing in the 
> wrong place.
> 
> It really all depends on the device itself.

All in all, assuming there is no special device quirk to apply, the feeling I'm
getting is to just let the error be. As you hint, firmware has no blame here,
and debugfs is a 'best effort, zero guarantees' interface after all.

Regards,
Nicolas
Florian Fainelli Feb. 10, 2021, 10:59 p.m. UTC | #5
On 2/10/2021 8:55 AM, Nicolas Saenz Julienne wrote:
> Hi Robin,
> 
> On Wed, 2021-02-10 at 16:25 +0000, Robin Murphy wrote:
>> On 2021-02-10 13:15, Nicolas Saenz Julienne wrote:
>>> [ Add Robin, Catalin and Florian in case they want to chime in ]
>>>
>>> Hi Juerg, thanks for the report!
>>>
>>> On Wed, 2021-02-10 at 11:48 +0100, Juerg Haefliger wrote:
>>>> Trying to dump the BCM2711 registers kills the kernel:
>>>>
>>>> # cat /sys/kernel/debug/regmap/dummy-avs-monitor\@fd5d2000/range
>>>> 0-efc
>>>> # cat /sys/kernel/debug/regmap/dummy-avs-monitor\@fd5d2000/registers
>>>>
>>>> [   62.857661] SError Interrupt on CPU1, code 0xbf000002 -- SError
>>>
>>> So ESR's IDS (bit 24) is set, which means it's an 'Implementation Defined
>>> SError,' hence IIUC the rest of the error code is meaningless to anyone outside
>>> of Broadcom/RPi.
>>
>> It's imp-def from the architecture's PoV, but the implementation in this 
>> case is Cortex-A72, where 0x000002 means an attributable, containable 
>> Slave Error:
>>
>> https://developer.arm.com/documentation/100095/0003/system-control/aarch64-register-descriptions/exception-syndrome-register--el1-and-el3?lang=en
>>
>> In other words, the thing at the other end of an interconnect 
>> transaction said "no" :)
>>
>> (The fact that Cortex-A72 gets too far ahead of itself to take it as a 
>> synchronous external abort is a mild annoyance, but hey...)
> 
> Thanks for both your clarifications! Reading arm documentation is a skill on
> its own.

Yes it is.

> 
>>> The regmap is created through the following syscon device:
>>>
>>> 	avs_monitor: avs-monitor@7d5d2000 {
>>> 		compatible = "brcm,bcm2711-avs-monitor",
>>> 			     "syscon", "simple-mfd";
>>> 		reg = <0x7d5d2000 0xf00>;
>>>
>>> 		thermal: thermal {
>>> 			compatible = "brcm,bcm2711-thermal";
>>> 			#thermal-sensor-cells = <0>;
>>> 		};
>>> 	};
>>>
>>> I've done some tests with devmem, and the whole <0x7d5d2000 0xf00> range is
>>> full of addresses that trigger this same error. Also note that as per Florian's
>>> comments[1]: "AVS_RO_REGISTERS_0: 0x7d5d2200 - 0x7d5d22e3." But from what I can
>>> tell, at least 0x7d5d22b0 seems to be faulty too.
>>>
>>> Any ideas/comments? My guess is that those addresses are marked somehow as
>>> secure, and only for VC4 to access (VC4 is RPi4's co-processor). Ultimately,
>>> the solution is to narrow the register range exposed by avs-monitor to whatever
>>> bcm2711-thermal needs (which is ATM a single 32bit register).
>>
>> When a peripheral decodes a region of address space, nobody says it has 
>> to accept accesses to *every* address in that space; registers may be 
>> sparsely populated, and although some devices might be "nice" and make 
>> unused areas behave as RAZ/WI, others may throw slave errors if you poke 
>> at the wrong places. As you note, in a TrustZone-aware device some 
>> registers may only exist in one or other of the Secure/Non-Secure 
>> address spaces.
>>
>> Even when there is a defined register at a given address, it still 
>> doesn't necessarily accept all possible types of access; it wouldn't be 
>> particularly friendly, but a device *could* have, say, some registers 
>> that support 32-bit accesses and others that only support 16-bit 
>> accesses, and thus throw slave errors if you do the wrong thing in the 
>> wrong place.
>>
>> It really all depends on the device itself.
> 
> All in all, assuming there is no special device quirk to apply, the feeling I'm
> getting is to just let the error be. As you hint, firmware has no blame here,
> and debugfs is a 'best effort, zero guarantees' interface after all.

We should probably fill a regmap_access_table to deny reading registers
for which there is no address decoding and possibly another one to deny
writing to the read-only registers.
Juerg Haefliger July 27, 2022, 8:05 a.m. UTC | #6
On Wed, 10 Feb 2021 14:59:45 -0800
Florian Fainelli <f.fainelli@gmail.com> wrote:

> On 2/10/2021 8:55 AM, Nicolas Saenz Julienne wrote:
> > Hi Robin,
> > 
> > On Wed, 2021-02-10 at 16:25 +0000, Robin Murphy wrote:  
> >> On 2021-02-10 13:15, Nicolas Saenz Julienne wrote:  
> >>> [ Add Robin, Catalin and Florian in case they want to chime in ]
> >>>
> >>> Hi Juerg, thanks for the report!
> >>>
> >>> On Wed, 2021-02-10 at 11:48 +0100, Juerg Haefliger wrote:  
> >>>> Trying to dump the BCM2711 registers kills the kernel:
> >>>>
> >>>> # cat /sys/kernel/debug/regmap/dummy-avs-monitor\@fd5d2000/range
> >>>> 0-efc
> >>>> # cat /sys/kernel/debug/regmap/dummy-avs-monitor\@fd5d2000/registers
> >>>>
> >>>> [   62.857661] SError Interrupt on CPU1, code 0xbf000002 -- SError  
> >>>
> >>> So ESR's IDS (bit 24) is set, which means it's an 'Implementation Defined
> >>> SError,' hence IIUC the rest of the error code is meaningless to anyone outside
> >>> of Broadcom/RPi.  
> >>
> >> It's imp-def from the architecture's PoV, but the implementation in this 
> >> case is Cortex-A72, where 0x000002 means an attributable, containable 
> >> Slave Error:
> >>
> >> https://developer.arm.com/documentation/100095/0003/system-control/aarch64-register-descriptions/exception-syndrome-register--el1-and-el3?lang=en
> >>
> >> In other words, the thing at the other end of an interconnect 
> >> transaction said "no" :)
> >>
> >> (The fact that Cortex-A72 gets too far ahead of itself to take it as a 
> >> synchronous external abort is a mild annoyance, but hey...)  
> > 
> > Thanks for both your clarifications! Reading arm documentation is a skill on
> > its own.  
> 
> Yes it is.
> 
> >   
> >>> The regmap is created through the following syscon device:
> >>>
> >>> 	avs_monitor: avs-monitor@7d5d2000 {
> >>> 		compatible = "brcm,bcm2711-avs-monitor",
> >>> 			     "syscon", "simple-mfd";
> >>> 		reg = <0x7d5d2000 0xf00>;
> >>>
> >>> 		thermal: thermal {
> >>> 			compatible = "brcm,bcm2711-thermal";
> >>> 			#thermal-sensor-cells = <0>;
> >>> 		};
> >>> 	};
> >>>
> >>> I've done some tests with devmem, and the whole <0x7d5d2000 0xf00> range is
> >>> full of addresses that trigger this same error. Also note that as per Florian's
> >>> comments[1]: "AVS_RO_REGISTERS_0: 0x7d5d2200 - 0x7d5d22e3." But from what I can
> >>> tell, at least 0x7d5d22b0 seems to be faulty too.
> >>>
> >>> Any ideas/comments? My guess is that those addresses are marked somehow as
> >>> secure, and only for VC4 to access (VC4 is RPi4's co-processor). Ultimately,
> >>> the solution is to narrow the register range exposed by avs-monitor to whatever
> >>> bcm2711-thermal needs (which is ATM a single 32bit register).  
> >>
> >> When a peripheral decodes a region of address space, nobody says it has 
> >> to accept accesses to *every* address in that space; registers may be 
> >> sparsely populated, and although some devices might be "nice" and make 
> >> unused areas behave as RAZ/WI, others may throw slave errors if you poke 
> >> at the wrong places. As you note, in a TrustZone-aware device some 
> >> registers may only exist in one or other of the Secure/Non-Secure 
> >> address spaces.
> >>
> >> Even when there is a defined register at a given address, it still 
> >> doesn't necessarily accept all possible types of access; it wouldn't be 
> >> particularly friendly, but a device *could* have, say, some registers 
> >> that support 32-bit accesses and others that only support 16-bit 
> >> accesses, and thus throw slave errors if you do the wrong thing in the 
> >> wrong place.
> >>
> >> It really all depends on the device itself.  
> > 
> > All in all, assuming there is no special device quirk to apply, the feeling I'm
> > getting is to just let the error be. As you hint, firmware has no blame here,
> > and debugfs is a 'best effort, zero guarantees' interface after all.  
> 
> We should probably fill a regmap_access_table to deny reading registers
> for which there is no address decoding and possibly another one to deny
> writing to the read-only registers.


Below is a patch that adds a read access table but it seems wrong to include
'internal.h' and add the table in the thermal driver. Shouldn't this happen
in a higher layer, somehow between syscon and the thermal node?

...Juerg


diff --git a/drivers/thermal/broadcom/bcm2711_thermal.c b/drivers/thermal/broadcom/bcm2711_thermal.c
index 6e2ff710b2ec..a831c33f6d9a 100644
--- a/drivers/thermal/broadcom/bcm2711_thermal.c
+++ b/drivers/thermal/broadcom/bcm2711_thermal.c
@@ -21,6 +21,7 @@
 #include <linux/thermal.h>
 
 #include "../thermal_hwmon.h"
+#include "../../base/regmap/internal.h"
 
 #define AVS_RO_TEMP_STATUS             0x200
 #define AVS_RO_TEMP_STATUS_VALID_MSK   (BIT(16) | BIT(10))
@@ -67,6 +68,32 @@ static const struct of_device_id bcm2711_thermal_id_table[] = {
 };
 MODULE_DEVICE_TABLE(of, bcm2711_thermal_id_table);
 
+/* Readable register ranges.
+ * Ranges determined experimentally by reading every register. Non-readable
+ * register reads cause SError exceptions. */
+static const struct regmap_range bcm2711_thermal_rd_ranges[] = {
+       regmap_reg_range(0x000, 0x010),
+       regmap_reg_range(0x034, 0x044),
+       regmap_reg_range(0x068, 0x098),
+       regmap_reg_range(0x0ac, 0x0c8),
+       regmap_reg_range(0x100, 0x100),
+       regmap_reg_range(0x108, 0x108),
+       regmap_reg_range(0x110, 0x124),
+       regmap_reg_range(0x200, 0x2ac),
+       regmap_reg_range(0x2e0, 0x2e0),
+       regmap_reg_range(0x800, 0x810),
+       regmap_reg_range(0xd00, 0xd8c),
+       regmap_reg_range(0xdd0, 0xdd4),
+       regmap_reg_range(0xdf8, 0xe8c),
+       regmap_reg_range(0xed0, 0xed4),
+       regmap_reg_range(0xef8, 0xefc),
+};
+
+static const struct regmap_access_table bcm2711_thermal_rd_table = {
+       .yes_ranges     = bcm2711_thermal_rd_ranges,
+       .n_yes_ranges   = ARRAY_SIZE(bcm2711_thermal_rd_ranges),
+};
+
 static int bcm2711_thermal_probe(struct platform_device *pdev)
 {
        struct thermal_zone_device *thermal;
@@ -90,6 +117,7 @@ static int bcm2711_thermal_probe(struct platform_device *pdev)
                return ret;
        }
        priv->regmap = regmap;
+       priv->regmap->rd_table = &bcm2711_thermal_rd_table;
 
        thermal = devm_thermal_zone_of_sensor_register(dev, 0, priv,
                                                       &bcm2711_thermal_of_ops);
Florian Fainelli July 27, 2022, 9:51 p.m. UTC | #7
On 7/27/22 01:05, Juerg Haefliger wrote:
> On Wed, 10 Feb 2021 14:59:45 -0800
> Florian Fainelli <f.fainelli@gmail.com> wrote:
> 
>> On 2/10/2021 8:55 AM, Nicolas Saenz Julienne wrote:
>>> Hi Robin,
>>>
>>> On Wed, 2021-02-10 at 16:25 +0000, Robin Murphy wrote:  
>>>> On 2021-02-10 13:15, Nicolas Saenz Julienne wrote:  
>>>>> [ Add Robin, Catalin and Florian in case they want to chime in ]
>>>>>
>>>>> Hi Juerg, thanks for the report!
>>>>>
>>>>> On Wed, 2021-02-10 at 11:48 +0100, Juerg Haefliger wrote:  
>>>>>> Trying to dump the BCM2711 registers kills the kernel:
>>>>>>
>>>>>> # cat /sys/kernel/debug/regmap/dummy-avs-monitor\@fd5d2000/range
>>>>>> 0-efc
>>>>>> # cat /sys/kernel/debug/regmap/dummy-avs-monitor\@fd5d2000/registers
>>>>>>
>>>>>> [   62.857661] SError Interrupt on CPU1, code 0xbf000002 -- SError  
>>>>>
>>>>> So ESR's IDS (bit 24) is set, which means it's an 'Implementation Defined
>>>>> SError,' hence IIUC the rest of the error code is meaningless to anyone outside
>>>>> of Broadcom/RPi.  
>>>>
>>>> It's imp-def from the architecture's PoV, but the implementation in this 
>>>> case is Cortex-A72, where 0x000002 means an attributable, containable 
>>>> Slave Error:
>>>>
>>>> https://developer.arm.com/documentation/100095/0003/system-control/aarch64-register-descriptions/exception-syndrome-register--el1-and-el3?lang=en
>>>>
>>>> In other words, the thing at the other end of an interconnect 
>>>> transaction said "no" :)
>>>>
>>>> (The fact that Cortex-A72 gets too far ahead of itself to take it as a 
>>>> synchronous external abort is a mild annoyance, but hey...)  
>>>
>>> Thanks for both your clarifications! Reading arm documentation is a skill on
>>> its own.  
>>
>> Yes it is.
>>
>>>   
>>>>> The regmap is created through the following syscon device:
>>>>>
>>>>> 	avs_monitor: avs-monitor@7d5d2000 {
>>>>> 		compatible = "brcm,bcm2711-avs-monitor",
>>>>> 			     "syscon", "simple-mfd";
>>>>> 		reg = <0x7d5d2000 0xf00>;
>>>>>
>>>>> 		thermal: thermal {
>>>>> 			compatible = "brcm,bcm2711-thermal";
>>>>> 			#thermal-sensor-cells = <0>;
>>>>> 		};
>>>>> 	};
>>>>>
>>>>> I've done some tests with devmem, and the whole <0x7d5d2000 0xf00> range is
>>>>> full of addresses that trigger this same error. Also note that as per Florian's
>>>>> comments[1]: "AVS_RO_REGISTERS_0: 0x7d5d2200 - 0x7d5d22e3." But from what I can
>>>>> tell, at least 0x7d5d22b0 seems to be faulty too.
>>>>>
>>>>> Any ideas/comments? My guess is that those addresses are marked somehow as
>>>>> secure, and only for VC4 to access (VC4 is RPi4's co-processor). Ultimately,
>>>>> the solution is to narrow the register range exposed by avs-monitor to whatever
>>>>> bcm2711-thermal needs (which is ATM a single 32bit register).  
>>>>
>>>> When a peripheral decodes a region of address space, nobody says it has 
>>>> to accept accesses to *every* address in that space; registers may be 
>>>> sparsely populated, and although some devices might be "nice" and make 
>>>> unused areas behave as RAZ/WI, others may throw slave errors if you poke 
>>>> at the wrong places. As you note, in a TrustZone-aware device some 
>>>> registers may only exist in one or other of the Secure/Non-Secure 
>>>> address spaces.
>>>>
>>>> Even when there is a defined register at a given address, it still 
>>>> doesn't necessarily accept all possible types of access; it wouldn't be 
>>>> particularly friendly, but a device *could* have, say, some registers 
>>>> that support 32-bit accesses and others that only support 16-bit 
>>>> accesses, and thus throw slave errors if you do the wrong thing in the 
>>>> wrong place.
>>>>
>>>> It really all depends on the device itself.  
>>>
>>> All in all, assuming there is no special device quirk to apply, the feeling I'm
>>> getting is to just let the error be. As you hint, firmware has no blame here,
>>> and debugfs is a 'best effort, zero guarantees' interface after all.  
>>
>> We should probably fill a regmap_access_table to deny reading registers
>> for which there is no address decoding and possibly another one to deny
>> writing to the read-only registers.
> 
> 
> Below is a patch that adds a read access table but it seems wrong to include
> 'internal.h' and add the table in the thermal driver. Shouldn't this happen
> in a higher layer, somehow between syscon and the thermal node?

What is the purpose of doing doing this though that cannot already be done using devmem/devmem2 if the point is explore the address space?
Juerg Haefliger July 28, 2022, 9:06 a.m. UTC | #8
On Wed, 27 Jul 2022 14:51:24 -0700
Florian Fainelli <f.fainelli@gmail.com> wrote:

> On 7/27/22 01:05, Juerg Haefliger wrote:
> > On Wed, 10 Feb 2021 14:59:45 -0800
> > Florian Fainelli <f.fainelli@gmail.com> wrote:
> >   
> >> On 2/10/2021 8:55 AM, Nicolas Saenz Julienne wrote:  
> >>> Hi Robin,
> >>>
> >>> On Wed, 2021-02-10 at 16:25 +0000, Robin Murphy wrote:    
> >>>> On 2021-02-10 13:15, Nicolas Saenz Julienne wrote:    
> >>>>> [ Add Robin, Catalin and Florian in case they want to chime in ]
> >>>>>
> >>>>> Hi Juerg, thanks for the report!
> >>>>>
> >>>>> On Wed, 2021-02-10 at 11:48 +0100, Juerg Haefliger wrote:    
> >>>>>> Trying to dump the BCM2711 registers kills the kernel:
> >>>>>>
> >>>>>> # cat /sys/kernel/debug/regmap/dummy-avs-monitor\@fd5d2000/range
> >>>>>> 0-efc
> >>>>>> # cat /sys/kernel/debug/regmap/dummy-avs-monitor\@fd5d2000/registers
> >>>>>>
> >>>>>> [   62.857661] SError Interrupt on CPU1, code 0xbf000002 -- SError    
> >>>>>
> >>>>> So ESR's IDS (bit 24) is set, which means it's an 'Implementation Defined
> >>>>> SError,' hence IIUC the rest of the error code is meaningless to anyone outside
> >>>>> of Broadcom/RPi.    
> >>>>
> >>>> It's imp-def from the architecture's PoV, but the implementation in this 
> >>>> case is Cortex-A72, where 0x000002 means an attributable, containable 
> >>>> Slave Error:
> >>>>
> >>>> https://developer.arm.com/documentation/100095/0003/system-control/aarch64-register-descriptions/exception-syndrome-register--el1-and-el3?lang=en
> >>>>
> >>>> In other words, the thing at the other end of an interconnect 
> >>>> transaction said "no" :)
> >>>>
> >>>> (The fact that Cortex-A72 gets too far ahead of itself to take it as a 
> >>>> synchronous external abort is a mild annoyance, but hey...)    
> >>>
> >>> Thanks for both your clarifications! Reading arm documentation is a skill on
> >>> its own.    
> >>
> >> Yes it is.
> >>  
> >>>     
> >>>>> The regmap is created through the following syscon device:
> >>>>>
> >>>>> 	avs_monitor: avs-monitor@7d5d2000 {
> >>>>> 		compatible = "brcm,bcm2711-avs-monitor",
> >>>>> 			     "syscon", "simple-mfd";
> >>>>> 		reg = <0x7d5d2000 0xf00>;
> >>>>>
> >>>>> 		thermal: thermal {
> >>>>> 			compatible = "brcm,bcm2711-thermal";
> >>>>> 			#thermal-sensor-cells = <0>;
> >>>>> 		};
> >>>>> 	};
> >>>>>
> >>>>> I've done some tests with devmem, and the whole <0x7d5d2000 0xf00> range is
> >>>>> full of addresses that trigger this same error. Also note that as per Florian's
> >>>>> comments[1]: "AVS_RO_REGISTERS_0: 0x7d5d2200 - 0x7d5d22e3." But from what I can
> >>>>> tell, at least 0x7d5d22b0 seems to be faulty too.
> >>>>>
> >>>>> Any ideas/comments? My guess is that those addresses are marked somehow as
> >>>>> secure, and only for VC4 to access (VC4 is RPi4's co-processor). Ultimately,
> >>>>> the solution is to narrow the register range exposed by avs-monitor to whatever
> >>>>> bcm2711-thermal needs (which is ATM a single 32bit register).    
> >>>>
> >>>> When a peripheral decodes a region of address space, nobody says it has 
> >>>> to accept accesses to *every* address in that space; registers may be 
> >>>> sparsely populated, and although some devices might be "nice" and make 
> >>>> unused areas behave as RAZ/WI, others may throw slave errors if you poke 
> >>>> at the wrong places. As you note, in a TrustZone-aware device some 
> >>>> registers may only exist in one or other of the Secure/Non-Secure 
> >>>> address spaces.
> >>>>
> >>>> Even when there is a defined register at a given address, it still 
> >>>> doesn't necessarily accept all possible types of access; it wouldn't be 
> >>>> particularly friendly, but a device *could* have, say, some registers 
> >>>> that support 32-bit accesses and others that only support 16-bit 
> >>>> accesses, and thus throw slave errors if you do the wrong thing in the 
> >>>> wrong place.
> >>>>
> >>>> It really all depends on the device itself.    
> >>>
> >>> All in all, assuming there is no special device quirk to apply, the feeling I'm
> >>> getting is to just let the error be. As you hint, firmware has no blame here,
> >>> and debugfs is a 'best effort, zero guarantees' interface after all.    
> >>
> >> We should probably fill a regmap_access_table to deny reading registers
> >> for which there is no address decoding and possibly another one to deny
> >> writing to the read-only registers.  
> > 
> > 
> > Below is a patch that adds a read access table but it seems wrong to include
> > 'internal.h' and add the table in the thermal driver. Shouldn't this happen
> > in a higher layer, somehow between syscon and the thermal node?  
> 
> What is the purpose of doing doing this though that cannot already be done using devmem/devmem2 if the point is explore the address space?

The goal is to prevent a kernel crash when doing
$ cat /sys/kernel/debug/regmap/dummy-avs-monitor\@fd5d2000/registers

...Juerg
Florian Fainelli Aug. 1, 2022, 3:34 p.m. UTC | #9
On 7/28/2022 2:06 AM, Juerg Haefliger wrote:
> On Wed, 27 Jul 2022 14:51:24 -0700
> Florian Fainelli <f.fainelli@gmail.com> wrote:
> 
>> On 7/27/22 01:05, Juerg Haefliger wrote:
>>> On Wed, 10 Feb 2021 14:59:45 -0800
>>> Florian Fainelli <f.fainelli@gmail.com> wrote:
>>>    
>>>> On 2/10/2021 8:55 AM, Nicolas Saenz Julienne wrote:
>>>>> Hi Robin,
>>>>>
>>>>> On Wed, 2021-02-10 at 16:25 +0000, Robin Murphy wrote:
>>>>>> On 2021-02-10 13:15, Nicolas Saenz Julienne wrote:
>>>>>>> [ Add Robin, Catalin and Florian in case they want to chime in ]
>>>>>>>
>>>>>>> Hi Juerg, thanks for the report!
>>>>>>>
>>>>>>> On Wed, 2021-02-10 at 11:48 +0100, Juerg Haefliger wrote:
>>>>>>>> Trying to dump the BCM2711 registers kills the kernel:
>>>>>>>>
>>>>>>>> # cat /sys/kernel/debug/regmap/dummy-avs-monitor\@fd5d2000/range
>>>>>>>> 0-efc
>>>>>>>> # cat /sys/kernel/debug/regmap/dummy-avs-monitor\@fd5d2000/registers
>>>>>>>>
>>>>>>>> [   62.857661] SError Interrupt on CPU1, code 0xbf000002 -- SError
>>>>>>>
>>>>>>> So ESR's IDS (bit 24) is set, which means it's an 'Implementation Defined
>>>>>>> SError,' hence IIUC the rest of the error code is meaningless to anyone outside
>>>>>>> of Broadcom/RPi.
>>>>>>
>>>>>> It's imp-def from the architecture's PoV, but the implementation in this
>>>>>> case is Cortex-A72, where 0x000002 means an attributable, containable
>>>>>> Slave Error:
>>>>>>
>>>>>> https://developer.arm.com/documentation/100095/0003/system-control/aarch64-register-descriptions/exception-syndrome-register--el1-and-el3?lang=en
>>>>>>
>>>>>> In other words, the thing at the other end of an interconnect
>>>>>> transaction said "no" :)
>>>>>>
>>>>>> (The fact that Cortex-A72 gets too far ahead of itself to take it as a
>>>>>> synchronous external abort is a mild annoyance, but hey...)
>>>>>
>>>>> Thanks for both your clarifications! Reading arm documentation is a skill on
>>>>> its own.
>>>>
>>>> Yes it is.
>>>>   
>>>>>      
>>>>>>> The regmap is created through the following syscon device:
>>>>>>>
>>>>>>> 	avs_monitor: avs-monitor@7d5d2000 {
>>>>>>> 		compatible = "brcm,bcm2711-avs-monitor",
>>>>>>> 			     "syscon", "simple-mfd";
>>>>>>> 		reg = <0x7d5d2000 0xf00>;
>>>>>>>
>>>>>>> 		thermal: thermal {
>>>>>>> 			compatible = "brcm,bcm2711-thermal";
>>>>>>> 			#thermal-sensor-cells = <0>;
>>>>>>> 		};
>>>>>>> 	};
>>>>>>>
>>>>>>> I've done some tests with devmem, and the whole <0x7d5d2000 0xf00> range is
>>>>>>> full of addresses that trigger this same error. Also note that as per Florian's
>>>>>>> comments[1]: "AVS_RO_REGISTERS_0: 0x7d5d2200 - 0x7d5d22e3." But from what I can
>>>>>>> tell, at least 0x7d5d22b0 seems to be faulty too.
>>>>>>>
>>>>>>> Any ideas/comments? My guess is that those addresses are marked somehow as
>>>>>>> secure, and only for VC4 to access (VC4 is RPi4's co-processor). Ultimately,
>>>>>>> the solution is to narrow the register range exposed by avs-monitor to whatever
>>>>>>> bcm2711-thermal needs (which is ATM a single 32bit register).
>>>>>>
>>>>>> When a peripheral decodes a region of address space, nobody says it has
>>>>>> to accept accesses to *every* address in that space; registers may be
>>>>>> sparsely populated, and although some devices might be "nice" and make
>>>>>> unused areas behave as RAZ/WI, others may throw slave errors if you poke
>>>>>> at the wrong places. As you note, in a TrustZone-aware device some
>>>>>> registers may only exist in one or other of the Secure/Non-Secure
>>>>>> address spaces.
>>>>>>
>>>>>> Even when there is a defined register at a given address, it still
>>>>>> doesn't necessarily accept all possible types of access; it wouldn't be
>>>>>> particularly friendly, but a device *could* have, say, some registers
>>>>>> that support 32-bit accesses and others that only support 16-bit
>>>>>> accesses, and thus throw slave errors if you do the wrong thing in the
>>>>>> wrong place.
>>>>>>
>>>>>> It really all depends on the device itself.
>>>>>
>>>>> All in all, assuming there is no special device quirk to apply, the feeling I'm
>>>>> getting is to just let the error be. As you hint, firmware has no blame here,
>>>>> and debugfs is a 'best effort, zero guarantees' interface after all.
>>>>
>>>> We should probably fill a regmap_access_table to deny reading registers
>>>> for which there is no address decoding and possibly another one to deny
>>>> writing to the read-only registers.
>>>
>>>
>>> Below is a patch that adds a read access table but it seems wrong to include
>>> 'internal.h' and add the table in the thermal driver. Shouldn't this happen
>>> in a higher layer, somehow between syscon and the thermal node?
>>
>> What is the purpose of doing doing this though that cannot already be done using devmem/devmem2 if the point is explore the address space?
> 
> The goal is to prevent a kernel crash when doing
> $ cat /sys/kernel/debug/regmap/dummy-avs-monitor\@fd5d2000/registers

Fair enough, but that really does not scale across drivers nor across 
power management decisions being made to various drivers.

The thermal sensor is unlikely to ever be clock gated by the time Linux 
runs, but if you were to do the same thing for any other type of 
peripheral, chances are the same outcome would be produced.

So this really begs the question as to how to address this globally 
short of disabling regmap debugfs support which is likely what is 
happening in a production environment anyway.
diff mbox series

Patch

diff --git a/drivers/base/regmap/regmap-debugfs.c b/drivers/base/regmap/regmap-debugfs.c
index ff2ee87987c7..9465f5a2f3b8 100644
--- a/drivers/base/regmap/regmap-debugfs.c
+++ b/drivers/base/regmap/regmap-debugfs.c
@@ -229,6 +229,7 @@  static ssize_t regmap_read_debugfs(struct regmap *map, unsigned int from,
        if (count > (PAGE_SIZE << (MAX_ORDER - 1)))
                count = PAGE_SIZE << (MAX_ORDER - 1);
 
+       printk("map=%px, from=%d, to=%d, count=%ld\n", map, from, to, count);
        buf = kmalloc(count, GFP_KERNEL);
        if (!buf)
                return -ENOMEM;
@@ -253,7 +254,9 @@  static ssize_t regmap_read_debugfs(struct regmap *map, unsigned int from,
                        buf_pos += map->debugfs_reg_len + 2;
 
                        /* Format the value, write all X if we can't read */
+                       printk("map=%px, i=%d\n", map, i);
                        ret = regmap_read(map, i, &val);
+                       printk("ret=%ld, val=%x\n", ret, val);
                        if (ret == 0)
                                snprintf(buf + buf_pos, count - buf_pos,
                                         "%.*x", map->debugfs_val_len, val);