diff mbox series

Regression on Macchiatobin from the irqchip driver

Message ID 20240821165034.1af97bad@fedora-3.home (mailing list archive)
State New, archived
Headers show
Series Regression on Macchiatobin from the irqchip driver | expand

Commit Message

Maxime Chevallier Aug. 21, 2024, 2:50 p.m. UTC
Hi everyone,

I've been testing out some network series on the Macchiatobin (Armada
8k SoC) and I stumbled upon a crash at boot, that showed-up on the
latest net-next branch :

[    2.755698] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008
[    2.757592] mmcblk0: mmc0:0001 8GME4R 7.28 GiB
[    2.766033] Mem abort info:
[    2.766036]   ESR = 0x0000000096000004
[    2.774534]  mmcblk0: p1
[    2.777086]   EC = 0x25: DABT (current EL), IL = 32 bits
[    2.779893] mmcblk0boot0: mmc0:0001 8GME4R 4.00 MiB
[    2.784965]   SET = 0, FnV = 0
[    2.784969]   EA = 0, S1PTW = 0
[    2.784972]   FSC = 0x04: level 0 translation fault
[    2.784976] Data abort info:
[    2.790648] mmcblk0boot1: mmc0:0001 8GME4R 4.00 MiB
[    2.792943]   ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
[    2.796867] mmcblk0rpmb: mmc0:0001 8GME4R 512 KiB, chardev (234:0)
[    2.801002]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
[    2.801006]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[    2.830960] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000101a75000
[    2.837436] [0000000000000008] pgd=0000000000000000, p4d=0000000000000000
[    2.844265] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP
[    2.850560] Modules linked in:
[    2.853631] CPU: 2 UID: 0 PID: 51 Comm: kworker/u18:2 Not tainted 6.10.0-12649-g25010bfdf8bb #10
[    2.862457] Hardware name: Marvell 8040 MACCHIATOBin Double-shot (DT)
[    2.868926] Workqueue: events_unbound deferred_probe_work_func
[    2.874800] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[    2.881794] pc : msi_lib_irq_domain_select+0x28/0x58
[    2.886786] lr : irq_find_matching_fwspec+0xf0/0x120
[    2.891778] sp : ffff8000871f39e0
[    2.895107] x29: ffff8000871f39e0 x28: 0000000000000000 x27: ffff00013f7e11e8
[    2.902281] x26: 0000000000000004 x25: ffff00013f7e11d0 x24: ffff800086764dd0
[    2.909453] x23: ffff800081633b28 x22: ffff8000871f3a68 x21: 0000000000000001
[    2.916624] x20: ffff800086764df0 x19: ffff000101171000 x18: ffffffffffffffff
[    2.923797] x17: ffff000100de8a0c x16: ffff000100de8a00 x15: ffff000100da9cca
[    2.930969] x14: ffffffffffffffff x13: 0030354072656c6c x12: 6f72746e6f632d74
[    2.938141] x11: 7075727265746e69 x10: 0000000000036018 x9 : 0000000000000001
[    2.945314] x8 : ffff8000871f3ab8 x7 : 0000000000000000 x6 : 4d0c1e0cade3e5ec
[    2.952486] x5 : 6c65632d0c1e0c4d x4 : ffff00013f7e11e8 x3 : ffff000101171000
[    2.959659] x2 : 0000000000000004 x1 : 0000000000000000 x0 : 0000000000000000
[    2.966831] Call trace:
[    2.969288]  msi_lib_irq_domain_select+0x28/0x58
[    2.973928]  irq_find_matching_fwnode+0x4c/0x78
[    2.978484]  of_msi_get_domain+0x11c/0x138
[    2.982602]  mvebu_icu_subset_probe+0x5c/0x124
[    2.987068]  platform_probe+0x68/0xdc
[    2.990748]  really_probe+0xbc/0x2a4
[    2.994343]  __driver_probe_device+0x78/0x12c
[    2.998722]  driver_probe_device+0xdc/0x160
[    3.002926]  __device_attach_driver+0xb8/0x134
[    3.007392]  bus_for_each_drv+0x80/0xdc
[    3.011248]  __device_attach+0xa8/0x1b0
[    3.015103]  device_initial_probe+0x14/0x20
[    3.019307]  bus_probe_device+0xa8/0xac
[    3.023162]  deferred_probe_work_func+0x88/0xc0
[    3.027714]  process_one_work+0x150/0x294
[    3.031743]  worker_thread+0x2e4/0x3ec
[    3.035510]  kthread+0x118/0x11c
[    3.038756]  ret_from_fork+0x10/0x20
[    3.042353] Code: d65f03c0 b9400820 35ffffa0 f9404461 (b9400823) 
[    3.048473] ---[ end trace 0000000000000000 ]---

I bisected the bug and the crash appeared at :

fbdf14e90ce4 ("irqchip/irq-mvebu-sei: Switch to MSI parent")

I've briefly looked at it, and it seems the NULL pointer that's being
dereferenced here is the "ops" pointer in msi_lib_irq_domain_select [1]

I'm not very familiar with the irqchip subsystem, my best guess
is that this is being called for the ap_domain, in the irq-mvebu-sei
driver, which doesn't have any msi_parent_ops set [2].

By looking at the msi_lib_irq_domain_select() implementation however, I
notice that it appears to be expected that these ops can be NULL by
looking at the check in the return line :

	return ops && !!(ops->bus_select_mask & busmask);

However, the line above dereferences the ops pointer without prior
check :

	/* Handle pure domain searches */
	if (bus_token == ops->bus_select_token)
		return 1;

As I said, this area of the kernel isn't very familiar to me, but I got
my board to boot with the following patch :

I have zero confidence that this is the correct solution to the issue
so feel free to ditch that solution :) I'll gladly test any
patch for that on the MCBIN.

Let me know if you want me to run more tests.

Thanks,

Maxime

[1] : https://elixir.bootlin.com/linux/v6.11-rc4/source/drivers/irqchip/irq-msi-lib.c#L125
[2] : https://elixir.bootlin.com/linux/v6.11-rc4/source/drivers/irqchip/irq-mvebu-sei.c#L423

Comments

Thomas Gleixner Aug. 23, 2024, 9:21 a.m. UTC | #1
On Wed, Aug 21 2024 at 16:50, Maxime Chevallier wrote:
> By looking at the msi_lib_irq_domain_select() implementation however, I
> notice that it appears to be expected that these ops can be NULL by
> looking at the check in the return line :
>
> 	return ops && !!(ops->bus_select_mask & busmask);
>
> However, the line above dereferences the ops pointer without prior
> check :
>
> 	/* Handle pure domain searches */
> 	if (bus_token == ops->bus_select_token)
> 		return 1;

Oops.

> As I said, this area of the kernel isn't very familiar to me, but I got
> my board to boot with the following patch :
>
> --- a/drivers/irqchip/irq-msi-lib.c
> +++ b/drivers/irqchip/irq-msi-lib.c
> @@ -128,6 +128,9 @@ int msi_lib_irq_domain_select(struct irq_domain *d, struct irq_fwspec *fwspec,
>         const struct msi_parent_ops *ops = d->msi_parent_ops;
>         u32 busmask = BIT(bus_token);
>  
> +       if (!ops)
> +               return 0;
> +
>         if (fwspec->fwnode != d->fwnode || fwspec->param_count != 0)
>                 return 0;
>  
> @@ -135,6 +138,6 @@ int msi_lib_irq_domain_select(struct irq_domain *d, struct irq_fwspec *fwspec,
>         if (bus_token == ops->bus_select_token)
>                 return 1;
>  
> -       return ops && !!(ops->bus_select_mask & busmask);
> +       return !!(ops->bus_select_mask & busmask);
>
> ----------------------------
>
> I have zero confidence that this is the correct solution to the issue
> so feel free to ditch that solution :) I'll gladly test any
> patch for that on the MCBIN.

It obviously is the proper solution check after use is pretty pointless
as you demonstrated. Care to send a proper patch?

Thanks,

        tglx
Maxime Chevallier Aug. 23, 2024, 9:34 a.m. UTC | #2
Hello Thomas,

On Fri, 23 Aug 2024 11:21:16 +0200
Thomas Gleixner <tglx@linutronix.de> wrote:

> It obviously is the proper solution check after use is pretty pointless
> as you demonstrated. Care to send a proper patch?

Sure :)

Thanks,

Maxime
diff mbox series

Patch

--- a/drivers/irqchip/irq-msi-lib.c
+++ b/drivers/irqchip/irq-msi-lib.c
@@ -128,6 +128,9 @@  int msi_lib_irq_domain_select(struct irq_domain *d, struct irq_fwspec *fwspec,
        const struct msi_parent_ops *ops = d->msi_parent_ops;
        u32 busmask = BIT(bus_token);
 
+       if (!ops)
+               return 0;
+
        if (fwspec->fwnode != d->fwnode || fwspec->param_count != 0)
                return 0;
 
@@ -135,6 +138,6 @@  int msi_lib_irq_domain_select(struct irq_domain *d, struct irq_fwspec *fwspec,
        if (bus_token == ops->bus_select_token)
                return 1;
 
-       return ops && !!(ops->bus_select_mask & busmask);
+       return !!(ops->bus_select_mask & busmask);

----------------------------