Message ID | 1232050347.5966.66.camel@localhost.localdomain (mailing list archive) |
---|---|
State | Accepted |
Headers | show |
James Bottomley wrote: > On Thu, 2009-01-15 at 14:54 -0500, James Bottomley wrote: >> On Thu, 2009-01-15 at 14:22 -0500, Len Brown wrote: >>> On Thu, 15 Jan 2009, James Bottomley wrote: >>> >>>> On Mon, 2009-01-12 at 18:48 -0600, James Bottomley wrote: >>>>> On Mon, 2009-01-12 at 16:16 -0500, Len Brown wrote: >>>>>>> This is on an IBM Maia system with the calgary IOMMU enabled. It's a >>>>>>> fatal boot up panic. >>>>>>> >>>>>> James, >>>>>> A guided bisect... >>>>>> Please let me know which of these fail >>>>>> >>>>>> a3a798c88a14b35e5d4ca30716dbc9eb9a1ddfe2 is 2.6.29 at ACPI merge >>>>>> efcb3cf7f00c3c424db012380a8a974c2676a3c8 is 2.6.29 before ACPI merge >>>>> Didn't try these (being after the failure) >>>>> >>>>>> ec9f168fcc344d2ffec1c8c822076bf22dab5c33 is 2.6.28 with most ACPI >>>>> This is the failing one. >>>>> >>>>>> e8443c358c34f3fe65236e24147ddf0cd0e61b08 is 2.6.28 plus just ACPICA >>>>> This one boots fine. >>>>> >>>>>> Please test the "2.6.28+ACPICA" one first. >>>>>> If it fails, we are close so you can skip the others above >>>>>> and bisect between that and 2.6.28. >>>>> I'll try bisecting between ec9f168fcc344d2ffec1c8c822076bf22dab5c33 and >>>>> e8443c358c34f3fe65236e24147ddf0cd0e61b08. >>>> OK, bisection complete. It's not actually coming from the ACPI tree but >>>> from the PCI one (appropriate CC's added). >>>> >>>> The commit causing the boot panic is: >>>> >>>> commit e8c331e963c58b83db24b7d0e39e8c07f687dbc6 >>>> Author: Kenji Kaneshige <kaneshige.kenji@jp.fujitsu.com> >>>> Date: Wed Dec 17 12:09:12 2008 +0900 >>>> >>>> PCI hotplug: introduce functions for ACPI slot detection >>>> >>>> I'm still not sure why, though >>> Nothing jumped out at me in the patch. >>> Does reverting e8c331e963c58b83db24b7d0e39e8c07f687dbc6 >>> make the boot crash go away? >> Yes, but then it would ... the call sequence is through the reverted >> code. > > It looks like acpi_pci_get_bridge_handle() is returning NULL, so this is > the fix that works for me. > I'm sorry for troubling you, and thank you for your patience. The patch seems to avoid the kernel panic, but I still don't know why acpi_pci_get_bridge_handle() returns NULL here. I assumed it should return non-NULL value here. So I'd like to investigate it more. Thanks, Kenji Kaneshige > James > > --- > > diff --git a/drivers/pci/hotplug/acpiphp_glue.c b/drivers/pci/hotplug/acpiphp_glue.c > index f09b101..803d9dd 100644 > --- a/drivers/pci/hotplug/acpiphp_glue.c > +++ b/drivers/pci/hotplug/acpiphp_glue.c > @@ -266,6 +266,8 @@ static int detect_ejectable_slots(struct pci_bus *pbus) > int found = acpi_pci_detect_ejectable(pbus); > if (!found) { > acpi_handle bridge_handle = acpi_pci_get_bridge_handle(pbus); > + if (!bridge_handle) > + return 0; > acpi_walk_namespace(ACPI_TYPE_DEVICE, bridge_handle, (u32)1, > is_pci_dock_device, (void *)&found, NULL); > } > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-pci" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 2009-01-16 at 15:07 +0900, Kenji Kaneshige wrote: > > It looks like acpi_pci_get_bridge_handle() is returning NULL, so this is > > the fix that works for me. > > > > I'm sorry for troubling you, and thank you for your patience. > > The patch seems to avoid the kernel panic, but I still don't know > why acpi_pci_get_bridge_handle() returns NULL here. I assumed > it should return non-NULL value here. So I'd like to investigate > it more. Sure, Len and I couldn't work out why it was returning NULL on this box (other than that perhaps it doesn't have an ACPI entry). The two offending busses which trigger this are the two internal ones (which aren't hotplug). The layout of the box is: sparkweed:~# lspci -t -+-[0000:0c]---00.0 +-[0000:0a]---00.0 +-[0000:08]---00.0 +-[0000:06]---00.0 +-[0000:04]---00.0 +-[0000:02]---00.0 +-[0000:01]-+-00.0 | +-01.0 | +-01.1 | \-02.0 \-[0000:00]-+-00.0 +-01.0 +-03.0 +-03.1 +-03.2 +-0f.0 +-0f.1 \-0f.3 sparkweed:~# lspci 00:00.0 Host bridge: IBM Calgary PCI-X Host Bridge (rev 02) 00:01.0 VGA compatible controller: ATI Technologies Inc Radeon RV100 QY [Radeon 7000/VE] 00:03.0 USB Controller: NEC Corporation USB (rev 43) 00:03.1 USB Controller: NEC Corporation USB (rev 43) 00:03.2 USB Controller: NEC Corporation USB 2.0 (rev 04) 00:0f.0 Host bridge: Broadcom CSB6 South Bridge (rev a0) 00:0f.1 IDE interface: Broadcom CSB6 RAID/IDE Controller (rev a0) 00:0f.3 ISA bridge: Broadcom GCLE-2 Host Bridge 01:00.0 Host bridge: IBM Calgary PCI-X Host Bridge (rev 02) 01:01.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 Gigabit Ethernet (rev 10) 01:01.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 Gigabit Ethernet (rev 10) 01:02.0 SCSI storage controller: Adaptec AIC-9410W SAS (Razor ASIC non-RAID) (rev 08) 02:00.0 Host bridge: IBM Calgary PCI-X Host Bridge (rev 02) 04:00.0 Host bridge: IBM Calgary PCI-X Host Bridge (rev 02) 06:00.0 Host bridge: IBM Calgary PCI-X Host Bridge (rev 02) 08:00.0 Host bridge: IBM Calgary PCI-X Host Bridge (rev 02) 0a:00.0 Host bridge: IBM Calgary PCI-X Host Bridge (rev 02) 0c:00.0 Host bridge: IBM Calgary PCI-X Host Bridge (rev 02) And when I annotate the problem, the two busses returning NULL are 0000:00 and 0000:01 James -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> > It looks like acpi_pci_get_bridge_handle() is returning NULL, so this is > > the fix that works for me. > > > > I'm sorry for troubling you, and thank you for your patience. > > The patch seems to avoid the kernel panic, but I still don't know > why acpi_pci_get_bridge_handle() returns NULL here. I assumed > it should return non-NULL value here. So I'd like to investigate > it more. > Kenji, I'd like to push jejb's 1-liner upstream now. It make sense, and it prevents a boot panic regression that I'd rather not have others experience in rc2. I agree that it is important to hotplug for you to figure out why this machine runs down that path in the first place, and I'm sure that James will continue to work with you on that after the 1-liner is in. thanks Len Brown, Intel Open Source Technology Center > > > > diff --git a/drivers/pci/hotplug/acpiphp_glue.c b/drivers/pci/hotplug/acpiphp_glue.c > > index f09b101..803d9dd 100644 > > --- a/drivers/pci/hotplug/acpiphp_glue.c > > +++ b/drivers/pci/hotplug/acpiphp_glue.c > > @@ -266,6 +266,8 @@ static int detect_ejectable_slots(struct pci_bus *pbus) > > int found = acpi_pci_detect_ejectable(pbus); > > if (!found) { > > acpi_handle bridge_handle = acpi_pci_get_bridge_handle(pbus); > > + if (!bridge_handle) > > + return 0; > > acpi_walk_namespace(ACPI_TYPE_DEVICE, bridge_handle, (u32)1, > > is_pci_dock_device, (void *)&found, NULL); > > } -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Friday, January 16, 2009 12:08 pm Len Brown wrote: > > > It looks like acpi_pci_get_bridge_handle() is returning NULL, so this > > > is the fix that works for me. > > > > I'm sorry for troubling you, and thank you for your patience. > > > > The patch seems to avoid the kernel panic, but I still don't know > > why acpi_pci_get_bridge_handle() returns NULL here. I assumed > > it should return non-NULL value here. So I'd like to investigate > > it more. > > Kenji, > I'd like to push jejb's 1-liner upstream now. > It make sense, and it prevents a boot panic regression > that I'd rather not have others experience in rc2. > > I agree that it is important to hotplug for you to figure out > why this machine runs down that path in the first place, > and I'm sure that James will continue to work with you > on that after the 1-liner is in. > > thanks > Len Brown, Intel Open Source Technology Center > > > > diff --git a/drivers/pci/hotplug/acpiphp_glue.c > > > b/drivers/pci/hotplug/acpiphp_glue.c index f09b101..803d9dd 100644 > > > --- a/drivers/pci/hotplug/acpiphp_glue.c > > > +++ b/drivers/pci/hotplug/acpiphp_glue.c > > > @@ -266,6 +266,8 @@ static int detect_ejectable_slots(struct pci_bus > > > *pbus) int found = acpi_pci_detect_ejectable(pbus); > > > if (!found) { > > > acpi_handle bridge_handle = acpi_pci_get_bridge_handle(pbus); > > > + if (!bridge_handle) > > > + return 0; > > > acpi_walk_namespace(ACPI_TYPE_DEVICE, bridge_handle, (u32)1, > > > is_pci_dock_device, (void *)&found, NULL); > > > } Agreed. It's funky that we're getting back a NULL here, but we definitely don't want people's machines to crash while we figure out what's going wrong...
James Bottomley wrote: > On Fri, 2009-01-16 at 15:07 +0900, Kenji Kaneshige wrote: >>> It looks like acpi_pci_get_bridge_handle() is returning NULL, so this is >>> the fix that works for me. >>> >> I'm sorry for troubling you, and thank you for your patience. >> >> The patch seems to avoid the kernel panic, but I still don't know >> why acpi_pci_get_bridge_handle() returns NULL here. I assumed >> it should return non-NULL value here. So I'd like to investigate >> it more. > > Sure, Len and I couldn't work out why it was returning NULL on this box > (other than that perhaps it doesn't have an ACPI entry). The two > offending busses which trigger this are the two internal ones (which > aren't hotplug). The layout of the box is: > > sparkweed:~# lspci -t > -+-[0000:0c]---00.0 > +-[0000:0a]---00.0 > +-[0000:08]---00.0 > +-[0000:06]---00.0 > +-[0000:04]---00.0 > +-[0000:02]---00.0 > +-[0000:01]-+-00.0 > | +-01.0 > | +-01.1 > | \-02.0 > \-[0000:00]-+-00.0 > +-01.0 > +-03.0 > +-03.1 > +-03.2 > +-0f.0 > +-0f.1 > \-0f.3 > sparkweed:~# lspci > 00:00.0 Host bridge: IBM Calgary PCI-X Host Bridge (rev 02) > 00:01.0 VGA compatible controller: ATI Technologies Inc Radeon RV100 QY > [Radeon 7000/VE] > 00:03.0 USB Controller: NEC Corporation USB (rev 43) > 00:03.1 USB Controller: NEC Corporation USB (rev 43) > 00:03.2 USB Controller: NEC Corporation USB 2.0 (rev 04) > 00:0f.0 Host bridge: Broadcom CSB6 South Bridge (rev a0) > 00:0f.1 IDE interface: Broadcom CSB6 RAID/IDE Controller (rev a0) > 00:0f.3 ISA bridge: Broadcom GCLE-2 Host Bridge > 01:00.0 Host bridge: IBM Calgary PCI-X Host Bridge (rev 02) > 01:01.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 > Gigabit Ethernet (rev 10) > 01:01.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 > Gigabit Ethernet (rev 10) > 01:02.0 SCSI storage controller: Adaptec AIC-9410W SAS (Razor ASIC > non-RAID) (rev 08) > 02:00.0 Host bridge: IBM Calgary PCI-X Host Bridge (rev 02) > 04:00.0 Host bridge: IBM Calgary PCI-X Host Bridge (rev 02) > 06:00.0 Host bridge: IBM Calgary PCI-X Host Bridge (rev 02) > 08:00.0 Host bridge: IBM Calgary PCI-X Host Bridge (rev 02) > 0a:00.0 Host bridge: IBM Calgary PCI-X Host Bridge (rev 02) > 0c:00.0 Host bridge: IBM Calgary PCI-X Host Bridge (rev 02) > > And when I annotate the problem, the two busses returning NULL are > 0000:00 and 0000:01 > Thank you very much for the information. It seems there are something special in the data structure of host bridge for 0000:00 and 0000:01. I'm making a debug patch now and will send it to you as soon as possible. I'm sorry to trouble you, but could you try it later. Thanks, Kenji Kaneshige -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Len Brown wrote: >>> It looks like acpi_pci_get_bridge_handle() is returning NULL, so this is >>> the fix that works for me. >>> >> I'm sorry for troubling you, and thank you for your patience. >> >> The patch seems to avoid the kernel panic, but I still don't know >> why acpi_pci_get_bridge_handle() returns NULL here. I assumed >> it should return non-NULL value here. So I'd like to investigate >> it more. >> > > > Kenji, > I'd like to push jejb's 1-liner upstream now. > It make sense, and it prevents a boot panic regression > that I'd rather not have others experience in rc2. > Yes, I agree. Thank you for doing this. Thanks, Kenji Kaneshige > I agree that it is important to hotplug for you to figure out > why this machine runs down that path in the first place, > and I'm sure that James will continue to work with you > on that after the 1-liner is in. > > thanks > Len Brown, Intel Open Source Technology Center > > >>> diff --git a/drivers/pci/hotplug/acpiphp_glue.c b/drivers/pci/hotplug/acpiphp_glue.c >>> index f09b101..803d9dd 100644 >>> --- a/drivers/pci/hotplug/acpiphp_glue.c >>> +++ b/drivers/pci/hotplug/acpiphp_glue.c >>> @@ -266,6 +266,8 @@ static int detect_ejectable_slots(struct pci_bus *pbus) >>> int found = acpi_pci_detect_ejectable(pbus); >>> if (!found) { >>> acpi_handle bridge_handle = acpi_pci_get_bridge_handle(pbus); >>> + if (!bridge_handle) >>> + return 0; >>> acpi_walk_namespace(ACPI_TYPE_DEVICE, bridge_handle, (u32)1, >>> is_pci_dock_device, (void *)&found, NULL); >>> } > > -- > To unsubscribe from this list: send the line "unsubscribe linux-pci" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 2009-01-19 at 10:10 +0900, Kenji Kaneshige wrote: > James Bottomley wrote: > > On Fri, 2009-01-16 at 15:07 +0900, Kenji Kaneshige wrote: > >>> It looks like acpi_pci_get_bridge_handle() is returning NULL, so this is > >>> the fix that works for me. > >>> > >> I'm sorry for troubling you, and thank you for your patience. > >> > >> The patch seems to avoid the kernel panic, but I still don't know > >> why acpi_pci_get_bridge_handle() returns NULL here. I assumed > >> it should return non-NULL value here. So I'd like to investigate > >> it more. > > > > Sure, Len and I couldn't work out why it was returning NULL on this box > > (other than that perhaps it doesn't have an ACPI entry). The two > > offending busses which trigger this are the two internal ones (which > > aren't hotplug). The layout of the box is: > > > > sparkweed:~# lspci -t > > -+-[0000:0c]---00.0 > > +-[0000:0a]---00.0 > > +-[0000:08]---00.0 > > +-[0000:06]---00.0 > > +-[0000:04]---00.0 > > +-[0000:02]---00.0 > > +-[0000:01]-+-00.0 > > | +-01.0 > > | +-01.1 > > | \-02.0 > > \-[0000:00]-+-00.0 > > +-01.0 > > +-03.0 > > +-03.1 > > +-03.2 > > +-0f.0 > > +-0f.1 > > \-0f.3 > > sparkweed:~# lspci > > 00:00.0 Host bridge: IBM Calgary PCI-X Host Bridge (rev 02) > > 00:01.0 VGA compatible controller: ATI Technologies Inc Radeon RV100 QY > > [Radeon 7000/VE] > > 00:03.0 USB Controller: NEC Corporation USB (rev 43) > > 00:03.1 USB Controller: NEC Corporation USB (rev 43) > > 00:03.2 USB Controller: NEC Corporation USB 2.0 (rev 04) > > 00:0f.0 Host bridge: Broadcom CSB6 South Bridge (rev a0) > > 00:0f.1 IDE interface: Broadcom CSB6 RAID/IDE Controller (rev a0) > > 00:0f.3 ISA bridge: Broadcom GCLE-2 Host Bridge > > 01:00.0 Host bridge: IBM Calgary PCI-X Host Bridge (rev 02) > > 01:01.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 > > Gigabit Ethernet (rev 10) > > 01:01.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 > > Gigabit Ethernet (rev 10) > > 01:02.0 SCSI storage controller: Adaptec AIC-9410W SAS (Razor ASIC > > non-RAID) (rev 08) > > 02:00.0 Host bridge: IBM Calgary PCI-X Host Bridge (rev 02) > > 04:00.0 Host bridge: IBM Calgary PCI-X Host Bridge (rev 02) > > 06:00.0 Host bridge: IBM Calgary PCI-X Host Bridge (rev 02) > > 08:00.0 Host bridge: IBM Calgary PCI-X Host Bridge (rev 02) > > 0a:00.0 Host bridge: IBM Calgary PCI-X Host Bridge (rev 02) > > 0c:00.0 Host bridge: IBM Calgary PCI-X Host Bridge (rev 02) > > > > And when I annotate the problem, the two busses returning NULL are > > 0000:00 and 0000:01 > > > > Thank you very much for the information. It seems there are > something special in the data structure of host bridge for > 0000:00 and 0000:01. Yes, len speculates the non hotplug buses are missing some acpi entries. > I'm making a debug patch now and will send it to you as soon > as possible. I'm sorry to trouble you, but could you try it > later. Sure ... I'm travelling this week, but the machine is usually remotely accessible. James -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/pci/hotplug/acpiphp_glue.c b/drivers/pci/hotplug/acpiphp_glue.c index f09b101..803d9dd 100644 --- a/drivers/pci/hotplug/acpiphp_glue.c +++ b/drivers/pci/hotplug/acpiphp_glue.c @@ -266,6 +266,8 @@ static int detect_ejectable_slots(struct pci_bus *pbus) int found = acpi_pci_detect_ejectable(pbus); if (!found) { acpi_handle bridge_handle = acpi_pci_get_bridge_handle(pbus); + if (!bridge_handle) + return 0; acpi_walk_namespace(ACPI_TYPE_DEVICE, bridge_handle, (u32)1, is_pci_dock_device, (void *)&found, NULL); }