Message ID | 20211210221642.869015045@linutronix.de (mailing list archive) |
---|---|
Headers | show |
Series | genirq/msi, PCI/MSI: Spring cleaning - Part 2 | expand |
On 23:18-20211210, Thomas Gleixner wrote: [...] > > It's also available from git: > > git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git msi-v3-part-2 [...] > --- > drivers/dma/ti/k3-udma-private.c | 6 > drivers/dma/ti/k3-udma.c | 14 - > drivers/irqchip/irq-ti-sci-inta.c | 2 > drivers/soc/ti/k3-ringacc.c | 6 > drivers/soc/ti/ti_sci_inta_msi.c | 22 -- > include/linux/soc/ti/ti_sci_inta_msi.h | 1 Also while testing on TI K3 platforms, I noticed: msi_device_data_release/msi_device_destroy_sysfs in am64xx-evm / j7200 [1] https://gist.github.com/nmenon/36899c7819681026cfe1ef185fb95f33#file-am64xx-evm-txt-L1018 [2] https://gist.github.com/nmenon/36899c7819681026cfe1ef185fb95f33#file-j7200-evm-txt-L1076 Which is not present in vanilla v5.16-rc4 v5.16-rc4: https://gist.github.com/nmenon/1aee3f0a7da47d5e9dcb7336b32a70cb msi-v3-part-2: https://gist.github.com/nmenon/36899c7819681026cfe1ef185fb95f33 (.config https://gist.github.com/nmenon/ec6f95303828abf16a64022d8e3a269f) Vs: next-20211208: https://gist.github.com/nmenon/f5ca3558bd5c1fbe62dc5ceb420b536e
On Mon, Dec 13 2021 at 12:29, Nishanth Menon wrote: > On 23:18-20211210, Thomas Gleixner wrote: > Also while testing on TI K3 platforms, I noticed: > > msi_device_data_release/msi_device_destroy_sysfs in am64xx-evm / j7200 The warning complains about a device being released with MSI descriptors still attached to the device. This was added by: 5b012cede0f7 ("device: Add device::msi_data pointer and struct msi_device_data") That's not a regression caused by this commit. The warning is just exposing an already existing problem in the iwlwifi driver, which seems to do: probe() setup_pci_msi[x]_interrupts() start_drv() if (try_to_load_firmware() == FAIL) device_release_driver() ... msi_device_data_release() WARN() Thanks, tglx
On 10:41-20211214, Thomas Gleixner wrote: > On Mon, Dec 13 2021 at 12:29, Nishanth Menon wrote: > > On 23:18-20211210, Thomas Gleixner wrote: > > Also while testing on TI K3 platforms, I noticed: > > > > msi_device_data_release/msi_device_destroy_sysfs in am64xx-evm / j7200 > > The warning complains about a device being released with MSI descriptors > still attached to the device. This was added by: > > 5b012cede0f7 ("device: Add device::msi_data pointer and struct msi_device_data") > > That's not a regression caused by this commit. The warning is just > exposing an already existing problem in the iwlwifi driver, which seems > to do: > > probe() > setup_pci_msi[x]_interrupts() > start_drv() > if (try_to_load_firmware() == FAIL) > device_release_driver() > ... > msi_device_data_release() > WARN() > Agreed that the warning is fine, the null pointer exception that follows [1] [2] it however does'nt look right and it can be trivially fixed with the following fixup for ee90787487bc ("genirq/msi: Provide msi_device_populate/destroy_sysfs()") below, with that the log looks like [3] - the warn is good, the null pointer exception and resultant crash could be avoided (not saying this is the best solution): diff --git a/kernel/irq/msi.c b/kernel/irq/msi.c index ab5e83f41188..24edb870c66f 100644 --- a/kernel/irq/msi.c +++ b/kernel/irq/msi.c @@ -252,11 +252,14 @@ int msi_device_populate_sysfs(struct device *dev) */ void msi_device_destroy_sysfs(struct device *dev) { - const struct attribute_group **msi_irq_groups = dev->msi.data->attrs; + const struct attribute_group **msi_irq_groups; struct device_attribute *dev_attr; struct attribute **msi_attrs; int count = 0; + if (!dev->msi.data) + return; + msi_irq_groups = dev->msi.data->attrs; dev->msi.data->attrs = NULL; if (!msi_irq_groups) return; [1] https://gist.github.com/nmenon/36899c7819681026cfe1ef185fb95f33#file-am64xx-evm-txt-L1049 [2] https://gist.github.com/nmenon/36899c7819681026cfe1ef185fb95f33#file-j7200-evm-txt-L1111 [3] https://gist.github.com/nmenon/575afe7d04463026a7e420a76c2c1c5b https://gist.github.com/nmenon/575afe7d04463026a7e420a76c2c1c5b#file-am64xx-evm-txt-L1018 https://gist.github.com/nmenon/575afe7d04463026a7e420a76c2c1c5b#file-j7200-evm-txt-L1053
On Tue, Dec 14 2021 at 10:22, Nishanth Menon wrote: > On 10:41-20211214, Thomas Gleixner wrote: > Agreed that the warning is fine, the null pointer exception that follows > [1] [2] it however does'nt look right and it can be trivially fixed with the > following fixup for ee90787487bc ("genirq/msi: Provide > msi_device_populate/destroy_sysfs()") below, with that the log looks > like [3] - the warn is good, the null pointer exception and resultant > crash could be avoided (not saying this is the best solution): Aaargh. [ 13.478122] Call trace: [ 13.509042] msi_device_destroy_sysfs+0x18/0x88 [ 13.509058] msi_domain_free_irqs+0x34/0x58 [ 13.509064] pci_msi_teardown_msi_irqs+0x30/0x3c [ 13.509072] free_msi_irqs+0x78/0xd4 [ 13.509077] pci_disable_msix+0x138/0x164 [ 13.529930] pcim_release+0x70/0x238 [ 13.529942] devres_release_all+0x9c/0xfc [ 13.529951] device_release_driver_internal+0x1a0/0x244 [ 13.542725] device_release_driver+0x18/0x24 [ 13.542741] iwl_req_fw_callback+0x1a28/0x1ddc [iwlwifi] [ 13.552308] request_firmware_work_func+0x50/0x9c [ 13.552320] process_one_work+0x194/0x25c That's not a driver problem, that's an ordering issue vs. the devres muck. Let me go back to the drawing board. Sigh... Thanks, tglx
On Tue, Dec 14 2021 at 17:36, Thomas Gleixner wrote: > On Tue, Dec 14 2021 at 10:22, Nishanth Menon wrote: >> On 10:41-20211214, Thomas Gleixner wrote: > [ 13.478122] Call trace: > [ 13.509042] msi_device_destroy_sysfs+0x18/0x88 > [ 13.509058] msi_domain_free_irqs+0x34/0x58 > [ 13.509064] pci_msi_teardown_msi_irqs+0x30/0x3c > [ 13.509072] free_msi_irqs+0x78/0xd4 > [ 13.509077] pci_disable_msix+0x138/0x164 > [ 13.529930] pcim_release+0x70/0x238 > [ 13.529942] devres_release_all+0x9c/0xfc > [ 13.529951] device_release_driver_internal+0x1a0/0x244 > [ 13.542725] device_release_driver+0x18/0x24 > [ 13.542741] iwl_req_fw_callback+0x1a28/0x1ddc [iwlwifi] > [ 13.552308] request_firmware_work_func+0x50/0x9c > [ 13.552320] process_one_work+0x194/0x25c > > That's not a driver problem, that's an ordering issue vs. the devres > muck. Let me go back to the drawing board. Sigh... Which is pretty obvious why: pcim_enable_device() devres_alloc(pcim_release...); ... pci_irq_alloc() msi_setup_device_data() devres_alloc(msi_device_data_release, ...) and once the device is released: msi_device_data_release() ... pcim_release() pci_disable_msi[x]() Groan....
Nishanth, On Tue, Dec 14 2021 at 18:03, Thomas Gleixner wrote: > msi_device_data_release() > ... > pcim_release() > pci_disable_msi[x]() > > Groan.... I think I managed to distangle this. Can you please give: git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git msi-v4-part-2 and/or the full pile: git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git msi-v4-part-3 a test ride? Thanks, tglx
On 21:15-20211214, Thomas Gleixner wrote: > Nishanth, > > On Tue, Dec 14 2021 at 18:03, Thomas Gleixner wrote: > > msi_device_data_release() > > ... > > pcim_release() > > pci_disable_msi[x]() > > > > Groan.... > > I think I managed to distangle this. Can you please give: > > git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git msi-v4-part-2 Umm.. I am not entirely sure what is going on.. but all kinds of weird corruption seems to occur with msi-v4-part-2 that does'nt seem to be present in v5.16-rc5. (I use NFS since ethernet in K3 platforms use inta/intr and dma that is impacted by this series). I will try and rebase your patches on v5.16-rc4 to be sure as well and report back later today once i get some time. [1] https://gist.github.com/nmenon/a66e022926c4c15313c45d44313d860c msi-v4-part-2 [2] https://gist.github.com/nmenon/43085664d69ad846d596e76a06ed0656 v5.16-rc5
Nishanth, On Tue, Dec 14 2021 at 14:56, Nishanth Menon wrote: > On 21:15-20211214, Thomas Gleixner wrote: >> I think I managed to distangle this. Can you please give: >> >> git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git msi-v4-part-2 > > > Umm.. I am not entirely sure what is going on.. but all kinds of weird > corruption seems to occur with msi-v4-part-2 that does'nt seem to be > present in v5.16-rc5. (I use NFS since ethernet in K3 platforms use > inta/intr and dma that is impacted by this series). > > I will try and rebase your patches on v5.16-rc4 to be sure as well and > report back later today once i get some time. > > [1] https://gist.github.com/nmenon/a66e022926c4c15313c45d44313d860c msi-v4-part-2 > [2] https://gist.github.com/nmenon/43085664d69ad846d596e76a06ed0656 v5.16-rc5 thanks for trying. I'll have a look again with brain awake tomorrow morning. Thanks, tglx
On Tue, Dec 14 2021 at 22:19, Thomas Gleixner wrote: > On Tue, Dec 14 2021 at 14:56, Nishanth Menon wrote: > > thanks for trying. I'll have a look again with brain awake tomorrow > morning. Morning was busy with other things, but I found what my sleepy brain managed to do wrong yesterday evening. Let me reintegrate the pile and I'll send you an update. Thanks, tglx
On Wed, Dec 15 2021 at 17:18, Thomas Gleixner wrote: > On Tue, Dec 14 2021 at 22:19, Thomas Gleixner wrote: >> On Tue, Dec 14 2021 at 14:56, Nishanth Menon wrote: >> >> thanks for trying. I'll have a look again with brain awake tomorrow >> morning. > > Morning was busy with other things, but I found what my sleepy brain > managed to do wrong yesterday evening. > > Let me reintegrate the pile and I'll send you an update. git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git msi-v4.1-part-2 git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git msi-v4.2-part-3 That should cure the problem.
On 17:35-20211215, Thomas Gleixner wrote: > git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git msi-v4.1-part-2 [...] > That should cure the problem. And it sure does. Thanks for looking closer and providing a fix. https://gist.github.com/nmenon/9862a1c31b17fd6dfe0a30c54d396187 (msi-v4.1-part-2) looks clean Also while I had detected pointer corruption in the previous v4 https://gist.github.com/nmenon/ce4d12f460db5cd511185c047d5d35d0 Running it again on v4.1 does indicate the fix is in place. https://gist.github.com/nmenon/3231fbb0faa1b9c827b40d1ae6160626 please feel free to add: Tested-by: Nishanth Menon <nm@ti.com>
Hi Thomas, On 17:35-20211215, Thomas Gleixner wrote: > git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git msi-v4.2-part-3 As you helped offline, summarizing the details on part3 of the series: I was seeing failure[1] of NFS(DMA) on all TI K3 platforms: [ 1.013258] ti-bcdma 485c0100.dma-controller: Number of rings: 68 [ 1.019963] ti-bcdma 485c0100.dma-controller: Failed to allocate IRQs -28 [ 1.026938] ti-bcdma 485c0100.dma-controller: Failed to allocate MSI interrupts Rationale as you explained: " -28 is ENOSPC, which is returned when the interrupt allocation in the MSI domain fails. Fix below. " Which turned out to be the fixup[2] you suggested and I confirm that fixes the problem for me. With the fixup in place: Tested-by: Nishanth Menon <nm@ti.com> for part 3 of the series as well. Thanks once again for your help. Hope we can roll in the fixes for part3. [1] https://gist.github.com/nmenon/5971ab27aa626c022e276cc946e4b6c3 [2] --- a/drivers/soc/ti/ti_sci_inta_msi.c +++ b/drivers/soc/ti/ti_sci_inta_msi.c @@ -68,6 +68,7 @@ static int ti_sci_inta_msi_alloc_descs(s int set, i, count = 0; memset(&msi_desc, 0, sizeof(msi_desc)); + msi_desc.nvec_used = 1; for (set = 0; set < res->sets; set++) { for (i = 0; i < res->desc[set].num; i++, count++) {
From: Thomas Gleixner <tglx@linutronix.de> Sent: Wednesday, December 15, 2021 8:36 AM > > On Wed, Dec 15 2021 at 17:18, Thomas Gleixner wrote: > > > On Tue, Dec 14 2021 at 22:19, Thomas Gleixner wrote: > >> On Tue, Dec 14 2021 at 14:56, Nishanth Menon wrote: > >> > >> thanks for trying. I'll have a look again with brain awake tomorrow > >> morning. > > > > Morning was busy with other things, but I found what my sleepy brain > > managed to do wrong yesterday evening. > > > > Let me reintegrate the pile and I'll send you an update. > > git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git msi-v4.1-part-2 > git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git msi-v4.2-part-3 > > That should cure the problem. Tested the msi-v4.2-part-3 tag in two different Azure/Hyper-V VMs. One is a Generation 1 VM that has legacy PCI devices and one is a Generation 2 VM with no legacy PCI devices. Tested hot add and remove of Mellanox CX-3 and CX-4 SR-IOV NIC virtual functions that are directly mapped into the VM. Also tested local NVMe devices directly mapped into one of the VMs. No issues encountered. So for Azure/Hyper-V specifically, Tested-by: Michael Kelley <mikelley@microsoft.com>
Nishanth, On Wed, Dec 15 2021 at 19:45, Nishanth Menon wrote: > On 17:35-20211215, Thomas Gleixner wrote: > Thanks once again for your help. Hope we can roll in the fixes for > part3. Sure, it's only the one-liner for ti sci. Got it folded already. Thanks for your help and testing! tglx