Message ID | 20200116133102.1.I9c7e72144ef639cc135ea33ef332852a6b33730f@changeid (mailing list archive) |
---|---|
State | Superseded, archived |
Headers | show |
Series | PCI/MSI: Avoid torn updates to MSI pairs | expand |
On Thu, Jan 16, 2020 at 1:31 PM Evan Green <evgreen@chromium.org> wrote: > > __pci_write_msi_msg() updates three registers in the device: address > high, address low, and data. On x86 systems, address low contains > CPU targeting info, and data contains the vector. The order of writes > is address, then data. > > This is problematic if an interrupt comes in after address has > been written, but before data is updated, and the SMP affinity of > the interrupt is changing. In this case, the interrupt targets the > wrong vector on the new CPU. > > This case is pretty easy to stumble into using xhci and CPU hotplugging. > Create a script that targets interrupts at a set of cores and then > offlines those cores. Put some stress on USB, and then watch xhci lose > an interrupt and die. > > Avoid this by disabling MSIs during the update. > > Signed-off-by: Evan Green <evgreen@chromium.org> Note to reviewers: I posted a v2 of this patch with some improvements here: https://lore.kernel.org/lkml/20200117162444.v2.1.I9c7e72144ef639cc135ea33ef332852a6b33730f@changeid/T/#u
[+cc Thomas, Marc, Christoph, Rajat] On Thu, Jan 16, 2020 at 01:31:28PM -0800, Evan Green wrote: > __pci_write_msi_msg() updates three registers in the device: address > high, address low, and data. On x86 systems, address low contains > CPU targeting info, and data contains the vector. The order of writes > is address, then data. > > This is problematic if an interrupt comes in after address has > been written, but before data is updated, and the SMP affinity of > the interrupt is changing. In this case, the interrupt targets the > wrong vector on the new CPU. > > This case is pretty easy to stumble into using xhci and CPU hotplugging. > Create a script that targets interrupts at a set of cores and then > offlines those cores. Put some stress on USB, and then watch xhci lose > an interrupt and die. > > Avoid this by disabling MSIs during the update. > > Signed-off-by: Evan Green <evgreen@chromium.org> > --- > > > Bjorn, > I was unsure whether disabling MSIs temporarily is actually an okay > thing to do. I considered using the mask bit, but got the impression > that not all devices support the mask bit. Let me know if this going to > cause problems or there's a better way. I can include the repro > script I used to cause mayhem if needed. I suspect this *is* a problem because I think disabling MSI doesn't disable interrupts; it just means the device will interrupt using INTx instead of MSI. And the driver is probably not prepared to handle INTx. PCIe r5.0, sec 7.7.1.2, seems relevant: "If MSI and MSI-X are both disabled, the Function requests servicing using INTx interrupts (if supported)." Maybe the IRQ guys have ideas about how to solve this? > --- > drivers/pci/msi.c | 5 ++++- > 1 file changed, 4 insertions(+), 1 deletion(-) > > diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c > index 6b43a5455c7af..97856ef862d68 100644 > --- a/drivers/pci/msi.c > +++ b/drivers/pci/msi.c > @@ -328,7 +328,7 @@ void __pci_write_msi_msg(struct msi_desc *entry, struct msi_msg *msg) > u16 msgctl; > > pci_read_config_word(dev, pos + PCI_MSI_FLAGS, &msgctl); > - msgctl &= ~PCI_MSI_FLAGS_QSIZE; > + msgctl &= ~(PCI_MSI_FLAGS_QSIZE | PCI_MSI_FLAGS_ENABLE); > msgctl |= entry->msi_attrib.multiple << 4; > pci_write_config_word(dev, pos + PCI_MSI_FLAGS, msgctl); > > @@ -343,6 +343,9 @@ void __pci_write_msi_msg(struct msi_desc *entry, struct msi_msg *msg) > pci_write_config_word(dev, pos + PCI_MSI_DATA_32, > msg->data); > } > + > + msgctl |= PCI_MSI_FLAGS_ENABLE; > + pci_write_config_word(dev, pos + PCI_MSI_FLAGS, msgctl); > } > > skip: > -- > 2.24.1 >
On Wed, Jan 22, 2020 at 9:28 AM Bjorn Helgaas <helgaas@kernel.org> wrote: > > [+cc Thomas, Marc, Christoph, Rajat] > > On Thu, Jan 16, 2020 at 01:31:28PM -0800, Evan Green wrote: > > __pci_write_msi_msg() updates three registers in the device: address > > high, address low, and data. On x86 systems, address low contains > > CPU targeting info, and data contains the vector. The order of writes > > is address, then data. > > > > This is problematic if an interrupt comes in after address has > > been written, but before data is updated, and the SMP affinity of > > the interrupt is changing. In this case, the interrupt targets the > > wrong vector on the new CPU. > > > > This case is pretty easy to stumble into using xhci and CPU hotplugging. > > Create a script that targets interrupts at a set of cores and then > > offlines those cores. Put some stress on USB, and then watch xhci lose > > an interrupt and die. > > > > Avoid this by disabling MSIs during the update. > > > > Signed-off-by: Evan Green <evgreen@chromium.org> > > --- > > > > > > Bjorn, > > I was unsure whether disabling MSIs temporarily is actually an okay > > thing to do. I considered using the mask bit, but got the impression > > that not all devices support the mask bit. Let me know if this going to > > cause problems or there's a better way. I can include the repro > > script I used to cause mayhem if needed. > > I suspect this *is* a problem because I think disabling MSI doesn't > disable interrupts; it just means the device will interrupt using INTx > instead of MSI. And the driver is probably not prepared to handle > INTx. > > PCIe r5.0, sec 7.7.1.2, seems relevant: "If MSI and MSI-X are both > disabled, the Function requests servicing using INTx interrupts (if > supported)." > > Maybe the IRQ guys have ideas about how to solve this? > But don't we already do this in __pci_restore_msi_state(): pci_intx_for_msi(dev, 0); pci_msi_set_enable(dev, 0); arch_restore_msi_irqs(dev); I'd think if there were a chance for a line-based interrupt to get in and wedge itself, it would already be happening there. One other way you could avoid torn MSI writes would be to ensure that if you migrate IRQs across cores, you keep the same x86 vector number. That way the address portion would be updated, and data doesn't change, so there's no window. But that may not actually be feasible. -Evan
On 2020-01-22 17:28, Bjorn Helgaas wrote: > [+cc Thomas, Marc, Christoph, Rajat] > > On Thu, Jan 16, 2020 at 01:31:28PM -0800, Evan Green wrote: >> __pci_write_msi_msg() updates three registers in the device: address >> high, address low, and data. On x86 systems, address low contains >> CPU targeting info, and data contains the vector. The order of writes >> is address, then data. >> >> This is problematic if an interrupt comes in after address has >> been written, but before data is updated, and the SMP affinity of >> the interrupt is changing. In this case, the interrupt targets the >> wrong vector on the new CPU. >> >> This case is pretty easy to stumble into using xhci and CPU >> hotplugging. >> Create a script that targets interrupts at a set of cores and then >> offlines those cores. Put some stress on USB, and then watch xhci lose >> an interrupt and die. >> >> Avoid this by disabling MSIs during the update. >> >> Signed-off-by: Evan Green <evgreen@chromium.org> >> --- >> >> >> Bjorn, >> I was unsure whether disabling MSIs temporarily is actually an okay >> thing to do. I considered using the mask bit, but got the impression >> that not all devices support the mask bit. Let me know if this going >> to >> cause problems or there's a better way. I can include the repro >> script I used to cause mayhem if needed. > > I suspect this *is* a problem because I think disabling MSI doesn't > disable interrupts; it just means the device will interrupt using INTx > instead of MSI. And the driver is probably not prepared to handle > INTx. > > PCIe r5.0, sec 7.7.1.2, seems relevant: "If MSI and MSI-X are both > disabled, the Function requests servicing using INTx interrupts (if > supported)." > > Maybe the IRQ guys have ideas about how to solve this? Not from the top of my head. MSI-X should always support masking, so we could at least handle that case properly and not loose interrupts. Good ol' MSI is more tricky. Disabling MSI, as Bjorn pointed out, is just going to make the problem worse. There is also the problem that a number of drivers pick MSI instead of MSI-X. M.
Evan Green <evgreen@chromium.org> writes: > On Wed, Jan 22, 2020 at 9:28 AM Bjorn Helgaas <helgaas@kernel.org> wrote: >> I suspect this *is* a problem because I think disabling MSI doesn't >> disable interrupts; it just means the device will interrupt using INTx >> instead of MSI. And the driver is probably not prepared to handle >> INTx. >> >> PCIe r5.0, sec 7.7.1.2, seems relevant: "If MSI and MSI-X are both >> disabled, the Function requests servicing using INTx interrupts (if >> supported)." Disabling MSI is not an option. Masking yes, but MSI does not have mandatory masking. We already attempt masking on migration, which covers only MSI-X reliably, but not all MSI incarnations. So I assume that problem happens on a MSI interrupt, right? >> Maybe the IRQ guys have ideas about how to solve this? Maybe :) > But don't we already do this in __pci_restore_msi_state(): > pci_intx_for_msi(dev, 0); > pci_msi_set_enable(dev, 0); > arch_restore_msi_irqs(dev); > > I'd think if there were a chance for a line-based interrupt to get in > and wedge itself, it would already be happening there. That's a completely different beast. It's used when resetting a device and for other stuff like virt state migration. That's not a model for affinity changes of a live device. > One other way you could avoid torn MSI writes would be to ensure that > if you migrate IRQs across cores, you keep the same x86 vector number. > That way the address portion would be updated, and data doesn't > change, so there's no window. But that may not actually be feasible. That's not possible simply because the x86 vector space is limited. If we would have to guarantee that then we'd end up with a max of ~220 interrupts per system. Sufficient for your notebook, but the big iron people would be not amused. The real critical path here is the CPU hotplug path. For regular migration between two online CPUs we use the 'migrate when the irq is actually serviced ' mechanism. That might have the same issue on misdesigned devices which are firing the next interrupt before the one on the flight is serviced, but I haven't seen any reports with that symptom yet. But before I dig deeper into this, please provide the output of 'lscpci -vvv' and 'cat /proc/interrupts' Thanks, tglx
On Wed, Jan 22, 2020 at 3:37 PM Thomas Gleixner <tglx@linutronix.de> wrote: > > Evan Green <evgreen@chromium.org> writes: > > On Wed, Jan 22, 2020 at 9:28 AM Bjorn Helgaas <helgaas@kernel.org> wrote: > >> I suspect this *is* a problem because I think disabling MSI doesn't > >> disable interrupts; it just means the device will interrupt using INTx > >> instead of MSI. And the driver is probably not prepared to handle > >> INTx. > >> > >> PCIe r5.0, sec 7.7.1.2, seems relevant: "If MSI and MSI-X are both > >> disabled, the Function requests servicing using INTx interrupts (if > >> supported)." > > Disabling MSI is not an option. Masking yes, but MSI does not have > mandatory masking. We already attempt masking on migration, which covers > only MSI-X reliably, but not all MSI incarnations. > > So I assume that problem happens on a MSI interrupt, right? > > >> Maybe the IRQ guys have ideas about how to solve this? > > Maybe :) > > > But don't we already do this in __pci_restore_msi_state(): > > pci_intx_for_msi(dev, 0); > > pci_msi_set_enable(dev, 0); > > arch_restore_msi_irqs(dev); > > > > I'd think if there were a chance for a line-based interrupt to get in > > and wedge itself, it would already be happening there. > > That's a completely different beast. It's used when resetting a device > and for other stuff like virt state migration. That's not a model for > affinity changes of a live device. Hm. Ok. > > > One other way you could avoid torn MSI writes would be to ensure that > > if you migrate IRQs across cores, you keep the same x86 vector number. > > That way the address portion would be updated, and data doesn't > > change, so there's no window. But that may not actually be feasible. > > That's not possible simply because the x86 vector space is limited. If > we would have to guarantee that then we'd end up with a max of ~220 > interrupts per system. Sufficient for your notebook, but the big iron > people would be not amused. Right, that occurred to me as well. The actual requirement isn't quite as restrictive. What you really need is the old vector to be registered on both the old CPU and the new CPU. Then once the interrupt is confirmed to have moved we could release both the old vector both CPUs, leaving only the new vector on the new CPU. In that world some SMP affinity transitions might fail, which is a bummer. To avoid that, you could first migrate to a vector that's available on both the source and destination CPUs, keeping affinity the same. Then change affinity in a separate step. Or alternatively, you could permanently designate a "transit" vector. If an interrupt fires on this vector, then we call all ISRs currently in transit between CPUs. You might end up calling ISRs that didn't actually need service, but at least that's better than missing edges. > > The real critical path here is the CPU hotplug path. > > For regular migration between two online CPUs we use the 'migrate when > the irq is actually serviced ' mechanism. That might have the same issue > on misdesigned devices which are firing the next interrupt before the > one on the flight is serviced, but I haven't seen any reports with that > symptom yet. > > But before I dig deeper into this, please provide the output of > > 'lscpci -vvv' and 'cat /proc/interrupts' > Here it is: https://pastebin.com/YyxBUvQ2 This is a Comet Lake system. It has 8 HT cores, but 4 of those cores have already been offlined. At the bottom of the paste I also included the script I used that causes a repro in a minute or two. I simply run this, then put some stress on USB. For me that stress was "join a Hangouts meeting", since that stressed both my USB webcam and USB ethernet. The script exits when xhci dies. -Evan
Evan Green <evgreen@chromium.org> writes: > On Wed, Jan 22, 2020 at 3:37 PM Thomas Gleixner <tglx@linutronix.de> wrote: >> > One other way you could avoid torn MSI writes would be to ensure that >> > if you migrate IRQs across cores, you keep the same x86 vector number. >> > That way the address portion would be updated, and data doesn't >> > change, so there's no window. But that may not actually be feasible. >> >> That's not possible simply because the x86 vector space is limited. If >> we would have to guarantee that then we'd end up with a max of ~220 >> interrupts per system. Sufficient for your notebook, but the big iron >> people would be not amused. > > Right, that occurred to me as well. The actual requirement isn't quite > as restrictive. What you really need is the old vector to be > registered on both the old CPU and the new CPU. Then once the > interrupt is confirmed to have moved we could release both the old > vector both CPUs, leaving only the new vector on the new CPU. Sure, and how can you guarantee that without reserving the vector on all CPUs in the first place? If you don't do that then if the vector is not available affinity setting would fail every so often and it would pretty much prevent hotplug if a to be migrated vector is not available on at least one online CPU. > In that world some SMP affinity transitions might fail, which is a > bummer. To avoid that, you could first migrate to a vector that's > available on both the source and destination CPUs, keeping affinity > the same. Then change affinity in a separate step. Good luck with doing that at the end of the hotplug routine where the CPU is about to vanish. > Or alternatively, you could permanently designate a "transit" vector. > If an interrupt fires on this vector, then we call all ISRs currently > in transit between CPUs. You might end up calling ISRs that didn't > actually need service, but at least that's better than missing edges. I don't think we need that. While walking the dogs I thought about invoking a force migrated interrupt on the target CPU, but haven't thought it through yet. >> 'lscpci -vvv' and 'cat /proc/interrupts' > > Here it is: > https://pastebin.com/YyxBUvQ2 Hrm: Capabilities: [80] MSI-X: Enable+ Count=16 Masked- So this is weird. We mask it before moving it, so the tear issue should not happen on MSI-X. So the tearing might be just a red herring. Let me stare into the code a bit. Thanks, tglx
On Thu, Jan 23, 2020 at 12:42 AM Thomas Gleixner <tglx@linutronix.de> wrote: > > Evan Green <evgreen@chromium.org> writes: > > On Wed, Jan 22, 2020 at 3:37 PM Thomas Gleixner <tglx@linutronix.de> wrote: > >> > One other way you could avoid torn MSI writes would be to ensure that > >> > if you migrate IRQs across cores, you keep the same x86 vector number. > >> > That way the address portion would be updated, and data doesn't > >> > change, so there's no window. But that may not actually be feasible. > >> > >> That's not possible simply because the x86 vector space is limited. If > >> we would have to guarantee that then we'd end up with a max of ~220 > >> interrupts per system. Sufficient for your notebook, but the big iron > >> people would be not amused. > > > > Right, that occurred to me as well. The actual requirement isn't quite > > as restrictive. What you really need is the old vector to be > > registered on both the old CPU and the new CPU. Then once the > > interrupt is confirmed to have moved we could release both the old > > vector both CPUs, leaving only the new vector on the new CPU. > > Sure, and how can you guarantee that without reserving the vector on all > CPUs in the first place? If you don't do that then if the vector is not > available affinity setting would fail every so often and it would pretty > much prevent hotplug if a to be migrated vector is not available on at > least one online CPU. > > > In that world some SMP affinity transitions might fail, which is a > > bummer. To avoid that, you could first migrate to a vector that's > > available on both the source and destination CPUs, keeping affinity > > the same. Then change affinity in a separate step. > > Good luck with doing that at the end of the hotplug routine where the > CPU is about to vanish. > > > Or alternatively, you could permanently designate a "transit" vector. > > If an interrupt fires on this vector, then we call all ISRs currently > > in transit between CPUs. You might end up calling ISRs that didn't > > actually need service, but at least that's better than missing edges. > > I don't think we need that. While walking the dogs I thought about > invoking a force migrated interrupt on the target CPU, but haven't > thought it through yet. Yeah, I think the Intel folks did that in some tree of theirs too. > > >> 'lscpci -vvv' and 'cat /proc/interrupts' > > > > Here it is: > > https://pastebin.com/YyxBUvQ2 > > Hrm: > > Capabilities: [80] MSI-X: Enable+ Count=16 Masked- > > So this is weird. We mask it before moving it, so the tear issue should > not happen on MSI-X. So the tearing might be just a red herring. Mmm... sorry what? This is the complete entry for xhci: 00:14.0 USB controller: Intel Corporation Device 02ed (prog-if 30 [XHCI]) Subsystem: Intel Corporation Device 7270 Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Interrupt: pin A routed to IRQ 124 Region 0: Memory at d1200000 (64-bit, non-prefetchable) [size=64K] Capabilities: [70] Power Management version 2 Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA PME(D0-,D1-,D2-,D3hot+,D3cold+) Status: D3 NoSoftRst+ PME-Enable+ DSel=0 DScale=0 PME- Capabilities: [80] MSI: Enable+ Count=1/8 Maskable- 64bit+ Address: 00000000fee10004 Data: 402a Capabilities: [90] Vendor Specific Information: Len=14 <?> Kernel driver in use: xhci_hcd > > Let me stare into the code a bit. Thanks, I appreciate the help. -Evan
Evan Green <evgreen@chromium.org> writes: > On Thu, Jan 23, 2020 at 12:42 AM Thomas Gleixner <tglx@linutronix.de> wrote: >> Hrm: >> >> Capabilities: [80] MSI-X: Enable+ Count=16 Masked- >> >> So this is weird. We mask it before moving it, so the tear issue should >> not happen on MSI-X. So the tearing might be just a red herring. > > Mmm... sorry what? This is the complete entry for xhci: > > 00:14.0 USB controller: Intel Corporation Device 02ed (prog-if 30 [XHCI]) > Subsystem: Intel Corporation Device 7270 > Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- > ParErr- Stepping- SERR- FastB2B- DisINTx+ > Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >>TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- > Interrupt: pin A routed to IRQ 124 > Region 0: Memory at d1200000 (64-bit, non-prefetchable) [size=64K] > Capabilities: [70] Power Management version 2 > Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA > PME(D0-,D1-,D2-,D3hot+,D3cold+) > Status: D3 NoSoftRst+ PME-Enable+ DSel=0 DScale=0 PME- > Capabilities: [80] MSI: Enable+ Count=1/8 Maskable- 64bit+ > Address: 00000000fee10004 Data: 402a > Capabilities: [90] Vendor Specific Information: Len=14 <?> > Kernel driver in use: xhci_hcd Bah. I was looking at the WIFI for whatever reason. Thanks, tglx
diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c index 6b43a5455c7af..97856ef862d68 100644 --- a/drivers/pci/msi.c +++ b/drivers/pci/msi.c @@ -328,7 +328,7 @@ void __pci_write_msi_msg(struct msi_desc *entry, struct msi_msg *msg) u16 msgctl; pci_read_config_word(dev, pos + PCI_MSI_FLAGS, &msgctl); - msgctl &= ~PCI_MSI_FLAGS_QSIZE; + msgctl &= ~(PCI_MSI_FLAGS_QSIZE | PCI_MSI_FLAGS_ENABLE); msgctl |= entry->msi_attrib.multiple << 4; pci_write_config_word(dev, pos + PCI_MSI_FLAGS, msgctl); @@ -343,6 +343,9 @@ void __pci_write_msi_msg(struct msi_desc *entry, struct msi_msg *msg) pci_write_config_word(dev, pos + PCI_MSI_DATA_32, msg->data); } + + msgctl |= PCI_MSI_FLAGS_ENABLE; + pci_write_config_word(dev, pos + PCI_MSI_FLAGS, msgctl); } skip:
__pci_write_msi_msg() updates three registers in the device: address high, address low, and data. On x86 systems, address low contains CPU targeting info, and data contains the vector. The order of writes is address, then data. This is problematic if an interrupt comes in after address has been written, but before data is updated, and the SMP affinity of the interrupt is changing. In this case, the interrupt targets the wrong vector on the new CPU. This case is pretty easy to stumble into using xhci and CPU hotplugging. Create a script that targets interrupts at a set of cores and then offlines those cores. Put some stress on USB, and then watch xhci lose an interrupt and die. Avoid this by disabling MSIs during the update. Signed-off-by: Evan Green <evgreen@chromium.org> --- Bjorn, I was unsure whether disabling MSIs temporarily is actually an okay thing to do. I considered using the mask bit, but got the impression that not all devices support the mask bit. Let me know if this going to cause problems or there's a better way. I can include the repro script I used to cause mayhem if needed. --- drivers/pci/msi.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-)