Message ID | 20181027000028.21343-1-tpiepho@impinj.com (mailing list archive) |
---|---|
State | New, archived |
Delegated to: | Bjorn Helgaas |
Headers | show |
Series | PCI: dwc: Fix interrupt race in when handling MSI | expand |
On 27/10/18 5:30 AM, Trent Piepho wrote: > This reverts commit 8c934095fa2f ("PCI: dwc: Clear MSI interrupt status > after it is handled, not before"). > > This is a very real race that we observed quickly after switching from > 4.13 to 4.16. Using a custom PCI-e endpoint and driver, I was able to > track it to the precise race and verify the fixed behavior, as will be > described below. > > This bug was originally fixed in 2013, in commit ca1658921b63 ("PCI: > designware: Fix missing MSI IRQs") The discussion of that commit, > archived in patchwork [1], is informative and worth reading. > > The bug was re-added in the '8c934 commit this is reverting, which > appeared in the 4.14 kernel. > > Unfortunately, Synopsys appears to consider the operation of this PCI-e > controller secret. They provide no publicly available docs for it nor > allow the references manuals of SoCs using the controller to publish any > documentation of it. > > So I can not say certain this code is correctly using the controller's > features. However, extensive testing gives me high confidence in the > accuracy of what is described below. > > If an MSI is received by the PCI-e controller while the status register > bit for that MSI is set, the PCI-e controller will NOT generate another > interrupt. In addition, it will NOT queue or otherwise mark the > interrupt as "pending", which means it will NOT generate an interrupt > when the status bit is unmasked. > > This gives the following race scenario: > > 1. An MSI is received by, and the status bit for the MSI is set in, the > DWC PCI-e controller. > 2. dw_handle_msi_irq() calls a driver's registered interrupt handler > for the MSI received. > 3. At some point, the interrupt handler must decide, correctly, that > there is no more work to do and return. > 4. The hardware generates a new MSI. As the MSI's status bit is still > set, this new MSI is ignored. > 6. dw_handle_msi_irq() unsets the MSI status bit. > > The MSI received at point 4 will never be acted upon. It occurred after > the driver had finished checking the hardware status for interrupt > conditions to act on. Since the MSI status was masked, it does not > generated a new IRQ, neither when it was received nor when the MSI is > unmasked. > > It seems clear there is an unsolvable race here. > > After this patch, the sequence becomes as follows: > > 1. An MSI is received and the status bit for the MSI is set in the > DWC PCI-e controller. > 2. dw_handle_msi_irq() clears this MSI status bit. > 3. dw_handle_msi_irq() calls a driver's registered interrupt handler > for the MSI received. > 3. At some point, the interrupt handler must decide, correctly, that > there is no more work to do and return. > 4. The hardware generates a new MSI. This sets the MSI status bit and > triggers another interrupt in the interrupt controller(s) above the DWC > PCI-e controller. As the the dwc-pcie handler is not re-entrant, it is > not run again at this time. > 6. dw_handle_msi_irq() finishes. The MSI status bit remains set. > 7. The IRQ is re-raised and dw_handle_msi_irq() runs again. > 8. dw_handle_msi_irq() invokes the MSI's registered interrupt handler > again as the status bit was still set. > > The observant will notice that point 4 present the opportunity for the > SoC's interrupt controller to lose the interrupt in the same manner as > the bug in this driver. The driver for that interrupt controller will > be written to properly deal with this. In some cases the hardware > supports an EOI concept, where the 2nd IRQ is masked and internally > queued in the hardware, to be re-raised at EOI in step 7. In other > cases the IRQ will be unmasked and re-raised at step 4, but the kernel > will see the handler is INPROGRESS and not re-invoke it, and instead set > a PENDING flag, which causes the handler to re-run at step 7. > > [1] https://patchwork.kernel.org/patch/3333681/ > > Fixes: 8c934095fa2f ("PCI: dwc: Clear MSI interrupt status after it is handled, not before") > Cc: stable@vger.kernel.org > Cc: Vignesh R <vigneshr@ti.com> > Cc: Faiz Abbas <faiz_abbas@ti.com> > Cc: Jingoo Han <jingoohan1@gmail.com> > Cc: Gustavo Pimentel <gustavo.pimentel@synopsys.com> > Cc: Joao Pinto <jpinto@synopsys.com> > Cc: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> > Cc: Bjorn Helgaas <bhelgaas@google.com> > Signed-off-by: Trent Piepho <tpiepho@impinj.com> > --- Thanks for the patch! I have observed similar race as described above using DWC controller on DRA7xx SoCs[2]. The proposed patch and correct handling of MSIs within controller driver helps overcome this issue. So, for this patch: Reviewed-by: Vignesh R <vigneshr@ti.com> [2] https://www.spinics.net/lists/linux-pci/msg68996.html > drivers/pci/controller/dwc/pcie-designware-host.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/pci/controller/dwc/pcie-designware-host.c b/drivers/pci/controller/dwc/pcie-designware-host.c > index 29a05759a294..9a3960c95ad3 100644 > --- a/drivers/pci/controller/dwc/pcie-designware-host.c > +++ b/drivers/pci/controller/dwc/pcie-designware-host.c > @@ -98,10 +98,10 @@ irqreturn_t dw_handle_msi_irq(struct pcie_port *pp) > irq = irq_find_mapping(pp->irq_domain, > (i * MAX_MSI_IRQS_PER_CTRL) + > pos); > - generic_handle_irq(irq); > dw_pcie_wr_own_conf(pp, PCIE_MSI_INTR0_STATUS + > (i * MSI_REG_CTRL_BLOCK_SIZE), > 4, 1 << pos); > + generic_handle_irq(irq); > pos++; > } > } >
On 27/10/2018 01:00, Trent Piepho wrote: > This reverts commit 8c934095fa2f ("PCI: dwc: Clear MSI interrupt status > after it is handled, not before"). > > This is a very real race that we observed quickly after switching from > 4.13 to 4.16. Using a custom PCI-e endpoint and driver, I was able to > track it to the precise race and verify the fixed behavior, as will be > described below. > > This bug was originally fixed in 2013, in commit ca1658921b63 ("PCI: > designware: Fix missing MSI IRQs") The discussion of that commit, > archived in patchwork [1], is informative and worth reading. > > The bug was re-added in the '8c934 commit this is reverting, which > appeared in the 4.14 kernel. > > Unfortunately, Synopsys appears to consider the operation of this PCI-e > controller secret. They provide no publicly available docs for it nor > allow the references manuals of SoCs using the controller to publish any > documentation of it. > > So I can not say certain this code is correctly using the controller's > features. However, extensive testing gives me high confidence in the > accuracy of what is described below. > > If an MSI is received by the PCI-e controller while the status register > bit for that MSI is set, the PCI-e controller will NOT generate another > interrupt. In addition, it will NOT queue or otherwise mark the > interrupt as "pending", which means it will NOT generate an interrupt > when the status bit is unmasked. > > This gives the following race scenario: > > 1. An MSI is received by, and the status bit for the MSI is set in, the > DWC PCI-e controller. > 2. dw_handle_msi_irq() calls a driver's registered interrupt handler > for the MSI received. > 3. At some point, the interrupt handler must decide, correctly, that > there is no more work to do and return. > 4. The hardware generates a new MSI. As the MSI's status bit is still > set, this new MSI is ignored. > 6. dw_handle_msi_irq() unsets the MSI status bit. > > The MSI received at point 4 will never be acted upon. It occurred after > the driver had finished checking the hardware status for interrupt > conditions to act on. Since the MSI status was masked, it does not > generated a new IRQ, neither when it was received nor when the MSI is > unmasked. > > It seems clear there is an unsolvable race here. > > After this patch, the sequence becomes as follows: > > 1. An MSI is received and the status bit for the MSI is set in the > DWC PCI-e controller. > 2. dw_handle_msi_irq() clears this MSI status bit. > 3. dw_handle_msi_irq() calls a driver's registered interrupt handler > for the MSI received. > 3. At some point, the interrupt handler must decide, correctly, that > there is no more work to do and return. > 4. The hardware generates a new MSI. This sets the MSI status bit and > triggers another interrupt in the interrupt controller(s) above the DWC > PCI-e controller. As the the dwc-pcie handler is not re-entrant, it is > not run again at this time. > 6. dw_handle_msi_irq() finishes. The MSI status bit remains set. > 7. The IRQ is re-raised and dw_handle_msi_irq() runs again. > 8. dw_handle_msi_irq() invokes the MSI's registered interrupt handler > again as the status bit was still set. > > The observant will notice that point 4 present the opportunity for the > SoC's interrupt controller to lose the interrupt in the same manner as > the bug in this driver. The driver for that interrupt controller will > be written to properly deal with this. In some cases the hardware > supports an EOI concept, where the 2nd IRQ is masked and internally > queued in the hardware, to be re-raised at EOI in step 7. In other > cases the IRQ will be unmasked and re-raised at step 4, but the kernel > will see the handler is INPROGRESS and not re-invoke it, and instead set > a PENDING flag, which causes the handler to re-run at step 7. > > [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__patchwork.kernel.org_patch_3333681_&d=DwIFAw&c=DPL6_X_6JkXFx7AXWqB0tg&r=bkWxpLoW-f-E3EdiDCCa0_h0PicsViasSlvIpzZvPxs&m=ZglJ52030-QSg4t06j182Uq9rsDcKNSB0P7q7Cx5Ow8&s=sHHaXL5syvT5Y0WGTVohW_3QqpQgnPTPOQ-HxAmvnb0&e= > > Fixes: 8c934095fa2f ("PCI: dwc: Clear MSI interrupt status after it is handled, not before") > Cc: stable@vger.kernel.org > Cc: Vignesh R <vigneshr@ti.com> > Cc: Faiz Abbas <faiz_abbas@ti.com> > Cc: Jingoo Han <jingoohan1@gmail.com> > Cc: Gustavo Pimentel <gustavo.pimentel@synopsys.com> > Cc: Joao Pinto <jpinto@synopsys.com> > Cc: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> > Cc: Bjorn Helgaas <bhelgaas@google.com> > Signed-off-by: Trent Piepho <tpiepho@impinj.com> > --- > drivers/pci/controller/dwc/pcie-designware-host.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/pci/controller/dwc/pcie-designware-host.c b/drivers/pci/controller/dwc/pcie-designware-host.c > index 29a05759a294..9a3960c95ad3 100644 > --- a/drivers/pci/controller/dwc/pcie-designware-host.c > +++ b/drivers/pci/controller/dwc/pcie-designware-host.c > @@ -98,10 +98,10 @@ irqreturn_t dw_handle_msi_irq(struct pcie_port *pp) > irq = irq_find_mapping(pp->irq_domain, > (i * MAX_MSI_IRQS_PER_CTRL) + > pos); > - generic_handle_irq(irq); > dw_pcie_wr_own_conf(pp, PCIE_MSI_INTR0_STATUS + > (i * MSI_REG_CTRL_BLOCK_SIZE), > 4, 1 << pos); > + generic_handle_irq(irq); Makes sense. Acked-by: Gustavo Pimentel <gustavo.pimentel@synopsys.com> > pos++; > } > } >
[CC Marc] On Sat, Oct 27, 2018 at 12:00:57AM +0000, Trent Piepho wrote: > This reverts commit 8c934095fa2f ("PCI: dwc: Clear MSI interrupt status > after it is handled, not before"). > > This is a very real race that we observed quickly after switching from > 4.13 to 4.16. Using a custom PCI-e endpoint and driver, I was able to > track it to the precise race and verify the fixed behavior, as will be > described below. > > This bug was originally fixed in 2013, in commit ca1658921b63 ("PCI: > designware: Fix missing MSI IRQs") The discussion of that commit, > archived in patchwork [1], is informative and worth reading. > > The bug was re-added in the '8c934 commit this is reverting, which > appeared in the 4.14 kernel. > > Unfortunately, Synopsys appears to consider the operation of this PCI-e > controller secret. They provide no publicly available docs for it nor > allow the references manuals of SoCs using the controller to publish any > documentation of it. > > So I can not say certain this code is correctly using the controller's > features. However, extensive testing gives me high confidence in the > accuracy of what is described below. > > If an MSI is received by the PCI-e controller while the status register > bit for that MSI is set, the PCI-e controller will NOT generate another > interrupt. In addition, it will NOT queue or otherwise mark the > interrupt as "pending", which means it will NOT generate an interrupt > when the status bit is unmasked. > > This gives the following race scenario: > > 1. An MSI is received by, and the status bit for the MSI is set in, the > DWC PCI-e controller. > 2. dw_handle_msi_irq() calls a driver's registered interrupt handler > for the MSI received. > 3. At some point, the interrupt handler must decide, correctly, that > there is no more work to do and return. > 4. The hardware generates a new MSI. As the MSI's status bit is still > set, this new MSI is ignored. > 6. dw_handle_msi_irq() unsets the MSI status bit. > > The MSI received at point 4 will never be acted upon. It occurred after > the driver had finished checking the hardware status for interrupt > conditions to act on. Since the MSI status was masked, it does not > generated a new IRQ, neither when it was received nor when the MSI is > unmasked. > > It seems clear there is an unsolvable race here. > > After this patch, the sequence becomes as follows: > > 1. An MSI is received and the status bit for the MSI is set in the > DWC PCI-e controller. > 2. dw_handle_msi_irq() clears this MSI status bit. > 3. dw_handle_msi_irq() calls a driver's registered interrupt handler > for the MSI received. > 3. At some point, the interrupt handler must decide, correctly, that > there is no more work to do and return. > 4. The hardware generates a new MSI. This sets the MSI status bit and > triggers another interrupt in the interrupt controller(s) above the DWC > PCI-e controller. As the the dwc-pcie handler is not re-entrant, it is > not run again at this time. > 6. dw_handle_msi_irq() finishes. The MSI status bit remains set. > 7. The IRQ is re-raised and dw_handle_msi_irq() runs again. > 8. dw_handle_msi_irq() invokes the MSI's registered interrupt handler > again as the status bit was still set. Not sure why (5) is not used in your lists, I assume because you want to highlight the race condition with the jump from 4 to 6 (or maybe you do not like number 5 :), just curious). > The observant will notice that point 4 present the opportunity for the > SoC's interrupt controller to lose the interrupt in the same manner as > the bug in this driver. The driver for that interrupt controller will > be written to properly deal with this. In some cases the hardware > supports an EOI concept, where the 2nd IRQ is masked and internally > queued in the hardware, to be re-raised at EOI in step 7. In other > cases the IRQ will be unmasked and re-raised at step 4, but the kernel > will see the handler is INPROGRESS and not re-invoke it, and instead set > a PENDING flag, which causes the handler to re-run at step 7. > > [1] https://patchwork.kernel.org/patch/3333681/ > > Fixes: 8c934095fa2f ("PCI: dwc: Clear MSI interrupt status after it is handled, not before") I have two questions: - Commit 8c934095fa2f has been in the kernel for a year and no regression was reported. It was assumed to fix a problem so before reverting it I want to make sure we are not breaking anything else. - Your reasoning seems correct but I would pick Marc's brain on this because I want to understand if what this patch does is what IRQ core expects it to do, especially in relation to the IRQ chaining you are mentioning. Lorenzo > Cc: stable@vger.kernel.org > Cc: Vignesh R <vigneshr@ti.com> > Cc: Faiz Abbas <faiz_abbas@ti.com> > Cc: Jingoo Han <jingoohan1@gmail.com> > Cc: Gustavo Pimentel <gustavo.pimentel@synopsys.com> > Cc: Joao Pinto <jpinto@synopsys.com> > Cc: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> > Cc: Bjorn Helgaas <bhelgaas@google.com> > Signed-off-by: Trent Piepho <tpiepho@impinj.com> > --- > drivers/pci/controller/dwc/pcie-designware-host.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/pci/controller/dwc/pcie-designware-host.c b/drivers/pci/controller/dwc/pcie-designware-host.c > index 29a05759a294..9a3960c95ad3 100644 > --- a/drivers/pci/controller/dwc/pcie-designware-host.c > +++ b/drivers/pci/controller/dwc/pcie-designware-host.c > @@ -98,10 +98,10 @@ irqreturn_t dw_handle_msi_irq(struct pcie_port *pp) > irq = irq_find_mapping(pp->irq_domain, > (i * MAX_MSI_IRQS_PER_CTRL) + > pos); > - generic_handle_irq(irq); > dw_pcie_wr_own_conf(pp, PCIE_MSI_INTR0_STATUS + > (i * MSI_REG_CTRL_BLOCK_SIZE), > 4, 1 << pos); > + generic_handle_irq(irq); > pos++; > } > } > -- > 2.14.4 >
On 06/11/18 14:53, Lorenzo Pieralisi wrote: > [CC Marc] > > On Sat, Oct 27, 2018 at 12:00:57AM +0000, Trent Piepho wrote: >> This reverts commit 8c934095fa2f ("PCI: dwc: Clear MSI interrupt status >> after it is handled, not before"). >> >> This is a very real race that we observed quickly after switching from >> 4.13 to 4.16. Using a custom PCI-e endpoint and driver, I was able to >> track it to the precise race and verify the fixed behavior, as will be >> described below. >> >> This bug was originally fixed in 2013, in commit ca1658921b63 ("PCI: >> designware: Fix missing MSI IRQs") The discussion of that commit, >> archived in patchwork [1], is informative and worth reading. >> >> The bug was re-added in the '8c934 commit this is reverting, which >> appeared in the 4.14 kernel. >> >> Unfortunately, Synopsys appears to consider the operation of this PCI-e >> controller secret. They provide no publicly available docs for it nor >> allow the references manuals of SoCs using the controller to publish any >> documentation of it. >> >> So I can not say certain this code is correctly using the controller's >> features. However, extensive testing gives me high confidence in the >> accuracy of what is described below. >> >> If an MSI is received by the PCI-e controller while the status register >> bit for that MSI is set, the PCI-e controller will NOT generate another >> interrupt. In addition, it will NOT queue or otherwise mark the >> interrupt as "pending", which means it will NOT generate an interrupt >> when the status bit is unmasked. >> >> This gives the following race scenario: >> >> 1. An MSI is received by, and the status bit for the MSI is set in, the >> DWC PCI-e controller. >> 2. dw_handle_msi_irq() calls a driver's registered interrupt handler >> for the MSI received. >> 3. At some point, the interrupt handler must decide, correctly, that >> there is no more work to do and return. >> 4. The hardware generates a new MSI. As the MSI's status bit is still >> set, this new MSI is ignored. >> 6. dw_handle_msi_irq() unsets the MSI status bit. >> >> The MSI received at point 4 will never be acted upon. It occurred after >> the driver had finished checking the hardware status for interrupt >> conditions to act on. Since the MSI status was masked, it does not >> generated a new IRQ, neither when it was received nor when the MSI is >> unmasked. >> >> It seems clear there is an unsolvable race here. >> >> After this patch, the sequence becomes as follows: >> >> 1. An MSI is received and the status bit for the MSI is set in the >> DWC PCI-e controller. >> 2. dw_handle_msi_irq() clears this MSI status bit. >> 3. dw_handle_msi_irq() calls a driver's registered interrupt handler >> for the MSI received. >> 3. At some point, the interrupt handler must decide, correctly, that >> there is no more work to do and return. >> 4. The hardware generates a new MSI. This sets the MSI status bit and >> triggers another interrupt in the interrupt controller(s) above the DWC >> PCI-e controller. As the the dwc-pcie handler is not re-entrant, it is >> not run again at this time. >> 6. dw_handle_msi_irq() finishes. The MSI status bit remains set. >> 7. The IRQ is re-raised and dw_handle_msi_irq() runs again. >> 8. dw_handle_msi_irq() invokes the MSI's registered interrupt handler >> again as the status bit was still set. > > Not sure why (5) is not used in your lists, I assume because you want > to highlight the race condition with the jump from 4 to 6 (or maybe > you do not like number 5 :), just curious). > >> The observant will notice that point 4 present the opportunity for the >> SoC's interrupt controller to lose the interrupt in the same manner as >> the bug in this driver. The driver for that interrupt controller will >> be written to properly deal with this. In some cases the hardware >> supports an EOI concept, where the 2nd IRQ is masked and internally >> queued in the hardware, to be re-raised at EOI in step 7. In other >> cases the IRQ will be unmasked and re-raised at step 4, but the kernel >> will see the handler is INPROGRESS and not re-invoke it, and instead set >> a PENDING flag, which causes the handler to re-run at step 7. >> >> [1] https://patchwork.kernel.org/patch/3333681/ >> >> Fixes: 8c934095fa2f ("PCI: dwc: Clear MSI interrupt status after it is handled, not before") > > I have two questions: > > - Commit 8c934095fa2f has been in the kernel for a year and no > regression was reported. It was assumed to fix a problem so before > reverting it I want to make sure we are not breaking anything else. > - Your reasoning seems correct but I would pick Marc's brain on this > because I want to understand if what this patch does is what IRQ core > expects it to do, especially in relation to the IRQ chaining you > are mentioning. It is hard to decide what the right solution is without understanding exactly what this particular write actually does. It seems to be some form of acknowledgement, but I'm only making an educated guess, and some of the defines suggest that there might be another register for that. What I'm interested in is the relationship this has with the mask/unmask callbacks, and whether masking the interrupt before acking it would help. Gustavo, can you help here? In any way, moving the action of acknowledging the interrupt to its right spot in the kernel (dw_pci_bottom_ack) would be a good start. Thanks, M.
On Tue, 2018-11-06 at 14:53 +0000, Lorenzo Pieralisi wrote: > > > Not sure why (5) is not used in your lists, I assume because you want > to highlight the race condition with the jump from 4 to 6 (or maybe > you do not like number 5 :), just curious). It's because I removed a step and I'm used to rst docs where it renumbers automatically. > > > The observant will notice that point 4 present the opportunity for the > > SoC's interrupt controller to lose the interrupt in the same manner as > > the bug in this driver. The driver for that interrupt controller will > > be written to properly deal with this. In some cases the hardware > > supports an EOI concept, where the 2nd IRQ is masked and internally > > queued in the hardware, to be re-raised at EOI in step 7. In other > > cases the IRQ will be unmasked and re-raised at step 4, but the kernel > > will see the handler is INPROGRESS and not re-invoke it, and instead set > > a PENDING flag, which causes the handler to re-run at step 7. > > > > [1] > > > > Fixes: 8c934095fa2f ("PCI: dwc: Clear MSI interrupt status after it is handled, not before") > > I have two questions: > > - Commit 8c934095fa2f has been in the kernel for a year and no > regression was reported. It was assumed to fix a problem so before > reverting it I want to make sure we are not breaking anything else. There have been reports. Vignesh reported this some time ago. Also see https://lkml.org/lkml/2018/11/2/55, the author indicated to me that it was the caused by this problem after I pointed him to my patch. And I've reported it now too. We only updated to a kernel with the regression a few months ago. I think imx6/7 systems don't update their kernels that quickly. Usually not running a mainstream distro with automatic updates. And there are probably a lot of people using the NXP vendor kernel, which has a ton of unmerged patches, and so they don't see this regression yet. > - Your reasoning seems correct but I would pick Marc's brain on this > because I want to understand if what this patch does is what IRQ core > expects it to do, especially in relation to the IRQ chaining you > are mentioning. I've traced this through the ARM arch code and used our custom PCI device to generate interrupts in all the race windows. It does work as I say, though I can't say this it how it's supposed to work.
On Tue, 2018-11-06 at 16:00 +0000, Marc Zyngier wrote: > > It is hard to decide what the right solution is without understanding > exactly what this particular write actually does. It seems to be some > form of acknowledgement, but I'm only making an educated guess, and some > of the defines suggest that there might be another register for that. Unfortunately, there are no docs for this controller. I've determined that it sets a bit in this register when an MSI is received. Once set, it acts as a mask and the controller will generate no interrupts when the same MSI is subsequently received. Writing a 1 to a bit clears that mask bit, obviously so that each bit can be cleared atomically vs a non-atomic RMW. The controller does not queue any MSIs received while the interrupt was masked. > > What I'm interested in is the relationship this has with the mask/unmask > callbacks, and whether masking the interrupt before acking it would help. > > Gustavo, can you help here? > > In any way, moving the action of acknowledging the interrupt to its > right spot in the kernel (dw_pci_bottom_ack) would be a good start. What about stable kernels that don't have the hierarchical API?
On Tue, Nov 06, 2018 at 07:40:18PM +0000, Trent Piepho wrote: > On Tue, 2018-11-06 at 16:00 +0000, Marc Zyngier wrote: > > > > It is hard to decide what the right solution is without understanding > > exactly what this particular write actually does. It seems to be some > > form of acknowledgement, but I'm only making an educated guess, and some > > of the defines suggest that there might be another register for that. > > Unfortunately, there are no docs for this controller. I've determined > that it sets a bit in this register when an MSI is received. Once set, > it acts as a mask and the controller will generate no interrupts when > the same MSI is subsequently received. Writing a 1 to a bit clears > that mask bit, obviously so that each bit can be cleared atomically vs > a non-atomic RMW. The controller does not queue any MSIs received > while the interrupt was masked. I would like to understand if this is the expected behaviour across dwc controllers but to understand that, as Marc mentioned, we would need some input from Synopsys (if that's not customizable option). Side note: what's the register at offset PCIE_MSI_INTR0_MASK supposed to do ? It seems unused in the current kernel, with a macro name hinting at MSI masking so I am asking. > > What I'm interested in is the relationship this has with the mask/unmask > > callbacks, and whether masking the interrupt before acking it would help. > > > > > > Gustavo, can you help here? > > > > In any way, moving the action of acknowledging the interrupt to its > > right spot in the kernel (dw_pci_bottom_ack) would be a good start. > > What about stable kernels that don't have the hierarchical API? We will have to backport the patches. Lorenzo
On 06/11/2018 16:00, Marc Zyngier wrote: > On 06/11/18 14:53, Lorenzo Pieralisi wrote: >> [CC Marc] >> >> On Sat, Oct 27, 2018 at 12:00:57AM +0000, Trent Piepho wrote: >>> This reverts commit 8c934095fa2f ("PCI: dwc: Clear MSI interrupt status >>> after it is handled, not before"). >>> >>> This is a very real race that we observed quickly after switching from >>> 4.13 to 4.16. Using a custom PCI-e endpoint and driver, I was able to >>> track it to the precise race and verify the fixed behavior, as will be >>> described below. >>> >>> This bug was originally fixed in 2013, in commit ca1658921b63 ("PCI: >>> designware: Fix missing MSI IRQs") The discussion of that commit, >>> archived in patchwork [1], is informative and worth reading. >>> >>> The bug was re-added in the '8c934 commit this is reverting, which >>> appeared in the 4.14 kernel. >>> >>> Unfortunately, Synopsys appears to consider the operation of this PCI-e >>> controller secret. They provide no publicly available docs for it nor >>> allow the references manuals of SoCs using the controller to publish any >>> documentation of it. >>> >>> So I can not say certain this code is correctly using the controller's >>> features. However, extensive testing gives me high confidence in the >>> accuracy of what is described below. >>> >>> If an MSI is received by the PCI-e controller while the status register >>> bit for that MSI is set, the PCI-e controller will NOT generate another >>> interrupt. In addition, it will NOT queue or otherwise mark the >>> interrupt as "pending", which means it will NOT generate an interrupt >>> when the status bit is unmasked. >>> >>> This gives the following race scenario: >>> >>> 1. An MSI is received by, and the status bit for the MSI is set in, the >>> DWC PCI-e controller. >>> 2. dw_handle_msi_irq() calls a driver's registered interrupt handler >>> for the MSI received. >>> 3. At some point, the interrupt handler must decide, correctly, that >>> there is no more work to do and return. >>> 4. The hardware generates a new MSI. As the MSI's status bit is still >>> set, this new MSI is ignored. >>> 6. dw_handle_msi_irq() unsets the MSI status bit. >>> >>> The MSI received at point 4 will never be acted upon. It occurred after >>> the driver had finished checking the hardware status for interrupt >>> conditions to act on. Since the MSI status was masked, it does not >>> generated a new IRQ, neither when it was received nor when the MSI is >>> unmasked. >>> >>> It seems clear there is an unsolvable race here. >>> >>> After this patch, the sequence becomes as follows: >>> >>> 1. An MSI is received and the status bit for the MSI is set in the >>> DWC PCI-e controller. >>> 2. dw_handle_msi_irq() clears this MSI status bit. >>> 3. dw_handle_msi_irq() calls a driver's registered interrupt handler >>> for the MSI received. >>> 3. At some point, the interrupt handler must decide, correctly, that >>> there is no more work to do and return. >>> 4. The hardware generates a new MSI. This sets the MSI status bit and >>> triggers another interrupt in the interrupt controller(s) above the DWC >>> PCI-e controller. As the the dwc-pcie handler is not re-entrant, it is >>> not run again at this time. >>> 6. dw_handle_msi_irq() finishes. The MSI status bit remains set. >>> 7. The IRQ is re-raised and dw_handle_msi_irq() runs again. >>> 8. dw_handle_msi_irq() invokes the MSI's registered interrupt handler >>> again as the status bit was still set. >> >> Not sure why (5) is not used in your lists, I assume because you want >> to highlight the race condition with the jump from 4 to 6 (or maybe >> you do not like number 5 :), just curious). >> >>> The observant will notice that point 4 present the opportunity for the >>> SoC's interrupt controller to lose the interrupt in the same manner as >>> the bug in this driver. The driver for that interrupt controller will >>> be written to properly deal with this. In some cases the hardware >>> supports an EOI concept, where the 2nd IRQ is masked and internally >>> queued in the hardware, to be re-raised at EOI in step 7. In other >>> cases the IRQ will be unmasked and re-raised at step 4, but the kernel >>> will see the handler is INPROGRESS and not re-invoke it, and instead set >>> a PENDING flag, which causes the handler to re-run at step 7. >>> >>> [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__patchwork.kernel.org_patch_3333681_&d=DwICaQ&c=DPL6_X_6JkXFx7AXWqB0tg&r=bkWxpLoW-f-E3EdiDCCa0_h0PicsViasSlvIpzZvPxs&m=3fxVS1aKAoLsOD0JiRgCC9pnyPfwwNVecSPqFznQ3Kc&s=X9ZYshlrXFwQbZJTzWC3ZC77lveB4ZlqfYqGJiSuqY8&e= >>> >>> Fixes: 8c934095fa2f ("PCI: dwc: Clear MSI interrupt status after it is handled, not before") >> >> I have two questions: >> >> - Commit 8c934095fa2f has been in the kernel for a year and no >> regression was reported. It was assumed to fix a problem so before >> reverting it I want to make sure we are not breaking anything else. >> - Your reasoning seems correct but I would pick Marc's brain on this >> because I want to understand if what this patch does is what IRQ core >> expects it to do, especially in relation to the IRQ chaining you >> are mentioning. > > It is hard to decide what the right solution is without understanding > exactly what this particular write actually does. It seems to be some > form of acknowledgement, but I'm only making an educated guess, and some > of the defines suggest that there might be another register for that. This status register indicates whether exists or not a MSI interrupt on that controller [0..7] to be handle. In theory, we should clear the interrupt flag only after the interrupt has actually handled (which can take some time to process on the worst case scenario). However, the Trent's patch allows to acknowledge the flag and handle the interrupt later, giving the opportunity to catch a possible new interrupt, which will be handle by a new call of this function. > > What I'm interested in is the relationship this has with the mask/unmask > callbacks, and whether masking the interrupt before acking it would help. Although there is the possibility of mask/unmask the interruptions on the controller, from what I've seen typically in other dw drivers this is not done. Probably we don't get much benefit from using it. Gustavo > > Gustavo, can you help here? > > In any way, moving the action of acknowledging the interrupt to its > right spot in the kernel (dw_pci_bottom_ack) would be a good start. > > Thanks, > > M. >
On 07/11/2018 11:07, Lorenzo Pieralisi wrote: > On Tue, Nov 06, 2018 at 07:40:18PM +0000, Trent Piepho wrote: >> On Tue, 2018-11-06 at 16:00 +0000, Marc Zyngier wrote: >>> >>> It is hard to decide what the right solution is without understanding >>> exactly what this particular write actually does. It seems to be some >>> form of acknowledgement, but I'm only making an educated guess, and some >>> of the defines suggest that there might be another register for that. >> >> Unfortunately, there are no docs for this controller. I've determined >> that it sets a bit in this register when an MSI is received. Once set, >> it acts as a mask and the controller will generate no interrupts when >> the same MSI is subsequently received. Writing a 1 to a bit clears >> that mask bit, obviously so that each bit can be cleared atomically vs >> a non-atomic RMW. The controller does not queue any MSIs received >> while the interrupt was masked. > > I would like to understand if this is the expected behaviour across > dwc controllers but to understand that, as Marc mentioned, we would > need some input from Synopsys (if that's not customizable option). > > Side note: what's the register at offset PCIE_MSI_INTR0_MASK > supposed to do ? It seems unused in the current kernel, with > a macro name hinting at MSI masking so I am asking. That's the interrupt mask for the controller #0 offset. Currently it's not used. Gustavo > >>> What I'm interested in is the relationship this has with the mask/unmask >>> callbacks, and whether masking the interrupt before acking it would help. >> >> >>> >>> Gustavo, can you help here? >>> >>> In any way, moving the action of acknowledging the interrupt to its >>> right spot in the kernel (dw_pci_bottom_ack) would be a good start. >> >> What about stable kernels that don't have the hierarchical API? > > We will have to backport the patches. > > Lorenzo >
On Wed, 2018-11-07 at 12:57 +0000, Gustavo Pimentel wrote: > On 06/11/2018 16:00, Marc Zyngier wrote: > > On 06/11/18 14:53, Lorenzo Pieralisi wrote: > > > On Sat, Oct 27, 2018 at 12:00:57AM +0000, Trent Piepho wrote: > > > > > > > > This gives the following race scenario: > > > > > > > > 1. An MSI is received by, and the status bit for the MSI is set in, the > > > > DWC PCI-e controller. > > > > 2. dw_handle_msi_irq() calls a driver's registered interrupt handler > > > > for the MSI received. > > > > 3. At some point, the interrupt handler must decide, correctly, that > > > > there is no more work to do and return. > > > > 4. The hardware generates a new MSI. As the MSI's status bit is still > > > > set, this new MSI is ignored. > > > > 6. dw_handle_msi_irq() unsets the MSI status bit. > > > > > > > > The MSI received at point 4 will never be acted upon. It occurred after > > > > the driver had finished checking the hardware status for interrupt > > > > conditions to act on. Since the MSI status was masked, it does not > > > > generated a new IRQ, neither when it was received nor when the MSI is > > > > unmasked. > > > > > This status register indicates whether exists or not a MSI interrupt on that > controller [0..7] to be handle. While the status for an MSI is set, no new interrupt will be triggered if another identical MSI is received, correct? > In theory, we should clear the interrupt flag only after the interrupt has > actually handled (which can take some time to process on the worst case scenario). But see above, there is a race if a new MSI arrives while still masked. I can see no possible way to solve this in software that does not involve unmasking the MSI before calling the handler. To leave the interrupt masked while calling the handler requires the hardware to queue an interrupt that arrives while masked. We have no docs, but the designware controller doesn't appear to do this in practice. > However, the Trent's patch allows to acknowledge the flag and handle the > interrupt later, giving the opportunity to catch a possible new interrupt, which > will be handle by a new call of this function. > > > > > What I'm interested in is the relationship this has with the mask/unmask > > callbacks, and whether masking the interrupt before acking it would help. > > Although there is the possibility of mask/unmask the interruptions on the > controller, from what I've seen typically in other dw drivers this is not done. > Probably we don't get much benefit from using it. > > Gustavo > > > > > Gustavo, can you help here? > > > > In any way, moving the action of acknowledging the interrupt to its > > right spot in the kernel (dw_pci_bottom_ack) would be a good start. > > > > Thanks, > > > > M. > > > >
On 06/11/18 19:40, Trent Piepho wrote: > On Tue, 2018-11-06 at 16:00 +0000, Marc Zyngier wrote: >> >> It is hard to decide what the right solution is without understanding >> exactly what this particular write actually does. It seems to be some >> form of acknowledgement, but I'm only making an educated guess, and some >> of the defines suggest that there might be another register for that. > > Unfortunately, there are no docs for this controller. I've determined > that it sets a bit in this register when an MSI is received. Once set, > it acts as a mask and the controller will generate no interrupts when > the same MSI is subsequently received. Writing a 1 to a bit clears > that mask bit, obviously so that each bit can be cleared atomically vs > a non-atomic RMW. The controller does not queue any MSIs received > while the interrupt was masked. I'm not sure this is really a mask, as the controller seems to have other ways of really masking the interrupt. It looks a lot more like either an ACK, but I need to understand how this interacts with the masking register. And I do not want to guess it. I want actual information from people who build this IP. >> What I'm interested in is the relationship this has with the mask/unmask >> callbacks, and whether masking the interrupt before acking it would help. > > >> >> Gustavo, can you help here? >> >> In any way, moving the action of acknowledging the interrupt to its >> right spot in the kernel (dw_pci_bottom_ack) would be a good start. > > What about stable kernels that don't have the hierarchical API? My goal is to fix mainline first. Once we have something that works on mainline, we can look at propagating the fix to other versions. But mainline always comes first. Thanks, M.
On 07/11/18 12:57, Gustavo Pimentel wrote: > On 06/11/2018 16:00, Marc Zyngier wrote: >> On 06/11/18 14:53, Lorenzo Pieralisi wrote: >>> [CC Marc] >>> >>> On Sat, Oct 27, 2018 at 12:00:57AM +0000, Trent Piepho wrote: >>>> This reverts commit 8c934095fa2f ("PCI: dwc: Clear MSI interrupt status >>>> after it is handled, not before"). >>>> >>>> This is a very real race that we observed quickly after switching from >>>> 4.13 to 4.16. Using a custom PCI-e endpoint and driver, I was able to >>>> track it to the precise race and verify the fixed behavior, as will be >>>> described below. >>>> >>>> This bug was originally fixed in 2013, in commit ca1658921b63 ("PCI: >>>> designware: Fix missing MSI IRQs") The discussion of that commit, >>>> archived in patchwork [1], is informative and worth reading. >>>> >>>> The bug was re-added in the '8c934 commit this is reverting, which >>>> appeared in the 4.14 kernel. >>>> >>>> Unfortunately, Synopsys appears to consider the operation of this PCI-e >>>> controller secret. They provide no publicly available docs for it nor >>>> allow the references manuals of SoCs using the controller to publish any >>>> documentation of it. >>>> >>>> So I can not say certain this code is correctly using the controller's >>>> features. However, extensive testing gives me high confidence in the >>>> accuracy of what is described below. >>>> >>>> If an MSI is received by the PCI-e controller while the status register >>>> bit for that MSI is set, the PCI-e controller will NOT generate another >>>> interrupt. In addition, it will NOT queue or otherwise mark the >>>> interrupt as "pending", which means it will NOT generate an interrupt >>>> when the status bit is unmasked. >>>> >>>> This gives the following race scenario: >>>> >>>> 1. An MSI is received by, and the status bit for the MSI is set in, the >>>> DWC PCI-e controller. >>>> 2. dw_handle_msi_irq() calls a driver's registered interrupt handler >>>> for the MSI received. >>>> 3. At some point, the interrupt handler must decide, correctly, that >>>> there is no more work to do and return. >>>> 4. The hardware generates a new MSI. As the MSI's status bit is still >>>> set, this new MSI is ignored. >>>> 6. dw_handle_msi_irq() unsets the MSI status bit. >>>> >>>> The MSI received at point 4 will never be acted upon. It occurred after >>>> the driver had finished checking the hardware status for interrupt >>>> conditions to act on. Since the MSI status was masked, it does not >>>> generated a new IRQ, neither when it was received nor when the MSI is >>>> unmasked. >>>> >>>> It seems clear there is an unsolvable race here. >>>> >>>> After this patch, the sequence becomes as follows: >>>> >>>> 1. An MSI is received and the status bit for the MSI is set in the >>>> DWC PCI-e controller. >>>> 2. dw_handle_msi_irq() clears this MSI status bit. >>>> 3. dw_handle_msi_irq() calls a driver's registered interrupt handler >>>> for the MSI received. >>>> 3. At some point, the interrupt handler must decide, correctly, that >>>> there is no more work to do and return. >>>> 4. The hardware generates a new MSI. This sets the MSI status bit and >>>> triggers another interrupt in the interrupt controller(s) above the DWC >>>> PCI-e controller. As the the dwc-pcie handler is not re-entrant, it is >>>> not run again at this time. >>>> 6. dw_handle_msi_irq() finishes. The MSI status bit remains set. >>>> 7. The IRQ is re-raised and dw_handle_msi_irq() runs again. >>>> 8. dw_handle_msi_irq() invokes the MSI's registered interrupt handler >>>> again as the status bit was still set. >>> >>> Not sure why (5) is not used in your lists, I assume because you want >>> to highlight the race condition with the jump from 4 to 6 (or maybe >>> you do not like number 5 :), just curious). >>> >>>> The observant will notice that point 4 present the opportunity for the >>>> SoC's interrupt controller to lose the interrupt in the same manner as >>>> the bug in this driver. The driver for that interrupt controller will >>>> be written to properly deal with this. In some cases the hardware >>>> supports an EOI concept, where the 2nd IRQ is masked and internally >>>> queued in the hardware, to be re-raised at EOI in step 7. In other >>>> cases the IRQ will be unmasked and re-raised at step 4, but the kernel >>>> will see the handler is INPROGRESS and not re-invoke it, and instead set >>>> a PENDING flag, which causes the handler to re-run at step 7. >>>> >>>> [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__patchwork.kernel.org_patch_3333681_&d=DwICaQ&c=DPL6_X_6JkXFx7AXWqB0tg&r=bkWxpLoW-f-E3EdiDCCa0_h0PicsViasSlvIpzZvPxs&m=3fxVS1aKAoLsOD0JiRgCC9pnyPfwwNVecSPqFznQ3Kc&s=X9ZYshlrXFwQbZJTzWC3ZC77lveB4ZlqfYqGJiSuqY8&e= >>>> >>>> Fixes: 8c934095fa2f ("PCI: dwc: Clear MSI interrupt status after it is handled, not before") >>> >>> I have two questions: >>> >>> - Commit 8c934095fa2f has been in the kernel for a year and no >>> regression was reported. It was assumed to fix a problem so before >>> reverting it I want to make sure we are not breaking anything else. >>> - Your reasoning seems correct but I would pick Marc's brain on this >>> because I want to understand if what this patch does is what IRQ core >>> expects it to do, especially in relation to the IRQ chaining you >>> are mentioning. >> >> It is hard to decide what the right solution is without understanding >> exactly what this particular write actually does. It seems to be some >> form of acknowledgement, but I'm only making an educated guess, and some >> of the defines suggest that there might be another register for that. > > This status register indicates whether exists or not a MSI interrupt on that > controller [0..7] to be handle. OK. What happen when the interrupt is masked? Does this interrupt appear in the status register? And what is the effect of writing to that register? > In theory, we should clear the interrupt flag only after the interrupt has > actually handled (which can take some time to process on the worst case scenario). At this stage, we do not care about performance, but correctness. If we loose interrupts, then the driver is not correct, and we need to address this first. > However, the Trent's patch allows to acknowledge the flag and handle the > interrupt later, giving the opportunity to catch a possible new interrupt, which > will be handle by a new call of this function. > >> >> What I'm interested in is the relationship this has with the mask/unmask >> callbacks, and whether masking the interrupt before acking it would help. > > Although there is the possibility of mask/unmask the interruptions on the > controller, from what I've seen typically in other dw drivers this is not done. > Probably we don't get much benefit from using it. I don't worry care about the drivers. I need to know what the HW does, and how this maps to the Linux interrupt subsystem. At the moment, this is just too blurry to be understandable. Thanks, M.
On Wed, 2018-11-07 at 18:41 +0000, Marc Zyngier wrote: > On 06/11/18 19:40, Trent Piepho wrote: > > > > What about stable kernels that don't have the hierarchical API? > > My goal is to fix mainline first. Once we have something that works on > mainline, we can look at propagating the fix to other versions. But > mainline always comes first. This is a regression that went into 4.14. Wouldn't the appropriate action for the stable series be to undo the regression? The hierarchical design might allow unmasking the irq to be placed in a better location. And maybe knowing how the hardware worked would allow for a design that maintained a separate pending and active state in hardware, with distinct ack and eoi controls, i.e. allowed an MSI to be queued while still masked. But that doesn't seem like the kind of enhancement that would be backported to stable.
On 07/11/18 20:17, Trent Piepho wrote: > On Wed, 2018-11-07 at 18:41 +0000, Marc Zyngier wrote: >> On 06/11/18 19:40, Trent Piepho wrote: >>> >>> What about stable kernels that don't have the hierarchical API? >> >> My goal is to fix mainline first. Once we have something that works on >> mainline, we can look at propagating the fix to other versions. But >> mainline always comes first. > > This is a regression that went into 4.14. Wouldn't the appropriate > action for the stable series be to undo the regression? This is not how stable works. Stable kernels *only* contain patches that are backported from mainline, and do not take standalone patch. Furthermore, your fix is to actually undo someone else's fix. Who is right? In the absence of any documentation, the answer is "nobody". > The hierarchical design might allow unmasking the irq to be placed in a > better location. And maybe knowing how the hardware worked would allow > for a design that maintained a separate pending and active state in > hardware, with distinct ack and eoi controls, i.e. allowed an MSI to be > queued while still masked. But that doesn't seem like the kind of > enhancement that would be backported to stable. Anything can be backported to stable once we understand the issue. At the moment, we're just playing games moving stuff around and hope nothing else will break. That's not a sustainable way of maintaining this driver. At the moment, the only patch I'm inclined to propose until we get an actual interrupt handling flow from Synopsys is to mark this driver as "BROKEN". Thanks, M.
On 07/11/2018 18:46, Marc Zyngier wrote: > On 07/11/18 12:57, Gustavo Pimentel wrote: >> On 06/11/2018 16:00, Marc Zyngier wrote: >>> On 06/11/18 14:53, Lorenzo Pieralisi wrote: >>>> [CC Marc] >>>> >>>> On Sat, Oct 27, 2018 at 12:00:57AM +0000, Trent Piepho wrote: >>>>> This reverts commit 8c934095fa2f ("PCI: dwc: Clear MSI interrupt status >>>>> after it is handled, not before"). >>>>> >>>>> This is a very real race that we observed quickly after switching from >>>>> 4.13 to 4.16. Using a custom PCI-e endpoint and driver, I was able to >>>>> track it to the precise race and verify the fixed behavior, as will be >>>>> described below. >>>>> >>>>> This bug was originally fixed in 2013, in commit ca1658921b63 ("PCI: >>>>> designware: Fix missing MSI IRQs") The discussion of that commit, >>>>> archived in patchwork [1], is informative and worth reading. >>>>> >>>>> The bug was re-added in the '8c934 commit this is reverting, which >>>>> appeared in the 4.14 kernel. >>>>> >>>>> Unfortunately, Synopsys appears to consider the operation of this PCI-e >>>>> controller secret. They provide no publicly available docs for it nor >>>>> allow the references manuals of SoCs using the controller to publish any >>>>> documentation of it. >>>>> >>>>> So I can not say certain this code is correctly using the controller's >>>>> features. However, extensive testing gives me high confidence in the >>>>> accuracy of what is described below. >>>>> >>>>> If an MSI is received by the PCI-e controller while the status register >>>>> bit for that MSI is set, the PCI-e controller will NOT generate another >>>>> interrupt. In addition, it will NOT queue or otherwise mark the >>>>> interrupt as "pending", which means it will NOT generate an interrupt >>>>> when the status bit is unmasked. >>>>> >>>>> This gives the following race scenario: >>>>> >>>>> 1. An MSI is received by, and the status bit for the MSI is set in, the >>>>> DWC PCI-e controller. >>>>> 2. dw_handle_msi_irq() calls a driver's registered interrupt handler >>>>> for the MSI received. >>>>> 3. At some point, the interrupt handler must decide, correctly, that >>>>> there is no more work to do and return. >>>>> 4. The hardware generates a new MSI. As the MSI's status bit is still >>>>> set, this new MSI is ignored. >>>>> 6. dw_handle_msi_irq() unsets the MSI status bit. >>>>> >>>>> The MSI received at point 4 will never be acted upon. It occurred after >>>>> the driver had finished checking the hardware status for interrupt >>>>> conditions to act on. Since the MSI status was masked, it does not >>>>> generated a new IRQ, neither when it was received nor when the MSI is >>>>> unmasked. >>>>> >>>>> It seems clear there is an unsolvable race here. >>>>> >>>>> After this patch, the sequence becomes as follows: >>>>> >>>>> 1. An MSI is received and the status bit for the MSI is set in the >>>>> DWC PCI-e controller. >>>>> 2. dw_handle_msi_irq() clears this MSI status bit. >>>>> 3. dw_handle_msi_irq() calls a driver's registered interrupt handler >>>>> for the MSI received. >>>>> 3. At some point, the interrupt handler must decide, correctly, that >>>>> there is no more work to do and return. >>>>> 4. The hardware generates a new MSI. This sets the MSI status bit and >>>>> triggers another interrupt in the interrupt controller(s) above the DWC >>>>> PCI-e controller. As the the dwc-pcie handler is not re-entrant, it is >>>>> not run again at this time. >>>>> 6. dw_handle_msi_irq() finishes. The MSI status bit remains set. >>>>> 7. The IRQ is re-raised and dw_handle_msi_irq() runs again. >>>>> 8. dw_handle_msi_irq() invokes the MSI's registered interrupt handler >>>>> again as the status bit was still set. >>>> >>>> Not sure why (5) is not used in your lists, I assume because you want >>>> to highlight the race condition with the jump from 4 to 6 (or maybe >>>> you do not like number 5 :), just curious). >>>> >>>>> The observant will notice that point 4 present the opportunity for the >>>>> SoC's interrupt controller to lose the interrupt in the same manner as >>>>> the bug in this driver. The driver for that interrupt controller will >>>>> be written to properly deal with this. In some cases the hardware >>>>> supports an EOI concept, where the 2nd IRQ is masked and internally >>>>> queued in the hardware, to be re-raised at EOI in step 7. In other >>>>> cases the IRQ will be unmasked and re-raised at step 4, but the kernel >>>>> will see the handler is INPROGRESS and not re-invoke it, and instead set >>>>> a PENDING flag, which causes the handler to re-run at step 7. >>>>> >>>>> [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__patchwork.kernel.org_patch_3333681_&d=DwICaQ&c=DPL6_X_6JkXFx7AXWqB0tg&r=bkWxpLoW-f-E3EdiDCCa0_h0PicsViasSlvIpzZvPxs&m=3fxVS1aKAoLsOD0JiRgCC9pnyPfwwNVecSPqFznQ3Kc&s=X9ZYshlrXFwQbZJTzWC3ZC77lveB4ZlqfYqGJiSuqY8&e= >>>>> >>>>> Fixes: 8c934095fa2f ("PCI: dwc: Clear MSI interrupt status after it is handled, not before") >>>> >>>> I have two questions: >>>> >>>> - Commit 8c934095fa2f has been in the kernel for a year and no >>>> regression was reported. It was assumed to fix a problem so before >>>> reverting it I want to make sure we are not breaking anything else. >>>> - Your reasoning seems correct but I would pick Marc's brain on this >>>> because I want to understand if what this patch does is what IRQ core >>>> expects it to do, especially in relation to the IRQ chaining you >>>> are mentioning. >>> >>> It is hard to decide what the right solution is without understanding >>> exactly what this particular write actually does. It seems to be some >>> form of acknowledgement, but I'm only making an educated guess, and some >>> of the defines suggest that there might be another register for that. >> >> This status register indicates whether exists or not a MSI interrupt on that >> controller [0..7] to be handle. > > OK. What happen when the interrupt is masked? Does this interrupt appear > in the status register? And what is the effect of writing to that register? When an MSI is received for a masked interrupt, the corresponding status bit gets set in the interrupt status register but the controller will not signal it. As soon as the masked interrupt is unmasked and assuming the status bit is still set the controller will signal it. Have I explain this clearly? > >> In theory, we should clear the interrupt flag only after the interrupt has >> actually handled (which can take some time to process on the worst case scenario). > > At this stage, we do not care about performance, but correctness. If we > loose interrupts, then the driver is not correct, and we need to address > this first. Ok, Marc. Let see if I can summarize it. You are suggesting to change the register from PCIE_MSI_INTR0_ENABLE to PCIE_MSI_INTR0_MASK on dw_pci_bottom_mask() and dw_pci_bottom_unmask() and clearing the interrupt status bit inside of dw_pci_bottom_ack() instead of dw_handle_msi_irq(), right? > >> However, the Trent's patch allows to acknowledge the flag and handle the >> interrupt later, giving the opportunity to catch a possible new interrupt, which >> will be handle by a new call of this function. >> >>> >>> What I'm interested in is the relationship this has with the mask/unmask >>> callbacks, and whether masking the interrupt before acking it would help. >> >> Although there is the possibility of mask/unmask the interruptions on the >> controller, from what I've seen typically in other dw drivers this is not done. >> Probably we don't get much benefit from using it. > > I don't worry care about the drivers. I need to know what the HW does, > and how this maps to the Linux interrupt subsystem. At the moment, this > is just too blurry to be understandable. Ok, let's clarify this then. > > Thanks, > > M. >
On 07/11/2018 18:32, Trent Piepho wrote: > On Wed, 2018-11-07 at 12:57 +0000, Gustavo Pimentel wrote: >> On 06/11/2018 16:00, Marc Zyngier wrote: >>> On 06/11/18 14:53, Lorenzo Pieralisi wrote: >>>> On Sat, Oct 27, 2018 at 12:00:57AM +0000, Trent Piepho wrote: >>>>> >>>>> This gives the following race scenario: >>>>> >>>>> 1. An MSI is received by, and the status bit for the MSI is set in, the >>>>> DWC PCI-e controller. >>>>> 2. dw_handle_msi_irq() calls a driver's registered interrupt handler >>>>> for the MSI received. >>>>> 3. At some point, the interrupt handler must decide, correctly, that >>>>> there is no more work to do and return. >>>>> 4. The hardware generates a new MSI. As the MSI's status bit is still >>>>> set, this new MSI is ignored. >>>>> 6. dw_handle_msi_irq() unsets the MSI status bit. >>>>> >>>>> The MSI received at point 4 will never be acted upon. It occurred after >>>>> the driver had finished checking the hardware status for interrupt >>>>> conditions to act on. Since the MSI status was masked, it does not >>>>> generated a new IRQ, neither when it was received nor when the MSI is >>>>> unmasked. >>>>> > >> This status register indicates whether exists or not a MSI interrupt on that >> controller [0..7] to be handle. > > While the status for an MSI is set, no new interrupt will be triggered Yes > if another identical MSI is received, correct? You cannot receive another identical MSI till you acknowledge the current one (This is ensured by the PCI protocol, I guess). > >> In theory, we should clear the interrupt flag only after the interrupt has >> actually handled (which can take some time to process on the worst case scenario). > > But see above, there is a race if a new MSI arrives while still masked. > I can see no possible way to solve this in software that does not > involve unmasking the MSI before calling the handler. To leave the > interrupt masked while calling the handler requires the hardware to > queue an interrupt that arrives while masked. We have no docs, but the > designware controller doesn't appear to do this in practice. See my reply to Marc about the interrupt masking. Like you said, probably the solution pass through using interrupt mask/unmask register instead of interrupt enable/disable register. Can you do a quick test, since you can easily reproduce the issue? Can you change register offset on both functions dw_pci_bottom_mask() and dw_pci_bottom_unmask()? Basically exchange the PCIE_MSI_INTR0_ENABLE register by PCIE_MSI_INTR0_MASK. Thanks. Gustavo > >> However, the Trent's patch allows to acknowledge the flag and handle the >> interrupt later, giving the opportunity to catch a possible new interrupt, which >> will be handle by a new call of this function. >> >>> >>> What I'm interested in is the relationship this has with the mask/unmask >>> callbacks, and whether masking the interrupt before acking it would help. >> >> Although there is the possibility of mask/unmask the interruptions on the >> controller, from what I've seen typically in other dw drivers this is not done. >> Probably we don't get much benefit from using it. >> >> Gustavo >> >>> >>> Gustavo, can you help here? >>> >>> In any way, moving the action of acknowledging the interrupt to its >>> right spot in the kernel (dw_pci_bottom_ack) would be a good start. >>> >>> Thanks, >>> >>> M. >>> >>
On Thu, 2018-11-08 at 09:49 +0000, Marc Zyngier wrote: > On 07/11/18 20:17, Trent Piepho wrote: > > On Wed, 2018-11-07 at 18:41 +0000, Marc Zyngier wrote: > > > On 06/11/18 19:40, Trent Piepho wrote: > > > > > > > > What about stable kernels that don't have the hierarchical API? > > > > > > My goal is to fix mainline first. Once we have something that works on > > > mainline, we can look at propagating the fix to other versions. But > > > mainline always comes first. > > > > This is a regression that went into 4.14. Wouldn't the appropriate > > action for the stable series be to undo the regression? > > This is not how stable works. Stable kernels *only* contain patches that > are backported from mainline, and do not take standalone patch. > > Furthermore, your fix is to actually undo someone else's fix. Who is > right? In the absence of any documentation, the answer is "nobody". Little more history to this bug. The code was originally the way it is now, but this same bug was fixed in 2013 in https://patchwork.kernel.or g/patch/3333681/ Then that lasted four years until it was changed Aug 2017 in https://pa tchwork.kernel.org/patch/9893303/ That lasted just six months until someone tried to revert it, https://p atchwork.kernel.org/patch/9893303/ Seems pretty clear the way it is now is much worse than the way it was before, even if the previous design may have had another flaw. Though I've yet to see anyone point out something makes the previous design broken. Sub-optimal yes, but not broken. > Anything can be backported to stable once we understand the issue. At > the moment, we're just playing games moving stuff around and hope > nothing else will break. That's not a sustainable way of maintaining > this driver. At the moment, the only patch I'm inclined to propose until > we get an actual interrupt handling flow from Synopsys is to mark this > driver as "BROKEN". It feels like you're using this bug to hold designware hostage in a broken kernel, and me along with them. I don't have the documentation, no one does, there's no way for me to give you want you want. But I've got hardware that doesn't work in the mainline kernel.
On Thu, 2018-11-08 at 11:46 +0000, Gustavo Pimentel wrote: > On 07/11/2018 18:32, Trent Piepho wrote: > > On Wed, 2018-11-07 at 12:57 +0000, Gustavo Pimentel wrote: > > > On 06/11/2018 16:00, Marc Zyngier wrote: > > > > On 06/11/18 14:53, Lorenzo Pieralisi wrote: > > > > > On Sat, Oct 27, 2018 at 12:00:57AM +0000, Trent Piepho wrote: > > > > > > > > > > > > This gives the following race scenario: > > > > > > > > > > > > 1. An MSI is received by, and the status bit for the MSI is set in, the > > > > > > DWC PCI-e controller. > > > > > > 2. dw_handle_msi_irq() calls a driver's registered interrupt handler > > > > > > for the MSI received. > > > > > > 3. At some point, the interrupt handler must decide, correctly, that > > > > > > there is no more work to do and return. > > > > > > 4. The hardware generates a new MSI. As the MSI's status bit is still > > > > > > set, this new MSI is ignored. > > > > > > 6. dw_handle_msi_irq() unsets the MSI status bit. > > > > > > > > > > > > The MSI received at point 4 will never be acted upon. It occurred after > > > > > > the driver had finished checking the hardware status for interrupt > > > > > > conditions to act on. Since the MSI status was masked, it does not > > > > > > generated a new IRQ, neither when it was received nor when the MSI is > > > > > > unmasked. > > > > > > > > > This status register indicates whether exists or not a MSI interrupt on that > > > controller [0..7] to be handle. > > > > While the status for an MSI is set, no new interrupt will be triggered > > Yes > > > if another identical MSI is received, correct? > > You cannot receive another identical MSI till you acknowledge the current one > (This is ensured by the PCI protocol, I guess). I don't believe this is a requirement of PCI. We designed our hardware to not send another MSI until our hardware's interrupt status register is read, but we didn't have to do that. > > > In theory, we should clear the interrupt flag only after the interrupt has > > > actually handled (which can take some time to process on the worst case scenario). > > > > But see above, there is a race if a new MSI arrives while still masked. > > I can see no possible way to solve this in software that does not > > involve unmasking the MSI before calling the handler. To leave the > > interrupt masked while calling the handler requires the hardware to > > queue an interrupt that arrives while masked. We have no docs, but the > > designware controller doesn't appear to do this in practice. > > See my reply to Marc about the interrupt masking. Like you said, probably the > solution pass through using interrupt mask/unmask register instead of interrupt > enable/disable register. > > Can you do a quick test, since you can easily reproduce the issue? Can you > change register offset on both functions dw_pci_bottom_mask() and > dw_pci_bottom_unmask()? > > Basically exchange the PCIE_MSI_INTR0_ENABLE register by PCIE_MSI_INTR0_MASK. Of course MSI still need to be enabled to work at all, which happens once when the driver using the MSI registers a handler. Masking can be done via mask register after that. It is not so easy for me to test on the newest kernel, as imx7d does not work due to yet more bugs. I have to port a set of patches to each new kernel. A set that does not shrink due to holdups like this. I understand the new flow would look like this (assume dw controller MSI interrupt output signal is connected to one of the ARM GIC interrupt lines, there could be different or more controllers above the dwc of course (but usually aren't)): 1. MSI arrives, status bit is set in dwc, interrupt raised to GIC. 2. dwc handler runs 3. dwc handler sees status bit is set for a(n) MSI(s) 4. dwc handler sets mask for those MSIs 5. dwc handler clears status bit 6. dwc handler runs driver handler for the received MSI 7. ** an new MSI arrives, racing with 6 ** 8. status bit becomes set again, but no interrupt is raised due to mask 9. dwc handler unmasks MSI, which raises the interrupt to GIC because of new MSI received in 7. 10. The original GIC interrupt is EOI'ed. 11. The interrupt for the dwc is re-raised by the GIC due to 9, and we go back to 2. It is very important that 5 be done before 6. Less so 4 before 5, but reversing the order there would allow re-raising even if the 2nd MSI arrived before the driver handler ran, which is not necessary. I do not see a race in this design and it appears correct to me. But, I also do not think there is any immediate improvement due to extra steps of masking and unmasking the MSI. The reason is that the GIC interrupt above the dwc is non-reentrant. It remains masked (aka active[1]) during this entire process (1 to 10). This means every MSI is effectively already masked. So masking the active MSI(s) a 2nd time gains nothing besides preventing some extra edges for a masked interrupt going to the ARM GIC. In theory, if the GIC interrupt handler was reentrant, then on receipt of a new MSI we could re-enter the dwc handler on a different CPU and run the new MSI (a different MSI!) at the same time as the original MSI handler is still running. There difference here is that by unmasking in the interrupt in the GIC before the dwc handler is finished, masking an individual MSI in the dwc is no longer a 2nd redundant masking. [1] When I say masked in GIC, I mean the interrupt is in the "active" or "active and pending" states. In these states the interrupt will not be raised to the CPU and can be considered masked.
On Thu, Nov 08, 2018 at 07:49:52PM +0000, Trent Piepho wrote: > On Thu, 2018-11-08 at 09:49 +0000, Marc Zyngier wrote: > > On 07/11/18 20:17, Trent Piepho wrote: > > > On Wed, 2018-11-07 at 18:41 +0000, Marc Zyngier wrote: > > > > On 06/11/18 19:40, Trent Piepho wrote: > > > > > > > > > > What about stable kernels that don't have the hierarchical API? > > > > > > > > My goal is to fix mainline first. Once we have something that works on > > > > mainline, we can look at propagating the fix to other versions. But > > > > mainline always comes first. > > > > > > This is a regression that went into 4.14. Wouldn't the appropriate > > > action for the stable series be to undo the regression? > > > > This is not how stable works. Stable kernels *only* contain patches that > > are backported from mainline, and do not take standalone patch. > > > > Furthermore, your fix is to actually undo someone else's fix. Who is > > right? In the absence of any documentation, the answer is "nobody". > > Little more history to this bug. The code was originally the way it is > now, but this same bug was fixed in 2013 in https://patchwork.kernel.or > g/patch/3333681/ > > Then that lasted four years until it was changed Aug 2017 in https://pa > tchwork.kernel.org/patch/9893303/ > > That lasted just six months until someone tried to revert it, https://p > atchwork.kernel.org/patch/9893303/ The last link is the same as the previous one, unless I am missing something. > Seems pretty clear the way it is now is much worse than the way it was > before, even if the previous design may have had another flaw. Though > I've yet to see anyone point out something makes the previous design > broken. Sub-optimal yes, but not broken. The way I see it is: either the MSI handling works or it does not. AFAICS: 8c934095fa2f ("PCI: dwc: Clear MSI interrupt status after it is handled, not before") was fixing a bug, causing "timeouts on some wireless lan cards", we want to understand what the problem is, fix it once for all on all DWC based systems. > > Anything can be backported to stable once we understand the issue. At > > the moment, we're just playing games moving stuff around and hope > > nothing else will break. That's not a sustainable way of maintaining > > this driver. At the moment, the only patch I'm inclined to propose until > > we get an actual interrupt handling flow from Synopsys is to mark this > > driver as "BROKEN". > > It feels like you're using this bug to hold designware hostage in a > broken kernel, and me along with them. I don't have the documentation, > no one does, there's no way for me to give you want you want. But I've > got hardware that doesn't work in the mainline kernel. Nobody is holding anyone hostage here, it is a pretty normal patch discussion, given the controversial history of fixes you reported we are just trying to get the whole picture. There is a bug that ought to be fixed, you are doing the right thing with the feedback you are providing and DWC maintainers must provide the information you need to get to the bottom of this, once for all, that's as simple as that. Thanks, Lorenzo
On 08/11/18 19:49, Trent Piepho wrote: > On Thu, 2018-11-08 at 09:49 +0000, Marc Zyngier wrote: >> On 07/11/18 20:17, Trent Piepho wrote: >>> On Wed, 2018-11-07 at 18:41 +0000, Marc Zyngier wrote: >>>> On 06/11/18 19:40, Trent Piepho wrote: >>>>> >>>>> What about stable kernels that don't have the hierarchical API? >>>> >>>> My goal is to fix mainline first. Once we have something that works on >>>> mainline, we can look at propagating the fix to other versions. But >>>> mainline always comes first. >>> >>> This is a regression that went into 4.14. Wouldn't the appropriate >>> action for the stable series be to undo the regression? >> >> This is not how stable works. Stable kernels *only* contain patches that >> are backported from mainline, and do not take standalone patch. >> >> Furthermore, your fix is to actually undo someone else's fix. Who is >> right? In the absence of any documentation, the answer is "nobody". > > Little more history to this bug. The code was originally the way it is > now, but this same bug was fixed in 2013 in https://patchwork.kernel.or > g/patch/3333681/ > > Then that lasted four years until it was changed Aug 2017 in https://pa > tchwork.kernel.org/patch/9893303/ > > That lasted just six months until someone tried to revert it, https://p > atchwork.kernel.org/patch/9893303/ > > Seems pretty clear the way it is now is much worse than the way it was > before, even if the previous design may have had another flaw. Though > I've yet to see anyone point out something makes the previous design > broken. Sub-optimal yes, but not broken. This is not what was reported by the previous submitter. I guess they changed this for a reason, no? I'm prepared to admit this is a end-point driver bug, but we need to understand what is happening and stop changing this driver randomly. >> Anything can be backported to stable once we understand the issue. At >> the moment, we're just playing games moving stuff around and hope >> nothing else will break. That's not a sustainable way of maintaining >> this driver. At the moment, the only patch I'm inclined to propose until >> we get an actual interrupt handling flow from Synopsys is to mark this >> driver as "BROKEN". > > It feels like you're using this bug to hold designware hostage in a > broken kernel, and me along with them. I don't have the documentation, > no one does, there's no way for me to give you want you want. But I've > got hardware that doesn't work in the mainline kernel. I take it as a personal offence that I'd be holding anything or anyone hostage. I think I have a long enough track record working with the Linux kernel not to take any of this nonsense. What's my interest in keeping anything in this sorry state? Think about it for a minute. When I'm asking for documentation, I'm not asking you. I perfectly understood that you don't have any. We need Synopsys to step up and give us a simple description of what the MSI workflow is. Because no matter how you randomly mutate this code, it will still be broken until we get a clear answer from *Synopsys*. Gustavo, here's one simple ask. Please reply to this email with a step by step, register by register description of how an MSI must be handled on this HW. We do need it, and we need it now. Thanks, M.
On 09/11/18 3:43 PM, Lorenzo Pieralisi wrote: > On Thu, Nov 08, 2018 at 07:49:52PM +0000, Trent Piepho wrote: >> On Thu, 2018-11-08 at 09:49 +0000, Marc Zyngier wrote: >>> On 07/11/18 20:17, Trent Piepho wrote: >>>> On Wed, 2018-11-07 at 18:41 +0000, Marc Zyngier wrote: >>>>> On 06/11/18 19:40, Trent Piepho wrote: >>>>>> >>>>>> What about stable kernels that don't have the hierarchical API? >>>>> >>>>> My goal is to fix mainline first. Once we have something that works on >>>>> mainline, we can look at propagating the fix to other versions. But >>>>> mainline always comes first. >>>> >>>> This is a regression that went into 4.14. Wouldn't the appropriate >>>> action for the stable series be to undo the regression? >>> >>> This is not how stable works. Stable kernels *only* contain patches that >>> are backported from mainline, and do not take standalone patch. >>> >>> Furthermore, your fix is to actually undo someone else's fix. Who is >>> right? In the absence of any documentation, the answer is "nobody". >> >> Little more history to this bug. The code was originally the way it is >> now, but this same bug was fixed in 2013 in https://patchwork.kernel.or >> g/patch/3333681/ >> >> Then that lasted four years until it was changed Aug 2017 in https://pa >> tchwork.kernel.org/patch/9893303/ >> >> That lasted just six months until someone tried to revert it, https://p >> atchwork.kernel.org/patch/9893303/ > > The last link is the same as the previous one, unless I am missing > something. > >> Seems pretty clear the way it is now is much worse than the way it was >> before, even if the previous design may have had another flaw. Though >> I've yet to see anyone point out something makes the previous design >> broken. Sub-optimal yes, but not broken. > > The way I see it is: either the MSI handling works or it does not. > > AFAICS: > > 8c934095fa2f ("PCI: dwc: Clear MSI interrupt status after it is handled, > not before") > > was fixing a bug, causing "timeouts on some wireless lan cards", we want > to understand what the problem is, fix it once for all on all DWC > based systems. > That issue was root caused to be due to a HW errata in dra7xx DWC wrapper which requires a special way of handling MSI interrupts at wrapper level. More info in this thread: https://www.spinics.net/lists/linux-pci/msg70462.html Unfortunately, commit 8c934095fa2f did not fix WLAN issue in longer tests and also broke PCIe USB cards. Therefore, it makes sense to revert 8c934095fa2f I am working on patches fix dra7xx wrapper for WLAN card issue. Regards Vignesh >>> Anything can be backported to stable once we understand the issue. At >>> the moment, we're just playing games moving stuff around and hope >>> nothing else will break. That's not a sustainable way of maintaining >>> this driver. At the moment, the only patch I'm inclined to propose until >>> we get an actual interrupt handling flow from Synopsys is to mark this >>> driver as "BROKEN". >> >> It feels like you're using this bug to hold designware hostage in a >> broken kernel, and me along with them. I don't have the documentation, >> no one does, there's no way for me to give you want you want. But I've >> got hardware that doesn't work in the mainline kernel. > > Nobody is holding anyone hostage here, it is a pretty normal patch > discussion, given the controversial history of fixes you reported > we are just trying to get the whole picture. > > There is a bug that ought to be fixed, you are doing the right thing > with the feedback you are providing and DWC maintainers must provide the > information you need to get to the bottom of this, once for all, that's > as simple as that. > > Thanks, > Lorenzo >
On Fri, 2018-11-09 at 11:34 +0000, Marc Zyngier wrote: > On 08/11/18 19:49, Trent Piepho wrote: > > On Thu, 2018-11-08 at 09:49 +0000, Marc Zyngier wrote: > > > > > Then that lasted four years until it was changed Aug 2017 in https://pa > > tchwork.kernel.org/patch/9893303/ > > > > That lasted just six months until someone tried to revert it, https://p > > atchwork.kernel.org/patch/9893303/ Should be https://patchwork.kernel.org/patch/10208879/ > > > > Seems pretty clear the way it is now is much worse than the way it was > > before, even if the previous design may have had another flaw. Though > > I've yet to see anyone point out something makes the previous design > > broken. Sub-optimal yes, but not broken. > > This is not what was reported by the previous submitter. I guess they > changed this for a reason, no? I'm prepared to admit this is a end-point > driver bug, but we need to understand what is happening and stop > changing this driver randomly. See Vignesh's recent message about the last change. It was a mistaken attempt to fix a problem, which it didn't fix, and I think we all agree it's not right. Reverting it is not "changing this driver randomly." And I take that as a personal offense. You imply I just applied patches randomly until something appeared to work. Maybe you think this is all over my head? Far from it. I traced every part of the interrupt path and thought through every race. Hamstrung by lack of docs, I still determined the behavior that was relevant through empirical analysis. I only discovered this was a recent regression and Vignesh's earlier attempt to revert it after I was done and was trying to determine how the code got this way in the first place. > > It feels like you're using this bug to hold designware hostage in a > > broken kernel, and me along with them. I don't have the documentation, > > no one does, there's no way for me to give you want you want. But I've > > got hardware that doesn't work in the mainline kernel. > > I take it as a personal offence that I'd be holding anything or anyone > hostage. I think I have a long enough track record working with the > Linux kernel not to take any of this nonsense. What's my interest in > keeping anything in this sorry state? Think about it for a minute. I'm sorry if you took it that way. I appreciate that there are still people who care about fixing things right and don't settle for whatever the easiest thing is that lets them say they're done, even if that just leaves time bombs for everyone who comes after. So I thank you for taking a stand. But I think it's clear that 8c934095fa2f was a mistake that causes serious bugs. That's not a random guess; it's well considered and well tested. Not reverting it now isn't helping anyone using stable kernels with this regression.
On Thu, Nov 08, 2018 at 08:51:38PM +0000, Trent Piepho wrote: > On Thu, 2018-11-08 at 11:46 +0000, Gustavo Pimentel wrote: > > On 07/11/2018 18:32, Trent Piepho wrote: > > > On Wed, 2018-11-07 at 12:57 +0000, Gustavo Pimentel wrote: > > > > On 06/11/2018 16:00, Marc Zyngier wrote: > > > > > On 06/11/18 14:53, Lorenzo Pieralisi wrote: > > > > > > On Sat, Oct 27, 2018 at 12:00:57AM +0000, Trent Piepho wrote: > > > > > > > > > > > > > > This gives the following race scenario: > > > > > > > > > > > > > > 1. An MSI is received by, and the status bit for the MSI is set in, the > > > > > > > DWC PCI-e controller. > > > > > > > 2. dw_handle_msi_irq() calls a driver's registered interrupt handler > > > > > > > for the MSI received. > > > > > > > 3. At some point, the interrupt handler must decide, correctly, that > > > > > > > there is no more work to do and return. > > > > > > > 4. The hardware generates a new MSI. As the MSI's status bit is still > > > > > > > set, this new MSI is ignored. > > > > > > > 6. dw_handle_msi_irq() unsets the MSI status bit. > > > > > > > > > > > > > > The MSI received at point 4 will never be acted upon. It occurred after > > > > > > > the driver had finished checking the hardware status for interrupt > > > > > > > conditions to act on. Since the MSI status was masked, it does not > > > > > > > generated a new IRQ, neither when it was received nor when the MSI is > > > > > > > unmasked. > > > > > > > > > > > This status register indicates whether exists or not a MSI interrupt on that > > > > controller [0..7] to be handle. > > > > > > While the status for an MSI is set, no new interrupt will be triggered > > > > Yes > > > > > if another identical MSI is received, correct? > > > > You cannot receive another identical MSI till you acknowledge the current one > > (This is ensured by the PCI protocol, I guess). > > I don't believe this is a requirement of PCI. We designed our hardware > to not send another MSI until our hardware's interrupt status register > is read, but we didn't have to do that. > > > > > In theory, we should clear the interrupt flag only after the interrupt has > > > > actually handled (which can take some time to process on the worst case scenario). > > > > > > But see above, there is a race if a new MSI arrives while still masked. > > > I can see no possible way to solve this in software that does not > > > involve unmasking the MSI before calling the handler. To leave the > > > interrupt masked while calling the handler requires the hardware to > > > queue an interrupt that arrives while masked. We have no docs, but the > > > designware controller doesn't appear to do this in practice. > > > > See my reply to Marc about the interrupt masking. Like you said, probably the > > solution pass through using interrupt mask/unmask register instead of interrupt > > enable/disable register. > > > > Can you do a quick test, since you can easily reproduce the issue? Can you > > change register offset on both functions dw_pci_bottom_mask() and > > dw_pci_bottom_unmask()? > > > > Basically exchange the PCIE_MSI_INTR0_ENABLE register by PCIE_MSI_INTR0_MASK. > > Of course MSI still need to be enabled to work at all, which happens > once when the driver using the MSI registers a handler. Masking can be > done via mask register after that. I want to understand from Synopsys maintainers what's the difference between the ENABLE and MASK registers and *why* the MSI IRQ masking is currently implemented through the PCIE_MSI_INTR0_ENABLE register instead of PCIE_MSI_INTR0_MASK. I am not happy to test and see what's the difference I want to know what's the difference as-per specifications. I need a definitive answer on this based on HW specifications before merging any code. > It is not so easy for me to test on the newest kernel, as imx7d does > not work due to yet more bugs. I have to port a set of patches to each > new kernel. A set that does not shrink due to holdups like this. > > I understand the new flow would look like this (assume dw controller > MSI interrupt output signal is connected to one of the ARM GIC > interrupt lines, there could be different or more controllers above the > dwc of course (but usually aren't)): > > 1. MSI arrives, status bit is set in dwc, interrupt raised to GIC. > 2. dwc handler runs > 3. dwc handler sees status bit is set for a(n) MSI(s) > 4. dwc handler sets mask for those MSIs > 5. dwc handler clears status bit > 6. dwc handler runs driver handler for the received MSI > 7. ** an new MSI arrives, racing with 6 ** > 8. status bit becomes set again, but no interrupt is raised due to mask > 9. dwc handler unmasks MSI, which raises the interrupt to GIC because > of new MSI received in 7. > 10. The original GIC interrupt is EOI'ed. > 11. The interrupt for the dwc is re-raised by the GIC due to 9, and we > go back to 2. > > It is very important that 5 be done before 6. Less so 4 before 5, but > reversing the order there would allow re-raising even if the 2nd MSI > arrived before the driver handler ran, which is not necessary. > > I do not see a race in this design and it appears correct to me. But, > I also do not think there is any immediate improvement due to extra > steps of masking and unmasking the MSI. > > The reason is that the GIC interrupt above the dwc is non-reentrant. > It remains masked (aka active[1]) during this entire process (1 to 10). > This means every MSI is effectively already masked. So masking the > active MSI(s) a 2nd time gains nothing besides preventing some extra > edges for a masked interrupt going to the ARM GIC. > > In theory, if the GIC interrupt handler was reentrant, then on receipt > of a new MSI we could re-enter the dwc handler on a different CPU and > run the new MSI (a different MSI!) at the same time as the original MSI > handler is still running. > > There difference here is that by unmasking in the interrupt in the GIC > before the dwc handler is finished, masking an individual MSI in the > dwc is no longer a 2nd redundant masking. There is not (and there must not be) anything in DWC HW that allows us assume its "interrupt controller" is cascaded to a GIC. It may correspond to what we are debugging but it is an assumption we must not make and its driver has to work regardless of what its interrupt controller is connected to, that's the basis of hierarchical IRQ chips management, at least from my narrow and (most likely) incomplete knowledge of the IRQ subsystem. So, this means that both the masking AND your revert must be merged in the mainline to make sure we are fixing MSI handling for good, that's how I sum up your report. Pending my request above, I am inclined to merge your revert but I would appreciate testing (tags) from all DWC host bridge maintainers (I am happy to delay it till -rc3 or -rc4 to that extent), do not take this as a lack of trust in your testing, it is just that the last thing we want is asking for reverting the revert given that it is going to stable kernels and most of all we have to make sure this works for all DWC derived host bridges. I *also* expect the patch implementing the MSI masking and I think that falls within Gustavo's remit. Thanks, Lorenzo
On 08/11/2018 20:51, Trent Piepho wrote: > On Thu, 2018-11-08 at 11:46 +0000, Gustavo Pimentel wrote: >> On 07/11/2018 18:32, Trent Piepho wrote: >>> On Wed, 2018-11-07 at 12:57 +0000, Gustavo Pimentel wrote: >>>> On 06/11/2018 16:00, Marc Zyngier wrote: >>>>> On 06/11/18 14:53, Lorenzo Pieralisi wrote: >>>>>> On Sat, Oct 27, 2018 at 12:00:57AM +0000, Trent Piepho wrote: >>>>>>> >>>>>>> This gives the following race scenario: >>>>>>> >>>>>>> 1. An MSI is received by, and the status bit for the MSI is set in, the >>>>>>> DWC PCI-e controller. >>>>>>> 2. dw_handle_msi_irq() calls a driver's registered interrupt handler >>>>>>> for the MSI received. >>>>>>> 3. At some point, the interrupt handler must decide, correctly, that >>>>>>> there is no more work to do and return. >>>>>>> 4. The hardware generates a new MSI. As the MSI's status bit is still >>>>>>> set, this new MSI is ignored. >>>>>>> 6. dw_handle_msi_irq() unsets the MSI status bit. >>>>>>> >>>>>>> The MSI received at point 4 will never be acted upon. It occurred after >>>>>>> the driver had finished checking the hardware status for interrupt >>>>>>> conditions to act on. Since the MSI status was masked, it does not >>>>>>> generated a new IRQ, neither when it was received nor when the MSI is >>>>>>> unmasked. >>>>>>> >>>> This status register indicates whether exists or not a MSI interrupt on that >>>> controller [0..7] to be handle. >>> >>> While the status for an MSI is set, no new interrupt will be triggered >> >> Yes >> >>> if another identical MSI is received, correct? >> >> You cannot receive another identical MSI till you acknowledge the current one >> (This is ensured by the PCI protocol, I guess). > > I don't believe this is a requirement of PCI. We designed our hardware > to not send another MSI until our hardware's interrupt status register > is read, but we didn't have to do that. > >>>> In theory, we should clear the interrupt flag only after the interrupt has >>>> actually handled (which can take some time to process on the worst case scenario). >>> >>> But see above, there is a race if a new MSI arrives while still masked. >>> I can see no possible way to solve this in software that does not >>> involve unmasking the MSI before calling the handler. To leave the >>> interrupt masked while calling the handler requires the hardware to >>> queue an interrupt that arrives while masked. We have no docs, but the >>> designware controller doesn't appear to do this in practice. >> >> See my reply to Marc about the interrupt masking. Like you said, probably the >> solution pass through using interrupt mask/unmask register instead of interrupt >> enable/disable register. >> >> Can you do a quick test, since you can easily reproduce the issue? Can you >> change register offset on both functions dw_pci_bottom_mask() and >> dw_pci_bottom_unmask()? >> >> Basically exchange the PCIE_MSI_INTR0_ENABLE register by PCIE_MSI_INTR0_MASK. > > Of course MSI still need to be enabled to work at all, which happens > once when the driver using the MSI registers a handler. Masking can be > done via mask register after that. > Correct, I was asking to switch only on the functions mentioned that are called after the dw_pcie_setup_rc() that enables the interrupts. > It is not so easy for me to test on the newest kernel, as imx7d does > not work due to yet more bugs. I have to port a set of patches to each > new kernel. A set that does not shrink due to holdups like this. Ok, I've to try to replicate this scenario of loss of interruptions so that I can do something about it. Till now this never happen before. > > I understand the new flow would look like this (assume dw controller > MSI interrupt output signal is connected to one of the ARM GIC > interrupt lines, there could be different or more controllers above the > dwc of course (but usually aren't)): > > 1. MSI arrives, status bit is set in dwc, interrupt raised to GIC. > 2. dwc handler runs > 3. dwc handler sees status bit is set for a(n) MSI(s) > 4. dwc handler sets mask for those MSIs > 5. dwc handler clears status bit > 6. dwc handler runs driver handler for the received MSI > 7. ** an new MSI arrives, racing with 6 ** > 8. status bit becomes set again, but no interrupt is raised due to mask > 9. dwc handler unmasks MSI, which raises the interrupt to GIC because > of new MSI received in 7. > 10. The original GIC interrupt is EOI'ed. > 11. The interrupt for the dwc is re-raised by the GIC due to 9, and we > go back to 2. > > It is very important that 5 be done before 6. Less so 4 before 5, but > reversing the order there would allow re-raising even if the 2nd MSI > arrived before the driver handler ran, which is not necessary. > > I do not see a race in this design and it appears correct to me. But, > I also do not think there is any immediate improvement due to extra > steps of masking and unmasking the MSI. > > The reason is that the GIC interrupt above the dwc is non-reentrant. > It remains masked (aka active[1]) during this entire process (1 to 10). > This means every MSI is effectively already masked. So masking the > active MSI(s) a 2nd time gains nothing besides preventing some extra > edges for a masked interrupt going to the ARM GIC. > > In theory, if the GIC interrupt handler was reentrant, then on receipt > of a new MSI we could re-enter the dwc handler on a different CPU and > run the new MSI (a different MSI!) at the same time as the original MSI > handler is still running. > > There difference here is that by unmasking in the interrupt in the GIC > before the dwc handler is finished, masking an individual MSI in the > dwc is no longer a 2nd redundant masking. > > > [1] When I say masked in GIC, I mean the interrupt is in the "active" > or "active and pending" states. In these states the interrupt will not > be raised to the CPU and can be considered masked. >
On 09/11/2018 11:34, Marc Zyngier wrote: > On 08/11/18 19:49, Trent Piepho wrote: >> On Thu, 2018-11-08 at 09:49 +0000, Marc Zyngier wrote: >>> On 07/11/18 20:17, Trent Piepho wrote: >>>> On Wed, 2018-11-07 at 18:41 +0000, Marc Zyngier wrote: >>>>> On 06/11/18 19:40, Trent Piepho wrote: >>>>>> >>>>>> What about stable kernels that don't have the hierarchical API? >>>>> >>>>> My goal is to fix mainline first. Once we have something that works on >>>>> mainline, we can look at propagating the fix to other versions. But >>>>> mainline always comes first. >>>> >>>> This is a regression that went into 4.14. Wouldn't the appropriate >>>> action for the stable series be to undo the regression? >>> >>> This is not how stable works. Stable kernels *only* contain patches that >>> are backported from mainline, and do not take standalone patch. >>> >>> Furthermore, your fix is to actually undo someone else's fix. Who is >>> right? In the absence of any documentation, the answer is "nobody". >> >> Little more history to this bug. The code was originally the way it is >> now, but this same bug was fixed in 2013 in https://urldefense.proofpoint.com/v2/url?u=https-3A__patchwork.kernel.or&d=DwICaQ&c=DPL6_X_6JkXFx7AXWqB0tg&r=bkWxpLoW-f-E3EdiDCCa0_h0PicsViasSlvIpzZvPxs&m=Tk2dtPxJ5WvHjQbWWx35od-JZmur8mRuDrSSVhf8nYs&s=3nFZMUc2qvOXJht1WuNCDorudHeFiWyeDaO1UbhTXmw&e= >> g/patch/3333681/ >> >> Then that lasted four years until it was changed Aug 2017 in https://urldefense.proofpoint.com/v2/url?u=https-3A__pa&d=DwICaQ&c=DPL6_X_6JkXFx7AXWqB0tg&r=bkWxpLoW-f-E3EdiDCCa0_h0PicsViasSlvIpzZvPxs&m=Tk2dtPxJ5WvHjQbWWx35od-JZmur8mRuDrSSVhf8nYs&s=yfWJAoFIGjLbTnuE_KLzhW7J_pyMD3X1dXm4eHoy7_o&e= >> tchwork.kernel.org/patch/9893303/ >> >> That lasted just six months until someone tried to revert it, https://urldefense.proofpoint.com/v2/url?u=https-3A__p&d=DwICaQ&c=DPL6_X_6JkXFx7AXWqB0tg&r=bkWxpLoW-f-E3EdiDCCa0_h0PicsViasSlvIpzZvPxs&m=Tk2dtPxJ5WvHjQbWWx35od-JZmur8mRuDrSSVhf8nYs&s=q1LGEb9w_CjcwmWVFOM5VWJ17oS9XHd8SsWTsBF8zfU&e= >> atchwork.kernel.org/patch/9893303/ >> >> Seems pretty clear the way it is now is much worse than the way it was >> before, even if the previous design may have had another flaw. Though >> I've yet to see anyone point out something makes the previous design >> broken. Sub-optimal yes, but not broken. > > This is not what was reported by the previous submitter. I guess they > changed this for a reason, no? I'm prepared to admit this is a end-point > driver bug, but we need to understand what is happening and stop > changing this driver randomly. > >>> Anything can be backported to stable once we understand the issue. At >>> the moment, we're just playing games moving stuff around and hope >>> nothing else will break. That's not a sustainable way of maintaining >>> this driver. At the moment, the only patch I'm inclined to propose until >>> we get an actual interrupt handling flow from Synopsys is to mark this >>> driver as "BROKEN". >> >> It feels like you're using this bug to hold designware hostage in a >> broken kernel, and me along with them. I don't have the documentation, >> no one does, there's no way for me to give you want you want. But I've >> got hardware that doesn't work in the mainline kernel. > > I take it as a personal offence that I'd be holding anything or anyone > hostage. I think I have a long enough track record working with the > Linux kernel not to take any of this nonsense. What's my interest in > keeping anything in this sorry state? Think about it for a minute. > > When I'm asking for documentation, I'm not asking you. I perfectly > understood that you don't have any. We need Synopsys to step up and give > us a simple description of what the MSI workflow is. Because no matter > how you randomly mutate this code, it will still be broken until we get > a clear answer from *Synopsys*. > > Gustavo, here's one simple ask. Please reply to this email with a step > by step, register by register description of how an MSI must be handled > on this HW. We do need it, and we need it now. Hi Marc, I thought that I did respond to your question on Nov 8. However I will repeat it and add some extra information to it now. About the registers: PCIE_MSI_INTR0_MASK When an MSI is received for a masked interrupt, the corresponding status bit gets set in the interrupt status register but the controller will not signal it. As soon as the masked interrupt is unmasked and assuming the status bit is still set the controller will signal it. PCIE_MSI_INTR0_ENABLE Enables/disables every interrupt. When an MSI is received from a disabled interrupt, no status bit gets set in MSI controller interrupt status register. About this new callbacks: The dw_pci_bottom_mask() and dw_pci_bottom_unmask() callbacks were added on Linux kernel v4.17 on PCI: dwc: Move MSI IRQs allocation to IRQ domains hierarchical API patch [1] and based on what I've seen so far, before this patch there were no interrupt masking been performed. Based on the information provided, its very likely that I should use the PCIE_MSI_INTR0_MASK register instead of PCIE_MSI_INTR0_ENABLE on the dw_pci_bottom_mask() and dw_pci_bottom_unmask() callbacks. As soon as I return from Linux plumbers conference, I will test the system using the PCIE_MSI_INTR0_MASK register. [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v4.20-rc2&id=7c5925afbc58c6d6b384e1dc051bb992969bf787 Is there anything else that has not been clarified yet? Regards, Gustavo > > Thanks, > > M. >
On 12/11/2018 16:01, Lorenzo Pieralisi wrote: > On Thu, Nov 08, 2018 at 08:51:38PM +0000, Trent Piepho wrote: >> On Thu, 2018-11-08 at 11:46 +0000, Gustavo Pimentel wrote: >>> On 07/11/2018 18:32, Trent Piepho wrote: >>>> On Wed, 2018-11-07 at 12:57 +0000, Gustavo Pimentel wrote: >>>>> On 06/11/2018 16:00, Marc Zyngier wrote: >>>>>> On 06/11/18 14:53, Lorenzo Pieralisi wrote: >>>>>>> On Sat, Oct 27, 2018 at 12:00:57AM +0000, Trent Piepho wrote: >>>>>>>> >>>>>>>> This gives the following race scenario: >>>>>>>> >>>>>>>> 1. An MSI is received by, and the status bit for the MSI is set in, the >>>>>>>> DWC PCI-e controller. >>>>>>>> 2. dw_handle_msi_irq() calls a driver's registered interrupt handler >>>>>>>> for the MSI received. >>>>>>>> 3. At some point, the interrupt handler must decide, correctly, that >>>>>>>> there is no more work to do and return. >>>>>>>> 4. The hardware generates a new MSI. As the MSI's status bit is still >>>>>>>> set, this new MSI is ignored. >>>>>>>> 6. dw_handle_msi_irq() unsets the MSI status bit. >>>>>>>> >>>>>>>> The MSI received at point 4 will never be acted upon. It occurred after >>>>>>>> the driver had finished checking the hardware status for interrupt >>>>>>>> conditions to act on. Since the MSI status was masked, it does not >>>>>>>> generated a new IRQ, neither when it was received nor when the MSI is >>>>>>>> unmasked. >>>>>>>> >>>>> This status register indicates whether exists or not a MSI interrupt on that >>>>> controller [0..7] to be handle. >>>> >>>> While the status for an MSI is set, no new interrupt will be triggered >>> >>> Yes >>> >>>> if another identical MSI is received, correct? >>> >>> You cannot receive another identical MSI till you acknowledge the current one >>> (This is ensured by the PCI protocol, I guess). >> >> I don't believe this is a requirement of PCI. We designed our hardware >> to not send another MSI until our hardware's interrupt status register >> is read, but we didn't have to do that. >> >>>>> In theory, we should clear the interrupt flag only after the interrupt has >>>>> actually handled (which can take some time to process on the worst case scenario). >>>> >>>> But see above, there is a race if a new MSI arrives while still masked. >>>> I can see no possible way to solve this in software that does not >>>> involve unmasking the MSI before calling the handler. To leave the >>>> interrupt masked while calling the handler requires the hardware to >>>> queue an interrupt that arrives while masked. We have no docs, but the >>>> designware controller doesn't appear to do this in practice. >>> >>> See my reply to Marc about the interrupt masking. Like you said, probably the >>> solution pass through using interrupt mask/unmask register instead of interrupt >>> enable/disable register. >>> >>> Can you do a quick test, since you can easily reproduce the issue? Can you >>> change register offset on both functions dw_pci_bottom_mask() and >>> dw_pci_bottom_unmask()? >>> >>> Basically exchange the PCIE_MSI_INTR0_ENABLE register by PCIE_MSI_INTR0_MASK. >> >> Of course MSI still need to be enabled to work at all, which happens >> once when the driver using the MSI registers a handler. Masking can be >> done via mask register after that. > > I want to understand from Synopsys maintainers what's the difference > between the ENABLE and MASK registers and *why* the MSI IRQ masking is > currently implemented through the PCIE_MSI_INTR0_ENABLE register instead > of PCIE_MSI_INTR0_MASK. I've already reply to Marc with that information twice. > > I am not happy to test and see what's the difference I want to know > what's the difference as-per specifications. Of course. I only asked if it was possible to test the register change, because the interruption loss (I got the impression of that) was deterministic on that platform. Based on the information provided to Marc, the correct register should be PCIE_MSI_INTR0_MASK instead of PCIE_MSI_INTR0_ENABLE, however I must replicate this "interruption loss" (I'm still thinking how...) and test any possible fix before doing anything else. > > I need a definitive answer on this based on HW specifications before > merging any code. > >> It is not so easy for me to test on the newest kernel, as imx7d does >> not work due to yet more bugs. I have to port a set of patches to each >> new kernel. A set that does not shrink due to holdups like this. >> >> I understand the new flow would look like this (assume dw controller >> MSI interrupt output signal is connected to one of the ARM GIC >> interrupt lines, there could be different or more controllers above the >> dwc of course (but usually aren't)): >> >> 1. MSI arrives, status bit is set in dwc, interrupt raised to GIC. >> 2. dwc handler runs >> 3. dwc handler sees status bit is set for a(n) MSI(s) >> 4. dwc handler sets mask for those MSIs >> 5. dwc handler clears status bit >> 6. dwc handler runs driver handler for the received MSI >> 7. ** an new MSI arrives, racing with 6 ** >> 8. status bit becomes set again, but no interrupt is raised due to mask >> 9. dwc handler unmasks MSI, which raises the interrupt to GIC because >> of new MSI received in 7. >> 10. The original GIC interrupt is EOI'ed. >> 11. The interrupt for the dwc is re-raised by the GIC due to 9, and we >> go back to 2. >> >> It is very important that 5 be done before 6. Less so 4 before 5, but >> reversing the order there would allow re-raising even if the 2nd MSI >> arrived before the driver handler ran, which is not necessary. >> >> I do not see a race in this design and it appears correct to me. But, >> I also do not think there is any immediate improvement due to extra >> steps of masking and unmasking the MSI. >> >> The reason is that the GIC interrupt above the dwc is non-reentrant. >> It remains masked (aka active[1]) during this entire process (1 to 10). >> This means every MSI is effectively already masked. So masking the >> active MSI(s) a 2nd time gains nothing besides preventing some extra >> edges for a masked interrupt going to the ARM GIC. >> >> In theory, if the GIC interrupt handler was reentrant, then on receipt >> of a new MSI we could re-enter the dwc handler on a different CPU and >> run the new MSI (a different MSI!) at the same time as the original MSI >> handler is still running. >> >> There difference here is that by unmasking in the interrupt in the GIC >> before the dwc handler is finished, masking an individual MSI in the >> dwc is no longer a 2nd redundant masking. > > There is not (and there must not be) anything in DWC HW that allows > us assume its "interrupt controller" is cascaded to a GIC. It may > correspond to what we are debugging but it is an assumption we must not > make and its driver has to work regardless of what its interrupt > controller is connected to, that's the basis of hierarchical IRQ chips > management, at least from my narrow and (most likely) incomplete > knowledge of the IRQ subsystem. > > So, this means that both the masking AND your revert must be > merged in the mainline to make sure we are fixing MSI handling > for good, that's how I sum up your report. > > Pending my request above, I am inclined to merge your revert but I would > appreciate testing (tags) from all DWC host bridge maintainers (I am > happy to delay it till -rc3 or -rc4 to that extent), do not take this as > a lack of trust in your testing, it is just that the last thing we want > is asking for reverting the revert given that it is going to stable > kernels and most of all we have to make sure this works for all DWC > derived host bridges. Agree. Any change on this have a big impact on all DWC host and should be tested by all. > > I *also* expect the patch implementing the MSI masking and I think that > falls within Gustavo's remit. Yes. Regards, Gustavo > > Thanks, > Lorenzo >
On Tue, 2018-11-13 at 00:41 +0000, Gustavo Pimentel wrote: > On 09/11/2018 11:34, Marc Zyngier wrote: > > > > Gustavo, here's one simple ask. Please reply to this email with a step > > by step, register by register description of how an MSI must be handled > > on this HW. We do need it, and we need it now. > > Hi Marc, I thought that I did respond to your question on Nov 8. However I will > repeat it and add some extra information to it now. > > About the registers: > > PCIE_MSI_INTR0_MASK > When an MSI is received for a masked interrupt, the corresponding status bit > gets set in the interrupt status register but the controller will not signal it. > As soon as the masked interrupt is unmasked and assuming the status bit is still > set the controller will signal it. > > PCIE_MSI_INTR0_ENABLE > Enables/disables every interrupt. When an MSI is received from a disabled > interrupt, no status bit gets set in MSI controller interrupt status register. If the MSI is unmasked and the status bit is still set, *not* from a new MSI but rather from one that arrived before the MSI was masked, does the controller still signal it? I would suspect the answer is no: only a new MSI will cause the interrupt be signaled on unmask. And then only if the status bit has not been cleared before the unmask (as you stated already). But for that to be the case, there must be another bit of interrupt status (i.e., "pending") that we don't know about yet. The bit in the status register alone is not enough. This pending bit would need to be set on a 0->1 transition of the status bit. An MSI arriving while the status bit is already 1 does not trigger a new interrupt. > About this new callbacks: > > The dw_pci_bottom_mask() and dw_pci_bottom_unmask() callbacks were added on > Linux kernel v4.17 on PCI: dwc: Move MSI IRQs allocation to IRQ domains > hierarchical API patch [1] and based on what I've seen so far, before this patch > there were no interrupt masking been performed. Yes. Or rather, the status bit being set effectively masks the interrupt. > Based on the information provided, its very likely that I should use the > PCIE_MSI_INTR0_MASK register instead of PCIE_MSI_INTR0_ENABLE on the > dw_pci_bottom_mask() and dw_pci_bottom_unmask() callbacks. As soon as I return > from Linux plumbers conference, I will test the system using the > PCIE_MSI_INTR0_MASK register. Using enable to mask the interrupt is broken. It will allow an MSI to be lost if it arrives while the MSI in not enabled. It is impossible to prevent that from racing against the pci device driver's final check that no MSI-signaled condition in the hardware is present.
On Tue, Nov 13, 2018 at 01:18:41AM +0000, Trent Piepho wrote: [...] > > Based on the information provided, its very likely that I should use the > > PCIE_MSI_INTR0_MASK register instead of PCIE_MSI_INTR0_ENABLE on the > > dw_pci_bottom_mask() and dw_pci_bottom_unmask() callbacks. As soon as I return > > from Linux plumbers conference, I will test the system using the > > PCIE_MSI_INTR0_MASK register. > > Using enable to mask the interrupt is broken. It will allow an MSI to > be lost if it arrives while the MSI in not enabled. It is impossible > to prevent that from racing against the pci device driver's final check > that no MSI-signaled condition in the hardware is present. That's exactly why I raised the point. I would like to understand if this means that this revert is not enough to fix the underlying issue or we need this revert AND correct mask/unmask operations. Thanks, Lorenzo
On Tue, 13 Nov 2018 00:41:24 +0000, Gustavo Pimentel <gustavo.pimentel@synopsys.com> wrote: > > On 09/11/2018 11:34, Marc Zyngier wrote: > > On 08/11/18 19:49, Trent Piepho wrote: > >> On Thu, 2018-11-08 at 09:49 +0000, Marc Zyngier wrote: > >>> On 07/11/18 20:17, Trent Piepho wrote: > >>>> On Wed, 2018-11-07 at 18:41 +0000, Marc Zyngier wrote: > >>>>> On 06/11/18 19:40, Trent Piepho wrote: > >>>>>> > >>>>>> What about stable kernels that don't have the hierarchical API? > >>>>> > >>>>> My goal is to fix mainline first. Once we have something that works on > >>>>> mainline, we can look at propagating the fix to other versions. But > >>>>> mainline always comes first. > >>>> > >>>> This is a regression that went into 4.14. Wouldn't the appropriate > >>>> action for the stable series be to undo the regression? > >>> > >>> This is not how stable works. Stable kernels *only* contain patches that > >>> are backported from mainline, and do not take standalone patch. > >>> > >>> Furthermore, your fix is to actually undo someone else's fix. Who is > >>> right? In the absence of any documentation, the answer is "nobody". > >> > >> Little more history to this bug. The code was originally the way it is > >> now, but this same bug was fixed in 2013 in https://urldefense.proofpoint.com/v2/url?u=https-3A__patchwork.kernel.or&d=DwICaQ&c=DPL6_X_6JkXFx7AXWqB0tg&r=bkWxpLoW-f-E3EdiDCCa0_h0PicsViasSlvIpzZvPxs&m=Tk2dtPxJ5WvHjQbWWx35od-JZmur8mRuDrSSVhf8nYs&s=3nFZMUc2qvOXJht1WuNCDorudHeFiWyeDaO1UbhTXmw&e= > >> g/patch/3333681/ > >> > >> Then that lasted four years until it was changed Aug 2017 in https://urldefense.proofpoint.com/v2/url?u=https-3A__pa&d=DwICaQ&c=DPL6_X_6JkXFx7AXWqB0tg&r=bkWxpLoW-f-E3EdiDCCa0_h0PicsViasSlvIpzZvPxs&m=Tk2dtPxJ5WvHjQbWWx35od-JZmur8mRuDrSSVhf8nYs&s=yfWJAoFIGjLbTnuE_KLzhW7J_pyMD3X1dXm4eHoy7_o&e= > >> tchwork.kernel.org/patch/9893303/ > >> > >> That lasted just six months until someone tried to revert it, https://urldefense.proofpoint.com/v2/url?u=https-3A__p&d=DwICaQ&c=DPL6_X_6JkXFx7AXWqB0tg&r=bkWxpLoW-f-E3EdiDCCa0_h0PicsViasSlvIpzZvPxs&m=Tk2dtPxJ5WvHjQbWWx35od-JZmur8mRuDrSSVhf8nYs&s=q1LGEb9w_CjcwmWVFOM5VWJ17oS9XHd8SsWTsBF8zfU&e= > >> atchwork.kernel.org/patch/9893303/ > >> > >> Seems pretty clear the way it is now is much worse than the way it was > >> before, even if the previous design may have had another flaw. Though > >> I've yet to see anyone point out something makes the previous design > >> broken. Sub-optimal yes, but not broken. > > > > This is not what was reported by the previous submitter. I guess they > > changed this for a reason, no? I'm prepared to admit this is a end-point > > driver bug, but we need to understand what is happening and stop > > changing this driver randomly. > > > >>> Anything can be backported to stable once we understand the issue. At > >>> the moment, we're just playing games moving stuff around and hope > >>> nothing else will break. That's not a sustainable way of maintaining > >>> this driver. At the moment, the only patch I'm inclined to propose until > >>> we get an actual interrupt handling flow from Synopsys is to mark this > >>> driver as "BROKEN". > >> > >> It feels like you're using this bug to hold designware hostage in a > >> broken kernel, and me along with them. I don't have the documentation, > >> no one does, there's no way for me to give you want you want. But I've > >> got hardware that doesn't work in the mainline kernel. > > > > I take it as a personal offence that I'd be holding anything or anyone > > hostage. I think I have a long enough track record working with the > > Linux kernel not to take any of this nonsense. What's my interest in > > keeping anything in this sorry state? Think about it for a minute. > > > > When I'm asking for documentation, I'm not asking you. I perfectly > > understood that you don't have any. We need Synopsys to step up and give > > us a simple description of what the MSI workflow is. Because no matter > > how you randomly mutate this code, it will still be broken until we get > > a clear answer from *Synopsys*. > > > > Gustavo, here's one simple ask. Please reply to this email with a step > > by step, register by register description of how an MSI must be handled > > on this HW. We do need it, and we need it now. > > Hi Marc, I thought that I did respond to your question on Nov 8. However I will > repeat it and add some extra information to it now. > > About the registers: > > PCIE_MSI_INTR0_MASK > When an MSI is received for a masked interrupt, the corresponding status bit > gets set in the interrupt status register but the controller will not signal it. > As soon as the masked interrupt is unmasked and assuming the status bit is still > set the controller will signal it. So this is *really* a mask. > PCIE_MSI_INTR0_ENABLE > Enables/disables every interrupt. When an MSI is received from a disabled > interrupt, no status bit gets set in MSI controller interrupt status register. And that's not a mask at all. This allows the interrupt to be simply lost. It is just a bit worrying that this is exactly what the DWC driver is using, and faces all kind of interrupt loss when enabling/disabling interrupts (which do have a mask/unmask semantic -- yes, the language is confusing). > > About this new callbacks: > > The dw_pci_bottom_mask() and dw_pci_bottom_unmask() callbacks were added on > Linux kernel v4.17 on PCI: dwc: Move MSI IRQs allocation to IRQ domains > hierarchical API patch [1] and based on what I've seen so far, before this patch > there were no interrupt masking been performed. > > Based on the information provided, its very likely that I should use the > PCIE_MSI_INTR0_MASK register instead of PCIE_MSI_INTR0_ENABLE on the > dw_pci_bottom_mask() and dw_pci_bottom_unmask() callbacks. As soon as I return > from Linux plumbers conference, I will test the system using the > PCIE_MSI_INTR0_MASK register. I'm at Plumbers as well, so feel free to grab me and we'll sort this out. We've wasted enough time on this broken driver. Thanks, M.
On Tue, 2018-11-13 at 10:36 +0000, Lorenzo Pieralisi wrote: > On Tue, Nov 13, 2018 at 01:18:41AM +0000, Trent Piepho wrote: > > [...] > > > > Based on the information provided, its very likely that I should > > > use the > > > PCIE_MSI_INTR0_MASK register instead of PCIE_MSI_INTR0_ENABLE on > > > the > > > dw_pci_bottom_mask() and dw_pci_bottom_unmask() callbacks. As > > > soon as I return > > > from Linux plumbers conference, I will test the system using the > > > PCIE_MSI_INTR0_MASK register. > > > > Using enable to mask the interrupt is broken. It will allow an MSI > > to > > be lost if it arrives while the MSI in not enabled. It is > > impossible > > to prevent that from racing against the pci device driver's final > > check > > that no MSI-signaled condition in the hardware is present. > > That's exactly why I raised the point. I would like to understand if > this means that this revert is not enough to fix the underlying issue > or we need this revert AND correct mask/unmask operations. Looks like _another_ bug added to the driver, this time in 4.17.
On Mon, 2018-11-12 at 16:01 +0000, Lorenzo Pieralisi wrote: > On Thu, Nov 08, 2018 at 08:51:38PM +0000, Trent Piepho wrote: > > > > The reason is that the GIC interrupt above the dwc is non-reentrant. > > It remains masked (aka active[1]) during this entire process (1 to 10). > > This means every MSI is effectively already masked. So masking the > > active MSI(s) a 2nd time gains nothing besides preventing some extra > > edges for a masked interrupt going to the ARM GIC. > > > > In theory, if the GIC interrupt handler was reentrant, then on receipt > > of a new MSI we could re-enter the dwc handler on a different CPU and > > run the new MSI (a different MSI!) at the same time as the original MSI > > handler is still running. > > > > There difference here is that by unmasking in the interrupt in the GIC > > before the dwc handler is finished, masking an individual MSI in the > > dwc is no longer a 2nd redundant masking. > > There is not (and there must not be) anything in DWC HW that allows > us assume its "interrupt controller" is cascaded to a GIC. It may > correspond to what we are debugging but it is an assumption we must not > make and its driver has to work regardless of what its interrupt > controller is connected to, that's the basis of hierarchical IRQ chips > management, at least from my narrow and (most likely) incomplete > knowledge of the IRQ subsystem. If you mean we can't assume the dwc is cascaded from specifically a GIC, totally agree. If you mean we can't assume the dwc is cascaded from "something", then I disagree. The driver can't possibly be at the top level. How would the cpu ever enter it? What I think really is what matters here is if the the chain the dwc is cascaded from allows the dwc to be called reentrantly or not. On my system, the code base I inspected does not allow that. I don't know of any written Linux interrupt policy that says this must be the case. If it is not the case, then I see other points in the dwc driver that are potentially racy.
diff --git a/drivers/pci/controller/dwc/pcie-designware-host.c b/drivers/pci/controller/dwc/pcie-designware-host.c index 29a05759a294..9a3960c95ad3 100644 --- a/drivers/pci/controller/dwc/pcie-designware-host.c +++ b/drivers/pci/controller/dwc/pcie-designware-host.c @@ -98,10 +98,10 @@ irqreturn_t dw_handle_msi_irq(struct pcie_port *pp) irq = irq_find_mapping(pp->irq_domain, (i * MAX_MSI_IRQS_PER_CTRL) + pos); - generic_handle_irq(irq); dw_pcie_wr_own_conf(pp, PCIE_MSI_INTR0_STATUS + (i * MSI_REG_CTRL_BLOCK_SIZE), 4, 1 << pos); + generic_handle_irq(irq); pos++; } }
This reverts commit 8c934095fa2f ("PCI: dwc: Clear MSI interrupt status after it is handled, not before"). This is a very real race that we observed quickly after switching from 4.13 to 4.16. Using a custom PCI-e endpoint and driver, I was able to track it to the precise race and verify the fixed behavior, as will be described below. This bug was originally fixed in 2013, in commit ca1658921b63 ("PCI: designware: Fix missing MSI IRQs") The discussion of that commit, archived in patchwork [1], is informative and worth reading. The bug was re-added in the '8c934 commit this is reverting, which appeared in the 4.14 kernel. Unfortunately, Synopsys appears to consider the operation of this PCI-e controller secret. They provide no publicly available docs for it nor allow the references manuals of SoCs using the controller to publish any documentation of it. So I can not say certain this code is correctly using the controller's features. However, extensive testing gives me high confidence in the accuracy of what is described below. If an MSI is received by the PCI-e controller while the status register bit for that MSI is set, the PCI-e controller will NOT generate another interrupt. In addition, it will NOT queue or otherwise mark the interrupt as "pending", which means it will NOT generate an interrupt when the status bit is unmasked. This gives the following race scenario: 1. An MSI is received by, and the status bit for the MSI is set in, the DWC PCI-e controller. 2. dw_handle_msi_irq() calls a driver's registered interrupt handler for the MSI received. 3. At some point, the interrupt handler must decide, correctly, that there is no more work to do and return. 4. The hardware generates a new MSI. As the MSI's status bit is still set, this new MSI is ignored. 6. dw_handle_msi_irq() unsets the MSI status bit. The MSI received at point 4 will never be acted upon. It occurred after the driver had finished checking the hardware status for interrupt conditions to act on. Since the MSI status was masked, it does not generated a new IRQ, neither when it was received nor when the MSI is unmasked. It seems clear there is an unsolvable race here. After this patch, the sequence becomes as follows: 1. An MSI is received and the status bit for the MSI is set in the DWC PCI-e controller. 2. dw_handle_msi_irq() clears this MSI status bit. 3. dw_handle_msi_irq() calls a driver's registered interrupt handler for the MSI received. 3. At some point, the interrupt handler must decide, correctly, that there is no more work to do and return. 4. The hardware generates a new MSI. This sets the MSI status bit and triggers another interrupt in the interrupt controller(s) above the DWC PCI-e controller. As the the dwc-pcie handler is not re-entrant, it is not run again at this time. 6. dw_handle_msi_irq() finishes. The MSI status bit remains set. 7. The IRQ is re-raised and dw_handle_msi_irq() runs again. 8. dw_handle_msi_irq() invokes the MSI's registered interrupt handler again as the status bit was still set. The observant will notice that point 4 present the opportunity for the SoC's interrupt controller to lose the interrupt in the same manner as the bug in this driver. The driver for that interrupt controller will be written to properly deal with this. In some cases the hardware supports an EOI concept, where the 2nd IRQ is masked and internally queued in the hardware, to be re-raised at EOI in step 7. In other cases the IRQ will be unmasked and re-raised at step 4, but the kernel will see the handler is INPROGRESS and not re-invoke it, and instead set a PENDING flag, which causes the handler to re-run at step 7. [1] https://patchwork.kernel.org/patch/3333681/ Fixes: 8c934095fa2f ("PCI: dwc: Clear MSI interrupt status after it is handled, not before") Cc: stable@vger.kernel.org Cc: Vignesh R <vigneshr@ti.com> Cc: Faiz Abbas <faiz_abbas@ti.com> Cc: Jingoo Han <jingoohan1@gmail.com> Cc: Gustavo Pimentel <gustavo.pimentel@synopsys.com> Cc: Joao Pinto <jpinto@synopsys.com> Cc: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Trent Piepho <tpiepho@impinj.com> --- drivers/pci/controller/dwc/pcie-designware-host.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)