Message ID | 0ee5faf5395cad8d29fb66e1ec444c8d882a4201.1735852688.git.lukas@wunner.de (mailing list archive) |
---|---|
State | Superseded |
Commit | bbefc2ff28b883c124cab7a436a7eb7d1fa51006 |
Delegated to: | Bjorn Helgaas |
Headers | show |
Series | [for-linus] PCI/bwctrl: Fix NULL pointer deref on unbind and bind | expand |
On Thu, Jan 02, 2025 at 10:20:35PM +0100, Lukas Wunner wrote: > The interrupt handler for bandwidth notifications, pcie_bwnotif_irq(), > dereferences a "data" pointer. > > On unbind, that pointer is set to NULL by pcie_bwnotif_remove(). However > the interrupt handler may still be invoked afterwards and will dereference > that NULL pointer. > > That's because the interrupt is requested using a devm_*() helper and the > driver core releases devm_*() resources *after* calling ->remove(). > > pcie_bwnotif_remove() does clear the Link Bandwidth Management Interrupt > Enable and Link Autonomous Bandwidth Interrupt Enable bits in the Link > Control Register, but that won't prevent execution of pcie_bwnotif_irq(): > The interrupt for bandwidth notifications may be shared with AER, DPC, > PME, and hotplug. So pcie_bwnotif_irq() may be executed as long as the > interrupt is requested. > > There's a similar race on bind: pcie_bwnotif_probe() requests the > interrupt when the "data" pointer still points to NULL. A NULL pointer > deref may thus likewise occur if AER, DPC, PME or hotplug raise an > interrupt in-between the bandwidth controller's call to devm_request_irq() > and assignment of the "data" pointer. > > Drop the devm_*() usage and reorder requesting of the interrupt to fix the > issue. > > While at it, drop a stray but harmless no_free_ptr() invocation when > assigning the "data" pointer in pcie_bwnotif_probe(). > > Evert reports a hang on shutdown of an ASUS ROG Strix SCAR 17 G733PYV. > The issue is no longer reproducible with the present commit. > > Evert found that attaching a USB-C monitor prevented the hang. The > machine contains an ASMedia USB 3.2 controller below a hotplug-capable > Root Port. So one possible explanation is that the controller gets > hot-removed on shutdown unless something is connected. And the ensuing > hotplug interrupt occurs exactly when the bandwidth controller is > unregistering. The precise cause could not be determined because the > screen had already turned black when the hang occurred. > > Fixes: 665745f27487 ("PCI/bwctrl: Re-add BW notification portdrv as PCIe BW controller") > Reported-by: Evert Vorster <evorster@gmail.com> > Tested-by: Evert Vorster <evorster@gmail.com> > Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219629 > Signed-off-by: Lukas Wunner <lukas@wunner.de> Applied to pci/for-linus for v6.13, thanks, Evert and Lukas for working to hard to get this resolved! > --- > drivers/pci/pcie/bwctrl.c | 25 +++++++++++++++++-------- > 1 file changed, 17 insertions(+), 8 deletions(-) > > diff --git a/drivers/pci/pcie/bwctrl.c b/drivers/pci/pcie/bwctrl.c > index b59cacc..7cd8211 100644 > --- a/drivers/pci/pcie/bwctrl.c > +++ b/drivers/pci/pcie/bwctrl.c > @@ -303,17 +303,18 @@ static int pcie_bwnotif_probe(struct pcie_device *srv) > if (ret) > return ret; > > - ret = devm_request_irq(&srv->device, srv->irq, pcie_bwnotif_irq, > - IRQF_SHARED, "PCIe bwctrl", srv); > + scoped_guard(rwsem_write, &pcie_bwctrl_setspeed_rwsem) > + scoped_guard(rwsem_write, &pcie_bwctrl_lbms_rwsem) > + port->link_bwctrl = data; > + > + ret = request_irq(srv->irq, pcie_bwnotif_irq, IRQF_SHARED, > + "PCIe bwctrl", srv); > if (ret) > - return ret; > + goto err_reset_data; > > - scoped_guard(rwsem_write, &pcie_bwctrl_setspeed_rwsem) { > - scoped_guard(rwsem_write, &pcie_bwctrl_lbms_rwsem) { > - port->link_bwctrl = no_free_ptr(data); > + scoped_guard(rwsem_write, &pcie_bwctrl_setspeed_rwsem) > + scoped_guard(rwsem_write, &pcie_bwctrl_lbms_rwsem) > pcie_bwnotif_enable(srv); > - } > - } > > pci_dbg(port, "enabled with IRQ %d\n", srv->irq); > > @@ -323,6 +324,12 @@ static int pcie_bwnotif_probe(struct pcie_device *srv) > port->link_bwctrl->cdev = NULL; > > return 0; > + > +err_reset_data: > + scoped_guard(rwsem_write, &pcie_bwctrl_setspeed_rwsem) > + scoped_guard(rwsem_write, &pcie_bwctrl_lbms_rwsem) > + port->link_bwctrl = NULL; > + return ret; > } > > static void pcie_bwnotif_remove(struct pcie_device *srv) > @@ -333,6 +340,8 @@ static void pcie_bwnotif_remove(struct pcie_device *srv) > > pcie_bwnotif_disable(srv->port); > > + free_irq(srv->irq, srv); > + > scoped_guard(rwsem_write, &pcie_bwctrl_setspeed_rwsem) > scoped_guard(rwsem_write, &pcie_bwctrl_lbms_rwsem) > srv->port->link_bwctrl = NULL; > -- > 2.43.0 >
On Thu, 2 Jan 2025, Lukas Wunner wrote: > The interrupt handler for bandwidth notifications, pcie_bwnotif_irq(), > dereferences a "data" pointer. > > On unbind, that pointer is set to NULL by pcie_bwnotif_remove(). However > the interrupt handler may still be invoked afterwards and will dereference > that NULL pointer. > > That's because the interrupt is requested using a devm_*() helper and the > driver core releases devm_*() resources *after* calling ->remove(). > > pcie_bwnotif_remove() does clear the Link Bandwidth Management Interrupt > Enable and Link Autonomous Bandwidth Interrupt Enable bits in the Link > Control Register, but that won't prevent execution of pcie_bwnotif_irq(): > The interrupt for bandwidth notifications may be shared with AER, DPC, > PME, and hotplug. So pcie_bwnotif_irq() may be executed as long as the > interrupt is requested. > > There's a similar race on bind: pcie_bwnotif_probe() requests the > interrupt when the "data" pointer still points to NULL. A NULL pointer > deref may thus likewise occur if AER, DPC, PME or hotplug raise an > interrupt in-between the bandwidth controller's call to devm_request_irq() > and assignment of the "data" pointer. > > Drop the devm_*() usage and reorder requesting of the interrupt to fix the > issue. > > While at it, drop a stray but harmless no_free_ptr() invocation when > assigning the "data" pointer in pcie_bwnotif_probe(). > > Evert reports a hang on shutdown of an ASUS ROG Strix SCAR 17 G733PYV. > The issue is no longer reproducible with the present commit. > > Evert found that attaching a USB-C monitor prevented the hang. The > machine contains an ASMedia USB 3.2 controller below a hotplug-capable > Root Port. So one possible explanation is that the controller gets > hot-removed on shutdown unless something is connected. And the ensuing > hotplug interrupt occurs exactly when the bandwidth controller is > unregistering. The precise cause could not be determined because the > screen had already turned black when the hang occurred. > > Fixes: 665745f27487 ("PCI/bwctrl: Re-add BW notification portdrv as PCIe BW controller") > Reported-by: Evert Vorster <evorster@gmail.com> > Tested-by: Evert Vorster <evorster@gmail.com> > Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219629 > Signed-off-by: Lukas Wunner <lukas@wunner.de> > --- > drivers/pci/pcie/bwctrl.c | 25 +++++++++++++++++-------- > 1 file changed, 17 insertions(+), 8 deletions(-) > > diff --git a/drivers/pci/pcie/bwctrl.c b/drivers/pci/pcie/bwctrl.c > index b59cacc..7cd8211 100644 > --- a/drivers/pci/pcie/bwctrl.c > +++ b/drivers/pci/pcie/bwctrl.c > @@ -303,17 +303,18 @@ static int pcie_bwnotif_probe(struct pcie_device *srv) > if (ret) > return ret; > > - ret = devm_request_irq(&srv->device, srv->irq, pcie_bwnotif_irq, > - IRQF_SHARED, "PCIe bwctrl", srv); > + scoped_guard(rwsem_write, &pcie_bwctrl_setspeed_rwsem) > + scoped_guard(rwsem_write, &pcie_bwctrl_lbms_rwsem) > + port->link_bwctrl = data; > + > + ret = request_irq(srv->irq, pcie_bwnotif_irq, IRQF_SHARED, > + "PCIe bwctrl", srv); > if (ret) > - return ret; > + goto err_reset_data; > > - scoped_guard(rwsem_write, &pcie_bwctrl_setspeed_rwsem) { > - scoped_guard(rwsem_write, &pcie_bwctrl_lbms_rwsem) { > - port->link_bwctrl = no_free_ptr(data); > + scoped_guard(rwsem_write, &pcie_bwctrl_setspeed_rwsem) > + scoped_guard(rwsem_write, &pcie_bwctrl_lbms_rwsem) > pcie_bwnotif_enable(srv); > - } > - } Hi Lukas, Indeed, it certainly didn't occur to me while arranging the code the way it is that there are other sources for the same irq. However, there is a reason those lines where within the same critical section (I also realized it's not documented anywhere): As bwctrl has two operating modes, one with BW notifications and the other without them, there are races when switching between those modes during probe wrt. call to lbms counting accessor, and I reused those rw semaphores to prevent those race (the race fixes were noted only in a history bullet of the bwctrl series). Do you think it would be possible to put also request_irq() inside that critical section to synchronize lbms count calls with the mode switch? (Also beware that with scoped_guards, a goto out of the critical section cannot be used in error handling and the NULL needs to be assigned inside it.) (I know Bjorn has taken this already in so if needed, we can consider making that change later on top of this as those races are much less severe issue than what this fix is correcting.) > pci_dbg(port, "enabled with IRQ %d\n", srv->irq); > > @@ -323,6 +324,12 @@ static int pcie_bwnotif_probe(struct pcie_device *srv) > port->link_bwctrl->cdev = NULL; > > return 0; > + > +err_reset_data: > + scoped_guard(rwsem_write, &pcie_bwctrl_setspeed_rwsem) > + scoped_guard(rwsem_write, &pcie_bwctrl_lbms_rwsem) > + port->link_bwctrl = NULL; > + return ret; > } > > static void pcie_bwnotif_remove(struct pcie_device *srv) > @@ -333,6 +340,8 @@ static void pcie_bwnotif_remove(struct pcie_device *srv) > > pcie_bwnotif_disable(srv->port); > > + free_irq(srv->irq, srv); > + > scoped_guard(rwsem_write, &pcie_bwctrl_setspeed_rwsem) > scoped_guard(rwsem_write, &pcie_bwctrl_lbms_rwsem) > srv->port->link_bwctrl = NULL; It seems I didn't remember to fix the mode switch races here at all. So I think also here, pcie_bwnotif_disable(), free_irq() and NULL assignment, should be within the same critical section. Although this doesn't seem very big problemn compared with the probe side.
On Sun, Jan 05, 2025 at 06:54:24PM +0200, Ilpo Järvinen wrote: > Indeed, it certainly didn't occur to me while arranging the code the way > it is that there are other sources for the same irq. However, there is a > reason those lines where within the same critical section (I also realized > it's not documented anywhere): > > As bwctrl has two operating modes, one with BW notifications and the other > without them, there are races when switching between those modes during > probe wrt. call to lbms counting accessor, and I reused those rw > semaphores to prevent those race (the race fixes were noted only in a > history bullet of the bwctrl series). Could you add code comment(s) to document this? I've respun the patch, but of course yesterday was a holiday in Finland. So I'm hoping you get a chance to review the v2 patch today. It seems pcie_bwctrl_setspeed_rwsem is only needed because pcie_retrain_link() calls pcie_reset_lbms_count(), which would recursively acquire pcie_bwctrl_lbms_rwsem. There are only two callers of pcie_retrain_link(), so I'm wondering if the invocation of pcie_reset_lbms_count() can be moved to them, thus avoiding the recursive lock acquisition and allowing to get rid of pcie_bwctrl_setspeed_rwsem. An alternative would be to have a __pcie_retrain_link() helper which doesn't call pcie_reset_lbms_count(). Right now there are no less than three locks used by bwctrl (the two global rwsem plus the per-port mutex). That doesn't look elegant and makes it difficult to reason about the code, so simplifying the locking would be desirable I think. I'm also wondering if the IRQ handler really needs to run in hardirq context. Is there a reason it can't run in thread context? Note that CONFIG_PREEMPT_RT=y (as well as the "threadirqs" command line option) cause the handler to be run in thread context, so it must work properly in that situation as well. Another oddity that caught my eye is the counting of the interrupts. It seems the only place where lbms_count is read is the pcie_failed_link_retrain() quirk, and it only cares about the count being non-zero. So this could be a bit in pci_dev->priv_flags that's accessed with set_bit() / test_bit() similar to pci_dev_assign_added() / pci_dev_is_added(). Are you planning on using the count for something else in the future? If not, using a flag would be simpler and more economical memory-wise. I'm also worried about the lbms_count overflowing. Because there's hardware which signals an interrupt before actually setting one of the two bits in the Link Status Register, I'm wondering if it would make sense to poll the register a couple of times in the irq handler. Obviously this is only an option if the handler is running in thread context. What was the maximum time you saw during testing that it took to set the LBMS bit belatedly? If you don't poll for the LBMS bit, then you definitely should clear it on unbind in case it contains a stale 1. Or probably clear it in any case. Thanks, Lukas
On Tue, 7 Jan 2025, Lukas Wunner wrote: > On Sun, Jan 05, 2025 at 06:54:24PM +0200, Ilpo Järvinen wrote: > > Indeed, it certainly didn't occur to me while arranging the code the way > > it is that there are other sources for the same irq. However, there is a > > reason those lines where within the same critical section (I also realized > > it's not documented anywhere): > > > > As bwctrl has two operating modes, one with BW notifications and the other > > without them, there are races when switching between those modes during > > probe wrt. call to lbms counting accessor, and I reused those rw > > semaphores to prevent those race (the race fixes were noted only in a > > history bullet of the bwctrl series). > > Could you add code comment(s) to document this? Sure, I'll do that once I've been able to clear the holiday-induced logjam on pdx86 maintainership front. :-) (For now, I added a bullet to my todo list to not forget it). > I've respun the patch, but of course yesterday was a holiday in Finland. > So I'm hoping you get a chance to review the v2 patch today. Done. > It seems pcie_bwctrl_setspeed_rwsem is only needed because > pcie_retrain_link() calls pcie_reset_lbms_count(), which > would recursively acquire pcie_bwctrl_lbms_rwsem. > > There are only two callers of pcie_retrain_link(), so I'm > wondering if the invocation of pcie_reset_lbms_count() > can be moved to them, thus avoiding the recursive lock > acquisition and allowing to get rid of pcie_bwctrl_setspeed_rwsem. > > An alternative would be to have a __pcie_retrain_link() helper > which doesn't call pcie_reset_lbms_count(). > > Right now there are no less than three locks used by bwctrl > (the two global rwsem plus the per-port mutex). That doesn't > look elegant and makes it difficult to reason about the code, > so simplifying the locking would be desirable I think. I considered __pcie_retrain_link() variant but it felt like locking details that are internal to bwctrl would be leaking into elsewhere in the code so I had some level of dislike towards this solution, but I'm not strictly against it. It would seem most straightforward approach that wouldn't force moving LBMS reset to callers which feels even less elegant/obvious. I just previously chose to keep that to complexity internal to bwctrl but if you think adding __pcie_retrain_link() would be more elegant, we can certainly move to that direction. > I'm also wondering if the IRQ handler really needs to run in > hardirq context. Is there a reason it can't run in thread > context? Note that CONFIG_PREEMPT_RT=y (as well as the > "threadirqs" command line option) cause the handler to be run > in thread context, so it must work properly in that situation > as well. If thread context would work now, why was the fix in the commit 3e82a7f9031f ("PCI/LINK: Supply IRQ handler so level-triggered IRQs are acked")) needed (the commit is from the bwnotif era)? What has changed since that fix? I'm open to our suggestion but that existence of that fix is keeping me back. I just don't understand why it would work now when it didn't back then. > Another oddity that caught my eye is the counting of the > interrupts. It seems the only place where lbms_count is read > is the pcie_failed_link_retrain() quirk, and it only cares > about the count being non-zero. So this could be a bit in > pci_dev->priv_flags that's accessed with set_bit() / test_bit() > similar to pci_dev_assign_added() / pci_dev_is_added(). > > Are you planning on using the count for something else in the > future? If not, using a flag would be simpler and more economical > memory-wise. Somebody requested having the count exposed. For troubleshooting HW problems (IIRC), it was privately asked from me when I posted one of the early versions of the bwctrl series (so quite long time ago). I've just not created that change yet to put it under sysfs. > I'm also worried about the lbms_count overflowing. Should I perhaps simply do pci_warn() if it happens? > Because there's hardware which signals an interrupt before actually > setting one of the two bits in the Link Status Register, I'm > wondering if it would make sense to poll the register a couple > of times in the irq handler. Obviously this is only an option > if the handler is running in thread context. What was the maximum > time you saw during testing that it took to set the LBMS bit belatedly? Is there some misunderstanding here between us because I don't think I've noticed delayed LBMS assertion? What I saw was the new Link Speed not yet updated when Link Training was already 0. In that case, the Link Status register was read inside the handler so I'd assume LBMS was set to actually trigger the interrupt, thus, not set belatedly. I only recall testing with reading the value again inside set speed functions and the Link Speed was always correct by then. I might have also tried polling it inside the handler but I'm sorry don't recall anymore if I did and what was the end result. > If you don't poll for the LBMS bit, then you definitely should clear > it on unbind in case it contains a stale 1. Or probably clear it in > any case.
On Tue, Jan 07, 2025 at 04:29:18PM +0200, Ilpo Järvinen wrote: > On Tue, 7 Jan 2025, Lukas Wunner wrote: > > It seems pcie_bwctrl_setspeed_rwsem is only needed because > > pcie_retrain_link() calls pcie_reset_lbms_count(), which > > would recursively acquire pcie_bwctrl_lbms_rwsem. > > > > There are only two callers of pcie_retrain_link(), so I'm > > wondering if the invocation of pcie_reset_lbms_count() > > can be moved to them, thus avoiding the recursive lock > > acquisition and allowing to get rid of pcie_bwctrl_setspeed_rwsem. > > > > An alternative would be to have a __pcie_retrain_link() helper > > which doesn't call pcie_reset_lbms_count(). > > I considered __pcie_retrain_link() variant but it felt like locking > details that are internal to bwctrl would be leaking into elsewhere in the > code so I had some level of dislike towards this solution, but I'm not > strictly against it. That's a fair argument. It seems the reason you're acquiring pcie_bwctrl_lbms_rwsem in pcie_reset_lbms_count() is because you need to dereference port->link_bwctrl so that you can access port->link_bwctrl->lbms_count. If you get rid of lbms_count and instead use a flag in pci_dev->priv_flags, then it seems you won't need to acquire the lock and this problem solves itself. > > I'm also wondering if the IRQ handler really needs to run in > > hardirq context. Is there a reason it can't run in thread > > context? Note that CONFIG_PREEMPT_RT=y (as well as the > > "threadirqs" command line option) cause the handler to be run > > in thread context, so it must work properly in that situation > > as well. > > If thread context would work now, why was the fix in the commit > 3e82a7f9031f ("PCI/LINK: Supply IRQ handler so level-triggered IRQs are > acked")) needed (the commit is from the bwnotif era)? What has changed > since that fix? Nothing has changed, I had forgotten about that commit. Basically you could move everything in pcie_bwnotif_irq() after clearing the interrupt into an IRQ thread, but that would just be the access to the atomic variable and the pcie_update_link_speed() call. That's not worth it because the overhead to wake the IRQ thread is bigger than just executing those things in the hardirq handler. So please ignore my comment. > > Another oddity that caught my eye is the counting of the > > interrupts. It seems the only place where lbms_count is read > > is the pcie_failed_link_retrain() quirk, and it only cares > > about the count being non-zero. So this could be a bit in > > pci_dev->priv_flags that's accessed with set_bit() / test_bit() > > similar to pci_dev_assign_added() / pci_dev_is_added(). > > > > Are you planning on using the count for something else in the > > future? If not, using a flag would be simpler and more economical > > memory-wise. > > Somebody requested having the count exposed. For troubleshooting HW > problems (IIRC), it was privately asked from me when I posted one of > the early versions of the bwctrl series (so quite long time ago). I've > just not created that change yet to put it under sysfs. There's a patch pending to add trace events support to native PCIe hotplug: https://lore.kernel.org/all/20241123113108.29722-1-xueshuai@linux.alibaba.com/ If that somebody thinks they need to know how often LBMS triggered, we could just add similar trace events for bandwidth notifications. That gives us not only the count but also the precise time when the bandwidth change happened, so it's arguably more useful for debugging. Trace points are patched in and out of the code path at runtime, so they have basically zero cost when not enabled (which would be the default). > > I'm also worried about the lbms_count overflowing. > > Should I perhaps simply do pci_warn() if it happens? I'd prefere getting rid of the counter altogether. :) > > Because there's hardware which signals an interrupt before actually > > setting one of the two bits in the Link Status Register, I'm > > wondering if it would make sense to poll the register a couple > > of times in the irq handler. Obviously this is only an option > > if the handler is running in thread context. What was the maximum > > time you saw during testing that it took to set the LBMS bit belatedly? > > Is there some misunderstanding here between us because I don't think I've > noticed delayed LBMS assertion? What I saw was the new Link Speed not yet > updated when Link Training was already 0. In that case, the Link Status > register was read inside the handler so I'd assume LBMS was set to > actually trigger the interrupt, thus, not set belatedly. Evert's laptop has BWMgmt+ ABWMgmt+ bits set on Root Port 00:02.1 in this lspci dump: https://bugzilla.kernel.org/attachment.cgi?id=307419&action=edit 00:02.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge GPP Bridge (prog-if 00 [Normal decode]) LnkSta: Speed 8GT/s, Width x4 TrErr- Train- SlotClk+ DLActive+ BWMgmt+ ABWMgmt+ How can it be that BWMgmt+ is set? I would have expected the bandwidth controller to clear that interrupt. I can only think of two explanations: Either BWMgmt+ was set but no interrupt was signaled. Or the interrupt was signaled and handled before BWMgmt+ was set. I'm guessing the latter is the case because /proc/irq/33/spurious indicates 1 unhandled interrupt. Back in March 2023 when you showed me your results with various Intel chipsets, I thought you mentioned that you witnessed this too-early interrupt situation a couple of times. But I may be misremembering. Thanks, Lukas
On Wed, 8 Jan 2025, Lukas Wunner wrote: > On Tue, Jan 07, 2025 at 04:29:18PM +0200, Ilpo Järvinen wrote: > > On Tue, 7 Jan 2025, Lukas Wunner wrote: > > > It seems pcie_bwctrl_setspeed_rwsem is only needed because > > > pcie_retrain_link() calls pcie_reset_lbms_count(), which > > > would recursively acquire pcie_bwctrl_lbms_rwsem. > > > > > > There are only two callers of pcie_retrain_link(), so I'm > > > wondering if the invocation of pcie_reset_lbms_count() > > > can be moved to them, thus avoiding the recursive lock > > > acquisition and allowing to get rid of pcie_bwctrl_setspeed_rwsem. > > > > > > An alternative would be to have a __pcie_retrain_link() helper > > > which doesn't call pcie_reset_lbms_count(). > > > > I considered __pcie_retrain_link() variant but it felt like locking > > details that are internal to bwctrl would be leaking into elsewhere in the > > code so I had some level of dislike towards this solution, but I'm not > > strictly against it. > > That's a fair argument. > > It seems the reason you're acquiring pcie_bwctrl_lbms_rwsem in > pcie_reset_lbms_count() is because you need to dereference > port->link_bwctrl so that you can access port->link_bwctrl->lbms_count. > > If you get rid of lbms_count and instead use a flag in pci_dev->priv_flags, > then it seems you won't need to acquire the lock and this problem > solves itself. Agreed on both points. > > > I'm also wondering if the IRQ handler really needs to run in > > > hardirq context. Is there a reason it can't run in thread > > > context? Note that CONFIG_PREEMPT_RT=y (as well as the > > > "threadirqs" command line option) cause the handler to be run > > > in thread context, so it must work properly in that situation > > > as well. > > > > If thread context would work now, why was the fix in the commit > > 3e82a7f9031f ("PCI/LINK: Supply IRQ handler so level-triggered IRQs are > > acked")) needed (the commit is from the bwnotif era)? What has changed > > since that fix? > > Nothing has changed, I had forgotten about that commit. > > Basically you could move everything in pcie_bwnotif_irq() after clearing > the interrupt into an IRQ thread, but that would just be the access to the > atomic variable and the pcie_update_link_speed() call. That's not worth it > because the overhead to wake the IRQ thread is bigger than just executing > those things in the hardirq handler. > > So please ignore my comment. > > > > > Another oddity that caught my eye is the counting of the > > > interrupts. It seems the only place where lbms_count is read > > > is the pcie_failed_link_retrain() quirk, and it only cares > > > about the count being non-zero. So this could be a bit in > > > pci_dev->priv_flags that's accessed with set_bit() / test_bit() > > > similar to pci_dev_assign_added() / pci_dev_is_added(). > > > > > > Are you planning on using the count for something else in the > > > future? If not, using a flag would be simpler and more economical > > > memory-wise. > > > > Somebody requested having the count exposed. For troubleshooting HW > > problems (IIRC), it was privately asked from me when I posted one of > > the early versions of the bwctrl series (so quite long time ago). I've > > just not created that change yet to put it under sysfs. > > There's a patch pending to add trace events support to native PCIe hotplug: > > https://lore.kernel.org/all/20241123113108.29722-1-xueshuai@linux.alibaba.com/ > > If that somebody thinks they need to know how often LBMS triggered, > we could just add similar trace events for bandwidth notifications. > That gives us not only the count but also the precise time when the > bandwidth change happened, so it's arguably more useful for debugging. > > Trace points are patched in and out of the code path at runtime, > so they have basically zero cost when not enabled (which would be the > default). > > > > > I'm also worried about the lbms_count overflowing. > > > > Should I perhaps simply do pci_warn() if it happens? > > I'd prefere getting rid of the counter altogether. :) Okay, it's a good suggestion. Trace events seem like a much better approach to the problem. > > > Because there's hardware which signals an interrupt before actually > > > setting one of the two bits in the Link Status Register, I'm > > > wondering if it would make sense to poll the register a couple > > > of times in the irq handler. Obviously this is only an option > > > if the handler is running in thread context. What was the maximum > > > time you saw during testing that it took to set the LBMS bit belatedly? > > > > Is there some misunderstanding here between us because I don't think I've > > noticed delayed LBMS assertion? What I saw was the new Link Speed not yet > > updated when Link Training was already 0. In that case, the Link Status > > register was read inside the handler so I'd assume LBMS was set to > > actually trigger the interrupt, thus, not set belatedly. > > Evert's laptop has BWMgmt+ ABWMgmt+ bits set on Root Port 00:02.1 > in this lspci dump: > > https://bugzilla.kernel.org/attachment.cgi?id=307419&action=edit > > 00:02.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge GPP Bridge (prog-if 00 [Normal decode]) > LnkSta: Speed 8GT/s, Width x4 > TrErr- Train- SlotClk+ DLActive+ BWMgmt+ ABWMgmt+ > > How can it be that BWMgmt+ is set? I would have expected the bandwidth > controller to clear that interrupt. I can only think of two explanations: > Either BWMgmt+ was set but no interrupt was signaled. Or the interrupt > was signaled and handled before BWMgmt+ was set. Either one of those, or third alternative that it's set more than once and no interrupt occurs on the second assertion (could be e.g. some race due to write-1-to-clear and reasserting LBMS). But, I've not seen this behavior before that report so I cannot answer to your question about how long it took for LBMS to get asserted. This misbehavior is a bit problematic because those same bits are used to identify if the interrupt belongs for bwctrl or not. So if they're not set by the time the handler runs, I don't have a way to identify I'd need to poll those bits in the first place (and it's also known the interrupt is shared with other stuff :-(). > I'm guessing the latter is the case because /proc/irq/33/spurious > indicates 1 unhandled interrupt. > > Back in March 2023 when you showed me your results with various Intel > chipsets, I thought you mentioned that you witnessed this too-early > interrupt situation a couple of times. But I may be misremembering. I've seen interrupts occur before the new Link Speed has been updated but those still had LT=1 and there was _another_ interrupt later. Handling that later interrupt read the new Link Speed and LT=0 so effectively there was just one extra interrupt. I assume it's because LBMS is simply asserted twice and the extra interrupt seemed harmless (no idea why HW ends up doing it though). The other case, which is a real problem, had LT=0 but no new link speed in the Current Link Speed field. So it is "premature" in a sense but seemingly the training had completed. In that case, there was no second interrupt so the handler could not acquire the new link speed. bwctrl does read the new link speed outside the handler by calling pcie_update_link_speed() which is mainly to handle misbehaviors where BW notifications are not coming at all (if the link speed changes outside of bwctrl setting speed, those changes won't be detected). It will still be racy though if LT is deasserted before the Current Link Speed is set (I've not seen it to fail to get the new speed but I've not tried to stress test it either).
diff --git a/drivers/pci/pcie/bwctrl.c b/drivers/pci/pcie/bwctrl.c index b59cacc..7cd8211 100644 --- a/drivers/pci/pcie/bwctrl.c +++ b/drivers/pci/pcie/bwctrl.c @@ -303,17 +303,18 @@ static int pcie_bwnotif_probe(struct pcie_device *srv) if (ret) return ret; - ret = devm_request_irq(&srv->device, srv->irq, pcie_bwnotif_irq, - IRQF_SHARED, "PCIe bwctrl", srv); + scoped_guard(rwsem_write, &pcie_bwctrl_setspeed_rwsem) + scoped_guard(rwsem_write, &pcie_bwctrl_lbms_rwsem) + port->link_bwctrl = data; + + ret = request_irq(srv->irq, pcie_bwnotif_irq, IRQF_SHARED, + "PCIe bwctrl", srv); if (ret) - return ret; + goto err_reset_data; - scoped_guard(rwsem_write, &pcie_bwctrl_setspeed_rwsem) { - scoped_guard(rwsem_write, &pcie_bwctrl_lbms_rwsem) { - port->link_bwctrl = no_free_ptr(data); + scoped_guard(rwsem_write, &pcie_bwctrl_setspeed_rwsem) + scoped_guard(rwsem_write, &pcie_bwctrl_lbms_rwsem) pcie_bwnotif_enable(srv); - } - } pci_dbg(port, "enabled with IRQ %d\n", srv->irq); @@ -323,6 +324,12 @@ static int pcie_bwnotif_probe(struct pcie_device *srv) port->link_bwctrl->cdev = NULL; return 0; + +err_reset_data: + scoped_guard(rwsem_write, &pcie_bwctrl_setspeed_rwsem) + scoped_guard(rwsem_write, &pcie_bwctrl_lbms_rwsem) + port->link_bwctrl = NULL; + return ret; } static void pcie_bwnotif_remove(struct pcie_device *srv) @@ -333,6 +340,8 @@ static void pcie_bwnotif_remove(struct pcie_device *srv) pcie_bwnotif_disable(srv->port); + free_irq(srv->irq, srv); + scoped_guard(rwsem_write, &pcie_bwctrl_setspeed_rwsem) scoped_guard(rwsem_write, &pcie_bwctrl_lbms_rwsem) srv->port->link_bwctrl = NULL;