diff mbox series

[for-linus] PCI/bwctrl: Fix NULL pointer deref on unbind and bind

Message ID 0ee5faf5395cad8d29fb66e1ec444c8d882a4201.1735852688.git.lukas@wunner.de (mailing list archive)
State Superseded
Commit bbefc2ff28b883c124cab7a436a7eb7d1fa51006
Delegated to: Bjorn Helgaas
Headers show
Series [for-linus] PCI/bwctrl: Fix NULL pointer deref on unbind and bind | expand

Commit Message

Lukas Wunner Jan. 2, 2025, 9:20 p.m. UTC
The interrupt handler for bandwidth notifications, pcie_bwnotif_irq(),
dereferences a "data" pointer.

On unbind, that pointer is set to NULL by pcie_bwnotif_remove().  However
the interrupt handler may still be invoked afterwards and will dereference
that NULL pointer.

That's because the interrupt is requested using a devm_*() helper and the
driver core releases devm_*() resources *after* calling ->remove().

pcie_bwnotif_remove() does clear the Link Bandwidth Management Interrupt
Enable and Link Autonomous Bandwidth Interrupt Enable bits in the Link
Control Register, but that won't prevent execution of pcie_bwnotif_irq():
The interrupt for bandwidth notifications may be shared with AER, DPC,
PME, and hotplug.  So pcie_bwnotif_irq() may be executed as long as the
interrupt is requested.

There's a similar race on bind:  pcie_bwnotif_probe() requests the
interrupt when the "data" pointer still points to NULL.  A NULL pointer
deref may thus likewise occur if AER, DPC, PME or hotplug raise an
interrupt in-between the bandwidth controller's call to devm_request_irq()
and assignment of the "data" pointer.

Drop the devm_*() usage and reorder requesting of the interrupt to fix the
issue.

While at it, drop a stray but harmless no_free_ptr() invocation when
assigning the "data" pointer in pcie_bwnotif_probe().

Evert reports a hang on shutdown of an ASUS ROG Strix SCAR 17 G733PYV.
The issue is no longer reproducible with the present commit.

Evert found that attaching a USB-C monitor prevented the hang.  The
machine contains an ASMedia USB 3.2 controller below a hotplug-capable
Root Port.  So one possible explanation is that the controller gets
hot-removed on shutdown unless something is connected.  And the ensuing
hotplug interrupt occurs exactly when the bandwidth controller is
unregistering.  The precise cause could not be determined because the
screen had already turned black when the hang occurred.

Fixes: 665745f27487 ("PCI/bwctrl: Re-add BW notification portdrv as PCIe BW controller")
Reported-by: Evert Vorster <evorster@gmail.com>
Tested-by: Evert Vorster <evorster@gmail.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219629
Signed-off-by: Lukas Wunner <lukas@wunner.de>
---
 drivers/pci/pcie/bwctrl.c | 25 +++++++++++++++++--------
 1 file changed, 17 insertions(+), 8 deletions(-)

Comments

Bjorn Helgaas Jan. 2, 2025, 11:22 p.m. UTC | #1
On Thu, Jan 02, 2025 at 10:20:35PM +0100, Lukas Wunner wrote:
> The interrupt handler for bandwidth notifications, pcie_bwnotif_irq(),
> dereferences a "data" pointer.
> 
> On unbind, that pointer is set to NULL by pcie_bwnotif_remove().  However
> the interrupt handler may still be invoked afterwards and will dereference
> that NULL pointer.
> 
> That's because the interrupt is requested using a devm_*() helper and the
> driver core releases devm_*() resources *after* calling ->remove().
> 
> pcie_bwnotif_remove() does clear the Link Bandwidth Management Interrupt
> Enable and Link Autonomous Bandwidth Interrupt Enable bits in the Link
> Control Register, but that won't prevent execution of pcie_bwnotif_irq():
> The interrupt for bandwidth notifications may be shared with AER, DPC,
> PME, and hotplug.  So pcie_bwnotif_irq() may be executed as long as the
> interrupt is requested.
> 
> There's a similar race on bind:  pcie_bwnotif_probe() requests the
> interrupt when the "data" pointer still points to NULL.  A NULL pointer
> deref may thus likewise occur if AER, DPC, PME or hotplug raise an
> interrupt in-between the bandwidth controller's call to devm_request_irq()
> and assignment of the "data" pointer.
> 
> Drop the devm_*() usage and reorder requesting of the interrupt to fix the
> issue.
> 
> While at it, drop a stray but harmless no_free_ptr() invocation when
> assigning the "data" pointer in pcie_bwnotif_probe().
> 
> Evert reports a hang on shutdown of an ASUS ROG Strix SCAR 17 G733PYV.
> The issue is no longer reproducible with the present commit.
> 
> Evert found that attaching a USB-C monitor prevented the hang.  The
> machine contains an ASMedia USB 3.2 controller below a hotplug-capable
> Root Port.  So one possible explanation is that the controller gets
> hot-removed on shutdown unless something is connected.  And the ensuing
> hotplug interrupt occurs exactly when the bandwidth controller is
> unregistering.  The precise cause could not be determined because the
> screen had already turned black when the hang occurred.
> 
> Fixes: 665745f27487 ("PCI/bwctrl: Re-add BW notification portdrv as PCIe BW controller")
> Reported-by: Evert Vorster <evorster@gmail.com>
> Tested-by: Evert Vorster <evorster@gmail.com>
> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219629
> Signed-off-by: Lukas Wunner <lukas@wunner.de>

Applied to pci/for-linus for v6.13, thanks, Evert and Lukas for
working to hard to get this resolved!

> ---
>  drivers/pci/pcie/bwctrl.c | 25 +++++++++++++++++--------
>  1 file changed, 17 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/pci/pcie/bwctrl.c b/drivers/pci/pcie/bwctrl.c
> index b59cacc..7cd8211 100644
> --- a/drivers/pci/pcie/bwctrl.c
> +++ b/drivers/pci/pcie/bwctrl.c
> @@ -303,17 +303,18 @@ static int pcie_bwnotif_probe(struct pcie_device *srv)
>  	if (ret)
>  		return ret;
>  
> -	ret = devm_request_irq(&srv->device, srv->irq, pcie_bwnotif_irq,
> -			       IRQF_SHARED, "PCIe bwctrl", srv);
> +	scoped_guard(rwsem_write, &pcie_bwctrl_setspeed_rwsem)
> +		scoped_guard(rwsem_write, &pcie_bwctrl_lbms_rwsem)
> +			port->link_bwctrl = data;
> +
> +	ret = request_irq(srv->irq, pcie_bwnotif_irq, IRQF_SHARED,
> +			  "PCIe bwctrl", srv);
>  	if (ret)
> -		return ret;
> +		goto err_reset_data;
>  
> -	scoped_guard(rwsem_write, &pcie_bwctrl_setspeed_rwsem) {
> -		scoped_guard(rwsem_write, &pcie_bwctrl_lbms_rwsem) {
> -			port->link_bwctrl = no_free_ptr(data);
> +	scoped_guard(rwsem_write, &pcie_bwctrl_setspeed_rwsem)
> +		scoped_guard(rwsem_write, &pcie_bwctrl_lbms_rwsem)
>  			pcie_bwnotif_enable(srv);
> -		}
> -	}
>  
>  	pci_dbg(port, "enabled with IRQ %d\n", srv->irq);
>  
> @@ -323,6 +324,12 @@ static int pcie_bwnotif_probe(struct pcie_device *srv)
>  		port->link_bwctrl->cdev = NULL;
>  
>  	return 0;
> +
> +err_reset_data:
> +	scoped_guard(rwsem_write, &pcie_bwctrl_setspeed_rwsem)
> +		scoped_guard(rwsem_write, &pcie_bwctrl_lbms_rwsem)
> +			port->link_bwctrl = NULL;
> +	return ret;
>  }
>  
>  static void pcie_bwnotif_remove(struct pcie_device *srv)
> @@ -333,6 +340,8 @@ static void pcie_bwnotif_remove(struct pcie_device *srv)
>  
>  	pcie_bwnotif_disable(srv->port);
>  
> +	free_irq(srv->irq, srv);
> +
>  	scoped_guard(rwsem_write, &pcie_bwctrl_setspeed_rwsem)
>  		scoped_guard(rwsem_write, &pcie_bwctrl_lbms_rwsem)
>  			srv->port->link_bwctrl = NULL;
> -- 
> 2.43.0
>
Ilpo Järvinen Jan. 5, 2025, 4:54 p.m. UTC | #2
On Thu, 2 Jan 2025, Lukas Wunner wrote:

> The interrupt handler for bandwidth notifications, pcie_bwnotif_irq(),
> dereferences a "data" pointer.
> 
> On unbind, that pointer is set to NULL by pcie_bwnotif_remove().  However
> the interrupt handler may still be invoked afterwards and will dereference
> that NULL pointer.
> 
> That's because the interrupt is requested using a devm_*() helper and the
> driver core releases devm_*() resources *after* calling ->remove().
> 
> pcie_bwnotif_remove() does clear the Link Bandwidth Management Interrupt
> Enable and Link Autonomous Bandwidth Interrupt Enable bits in the Link
> Control Register, but that won't prevent execution of pcie_bwnotif_irq():
> The interrupt for bandwidth notifications may be shared with AER, DPC,
> PME, and hotplug.  So pcie_bwnotif_irq() may be executed as long as the
> interrupt is requested.
> 
> There's a similar race on bind:  pcie_bwnotif_probe() requests the
> interrupt when the "data" pointer still points to NULL.  A NULL pointer
> deref may thus likewise occur if AER, DPC, PME or hotplug raise an
> interrupt in-between the bandwidth controller's call to devm_request_irq()
> and assignment of the "data" pointer.
> 
> Drop the devm_*() usage and reorder requesting of the interrupt to fix the
> issue.
> 
> While at it, drop a stray but harmless no_free_ptr() invocation when
> assigning the "data" pointer in pcie_bwnotif_probe().
> 
> Evert reports a hang on shutdown of an ASUS ROG Strix SCAR 17 G733PYV.
> The issue is no longer reproducible with the present commit.
> 
> Evert found that attaching a USB-C monitor prevented the hang.  The
> machine contains an ASMedia USB 3.2 controller below a hotplug-capable
> Root Port.  So one possible explanation is that the controller gets
> hot-removed on shutdown unless something is connected.  And the ensuing
> hotplug interrupt occurs exactly when the bandwidth controller is
> unregistering.  The precise cause could not be determined because the
> screen had already turned black when the hang occurred.
> 
> Fixes: 665745f27487 ("PCI/bwctrl: Re-add BW notification portdrv as PCIe BW controller")
> Reported-by: Evert Vorster <evorster@gmail.com>
> Tested-by: Evert Vorster <evorster@gmail.com>
> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219629
> Signed-off-by: Lukas Wunner <lukas@wunner.de>
> ---
>  drivers/pci/pcie/bwctrl.c | 25 +++++++++++++++++--------
>  1 file changed, 17 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/pci/pcie/bwctrl.c b/drivers/pci/pcie/bwctrl.c
> index b59cacc..7cd8211 100644
> --- a/drivers/pci/pcie/bwctrl.c
> +++ b/drivers/pci/pcie/bwctrl.c
> @@ -303,17 +303,18 @@ static int pcie_bwnotif_probe(struct pcie_device *srv)
>  	if (ret)
>  		return ret;
>  
> -	ret = devm_request_irq(&srv->device, srv->irq, pcie_bwnotif_irq,
> -			       IRQF_SHARED, "PCIe bwctrl", srv);
> +	scoped_guard(rwsem_write, &pcie_bwctrl_setspeed_rwsem)
> +		scoped_guard(rwsem_write, &pcie_bwctrl_lbms_rwsem)
> +			port->link_bwctrl = data;
> +
> +	ret = request_irq(srv->irq, pcie_bwnotif_irq, IRQF_SHARED,
> +			  "PCIe bwctrl", srv);
>  	if (ret)
> -		return ret;
> +		goto err_reset_data;
>  
> -	scoped_guard(rwsem_write, &pcie_bwctrl_setspeed_rwsem) {
> -		scoped_guard(rwsem_write, &pcie_bwctrl_lbms_rwsem) {
> -			port->link_bwctrl = no_free_ptr(data);
> +	scoped_guard(rwsem_write, &pcie_bwctrl_setspeed_rwsem)
> +		scoped_guard(rwsem_write, &pcie_bwctrl_lbms_rwsem)
>  			pcie_bwnotif_enable(srv);
> -		}
> -	}

Hi Lukas,

Indeed, it certainly didn't occur to me while arranging the code the way 
it is that there are other sources for the same irq. However, there is a 
reason those lines where within the same critical section (I also realized 
it's not documented anywhere):

As bwctrl has two operating modes, one with BW notifications and the other 
without them, there are races when switching between those modes during 
probe wrt. call to lbms counting accessor, and I reused those rw 
semaphores to prevent those race (the race fixes were noted only in a 
history bullet of the bwctrl series).

Do you think it would be possible to put also request_irq() inside that
critical section to synchronize lbms count calls with the mode switch? 
(Also beware that with scoped_guards, a goto out of the critical section 
cannot be used in error handling and the NULL needs to be assigned inside 
it.)

(I know Bjorn has taken this already in so if needed, we can consider 
making that change later on top of this as those races are much less 
severe issue than what this fix is correcting.)

>  	pci_dbg(port, "enabled with IRQ %d\n", srv->irq);
>  
> @@ -323,6 +324,12 @@ static int pcie_bwnotif_probe(struct pcie_device *srv)
>  		port->link_bwctrl->cdev = NULL;
>  
>  	return 0;
> +
> +err_reset_data:
> +	scoped_guard(rwsem_write, &pcie_bwctrl_setspeed_rwsem)
> +		scoped_guard(rwsem_write, &pcie_bwctrl_lbms_rwsem)
> +			port->link_bwctrl = NULL;
> +	return ret;
>  }
>  
>  static void pcie_bwnotif_remove(struct pcie_device *srv)
> @@ -333,6 +340,8 @@ static void pcie_bwnotif_remove(struct pcie_device *srv)
>  
>  	pcie_bwnotif_disable(srv->port);
>  
> +	free_irq(srv->irq, srv);
> +
>  	scoped_guard(rwsem_write, &pcie_bwctrl_setspeed_rwsem)
>  		scoped_guard(rwsem_write, &pcie_bwctrl_lbms_rwsem)
>  			srv->port->link_bwctrl = NULL;

It seems I didn't remember to fix the mode switch races here at all.
So I think also here, pcie_bwnotif_disable(), free_irq() and NULL 
assignment, should be within the same critical section. Although this 
doesn't seem very big problemn compared with the probe side.
Lukas Wunner Jan. 7, 2025, 4:29 a.m. UTC | #3
On Sun, Jan 05, 2025 at 06:54:24PM +0200, Ilpo Järvinen wrote:
> Indeed, it certainly didn't occur to me while arranging the code the way 
> it is that there are other sources for the same irq. However, there is a 
> reason those lines where within the same critical section (I also realized 
> it's not documented anywhere):
> 
> As bwctrl has two operating modes, one with BW notifications and the other 
> without them, there are races when switching between those modes during 
> probe wrt. call to lbms counting accessor, and I reused those rw 
> semaphores to prevent those race (the race fixes were noted only in a 
> history bullet of the bwctrl series).

Could you add code comment(s) to document this?

I've respun the patch, but of course yesterday was a holiday in Finland.
So I'm hoping you get a chance to review the v2 patch today.


It seems pcie_bwctrl_setspeed_rwsem is only needed because
pcie_retrain_link() calls pcie_reset_lbms_count(), which
would recursively acquire pcie_bwctrl_lbms_rwsem.

There are only two callers of pcie_retrain_link(), so I'm
wondering if the invocation of pcie_reset_lbms_count()
can be moved to them, thus avoiding the recursive lock
acquisition and allowing to get rid of pcie_bwctrl_setspeed_rwsem.
An alternative would be to have a __pcie_retrain_link() helper
which doesn't call pcie_reset_lbms_count().

Right now there are no less than three locks used by bwctrl
(the two global rwsem plus the per-port mutex).  That doesn't
look elegant and makes it difficult to reason about the code,
so simplifying the locking would be desirable I think.


I'm also wondering if the IRQ handler really needs to run in
hardirq context.  Is there a reason it can't run in thread
context?  Note that CONFIG_PREEMPT_RT=y (as well as the
"threadirqs" command line option) cause the handler to be run
in thread context, so it must work properly in that situation
as well.


Another oddity that caught my eye is the counting of the
interrupts.  It seems the only place where lbms_count is read
is the pcie_failed_link_retrain() quirk, and it only cares
about the count being non-zero.  So this could be a bit in
pci_dev->priv_flags that's accessed with set_bit() / test_bit()
similar to pci_dev_assign_added() / pci_dev_is_added().

Are you planning on using the count for something else in the
future?  If not, using a flag would be simpler and more economical
memory-wise.  I'm also worried about the lbms_count overflowing.


Because there's hardware which signals an interrupt before actually
setting one of the two bits in the Link Status Register, I'm
wondering if it would make sense to poll the register a couple
of times in the irq handler.  Obviously this is only an option
if the handler is running in thread context.  What was the maximum
time you saw during testing that it took to set the LBMS bit belatedly?

If you don't poll for the LBMS bit, then you definitely should clear
it on unbind in case it contains a stale 1.  Or probably clear it in
any case.

Thanks,

Lukas
Ilpo Järvinen Jan. 7, 2025, 2:29 p.m. UTC | #4
On Tue, 7 Jan 2025, Lukas Wunner wrote:

> On Sun, Jan 05, 2025 at 06:54:24PM +0200, Ilpo Järvinen wrote:
> > Indeed, it certainly didn't occur to me while arranging the code the way 
> > it is that there are other sources for the same irq. However, there is a 
> > reason those lines where within the same critical section (I also realized 
> > it's not documented anywhere):
> > 
> > As bwctrl has two operating modes, one with BW notifications and the other 
> > without them, there are races when switching between those modes during 
> > probe wrt. call to lbms counting accessor, and I reused those rw 
> > semaphores to prevent those race (the race fixes were noted only in a 
> > history bullet of the bwctrl series).
> 
> Could you add code comment(s) to document this?

Sure, I'll do that once I've been able to clear the holiday-induced logjam
on pdx86 maintainership front. :-) (For now, I added a bullet to my todo 
list to not forget it).

> I've respun the patch, but of course yesterday was a holiday in Finland.
> So I'm hoping you get a chance to review the v2 patch today.

Done.

> It seems pcie_bwctrl_setspeed_rwsem is only needed because
> pcie_retrain_link() calls pcie_reset_lbms_count(), which
> would recursively acquire pcie_bwctrl_lbms_rwsem.
>
> There are only two callers of pcie_retrain_link(), so I'm
> wondering if the invocation of pcie_reset_lbms_count()
> can be moved to them, thus avoiding the recursive lock
> acquisition and allowing to get rid of pcie_bwctrl_setspeed_rwsem.
>
> An alternative would be to have a __pcie_retrain_link() helper
> which doesn't call pcie_reset_lbms_count().
>
> Right now there are no less than three locks used by bwctrl
> (the two global rwsem plus the per-port mutex).  That doesn't
> look elegant and makes it difficult to reason about the code,
> so simplifying the locking would be desirable I think.

I considered __pcie_retrain_link() variant but it felt like locking 
details that are internal to bwctrl would be leaking into elsewhere in the 
code so I had some level of dislike towards this solution, but I'm not 
strictly against it.

It would seem most straightforward approach that wouldn't force 
moving LBMS reset to callers which feels even less elegant/obvious.
I just previously chose to keep that to complexity internal to bwctrl
but if you think adding __pcie_retrain_link() would be more elegant,
we can certainly move to that direction.

> I'm also wondering if the IRQ handler really needs to run in
> hardirq context.  Is there a reason it can't run in thread
> context?  Note that CONFIG_PREEMPT_RT=y (as well as the
> "threadirqs" command line option) cause the handler to be run
> in thread context, so it must work properly in that situation
> as well.

If thread context would work now, why was the fix in the commit 
3e82a7f9031f ("PCI/LINK: Supply IRQ handler so level-triggered IRQs are 
acked")) needed (the commit is from the bwnotif era)? What has changed 
since that fix?

I'm open to our suggestion but that existence of that fix is keeping me 
back. I just don't understand why it would work now when it didn't back 
then.

> Another oddity that caught my eye is the counting of the
> interrupts.  It seems the only place where lbms_count is read
> is the pcie_failed_link_retrain() quirk, and it only cares
> about the count being non-zero.  So this could be a bit in
> pci_dev->priv_flags that's accessed with set_bit() / test_bit()
> similar to pci_dev_assign_added() / pci_dev_is_added().
> 
> Are you planning on using the count for something else in the
> future?  If not, using a flag would be simpler and more economical
> memory-wise.

Somebody requested having the count exposed. For troubleshooting HW 
problems (IIRC), it was privately asked from me when I posted one of 
the early versions of the bwctrl series (so quite long time ago). I've
just not created that change yet to put it under sysfs.

> I'm also worried about the lbms_count overflowing.

Should I perhaps simply do pci_warn() if it happens?

> Because there's hardware which signals an interrupt before actually
> setting one of the two bits in the Link Status Register, I'm
> wondering if it would make sense to poll the register a couple
> of times in the irq handler.  Obviously this is only an option
> if the handler is running in thread context.  What was the maximum
> time you saw during testing that it took to set the LBMS bit belatedly?

Is there some misunderstanding here between us because I don't think I've 
noticed delayed LBMS assertion? What I saw was the new Link Speed not yet 
updated when Link Training was already 0. In that case, the Link Status 
register was read inside the handler so I'd assume LBMS was set to 
actually trigger the interrupt, thus, not set belatedly.

I only recall testing with reading the value again inside set speed 
functions and the Link Speed was always correct by then. I might have also 
tried polling it inside the handler but I'm sorry don't recall anymore if 
I did and what was the end result.

> If you don't poll for the LBMS bit, then you definitely should clear
> it on unbind in case it contains a stale 1.  Or probably clear it in
> any case.
Lukas Wunner Jan. 8, 2025, 12:05 p.m. UTC | #5
On Tue, Jan 07, 2025 at 04:29:18PM +0200, Ilpo Järvinen wrote:
> On Tue, 7 Jan 2025, Lukas Wunner wrote:
> > It seems pcie_bwctrl_setspeed_rwsem is only needed because
> > pcie_retrain_link() calls pcie_reset_lbms_count(), which
> > would recursively acquire pcie_bwctrl_lbms_rwsem.
> >
> > There are only two callers of pcie_retrain_link(), so I'm
> > wondering if the invocation of pcie_reset_lbms_count()
> > can be moved to them, thus avoiding the recursive lock
> > acquisition and allowing to get rid of pcie_bwctrl_setspeed_rwsem.
> >
> > An alternative would be to have a __pcie_retrain_link() helper
> > which doesn't call pcie_reset_lbms_count().
> 
> I considered __pcie_retrain_link() variant but it felt like locking 
> details that are internal to bwctrl would be leaking into elsewhere in the 
> code so I had some level of dislike towards this solution, but I'm not 
> strictly against it.

That's a fair argument.

It seems the reason you're acquiring pcie_bwctrl_lbms_rwsem in
pcie_reset_lbms_count() is because you need to dereference
port->link_bwctrl so that you can access port->link_bwctrl->lbms_count.

If you get rid of lbms_count and instead use a flag in pci_dev->priv_flags,
then it seems you won't need to acquire the lock and this problem
solves itself.


> > I'm also wondering if the IRQ handler really needs to run in
> > hardirq context.  Is there a reason it can't run in thread
> > context?  Note that CONFIG_PREEMPT_RT=y (as well as the
> > "threadirqs" command line option) cause the handler to be run
> > in thread context, so it must work properly in that situation
> > as well.
> 
> If thread context would work now, why was the fix in the commit 
> 3e82a7f9031f ("PCI/LINK: Supply IRQ handler so level-triggered IRQs are 
> acked")) needed (the commit is from the bwnotif era)? What has changed 
> since that fix?

Nothing has changed, I had forgotten about that commit.

Basically you could move everything in pcie_bwnotif_irq() after clearing
the interrupt into an IRQ thread, but that would just be the access to the
atomic variable and the pcie_update_link_speed() call.  That's not worth it
because the overhead to wake the IRQ thread is bigger than just executing
those things in the hardirq handler.

So please ignore my comment.


> > Another oddity that caught my eye is the counting of the
> > interrupts.  It seems the only place where lbms_count is read
> > is the pcie_failed_link_retrain() quirk, and it only cares
> > about the count being non-zero.  So this could be a bit in
> > pci_dev->priv_flags that's accessed with set_bit() / test_bit()
> > similar to pci_dev_assign_added() / pci_dev_is_added().
> > 
> > Are you planning on using the count for something else in the
> > future?  If not, using a flag would be simpler and more economical
> > memory-wise.
> 
> Somebody requested having the count exposed. For troubleshooting HW 
> problems (IIRC), it was privately asked from me when I posted one of 
> the early versions of the bwctrl series (so quite long time ago). I've
> just not created that change yet to put it under sysfs.

There's a patch pending to add trace events support to native PCIe hotplug:

https://lore.kernel.org/all/20241123113108.29722-1-xueshuai@linux.alibaba.com/

If that somebody thinks they need to know how often LBMS triggered,
we could just add similar trace events for bandwidth notifications.
That gives us not only the count but also the precise time when the
bandwidth change happened, so it's arguably more useful for debugging.

Trace points are patched in and out of the code path at runtime,
so they have basically zero cost when not enabled (which would be the
default).


> > I'm also worried about the lbms_count overflowing.
>
> Should I perhaps simply do pci_warn() if it happens?

I'd prefere getting rid of the counter altogether. :)


> > Because there's hardware which signals an interrupt before actually
> > setting one of the two bits in the Link Status Register, I'm
> > wondering if it would make sense to poll the register a couple
> > of times in the irq handler.  Obviously this is only an option
> > if the handler is running in thread context.  What was the maximum
> > time you saw during testing that it took to set the LBMS bit belatedly?
> 
> Is there some misunderstanding here between us because I don't think I've 
> noticed delayed LBMS assertion? What I saw was the new Link Speed not yet 
> updated when Link Training was already 0. In that case, the Link Status 
> register was read inside the handler so I'd assume LBMS was set to 
> actually trigger the interrupt, thus, not set belatedly.

Evert's laptop has BWMgmt+ ABWMgmt+ bits set on Root Port 00:02.1
in this lspci dump:

https://bugzilla.kernel.org/attachment.cgi?id=307419&action=edit

00:02.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge GPP Bridge (prog-if 00 [Normal decode])
                LnkSta: Speed 8GT/s, Width x4
                        TrErr- Train- SlotClk+ DLActive+ BWMgmt+ ABWMgmt+

How can it be that BWMgmt+ is set?  I would have expected the bandwidth
controller to clear that interrupt.  I can only think of two explanations:
Either BWMgmt+ was set but no interrupt was signaled.  Or the interrupt
was signaled and handled before BWMgmt+ was set.

I'm guessing the latter is the case because /proc/irq/33/spurious
indicates 1 unhandled interrupt.

Back in March 2023 when you showed me your results with various Intel
chipsets, I thought you mentioned that you witnessed this too-early
interrupt situation a couple of times.  But I may be misremembering.

Thanks,

Lukas
Ilpo Järvinen Jan. 8, 2025, 12:55 p.m. UTC | #6
On Wed, 8 Jan 2025, Lukas Wunner wrote:
> On Tue, Jan 07, 2025 at 04:29:18PM +0200, Ilpo Järvinen wrote:
> > On Tue, 7 Jan 2025, Lukas Wunner wrote:
> > > It seems pcie_bwctrl_setspeed_rwsem is only needed because
> > > pcie_retrain_link() calls pcie_reset_lbms_count(), which
> > > would recursively acquire pcie_bwctrl_lbms_rwsem.
> > >
> > > There are only two callers of pcie_retrain_link(), so I'm
> > > wondering if the invocation of pcie_reset_lbms_count()
> > > can be moved to them, thus avoiding the recursive lock
> > > acquisition and allowing to get rid of pcie_bwctrl_setspeed_rwsem.
> > >
> > > An alternative would be to have a __pcie_retrain_link() helper
> > > which doesn't call pcie_reset_lbms_count().
> > 
> > I considered __pcie_retrain_link() variant but it felt like locking 
> > details that are internal to bwctrl would be leaking into elsewhere in the 
> > code so I had some level of dislike towards this solution, but I'm not 
> > strictly against it.
> 
> That's a fair argument.
> 
> It seems the reason you're acquiring pcie_bwctrl_lbms_rwsem in
> pcie_reset_lbms_count() is because you need to dereference
> port->link_bwctrl so that you can access port->link_bwctrl->lbms_count.
> 
> If you get rid of lbms_count and instead use a flag in pci_dev->priv_flags,
> then it seems you won't need to acquire the lock and this problem
> solves itself.

Agreed on both points.

> > > I'm also wondering if the IRQ handler really needs to run in
> > > hardirq context.  Is there a reason it can't run in thread
> > > context?  Note that CONFIG_PREEMPT_RT=y (as well as the
> > > "threadirqs" command line option) cause the handler to be run
> > > in thread context, so it must work properly in that situation
> > > as well.
> > 
> > If thread context would work now, why was the fix in the commit 
> > 3e82a7f9031f ("PCI/LINK: Supply IRQ handler so level-triggered IRQs are 
> > acked")) needed (the commit is from the bwnotif era)? What has changed 
> > since that fix?
> 
> Nothing has changed, I had forgotten about that commit.
> 
> Basically you could move everything in pcie_bwnotif_irq() after clearing
> the interrupt into an IRQ thread, but that would just be the access to the
> atomic variable and the pcie_update_link_speed() call.  That's not worth it
> because the overhead to wake the IRQ thread is bigger than just executing
> those things in the hardirq handler.
> 
> So please ignore my comment.
> 
> 
> > > Another oddity that caught my eye is the counting of the
> > > interrupts.  It seems the only place where lbms_count is read
> > > is the pcie_failed_link_retrain() quirk, and it only cares
> > > about the count being non-zero.  So this could be a bit in
> > > pci_dev->priv_flags that's accessed with set_bit() / test_bit()
> > > similar to pci_dev_assign_added() / pci_dev_is_added().
> > > 
> > > Are you planning on using the count for something else in the
> > > future?  If not, using a flag would be simpler and more economical
> > > memory-wise.
> > 
> > Somebody requested having the count exposed. For troubleshooting HW 
> > problems (IIRC), it was privately asked from me when I posted one of 
> > the early versions of the bwctrl series (so quite long time ago). I've
> > just not created that change yet to put it under sysfs.
> 
> There's a patch pending to add trace events support to native PCIe hotplug:
> 
> https://lore.kernel.org/all/20241123113108.29722-1-xueshuai@linux.alibaba.com/
> 
> If that somebody thinks they need to know how often LBMS triggered,
> we could just add similar trace events for bandwidth notifications.
> That gives us not only the count but also the precise time when the
> bandwidth change happened, so it's arguably more useful for debugging.
> 
> Trace points are patched in and out of the code path at runtime,
> so they have basically zero cost when not enabled (which would be the
> default).
> 
> 
> > > I'm also worried about the lbms_count overflowing.
> >
> > Should I perhaps simply do pci_warn() if it happens?
> 
> I'd prefere getting rid of the counter altogether. :)

Okay, it's a good suggestion. Trace events seem like a much better 
approach to the problem.

> > > Because there's hardware which signals an interrupt before actually
> > > setting one of the two bits in the Link Status Register, I'm
> > > wondering if it would make sense to poll the register a couple
> > > of times in the irq handler.  Obviously this is only an option
> > > if the handler is running in thread context.  What was the maximum
> > > time you saw during testing that it took to set the LBMS bit belatedly?
> > 
> > Is there some misunderstanding here between us because I don't think I've 
> > noticed delayed LBMS assertion? What I saw was the new Link Speed not yet 
> > updated when Link Training was already 0. In that case, the Link Status 
> > register was read inside the handler so I'd assume LBMS was set to 
> > actually trigger the interrupt, thus, not set belatedly.
> 
> Evert's laptop has BWMgmt+ ABWMgmt+ bits set on Root Port 00:02.1
> in this lspci dump:
> 
> https://bugzilla.kernel.org/attachment.cgi?id=307419&action=edit
> 
> 00:02.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge GPP Bridge (prog-if 00 [Normal decode])
>                 LnkSta: Speed 8GT/s, Width x4
>                         TrErr- Train- SlotClk+ DLActive+ BWMgmt+ ABWMgmt+
> 
> How can it be that BWMgmt+ is set?  I would have expected the bandwidth
> controller to clear that interrupt.  I can only think of two explanations:
> Either BWMgmt+ was set but no interrupt was signaled.  Or the interrupt
> was signaled and handled before BWMgmt+ was set.

Either one of those, or third alternative that it's set more than once 
and no interrupt occurs on the second assertion (could be e.g. some race 
due to write-1-to-clear and reasserting LBMS).

But, I've not seen this behavior before that report so I cannot answer to 
your question about how long it took for LBMS to get asserted.

This misbehavior is a bit problematic because those same bits are used to 
identify if the interrupt belongs for bwctrl or not. So if they're not 
set by the time the handler runs, I don't have a way to identify I'd need 
to poll those bits in the first place (and it's also known the interrupt 
is shared with other stuff :-().

> I'm guessing the latter is the case because /proc/irq/33/spurious
> indicates 1 unhandled interrupt.
> 
> Back in March 2023 when you showed me your results with various Intel
> chipsets, I thought you mentioned that you witnessed this too-early
> interrupt situation a couple of times.  But I may be misremembering.

I've seen interrupts occur before the new Link Speed has been updated but 
those still had LT=1 and there was _another_ interrupt later. Handling
that later interrupt read the new Link Speed and LT=0 so effectively 
there was just one extra interrupt. I assume it's because LBMS is simply 
asserted twice and the extra interrupt seemed harmless (no idea why HW 
ends up doing it though).

The other case, which is a real problem, had LT=0 but no new link speed 
in the Current Link Speed field. So it is "premature" in a sense but 
seemingly the training had completed. In that case, there was no second 
interrupt so the handler could not acquire the new link speed.

bwctrl does read the new link speed outside the handler by calling 
pcie_update_link_speed() which is mainly to handle misbehaviors where
BW notifications are not coming at all (if the link speed changes 
outside of bwctrl setting speed, those changes won't be detected). It will 
still be racy though if LT is deasserted before the Current Link Speed is 
set (I've not seen it to fail to get the new speed but I've not tried to 
stress test it either).
diff mbox series

Patch

diff --git a/drivers/pci/pcie/bwctrl.c b/drivers/pci/pcie/bwctrl.c
index b59cacc..7cd8211 100644
--- a/drivers/pci/pcie/bwctrl.c
+++ b/drivers/pci/pcie/bwctrl.c
@@ -303,17 +303,18 @@  static int pcie_bwnotif_probe(struct pcie_device *srv)
 	if (ret)
 		return ret;
 
-	ret = devm_request_irq(&srv->device, srv->irq, pcie_bwnotif_irq,
-			       IRQF_SHARED, "PCIe bwctrl", srv);
+	scoped_guard(rwsem_write, &pcie_bwctrl_setspeed_rwsem)
+		scoped_guard(rwsem_write, &pcie_bwctrl_lbms_rwsem)
+			port->link_bwctrl = data;
+
+	ret = request_irq(srv->irq, pcie_bwnotif_irq, IRQF_SHARED,
+			  "PCIe bwctrl", srv);
 	if (ret)
-		return ret;
+		goto err_reset_data;
 
-	scoped_guard(rwsem_write, &pcie_bwctrl_setspeed_rwsem) {
-		scoped_guard(rwsem_write, &pcie_bwctrl_lbms_rwsem) {
-			port->link_bwctrl = no_free_ptr(data);
+	scoped_guard(rwsem_write, &pcie_bwctrl_setspeed_rwsem)
+		scoped_guard(rwsem_write, &pcie_bwctrl_lbms_rwsem)
 			pcie_bwnotif_enable(srv);
-		}
-	}
 
 	pci_dbg(port, "enabled with IRQ %d\n", srv->irq);
 
@@ -323,6 +324,12 @@  static int pcie_bwnotif_probe(struct pcie_device *srv)
 		port->link_bwctrl->cdev = NULL;
 
 	return 0;
+
+err_reset_data:
+	scoped_guard(rwsem_write, &pcie_bwctrl_setspeed_rwsem)
+		scoped_guard(rwsem_write, &pcie_bwctrl_lbms_rwsem)
+			port->link_bwctrl = NULL;
+	return ret;
 }
 
 static void pcie_bwnotif_remove(struct pcie_device *srv)
@@ -333,6 +340,8 @@  static void pcie_bwnotif_remove(struct pcie_device *srv)
 
 	pcie_bwnotif_disable(srv->port);
 
+	free_irq(srv->irq, srv);
+
 	scoped_guard(rwsem_write, &pcie_bwctrl_setspeed_rwsem)
 		scoped_guard(rwsem_write, &pcie_bwctrl_lbms_rwsem)
 			srv->port->link_bwctrl = NULL;