Message ID | 20230804071757.383971-1-u.kleine-koenig@pengutronix.de (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | net: phy: Don't disable irqs on shutdown if WoL is enabled | expand |
On Fri, 4 Aug 2023 09:17:57 +0200 Uwe Kleine-König wrote: > Most PHYs signal WoL using an interrupt. So disabling interrupts breaks > WoL at least on PHYs covered by the marvell driver. So skip disabling > irqs on shutdown if WoL is enabled. > > While at it also explain the motivation that irqs are disabled at all. > > Fixes: e2f016cf7751 ("net: phy: add a shutdown procedure") > Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> What do we do with this one? It sounded like Russell was leaning towards a revert? FTR original report: https://lore.kernel.org/all/20230803181640.yzxsk2xphwryxww4@pengutronix.de/ > while I'm not sure that disabling interrupts is a good idea in general, > this change at least should fix the WoL case. Note that this change is > only compile tested as next doesn't boot on my test machine (because of > https://git.kernel.org/linus/b3574f579ece24439c90e9a179742c61205fbcfa) > and 6.1 (which is the other kernel I have running) doesn't know about > .wol_enabled. I don't want to delay this fix until I bisected this new > issue. > > Assuming this patch is eligible for backporting to stable, maybe point > out that it depends on v6.5-rc1~163^2~286^2~2 ("net: phy: Allow drivers > to always call into ->suspend()"). Didn't try to backport that. > > Best regards > Uwe > > drivers/net/phy/phy_device.c | 9 +++++++++ > 1 file changed, 9 insertions(+) > > diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c > index 61921d4dbb13..6d1526bdd1d7 100644 > --- a/drivers/net/phy/phy_device.c > +++ b/drivers/net/phy/phy_device.c > @@ -3340,6 +3340,15 @@ static void phy_shutdown(struct device *dev) > if (phydev->state == PHY_READY || !phydev->attached_dev) > return; > > + /* Most phys signal WoL via the irq line. So for these irqs shouldn't be > + * disabled. > + */ > + if (phydev->wol_enabled) > + return; > + > + /* On shutdown disable irqs to prevent an irq storm on systems where the > + * irq line is shared by several devices. > + */ > phy_disable_interrupts(phydev); > } >
On 8/8/23 14:53, Jakub Kicinski wrote: > On Fri, 4 Aug 2023 09:17:57 +0200 Uwe Kleine-König wrote: >> Most PHYs signal WoL using an interrupt. So disabling interrupts breaks >> WoL at least on PHYs covered by the marvell driver. So skip disabling >> irqs on shutdown if WoL is enabled. >> >> While at it also explain the motivation that irqs are disabled at all. >> >> Fixes: e2f016cf7751 ("net: phy: add a shutdown procedure") >> Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> > > What do we do with this one? It sounded like Russell was leaning > towards a revert? Yes, though I believe this will create a different kind of regression for what Iona was addressing initially. Then it becomes a choice of which regression do we consider to be the worst to handle until something better comes up. Russell what are your thoughts?
On Tue, Aug 08, 2023 at 02:59:41PM -0700, Florian Fainelli wrote: > On 8/8/23 14:53, Jakub Kicinski wrote: > > On Fri, 4 Aug 2023 09:17:57 +0200 Uwe Kleine-König wrote: > > > Most PHYs signal WoL using an interrupt. So disabling interrupts breaks > > > WoL at least on PHYs covered by the marvell driver. So skip disabling > > > irqs on shutdown if WoL is enabled. > > > > > > While at it also explain the motivation that irqs are disabled at all. > > > > > > Fixes: e2f016cf7751 ("net: phy: add a shutdown procedure") > > > Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> > > > > What do we do with this one? It sounded like Russell was leaning > > towards a revert? > > Yes, though I believe this will create a different kind of regression for > what Iona was addressing initially. Then it becomes a choice of which > regression do we consider to be the worst to handle until something better > comes up. > > Russell what are your thoughts? In this situation where a fix for a problem is provided which then causes a regression by fixing that problem, I've seen Linus T state that it means the fix was incorrect. That seems entirely sensible. We are, of course, in the situation where reverting the commit restores the old behaviour and thus fixes a regression, but causes a regression for another user. If it is possible to quickly come up with a fix that avoids any regression to either use case, then that is obviously preferable. However, if that's not possible, then it seems going back to the original situation (i.o.w. reverting) is sensible. Now, the fact is that many PHYs do use their interrupts to signal that a wake-up happened, and disabling the IRQ from the PHY will prevent WoL from working. Other PHYs have a separate pin for this. Two recent examples are AR8035, which only has a single interrupt pin which covers all interrupts from that PHY, and AR8031 or AR8033 which have a separate WOL_INT pin which might be used - or the main interrupt pin. If we hibernate the system, then people how have configured WoL are going to expect it to work - but disabling the ability for the PHY to raise an interrupt will prevent it. So, clearly always disabling PHY interrupts can have a detrimental effect on the ability to wake a system up using WoL - where the PHY interrupt is used to signal WoL to the rest of the system. Now, if waking the system up from hibernation using WoL involves the PHY asserting its interrupt pin, then the system must be capable of dealing with the PHY asserting its interrupt while the system is booting. Remember that the way Linux hibernation works, that boot is just the same as a regular boot right up through the normal kernel initialisation. It is only towards the end that the kernel detects the signature in swap space, and then does the funky stuff to resume from the saved data. So, during that boot, the system has to cope with that interrupt having been asserted by the PHY hardware. Either system firmware has to recognise that was the wake-up event and deal with it (e.g. disabling the interrupt source) before passing control to the kernel, or the kernel has to be able to cope with that interrupt being stuck at active state until the PHY driver can deal with it. Obviously, if WoL is not enabled or supported, then disabling the PHY interrupt should be harmless - but that will have the effect of masking any issues that a platform may have until PHY based WoL has been enabled. Also, don't forget that we have this kexec thing - and the .shutdown methods will be called just before handing control to the new kernel. Uwe's patch solves the problem that he's experiencing - because it makes the interrupt disabling dependent on the WoL configuration. However, Ioana's problem would still remain - and enabling WoL on that platform will make it reappear - and thus it still needs to be properly fixed. If that problem is properly fixed, then we don't need to disable PHY interrupts, which means a revert would be the right approach. Honestly, I don't know what would be best - and I don't believe we've heard from Ioana about the problem that was trying to be addressed (e.g. exactly when it happened and why.) If I had to guess: - the PHY in question may be sharing an interrupt with another device - when that other device probes and claims its interrupt, an interrupt storm ensues - the interrupt layer disables the interrupt input, rendering both the PHY and other device unusable. I think I've covered all the possibilities, all the issues, outcomes, and the politics as far as Linus T would state. I'm also quite sure that there will be no way to satisfy everyone! Bearing in mind that it is holiday season, and we're at -rc5, I think we should give Ioana a bit more time to respond before we make a decision. Maybe a little over a week? If we don't hear anything, then I think following our established policy and reverting would be the correct way forward.
Hi Uwe, (I hope the threading won't be broken) On Fri, Aug 04, 2023 at 09:17:57AM +0200, Uwe Kleine-König wrote: > Most PHYs signal WoL using an interrupt. So disabling interrupts breaks > WoL at least on PHYs covered by the marvell driver. So skip disabling > irqs on shutdown if WoL is enabled. > > While at it also explain the motivation that irqs are disabled at all. > > Fixes: e2f016cf7751 ("net: phy: add a shutdown procedure") > Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> > --- > Hello, > > while I'm not sure that disabling interrupts is a good idea in general, > this change at least should fix the WoL case. Note that this change is > only compile tested as next doesn't boot on my test machine (because of > https://git.kernel.org/linus/b3574f579ece24439c90e9a179742c61205fbcfa) > and 6.1 (which is the other kernel I have running) doesn't know about > .wol_enabled. I don't want to delay this fix until I bisected this new > issue. > > Assuming this patch is eligible for backporting to stable, maybe point > out that it depends on v6.5-rc1~163^2~286^2~2 ("net: phy: Allow drivers > to always call into ->suspend()"). Didn't try to backport that. > > Best regards > Uwe > > drivers/net/phy/phy_device.c | 9 +++++++++ > 1 file changed, 9 insertions(+) > > diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c > index 61921d4dbb13..6d1526bdd1d7 100644 > --- a/drivers/net/phy/phy_device.c > +++ b/drivers/net/phy/phy_device.c > @@ -3340,6 +3340,15 @@ static void phy_shutdown(struct device *dev) > if (phydev->state == PHY_READY || !phydev->attached_dev) > return; > > + /* Most phys signal WoL via the irq line. So for these irqs shouldn't be > + * disabled. > + */ > + if (phydev->wol_enabled) > + return; > + > + /* On shutdown disable irqs to prevent an irq storm on systems where the > + * irq line is shared by several devices. > + */ > phy_disable_interrupts(phydev); > } > > -- > 2.40.1 > > I think the idea is not bad and something along these lines might be the way to go, but I don't think it works (as currently implemented, and tested by me, prints below). Upon a quick search, phydev->wol_enabled is only set from phy_suspend(), and phy_suspend() isn't invoked from the ethnl_set_wol() call stack. Confirmed this way: $ ethtool -s end0 wol g # &enet0 in the device tree $ reboot -f Rebooting. [ 288.682444] Qualcomm Atheros AR8031/AR8033 mdio@2d24000:02: phy_shutdown: wol_enabled 0 [ 288.690935] Qualcomm Atheros AR8031/AR8033 mdio@2d24000:01: phy_shutdown: wol_enabled 0 [ 288.736145] reboot: Restarting system This is on the arch/arm/boot/dts/nxp/ls/ls1021a-tsn.dts, the same board as the one which Ioana worked on (with the shared AR8031 PHY interrupts). Sure, it needs to be mentioned that WoL + shared PHY interrupts is not by any means an impossible combination, but it still won't work reliably. I guess that's ok temporarily, since WoL requires user opt-in, so in the default configuration, the LS1021A-TSN is not broken by this change (in its functional variant). pw-bot: cr
On Tue, Aug 08, 2023 at 11:56:56PM +0100, Russell King (Oracle) wrote: > On Tue, Aug 08, 2023 at 02:59:41PM -0700, Florian Fainelli wrote: > > On 8/8/23 14:53, Jakub Kicinski wrote: > > > On Fri, 4 Aug 2023 09:17:57 +0200 Uwe Kleine-König wrote: > > > > Most PHYs signal WoL using an interrupt. So disabling interrupts breaks > > > > WoL at least on PHYs covered by the marvell driver. So skip disabling > > > > irqs on shutdown if WoL is enabled. > > > > > > > > While at it also explain the motivation that irqs are disabled at all. > > > > > > > > Fixes: e2f016cf7751 ("net: phy: add a shutdown procedure") > > > > Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> > > > > > > What do we do with this one? It sounded like Russell was leaning > > > towards a revert? > > > > Yes, though I believe this will create a different kind of regression for > > what Iona was addressing initially. Then it becomes a choice of which > > regression do we consider to be the worst to handle until something better > > comes up. > > > > Russell what are your thoughts? > > In this situation where a fix for a problem is provided which then > causes a regression by fixing that problem, I've seen Linus T state > that it means the fix was incorrect. That seems entirely sensible. > > We are, of course, in the situation where reverting the commit > restores the old behaviour and thus fixes a regression, but causes > a regression for another user. > > If it is possible to quickly come up with a fix that avoids any > regression to either use case, then that is obviously preferable. > However, if that's not possible, then it seems going back to the > original situation (i.o.w. reverting) is sensible. > > Now, the fact is that many PHYs do use their interrupts to signal > that a wake-up happened, and disabling the IRQ from the PHY will > prevent WoL from working. Other PHYs have a separate pin for this. > Two recent examples are AR8035, which only has a single interrupt > pin which covers all interrupts from that PHY, and AR8031 or AR8033 > which have a separate WOL_INT pin which might be used - or the > main interrupt pin. > > If we hibernate the system, then people how have configured WoL > are going to expect it to work - but disabling the ability for > the PHY to raise an interrupt will prevent it. > > So, clearly always disabling PHY interrupts can have a detrimental > effect on the ability to wake a system up using WoL - where the PHY > interrupt is used to signal WoL to the rest of the system. > > Now, if waking the system up from hibernation using WoL involves > the PHY asserting its interrupt pin, then the system must be > capable of dealing with the PHY asserting its interrupt while the > system is booting. Remember that the way Linux hibernation works, > that boot is just the same as a regular boot right up through the > normal kernel initialisation. It is only towards the end that the > kernel detects the signature in swap space, and then does the > funky stuff to resume from the saved data. > > So, during that boot, the system has to cope with that interrupt > having been asserted by the PHY hardware. Either system firmware > has to recognise that was the wake-up event and deal with it (e.g. > disabling the interrupt source) before passing control to the > kernel, or the kernel has to be able to cope with that interrupt > being stuck at active state until the PHY driver can deal with it. > > Obviously, if WoL is not enabled or supported, then disabling the > PHY interrupt should be harmless - but that will have the effect > of masking any issues that a platform may have until PHY based > WoL has been enabled. > > Also, don't forget that we have this kexec thing - and the > .shutdown methods will be called just before handing control to > the new kernel. > > Uwe's patch solves the problem that he's experiencing - because > it makes the interrupt disabling dependent on the WoL configuration. > > However, Ioana's problem would still remain - and enabling WoL on > that platform will make it reappear - and thus it still needs to be > properly fixed. > > If that problem is properly fixed, then we don't need to disable > PHY interrupts, which means a revert would be the right approach. Yes, my initial problem would still remain if WoL is enabled on any of the PHYs that have a shared IRQ line. The problem is that I find this combination - shared IRQ and WoL enabled - impossible to make it work. Maybe we could look into denying WoL from being enabled on PHYs which have a shared interrupt line. > > Honestly, I don't know what would be best - and I don't believe we've > heard from Ioana about the problem that was trying to be addressed > (e.g. exactly when it happened and why.) > > If I had to guess: > - the PHY in question may be sharing an interrupt with another device > - when that other device probes and claims its interrupt, an interrupt > storm ensues > - the interrupt layer disables the interrupt input, rendering both the > PHY and other device unusable. > That's a perfect summary of the problem that I was trying to fix. The board in question is a LS1021ATSN which has two AR8031 PHYs that share an interrupt line. In case only one of the PHYs is probed and there are pending interrupts on the PHY#2 an IRQ storm will happen since there is no entity to clear the interrupt from PHY#2's registers. PHY#1's driver will get stuck in .handle_interrupt() indefinitely. > I think I've covered all the possibilities, all the issues, outcomes, > and the politics as far as Linus T would state. I'm also quite sure > that there will be no way to satisfy everyone! > > Bearing in mind that it is holiday season, and we're at -rc5, I > think we should give Ioana a bit more time to respond before we > make a decision. Maybe a little over a week? If we don't hear anything, > then I think following our established policy and reverting would be > the correct way forward. >
On Wed, Aug 09, 2023 at 05:21:55PM +0300, Ioana Ciornei wrote: > That's a perfect summary of the problem that I was trying to fix. > > The board in question is a LS1021ATSN which has two AR8031 PHYs that > share an interrupt line. In case only one of the PHYs is probed and > there are pending interrupts on the PHY#2 an IRQ storm will happen > since there is no entity to clear the interrupt from PHY#2's registers. > PHY#1's driver will get stuck in .handle_interrupt() indefinitely. So I have two further questions: 1. Is WoL able to be supported on this hardware? 2. AR8031 has a seperate WOL_INT signal that can be used to wake up the system. Is this used in the hardware design? Thanks.
On 8/9/2023 6:58 AM, Vladimir Oltean wrote: > Hi Uwe, > > (I hope the threading won't be broken) > > On Fri, Aug 04, 2023 at 09:17:57AM +0200, Uwe Kleine-König wrote: >> Most PHYs signal WoL using an interrupt. So disabling interrupts breaks >> WoL at least on PHYs covered by the marvell driver. So skip disabling >> irqs on shutdown if WoL is enabled. >> >> While at it also explain the motivation that irqs are disabled at all. >> >> Fixes: e2f016cf7751 ("net: phy: add a shutdown procedure") >> Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> >> --- >> Hello, >> >> while I'm not sure that disabling interrupts is a good idea in general, >> this change at least should fix the WoL case. Note that this change is >> only compile tested as next doesn't boot on my test machine (because of >> https://git.kernel.org/linus/b3574f579ece24439c90e9a179742c61205fbcfa) >> and 6.1 (which is the other kernel I have running) doesn't know about >> .wol_enabled. I don't want to delay this fix until I bisected this new >> issue. >> >> Assuming this patch is eligible for backporting to stable, maybe point >> out that it depends on v6.5-rc1~163^2~286^2~2 ("net: phy: Allow drivers >> to always call into ->suspend()"). Didn't try to backport that. >> >> Best regards >> Uwe >> >> drivers/net/phy/phy_device.c | 9 +++++++++ >> 1 file changed, 9 insertions(+) >> >> diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c >> index 61921d4dbb13..6d1526bdd1d7 100644 >> --- a/drivers/net/phy/phy_device.c >> +++ b/drivers/net/phy/phy_device.c >> @@ -3340,6 +3340,15 @@ static void phy_shutdown(struct device *dev) >> if (phydev->state == PHY_READY || !phydev->attached_dev) >> return; >> >> + /* Most phys signal WoL via the irq line. So for these irqs shouldn't be >> + * disabled. >> + */ >> + if (phydev->wol_enabled) >> + return; >> + >> + /* On shutdown disable irqs to prevent an irq storm on systems where the >> + * irq line is shared by several devices. >> + */ >> phy_disable_interrupts(phydev); >> } >> >> -- >> 2.40.1 >> >> > > I think the idea is not bad and something along these lines might be the > way to go, but I don't think it works (as currently implemented, and > tested by me, prints below). > > Upon a quick search, phydev->wol_enabled is only set from phy_suspend(), > and phy_suspend() isn't invoked from the ethnl_set_wol() call stack. You are right, this was an ill advised suggestion from my side here. In principle however we need to have something like this in the shutdown routine to have this patch work: struct ethtool_wolinfo wol = { .cmd = ETHTOOL_GWOL }; struct net_device *netdev = phydev->attached_dev; int ret; phy_ethtool_get_wol(phydev, &wol); phydev->wol_enabled = wol.wolopts || (netdev && netdev->wol_enabled); if (!phydev->wol_enabled) return; this does make me wonder whether Uwe tested with a prior system suspend/resume cycle before shutting down?
On Wed, Aug 09, 2023 at 03:29:17PM +0100, Russell King (Oracle) wrote: > On Wed, Aug 09, 2023 at 05:21:55PM +0300, Ioana Ciornei wrote: > > That's a perfect summary of the problem that I was trying to fix. > > > > The board in question is a LS1021ATSN which has two AR8031 PHYs that > > share an interrupt line. In case only one of the PHYs is probed and > > there are pending interrupts on the PHY#2 an IRQ storm will happen > > since there is no entity to clear the interrupt from PHY#2's registers. > > PHY#1's driver will get stuck in .handle_interrupt() indefinitely. > > So I have two further questions: > 1. Is WoL able to be supported on this hardware? I don't know if anyone cares about WoL on the AR8031 PHYs from this board. Both of the PHYs are used in conjuction with 2 eTSEC controllers - which use the driver in drivers/net/ethernet/freescale/gianfar.c. The Ethernet controller does have WoL support, which means that WoL could still be supported on the board even though we would forbid WoL on the AR8031 PHYs. > 2. AR8031 has a seperate WOL_INT signal that can be used to wake up the > system. Is this used in the hardware design? No, WOL_INT is not connected.
On Wed, Aug 09, 2023 at 06:44:18PM +0300, Ioana Ciornei wrote: > On Wed, Aug 09, 2023 at 03:29:17PM +0100, Russell King (Oracle) wrote: > > On Wed, Aug 09, 2023 at 05:21:55PM +0300, Ioana Ciornei wrote: > > > That's a perfect summary of the problem that I was trying to fix. > > > > > > The board in question is a LS1021ATSN which has two AR8031 PHYs that > > > share an interrupt line. In case only one of the PHYs is probed and > > > there are pending interrupts on the PHY#2 an IRQ storm will happen > > > since there is no entity to clear the interrupt from PHY#2's registers. > > > PHY#1's driver will get stuck in .handle_interrupt() indefinitely. > > > > So I have two further questions: > > 1. Is WoL able to be supported on this hardware? > > I don't know if anyone cares about WoL on the AR8031 PHYs from this > board. > > Both of the PHYs are used in conjuction with 2 eTSEC controllers - which > use the driver in drivers/net/ethernet/freescale/gianfar.c. > > The Ethernet controller does have WoL support, which means that WoL could > still be supported on the board even though we would forbid WoL on the > AR8031 PHYs. > > > 2. AR8031 has a seperate WOL_INT signal that can be used to wake up the > > system. Is this used in the hardware design? > > No, WOL_INT is not connected. Okay, so we don't need to care about WoL for your hardware setup. Thinking about this, I wonder whether we could solve your issue by disabling interrupts when the PHY is probed, rather than disabling them on shutdown - something like this? (not build tested) diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c index 3e9909b30938..4d1a37487923 100644 --- a/drivers/net/phy/phy_device.c +++ b/drivers/net/phy/phy_device.c @@ -3216,6 +3216,8 @@ static int phy_probe(struct device *dev) goto out; } + phy_disable_interrupts(phydev); + /* Start out supporting everything. Eventually, * a controller will attach, and may modify one * or both of these values @@ -3333,16 +3335,6 @@ static int phy_remove(struct device *dev) return 0; } -static void phy_shutdown(struct device *dev) -{ - struct phy_device *phydev = to_phy_device(dev); - - if (phydev->state == PHY_READY || !phydev->attached_dev) - return; - - phy_disable_interrupts(phydev); -} - /** * phy_driver_register - register a phy_driver with the PHY layer * @new_driver: new phy_driver to register @@ -3376,7 +3368,6 @@ int phy_driver_register(struct phy_driver *new_driver, struct module *owner) new_driver->mdiodrv.driver.bus = &mdio_bus_type; new_driver->mdiodrv.driver.probe = phy_probe; new_driver->mdiodrv.driver.remove = phy_remove; - new_driver->mdiodrv.driver.shutdown = phy_shutdown; new_driver->mdiodrv.driver.owner = owner; new_driver->mdiodrv.driver.probe_type = PROBE_FORCE_SYNCHRONOUS;
> Thinking about this, I wonder whether we could solve your issue by > disabling interrupts when the PHY is probed, rather than disabling > them on shutdown - something like this? (not build tested) > > diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c > index 3e9909b30938..4d1a37487923 100644 > --- a/drivers/net/phy/phy_device.c > +++ b/drivers/net/phy/phy_device.c > @@ -3216,6 +3216,8 @@ static int phy_probe(struct device *dev) > goto out; > } > > + phy_disable_interrupts(phydev); > + > /* Start out supporting everything. Eventually, > * a controller will attach, and may modify one > * or both of these values At some point, the interrupt is going to be enabled again. And then the WoL interrupt will fire. I think some PHY drivers actually need to see that WoL interrupt. e.g. the marvell driver has this comment: static int m88e1318_set_wol(struct phy_device *phydev, struct ethtool_wolinfo *wol) { .... /* If WOL event happened once, the LED[2] interrupt pin * will not be cleared unless we reading the interrupt status * register. If interrupts are in use, the normal interrupt * handling will clear the WOL event. Clear the WOL event * before enabling it if !phy_interrupt_is_valid() */ So it seems like just probing the marvell PHY is not enough to clear the WoL interrupt. Can we be sure that the other PHY has reached a state it can handle and clear an interrupt when we come to enable the interrupt? I think not, especially in cases like NFS root, where the interface will be put into use as soon as it exists, maybe before the other interface has probed. Andrew
On 8/9/2023 9:21 AM, Andrew Lunn wrote: >> Thinking about this, I wonder whether we could solve your issue by >> disabling interrupts when the PHY is probed, rather than disabling >> them on shutdown - something like this? (not build tested) >> >> diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c >> index 3e9909b30938..4d1a37487923 100644 >> --- a/drivers/net/phy/phy_device.c >> +++ b/drivers/net/phy/phy_device.c >> @@ -3216,6 +3216,8 @@ static int phy_probe(struct device *dev) >> goto out; >> } >> >> + phy_disable_interrupts(phydev); >> + >> /* Start out supporting everything. Eventually, >> * a controller will attach, and may modify one >> * or both of these values > > At some point, the interrupt is going to be enabled again. And then > the WoL interrupt will fire. I think some PHY drivers actually need to > see that WoL interrupt. e.g. the marvell driver has this comment: > > static int m88e1318_set_wol(struct phy_device *phydev, > struct ethtool_wolinfo *wol) > { > .... > /* If WOL event happened once, the LED[2] interrupt pin > * will not be cleared unless we reading the interrupt status > * register. If interrupts are in use, the normal interrupt > * handling will clear the WOL event. Clear the WOL event > * before enabling it if !phy_interrupt_is_valid() > */ > > So it seems like just probing the marvell PHY is not enough to clear > the WoL interrupt. > > Can we be sure that the other PHY has reached a state it can handle > and clear an interrupt when we come to enable the interrupt? I think > not, especially in cases like NFS root, where the interface will be > put into use as soon as it exists, maybe before the other interface > has probed. Does it really make sense to have the PHY be interrupt driven for this specific board configuration if this causes so much hassle?
On Wed, Aug 09, 2023 at 06:21:58PM +0200, Andrew Lunn wrote: > > Thinking about this, I wonder whether we could solve your issue by > > disabling interrupts when the PHY is probed, rather than disabling > > them on shutdown - something like this? (not build tested) > > > > diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c > > index 3e9909b30938..4d1a37487923 100644 > > --- a/drivers/net/phy/phy_device.c > > +++ b/drivers/net/phy/phy_device.c > > @@ -3216,6 +3216,8 @@ static int phy_probe(struct device *dev) > > goto out; > > } > > > > + phy_disable_interrupts(phydev); > > + > > /* Start out supporting everything. Eventually, > > * a controller will attach, and may modify one > > * or both of these values > > At some point, the interrupt is going to be enabled again. And then > the WoL interrupt will fire. I think some PHY drivers actually need to > see that WoL interrupt. e.g. the marvell driver has this comment: > > static int m88e1318_set_wol(struct phy_device *phydev, > struct ethtool_wolinfo *wol) > { > .... > /* If WOL event happened once, the LED[2] interrupt pin > * will not be cleared unless we reading the interrupt status > * register. If interrupts are in use, the normal interrupt > * handling will clear the WOL event. Clear the WOL event > * before enabling it if !phy_interrupt_is_valid() > */ > > So it seems like just probing the marvell PHY is not enough to clear > the WoL interrupt. > > Can we be sure that the other PHY has reached a state it can handle > and clear an interrupt when we come to enable the interrupt? I think > not, especially in cases like NFS root, where the interface will be > put into use as soon as it exists, maybe before the other interface > has probed. I suppose the question to Ioana would be - are the two AR8031 PHYs on the same MDIO bus? If they are, then we're safe, because both will be probed consecutively (because they're using the same driver.) I know it would be desirable to have a generic solution to this, but I don't think that is sanely achievable if we have multiple different PHYs sharing an interrupt line over multiple different MDIO buses and multiple different PHY drivers. So, I'm suggesting we try to do a best-effort solution to solve Ioana's problem so that we can restore Uwe's WoL behaviour without causing Ioana's issue to regress. It does mean that if someone has a more weird setup (such as I describe in my paragraph above) it won't work, but then it also didn't used to work before Ioana's patch.
Hello Florian, On Wed, Aug 09, 2023 at 08:35:24AM -0700, Florian Fainelli wrote: > this does make me wonder whether Uwe tested with a prior system > suspend/resume cycle before shutting down? No, he didn't. That's why he wrote "Note that this change is only compile tested as next doesn't boot on my test machine (because of https://git.kernel.org/linus/b3574f579ece24439c90e9a179742c61205fbcfa) and 6.1 (which is the other kernel I have running) doesn't know about .wol_enabled. I don't want to delay this fix until I bisected this new issue." in the mail containing the patch. The issue in next is resolved, but I didn't come around to test this patch yet. Best regards Uwe
On Wed, Aug 09, 2023 at 08:15:27PM +0100, Russell King (Oracle) wrote: > On Wed, Aug 09, 2023 at 06:21:58PM +0200, Andrew Lunn wrote: > > > Thinking about this, I wonder whether we could solve your issue by > > > disabling interrupts when the PHY is probed, rather than disabling > > > them on shutdown - something like this? (not build tested) > > > > > > diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c > > > index 3e9909b30938..4d1a37487923 100644 > > > --- a/drivers/net/phy/phy_device.c > > > +++ b/drivers/net/phy/phy_device.c > > > @@ -3216,6 +3216,8 @@ static int phy_probe(struct device *dev) > > > goto out; > > > } > > > > > > + phy_disable_interrupts(phydev); > > > + > > > /* Start out supporting everything. Eventually, > > > * a controller will attach, and may modify one > > > * or both of these values > > > > At some point, the interrupt is going to be enabled again. And then > > the WoL interrupt will fire. I think some PHY drivers actually need to > > see that WoL interrupt. e.g. the marvell driver has this comment: > > > > static int m88e1318_set_wol(struct phy_device *phydev, > > struct ethtool_wolinfo *wol) > > { > > .... > > /* If WOL event happened once, the LED[2] interrupt pin > > * will not be cleared unless we reading the interrupt status > > * register. If interrupts are in use, the normal interrupt > > * handling will clear the WOL event. Clear the WOL event > > * before enabling it if !phy_interrupt_is_valid() > > */ > > > > So it seems like just probing the marvell PHY is not enough to clear > > the WoL interrupt. > > > > Can we be sure that the other PHY has reached a state it can handle > > and clear an interrupt when we come to enable the interrupt? I think > > not, especially in cases like NFS root, where the interface will be > > put into use as soon as it exists, maybe before the other interface > > has probed. > > I suppose the question to Ioana would be - are the two AR8031 PHYs on > the same MDIO bus? If they are, then we're safe, because both will be > probed consecutively (because they're using the same driver.) > Yes, the two AR8031 PHYs are on the same MDIO bus. I just tested your approach to disable the interrupts at phy_probe() and it seems to be working. I also tested with NFS boot using one of the PHYs and it's behaving ok. I think I'm ok with this approach as long as Uwe's problem is also fixed. I don't know how a wake-on-lan procedure works and if it matters if the WoL interrupt is lost before the PHY driver gets to know about it. Ioana
diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c index 61921d4dbb13..6d1526bdd1d7 100644 --- a/drivers/net/phy/phy_device.c +++ b/drivers/net/phy/phy_device.c @@ -3340,6 +3340,15 @@ static void phy_shutdown(struct device *dev) if (phydev->state == PHY_READY || !phydev->attached_dev) return; + /* Most phys signal WoL via the irq line. So for these irqs shouldn't be + * disabled. + */ + if (phydev->wol_enabled) + return; + + /* On shutdown disable irqs to prevent an irq storm on systems where the + * irq line is shared by several devices. + */ phy_disable_interrupts(phydev); }
Most PHYs signal WoL using an interrupt. So disabling interrupts breaks WoL at least on PHYs covered by the marvell driver. So skip disabling irqs on shutdown if WoL is enabled. While at it also explain the motivation that irqs are disabled at all. Fixes: e2f016cf7751 ("net: phy: add a shutdown procedure") Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> --- Hello, while I'm not sure that disabling interrupts is a good idea in general, this change at least should fix the WoL case. Note that this change is only compile tested as next doesn't boot on my test machine (because of https://git.kernel.org/linus/b3574f579ece24439c90e9a179742c61205fbcfa) and 6.1 (which is the other kernel I have running) doesn't know about .wol_enabled. I don't want to delay this fix until I bisected this new issue. Assuming this patch is eligible for backporting to stable, maybe point out that it depends on v6.5-rc1~163^2~286^2~2 ("net: phy: Allow drivers to always call into ->suspend()"). Didn't try to backport that. Best regards Uwe drivers/net/phy/phy_device.c | 9 +++++++++ 1 file changed, 9 insertions(+)