diff mbox series

net: phy: Don't disable irqs on shutdown if WoL is enabled

Message ID 20230804071757.383971-1-u.kleine-koenig@pengutronix.de (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series net: phy: Don't disable irqs on shutdown if WoL is enabled | expand

Checks

Context Check Description
netdev/series_format warning Single patches do not need cover letters; Target tree name not specified in the subject
netdev/tree_selection success Guessed tree name to be net-next
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 1328 this patch: 1328
netdev/cc_maintainers warning 3 maintainers not CCed: pabeni@redhat.com edumazet@google.com davem@davemloft.net
netdev/build_clang success Errors and warnings before: 1351 this patch: 1351
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 1351 this patch: 1351
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 15 lines checked
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Uwe Kleine-König Aug. 4, 2023, 7:17 a.m. UTC
Most PHYs signal WoL using an interrupt. So disabling interrupts breaks
WoL at least on PHYs covered by the marvell driver. So skip disabling
irqs on shutdown if WoL is enabled.

While at it also explain the motivation that irqs are disabled at all.

Fixes: e2f016cf7751 ("net: phy: add a shutdown procedure")
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
---
Hello,

while I'm not sure that disabling interrupts is a good idea in general,
this change at least should fix the WoL case. Note that this change is
only compile tested as next doesn't boot on my test machine (because of
https://git.kernel.org/linus/b3574f579ece24439c90e9a179742c61205fbcfa)
and 6.1 (which is the other kernel I have running) doesn't know about
.wol_enabled. I don't want to delay this fix until I bisected this new
issue.

Assuming this patch is eligible for backporting to stable, maybe point
out that it depends on v6.5-rc1~163^2~286^2~2 ("net: phy: Allow drivers
to always call into ->suspend()"). Didn't try to backport that.

Best regards
Uwe

 drivers/net/phy/phy_device.c | 9 +++++++++
 1 file changed, 9 insertions(+)

Comments

Jakub Kicinski Aug. 8, 2023, 9:53 p.m. UTC | #1
On Fri,  4 Aug 2023 09:17:57 +0200 Uwe Kleine-König wrote:
> Most PHYs signal WoL using an interrupt. So disabling interrupts breaks
> WoL at least on PHYs covered by the marvell driver. So skip disabling
> irqs on shutdown if WoL is enabled.
> 
> While at it also explain the motivation that irqs are disabled at all.
> 
> Fixes: e2f016cf7751 ("net: phy: add a shutdown procedure")
> Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>

What do we do with this one? It sounded like Russell was leaning
towards a revert?

FTR original report:
https://lore.kernel.org/all/20230803181640.yzxsk2xphwryxww4@pengutronix.de/

> while I'm not sure that disabling interrupts is a good idea in general,
> this change at least should fix the WoL case. Note that this change is
> only compile tested as next doesn't boot on my test machine (because of
> https://git.kernel.org/linus/b3574f579ece24439c90e9a179742c61205fbcfa)
> and 6.1 (which is the other kernel I have running) doesn't know about
> .wol_enabled. I don't want to delay this fix until I bisected this new
> issue.
> 
> Assuming this patch is eligible for backporting to stable, maybe point
> out that it depends on v6.5-rc1~163^2~286^2~2 ("net: phy: Allow drivers
> to always call into ->suspend()"). Didn't try to backport that.
> 
> Best regards
> Uwe
> 
>  drivers/net/phy/phy_device.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
> index 61921d4dbb13..6d1526bdd1d7 100644
> --- a/drivers/net/phy/phy_device.c
> +++ b/drivers/net/phy/phy_device.c
> @@ -3340,6 +3340,15 @@ static void phy_shutdown(struct device *dev)
>  	if (phydev->state == PHY_READY || !phydev->attached_dev)
>  		return;
>  
> +	/* Most phys signal WoL via the irq line. So for these irqs shouldn't be
> +	 * disabled.
> +	 */
> +	if (phydev->wol_enabled)
> +		return;
> +
> +	/* On shutdown disable irqs to prevent an irq storm on systems where the
> +	 * irq line is shared by several devices.
> +	 */
>  	phy_disable_interrupts(phydev);
>  }
>
Florian Fainelli Aug. 8, 2023, 9:59 p.m. UTC | #2
On 8/8/23 14:53, Jakub Kicinski wrote:
> On Fri,  4 Aug 2023 09:17:57 +0200 Uwe Kleine-König wrote:
>> Most PHYs signal WoL using an interrupt. So disabling interrupts breaks
>> WoL at least on PHYs covered by the marvell driver. So skip disabling
>> irqs on shutdown if WoL is enabled.
>>
>> While at it also explain the motivation that irqs are disabled at all.
>>
>> Fixes: e2f016cf7751 ("net: phy: add a shutdown procedure")
>> Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
> 
> What do we do with this one? It sounded like Russell was leaning
> towards a revert?

Yes, though I believe this will create a different kind of regression 
for what Iona was addressing initially. Then it becomes a choice of 
which regression do we consider to be the worst to handle until 
something better comes up.

Russell what are your thoughts?
Russell King (Oracle) Aug. 8, 2023, 10:56 p.m. UTC | #3
On Tue, Aug 08, 2023 at 02:59:41PM -0700, Florian Fainelli wrote:
> On 8/8/23 14:53, Jakub Kicinski wrote:
> > On Fri,  4 Aug 2023 09:17:57 +0200 Uwe Kleine-König wrote:
> > > Most PHYs signal WoL using an interrupt. So disabling interrupts breaks
> > > WoL at least on PHYs covered by the marvell driver. So skip disabling
> > > irqs on shutdown if WoL is enabled.
> > > 
> > > While at it also explain the motivation that irqs are disabled at all.
> > > 
> > > Fixes: e2f016cf7751 ("net: phy: add a shutdown procedure")
> > > Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
> > 
> > What do we do with this one? It sounded like Russell was leaning
> > towards a revert?
> 
> Yes, though I believe this will create a different kind of regression for
> what Iona was addressing initially. Then it becomes a choice of which
> regression do we consider to be the worst to handle until something better
> comes up.
> 
> Russell what are your thoughts?

In this situation where a fix for a problem is provided which then
causes a regression by fixing that problem, I've seen Linus T state
that it means the fix was incorrect. That seems entirely sensible.

We are, of course, in the situation where reverting the commit
restores the old behaviour and thus fixes a regression, but causes
a regression for another user.

If it is possible to quickly come up with a fix that avoids any
regression to either use case, then that is obviously preferable.
However, if that's not possible, then it seems going back to the
original situation (i.o.w. reverting) is sensible.

Now, the fact is that many PHYs do use their interrupts to signal
that a wake-up happened, and disabling the IRQ from the PHY will
prevent WoL from working. Other PHYs have a separate pin for this.
Two recent examples are AR8035, which only has a single interrupt
pin which covers all interrupts from that PHY, and AR8031 or AR8033
which have a separate WOL_INT pin which might be used - or the
main interrupt pin.

If we hibernate the system, then people how have configured WoL
are going to expect it to work - but disabling the ability for
the PHY to raise an interrupt will prevent it.

So, clearly always disabling PHY interrupts can have a detrimental
effect on the ability to wake a system up using WoL - where the PHY
interrupt is used to signal WoL to the rest of the system.

Now, if waking the system up from hibernation using WoL involves
the PHY asserting its interrupt pin, then the system must be
capable of dealing with the PHY asserting its interrupt while the
system is booting. Remember that the way Linux hibernation works,
that boot is just the same as a regular boot right up through the
normal kernel initialisation. It is only towards the end that the
kernel detects the signature in swap space, and then does the
funky stuff to resume from the saved data.

So, during that boot, the system has to cope with that interrupt
having been asserted by the PHY hardware. Either system firmware
has to recognise that was the wake-up event and deal with it (e.g.
disabling the interrupt source) before passing control to the
kernel, or the kernel has to be able to cope with that interrupt
being stuck at active state until the PHY driver can deal with it.

Obviously, if WoL is not enabled or supported, then disabling the
PHY interrupt should be harmless - but that will have the effect
of masking any issues that a platform may have until PHY based
WoL has been enabled.

Also, don't forget that we have this kexec thing - and the
.shutdown methods will be called just before handing control to
the new kernel.

Uwe's patch solves the problem that he's experiencing - because
it makes the interrupt disabling dependent on the WoL configuration.

However, Ioana's problem would still remain - and enabling WoL on
that platform will make it reappear - and thus it still needs to be
properly fixed.

If that problem is properly fixed, then we don't need to disable
PHY interrupts, which means a revert would be the right approach.

Honestly, I don't know what would be best - and I don't believe we've
heard from Ioana about the problem that was trying to be addressed
(e.g. exactly when it happened and why.)

If I had to guess:
- the PHY in question may be sharing an interrupt with another device
- when that other device probes and claims its interrupt, an interrupt
  storm ensues
- the interrupt layer disables the interrupt input, rendering both the
  PHY and other device unusable.

I think I've covered all the possibilities, all the issues, outcomes,
and the politics as far as Linus T would state. I'm also quite sure
that there will be no way to satisfy everyone!

Bearing in mind that it is holiday season, and we're at -rc5, I
think we should give Ioana a bit more time to respond before we
make a decision. Maybe a little over a week? If we don't hear anything,
then I think following our established policy and reverting would be
the correct way forward.
Vladimir Oltean Aug. 9, 2023, 1:58 p.m. UTC | #4
Hi Uwe,

(I hope the threading won't be broken)

On Fri, Aug 04, 2023 at 09:17:57AM +0200, Uwe Kleine-König wrote:
> Most PHYs signal WoL using an interrupt. So disabling interrupts breaks
> WoL at least on PHYs covered by the marvell driver. So skip disabling
> irqs on shutdown if WoL is enabled.
> 
> While at it also explain the motivation that irqs are disabled at all.
> 
> Fixes: e2f016cf7751 ("net: phy: add a shutdown procedure")
> Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
> ---
> Hello,
> 
> while I'm not sure that disabling interrupts is a good idea in general,
> this change at least should fix the WoL case. Note that this change is
> only compile tested as next doesn't boot on my test machine (because of
> https://git.kernel.org/linus/b3574f579ece24439c90e9a179742c61205fbcfa)
> and 6.1 (which is the other kernel I have running) doesn't know about
> .wol_enabled. I don't want to delay this fix until I bisected this new
> issue.
> 
> Assuming this patch is eligible for backporting to stable, maybe point
> out that it depends on v6.5-rc1~163^2~286^2~2 ("net: phy: Allow drivers
> to always call into ->suspend()"). Didn't try to backport that.
> 
> Best regards
> Uwe
> 
>  drivers/net/phy/phy_device.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
> index 61921d4dbb13..6d1526bdd1d7 100644
> --- a/drivers/net/phy/phy_device.c
> +++ b/drivers/net/phy/phy_device.c
> @@ -3340,6 +3340,15 @@ static void phy_shutdown(struct device *dev)
>  	if (phydev->state == PHY_READY || !phydev->attached_dev)
>  		return;
>  
> +	/* Most phys signal WoL via the irq line. So for these irqs shouldn't be
> +	 * disabled.
> +	 */
> +	if (phydev->wol_enabled)
> +		return;
> +
> +	/* On shutdown disable irqs to prevent an irq storm on systems where the
> +	 * irq line is shared by several devices.
> +	 */
>  	phy_disable_interrupts(phydev);
>  }
>  
> -- 
> 2.40.1
> 
> 

I think the idea is not bad and something along these lines might be the
way to go, but I don't think it works (as currently implemented, and
tested by me, prints below).

Upon a quick search, phydev->wol_enabled is only set from phy_suspend(),
and phy_suspend() isn't invoked from the ethnl_set_wol() call stack.

Confirmed this way:

$ ethtool -s end0 wol g # &enet0 in the device tree
$ reboot -f
Rebooting.
[  288.682444] Qualcomm Atheros AR8031/AR8033 mdio@2d24000:02: phy_shutdown: wol_enabled 0
[  288.690935] Qualcomm Atheros AR8031/AR8033 mdio@2d24000:01: phy_shutdown: wol_enabled 0
[  288.736145] reboot: Restarting system

This is on the arch/arm/boot/dts/nxp/ls/ls1021a-tsn.dts, the same board
as the one which Ioana worked on (with the shared AR8031 PHY interrupts).

Sure, it needs to be mentioned that WoL + shared PHY interrupts is not
by any means an impossible combination, but it still won't work reliably.
I guess that's ok temporarily, since WoL requires user opt-in, so in the
default configuration, the LS1021A-TSN is not broken by this change (in
its functional variant).

pw-bot: cr
Ioana Ciornei Aug. 9, 2023, 2:21 p.m. UTC | #5
On Tue, Aug 08, 2023 at 11:56:56PM +0100, Russell King (Oracle) wrote:
> On Tue, Aug 08, 2023 at 02:59:41PM -0700, Florian Fainelli wrote:
> > On 8/8/23 14:53, Jakub Kicinski wrote:
> > > On Fri,  4 Aug 2023 09:17:57 +0200 Uwe Kleine-König wrote:
> > > > Most PHYs signal WoL using an interrupt. So disabling interrupts breaks
> > > > WoL at least on PHYs covered by the marvell driver. So skip disabling
> > > > irqs on shutdown if WoL is enabled.
> > > > 
> > > > While at it also explain the motivation that irqs are disabled at all.
> > > > 
> > > > Fixes: e2f016cf7751 ("net: phy: add a shutdown procedure")
> > > > Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
> > > 
> > > What do we do with this one? It sounded like Russell was leaning
> > > towards a revert?
> > 
> > Yes, though I believe this will create a different kind of regression for
> > what Iona was addressing initially. Then it becomes a choice of which
> > regression do we consider to be the worst to handle until something better
> > comes up.
> > 
> > Russell what are your thoughts?
> 
> In this situation where a fix for a problem is provided which then
> causes a regression by fixing that problem, I've seen Linus T state
> that it means the fix was incorrect. That seems entirely sensible.
> 
> We are, of course, in the situation where reverting the commit
> restores the old behaviour and thus fixes a regression, but causes
> a regression for another user.
> 
> If it is possible to quickly come up with a fix that avoids any
> regression to either use case, then that is obviously preferable.
> However, if that's not possible, then it seems going back to the
> original situation (i.o.w. reverting) is sensible.
> 
> Now, the fact is that many PHYs do use their interrupts to signal
> that a wake-up happened, and disabling the IRQ from the PHY will
> prevent WoL from working. Other PHYs have a separate pin for this.
> Two recent examples are AR8035, which only has a single interrupt
> pin which covers all interrupts from that PHY, and AR8031 or AR8033
> which have a separate WOL_INT pin which might be used - or the
> main interrupt pin.
> 
> If we hibernate the system, then people how have configured WoL
> are going to expect it to work - but disabling the ability for
> the PHY to raise an interrupt will prevent it.
> 
> So, clearly always disabling PHY interrupts can have a detrimental
> effect on the ability to wake a system up using WoL - where the PHY
> interrupt is used to signal WoL to the rest of the system.
> 
> Now, if waking the system up from hibernation using WoL involves
> the PHY asserting its interrupt pin, then the system must be
> capable of dealing with the PHY asserting its interrupt while the
> system is booting. Remember that the way Linux hibernation works,
> that boot is just the same as a regular boot right up through the
> normal kernel initialisation. It is only towards the end that the
> kernel detects the signature in swap space, and then does the
> funky stuff to resume from the saved data.
> 
> So, during that boot, the system has to cope with that interrupt
> having been asserted by the PHY hardware. Either system firmware
> has to recognise that was the wake-up event and deal with it (e.g.
> disabling the interrupt source) before passing control to the
> kernel, or the kernel has to be able to cope with that interrupt
> being stuck at active state until the PHY driver can deal with it.
> 
> Obviously, if WoL is not enabled or supported, then disabling the
> PHY interrupt should be harmless - but that will have the effect
> of masking any issues that a platform may have until PHY based
> WoL has been enabled.
> 
> Also, don't forget that we have this kexec thing - and the
> .shutdown methods will be called just before handing control to
> the new kernel.
> 
> Uwe's patch solves the problem that he's experiencing - because
> it makes the interrupt disabling dependent on the WoL configuration.
> 
> However, Ioana's problem would still remain - and enabling WoL on
> that platform will make it reappear - and thus it still needs to be
> properly fixed.
> 
> If that problem is properly fixed, then we don't need to disable
> PHY interrupts, which means a revert would be the right approach.

Yes, my initial problem would still remain if WoL is enabled on any of
the PHYs that have a shared IRQ line.

The problem is that I find this combination - shared IRQ and WoL enabled
- impossible to make it work. Maybe we could look into denying WoL from
being enabled on PHYs which have a shared interrupt line.

> 
> Honestly, I don't know what would be best - and I don't believe we've
> heard from Ioana about the problem that was trying to be addressed
> (e.g. exactly when it happened and why.)
> 
> If I had to guess:
> - the PHY in question may be sharing an interrupt with another device
> - when that other device probes and claims its interrupt, an interrupt
>   storm ensues
> - the interrupt layer disables the interrupt input, rendering both the
>   PHY and other device unusable.
> 

That's a perfect summary of the problem that I was trying to fix.

The board in question is a LS1021ATSN which has two AR8031 PHYs that
share an interrupt line. In case only one of the PHYs is probed and
there are pending interrupts on the PHY#2 an IRQ storm will happen
since there is no entity to clear the interrupt from PHY#2's registers.
PHY#1's driver will get stuck in .handle_interrupt() indefinitely.

> I think I've covered all the possibilities, all the issues, outcomes,
> and the politics as far as Linus T would state. I'm also quite sure
> that there will be no way to satisfy everyone!
> 
> Bearing in mind that it is holiday season, and we're at -rc5, I
> think we should give Ioana a bit more time to respond before we
> make a decision. Maybe a little over a week? If we don't hear anything,
> then I think following our established policy and reverting would be
> the correct way forward.
>
Russell King (Oracle) Aug. 9, 2023, 2:29 p.m. UTC | #6
On Wed, Aug 09, 2023 at 05:21:55PM +0300, Ioana Ciornei wrote:
> That's a perfect summary of the problem that I was trying to fix.
> 
> The board in question is a LS1021ATSN which has two AR8031 PHYs that
> share an interrupt line. In case only one of the PHYs is probed and
> there are pending interrupts on the PHY#2 an IRQ storm will happen
> since there is no entity to clear the interrupt from PHY#2's registers.
> PHY#1's driver will get stuck in .handle_interrupt() indefinitely.

So I have two further questions:
1. Is WoL able to be supported on this hardware?
2. AR8031 has a seperate WOL_INT signal that can be used to wake up the
   system. Is this used in the hardware design?

Thanks.
Florian Fainelli Aug. 9, 2023, 3:35 p.m. UTC | #7
On 8/9/2023 6:58 AM, Vladimir Oltean wrote:
> Hi Uwe,
> 
> (I hope the threading won't be broken)
> 
> On Fri, Aug 04, 2023 at 09:17:57AM +0200, Uwe Kleine-König wrote:
>> Most PHYs signal WoL using an interrupt. So disabling interrupts breaks
>> WoL at least on PHYs covered by the marvell driver. So skip disabling
>> irqs on shutdown if WoL is enabled.
>>
>> While at it also explain the motivation that irqs are disabled at all.
>>
>> Fixes: e2f016cf7751 ("net: phy: add a shutdown procedure")
>> Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
>> ---
>> Hello,
>>
>> while I'm not sure that disabling interrupts is a good idea in general,
>> this change at least should fix the WoL case. Note that this change is
>> only compile tested as next doesn't boot on my test machine (because of
>> https://git.kernel.org/linus/b3574f579ece24439c90e9a179742c61205fbcfa)
>> and 6.1 (which is the other kernel I have running) doesn't know about
>> .wol_enabled. I don't want to delay this fix until I bisected this new
>> issue.
>>
>> Assuming this patch is eligible for backporting to stable, maybe point
>> out that it depends on v6.5-rc1~163^2~286^2~2 ("net: phy: Allow drivers
>> to always call into ->suspend()"). Didn't try to backport that.
>>
>> Best regards
>> Uwe
>>
>>   drivers/net/phy/phy_device.c | 9 +++++++++
>>   1 file changed, 9 insertions(+)
>>
>> diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
>> index 61921d4dbb13..6d1526bdd1d7 100644
>> --- a/drivers/net/phy/phy_device.c
>> +++ b/drivers/net/phy/phy_device.c
>> @@ -3340,6 +3340,15 @@ static void phy_shutdown(struct device *dev)
>>   	if (phydev->state == PHY_READY || !phydev->attached_dev)
>>   		return;
>>   
>> +	/* Most phys signal WoL via the irq line. So for these irqs shouldn't be
>> +	 * disabled.
>> +	 */
>> +	if (phydev->wol_enabled)
>> +		return;
>> +
>> +	/* On shutdown disable irqs to prevent an irq storm on systems where the
>> +	 * irq line is shared by several devices.
>> +	 */
>>   	phy_disable_interrupts(phydev);
>>   }
>>   
>> -- 
>> 2.40.1
>>
>>
> 
> I think the idea is not bad and something along these lines might be the
> way to go, but I don't think it works (as currently implemented, and
> tested by me, prints below).
> 
> Upon a quick search, phydev->wol_enabled is only set from phy_suspend(),
> and phy_suspend() isn't invoked from the ethnl_set_wol() call stack.

You are right, this was an ill advised suggestion from my side here. In 
principle however we need to have something like this in the shutdown 
routine to have this patch work:

        struct ethtool_wolinfo wol = { .cmd = ETHTOOL_GWOL };
        struct net_device *netdev = phydev->attached_dev;
        int ret;

        phy_ethtool_get_wol(phydev, &wol);
        phydev->wol_enabled = wol.wolopts || (netdev && 
netdev->wol_enabled);

	if (!phydev->wol_enabled)
		return;

this does make me wonder whether Uwe tested with a prior system 
suspend/resume cycle before shutting down?
Ioana Ciornei Aug. 9, 2023, 3:44 p.m. UTC | #8
On Wed, Aug 09, 2023 at 03:29:17PM +0100, Russell King (Oracle) wrote:
> On Wed, Aug 09, 2023 at 05:21:55PM +0300, Ioana Ciornei wrote:
> > That's a perfect summary of the problem that I was trying to fix.
> > 
> > The board in question is a LS1021ATSN which has two AR8031 PHYs that
> > share an interrupt line. In case only one of the PHYs is probed and
> > there are pending interrupts on the PHY#2 an IRQ storm will happen
> > since there is no entity to clear the interrupt from PHY#2's registers.
> > PHY#1's driver will get stuck in .handle_interrupt() indefinitely.
> 
> So I have two further questions:
> 1. Is WoL able to be supported on this hardware?

I don't know if anyone cares about WoL on the AR8031 PHYs from this
board.

Both of the PHYs are used in conjuction with 2 eTSEC controllers - which
use the driver in drivers/net/ethernet/freescale/gianfar.c.

The Ethernet controller does have WoL support, which means that WoL could
still be supported on the board even though we would forbid WoL on the
AR8031 PHYs.

> 2. AR8031 has a seperate WOL_INT signal that can be used to wake up the
>    system. Is this used in the hardware design?

No, WOL_INT is not connected.
Russell King (Oracle) Aug. 9, 2023, 4:01 p.m. UTC | #9
On Wed, Aug 09, 2023 at 06:44:18PM +0300, Ioana Ciornei wrote:
> On Wed, Aug 09, 2023 at 03:29:17PM +0100, Russell King (Oracle) wrote:
> > On Wed, Aug 09, 2023 at 05:21:55PM +0300, Ioana Ciornei wrote:
> > > That's a perfect summary of the problem that I was trying to fix.
> > > 
> > > The board in question is a LS1021ATSN which has two AR8031 PHYs that
> > > share an interrupt line. In case only one of the PHYs is probed and
> > > there are pending interrupts on the PHY#2 an IRQ storm will happen
> > > since there is no entity to clear the interrupt from PHY#2's registers.
> > > PHY#1's driver will get stuck in .handle_interrupt() indefinitely.
> > 
> > So I have two further questions:
> > 1. Is WoL able to be supported on this hardware?
> 
> I don't know if anyone cares about WoL on the AR8031 PHYs from this
> board.
> 
> Both of the PHYs are used in conjuction with 2 eTSEC controllers - which
> use the driver in drivers/net/ethernet/freescale/gianfar.c.
> 
> The Ethernet controller does have WoL support, which means that WoL could
> still be supported on the board even though we would forbid WoL on the
> AR8031 PHYs.
> 
> > 2. AR8031 has a seperate WOL_INT signal that can be used to wake up the
> >    system. Is this used in the hardware design?
> 
> No, WOL_INT is not connected.

Okay, so we don't need to care about WoL for your hardware setup.

Thinking about this, I wonder whether we could solve your issue by
disabling interrupts when the PHY is probed, rather than disabling
them on shutdown - something like this? (not build tested)

diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
index 3e9909b30938..4d1a37487923 100644
--- a/drivers/net/phy/phy_device.c
+++ b/drivers/net/phy/phy_device.c
@@ -3216,6 +3216,8 @@ static int phy_probe(struct device *dev)
 			goto out;
 	}
 
+        phy_disable_interrupts(phydev);
+
 	/* Start out supporting everything. Eventually,
 	 * a controller will attach, and may modify one
 	 * or both of these values
@@ -3333,16 +3335,6 @@ static int phy_remove(struct device *dev)
 	return 0;
 }
 
-static void phy_shutdown(struct device *dev)
-{
-	struct phy_device *phydev = to_phy_device(dev);
-
-	if (phydev->state == PHY_READY || !phydev->attached_dev)
-		return;
-
-	phy_disable_interrupts(phydev);
-}
-
 /**
  * phy_driver_register - register a phy_driver with the PHY layer
  * @new_driver: new phy_driver to register
@@ -3376,7 +3368,6 @@ int phy_driver_register(struct phy_driver *new_driver, struct module *owner)
 	new_driver->mdiodrv.driver.bus = &mdio_bus_type;
 	new_driver->mdiodrv.driver.probe = phy_probe;
 	new_driver->mdiodrv.driver.remove = phy_remove;
-	new_driver->mdiodrv.driver.shutdown = phy_shutdown;
 	new_driver->mdiodrv.driver.owner = owner;
 	new_driver->mdiodrv.driver.probe_type = PROBE_FORCE_SYNCHRONOUS;
Andrew Lunn Aug. 9, 2023, 4:21 p.m. UTC | #10
> Thinking about this, I wonder whether we could solve your issue by
> disabling interrupts when the PHY is probed, rather than disabling
> them on shutdown - something like this? (not build tested)
> 
> diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
> index 3e9909b30938..4d1a37487923 100644
> --- a/drivers/net/phy/phy_device.c
> +++ b/drivers/net/phy/phy_device.c
> @@ -3216,6 +3216,8 @@ static int phy_probe(struct device *dev)
>  			goto out;
>  	}
>  
> +        phy_disable_interrupts(phydev);
> +
>  	/* Start out supporting everything. Eventually,
>  	 * a controller will attach, and may modify one
>  	 * or both of these values

At some point, the interrupt is going to be enabled again. And then
the WoL interrupt will fire. I think some PHY drivers actually need to
see that WoL interrupt. e.g. the marvell driver has this comment:

static int m88e1318_set_wol(struct phy_device *phydev,
                            struct ethtool_wolinfo *wol)
{
....
                /* If WOL event happened once, the LED[2] interrupt pin
                 * will not be cleared unless we reading the interrupt status
                 * register. If interrupts are in use, the normal interrupt
                 * handling will clear the WOL event. Clear the WOL event
                 * before enabling it if !phy_interrupt_is_valid()
                 */

So it seems like just probing the marvell PHY is not enough to clear
the WoL interrupt.

Can we be sure that the other PHY has reached a state it can handle
and clear an interrupt when we come to enable the interrupt? I think
not, especially in cases like NFS root, where the interface will be
put into use as soon as it exists, maybe before the other interface
has probed.

	Andrew
Florian Fainelli Aug. 9, 2023, 5:04 p.m. UTC | #11
On 8/9/2023 9:21 AM, Andrew Lunn wrote:
>> Thinking about this, I wonder whether we could solve your issue by
>> disabling interrupts when the PHY is probed, rather than disabling
>> them on shutdown - something like this? (not build tested)
>>
>> diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
>> index 3e9909b30938..4d1a37487923 100644
>> --- a/drivers/net/phy/phy_device.c
>> +++ b/drivers/net/phy/phy_device.c
>> @@ -3216,6 +3216,8 @@ static int phy_probe(struct device *dev)
>>   			goto out;
>>   	}
>>   
>> +        phy_disable_interrupts(phydev);
>> +
>>   	/* Start out supporting everything. Eventually,
>>   	 * a controller will attach, and may modify one
>>   	 * or both of these values
> 
> At some point, the interrupt is going to be enabled again. And then
> the WoL interrupt will fire. I think some PHY drivers actually need to
> see that WoL interrupt. e.g. the marvell driver has this comment:
> 
> static int m88e1318_set_wol(struct phy_device *phydev,
>                              struct ethtool_wolinfo *wol)
> {
> ....
>                  /* If WOL event happened once, the LED[2] interrupt pin
>                   * will not be cleared unless we reading the interrupt status
>                   * register. If interrupts are in use, the normal interrupt
>                   * handling will clear the WOL event. Clear the WOL event
>                   * before enabling it if !phy_interrupt_is_valid()
>                   */
> 
> So it seems like just probing the marvell PHY is not enough to clear
> the WoL interrupt.
> 
> Can we be sure that the other PHY has reached a state it can handle
> and clear an interrupt when we come to enable the interrupt? I think
> not, especially in cases like NFS root, where the interface will be
> put into use as soon as it exists, maybe before the other interface
> has probed.

Does it really make sense to have the PHY be interrupt driven for this 
specific board configuration if this causes so much hassle?
Russell King (Oracle) Aug. 9, 2023, 7:15 p.m. UTC | #12
On Wed, Aug 09, 2023 at 06:21:58PM +0200, Andrew Lunn wrote:
> > Thinking about this, I wonder whether we could solve your issue by
> > disabling interrupts when the PHY is probed, rather than disabling
> > them on shutdown - something like this? (not build tested)
> > 
> > diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
> > index 3e9909b30938..4d1a37487923 100644
> > --- a/drivers/net/phy/phy_device.c
> > +++ b/drivers/net/phy/phy_device.c
> > @@ -3216,6 +3216,8 @@ static int phy_probe(struct device *dev)
> >  			goto out;
> >  	}
> >  
> > +        phy_disable_interrupts(phydev);
> > +
> >  	/* Start out supporting everything. Eventually,
> >  	 * a controller will attach, and may modify one
> >  	 * or both of these values
> 
> At some point, the interrupt is going to be enabled again. And then
> the WoL interrupt will fire. I think some PHY drivers actually need to
> see that WoL interrupt. e.g. the marvell driver has this comment:
> 
> static int m88e1318_set_wol(struct phy_device *phydev,
>                             struct ethtool_wolinfo *wol)
> {
> ....
>                 /* If WOL event happened once, the LED[2] interrupt pin
>                  * will not be cleared unless we reading the interrupt status
>                  * register. If interrupts are in use, the normal interrupt
>                  * handling will clear the WOL event. Clear the WOL event
>                  * before enabling it if !phy_interrupt_is_valid()
>                  */
> 
> So it seems like just probing the marvell PHY is not enough to clear
> the WoL interrupt.
> 
> Can we be sure that the other PHY has reached a state it can handle
> and clear an interrupt when we come to enable the interrupt? I think
> not, especially in cases like NFS root, where the interface will be
> put into use as soon as it exists, maybe before the other interface
> has probed.

I suppose the question to Ioana would be - are the two AR8031 PHYs on
the same MDIO bus? If they are, then we're safe, because both will be
probed consecutively (because they're using the same driver.)

I know it would be desirable to have a generic solution to this, but
I don't think that is sanely achievable if we have multiple different
PHYs sharing an interrupt line over multiple different MDIO buses
and multiple different PHY drivers.

So, I'm suggesting we try to do a best-effort solution to solve Ioana's
problem so that we can restore Uwe's WoL behaviour without causing
Ioana's issue to regress. It does mean that if someone has a more weird
setup (such as I describe in my paragraph above) it won't work, but
then it also didn't used to work before Ioana's patch.
Uwe Kleine-König Aug. 10, 2023, 6:32 a.m. UTC | #13
Hello Florian,

On Wed, Aug 09, 2023 at 08:35:24AM -0700, Florian Fainelli wrote:
> this does make me wonder whether Uwe tested with a prior system
> suspend/resume cycle before shutting down?

No, he didn't. That's why he wrote "Note that this change is
only compile tested as next doesn't boot on my test machine (because of
https://git.kernel.org/linus/b3574f579ece24439c90e9a179742c61205fbcfa)
and 6.1 (which is the other kernel I have running) doesn't know about
.wol_enabled. I don't want to delay this fix until I bisected this new
issue." in the mail containing the patch.

The issue in next is resolved, but I didn't come around to test this
patch yet.

Best regards
Uwe
Ioana Ciornei Aug. 10, 2023, 11:01 a.m. UTC | #14
On Wed, Aug 09, 2023 at 08:15:27PM +0100, Russell King (Oracle) wrote:
> On Wed, Aug 09, 2023 at 06:21:58PM +0200, Andrew Lunn wrote:
> > > Thinking about this, I wonder whether we could solve your issue by
> > > disabling interrupts when the PHY is probed, rather than disabling
> > > them on shutdown - something like this? (not build tested)
> > > 
> > > diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
> > > index 3e9909b30938..4d1a37487923 100644
> > > --- a/drivers/net/phy/phy_device.c
> > > +++ b/drivers/net/phy/phy_device.c
> > > @@ -3216,6 +3216,8 @@ static int phy_probe(struct device *dev)
> > >  			goto out;
> > >  	}
> > >  
> > > +        phy_disable_interrupts(phydev);
> > > +
> > >  	/* Start out supporting everything. Eventually,
> > >  	 * a controller will attach, and may modify one
> > >  	 * or both of these values
> > 
> > At some point, the interrupt is going to be enabled again. And then
> > the WoL interrupt will fire. I think some PHY drivers actually need to
> > see that WoL interrupt. e.g. the marvell driver has this comment:
> > 
> > static int m88e1318_set_wol(struct phy_device *phydev,
> >                             struct ethtool_wolinfo *wol)
> > {
> > ....
> >                 /* If WOL event happened once, the LED[2] interrupt pin
> >                  * will not be cleared unless we reading the interrupt status
> >                  * register. If interrupts are in use, the normal interrupt
> >                  * handling will clear the WOL event. Clear the WOL event
> >                  * before enabling it if !phy_interrupt_is_valid()
> >                  */
> > 
> > So it seems like just probing the marvell PHY is not enough to clear
> > the WoL interrupt.
> > 
> > Can we be sure that the other PHY has reached a state it can handle
> > and clear an interrupt when we come to enable the interrupt? I think
> > not, especially in cases like NFS root, where the interface will be
> > put into use as soon as it exists, maybe before the other interface
> > has probed.
> 
> I suppose the question to Ioana would be - are the two AR8031 PHYs on
> the same MDIO bus? If they are, then we're safe, because both will be
> probed consecutively (because they're using the same driver.)
> 

Yes, the two AR8031 PHYs are on the same MDIO bus.

I just tested your approach to disable the interrupts at phy_probe() and
it seems to be working. I also tested with NFS boot using one of the
PHYs and it's behaving ok.

I think I'm ok with this approach as long as Uwe's problem is also
fixed. I don't know how a wake-on-lan procedure works and if it matters
if the WoL interrupt is lost before the PHY driver gets to know about
it.

Ioana
diff mbox series

Patch

diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
index 61921d4dbb13..6d1526bdd1d7 100644
--- a/drivers/net/phy/phy_device.c
+++ b/drivers/net/phy/phy_device.c
@@ -3340,6 +3340,15 @@  static void phy_shutdown(struct device *dev)
 	if (phydev->state == PHY_READY || !phydev->attached_dev)
 		return;
 
+	/* Most phys signal WoL via the irq line. So for these irqs shouldn't be
+	 * disabled.
+	 */
+	if (phydev->wol_enabled)
+		return;
+
+	/* On shutdown disable irqs to prevent an irq storm on systems where the
+	 * irq line is shared by several devices.
+	 */
 	phy_disable_interrupts(phydev);
 }