mbox series

[RFC,0/5] net: stmmac: fix resume failures due to RX clock

Message ID Z8B4tVd4nLUKXdQ4@shell.armlinux.org.uk (mailing list archive)
Headers show
Series net: stmmac: fix resume failures due to RX clock | expand

Message

Russell King (Oracle) Feb. 27, 2025, 2:37 p.m. UTC
Hi,

This series is likely dependent on the "net: stmmac: cleanup transmit
clock setting" series which was submitted earlier today.

stmmac has a long history of failing to resume due to lack of receive
clock. NVidia have reported that as a result of the EEE changes, they
see a greater chance of resume failure with those patches applied than
before.

The issue is that the DesignWare core requires that the receive clock
is running in order to complete software reset, which causes
stmmac_reset() and stmmac_hw_setup() to fail.

There are several things that are wrong:

1. Calling phylink_start() early can result in a call to mac_link_up()
   which will set TE and RE bits before stmmac_hw_setup() has been
   called. This is evident in the debug logs that NVidia sent while
   debugging the problem.

   This is something I have pointed out in the past, but ithas been
   claimed to be necessary to do things this way to have the PHY
   receive clock running. Enabling RE before DMA is setup is against
   the DesignWare databook documentation.

2. Enabling LPI clock-stop at the PHY (as the driver has done prior
   to my patch set) allows the PHY to stop its receive clock when the
   link enters low-power mode. This means the completion of reset is
   dependent on the current EEE state, which is indeterminable, but
   is likely to be in low power mode on resume.

We solve (1) by moving the call to phylink_resume() later. This patch
on its own probably causes regressions as it may make it more likely
that the link will be in low power state, or maybe the PHY driver does
not respect the PHY_F_RXC_ALWAYS_ON flag - this needs to be tested on
as many different hardware setups that use this driver as possible,
and any issues addressed *without* moving phylink_resume() back.
If we need some way to resume the PHY early, then we need to work out
some way to do that with phylib without calling phy_start() early.

(2) is fixed by introducing phylink_prepare_resume(), which will
disable receive clock-stop in LPI mode at the PHY, and we will restore
the clock-stop setting in phylink_resume(). It is possible that this
solves some of the reason for the early placement of phylink_resume().

phylink_prepare_resume() also provides a convenient site should (1)
need further work.

 drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 27 +++++++--------
 drivers/net/phy/phylink.c                         | 40 ++++++++++++++++++++++-
 include/linux/phylink.h                           |  1 +
 3 files changed, 51 insertions(+), 17 deletions(-)

Comments

Jon Hunter March 6, 2025, 11:30 a.m. UTC | #1
Hi Russell,

On 27/02/2025 14:37, Russell King (Oracle) wrote:
> Hi,
> 
> This series is likely dependent on the "net: stmmac: cleanup transmit
> clock setting" series which was submitted earlier today.

I tested this series without the above on top of mainline and I still 
saw some issues with suspend. However, when testing this on top of -next 
(which has the referenced series) it works like a charm. So yes it does 
appear to be dependent indeed.

I have tested this on Tegra186, Tegra194 and Tegra234 with -next and all 
are working fine. So with that feel free to add my ...

Tested-by: Jon Hunter <jonathanh@nvidia.com>

Thanks!
Jon
Russell King (Oracle) March 6, 2025, 3:44 p.m. UTC | #2
On Thu, Mar 06, 2025 at 11:30:53AM +0000, Jon Hunter wrote:
> Hi Russell,
> 
> On 27/02/2025 14:37, Russell King (Oracle) wrote:
> > Hi,
> > 
> > This series is likely dependent on the "net: stmmac: cleanup transmit
> > clock setting" series which was submitted earlier today.
> 
> I tested this series without the above on top of mainline and I still saw
> some issues with suspend. However, when testing this on top of -next (which
> has the referenced series) it works like a charm. So yes it does appear to
> be dependent indeed.
> 
> I have tested this on Tegra186, Tegra194 and Tegra234 with -next and all are
> working fine. So with that feel free to add my ...
> 
> Tested-by: Jon Hunter <jonathanh@nvidia.com>

Hi Jon,

I came up with an alternative approach which should make this safer -
for example, if the PHY remains linked with the partner over an
ifdown or module remove/re-insert.

Please see v2 of "net: stmmac: approach 2 to solve EEE LPI reset
issues" which replaces this series.

https://lore.kernel.org/r/Z8m-CRucPxDW5zZK@shell.armlinux.org.uk

Thanks.