mbox series

[RFC,net-next,0/7] net: phy: avoid race when erroring stopping PHY

Message ID ZPsDdqt1RrXB+aTO@shell.armlinux.org.uk (mailing list archive)
Headers show
Series net: phy: avoid race when erroring stopping PHY | expand

Message

Russell King (Oracle) Sept. 8, 2023, 11:20 a.m. UTC
This series addresses a problem reported by Jijie Shao where the PHY
state machine can race with phy_stop() leading to an incorrect state.

The issue centres around phy_state_machine() dropping the phydev->lock
mutex briefly, which allows phy_stop() to get in half-way through the
state machine, and when the state machine resumes, it overwrites
phydev->state with a value incompatible with a stopped PHY. This causes
a subsequent phy_start() to issue a warning.

We address this firstly by using versions of functions that do not take
tne lock, moving them into the locked region. The only function that
this can't be done with is phy_suspend() which needs to call into the
driver without taking the lock.

For phy_suspend(), we split the state machine into two parts - the
initial part which runs under the phydev->lock, and the second part
which runs without the lock.

We finish off by using the split state machine in phy_stop() which
removes another unnecessary unlock-lock sequence from phylib.

 drivers/net/phy/phy.c | 204 +++++++++++++++++++++++++++-----------------------
 1 file changed, 110 insertions(+), 94 deletions(-)

Comments

Russell King (Oracle) Sept. 11, 2023, 8:54 a.m. UTC | #1
Hi,

It would be good if Jijie Shao could test these patches and provide a
tested-by as appropriate.

Thanks.

On Fri, Sep 08, 2023 at 12:20:22PM +0100, Russell King (Oracle) wrote:
> This series addresses a problem reported by Jijie Shao where the PHY
> state machine can race with phy_stop() leading to an incorrect state.
> 
> The issue centres around phy_state_machine() dropping the phydev->lock
> mutex briefly, which allows phy_stop() to get in half-way through the
> state machine, and when the state machine resumes, it overwrites
> phydev->state with a value incompatible with a stopped PHY. This causes
> a subsequent phy_start() to issue a warning.
> 
> We address this firstly by using versions of functions that do not take
> tne lock, moving them into the locked region. The only function that
> this can't be done with is phy_suspend() which needs to call into the
> driver without taking the lock.
> 
> For phy_suspend(), we split the state machine into two parts - the
> initial part which runs under the phydev->lock, and the second part
> which runs without the lock.
> 
> We finish off by using the split state machine in phy_stop() which
> removes another unnecessary unlock-lock sequence from phylib.
> 
>  drivers/net/phy/phy.c | 204 +++++++++++++++++++++++++++-----------------------
>  1 file changed, 110 insertions(+), 94 deletions(-)
> 
> -- 
> RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
> FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!
>
Jijie Shao Sept. 12, 2023, 6:35 a.m. UTC | #2
on 2023/9/11 16:54, Russell King (Oracle) wrote:
> Hi,
>
> It would be good if Jijie Shao could test these patches and provide a
> tested-by as appropriate.
>
> Thanks.
>
> On Fri, Sep 08, 2023 at 12:20:22PM +0100, Russell King (Oracle) wrote:
>> This series addresses a problem reported by Jijie Shao where the PHY
>> state machine can race with phy_stop() leading to an incorrect state.
>>
Hi Russell,

Sorry for late reply and thanks for your patches. It works in our case.
And it should be noted that our device does not support resuming from
suspend. So the case about suspend was not tested.

Tested-by: Jijie Shao <shaojijie@huawei.com>