[net] net: phy: Fix deadlocking in phy_error() invocation

Message ID	20230818125449.32061-1-fancer.lancer@gmail.com (mailing list archive)
State	Accepted
Commit	a0e026e7b37e997f4fa3fcaa714e5484f3ce9e75
Delegated to:	Netdev Maintainers
Headers	show Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net [23.128.96.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 712C563B4 for <netdev@vger.kernel.org>; Fri, 18 Aug 2023 12:54:59 +0000 (UTC) From: Serge Semin <fancer.lancer@gmail.com> To: Andrew Lunn <andrew@lunn.ch>, Heiner Kallweit <hkallweit1@gmail.com>, Russell King <linux@armlinux.org.uk>, "David S. Miller" <davem@davemloft.net>, Eric Dumazet <edumazet@google.com>, Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>, Francesco Dolcini <francesco.dolcini@toradex.com> Cc: Serge Semin <fancer.lancer@gmail.com>, netdev@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH net] net: phy: Fix deadlocking in phy_error() invocation Date: Fri, 18 Aug 2023 15:54:45 +0300 Message-ID: <20230818125449.32061-1-fancer.lancer@gmail.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	[net] net: phy: Fix deadlocking in phy_error() invocation \| expand [net] net: phy: Fix deadlocking in phy_error() invocation

Context	Check	Description
netdev/series_format	success	Single patches do not need cover letters
netdev/tree_selection	success	Clearly marked for net
netdev/fixes_present	success	Fixes tag present in non-next series
netdev/header_inline	success	No static functions without inline keyword in header files
netdev/build_32bit	success	Errors and warnings before: 1328 this patch: 1328
netdev/cc_maintainers	success	CCed 9 of 9 maintainers
netdev/build_clang	success	Errors and warnings before: 1351 this patch: 1351
netdev/verify_signedoff	success	Signed-off-by tag matches author and committer
netdev/deprecated_api	success	None detected
netdev/check_selftest	success	No net selftest shell script
netdev/verify_fixes	success	Fixes tag looks correct
netdev/build_allmodconfig_warn	success	Errors and warnings before: 1351 this patch: 1351
netdev/checkpatch	success	total: 0 errors, 0 warnings, 0 checks, 31 lines checked
netdev/kdoc	success	Errors and warnings before: 0 this patch: 0
netdev/source_inline	success	Was 0 now: 0

Serge Semin Aug. 18, 2023, 12:54 p.m. UTC

Since commit 91a7cda1f4b8 ("net: phy: Fix race condition on link status
change") all the phy_error() method invocations have been causing the
nested-mutex-lock deadlock because it's normally done in the PHY-driver
threaded IRQ handlers which since that change have been called with the
phydev->lock mutex held. Here is the calls thread:

IRQ: phy_interrupt()
     +-> mutex_lock(&phydev->lock); <--------------------+
         drv->handle_interrupt()                         | Deadlock due
         +-> ERROR: phy_error()                          + to the nested
                    +-> phy_process_error()              | mutex lock
                        +-> mutex_lock(&phydev->lock); <-+
                            phydev->state = PHY_ERROR;
                            mutex_unlock(&phydev->lock);
         mutex_unlock(&phydev->lock);

The problem can be easily reproduced just by calling phy_error() from any
PHY-device threaded interrupt handler. Fix it by dropping the phydev->lock
mutex lock from the phy_process_error() method and printing a nasty error
message to the system log if the mutex isn't held in the caller execution
context.

Note for the fix to work correctly in the PHY-subsystem itself the
phydev->lock mutex locking must be added to the phy_error_precise()
function.

Link: https://lore.kernel.org/netdev/20230816180944.19262-1-fancer.lancer@gmail.com
Fixes: 91a7cda1f4b8 ("net: phy: Fix race condition on link status change")
Suggested-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Serge Semin <fancer.lancer@gmail.com>
---
 drivers/net/phy/phy.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

Andrew Lunn Aug. 18, 2023, 1:07 p.m. UTC | #1

On Fri, Aug 18, 2023 at 03:54:45PM +0300, Serge Semin wrote:
>  static void phy_process_error(struct phy_device *phydev)
>  {
> -	mutex_lock(&phydev->lock);
> +	/* phydev->lock must be held for the state change to be safe */
> +	if (!mutex_is_locked(&phydev->lock))
> +		phydev_err(phydev, "PHY-device data unsafe context\n");
> +
>  	phydev->state = PHY_ERROR;
> -	mutex_unlock(&phydev->lock);
>  
>  	phy_trigger_machine(phydev);
>  }

Thanks for the patch Serge. It looks like a good implementation of
what i suggested. But thinking about it further, if the error ever
appears in somebodies kernel log, there is probably not enough
information to actually fix it. There is no call path. So i think it
should actually use WARN_ON_ONCE() so we get a stack trace.

Sorry for changing my mind.

    Andrew

---
pw-bot: cr

Serge Semin Aug. 18, 2023, 2:27 p.m. UTC | #2

On Fri, Aug 18, 2023 at 03:07:49PM +0200, Andrew Lunn wrote:
> On Fri, Aug 18, 2023 at 03:54:45PM +0300, Serge Semin wrote:
> >  static void phy_process_error(struct phy_device *phydev)
> >  {
> > -	mutex_lock(&phydev->lock);
> > +	/* phydev->lock must be held for the state change to be safe */
> > +	if (!mutex_is_locked(&phydev->lock))
> > +		phydev_err(phydev, "PHY-device data unsafe context\n");
> > +
> >  	phydev->state = PHY_ERROR;
> > -	mutex_unlock(&phydev->lock);
> >  
> >  	phy_trigger_machine(phydev);
> >  }
> 
> Thanks for the patch Serge. It looks like a good implementation of
> what i suggested. But thinking about it further, if the error ever
> appears in somebodies kernel log, there is probably not enough
> information to actually fix it. There is no call path. So i think it
> should actually use WARN_ON_ONCE() so we get a stack trace.

A trace is already printed by means of WARN()/WARN_ON()
in the phy_process_error() method callers:
phy_error_precise()
and
phy_error()
Wouldn't it be too much to print it twice in a row?

We can redefine phy_error_precise() and phy_process_error() functions
to something like this:

static void phy_process_error(struct phy_device *phydev,
			      const void *func, int err)
{
	if (__ONCE_LITE_IF(!mutex_is_locked(&phydev->lock)))
		WARN(1, "PHY-device data unsafe context\n");
	else if (func)
		WARN(1, "%pS: returned: %d\n", func, err);
	else
		WARN_ON(1);

	phydev->state = PHY_ERROR;

	phy_trigger_machine(phydev);
}

static void phy_error_precise(struct phy_device *phydev,
			      const void *func, int err)
{
        mutex_lock(&phydev->lock);
        phy_process_error(phydev, func, err);
        mutex_unlock(&phydev->lock);
}

void phy_error(struct phy_device *phydev)
{
	phy_process_error(phydev, NULL, 0);
}
EXPORT_SYMBOL(phy_error);

Though in such implementation phy_error_precise() looks redundant. We
can freely move its body to the single user - phy_state_machine()
function.

Note a positive side effect of this implementation is that potentially
phy_error() can be converted to accepting a function pointer caused
the error (phy_read(), phy_write(), etc). Alternatively if the
conversion would look too bulky, phy_error_preciseI() could be just
EXPORT_SYMBOL()-ed with the PHY-device mutex locking being moved to
phy_state_machine().

> 
> Sorry for changing my mind.

No worries.

-Serge(y)

> 
>     Andrew
> 
> ---
> pw-bot: cr

Andrew Lunn Aug. 18, 2023, 3:32 p.m. UTC | #3

> A trace is already printed by means of WARN()/WARN_ON()
> in the phy_process_error() method callers:
> phy_error_precise()
> and
> phy_error()
> Wouldn't it be too much to print it twice in a row?

Ah, good point. I missed that.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>

    Andrew

pw-bot: new

patchwork-bot+netdevbpf@kernel.org Aug. 19, 2023, 6:34 p.m. UTC | #4

Hello:

This patch was applied to netdev/net.git (main)
by David S. Miller <davem@davemloft.net>:

On Fri, 18 Aug 2023 15:54:45 +0300 you wrote:
> Since commit 91a7cda1f4b8 ("net: phy: Fix race condition on link status
> change") all the phy_error() method invocations have been causing the
> nested-mutex-lock deadlock because it's normally done in the PHY-driver
> threaded IRQ handlers which since that change have been called with the
> phydev->lock mutex held. Here is the calls thread:
> 
> IRQ: phy_interrupt()
>      +-> mutex_lock(&phydev->lock); <--------------------+
>          drv->handle_interrupt()                         | Deadlock due
>          +-> ERROR: phy_error()                          + to the nested
>                     +-> phy_process_error()              | mutex lock
>                         +-> mutex_lock(&phydev->lock); <-+
>                             phydev->state = PHY_ERROR;
>                             mutex_unlock(&phydev->lock);
>          mutex_unlock(&phydev->lock);
> 
> [...]

Here is the summary with links:
  - [net] net: phy: Fix deadlocking in phy_error() invocation
    https://git.kernel.org/netdev/net/c/a0e026e7b37e

You are awesome, thank you!

[net] net: phy: Fix deadlocking in phy_error() invocation

Checks

Commit Message

Comments

Patch