diff mbox series

[RFC,net-next,1/3] net: phy: don't bind genphy in phy_attach_direct if the specific driver defers probe

Message ID 20210901225053.1205571-2-vladimir.oltean@nxp.com (mailing list archive)
State RFC
Delegated to: Netdev Maintainers
Headers show
Series Make the PHY library stop being so greedy when binding the generic PHY driver | expand

Checks

Context Check Description
netdev/cover_letter success Link
netdev/fixes_present success Link
netdev/patch_count success Link
netdev/tree_selection success Clearly marked for net-next
netdev/subject_prefix success Link
netdev/cc_maintainers warning 6 maintainers not CCed: heikki.krogerus@linux.intel.com saravanak@google.com rajatja@google.com rdunlap@infradead.org andriy.shevchenko@linux.intel.com tglx@linutronix.de
netdev/source_inline success Was 0 now: 0
netdev/verify_signedoff success Link
netdev/module_param success Was 0 now: 0
netdev/build_32bit success Errors and warnings before: 17237 this patch: 17237
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/verify_fixes success Link
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 62 lines checked
netdev/build_allmodconfig_warn success Errors and warnings before: 16733 this patch: 16733
netdev/header_inline success Link

Commit Message

Vladimir Oltean Sept. 1, 2021, 10:50 p.m. UTC
There are systems where the PHY driver might get its probe deferred due
to a missing supplier, like an interrupt-parent, gpio, clock or whatever.

If the phy_attach_direct call happens right in between probe attempts,
the PHY library is greedy and assumes that a specific driver will never
appear, so it just binds the generic PHY driver.

In certain cases this is the wrong choice, because some PHYs simply need
the specific driver. The specific PHY driver was going to probe, given
enough time, but this doesn't seem to matter to phy_attach_direct.

To solve this, make phy_attach_direct check whether a specific PHY
driver is pending or not, and if it is, just defer the probing of the
MAC that's connecting to us a bit more too.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 drivers/base/dd.c            | 21 +++++++++++++++++++--
 drivers/net/phy/phy_device.c |  8 ++++++++
 include/linux/device.h       |  1 +
 3 files changed, 28 insertions(+), 2 deletions(-)

Comments

Greg Kroah-Hartman Sept. 2, 2021, 5:43 a.m. UTC | #1
On Thu, Sep 02, 2021 at 01:50:51AM +0300, Vladimir Oltean wrote:
> There are systems where the PHY driver might get its probe deferred due
> to a missing supplier, like an interrupt-parent, gpio, clock or whatever.
> 
> If the phy_attach_direct call happens right in between probe attempts,
> the PHY library is greedy and assumes that a specific driver will never
> appear, so it just binds the generic PHY driver.
> 
> In certain cases this is the wrong choice, because some PHYs simply need
> the specific driver. The specific PHY driver was going to probe, given
> enough time, but this doesn't seem to matter to phy_attach_direct.
> 
> To solve this, make phy_attach_direct check whether a specific PHY
> driver is pending or not, and if it is, just defer the probing of the
> MAC that's connecting to us a bit more too.
> 
> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
> ---
>  drivers/base/dd.c            | 21 +++++++++++++++++++--
>  drivers/net/phy/phy_device.c |  8 ++++++++
>  include/linux/device.h       |  1 +
>  3 files changed, 28 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/base/dd.c b/drivers/base/dd.c
> index 1c379d20812a..b22073b0acd2 100644
> --- a/drivers/base/dd.c
> +++ b/drivers/base/dd.c
> @@ -128,13 +128,30 @@ static void deferred_probe_work_func(struct work_struct *work)
>  }
>  static DECLARE_WORK(deferred_probe_work, deferred_probe_work_func);
>  
> +static bool __device_pending_probe(struct device *dev)
> +{
> +	return !list_empty(&dev->p->deferred_probe);
> +}
> +
> +bool device_pending_probe(struct device *dev)
> +{
> +	bool pending;
> +
> +	mutex_lock(&deferred_probe_mutex);
> +	pending = __device_pending_probe(dev);
> +	mutex_unlock(&deferred_probe_mutex);
> +
> +	return pending;
> +}
> +EXPORT_SYMBOL_GPL(device_pending_probe);
> +
>  void driver_deferred_probe_add(struct device *dev)
>  {
>  	if (!dev->can_match)
>  		return;
>  
>  	mutex_lock(&deferred_probe_mutex);
> -	if (list_empty(&dev->p->deferred_probe)) {
> +	if (!__device_pending_probe(dev)) {
>  		dev_dbg(dev, "Added to deferred list\n");
>  		list_add_tail(&dev->p->deferred_probe, &deferred_probe_pending_list);
>  	}
> @@ -144,7 +161,7 @@ void driver_deferred_probe_add(struct device *dev)
>  void driver_deferred_probe_del(struct device *dev)
>  {
>  	mutex_lock(&deferred_probe_mutex);
> -	if (!list_empty(&dev->p->deferred_probe)) {
> +	if (__device_pending_probe(dev)) {
>  		dev_dbg(dev, "Removed from deferred list\n");
>  		list_del_init(&dev->p->deferred_probe);
>  		__device_set_deferred_probe_reason(dev, NULL);
> diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
> index 52310df121de..2c22a32f0a1c 100644
> --- a/drivers/net/phy/phy_device.c
> +++ b/drivers/net/phy/phy_device.c
> @@ -1386,8 +1386,16 @@ int phy_attach_direct(struct net_device *dev, struct phy_device *phydev,
>  
>  	/* Assume that if there is no driver, that it doesn't
>  	 * exist, and we should use the genphy driver.
> +	 * The exception is during probing, when the PHY driver might have
> +	 * attempted a probe but has requested deferral. Since there might be
> +	 * MAC drivers which also attach to the PHY during probe time, try
> +	 * harder to bind the specific PHY driver, and defer the MAC driver's
> +	 * probing until then.

Wait, no, this should not be a "special" thing, and why would the list
of deferred probe show this?

If a bus wants to have this type of "generic vs. specific" logic, then
it needs to handle it in the bus logic itself as that does NOT fit into
the normal driver model at all.  Don't try to get a "hint" of this by
messing with the probe function list.

thanks,

greg k-h
Vladimir Oltean Sept. 2, 2021, 10:11 a.m. UTC | #2
On Thu, Sep 02, 2021 at 07:43:10AM +0200, Greg Kroah-Hartman wrote:
> Wait, no, this should not be a "special" thing, and why would the list
> of deferred probe show this?

Why as in why would it work/do what I want, or as in why would you want to do that?

> If a bus wants to have this type of "generic vs. specific" logic, then
> it needs to handle it in the bus logic itself as that does NOT fit into
> the normal driver model at all.  Don't try to get a "hint" of this by
> messing with the probe function list.

Where and how? Do you have an example?
Greg Kroah-Hartman Sept. 2, 2021, 10:37 a.m. UTC | #3
On Thu, Sep 02, 2021 at 01:11:50PM +0300, Vladimir Oltean wrote:
> On Thu, Sep 02, 2021 at 07:43:10AM +0200, Greg Kroah-Hartman wrote:
> > Wait, no, this should not be a "special" thing, and why would the list
> > of deferred probe show this?
> 
> Why as in why would it work/do what I want, or as in why would you want to do that?

Both!  :)

> > If a bus wants to have this type of "generic vs. specific" logic, then
> > it needs to handle it in the bus logic itself as that does NOT fit into
> > the normal driver model at all.  Don't try to get a "hint" of this by
> > messing with the probe function list.
> 
> Where and how? Do you have an example?

No I do not, sorry, most busses do not do this for obvious ordering /
loading / we are not that crazy reasons.

What is causing this all to suddenly break?  The devlink stuff?

thanks,

greg k-h
Vladimir Oltean Sept. 2, 2021, 11:17 a.m. UTC | #4
On Thu, Sep 02, 2021 at 12:37:34PM +0200, Greg Kroah-Hartman wrote:
> On Thu, Sep 02, 2021 at 01:11:50PM +0300, Vladimir Oltean wrote:
> > On Thu, Sep 02, 2021 at 07:43:10AM +0200, Greg Kroah-Hartman wrote:
> > > Wait, no, this should not be a "special" thing, and why would the list
> > > of deferred probe show this?
> >
> > Why as in why would it work/do what I want, or as in why would you want to do that?
>
> Both!  :)

So first: why would it work.
You seem to have a misconception that I am "messing with the probe
function list".
I am not, I am just exporting the information whether the device had a
driver which returned -EPROBE_DEFER during probe, or not. For that I am
looking at the presence of this device on the deferred_probe_pending_list.

driver_probe_device
-> if (ret == -EPROBE_DEFER || ret == EPROBE_DEFER) driver_deferred_probe_add(dev);
   -> list_add_tail(&dev->p->deferred_probe, &deferred_probe_pending_list);

driver_bound
-> driver_deferred_probe_del
   -> list_del_init(&dev->p->deferred_probe);

So the presence of "dev" inside deferred_probe_pending_list means
precisely that a driver is pending to be bound.

Second: why would I want to do that.
In the case of PHY devices, the driver binding process starts here:

phy_device_register
-> device_add

It begins synchronously, but may not finish due to probe deferral.
So after device_add finishes, phydev->drv might be NULL due to 2 reasons:

1. -EPROBE_DEFER triggered by "somebody", either by the PHY driver probe
   function itself, or by third parties (like device_links_check_suppliers
   happening to notice that before even calling the driver's probe fn).
   Anyway, the distinction between these 2 is pretty much irrelevant.

2. There genuinely was no driver loaded in the system for this PHY. Note
   that the way things are written, the Generic PHY driver will not
   match on any device in phy_bus_match(). It is bound manually, separately.

The PHY library is absolutely happy to work with a headless chicken, a
phydev with a NULL phydev->drv. Just search for "if (!phydev->drv)"
inside drivers/net/phy/phy.c and drivers/net/phy/phy_device.c.

However, the phydev walking with a NULL drv can only last for so long.
An Ethernet port will soon need that PHY device, and will attach to it.
There are many code paths, all ending in phy_attach_direct.
However, when an Ethernet port decides to attach to a PHY device is
completely asynchronous to the lifetime of the PHY device itself.
This moment is where a driver is really needed, and if none is present,
the generic one is force-bound.

My patch only distinguishes between case 1 and 2 for which phydev->drv
might be NULL. It avoids force-binding the generic PHY when a specific
PHY driver was found, but did not finish binding due to probe deferral.

> > > If a bus wants to have this type of "generic vs. specific" logic, then
> > > it needs to handle it in the bus logic itself as that does NOT fit into
> > > the normal driver model at all.  Don't try to get a "hint" of this by
> > > messing with the probe function list.
> >
> > Where and how? Do you have an example?
>
> No I do not, sorry, most busses do not do this for obvious ordering /
> loading / we are not that crazy reasons.
>
> What is causing this all to suddenly break?  The devlink stuff?

There was a report related to fw_devlink indeed, however strictly
speaking, I wouldn't say it is the cause of all this. It is pretty
uncommon for a PHY device to defer probing I think, hence the bad
assumptions made around it.
Rafael J. Wysocki Sept. 2, 2021, 2:37 p.m. UTC | #5
On Thu, Sep 2, 2021 at 7:43 AM Greg Kroah-Hartman
<gregkh@linuxfoundation.org> wrote:
>
> On Thu, Sep 02, 2021 at 01:50:51AM +0300, Vladimir Oltean wrote:
> > There are systems where the PHY driver might get its probe deferred due
> > to a missing supplier, like an interrupt-parent, gpio, clock or whatever.
> >
> > If the phy_attach_direct call happens right in between probe attempts,
> > the PHY library is greedy and assumes that a specific driver will never
> > appear, so it just binds the generic PHY driver.
> >
> > In certain cases this is the wrong choice, because some PHYs simply need
> > the specific driver. The specific PHY driver was going to probe, given
> > enough time, but this doesn't seem to matter to phy_attach_direct.
> >
> > To solve this, make phy_attach_direct check whether a specific PHY
> > driver is pending or not, and if it is, just defer the probing of the
> > MAC that's connecting to us a bit more too.
> >
> > Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
> > ---
> >  drivers/base/dd.c            | 21 +++++++++++++++++++--
> >  drivers/net/phy/phy_device.c |  8 ++++++++
> >  include/linux/device.h       |  1 +
> >  3 files changed, 28 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/base/dd.c b/drivers/base/dd.c
> > index 1c379d20812a..b22073b0acd2 100644
> > --- a/drivers/base/dd.c
> > +++ b/drivers/base/dd.c
> > @@ -128,13 +128,30 @@ static void deferred_probe_work_func(struct work_struct *work)
> >  }
> >  static DECLARE_WORK(deferred_probe_work, deferred_probe_work_func);
> >
> > +static bool __device_pending_probe(struct device *dev)
> > +{
> > +     return !list_empty(&dev->p->deferred_probe);
> > +}
> > +
> > +bool device_pending_probe(struct device *dev)
> > +{
> > +     bool pending;
> > +
> > +     mutex_lock(&deferred_probe_mutex);
> > +     pending = __device_pending_probe(dev);
> > +     mutex_unlock(&deferred_probe_mutex);
> > +
> > +     return pending;
> > +}
> > +EXPORT_SYMBOL_GPL(device_pending_probe);
> > +
> >  void driver_deferred_probe_add(struct device *dev)
> >  {
> >       if (!dev->can_match)
> >               return;
> >
> >       mutex_lock(&deferred_probe_mutex);
> > -     if (list_empty(&dev->p->deferred_probe)) {
> > +     if (!__device_pending_probe(dev)) {
> >               dev_dbg(dev, "Added to deferred list\n");
> >               list_add_tail(&dev->p->deferred_probe, &deferred_probe_pending_list);
> >       }
> > @@ -144,7 +161,7 @@ void driver_deferred_probe_add(struct device *dev)
> >  void driver_deferred_probe_del(struct device *dev)
> >  {
> >       mutex_lock(&deferred_probe_mutex);
> > -     if (!list_empty(&dev->p->deferred_probe)) {
> > +     if (__device_pending_probe(dev)) {
> >               dev_dbg(dev, "Removed from deferred list\n");
> >               list_del_init(&dev->p->deferred_probe);
> >               __device_set_deferred_probe_reason(dev, NULL);
> > diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
> > index 52310df121de..2c22a32f0a1c 100644
> > --- a/drivers/net/phy/phy_device.c
> > +++ b/drivers/net/phy/phy_device.c
> > @@ -1386,8 +1386,16 @@ int phy_attach_direct(struct net_device *dev, struct phy_device *phydev,
> >
> >       /* Assume that if there is no driver, that it doesn't
> >        * exist, and we should use the genphy driver.
> > +      * The exception is during probing, when the PHY driver might have
> > +      * attempted a probe but has requested deferral. Since there might be
> > +      * MAC drivers which also attach to the PHY during probe time, try
> > +      * harder to bind the specific PHY driver, and defer the MAC driver's
> > +      * probing until then.
>
> Wait, no, this should not be a "special" thing, and why would the list
> of deferred probe show this?
>
> If a bus wants to have this type of "generic vs. specific" logic, then
> it needs to handle it in the bus logic itself as that does NOT fit into
> the normal driver model at all.

Well, I think that this is a general issue and it appears to me to be
present in the driver core too, at least to some extent.

Namely, if there are two drivers matching the same device and the
first one's ->probe() returns -EPROBE_DEFER, that will be converted to
EPROBE_DEFER by really_probe(), so driver_probe_device() will pass it
to __device_attach_driver() which then will return 0.  This
bus_for_each_drv() will call __device_attach_driver()  for the second
matching driver even though the first one may still probe successfully
later.

To me, this really is a variant of "if a driver has failed to probe,
try another one" which phy_attach_direct() appears to be doing and in
both cases the probing of the "alternative" is premature if the
probing of the original driver has been deferred.

> Don't try to get a "hint" of this by messing with the probe function list.

I agree that this doesn't look particularly clean, but then I'm
wondering how to address this cleanly.
Russell King (Oracle) Sept. 2, 2021, 6:50 p.m. UTC | #6
On Thu, Sep 02, 2021 at 01:50:51AM +0300, Vladimir Oltean wrote:
> diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
> index 52310df121de..2c22a32f0a1c 100644
> --- a/drivers/net/phy/phy_device.c
> +++ b/drivers/net/phy/phy_device.c
> @@ -1386,8 +1386,16 @@ int phy_attach_direct(struct net_device *dev, struct phy_device *phydev,
>  
>  	/* Assume that if there is no driver, that it doesn't
>  	 * exist, and we should use the genphy driver.
> +	 * The exception is during probing, when the PHY driver might have
> +	 * attempted a probe but has requested deferral. Since there might be
> +	 * MAC drivers which also attach to the PHY during probe time, try
> +	 * harder to bind the specific PHY driver, and defer the MAC driver's
> +	 * probing until then.
>  	 */
>  	if (!d->driver) {
> +		if (device_pending_probe(d))
> +			return -EPROBE_DEFER;

Something else that concerns me here.

As noted, many network drivers attempt to attach their PHY when the
device is brought up, and not during their probe function.

Taking a driver at random:

drivers/net/ethernet/renesas/sh_eth.c

sh_eth_phy_init() calls of_phy_connect() or phy_connect(), which
ultimately calls phy_attach_direct() and propagates the error code
via an error pointer.

sh_eth_phy_init() propagates the error code to its caller,
sh_eth_phy_start(). This is called from sh_eth_open(), which
probagates the error code. This is called from .ndo_open... and it's
highly likely -EPROBE_DEFER will end up being returned to userspace
through either netlink or netdev ioctls.

Since EPROBE_DEFER is not an error number that we export to
userspace, this should basically never be exposed to userspace, yet
we have a path that it _could_ be exposed if the above condition
is true.

If device_pending_probe() returns true e.g. during initial boot up
while modules are being loaded - maybe the phy driver doesn't have
all the resources it needs because of some other module that hasn't
finished initialising - then we have a window where this will be
exposed to userspace.

So, do we need to fix all the network drivers to do something if
their .ndo_open method encounters this? If so, what? Sleep a bit
and try again? How many times to retry? Convert the error code into
something else, causing userspace to fail where it worked before? If
so which error code?

I think this needs to be thought through a bit better. In this case,
I feel that throwing -EPROBE_DEFER to solve one problem with one
subsystem can result in new problems elsewhere.

We did have an idea at one point about reserving some flag bits in
phydev->dev_flags for phylib use, but I don't think that happened.
If this is the direction we want to go, I think we need to have a
flag in dev_flags so that callers opt-in to the new behaviour whereas
callers such as from .ndo_open keep the old behaviour - because they
just aren't setup to handle an -EPROBE_DEFER return from these
functions.
Vladimir Oltean Sept. 2, 2021, 7:23 p.m. UTC | #7
On Thu, Sep 02, 2021 at 07:50:16PM +0100, Russell King (Oracle) wrote:
> On Thu, Sep 02, 2021 at 01:50:51AM +0300, Vladimir Oltean wrote:
> > diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
> > index 52310df121de..2c22a32f0a1c 100644
> > --- a/drivers/net/phy/phy_device.c
> > +++ b/drivers/net/phy/phy_device.c
> > @@ -1386,8 +1386,16 @@ int phy_attach_direct(struct net_device *dev, struct phy_device *phydev,
> >  
> >  	/* Assume that if there is no driver, that it doesn't
> >  	 * exist, and we should use the genphy driver.
> > +	 * The exception is during probing, when the PHY driver might have
> > +	 * attempted a probe but has requested deferral. Since there might be
> > +	 * MAC drivers which also attach to the PHY during probe time, try
> > +	 * harder to bind the specific PHY driver, and defer the MAC driver's
> > +	 * probing until then.
> >  	 */
> >  	if (!d->driver) {
> > +		if (device_pending_probe(d))
> > +			return -EPROBE_DEFER;
> 
> Something else that concerns me here.
> 
> As noted, many network drivers attempt to attach their PHY when the
> device is brought up, and not during their probe function.
> 
> Taking a driver at random:
> 
> drivers/net/ethernet/renesas/sh_eth.c
> 
> sh_eth_phy_init() calls of_phy_connect() or phy_connect(), which
> ultimately calls phy_attach_direct() and propagates the error code
> via an error pointer.
> 
> sh_eth_phy_init() propagates the error code to its caller,
> sh_eth_phy_start(). This is called from sh_eth_open(), which
> probagates the error code. This is called from .ndo_open... and it's
> highly likely -EPROBE_DEFER will end up being returned to userspace
> through either netlink or netdev ioctls.
> 
> Since EPROBE_DEFER is not an error number that we export to
> userspace, this should basically never be exposed to userspace, yet
> we have a path that it _could_ be exposed if the above condition
> is true.
> 
> If device_pending_probe() returns true e.g. during initial boot up
> while modules are being loaded - maybe the phy driver doesn't have
> all the resources it needs because of some other module that hasn't
> finished initialising - then we have a window where this will be
> exposed to userspace.
> 
> So, do we need to fix all the network drivers to do something if
> their .ndo_open method encounters this? If so, what? Sleep a bit
> and try again? How many times to retry? Convert the error code into
> something else, causing userspace to fail where it worked before? If
> so which error code?

It depends what is the outcome you're going for.
If there's a PHY driver pending, I would do something to wait for that
if I could, it would be silly for the PHY driver to be loading but the
PHY to still be bound to genphy.

I feel that connecting to the PHY from the probe path is the overall
cleaner way to go since it deals with this automatically, but due to the
sheer volume of drivers that connect from .ndo_open, modifying them in
bulk is out of the question. Something sensible needs to happen with
them too, and 'genphy is what you get' might be just that, which is
basically what is happening without these patches. On that note, I don't
know whether there is any objective advantage to connecting to the PHY
at .ndo_open time.

> 
> I think this needs to be thought through a bit better. In this case,
> I feel that throwing -EPROBE_DEFER to solve one problem with one
> subsystem can result in new problems elsewhere.
> 
> We did have an idea at one point about reserving some flag bits in
> phydev->dev_flags for phylib use, but I don't think that happened.
> If this is the direction we want to go, I think we need to have a
> flag in dev_flags so that callers opt-in to the new behaviour whereas
> callers such as from .ndo_open keep the old behaviour - because they
> just aren't setup to handle an -EPROBE_DEFER return from these
> functions.

Or that, yes. I hadn't actually thought about using PHY flags, but I
suppose callers which already can cope with EPROBE_DEFER (they connect
from probe) can opt into that.
Andrew Lunn Sept. 2, 2021, 7:51 p.m. UTC | #8
On Thu, Sep 02, 2021 at 07:50:16PM +0100, Russell King (Oracle) wrote:
> On Thu, Sep 02, 2021 at 01:50:51AM +0300, Vladimir Oltean wrote:
> > diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
> > index 52310df121de..2c22a32f0a1c 100644
> > --- a/drivers/net/phy/phy_device.c
> > +++ b/drivers/net/phy/phy_device.c
> > @@ -1386,8 +1386,16 @@ int phy_attach_direct(struct net_device *dev, struct phy_device *phydev,
> >  
> >  	/* Assume that if there is no driver, that it doesn't
> >  	 * exist, and we should use the genphy driver.
> > +	 * The exception is during probing, when the PHY driver might have
> > +	 * attempted a probe but has requested deferral. Since there might be
> > +	 * MAC drivers which also attach to the PHY during probe time, try
> > +	 * harder to bind the specific PHY driver, and defer the MAC driver's
> > +	 * probing until then.
> >  	 */
> >  	if (!d->driver) {
> > +		if (device_pending_probe(d))
> > +			return -EPROBE_DEFER;
> 
> Something else that concerns me here.
> 
> As noted, many network drivers attempt to attach their PHY when the
> device is brought up, and not during their probe function.

Yes, this is going to be a problem. I agree it is too late to return
-EPROBE_DEFER. Maybe phy_attach_direct() needs to wait around, if the
device is still on the deferred list, otherwise use genphy. And maybe
a timeout and return -ENODEV, which is not 100% correct, we know the
device exists, we just cannot drive it.

Can we tell we are in the context of a driver probe? Or do we need to
add a parameter to the various phy_attach API calls to let the core
know if this is probe or open?

This is more likely to be a problem with NFS root, with the kernel
bringing up an interface as soon as its registered. userspace bringing
up interfaces is generally much later, and udev tends to wait around
until there are no more driver load requests before the boot
continues.

	Andrew
Florian Fainelli Sept. 2, 2021, 8:33 p.m. UTC | #9
On 9/2/2021 12:51 PM, Andrew Lunn wrote:
> On Thu, Sep 02, 2021 at 07:50:16PM +0100, Russell King (Oracle) wrote:
>> On Thu, Sep 02, 2021 at 01:50:51AM +0300, Vladimir Oltean wrote:
>>> diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
>>> index 52310df121de..2c22a32f0a1c 100644
>>> --- a/drivers/net/phy/phy_device.c
>>> +++ b/drivers/net/phy/phy_device.c
>>> @@ -1386,8 +1386,16 @@ int phy_attach_direct(struct net_device *dev, struct phy_device *phydev,
>>>   
>>>   	/* Assume that if there is no driver, that it doesn't
>>>   	 * exist, and we should use the genphy driver.
>>> +	 * The exception is during probing, when the PHY driver might have
>>> +	 * attempted a probe but has requested deferral. Since there might be
>>> +	 * MAC drivers which also attach to the PHY during probe time, try
>>> +	 * harder to bind the specific PHY driver, and defer the MAC driver's
>>> +	 * probing until then.
>>>   	 */
>>>   	if (!d->driver) {
>>> +		if (device_pending_probe(d))
>>> +			return -EPROBE_DEFER;
>>
>> Something else that concerns me here.
>>
>> As noted, many network drivers attempt to attach their PHY when the
>> device is brought up, and not during their probe function.
> 
> Yes, this is going to be a problem. I agree it is too late to return
> -EPROBE_DEFER. Maybe phy_attach_direct() needs to wait around, if the
> device is still on the deferred list, otherwise use genphy. And maybe
> a timeout and return -ENODEV, which is not 100% correct, we know the
> device exists, we just cannot drive it.

Is it really going to be a problem though? The two cases where this will 
matter is if we use IP auto-configuration within the kernel, which this 
patchset ought to be helping with, if we are already in user-space and 
the PHY is connected at .ndo_open() time, there is a whole lot of things 
that did happen prior to getting there, such as udevd using modaliases 
in order to load every possible module we might, so I am debating 
whether we will really see a probe deferral at all.

> 
> Can we tell we are in the context of a driver probe? Or do we need to
> add a parameter to the various phy_attach API calls to let the core
> know if this is probe or open?

Actually we do the RTNL lock will be held during ndo_open and it won't 
during driver probe.

> 
> This is more likely to be a problem with NFS root, with the kernel
> bringing up an interface as soon as its registered. userspace bringing
> up interfaces is generally much later, and udev tends to wait around
> until there are no more driver load requests before the boot
> continues.

See my point above, with Vladimir's change, we should have fw_devlink do 
its job such that by the time the network interface is needed for IP 
auto-configuration, all of its depending resources should also be ready, 
would not they?
Russell King (Oracle) Sept. 2, 2021, 9:33 p.m. UTC | #10
On Thu, Sep 02, 2021 at 01:33:57PM -0700, Florian Fainelli wrote:
> On 9/2/2021 12:51 PM, Andrew Lunn wrote:
> > On Thu, Sep 02, 2021 at 07:50:16PM +0100, Russell King (Oracle) wrote:
> > > On Thu, Sep 02, 2021 at 01:50:51AM +0300, Vladimir Oltean wrote:
> > > > diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
> > > > index 52310df121de..2c22a32f0a1c 100644
> > > > --- a/drivers/net/phy/phy_device.c
> > > > +++ b/drivers/net/phy/phy_device.c
> > > > @@ -1386,8 +1386,16 @@ int phy_attach_direct(struct net_device *dev, struct phy_device *phydev,
> > > >   	/* Assume that if there is no driver, that it doesn't
> > > >   	 * exist, and we should use the genphy driver.
> > > > +	 * The exception is during probing, when the PHY driver might have
> > > > +	 * attempted a probe but has requested deferral. Since there might be
> > > > +	 * MAC drivers which also attach to the PHY during probe time, try
> > > > +	 * harder to bind the specific PHY driver, and defer the MAC driver's
> > > > +	 * probing until then.
> > > >   	 */
> > > >   	if (!d->driver) {
> > > > +		if (device_pending_probe(d))
> > > > +			return -EPROBE_DEFER;
> > > 
> > > Something else that concerns me here.
> > > 
> > > As noted, many network drivers attempt to attach their PHY when the
> > > device is brought up, and not during their probe function.
> > 
> > Yes, this is going to be a problem. I agree it is too late to return
> > -EPROBE_DEFER. Maybe phy_attach_direct() needs to wait around, if the
> > device is still on the deferred list, otherwise use genphy. And maybe
> > a timeout and return -ENODEV, which is not 100% correct, we know the
> > device exists, we just cannot drive it.
> 
> Is it really going to be a problem though? The two cases where this will
> matter is if we use IP auto-configuration within the kernel, which this
> patchset ought to be helping with

There is no handling of EPROBE_DEFER in the IP auto-configuration
code while trying to bring up interfaces:

        for_each_netdev(&init_net, dev) {
                if (ic_is_init_dev(dev)) {
...
                        oflags = dev->flags;
                        if (dev_change_flags(dev, oflags | IFF_UP, NULL) < 0) {
                                pr_err("IP-Config: Failed to open %s\n",
                                       dev->name);
                                continue;
                        }

So, the only way this could be reliable is if we can guarantee that
all deferred probes will have been retried by the time we get here.
Do we have that guarantee?

> if we are already in user-space and the
> PHY is connected at .ndo_open() time, there is a whole lot of things that
> did happen prior to getting there, such as udevd using modaliases in order
> to load every possible module we might, so I am debating whether we will
> really see a probe deferral at all.

As can be seen from my recent posts which show on Debian Buster that
interfaces are attempted to be brought up while e.g. mv88e6xxx is still
probing, we can't make any guarantees that things have "settled" by the
time userspace attempts to bring up the network interfaces.

I may have more on why that is happening... I won't post it here, I'll
post to the other thread.

> > Can we tell we are in the context of a driver probe? Or do we need to
> > add a parameter to the various phy_attach API calls to let the core
> > know if this is probe or open?
> 
> Actually we do the RTNL lock will be held during ndo_open and it won't
> during driver probe.

That's probably an unreliable indicator. DPAA2 has weirdness in the
way it can dynamically create and destroy network interfaces, which
does lead to problems with the rtnl lock. I've been carrying a patch
from NXP for this for almost two years now, which NXP still haven't
submitted:

http://git.armlinux.org.uk/cgit/linux-arm.git/commit/?h=cex7&id=a600f2ee50223e9bcdcf86b65b4c427c0fd425a4

... and I've no idea why that patch never made mainline. I need it
to avoid the stated deadlock on SolidRun Honeycomb platforms when
creating additional network interfaces for the SFP cages in userspace.
Vladimir Oltean Sept. 2, 2021, 9:39 p.m. UTC | #11
On Thu, Sep 02, 2021 at 10:33:03PM +0100, Russell King (Oracle) wrote:
> That's probably an unreliable indicator. DPAA2 has weirdness in the
> way it can dynamically create and destroy network interfaces, which
> does lead to problems with the rtnl lock. I've been carrying a patch
> from NXP for this for almost two years now, which NXP still haven't
> submitted:
> 
> http://git.armlinux.org.uk/cgit/linux-arm.git/commit/?h=cex7&id=a600f2ee50223e9bcdcf86b65b4c427c0fd425a4
> 
> ... and I've no idea why that patch never made mainline. I need it
> to avoid the stated deadlock on SolidRun Honeycomb platforms when
> creating additional network interfaces for the SFP cages in userspace.

Ah, nice, I've copied that broken logic for the dpaa2-switch too:
https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=d52ef12f7d6c016f3b249db95af33f725e3dd065

So why don't you send the patch? I can send it too if you want to, one
for the switch and one for the DPNI driver.
Russell King (Oracle) Sept. 2, 2021, 10:24 p.m. UTC | #12
On Fri, Sep 03, 2021 at 12:39:49AM +0300, Vladimir Oltean wrote:
> On Thu, Sep 02, 2021 at 10:33:03PM +0100, Russell King (Oracle) wrote:
> > That's probably an unreliable indicator. DPAA2 has weirdness in the
> > way it can dynamically create and destroy network interfaces, which
> > does lead to problems with the rtnl lock. I've been carrying a patch
> > from NXP for this for almost two years now, which NXP still haven't
> > submitted:
> > 
> > http://git.armlinux.org.uk/cgit/linux-arm.git/commit/?h=cex7&id=a600f2ee50223e9bcdcf86b65b4c427c0fd425a4
> > 
> > ... and I've no idea why that patch never made mainline. I need it
> > to avoid the stated deadlock on SolidRun Honeycomb platforms when
> > creating additional network interfaces for the SFP cages in userspace.
> 
> Ah, nice, I've copied that broken logic for the dpaa2-switch too:
> https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=d52ef12f7d6c016f3b249db95af33f725e3dd065
> 
> So why don't you send the patch? I can send it too if you want to, one
> for the switch and one for the DPNI driver.

Sorry, I mis-stated. NXP did submit that exact patch, but it's actually
incorrect for the reason I stated when it was sent:

https://patchwork.ozlabs.org/project/netdev/patch/1574363727-5437-2-git-send-email-ioana.ciornei@nxp.com/

I did miss the rtnl_lock() around phylink_disconnect_phy() in the
description of the race, which goes someway towards hiding it, but
there is still a race between phylink_destroy() and another thread
calling dpaa2_eth_get_link_ksettings(), and priv->mac being freed:

static int
dpaa2_eth_get_link_ksettings(struct net_device *net_dev,
                             struct ethtool_link_ksettings *link_settings)
{
        struct dpaa2_eth_priv *priv = netdev_priv(net_dev);

        if (dpaa2_eth_is_type_phy(priv))
                return phylink_ethtool_ksettings_get(priv->mac->phylink,
                                                     link_settings);

which dereferences priv->mac and priv->mac->phylink, vs:

static irqreturn_t dpni_irq0_handler_thread(int irq_num, void *arg)
{
...
        if (status & DPNI_IRQ_EVENT_ENDPOINT_CHANGED) {
                dpaa2_eth_set_mac_addr(netdev_priv(net_dev));
                dpaa2_eth_update_tx_fqids(priv);

                if (dpaa2_eth_has_mac(priv))
                        dpaa2_eth_disconnect_mac(priv);
                else
                        dpaa2_eth_connect_mac(priv);
        }

static void dpaa2_eth_disconnect_mac(struct dpaa2_eth_priv *priv)
{
        if (dpaa2_eth_is_type_phy(priv))
                dpaa2_mac_disconnect(priv->mac);

        if (!dpaa2_eth_has_mac(priv))
                return;

        dpaa2_mac_close(priv->mac);
        kfree(priv->mac);		<== potential use after free bug by
        priv->mac = NULL;		<== dpaa2_eth_get_link_ksettings()
}

void dpaa2_mac_disconnect(struct dpaa2_mac *mac)
{
        if (!mac->phylink)
                return;

        phylink_disconnect_phy(mac->phylink);
        phylink_destroy(mac->phylink);	<== another use-after-free bug via
					    dpaa2_eth_get_link_ksettings()
        dpaa2_pcs_destroy(mac);
}

Note that phylink_destroy() is documented as:

 * Note: the rtnl lock must not be held when calling this function.

because it calls sfp_bus_del_upstream(), which will take the rtnl lock
itself. An alternative solution would be to remove the rtnl locking
from sfp_bus_del_upstream(), but then force _everyone_ to take the
rtnl lock before calling phylink_destroy() - meaning a larger block of
code ends up executing under the lock than is really necessary.

However, as I stated in my review of the patch "As I've already stated,
the phylink is not designed to be created and destroyed on a published
network device." That still remains true today, and it seems that the
issue has never been fixed in DPAA2 despite having been pointed out.
Vladimir Oltean Sept. 2, 2021, 10:45 p.m. UTC | #13
On Thu, Sep 02, 2021 at 11:24:39PM +0100, Russell King (Oracle) wrote:
> On Fri, Sep 03, 2021 at 12:39:49AM +0300, Vladimir Oltean wrote:
> > On Thu, Sep 02, 2021 at 10:33:03PM +0100, Russell King (Oracle) wrote:
> > > That's probably an unreliable indicator. DPAA2 has weirdness in the
> > > way it can dynamically create and destroy network interfaces, which
> > > does lead to problems with the rtnl lock. I've been carrying a patch
> > > from NXP for this for almost two years now, which NXP still haven't
> > > submitted:
> > >
> > > http://git.armlinux.org.uk/cgit/linux-arm.git/commit/?h=cex7&id=a600f2ee50223e9bcdcf86b65b4c427c0fd425a4
> > >
> > > ... and I've no idea why that patch never made mainline. I need it
> > > to avoid the stated deadlock on SolidRun Honeycomb platforms when
> > > creating additional network interfaces for the SFP cages in userspace.
> >
> > Ah, nice, I've copied that broken logic for the dpaa2-switch too:
> > https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=d52ef12f7d6c016f3b249db95af33f725e3dd065
> >
> > So why don't you send the patch? I can send it too if you want to, one
> > for the switch and one for the DPNI driver.
>
> Sorry, I mis-stated. NXP did submit that exact patch, but it's actually
> incorrect for the reason I stated when it was sent:
>
> https://patchwork.ozlabs.org/project/netdev/patch/1574363727-5437-2-git-send-email-ioana.ciornei@nxp.com/

So why are you carrying it then?

> I did miss the rtnl_lock() around phylink_disconnect_phy() in the
> description of the race, which goes someway towards hiding it, but
> there is still a race between phylink_destroy() and another thread
> calling dpaa2_eth_get_link_ksettings(), and priv->mac being freed:
>
> static int
> dpaa2_eth_get_link_ksettings(struct net_device *net_dev,
>                              struct ethtool_link_ksettings *link_settings)
> {
>         struct dpaa2_eth_priv *priv = netdev_priv(net_dev);
>
>         if (dpaa2_eth_is_type_phy(priv))
>                 return phylink_ethtool_ksettings_get(priv->mac->phylink,
>                                                      link_settings);
>
> which dereferences priv->mac and priv->mac->phylink, vs:
>
> static irqreturn_t dpni_irq0_handler_thread(int irq_num, void *arg)
> {
> ...
>         if (status & DPNI_IRQ_EVENT_ENDPOINT_CHANGED) {
>                 dpaa2_eth_set_mac_addr(netdev_priv(net_dev));
>                 dpaa2_eth_update_tx_fqids(priv);
>
>                 if (dpaa2_eth_has_mac(priv))
>                         dpaa2_eth_disconnect_mac(priv);
>                 else
>                         dpaa2_eth_connect_mac(priv);
>         }
>
> static void dpaa2_eth_disconnect_mac(struct dpaa2_eth_priv *priv)
> {
>         if (dpaa2_eth_is_type_phy(priv))
>                 dpaa2_mac_disconnect(priv->mac);
>
>         if (!dpaa2_eth_has_mac(priv))
>                 return;
>
>         dpaa2_mac_close(priv->mac);
>         kfree(priv->mac);		<== potential use after free bug by
>         priv->mac = NULL;		<== dpaa2_eth_get_link_ksettings()
> }

Okay, so this needs to stay under the rtnetlink mutex, to serialize with
dpaa2_eth_get_link_ksettings which is already under the rtnetlink mutex.
So the way in which rtnl_lock is taken right now is actually fine in a way.

>
> void dpaa2_mac_disconnect(struct dpaa2_mac *mac)
> {
>         if (!mac->phylink)
>                 return;
>
>         phylink_disconnect_phy(mac->phylink);
>         phylink_destroy(mac->phylink);	<== another use-after-free bug via
> 					    dpaa2_eth_get_link_ksettings()
>         dpaa2_pcs_destroy(mac);
> }
>
> Note that phylink_destroy() is documented as:
>
>  * Note: the rtnl lock must not be held when calling this function.
>
> because it calls sfp_bus_del_upstream(), which will take the rtnl lock
> itself. An alternative solution would be to remove the rtnl locking
> from sfp_bus_del_upstream(), but then force _everyone_ to take the
> rtnl lock before calling phylink_destroy() - meaning a larger block of
> code ends up executing under the lock than is really necessary.

So phylink_destroy has exactly 20 call sites, it is not that bad?

And as for "larger block than necessary" - doesn't the dpaa2 prolonged
usage count as necessary?

> However, as I stated in my review of the patch "As I've already stated,
> the phylink is not designed to be created and destroyed on a published
> network device." That still remains true today, and it seems that the
> issue has never been fixed in DPAA2 despite having been pointed out.

So what would you do, exactly, to "fix" the issue that a DPNI can
connect and disconnect at runtime from a DPMAC?

Also, "X is not designed to Y" doesn't really say much, given a bit of
will power. Linux was not designed to run on non-i386 either.

Any other issues besides needing to take rtnl_mutex top-level when
calling phylink_destroy? Since phylink_disconnect_phy needs it anyway,
and phylink_destroy ends up calling sfp_bus_del_upstream which takes the
same mutex again, and drivers that connect/disconnect at probe/remove
time end up calling both in a row, I don't think there is much of an
issue to speak of, or that the rework would be that difficult.
Andrew Lunn Sept. 2, 2021, 11:02 p.m. UTC | #14
> > Note that phylink_destroy() is documented as:
> >
> >  * Note: the rtnl lock must not be held when calling this function.
> >

...

> 
> Any other issues besides needing to take rtnl_mutex top-level when
> calling phylink_destroy?

We should try to keep phylink_create and phylink_destroy symmetrical:

/**
 * phylink_create() - create a phylink instance
 * @config: a pointer to the target &struct phylink_config
 * @fwnode: a pointer to a &struct fwnode_handle describing the network
 *      interface
 * @iface: the desired link mode defined by &typedef phy_interface_t
 * @mac_ops: a pointer to a &struct phylink_mac_ops for the MAC.
 *
 * Create a new phylink instance, and parse the link parameters found in @np.
 * This will parse in-band modes, fixed-link or SFP configuration.
 *
 * Note: the rtnl lock must not be held when calling this function.

Having different locking requirements will catch people out.

Interestingly, there is no ASSERT_NO_RTNL(). Maybe we should add such
a macro.

    Andrew
Vladimir Oltean Sept. 2, 2021, 11:26 p.m. UTC | #15
On Fri, Sep 03, 2021 at 01:02:06AM +0200, Andrew Lunn wrote:
> We should try to keep phylink_create and phylink_destroy symmetrical:
> 
> /**
>  * phylink_create() - create a phylink instance
>  * @config: a pointer to the target &struct phylink_config
>  * @fwnode: a pointer to a &struct fwnode_handle describing the network
>  *      interface
>  * @iface: the desired link mode defined by &typedef phy_interface_t
>  * @mac_ops: a pointer to a &struct phylink_mac_ops for the MAC.
>  *
>  * Create a new phylink instance, and parse the link parameters found in @np.
>  * This will parse in-band modes, fixed-link or SFP configuration.
>  *
>  * Note: the rtnl lock must not be held when calling this function.
> 
> Having different locking requirements will catch people out.
> 
> Interestingly, there is no ASSERT_NO_RTNL(). Maybe we should add such
> a macro.

In this case, the easiest might be to just take a different mutex in
dpaa2 which serializes all places that access the priv->mac references.
I don't know exactly why the SFP bus needs the rtnl_mutex, I've removed
those locks and will see what fails tomorrow, but I don't think dpaa2
has a good enough justification to take the rtnl_mutex just so that it
can connect and disconnect to the MAC freely at runtime.
Russell King (Oracle) Sept. 3, 2021, 12:04 a.m. UTC | #16
On Fri, Sep 03, 2021 at 02:26:07AM +0300, Vladimir Oltean wrote:
> On Fri, Sep 03, 2021 at 01:02:06AM +0200, Andrew Lunn wrote:
> > We should try to keep phylink_create and phylink_destroy symmetrical:
> > 
> > /**
> >  * phylink_create() - create a phylink instance
> >  * @config: a pointer to the target &struct phylink_config
> >  * @fwnode: a pointer to a &struct fwnode_handle describing the network
> >  *      interface
> >  * @iface: the desired link mode defined by &typedef phy_interface_t
> >  * @mac_ops: a pointer to a &struct phylink_mac_ops for the MAC.
> >  *
> >  * Create a new phylink instance, and parse the link parameters found in @np.
> >  * This will parse in-band modes, fixed-link or SFP configuration.
> >  *
> >  * Note: the rtnl lock must not be held when calling this function.
> > 
> > Having different locking requirements will catch people out.
> > 
> > Interestingly, there is no ASSERT_NO_RTNL(). Maybe we should add such
> > a macro.
> 
> In this case, the easiest might be to just take a different mutex in
> dpaa2 which serializes all places that access the priv->mac references.
> I don't know exactly why the SFP bus needs the rtnl_mutex, I've removed
> those locks and will see what fails tomorrow, but I don't think dpaa2
> has a good enough justification to take the rtnl_mutex just so that it
> can connect and disconnect to the MAC freely at runtime.

It needs it to ensure that the sfp-bus code is safe. sfp-bus code
sits between phylink and the sfp stuff, and will be called from
either side. It can't have its own lock, because that gives lockdep
splats.

Removing a lock and then running the kernel is a down right stupid
way to test to see if a lock is necessary.

That approach is like having built a iron bridge, covered it in paint,
then you remove most the bolts, and then test to see whether it's safe
for vehicles to travel over it by riding your bicycle across it and
declaring it safe.

Sorry, but if you think "remove lock, run kernel, if it works fine
the lock is unnecessary" is a valid approach, then you've just
disqualified yourself from discussing this topic any further.
Locking is done by knowing the code and code analysis, not by
playing "does the code fail if I remove it" games. I am utterly
shocked that you think that this is a valid approach.
Ioana Ciornei Sept. 3, 2021, 9:27 a.m. UTC | #17
On Thu, Sep 02, 2021 at 11:24:39PM +0100, Russell King (Oracle) wrote:
> On Fri, Sep 03, 2021 at 12:39:49AM +0300, Vladimir Oltean wrote:
> > On Thu, Sep 02, 2021 at 10:33:03PM +0100, Russell King (Oracle) wrote:
> > > That's probably an unreliable indicator. DPAA2 has weirdness in the
> > > way it can dynamically create and destroy network interfaces, which
> > > does lead to problems with the rtnl lock. I've been carrying a patch
> > > from NXP for this for almost two years now, which NXP still haven't
> > > submitted:
> > > 
> > > http://git.armlinux.org.uk/cgit/linux-arm.git/commit/?h=cex7&id=a600f2ee50223e9bcdcf86b65b4c427c0fd425a4
> > > 
> > > ... and I've no idea why that patch never made mainline. I need it
> > > to avoid the stated deadlock on SolidRun Honeycomb platforms when
> > > creating additional network interfaces for the SFP cages in userspace.
> > 
> > Ah, nice, I've copied that broken logic for the dpaa2-switch too:
> > https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=d52ef12f7d6c016f3b249db95af33f725e3dd065
> > 
> > So why don't you send the patch? I can send it too if you want to, one
> > for the switch and one for the DPNI driver.
> 
> Sorry, I mis-stated. NXP did submit that exact patch, but it's actually
> incorrect for the reason I stated when it was sent:
> 
> https://patchwork.ozlabs.org/project/netdev/patch/1574363727-5437-2-git-send-email-ioana.ciornei@nxp.com/
> 
> I did miss the rtnl_lock() around phylink_disconnect_phy() in the
> description of the race, which goes someway towards hiding it, but
> there is still a race between phylink_destroy() and another thread
> calling dpaa2_eth_get_link_ksettings(), and priv->mac being freed:
> 
> static int
> dpaa2_eth_get_link_ksettings(struct net_device *net_dev,
>                              struct ethtool_link_ksettings *link_settings)
> {
>         struct dpaa2_eth_priv *priv = netdev_priv(net_dev);
> 
>         if (dpaa2_eth_is_type_phy(priv))
>                 return phylink_ethtool_ksettings_get(priv->mac->phylink,
>                                                      link_settings);
> 
> which dereferences priv->mac and priv->mac->phylink, vs:
> 
> static irqreturn_t dpni_irq0_handler_thread(int irq_num, void *arg)
> {
> ...
>         if (status & DPNI_IRQ_EVENT_ENDPOINT_CHANGED) {
>                 dpaa2_eth_set_mac_addr(netdev_priv(net_dev));
>                 dpaa2_eth_update_tx_fqids(priv);
> 
>                 if (dpaa2_eth_has_mac(priv))
>                         dpaa2_eth_disconnect_mac(priv);
>                 else
>                         dpaa2_eth_connect_mac(priv);
>         }
> 
> static void dpaa2_eth_disconnect_mac(struct dpaa2_eth_priv *priv)
> {
>         if (dpaa2_eth_is_type_phy(priv))
>                 dpaa2_mac_disconnect(priv->mac);
> 
>         if (!dpaa2_eth_has_mac(priv))
>                 return;
> 
>         dpaa2_mac_close(priv->mac);
>         kfree(priv->mac);		<== potential use after free bug by
>         priv->mac = NULL;		<== dpaa2_eth_get_link_ksettings()
> }
> 
> void dpaa2_mac_disconnect(struct dpaa2_mac *mac)
> {
>         if (!mac->phylink)
>                 return;
> 
>         phylink_disconnect_phy(mac->phylink);
>         phylink_destroy(mac->phylink);	<== another use-after-free bug via
> 					    dpaa2_eth_get_link_ksettings()
>         dpaa2_pcs_destroy(mac);
> }
> 
> Note that phylink_destroy() is documented as:
> 
>  * Note: the rtnl lock must not be held when calling this function.
> 
> because it calls sfp_bus_del_upstream(), which will take the rtnl lock
> itself. An alternative solution would be to remove the rtnl locking
> from sfp_bus_del_upstream(), but then force _everyone_ to take the
> rtnl lock before calling phylink_destroy() - meaning a larger block of
> code ends up executing under the lock than is really necessary.
> 
> However, as I stated in my review of the patch "As I've already stated,
> the phylink is not designed to be created and destroyed on a published
> network device." That still remains true today, and it seems that the
> issue has never been fixed in DPAA2 despite having been pointed out.
> 

My attempt to fix this issue was that patch that you just pointed at.
Taking your feedback into account (that phylink is not designed to be
created and destroyed on a published networking device) I really do not
know what other viable solution to send out.

The alternative here would have been to just have a different driver for
the MAC side (probing on dpmac objects) that creates the phylink
instance at probe time and then is just used by the dpaa2-eth driver
when it connects to a dpmac. This way no phylink is created/destroyed
dynamically.

This was the architecture of my initial attempt at supporting phylink in
DPAA2.
https://patchwork.ozlabs.org/project/netdev/patch/1560470153-26155-5-git-send-email-ioana.ciornei@nxp.com/

If you have any suggestion on how I should go about fixing this, please
let me know.

Ioana
Vladimir Oltean Sept. 3, 2021, 8:48 p.m. UTC | #18
On Fri, Sep 03, 2021 at 01:04:19AM +0100, Russell King (Oracle) wrote:
> Removing a lock and then running the kernel is a down right stupid
> way to test to see if a lock is necessary.
> 
> That approach is like having built a iron bridge, covered it in paint,
> then you remove most the bolts, and then test to see whether it's safe
> for vehicles to travel over it by riding your bicycle across it and
> declaring it safe.
> 
> Sorry, but if you think "remove lock, run kernel, if it works fine
> the lock is unnecessary" is a valid approach, then you've just
> disqualified yourself from discussing this topic any further.
> Locking is done by knowing the code and code analysis, not by
> playing "does the code fail if I remove it" games. I am utterly
> shocked that you think that this is a valid approach.

... and this is exactly why you will no longer get any attention from me
on this topic. Good luck.
Russell King (Oracle) Sept. 3, 2021, 10:06 p.m. UTC | #19
On Fri, Sep 03, 2021 at 11:48:22PM +0300, Vladimir Oltean wrote:
> On Fri, Sep 03, 2021 at 01:04:19AM +0100, Russell King (Oracle) wrote:
> > Removing a lock and then running the kernel is a down right stupid
> > way to test to see if a lock is necessary.
> > 
> > That approach is like having built a iron bridge, covered it in paint,
> > then you remove most the bolts, and then test to see whether it's safe
> > for vehicles to travel over it by riding your bicycle across it and
> > declaring it safe.
> > 
> > Sorry, but if you think "remove lock, run kernel, if it works fine
> > the lock is unnecessary" is a valid approach, then you've just
> > disqualified yourself from discussing this topic any further.
> > Locking is done by knowing the code and code analysis, not by
> > playing "does the code fail if I remove it" games. I am utterly
> > shocked that you think that this is a valid approach.
> 
> ... and this is exactly why you will no longer get any attention from me
> on this topic. Good luck.

Good, because your approach to this to me reads as "I don't think you
know what the hell you're doing so I'm going to remove a lock to test
whether it is needed." Effectively, that action is an insult towards
me as the author of that code.

And as I said, if you think that's a valid approach, then quite frankly
I don't want you touching my code, because you clearly don't know what
you're doing as you aren't willing to put the necessary effort in to
understanding the code.

Removing a lock and running the kernel is _never_ a valid way to see
whether the lock is required or not. The only way is via code analysis.

I wonder whether you'd take the same approach with filesystems or
memory management code. Why don't you try removing some locks from
those subsystems and see how long your filesystems last?

You could have asked why the lock was necessary, and I would have
described it. That would have been the civil approach. Maybe even
put forward a hypothesis why you think the lock isn't necessary, but
no, you decide that the best way to go about this is to remove the
lock and see whether the kernel breaks.

It may shock you to know that those of us who have been working on
the kernel for almost 30 years and have seen the evolution of the
kernel from uniprocessor to SMP, have had to debug race conditions
caused by a lack of locking know very well that you can have what
seems to be a functioning kernel despite missing locks - and such a
kernel can last quite a long time and only show up the race quite
rarely. This is exactly why "lets remove the lock and see if it
breaks" is a completely invalid approach. I'm sorry that you don't
seem to realise just how stupid a suggestion that was.

I can tell you now: removing the locks you proposed will not show an
immediate problem, but by removing those locks you will definitely
open up race conditions between driver binding events on the SFP
side and network usage on the netdev side which will only occur
rarely.

And just because they only happen rarely is not a justification to
remove locks, no matter how inconvenient those locks may be.
Vladimir Oltean Sept. 4, 2021, 9:59 p.m. UTC | #20
[ again, trimming the CC list, because I assume most people don't care,
  and if they do, the mailing lists are there for that ]

On Fri, Sep 03, 2021 at 11:06:23PM +0100, Russell King (Oracle) wrote:
> On Fri, Sep 03, 2021 at 11:48:22PM +0300, Vladimir Oltean wrote:
> > On Fri, Sep 03, 2021 at 01:04:19AM +0100, Russell King (Oracle) wrote:
> > > Removing a lock and then running the kernel is a down right stupid
> > > way to test to see if a lock is necessary.
> > > 
> > > That approach is like having built a iron bridge, covered it in paint,
> > > then you remove most the bolts, and then test to see whether it's safe
> > > for vehicles to travel over it by riding your bicycle across it and
> > > declaring it safe.
> > > 
> > > Sorry, but if you think "remove lock, run kernel, if it works fine
> > > the lock is unnecessary" is a valid approach, then you've just
> > > disqualified yourself from discussing this topic any further.
> > > Locking is done by knowing the code and code analysis, not by
> > > playing "does the code fail if I remove it" games. I am utterly
> > > shocked that you think that this is a valid approach.
> > 
> > ... and this is exactly why you will no longer get any attention from me
> > on this topic. Good luck.
> 
> Good, because your approach to this to me reads as "I don't think you
> know what the hell you're doing so I'm going to remove a lock to test
> whether it is needed." Effectively, that action is an insult towards
> me as the author of that code.

The reason why you aren't getting any of my attention is your attitude,
in case it was not clear.

You've transformed a few words I said and which were entirely
reasonable, "I don't know exactly why the SFP bus needs the rtnl_mutex,
I've removed those locks and will see what fails tomorrow", into a soap
opera based on something I did not say.

> And as I said, if you think that's a valid approach, then quite frankly
> I don't want you touching my code, because you clearly don't know what
> you're doing as you aren't willing to put the necessary effort in to
> understanding the code.
> 
> Removing a lock and running the kernel is _never_ a valid way to see
> whether the lock is required or not. The only way is via code analysis.

It is a completely valid approach for a simple reason: if there was an
obvious reason why the SFP bus code would have needed serialization
through the rtnetlink mutex, I could have found out by looking at all
the failed assertions and said to myself "oh, yeah, right, of course",
instead of spending several hours looking at the code, at which point I
would have had fewer chances of figuring out anyway.

Effectively, with no assertions failing except those from the phylink
upstream SFP ops (which aren't really an indication that any data
structures are protected, no comments, etc), things are much less
obvious. If I knew from the get-go it would come to this, I would have
asked, rest assured. I just did not want to ask an obvious question,
I was more or less thinking out loud about what I am going to do next.

> I wonder whether you'd take the same approach with filesystems or
> memory management code. Why don't you try removing some locks from
> those subsystems and see how long your filesystems last?

This is a completely irrelevant and wrong argument, of course there are
sandboxes in which incompetent people can do insane things without doing
any damage, even if the subsystems they are interested in are filesystems
and memory management. It brings exactly nothing to the discussion.

> You could have asked why the lock was necessary, and I would have
> described it. That would have been the civil approach. Maybe even
> put forward a hypothesis why you think the lock isn't necessary, but
> no, you decide that the best way to go about this is to remove the
> lock and see whether the kernel breaks.
> 
> It may shock you to know that those of us who have been working on
> the kernel for almost 30 years and have seen the evolution of the
> kernel from uniprocessor to SMP, have had to debug race conditions
> caused by a lack of locking know very well that you can have what
> seems to be a functioning kernel despite missing locks - and such a
> kernel can last quite a long time and only show up the race quite
> rarely. This is exactly why "lets remove the lock and see if it
> breaks" is a completely invalid approach. I'm sorry that you don't
> seem to realise just how stupid a suggestion that was.
> 
> I can tell you now: removing the locks you proposed will not show an
> immediate problem, but by removing those locks you will definitely
> open up race conditions between driver binding events on the SFP
> side and network usage on the netdev side which will only occur
> rarely.
> 
> And just because they only happen rarely is not a justification to
> remove locks, no matter how inconvenient those locks may be.

So I really wasn't going to do it, since I have absolutely no stake in
this, but I happened to be on a plane today for several hours with
literally nothing better to do, so I went through the phylink_create and
phylink_destroy code path, with the intention of seeing whether there is
something it does which fundamentally needs to be serialized by the
rtnetlink mutex.

If the mere idea of me removing a lock was insulting to you, I've no
idea what atrocity this might even compare to. But suffice to say, I
spent several hours and it is not obvious at all, based on code analysis
as you wish, why it must be the rtnl_lock and not any other mutex taken
by both the SFP module driver and the SFP upstream consumer (phylink),
with the same semantics except not the mega-bloated rtnetlink mutex.

These are my notes from the plane, it is a single pass (the second pass
will most likely not happen), again it is purely based on code analysis
as you requested, non-expert of course because it is the first time I
look at the details or even study the code paths, and I haven't even run
the code without the rtnetlink protection as I originally intended.

phylink_register_sfp
-> bus = sfp_bus_find_fwnode(fwnode)
   -> fwnode_property_get_reference_args(fwnode)
   -> bus = sfp_bus_get(fwnode)
      -> mutex_lock(&sfp_mutex)
      -> search for fwnode in sfp->fwnode of sfp_buses list # side note, the iterator in this function should have been named "bus", not "sfp", for consistency
         -> if found, kref_get(&sfp->kref)
         -> else allocate new sfp bus with this sfp->fwnode, and kref_init
      -> mutex_unlock(&sfp_mutex)
   -> fwnode_handle_put(fwnode)
-> pl->sfp_bus = bus
-> sfp_bus_add_upstream(bus, pl)
   -> rtnl_lock()
   -> kref_get(bus->kref) <- why? this increments from 1 to 2. Indicative of possibly concurrent code
   -> bus->upstream = pl
   -> if (bus->sfp) <- this code path does not populate bus->sfp, so unless code is running concurrently (?!) branch is not taken
      -> sfp_register_bus(bus)
   -> rtnl_unlock()
   -> if (ret) => sfp_bus_put(bus) <= on error this decrements the kref back from 2 to 1
      -> kref_put_mutex(&bus->kref, sfp_bus_release, &sfp_mutex)
-> sfp_bus_put(bus)
   -> on error, drops the kref from 1 to 0 and frees the bus under the sfp_mutex
   -> on normal path, drops the kref from 2 to 1

Ok, why would bus->sfp be non-NULL (how would the sfp_register_bus possibly be triggered by this function)?
sfp->bus is set from:

sfp_unregister_socket(bus)
-> rtnl_lock
-> if (bus->upstream_ops) sfp_unregister_bus(bus)
-> sfp_socket_clear(bus)
   -> bus->sfp = NULL
-> rtnl_unlock
-> sfp_bus_put(bus)

sfp_register_socket(dev, sfp, ops)
-> bus = sfp_bus_get(dev->fwnode)
-> rtnl_lock
-> bus->sfp_dev = dev;
-> bus->sfp = sfp;
-> bus->socket_ops = ops;
-> if (bus->upstream_ops) => sfp_register_bus(bus);
-> rtnl_unlock
-> on error => sfp_bus_put(bus)
-> return bus

Who calls sfp_register_socket and sfp_unregister_socket?

sfp_probe (the driver for the cage)
-> sfp->sfp_bus = sfp_register_socket(sfp->dev, sfp, &sfp_module_ops)

sfp_remove
-> sfp_unregister_socket(sfp->sfp_bus)

So sfp_register_bus can be called either by phylink_register_sfp(the upstream side) or sfp_probe(the cage side). They are serialized by the rtnl_mutex.
The bus is only registered when both sides are ready:
phylink_register_sfp registers the sfp_bus if sfp_probe was already called first
sfp_probe registers the sfp_bus if phylink_register_sfp was already called first

Finally, what does sfp_register_bus do?

sfp_register_bus(bus)
-> bus->upstream_ops->link_down(bus->upstream)
   -> phylink_sfp_link_down(pl)
      -> phylink_run_resolve_and_disable(pl, PHYLINK_DISABLE_LINK)
         -> appears to be patched by commit 87454b6edc1b ("net: phylink: avoid resolving link state too early")
            to do nothing when called from phylink_register_sfp, since phylink_create calls:
            __set_bit(PHYLINK_DISABLE_STOPPED, &pl->phylink_disable_state);
            before calling phylink_register_sfp
         -> probably does something only if phylink_start(pl) was called, and sfp_register_socket is called afterwards.
            In that case, it will do this:
            -> queue_work(system_power_efficient_wq, &pl->resolve);
            -> flush_work(&pl->resolve);
            So it will practically call phylink_resolve and wait for it to finish.
-> if bus->phydev exists by now => bus->upstream_ops->connect_phy(bus->upstream, bus->phydev)
   -> phylink_sfp_connect_phy(pl, phy)
      -> phylink_sfp_config
         -> does not seem to do anything requiring rtnl_mutex:
            it mostly seems to call phylink_validate and sfp_select_interface, which are stateless, and
            modify pl->supported, pl->link_config, pl->cur_link_an_mode, pl->link_port.
      -> It also calls phylink_mac_initial_config, but not from the phylink_create code path, because it checks PHYLINK_DISABLE_STOPPED again, as discussed
         -> phylink_mac_initial_config again appears to be stateless from phylink's perspective
   -> phylink_attach_phy
      -> phy_attach_direct
         # does not require rtnl_mutex, see the plethora of drivers which call variants of this function without holding this lock.
         # This is literally the premise that led to the discussion, which Florian pointed out: that you can differentiate between
         # drivers that connect to the PHY in .ndo_open from those that do so at probe time by looking at whether the rtnetlink mutex
         # is taken (which it will be from .ndo_open)
   -> phylink_bringup_phy
      -> phylink_validate, stateless
      -> populates phy->phylink and phy->phy_link_change, but not under mutex_lock(&phy->lock), which is probably fine because there are no concurrent readers
      -> mutex_lock(&phy->lock)
      -> mutex_lock(&pl->state_mutex) # serialize with phylink_disconnect_phy, phylink_ethtool_ksettings_set, phylink_ethtool_set_pauseparam, phylink_resolve, phylink_phy_change
      -> change pl->phydev, pl->phy_state, pl->supported, pl->link_config, phy->advertising
      -> mutex_unlock(&pl->state_mutex)
      -> mutex_unlock(&phy_lock)
      -> phy_request_interrupt
         -> request_threaded_irq
         -> phy_enable_interrupts
            -> calls driver, does not need rtnl_mutex
-> bus->registered = true;
-> bus->socket_ops->attach(bus->sfp);
   -> sfp_attach(sfp)
      -> sfp_sm_event(sfp, SFP_E_DEV_ATTACH)
         -> It must be said that all callers of this function are taking the rtnl_mutex, so there must be something going on in here.
         -> mutex_lock(&sfp->sm_mutex);
         -> sfp_sm_device
            -> Tracks the upstream's state as per the comment.
            -> changes sfp->sm_dev_state. I think the upstream state is changed on the attach event only if the sfp->sm_dev_state
               was SFP_DEV_DETACHED, otherwise nothing happens. The new upstream state appears to transition to SFP_DEV_DOWN.
         -> sfp_sm_module
            -> As per the comment, tracks the insert/remove state of the module, probes the on-board EEPROM, and sets up the power level.
            -> On SFP_E_DEV_ATTACH, it only does something if it was in the SFP_MOD_WAITDEV state.
               -> As per the comment, report the module insertion to the upstream device
                  -> sfp_module_insert(sfp->sfp_bus, &sfp->id);
                     -> bus->upstream_ops->module_insert(bus->upstream, bus)
                        -> phylink_sfp_module_insert
                           -> sets pl->sfp_port and pl->sfp_may_have_phy and pl->sfp_may_have_phy
                           -> optionally calls phylink_sfp_config, which as discussed above does not seem to need the rtnetlink mutex's protection
               -> transitions sfp->sm_mod_state into SFP_MOD_HPOWER and falls through into that state's handler
                  -> sfp_sm_mod_hpower
                     -> a bunch of i2c_reads and i2c_writes, probably do not need rtnl_mutex
                  -> transitions into SFP_MOD_WAITPWR and sets up a timer
         -> sfp_sm_main
            -> we should be entering with sfp->sm_dev_state == SFP_DEV_DOWN and god knows what sfp->sm_mod_state, because the module
               has its own life independent of the upstream attaching or not. Most interesting plausible module state is SFP_MOD_WAITDEV,
               which just became SFP_MOD_WAITPWR. The main state machins is probably in sfp->sm_state == SFP_S_DOWN too.
            -> Because sfp->sm_state is SFP_S_DOWN, the branch under "Some events are global" does probably not get executed at all.
            -> The SFP_S_DOWN handler under the "The main state machine" comment probably exists early too, because sfp->sm_dev_state
               is SFP_DEV_DOWN and not SFP_DEV_UP (freshly attached upstream)
            -> function exits doing nothing?!
-> if (bus->started) => bus->socket_ops->start
   # Since bus->started is set by sfp_upstream_start, which is called at the end of phylink_start, it means this piece of code is not meant to execute
   # from phylink_create's calling context, but from the other caller of sfp_register_bus: sfp_probe. Logically phylink_start comes after phylink_create.
   # So ignore.
-> bus->upstream_ops->attach(bus->upstream, bus)
   -> phylink_sfp_attach(pl, bus)
      -> pl->netdev->sfp_bus = bus
         # This is about the first thing touching the netdev I've seen, but since it's a simple pointer assignment and not anything
         # operating on complex/non-atomic data structures, not sure if this is the reason for rtnl_lock?

And sfp_unregister_bus, called by sfp_bus_del_upstream:
-> bus->upstream_ops->detach(bus->upstream, bus)
  -> phylink_sfp_detach
     -> pl->netdev->sfp_bus = NULL
-> bus->socket_ops->stop(bus->sfp)
   -> sfp_stop(sfp)
      -> sfp_sm_event(sfp, SFP_E_DEV_DOWN);
         -> sfp_sm_device
            -> let's say sfp->sm_dev_state was SFP_DEV_UP, with SFP_E_DEV_DOWN it becomes SFP_DEV_DOWN.
         -> sfp_sm_module
            # I won't go through the state machines again, if something in the teardown path needs
            # rtnetlink protection whereas in the setup path it did not, oh well...

So these are my notes, now please hit me hard, because I don't know, and
I won't look any further into it. Why does phylink_{create,destroy}
require serialization through the rtnetlink mutex?
Russell King (Oracle) Sept. 4, 2021, 11:25 p.m. UTC | #21
On Sun, Sep 05, 2021 at 12:59:05AM +0300, Vladimir Oltean wrote:
> [ again, trimming the CC list, because I assume most people don't care,
>   and if they do, the mailing lists are there for that ]
> 
> On Fri, Sep 03, 2021 at 11:06:23PM +0100, Russell King (Oracle) wrote:
> > On Fri, Sep 03, 2021 at 11:48:22PM +0300, Vladimir Oltean wrote:
> > > On Fri, Sep 03, 2021 at 01:04:19AM +0100, Russell King (Oracle) wrote:
> > > > Removing a lock and then running the kernel is a down right stupid
> > > > way to test to see if a lock is necessary.
> > > > 
> > > > That approach is like having built a iron bridge, covered it in paint,
> > > > then you remove most the bolts, and then test to see whether it's safe
> > > > for vehicles to travel over it by riding your bicycle across it and
> > > > declaring it safe.
> > > > 
> > > > Sorry, but if you think "remove lock, run kernel, if it works fine
> > > > the lock is unnecessary" is a valid approach, then you've just
> > > > disqualified yourself from discussing this topic any further.
> > > > Locking is done by knowing the code and code analysis, not by
> > > > playing "does the code fail if I remove it" games. I am utterly
> > > > shocked that you think that this is a valid approach.
> > > 
> > > ... and this is exactly why you will no longer get any attention from me
> > > on this topic. Good luck.
> > 
> > Good, because your approach to this to me reads as "I don't think you
> > know what the hell you're doing so I'm going to remove a lock to test
> > whether it is needed." Effectively, that action is an insult towards
> > me as the author of that code.
> 
> The reason why you aren't getting any of my attention is your attitude,
> in case it was not clear.
> 
> You've transformed a few words I said and which were entirely
> reasonable, "I don't know exactly why the SFP bus needs the rtnl_mutex,
> I've removed those locks and will see what fails tomorrow", into a soap
> opera based on something I did not say.

You really don't understand, do you.

I will say again: you can not remove a lock and then run-time test
to see whether that lock is required. It doesn't just work like that,
and the more you persist to assert that it does, the more stupid you
look to all those who have had years of kernel programming experience.
So please continue...

> > And as I said, if you think that's a valid approach, then quite frankly
> > I don't want you touching my code, because you clearly don't know what
> > you're doing as you aren't willing to put the necessary effort in to
> > understanding the code.
> > 
> > Removing a lock and running the kernel is _never_ a valid way to see
> > whether the lock is required or not. The only way is via code analysis.
> 
> It is a completely valid approach for a simple reason: if there was an
> obvious reason why the SFP bus code would have needed serialization
> through the rtnetlink mutex, I could have found out by looking at all
> the failed assertions and said to myself "oh, yeah, right, of course",
> instead of spending several hours looking at the code, at which point I
> would have had fewer chances of figuring out anyway.

If we want to answer the question of "why rtnl_mutex" then one first
has to understand the locking strategy and why I ended up there. It
is _not_ simple.

> > I wonder whether you'd take the same approach with filesystems or
> > memory management code. Why don't you try removing some locks from
> > those subsystems and see how long your filesystems last?
> 
> This is a completely irrelevant and wrong argument, of course there are
> sandboxes in which incompetent people can do insane things without doing
> any damage, even if the subsystems they are interested in are filesystems
> and memory management. It brings exactly nothing to the discussion.

It is entirely relevant - it is about your approach to testing whether
a lock is necessary or not. Your stated approach is "lets remove the
lock and then run the kernel and see if anything breaks." I assert
that approach is completely invalid.

> If the mere idea of me removing a lock was insulting to you, I've no
> idea what atrocity this might even compare to. But suffice to say, I
> spent several hours and it is not obvious at all, based on code analysis
> as you wish, why it must be the rtnl_lock and not any other mutex taken
> by both the SFP module driver and the SFP upstream consumer (phylink),
> with the same semantics except not the mega-bloated rtnetlink mutex.
> 
> These are my notes from the plane, it is a single pass (the second pass
> will most likely not happen), again it is purely based on code analysis
> as you requested, non-expert of course because it is the first time I
> look at the details or even study the code paths, and I haven't even run
> the code without the rtnetlink protection as I originally intended.
> 
> phylink_register_sfp
> -> bus = sfp_bus_find_fwnode(fwnode)
>    -> fwnode_property_get_reference_args(fwnode)
>    -> bus = sfp_bus_get(fwnode)
>       -> mutex_lock(&sfp_mutex)
>       -> search for fwnode in sfp->fwnode of sfp_buses list # side note, the iterator in this function should have been named "bus", not "sfp", for consistency
>          -> if found, kref_get(&sfp->kref)
>          -> else allocate new sfp bus with this sfp->fwnode, and kref_init
>       -> mutex_unlock(&sfp_mutex)
>    -> fwnode_handle_put(fwnode)
> -> pl->sfp_bus = bus
> -> sfp_bus_add_upstream(bus, pl)
>    -> rtnl_lock()
>    -> kref_get(bus->kref) <- why? this increments from 1 to 2. Indicative of possibly concurrent code
>    -> bus->upstream = pl
>    -> if (bus->sfp) <- this code path does not populate bus->sfp, so unless code is running concurrently (?!) branch is not taken
>       -> sfp_register_bus(bus)
>    -> rtnl_unlock()
>    -> if (ret) => sfp_bus_put(bus) <= on error this decrements the kref back from 2 to 1
>       -> kref_put_mutex(&bus->kref, sfp_bus_release, &sfp_mutex)
> -> sfp_bus_put(bus)
>    -> on error, drops the kref from 1 to 0 and frees the bus under the sfp_mutex
>    -> on normal path, drops the kref from 2 to 1

First question "why? this increments from 1 to 2. Indicative of possibly
concurrent code" - you appear to have answered that already in two lines
immediately above.

In the case of a pre-existing bus being found, then the krefs will be
one higher than the numerical values you have given above.

> Ok, why would bus->sfp be non-NULL (how would the sfp_register_bus possibly be triggered by this function)?

You've already answered that above. "else allocate new sfp bus with this
sfp->fwnode, and kref_init". In that case, bus->sfp will be NULL because
the socket hasn't been registered.

> sfp->bus is set from:
> 
> sfp_unregister_socket(bus)
> -> rtnl_lock
> -> if (bus->upstream_ops) sfp_unregister_bus(bus)
> -> sfp_socket_clear(bus)
>    -> bus->sfp = NULL
> -> rtnl_unlock
> -> sfp_bus_put(bus)
> 
> sfp_register_socket(dev, sfp, ops)
> -> bus = sfp_bus_get(dev->fwnode)
> -> rtnl_lock
> -> bus->sfp_dev = dev;
> -> bus->sfp = sfp;
> -> bus->socket_ops = ops;
> -> if (bus->upstream_ops) => sfp_register_bus(bus);
> -> rtnl_unlock
> -> on error => sfp_bus_put(bus)
> -> return bus
> 
> Who calls sfp_register_socket and sfp_unregister_socket?
> 
> sfp_probe (the driver for the cage)
> -> sfp->sfp_bus = sfp_register_socket(sfp->dev, sfp, &sfp_module_ops)
> 
> sfp_remove
> -> sfp_unregister_socket(sfp->sfp_bus)
> 
> So sfp_register_bus can be called either by phylink_register_sfp(the upstream side) or sfp_probe(the cage side). They are serialized by the rtnl_mutex.

So here you have established the need for serialisation. However, I
don't think you have completely grasped it fully.

Not only do these two need to be serialised, but also the calls
through sfp_bus, to prevent bus->sfp, bus->socket_ops,
bus->upstream_ops, or bus->upstream changing beneath us.

Sure, bus->sfp, bus->socket_ops isn't going to change except when the
SFP cage is being removed once setup - but these may be dereferenced
by a call from the network side. The same is true of calls going the
other way.

So, we now have a concrete reason why we need serialisation here,
agreed?

Let's take a moment, and assume the sfp-bus layer uses its own private
mutex to achieve this, which would be taken whenever either side calls
one of the interface functions so that dereferences of bus->sfp,
bus->socket_ops, bus->upstream_ops and bus->upstream are all safe.

sfp_get_module_info() and sfp_get_module_eeprom() are called from
ethtool operations. So, lockdep will see rtnl taken first, then our
private mutex. As soon as any two locks nest, it creates an immediate
nesting rule for these two locks to avoid an AB-BA deadlock. We must
always take our private mutex before rtnl, otherwise we have the
possibility of an AB-BA deadlock.

The next part of the puzzle is how we add and remove PHYs.

Pick any ethtool implementation that dereferences the net device
"phydev" member, for example linkstate_get_sqi(). This happens to
take the phydev->lock, but that is not important - the important
point is that netdev->phydev must be a valid phydev or NULL and
must not change while the ethtool call is being processed. Which
lock guarantees that? It's the rtnl lock.

So, to safely change netdev->phydev on a published or running net
device, we must be holding the rtnl lock.

Okay, now lets go back to the sfp_bus layer, and lets consider the
case where a PHY is being removed - and continue to assume that we
are using our private locks in that code. The SFP cage code has
called sfp_remove_phy(), which takes our lock and then calls
through to the disconnect_phy method.

The disconnect_phy() needs to take the rtnl lock to safely remove the
phydev from the network device... but we've taken our private lock.

So, we end up with two paths, one which takes the locks in the order
AB and another which takes them in order BA. Lockdep will spot that
and will complain.

What ways can that be solved?

- One can fall back and just use the rtnl lock.
- One could refcount the structures on both sides, and adding code
  to handle the case where one side or the other goes away - but
  even with that, it's still unsafe.

  Consider sfp_get_module_eeprom(). This will sleep while i2c is
  read (many network drivers will sleep here under the rtnl lock.)
  The SFP cage module gets removed mid-call. There's absolutely
  nothing to prevent that happening. We don't get a look in while
  the i2c adapter is sleeping to abort that. Maybe the SFP cage
  gets removed. We now have lost its code, so when the i2c adapter
  returns, we get a kernel oops because the code we were going to
  execute on function return has been removed.

As soon as you start thinking "we can add a lock here instead of rtnl"
then things start getting really difficult because of netdev holding
rtnl when making some calls through to the SFP cage code, and rtnl
needing to be held when changing the phydev in the network interface
side.

It isn't nice, I know. I wish it wasn't that way, and we could have
finer grained locking, but I don't see any possibilities to avoid the
AB-BA deadlock problem without introducing even more code and/or
creating bugs in the process of doing so.


Now, if you take your approach of "lets remove the rtnl lock and see
whether anything breaks" I can tell you now - you likely won't notice
anything break from a few hundred boots. However, removing the lock
_provably_ opens a race between threads loading or removing the SFP
cage code and actions happening in the netdev layer.

This is why your approach is invalid. You can not prove a negative.
You can not prove that a lock isn't needed by removing it. Computing
does not work that way.

I don't write code to "work 99% of the time". I write code to try to
achieve reliable operation, and that means having the necessary locks
in place to avoid races and prevent kernel oops.

One of the things that having been involved in Linux for so long
teaches you is that a race, no matter how rare, will get found.
Vladimir Oltean Sept. 5, 2021, 12:41 a.m. UTC | #22
On Sun, Sep 05, 2021 at 12:25:40AM +0100, Russell King (Oracle) wrote:
> On Sun, Sep 05, 2021 at 12:59:05AM +0300, Vladimir Oltean wrote:
> > [ again, trimming the CC list, because I assume most people don't care,
> >   and if they do, the mailing lists are there for that ]
> > 
> > On Fri, Sep 03, 2021 at 11:06:23PM +0100, Russell King (Oracle) wrote:
> > > On Fri, Sep 03, 2021 at 11:48:22PM +0300, Vladimir Oltean wrote:
> > > > On Fri, Sep 03, 2021 at 01:04:19AM +0100, Russell King (Oracle) wrote:
> > > > > Removing a lock and then running the kernel is a down right stupid
> > > > > way to test to see if a lock is necessary.
> > > > > 
> > > > > That approach is like having built a iron bridge, covered it in paint,
> > > > > then you remove most the bolts, and then test to see whether it's safe
> > > > > for vehicles to travel over it by riding your bicycle across it and
> > > > > declaring it safe.
> > > > > 
> > > > > Sorry, but if you think "remove lock, run kernel, if it works fine
> > > > > the lock is unnecessary" is a valid approach, then you've just
> > > > > disqualified yourself from discussing this topic any further.
> > > > > Locking is done by knowing the code and code analysis, not by
> > > > > playing "does the code fail if I remove it" games. I am utterly
> > > > > shocked that you think that this is a valid approach.
> > > > 
> > > > ... and this is exactly why you will no longer get any attention from me
> > > > on this topic. Good luck.
> > > 
> > > Good, because your approach to this to me reads as "I don't think you
> > > know what the hell you're doing so I'm going to remove a lock to test
> > > whether it is needed." Effectively, that action is an insult towards
> > > me as the author of that code.
> > 
> > The reason why you aren't getting any of my attention is your attitude,
> > in case it was not clear.
> > 
> > You've transformed a few words I said and which were entirely
> > reasonable, "I don't know exactly why the SFP bus needs the rtnl_mutex,
> > I've removed those locks and will see what fails tomorrow", into a soap
> > opera based on something I did not say.
> 
> You really don't understand, do you.
> 
> I will say again: you can not remove a lock and then run-time test
> to see whether that lock is required. It doesn't just work like that,
> and the more you persist to assert that it does, the more stupid you
> look to all those who have had years of kernel programming experience.
> So please continue...
> 
> > > And as I said, if you think that's a valid approach, then quite frankly
> > > I don't want you touching my code, because you clearly don't know what
> > > you're doing as you aren't willing to put the necessary effort in to
> > > understanding the code.
> > > 
> > > Removing a lock and running the kernel is _never_ a valid way to see
> > > whether the lock is required or not. The only way is via code analysis.
> > 
> > It is a completely valid approach for a simple reason: if there was an
> > obvious reason why the SFP bus code would have needed serialization
> > through the rtnetlink mutex, I could have found out by looking at all
> > the failed assertions and said to myself "oh, yeah, right, of course",
> > instead of spending several hours looking at the code, at which point I
> > would have had fewer chances of figuring out anyway.
> 
> If we want to answer the question of "why rtnl_mutex" then one first
> has to understand the locking strategy and why I ended up there. It
> is _not_ simple.
> 
> > > I wonder whether you'd take the same approach with filesystems or
> > > memory management code. Why don't you try removing some locks from
> > > those subsystems and see how long your filesystems last?
> > 
> > This is a completely irrelevant and wrong argument, of course there are
> > sandboxes in which incompetent people can do insane things without doing
> > any damage, even if the subsystems they are interested in are filesystems
> > and memory management. It brings exactly nothing to the discussion.
> 
> It is entirely relevant - it is about your approach to testing whether
> a lock is necessary or not. Your stated approach is "lets remove the
> lock and then run the kernel and see if anything breaks." I assert
> that approach is completely invalid.
> 
> > If the mere idea of me removing a lock was insulting to you, I've no
> > idea what atrocity this might even compare to. But suffice to say, I
> > spent several hours and it is not obvious at all, based on code analysis
> > as you wish, why it must be the rtnl_lock and not any other mutex taken
> > by both the SFP module driver and the SFP upstream consumer (phylink),
> > with the same semantics except not the mega-bloated rtnetlink mutex.
> > 
> > These are my notes from the plane, it is a single pass (the second pass
> > will most likely not happen), again it is purely based on code analysis
> > as you requested, non-expert of course because it is the first time I
> > look at the details or even study the code paths, and I haven't even run
> > the code without the rtnetlink protection as I originally intended.
> > 
> > phylink_register_sfp
> > -> bus = sfp_bus_find_fwnode(fwnode)
> >    -> fwnode_property_get_reference_args(fwnode)
> >    -> bus = sfp_bus_get(fwnode)
> >       -> mutex_lock(&sfp_mutex)
> >       -> search for fwnode in sfp->fwnode of sfp_buses list # side note, the iterator in this function should have been named "bus", not "sfp", for consistency
> >          -> if found, kref_get(&sfp->kref)
> >          -> else allocate new sfp bus with this sfp->fwnode, and kref_init
> >       -> mutex_unlock(&sfp_mutex)
> >    -> fwnode_handle_put(fwnode)
> > -> pl->sfp_bus = bus
> > -> sfp_bus_add_upstream(bus, pl)
> >    -> rtnl_lock()
> >    -> kref_get(bus->kref) <- why? this increments from 1 to 2. Indicative of possibly concurrent code
> >    -> bus->upstream = pl
> >    -> if (bus->sfp) <- this code path does not populate bus->sfp, so unless code is running concurrently (?!) branch is not taken
> >       -> sfp_register_bus(bus)
> >    -> rtnl_unlock()
> >    -> if (ret) => sfp_bus_put(bus) <= on error this decrements the kref back from 2 to 1
> >       -> kref_put_mutex(&bus->kref, sfp_bus_release, &sfp_mutex)
> > -> sfp_bus_put(bus)
> >    -> on error, drops the kref from 1 to 0 and frees the bus under the sfp_mutex
> >    -> on normal path, drops the kref from 2 to 1
> 
> First question "why? this increments from 1 to 2. Indicative of possibly
> concurrent code" - you appear to have answered that already in two lines
> immediately above.
> 
> In the case of a pre-existing bus being found, then the krefs will be
> one higher than the numerical values you have given above.
> 
> > Ok, why would bus->sfp be non-NULL (how would the sfp_register_bus possibly be triggered by this function)?
> 
> You've already answered that above. "else allocate new sfp bus with this
> sfp->fwnode, and kref_init". In that case, bus->sfp will be NULL because
> the socket hasn't been registered.
> 
> > sfp->bus is set from:
> > 
> > sfp_unregister_socket(bus)
> > -> rtnl_lock
> > -> if (bus->upstream_ops) sfp_unregister_bus(bus)
> > -> sfp_socket_clear(bus)
> >    -> bus->sfp = NULL
> > -> rtnl_unlock
> > -> sfp_bus_put(bus)
> > 
> > sfp_register_socket(dev, sfp, ops)
> > -> bus = sfp_bus_get(dev->fwnode)
> > -> rtnl_lock
> > -> bus->sfp_dev = dev;
> > -> bus->sfp = sfp;
> > -> bus->socket_ops = ops;
> > -> if (bus->upstream_ops) => sfp_register_bus(bus);
> > -> rtnl_unlock
> > -> on error => sfp_bus_put(bus)
> > -> return bus
> > 
> > Who calls sfp_register_socket and sfp_unregister_socket?
> > 
> > sfp_probe (the driver for the cage)
> > -> sfp->sfp_bus = sfp_register_socket(sfp->dev, sfp, &sfp_module_ops)
> > 
> > sfp_remove
> > -> sfp_unregister_socket(sfp->sfp_bus)
> > 
> > So sfp_register_bus can be called either by phylink_register_sfp(the upstream side) or sfp_probe(the cage side). They are serialized by the rtnl_mutex.
> 
> So here you have established the need for serialisation. However, I
> don't think you have completely grasped it fully.
> 
> Not only do these two need to be serialised, but also the calls
> through sfp_bus, to prevent bus->sfp, bus->socket_ops,
> bus->upstream_ops, or bus->upstream changing beneath us.
> 
> Sure, bus->sfp, bus->socket_ops isn't going to change except when the
> SFP cage is being removed once setup - but these may be dereferenced
> by a call from the network side. The same is true of calls going the
> other way.
> 
> So, we now have a concrete reason why we need serialisation here,
> agreed?
> 
> Let's take a moment, and assume the sfp-bus layer uses its own private
> mutex to achieve this, which would be taken whenever either side calls
> one of the interface functions so that dereferences of bus->sfp,
> bus->socket_ops, bus->upstream_ops and bus->upstream are all safe.
> 
> sfp_get_module_info() and sfp_get_module_eeprom() are called from
> ethtool operations. So, lockdep will see rtnl taken first, then our
> private mutex. As soon as any two locks nest, it creates an immediate
> nesting rule for these two locks to avoid an AB-BA deadlock. We must
> always take our private mutex before rtnl, otherwise we have the
> possibility of an AB-BA deadlock.
> 
> The next part of the puzzle is how we add and remove PHYs.
> 
> Pick any ethtool implementation that dereferences the net device
> "phydev" member, for example linkstate_get_sqi(). This happens to
> take the phydev->lock, but that is not important - the important
> point is that netdev->phydev must be a valid phydev or NULL and
> must not change while the ethtool call is being processed. Which
> lock guarantees that? It's the rtnl lock.
> 
> So, to safely change netdev->phydev on a published or running net
> device, we must be holding the rtnl lock.
> 
> Okay, now lets go back to the sfp_bus layer, and lets consider the
> case where a PHY is being removed - and continue to assume that we
> are using our private locks in that code. The SFP cage code has
> called sfp_remove_phy(), which takes our lock and then calls
> through to the disconnect_phy method.
> 
> The disconnect_phy() needs to take the rtnl lock to safely remove the
> phydev from the network device... but we've taken our private lock.
> 
> So, we end up with two paths, one which takes the locks in the order
> AB and another which takes them in order BA. Lockdep will spot that
> and will complain.
> 
> What ways can that be solved?
> 
> - One can fall back and just use the rtnl lock.
> - One could refcount the structures on both sides, and adding code
>   to handle the case where one side or the other goes away - but
>   even with that, it's still unsafe.
> 
>   Consider sfp_get_module_eeprom(). This will sleep while i2c is
>   read (many network drivers will sleep here under the rtnl lock.)
>   The SFP cage module gets removed mid-call. There's absolutely
>   nothing to prevent that happening. We don't get a look in while
>   the i2c adapter is sleeping to abort that. Maybe the SFP cage
>   gets removed. We now have lost its code, so when the i2c adapter
>   returns, we get a kernel oops because the code we were going to
>   execute on function return has been removed.
> 
> As soon as you start thinking "we can add a lock here instead of rtnl"
> then things start getting really difficult because of netdev holding
> rtnl when making some calls through to the SFP cage code, and rtnl
> needing to be held when changing the phydev in the network interface
> side.
> 
> It isn't nice, I know. I wish it wasn't that way, and we could have
> finer grained locking, but I don't see any possibilities to avoid the
> AB-BA deadlock problem without introducing even more code and/or
> creating bugs in the process of doing so.
> 
> 
> Now, if you take your approach of "lets remove the rtnl lock and see
> whether anything breaks" I can tell you now - you likely won't notice
> anything break from a few hundred boots. However, removing the lock
> _provably_ opens a race between threads loading or removing the SFP
> cage code and actions happening in the netdev layer.
> 
> This is why your approach is invalid. You can not prove a negative.
> You can not prove that a lock isn't needed by removing it. Computing
> does not work that way.
> 
> I don't write code to "work 99% of the time". I write code to try to
> achieve reliable operation, and that means having the necessary locks
> in place to avoid races and prevent kernel oops.
> 
> One of the things that having been involved in Linux for so long
> teaches you is that a race, no matter how rare, will get found.

I haven't fully digested everything you've said, but if I were to
summarize the basic idea, it is that
(a) the rtnetlink mutex fundamentally serializes accesses to
    netdev->phydev on a registered interface, which renders my comment
    about phy_attach_direct not needing rtnl_lock invalid, since phylink
    needs to deal with PHYs appearing and disappearing on SFP modules
    while the netdev attached to the sfp upstream is already registered.
    This would mean that if we were to replace the rtnl mutex with
    something else, it'd have to be a netdev->phy_lock which is held at
    all times when netdev->phydev is accessed. Of course, that would be
    its own discussion, which functions should take this lock on their
    own and which should expect it to already be held.
(b) the sfp upstream and module drivers need to synchronize on a shared
    lock that is taken top-level by both parties, and the rtnetlink
    mutex is currently used for that. Could the sfp bus not have exactly
    that, a shared mutex (a pointer), created by whomever creates the
    sfp bus first, and destroyed by whomever puts the last reference on
    the sfp bus? This would act as the top-level mutex, and netdev->phy_lock
    would be accessed from within the individual functions that need to
    touch that (probably just the upstream side). The sfp_mutex lock
    inside sfp_bus_get could probably act as preliminary serialization
    until the shared lock is fully published and held by the side that
    created the lock in the first place.

The entire reason I'm asking is because I got the impression that you
are not fully satisfied with the only apparent solution to the dpaa2
deadlock problem, which is to force all callers of phylink_{create,destroy}
to take the rtnl mutex top-level, because it would not be needed for
most of the operations, and it would make the design even harder to
untangle later on. If you are satisfied with that solution, my bad.
diff mbox series

Patch

diff --git a/drivers/base/dd.c b/drivers/base/dd.c
index 1c379d20812a..b22073b0acd2 100644
--- a/drivers/base/dd.c
+++ b/drivers/base/dd.c
@@ -128,13 +128,30 @@  static void deferred_probe_work_func(struct work_struct *work)
 }
 static DECLARE_WORK(deferred_probe_work, deferred_probe_work_func);
 
+static bool __device_pending_probe(struct device *dev)
+{
+	return !list_empty(&dev->p->deferred_probe);
+}
+
+bool device_pending_probe(struct device *dev)
+{
+	bool pending;
+
+	mutex_lock(&deferred_probe_mutex);
+	pending = __device_pending_probe(dev);
+	mutex_unlock(&deferred_probe_mutex);
+
+	return pending;
+}
+EXPORT_SYMBOL_GPL(device_pending_probe);
+
 void driver_deferred_probe_add(struct device *dev)
 {
 	if (!dev->can_match)
 		return;
 
 	mutex_lock(&deferred_probe_mutex);
-	if (list_empty(&dev->p->deferred_probe)) {
+	if (!__device_pending_probe(dev)) {
 		dev_dbg(dev, "Added to deferred list\n");
 		list_add_tail(&dev->p->deferred_probe, &deferred_probe_pending_list);
 	}
@@ -144,7 +161,7 @@  void driver_deferred_probe_add(struct device *dev)
 void driver_deferred_probe_del(struct device *dev)
 {
 	mutex_lock(&deferred_probe_mutex);
-	if (!list_empty(&dev->p->deferred_probe)) {
+	if (__device_pending_probe(dev)) {
 		dev_dbg(dev, "Removed from deferred list\n");
 		list_del_init(&dev->p->deferred_probe);
 		__device_set_deferred_probe_reason(dev, NULL);
diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
index 52310df121de..2c22a32f0a1c 100644
--- a/drivers/net/phy/phy_device.c
+++ b/drivers/net/phy/phy_device.c
@@ -1386,8 +1386,16 @@  int phy_attach_direct(struct net_device *dev, struct phy_device *phydev,
 
 	/* Assume that if there is no driver, that it doesn't
 	 * exist, and we should use the genphy driver.
+	 * The exception is during probing, when the PHY driver might have
+	 * attempted a probe but has requested deferral. Since there might be
+	 * MAC drivers which also attach to the PHY during probe time, try
+	 * harder to bind the specific PHY driver, and defer the MAC driver's
+	 * probing until then.
 	 */
 	if (!d->driver) {
+		if (device_pending_probe(d))
+			return -EPROBE_DEFER;
+
 		if (phydev->is_c45)
 			d->driver = &genphy_c45_driver.mdiodrv.driver;
 		else
diff --git a/include/linux/device.h b/include/linux/device.h
index e270cb740b9e..505e77715789 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -889,6 +889,7 @@  int __must_check driver_attach(struct device_driver *drv);
 void device_initial_probe(struct device *dev);
 int __must_check device_reprobe(struct device *dev);
 
+bool device_pending_probe(struct device *dev);
 bool device_is_bound(struct device *dev);
 
 /*