Message ID | 20210901225053.1205571-2-vladimir.oltean@nxp.com (mailing list archive) |
---|---|
State | RFC |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | Make the PHY library stop being so greedy when binding the generic PHY driver | expand |
Context | Check | Description |
---|---|---|
netdev/cover_letter | success | Link |
netdev/fixes_present | success | Link |
netdev/patch_count | success | Link |
netdev/tree_selection | success | Clearly marked for net-next |
netdev/subject_prefix | success | Link |
netdev/cc_maintainers | warning | 6 maintainers not CCed: heikki.krogerus@linux.intel.com saravanak@google.com rajatja@google.com rdunlap@infradead.org andriy.shevchenko@linux.intel.com tglx@linutronix.de |
netdev/source_inline | success | Was 0 now: 0 |
netdev/verify_signedoff | success | Link |
netdev/module_param | success | Was 0 now: 0 |
netdev/build_32bit | success | Errors and warnings before: 17237 this patch: 17237 |
netdev/kdoc | success | Errors and warnings before: 0 this patch: 0 |
netdev/verify_fixes | success | Link |
netdev/checkpatch | success | total: 0 errors, 0 warnings, 0 checks, 62 lines checked |
netdev/build_allmodconfig_warn | success | Errors and warnings before: 16733 this patch: 16733 |
netdev/header_inline | success | Link |
On Thu, Sep 02, 2021 at 01:50:51AM +0300, Vladimir Oltean wrote: > There are systems where the PHY driver might get its probe deferred due > to a missing supplier, like an interrupt-parent, gpio, clock or whatever. > > If the phy_attach_direct call happens right in between probe attempts, > the PHY library is greedy and assumes that a specific driver will never > appear, so it just binds the generic PHY driver. > > In certain cases this is the wrong choice, because some PHYs simply need > the specific driver. The specific PHY driver was going to probe, given > enough time, but this doesn't seem to matter to phy_attach_direct. > > To solve this, make phy_attach_direct check whether a specific PHY > driver is pending or not, and if it is, just defer the probing of the > MAC that's connecting to us a bit more too. > > Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> > --- > drivers/base/dd.c | 21 +++++++++++++++++++-- > drivers/net/phy/phy_device.c | 8 ++++++++ > include/linux/device.h | 1 + > 3 files changed, 28 insertions(+), 2 deletions(-) > > diff --git a/drivers/base/dd.c b/drivers/base/dd.c > index 1c379d20812a..b22073b0acd2 100644 > --- a/drivers/base/dd.c > +++ b/drivers/base/dd.c > @@ -128,13 +128,30 @@ static void deferred_probe_work_func(struct work_struct *work) > } > static DECLARE_WORK(deferred_probe_work, deferred_probe_work_func); > > +static bool __device_pending_probe(struct device *dev) > +{ > + return !list_empty(&dev->p->deferred_probe); > +} > + > +bool device_pending_probe(struct device *dev) > +{ > + bool pending; > + > + mutex_lock(&deferred_probe_mutex); > + pending = __device_pending_probe(dev); > + mutex_unlock(&deferred_probe_mutex); > + > + return pending; > +} > +EXPORT_SYMBOL_GPL(device_pending_probe); > + > void driver_deferred_probe_add(struct device *dev) > { > if (!dev->can_match) > return; > > mutex_lock(&deferred_probe_mutex); > - if (list_empty(&dev->p->deferred_probe)) { > + if (!__device_pending_probe(dev)) { > dev_dbg(dev, "Added to deferred list\n"); > list_add_tail(&dev->p->deferred_probe, &deferred_probe_pending_list); > } > @@ -144,7 +161,7 @@ void driver_deferred_probe_add(struct device *dev) > void driver_deferred_probe_del(struct device *dev) > { > mutex_lock(&deferred_probe_mutex); > - if (!list_empty(&dev->p->deferred_probe)) { > + if (__device_pending_probe(dev)) { > dev_dbg(dev, "Removed from deferred list\n"); > list_del_init(&dev->p->deferred_probe); > __device_set_deferred_probe_reason(dev, NULL); > diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c > index 52310df121de..2c22a32f0a1c 100644 > --- a/drivers/net/phy/phy_device.c > +++ b/drivers/net/phy/phy_device.c > @@ -1386,8 +1386,16 @@ int phy_attach_direct(struct net_device *dev, struct phy_device *phydev, > > /* Assume that if there is no driver, that it doesn't > * exist, and we should use the genphy driver. > + * The exception is during probing, when the PHY driver might have > + * attempted a probe but has requested deferral. Since there might be > + * MAC drivers which also attach to the PHY during probe time, try > + * harder to bind the specific PHY driver, and defer the MAC driver's > + * probing until then. Wait, no, this should not be a "special" thing, and why would the list of deferred probe show this? If a bus wants to have this type of "generic vs. specific" logic, then it needs to handle it in the bus logic itself as that does NOT fit into the normal driver model at all. Don't try to get a "hint" of this by messing with the probe function list. thanks, greg k-h
On Thu, Sep 02, 2021 at 07:43:10AM +0200, Greg Kroah-Hartman wrote: > Wait, no, this should not be a "special" thing, and why would the list > of deferred probe show this? Why as in why would it work/do what I want, or as in why would you want to do that? > If a bus wants to have this type of "generic vs. specific" logic, then > it needs to handle it in the bus logic itself as that does NOT fit into > the normal driver model at all. Don't try to get a "hint" of this by > messing with the probe function list. Where and how? Do you have an example?
On Thu, Sep 02, 2021 at 01:11:50PM +0300, Vladimir Oltean wrote: > On Thu, Sep 02, 2021 at 07:43:10AM +0200, Greg Kroah-Hartman wrote: > > Wait, no, this should not be a "special" thing, and why would the list > > of deferred probe show this? > > Why as in why would it work/do what I want, or as in why would you want to do that? Both! :) > > If a bus wants to have this type of "generic vs. specific" logic, then > > it needs to handle it in the bus logic itself as that does NOT fit into > > the normal driver model at all. Don't try to get a "hint" of this by > > messing with the probe function list. > > Where and how? Do you have an example? No I do not, sorry, most busses do not do this for obvious ordering / loading / we are not that crazy reasons. What is causing this all to suddenly break? The devlink stuff? thanks, greg k-h
On Thu, Sep 02, 2021 at 12:37:34PM +0200, Greg Kroah-Hartman wrote: > On Thu, Sep 02, 2021 at 01:11:50PM +0300, Vladimir Oltean wrote: > > On Thu, Sep 02, 2021 at 07:43:10AM +0200, Greg Kroah-Hartman wrote: > > > Wait, no, this should not be a "special" thing, and why would the list > > > of deferred probe show this? > > > > Why as in why would it work/do what I want, or as in why would you want to do that? > > Both! :) So first: why would it work. You seem to have a misconception that I am "messing with the probe function list". I am not, I am just exporting the information whether the device had a driver which returned -EPROBE_DEFER during probe, or not. For that I am looking at the presence of this device on the deferred_probe_pending_list. driver_probe_device -> if (ret == -EPROBE_DEFER || ret == EPROBE_DEFER) driver_deferred_probe_add(dev); -> list_add_tail(&dev->p->deferred_probe, &deferred_probe_pending_list); driver_bound -> driver_deferred_probe_del -> list_del_init(&dev->p->deferred_probe); So the presence of "dev" inside deferred_probe_pending_list means precisely that a driver is pending to be bound. Second: why would I want to do that. In the case of PHY devices, the driver binding process starts here: phy_device_register -> device_add It begins synchronously, but may not finish due to probe deferral. So after device_add finishes, phydev->drv might be NULL due to 2 reasons: 1. -EPROBE_DEFER triggered by "somebody", either by the PHY driver probe function itself, or by third parties (like device_links_check_suppliers happening to notice that before even calling the driver's probe fn). Anyway, the distinction between these 2 is pretty much irrelevant. 2. There genuinely was no driver loaded in the system for this PHY. Note that the way things are written, the Generic PHY driver will not match on any device in phy_bus_match(). It is bound manually, separately. The PHY library is absolutely happy to work with a headless chicken, a phydev with a NULL phydev->drv. Just search for "if (!phydev->drv)" inside drivers/net/phy/phy.c and drivers/net/phy/phy_device.c. However, the phydev walking with a NULL drv can only last for so long. An Ethernet port will soon need that PHY device, and will attach to it. There are many code paths, all ending in phy_attach_direct. However, when an Ethernet port decides to attach to a PHY device is completely asynchronous to the lifetime of the PHY device itself. This moment is where a driver is really needed, and if none is present, the generic one is force-bound. My patch only distinguishes between case 1 and 2 for which phydev->drv might be NULL. It avoids force-binding the generic PHY when a specific PHY driver was found, but did not finish binding due to probe deferral. > > > If a bus wants to have this type of "generic vs. specific" logic, then > > > it needs to handle it in the bus logic itself as that does NOT fit into > > > the normal driver model at all. Don't try to get a "hint" of this by > > > messing with the probe function list. > > > > Where and how? Do you have an example? > > No I do not, sorry, most busses do not do this for obvious ordering / > loading / we are not that crazy reasons. > > What is causing this all to suddenly break? The devlink stuff? There was a report related to fw_devlink indeed, however strictly speaking, I wouldn't say it is the cause of all this. It is pretty uncommon for a PHY device to defer probing I think, hence the bad assumptions made around it.
On Thu, Sep 2, 2021 at 7:43 AM Greg Kroah-Hartman <gregkh@linuxfoundation.org> wrote: > > On Thu, Sep 02, 2021 at 01:50:51AM +0300, Vladimir Oltean wrote: > > There are systems where the PHY driver might get its probe deferred due > > to a missing supplier, like an interrupt-parent, gpio, clock or whatever. > > > > If the phy_attach_direct call happens right in between probe attempts, > > the PHY library is greedy and assumes that a specific driver will never > > appear, so it just binds the generic PHY driver. > > > > In certain cases this is the wrong choice, because some PHYs simply need > > the specific driver. The specific PHY driver was going to probe, given > > enough time, but this doesn't seem to matter to phy_attach_direct. > > > > To solve this, make phy_attach_direct check whether a specific PHY > > driver is pending or not, and if it is, just defer the probing of the > > MAC that's connecting to us a bit more too. > > > > Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> > > --- > > drivers/base/dd.c | 21 +++++++++++++++++++-- > > drivers/net/phy/phy_device.c | 8 ++++++++ > > include/linux/device.h | 1 + > > 3 files changed, 28 insertions(+), 2 deletions(-) > > > > diff --git a/drivers/base/dd.c b/drivers/base/dd.c > > index 1c379d20812a..b22073b0acd2 100644 > > --- a/drivers/base/dd.c > > +++ b/drivers/base/dd.c > > @@ -128,13 +128,30 @@ static void deferred_probe_work_func(struct work_struct *work) > > } > > static DECLARE_WORK(deferred_probe_work, deferred_probe_work_func); > > > > +static bool __device_pending_probe(struct device *dev) > > +{ > > + return !list_empty(&dev->p->deferred_probe); > > +} > > + > > +bool device_pending_probe(struct device *dev) > > +{ > > + bool pending; > > + > > + mutex_lock(&deferred_probe_mutex); > > + pending = __device_pending_probe(dev); > > + mutex_unlock(&deferred_probe_mutex); > > + > > + return pending; > > +} > > +EXPORT_SYMBOL_GPL(device_pending_probe); > > + > > void driver_deferred_probe_add(struct device *dev) > > { > > if (!dev->can_match) > > return; > > > > mutex_lock(&deferred_probe_mutex); > > - if (list_empty(&dev->p->deferred_probe)) { > > + if (!__device_pending_probe(dev)) { > > dev_dbg(dev, "Added to deferred list\n"); > > list_add_tail(&dev->p->deferred_probe, &deferred_probe_pending_list); > > } > > @@ -144,7 +161,7 @@ void driver_deferred_probe_add(struct device *dev) > > void driver_deferred_probe_del(struct device *dev) > > { > > mutex_lock(&deferred_probe_mutex); > > - if (!list_empty(&dev->p->deferred_probe)) { > > + if (__device_pending_probe(dev)) { > > dev_dbg(dev, "Removed from deferred list\n"); > > list_del_init(&dev->p->deferred_probe); > > __device_set_deferred_probe_reason(dev, NULL); > > diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c > > index 52310df121de..2c22a32f0a1c 100644 > > --- a/drivers/net/phy/phy_device.c > > +++ b/drivers/net/phy/phy_device.c > > @@ -1386,8 +1386,16 @@ int phy_attach_direct(struct net_device *dev, struct phy_device *phydev, > > > > /* Assume that if there is no driver, that it doesn't > > * exist, and we should use the genphy driver. > > + * The exception is during probing, when the PHY driver might have > > + * attempted a probe but has requested deferral. Since there might be > > + * MAC drivers which also attach to the PHY during probe time, try > > + * harder to bind the specific PHY driver, and defer the MAC driver's > > + * probing until then. > > Wait, no, this should not be a "special" thing, and why would the list > of deferred probe show this? > > If a bus wants to have this type of "generic vs. specific" logic, then > it needs to handle it in the bus logic itself as that does NOT fit into > the normal driver model at all. Well, I think that this is a general issue and it appears to me to be present in the driver core too, at least to some extent. Namely, if there are two drivers matching the same device and the first one's ->probe() returns -EPROBE_DEFER, that will be converted to EPROBE_DEFER by really_probe(), so driver_probe_device() will pass it to __device_attach_driver() which then will return 0. This bus_for_each_drv() will call __device_attach_driver() for the second matching driver even though the first one may still probe successfully later. To me, this really is a variant of "if a driver has failed to probe, try another one" which phy_attach_direct() appears to be doing and in both cases the probing of the "alternative" is premature if the probing of the original driver has been deferred. > Don't try to get a "hint" of this by messing with the probe function list. I agree that this doesn't look particularly clean, but then I'm wondering how to address this cleanly.
On Thu, Sep 02, 2021 at 01:50:51AM +0300, Vladimir Oltean wrote: > diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c > index 52310df121de..2c22a32f0a1c 100644 > --- a/drivers/net/phy/phy_device.c > +++ b/drivers/net/phy/phy_device.c > @@ -1386,8 +1386,16 @@ int phy_attach_direct(struct net_device *dev, struct phy_device *phydev, > > /* Assume that if there is no driver, that it doesn't > * exist, and we should use the genphy driver. > + * The exception is during probing, when the PHY driver might have > + * attempted a probe but has requested deferral. Since there might be > + * MAC drivers which also attach to the PHY during probe time, try > + * harder to bind the specific PHY driver, and defer the MAC driver's > + * probing until then. > */ > if (!d->driver) { > + if (device_pending_probe(d)) > + return -EPROBE_DEFER; Something else that concerns me here. As noted, many network drivers attempt to attach their PHY when the device is brought up, and not during their probe function. Taking a driver at random: drivers/net/ethernet/renesas/sh_eth.c sh_eth_phy_init() calls of_phy_connect() or phy_connect(), which ultimately calls phy_attach_direct() and propagates the error code via an error pointer. sh_eth_phy_init() propagates the error code to its caller, sh_eth_phy_start(). This is called from sh_eth_open(), which probagates the error code. This is called from .ndo_open... and it's highly likely -EPROBE_DEFER will end up being returned to userspace through either netlink or netdev ioctls. Since EPROBE_DEFER is not an error number that we export to userspace, this should basically never be exposed to userspace, yet we have a path that it _could_ be exposed if the above condition is true. If device_pending_probe() returns true e.g. during initial boot up while modules are being loaded - maybe the phy driver doesn't have all the resources it needs because of some other module that hasn't finished initialising - then we have a window where this will be exposed to userspace. So, do we need to fix all the network drivers to do something if their .ndo_open method encounters this? If so, what? Sleep a bit and try again? How many times to retry? Convert the error code into something else, causing userspace to fail where it worked before? If so which error code? I think this needs to be thought through a bit better. In this case, I feel that throwing -EPROBE_DEFER to solve one problem with one subsystem can result in new problems elsewhere. We did have an idea at one point about reserving some flag bits in phydev->dev_flags for phylib use, but I don't think that happened. If this is the direction we want to go, I think we need to have a flag in dev_flags so that callers opt-in to the new behaviour whereas callers such as from .ndo_open keep the old behaviour - because they just aren't setup to handle an -EPROBE_DEFER return from these functions.
On Thu, Sep 02, 2021 at 07:50:16PM +0100, Russell King (Oracle) wrote: > On Thu, Sep 02, 2021 at 01:50:51AM +0300, Vladimir Oltean wrote: > > diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c > > index 52310df121de..2c22a32f0a1c 100644 > > --- a/drivers/net/phy/phy_device.c > > +++ b/drivers/net/phy/phy_device.c > > @@ -1386,8 +1386,16 @@ int phy_attach_direct(struct net_device *dev, struct phy_device *phydev, > > > > /* Assume that if there is no driver, that it doesn't > > * exist, and we should use the genphy driver. > > + * The exception is during probing, when the PHY driver might have > > + * attempted a probe but has requested deferral. Since there might be > > + * MAC drivers which also attach to the PHY during probe time, try > > + * harder to bind the specific PHY driver, and defer the MAC driver's > > + * probing until then. > > */ > > if (!d->driver) { > > + if (device_pending_probe(d)) > > + return -EPROBE_DEFER; > > Something else that concerns me here. > > As noted, many network drivers attempt to attach their PHY when the > device is brought up, and not during their probe function. > > Taking a driver at random: > > drivers/net/ethernet/renesas/sh_eth.c > > sh_eth_phy_init() calls of_phy_connect() or phy_connect(), which > ultimately calls phy_attach_direct() and propagates the error code > via an error pointer. > > sh_eth_phy_init() propagates the error code to its caller, > sh_eth_phy_start(). This is called from sh_eth_open(), which > probagates the error code. This is called from .ndo_open... and it's > highly likely -EPROBE_DEFER will end up being returned to userspace > through either netlink or netdev ioctls. > > Since EPROBE_DEFER is not an error number that we export to > userspace, this should basically never be exposed to userspace, yet > we have a path that it _could_ be exposed if the above condition > is true. > > If device_pending_probe() returns true e.g. during initial boot up > while modules are being loaded - maybe the phy driver doesn't have > all the resources it needs because of some other module that hasn't > finished initialising - then we have a window where this will be > exposed to userspace. > > So, do we need to fix all the network drivers to do something if > their .ndo_open method encounters this? If so, what? Sleep a bit > and try again? How many times to retry? Convert the error code into > something else, causing userspace to fail where it worked before? If > so which error code? It depends what is the outcome you're going for. If there's a PHY driver pending, I would do something to wait for that if I could, it would be silly for the PHY driver to be loading but the PHY to still be bound to genphy. I feel that connecting to the PHY from the probe path is the overall cleaner way to go since it deals with this automatically, but due to the sheer volume of drivers that connect from .ndo_open, modifying them in bulk is out of the question. Something sensible needs to happen with them too, and 'genphy is what you get' might be just that, which is basically what is happening without these patches. On that note, I don't know whether there is any objective advantage to connecting to the PHY at .ndo_open time. > > I think this needs to be thought through a bit better. In this case, > I feel that throwing -EPROBE_DEFER to solve one problem with one > subsystem can result in new problems elsewhere. > > We did have an idea at one point about reserving some flag bits in > phydev->dev_flags for phylib use, but I don't think that happened. > If this is the direction we want to go, I think we need to have a > flag in dev_flags so that callers opt-in to the new behaviour whereas > callers such as from .ndo_open keep the old behaviour - because they > just aren't setup to handle an -EPROBE_DEFER return from these > functions. Or that, yes. I hadn't actually thought about using PHY flags, but I suppose callers which already can cope with EPROBE_DEFER (they connect from probe) can opt into that.
On Thu, Sep 02, 2021 at 07:50:16PM +0100, Russell King (Oracle) wrote: > On Thu, Sep 02, 2021 at 01:50:51AM +0300, Vladimir Oltean wrote: > > diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c > > index 52310df121de..2c22a32f0a1c 100644 > > --- a/drivers/net/phy/phy_device.c > > +++ b/drivers/net/phy/phy_device.c > > @@ -1386,8 +1386,16 @@ int phy_attach_direct(struct net_device *dev, struct phy_device *phydev, > > > > /* Assume that if there is no driver, that it doesn't > > * exist, and we should use the genphy driver. > > + * The exception is during probing, when the PHY driver might have > > + * attempted a probe but has requested deferral. Since there might be > > + * MAC drivers which also attach to the PHY during probe time, try > > + * harder to bind the specific PHY driver, and defer the MAC driver's > > + * probing until then. > > */ > > if (!d->driver) { > > + if (device_pending_probe(d)) > > + return -EPROBE_DEFER; > > Something else that concerns me here. > > As noted, many network drivers attempt to attach their PHY when the > device is brought up, and not during their probe function. Yes, this is going to be a problem. I agree it is too late to return -EPROBE_DEFER. Maybe phy_attach_direct() needs to wait around, if the device is still on the deferred list, otherwise use genphy. And maybe a timeout and return -ENODEV, which is not 100% correct, we know the device exists, we just cannot drive it. Can we tell we are in the context of a driver probe? Or do we need to add a parameter to the various phy_attach API calls to let the core know if this is probe or open? This is more likely to be a problem with NFS root, with the kernel bringing up an interface as soon as its registered. userspace bringing up interfaces is generally much later, and udev tends to wait around until there are no more driver load requests before the boot continues. Andrew
On 9/2/2021 12:51 PM, Andrew Lunn wrote: > On Thu, Sep 02, 2021 at 07:50:16PM +0100, Russell King (Oracle) wrote: >> On Thu, Sep 02, 2021 at 01:50:51AM +0300, Vladimir Oltean wrote: >>> diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c >>> index 52310df121de..2c22a32f0a1c 100644 >>> --- a/drivers/net/phy/phy_device.c >>> +++ b/drivers/net/phy/phy_device.c >>> @@ -1386,8 +1386,16 @@ int phy_attach_direct(struct net_device *dev, struct phy_device *phydev, >>> >>> /* Assume that if there is no driver, that it doesn't >>> * exist, and we should use the genphy driver. >>> + * The exception is during probing, when the PHY driver might have >>> + * attempted a probe but has requested deferral. Since there might be >>> + * MAC drivers which also attach to the PHY during probe time, try >>> + * harder to bind the specific PHY driver, and defer the MAC driver's >>> + * probing until then. >>> */ >>> if (!d->driver) { >>> + if (device_pending_probe(d)) >>> + return -EPROBE_DEFER; >> >> Something else that concerns me here. >> >> As noted, many network drivers attempt to attach their PHY when the >> device is brought up, and not during their probe function. > > Yes, this is going to be a problem. I agree it is too late to return > -EPROBE_DEFER. Maybe phy_attach_direct() needs to wait around, if the > device is still on the deferred list, otherwise use genphy. And maybe > a timeout and return -ENODEV, which is not 100% correct, we know the > device exists, we just cannot drive it. Is it really going to be a problem though? The two cases where this will matter is if we use IP auto-configuration within the kernel, which this patchset ought to be helping with, if we are already in user-space and the PHY is connected at .ndo_open() time, there is a whole lot of things that did happen prior to getting there, such as udevd using modaliases in order to load every possible module we might, so I am debating whether we will really see a probe deferral at all. > > Can we tell we are in the context of a driver probe? Or do we need to > add a parameter to the various phy_attach API calls to let the core > know if this is probe or open? Actually we do the RTNL lock will be held during ndo_open and it won't during driver probe. > > This is more likely to be a problem with NFS root, with the kernel > bringing up an interface as soon as its registered. userspace bringing > up interfaces is generally much later, and udev tends to wait around > until there are no more driver load requests before the boot > continues. See my point above, with Vladimir's change, we should have fw_devlink do its job such that by the time the network interface is needed for IP auto-configuration, all of its depending resources should also be ready, would not they?
On Thu, Sep 02, 2021 at 01:33:57PM -0700, Florian Fainelli wrote: > On 9/2/2021 12:51 PM, Andrew Lunn wrote: > > On Thu, Sep 02, 2021 at 07:50:16PM +0100, Russell King (Oracle) wrote: > > > On Thu, Sep 02, 2021 at 01:50:51AM +0300, Vladimir Oltean wrote: > > > > diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c > > > > index 52310df121de..2c22a32f0a1c 100644 > > > > --- a/drivers/net/phy/phy_device.c > > > > +++ b/drivers/net/phy/phy_device.c > > > > @@ -1386,8 +1386,16 @@ int phy_attach_direct(struct net_device *dev, struct phy_device *phydev, > > > > /* Assume that if there is no driver, that it doesn't > > > > * exist, and we should use the genphy driver. > > > > + * The exception is during probing, when the PHY driver might have > > > > + * attempted a probe but has requested deferral. Since there might be > > > > + * MAC drivers which also attach to the PHY during probe time, try > > > > + * harder to bind the specific PHY driver, and defer the MAC driver's > > > > + * probing until then. > > > > */ > > > > if (!d->driver) { > > > > + if (device_pending_probe(d)) > > > > + return -EPROBE_DEFER; > > > > > > Something else that concerns me here. > > > > > > As noted, many network drivers attempt to attach their PHY when the > > > device is brought up, and not during their probe function. > > > > Yes, this is going to be a problem. I agree it is too late to return > > -EPROBE_DEFER. Maybe phy_attach_direct() needs to wait around, if the > > device is still on the deferred list, otherwise use genphy. And maybe > > a timeout and return -ENODEV, which is not 100% correct, we know the > > device exists, we just cannot drive it. > > Is it really going to be a problem though? The two cases where this will > matter is if we use IP auto-configuration within the kernel, which this > patchset ought to be helping with There is no handling of EPROBE_DEFER in the IP auto-configuration code while trying to bring up interfaces: for_each_netdev(&init_net, dev) { if (ic_is_init_dev(dev)) { ... oflags = dev->flags; if (dev_change_flags(dev, oflags | IFF_UP, NULL) < 0) { pr_err("IP-Config: Failed to open %s\n", dev->name); continue; } So, the only way this could be reliable is if we can guarantee that all deferred probes will have been retried by the time we get here. Do we have that guarantee? > if we are already in user-space and the > PHY is connected at .ndo_open() time, there is a whole lot of things that > did happen prior to getting there, such as udevd using modaliases in order > to load every possible module we might, so I am debating whether we will > really see a probe deferral at all. As can be seen from my recent posts which show on Debian Buster that interfaces are attempted to be brought up while e.g. mv88e6xxx is still probing, we can't make any guarantees that things have "settled" by the time userspace attempts to bring up the network interfaces. I may have more on why that is happening... I won't post it here, I'll post to the other thread. > > Can we tell we are in the context of a driver probe? Or do we need to > > add a parameter to the various phy_attach API calls to let the core > > know if this is probe or open? > > Actually we do the RTNL lock will be held during ndo_open and it won't > during driver probe. That's probably an unreliable indicator. DPAA2 has weirdness in the way it can dynamically create and destroy network interfaces, which does lead to problems with the rtnl lock. I've been carrying a patch from NXP for this for almost two years now, which NXP still haven't submitted: http://git.armlinux.org.uk/cgit/linux-arm.git/commit/?h=cex7&id=a600f2ee50223e9bcdcf86b65b4c427c0fd425a4 ... and I've no idea why that patch never made mainline. I need it to avoid the stated deadlock on SolidRun Honeycomb platforms when creating additional network interfaces for the SFP cages in userspace.
On Thu, Sep 02, 2021 at 10:33:03PM +0100, Russell King (Oracle) wrote: > That's probably an unreliable indicator. DPAA2 has weirdness in the > way it can dynamically create and destroy network interfaces, which > does lead to problems with the rtnl lock. I've been carrying a patch > from NXP for this for almost two years now, which NXP still haven't > submitted: > > http://git.armlinux.org.uk/cgit/linux-arm.git/commit/?h=cex7&id=a600f2ee50223e9bcdcf86b65b4c427c0fd425a4 > > ... and I've no idea why that patch never made mainline. I need it > to avoid the stated deadlock on SolidRun Honeycomb platforms when > creating additional network interfaces for the SFP cages in userspace. Ah, nice, I've copied that broken logic for the dpaa2-switch too: https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=d52ef12f7d6c016f3b249db95af33f725e3dd065 So why don't you send the patch? I can send it too if you want to, one for the switch and one for the DPNI driver.
On Fri, Sep 03, 2021 at 12:39:49AM +0300, Vladimir Oltean wrote: > On Thu, Sep 02, 2021 at 10:33:03PM +0100, Russell King (Oracle) wrote: > > That's probably an unreliable indicator. DPAA2 has weirdness in the > > way it can dynamically create and destroy network interfaces, which > > does lead to problems with the rtnl lock. I've been carrying a patch > > from NXP for this for almost two years now, which NXP still haven't > > submitted: > > > > http://git.armlinux.org.uk/cgit/linux-arm.git/commit/?h=cex7&id=a600f2ee50223e9bcdcf86b65b4c427c0fd425a4 > > > > ... and I've no idea why that patch never made mainline. I need it > > to avoid the stated deadlock on SolidRun Honeycomb platforms when > > creating additional network interfaces for the SFP cages in userspace. > > Ah, nice, I've copied that broken logic for the dpaa2-switch too: > https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=d52ef12f7d6c016f3b249db95af33f725e3dd065 > > So why don't you send the patch? I can send it too if you want to, one > for the switch and one for the DPNI driver. Sorry, I mis-stated. NXP did submit that exact patch, but it's actually incorrect for the reason I stated when it was sent: https://patchwork.ozlabs.org/project/netdev/patch/1574363727-5437-2-git-send-email-ioana.ciornei@nxp.com/ I did miss the rtnl_lock() around phylink_disconnect_phy() in the description of the race, which goes someway towards hiding it, but there is still a race between phylink_destroy() and another thread calling dpaa2_eth_get_link_ksettings(), and priv->mac being freed: static int dpaa2_eth_get_link_ksettings(struct net_device *net_dev, struct ethtool_link_ksettings *link_settings) { struct dpaa2_eth_priv *priv = netdev_priv(net_dev); if (dpaa2_eth_is_type_phy(priv)) return phylink_ethtool_ksettings_get(priv->mac->phylink, link_settings); which dereferences priv->mac and priv->mac->phylink, vs: static irqreturn_t dpni_irq0_handler_thread(int irq_num, void *arg) { ... if (status & DPNI_IRQ_EVENT_ENDPOINT_CHANGED) { dpaa2_eth_set_mac_addr(netdev_priv(net_dev)); dpaa2_eth_update_tx_fqids(priv); if (dpaa2_eth_has_mac(priv)) dpaa2_eth_disconnect_mac(priv); else dpaa2_eth_connect_mac(priv); } static void dpaa2_eth_disconnect_mac(struct dpaa2_eth_priv *priv) { if (dpaa2_eth_is_type_phy(priv)) dpaa2_mac_disconnect(priv->mac); if (!dpaa2_eth_has_mac(priv)) return; dpaa2_mac_close(priv->mac); kfree(priv->mac); <== potential use after free bug by priv->mac = NULL; <== dpaa2_eth_get_link_ksettings() } void dpaa2_mac_disconnect(struct dpaa2_mac *mac) { if (!mac->phylink) return; phylink_disconnect_phy(mac->phylink); phylink_destroy(mac->phylink); <== another use-after-free bug via dpaa2_eth_get_link_ksettings() dpaa2_pcs_destroy(mac); } Note that phylink_destroy() is documented as: * Note: the rtnl lock must not be held when calling this function. because it calls sfp_bus_del_upstream(), which will take the rtnl lock itself. An alternative solution would be to remove the rtnl locking from sfp_bus_del_upstream(), but then force _everyone_ to take the rtnl lock before calling phylink_destroy() - meaning a larger block of code ends up executing under the lock than is really necessary. However, as I stated in my review of the patch "As I've already stated, the phylink is not designed to be created and destroyed on a published network device." That still remains true today, and it seems that the issue has never been fixed in DPAA2 despite having been pointed out.
On Thu, Sep 02, 2021 at 11:24:39PM +0100, Russell King (Oracle) wrote: > On Fri, Sep 03, 2021 at 12:39:49AM +0300, Vladimir Oltean wrote: > > On Thu, Sep 02, 2021 at 10:33:03PM +0100, Russell King (Oracle) wrote: > > > That's probably an unreliable indicator. DPAA2 has weirdness in the > > > way it can dynamically create and destroy network interfaces, which > > > does lead to problems with the rtnl lock. I've been carrying a patch > > > from NXP for this for almost two years now, which NXP still haven't > > > submitted: > > > > > > http://git.armlinux.org.uk/cgit/linux-arm.git/commit/?h=cex7&id=a600f2ee50223e9bcdcf86b65b4c427c0fd425a4 > > > > > > ... and I've no idea why that patch never made mainline. I need it > > > to avoid the stated deadlock on SolidRun Honeycomb platforms when > > > creating additional network interfaces for the SFP cages in userspace. > > > > Ah, nice, I've copied that broken logic for the dpaa2-switch too: > > https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=d52ef12f7d6c016f3b249db95af33f725e3dd065 > > > > So why don't you send the patch? I can send it too if you want to, one > > for the switch and one for the DPNI driver. > > Sorry, I mis-stated. NXP did submit that exact patch, but it's actually > incorrect for the reason I stated when it was sent: > > https://patchwork.ozlabs.org/project/netdev/patch/1574363727-5437-2-git-send-email-ioana.ciornei@nxp.com/ So why are you carrying it then? > I did miss the rtnl_lock() around phylink_disconnect_phy() in the > description of the race, which goes someway towards hiding it, but > there is still a race between phylink_destroy() and another thread > calling dpaa2_eth_get_link_ksettings(), and priv->mac being freed: > > static int > dpaa2_eth_get_link_ksettings(struct net_device *net_dev, > struct ethtool_link_ksettings *link_settings) > { > struct dpaa2_eth_priv *priv = netdev_priv(net_dev); > > if (dpaa2_eth_is_type_phy(priv)) > return phylink_ethtool_ksettings_get(priv->mac->phylink, > link_settings); > > which dereferences priv->mac and priv->mac->phylink, vs: > > static irqreturn_t dpni_irq0_handler_thread(int irq_num, void *arg) > { > ... > if (status & DPNI_IRQ_EVENT_ENDPOINT_CHANGED) { > dpaa2_eth_set_mac_addr(netdev_priv(net_dev)); > dpaa2_eth_update_tx_fqids(priv); > > if (dpaa2_eth_has_mac(priv)) > dpaa2_eth_disconnect_mac(priv); > else > dpaa2_eth_connect_mac(priv); > } > > static void dpaa2_eth_disconnect_mac(struct dpaa2_eth_priv *priv) > { > if (dpaa2_eth_is_type_phy(priv)) > dpaa2_mac_disconnect(priv->mac); > > if (!dpaa2_eth_has_mac(priv)) > return; > > dpaa2_mac_close(priv->mac); > kfree(priv->mac); <== potential use after free bug by > priv->mac = NULL; <== dpaa2_eth_get_link_ksettings() > } Okay, so this needs to stay under the rtnetlink mutex, to serialize with dpaa2_eth_get_link_ksettings which is already under the rtnetlink mutex. So the way in which rtnl_lock is taken right now is actually fine in a way. > > void dpaa2_mac_disconnect(struct dpaa2_mac *mac) > { > if (!mac->phylink) > return; > > phylink_disconnect_phy(mac->phylink); > phylink_destroy(mac->phylink); <== another use-after-free bug via > dpaa2_eth_get_link_ksettings() > dpaa2_pcs_destroy(mac); > } > > Note that phylink_destroy() is documented as: > > * Note: the rtnl lock must not be held when calling this function. > > because it calls sfp_bus_del_upstream(), which will take the rtnl lock > itself. An alternative solution would be to remove the rtnl locking > from sfp_bus_del_upstream(), but then force _everyone_ to take the > rtnl lock before calling phylink_destroy() - meaning a larger block of > code ends up executing under the lock than is really necessary. So phylink_destroy has exactly 20 call sites, it is not that bad? And as for "larger block than necessary" - doesn't the dpaa2 prolonged usage count as necessary? > However, as I stated in my review of the patch "As I've already stated, > the phylink is not designed to be created and destroyed on a published > network device." That still remains true today, and it seems that the > issue has never been fixed in DPAA2 despite having been pointed out. So what would you do, exactly, to "fix" the issue that a DPNI can connect and disconnect at runtime from a DPMAC? Also, "X is not designed to Y" doesn't really say much, given a bit of will power. Linux was not designed to run on non-i386 either. Any other issues besides needing to take rtnl_mutex top-level when calling phylink_destroy? Since phylink_disconnect_phy needs it anyway, and phylink_destroy ends up calling sfp_bus_del_upstream which takes the same mutex again, and drivers that connect/disconnect at probe/remove time end up calling both in a row, I don't think there is much of an issue to speak of, or that the rework would be that difficult.
> > Note that phylink_destroy() is documented as: > > > > * Note: the rtnl lock must not be held when calling this function. > > ... > > Any other issues besides needing to take rtnl_mutex top-level when > calling phylink_destroy? We should try to keep phylink_create and phylink_destroy symmetrical: /** * phylink_create() - create a phylink instance * @config: a pointer to the target &struct phylink_config * @fwnode: a pointer to a &struct fwnode_handle describing the network * interface * @iface: the desired link mode defined by &typedef phy_interface_t * @mac_ops: a pointer to a &struct phylink_mac_ops for the MAC. * * Create a new phylink instance, and parse the link parameters found in @np. * This will parse in-band modes, fixed-link or SFP configuration. * * Note: the rtnl lock must not be held when calling this function. Having different locking requirements will catch people out. Interestingly, there is no ASSERT_NO_RTNL(). Maybe we should add such a macro. Andrew
On Fri, Sep 03, 2021 at 01:02:06AM +0200, Andrew Lunn wrote: > We should try to keep phylink_create and phylink_destroy symmetrical: > > /** > * phylink_create() - create a phylink instance > * @config: a pointer to the target &struct phylink_config > * @fwnode: a pointer to a &struct fwnode_handle describing the network > * interface > * @iface: the desired link mode defined by &typedef phy_interface_t > * @mac_ops: a pointer to a &struct phylink_mac_ops for the MAC. > * > * Create a new phylink instance, and parse the link parameters found in @np. > * This will parse in-band modes, fixed-link or SFP configuration. > * > * Note: the rtnl lock must not be held when calling this function. > > Having different locking requirements will catch people out. > > Interestingly, there is no ASSERT_NO_RTNL(). Maybe we should add such > a macro. In this case, the easiest might be to just take a different mutex in dpaa2 which serializes all places that access the priv->mac references. I don't know exactly why the SFP bus needs the rtnl_mutex, I've removed those locks and will see what fails tomorrow, but I don't think dpaa2 has a good enough justification to take the rtnl_mutex just so that it can connect and disconnect to the MAC freely at runtime.
On Fri, Sep 03, 2021 at 02:26:07AM +0300, Vladimir Oltean wrote: > On Fri, Sep 03, 2021 at 01:02:06AM +0200, Andrew Lunn wrote: > > We should try to keep phylink_create and phylink_destroy symmetrical: > > > > /** > > * phylink_create() - create a phylink instance > > * @config: a pointer to the target &struct phylink_config > > * @fwnode: a pointer to a &struct fwnode_handle describing the network > > * interface > > * @iface: the desired link mode defined by &typedef phy_interface_t > > * @mac_ops: a pointer to a &struct phylink_mac_ops for the MAC. > > * > > * Create a new phylink instance, and parse the link parameters found in @np. > > * This will parse in-band modes, fixed-link or SFP configuration. > > * > > * Note: the rtnl lock must not be held when calling this function. > > > > Having different locking requirements will catch people out. > > > > Interestingly, there is no ASSERT_NO_RTNL(). Maybe we should add such > > a macro. > > In this case, the easiest might be to just take a different mutex in > dpaa2 which serializes all places that access the priv->mac references. > I don't know exactly why the SFP bus needs the rtnl_mutex, I've removed > those locks and will see what fails tomorrow, but I don't think dpaa2 > has a good enough justification to take the rtnl_mutex just so that it > can connect and disconnect to the MAC freely at runtime. It needs it to ensure that the sfp-bus code is safe. sfp-bus code sits between phylink and the sfp stuff, and will be called from either side. It can't have its own lock, because that gives lockdep splats. Removing a lock and then running the kernel is a down right stupid way to test to see if a lock is necessary. That approach is like having built a iron bridge, covered it in paint, then you remove most the bolts, and then test to see whether it's safe for vehicles to travel over it by riding your bicycle across it and declaring it safe. Sorry, but if you think "remove lock, run kernel, if it works fine the lock is unnecessary" is a valid approach, then you've just disqualified yourself from discussing this topic any further. Locking is done by knowing the code and code analysis, not by playing "does the code fail if I remove it" games. I am utterly shocked that you think that this is a valid approach.
On Thu, Sep 02, 2021 at 11:24:39PM +0100, Russell King (Oracle) wrote: > On Fri, Sep 03, 2021 at 12:39:49AM +0300, Vladimir Oltean wrote: > > On Thu, Sep 02, 2021 at 10:33:03PM +0100, Russell King (Oracle) wrote: > > > That's probably an unreliable indicator. DPAA2 has weirdness in the > > > way it can dynamically create and destroy network interfaces, which > > > does lead to problems with the rtnl lock. I've been carrying a patch > > > from NXP for this for almost two years now, which NXP still haven't > > > submitted: > > > > > > http://git.armlinux.org.uk/cgit/linux-arm.git/commit/?h=cex7&id=a600f2ee50223e9bcdcf86b65b4c427c0fd425a4 > > > > > > ... and I've no idea why that patch never made mainline. I need it > > > to avoid the stated deadlock on SolidRun Honeycomb platforms when > > > creating additional network interfaces for the SFP cages in userspace. > > > > Ah, nice, I've copied that broken logic for the dpaa2-switch too: > > https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=d52ef12f7d6c016f3b249db95af33f725e3dd065 > > > > So why don't you send the patch? I can send it too if you want to, one > > for the switch and one for the DPNI driver. > > Sorry, I mis-stated. NXP did submit that exact patch, but it's actually > incorrect for the reason I stated when it was sent: > > https://patchwork.ozlabs.org/project/netdev/patch/1574363727-5437-2-git-send-email-ioana.ciornei@nxp.com/ > > I did miss the rtnl_lock() around phylink_disconnect_phy() in the > description of the race, which goes someway towards hiding it, but > there is still a race between phylink_destroy() and another thread > calling dpaa2_eth_get_link_ksettings(), and priv->mac being freed: > > static int > dpaa2_eth_get_link_ksettings(struct net_device *net_dev, > struct ethtool_link_ksettings *link_settings) > { > struct dpaa2_eth_priv *priv = netdev_priv(net_dev); > > if (dpaa2_eth_is_type_phy(priv)) > return phylink_ethtool_ksettings_get(priv->mac->phylink, > link_settings); > > which dereferences priv->mac and priv->mac->phylink, vs: > > static irqreturn_t dpni_irq0_handler_thread(int irq_num, void *arg) > { > ... > if (status & DPNI_IRQ_EVENT_ENDPOINT_CHANGED) { > dpaa2_eth_set_mac_addr(netdev_priv(net_dev)); > dpaa2_eth_update_tx_fqids(priv); > > if (dpaa2_eth_has_mac(priv)) > dpaa2_eth_disconnect_mac(priv); > else > dpaa2_eth_connect_mac(priv); > } > > static void dpaa2_eth_disconnect_mac(struct dpaa2_eth_priv *priv) > { > if (dpaa2_eth_is_type_phy(priv)) > dpaa2_mac_disconnect(priv->mac); > > if (!dpaa2_eth_has_mac(priv)) > return; > > dpaa2_mac_close(priv->mac); > kfree(priv->mac); <== potential use after free bug by > priv->mac = NULL; <== dpaa2_eth_get_link_ksettings() > } > > void dpaa2_mac_disconnect(struct dpaa2_mac *mac) > { > if (!mac->phylink) > return; > > phylink_disconnect_phy(mac->phylink); > phylink_destroy(mac->phylink); <== another use-after-free bug via > dpaa2_eth_get_link_ksettings() > dpaa2_pcs_destroy(mac); > } > > Note that phylink_destroy() is documented as: > > * Note: the rtnl lock must not be held when calling this function. > > because it calls sfp_bus_del_upstream(), which will take the rtnl lock > itself. An alternative solution would be to remove the rtnl locking > from sfp_bus_del_upstream(), but then force _everyone_ to take the > rtnl lock before calling phylink_destroy() - meaning a larger block of > code ends up executing under the lock than is really necessary. > > However, as I stated in my review of the patch "As I've already stated, > the phylink is not designed to be created and destroyed on a published > network device." That still remains true today, and it seems that the > issue has never been fixed in DPAA2 despite having been pointed out. > My attempt to fix this issue was that patch that you just pointed at. Taking your feedback into account (that phylink is not designed to be created and destroyed on a published networking device) I really do not know what other viable solution to send out. The alternative here would have been to just have a different driver for the MAC side (probing on dpmac objects) that creates the phylink instance at probe time and then is just used by the dpaa2-eth driver when it connects to a dpmac. This way no phylink is created/destroyed dynamically. This was the architecture of my initial attempt at supporting phylink in DPAA2. https://patchwork.ozlabs.org/project/netdev/patch/1560470153-26155-5-git-send-email-ioana.ciornei@nxp.com/ If you have any suggestion on how I should go about fixing this, please let me know. Ioana
On Fri, Sep 03, 2021 at 01:04:19AM +0100, Russell King (Oracle) wrote: > Removing a lock and then running the kernel is a down right stupid > way to test to see if a lock is necessary. > > That approach is like having built a iron bridge, covered it in paint, > then you remove most the bolts, and then test to see whether it's safe > for vehicles to travel over it by riding your bicycle across it and > declaring it safe. > > Sorry, but if you think "remove lock, run kernel, if it works fine > the lock is unnecessary" is a valid approach, then you've just > disqualified yourself from discussing this topic any further. > Locking is done by knowing the code and code analysis, not by > playing "does the code fail if I remove it" games. I am utterly > shocked that you think that this is a valid approach. ... and this is exactly why you will no longer get any attention from me on this topic. Good luck.
On Fri, Sep 03, 2021 at 11:48:22PM +0300, Vladimir Oltean wrote: > On Fri, Sep 03, 2021 at 01:04:19AM +0100, Russell King (Oracle) wrote: > > Removing a lock and then running the kernel is a down right stupid > > way to test to see if a lock is necessary. > > > > That approach is like having built a iron bridge, covered it in paint, > > then you remove most the bolts, and then test to see whether it's safe > > for vehicles to travel over it by riding your bicycle across it and > > declaring it safe. > > > > Sorry, but if you think "remove lock, run kernel, if it works fine > > the lock is unnecessary" is a valid approach, then you've just > > disqualified yourself from discussing this topic any further. > > Locking is done by knowing the code and code analysis, not by > > playing "does the code fail if I remove it" games. I am utterly > > shocked that you think that this is a valid approach. > > ... and this is exactly why you will no longer get any attention from me > on this topic. Good luck. Good, because your approach to this to me reads as "I don't think you know what the hell you're doing so I'm going to remove a lock to test whether it is needed." Effectively, that action is an insult towards me as the author of that code. And as I said, if you think that's a valid approach, then quite frankly I don't want you touching my code, because you clearly don't know what you're doing as you aren't willing to put the necessary effort in to understanding the code. Removing a lock and running the kernel is _never_ a valid way to see whether the lock is required or not. The only way is via code analysis. I wonder whether you'd take the same approach with filesystems or memory management code. Why don't you try removing some locks from those subsystems and see how long your filesystems last? You could have asked why the lock was necessary, and I would have described it. That would have been the civil approach. Maybe even put forward a hypothesis why you think the lock isn't necessary, but no, you decide that the best way to go about this is to remove the lock and see whether the kernel breaks. It may shock you to know that those of us who have been working on the kernel for almost 30 years and have seen the evolution of the kernel from uniprocessor to SMP, have had to debug race conditions caused by a lack of locking know very well that you can have what seems to be a functioning kernel despite missing locks - and such a kernel can last quite a long time and only show up the race quite rarely. This is exactly why "lets remove the lock and see if it breaks" is a completely invalid approach. I'm sorry that you don't seem to realise just how stupid a suggestion that was. I can tell you now: removing the locks you proposed will not show an immediate problem, but by removing those locks you will definitely open up race conditions between driver binding events on the SFP side and network usage on the netdev side which will only occur rarely. And just because they only happen rarely is not a justification to remove locks, no matter how inconvenient those locks may be.
[ again, trimming the CC list, because I assume most people don't care, and if they do, the mailing lists are there for that ] On Fri, Sep 03, 2021 at 11:06:23PM +0100, Russell King (Oracle) wrote: > On Fri, Sep 03, 2021 at 11:48:22PM +0300, Vladimir Oltean wrote: > > On Fri, Sep 03, 2021 at 01:04:19AM +0100, Russell King (Oracle) wrote: > > > Removing a lock and then running the kernel is a down right stupid > > > way to test to see if a lock is necessary. > > > > > > That approach is like having built a iron bridge, covered it in paint, > > > then you remove most the bolts, and then test to see whether it's safe > > > for vehicles to travel over it by riding your bicycle across it and > > > declaring it safe. > > > > > > Sorry, but if you think "remove lock, run kernel, if it works fine > > > the lock is unnecessary" is a valid approach, then you've just > > > disqualified yourself from discussing this topic any further. > > > Locking is done by knowing the code and code analysis, not by > > > playing "does the code fail if I remove it" games. I am utterly > > > shocked that you think that this is a valid approach. > > > > ... and this is exactly why you will no longer get any attention from me > > on this topic. Good luck. > > Good, because your approach to this to me reads as "I don't think you > know what the hell you're doing so I'm going to remove a lock to test > whether it is needed." Effectively, that action is an insult towards > me as the author of that code. The reason why you aren't getting any of my attention is your attitude, in case it was not clear. You've transformed a few words I said and which were entirely reasonable, "I don't know exactly why the SFP bus needs the rtnl_mutex, I've removed those locks and will see what fails tomorrow", into a soap opera based on something I did not say. > And as I said, if you think that's a valid approach, then quite frankly > I don't want you touching my code, because you clearly don't know what > you're doing as you aren't willing to put the necessary effort in to > understanding the code. > > Removing a lock and running the kernel is _never_ a valid way to see > whether the lock is required or not. The only way is via code analysis. It is a completely valid approach for a simple reason: if there was an obvious reason why the SFP bus code would have needed serialization through the rtnetlink mutex, I could have found out by looking at all the failed assertions and said to myself "oh, yeah, right, of course", instead of spending several hours looking at the code, at which point I would have had fewer chances of figuring out anyway. Effectively, with no assertions failing except those from the phylink upstream SFP ops (which aren't really an indication that any data structures are protected, no comments, etc), things are much less obvious. If I knew from the get-go it would come to this, I would have asked, rest assured. I just did not want to ask an obvious question, I was more or less thinking out loud about what I am going to do next. > I wonder whether you'd take the same approach with filesystems or > memory management code. Why don't you try removing some locks from > those subsystems and see how long your filesystems last? This is a completely irrelevant and wrong argument, of course there are sandboxes in which incompetent people can do insane things without doing any damage, even if the subsystems they are interested in are filesystems and memory management. It brings exactly nothing to the discussion. > You could have asked why the lock was necessary, and I would have > described it. That would have been the civil approach. Maybe even > put forward a hypothesis why you think the lock isn't necessary, but > no, you decide that the best way to go about this is to remove the > lock and see whether the kernel breaks. > > It may shock you to know that those of us who have been working on > the kernel for almost 30 years and have seen the evolution of the > kernel from uniprocessor to SMP, have had to debug race conditions > caused by a lack of locking know very well that you can have what > seems to be a functioning kernel despite missing locks - and such a > kernel can last quite a long time and only show up the race quite > rarely. This is exactly why "lets remove the lock and see if it > breaks" is a completely invalid approach. I'm sorry that you don't > seem to realise just how stupid a suggestion that was. > > I can tell you now: removing the locks you proposed will not show an > immediate problem, but by removing those locks you will definitely > open up race conditions between driver binding events on the SFP > side and network usage on the netdev side which will only occur > rarely. > > And just because they only happen rarely is not a justification to > remove locks, no matter how inconvenient those locks may be. So I really wasn't going to do it, since I have absolutely no stake in this, but I happened to be on a plane today for several hours with literally nothing better to do, so I went through the phylink_create and phylink_destroy code path, with the intention of seeing whether there is something it does which fundamentally needs to be serialized by the rtnetlink mutex. If the mere idea of me removing a lock was insulting to you, I've no idea what atrocity this might even compare to. But suffice to say, I spent several hours and it is not obvious at all, based on code analysis as you wish, why it must be the rtnl_lock and not any other mutex taken by both the SFP module driver and the SFP upstream consumer (phylink), with the same semantics except not the mega-bloated rtnetlink mutex. These are my notes from the plane, it is a single pass (the second pass will most likely not happen), again it is purely based on code analysis as you requested, non-expert of course because it is the first time I look at the details or even study the code paths, and I haven't even run the code without the rtnetlink protection as I originally intended. phylink_register_sfp -> bus = sfp_bus_find_fwnode(fwnode) -> fwnode_property_get_reference_args(fwnode) -> bus = sfp_bus_get(fwnode) -> mutex_lock(&sfp_mutex) -> search for fwnode in sfp->fwnode of sfp_buses list # side note, the iterator in this function should have been named "bus", not "sfp", for consistency -> if found, kref_get(&sfp->kref) -> else allocate new sfp bus with this sfp->fwnode, and kref_init -> mutex_unlock(&sfp_mutex) -> fwnode_handle_put(fwnode) -> pl->sfp_bus = bus -> sfp_bus_add_upstream(bus, pl) -> rtnl_lock() -> kref_get(bus->kref) <- why? this increments from 1 to 2. Indicative of possibly concurrent code -> bus->upstream = pl -> if (bus->sfp) <- this code path does not populate bus->sfp, so unless code is running concurrently (?!) branch is not taken -> sfp_register_bus(bus) -> rtnl_unlock() -> if (ret) => sfp_bus_put(bus) <= on error this decrements the kref back from 2 to 1 -> kref_put_mutex(&bus->kref, sfp_bus_release, &sfp_mutex) -> sfp_bus_put(bus) -> on error, drops the kref from 1 to 0 and frees the bus under the sfp_mutex -> on normal path, drops the kref from 2 to 1 Ok, why would bus->sfp be non-NULL (how would the sfp_register_bus possibly be triggered by this function)? sfp->bus is set from: sfp_unregister_socket(bus) -> rtnl_lock -> if (bus->upstream_ops) sfp_unregister_bus(bus) -> sfp_socket_clear(bus) -> bus->sfp = NULL -> rtnl_unlock -> sfp_bus_put(bus) sfp_register_socket(dev, sfp, ops) -> bus = sfp_bus_get(dev->fwnode) -> rtnl_lock -> bus->sfp_dev = dev; -> bus->sfp = sfp; -> bus->socket_ops = ops; -> if (bus->upstream_ops) => sfp_register_bus(bus); -> rtnl_unlock -> on error => sfp_bus_put(bus) -> return bus Who calls sfp_register_socket and sfp_unregister_socket? sfp_probe (the driver for the cage) -> sfp->sfp_bus = sfp_register_socket(sfp->dev, sfp, &sfp_module_ops) sfp_remove -> sfp_unregister_socket(sfp->sfp_bus) So sfp_register_bus can be called either by phylink_register_sfp(the upstream side) or sfp_probe(the cage side). They are serialized by the rtnl_mutex. The bus is only registered when both sides are ready: phylink_register_sfp registers the sfp_bus if sfp_probe was already called first sfp_probe registers the sfp_bus if phylink_register_sfp was already called first Finally, what does sfp_register_bus do? sfp_register_bus(bus) -> bus->upstream_ops->link_down(bus->upstream) -> phylink_sfp_link_down(pl) -> phylink_run_resolve_and_disable(pl, PHYLINK_DISABLE_LINK) -> appears to be patched by commit 87454b6edc1b ("net: phylink: avoid resolving link state too early") to do nothing when called from phylink_register_sfp, since phylink_create calls: __set_bit(PHYLINK_DISABLE_STOPPED, &pl->phylink_disable_state); before calling phylink_register_sfp -> probably does something only if phylink_start(pl) was called, and sfp_register_socket is called afterwards. In that case, it will do this: -> queue_work(system_power_efficient_wq, &pl->resolve); -> flush_work(&pl->resolve); So it will practically call phylink_resolve and wait for it to finish. -> if bus->phydev exists by now => bus->upstream_ops->connect_phy(bus->upstream, bus->phydev) -> phylink_sfp_connect_phy(pl, phy) -> phylink_sfp_config -> does not seem to do anything requiring rtnl_mutex: it mostly seems to call phylink_validate and sfp_select_interface, which are stateless, and modify pl->supported, pl->link_config, pl->cur_link_an_mode, pl->link_port. -> It also calls phylink_mac_initial_config, but not from the phylink_create code path, because it checks PHYLINK_DISABLE_STOPPED again, as discussed -> phylink_mac_initial_config again appears to be stateless from phylink's perspective -> phylink_attach_phy -> phy_attach_direct # does not require rtnl_mutex, see the plethora of drivers which call variants of this function without holding this lock. # This is literally the premise that led to the discussion, which Florian pointed out: that you can differentiate between # drivers that connect to the PHY in .ndo_open from those that do so at probe time by looking at whether the rtnetlink mutex # is taken (which it will be from .ndo_open) -> phylink_bringup_phy -> phylink_validate, stateless -> populates phy->phylink and phy->phy_link_change, but not under mutex_lock(&phy->lock), which is probably fine because there are no concurrent readers -> mutex_lock(&phy->lock) -> mutex_lock(&pl->state_mutex) # serialize with phylink_disconnect_phy, phylink_ethtool_ksettings_set, phylink_ethtool_set_pauseparam, phylink_resolve, phylink_phy_change -> change pl->phydev, pl->phy_state, pl->supported, pl->link_config, phy->advertising -> mutex_unlock(&pl->state_mutex) -> mutex_unlock(&phy_lock) -> phy_request_interrupt -> request_threaded_irq -> phy_enable_interrupts -> calls driver, does not need rtnl_mutex -> bus->registered = true; -> bus->socket_ops->attach(bus->sfp); -> sfp_attach(sfp) -> sfp_sm_event(sfp, SFP_E_DEV_ATTACH) -> It must be said that all callers of this function are taking the rtnl_mutex, so there must be something going on in here. -> mutex_lock(&sfp->sm_mutex); -> sfp_sm_device -> Tracks the upstream's state as per the comment. -> changes sfp->sm_dev_state. I think the upstream state is changed on the attach event only if the sfp->sm_dev_state was SFP_DEV_DETACHED, otherwise nothing happens. The new upstream state appears to transition to SFP_DEV_DOWN. -> sfp_sm_module -> As per the comment, tracks the insert/remove state of the module, probes the on-board EEPROM, and sets up the power level. -> On SFP_E_DEV_ATTACH, it only does something if it was in the SFP_MOD_WAITDEV state. -> As per the comment, report the module insertion to the upstream device -> sfp_module_insert(sfp->sfp_bus, &sfp->id); -> bus->upstream_ops->module_insert(bus->upstream, bus) -> phylink_sfp_module_insert -> sets pl->sfp_port and pl->sfp_may_have_phy and pl->sfp_may_have_phy -> optionally calls phylink_sfp_config, which as discussed above does not seem to need the rtnetlink mutex's protection -> transitions sfp->sm_mod_state into SFP_MOD_HPOWER and falls through into that state's handler -> sfp_sm_mod_hpower -> a bunch of i2c_reads and i2c_writes, probably do not need rtnl_mutex -> transitions into SFP_MOD_WAITPWR and sets up a timer -> sfp_sm_main -> we should be entering with sfp->sm_dev_state == SFP_DEV_DOWN and god knows what sfp->sm_mod_state, because the module has its own life independent of the upstream attaching or not. Most interesting plausible module state is SFP_MOD_WAITDEV, which just became SFP_MOD_WAITPWR. The main state machins is probably in sfp->sm_state == SFP_S_DOWN too. -> Because sfp->sm_state is SFP_S_DOWN, the branch under "Some events are global" does probably not get executed at all. -> The SFP_S_DOWN handler under the "The main state machine" comment probably exists early too, because sfp->sm_dev_state is SFP_DEV_DOWN and not SFP_DEV_UP (freshly attached upstream) -> function exits doing nothing?! -> if (bus->started) => bus->socket_ops->start # Since bus->started is set by sfp_upstream_start, which is called at the end of phylink_start, it means this piece of code is not meant to execute # from phylink_create's calling context, but from the other caller of sfp_register_bus: sfp_probe. Logically phylink_start comes after phylink_create. # So ignore. -> bus->upstream_ops->attach(bus->upstream, bus) -> phylink_sfp_attach(pl, bus) -> pl->netdev->sfp_bus = bus # This is about the first thing touching the netdev I've seen, but since it's a simple pointer assignment and not anything # operating on complex/non-atomic data structures, not sure if this is the reason for rtnl_lock? And sfp_unregister_bus, called by sfp_bus_del_upstream: -> bus->upstream_ops->detach(bus->upstream, bus) -> phylink_sfp_detach -> pl->netdev->sfp_bus = NULL -> bus->socket_ops->stop(bus->sfp) -> sfp_stop(sfp) -> sfp_sm_event(sfp, SFP_E_DEV_DOWN); -> sfp_sm_device -> let's say sfp->sm_dev_state was SFP_DEV_UP, with SFP_E_DEV_DOWN it becomes SFP_DEV_DOWN. -> sfp_sm_module # I won't go through the state machines again, if something in the teardown path needs # rtnetlink protection whereas in the setup path it did not, oh well... So these are my notes, now please hit me hard, because I don't know, and I won't look any further into it. Why does phylink_{create,destroy} require serialization through the rtnetlink mutex?
On Sun, Sep 05, 2021 at 12:59:05AM +0300, Vladimir Oltean wrote: > [ again, trimming the CC list, because I assume most people don't care, > and if they do, the mailing lists are there for that ] > > On Fri, Sep 03, 2021 at 11:06:23PM +0100, Russell King (Oracle) wrote: > > On Fri, Sep 03, 2021 at 11:48:22PM +0300, Vladimir Oltean wrote: > > > On Fri, Sep 03, 2021 at 01:04:19AM +0100, Russell King (Oracle) wrote: > > > > Removing a lock and then running the kernel is a down right stupid > > > > way to test to see if a lock is necessary. > > > > > > > > That approach is like having built a iron bridge, covered it in paint, > > > > then you remove most the bolts, and then test to see whether it's safe > > > > for vehicles to travel over it by riding your bicycle across it and > > > > declaring it safe. > > > > > > > > Sorry, but if you think "remove lock, run kernel, if it works fine > > > > the lock is unnecessary" is a valid approach, then you've just > > > > disqualified yourself from discussing this topic any further. > > > > Locking is done by knowing the code and code analysis, not by > > > > playing "does the code fail if I remove it" games. I am utterly > > > > shocked that you think that this is a valid approach. > > > > > > ... and this is exactly why you will no longer get any attention from me > > > on this topic. Good luck. > > > > Good, because your approach to this to me reads as "I don't think you > > know what the hell you're doing so I'm going to remove a lock to test > > whether it is needed." Effectively, that action is an insult towards > > me as the author of that code. > > The reason why you aren't getting any of my attention is your attitude, > in case it was not clear. > > You've transformed a few words I said and which were entirely > reasonable, "I don't know exactly why the SFP bus needs the rtnl_mutex, > I've removed those locks and will see what fails tomorrow", into a soap > opera based on something I did not say. You really don't understand, do you. I will say again: you can not remove a lock and then run-time test to see whether that lock is required. It doesn't just work like that, and the more you persist to assert that it does, the more stupid you look to all those who have had years of kernel programming experience. So please continue... > > And as I said, if you think that's a valid approach, then quite frankly > > I don't want you touching my code, because you clearly don't know what > > you're doing as you aren't willing to put the necessary effort in to > > understanding the code. > > > > Removing a lock and running the kernel is _never_ a valid way to see > > whether the lock is required or not. The only way is via code analysis. > > It is a completely valid approach for a simple reason: if there was an > obvious reason why the SFP bus code would have needed serialization > through the rtnetlink mutex, I could have found out by looking at all > the failed assertions and said to myself "oh, yeah, right, of course", > instead of spending several hours looking at the code, at which point I > would have had fewer chances of figuring out anyway. If we want to answer the question of "why rtnl_mutex" then one first has to understand the locking strategy and why I ended up there. It is _not_ simple. > > I wonder whether you'd take the same approach with filesystems or > > memory management code. Why don't you try removing some locks from > > those subsystems and see how long your filesystems last? > > This is a completely irrelevant and wrong argument, of course there are > sandboxes in which incompetent people can do insane things without doing > any damage, even if the subsystems they are interested in are filesystems > and memory management. It brings exactly nothing to the discussion. It is entirely relevant - it is about your approach to testing whether a lock is necessary or not. Your stated approach is "lets remove the lock and then run the kernel and see if anything breaks." I assert that approach is completely invalid. > If the mere idea of me removing a lock was insulting to you, I've no > idea what atrocity this might even compare to. But suffice to say, I > spent several hours and it is not obvious at all, based on code analysis > as you wish, why it must be the rtnl_lock and not any other mutex taken > by both the SFP module driver and the SFP upstream consumer (phylink), > with the same semantics except not the mega-bloated rtnetlink mutex. > > These are my notes from the plane, it is a single pass (the second pass > will most likely not happen), again it is purely based on code analysis > as you requested, non-expert of course because it is the first time I > look at the details or even study the code paths, and I haven't even run > the code without the rtnetlink protection as I originally intended. > > phylink_register_sfp > -> bus = sfp_bus_find_fwnode(fwnode) > -> fwnode_property_get_reference_args(fwnode) > -> bus = sfp_bus_get(fwnode) > -> mutex_lock(&sfp_mutex) > -> search for fwnode in sfp->fwnode of sfp_buses list # side note, the iterator in this function should have been named "bus", not "sfp", for consistency > -> if found, kref_get(&sfp->kref) > -> else allocate new sfp bus with this sfp->fwnode, and kref_init > -> mutex_unlock(&sfp_mutex) > -> fwnode_handle_put(fwnode) > -> pl->sfp_bus = bus > -> sfp_bus_add_upstream(bus, pl) > -> rtnl_lock() > -> kref_get(bus->kref) <- why? this increments from 1 to 2. Indicative of possibly concurrent code > -> bus->upstream = pl > -> if (bus->sfp) <- this code path does not populate bus->sfp, so unless code is running concurrently (?!) branch is not taken > -> sfp_register_bus(bus) > -> rtnl_unlock() > -> if (ret) => sfp_bus_put(bus) <= on error this decrements the kref back from 2 to 1 > -> kref_put_mutex(&bus->kref, sfp_bus_release, &sfp_mutex) > -> sfp_bus_put(bus) > -> on error, drops the kref from 1 to 0 and frees the bus under the sfp_mutex > -> on normal path, drops the kref from 2 to 1 First question "why? this increments from 1 to 2. Indicative of possibly concurrent code" - you appear to have answered that already in two lines immediately above. In the case of a pre-existing bus being found, then the krefs will be one higher than the numerical values you have given above. > Ok, why would bus->sfp be non-NULL (how would the sfp_register_bus possibly be triggered by this function)? You've already answered that above. "else allocate new sfp bus with this sfp->fwnode, and kref_init". In that case, bus->sfp will be NULL because the socket hasn't been registered. > sfp->bus is set from: > > sfp_unregister_socket(bus) > -> rtnl_lock > -> if (bus->upstream_ops) sfp_unregister_bus(bus) > -> sfp_socket_clear(bus) > -> bus->sfp = NULL > -> rtnl_unlock > -> sfp_bus_put(bus) > > sfp_register_socket(dev, sfp, ops) > -> bus = sfp_bus_get(dev->fwnode) > -> rtnl_lock > -> bus->sfp_dev = dev; > -> bus->sfp = sfp; > -> bus->socket_ops = ops; > -> if (bus->upstream_ops) => sfp_register_bus(bus); > -> rtnl_unlock > -> on error => sfp_bus_put(bus) > -> return bus > > Who calls sfp_register_socket and sfp_unregister_socket? > > sfp_probe (the driver for the cage) > -> sfp->sfp_bus = sfp_register_socket(sfp->dev, sfp, &sfp_module_ops) > > sfp_remove > -> sfp_unregister_socket(sfp->sfp_bus) > > So sfp_register_bus can be called either by phylink_register_sfp(the upstream side) or sfp_probe(the cage side). They are serialized by the rtnl_mutex. So here you have established the need for serialisation. However, I don't think you have completely grasped it fully. Not only do these two need to be serialised, but also the calls through sfp_bus, to prevent bus->sfp, bus->socket_ops, bus->upstream_ops, or bus->upstream changing beneath us. Sure, bus->sfp, bus->socket_ops isn't going to change except when the SFP cage is being removed once setup - but these may be dereferenced by a call from the network side. The same is true of calls going the other way. So, we now have a concrete reason why we need serialisation here, agreed? Let's take a moment, and assume the sfp-bus layer uses its own private mutex to achieve this, which would be taken whenever either side calls one of the interface functions so that dereferences of bus->sfp, bus->socket_ops, bus->upstream_ops and bus->upstream are all safe. sfp_get_module_info() and sfp_get_module_eeprom() are called from ethtool operations. So, lockdep will see rtnl taken first, then our private mutex. As soon as any two locks nest, it creates an immediate nesting rule for these two locks to avoid an AB-BA deadlock. We must always take our private mutex before rtnl, otherwise we have the possibility of an AB-BA deadlock. The next part of the puzzle is how we add and remove PHYs. Pick any ethtool implementation that dereferences the net device "phydev" member, for example linkstate_get_sqi(). This happens to take the phydev->lock, but that is not important - the important point is that netdev->phydev must be a valid phydev or NULL and must not change while the ethtool call is being processed. Which lock guarantees that? It's the rtnl lock. So, to safely change netdev->phydev on a published or running net device, we must be holding the rtnl lock. Okay, now lets go back to the sfp_bus layer, and lets consider the case where a PHY is being removed - and continue to assume that we are using our private locks in that code. The SFP cage code has called sfp_remove_phy(), which takes our lock and then calls through to the disconnect_phy method. The disconnect_phy() needs to take the rtnl lock to safely remove the phydev from the network device... but we've taken our private lock. So, we end up with two paths, one which takes the locks in the order AB and another which takes them in order BA. Lockdep will spot that and will complain. What ways can that be solved? - One can fall back and just use the rtnl lock. - One could refcount the structures on both sides, and adding code to handle the case where one side or the other goes away - but even with that, it's still unsafe. Consider sfp_get_module_eeprom(). This will sleep while i2c is read (many network drivers will sleep here under the rtnl lock.) The SFP cage module gets removed mid-call. There's absolutely nothing to prevent that happening. We don't get a look in while the i2c adapter is sleeping to abort that. Maybe the SFP cage gets removed. We now have lost its code, so when the i2c adapter returns, we get a kernel oops because the code we were going to execute on function return has been removed. As soon as you start thinking "we can add a lock here instead of rtnl" then things start getting really difficult because of netdev holding rtnl when making some calls through to the SFP cage code, and rtnl needing to be held when changing the phydev in the network interface side. It isn't nice, I know. I wish it wasn't that way, and we could have finer grained locking, but I don't see any possibilities to avoid the AB-BA deadlock problem without introducing even more code and/or creating bugs in the process of doing so. Now, if you take your approach of "lets remove the rtnl lock and see whether anything breaks" I can tell you now - you likely won't notice anything break from a few hundred boots. However, removing the lock _provably_ opens a race between threads loading or removing the SFP cage code and actions happening in the netdev layer. This is why your approach is invalid. You can not prove a negative. You can not prove that a lock isn't needed by removing it. Computing does not work that way. I don't write code to "work 99% of the time". I write code to try to achieve reliable operation, and that means having the necessary locks in place to avoid races and prevent kernel oops. One of the things that having been involved in Linux for so long teaches you is that a race, no matter how rare, will get found.
On Sun, Sep 05, 2021 at 12:25:40AM +0100, Russell King (Oracle) wrote: > On Sun, Sep 05, 2021 at 12:59:05AM +0300, Vladimir Oltean wrote: > > [ again, trimming the CC list, because I assume most people don't care, > > and if they do, the mailing lists are there for that ] > > > > On Fri, Sep 03, 2021 at 11:06:23PM +0100, Russell King (Oracle) wrote: > > > On Fri, Sep 03, 2021 at 11:48:22PM +0300, Vladimir Oltean wrote: > > > > On Fri, Sep 03, 2021 at 01:04:19AM +0100, Russell King (Oracle) wrote: > > > > > Removing a lock and then running the kernel is a down right stupid > > > > > way to test to see if a lock is necessary. > > > > > > > > > > That approach is like having built a iron bridge, covered it in paint, > > > > > then you remove most the bolts, and then test to see whether it's safe > > > > > for vehicles to travel over it by riding your bicycle across it and > > > > > declaring it safe. > > > > > > > > > > Sorry, but if you think "remove lock, run kernel, if it works fine > > > > > the lock is unnecessary" is a valid approach, then you've just > > > > > disqualified yourself from discussing this topic any further. > > > > > Locking is done by knowing the code and code analysis, not by > > > > > playing "does the code fail if I remove it" games. I am utterly > > > > > shocked that you think that this is a valid approach. > > > > > > > > ... and this is exactly why you will no longer get any attention from me > > > > on this topic. Good luck. > > > > > > Good, because your approach to this to me reads as "I don't think you > > > know what the hell you're doing so I'm going to remove a lock to test > > > whether it is needed." Effectively, that action is an insult towards > > > me as the author of that code. > > > > The reason why you aren't getting any of my attention is your attitude, > > in case it was not clear. > > > > You've transformed a few words I said and which were entirely > > reasonable, "I don't know exactly why the SFP bus needs the rtnl_mutex, > > I've removed those locks and will see what fails tomorrow", into a soap > > opera based on something I did not say. > > You really don't understand, do you. > > I will say again: you can not remove a lock and then run-time test > to see whether that lock is required. It doesn't just work like that, > and the more you persist to assert that it does, the more stupid you > look to all those who have had years of kernel programming experience. > So please continue... > > > > And as I said, if you think that's a valid approach, then quite frankly > > > I don't want you touching my code, because you clearly don't know what > > > you're doing as you aren't willing to put the necessary effort in to > > > understanding the code. > > > > > > Removing a lock and running the kernel is _never_ a valid way to see > > > whether the lock is required or not. The only way is via code analysis. > > > > It is a completely valid approach for a simple reason: if there was an > > obvious reason why the SFP bus code would have needed serialization > > through the rtnetlink mutex, I could have found out by looking at all > > the failed assertions and said to myself "oh, yeah, right, of course", > > instead of spending several hours looking at the code, at which point I > > would have had fewer chances of figuring out anyway. > > If we want to answer the question of "why rtnl_mutex" then one first > has to understand the locking strategy and why I ended up there. It > is _not_ simple. > > > > I wonder whether you'd take the same approach with filesystems or > > > memory management code. Why don't you try removing some locks from > > > those subsystems and see how long your filesystems last? > > > > This is a completely irrelevant and wrong argument, of course there are > > sandboxes in which incompetent people can do insane things without doing > > any damage, even if the subsystems they are interested in are filesystems > > and memory management. It brings exactly nothing to the discussion. > > It is entirely relevant - it is about your approach to testing whether > a lock is necessary or not. Your stated approach is "lets remove the > lock and then run the kernel and see if anything breaks." I assert > that approach is completely invalid. > > > If the mere idea of me removing a lock was insulting to you, I've no > > idea what atrocity this might even compare to. But suffice to say, I > > spent several hours and it is not obvious at all, based on code analysis > > as you wish, why it must be the rtnl_lock and not any other mutex taken > > by both the SFP module driver and the SFP upstream consumer (phylink), > > with the same semantics except not the mega-bloated rtnetlink mutex. > > > > These are my notes from the plane, it is a single pass (the second pass > > will most likely not happen), again it is purely based on code analysis > > as you requested, non-expert of course because it is the first time I > > look at the details or even study the code paths, and I haven't even run > > the code without the rtnetlink protection as I originally intended. > > > > phylink_register_sfp > > -> bus = sfp_bus_find_fwnode(fwnode) > > -> fwnode_property_get_reference_args(fwnode) > > -> bus = sfp_bus_get(fwnode) > > -> mutex_lock(&sfp_mutex) > > -> search for fwnode in sfp->fwnode of sfp_buses list # side note, the iterator in this function should have been named "bus", not "sfp", for consistency > > -> if found, kref_get(&sfp->kref) > > -> else allocate new sfp bus with this sfp->fwnode, and kref_init > > -> mutex_unlock(&sfp_mutex) > > -> fwnode_handle_put(fwnode) > > -> pl->sfp_bus = bus > > -> sfp_bus_add_upstream(bus, pl) > > -> rtnl_lock() > > -> kref_get(bus->kref) <- why? this increments from 1 to 2. Indicative of possibly concurrent code > > -> bus->upstream = pl > > -> if (bus->sfp) <- this code path does not populate bus->sfp, so unless code is running concurrently (?!) branch is not taken > > -> sfp_register_bus(bus) > > -> rtnl_unlock() > > -> if (ret) => sfp_bus_put(bus) <= on error this decrements the kref back from 2 to 1 > > -> kref_put_mutex(&bus->kref, sfp_bus_release, &sfp_mutex) > > -> sfp_bus_put(bus) > > -> on error, drops the kref from 1 to 0 and frees the bus under the sfp_mutex > > -> on normal path, drops the kref from 2 to 1 > > First question "why? this increments from 1 to 2. Indicative of possibly > concurrent code" - you appear to have answered that already in two lines > immediately above. > > In the case of a pre-existing bus being found, then the krefs will be > one higher than the numerical values you have given above. > > > Ok, why would bus->sfp be non-NULL (how would the sfp_register_bus possibly be triggered by this function)? > > You've already answered that above. "else allocate new sfp bus with this > sfp->fwnode, and kref_init". In that case, bus->sfp will be NULL because > the socket hasn't been registered. > > > sfp->bus is set from: > > > > sfp_unregister_socket(bus) > > -> rtnl_lock > > -> if (bus->upstream_ops) sfp_unregister_bus(bus) > > -> sfp_socket_clear(bus) > > -> bus->sfp = NULL > > -> rtnl_unlock > > -> sfp_bus_put(bus) > > > > sfp_register_socket(dev, sfp, ops) > > -> bus = sfp_bus_get(dev->fwnode) > > -> rtnl_lock > > -> bus->sfp_dev = dev; > > -> bus->sfp = sfp; > > -> bus->socket_ops = ops; > > -> if (bus->upstream_ops) => sfp_register_bus(bus); > > -> rtnl_unlock > > -> on error => sfp_bus_put(bus) > > -> return bus > > > > Who calls sfp_register_socket and sfp_unregister_socket? > > > > sfp_probe (the driver for the cage) > > -> sfp->sfp_bus = sfp_register_socket(sfp->dev, sfp, &sfp_module_ops) > > > > sfp_remove > > -> sfp_unregister_socket(sfp->sfp_bus) > > > > So sfp_register_bus can be called either by phylink_register_sfp(the upstream side) or sfp_probe(the cage side). They are serialized by the rtnl_mutex. > > So here you have established the need for serialisation. However, I > don't think you have completely grasped it fully. > > Not only do these two need to be serialised, but also the calls > through sfp_bus, to prevent bus->sfp, bus->socket_ops, > bus->upstream_ops, or bus->upstream changing beneath us. > > Sure, bus->sfp, bus->socket_ops isn't going to change except when the > SFP cage is being removed once setup - but these may be dereferenced > by a call from the network side. The same is true of calls going the > other way. > > So, we now have a concrete reason why we need serialisation here, > agreed? > > Let's take a moment, and assume the sfp-bus layer uses its own private > mutex to achieve this, which would be taken whenever either side calls > one of the interface functions so that dereferences of bus->sfp, > bus->socket_ops, bus->upstream_ops and bus->upstream are all safe. > > sfp_get_module_info() and sfp_get_module_eeprom() are called from > ethtool operations. So, lockdep will see rtnl taken first, then our > private mutex. As soon as any two locks nest, it creates an immediate > nesting rule for these two locks to avoid an AB-BA deadlock. We must > always take our private mutex before rtnl, otherwise we have the > possibility of an AB-BA deadlock. > > The next part of the puzzle is how we add and remove PHYs. > > Pick any ethtool implementation that dereferences the net device > "phydev" member, for example linkstate_get_sqi(). This happens to > take the phydev->lock, but that is not important - the important > point is that netdev->phydev must be a valid phydev or NULL and > must not change while the ethtool call is being processed. Which > lock guarantees that? It's the rtnl lock. > > So, to safely change netdev->phydev on a published or running net > device, we must be holding the rtnl lock. > > Okay, now lets go back to the sfp_bus layer, and lets consider the > case where a PHY is being removed - and continue to assume that we > are using our private locks in that code. The SFP cage code has > called sfp_remove_phy(), which takes our lock and then calls > through to the disconnect_phy method. > > The disconnect_phy() needs to take the rtnl lock to safely remove the > phydev from the network device... but we've taken our private lock. > > So, we end up with two paths, one which takes the locks in the order > AB and another which takes them in order BA. Lockdep will spot that > and will complain. > > What ways can that be solved? > > - One can fall back and just use the rtnl lock. > - One could refcount the structures on both sides, and adding code > to handle the case where one side or the other goes away - but > even with that, it's still unsafe. > > Consider sfp_get_module_eeprom(). This will sleep while i2c is > read (many network drivers will sleep here under the rtnl lock.) > The SFP cage module gets removed mid-call. There's absolutely > nothing to prevent that happening. We don't get a look in while > the i2c adapter is sleeping to abort that. Maybe the SFP cage > gets removed. We now have lost its code, so when the i2c adapter > returns, we get a kernel oops because the code we were going to > execute on function return has been removed. > > As soon as you start thinking "we can add a lock here instead of rtnl" > then things start getting really difficult because of netdev holding > rtnl when making some calls through to the SFP cage code, and rtnl > needing to be held when changing the phydev in the network interface > side. > > It isn't nice, I know. I wish it wasn't that way, and we could have > finer grained locking, but I don't see any possibilities to avoid the > AB-BA deadlock problem without introducing even more code and/or > creating bugs in the process of doing so. > > > Now, if you take your approach of "lets remove the rtnl lock and see > whether anything breaks" I can tell you now - you likely won't notice > anything break from a few hundred boots. However, removing the lock > _provably_ opens a race between threads loading or removing the SFP > cage code and actions happening in the netdev layer. > > This is why your approach is invalid. You can not prove a negative. > You can not prove that a lock isn't needed by removing it. Computing > does not work that way. > > I don't write code to "work 99% of the time". I write code to try to > achieve reliable operation, and that means having the necessary locks > in place to avoid races and prevent kernel oops. > > One of the things that having been involved in Linux for so long > teaches you is that a race, no matter how rare, will get found. I haven't fully digested everything you've said, but if I were to summarize the basic idea, it is that (a) the rtnetlink mutex fundamentally serializes accesses to netdev->phydev on a registered interface, which renders my comment about phy_attach_direct not needing rtnl_lock invalid, since phylink needs to deal with PHYs appearing and disappearing on SFP modules while the netdev attached to the sfp upstream is already registered. This would mean that if we were to replace the rtnl mutex with something else, it'd have to be a netdev->phy_lock which is held at all times when netdev->phydev is accessed. Of course, that would be its own discussion, which functions should take this lock on their own and which should expect it to already be held. (b) the sfp upstream and module drivers need to synchronize on a shared lock that is taken top-level by both parties, and the rtnetlink mutex is currently used for that. Could the sfp bus not have exactly that, a shared mutex (a pointer), created by whomever creates the sfp bus first, and destroyed by whomever puts the last reference on the sfp bus? This would act as the top-level mutex, and netdev->phy_lock would be accessed from within the individual functions that need to touch that (probably just the upstream side). The sfp_mutex lock inside sfp_bus_get could probably act as preliminary serialization until the shared lock is fully published and held by the side that created the lock in the first place. The entire reason I'm asking is because I got the impression that you are not fully satisfied with the only apparent solution to the dpaa2 deadlock problem, which is to force all callers of phylink_{create,destroy} to take the rtnl mutex top-level, because it would not be needed for most of the operations, and it would make the design even harder to untangle later on. If you are satisfied with that solution, my bad.
diff --git a/drivers/base/dd.c b/drivers/base/dd.c index 1c379d20812a..b22073b0acd2 100644 --- a/drivers/base/dd.c +++ b/drivers/base/dd.c @@ -128,13 +128,30 @@ static void deferred_probe_work_func(struct work_struct *work) } static DECLARE_WORK(deferred_probe_work, deferred_probe_work_func); +static bool __device_pending_probe(struct device *dev) +{ + return !list_empty(&dev->p->deferred_probe); +} + +bool device_pending_probe(struct device *dev) +{ + bool pending; + + mutex_lock(&deferred_probe_mutex); + pending = __device_pending_probe(dev); + mutex_unlock(&deferred_probe_mutex); + + return pending; +} +EXPORT_SYMBOL_GPL(device_pending_probe); + void driver_deferred_probe_add(struct device *dev) { if (!dev->can_match) return; mutex_lock(&deferred_probe_mutex); - if (list_empty(&dev->p->deferred_probe)) { + if (!__device_pending_probe(dev)) { dev_dbg(dev, "Added to deferred list\n"); list_add_tail(&dev->p->deferred_probe, &deferred_probe_pending_list); } @@ -144,7 +161,7 @@ void driver_deferred_probe_add(struct device *dev) void driver_deferred_probe_del(struct device *dev) { mutex_lock(&deferred_probe_mutex); - if (!list_empty(&dev->p->deferred_probe)) { + if (__device_pending_probe(dev)) { dev_dbg(dev, "Removed from deferred list\n"); list_del_init(&dev->p->deferred_probe); __device_set_deferred_probe_reason(dev, NULL); diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c index 52310df121de..2c22a32f0a1c 100644 --- a/drivers/net/phy/phy_device.c +++ b/drivers/net/phy/phy_device.c @@ -1386,8 +1386,16 @@ int phy_attach_direct(struct net_device *dev, struct phy_device *phydev, /* Assume that if there is no driver, that it doesn't * exist, and we should use the genphy driver. + * The exception is during probing, when the PHY driver might have + * attempted a probe but has requested deferral. Since there might be + * MAC drivers which also attach to the PHY during probe time, try + * harder to bind the specific PHY driver, and defer the MAC driver's + * probing until then. */ if (!d->driver) { + if (device_pending_probe(d)) + return -EPROBE_DEFER; + if (phydev->is_c45) d->driver = &genphy_c45_driver.mdiodrv.driver; else diff --git a/include/linux/device.h b/include/linux/device.h index e270cb740b9e..505e77715789 100644 --- a/include/linux/device.h +++ b/include/linux/device.h @@ -889,6 +889,7 @@ int __must_check driver_attach(struct device_driver *drv); void device_initial_probe(struct device *dev); int __must_check device_reprobe(struct device *dev); +bool device_pending_probe(struct device *dev); bool device_is_bound(struct device *dev); /*
There are systems where the PHY driver might get its probe deferred due to a missing supplier, like an interrupt-parent, gpio, clock or whatever. If the phy_attach_direct call happens right in between probe attempts, the PHY library is greedy and assumes that a specific driver will never appear, so it just binds the generic PHY driver. In certain cases this is the wrong choice, because some PHYs simply need the specific driver. The specific PHY driver was going to probe, given enough time, but this doesn't seem to matter to phy_attach_direct. To solve this, make phy_attach_direct check whether a specific PHY driver is pending or not, and if it is, just defer the probing of the MAC that's connecting to us a bit more too. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> --- drivers/base/dd.c | 21 +++++++++++++++++++-- drivers/net/phy/phy_device.c | 8 ++++++++ include/linux/device.h | 1 + 3 files changed, 28 insertions(+), 2 deletions(-)