Message ID | 174481734008.986682.1350602067856870465.stgit@ahduyck-xeon-server.home.arpa (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | net: phylink: Fix issue w/ BMC link flap | expand |
On Wed, Apr 16, 2025 at 08:29:00AM -0700, Alexander Duyck wrote: > From: Alexander Duyck <alexanderduyck@fb.com> > > This change is meant to address the fact that there are link imbalances > introduced when using phylink on a system with a BMC. Specifically there > are two issues. > > The first issue is that if we lose link after the first call to > phylink_start but before it gets to the phylink_resolve we will end up with > the phylink interface assuming the link was always down and not calling > phylink_link_down resulting in a stuck interface. That is intentional. phylink strictly orders .mac_link_down and .mac_link_up, and starts from an initial position that the link _will_ be considered to be down. So, it is intentional that .mac_link_down will _never_ be called after phylink_start(). > The second issue is that when a BMC is present we are currently forcing the > link down. This results in us bouncing the link for a fraction of a second > and that will result in dropped packets for the BMC. ... but you don't explain how that happens. > The third issue is just an extra "Link Down" message that is seen when > calling phylink_resume. This is addressed by identifying that the link > isn't balanced and just not displaying the down message in such a case. Hmm, this one is an error, but is not as simple as "don't print the message" as it results in a violation of the rule I mentioned above. We need phylink_suspend() to record the state of the link at that point, and avoid calling phylink_link_down() if the link was down prior to suspend.
On Wed, Apr 16, 2025 at 05:01:09PM +0100, Russell King (Oracle) wrote: > On Wed, Apr 16, 2025 at 08:29:00AM -0700, Alexander Duyck wrote: > > From: Alexander Duyck <alexanderduyck@fb.com> > > > > This change is meant to address the fact that there are link imbalances > > introduced when using phylink on a system with a BMC. Specifically there > > are two issues. > > > > The first issue is that if we lose link after the first call to > > phylink_start but before it gets to the phylink_resolve we will end up with > > the phylink interface assuming the link was always down and not calling > > phylink_link_down resulting in a stuck interface. > > That is intentional. > > phylink strictly orders .mac_link_down and .mac_link_up, and starts from > an initial position that the link _will_ be considered to be down. So, > it is intentional that .mac_link_down will _never_ be called after > phylink_start(). > > > The second issue is that when a BMC is present we are currently forcing the > > link down. This results in us bouncing the link for a fraction of a second > > and that will result in dropped packets for the BMC. > > ... but you don't explain how that happens. > > > The third issue is just an extra "Link Down" message that is seen when > > calling phylink_resume. This is addressed by identifying that the link > > isn't balanced and just not displaying the down message in such a case. > > Hmm, this one is an error, but is not as simple as "don't print the > message" as it results in a violation of the rule I mentioned above. > We need phylink_suspend() to record the state of the link at that > point, and avoid calling phylink_link_down() if the link was down > prior to suspend. Okay, confirmed on nvidia Jetson Xavier NX: [ 11.838132] dwc-eth-dwmac 2490000.ethernet eth0: Adding VLAN ID 0 is not supported [ 15.299757] dwc-eth-dwmac 2490000.ethernet eth0: Link is Up - 1Gbps/Full - flow control rx/tx LAN cable was unplugged: [ 50.436587] dwc-eth-dwmac 2490000.ethernet eth0: Link is Down Then the system was suspended using rtcwake for 3 seconds: [ 54.736849] dwc-eth-dwmac 2490000.ethernet eth0: No Safety Features support found [ 54.736898] dwc-eth-dwmac 2490000.ethernet eth0: IEEE 1588-2008 Advanced Timestamp supported [ 54.741078] dwc-eth-dwmac 2490000.ethernet eth0: Link is Down This shouldn't happen. With the patch I posted, this second "Link is Down" message is not printed, and .mac_link_down() will not be called.
On Wed, Apr 16, 2025 at 9:01 AM Russell King (Oracle) <linux@armlinux.org.uk> wrote: > > On Wed, Apr 16, 2025 at 08:29:00AM -0700, Alexander Duyck wrote: > > From: Alexander Duyck <alexanderduyck@fb.com> > > > > This change is meant to address the fact that there are link imbalances > > introduced when using phylink on a system with a BMC. Specifically there > > are two issues. > > > > The first issue is that if we lose link after the first call to > > phylink_start but before it gets to the phylink_resolve we will end up with > > the phylink interface assuming the link was always down and not calling > > phylink_link_down resulting in a stuck interface. > > That is intentional. > > phylink strictly orders .mac_link_down and .mac_link_up, and starts from > an initial position that the link _will_ be considered to be down. So, > it is intentional that .mac_link_down will _never_ be called after > phylink_start(). Well the issue is that with a BMC present the link may be up before we even start using phylink. So if the link is lost while we are going through phylink_start we will end up in an in-between state where the link is physically down, but the MAC is still configured as though the link is up. This will be problematic as the MAC should essentially be discarding frames for transmit if the link is down to avoid blocking internal Tx FIFOs. Everything for our driver has to be a light touch as bringing down the BMC link for any reason other than the physical link being lost is essentially a criteria for failure as the BMC is the most essential link on the system. So, for example the code I have in the driver for handling a major config is currently checking in mac_prepare to verify if we already have link based on the requested interface mode and FEC settings and if we do the mac_config and pcs_config steps read an internal state flag and do nothing, then we just go through to mac_finish where we write the necessary bits to make sure the PCS/PMA and PMA/PMD connections are enabled which essentially becomes a no-op if the link is already enabled. > > The second issue is that when a BMC is present we are currently forcing the > > link down. This results in us bouncing the link for a fraction of a second > > and that will result in dropped packets for the BMC. > > ... but you don't explain how that happens. It was right there in the patch. It was the lines I removed: @@ -2596,16 +2600,6 @@ void phylink_resume(struct phylink *pl) if (test_bit(PHYLINK_DISABLE_MAC_WOL, &pl->phylink_disable_state)) { /* Wake-on-Lan enabled, MAC handling */ - /* Call mac_link_down() so we keep the overall state balanced. - * Do this under the state_mutex lock for consistency. This - * will cause a "Link Down" message to be printed during - * resume, which is harmless - the true link state will be - * printed when we run a resolve. - */ - mutex_lock(&pl->state_mutex); - phylink_link_down(pl); - mutex_unlock(&pl->state_mutex); - /* Re-apply the link parameters so that all the settings get * restored to the MAC. */ From a BMC perspective this forcing of the link down even if it is for a fraction of a second is unacceptable as it can break up ssh sessions to the BMC, especially if somebody is doing a bunch of configuration changes on the NIC as it results in dropped frames. When compared to the firmware based NIC approaches such as Broadcom and Nvidia/Mellanox it is a huge negative as the BMC link is static with those NICs and doesn't bounce no matter if the interface is being configured up/down, the driver proved/removed ect. The issue, even with your recent patch, is that it will still force the link down if the link was previously up. That is the piece I need to avoid to prevent the BMC from losing link. Ideally what I need is to have a check of the current link state and then sync back up rather than force the phylink state on the MAC and then clean things up after the fact. > > The third issue is just an extra "Link Down" message that is seen when > > calling phylink_resume. This is addressed by identifying that the link > > isn't balanced and just not displaying the down message in such a case. > > Hmm, this one is an error, but is not as simple as "don't print the > message" as it results in a violation of the rule I mentioned above. > We need phylink_suspend() to record the state of the link at that > point, and avoid calling phylink_link_down() if the link was down > prior to suspend. Looking at your fix it doesn't resolve my core issue. All it does is address the "third issue". In my case the link is up and needs to stay up when I resume. The fact that we are forcing the link down causes our BMC to drop connection which makes it difficult to remotely manage the interface as changes to the interface such as a simple ifconfig down; ifconfig up are enough to cause the BMC to drop packets and potentially the ssh session if we start losing enough of them.
On Wed, Apr 16, 2025 at 12:03:05PM -0700, Alexander Duyck wrote: > On Wed, Apr 16, 2025 at 9:01 AM Russell King (Oracle) > <linux@armlinux.org.uk> wrote: > > > > On Wed, Apr 16, 2025 at 08:29:00AM -0700, Alexander Duyck wrote: > > > From: Alexander Duyck <alexanderduyck@fb.com> > > > > > > This change is meant to address the fact that there are link imbalances > > > introduced when using phylink on a system with a BMC. Specifically there > > > are two issues. > > > > > > The first issue is that if we lose link after the first call to > > > phylink_start but before it gets to the phylink_resolve we will end up with > > > the phylink interface assuming the link was always down and not calling > > > phylink_link_down resulting in a stuck interface. > > > > That is intentional. > > > > phylink strictly orders .mac_link_down and .mac_link_up, and starts from > > an initial position that the link _will_ be considered to be down. So, > > it is intentional that .mac_link_down will _never_ be called after > > phylink_start(). > > Well the issue is that with a BMC present the link may be up before we > even start using phylink. So if the link is lost while we are going > through phylink_start we will end up in an in-between state where the > link is physically down, but the MAC is still configured as though the > link is up. This will be problematic as the MAC should essentially be > discarding frames for transmit if the link is down to avoid blocking > internal Tx FIFOs. So, when a Linux network driver probes, it starts out in administrative state *DOWN*. When the administrator configures the network driver, .ndo_open is called, which is expected to configure the network adapter. Part of that process is to call phylink_start() as one of the last steps, which detects whether the link is up or not. If the link is up, then phylink will call netif_carrier_on() and .mac_link_up(). This tells the core networking layers that the network interface is now ready to start sending packets, and it will begin queueing packets for the network driver to process - not before. Prior to .ndo_open being called, the networking layers do not expect traffic from the network device no matter what physical state the media link is in. If .ndo_open fails, the same applies - no traffic is expected to be passed to the core network layers from the network layers because as far as the network stack is concerned, the interface is still administratively down. Thus, the fact that your BMC thinks that the link is up is irrelevant. So, start off in a state that packet activity is suspended even if the link is up at probe time. Only start packet activity (reception and transmission) once .mac_link_up() has been called. Stop that activity when .mac_link_down() is subsequently called. There have been lots of NICs out there where the physical link doesn't follow the adminstrative state of the network interface. This is not a problem. It may be desirable that it does, but a desire is not the same as "it must". > > > The second issue is that when a BMC is present we are currently forcing the > > > link down. This results in us bouncing the link for a fraction of a second > > > and that will result in dropped packets for the BMC. > > > > ... but you don't explain how that happens. > > It was right there in the patch. It was the lines I removed: ... thus further breaking phylink guarantees. Sorry, but no. > The issue, even with your recent patch, is that it will still force > the link down if the link was previously up. That is the piece I need > to avoid to prevent the BMC from losing link. Ideally what I need is > to have a check of the current link state and then sync back up rather > than force the phylink state on the MAC and then clean things up after > the fact. So don't force the link, just stop packet activity. As stated above, nothing requires that the physical link is forced down just because .mac_link_down() has been called. This makes me wonder what happens to your BMC with your ideas if userspace takes down the network interface. It sounds like all hell breaks loose because you've taken the link down, and that's seen as a critical failure... So taking down a network interface becomes a critical failure - yet it's a *normal* userspace operation.
On Wed, Apr 16, 2025 at 08:19:38PM +0100, Russell King (Oracle) wrote: > So, when a Linux network driver probes, it starts out in administrative > state *DOWN*. When the administrator configures the network driver, > .ndo_open is called, which is expected to configure the network adapter. > > Part of that process is to call phylink_start() as one of the last > steps, which detects whether the link is up or not. If the link is up, > then phylink will call netif_carrier_on() and .mac_link_up(). This > tells the core networking layers that the network interface is now > ready to start sending packets, and it will begin queueing packets for > the network driver to process - not before. > > Prior to .ndo_open being called, the networking layers do not expect > traffic from the network device no matter what physical state the > media link is in. If .ndo_open fails, the same applies - no traffic is > expected to be passed to the core network layers from the network > layers because as far as the network stack is concerned, the interface > is still administratively down. > > Thus, the fact that your BMC thinks that the link is up is irrelevant. > > So, start off in a state that packet activity is suspended even if the > link is up at probe time. Only start packet activity (reception and > transmission) once .mac_link_up() has been called. Stop that activity > when .mac_link_down() is subsequently called. > > There have been lots of NICs out there where the physical link doesn't > follow the adminstrative state of the network interface. This is not a > problem. It may be desirable that it does, but a desire is not the same > as "it must". Let me be crystal clear on this. Phylink has a contract with all existing users. That contract is: Initial state: link down. Driver calls phylink_start() in its .ndo_open method. Phylink does configuration of the PHY and link according to the chosen link parameters by calling into the MAC, PCS, and phylib as appropriate. If the link is then discovered to be up (it might have been already up before phylink_start() was called), phylink will call the various components such as PCS and MAC to inform them that the link is now up. This will mean calling the .mac_link_up() method. Otherwise (if the link is discovered to be down when the interface is brought up) no call to either .mac_link_up() nor .mac_link_down() will be made. If the link _subsequently_ goes down, then phylink deals with that and calls .mac_link_down() - only if .mac_link_up() was previously called (that's one of the bugs you discovered, that on resume it gets called anyway. I've submitted a fix for that contract breach, which only affects a very small number of drivers - stmmac, ucc_geth and your fbnic out of 22 total ethernet users plus however many DSA users we have.) Only if .mac_link_down() has been called, if the link subsequently comes back up, then the same process happens as before resulting in .mac_link_up() being called. If the interface is taken down, then .mac_link_down() will be called if and only if .mac_link_up() had been called. The ordering of .mac_link_up() / .mac_link_down() is a strict contract term with phylink users. The reason for this contract: phylink users may have ordering requirements. For example, on mac_link_down(), they may wait for packet activity to stop, and then place the MAC in reset. If called without a previous .mac_link_up call, the wait stage may time out due to the MAC being in reset. (Ocelot may suffer with this.) Another example is fs_enet which also relies on this strict ordering as described above. There could be others - there are some that call into firmware on calls to .mac_link_up() / .mac_link_down() and misordering those depends on what the firmware is doing, which we have no visibility of. As I stated, this is the contract that phylink gave to users, and the contract still stands, and can't be broken to behave differently (e.g. calling .mac_link_down() after phylink_start() without an intervening call to .mac_link_up()) otherwise existing users will break. Bugs that go against that contract will be fixed, but the contract will not be intentionally broken.
On Wed, Apr 16, 2025 at 1:05 PM Russell King (Oracle) <linux@armlinux.org.uk> wrote: > > On Wed, Apr 16, 2025 at 08:19:38PM +0100, Russell King (Oracle) wrote: > > So, when a Linux network driver probes, it starts out in administrative > > state *DOWN*. When the administrator configures the network driver, > > .ndo_open is called, which is expected to configure the network adapter. > > > > Part of that process is to call phylink_start() as one of the last > > steps, which detects whether the link is up or not. If the link is up, > > then phylink will call netif_carrier_on() and .mac_link_up(). This > > tells the core networking layers that the network interface is now > > ready to start sending packets, and it will begin queueing packets for > > the network driver to process - not before. > > > > Prior to .ndo_open being called, the networking layers do not expect > > traffic from the network device no matter what physical state the > > media link is in. If .ndo_open fails, the same applies - no traffic is > > expected to be passed to the core network layers from the network > > layers because as far as the network stack is concerned, the interface > > is still administratively down. > > > > Thus, the fact that your BMC thinks that the link is up is irrelevant. > > > > So, start off in a state that packet activity is suspended even if the > > link is up at probe time. Only start packet activity (reception and > > transmission) once .mac_link_up() has been called. Stop that activity > > when .mac_link_down() is subsequently called. > > > > There have been lots of NICs out there where the physical link doesn't > > follow the adminstrative state of the network interface. This is not a > > problem. It may be desirable that it does, but a desire is not the same > > as "it must". > > Let me be crystal clear on this. > > Phylink has a contract with all existing users. That contract is: > > Initial state: link down. > > Driver calls phylink_start() in its .ndo_open method. > > Phylink does configuration of the PHY and link according to the > chosen link parameters by calling into the MAC, PCS, and phylib as > appropriate. > > If the link is then discovered to be up (it might have been already > up before phylink_start() was called), phylink will call the various > components such as PCS and MAC to inform them that the link is now up. > This will mean calling the .mac_link_up() method. Otherwise (if the > link is discovered to be down when the interface is brought up) no > call to either .mac_link_up() nor .mac_link_down() will be made. > > If the link _subsequently_ goes down, then phylink deals with that > and calls .mac_link_down() - only if .mac_link_up() was previously > called (that's one of the bugs you discovered, that on resume it > gets called anyway. I've submitted a fix for that contract breach, > which only affects a very small number of drivers - stmmac, ucc_geth > and your fbnic out of 22 total ethernet users plus however many DSA > users we have.) > > Only if .mac_link_down() has been called, if the link subsequently > comes back up, then the same process happens as before resulting in > .mac_link_up() being called. > > If the interface is taken down, then .mac_link_down() will be called > if and only if .mac_link_up() had been called. > > The ordering of .mac_link_up() / .mac_link_down() is a strict > contract term with phylink users. > > The reason for this contract: phylink users may have ordering > requirements. > > For example, on mac_link_down(), they may wait for packet activity to > stop, and then place the MAC in reset. If called without a previous > .mac_link_up call, the wait stage may time out due to the MAC being > in reset. (Ocelot may suffer with this.) > > Another example is fs_enet which also relies on this strict ordering > as described above. > > There could be others - there are some that call into firmware on > calls to .mac_link_up() / .mac_link_down() and misordering those > depends on what the firmware is doing, which we have no visibility > of. > > As I stated, this is the contract that phylink gave to users, and > the contract still stands, and can't be broken to behave differently > (e.g. calling .mac_link_down() after phylink_start() without an > intervening call to .mac_link_up()) otherwise existing users will > break. Bugs that go against that contract will be fixed, but the > contract will not be intentionally broken. The issue is as that stands the contract is inherently broken if a BMC is present. 1. There is still the link loss during the phylink_start issue which will essentially leave the MAC up if the link fails. This will cause the Tx FIFO on the NIC to essentially be left to hang without flushing out the stale packets that may have queued up when the link dropped. This is one of the reasons why the mac_link_down call still needs to propagate all the way back down to the hardware. 2. We already have precedent for the link being up when WOL is in use. One concern I would have with your patch is if it will impact that or not. I suspected part of the mac_link_down is related to cleaning up something related to the link setup for WOL configuring the link for some speed. 3. While it may be a part of the contract, isn't there some way we can "renegotiate the terms"? The fact is in the case of our driver we are essentially doing a hand-off from FW to the OS for the link when we call ndo_open. When we are done in ndo_stop we hand ownership back over to the firmware. Those hand-offs need to be free of link bounces. This pattern looks very much like the WOL setup with the only exception being trying to avoid these link bounces. I'm just wondering if there isn't some way we can add a phylink_config value indicating that there is a BMC present on the link so we can handle those cases. The general idea would be to update my patch so instead of pl->link_balanced being just set to false in suspend and at creation time we could essentially just set it to "pl->link_balanced = !pl->config.mac_bmc_handoff". The fact is the code with the diff I provided did everything I needed. I could load/unload and ifconfig up/down the driver all day and it didn't drop a packet. As it stood the definition of the mac_config code more or less called out that we weren't supposed to change things unless we needed to and I had followed that. I suspect the overall change should be small, smaller now following your fix of the message issue, and it shouldn't impact anything other than our driver if I add the config flag checks.
diff --git a/drivers/net/phy/phylink.c b/drivers/net/phy/phylink.c index 942ce114dabd..2b9ab343942e 100644 --- a/drivers/net/phy/phylink.c +++ b/drivers/net/phy/phylink.c @@ -45,6 +45,7 @@ struct phylink { struct phylink_pcs *pcs; struct device *dev; unsigned int old_link_state:1; + unsigned int link_balanced:1; unsigned long phylink_disable_state; /* bitmask of disables */ struct phy_device *phydev; @@ -1553,7 +1554,8 @@ static void phylink_link_down(struct phylink *pl) pl->mac_ops->mac_link_down(pl->config, pl->act_link_an_mode, pl->cur_interface); - phylink_info(pl, "Link is Down\n"); + if (pl->link_balanced) + phylink_info(pl, "Link is Down\n"); } static bool phylink_link_is_up(struct phylink *pl) @@ -1658,12 +1660,13 @@ static void phylink_resolve(struct work_struct *w) if (pl->major_config_failed) link_state.link = false; - if (link_state.link != cur_link_state) { + if (link_state.link != cur_link_state || !pl->link_balanced) { pl->old_link_state = link_state.link; if (!link_state.link) phylink_link_down(pl); else phylink_link_up(pl, link_state); + pl->link_balanced = true; } if (!link_state.link && retrigger) { pl->link_failed = false; @@ -2546,6 +2549,7 @@ void phylink_suspend(struct phylink *pl, bool mac_wol) netif_carrier_off(pl->netdev); else pl->old_link_state = false; + pl->link_balanced = false; /* We do not call mac_link_down() here as we want the * link to remain up to receive the WoL packets. @@ -2596,16 +2600,6 @@ void phylink_resume(struct phylink *pl) if (test_bit(PHYLINK_DISABLE_MAC_WOL, &pl->phylink_disable_state)) { /* Wake-on-Lan enabled, MAC handling */ - /* Call mac_link_down() so we keep the overall state balanced. - * Do this under the state_mutex lock for consistency. This - * will cause a "Link Down" message to be printed during - * resume, which is harmless - the true link state will be - * printed when we run a resolve. - */ - mutex_lock(&pl->state_mutex); - phylink_link_down(pl); - mutex_unlock(&pl->state_mutex); - /* Re-apply the link parameters so that all the settings get * restored to the MAC. */