Message ID | ae53177a7b68964b2a988934a09f74a4931b862d.1728438951.git.daniel@makrotopia.org (mailing list archive) |
---|---|
State | Rejected |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [net-next] net: phy: populate host_interfaces when attaching PHY | expand |
On Wed, Oct 09, 2024 at 02:57:03AM +0100, Daniel Golle wrote: > Use bitmask of interfaces supported by the MAC for the PHY to choose > from if the declared interface mode is among those using a single pair > of SerDes lanes. > This will allow 2500Base-T PHYs to switch to SGMII on most hosts, which > results in half-duplex being supported in case the MAC supports that. > Without this change, 2500Base-T PHYs will always operate in 2500Base-X > mode with rate-matching, which is not only wasteful in terms of energy > consumption, but also limits the supported interface modes to > full-duplex only. We've had a similar patch before, and it's been NAK'd. The problem is that supplying the host_interfaces for built-in PHYs means that the hardware strapping for the PHY interface mode becomes useless, as does the DT property specifying it - and thus we may end up selecting a mode that both the MAC and PHY support, but the hardware design doesn't (e.g. signals aren't connected, signal speed to fast.) For example, take a board designed to use RXAUI and the host supports 10GBASE-R. The first problem is, RXAUI is not listed in the SFP interface list because it's not usable over a SFP cage. So, the host_interfaces excludes that, and thus the PHY thinks that's not supported. It looks at the mask and sees only 10GBASE-R, and decides to use that instead with rate matching. The MAC doesn't have support for flow control, and thus can't use rate matching. Not only have the electrical charateristics been violated by selecting a faster interface than the hardware was designed for, but we now have rate matching being used when it shouldn't be.
Hi Russell, On Wed, Oct 09, 2024 at 10:01:09AM +0100, Russell King (Oracle) wrote: > On Wed, Oct 09, 2024 at 02:57:03AM +0100, Daniel Golle wrote: > > Use bitmask of interfaces supported by the MAC for the PHY to choose > > from if the declared interface mode is among those using a single pair > > of SerDes lanes. > > This will allow 2500Base-T PHYs to switch to SGMII on most hosts, which > > results in half-duplex being supported in case the MAC supports that. > > Without this change, 2500Base-T PHYs will always operate in 2500Base-X > > mode with rate-matching, which is not only wasteful in terms of energy > > consumption, but also limits the supported interface modes to > > full-duplex only. > > We've had a similar patch before, and it's been NAK'd. The problem is > that supplying the host_interfaces for built-in PHYs means that the > hardware strapping for the PHY interface mode becomes useless, as does > the DT property specifying it - and thus we may end up selecting a > mode that both the MAC and PHY support, but the hardware design > doesn't (e.g. signals aren't connected, signal speed to fast.) > > For example, take a board designed to use RXAUI and the host supports > 10GBASE-R. The first problem is, RXAUI is not listed in the SFP > interface list because it's not usable over a SFP cage. I thought about that, also boards configured for RGMII but both MAC and PHY supporting SGMII or even 2500Base-X would be such a case. In order to make sure we don't switch to link modes not supported by the design I check if the interface mode configured in DT is among those suitable for use with an SFP (ie. using a single pair of SerDes lanes): if (test_bit(pl->link_interface, phylink_sfp_interfaces)) phy_interface_and(phy_dev->host_interfaces, phylink_sfp_interfaces, pl->config->supported_interfaces); Neither RXAUI nor RGMII modes are among phylink_sfp_interfaces, so cases in which those modes are configured in DT are already excluded. > So, the > host_interfaces excludes that, and thus the PHY thinks that's not > supported. It looks at the mask and sees only 10GBASE-R, and > decides to use that instead with rate matching. The MAC doesn't have > support for flow control, and thus can't use rate matching. > > Not only have the electrical charateristics been violated by selecting > a faster interface than the hardware was designed for, but we now have > rate matching being used when it shouldn't be. As we are also using using rate matching right now in cases when it should not (and thereby inhibiting support for half-duplex modes), I suppose the only good solution would be to allow a set of interface modes in DT instead of only a single one. Or, as that is the only really relevant case, we can be more strict on the condition and additional modes to be added, ie. check if both PHY and MAC support both 2500Base-X and SGMII, and only add SGMII in case 2500Base-X is selected in DT. I have never seen designs on which SGMII and 2500Base-X would both be supported by the SoC but use a different set of pins. Also, as 2500Base-X is 2.5x as fast as SGMII, it's safe to assume that a board which has been designed for 2500Base-X would also be fine using SGMII. Let me know of either of the above would be acceptable.
On Wed, Oct 09, 2024 at 12:52:13PM +0100, Daniel Golle wrote: > Hi Russell, > > On Wed, Oct 09, 2024 at 10:01:09AM +0100, Russell King (Oracle) wrote: > > On Wed, Oct 09, 2024 at 02:57:03AM +0100, Daniel Golle wrote: > > > Use bitmask of interfaces supported by the MAC for the PHY to choose > > > from if the declared interface mode is among those using a single pair > > > of SerDes lanes. > > > This will allow 2500Base-T PHYs to switch to SGMII on most hosts, which > > > results in half-duplex being supported in case the MAC supports that. > > > Without this change, 2500Base-T PHYs will always operate in 2500Base-X > > > mode with rate-matching, which is not only wasteful in terms of energy > > > consumption, but also limits the supported interface modes to > > > full-duplex only. > > > > We've had a similar patch before, and it's been NAK'd. The problem is > > that supplying the host_interfaces for built-in PHYs means that the > > hardware strapping for the PHY interface mode becomes useless, as does > > the DT property specifying it - and thus we may end up selecting a > > mode that both the MAC and PHY support, but the hardware design > > doesn't (e.g. signals aren't connected, signal speed to fast.) > > > > For example, take a board designed to use RXAUI and the host supports > > 10GBASE-R. The first problem is, RXAUI is not listed in the SFP > > interface list because it's not usable over a SFP cage. > > I thought about that, also boards configured for RGMII but both MAC > and PHY supporting SGMII or even 2500Base-X would be such a case. > In order to make sure we don't switch to link modes not supported > by the design I check if the interface mode configured in DT is > among those suitable for use with an SFP (ie. using a single pair > of SerDes lanes): > if (test_bit(pl->link_interface, phylink_sfp_interfaces)) > phy_interface_and(phy_dev->host_interfaces, phylink_sfp_interfaces, > pl->config->supported_interfaces); > > Neither RXAUI nor RGMII modes are among phylink_sfp_interfaces, so > cases in which those modes are configured in DT are already excluded. This still won't work. There are drivers (boo, hiss, stmmac crap which is resistant to cleanup and fixing but there's others too) that don't do the phylink interface switching. For example, stmmac sets the mode specified in DT and also if there is a Synopsys XPCS, then the supported interfaces also gets USXGMII, 10GKR, XLGMII, 10GBASER, SGMII, 1000BASEX and 2500BASEX. If DT says 10GBASER, then the PHY must not switch to USXGMII, but if an 88x3310 were to be thrown in, the PHY driver _would_ decide to use USXGMII against the DT configuration. phydev->host_interfaces is there to allow PHYs on SFPs be properly configured according to the host interface, where there is no DT description for the module. It is not meant for built-in PHYs. > > So, the > > host_interfaces excludes that, and thus the PHY thinks that's not > > supported. It looks at the mask and sees only 10GBASE-R, and > > decides to use that instead with rate matching. The MAC doesn't have > > support for flow control, and thus can't use rate matching. > > > > Not only have the electrical charateristics been violated by selecting > > a faster interface than the hardware was designed for, but we now have > > rate matching being used when it shouldn't be. > > As we are also using using rate matching right now in cases when it > should not (and thereby inhibiting support for half-duplex modes), I > suppose the only good solution would be to allow a set of interface > modes in DT instead of only a single one. Two issues... why was the PHY configured via firmware to use rate matching if that brings with it this restriction, and it's possible not to? Second, aqr107_get_rate_matching() is rather basic based on what people want. It doesn't actually ask the PHY what its going to do. I know there's a bunch of VEND1_GLOBAL_CFG registers that give the serdes mode and rate adaption for each speed, and these can be set not only by firmware configuration, but changed by the host. So, aqr107_get_rate_matching() should work out whether rate matching will be used for the interface mode by scanning these registers. Something like: u16 cfg_regs[] = { VEND1_GLOBAL_CFG_10M, VEND1_GLOBAL_CFG_100M, VEND1_GLOBAL_CFG_1G, VEND1_GLOBAL_CFG_2_5G, VEND1_GLOBAL_CFG_5G, VEND1_GLOBAL_CFG_10G }; int i, val; u8 mode; switch (interface) { case PHY_INTERFACE_MODE_10GBASER: mode = VEND1_GLOBAL_CFG_SERDES_MODE_XFI; break; case PHY_INTERFACE_MODE_2500BASEX: mode = VEND1_GLOBAL_CFG_SERDES_MODE_OCSGMII; break; case PHY_INTERFACE_MODE_5GBASER: mode = VEND1_GLOBAL_CFG_SERDES_MODE_XFI5G; break; default: return RATE_MATCH_NONE; } /* If any speed corresponds to the interface mode and uses pause rate * matching, indicate that this interface mode uses pause rate * matching. */ for (i = 0; i < ARRAY_SIZE(cfg_regs); i++) { val = phy_read_mmd(phydev, MDIO_MMD_VEND1, cfg_regs[i]); if (val < 0) return val; if (FIELD_GET(VEND1_GLOBAL_CFG_SERDES_MODE, val) == mode) { if (FIELD_GET(VEND1_GLOBAL_CFG_RATE_ADAPT, val) == VEND1_GLOBAL_CFG_RATE_ADAPT_PAUSE) return RATE_MATCH_PAUSE; } } return RATE_MATCH_NONE; Now, while phylink restricts RATE_MATCH_PAUSE to being full-duplex only, I'm not sure that is correct. I didn't contribute this support, and I don't have any platforms that support this, and I don't have any experience of it. What I do have is the data sheet for 88x3310, and that doesn't mention any restriction such as "only full duplex is supported in rate matching mode". It is true that to use pause frames, the MAC/PCS must be in full-duplex mode, but if the PHY supports half-duplex on the media to full-duplex on the MAC side link, then why should phylink restrict this to be full-duplex only? I suspect phylink_get_capabilities() handling for RATE_MATCH_PAUSE is not correct - or maybe not versatile enough.
On Wed, Oct 09, 2024 at 06:34:07PM +0100, Russell King (Oracle) wrote: > On Wed, Oct 09, 2024 at 12:52:13PM +0100, Daniel Golle wrote: > > Hi Russell, > > > > On Wed, Oct 09, 2024 at 10:01:09AM +0100, Russell King (Oracle) wrote: > > > On Wed, Oct 09, 2024 at 02:57:03AM +0100, Daniel Golle wrote: > > > > Use bitmask of interfaces supported by the MAC for the PHY to choose > > > > from if the declared interface mode is among those using a single pair > > > > of SerDes lanes. > > > > This will allow 2500Base-T PHYs to switch to SGMII on most hosts, which > > > > results in half-duplex being supported in case the MAC supports that. > > > > Without this change, 2500Base-T PHYs will always operate in 2500Base-X > > > > mode with rate-matching, which is not only wasteful in terms of energy > > > > consumption, but also limits the supported interface modes to > > > > full-duplex only. > > > > > > We've had a similar patch before, and it's been NAK'd. The problem is > > > that supplying the host_interfaces for built-in PHYs means that the > > > hardware strapping for the PHY interface mode becomes useless, as does > > > the DT property specifying it - and thus we may end up selecting a > > > mode that both the MAC and PHY support, but the hardware design > > > doesn't (e.g. signals aren't connected, signal speed to fast.) > > > > > > For example, take a board designed to use RXAUI and the host supports > > > 10GBASE-R. The first problem is, RXAUI is not listed in the SFP > > > interface list because it's not usable over a SFP cage. > > > > I thought about that, also boards configured for RGMII but both MAC > > and PHY supporting SGMII or even 2500Base-X would be such a case. > > In order to make sure we don't switch to link modes not supported > > by the design I check if the interface mode configured in DT is > > among those suitable for use with an SFP (ie. using a single pair > > of SerDes lanes): > > if (test_bit(pl->link_interface, phylink_sfp_interfaces)) > > phy_interface_and(phy_dev->host_interfaces, phylink_sfp_interfaces, > > pl->config->supported_interfaces); > > > > Neither RXAUI nor RGMII modes are among phylink_sfp_interfaces, so > > cases in which those modes are configured in DT are already excluded. > > This still won't work. There are drivers (boo, hiss, stmmac crap which > is resistant to cleanup and fixing but there's others too) that don't > do the phylink interface switching. Ok, what a mess. So multiple interfaces modes will have to be declared in DT for those boards which actually support that. What you be open to something like that: phy-mode = "2500base-x", "sgmii"; > > For example, stmmac sets the mode specified in DT and also if there > is a Synopsys XPCS, then the supported interfaces also gets USXGMII, > 10GKR, XLGMII, 10GBASER, SGMII, 1000BASEX and 2500BASEX. If DT says > 10GBASER, then the PHY must not switch to USXGMII, but if an > 88x3310 were to be thrown in, the PHY driver _would_ decide to use > USXGMII against the DT configuration. > > phydev->host_interfaces is there to allow PHYs on SFPs be properly > configured according to the host interface, where there is no DT > description for the module. It is not meant for built-in PHYs. Imho allowing several available interface modes can still be advantageous also for built-in PHYs. Measurable power savings (~100mW on MT7986!) and not needing rate matching are both desireable things. The question is just how would we get there... > > As we are also using using rate matching right now in cases when it > > should not (and thereby inhibiting support for half-duplex modes), I > > suppose the only good solution would be to allow a set of interface > > modes in DT instead of only a single one. > > Two issues... why was the PHY configured via firmware to use rate > matching if that brings with it this restriction, and it's possible > not to? Looking at drivers/net/phy/realtek.c you see in rtl822xb_config_init() ... has_2500 = test_bit(PHY_INTERFACE_MODE_2500BASEX, phydev->host_interfaces) || phydev->interface == PHY_INTERFACE_MODE_2500BASEX; has_sgmii = test_bit(PHY_INTERFACE_MODE_SGMII, phydev->host_interfaces) || phydev->interface == PHY_INTERFACE_MODE_SGMII; /* fill in possible interfaces */ __assign_bit(PHY_INTERFACE_MODE_2500BASEX, phydev->possible_interfaces, has_2500); __assign_bit(PHY_INTERFACE_MODE_SGMII, phydev->possible_interfaces, has_sgmii); if (!has_2500 && !has_sgmii) return 0; /* determine SerDes option mode */ if (has_2500 && !has_sgmii) { mode = RTL822X_VND1_SERDES_OPTION_MODE_2500BASEX; phydev->rate_matching = RATE_MATCH_PAUSE; } else { mode = RTL822X_VND1_SERDES_OPTION_MODE_2500BASEX_SGMII; phydev->rate_matching = RATE_MATCH_NONE; } ... So rate-matching is NOT configured in firmware, but rather it is selected in cases where we (seemingly) don't have any other option. It may not be a good choice (imho it never is), but rather just a last resort in case we otherwise can't support lower speeds at all. > > Second, aqr107_get_rate_matching() is rather basic based on what people > want. It doesn't actually ask the PHY what its going to do. I know > there's a bunch of VEND1_GLOBAL_CFG registers that give the serdes > mode and rate adaption for each speed, and these can be set not only > by firmware configuration, but changed by the host. > > So, aqr107_get_rate_matching() should work out whether rate matching > will be used for the interface mode by scanning these registers. > [...] Afaik Aquantia 2.5G PHYs always work only with a fixed interface mode and perform rate-matching for lower speeds. > > > Now, while phylink restricts RATE_MATCH_PAUSE to being full-duplex only, > I'm not sure that is correct. I didn't contribute this support, and I > don't have any platforms that support this, and I don't have any > experience of it. Afaik enforcing half-duplex via rate-maching with pause-frames is supported by all the 2.5G PHYs I've seen up to now. > > What I do have is the data sheet for 88x3310, and that doesn't mention > any restriction such as "only full duplex is supported in rate matching > mode". Yep, and I suppose it should work just fine. The same applies for RealTek and MaxLinear PHYs. I've tested it. > > It is true that to use pause frames, the MAC/PCS must be in full-duplex > mode, but if the PHY supports half-duplex on the media to full-duplex > on the MAC side link, then why should phylink restrict this to be > full-duplex only? There is no reason for that imho. phylink.c states /* Although a duplex-matching phy might exist, we * conservatively remove these modes because the MAC * will not be aware of the half-duplex nature of the * link. */ Afaik, practially all rate-matching PHYs which do support half-duplex modes on the TP interface can perform duplex-matching as well. > > I suspect phylink_get_capabilities() handling for RATE_MATCH_PAUSE is > not correct - or maybe not versatile enough. I agree. Never the less, why use rate matching at all if we don't have to? It's obviously inefficient and wasteful, having the MAC follow the PHY speed is preferrable in every case imho.
On Wed, Oct 09, 2024 at 07:52:42PM +0100, Daniel Golle wrote: > On Wed, Oct 09, 2024 at 06:34:07PM +0100, Russell King (Oracle) wrote: > > On Wed, Oct 09, 2024 at 12:52:13PM +0100, Daniel Golle wrote: > > > Hi Russell, > > > > > > On Wed, Oct 09, 2024 at 10:01:09AM +0100, Russell King (Oracle) wrote: > > > > On Wed, Oct 09, 2024 at 02:57:03AM +0100, Daniel Golle wrote: > > > > > Use bitmask of interfaces supported by the MAC for the PHY to choose > > > > > from if the declared interface mode is among those using a single pair > > > > > of SerDes lanes. > > > > > This will allow 2500Base-T PHYs to switch to SGMII on most hosts, which > > > > > results in half-duplex being supported in case the MAC supports that. > > > > > Without this change, 2500Base-T PHYs will always operate in 2500Base-X > > > > > mode with rate-matching, which is not only wasteful in terms of energy > > > > > consumption, but also limits the supported interface modes to > > > > > full-duplex only. > > > > > > > > We've had a similar patch before, and it's been NAK'd. The problem is > > > > that supplying the host_interfaces for built-in PHYs means that the > > > > hardware strapping for the PHY interface mode becomes useless, as does > > > > the DT property specifying it - and thus we may end up selecting a > > > > mode that both the MAC and PHY support, but the hardware design > > > > doesn't (e.g. signals aren't connected, signal speed to fast.) > > > > > > > > For example, take a board designed to use RXAUI and the host supports > > > > 10GBASE-R. The first problem is, RXAUI is not listed in the SFP > > > > interface list because it's not usable over a SFP cage. > > > > > > I thought about that, also boards configured for RGMII but both MAC > > > and PHY supporting SGMII or even 2500Base-X would be such a case. > > > In order to make sure we don't switch to link modes not supported > > > by the design I check if the interface mode configured in DT is > > > among those suitable for use with an SFP (ie. using a single pair > > > of SerDes lanes): > > > if (test_bit(pl->link_interface, phylink_sfp_interfaces)) > > > phy_interface_and(phy_dev->host_interfaces, phylink_sfp_interfaces, > > > pl->config->supported_interfaces); > > > > > > Neither RXAUI nor RGMII modes are among phylink_sfp_interfaces, so > > > cases in which those modes are configured in DT are already excluded. > > > > This still won't work. There are drivers (boo, hiss, stmmac crap which > > is resistant to cleanup and fixing but there's others too) that don't > > do the phylink interface switching. > > Ok, what a mess. So multiple interfaces modes will have to be declared > in DT for those boards which actually support that. What you be open > to something like that: > > phy-mode = "2500base-x", "sgmii"; I think we've tried going down that route before and it wasn't well received. > > For example, stmmac sets the mode specified in DT and also if there > > is a Synopsys XPCS, then the supported interfaces also gets USXGMII, > > 10GKR, XLGMII, 10GBASER, SGMII, 1000BASEX and 2500BASEX. If DT says > > 10GBASER, then the PHY must not switch to USXGMII, but if an > > 88x3310 were to be thrown in, the PHY driver _would_ decide to use > > USXGMII against the DT configuration. > > > > phydev->host_interfaces is there to allow PHYs on SFPs be properly > > configured according to the host interface, where there is no DT > > description for the module. It is not meant for built-in PHYs. > > Imho allowing several available interface modes can still be > advantageous also for built-in PHYs. Measurable power savings (~100mW on > MT7986!) and not needing rate matching are both desireable things. > > The question is just how would we get there... We already do. Let me take an example that I use. Macchiatobin, which you've probably heard lots about. The PHY there is the 88x3310, which is configured to run in MAC mode 0 by hardware strapping. MAC mode 0 means the 88x3310 will switch between 10GBASE-R, 5GBASE-R, 2500BASE-X and SGMII on its host interface. The MAC is Marvell's PP2, which supports all of those, and being one of the original MACs that phylink was developed against, is coded properly such that it fully works with phylink dynamically changing the interface mode. The interface mode given in DT is just a "guide" because the 88x3310 does no more than verify that the interface mode that it is bound with is one it supports. However, every time the link comes up, providing it is not operating in rate matching mode (which the PP2 doesn't support) it will change its MAC facing interface appropriately. Another board uses the 88x3310 with XAUI, and if I remember correctly, the PHY is strapped for XAUI with rate matching mode. It's connected to an 88e6390x DSA switch port 9, which supports RXAUI, XAUI, SGMII, 1000BASEX, 2500BASEX and maybe other stuff too. So, what we do is in DT, we specify the maximum mode, and rely on the hardware being correctly strapped on the PHY to configure how the MAC side interface will be used. Now, the thing with that second board is... if we use your original suggestion, then we end up filling the host_interfaces with just 2500BASEX, 1000BASEX and SGMII. That will lead mv3310_select_mactype() to select MV_V2_33X0_PORT_CTRL_MACTYPE_10GBASER for which the PHY will attempt to use the modes I listed above for Macchiatobin on its MAC interface which is wrong. Let me repeat the point. phydev->host_interfaces is there to allow a PHY driver to go off and make its own decisions about the interface mode group that it should use _ignoring_ what's being asked of it when the MAC binds to it. It should be empty for built-in setups where this should not be used, and we have precedent on Macchiatobin that interface switching by the PHY is permitted even in that situation. > > > As we are also using using rate matching right now in cases when it > > > should not (and thereby inhibiting support for half-duplex modes), I > > > suppose the only good solution would be to allow a set of interface > > > modes in DT instead of only a single one. > > > > Two issues... why was the PHY configured via firmware to use rate > > matching if that brings with it this restriction, and it's possible > > not to? > > Looking at drivers/net/phy/realtek.c you see in rtl822xb_config_init() > ... > has_2500 = test_bit(PHY_INTERFACE_MODE_2500BASEX, > phydev->host_interfaces) || > phydev->interface == PHY_INTERFACE_MODE_2500BASEX; > > has_sgmii = test_bit(PHY_INTERFACE_MODE_SGMII, > phydev->host_interfaces) || > phydev->interface == PHY_INTERFACE_MODE_SGMII; > > /* fill in possible interfaces */ > __assign_bit(PHY_INTERFACE_MODE_2500BASEX, phydev->possible_interfaces, > has_2500); > __assign_bit(PHY_INTERFACE_MODE_SGMII, phydev->possible_interfaces, > has_sgmii); > > if (!has_2500 && !has_sgmii) > return 0; > > /* determine SerDes option mode */ > if (has_2500 && !has_sgmii) { > mode = RTL822X_VND1_SERDES_OPTION_MODE_2500BASEX; > phydev->rate_matching = RATE_MATCH_PAUSE; > } else { > mode = RTL822X_VND1_SERDES_OPTION_MODE_2500BASEX_SGMII; > phydev->rate_matching = RATE_MATCH_NONE; > } > ... > > So rate-matching is NOT configured in firmware, but rather it is > selected in cases where we (seemingly) don't have any other option. It > may not be a good choice (imho it never is), but rather just a last > resort in case we otherwise can't support lower speeds at all. > > > > > Second, aqr107_get_rate_matching() is rather basic based on what people > > want. It doesn't actually ask the PHY what its going to do. I know > > there's a bunch of VEND1_GLOBAL_CFG registers that give the serdes > > mode and rate adaption for each speed, and these can be set not only > > by firmware configuration, but changed by the host. > > > > So, aqr107_get_rate_matching() should work out whether rate matching > > will be used for the interface mode by scanning these registers. > > [...] > > Afaik Aquantia 2.5G PHYs always work only with a fixed interface mode > and perform rate-matching for lower speeds. They don't _have_ to. To use a SFP with an AQR PHY on it in a clearfog platform (which only supports SGMII, 1000BASE-X and 2500BASE-X, I hacked the aquantia driver to do this: phy_set_bits_mmd(phydev, MDIO_MMD_VEND1, MDIO_CTRL1, MDIO_CTRL1_LPOWER); mdelay(10); phy_write_mmd(phydev, MDIO_MMD_VEND1, 0x31a, 2); phy_write_mmd(phydev, MDIO_MMD_VEND1, VEND1_GLOBAL_CFG_10M, VEND1_GLOBAL_CFG_SGMII_AN | VEND1_GLOBAL_CFG_SERDES_MODE_SGMII); phy_write_mmd(phydev, MDIO_MMD_VEND1, VEND1_GLOBAL_CFG_100M, VEND1_GLOBAL_CFG_SGMII_AN | VEND1_GLOBAL_CFG_SERDES_MODE_SGMII); phy_write_mmd(phydev, MDIO_MMD_VEND1, VEND1_GLOBAL_CFG_1G, VEND1_GLOBAL_CFG_SGMII_AN | VEND1_GLOBAL_CFG_SERDES_MODE_SGMII); phy_write_mmd(phydev, MDIO_MMD_VEND1, VEND1_GLOBAL_CFG_2_5G, VEND1_GLOBAL_CFG_SGMII_AN | VEND1_GLOBAL_CFG_SERDES_MODE_OCSGMII); phy_clear_bits_mmd(phydev, MDIO_MMD_VEND1, MDIO_CTRL1, MDIO_CTRL1_LPOWER); which disables rate matching and causes it to switch interfaces. Moreover, aqr107_read_status() reads the interface status and sets the MAC interface accordingly, so the driver does support dynamically changing the MAC interface just like 88x3310. If all use cases of this PHY only ever did rate matching, then there would be no point to that code being there! > > Now, while phylink restricts RATE_MATCH_PAUSE to being full-duplex only, > > I'm not sure that is correct. I didn't contribute this support, and I > > don't have any platforms that support this, and I don't have any > > experience of it. > > Afaik enforcing half-duplex via rate-maching with pause-frames is > supported by all the 2.5G PHYs I've seen up to now. I'm sorry, I don't understand your sentence, because it just flies wildly against everything that's been said. First, pause-frames require full duplex on the link on which pause frames are to be sent and received. That's fundamental. Second, I'm not sure what "enforcing half-duplex" has to do with rate-matching with pause-frames. Third, the 88x3310 without MACSEC in rate matching mode requires the MAC to pace itself. No support for pause frames at all - you only get pause frames with the 88x3310P which has MACSEC. This is a 10M to 10G PHY multi-rate PHY. So... your comment makes no sense to me, sorry. > > What I do have is the data sheet for 88x3310, and that doesn't mention > > any restriction such as "only full duplex is supported in rate matching > > mode". > > Yep, and I suppose it should work just fine. The same applies for > RealTek and MaxLinear PHYs. I've tested it. > > > It is true that to use pause frames, the MAC/PCS must be in full-duplex > > mode, but if the PHY supports half-duplex on the media to full-duplex > > on the MAC side link, then why should phylink restrict this to be > > full-duplex only? > > There is no reason for that imho. phylink.c states > /* Although a duplex-matching phy might exist, we > * conservatively remove these modes because the MAC > * will not be aware of the half-duplex nature of the > * link. > */ > > Afaik, practially all rate-matching PHYs which do support half-duplex > modes on the TP interface can perform duplex-matching as well. So we should remove that restriction! > > I suspect phylink_get_capabilities() handling for RATE_MATCH_PAUSE is > > not correct - or maybe not versatile enough. > > I agree. Never the less, why use rate matching at all if we don't have > to? It's obviously inefficient and wasteful, having the MAC follow the > PHY speed is preferrable in every case imho. As I say, I believe this is a matter of how the Aquantia firmware is commissioned - I believe the defaults for all of the VEND1_GLOBAL_CFG* registers come from the firmware that the PHY loads. So much like the pin strapping of 88x3310 determines its set of MAC modes, which are decided by the board designer, Aquantia firmware determines the PHY behaviour. The problem here is we need to be mindful of the existing implementations, not only those where the MAC/PHY combination would get stuff wrong, but also of those MACs where multiple different interfaces are supported through hardware strapping which is not software determinable, but are not software configurable (like DSA switches.) Then there's those who hacked phylink into their use without properly implementing it (e.g. where mac_config() is entirely empty, and thus support no reconfiguration of the interface yet support multiple different interfaces at driver initialisation time.) Yes, some of these situations can be said to be buggy. Feel free to spend a few years hacking on stmmac to try and fix that god almighty mess. I've given up with stmmac now after trying several attempts at cleaning the mess up, and ending up in a very non-productive place with it.
On Wed, Oct 09, 2024 at 09:12:06PM +0100, Russell King (Oracle) wrote: > On Wed, Oct 09, 2024 at 07:52:42PM +0100, Daniel Golle wrote: > > [...] > > Imho allowing several available interface modes can still be > > advantageous also for built-in PHYs. Measurable power savings (~100mW on > > MT7986!) and not needing rate matching are both desireable things. > > > > The question is just how would we get there... > > We already do. Let me take an example that I use. Macchiatobin, which > you've probably heard lots about. > > The PHY there is the 88x3310, which is configured to run in MAC mode 0 > by hardware strapping. MAC mode 0 means the 88x3310 will switch between > 10GBASE-R, 5GBASE-R, 2500BASE-X and SGMII on its host interface. > > The MAC is Marvell's PP2, which supports all of those, and being one of > the original MACs that phylink was developed against, is coded properly > such that it fully works with phylink dynamically changing the interface > mode. > > The interface mode given in DT is just a "guide" because the 88x3310 > does no more than verify that the interface mode that it is bound with > is one it supports. However, every time the link comes up, providing > it is not operating in rate matching mode (which the PP2 doesn't > support) it will change its MAC facing interface appropriately. Unfortunately, and apparently different from the 88x3310, there is no hardware strapping for this to be decided with RealTek's PHYs, but using rate-matching or interface-mode-switching is decided by the driver. > > Another board uses the 88x3310 with XAUI, and if I remember correctly, > the PHY is strapped for XAUI with rate matching mode. It's connected > to an 88e6390x DSA switch port 9, which supports RXAUI, XAUI, SGMII, > 1000BASEX, 2500BASEX and maybe other stuff too. > > So, what we do is in DT, we specify the maximum mode, and rely on the > hardware being correctly strapped on the PHY to configure how the > MAC side interface will be used. Does that mean the realtek.c PHY driver (and maybe others as well) should just assume that the MAC must also support SGMII mode in case 2500Base-X is supported? I was under the impression that there might be MAC out there which support only 2500Base-X, or maybe 2500Base-X and 1000Base-X, but not SGMII. And on those we would always need to use 2500Base-X with rate matching. But maybe I'm wrong. > > Now, the thing with that second board is... if we use your original > suggestion, then we end up filling the host_interfaces with just > 2500BASEX, 1000BASEX and SGMII. That will lead mv3310_select_mactype() > to select MV_V2_33X0_PORT_CTRL_MACTYPE_10GBASER for which the PHY will > attempt to use the modes I listed above for Macchiatobin on its MAC > interface which is wrong. > > Let me repeat the point. phydev->host_interfaces is there to allow a > PHY driver to go off and make its own decisions about the interface > mode group that it should use _ignoring_ what's being asked of it > when the MAC binds to it. It should be empty for built-in setups > where this should not be used, and we have precedent on Macchiatobin > that interface switching by the PHY is permitted even in that > situation. ... because it is decided before Linux even starts, and despite in armada-8040-mcbin.dts is stated. phy-mode = "10gbase-r"; The same approach would not work for those RealTek PHYs and boards using them. In some cases the bootloader sets up the PHY, and usually there rate matching is used -- performance is not the goal there, but simplicity. In other cases we rely entirely on the Linux PHY driver to set things up. How would the PHY driver know whether it should setup rate matching mode or interface switching mode? The SFP case is clear, it's using host_interfaces. But in the built-in case, as of today, it always ends up in fixed interface mode with rate matching. > > > So, aqr107_get_rate_matching() should work out whether rate matching > > > will be used for the interface mode by scanning these registers. > > > [...] > > > > Afaik Aquantia 2.5G PHYs always work only with a fixed interface mode > > and perform rate-matching for lower speeds. > > They don't _have_ to. To use a SFP with an AQR PHY on it in a clearfog > platform (which only supports SGMII, 1000BASE-X and 2500BASE-X, I > hacked the aquantia driver to do this: > > phy_set_bits_mmd(phydev, MDIO_MMD_VEND1, MDIO_CTRL1, MDIO_CTRL1_LPOWER); > mdelay(10); > phy_write_mmd(phydev, MDIO_MMD_VEND1, 0x31a, 2); > phy_write_mmd(phydev, MDIO_MMD_VEND1, VEND1_GLOBAL_CFG_10M, > VEND1_GLOBAL_CFG_SGMII_AN | > VEND1_GLOBAL_CFG_SERDES_MODE_SGMII); > phy_write_mmd(phydev, MDIO_MMD_VEND1, VEND1_GLOBAL_CFG_100M, > VEND1_GLOBAL_CFG_SGMII_AN | > VEND1_GLOBAL_CFG_SERDES_MODE_SGMII); > phy_write_mmd(phydev, MDIO_MMD_VEND1, VEND1_GLOBAL_CFG_1G, > VEND1_GLOBAL_CFG_SGMII_AN | > VEND1_GLOBAL_CFG_SERDES_MODE_SGMII); > phy_write_mmd(phydev, MDIO_MMD_VEND1, VEND1_GLOBAL_CFG_2_5G, > VEND1_GLOBAL_CFG_SGMII_AN | > VEND1_GLOBAL_CFG_SERDES_MODE_OCSGMII); > phy_clear_bits_mmd(phydev, MDIO_MMD_VEND1, MDIO_CTRL1, > MDIO_CTRL1_LPOWER); > > which disables rate matching and causes it to switch interfaces. That's pretty cool, and when connected to a MAC which support both SGMII and 2500Base-X, and a driver which can switch between the two, it should be the default imho. > > Moreover, aqr107_read_status() reads the interface status and sets > the MAC interface accordingly, so the driver does support dynamically > changing the MAC interface just like 88x3310. If all use cases of > this PHY only ever did rate matching, then there would be no point > to that code being there! So, just like for mcbin, the driver here replies on the PHY being configured (by strapping or by the bootloader) to either use a fixed interface mode with rate matching, or perform interface mode switching. What if we had to decide this in the driver (like it is the case with RealTek PHYs)? Let me summarize: - We can't just assume that every MAC which supports 2500Base-X also supports SGMII. It might support only 2500Base-X, or 2500Base-X and 1000Base-X, or the driver might not support switching between interface modes. - We can't rely on the PHY being pre-configured by the bootloader to either rate maching or interface-switching mode. - There are no strapping options for this, the only thing which is configured by strapping is the address. > > > > Now, while phylink restricts RATE_MATCH_PAUSE to being full-duplex only, > > > I'm not sure that is correct. I didn't contribute this support, and I > > > don't have any platforms that support this, and I don't have any > > > experience of it. > > > > Afaik enforcing half-duplex via rate-maching with pause-frames is > > supported by all the 2.5G PHYs I've seen up to now. > > I'm sorry, I don't understand your sentence, because it just flies > wildly against everything that's been said. > > First, pause-frames require full duplex on the link on which pause > frames are to be sent and received. That's fundamental. I meant pause-frames on the host-side interface of the PHY, which are used to perform rate and duplex matching for the remote-side interface. I hope that makes more sense now. > > Second, I'm not sure what "enforcing half-duplex" has to do with > rate-matching with pause-frames. It can be used as a method to preventing the MAC from sending to the PHY while receiving. > > Third, the 88x3310 without MACSEC in rate matching mode requires the > MAC to pace itself. No support for pause frames at all - you only > get pause frames with the 88x3310P which has MACSEC. This is a 10M > to 10G PHY multi-rate PHY. > > So... your comment makes no sense to me, sorry. You misunderstood me. I didn't mean pause frames on the remote-side of the PHY. I meant pause frames as in RATE_MATCH_PAUSE. > > > > What I do have is the data sheet for 88x3310, and that doesn't mention > > > any restriction such as "only full duplex is supported in rate matching > > > mode". > > > > Yep, and I suppose it should work just fine. The same applies for > > RealTek and MaxLinear PHYs. I've tested it. > > > > > It is true that to use pause frames, the MAC/PCS must be in full-duplex > > > mode, but if the PHY supports half-duplex on the media to full-duplex > > > on the MAC side link, then why should phylink restrict this to be > > > full-duplex only? > > > > There is no reason for that imho. phylink.c states > > /* Although a duplex-matching phy might exist, we > > * conservatively remove these modes because the MAC > > * will not be aware of the half-duplex nature of the > > * link. > > */ > > > > Afaik, practially all rate-matching PHYs which do support half-duplex > > modes on the TP interface can perform duplex-matching as well. > > So we should remove that restriction! Absolutely. That will solve at least half of the problem. It still leaves us with a SerDes clock running 2.5x faster than it would have to, PHY and MAC consuming more energy than they would have to and TX performance being slightly worse (visible with iperf3 --bidir at least with some PHYs). But at least the link would come up. > > > > I suspect phylink_get_capabilities() handling for RATE_MATCH_PAUSE is > > > not correct - or maybe not versatile enough. > > > > I agree. Never the less, why use rate matching at all if we don't have > > to? It's obviously inefficient and wasteful, having the MAC follow the > > PHY speed is preferrable in every case imho. > > As I say, I believe this is a matter of how the Aquantia firmware is > commissioned - I believe the defaults for all of the VEND1_GLOBAL_CFG* > registers come from the firmware that the PHY loads. I must admit I don't care much about the Aquantia story in the picture here. Simply because the rate-matching implementation they got seems to work rather well, and there anyway aren't many low-cost mass-produced devices having Aquantia 2.5GBit/s PHYs. I know exactly one device (Ubiquity UniFi 6 LR v1) which isn't produced in that version any more -- later versions replaced the rather costly Aquantia PHY with a low-cost RealTek PHY. There are millions of devices with Intel/MaxLinear GPY211, and probably even by magnitudes more with RealTek RTL8221B. Having all those devices perform rate-matching and hence running the SerDes clock at 3.125 GHz instead of 1.250 GHz is not just a significant waste of electricity, but also just means wrose performance, esp. on 1000 MBit/s links (which happen to still be the most common). > > So much like the pin strapping of 88x3310 determines its set of MAC > modes, which are decided by the board designer, Aquantia firmware > determines the PHY behaviour. > > The problem here is we need to be mindful of the existing > implementations, not only those where the MAC/PHY combination would > get stuff wrong, but also of those MACs where multiple different > interfaces are supported through hardware strapping which is not > software determinable, but are not software configurable (like DSA > switches.) Then there's those who hacked phylink into their use > without properly implementing it (e.g. where mac_config() is entirely > empty, and thus support no reconfiguration of the interface yet > support multiple different interfaces at driver initialisation time.) > > Yes, some of these situations can be said to be buggy. Feel free to > spend a few years hacking on stmmac to try and fix that god almighty > mess. I've given up with stmmac now after trying several attempts at > cleaning the mess up, and ending up in a very non-productive place > with it. Sounds dreadful. However, just because one driver is a crazy mess it should mean that all drivers have to perform worse than they could. Can you name a few boards which rely on that particularly smelly code? I might get some of those and get a feel of how bad things are, and maybe I will spend a few years trying to improve things there.
On Wed, Oct 09, 2024 at 10:31:51PM +0100, Daniel Golle wrote: > On Wed, Oct 09, 2024 at 09:12:06PM +0100, Russell King (Oracle) wrote: > > On Wed, Oct 09, 2024 at 07:52:42PM +0100, Daniel Golle wrote: > > > [...] > > > Imho allowing several available interface modes can still be > > > advantageous also for built-in PHYs. Measurable power savings (~100mW on > > > MT7986!) and not needing rate matching are both desireable things. > > > > > > The question is just how would we get there... > > > > We already do. Let me take an example that I use. Macchiatobin, which > > you've probably heard lots about. > > > > The PHY there is the 88x3310, which is configured to run in MAC mode 0 > > by hardware strapping. MAC mode 0 means the 88x3310 will switch between > > 10GBASE-R, 5GBASE-R, 2500BASE-X and SGMII on its host interface. > > > > The MAC is Marvell's PP2, which supports all of those, and being one of > > the original MACs that phylink was developed against, is coded properly > > such that it fully works with phylink dynamically changing the interface > > mode. > > > > The interface mode given in DT is just a "guide" because the 88x3310 > > does no more than verify that the interface mode that it is bound with > > is one it supports. However, every time the link comes up, providing > > it is not operating in rate matching mode (which the PP2 doesn't > > support) it will change its MAC facing interface appropriately. > > Unfortunately, and apparently different from the 88x3310, there is no > hardware strapping for this to be decided with RealTek's PHYs, but > using rate-matching or interface-mode-switching is decided by the > driver. No, but there is firmware, and firmware provisioning determines the PHY behaviour. That's the equivalent of hardware pin strapping on Aquantia PHYs. I think you need to review a previous thread on the Aquantia PHYs and firmware setting stupid interface modes, and trying to fix the stuff up (and there in you will find the discussions about the firmware provisioning that determines how the MAC interface on these PHYs behave. https://lore.kernel.org/lkml/20221115230207.2e77pifwruzkexbr@skbuf/T/ You'll also find other useful stuff relevant to this discussion too. This is the lore link for the phy-modes discussion: https://lore.kernel.org/all/20211123164027.15618-1-kabel@kernel.org/T/ > > Another board uses the 88x3310 with XAUI, and if I remember correctly, > > the PHY is strapped for XAUI with rate matching mode. It's connected > > to an 88e6390x DSA switch port 9, which supports RXAUI, XAUI, SGMII, > > 1000BASEX, 2500BASEX and maybe other stuff too. > > > > So, what we do is in DT, we specify the maximum mode, and rely on the > > hardware being correctly strapped on the PHY to configure how the > > MAC side interface will be used. > > Does that mean the realtek.c PHY driver (and maybe others as well) should > just assume that the MAC must also support SGMII mode in case 2500Base-X > is supported? > > I was under the impression that there might be MAC out there which support > only 2500Base-X, or maybe 2500Base-X and 1000Base-X, but not SGMII. And on > those we would always need to use 2500Base-X with rate matching. > But maybe I'm wrong. The problem I'm trying to point out is that doing what you're wanting has a high chance of causing _regressions_, causing setups that work today to stop working. We can't allow that to happen, sorry. > > Now, the thing with that second board is... if we use your original > > suggestion, then we end up filling the host_interfaces with just > > 2500BASEX, 1000BASEX and SGMII. That will lead mv3310_select_mactype() > > to select MV_V2_33X0_PORT_CTRL_MACTYPE_10GBASER for which the PHY will > > attempt to use the modes I listed above for Macchiatobin on its MAC > > interface which is wrong. > > > > Let me repeat the point. phydev->host_interfaces is there to allow a > > PHY driver to go off and make its own decisions about the interface > > mode group that it should use _ignoring_ what's being asked of it > > when the MAC binds to it. It should be empty for built-in setups > > where this should not be used, and we have precedent on Macchiatobin > > that interface switching by the PHY is permitted even in that > > situation. > > ... because it is decided before Linux even starts, and despite in > armada-8040-mcbin.dts is stated. > phy-mode = "10gbase-r"; You may notice in the discussions I've linked above, phy-mode is the "initial" MAC mode for these PHYs... Note that this interface switching mechanism was introduced early on with the 88x3310 PHY, before any documentation on it was available, and all that was known at the time is that the PHY switched between different MAC facing interface modes depending on the negotiated speed. It needed to be supported, and what we have came out of that. Legacy is important, due to the "no regressions" rule that we have in kernel development - we can't go breaking already working setups. > The same approach would not work for those RealTek PHYs and boards > using them. In some cases the bootloader sets up the PHY, and usually > there rate matching is used -- performance is not the goal there, but > simplicity. In other cases we rely entirely on the Linux PHY driver > to set things up. How would the PHY driver know whether it should > setup rate matching mode or interface switching mode? In the case of the Realtek driver, it decides to override whatever came before in a defined way. E.g. rtl822xb_config_init(). > The SFP case is clear, it's using host_interfaces. But in the built-in > case, as of today, it always ends up in fixed interface mode with > rate matching. "always rate matching" no. Fixed interface mode, yes. If rtl822xb_config_init() sees phydev->interface is set to SGMII, then it uses 2500BASEX_SGMII mode without rate matching - and the advertisement will be limited to 1G and below which will effectively prevent the PHY using 2.5G mode - which is fine in that case. > > > > So, aqr107_get_rate_matching() should work out whether rate matching > > > > will be used for the interface mode by scanning these registers. > > > > [...] > > > > > > Afaik Aquantia 2.5G PHYs always work only with a fixed interface mode > > > and perform rate-matching for lower speeds. > > > > They don't _have_ to. To use a SFP with an AQR PHY on it in a clearfog > > platform (which only supports SGMII, 1000BASE-X and 2500BASE-X, I > > hacked the aquantia driver to do this: > > > > phy_set_bits_mmd(phydev, MDIO_MMD_VEND1, MDIO_CTRL1, MDIO_CTRL1_LPOWER); > > mdelay(10); > > phy_write_mmd(phydev, MDIO_MMD_VEND1, 0x31a, 2); > > phy_write_mmd(phydev, MDIO_MMD_VEND1, VEND1_GLOBAL_CFG_10M, > > VEND1_GLOBAL_CFG_SGMII_AN | > > VEND1_GLOBAL_CFG_SERDES_MODE_SGMII); > > phy_write_mmd(phydev, MDIO_MMD_VEND1, VEND1_GLOBAL_CFG_100M, > > VEND1_GLOBAL_CFG_SGMII_AN | > > VEND1_GLOBAL_CFG_SERDES_MODE_SGMII); > > phy_write_mmd(phydev, MDIO_MMD_VEND1, VEND1_GLOBAL_CFG_1G, > > VEND1_GLOBAL_CFG_SGMII_AN | > > VEND1_GLOBAL_CFG_SERDES_MODE_SGMII); > > phy_write_mmd(phydev, MDIO_MMD_VEND1, VEND1_GLOBAL_CFG_2_5G, > > VEND1_GLOBAL_CFG_SGMII_AN | > > VEND1_GLOBAL_CFG_SERDES_MODE_OCSGMII); > > phy_clear_bits_mmd(phydev, MDIO_MMD_VEND1, MDIO_CTRL1, > > MDIO_CTRL1_LPOWER); > > > > which disables rate matching and causes it to switch interfaces. > > That's pretty cool, and when connected to a MAC which support both > SGMII and 2500Base-X, and a driver which can switch between the two, > it should be the default imho. > > > > > Moreover, aqr107_read_status() reads the interface status and sets > > the MAC interface accordingly, so the driver does support dynamically > > changing the MAC interface just like 88x3310. If all use cases of > > this PHY only ever did rate matching, then there would be no point > > to that code being there! > > So, just like for mcbin, the driver here replies on the PHY being > configured (by strapping or by the bootloader) to either use > a fixed interface mode with rate matching, or perform interface mode > switching. What if we had to decide this in the driver (like it is > the case with RealTek PHYs)? Then we have a problem, because simply re-using the host_interfaces for built-in PHYs is _going_ _to_ cause problems. As I've said several times now... if all phylink users implemented the phylink methods such that they can at run time switch between all the interface modes in phylink's supported_interfaces (or only set interface modes they were prepared to switch between), then that would be a good first step forward. Unfortunately, for some, such as the 88e6390x, we can't tell exactly which interface mode is pinstrapped, and we can't change the interface mode. So, we have to live with the fact that ->supported_interfaces is less than accurate. > Let me summarize: > - We can't just assume that every MAC which supports 2500Base-X also > supports SGMII. It might support only 2500Base-X, or 2500Base-X and > 1000Base-X, or the driver might not support switching between > interface modes. > - We can't rely on the PHY being pre-configured by the bootloader to > either rate maching or interface-switching mode. > - There are no strapping options for this, the only thing which is > configured by strapping is the address. Right, so the only _safe_ thing to do is to assume that: 1. On existing PHY drivers which do not do any kind of interface switching, retrofitting interface switching of any kind is unsafe. 2. On brand new PHY drivers which have no prior history, there can not be any regressions, so implementing interface switching from the very start is safe. The only way out of this is by inventing something new for existing drivers that says "you can adopt a different behaviour" and that must be a per-platform switch - in other words, coming from the platform's firmware definition in some way. > > > > Now, while phylink restricts RATE_MATCH_PAUSE to being full-duplex only, > > > > I'm not sure that is correct. I didn't contribute this support, and I > > > > don't have any platforms that support this, and I don't have any > > > > experience of it. > > > > > > Afaik enforcing half-duplex via rate-maching with pause-frames is > > > supported by all the 2.5G PHYs I've seen up to now. > > > > I'm sorry, I don't understand your sentence, because it just flies > > wildly against everything that's been said. > > > > First, pause-frames require full duplex on the link on which pause > > frames are to be sent and received. That's fundamental. > > I meant pause-frames on the host-side interface of the PHY, which > are used to perform rate and duplex matching for the remote-side > interface. I hope that makes more sense now. > > > > > Second, I'm not sure what "enforcing half-duplex" has to do with > > rate-matching with pause-frames. > > It can be used as a method to preventing the MAC from sending > to the PHY while receiving. > > > > > Third, the 88x3310 without MACSEC in rate matching mode requires the > > MAC to pace itself. No support for pause frames at all - you only > > get pause frames with the 88x3310P which has MACSEC. This is a 10M > > to 10G PHY multi-rate PHY. > > > > So... your comment makes no sense to me, sorry. > > You misunderstood me. I didn't mean pause frames on the remote-side of > the PHY. I meant pause frames as in RATE_MATCH_PAUSE. You also misunderstand me. I wasn't talking about remote-side of the PHY either! I'm also none the wiser exactly what you meant by "Afaik enforcing half-duplex via rate-maching with pause-frames is supported by all the 2.5G PHYs I've seen up to now." When rate matching with pause frames is being used, the host-side interface _must_ be in full duplex mode, because that is the only time when pause frames are permitted to be used by a MAC. I think we agree on that. What I'm having a hard time understanding is *enforcing* half-duplex. I think what you're trying to say is not "enforcing" the use of half-duplex somewhere, but pause frames over the host link to prevent transmission while receiving. It's likely not that simple - the PHY probably has buffering of the transmit frame (consider the difference in the time it takes to send a frame at 10G speeds vs a potential media speed of 10M.) What the PHY will do is send pause frames when that buffer gets full (e.g. after receiving one packet to be transmitted) until it has enough space to accept another transmit packet. If the link is in half-duplex mode, then the PHY will need to do the receive detection, collision handling, and retries of the pending frame itself because there will be no way to signal back to the MAC "that last frame you sent me, could you resend because it collided". > > > > What I do have is the data sheet for 88x3310, and that doesn't mention > > > > any restriction such as "only full duplex is supported in rate matching > > > > mode". > > > > > > Yep, and I suppose it should work just fine. The same applies for > > > RealTek and MaxLinear PHYs. I've tested it. > > > > > > > It is true that to use pause frames, the MAC/PCS must be in full-duplex > > > > mode, but if the PHY supports half-duplex on the media to full-duplex > > > > on the MAC side link, then why should phylink restrict this to be > > > > full-duplex only? > > > > > > There is no reason for that imho. phylink.c states > > > /* Although a duplex-matching phy might exist, we > > > * conservatively remove these modes because the MAC > > > * will not be aware of the half-duplex nature of the > > > * link. > > > */ > > > > > > Afaik, practially all rate-matching PHYs which do support half-duplex > > > modes on the TP interface can perform duplex-matching as well. > > > > So we should remove that restriction! > > Absolutely. That will solve at least half of the problem. It still > leaves us with a SerDes clock running 2.5x faster than it would have to, > PHY and MAC consuming more energy than they would have to and TX > performance being slightly worse (visible with iperf3 --bidir at least > with some PHYs). But at least the link would come up. It also means we move forward! > > > > I suspect phylink_get_capabilities() handling for RATE_MATCH_PAUSE is > > > > not correct - or maybe not versatile enough. > > > > > > I agree. Never the less, why use rate matching at all if we don't have > > > to? It's obviously inefficient and wasteful, having the MAC follow the > > > PHY speed is preferrable in every case imho. > > > > As I say, I believe this is a matter of how the Aquantia firmware is > > commissioned - I believe the defaults for all of the VEND1_GLOBAL_CFG* > > registers come from the firmware that the PHY loads. > > I must admit I don't care much about the Aquantia story in the > picture here. Simply because the rate-matching implementation they > got seems to work rather well, and there anyway aren't many low-cost > mass-produced devices having Aquantia 2.5GBit/s PHYs. I know exactly > one device (Ubiquity UniFi 6 LR v1) which isn't produced in that > version any more -- later versions replaced the rather costly > Aquantia PHY with a low-cost RealTek PHY. > > There are millions of devices with Intel/MaxLinear GPY211, and > probably even by magnitudes more with RealTek RTL8221B. Having all > those devices perform rate-matching and hence running the SerDes > clock at 3.125 GHz instead of 1.250 GHz is not just a significant > waste of electricity, but also just means wrose performance, esp. on > 1000 MBit/s links (which happen to still be the most common). > > > > > So much like the pin strapping of 88x3310 determines its set of MAC > > modes, which are decided by the board designer, Aquantia firmware > > determines the PHY behaviour. > > > > The problem here is we need to be mindful of the existing > > implementations, not only those where the MAC/PHY combination would > > get stuff wrong, but also of those MACs where multiple different > > interfaces are supported through hardware strapping which is not > > software determinable, but are not software configurable (like DSA > > switches.) Then there's those who hacked phylink into their use > > without properly implementing it (e.g. where mac_config() is entirely > > empty, and thus support no reconfiguration of the interface yet > > support multiple different interfaces at driver initialisation time.) > > > > Yes, some of these situations can be said to be buggy. Feel free to > > spend a few years hacking on stmmac to try and fix that god almighty > > mess. I've given up with stmmac now after trying several attempts at > > cleaning the mess up, and ending up in a very non-productive place > > with it. > > Sounds dreadful. However, just because one driver is a crazy mess it > should mean that all drivers have to perform worse than they could. > Can you name a few boards which rely on that particularly smelly > code? I might get some of those and get a feel of how bad things are, > and maybe I will spend a few years trying to improve things there. Well, when you realise that stmmac's internal PCS implementation entirely bypasses phylink yet the driver uses phylink... I had patches, now junked. I tried some simpler cleanups before trying to fix that, which got massively hung up on an insanely trivial question about whether "%d" or "%u" should be used to print the speed in a mac_link_up method... and resulted in me junking the entire set of stmmac patches I had. I've given up with stmmac. You may notice that my recent XPCS patch series only _minimally_ touched the stmmac code. It could've gone further, but I really want to touch stmmac as little as possible now. I've decided it's a hopeless case. In fact, I've asked that the driver backs out its use of phylink... which has been ignored. So... at this point, as phylink maintainer, my only option is to ignore stmmac for any more major changes and leave the updates to whoever cares about stmmac. I don't like reaching that point with a driver, but I see no other option with it given the history..
On Thu, Oct 10, 2024 at 12:23:18AM +0100, Russell King (Oracle) wrote: > On Wed, Oct 09, 2024 at 10:31:51PM +0100, Daniel Golle wrote: > > On Wed, Oct 09, 2024 at 09:12:06PM +0100, Russell King (Oracle) wrote: > > > [...] > > > The interface mode given in DT is just a "guide" because the 88x3310 > > > does no more than verify that the interface mode that it is bound with > > > is one it supports. However, every time the link comes up, providing > > > it is not operating in rate matching mode (which the PP2 doesn't > > > support) it will change its MAC facing interface appropriately. > > > > Unfortunately, and apparently different from the 88x3310, there is no > > hardware strapping for this to be decided with RealTek's PHYs, but > > using rate-matching or interface-mode-switching is decided by the > > driver. > > No, but there is firmware, and firmware provisioning determines the > PHY behaviour. That's the equivalent of hardware pin strapping on > Aquantia PHYs. Yes, but there is broken firmware, or firmware which doesn't care and always just expected the Linux driver to take care of that. In many cases I've seen, U-Boot, for example, only initializes the PHY and sets those registers in case it is configured to load Linux via TFTP, or interrupted by the user who may chose to do that. In the default case (loading Linux from flash) the PHY is not even powered-up in some cases. In case you mean firmware as in what Linux is presented in DT, then yes, I agree. But we'll need define new bindings for that then. > > I think you need to review a previous thread on the Aquantia PHYs > and firmware setting stupid interface modes, and trying to fix the > stuff up (and there in you will find the discussions about the > firmware provisioning that determines how the MAC interface on these > PHYs behave. > > https://lore.kernel.org/lkml/20221115230207.2e77pifwruzkexbr@skbuf/T/ > > You'll also find other useful stuff relevant to this discussion too. > This is the lore link for the phy-modes discussion: > > https://lore.kernel.org/all/20211123164027.15618-1-kabel@kernel.org/T/ Thank you a lot for digging up those conversions. I've spent some hours reading and understanding all of that, and well, things are complicated. To my surprise, introducing an array of phy-connection-type or phy-mode in DT has been positively received by the DT folks, but the idea was then later on discarded by netdev participants of the debate. Imho that would still be the best solution, but I understand the argument that having to list all supported modes can be both tedious and hard to review, and hence error-prone. > > > So, what we do is in DT, we specify the maximum mode, and rely on the > > > hardware being correctly strapped on the PHY to configure how the > > > MAC side interface will be used. > > > > Does that mean the realtek.c PHY driver (and maybe others as well) should > > just assume that the MAC must also support SGMII mode in case 2500Base-X > > is supported? > > > > I was under the impression that there might be MAC out there which support > > only 2500Base-X, or maybe 2500Base-X and 1000Base-X, but not SGMII. And on > > those we would always need to use 2500Base-X with rate matching. > > But maybe I'm wrong. > > The problem I'm trying to point out is that doing what you're wanting > has a high chance of causing _regressions_, causing setups that work > today to stop working. We can't allow that to happen, sorry. Of course :) > Note that this interface switching mechanism was introduced early on > with the 88x3310 PHY, before any documentation on it was available, > and all that was known at the time is that the PHY switched between > different MAC facing interface modes depending on the negotiated > speed. It needed to be supported, and what we have came out of that. > Legacy is important, due to the "no regressions" rule that we have > in kernel development - we can't go breaking already working setups. What about marking Ethernet drivers which are capable of interface mode switching? Right now there isn't one "correct" thing to do for PHY drivers, which is bad, as people may look into any driver as a reference for the development of future drivers. So why not introduce a MAC capability bit? Even dedicated for switching between two specific modes (SGMII and 2500Base-X), to avoid any ambiguitities or unnecessary feature-creep. Typically switching between modes within the same PCS is more easy to support, switching from 10:8 to 66:64 coding can be more involved, and require resets which affect other links, so that's something worth avoiding unless we really need it (in case the users inserts a different SFP module it would be really needed, in case the external link goes from 5 Gbit/s to 2.5 Gbit/s it might be worth avoiding having to switch from 5GBase-R to 2500Base-X) It wouldn't be the first time something like that would be done, however I have full understanding that any reminder of that whole legacy_pre_march2020 episode may trigger post-traumatic stress in netdev maintainers. > > > The same approach would not work for those RealTek PHYs and boards > > using them. In some cases the bootloader sets up the PHY, and usually > > there rate matching is used -- performance is not the goal there, but > > simplicity. In other cases we rely entirely on the Linux PHY driver > > to set things up. How would the PHY driver know whether it should > > setup rate matching mode or interface switching mode? > > In the case of the Realtek driver, it decides to override whatever came > before in a defined way. E.g. rtl822xb_config_init(). Yes, and right now it would decide to always use 2500Base-X on a host capable of 2500Base-X, and hence make use of RATE_MATCH_PAUSE for 1000M/100M/10M links, even though using SGMII would be better in terms of energy consumption and performance. > > > The SFP case is clear, it's using host_interfaces. But in the built-in > > case, as of today, it always ends up in fixed interface mode with > > rate matching. > > "always rate matching" no. Fixed interface mode, yes. If > rtl822xb_config_init() sees phydev->interface is set to SGMII, then > it uses 2500BASEX_SGMII mode without rate matching - and the > advertisement will be limited to 1G and below which will effectively > prevent the PHY using 2.5G mode - which is fine in that case. While that case might be relevant for SFP modules, in pracise, why would anyone use a more expensive 2.5Gbit/s PHY on a board which only supports up to 1Gbit/s -- that doesn't happen in the real world, because 1Gbit/s SGMII PHYs are ubiquitous and much cheaper than faster ones. > > Let me summarize: > > - We can't just assume that every MAC which supports 2500Base-X also > > supports SGMII. It might support only 2500Base-X, or 2500Base-X and > > 1000Base-X, or the driver might not support switching between > > interface modes. > > - We can't rely on the PHY being pre-configured by the bootloader to > > either rate maching or interface-switching mode. > > - There are no strapping options for this, the only thing which is > > configured by strapping is the address. > > Right, so the only _safe_ thing to do is to assume that: > > 1. On existing PHY drivers which do not do any kind of interface > switching, retrofitting interface switching of any kind is unsafe. It's important to note that for the RealTek driver specifically we have done that in OpenWrt from day 1 -- simply because the rate-matching performed, especially by early versions of the PHY, is too bad. So we always made the driver switch interface to SGMII in case of link speeds slower than 2500M. Now, with the backport of upstream commits replacing our downstream patches, users started to complain that the issue of bad PHY performance in 1 Gbit/s mode has returned, which is the whole reason why I started to work on this issue. I understand, however, there may of course be other users of those RealTek 2.5G PHYs, even Rockchip with stmmac maybe, and that would break if we assume the MAC can support switching between 2500Base-X and SGMII, so users of those boards will have to live with the bad performance of the rate-matching performed by the PHY unless someone fixes the stmmac driver... > > 2. On brand new PHY drivers which have no prior history, there can > not be any regressions, so implementing interface switching from > the very start is safe. > > The only way out of this is by inventing something new for existing > drivers that says "you can adopt a different behaviour" and that > must be a per-platform switch - in other words, coming from the > platform's firmware definition in some way. Why would it not just be the MAC driver which indicates that it can support switching to lower-speed interface modes it also supports? Do you really believe there are boards which are electrically unfit for performing SGMII on the traces intended for 2500Base-X? If by 'firmware' you mean 'device tree' then we are back on square one, and we would need several phy-connection-type aka. phy-mode listed there. After having read all the threads from 2021 you have provided links for, I believe that maybe an additional property which lists the interface modes to be used *optionally* and in addition to the primary (ie. fastest) mode stated in phy-mode or phy-connection-type could be a way out of this. It would still end up being potentially a longer list of interface modes, but reality is complex! Looking at other corners of DT it would still be rather simple and human readable (in contrast: take a look at inhumanly long arrays of gpio-line-names where even the order matters, for example...) Yet, it would still be a partial violation of Device Tree rules because we are (also) describing driver capabilities there then. What if, let's say one day stmmac *will* support interface mode switching? Should we update each and every board's device tree? Hence, unless there is a really good reason to believe that a board which works fine with 2500Base-X would not work equally well with SGMII, given that both the MAC (-driver) and PHY (-driver) support both modes, I don't see a need to burden firmware with having to list additional phy-modes. Of course, I'm always only talking about allowing switching to slower interface modes, and always only about modes which use a single pair differential lanes (ie. sgmii, 1000base-x, 2500base-x, 5gbase-r, 10gbase-r, ...). > > > > Afaik, practially all rate-matching PHYs which do support half-duplex > > > > modes on the TP interface can perform duplex-matching as well. > > > > > > So we should remove that restriction! > > > > Absolutely. That will solve at least half of the problem. It still > > leaves us with a SerDes clock running 2.5x faster than it would have to, > > PHY and MAC consuming more energy than they would have to and TX > > performance being slightly worse (visible with iperf3 --bidir at least > > with some PHYs). But at least the link would come up. > > It also means we move forward! ACK. Patch posted.
On Thu, Oct 10, 2024 at 03:07:30PM +0100, Daniel Golle wrote: > > Note that this interface switching mechanism was introduced early on > > with the 88x3310 PHY, before any documentation on it was available, > > and all that was known at the time is that the PHY switched between > > different MAC facing interface modes depending on the negotiated > > speed. It needed to be supported, and what we have came out of that. > > Legacy is important, due to the "no regressions" rule that we have > > in kernel development - we can't go breaking already working setups. > > What about marking Ethernet drivers which are capable of interface > mode switching? Right now there isn't one "correct" thing to do for > PHY drivers, which is bad, as people may look into any driver as > a reference for the development of future drivers. > > So why not introduce a MAC capability bit? Even dedicated for switching > between two specific modes (SGMII and 2500Base-X), to avoid any > ambiguitities or unnecessary feature-creep. They already have a perfectly good way to do this today. They can look at DT and set just one mode in the ->supported_interfaces bitmap if they don't support interface switching! The MAC drivers are already responsible for parsing the phy-mode from firmware, and it's that driver that also knows whether it knows if it supports interface switching or not. So I don't see any need for additional capability bits. > Typically switching between modes within the same PCS is more easy to > support, switching from 10:8 to 66:64 coding can be more involved, > and require resets which affect other links, so that's something > worth avoiding unless we really need it (in case the users inserts > a different SFP module it would be really needed, in case the external > link goes from 5 Gbit/s to 2.5 Gbit/s it might be worth avoiding > having to switch from 5GBase-R to 2500Base-X) > > It wouldn't be the first time something like that would be done, however > I have full understanding that any reminder of that whole > legacy_pre_march2020 episode may trigger post-traumatic stress in netdev > maintainers. Yes, and it's still continuing to cause problems today. I've just replied to someone working on the macb driver who was proposing to make use state->speed in his mac_config() method, despite modern phylink always passing SPEED_UNKNOWN for that. I guess if I didn't reply, he'd find out the hard way (which is why I made the change to phylink_mac_config() to cause testing failures if people try that.) > > > The SFP case is clear, it's using host_interfaces. But in the built-in > > > case, as of today, it always ends up in fixed interface mode with > > > rate matching. > > > > "always rate matching" no. Fixed interface mode, yes. If > > rtl822xb_config_init() sees phydev->interface is set to SGMII, then > > it uses 2500BASEX_SGMII mode without rate matching - and the > > advertisement will be limited to 1G and below which will effectively > > prevent the PHY using 2.5G mode - which is fine in that case. > > While that case might be relevant for SFP modules, in pracise, why would > anyone use a more expensive 2.5Gbit/s PHY on a board which only supports > up to 1Gbit/s -- that doesn't happen in the real world, because 1Gbit/s > SGMII PHYs are ubiquitous and much cheaper than faster ones. I'm merely stating the logic that is there today. > > > Let me summarize: > > > - We can't just assume that every MAC which supports 2500Base-X also > > > supports SGMII. It might support only 2500Base-X, or 2500Base-X and > > > 1000Base-X, or the driver might not support switching between > > > interface modes. > > > - We can't rely on the PHY being pre-configured by the bootloader to > > > either rate maching or interface-switching mode. > > > - There are no strapping options for this, the only thing which is > > > configured by strapping is the address. > > > > Right, so the only _safe_ thing to do is to assume that: > > > > 1. On existing PHY drivers which do not do any kind of interface > > switching, retrofitting interface switching of any kind is unsafe. > > It's important to note that for the RealTek driver specifically we > have done that in OpenWrt from day 1 -- simply because the rate-matching > performed, especially by early versions of the PHY, is too bad. So we > always made the driver switch interface to SGMII in case of link speeds > slower than 2500M. > Now, with the backport of upstream commits replacing our downstream > patches, users started to complain that the issue of bad PHY performance > in 1 Gbit/s mode has returned, which is the whole reason why I started > to work on this issue. > > I understand, however, there may of course be other users of those > RealTek 2.5G PHYs, even Rockchip with stmmac maybe, and that would > break if we assume the MAC can support switching between 2500Base-X > and SGMII, so users of those boards will have to live with the bad > performance of the rate-matching performed by the PHY unless someone > fixes the stmmac driver... If OpenWRT's switching predates July 2017, then maybe they should've been more pro-active at getting their patches upstream? Unfortunately, in July 2017, there was nothing in mainline supporting 2500base*, and nothing doing any interface switching until I added the Marvell 88x3310 driver which was where _I_ proposed interface switching being added to phylib. > > 2. On brand new PHY drivers which have no prior history, there can > > not be any regressions, so implementing interface switching from > > the very start is safe. > > > > The only way out of this is by inventing something new for existing > > drivers that says "you can adopt a different behaviour" and that > > must be a per-platform switch - in other words, coming from the > > platform's firmware definition in some way. > > Why would it not just be the MAC driver which indicates that it can > support switching to lower-speed interface modes it also supports? > Do you really believe there are boards which are electrically > unfit for performing SGMII on the traces intended for 2500Base-X? For a single-lane serdes, what you say is true. However, it is not universal across all interface modes. As I stated in a previous discussion, if we have e.g. four lanes of XAUI between a PHY and MAC, and both ends support XAUI, RXAUI, 10GBASE-R, 5GBASE-R, 2500BASE-X, and SGMII, it does not necessarily follow that the platform can support 10GBASE-R and 5GBASE-R over a single lane because the signalling rate is so much higher. Just because the overall speed is the same or lower does not automatically mean that it can be used. > If by 'firmware' you mean 'device tree' then we are back on square > one, and we would need several phy-connection-type aka. phy-mode > listed there. > > After having read all the threads from 2021 you have provided links for, > I believe that maybe an additional property which lists the interface > modes to be used *optionally* and in addition to the primary (ie. > fastest) mode stated in phy-mode or phy-connection-type could be a way > out of this. It would still end up being potentially a longer list of > interface modes, but reality is complex! Looking at other corners of DT > it would still be rather simple and human readable (in contrast: take a > look at inhumanly long arrays of gpio-line-names where even the order > matters, for example...) > > Yet, it would still be a partial violation of Device Tree rules because > we are (also) describing driver capabilities there then. What if, let's > say one day stmmac *will* support interface mode switching? Should we > update each and every board's device tree? Well, I guess we need people that adopt phylink to actually implement it properly rather than just slapping it into their driver in a way that "works for them". :) This is a battle that I've been trying for years with, but programmers are lazy individuals who don't want to (a) read API documentation, (b) implement things properly. Anyway, I'm out of time right now to continue replying to this conversation (it's taken over an hour so far to put what I've said together, and I now have a meeting... so reached the end of what I can do right now... so close to the end of your email too! Alas... > > Hence, unless there is a really good reason to believe that a board which > works fine with 2500Base-X would not work equally well with SGMII, given > that both the MAC (-driver) and PHY (-driver) support both modes, I don't > see a need to burden firmware with having to list additional phy-modes. > > Of course, I'm always only talking about allowing switching to slower > interface modes, and always only about modes which use a single pair > differential lanes (ie. sgmii, 1000base-x, 2500base-x, 5gbase-r, > 10gbase-r, ...). > > > > > > Afaik, practially all rate-matching PHYs which do support half-duplex > > > > > modes on the TP interface can perform duplex-matching as well. > > > > > > > > So we should remove that restriction! > > > > > > Absolutely. That will solve at least half of the problem. It still > > > leaves us with a SerDes clock running 2.5x faster than it would have to, > > > PHY and MAC consuming more energy than they would have to and TX > > > performance being slightly worse (visible with iperf3 --bidir at least > > > with some PHYs). But at least the link would come up. > > > > It also means we move forward! > > ACK. Patch posted. >
On Thu, Oct 10, 2024 at 03:58:36PM +0100, Russell King (Oracle) wrote: > On Thu, Oct 10, 2024 at 03:07:30PM +0100, Daniel Golle wrote: > > > Note that this interface switching mechanism was introduced early on > > > with the 88x3310 PHY, before any documentation on it was available, > > > and all that was known at the time is that the PHY switched between > > > different MAC facing interface modes depending on the negotiated > > > speed. It needed to be supported, and what we have came out of that. > > > Legacy is important, due to the "no regressions" rule that we have > > > in kernel development - we can't go breaking already working setups. > > > > What about marking Ethernet drivers which are capable of interface > > mode switching? Right now there isn't one "correct" thing to do for > > PHY drivers, which is bad, as people may look into any driver as > > a reference for the development of future drivers. > > > > So why not introduce a MAC capability bit? Even dedicated for switching > > between two specific modes (SGMII and 2500Base-X), to avoid any > > ambiguitities or unnecessary feature-creep. > > They already have a perfectly good way to do this today. They can look > at DT and set just one mode in the ->supported_interfaces bitmap if > they don't support interface switching! Some drivers seem to support multiple interface modes, but don't support switching between them if asked to do so by the PHY. That was the potential cause for regressions you named in case we would populate host_interfaces also for on-board PHYs (independently of how careful or selective we would do that, which is another story). > The MAC drivers are already > responsible for parsing the phy-mode from firmware, and it's that > driver that also knows whether it knows if it supports interface > switching or not. So I don't see any need for additional capability > bits. I agree that the MAC driver should know whether it can do PHY-requested mode switches, however, at this point phylink doesn't. We may want to let phylink know that the MAC (-driver) is fit for such acrobatic moves, instead of just assuming that in PHY drivers by historic precedence, no? Let me reiterate over the current situation: - The mxl-gpy.c depends on the PHY being pre-provisioned by early boot firmware to decide whether it should do interface mode switching or perform rate matching. Some (but not all) board vendors correctly provision the PHY (ie. to perform interface mode switching), others just set it to rate-matching because it's easier to support when using the Ethernet connection **within the bootloader**. After all, this is not so much of a problem because while still being a potential waste of power (clocks running 2.5x faster even if not needed to) the rate matching is at least implemented in a way which works well and doesn't hurt too much beyond wasting ~100 milliwatts. - Marvell PHYs use pin strapping to decide that. - Aquantia seems to rely on provisioned firmware. - The RealTek driver currently decides whether to use interface mode switching or rate matching based on the selected (main) interface mode as well as phydev->host_interfaces. * host_interfaces is not set unless the PHY is used on a SFP module * the driver hence ends up always using a fixed interface mode for on-board PHYs. => This is bad because we end up using the (bad) rate matching instead of interface mode switching (good) even though that would be perfectly supported by both MAC and PHY. - We currently don't have a way for phylink to know whether a MAC can perform interface mode switching - Populating host_interfaces also for on-board PHYs was rejected because it might introduce regressions due to MAC drivers not being able to perform interface mode switching and relying on the current behaviour. => We may hence need a way for phylink to know whether an Ethernet MAC can perform interface mode switching or not. > > I understand, however, there may of course be other users of those > > RealTek 2.5G PHYs, even Rockchip with stmmac maybe, and that would > > break if we assume the MAC can support switching between 2500Base-X > > and SGMII, so users of those boards will have to live with the bad > > performance of the rate-matching performed by the PHY unless someone > > fixes the stmmac driver... > > If OpenWRT's switching predates July 2017, then maybe they should've > been more pro-active at getting their patches upstream? No, and there were no consumer-grade devices with 2.5G PHYs back then. The first started to show up around 2020. The first to be supported by OpenWrt was the before mentioned Ubiquiti UniFi 6 LR AP, sporting an Aquanria AQR112 2.5G PHY connected via 2500Base-X, and always performing rate matching. The RealTek 2.5G PHY driver creeped in via their PCIe Ethernet NIC drivers, and in the beginning (2020) only supported the PHYs built-into those PCIe NIC chips. Most setup of the built-in PHY is anyway done by the NIC driver, so nobody bothered to implement config_init() until we started to see devices using the RTL8226 (later renamed to RTL8221B) as standalone PHY ICs connected to router SoCs around 2021/22. Soon people contributed OpenWrt support for those boards, initially just not understanding why the PHY would only work with 2500Base-T links but not with 1000Base-T, 100Base-TX, ... At some point Alexander Couzens and me were working on sorting out proper support for those PHYs, including interface mode switching, and Alexander implemented the config_init() and update_interface() functions. I initially tried to upstream the patch Alexander Couzens had written for that in 2023, and back then we assumed that interface mode switching would always be performed in case 2500Base-X was the main mode. However, I gave up at some point because I was asked for detailed definition of the PHY registers and bits, which, frankly speaking, just doesn't exist, not even in RealTek's under-NDA datasheet which merely lists some magic sequences of register writes. https://patchwork.kernel.org/comment/25331309/ In December 2023 Marek Behún then modified the patch and sent it upstream once again. It was part of a bigger series improving the driver, and he clearly stated that: "All this is done so that we can support another 2.5G copper SFP module" https://patchwork.kernel.org/project/netdevbpf/cover/20231220155518.15692-1-kabel@kernel.org/ This series (luckily without anyone asking for non-existent register definitions this time) ended up in mainline Linux. As OpenWrt generally tries to be as close to upstream as possible the downstream patches for the RealTek PHY drivers have then been replaced by backports of the corresponding commits in mainline Linux earlier this year. This has resulted in multiple reports of degraded performance, and it took a while until I understood that Marek's version of the config_init function only sets the PHY to perform interface mode switching if that is indicated by host_interfaces -- and built-in PHYs will *always* use a fixed interface mode now. We could of course just introduce a downstream patch changing that again, e.g. by letting the driver assume that SGMII is also supported in case 2500Base-X is supported -- but that may not be true for all the Ethernet MACs out there. Some might not support SGMII at all, or not be ready to perform interface mode switching. Hence, and we may need a way for phylink or the phy driver to destinguish whether interface mode switching is supported by the Ethernet MAC or not. > > Unfortunately, in July 2017, there was nothing in mainline supporting > 2500base*, and nothing doing any interface switching until I added > the Marvell 88x3310 driver which was where _I_ proposed interface > switching being added to phylib. Yes, and that has worked well. We should have a way to make use of it also on boards with on-board PHYs which are at the same time also used on boards which rely on Ethernet drivers which do not support SGMII, or don't support switching the interface mode. > > > > 2. On brand new PHY drivers which have no prior history, there can > > > not be any regressions, so implementing interface switching from > > > the very start is safe. > > > > > > The only way out of this is by inventing something new for existing > > > drivers that says "you can adopt a different behaviour" and that > > > must be a per-platform switch - in other words, coming from the > > > platform's firmware definition in some way. > > > > Why would it not just be the MAC driver which indicates that it can > > support switching to lower-speed interface modes it also supports? > > Do you really believe there are boards which are electrically > > unfit for performing SGMII on the traces intended for 2500Base-X? > > For a single-lane serdes, what you say is true. However, it is not > universal across all interface modes. Of course not, but for single-lane serdes it could (and, if possible, should!) be done. > > As I stated in a previous discussion, if we have e.g. four lanes of > XAUI between a PHY and MAC, and both ends support XAUI, RXAUI, > 10GBASE-R, 5GBASE-R, 2500BASE-X, and SGMII, it does not necessarily > follow that the platform can support 10GBASE-R and 5GBASE-R over > a single lane because the signalling rate is so much higher. Just > because the overall speed is the same or lower does not automatically > mean that it can be used. > > > If by 'firmware' you mean 'device tree' then we are back on square > > one, and we would need several phy-connection-type aka. phy-mode > > listed there. > > > > After having read all the threads from 2021 you have provided links for, > > I believe that maybe an additional property which lists the interface > > modes to be used *optionally* and in addition to the primary (ie. > > fastest) mode stated in phy-mode or phy-connection-type could be a way > > out of this. It would still end up being potentially a longer list of > > interface modes, but reality is complex! Looking at other corners of DT > > it would still be rather simple and human readable (in contrast: take a > > look at inhumanly long arrays of gpio-line-names where even the order > > matters, for example...) > > > > Yet, it would still be a partial violation of Device Tree rules because > > we are (also) describing driver capabilities there then. What if, let's > > say one day stmmac *will* support interface mode switching? Should we > > update each and every board's device tree? > > Well, I guess we need people that adopt phylink to actually implement > it properly rather than just slapping it into their driver in a way > that "works for them". :) +1 > > This is a battle that I've been trying for years with, but programmers > are lazy individuals who don't want to (a) read API documentation, (b) > implement things properly. > > Anyway, I'm out of time right now to continue replying to this > conversation (it's taken over an hour so far to put what I've said > together, and I now have a meeting... so reached the end of what I can > do right now... so close to the end of your email too! Alas... Thank you for putting all this time into it. I think the problem itself is simple and well understood by both of us. It was already well understood in 2021. There are many possible solutions to chose from: 1) List multiple phy-connection-type (aka. phy-mode) in DT, populate host_interfaces accordingly. 2) Let the MAC indicate that it supports mode switching, so phylink would populate host_interfaces accordingly. 3) Assume that MAC supports mode switching to slower, also supported, single-lane SerDes modes (and risk breakage because of bad drivers), and populate host_interfaces accordingly. 4) Change the PHY driver to assume SGMII and interface mode switching is always supported in case 2500Base-X is the mode declared in DT. 5) Live with bad performance of RealTek's rate matching (not acceptable for OpenWrt) 6) Throw all devices with cheapo PHYs or bad boot firmware in the bin (bad for the environment, and consumer wallets) 7) Downstream hacks (current situation in OpenWrt and vendor firmware) Did I forget any? 3) and 4) may obviously cause regressions. So, from what I can see, only 1) and 2) would allow us to do anything else than 7) from OpenWrt's perspective. I totally understand if Linux netdev maintainers chose 5) or 6), however, for OpenWrt the result in that case would be to move the existing downstream patches from "pending" to "hack", which means that we will not attempt to send them upstream.
> I initially tried to upstream the patch Alexander Couzens had written > for that in 2023, and back then we assumed that interface mode switching > would always be performed in case 2500Base-X was the main mode. > However, I gave up at some point because I was asked for detailed > definition of the PHY registers and bits, which, frankly speaking, just > doesn't exist, not even in RealTek's under-NDA datasheet which merely > lists some magic sequences of register writes. > > https://patchwork.kernel.org/comment/25331309/ It is fine to document that these are magic values written into magic registers and nobody knows what they are doing, if there is no documentation. I just like to push people to actually check if there is documentation, otherwise there will be more and more examples of this magic. I've been keeping an eye out for 'magic' registers which are actually part of 802.3, and pushing back that these are least should be fully documented. And it does happen, particularly in vendor submitted code :-( But you said the patch got merged eventually, so it seems like it worked out in the end. Andrew
diff --git a/drivers/net/phy/phylink.c b/drivers/net/phy/phylink.c index 4309317de3d1..5d043c47a727 100644 --- a/drivers/net/phy/phylink.c +++ b/drivers/net/phy/phylink.c @@ -2111,6 +2111,13 @@ int phylink_fwnode_phy_connect(struct phylink *pl, pl->link_config.interface = pl->link_interface; } + /* Assume SerDes interface modes share the same lanes and allow + * the PHY to switch between the + */ + if (test_bit(pl->link_interface, phylink_sfp_interfaces)) + phy_interface_and(phy_dev->host_interfaces, phylink_sfp_interfaces, + pl->config->supported_interfaces); + if (pl->config->mac_requires_rxc) flags |= PHY_F_RXC_ALWAYS_ON;
Use bitmask of interfaces supported by the MAC for the PHY to choose from if the declared interface mode is among those using a single pair of SerDes lanes. This will allow 2500Base-T PHYs to switch to SGMII on most hosts, which results in half-duplex being supported in case the MAC supports that. Without this change, 2500Base-T PHYs will always operate in 2500Base-X mode with rate-matching, which is not only wasteful in terms of energy consumption, but also limits the supported interface modes to full-duplex only. Signed-off-by: Daniel Golle <daniel@makrotopia.org> --- drivers/net/phy/phylink.c | 7 +++++++ 1 file changed, 7 insertions(+)