diff mbox series

[net-next] net: phy: populate host_interfaces when attaching PHY

Message ID ae53177a7b68964b2a988934a09f74a4931b862d.1728438951.git.daniel@makrotopia.org (mailing list archive)
State Rejected
Delegated to: Netdev Maintainers
Headers show
Series [net-next] net: phy: populate host_interfaces when attaching PHY | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for net-next
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 6 this patch: 6
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers success CCed 7 of 7 maintainers
netdev/build_clang success Errors and warnings before: 6 this patch: 6
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 5 this patch: 5
netdev/checkpatch warning WARNING: line length of 83 exceeds 80 columns
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 19 this patch: 19
netdev/source_inline success Was 0 now: 0
netdev/contest success net-next-2024-10-10--09-00 (tests: 775)

Commit Message

Daniel Golle Oct. 9, 2024, 1:57 a.m. UTC
Use bitmask of interfaces supported by the MAC for the PHY to choose
from if the declared interface mode is among those using a single pair
of SerDes lanes.
This will allow 2500Base-T PHYs to switch to SGMII on most hosts, which
results in half-duplex being supported in case the MAC supports that.
Without this change, 2500Base-T PHYs will always operate in 2500Base-X
mode with rate-matching, which is not only wasteful in terms of energy
consumption, but also limits the supported interface modes to
full-duplex only.

Signed-off-by: Daniel Golle <daniel@makrotopia.org>
---
 drivers/net/phy/phylink.c | 7 +++++++
 1 file changed, 7 insertions(+)

Comments

Russell King (Oracle) Oct. 9, 2024, 9:01 a.m. UTC | #1
On Wed, Oct 09, 2024 at 02:57:03AM +0100, Daniel Golle wrote:
> Use bitmask of interfaces supported by the MAC for the PHY to choose
> from if the declared interface mode is among those using a single pair
> of SerDes lanes.
> This will allow 2500Base-T PHYs to switch to SGMII on most hosts, which
> results in half-duplex being supported in case the MAC supports that.
> Without this change, 2500Base-T PHYs will always operate in 2500Base-X
> mode with rate-matching, which is not only wasteful in terms of energy
> consumption, but also limits the supported interface modes to
> full-duplex only.

We've had a similar patch before, and it's been NAK'd. The problem is
that supplying the host_interfaces for built-in PHYs means that the
hardware strapping for the PHY interface mode becomes useless, as does
the DT property specifying it - and thus we may end up selecting a
mode that both the MAC and PHY support, but the hardware design
doesn't (e.g. signals aren't connected, signal speed to fast.)

For example, take a board designed to use RXAUI and the host supports
10GBASE-R. The first problem is, RXAUI is not listed in the SFP
interface list because it's not usable over a SFP cage. So, the
host_interfaces excludes that, and thus the PHY thinks that's not
supported. It looks at the mask and sees only 10GBASE-R, and
decides to use that instead with rate matching. The MAC doesn't have
support for flow control, and thus can't use rate matching.

Not only have the electrical charateristics been violated by selecting
a faster interface than the hardware was designed for, but we now have
rate matching being used when it shouldn't be.
Daniel Golle Oct. 9, 2024, 11:52 a.m. UTC | #2
Hi Russell,

On Wed, Oct 09, 2024 at 10:01:09AM +0100, Russell King (Oracle) wrote:
> On Wed, Oct 09, 2024 at 02:57:03AM +0100, Daniel Golle wrote:
> > Use bitmask of interfaces supported by the MAC for the PHY to choose
> > from if the declared interface mode is among those using a single pair
> > of SerDes lanes.
> > This will allow 2500Base-T PHYs to switch to SGMII on most hosts, which
> > results in half-duplex being supported in case the MAC supports that.
> > Without this change, 2500Base-T PHYs will always operate in 2500Base-X
> > mode with rate-matching, which is not only wasteful in terms of energy
> > consumption, but also limits the supported interface modes to
> > full-duplex only.
> 
> We've had a similar patch before, and it's been NAK'd. The problem is
> that supplying the host_interfaces for built-in PHYs means that the
> hardware strapping for the PHY interface mode becomes useless, as does
> the DT property specifying it - and thus we may end up selecting a
> mode that both the MAC and PHY support, but the hardware design
> doesn't (e.g. signals aren't connected, signal speed to fast.)
> 
> For example, take a board designed to use RXAUI and the host supports
> 10GBASE-R. The first problem is, RXAUI is not listed in the SFP
> interface list because it's not usable over a SFP cage.

I thought about that, also boards configured for RGMII but both MAC
and PHY supporting SGMII or even 2500Base-X would be such a case.
In order to make sure we don't switch to link modes not supported
by the design I check if the interface mode configured in DT is
among those suitable for use with an SFP (ie. using a single pair
of SerDes lanes):
if (test_bit(pl->link_interface, phylink_sfp_interfaces))
	phy_interface_and(phy_dev->host_interfaces, phylink_sfp_interfaces,
			  pl->config->supported_interfaces);

Neither RXAUI nor RGMII modes are among phylink_sfp_interfaces, so
cases in which those modes are configured in DT are already excluded.

> So, the
> host_interfaces excludes that, and thus the PHY thinks that's not
> supported. It looks at the mask and sees only 10GBASE-R, and
> decides to use that instead with rate matching. The MAC doesn't have
> support for flow control, and thus can't use rate matching.
> 
> Not only have the electrical charateristics been violated by selecting
> a faster interface than the hardware was designed for, but we now have
> rate matching being used when it shouldn't be.

As we are also using using rate matching right now in cases when it
should not (and thereby inhibiting support for half-duplex modes), I
suppose the only good solution would be to allow a set of interface
modes in DT instead of only a single one.

Or, as that is the only really relevant case, we can be more strict
on the condition and additional modes to be added, ie. check if both
PHY and MAC support both 2500Base-X and SGMII, and only add SGMII
in case 2500Base-X is selected in DT.

I have never seen designs on which SGMII and 2500Base-X would both
be supported by the SoC but use a different set of pins. Also, as
2500Base-X is 2.5x as fast as SGMII, it's safe to assume that a
board which has been designed for 2500Base-X would also be fine
using SGMII.

Let me know of either of the above would be acceptable.
Russell King (Oracle) Oct. 9, 2024, 5:34 p.m. UTC | #3
On Wed, Oct 09, 2024 at 12:52:13PM +0100, Daniel Golle wrote:
> Hi Russell,
> 
> On Wed, Oct 09, 2024 at 10:01:09AM +0100, Russell King (Oracle) wrote:
> > On Wed, Oct 09, 2024 at 02:57:03AM +0100, Daniel Golle wrote:
> > > Use bitmask of interfaces supported by the MAC for the PHY to choose
> > > from if the declared interface mode is among those using a single pair
> > > of SerDes lanes.
> > > This will allow 2500Base-T PHYs to switch to SGMII on most hosts, which
> > > results in half-duplex being supported in case the MAC supports that.
> > > Without this change, 2500Base-T PHYs will always operate in 2500Base-X
> > > mode with rate-matching, which is not only wasteful in terms of energy
> > > consumption, but also limits the supported interface modes to
> > > full-duplex only.
> > 
> > We've had a similar patch before, and it's been NAK'd. The problem is
> > that supplying the host_interfaces for built-in PHYs means that the
> > hardware strapping for the PHY interface mode becomes useless, as does
> > the DT property specifying it - and thus we may end up selecting a
> > mode that both the MAC and PHY support, but the hardware design
> > doesn't (e.g. signals aren't connected, signal speed to fast.)
> > 
> > For example, take a board designed to use RXAUI and the host supports
> > 10GBASE-R. The first problem is, RXAUI is not listed in the SFP
> > interface list because it's not usable over a SFP cage.
> 
> I thought about that, also boards configured for RGMII but both MAC
> and PHY supporting SGMII or even 2500Base-X would be such a case.
> In order to make sure we don't switch to link modes not supported
> by the design I check if the interface mode configured in DT is
> among those suitable for use with an SFP (ie. using a single pair
> of SerDes lanes):
> if (test_bit(pl->link_interface, phylink_sfp_interfaces))
> 	phy_interface_and(phy_dev->host_interfaces, phylink_sfp_interfaces,
> 			  pl->config->supported_interfaces);
> 
> Neither RXAUI nor RGMII modes are among phylink_sfp_interfaces, so
> cases in which those modes are configured in DT are already excluded.

This still won't work. There are drivers (boo, hiss, stmmac crap which
is resistant to cleanup and fixing but there's others too) that don't
do the phylink interface switching.

For example, stmmac sets the mode specified in DT and also if there
is a Synopsys XPCS, then the supported interfaces also gets USXGMII,
10GKR, XLGMII, 10GBASER, SGMII, 1000BASEX and 2500BASEX. If DT says
10GBASER, then the PHY must not switch to USXGMII, but if an
88x3310 were to be thrown in, the PHY driver _would_ decide to use
USXGMII against the DT configuration.

phydev->host_interfaces is there to allow PHYs on SFPs be properly
configured according to the host interface, where there is no DT
description for the module. It is not meant for built-in PHYs.

> > So, the
> > host_interfaces excludes that, and thus the PHY thinks that's not
> > supported. It looks at the mask and sees only 10GBASE-R, and
> > decides to use that instead with rate matching. The MAC doesn't have
> > support for flow control, and thus can't use rate matching.
> > 
> > Not only have the electrical charateristics been violated by selecting
> > a faster interface than the hardware was designed for, but we now have
> > rate matching being used when it shouldn't be.
> 
> As we are also using using rate matching right now in cases when it
> should not (and thereby inhibiting support for half-duplex modes), I
> suppose the only good solution would be to allow a set of interface
> modes in DT instead of only a single one.

Two issues... why was the PHY configured via firmware to use rate
matching if that brings with it this restriction, and it's possible
not to?

Second, aqr107_get_rate_matching() is rather basic based on what people
want. It doesn't actually ask the PHY what its going to do. I know
there's a bunch of VEND1_GLOBAL_CFG registers that give the serdes
mode and rate adaption for each speed, and these can be set not only
by firmware configuration, but changed by the host.

So, aqr107_get_rate_matching() should work out whether rate matching
will be used for the interface mode by scanning these registers.

Something like:

	u16 cfg_regs[] = {
		VEND1_GLOBAL_CFG_10M,
		VEND1_GLOBAL_CFG_100M,
		VEND1_GLOBAL_CFG_1G,
		VEND1_GLOBAL_CFG_2_5G,
		VEND1_GLOBAL_CFG_5G,
		VEND1_GLOBAL_CFG_10G
	};
	int i, val;
	u8 mode;

	switch (interface) {
	case PHY_INTERFACE_MODE_10GBASER:
		mode = VEND1_GLOBAL_CFG_SERDES_MODE_XFI;
		break;
	
	case PHY_INTERFACE_MODE_2500BASEX:
		mode = VEND1_GLOBAL_CFG_SERDES_MODE_OCSGMII;
		break;

	case PHY_INTERFACE_MODE_5GBASER:
		mode = VEND1_GLOBAL_CFG_SERDES_MODE_XFI5G;
		break;

	default:
		return RATE_MATCH_NONE;
	}

	/* If any speed corresponds to the interface mode and uses pause rate
	 * matching, indicate that this interface mode uses pause rate
	 * matching.
	 */
	for (i = 0; i < ARRAY_SIZE(cfg_regs); i++) {
		val = phy_read_mmd(phydev, MDIO_MMD_VEND1, cfg_regs[i]);
		if (val < 0)
			return val;

		if (FIELD_GET(VEND1_GLOBAL_CFG_SERDES_MODE, val) == mode) {
			if (FIELD_GET(VEND1_GLOBAL_CFG_RATE_ADAPT, val) ==
			    VEND1_GLOBAL_CFG_RATE_ADAPT_PAUSE)
				return RATE_MATCH_PAUSE;
		}
	}

	return RATE_MATCH_NONE;

Now, while phylink restricts RATE_MATCH_PAUSE to being full-duplex only,
I'm not sure that is correct. I didn't contribute this support, and I
don't have any platforms that support this, and I don't have any
experience of it.

What I do have is the data sheet for 88x3310, and that doesn't mention
any restriction such as "only full duplex is supported in rate matching
mode".

It is true that to use pause frames, the MAC/PCS must be in full-duplex
mode, but if the PHY supports half-duplex on the media to full-duplex
on the MAC side link, then why should phylink restrict this to be
full-duplex only?

I suspect phylink_get_capabilities() handling for RATE_MATCH_PAUSE is
not correct - or maybe not versatile enough.
Daniel Golle Oct. 9, 2024, 6:52 p.m. UTC | #4
On Wed, Oct 09, 2024 at 06:34:07PM +0100, Russell King (Oracle) wrote:
> On Wed, Oct 09, 2024 at 12:52:13PM +0100, Daniel Golle wrote:
> > Hi Russell,
> > 
> > On Wed, Oct 09, 2024 at 10:01:09AM +0100, Russell King (Oracle) wrote:
> > > On Wed, Oct 09, 2024 at 02:57:03AM +0100, Daniel Golle wrote:
> > > > Use bitmask of interfaces supported by the MAC for the PHY to choose
> > > > from if the declared interface mode is among those using a single pair
> > > > of SerDes lanes.
> > > > This will allow 2500Base-T PHYs to switch to SGMII on most hosts, which
> > > > results in half-duplex being supported in case the MAC supports that.
> > > > Without this change, 2500Base-T PHYs will always operate in 2500Base-X
> > > > mode with rate-matching, which is not only wasteful in terms of energy
> > > > consumption, but also limits the supported interface modes to
> > > > full-duplex only.
> > > 
> > > We've had a similar patch before, and it's been NAK'd. The problem is
> > > that supplying the host_interfaces for built-in PHYs means that the
> > > hardware strapping for the PHY interface mode becomes useless, as does
> > > the DT property specifying it - and thus we may end up selecting a
> > > mode that both the MAC and PHY support, but the hardware design
> > > doesn't (e.g. signals aren't connected, signal speed to fast.)
> > > 
> > > For example, take a board designed to use RXAUI and the host supports
> > > 10GBASE-R. The first problem is, RXAUI is not listed in the SFP
> > > interface list because it's not usable over a SFP cage.
> > 
> > I thought about that, also boards configured for RGMII but both MAC
> > and PHY supporting SGMII or even 2500Base-X would be such a case.
> > In order to make sure we don't switch to link modes not supported
> > by the design I check if the interface mode configured in DT is
> > among those suitable for use with an SFP (ie. using a single pair
> > of SerDes lanes):
> > if (test_bit(pl->link_interface, phylink_sfp_interfaces))
> > 	phy_interface_and(phy_dev->host_interfaces, phylink_sfp_interfaces,
> > 			  pl->config->supported_interfaces);
> > 
> > Neither RXAUI nor RGMII modes are among phylink_sfp_interfaces, so
> > cases in which those modes are configured in DT are already excluded.
> 
> This still won't work. There are drivers (boo, hiss, stmmac crap which
> is resistant to cleanup and fixing but there's others too) that don't
> do the phylink interface switching.

Ok, what a mess. So multiple interfaces modes will have to be declared
in DT for those boards which actually support that. What you be open
to something like that:

phy-mode = "2500base-x", "sgmii";

> 
> For example, stmmac sets the mode specified in DT and also if there
> is a Synopsys XPCS, then the supported interfaces also gets USXGMII,
> 10GKR, XLGMII, 10GBASER, SGMII, 1000BASEX and 2500BASEX. If DT says
> 10GBASER, then the PHY must not switch to USXGMII, but if an
> 88x3310 were to be thrown in, the PHY driver _would_ decide to use
> USXGMII against the DT configuration.
> 
> phydev->host_interfaces is there to allow PHYs on SFPs be properly
> configured according to the host interface, where there is no DT
> description for the module. It is not meant for built-in PHYs.

Imho allowing several available interface modes can still be
advantageous also for built-in PHYs. Measurable power savings (~100mW on
MT7986!) and not needing rate matching are both desireable things.

The question is just how would we get there...

> > As we are also using using rate matching right now in cases when it
> > should not (and thereby inhibiting support for half-duplex modes), I
> > suppose the only good solution would be to allow a set of interface
> > modes in DT instead of only a single one.
> 
> Two issues... why was the PHY configured via firmware to use rate
> matching if that brings with it this restriction, and it's possible
> not to?

Looking at drivers/net/phy/realtek.c you see in rtl822xb_config_init()
...
        has_2500 = test_bit(PHY_INTERFACE_MODE_2500BASEX,
                            phydev->host_interfaces) ||
                   phydev->interface == PHY_INTERFACE_MODE_2500BASEX;

        has_sgmii = test_bit(PHY_INTERFACE_MODE_SGMII,
                             phydev->host_interfaces) ||
                    phydev->interface == PHY_INTERFACE_MODE_SGMII;

        /* fill in possible interfaces */
        __assign_bit(PHY_INTERFACE_MODE_2500BASEX, phydev->possible_interfaces,
                     has_2500);
        __assign_bit(PHY_INTERFACE_MODE_SGMII, phydev->possible_interfaces,
                     has_sgmii);

        if (!has_2500 && !has_sgmii)
                return 0;

        /* determine SerDes option mode */
        if (has_2500 && !has_sgmii) {
                mode = RTL822X_VND1_SERDES_OPTION_MODE_2500BASEX;
                phydev->rate_matching = RATE_MATCH_PAUSE;
        } else {
                mode = RTL822X_VND1_SERDES_OPTION_MODE_2500BASEX_SGMII;
                phydev->rate_matching = RATE_MATCH_NONE;
        }
...

So rate-matching is NOT configured in firmware, but rather it is
selected in cases where we (seemingly) don't have any other option. It
may not be a good choice (imho it never is), but rather just a last
resort in case we otherwise can't support lower speeds at all.

> 
> Second, aqr107_get_rate_matching() is rather basic based on what people
> want. It doesn't actually ask the PHY what its going to do. I know
> there's a bunch of VEND1_GLOBAL_CFG registers that give the serdes
> mode and rate adaption for each speed, and these can be set not only
> by firmware configuration, but changed by the host.
> 
> So, aqr107_get_rate_matching() should work out whether rate matching
> will be used for the interface mode by scanning these registers.
> [...]

Afaik Aquantia 2.5G PHYs always work only with a fixed interface mode
and perform rate-matching for lower speeds.

> 
> 
> Now, while phylink restricts RATE_MATCH_PAUSE to being full-duplex only,
> I'm not sure that is correct. I didn't contribute this support, and I
> don't have any platforms that support this, and I don't have any
> experience of it.

Afaik enforcing half-duplex via rate-maching with pause-frames is
supported by all the 2.5G PHYs I've seen up to now.

> 
> What I do have is the data sheet for 88x3310, and that doesn't mention
> any restriction such as "only full duplex is supported in rate matching
> mode".

Yep, and I suppose it should work just fine. The same applies for
RealTek and MaxLinear PHYs. I've tested it.

> 
> It is true that to use pause frames, the MAC/PCS must be in full-duplex
> mode, but if the PHY supports half-duplex on the media to full-duplex
> on the MAC side link, then why should phylink restrict this to be
> full-duplex only?

There is no reason for that imho. phylink.c states
/* Although a duplex-matching phy might exist, we
 * conservatively remove these modes because the MAC
 * will not be aware of the half-duplex nature of the
 * link.
 */

Afaik, practially all rate-matching PHYs which do support half-duplex
modes on the TP interface can perform duplex-matching as well.

> 
> I suspect phylink_get_capabilities() handling for RATE_MATCH_PAUSE is
> not correct - or maybe not versatile enough.

I agree. Never the less, why use rate matching at all if we don't have
to? It's obviously inefficient and wasteful, having the MAC follow the
PHY speed is preferrable in every case imho.
Russell King (Oracle) Oct. 9, 2024, 8:12 p.m. UTC | #5
On Wed, Oct 09, 2024 at 07:52:42PM +0100, Daniel Golle wrote:
> On Wed, Oct 09, 2024 at 06:34:07PM +0100, Russell King (Oracle) wrote:
> > On Wed, Oct 09, 2024 at 12:52:13PM +0100, Daniel Golle wrote:
> > > Hi Russell,
> > > 
> > > On Wed, Oct 09, 2024 at 10:01:09AM +0100, Russell King (Oracle) wrote:
> > > > On Wed, Oct 09, 2024 at 02:57:03AM +0100, Daniel Golle wrote:
> > > > > Use bitmask of interfaces supported by the MAC for the PHY to choose
> > > > > from if the declared interface mode is among those using a single pair
> > > > > of SerDes lanes.
> > > > > This will allow 2500Base-T PHYs to switch to SGMII on most hosts, which
> > > > > results in half-duplex being supported in case the MAC supports that.
> > > > > Without this change, 2500Base-T PHYs will always operate in 2500Base-X
> > > > > mode with rate-matching, which is not only wasteful in terms of energy
> > > > > consumption, but also limits the supported interface modes to
> > > > > full-duplex only.
> > > > 
> > > > We've had a similar patch before, and it's been NAK'd. The problem is
> > > > that supplying the host_interfaces for built-in PHYs means that the
> > > > hardware strapping for the PHY interface mode becomes useless, as does
> > > > the DT property specifying it - and thus we may end up selecting a
> > > > mode that both the MAC and PHY support, but the hardware design
> > > > doesn't (e.g. signals aren't connected, signal speed to fast.)
> > > > 
> > > > For example, take a board designed to use RXAUI and the host supports
> > > > 10GBASE-R. The first problem is, RXAUI is not listed in the SFP
> > > > interface list because it's not usable over a SFP cage.
> > > 
> > > I thought about that, also boards configured for RGMII but both MAC
> > > and PHY supporting SGMII or even 2500Base-X would be such a case.
> > > In order to make sure we don't switch to link modes not supported
> > > by the design I check if the interface mode configured in DT is
> > > among those suitable for use with an SFP (ie. using a single pair
> > > of SerDes lanes):
> > > if (test_bit(pl->link_interface, phylink_sfp_interfaces))
> > > 	phy_interface_and(phy_dev->host_interfaces, phylink_sfp_interfaces,
> > > 			  pl->config->supported_interfaces);
> > > 
> > > Neither RXAUI nor RGMII modes are among phylink_sfp_interfaces, so
> > > cases in which those modes are configured in DT are already excluded.
> > 
> > This still won't work. There are drivers (boo, hiss, stmmac crap which
> > is resistant to cleanup and fixing but there's others too) that don't
> > do the phylink interface switching.
> 
> Ok, what a mess. So multiple interfaces modes will have to be declared
> in DT for those boards which actually support that. What you be open
> to something like that:
> 
> phy-mode = "2500base-x", "sgmii";

I think we've tried going down that route before and it wasn't well
received.

> > For example, stmmac sets the mode specified in DT and also if there
> > is a Synopsys XPCS, then the supported interfaces also gets USXGMII,
> > 10GKR, XLGMII, 10GBASER, SGMII, 1000BASEX and 2500BASEX. If DT says
> > 10GBASER, then the PHY must not switch to USXGMII, but if an
> > 88x3310 were to be thrown in, the PHY driver _would_ decide to use
> > USXGMII against the DT configuration.
> > 
> > phydev->host_interfaces is there to allow PHYs on SFPs be properly
> > configured according to the host interface, where there is no DT
> > description for the module. It is not meant for built-in PHYs.
> 
> Imho allowing several available interface modes can still be
> advantageous also for built-in PHYs. Measurable power savings (~100mW on
> MT7986!) and not needing rate matching are both desireable things.
> 
> The question is just how would we get there...

We already do. Let me take an example that I use. Macchiatobin, which
you've probably heard lots about.

The PHY there is the 88x3310, which is configured to run in MAC mode 0
by hardware strapping. MAC mode 0 means the 88x3310 will switch between
10GBASE-R, 5GBASE-R, 2500BASE-X and SGMII on its host interface.

The MAC is Marvell's PP2, which supports all of those, and being one of
the original MACs that phylink was developed against, is coded properly
such that it fully works with phylink dynamically changing the interface
mode.

The interface mode given in DT is just a "guide" because the 88x3310
does no more than verify that the interface mode that it is bound with
is one it supports. However, every time the link comes up, providing
it is not operating in rate matching mode (which the PP2 doesn't
support) it will change its MAC facing interface appropriately.

Another board uses the 88x3310 with XAUI, and if I remember correctly,
the PHY is strapped for XAUI with rate matching mode. It's connected
to an 88e6390x DSA switch port 9, which supports RXAUI, XAUI, SGMII,
1000BASEX, 2500BASEX and maybe other stuff too.

So, what we do is in DT, we specify the maximum mode, and rely on the
hardware being correctly strapped on the PHY to configure how the
MAC side interface will be used.

Now, the thing with that second board is... if we use your original
suggestion, then we end up filling the host_interfaces with just
2500BASEX, 1000BASEX and SGMII. That will lead mv3310_select_mactype()
to select MV_V2_33X0_PORT_CTRL_MACTYPE_10GBASER for which the PHY will
attempt to use the modes I listed above for Macchiatobin on its MAC
interface which is wrong.

Let me repeat the point. phydev->host_interfaces is there to allow a
PHY driver to go off and make its own decisions about the interface
mode group that it should use _ignoring_ what's being asked of it
when the MAC binds to it. It should be empty for built-in setups
where this should not be used, and we have precedent on Macchiatobin
that interface switching by the PHY is permitted even in that
situation.


> > > As we are also using using rate matching right now in cases when it
> > > should not (and thereby inhibiting support for half-duplex modes), I
> > > suppose the only good solution would be to allow a set of interface
> > > modes in DT instead of only a single one.
> > 
> > Two issues... why was the PHY configured via firmware to use rate
> > matching if that brings with it this restriction, and it's possible
> > not to?
> 
> Looking at drivers/net/phy/realtek.c you see in rtl822xb_config_init()
> ...
>         has_2500 = test_bit(PHY_INTERFACE_MODE_2500BASEX,
>                             phydev->host_interfaces) ||
>                    phydev->interface == PHY_INTERFACE_MODE_2500BASEX;
> 
>         has_sgmii = test_bit(PHY_INTERFACE_MODE_SGMII,
>                              phydev->host_interfaces) ||
>                     phydev->interface == PHY_INTERFACE_MODE_SGMII;
> 
>         /* fill in possible interfaces */
>         __assign_bit(PHY_INTERFACE_MODE_2500BASEX, phydev->possible_interfaces,
>                      has_2500);
>         __assign_bit(PHY_INTERFACE_MODE_SGMII, phydev->possible_interfaces,
>                      has_sgmii);
> 
>         if (!has_2500 && !has_sgmii)
>                 return 0;
> 
>         /* determine SerDes option mode */
>         if (has_2500 && !has_sgmii) {
>                 mode = RTL822X_VND1_SERDES_OPTION_MODE_2500BASEX;
>                 phydev->rate_matching = RATE_MATCH_PAUSE;
>         } else {
>                 mode = RTL822X_VND1_SERDES_OPTION_MODE_2500BASEX_SGMII;
>                 phydev->rate_matching = RATE_MATCH_NONE;
>         }
> ...
> 
> So rate-matching is NOT configured in firmware, but rather it is
> selected in cases where we (seemingly) don't have any other option. It
> may not be a good choice (imho it never is), but rather just a last
> resort in case we otherwise can't support lower speeds at all.
> 
> > 
> > Second, aqr107_get_rate_matching() is rather basic based on what people
> > want. It doesn't actually ask the PHY what its going to do. I know
> > there's a bunch of VEND1_GLOBAL_CFG registers that give the serdes
> > mode and rate adaption for each speed, and these can be set not only
> > by firmware configuration, but changed by the host.
> > 
> > So, aqr107_get_rate_matching() should work out whether rate matching
> > will be used for the interface mode by scanning these registers.
> > [...]
> 
> Afaik Aquantia 2.5G PHYs always work only with a fixed interface mode
> and perform rate-matching for lower speeds.

They don't _have_ to. To use a SFP with an AQR PHY on it in a clearfog
platform (which only supports SGMII, 1000BASE-X and 2500BASE-X, I
hacked the aquantia driver to do this:

        phy_set_bits_mmd(phydev, MDIO_MMD_VEND1, MDIO_CTRL1, MDIO_CTRL1_LPOWER);
        mdelay(10);
        phy_write_mmd(phydev, MDIO_MMD_VEND1, 0x31a, 2);
        phy_write_mmd(phydev, MDIO_MMD_VEND1, VEND1_GLOBAL_CFG_10M,
                      VEND1_GLOBAL_CFG_SGMII_AN |
                      VEND1_GLOBAL_CFG_SERDES_MODE_SGMII);
        phy_write_mmd(phydev, MDIO_MMD_VEND1, VEND1_GLOBAL_CFG_100M,
                      VEND1_GLOBAL_CFG_SGMII_AN |
                      VEND1_GLOBAL_CFG_SERDES_MODE_SGMII);
        phy_write_mmd(phydev, MDIO_MMD_VEND1, VEND1_GLOBAL_CFG_1G,
                      VEND1_GLOBAL_CFG_SGMII_AN |
                      VEND1_GLOBAL_CFG_SERDES_MODE_SGMII);
        phy_write_mmd(phydev, MDIO_MMD_VEND1, VEND1_GLOBAL_CFG_2_5G,
                      VEND1_GLOBAL_CFG_SGMII_AN |
                      VEND1_GLOBAL_CFG_SERDES_MODE_OCSGMII);
        phy_clear_bits_mmd(phydev, MDIO_MMD_VEND1, MDIO_CTRL1,
                           MDIO_CTRL1_LPOWER);

which disables rate matching and causes it to switch interfaces.

Moreover, aqr107_read_status() reads the interface status and sets
the MAC interface accordingly, so the driver does support dynamically
changing the MAC interface just like 88x3310. If all use cases of
this PHY only ever did rate matching, then there would be no point
to that code being there!

> > Now, while phylink restricts RATE_MATCH_PAUSE to being full-duplex only,
> > I'm not sure that is correct. I didn't contribute this support, and I
> > don't have any platforms that support this, and I don't have any
> > experience of it.
> 
> Afaik enforcing half-duplex via rate-maching with pause-frames is
> supported by all the 2.5G PHYs I've seen up to now.

I'm sorry, I don't understand your sentence, because it just flies
wildly against everything that's been said.

First, pause-frames require full duplex on the link on which pause
frames are to be sent and received. That's fundamental.

Second, I'm not sure what "enforcing half-duplex" has to do with
rate-matching with pause-frames.

Third, the 88x3310 without MACSEC in rate matching mode requires the
MAC to pace itself. No support for pause frames at all - you only
get pause frames with the 88x3310P which has MACSEC. This is a 10M
to 10G PHY multi-rate PHY.

So... your comment makes no sense to me, sorry.

> > What I do have is the data sheet for 88x3310, and that doesn't mention
> > any restriction such as "only full duplex is supported in rate matching
> > mode".
> 
> Yep, and I suppose it should work just fine. The same applies for
> RealTek and MaxLinear PHYs. I've tested it.
> 
> > It is true that to use pause frames, the MAC/PCS must be in full-duplex
> > mode, but if the PHY supports half-duplex on the media to full-duplex
> > on the MAC side link, then why should phylink restrict this to be
> > full-duplex only?
> 
> There is no reason for that imho. phylink.c states
> /* Although a duplex-matching phy might exist, we
>  * conservatively remove these modes because the MAC
>  * will not be aware of the half-duplex nature of the
>  * link.
>  */
> 
> Afaik, practially all rate-matching PHYs which do support half-duplex
> modes on the TP interface can perform duplex-matching as well.

So we should remove that restriction!

> > I suspect phylink_get_capabilities() handling for RATE_MATCH_PAUSE is
> > not correct - or maybe not versatile enough.
> 
> I agree. Never the less, why use rate matching at all if we don't have
> to? It's obviously inefficient and wasteful, having the MAC follow the
> PHY speed is preferrable in every case imho.

As I say, I believe this is a matter of how the Aquantia firmware is
commissioned - I believe the defaults for all of the VEND1_GLOBAL_CFG*
registers come from the firmware that the PHY loads.

So much like the pin strapping of 88x3310 determines its set of MAC
modes, which are decided by the board designer, Aquantia firmware
determines the PHY behaviour.

The problem here is we need to be mindful of the existing
implementations, not only those where the MAC/PHY combination would
get stuff wrong, but also of those MACs where multiple different
interfaces are supported through hardware strapping which is not
software determinable, but are not software configurable (like DSA
switches.) Then there's those who hacked phylink into their use
without properly implementing it (e.g. where mac_config() is entirely
empty, and thus support no reconfiguration of the interface yet
support multiple different interfaces at driver initialisation time.)

Yes, some of these situations can be said to be buggy. Feel free to
spend a few years hacking on stmmac to try and fix that god almighty
mess. I've given up with stmmac now after trying several attempts at
cleaning the mess up, and ending up in a very non-productive place
with it.
Daniel Golle Oct. 9, 2024, 9:31 p.m. UTC | #6
On Wed, Oct 09, 2024 at 09:12:06PM +0100, Russell King (Oracle) wrote:
> On Wed, Oct 09, 2024 at 07:52:42PM +0100, Daniel Golle wrote:
> > [...]
> > Imho allowing several available interface modes can still be
> > advantageous also for built-in PHYs. Measurable power savings (~100mW on
> > MT7986!) and not needing rate matching are both desireable things.
> > 
> > The question is just how would we get there...
> 
> We already do. Let me take an example that I use. Macchiatobin, which
> you've probably heard lots about.
> 
> The PHY there is the 88x3310, which is configured to run in MAC mode 0
> by hardware strapping. MAC mode 0 means the 88x3310 will switch between
> 10GBASE-R, 5GBASE-R, 2500BASE-X and SGMII on its host interface.
> 
> The MAC is Marvell's PP2, which supports all of those, and being one of
> the original MACs that phylink was developed against, is coded properly
> such that it fully works with phylink dynamically changing the interface
> mode.
> 
> The interface mode given in DT is just a "guide" because the 88x3310
> does no more than verify that the interface mode that it is bound with
> is one it supports. However, every time the link comes up, providing
> it is not operating in rate matching mode (which the PP2 doesn't
> support) it will change its MAC facing interface appropriately.

Unfortunately, and apparently different from the 88x3310, there is no
hardware strapping for this to be decided with RealTek's PHYs, but
using rate-matching or interface-mode-switching is decided by the
driver.

> 
> Another board uses the 88x3310 with XAUI, and if I remember correctly,
> the PHY is strapped for XAUI with rate matching mode. It's connected
> to an 88e6390x DSA switch port 9, which supports RXAUI, XAUI, SGMII,
> 1000BASEX, 2500BASEX and maybe other stuff too.
> 
> So, what we do is in DT, we specify the maximum mode, and rely on the
> hardware being correctly strapped on the PHY to configure how the
> MAC side interface will be used.

Does that mean the realtek.c PHY driver (and maybe others as well) should
just assume that the MAC must also support SGMII mode in case 2500Base-X
is supported?

I was under the impression that there might be MAC out there which support
only 2500Base-X, or maybe 2500Base-X and 1000Base-X, but not SGMII. And on
those we would always need to use 2500Base-X with rate matching.
But maybe I'm wrong.

> 
> Now, the thing with that second board is... if we use your original
> suggestion, then we end up filling the host_interfaces with just
> 2500BASEX, 1000BASEX and SGMII. That will lead mv3310_select_mactype()
> to select MV_V2_33X0_PORT_CTRL_MACTYPE_10GBASER for which the PHY will
> attempt to use the modes I listed above for Macchiatobin on its MAC
> interface which is wrong.
> 
> Let me repeat the point. phydev->host_interfaces is there to allow a
> PHY driver to go off and make its own decisions about the interface
> mode group that it should use _ignoring_ what's being asked of it
> when the MAC binds to it. It should be empty for built-in setups
> where this should not be used, and we have precedent on Macchiatobin
> that interface switching by the PHY is permitted even in that
> situation.

... because it is decided before Linux even starts, and despite in
armada-8040-mcbin.dts is stated.
        phy-mode = "10gbase-r";

The same approach would not work for those RealTek PHYs and boards
using them. In some cases the bootloader sets up the PHY, and usually
there rate matching is used -- performance is not the goal there, but
simplicity. In other cases we rely entirely on the Linux PHY driver
to set things up. How would the PHY driver know whether it should
setup rate matching mode or interface switching mode?

The SFP case is clear, it's using host_interfaces. But in the built-in
case, as of today, it always ends up in fixed interface mode with
rate matching.

> > > So, aqr107_get_rate_matching() should work out whether rate matching
> > > will be used for the interface mode by scanning these registers.
> > > [...]
> > 
> > Afaik Aquantia 2.5G PHYs always work only with a fixed interface mode
> > and perform rate-matching for lower speeds.
> 
> They don't _have_ to. To use a SFP with an AQR PHY on it in a clearfog
> platform (which only supports SGMII, 1000BASE-X and 2500BASE-X, I
> hacked the aquantia driver to do this:
> 
>         phy_set_bits_mmd(phydev, MDIO_MMD_VEND1, MDIO_CTRL1, MDIO_CTRL1_LPOWER);
>         mdelay(10);
>         phy_write_mmd(phydev, MDIO_MMD_VEND1, 0x31a, 2);
>         phy_write_mmd(phydev, MDIO_MMD_VEND1, VEND1_GLOBAL_CFG_10M,
>                       VEND1_GLOBAL_CFG_SGMII_AN |
>                       VEND1_GLOBAL_CFG_SERDES_MODE_SGMII);
>         phy_write_mmd(phydev, MDIO_MMD_VEND1, VEND1_GLOBAL_CFG_100M,
>                       VEND1_GLOBAL_CFG_SGMII_AN |
>                       VEND1_GLOBAL_CFG_SERDES_MODE_SGMII);
>         phy_write_mmd(phydev, MDIO_MMD_VEND1, VEND1_GLOBAL_CFG_1G,
>                       VEND1_GLOBAL_CFG_SGMII_AN |
>                       VEND1_GLOBAL_CFG_SERDES_MODE_SGMII);
>         phy_write_mmd(phydev, MDIO_MMD_VEND1, VEND1_GLOBAL_CFG_2_5G,
>                       VEND1_GLOBAL_CFG_SGMII_AN |
>                       VEND1_GLOBAL_CFG_SERDES_MODE_OCSGMII);
>         phy_clear_bits_mmd(phydev, MDIO_MMD_VEND1, MDIO_CTRL1,
>                            MDIO_CTRL1_LPOWER);
> 
> which disables rate matching and causes it to switch interfaces.

That's pretty cool, and when connected to a MAC which support both
SGMII and 2500Base-X, and a driver which can switch between the two,
it should be the default imho.

> 
> Moreover, aqr107_read_status() reads the interface status and sets
> the MAC interface accordingly, so the driver does support dynamically
> changing the MAC interface just like 88x3310. If all use cases of
> this PHY only ever did rate matching, then there would be no point
> to that code being there!

So, just like for mcbin, the driver here replies on the PHY being
configured (by strapping or by the bootloader) to either use
a fixed interface mode with rate matching, or perform interface mode
switching. What if we had to decide this in the driver (like it is
the case with RealTek PHYs)?

Let me summarize:
 - We can't just assume that every MAC which supports 2500Base-X also
   supports SGMII. It might support only 2500Base-X, or 2500Base-X and
   1000Base-X, or the driver might not support switching between
   interface modes.
 - We can't rely on the PHY being pre-configured by the bootloader to
   either rate maching or interface-switching mode.
 - There are no strapping options for this, the only thing which is
   configured by strapping is the address.

> 
> > > Now, while phylink restricts RATE_MATCH_PAUSE to being full-duplex only,
> > > I'm not sure that is correct. I didn't contribute this support, and I
> > > don't have any platforms that support this, and I don't have any
> > > experience of it.
> > 
> > Afaik enforcing half-duplex via rate-maching with pause-frames is
> > supported by all the 2.5G PHYs I've seen up to now.
> 
> I'm sorry, I don't understand your sentence, because it just flies
> wildly against everything that's been said.
> 
> First, pause-frames require full duplex on the link on which pause
> frames are to be sent and received. That's fundamental.

I meant pause-frames on the host-side interface of the PHY, which
are used to perform rate and duplex matching for the remote-side
interface. I hope that makes more sense now.

> 
> Second, I'm not sure what "enforcing half-duplex" has to do with
> rate-matching with pause-frames.

It can be used as a method to preventing the MAC from sending
to the PHY while receiving.

> 
> Third, the 88x3310 without MACSEC in rate matching mode requires the
> MAC to pace itself. No support for pause frames at all - you only
> get pause frames with the 88x3310P which has MACSEC. This is a 10M
> to 10G PHY multi-rate PHY.
> 
> So... your comment makes no sense to me, sorry.

You misunderstood me. I didn't mean pause frames on the remote-side of
the PHY. I meant pause frames as in RATE_MATCH_PAUSE.

> 
> > > What I do have is the data sheet for 88x3310, and that doesn't mention
> > > any restriction such as "only full duplex is supported in rate matching
> > > mode".
> > 
> > Yep, and I suppose it should work just fine. The same applies for
> > RealTek and MaxLinear PHYs. I've tested it.
> > 
> > > It is true that to use pause frames, the MAC/PCS must be in full-duplex
> > > mode, but if the PHY supports half-duplex on the media to full-duplex
> > > on the MAC side link, then why should phylink restrict this to be
> > > full-duplex only?
> > 
> > There is no reason for that imho. phylink.c states
> > /* Although a duplex-matching phy might exist, we
> >  * conservatively remove these modes because the MAC
> >  * will not be aware of the half-duplex nature of the
> >  * link.
> >  */
> > 
> > Afaik, practially all rate-matching PHYs which do support half-duplex
> > modes on the TP interface can perform duplex-matching as well.
> 
> So we should remove that restriction!

Absolutely. That will solve at least half of the problem. It still
leaves us with a SerDes clock running 2.5x faster than it would have to,
PHY and MAC consuming more energy than they would have to and TX
performance being slightly worse (visible with iperf3 --bidir at least
with some PHYs). But at least the link would come up.

> 
> > > I suspect phylink_get_capabilities() handling for RATE_MATCH_PAUSE is
> > > not correct - or maybe not versatile enough.
> > 
> > I agree. Never the less, why use rate matching at all if we don't have
> > to? It's obviously inefficient and wasteful, having the MAC follow the
> > PHY speed is preferrable in every case imho.
> 
> As I say, I believe this is a matter of how the Aquantia firmware is
> commissioned - I believe the defaults for all of the VEND1_GLOBAL_CFG*
> registers come from the firmware that the PHY loads.

I must admit I don't care much about the Aquantia story in the
picture here. Simply because the rate-matching implementation they
got seems to work rather well, and there anyway aren't many low-cost
mass-produced devices having Aquantia 2.5GBit/s PHYs. I know exactly
one device (Ubiquity UniFi 6 LR v1) which isn't produced in that
version any more -- later versions replaced the rather costly
Aquantia PHY with a low-cost RealTek PHY.

There are millions of devices with Intel/MaxLinear GPY211, and
probably even by magnitudes more with RealTek RTL8221B. Having all
those devices perform rate-matching and hence running the SerDes
clock at 3.125 GHz instead of 1.250 GHz is not just a significant
waste of electricity, but also just means wrose performance, esp. on
1000 MBit/s links (which happen to still be the most common).

> 
> So much like the pin strapping of 88x3310 determines its set of MAC
> modes, which are decided by the board designer, Aquantia firmware
> determines the PHY behaviour.
> 
> The problem here is we need to be mindful of the existing
> implementations, not only those where the MAC/PHY combination would
> get stuff wrong, but also of those MACs where multiple different
> interfaces are supported through hardware strapping which is not
> software determinable, but are not software configurable (like DSA
> switches.) Then there's those who hacked phylink into their use
> without properly implementing it (e.g. where mac_config() is entirely
> empty, and thus support no reconfiguration of the interface yet
> support multiple different interfaces at driver initialisation time.)
> 
> Yes, some of these situations can be said to be buggy. Feel free to
> spend a few years hacking on stmmac to try and fix that god almighty
> mess. I've given up with stmmac now after trying several attempts at
> cleaning the mess up, and ending up in a very non-productive place
> with it.

Sounds dreadful. However, just because one driver is a crazy mess it
should mean that all drivers have to perform worse than they could.
Can you name a few boards which rely on that particularly smelly
code? I might get some of those and get a feel of how bad things are,
and maybe I will spend a few years trying to improve things there.
Russell King (Oracle) Oct. 9, 2024, 11:23 p.m. UTC | #7
On Wed, Oct 09, 2024 at 10:31:51PM +0100, Daniel Golle wrote:
> On Wed, Oct 09, 2024 at 09:12:06PM +0100, Russell King (Oracle) wrote:
> > On Wed, Oct 09, 2024 at 07:52:42PM +0100, Daniel Golle wrote:
> > > [...]
> > > Imho allowing several available interface modes can still be
> > > advantageous also for built-in PHYs. Measurable power savings (~100mW on
> > > MT7986!) and not needing rate matching are both desireable things.
> > > 
> > > The question is just how would we get there...
> > 
> > We already do. Let me take an example that I use. Macchiatobin, which
> > you've probably heard lots about.
> > 
> > The PHY there is the 88x3310, which is configured to run in MAC mode 0
> > by hardware strapping. MAC mode 0 means the 88x3310 will switch between
> > 10GBASE-R, 5GBASE-R, 2500BASE-X and SGMII on its host interface.
> > 
> > The MAC is Marvell's PP2, which supports all of those, and being one of
> > the original MACs that phylink was developed against, is coded properly
> > such that it fully works with phylink dynamically changing the interface
> > mode.
> > 
> > The interface mode given in DT is just a "guide" because the 88x3310
> > does no more than verify that the interface mode that it is bound with
> > is one it supports. However, every time the link comes up, providing
> > it is not operating in rate matching mode (which the PP2 doesn't
> > support) it will change its MAC facing interface appropriately.
> 
> Unfortunately, and apparently different from the 88x3310, there is no
> hardware strapping for this to be decided with RealTek's PHYs, but
> using rate-matching or interface-mode-switching is decided by the
> driver.

No, but there is firmware, and firmware provisioning determines the
PHY behaviour. That's the equivalent of hardware pin strapping on
Aquantia PHYs.

I think you need to review a previous thread on the Aquantia PHYs
and firmware setting stupid interface modes, and trying to fix the
stuff up (and there in you will find the discussions about the
firmware provisioning that determines how the MAC interface on these
PHYs behave.

https://lore.kernel.org/lkml/20221115230207.2e77pifwruzkexbr@skbuf/T/

You'll also find other useful stuff relevant to this discussion too.
This is the lore link for the phy-modes discussion:

https://lore.kernel.org/all/20211123164027.15618-1-kabel@kernel.org/T/

> > Another board uses the 88x3310 with XAUI, and if I remember correctly,
> > the PHY is strapped for XAUI with rate matching mode. It's connected
> > to an 88e6390x DSA switch port 9, which supports RXAUI, XAUI, SGMII,
> > 1000BASEX, 2500BASEX and maybe other stuff too.
> > 
> > So, what we do is in DT, we specify the maximum mode, and rely on the
> > hardware being correctly strapped on the PHY to configure how the
> > MAC side interface will be used.
> 
> Does that mean the realtek.c PHY driver (and maybe others as well) should
> just assume that the MAC must also support SGMII mode in case 2500Base-X
> is supported?
> 
> I was under the impression that there might be MAC out there which support
> only 2500Base-X, or maybe 2500Base-X and 1000Base-X, but not SGMII. And on
> those we would always need to use 2500Base-X with rate matching.
> But maybe I'm wrong.

The problem I'm trying to point out is that doing what you're wanting
has a high chance of causing _regressions_, causing setups that work
today to stop working. We can't allow that to happen, sorry.

> > Now, the thing with that second board is... if we use your original
> > suggestion, then we end up filling the host_interfaces with just
> > 2500BASEX, 1000BASEX and SGMII. That will lead mv3310_select_mactype()
> > to select MV_V2_33X0_PORT_CTRL_MACTYPE_10GBASER for which the PHY will
> > attempt to use the modes I listed above for Macchiatobin on its MAC
> > interface which is wrong.
> > 
> > Let me repeat the point. phydev->host_interfaces is there to allow a
> > PHY driver to go off and make its own decisions about the interface
> > mode group that it should use _ignoring_ what's being asked of it
> > when the MAC binds to it. It should be empty for built-in setups
> > where this should not be used, and we have precedent on Macchiatobin
> > that interface switching by the PHY is permitted even in that
> > situation.
> 
> ... because it is decided before Linux even starts, and despite in
> armada-8040-mcbin.dts is stated.
>         phy-mode = "10gbase-r";

You may notice in the discussions I've linked above, phy-mode is the
"initial" MAC mode for these PHYs...

Note that this interface switching mechanism was introduced early on
with the 88x3310 PHY, before any documentation on it was available,
and all that was known at the time is that the PHY switched between
different MAC facing interface modes depending on the negotiated
speed. It needed to be supported, and what we have came out of that.
Legacy is important, due to the "no regressions" rule that we have
in kernel development - we can't go breaking already working setups.

> The same approach would not work for those RealTek PHYs and boards
> using them. In some cases the bootloader sets up the PHY, and usually
> there rate matching is used -- performance is not the goal there, but
> simplicity. In other cases we rely entirely on the Linux PHY driver
> to set things up. How would the PHY driver know whether it should
> setup rate matching mode or interface switching mode?

In the case of the Realtek driver, it decides to override whatever came
before in a defined way. E.g. rtl822xb_config_init().

> The SFP case is clear, it's using host_interfaces. But in the built-in
> case, as of today, it always ends up in fixed interface mode with
> rate matching.

"always rate matching" no. Fixed interface mode, yes. If
rtl822xb_config_init() sees phydev->interface is set to SGMII, then
it uses 2500BASEX_SGMII mode without rate matching - and the
advertisement will be limited to 1G and below which will effectively
prevent the PHY using 2.5G mode - which is fine in that case.

> > > > So, aqr107_get_rate_matching() should work out whether rate matching
> > > > will be used for the interface mode by scanning these registers.
> > > > [...]
> > > 
> > > Afaik Aquantia 2.5G PHYs always work only with a fixed interface mode
> > > and perform rate-matching for lower speeds.
> > 
> > They don't _have_ to. To use a SFP with an AQR PHY on it in a clearfog
> > platform (which only supports SGMII, 1000BASE-X and 2500BASE-X, I
> > hacked the aquantia driver to do this:
> > 
> >         phy_set_bits_mmd(phydev, MDIO_MMD_VEND1, MDIO_CTRL1, MDIO_CTRL1_LPOWER);
> >         mdelay(10);
> >         phy_write_mmd(phydev, MDIO_MMD_VEND1, 0x31a, 2);
> >         phy_write_mmd(phydev, MDIO_MMD_VEND1, VEND1_GLOBAL_CFG_10M,
> >                       VEND1_GLOBAL_CFG_SGMII_AN |
> >                       VEND1_GLOBAL_CFG_SERDES_MODE_SGMII);
> >         phy_write_mmd(phydev, MDIO_MMD_VEND1, VEND1_GLOBAL_CFG_100M,
> >                       VEND1_GLOBAL_CFG_SGMII_AN |
> >                       VEND1_GLOBAL_CFG_SERDES_MODE_SGMII);
> >         phy_write_mmd(phydev, MDIO_MMD_VEND1, VEND1_GLOBAL_CFG_1G,
> >                       VEND1_GLOBAL_CFG_SGMII_AN |
> >                       VEND1_GLOBAL_CFG_SERDES_MODE_SGMII);
> >         phy_write_mmd(phydev, MDIO_MMD_VEND1, VEND1_GLOBAL_CFG_2_5G,
> >                       VEND1_GLOBAL_CFG_SGMII_AN |
> >                       VEND1_GLOBAL_CFG_SERDES_MODE_OCSGMII);
> >         phy_clear_bits_mmd(phydev, MDIO_MMD_VEND1, MDIO_CTRL1,
> >                            MDIO_CTRL1_LPOWER);
> > 
> > which disables rate matching and causes it to switch interfaces.
> 
> That's pretty cool, and when connected to a MAC which support both
> SGMII and 2500Base-X, and a driver which can switch between the two,
> it should be the default imho.
> 
> > 
> > Moreover, aqr107_read_status() reads the interface status and sets
> > the MAC interface accordingly, so the driver does support dynamically
> > changing the MAC interface just like 88x3310. If all use cases of
> > this PHY only ever did rate matching, then there would be no point
> > to that code being there!
> 
> So, just like for mcbin, the driver here replies on the PHY being
> configured (by strapping or by the bootloader) to either use
> a fixed interface mode with rate matching, or perform interface mode
> switching. What if we had to decide this in the driver (like it is
> the case with RealTek PHYs)?

Then we have a problem, because simply re-using the host_interfaces
for built-in PHYs is _going_ _to_ cause problems.

As I've said several times now... if all phylink users implemented the
phylink methods such that they can at run time switch between all the
interface modes in phylink's supported_interfaces (or only set
interface modes they were prepared to switch between), then that would
be a good first step forward. Unfortunately, for some, such as the
88e6390x, we can't tell exactly which interface mode is pinstrapped,
and we can't change the interface mode. So, we have to live with the
fact that ->supported_interfaces is less than accurate.

> Let me summarize:
>  - We can't just assume that every MAC which supports 2500Base-X also
>    supports SGMII. It might support only 2500Base-X, or 2500Base-X and
>    1000Base-X, or the driver might not support switching between
>    interface modes.
>  - We can't rely on the PHY being pre-configured by the bootloader to
>    either rate maching or interface-switching mode.
>  - There are no strapping options for this, the only thing which is
>    configured by strapping is the address.

Right, so the only _safe_ thing to do is to assume that:

1. On existing PHY drivers which do not do any kind of interface
switching, retrofitting interface switching of any kind is unsafe.

2. On brand new PHY drivers which have no prior history, there can
not be any regressions, so implementing interface switching from
the very start is safe.

The only way out of this is by inventing something new for existing
drivers that says "you can adopt a different behaviour" and that
must be a per-platform switch - in other words, coming from the
platform's firmware definition in some way.

> > > > Now, while phylink restricts RATE_MATCH_PAUSE to being full-duplex only,
> > > > I'm not sure that is correct. I didn't contribute this support, and I
> > > > don't have any platforms that support this, and I don't have any
> > > > experience of it.
> > > 
> > > Afaik enforcing half-duplex via rate-maching with pause-frames is
> > > supported by all the 2.5G PHYs I've seen up to now.
> > 
> > I'm sorry, I don't understand your sentence, because it just flies
> > wildly against everything that's been said.
> > 
> > First, pause-frames require full duplex on the link on which pause
> > frames are to be sent and received. That's fundamental.
> 
> I meant pause-frames on the host-side interface of the PHY, which
> are used to perform rate and duplex matching for the remote-side
> interface. I hope that makes more sense now.
> 
> > 
> > Second, I'm not sure what "enforcing half-duplex" has to do with
> > rate-matching with pause-frames.
> 
> It can be used as a method to preventing the MAC from sending
> to the PHY while receiving.
> 
> > 
> > Third, the 88x3310 without MACSEC in rate matching mode requires the
> > MAC to pace itself. No support for pause frames at all - you only
> > get pause frames with the 88x3310P which has MACSEC. This is a 10M
> > to 10G PHY multi-rate PHY.
> > 
> > So... your comment makes no sense to me, sorry.
> 
> You misunderstood me. I didn't mean pause frames on the remote-side of
> the PHY. I meant pause frames as in RATE_MATCH_PAUSE.

You also misunderstand me. I wasn't talking about remote-side of the
PHY either! I'm also none the wiser exactly what you meant by "Afaik
enforcing half-duplex via rate-maching with pause-frames is supported
by all the 2.5G PHYs I've seen up to now."

When rate matching with pause frames is being used, the host-side
interface _must_ be in full duplex mode, because that is the only
time when pause frames are permitted to be used by a MAC. I think
we agree on that.

What I'm having a hard time understanding is *enforcing* half-duplex.
I think what you're trying to say is not "enforcing" the use of
half-duplex somewhere, but pause frames over the host link to
prevent transmission while receiving.

It's likely not that simple - the PHY probably has buffering of
the transmit frame (consider the difference in the time it takes
to send a frame at 10G speeds vs a potential media speed of 10M.)
What the PHY will do is send pause frames when that buffer gets
full (e.g. after receiving one packet to be transmitted) until
it has enough space to accept another transmit packet. If the
link is in half-duplex mode, then the PHY will need to do the
receive detection, collision handling, and retries of the pending
frame itself because there will be no way to signal back to the
MAC "that last frame you sent me, could you resend because it
collided".

> > > > What I do have is the data sheet for 88x3310, and that doesn't mention
> > > > any restriction such as "only full duplex is supported in rate matching
> > > > mode".
> > > 
> > > Yep, and I suppose it should work just fine. The same applies for
> > > RealTek and MaxLinear PHYs. I've tested it.
> > > 
> > > > It is true that to use pause frames, the MAC/PCS must be in full-duplex
> > > > mode, but if the PHY supports half-duplex on the media to full-duplex
> > > > on the MAC side link, then why should phylink restrict this to be
> > > > full-duplex only?
> > > 
> > > There is no reason for that imho. phylink.c states
> > > /* Although a duplex-matching phy might exist, we
> > >  * conservatively remove these modes because the MAC
> > >  * will not be aware of the half-duplex nature of the
> > >  * link.
> > >  */
> > > 
> > > Afaik, practially all rate-matching PHYs which do support half-duplex
> > > modes on the TP interface can perform duplex-matching as well.
> > 
> > So we should remove that restriction!
> 
> Absolutely. That will solve at least half of the problem. It still
> leaves us with a SerDes clock running 2.5x faster than it would have to,
> PHY and MAC consuming more energy than they would have to and TX
> performance being slightly worse (visible with iperf3 --bidir at least
> with some PHYs). But at least the link would come up.

It also means we move forward!

> > > > I suspect phylink_get_capabilities() handling for RATE_MATCH_PAUSE is
> > > > not correct - or maybe not versatile enough.
> > > 
> > > I agree. Never the less, why use rate matching at all if we don't have
> > > to? It's obviously inefficient and wasteful, having the MAC follow the
> > > PHY speed is preferrable in every case imho.
> > 
> > As I say, I believe this is a matter of how the Aquantia firmware is
> > commissioned - I believe the defaults for all of the VEND1_GLOBAL_CFG*
> > registers come from the firmware that the PHY loads.
> 
> I must admit I don't care much about the Aquantia story in the
> picture here. Simply because the rate-matching implementation they
> got seems to work rather well, and there anyway aren't many low-cost
> mass-produced devices having Aquantia 2.5GBit/s PHYs. I know exactly
> one device (Ubiquity UniFi 6 LR v1) which isn't produced in that
> version any more -- later versions replaced the rather costly
> Aquantia PHY with a low-cost RealTek PHY.
> 
> There are millions of devices with Intel/MaxLinear GPY211, and
> probably even by magnitudes more with RealTek RTL8221B. Having all
> those devices perform rate-matching and hence running the SerDes
> clock at 3.125 GHz instead of 1.250 GHz is not just a significant
> waste of electricity, but also just means wrose performance, esp. on
> 1000 MBit/s links (which happen to still be the most common).
> 
> > 
> > So much like the pin strapping of 88x3310 determines its set of MAC
> > modes, which are decided by the board designer, Aquantia firmware
> > determines the PHY behaviour.
> > 
> > The problem here is we need to be mindful of the existing
> > implementations, not only those where the MAC/PHY combination would
> > get stuff wrong, but also of those MACs where multiple different
> > interfaces are supported through hardware strapping which is not
> > software determinable, but are not software configurable (like DSA
> > switches.) Then there's those who hacked phylink into their use
> > without properly implementing it (e.g. where mac_config() is entirely
> > empty, and thus support no reconfiguration of the interface yet
> > support multiple different interfaces at driver initialisation time.)
> > 
> > Yes, some of these situations can be said to be buggy. Feel free to
> > spend a few years hacking on stmmac to try and fix that god almighty
> > mess. I've given up with stmmac now after trying several attempts at
> > cleaning the mess up, and ending up in a very non-productive place
> > with it.
> 
> Sounds dreadful. However, just because one driver is a crazy mess it
> should mean that all drivers have to perform worse than they could.
> Can you name a few boards which rely on that particularly smelly
> code? I might get some of those and get a feel of how bad things are,
> and maybe I will spend a few years trying to improve things there.

Well, when you realise that stmmac's internal PCS implementation
entirely bypasses phylink yet the driver uses phylink... I had
patches, now junked.

I tried some simpler cleanups before trying to fix that, which got
massively hung up on an insanely trivial question about whether
"%d" or "%u" should be used to print the speed in a mac_link_up
method... and resulted in me junking the entire set of stmmac
patches I had.

I've given up with stmmac. You may notice that my recent XPCS
patch series only _minimally_ touched the stmmac code. It could've
gone further, but I really want to touch stmmac as little as
possible now. I've decided it's a hopeless case. In fact, I've
asked that the driver backs out its use of phylink... which has
been ignored. So... at this point, as phylink maintainer, my
only option is to ignore stmmac for any more major changes and
leave the updates to whoever cares about stmmac. I don't like
reaching that point with a driver, but I see no other option
with it given the history..
Daniel Golle Oct. 10, 2024, 2:07 p.m. UTC | #8
On Thu, Oct 10, 2024 at 12:23:18AM +0100, Russell King (Oracle) wrote:
> On Wed, Oct 09, 2024 at 10:31:51PM +0100, Daniel Golle wrote:
> > On Wed, Oct 09, 2024 at 09:12:06PM +0100, Russell King (Oracle) wrote:
> > > [...]
> > > The interface mode given in DT is just a "guide" because the 88x3310
> > > does no more than verify that the interface mode that it is bound with
> > > is one it supports. However, every time the link comes up, providing
> > > it is not operating in rate matching mode (which the PP2 doesn't
> > > support) it will change its MAC facing interface appropriately.
> > 
> > Unfortunately, and apparently different from the 88x3310, there is no
> > hardware strapping for this to be decided with RealTek's PHYs, but
> > using rate-matching or interface-mode-switching is decided by the
> > driver.
> 
> No, but there is firmware, and firmware provisioning determines the
> PHY behaviour. That's the equivalent of hardware pin strapping on
> Aquantia PHYs.

Yes, but there is broken firmware, or firmware which doesn't care
and always just expected the Linux driver to take care of that.
In many cases I've seen, U-Boot, for example, only initializes the
PHY and sets those registers in case it is configured to load
Linux via TFTP, or interrupted by the user who may chose to do that.
In the default case (loading Linux from flash) the PHY is not even
powered-up in some cases.

In case you mean firmware as in what Linux is presented in DT, then
yes, I agree. But we'll need define new bindings for that then.

> 
> I think you need to review a previous thread on the Aquantia PHYs
> and firmware setting stupid interface modes, and trying to fix the
> stuff up (and there in you will find the discussions about the
> firmware provisioning that determines how the MAC interface on these
> PHYs behave.
> 
> https://lore.kernel.org/lkml/20221115230207.2e77pifwruzkexbr@skbuf/T/
> 
> You'll also find other useful stuff relevant to this discussion too.
> This is the lore link for the phy-modes discussion:
> 
> https://lore.kernel.org/all/20211123164027.15618-1-kabel@kernel.org/T/

Thank you a lot for digging up those conversions. I've spent some hours
reading and understanding all of that, and well, things are complicated.

To my surprise, introducing an array of phy-connection-type or phy-mode
in DT has been positively received by the DT folks, but the idea was
then later on discarded by netdev participants of the debate. Imho that
would still be the best solution, but I understand the argument that
having to list all supported modes can be both tedious and hard to
review, and hence error-prone.

> > > So, what we do is in DT, we specify the maximum mode, and rely on the
> > > hardware being correctly strapped on the PHY to configure how the
> > > MAC side interface will be used.
> > 
> > Does that mean the realtek.c PHY driver (and maybe others as well) should
> > just assume that the MAC must also support SGMII mode in case 2500Base-X
> > is supported?
> > 
> > I was under the impression that there might be MAC out there which support
> > only 2500Base-X, or maybe 2500Base-X and 1000Base-X, but not SGMII. And on
> > those we would always need to use 2500Base-X with rate matching.
> > But maybe I'm wrong.
> 
> The problem I'm trying to point out is that doing what you're wanting
> has a high chance of causing _regressions_, causing setups that work
> today to stop working. We can't allow that to happen, sorry.

Of course :)

> Note that this interface switching mechanism was introduced early on
> with the 88x3310 PHY, before any documentation on it was available,
> and all that was known at the time is that the PHY switched between
> different MAC facing interface modes depending on the negotiated
> speed. It needed to be supported, and what we have came out of that.
> Legacy is important, due to the "no regressions" rule that we have
> in kernel development - we can't go breaking already working setups.

What about marking Ethernet drivers which are capable of interface
mode switching? Right now there isn't one "correct" thing to do for
PHY drivers, which is bad, as people may look into any driver as
a reference for the development of future drivers.

So why not introduce a MAC capability bit? Even dedicated for switching
between two specific modes (SGMII and 2500Base-X), to avoid any
ambiguitities or unnecessary feature-creep.

Typically switching between modes within the same PCS is more easy to
support, switching from 10:8 to 66:64 coding can be more involved,
and require resets which affect other links, so that's something
worth avoiding unless we really need it (in case the users inserts
a different SFP module it would be really needed, in case the external
link goes from 5 Gbit/s to 2.5 Gbit/s it might be worth avoiding
having to switch from 5GBase-R to 2500Base-X)

It wouldn't be the first time something like that would be done, however
I have full understanding that any reminder of that whole
legacy_pre_march2020 episode may trigger post-traumatic stress in netdev
maintainers.

> 
> > The same approach would not work for those RealTek PHYs and boards
> > using them. In some cases the bootloader sets up the PHY, and usually
> > there rate matching is used -- performance is not the goal there, but
> > simplicity. In other cases we rely entirely on the Linux PHY driver
> > to set things up. How would the PHY driver know whether it should
> > setup rate matching mode or interface switching mode?
> 
> In the case of the Realtek driver, it decides to override whatever came
> before in a defined way. E.g. rtl822xb_config_init().

Yes, and right now it would decide to always use 2500Base-X on a host
capable of 2500Base-X, and hence make use of RATE_MATCH_PAUSE for
1000M/100M/10M links, even though using SGMII would be better in terms
of energy consumption and performance.

> 
> > The SFP case is clear, it's using host_interfaces. But in the built-in
> > case, as of today, it always ends up in fixed interface mode with
> > rate matching.
> 
> "always rate matching" no. Fixed interface mode, yes. If
> rtl822xb_config_init() sees phydev->interface is set to SGMII, then
> it uses 2500BASEX_SGMII mode without rate matching - and the
> advertisement will be limited to 1G and below which will effectively
> prevent the PHY using 2.5G mode - which is fine in that case.

While that case might be relevant for SFP modules, in pracise, why would
anyone use a more expensive 2.5Gbit/s PHY on a board which only supports
up to 1Gbit/s -- that doesn't happen in the real world, because 1Gbit/s
SGMII PHYs are ubiquitous and much cheaper than faster ones.

> > Let me summarize:
> >  - We can't just assume that every MAC which supports 2500Base-X also
> >    supports SGMII. It might support only 2500Base-X, or 2500Base-X and
> >    1000Base-X, or the driver might not support switching between
> >    interface modes.
> >  - We can't rely on the PHY being pre-configured by the bootloader to
> >    either rate maching or interface-switching mode.
> >  - There are no strapping options for this, the only thing which is
> >    configured by strapping is the address.
> 
> Right, so the only _safe_ thing to do is to assume that:
> 
> 1. On existing PHY drivers which do not do any kind of interface
> switching, retrofitting interface switching of any kind is unsafe.

It's important to note that for the RealTek driver specifically we
have done that in OpenWrt from day 1 -- simply because the rate-matching
performed, especially by early versions of the PHY, is too bad. So we
always made the driver switch interface to SGMII in case of link speeds
slower than 2500M.
Now, with the backport of upstream commits replacing our downstream
patches, users started to complain that the issue of bad PHY performance
in 1 Gbit/s mode has returned, which is the whole reason why I started
to work on this issue.

I understand, however, there may of course be other users of those
RealTek 2.5G PHYs, even Rockchip with stmmac maybe, and that would
break if we assume the MAC can support switching between 2500Base-X
and SGMII, so users of those boards will have to live with the bad
performance of the rate-matching performed by the PHY unless someone
fixes the stmmac driver...

> 
> 2. On brand new PHY drivers which have no prior history, there can
> not be any regressions, so implementing interface switching from
> the very start is safe.
> 
> The only way out of this is by inventing something new for existing
> drivers that says "you can adopt a different behaviour" and that
> must be a per-platform switch - in other words, coming from the
> platform's firmware definition in some way.

Why would it not just be the MAC driver which indicates that it can
support switching to lower-speed interface modes it also supports?
Do you really believe there are boards which are electrically
unfit for performing SGMII on the traces intended for 2500Base-X?

If by 'firmware' you mean 'device tree' then we are back on square
one, and we would need several phy-connection-type aka. phy-mode
listed there.

After having read all the threads from 2021 you have provided links for,
I believe that maybe an additional property which lists the interface
modes to be used *optionally* and in addition to the primary (ie.
fastest) mode stated in phy-mode or phy-connection-type could be a way
out of this. It would still end up being potentially a longer list of
interface modes, but reality is complex! Looking at other corners of DT
it would still be rather simple and human readable (in contrast: take a
look at inhumanly long arrays of gpio-line-names where even the order
matters, for example...)

Yet, it would still be a partial violation of Device Tree rules because
we are (also) describing driver capabilities there then. What if, let's
say one day stmmac *will* support interface mode switching? Should we
update each and every board's device tree?

Hence, unless there is a really good reason to believe that a board which
works fine with 2500Base-X would not work equally well with SGMII, given
that both the MAC (-driver) and PHY (-driver) support both modes, I don't
see a need to burden firmware with having to list additional phy-modes.

Of course, I'm always only talking about allowing switching to slower
interface modes, and always only about modes which use a single pair
differential lanes (ie. sgmii, 1000base-x, 2500base-x, 5gbase-r,
10gbase-r, ...). 

> > > > Afaik, practially all rate-matching PHYs which do support half-duplex
> > > > modes on the TP interface can perform duplex-matching as well.
> > > 
> > > So we should remove that restriction!
> > 
> > Absolutely. That will solve at least half of the problem. It still
> > leaves us with a SerDes clock running 2.5x faster than it would have to,
> > PHY and MAC consuming more energy than they would have to and TX
> > performance being slightly worse (visible with iperf3 --bidir at least
> > with some PHYs). But at least the link would come up.
> 
> It also means we move forward!

ACK. Patch posted.
Russell King (Oracle) Oct. 10, 2024, 2:58 p.m. UTC | #9
On Thu, Oct 10, 2024 at 03:07:30PM +0100, Daniel Golle wrote:
> > Note that this interface switching mechanism was introduced early on
> > with the 88x3310 PHY, before any documentation on it was available,
> > and all that was known at the time is that the PHY switched between
> > different MAC facing interface modes depending on the negotiated
> > speed. It needed to be supported, and what we have came out of that.
> > Legacy is important, due to the "no regressions" rule that we have
> > in kernel development - we can't go breaking already working setups.
> 
> What about marking Ethernet drivers which are capable of interface
> mode switching? Right now there isn't one "correct" thing to do for
> PHY drivers, which is bad, as people may look into any driver as
> a reference for the development of future drivers.
> 
> So why not introduce a MAC capability bit? Even dedicated for switching
> between two specific modes (SGMII and 2500Base-X), to avoid any
> ambiguitities or unnecessary feature-creep.

They already have a perfectly good way to do this today. They can look
at DT and set just one mode in the ->supported_interfaces bitmap if
they don't support interface switching! The MAC drivers are already
responsible for parsing the phy-mode from firmware, and it's that
driver that also knows whether it knows if it supports interface
switching or not. So I don't see any need for additional capability
bits.

> Typically switching between modes within the same PCS is more easy to
> support, switching from 10:8 to 66:64 coding can be more involved,
> and require resets which affect other links, so that's something
> worth avoiding unless we really need it (in case the users inserts
> a different SFP module it would be really needed, in case the external
> link goes from 5 Gbit/s to 2.5 Gbit/s it might be worth avoiding
> having to switch from 5GBase-R to 2500Base-X)
> 
> It wouldn't be the first time something like that would be done, however
> I have full understanding that any reminder of that whole
> legacy_pre_march2020 episode may trigger post-traumatic stress in netdev
> maintainers.

Yes, and it's still continuing to cause problems today. I've just
replied to someone working on the macb driver who was proposing to
make use state->speed in his mac_config() method, despite modern
phylink always passing SPEED_UNKNOWN for that. I guess if I didn't
reply, he'd find out the hard way (which is why I made the change
to phylink_mac_config() to cause testing failures if people try
that.)

> > > The SFP case is clear, it's using host_interfaces. But in the built-in
> > > case, as of today, it always ends up in fixed interface mode with
> > > rate matching.
> > 
> > "always rate matching" no. Fixed interface mode, yes. If
> > rtl822xb_config_init() sees phydev->interface is set to SGMII, then
> > it uses 2500BASEX_SGMII mode without rate matching - and the
> > advertisement will be limited to 1G and below which will effectively
> > prevent the PHY using 2.5G mode - which is fine in that case.
> 
> While that case might be relevant for SFP modules, in pracise, why would
> anyone use a more expensive 2.5Gbit/s PHY on a board which only supports
> up to 1Gbit/s -- that doesn't happen in the real world, because 1Gbit/s
> SGMII PHYs are ubiquitous and much cheaper than faster ones.

I'm merely stating the logic that is there today.

> > > Let me summarize:
> > >  - We can't just assume that every MAC which supports 2500Base-X also
> > >    supports SGMII. It might support only 2500Base-X, or 2500Base-X and
> > >    1000Base-X, or the driver might not support switching between
> > >    interface modes.
> > >  - We can't rely on the PHY being pre-configured by the bootloader to
> > >    either rate maching or interface-switching mode.
> > >  - There are no strapping options for this, the only thing which is
> > >    configured by strapping is the address.
> > 
> > Right, so the only _safe_ thing to do is to assume that:
> > 
> > 1. On existing PHY drivers which do not do any kind of interface
> > switching, retrofitting interface switching of any kind is unsafe.
> 
> It's important to note that for the RealTek driver specifically we
> have done that in OpenWrt from day 1 -- simply because the rate-matching
> performed, especially by early versions of the PHY, is too bad. So we
> always made the driver switch interface to SGMII in case of link speeds
> slower than 2500M.
> Now, with the backport of upstream commits replacing our downstream
> patches, users started to complain that the issue of bad PHY performance
> in 1 Gbit/s mode has returned, which is the whole reason why I started
> to work on this issue.
> 
> I understand, however, there may of course be other users of those
> RealTek 2.5G PHYs, even Rockchip with stmmac maybe, and that would
> break if we assume the MAC can support switching between 2500Base-X
> and SGMII, so users of those boards will have to live with the bad
> performance of the rate-matching performed by the PHY unless someone
> fixes the stmmac driver...

If OpenWRT's switching predates July 2017, then maybe they should've
been more pro-active at getting their patches upstream?

Unfortunately, in July 2017, there was nothing in mainline supporting
2500base*, and nothing doing any interface switching until I added
the Marvell 88x3310 driver which was where _I_ proposed interface
switching being added to phylib.

> > 2. On brand new PHY drivers which have no prior history, there can
> > not be any regressions, so implementing interface switching from
> > the very start is safe.
> > 
> > The only way out of this is by inventing something new for existing
> > drivers that says "you can adopt a different behaviour" and that
> > must be a per-platform switch - in other words, coming from the
> > platform's firmware definition in some way.
> 
> Why would it not just be the MAC driver which indicates that it can
> support switching to lower-speed interface modes it also supports?
> Do you really believe there are boards which are electrically
> unfit for performing SGMII on the traces intended for 2500Base-X?

For a single-lane serdes, what you say is true. However, it is not
universal across all interface modes.

As I stated in a previous discussion, if we have e.g. four lanes of
XAUI between a PHY and MAC, and both ends support XAUI, RXAUI,
10GBASE-R, 5GBASE-R, 2500BASE-X, and SGMII, it does not necessarily
follow that the platform can support 10GBASE-R and 5GBASE-R over
a single lane because the signalling rate is so much higher. Just
because the overall speed is the same or lower does not automatically
mean that it can be used.

> If by 'firmware' you mean 'device tree' then we are back on square
> one, and we would need several phy-connection-type aka. phy-mode
> listed there.
> 
> After having read all the threads from 2021 you have provided links for,
> I believe that maybe an additional property which lists the interface
> modes to be used *optionally* and in addition to the primary (ie.
> fastest) mode stated in phy-mode or phy-connection-type could be a way
> out of this. It would still end up being potentially a longer list of
> interface modes, but reality is complex! Looking at other corners of DT
> it would still be rather simple and human readable (in contrast: take a
> look at inhumanly long arrays of gpio-line-names where even the order
> matters, for example...)
> 
> Yet, it would still be a partial violation of Device Tree rules because
> we are (also) describing driver capabilities there then. What if, let's
> say one day stmmac *will* support interface mode switching? Should we
> update each and every board's device tree?

Well, I guess we need people that adopt phylink to actually implement
it properly rather than just slapping it into their driver in a way
that "works for them". :)

This is a battle that I've been trying for years with, but programmers
are lazy individuals who don't want to (a) read API documentation, (b)
implement things properly.

Anyway, I'm out of time right now to continue replying to this
conversation (it's taken over an hour so far to put what I've said
together, and I now have a meeting... so reached the end of what I can
do right now... so close to the end of your email too! Alas...

> 
> Hence, unless there is a really good reason to believe that a board which
> works fine with 2500Base-X would not work equally well with SGMII, given
> that both the MAC (-driver) and PHY (-driver) support both modes, I don't
> see a need to burden firmware with having to list additional phy-modes.
> 
> Of course, I'm always only talking about allowing switching to slower
> interface modes, and always only about modes which use a single pair
> differential lanes (ie. sgmii, 1000base-x, 2500base-x, 5gbase-r,
> 10gbase-r, ...). 
> 
> > > > > Afaik, practially all rate-matching PHYs which do support half-duplex
> > > > > modes on the TP interface can perform duplex-matching as well.
> > > > 
> > > > So we should remove that restriction!
> > > 
> > > Absolutely. That will solve at least half of the problem. It still
> > > leaves us with a SerDes clock running 2.5x faster than it would have to,
> > > PHY and MAC consuming more energy than they would have to and TX
> > > performance being slightly worse (visible with iperf3 --bidir at least
> > > with some PHYs). But at least the link would come up.
> > 
> > It also means we move forward!
> 
> ACK. Patch posted.
>
Daniel Golle Oct. 10, 2024, 4:28 p.m. UTC | #10
On Thu, Oct 10, 2024 at 03:58:36PM +0100, Russell King (Oracle) wrote:
> On Thu, Oct 10, 2024 at 03:07:30PM +0100, Daniel Golle wrote:
> > > Note that this interface switching mechanism was introduced early on
> > > with the 88x3310 PHY, before any documentation on it was available,
> > > and all that was known at the time is that the PHY switched between
> > > different MAC facing interface modes depending on the negotiated
> > > speed. It needed to be supported, and what we have came out of that.
> > > Legacy is important, due to the "no regressions" rule that we have
> > > in kernel development - we can't go breaking already working setups.
> > 
> > What about marking Ethernet drivers which are capable of interface
> > mode switching? Right now there isn't one "correct" thing to do for
> > PHY drivers, which is bad, as people may look into any driver as
> > a reference for the development of future drivers.
> > 
> > So why not introduce a MAC capability bit? Even dedicated for switching
> > between two specific modes (SGMII and 2500Base-X), to avoid any
> > ambiguitities or unnecessary feature-creep.
> 
> They already have a perfectly good way to do this today. They can look
> at DT and set just one mode in the ->supported_interfaces bitmap if
> they don't support interface switching!

Some drivers seem to support multiple interface modes, but don't support
switching between them if asked to do so by the PHY. That was the
potential cause for regressions you named in case we would populate
host_interfaces also for on-board PHYs (independently of how careful or
selective we would do that, which is another story).

> The MAC drivers are already
> responsible for parsing the phy-mode from firmware, and it's that
> driver that also knows whether it knows if it supports interface
> switching or not. So I don't see any need for additional capability
> bits.

I agree that the MAC driver should know whether it can do PHY-requested
mode switches, however, at this point phylink doesn't. We may want to
let phylink know that the MAC (-driver) is fit for such acrobatic moves,
instead of just assuming that in PHY drivers by historic precedence, no?

Let me reiterate over the current situation:

 - The mxl-gpy.c depends on the PHY being pre-provisioned by early boot
   firmware to decide whether it should do interface mode switching or
   perform rate matching. Some (but not all) board vendors correctly
   provision the PHY (ie. to perform interface mode switching), others
   just set it to rate-matching because it's easier to support when
   using the Ethernet connection **within the bootloader**.
   After all, this is not so much of a problem because while still being
   a potential waste of power (clocks running 2.5x faster even if not
   needed to) the rate matching is at least implemented in a way which
   works well and doesn't hurt too much beyond wasting ~100 milliwatts.

 - Marvell PHYs use pin strapping to decide that.

 - Aquantia seems to rely on provisioned firmware.

 - The RealTek driver currently decides whether to use interface mode
   switching or rate matching based on the selected (main) interface
   mode as well as phydev->host_interfaces.
   * host_interfaces is not set unless the PHY is used on a SFP module
   * the driver hence ends up always using a fixed interface mode for
     on-board PHYs.
     => This is bad because we end up using the (bad) rate matching
        instead of interface mode switching (good) even though that
        would be perfectly supported by both MAC and PHY.
        - We currently don't have a way for phylink to know whether
          a MAC can perform interface mode switching
        - Populating host_interfaces also for on-board PHYs was
          rejected because it might introduce regressions due to MAC
          drivers not being able to perform interface mode switching
          and relying on the current behaviour.
          => We may hence need a way for phylink to know whether an
             Ethernet MAC can perform interface mode switching or not.

> > I understand, however, there may of course be other users of those
> > RealTek 2.5G PHYs, even Rockchip with stmmac maybe, and that would
> > break if we assume the MAC can support switching between 2500Base-X
> > and SGMII, so users of those boards will have to live with the bad
> > performance of the rate-matching performed by the PHY unless someone
> > fixes the stmmac driver...
> 
> If OpenWRT's switching predates July 2017, then maybe they should've
> been more pro-active at getting their patches upstream?

No, and there were no consumer-grade devices with 2.5G PHYs back then.
The first started to show up around 2020. The first to be supported by
OpenWrt was the before mentioned Ubiquiti UniFi 6 LR AP, sporting an
Aquanria AQR112 2.5G PHY connected via 2500Base-X, and always performing
rate matching.

The RealTek 2.5G PHY driver creeped in via their PCIe Ethernet NIC
drivers, and in the beginning (2020) only supported the PHYs built-into
those PCIe NIC chips. Most setup of the built-in PHY is anyway done by
the NIC driver, so nobody bothered to implement config_init() until we
started to see devices using the RTL8226 (later renamed to RTL8221B) as
standalone PHY ICs connected to router SoCs around 2021/22. Soon people
contributed OpenWrt support for those boards, initially just not
understanding why the PHY would only work with 2500Base-T links but not
with 1000Base-T, 100Base-TX, ...

At some point Alexander Couzens and me were working on sorting out
proper support for those PHYs, including interface mode switching, and
Alexander implemented the config_init() and update_interface()
functions.

I initially tried to upstream the patch Alexander Couzens had written
for that in 2023, and back then we assumed that interface mode switching
would always be performed in case 2500Base-X was the main mode.
However, I gave up at some point because I was asked for detailed
definition of the PHY registers and bits, which, frankly speaking, just
doesn't exist, not even in RealTek's under-NDA datasheet which merely
lists some magic sequences of register writes.

https://patchwork.kernel.org/comment/25331309/

In December 2023 Marek Behún then modified the patch and sent it
upstream once again. It was part of a bigger series improving the
driver, and he clearly stated that:
"All this is done so that we can support another 2.5G copper SFP module"

https://patchwork.kernel.org/project/netdevbpf/cover/20231220155518.15692-1-kabel@kernel.org/

This series (luckily without anyone asking for non-existent register
definitions this time) ended up in mainline Linux.

As OpenWrt generally tries to be as close to upstream as possible the
downstream patches for the RealTek PHY drivers have then been replaced
by backports of the corresponding commits in mainline Linux earlier this
year.

This has resulted in multiple reports of degraded performance, and it
took a while until I understood that Marek's version of the config_init
function only sets the PHY to perform interface mode switching if that
is indicated by host_interfaces -- and built-in PHYs will *always* use a
fixed interface mode now.

We could of course just introduce a downstream patch changing that
again, e.g. by letting the driver assume that SGMII is also supported in
case 2500Base-X is supported -- but that may not be true for all the
Ethernet MACs out there. Some might not support SGMII at all, or not be
ready to perform interface mode switching.

Hence, and we may need a way for phylink or the phy driver to
destinguish whether interface mode switching is supported by the
Ethernet MAC or not.

> 
> Unfortunately, in July 2017, there was nothing in mainline supporting
> 2500base*, and nothing doing any interface switching until I added
> the Marvell 88x3310 driver which was where _I_ proposed interface
> switching being added to phylib.

Yes, and that has worked well.
We should have a way to make use of it also on boards with on-board PHYs
which are at the same time also used on boards which rely on Ethernet
drivers which do not support SGMII, or don't support switching the
interface mode.

> 
> > > 2. On brand new PHY drivers which have no prior history, there can
> > > not be any regressions, so implementing interface switching from
> > > the very start is safe.
> > > 
> > > The only way out of this is by inventing something new for existing
> > > drivers that says "you can adopt a different behaviour" and that
> > > must be a per-platform switch - in other words, coming from the
> > > platform's firmware definition in some way.
> > 
> > Why would it not just be the MAC driver which indicates that it can
> > support switching to lower-speed interface modes it also supports?
> > Do you really believe there are boards which are electrically
> > unfit for performing SGMII on the traces intended for 2500Base-X?
> 
> For a single-lane serdes, what you say is true. However, it is not
> universal across all interface modes.

Of course not, but for single-lane serdes it could (and, if possible,
should!) be done.

> 
> As I stated in a previous discussion, if we have e.g. four lanes of
> XAUI between a PHY and MAC, and both ends support XAUI, RXAUI,
> 10GBASE-R, 5GBASE-R, 2500BASE-X, and SGMII, it does not necessarily
> follow that the platform can support 10GBASE-R and 5GBASE-R over
> a single lane because the signalling rate is so much higher. Just
> because the overall speed is the same or lower does not automatically
> mean that it can be used.
> 
> > If by 'firmware' you mean 'device tree' then we are back on square
> > one, and we would need several phy-connection-type aka. phy-mode
> > listed there.
> > 
> > After having read all the threads from 2021 you have provided links for,
> > I believe that maybe an additional property which lists the interface
> > modes to be used *optionally* and in addition to the primary (ie.
> > fastest) mode stated in phy-mode or phy-connection-type could be a way
> > out of this. It would still end up being potentially a longer list of
> > interface modes, but reality is complex! Looking at other corners of DT
> > it would still be rather simple and human readable (in contrast: take a
> > look at inhumanly long arrays of gpio-line-names where even the order
> > matters, for example...)
> > 
> > Yet, it would still be a partial violation of Device Tree rules because
> > we are (also) describing driver capabilities there then. What if, let's
> > say one day stmmac *will* support interface mode switching? Should we
> > update each and every board's device tree?
> 
> Well, I guess we need people that adopt phylink to actually implement
> it properly rather than just slapping it into their driver in a way
> that "works for them". :)

+1

> 
> This is a battle that I've been trying for years with, but programmers
> are lazy individuals who don't want to (a) read API documentation, (b)
> implement things properly.
> 
> Anyway, I'm out of time right now to continue replying to this
> conversation (it's taken over an hour so far to put what I've said
> together, and I now have a meeting... so reached the end of what I can
> do right now... so close to the end of your email too! Alas...

Thank you for putting all this time into it. I think the problem itself
is simple and well understood by both of us. It was already well
understood in 2021. There are many possible solutions to chose from:

1) List multiple phy-connection-type (aka. phy-mode) in DT, populate
   host_interfaces accordingly.

2) Let the MAC indicate that it supports mode switching, so phylink would
   populate host_interfaces accordingly.

3) Assume that MAC supports mode switching to slower, also supported,
   single-lane SerDes modes (and risk breakage because of bad drivers),
   and populate host_interfaces accordingly.

4) Change the PHY driver to assume SGMII and interface mode switching is
   always supported in case 2500Base-X is the mode declared in DT.

5) Live with bad performance of RealTek's rate matching (not acceptable
   for OpenWrt)

6) Throw all devices with cheapo PHYs or bad boot firmware in the bin
   (bad for the environment, and consumer wallets)

7) Downstream hacks (current situation in OpenWrt and vendor firmware)

Did I forget any?

3) and 4) may obviously cause regressions.
So, from what I can see, only 1) and 2) would allow us to do anything
else than 7) from OpenWrt's perspective.

I totally understand if Linux netdev maintainers chose 5) or 6),
however, for OpenWrt the result in that case would be to move the
existing downstream patches from "pending" to "hack", which means that
we will not attempt to send them upstream.
Andrew Lunn Oct. 10, 2024, 4:58 p.m. UTC | #11
> I initially tried to upstream the patch Alexander Couzens had written
> for that in 2023, and back then we assumed that interface mode switching
> would always be performed in case 2500Base-X was the main mode.
> However, I gave up at some point because I was asked for detailed
> definition of the PHY registers and bits, which, frankly speaking, just
> doesn't exist, not even in RealTek's under-NDA datasheet which merely
> lists some magic sequences of register writes.
> 
> https://patchwork.kernel.org/comment/25331309/

It is fine to document that these are magic values written into magic
registers and nobody knows what they are doing, if there is no
documentation. I just like to push people to actually check if there
is documentation, otherwise there will be more and more examples of
this magic.

I've been keeping an eye out for 'magic' registers which are actually
part of 802.3, and pushing back that these are least should be fully
documented. And it does happen, particularly in vendor submitted code
:-(

But you said the patch got merged eventually, so it seems like it
worked out in the end.

	Andrew
diff mbox series

Patch

diff --git a/drivers/net/phy/phylink.c b/drivers/net/phy/phylink.c
index 4309317de3d1..5d043c47a727 100644
--- a/drivers/net/phy/phylink.c
+++ b/drivers/net/phy/phylink.c
@@ -2111,6 +2111,13 @@  int phylink_fwnode_phy_connect(struct phylink *pl,
 		pl->link_config.interface = pl->link_interface;
 	}
 
+	/* Assume SerDes interface modes share the same lanes and allow
+	 * the PHY to switch between the
+	 */
+	if (test_bit(pl->link_interface, phylink_sfp_interfaces))
+		phy_interface_and(phy_dev->host_interfaces, phylink_sfp_interfaces,
+				  pl->config->supported_interfaces);
+
 	if (pl->config->mac_requires_rxc)
 		flags |= PHY_F_RXC_ALWAYS_ON;