[net-next,v6] net: macb: Fix several edge cases in validate

Message ID	20211112190400.1937855-1-sean.anderson@seco.com (mailing list archive)
State	Accepted
Delegated to:	Netdev Maintainers
Headers	show Return-Path: <netdev-owner@kernel.org> From: Sean Anderson <sean.anderson@seco.com> To: netdev@vger.kernel.org, "David S . Miller" <davem@davemloft.net>, Jakub Kicinski <kuba@kernel.org> Cc: Claudiu Beznea <claudiu.beznea@microchip.com>, Parshuram Thombare <pthombar@cadence.com>, Antoine Tenart <atenart@kernel.org>, Nicolas Ferre <nicolas.ferre@microchip.com>, Milind Parab <mparab@cadence.com>, Russell King <linux@armlinux.org.uk>, Sean Anderson <sean.anderson@seco.com> Subject: [net-next PATCH v6] net: macb: Fix several edge cases in validate Date: Fri, 12 Nov 2021 14:04:00 -0500 Message-Id: <20211112190400.1937855-1-sean.anderson@seco.com> Content-Transfer-Encoding: 8bit Content-Type: text/plain MIME-Version: 1.0 Precedence: bulk
Series	[net-next,v6] net: macb: Fix several edge cases in validate \| expand [net-next,v6] net: macb: Fix several edge cases in validate

Context	Check	Description
netdev/tree_selection	success	Clearly marked for net-next
netdev/fixes_present	success	Fixes tag not required for -next series
netdev/subject_prefix	success	Link
netdev/cover_letter	success	Single patches do not need cover letters
netdev/patch_count	success	Link
netdev/header_inline	success	No static functions without inline keyword in header files
netdev/build_32bit	fail	Errors and warnings before: 138 this patch: 37
netdev/cc_maintainers	success	CCed 6 of 6 maintainers
netdev/build_clang	success	Errors and warnings before: 0 this patch: 0
netdev/module_param	success	Was 0 now: 0
netdev/verify_signedoff	success	Signed-off-by tag matches author and committer
netdev/verify_fixes	success	No Fixes tag
netdev/build_allmodconfig_warn	success	Errors and warnings before: 25 this patch: 25
netdev/checkpatch	success	total: 0 errors, 0 warnings, 0 checks, 134 lines checked
netdev/kdoc	success	Errors and warnings before: 0 this patch: 0
netdev/source_inline	success	Was 0 now: 0

Sean Anderson Nov. 12, 2021, 7:04 p.m. UTC

There were several cases where validate() would return bogus supported
modes with unusual combinations of interfaces and capabilities. For
example, if state->interface was 10GBASER and the macb had HIGH_SPEED
and PCS but not GIGABIT MODE, then 10/100 modes would be set anyway. In
another case, SGMII could be enabled even if the mac was not a GEM
(despite this being checked for later on in mac_config()). These
inconsistencies make it difficult to refactor this function cleanly.

There is still the open question of what exactly the requirements for
SGMII and 10GBASER are, and what SGMII actually supports. If someone
from Cadence (or anyone else with access to the GEM/MACB datasheet)
could comment on this, it would be greatly appreciated. In particular,
what is supported by Cadence vs. vendor extension/limitation?

To address this, the current logic is split into three parts. First, we
determine what we support, then we eliminate unsupported interfaces, and
finally we set the appropriate link modes. There is still some cruft
related to NA, but this can be removed in a future patch.

Signed-off-by: Sean Anderson <sean.anderson@seco.com>
---

Changes in v6:
- Fix condition for have_1g (thanks Parshuram)

Changes in v5:
- Refactor, taking into account Russell's suggestions

Changes in v4:
- Drop cleanup patch
- Refactor to just address logic issues

Changes in v3:
- Order bugfix patch first

Changes in v2:
- New

 drivers/net/ethernet/cadence/macb_main.c | 108 +++++++++++++++--------
 1 file changed, 71 insertions(+), 37 deletions(-)

Parshuram Raju Thombare Nov. 15, 2021, 1:18 p.m. UTC | #1

>There were several cases where validate() would return bogus supported
>modes with unusual combinations of interfaces and capabilities. For
>example, if state->interface was 10GBASER and the macb had HIGH_SPEED
>and PCS but not GIGABIT MODE, then 10/100 modes would be set anyway. In
>another case, SGMII could be enabled even if the mac was not a GEM
>(despite this being checked for later on in mac_config()). These
>inconsistencies make it difficult to refactor this function cleanly.
>
>There is still the open question of what exactly the requirements for
>SGMII and 10GBASER are, and what SGMII actually supports. If someone
>from Cadence (or anyone else with access to the GEM/MACB datasheet)
>could comment on this, it would be greatly appreciated. In particular,
>what is supported by Cadence vs. vendor extension/limitation?
>
>To address this, the current logic is split into three parts. First, we
>determine what we support, then we eliminate unsupported interfaces, and
>finally we set the appropriate link modes. There is still some cruft
>related to NA, but this can be removed in a future patch.
>
>Signed-off-by: Sean Anderson <sean.anderson@seco.com>

Reviewed-by: Parshuram Thombare <pthombar@cadence.com>

>---
>
>Changes in v6:
>- Fix condition for have_1g (thanks Parshuram)
>
>Changes in v5:
>- Refactor, taking into account Russell's suggestions
>
>Changes in v4:
>- Drop cleanup patch
>- Refactor to just address logic issues
>
>Changes in v3:
>- Order bugfix patch first
>
>Changes in v2:
>- New
>
> drivers/net/ethernet/cadence/macb_main.c | 108 +++++++++++++++--------
> 1 file changed, 71 insertions(+), 37 deletions(-)

Regards,
Parshuram Thombare

Russell King (Oracle) Nov. 15, 2021, 4:44 p.m. UTC | #2

On Fri, Nov 12, 2021 at 02:04:00PM -0500, Sean Anderson wrote:
> There were several cases where validate() would return bogus supported
> modes with unusual combinations of interfaces and capabilities. For
> example, if state->interface was 10GBASER and the macb had HIGH_SPEED
> and PCS but not GIGABIT MODE, then 10/100 modes would be set anyway. In
> another case, SGMII could be enabled even if the mac was not a GEM
> (despite this being checked for later on in mac_config()). These
> inconsistencies make it difficult to refactor this function cleanly.
> 
> There is still the open question of what exactly the requirements for
> SGMII and 10GBASER are, and what SGMII actually supports. If someone
> from Cadence (or anyone else with access to the GEM/MACB datasheet)
> could comment on this, it would be greatly appreciated. In particular,
> what is supported by Cadence vs. vendor extension/limitation?
> 
> To address this, the current logic is split into three parts. First, we
> determine what we support, then we eliminate unsupported interfaces, and
> finally we set the appropriate link modes. There is still some cruft
> related to NA, but this can be removed in a future patch.
> 
> Signed-off-by: Sean Anderson <sean.anderson@seco.com>

Thanks - this looks good to me.

Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>

I've added it to my tree as I have patches that follow on which
entirely eliminate macb_validate(), replacing it with the generic
implementation that was merged into net-next today.

Jakub Kicinski Nov. 16, 2021, 1:35 a.m. UTC | #3

On Mon, 15 Nov 2021 16:44:38 +0000 Russell King (Oracle) wrote:
> On Fri, Nov 12, 2021 at 02:04:00PM -0500, Sean Anderson wrote:
> > There were several cases where validate() would return bogus supported
> > modes with unusual combinations of interfaces and capabilities. For
> > example, if state->interface was 10GBASER and the macb had HIGH_SPEED
> > and PCS but not GIGABIT MODE, then 10/100 modes would be set anyway. In
> > another case, SGMII could be enabled even if the mac was not a GEM
> > (despite this being checked for later on in mac_config()). These
> > inconsistencies make it difficult to refactor this function cleanly.
> > 
> > There is still the open question of what exactly the requirements for
> > SGMII and 10GBASER are, and what SGMII actually supports. If someone
> > from Cadence (or anyone else with access to the GEM/MACB datasheet)
> > could comment on this, it would be greatly appreciated. In particular,
> > what is supported by Cadence vs. vendor extension/limitation?
> > 
> > To address this, the current logic is split into three parts. First, we
> > determine what we support, then we eliminate unsupported interfaces, and
> > finally we set the appropriate link modes. There is still some cruft
> > related to NA, but this can be removed in a future patch.
> > 
> > Signed-off-by: Sean Anderson <sean.anderson@seco.com>  
> 
> Thanks - this looks good to me.
> 
> Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>

Applied to net-next, thanks!

Sean Anderson Nov. 16, 2021, 10:56 p.m. UTC | #4

Hi Russell,

On 11/15/21 11:44 AM, Russell King (Oracle) wrote:
>
> Thanks - this looks good to me.
>
> Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
>
> I've added it to my tree as I have patches that follow on which
> entirely eliminate macb_validate(), replacing it with the generic
> implementation that was merged into net-next today.

I have a few questions/comments about your tree (and pl in general).
This is not particularly relevant to the above patch, but this is as
good a place as any to ask.

What is the intent for the supported link modes in validate()? The docs
say

> Note that the PHY may be able to transform from one connection
> technology to another, so, eg, don't clear 1000BaseX just
> because the MAC is unable to BaseX mode. This is more about
> clearing unsupported speeds and duplex settings. The port modes
> should not be cleared; phylink_set_port_modes() will help with this.

But this is not how validate() has been/is currently implemented in many
drivers. In 34ae2c09d46a ("net: phylink: add generic validate
implementation"), it appears you are hewing closer to the documented
purpose (e.g. MAC_1000FD selects all the full-duplex 1G link modes).
Should new code try to stick to the above documentation?

Of course, the above leaves me quite confused about where the best place
is to let the PCS have a say about what things are supported, and (as
discussed below) whether it can support such a thing. The general
perspective taken in existing drivers seems to be that PCSs are
integrated with the MAC. This is in contrast to the IEEE specs, which
take the pespective that the PCS is a part of the PHY. It's unclear to
me what stance the above documentation takes.

Consider the Xilinx 1G PCS. This PCS supports 1000BASE-X and SGMII, but
only at full duplex. This caveat already rules out a completely
bitmap-based solution (as phylink_get_linkmodes thinks that both of
those interfaces are always full-duplex). There are also config options
which (either as a feature or a consequence) disable SPEED_10 SGMII or
autonegotiation (although I don't plan on supporting either of those).
The "easiest" solution is simply to provide two callbacks like

	void pcs_set_modes(struct phylink_pcs *pcs, ulong *supported,
			   phy_interface_t interface);
	bool pcs_mode_supported(struct phylink_pcs *pcs,
				phy_interface_t interface, int speed,
				int duplex);

perhaps with some generic substitutes. The former would need to be
called from mac_validate, and the latter from mac_select_pcs/
mac_prepare. This design is rather averse to genericization, so perhaps
you have some suggestion?

On the subject of PCS selection, mac_select_pcs should supply the whole
state. This is because the interface alone is insufficient to determine
which PCS to select. For example, a PCS which supports full duplex but
not half duplex should not be selected if the config specifies half
duplex. Additionally, it should also support a selection of "no pcs".
Otherwise MACs which (optionally!) have PCSs will fail to configure. We
should not fail when no PCS has yet been selected or when there is no
PCS at all in some hardware configuration.  Further, why do we have this
callback in the first place? Why don't we have drivers just do this in
prepare()?

--Sean

Russell King (Oracle) Nov. 17, 2021, 12:22 a.m. UTC | #5

On Tue, Nov 16, 2021 at 05:56:43PM -0500, Sean Anderson wrote:
> Hi Russell,
> 
> I have a few questions/comments about your tree (and pl in general).
> This is not particularly relevant to the above patch, but this is as
> good a place as any to ask.
> 
> What is the intent for the supported link modes in validate()? The docs
> say

The _link_ modes describe what gets reported to userspace via the
ethtool APIs, and therefore what appears in ethtool as the supported
and advertised capabilities for the media, whatever the "media" is
defined to be.

Generally, the "media" is what the user gets to play with to connect
two network interfaces together - so twisted pair, fibre, direct-attach
cable, etc.

> > Note that the PHY may be able to transform from one connection
> > technology to another, so, eg, don't clear 1000BaseX just
> > because the MAC is unable to BaseX mode. This is more about
> > clearing unsupported speeds and duplex settings. The port modes
> > should not be cleared; phylink_set_port_modes() will help with this.
> 
> But this is not how validate() has been/is currently implemented in many
> drivers. In 34ae2c09d46a ("net: phylink: add generic validate
> implementation"), it appears you are hewing closer to the documented
> purpose (e.g. MAC_1000FD selects all the full-duplex 1G link modes).
> Should new code try to stick to the above documentation?

I try to encourage new code to stick to this - and this is one of the
motivations behind moving to this new model, so people don't make
these kinds of mistakes.

In the case of nothing between the MAC and the media performing any
kind of speed conversion, the MAC itself doesn't have much to do with
which ethtool link modes are supported - and here's why.

Take a gigabit capable MAC that is connected via SGMII to a PHY that
supports both conventional twisted-pair media and fiber. If the
twisted-pair port is in use at 1G speeds, then we're using 1000base-T.
If the fiber port is being used, then we're using 1000base-X. The
protocol between the PHY and MAC makes no difference to what link
modes are supported.

A more extreme case could be: a 10G MAC connected to a backplane PHY
via 10G BASE-KR which is then connected to a PHY that connects to
conventional twisted-pair media.

Or a multi-speed PHY where it switches between SGMII, 2500BASE-X,
5GBASE-R, 10GBASE-R depending on the results of negotiation on the
twisted-pair media. The MAC supports operating at 10M, 100M, 1G,
2.5G, 5G, and 10G speeds, and can select between PCS that support
SGMII, 2500BASE-X, 5GBASE-R and 10GBASE-R. However, ultimately for
userspace, what matters is the media capabilities - the base-T*
ethtool link modes. 2500base-X in this situation doesn't come up
unless the PHY offers 2500base-X on the media.

The same PHY might be embedded within a SFP module, and that SFP
module might be plugged into a cage where the MAC is unable to
support the faster speeds - maybe it is only capable of up to
2.5G speeds. In which case, the system supports up to 2500BASE-T.

So you can see, the MAC side has little relevance for link modes
except for defining the speeds and duplexes that can be supported.
The type of media (-T, -X, -*R) is of no concern at this stage.

It is of little concern at the PCS except when the PCS is the
endpoint for connecting to the media (like it is in _some_ 802.3z
connections.) I say "some" because its entirely possible to use
1000base-X to talk to a PHY that connects to 1000base-T media
(and there are SFPs around that will do this by default.)

> Of course, the above leaves me quite confused about where the best place
> is to let the PCS have a say about what things are supported, and (as
> discussed below) whether it can support such a thing. The general
> perspective taken in existing drivers seems to be that PCSs are
> integrated with the MAC. This is in contrast to the IEEE specs, which
> take the pespective that the PCS is a part of the PHY. It's unclear to
> me what stance the above documentation takes.

Things can get quite complex, and I have to say the IEEE specs give
a simplified view. When you have a SGMII link to a PHY that then
connects to twisted pair media, you actually have multiple PCS:

                                 PHY
                      /----------------------\
MAC -- PCS ---<SGMII>--- PCS --- PCS --- PMA ---- media
     (sgmii)           (sgmii)   (1000baseT)

This can be seen in PHYs such as Marvell 88E151x, where the fiber
interface is re-used for SGMII, and if you read the datasheet and/or
read the fiber page registers, you find that this is being used for
the fiber side. So the PHY can be thought of two separate PHYs
back-to-back. Remember that the PCS for 1000BASE-X (and SGMII) is
different from the PCS for 1000BASE-T in IEEE802.3.

The point I'm making here is that the capability of the link between
the MAC and the PHY doesn't define what the media link modes are. It
may define the speeds and duplexes that can be supported, and that
restricts the available link modes, but it doesn't define which
media "technologies" can be supported.

Hence, for example, the validate() implementation masking out
1000base-X but leaving 1000base-T on a *GMII link is pretty silly,
because whether one or the other is supported depends ultimately
what the *GMII link ends up being connected to.

> Consider the Xilinx 1G PCS. This PCS supports 1000BASE-X and SGMII, but
> only at full duplex. This caveat already rules out a completely
> bitmap-based solution (as phylink_get_linkmodes thinks that both of
> those interfaces are always full-duplex).

I don't see why you say this rules out a bitmap solution. You say that
it only supports full-duplex, and that is catered for in the current
solution: MAC_10 for example is actually MAC_10HD | MAC_10FD - which
allows one to specify that only MAC_10FD is supported and not MAC_10HD
in such a scenario.

Hmm. Also note that the validate() callback is not going away -
phylink_generic_validate() is a generic implementation of this that
gets rid of a lot of duplication and variability of implementation
that really shouldn't be there.

There are cases where the generic implementation will not be suitable,
and for this phylink_get_linkmodes() can be called directly, or I'd
even consider making phylink_caps_to_linkmodes() available if it is
useful. Or one can do it the "old way" that we currently have.

> There are also config options
> which (either as a feature or a consequence) disable SPEED_10 SGMII or
> autonegotiation (although I don't plan on supporting either of those).
> The "easiest" solution is simply to provide two callbacks like
> 
> 	void pcs_set_modes(struct phylink_pcs *pcs, ulong *supported,
> 			   phy_interface_t interface);
> 	bool pcs_mode_supported(struct phylink_pcs *pcs,
> 				phy_interface_t interface, int speed,
> 				int duplex);
> 
> perhaps with some generic substitutes. The former would need to be
> called from mac_validate, and the latter from mac_select_pcs/
> mac_prepare. This design is rather averse to genericization, so perhaps
> you have some suggestion?

I don't have a good answer for you at the moment - the PCS support
is something that has been recently added and is still quite young,
so these are the kinds of issues I'd expect to crop up.

> On the subject of PCS selection, mac_select_pcs should supply the whole
> state.

That may seem like a good thing to ask for, but not even phylink
knows what the full state is when calling the validation function,
nor when calling mac_select_pcs.

Let's take an example of the Marvell 88X3310 multi-speed PHY, which
supports 10G, 5G, 2.5G, 1G, 100M and 10M on copper, and 1G and 100M
on fiber, and can do all of that while connected to a single serdes
connection back to the MAC. As I've said above, it does this by
switching its MAC connection under its internal firmware between
10000Base-R, 5000Base-R, 2500Base-X, and SGMII. This PHY has been
found to be used in platforms, and discovered to also be in SFP
modules. Beyond programming the overall "host interface" mode, we
don't get a choice in which mode the PHY picks - that is determined
by the results of which interface comes up and autonegotiation on
that interface.

So, if the PHY decides to link on copper at 2500BASE-T, then we end
up with the MAC link operating at 2500BASE-X, and there's nothing
we can do about that.

The only way to restrict this is to know ahead of time what the
capabilities of the MAC and PCSes are, and to restrict the link
modes that phylib gives us in both the "supported" and "advertising"
fields, so the PHY will be programmed to e.g. not support 2500BASE-T
on copper if 2500BASE-X is not supported by the PCS, or 2.5G speeds
are not supported by the MAC.

This isn't something one can do when trying to bring the link up,
it's something that needs to be done when we are "putting the system
together" - in other words, when we are binding the PHY into the
link setup.

Now, this is quite horrible right now, because for PHYs like this,
phylink just asks the MAC's validate function "give me everything
you can support" when working this out - which won't be sufficient
going forward. With some of the changes you've prompted - making
more use of the supported_interfaces bitmap, and with further
adaption of phylib to also provide that information, we can start to
work out which interface modes the PHY _could_ select, and we can then
query the validate() function for what is possible for each of those
interface modes, and use that to bound the PHY capabilities. However,
at the moment, we just don't have that information available from
phylib.

> This is because the interface alone is insufficient to determine
> which PCS to select. For example, a PCS which supports full duplex but
> not half duplex should not be selected if the config specifies half
> duplex. Additionally, it should also support a selection of "no pcs".

Right now, "no pcs" is really not an option I'm afraid. The presence
of a PCS changes the phylink behaviour slightly . This is one of my
bug-bears. The evolution of phylink has meant that we need to keep
compatibility with how phylink used to work before we split the PCS
support - and we detect that by whether there is a PCS to determine
whether we need to operate with that compatibility. It probably was
a mistake to do that in hind sight.

If we can find a way to identify the old vs new drivers that doesn't
rely on the presence of a PCS, then we should be able to fix this to
allow the PCS to "vanish" in certain modes, but I do question whether
there would be any realistic implementations using it. If we have a
PHY connected to a serdes lane back to a set of PCS to support
different protocols on the serdes, then under what scenario would we
select "no pcs" - doesn't "no pcs" in that situation mean "we don't
know what protocol to drive the serdes link" ?

> Otherwise MACs which (optionally!) have PCSs will fail to configure. We
> should not fail when no PCS has yet been selected or when there is no
> PCS at all in some hardware configuration.  Further, why do we have this
> callback in the first place? Why don't we have drivers just do this in
> prepare()?

I added mac_select_pcs() because finding out that something isn't
supported in mac_prepare() is way too late - as explained above
where I talked about binding the PHY into the link setup. E.g. if
the "system" as a whole can't operate at 2.5G speeds, then we should
not allow the PHY to advertise 2500BASE-T. It is no good advertising
2500BASE-T, then having the PHY negotiate 2500BASE-T, select 2500BASE-X,
and then have mac_prepare() decide that can't be supported. The link
won't come up, and there's nothing that can be sensibly done. The
user sees the PHY indicating link, the link partner indicates link,
but the link is non-functional. That isn't a good user experience.

Whereas, if we know ahead of time that 2.5G can't be supported, we can
remove 2500BASE-T from the advertisement, and the PHY will instead
negotiate a slower speed - resulting in a working link, albiet slower.

I hope that explains why it is so important not to error out in
mac_prepare() because something wasn't properly handled in the
validate() step.

Russell King (Oracle) Nov. 18, 2021, 3:34 p.m. UTC | #6

On Wed, Nov 17, 2021 at 12:22:26AM +0000, Russell King (Oracle) wrote:
> On Tue, Nov 16, 2021 at 05:56:43PM -0500, Sean Anderson wrote:
> > Hi Russell,
> > 
> > I have a few questions/comments about your tree (and pl in general).
> > This is not particularly relevant to the above patch, but this is as
> > good a place as any to ask.
> > 
> > What is the intent for the supported link modes in validate()? The docs
> > say
> 
> The _link_ modes describe what gets reported to userspace via the
> ethtool APIs, and therefore what appears in ethtool as the supported
> and advertised capabilities for the media, whatever the "media" is
> defined to be.
> 
> Generally, the "media" is what the user gets to play with to connect
> two network interfaces together - so twisted pair, fibre, direct-attach
> cable, etc.
> 
> > > Note that the PHY may be able to transform from one connection
> > > technology to another, so, eg, don't clear 1000BaseX just
> > > because the MAC is unable to BaseX mode. This is more about
> > > clearing unsupported speeds and duplex settings. The port modes
> > > should not be cleared; phylink_set_port_modes() will help with this.
> > 
> > But this is not how validate() has been/is currently implemented in many
> > drivers. In 34ae2c09d46a ("net: phylink: add generic validate
> > implementation"), it appears you are hewing closer to the documented
> > purpose (e.g. MAC_1000FD selects all the full-duplex 1G link modes).
> > Should new code try to stick to the above documentation?
> 
> I try to encourage new code to stick to this - and this is one of the
> motivations behind moving to this new model, so people don't make
> these kinds of mistakes.
> 
> In the case of nothing between the MAC and the media performing any
> kind of speed conversion, the MAC itself doesn't have much to do with
> which ethtool link modes are supported - and here's why.
> 
> Take a gigabit capable MAC that is connected via SGMII to a PHY that
> supports both conventional twisted-pair media and fiber. If the
> twisted-pair port is in use at 1G speeds, then we're using 1000base-T.
> If the fiber port is being used, then we're using 1000base-X. The
> protocol between the PHY and MAC makes no difference to what link
> modes are supported.
> 
> A more extreme case could be: a 10G MAC connected to a backplane PHY
> via 10G BASE-KR which is then connected to a PHY that connects to
> conventional twisted-pair media.
> 
> Or a multi-speed PHY where it switches between SGMII, 2500BASE-X,
> 5GBASE-R, 10GBASE-R depending on the results of negotiation on the
> twisted-pair media. The MAC supports operating at 10M, 100M, 1G,
> 2.5G, 5G, and 10G speeds, and can select between PCS that support
> SGMII, 2500BASE-X, 5GBASE-R and 10GBASE-R. However, ultimately for
> userspace, what matters is the media capabilities - the base-T*
> ethtool link modes. 2500base-X in this situation doesn't come up
> unless the PHY offers 2500base-X on the media.
> 
> The same PHY might be embedded within a SFP module, and that SFP
> module might be plugged into a cage where the MAC is unable to
> support the faster speeds - maybe it is only capable of up to
> 2.5G speeds. In which case, the system supports up to 2500BASE-T.
> 
> So you can see, the MAC side has little relevance for link modes
> except for defining the speeds and duplexes that can be supported.
> The type of media (-T, -X, -*R) is of no concern at this stage.
> 
> It is of little concern at the PCS except when the PCS is the
> endpoint for connecting to the media (like it is in _some_ 802.3z
> connections.) I say "some" because its entirely possible to use
> 1000base-X to talk to a PHY that connects to 1000base-T media
> (and there are SFPs around that will do this by default.)
> 
> > Of course, the above leaves me quite confused about where the best place
> > is to let the PCS have a say about what things are supported, and (as
> > discussed below) whether it can support such a thing. The general
> > perspective taken in existing drivers seems to be that PCSs are
> > integrated with the MAC. This is in contrast to the IEEE specs, which
> > take the pespective that the PCS is a part of the PHY. It's unclear to
> > me what stance the above documentation takes.
> 
> Things can get quite complex, and I have to say the IEEE specs give
> a simplified view. When you have a SGMII link to a PHY that then
> connects to twisted pair media, you actually have multiple PCS:
> 
>                                  PHY
>                       /----------------------\
> MAC -- PCS ---<SGMII>--- PCS --- PCS --- PMA ---- media
>      (sgmii)           (sgmii)   (1000baseT)
> 
> This can be seen in PHYs such as Marvell 88E151x, where the fiber
> interface is re-used for SGMII, and if you read the datasheet and/or
> read the fiber page registers, you find that this is being used for
> the fiber side. So the PHY can be thought of two separate PHYs
> back-to-back. Remember that the PCS for 1000BASE-X (and SGMII) is
> different from the PCS for 1000BASE-T in IEEE802.3.
> 
> The point I'm making here is that the capability of the link between
> the MAC and the PHY doesn't define what the media link modes are. It
> may define the speeds and duplexes that can be supported, and that
> restricts the available link modes, but it doesn't define which
> media "technologies" can be supported.
> 
> Hence, for example, the validate() implementation masking out
> 1000base-X but leaving 1000base-T on a *GMII link is pretty silly,
> because whether one or the other is supported depends ultimately
> what the *GMII link ends up being connected to.
> 
> > Consider the Xilinx 1G PCS. This PCS supports 1000BASE-X and SGMII, but
> > only at full duplex. This caveat already rules out a completely
> > bitmap-based solution (as phylink_get_linkmodes thinks that both of
> > those interfaces are always full-duplex).
> 
> I don't see why you say this rules out a bitmap solution. You say that
> it only supports full-duplex, and that is catered for in the current
> solution: MAC_10 for example is actually MAC_10HD | MAC_10FD - which
> allows one to specify that only MAC_10FD is supported and not MAC_10HD
> in such a scenario.
> 
> Hmm. Also note that the validate() callback is not going away -
> phylink_generic_validate() is a generic implementation of this that
> gets rid of a lot of duplication and variability of implementation
> that really shouldn't be there.
> 
> There are cases where the generic implementation will not be suitable,
> and for this phylink_get_linkmodes() can be called directly, or I'd
> even consider making phylink_caps_to_linkmodes() available if it is
> useful. Or one can do it the "old way" that we currently have.
> 
> > There are also config options
> > which (either as a feature or a consequence) disable SPEED_10 SGMII or
> > autonegotiation (although I don't plan on supporting either of those).
> > The "easiest" solution is simply to provide two callbacks like
> > 
> > 	void pcs_set_modes(struct phylink_pcs *pcs, ulong *supported,
> > 			   phy_interface_t interface);
> > 	bool pcs_mode_supported(struct phylink_pcs *pcs,
> > 				phy_interface_t interface, int speed,
> > 				int duplex);
> > 
> > perhaps with some generic substitutes. The former would need to be
> > called from mac_validate, and the latter from mac_select_pcs/
> > mac_prepare. This design is rather averse to genericization, so perhaps
> > you have some suggestion?
> 
> I don't have a good answer for you at the moment - the PCS support
> is something that has been recently added and is still quite young,
> so these are the kinds of issues I'd expect to crop up.
> 
> > On the subject of PCS selection, mac_select_pcs should supply the whole
> > state.
> 
> That may seem like a good thing to ask for, but not even phylink
> knows what the full state is when calling the validation function,
> nor when calling mac_select_pcs.
> 
> Let's take an example of the Marvell 88X3310 multi-speed PHY, which
> supports 10G, 5G, 2.5G, 1G, 100M and 10M on copper, and 1G and 100M
> on fiber, and can do all of that while connected to a single serdes
> connection back to the MAC. As I've said above, it does this by
> switching its MAC connection under its internal firmware between
> 10000Base-R, 5000Base-R, 2500Base-X, and SGMII. This PHY has been
> found to be used in platforms, and discovered to also be in SFP
> modules. Beyond programming the overall "host interface" mode, we
> don't get a choice in which mode the PHY picks - that is determined
> by the results of which interface comes up and autonegotiation on
> that interface.
> 
> So, if the PHY decides to link on copper at 2500BASE-T, then we end
> up with the MAC link operating at 2500BASE-X, and there's nothing
> we can do about that.
> 
> The only way to restrict this is to know ahead of time what the
> capabilities of the MAC and PCSes are, and to restrict the link
> modes that phylib gives us in both the "supported" and "advertising"
> fields, so the PHY will be programmed to e.g. not support 2500BASE-T
> on copper if 2500BASE-X is not supported by the PCS, or 2.5G speeds
> are not supported by the MAC.
> 
> This isn't something one can do when trying to bring the link up,
> it's something that needs to be done when we are "putting the system
> together" - in other words, when we are binding the PHY into the
> link setup.
> 
> Now, this is quite horrible right now, because for PHYs like this,
> phylink just asks the MAC's validate function "give me everything
> you can support" when working this out - which won't be sufficient
> going forward. With some of the changes you've prompted - making
> more use of the supported_interfaces bitmap, and with further
> adaption of phylib to also provide that information, we can start to
> work out which interface modes the PHY _could_ select, and we can then
> query the validate() function for what is possible for each of those
> interface modes, and use that to bound the PHY capabilities. However,
> at the moment, we just don't have that information available from
> phylib.
> 
> > This is because the interface alone is insufficient to determine
> > which PCS to select. For example, a PCS which supports full duplex but
> > not half duplex should not be selected if the config specifies half
> > duplex. Additionally, it should also support a selection of "no pcs".
> 
> Right now, "no pcs" is really not an option I'm afraid. The presence
> of a PCS changes the phylink behaviour slightly . This is one of my
> bug-bears. The evolution of phylink has meant that we need to keep
> compatibility with how phylink used to work before we split the PCS
> support - and we detect that by whether there is a PCS to determine
> whether we need to operate with that compatibility. It probably was
> a mistake to do that in hind sight.
> 
> If we can find a way to identify the old vs new drivers that doesn't
> rely on the presence of a PCS, then we should be able to fix this to
> allow the PCS to "vanish" in certain modes, but I do question whether
> there would be any realistic implementations using it. If we have a
> PHY connected to a serdes lane back to a set of PCS to support
> different protocols on the serdes, then under what scenario would we
> select "no pcs" - doesn't "no pcs" in that situation mean "we don't
> know what protocol to drive the serdes link" ?
> 
> > Otherwise MACs which (optionally!) have PCSs will fail to configure. We
> > should not fail when no PCS has yet been selected or when there is no
> > PCS at all in some hardware configuration.  Further, why do we have this
> > callback in the first place? Why don't we have drivers just do this in
> > prepare()?
> 
> I added mac_select_pcs() because finding out that something isn't
> supported in mac_prepare() is way too late - as explained above
> where I talked about binding the PHY into the link setup. E.g. if
> the "system" as a whole can't operate at 2.5G speeds, then we should
> not allow the PHY to advertise 2500BASE-T. It is no good advertising
> 2500BASE-T, then having the PHY negotiate 2500BASE-T, select 2500BASE-X,
> and then have mac_prepare() decide that can't be supported. The link
> won't come up, and there's nothing that can be sensibly done. The
> user sees the PHY indicating link, the link partner indicates link,
> but the link is non-functional. That isn't a good user experience.
> 
> Whereas, if we know ahead of time that 2.5G can't be supported, we can
> remove 2500BASE-T from the advertisement, and the PHY will instead
> negotiate a slower speed - resulting in a working link, albiet slower.
> 
> I hope that explains why it is so important not to error out in
> mac_prepare() because something wasn't properly handled in the
> validate() step.

What I haven't described in the above (it was rather late when I was
writing that email!) is where we need to head with a PHY that does
rate adaption - and yet again an example of such a PHY is the 88X3310.
This is actually a good example for many of the issues right now.

If the 88X3310 is configured to have a host interface that only
supports 10GBASE-R, then rate adaption within the PHY is activated,
meaning the PHY is able to operate at, e.g. 10BASE-T while the host
MAC operates at 10GBASE-R. There are some provisos to this though:

1) If the 88X3310 supports MACSEC, then it has internal MACs that
   are able to generate pause frames, and pause frames will be sent
   over the 10GBASE-R link as necessary to control the rate at which
   the MAC sends packets.

2) If the 88X3310 does not support MACSEC, then it is expected that
   the MAC is paced according to the link speed to avoid overflowing
   the 88X3310 internal FIFOs (what happens when the internal FIFOs
   overflow is not known.) There are no pause frames generated.
   (This is the case on Macchiatobin boards if we configured the PHY
   for 10GBASE-R rate-adaption mode.)

We have no "real" support for rate adaption at either phylib or phylink
level - phylib has no way to tell us whether rate adaption is enabled
on the PHY, nor does it have a way to tell us if we either need to pace
the MAC or whether to expect pause frames from the PHY.

If we have a PHY in rate adaption mode, the current behaviour will be
that mac_link_up() and pcs_link_up() will be passed the negotiated
media parameters as "speed", "duplex" and any flow control information,
which will confuse PCS and MAC drivers at the moment, because it isn't
something they expect to happen. What I mean is, if we are using
PHY_INTERFACE_MODE_10GBASER, then most people will expect "speed" to be
SPEED_10000, but with a rate adapting PHY it may not be.

In order to properly support this, we need to update the documentation
at the very least to say that what gets passed to mac_link_up() for
"speed" and "duplex" are the media negotiated parameters. Then we need
to have a think about how to handle flow control, and this is where
extending phylib to tell us whether the PHY supports rate adaption
becomes important. Flow control on the MAC needs to be enabled if (the
PHY has rate adaption disabled but the media negotiated flow control)
or (the PHY has rate adaption enabled _and_ the PHY is capable of
issuing flow control frames - presumably the PHY will respond itself
to flow control) or (the PHY has rate adaption enabled and the media
negotiated flow control but the PHY is not capable of issuing flow
control frames).

Then there's the issue of implementing transmission pacing in any MAC
driver that wants to be usable with a rate adapting PHY.

Lastly, there's the issue of the "speed" and "duplex" parameters passed
to pcs_link_up(), which I'm currently thinking should be the interface
parameters and not the media parameters. In other words, if it's a
10GBASE-R connection between the PHY and PCS, we really should not be
passing the media negotiated speed there.

So, to sum up, rate adaption isn't something that is well supported in
the kernel - it's possible to bodge around phylib and phylink to make
it work, but this really needs to be handled properly.


Rate adaption is fairly low priority at the moment as it is in a
minority, although it seems we are seeing more systems that have PHYs
with this feature.

So, I hope these two emails have provides some useful insights.

Sean Anderson Nov. 18, 2021, 8:20 p.m. UTC | #7

Hi Russell,

On 11/18/21 10:34 AM, Russell King (Oracle) wrote:
> On Wed, Nov 17, 2021 at 12:22:26AM +0000, Russell King (Oracle) wrote:
>> On Tue, Nov 16, 2021 at 05:56:43PM -0500, Sean Anderson wrote:
>> > Hi Russell,
>> >
>> > I have a few questions/comments about your tree (and pl in general).
>> > This is not particularly relevant to the above patch, but this is as
>> > good a place as any to ask.
>> >
>> > What is the intent for the supported link modes in validate()? The docs
>> > say
>>
>> The _link_ modes describe what gets reported to userspace via the
>> ethtool APIs, and therefore what appears in ethtool as the supported
>> and advertised capabilities for the media, whatever the "media" is
>> defined to be.
>>
>> Generally, the "media" is what the user gets to play with to connect
>> two network interfaces together - so twisted pair, fibre, direct-attach
>> cable, etc.
>>
>> > > Note that the PHY may be able to transform from one connection
>> > > technology to another, so, eg, don't clear 1000BaseX just
>> > > because the MAC is unable to BaseX mode. This is more about
>> > > clearing unsupported speeds and duplex settings. The port modes
>> > > should not be cleared; phylink_set_port_modes() will help with this.
>> >
>> > But this is not how validate() has been/is currently implemented in many
>> > drivers. In 34ae2c09d46a ("net: phylink: add generic validate
>> > implementation"), it appears you are hewing closer to the documented
>> > purpose (e.g. MAC_1000FD selects all the full-duplex 1G link modes).
>> > Should new code try to stick to the above documentation?
>>
>> I try to encourage new code to stick to this - and this is one of the
>> motivations behind moving to this new model, so people don't make
>> these kinds of mistakes.
>>
>> In the case of nothing between the MAC and the media performing any
>> kind of speed conversion, the MAC itself doesn't have much to do with
>> which ethtool link modes are supported - and here's why.
>>
>> Take a gigabit capable MAC that is connected via SGMII to a PHY that
>> supports both conventional twisted-pair media and fiber. If the
>> twisted-pair port is in use at 1G speeds, then we're using 1000base-T.
>> If the fiber port is being used, then we're using 1000base-X. The
>> protocol between the PHY and MAC makes no difference to what link
>> modes are supported.
>>
>> A more extreme case could be: a 10G MAC connected to a backplane PHY
>> via 10G BASE-KR which is then connected to a PHY that connects to
>> conventional twisted-pair media.
>>
>> Or a multi-speed PHY where it switches between SGMII, 2500BASE-X,
>> 5GBASE-R, 10GBASE-R depending on the results of negotiation on the
>> twisted-pair media. The MAC supports operating at 10M, 100M, 1G,
>> 2.5G, 5G, and 10G speeds, and can select between PCS that support
>> SGMII, 2500BASE-X, 5GBASE-R and 10GBASE-R. However, ultimately for
>> userspace, what matters is the media capabilities - the base-T*
>> ethtool link modes. 2500base-X in this situation doesn't come up
>> unless the PHY offers 2500base-X on the media.
>>
>> The same PHY might be embedded within a SFP module, and that SFP
>> module might be plugged into a cage where the MAC is unable to
>> support the faster speeds - maybe it is only capable of up to
>> 2.5G speeds. In which case, the system supports up to 2500BASE-T.
>>
>> So you can see, the MAC side has little relevance for link modes
>> except for defining the speeds and duplexes that can be supported.
>> The type of media (-T, -X, -*R) is of no concern at this stage.
>>
>> It is of little concern at the PCS except when the PCS is the
>> endpoint for connecting to the media (like it is in _some_ 802.3z
>> connections.) I say "some" because its entirely possible to use
>> 1000base-X to talk to a PHY that connects to 1000base-T media
>> (and there are SFPs around that will do this by default.)

Of course, since 1000BASE-X is not an electrical specification, this is
really more like using 1000BASE-CX to 1000BASE-T :)

>> > Of course, the above leaves me quite confused about where the best place
>> > is to let the PCS have a say about what things are supported, and (as
>> > discussed below) whether it can support such a thing. The general
>> > perspective taken in existing drivers seems to be that PCSs are
>> > integrated with the MAC. This is in contrast to the IEEE specs, which
>> > take the pespective that the PCS is a part of the PHY. It's unclear to
>> > me what stance the above documentation takes.
>>
>> Things can get quite complex, and I have to say the IEEE specs give
>> a simplified view. When you have a SGMII link to a PHY that then
>> connects to twisted pair media, you actually have multiple PCS:
>>
>>                                  PHY
>>                       /----------------------\
>> MAC -- PCS ---<SGMII>--- PCS --- PCS --- PMA ---- media
>>      (sgmii)           (sgmii)   (1000baseT)
>>
>> This can be seen in PHYs such as Marvell 88E151x, where the fiber
>> interface is re-used for SGMII, and if you read the datasheet and/or
>> read the fiber page registers, you find that this is being used for
>> the fiber side. So the PHY can be thought of two separate PHYs
>> back-to-back. Remember that the PCS for 1000BASE-X (and SGMII) is
>> different from the PCS for 1000BASE-T in IEEE802.3.

Right and this is a bit of the source of the confusion. There are
different levels/layers of PHYs all with their own PCS/PMA/PMD stack.
Depending on what perspective you take at the time, some of these can be
subsumed into each other.

>> The point I'm making here is that the capability of the link between
>> the MAC and the PHY doesn't define what the media link modes are. It
>> may define the speeds and duplexes that can be supported, and that
>> restricts the available link modes, but it doesn't define which
>> media "technologies" can be supported.

Right. IMO there is a lot of conflation of this concept in the current
net subsystem. Realistically, the MAC should only be concerned with the
phy interface mode, and perhaps duplex and speed. But! This should be
the interface mode needed to talk to *the next stage in the signal
path*. That is, if the MAC has GMII output and needs a separate PCS to
talk 1000BASE-X or SGMII, it should only report GMII. And then the PCS
can say what kind of interface it supports. However, the current model
assumes that the PCS is tightly integrated, so these sorts of things are
not modeled well. I don't know whether the above change would be
feasable at all. Ideally, validate() would talk about interfaces modes
and not link modes.

>> Hence, for example, the validate() implementation masking out
>> 1000base-X but leaving 1000base-T on a *GMII link is pretty silly,
>> because whether one or the other is supported depends ultimately
>> what the *GMII link ends up being connected to.
>>
>> > Consider the Xilinx 1G PCS. This PCS supports 1000BASE-X and SGMII, but
>> > only at full duplex. This caveat already rules out a completely
>> > bitmap-based solution (as phylink_get_linkmodes thinks that both of
>> > those interfaces are always full-duplex).
>>
>> I don't see why you say this rules out a bitmap solution. You say that
>> it only supports full-duplex, and that is catered for in the current
>> solution: MAC_10 for example is actually MAC_10HD | MAC_10FD - which
>> allows one to specify that only MAC_10FD is supported and not MAC_10HD
>> in such a scenario.

Say that you are a MAC with an integrated PCS (so you handle everything
in the MAC driver). You support GMII full and half duplex, but your PCS
only supports 1000BASE-X with full duplex. The naïve bitmap is

supported_interfaces = PHY_INTERFACE_GMII | PHY_INTERFACE_1000BASEX;
mac_capabilities = MAC_10 | MAC_100 | MAC_1000;

but this will report 1000BASE-X as supporting both full and half duplex.
So you still need a custom validate() in order to report the correct link
modes.

The tricky part comes in a scenario where the exact MAC is determined at
runtime, such as the MACB+Xilinx PCS configuration.

>> Hmm. Also note that the validate() callback is not going away -
>> phylink_generic_validate() is a generic implementation of this that
>> gets rid of a lot of duplication and variability of implementation
>> that really shouldn't be there.
>>
>> There are cases where the generic implementation will not be suitable,
>> and for this phylink_get_linkmodes() can be called directly, or I'd
>> even consider making phylink_caps_to_linkmodes() available if it is
>> useful. Or one can do it the "old way" that we currently have.
>>
>> > There are also config options
>> > which (either as a feature or a consequence) disable SPEED_10 SGMII or
>> > autonegotiation (although I don't plan on supporting either of those).
>> > The "easiest" solution is simply to provide two callbacks like
>> >
>> > 	void pcs_set_modes(struct phylink_pcs *pcs, ulong *supported,
>> > 			   phy_interface_t interface);
>> > 	bool pcs_mode_supported(struct phylink_pcs *pcs,
>> > 				phy_interface_t interface, int speed,
>> > 				int duplex);
>> >
>> > perhaps with some generic substitutes. The former would need to be
>> > called from mac_validate, and the latter from mac_select_pcs/
>> > mac_prepare. This design is rather averse to genericization, so perhaps
>> > you have some suggestion?
>>
>> I don't have a good answer for you at the moment - the PCS support
>> is something that has been recently added and is still quite young,
>> so these are the kinds of issues I'd expect to crop up.
>>
>> > On the subject of PCS selection, mac_select_pcs should supply the whole
>> > state.
>>
>> That may seem like a good thing to ask for, but not even phylink
>> knows what the full state is when calling the validation function,
>> nor when calling mac_select_pcs.
>>
>> Let's take an example of the Marvell 88X3310 multi-speed PHY, which
>> supports 10G, 5G, 2.5G, 1G, 100M and 10M on copper, and 1G and 100M
>> on fiber, and can do all of that while connected to a single serdes
>> connection back to the MAC. As I've said above, it does this by
>> switching its MAC connection under its internal firmware between
>> 10000Base-R, 5000Base-R, 2500Base-X, and SGMII. This PHY has been
>> found to be used in platforms, and discovered to also be in SFP
>> modules. Beyond programming the overall "host interface" mode, we
>> don't get a choice in which mode the PHY picks - that is determined
>> by the results of which interface comes up and autonegotiation on
>> that interface.
>>
>> So, if the PHY decides to link on copper at 2500BASE-T, then we end
>> up with the MAC link operating at 2500BASE-X, and there's nothing
>> we can do about that.
>>
>> The only way to restrict this is to know ahead of time what the
>> capabilities of the MAC and PCSes are, and to restrict the link
>> modes that phylib gives us in both the "supported" and "advertising"
>> fields, so the PHY will be programmed to e.g. not support 2500BASE-T
>> on copper if 2500BASE-X is not supported by the PCS, or 2.5G speeds
>> are not supported by the MAC.
>>
>> This isn't something one can do when trying to bring the link up,
>> it's something that needs to be done when we are "putting the system
>> together" - in other words, when we are binding the PHY into the
>> link setup.
>>
>> Now, this is quite horrible right now, because for PHYs like this,
>> phylink just asks the MAC's validate function "give me everything
>> you can support" when working this out - which won't be sufficient
>> going forward. With some of the changes you've prompted - making
>> more use of the supported_interfaces bitmap, and with further
>> adaption of phylib to also provide that information, we can start to
>> work out which interface modes the PHY _could_ select, and we can then
>> query the validate() function for what is possible for each of those
>> interface modes, and use that to bound the PHY capabilities. However,
>> at the moment, we just don't have that information available from
>> phylib.
>>
>> > This is because the interface alone is insufficient to determine
>> > which PCS to select. For example, a PCS which supports full duplex but
>> > not half duplex should not be selected if the config specifies half
>> > duplex. Additionally, it should also support a selection of "no pcs".
>>
>> Right now, "no pcs" is really not an option I'm afraid. The presence
>> of a PCS changes the phylink behaviour slightly . This is one of my
>> bug-bears. The evolution of phylink has meant that we need to keep
>> compatibility with how phylink used to work before we split the PCS
>> support - and we detect that by whether there is a PCS to determine
>> whether we need to operate with that compatibility. It probably was
>> a mistake to do that in hind sight.

Of course it's an option :)

Consider the MACB driver. It has two PCSs in some configurations, and
thus is a natural target for implementing the select_pcs callback.
However, in some other configurations, it has no PCSs at all. The only
way to implement select_pcs in the current design is to have two sets of
phylink_mac_ops: one with select_pcs populated, and one with it set to
NULL.

The correct check is probably something like

	if (pl->mac_ops->mac_select_pcs) {
		pcs = pl->mac_ops->mac_select_pcs(pl->config, state->interface);
		if (!pcs && pl->pcs)
			phylink_err(pl, "mac_select_pcs unexpectedly failed\n");
		else if (pcs)
			phylink_set_pcs(pl, pcs);
	}

>> If we can find a way to identify the old vs new drivers that doesn't
>> rely on the presence of a PCS, then we should be able to fix this to
>> allow the PCS to "vanish" in certain modes, but I do question whether
>> there would be any realistic implementations using it. If we have a
>> PHY connected to a serdes lane back to a set of PCS to support
>> different protocols on the serdes, then under what scenario would we
>> select "no pcs" - doesn't "no pcs" in that situation mean "we don't
>> know what protocol to drive the serdes link" ?
>>
>> > Otherwise MACs which (optionally!) have PCSs will fail to configure. We
>> > should not fail when no PCS has yet been selected or when there is no
>> > PCS at all in some hardware configuration.  Further, why do we have this
>> > callback in the first place? Why don't we have drivers just do this in
>> > prepare()?
>>
>> I added mac_select_pcs() because finding out that something isn't
>> supported in mac_prepare() is way too late - as explained above
>> where I talked about binding the PHY into the link setup. E.g. if
>> the "system" as a whole can't operate at 2.5G speeds, then we should
>> not allow the PHY to advertise 2500BASE-T. It is no good advertising
>> 2500BASE-T, then having the PHY negotiate 2500BASE-T, select 2500BASE-X,
>> and then have mac_prepare() decide that can't be supported. The link
>> won't come up, and there's nothing that can be sensibly done. The
>> user sees the PHY indicating link, the link partner indicates link,
>> but the link is non-functional. That isn't a good user experience.
>>
>> Whereas, if we know ahead of time that 2.5G can't be supported, we can
>> remove 2500BASE-T from the advertisement, and the PHY will instead
>> negotiate a slower speed - resulting in a working link, albiet slower.

AIUI it's a bug in the driver to advertise something in validate() which
it can't support. So we don't necessarily need a separate callback.

>> I hope that explains why it is so important not to error out in
>> mac_prepare() because something wasn't properly handled in the
>> validate() step.

Specifically, "I don't need a PCS for this mode" should be a valid
response. I agree that "I can't select a PCS" doesn't make sense.

> What I haven't described in the above (it was rather late when I was
> writing that email!) is where we need to head with a PHY that does
> rate adaption - and yet again an example of such a PHY is the 88X3310.
> This is actually a good example for many of the issues right now.
>
> If the 88X3310 is configured to have a host interface that only
> supports 10GBASE-R, then rate adaption within the PHY is activated,
> meaning the PHY is able to operate at, e.g. 10BASE-T while the host
> MAC operates at 10GBASE-R. There are some provisos to this though:
>
> 1) If the 88X3310 supports MACSEC, then it has internal MACs that
>     are able to generate pause frames, and pause frames will be sent
>     over the 10GBASE-R link as necessary to control the rate at which
>     the MAC sends packets.
>
> 2) If the 88X3310 does not support MACSEC, then it is expected that
>     the MAC is paced according to the link speed to avoid overflowing
>     the 88X3310 internal FIFOs (what happens when the internal FIFOs
>     overflow is not known.) There are no pause frames generated.
>     (This is the case on Macchiatobin boards if we configured the PHY
>     for 10GBASE-R rate-adaption mode.)

Well if it just drops/corrupts packets then it just looks like a
lossy/congested link. And the upper layers of the network stack already
expect that sort of thing (though perhaps not optimally).

> We have no "real" support for rate adaption at either phylib or phylink
> level - phylib has no way to tell us whether rate adaption is enabled
> on the PHY, nor does it have a way to tell us if we either need to pace
> the MAC or whether to expect pause frames from the PHY.
>
> If we have a PHY in rate adaption mode, the current behaviour will be
> that mac_link_up() and pcs_link_up() will be passed the negotiated
> media parameters as "speed", "duplex" and any flow control information,
> which will confuse PCS and MAC drivers at the moment, because it isn't
> something they expect to happen. What I mean is, if we are using
> PHY_INTERFACE_MODE_10GBASER, then most people will expect "speed" to be
> SPEED_10000, but with a rate adapting PHY it may not be.
>
> In order to properly support this, we need to update the documentation
> at the very least to say that what gets passed to mac_link_up() for
> "speed" and "duplex" are the media negotiated parameters. Then we need
> to have a think about how to handle flow control, and this is where
> extending phylib to tell us whether the PHY supports rate adaption
> becomes important. Flow control on the MAC needs to be enabled if (the
> PHY has rate adaption disabled but the media negotiated flow control)
> or (the PHY has rate adaption enabled _and_ the PHY is capable of
> issuing flow control frames - presumably the PHY will respond itself
> to flow control) or (the PHY has rate adaption enabled and the media
> negotiated flow control but the PHY is not capable of issuing flow
> control frames).
>
> Then there's the issue of implementing transmission pacing in any MAC
> driver that wants to be usable with a rate adapting PHY.
>
> Lastly, there's the issue of the "speed" and "duplex" parameters passed
> to pcs_link_up(), which I'm currently thinking should be the interface
> parameters and not the media parameters. In other words, if it's a
> 10GBASE-R connection between the PHY and PCS, we really should not be
> passing the media negotiated speed there.

Right.

> So, to sum up, rate adaption isn't something that is well supported in
> the kernel - it's possible to bodge around phylib and phylink to make
> it work, but this really needs to be handled properly.
>
>
> Rate adaption is fairly low priority at the moment as it is in a
> minority, although it seems we are seeing more systems that have PHYs
> with this feature.
>
> So, I hope these two emails have provides some useful insights.

They have, thanks.

--Sean

Russell King (Oracle) Nov. 19, 2021, 6:56 p.m. UTC | #8

Hi Sean,

On Thu, Nov 18, 2021 at 03:20:18PM -0500, Sean Anderson wrote:
> Hi Russell,
> 
> On 11/18/21 10:34 AM, Russell King (Oracle) wrote:
> > On Wed, Nov 17, 2021 at 12:22:26AM +0000, Russell King (Oracle) wrote:
> > > It is of little concern at the PCS except when the PCS is the
> > > endpoint for connecting to the media (like it is in _some_ 802.3z
> > > connections.) I say "some" because its entirely possible to use
> > > 1000base-X to talk to a PHY that connects to 1000base-T media
> > > (and there are SFPs around that will do this by default.)
> 
> Of course, since 1000BASE-X is not an electrical specification, this is
> really more like using 1000BASE-CX to 1000BASE-T :)

I'd disagree with that statement for the following reason:

With reference to the IEEE 802.3 1000BASE-X sublayers that are
described in diagram 36-1, section 38 and section 39.

Practically, a SoC implements a 1000BASE-X PCS, and a serdes block
that produces and receives the appropriate waveform on a balanced
pair, one for transmit and one for receive.

These balanced pairs can be routed to a 1000BASE-SX transceiver,
1000BASE-LX transceiver, or a 1000BASE-CX cable. One very likely
scenario where this can be seen practically is with a SFP cage.
The balanced pairs are routed to the SFP cage, and the medium is
entirely dependent on what kind of SFP is plugged in - whether
it is a LX, SX or CX compliant module.

Section 39.3 clearly defines 1000BASE-CX in terms of TP2 and TP3.
Section 38.2.1 is similar except for 1000BASE-SX/LX. Both states
that the electrical specification at points TP1 and TP4 are
electrical, but are not compliance points, and their parameters
are unspecified by IEEE 802.3.

How, this point (TP1/TP4) (which is about the division point
between the PMA and PMD) can also be connected to a PHY or DSA
switch. This doesn't change what TP1/TP4 is.

So, while it may be tempting to call the 1000BASE-X serdes
connection between the PMA and whatever it's connected to
1000BASE-CX, it is incorrect to do so unless it really does
consist of a jumper cable, as called out by section 39.1.

It may be that the electrical specifications of 1000BASE-CX at
TP2 and TP3 are the same as what can be measured at TP1 and TP4,
but that doesn't in my mind make it 1000BASE-CX.

Also note that 1000BASE-CX has a minimum operating distance of
0.1m, which is probably hard to achieve on many PCBs.

> > > Things can get quite complex, and I have to say the IEEE specs give
> > > a simplified view. When you have a SGMII link to a PHY that then
> > > connects to twisted pair media, you actually have multiple PCS:
> > > 
> > >                                  PHY
> > >                       /----------------------\
> > > MAC -- PCS ---<SGMII>--- PCS --- PCS --- PMA ---- media
> > >      (sgmii)           (sgmii)   (1000baseT)
> > > 
> > > This can be seen in PHYs such as Marvell 88E151x, where the fiber
> > > interface is re-used for SGMII, and if you read the datasheet and/or
> > > read the fiber page registers, you find that this is being used for
> > > the fiber side. So the PHY can be thought of two separate PHYs
> > > back-to-back. Remember that the PCS for 1000BASE-X (and SGMII) is
> > > different from the PCS for 1000BASE-T in IEEE802.3.
> 
> Right and this is a bit of the source of the confusion. There are
> different levels/layers of PHYs all with their own PCS/PMA/PMD stack.
> Depending on what perspective you take at the time, some of these can be
> subsumed into each other.

Right. One thing I didn't mention in the above is that such a PHY
looks after its MAC-facing PCS - there is generally no need for
management software to touch that PCS, since it is generally
configured by the hardware strapping of the PHY itself. Sometimes
(such as Marek's situation where he wishes to switch the operating
mode of the PHY from 10GBASE-R with rate adaption to USXGMII) it
may be required to reconfigure the PHY, but in terms of overall
link management, this PCS tends to be transparent.

Therefore, currently, there is no need to model this PCS, and
trying to do so until there is a need would add extra unnecessary
complexity - put another way, it would be over-design.

> > > The point I'm making here is that the capability of the link between
> > > the MAC and the PHY doesn't define what the media link modes are. It
> > > may define the speeds and duplexes that can be supported, and that
> > > restricts the available link modes, but it doesn't define which
> > > media "technologies" can be supported.
> 
> Right. IMO there is a lot of conflation of this concept in the current
> net subsystem. Realistically, the MAC should only be concerned with the
> phy interface mode, and perhaps duplex and speed. But! This should be
> the interface mode needed to talk to *the next stage in the signal
> path*.

You may be surprised that I agree with you, and this is the direction
I'm trying to move phylink towards. Before March 2020, phylink didn't
even consider there to be a separate PCS, so the separated PCS with
its own "ops" structure is really rather new and young.

However, we have to take all the current users along with us and keep
them working while we make changes - we can't just break everything
to implement an entirely new model. Consequently, change takes time,
effort and care.

> That is, if the MAC has GMII output and needs a separate PCS to
> talk 1000BASE-X or SGMII, it should only report GMII. And then the PCS
> can say what kind of interface it supports. However, the current model
> assumes that the PCS is tightly integrated, so these sorts of things are
> not modeled well. I don't know whether the above change would be
> feasable at all. Ideally, validate() would talk about interfaces modes
> and not link modes.

It is tightly integrated at the moment precisely because phylink has
its origins in the Marvell platforms of about September 2015, where
there was no separately identifyable PCS block despite it having
gigabit serdes and being capable of SGMII, 1000BASE-X and 2500BASE-X
on that serdes, as well as RGMII etc. Even in RGMII mode, everything
most of the configuration is hidden behind one register.

I do believe that the hardware does indeed have a PCS block - it's
just that Marvell decided to have one register that controls both the
MAC and PCS as if they are a single piece of hardware.

> Say that you are a MAC with an integrated PCS (so you handle everything
> in the MAC driver). You support GMII full and half duplex, but your PCS
> only supports 1000BASE-X with full duplex. The naïve bitmap is
> 
> supported_interfaces = PHY_INTERFACE_GMII | PHY_INTERFACE_1000BASEX;
> mac_capabilities = MAC_10 | MAC_100 | MAC_1000;
> 
> but this will report 1000BASE-X as supporting both full and half duplex.
> So you still need a custom validate() in order to report the correct link
> modes.

First, let me say that I've now been through all phylink users in the
kernel, converting them, and there are seven cases where the generic
helper can't be used. The validate() method isn't going away (at least
not just yet) which allows unexpected situations to still be handled.
The current state of this conversion is as follows.

In terms of network drivers:

mvneta and mvpp2: the generic helper is used but needs to be wrapped
  since these specify that negotiation must not be disabled in
  "1000BASE-X" mode, which includes the up-clocked 2500BASE-X.

stmmac: when there is no PCS attached, there is no clues from the
  driver which PHY interface modes are supported. I am in discussion
  with Jose about this.

In terms of DSA drivers:

bcm_sf2: pause mode support appears to be dependent on the interface
  mode.

  I have loads of questions about this driver in general, and have had
  for some time. E.g. bcm_sf2_sw_mac_config() makes it look like
  although RGMII is supported, RGMII_RXID and RGMII_ID are not.
  However, it has been established that the RGMII* modes defines the
  delays on the PHY end of the link, so this should be of no
  consequence on the MAC end... I regard this driver as needing
  attention, and probably a poor choice to draw any conclusions from.

hellcreek: this can probably be converted - it looks like it only ever
  contains copper PHYs which operate at a single fixed speed. phylib
  should restrict the link modes to baseT modes based on the PHYs,
  so specifying MAC_100FD or MAC_1000FD depending on
  hellcreek->pdata->is_100_mbits gets us the MAC capabilities.
  However, it is unknown which PHY interface mode is used in this
  driver, consequently I can't populate the supported_interfaces
  without assistance from the author.

mt7530: apparently, autoneg is not supported on TRGMII nor 802.3z
  interface modes on these switches. I don't yet know enough about
  TRGMII to comment on that, but it's a little surprising for
  802.3z (aka 1000BASE-X) because it means it restricts its use
  with fiber.

xrs700x: absolutely no information on which interface modes may be
  supported - the driver makes no attempt to check the interface
  mode anywhere in its code - it will accept anything we throw at
  it. I haven't researched to find any public information on it
  that may throw light on what it supports yet.

On the flip side:

- six ethernet drivers convert over to using phylink_generic_validate
directly, as does macb in its current form (although I haven't sent
those patches due to the open questions we have on it.)

- eight DSA drivers convert over to using phylink_generic_validate.

So we have fourteen drivers which lend themselves to this new model
really well, substantially reducing the amount of code that needs to
be looked after.

Now, I know this doesn't match your model, but it is a step closer
towards it. Why do I say this?

I have entirely eliminated the basex helper that writes to
state->interface in the validate() callback. This may not seem to be
significant, but it is - this is the reason why we have been passing
in the "state" and "supported" parameters in and requiring the
implementations to do the bitmap AND operations, rather than just
returning a bitmap of supported ethtool linkmodes.

The second thing that becomes possible through this change is we can
now find out from the MAC's validate() function what is supported for
a particular set of PHY interface modes. So, for example:
1) with the 88X3310 PHY operating in SGMII, 2500BASE-X, 5GBASE-R and
   10GBASE-R modes, we can ask for just these interface modes, rather
   than requesting everything that the PHY supports.
2) with a SFP, we can ask what is supported only for the PHY interface
   modes that would be appropriate to the cage and/or the SFP plugged
   in.

This makes it more practical to think about how to move forward towards
splitting the PCS part of validation from the MAC validate() callback.
Once we have a solution for that, then we can likely think about how to
get rid of the MAC dealing with ethtool linkmodes at all.

Rather than rush into this immediately, I want to let all these drivers
settle after this change - give time for any bugs that I may have
introduced to be shaken out and fixed. I'm not perfect, but I've given
this my best shot - and since I can't test beyond a build-test of this,
it's probably going to require a full kernel release cycle before we
really know whether anything has been broken.

> The tricky part comes in a scenario where the exact MAC is determined at
> runtime, such as the MACB+Xilinx PCS configuration.

We don't model multiple MACs - but we do have situations where we have
multiple MACs already. For example, Marvell mvpp2 has multiple MACs -
one for <=2.5G (GMAC) and another for >2.5G (XLG) and we deal with
that all in the driver. At this stage, I don't think we should begin to
start discussing any kind of model for this, otherwise I fear the
discussion will get too complex - let's concentrate on one issue at a
time.

> > > Right now, "no pcs" is really not an option I'm afraid. The presence
> > > of a PCS changes the phylink behaviour slightly . This is one of my
> > > bug-bears. The evolution of phylink has meant that we need to keep
> > > compatibility with how phylink used to work before we split the PCS
> > > support - and we detect that by whether there is a PCS to determine
> > > whether we need to operate with that compatibility. It probably was
> > > a mistake to do that in hind sight.
> 
> Of course it's an option :)

I'm saying that with phylink as it is, phylink does _not_ support there
being no PCS in a non-legacy driver, since the presence of a PCS is a
flag that the driver is non-legacy. It changes the circumstances in
which the mac_config() method is called.

If we want the PCS to become optional, then we need phylink to deal with
detecting non-legacy drivers some other way (e.g. a legacy-mode flag in
struct phylink_config), or we need to eliminate the legacy drivers by
either converting them (which is always my preference) or deleting them.

> > > I added mac_select_pcs() because finding out that something isn't
> > > supported in mac_prepare() is way too late - as explained above
> > > where I talked about binding the PHY into the link setup. E.g. if
> > > the "system" as a whole can't operate at 2.5G speeds, then we should
> > > not allow the PHY to advertise 2500BASE-T. It is no good advertising
> > > 2500BASE-T, then having the PHY negotiate 2500BASE-T, select 2500BASE-X,
> > > and then have mac_prepare() decide that can't be supported. The link
> > > won't come up, and there's nothing that can be sensibly done. The
> > > user sees the PHY indicating link, the link partner indicates link,
> > > but the link is non-functional. That isn't a good user experience.
> > > 
> > > Whereas, if we know ahead of time that 2.5G can't be supported, we can
> > > remove 2500BASE-T from the advertisement, and the PHY will instead
> > > negotiate a slower speed - resulting in a working link, albiet slower.
> 
> AIUI it's a bug in the driver to advertise something in validate() which
> it can't support.

Right now, it's a bug in the driver if validate() allows bits in the
supported mask or advertising mask to be set that the MAC or the PCS
are unable to support - and that goes _just_ for speeds, duplexes,
pause modes, and autoneg. It does not apply to media technologies,
nor port modes.

Restricting the media technologies is the responsibility of the PHY
when a PHY is present (so all those drivers that decide to only
allow the ethtool 1000baseX_Full bit when in 1000BASE-X mode, but
otherwise allow 1000baseT_Full are themselves buggy.)

In the case of:

MAC --- PCS --- non-SFP Media

Then we will currently get all the linkmodes advertised - unless
something restricts them, and currently with stmmac + xpcs, the
xpcs driver restricts the link modes. That said, if we have
stmmac + xpcs + phy, restricting to 10/100/1000BASE-T is pretty
silly when the PHY could be something like a Marvell 88E151x which
supports automedia between copper and fiber (and thus supports
1000BASE-X).

What constitutes the ethtool link modes really comes down to being
the responsibility of the device that's connecting to the media
coupled with the speed/duplex/pause capabilities of the blocks
leading back to the system. Even the "Autoneg" ethtool bit is
irrelevant for preceeding blocks if there's a PHY on the end that
performs autonegotiation - but it becomes more relevant when the
PCS the last modelled block before the media.

> > > I hope that explains why it is so important not to error out in
> > > mac_prepare() because something wasn't properly handled in the
> > > validate() step.
> 
> Specifically, "I don't need a PCS for this mode" should be a valid
> response. I agree that "I can't select a PCS" doesn't make sense.

If we are going to involve the PCS inside phylink as suggested above,
it may actually become a problem not to have a PCS, even when there
isn't one in the strict sense according to IEEE 802.3. It's a bit like
the model we've had for fixed-links prior to phylink - which was to
emulate a PHY in software that phylib can talk to complete with its
own set of MDIO registers.

> > What I haven't described in the above (it was rather late when I was
> > writing that email!) is where we need to head with a PHY that does
> > rate adaption - and yet again an example of such a PHY is the 88X3310.
> > This is actually a good example for many of the issues right now.
> > 
> > If the 88X3310 is configured to have a host interface that only
> > supports 10GBASE-R, then rate adaption within the PHY is activated,
> > meaning the PHY is able to operate at, e.g. 10BASE-T while the host
> > MAC operates at 10GBASE-R. There are some provisos to this though:
> > 
> > 1) If the 88X3310 supports MACSEC, then it has internal MACs that
> >     are able to generate pause frames, and pause frames will be sent
> >     over the 10GBASE-R link as necessary to control the rate at which
> >     the MAC sends packets.
> > 
> > 2) If the 88X3310 does not support MACSEC, then it is expected that
> >     the MAC is paced according to the link speed to avoid overflowing
> >     the 88X3310 internal FIFOs (what happens when the internal FIFOs
> >     overflow is not known.) There are no pause frames generated.
> >     (This is the case on Macchiatobin boards if we configured the PHY
> >     for 10GBASE-R rate-adaption mode.)
> 
> Well if it just drops/corrupts packets then it just looks like a
> lossy/congested link. And the upper layers of the network stack already
> expect that sort of thing (though perhaps not optimally).

I don't think you've thought this through. If you stuff the transmit
FIFO and you end up with a partial packet in the FIFO that the PHY
starts to transmit, the packet will not end up hitting the wire in
a complete form, and _every_ node on the network that receives that
packet is going to flag an error and increment its error counters.
I'm not saying this is what the PHY will do - I don't really know
what it will do if the pacing requirement isn't satisfied. However,
as it doesn't contain a MAC internally, it likely doesn't check the
checksum on the received data. Does it implement store-and-forward,
we don't know.

I wonder what your thoughts are if you do "ip -s li" and you see errors
reported on the receive side of your network interface? Would you
assume it's one particular machine on your network causing it, and
would you then seek to remove it from your network, or would you
suspect a cabling issue or problem with a network switch?

Russell King (Oracle) Nov. 22, 2021, 7:15 p.m. UTC | #9

Hi Sean,

Today's update...

On Fri, Nov 19, 2021 at 06:56:09PM +0000, Russell King (Oracle) wrote:
> > Say that you are a MAC with an integrated PCS (so you handle everything
> > in the MAC driver). You support GMII full and half duplex, but your PCS
> > only supports 1000BASE-X with full duplex. The naïve bitmap is
> > 
> > supported_interfaces = PHY_INTERFACE_GMII | PHY_INTERFACE_1000BASEX;
> > mac_capabilities = MAC_10 | MAC_100 | MAC_1000;
> > 
> > but this will report 1000BASE-X as supporting both full and half duplex.
> > So you still need a custom validate() in order to report the correct link
> > modes.
> 
> First, let me say that I've now been through all phylink users in the
> kernel, converting them, and there are seven cases where the generic
> helper can't be used. The validate() method isn't going away (at least
> not just yet) which allows unexpected situations to still be handled.
> The current state of this conversion is as follows.
> 
> In terms of network drivers:
> 
> mvneta and mvpp2: the generic helper is used but needs to be wrapped
>   since these specify that negotiation must not be disabled in
>   "1000BASE-X" mode, which includes the up-clocked 2500BASE-X.

Over the weekend, I've been able to eliminate these two with the
mac_select_pcs() method, and adding a pcs_validate() callback to the
PCS operations. The updated patches are in the net-queue branch.

> > > > Right now, "no pcs" is really not an option I'm afraid. The presence
> > > > of a PCS changes the phylink behaviour slightly . This is one of my
> > > > bug-bears. The evolution of phylink has meant that we need to keep
> > > > compatibility with how phylink used to work before we split the PCS
> > > > support - and we detect that by whether there is a PCS to determine
> > > > whether we need to operate with that compatibility. It probably was
> > > > a mistake to do that in hind sight.
> > 
> > Of course it's an option :)
> 
> I'm saying that with phylink as it is, phylink does _not_ support there
> being no PCS in a non-legacy driver, since the presence of a PCS is a
> flag that the driver is non-legacy. It changes the circumstances in
> which the mac_config() method is called.
> 
> If we want the PCS to become optional, then we need phylink to deal with
> detecting non-legacy drivers some other way (e.g. a legacy-mode flag in
> struct phylink_config), or we need to eliminate the legacy drivers by
> either converting them (which is always my preference) or deleting them.

... and over the weekend, I've implemented a flag "legacy_pre_march2020"
which is used to flag which drivers predate the changes to add PCS
support, allowing phylink to know whether they are legacy drivers or
not irrespective of whether they use a PCS.

So, I've updated the "mac_select_pcs" patch to allow NULL as a valid
return, and use error-pointers to indicate failure.

Then there's a following patch that adds a "pcs_validate" method which
gives the PCS an opportunity to have its say in the MAC validation
without the MAC having code to call the PCS - this change allowed mvneta
and mvpp2 to use phylink_generic_validate().

So, this is another step towards the end goal.

I'm planning to get lkp chew on the "legacy_pre_march2020" before I send
those out to netdev (which will be the next batch of patches), and then
will be the patches that add the ground work to allow DSA drivers to
become properly non-legacy.

Please have a look at my net-queue - are you happier with the
mac_select_pcs method now?

Once we've got this settled, and remaining drivers converted, my plan is
to kill off the old .validate method entirely. At that point,
"mac_capabilities" then becomes exactly that - the capabilities of the
MAC block irrespective of anything else.

Given bcm_sf2, I may need to add a method which allows the MAC to modify
those capabilities depending on the interface - bcm_sf2's pause frame
capabilities appear to be dependent on the interface mode. I'm tempted
to reverse the logic I have in bcm_sf2 in these patches - set
MAC_ASYM_PAUSE | MAC_SYM_PAUSE in mac_capabilities, and exclude them for
interface types that don't support pause frames.

Sean Anderson Nov. 22, 2021, 9:06 p.m. UTC | #10

Hi Russell,

On 11/22/21 2:15 PM, Russell King (Oracle) wrote:
> Hi Sean,
>
> Today's update...
>
> On Fri, Nov 19, 2021 at 06:56:09PM +0000, Russell King (Oracle) wrote:
>> > Say that you are a MAC with an integrated PCS (so you handle everything
>> > in the MAC driver). You support GMII full and half duplex, but your PCS
>> > only supports 1000BASE-X with full duplex. The naïve bitmap is
>> >
>> > supported_interfaces = PHY_INTERFACE_GMII | PHY_INTERFACE_1000BASEX;
>> > mac_capabilities = MAC_10 | MAC_100 | MAC_1000;
>> >
>> > but this will report 1000BASE-X as supporting both full and half duplex.
>> > So you still need a custom validate() in order to report the correct link
>> > modes.
>>
>> First, let me say that I've now been through all phylink users in the
>> kernel, converting them, and there are seven cases where the generic
>> helper can't be used. The validate() method isn't going away (at least
>> not just yet) which allows unexpected situations to still be handled.
>> The current state of this conversion is as follows.
>>
>> In terms of network drivers:
>>
>> mvneta and mvpp2: the generic helper is used but needs to be wrapped
>>   since these specify that negotiation must not be disabled in
>>   "1000BASE-X" mode, which includes the up-clocked 2500BASE-X.
>
> Over the weekend, I've been able to eliminate these two with the
> mac_select_pcs() method, and adding a pcs_validate() callback to the
> PCS operations. The updated patches are in the net-queue branch.

That looks good. It looks effectively the same as what I'd prototyped.

>> > > > Right now, "no pcs" is really not an option I'm afraid. The presence
>> > > > of a PCS changes the phylink behaviour slightly . This is one of my
>> > > > bug-bears. The evolution of phylink has meant that we need to keep
>> > > > compatibility with how phylink used to work before we split the PCS
>> > > > support - and we detect that by whether there is a PCS to determine
>> > > > whether we need to operate with that compatibility. It probably was
>> > > > a mistake to do that in hind sight.
>> >
>> > Of course it's an option :)
>>
>> I'm saying that with phylink as it is, phylink does _not_ support there
>> being no PCS in a non-legacy driver, since the presence of a PCS is a
>> flag that the driver is non-legacy. It changes the circumstances in
>> which the mac_config() method is called.
>>
>> If we want the PCS to become optional, then we need phylink to deal with
>> detecting non-legacy drivers some other way (e.g. a legacy-mode flag in
>> struct phylink_config), or we need to eliminate the legacy drivers by
>> either converting them (which is always my preference) or deleting them.
>
> ... and over the weekend, I've implemented a flag "legacy_pre_march2020"
> which is used to flag which drivers predate the changes to add PCS
> support, allowing phylink to know whether they are legacy drivers or
> not irrespective of whether they use a PCS.

Thanks for clarifying this. It was always a little unclear to me exactly
what the implications of unregistering a PCS were.

> So, I've updated the "mac_select_pcs" patch to allow NULL as a valid
> return, and use error-pointers to indicate failure.
>
> Then there's a following patch that adds a "pcs_validate" method which
> gives the PCS an opportunity to have its say in the MAC validation
> without the MAC having code to call the PCS - this change allowed mvneta
> and mvpp2 to use phylink_generic_validate().
>
> So, this is another step towards the end goal.
>
> I'm planning to get lkp chew on the "legacy_pre_march2020" before I send
> those out to netdev (which will be the next batch of patches), and then
> will be the patches that add the ground work to allow DSA drivers to
> become properly non-legacy.
>
> Please have a look at my net-queue - are you happier with the
> mac_select_pcs method now?

Yes. This looks like it can be implemented reasonably. I'm going to hold
off on working on Xilinx PCS for now and wait until your changes have
made their way upstream.

> Once we've got this settled, and remaining drivers converted, my plan is
> to kill off the old .validate method entirely. At that point,
> "mac_capabilities" then becomes exactly that - the capabilities of the
> MAC block irrespective of anything else.
>
> Given bcm_sf2, I may need to add a method which allows the MAC to modify
> those capabilities depending on the interface - bcm_sf2's pause frame
> capabilities appear to be dependent on the interface mode. I'm tempted
> to reverse the logic I have in bcm_sf2 in these patches - set
> MAC_ASYM_PAUSE | MAC_SYM_PAUSE in mac_capabilities, and exclude them for
> interface types that don't support pause frames.

Yeah. I like how phylink_get_linkmodes allows for modifying the
capabilities if necessary.

Thanks for reworking this stuff. I think this will make it a lot easier
to get my changes upstreamed :)

--Sean

[net-next,v6] net: macb: Fix several edge cases in validate

Checks

Commit Message

Comments

Patch