diff mbox series

[net-next,v3,07/47] net: phy: Add support for rate adaptation

Message ID 20220715215954.1449214-8-sean.anderson@seco.com (mailing list archive)
State RFC
Delegated to: Netdev Maintainers
Headers show
Series net: dpaa: Convert to phylink | expand

Checks

Context Check Description
netdev/tree_selection success Clearly marked for net-next, async
netdev/fixes_present success Fixes tag not required for -next series
netdev/subject_prefix success Link
netdev/cover_letter success Series has a cover letter
netdev/patch_count fail Series longer than 15 patches (and no cover letter)
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit fail Errors and warnings before: 425 this patch: 425
netdev/cc_maintainers success CCed 8 of 8 maintainers
netdev/build_clang fail Errors and warnings before: 302 this patch: 304
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn fail Errors and warnings before: 410 this patch: 410
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 95 lines checked
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Sean Anderson July 15, 2022, 9:59 p.m. UTC
This adds general support for rate adaptation to the phy subsystem. The
general idea is that the phy interface runs at one speed, and the MAC
throttles the rate at which it sends packets to the link speed. There's a
good overview of several techniques for achieving this at [1]. This patch
adds support for three: pause-frame based (such as in Aquantia phys),
CRS-based (such as in 10PASS-TS and 2BASE-TL), and open-loop-based (such as
in 10GBASE-W).

This patch makes a few assumptions and a few non assumptions about the
types of rate adaptation available. First, it assumes that different phys
may use different forms of rate adaptation. Second, it assumes that phys
can use rate adaptation for any of their supported link speeds (e.g. if a
phy supports 10BASE-T and XGMII, then it can adapt XGMII to 10BASE-T).
Third, it does not assume that all interface modes will use the same form
of rate adaptation. Fourth, it does not assume that all phy devices will
support rate adaptation (even some do). Relaxing or strengthening these
(non-)assumptions could result in a different API. For example, if all
interface modes were assumed to use the same form of rate adaptation, then
a bitmask of interface modes supportting rate adaptation would suffice.

We need support for rate adaptation for two reasons. First, the phy
consumer needs to know if the phy will perform rate adaptation in order to
program the correct advertising. An unaware consumer will only program
support for link modes at the phy interface mode's native speed. This will
cause autonegotiation to fail if the link partner only advertises support
for lower speed link modes.

Second, to reduce packet loss it may be desirable to throttle packet
throughput. In past discussions [2-4], this behavior has been
controversial. It is the opinion of several developers that it is the
responsibility of the system integrator or end user to set the link
settings appropriately for rate adaptation. In particular, it was argued
that it is difficult to determine whether a particular phy has rate
adaptation enabled, and it is simpler to keep such determinations out of
the kernel. Another criticism is that packet loss may happen anyway, such
as if a faster link is used with a switch or repeater that does not support
pause frames.

I believe that our current approach is limiting, especially when
considering that rate adaptation (in two forms) has made it into IEEE
standards. In general, When we have appropriate information we should set
sensible defaults. To consider use a contrasting example, we enable pause
frames by default for switches which autonegotiate for them. When it's the
phy itself generating these frames, we don't even have to autonegotiate to
know that we should enable pause frames.

Our current approach also encourages workarounds, such as commit
73a21fa817f0 ("dpaa_eth: support all modes with rate adapting PHYs").
These workarounds are fine for phylib drivers, but phylink drivers cannot
use this approach (since there is no direct access to the phy). Note that
even when we determine (e.g.) the pause settings based on whether rate
adaptation is enabled, they can still be overridden by userspace (using
ethtool). It might be prudent to allow disabling of rate adaptation
generally in ethtool as well.

[1] https://www.ieee802.org/3/efm/baseline/marris_1_0302.pdf
[2] https://lore.kernel.org/netdev/1579701573-6609-1-git-send-email-madalin.bucur@oss.nxp.com/
[3] https://lore.kernel.org/netdev/1580137671-22081-1-git-send-email-madalin.bucur@oss.nxp.com/
[4] https://lore.kernel.org/netdev/20200116181933.32765-1-olteanv@gmail.com/

Signed-off-by: Sean Anderson <sean.anderson@seco.com>
---

Changes in v3:
- New

 drivers/net/phy/phy.c | 21 +++++++++++++++++++++
 include/linux/phy.h   | 38 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 59 insertions(+)

Comments

Andrew Lunn July 16, 2022, 7:39 p.m. UTC | #1
>  drivers/net/phy/phy.c | 21 +++++++++++++++++++++
>  include/linux/phy.h   | 38 ++++++++++++++++++++++++++++++++++++++
>  2 files changed, 59 insertions(+)
> 
> diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
> index 8d3ee3a6495b..cf4a8b055a42 100644
> --- a/drivers/net/phy/phy.c
> +++ b/drivers/net/phy/phy.c
> @@ -114,6 +114,27 @@ void phy_print_status(struct phy_device *phydev)
>  }
>  EXPORT_SYMBOL(phy_print_status);
>  
> +/**
> + * phy_get_rate_adaptation - determine if rate adaptation is supported
> + * @phydev: The phy device to return rate adaptation for
> + * @iface: The interface mode to use
> + *
> + * This determines the type of rate adaptation (if any) that @phy supports
> + * using @iface. @iface may be %PHY_INTERFACE_MODE_NA to determine if any
> + * interface supports rate adaptation.
> + *
> + * Return: The type of rate adaptation @phy supports for @iface, or
> + *         %RATE_ADAPT_NONE.
> + */
> +enum rate_adaptation phy_get_rate_adaptation(struct phy_device *phydev,
> +					     phy_interface_t iface)
> +{
> +	if (phydev->drv->get_rate_adaptation)
> +		return phydev->drv->get_rate_adaptation(phydev, iface);

It is normal that any call into the driver is performed with the
phydev->lock held.

>  #define PHY_INIT_TIMEOUT	100000
>  #define PHY_FORCE_TIMEOUT	10
> @@ -570,6 +588,7 @@ struct macsec_ops;
>   * @lp_advertising: Current link partner advertised linkmodes
>   * @eee_broken_modes: Energy efficient ethernet modes which should be prohibited
>   * @autoneg: Flag autoneg being used
> + * @rate_adaptation: Current rate adaptation mode
>   * @link: Current link state
>   * @autoneg_complete: Flag auto negotiation of the link has completed
>   * @mdix: Current crossover
> @@ -637,6 +656,8 @@ struct phy_device {
>  	unsigned irq_suspended:1;
>  	unsigned irq_rerun:1;
>  
> +	enum rate_adaptation rate_adaptation;

It is not clear what the locking is on this member. Is it only safe to
access it during the adjust_link callback, when it is guaranteed that
the phydev->lock is held, so the value is consistent? Or is the MAC
allowed to access this at other times?

	Andrew
Sean Anderson July 16, 2022, 9:55 p.m. UTC | #2
On 7/16/22 3:39 PM, Andrew Lunn wrote:
>>   drivers/net/phy/phy.c | 21 +++++++++++++++++++++
>>   include/linux/phy.h   | 38 ++++++++++++++++++++++++++++++++++++++
>>   2 files changed, 59 insertions(+)
>>
>> diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
>> index 8d3ee3a6495b..cf4a8b055a42 100644
>> --- a/drivers/net/phy/phy.c
>> +++ b/drivers/net/phy/phy.c
>> @@ -114,6 +114,27 @@ void phy_print_status(struct phy_device *phydev)
>>   }
>>   EXPORT_SYMBOL(phy_print_status);
>>   
>> +/**
>> + * phy_get_rate_adaptation - determine if rate adaptation is supported
>> + * @phydev: The phy device to return rate adaptation for
>> + * @iface: The interface mode to use
>> + *
>> + * This determines the type of rate adaptation (if any) that @phy supports
>> + * using @iface. @iface may be %PHY_INTERFACE_MODE_NA to determine if any
>> + * interface supports rate adaptation.
>> + *
>> + * Return: The type of rate adaptation @phy supports for @iface, or
>> + *         %RATE_ADAPT_NONE.
>> + */
>> +enum rate_adaptation phy_get_rate_adaptation(struct phy_device *phydev,
>> +					     phy_interface_t iface)
>> +{
>> +	if (phydev->drv->get_rate_adaptation)
>> +		return phydev->drv->get_rate_adaptation(phydev, iface);
> 
> It is normal that any call into the driver is performed with the
> phydev->lock held.

Ah, so like phy_ethtool_get_strings.

>>   #define PHY_INIT_TIMEOUT	100000
>>   #define PHY_FORCE_TIMEOUT	10
>> @@ -570,6 +588,7 @@ struct macsec_ops;
>>    * @lp_advertising: Current link partner advertised linkmodes
>>    * @eee_broken_modes: Energy efficient ethernet modes which should be prohibited
>>    * @autoneg: Flag autoneg being used
>> + * @rate_adaptation: Current rate adaptation mode
>>    * @link: Current link state
>>    * @autoneg_complete: Flag auto negotiation of the link has completed
>>    * @mdix: Current crossover
>> @@ -637,6 +656,8 @@ struct phy_device {
>>   	unsigned irq_suspended:1;
>>   	unsigned irq_rerun:1;
>>   
>> +	enum rate_adaptation rate_adaptation;
> 
> It is not clear what the locking is on this member. Is it only safe to
> access it during the adjust_link callback, when it is guaranteed that
> the phydev->lock is held, so the value is consistent? Or is the MAC
> allowed to access this at other times?

The former. My intention is that this has the same access as link/interface/speed/duplex.

--Sean
diff mbox series

Patch

diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
index 8d3ee3a6495b..cf4a8b055a42 100644
--- a/drivers/net/phy/phy.c
+++ b/drivers/net/phy/phy.c
@@ -114,6 +114,27 @@  void phy_print_status(struct phy_device *phydev)
 }
 EXPORT_SYMBOL(phy_print_status);
 
+/**
+ * phy_get_rate_adaptation - determine if rate adaptation is supported
+ * @phydev: The phy device to return rate adaptation for
+ * @iface: The interface mode to use
+ *
+ * This determines the type of rate adaptation (if any) that @phy supports
+ * using @iface. @iface may be %PHY_INTERFACE_MODE_NA to determine if any
+ * interface supports rate adaptation.
+ *
+ * Return: The type of rate adaptation @phy supports for @iface, or
+ *         %RATE_ADAPT_NONE.
+ */
+enum rate_adaptation phy_get_rate_adaptation(struct phy_device *phydev,
+					     phy_interface_t iface)
+{
+	if (phydev->drv->get_rate_adaptation)
+		return phydev->drv->get_rate_adaptation(phydev, iface);
+	return RATE_ADAPT_NONE;
+}
+EXPORT_SYMBOL_GPL(phy_get_rate_adaptation);
+
 /**
  * phy_config_interrupt - configure the PHY device for the requested interrupts
  * @phydev: the phy_device struct
diff --git a/include/linux/phy.h b/include/linux/phy.h
index 81ce76c3e799..e983711f6c8b 100644
--- a/include/linux/phy.h
+++ b/include/linux/phy.h
@@ -276,6 +276,24 @@  static inline const char *phy_modes(phy_interface_t interface)
 	}
 }
 
+/**
+ * enum rate_adaptation - methods of rate adaptation
+ * @RATE_ADAPT_NONE: No rate adaptation performed.
+ * @RATE_ADAPT_PAUSE: The phy sends pause frames to throttle the MAC.
+ * @RATE_ADAPT_CRS: The phy asserts CRS to prevent the MAC from transmitting.
+ * @RATE_ADAPT_OPEN_LOOP: The MAC is programmed with a sufficiently-large IPG.
+ *
+ * These are used to throttle the rate of data on the phy interface when the
+ * native speed of the interface is higher than the link speed. These should
+ * not be used for phy interfaces which natively support multiple speeds (e.g.
+ * MII or SGMII).
+ */
+enum rate_adaptation {
+	RATE_ADAPT_NONE = 0,
+	RATE_ADAPT_PAUSE,
+	RATE_ADAPT_CRS,
+	RATE_ADAPT_OPEN_LOOP,
+};
 
 #define PHY_INIT_TIMEOUT	100000
 #define PHY_FORCE_TIMEOUT	10
@@ -570,6 +588,7 @@  struct macsec_ops;
  * @lp_advertising: Current link partner advertised linkmodes
  * @eee_broken_modes: Energy efficient ethernet modes which should be prohibited
  * @autoneg: Flag autoneg being used
+ * @rate_adaptation: Current rate adaptation mode
  * @link: Current link state
  * @autoneg_complete: Flag auto negotiation of the link has completed
  * @mdix: Current crossover
@@ -637,6 +656,8 @@  struct phy_device {
 	unsigned irq_suspended:1;
 	unsigned irq_rerun:1;
 
+	enum rate_adaptation rate_adaptation;
+
 	enum phy_state state;
 
 	u32 dev_flags;
@@ -801,6 +822,21 @@  struct phy_driver {
 	 */
 	int (*get_features)(struct phy_device *phydev);
 
+	/**
+	 * @get_rate_adaptation: Get the supported type of rate adaptation for a
+	 * particular phy interface. This is used by phy consumers to determine
+	 * whether to advertise lower-speed modes for that interface. It is
+	 * assumed that if a rate adaptation mode is supported on an interface,
+	 * then that interface's rate can be adapted to all slower link speeds
+	 * supported by the phy. If iface is %PHY_INTERFACE_MODE_NA, and the phy
+	 * supports any kind of rate adaptation for any interface, then it must
+	 * return that rate adaptation mode (preferring %RATE_ADAPT_PAUSE, to
+	 * %RATE_ADAPT_CRS). If the interface is not supported, this should
+	 * return %RATE_ADAPT_NONE.
+	 */
+	enum rate_adaptation (*get_rate_adaptation)(struct phy_device *phydev,
+						    phy_interface_t iface);
+
 	/* PHY Power Management */
 	/** @suspend: Suspend the hardware, saving state if needed */
 	int (*suspend)(struct phy_device *phydev);
@@ -1681,6 +1717,8 @@  int phy_disable_interrupts(struct phy_device *phydev);
 void phy_request_interrupt(struct phy_device *phydev);
 void phy_free_interrupt(struct phy_device *phydev);
 void phy_print_status(struct phy_device *phydev);
+enum rate_adaptation phy_get_rate_adaptation(struct phy_device *phydev,
+					     phy_interface_t iface);
 void phy_set_max_speed(struct phy_device *phydev, u32 max_speed);
 void phy_remove_link_mode(struct phy_device *phydev, u32 link_mode);
 void phy_advertise_supported(struct phy_device *phydev);