diff mbox series

[RFC,net-next,2/7] net: ethtool: add support for Frame Preemption and MAC Merge layer

Message ID 20220816222920.1952936-3-vladimir.oltean@nxp.com (mailing list archive)
State RFC
Delegated to: Netdev Maintainers
Headers show
Series 802.1Q Frame Preemption and 802.3 MAC Merge support via ethtool | expand

Checks

Context Check Description
netdev/tree_selection success Clearly marked for net-next
netdev/fixes_present success Fixes tag not required for -next series
netdev/subject_prefix success Link
netdev/cover_letter success Series has a cover letter
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit fail Errors and warnings before: 241204 this patch: 241374
netdev/cc_maintainers warning 5 maintainers not CCed: chenhao288@hisilicon.com moyufeng@huawei.com idosch@nvidia.com gustavoars@kernel.org huangguangbin2@huawei.com
netdev/build_clang fail Errors and warnings before: 620 this patch: 623
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn fail Errors and warnings before: 257166 this patch: 257336
netdev/checkpatch warning CHECK: No space is necessary after a cast WARNING: 'preemptable' may be misspelled - perhaps 'preemptible'? WARNING: added, moved or deleted file(s), does MAINTAINERS need updating? WARNING: line length of 81 exceeds 80 columns WARNING: line length of 84 exceeds 80 columns WARNING: line length of 88 exceeds 80 columns
netdev/kdoc fail Errors and warnings before: 0 this patch: 13
netdev/source_inline success Was 0 now: 0

Commit Message

Vladimir Oltean Aug. 16, 2022, 10:29 p.m. UTC
Frame preemption (IEEE 802.1Q-2018 clause 6.7.2) and the MAC merge
sublayer (IEEE 802.3-2018 clause 99) are 2 specifications, part of the
Time Sensitive Networking enhancements, which work together to minimize
latency caused by frame interference. The overall goal of TSN is for
normal traffic and traffic belonging to real-time control processes to
be able to cohabitate on the same L2 network and not bother each other
too much.

They achieve this (partly) by introducing the concept of preemptable
traffic, i.e. Ethernet frames that have an altered Start-Of-Frame
delimiter, which can be fragmented and reassembled at L2 on a link-local
basis. The non-preemptable frames are called express traffic, and they
can preempt preemptable frames, therefore having lower latency, which
can matter at lower (100 Mbps) link speeds, or at high MTUs (jumbo
frames around 9K). Preemption is not recursive, i.e. a PT frame cannot
preempt another PT frame. Preemption also does not depend upon priority,
or otherwise said, an ET frame with prio 0 will still preempt a PT frame
with prio 7.

In terms of implementation, the specs talk about the fact that the
express traffic is handled by an express MAC (eMAC) and the preemptable
traffic by a preemptable MAC (pMAC), and these MACs are multiplexed on
the same MII by a MAC merge layer.

On RX, packets go to the eMAC or to the pMAC based on their SFD (the
definition of which was generalized to SMD, Start-of-mPacket-Delimiter,
where mPacket is essentially an Ethernet frame fragment, or a complete
frame). On TX, the eMAC/pMAC classification decision is taken by the
802.1Q spec, based on packet priority (each of the 8 priority values may
have an admin-status of preemptable or express).

I have modeled both the Ethernet part of the spec and the queuing part
as separate netlink messages, with separate ethtool command sets
intended for them. I am slightly flexible as to where to place the FP
settings; there were previous discussions about placing them in tc.
At the moment I haven't received a good enough argument to move them out
of ethtool, so here they are.
https://patchwork.kernel.org/project/netdevbpf/cover/20220520011538.1098888-1-vinicius.gomes@intel.com/

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 include/linux/ethtool.h              |  59 ++++++
 include/uapi/linux/ethtool.h         |  15 ++
 include/uapi/linux/ethtool_netlink.h |  82 ++++++++
 net/ethtool/Makefile                 |   3 +-
 net/ethtool/fp.c                     | 295 +++++++++++++++++++++++++++
 net/ethtool/mm.c                     | 228 +++++++++++++++++++++
 net/ethtool/netlink.c                |  38 ++++
 net/ethtool/netlink.h                |   8 +
 8 files changed, 727 insertions(+), 1 deletion(-)
 create mode 100644 net/ethtool/fp.c
 create mode 100644 net/ethtool/mm.c

Comments

Jakub Kicinski Aug. 17, 2022, 3:22 a.m. UTC | #1
On Wed, 17 Aug 2022 01:29:15 +0300 Vladimir Oltean wrote:
> +/**
> + * struct ethtool_mm_state - 802.3 MAC merge layer state
> + */
> +struct ethtool_mm_state {
> +	u32 verify_time;
> +	enum ethtool_mm_verify_status verify_status;
> +	bool supported;
> +	bool enabled;
> +	bool active;

The enabled vs active piqued my interest. Is there some handshake /
aneg or such?

> +	nest_table = nla_nest_start(skb, ETHTOOL_A_FP_PARAM_TABLE);
> +	if (!nest_table)
> +		return -EMSGSIZE;

Don't warp tables in nests, let the elements repeat in the parent.
Vladimir Oltean Aug. 17, 2022, 11:41 a.m. UTC | #2
On Tue, Aug 16, 2022 at 08:22:09PM -0700, Jakub Kicinski wrote:
> On Wed, 17 Aug 2022 01:29:15 +0300 Vladimir Oltean wrote:
> > +/**
> > + * struct ethtool_mm_state - 802.3 MAC merge layer state
> > + */
> > +struct ethtool_mm_state {
> > +	u32 verify_time;
> > +	enum ethtool_mm_verify_status verify_status;
> > +	bool supported;
> > +	bool enabled;
> > +	bool active;
> 
> The enabled vs active piqued my interest. Is there some handshake /
> aneg or such?

This is where writing the kdocs, as mentioned in the cover letter, would
have helped. Yes, the handshake is described in 802.3 clause "99.4.3
Verifying preemption operation". It says:

| Verification (see Figure 99–3) checks that the link can support the
| preemption capability.
|
| If verification is enabled, the preemption capability shall be active
| only after verification has completed
| successfully.
|
| If the preemption capability is enabled but has not been verified yet,
| the MAC Merge sublayer initiates verification. Verification relies on
| the transmission of a verify mPacket and receipt of a respond mPacket to
| confirm that the remote station supports the preemption capability.

In fact, the verify_time and verify_status fields right above
enabled/active are related exactly to this handshake process.

> > +	nest_table = nla_nest_start(skb, ETHTOOL_A_FP_PARAM_TABLE);
> > +	if (!nest_table)
> > +		return -EMSGSIZE;
> 
> Don't warp tables in nests, let the elements repeat in the parent.

Ok, can do. I did this because 802.1Q actually specifies in the
IEEE8021-Preemption-MIB that there is a ieee8021PreemptionParameterTable
structure containing pairs of ieee8021PreemptionPriority and
ieee8021FramePreemptionAdminStatus.

Do you have actual issues with the structuring of the FP parameters
though? They look like this currently:

ethtool --json --show-fp eno0
[ {
        "ifname": "eno0",
        "parameter-table": [ {
                "prio": 0,
                "admin-status": "preemptable"
            },{
                "prio": 1,
                "admin-status": "preemptable"
            },{
                "prio": 2,
                "admin-status": "preemptable"
            },{
                "prio": 3,
                "admin-status": "preemptable"
            },{
                "prio": 4,
                "admin-status": "preemptable"
            },{
                "prio": 5,
                "admin-status": "preemptable"
            },{
                "prio": 6,
                "admin-status": "preemptable"
            },{
                "prio": 7,
                "admin-status": "preemptable"
            } ],
        "hold-advance": 0,
        "release-advance": 0
    } ]
Jakub Kicinski Aug. 17, 2022, 6:35 p.m. UTC | #3
On Wed, 17 Aug 2022 11:41:01 +0000 Vladimir Oltean wrote:
> > > +	nest_table = nla_nest_start(skb, ETHTOOL_A_FP_PARAM_TABLE);
> > > +	if (!nest_table)
> > > +		return -EMSGSIZE;  
> > 
> > Don't warp tables in nests, let the elements repeat in the parent.  
> 
> Ok, can do. I did this because 802.1Q actually specifies in the
> IEEE8021-Preemption-MIB that there is a ieee8021PreemptionParameterTable
> structure containing pairs of ieee8021PreemptionPriority and
> ieee8021FramePreemptionAdminStatus.

Yeah, netlink is a bit special in array definition. I need to document
it well in my YAML netlink patches..

> Do you have actual issues with the structuring of the FP parameters
> though? They look like this currently

IDK how to put it best, I shared my largely uninformed thoughts, you
know much more about the spec and HW so whatever you think is most
appropriate is fine by me.
Vinicius Costa Gomes Aug. 17, 2022, 11:15 p.m. UTC | #4
Vladimir Oltean <vladimir.oltean@nxp.com> writes:

> Frame preemption (IEEE 802.1Q-2018 clause 6.7.2) and the MAC merge
> sublayer (IEEE 802.3-2018 clause 99) are 2 specifications, part of the
> Time Sensitive Networking enhancements, which work together to minimize
> latency caused by frame interference. The overall goal of TSN is for
> normal traffic and traffic belonging to real-time control processes to
> be able to cohabitate on the same L2 network and not bother each other
> too much.
>
> They achieve this (partly) by introducing the concept of preemptable
> traffic, i.e. Ethernet frames that have an altered Start-Of-Frame
> delimiter, which can be fragmented and reassembled at L2 on a link-local
> basis. The non-preemptable frames are called express traffic, and they
> can preempt preemptable frames, therefore having lower latency, which
> can matter at lower (100 Mbps) link speeds, or at high MTUs (jumbo
> frames around 9K). Preemption is not recursive, i.e. a PT frame cannot
> preempt another PT frame. Preemption also does not depend upon priority,
> or otherwise said, an ET frame with prio 0 will still preempt a PT frame
> with prio 7.

I liked that in the API sense, using this "prio" concept we gain more
flexibility, and we can express better what the hardware you work with
can do, i.e. priority (for frame preemption purposes?) and queues are
orthogonal.

The problem I have is that the hardware I work with is more limited (as
are some stmmac-based NICs that I am aware of) frame preemption
capability and "priority" are attached to queues.

From the API perspective, it seems that I could say that "fp-prio" 0 is
associated with queue 0, fp-prio 1 to queue 1, and so on, and everything
will work.

The only thing that I am not happy at all is that there are exactly 8
fp-prios.

The Linux network stack is more flexible than what 802.1Q defines, think
skb->priority, number of TCs, as you said earlier, I would hate to
impose some almost artificial limits here. And in future we are going to
see TSN enabled devices with lots more queues.

In short:
 - Comment: this section of the RFC is hardware independent, this
 behavior of queues and priorities being orthogonal is only valid for
 some implementations;
 - Question: would it be against the intention of the API to have a 1:1
 map of priorities to queues?
 - Deal breaker: fixed 8 prios;

>
> In terms of implementation, the specs talk about the fact that the
> express traffic is handled by an express MAC (eMAC) and the preemptable
> traffic by a preemptable MAC (pMAC), and these MACs are multiplexed on
> the same MII by a MAC merge layer.
>
> On RX, packets go to the eMAC or to the pMAC based on their SFD (the
> definition of which was generalized to SMD, Start-of-mPacket-Delimiter,
> where mPacket is essentially an Ethernet frame fragment, or a complete
> frame). On TX, the eMAC/pMAC classification decision is taken by the
> 802.1Q spec, based on packet priority (each of the 8 priority values may
> have an admin-status of preemptable or express).
>
> I have modeled both the Ethernet part of the spec and the queuing part
> as separate netlink messages, with separate ethtool command sets
> intended for them. I am slightly flexible as to where to place the FP
> settings; there were previous discussions about placing them in tc.
> At the moment I haven't received a good enough argument to move them out
> of ethtool, so here they are.
> https://patchwork.kernel.org/project/netdevbpf/cover/20220520011538.1098888-1-vinicius.gomes@intel.com/
>
> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
> ---
>  include/linux/ethtool.h              |  59 ++++++
>  include/uapi/linux/ethtool.h         |  15 ++
>  include/uapi/linux/ethtool_netlink.h |  82 ++++++++
>  net/ethtool/Makefile                 |   3 +-
>  net/ethtool/fp.c                     | 295 +++++++++++++++++++++++++++
>  net/ethtool/mm.c                     | 228 +++++++++++++++++++++
>  net/ethtool/netlink.c                |  38 ++++
>  net/ethtool/netlink.h                |   8 +
>  8 files changed, 727 insertions(+), 1 deletion(-)
>  create mode 100644 net/ethtool/fp.c
>  create mode 100644 net/ethtool/mm.c
>
> diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h
> index 99dc7bfbcd3c..fa504dd22bf6 100644
> --- a/include/linux/ethtool.h
> +++ b/include/linux/ethtool.h
> @@ -453,6 +453,49 @@ struct ethtool_module_power_mode_params {
>  	enum ethtool_module_power_mode mode;
>  };
>  
> +/**
> + * struct ethtool_fp_param - 802.1Q Frame Preemption parameters
> + */
> +struct ethtool_fp_param {
> +	u8 preemptable_prios;
> +	u32 hold_advance;
> +	u32 release_advance;
> +};
> +
> +/**
> + * struct ethtool_mm_state - 802.3 MAC merge layer state
> + */
> +struct ethtool_mm_state {
> +	u32 verify_time;
> +	enum ethtool_mm_verify_status verify_status;
> +	bool supported;
> +	bool enabled;
> +	bool active;
> +	u8 add_frag_size;
> +};
> +
> +/**
> + * struct ethtool_mm_cfg - 802.3 MAC merge layer configuration
> + */
> +struct ethtool_mm_cfg {
> +	u32 verify_time;
> +	bool verify_disable;
> +	bool enabled;
> +	u8 add_frag_size;
> +};
> +
> +/* struct ethtool_mm_stats - 802.3 MAC merge layer statistics, as defined in
> + * IEEE 802.3-2018 subclause 30.14.1 oMACMergeEntity managed object class
> + */
> +struct ethtool_mm_stats {
> +	u64 MACMergeFrameAssErrorCount;
> +	u64 MACMergeFrameSmdErrorCount;
> +	u64 MACMergeFrameAssOkCount;
> +	u64 MACMergeFragCountRx;
> +	u64 MACMergeFragCountTx;
> +	u64 MACMergeHoldCount;
> +};
> +
>  /**
>   * struct ethtool_ops - optional netdev operations
>   * @cap_link_lanes_supported: indicates if the driver supports lanes
> @@ -624,6 +667,11 @@ struct ethtool_module_power_mode_params {
>   *	plugged-in.
>   * @set_module_power_mode: Set the power mode policy for the plug-in module
>   *	used by the network device.
> + * @get_fp_param: Query the 802.1Q Frame Preemption parameters.
> + * @set_fp_param: Set the 802.1Q Frame Preemption parameters.
> + * @get_mm_state: Query the 802.3 MAC Merge layer state.
> + * @set_mm_cfg: Set the 802.3 MAC Merge layer parameters.
> + * @get_mm_stats: Query the 802.3 MAC Merge layer statistics.
>   *
>   * All operations are optional (i.e. the function pointer may be set
>   * to %NULL) and callers must take this into account.  Callers must
> @@ -760,6 +808,17 @@ struct ethtool_ops {
>  	int	(*set_module_power_mode)(struct net_device *dev,
>  					 const struct ethtool_module_power_mode_params *params,
>  					 struct netlink_ext_ack *extack);
> +	void	(*get_fp_param)(struct net_device *dev,
> +				struct ethtool_fp_param *params);
> +	int	(*set_fp_param)(struct net_device *dev,
> +				const struct ethtool_fp_param *params,
> +				struct netlink_ext_ack *extack);
> +	void	(*get_mm_state)(struct net_device *dev,
> +				struct ethtool_mm_state *state);
> +	int	(*set_mm_cfg)(struct net_device *dev, struct ethtool_mm_cfg *cfg,
> +			      struct netlink_ext_ack *extack);
> +	void	(*get_mm_stats)(struct net_device *dev,
> +				struct ethtool_mm_stats *stats);
>  };
>  
>  int ethtool_check_ops(const struct ethtool_ops *ops);
> diff --git a/include/uapi/linux/ethtool.h b/include/uapi/linux/ethtool.h
> index 2d5741fd44bb..7a21fcb26a43 100644
> --- a/include/uapi/linux/ethtool.h
> +++ b/include/uapi/linux/ethtool.h
> @@ -736,6 +736,21 @@ enum ethtool_module_power_mode {
>  	ETHTOOL_MODULE_POWER_MODE_HIGH,
>  };
>  
> +/* Values from ieee8021FramePreemptionAdminStatus */
> +enum ethtool_fp_admin_status {
> +	ETHTOOL_FP_PARAM_ENTRY_ADMIN_STATUS_EXPRESS = 1,
> +	ETHTOOL_FP_PARAM_ENTRY_ADMIN_STATUS_PREEMPTABLE = 2,
> +};
> +
> +enum ethtool_mm_verify_status {
> +	ETHTOOL_MM_VERIFY_STATUS_UNKNOWN,
> +	ETHTOOL_MM_VERIFY_STATUS_INITIAL,
> +	ETHTOOL_MM_VERIFY_STATUS_VERIFYING,
> +	ETHTOOL_MM_VERIFY_STATUS_SUCCEEDED,
> +	ETHTOOL_MM_VERIFY_STATUS_FAILED,
> +	ETHTOOL_MM_VERIFY_STATUS_DISABLED,
> +};
> +
>  /**
>   * struct ethtool_gstrings - string set for data tagging
>   * @cmd: Command number = %ETHTOOL_GSTRINGS
> diff --git a/include/uapi/linux/ethtool_netlink.h b/include/uapi/linux/ethtool_netlink.h
> index d2fb4f7be61b..658810274c49 100644
> --- a/include/uapi/linux/ethtool_netlink.h
> +++ b/include/uapi/linux/ethtool_netlink.h
> @@ -49,6 +49,10 @@ enum {
>  	ETHTOOL_MSG_PHC_VCLOCKS_GET,
>  	ETHTOOL_MSG_MODULE_GET,
>  	ETHTOOL_MSG_MODULE_SET,
> +	ETHTOOL_MSG_FP_GET,
> +	ETHTOOL_MSG_FP_SET,
> +	ETHTOOL_MSG_MM_GET,
> +	ETHTOOL_MSG_MM_SET,
>  
>  	/* add new constants above here */
>  	__ETHTOOL_MSG_USER_CNT,
> @@ -94,6 +98,10 @@ enum {
>  	ETHTOOL_MSG_PHC_VCLOCKS_GET_REPLY,
>  	ETHTOOL_MSG_MODULE_GET_REPLY,
>  	ETHTOOL_MSG_MODULE_NTF,
> +	ETHTOOL_MSG_FP_GET_REPLY,
> +	ETHTOOL_MSG_FP_NTF,
> +	ETHTOOL_MSG_MM_GET_REPLY,
> +	ETHTOOL_MSG_MM_NTF,
>  
>  	/* add new constants above here */
>  	__ETHTOOL_MSG_KERNEL_CNT,
> @@ -862,6 +870,80 @@ enum {
>  	ETHTOOL_A_MODULE_MAX = (__ETHTOOL_A_MODULE_CNT - 1)
>  };
>  
> +/* FRAME PREEMPTION (802.1Q) */
> +
> +enum {
> +	ETHTOOL_A_FP_PARAM_ENTRY_UNSPEC,
> +	ETHTOOL_A_FP_PARAM_ENTRY_PRIO,		/* u8 */
> +	ETHTOOL_A_FP_PARAM_ENTRY_ADMIN_STATUS,	/* u8 */
> +
> +	/* add new constants above here */
> +	__ETHTOOL_A_FP_PARAM_ENTRY_CNT,
> +	ETHTOOL_A_FP_PARAM_ENTRY_MAX = (__ETHTOOL_A_FP_PARAM_ENTRY_CNT - 1)
> +};
> +
> +enum {
> +	ETHTOOL_A_FP_PARAM_UNSPEC,
> +	ETHTOOL_A_FP_PARAM_ENTRY,		/* nest */
> +
> +	/* add new constants above here */
> +	__ETHTOOL_A_FP_PARAM_CNT,
> +	ETHTOOL_A_FP_PARAM_MAX = (__ETHTOOL_A_FP_PARAM_CNT - 1)
> +};
> +
> +enum {
> +	ETHTOOL_A_FP_UNSPEC,
> +	ETHTOOL_A_FP_HEADER,			/* nest - _A_HEADER_* */
> +	ETHTOOL_A_FP_PARAM_TABLE,		/* nest */
> +	ETHTOOL_A_FP_HOLD_ADVANCE,		/* u32 */
> +	ETHTOOL_A_FP_RELEASE_ADVANCE,		/* u32 */
> +
> +	/* add new constants above here */
> +	__ETHTOOL_A_FP_CNT,
> +	ETHTOOL_A_FP_MAX = (__ETHTOOL_A_FP_CNT - 1)
> +};
> +
> +/* MAC MERGE (802.3) */
> +
> +enum {
> +	ETHTOOL_A_MM_STAT_UNSPEC,
> +	ETHTOOL_A_MM_STAT_PAD,
> +
> +	/* aMACMergeFrameAssErrorCount */
> +	ETHTOOL_A_MM_STAT_REASSEMBLY_ERRORS,	/* u64 */
> +	/* aMACMergeFrameSmdErrorCount */
> +	ETHTOOL_A_MM_STAT_SMD_ERRORS,		/* u64 */
> +	/* aMACMergeFrameAssOkCount */
> +	ETHTOOL_A_MM_STAT_REASSEMBLY_OK,	/* u64 */
> +	/* aMACMergeFragCountRx */
> +	ETHTOOL_A_MM_STAT_RX_FRAG_COUNT,	/* u64 */
> +	/* aMACMergeFragCountTx */
> +	ETHTOOL_A_MM_STAT_TX_FRAG_COUNT,	/* u64 */
> +	/* aMACMergeHoldCount */
> +	ETHTOOL_A_MM_STAT_HOLD_COUNT,		/* u64 */
> +
> +	/* add new constants above here */
> +	__ETHTOOL_A_MM_STAT_CNT,
> +	ETHTOOL_A_MM_STAT_MAX = (__ETHTOOL_A_MM_STAT_CNT - 1)
> +};
> +
> +enum {
> +	ETHTOOL_A_MM_UNSPEC,
> +	ETHTOOL_A_MM_HEADER,			/* nest - _A_HEADER_* */
> +	ETHTOOL_A_MM_VERIFY_STATUS,		/* u8 */
> +	ETHTOOL_A_MM_VERIFY_DISABLE,		/* u8 */
> +	ETHTOOL_A_MM_VERIFY_TIME,		/* u32 */
> +	ETHTOOL_A_MM_SUPPORTED,			/* u8 */
> +	ETHTOOL_A_MM_ENABLED,			/* u8 */
> +	ETHTOOL_A_MM_ACTIVE,			/* u8 */
> +	ETHTOOL_A_MM_ADD_FRAG_SIZE,		/* u8 */
> +	ETHTOOL_A_MM_STATS,			/* nest - _A_MM_STAT_* */
> +
> +	/* add new constants above here */
> +	__ETHTOOL_A_MM_CNT,
> +	ETHTOOL_A_MM_MAX = (__ETHTOOL_A_MM_CNT - 1)
> +};
> +
>  /* generic netlink info */
>  #define ETHTOOL_GENL_NAME "ethtool"
>  #define ETHTOOL_GENL_VERSION 1
> diff --git a/net/ethtool/Makefile b/net/ethtool/Makefile
> index b76432e70e6b..31e056f17856 100644
> --- a/net/ethtool/Makefile
> +++ b/net/ethtool/Makefile
> @@ -7,4 +7,5 @@ obj-$(CONFIG_ETHTOOL_NETLINK)	+= ethtool_nl.o
>  ethtool_nl-y	:= netlink.o bitset.o strset.o linkinfo.o linkmodes.o \
>  		   linkstate.o debug.o wol.o features.o privflags.o rings.o \
>  		   channels.o coalesce.o pause.o eee.o tsinfo.o cabletest.o \
> -		   tunnels.o fec.o eeprom.o stats.o phc_vclocks.o module.o
> +		   tunnels.o fec.o eeprom.o stats.o phc_vclocks.o module.o \
> +		   fp.o mm.o
> diff --git a/net/ethtool/fp.c b/net/ethtool/fp.c
> new file mode 100644
> index 000000000000..20f19d8c1461
> --- /dev/null
> +++ b/net/ethtool/fp.c
> @@ -0,0 +1,295 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright 2022 NXP
> + */
> +#include "common.h"
> +#include "netlink.h"
> +
> +struct fp_req_info {
> +	struct ethnl_req_info		base;
> +};
> +
> +struct fp_reply_data {
> +	struct ethnl_reply_data		base;
> +	struct ethtool_fp_param		params;
> +};
> +
> +#define FP_REPDATA(__reply_base) \
> +	container_of(__reply_base, struct fp_reply_data, base)
> +
> +const struct nla_policy ethnl_fp_get_policy[ETHTOOL_A_FP_HEADER + 1] = {
> +	[ETHTOOL_A_FP_HEADER] = NLA_POLICY_NESTED(ethnl_header_policy),
> +};
> +
> +static int fp_prepare_data(const struct ethnl_req_info *req_base,
> +			   struct ethnl_reply_data *reply_base,
> +			   struct genl_info *info)
> +{
> +	struct fp_reply_data *data = FP_REPDATA(reply_base);
> +	struct net_device *dev = reply_base->dev;
> +	int ret;
> +
> +	if (!dev->ethtool_ops->get_fp_param)
> +		return -EOPNOTSUPP;
> +
> +	ret = ethnl_ops_begin(dev);
> +	if (ret < 0)
> +		return ret;
> +
> +	dev->ethtool_ops->get_fp_param(dev, &data->params);
> +	ethnl_ops_complete(dev);
> +
> +	return 0;
> +}
> +
> +static int fp_reply_size(const struct ethnl_req_info *req_base,
> +			 const struct ethnl_reply_data *reply_base)
> +{
> +	int len = 0;
> +
> +	len += nla_total_size(0); /* _FP_PARAM_ENTRY */
> +	len += nla_total_size(sizeof(u8)); /* _FP_PARAM_ENTRY_PRIO */
> +	len += nla_total_size(sizeof(u8)); /* _FP_PARAM_ENTRY_ADMIN_STATUS */
> +	len *= 8; /* 8 prios */
> +	len += nla_total_size(0); /* _FP_PARAM_TABLE */
> +	len += nla_total_size(sizeof(u32)); /* _FP_HOLD_ADVANCE */
> +	len += nla_total_size(sizeof(u32)); /* _FP_RELEASE_ADVANCE */
> +
> +	return len;
> +}
> +
> +static int fp_fill_reply(struct sk_buff *skb,
> +			 const struct ethnl_req_info *req_base,
> +			 const struct ethnl_reply_data *reply_base)
> +{
> +	const struct fp_reply_data *data = FP_REPDATA(reply_base);
> +	const struct ethtool_fp_param *params = &data->params;
> +	enum ethtool_fp_admin_status admin_status;
> +	struct nlattr *nest_table, *nest_entry;
> +	int prio;
> +	int ret;
> +
> +	if (nla_put_u32(skb, ETHTOOL_A_FP_HOLD_ADVANCE, params->hold_advance) ||
> +	    nla_put_u32(skb, ETHTOOL_A_FP_RELEASE_ADVANCE, params->release_advance))
> +		return -EMSGSIZE;
> +
> +	nest_table = nla_nest_start(skb, ETHTOOL_A_FP_PARAM_TABLE);
> +	if (!nest_table)
> +		return -EMSGSIZE;
> +
> +	for (prio = 0; prio < 8; prio++) {
> +		admin_status = (params->preemptable_prios & BIT(prio)) ?
> +			ETHTOOL_FP_PARAM_ENTRY_ADMIN_STATUS_PREEMPTABLE :
> +			ETHTOOL_FP_PARAM_ENTRY_ADMIN_STATUS_EXPRESS;
> +
> +		nest_entry = nla_nest_start(skb, ETHTOOL_A_FP_PARAM_ENTRY);
> +		if (!nest_entry)
> +			goto nla_nest_cancel_table;
> +
> +		if (nla_put_u8(skb, ETHTOOL_A_FP_PARAM_ENTRY_PRIO, prio))
> +			goto nla_nest_cancel_entry;
> +
> +		if (nla_put_u8(skb, ETHTOOL_A_FP_PARAM_ENTRY_ADMIN_STATUS,
> +			       admin_status))
> +			goto nla_nest_cancel_entry;
> +
> +		nla_nest_end(skb, nest_entry);
> +	}
> +
> +	nla_nest_end(skb, nest_table);
> +
> +	return 0;
> +
> +nla_nest_cancel_entry:
> +	nla_nest_cancel(skb, nest_entry);
> +nla_nest_cancel_table:
> +	nla_nest_cancel(skb, nest_table);
> +	return ret;
> +}
> +
> +const struct ethnl_request_ops ethnl_fp_request_ops = {
> +	.request_cmd		= ETHTOOL_MSG_FP_GET,
> +	.reply_cmd		= ETHTOOL_MSG_FP_GET_REPLY,
> +	.hdr_attr		= ETHTOOL_A_FP_HEADER,
> +	.req_info_size		= sizeof(struct fp_req_info),
> +	.reply_data_size	= sizeof(struct fp_reply_data),
> +
> +	.prepare_data		= fp_prepare_data,
> +	.reply_size		= fp_reply_size,
> +	.fill_reply		= fp_fill_reply,
> +};
> +
> +static const struct nla_policy
> +ethnl_fp_set_param_entry_policy[ETHTOOL_A_FP_PARAM_ENTRY_MAX  + 1] = {
> +	[ETHTOOL_A_FP_PARAM_ENTRY_PRIO] = { .type = NLA_U8 },
> +	[ETHTOOL_A_FP_PARAM_ENTRY_ADMIN_STATUS] = { .type = NLA_U8 },
> +};
> +
> +static const struct nla_policy
> +ethnl_fp_set_param_table_policy[ETHTOOL_A_FP_PARAM_MAX + 1] = {
> +	[ETHTOOL_A_FP_PARAM_ENTRY] = NLA_POLICY_NESTED(ethnl_fp_set_param_entry_policy),
> +};
> +
> +const struct nla_policy ethnl_fp_set_policy[ETHTOOL_A_FP_MAX + 1] = {
> +	[ETHTOOL_A_FP_HEADER] = NLA_POLICY_NESTED(ethnl_header_policy),
> +	[ETHTOOL_A_FP_PARAM_TABLE] = NLA_POLICY_NESTED(ethnl_fp_set_param_table_policy),
> +};
> +
> +static int fp_parse_param_entry(const struct nlattr *nest,
> +				struct ethtool_fp_param *params,
> +				u8 *seen_prios, bool *mod,
> +				struct netlink_ext_ack *extack)
> +{
> +	struct nlattr *tb[ARRAY_SIZE(ethnl_fp_set_param_entry_policy)];
> +	u8 preemptable_prios = params->preemptable_prios;
> +	u8 prio, admin_status;
> +	int ret;
> +
> +	if (!nest)
> +		return 0;
> +
> +	ret = nla_parse_nested(tb,
> +			       ARRAY_SIZE(ethnl_fp_set_param_entry_policy) - 1,
> +			       nest, ethnl_fp_set_param_entry_policy,
> +			       extack);
> +	if (ret)
> +		return ret;
> +
> +	if (!tb[ETHTOOL_A_FP_PARAM_ENTRY_PRIO]) {
> +		NL_SET_ERR_MSG_ATTR(extack, nest,
> +				    "Missing 'prio' attribute");
> +		return -EINVAL;
> +	}
> +
> +	if (!tb[ETHTOOL_A_FP_PARAM_ENTRY_ADMIN_STATUS]) {
> +		NL_SET_ERR_MSG_ATTR(extack, nest,
> +				    "Missing 'admin_status' attribute");
> +		return -EINVAL;
> +	}
> +
> +	prio = nla_get_u8(tb[ETHTOOL_A_FP_PARAM_ENTRY_PRIO]);
> +	if (prio >= 8) {
> +		NL_SET_ERR_MSG_ATTR(extack, nest,
> +				    "Range exceeded for 'prio' attribute");
> +		return -ERANGE;
> +	}
> +
> +	if (*seen_prios & BIT(prio)) {
> +		NL_SET_ERR_MSG_ATTR(extack, nest,
> +				    "Duplicate 'prio' attribute");
> +		return -EINVAL;
> +	}
> +
> +	*seen_prios |= BIT(prio);
> +
> +	admin_status = nla_get_u8(tb[ETHTOOL_A_FP_PARAM_ENTRY_ADMIN_STATUS]);
> +	switch (admin_status) {
> +	case ETHTOOL_FP_PARAM_ENTRY_ADMIN_STATUS_PREEMPTABLE:
> +		preemptable_prios |= BIT(prio);
> +		break;
> +	case ETHTOOL_FP_PARAM_ENTRY_ADMIN_STATUS_EXPRESS:
> +		preemptable_prios &= ~BIT(prio);
> +		break;
> +	default:
> +		NL_SET_ERR_MSG_ATTR(extack, nest,
> +				    "Unexpected value for 'admin_status' attribute");
> +		return -EINVAL;
> +	}
> +
> +	if (preemptable_prios != params->preemptable_prios) {
> +		params->preemptable_prios = preemptable_prios;
> +		*mod = true;
> +	}
> +
> +	return 0;
> +}
> +
> +static int fp_parse_param_table(const struct nlattr *nest,
> +				struct ethtool_fp_param *params, bool *mod,
> +				struct netlink_ext_ack *extack)
> +{
> +	u8 seen_prios = 0;
> +	struct nlattr *n;
> +	int rem, ret;
> +
> +	if (!nest)
> +		return 0;
> +
> +	nla_for_each_nested(n, nest, rem) {
> +		struct sched_entry *entry;
> +
> +		if (nla_type(n) != ETHTOOL_A_FP_PARAM_ENTRY) {
> +			NL_SET_ERR_MSG_ATTR(extack, n,
> +					    "Attribute is not of type 'entry'");
> +			continue;
> +		}
> +
> +		ret = fp_parse_param_entry(n, params, &seen_prios, mod, extack);
> +		if (ret < 0) {
> +			kfree(entry);
> +			return ret;
> +		}
> +	}
> +
> +	return 0;
> +}
> +
> +int ethnl_set_fp_param(struct sk_buff *skb, struct genl_info *info)
> +{
> +	struct netlink_ext_ack *extack = info->extack;
> +	struct ethnl_req_info req_info = {};
> +	struct ethtool_fp_param params = {};
> +	struct nlattr **tb = info->attrs;
> +	const struct ethtool_ops *ops;
> +	struct net_device *dev;
> +	bool mod = false;
> +	int ret;
> +
> +	ret = ethnl_parse_header_dev_get(&req_info, tb[ETHTOOL_A_FP_HEADER],
> +					 genl_info_net(info), extack, true);
> +	if (ret)
> +		return ret;
> +
> +	dev = req_info.dev;
> +	ops = dev->ethtool_ops;
> +
> +	if (!ops->get_fp_param || !ops->set_fp_param) {
> +		ret = -EOPNOTSUPP;
> +		goto out_dev;
> +	}
> +
> +	rtnl_lock();
> +	ret = ethnl_ops_begin(dev);
> +	if (ret)
> +		goto out_rtnl;
> +
> +	dev->ethtool_ops->get_fp_param(dev, &params);
> +
> +	ethnl_update_u32(&params.hold_advance, tb[ETHTOOL_A_FP_HOLD_ADVANCE],
> +			 &mod);
> +	ethnl_update_u32(&params.release_advance,
> +			 tb[ETHTOOL_A_FP_RELEASE_ADVANCE], &mod);
> +
> +	ret = fp_parse_param_table(tb[ETHTOOL_A_FP_PARAM_TABLE], &params, &mod,
> +				   extack);
> +	if (ret || !mod)
> +		goto out_ops;
> +
> +	ret = dev->ethtool_ops->set_fp_param(dev, &params, extack);
> +	if (ret) {
> +		if (!extack->_msg)
> +			NL_SET_ERR_MSG(extack,
> +				       "Failed to update frame preemption parameters");
> +		goto out_ops;
> +	}
> +
> +	ethtool_notify(dev, ETHTOOL_MSG_FP_NTF, NULL);
> +
> +out_ops:
> +	ethnl_ops_complete(dev);
> +out_rtnl:
> +	rtnl_unlock();
> +out_dev:
> +	dev_put(dev);
> +	return ret;
> +}
> diff --git a/net/ethtool/mm.c b/net/ethtool/mm.c
> new file mode 100644
> index 000000000000..d4731fb7aee4
> --- /dev/null
> +++ b/net/ethtool/mm.c
> @@ -0,0 +1,228 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright 2022 NXP
> + */
> +#include "common.h"
> +#include "netlink.h"
> +
> +struct mm_req_info {
> +	struct ethnl_req_info		base;
> +};
> +
> +struct mm_reply_data {
> +	struct ethnl_reply_data		base;
> +	struct ethtool_mm_state		state;
> +	struct ethtool_mm_stats		stats;
> +};
> +
> +#define MM_REPDATA(__reply_base) \
> +	container_of(__reply_base, struct mm_reply_data, base)
> +
> +#define ETHTOOL_MM_STAT_CNT \
> +	(__ETHTOOL_A_MM_STAT_CNT - (ETHTOOL_A_MM_STAT_PAD + 1))
> +
> +const struct nla_policy ethnl_mm_get_policy[ETHTOOL_A_MM_HEADER + 1] = {
> +	[ETHTOOL_A_MM_HEADER] = NLA_POLICY_NESTED(ethnl_header_policy_stats),
> +};
> +
> +static int mm_prepare_data(const struct ethnl_req_info *req_base,
> +			   struct ethnl_reply_data *reply_base,
> +			   struct genl_info *info)
> +{
> +	struct mm_reply_data *data = MM_REPDATA(reply_base);
> +	struct net_device *dev = reply_base->dev;
> +	const struct ethtool_ops *ops;
> +	int ret;
> +
> +	ops = dev->ethtool_ops;
> +
> +	if (!ops->get_mm_state)
> +		return -EOPNOTSUPP;
> +
> +	ethtool_stats_init((u64 *)&data->stats,
> +			   sizeof(data->stats) / sizeof(u64));
> +
> +	ret = ethnl_ops_begin(dev);
> +	if (ret < 0)
> +		return ret;
> +
> +	ops->get_mm_state(dev, &data->state);
> +
> +	if (ops->get_mm_stats && (req_base->flags & ETHTOOL_FLAG_STATS))
> +		ops->get_mm_stats(dev, &data->stats);
> +
> +	ethnl_ops_complete(dev);
> +
> +	return 0;
> +}
> +
> +static int mm_reply_size(const struct ethnl_req_info *req_base,
> +			 const struct ethnl_reply_data *reply_base)
> +{
> +	int len = 0;
> +
> +	len += nla_total_size(sizeof(u8)); /* _MM_VERIFY_STATUS */
> +	len += nla_total_size(sizeof(u32)); /* _MM_VERIFY_TIME */
> +	len += nla_total_size(sizeof(u8)); /* _MM_SUPPORTED */
> +	len += nla_total_size(sizeof(u8)); /* _MM_ENABLED */
> +	len += nla_total_size(sizeof(u8)); /* _MM_ACTIVE */
> +	len += nla_total_size(sizeof(u8)); /* _MM_ADD_FRAG_SIZE */
> +
> +	if (req_base->flags & ETHTOOL_FLAG_STATS)
> +		len += nla_total_size(0) + /* _MM_STATS */
> +		       nla_total_size_64bit(sizeof(u64)) * ETHTOOL_MM_STAT_CNT;
> +
> +	return len;
> +}
> +
> +static int mm_put_stat(struct sk_buff *skb, u64 val, u16 attrtype)
> +{
> +	if (val == ETHTOOL_STAT_NOT_SET)
> +		return 0;
> +	if (nla_put_u64_64bit(skb, attrtype, val, ETHTOOL_A_MM_STAT_PAD))
> +		return -EMSGSIZE;
> +	return 0;
> +}
> +
> +static int mm_put_stats(struct sk_buff *skb,
> +			const struct ethtool_mm_stats *stats)
> +{
> +	struct nlattr *nest;
> +
> +	nest = nla_nest_start(skb, ETHTOOL_A_MM_STATS);
> +	if (!nest)
> +		return -EMSGSIZE;
> +
> +	if (mm_put_stat(skb, stats->MACMergeFrameAssErrorCount,
> +			ETHTOOL_A_MM_STAT_REASSEMBLY_ERRORS) ||
> +	    mm_put_stat(skb, stats->MACMergeFrameSmdErrorCount,
> +			ETHTOOL_A_MM_STAT_SMD_ERRORS) ||
> +	    mm_put_stat(skb, stats->MACMergeFrameAssOkCount,
> +			ETHTOOL_A_MM_STAT_REASSEMBLY_OK) ||
> +	    mm_put_stat(skb, stats->MACMergeFragCountRx,
> +			ETHTOOL_A_MM_STAT_RX_FRAG_COUNT) ||
> +	    mm_put_stat(skb, stats->MACMergeFragCountTx,
> +			ETHTOOL_A_MM_STAT_TX_FRAG_COUNT) ||
> +	    mm_put_stat(skb, stats->MACMergeHoldCount,
> +			ETHTOOL_A_MM_STAT_HOLD_COUNT))
> +		goto err_cancel;
> +
> +	nla_nest_end(skb, nest);
> +	return 0;
> +
> +err_cancel:
> +	nla_nest_cancel(skb, nest);
> +	return -EMSGSIZE;
> +}
> +
> +static int mm_fill_reply(struct sk_buff *skb,
> +			 const struct ethnl_req_info *req_base,
> +			 const struct ethnl_reply_data *reply_base)
> +{
> +	const struct mm_reply_data *data = MM_REPDATA(reply_base);
> +	const struct ethtool_mm_state *state = &data->state;
> +
> +	if (nla_put_u8(skb, ETHTOOL_A_MM_VERIFY_STATUS, state->verify_status) ||
> +	    nla_put_u32(skb, ETHTOOL_A_MM_VERIFY_TIME, state->verify_time) ||
> +	    nla_put_u8(skb, ETHTOOL_A_MM_SUPPORTED, state->supported) ||
> +	    nla_put_u8(skb, ETHTOOL_A_MM_ENABLED, state->enabled) ||
> +	    nla_put_u8(skb, ETHTOOL_A_MM_ACTIVE, state->active) ||
> +	    nla_put_u8(skb, ETHTOOL_A_MM_ADD_FRAG_SIZE, state->add_frag_size))
> +		return -EMSGSIZE;
> +
> +	if (req_base->flags & ETHTOOL_FLAG_STATS &&
> +	    mm_put_stats(skb, &data->stats))
> +		return -EMSGSIZE;
> +
> +	return 0;
> +}
> +
> +const struct ethnl_request_ops ethnl_mm_request_ops = {
> +	.request_cmd		= ETHTOOL_MSG_MM_GET,
> +	.reply_cmd		= ETHTOOL_MSG_MM_GET_REPLY,
> +	.hdr_attr		= ETHTOOL_A_MM_HEADER,
> +	.req_info_size		= sizeof(struct mm_req_info),
> +	.reply_data_size	= sizeof(struct mm_reply_data),
> +
> +	.prepare_data		= mm_prepare_data,
> +	.reply_size		= mm_reply_size,
> +	.fill_reply		= mm_fill_reply,
> +};
> +
> +const struct nla_policy ethnl_mm_set_policy[ETHTOOL_A_MM_MAX + 1] = {
> +	[ETHTOOL_A_MM_HEADER] = NLA_POLICY_NESTED(ethnl_header_policy),
> +	[ETHTOOL_A_MM_VERIFY_DISABLE] = { .type = NLA_U8 },
> +	[ETHTOOL_A_MM_VERIFY_TIME] = { .type = NLA_U32 },
> +	[ETHTOOL_A_MM_ENABLED] = { .type = NLA_U8 },
> +	[ETHTOOL_A_MM_ADD_FRAG_SIZE] = { .type = NLA_U8 },
> +};
> +
> +static void mm_state_to_cfg(const struct ethtool_mm_state *state,
> +			    struct ethtool_mm_cfg *cfg)
> +{
> +	cfg->verify_disable =
> +		state->verify_status == ETHTOOL_MM_VERIFY_STATUS_DISABLED;
> +	cfg->verify_time = state->verify_time;
> +	cfg->enabled = state->enabled;
> +	cfg->add_frag_size = state->add_frag_size;
> +}
> +
> +int ethnl_set_mm_cfg(struct sk_buff *skb, struct genl_info *info)
> +{
> +	struct netlink_ext_ack *extack = info->extack;
> +	struct ethnl_req_info req_info = {};
> +	struct ethtool_mm_state state = {};
> +	struct nlattr **tb = info->attrs;
> +	struct ethtool_mm_cfg cfg = {};
> +	const struct ethtool_ops *ops;
> +	struct net_device *dev;
> +	bool mod = false;
> +	int ret;
> +
> +	ret = ethnl_parse_header_dev_get(&req_info, tb[ETHTOOL_A_MM_HEADER],
> +					 genl_info_net(info), extack, true);
> +	if (ret)
> +		return ret;
> +
> +	dev = req_info.dev;
> +	ops = dev->ethtool_ops;
> +
> +	if (!ops->get_mm_state || !ops->set_mm_cfg) {
> +		ret = -EOPNOTSUPP;
> +		goto out_dev;
> +	}
> +
> +	rtnl_lock();
> +	ret = ethnl_ops_begin(dev);
> +	if (ret)
> +		goto out_rtnl;
> +
> +	ops->get_mm_state(dev, &state);
> +
> +	mm_state_to_cfg(&state, &cfg);
> +
> +	ethnl_update_bool(&cfg.verify_disable, tb[ETHTOOL_A_MM_VERIFY_DISABLE],
> +			  &mod);
> +	ethnl_update_u32(&cfg.verify_time, tb[ETHTOOL_A_MM_VERIFY_TIME], &mod);
> +	ethnl_update_bool(&cfg.enabled, tb[ETHTOOL_A_MM_ENABLED], &mod);
> +	ethnl_update_u8(&cfg.add_frag_size, tb[ETHTOOL_A_MM_ADD_FRAG_SIZE],
> +			&mod);
> +
> +	ret = ops->set_mm_cfg(dev, &cfg, extack);
> +	if (ret) {
> +		if (!extack->_msg)
> +			NL_SET_ERR_MSG(extack,
> +				       "Failed to update MAC merge configuration");
> +		goto out_ops;
> +	}
> +
> +	ethtool_notify(dev, ETHTOOL_MSG_MM_NTF, NULL);
> +
> +out_ops:
> +	ethnl_ops_complete(dev);
> +out_rtnl:
> +	rtnl_unlock();
> +out_dev:
> +	dev_put(dev);
> +	return ret;
> +}
> diff --git a/net/ethtool/netlink.c b/net/ethtool/netlink.c
> index e26079e11835..82ad2bd92d70 100644
> --- a/net/ethtool/netlink.c
> +++ b/net/ethtool/netlink.c
> @@ -286,6 +286,8 @@ ethnl_default_requests[__ETHTOOL_MSG_USER_CNT] = {
>  	[ETHTOOL_MSG_STATS_GET]		= &ethnl_stats_request_ops,
>  	[ETHTOOL_MSG_PHC_VCLOCKS_GET]	= &ethnl_phc_vclocks_request_ops,
>  	[ETHTOOL_MSG_MODULE_GET]	= &ethnl_module_request_ops,
> +	[ETHTOOL_MSG_FP_GET]		= &ethnl_fp_request_ops,
> +	[ETHTOOL_MSG_MM_GET]		= &ethnl_mm_request_ops,
>  };
>  
>  static struct ethnl_dump_ctx *ethnl_dump_context(struct netlink_callback *cb)
> @@ -598,6 +600,8 @@ ethnl_default_notify_ops[ETHTOOL_MSG_KERNEL_MAX + 1] = {
>  	[ETHTOOL_MSG_EEE_NTF]		= &ethnl_eee_request_ops,
>  	[ETHTOOL_MSG_FEC_NTF]		= &ethnl_fec_request_ops,
>  	[ETHTOOL_MSG_MODULE_NTF]	= &ethnl_module_request_ops,
> +	[ETHTOOL_MSG_FP_NTF]		= &ethnl_fp_request_ops,
> +	[ETHTOOL_MSG_MM_NTF]		= &ethnl_mm_request_ops,
>  };
>  
>  /* default notification handler */
> @@ -691,6 +695,8 @@ static const ethnl_notify_handler_t ethnl_notify_handlers[] = {
>  	[ETHTOOL_MSG_EEE_NTF]		= ethnl_default_notify,
>  	[ETHTOOL_MSG_FEC_NTF]		= ethnl_default_notify,
>  	[ETHTOOL_MSG_MODULE_NTF]	= ethnl_default_notify,
> +	[ETHTOOL_MSG_FP_NTF]		= ethnl_default_notify,
> +	[ETHTOOL_MSG_MM_NTF]		= ethnl_default_notify,
>  };
>  
>  void ethtool_notify(struct net_device *dev, unsigned int cmd, const void *data)
> @@ -1020,6 +1026,38 @@ static const struct genl_ops ethtool_genl_ops[] = {
>  		.policy = ethnl_module_set_policy,
>  		.maxattr = ARRAY_SIZE(ethnl_module_set_policy) - 1,
>  	},
> +	{
> +		.cmd	= ETHTOOL_MSG_FP_GET,
> +		.doit	= ethnl_default_doit,
> +		.start	= ethnl_default_start,
> +		.dumpit	= ethnl_default_dumpit,
> +		.done	= ethnl_default_done,
> +		.policy = ethnl_fp_get_policy,
> +		.maxattr = ARRAY_SIZE(ethnl_fp_get_policy) - 1,
> +	},
> +	{
> +		.cmd	= ETHTOOL_MSG_FP_SET,
> +		.flags	= GENL_UNS_ADMIN_PERM,
> +		.doit	= ethnl_set_fp_param,
> +		.policy = ethnl_fp_set_policy,
> +		.maxattr = ARRAY_SIZE(ethnl_fp_set_policy) - 1,
> +	},
> +	{
> +		.cmd	= ETHTOOL_MSG_MM_GET,
> +		.doit	= ethnl_default_doit,
> +		.start	= ethnl_default_start,
> +		.dumpit	= ethnl_default_dumpit,
> +		.done	= ethnl_default_done,
> +		.policy = ethnl_mm_get_policy,
> +		.maxattr = ARRAY_SIZE(ethnl_mm_get_policy) - 1,
> +	},
> +	{
> +		.cmd	= ETHTOOL_MSG_MM_SET,
> +		.flags	= GENL_UNS_ADMIN_PERM,
> +		.doit	= ethnl_set_mm_cfg,
> +		.policy = ethnl_mm_set_policy,
> +		.maxattr = ARRAY_SIZE(ethnl_mm_set_policy) - 1,
> +	},
>  };
>  
>  static const struct genl_multicast_group ethtool_nl_mcgrps[] = {
> diff --git a/net/ethtool/netlink.h b/net/ethtool/netlink.h
> index 1653fd2cf0cf..a2e56df74c85 100644
> --- a/net/ethtool/netlink.h
> +++ b/net/ethtool/netlink.h
> @@ -371,6 +371,8 @@ extern const struct ethnl_request_ops ethnl_module_eeprom_request_ops;
>  extern const struct ethnl_request_ops ethnl_stats_request_ops;
>  extern const struct ethnl_request_ops ethnl_phc_vclocks_request_ops;
>  extern const struct ethnl_request_ops ethnl_module_request_ops;
> +extern const struct ethnl_request_ops ethnl_fp_request_ops;
> +extern const struct ethnl_request_ops ethnl_mm_request_ops;
>  
>  extern const struct nla_policy ethnl_header_policy[ETHTOOL_A_HEADER_FLAGS + 1];
>  extern const struct nla_policy ethnl_header_policy_stats[ETHTOOL_A_HEADER_FLAGS + 1];
> @@ -409,6 +411,10 @@ extern const struct nla_policy ethnl_stats_get_policy[ETHTOOL_A_STATS_GROUPS + 1
>  extern const struct nla_policy ethnl_phc_vclocks_get_policy[ETHTOOL_A_PHC_VCLOCKS_HEADER + 1];
>  extern const struct nla_policy ethnl_module_get_policy[ETHTOOL_A_MODULE_HEADER + 1];
>  extern const struct nla_policy ethnl_module_set_policy[ETHTOOL_A_MODULE_POWER_MODE_POLICY + 1];
> +extern const struct nla_policy ethnl_fp_get_policy[ETHTOOL_A_FP_HEADER + 1];
> +extern const struct nla_policy ethnl_fp_set_policy[ETHTOOL_A_FP_MAX + 1];
> +extern const struct nla_policy ethnl_mm_get_policy[ETHTOOL_A_MM_HEADER + 1];
> +extern const struct nla_policy ethnl_mm_set_policy[ETHTOOL_A_MM_MAX + 1];
>  
>  int ethnl_set_linkinfo(struct sk_buff *skb, struct genl_info *info);
>  int ethnl_set_linkmodes(struct sk_buff *skb, struct genl_info *info);
> @@ -428,6 +434,8 @@ int ethnl_tunnel_info_start(struct netlink_callback *cb);
>  int ethnl_tunnel_info_dumpit(struct sk_buff *skb, struct netlink_callback *cb);
>  int ethnl_set_fec(struct sk_buff *skb, struct genl_info *info);
>  int ethnl_set_module(struct sk_buff *skb, struct genl_info *info);
> +int ethnl_set_fp_param(struct sk_buff *skb, struct genl_info *info);
> +int ethnl_set_mm_cfg(struct sk_buff *skb, struct genl_info *info);
>  
>  extern const char stats_std_names[__ETHTOOL_STATS_CNT][ETH_GSTRING_LEN];
>  extern const char stats_eth_phy_names[__ETHTOOL_A_STATS_ETH_PHY_CNT][ETH_GSTRING_LEN];
> -- 
> 2.34.1
>
Vladimir Oltean Aug. 19, 2022, 4:12 p.m. UTC | #5
On Wed, Aug 17, 2022 at 04:15:11PM -0700, Vinicius Costa Gomes wrote:
> I liked that in the API sense, using this "prio" concept we gain more
> flexibility, and we can express better what the hardware you work with
> can do, i.e. priority (for frame preemption purposes?) and queues are
> orthogonal.
> 
> The problem I have is that the hardware I work with is more limited (as
> are some stmmac-based NICs that I am aware of) frame preemption
> capability and "priority" are attached to queues.

Don't get me wrong, enetc also has "priority" attached to TX rings
(enetc_set_bdr_prio). Then there is another register which maps a TX
priority to a traffic class. This could be altered by the "map" property
of tc-mqprio, but in practice it isn't. The only supported prio->tc map
is "map 0 1 2 3 4 5 6 7".

This is to say, enetc does not have a per-packet priority that gets
passed via BD metadata, but rather, packets inherit the configured
priority of the ring (Linux TX queue).

> From the API perspective, it seems that I could say that "fp-prio" 0 is
> associated with queue 0, fp-prio 1 to queue 1, and so on, and everything
> will work.

You have the tc-mqprio "queues" and "map" to juggle with, to end up
configuring FP per queue based on the provided per-priority settings.

> The only thing that I am not happy at all is that there are exactly 8
> fp-prios.
> 
> The Linux network stack is more flexible than what 802.1Q defines, think
> skb->priority, number of TCs, as you said earlier, I would hate to
> impose some almost artificial limits here. And in future we are going to
> see TSN enabled devices with lots more queues.

~artificial limits~

IEEE 802.1Q says:

| 12.30.1.1 framePreemptionStatusTable structure and data types
| 
| The framePreemptionStatusTable (6.7.2) consists of 8 framePreemptionAdminStatus
| values (12.30.1.1.1), one per priority.

I'm more concerned about setting things straight than about the limits
right now. I don't yet think everything is quite ok in that regard.
The netlink format proposed here is in principle extensible for
priorities > 7, if that will ever make sense.

> 
> In short:
>  - Comment: this section of the RFC is hardware independent, this
>  behavior of queues and priorities being orthogonal is only valid for
>  some implementations;

Yes, and IMO it can only be that way, if I were to be truthful to my
interpretation of the intention 802.1Q spec (please contradict me if you
have a different interpretation of it!).

Note that I do see contradictions in 802.1Q, and I don't know how to
reconcile them. I've suppressed some of them for lack of a logical
explanation. I'm mentioning this for transparency; I don't know
everything either, but I need to make something out of what I do know.

So 802.1Q says this:

| 12.30.1.1.1 framePreemptionAdminStatus
|
| This parameter is the administrative value of the preemption status for
| the priority. It takes value express if frames queued for the priority
| are to be transmitted using the express service for the Port, or
| preemptible if frames queued for the priority are to be transmitted
| using the preemptible service for the Port and preemption is enabled for
| the Port.

So far so good. In 802.1Q definitions, priority is attached to a packet
rather than to a traffic class / queue. But then the same clause continues:

| Priorities that all map to the same traffic class should be constrained
| to use the same value of preemption status.

Which seems to throw a wrench into everything.

It raises two questions:

(A) why is AdminStatus not per traffic class then?
(B) why is the constraint there, what's it trying to protect against?

I've asked around, and I got unsatisfactory answers to both questions.

A seemingly competent answer given to (A) by Rui (CCed) is that the
eMAC/pMAC selection on TX actually takes place in the MAC layer, in what
is called by 802.1AC-2016 "MA_UNITDATA.request" (and what everybody else
calls "MAC client xmit"). The parameters of this MAC service primitive
are:

MA_UNITDATA.request(destination_address,
		    source_address,
		    mac_service_data_unit,
		    priority)

So since only "priority" is passed to the MAC service (and traffic class isn't),
this means that the MAC service can only steer packets to eMAC/pMAC
based on what it's given (i.e. priority).

But upon closer investigation, this explanation doesn't appear to hold
water very well. This is because clause 6.7.1 Support of the ISS by IEEE
Std 802.3 (Ethernet) from 802.1Q says:

| If frame preemption (6.7.2) is supported on a Port, then the IEEE 802.3
| MAC provides the following two MAC service interfaces (99.4 of IEEE Std
| 802.3br™-2016 [B21]):
|
| a) A preemptible MAC (pMAC) service interface
| b) An express MAC (eMAC) service interface
|
| For priority values that are identified in the frame preemption status
| table (6.7.2) as preemptible, frames that are selected for transmission
| shall be transmitted using the pMAC service instance, and for priority
| values that are identified in the frame preemption status table as
| express, frames that are selected for transmission shall be transmitted
| using the eMAC service instance.
| In all other respects, the Port behaves as if it is supported by a
| single MAC service interface. In particular, all frames received by the
| Port are treated as if they were received on a single MAC service
| interface regardless of whether they were received on the eMAC service
| interface or the pMAC service interface, except with respect to frame
| preemption.

So there you go, the MAC has to provide *two* service interfaces, so the
802.1Q upper layer (client of both) can just decide based on an internal
criterion (like, say traffic class) into which service it sends a frame.
So this can't be the reason.

As for (B), it was suggested to me that 802.1Q that doesn't allow out of
order transmission within the same queue/traffic class. After all,
what's at play here is whether a single TX queue device can support FP
or not. Supporting FP would mean reordering PT frames relative to ET
frames.

I think this explanation is unsatisfactory too. Here's the only
reference I saw in 802.1Q to ordering guarantees. Basically those
guarantees are all *per priority*, and since PT/ET is also per priority,
I don't see why reordering there would violate this:

| 6.5.3 Frame misordering
| The MAC Service (IEEE Std 802.1AC) permits a negligible rate of
| reordering of frames with a given priority for a given combination of
| destination address, source address, and flow hash, if present,
| transmitted on a given VLAN.
| MA_UNITDATA.indication service primitives corresponding to
| MA_UNITDATA.request primitives, with the same requested priority and for
| the same combination of VLAN classification, destination address, source
| address, and flow hash, if present, are received in the same order as
| the request primitives were processed.

So for anything to make sense for me at all, I simply had to block out
that phrase from my mind. I'm posting the concern here, publicly, in
case someone can enlighten me.

>  - Question: would it be against the intention of the API to have a 1:1
>  map of priorities to queues?

No; as mentioned, going through mqprio's "map" and "queues" to resolve
the priority to a queue is something that I intended to be possible.
Sure, it's not handy either. It would have been handier if the
"admin-status" array was part of the tc-mqprio config, like you did in
your RFC.

But right now I'm trying to not close the possibility for single queue
devices (which won't implement mqprio) to support FP, since I haven't
seen anything convincing that would disprove such a hw design as
infeasible. If someone could share some compelling insight into this it
would be really appreciated.

>  - Deal breaker: fixed 8 prios;

idk, I don't think that's our biggest problem right now honestly.
Vinicius Costa Gomes Aug. 24, 2022, 12:35 a.m. UTC | #6
Hi Vladimir,

Vladimir Oltean <vladimir.oltean@nxp.com> writes:

> On Wed, Aug 17, 2022 at 04:15:11PM -0700, Vinicius Costa Gomes wrote:
>> I liked that in the API sense, using this "prio" concept we gain more
>> flexibility, and we can express better what the hardware you work with
>> can do, i.e. priority (for frame preemption purposes?) and queues are
>> orthogonal.
>> 
>> The problem I have is that the hardware I work with is more limited (as
>> are some stmmac-based NICs that I am aware of) frame preemption
>> capability and "priority" are attached to queues.
>
> Don't get me wrong, enetc also has "priority" attached to TX rings
> (enetc_set_bdr_prio). Then there is another register which maps a TX
> priority to a traffic class. This could be altered by the "map" property
> of tc-mqprio, but in practice it isn't. The only supported prio->tc map
> is "map 0 1 2 3 4 5 6 7".
>
> This is to say, enetc does not have a per-packet priority that gets
> passed via BD metadata, but rather, packets inherit the configured
> priority of the ring (Linux TX queue).
>

I see. Thanks for the explanation. Intel hardware (client/"not data
center", that is) is simpler, we only have one level of "ordering", and
we usually don't change the default: lowest queue index (0) has highest
priority, (1) has lower, and so on.

>> From the API perspective, it seems that I could say that "fp-prio" 0 is
>> associated with queue 0, fp-prio 1 to queue 1, and so on, and everything
>> will work.
>
> You have the tc-mqprio "queues" and "map" to juggle with, to end up
> configuring FP per queue based on the provided per-priority settings.
>
>> The only thing that I am not happy at all is that there are exactly 8
>> fp-prios.
>> 
>> The Linux network stack is more flexible than what 802.1Q defines, think
>> skb->priority, number of TCs, as you said earlier, I would hate to
>> impose some almost artificial limits here. And in future we are going to
>> see TSN enabled devices with lots more queues.
>
> ~artificial limits~
>
> IEEE 802.1Q says:
>
> | 12.30.1.1 framePreemptionStatusTable structure and data types
> | 
> | The framePreemptionStatusTable (6.7.2) consists of 8 framePreemptionAdminStatus
> | values (12.30.1.1.1), one per priority.
>
> I'm more concerned about setting things straight than about the limits
> right now. I don't yet think everything is quite ok in that regard.
> The netlink format proposed here is in principle extensible for
> priorities > 7, if that will ever make sense.

Yes, as the limits are not in the UAPI this is a minor point, agreed.

What I am concerned is about use cases not handled by IEEE 802.1Q, for
example, thinking about that other thread about PCP/DSCP: how about a
network that wants to have as many priorities as the DSCP field (6 bits)
allows, together with frame preemption (in the future we may have
hardware that supports that). I am only thinking that we should not
close that door.

Again, minor point.

>
>> 
>> In short:
>>  - Comment: this section of the RFC is hardware independent, this
>>  behavior of queues and priorities being orthogonal is only valid for
>>  some implementations;
>
> Yes, and IMO it can only be that way, if I were to be truthful to my
> interpretation of the intention 802.1Q spec (please contradict me if you
> have a different interpretation of it!).
>

Sorry if I made you go that deep in the spec.

I was only trying to say that for some implementations it's
difficult/impossible to have totally orthogonal priorities vs. queue
assignments, and that the commit message needs to take that into account
(because the API introduced in this commit is implementation
independent).

> Note that I do see contradictions in 802.1Q, and I don't know how to
> reconcile them. I've suppressed some of them for lack of a logical
> explanation. I'm mentioning this for transparency; I don't know
> everything either, but I need to make something out of what I do know.
>
> So 802.1Q says this:
>
> | 12.30.1.1.1 framePreemptionAdminStatus
> |
> | This parameter is the administrative value of the preemption status for
> | the priority. It takes value express if frames queued for the priority
> | are to be transmitted using the express service for the Port, or
> | preemptible if frames queued for the priority are to be transmitted
> | using the preemptible service for the Port and preemption is enabled for
> | the Port.
>
> So far so good. In 802.1Q definitions, priority is attached to a packet
> rather than to a traffic class / queue. But then the same clause continues:
>
> | Priorities that all map to the same traffic class should be constrained
> | to use the same value of preemption status.
>
> Which seems to throw a wrench into everything.
>
> It raises two questions:
>
> (A) why is AdminStatus not per traffic class then?
> (B) why is the constraint there, what's it trying to protect against?
>
> I've asked around, and I got unsatisfactory answers to both questions.
>
> A seemingly competent answer given to (A) by Rui (CCed) is that the
> eMAC/pMAC selection on TX actually takes place in the MAC layer, in what
> is called by 802.1AC-2016 "MA_UNITDATA.request" (and what everybody else
> calls "MAC client xmit"). The parameters of this MAC service primitive
> are:
>
> MA_UNITDATA.request(destination_address,
> 		    source_address,
> 		    mac_service_data_unit,
> 		    priority)
>
> So since only "priority" is passed to the MAC service (and traffic class isn't),
> this means that the MAC service can only steer packets to eMAC/pMAC
> based on what it's given (i.e. priority).
>
> But upon closer investigation, this explanation doesn't appear to hold
> water very well. This is because clause 6.7.1 Support of the ISS by IEEE
> Std 802.3 (Ethernet) from 802.1Q says:
>
> | If frame preemption (6.7.2) is supported on a Port, then the IEEE 802.3
> | MAC provides the following two MAC service interfaces (99.4 of IEEE Std
> | 802.3br™-2016 [B21]):
> |
> | a) A preemptible MAC (pMAC) service interface
> | b) An express MAC (eMAC) service interface
> |
> | For priority values that are identified in the frame preemption status
> | table (6.7.2) as preemptible, frames that are selected for transmission
> | shall be transmitted using the pMAC service instance, and for priority
> | values that are identified in the frame preemption status table as
> | express, frames that are selected for transmission shall be transmitted
> | using the eMAC service instance.
> | In all other respects, the Port behaves as if it is supported by a
> | single MAC service interface. In particular, all frames received by the
> | Port are treated as if they were received on a single MAC service
> | interface regardless of whether they were received on the eMAC service
> | interface or the pMAC service interface, except with respect to frame
> | preemption.
>
> So there you go, the MAC has to provide *two* service interfaces, so the
> 802.1Q upper layer (client of both) can just decide based on an internal
> criterion (like, say traffic class) into which service it sends a frame.
> So this can't be the reason.
>
> As for (B), it was suggested to me that 802.1Q that doesn't allow out of
> order transmission within the same queue/traffic class. After all,
> what's at play here is whether a single TX queue device can support FP
> or not. Supporting FP would mean reordering PT frames relative to ET
> frames.
>
> I think this explanation is unsatisfactory too. Here's the only
> reference I saw in 802.1Q to ordering guarantees. Basically those
> guarantees are all *per priority*, and since PT/ET is also per priority,
> I don't see why reordering there would violate this:
>
> | 6.5.3 Frame misordering
> | The MAC Service (IEEE Std 802.1AC) permits a negligible rate of
> | reordering of frames with a given priority for a given combination of
> | destination address, source address, and flow hash, if present,
> | transmitted on a given VLAN.
> | MA_UNITDATA.indication service primitives corresponding to
> | MA_UNITDATA.request primitives, with the same requested priority and for
> | the same combination of VLAN classification, destination address, source
> | address, and flow hash, if present, are received in the same order as
> | the request primitives were processed.
>
> So for anything to make sense for me at all, I simply had to block out
> that phrase from my mind. I'm posting the concern here, publicly, in
> case someone can enlighten me.
>

I, truly, admire your effort (and the time you took) and your
explanation. I think I understand where you come from better. Thank you.

Taking a few steps back (I don't like this expression, but anyway), I
think that strict conformance is not an objective of the Linux network
stack as a whole. What I mean by "not strict": that it doesn't try to
follow the letter of the "baggy pants" model that IEEE 802.* and friends
define. Being efficient/useful has a higher importance.

But being certifiable is a target/ideal, i.e. the "on the wire" and user
visible (statistics, etc) behavior can/"should be able" to be configured
to pass conformance tests or even better, interoperability tests.

Now back on track, in my mental model, this simplifies things to a
couple of questions that I ask myself about this RFC, in particular
about this priority idea:
 - Is this idea useful? My answer is yes.
 - Can this idea be used to configure a certifiable system? Also, yes.

So, I have no problems with it.

>>  - Question: would it be against the intention of the API to have a 1:1
>>  map of priorities to queues?
>
> No; as mentioned, going through mqprio's "map" and "queues" to resolve
> the priority to a queue is something that I intended to be possible.
> Sure, it's not handy either. It would have been handier if the
> "admin-status" array was part of the tc-mqprio config, like you did in
> your RFC.

Just to be sure, say that my plan is that I document that for the igc
driver, the mapping of priorities for the "FP" command is prio (0) ->
queue (0), prio (1) -> queue (1), and so on. Note that the mqprio
mapping could be that "skb->priority" 10 goes to TC 1 which is mapped to
queue 0. Would this be ok?

>
> But right now I'm trying to not close the possibility for single queue
> devices (which won't implement mqprio) to support FP, since I haven't
> seen anything convincing that would disprove such a hw design as
> infeasible. If someone could share some compelling insight into this it
> would be really appreciated.
>

The important part is that you didn't close that door. If someone is
brave enough to design that hw, it seems that it can be made to work,
and that's good enough.

>>  - Deal breaker: fixed 8 prios;
>
> idk, I don't think that's our biggest problem right now honestly.

Perhaps I am being too optimistic, but the way I see, your proposal is
already expressible enough to be used in a "certifiable" system. I was
too harsh with "deal breaker.

So, what is the biggest problem that you see?

Overall, no issues from my side.

Cheers,
Vladimir Oltean Sept. 7, 2022, 8:57 p.m. UTC | #7
On Tue, Aug 23, 2022 at 05:35:11PM -0700, Vinicius Costa Gomes wrote:
> Yes, as the limits are not in the UAPI this is a minor point, agreed.
> 
> What I am concerned is about use cases not handled by IEEE 802.1Q, for
> example, thinking about that other thread about PCP/DSCP: how about a
> network that wants to have as many priorities as the DSCP field (6 bits)
> allows, together with frame preemption (in the future we may have
> hardware that supports that). I am only thinking that we should not
> close that door.
> 
> Again, minor point.

Since Linux seems to have settled on TC_PRIO_MAX, TC_QOPT_MAX_QUEUE
constants of 16 rather than the 8 you'd expect, I guess I don't have
that big of a problem to also expose 16 admin-status values for FP.
I can limit the drivers I have to only look at the first 8 entries.
But what do 16 priorities mean to ethtool? 16 priorities are a tc thing.
Does ethtool even have a history of messing with packet priorities?
What am I even doing? lol.

> Sorry if I made you go that deep in the spec.
> 
> I was only trying to say that for some implementations it's
> difficult/impossible to have totally orthogonal priorities vs. queue
> assignments, and that the commit message needs to take that into account
> (because the API introduced in this commit is implementation
> independent).

Yes, the comment is totally valid. However, between "difficult" and
"impossible" there is a big leap, and if it is really impossible, I'd
like to know precisely why.

> I, truly, admire your effort (and the time you took) and your
> explanation. I think I understand where you come from better. Thank you.
> 
> Taking a few steps back (I don't like this expression, but anyway), I
> think that strict conformance is not an objective of the Linux network
> stack as a whole. What I mean by "not strict": that it doesn't try to
> follow the letter of the "baggy pants" model that IEEE 802.* and friends
> define. Being efficient/useful has a higher importance.
> 
> But being certifiable is a target/ideal, i.e. the "on the wire" and user
> visible (statistics, etc) behavior can/"should be able" to be configured
> to pass conformance tests or even better, interoperability tests.
> 
> Now back on track, in my mental model, this simplifies things to a
> couple of questions that I ask myself about this RFC, in particular
> about this priority idea:
>  - Is this idea useful? My answer is yes.
>  - Can this idea be used to configure a certifiable system? Also, yes.
> 
> So, I have no problems with it.
> 

I've taken quite a few steps back now, unfortunately I'm still back to
where I was :)

I've talked to more people within NXP, explained to them that the
standard technically does not disallow the existence of FP for single
queue devices, for this and that reason. The only responses I got varied
from the equivalent of a frightened "brrr", to a resounding "no, that
isn't what was intended". Of course I tried to dig deeper, and I was
told that the configuration is per (PCP) priority because this is the
externally visible label carried by packets, so an external management
system that has to coordinate FP across a LAN does not need to know how
many traffic classes each node has, or how the priorities map to the
traffic classes. Only the externally visible behavior is important,
which is that packets with priority (PCP) X are sent on the wire using a
preemptable SFD or a normal SFD, because the configuration also affects
what the other nodes in the network expect. You might contrast this with
tc-taprio, where the open/closed gates really are defined per traffic
class, and you might wonder why it wasn't important there for the
configuration to also be handled using the externally-visible priority
(PCP) as input. Yes, but with tc-taprio, you can schedule within your
traffic classes all you want, but this does not change the externally
visible appearance of a frame, so it doesn't matter. The scheduling is
orthogonal to whether a packet will be sent as preemptable or not.

Is this explanation satisfactory, and where does it leave us? For me it
isn't, and it leaves me nowhere new, but it's the best I got.

So ok, single TX queue devices with FP probably are possibly a fun
physics class experiment, but practically minded vendors won't implement
them. But does the alternate justification given change my design decision
in any way (i.e. expose "preemptable priorities" in ethtool as opposed
to "preemptable queues" in tc-mqprio and tc-taprio as you did)? No.
It just becomes a matter of enforcing the recommended restriction that
preemptable and express priorities shouldn't be mixed within the same
traffic class, something which my current patch set does not do.

[ yes, "should" is a technical term in IEEE standards which means "not
  mandatory", and that's precisely how the constraint from 12.30.1.1.1
  was formulated ]

I guess when push comes to shove, somebody will have to answer the
question of why was FP exposed in Linux by something other than what the
standard defined it for (prio), and I wouldn't know how to answer that.

> >>  - Question: would it be against the intention of the API to have a 1:1
> >>  map of priorities to queues?
> >
> > No; as mentioned, going through mqprio's "map" and "queues" to resolve
> > the priority to a queue is something that I intended to be possible.
> > Sure, it's not handy either. It would have been handier if the
> > "admin-status" array was part of the tc-mqprio config, like you did in
> > your RFC.
> 
> Just to be sure, say that my plan is that I document that for the igc
> driver, the mapping of priorities for the "FP" command is prio (0) ->
> queue (0), prio (1) -> queue (1), and so on. Note that the mqprio
> mapping could be that "skb->priority" 10 goes to TC 1 which is mapped to
> queue 0. Would this be ok?

If skb->priority 10 goes to TC 1 via the prio_tc_map, and finally lands
in queue 0 where your actual FP knob is, then I'd expect the driver to
follow the reverse path between queue -> tc -> prios mapped to that tc.
And I would not state that prio 0 is queue 0 and skb->priority values
from the whole spectrum may land there, kthxbye. Also, didn't you say
earlier that "lowest queue index (0) has highest priority"? In Linux,
skb->priority 0 has lower priority that skb->priority 1. How does that
work out?

In fact, for both enetc and felix, I have to do this exact thing as you,
except that my hardware knobs are halfway in the path (per tc, not per
queue). The reason why I'm not doing it is because we only consider the
1:1 prio_tc_map, so we use "prio" and "tc" interchangeably.

It can't even be any other way given the current code, since the struct
tc_taprio_qopt_offload passed via ndo_setup_tc does not even contain the
tc_mqprio_qopt sub-structure with the prio_tc_map. Drivers are probably
expected to look, from the ndo_setup_tc hook, at their dev->prio_tc_map
as changed by netdev_set_prio_tc_map() in taprio_change(). Why this API
is different compared to struct tc_mqprio_qopt_offload, I don't know.

What prevents you from doing this? I'd like to understand as precisely
as you can explain, to see how what I'm proposing really sounds when
applied to Intel hardware.

> >>  - Deal breaker: fixed 8 prios;
> >
> > idk, I don't think that's our biggest problem right now honestly.
> 
> Perhaps I am being too optimistic, but the way I see, your proposal is
> already expressible enough to be used in a "certifiable" system. I was
> too harsh with "deal breaker.
> 
> So, what is the biggest problem that you see?
> 
> Overall, no issues from my side.

Let me recap my action items out loud for v2 (some of them weren't
discussed publicly per se).

* Enforce the constraint recommended by 12.30.1.1.1, somewhere as a
  static inline helper function that can be called by drivers, from the
  tc-taprio/tc-mqprio and ethtool-fp code paths. Note that I could very
  quickly run into a very deep rabbit hole here, if I actually bother to
  offload to hardware (enetc) the prio_tc_map communicated from tc-mqprio.
  I'll try to avoid that. I noticed that the mqprio_parse_opt() function
  doesn't validate the provided queue offsets and count if hw offload is
  requested (therefore allowing overlapping queue ranges). I really
  dislike that, because it was probably done for a reason.

* Port the isochron script I have for enetc (endpoint) FP latency
  measurements to kselftest format.

* Do something such that eMAC/pMAC counters are also summed up somewhere
  centrally by ethtool, and reported to the user either individually or
  aggregated.

* Reorganize the netlink UAPI such as to remove ETHTOOL_A_FP_PARAM_TABLE
  and just have multiple ETHTOOL_A_FP_PARAM_ENTRY nests in the parent.

* The enetc verification state machine is off by default. The felix
  verification state machine is on by default. I guess we should figure
  out a reasonable default for all drivers out there?

* For some reason, I notice that the verification state machine remains
  FAILED on felix when it links to enetc and that has verification
  enabled. I need to disable it on enetc, enable it again, and only then
  will the MAC merge interrupt on felix say that verification completed
  successfully. I think I'll need to put a quirk in enetc_pl_mac_link_up()
  to toggle verification off and on, if it was requested, in order for
  this to be seamless to the user.

If it's only these, nothing seems exactly major. However, I assume this
is only the beginning. I fully expect there to be one more round of
"why not dcbnl? why not tc?" nitpicks, for which I'm not exactly
mentally ready (if credible arguments are to be put forward, I'd rather
have them now while I haven't yet put in the extra work in the direction
I'm about to).
Vinicius Costa Gomes Sept. 10, 2022, 12:19 a.m. UTC | #8
Vladimir Oltean <vladimir.oltean@nxp.com> writes:

> On Tue, Aug 23, 2022 at 05:35:11PM -0700, Vinicius Costa Gomes wrote:
>> Yes, as the limits are not in the UAPI this is a minor point, agreed.
>> 
>> What I am concerned is about use cases not handled by IEEE 802.1Q, for
>> example, thinking about that other thread about PCP/DSCP: how about a
>> network that wants to have as many priorities as the DSCP field (6 bits)
>> allows, together with frame preemption (in the future we may have
>> hardware that supports that). I am only thinking that we should not
>> close that door.
>> 
>> Again, minor point.
>
> Since Linux seems to have settled on TC_PRIO_MAX, TC_QOPT_MAX_QUEUE
> constants of 16 rather than the 8 you'd expect, I guess I don't have
> that big of a problem to also expose 16 admin-status values for FP.
> I can limit the drivers I have to only look at the first 8 entries.
> But what do 16 priorities mean to ethtool? 16 priorities are a tc thing.
> Does ethtool even have a history of messing with packet priorities?
> What am I even doing? lol.
>

That's a good point. ethtool doesn't has a history of knowing about
priorities (and what they even mean). But I like the flexibility that
this approach provides.

>> Sorry if I made you go that deep in the spec.
>> 
>> I was only trying to say that for some implementations it's
>> difficult/impossible to have totally orthogonal priorities vs. queue
>> assignments, and that the commit message needs to take that into account
>> (because the API introduced in this commit is implementation
>> independent).
>
> Yes, the comment is totally valid. However, between "difficult" and
> "impossible" there is a big leap, and if it is really impossible, I'd
> like to know precisely why.
>

Sometimes the enemy of good is perfect (in this case precisely). Perhaps
aiming for "good enough" understanding?

>> I, truly, admire your effort (and the time you took) and your
>> explanation. I think I understand where you come from better. Thank you.
>> 
>> Taking a few steps back (I don't like this expression, but anyway), I
>> think that strict conformance is not an objective of the Linux network
>> stack as a whole. What I mean by "not strict": that it doesn't try to
>> follow the letter of the "baggy pants" model that IEEE 802.* and friends
>> define. Being efficient/useful has a higher importance.
>> 
>> But being certifiable is a target/ideal, i.e. the "on the wire" and user
>> visible (statistics, etc) behavior can/"should be able" to be configured
>> to pass conformance tests or even better, interoperability tests.
>> 
>> Now back on track, in my mental model, this simplifies things to a
>> couple of questions that I ask myself about this RFC, in particular
>> about this priority idea:
>>  - Is this idea useful? My answer is yes.
>>  - Can this idea be used to configure a certifiable system? Also, yes.
>> 
>> So, I have no problems with it.
>> 
>
> I've taken quite a few steps back now, unfortunately I'm still back to
> where I was :)
>
> I've talked to more people within NXP, explained to them that the
> standard technically does not disallow the existence of FP for single
> queue devices, for this and that reason. The only responses I got varied
> from the equivalent of a frightened "brrr", to a resounding "no, that
> isn't what was intended". Of course I tried to dig deeper, and I was
> told that the configuration is per (PCP) priority because this is the
> externally visible label carried by packets, so an external management
> system that has to coordinate FP across a LAN does not need to know how
> many traffic classes each node has, or how the priorities map to the
> traffic classes. Only the externally visible behavior is important,
> which is that packets with priority (PCP) X are sent on the wire using a
> preemptable SFD or a normal SFD, because the configuration also affects
> what the other nodes in the network expect. You might contrast this with
> tc-taprio, where the open/closed gates really are defined per traffic
> class, and you might wonder why it wasn't important there for the
> configuration to also be handled using the externally-visible priority
> (PCP) as input. Yes, but with tc-taprio, you can schedule within your
> traffic classes all you want, but this does not change the externally
> visible appearance of a frame, so it doesn't matter. The scheduling is
> orthogonal to whether a packet will be sent as preemptable or not.
>
> Is this explanation satisfactory, and where does it leave us? For me it
> isn't, and it leaves me nowhere new, but it's the best I got.
>
> So ok, single TX queue devices with FP probably are possibly a fun
> physics class experiment, but practically minded vendors won't implement
> them. But does the alternate justification given change my design decision
> in any way (i.e. expose "preemptable priorities" in ethtool as opposed
> to "preemptable queues" in tc-mqprio and tc-taprio as you did)? No.
> It just becomes a matter of enforcing the recommended restriction that
> preemptable and express priorities shouldn't be mixed within the same
> traffic class, something which my current patch set does not do.
>
> [ yes, "should" is a technical term in IEEE standards which means "not
>   mandatory", and that's precisely how the constraint from 12.30.1.1.1
>   was formulated ]
>
> I guess when push comes to shove, somebody will have to answer the
> question of why was FP exposed in Linux by something other than what the
> standard defined it for (prio), and I wouldn't know how to answer that.
>

In my mind the answer is: because it's cool to support more use-cases
than the standard defines, Linux is still a vehicle for experimentation,
more space for our users to play.

>> >>  - Question: would it be against the intention of the API to have a 1:1
>> >>  map of priorities to queues?
>> >
>> > No; as mentioned, going through mqprio's "map" and "queues" to resolve
>> > the priority to a queue is something that I intended to be possible.
>> > Sure, it's not handy either. It would have been handier if the
>> > "admin-status" array was part of the tc-mqprio config, like you did in
>> > your RFC.
>> 
>> Just to be sure, say that my plan is that I document that for the igc
>> driver, the mapping of priorities for the "FP" command is prio (0) ->
>> queue (0), prio (1) -> queue (1), and so on. Note that the mqprio
>> mapping could be that "skb->priority" 10 goes to TC 1 which is mapped to
>> queue 0. Would this be ok?
>
> If skb->priority 10 goes to TC 1 via the prio_tc_map, and finally lands
> in queue 0 where your actual FP knob is, then I'd expect the driver to
> follow the reverse path between queue -> tc -> prios mapped to that tc.
> And I would not state that prio 0 is queue 0 and skb->priority values
> from the whole spectrum may land there, kthxbye. Also, didn't you say
> earlier that "lowest queue index (0) has highest priority"? In Linux,
> skb->priority 0 has lower priority that skb->priority 1. How does that
> work out?
>

Ah, sorry for the confusion, when I said "lowest queue index (0) has
highest priority" it was more in the sense that queue 0 has higher
precedence, packets from it are fetched first, that kind of thing.

> In fact, for both enetc and felix, I have to do this exact thing as you,
> except that my hardware knobs are halfway in the path (per tc, not per
> queue). The reason why I'm not doing it is because we only consider the
> 1:1 prio_tc_map, so we use "prio" and "tc" interchangeably.
>

Hm, that's interesting. I think that's a difference on how we are
modeling our traffic, here's one very common mqprio config for AVB use
cases:

$ tc qdisc replace dev $IFACE parent root handle 100 mqprio \
      num_tc 3 \
      map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 \
      queues 1@0 1@1 2@2 \
      hw 0

One priority for Class A traffic, one priority for Class B traffic and
multiple priorities mapped to a single TC, the rest is Best Effort and
it's mapped to multiple queues. The way we model traffic, TCs and
priorities are different concepts.

One important detail is that in TSN mode, we use the "priority" (meaning
precedence in this context) fetch order for the queues: queue 0 first,
then queue 1, and so on.

> It can't even be any other way given the current code, since the struct
> tc_taprio_qopt_offload passed via ndo_setup_tc does not even contain the
> tc_mqprio_qopt sub-structure with the prio_tc_map. Drivers are probably
> expected to look, from the ndo_setup_tc hook, at their dev->prio_tc_map
> as changed by netdev_set_prio_tc_map() in taprio_change(). Why this API
> is different compared to struct tc_mqprio_qopt_offload, I don't know.

At least for igb and igc I never felt it was necessary to look at the
prio_tc_map. The core netstack will already enqueue each packet to the
right queue. All I need to know is how to configure each queue.

Also note that for igb and igc I didn't ndo_setup_tc for mqprio, because
the default behavior is fine, that's also why that taprio doesn't send
that information (but this is easily fixed).

This could be a consequence that prio_tc_map (and friends) was
introduced by Intel folks (long before my time) and this doesn't work
that well for other vendors.

>
> What prevents you from doing this? I'd like to understand as precisely
> as you can explain, to see how what I'm proposing really sounds when
> applied to Intel hardware.
>

If I am understanding you correctly, nothing prevents me from doing
that. It's just that everything worked out that from the driver side I
don't need to know about priorities, traffic classes, I only need to
care about queues.

That's why I for my "ethtool" proposal I focused on setting preemption
per queue, because the drivers I work with don't need to know about
anything else. But that can change if necessary. The cost doesn't seem
too big.

>> >>  - Deal breaker: fixed 8 prios;
>> >
>> > idk, I don't think that's our biggest problem right now honestly.
>> 
>> Perhaps I am being too optimistic, but the way I see, your proposal is
>> already expressible enough to be used in a "certifiable" system. I was
>> too harsh with "deal breaker.
>> 
>> So, what is the biggest problem that you see?
>> 
>> Overall, no issues from my side.
>
> Let me recap my action items out loud for v2 (some of them weren't
> discussed publicly per se).
>
> * Enforce the constraint recommended by 12.30.1.1.1, somewhere as a
>   static inline helper function that can be called by drivers, from the
>   tc-taprio/tc-mqprio and ethtool-fp code paths. Note that I could very
>   quickly run into a very deep rabbit hole here, if I actually bother to
>   offload to hardware (enetc) the prio_tc_map communicated from tc-mqprio.
>   I'll try to avoid that. I noticed that the mqprio_parse_opt() function
>   doesn't validate the provided queue offsets and count if hw offload is
>   requested (therefore allowing overlapping queue ranges). I really
>   dislike that, because it was probably done for a reason.
>

I agree with that mqprio offload lack of validation being weird.

+1 for the helper, those kind of functions help document a lot the
expectations.

> * Port the isochron script I have for enetc (endpoint) FP latency
>   measurements to kselftest format.
>
> * Do something such that eMAC/pMAC counters are also summed up somewhere
>   centrally by ethtool, and reported to the user either individually or
>   aggregated.
>
> * Reorganize the netlink UAPI such as to remove ETHTOOL_A_FP_PARAM_TABLE
>   and just have multiple ETHTOOL_A_FP_PARAM_ENTRY nests in the parent.
>

Another +1

> * The enetc verification state machine is off by default. The felix
>   verification state machine is on by default. I guess we should figure
>   out a reasonable default for all drivers out there?
>

From the language of IEEE 802.3 ("disableVerify" and friends), it seems
that the expectation is that verification is enabled by default. *But*
from a interoperability point of view, I would only enable verification
by default after some testing in the wild. Or only send the verify
packet if frame preemption is also enabled. 

> * For some reason, I notice that the verification state machine remains
>   FAILED on felix when it links to enetc and that has verification
>   enabled. I need to disable it on enetc, enable it again, and only then
>   will the MAC merge interrupt on felix say that verification completed
>   successfully. I think I'll need to put a quirk in enetc_pl_mac_link_up()
>   to toggle verification off and on, if it was requested, in order for
>   this to be seamless to the user.
>
> If it's only these, nothing seems exactly major. However, I assume this
> is only the beginning. I fully expect there to be one more round of
> "why not dcbnl? why not tc?" nitpicks, for which I'm not exactly
> mentally ready (if credible arguments are to be put forward, I'd rather
> have them now while I haven't yet put in the extra work in the direction
> I'm about to).

I really know the feeling "not mentally ready".

I am still in the "tc" camp. But I cannot deny that this interface seems
to work for the uses that we have in mind. That's (at least) good enough
for me ;-)


Cheers,
Vladimir Oltean Sept. 10, 2022, 4:36 p.m. UTC | #9
On Fri, Sep 09, 2022 at 05:19:35PM -0700, Vinicius Costa Gomes wrote:
> Vladimir Oltean <vladimir.oltean@nxp.com> writes:
> 
> > On Tue, Aug 23, 2022 at 05:35:11PM -0700, Vinicius Costa Gomes wrote:
> >> Yes, as the limits are not in the UAPI this is a minor point, agreed.
> >> 
> >> What I am concerned is about use cases not handled by IEEE 802.1Q, for
> >> example, thinking about that other thread about PCP/DSCP: how about a
> >> network that wants to have as many priorities as the DSCP field (6 bits)
> >> allows, together with frame preemption (in the future we may have
> >> hardware that supports that). I am only thinking that we should not
> >> close that door.
> >> 
> >> Again, minor point.
> >
> > Since Linux seems to have settled on TC_PRIO_MAX, TC_QOPT_MAX_QUEUE
> > constants of 16 rather than the 8 you'd expect, I guess I don't have
> > that big of a problem to also expose 16 admin-status values for FP.
> > I can limit the drivers I have to only look at the first 8 entries.
> > But what do 16 priorities mean to ethtool? 16 priorities are a tc thing.
> > Does ethtool even have a history of messing with packet priorities?
> > What am I even doing? lol.
> 
> That's a good point. ethtool doesn't has a history of knowing about
> priorities (and what they even mean). But I like the flexibility that
> this approach provides.

Fair. On the other hand, we need to weigh the pros and cons of moving
the adminStatus to other places.

dcbnl

- pro: is a hardware-only interface, just like ethtool, which is
  suitable for a hardware-only feature like FP

- pro: knows about priorities (app table, PFC). I keep coming back to
  PFC as an example because it has the absolute exact same problems as
  FP, which is that it is defined per priority, but there is a "NOTE 2 -
  Mixing PFC and non-PFC priorities in the same queue results in non-PFC
  traffic being paused causing congestion spreading, and therefore is
  not recommended."

- con: I think the dcbnl app table has the same limitation that
  priorities go only up to 8 (the comment above struct dcb_app says
  "@priority: 3-bit unsigned integer indicating priority for IEEE")

- con: where do dcbnl's responsibilities begin, and where do they end,
  exactly? My understanding is that DCB and TSN are exactly the same
  kind of thing, i.e. extensions to 802.1Q (and 802.3, in case of TSN)
  which were made in order to align Ethernet (and bridges) with a set of
  vertical use cases which weren't quite possible before. All is well,
  except TSN was integrated quite differently into the Linux kernel,
  i.o.w. we don't have a "tsnnl" like there is a "dcbnl", but things are
  much more fine-grained, and integrated with the subsystem that they
  extend, rather than creating a subsystem aligned to a use case. So the
  question becomes, is anything we're doing here, for TSN, extending
  DCB? With Microchip's desire to add non-standard APP table selectors
  for VLAN PCP/DEI prioritization, I think the general movement is
  towards more reuse of dcbnl constructs, and towards forgetting that
  dcbnl is for DCB. But there are many things I don't understand about
  dcbnl, for example why is it possible to set an interface's prio-tc
  map using dcb, and what does it even mean in relationship to
  tc-mqprio, tc-taprio etc (or even with "priomap" from tc-ets).

  $ dcb ets set dev eth0 prio-tc 0:0 1:1 2:2 3:3 4:4 5:5 6:6 7:7

  Jakub had an answer which explained this in terms of Conway's law:
  https://patchwork.kernel.org/project/netdevbpf/cover/20220520011538.1098888-1-vinicius.gomes@intel.com/#24870325
  but as beautiful that answer is, as many questions it still leaves for
  me. Like if we were to accept dcbnl as part of the UAPI for TSN, how
  does this fact fit within our story that the already use the tc-taprio
  map to map priorities to traffic classes?

ethtool:

- pro: is in the same subsystem as MAC merge layer configuration (for
  which I have no doubts that ethtool is the right place)

- pro: is not aligned to any specific vertical use case, like dcbnl is

- con: knows nothing about priorities

- con: for hardware where FP adminStatus is per traffic class rather
  than per priority, we'll have to change the adminStatus not only from
  changes made via ethtool to the FP adminStatus, but also from changes
  made to the prio_tc_map (tc I guess?). But this may also be a subtle
  pro for dcbnl, since that has its own prio_tc_map, so we'd remain
  within the same subsystem?

tc (part of tc-mqprio and tc-taprio):

- open question: does the FP adminStatus influence scheduling or not?
  Normally I'd say no, adminStatus only decides whether the eMAC or pMAC
  will be used to send a frame, and this is orthogonal to scheduling.
  But 802.1Q says that "It should be noted that, other things being
  equal, designating a priority as “express” effectively increases its
  priority above that of any priority designated as “preemptible”."
  Is this phrase enough to justify a modification of the qdisc dequeue
  method in, say, taprio, which would prioritize a lower priority qdisc
  queue A over a higher one B, if A is express but B is preemptable?

- pro: assuming that in lack of tc-mqprio or tc-taprio, an interface's
  number of traffic classes will collapse to 1, then putting FP
  adminStatus here makes it trivial to support hardware with FP per
  traffic class

- con: more difficult for user space to change FP adminStatus
  independently of the current qdisc?

> >> > No; as mentioned, going through mqprio's "map" and "queues" to resolve
> >> > the priority to a queue is something that I intended to be possible.
> >> > Sure, it's not handy either. It would have been handier if the
> >> > "admin-status" array was part of the tc-mqprio config, like you did in
> >> > your RFC.
> >> 
> >> Just to be sure, say that my plan is that I document that for the igc
> >> driver, the mapping of priorities for the "FP" command is prio (0) ->
> >> queue (0), prio (1) -> queue (1), and so on. Note that the mqprio
> >> mapping could be that "skb->priority" 10 goes to TC 1 which is mapped to
> >> queue 0. Would this be ok?
> >
> > If skb->priority 10 goes to TC 1 via the prio_tc_map, and finally lands
> > in queue 0 where your actual FP knob is, then I'd expect the driver to
> > follow the reverse path between queue -> tc -> prios mapped to that tc.
> > And I would not state that prio 0 is queue 0 and skb->priority values
> > from the whole spectrum may land there, kthxbye. Also, didn't you say
> > earlier that "lowest queue index (0) has highest priority"? In Linux,
> > skb->priority 0 has lower priority that skb->priority 1. How does that
> > work out?
> >
> 
> Ah, sorry for the confusion, when I said "lowest queue index (0) has
> highest priority" it was more in the sense that queue 0 has higher
> precedence, packets from it are fetched first, that kind of thing.

Fetched first by whom?
In ENETC, the transmission selection chooses between TX BD rings whose
configured priority maps to the same traffic class based on a weighted
round robin dequeuing policy. Groups of TX BD rings of different
priorities are in strict priority dequeuing relative to each other.
And the number of the TX BD ring has nothing to do with its priority.

> > In fact, for both enetc and felix, I have to do this exact thing as you,
> > except that my hardware knobs are halfway in the path (per tc, not per
> > queue). The reason why I'm not doing it is because we only consider the
> > 1:1 prio_tc_map, so we use "prio" and "tc" interchangeably.
> >
> 
> Hm, that's interesting. I think that's a difference on how we are
> modeling our traffic, here's one very common mqprio config for AVB use
> cases:
> 
> $ tc qdisc replace dev $IFACE parent root handle 100 mqprio \
>       num_tc 3 \
>       map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 \
>       queues 1@0 1@1 2@2 \
>       hw 0
> 
> One priority for Class A traffic, one priority for Class B traffic and
> multiple priorities mapped to a single TC, the rest is Best Effort and
> it's mapped to multiple queues. The way we model traffic, TCs and
> priorities are different concepts.

So here's what I don't understand. Why do you need 16 priority values,
when in the end you'll map them to only 3 egress traffic classes anyway?

> One important detail is that in TSN mode, we use the "priority" (meaning
> precedence in this context) fetch order for the queues: queue 0 first,
> then queue 1, and so on.

So the number of the queue is what actually contributes to the egress
scheduling algorithm of your NIC? Strange. Tell me something so I'm sure
I'm understanding you properly. netdev_pick_tx() will pick a TX queue
using skb_tx_hash(), which hashes between the netdev queues mapped to
the same traffic class as the traffic class that skb->priority is mapped to.
In your case, you have 2@2 (2 TX queues for traffic class 2), call these
queues 2 and 3. You're saying that hardware prioritizes queue 2 over 3.
But the network stack doesn't know that. It hashes equally packets that
go to tc 2 between queue 2 and queue 3. In our case, this misconfiguration
would be immediately obvious, because we could have an offloaded tc-taprio
schedule in ENETC, and Linux could be enqueuing into a ring that ultimately
has a different schedule than software knows about. How does this work
out for you?

> > * The enetc verification state machine is off by default. The felix
> >   verification state machine is on by default. I guess we should figure
> >   out a reasonable default for all drivers out there?
> >
> 
> From the language of IEEE 802.3 ("disableVerify" and friends), it seems
> that the expectation is that verification is enabled by default. *But*
> from a interoperability point of view, I would only enable verification
> by default after some testing in the wild. Or only send the verify
> packet if frame preemption is also enabled. 

I wonder whether the answer to this one is actually "Additional Ethernet
Capabilities TLV" (802.3 clause 79.3.7). It would be good if we could
all start off with FP/MM disabled, verification state machine included,
until the LLDP daemon receives a TLV that says PreemptSupported and
PreemptEnabled are true. If both local and remote preemption are
supported and enabled, we tell the kernel to enable the MAC merge layer
and kick off verification. In turn this will alter PreemptActive that
gets set in future replies. I don't know, the flow may be different.

> > * For some reason, I notice that the verification state machine remains
> >   FAILED on felix when it links to enetc and that has verification
> >   enabled. I need to disable it on enetc, enable it again, and only then
> >   will the MAC merge interrupt on felix say that verification completed
> >   successfully. I think I'll need to put a quirk in enetc_pl_mac_link_up()
> >   to toggle verification off and on, if it was requested, in order for
> >   this to be seamless to the user.
> >
> > If it's only these, nothing seems exactly major. However, I assume this
> > is only the beginning. I fully expect there to be one more round of
> > "why not dcbnl? why not tc?" nitpicks, for which I'm not exactly
> > mentally ready (if credible arguments are to be put forward, I'd rather
> > have them now while I haven't yet put in the extra work in the direction
> > I'm about to).
> 
> I really know the feeling "not mentally ready".
> 
> I am still in the "tc" camp. But I cannot deny that this interface seems
> to work for the uses that we have in mind. That's (at least) good enough
> for me ;-)

To me right now, FP adminStatus is not exactly something that the
scheduler is concerned with, and this is why tc is kind of my last
choice. But I agree that it makes certain things simpler. Plus, it seems
to be the only subsystem aware of 16 priorities and 16 traffic classes,
which you seem to care about.
Vinicius Costa Gomes Sept. 14, 2022, 2:59 a.m. UTC | #10
Vladimir Oltean <vladimir.oltean@nxp.com> writes:

> On Fri, Sep 09, 2022 at 05:19:35PM -0700, Vinicius Costa Gomes wrote:
>> Vladimir Oltean <vladimir.oltean@nxp.com> writes:
>> 
>> > On Tue, Aug 23, 2022 at 05:35:11PM -0700, Vinicius Costa Gomes wrote:
>> >> Yes, as the limits are not in the UAPI this is a minor point, agreed.
>> >> 
>> >> What I am concerned is about use cases not handled by IEEE 802.1Q, for
>> >> example, thinking about that other thread about PCP/DSCP: how about a
>> >> network that wants to have as many priorities as the DSCP field (6 bits)
>> >> allows, together with frame preemption (in the future we may have
>> >> hardware that supports that). I am only thinking that we should not
>> >> close that door.
>> >> 
>> >> Again, minor point.
>> >
>> > Since Linux seems to have settled on TC_PRIO_MAX, TC_QOPT_MAX_QUEUE
>> > constants of 16 rather than the 8 you'd expect, I guess I don't have
>> > that big of a problem to also expose 16 admin-status values for FP.
>> > I can limit the drivers I have to only look at the first 8 entries.
>> > But what do 16 priorities mean to ethtool? 16 priorities are a tc thing.
>> > Does ethtool even have a history of messing with packet priorities?
>> > What am I even doing? lol.
>> 
>> That's a good point. ethtool doesn't has a history of knowing about
>> priorities (and what they even mean). But I like the flexibility that
>> this approach provides.
>
> Fair. On the other hand, we need to weigh the pros and cons of moving
> the adminStatus to other places.
>
> dcbnl
>
> - pro: is a hardware-only interface, just like ethtool, which is
>   suitable for a hardware-only feature like FP
>
> - pro: knows about priorities (app table, PFC). I keep coming back to
>   PFC as an example because it has the absolute exact same problems as
>   FP, which is that it is defined per priority, but there is a "NOTE 2 -
>   Mixing PFC and non-PFC priorities in the same queue results in non-PFC
>   traffic being paused causing congestion spreading, and therefore is
>   not recommended."
>
> - con: I think the dcbnl app table has the same limitation that
>   priorities go only up to 8 (the comment above struct dcb_app says
>   "@priority: 3-bit unsigned integer indicating priority for IEEE")
>
> - con: where do dcbnl's responsibilities begin, and where do they end,
>   exactly? My understanding is that DCB and TSN are exactly the same
>   kind of thing, i.e. extensions to 802.1Q (and 802.3, in case of TSN)
>   which were made in order to align Ethernet (and bridges) with a set of
>   vertical use cases which weren't quite possible before. All is well,
>   except TSN was integrated quite differently into the Linux kernel,
>   i.o.w. we don't have a "tsnnl" like there is a "dcbnl", but things are
>   much more fine-grained, and integrated with the subsystem that they
>   extend, rather than creating a subsystem aligned to a use case. So the
>   question becomes, is anything we're doing here, for TSN, extending
>   DCB? With Microchip's desire to add non-standard APP table selectors
>   for VLAN PCP/DEI prioritization, I think the general movement is
>   towards more reuse of dcbnl constructs, and towards forgetting that
>   dcbnl is for DCB. But there are many things I don't understand about
>   dcbnl, for example why is it possible to set an interface's prio-tc
>   map using dcb, and what does it even mean in relationship to
>   tc-mqprio, tc-taprio etc (or even with "priomap" from tc-ets).
>

I don't know. I still have to learn more about DCB. I suppose HW that
has both TSN and DCB features is not too far in the future. So is it
possible that people are going to try to use them together? Does it make
sense? I don't know.

Here are some notes from a quick look at the code, anyone feel free to
correct me, or add anything.

Looking at two drivers code (mlx5 and ice) it seems that the dcbnl
prio-tc map has a similar objective to the netdev prio_tc_map (I am
looking at how that map feeds into each driver ndo_select_queue()), but
inside the driver and with more flexibility ("in the DSCP mode,
overwrite the skb->priority with this value and use this value for
netdev_tx_pick()", that kind of thing). So using tc-mqprio/tc-taprio
with 'dcb ets' 'prio-map' will send traffic to suprising traffic
classes/queues. 

tc-ets priomap feels like another layer, you use that to setup buckets
in which run what seems to be burst protection algorithms, and this
happens after ndo_select_queue() runs, so this map seems almost
independent of the dcbnl prio-tc map (?).

>   $ dcb ets set dev eth0 prio-tc 0:0 1:1 2:2 3:3 4:4 5:5 6:6 7:7
>
>   Jakub had an answer which explained this in terms of Conway's law:
>   https://patchwork.kernel.org/project/netdevbpf/cover/20220520011538.1098888-1-vinicius.gomes@intel.com/#24870325
>   but as beautiful that answer is, as many questions it still leaves for
>   me. Like if we were to accept dcbnl as part of the UAPI for TSN, how
>   does this fact fit within our story that the already use the tc-taprio
>   map to map priorities to traffic classes?

Setting the 'dcb ets' "prio-tc" map together with tc-taprio or
tc-mqprio, doesn't make much sense, true. But if I am understanding the
code right, you could use that 'dcb' command, and configure frame
preemption via ethtool (for example) and it could work, without
tc-taprio or tc-mqprio to setup the prio:tc mapping.

Perhaps, for now, the solution is to disallow this as invalid, that you
can only have one such priority map configured, either dcb prio-map or
taprio/mqprio. Not sure.

For the long term, perhaps the solution is to have a single place that
does this mapping (from skb to traffic classes and then to queues) more
centralized. 'dcb' would populate that as would 'mqprio/taprio'. This
seems to imply that ndo_select_queue() is a escape hatch, that in the
ideal world it would not exist. Would this create worse problems?
(Reality is pointing that it doesn't seem that we can avoid making
drivers aware of this map, so this idea doesn't seem to help the
situation much)

>
> ethtool:
>
> - pro: is in the same subsystem as MAC merge layer configuration (for
>   which I have no doubts that ethtool is the right place)
>
> - pro: is not aligned to any specific vertical use case, like dcbnl is
>
> - con: knows nothing about priorities
>
> - con: for hardware where FP adminStatus is per traffic class rather
>   than per priority, we'll have to change the adminStatus not only from
>   changes made via ethtool to the FP adminStatus, but also from changes
>   made to the prio_tc_map (tc I guess?). But this may also be a subtle
>   pro for dcbnl, since that has its own prio_tc_map, so we'd remain
>   within the same subsystem?

If the driver also supports DCB, it will have the offload functions
implemented, so it be aware of changes in that map. The problem is now
mqprio, that if it isn't offloaded, the driver is not aware of any
changes. But mqprio doesn't implement .change(), which makes things a
bit easier.

'ethtool' is gaining my vote.

>
> tc (part of tc-mqprio and tc-taprio):
>
> - open question: does the FP adminStatus influence scheduling or not?
>   Normally I'd say no, adminStatus only decides whether the eMAC or pMAC
>   will be used to send a frame, and this is orthogonal to scheduling.
>   But 802.1Q says that "It should be noted that, other things being
>   equal, designating a priority as “express” effectively increases its
>   priority above that of any priority designated as “preemptible”."
>   Is this phrase enough to justify a modification of the qdisc dequeue
>   method in, say, taprio, which would prioritize a lower priority qdisc
>   queue A over a higher one B, if A is express but B is preemptable?

I understood this as: if a preemptible packet (higher priority) and
expresss packet (lower priority) arrive at the MAC at the same time, the
express packet will finish transmitting before the preemptible, causing
the express packet to have an 'effective' higher priority.

I don't think we need to change anything on the dequeueing in the qdisc
side (and trying to solve it is going to be awkward, having to check the
neighbor qdiscs for packets being enqueued/peeked, ugh!).

(as a note: for i225, traffic in queues marked as express will only
preempt traffic from lower numbered queues, this behavior doesn't seem
to be forbidden by the standard)

>
> - pro: assuming that in lack of tc-mqprio or tc-taprio, an interface's
>   number of traffic classes will collapse to 1, then putting FP
>   adminStatus here makes it trivial to support hardware with FP per
>   traffic class
>
> - con: more difficult for user space to change FP adminStatus
>   independently of the current qdisc?
>
>> >> > No; as mentioned, going through mqprio's "map" and "queues" to resolve
>> >> > the priority to a queue is something that I intended to be possible.
>> >> > Sure, it's not handy either. It would have been handier if the
>> >> > "admin-status" array was part of the tc-mqprio config, like you did in
>> >> > your RFC.
>> >> 
>> >> Just to be sure, say that my plan is that I document that for the igc
>> >> driver, the mapping of priorities for the "FP" command is prio (0) ->
>> >> queue (0), prio (1) -> queue (1), and so on. Note that the mqprio
>> >> mapping could be that "skb->priority" 10 goes to TC 1 which is mapped to
>> >> queue 0. Would this be ok?
>> >
>> > If skb->priority 10 goes to TC 1 via the prio_tc_map, and finally lands
>> > in queue 0 where your actual FP knob is, then I'd expect the driver to
>> > follow the reverse path between queue -> tc -> prios mapped to that tc.
>> > And I would not state that prio 0 is queue 0 and skb->priority values
>> > from the whole spectrum may land there, kthxbye. Also, didn't you say
>> > earlier that "lowest queue index (0) has highest priority"? In Linux,
>> > skb->priority 0 has lower priority that skb->priority 1. How does that
>> > work out?
>> >
>> 
>> Ah, sorry for the confusion, when I said "lowest queue index (0) has
>> highest priority" it was more in the sense that queue 0 has higher
>> precedence, packets from it are fetched first, that kind of thing.
>
> Fetched first by whom?
> In ENETC, the transmission selection chooses between TX BD rings whose
> configured priority maps to the same traffic class based on a weighted
> round robin dequeuing policy. Groups of TX BD rings of different
> priorities are in strict priority dequeuing relative to each other.
> And the number of the TX BD ring has nothing to do with its priority.

For the hardware I work with:
 - i210: two modes, (1) round robin, by default, or in (2) Qav-mode
   ("TSN), strict priority, fixed by the queue index.
 - i225/i226: two modes, (1) round robin, by default, or in (2) TSN
    mode, strict priority, configurable queue priority.

(to give the full picture, we can also configure how the descriptors are
going to be fetched from host memory, but as host memory is not usually
where the congestion happens, it's less important for this)

>
>> > In fact, for both enetc and felix, I have to do this exact thing as you,
>> > except that my hardware knobs are halfway in the path (per tc, not per
>> > queue). The reason why I'm not doing it is because we only consider the
>> > 1:1 prio_tc_map, so we use "prio" and "tc" interchangeably.
>> >
>> 
>> Hm, that's interesting. I think that's a difference on how we are
>> modeling our traffic, here's one very common mqprio config for AVB use
>> cases:
>> 
>> $ tc qdisc replace dev $IFACE parent root handle 100 mqprio \
>>       num_tc 3 \
>>       map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 \
>>       queues 1@0 1@1 2@2 \
>>       hw 0
>> 
>> One priority for Class A traffic, one priority for Class B traffic and
>> multiple priorities mapped to a single TC, the rest is Best Effort and
>> it's mapped to multiple queues. The way we model traffic, TCs and
>> priorities are different concepts.
>
> So here's what I don't understand. Why do you need 16 priority values,
> when in the end you'll map them to only 3 egress traffic classes anyway?
>

Basically because of how prio_tc_map works and the fact that I want TC 0
to be mapped to queue 0. i.e. if an entry is not specified in the map,
it's going ot be mapped to TC 0. I want to avoid that, historically, in
our models, TC 0 is the "important" one.

>> One important detail is that in TSN mode, we use the "priority" (meaning
>> precedence in this context) fetch order for the queues: queue 0 first,
>> then queue 1, and so on.
>
> So the number of the queue is what actually contributes to the egress
> scheduling algorithm of your NIC? Strange. Tell me something so I'm sure
> I'm understanding you properly. netdev_pick_tx() will pick a TX queue
> using skb_tx_hash(), which hashes between the netdev queues mapped to
> the same traffic class as the traffic class that skb->priority is mapped to.
> In your case, you have 2@2 (2 TX queues for traffic class 2), call these
> queues 2 and 3. You're saying that hardware prioritizes queue 2 over 3.
> But the network stack doesn't know that. It hashes equally packets that
> go to tc 2 between queue 2 and queue 3. In our case, this misconfiguration
> would be immediately obvious, because we could have an offloaded tc-taprio
> schedule in ENETC, and Linux could be enqueuing into a ring that ultimately
> has a different schedule than software knows about. How does this work
> out for you?
>

More or less, repeating a bit what I said before, on i210 it was fixed,
on i225/i226 it can be changed, but we decided to keep the default, less
surprises for our users.

In (some of) our traffic models, the only kind of traffic class that we
allow to go through multiple queues is Best Effort.

And even that it works out because taprio "translates" from traffic
classes to queues when it sends the offload information to the driver,
i.e. the driver knows the schedule of queues, not traffic classes.

>> > * The enetc verification state machine is off by default. The felix
>> >   verification state machine is on by default. I guess we should figure
>> >   out a reasonable default for all drivers out there?
>> >
>> 
>> From the language of IEEE 802.3 ("disableVerify" and friends), it seems
>> that the expectation is that verification is enabled by default. *But*
>> from a interoperability point of view, I would only enable verification
>> by default after some testing in the wild. Or only send the verify
>> packet if frame preemption is also enabled. 
>
> I wonder whether the answer to this one is actually "Additional Ethernet
> Capabilities TLV" (802.3 clause 79.3.7). It would be good if we could
> all start off with FP/MM disabled, verification state machine included,
> until the LLDP daemon receives a TLV that says PreemptSupported and
> PreemptEnabled are true. If both local and remote preemption are
> supported and enabled, we tell the kernel to enable the MAC merge layer
> and kick off verification. In turn this will alter PreemptActive that
> gets set in future replies. I don't know, the flow may be different.
>

That's a good point (your understanding of the flow is similar to mine).
This seems a good idea. But we may have to wait for a bit until we have
a LLDP implementation that supports this.

>> > * For some reason, I notice that the verification state machine remains
>> >   FAILED on felix when it links to enetc and that has verification
>> >   enabled. I need to disable it on enetc, enable it again, and only then
>> >   will the MAC merge interrupt on felix say that verification completed
>> >   successfully. I think I'll need to put a quirk in enetc_pl_mac_link_up()
>> >   to toggle verification off and on, if it was requested, in order for
>> >   this to be seamless to the user.
>> >
>> > If it's only these, nothing seems exactly major. However, I assume this
>> > is only the beginning. I fully expect there to be one more round of
>> > "why not dcbnl? why not tc?" nitpicks, for which I'm not exactly
>> > mentally ready (if credible arguments are to be put forward, I'd rather
>> > have them now while I haven't yet put in the extra work in the direction
>> > I'm about to).
>> 
>> I really know the feeling "not mentally ready".
>> 
>> I am still in the "tc" camp. But I cannot deny that this interface seems
>> to work for the uses that we have in mind. That's (at least) good enough
>> for me ;-)
>
> To me right now, FP adminStatus is not exactly something that the
> scheduler is concerned with, and this is why tc is kind of my last
> choice. But I agree that it makes certain things simpler. Plus, it seems
> to be the only subsystem aware of 16 priorities and 16 traffic classes,
> which you seem to care about.

This is just food for thought, I am not expecting a reply, as it doesn't
help the review of the RFC.

Ideally, we wouldn't have that limit of 16 priorities or traffic
classes. I don't want to limit it even more. This "16" limit is already
going to require some creativity to handle some use cases in a medium
future, I am afraid.

I am already hearing coleagues talking about containers (think the
net_prio namespace) together with VMs, NICs with dozens of queues. All
of this with scheduled traffic. On the wire, the 8 PCP codepoints are
what we have, but internally/"inside the system", we might need more
granularity to organize all those applications.


Cheers,
Vladimir Oltean Sept. 15, 2022, 2:14 p.m. UTC | #11
I won't reply to every point, as that would just diverge more and more
from what I'm really trying to settle on.

On Tue, Sep 13, 2022 at 07:59:29PM -0700, Vinicius Costa Gomes wrote:
> For the long term, perhaps the solution is to have a single place that
> does this mapping (from skb to traffic classes and then to queues) more
> centralized. 'dcb' would populate that as would 'mqprio/taprio'. This
> seems to imply that ndo_select_queue() is a escape hatch, that in the
> ideal world it would not exist. Would this create worse problems?
> (Reality is pointing that it doesn't seem that we can avoid making
> drivers aware of this map, so this idea doesn't seem to help the
> situation much)

That central place is called netdev_set_prio_tc_map(). It is called from
qdiscs and drivers alike. W.r.t. your "ndo_select_queue() escape hatch",
the commit message of 4f57c087de9b ("net: implement mechanism for HW
based QOS") provides some nice insight into networking history, 11 years
later.

> > ethtool:
> >
> > - pro: is in the same subsystem as MAC merge layer configuration (for
> >   which I have no doubts that ethtool is the right place)
> >
> > - pro: is not aligned to any specific vertical use case, like dcbnl is
> >
> > - con: knows nothing about priorities
> >
> > - con: for hardware where FP adminStatus is per traffic class rather
> >   than per priority, we'll have to change the adminStatus not only from
> >   changes made via ethtool to the FP adminStatus, but also from changes
> >   made to the prio_tc_map (tc I guess?). But this may also be a subtle
> >   pro for dcbnl, since that has its own prio_tc_map, so we'd remain
> >   within the same subsystem?
> 
> If the driver also supports DCB, it will have the offload functions
> implemented, so it be aware of changes in that map.

Not so simple. DCB has many ops, you don't have to implement all of them.

> The problem is now mqprio, that if it isn't offloaded, the driver is
> not aware of any changes. But mqprio doesn't implement .change(),
> which makes things a bit easier.
> 
> 'ethtool' is gaining my vote.

Well, ethtool is certainly not aware of changes to the prio:tc map
either, very much less so than the driver :)

> > tc (part of tc-mqprio and tc-taprio):
> >
> > - open question: does the FP adminStatus influence scheduling or not?
> >   Normally I'd say no, adminStatus only decides whether the eMAC or pMAC
> >   will be used to send a frame, and this is orthogonal to scheduling.
> >   But 802.1Q says that "It should be noted that, other things being
> >   equal, designating a priority as “express” effectively increases its
> >   priority above that of any priority designated as “preemptible”."
> >   Is this phrase enough to justify a modification of the qdisc dequeue
> >   method in, say, taprio, which would prioritize a lower priority qdisc
> >   queue A over a higher one B, if A is express but B is preemptable?
> 
> I understood this as: if a preemptible packet (higher priority) and
> expresss packet (lower priority) arrive at the MAC at the same time, the
> express packet will finish transmitting before the preemptible, causing
> the express packet to have an 'effective' higher priority.
> 
> I don't think we need to change anything on the dequeueing in the qdisc
> side (and trying to solve it is going to be awkward, having to check the
> neighbor qdiscs for packets being enqueued/peeked, ugh!).

Agree. So bottom line, express/preemptable does not affect the
scheduling algorithm. This is one reservation I have against FP in tc.
Maybe if we restricted this new option to only the hardware offload
modes of the respective qdiscs.

> (as a note: for i225, traffic in queues marked as express will only
> preempt traffic from lower numbered queues, this behavior doesn't seem
> to be forbidden by the standard)

So I want the frame preemption adminStatus to work hand in hand with the
prio:tc map of the netdev, no matter where/how the adminStatus lands.

Whereas your hardware implementation does not work hand in hand with it.
You'll have to add some restrictions and do some clarification work, sorry.

> For the hardware I work with:
>  - i210: two modes, (1) round robin, by default, or in (2) Qav-mode
>    ("TSN), strict priority, fixed by the queue index.
>  - i225/i226: two modes, (1) round robin, by default, or in (2) TSN
>     mode, strict priority, configurable queue priority.
> 
> (to give the full picture, we can also configure how the descriptors are
> going to be fetched from host memory, but as host memory is not usually
> where the congestion happens, it's less important for this)

So this is to say that i210 has 1 traffic class in mode (1), and a
number of traffic classes equal to the number of queues, in mode (2)?
As for i225/i226, the number of traffic classes is given by the number
of uniqueuly configured queue priorities? Or what would you call a
traffic class, exactly?

> > So here's what I don't understand. Why do you need 16 priority values,
> > when in the end you'll map them to only 3 egress traffic classes anyway?
> >
> 
> Basically because of how prio_tc_map works and the fact that I want TC 0
> to be mapped to queue 0. i.e. if an entry is not specified in the map,
> it's going ot be mapped to TC 0. I want to avoid that, historically, in
> our models, TC 0 is the "important" one.

Strange. I wonder who else expects traffic class 0 to be the important one?
This is clearly a case of 2 vendors using the exact same tooling and
commands, but a configuration script working in the exact opposite way.

> More or less, repeating a bit what I said before, on i210 it was fixed,
> on i225/i226 it can be changed, but we decided to keep the default, less
> surprises for our users.

For yours, sure, maybe. For others.. :)

> In (some of) our traffic models, the only kind of traffic class that we
> allow to go through multiple queues is Best Effort.

With the mention that half (or more) of the best effort traffic will be
Better Effort, due to your secret prioritization scheme within the same
traffic class. And netdev_pick_tx() will essentially be a Russian
roulette, because it hashes packets with the same skb->priority towards
queues of different actual priorities.

> And even that it works out because taprio "translates" from traffic
> classes to queues when it sends the offload information to the driver,
> i.e. the driver knows the schedule of queues, not traffic classes.

Which is incredibly strange to me, since the standard clearly defines
Qbv gates to be per traffic class, and in ENETC, even if we have 2 TX
queues for the same traffic class (one per CPU), the hardware schedule
is still per traffic class and not per independent TX queue (BD ring).

How does this work for i225/i226, if 2 queues are configured for the
same dequeue priority? Do the taprio gates still take effect per queue?

> That's a good point (your understanding of the flow is similar to mine).
> This seems a good idea. But we may have to wait for a bit until we have
> a LLDP implementation that supports this.

This is circular; we have a downstream patch to openlldp that adds
support for this TLV, but it can't go anywhere until there is mainline
kernel support.

What about splitting out MAC merge support from FP support, and me
concentrating on the MAC merge layer for now? They're independent up to
a pretty high level. The MAC merge layer is supposed to be controlled by
the LLDP daemon and be pretty much plug-and-play, while the FP adminStatus
is supposed to be set by some high level administrator, like a NETCONF
client.

We can argue some more about how to best expose the FP adminStatus.
Right now, I'm thinking that for all intents and purposes, the
adminStatus really only makes practical sense when we have at least 2
traffic classes: one onto which we can map the preemptable priorities,
and one onto which we can map the express ones. In turn, it means that
the 'priorities collapsing into the same traffic class' problem that we
have when we delete the taprio/mqprio qdisc can be solved if we put the
FP adminStatus into the same netlink interface that also configures the
prio:tc map and the number of traffic classes (i.e. yes, in tc).

I'll let someone more knowledgeable of our sysrepo-tsn implementation of
a NETCONF server (like Xiaoliang, maybe even Rui or Richie?)
https://github.com/real-time-edge-sw/real-time-edge-sysrepo
comment on whether this would create any undesirable side effect.
The implication is that user space, when asked to change the adminStatus,
will have to figure out whether we are using a taprio, or an mqprio
qdisc, and create different netlink attributes to talk to the kernel to
request that configuration. User space will also have to send an error
towards the NETCONF client if no such qdisc exists (meaning that the
user wants to combine priorities of different express/preemptable values
in the same traffic class, TC0). The qdiscs will also centrally validate
whether such invalid mixed configurations are requested.

I'm mostly OK with this: an UAPI where FP adminStatus is per priority,
but it gets translated into per-tc by the qdisc, and passed to the
driver through ndo_setup_tc(). I am certainly still very much against
the "preemptable queue mask" that you proposed, and IMO you'll have to
do something and clarify what a traffic class even means for Intel
hardware, especially in relationship to multiple queues per tc
(mqprio queues 2@1).

Anyone else who has an opinion of any sort, please feel free.
Xiaoliang Yang Nov. 10, 2022, 10:33 a.m. UTC | #12
On Sep 15, 2022, Vladimir wrote:
> > That's a good point (your understanding of the flow is similar to mine).
> > This seems a good idea. But we may have to wait for a bit until we
> > have a LLDP implementation that supports this.
> 
> This is circular; we have a downstream patch to openlldp that adds support for
> this TLV, but it can't go anywhere until there is mainline kernel support.
> 
> What about splitting out MAC merge support from FP support, and me
> concentrating on the MAC merge layer for now? They're independent up to a
> pretty high level. The MAC merge layer is supposed to be controlled by the
> LLDP daemon and be pretty much plug-and-play, while the FP adminStatus is
> supposed to be set by some high level administrator, like a NETCONF client.
I agree that splitting out MAC merge support from FP support, they are defined
by different spec. I'm trying to add LLDP exchange verification support. The
procedure is like following:
 1. enable preemption on local port by using "ethtool", do not active preemption.
 2. Decode the LLDP TLV of remote port and ensure the remote port supports
and enables preemption.
 3. Run SMD-v/r verify process on local port or active the preemption directly.
 4. Send updated LLDP TLV to remote port.
 5. Disable preemption on local port and repeat step 1-4 when link down/up.
The struct "ethtool_mm_cfg" seems not fit this procedure. I update it:
struct ethtool_mm_cfg {
	u32 verify_time;
	bool verify_disable;
	bool enabled;
+	bool active; // Used to active or start verify preemption by LLDP daemon.
	u8 add_frag_size;
};
If we want to enable/disable the LLDP exchange verification, maybe we need
to add more parameters like "bool lldp_verify_enable"

> 
> We can argue some more about how to best expose the FP adminStatus.
> Right now, I'm thinking that for all intents and purposes, the adminStatus
> really only makes practical sense when we have at least 2 traffic classes: one
> onto which we can map the preemptable priorities, and one onto which we
> can map the express ones. In turn, it means that the 'priorities collapsing into
> the same traffic class' problem that we have when we delete the
> taprio/mqprio qdisc can be solved if we put the FP adminStatus into the same
> netlink interface that also configures the prio:tc map and the number of traffic
> classes (i.e. yes, in tc).
> 
> I'll let someone more knowledgeable of our sysrepo-tsn implementation of a
> NETCONF server (like Xiaoliang, maybe even Rui or Richie?)
> https://github.com/real-time-edge-sw/real-time-edge-sysrepo
> comment on whether this would create any undesirable side effect.
> The implication is that user space, when asked to change the adminStatus, will
> have to figure out whether we are using a taprio, or an mqprio qdisc, and
> create different netlink attributes to talk to the kernel to request that
> configuration. User space will also have to send an error towards the NETCONF
> client if no such qdisc exists (meaning that the user wants to combine priorities
> of different express/preemptable values in the same traffic class, TC0). The
> qdiscs will also centrally validate whether such invalid mixed configurations
> are requested.
The Netconf yang model haven't defined the mac preemption feature of 802.3br,
but it defines the 802.1Qbu preemption. It uses priority to configure the
adminStatus of frame-preemption-status-table. The priority map traffic class is
independent in Netconf, so there is no effect whether we combine FP status
configuration with tc-mqprio or not, just translated it in Kernel is OK.

> 
> I'm mostly OK with this: an UAPI where FP adminStatus is per priority, but it
> gets translated into per-tc by the qdisc, and passed to the driver through
> ndo_setup_tc(). I am certainly still very much against the "preemptable queue
> mask" that you proposed, and IMO you'll have to do something and clarify
> what a traffic class even means for Intel hardware, especially in relationship to
> multiple queues per tc (mqprio queues 2@1).
> 
> Anyone else who has an opinion of any sort, please feel free.
I didn't think about it too much, but I think combined with tc-mqprio command
is more easy to implement, this eliminates the priority of translating to TC.
Example:
tc qdisc add dev eth0 root handle 1: mqprio num_tc 8 map 0 1 2 3 4 5 6 7 \
	fp-prios 0xff
Vladimir Oltean Nov. 14, 2022, 3:42 p.m. UTC | #13
Hi Xiaoliang,

On Thu, Nov 10, 2022 at 10:33:50AM +0000, Xiaoliang Yang wrote:
> On Sep 15, 2022, Vladimir wrote:
> > > That's a good point (your understanding of the flow is similar to mine).
> > > This seems a good idea. But we may have to wait for a bit until we
> > > have a LLDP implementation that supports this.
> > 
> > This is circular; we have a downstream patch to openlldp that adds support for
> > this TLV, but it can't go anywhere until there is mainline kernel support.
> > 
> > What about splitting out MAC merge support from FP support, and me
> > concentrating on the MAC merge layer for now? They're independent up to a
> > pretty high level. The MAC merge layer is supposed to be controlled by the
> > LLDP daemon and be pretty much plug-and-play, while the FP adminStatus is
> > supposed to be set by some high level administrator, like a NETCONF client.
> 
> I agree that splitting out MAC merge support from FP support, they are defined
> by different spec. I'm trying to add LLDP exchange verification support. The
> procedure is like following:
>  1. enable preemption on local port by using "ethtool", do not active preemption.

"Enable" and "activate" are different things only if verification is used.
Otherwise, they mean the same thing.

In turn, LLDP does not control whether verification is used or not. So,
from the perspective of the LLDP daemon, you can't "enable" but "not activate"
preemption.

And I wouldn't expect the LLDP flow to require use of ethtool.

And I think that enabling preemption at step 1 is too early. Why enable
it, if we don't know if the link partner supports it?

>  2. Decode the LLDP TLV of remote port and ensure the remote port supports
> and enables preemption.
>  3. Run SMD-v/r verify process on local port or active the preemption directly.

Verification as defined by 802.3 is, as mentioned, independent of the
Additional Ethernet Capabilities TLV.
Which seems to be a good thing, considering that 2 link partners A and B
don't need to have coordinated settings for verification (verify_disable
doesn't need to be in sync). Any device with the MAC Merge layer turned on is
required to send SMD-R (response) frames in response to SMD-V (verification),
regardless of whether it sends SMD-V frames of its own.

So those who enable verification should get a response from those who
don't enable it, and those who don't enable verification don't cause the
link partners who do enable it to fail their own verification.

The only thing to agree on is this: if the link partner supports
preemption, and you support preemption, then how does the LLDP daemon
decide whether to actually *enable* preemption? My assumption is that
it's fine to unconditionally enable it as long as it's supported.
No configuration or user override is needed.

FWIW, I have my own patch on the openlldp project (yes, not lldpd) which
adds support for the MAC Merge layer. I hope I can wrap everything up
some time next week and resend everything in a state that reflects what
I've been cooking in the meantime. I don't have a solution for the
adminStatus yet, but I guess I'll try to add it via the netlink interface
of tc-taprio and tc-mqprio.

>  4. Send updated LLDP TLV to remote port.
>  5. Disable preemption on local port and repeat step 1-4 when link down/up.
> 
> The struct "ethtool_mm_cfg" seems not fit this procedure. I update it:
> struct ethtool_mm_cfg {
> 	u32 verify_time;
> 	bool verify_disable;
> 	bool enabled;
> +	bool active; // Used to active or start verify preemption by LLDP daemon.

What sense does it make to "force" preemption as active? What would this
do in terms of driving the internal state machines of the hardware, i.e.
force FP active in which way? It's active when enabled and verification
is either disabled, or enabled and complete.

> 	u8 add_frag_size;
> };
> If we want to enable/disable the LLDP exchange verification, maybe we need
> to add more parameters like "bool lldp_verify_enable"

Control the LLDP exchange via kernel UAPI? But why? What difference does
it make to the kernel whether there is an LLDP daemon or not? LLDP frames
are normal packets sent via PF_PACKET sockets, no special driver involvement
needed. Not the same thing as SMD-V/SMD-R hardware verification.
Xiaoliang Yang Nov. 15, 2022, 3:01 a.m. UTC | #14
Hi Vladimir,

Thanks for your reply, I comment below and make a summarize at the end.

On Mon, Nov 14, 2022 Vladimir wrote:
> On Thu, Nov 10, 2022 at 10:33:50AM +0000, Xiaoliang Yang wrote:
> > On Sep 15, 2022, Vladimir wrote:
> > > > That's a good point (your understanding of the flow is similar to mine).
> > > > This seems a good idea. But we may have to wait for a bit until we
> > > > have a LLDP implementation that supports this.
> > >
> > > This is circular; we have a downstream patch to openlldp that adds
> > > support for this TLV, but it can't go anywhere until there is mainline kernel
> support.
> > >
> > > What about splitting out MAC merge support from FP support, and me
> > > concentrating on the MAC merge layer for now? They're independent up
> > > to a pretty high level. The MAC merge layer is supposed to be
> > > controlled by the LLDP daemon and be pretty much plug-and-play,
> > > while the FP adminStatus is supposed to be set by some high level
> administrator, like a NETCONF client.
> >
> > I agree that splitting out MAC merge support from FP support, they are
> > defined by different spec. I'm trying to add LLDP exchange
> > verification support. The procedure is like following:
> >  1. enable preemption on local port by using "ethtool", do not active
> preemption.
> 
> "Enable" and "activate" are different things only if verification is used.
> Otherwise, they mean the same thing.
Yes, I know that "Enable" and "activate" are different things. I mean that if we
want to enable the next lldp exchange verification process, we need to enable
it but not let the MAC merge layer working until lldp verify successfully.
There are two verification procedure, one is LLDP verification, based on chapter
"99.4.2 Determining that the link partner supports preemption" of 802.3 spec.
This need lldp exchange and ensure preemption is enabled on remote port, and
this verification doesn’t have parameter to enable/disable it. I just want to add
this feature. Another is SMD-r/v verification, this is hardware verification and
already has "verify_disable" to enable/disable it.

Or we can enable by default through LLDP. That don’t need a new "active"
parameter. Once the LLDP application has verified that remote port supports
preemption, it will enable preemption on local port, and tell remote port to
enable it. If we use ethtool to enable preemption, the preemption will be force
enabled without LLDP verification.


> 
> In turn, LLDP does not control whether verification is used or not. So, from the
> perspective of the LLDP daemon, you can't "enable" but "not activate"
> preemption.
Yes, LLDP does not control whether hardware verification is used or not. Maybe I
shouldn't use the word "active", I mean preemption can't be working before LLDP
verification, even though it's enabled by ethtool.

> 
> And I wouldn't expect the LLDP flow to require use of ethtool.
If LLDP application can’t require to use ethtool driver, how to enable the preemption
After LLDP verification?

> 
> And I think that enabling preemption at step 1 is too early. Why enable it, if we
> don't know if the link partner supports it?
Same as mentioned above. Enabling at step 1 does not actually enable preemption,
it means enable the LLDP verification.

> 
> >  2. Decode the LLDP TLV of remote port and ensure the remote port
> > supports and enables preemption.
> >  3. Run SMD-v/r verify process on local port or active the preemption
> directly.
> 
> Verification as defined by 802.3 is, as mentioned, independent of the
> Additional Ethernet Capabilities TLV.
> Which seems to be a good thing, considering that 2 link partners A and B don't
> need to have coordinated settings for verification (verify_disable doesn't need
> to be in sync). Any device with the MAC Merge layer turned on is required to
> send SMD-R (response) frames in response to SMD-V (verification), regardless
> of whether it sends SMD-V frames of its own.
> 
> So those who enable verification should get a response from those who don't
> enable it, and those who don't enable verification don't cause the link partners
> who do enable it to fail their own verification.
The hardware may not obey this rule. I have tested on LS1028a and i.mx8mp boards,
The verification needs to be enabled at the same time for link partners. So it needs
LLDP verification, and let LLDP application to enable both partner ports at the same
time(The hardware verify three times, and each verify time is 10ms in default).

> 
> The only thing to agree on is this: if the link partner supports preemption, and
> you support preemption, then how does the LLDP daemon decide whether to
> actually *enable* preemption? My assumption is that it's fine to
> unconditionally enable it as long as it's supported.
> No configuration or user override is needed.
Yes, I also consider this way. Enable the preemption once LLDP exchange ensure
that both partner support it. But how to enable preemption via LLDP daemon?
I tried to use the ioctl API of ethtool to enable preemption. Do you have any
suggestion?

> 
> FWIW, I have my own patch on the openlldp project (yes, not lldpd) which adds
> support for the MAC Merge layer. I hope I can wrap everything up some time
> next week and resend everything in a state that reflects what I've been cooking
> in the meantime. I don't have a solution for the adminStatus yet, but I guess I'll
> try to add it via the netlink interface of tc-taprio and tc-mqprio.
It's a good news, I can keep up with upstream patches after you upstream the patches.

> 
> >  4. Send updated LLDP TLV to remote port.
> >  5. Disable preemption on local port and repeat step 1-4 when link
> down/up.
> >
> > The struct "ethtool_mm_cfg" seems not fit this procedure. I update it:
> > struct ethtool_mm_cfg {
> > 	u32 verify_time;
> > 	bool verify_disable;
> > 	bool enabled;
> > +	bool active; // Used to active or start verify preemption by LLDP daemon.
> 
> What sense does it make to "force" preemption as active? What would this do
> in terms of driving the internal state machines of the hardware, i.e.
> force FP active in which way? It's active when enabled and verification is either
> disabled, or enabled and complete.
This is not mean active status of SMD-r/v verification. It means let preemption
Working.

> 
> > 	u8 add_frag_size;
> > };
> > If we want to enable/disable the LLDP exchange verification, maybe we
> > need to add more parameters like "bool lldp_verify_enable"
> 
> Control the LLDP exchange via kernel UAPI? But why? What difference does it
> make to the kernel whether there is an LLDP daemon or not? LLDP frames are
> normal packets sent via PF_PACKET sockets, no special driver involvement
> needed. Not the same thing as SMD-V/SMD-R hardware verification.
I mean that if user want to enable the preemption without LLDP. It's not needed
If we use the second way I mentioned above.

I will summarize my opinion. Enable the preemption by default through LLDP.
Once the LLDP application has verified that remote port supports preemption,
it will enable preemption on local port, and tell remote port to enable it.
If user want to force enable preemption, use ethtool to enable it directly.
The lldp daemon use the ioctl API of ethtool to enable preemption.

Thanks,
Xiaoliang
Vladimir Oltean Nov. 16, 2022, 12:36 p.m. UTC | #15
Hi Xiaoliang,

On Tue, Nov 15, 2022 at 05:01:24AM +0200, Xiaoliang Yang wrote:
> Yes, I know that "Enable" and "activate" are different things. I mean that if we
> want to enable the next lldp exchange verification process, we need to enable
> it but not let the MAC merge layer working until lldp verify successfully.
> There are two verification procedure, one is LLDP verification, based on chapter
> "99.4.2 Determining that the link partner supports preemption" of 802.3 spec.
> This need lldp exchange and ensure preemption is enabled on remote port, and
> this verification doesn’t have parameter to enable/disable it. I just want to add
> this feature.

Why? The mechanism to enable/disable the MAC Merge layer is the ETHTOOL_A_MM_ENABLED
netlink attribute which is settable.

To the kernel, it doesn't matter if ethtool (the program) or openlldp
sets ETHTOOL_A_MM_ENABLED.

> Another is SMD-r/v verification, this is hardware verification and
> already has "verify_disable" to enable/disable it.
> 
> Or we can enable by default through LLDP. That don’t need a new "active"
> parameter. Once the LLDP application has verified that remote port supports
> preemption, it will enable preemption on local port, and tell remote port to
> enable it.

IIUC, it cannot "tell" the link partner to enable preemption. It can
just advertise the following in a subsequent TLV:

Table 79–8—Additional Ethernet capabilities

Bit 0 (preemption capability support) - supported/not supported
Bit 1 (preemption capability status) - enabled/not enabled
Bit 2 (preemption capability active) - active/not active
Bits 4:3 (additional fragment size) - A 2-bit integer value indicating,
in units of 64 octets, the minimum number of octets over 64 octets
required in non-final fragments by the receiver

The only thing we can do, is if the link partner sets Bit 0 to true, to
generate a ETHTOOL_MSG_MM_SET netlink message with ETHTOOL_A_MM_ENABLED
set to true.

We _cannot_ wait for the link partner to set Bit 1 to true, because, if
it's the same implementation as us, nothing will work, we will both wait
for each other...

So either the link partner enables preemption too, as a result of us
advertising a TLV with Bit 0 set, and things eventually work, or the
link partner doesn't enable preemption even though both ends support it.
In that case, preemption doesn't work (and this is also visible in the
TLVs sent by the link partner - Bit 1 and Bit 2 remain zero).

If the SMD-V/SMD-R handshake is enabled locally (ETHTOOL_A_MM_VERIFY_DISABLE
is false), we can also detect that by the fact that the link partner
will not respond with a SMD-R to our SMD-V. So, even if we will set Bit 1
(corresponds to ETHTOOL_A_MM_ENABLED in the ETHTOOL_MSG_MM_GET message)
to true, Bit 2 (corresponding to ETHTOOL_A_MM_ACTIVE in the GET message)
will still remain unset. Otherwise, the condition will remain undetected,
and we will report Bit 2 (active) as true (which it is - verification is
disabled after all).

> If we use ethtool to enable preemption, the preemption will be force
> enabled without LLDP verification.

Yup. I consider that within the expected parameters of the game, no?

> > In turn, LLDP does not control whether verification is used or not. So, from the
> > perspective of the LLDP daemon, you can't "enable" but "not activate"
> > preemption.
> Yes, LLDP does not control whether hardware verification is used or not. Maybe I
> shouldn't use the word "active", I mean preemption can't be working before LLDP
> verification, even though it's enabled by ethtool.

I don't understand this. Why can't preemption work without LLDP
verification? My understanding is that LLDP is just an automated way of
maximizing the enabled feature set depending on what the link partner
advertises as supported. But the same feature set can be force-enabled
manually, if there is prior knowledge that the link partner can do it.
I think of it as a software autoneg, I might be wrong, I don't have too
much XP with LLDP. That's exactly what ethtool (the program, not the
kernel interface) is for - to be able to force settings.

> > And I wouldn't expect the LLDP flow to require use of ethtool.
> If LLDP application can’t require to use ethtool driver, how to enable the preemption
> after LLDP verification?

No, I mean the LLDP flow doesn't require use of ethtool the program, not
ethtool the kernel netlink interface. Of course it requires the latter.

> > And I think that enabling preemption at step 1 is too early. Why enable it, if we
> > don't know if the link partner supports it?
> Same as mentioned above. Enabling at step 1 does not actually enable preemption,
> it means enable the LLDP verification.

Nope. LLDP verification is no more than a user space program that
automates something that can also be done manually, using ethtool the
program. No special netlink messages for LLDP.

> > So those who enable verification should get a response from those who don't
> > enable it, and those who don't enable verification don't cause the link partners
> > who do enable it to fail their own verification.
> The hardware may not obey this rule. I have tested on LS1028a and i.mx8mp boards,
> The verification needs to be enabled at the same time for link partners. So it needs
> LLDP verification, and let LLDP application to enable both partner ports at the same
> time(The hardware verify three times, and each verify time is 10ms in default).

I think the imx8mp has the same EQOS block as the S32G GMAC, right?
If so, I might try experimenting a bit on the S32G linked to LS1028A,
to see exactly what you're talking about.

Keep in mind that hardware quirks don't mean much to what the user space
flow should be. The kernel should deal with quirks. For example, on enetc,
the MAC merge layer must be manually disabled and re-enabled on each
link up/down event by the kernel driver. Otherwise the verification
state machine remains FAILED, as you seem to say. But that changes
exactly nothing in terms of the LLDP flow - because as you say, the
hardware SMD-R/SMD-V verification state machine which you talk about is
independent of LLDP.

> > The only thing to agree on is this: if the link partner supports preemption, and
> > you support preemption, then how does the LLDP daemon decide whether to
> > actually *enable* preemption? My assumption is that it's fine to
> > unconditionally enable it as long as it's supported.
> > No configuration or user override is needed.
> Yes, I also consider this way. Enable the preemption once LLDP exchange ensure
> that both partner support it. But how to enable preemption via LLDP daemon?
> I tried to use the ioctl API of ethtool to enable preemption. Do you have any
> suggestion?

Yeah, see above. The ethtool ioctl is dead and it's not coming back, see ETHTOOL_A_MM_ENABLED.
I wish I could come back to this patch set sooner, but I'm really deadlocked
with some other stuff I need to do this week. Sorry.
diff mbox series

Patch

diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h
index 99dc7bfbcd3c..fa504dd22bf6 100644
--- a/include/linux/ethtool.h
+++ b/include/linux/ethtool.h
@@ -453,6 +453,49 @@  struct ethtool_module_power_mode_params {
 	enum ethtool_module_power_mode mode;
 };
 
+/**
+ * struct ethtool_fp_param - 802.1Q Frame Preemption parameters
+ */
+struct ethtool_fp_param {
+	u8 preemptable_prios;
+	u32 hold_advance;
+	u32 release_advance;
+};
+
+/**
+ * struct ethtool_mm_state - 802.3 MAC merge layer state
+ */
+struct ethtool_mm_state {
+	u32 verify_time;
+	enum ethtool_mm_verify_status verify_status;
+	bool supported;
+	bool enabled;
+	bool active;
+	u8 add_frag_size;
+};
+
+/**
+ * struct ethtool_mm_cfg - 802.3 MAC merge layer configuration
+ */
+struct ethtool_mm_cfg {
+	u32 verify_time;
+	bool verify_disable;
+	bool enabled;
+	u8 add_frag_size;
+};
+
+/* struct ethtool_mm_stats - 802.3 MAC merge layer statistics, as defined in
+ * IEEE 802.3-2018 subclause 30.14.1 oMACMergeEntity managed object class
+ */
+struct ethtool_mm_stats {
+	u64 MACMergeFrameAssErrorCount;
+	u64 MACMergeFrameSmdErrorCount;
+	u64 MACMergeFrameAssOkCount;
+	u64 MACMergeFragCountRx;
+	u64 MACMergeFragCountTx;
+	u64 MACMergeHoldCount;
+};
+
 /**
  * struct ethtool_ops - optional netdev operations
  * @cap_link_lanes_supported: indicates if the driver supports lanes
@@ -624,6 +667,11 @@  struct ethtool_module_power_mode_params {
  *	plugged-in.
  * @set_module_power_mode: Set the power mode policy for the plug-in module
  *	used by the network device.
+ * @get_fp_param: Query the 802.1Q Frame Preemption parameters.
+ * @set_fp_param: Set the 802.1Q Frame Preemption parameters.
+ * @get_mm_state: Query the 802.3 MAC Merge layer state.
+ * @set_mm_cfg: Set the 802.3 MAC Merge layer parameters.
+ * @get_mm_stats: Query the 802.3 MAC Merge layer statistics.
  *
  * All operations are optional (i.e. the function pointer may be set
  * to %NULL) and callers must take this into account.  Callers must
@@ -760,6 +808,17 @@  struct ethtool_ops {
 	int	(*set_module_power_mode)(struct net_device *dev,
 					 const struct ethtool_module_power_mode_params *params,
 					 struct netlink_ext_ack *extack);
+	void	(*get_fp_param)(struct net_device *dev,
+				struct ethtool_fp_param *params);
+	int	(*set_fp_param)(struct net_device *dev,
+				const struct ethtool_fp_param *params,
+				struct netlink_ext_ack *extack);
+	void	(*get_mm_state)(struct net_device *dev,
+				struct ethtool_mm_state *state);
+	int	(*set_mm_cfg)(struct net_device *dev, struct ethtool_mm_cfg *cfg,
+			      struct netlink_ext_ack *extack);
+	void	(*get_mm_stats)(struct net_device *dev,
+				struct ethtool_mm_stats *stats);
 };
 
 int ethtool_check_ops(const struct ethtool_ops *ops);
diff --git a/include/uapi/linux/ethtool.h b/include/uapi/linux/ethtool.h
index 2d5741fd44bb..7a21fcb26a43 100644
--- a/include/uapi/linux/ethtool.h
+++ b/include/uapi/linux/ethtool.h
@@ -736,6 +736,21 @@  enum ethtool_module_power_mode {
 	ETHTOOL_MODULE_POWER_MODE_HIGH,
 };
 
+/* Values from ieee8021FramePreemptionAdminStatus */
+enum ethtool_fp_admin_status {
+	ETHTOOL_FP_PARAM_ENTRY_ADMIN_STATUS_EXPRESS = 1,
+	ETHTOOL_FP_PARAM_ENTRY_ADMIN_STATUS_PREEMPTABLE = 2,
+};
+
+enum ethtool_mm_verify_status {
+	ETHTOOL_MM_VERIFY_STATUS_UNKNOWN,
+	ETHTOOL_MM_VERIFY_STATUS_INITIAL,
+	ETHTOOL_MM_VERIFY_STATUS_VERIFYING,
+	ETHTOOL_MM_VERIFY_STATUS_SUCCEEDED,
+	ETHTOOL_MM_VERIFY_STATUS_FAILED,
+	ETHTOOL_MM_VERIFY_STATUS_DISABLED,
+};
+
 /**
  * struct ethtool_gstrings - string set for data tagging
  * @cmd: Command number = %ETHTOOL_GSTRINGS
diff --git a/include/uapi/linux/ethtool_netlink.h b/include/uapi/linux/ethtool_netlink.h
index d2fb4f7be61b..658810274c49 100644
--- a/include/uapi/linux/ethtool_netlink.h
+++ b/include/uapi/linux/ethtool_netlink.h
@@ -49,6 +49,10 @@  enum {
 	ETHTOOL_MSG_PHC_VCLOCKS_GET,
 	ETHTOOL_MSG_MODULE_GET,
 	ETHTOOL_MSG_MODULE_SET,
+	ETHTOOL_MSG_FP_GET,
+	ETHTOOL_MSG_FP_SET,
+	ETHTOOL_MSG_MM_GET,
+	ETHTOOL_MSG_MM_SET,
 
 	/* add new constants above here */
 	__ETHTOOL_MSG_USER_CNT,
@@ -94,6 +98,10 @@  enum {
 	ETHTOOL_MSG_PHC_VCLOCKS_GET_REPLY,
 	ETHTOOL_MSG_MODULE_GET_REPLY,
 	ETHTOOL_MSG_MODULE_NTF,
+	ETHTOOL_MSG_FP_GET_REPLY,
+	ETHTOOL_MSG_FP_NTF,
+	ETHTOOL_MSG_MM_GET_REPLY,
+	ETHTOOL_MSG_MM_NTF,
 
 	/* add new constants above here */
 	__ETHTOOL_MSG_KERNEL_CNT,
@@ -862,6 +870,80 @@  enum {
 	ETHTOOL_A_MODULE_MAX = (__ETHTOOL_A_MODULE_CNT - 1)
 };
 
+/* FRAME PREEMPTION (802.1Q) */
+
+enum {
+	ETHTOOL_A_FP_PARAM_ENTRY_UNSPEC,
+	ETHTOOL_A_FP_PARAM_ENTRY_PRIO,		/* u8 */
+	ETHTOOL_A_FP_PARAM_ENTRY_ADMIN_STATUS,	/* u8 */
+
+	/* add new constants above here */
+	__ETHTOOL_A_FP_PARAM_ENTRY_CNT,
+	ETHTOOL_A_FP_PARAM_ENTRY_MAX = (__ETHTOOL_A_FP_PARAM_ENTRY_CNT - 1)
+};
+
+enum {
+	ETHTOOL_A_FP_PARAM_UNSPEC,
+	ETHTOOL_A_FP_PARAM_ENTRY,		/* nest */
+
+	/* add new constants above here */
+	__ETHTOOL_A_FP_PARAM_CNT,
+	ETHTOOL_A_FP_PARAM_MAX = (__ETHTOOL_A_FP_PARAM_CNT - 1)
+};
+
+enum {
+	ETHTOOL_A_FP_UNSPEC,
+	ETHTOOL_A_FP_HEADER,			/* nest - _A_HEADER_* */
+	ETHTOOL_A_FP_PARAM_TABLE,		/* nest */
+	ETHTOOL_A_FP_HOLD_ADVANCE,		/* u32 */
+	ETHTOOL_A_FP_RELEASE_ADVANCE,		/* u32 */
+
+	/* add new constants above here */
+	__ETHTOOL_A_FP_CNT,
+	ETHTOOL_A_FP_MAX = (__ETHTOOL_A_FP_CNT - 1)
+};
+
+/* MAC MERGE (802.3) */
+
+enum {
+	ETHTOOL_A_MM_STAT_UNSPEC,
+	ETHTOOL_A_MM_STAT_PAD,
+
+	/* aMACMergeFrameAssErrorCount */
+	ETHTOOL_A_MM_STAT_REASSEMBLY_ERRORS,	/* u64 */
+	/* aMACMergeFrameSmdErrorCount */
+	ETHTOOL_A_MM_STAT_SMD_ERRORS,		/* u64 */
+	/* aMACMergeFrameAssOkCount */
+	ETHTOOL_A_MM_STAT_REASSEMBLY_OK,	/* u64 */
+	/* aMACMergeFragCountRx */
+	ETHTOOL_A_MM_STAT_RX_FRAG_COUNT,	/* u64 */
+	/* aMACMergeFragCountTx */
+	ETHTOOL_A_MM_STAT_TX_FRAG_COUNT,	/* u64 */
+	/* aMACMergeHoldCount */
+	ETHTOOL_A_MM_STAT_HOLD_COUNT,		/* u64 */
+
+	/* add new constants above here */
+	__ETHTOOL_A_MM_STAT_CNT,
+	ETHTOOL_A_MM_STAT_MAX = (__ETHTOOL_A_MM_STAT_CNT - 1)
+};
+
+enum {
+	ETHTOOL_A_MM_UNSPEC,
+	ETHTOOL_A_MM_HEADER,			/* nest - _A_HEADER_* */
+	ETHTOOL_A_MM_VERIFY_STATUS,		/* u8 */
+	ETHTOOL_A_MM_VERIFY_DISABLE,		/* u8 */
+	ETHTOOL_A_MM_VERIFY_TIME,		/* u32 */
+	ETHTOOL_A_MM_SUPPORTED,			/* u8 */
+	ETHTOOL_A_MM_ENABLED,			/* u8 */
+	ETHTOOL_A_MM_ACTIVE,			/* u8 */
+	ETHTOOL_A_MM_ADD_FRAG_SIZE,		/* u8 */
+	ETHTOOL_A_MM_STATS,			/* nest - _A_MM_STAT_* */
+
+	/* add new constants above here */
+	__ETHTOOL_A_MM_CNT,
+	ETHTOOL_A_MM_MAX = (__ETHTOOL_A_MM_CNT - 1)
+};
+
 /* generic netlink info */
 #define ETHTOOL_GENL_NAME "ethtool"
 #define ETHTOOL_GENL_VERSION 1
diff --git a/net/ethtool/Makefile b/net/ethtool/Makefile
index b76432e70e6b..31e056f17856 100644
--- a/net/ethtool/Makefile
+++ b/net/ethtool/Makefile
@@ -7,4 +7,5 @@  obj-$(CONFIG_ETHTOOL_NETLINK)	+= ethtool_nl.o
 ethtool_nl-y	:= netlink.o bitset.o strset.o linkinfo.o linkmodes.o \
 		   linkstate.o debug.o wol.o features.o privflags.o rings.o \
 		   channels.o coalesce.o pause.o eee.o tsinfo.o cabletest.o \
-		   tunnels.o fec.o eeprom.o stats.o phc_vclocks.o module.o
+		   tunnels.o fec.o eeprom.o stats.o phc_vclocks.o module.o \
+		   fp.o mm.o
diff --git a/net/ethtool/fp.c b/net/ethtool/fp.c
new file mode 100644
index 000000000000..20f19d8c1461
--- /dev/null
+++ b/net/ethtool/fp.c
@@ -0,0 +1,295 @@ 
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright 2022 NXP
+ */
+#include "common.h"
+#include "netlink.h"
+
+struct fp_req_info {
+	struct ethnl_req_info		base;
+};
+
+struct fp_reply_data {
+	struct ethnl_reply_data		base;
+	struct ethtool_fp_param		params;
+};
+
+#define FP_REPDATA(__reply_base) \
+	container_of(__reply_base, struct fp_reply_data, base)
+
+const struct nla_policy ethnl_fp_get_policy[ETHTOOL_A_FP_HEADER + 1] = {
+	[ETHTOOL_A_FP_HEADER] = NLA_POLICY_NESTED(ethnl_header_policy),
+};
+
+static int fp_prepare_data(const struct ethnl_req_info *req_base,
+			   struct ethnl_reply_data *reply_base,
+			   struct genl_info *info)
+{
+	struct fp_reply_data *data = FP_REPDATA(reply_base);
+	struct net_device *dev = reply_base->dev;
+	int ret;
+
+	if (!dev->ethtool_ops->get_fp_param)
+		return -EOPNOTSUPP;
+
+	ret = ethnl_ops_begin(dev);
+	if (ret < 0)
+		return ret;
+
+	dev->ethtool_ops->get_fp_param(dev, &data->params);
+	ethnl_ops_complete(dev);
+
+	return 0;
+}
+
+static int fp_reply_size(const struct ethnl_req_info *req_base,
+			 const struct ethnl_reply_data *reply_base)
+{
+	int len = 0;
+
+	len += nla_total_size(0); /* _FP_PARAM_ENTRY */
+	len += nla_total_size(sizeof(u8)); /* _FP_PARAM_ENTRY_PRIO */
+	len += nla_total_size(sizeof(u8)); /* _FP_PARAM_ENTRY_ADMIN_STATUS */
+	len *= 8; /* 8 prios */
+	len += nla_total_size(0); /* _FP_PARAM_TABLE */
+	len += nla_total_size(sizeof(u32)); /* _FP_HOLD_ADVANCE */
+	len += nla_total_size(sizeof(u32)); /* _FP_RELEASE_ADVANCE */
+
+	return len;
+}
+
+static int fp_fill_reply(struct sk_buff *skb,
+			 const struct ethnl_req_info *req_base,
+			 const struct ethnl_reply_data *reply_base)
+{
+	const struct fp_reply_data *data = FP_REPDATA(reply_base);
+	const struct ethtool_fp_param *params = &data->params;
+	enum ethtool_fp_admin_status admin_status;
+	struct nlattr *nest_table, *nest_entry;
+	int prio;
+	int ret;
+
+	if (nla_put_u32(skb, ETHTOOL_A_FP_HOLD_ADVANCE, params->hold_advance) ||
+	    nla_put_u32(skb, ETHTOOL_A_FP_RELEASE_ADVANCE, params->release_advance))
+		return -EMSGSIZE;
+
+	nest_table = nla_nest_start(skb, ETHTOOL_A_FP_PARAM_TABLE);
+	if (!nest_table)
+		return -EMSGSIZE;
+
+	for (prio = 0; prio < 8; prio++) {
+		admin_status = (params->preemptable_prios & BIT(prio)) ?
+			ETHTOOL_FP_PARAM_ENTRY_ADMIN_STATUS_PREEMPTABLE :
+			ETHTOOL_FP_PARAM_ENTRY_ADMIN_STATUS_EXPRESS;
+
+		nest_entry = nla_nest_start(skb, ETHTOOL_A_FP_PARAM_ENTRY);
+		if (!nest_entry)
+			goto nla_nest_cancel_table;
+
+		if (nla_put_u8(skb, ETHTOOL_A_FP_PARAM_ENTRY_PRIO, prio))
+			goto nla_nest_cancel_entry;
+
+		if (nla_put_u8(skb, ETHTOOL_A_FP_PARAM_ENTRY_ADMIN_STATUS,
+			       admin_status))
+			goto nla_nest_cancel_entry;
+
+		nla_nest_end(skb, nest_entry);
+	}
+
+	nla_nest_end(skb, nest_table);
+
+	return 0;
+
+nla_nest_cancel_entry:
+	nla_nest_cancel(skb, nest_entry);
+nla_nest_cancel_table:
+	nla_nest_cancel(skb, nest_table);
+	return ret;
+}
+
+const struct ethnl_request_ops ethnl_fp_request_ops = {
+	.request_cmd		= ETHTOOL_MSG_FP_GET,
+	.reply_cmd		= ETHTOOL_MSG_FP_GET_REPLY,
+	.hdr_attr		= ETHTOOL_A_FP_HEADER,
+	.req_info_size		= sizeof(struct fp_req_info),
+	.reply_data_size	= sizeof(struct fp_reply_data),
+
+	.prepare_data		= fp_prepare_data,
+	.reply_size		= fp_reply_size,
+	.fill_reply		= fp_fill_reply,
+};
+
+static const struct nla_policy
+ethnl_fp_set_param_entry_policy[ETHTOOL_A_FP_PARAM_ENTRY_MAX  + 1] = {
+	[ETHTOOL_A_FP_PARAM_ENTRY_PRIO] = { .type = NLA_U8 },
+	[ETHTOOL_A_FP_PARAM_ENTRY_ADMIN_STATUS] = { .type = NLA_U8 },
+};
+
+static const struct nla_policy
+ethnl_fp_set_param_table_policy[ETHTOOL_A_FP_PARAM_MAX + 1] = {
+	[ETHTOOL_A_FP_PARAM_ENTRY] = NLA_POLICY_NESTED(ethnl_fp_set_param_entry_policy),
+};
+
+const struct nla_policy ethnl_fp_set_policy[ETHTOOL_A_FP_MAX + 1] = {
+	[ETHTOOL_A_FP_HEADER] = NLA_POLICY_NESTED(ethnl_header_policy),
+	[ETHTOOL_A_FP_PARAM_TABLE] = NLA_POLICY_NESTED(ethnl_fp_set_param_table_policy),
+};
+
+static int fp_parse_param_entry(const struct nlattr *nest,
+				struct ethtool_fp_param *params,
+				u8 *seen_prios, bool *mod,
+				struct netlink_ext_ack *extack)
+{
+	struct nlattr *tb[ARRAY_SIZE(ethnl_fp_set_param_entry_policy)];
+	u8 preemptable_prios = params->preemptable_prios;
+	u8 prio, admin_status;
+	int ret;
+
+	if (!nest)
+		return 0;
+
+	ret = nla_parse_nested(tb,
+			       ARRAY_SIZE(ethnl_fp_set_param_entry_policy) - 1,
+			       nest, ethnl_fp_set_param_entry_policy,
+			       extack);
+	if (ret)
+		return ret;
+
+	if (!tb[ETHTOOL_A_FP_PARAM_ENTRY_PRIO]) {
+		NL_SET_ERR_MSG_ATTR(extack, nest,
+				    "Missing 'prio' attribute");
+		return -EINVAL;
+	}
+
+	if (!tb[ETHTOOL_A_FP_PARAM_ENTRY_ADMIN_STATUS]) {
+		NL_SET_ERR_MSG_ATTR(extack, nest,
+				    "Missing 'admin_status' attribute");
+		return -EINVAL;
+	}
+
+	prio = nla_get_u8(tb[ETHTOOL_A_FP_PARAM_ENTRY_PRIO]);
+	if (prio >= 8) {
+		NL_SET_ERR_MSG_ATTR(extack, nest,
+				    "Range exceeded for 'prio' attribute");
+		return -ERANGE;
+	}
+
+	if (*seen_prios & BIT(prio)) {
+		NL_SET_ERR_MSG_ATTR(extack, nest,
+				    "Duplicate 'prio' attribute");
+		return -EINVAL;
+	}
+
+	*seen_prios |= BIT(prio);
+
+	admin_status = nla_get_u8(tb[ETHTOOL_A_FP_PARAM_ENTRY_ADMIN_STATUS]);
+	switch (admin_status) {
+	case ETHTOOL_FP_PARAM_ENTRY_ADMIN_STATUS_PREEMPTABLE:
+		preemptable_prios |= BIT(prio);
+		break;
+	case ETHTOOL_FP_PARAM_ENTRY_ADMIN_STATUS_EXPRESS:
+		preemptable_prios &= ~BIT(prio);
+		break;
+	default:
+		NL_SET_ERR_MSG_ATTR(extack, nest,
+				    "Unexpected value for 'admin_status' attribute");
+		return -EINVAL;
+	}
+
+	if (preemptable_prios != params->preemptable_prios) {
+		params->preemptable_prios = preemptable_prios;
+		*mod = true;
+	}
+
+	return 0;
+}
+
+static int fp_parse_param_table(const struct nlattr *nest,
+				struct ethtool_fp_param *params, bool *mod,
+				struct netlink_ext_ack *extack)
+{
+	u8 seen_prios = 0;
+	struct nlattr *n;
+	int rem, ret;
+
+	if (!nest)
+		return 0;
+
+	nla_for_each_nested(n, nest, rem) {
+		struct sched_entry *entry;
+
+		if (nla_type(n) != ETHTOOL_A_FP_PARAM_ENTRY) {
+			NL_SET_ERR_MSG_ATTR(extack, n,
+					    "Attribute is not of type 'entry'");
+			continue;
+		}
+
+		ret = fp_parse_param_entry(n, params, &seen_prios, mod, extack);
+		if (ret < 0) {
+			kfree(entry);
+			return ret;
+		}
+	}
+
+	return 0;
+}
+
+int ethnl_set_fp_param(struct sk_buff *skb, struct genl_info *info)
+{
+	struct netlink_ext_ack *extack = info->extack;
+	struct ethnl_req_info req_info = {};
+	struct ethtool_fp_param params = {};
+	struct nlattr **tb = info->attrs;
+	const struct ethtool_ops *ops;
+	struct net_device *dev;
+	bool mod = false;
+	int ret;
+
+	ret = ethnl_parse_header_dev_get(&req_info, tb[ETHTOOL_A_FP_HEADER],
+					 genl_info_net(info), extack, true);
+	if (ret)
+		return ret;
+
+	dev = req_info.dev;
+	ops = dev->ethtool_ops;
+
+	if (!ops->get_fp_param || !ops->set_fp_param) {
+		ret = -EOPNOTSUPP;
+		goto out_dev;
+	}
+
+	rtnl_lock();
+	ret = ethnl_ops_begin(dev);
+	if (ret)
+		goto out_rtnl;
+
+	dev->ethtool_ops->get_fp_param(dev, &params);
+
+	ethnl_update_u32(&params.hold_advance, tb[ETHTOOL_A_FP_HOLD_ADVANCE],
+			 &mod);
+	ethnl_update_u32(&params.release_advance,
+			 tb[ETHTOOL_A_FP_RELEASE_ADVANCE], &mod);
+
+	ret = fp_parse_param_table(tb[ETHTOOL_A_FP_PARAM_TABLE], &params, &mod,
+				   extack);
+	if (ret || !mod)
+		goto out_ops;
+
+	ret = dev->ethtool_ops->set_fp_param(dev, &params, extack);
+	if (ret) {
+		if (!extack->_msg)
+			NL_SET_ERR_MSG(extack,
+				       "Failed to update frame preemption parameters");
+		goto out_ops;
+	}
+
+	ethtool_notify(dev, ETHTOOL_MSG_FP_NTF, NULL);
+
+out_ops:
+	ethnl_ops_complete(dev);
+out_rtnl:
+	rtnl_unlock();
+out_dev:
+	dev_put(dev);
+	return ret;
+}
diff --git a/net/ethtool/mm.c b/net/ethtool/mm.c
new file mode 100644
index 000000000000..d4731fb7aee4
--- /dev/null
+++ b/net/ethtool/mm.c
@@ -0,0 +1,228 @@ 
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright 2022 NXP
+ */
+#include "common.h"
+#include "netlink.h"
+
+struct mm_req_info {
+	struct ethnl_req_info		base;
+};
+
+struct mm_reply_data {
+	struct ethnl_reply_data		base;
+	struct ethtool_mm_state		state;
+	struct ethtool_mm_stats		stats;
+};
+
+#define MM_REPDATA(__reply_base) \
+	container_of(__reply_base, struct mm_reply_data, base)
+
+#define ETHTOOL_MM_STAT_CNT \
+	(__ETHTOOL_A_MM_STAT_CNT - (ETHTOOL_A_MM_STAT_PAD + 1))
+
+const struct nla_policy ethnl_mm_get_policy[ETHTOOL_A_MM_HEADER + 1] = {
+	[ETHTOOL_A_MM_HEADER] = NLA_POLICY_NESTED(ethnl_header_policy_stats),
+};
+
+static int mm_prepare_data(const struct ethnl_req_info *req_base,
+			   struct ethnl_reply_data *reply_base,
+			   struct genl_info *info)
+{
+	struct mm_reply_data *data = MM_REPDATA(reply_base);
+	struct net_device *dev = reply_base->dev;
+	const struct ethtool_ops *ops;
+	int ret;
+
+	ops = dev->ethtool_ops;
+
+	if (!ops->get_mm_state)
+		return -EOPNOTSUPP;
+
+	ethtool_stats_init((u64 *)&data->stats,
+			   sizeof(data->stats) / sizeof(u64));
+
+	ret = ethnl_ops_begin(dev);
+	if (ret < 0)
+		return ret;
+
+	ops->get_mm_state(dev, &data->state);
+
+	if (ops->get_mm_stats && (req_base->flags & ETHTOOL_FLAG_STATS))
+		ops->get_mm_stats(dev, &data->stats);
+
+	ethnl_ops_complete(dev);
+
+	return 0;
+}
+
+static int mm_reply_size(const struct ethnl_req_info *req_base,
+			 const struct ethnl_reply_data *reply_base)
+{
+	int len = 0;
+
+	len += nla_total_size(sizeof(u8)); /* _MM_VERIFY_STATUS */
+	len += nla_total_size(sizeof(u32)); /* _MM_VERIFY_TIME */
+	len += nla_total_size(sizeof(u8)); /* _MM_SUPPORTED */
+	len += nla_total_size(sizeof(u8)); /* _MM_ENABLED */
+	len += nla_total_size(sizeof(u8)); /* _MM_ACTIVE */
+	len += nla_total_size(sizeof(u8)); /* _MM_ADD_FRAG_SIZE */
+
+	if (req_base->flags & ETHTOOL_FLAG_STATS)
+		len += nla_total_size(0) + /* _MM_STATS */
+		       nla_total_size_64bit(sizeof(u64)) * ETHTOOL_MM_STAT_CNT;
+
+	return len;
+}
+
+static int mm_put_stat(struct sk_buff *skb, u64 val, u16 attrtype)
+{
+	if (val == ETHTOOL_STAT_NOT_SET)
+		return 0;
+	if (nla_put_u64_64bit(skb, attrtype, val, ETHTOOL_A_MM_STAT_PAD))
+		return -EMSGSIZE;
+	return 0;
+}
+
+static int mm_put_stats(struct sk_buff *skb,
+			const struct ethtool_mm_stats *stats)
+{
+	struct nlattr *nest;
+
+	nest = nla_nest_start(skb, ETHTOOL_A_MM_STATS);
+	if (!nest)
+		return -EMSGSIZE;
+
+	if (mm_put_stat(skb, stats->MACMergeFrameAssErrorCount,
+			ETHTOOL_A_MM_STAT_REASSEMBLY_ERRORS) ||
+	    mm_put_stat(skb, stats->MACMergeFrameSmdErrorCount,
+			ETHTOOL_A_MM_STAT_SMD_ERRORS) ||
+	    mm_put_stat(skb, stats->MACMergeFrameAssOkCount,
+			ETHTOOL_A_MM_STAT_REASSEMBLY_OK) ||
+	    mm_put_stat(skb, stats->MACMergeFragCountRx,
+			ETHTOOL_A_MM_STAT_RX_FRAG_COUNT) ||
+	    mm_put_stat(skb, stats->MACMergeFragCountTx,
+			ETHTOOL_A_MM_STAT_TX_FRAG_COUNT) ||
+	    mm_put_stat(skb, stats->MACMergeHoldCount,
+			ETHTOOL_A_MM_STAT_HOLD_COUNT))
+		goto err_cancel;
+
+	nla_nest_end(skb, nest);
+	return 0;
+
+err_cancel:
+	nla_nest_cancel(skb, nest);
+	return -EMSGSIZE;
+}
+
+static int mm_fill_reply(struct sk_buff *skb,
+			 const struct ethnl_req_info *req_base,
+			 const struct ethnl_reply_data *reply_base)
+{
+	const struct mm_reply_data *data = MM_REPDATA(reply_base);
+	const struct ethtool_mm_state *state = &data->state;
+
+	if (nla_put_u8(skb, ETHTOOL_A_MM_VERIFY_STATUS, state->verify_status) ||
+	    nla_put_u32(skb, ETHTOOL_A_MM_VERIFY_TIME, state->verify_time) ||
+	    nla_put_u8(skb, ETHTOOL_A_MM_SUPPORTED, state->supported) ||
+	    nla_put_u8(skb, ETHTOOL_A_MM_ENABLED, state->enabled) ||
+	    nla_put_u8(skb, ETHTOOL_A_MM_ACTIVE, state->active) ||
+	    nla_put_u8(skb, ETHTOOL_A_MM_ADD_FRAG_SIZE, state->add_frag_size))
+		return -EMSGSIZE;
+
+	if (req_base->flags & ETHTOOL_FLAG_STATS &&
+	    mm_put_stats(skb, &data->stats))
+		return -EMSGSIZE;
+
+	return 0;
+}
+
+const struct ethnl_request_ops ethnl_mm_request_ops = {
+	.request_cmd		= ETHTOOL_MSG_MM_GET,
+	.reply_cmd		= ETHTOOL_MSG_MM_GET_REPLY,
+	.hdr_attr		= ETHTOOL_A_MM_HEADER,
+	.req_info_size		= sizeof(struct mm_req_info),
+	.reply_data_size	= sizeof(struct mm_reply_data),
+
+	.prepare_data		= mm_prepare_data,
+	.reply_size		= mm_reply_size,
+	.fill_reply		= mm_fill_reply,
+};
+
+const struct nla_policy ethnl_mm_set_policy[ETHTOOL_A_MM_MAX + 1] = {
+	[ETHTOOL_A_MM_HEADER] = NLA_POLICY_NESTED(ethnl_header_policy),
+	[ETHTOOL_A_MM_VERIFY_DISABLE] = { .type = NLA_U8 },
+	[ETHTOOL_A_MM_VERIFY_TIME] = { .type = NLA_U32 },
+	[ETHTOOL_A_MM_ENABLED] = { .type = NLA_U8 },
+	[ETHTOOL_A_MM_ADD_FRAG_SIZE] = { .type = NLA_U8 },
+};
+
+static void mm_state_to_cfg(const struct ethtool_mm_state *state,
+			    struct ethtool_mm_cfg *cfg)
+{
+	cfg->verify_disable =
+		state->verify_status == ETHTOOL_MM_VERIFY_STATUS_DISABLED;
+	cfg->verify_time = state->verify_time;
+	cfg->enabled = state->enabled;
+	cfg->add_frag_size = state->add_frag_size;
+}
+
+int ethnl_set_mm_cfg(struct sk_buff *skb, struct genl_info *info)
+{
+	struct netlink_ext_ack *extack = info->extack;
+	struct ethnl_req_info req_info = {};
+	struct ethtool_mm_state state = {};
+	struct nlattr **tb = info->attrs;
+	struct ethtool_mm_cfg cfg = {};
+	const struct ethtool_ops *ops;
+	struct net_device *dev;
+	bool mod = false;
+	int ret;
+
+	ret = ethnl_parse_header_dev_get(&req_info, tb[ETHTOOL_A_MM_HEADER],
+					 genl_info_net(info), extack, true);
+	if (ret)
+		return ret;
+
+	dev = req_info.dev;
+	ops = dev->ethtool_ops;
+
+	if (!ops->get_mm_state || !ops->set_mm_cfg) {
+		ret = -EOPNOTSUPP;
+		goto out_dev;
+	}
+
+	rtnl_lock();
+	ret = ethnl_ops_begin(dev);
+	if (ret)
+		goto out_rtnl;
+
+	ops->get_mm_state(dev, &state);
+
+	mm_state_to_cfg(&state, &cfg);
+
+	ethnl_update_bool(&cfg.verify_disable, tb[ETHTOOL_A_MM_VERIFY_DISABLE],
+			  &mod);
+	ethnl_update_u32(&cfg.verify_time, tb[ETHTOOL_A_MM_VERIFY_TIME], &mod);
+	ethnl_update_bool(&cfg.enabled, tb[ETHTOOL_A_MM_ENABLED], &mod);
+	ethnl_update_u8(&cfg.add_frag_size, tb[ETHTOOL_A_MM_ADD_FRAG_SIZE],
+			&mod);
+
+	ret = ops->set_mm_cfg(dev, &cfg, extack);
+	if (ret) {
+		if (!extack->_msg)
+			NL_SET_ERR_MSG(extack,
+				       "Failed to update MAC merge configuration");
+		goto out_ops;
+	}
+
+	ethtool_notify(dev, ETHTOOL_MSG_MM_NTF, NULL);
+
+out_ops:
+	ethnl_ops_complete(dev);
+out_rtnl:
+	rtnl_unlock();
+out_dev:
+	dev_put(dev);
+	return ret;
+}
diff --git a/net/ethtool/netlink.c b/net/ethtool/netlink.c
index e26079e11835..82ad2bd92d70 100644
--- a/net/ethtool/netlink.c
+++ b/net/ethtool/netlink.c
@@ -286,6 +286,8 @@  ethnl_default_requests[__ETHTOOL_MSG_USER_CNT] = {
 	[ETHTOOL_MSG_STATS_GET]		= &ethnl_stats_request_ops,
 	[ETHTOOL_MSG_PHC_VCLOCKS_GET]	= &ethnl_phc_vclocks_request_ops,
 	[ETHTOOL_MSG_MODULE_GET]	= &ethnl_module_request_ops,
+	[ETHTOOL_MSG_FP_GET]		= &ethnl_fp_request_ops,
+	[ETHTOOL_MSG_MM_GET]		= &ethnl_mm_request_ops,
 };
 
 static struct ethnl_dump_ctx *ethnl_dump_context(struct netlink_callback *cb)
@@ -598,6 +600,8 @@  ethnl_default_notify_ops[ETHTOOL_MSG_KERNEL_MAX + 1] = {
 	[ETHTOOL_MSG_EEE_NTF]		= &ethnl_eee_request_ops,
 	[ETHTOOL_MSG_FEC_NTF]		= &ethnl_fec_request_ops,
 	[ETHTOOL_MSG_MODULE_NTF]	= &ethnl_module_request_ops,
+	[ETHTOOL_MSG_FP_NTF]		= &ethnl_fp_request_ops,
+	[ETHTOOL_MSG_MM_NTF]		= &ethnl_mm_request_ops,
 };
 
 /* default notification handler */
@@ -691,6 +695,8 @@  static const ethnl_notify_handler_t ethnl_notify_handlers[] = {
 	[ETHTOOL_MSG_EEE_NTF]		= ethnl_default_notify,
 	[ETHTOOL_MSG_FEC_NTF]		= ethnl_default_notify,
 	[ETHTOOL_MSG_MODULE_NTF]	= ethnl_default_notify,
+	[ETHTOOL_MSG_FP_NTF]		= ethnl_default_notify,
+	[ETHTOOL_MSG_MM_NTF]		= ethnl_default_notify,
 };
 
 void ethtool_notify(struct net_device *dev, unsigned int cmd, const void *data)
@@ -1020,6 +1026,38 @@  static const struct genl_ops ethtool_genl_ops[] = {
 		.policy = ethnl_module_set_policy,
 		.maxattr = ARRAY_SIZE(ethnl_module_set_policy) - 1,
 	},
+	{
+		.cmd	= ETHTOOL_MSG_FP_GET,
+		.doit	= ethnl_default_doit,
+		.start	= ethnl_default_start,
+		.dumpit	= ethnl_default_dumpit,
+		.done	= ethnl_default_done,
+		.policy = ethnl_fp_get_policy,
+		.maxattr = ARRAY_SIZE(ethnl_fp_get_policy) - 1,
+	},
+	{
+		.cmd	= ETHTOOL_MSG_FP_SET,
+		.flags	= GENL_UNS_ADMIN_PERM,
+		.doit	= ethnl_set_fp_param,
+		.policy = ethnl_fp_set_policy,
+		.maxattr = ARRAY_SIZE(ethnl_fp_set_policy) - 1,
+	},
+	{
+		.cmd	= ETHTOOL_MSG_MM_GET,
+		.doit	= ethnl_default_doit,
+		.start	= ethnl_default_start,
+		.dumpit	= ethnl_default_dumpit,
+		.done	= ethnl_default_done,
+		.policy = ethnl_mm_get_policy,
+		.maxattr = ARRAY_SIZE(ethnl_mm_get_policy) - 1,
+	},
+	{
+		.cmd	= ETHTOOL_MSG_MM_SET,
+		.flags	= GENL_UNS_ADMIN_PERM,
+		.doit	= ethnl_set_mm_cfg,
+		.policy = ethnl_mm_set_policy,
+		.maxattr = ARRAY_SIZE(ethnl_mm_set_policy) - 1,
+	},
 };
 
 static const struct genl_multicast_group ethtool_nl_mcgrps[] = {
diff --git a/net/ethtool/netlink.h b/net/ethtool/netlink.h
index 1653fd2cf0cf..a2e56df74c85 100644
--- a/net/ethtool/netlink.h
+++ b/net/ethtool/netlink.h
@@ -371,6 +371,8 @@  extern const struct ethnl_request_ops ethnl_module_eeprom_request_ops;
 extern const struct ethnl_request_ops ethnl_stats_request_ops;
 extern const struct ethnl_request_ops ethnl_phc_vclocks_request_ops;
 extern const struct ethnl_request_ops ethnl_module_request_ops;
+extern const struct ethnl_request_ops ethnl_fp_request_ops;
+extern const struct ethnl_request_ops ethnl_mm_request_ops;
 
 extern const struct nla_policy ethnl_header_policy[ETHTOOL_A_HEADER_FLAGS + 1];
 extern const struct nla_policy ethnl_header_policy_stats[ETHTOOL_A_HEADER_FLAGS + 1];
@@ -409,6 +411,10 @@  extern const struct nla_policy ethnl_stats_get_policy[ETHTOOL_A_STATS_GROUPS + 1
 extern const struct nla_policy ethnl_phc_vclocks_get_policy[ETHTOOL_A_PHC_VCLOCKS_HEADER + 1];
 extern const struct nla_policy ethnl_module_get_policy[ETHTOOL_A_MODULE_HEADER + 1];
 extern const struct nla_policy ethnl_module_set_policy[ETHTOOL_A_MODULE_POWER_MODE_POLICY + 1];
+extern const struct nla_policy ethnl_fp_get_policy[ETHTOOL_A_FP_HEADER + 1];
+extern const struct nla_policy ethnl_fp_set_policy[ETHTOOL_A_FP_MAX + 1];
+extern const struct nla_policy ethnl_mm_get_policy[ETHTOOL_A_MM_HEADER + 1];
+extern const struct nla_policy ethnl_mm_set_policy[ETHTOOL_A_MM_MAX + 1];
 
 int ethnl_set_linkinfo(struct sk_buff *skb, struct genl_info *info);
 int ethnl_set_linkmodes(struct sk_buff *skb, struct genl_info *info);
@@ -428,6 +434,8 @@  int ethnl_tunnel_info_start(struct netlink_callback *cb);
 int ethnl_tunnel_info_dumpit(struct sk_buff *skb, struct netlink_callback *cb);
 int ethnl_set_fec(struct sk_buff *skb, struct genl_info *info);
 int ethnl_set_module(struct sk_buff *skb, struct genl_info *info);
+int ethnl_set_fp_param(struct sk_buff *skb, struct genl_info *info);
+int ethnl_set_mm_cfg(struct sk_buff *skb, struct genl_info *info);
 
 extern const char stats_std_names[__ETHTOOL_STATS_CNT][ETH_GSTRING_LEN];
 extern const char stats_eth_phy_names[__ETHTOOL_A_STATS_ETH_PHY_CNT][ETH_GSTRING_LEN];