Message ID | 20220816222920.1952936-3-vladimir.oltean@nxp.com (mailing list archive) |
---|---|
State | RFC |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | 802.1Q Frame Preemption and 802.3 MAC Merge support via ethtool | expand |
On Wed, 17 Aug 2022 01:29:15 +0300 Vladimir Oltean wrote: > +/** > + * struct ethtool_mm_state - 802.3 MAC merge layer state > + */ > +struct ethtool_mm_state { > + u32 verify_time; > + enum ethtool_mm_verify_status verify_status; > + bool supported; > + bool enabled; > + bool active; The enabled vs active piqued my interest. Is there some handshake / aneg or such? > + nest_table = nla_nest_start(skb, ETHTOOL_A_FP_PARAM_TABLE); > + if (!nest_table) > + return -EMSGSIZE; Don't warp tables in nests, let the elements repeat in the parent.
On Tue, Aug 16, 2022 at 08:22:09PM -0700, Jakub Kicinski wrote: > On Wed, 17 Aug 2022 01:29:15 +0300 Vladimir Oltean wrote: > > +/** > > + * struct ethtool_mm_state - 802.3 MAC merge layer state > > + */ > > +struct ethtool_mm_state { > > + u32 verify_time; > > + enum ethtool_mm_verify_status verify_status; > > + bool supported; > > + bool enabled; > > + bool active; > > The enabled vs active piqued my interest. Is there some handshake / > aneg or such? This is where writing the kdocs, as mentioned in the cover letter, would have helped. Yes, the handshake is described in 802.3 clause "99.4.3 Verifying preemption operation". It says: | Verification (see Figure 99–3) checks that the link can support the | preemption capability. | | If verification is enabled, the preemption capability shall be active | only after verification has completed | successfully. | | If the preemption capability is enabled but has not been verified yet, | the MAC Merge sublayer initiates verification. Verification relies on | the transmission of a verify mPacket and receipt of a respond mPacket to | confirm that the remote station supports the preemption capability. In fact, the verify_time and verify_status fields right above enabled/active are related exactly to this handshake process. > > + nest_table = nla_nest_start(skb, ETHTOOL_A_FP_PARAM_TABLE); > > + if (!nest_table) > > + return -EMSGSIZE; > > Don't warp tables in nests, let the elements repeat in the parent. Ok, can do. I did this because 802.1Q actually specifies in the IEEE8021-Preemption-MIB that there is a ieee8021PreemptionParameterTable structure containing pairs of ieee8021PreemptionPriority and ieee8021FramePreemptionAdminStatus. Do you have actual issues with the structuring of the FP parameters though? They look like this currently: ethtool --json --show-fp eno0 [ { "ifname": "eno0", "parameter-table": [ { "prio": 0, "admin-status": "preemptable" },{ "prio": 1, "admin-status": "preemptable" },{ "prio": 2, "admin-status": "preemptable" },{ "prio": 3, "admin-status": "preemptable" },{ "prio": 4, "admin-status": "preemptable" },{ "prio": 5, "admin-status": "preemptable" },{ "prio": 6, "admin-status": "preemptable" },{ "prio": 7, "admin-status": "preemptable" } ], "hold-advance": 0, "release-advance": 0 } ]
On Wed, 17 Aug 2022 11:41:01 +0000 Vladimir Oltean wrote: > > > + nest_table = nla_nest_start(skb, ETHTOOL_A_FP_PARAM_TABLE); > > > + if (!nest_table) > > > + return -EMSGSIZE; > > > > Don't warp tables in nests, let the elements repeat in the parent. > > Ok, can do. I did this because 802.1Q actually specifies in the > IEEE8021-Preemption-MIB that there is a ieee8021PreemptionParameterTable > structure containing pairs of ieee8021PreemptionPriority and > ieee8021FramePreemptionAdminStatus. Yeah, netlink is a bit special in array definition. I need to document it well in my YAML netlink patches.. > Do you have actual issues with the structuring of the FP parameters > though? They look like this currently IDK how to put it best, I shared my largely uninformed thoughts, you know much more about the spec and HW so whatever you think is most appropriate is fine by me.
Vladimir Oltean <vladimir.oltean@nxp.com> writes: > Frame preemption (IEEE 802.1Q-2018 clause 6.7.2) and the MAC merge > sublayer (IEEE 802.3-2018 clause 99) are 2 specifications, part of the > Time Sensitive Networking enhancements, which work together to minimize > latency caused by frame interference. The overall goal of TSN is for > normal traffic and traffic belonging to real-time control processes to > be able to cohabitate on the same L2 network and not bother each other > too much. > > They achieve this (partly) by introducing the concept of preemptable > traffic, i.e. Ethernet frames that have an altered Start-Of-Frame > delimiter, which can be fragmented and reassembled at L2 on a link-local > basis. The non-preemptable frames are called express traffic, and they > can preempt preemptable frames, therefore having lower latency, which > can matter at lower (100 Mbps) link speeds, or at high MTUs (jumbo > frames around 9K). Preemption is not recursive, i.e. a PT frame cannot > preempt another PT frame. Preemption also does not depend upon priority, > or otherwise said, an ET frame with prio 0 will still preempt a PT frame > with prio 7. I liked that in the API sense, using this "prio" concept we gain more flexibility, and we can express better what the hardware you work with can do, i.e. priority (for frame preemption purposes?) and queues are orthogonal. The problem I have is that the hardware I work with is more limited (as are some stmmac-based NICs that I am aware of) frame preemption capability and "priority" are attached to queues. From the API perspective, it seems that I could say that "fp-prio" 0 is associated with queue 0, fp-prio 1 to queue 1, and so on, and everything will work. The only thing that I am not happy at all is that there are exactly 8 fp-prios. The Linux network stack is more flexible than what 802.1Q defines, think skb->priority, number of TCs, as you said earlier, I would hate to impose some almost artificial limits here. And in future we are going to see TSN enabled devices with lots more queues. In short: - Comment: this section of the RFC is hardware independent, this behavior of queues and priorities being orthogonal is only valid for some implementations; - Question: would it be against the intention of the API to have a 1:1 map of priorities to queues? - Deal breaker: fixed 8 prios; > > In terms of implementation, the specs talk about the fact that the > express traffic is handled by an express MAC (eMAC) and the preemptable > traffic by a preemptable MAC (pMAC), and these MACs are multiplexed on > the same MII by a MAC merge layer. > > On RX, packets go to the eMAC or to the pMAC based on their SFD (the > definition of which was generalized to SMD, Start-of-mPacket-Delimiter, > where mPacket is essentially an Ethernet frame fragment, or a complete > frame). On TX, the eMAC/pMAC classification decision is taken by the > 802.1Q spec, based on packet priority (each of the 8 priority values may > have an admin-status of preemptable or express). > > I have modeled both the Ethernet part of the spec and the queuing part > as separate netlink messages, with separate ethtool command sets > intended for them. I am slightly flexible as to where to place the FP > settings; there were previous discussions about placing them in tc. > At the moment I haven't received a good enough argument to move them out > of ethtool, so here they are. > https://patchwork.kernel.org/project/netdevbpf/cover/20220520011538.1098888-1-vinicius.gomes@intel.com/ > > Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> > --- > include/linux/ethtool.h | 59 ++++++ > include/uapi/linux/ethtool.h | 15 ++ > include/uapi/linux/ethtool_netlink.h | 82 ++++++++ > net/ethtool/Makefile | 3 +- > net/ethtool/fp.c | 295 +++++++++++++++++++++++++++ > net/ethtool/mm.c | 228 +++++++++++++++++++++ > net/ethtool/netlink.c | 38 ++++ > net/ethtool/netlink.h | 8 + > 8 files changed, 727 insertions(+), 1 deletion(-) > create mode 100644 net/ethtool/fp.c > create mode 100644 net/ethtool/mm.c > > diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h > index 99dc7bfbcd3c..fa504dd22bf6 100644 > --- a/include/linux/ethtool.h > +++ b/include/linux/ethtool.h > @@ -453,6 +453,49 @@ struct ethtool_module_power_mode_params { > enum ethtool_module_power_mode mode; > }; > > +/** > + * struct ethtool_fp_param - 802.1Q Frame Preemption parameters > + */ > +struct ethtool_fp_param { > + u8 preemptable_prios; > + u32 hold_advance; > + u32 release_advance; > +}; > + > +/** > + * struct ethtool_mm_state - 802.3 MAC merge layer state > + */ > +struct ethtool_mm_state { > + u32 verify_time; > + enum ethtool_mm_verify_status verify_status; > + bool supported; > + bool enabled; > + bool active; > + u8 add_frag_size; > +}; > + > +/** > + * struct ethtool_mm_cfg - 802.3 MAC merge layer configuration > + */ > +struct ethtool_mm_cfg { > + u32 verify_time; > + bool verify_disable; > + bool enabled; > + u8 add_frag_size; > +}; > + > +/* struct ethtool_mm_stats - 802.3 MAC merge layer statistics, as defined in > + * IEEE 802.3-2018 subclause 30.14.1 oMACMergeEntity managed object class > + */ > +struct ethtool_mm_stats { > + u64 MACMergeFrameAssErrorCount; > + u64 MACMergeFrameSmdErrorCount; > + u64 MACMergeFrameAssOkCount; > + u64 MACMergeFragCountRx; > + u64 MACMergeFragCountTx; > + u64 MACMergeHoldCount; > +}; > + > /** > * struct ethtool_ops - optional netdev operations > * @cap_link_lanes_supported: indicates if the driver supports lanes > @@ -624,6 +667,11 @@ struct ethtool_module_power_mode_params { > * plugged-in. > * @set_module_power_mode: Set the power mode policy for the plug-in module > * used by the network device. > + * @get_fp_param: Query the 802.1Q Frame Preemption parameters. > + * @set_fp_param: Set the 802.1Q Frame Preemption parameters. > + * @get_mm_state: Query the 802.3 MAC Merge layer state. > + * @set_mm_cfg: Set the 802.3 MAC Merge layer parameters. > + * @get_mm_stats: Query the 802.3 MAC Merge layer statistics. > * > * All operations are optional (i.e. the function pointer may be set > * to %NULL) and callers must take this into account. Callers must > @@ -760,6 +808,17 @@ struct ethtool_ops { > int (*set_module_power_mode)(struct net_device *dev, > const struct ethtool_module_power_mode_params *params, > struct netlink_ext_ack *extack); > + void (*get_fp_param)(struct net_device *dev, > + struct ethtool_fp_param *params); > + int (*set_fp_param)(struct net_device *dev, > + const struct ethtool_fp_param *params, > + struct netlink_ext_ack *extack); > + void (*get_mm_state)(struct net_device *dev, > + struct ethtool_mm_state *state); > + int (*set_mm_cfg)(struct net_device *dev, struct ethtool_mm_cfg *cfg, > + struct netlink_ext_ack *extack); > + void (*get_mm_stats)(struct net_device *dev, > + struct ethtool_mm_stats *stats); > }; > > int ethtool_check_ops(const struct ethtool_ops *ops); > diff --git a/include/uapi/linux/ethtool.h b/include/uapi/linux/ethtool.h > index 2d5741fd44bb..7a21fcb26a43 100644 > --- a/include/uapi/linux/ethtool.h > +++ b/include/uapi/linux/ethtool.h > @@ -736,6 +736,21 @@ enum ethtool_module_power_mode { > ETHTOOL_MODULE_POWER_MODE_HIGH, > }; > > +/* Values from ieee8021FramePreemptionAdminStatus */ > +enum ethtool_fp_admin_status { > + ETHTOOL_FP_PARAM_ENTRY_ADMIN_STATUS_EXPRESS = 1, > + ETHTOOL_FP_PARAM_ENTRY_ADMIN_STATUS_PREEMPTABLE = 2, > +}; > + > +enum ethtool_mm_verify_status { > + ETHTOOL_MM_VERIFY_STATUS_UNKNOWN, > + ETHTOOL_MM_VERIFY_STATUS_INITIAL, > + ETHTOOL_MM_VERIFY_STATUS_VERIFYING, > + ETHTOOL_MM_VERIFY_STATUS_SUCCEEDED, > + ETHTOOL_MM_VERIFY_STATUS_FAILED, > + ETHTOOL_MM_VERIFY_STATUS_DISABLED, > +}; > + > /** > * struct ethtool_gstrings - string set for data tagging > * @cmd: Command number = %ETHTOOL_GSTRINGS > diff --git a/include/uapi/linux/ethtool_netlink.h b/include/uapi/linux/ethtool_netlink.h > index d2fb4f7be61b..658810274c49 100644 > --- a/include/uapi/linux/ethtool_netlink.h > +++ b/include/uapi/linux/ethtool_netlink.h > @@ -49,6 +49,10 @@ enum { > ETHTOOL_MSG_PHC_VCLOCKS_GET, > ETHTOOL_MSG_MODULE_GET, > ETHTOOL_MSG_MODULE_SET, > + ETHTOOL_MSG_FP_GET, > + ETHTOOL_MSG_FP_SET, > + ETHTOOL_MSG_MM_GET, > + ETHTOOL_MSG_MM_SET, > > /* add new constants above here */ > __ETHTOOL_MSG_USER_CNT, > @@ -94,6 +98,10 @@ enum { > ETHTOOL_MSG_PHC_VCLOCKS_GET_REPLY, > ETHTOOL_MSG_MODULE_GET_REPLY, > ETHTOOL_MSG_MODULE_NTF, > + ETHTOOL_MSG_FP_GET_REPLY, > + ETHTOOL_MSG_FP_NTF, > + ETHTOOL_MSG_MM_GET_REPLY, > + ETHTOOL_MSG_MM_NTF, > > /* add new constants above here */ > __ETHTOOL_MSG_KERNEL_CNT, > @@ -862,6 +870,80 @@ enum { > ETHTOOL_A_MODULE_MAX = (__ETHTOOL_A_MODULE_CNT - 1) > }; > > +/* FRAME PREEMPTION (802.1Q) */ > + > +enum { > + ETHTOOL_A_FP_PARAM_ENTRY_UNSPEC, > + ETHTOOL_A_FP_PARAM_ENTRY_PRIO, /* u8 */ > + ETHTOOL_A_FP_PARAM_ENTRY_ADMIN_STATUS, /* u8 */ > + > + /* add new constants above here */ > + __ETHTOOL_A_FP_PARAM_ENTRY_CNT, > + ETHTOOL_A_FP_PARAM_ENTRY_MAX = (__ETHTOOL_A_FP_PARAM_ENTRY_CNT - 1) > +}; > + > +enum { > + ETHTOOL_A_FP_PARAM_UNSPEC, > + ETHTOOL_A_FP_PARAM_ENTRY, /* nest */ > + > + /* add new constants above here */ > + __ETHTOOL_A_FP_PARAM_CNT, > + ETHTOOL_A_FP_PARAM_MAX = (__ETHTOOL_A_FP_PARAM_CNT - 1) > +}; > + > +enum { > + ETHTOOL_A_FP_UNSPEC, > + ETHTOOL_A_FP_HEADER, /* nest - _A_HEADER_* */ > + ETHTOOL_A_FP_PARAM_TABLE, /* nest */ > + ETHTOOL_A_FP_HOLD_ADVANCE, /* u32 */ > + ETHTOOL_A_FP_RELEASE_ADVANCE, /* u32 */ > + > + /* add new constants above here */ > + __ETHTOOL_A_FP_CNT, > + ETHTOOL_A_FP_MAX = (__ETHTOOL_A_FP_CNT - 1) > +}; > + > +/* MAC MERGE (802.3) */ > + > +enum { > + ETHTOOL_A_MM_STAT_UNSPEC, > + ETHTOOL_A_MM_STAT_PAD, > + > + /* aMACMergeFrameAssErrorCount */ > + ETHTOOL_A_MM_STAT_REASSEMBLY_ERRORS, /* u64 */ > + /* aMACMergeFrameSmdErrorCount */ > + ETHTOOL_A_MM_STAT_SMD_ERRORS, /* u64 */ > + /* aMACMergeFrameAssOkCount */ > + ETHTOOL_A_MM_STAT_REASSEMBLY_OK, /* u64 */ > + /* aMACMergeFragCountRx */ > + ETHTOOL_A_MM_STAT_RX_FRAG_COUNT, /* u64 */ > + /* aMACMergeFragCountTx */ > + ETHTOOL_A_MM_STAT_TX_FRAG_COUNT, /* u64 */ > + /* aMACMergeHoldCount */ > + ETHTOOL_A_MM_STAT_HOLD_COUNT, /* u64 */ > + > + /* add new constants above here */ > + __ETHTOOL_A_MM_STAT_CNT, > + ETHTOOL_A_MM_STAT_MAX = (__ETHTOOL_A_MM_STAT_CNT - 1) > +}; > + > +enum { > + ETHTOOL_A_MM_UNSPEC, > + ETHTOOL_A_MM_HEADER, /* nest - _A_HEADER_* */ > + ETHTOOL_A_MM_VERIFY_STATUS, /* u8 */ > + ETHTOOL_A_MM_VERIFY_DISABLE, /* u8 */ > + ETHTOOL_A_MM_VERIFY_TIME, /* u32 */ > + ETHTOOL_A_MM_SUPPORTED, /* u8 */ > + ETHTOOL_A_MM_ENABLED, /* u8 */ > + ETHTOOL_A_MM_ACTIVE, /* u8 */ > + ETHTOOL_A_MM_ADD_FRAG_SIZE, /* u8 */ > + ETHTOOL_A_MM_STATS, /* nest - _A_MM_STAT_* */ > + > + /* add new constants above here */ > + __ETHTOOL_A_MM_CNT, > + ETHTOOL_A_MM_MAX = (__ETHTOOL_A_MM_CNT - 1) > +}; > + > /* generic netlink info */ > #define ETHTOOL_GENL_NAME "ethtool" > #define ETHTOOL_GENL_VERSION 1 > diff --git a/net/ethtool/Makefile b/net/ethtool/Makefile > index b76432e70e6b..31e056f17856 100644 > --- a/net/ethtool/Makefile > +++ b/net/ethtool/Makefile > @@ -7,4 +7,5 @@ obj-$(CONFIG_ETHTOOL_NETLINK) += ethtool_nl.o > ethtool_nl-y := netlink.o bitset.o strset.o linkinfo.o linkmodes.o \ > linkstate.o debug.o wol.o features.o privflags.o rings.o \ > channels.o coalesce.o pause.o eee.o tsinfo.o cabletest.o \ > - tunnels.o fec.o eeprom.o stats.o phc_vclocks.o module.o > + tunnels.o fec.o eeprom.o stats.o phc_vclocks.o module.o \ > + fp.o mm.o > diff --git a/net/ethtool/fp.c b/net/ethtool/fp.c > new file mode 100644 > index 000000000000..20f19d8c1461 > --- /dev/null > +++ b/net/ethtool/fp.c > @@ -0,0 +1,295 @@ > +// SPDX-License-Identifier: GPL-2.0-only > +/* > + * Copyright 2022 NXP > + */ > +#include "common.h" > +#include "netlink.h" > + > +struct fp_req_info { > + struct ethnl_req_info base; > +}; > + > +struct fp_reply_data { > + struct ethnl_reply_data base; > + struct ethtool_fp_param params; > +}; > + > +#define FP_REPDATA(__reply_base) \ > + container_of(__reply_base, struct fp_reply_data, base) > + > +const struct nla_policy ethnl_fp_get_policy[ETHTOOL_A_FP_HEADER + 1] = { > + [ETHTOOL_A_FP_HEADER] = NLA_POLICY_NESTED(ethnl_header_policy), > +}; > + > +static int fp_prepare_data(const struct ethnl_req_info *req_base, > + struct ethnl_reply_data *reply_base, > + struct genl_info *info) > +{ > + struct fp_reply_data *data = FP_REPDATA(reply_base); > + struct net_device *dev = reply_base->dev; > + int ret; > + > + if (!dev->ethtool_ops->get_fp_param) > + return -EOPNOTSUPP; > + > + ret = ethnl_ops_begin(dev); > + if (ret < 0) > + return ret; > + > + dev->ethtool_ops->get_fp_param(dev, &data->params); > + ethnl_ops_complete(dev); > + > + return 0; > +} > + > +static int fp_reply_size(const struct ethnl_req_info *req_base, > + const struct ethnl_reply_data *reply_base) > +{ > + int len = 0; > + > + len += nla_total_size(0); /* _FP_PARAM_ENTRY */ > + len += nla_total_size(sizeof(u8)); /* _FP_PARAM_ENTRY_PRIO */ > + len += nla_total_size(sizeof(u8)); /* _FP_PARAM_ENTRY_ADMIN_STATUS */ > + len *= 8; /* 8 prios */ > + len += nla_total_size(0); /* _FP_PARAM_TABLE */ > + len += nla_total_size(sizeof(u32)); /* _FP_HOLD_ADVANCE */ > + len += nla_total_size(sizeof(u32)); /* _FP_RELEASE_ADVANCE */ > + > + return len; > +} > + > +static int fp_fill_reply(struct sk_buff *skb, > + const struct ethnl_req_info *req_base, > + const struct ethnl_reply_data *reply_base) > +{ > + const struct fp_reply_data *data = FP_REPDATA(reply_base); > + const struct ethtool_fp_param *params = &data->params; > + enum ethtool_fp_admin_status admin_status; > + struct nlattr *nest_table, *nest_entry; > + int prio; > + int ret; > + > + if (nla_put_u32(skb, ETHTOOL_A_FP_HOLD_ADVANCE, params->hold_advance) || > + nla_put_u32(skb, ETHTOOL_A_FP_RELEASE_ADVANCE, params->release_advance)) > + return -EMSGSIZE; > + > + nest_table = nla_nest_start(skb, ETHTOOL_A_FP_PARAM_TABLE); > + if (!nest_table) > + return -EMSGSIZE; > + > + for (prio = 0; prio < 8; prio++) { > + admin_status = (params->preemptable_prios & BIT(prio)) ? > + ETHTOOL_FP_PARAM_ENTRY_ADMIN_STATUS_PREEMPTABLE : > + ETHTOOL_FP_PARAM_ENTRY_ADMIN_STATUS_EXPRESS; > + > + nest_entry = nla_nest_start(skb, ETHTOOL_A_FP_PARAM_ENTRY); > + if (!nest_entry) > + goto nla_nest_cancel_table; > + > + if (nla_put_u8(skb, ETHTOOL_A_FP_PARAM_ENTRY_PRIO, prio)) > + goto nla_nest_cancel_entry; > + > + if (nla_put_u8(skb, ETHTOOL_A_FP_PARAM_ENTRY_ADMIN_STATUS, > + admin_status)) > + goto nla_nest_cancel_entry; > + > + nla_nest_end(skb, nest_entry); > + } > + > + nla_nest_end(skb, nest_table); > + > + return 0; > + > +nla_nest_cancel_entry: > + nla_nest_cancel(skb, nest_entry); > +nla_nest_cancel_table: > + nla_nest_cancel(skb, nest_table); > + return ret; > +} > + > +const struct ethnl_request_ops ethnl_fp_request_ops = { > + .request_cmd = ETHTOOL_MSG_FP_GET, > + .reply_cmd = ETHTOOL_MSG_FP_GET_REPLY, > + .hdr_attr = ETHTOOL_A_FP_HEADER, > + .req_info_size = sizeof(struct fp_req_info), > + .reply_data_size = sizeof(struct fp_reply_data), > + > + .prepare_data = fp_prepare_data, > + .reply_size = fp_reply_size, > + .fill_reply = fp_fill_reply, > +}; > + > +static const struct nla_policy > +ethnl_fp_set_param_entry_policy[ETHTOOL_A_FP_PARAM_ENTRY_MAX + 1] = { > + [ETHTOOL_A_FP_PARAM_ENTRY_PRIO] = { .type = NLA_U8 }, > + [ETHTOOL_A_FP_PARAM_ENTRY_ADMIN_STATUS] = { .type = NLA_U8 }, > +}; > + > +static const struct nla_policy > +ethnl_fp_set_param_table_policy[ETHTOOL_A_FP_PARAM_MAX + 1] = { > + [ETHTOOL_A_FP_PARAM_ENTRY] = NLA_POLICY_NESTED(ethnl_fp_set_param_entry_policy), > +}; > + > +const struct nla_policy ethnl_fp_set_policy[ETHTOOL_A_FP_MAX + 1] = { > + [ETHTOOL_A_FP_HEADER] = NLA_POLICY_NESTED(ethnl_header_policy), > + [ETHTOOL_A_FP_PARAM_TABLE] = NLA_POLICY_NESTED(ethnl_fp_set_param_table_policy), > +}; > + > +static int fp_parse_param_entry(const struct nlattr *nest, > + struct ethtool_fp_param *params, > + u8 *seen_prios, bool *mod, > + struct netlink_ext_ack *extack) > +{ > + struct nlattr *tb[ARRAY_SIZE(ethnl_fp_set_param_entry_policy)]; > + u8 preemptable_prios = params->preemptable_prios; > + u8 prio, admin_status; > + int ret; > + > + if (!nest) > + return 0; > + > + ret = nla_parse_nested(tb, > + ARRAY_SIZE(ethnl_fp_set_param_entry_policy) - 1, > + nest, ethnl_fp_set_param_entry_policy, > + extack); > + if (ret) > + return ret; > + > + if (!tb[ETHTOOL_A_FP_PARAM_ENTRY_PRIO]) { > + NL_SET_ERR_MSG_ATTR(extack, nest, > + "Missing 'prio' attribute"); > + return -EINVAL; > + } > + > + if (!tb[ETHTOOL_A_FP_PARAM_ENTRY_ADMIN_STATUS]) { > + NL_SET_ERR_MSG_ATTR(extack, nest, > + "Missing 'admin_status' attribute"); > + return -EINVAL; > + } > + > + prio = nla_get_u8(tb[ETHTOOL_A_FP_PARAM_ENTRY_PRIO]); > + if (prio >= 8) { > + NL_SET_ERR_MSG_ATTR(extack, nest, > + "Range exceeded for 'prio' attribute"); > + return -ERANGE; > + } > + > + if (*seen_prios & BIT(prio)) { > + NL_SET_ERR_MSG_ATTR(extack, nest, > + "Duplicate 'prio' attribute"); > + return -EINVAL; > + } > + > + *seen_prios |= BIT(prio); > + > + admin_status = nla_get_u8(tb[ETHTOOL_A_FP_PARAM_ENTRY_ADMIN_STATUS]); > + switch (admin_status) { > + case ETHTOOL_FP_PARAM_ENTRY_ADMIN_STATUS_PREEMPTABLE: > + preemptable_prios |= BIT(prio); > + break; > + case ETHTOOL_FP_PARAM_ENTRY_ADMIN_STATUS_EXPRESS: > + preemptable_prios &= ~BIT(prio); > + break; > + default: > + NL_SET_ERR_MSG_ATTR(extack, nest, > + "Unexpected value for 'admin_status' attribute"); > + return -EINVAL; > + } > + > + if (preemptable_prios != params->preemptable_prios) { > + params->preemptable_prios = preemptable_prios; > + *mod = true; > + } > + > + return 0; > +} > + > +static int fp_parse_param_table(const struct nlattr *nest, > + struct ethtool_fp_param *params, bool *mod, > + struct netlink_ext_ack *extack) > +{ > + u8 seen_prios = 0; > + struct nlattr *n; > + int rem, ret; > + > + if (!nest) > + return 0; > + > + nla_for_each_nested(n, nest, rem) { > + struct sched_entry *entry; > + > + if (nla_type(n) != ETHTOOL_A_FP_PARAM_ENTRY) { > + NL_SET_ERR_MSG_ATTR(extack, n, > + "Attribute is not of type 'entry'"); > + continue; > + } > + > + ret = fp_parse_param_entry(n, params, &seen_prios, mod, extack); > + if (ret < 0) { > + kfree(entry); > + return ret; > + } > + } > + > + return 0; > +} > + > +int ethnl_set_fp_param(struct sk_buff *skb, struct genl_info *info) > +{ > + struct netlink_ext_ack *extack = info->extack; > + struct ethnl_req_info req_info = {}; > + struct ethtool_fp_param params = {}; > + struct nlattr **tb = info->attrs; > + const struct ethtool_ops *ops; > + struct net_device *dev; > + bool mod = false; > + int ret; > + > + ret = ethnl_parse_header_dev_get(&req_info, tb[ETHTOOL_A_FP_HEADER], > + genl_info_net(info), extack, true); > + if (ret) > + return ret; > + > + dev = req_info.dev; > + ops = dev->ethtool_ops; > + > + if (!ops->get_fp_param || !ops->set_fp_param) { > + ret = -EOPNOTSUPP; > + goto out_dev; > + } > + > + rtnl_lock(); > + ret = ethnl_ops_begin(dev); > + if (ret) > + goto out_rtnl; > + > + dev->ethtool_ops->get_fp_param(dev, ¶ms); > + > + ethnl_update_u32(¶ms.hold_advance, tb[ETHTOOL_A_FP_HOLD_ADVANCE], > + &mod); > + ethnl_update_u32(¶ms.release_advance, > + tb[ETHTOOL_A_FP_RELEASE_ADVANCE], &mod); > + > + ret = fp_parse_param_table(tb[ETHTOOL_A_FP_PARAM_TABLE], ¶ms, &mod, > + extack); > + if (ret || !mod) > + goto out_ops; > + > + ret = dev->ethtool_ops->set_fp_param(dev, ¶ms, extack); > + if (ret) { > + if (!extack->_msg) > + NL_SET_ERR_MSG(extack, > + "Failed to update frame preemption parameters"); > + goto out_ops; > + } > + > + ethtool_notify(dev, ETHTOOL_MSG_FP_NTF, NULL); > + > +out_ops: > + ethnl_ops_complete(dev); > +out_rtnl: > + rtnl_unlock(); > +out_dev: > + dev_put(dev); > + return ret; > +} > diff --git a/net/ethtool/mm.c b/net/ethtool/mm.c > new file mode 100644 > index 000000000000..d4731fb7aee4 > --- /dev/null > +++ b/net/ethtool/mm.c > @@ -0,0 +1,228 @@ > +// SPDX-License-Identifier: GPL-2.0-only > +/* > + * Copyright 2022 NXP > + */ > +#include "common.h" > +#include "netlink.h" > + > +struct mm_req_info { > + struct ethnl_req_info base; > +}; > + > +struct mm_reply_data { > + struct ethnl_reply_data base; > + struct ethtool_mm_state state; > + struct ethtool_mm_stats stats; > +}; > + > +#define MM_REPDATA(__reply_base) \ > + container_of(__reply_base, struct mm_reply_data, base) > + > +#define ETHTOOL_MM_STAT_CNT \ > + (__ETHTOOL_A_MM_STAT_CNT - (ETHTOOL_A_MM_STAT_PAD + 1)) > + > +const struct nla_policy ethnl_mm_get_policy[ETHTOOL_A_MM_HEADER + 1] = { > + [ETHTOOL_A_MM_HEADER] = NLA_POLICY_NESTED(ethnl_header_policy_stats), > +}; > + > +static int mm_prepare_data(const struct ethnl_req_info *req_base, > + struct ethnl_reply_data *reply_base, > + struct genl_info *info) > +{ > + struct mm_reply_data *data = MM_REPDATA(reply_base); > + struct net_device *dev = reply_base->dev; > + const struct ethtool_ops *ops; > + int ret; > + > + ops = dev->ethtool_ops; > + > + if (!ops->get_mm_state) > + return -EOPNOTSUPP; > + > + ethtool_stats_init((u64 *)&data->stats, > + sizeof(data->stats) / sizeof(u64)); > + > + ret = ethnl_ops_begin(dev); > + if (ret < 0) > + return ret; > + > + ops->get_mm_state(dev, &data->state); > + > + if (ops->get_mm_stats && (req_base->flags & ETHTOOL_FLAG_STATS)) > + ops->get_mm_stats(dev, &data->stats); > + > + ethnl_ops_complete(dev); > + > + return 0; > +} > + > +static int mm_reply_size(const struct ethnl_req_info *req_base, > + const struct ethnl_reply_data *reply_base) > +{ > + int len = 0; > + > + len += nla_total_size(sizeof(u8)); /* _MM_VERIFY_STATUS */ > + len += nla_total_size(sizeof(u32)); /* _MM_VERIFY_TIME */ > + len += nla_total_size(sizeof(u8)); /* _MM_SUPPORTED */ > + len += nla_total_size(sizeof(u8)); /* _MM_ENABLED */ > + len += nla_total_size(sizeof(u8)); /* _MM_ACTIVE */ > + len += nla_total_size(sizeof(u8)); /* _MM_ADD_FRAG_SIZE */ > + > + if (req_base->flags & ETHTOOL_FLAG_STATS) > + len += nla_total_size(0) + /* _MM_STATS */ > + nla_total_size_64bit(sizeof(u64)) * ETHTOOL_MM_STAT_CNT; > + > + return len; > +} > + > +static int mm_put_stat(struct sk_buff *skb, u64 val, u16 attrtype) > +{ > + if (val == ETHTOOL_STAT_NOT_SET) > + return 0; > + if (nla_put_u64_64bit(skb, attrtype, val, ETHTOOL_A_MM_STAT_PAD)) > + return -EMSGSIZE; > + return 0; > +} > + > +static int mm_put_stats(struct sk_buff *skb, > + const struct ethtool_mm_stats *stats) > +{ > + struct nlattr *nest; > + > + nest = nla_nest_start(skb, ETHTOOL_A_MM_STATS); > + if (!nest) > + return -EMSGSIZE; > + > + if (mm_put_stat(skb, stats->MACMergeFrameAssErrorCount, > + ETHTOOL_A_MM_STAT_REASSEMBLY_ERRORS) || > + mm_put_stat(skb, stats->MACMergeFrameSmdErrorCount, > + ETHTOOL_A_MM_STAT_SMD_ERRORS) || > + mm_put_stat(skb, stats->MACMergeFrameAssOkCount, > + ETHTOOL_A_MM_STAT_REASSEMBLY_OK) || > + mm_put_stat(skb, stats->MACMergeFragCountRx, > + ETHTOOL_A_MM_STAT_RX_FRAG_COUNT) || > + mm_put_stat(skb, stats->MACMergeFragCountTx, > + ETHTOOL_A_MM_STAT_TX_FRAG_COUNT) || > + mm_put_stat(skb, stats->MACMergeHoldCount, > + ETHTOOL_A_MM_STAT_HOLD_COUNT)) > + goto err_cancel; > + > + nla_nest_end(skb, nest); > + return 0; > + > +err_cancel: > + nla_nest_cancel(skb, nest); > + return -EMSGSIZE; > +} > + > +static int mm_fill_reply(struct sk_buff *skb, > + const struct ethnl_req_info *req_base, > + const struct ethnl_reply_data *reply_base) > +{ > + const struct mm_reply_data *data = MM_REPDATA(reply_base); > + const struct ethtool_mm_state *state = &data->state; > + > + if (nla_put_u8(skb, ETHTOOL_A_MM_VERIFY_STATUS, state->verify_status) || > + nla_put_u32(skb, ETHTOOL_A_MM_VERIFY_TIME, state->verify_time) || > + nla_put_u8(skb, ETHTOOL_A_MM_SUPPORTED, state->supported) || > + nla_put_u8(skb, ETHTOOL_A_MM_ENABLED, state->enabled) || > + nla_put_u8(skb, ETHTOOL_A_MM_ACTIVE, state->active) || > + nla_put_u8(skb, ETHTOOL_A_MM_ADD_FRAG_SIZE, state->add_frag_size)) > + return -EMSGSIZE; > + > + if (req_base->flags & ETHTOOL_FLAG_STATS && > + mm_put_stats(skb, &data->stats)) > + return -EMSGSIZE; > + > + return 0; > +} > + > +const struct ethnl_request_ops ethnl_mm_request_ops = { > + .request_cmd = ETHTOOL_MSG_MM_GET, > + .reply_cmd = ETHTOOL_MSG_MM_GET_REPLY, > + .hdr_attr = ETHTOOL_A_MM_HEADER, > + .req_info_size = sizeof(struct mm_req_info), > + .reply_data_size = sizeof(struct mm_reply_data), > + > + .prepare_data = mm_prepare_data, > + .reply_size = mm_reply_size, > + .fill_reply = mm_fill_reply, > +}; > + > +const struct nla_policy ethnl_mm_set_policy[ETHTOOL_A_MM_MAX + 1] = { > + [ETHTOOL_A_MM_HEADER] = NLA_POLICY_NESTED(ethnl_header_policy), > + [ETHTOOL_A_MM_VERIFY_DISABLE] = { .type = NLA_U8 }, > + [ETHTOOL_A_MM_VERIFY_TIME] = { .type = NLA_U32 }, > + [ETHTOOL_A_MM_ENABLED] = { .type = NLA_U8 }, > + [ETHTOOL_A_MM_ADD_FRAG_SIZE] = { .type = NLA_U8 }, > +}; > + > +static void mm_state_to_cfg(const struct ethtool_mm_state *state, > + struct ethtool_mm_cfg *cfg) > +{ > + cfg->verify_disable = > + state->verify_status == ETHTOOL_MM_VERIFY_STATUS_DISABLED; > + cfg->verify_time = state->verify_time; > + cfg->enabled = state->enabled; > + cfg->add_frag_size = state->add_frag_size; > +} > + > +int ethnl_set_mm_cfg(struct sk_buff *skb, struct genl_info *info) > +{ > + struct netlink_ext_ack *extack = info->extack; > + struct ethnl_req_info req_info = {}; > + struct ethtool_mm_state state = {}; > + struct nlattr **tb = info->attrs; > + struct ethtool_mm_cfg cfg = {}; > + const struct ethtool_ops *ops; > + struct net_device *dev; > + bool mod = false; > + int ret; > + > + ret = ethnl_parse_header_dev_get(&req_info, tb[ETHTOOL_A_MM_HEADER], > + genl_info_net(info), extack, true); > + if (ret) > + return ret; > + > + dev = req_info.dev; > + ops = dev->ethtool_ops; > + > + if (!ops->get_mm_state || !ops->set_mm_cfg) { > + ret = -EOPNOTSUPP; > + goto out_dev; > + } > + > + rtnl_lock(); > + ret = ethnl_ops_begin(dev); > + if (ret) > + goto out_rtnl; > + > + ops->get_mm_state(dev, &state); > + > + mm_state_to_cfg(&state, &cfg); > + > + ethnl_update_bool(&cfg.verify_disable, tb[ETHTOOL_A_MM_VERIFY_DISABLE], > + &mod); > + ethnl_update_u32(&cfg.verify_time, tb[ETHTOOL_A_MM_VERIFY_TIME], &mod); > + ethnl_update_bool(&cfg.enabled, tb[ETHTOOL_A_MM_ENABLED], &mod); > + ethnl_update_u8(&cfg.add_frag_size, tb[ETHTOOL_A_MM_ADD_FRAG_SIZE], > + &mod); > + > + ret = ops->set_mm_cfg(dev, &cfg, extack); > + if (ret) { > + if (!extack->_msg) > + NL_SET_ERR_MSG(extack, > + "Failed to update MAC merge configuration"); > + goto out_ops; > + } > + > + ethtool_notify(dev, ETHTOOL_MSG_MM_NTF, NULL); > + > +out_ops: > + ethnl_ops_complete(dev); > +out_rtnl: > + rtnl_unlock(); > +out_dev: > + dev_put(dev); > + return ret; > +} > diff --git a/net/ethtool/netlink.c b/net/ethtool/netlink.c > index e26079e11835..82ad2bd92d70 100644 > --- a/net/ethtool/netlink.c > +++ b/net/ethtool/netlink.c > @@ -286,6 +286,8 @@ ethnl_default_requests[__ETHTOOL_MSG_USER_CNT] = { > [ETHTOOL_MSG_STATS_GET] = ðnl_stats_request_ops, > [ETHTOOL_MSG_PHC_VCLOCKS_GET] = ðnl_phc_vclocks_request_ops, > [ETHTOOL_MSG_MODULE_GET] = ðnl_module_request_ops, > + [ETHTOOL_MSG_FP_GET] = ðnl_fp_request_ops, > + [ETHTOOL_MSG_MM_GET] = ðnl_mm_request_ops, > }; > > static struct ethnl_dump_ctx *ethnl_dump_context(struct netlink_callback *cb) > @@ -598,6 +600,8 @@ ethnl_default_notify_ops[ETHTOOL_MSG_KERNEL_MAX + 1] = { > [ETHTOOL_MSG_EEE_NTF] = ðnl_eee_request_ops, > [ETHTOOL_MSG_FEC_NTF] = ðnl_fec_request_ops, > [ETHTOOL_MSG_MODULE_NTF] = ðnl_module_request_ops, > + [ETHTOOL_MSG_FP_NTF] = ðnl_fp_request_ops, > + [ETHTOOL_MSG_MM_NTF] = ðnl_mm_request_ops, > }; > > /* default notification handler */ > @@ -691,6 +695,8 @@ static const ethnl_notify_handler_t ethnl_notify_handlers[] = { > [ETHTOOL_MSG_EEE_NTF] = ethnl_default_notify, > [ETHTOOL_MSG_FEC_NTF] = ethnl_default_notify, > [ETHTOOL_MSG_MODULE_NTF] = ethnl_default_notify, > + [ETHTOOL_MSG_FP_NTF] = ethnl_default_notify, > + [ETHTOOL_MSG_MM_NTF] = ethnl_default_notify, > }; > > void ethtool_notify(struct net_device *dev, unsigned int cmd, const void *data) > @@ -1020,6 +1026,38 @@ static const struct genl_ops ethtool_genl_ops[] = { > .policy = ethnl_module_set_policy, > .maxattr = ARRAY_SIZE(ethnl_module_set_policy) - 1, > }, > + { > + .cmd = ETHTOOL_MSG_FP_GET, > + .doit = ethnl_default_doit, > + .start = ethnl_default_start, > + .dumpit = ethnl_default_dumpit, > + .done = ethnl_default_done, > + .policy = ethnl_fp_get_policy, > + .maxattr = ARRAY_SIZE(ethnl_fp_get_policy) - 1, > + }, > + { > + .cmd = ETHTOOL_MSG_FP_SET, > + .flags = GENL_UNS_ADMIN_PERM, > + .doit = ethnl_set_fp_param, > + .policy = ethnl_fp_set_policy, > + .maxattr = ARRAY_SIZE(ethnl_fp_set_policy) - 1, > + }, > + { > + .cmd = ETHTOOL_MSG_MM_GET, > + .doit = ethnl_default_doit, > + .start = ethnl_default_start, > + .dumpit = ethnl_default_dumpit, > + .done = ethnl_default_done, > + .policy = ethnl_mm_get_policy, > + .maxattr = ARRAY_SIZE(ethnl_mm_get_policy) - 1, > + }, > + { > + .cmd = ETHTOOL_MSG_MM_SET, > + .flags = GENL_UNS_ADMIN_PERM, > + .doit = ethnl_set_mm_cfg, > + .policy = ethnl_mm_set_policy, > + .maxattr = ARRAY_SIZE(ethnl_mm_set_policy) - 1, > + }, > }; > > static const struct genl_multicast_group ethtool_nl_mcgrps[] = { > diff --git a/net/ethtool/netlink.h b/net/ethtool/netlink.h > index 1653fd2cf0cf..a2e56df74c85 100644 > --- a/net/ethtool/netlink.h > +++ b/net/ethtool/netlink.h > @@ -371,6 +371,8 @@ extern const struct ethnl_request_ops ethnl_module_eeprom_request_ops; > extern const struct ethnl_request_ops ethnl_stats_request_ops; > extern const struct ethnl_request_ops ethnl_phc_vclocks_request_ops; > extern const struct ethnl_request_ops ethnl_module_request_ops; > +extern const struct ethnl_request_ops ethnl_fp_request_ops; > +extern const struct ethnl_request_ops ethnl_mm_request_ops; > > extern const struct nla_policy ethnl_header_policy[ETHTOOL_A_HEADER_FLAGS + 1]; > extern const struct nla_policy ethnl_header_policy_stats[ETHTOOL_A_HEADER_FLAGS + 1]; > @@ -409,6 +411,10 @@ extern const struct nla_policy ethnl_stats_get_policy[ETHTOOL_A_STATS_GROUPS + 1 > extern const struct nla_policy ethnl_phc_vclocks_get_policy[ETHTOOL_A_PHC_VCLOCKS_HEADER + 1]; > extern const struct nla_policy ethnl_module_get_policy[ETHTOOL_A_MODULE_HEADER + 1]; > extern const struct nla_policy ethnl_module_set_policy[ETHTOOL_A_MODULE_POWER_MODE_POLICY + 1]; > +extern const struct nla_policy ethnl_fp_get_policy[ETHTOOL_A_FP_HEADER + 1]; > +extern const struct nla_policy ethnl_fp_set_policy[ETHTOOL_A_FP_MAX + 1]; > +extern const struct nla_policy ethnl_mm_get_policy[ETHTOOL_A_MM_HEADER + 1]; > +extern const struct nla_policy ethnl_mm_set_policy[ETHTOOL_A_MM_MAX + 1]; > > int ethnl_set_linkinfo(struct sk_buff *skb, struct genl_info *info); > int ethnl_set_linkmodes(struct sk_buff *skb, struct genl_info *info); > @@ -428,6 +434,8 @@ int ethnl_tunnel_info_start(struct netlink_callback *cb); > int ethnl_tunnel_info_dumpit(struct sk_buff *skb, struct netlink_callback *cb); > int ethnl_set_fec(struct sk_buff *skb, struct genl_info *info); > int ethnl_set_module(struct sk_buff *skb, struct genl_info *info); > +int ethnl_set_fp_param(struct sk_buff *skb, struct genl_info *info); > +int ethnl_set_mm_cfg(struct sk_buff *skb, struct genl_info *info); > > extern const char stats_std_names[__ETHTOOL_STATS_CNT][ETH_GSTRING_LEN]; > extern const char stats_eth_phy_names[__ETHTOOL_A_STATS_ETH_PHY_CNT][ETH_GSTRING_LEN]; > -- > 2.34.1 >
On Wed, Aug 17, 2022 at 04:15:11PM -0700, Vinicius Costa Gomes wrote: > I liked that in the API sense, using this "prio" concept we gain more > flexibility, and we can express better what the hardware you work with > can do, i.e. priority (for frame preemption purposes?) and queues are > orthogonal. > > The problem I have is that the hardware I work with is more limited (as > are some stmmac-based NICs that I am aware of) frame preemption > capability and "priority" are attached to queues. Don't get me wrong, enetc also has "priority" attached to TX rings (enetc_set_bdr_prio). Then there is another register which maps a TX priority to a traffic class. This could be altered by the "map" property of tc-mqprio, but in practice it isn't. The only supported prio->tc map is "map 0 1 2 3 4 5 6 7". This is to say, enetc does not have a per-packet priority that gets passed via BD metadata, but rather, packets inherit the configured priority of the ring (Linux TX queue). > From the API perspective, it seems that I could say that "fp-prio" 0 is > associated with queue 0, fp-prio 1 to queue 1, and so on, and everything > will work. You have the tc-mqprio "queues" and "map" to juggle with, to end up configuring FP per queue based on the provided per-priority settings. > The only thing that I am not happy at all is that there are exactly 8 > fp-prios. > > The Linux network stack is more flexible than what 802.1Q defines, think > skb->priority, number of TCs, as you said earlier, I would hate to > impose some almost artificial limits here. And in future we are going to > see TSN enabled devices with lots more queues. ~artificial limits~ IEEE 802.1Q says: | 12.30.1.1 framePreemptionStatusTable structure and data types | | The framePreemptionStatusTable (6.7.2) consists of 8 framePreemptionAdminStatus | values (12.30.1.1.1), one per priority. I'm more concerned about setting things straight than about the limits right now. I don't yet think everything is quite ok in that regard. The netlink format proposed here is in principle extensible for priorities > 7, if that will ever make sense. > > In short: > - Comment: this section of the RFC is hardware independent, this > behavior of queues and priorities being orthogonal is only valid for > some implementations; Yes, and IMO it can only be that way, if I were to be truthful to my interpretation of the intention 802.1Q spec (please contradict me if you have a different interpretation of it!). Note that I do see contradictions in 802.1Q, and I don't know how to reconcile them. I've suppressed some of them for lack of a logical explanation. I'm mentioning this for transparency; I don't know everything either, but I need to make something out of what I do know. So 802.1Q says this: | 12.30.1.1.1 framePreemptionAdminStatus | | This parameter is the administrative value of the preemption status for | the priority. It takes value express if frames queued for the priority | are to be transmitted using the express service for the Port, or | preemptible if frames queued for the priority are to be transmitted | using the preemptible service for the Port and preemption is enabled for | the Port. So far so good. In 802.1Q definitions, priority is attached to a packet rather than to a traffic class / queue. But then the same clause continues: | Priorities that all map to the same traffic class should be constrained | to use the same value of preemption status. Which seems to throw a wrench into everything. It raises two questions: (A) why is AdminStatus not per traffic class then? (B) why is the constraint there, what's it trying to protect against? I've asked around, and I got unsatisfactory answers to both questions. A seemingly competent answer given to (A) by Rui (CCed) is that the eMAC/pMAC selection on TX actually takes place in the MAC layer, in what is called by 802.1AC-2016 "MA_UNITDATA.request" (and what everybody else calls "MAC client xmit"). The parameters of this MAC service primitive are: MA_UNITDATA.request(destination_address, source_address, mac_service_data_unit, priority) So since only "priority" is passed to the MAC service (and traffic class isn't), this means that the MAC service can only steer packets to eMAC/pMAC based on what it's given (i.e. priority). But upon closer investigation, this explanation doesn't appear to hold water very well. This is because clause 6.7.1 Support of the ISS by IEEE Std 802.3 (Ethernet) from 802.1Q says: | If frame preemption (6.7.2) is supported on a Port, then the IEEE 802.3 | MAC provides the following two MAC service interfaces (99.4 of IEEE Std | 802.3br™-2016 [B21]): | | a) A preemptible MAC (pMAC) service interface | b) An express MAC (eMAC) service interface | | For priority values that are identified in the frame preemption status | table (6.7.2) as preemptible, frames that are selected for transmission | shall be transmitted using the pMAC service instance, and for priority | values that are identified in the frame preemption status table as | express, frames that are selected for transmission shall be transmitted | using the eMAC service instance. | In all other respects, the Port behaves as if it is supported by a | single MAC service interface. In particular, all frames received by the | Port are treated as if they were received on a single MAC service | interface regardless of whether they were received on the eMAC service | interface or the pMAC service interface, except with respect to frame | preemption. So there you go, the MAC has to provide *two* service interfaces, so the 802.1Q upper layer (client of both) can just decide based on an internal criterion (like, say traffic class) into which service it sends a frame. So this can't be the reason. As for (B), it was suggested to me that 802.1Q that doesn't allow out of order transmission within the same queue/traffic class. After all, what's at play here is whether a single TX queue device can support FP or not. Supporting FP would mean reordering PT frames relative to ET frames. I think this explanation is unsatisfactory too. Here's the only reference I saw in 802.1Q to ordering guarantees. Basically those guarantees are all *per priority*, and since PT/ET is also per priority, I don't see why reordering there would violate this: | 6.5.3 Frame misordering | The MAC Service (IEEE Std 802.1AC) permits a negligible rate of | reordering of frames with a given priority for a given combination of | destination address, source address, and flow hash, if present, | transmitted on a given VLAN. | MA_UNITDATA.indication service primitives corresponding to | MA_UNITDATA.request primitives, with the same requested priority and for | the same combination of VLAN classification, destination address, source | address, and flow hash, if present, are received in the same order as | the request primitives were processed. So for anything to make sense for me at all, I simply had to block out that phrase from my mind. I'm posting the concern here, publicly, in case someone can enlighten me. > - Question: would it be against the intention of the API to have a 1:1 > map of priorities to queues? No; as mentioned, going through mqprio's "map" and "queues" to resolve the priority to a queue is something that I intended to be possible. Sure, it's not handy either. It would have been handier if the "admin-status" array was part of the tc-mqprio config, like you did in your RFC. But right now I'm trying to not close the possibility for single queue devices (which won't implement mqprio) to support FP, since I haven't seen anything convincing that would disprove such a hw design as infeasible. If someone could share some compelling insight into this it would be really appreciated. > - Deal breaker: fixed 8 prios; idk, I don't think that's our biggest problem right now honestly.
Hi Vladimir, Vladimir Oltean <vladimir.oltean@nxp.com> writes: > On Wed, Aug 17, 2022 at 04:15:11PM -0700, Vinicius Costa Gomes wrote: >> I liked that in the API sense, using this "prio" concept we gain more >> flexibility, and we can express better what the hardware you work with >> can do, i.e. priority (for frame preemption purposes?) and queues are >> orthogonal. >> >> The problem I have is that the hardware I work with is more limited (as >> are some stmmac-based NICs that I am aware of) frame preemption >> capability and "priority" are attached to queues. > > Don't get me wrong, enetc also has "priority" attached to TX rings > (enetc_set_bdr_prio). Then there is another register which maps a TX > priority to a traffic class. This could be altered by the "map" property > of tc-mqprio, but in practice it isn't. The only supported prio->tc map > is "map 0 1 2 3 4 5 6 7". > > This is to say, enetc does not have a per-packet priority that gets > passed via BD metadata, but rather, packets inherit the configured > priority of the ring (Linux TX queue). > I see. Thanks for the explanation. Intel hardware (client/"not data center", that is) is simpler, we only have one level of "ordering", and we usually don't change the default: lowest queue index (0) has highest priority, (1) has lower, and so on. >> From the API perspective, it seems that I could say that "fp-prio" 0 is >> associated with queue 0, fp-prio 1 to queue 1, and so on, and everything >> will work. > > You have the tc-mqprio "queues" and "map" to juggle with, to end up > configuring FP per queue based on the provided per-priority settings. > >> The only thing that I am not happy at all is that there are exactly 8 >> fp-prios. >> >> The Linux network stack is more flexible than what 802.1Q defines, think >> skb->priority, number of TCs, as you said earlier, I would hate to >> impose some almost artificial limits here. And in future we are going to >> see TSN enabled devices with lots more queues. > > ~artificial limits~ > > IEEE 802.1Q says: > > | 12.30.1.1 framePreemptionStatusTable structure and data types > | > | The framePreemptionStatusTable (6.7.2) consists of 8 framePreemptionAdminStatus > | values (12.30.1.1.1), one per priority. > > I'm more concerned about setting things straight than about the limits > right now. I don't yet think everything is quite ok in that regard. > The netlink format proposed here is in principle extensible for > priorities > 7, if that will ever make sense. Yes, as the limits are not in the UAPI this is a minor point, agreed. What I am concerned is about use cases not handled by IEEE 802.1Q, for example, thinking about that other thread about PCP/DSCP: how about a network that wants to have as many priorities as the DSCP field (6 bits) allows, together with frame preemption (in the future we may have hardware that supports that). I am only thinking that we should not close that door. Again, minor point. > >> >> In short: >> - Comment: this section of the RFC is hardware independent, this >> behavior of queues and priorities being orthogonal is only valid for >> some implementations; > > Yes, and IMO it can only be that way, if I were to be truthful to my > interpretation of the intention 802.1Q spec (please contradict me if you > have a different interpretation of it!). > Sorry if I made you go that deep in the spec. I was only trying to say that for some implementations it's difficult/impossible to have totally orthogonal priorities vs. queue assignments, and that the commit message needs to take that into account (because the API introduced in this commit is implementation independent). > Note that I do see contradictions in 802.1Q, and I don't know how to > reconcile them. I've suppressed some of them for lack of a logical > explanation. I'm mentioning this for transparency; I don't know > everything either, but I need to make something out of what I do know. > > So 802.1Q says this: > > | 12.30.1.1.1 framePreemptionAdminStatus > | > | This parameter is the administrative value of the preemption status for > | the priority. It takes value express if frames queued for the priority > | are to be transmitted using the express service for the Port, or > | preemptible if frames queued for the priority are to be transmitted > | using the preemptible service for the Port and preemption is enabled for > | the Port. > > So far so good. In 802.1Q definitions, priority is attached to a packet > rather than to a traffic class / queue. But then the same clause continues: > > | Priorities that all map to the same traffic class should be constrained > | to use the same value of preemption status. > > Which seems to throw a wrench into everything. > > It raises two questions: > > (A) why is AdminStatus not per traffic class then? > (B) why is the constraint there, what's it trying to protect against? > > I've asked around, and I got unsatisfactory answers to both questions. > > A seemingly competent answer given to (A) by Rui (CCed) is that the > eMAC/pMAC selection on TX actually takes place in the MAC layer, in what > is called by 802.1AC-2016 "MA_UNITDATA.request" (and what everybody else > calls "MAC client xmit"). The parameters of this MAC service primitive > are: > > MA_UNITDATA.request(destination_address, > source_address, > mac_service_data_unit, > priority) > > So since only "priority" is passed to the MAC service (and traffic class isn't), > this means that the MAC service can only steer packets to eMAC/pMAC > based on what it's given (i.e. priority). > > But upon closer investigation, this explanation doesn't appear to hold > water very well. This is because clause 6.7.1 Support of the ISS by IEEE > Std 802.3 (Ethernet) from 802.1Q says: > > | If frame preemption (6.7.2) is supported on a Port, then the IEEE 802.3 > | MAC provides the following two MAC service interfaces (99.4 of IEEE Std > | 802.3br™-2016 [B21]): > | > | a) A preemptible MAC (pMAC) service interface > | b) An express MAC (eMAC) service interface > | > | For priority values that are identified in the frame preemption status > | table (6.7.2) as preemptible, frames that are selected for transmission > | shall be transmitted using the pMAC service instance, and for priority > | values that are identified in the frame preemption status table as > | express, frames that are selected for transmission shall be transmitted > | using the eMAC service instance. > | In all other respects, the Port behaves as if it is supported by a > | single MAC service interface. In particular, all frames received by the > | Port are treated as if they were received on a single MAC service > | interface regardless of whether they were received on the eMAC service > | interface or the pMAC service interface, except with respect to frame > | preemption. > > So there you go, the MAC has to provide *two* service interfaces, so the > 802.1Q upper layer (client of both) can just decide based on an internal > criterion (like, say traffic class) into which service it sends a frame. > So this can't be the reason. > > As for (B), it was suggested to me that 802.1Q that doesn't allow out of > order transmission within the same queue/traffic class. After all, > what's at play here is whether a single TX queue device can support FP > or not. Supporting FP would mean reordering PT frames relative to ET > frames. > > I think this explanation is unsatisfactory too. Here's the only > reference I saw in 802.1Q to ordering guarantees. Basically those > guarantees are all *per priority*, and since PT/ET is also per priority, > I don't see why reordering there would violate this: > > | 6.5.3 Frame misordering > | The MAC Service (IEEE Std 802.1AC) permits a negligible rate of > | reordering of frames with a given priority for a given combination of > | destination address, source address, and flow hash, if present, > | transmitted on a given VLAN. > | MA_UNITDATA.indication service primitives corresponding to > | MA_UNITDATA.request primitives, with the same requested priority and for > | the same combination of VLAN classification, destination address, source > | address, and flow hash, if present, are received in the same order as > | the request primitives were processed. > > So for anything to make sense for me at all, I simply had to block out > that phrase from my mind. I'm posting the concern here, publicly, in > case someone can enlighten me. > I, truly, admire your effort (and the time you took) and your explanation. I think I understand where you come from better. Thank you. Taking a few steps back (I don't like this expression, but anyway), I think that strict conformance is not an objective of the Linux network stack as a whole. What I mean by "not strict": that it doesn't try to follow the letter of the "baggy pants" model that IEEE 802.* and friends define. Being efficient/useful has a higher importance. But being certifiable is a target/ideal, i.e. the "on the wire" and user visible (statistics, etc) behavior can/"should be able" to be configured to pass conformance tests or even better, interoperability tests. Now back on track, in my mental model, this simplifies things to a couple of questions that I ask myself about this RFC, in particular about this priority idea: - Is this idea useful? My answer is yes. - Can this idea be used to configure a certifiable system? Also, yes. So, I have no problems with it. >> - Question: would it be against the intention of the API to have a 1:1 >> map of priorities to queues? > > No; as mentioned, going through mqprio's "map" and "queues" to resolve > the priority to a queue is something that I intended to be possible. > Sure, it's not handy either. It would have been handier if the > "admin-status" array was part of the tc-mqprio config, like you did in > your RFC. Just to be sure, say that my plan is that I document that for the igc driver, the mapping of priorities for the "FP" command is prio (0) -> queue (0), prio (1) -> queue (1), and so on. Note that the mqprio mapping could be that "skb->priority" 10 goes to TC 1 which is mapped to queue 0. Would this be ok? > > But right now I'm trying to not close the possibility for single queue > devices (which won't implement mqprio) to support FP, since I haven't > seen anything convincing that would disprove such a hw design as > infeasible. If someone could share some compelling insight into this it > would be really appreciated. > The important part is that you didn't close that door. If someone is brave enough to design that hw, it seems that it can be made to work, and that's good enough. >> - Deal breaker: fixed 8 prios; > > idk, I don't think that's our biggest problem right now honestly. Perhaps I am being too optimistic, but the way I see, your proposal is already expressible enough to be used in a "certifiable" system. I was too harsh with "deal breaker. So, what is the biggest problem that you see? Overall, no issues from my side. Cheers,
On Tue, Aug 23, 2022 at 05:35:11PM -0700, Vinicius Costa Gomes wrote: > Yes, as the limits are not in the UAPI this is a minor point, agreed. > > What I am concerned is about use cases not handled by IEEE 802.1Q, for > example, thinking about that other thread about PCP/DSCP: how about a > network that wants to have as many priorities as the DSCP field (6 bits) > allows, together with frame preemption (in the future we may have > hardware that supports that). I am only thinking that we should not > close that door. > > Again, minor point. Since Linux seems to have settled on TC_PRIO_MAX, TC_QOPT_MAX_QUEUE constants of 16 rather than the 8 you'd expect, I guess I don't have that big of a problem to also expose 16 admin-status values for FP. I can limit the drivers I have to only look at the first 8 entries. But what do 16 priorities mean to ethtool? 16 priorities are a tc thing. Does ethtool even have a history of messing with packet priorities? What am I even doing? lol. > Sorry if I made you go that deep in the spec. > > I was only trying to say that for some implementations it's > difficult/impossible to have totally orthogonal priorities vs. queue > assignments, and that the commit message needs to take that into account > (because the API introduced in this commit is implementation > independent). Yes, the comment is totally valid. However, between "difficult" and "impossible" there is a big leap, and if it is really impossible, I'd like to know precisely why. > I, truly, admire your effort (and the time you took) and your > explanation. I think I understand where you come from better. Thank you. > > Taking a few steps back (I don't like this expression, but anyway), I > think that strict conformance is not an objective of the Linux network > stack as a whole. What I mean by "not strict": that it doesn't try to > follow the letter of the "baggy pants" model that IEEE 802.* and friends > define. Being efficient/useful has a higher importance. > > But being certifiable is a target/ideal, i.e. the "on the wire" and user > visible (statistics, etc) behavior can/"should be able" to be configured > to pass conformance tests or even better, interoperability tests. > > Now back on track, in my mental model, this simplifies things to a > couple of questions that I ask myself about this RFC, in particular > about this priority idea: > - Is this idea useful? My answer is yes. > - Can this idea be used to configure a certifiable system? Also, yes. > > So, I have no problems with it. > I've taken quite a few steps back now, unfortunately I'm still back to where I was :) I've talked to more people within NXP, explained to them that the standard technically does not disallow the existence of FP for single queue devices, for this and that reason. The only responses I got varied from the equivalent of a frightened "brrr", to a resounding "no, that isn't what was intended". Of course I tried to dig deeper, and I was told that the configuration is per (PCP) priority because this is the externally visible label carried by packets, so an external management system that has to coordinate FP across a LAN does not need to know how many traffic classes each node has, or how the priorities map to the traffic classes. Only the externally visible behavior is important, which is that packets with priority (PCP) X are sent on the wire using a preemptable SFD or a normal SFD, because the configuration also affects what the other nodes in the network expect. You might contrast this with tc-taprio, where the open/closed gates really are defined per traffic class, and you might wonder why it wasn't important there for the configuration to also be handled using the externally-visible priority (PCP) as input. Yes, but with tc-taprio, you can schedule within your traffic classes all you want, but this does not change the externally visible appearance of a frame, so it doesn't matter. The scheduling is orthogonal to whether a packet will be sent as preemptable or not. Is this explanation satisfactory, and where does it leave us? For me it isn't, and it leaves me nowhere new, but it's the best I got. So ok, single TX queue devices with FP probably are possibly a fun physics class experiment, but practically minded vendors won't implement them. But does the alternate justification given change my design decision in any way (i.e. expose "preemptable priorities" in ethtool as opposed to "preemptable queues" in tc-mqprio and tc-taprio as you did)? No. It just becomes a matter of enforcing the recommended restriction that preemptable and express priorities shouldn't be mixed within the same traffic class, something which my current patch set does not do. [ yes, "should" is a technical term in IEEE standards which means "not mandatory", and that's precisely how the constraint from 12.30.1.1.1 was formulated ] I guess when push comes to shove, somebody will have to answer the question of why was FP exposed in Linux by something other than what the standard defined it for (prio), and I wouldn't know how to answer that. > >> - Question: would it be against the intention of the API to have a 1:1 > >> map of priorities to queues? > > > > No; as mentioned, going through mqprio's "map" and "queues" to resolve > > the priority to a queue is something that I intended to be possible. > > Sure, it's not handy either. It would have been handier if the > > "admin-status" array was part of the tc-mqprio config, like you did in > > your RFC. > > Just to be sure, say that my plan is that I document that for the igc > driver, the mapping of priorities for the "FP" command is prio (0) -> > queue (0), prio (1) -> queue (1), and so on. Note that the mqprio > mapping could be that "skb->priority" 10 goes to TC 1 which is mapped to > queue 0. Would this be ok? If skb->priority 10 goes to TC 1 via the prio_tc_map, and finally lands in queue 0 where your actual FP knob is, then I'd expect the driver to follow the reverse path between queue -> tc -> prios mapped to that tc. And I would not state that prio 0 is queue 0 and skb->priority values from the whole spectrum may land there, kthxbye. Also, didn't you say earlier that "lowest queue index (0) has highest priority"? In Linux, skb->priority 0 has lower priority that skb->priority 1. How does that work out? In fact, for both enetc and felix, I have to do this exact thing as you, except that my hardware knobs are halfway in the path (per tc, not per queue). The reason why I'm not doing it is because we only consider the 1:1 prio_tc_map, so we use "prio" and "tc" interchangeably. It can't even be any other way given the current code, since the struct tc_taprio_qopt_offload passed via ndo_setup_tc does not even contain the tc_mqprio_qopt sub-structure with the prio_tc_map. Drivers are probably expected to look, from the ndo_setup_tc hook, at their dev->prio_tc_map as changed by netdev_set_prio_tc_map() in taprio_change(). Why this API is different compared to struct tc_mqprio_qopt_offload, I don't know. What prevents you from doing this? I'd like to understand as precisely as you can explain, to see how what I'm proposing really sounds when applied to Intel hardware. > >> - Deal breaker: fixed 8 prios; > > > > idk, I don't think that's our biggest problem right now honestly. > > Perhaps I am being too optimistic, but the way I see, your proposal is > already expressible enough to be used in a "certifiable" system. I was > too harsh with "deal breaker. > > So, what is the biggest problem that you see? > > Overall, no issues from my side. Let me recap my action items out loud for v2 (some of them weren't discussed publicly per se). * Enforce the constraint recommended by 12.30.1.1.1, somewhere as a static inline helper function that can be called by drivers, from the tc-taprio/tc-mqprio and ethtool-fp code paths. Note that I could very quickly run into a very deep rabbit hole here, if I actually bother to offload to hardware (enetc) the prio_tc_map communicated from tc-mqprio. I'll try to avoid that. I noticed that the mqprio_parse_opt() function doesn't validate the provided queue offsets and count if hw offload is requested (therefore allowing overlapping queue ranges). I really dislike that, because it was probably done for a reason. * Port the isochron script I have for enetc (endpoint) FP latency measurements to kselftest format. * Do something such that eMAC/pMAC counters are also summed up somewhere centrally by ethtool, and reported to the user either individually or aggregated. * Reorganize the netlink UAPI such as to remove ETHTOOL_A_FP_PARAM_TABLE and just have multiple ETHTOOL_A_FP_PARAM_ENTRY nests in the parent. * The enetc verification state machine is off by default. The felix verification state machine is on by default. I guess we should figure out a reasonable default for all drivers out there? * For some reason, I notice that the verification state machine remains FAILED on felix when it links to enetc and that has verification enabled. I need to disable it on enetc, enable it again, and only then will the MAC merge interrupt on felix say that verification completed successfully. I think I'll need to put a quirk in enetc_pl_mac_link_up() to toggle verification off and on, if it was requested, in order for this to be seamless to the user. If it's only these, nothing seems exactly major. However, I assume this is only the beginning. I fully expect there to be one more round of "why not dcbnl? why not tc?" nitpicks, for which I'm not exactly mentally ready (if credible arguments are to be put forward, I'd rather have them now while I haven't yet put in the extra work in the direction I'm about to).
Vladimir Oltean <vladimir.oltean@nxp.com> writes: > On Tue, Aug 23, 2022 at 05:35:11PM -0700, Vinicius Costa Gomes wrote: >> Yes, as the limits are not in the UAPI this is a minor point, agreed. >> >> What I am concerned is about use cases not handled by IEEE 802.1Q, for >> example, thinking about that other thread about PCP/DSCP: how about a >> network that wants to have as many priorities as the DSCP field (6 bits) >> allows, together with frame preemption (in the future we may have >> hardware that supports that). I am only thinking that we should not >> close that door. >> >> Again, minor point. > > Since Linux seems to have settled on TC_PRIO_MAX, TC_QOPT_MAX_QUEUE > constants of 16 rather than the 8 you'd expect, I guess I don't have > that big of a problem to also expose 16 admin-status values for FP. > I can limit the drivers I have to only look at the first 8 entries. > But what do 16 priorities mean to ethtool? 16 priorities are a tc thing. > Does ethtool even have a history of messing with packet priorities? > What am I even doing? lol. > That's a good point. ethtool doesn't has a history of knowing about priorities (and what they even mean). But I like the flexibility that this approach provides. >> Sorry if I made you go that deep in the spec. >> >> I was only trying to say that for some implementations it's >> difficult/impossible to have totally orthogonal priorities vs. queue >> assignments, and that the commit message needs to take that into account >> (because the API introduced in this commit is implementation >> independent). > > Yes, the comment is totally valid. However, between "difficult" and > "impossible" there is a big leap, and if it is really impossible, I'd > like to know precisely why. > Sometimes the enemy of good is perfect (in this case precisely). Perhaps aiming for "good enough" understanding? >> I, truly, admire your effort (and the time you took) and your >> explanation. I think I understand where you come from better. Thank you. >> >> Taking a few steps back (I don't like this expression, but anyway), I >> think that strict conformance is not an objective of the Linux network >> stack as a whole. What I mean by "not strict": that it doesn't try to >> follow the letter of the "baggy pants" model that IEEE 802.* and friends >> define. Being efficient/useful has a higher importance. >> >> But being certifiable is a target/ideal, i.e. the "on the wire" and user >> visible (statistics, etc) behavior can/"should be able" to be configured >> to pass conformance tests or even better, interoperability tests. >> >> Now back on track, in my mental model, this simplifies things to a >> couple of questions that I ask myself about this RFC, in particular >> about this priority idea: >> - Is this idea useful? My answer is yes. >> - Can this idea be used to configure a certifiable system? Also, yes. >> >> So, I have no problems with it. >> > > I've taken quite a few steps back now, unfortunately I'm still back to > where I was :) > > I've talked to more people within NXP, explained to them that the > standard technically does not disallow the existence of FP for single > queue devices, for this and that reason. The only responses I got varied > from the equivalent of a frightened "brrr", to a resounding "no, that > isn't what was intended". Of course I tried to dig deeper, and I was > told that the configuration is per (PCP) priority because this is the > externally visible label carried by packets, so an external management > system that has to coordinate FP across a LAN does not need to know how > many traffic classes each node has, or how the priorities map to the > traffic classes. Only the externally visible behavior is important, > which is that packets with priority (PCP) X are sent on the wire using a > preemptable SFD or a normal SFD, because the configuration also affects > what the other nodes in the network expect. You might contrast this with > tc-taprio, where the open/closed gates really are defined per traffic > class, and you might wonder why it wasn't important there for the > configuration to also be handled using the externally-visible priority > (PCP) as input. Yes, but with tc-taprio, you can schedule within your > traffic classes all you want, but this does not change the externally > visible appearance of a frame, so it doesn't matter. The scheduling is > orthogonal to whether a packet will be sent as preemptable or not. > > Is this explanation satisfactory, and where does it leave us? For me it > isn't, and it leaves me nowhere new, but it's the best I got. > > So ok, single TX queue devices with FP probably are possibly a fun > physics class experiment, but practically minded vendors won't implement > them. But does the alternate justification given change my design decision > in any way (i.e. expose "preemptable priorities" in ethtool as opposed > to "preemptable queues" in tc-mqprio and tc-taprio as you did)? No. > It just becomes a matter of enforcing the recommended restriction that > preemptable and express priorities shouldn't be mixed within the same > traffic class, something which my current patch set does not do. > > [ yes, "should" is a technical term in IEEE standards which means "not > mandatory", and that's precisely how the constraint from 12.30.1.1.1 > was formulated ] > > I guess when push comes to shove, somebody will have to answer the > question of why was FP exposed in Linux by something other than what the > standard defined it for (prio), and I wouldn't know how to answer that. > In my mind the answer is: because it's cool to support more use-cases than the standard defines, Linux is still a vehicle for experimentation, more space for our users to play. >> >> - Question: would it be against the intention of the API to have a 1:1 >> >> map of priorities to queues? >> > >> > No; as mentioned, going through mqprio's "map" and "queues" to resolve >> > the priority to a queue is something that I intended to be possible. >> > Sure, it's not handy either. It would have been handier if the >> > "admin-status" array was part of the tc-mqprio config, like you did in >> > your RFC. >> >> Just to be sure, say that my plan is that I document that for the igc >> driver, the mapping of priorities for the "FP" command is prio (0) -> >> queue (0), prio (1) -> queue (1), and so on. Note that the mqprio >> mapping could be that "skb->priority" 10 goes to TC 1 which is mapped to >> queue 0. Would this be ok? > > If skb->priority 10 goes to TC 1 via the prio_tc_map, and finally lands > in queue 0 where your actual FP knob is, then I'd expect the driver to > follow the reverse path between queue -> tc -> prios mapped to that tc. > And I would not state that prio 0 is queue 0 and skb->priority values > from the whole spectrum may land there, kthxbye. Also, didn't you say > earlier that "lowest queue index (0) has highest priority"? In Linux, > skb->priority 0 has lower priority that skb->priority 1. How does that > work out? > Ah, sorry for the confusion, when I said "lowest queue index (0) has highest priority" it was more in the sense that queue 0 has higher precedence, packets from it are fetched first, that kind of thing. > In fact, for both enetc and felix, I have to do this exact thing as you, > except that my hardware knobs are halfway in the path (per tc, not per > queue). The reason why I'm not doing it is because we only consider the > 1:1 prio_tc_map, so we use "prio" and "tc" interchangeably. > Hm, that's interesting. I think that's a difference on how we are modeling our traffic, here's one very common mqprio config for AVB use cases: $ tc qdisc replace dev $IFACE parent root handle 100 mqprio \ num_tc 3 \ map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 \ queues 1@0 1@1 2@2 \ hw 0 One priority for Class A traffic, one priority for Class B traffic and multiple priorities mapped to a single TC, the rest is Best Effort and it's mapped to multiple queues. The way we model traffic, TCs and priorities are different concepts. One important detail is that in TSN mode, we use the "priority" (meaning precedence in this context) fetch order for the queues: queue 0 first, then queue 1, and so on. > It can't even be any other way given the current code, since the struct > tc_taprio_qopt_offload passed via ndo_setup_tc does not even contain the > tc_mqprio_qopt sub-structure with the prio_tc_map. Drivers are probably > expected to look, from the ndo_setup_tc hook, at their dev->prio_tc_map > as changed by netdev_set_prio_tc_map() in taprio_change(). Why this API > is different compared to struct tc_mqprio_qopt_offload, I don't know. At least for igb and igc I never felt it was necessary to look at the prio_tc_map. The core netstack will already enqueue each packet to the right queue. All I need to know is how to configure each queue. Also note that for igb and igc I didn't ndo_setup_tc for mqprio, because the default behavior is fine, that's also why that taprio doesn't send that information (but this is easily fixed). This could be a consequence that prio_tc_map (and friends) was introduced by Intel folks (long before my time) and this doesn't work that well for other vendors. > > What prevents you from doing this? I'd like to understand as precisely > as you can explain, to see how what I'm proposing really sounds when > applied to Intel hardware. > If I am understanding you correctly, nothing prevents me from doing that. It's just that everything worked out that from the driver side I don't need to know about priorities, traffic classes, I only need to care about queues. That's why I for my "ethtool" proposal I focused on setting preemption per queue, because the drivers I work with don't need to know about anything else. But that can change if necessary. The cost doesn't seem too big. >> >> - Deal breaker: fixed 8 prios; >> > >> > idk, I don't think that's our biggest problem right now honestly. >> >> Perhaps I am being too optimistic, but the way I see, your proposal is >> already expressible enough to be used in a "certifiable" system. I was >> too harsh with "deal breaker. >> >> So, what is the biggest problem that you see? >> >> Overall, no issues from my side. > > Let me recap my action items out loud for v2 (some of them weren't > discussed publicly per se). > > * Enforce the constraint recommended by 12.30.1.1.1, somewhere as a > static inline helper function that can be called by drivers, from the > tc-taprio/tc-mqprio and ethtool-fp code paths. Note that I could very > quickly run into a very deep rabbit hole here, if I actually bother to > offload to hardware (enetc) the prio_tc_map communicated from tc-mqprio. > I'll try to avoid that. I noticed that the mqprio_parse_opt() function > doesn't validate the provided queue offsets and count if hw offload is > requested (therefore allowing overlapping queue ranges). I really > dislike that, because it was probably done for a reason. > I agree with that mqprio offload lack of validation being weird. +1 for the helper, those kind of functions help document a lot the expectations. > * Port the isochron script I have for enetc (endpoint) FP latency > measurements to kselftest format. > > * Do something such that eMAC/pMAC counters are also summed up somewhere > centrally by ethtool, and reported to the user either individually or > aggregated. > > * Reorganize the netlink UAPI such as to remove ETHTOOL_A_FP_PARAM_TABLE > and just have multiple ETHTOOL_A_FP_PARAM_ENTRY nests in the parent. > Another +1 > * The enetc verification state machine is off by default. The felix > verification state machine is on by default. I guess we should figure > out a reasonable default for all drivers out there? > From the language of IEEE 802.3 ("disableVerify" and friends), it seems that the expectation is that verification is enabled by default. *But* from a interoperability point of view, I would only enable verification by default after some testing in the wild. Or only send the verify packet if frame preemption is also enabled. > * For some reason, I notice that the verification state machine remains > FAILED on felix when it links to enetc and that has verification > enabled. I need to disable it on enetc, enable it again, and only then > will the MAC merge interrupt on felix say that verification completed > successfully. I think I'll need to put a quirk in enetc_pl_mac_link_up() > to toggle verification off and on, if it was requested, in order for > this to be seamless to the user. > > If it's only these, nothing seems exactly major. However, I assume this > is only the beginning. I fully expect there to be one more round of > "why not dcbnl? why not tc?" nitpicks, for which I'm not exactly > mentally ready (if credible arguments are to be put forward, I'd rather > have them now while I haven't yet put in the extra work in the direction > I'm about to). I really know the feeling "not mentally ready". I am still in the "tc" camp. But I cannot deny that this interface seems to work for the uses that we have in mind. That's (at least) good enough for me ;-) Cheers,
On Fri, Sep 09, 2022 at 05:19:35PM -0700, Vinicius Costa Gomes wrote: > Vladimir Oltean <vladimir.oltean@nxp.com> writes: > > > On Tue, Aug 23, 2022 at 05:35:11PM -0700, Vinicius Costa Gomes wrote: > >> Yes, as the limits are not in the UAPI this is a minor point, agreed. > >> > >> What I am concerned is about use cases not handled by IEEE 802.1Q, for > >> example, thinking about that other thread about PCP/DSCP: how about a > >> network that wants to have as many priorities as the DSCP field (6 bits) > >> allows, together with frame preemption (in the future we may have > >> hardware that supports that). I am only thinking that we should not > >> close that door. > >> > >> Again, minor point. > > > > Since Linux seems to have settled on TC_PRIO_MAX, TC_QOPT_MAX_QUEUE > > constants of 16 rather than the 8 you'd expect, I guess I don't have > > that big of a problem to also expose 16 admin-status values for FP. > > I can limit the drivers I have to only look at the first 8 entries. > > But what do 16 priorities mean to ethtool? 16 priorities are a tc thing. > > Does ethtool even have a history of messing with packet priorities? > > What am I even doing? lol. > > That's a good point. ethtool doesn't has a history of knowing about > priorities (and what they even mean). But I like the flexibility that > this approach provides. Fair. On the other hand, we need to weigh the pros and cons of moving the adminStatus to other places. dcbnl - pro: is a hardware-only interface, just like ethtool, which is suitable for a hardware-only feature like FP - pro: knows about priorities (app table, PFC). I keep coming back to PFC as an example because it has the absolute exact same problems as FP, which is that it is defined per priority, but there is a "NOTE 2 - Mixing PFC and non-PFC priorities in the same queue results in non-PFC traffic being paused causing congestion spreading, and therefore is not recommended." - con: I think the dcbnl app table has the same limitation that priorities go only up to 8 (the comment above struct dcb_app says "@priority: 3-bit unsigned integer indicating priority for IEEE") - con: where do dcbnl's responsibilities begin, and where do they end, exactly? My understanding is that DCB and TSN are exactly the same kind of thing, i.e. extensions to 802.1Q (and 802.3, in case of TSN) which were made in order to align Ethernet (and bridges) with a set of vertical use cases which weren't quite possible before. All is well, except TSN was integrated quite differently into the Linux kernel, i.o.w. we don't have a "tsnnl" like there is a "dcbnl", but things are much more fine-grained, and integrated with the subsystem that they extend, rather than creating a subsystem aligned to a use case. So the question becomes, is anything we're doing here, for TSN, extending DCB? With Microchip's desire to add non-standard APP table selectors for VLAN PCP/DEI prioritization, I think the general movement is towards more reuse of dcbnl constructs, and towards forgetting that dcbnl is for DCB. But there are many things I don't understand about dcbnl, for example why is it possible to set an interface's prio-tc map using dcb, and what does it even mean in relationship to tc-mqprio, tc-taprio etc (or even with "priomap" from tc-ets). $ dcb ets set dev eth0 prio-tc 0:0 1:1 2:2 3:3 4:4 5:5 6:6 7:7 Jakub had an answer which explained this in terms of Conway's law: https://patchwork.kernel.org/project/netdevbpf/cover/20220520011538.1098888-1-vinicius.gomes@intel.com/#24870325 but as beautiful that answer is, as many questions it still leaves for me. Like if we were to accept dcbnl as part of the UAPI for TSN, how does this fact fit within our story that the already use the tc-taprio map to map priorities to traffic classes? ethtool: - pro: is in the same subsystem as MAC merge layer configuration (for which I have no doubts that ethtool is the right place) - pro: is not aligned to any specific vertical use case, like dcbnl is - con: knows nothing about priorities - con: for hardware where FP adminStatus is per traffic class rather than per priority, we'll have to change the adminStatus not only from changes made via ethtool to the FP adminStatus, but also from changes made to the prio_tc_map (tc I guess?). But this may also be a subtle pro for dcbnl, since that has its own prio_tc_map, so we'd remain within the same subsystem? tc (part of tc-mqprio and tc-taprio): - open question: does the FP adminStatus influence scheduling or not? Normally I'd say no, adminStatus only decides whether the eMAC or pMAC will be used to send a frame, and this is orthogonal to scheduling. But 802.1Q says that "It should be noted that, other things being equal, designating a priority as “express” effectively increases its priority above that of any priority designated as “preemptible”." Is this phrase enough to justify a modification of the qdisc dequeue method in, say, taprio, which would prioritize a lower priority qdisc queue A over a higher one B, if A is express but B is preemptable? - pro: assuming that in lack of tc-mqprio or tc-taprio, an interface's number of traffic classes will collapse to 1, then putting FP adminStatus here makes it trivial to support hardware with FP per traffic class - con: more difficult for user space to change FP adminStatus independently of the current qdisc? > >> > No; as mentioned, going through mqprio's "map" and "queues" to resolve > >> > the priority to a queue is something that I intended to be possible. > >> > Sure, it's not handy either. It would have been handier if the > >> > "admin-status" array was part of the tc-mqprio config, like you did in > >> > your RFC. > >> > >> Just to be sure, say that my plan is that I document that for the igc > >> driver, the mapping of priorities for the "FP" command is prio (0) -> > >> queue (0), prio (1) -> queue (1), and so on. Note that the mqprio > >> mapping could be that "skb->priority" 10 goes to TC 1 which is mapped to > >> queue 0. Would this be ok? > > > > If skb->priority 10 goes to TC 1 via the prio_tc_map, and finally lands > > in queue 0 where your actual FP knob is, then I'd expect the driver to > > follow the reverse path between queue -> tc -> prios mapped to that tc. > > And I would not state that prio 0 is queue 0 and skb->priority values > > from the whole spectrum may land there, kthxbye. Also, didn't you say > > earlier that "lowest queue index (0) has highest priority"? In Linux, > > skb->priority 0 has lower priority that skb->priority 1. How does that > > work out? > > > > Ah, sorry for the confusion, when I said "lowest queue index (0) has > highest priority" it was more in the sense that queue 0 has higher > precedence, packets from it are fetched first, that kind of thing. Fetched first by whom? In ENETC, the transmission selection chooses between TX BD rings whose configured priority maps to the same traffic class based on a weighted round robin dequeuing policy. Groups of TX BD rings of different priorities are in strict priority dequeuing relative to each other. And the number of the TX BD ring has nothing to do with its priority. > > In fact, for both enetc and felix, I have to do this exact thing as you, > > except that my hardware knobs are halfway in the path (per tc, not per > > queue). The reason why I'm not doing it is because we only consider the > > 1:1 prio_tc_map, so we use "prio" and "tc" interchangeably. > > > > Hm, that's interesting. I think that's a difference on how we are > modeling our traffic, here's one very common mqprio config for AVB use > cases: > > $ tc qdisc replace dev $IFACE parent root handle 100 mqprio \ > num_tc 3 \ > map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 \ > queues 1@0 1@1 2@2 \ > hw 0 > > One priority for Class A traffic, one priority for Class B traffic and > multiple priorities mapped to a single TC, the rest is Best Effort and > it's mapped to multiple queues. The way we model traffic, TCs and > priorities are different concepts. So here's what I don't understand. Why do you need 16 priority values, when in the end you'll map them to only 3 egress traffic classes anyway? > One important detail is that in TSN mode, we use the "priority" (meaning > precedence in this context) fetch order for the queues: queue 0 first, > then queue 1, and so on. So the number of the queue is what actually contributes to the egress scheduling algorithm of your NIC? Strange. Tell me something so I'm sure I'm understanding you properly. netdev_pick_tx() will pick a TX queue using skb_tx_hash(), which hashes between the netdev queues mapped to the same traffic class as the traffic class that skb->priority is mapped to. In your case, you have 2@2 (2 TX queues for traffic class 2), call these queues 2 and 3. You're saying that hardware prioritizes queue 2 over 3. But the network stack doesn't know that. It hashes equally packets that go to tc 2 between queue 2 and queue 3. In our case, this misconfiguration would be immediately obvious, because we could have an offloaded tc-taprio schedule in ENETC, and Linux could be enqueuing into a ring that ultimately has a different schedule than software knows about. How does this work out for you? > > * The enetc verification state machine is off by default. The felix > > verification state machine is on by default. I guess we should figure > > out a reasonable default for all drivers out there? > > > > From the language of IEEE 802.3 ("disableVerify" and friends), it seems > that the expectation is that verification is enabled by default. *But* > from a interoperability point of view, I would only enable verification > by default after some testing in the wild. Or only send the verify > packet if frame preemption is also enabled. I wonder whether the answer to this one is actually "Additional Ethernet Capabilities TLV" (802.3 clause 79.3.7). It would be good if we could all start off with FP/MM disabled, verification state machine included, until the LLDP daemon receives a TLV that says PreemptSupported and PreemptEnabled are true. If both local and remote preemption are supported and enabled, we tell the kernel to enable the MAC merge layer and kick off verification. In turn this will alter PreemptActive that gets set in future replies. I don't know, the flow may be different. > > * For some reason, I notice that the verification state machine remains > > FAILED on felix when it links to enetc and that has verification > > enabled. I need to disable it on enetc, enable it again, and only then > > will the MAC merge interrupt on felix say that verification completed > > successfully. I think I'll need to put a quirk in enetc_pl_mac_link_up() > > to toggle verification off and on, if it was requested, in order for > > this to be seamless to the user. > > > > If it's only these, nothing seems exactly major. However, I assume this > > is only the beginning. I fully expect there to be one more round of > > "why not dcbnl? why not tc?" nitpicks, for which I'm not exactly > > mentally ready (if credible arguments are to be put forward, I'd rather > > have them now while I haven't yet put in the extra work in the direction > > I'm about to). > > I really know the feeling "not mentally ready". > > I am still in the "tc" camp. But I cannot deny that this interface seems > to work for the uses that we have in mind. That's (at least) good enough > for me ;-) To me right now, FP adminStatus is not exactly something that the scheduler is concerned with, and this is why tc is kind of my last choice. But I agree that it makes certain things simpler. Plus, it seems to be the only subsystem aware of 16 priorities and 16 traffic classes, which you seem to care about.
Vladimir Oltean <vladimir.oltean@nxp.com> writes: > On Fri, Sep 09, 2022 at 05:19:35PM -0700, Vinicius Costa Gomes wrote: >> Vladimir Oltean <vladimir.oltean@nxp.com> writes: >> >> > On Tue, Aug 23, 2022 at 05:35:11PM -0700, Vinicius Costa Gomes wrote: >> >> Yes, as the limits are not in the UAPI this is a minor point, agreed. >> >> >> >> What I am concerned is about use cases not handled by IEEE 802.1Q, for >> >> example, thinking about that other thread about PCP/DSCP: how about a >> >> network that wants to have as many priorities as the DSCP field (6 bits) >> >> allows, together with frame preemption (in the future we may have >> >> hardware that supports that). I am only thinking that we should not >> >> close that door. >> >> >> >> Again, minor point. >> > >> > Since Linux seems to have settled on TC_PRIO_MAX, TC_QOPT_MAX_QUEUE >> > constants of 16 rather than the 8 you'd expect, I guess I don't have >> > that big of a problem to also expose 16 admin-status values for FP. >> > I can limit the drivers I have to only look at the first 8 entries. >> > But what do 16 priorities mean to ethtool? 16 priorities are a tc thing. >> > Does ethtool even have a history of messing with packet priorities? >> > What am I even doing? lol. >> >> That's a good point. ethtool doesn't has a history of knowing about >> priorities (and what they even mean). But I like the flexibility that >> this approach provides. > > Fair. On the other hand, we need to weigh the pros and cons of moving > the adminStatus to other places. > > dcbnl > > - pro: is a hardware-only interface, just like ethtool, which is > suitable for a hardware-only feature like FP > > - pro: knows about priorities (app table, PFC). I keep coming back to > PFC as an example because it has the absolute exact same problems as > FP, which is that it is defined per priority, but there is a "NOTE 2 - > Mixing PFC and non-PFC priorities in the same queue results in non-PFC > traffic being paused causing congestion spreading, and therefore is > not recommended." > > - con: I think the dcbnl app table has the same limitation that > priorities go only up to 8 (the comment above struct dcb_app says > "@priority: 3-bit unsigned integer indicating priority for IEEE") > > - con: where do dcbnl's responsibilities begin, and where do they end, > exactly? My understanding is that DCB and TSN are exactly the same > kind of thing, i.e. extensions to 802.1Q (and 802.3, in case of TSN) > which were made in order to align Ethernet (and bridges) with a set of > vertical use cases which weren't quite possible before. All is well, > except TSN was integrated quite differently into the Linux kernel, > i.o.w. we don't have a "tsnnl" like there is a "dcbnl", but things are > much more fine-grained, and integrated with the subsystem that they > extend, rather than creating a subsystem aligned to a use case. So the > question becomes, is anything we're doing here, for TSN, extending > DCB? With Microchip's desire to add non-standard APP table selectors > for VLAN PCP/DEI prioritization, I think the general movement is > towards more reuse of dcbnl constructs, and towards forgetting that > dcbnl is for DCB. But there are many things I don't understand about > dcbnl, for example why is it possible to set an interface's prio-tc > map using dcb, and what does it even mean in relationship to > tc-mqprio, tc-taprio etc (or even with "priomap" from tc-ets). > I don't know. I still have to learn more about DCB. I suppose HW that has both TSN and DCB features is not too far in the future. So is it possible that people are going to try to use them together? Does it make sense? I don't know. Here are some notes from a quick look at the code, anyone feel free to correct me, or add anything. Looking at two drivers code (mlx5 and ice) it seems that the dcbnl prio-tc map has a similar objective to the netdev prio_tc_map (I am looking at how that map feeds into each driver ndo_select_queue()), but inside the driver and with more flexibility ("in the DSCP mode, overwrite the skb->priority with this value and use this value for netdev_tx_pick()", that kind of thing). So using tc-mqprio/tc-taprio with 'dcb ets' 'prio-map' will send traffic to suprising traffic classes/queues. tc-ets priomap feels like another layer, you use that to setup buckets in which run what seems to be burst protection algorithms, and this happens after ndo_select_queue() runs, so this map seems almost independent of the dcbnl prio-tc map (?). > $ dcb ets set dev eth0 prio-tc 0:0 1:1 2:2 3:3 4:4 5:5 6:6 7:7 > > Jakub had an answer which explained this in terms of Conway's law: > https://patchwork.kernel.org/project/netdevbpf/cover/20220520011538.1098888-1-vinicius.gomes@intel.com/#24870325 > but as beautiful that answer is, as many questions it still leaves for > me. Like if we were to accept dcbnl as part of the UAPI for TSN, how > does this fact fit within our story that the already use the tc-taprio > map to map priorities to traffic classes? Setting the 'dcb ets' "prio-tc" map together with tc-taprio or tc-mqprio, doesn't make much sense, true. But if I am understanding the code right, you could use that 'dcb' command, and configure frame preemption via ethtool (for example) and it could work, without tc-taprio or tc-mqprio to setup the prio:tc mapping. Perhaps, for now, the solution is to disallow this as invalid, that you can only have one such priority map configured, either dcb prio-map or taprio/mqprio. Not sure. For the long term, perhaps the solution is to have a single place that does this mapping (from skb to traffic classes and then to queues) more centralized. 'dcb' would populate that as would 'mqprio/taprio'. This seems to imply that ndo_select_queue() is a escape hatch, that in the ideal world it would not exist. Would this create worse problems? (Reality is pointing that it doesn't seem that we can avoid making drivers aware of this map, so this idea doesn't seem to help the situation much) > > ethtool: > > - pro: is in the same subsystem as MAC merge layer configuration (for > which I have no doubts that ethtool is the right place) > > - pro: is not aligned to any specific vertical use case, like dcbnl is > > - con: knows nothing about priorities > > - con: for hardware where FP adminStatus is per traffic class rather > than per priority, we'll have to change the adminStatus not only from > changes made via ethtool to the FP adminStatus, but also from changes > made to the prio_tc_map (tc I guess?). But this may also be a subtle > pro for dcbnl, since that has its own prio_tc_map, so we'd remain > within the same subsystem? If the driver also supports DCB, it will have the offload functions implemented, so it be aware of changes in that map. The problem is now mqprio, that if it isn't offloaded, the driver is not aware of any changes. But mqprio doesn't implement .change(), which makes things a bit easier. 'ethtool' is gaining my vote. > > tc (part of tc-mqprio and tc-taprio): > > - open question: does the FP adminStatus influence scheduling or not? > Normally I'd say no, adminStatus only decides whether the eMAC or pMAC > will be used to send a frame, and this is orthogonal to scheduling. > But 802.1Q says that "It should be noted that, other things being > equal, designating a priority as “express” effectively increases its > priority above that of any priority designated as “preemptible”." > Is this phrase enough to justify a modification of the qdisc dequeue > method in, say, taprio, which would prioritize a lower priority qdisc > queue A over a higher one B, if A is express but B is preemptable? I understood this as: if a preemptible packet (higher priority) and expresss packet (lower priority) arrive at the MAC at the same time, the express packet will finish transmitting before the preemptible, causing the express packet to have an 'effective' higher priority. I don't think we need to change anything on the dequeueing in the qdisc side (and trying to solve it is going to be awkward, having to check the neighbor qdiscs for packets being enqueued/peeked, ugh!). (as a note: for i225, traffic in queues marked as express will only preempt traffic from lower numbered queues, this behavior doesn't seem to be forbidden by the standard) > > - pro: assuming that in lack of tc-mqprio or tc-taprio, an interface's > number of traffic classes will collapse to 1, then putting FP > adminStatus here makes it trivial to support hardware with FP per > traffic class > > - con: more difficult for user space to change FP adminStatus > independently of the current qdisc? > >> >> > No; as mentioned, going through mqprio's "map" and "queues" to resolve >> >> > the priority to a queue is something that I intended to be possible. >> >> > Sure, it's not handy either. It would have been handier if the >> >> > "admin-status" array was part of the tc-mqprio config, like you did in >> >> > your RFC. >> >> >> >> Just to be sure, say that my plan is that I document that for the igc >> >> driver, the mapping of priorities for the "FP" command is prio (0) -> >> >> queue (0), prio (1) -> queue (1), and so on. Note that the mqprio >> >> mapping could be that "skb->priority" 10 goes to TC 1 which is mapped to >> >> queue 0. Would this be ok? >> > >> > If skb->priority 10 goes to TC 1 via the prio_tc_map, and finally lands >> > in queue 0 where your actual FP knob is, then I'd expect the driver to >> > follow the reverse path between queue -> tc -> prios mapped to that tc. >> > And I would not state that prio 0 is queue 0 and skb->priority values >> > from the whole spectrum may land there, kthxbye. Also, didn't you say >> > earlier that "lowest queue index (0) has highest priority"? In Linux, >> > skb->priority 0 has lower priority that skb->priority 1. How does that >> > work out? >> > >> >> Ah, sorry for the confusion, when I said "lowest queue index (0) has >> highest priority" it was more in the sense that queue 0 has higher >> precedence, packets from it are fetched first, that kind of thing. > > Fetched first by whom? > In ENETC, the transmission selection chooses between TX BD rings whose > configured priority maps to the same traffic class based on a weighted > round robin dequeuing policy. Groups of TX BD rings of different > priorities are in strict priority dequeuing relative to each other. > And the number of the TX BD ring has nothing to do with its priority. For the hardware I work with: - i210: two modes, (1) round robin, by default, or in (2) Qav-mode ("TSN), strict priority, fixed by the queue index. - i225/i226: two modes, (1) round robin, by default, or in (2) TSN mode, strict priority, configurable queue priority. (to give the full picture, we can also configure how the descriptors are going to be fetched from host memory, but as host memory is not usually where the congestion happens, it's less important for this) > >> > In fact, for both enetc and felix, I have to do this exact thing as you, >> > except that my hardware knobs are halfway in the path (per tc, not per >> > queue). The reason why I'm not doing it is because we only consider the >> > 1:1 prio_tc_map, so we use "prio" and "tc" interchangeably. >> > >> >> Hm, that's interesting. I think that's a difference on how we are >> modeling our traffic, here's one very common mqprio config for AVB use >> cases: >> >> $ tc qdisc replace dev $IFACE parent root handle 100 mqprio \ >> num_tc 3 \ >> map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 \ >> queues 1@0 1@1 2@2 \ >> hw 0 >> >> One priority for Class A traffic, one priority for Class B traffic and >> multiple priorities mapped to a single TC, the rest is Best Effort and >> it's mapped to multiple queues. The way we model traffic, TCs and >> priorities are different concepts. > > So here's what I don't understand. Why do you need 16 priority values, > when in the end you'll map them to only 3 egress traffic classes anyway? > Basically because of how prio_tc_map works and the fact that I want TC 0 to be mapped to queue 0. i.e. if an entry is not specified in the map, it's going ot be mapped to TC 0. I want to avoid that, historically, in our models, TC 0 is the "important" one. >> One important detail is that in TSN mode, we use the "priority" (meaning >> precedence in this context) fetch order for the queues: queue 0 first, >> then queue 1, and so on. > > So the number of the queue is what actually contributes to the egress > scheduling algorithm of your NIC? Strange. Tell me something so I'm sure > I'm understanding you properly. netdev_pick_tx() will pick a TX queue > using skb_tx_hash(), which hashes between the netdev queues mapped to > the same traffic class as the traffic class that skb->priority is mapped to. > In your case, you have 2@2 (2 TX queues for traffic class 2), call these > queues 2 and 3. You're saying that hardware prioritizes queue 2 over 3. > But the network stack doesn't know that. It hashes equally packets that > go to tc 2 between queue 2 and queue 3. In our case, this misconfiguration > would be immediately obvious, because we could have an offloaded tc-taprio > schedule in ENETC, and Linux could be enqueuing into a ring that ultimately > has a different schedule than software knows about. How does this work > out for you? > More or less, repeating a bit what I said before, on i210 it was fixed, on i225/i226 it can be changed, but we decided to keep the default, less surprises for our users. In (some of) our traffic models, the only kind of traffic class that we allow to go through multiple queues is Best Effort. And even that it works out because taprio "translates" from traffic classes to queues when it sends the offload information to the driver, i.e. the driver knows the schedule of queues, not traffic classes. >> > * The enetc verification state machine is off by default. The felix >> > verification state machine is on by default. I guess we should figure >> > out a reasonable default for all drivers out there? >> > >> >> From the language of IEEE 802.3 ("disableVerify" and friends), it seems >> that the expectation is that verification is enabled by default. *But* >> from a interoperability point of view, I would only enable verification >> by default after some testing in the wild. Or only send the verify >> packet if frame preemption is also enabled. > > I wonder whether the answer to this one is actually "Additional Ethernet > Capabilities TLV" (802.3 clause 79.3.7). It would be good if we could > all start off with FP/MM disabled, verification state machine included, > until the LLDP daemon receives a TLV that says PreemptSupported and > PreemptEnabled are true. If both local and remote preemption are > supported and enabled, we tell the kernel to enable the MAC merge layer > and kick off verification. In turn this will alter PreemptActive that > gets set in future replies. I don't know, the flow may be different. > That's a good point (your understanding of the flow is similar to mine). This seems a good idea. But we may have to wait for a bit until we have a LLDP implementation that supports this. >> > * For some reason, I notice that the verification state machine remains >> > FAILED on felix when it links to enetc and that has verification >> > enabled. I need to disable it on enetc, enable it again, and only then >> > will the MAC merge interrupt on felix say that verification completed >> > successfully. I think I'll need to put a quirk in enetc_pl_mac_link_up() >> > to toggle verification off and on, if it was requested, in order for >> > this to be seamless to the user. >> > >> > If it's only these, nothing seems exactly major. However, I assume this >> > is only the beginning. I fully expect there to be one more round of >> > "why not dcbnl? why not tc?" nitpicks, for which I'm not exactly >> > mentally ready (if credible arguments are to be put forward, I'd rather >> > have them now while I haven't yet put in the extra work in the direction >> > I'm about to). >> >> I really know the feeling "not mentally ready". >> >> I am still in the "tc" camp. But I cannot deny that this interface seems >> to work for the uses that we have in mind. That's (at least) good enough >> for me ;-) > > To me right now, FP adminStatus is not exactly something that the > scheduler is concerned with, and this is why tc is kind of my last > choice. But I agree that it makes certain things simpler. Plus, it seems > to be the only subsystem aware of 16 priorities and 16 traffic classes, > which you seem to care about. This is just food for thought, I am not expecting a reply, as it doesn't help the review of the RFC. Ideally, we wouldn't have that limit of 16 priorities or traffic classes. I don't want to limit it even more. This "16" limit is already going to require some creativity to handle some use cases in a medium future, I am afraid. I am already hearing coleagues talking about containers (think the net_prio namespace) together with VMs, NICs with dozens of queues. All of this with scheduled traffic. On the wire, the 8 PCP codepoints are what we have, but internally/"inside the system", we might need more granularity to organize all those applications. Cheers,
I won't reply to every point, as that would just diverge more and more from what I'm really trying to settle on. On Tue, Sep 13, 2022 at 07:59:29PM -0700, Vinicius Costa Gomes wrote: > For the long term, perhaps the solution is to have a single place that > does this mapping (from skb to traffic classes and then to queues) more > centralized. 'dcb' would populate that as would 'mqprio/taprio'. This > seems to imply that ndo_select_queue() is a escape hatch, that in the > ideal world it would not exist. Would this create worse problems? > (Reality is pointing that it doesn't seem that we can avoid making > drivers aware of this map, so this idea doesn't seem to help the > situation much) That central place is called netdev_set_prio_tc_map(). It is called from qdiscs and drivers alike. W.r.t. your "ndo_select_queue() escape hatch", the commit message of 4f57c087de9b ("net: implement mechanism for HW based QOS") provides some nice insight into networking history, 11 years later. > > ethtool: > > > > - pro: is in the same subsystem as MAC merge layer configuration (for > > which I have no doubts that ethtool is the right place) > > > > - pro: is not aligned to any specific vertical use case, like dcbnl is > > > > - con: knows nothing about priorities > > > > - con: for hardware where FP adminStatus is per traffic class rather > > than per priority, we'll have to change the adminStatus not only from > > changes made via ethtool to the FP adminStatus, but also from changes > > made to the prio_tc_map (tc I guess?). But this may also be a subtle > > pro for dcbnl, since that has its own prio_tc_map, so we'd remain > > within the same subsystem? > > If the driver also supports DCB, it will have the offload functions > implemented, so it be aware of changes in that map. Not so simple. DCB has many ops, you don't have to implement all of them. > The problem is now mqprio, that if it isn't offloaded, the driver is > not aware of any changes. But mqprio doesn't implement .change(), > which makes things a bit easier. > > 'ethtool' is gaining my vote. Well, ethtool is certainly not aware of changes to the prio:tc map either, very much less so than the driver :) > > tc (part of tc-mqprio and tc-taprio): > > > > - open question: does the FP adminStatus influence scheduling or not? > > Normally I'd say no, adminStatus only decides whether the eMAC or pMAC > > will be used to send a frame, and this is orthogonal to scheduling. > > But 802.1Q says that "It should be noted that, other things being > > equal, designating a priority as “express” effectively increases its > > priority above that of any priority designated as “preemptible”." > > Is this phrase enough to justify a modification of the qdisc dequeue > > method in, say, taprio, which would prioritize a lower priority qdisc > > queue A over a higher one B, if A is express but B is preemptable? > > I understood this as: if a preemptible packet (higher priority) and > expresss packet (lower priority) arrive at the MAC at the same time, the > express packet will finish transmitting before the preemptible, causing > the express packet to have an 'effective' higher priority. > > I don't think we need to change anything on the dequeueing in the qdisc > side (and trying to solve it is going to be awkward, having to check the > neighbor qdiscs for packets being enqueued/peeked, ugh!). Agree. So bottom line, express/preemptable does not affect the scheduling algorithm. This is one reservation I have against FP in tc. Maybe if we restricted this new option to only the hardware offload modes of the respective qdiscs. > (as a note: for i225, traffic in queues marked as express will only > preempt traffic from lower numbered queues, this behavior doesn't seem > to be forbidden by the standard) So I want the frame preemption adminStatus to work hand in hand with the prio:tc map of the netdev, no matter where/how the adminStatus lands. Whereas your hardware implementation does not work hand in hand with it. You'll have to add some restrictions and do some clarification work, sorry. > For the hardware I work with: > - i210: two modes, (1) round robin, by default, or in (2) Qav-mode > ("TSN), strict priority, fixed by the queue index. > - i225/i226: two modes, (1) round robin, by default, or in (2) TSN > mode, strict priority, configurable queue priority. > > (to give the full picture, we can also configure how the descriptors are > going to be fetched from host memory, but as host memory is not usually > where the congestion happens, it's less important for this) So this is to say that i210 has 1 traffic class in mode (1), and a number of traffic classes equal to the number of queues, in mode (2)? As for i225/i226, the number of traffic classes is given by the number of uniqueuly configured queue priorities? Or what would you call a traffic class, exactly? > > So here's what I don't understand. Why do you need 16 priority values, > > when in the end you'll map them to only 3 egress traffic classes anyway? > > > > Basically because of how prio_tc_map works and the fact that I want TC 0 > to be mapped to queue 0. i.e. if an entry is not specified in the map, > it's going ot be mapped to TC 0. I want to avoid that, historically, in > our models, TC 0 is the "important" one. Strange. I wonder who else expects traffic class 0 to be the important one? This is clearly a case of 2 vendors using the exact same tooling and commands, but a configuration script working in the exact opposite way. > More or less, repeating a bit what I said before, on i210 it was fixed, > on i225/i226 it can be changed, but we decided to keep the default, less > surprises for our users. For yours, sure, maybe. For others.. :) > In (some of) our traffic models, the only kind of traffic class that we > allow to go through multiple queues is Best Effort. With the mention that half (or more) of the best effort traffic will be Better Effort, due to your secret prioritization scheme within the same traffic class. And netdev_pick_tx() will essentially be a Russian roulette, because it hashes packets with the same skb->priority towards queues of different actual priorities. > And even that it works out because taprio "translates" from traffic > classes to queues when it sends the offload information to the driver, > i.e. the driver knows the schedule of queues, not traffic classes. Which is incredibly strange to me, since the standard clearly defines Qbv gates to be per traffic class, and in ENETC, even if we have 2 TX queues for the same traffic class (one per CPU), the hardware schedule is still per traffic class and not per independent TX queue (BD ring). How does this work for i225/i226, if 2 queues are configured for the same dequeue priority? Do the taprio gates still take effect per queue? > That's a good point (your understanding of the flow is similar to mine). > This seems a good idea. But we may have to wait for a bit until we have > a LLDP implementation that supports this. This is circular; we have a downstream patch to openlldp that adds support for this TLV, but it can't go anywhere until there is mainline kernel support. What about splitting out MAC merge support from FP support, and me concentrating on the MAC merge layer for now? They're independent up to a pretty high level. The MAC merge layer is supposed to be controlled by the LLDP daemon and be pretty much plug-and-play, while the FP adminStatus is supposed to be set by some high level administrator, like a NETCONF client. We can argue some more about how to best expose the FP adminStatus. Right now, I'm thinking that for all intents and purposes, the adminStatus really only makes practical sense when we have at least 2 traffic classes: one onto which we can map the preemptable priorities, and one onto which we can map the express ones. In turn, it means that the 'priorities collapsing into the same traffic class' problem that we have when we delete the taprio/mqprio qdisc can be solved if we put the FP adminStatus into the same netlink interface that also configures the prio:tc map and the number of traffic classes (i.e. yes, in tc). I'll let someone more knowledgeable of our sysrepo-tsn implementation of a NETCONF server (like Xiaoliang, maybe even Rui or Richie?) https://github.com/real-time-edge-sw/real-time-edge-sysrepo comment on whether this would create any undesirable side effect. The implication is that user space, when asked to change the adminStatus, will have to figure out whether we are using a taprio, or an mqprio qdisc, and create different netlink attributes to talk to the kernel to request that configuration. User space will also have to send an error towards the NETCONF client if no such qdisc exists (meaning that the user wants to combine priorities of different express/preemptable values in the same traffic class, TC0). The qdiscs will also centrally validate whether such invalid mixed configurations are requested. I'm mostly OK with this: an UAPI where FP adminStatus is per priority, but it gets translated into per-tc by the qdisc, and passed to the driver through ndo_setup_tc(). I am certainly still very much against the "preemptable queue mask" that you proposed, and IMO you'll have to do something and clarify what a traffic class even means for Intel hardware, especially in relationship to multiple queues per tc (mqprio queues 2@1). Anyone else who has an opinion of any sort, please feel free.
On Sep 15, 2022, Vladimir wrote: > > That's a good point (your understanding of the flow is similar to mine). > > This seems a good idea. But we may have to wait for a bit until we > > have a LLDP implementation that supports this. > > This is circular; we have a downstream patch to openlldp that adds support for > this TLV, but it can't go anywhere until there is mainline kernel support. > > What about splitting out MAC merge support from FP support, and me > concentrating on the MAC merge layer for now? They're independent up to a > pretty high level. The MAC merge layer is supposed to be controlled by the > LLDP daemon and be pretty much plug-and-play, while the FP adminStatus is > supposed to be set by some high level administrator, like a NETCONF client. I agree that splitting out MAC merge support from FP support, they are defined by different spec. I'm trying to add LLDP exchange verification support. The procedure is like following: 1. enable preemption on local port by using "ethtool", do not active preemption. 2. Decode the LLDP TLV of remote port and ensure the remote port supports and enables preemption. 3. Run SMD-v/r verify process on local port or active the preemption directly. 4. Send updated LLDP TLV to remote port. 5. Disable preemption on local port and repeat step 1-4 when link down/up. The struct "ethtool_mm_cfg" seems not fit this procedure. I update it: struct ethtool_mm_cfg { u32 verify_time; bool verify_disable; bool enabled; + bool active; // Used to active or start verify preemption by LLDP daemon. u8 add_frag_size; }; If we want to enable/disable the LLDP exchange verification, maybe we need to add more parameters like "bool lldp_verify_enable" > > We can argue some more about how to best expose the FP adminStatus. > Right now, I'm thinking that for all intents and purposes, the adminStatus > really only makes practical sense when we have at least 2 traffic classes: one > onto which we can map the preemptable priorities, and one onto which we > can map the express ones. In turn, it means that the 'priorities collapsing into > the same traffic class' problem that we have when we delete the > taprio/mqprio qdisc can be solved if we put the FP adminStatus into the same > netlink interface that also configures the prio:tc map and the number of traffic > classes (i.e. yes, in tc). > > I'll let someone more knowledgeable of our sysrepo-tsn implementation of a > NETCONF server (like Xiaoliang, maybe even Rui or Richie?) > https://github.com/real-time-edge-sw/real-time-edge-sysrepo > comment on whether this would create any undesirable side effect. > The implication is that user space, when asked to change the adminStatus, will > have to figure out whether we are using a taprio, or an mqprio qdisc, and > create different netlink attributes to talk to the kernel to request that > configuration. User space will also have to send an error towards the NETCONF > client if no such qdisc exists (meaning that the user wants to combine priorities > of different express/preemptable values in the same traffic class, TC0). The > qdiscs will also centrally validate whether such invalid mixed configurations > are requested. The Netconf yang model haven't defined the mac preemption feature of 802.3br, but it defines the 802.1Qbu preemption. It uses priority to configure the adminStatus of frame-preemption-status-table. The priority map traffic class is independent in Netconf, so there is no effect whether we combine FP status configuration with tc-mqprio or not, just translated it in Kernel is OK. > > I'm mostly OK with this: an UAPI where FP adminStatus is per priority, but it > gets translated into per-tc by the qdisc, and passed to the driver through > ndo_setup_tc(). I am certainly still very much against the "preemptable queue > mask" that you proposed, and IMO you'll have to do something and clarify > what a traffic class even means for Intel hardware, especially in relationship to > multiple queues per tc (mqprio queues 2@1). > > Anyone else who has an opinion of any sort, please feel free. I didn't think about it too much, but I think combined with tc-mqprio command is more easy to implement, this eliminates the priority of translating to TC. Example: tc qdisc add dev eth0 root handle 1: mqprio num_tc 8 map 0 1 2 3 4 5 6 7 \ fp-prios 0xff
Hi Xiaoliang, On Thu, Nov 10, 2022 at 10:33:50AM +0000, Xiaoliang Yang wrote: > On Sep 15, 2022, Vladimir wrote: > > > That's a good point (your understanding of the flow is similar to mine). > > > This seems a good idea. But we may have to wait for a bit until we > > > have a LLDP implementation that supports this. > > > > This is circular; we have a downstream patch to openlldp that adds support for > > this TLV, but it can't go anywhere until there is mainline kernel support. > > > > What about splitting out MAC merge support from FP support, and me > > concentrating on the MAC merge layer for now? They're independent up to a > > pretty high level. The MAC merge layer is supposed to be controlled by the > > LLDP daemon and be pretty much plug-and-play, while the FP adminStatus is > > supposed to be set by some high level administrator, like a NETCONF client. > > I agree that splitting out MAC merge support from FP support, they are defined > by different spec. I'm trying to add LLDP exchange verification support. The > procedure is like following: > 1. enable preemption on local port by using "ethtool", do not active preemption. "Enable" and "activate" are different things only if verification is used. Otherwise, they mean the same thing. In turn, LLDP does not control whether verification is used or not. So, from the perspective of the LLDP daemon, you can't "enable" but "not activate" preemption. And I wouldn't expect the LLDP flow to require use of ethtool. And I think that enabling preemption at step 1 is too early. Why enable it, if we don't know if the link partner supports it? > 2. Decode the LLDP TLV of remote port and ensure the remote port supports > and enables preemption. > 3. Run SMD-v/r verify process on local port or active the preemption directly. Verification as defined by 802.3 is, as mentioned, independent of the Additional Ethernet Capabilities TLV. Which seems to be a good thing, considering that 2 link partners A and B don't need to have coordinated settings for verification (verify_disable doesn't need to be in sync). Any device with the MAC Merge layer turned on is required to send SMD-R (response) frames in response to SMD-V (verification), regardless of whether it sends SMD-V frames of its own. So those who enable verification should get a response from those who don't enable it, and those who don't enable verification don't cause the link partners who do enable it to fail their own verification. The only thing to agree on is this: if the link partner supports preemption, and you support preemption, then how does the LLDP daemon decide whether to actually *enable* preemption? My assumption is that it's fine to unconditionally enable it as long as it's supported. No configuration or user override is needed. FWIW, I have my own patch on the openlldp project (yes, not lldpd) which adds support for the MAC Merge layer. I hope I can wrap everything up some time next week and resend everything in a state that reflects what I've been cooking in the meantime. I don't have a solution for the adminStatus yet, but I guess I'll try to add it via the netlink interface of tc-taprio and tc-mqprio. > 4. Send updated LLDP TLV to remote port. > 5. Disable preemption on local port and repeat step 1-4 when link down/up. > > The struct "ethtool_mm_cfg" seems not fit this procedure. I update it: > struct ethtool_mm_cfg { > u32 verify_time; > bool verify_disable; > bool enabled; > + bool active; // Used to active or start verify preemption by LLDP daemon. What sense does it make to "force" preemption as active? What would this do in terms of driving the internal state machines of the hardware, i.e. force FP active in which way? It's active when enabled and verification is either disabled, or enabled and complete. > u8 add_frag_size; > }; > If we want to enable/disable the LLDP exchange verification, maybe we need > to add more parameters like "bool lldp_verify_enable" Control the LLDP exchange via kernel UAPI? But why? What difference does it make to the kernel whether there is an LLDP daemon or not? LLDP frames are normal packets sent via PF_PACKET sockets, no special driver involvement needed. Not the same thing as SMD-V/SMD-R hardware verification.
Hi Vladimir, Thanks for your reply, I comment below and make a summarize at the end. On Mon, Nov 14, 2022 Vladimir wrote: > On Thu, Nov 10, 2022 at 10:33:50AM +0000, Xiaoliang Yang wrote: > > On Sep 15, 2022, Vladimir wrote: > > > > That's a good point (your understanding of the flow is similar to mine). > > > > This seems a good idea. But we may have to wait for a bit until we > > > > have a LLDP implementation that supports this. > > > > > > This is circular; we have a downstream patch to openlldp that adds > > > support for this TLV, but it can't go anywhere until there is mainline kernel > support. > > > > > > What about splitting out MAC merge support from FP support, and me > > > concentrating on the MAC merge layer for now? They're independent up > > > to a pretty high level. The MAC merge layer is supposed to be > > > controlled by the LLDP daemon and be pretty much plug-and-play, > > > while the FP adminStatus is supposed to be set by some high level > administrator, like a NETCONF client. > > > > I agree that splitting out MAC merge support from FP support, they are > > defined by different spec. I'm trying to add LLDP exchange > > verification support. The procedure is like following: > > 1. enable preemption on local port by using "ethtool", do not active > preemption. > > "Enable" and "activate" are different things only if verification is used. > Otherwise, they mean the same thing. Yes, I know that "Enable" and "activate" are different things. I mean that if we want to enable the next lldp exchange verification process, we need to enable it but not let the MAC merge layer working until lldp verify successfully. There are two verification procedure, one is LLDP verification, based on chapter "99.4.2 Determining that the link partner supports preemption" of 802.3 spec. This need lldp exchange and ensure preemption is enabled on remote port, and this verification doesn’t have parameter to enable/disable it. I just want to add this feature. Another is SMD-r/v verification, this is hardware verification and already has "verify_disable" to enable/disable it. Or we can enable by default through LLDP. That don’t need a new "active" parameter. Once the LLDP application has verified that remote port supports preemption, it will enable preemption on local port, and tell remote port to enable it. If we use ethtool to enable preemption, the preemption will be force enabled without LLDP verification. > > In turn, LLDP does not control whether verification is used or not. So, from the > perspective of the LLDP daemon, you can't "enable" but "not activate" > preemption. Yes, LLDP does not control whether hardware verification is used or not. Maybe I shouldn't use the word "active", I mean preemption can't be working before LLDP verification, even though it's enabled by ethtool. > > And I wouldn't expect the LLDP flow to require use of ethtool. If LLDP application can’t require to use ethtool driver, how to enable the preemption After LLDP verification? > > And I think that enabling preemption at step 1 is too early. Why enable it, if we > don't know if the link partner supports it? Same as mentioned above. Enabling at step 1 does not actually enable preemption, it means enable the LLDP verification. > > > 2. Decode the LLDP TLV of remote port and ensure the remote port > > supports and enables preemption. > > 3. Run SMD-v/r verify process on local port or active the preemption > directly. > > Verification as defined by 802.3 is, as mentioned, independent of the > Additional Ethernet Capabilities TLV. > Which seems to be a good thing, considering that 2 link partners A and B don't > need to have coordinated settings for verification (verify_disable doesn't need > to be in sync). Any device with the MAC Merge layer turned on is required to > send SMD-R (response) frames in response to SMD-V (verification), regardless > of whether it sends SMD-V frames of its own. > > So those who enable verification should get a response from those who don't > enable it, and those who don't enable verification don't cause the link partners > who do enable it to fail their own verification. The hardware may not obey this rule. I have tested on LS1028a and i.mx8mp boards, The verification needs to be enabled at the same time for link partners. So it needs LLDP verification, and let LLDP application to enable both partner ports at the same time(The hardware verify three times, and each verify time is 10ms in default). > > The only thing to agree on is this: if the link partner supports preemption, and > you support preemption, then how does the LLDP daemon decide whether to > actually *enable* preemption? My assumption is that it's fine to > unconditionally enable it as long as it's supported. > No configuration or user override is needed. Yes, I also consider this way. Enable the preemption once LLDP exchange ensure that both partner support it. But how to enable preemption via LLDP daemon? I tried to use the ioctl API of ethtool to enable preemption. Do you have any suggestion? > > FWIW, I have my own patch on the openlldp project (yes, not lldpd) which adds > support for the MAC Merge layer. I hope I can wrap everything up some time > next week and resend everything in a state that reflects what I've been cooking > in the meantime. I don't have a solution for the adminStatus yet, but I guess I'll > try to add it via the netlink interface of tc-taprio and tc-mqprio. It's a good news, I can keep up with upstream patches after you upstream the patches. > > > 4. Send updated LLDP TLV to remote port. > > 5. Disable preemption on local port and repeat step 1-4 when link > down/up. > > > > The struct "ethtool_mm_cfg" seems not fit this procedure. I update it: > > struct ethtool_mm_cfg { > > u32 verify_time; > > bool verify_disable; > > bool enabled; > > + bool active; // Used to active or start verify preemption by LLDP daemon. > > What sense does it make to "force" preemption as active? What would this do > in terms of driving the internal state machines of the hardware, i.e. > force FP active in which way? It's active when enabled and verification is either > disabled, or enabled and complete. This is not mean active status of SMD-r/v verification. It means let preemption Working. > > > u8 add_frag_size; > > }; > > If we want to enable/disable the LLDP exchange verification, maybe we > > need to add more parameters like "bool lldp_verify_enable" > > Control the LLDP exchange via kernel UAPI? But why? What difference does it > make to the kernel whether there is an LLDP daemon or not? LLDP frames are > normal packets sent via PF_PACKET sockets, no special driver involvement > needed. Not the same thing as SMD-V/SMD-R hardware verification. I mean that if user want to enable the preemption without LLDP. It's not needed If we use the second way I mentioned above. I will summarize my opinion. Enable the preemption by default through LLDP. Once the LLDP application has verified that remote port supports preemption, it will enable preemption on local port, and tell remote port to enable it. If user want to force enable preemption, use ethtool to enable it directly. The lldp daemon use the ioctl API of ethtool to enable preemption. Thanks, Xiaoliang
Hi Xiaoliang, On Tue, Nov 15, 2022 at 05:01:24AM +0200, Xiaoliang Yang wrote: > Yes, I know that "Enable" and "activate" are different things. I mean that if we > want to enable the next lldp exchange verification process, we need to enable > it but not let the MAC merge layer working until lldp verify successfully. > There are two verification procedure, one is LLDP verification, based on chapter > "99.4.2 Determining that the link partner supports preemption" of 802.3 spec. > This need lldp exchange and ensure preemption is enabled on remote port, and > this verification doesn’t have parameter to enable/disable it. I just want to add > this feature. Why? The mechanism to enable/disable the MAC Merge layer is the ETHTOOL_A_MM_ENABLED netlink attribute which is settable. To the kernel, it doesn't matter if ethtool (the program) or openlldp sets ETHTOOL_A_MM_ENABLED. > Another is SMD-r/v verification, this is hardware verification and > already has "verify_disable" to enable/disable it. > > Or we can enable by default through LLDP. That don’t need a new "active" > parameter. Once the LLDP application has verified that remote port supports > preemption, it will enable preemption on local port, and tell remote port to > enable it. IIUC, it cannot "tell" the link partner to enable preemption. It can just advertise the following in a subsequent TLV: Table 79–8—Additional Ethernet capabilities Bit 0 (preemption capability support) - supported/not supported Bit 1 (preemption capability status) - enabled/not enabled Bit 2 (preemption capability active) - active/not active Bits 4:3 (additional fragment size) - A 2-bit integer value indicating, in units of 64 octets, the minimum number of octets over 64 octets required in non-final fragments by the receiver The only thing we can do, is if the link partner sets Bit 0 to true, to generate a ETHTOOL_MSG_MM_SET netlink message with ETHTOOL_A_MM_ENABLED set to true. We _cannot_ wait for the link partner to set Bit 1 to true, because, if it's the same implementation as us, nothing will work, we will both wait for each other... So either the link partner enables preemption too, as a result of us advertising a TLV with Bit 0 set, and things eventually work, or the link partner doesn't enable preemption even though both ends support it. In that case, preemption doesn't work (and this is also visible in the TLVs sent by the link partner - Bit 1 and Bit 2 remain zero). If the SMD-V/SMD-R handshake is enabled locally (ETHTOOL_A_MM_VERIFY_DISABLE is false), we can also detect that by the fact that the link partner will not respond with a SMD-R to our SMD-V. So, even if we will set Bit 1 (corresponds to ETHTOOL_A_MM_ENABLED in the ETHTOOL_MSG_MM_GET message) to true, Bit 2 (corresponding to ETHTOOL_A_MM_ACTIVE in the GET message) will still remain unset. Otherwise, the condition will remain undetected, and we will report Bit 2 (active) as true (which it is - verification is disabled after all). > If we use ethtool to enable preemption, the preemption will be force > enabled without LLDP verification. Yup. I consider that within the expected parameters of the game, no? > > In turn, LLDP does not control whether verification is used or not. So, from the > > perspective of the LLDP daemon, you can't "enable" but "not activate" > > preemption. > Yes, LLDP does not control whether hardware verification is used or not. Maybe I > shouldn't use the word "active", I mean preemption can't be working before LLDP > verification, even though it's enabled by ethtool. I don't understand this. Why can't preemption work without LLDP verification? My understanding is that LLDP is just an automated way of maximizing the enabled feature set depending on what the link partner advertises as supported. But the same feature set can be force-enabled manually, if there is prior knowledge that the link partner can do it. I think of it as a software autoneg, I might be wrong, I don't have too much XP with LLDP. That's exactly what ethtool (the program, not the kernel interface) is for - to be able to force settings. > > And I wouldn't expect the LLDP flow to require use of ethtool. > If LLDP application can’t require to use ethtool driver, how to enable the preemption > after LLDP verification? No, I mean the LLDP flow doesn't require use of ethtool the program, not ethtool the kernel netlink interface. Of course it requires the latter. > > And I think that enabling preemption at step 1 is too early. Why enable it, if we > > don't know if the link partner supports it? > Same as mentioned above. Enabling at step 1 does not actually enable preemption, > it means enable the LLDP verification. Nope. LLDP verification is no more than a user space program that automates something that can also be done manually, using ethtool the program. No special netlink messages for LLDP. > > So those who enable verification should get a response from those who don't > > enable it, and those who don't enable verification don't cause the link partners > > who do enable it to fail their own verification. > The hardware may not obey this rule. I have tested on LS1028a and i.mx8mp boards, > The verification needs to be enabled at the same time for link partners. So it needs > LLDP verification, and let LLDP application to enable both partner ports at the same > time(The hardware verify three times, and each verify time is 10ms in default). I think the imx8mp has the same EQOS block as the S32G GMAC, right? If so, I might try experimenting a bit on the S32G linked to LS1028A, to see exactly what you're talking about. Keep in mind that hardware quirks don't mean much to what the user space flow should be. The kernel should deal with quirks. For example, on enetc, the MAC merge layer must be manually disabled and re-enabled on each link up/down event by the kernel driver. Otherwise the verification state machine remains FAILED, as you seem to say. But that changes exactly nothing in terms of the LLDP flow - because as you say, the hardware SMD-R/SMD-V verification state machine which you talk about is independent of LLDP. > > The only thing to agree on is this: if the link partner supports preemption, and > > you support preemption, then how does the LLDP daemon decide whether to > > actually *enable* preemption? My assumption is that it's fine to > > unconditionally enable it as long as it's supported. > > No configuration or user override is needed. > Yes, I also consider this way. Enable the preemption once LLDP exchange ensure > that both partner support it. But how to enable preemption via LLDP daemon? > I tried to use the ioctl API of ethtool to enable preemption. Do you have any > suggestion? Yeah, see above. The ethtool ioctl is dead and it's not coming back, see ETHTOOL_A_MM_ENABLED. I wish I could come back to this patch set sooner, but I'm really deadlocked with some other stuff I need to do this week. Sorry.
diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h index 99dc7bfbcd3c..fa504dd22bf6 100644 --- a/include/linux/ethtool.h +++ b/include/linux/ethtool.h @@ -453,6 +453,49 @@ struct ethtool_module_power_mode_params { enum ethtool_module_power_mode mode; }; +/** + * struct ethtool_fp_param - 802.1Q Frame Preemption parameters + */ +struct ethtool_fp_param { + u8 preemptable_prios; + u32 hold_advance; + u32 release_advance; +}; + +/** + * struct ethtool_mm_state - 802.3 MAC merge layer state + */ +struct ethtool_mm_state { + u32 verify_time; + enum ethtool_mm_verify_status verify_status; + bool supported; + bool enabled; + bool active; + u8 add_frag_size; +}; + +/** + * struct ethtool_mm_cfg - 802.3 MAC merge layer configuration + */ +struct ethtool_mm_cfg { + u32 verify_time; + bool verify_disable; + bool enabled; + u8 add_frag_size; +}; + +/* struct ethtool_mm_stats - 802.3 MAC merge layer statistics, as defined in + * IEEE 802.3-2018 subclause 30.14.1 oMACMergeEntity managed object class + */ +struct ethtool_mm_stats { + u64 MACMergeFrameAssErrorCount; + u64 MACMergeFrameSmdErrorCount; + u64 MACMergeFrameAssOkCount; + u64 MACMergeFragCountRx; + u64 MACMergeFragCountTx; + u64 MACMergeHoldCount; +}; + /** * struct ethtool_ops - optional netdev operations * @cap_link_lanes_supported: indicates if the driver supports lanes @@ -624,6 +667,11 @@ struct ethtool_module_power_mode_params { * plugged-in. * @set_module_power_mode: Set the power mode policy for the plug-in module * used by the network device. + * @get_fp_param: Query the 802.1Q Frame Preemption parameters. + * @set_fp_param: Set the 802.1Q Frame Preemption parameters. + * @get_mm_state: Query the 802.3 MAC Merge layer state. + * @set_mm_cfg: Set the 802.3 MAC Merge layer parameters. + * @get_mm_stats: Query the 802.3 MAC Merge layer statistics. * * All operations are optional (i.e. the function pointer may be set * to %NULL) and callers must take this into account. Callers must @@ -760,6 +808,17 @@ struct ethtool_ops { int (*set_module_power_mode)(struct net_device *dev, const struct ethtool_module_power_mode_params *params, struct netlink_ext_ack *extack); + void (*get_fp_param)(struct net_device *dev, + struct ethtool_fp_param *params); + int (*set_fp_param)(struct net_device *dev, + const struct ethtool_fp_param *params, + struct netlink_ext_ack *extack); + void (*get_mm_state)(struct net_device *dev, + struct ethtool_mm_state *state); + int (*set_mm_cfg)(struct net_device *dev, struct ethtool_mm_cfg *cfg, + struct netlink_ext_ack *extack); + void (*get_mm_stats)(struct net_device *dev, + struct ethtool_mm_stats *stats); }; int ethtool_check_ops(const struct ethtool_ops *ops); diff --git a/include/uapi/linux/ethtool.h b/include/uapi/linux/ethtool.h index 2d5741fd44bb..7a21fcb26a43 100644 --- a/include/uapi/linux/ethtool.h +++ b/include/uapi/linux/ethtool.h @@ -736,6 +736,21 @@ enum ethtool_module_power_mode { ETHTOOL_MODULE_POWER_MODE_HIGH, }; +/* Values from ieee8021FramePreemptionAdminStatus */ +enum ethtool_fp_admin_status { + ETHTOOL_FP_PARAM_ENTRY_ADMIN_STATUS_EXPRESS = 1, + ETHTOOL_FP_PARAM_ENTRY_ADMIN_STATUS_PREEMPTABLE = 2, +}; + +enum ethtool_mm_verify_status { + ETHTOOL_MM_VERIFY_STATUS_UNKNOWN, + ETHTOOL_MM_VERIFY_STATUS_INITIAL, + ETHTOOL_MM_VERIFY_STATUS_VERIFYING, + ETHTOOL_MM_VERIFY_STATUS_SUCCEEDED, + ETHTOOL_MM_VERIFY_STATUS_FAILED, + ETHTOOL_MM_VERIFY_STATUS_DISABLED, +}; + /** * struct ethtool_gstrings - string set for data tagging * @cmd: Command number = %ETHTOOL_GSTRINGS diff --git a/include/uapi/linux/ethtool_netlink.h b/include/uapi/linux/ethtool_netlink.h index d2fb4f7be61b..658810274c49 100644 --- a/include/uapi/linux/ethtool_netlink.h +++ b/include/uapi/linux/ethtool_netlink.h @@ -49,6 +49,10 @@ enum { ETHTOOL_MSG_PHC_VCLOCKS_GET, ETHTOOL_MSG_MODULE_GET, ETHTOOL_MSG_MODULE_SET, + ETHTOOL_MSG_FP_GET, + ETHTOOL_MSG_FP_SET, + ETHTOOL_MSG_MM_GET, + ETHTOOL_MSG_MM_SET, /* add new constants above here */ __ETHTOOL_MSG_USER_CNT, @@ -94,6 +98,10 @@ enum { ETHTOOL_MSG_PHC_VCLOCKS_GET_REPLY, ETHTOOL_MSG_MODULE_GET_REPLY, ETHTOOL_MSG_MODULE_NTF, + ETHTOOL_MSG_FP_GET_REPLY, + ETHTOOL_MSG_FP_NTF, + ETHTOOL_MSG_MM_GET_REPLY, + ETHTOOL_MSG_MM_NTF, /* add new constants above here */ __ETHTOOL_MSG_KERNEL_CNT, @@ -862,6 +870,80 @@ enum { ETHTOOL_A_MODULE_MAX = (__ETHTOOL_A_MODULE_CNT - 1) }; +/* FRAME PREEMPTION (802.1Q) */ + +enum { + ETHTOOL_A_FP_PARAM_ENTRY_UNSPEC, + ETHTOOL_A_FP_PARAM_ENTRY_PRIO, /* u8 */ + ETHTOOL_A_FP_PARAM_ENTRY_ADMIN_STATUS, /* u8 */ + + /* add new constants above here */ + __ETHTOOL_A_FP_PARAM_ENTRY_CNT, + ETHTOOL_A_FP_PARAM_ENTRY_MAX = (__ETHTOOL_A_FP_PARAM_ENTRY_CNT - 1) +}; + +enum { + ETHTOOL_A_FP_PARAM_UNSPEC, + ETHTOOL_A_FP_PARAM_ENTRY, /* nest */ + + /* add new constants above here */ + __ETHTOOL_A_FP_PARAM_CNT, + ETHTOOL_A_FP_PARAM_MAX = (__ETHTOOL_A_FP_PARAM_CNT - 1) +}; + +enum { + ETHTOOL_A_FP_UNSPEC, + ETHTOOL_A_FP_HEADER, /* nest - _A_HEADER_* */ + ETHTOOL_A_FP_PARAM_TABLE, /* nest */ + ETHTOOL_A_FP_HOLD_ADVANCE, /* u32 */ + ETHTOOL_A_FP_RELEASE_ADVANCE, /* u32 */ + + /* add new constants above here */ + __ETHTOOL_A_FP_CNT, + ETHTOOL_A_FP_MAX = (__ETHTOOL_A_FP_CNT - 1) +}; + +/* MAC MERGE (802.3) */ + +enum { + ETHTOOL_A_MM_STAT_UNSPEC, + ETHTOOL_A_MM_STAT_PAD, + + /* aMACMergeFrameAssErrorCount */ + ETHTOOL_A_MM_STAT_REASSEMBLY_ERRORS, /* u64 */ + /* aMACMergeFrameSmdErrorCount */ + ETHTOOL_A_MM_STAT_SMD_ERRORS, /* u64 */ + /* aMACMergeFrameAssOkCount */ + ETHTOOL_A_MM_STAT_REASSEMBLY_OK, /* u64 */ + /* aMACMergeFragCountRx */ + ETHTOOL_A_MM_STAT_RX_FRAG_COUNT, /* u64 */ + /* aMACMergeFragCountTx */ + ETHTOOL_A_MM_STAT_TX_FRAG_COUNT, /* u64 */ + /* aMACMergeHoldCount */ + ETHTOOL_A_MM_STAT_HOLD_COUNT, /* u64 */ + + /* add new constants above here */ + __ETHTOOL_A_MM_STAT_CNT, + ETHTOOL_A_MM_STAT_MAX = (__ETHTOOL_A_MM_STAT_CNT - 1) +}; + +enum { + ETHTOOL_A_MM_UNSPEC, + ETHTOOL_A_MM_HEADER, /* nest - _A_HEADER_* */ + ETHTOOL_A_MM_VERIFY_STATUS, /* u8 */ + ETHTOOL_A_MM_VERIFY_DISABLE, /* u8 */ + ETHTOOL_A_MM_VERIFY_TIME, /* u32 */ + ETHTOOL_A_MM_SUPPORTED, /* u8 */ + ETHTOOL_A_MM_ENABLED, /* u8 */ + ETHTOOL_A_MM_ACTIVE, /* u8 */ + ETHTOOL_A_MM_ADD_FRAG_SIZE, /* u8 */ + ETHTOOL_A_MM_STATS, /* nest - _A_MM_STAT_* */ + + /* add new constants above here */ + __ETHTOOL_A_MM_CNT, + ETHTOOL_A_MM_MAX = (__ETHTOOL_A_MM_CNT - 1) +}; + /* generic netlink info */ #define ETHTOOL_GENL_NAME "ethtool" #define ETHTOOL_GENL_VERSION 1 diff --git a/net/ethtool/Makefile b/net/ethtool/Makefile index b76432e70e6b..31e056f17856 100644 --- a/net/ethtool/Makefile +++ b/net/ethtool/Makefile @@ -7,4 +7,5 @@ obj-$(CONFIG_ETHTOOL_NETLINK) += ethtool_nl.o ethtool_nl-y := netlink.o bitset.o strset.o linkinfo.o linkmodes.o \ linkstate.o debug.o wol.o features.o privflags.o rings.o \ channels.o coalesce.o pause.o eee.o tsinfo.o cabletest.o \ - tunnels.o fec.o eeprom.o stats.o phc_vclocks.o module.o + tunnels.o fec.o eeprom.o stats.o phc_vclocks.o module.o \ + fp.o mm.o diff --git a/net/ethtool/fp.c b/net/ethtool/fp.c new file mode 100644 index 000000000000..20f19d8c1461 --- /dev/null +++ b/net/ethtool/fp.c @@ -0,0 +1,295 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Copyright 2022 NXP + */ +#include "common.h" +#include "netlink.h" + +struct fp_req_info { + struct ethnl_req_info base; +}; + +struct fp_reply_data { + struct ethnl_reply_data base; + struct ethtool_fp_param params; +}; + +#define FP_REPDATA(__reply_base) \ + container_of(__reply_base, struct fp_reply_data, base) + +const struct nla_policy ethnl_fp_get_policy[ETHTOOL_A_FP_HEADER + 1] = { + [ETHTOOL_A_FP_HEADER] = NLA_POLICY_NESTED(ethnl_header_policy), +}; + +static int fp_prepare_data(const struct ethnl_req_info *req_base, + struct ethnl_reply_data *reply_base, + struct genl_info *info) +{ + struct fp_reply_data *data = FP_REPDATA(reply_base); + struct net_device *dev = reply_base->dev; + int ret; + + if (!dev->ethtool_ops->get_fp_param) + return -EOPNOTSUPP; + + ret = ethnl_ops_begin(dev); + if (ret < 0) + return ret; + + dev->ethtool_ops->get_fp_param(dev, &data->params); + ethnl_ops_complete(dev); + + return 0; +} + +static int fp_reply_size(const struct ethnl_req_info *req_base, + const struct ethnl_reply_data *reply_base) +{ + int len = 0; + + len += nla_total_size(0); /* _FP_PARAM_ENTRY */ + len += nla_total_size(sizeof(u8)); /* _FP_PARAM_ENTRY_PRIO */ + len += nla_total_size(sizeof(u8)); /* _FP_PARAM_ENTRY_ADMIN_STATUS */ + len *= 8; /* 8 prios */ + len += nla_total_size(0); /* _FP_PARAM_TABLE */ + len += nla_total_size(sizeof(u32)); /* _FP_HOLD_ADVANCE */ + len += nla_total_size(sizeof(u32)); /* _FP_RELEASE_ADVANCE */ + + return len; +} + +static int fp_fill_reply(struct sk_buff *skb, + const struct ethnl_req_info *req_base, + const struct ethnl_reply_data *reply_base) +{ + const struct fp_reply_data *data = FP_REPDATA(reply_base); + const struct ethtool_fp_param *params = &data->params; + enum ethtool_fp_admin_status admin_status; + struct nlattr *nest_table, *nest_entry; + int prio; + int ret; + + if (nla_put_u32(skb, ETHTOOL_A_FP_HOLD_ADVANCE, params->hold_advance) || + nla_put_u32(skb, ETHTOOL_A_FP_RELEASE_ADVANCE, params->release_advance)) + return -EMSGSIZE; + + nest_table = nla_nest_start(skb, ETHTOOL_A_FP_PARAM_TABLE); + if (!nest_table) + return -EMSGSIZE; + + for (prio = 0; prio < 8; prio++) { + admin_status = (params->preemptable_prios & BIT(prio)) ? + ETHTOOL_FP_PARAM_ENTRY_ADMIN_STATUS_PREEMPTABLE : + ETHTOOL_FP_PARAM_ENTRY_ADMIN_STATUS_EXPRESS; + + nest_entry = nla_nest_start(skb, ETHTOOL_A_FP_PARAM_ENTRY); + if (!nest_entry) + goto nla_nest_cancel_table; + + if (nla_put_u8(skb, ETHTOOL_A_FP_PARAM_ENTRY_PRIO, prio)) + goto nla_nest_cancel_entry; + + if (nla_put_u8(skb, ETHTOOL_A_FP_PARAM_ENTRY_ADMIN_STATUS, + admin_status)) + goto nla_nest_cancel_entry; + + nla_nest_end(skb, nest_entry); + } + + nla_nest_end(skb, nest_table); + + return 0; + +nla_nest_cancel_entry: + nla_nest_cancel(skb, nest_entry); +nla_nest_cancel_table: + nla_nest_cancel(skb, nest_table); + return ret; +} + +const struct ethnl_request_ops ethnl_fp_request_ops = { + .request_cmd = ETHTOOL_MSG_FP_GET, + .reply_cmd = ETHTOOL_MSG_FP_GET_REPLY, + .hdr_attr = ETHTOOL_A_FP_HEADER, + .req_info_size = sizeof(struct fp_req_info), + .reply_data_size = sizeof(struct fp_reply_data), + + .prepare_data = fp_prepare_data, + .reply_size = fp_reply_size, + .fill_reply = fp_fill_reply, +}; + +static const struct nla_policy +ethnl_fp_set_param_entry_policy[ETHTOOL_A_FP_PARAM_ENTRY_MAX + 1] = { + [ETHTOOL_A_FP_PARAM_ENTRY_PRIO] = { .type = NLA_U8 }, + [ETHTOOL_A_FP_PARAM_ENTRY_ADMIN_STATUS] = { .type = NLA_U8 }, +}; + +static const struct nla_policy +ethnl_fp_set_param_table_policy[ETHTOOL_A_FP_PARAM_MAX + 1] = { + [ETHTOOL_A_FP_PARAM_ENTRY] = NLA_POLICY_NESTED(ethnl_fp_set_param_entry_policy), +}; + +const struct nla_policy ethnl_fp_set_policy[ETHTOOL_A_FP_MAX + 1] = { + [ETHTOOL_A_FP_HEADER] = NLA_POLICY_NESTED(ethnl_header_policy), + [ETHTOOL_A_FP_PARAM_TABLE] = NLA_POLICY_NESTED(ethnl_fp_set_param_table_policy), +}; + +static int fp_parse_param_entry(const struct nlattr *nest, + struct ethtool_fp_param *params, + u8 *seen_prios, bool *mod, + struct netlink_ext_ack *extack) +{ + struct nlattr *tb[ARRAY_SIZE(ethnl_fp_set_param_entry_policy)]; + u8 preemptable_prios = params->preemptable_prios; + u8 prio, admin_status; + int ret; + + if (!nest) + return 0; + + ret = nla_parse_nested(tb, + ARRAY_SIZE(ethnl_fp_set_param_entry_policy) - 1, + nest, ethnl_fp_set_param_entry_policy, + extack); + if (ret) + return ret; + + if (!tb[ETHTOOL_A_FP_PARAM_ENTRY_PRIO]) { + NL_SET_ERR_MSG_ATTR(extack, nest, + "Missing 'prio' attribute"); + return -EINVAL; + } + + if (!tb[ETHTOOL_A_FP_PARAM_ENTRY_ADMIN_STATUS]) { + NL_SET_ERR_MSG_ATTR(extack, nest, + "Missing 'admin_status' attribute"); + return -EINVAL; + } + + prio = nla_get_u8(tb[ETHTOOL_A_FP_PARAM_ENTRY_PRIO]); + if (prio >= 8) { + NL_SET_ERR_MSG_ATTR(extack, nest, + "Range exceeded for 'prio' attribute"); + return -ERANGE; + } + + if (*seen_prios & BIT(prio)) { + NL_SET_ERR_MSG_ATTR(extack, nest, + "Duplicate 'prio' attribute"); + return -EINVAL; + } + + *seen_prios |= BIT(prio); + + admin_status = nla_get_u8(tb[ETHTOOL_A_FP_PARAM_ENTRY_ADMIN_STATUS]); + switch (admin_status) { + case ETHTOOL_FP_PARAM_ENTRY_ADMIN_STATUS_PREEMPTABLE: + preemptable_prios |= BIT(prio); + break; + case ETHTOOL_FP_PARAM_ENTRY_ADMIN_STATUS_EXPRESS: + preemptable_prios &= ~BIT(prio); + break; + default: + NL_SET_ERR_MSG_ATTR(extack, nest, + "Unexpected value for 'admin_status' attribute"); + return -EINVAL; + } + + if (preemptable_prios != params->preemptable_prios) { + params->preemptable_prios = preemptable_prios; + *mod = true; + } + + return 0; +} + +static int fp_parse_param_table(const struct nlattr *nest, + struct ethtool_fp_param *params, bool *mod, + struct netlink_ext_ack *extack) +{ + u8 seen_prios = 0; + struct nlattr *n; + int rem, ret; + + if (!nest) + return 0; + + nla_for_each_nested(n, nest, rem) { + struct sched_entry *entry; + + if (nla_type(n) != ETHTOOL_A_FP_PARAM_ENTRY) { + NL_SET_ERR_MSG_ATTR(extack, n, + "Attribute is not of type 'entry'"); + continue; + } + + ret = fp_parse_param_entry(n, params, &seen_prios, mod, extack); + if (ret < 0) { + kfree(entry); + return ret; + } + } + + return 0; +} + +int ethnl_set_fp_param(struct sk_buff *skb, struct genl_info *info) +{ + struct netlink_ext_ack *extack = info->extack; + struct ethnl_req_info req_info = {}; + struct ethtool_fp_param params = {}; + struct nlattr **tb = info->attrs; + const struct ethtool_ops *ops; + struct net_device *dev; + bool mod = false; + int ret; + + ret = ethnl_parse_header_dev_get(&req_info, tb[ETHTOOL_A_FP_HEADER], + genl_info_net(info), extack, true); + if (ret) + return ret; + + dev = req_info.dev; + ops = dev->ethtool_ops; + + if (!ops->get_fp_param || !ops->set_fp_param) { + ret = -EOPNOTSUPP; + goto out_dev; + } + + rtnl_lock(); + ret = ethnl_ops_begin(dev); + if (ret) + goto out_rtnl; + + dev->ethtool_ops->get_fp_param(dev, ¶ms); + + ethnl_update_u32(¶ms.hold_advance, tb[ETHTOOL_A_FP_HOLD_ADVANCE], + &mod); + ethnl_update_u32(¶ms.release_advance, + tb[ETHTOOL_A_FP_RELEASE_ADVANCE], &mod); + + ret = fp_parse_param_table(tb[ETHTOOL_A_FP_PARAM_TABLE], ¶ms, &mod, + extack); + if (ret || !mod) + goto out_ops; + + ret = dev->ethtool_ops->set_fp_param(dev, ¶ms, extack); + if (ret) { + if (!extack->_msg) + NL_SET_ERR_MSG(extack, + "Failed to update frame preemption parameters"); + goto out_ops; + } + + ethtool_notify(dev, ETHTOOL_MSG_FP_NTF, NULL); + +out_ops: + ethnl_ops_complete(dev); +out_rtnl: + rtnl_unlock(); +out_dev: + dev_put(dev); + return ret; +} diff --git a/net/ethtool/mm.c b/net/ethtool/mm.c new file mode 100644 index 000000000000..d4731fb7aee4 --- /dev/null +++ b/net/ethtool/mm.c @@ -0,0 +1,228 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Copyright 2022 NXP + */ +#include "common.h" +#include "netlink.h" + +struct mm_req_info { + struct ethnl_req_info base; +}; + +struct mm_reply_data { + struct ethnl_reply_data base; + struct ethtool_mm_state state; + struct ethtool_mm_stats stats; +}; + +#define MM_REPDATA(__reply_base) \ + container_of(__reply_base, struct mm_reply_data, base) + +#define ETHTOOL_MM_STAT_CNT \ + (__ETHTOOL_A_MM_STAT_CNT - (ETHTOOL_A_MM_STAT_PAD + 1)) + +const struct nla_policy ethnl_mm_get_policy[ETHTOOL_A_MM_HEADER + 1] = { + [ETHTOOL_A_MM_HEADER] = NLA_POLICY_NESTED(ethnl_header_policy_stats), +}; + +static int mm_prepare_data(const struct ethnl_req_info *req_base, + struct ethnl_reply_data *reply_base, + struct genl_info *info) +{ + struct mm_reply_data *data = MM_REPDATA(reply_base); + struct net_device *dev = reply_base->dev; + const struct ethtool_ops *ops; + int ret; + + ops = dev->ethtool_ops; + + if (!ops->get_mm_state) + return -EOPNOTSUPP; + + ethtool_stats_init((u64 *)&data->stats, + sizeof(data->stats) / sizeof(u64)); + + ret = ethnl_ops_begin(dev); + if (ret < 0) + return ret; + + ops->get_mm_state(dev, &data->state); + + if (ops->get_mm_stats && (req_base->flags & ETHTOOL_FLAG_STATS)) + ops->get_mm_stats(dev, &data->stats); + + ethnl_ops_complete(dev); + + return 0; +} + +static int mm_reply_size(const struct ethnl_req_info *req_base, + const struct ethnl_reply_data *reply_base) +{ + int len = 0; + + len += nla_total_size(sizeof(u8)); /* _MM_VERIFY_STATUS */ + len += nla_total_size(sizeof(u32)); /* _MM_VERIFY_TIME */ + len += nla_total_size(sizeof(u8)); /* _MM_SUPPORTED */ + len += nla_total_size(sizeof(u8)); /* _MM_ENABLED */ + len += nla_total_size(sizeof(u8)); /* _MM_ACTIVE */ + len += nla_total_size(sizeof(u8)); /* _MM_ADD_FRAG_SIZE */ + + if (req_base->flags & ETHTOOL_FLAG_STATS) + len += nla_total_size(0) + /* _MM_STATS */ + nla_total_size_64bit(sizeof(u64)) * ETHTOOL_MM_STAT_CNT; + + return len; +} + +static int mm_put_stat(struct sk_buff *skb, u64 val, u16 attrtype) +{ + if (val == ETHTOOL_STAT_NOT_SET) + return 0; + if (nla_put_u64_64bit(skb, attrtype, val, ETHTOOL_A_MM_STAT_PAD)) + return -EMSGSIZE; + return 0; +} + +static int mm_put_stats(struct sk_buff *skb, + const struct ethtool_mm_stats *stats) +{ + struct nlattr *nest; + + nest = nla_nest_start(skb, ETHTOOL_A_MM_STATS); + if (!nest) + return -EMSGSIZE; + + if (mm_put_stat(skb, stats->MACMergeFrameAssErrorCount, + ETHTOOL_A_MM_STAT_REASSEMBLY_ERRORS) || + mm_put_stat(skb, stats->MACMergeFrameSmdErrorCount, + ETHTOOL_A_MM_STAT_SMD_ERRORS) || + mm_put_stat(skb, stats->MACMergeFrameAssOkCount, + ETHTOOL_A_MM_STAT_REASSEMBLY_OK) || + mm_put_stat(skb, stats->MACMergeFragCountRx, + ETHTOOL_A_MM_STAT_RX_FRAG_COUNT) || + mm_put_stat(skb, stats->MACMergeFragCountTx, + ETHTOOL_A_MM_STAT_TX_FRAG_COUNT) || + mm_put_stat(skb, stats->MACMergeHoldCount, + ETHTOOL_A_MM_STAT_HOLD_COUNT)) + goto err_cancel; + + nla_nest_end(skb, nest); + return 0; + +err_cancel: + nla_nest_cancel(skb, nest); + return -EMSGSIZE; +} + +static int mm_fill_reply(struct sk_buff *skb, + const struct ethnl_req_info *req_base, + const struct ethnl_reply_data *reply_base) +{ + const struct mm_reply_data *data = MM_REPDATA(reply_base); + const struct ethtool_mm_state *state = &data->state; + + if (nla_put_u8(skb, ETHTOOL_A_MM_VERIFY_STATUS, state->verify_status) || + nla_put_u32(skb, ETHTOOL_A_MM_VERIFY_TIME, state->verify_time) || + nla_put_u8(skb, ETHTOOL_A_MM_SUPPORTED, state->supported) || + nla_put_u8(skb, ETHTOOL_A_MM_ENABLED, state->enabled) || + nla_put_u8(skb, ETHTOOL_A_MM_ACTIVE, state->active) || + nla_put_u8(skb, ETHTOOL_A_MM_ADD_FRAG_SIZE, state->add_frag_size)) + return -EMSGSIZE; + + if (req_base->flags & ETHTOOL_FLAG_STATS && + mm_put_stats(skb, &data->stats)) + return -EMSGSIZE; + + return 0; +} + +const struct ethnl_request_ops ethnl_mm_request_ops = { + .request_cmd = ETHTOOL_MSG_MM_GET, + .reply_cmd = ETHTOOL_MSG_MM_GET_REPLY, + .hdr_attr = ETHTOOL_A_MM_HEADER, + .req_info_size = sizeof(struct mm_req_info), + .reply_data_size = sizeof(struct mm_reply_data), + + .prepare_data = mm_prepare_data, + .reply_size = mm_reply_size, + .fill_reply = mm_fill_reply, +}; + +const struct nla_policy ethnl_mm_set_policy[ETHTOOL_A_MM_MAX + 1] = { + [ETHTOOL_A_MM_HEADER] = NLA_POLICY_NESTED(ethnl_header_policy), + [ETHTOOL_A_MM_VERIFY_DISABLE] = { .type = NLA_U8 }, + [ETHTOOL_A_MM_VERIFY_TIME] = { .type = NLA_U32 }, + [ETHTOOL_A_MM_ENABLED] = { .type = NLA_U8 }, + [ETHTOOL_A_MM_ADD_FRAG_SIZE] = { .type = NLA_U8 }, +}; + +static void mm_state_to_cfg(const struct ethtool_mm_state *state, + struct ethtool_mm_cfg *cfg) +{ + cfg->verify_disable = + state->verify_status == ETHTOOL_MM_VERIFY_STATUS_DISABLED; + cfg->verify_time = state->verify_time; + cfg->enabled = state->enabled; + cfg->add_frag_size = state->add_frag_size; +} + +int ethnl_set_mm_cfg(struct sk_buff *skb, struct genl_info *info) +{ + struct netlink_ext_ack *extack = info->extack; + struct ethnl_req_info req_info = {}; + struct ethtool_mm_state state = {}; + struct nlattr **tb = info->attrs; + struct ethtool_mm_cfg cfg = {}; + const struct ethtool_ops *ops; + struct net_device *dev; + bool mod = false; + int ret; + + ret = ethnl_parse_header_dev_get(&req_info, tb[ETHTOOL_A_MM_HEADER], + genl_info_net(info), extack, true); + if (ret) + return ret; + + dev = req_info.dev; + ops = dev->ethtool_ops; + + if (!ops->get_mm_state || !ops->set_mm_cfg) { + ret = -EOPNOTSUPP; + goto out_dev; + } + + rtnl_lock(); + ret = ethnl_ops_begin(dev); + if (ret) + goto out_rtnl; + + ops->get_mm_state(dev, &state); + + mm_state_to_cfg(&state, &cfg); + + ethnl_update_bool(&cfg.verify_disable, tb[ETHTOOL_A_MM_VERIFY_DISABLE], + &mod); + ethnl_update_u32(&cfg.verify_time, tb[ETHTOOL_A_MM_VERIFY_TIME], &mod); + ethnl_update_bool(&cfg.enabled, tb[ETHTOOL_A_MM_ENABLED], &mod); + ethnl_update_u8(&cfg.add_frag_size, tb[ETHTOOL_A_MM_ADD_FRAG_SIZE], + &mod); + + ret = ops->set_mm_cfg(dev, &cfg, extack); + if (ret) { + if (!extack->_msg) + NL_SET_ERR_MSG(extack, + "Failed to update MAC merge configuration"); + goto out_ops; + } + + ethtool_notify(dev, ETHTOOL_MSG_MM_NTF, NULL); + +out_ops: + ethnl_ops_complete(dev); +out_rtnl: + rtnl_unlock(); +out_dev: + dev_put(dev); + return ret; +} diff --git a/net/ethtool/netlink.c b/net/ethtool/netlink.c index e26079e11835..82ad2bd92d70 100644 --- a/net/ethtool/netlink.c +++ b/net/ethtool/netlink.c @@ -286,6 +286,8 @@ ethnl_default_requests[__ETHTOOL_MSG_USER_CNT] = { [ETHTOOL_MSG_STATS_GET] = ðnl_stats_request_ops, [ETHTOOL_MSG_PHC_VCLOCKS_GET] = ðnl_phc_vclocks_request_ops, [ETHTOOL_MSG_MODULE_GET] = ðnl_module_request_ops, + [ETHTOOL_MSG_FP_GET] = ðnl_fp_request_ops, + [ETHTOOL_MSG_MM_GET] = ðnl_mm_request_ops, }; static struct ethnl_dump_ctx *ethnl_dump_context(struct netlink_callback *cb) @@ -598,6 +600,8 @@ ethnl_default_notify_ops[ETHTOOL_MSG_KERNEL_MAX + 1] = { [ETHTOOL_MSG_EEE_NTF] = ðnl_eee_request_ops, [ETHTOOL_MSG_FEC_NTF] = ðnl_fec_request_ops, [ETHTOOL_MSG_MODULE_NTF] = ðnl_module_request_ops, + [ETHTOOL_MSG_FP_NTF] = ðnl_fp_request_ops, + [ETHTOOL_MSG_MM_NTF] = ðnl_mm_request_ops, }; /* default notification handler */ @@ -691,6 +695,8 @@ static const ethnl_notify_handler_t ethnl_notify_handlers[] = { [ETHTOOL_MSG_EEE_NTF] = ethnl_default_notify, [ETHTOOL_MSG_FEC_NTF] = ethnl_default_notify, [ETHTOOL_MSG_MODULE_NTF] = ethnl_default_notify, + [ETHTOOL_MSG_FP_NTF] = ethnl_default_notify, + [ETHTOOL_MSG_MM_NTF] = ethnl_default_notify, }; void ethtool_notify(struct net_device *dev, unsigned int cmd, const void *data) @@ -1020,6 +1026,38 @@ static const struct genl_ops ethtool_genl_ops[] = { .policy = ethnl_module_set_policy, .maxattr = ARRAY_SIZE(ethnl_module_set_policy) - 1, }, + { + .cmd = ETHTOOL_MSG_FP_GET, + .doit = ethnl_default_doit, + .start = ethnl_default_start, + .dumpit = ethnl_default_dumpit, + .done = ethnl_default_done, + .policy = ethnl_fp_get_policy, + .maxattr = ARRAY_SIZE(ethnl_fp_get_policy) - 1, + }, + { + .cmd = ETHTOOL_MSG_FP_SET, + .flags = GENL_UNS_ADMIN_PERM, + .doit = ethnl_set_fp_param, + .policy = ethnl_fp_set_policy, + .maxattr = ARRAY_SIZE(ethnl_fp_set_policy) - 1, + }, + { + .cmd = ETHTOOL_MSG_MM_GET, + .doit = ethnl_default_doit, + .start = ethnl_default_start, + .dumpit = ethnl_default_dumpit, + .done = ethnl_default_done, + .policy = ethnl_mm_get_policy, + .maxattr = ARRAY_SIZE(ethnl_mm_get_policy) - 1, + }, + { + .cmd = ETHTOOL_MSG_MM_SET, + .flags = GENL_UNS_ADMIN_PERM, + .doit = ethnl_set_mm_cfg, + .policy = ethnl_mm_set_policy, + .maxattr = ARRAY_SIZE(ethnl_mm_set_policy) - 1, + }, }; static const struct genl_multicast_group ethtool_nl_mcgrps[] = { diff --git a/net/ethtool/netlink.h b/net/ethtool/netlink.h index 1653fd2cf0cf..a2e56df74c85 100644 --- a/net/ethtool/netlink.h +++ b/net/ethtool/netlink.h @@ -371,6 +371,8 @@ extern const struct ethnl_request_ops ethnl_module_eeprom_request_ops; extern const struct ethnl_request_ops ethnl_stats_request_ops; extern const struct ethnl_request_ops ethnl_phc_vclocks_request_ops; extern const struct ethnl_request_ops ethnl_module_request_ops; +extern const struct ethnl_request_ops ethnl_fp_request_ops; +extern const struct ethnl_request_ops ethnl_mm_request_ops; extern const struct nla_policy ethnl_header_policy[ETHTOOL_A_HEADER_FLAGS + 1]; extern const struct nla_policy ethnl_header_policy_stats[ETHTOOL_A_HEADER_FLAGS + 1]; @@ -409,6 +411,10 @@ extern const struct nla_policy ethnl_stats_get_policy[ETHTOOL_A_STATS_GROUPS + 1 extern const struct nla_policy ethnl_phc_vclocks_get_policy[ETHTOOL_A_PHC_VCLOCKS_HEADER + 1]; extern const struct nla_policy ethnl_module_get_policy[ETHTOOL_A_MODULE_HEADER + 1]; extern const struct nla_policy ethnl_module_set_policy[ETHTOOL_A_MODULE_POWER_MODE_POLICY + 1]; +extern const struct nla_policy ethnl_fp_get_policy[ETHTOOL_A_FP_HEADER + 1]; +extern const struct nla_policy ethnl_fp_set_policy[ETHTOOL_A_FP_MAX + 1]; +extern const struct nla_policy ethnl_mm_get_policy[ETHTOOL_A_MM_HEADER + 1]; +extern const struct nla_policy ethnl_mm_set_policy[ETHTOOL_A_MM_MAX + 1]; int ethnl_set_linkinfo(struct sk_buff *skb, struct genl_info *info); int ethnl_set_linkmodes(struct sk_buff *skb, struct genl_info *info); @@ -428,6 +434,8 @@ int ethnl_tunnel_info_start(struct netlink_callback *cb); int ethnl_tunnel_info_dumpit(struct sk_buff *skb, struct netlink_callback *cb); int ethnl_set_fec(struct sk_buff *skb, struct genl_info *info); int ethnl_set_module(struct sk_buff *skb, struct genl_info *info); +int ethnl_set_fp_param(struct sk_buff *skb, struct genl_info *info); +int ethnl_set_mm_cfg(struct sk_buff *skb, struct genl_info *info); extern const char stats_std_names[__ETHTOOL_STATS_CNT][ETH_GSTRING_LEN]; extern const char stats_eth_phy_names[__ETHTOOL_A_STATS_ETH_PHY_CNT][ETH_GSTRING_LEN];
Frame preemption (IEEE 802.1Q-2018 clause 6.7.2) and the MAC merge sublayer (IEEE 802.3-2018 clause 99) are 2 specifications, part of the Time Sensitive Networking enhancements, which work together to minimize latency caused by frame interference. The overall goal of TSN is for normal traffic and traffic belonging to real-time control processes to be able to cohabitate on the same L2 network and not bother each other too much. They achieve this (partly) by introducing the concept of preemptable traffic, i.e. Ethernet frames that have an altered Start-Of-Frame delimiter, which can be fragmented and reassembled at L2 on a link-local basis. The non-preemptable frames are called express traffic, and they can preempt preemptable frames, therefore having lower latency, which can matter at lower (100 Mbps) link speeds, or at high MTUs (jumbo frames around 9K). Preemption is not recursive, i.e. a PT frame cannot preempt another PT frame. Preemption also does not depend upon priority, or otherwise said, an ET frame with prio 0 will still preempt a PT frame with prio 7. In terms of implementation, the specs talk about the fact that the express traffic is handled by an express MAC (eMAC) and the preemptable traffic by a preemptable MAC (pMAC), and these MACs are multiplexed on the same MII by a MAC merge layer. On RX, packets go to the eMAC or to the pMAC based on their SFD (the definition of which was generalized to SMD, Start-of-mPacket-Delimiter, where mPacket is essentially an Ethernet frame fragment, or a complete frame). On TX, the eMAC/pMAC classification decision is taken by the 802.1Q spec, based on packet priority (each of the 8 priority values may have an admin-status of preemptable or express). I have modeled both the Ethernet part of the spec and the queuing part as separate netlink messages, with separate ethtool command sets intended for them. I am slightly flexible as to where to place the FP settings; there were previous discussions about placing them in tc. At the moment I haven't received a good enough argument to move them out of ethtool, so here they are. https://patchwork.kernel.org/project/netdevbpf/cover/20220520011538.1098888-1-vinicius.gomes@intel.com/ Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> --- include/linux/ethtool.h | 59 ++++++ include/uapi/linux/ethtool.h | 15 ++ include/uapi/linux/ethtool_netlink.h | 82 ++++++++ net/ethtool/Makefile | 3 +- net/ethtool/fp.c | 295 +++++++++++++++++++++++++++ net/ethtool/mm.c | 228 +++++++++++++++++++++ net/ethtool/netlink.c | 38 ++++ net/ethtool/netlink.h | 8 + 8 files changed, 727 insertions(+), 1 deletion(-) create mode 100644 net/ethtool/fp.c create mode 100644 net/ethtool/mm.c