diff mbox series

[net-next,v9,2/4] ethtool: provide customized dim profile management

Message ID 20240417155546.25691-3-hengqi@linux.alibaba.com (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series ethtool: provide the dim profile fine-tuning channel | expand

Checks

Context Check Description
netdev/series_format success Posting correctly formatted
netdev/tree_selection success Clearly marked for net-next, async
netdev/ynl success Generated files up to date; no warnings/errors; GEN HAS DIFF 2 files changed, 250 insertions(+);
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 4839 this patch: 4839
netdev/build_tools success Errors and warnings before: 0 this patch: 0
netdev/cc_maintainers success CCed 9 of 9 maintainers
netdev/build_clang success Errors and warnings before: 1041 this patch: 1041
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 5115 this patch: 5115
netdev/checkpatch warning WARNING: line length of 82 exceeds 80 columns WARNING: line length of 83 exceeds 80 columns WARNING: line length of 84 exceeds 80 columns WARNING: line length of 85 exceeds 80 columns WARNING: line length of 86 exceeds 80 columns WARNING: line length of 87 exceeds 80 columns WARNING: line length of 90 exceeds 80 columns WARNING: line length of 91 exceeds 80 columns WARNING: line length of 94 exceeds 80 columns WARNING: line length of 96 exceeds 80 columns WARNING: line length of 97 exceeds 80 columns
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 106 this patch: 106
netdev/source_inline success Was 0 now: 0
netdev/contest success net-next-2024-04-18--00-00 (tests: 961)

Commit Message

Heng Qi April 17, 2024, 3:55 p.m. UTC
The NetDIM library, currently leveraged by an array of NICs, delivers
excellent acceleration benefits. Nevertheless, NICs vary significantly
in their dim profile list prerequisites.

Specifically, virtio-net backends may present diverse sw or hw device
implementation, making a one-size-fits-all parameter list impractical.
On Alibaba Cloud, the virtio DPU's performance under the default DIM
profile falls short of expectations, partly due to a mismatch in
parameter configuration.

I also noticed that ice/idpf/ena and other NICs have customized
profilelist or placed some restrictions on dim capabilities.

Motivated by this, I tried adding new params for "ethtool -C" that provides
a per-device control to modify and access a device's interrupt parameters.

Usage
========
The target NIC is named ethx.

Assume that ethx only declares support for ETHTOOL_COALESCE_RX_EQE_PROFILE
in ethtool_ops->supported_coalesce_params.

1. Query the currently customized list of the device

$ ethtool -c ethx
...
rx-eqe-profile:
{.usec =   1, .pkts = 256, .comps =   0,},
{.usec =   8, .pkts = 256, .comps =   0,},
{.usec =  64, .pkts = 256, .comps =   0,},
{.usec = 128, .pkts = 256, .comps =   0,},
{.usec = 256, .pkts = 256, .comps =   0,}
rx-cqe-profile:   n/a
tx-eqe-profile:   n/a
tx-cqe-profile:   n/a

2. Tune
$ ethtool -C ethx rx-eqe-profile 1,1,0_2,2,0_3,3,0_4,4,0_5,5,0
$ ethtool -c ethx
...
rx-eqe-profile:
{.usec =   1, .pkts =   1, .comps =   0,},
{.usec =   2, .pkts =   2, .comps =   0,},
{.usec =   3, .pkts =   3, .comps =   0,},
{.usec =   4, .pkts =   4, .comps =   0,},
{.usec =   5, .pkts =   5, .comps =   0,}
rx-cqe-profile:   n/a
tx-eqe-profile:   n/a
tx-cqe-profile:   n/a

3. Hint
If the device does not support some type of customized dim
profiles, the corresponding "n/a" will display.

Signed-off-by: Heng Qi <hengqi@linux.alibaba.com>
---
 Documentation/netlink/specs/ethtool.yaml     |  33 +++
 Documentation/networking/ethtool-netlink.rst |   8 +
 include/linux/ethtool.h                      |  11 +-
 include/linux/netdevice.h                    |  24 +++
 include/uapi/linux/ethtool_netlink.h         |  24 +++
 net/core/dev.c                               |  79 ++++++++
 net/ethtool/coalesce.c                       | 201 ++++++++++++++++++-
 7 files changed, 378 insertions(+), 2 deletions(-)

Comments

Jakub Kicinski April 19, 2024, 12:48 a.m. UTC | #1
On Wed, 17 Apr 2024 23:55:44 +0800 Heng Qi wrote:
> $ ethtool -c ethx
> ...
> rx-eqe-profile:
> {.usec =   1, .pkts = 256, .comps =   0,},
> {.usec =   8, .pkts = 256, .comps =   0,},
> {.usec =  64, .pkts = 256, .comps =   0,},
> {.usec = 128, .pkts = 256, .comps =   0,},
> {.usec = 256, .pkts = 256, .comps =   0,}
> rx-cqe-profile:   n/a
> tx-eqe-profile:   n/a
> tx-cqe-profile:   n/a

I don't think that exposing CQE vs EQE mode makes much sense here.
We enable the right mode via:

struct kernel_ethtool_coalesce {
	u8 use_cqe_mode_tx;
	u8 use_cqe_mode_rx;

the user needs to set the packets and usecs in an educated way 
(matching the mode). The kernel never changes the mode.

Am I missing something?

> +  -
> +    name: moderation

irq-moderation ?

> +    attributes:
> +      -
> +        name: usec
> +        type: u32
> +      -
> +        name: pkts
> +        type: u32
> +      -
> +        name: comps
> +        type: u32

> +++ b/include/linux/ethtool.h
> @@ -284,7 +284,11 @@ bool ethtool_convert_link_mode_to_legacy_u32(u32 *legacy_u32,
>  #define ETHTOOL_COALESCE_TX_AGGR_MAX_BYTES	BIT(24)
>  #define ETHTOOL_COALESCE_TX_AGGR_MAX_FRAMES	BIT(25)
>  #define ETHTOOL_COALESCE_TX_AGGR_TIME_USECS	BIT(26)
> -#define ETHTOOL_COALESCE_ALL_PARAMS		GENMASK(26, 0)
> +#define ETHTOOL_COALESCE_RX_EQE_PROFILE         BIT(27)
> +#define ETHTOOL_COALESCE_RX_CQE_PROFILE         BIT(28)
> +#define ETHTOOL_COALESCE_TX_EQE_PROFILE         BIT(29)
> +#define ETHTOOL_COALESCE_TX_CQE_PROFILE         BIT(30)
> +#define ETHTOOL_COALESCE_ALL_PARAMS		GENMASK(30, 0)

I think it's better to add these to netdev_ops, see below.

> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index d45f330d083d..a1c7e9c2be86 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -80,6 +80,25 @@ struct xdp_frame;
>  struct xdp_metadata_ops;
>  struct xdp_md;
>  
> +#if IS_ENABLED(CONFIG_DIMLIB)

Don't wrap type definitions in an ifdef.

> +struct dim_cq_moder;

Unnecessary.

> +#define NETDEV_PROFILE_USEC	BIT(0)	/* device supports usec field modification */
> +#define NETDEV_PROFILE_PKTS	BIT(1)	/* device supports pkts field modification */
> +#define NETDEV_PROFILE_COMPS	BIT(2)	/* device supports comps field modification */
> +
> +struct netdev_profile_moder {
> +	/* See NETDEV_PROFILE_* */
> +	unsigned int flags;
> +
> +	/* DIM profile lists for different dim cq modes */
> +	struct dim_cq_moder *rx_eqe_profile;
> +	struct dim_cq_moder *rx_cqe_profile;
> +	struct dim_cq_moder *tx_eqe_profile;
> +	struct dim_cq_moder *tx_cqe_profile;
> +};
> +#endif

All of this can move to the dim header. No need to grow netdevice.h

>  typedef u32 xdp_features_t;
>  
>  void synchronize_net(void);


> +static int dev_dim_profile_init(struct net_device *dev)
> +{
> +#if IS_ENABLED(CONFIG_DIMLIB)
> +	u32 supported = dev->ethtool_ops->supported_coalesce_params;
> +
> +	struct netdev_profile_moder *moder;
> +	int length;
> +
> +	dev->moderation = kzalloc(sizeof(*dev->moderation), GFP_KERNEL);
> +	if (!dev->moderation)
> +		goto err_moder;

if the device has no DIM we should allocate nothing

> +	moder = dev->moderation;
> +	length = NET_DIM_PARAMS_NUM_PROFILES * sizeof(*moder->rx_eqe_profile);
> +
> +	if (supported & ETHTOOL_COALESCE_RX_EQE_PROFILE) {
> +		moder->rx_eqe_profile = kmemdup(dim_rx_profile[0], length, GFP_KERNEL);
> +		if (!moder->rx_eqe_profile)
> +			goto err_rx_eqe;
> +	}
> +	if (supported & ETHTOOL_COALESCE_RX_CQE_PROFILE) {
> +		moder->rx_cqe_profile = kmemdup(dim_rx_profile[1], length, GFP_KERNEL);
> +		if (!moder->rx_cqe_profile)
> +			goto err_rx_cqe;
> +	}
> +	if (supported & ETHTOOL_COALESCE_TX_EQE_PROFILE) {
> +		moder->tx_eqe_profile = kmemdup(dim_tx_profile[0], length, GFP_KERNEL);
> +		if (!moder->tx_eqe_profile)
> +			goto err_tx_eqe;
> +	}
> +	if (supported & ETHTOOL_COALESCE_TX_CQE_PROFILE) {
> +		moder->tx_cqe_profile = kmemdup(dim_tx_profile[1], length, GFP_KERNEL);
> +		if (!moder->tx_cqe_profile)
> +			goto err_tx_cqe;
> +	}

This code should also live in dim rather than dev.c.

> +#endif
> +	return 0;
> +
> +#if IS_ENABLED(CONFIG_DIMLIB)
> +err_tx_cqe:
> +	kfree(moder->tx_eqe_profile);
> +err_tx_eqe:
> +	kfree(moder->rx_cqe_profile);
> +err_rx_cqe:
> +	kfree(moder->rx_eqe_profile);
> +err_rx_eqe:
> +	kfree(moder);
> +err_moder:
> +	return -ENOMEM;
> +#endif
> +}
> +
>  /**
>   * register_netdevice() - register a network device
>   * @dev: device to register
> @@ -10258,6 +10310,10 @@ int register_netdevice(struct net_device *dev)
>  	if (ret)
>  		return ret;
>  
> +	ret = dev_dim_profile_init(dev);
> +	if (ret)
> +		return ret;

This is fine but the driver still has to manually do bunch of init:

		INIT_WORK(&vi->rq[i].dim.work, virtnet_rx_dim_work);
		vi->rq[i].dim.mode = DIM_CQ_PERIOD_MODE_START_FROM_EQE;

It'd be better to collect all this static info (flags, mode, work func)
in one place / struct, attached to netdev or netdev_ops. Then the
driver can call a helper like which only needs to take netdev and dim
as arguments.

>  	spin_lock_init(&dev->addr_list_lock);
>  	netdev_set_addr_lockdep_class(dev);
>  
> @@ -11011,6 +11067,27 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
>  }
>  EXPORT_SYMBOL(alloc_netdev_mqs);
>  
> +static void netif_free_profile(struct net_device *dev)
> +{
> +#if IS_ENABLED(CONFIG_DIMLIB)
> +	u32 supported = dev->ethtool_ops->supported_coalesce_params;
> +
> +	if (supported & ETHTOOL_COALESCE_RX_EQE_PROFILE)
> +		kfree(dev->moderation->rx_eqe_profile);

kfree(NULL) is valid, you don't have to check the flags

> +	if (supported & ETHTOOL_COALESCE_RX_CQE_PROFILE)
> +		kfree(dev->moderation->rx_cqe_profile);
> +
> +	if (supported & ETHTOOL_COALESCE_TX_EQE_PROFILE)
> +		kfree(dev->moderation->tx_eqe_profile);
> +
> +	if (supported & ETHTOOL_COALESCE_TX_CQE_PROFILE)
> +		kfree(dev->moderation->tx_cqe_profile);
> +
> +	kfree(dev->moderation);
> +#endif
> +}
> +
>  /**
>   * free_netdev - free network device
>   * @dev: device
> @@ -11036,6 +11113,8 @@ void free_netdev(struct net_device *dev)
>  		return;
>  	}
>  
> +	netif_free_profile(dev);
> +
>  	netif_free_tx_queues(dev);
>  	netif_free_rx_queues(dev);
>  
> diff --git a/net/ethtool/coalesce.c b/net/ethtool/coalesce.c
> index 83112c1a71ae..3a41840fbcc7 100644
> --- a/net/ethtool/coalesce.c
> +++ b/net/ethtool/coalesce.c
> @@ -1,5 +1,6 @@
>  // SPDX-License-Identifier: GPL-2.0-only
>  
> +#include <linux/dim.h>
>  #include "netlink.h"
>  #include "common.h"
>  
> @@ -51,6 +52,10 @@ __CHECK_SUPPORTED_OFFSET(COALESCE_RX_MAX_FRAMES_HIGH);
>  __CHECK_SUPPORTED_OFFSET(COALESCE_TX_USECS_HIGH);
>  __CHECK_SUPPORTED_OFFSET(COALESCE_TX_MAX_FRAMES_HIGH);
>  __CHECK_SUPPORTED_OFFSET(COALESCE_RATE_SAMPLE_INTERVAL);
> +__CHECK_SUPPORTED_OFFSET(COALESCE_RX_EQE_PROFILE);
> +__CHECK_SUPPORTED_OFFSET(COALESCE_RX_CQE_PROFILE);
> +__CHECK_SUPPORTED_OFFSET(COALESCE_TX_EQE_PROFILE);
> +__CHECK_SUPPORTED_OFFSET(COALESCE_TX_CQE_PROFILE);
>  
>  const struct nla_policy ethnl_coalesce_get_policy[] = {
>  	[ETHTOOL_A_COALESCE_HEADER]		=
> @@ -82,6 +87,14 @@ static int coalesce_prepare_data(const struct ethnl_req_info *req_base,
>  static int coalesce_reply_size(const struct ethnl_req_info *req_base,
>  			       const struct ethnl_reply_data *reply_base)
>  {
> +	int modersz = nla_total_size(0) + /* _MODERATIONS_MODERATION, nest */
> +		      nla_total_size(sizeof(u32)) + /* _MODERATION_USEC */
> +		      nla_total_size(sizeof(u32)) + /* _MODERATION_PKTS */
> +		      nla_total_size(sizeof(u32));  /* _MODERATION_COMPS */
> +
> +	int total_modersz = nla_total_size(0) +  /* _{R,T}X_{E,C}QE_PROFILE, nest */
> +			modersz * NET_DIM_PARAMS_NUM_PROFILES;
> +
>  	return nla_total_size(sizeof(u32)) +	/* _RX_USECS */
>  	       nla_total_size(sizeof(u32)) +	/* _RX_MAX_FRAMES */
>  	       nla_total_size(sizeof(u32)) +	/* _RX_USECS_IRQ */
> @@ -108,7 +121,8 @@ static int coalesce_reply_size(const struct ethnl_req_info *req_base,
>  	       nla_total_size(sizeof(u8)) +	/* _USE_CQE_MODE_RX */
>  	       nla_total_size(sizeof(u32)) +	/* _TX_AGGR_MAX_BYTES */
>  	       nla_total_size(sizeof(u32)) +	/* _TX_AGGR_MAX_FRAMES */
> -	       nla_total_size(sizeof(u32));	/* _TX_AGGR_TIME_USECS */
> +	       nla_total_size(sizeof(u32)) +	/* _TX_AGGR_TIME_USECS */
> +	       total_modersz * 4;		/* _{R,T}X_{E,C}QE_PROFILE */
>  }
>  
>  static bool coalesce_put_u32(struct sk_buff *skb, u16 attr_type, u32 val,
> @@ -127,6 +141,62 @@ static bool coalesce_put_bool(struct sk_buff *skb, u16 attr_type, u32 val,
>  	return nla_put_u8(skb, attr_type, !!val);
>  }
>  
> +#if IS_ENABLED(CONFIG_DIMLIB)

prefer using

	if (IS_ENABLED(CONFIG_DIMLIB))
		return -EOPNOTSUPP;

rather than hiding code from the compiler, where possible 

> +/**
> + * coalesce_put_profile - fill reply with a nla nest with four child nla nests.
> + * @skb: socket buffer the message is stored in
> + * @attr_type: nest attr type ETHTOOL_A_COALESCE_*X_*QE_PROFILE
> + * @profile: data passed to userspace
> + * @supported_params: modifiable parameters supported by the driver
> + *
> + * Put a dim profile nest attribute. Refer to ETHTOOL_A_MODERATIONS_MODERATION.
> + *
> + * Return: false to indicate successful placement or no placement, and
> + * true to pass the -EMSGSIZE error to the wrapper.

Why the bool? Doesn't most of the similar code return the error?

> + */
> +static bool coalesce_put_profile(struct sk_buff *skb, u16 attr_type,
> +				 const struct dim_cq_moder *profile,
> +				 u32 supported_params)
> +{
> +	struct nlattr *profile_attr, *moder_attr;
> +	bool emsg = !!-EMSGSIZE;
> +	int i;
> +
> +	if (!profile)
> +		return false;
> +
> +	if (!(supported_params & attr_to_mask(attr_type)))
> +		return false;
> +
> +	profile_attr = nla_nest_start(skb, attr_type);
> +	if (!profile_attr)
> +		return emsg;
> +
> +	for (i = 0; i < NET_DIM_PARAMS_NUM_PROFILES; i++) {
> +		moder_attr = nla_nest_start(skb, ETHTOOL_A_MODERATIONS_MODERATION);
> +		if (!moder_attr)
> +			goto nla_cancel_profile;
> +
> +		if (nla_put_u32(skb, ETHTOOL_A_MODERATION_USEC, profile[i].usec) ||
> +		    nla_put_u32(skb, ETHTOOL_A_MODERATION_PKTS, profile[i].pkts) ||
> +		    nla_put_u32(skb, ETHTOOL_A_MODERATION_COMPS, profile[i].comps))

Only put attrs for supported fields

> +			goto nla_cancel_moder;
> +
> +		nla_nest_end(skb, moder_attr);
> +	}
> +
> +	nla_nest_end(skb, profile_attr);
> +
> +	return 0;
> +
> +nla_cancel_moder:
> +	nla_nest_cancel(skb, moder_attr);
> +nla_cancel_profile:
> +	nla_nest_cancel(skb, profile_attr);
> +	return emsg;
> +}
> +#endif
> +
>  static int coalesce_fill_reply(struct sk_buff *skb,
>  			       const struct ethnl_req_info *req_base,
>  			       const struct ethnl_reply_data *reply_base)
> @@ -134,6 +204,9 @@ static int coalesce_fill_reply(struct sk_buff *skb,
>  	const struct coalesce_reply_data *data = COALESCE_REPDATA(reply_base);
>  	const struct kernel_ethtool_coalesce *kcoal = &data->kernel_coalesce;
>  	const struct ethtool_coalesce *coal = &data->coalesce;
> +#if IS_ENABLED(CONFIG_DIMLIB)
> +	struct net_device *dev = req_base->dev;
> +#endif
>  	u32 supported = data->supported_params;
>  
>  	if (coalesce_put_u32(skb, ETHTOOL_A_COALESCE_RX_USECS,
> @@ -192,6 +265,21 @@ static int coalesce_fill_reply(struct sk_buff *skb,
>  			     kcoal->tx_aggr_time_usecs, supported))
>  		return -EMSGSIZE;
>  
> +#if IS_ENABLED(CONFIG_DIMLIB)
> +	if (!(dev->moderation->flags & (NETDEV_PROFILE_USEC | NETDEV_PROFILE_PKTS |
> +					NETDEV_PROFILE_COMPS)))
> +		return 0;
> +
> +	if (coalesce_put_profile(skb, ETHTOOL_A_COALESCE_RX_EQE_PROFILE,
> +				 dev->moderation->rx_eqe_profile, supported) ||
> +	    coalesce_put_profile(skb, ETHTOOL_A_COALESCE_RX_CQE_PROFILE,
> +				 dev->moderation->rx_cqe_profile, supported) ||
> +	    coalesce_put_profile(skb, ETHTOOL_A_COALESCE_TX_EQE_PROFILE,
> +				 dev->moderation->tx_eqe_profile, supported) ||
> +	    coalesce_put_profile(skb, ETHTOOL_A_COALESCE_TX_CQE_PROFILE,
> +				 dev->moderation->tx_cqe_profile, supported))
> +		return -EMSGSIZE;
> +#endif
>  	return 0;
>  }
>  
> @@ -227,7 +315,19 @@ const struct nla_policy ethnl_coalesce_set_policy[] = {
>  	[ETHTOOL_A_COALESCE_TX_AGGR_MAX_BYTES] = { .type = NLA_U32 },
>  	[ETHTOOL_A_COALESCE_TX_AGGR_MAX_FRAMES] = { .type = NLA_U32 },
>  	[ETHTOOL_A_COALESCE_TX_AGGR_TIME_USECS] = { .type = NLA_U32 },
> +	[ETHTOOL_A_COALESCE_RX_EQE_PROFILE]     = { .type = NLA_NESTED },
> +	[ETHTOOL_A_COALESCE_RX_CQE_PROFILE]     = { .type = NLA_NESTED },
> +	[ETHTOOL_A_COALESCE_TX_EQE_PROFILE]     = { .type = NLA_NESTED },
> +	[ETHTOOL_A_COALESCE_TX_CQE_PROFILE]     = { .type = NLA_NESTED },

NLA_POLICY_NESTED(coalesce_set_profile_policy)

> +};
> +
> +#if IS_ENABLED(CONFIG_DIMLIB)
> +static const struct nla_policy coalesce_set_profile_policy[] = {
> +	[ETHTOOL_A_MODERATION_USEC]	= {.type = NLA_U32},
> +	[ETHTOOL_A_MODERATION_PKTS]	= {.type = NLA_U32},
> +	[ETHTOOL_A_MODERATION_COMPS]	= {.type = NLA_U32},
>  };
> +#endif
>  
>  static int
>  ethnl_set_coalesce_validate(struct ethnl_req_info *req_info,
> @@ -253,6 +353,76 @@ ethnl_set_coalesce_validate(struct ethnl_req_info *req_info,
>  	return 1;
>  }
>  
> +#if IS_ENABLED(CONFIG_DIMLIB)
> +/**
> + * ethnl_update_profile - get a nla nest with four child nla nests from userspace.
> + * @dev: netdevice to update the profile
> + * @dst: data get from the driver and modified by ethnl_update_profile.
> + * @nests: nest attr ETHTOOL_A_COALESCE_*X_*QE_PROFILE to set driver's profile.
> + * @extack: Netlink extended ack
> + *
> + * Layout of nests:
> + *   Nested ETHTOOL_A_COALESCE_*X_*QE_PROFILE attr
> + *     Nested ETHTOOL_A_MODERATIONS_MODERATION attr
> + *       ETHTOOL_A_MODERATION_USEC attr
> + *       ETHTOOL_A_MODERATION_PKTS attr
> + *       ETHTOOL_A_MODERATION_COMPS attr
> + *     ...
> + *     Nested ETHTOOL_A_MODERATIONS_MODERATION attr
> + *       ETHTOOL_A_MODERATION_USEC attr
> + *       ETHTOOL_A_MODERATION_PKTS attr
> + *       ETHTOOL_A_MODERATION_COMPS attr
> + *
> + * Return: 0 on success or a negative error code.
> + */
> +static int ethnl_update_profile(struct net_device *dev,
> +				struct dim_cq_moder *dst,
> +				const struct nlattr *nests,
> +				struct netlink_ext_ack *extack)
> +{
> +	struct nlattr *tb_moder[ARRAY_SIZE(coalesce_set_profile_policy)];
> +	struct dim_cq_moder profile[NET_DIM_PARAMS_NUM_PROFILES];
> +	struct netdev_profile_moder *moder = dev->moderation;
> +	struct nlattr *nest;
> +	int ret, rem, i = 0;
> +
> +	if (!nests)
> +		return 0;
> +
> +	if (!dst)
> +		return -EOPNOTSUPP;
> +
> +	nla_for_each_nested_type(nest, ETHTOOL_A_MODERATIONS_MODERATION, nests, rem) {
> +		ret = nla_parse_nested(tb_moder,
> +				       ARRAY_SIZE(coalesce_set_profile_policy) - 1,
> +				       nest, coalesce_set_profile_policy,
> +				       extack);
> +		if (ret)
> +			return ret;
> +
> +		if (NL_REQ_ATTR_CHECK(extack, nest, tb_moder, ETHTOOL_A_MODERATION_USEC) ||
> +		    NL_REQ_ATTR_CHECK(extack, nest, tb_moder, ETHTOOL_A_MODERATION_PKTS) ||
> +		    NL_REQ_ATTR_CHECK(extack, nest, tb_moder, ETHTOOL_A_MODERATION_COMPS))

Only require what the device supports, reject what it doesn't

> +			return -EINVAL;
> +
> +		profile[i].usec = nla_get_u32(tb_moder[ETHTOOL_A_MODERATION_USEC]);
> +		profile[i].pkts = nla_get_u32(tb_moder[ETHTOOL_A_MODERATION_PKTS]);
> +		profile[i].comps = nla_get_u32(tb_moder[ETHTOOL_A_MODERATION_COMPS]);
> +
> +		if ((dst[i].usec != profile[i].usec && !(moder->flags & NETDEV_PROFILE_USEC)) ||
> +		    (dst[i].pkts != profile[i].pkts && !(moder->flags & NETDEV_PROFILE_PKTS)) ||
> +		    (dst[i].comps != profile[i].comps && !(moder->flags & NETDEV_PROFILE_COMPS)))
> +			return -EOPNOTSUPP;
> +
> +		i++;
> +	}
> +
> +	memcpy(dst, profile, sizeof(profile));

Is this safe? I think you need to use some synchronization when
swapping profiles, maybe RCU?..

> +	return 0;
> +}
> +#endif
> +
>  static int
>  __ethnl_set_coalesce(struct ethnl_req_info *req_info, struct genl_info *info,
>  		     bool *dual_change)
> @@ -317,6 +487,35 @@ __ethnl_set_coalesce(struct ethnl_req_info *req_info, struct genl_info *info,
>  	ethnl_update_u32(&kernel_coalesce.tx_aggr_time_usecs,
>  			 tb[ETHTOOL_A_COALESCE_TX_AGGR_TIME_USECS], &mod);
>  
> +#if IS_ENABLED(CONFIG_DIMLIB)
> +	ret = ethnl_update_profile(dev, dev->moderation->rx_eqe_profile,
> +				   tb[ETHTOOL_A_COALESCE_RX_EQE_PROFILE],
> +				   info->extack);
> +	if (ret < 0)
> +		return ret;
> +	ret = ethnl_update_profile(dev, dev->moderation->rx_cqe_profile,
> +				   tb[ETHTOOL_A_COALESCE_RX_CQE_PROFILE],
> +				   info->extack);
> +	if (ret < 0)
> +		return ret;
> +	ret = ethnl_update_profile(dev, dev->moderation->tx_eqe_profile,
> +				   tb[ETHTOOL_A_COALESCE_TX_EQE_PROFILE],
> +				   info->extack);
> +	if (ret < 0)
> +		return ret;
> +	ret = ethnl_update_profile(dev, dev->moderation->tx_cqe_profile,
> +				   tb[ETHTOOL_A_COALESCE_TX_CQE_PROFILE],
> +				   info->extack);
> +	if (ret < 0)
> +		return ret;
> +#else
> +	if (tb[ETHTOOL_A_COALESCE_RX_EQE_PROFILE] ||
> +	    tb[ETHTOOL_A_COALESCE_RX_CQE_PROFILE] ||
> +	    tb[ETHTOOL_A_COALESCE_TX_EQE_PROFILE] ||
> +	    tb[ETHTOOL_A_COALESCE_TX_CQE_PROFILE])
> +		return -EOPNOTSUPP;
> +
> +#endif
>  	/* Update operation modes */
>  	ethnl_update_bool32(&coalesce.use_adaptive_rx_coalesce,
>  			    tb[ETHTOOL_A_COALESCE_USE_ADAPTIVE_RX], &mod_mode);
Brett Creeley April 21, 2024, 3:53 p.m. UTC | #2
On 4/17/2024 8:55 AM, Heng Qi wrote:
> Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.
> 
> 
> The NetDIM library, currently leveraged by an array of NICs, delivers
> excellent acceleration benefits. Nevertheless, NICs vary significantly
> in their dim profile list prerequisites.
> 
> Specifically, virtio-net backends may present diverse sw or hw device
> implementation, making a one-size-fits-all parameter list impractical.
> On Alibaba Cloud, the virtio DPU's performance under the default DIM
> profile falls short of expectations, partly due to a mismatch in
> parameter configuration.
> 
> I also noticed that ice/idpf/ena and other NICs have customized
> profilelist or placed some restrictions on dim capabilities.
> 
> Motivated by this, I tried adding new params for "ethtool -C" that provides
> a per-device control to modify and access a device's interrupt parameters.
> 
> Usage
> ========
> The target NIC is named ethx.
> 
> Assume that ethx only declares support for ETHTOOL_COALESCE_RX_EQE_PROFILE
> in ethtool_ops->supported_coalesce_params.
> 
> 1. Query the currently customized list of the device
> 
> $ ethtool -c ethx
> ...
> rx-eqe-profile:
> {.usec =   1, .pkts = 256, .comps =   0,},
> {.usec =   8, .pkts = 256, .comps =   0,},
> {.usec =  64, .pkts = 256, .comps =   0,},
> {.usec = 128, .pkts = 256, .comps =   0,},
> {.usec = 256, .pkts = 256, .comps =   0,}
> rx-cqe-profile:   n/a
> tx-eqe-profile:   n/a
> tx-cqe-profile:   n/a
> 
> 2. Tune
> $ ethtool -C ethx rx-eqe-profile 1,1,0_2,2,0_3,3,0_4,4,0_5,5,0

With all of this work to support custom dim profiles (which I think is a 
great idea FWIW), I wonder if it would be worth supporting a dynamic 
number of profile entries instead of being hard-coded to 5?

Thanks,

Brett

<snip>
Heng Qi April 22, 2024, 9 a.m. UTC | #3
在 2024/4/19 上午8:48, Jakub Kicinski 写道:
> On Wed, 17 Apr 2024 23:55:44 +0800 Heng Qi wrote:
>> $ ethtool -c ethx
>> ...
>> rx-eqe-profile:
>> {.usec =   1, .pkts = 256, .comps =   0,},
>> {.usec =   8, .pkts = 256, .comps =   0,},
>> {.usec =  64, .pkts = 256, .comps =   0,},
>> {.usec = 128, .pkts = 256, .comps =   0,},
>> {.usec = 256, .pkts = 256, .comps =   0,}
>> rx-cqe-profile:   n/a
>> tx-eqe-profile:   n/a
>> tx-cqe-profile:   n/a
> I don't think that exposing CQE vs EQE mode makes much sense here.
> We enable the right mode via:
>
> struct kernel_ethtool_coalesce {
> 	u8 use_cqe_mode_tx;
> 	u8 use_cqe_mode_rx;

I think it is reasonable to use 4 types:

dim lib itself is designed with 4 independent arrays, which are queried 
by dim->mode and
dim->profile_ix. This way allows users to manually modify the cqe mode 
of tx or rx, and the
dim interface can automatically match the profile of the corresponding 
mode when obtaining
the results.

Merely exposing rx_profile and tx_profile does not seem to make the 
interface and code clearer.

>
> the user needs to set the packets and usecs in an educated way
> (matching the mode).

I think this is acceptable. In addition to the above reasons, users can 
already set the cqe
mode of tx and rx in user mode, which means that they have been educated.

> The kernel never changes the mode.

Sorry, I don't seem to understand what this means.

>
> Am I missing something?
>
>> +  -
>> +    name: moderation
> irq-moderation ?

OK.

>
>> +    attributes:
>> +      -
>> +        name: usec
>> +        type: u32
>> +      -
>> +        name: pkts
>> +        type: u32
>> +      -
>> +        name: comps
>> +        type: u32
>> +++ b/include/linux/ethtool.h
>> @@ -284,7 +284,11 @@ bool ethtool_convert_link_mode_to_legacy_u32(u32 *legacy_u32,
>>   #define ETHTOOL_COALESCE_TX_AGGR_MAX_BYTES	BIT(24)
>>   #define ETHTOOL_COALESCE_TX_AGGR_MAX_FRAMES	BIT(25)
>>   #define ETHTOOL_COALESCE_TX_AGGR_TIME_USECS	BIT(26)
>> -#define ETHTOOL_COALESCE_ALL_PARAMS		GENMASK(26, 0)
>> +#define ETHTOOL_COALESCE_RX_EQE_PROFILE         BIT(27)
>> +#define ETHTOOL_COALESCE_RX_CQE_PROFILE         BIT(28)
>> +#define ETHTOOL_COALESCE_TX_EQE_PROFILE         BIT(29)
>> +#define ETHTOOL_COALESCE_TX_CQE_PROFILE         BIT(30)
>> +#define ETHTOOL_COALESCE_ALL_PARAMS		GENMASK(30, 0)
> I think it's better to add these to netdev_ops, see below.
>
>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>> index d45f330d083d..a1c7e9c2be86 100644
>> --- a/include/linux/netdevice.h
>> +++ b/include/linux/netdevice.h
>> @@ -80,6 +80,25 @@ struct xdp_frame;
>>   struct xdp_metadata_ops;
>>   struct xdp_md;
>>   
>> +#if IS_ENABLED(CONFIG_DIMLIB)
> Don't wrap type definitions in an ifdef.
>
>> +struct dim_cq_moder;
> Unnecessary.
>
>> +#define NETDEV_PROFILE_USEC	BIT(0)	/* device supports usec field modification */
>> +#define NETDEV_PROFILE_PKTS	BIT(1)	/* device supports pkts field modification */
>> +#define NETDEV_PROFILE_COMPS	BIT(2)	/* device supports comps field modification */
>> +
>> +struct netdev_profile_moder {
>> +	/* See NETDEV_PROFILE_* */
>> +	unsigned int flags;
>> +
>> +	/* DIM profile lists for different dim cq modes */
>> +	struct dim_cq_moder *rx_eqe_profile;
>> +	struct dim_cq_moder *rx_cqe_profile;
>> +	struct dim_cq_moder *tx_eqe_profile;
>> +	struct dim_cq_moder *tx_cqe_profile;
>> +};
>> +#endif
> All of this can move to the dim header. No need to grow netdevice.h

OK.

>
>>   typedef u32 xdp_features_t;
>>   
>>   void synchronize_net(void);
>
>> +static int dev_dim_profile_init(struct net_device *dev)
>> +{
>> +#if IS_ENABLED(CONFIG_DIMLIB)
>> +	u32 supported = dev->ethtool_ops->supported_coalesce_params;
>> +
>> +	struct netdev_profile_moder *moder;
>> +	int length;
>> +
>> +	dev->moderation = kzalloc(sizeof(*dev->moderation), GFP_KERNEL);
>> +	if (!dev->moderation)
>> +		goto err_moder;
> if the device has no DIM we should allocate nothing

This can be done by netdev_ops.

>
>> +	moder = dev->moderation;
>> +	length = NET_DIM_PARAMS_NUM_PROFILES * sizeof(*moder->rx_eqe_profile);
>> +
>> +	if (supported & ETHTOOL_COALESCE_RX_EQE_PROFILE) {
>> +		moder->rx_eqe_profile = kmemdup(dim_rx_profile[0], length, GFP_KERNEL);
>> +		if (!moder->rx_eqe_profile)
>> +			goto err_rx_eqe;
>> +	}
>> +	if (supported & ETHTOOL_COALESCE_RX_CQE_PROFILE) {
>> +		moder->rx_cqe_profile = kmemdup(dim_rx_profile[1], length, GFP_KERNEL);
>> +		if (!moder->rx_cqe_profile)
>> +			goto err_rx_cqe;
>> +	}
>> +	if (supported & ETHTOOL_COALESCE_TX_EQE_PROFILE) {
>> +		moder->tx_eqe_profile = kmemdup(dim_tx_profile[0], length, GFP_KERNEL);
>> +		if (!moder->tx_eqe_profile)
>> +			goto err_tx_eqe;
>> +	}
>> +	if (supported & ETHTOOL_COALESCE_TX_CQE_PROFILE) {
>> +		moder->tx_cqe_profile = kmemdup(dim_tx_profile[1], length, GFP_KERNEL);
>> +		if (!moder->tx_cqe_profile)
>> +			goto err_tx_cqe;
>> +	}
> This code should also live in dim rather than dev.c.
>
>> +#endif
>> +	return 0;
>> +
>> +#if IS_ENABLED(CONFIG_DIMLIB)
>> +err_tx_cqe:
>> +	kfree(moder->tx_eqe_profile);
>> +err_tx_eqe:
>> +	kfree(moder->rx_cqe_profile);
>> +err_rx_cqe:
>> +	kfree(moder->rx_eqe_profile);
>> +err_rx_eqe:
>> +	kfree(moder);
>> +err_moder:
>> +	return -ENOMEM;
>> +#endif
>> +}
>> +
>>   /**
>>    * register_netdevice() - register a network device
>>    * @dev: device to register
>> @@ -10258,6 +10310,10 @@ int register_netdevice(struct net_device *dev)
>>   	if (ret)
>>   		return ret;
>>   
>> +	ret = dev_dim_profile_init(dev);
>> +	if (ret)
>> +		return ret;
> This is fine but the driver still has to manually do bunch of init:
>
> 		INIT_WORK(&vi->rq[i].dim.work, virtnet_rx_dim_work);
> 		vi->rq[i].dim.mode = DIM_CQ_PERIOD_MODE_START_FROM_EQE;
>
> It'd be better to collect all this static info (flags, mode, work func)
> in one place / struct, attached to netdev or netdev_ops. Then the
> driver can call a helper like which only needs to take netdev and dim
> as arguments.

If mode is put into netdev, then use_cqe_mode_rx and use_cqe_mode_tx 
need to be moved into netdev too.
Then dim->mode is no longer required when dim obtain its results.
We need to transform the way all drivers with dim currently behave.

But why should work be put into netdev?
The driver still requires a for loop to initialize dim work for a 
driver-specific queues.

>
>>   	spin_lock_init(&dev->addr_list_lock);
>>   	netdev_set_addr_lockdep_class(dev);
>>   
>> @@ -11011,6 +11067,27 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
>>   }
>>   EXPORT_SYMBOL(alloc_netdev_mqs);
>>   
>> +static void netif_free_profile(struct net_device *dev)
>> +{
>> +#if IS_ENABLED(CONFIG_DIMLIB)
>> +	u32 supported = dev->ethtool_ops->supported_coalesce_params;
>> +
>> +	if (supported & ETHTOOL_COALESCE_RX_EQE_PROFILE)
>> +		kfree(dev->moderation->rx_eqe_profile);
> kfree(NULL) is valid, you don't have to check the flags

Ok.

>
>> +	if (supported & ETHTOOL_COALESCE_RX_CQE_PROFILE)
>> +		kfree(dev->moderation->rx_cqe_profile);
>> +
>> +	if (supported & ETHTOOL_COALESCE_TX_EQE_PROFILE)
>> +		kfree(dev->moderation->tx_eqe_profile);
>> +
>> +	if (supported & ETHTOOL_COALESCE_TX_CQE_PROFILE)
>> +		kfree(dev->moderation->tx_cqe_profile);
>> +
>> +	kfree(dev->moderation);
>> +#endif
>> +}
>> +
>>   /**
>>    * free_netdev - free network device
>>    * @dev: device
>> @@ -11036,6 +11113,8 @@ void free_netdev(struct net_device *dev)
>>   		return;
>>   	}
>>   
>> +	netif_free_profile(dev);
>> +
>>   	netif_free_tx_queues(dev);
>>   	netif_free_rx_queues(dev);
>>   
>> diff --git a/net/ethtool/coalesce.c b/net/ethtool/coalesce.c
>> index 83112c1a71ae..3a41840fbcc7 100644
>> --- a/net/ethtool/coalesce.c
>> +++ b/net/ethtool/coalesce.c
>> @@ -1,5 +1,6 @@
>>   // SPDX-License-Identifier: GPL-2.0-only
>>   
>> +#include <linux/dim.h>
>>   #include "netlink.h"
>>   #include "common.h"
>>   
>> @@ -51,6 +52,10 @@ __CHECK_SUPPORTED_OFFSET(COALESCE_RX_MAX_FRAMES_HIGH);
>>   __CHECK_SUPPORTED_OFFSET(COALESCE_TX_USECS_HIGH);
>>   __CHECK_SUPPORTED_OFFSET(COALESCE_TX_MAX_FRAMES_HIGH);
>>   __CHECK_SUPPORTED_OFFSET(COALESCE_RATE_SAMPLE_INTERVAL);
>> +__CHECK_SUPPORTED_OFFSET(COALESCE_RX_EQE_PROFILE);
>> +__CHECK_SUPPORTED_OFFSET(COALESCE_RX_CQE_PROFILE);
>> +__CHECK_SUPPORTED_OFFSET(COALESCE_TX_EQE_PROFILE);
>> +__CHECK_SUPPORTED_OFFSET(COALESCE_TX_CQE_PROFILE);
>>   
>>   const struct nla_policy ethnl_coalesce_get_policy[] = {
>>   	[ETHTOOL_A_COALESCE_HEADER]		=
>> @@ -82,6 +87,14 @@ static int coalesce_prepare_data(const struct ethnl_req_info *req_base,
>>   static int coalesce_reply_size(const struct ethnl_req_info *req_base,
>>   			       const struct ethnl_reply_data *reply_base)
>>   {
>> +	int modersz = nla_total_size(0) + /* _MODERATIONS_MODERATION, nest */
>> +		      nla_total_size(sizeof(u32)) + /* _MODERATION_USEC */
>> +		      nla_total_size(sizeof(u32)) + /* _MODERATION_PKTS */
>> +		      nla_total_size(sizeof(u32));  /* _MODERATION_COMPS */
>> +
>> +	int total_modersz = nla_total_size(0) +  /* _{R,T}X_{E,C}QE_PROFILE, nest */
>> +			modersz * NET_DIM_PARAMS_NUM_PROFILES;
>> +
>>   	return nla_total_size(sizeof(u32)) +	/* _RX_USECS */
>>   	       nla_total_size(sizeof(u32)) +	/* _RX_MAX_FRAMES */
>>   	       nla_total_size(sizeof(u32)) +	/* _RX_USECS_IRQ */
>> @@ -108,7 +121,8 @@ static int coalesce_reply_size(const struct ethnl_req_info *req_base,
>>   	       nla_total_size(sizeof(u8)) +	/* _USE_CQE_MODE_RX */
>>   	       nla_total_size(sizeof(u32)) +	/* _TX_AGGR_MAX_BYTES */
>>   	       nla_total_size(sizeof(u32)) +	/* _TX_AGGR_MAX_FRAMES */
>> -	       nla_total_size(sizeof(u32));	/* _TX_AGGR_TIME_USECS */
>> +	       nla_total_size(sizeof(u32)) +	/* _TX_AGGR_TIME_USECS */
>> +	       total_modersz * 4;		/* _{R,T}X_{E,C}QE_PROFILE */
>>   }
>>   
>>   static bool coalesce_put_u32(struct sk_buff *skb, u16 attr_type, u32 val,
>> @@ -127,6 +141,62 @@ static bool coalesce_put_bool(struct sk_buff *skb, u16 attr_type, u32 val,
>>   	return nla_put_u8(skb, attr_type, !!val);
>>   }
>>   
>> +#if IS_ENABLED(CONFIG_DIMLIB)
> prefer using
>
> 	if (IS_ENABLED(CONFIG_DIMLIB))
> 		return -EOPNOTSUPP;

Hiding code is to eliminate nipa's warning about "defined but not used".

>
> rather than hiding code from the compiler, where possible
>
>> +/**
>> + * coalesce_put_profile - fill reply with a nla nest with four child nla nests.
>> + * @skb: socket buffer the message is stored in
>> + * @attr_type: nest attr type ETHTOOL_A_COALESCE_*X_*QE_PROFILE
>> + * @profile: data passed to userspace
>> + * @supported_params: modifiable parameters supported by the driver
>> + *
>> + * Put a dim profile nest attribute. Refer to ETHTOOL_A_MODERATIONS_MODERATION.
>> + *
>> + * Return: false to indicate successful placement or no placement, and
>> + * true to pass the -EMSGSIZE error to the wrapper.
> Why the bool? Doesn't most of the similar code return the error?

Its wrapper function ethnl_default_doit only recognizes the EMSGSIZE code.

>
>> + */
>> +static bool coalesce_put_profile(struct sk_buff *skb, u16 attr_type,
>> +				 const struct dim_cq_moder *profile,
>> +				 u32 supported_params)
>> +{
>> +	struct nlattr *profile_attr, *moder_attr;
>> +	bool emsg = !!-EMSGSIZE;
>> +	int i;
>> +
>> +	if (!profile)
>> +		return false;
>> +
>> +	if (!(supported_params & attr_to_mask(attr_type)))
>> +		return false;
>> +
>> +	profile_attr = nla_nest_start(skb, attr_type);
>> +	if (!profile_attr)
>> +		return emsg;
>> +
>> +	for (i = 0; i < NET_DIM_PARAMS_NUM_PROFILES; i++) {
>> +		moder_attr = nla_nest_start(skb, ETHTOOL_A_MODERATIONS_MODERATION);
>> +		if (!moder_attr)
>> +			goto nla_cancel_profile;
>> +
>> +		if (nla_put_u32(skb, ETHTOOL_A_MODERATION_USEC, profile[i].usec) ||
>> +		    nla_put_u32(skb, ETHTOOL_A_MODERATION_PKTS, profile[i].pkts) ||
>> +		    nla_put_u32(skb, ETHTOOL_A_MODERATION_COMPS, profile[i].comps))
> Only put attrs for supported fields

Ok.

>
>> +			goto nla_cancel_moder;
>> +
>> +		nla_nest_end(skb, moder_attr);
>> +	}
>> +
>> +	nla_nest_end(skb, profile_attr);
>> +
>> +	return 0;
>> +
>> +nla_cancel_moder:
>> +	nla_nest_cancel(skb, moder_attr);
>> +nla_cancel_profile:
>> +	nla_nest_cancel(skb, profile_attr);
>> +	return emsg;
>> +}
>> +#endif
>> +
>>   static int coalesce_fill_reply(struct sk_buff *skb,
>>   			       const struct ethnl_req_info *req_base,
>>   			       const struct ethnl_reply_data *reply_base)
>> @@ -134,6 +204,9 @@ static int coalesce_fill_reply(struct sk_buff *skb,
>>   	const struct coalesce_reply_data *data = COALESCE_REPDATA(reply_base);
>>   	const struct kernel_ethtool_coalesce *kcoal = &data->kernel_coalesce;
>>   	const struct ethtool_coalesce *coal = &data->coalesce;
>> +#if IS_ENABLED(CONFIG_DIMLIB)
>> +	struct net_device *dev = req_base->dev;
>> +#endif
>>   	u32 supported = data->supported_params;
>>   
>>   	if (coalesce_put_u32(skb, ETHTOOL_A_COALESCE_RX_USECS,
>> @@ -192,6 +265,21 @@ static int coalesce_fill_reply(struct sk_buff *skb,
>>   			     kcoal->tx_aggr_time_usecs, supported))
>>   		return -EMSGSIZE;
>>   
>> +#if IS_ENABLED(CONFIG_DIMLIB)
>> +	if (!(dev->moderation->flags & (NETDEV_PROFILE_USEC | NETDEV_PROFILE_PKTS |
>> +					NETDEV_PROFILE_COMPS)))
>> +		return 0;
>> +
>> +	if (coalesce_put_profile(skb, ETHTOOL_A_COALESCE_RX_EQE_PROFILE,
>> +				 dev->moderation->rx_eqe_profile, supported) ||
>> +	    coalesce_put_profile(skb, ETHTOOL_A_COALESCE_RX_CQE_PROFILE,
>> +				 dev->moderation->rx_cqe_profile, supported) ||
>> +	    coalesce_put_profile(skb, ETHTOOL_A_COALESCE_TX_EQE_PROFILE,
>> +				 dev->moderation->tx_eqe_profile, supported) ||
>> +	    coalesce_put_profile(skb, ETHTOOL_A_COALESCE_TX_CQE_PROFILE,
>> +				 dev->moderation->tx_cqe_profile, supported))
>> +		return -EMSGSIZE;
>> +#endif
>>   	return 0;
>>   }
>>   
>> @@ -227,7 +315,19 @@ const struct nla_policy ethnl_coalesce_set_policy[] = {
>>   	[ETHTOOL_A_COALESCE_TX_AGGR_MAX_BYTES] = { .type = NLA_U32 },
>>   	[ETHTOOL_A_COALESCE_TX_AGGR_MAX_FRAMES] = { .type = NLA_U32 },
>>   	[ETHTOOL_A_COALESCE_TX_AGGR_TIME_USECS] = { .type = NLA_U32 },
>> +	[ETHTOOL_A_COALESCE_RX_EQE_PROFILE]     = { .type = NLA_NESTED },
>> +	[ETHTOOL_A_COALESCE_RX_CQE_PROFILE]     = { .type = NLA_NESTED },
>> +	[ETHTOOL_A_COALESCE_TX_EQE_PROFILE]     = { .type = NLA_NESTED },
>> +	[ETHTOOL_A_COALESCE_TX_CQE_PROFILE]     = { .type = NLA_NESTED },
> NLA_POLICY_NESTED(coalesce_set_profile_policy)

This doesn't work because the first level of sub-nesting of
ETHTOOL_A_COALESCE_RX_CQE_PROFILE is ETHTOOL_A_PROFILE_IRQ_MODERATION.

>
>> +};
>> +
>> +#if IS_ENABLED(CONFIG_DIMLIB)
>> +static const struct nla_policy coalesce_set_profile_policy[] = {
>> +	[ETHTOOL_A_MODERATION_USEC]	= {.type = NLA_U32},
>> +	[ETHTOOL_A_MODERATION_PKTS]	= {.type = NLA_U32},
>> +	[ETHTOOL_A_MODERATION_COMPS]	= {.type = NLA_U32},
>>   };
>> +#endif
>>   
>>   static int
>>   ethnl_set_coalesce_validate(struct ethnl_req_info *req_info,
>> @@ -253,6 +353,76 @@ ethnl_set_coalesce_validate(struct ethnl_req_info *req_info,
>>   	return 1;
>>   }
>>   
>> +#if IS_ENABLED(CONFIG_DIMLIB)
>> +/**
>> + * ethnl_update_profile - get a nla nest with four child nla nests from userspace.
>> + * @dev: netdevice to update the profile
>> + * @dst: data get from the driver and modified by ethnl_update_profile.
>> + * @nests: nest attr ETHTOOL_A_COALESCE_*X_*QE_PROFILE to set driver's profile.
>> + * @extack: Netlink extended ack
>> + *
>> + * Layout of nests:
>> + *   Nested ETHTOOL_A_COALESCE_*X_*QE_PROFILE attr
>> + *     Nested ETHTOOL_A_MODERATIONS_MODERATION attr
>> + *       ETHTOOL_A_MODERATION_USEC attr
>> + *       ETHTOOL_A_MODERATION_PKTS attr
>> + *       ETHTOOL_A_MODERATION_COMPS attr
>> + *     ...
>> + *     Nested ETHTOOL_A_MODERATIONS_MODERATION attr
>> + *       ETHTOOL_A_MODERATION_USEC attr
>> + *       ETHTOOL_A_MODERATION_PKTS attr
>> + *       ETHTOOL_A_MODERATION_COMPS attr
>> + *
>> + * Return: 0 on success or a negative error code.
>> + */
>> +static int ethnl_update_profile(struct net_device *dev,
>> +				struct dim_cq_moder *dst,
>> +				const struct nlattr *nests,
>> +				struct netlink_ext_ack *extack)
>> +{
>> +	struct nlattr *tb_moder[ARRAY_SIZE(coalesce_set_profile_policy)];
>> +	struct dim_cq_moder profile[NET_DIM_PARAMS_NUM_PROFILES];
>> +	struct netdev_profile_moder *moder = dev->moderation;
>> +	struct nlattr *nest;
>> +	int ret, rem, i = 0;
>> +
>> +	if (!nests)
>> +		return 0;
>> +
>> +	if (!dst)
>> +		return -EOPNOTSUPP;
>> +
>> +	nla_for_each_nested_type(nest, ETHTOOL_A_MODERATIONS_MODERATION, nests, rem) {
>> +		ret = nla_parse_nested(tb_moder,
>> +				       ARRAY_SIZE(coalesce_set_profile_policy) - 1,
>> +				       nest, coalesce_set_profile_policy,
>> +				       extack);
>> +		if (ret)
>> +			return ret;
>> +
>> +		if (NL_REQ_ATTR_CHECK(extack, nest, tb_moder, ETHTOOL_A_MODERATION_USEC) ||
>> +		    NL_REQ_ATTR_CHECK(extack, nest, tb_moder, ETHTOOL_A_MODERATION_PKTS) ||
>> +		    NL_REQ_ATTR_CHECK(extack, nest, tb_moder, ETHTOOL_A_MODERATION_COMPS))
> Only require what the device supports, reject what it doesn't

Ok.

>
>> +			return -EINVAL;
>> +
>> +		profile[i].usec = nla_get_u32(tb_moder[ETHTOOL_A_MODERATION_USEC]);
>> +		profile[i].pkts = nla_get_u32(tb_moder[ETHTOOL_A_MODERATION_PKTS]);
>> +		profile[i].comps = nla_get_u32(tb_moder[ETHTOOL_A_MODERATION_COMPS]);
>> +
>> +		if ((dst[i].usec != profile[i].usec && !(moder->flags & NETDEV_PROFILE_USEC)) ||
>> +		    (dst[i].pkts != profile[i].pkts && !(moder->flags & NETDEV_PROFILE_PKTS)) ||
>> +		    (dst[i].comps != profile[i].comps && !(moder->flags & NETDEV_PROFILE_COMPS)))
>> +			return -EOPNOTSUPP;
>> +
>> +		i++;
>> +	}
>> +
>> +	memcpy(dst, profile, sizeof(profile));
> Is this safe? I think you need to use some synchronization when
> swapping profiles, maybe RCU?..

RCU +1.

Thanks for reminding.

>
>> +	return 0;
>> +}
>> +#endif
>> +
>>   static int
>>   __ethnl_set_coalesce(struct ethnl_req_info *req_info, struct genl_info *info,
>>   		     bool *dual_change)
>> @@ -317,6 +487,35 @@ __ethnl_set_coalesce(struct ethnl_req_info *req_info, struct genl_info *info,
>>   	ethnl_update_u32(&kernel_coalesce.tx_aggr_time_usecs,
>>   			 tb[ETHTOOL_A_COALESCE_TX_AGGR_TIME_USECS], &mod);
>>   
>> +#if IS_ENABLED(CONFIG_DIMLIB)
>> +	ret = ethnl_update_profile(dev, dev->moderation->rx_eqe_profile,
>> +				   tb[ETHTOOL_A_COALESCE_RX_EQE_PROFILE],
>> +				   info->extack);
>> +	if (ret < 0)
>> +		return ret;
>> +	ret = ethnl_update_profile(dev, dev->moderation->rx_cqe_profile,
>> +				   tb[ETHTOOL_A_COALESCE_RX_CQE_PROFILE],
>> +				   info->extack);
>> +	if (ret < 0)
>> +		return ret;
>> +	ret = ethnl_update_profile(dev, dev->moderation->tx_eqe_profile,
>> +				   tb[ETHTOOL_A_COALESCE_TX_EQE_PROFILE],
>> +				   info->extack);
>> +	if (ret < 0)
>> +		return ret;
>> +	ret = ethnl_update_profile(dev, dev->moderation->tx_cqe_profile,
>> +				   tb[ETHTOOL_A_COALESCE_TX_CQE_PROFILE],
>> +				   info->extack);
>> +	if (ret < 0)
>> +		return ret;
>> +#else
>> +	if (tb[ETHTOOL_A_COALESCE_RX_EQE_PROFILE] ||
>> +	    tb[ETHTOOL_A_COALESCE_RX_CQE_PROFILE] ||
>> +	    tb[ETHTOOL_A_COALESCE_TX_EQE_PROFILE] ||
>> +	    tb[ETHTOOL_A_COALESCE_TX_CQE_PROFILE])
>> +		return -EOPNOTSUPP;
>> +
>> +#endif
>>   	/* Update operation modes */
>>   	ethnl_update_bool32(&coalesce.use_adaptive_rx_coalesce,
>>   			    tb[ETHTOOL_A_COALESCE_USE_ADAPTIVE_RX], &mod_mode);
Heng Qi April 22, 2024, 9:04 a.m. UTC | #4
在 2024/4/21 下午11:53, Brett Creeley 写道:
> On 4/17/2024 8:55 AM, Heng Qi wrote:
>> Caution: This message originated from an External Source. Use proper 
>> caution when opening attachments, clicking links, or responding.
>>
>>
>> The NetDIM library, currently leveraged by an array of NICs, delivers
>> excellent acceleration benefits. Nevertheless, NICs vary significantly
>> in their dim profile list prerequisites.
>>
>> Specifically, virtio-net backends may present diverse sw or hw device
>> implementation, making a one-size-fits-all parameter list impractical.
>> On Alibaba Cloud, the virtio DPU's performance under the default DIM
>> profile falls short of expectations, partly due to a mismatch in
>> parameter configuration.
>>
>> I also noticed that ice/idpf/ena and other NICs have customized
>> profilelist or placed some restrictions on dim capabilities.
>>
>> Motivated by this, I tried adding new params for "ethtool -C" that 
>> provides
>> a per-device control to modify and access a device's interrupt 
>> parameters.
>>
>> Usage
>> ========
>> The target NIC is named ethx.
>>
>> Assume that ethx only declares support for 
>> ETHTOOL_COALESCE_RX_EQE_PROFILE
>> in ethtool_ops->supported_coalesce_params.
>>
>> 1. Query the currently customized list of the device
>>
>> $ ethtool -c ethx
>> ...
>> rx-eqe-profile:
>> {.usec =   1, .pkts = 256, .comps =   0,},
>> {.usec =   8, .pkts = 256, .comps =   0,},
>> {.usec =  64, .pkts = 256, .comps =   0,},
>> {.usec = 128, .pkts = 256, .comps =   0,},
>> {.usec = 256, .pkts = 256, .comps =   0,}
>> rx-cqe-profile:   n/a
>> tx-eqe-profile:   n/a
>> tx-cqe-profile:   n/a
>>
>> 2. Tune
>> $ ethtool -C ethx rx-eqe-profile 1,1,0_2,2,0_3,3,0_4,4,0_5,5,0
>
> With all of this work to support custom dim profiles (which I think is 
> a great idea FWIW), I wonder if it would be worth supporting a dynamic 
> number of profile entries instead of being hard-coded to 5?
>

In my practice, I have not found that fewer or more profile entries lead 
to better performance.

Thanks.

> Thanks,
>
> Brett
>
> <snip>
Jakub Kicinski April 22, 2024, 6:39 p.m. UTC | #5
On Mon, 22 Apr 2024 17:00:25 +0800 Heng Qi wrote:
> 在 2024/4/19 上午8:48, Jakub Kicinski 写道:
> > On Wed, 17 Apr 2024 23:55:44 +0800 Heng Qi wrote:  
> >> $ ethtool -c ethx
> >> ...
> >> rx-eqe-profile:
> >> {.usec =   1, .pkts = 256, .comps =   0,},
> >> {.usec =   8, .pkts = 256, .comps =   0,},
> >> {.usec =  64, .pkts = 256, .comps =   0,},
> >> {.usec = 128, .pkts = 256, .comps =   0,},
> >> {.usec = 256, .pkts = 256, .comps =   0,}
> >> rx-cqe-profile:   n/a
> >> tx-eqe-profile:   n/a
> >> tx-cqe-profile:   n/a  
> > I don't think that exposing CQE vs EQE mode makes much sense here.
> > We enable the right mode via:
> >
> > struct kernel_ethtool_coalesce {
> > 	u8 use_cqe_mode_tx;
> > 	u8 use_cqe_mode_rx;  
> 
> I think it is reasonable to use 4 types:
> 
> dim lib itself is designed with 4 independent arrays, which are queried 
> by dim->mode and
> dim->profile_ix. This way allows users to manually modify the cqe mode 
> of tx or rx, and the
> dim interface can automatically match the profile of the corresponding 
> mode when obtaining
> the results.

I'm guessing that DIM has 4 profiles because it was easier to implement
for profiles hardcoded by DIM itself(!) Even for driver-declared
profiles the distinction is a hack.

> Merely exposing rx_profile and tx_profile does not seem to make the 
> interface and code clearer.
> 
> > the user needs to set the packets and usecs in an educated way
> > (matching the mode).  
> 
> I think this is acceptable. In addition to the above reasons, users can 
> already set the cqe
> mode of tx and rx in user mode, which means that they have been educated.
> 
> > The kernel never changes the mode.  
> 
> Sorry, I don't seem to understand what this means.

Kernel never changes the mode for its own reasons.
Only user can change the mode.
So we don't have to always have both CQE and EQE mode tables ready.
If the kernel changed the mode for some reason we'd have to get both
from the user, in case kernel wanted to switch.

> >>   /**
> >>    * register_netdevice() - register a network device
> >>    * @dev: device to register
> >> @@ -10258,6 +10310,10 @@ int register_netdevice(struct net_device *dev)
> >>   	if (ret)
> >>   		return ret;
> >>   
> >> +	ret = dev_dim_profile_init(dev);
> >> +	if (ret)
> >> +		return ret;  
> > This is fine but the driver still has to manually do bunch of init:
> >
> > 		INIT_WORK(&vi->rq[i].dim.work, virtnet_rx_dim_work);
> > 		vi->rq[i].dim.mode = DIM_CQ_PERIOD_MODE_START_FROM_EQE;
> >
> > It'd be better to collect all this static info (flags, mode, work func)
> > in one place / struct, attached to netdev or netdev_ops. Then the
> > driver can call a helper like which only needs to take netdev and dim
> > as arguments.  
> 
> If mode is put into netdev, then use_cqe_mode_rx and use_cqe_mode_tx 
> need to be moved into netdev too.
> Then dim->mode is no longer required when dim obtain its results.
> We need to transform the way all drivers with dim currently behave.

Hopefully that won't be too hard.

> But why should work be put into netdev?
> The driver still requires a for loop to initialize dim work for a 
> driver-specific queues.

It's easier to refactor and extend the API when there's an explicit
call done to the core by the driver, rather than driver poking values
into random fields of the core structure.

> >> + * coalesce_put_profile - fill reply with a nla nest with four child nla nests.
> >> + * @skb: socket buffer the message is stored in
> >> + * @attr_type: nest attr type ETHTOOL_A_COALESCE_*X_*QE_PROFILE
> >> + * @profile: data passed to userspace
> >> + * @supported_params: modifiable parameters supported by the driver
> >> + *
> >> + * Put a dim profile nest attribute. Refer to ETHTOOL_A_MODERATIONS_MODERATION.
> >> + *
> >> + * Return: false to indicate successful placement or no placement, and
> >> + * true to pass the -EMSGSIZE error to the wrapper.  
> > Why the bool? Doesn't most of the similar code return the error?  
> 
> Its wrapper function ethnl_default_doit only recognizes the EMSGSIZE code.

Not sure what you mean.

> >> @@ -227,7 +315,19 @@ const struct nla_policy ethnl_coalesce_set_policy[] = {
> >>   	[ETHTOOL_A_COALESCE_TX_AGGR_MAX_BYTES] = { .type = NLA_U32 },
> >>   	[ETHTOOL_A_COALESCE_TX_AGGR_MAX_FRAMES] = { .type = NLA_U32 },
> >>   	[ETHTOOL_A_COALESCE_TX_AGGR_TIME_USECS] = { .type = NLA_U32 },
> >> +	[ETHTOOL_A_COALESCE_RX_EQE_PROFILE]     = { .type = NLA_NESTED },
> >> +	[ETHTOOL_A_COALESCE_RX_CQE_PROFILE]     = { .type = NLA_NESTED },
> >> +	[ETHTOOL_A_COALESCE_TX_EQE_PROFILE]     = { .type = NLA_NESTED },
> >> +	[ETHTOOL_A_COALESCE_TX_CQE_PROFILE]     = { .type = NLA_NESTED },  
> > NLA_POLICY_NESTED(coalesce_set_profile_policy)  
> 
> This doesn't work because the first level of sub-nesting of
> ETHTOOL_A_COALESCE_RX_CQE_PROFILE is ETHTOOL_A_PROFILE_IRQ_MODERATION.

So declare a policy for IRQ_MODERATION which has one entry -> nested
profile policy?
Heng Qi April 24, 2024, 1:10 p.m. UTC | #6
@@ -227,7 +315,19 @@ const struct nla_policy ethnl_coalesce_set_policy[] = {
>>>>    	[ETHTOOL_A_COALESCE_TX_AGGR_MAX_BYTES] = { .type = NLA_U32 },
>>>>    	[ETHTOOL_A_COALESCE_TX_AGGR_MAX_FRAMES] = { .type = NLA_U32 },
>>>>    	[ETHTOOL_A_COALESCE_TX_AGGR_TIME_USECS] = { .type = NLA_U32 },
>>>> +	[ETHTOOL_A_COALESCE_RX_EQE_PROFILE]     = { .type = NLA_NESTED },
>>>> +	[ETHTOOL_A_COALESCE_RX_CQE_PROFILE]     = { .type = NLA_NESTED },
>>>> +	[ETHTOOL_A_COALESCE_TX_EQE_PROFILE]     = { .type = NLA_NESTED },
>>>> +	[ETHTOOL_A_COALESCE_TX_CQE_PROFILE]     = { .type = NLA_NESTED },
>>> NLA_POLICY_NESTED(coalesce_set_profile_policy)
>> This doesn't work because the first level of sub-nesting of
>> ETHTOOL_A_COALESCE_RX_CQE_PROFILE is ETHTOOL_A_PROFILE_IRQ_MODERATION.
> So declare a policy for IRQ_MODERATION which has one entry -> nested
> profile policy?

Still doesn't work because one profile corresponds to 5 IRQ_MODERATION sub-nests.

In the same example, strset also uses NLA_NESTED.

Thanks.
Heng Qi April 24, 2024, 1:41 p.m. UTC | #7
>>> +++ b/include/linux/ethtool.h
>>> @@ -284,7 +284,11 @@ bool 
>>> ethtool_convert_link_mode_to_legacy_u32(u32 *legacy_u32,
>>>   #define ETHTOOL_COALESCE_TX_AGGR_MAX_BYTES    BIT(24)
>>>   #define ETHTOOL_COALESCE_TX_AGGR_MAX_FRAMES    BIT(25)
>>>   #define ETHTOOL_COALESCE_TX_AGGR_TIME_USECS    BIT(26)
>>> -#define ETHTOOL_COALESCE_ALL_PARAMS        GENMASK(26, 0)
>>> +#define ETHTOOL_COALESCE_RX_EQE_PROFILE         BIT(27)
>>> +#define ETHTOOL_COALESCE_RX_CQE_PROFILE         BIT(28)
>>> +#define ETHTOOL_COALESCE_TX_EQE_PROFILE         BIT(29)
>>> +#define ETHTOOL_COALESCE_TX_CQE_PROFILE         BIT(30)
>>> +#define ETHTOOL_COALESCE_ALL_PARAMS        GENMASK(30, 0)
>> I think it's better to add these to netdev_ops, see below.


When modified according to v9's advice, I have the following structure, 
functions and ops:

+#define DIM_PROFILE_RX         BIT(0)  /* support rx dim profile 
modification */
+#define DIM_PROFILE_TX         BIT(1)  /* support tx dim profile 
modification */
+
+#define DIM_COALESCE_USEC      BIT(0)  /* support usec field 
modification */
+#define DIM_COALESCE_PKTS      BIT(1)  /* support pkts field 
modification */
+#define DIM_COALESCE_COMPS     BIT(2)  /* support comps field 
modification */
+
+struct dim_irq_moder {
+       /* See DIM_PROFILE_* */
+       u8 profile_flags;
+
+       /* See DIM_COALESCE_* for Rx and Tx */
+       u8 coal_flags;
+
+       /* Rx DIM period count mode: CQE or EQE */
+       u8 dim_rx_mode;
+
+       /* Tx DIM period count mode: CQE or EQE */
+       u8 dim_tx_mode;
+
+       /* DIM profile list for Rx */
+       struct dim_cq_moder *rx_profile;
+
+       /* DIM profile list for Tx */
+       struct dim_cq_moder *tx_profile;
+
+       /* Rx DIM worker function scheduled by net_dim() */
+       void (*rx_dim_work)(struct work_struct *work);
+
+       /* Tx DIM worker function scheduled by net_dim() */
+       void (*tx_dim_work)(struct work_struct *work);
+};
+

.....


+       .ndo_init_irq_moder     = virtnet_init_irq_moder,

....


+static int virtnet_init_irq_moder(struct net_device *dev)
+{
+        struct virtnet_info *vi = netdev_priv(dev);
+        struct dim_irq_moder *moder;
+        int len;
+
+        if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_VQ_NOTF_COAL))
+                return 0;
+
+        dev->irq_moder = kzalloc(sizeof(*dev->irq_moder), GFP_KERNEL);
+        if (!dev->irq_moder)
+                goto err_moder;
+
+        moder = dev->irq_moder;
+        len = NET_DIM_PARAMS_NUM_PROFILES * sizeof(*moder->rx_profile);
+
+        moder->profile_flags |= DIM_PROFILE_RX;
+        moder->coal_flags |= DIM_COALESCE_USEC | DIM_COALESCE_PKTS;
+        moder->dim_rx_mode = DIM_CQ_PERIOD_MODE_START_FROM_EQE;
+
+        moder->rx_dim_work = virtnet_rx_dim_work;
+
+        moder->rx_profile = kmemdup(dim_rx_profile[moder->dim_rx_mode],
+                                   len, GFP_KERNEL);
+        if (!moder->rx_profile)
+                goto err_profile;
+
+        return 0;
+
+err_profile:
+        kfree(moder);
+err_moder:
+        return -ENOMEM;
+}
+

......

+void net_dim_setting(struct net_device *dev, struct dim *dim, bool is_tx)
+{
+       struct dim_irq_moder *irq_moder = dev->irq_moder;
+
+       if (!irq_moder)
+               return;
+
+       if (is_tx) {
+               INIT_WORK(&dim->work, irq_moder->tx_dim_work);
+               dim->mode = irq_moder->dim_tx_mode;
+               return;
+       }
+
+       INIT_WORK(&dim->work, irq_moder->rx_dim_work);
+       dim->mode = irq_moder->dim_rx_mode;
+}

.....

+       for (i = 0; i < vi->max_queue_pairs; i++)
+               net_dim_setting(vi->dev, &vi->rq[i].dim, false);

.....

     ETHTOOL_A_COALESCE_TX_AGGR_MAX_FRAMES,          /* u32 */

     ETHTOOL_A_COALESCE_TX_AGGR_TIME_USECS,           /* u32 */

+ /* nest - _A_PROFILE_IRQ_MODERATION */

+  ETHTOOL_A_COALESCE_RX_PROFILE,

+   /* nest - _A_PROFILE_IRQ_MODERATION */
+  ETHTOOL_A_COALESCE_TX_PROFILE,

......


Almost clear implementation, but the only problem is when I want to

reuse "ethtool -C" and append ETHTOOL_A_COALESCE_RX_PROFILE

and ETHTOOL_A_COALESCE_TX_PROFILE, *ethnl_set_coalesce_validate()

will report an error because there are no ETHTOOL_COALESCE_RX_PROFILE

and ETHTOOL_COALESCE_TX_PROFILE, because they are replaced by

DIM_PROFILE_RX and DIM_PROFILE_TX in the field profile_flags.*


Should I reuse ETHTOOL_COALESCE_RX_PROFILE and

ETHTOOL_A_COALESCE_TX_PROFILE in ethtool_ops->.supported_coalesce_params

and remove the field profile_flags from struct dim_irq_moder?

Or let ethnl_set_coalesce_validate not verify these two flags?

Is there a better solution?


Thanks!
Jakub Kicinski April 24, 2024, 3:52 p.m. UTC | #8
On Wed, 24 Apr 2024 21:10:55 +0800 Heng Qi wrote:
> >> This doesn't work because the first level of sub-nesting of
> >> ETHTOOL_A_COALESCE_RX_CQE_PROFILE is ETHTOOL_A_PROFILE_IRQ_MODERATION.  
> > So declare a policy for IRQ_MODERATION which has one entry -> nested
> > profile policy?  
> 
> Still doesn't work because one profile corresponds to 5 IRQ_MODERATION sub-nests.

I don't get it. Can you show the code you used and the failure / error
you get?

> In the same example, strset also uses NLA_NESTED.

Likely just because it's older code.
Jakub Kicinski April 24, 2024, 4:18 p.m. UTC | #9
On Wed, 24 Apr 2024 21:41:55 +0800 Heng Qi wrote:
> +struct dim_irq_moder {
> +       /* See DIM_PROFILE_* */
> +       u8 profile_flags;
> +
> +       /* See DIM_COALESCE_* for Rx and Tx */
> +       u8 coal_flags;
> +
> +       /* Rx DIM period count mode: CQE or EQE */
> +       u8 dim_rx_mode;
> +
> +       /* Tx DIM period count mode: CQE or EQE */
> +       u8 dim_tx_mode;
> +
> +       /* DIM profile list for Rx */
> +       struct dim_cq_moder *rx_profile;
> +
> +       /* DIM profile list for Tx */
> +       struct dim_cq_moder *tx_profile;
> +
> +       /* Rx DIM worker function scheduled by net_dim() */
> +       void (*rx_dim_work)(struct work_struct *work);
> +
> +       /* Tx DIM worker function scheduled by net_dim() */
> +       void (*tx_dim_work)(struct work_struct *work);
> +};
> +
> 
> .....
> 
> 
> +       .ndo_init_irq_moder     = virtnet_init_irq_moder,

The init callback mostly fills in static data, can we not
declare the driver information statically and move the init
code into the core?

> ....
> 
> 
> +static int virtnet_init_irq_moder(struct net_device *dev)
> +{
> +        struct virtnet_info *vi = netdev_priv(dev);
> +        struct dim_irq_moder *moder;
> +        int len;
> +
> +        if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_VQ_NOTF_COAL))
> +                return 0;
> +
> +        dev->irq_moder = kzalloc(sizeof(*dev->irq_moder), GFP_KERNEL);
> +        if (!dev->irq_moder)
> +                goto err_moder;
> +
> +        moder = dev->irq_moder;
> +        len = NET_DIM_PARAMS_NUM_PROFILES * sizeof(*moder->rx_profile);
> +
> +        moder->profile_flags |= DIM_PROFILE_RX;
> +        moder->coal_flags |= DIM_COALESCE_USEC | DIM_COALESCE_PKTS;
> +        moder->dim_rx_mode = DIM_CQ_PERIOD_MODE_START_FROM_EQE;
> +
> +        moder->rx_dim_work = virtnet_rx_dim_work;
> +
> +        moder->rx_profile = kmemdup(dim_rx_profile[moder->dim_rx_mode],
> +                                   len, GFP_KERNEL);
> +        if (!moder->rx_profile)
> +                goto err_profile;
> +
> +        return 0;
> +
> +err_profile:
> +        kfree(moder);
> +err_moder:
> +        return -ENOMEM;
> +}
> +
> 
> ......
> 
> +void net_dim_setting(struct net_device *dev, struct dim *dim, bool is_tx)
> +{
> +       struct dim_irq_moder *irq_moder = dev->irq_moder;
> +
> +       if (!irq_moder)
> +               return;
> +
> +       if (is_tx) {
> +               INIT_WORK(&dim->work, irq_moder->tx_dim_work);
> +               dim->mode = irq_moder->dim_tx_mode;
> +               return;
> +       }
> +
> +       INIT_WORK(&dim->work, irq_moder->rx_dim_work);
> +       dim->mode = irq_moder->dim_rx_mode;
> +}
> 
> .....
> 
> +       for (i = 0; i < vi->max_queue_pairs; i++)
> +               net_dim_setting(vi->dev, &vi->rq[i].dim, false);
> 
> .....
> 
>      ETHTOOL_A_COALESCE_TX_AGGR_MAX_FRAMES,          /* u32 */
> 
>      ETHTOOL_A_COALESCE_TX_AGGR_TIME_USECS,           /* u32 */
> 
> + /* nest - _A_PROFILE_IRQ_MODERATION */
> 
> +  ETHTOOL_A_COALESCE_RX_PROFILE,
> 
> +   /* nest - _A_PROFILE_IRQ_MODERATION */
> +  ETHTOOL_A_COALESCE_TX_PROFILE,
> 
> ......
> 
> 
> Almost clear implementation, but the only problem is when I want to
> reuse "ethtool -C" and append ETHTOOL_A_COALESCE_RX_PROFILE
> and ETHTOOL_A_COALESCE_TX_PROFILE, *ethnl_set_coalesce_validate()
> will report an error because there are no ETHTOOL_COALESCE_RX_PROFILE
> and ETHTOOL_COALESCE_TX_PROFILE, because they are replaced by
> DIM_PROFILE_RX and DIM_PROFILE_TX in the field profile_flags.*

I see.

> Should I reuse ETHTOOL_COALESCE_RX_PROFILE and
> ETHTOOL_A_COALESCE_TX_PROFILE in ethtool_ops->.supported_coalesce_params
> and remove the field profile_flags from struct dim_irq_moder?
> Or let ethnl_set_coalesce_validate not verify these two flags?
> Is there a better solution?

Maybe create the bits but automatically add them for the driver?

diff --git a/net/ethtool/coalesce.c b/net/ethtool/coalesce.c
index 83112c1a71ae..56777d36f7f1 100644
--- a/net/ethtool/coalesce.c
+++ b/net/ethtool/coalesce.c
@@ -243,6 +243,8 @@ ethnl_set_coalesce_validate(struct ethnl_req_info *req_info,
 
 	/* make sure that only supported parameters are present */
 	supported_params = ops->supported_coalesce_params;
+	if (dev->moder->coal_flags ...)
+		supported_params |= ETHTOOL_COALESCE_...;
 	for (a = ETHTOOL_A_COALESCE_RX_USECS; a < __ETHTOOL_A_COALESCE_CNT; a++)
 		if (tb[a] && !(supported_params & attr_to_mask(a))) {
 			NL_SET_ERR_MSG_ATTR(info->extack, tb[a],
Heng Qi April 24, 2024, 4:49 p.m. UTC | #10
在 2024/4/25 上午12:18, Jakub Kicinski 写道:
> On Wed, 24 Apr 2024 21:41:55 +0800 Heng Qi wrote:
>> +struct dim_irq_moder {
>> +       /* See DIM_PROFILE_* */
>> +       u8 profile_flags;
>> +
>> +       /* See DIM_COALESCE_* for Rx and Tx */
>> +       u8 coal_flags;
>> +
>> +       /* Rx DIM period count mode: CQE or EQE */
>> +       u8 dim_rx_mode;
>> +
>> +       /* Tx DIM period count mode: CQE or EQE */
>> +       u8 dim_tx_mode;
>> +
>> +       /* DIM profile list for Rx */
>> +       struct dim_cq_moder *rx_profile;
>> +
>> +       /* DIM profile list for Tx */
>> +       struct dim_cq_moder *tx_profile;
>> +
>> +       /* Rx DIM worker function scheduled by net_dim() */
>> +       void (*rx_dim_work)(struct work_struct *work);
>> +
>> +       /* Tx DIM worker function scheduled by net_dim() */
>> +       void (*tx_dim_work)(struct work_struct *work);
>> +};
>> +
>>
>> .....
>>
>>
>> +       .ndo_init_irq_moder     = virtnet_init_irq_moder,
> The init callback mostly fills in static data, can we not
> declare the driver information statically and move the init
> code into the core?

Now the init callback is used as following

In dim.c:

+int net_dim_init_irq_moder(struct net_device *dev)
+{
+       if (dev->netdev_ops && dev->netdev_ops->ndo_init_irq_moder)
+               return dev->netdev_ops->ndo_init_irq_moder(dev);
+
+       return 0;
+}
+EXPORT_SYMBOL(net_dim_init_irq_moder);


In dev.c

@@ -10258,6 +10259,10 @@ int register_netdevice(struct net_device *dev)
         if (ret)
                 return ret;

+       ret = net_dim_init_irq_moder(dev);
+       if (ret)
+               return ret;
+
         spin_lock_init(&dev->addr_list_lock);
         netdev_set_addr_lockdep_class(dev);


The collected flags, mode, and work must obtain driver-specific

values from the driver. If I'm not wrong, you don't want an interface

like .ndo_init_irq_moder, but instead provide a generic interface in

dim.c (e.g. net_dim_init_irq_moder() with parameters dev and struct

dim_irq_moder or separate flags,mode,work). Then this func is called

by the driver in the probe phase?


>> ....
>>
>>
>> +static int virtnet_init_irq_moder(struct net_device *dev)
>> +{
>> +        struct virtnet_info *vi = netdev_priv(dev);
>> +        struct dim_irq_moder *moder;
>> +        int len;
>> +
>> +        if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_VQ_NOTF_COAL))
>> +                return 0;
>> +
>> +        dev->irq_moder = kzalloc(sizeof(*dev->irq_moder), GFP_KERNEL);
>> +        if (!dev->irq_moder)
>> +                goto err_moder;
>> +
>> +        moder = dev->irq_moder;
>> +        len = NET_DIM_PARAMS_NUM_PROFILES * sizeof(*moder->rx_profile);
>> +
>> +        moder->profile_flags |= DIM_PROFILE_RX;
>> +        moder->coal_flags |= DIM_COALESCE_USEC | DIM_COALESCE_PKTS;
>> +        moder->dim_rx_mode = DIM_CQ_PERIOD_MODE_START_FROM_EQE;
>> +
>> +        moder->rx_dim_work = virtnet_rx_dim_work;
>> +
>> +        moder->rx_profile = kmemdup(dim_rx_profile[moder->dim_rx_mode],
>> +                                   len, GFP_KERNEL);
>> +        if (!moder->rx_profile)
>> +                goto err_profile;
>> +
>> +        return 0;
>> +
>> +err_profile:
>> +        kfree(moder);
>> +err_moder:
>> +        return -ENOMEM;
>> +}
>> +
>>
>> ......
>>
>> +void net_dim_setting(struct net_device *dev, struct dim *dim, bool is_tx)
>> +{
>> +       struct dim_irq_moder *irq_moder = dev->irq_moder;
>> +
>> +       if (!irq_moder)
>> +               return;
>> +
>> +       if (is_tx) {
>> +               INIT_WORK(&dim->work, irq_moder->tx_dim_work);
>> +               dim->mode = irq_moder->dim_tx_mode;
>> +               return;
>> +       }
>> +
>> +       INIT_WORK(&dim->work, irq_moder->rx_dim_work);
>> +       dim->mode = irq_moder->dim_rx_mode;
>> +}
>>
>> .....
>>
>> +       for (i = 0; i < vi->max_queue_pairs; i++)
>> +               net_dim_setting(vi->dev, &vi->rq[i].dim, false);
>>
>> .....
>>
>>       ETHTOOL_A_COALESCE_TX_AGGR_MAX_FRAMES,          /* u32 */
>>
>>       ETHTOOL_A_COALESCE_TX_AGGR_TIME_USECS,           /* u32 */
>>
>> + /* nest - _A_PROFILE_IRQ_MODERATION */
>>
>> +  ETHTOOL_A_COALESCE_RX_PROFILE,
>>
>> +   /* nest - _A_PROFILE_IRQ_MODERATION */
>> +  ETHTOOL_A_COALESCE_TX_PROFILE,
>>
>> ......
>>
>>
>> Almost clear implementation, but the only problem is when I want to
>> reuse "ethtool -C" and append ETHTOOL_A_COALESCE_RX_PROFILE
>> and ETHTOOL_A_COALESCE_TX_PROFILE, *ethnl_set_coalesce_validate()
>> will report an error because there are no ETHTOOL_COALESCE_RX_PROFILE
>> and ETHTOOL_COALESCE_TX_PROFILE, because they are replaced by
>> DIM_PROFILE_RX and DIM_PROFILE_TX in the field profile_flags.*
> I see.
>
>> Should I reuse ETHTOOL_COALESCE_RX_PROFILE and
>> ETHTOOL_A_COALESCE_TX_PROFILE in ethtool_ops->.supported_coalesce_params
>> and remove the field profile_flags from struct dim_irq_moder?
>> Or let ethnl_set_coalesce_validate not verify these two flags?
>> Is there a better solution?
> Maybe create the bits but automatically add them for the driver?


Ok. I think it works.


Thanks!


> diff --git a/net/ethtool/coalesce.c b/net/ethtool/coalesce.c
> index 83112c1a71ae..56777d36f7f1 100644
> --- a/net/ethtool/coalesce.c
> +++ b/net/ethtool/coalesce.c
> @@ -243,6 +243,8 @@ ethnl_set_coalesce_validate(struct ethnl_req_info *req_info,
>   
>   	/* make sure that only supported parameters are present */
>   	supported_params = ops->supported_coalesce_params;
> +	if (dev->moder->coal_flags ...)
> +		supported_params |= ETHTOOL_COALESCE_...;
>   	for (a = ETHTOOL_A_COALESCE_RX_USECS; a < __ETHTOOL_A_COALESCE_CNT; a++)
>   		if (tb[a] && !(supported_params & attr_to_mask(a))) {
>   			NL_SET_ERR_MSG_ATTR(info->extack, tb[a],
Jakub Kicinski April 24, 2024, 11:59 p.m. UTC | #11
On Thu, 25 Apr 2024 00:49:26 +0800 Heng Qi wrote:
> If I'm not wrong, you don't want an interface
> like .ndo_init_irq_moder, but instead provide a generic interface in
> dim.c (e.g. net_dim_init_irq_moder() with parameters dev and struct
> dim_irq_moder or separate flags,mode,work). Then this func is called
> by the driver in the probe phase?

That's right.
diff mbox series

Patch

diff --git a/Documentation/netlink/specs/ethtool.yaml b/Documentation/netlink/specs/ethtool.yaml
index 87ae7b397984..8165b598dab7 100644
--- a/Documentation/netlink/specs/ethtool.yaml
+++ b/Documentation/netlink/specs/ethtool.yaml
@@ -413,6 +413,18 @@  attribute-sets:
       -
         name: combined-count
         type: u32
+  -
+    name: moderation
+    attributes:
+      -
+        name: usec
+        type: u32
+      -
+        name: pkts
+        type: u32
+      -
+        name: comps
+        type: u32
 
   -
     name: coalesce
@@ -502,6 +514,23 @@  attribute-sets:
       -
         name: tx-aggr-time-usecs
         type: u32
+      -
+        name: rx-eqe-profile
+        type: nest
+        nested-attributes: moderation
+      -
+        name: rx-cqe-profile
+        type: nest
+        nested-attributes: moderation
+      -
+        name: tx-eqe-profile
+        type: nest
+        nested-attributes: moderation
+      -
+        name: tx-cqe-profile
+        type: nest
+        nested-attributes: moderation
+
   -
     name: pause-stat
     attributes:
@@ -1313,6 +1342,10 @@  operations:
             - tx-aggr-max-bytes
             - tx-aggr-max-frames
             - tx-aggr-time-usecs
+            - rx-eqe-profile
+            - rx-cqe-profile
+            - tx-eqe-profile
+            - tx-cqe-profile
       dump: *coalesce-get-op
     -
       name: coalesce-set
diff --git a/Documentation/networking/ethtool-netlink.rst b/Documentation/networking/ethtool-netlink.rst
index 4e63d3708ed9..98a619198465 100644
--- a/Documentation/networking/ethtool-netlink.rst
+++ b/Documentation/networking/ethtool-netlink.rst
@@ -1040,6 +1040,10 @@  Kernel response contents:
   ``ETHTOOL_A_COALESCE_TX_AGGR_MAX_BYTES``     u32     max aggr size, Tx
   ``ETHTOOL_A_COALESCE_TX_AGGR_MAX_FRAMES``    u32     max aggr packets, Tx
   ``ETHTOOL_A_COALESCE_TX_AGGR_TIME_USECS``    u32     time (us), aggr, Tx
+  ``ETHTOOL_A_COALESCE_RX_EQE_PROFILE``        nested  profile of DIM EQE, Rx
+  ``ETHTOOL_A_COALESCE_RX_CQE_PROFILE``        nested  profile of DIM CQE, Rx
+  ``ETHTOOL_A_COALESCE_TX_EQE_PROFILE``        nested  profile of DIM EQE, Tx
+  ``ETHTOOL_A_COALESCE_TX_CQE_PROFILE``        nested  profile of DIM CQE, Tx
   ===========================================  ======  =======================
 
 Attributes are only included in reply if their value is not zero or the
@@ -1105,6 +1109,10 @@  Request contents:
   ``ETHTOOL_A_COALESCE_TX_AGGR_MAX_BYTES``     u32     max aggr size, Tx
   ``ETHTOOL_A_COALESCE_TX_AGGR_MAX_FRAMES``    u32     max aggr packets, Tx
   ``ETHTOOL_A_COALESCE_TX_AGGR_TIME_USECS``    u32     time (us), aggr, Tx
+  ``ETHTOOL_A_COALESCE_RX_EQE_PROFILE``        nested  profile of DIM EQE, Rx
+  ``ETHTOOL_A_COALESCE_RX_CQE_PROFILE``        nested  profile of DIM CQE, Rx
+  ``ETHTOOL_A_COALESCE_TX_EQE_PROFILE``        nested  profile of DIM EQE, Tx
+  ``ETHTOOL_A_COALESCE_TX_CQE_PROFILE``        nested  profile of DIM CQE, Tx
   ===========================================  ======  =======================
 
 Request is rejected if it attributes declared as unsupported by driver (i.e.
diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h
index 6fd9107d3cc0..614a113eda29 100644
--- a/include/linux/ethtool.h
+++ b/include/linux/ethtool.h
@@ -284,7 +284,11 @@  bool ethtool_convert_link_mode_to_legacy_u32(u32 *legacy_u32,
 #define ETHTOOL_COALESCE_TX_AGGR_MAX_BYTES	BIT(24)
 #define ETHTOOL_COALESCE_TX_AGGR_MAX_FRAMES	BIT(25)
 #define ETHTOOL_COALESCE_TX_AGGR_TIME_USECS	BIT(26)
-#define ETHTOOL_COALESCE_ALL_PARAMS		GENMASK(26, 0)
+#define ETHTOOL_COALESCE_RX_EQE_PROFILE         BIT(27)
+#define ETHTOOL_COALESCE_RX_CQE_PROFILE         BIT(28)
+#define ETHTOOL_COALESCE_TX_EQE_PROFILE         BIT(29)
+#define ETHTOOL_COALESCE_TX_CQE_PROFILE         BIT(30)
+#define ETHTOOL_COALESCE_ALL_PARAMS		GENMASK(30, 0)
 
 #define ETHTOOL_COALESCE_USECS						\
 	(ETHTOOL_COALESCE_RX_USECS | ETHTOOL_COALESCE_TX_USECS)
@@ -316,6 +320,11 @@  bool ethtool_convert_link_mode_to_legacy_u32(u32 *legacy_u32,
 	(ETHTOOL_COALESCE_TX_AGGR_MAX_BYTES |	\
 	 ETHTOOL_COALESCE_TX_AGGR_MAX_FRAMES |	\
 	 ETHTOOL_COALESCE_TX_AGGR_TIME_USECS)
+#define ETHTOOL_COALESCE_PROFILE		\
+	(ETHTOOL_COALESCE_RX_EQE_PROFILE |	\
+	 ETHTOOL_COALESCE_RX_CQE_PROFILE |	\
+	 ETHTOOL_COALESCE_TX_EQE_PROFILE |	\
+	 ETHTOOL_COALESCE_TX_CQE_PROFILE)
 
 #define ETHTOOL_STAT_NOT_SET	(~0ULL)
 
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index d45f330d083d..a1c7e9c2be86 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -80,6 +80,25 @@  struct xdp_frame;
 struct xdp_metadata_ops;
 struct xdp_md;
 
+#if IS_ENABLED(CONFIG_DIMLIB)
+struct dim_cq_moder;
+
+#define NETDEV_PROFILE_USEC	BIT(0)	/* device supports usec field modification */
+#define NETDEV_PROFILE_PKTS	BIT(1)	/* device supports pkts field modification */
+#define NETDEV_PROFILE_COMPS	BIT(2)	/* device supports comps field modification */
+
+struct netdev_profile_moder {
+	/* See NETDEV_PROFILE_* */
+	unsigned int flags;
+
+	/* DIM profile lists for different dim cq modes */
+	struct dim_cq_moder *rx_eqe_profile;
+	struct dim_cq_moder *rx_cqe_profile;
+	struct dim_cq_moder *tx_eqe_profile;
+	struct dim_cq_moder *tx_cqe_profile;
+};
+#endif
+
 typedef u32 xdp_features_t;
 
 void synchronize_net(void);
@@ -2400,6 +2419,11 @@  struct net_device {
 	/** @page_pools: page pools created for this netdevice */
 	struct hlist_head	page_pools;
 #endif
+
+#if IS_ENABLED(CONFIG_DIMLIB)
+	/** @moderation: dim tunable parameters for this netdevice */
+	struct netdev_profile_moder	*moderation;
+#endif
 };
 #define to_net_dev(d) container_of(d, struct net_device, dev)
 
diff --git a/include/uapi/linux/ethtool_netlink.h b/include/uapi/linux/ethtool_netlink.h
index b4f0d233d048..d4c6e30a55cb 100644
--- a/include/uapi/linux/ethtool_netlink.h
+++ b/include/uapi/linux/ethtool_netlink.h
@@ -416,12 +416,36 @@  enum {
 	ETHTOOL_A_COALESCE_TX_AGGR_MAX_BYTES,		/* u32 */
 	ETHTOOL_A_COALESCE_TX_AGGR_MAX_FRAMES,		/* u32 */
 	ETHTOOL_A_COALESCE_TX_AGGR_TIME_USECS,		/* u32 */
+	ETHTOOL_A_COALESCE_RX_EQE_PROFILE,              /* nest - _A_MODERATIONS_MODERATION */
+	ETHTOOL_A_COALESCE_RX_CQE_PROFILE,              /* nest - _A_MODERATIONS_MODERATION */
+	ETHTOOL_A_COALESCE_TX_EQE_PROFILE,              /* nest - _A_MODERATIONS_MODERATION */
+	ETHTOOL_A_COALESCE_TX_CQE_PROFILE,              /* nest - _A_MODERATIONS_MODERATION */
 
 	/* add new constants above here */
 	__ETHTOOL_A_COALESCE_CNT,
 	ETHTOOL_A_COALESCE_MAX = (__ETHTOOL_A_COALESCE_CNT - 1)
 };
 
+enum {
+	ETHTOOL_A_MODERATIONS_UNSPEC,
+	ETHTOOL_A_MODERATIONS_MODERATION,		/* nest, _A_MODERATION_* */
+
+	/* add new constants above here */
+	__ETHTOOL_A_MODERATIONS_CNT,
+	ETHTOOL_A_MODERATIONS_MAX = (__ETHTOOL_A_MODERATIONS_CNT - 1)
+};
+
+enum {
+	ETHTOOL_A_MODERATION_UNSPEC,
+	ETHTOOL_A_MODERATION_USEC,			/* u32 */
+	ETHTOOL_A_MODERATION_PKTS,			/* u32 */
+	ETHTOOL_A_MODERATION_COMPS,			/* u32 */
+
+	/* add new constants above here */
+	__ETHTOOL_A_MODERATION_CNT,
+	ETHTOOL_A_MODERATION_MAX = (__ETHTOOL_A_MODERATION_CNT - 1)
+};
+
 /* PAUSE */
 
 enum {
diff --git a/net/core/dev.c b/net/core/dev.c
index 854a3a28a8d8..a30287279fcc 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -96,6 +96,7 @@ 
 #include <linux/kthread.h>
 #include <linux/bpf.h>
 #include <linux/bpf_trace.h>
+#include <linux/dim.h>
 #include <net/net_namespace.h>
 #include <net/sock.h>
 #include <net/busy_poll.h>
@@ -10229,6 +10230,57 @@  static void netdev_do_free_pcpu_stats(struct net_device *dev)
 	}
 }
 
+static int dev_dim_profile_init(struct net_device *dev)
+{
+#if IS_ENABLED(CONFIG_DIMLIB)
+	u32 supported = dev->ethtool_ops->supported_coalesce_params;
+	struct netdev_profile_moder *moder;
+	int length;
+
+	dev->moderation = kzalloc(sizeof(*dev->moderation), GFP_KERNEL);
+	if (!dev->moderation)
+		goto err_moder;
+
+	moder = dev->moderation;
+	length = NET_DIM_PARAMS_NUM_PROFILES * sizeof(*moder->rx_eqe_profile);
+
+	if (supported & ETHTOOL_COALESCE_RX_EQE_PROFILE) {
+		moder->rx_eqe_profile = kmemdup(dim_rx_profile[0], length, GFP_KERNEL);
+		if (!moder->rx_eqe_profile)
+			goto err_rx_eqe;
+	}
+	if (supported & ETHTOOL_COALESCE_RX_CQE_PROFILE) {
+		moder->rx_cqe_profile = kmemdup(dim_rx_profile[1], length, GFP_KERNEL);
+		if (!moder->rx_cqe_profile)
+			goto err_rx_cqe;
+	}
+	if (supported & ETHTOOL_COALESCE_TX_EQE_PROFILE) {
+		moder->tx_eqe_profile = kmemdup(dim_tx_profile[0], length, GFP_KERNEL);
+		if (!moder->tx_eqe_profile)
+			goto err_tx_eqe;
+	}
+	if (supported & ETHTOOL_COALESCE_TX_CQE_PROFILE) {
+		moder->tx_cqe_profile = kmemdup(dim_tx_profile[1], length, GFP_KERNEL);
+		if (!moder->tx_cqe_profile)
+			goto err_tx_cqe;
+	}
+#endif
+	return 0;
+
+#if IS_ENABLED(CONFIG_DIMLIB)
+err_tx_cqe:
+	kfree(moder->tx_eqe_profile);
+err_tx_eqe:
+	kfree(moder->rx_cqe_profile);
+err_rx_cqe:
+	kfree(moder->rx_eqe_profile);
+err_rx_eqe:
+	kfree(moder);
+err_moder:
+	return -ENOMEM;
+#endif
+}
+
 /**
  * register_netdevice() - register a network device
  * @dev: device to register
@@ -10258,6 +10310,10 @@  int register_netdevice(struct net_device *dev)
 	if (ret)
 		return ret;
 
+	ret = dev_dim_profile_init(dev);
+	if (ret)
+		return ret;
+
 	spin_lock_init(&dev->addr_list_lock);
 	netdev_set_addr_lockdep_class(dev);
 
@@ -11011,6 +11067,27 @@  struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
 }
 EXPORT_SYMBOL(alloc_netdev_mqs);
 
+static void netif_free_profile(struct net_device *dev)
+{
+#if IS_ENABLED(CONFIG_DIMLIB)
+	u32 supported = dev->ethtool_ops->supported_coalesce_params;
+
+	if (supported & ETHTOOL_COALESCE_RX_EQE_PROFILE)
+		kfree(dev->moderation->rx_eqe_profile);
+
+	if (supported & ETHTOOL_COALESCE_RX_CQE_PROFILE)
+		kfree(dev->moderation->rx_cqe_profile);
+
+	if (supported & ETHTOOL_COALESCE_TX_EQE_PROFILE)
+		kfree(dev->moderation->tx_eqe_profile);
+
+	if (supported & ETHTOOL_COALESCE_TX_CQE_PROFILE)
+		kfree(dev->moderation->tx_cqe_profile);
+
+	kfree(dev->moderation);
+#endif
+}
+
 /**
  * free_netdev - free network device
  * @dev: device
@@ -11036,6 +11113,8 @@  void free_netdev(struct net_device *dev)
 		return;
 	}
 
+	netif_free_profile(dev);
+
 	netif_free_tx_queues(dev);
 	netif_free_rx_queues(dev);
 
diff --git a/net/ethtool/coalesce.c b/net/ethtool/coalesce.c
index 83112c1a71ae..3a41840fbcc7 100644
--- a/net/ethtool/coalesce.c
+++ b/net/ethtool/coalesce.c
@@ -1,5 +1,6 @@ 
 // SPDX-License-Identifier: GPL-2.0-only
 
+#include <linux/dim.h>
 #include "netlink.h"
 #include "common.h"
 
@@ -51,6 +52,10 @@  __CHECK_SUPPORTED_OFFSET(COALESCE_RX_MAX_FRAMES_HIGH);
 __CHECK_SUPPORTED_OFFSET(COALESCE_TX_USECS_HIGH);
 __CHECK_SUPPORTED_OFFSET(COALESCE_TX_MAX_FRAMES_HIGH);
 __CHECK_SUPPORTED_OFFSET(COALESCE_RATE_SAMPLE_INTERVAL);
+__CHECK_SUPPORTED_OFFSET(COALESCE_RX_EQE_PROFILE);
+__CHECK_SUPPORTED_OFFSET(COALESCE_RX_CQE_PROFILE);
+__CHECK_SUPPORTED_OFFSET(COALESCE_TX_EQE_PROFILE);
+__CHECK_SUPPORTED_OFFSET(COALESCE_TX_CQE_PROFILE);
 
 const struct nla_policy ethnl_coalesce_get_policy[] = {
 	[ETHTOOL_A_COALESCE_HEADER]		=
@@ -82,6 +87,14 @@  static int coalesce_prepare_data(const struct ethnl_req_info *req_base,
 static int coalesce_reply_size(const struct ethnl_req_info *req_base,
 			       const struct ethnl_reply_data *reply_base)
 {
+	int modersz = nla_total_size(0) + /* _MODERATIONS_MODERATION, nest */
+		      nla_total_size(sizeof(u32)) + /* _MODERATION_USEC */
+		      nla_total_size(sizeof(u32)) + /* _MODERATION_PKTS */
+		      nla_total_size(sizeof(u32));  /* _MODERATION_COMPS */
+
+	int total_modersz = nla_total_size(0) +  /* _{R,T}X_{E,C}QE_PROFILE, nest */
+			modersz * NET_DIM_PARAMS_NUM_PROFILES;
+
 	return nla_total_size(sizeof(u32)) +	/* _RX_USECS */
 	       nla_total_size(sizeof(u32)) +	/* _RX_MAX_FRAMES */
 	       nla_total_size(sizeof(u32)) +	/* _RX_USECS_IRQ */
@@ -108,7 +121,8 @@  static int coalesce_reply_size(const struct ethnl_req_info *req_base,
 	       nla_total_size(sizeof(u8)) +	/* _USE_CQE_MODE_RX */
 	       nla_total_size(sizeof(u32)) +	/* _TX_AGGR_MAX_BYTES */
 	       nla_total_size(sizeof(u32)) +	/* _TX_AGGR_MAX_FRAMES */
-	       nla_total_size(sizeof(u32));	/* _TX_AGGR_TIME_USECS */
+	       nla_total_size(sizeof(u32)) +	/* _TX_AGGR_TIME_USECS */
+	       total_modersz * 4;		/* _{R,T}X_{E,C}QE_PROFILE */
 }
 
 static bool coalesce_put_u32(struct sk_buff *skb, u16 attr_type, u32 val,
@@ -127,6 +141,62 @@  static bool coalesce_put_bool(struct sk_buff *skb, u16 attr_type, u32 val,
 	return nla_put_u8(skb, attr_type, !!val);
 }
 
+#if IS_ENABLED(CONFIG_DIMLIB)
+/**
+ * coalesce_put_profile - fill reply with a nla nest with four child nla nests.
+ * @skb: socket buffer the message is stored in
+ * @attr_type: nest attr type ETHTOOL_A_COALESCE_*X_*QE_PROFILE
+ * @profile: data passed to userspace
+ * @supported_params: modifiable parameters supported by the driver
+ *
+ * Put a dim profile nest attribute. Refer to ETHTOOL_A_MODERATIONS_MODERATION.
+ *
+ * Return: false to indicate successful placement or no placement, and
+ * true to pass the -EMSGSIZE error to the wrapper.
+ */
+static bool coalesce_put_profile(struct sk_buff *skb, u16 attr_type,
+				 const struct dim_cq_moder *profile,
+				 u32 supported_params)
+{
+	struct nlattr *profile_attr, *moder_attr;
+	bool emsg = !!-EMSGSIZE;
+	int i;
+
+	if (!profile)
+		return false;
+
+	if (!(supported_params & attr_to_mask(attr_type)))
+		return false;
+
+	profile_attr = nla_nest_start(skb, attr_type);
+	if (!profile_attr)
+		return emsg;
+
+	for (i = 0; i < NET_DIM_PARAMS_NUM_PROFILES; i++) {
+		moder_attr = nla_nest_start(skb, ETHTOOL_A_MODERATIONS_MODERATION);
+		if (!moder_attr)
+			goto nla_cancel_profile;
+
+		if (nla_put_u32(skb, ETHTOOL_A_MODERATION_USEC, profile[i].usec) ||
+		    nla_put_u32(skb, ETHTOOL_A_MODERATION_PKTS, profile[i].pkts) ||
+		    nla_put_u32(skb, ETHTOOL_A_MODERATION_COMPS, profile[i].comps))
+			goto nla_cancel_moder;
+
+		nla_nest_end(skb, moder_attr);
+	}
+
+	nla_nest_end(skb, profile_attr);
+
+	return 0;
+
+nla_cancel_moder:
+	nla_nest_cancel(skb, moder_attr);
+nla_cancel_profile:
+	nla_nest_cancel(skb, profile_attr);
+	return emsg;
+}
+#endif
+
 static int coalesce_fill_reply(struct sk_buff *skb,
 			       const struct ethnl_req_info *req_base,
 			       const struct ethnl_reply_data *reply_base)
@@ -134,6 +204,9 @@  static int coalesce_fill_reply(struct sk_buff *skb,
 	const struct coalesce_reply_data *data = COALESCE_REPDATA(reply_base);
 	const struct kernel_ethtool_coalesce *kcoal = &data->kernel_coalesce;
 	const struct ethtool_coalesce *coal = &data->coalesce;
+#if IS_ENABLED(CONFIG_DIMLIB)
+	struct net_device *dev = req_base->dev;
+#endif
 	u32 supported = data->supported_params;
 
 	if (coalesce_put_u32(skb, ETHTOOL_A_COALESCE_RX_USECS,
@@ -192,6 +265,21 @@  static int coalesce_fill_reply(struct sk_buff *skb,
 			     kcoal->tx_aggr_time_usecs, supported))
 		return -EMSGSIZE;
 
+#if IS_ENABLED(CONFIG_DIMLIB)
+	if (!(dev->moderation->flags & (NETDEV_PROFILE_USEC | NETDEV_PROFILE_PKTS |
+					NETDEV_PROFILE_COMPS)))
+		return 0;
+
+	if (coalesce_put_profile(skb, ETHTOOL_A_COALESCE_RX_EQE_PROFILE,
+				 dev->moderation->rx_eqe_profile, supported) ||
+	    coalesce_put_profile(skb, ETHTOOL_A_COALESCE_RX_CQE_PROFILE,
+				 dev->moderation->rx_cqe_profile, supported) ||
+	    coalesce_put_profile(skb, ETHTOOL_A_COALESCE_TX_EQE_PROFILE,
+				 dev->moderation->tx_eqe_profile, supported) ||
+	    coalesce_put_profile(skb, ETHTOOL_A_COALESCE_TX_CQE_PROFILE,
+				 dev->moderation->tx_cqe_profile, supported))
+		return -EMSGSIZE;
+#endif
 	return 0;
 }
 
@@ -227,7 +315,19 @@  const struct nla_policy ethnl_coalesce_set_policy[] = {
 	[ETHTOOL_A_COALESCE_TX_AGGR_MAX_BYTES] = { .type = NLA_U32 },
 	[ETHTOOL_A_COALESCE_TX_AGGR_MAX_FRAMES] = { .type = NLA_U32 },
 	[ETHTOOL_A_COALESCE_TX_AGGR_TIME_USECS] = { .type = NLA_U32 },
+	[ETHTOOL_A_COALESCE_RX_EQE_PROFILE]     = { .type = NLA_NESTED },
+	[ETHTOOL_A_COALESCE_RX_CQE_PROFILE]     = { .type = NLA_NESTED },
+	[ETHTOOL_A_COALESCE_TX_EQE_PROFILE]     = { .type = NLA_NESTED },
+	[ETHTOOL_A_COALESCE_TX_CQE_PROFILE]     = { .type = NLA_NESTED },
+};
+
+#if IS_ENABLED(CONFIG_DIMLIB)
+static const struct nla_policy coalesce_set_profile_policy[] = {
+	[ETHTOOL_A_MODERATION_USEC]	= {.type = NLA_U32},
+	[ETHTOOL_A_MODERATION_PKTS]	= {.type = NLA_U32},
+	[ETHTOOL_A_MODERATION_COMPS]	= {.type = NLA_U32},
 };
+#endif
 
 static int
 ethnl_set_coalesce_validate(struct ethnl_req_info *req_info,
@@ -253,6 +353,76 @@  ethnl_set_coalesce_validate(struct ethnl_req_info *req_info,
 	return 1;
 }
 
+#if IS_ENABLED(CONFIG_DIMLIB)
+/**
+ * ethnl_update_profile - get a nla nest with four child nla nests from userspace.
+ * @dev: netdevice to update the profile
+ * @dst: data get from the driver and modified by ethnl_update_profile.
+ * @nests: nest attr ETHTOOL_A_COALESCE_*X_*QE_PROFILE to set driver's profile.
+ * @extack: Netlink extended ack
+ *
+ * Layout of nests:
+ *   Nested ETHTOOL_A_COALESCE_*X_*QE_PROFILE attr
+ *     Nested ETHTOOL_A_MODERATIONS_MODERATION attr
+ *       ETHTOOL_A_MODERATION_USEC attr
+ *       ETHTOOL_A_MODERATION_PKTS attr
+ *       ETHTOOL_A_MODERATION_COMPS attr
+ *     ...
+ *     Nested ETHTOOL_A_MODERATIONS_MODERATION attr
+ *       ETHTOOL_A_MODERATION_USEC attr
+ *       ETHTOOL_A_MODERATION_PKTS attr
+ *       ETHTOOL_A_MODERATION_COMPS attr
+ *
+ * Return: 0 on success or a negative error code.
+ */
+static int ethnl_update_profile(struct net_device *dev,
+				struct dim_cq_moder *dst,
+				const struct nlattr *nests,
+				struct netlink_ext_ack *extack)
+{
+	struct nlattr *tb_moder[ARRAY_SIZE(coalesce_set_profile_policy)];
+	struct dim_cq_moder profile[NET_DIM_PARAMS_NUM_PROFILES];
+	struct netdev_profile_moder *moder = dev->moderation;
+	struct nlattr *nest;
+	int ret, rem, i = 0;
+
+	if (!nests)
+		return 0;
+
+	if (!dst)
+		return -EOPNOTSUPP;
+
+	nla_for_each_nested_type(nest, ETHTOOL_A_MODERATIONS_MODERATION, nests, rem) {
+		ret = nla_parse_nested(tb_moder,
+				       ARRAY_SIZE(coalesce_set_profile_policy) - 1,
+				       nest, coalesce_set_profile_policy,
+				       extack);
+		if (ret)
+			return ret;
+
+		if (NL_REQ_ATTR_CHECK(extack, nest, tb_moder, ETHTOOL_A_MODERATION_USEC) ||
+		    NL_REQ_ATTR_CHECK(extack, nest, tb_moder, ETHTOOL_A_MODERATION_PKTS) ||
+		    NL_REQ_ATTR_CHECK(extack, nest, tb_moder, ETHTOOL_A_MODERATION_COMPS))
+			return -EINVAL;
+
+		profile[i].usec = nla_get_u32(tb_moder[ETHTOOL_A_MODERATION_USEC]);
+		profile[i].pkts = nla_get_u32(tb_moder[ETHTOOL_A_MODERATION_PKTS]);
+		profile[i].comps = nla_get_u32(tb_moder[ETHTOOL_A_MODERATION_COMPS]);
+
+		if ((dst[i].usec != profile[i].usec && !(moder->flags & NETDEV_PROFILE_USEC)) ||
+		    (dst[i].pkts != profile[i].pkts && !(moder->flags & NETDEV_PROFILE_PKTS)) ||
+		    (dst[i].comps != profile[i].comps && !(moder->flags & NETDEV_PROFILE_COMPS)))
+			return -EOPNOTSUPP;
+
+		i++;
+	}
+
+	memcpy(dst, profile, sizeof(profile));
+
+	return 0;
+}
+#endif
+
 static int
 __ethnl_set_coalesce(struct ethnl_req_info *req_info, struct genl_info *info,
 		     bool *dual_change)
@@ -317,6 +487,35 @@  __ethnl_set_coalesce(struct ethnl_req_info *req_info, struct genl_info *info,
 	ethnl_update_u32(&kernel_coalesce.tx_aggr_time_usecs,
 			 tb[ETHTOOL_A_COALESCE_TX_AGGR_TIME_USECS], &mod);
 
+#if IS_ENABLED(CONFIG_DIMLIB)
+	ret = ethnl_update_profile(dev, dev->moderation->rx_eqe_profile,
+				   tb[ETHTOOL_A_COALESCE_RX_EQE_PROFILE],
+				   info->extack);
+	if (ret < 0)
+		return ret;
+	ret = ethnl_update_profile(dev, dev->moderation->rx_cqe_profile,
+				   tb[ETHTOOL_A_COALESCE_RX_CQE_PROFILE],
+				   info->extack);
+	if (ret < 0)
+		return ret;
+	ret = ethnl_update_profile(dev, dev->moderation->tx_eqe_profile,
+				   tb[ETHTOOL_A_COALESCE_TX_EQE_PROFILE],
+				   info->extack);
+	if (ret < 0)
+		return ret;
+	ret = ethnl_update_profile(dev, dev->moderation->tx_cqe_profile,
+				   tb[ETHTOOL_A_COALESCE_TX_CQE_PROFILE],
+				   info->extack);
+	if (ret < 0)
+		return ret;
+#else
+	if (tb[ETHTOOL_A_COALESCE_RX_EQE_PROFILE] ||
+	    tb[ETHTOOL_A_COALESCE_RX_CQE_PROFILE] ||
+	    tb[ETHTOOL_A_COALESCE_TX_EQE_PROFILE] ||
+	    tb[ETHTOOL_A_COALESCE_TX_CQE_PROFILE])
+		return -EOPNOTSUPP;
+
+#endif
 	/* Update operation modes */
 	ethnl_update_bool32(&coalesce.use_adaptive_rx_coalesce,
 			    tb[ETHTOOL_A_COALESCE_USE_ADAPTIVE_RX], &mod_mode);