mbox series

[iwl-next,v4,0/5] iavf: Add devlink and devlink rate support

Message ID 20230822034003.31628-1-wenjun1.wu@intel.com (mailing list archive)
Headers show
Series iavf: Add devlink and devlink rate support | expand

Message

Wenjun Wu Aug. 22, 2023, 3:39 a.m. UTC
To allow user to configure queue bandwidth, devlink port support
is added to support devlink port rate API. [1]

Add devlink framework registration/unregistration on iavf driver
initialization and remove, and devlink port of DEVLINK_PORT_FLAVOUR_VIRTUAL
is created to be associated iavf netdevice.

iavf rate tree with root node, queue nodes, and leaf node is created
and registered with devlink rate when iavf adapter is configured, and
if PF indicates support of VIRTCHNL_VF_OFFLOAD_QOS through VF Resource /
Capability Exchange.

[root@localhost ~]# devlink port function rate show
pci/0000:af:01.0/txq_15: type node parent iavf_root
pci/0000:af:01.0/txq_14: type node parent iavf_root
pci/0000:af:01.0/txq_13: type node parent iavf_root
pci/0000:af:01.0/txq_12: type node parent iavf_root
pci/0000:af:01.0/txq_11: type node parent iavf_root
pci/0000:af:01.0/txq_10: type node parent iavf_root
pci/0000:af:01.0/txq_9: type node parent iavf_root
pci/0000:af:01.0/txq_8: type node parent iavf_root
pci/0000:af:01.0/txq_7: type node parent iavf_root
pci/0000:af:01.0/txq_6: type node parent iavf_root
pci/0000:af:01.0/txq_5: type node parent iavf_root
pci/0000:af:01.0/txq_4: type node parent iavf_root
pci/0000:af:01.0/txq_3: type node parent iavf_root
pci/0000:af:01.0/txq_2: type node parent iavf_root
pci/0000:af:01.0/txq_1: type node parent iavf_root
pci/0000:af:01.0/txq_0: type node parent iavf_root
pci/0000:af:01.0/iavf_root: type node


                         +---------+
                         |   root  |
                         +----+----+
                              |
            |-----------------|-----------------|
       +----v----+       +----v----+       +----v----+
       |  txq_0  |       |  txq_1  |       |  txq_x  |
       +----+----+       +----+----+       +----+----+

User can configure the tx_max and tx_share of each queue. Once any one of the
queues are fully configured, VIRTCHNL opcodes of VIRTCHNL_OP_CONFIG_QUEUE_BW
and VIRTCHNL_OP_CONFIG_QUANTA will be sent to PF to configure queues allocated
to VF

Example:

1.To Set the queue tx_share:
devlink port function rate set pci/0000:af:01.0 txq_0 tx_share 100 MBps

2.To Set the queue tx_max:
devlink port function rate set pci/0000:af:01.0 txq_0 tx_max 200 MBps

3.To Show Current devlink port rate info:
devlink port function rate function show
[root@localhost ~]# devlink port function rate show
pci/0000:af:01.0/txq_15: type node parent iavf_root
pci/0000:af:01.0/txq_14: type node parent iavf_root
pci/0000:af:01.0/txq_13: type node parent iavf_root
pci/0000:af:01.0/txq_12: type node parent iavf_root
pci/0000:af:01.0/txq_11: type node parent iavf_root
pci/0000:af:01.0/txq_10: type node parent iavf_root
pci/0000:af:01.0/txq_9: type node parent iavf_root
pci/0000:af:01.0/txq_8: type node parent iavf_root
pci/0000:af:01.0/txq_7: type node parent iavf_root
pci/0000:af:01.0/txq_6: type node parent iavf_root
pci/0000:af:01.0/txq_5: type node parent iavf_root
pci/0000:af:01.0/txq_4: type node parent iavf_root
pci/0000:af:01.0/txq_3: type node parent iavf_root
pci/0000:af:01.0/txq_2: type node parent iavf_root
pci/0000:af:01.0/txq_1: type node parent iavf_root
pci/0000:af:01.0/txq_0: type node tx_share 800Mbit tx_max 1600Mbit parent iavf_root
pci/0000:af:01.0/iavf_root: type node


[1]https://lore.kernel.org/netdev/20221115104825.172668-1-michal.wilczynski@intel.com/

Change log:

v4:
- Rearrange the ice_vf_qs_bw structure, put the largest number first
- Minimize the scope of values
- Remove the unnecessary brackets
- Remove the unnecessary memory allocation.
- Added Error Code and moved devlink registration before aq lock initialization
- Changed devlink registration for error handling in case of allocation failure
- Used kcalloc for object array memory allocation and initialization
- Changed functions & comments for readability

v3:
- Rebase the code
- Changed rate node max/share set function description
- Put variable in local scope

v2:
- Change static array to flex array
- Use struct_size helper
- Align all the error code types in the function
- Move the register field definitions to the right place in the file
- Fix coding style
- Adapted to queue bw cfg and qos cap list virtchnl message with flex array fields
---

Jun Zhang (3):
  iavf: Add devlink and devlink port support
  iavf: Add devlink port function rate API support
  iavf: Add VIRTCHNL Opcodes Support for Queue bw Setting

Wenjun Wu (2):
  virtchnl: support queue rate limit and quanta size configuration
  ice: Support VF queue rate limit and quanta size configuration

 drivers/net/ethernet/intel/Kconfig            |   1 +
 drivers/net/ethernet/intel/iavf/Makefile      |   2 +-
 drivers/net/ethernet/intel/iavf/iavf.h        |  19 +
 .../net/ethernet/intel/iavf/iavf_devlink.c    | 377 ++++++++++++++++++
 .../net/ethernet/intel/iavf/iavf_devlink.h    |  38 ++
 drivers/net/ethernet/intel/iavf/iavf_main.c   |  64 ++-
 .../net/ethernet/intel/iavf/iavf_virtchnl.c   | 231 ++++++++++-
 drivers/net/ethernet/intel/ice/ice.h          |   2 +
 drivers/net/ethernet/intel/ice/ice_base.c     |   2 +
 drivers/net/ethernet/intel/ice/ice_common.c   |  19 +
 .../net/ethernet/intel/ice/ice_hw_autogen.h   |   8 +
 drivers/net/ethernet/intel/ice/ice_txrx.h     |   2 +
 drivers/net/ethernet/intel/ice/ice_type.h     |   1 +
 drivers/net/ethernet/intel/ice/ice_vf_lib.h   |   9 +
 drivers/net/ethernet/intel/ice/ice_virtchnl.c | 310 ++++++++++++++
 drivers/net/ethernet/intel/ice/ice_virtchnl.h |  11 +
 .../intel/ice/ice_virtchnl_allowlist.c        |   6 +
 include/linux/avf/virtchnl.h                  | 119 ++++++
 18 files changed, 1218 insertions(+), 3 deletions(-)
 create mode 100644 drivers/net/ethernet/intel/iavf/iavf_devlink.c
 create mode 100644 drivers/net/ethernet/intel/iavf/iavf_devlink.h

Comments

Jiri Pirko Aug. 22, 2023, 6:12 a.m. UTC | #1
Tue, Aug 22, 2023 at 05:39:58AM CEST, wenjun1.wu@intel.com wrote:
>To allow user to configure queue bandwidth, devlink port support
>is added to support devlink port rate API. [1]
>
>Add devlink framework registration/unregistration on iavf driver
>initialization and remove, and devlink port of DEVLINK_PORT_FLAVOUR_VIRTUAL
>is created to be associated iavf netdevice.
>
>iavf rate tree with root node, queue nodes, and leaf node is created
>and registered with devlink rate when iavf adapter is configured, and
>if PF indicates support of VIRTCHNL_VF_OFFLOAD_QOS through VF Resource /
>Capability Exchange.

NACK! Port function is there to configure the VF/SF from the eswitch
side. Yet you use it for the configureation of the actual VF, which is
clear misuse. Please don't


>
>[root@localhost ~]# devlink port function rate show
>pci/0000:af:01.0/txq_15: type node parent iavf_root
>pci/0000:af:01.0/txq_14: type node parent iavf_root
>pci/0000:af:01.0/txq_13: type node parent iavf_root
>pci/0000:af:01.0/txq_12: type node parent iavf_root
>pci/0000:af:01.0/txq_11: type node parent iavf_root
>pci/0000:af:01.0/txq_10: type node parent iavf_root
>pci/0000:af:01.0/txq_9: type node parent iavf_root
>pci/0000:af:01.0/txq_8: type node parent iavf_root
>pci/0000:af:01.0/txq_7: type node parent iavf_root
>pci/0000:af:01.0/txq_6: type node parent iavf_root
>pci/0000:af:01.0/txq_5: type node parent iavf_root
>pci/0000:af:01.0/txq_4: type node parent iavf_root
>pci/0000:af:01.0/txq_3: type node parent iavf_root
>pci/0000:af:01.0/txq_2: type node parent iavf_root
>pci/0000:af:01.0/txq_1: type node parent iavf_root
>pci/0000:af:01.0/txq_0: type node parent iavf_root
>pci/0000:af:01.0/iavf_root: type node
>
>
>                         +---------+
>                         |   root  |
>                         +----+----+
>                              |
>            |-----------------|-----------------|
>       +----v----+       +----v----+       +----v----+
>       |  txq_0  |       |  txq_1  |       |  txq_x  |
>       +----+----+       +----+----+       +----+----+
>
>User can configure the tx_max and tx_share of each queue. Once any one of the
>queues are fully configured, VIRTCHNL opcodes of VIRTCHNL_OP_CONFIG_QUEUE_BW
>and VIRTCHNL_OP_CONFIG_QUANTA will be sent to PF to configure queues allocated
>to VF
>
>Example:
>
>1.To Set the queue tx_share:
>devlink port function rate set pci/0000:af:01.0 txq_0 tx_share 100 MBps
>
>2.To Set the queue tx_max:
>devlink port function rate set pci/0000:af:01.0 txq_0 tx_max 200 MBps
>
>3.To Show Current devlink port rate info:
>devlink port function rate function show
>[root@localhost ~]# devlink port function rate show
>pci/0000:af:01.0/txq_15: type node parent iavf_root
>pci/0000:af:01.0/txq_14: type node parent iavf_root
>pci/0000:af:01.0/txq_13: type node parent iavf_root
>pci/0000:af:01.0/txq_12: type node parent iavf_root
>pci/0000:af:01.0/txq_11: type node parent iavf_root
>pci/0000:af:01.0/txq_10: type node parent iavf_root
>pci/0000:af:01.0/txq_9: type node parent iavf_root
>pci/0000:af:01.0/txq_8: type node parent iavf_root
>pci/0000:af:01.0/txq_7: type node parent iavf_root
>pci/0000:af:01.0/txq_6: type node parent iavf_root
>pci/0000:af:01.0/txq_5: type node parent iavf_root
>pci/0000:af:01.0/txq_4: type node parent iavf_root
>pci/0000:af:01.0/txq_3: type node parent iavf_root
>pci/0000:af:01.0/txq_2: type node parent iavf_root
>pci/0000:af:01.0/txq_1: type node parent iavf_root
>pci/0000:af:01.0/txq_0: type node tx_share 800Mbit tx_max 1600Mbit parent iavf_root
>pci/0000:af:01.0/iavf_root: type node
>
>
>[1]https://lore.kernel.org/netdev/20221115104825.172668-1-michal.wilczynski@intel.com/
>
>Change log:
>
>v4:
>- Rearrange the ice_vf_qs_bw structure, put the largest number first
>- Minimize the scope of values
>- Remove the unnecessary brackets
>- Remove the unnecessary memory allocation.
>- Added Error Code and moved devlink registration before aq lock initialization
>- Changed devlink registration for error handling in case of allocation failure
>- Used kcalloc for object array memory allocation and initialization
>- Changed functions & comments for readability
>
>v3:
>- Rebase the code
>- Changed rate node max/share set function description
>- Put variable in local scope
>
>v2:
>- Change static array to flex array
>- Use struct_size helper
>- Align all the error code types in the function
>- Move the register field definitions to the right place in the file
>- Fix coding style
>- Adapted to queue bw cfg and qos cap list virtchnl message with flex array fields
>---
>
>Jun Zhang (3):
>  iavf: Add devlink and devlink port support
>  iavf: Add devlink port function rate API support
>  iavf: Add VIRTCHNL Opcodes Support for Queue bw Setting
>
>Wenjun Wu (2):
>  virtchnl: support queue rate limit and quanta size configuration
>  ice: Support VF queue rate limit and quanta size configuration
>
> drivers/net/ethernet/intel/Kconfig            |   1 +
> drivers/net/ethernet/intel/iavf/Makefile      |   2 +-
> drivers/net/ethernet/intel/iavf/iavf.h        |  19 +
> .../net/ethernet/intel/iavf/iavf_devlink.c    | 377 ++++++++++++++++++
> .../net/ethernet/intel/iavf/iavf_devlink.h    |  38 ++
> drivers/net/ethernet/intel/iavf/iavf_main.c   |  64 ++-
> .../net/ethernet/intel/iavf/iavf_virtchnl.c   | 231 ++++++++++-
> drivers/net/ethernet/intel/ice/ice.h          |   2 +
> drivers/net/ethernet/intel/ice/ice_base.c     |   2 +
> drivers/net/ethernet/intel/ice/ice_common.c   |  19 +
> .../net/ethernet/intel/ice/ice_hw_autogen.h   |   8 +
> drivers/net/ethernet/intel/ice/ice_txrx.h     |   2 +
> drivers/net/ethernet/intel/ice/ice_type.h     |   1 +
> drivers/net/ethernet/intel/ice/ice_vf_lib.h   |   9 +
> drivers/net/ethernet/intel/ice/ice_virtchnl.c | 310 ++++++++++++++
> drivers/net/ethernet/intel/ice/ice_virtchnl.h |  11 +
> .../intel/ice/ice_virtchnl_allowlist.c        |   6 +
> include/linux/avf/virtchnl.h                  | 119 ++++++
> 18 files changed, 1218 insertions(+), 3 deletions(-)
> create mode 100644 drivers/net/ethernet/intel/iavf/iavf_devlink.c
> create mode 100644 drivers/net/ethernet/intel/iavf/iavf_devlink.h
>
>-- 
>2.34.1
>
>
Jakub Kicinski Aug. 22, 2023, 3:12 p.m. UTC | #2
On Tue, 22 Aug 2023 08:12:28 +0200 Jiri Pirko wrote:
> NACK! Port function is there to configure the VF/SF from the eswitch
> side. Yet you use it for the configureation of the actual VF, which is
> clear misuse. Please don't

Stating where they are supposed to configure the rate would be helpful.
Jiri Pirko Aug. 22, 2023, 3:34 p.m. UTC | #3
Tue, Aug 22, 2023 at 05:12:55PM CEST, kuba@kernel.org wrote:
>On Tue, 22 Aug 2023 08:12:28 +0200 Jiri Pirko wrote:
>> NACK! Port function is there to configure the VF/SF from the eswitch
>> side. Yet you use it for the configureation of the actual VF, which is
>> clear misuse. Please don't
>
>Stating where they are supposed to configure the rate would be helpful.

TC?
Zhang, Xuejun Aug. 23, 2023, 9:39 p.m. UTC | #4
On 8/22/2023 8:34 AM, Jiri Pirko wrote:
> Tue, Aug 22, 2023 at 05:12:55PM CEST, kuba@kernel.org wrote:
>> On Tue, 22 Aug 2023 08:12:28 +0200 Jiri Pirko wrote:
>>> NACK! Port function is there to configure the VF/SF from the eswitch
>>> side. Yet you use it for the configureation of the actual VF, which is
>>> clear misuse. Please don't
>> Stating where they are supposed to configure the rate would be helpful.
> TC?

Our implementation is an extension to this commit 42c2eb6b1f43 ice: 
Implement devlink-rate API).

We are setting the Tx max & share rates of individual queues in a VF 
using the devlink rate API.

Here we are using DEVLINK_PORT_FLAVOUR_VIRTUAL as the attribute for the 
port to distinguish it from being eswitch.

[resend in plain text only]
Jiri Pirko Aug. 24, 2023, 7:04 a.m. UTC | #5
Wed, Aug 23, 2023 at 09:13:34PM CEST, xuejun.zhang@intel.com wrote:
>
>On 8/22/2023 8:34 AM, Jiri Pirko wrote:
>> Tue, Aug 22, 2023 at 05:12:55PM CEST,kuba@kernel.org  wrote:
>> > On Tue, 22 Aug 2023 08:12:28 +0200 Jiri Pirko wrote:
>> > > NACK! Port function is there to configure the VF/SF from the eswitch
>> > > side. Yet you use it for the configureation of the actual VF, which is
>> > > clear misuse. Please don't
>> > Stating where they are supposed to configure the rate would be helpful.
>> TC?
>
>Our implementation is an extension to this commit 42c2eb6b1f43 ice: Implement
>devlink-rate API).
>
>We are setting the Tx max & share rates of individual queues in a VF using
>the devlink rate API.
>
>Here we are using DEVLINK_PORT_FLAVOUR_VIRTUAL as the attribute for the port
>to distinguish it from being eswitch.

I understand, that is a wrong object. So again, you should use
"function" subobject of devlink port to configure "the other side of the
wire", that means the function related to a eswitch port. Here, you are
doing it for the VF directly, which is wrong. If you need some rate
limiting to be configured on an actual VF, use what you use for any
other nic. Offload TC.
Zhang, Xuejun Aug. 28, 2023, 10:46 p.m. UTC | #6
On 8/24/2023 12:04 AM, Jiri Pirko wrote:
> Wed, Aug 23, 2023 at 09:13:34PM CEST, xuejun.zhang@intel.com wrote:
>> On 8/22/2023 8:34 AM, Jiri Pirko wrote:
>>> Tue, Aug 22, 2023 at 05:12:55PM CEST,kuba@kernel.org  wrote:
>>>> On Tue, 22 Aug 2023 08:12:28 +0200 Jiri Pirko wrote:
>>>>> NACK! Port function is there to configure the VF/SF from the eswitch
>>>>> side. Yet you use it for the configureation of the actual VF, which is
>>>>> clear misuse. Please don't
>>>> Stating where they are supposed to configure the rate would be helpful.
>>> TC?
>> Our implementation is an extension to this commit 42c2eb6b1f43 ice: Implement
>> devlink-rate API).
>>
>> We are setting the Tx max & share rates of individual queues in a VF using
>> the devlink rate API.
>>
>> Here we are using DEVLINK_PORT_FLAVOUR_VIRTUAL as the attribute for the port
>> to distinguish it from being eswitch.
> I understand, that is a wrong object. So again, you should use
> "function" subobject of devlink port to configure "the other side of the
> wire", that means the function related to a eswitch port. Here, you are
> doing it for the VF directly, which is wrong. If you need some rate
> limiting to be configured on an actual VF, use what you use for any
> other nic. Offload TC.
Thanks for detailed explanation and suggestions. Sorry for late reply as 
it took a bit longer to understand options.

As sysfs has similar rate configuration on per queue basis with 
tx_maxrate, is it a viable option for our use case (i.e allow user to 
configure tx rate for each allocated queue in a VF).

Pls aslo see If adding tx_minrate to sysfs tx queue entry is feasible on 
the current framework.
Paolo Abeni Oct. 18, 2023, 9:05 a.m. UTC | #7
Hi,

please allow me to revive this old thread...

On Thu, 2023-08-24 at 09:04 +0200, Jiri Pirko wrote:
> > Wed, Aug 23, 2023 at 09:13:34PM CEST, xuejun.zhang@intel.com wrote:
> > > > 
> > > > On 8/22/2023 8:34 AM, Jiri Pirko wrote:
> > > > > > Tue, Aug 22, 2023 at 05:12:55PM CEST,kuba@kernel.org  wrote:
> > > > > > > > On Tue, 22 Aug 2023 08:12:28 +0200 Jiri Pirko wrote:
> > > > > > > > > > NACK! Port function is there to configure the VF/SF from the eswitch
> > > > > > > > > > side. Yet you use it for the configureation of the actual VF, which is
> > > > > > > > > > clear misuse. Please don't
> > > > > > > > Stating where they are supposed to configure the rate would be helpful.
> > > > > > TC?
> > > > 
> > > > Our implementation is an extension to this commit 42c2eb6b1f43 ice: Implement
> > > > devlink-rate API).
> > > > 
> > > > We are setting the Tx max & share rates of individual queues in a VF using
> > > > the devlink rate API.
> > > > 
> > > > Here we are using DEVLINK_PORT_FLAVOUR_VIRTUAL as the attribute for the port
> > > > to distinguish it from being eswitch.
> > 
> > I understand, that is a wrong object. So again, you should use
> > "function" subobject of devlink port to configure "the other side of the
> > wire", that means the function related to a eswitch port. Here, you are
> > doing it for the VF directly, which is wrong. If you need some rate
> > limiting to be configured on an actual VF, use what you use for any
> > other nic. Offload TC.

I have a doubt WRT the above. Don't we need something more/different
here? I mean: a possible intent is limiting the amount of resources (BW
in the VF -> esw direction) that the application owing the VF could
use.

If that is enforced via TC on the VF side (say, a different namespace
or VM), the VF user could circumvent such limit - changing the tc
configuration - either by mistake or malicious action. 

Looking at the thing from a different perspective, the TX B/W on the VF
side is the RX B/W on the eswitch side, so the same effect could be
obtained with a (new/different) API formally touching only eswitch side
object. WDYT?

Thanks,

Paolo
Zhang, Xuejun Nov. 17, 2023, 5:52 a.m. UTC | #8
Hello Jiri & Jakub,

Thanks for looking into our last patch with devlink API. Really 
appreciate your candid review.

Following your suggestion, we have looked into 3 tc offload options to 
support queue rate limiting

#1 mq + matchall + police

#2 mq + tbf

#3 htb

all 3 tc offload options require some level of tc extensions to support 
VF tx queue rate limiting (tx_maxrate & tx_minrate)

htb offload requires minimal tc changes or no change with similar change 
done @ driver (we can share patch for review).

After discussing with Maxim Mikityanskiy( 
https://lore.kernel.org/netdev/54a7dd27-a612-46f1-80dd-b43e28f8e4ce@intel.com/ 
), looks like sysfs interface with tx_minrate extension could be the 
option we can take.

Look forward your opinion & guidance. Thanks for your time!

Regards,

Jun

On 8/28/2023 3:46 PM, Zhang, Xuejun wrote:
>
> On 8/24/2023 12:04 AM, Jiri Pirko wrote:
>> Wed, Aug 23, 2023 at 09:13:34PM CEST, xuejun.zhang@intel.com wrote:
>>> On 8/22/2023 8:34 AM, Jiri Pirko wrote:
>>>> Tue, Aug 22, 2023 at 05:12:55PM CEST,kuba@kernel.org  wrote:
>>>>> On Tue, 22 Aug 2023 08:12:28 +0200 Jiri Pirko wrote:
>>>>>> NACK! Port function is there to configure the VF/SF from the eswitch
>>>>>> side. Yet you use it for the configureation of the actual VF, 
>>>>>> which is
>>>>>> clear misuse. Please don't
>>>>> Stating where they are supposed to configure the rate would be 
>>>>> helpful.
>>>> TC?
>>> Our implementation is an extension to this commit 42c2eb6b1f43 ice: 
>>> Implement
>>> devlink-rate API).
>>>
>>> We are setting the Tx max & share rates of individual queues in a VF 
>>> using
>>> the devlink rate API.
>>>
>>> Here we are using DEVLINK_PORT_FLAVOUR_VIRTUAL as the attribute for 
>>> the port
>>> to distinguish it from being eswitch.
>> I understand, that is a wrong object. So again, you should use
>> "function" subobject of devlink port to configure "the other side of the
>> wire", that means the function related to a eswitch port. Here, you are
>> doing it for the VF directly, which is wrong. If you need some rate
>> limiting to be configured on an actual VF, use what you use for any
>> other nic. Offload TC.
> Thanks for detailed explanation and suggestions. Sorry for late reply 
> as it took a bit longer to understand options.
>
> As sysfs has similar rate configuration on per queue basis with 
> tx_maxrate, is it a viable option for our use case (i.e allow user to 
> configure tx rate for each allocated queue in a VF).
>
> Pls aslo see If adding tx_minrate to sysfs tx queue entry is feasible 
> on the current framework.
> _______________________________________________
> Intel-wired-lan mailing list
> Intel-wired-lan@osuosl.org
> https://lists.osuosl.org/mailman/listinfo/intel-wired-lan
Jiri Pirko Nov. 17, 2023, 11:21 a.m. UTC | #9
Fri, Nov 17, 2023 at 06:52:49AM CET, xuejun.zhang@intel.com wrote:
>Hello Jiri & Jakub,
>
>Thanks for looking into our last patch with devlink API. Really appreciate
>your candid review.
>
>Following your suggestion, we have looked into 3 tc offload options to
>support queue rate limiting
>
>#1 mq + matchall + police

This looks most suitable. Why it would not work?

>
>#2 mq + tbf
>
>#3 htb
>
>all 3 tc offload options require some level of tc extensions to support VF tx
>queue rate limiting (tx_maxrate & tx_minrate)
>
>htb offload requires minimal tc changes or no change with similar change done
>@ driver (we can share patch for review).
>
>After discussing with Maxim Mikityanskiy( https://lore.kernel.org/netdev/54a7dd27-a612-46f1-80dd-b43e28f8e4ce@intel.com/
>), looks like sysfs interface with tx_minrate extension could be the option

I don't undestand how any sysfs know is related to any of the tree tc
solutions above.


>we can take.
>
>Look forward your opinion & guidance. Thanks for your time!
>
>Regards,
>
>Jun
>
>On 8/28/2023 3:46 PM, Zhang, Xuejun wrote:
>> 
>> On 8/24/2023 12:04 AM, Jiri Pirko wrote:
>> > Wed, Aug 23, 2023 at 09:13:34PM CEST, xuejun.zhang@intel.com wrote:
>> > > On 8/22/2023 8:34 AM, Jiri Pirko wrote:
>> > > > Tue, Aug 22, 2023 at 05:12:55PM CEST,kuba@kernel.org  wrote:
>> > > > > On Tue, 22 Aug 2023 08:12:28 +0200 Jiri Pirko wrote:
>> > > > > > NACK! Port function is there to configure the VF/SF from the eswitch
>> > > > > > side. Yet you use it for the configureation of the
>> > > > > > actual VF, which is
>> > > > > > clear misuse. Please don't
>> > > > > Stating where they are supposed to configure the rate
>> > > > > would be helpful.
>> > > > TC?
>> > > Our implementation is an extension to this commit 42c2eb6b1f43
>> > > ice: Implement
>> > > devlink-rate API).
>> > > 
>> > > We are setting the Tx max & share rates of individual queues in a
>> > > VF using
>> > > the devlink rate API.
>> > > 
>> > > Here we are using DEVLINK_PORT_FLAVOUR_VIRTUAL as the attribute
>> > > for the port
>> > > to distinguish it from being eswitch.
>> > I understand, that is a wrong object. So again, you should use
>> > "function" subobject of devlink port to configure "the other side of the
>> > wire", that means the function related to a eswitch port. Here, you are
>> > doing it for the VF directly, which is wrong. If you need some rate
>> > limiting to be configured on an actual VF, use what you use for any
>> > other nic. Offload TC.
>> Thanks for detailed explanation and suggestions. Sorry for late reply as
>> it took a bit longer to understand options.
>> 
>> As sysfs has similar rate configuration on per queue basis with
>> tx_maxrate, is it a viable option for our use case (i.e allow user to
>> configure tx rate for each allocated queue in a VF).
>> 
>> Pls aslo see If adding tx_minrate to sysfs tx queue entry is feasible on
>> the current framework.
>> _______________________________________________
>> Intel-wired-lan mailing list
>> Intel-wired-lan@osuosl.org
>> https://lists.osuosl.org/mailman/listinfo/intel-wired-lan
Jakub Kicinski Nov. 18, 2023, 4:48 p.m. UTC | #10
On Thu, 16 Nov 2023 21:52:49 -0800 Zhang, Xuejun wrote:
> Thanks for looking into our last patch with devlink API. Really 
> appreciate your candid review.
> 
> Following your suggestion, we have looked into 3 tc offload options to 
> support queue rate limiting
> 
> #1 mq + matchall + police
> 
> #2 mq + tbf

You can extend mqprio, too, if you wanted.

> #3 htb
> 
> all 3 tc offload options require some level of tc extensions to support 
> VF tx queue rate limiting (tx_maxrate & tx_minrate)
> 
> htb offload requires minimal tc changes or no change with similar change 
> done @ driver (we can share patch for review).
> 
> After discussing with Maxim Mikityanskiy( 
> https://lore.kernel.org/netdev/54a7dd27-a612-46f1-80dd-b43e28f8e4ce@intel.com/ 
> ), looks like sysfs interface with tx_minrate extension could be the 
> option we can take.
> 
> Look forward your opinion & guidance. Thanks for your time!

My least favorite thing to do is to configure the same piece of silicon
with 4 different SW interfaces. It's okay if we have 4 different uAPIs
(user level APIs) but the driver should not be exposed to all these
options.

I'm saying 4 but really I can think of 6 ways of setting maxrate :(

IMHO we need to be a bit more realistic about the notion of "offloading
the SW thing" for qdiscs specifically. Normally we offload SW constructs
to have a fallback and have a clear definition of functionality.
I bet most data-centers will use BPF+FQ these days, so the "fallback"
argument does not apply. And the "clear definition" when it comes to
basic rate limiting is.. moot.

Besides we already have mqprio, sysfs maxrate, sriov ndo, devlink rate,
none of which have SW fallback.

So since you asked for my opinion - my opinion is that step 1 is to
create a common representation of what we already have and feed it
to the drivers via a single interface. I could just be taking sysfs
maxrate and feeding it to the driver via the devlink rate interface.
If we have the right internals I give 0 cares about what uAPI you pick.
Paolo Abeni Nov. 21, 2023, 9:04 a.m. UTC | #11
On Fri, 2023-11-17 at 12:21 +0100, Jiri Pirko wrote:
> Fri, Nov 17, 2023 at 06:52:49AM CET, xuejun.zhang@intel.com wrote:
> > Hello Jiri & Jakub,
> > 
> > Thanks for looking into our last patch with devlink API. Really appreciate
> > your candid review.
> > 
> > Following your suggestion, we have looked into 3 tc offload options to
> > support queue rate limiting
> > 
> > #1 mq + matchall + police
> 
> This looks most suitable. Why it would not work?

AFAICS, it should work, but it does not look the most suitable to me:
beyond splitting a "simple" task in separate entities, it poses a
constraint on the classification performed on the egress device.

Suppose the admin wants to limit the egress bandwidth on the given tx
queue _and_ do some application specific packet classification and
actions. That would not be possible right?

Thanks!

Paolo
Zhang, Xuejun Nov. 22, 2023, 10:19 p.m. UTC | #12
On 11/18/2023 8:48 AM, Jakub Kicinski wrote:
> On Thu, 16 Nov 2023 21:52:49 -0800 Zhang, Xuejun wrote:
>> Thanks for looking into our last patch with devlink API. Really
>> appreciate your candid review.
>>
>> Following your suggestion, we have looked into 3 tc offload options to
>> support queue rate limiting
>>
>> #1 mq + matchall + police
>>
>> #2 mq + tbf
> You can extend mqprio, too, if you wanted.
>
>> #3 htb
>>
>> all 3 tc offload options require some level of tc extensions to support
>> VF tx queue rate limiting (tx_maxrate & tx_minrate)
>>
>> htb offload requires minimal tc changes or no change with similar change
>> done @ driver (we can share patch for review).
>>
>> After discussing with Maxim Mikityanskiy(
>> https://lore.kernel.org/netdev/54a7dd27-a612-46f1-80dd-b43e28f8e4ce@intel.com/
>> ), looks like sysfs interface with tx_minrate extension could be the
>> option we can take.
>>
>> Look forward your opinion & guidance. Thanks for your time!
> My least favorite thing to do is to configure the same piece of silicon
> with 4 different SW interfaces. It's okay if we have 4 different uAPIs
> (user level APIs) but the driver should not be exposed to all these
> options.
>
> I'm saying 4 but really I can think of 6 ways of setting maxrate :(
>
> IMHO we need to be a bit more realistic about the notion of "offloading
> the SW thing" for qdiscs specifically. Normally we offload SW constructs
> to have a fallback and have a clear definition of functionality.
> I bet most data-centers will use BPF+FQ these days, so the "fallback"
> argument does not apply. And the "clear definition" when it comes to
> basic rate limiting is.. moot.
>
> Besides we already have mqprio, sysfs maxrate, sriov ndo, devlink rate,
> none of which have SW fallback.
>
> So since you asked for my opinion - my opinion is that step 1 is to
> create a common representation of what we already have and feed it
> to the drivers via a single interface. I could just be taking sysfs
> maxrate and feeding it to the driver via the devlink rate interface.
> If we have the right internals I give 0 cares about what uAPI you pick.

There is an existing ndo api for setting tx_maxrate API

int (*ndo_set_tx_maxrate)(struct net_device *dev,
                           int queue_index,
               u32 maxrate);

we could expand and modify the above API with tx_minrate and burst 
parameters as below
int (*ndo_set_tx_rate)(struct net_device *dev,
                  int queue_index,
                int maxrate , int minrate, int burst);

queue_index: tx queue index of net device
maxrate: tx maxrate in mbps
minrate: tx min rate in mbps
burst: data burst in bytes


The proposed API would incur net/core and driver changes as follows
a) existing drivers with ndo_set_tx_maxrate support upgraded to use new 
ndo_set_tx_rate
b) net sysfs (replacing ndo_set_maxrate with ndo_set_tx_rate with 
minrate and burst set to 0, -1 means ignore)
c) Keep the existing /sys/class/net/ethx/queues/tx_nn/tx_maxrate as it 
is currently
d) Add sysfs entry as /sys/class/net/ethx/queues/tx_nn/tx_minrate & 
/sys/class/net/ethx/queues/tx_nn/burst

Let us know your thoughts.
Jakub Kicinski Nov. 23, 2023, 3:22 a.m. UTC | #13
On Wed, 22 Nov 2023 14:19:14 -0800 Zhang, Xuejun wrote:
> The proposed API would incur net/core and driver changes as follows
> a) existing drivers with ndo_set_tx_maxrate support upgraded to use new 
> ndo_set_tx_rate
> b) net sysfs (replacing ndo_set_maxrate with ndo_set_tx_rate with 
> minrate and burst set to 0, -1 means ignore)
> c) Keep the existing /sys/class/net/ethx/queues/tx_nn/tx_maxrate as it 
> is currently
> d) Add sysfs entry as /sys/class/net/ethx/queues/tx_nn/tx_minrate & 
> /sys/class/net/ethx/queues/tx_nn/burst

You described extending the sysfs API (which the ndo you mention 
is for) and nothing about converging the other existing APIs.
Zhang, Xuejun Nov. 28, 2023, 12:15 a.m. UTC | #14
On 11/22/2023 7:22 PM, Jakub Kicinski wrote:
> On Wed, 22 Nov 2023 14:19:14 -0800 Zhang, Xuejun wrote:
>> The proposed API would incur net/core and driver changes as follows
>> a) existing drivers with ndo_set_tx_maxrate support upgraded to use new
>> ndo_set_tx_rate
>> b) net sysfs (replacing ndo_set_maxrate with ndo_set_tx_rate with
>> minrate and burst set to 0, -1 means ignore)
>> c) Keep the existing /sys/class/net/ethx/queues/tx_nn/tx_maxrate as it
>> is currently
>> d) Add sysfs entry as /sys/class/net/ethx/queues/tx_nn/tx_minrate &
>> /sys/class/net/ethx/queues/tx_nn/burst
> You described extending the sysfs API (which the ndo you mention
> is for) and nothing about converging the other existing APIs.
This is extension of ndo_set_tx_maxrate to include per queue parameters 
of tx_minrate and burst.

devlink rate api includes tx_maxrate and tx_minrate, it is intended for 
port rate configurations.

With regarding to tc mqprio, it is being used to configure queue group 
per tc.

For sriov ndo ndo_set_vf_rate, that has been used for overall VF rate 
configuration, not for queue based rate configuration.

It seems there are differences on intent of the aforementioned APIs.

Our use case here is to allow user (i.e @ uAPI) to configure tx rates of 
max rate & min rate per VF queue.Hence we are inclined to 
ndo_set_tx_maxrate extension.
Jakub Kicinski Nov. 28, 2023, 1:43 a.m. UTC | #15
On Mon, 27 Nov 2023 16:15:47 -0800 Zhang, Xuejun wrote:
> This is extension of ndo_set_tx_maxrate to include per queue parameters 
> of tx_minrate and burst.
> 
> devlink rate api includes tx_maxrate and tx_minrate, it is intended for 
> port rate configurations.
> 
> With regarding to tc mqprio, it is being used to configure queue group 
> per tc.
> 
> For sriov ndo ndo_set_vf_rate, that has been used for overall VF rate 
> configuration, not for queue based rate configuration.
> 
> It seems there are differences on intent of the aforementioned APIs.
> 
> Our use case here is to allow user (i.e @ uAPI) to configure tx rates of 
> max rate & min rate per VF queue.Hence we are inclined to 
> ndo_set_tx_maxrate extension.

I said:

  So since you asked for my opinion - my opinion is that step 1 is to
  create a common representation of what we already have and feed it
  to the drivers via a single interface. I could just be taking sysfs
  maxrate and feeding it to the driver via the devlink rate interface.
  If we have the right internals I give 0 cares about what uAPI you pick.

https://lore.kernel.org/all/20231118084843.70c344d9@kernel.org/

Again, the first step is creating a common kernel <> driver interface
which can be used to send to the driver the configuration from the
existing 4 interfaces.
Paolo Abeni Dec. 14, 2023, 8:29 p.m. UTC | #16
On Mon, 2023-11-27 at 17:43 -0800, Jakub Kicinski wrote:
> On Mon, 27 Nov 2023 16:15:47 -0800 Zhang, Xuejun wrote:
> > This is extension of ndo_set_tx_maxrate to include per queue parameters 
> > of tx_minrate and burst.
> > 
> > devlink rate api includes tx_maxrate and tx_minrate, it is intended for 
> > port rate configurations.
> > 
> > With regarding to tc mqprio, it is being used to configure queue group 
> > per tc.
> > 
> > For sriov ndo ndo_set_vf_rate, that has been used for overall VF rate 
> > configuration, not for queue based rate configuration.
> > 
> > It seems there are differences on intent of the aforementioned APIs.
> > 
> > Our use case here is to allow user (i.e @ uAPI) to configure tx rates of 
> > max rate & min rate per VF queue.Hence we are inclined to 
> > ndo_set_tx_maxrate extension.
> 
> I said:
> 
>   So since you asked for my opinion - my opinion is that step 1 is to
>   create a common representation of what we already have and feed it
>   to the drivers via a single interface. I could just be taking sysfs
>   maxrate and feeding it to the driver via the devlink rate interface.
>   If we have the right internals I give 0 cares about what uAPI you pick.
> 
> https://lore.kernel.org/all/20231118084843.70c344d9@kernel.org/
> 
> Again, the first step is creating a common kernel <> driver interface
> which can be used to send to the driver the configuration from the
> existing 4 interfaces.

Together with Simon, I spent some time on the above. We think the
ndo_setup_tc(TC_SETUP_QDISC_TBF) hook could be used as common basis for
this offloads, with some small extensions (adding a 'max_rate' param,
too).

The idea would be:
- 'fixing' sch_btf so that the s/w path became a no-op when h/w offload
is enabled
- extend sch_btf to support max rate
- do the relevant ice implementation
- ndo_set_tx_maxrate could be replaced with the mentioned ndo call (the
latter interface is a strict super-set of former)
- ndo_set_vf_rate could also be replaced with the mentioned ndo call
(with another small extension to the offload data)

I think mqprio deserves it's own separate offload interface, as it
covers multiple tasks other than shaping (grouping queues and mapping
priority to classes)

In the long run we could have a generic implementation of the
ndo_setup_tc(TC_SETUP_QDISC_TBF) in term of devlink rate adding a
generic way to fetch the devlink_port instance corresponding to the
given netdev and mapping the TBF features to the devlink_rate API.

Not starting this due to what Jiri mentioned [1].

WDYT?

Thanks,

Paolo and Simon

[1] https://lore.kernel.org/netdev/ZORRzEBcUDEjMniz@nanopsycho/
Jakub Kicinski Dec. 15, 2023, 1:46 a.m. UTC | #17
On Thu, 14 Dec 2023 21:29:51 +0100 Paolo Abeni wrote:
> Together with Simon, I spent some time on the above. We think the
> ndo_setup_tc(TC_SETUP_QDISC_TBF) hook could be used as common basis for
> this offloads, with some small extensions (adding a 'max_rate' param,
> too).

uAPI aside, why would we use ndo_setup_tc(TC_SETUP_QDISC_TBF)
to implement common basis?

Is it not cleaner to have a separate driver API, with its ops
and capabilities?

> The idea would be:
> - 'fixing' sch_btf so that the s/w path became a no-op when h/w offload
> is enabled
> - extend sch_btf to support max rate
> - do the relevant ice implementation
> - ndo_set_tx_maxrate could be replaced with the mentioned ndo call (the
> latter interface is a strict super-set of former)
> - ndo_set_vf_rate could also be replaced with the mentioned ndo call
> (with another small extension to the offload data)
> 
> I think mqprio deserves it's own separate offload interface, as it
> covers multiple tasks other than shaping (grouping queues and mapping
> priority to classes)
> 
> In the long run we could have a generic implementation of the
> ndo_setup_tc(TC_SETUP_QDISC_TBF) in term of devlink rate adding a
> generic way to fetch the devlink_port instance corresponding to the
> given netdev and mapping the TBF features to the devlink_rate API.
> 
> Not starting this due to what Jiri mentioned [1].

Jiri, AFAIU, is against using devlink rate *uAPI* to configure network
rate limiting. That's separate from the internal representation.
Paolo Abeni Dec. 15, 2023, 11:06 a.m. UTC | #18
On Thu, 2023-12-14 at 17:46 -0800, Jakub Kicinski wrote:
> On Thu, 14 Dec 2023 21:29:51 +0100 Paolo Abeni wrote:
> > Together with Simon, I spent some time on the above. We think the
> > ndo_setup_tc(TC_SETUP_QDISC_TBF) hook could be used as common basis for
> > this offloads, with some small extensions (adding a 'max_rate' param,
> > too).
> 
> uAPI aside, why would we use ndo_setup_tc(TC_SETUP_QDISC_TBF)
> to implement common basis?
> 
> Is it not cleaner to have a separate driver API, with its ops
> and capabilities?

We understand one of the end goal is consolidating the existing rate-
related in kernel interfaces.  Adding a new one does not feel a good
starting to reach that goal, see [1] & [2] ;). ndo_setup_tc() feels
like the natural choice for H/W offload and TBF is the existing
interface IMHO nearest to the requirements here.

The devlink rate API could be a possible alternative...

> > The idea would be:
> > - 'fixing' sch_btf so that the s/w path became a no-op when h/w offload
> > is enabled
> > - extend sch_btf to support max rate
> > - do the relevant ice implementation
> > - ndo_set_tx_maxrate could be replaced with the mentioned ndo call (the
> > latter interface is a strict super-set of former)
> > - ndo_set_vf_rate could also be replaced with the mentioned ndo call
> > (with another small extension to the offload data)
> > 
> > I think mqprio deserves it's own separate offload interface, as it
> > covers multiple tasks other than shaping (grouping queues and mapping
> > priority to classes)
> > 
> > In the long run we could have a generic implementation of the
> > ndo_setup_tc(TC_SETUP_QDISC_TBF) in term of devlink rate adding a
> > generic way to fetch the devlink_port instance corresponding to the
> > given netdev and mapping the TBF features to the devlink_rate API.
> > 
> > Not starting this due to what Jiri mentioned [1].
> 
> Jiri, AFAIU, is against using devlink rate *uAPI* to configure network
> rate limiting. That's separate from the internal representation.

... with a couples of caveats:

1) AFAICS devlink (and/or devlink_port) does not have fine grained, per
queue representation and intel want to be able to configure shaping on
per queue basis. I think/hope we don't want to bring the discussion to
extending the devlink interface with queue support, I fear that will
block us for a long time. Perhaps I’m missing or misunderstanding
something here. Otherwise in retrospect this looks like a reasonable
point to completely avoid devlink here.

2) My understanding of Jiri statement was more restrictive. @Jiri it
would great if could share your genuine interpretation: are you ok with
using the devlink_port rate API as a basis to replace
ndo_set_tx_maxrate() (via dev->devlink_port->devlink->) and possibly
ndo_set_vf_rate(). Note the given the previous point, this option would
still feel problematic.

Cheers,

Paolo

[1] https://xkcd.com/927/
[2] https://www.youtube.com/watch?v=f8kO_L-pDwo
Paolo Abeni Dec. 15, 2023, 11:47 a.m. UTC | #19
On Fri, 2023-12-15 at 12:06 +0100, Paolo Abeni wrote:
> 1) AFAICS devlink (and/or devlink_port) does not have fine grained, per
> queue representation and intel want to be able to configure shaping on
> per queue basis. I think/hope we don't want to bring the discussion to
> extending the devlink interface with queue support, I fear that will
> block us for a long time. Perhaps I’m missing or misunderstanding
> something here. Otherwise in retrospect this looks like a reasonable
> point to completely avoid devlink here.

Note to self: never send a message to the ML before my 3rd morning
coffee.

This thread started with Intel trying to using devlink rate for their
use-case, apparently slamming my doubt above.

My understanding is that in the patches the queue devlink <> queue
relationship was kept inside the driver and not exposed to the devlink
level.

If we want to use the devlink rate api to replace e.g.
ndo_set_tx_maxrate, we would need a devlink queue(id) or the like,
hence this point.

Cheer,

Paolo
Jiri Pirko Dec. 15, 2023, 12:22 p.m. UTC | #20
Fri, Dec 15, 2023 at 02:46:04AM CET, kuba@kernel.org wrote:
>On Thu, 14 Dec 2023 21:29:51 +0100 Paolo Abeni wrote:
>> Together with Simon, I spent some time on the above. We think the
>> ndo_setup_tc(TC_SETUP_QDISC_TBF) hook could be used as common basis for
>> this offloads, with some small extensions (adding a 'max_rate' param,
>> too).
>
>uAPI aside, why would we use ndo_setup_tc(TC_SETUP_QDISC_TBF)
>to implement common basis?
>
>Is it not cleaner to have a separate driver API, with its ops
>and capabilities?
>
>> The idea would be:
>> - 'fixing' sch_btf so that the s/w path became a no-op when h/w offload
>> is enabled
>> - extend sch_btf to support max rate
>> - do the relevant ice implementation
>> - ndo_set_tx_maxrate could be replaced with the mentioned ndo call (the
>> latter interface is a strict super-set of former)
>> - ndo_set_vf_rate could also be replaced with the mentioned ndo call
>> (with another small extension to the offload data)
>> 
>> I think mqprio deserves it's own separate offload interface, as it
>> covers multiple tasks other than shaping (grouping queues and mapping
>> priority to classes)
>> 
>> In the long run we could have a generic implementation of the
>> ndo_setup_tc(TC_SETUP_QDISC_TBF) in term of devlink rate adding a
>> generic way to fetch the devlink_port instance corresponding to the
>> given netdev and mapping the TBF features to the devlink_rate API.
>> 
>> Not starting this due to what Jiri mentioned [1].
>
>Jiri, AFAIU, is against using devlink rate *uAPI* to configure network
>rate limiting. That's separate from the internal representation.

Devlink rate was introduced for configuring port functions that are
connected to eswitch port. I don't see any reason to extend it for
configuration of netdev on the host. We have netdev instance and other
means to do it.
Jiri Pirko Dec. 15, 2023, 12:30 p.m. UTC | #21
Fri, Dec 15, 2023 at 12:06:52PM CET, pabeni@redhat.com wrote:
>On Thu, 2023-12-14 at 17:46 -0800, Jakub Kicinski wrote:
>> On Thu, 14 Dec 2023 21:29:51 +0100 Paolo Abeni wrote:
>> > Together with Simon, I spent some time on the above. We think the
>> > ndo_setup_tc(TC_SETUP_QDISC_TBF) hook could be used as common basis for
>> > this offloads, with some small extensions (adding a 'max_rate' param,
>> > too).
>> 
>> uAPI aside, why would we use ndo_setup_tc(TC_SETUP_QDISC_TBF)
>> to implement common basis?
>> 
>> Is it not cleaner to have a separate driver API, with its ops
>> and capabilities?
>
>We understand one of the end goal is consolidating the existing rate-
>related in kernel interfaces.  Adding a new one does not feel a good
>starting to reach that goal, see [1] & [2] ;). ndo_setup_tc() feels
>like the natural choice for H/W offload and TBF is the existing
>interface IMHO nearest to the requirements here.
>
>The devlink rate API could be a possible alternative...

Again, devlink rate was introduced for the rate configuration of the
entity that is not present (by netdev for example) on a host.
If we have netdev, let's use it.


>
>> > The idea would be:
>> > - 'fixing' sch_btf so that the s/w path became a no-op when h/w offload
>> > is enabled
>> > - extend sch_btf to support max rate
>> > - do the relevant ice implementation
>> > - ndo_set_tx_maxrate could be replaced with the mentioned ndo call (the
>> > latter interface is a strict super-set of former)
>> > - ndo_set_vf_rate could also be replaced with the mentioned ndo call
>> > (with another small extension to the offload data)
>> > 
>> > I think mqprio deserves it's own separate offload interface, as it
>> > covers multiple tasks other than shaping (grouping queues and mapping
>> > priority to classes)
>> > 
>> > In the long run we could have a generic implementation of the
>> > ndo_setup_tc(TC_SETUP_QDISC_TBF) in term of devlink rate adding a
>> > generic way to fetch the devlink_port instance corresponding to the
>> > given netdev and mapping the TBF features to the devlink_rate API.
>> > 
>> > Not starting this due to what Jiri mentioned [1].
>> 
>> Jiri, AFAIU, is against using devlink rate *uAPI* to configure network
>> rate limiting. That's separate from the internal representation.
>
>... with a couples of caveats:
>
>1) AFAICS devlink (and/or devlink_port) does not have fine grained, per
>queue representation and intel want to be able to configure shaping on
>per queue basis. I think/hope we don't want to bring the discussion to
>extending the devlink interface with queue support, I fear that will
>block us for a long time. Perhaps I’m missing or misunderstanding
>something here. Otherwise in retrospect this looks like a reasonable
>point to completely avoid devlink here.
>
>2) My understanding of Jiri statement was more restrictive. @Jiri it
>would great if could share your genuine interpretation: are you ok with
>using the devlink_port rate API as a basis to replace
>ndo_set_tx_maxrate() (via dev->devlink_port->devlink->) and possibly

Does not make any sense to me.


>ndo_set_vf_rate(). Note the given the previous point, this option would

ndo_set_vf_rate() (and the rest of ndo_[gs]et_vf_*() ndo) is the
legacy way. Devlink rate replaced that when switchdev eswich mode is
configured by:
$ sudo devlink dev eswitch set pci/0000:08:00.1 mode switchdev

In drivers, ndo_set_vf_rate() and devlink rate are implemented in the
same way. See mlx5 for example:
mlx5_esw_qos_set_vport_rate()
mlx5_esw_devlink_rate_leaf_tx_share_set()



>still feel problematic.
>
>Cheers,
>
>Paolo
>
>[1] https://xkcd.com/927/
>[2] https://www.youtube.com/watch?v=f8kO_L-pDwo
>
Jakub Kicinski Dec. 15, 2023, 10:41 p.m. UTC | #22
On Fri, 15 Dec 2023 12:06:52 +0100 Paolo Abeni wrote:
> > uAPI aside, why would we use ndo_setup_tc(TC_SETUP_QDISC_TBF)
> > to implement common basis?
> > 
> > Is it not cleaner to have a separate driver API, with its ops
> > and capabilities?  
> 
> We understand one of the end goal is consolidating the existing rate-
> related in kernel interfaces.  Adding a new one does not feel a good
> starting to reach that goal, see [1] & [2] ;)

ndo_setup_tc(TC_SETUP_QDISC_TBF) is a new API, too, very much so.
These attempts to build on top of existing interfaces with small
tweaks is leading us to a fragmented and incompatible driver landscape.

I explained before (perhaps on the netdev call) - Qdiscs have two
different offload models. "local" and "switchdev", here we want "local"
AFAIU and TBF only has "switchdev" offload (take a look at the enqueue
method and which drivers support it today).

"We'll extend TBF" is very much adding a new API. You'll have to add
"local offload" support in TBF and no NIC driver today supports it.
I'm not saying TBF is bad, but I disagree that it's any different
than a new NDO for all practical purposes.

> ndo_setup_tc() feels like the natural choice for H/W offload and TBF
> is the existing interface IMHO nearest to the requirements here.

I question whether something as basic as scheduling and ACLs should
follow the "offload SW constructs" mantra. You are exposed to more
diverse users so please don't hesitate to disagree, but AFAICT
the transparent offload (user installs SW constructs and if offload
is available - offload, otherwise use SW is good enough) has not
played out like we have hoped.

Let's figure out what is the abstract model of scheduling / shaping
within a NIC that we want to target. And then come up with a way of
representing it in SW. Not which uAPI we can shoehorn into the use
case.
Paolo Abeni Dec. 18, 2023, 8:12 p.m. UTC | #23
On Fri, 2023-12-15 at 14:41 -0800, Jakub Kicinski wrote:
> I explained before (perhaps on the netdev call) - Qdiscs have two
> different offload models. "local" and "switchdev", here we want "local"
> AFAIU and TBF only has "switchdev" offload (take a look at the enqueue
> method and which drivers support it today).

I must admit the above is not yet clear to me.

I initially thought you meant that "local" offloads properly
reconfigure the S/W datapath so that locally generated traffic would go
through the expected processing (e.g. shaping) just once, while with
"switchdev" offload locally generated traffic will see shaping done
both by the S/W and the H/W[1].

Reading the above I now think you mean that local offloads has only
effect for locally generated traffic but not on traffic forwarded via
eswitch, and vice versa[2]. 

The drivers I looked at did not show any clue (to me).

FTR, I think that [1] is a bug worth fixing and [2] is evil ;)

Could you please clarify which is the difference exactly between them?

> "We'll extend TBF" is very much adding a new API. You'll have to add
> "local offload" support in TBF and no NIC driver today supports it.
> I'm not saying TBF is bad, but I disagree that it's any different
> than a new NDO for all practical purposes.
> 
> > ndo_setup_tc() feels like the natural choice for H/W offload and TBF
> > is the existing interface IMHO nearest to the requirements here.
> 
> I question whether something as basic as scheduling and ACLs should
> follow the "offload SW constructs" mantra. You are exposed to more
> diverse users so please don't hesitate to disagree, but AFAICT
> the transparent offload (user installs SW constructs and if offload
> is available - offload, otherwise use SW is good enough) has not
> played out like we have hoped.
> 
> Let's figure out what is the abstract model of scheduling / shaping
> within a NIC that we want to target. And then come up with a way of
> representing it in SW. Not which uAPI we can shoehorn into the use
> case.

I thought the model was quite well defined since the initial submission
from Intel, and is quite simple: expose TX shaping on per tx queue
basis, with min rate, max rate (in bps) and burst (in bytes).

I think that making it more complex (e.g. with nesting, pkt overhead,
etc) we will still not cover every possible use case and will add
considerable complexity.
> 
Cheers,

Paolo
Jakub Kicinski Dec. 18, 2023, 9:33 p.m. UTC | #24
On Mon, 18 Dec 2023 21:12:35 +0100 Paolo Abeni wrote:
> On Fri, 2023-12-15 at 14:41 -0800, Jakub Kicinski wrote:
> > I explained before (perhaps on the netdev call) - Qdiscs have two
> > different offload models. "local" and "switchdev", here we want "local"
> > AFAIU and TBF only has "switchdev" offload (take a look at the enqueue
> > method and which drivers support it today).  
> 
> I must admit the above is not yet clear to me.
> 
> I initially thought you meant that "local" offloads properly
> reconfigure the S/W datapath so that locally generated traffic would go
> through the expected processing (e.g. shaping) just once, while with
> "switchdev" offload locally generated traffic will see shaping done
> both by the S/W and the H/W[1].
> 
> Reading the above I now think you mean that local offloads has only
> effect for locally generated traffic but not on traffic forwarded via
> eswitch, and vice versa[2]. 
> 
> The drivers I looked at did not show any clue (to me).
> 
> FTR, I think that [1] is a bug worth fixing and [2] is evil ;)
> 
> Could you please clarify which is the difference exactly between them?

The practical difference which you can see in the code is that
"locally offloaded" qdiscs will act like a FIFO in the SW path (at least
to some extent). While "switchdev" offload qdiscs act exactly the same
regardless of the offload.

Neither is wrong, they are offloading different things. Qdisc offload
on a representor (switchdev) offloads from the switch perspective, i.e.
"ingress to host". Only fallback goes thru SW path, and should be
negligible.

"Local" offload can be implemented as admission control (and is
sometimes work conserving), it's on the "real" interface, it's egress,
and doesn't take part in forwarding.

> > I question whether something as basic as scheduling and ACLs should
> > follow the "offload SW constructs" mantra. You are exposed to more
> > diverse users so please don't hesitate to disagree, but AFAICT
> > the transparent offload (user installs SW constructs and if offload
> > is available - offload, otherwise use SW is good enough) has not
> > played out like we have hoped.
> > 
> > Let's figure out what is the abstract model of scheduling / shaping
> > within a NIC that we want to target. And then come up with a way of
> > representing it in SW. Not which uAPI we can shoehorn into the use
> > case.  
> 
> I thought the model was quite well defined since the initial submission
> from Intel, and is quite simple: expose TX shaping on per tx queue
> basis, with min rate, max rate (in bps) and burst (in bytes).

For some definition of a model, I guess. Given the confusion about
switchdev vs local (ingress vs egress) - I can't agree that the model
is well defined :(

What I mean is - given piece of functionality like "Tx queue shaping"
you can come up with a reasonable uAPI that you can hijack and it makes
sense to you. But someone else (switchdev ingress) can chose the same
API to implement a different offload. Not to mention that yet another
person will chose a different API to implement the same things as you :(

Off the top of my head we have at least:

 - Tx DMA admission control / scheduling (which Tx DMA queue will NIC 
   pull from)
 - Rx DMA scheduling (which Rx queue will NIC push to)

 - buffer/queue configuration (how to deal with buildup of packets in
   NIC SRAM, usually mostly for ingress)
 - NIC buffer configuration (how the SRAM is allocated to queues)

 - policers in the NIC forwarding logic


Let's extend this list so that it covers all reasonable NIC designs,
and them work on mapping how each of them is configured?

> I think that making it more complex (e.g. with nesting, pkt overhead,
> etc) we will still not cover every possible use case and will add
> considerable complexity.