mbox series

[net-next,V5,00/11] net/mlx5: ConnectX-8 SW Steering + Rate management on traffic classes

Message ID 20241204220931.254964-1-tariqt@nvidia.com (mailing list archive)
Headers show
Series net/mlx5: ConnectX-8 SW Steering + Rate management on traffic classes | expand

Message

Tariq Toukan Dec. 4, 2024, 10:09 p.m. UTC
Hi,

This patchset starts with 4 patches that modify the IFC, targeted to
mlx5-next in order to be taken to rdma-next branch side sooner than in
the next merge window.

This patchset consists of two features:
1. In patches 5-6, Itamar adds SW Steering support for ConnectX-8.
2. Followed by patches by Carolina that add rate management support on
traffic classes in devlink and mlx5, more details below [1].

Series generated against:
commit bb18265c3aba ("r8169: remove support for chip version 11")

Regards,
Tariq

V5:
- Fix warning in devlink_nl_rate_tc_bw_set().
- Fix target branch of patch #4.

V4:
- Renamed the nested attribute for traffic class bandwidth to
  DEVLINK_ATTR_RATE_TC_BWS.
- Changed the order of the attributes in `devlink.h`.
- Refactored the initialization tc-bw array in
  devlink_nl_rate_tc_bw_set().
- Added extack messages to provide clear feedback on issues with tc-bw
  arguments.
- Updated `rate-tc-bws` to support a multi-attr set, where each
  attribute includes an index and the corresponding bandwidth for that
  traffic class.
- Handled the issue where the user could provide
  DEVLINK_ATTR_RATE_TC_BWS with duplicate indices.
- Provided ynl exmaples in devlink patch commit message.
- Take IFC patches to beginning of the series, targeted for mlx5-next.

V3:
- Dropped rate-tc-index, using tc-bw array index instead.
- Renamed rate-bw to rate-tc-bw.
- Documneted what the rate-tc-bw represents and added a range check for
  validation.
- Intorduced devlink_nl_rate_tc_bw_set() to parse and set the TC
  bandwidth values.
- Updated the user API in the commit message of patch 1/6 to ensure
  bandwidths sum equals 100.
- Fixed missing filling of rate-parent in devlink_nl_rate_fill().

V2:
- Included <linux/dcbnl.h> in devlink.h to resolve missing
  IEEE_8021QAZ_MAX_TCS definition.
- Refactored the rate-tc-bw attribute structure to use a separate
  rate-tc-index.
- Updated patch 2/6 title.


[1]
This patch series extends the devlink-rate API to support traffic class
(TC) bandwidth management, enabling more granular control over traffic
shaping and rate limiting across multiple TCs. The API now allows users
to specify bandwidth proportions for different traffic classes in a
single command. This is particularly useful for managing Enhanced
Transmission Selection (ETS) for groups of Virtual Functions (VFs),
allowing precise bandwidth allocation across traffic classes.

Additionally the series refines the QoS handling in net/mlx5 to support
TC arbitration and bandwidth management on vports and rate nodes.

Extend devlink-rate API to support rate management on TCs:
- devlink: Extend the devlink rate API to support traffic class
  bandwidth management

Introduce a no-op implementation:
- net/mlx5: Add no-op implementation for setting tc-bw on rate objects

Add support for enabling and disabling TC QoS on vports and nodes:
- net/mlx5: Add support for setting tc-bw on nodes
- net/mlx5: Add traffic class scheduling support for vport QoS

Support for setting tc-bw on rate objects:
- net/mlx5: Manage TC arbiter nodes and implement full support for
  tc-bw

Carolina Jubran (6):
  net/mlx5: Add support for new scheduling elements
  devlink: Extend devlink rate API with traffic classes bandwidth
    management
  net/mlx5: Add no-op implementation for setting tc-bw on rate objects
  net/mlx5: Add support for setting tc-bw on nodes
  net/mlx5: Add traffic class scheduling support for vport QoS
  net/mlx5: Manage TC arbiter nodes and implement full support for tc-bw

Cosmin Ratiu (2):
  net/mlx5: ifc: Reorganize mlx5_ifc_flow_table_context_bits
  net/mlx5: qos: Add ifc support for cross-esw scheduling

Itamar Gozlan (2):
  net/mlx5: DR, Expand SWS STE callbacks and consolidate common structs
  net/mlx5: DR, Add support for ConnectX-8 steering

Yevgeny Kliteynik (1):
  net/mlx5: Add ConnectX-8 device to ifc

 Documentation/netlink/specs/devlink.yaml      |  28 +-
 .../net/ethernet/mellanox/mlx5/core/Makefile  |   1 +
 .../net/ethernet/mellanox/mlx5/core/devlink.c |   2 +
 .../net/ethernet/mellanox/mlx5/core/esw/qos.c | 795 +++++++++++++++++-
 .../net/ethernet/mellanox/mlx5/core/esw/qos.h |   4 +
 .../net/ethernet/mellanox/mlx5/core/eswitch.h |  13 +-
 drivers/net/ethernet/mellanox/mlx5/core/rl.c  |   4 +
 .../mlx5/core/steering/sws/dr_domain.c        |   2 +-
 .../mellanox/mlx5/core/steering/sws/dr_ste.c  |   6 +-
 .../mellanox/mlx5/core/steering/sws/dr_ste.h  |  19 +-
 .../mlx5/core/steering/sws/dr_ste_v0.c        |   6 +-
 .../mlx5/core/steering/sws/dr_ste_v1.c        | 207 +----
 .../mlx5/core/steering/sws/dr_ste_v1.h        | 147 +++-
 .../mlx5/core/steering/sws/dr_ste_v2.c        | 169 +---
 .../mlx5/core/steering/sws/dr_ste_v2.h        | 168 ++++
 .../mlx5/core/steering/sws/dr_ste_v3.c        | 221 +++++
 .../mlx5/core/steering/sws/mlx5_ifc_dr.h      |  40 +
 .../mellanox/mlx5/core/steering/sws/mlx5dr.h  |   2 +-
 include/linux/mlx5/mlx5_ifc.h                 |  56 +-
 include/net/devlink.h                         |   7 +
 include/uapi/linux/devlink.h                  |   4 +
 net/devlink/netlink_gen.c                     |  15 +-
 net/devlink/netlink_gen.h                     |   1 +
 net/devlink/rate.c                            | 124 +++
 24 files changed, 1645 insertions(+), 396 deletions(-)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/steering/sws/dr_ste_v2.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/steering/sws/dr_ste_v3.c

Comments

Leon Romanovsky Dec. 5, 2024, 9:23 a.m. UTC | #1
On Thu, Dec 05, 2024 at 12:09:20AM +0200, Tariq Toukan wrote:
> Hi,
> 
> This patchset starts with 4 patches that modify the IFC, targeted to
> mlx5-next in order to be taken to rdma-next branch side sooner than in
> the next merge window.
> 
> This patchset consists of two features:
> 1. In patches 5-6, Itamar adds SW Steering support for ConnectX-8.
> 2. Followed by patches by Carolina that add rate management support on
> traffic classes in devlink and mlx5, more details below [1].
> 
> Series generated against:
> commit bb18265c3aba ("r8169: remove support for chip version 11")
> 
> Regards,
> Tariq

<...>

> Carolina Jubran (6):
>   net/mlx5: Add support for new scheduling elements
> 
> Cosmin Ratiu (2):
>   net/mlx5: ifc: Reorganize mlx5_ifc_flow_table_context_bits
>   net/mlx5: qos: Add ifc support for cross-esw scheduling
> 
> Yevgeny Kliteynik (1):
>   net/mlx5: Add ConnectX-8 device to ifc

I applied these IFC patches to our mlx5-next shared branch.
https://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux.git/log/?h=mlx5-next

Thanks
Jakub Kicinski Dec. 7, 2024, 2:13 a.m. UTC | #2
On Thu, 5 Dec 2024 00:09:20 +0200 Tariq Toukan wrote:
> This patch series extends the devlink-rate API to support traffic class
> (TC) bandwidth management, enabling more granular control over traffic
> shaping and rate limiting across multiple TCs. The API now allows users
> to specify bandwidth proportions for different traffic classes in a
> single command. This is particularly useful for managing Enhanced
> Transmission Selection (ETS) for groups of Virtual Functions (VFs),
> allowing precise bandwidth allocation across traffic classes.
> 
> Additionally the series refines the QoS handling in net/mlx5 to support
> TC arbitration and bandwidth management on vports and rate nodes.
> 
> Extend devlink-rate API to support rate management on TCs:
> - devlink: Extend the devlink rate API to support traffic class
>   bandwidth management
> 
> Introduce a no-op implementation:
> - net/mlx5: Add no-op implementation for setting tc-bw on rate objects
> 
> Add support for enabling and disabling TC QoS on vports and nodes:
> - net/mlx5: Add support for setting tc-bw on nodes
> - net/mlx5: Add traffic class scheduling support for vport QoS
> 
> Support for setting tc-bw on rate objects:
> - net/mlx5: Manage TC arbiter nodes and implement full support for
>   tc-bw

Do you expect TC bw allocation to work on non-leaf nodes?

How does this relate to the rate API which Paolo added? He was asked 
to build in a way to integrate with devlink now devlink is growing
extra features again, which presumably the other API will also need.
And the integration may turn out to be challenging.
Tariq Toukan Dec. 9, 2024, 7:32 p.m. UTC | #3
On 07/12/2024 4:13, Jakub Kicinski wrote:
> On Thu, 5 Dec 2024 00:09:20 +0200 Tariq Toukan wrote:
>> This patch series extends the devlink-rate API to support traffic class
>> (TC) bandwidth management, enabling more granular control over traffic
>> shaping and rate limiting across multiple TCs. The API now allows users
>> to specify bandwidth proportions for different traffic classes in a
>> single command. This is particularly useful for managing Enhanced
>> Transmission Selection (ETS) for groups of Virtual Functions (VFs),
>> allowing precise bandwidth allocation across traffic classes.
>>
>> Additionally the series refines the QoS handling in net/mlx5 to support
>> TC arbitration and bandwidth management on vports and rate nodes.
>>
>> Extend devlink-rate API to support rate management on TCs:
>> - devlink: Extend the devlink rate API to support traffic class
>>    bandwidth management
>>
>> Introduce a no-op implementation:
>> - net/mlx5: Add no-op implementation for setting tc-bw on rate objects
>>
>> Add support for enabling and disabling TC QoS on vports and nodes:
>> - net/mlx5: Add support for setting tc-bw on nodes
>> - net/mlx5: Add traffic class scheduling support for vport QoS
>>
>> Support for setting tc-bw on rate objects:
>> - net/mlx5: Manage TC arbiter nodes and implement full support for
>>    tc-bw
> 
> Do you expect TC bw allocation to work on non-leaf nodes?
> 

Yes. That's the point. It works.

> How does this relate to the rate API which Paolo added? He was asked
> to build in a way to integrate with devlink now devlink is growing
> extra features again, which presumably the other API will also need.
> And the integration may turn out to be challenging.
> 

AFAIU Paolo's work is not for shapers 'above' the network device level, 
i.e. groups.
Jakub Kicinski Dec. 9, 2024, 9:41 p.m. UTC | #4
On Mon, 9 Dec 2024 21:32:11 +0200 Tariq Toukan wrote:
> > Do you expect TC bw allocation to work on non-leaf nodes?
>
> Yes. That's the point. It works.

Let's level -- I'm not trying to be difficult, but you're defining
uAPI with little to no documentation. "It works" is not going to cut it.

> > How does this relate to the rate API which Paolo added? He was asked
> > to build in a way to integrate with devlink now devlink is growing
> > extra features again, which presumably the other API will also need.
> > And the integration may turn out to be challenging.
> 
> AFAIU Paolo's work is not for shapers 'above' the network device level, 
> i.e. groups.

What's the difference between queue group and a VF?
Cosmin Ratiu Dec. 11, 2024, 9:49 a.m. UTC | #5
On Mon, 2024-12-09 at 13:41 -0800, Jakub Kicinski wrote:
> On Mon, 9 Dec 2024 21:32:11 +0200 Tariq Toukan wrote:
> > > Do you expect TC bw allocation to work on non-leaf nodes?
> > 
> > Yes. That's the point. It works.
> 
> Let's level -- I'm not trying to be difficult, but you're defining
> uAPI with little to no documentation. "It works" is not going to cut
> it.

The original intent was to document this in the devlink man page. But
we will add something in the kernel documentation as well in the next
submission.

> 
> > > How does this relate to the rate API which Paolo added? He was
> > > asked
> > > to build in a way to integrate with devlink now devlink is
> > > growing
> > > extra features again, which presumably the other API will also
> > > need.
> > > And the integration may turn out to be challenging.
> > 
> > AFAIU Paolo's work is not for shapers 'above' the network device
> > level, 
> > i.e. groups.
> 
> What's the difference between queue group and a VF?
> 

I've looked over the latest version of the net-shapers API.
There is some conceptual overlap between this patchset and net-shapers
ability to define a group of device queues and manipulate its tx
limits. But as far as I am aware ([1]), the net-shapers API doesn't
intend to shape entities above netdev level.

So there are two things to discuss here:
1. Integrating device-level TC shaping into net-shapers. The net-
shapers model would need to be extended with the ability to define TC
queues. At the moment I see it's concerned with device tx queues which
don't necessarily map 1:1 to traffic classes.

Then, it would need to have the ability to group TC queues into a node.

Then the integration should be easy. Either API can call the device
driver implementation or one API can call the other's function to do
so.

Paolo, what are your thoughts on tc shaping in the net-shapers API?

2. VF-group TC shaping. The current patchset offers the ability to
split TC bandwidth on a devlink rate node, applying to all VFs in the
node. As far as I am aware, net-shapers doesn't intend to address this
use case. Do we want to have two completely different APIs to
manipulate tc bandwidth?

Cosmin.

[1]
https://lore.kernel.org/netdev/7195630a-1021-4e1e-b48b-a07945477863@redhat.com/
Jakub Kicinski Dec. 12, 2024, 1:49 a.m. UTC | #6
On Wed, 11 Dec 2024 09:49:28 +0000 Cosmin Ratiu wrote:
> I've looked over the latest version of the net-shapers API.
> There is some conceptual overlap between this patchset and net-shapers
> ability to define a group of device queues and manipulate its tx
> limits. But as far as I am aware ([1]), the net-shapers API doesn't
> intend to shape entities above netdev level.

It's not about the uAPI but about having a uniform way of representing
the shaping hierarchy.

> So there are two things to discuss here:
> 1. Integrating device-level TC shaping into net-shapers. The net-
> shapers model would need to be extended with the ability to define TC
> queues. At the moment I see it's concerned with device tx queues which
> don't necessarily map 1:1 to traffic classes.

What are "TC queues"? NIC queues with assigned TC? Your patches shape
on a group of VFs, so the equivalent would be a group of queues 
(e.g. group of queues assigned to a container).

> Then, it would need to have the ability to group TC queues into a node.


Cosmin Ratiu Dec. 13, 2024, 1:42 p.m. UTC | #7
On Wed, 2024-12-11 at 17:49 -0800, Jakub Kicinski wrote:
> On Wed, 11 Dec 2024 09:49:28 +0000 Cosmin Ratiu wrote:
> > I've looked over the latest version of the net-shapers API.
> > There is some conceptual overlap between this patchset and net-
> > shapers
> > ability to define a group of device queues and manipulate its tx
> > limits. But as far as I am aware ([1]), the net-shapers API doesn't
> > intend to shape entities above netdev level.
> 
> It's not about the uAPI but about having a uniform way of
> representing the shaping hierarchy.

I understand your point now.

> > So there are two things to discuss here:
> > 1. Integrating device-level TC shaping into net-shapers. The net-
> > shapers model would need to be extended with the ability to define
> > TC
> > queues. At the moment I see it's concerned with device tx queues
> > which
> > don't necessarily map 1:1 to traffic classes.
> 
> What are "TC queues"? NIC queues with assigned TC? Your patches shape
> on a group of VFs, so the equivalent would be a group of queues 
> (e.g. group of queues assigned to a container).

My terminology was slightly off. "TC queues" are a logical construct,
not necessarily corresponding to device queues. As far as I know,
packet traffic classes are determined with a variety of methods, and
can be encoded in the IP header (ToS) or as metadata in the tx
descriptor somewhere. I am not sure there's any correspondence with
device queues although one could define specific queues for specific
traffic classes, I guess. The "TC queues" I was mentioning are a
logical representation of the packet flow and refer to the hardware's
ability to treat different TCs differently with HW scheduling elements.

> > Then, it would need to have the ability to group TC queues into a
> > node.
> 
>