[net-next,0/5] devlink rate police limiter

Message ID	20220620152647.2498927-1-dchumak@nvidia.com (mailing list archive)
Headers	show Return-Path: <netdev-owner@kernel.org> Received-SPF: Pass (protection.outlook.com: domain of nvidia.com designates 12.22.5.234 as permitted sender) receiver=protection.outlook.com; client-ip=12.22.5.234; helo=mail.nvidia.com; pr=C From: Dima Chumak <dchumak@nvidia.com> To: Jakub Kicinski <kuba@kernel.org> CC: Jiri Pirko <jiri@nvidia.com>, "David S. Miller" <davem@davemloft.net>, Eric Dumazet <edumazet@google.com>, Paolo Abeni <pabeni@redhat.com>, <netdev@vger.kernel.org>, Dima Chumak <dchumak@nvidia.com> Subject: [PATCH net-next 0/5] devlink rate police limiter Date: Mon, 20 Jun 2022 18:26:42 +0300 Message-ID: <20220620152647.2498927-1-dchumak@nvidia.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain Precedence: bulk
Series	devlink rate police limiter \| expand [net-next,0/5] devlink rate police limiter [net-next,1/5] devlink: Introduce limit_type attr for rate objects [net-next,2/5] devlink: Introduce police rate limit type [net-next,3/5] netdevsim: Support devlink rate limit_type police [net-next,4/5] selftest: netdevsim: Add devlink rate police sub-test [net-next,5/5] Documentation: devlink rate objects limit_type

Dima Chumak June 20, 2022, 3:26 p.m. UTC

Currently, kernel provides a way to limit tx rate of a VF via devlink
rate function of a port. The underlying mechanism is a shaper applied to
all traffic passing through the target VF or a group of VFs. By its
essence, a shaper naturally works with outbound traffic, and in
practice, it's rarely seen to be implemented for inbound traffic.
Nevertheless, there is a user request to have a mechanism for limiting
inbound traffic as well. It is usually done by using some form of
traffic policing, dropping excess packets over the configured limit that
set by a user. Thus, introducing another limiting mechanism to the port
function can help close this gap.

This series introduces devlink attrs, along with their ops, to manage
rate policing of a single port as well as a port group. It is based on
the existing notion of leaf and node rate objects, and extends their
attributes to support both RX and TX limiting, for a number of packets
per second and/or a number of bytes per second. Additionally, there is a
second set of parameters for specifying the size of buffering performed,
called "burst", that controls the allowed level of spikes in traffic
before it starts getting dropped.

A new sub-type of a devlink_rate object is introduced, called
"limit_type". It can be either "shaping", the default, or "police".
A single leaf or a node object can be switched from one limit type to
another, but it cannot do both types of rate limiting simultaneously.
A node and a leaf object that have parent-child relationship must have
the same limit type. In other words, it's only possible to group rate
objects of the same limit type as their group's limit_type.

devlink_ops extended with following callbacks:
- rate_{leaf|node}_tx_{burst|pkts|pkts_burst}_set
- rate_{leaf|node}_rx_{max|burst|pkts|pkts_burst}_set

UAPI provides:
- setting tx_{burst|pkts|pkts_burst} and rx_{max|burst|pkts|pkts_burst}
  of a rate object

Added devlink_rate police attrs support for netdevsim driver.

Issues/open questions:
- Current implementation requires a user to set both "rate" and "burst"
  parameters explicitly, in order to activate police rate limiting. For
  example, "rx_max 200Mbit rx_burst 16mb". Is it necessary to
  automagically deduce "burst" value when it's omitted by the user?
  For example when user only sets "rx_max 200Mbit".
- If answer is positive to the first question, at which level it's
  better to be done, at user-space iproute2, at kernel devlink core or
  at vendor driver that implements devlink_ops for police attrs?

CLI examples:

  $ devlink port function rate show
  netdevsim/netdevsim10/128: type leaf limit_type unset
  netdevsim/netdevsim10/129: type leaf limit_type unset
  netdevsim/netdevsim10/130: type leaf limit_type unset

  # Set police rate limiting of inbound traffic
  $ devlink port function rate set netdevsim/netdevsim10/128 \
            limit_type police rx_max 100mbit rx_burst 10mbit
  $ devlink port function rate show
  netdevsim/netdevsim10/128: type leaf limit_type police rx_max 100Mbit rx_burst 10485Kbit

  # Set shaping rate limiting of outbound traffic (default limit_type)
  $ devlink port function rate set netdevsim/netdevsim10/129 tx_max 200mbit
  $ devlink port function rate show
  netdevsim/netdevsim10/129: type leaf limit_type shaping tx_max 200Mbit

  # Attempt to set police attr with the default shaping limit_type
  $ devlink port function rate set netdevsim/netdevsim10/129 rx_max 400mbit
  Unsupported option "rx_max" for limit_type "shaping"

  # Set police rate attr for a port that already has active shaping
  $ devlink port function rate set netdevsim/netdevsim10/129 limit_type police rx_max 400mbit
  Error: devlink: Cannot change limit_type of the rate leaf object, reset current rate attributes first.
  kernel answers: Device or resource busy

  # Create a rate group
  $ devlink port function rate add netdevsim/netdevsim10/g1 \
            limit_type police rx_max 1Gbit
  $ devlink port function rate show
  netdevsim/netdevsim10/g1: type node limit_type police rx_max 1Gbit

  # Add port to the group
  $ devlink port function rate set netdevsim/netdevsim10/128 parent g1
  $ devlink port function rate show
  netdevsim/netdevsim10/g1: type node limit_type police rx_max 1Gbit
  netdevsim/netdevsim10/128: type leaf limit_type police rx_max 100Mbit rx_burst 10485Kbit parent g1
  netdevsim/netdevsim10/129: type leaf limit_type shaping tx_max 200Mbit
  netdevsim/netdevsim10/130: type leaf limit_type unset

  # Try to add a port with a non-matching limit_type to the group
  $ devlink port function rate set netdevsim/netdevsim10/129 parent g1
  Error: devlink: Parent and object should be of the same limit_type.
  kernel answers: Invalid argument

  # Adding a port with "unset" limit_type to a group inherits the
  # group's limit_type
  $ devlink port function rate set netdevsim/netdevsim10/130 parent g1
  $ devlink port function rate show
  netdevsim/netdevsim10/130: type leaf limit_type police parent g1

  # Set all police parameters
  $ devlink port func rate set netdevsim/netdevsim10/130 \
            limit_type police tx_max 10GBps tx_burst 1gb \
                              rx_max 25GBps rx_burst 2gb \
                              tx_pkts 10000 tx_pkts_burst 1gb \
                              rx_pkts 20000 rx_pkts_burst 2gb

Dima Chumak (5):
  devlink: Introduce limit_type attr for rate objects
  devlink: Introduce police rate limit type
  netdevsim: Support devlink rate limit_type police
  selftest: netdevsim: Add devlink rate police sub-test
  Documentation: devlink rate objects limit_type

 .../networking/devlink/devlink-port.rst       |  44 ++-
 .../networking/devlink/netdevsim.rst          |   3 +-
 .../net/ethernet/mellanox/mlx5/core/esw/qos.c |  28 +-
 drivers/net/netdevsim/dev.c                   | 211 ++++++++++-
 drivers/net/netdevsim/netdevsim.h             |  11 +-
 include/net/devlink.h                         |  52 ++-
 include/uapi/linux/devlink.h                  |  15 +
 net/core/devlink.c                            | 336 ++++++++++++++++--
 .../drivers/net/netdevsim/devlink.sh          | 215 ++++++++++-
 9 files changed, 853 insertions(+), 62 deletions(-)

Jakub Kicinski June 20, 2022, 8:04 p.m. UTC | #1

On Mon, 20 Jun 2022 18:26:42 +0300 Dima Chumak wrote:
> Currently, kernel provides a way to limit tx rate of a VF via devlink
> rate function of a port. The underlying mechanism is a shaper applied to
> all traffic passing through the target VF or a group of VFs. By its
> essence, a shaper naturally works with outbound traffic, and in
> practice, it's rarely seen to be implemented for inbound traffic.
> Nevertheless, there is a user request to have a mechanism for limiting
> inbound traffic as well. It is usually done by using some form of
> traffic policing, dropping excess packets over the configured limit that
> set by a user. Thus, introducing another limiting mechanism to the port
> function can help close this gap.
> 
> This series introduces devlink attrs, along with their ops, to manage
> rate policing of a single port as well as a port group. It is based on
> the existing notion of leaf and node rate objects, and extends their
> attributes to support both RX and TX limiting, for a number of packets
> per second and/or a number of bytes per second. Additionally, there is a
> second set of parameters for specifying the size of buffering performed,
> called "burst", that controls the allowed level of spikes in traffic
> before it starts getting dropped.
> 
> A new sub-type of a devlink_rate object is introduced, called
> "limit_type". It can be either "shaping", the default, or "police".
> A single leaf or a node object can be switched from one limit type to
> another, but it cannot do both types of rate limiting simultaneously.
> A node and a leaf object that have parent-child relationship must have
> the same limit type. In other words, it's only possible to group rate
> objects of the same limit type as their group's limit_type.

TC already has the police action. Your previous patches were accepted
because there was no exact match for shaping / admission. Now you're 
"extending" that API to duplicate existing TC APIs. Infuriating.

Dima Chumak June 30, 2022, 3:27 p.m. UTC | #2

On 6/20/22 10:04 PM, Jakub Kicinski wrote:
> 
> On Mon, 20 Jun 2022 18:26:42 +0300 Dima Chumak wrote:
>> Currently, kernel provides a way to limit tx rate of a VF via devlink
>> rate function of a port. The underlying mechanism is a shaper applied to
>> all traffic passing through the target VF or a group of VFs. By its
>> essence, a shaper naturally works with outbound traffic, and in
>> practice, it's rarely seen to be implemented for inbound traffic.
>> Nevertheless, there is a user request to have a mechanism for limiting
>> inbound traffic as well. It is usually done by using some form of
>> traffic policing, dropping excess packets over the configured limit that
>> set by a user. Thus, introducing another limiting mechanism to the port
>> function can help close this gap.
>>
>> This series introduces devlink attrs, along with their ops, to manage
>> rate policing of a single port as well as a port group. It is based on
>> the existing notion of leaf and node rate objects, and extends their
>> attributes to support both RX and TX limiting, for a number of packets
>> per second and/or a number of bytes per second. Additionally, there is a
>> second set of parameters for specifying the size of buffering performed,
>> called "burst", that controls the allowed level of spikes in traffic
>> before it starts getting dropped.
>>
>> A new sub-type of a devlink_rate object is introduced, called
>> "limit_type". It can be either "shaping", the default, or "police".
>> A single leaf or a node object can be switched from one limit type to
>> another, but it cannot do both types of rate limiting simultaneously.
>> A node and a leaf object that have parent-child relationship must have
>> the same limit type. In other words, it's only possible to group rate
>> objects of the same limit type as their group's limit_type.
> 
> TC already has the police action. Your previous patches were accepted
> because there was no exact match for shaping / admission. Now you're
> "extending" that API to duplicate existing TC APIs. Infuriating.

I'm sorry for not being able to reply promptly.

I've re-read more carefully the cover letter of the original 'devlink:
rate objects API' series by Dmytro Linkin, off of which I based my
patches, though my understanding still might be incomplete/incorrect
here.

It seems that TC, being ingress only, doesn't cover the full spectrum of
rate-limiting that's possible to achieve with devlink. TC works only
with representors and doesn't allow to configure "the other side of the
wire", where devlink port function seems to be a better match as it
connects directly to a VF.

Also, for the existing devlink-rate mechanism of VF grouping, it would be
challenging to achieve similar functionality with TC flows, as groups don't
have a net device instance where flows can be attached.

I want to apologize in case my proposed changes have come across as
being bluntly ignoring some of the pre-established agreements and
understandings of TC / devlink responsibility separation, it wasn't
intentional.

Jakub Kicinski June 30, 2022, 6:13 p.m. UTC | #3

On Thu, 30 Jun 2022 17:27:08 +0200 Dima Chumak wrote:
> I've re-read more carefully the cover letter of the original 'devlink:
> rate objects API' series by Dmytro Linkin, off of which I based my
> patches, though my understanding still might be incomplete/incorrect
> here.
> 
> It seems that TC, being ingress only, doesn't cover the full spectrum of
> rate-limiting that's possible to achieve with devlink. TC works only
> with representors and doesn't allow to configure "the other side of the
> wire", where devlink port function seems to be a better match as it
> connects directly to a VF.

Right, but you are adding Rx and Tx now, IIUC, so you're venturing into
the same "side of the wire" where tc lives.

> Also, for the existing devlink-rate mechanism of VF grouping, it would be
> challenging to achieve similar functionality with TC flows, as groups don't
> have a net device instance where flows can be attached.

You can share actions in TC. The hierarchical aspects may be more
limited, not sure.

> I want to apologize in case my proposed changes have come across as
> being bluntly ignoring some of the pre-established agreements and
> understandings of TC / devlink responsibility separation, it wasn't
> intentional.

Apologies, TBH I thought you're the same person I was arguing with last
time.

My objective is to avoid having multiple user space interfaces which 
drivers have to (a) support and (b) reconcile. We already have the VF 
rate limits in ip link, and in TC (which I believe is used by OvS
offload). 

I presume you have a mlx5 implementation ready, so how do you reconcile
those 3 APIs?

Jiri Pirko July 7, 2022, 11:20 a.m. UTC | #4

Thu, Jun 30, 2022 at 08:13:27PM CEST, kuba@kernel.org wrote:
>On Thu, 30 Jun 2022 17:27:08 +0200 Dima Chumak wrote:
>> I've re-read more carefully the cover letter of the original 'devlink:
>> rate objects API' series by Dmytro Linkin, off of which I based my
>> patches, though my understanding still might be incomplete/incorrect
>> here.
>> 
>> It seems that TC, being ingress only, doesn't cover the full spectrum of
>> rate-limiting that's possible to achieve with devlink. TC works only
>> with representors and doesn't allow to configure "the other side of the
>> wire", where devlink port function seems to be a better match as it
>> connects directly to a VF.
>
>Right, but you are adding Rx and Tx now, IIUC, so you're venturing into
>the same "side of the wire" where tc lives.

Wait. Lets draw the basic picture of "the wire":

--------------------------+                +--------------------------
eswitch representor netdev|=====thewire====|function (vf/sf/whatever
--------------------------+                +-------------------------

Now the rate setting Dima is talking about, it is the configuration of
the "function" side. Setting the rate is limitting the "function" TX/RX
Note that this function could be of any type - netdev, rdma, vdpa, nvme.
Configuring the TX/RX rate (including groupping) applies to all of
these.

Putting the configuration on the eswitch representor does not fit:
1) it is configuring the other side of the wire, the configuration
   should be of the eswitch port. Configuring the other side is
   confusing and misleading. For the purpose of configuring the
   "function" side, we introduced "port function" object in devlink.
2) it is confuguring netdev/ethernet however the confuguration applies
   to all queues of the function.


>
>> Also, for the existing devlink-rate mechanism of VF grouping, it would be
>> challenging to achieve similar functionality with TC flows, as groups don't
>> have a net device instance where flows can be attached.
>
>You can share actions in TC. The hierarchical aspects may be more
>limited, not sure.
>
>> I want to apologize in case my proposed changes have come across as
>> being bluntly ignoring some of the pre-established agreements and
>> understandings of TC / devlink responsibility separation, it wasn't
>> intentional.
>
>Apologies, TBH I thought you're the same person I was arguing with last
>time.
>
>My objective is to avoid having multiple user space interfaces which 
>drivers have to (a) support and (b) reconcile. We already have the VF 
>rate limits in ip link, and in TC (which I believe is used by OvS
>offload). 
>
>I presume you have a mlx5 implementation ready, so how do you reconcile
>those 3 APIs?

Jakub Kicinski July 7, 2022, 8:16 p.m. UTC | #5

On Thu, 7 Jul 2022 13:20:12 +0200 Jiri Pirko wrote:
> Wait. Lets draw the basic picture of "the wire":
> 
> --------------------------+                +--------------------------
> eswitch representor netdev|=====thewire====|function (vf/sf/whatever
> --------------------------+                +-------------------------
> 
> Now the rate setting Dima is talking about, it is the configuration of
> the "function" side. Setting the rate is limitting the "function" TX/RX
> Note that this function could be of any type - netdev, rdma, vdpa, nvme.

The patches add policing, are you saying we're gonna drop RDMA or NVMe
I/O?

> Configuring the TX/RX rate (including groupping) applies to all of
> these.

I don't understand why the "side of the wire" matters when the patches
target both Rx and Tx. Surely that covers both directions.

> Putting the configuration on the eswitch representor does not fit:
> 1) it is configuring the other side of the wire, the configuration
>    should be of the eswitch port. Configuring the other side is
>    confusing and misleading. For the purpose of configuring the
>    "function" side, we introduced "port function" object in devlink.
> 2) it is confuguring netdev/ethernet however the confuguration applies
>    to all queues of the function.

If you think it's technically superior to put it in devlink that's fine.
I'll repeat myself - what I'm asking for is convergence so that drivers
don't have  to implement 3 different ways of configuring this. We have
devlink rate for from-VF direction shaping, tc police for bi-dir
policing and obviously legacy NDOs. None of them translate between each
other so drivers and user space have to juggle interfaces.

Jiri Pirko July 8, 2022, 7:27 a.m. UTC | #6

Thu, Jul 07, 2022 at 10:16:49PM CEST, kuba@kernel.org wrote:
>On Thu, 7 Jul 2022 13:20:12 +0200 Jiri Pirko wrote:
>> Wait. Lets draw the basic picture of "the wire":
>> 
>> --------------------------+                +--------------------------
>> eswitch representor netdev|=====thewire====|function (vf/sf/whatever
>> --------------------------+                +-------------------------
>> 
>> Now the rate setting Dima is talking about, it is the configuration of
>> the "function" side. Setting the rate is limitting the "function" TX/RX
>> Note that this function could be of any type - netdev, rdma, vdpa, nvme.
>
>The patches add policing, are you saying we're gonna drop RDMA or NVMe
>I/O?

Well, there is some limit to the rate of VF anyway, so at some point,
the packets need to be dropped, with or without policing.
Not really sure how that is handled in rdma and nvme.

>
>> Configuring the TX/RX rate (including groupping) applies to all of
>> these.
>
>I don't understand why the "side of the wire" matters when the patches
>target both Rx and Tx. Surely that covers both directions.

Hmm, I believe it really does. We have objects which we configure. There
is a function object, which has some configuration (including this).
Making user to configure function object via another object (eswitch
port netdevice on the other side of the wire), is quite confusing and I
feel it is wrong. The only reason is to somehow fit TC interface for
which we don't have an anchor for port function.

What about another configuration? would it be ok to use eswitch port
netdev to configure port function too, if there is an interface for it?
I believe not, that is why we introduced port function.

>
>> Putting the configuration on the eswitch representor does not fit:
>> 1) it is configuring the other side of the wire, the configuration
>>    should be of the eswitch port. Configuring the other side is
>>    confusing and misleading. For the purpose of configuring the
>>    "function" side, we introduced "port function" object in devlink.
>> 2) it is confuguring netdev/ethernet however the confuguration applies
>>    to all queues of the function.
>
>If you think it's technically superior to put it in devlink that's fine.
>I'll repeat myself - what I'm asking for is convergence so that drivers
>don't have  to implement 3 different ways of configuring this. We have
>devlink rate for from-VF direction shaping, tc police for bi-dir
>policing and obviously legacy NDOs. None of them translate between each
>other so drivers and user space have to juggle interfaces.

The legacy ndo is legacy. Drivers that implement switchdev mode do
not implement those, and should not.

Jakub Kicinski July 8, 2022, 6:05 p.m. UTC | #7

Adding Michal

On Fri, 8 Jul 2022 09:27:14 +0200 Jiri Pirko wrote:
> >> Configuring the TX/RX rate (including groupping) applies to all of
> >> these.  
> >
> >I don't understand why the "side of the wire" matters when the patches
> >target both Rx and Tx. Surely that covers both directions.  
> 
> Hmm, I believe it really does. We have objects which we configure. There
> is a function object, which has some configuration (including this).
> Making user to configure function object via another object (eswitch
> port netdevice on the other side of the wire), is quite confusing and I
> feel it is wrong. The only reason is to somehow fit TC interface for
> which we don't have an anchor for port function.
> 
> What about another configuration? would it be ok to use eswitch port
> netdev to configure port function too, if there is an interface for it?
> I believe not, that is why we introduced port function.

I resisted the port function aberration as long as I could. It's 
a limitation of your design as far as I'm concerned.

Switches use TC to configure egress queuing, that's our Linux model.
Representor is the switch side, TC qdisc on it maps to the egress
of the switch.

I don't understand where the disconnect between us is, you know that's
what mlxsw does..

> >> Putting the configuration on the eswitch representor does not fit:
> >> 1) it is configuring the other side of the wire, the configuration
> >>    should be of the eswitch port. Configuring the other side is
> >>    confusing and misleading. For the purpose of configuring the
> >>    "function" side, we introduced "port function" object in devlink.
> >> 2) it is confuguring netdev/ethernet however the confuguration applies
> >>    to all queues of the function.  
> >
> >If you think it's technically superior to put it in devlink that's fine.
> >I'll repeat myself - what I'm asking for is convergence so that drivers
> >don't have  to implement 3 different ways of configuring this. We have
> >devlink rate for from-VF direction shaping, tc police for bi-dir
> >policing and obviously legacy NDOs. None of them translate between each
> >other so drivers and user space have to juggle interfaces.  
> 
> The legacy ndo is legacy. Drivers that implement switchdev mode do
> not implement those, and should not.

That's irrelevant - what I'm saying is that in practice drivers have to
implement _all_ of these interfaces today. Just because they are not
needed in eswitch mode doesn't mean the sales department won't find a
customer who's happy with the non-switchdev mode and doesn't want to
move.

Jiri Pirko July 9, 2022, 5:14 a.m. UTC | #8

Fri, Jul 08, 2022 at 08:05:35PM CEST, kuba@kernel.org wrote:
>Adding Michal
>
>On Fri, 8 Jul 2022 09:27:14 +0200 Jiri Pirko wrote:
>> >> Configuring the TX/RX rate (including groupping) applies to all of
>> >> these.  
>> >
>> >I don't understand why the "side of the wire" matters when the patches
>> >target both Rx and Tx. Surely that covers both directions.  
>> 
>> Hmm, I believe it really does. We have objects which we configure. There
>> is a function object, which has some configuration (including this).
>> Making user to configure function object via another object (eswitch
>> port netdevice on the other side of the wire), is quite confusing and I
>> feel it is wrong. The only reason is to somehow fit TC interface for
>> which we don't have an anchor for port function.
>> 
>> What about another configuration? would it be ok to use eswitch port
>> netdev to configure port function too, if there is an interface for it?
>> I believe not, that is why we introduced port function.
>
>I resisted the port function aberration as long as I could. It's 

Why do you say "aberration"? It is a legitimate feature that is allowing
to solve legitimate issues. Maybe I'm missing something.


>a limitation of your design as far as I'm concerned.

What do you mean? This is not related to us only. The need to work with
port function (the other side of the wire) is definitelly nothing
specific to mlx5 driver.


>
>Switches use TC to configure egress queuing, that's our Linux model.
>Representor is the switch side, TC qdisc on it maps to the egress
>of the switch.

Sure.

>
>I don't understand where the disconnect between us is, you know that's
>what mlxsw does..

No disconnect. mlxsw works like that. However, there is no VF/SF in
mlxsw world. The other side of the wire is a different host.

However in case of VF/SF, we also need to configure the other side of
the wire, which we are orchestrating. That is the sole purpose of why we
have devlink port function. And once we have such object, why is it
incorrect to use it for the needed configuration?

Okay, if you really feel that we need to reuse TC interface for this
feature (however mismathing it might be), lets create a netdev for the
port function to hook this to. But do we want such a beast? But to hook
this to eswitch port representor seems to me plain wrong.


>
>> >> Putting the configuration on the eswitch representor does not fit:
>> >> 1) it is configuring the other side of the wire, the configuration
>> >>    should be of the eswitch port. Configuring the other side is
>> >>    confusing and misleading. For the purpose of configuring the
>> >>    "function" side, we introduced "port function" object in devlink.
>> >> 2) it is confuguring netdev/ethernet however the confuguration applies
>> >>    to all queues of the function.  
>> >
>> >If you think it's technically superior to put it in devlink that's fine.
>> >I'll repeat myself - what I'm asking for is convergence so that drivers
>> >don't have  to implement 3 different ways of configuring this. We have
>> >devlink rate for from-VF direction shaping, tc police for bi-dir
>> >policing and obviously legacy NDOs. None of them translate between each
>> >other so drivers and user space have to juggle interfaces.  
>> 
>> The legacy ndo is legacy. Drivers that implement switchdev mode do
>> not implement those, and should not.
>
>That's irrelevant - what I'm saying is that in practice drivers have to
>implement _all_ of these interfaces today. Just because they are not
>needed in eswitch mode doesn't mean the sales department won't find a
>customer who's happy with the non-switchdev mode and doesn't want to
>move.

Jakub Kicinski July 11, 2022, 5:29 p.m. UTC | #9

On Sat, 9 Jul 2022 07:14:31 +0200 Jiri Pirko wrote:
> >I resisted the port function aberration as long as I could. It's   
> 
> Why do you say "aberration"? It is a legitimate feature that is allowing
> to solve legitimate issues. Maybe I'm missing something.

From netdev perspective it's an implementation detail irrelevant 
to the user. The netdev model is complete without it.

> >a limitation of your design as far as I'm concerned.  
> 
> What do you mean? This is not related to us only. The need to work with
> port function (the other side of the wire) is definitelly nothing
> specific to mlx5 driver.
>
> >Switches use TC to configure egress queuing, that's our Linux model.
> >Representor is the switch side, TC qdisc on it maps to the egress
> >of the switch.  
> 
> Sure.
>
> >I don't understand where the disconnect between us is, you know that's
> >what mlxsw does..  
> 
> No disconnect. mlxsw works like that. However, there is no VF/SF in
> mlxsw world. The other side of the wire is a different host.
> 
> However in case of VF/SF, we also need to configure the other side of
> the wire, which we are orchestrating. That is the sole purpose of why we
> have devlink port function. And once we have such object, why is it
> incorrect to use it for the needed configuration?

So the function conversation _is_ relevant here, eh? Sad but it is what
it is.

> Okay, if you really feel that we need to reuse TC interface for this
> feature (however mismathing it might be),

Not what I said, I'm not gonna say it the fourth time.

> lets create a netdev for the port function to hook this to. But do we
> want such a beast? But to hook this to eswitch port representor seems
> to me plain wrong.

I presume you're being facetious. Extra netdev is gonna help nothing. 

AFAIU the problem is that you want to control endpoints which are not
ndevs with this API. Is that the main or only reason? Can we agree that
it's legitimate but will result in muddying the netdev model (which in
itself is good and complete)?

Jiri Pirko July 12, 2022, 6:03 a.m. UTC | #10

Mon, Jul 11, 2022 at 07:29:57PM CEST, kuba@kernel.org wrote:
>On Sat, 9 Jul 2022 07:14:31 +0200 Jiri Pirko wrote:
>> >I resisted the port function aberration as long as I could. It's   
>> 
>> Why do you say "aberration"? It is a legitimate feature that is allowing
>> to solve legitimate issues. Maybe I'm missing something.
>
>From netdev perspective it's an implementation detail irrelevant 
>to the user. The netdev model is complete without it.

Well it is a configuration of a device part out of the scope of netdev.
So yes, netdev model is complete without it. But does does not mean we
don't need such configuration. I may be missing your point.


>
>> >a limitation of your design as far as I'm concerned.  
>> 
>> What do you mean? This is not related to us only. The need to work with
>> port function (the other side of the wire) is definitelly nothing
>> specific to mlx5 driver.
>>
>> >Switches use TC to configure egress queuing, that's our Linux model.
>> >Representor is the switch side, TC qdisc on it maps to the egress
>> >of the switch.  
>> 
>> Sure.
>>
>> >I don't understand where the disconnect between us is, you know that's
>> >what mlxsw does..  
>> 
>> No disconnect. mlxsw works like that. However, there is no VF/SF in
>> mlxsw world. The other side of the wire is a different host.
>> 
>> However in case of VF/SF, we also need to configure the other side of
>> the wire, which we are orchestrating. That is the sole purpose of why we
>> have devlink port function. And once we have such object, why is it
>> incorrect to use it for the needed configuration?
>
>So the function conversation _is_ relevant here, eh? Sad but it is what
>it is.

I'm not sure I follow what "function conversation" you mean. :/


>
>> Okay, if you really feel that we need to reuse TC interface for this
>> feature (however mismathing it might be),
>
>Not what I said, I'm not gonna say it the fourth time.

Okay, sorry for being slow, but I still don't understand your point :/


>
>> lets create a netdev for the port function to hook this to. But do we
>> want such a beast? But to hook this to eswitch port representor seems
>> to me plain wrong.
>
>I presume you're being facetious. Extra netdev is gonna help nothing. 

I'm somewhat am, yes.


>
>AFAIU the problem is that you want to control endpoints which are not
>ndevs with this API. Is that the main or only reason? Can we agree that
>it's legitimate but will result in muddying the netdev model (which in
>itself is good and complete)?

I don't think this has anything to do with netdev model. It is actually
out of the scope of it, therefore there cannot be any mudding of it.

Jakub Kicinski July 13, 2022, 12:13 a.m. UTC | #11

On Tue, 12 Jul 2022 08:03:40 +0200 Jiri Pirko wrote:
> >AFAIU the problem is that you want to control endpoints which are not
> >ndevs with this API. Is that the main or only reason? Can we agree that
> >it's legitimate but will result in muddying the netdev model (which in
> >itself is good and complete)?  
> 
> I don't think this has anything to do with netdev model. 
> It is actually out of the scope of it, therefore there cannot be any mudding of it.

You should have decided that rate limiting was out of scope for netdev
before we added tc qdisc and tc police support. Now those offloads are
there, used by people and it's too late.

If you want to create a common way to rate limit functions you must
provide plumbing for the existing methods (at least tc police,
preferably legacy NDO as well) to automatically populate the new API.

Jiri Pirko July 13, 2022, 5:04 a.m. UTC | #12

Wed, Jul 13, 2022 at 02:13:41AM CEST, kuba@kernel.org wrote:
>On Tue, 12 Jul 2022 08:03:40 +0200 Jiri Pirko wrote:
>> >AFAIU the problem is that you want to control endpoints which are not
>> >ndevs with this API. Is that the main or only reason? Can we agree that
>> >it's legitimate but will result in muddying the netdev model (which in
>> >itself is good and complete)?  
>> 
>> I don't think this has anything to do with netdev model. 
>> It is actually out of the scope of it, therefore there cannot be any mudding of it.
>
>You should have decided that rate limiting was out of scope for netdev
>before we added tc qdisc and tc police support. Now those offloads are
>there, used by people and it's too late.
>
>If you want to create a common way to rate limit functions you must
>provide plumbing for the existing methods (at least tc police,
>preferably legacy NDO as well) to automatically populate the new API.

Even if there is no netdevice to hook it to, because it does not exist?
I have to be missing something, sorry :/

Jakub Kicinski July 13, 2022, 5:52 p.m. UTC | #13

On Wed, 13 Jul 2022 07:04:04 +0200 Jiri Pirko wrote:
> Wed, Jul 13, 2022 at 02:13:41AM CEST, kuba@kernel.org wrote:
> >> I don't think this has anything to do with netdev model. 
> >> It is actually out of the scope of it, therefore there cannot be any mudding of it.  
> >
> >You should have decided that rate limiting was out of scope for netdev
> >before we added tc qdisc and tc police support. Now those offloads are
> >there, used by people and it's too late.
> >
> >If you want to create a common way to rate limit functions you must
> >provide plumbing for the existing methods (at least tc police,
> >preferably legacy NDO as well) to automatically populate the new API.  
> 
> Even if there is no netdevice to hook it to, because it does not exist?
> I have to be missing something, sorry :/

What I'm saying is that we can treat the devlink rate API as a "lower
layer interface". A layer under the netdevs. That seems sensible and
removes the API duplication which otherwise annoys me.

We want drivers to only have to implement one API.

So when user calls the legacy NDO API it should check if the device has
devlink rate support, first, and try to translate the legacy request
into devlink rate.

Same for TC police as installed by the OvS offload feature that Simon
knows far more about than I do. IIRC we use a combination of matchall
and police to do shaping.

That way drivers don't have to implement all three APIs, only devlink
rate (four APIs if we count TC qdisc but I think only NFP uses that
one and it has RED etc so that's too much).

Does this help or am I still not making sense?

Jiri Pirko July 14, 2022, 4:55 a.m. UTC | #14

Wed, Jul 13, 2022 at 07:52:55PM CEST, kuba@kernel.org wrote:
>On Wed, 13 Jul 2022 07:04:04 +0200 Jiri Pirko wrote:
>> Wed, Jul 13, 2022 at 02:13:41AM CEST, kuba@kernel.org wrote:
>> >> I don't think this has anything to do with netdev model. 
>> >> It is actually out of the scope of it, therefore there cannot be any mudding of it.  
>> >
>> >You should have decided that rate limiting was out of scope for netdev
>> >before we added tc qdisc and tc police support. Now those offloads are
>> >there, used by people and it's too late.
>> >
>> >If you want to create a common way to rate limit functions you must
>> >provide plumbing for the existing methods (at least tc police,
>> >preferably legacy NDO as well) to automatically populate the new API.  
>> 
>> Even if there is no netdevice to hook it to, because it does not exist?
>> I have to be missing something, sorry :/
>
>What I'm saying is that we can treat the devlink rate API as a "lower
>layer interface". A layer under the netdevs. That seems sensible and
>removes the API duplication which otherwise annoys me.
>
>We want drivers to only have to implement one API.
>
>So when user calls the legacy NDO API it should check if the device has
>devlink rate support, first, and try to translate the legacy request
>into devlink rate.
>
>Same for TC police as installed by the OvS offload feature that Simon
>knows far more about than I do. IIRC we use a combination of matchall
>and police to do shaping.
>
>That way drivers don't have to implement all three APIs, only devlink
>rate (four APIs if we count TC qdisc but I think only NFP uses that
>one and it has RED etc so that's too much).
>
>Does this help or am I still not making sense?

I think I got it now. But in our case, there is no change for the user,
as the netdev does not exist. So user still uses devlink rate uapi as
proposed by this patchset. Only internal kernel changes requested.
Correct?

Jakub Kicinski July 14, 2022, 4:07 p.m. UTC | #15

On Thu, 14 Jul 2022 06:55:05 +0200 Jiri Pirko wrote:
> Wed, Jul 13, 2022 at 07:52:55PM CEST, kuba@kernel.org wrote:
> >On Wed, 13 Jul 2022 07:04:04 +0200 Jiri Pirko wrote:  
> >> Even if there is no netdevice to hook it to, because it does not exist?
> >> I have to be missing something, sorry :/  
> >
> >What I'm saying is that we can treat the devlink rate API as a "lower
> >layer interface". A layer under the netdevs. That seems sensible and
> >removes the API duplication which otherwise annoys me.
> >
> >We want drivers to only have to implement one API.
> >
> >So when user calls the legacy NDO API it should check if the device has
> >devlink rate support, first, and try to translate the legacy request
> >into devlink rate.
> >
> >Same for TC police as installed by the OvS offload feature that Simon
> >knows far more about than I do. IIRC we use a combination of matchall
> >and police to do shaping.
> >
> >That way drivers don't have to implement all three APIs, only devlink
> >rate (four APIs if we count TC qdisc but I think only NFP uses that
> >one and it has RED etc so that's too much).
> >
> >Does this help or am I still not making sense?  
> 
> I think I got it now. But in our case, there is no change for the user,
> as the netdev does not exist. So user still uses devlink rate uapi as
> proposed by this patchset. Only internal kernel changes requested.
> Correct?

Right. If the user wants to use devlink-rate directly they can.
For legacy users the kernel will translate to devlink-rate.
The point is for drivers to only have to implement devlink-rate.

[net-next,0/5] devlink rate police limiter

Message

Comments