Message ID | 20220620152647.2498927-1-dchumak@nvidia.com (mailing list archive) |
---|---|
Headers | show |
Series | devlink rate police limiter | expand |
On Mon, 20 Jun 2022 18:26:42 +0300 Dima Chumak wrote: > Currently, kernel provides a way to limit tx rate of a VF via devlink > rate function of a port. The underlying mechanism is a shaper applied to > all traffic passing through the target VF or a group of VFs. By its > essence, a shaper naturally works with outbound traffic, and in > practice, it's rarely seen to be implemented for inbound traffic. > Nevertheless, there is a user request to have a mechanism for limiting > inbound traffic as well. It is usually done by using some form of > traffic policing, dropping excess packets over the configured limit that > set by a user. Thus, introducing another limiting mechanism to the port > function can help close this gap. > > This series introduces devlink attrs, along with their ops, to manage > rate policing of a single port as well as a port group. It is based on > the existing notion of leaf and node rate objects, and extends their > attributes to support both RX and TX limiting, for a number of packets > per second and/or a number of bytes per second. Additionally, there is a > second set of parameters for specifying the size of buffering performed, > called "burst", that controls the allowed level of spikes in traffic > before it starts getting dropped. > > A new sub-type of a devlink_rate object is introduced, called > "limit_type". It can be either "shaping", the default, or "police". > A single leaf or a node object can be switched from one limit type to > another, but it cannot do both types of rate limiting simultaneously. > A node and a leaf object that have parent-child relationship must have > the same limit type. In other words, it's only possible to group rate > objects of the same limit type as their group's limit_type. TC already has the police action. Your previous patches were accepted because there was no exact match for shaping / admission. Now you're "extending" that API to duplicate existing TC APIs. Infuriating.
On 6/20/22 10:04 PM, Jakub Kicinski wrote: > > On Mon, 20 Jun 2022 18:26:42 +0300 Dima Chumak wrote: >> Currently, kernel provides a way to limit tx rate of a VF via devlink >> rate function of a port. The underlying mechanism is a shaper applied to >> all traffic passing through the target VF or a group of VFs. By its >> essence, a shaper naturally works with outbound traffic, and in >> practice, it's rarely seen to be implemented for inbound traffic. >> Nevertheless, there is a user request to have a mechanism for limiting >> inbound traffic as well. It is usually done by using some form of >> traffic policing, dropping excess packets over the configured limit that >> set by a user. Thus, introducing another limiting mechanism to the port >> function can help close this gap. >> >> This series introduces devlink attrs, along with their ops, to manage >> rate policing of a single port as well as a port group. It is based on >> the existing notion of leaf and node rate objects, and extends their >> attributes to support both RX and TX limiting, for a number of packets >> per second and/or a number of bytes per second. Additionally, there is a >> second set of parameters for specifying the size of buffering performed, >> called "burst", that controls the allowed level of spikes in traffic >> before it starts getting dropped. >> >> A new sub-type of a devlink_rate object is introduced, called >> "limit_type". It can be either "shaping", the default, or "police". >> A single leaf or a node object can be switched from one limit type to >> another, but it cannot do both types of rate limiting simultaneously. >> A node and a leaf object that have parent-child relationship must have >> the same limit type. In other words, it's only possible to group rate >> objects of the same limit type as their group's limit_type. > > TC already has the police action. Your previous patches were accepted > because there was no exact match for shaping / admission. Now you're > "extending" that API to duplicate existing TC APIs. Infuriating. I'm sorry for not being able to reply promptly. I've re-read more carefully the cover letter of the original 'devlink: rate objects API' series by Dmytro Linkin, off of which I based my patches, though my understanding still might be incomplete/incorrect here. It seems that TC, being ingress only, doesn't cover the full spectrum of rate-limiting that's possible to achieve with devlink. TC works only with representors and doesn't allow to configure "the other side of the wire", where devlink port function seems to be a better match as it connects directly to a VF. Also, for the existing devlink-rate mechanism of VF grouping, it would be challenging to achieve similar functionality with TC flows, as groups don't have a net device instance where flows can be attached. I want to apologize in case my proposed changes have come across as being bluntly ignoring some of the pre-established agreements and understandings of TC / devlink responsibility separation, it wasn't intentional.
On Thu, 30 Jun 2022 17:27:08 +0200 Dima Chumak wrote: > I've re-read more carefully the cover letter of the original 'devlink: > rate objects API' series by Dmytro Linkin, off of which I based my > patches, though my understanding still might be incomplete/incorrect > here. > > It seems that TC, being ingress only, doesn't cover the full spectrum of > rate-limiting that's possible to achieve with devlink. TC works only > with representors and doesn't allow to configure "the other side of the > wire", where devlink port function seems to be a better match as it > connects directly to a VF. Right, but you are adding Rx and Tx now, IIUC, so you're venturing into the same "side of the wire" where tc lives. > Also, for the existing devlink-rate mechanism of VF grouping, it would be > challenging to achieve similar functionality with TC flows, as groups don't > have a net device instance where flows can be attached. You can share actions in TC. The hierarchical aspects may be more limited, not sure. > I want to apologize in case my proposed changes have come across as > being bluntly ignoring some of the pre-established agreements and > understandings of TC / devlink responsibility separation, it wasn't > intentional. Apologies, TBH I thought you're the same person I was arguing with last time. My objective is to avoid having multiple user space interfaces which drivers have to (a) support and (b) reconcile. We already have the VF rate limits in ip link, and in TC (which I believe is used by OvS offload). I presume you have a mlx5 implementation ready, so how do you reconcile those 3 APIs?
Thu, Jun 30, 2022 at 08:13:27PM CEST, kuba@kernel.org wrote: >On Thu, 30 Jun 2022 17:27:08 +0200 Dima Chumak wrote: >> I've re-read more carefully the cover letter of the original 'devlink: >> rate objects API' series by Dmytro Linkin, off of which I based my >> patches, though my understanding still might be incomplete/incorrect >> here. >> >> It seems that TC, being ingress only, doesn't cover the full spectrum of >> rate-limiting that's possible to achieve with devlink. TC works only >> with representors and doesn't allow to configure "the other side of the >> wire", where devlink port function seems to be a better match as it >> connects directly to a VF. > >Right, but you are adding Rx and Tx now, IIUC, so you're venturing into >the same "side of the wire" where tc lives. Wait. Lets draw the basic picture of "the wire": --------------------------+ +-------------------------- eswitch representor netdev|=====thewire====|function (vf/sf/whatever --------------------------+ +------------------------- Now the rate setting Dima is talking about, it is the configuration of the "function" side. Setting the rate is limitting the "function" TX/RX Note that this function could be of any type - netdev, rdma, vdpa, nvme. Configuring the TX/RX rate (including groupping) applies to all of these. Putting the configuration on the eswitch representor does not fit: 1) it is configuring the other side of the wire, the configuration should be of the eswitch port. Configuring the other side is confusing and misleading. For the purpose of configuring the "function" side, we introduced "port function" object in devlink. 2) it is confuguring netdev/ethernet however the confuguration applies to all queues of the function. > >> Also, for the existing devlink-rate mechanism of VF grouping, it would be >> challenging to achieve similar functionality with TC flows, as groups don't >> have a net device instance where flows can be attached. > >You can share actions in TC. The hierarchical aspects may be more >limited, not sure. > >> I want to apologize in case my proposed changes have come across as >> being bluntly ignoring some of the pre-established agreements and >> understandings of TC / devlink responsibility separation, it wasn't >> intentional. > >Apologies, TBH I thought you're the same person I was arguing with last >time. > >My objective is to avoid having multiple user space interfaces which >drivers have to (a) support and (b) reconcile. We already have the VF >rate limits in ip link, and in TC (which I believe is used by OvS >offload). > >I presume you have a mlx5 implementation ready, so how do you reconcile >those 3 APIs?
On Thu, 7 Jul 2022 13:20:12 +0200 Jiri Pirko wrote: > Wait. Lets draw the basic picture of "the wire": > > --------------------------+ +-------------------------- > eswitch representor netdev|=====thewire====|function (vf/sf/whatever > --------------------------+ +------------------------- > > Now the rate setting Dima is talking about, it is the configuration of > the "function" side. Setting the rate is limitting the "function" TX/RX > Note that this function could be of any type - netdev, rdma, vdpa, nvme. The patches add policing, are you saying we're gonna drop RDMA or NVMe I/O? > Configuring the TX/RX rate (including groupping) applies to all of > these. I don't understand why the "side of the wire" matters when the patches target both Rx and Tx. Surely that covers both directions. > Putting the configuration on the eswitch representor does not fit: > 1) it is configuring the other side of the wire, the configuration > should be of the eswitch port. Configuring the other side is > confusing and misleading. For the purpose of configuring the > "function" side, we introduced "port function" object in devlink. > 2) it is confuguring netdev/ethernet however the confuguration applies > to all queues of the function. If you think it's technically superior to put it in devlink that's fine. I'll repeat myself - what I'm asking for is convergence so that drivers don't have to implement 3 different ways of configuring this. We have devlink rate for from-VF direction shaping, tc police for bi-dir policing and obviously legacy NDOs. None of them translate between each other so drivers and user space have to juggle interfaces.
Thu, Jul 07, 2022 at 10:16:49PM CEST, kuba@kernel.org wrote: >On Thu, 7 Jul 2022 13:20:12 +0200 Jiri Pirko wrote: >> Wait. Lets draw the basic picture of "the wire": >> >> --------------------------+ +-------------------------- >> eswitch representor netdev|=====thewire====|function (vf/sf/whatever >> --------------------------+ +------------------------- >> >> Now the rate setting Dima is talking about, it is the configuration of >> the "function" side. Setting the rate is limitting the "function" TX/RX >> Note that this function could be of any type - netdev, rdma, vdpa, nvme. > >The patches add policing, are you saying we're gonna drop RDMA or NVMe >I/O? Well, there is some limit to the rate of VF anyway, so at some point, the packets need to be dropped, with or without policing. Not really sure how that is handled in rdma and nvme. > >> Configuring the TX/RX rate (including groupping) applies to all of >> these. > >I don't understand why the "side of the wire" matters when the patches >target both Rx and Tx. Surely that covers both directions. Hmm, I believe it really does. We have objects which we configure. There is a function object, which has some configuration (including this). Making user to configure function object via another object (eswitch port netdevice on the other side of the wire), is quite confusing and I feel it is wrong. The only reason is to somehow fit TC interface for which we don't have an anchor for port function. What about another configuration? would it be ok to use eswitch port netdev to configure port function too, if there is an interface for it? I believe not, that is why we introduced port function. > >> Putting the configuration on the eswitch representor does not fit: >> 1) it is configuring the other side of the wire, the configuration >> should be of the eswitch port. Configuring the other side is >> confusing and misleading. For the purpose of configuring the >> "function" side, we introduced "port function" object in devlink. >> 2) it is confuguring netdev/ethernet however the confuguration applies >> to all queues of the function. > >If you think it's technically superior to put it in devlink that's fine. >I'll repeat myself - what I'm asking for is convergence so that drivers >don't have to implement 3 different ways of configuring this. We have >devlink rate for from-VF direction shaping, tc police for bi-dir >policing and obviously legacy NDOs. None of them translate between each >other so drivers and user space have to juggle interfaces. The legacy ndo is legacy. Drivers that implement switchdev mode do not implement those, and should not.
Adding Michal On Fri, 8 Jul 2022 09:27:14 +0200 Jiri Pirko wrote: > >> Configuring the TX/RX rate (including groupping) applies to all of > >> these. > > > >I don't understand why the "side of the wire" matters when the patches > >target both Rx and Tx. Surely that covers both directions. > > Hmm, I believe it really does. We have objects which we configure. There > is a function object, which has some configuration (including this). > Making user to configure function object via another object (eswitch > port netdevice on the other side of the wire), is quite confusing and I > feel it is wrong. The only reason is to somehow fit TC interface for > which we don't have an anchor for port function. > > What about another configuration? would it be ok to use eswitch port > netdev to configure port function too, if there is an interface for it? > I believe not, that is why we introduced port function. I resisted the port function aberration as long as I could. It's a limitation of your design as far as I'm concerned. Switches use TC to configure egress queuing, that's our Linux model. Representor is the switch side, TC qdisc on it maps to the egress of the switch. I don't understand where the disconnect between us is, you know that's what mlxsw does.. > >> Putting the configuration on the eswitch representor does not fit: > >> 1) it is configuring the other side of the wire, the configuration > >> should be of the eswitch port. Configuring the other side is > >> confusing and misleading. For the purpose of configuring the > >> "function" side, we introduced "port function" object in devlink. > >> 2) it is confuguring netdev/ethernet however the confuguration applies > >> to all queues of the function. > > > >If you think it's technically superior to put it in devlink that's fine. > >I'll repeat myself - what I'm asking for is convergence so that drivers > >don't have to implement 3 different ways of configuring this. We have > >devlink rate for from-VF direction shaping, tc police for bi-dir > >policing and obviously legacy NDOs. None of them translate between each > >other so drivers and user space have to juggle interfaces. > > The legacy ndo is legacy. Drivers that implement switchdev mode do > not implement those, and should not. That's irrelevant - what I'm saying is that in practice drivers have to implement _all_ of these interfaces today. Just because they are not needed in eswitch mode doesn't mean the sales department won't find a customer who's happy with the non-switchdev mode and doesn't want to move.
Fri, Jul 08, 2022 at 08:05:35PM CEST, kuba@kernel.org wrote: >Adding Michal > >On Fri, 8 Jul 2022 09:27:14 +0200 Jiri Pirko wrote: >> >> Configuring the TX/RX rate (including groupping) applies to all of >> >> these. >> > >> >I don't understand why the "side of the wire" matters when the patches >> >target both Rx and Tx. Surely that covers both directions. >> >> Hmm, I believe it really does. We have objects which we configure. There >> is a function object, which has some configuration (including this). >> Making user to configure function object via another object (eswitch >> port netdevice on the other side of the wire), is quite confusing and I >> feel it is wrong. The only reason is to somehow fit TC interface for >> which we don't have an anchor for port function. >> >> What about another configuration? would it be ok to use eswitch port >> netdev to configure port function too, if there is an interface for it? >> I believe not, that is why we introduced port function. > >I resisted the port function aberration as long as I could. It's Why do you say "aberration"? It is a legitimate feature that is allowing to solve legitimate issues. Maybe I'm missing something. >a limitation of your design as far as I'm concerned. What do you mean? This is not related to us only. The need to work with port function (the other side of the wire) is definitelly nothing specific to mlx5 driver. > >Switches use TC to configure egress queuing, that's our Linux model. >Representor is the switch side, TC qdisc on it maps to the egress >of the switch. Sure. > >I don't understand where the disconnect between us is, you know that's >what mlxsw does.. No disconnect. mlxsw works like that. However, there is no VF/SF in mlxsw world. The other side of the wire is a different host. However in case of VF/SF, we also need to configure the other side of the wire, which we are orchestrating. That is the sole purpose of why we have devlink port function. And once we have such object, why is it incorrect to use it for the needed configuration? Okay, if you really feel that we need to reuse TC interface for this feature (however mismathing it might be), lets create a netdev for the port function to hook this to. But do we want such a beast? But to hook this to eswitch port representor seems to me plain wrong. > >> >> Putting the configuration on the eswitch representor does not fit: >> >> 1) it is configuring the other side of the wire, the configuration >> >> should be of the eswitch port. Configuring the other side is >> >> confusing and misleading. For the purpose of configuring the >> >> "function" side, we introduced "port function" object in devlink. >> >> 2) it is confuguring netdev/ethernet however the confuguration applies >> >> to all queues of the function. >> > >> >If you think it's technically superior to put it in devlink that's fine. >> >I'll repeat myself - what I'm asking for is convergence so that drivers >> >don't have to implement 3 different ways of configuring this. We have >> >devlink rate for from-VF direction shaping, tc police for bi-dir >> >policing and obviously legacy NDOs. None of them translate between each >> >other so drivers and user space have to juggle interfaces. >> >> The legacy ndo is legacy. Drivers that implement switchdev mode do >> not implement those, and should not. > >That's irrelevant - what I'm saying is that in practice drivers have to >implement _all_ of these interfaces today. Just because they are not >needed in eswitch mode doesn't mean the sales department won't find a >customer who's happy with the non-switchdev mode and doesn't want to >move.
On Sat, 9 Jul 2022 07:14:31 +0200 Jiri Pirko wrote: > >I resisted the port function aberration as long as I could. It's > > Why do you say "aberration"? It is a legitimate feature that is allowing > to solve legitimate issues. Maybe I'm missing something. From netdev perspective it's an implementation detail irrelevant to the user. The netdev model is complete without it. > >a limitation of your design as far as I'm concerned. > > What do you mean? This is not related to us only. The need to work with > port function (the other side of the wire) is definitelly nothing > specific to mlx5 driver. > > >Switches use TC to configure egress queuing, that's our Linux model. > >Representor is the switch side, TC qdisc on it maps to the egress > >of the switch. > > Sure. > > >I don't understand where the disconnect between us is, you know that's > >what mlxsw does.. > > No disconnect. mlxsw works like that. However, there is no VF/SF in > mlxsw world. The other side of the wire is a different host. > > However in case of VF/SF, we also need to configure the other side of > the wire, which we are orchestrating. That is the sole purpose of why we > have devlink port function. And once we have such object, why is it > incorrect to use it for the needed configuration? So the function conversation _is_ relevant here, eh? Sad but it is what it is. > Okay, if you really feel that we need to reuse TC interface for this > feature (however mismathing it might be), Not what I said, I'm not gonna say it the fourth time. > lets create a netdev for the port function to hook this to. But do we > want such a beast? But to hook this to eswitch port representor seems > to me plain wrong. I presume you're being facetious. Extra netdev is gonna help nothing. AFAIU the problem is that you want to control endpoints which are not ndevs with this API. Is that the main or only reason? Can we agree that it's legitimate but will result in muddying the netdev model (which in itself is good and complete)?
Mon, Jul 11, 2022 at 07:29:57PM CEST, kuba@kernel.org wrote: >On Sat, 9 Jul 2022 07:14:31 +0200 Jiri Pirko wrote: >> >I resisted the port function aberration as long as I could. It's >> >> Why do you say "aberration"? It is a legitimate feature that is allowing >> to solve legitimate issues. Maybe I'm missing something. > >From netdev perspective it's an implementation detail irrelevant >to the user. The netdev model is complete without it. Well it is a configuration of a device part out of the scope of netdev. So yes, netdev model is complete without it. But does does not mean we don't need such configuration. I may be missing your point. > >> >a limitation of your design as far as I'm concerned. >> >> What do you mean? This is not related to us only. The need to work with >> port function (the other side of the wire) is definitelly nothing >> specific to mlx5 driver. >> >> >Switches use TC to configure egress queuing, that's our Linux model. >> >Representor is the switch side, TC qdisc on it maps to the egress >> >of the switch. >> >> Sure. >> >> >I don't understand where the disconnect between us is, you know that's >> >what mlxsw does.. >> >> No disconnect. mlxsw works like that. However, there is no VF/SF in >> mlxsw world. The other side of the wire is a different host. >> >> However in case of VF/SF, we also need to configure the other side of >> the wire, which we are orchestrating. That is the sole purpose of why we >> have devlink port function. And once we have such object, why is it >> incorrect to use it for the needed configuration? > >So the function conversation _is_ relevant here, eh? Sad but it is what >it is. I'm not sure I follow what "function conversation" you mean. :/ > >> Okay, if you really feel that we need to reuse TC interface for this >> feature (however mismathing it might be), > >Not what I said, I'm not gonna say it the fourth time. Okay, sorry for being slow, but I still don't understand your point :/ > >> lets create a netdev for the port function to hook this to. But do we >> want such a beast? But to hook this to eswitch port representor seems >> to me plain wrong. > >I presume you're being facetious. Extra netdev is gonna help nothing. I'm somewhat am, yes. > >AFAIU the problem is that you want to control endpoints which are not >ndevs with this API. Is that the main or only reason? Can we agree that >it's legitimate but will result in muddying the netdev model (which in >itself is good and complete)? I don't think this has anything to do with netdev model. It is actually out of the scope of it, therefore there cannot be any mudding of it.
On Tue, 12 Jul 2022 08:03:40 +0200 Jiri Pirko wrote: > >AFAIU the problem is that you want to control endpoints which are not > >ndevs with this API. Is that the main or only reason? Can we agree that > >it's legitimate but will result in muddying the netdev model (which in > >itself is good and complete)? > > I don't think this has anything to do with netdev model. > It is actually out of the scope of it, therefore there cannot be any mudding of it. You should have decided that rate limiting was out of scope for netdev before we added tc qdisc and tc police support. Now those offloads are there, used by people and it's too late. If you want to create a common way to rate limit functions you must provide plumbing for the existing methods (at least tc police, preferably legacy NDO as well) to automatically populate the new API.
Wed, Jul 13, 2022 at 02:13:41AM CEST, kuba@kernel.org wrote: >On Tue, 12 Jul 2022 08:03:40 +0200 Jiri Pirko wrote: >> >AFAIU the problem is that you want to control endpoints which are not >> >ndevs with this API. Is that the main or only reason? Can we agree that >> >it's legitimate but will result in muddying the netdev model (which in >> >itself is good and complete)? >> >> I don't think this has anything to do with netdev model. >> It is actually out of the scope of it, therefore there cannot be any mudding of it. > >You should have decided that rate limiting was out of scope for netdev >before we added tc qdisc and tc police support. Now those offloads are >there, used by people and it's too late. > >If you want to create a common way to rate limit functions you must >provide plumbing for the existing methods (at least tc police, >preferably legacy NDO as well) to automatically populate the new API. Even if there is no netdevice to hook it to, because it does not exist? I have to be missing something, sorry :/
On Wed, 13 Jul 2022 07:04:04 +0200 Jiri Pirko wrote: > Wed, Jul 13, 2022 at 02:13:41AM CEST, kuba@kernel.org wrote: > >> I don't think this has anything to do with netdev model. > >> It is actually out of the scope of it, therefore there cannot be any mudding of it. > > > >You should have decided that rate limiting was out of scope for netdev > >before we added tc qdisc and tc police support. Now those offloads are > >there, used by people and it's too late. > > > >If you want to create a common way to rate limit functions you must > >provide plumbing for the existing methods (at least tc police, > >preferably legacy NDO as well) to automatically populate the new API. > > Even if there is no netdevice to hook it to, because it does not exist? > I have to be missing something, sorry :/ What I'm saying is that we can treat the devlink rate API as a "lower layer interface". A layer under the netdevs. That seems sensible and removes the API duplication which otherwise annoys me. We want drivers to only have to implement one API. So when user calls the legacy NDO API it should check if the device has devlink rate support, first, and try to translate the legacy request into devlink rate. Same for TC police as installed by the OvS offload feature that Simon knows far more about than I do. IIRC we use a combination of matchall and police to do shaping. That way drivers don't have to implement all three APIs, only devlink rate (four APIs if we count TC qdisc but I think only NFP uses that one and it has RED etc so that's too much). Does this help or am I still not making sense?
Wed, Jul 13, 2022 at 07:52:55PM CEST, kuba@kernel.org wrote: >On Wed, 13 Jul 2022 07:04:04 +0200 Jiri Pirko wrote: >> Wed, Jul 13, 2022 at 02:13:41AM CEST, kuba@kernel.org wrote: >> >> I don't think this has anything to do with netdev model. >> >> It is actually out of the scope of it, therefore there cannot be any mudding of it. >> > >> >You should have decided that rate limiting was out of scope for netdev >> >before we added tc qdisc and tc police support. Now those offloads are >> >there, used by people and it's too late. >> > >> >If you want to create a common way to rate limit functions you must >> >provide plumbing for the existing methods (at least tc police, >> >preferably legacy NDO as well) to automatically populate the new API. >> >> Even if there is no netdevice to hook it to, because it does not exist? >> I have to be missing something, sorry :/ > >What I'm saying is that we can treat the devlink rate API as a "lower >layer interface". A layer under the netdevs. That seems sensible and >removes the API duplication which otherwise annoys me. > >We want drivers to only have to implement one API. > >So when user calls the legacy NDO API it should check if the device has >devlink rate support, first, and try to translate the legacy request >into devlink rate. > >Same for TC police as installed by the OvS offload feature that Simon >knows far more about than I do. IIRC we use a combination of matchall >and police to do shaping. > >That way drivers don't have to implement all three APIs, only devlink >rate (four APIs if we count TC qdisc but I think only NFP uses that >one and it has RED etc so that's too much). > >Does this help or am I still not making sense? I think I got it now. But in our case, there is no change for the user, as the netdev does not exist. So user still uses devlink rate uapi as proposed by this patchset. Only internal kernel changes requested. Correct?
On Thu, 14 Jul 2022 06:55:05 +0200 Jiri Pirko wrote: > Wed, Jul 13, 2022 at 07:52:55PM CEST, kuba@kernel.org wrote: > >On Wed, 13 Jul 2022 07:04:04 +0200 Jiri Pirko wrote: > >> Even if there is no netdevice to hook it to, because it does not exist? > >> I have to be missing something, sorry :/ > > > >What I'm saying is that we can treat the devlink rate API as a "lower > >layer interface". A layer under the netdevs. That seems sensible and > >removes the API duplication which otherwise annoys me. > > > >We want drivers to only have to implement one API. > > > >So when user calls the legacy NDO API it should check if the device has > >devlink rate support, first, and try to translate the legacy request > >into devlink rate. > > > >Same for TC police as installed by the OvS offload feature that Simon > >knows far more about than I do. IIRC we use a combination of matchall > >and police to do shaping. > > > >That way drivers don't have to implement all three APIs, only devlink > >rate (four APIs if we count TC qdisc but I think only NFP uses that > >one and it has RED etc so that's too much). > > > >Does this help or am I still not making sense? > > I think I got it now. But in our case, there is no change for the user, > as the netdev does not exist. So user still uses devlink rate uapi as > proposed by this patchset. Only internal kernel changes requested. > Correct? Right. If the user wants to use devlink-rate directly they can. For legacy users the kernel will translate to devlink-rate. The point is for drivers to only have to implement devlink-rate.