Message ID | 20230822034003.31628-1-wenjun1.wu@intel.com (mailing list archive) |
---|---|
Headers | show |
Series | iavf: Add devlink and devlink rate support | expand |
Tue, Aug 22, 2023 at 05:39:58AM CEST, wenjun1.wu@intel.com wrote: >To allow user to configure queue bandwidth, devlink port support >is added to support devlink port rate API. [1] > >Add devlink framework registration/unregistration on iavf driver >initialization and remove, and devlink port of DEVLINK_PORT_FLAVOUR_VIRTUAL >is created to be associated iavf netdevice. > >iavf rate tree with root node, queue nodes, and leaf node is created >and registered with devlink rate when iavf adapter is configured, and >if PF indicates support of VIRTCHNL_VF_OFFLOAD_QOS through VF Resource / >Capability Exchange. NACK! Port function is there to configure the VF/SF from the eswitch side. Yet you use it for the configureation of the actual VF, which is clear misuse. Please don't > >[root@localhost ~]# devlink port function rate show >pci/0000:af:01.0/txq_15: type node parent iavf_root >pci/0000:af:01.0/txq_14: type node parent iavf_root >pci/0000:af:01.0/txq_13: type node parent iavf_root >pci/0000:af:01.0/txq_12: type node parent iavf_root >pci/0000:af:01.0/txq_11: type node parent iavf_root >pci/0000:af:01.0/txq_10: type node parent iavf_root >pci/0000:af:01.0/txq_9: type node parent iavf_root >pci/0000:af:01.0/txq_8: type node parent iavf_root >pci/0000:af:01.0/txq_7: type node parent iavf_root >pci/0000:af:01.0/txq_6: type node parent iavf_root >pci/0000:af:01.0/txq_5: type node parent iavf_root >pci/0000:af:01.0/txq_4: type node parent iavf_root >pci/0000:af:01.0/txq_3: type node parent iavf_root >pci/0000:af:01.0/txq_2: type node parent iavf_root >pci/0000:af:01.0/txq_1: type node parent iavf_root >pci/0000:af:01.0/txq_0: type node parent iavf_root >pci/0000:af:01.0/iavf_root: type node > > > +---------+ > | root | > +----+----+ > | > |-----------------|-----------------| > +----v----+ +----v----+ +----v----+ > | txq_0 | | txq_1 | | txq_x | > +----+----+ +----+----+ +----+----+ > >User can configure the tx_max and tx_share of each queue. Once any one of the >queues are fully configured, VIRTCHNL opcodes of VIRTCHNL_OP_CONFIG_QUEUE_BW >and VIRTCHNL_OP_CONFIG_QUANTA will be sent to PF to configure queues allocated >to VF > >Example: > >1.To Set the queue tx_share: >devlink port function rate set pci/0000:af:01.0 txq_0 tx_share 100 MBps > >2.To Set the queue tx_max: >devlink port function rate set pci/0000:af:01.0 txq_0 tx_max 200 MBps > >3.To Show Current devlink port rate info: >devlink port function rate function show >[root@localhost ~]# devlink port function rate show >pci/0000:af:01.0/txq_15: type node parent iavf_root >pci/0000:af:01.0/txq_14: type node parent iavf_root >pci/0000:af:01.0/txq_13: type node parent iavf_root >pci/0000:af:01.0/txq_12: type node parent iavf_root >pci/0000:af:01.0/txq_11: type node parent iavf_root >pci/0000:af:01.0/txq_10: type node parent iavf_root >pci/0000:af:01.0/txq_9: type node parent iavf_root >pci/0000:af:01.0/txq_8: type node parent iavf_root >pci/0000:af:01.0/txq_7: type node parent iavf_root >pci/0000:af:01.0/txq_6: type node parent iavf_root >pci/0000:af:01.0/txq_5: type node parent iavf_root >pci/0000:af:01.0/txq_4: type node parent iavf_root >pci/0000:af:01.0/txq_3: type node parent iavf_root >pci/0000:af:01.0/txq_2: type node parent iavf_root >pci/0000:af:01.0/txq_1: type node parent iavf_root >pci/0000:af:01.0/txq_0: type node tx_share 800Mbit tx_max 1600Mbit parent iavf_root >pci/0000:af:01.0/iavf_root: type node > > >[1]https://lore.kernel.org/netdev/20221115104825.172668-1-michal.wilczynski@intel.com/ > >Change log: > >v4: >- Rearrange the ice_vf_qs_bw structure, put the largest number first >- Minimize the scope of values >- Remove the unnecessary brackets >- Remove the unnecessary memory allocation. >- Added Error Code and moved devlink registration before aq lock initialization >- Changed devlink registration for error handling in case of allocation failure >- Used kcalloc for object array memory allocation and initialization >- Changed functions & comments for readability > >v3: >- Rebase the code >- Changed rate node max/share set function description >- Put variable in local scope > >v2: >- Change static array to flex array >- Use struct_size helper >- Align all the error code types in the function >- Move the register field definitions to the right place in the file >- Fix coding style >- Adapted to queue bw cfg and qos cap list virtchnl message with flex array fields >--- > >Jun Zhang (3): > iavf: Add devlink and devlink port support > iavf: Add devlink port function rate API support > iavf: Add VIRTCHNL Opcodes Support for Queue bw Setting > >Wenjun Wu (2): > virtchnl: support queue rate limit and quanta size configuration > ice: Support VF queue rate limit and quanta size configuration > > drivers/net/ethernet/intel/Kconfig | 1 + > drivers/net/ethernet/intel/iavf/Makefile | 2 +- > drivers/net/ethernet/intel/iavf/iavf.h | 19 + > .../net/ethernet/intel/iavf/iavf_devlink.c | 377 ++++++++++++++++++ > .../net/ethernet/intel/iavf/iavf_devlink.h | 38 ++ > drivers/net/ethernet/intel/iavf/iavf_main.c | 64 ++- > .../net/ethernet/intel/iavf/iavf_virtchnl.c | 231 ++++++++++- > drivers/net/ethernet/intel/ice/ice.h | 2 + > drivers/net/ethernet/intel/ice/ice_base.c | 2 + > drivers/net/ethernet/intel/ice/ice_common.c | 19 + > .../net/ethernet/intel/ice/ice_hw_autogen.h | 8 + > drivers/net/ethernet/intel/ice/ice_txrx.h | 2 + > drivers/net/ethernet/intel/ice/ice_type.h | 1 + > drivers/net/ethernet/intel/ice/ice_vf_lib.h | 9 + > drivers/net/ethernet/intel/ice/ice_virtchnl.c | 310 ++++++++++++++ > drivers/net/ethernet/intel/ice/ice_virtchnl.h | 11 + > .../intel/ice/ice_virtchnl_allowlist.c | 6 + > include/linux/avf/virtchnl.h | 119 ++++++ > 18 files changed, 1218 insertions(+), 3 deletions(-) > create mode 100644 drivers/net/ethernet/intel/iavf/iavf_devlink.c > create mode 100644 drivers/net/ethernet/intel/iavf/iavf_devlink.h > >-- >2.34.1 > >
On Tue, 22 Aug 2023 08:12:28 +0200 Jiri Pirko wrote: > NACK! Port function is there to configure the VF/SF from the eswitch > side. Yet you use it for the configureation of the actual VF, which is > clear misuse. Please don't Stating where they are supposed to configure the rate would be helpful.
Tue, Aug 22, 2023 at 05:12:55PM CEST, kuba@kernel.org wrote: >On Tue, 22 Aug 2023 08:12:28 +0200 Jiri Pirko wrote: >> NACK! Port function is there to configure the VF/SF from the eswitch >> side. Yet you use it for the configureation of the actual VF, which is >> clear misuse. Please don't > >Stating where they are supposed to configure the rate would be helpful. TC?
On 8/22/2023 8:34 AM, Jiri Pirko wrote: > Tue, Aug 22, 2023 at 05:12:55PM CEST, kuba@kernel.org wrote: >> On Tue, 22 Aug 2023 08:12:28 +0200 Jiri Pirko wrote: >>> NACK! Port function is there to configure the VF/SF from the eswitch >>> side. Yet you use it for the configureation of the actual VF, which is >>> clear misuse. Please don't >> Stating where they are supposed to configure the rate would be helpful. > TC? Our implementation is an extension to this commit 42c2eb6b1f43 ice: Implement devlink-rate API). We are setting the Tx max & share rates of individual queues in a VF using the devlink rate API. Here we are using DEVLINK_PORT_FLAVOUR_VIRTUAL as the attribute for the port to distinguish it from being eswitch. [resend in plain text only]
Wed, Aug 23, 2023 at 09:13:34PM CEST, xuejun.zhang@intel.com wrote: > >On 8/22/2023 8:34 AM, Jiri Pirko wrote: >> Tue, Aug 22, 2023 at 05:12:55PM CEST,kuba@kernel.org wrote: >> > On Tue, 22 Aug 2023 08:12:28 +0200 Jiri Pirko wrote: >> > > NACK! Port function is there to configure the VF/SF from the eswitch >> > > side. Yet you use it for the configureation of the actual VF, which is >> > > clear misuse. Please don't >> > Stating where they are supposed to configure the rate would be helpful. >> TC? > >Our implementation is an extension to this commit 42c2eb6b1f43 ice: Implement >devlink-rate API). > >We are setting the Tx max & share rates of individual queues in a VF using >the devlink rate API. > >Here we are using DEVLINK_PORT_FLAVOUR_VIRTUAL as the attribute for the port >to distinguish it from being eswitch. I understand, that is a wrong object. So again, you should use "function" subobject of devlink port to configure "the other side of the wire", that means the function related to a eswitch port. Here, you are doing it for the VF directly, which is wrong. If you need some rate limiting to be configured on an actual VF, use what you use for any other nic. Offload TC.
On 8/24/2023 12:04 AM, Jiri Pirko wrote: > Wed, Aug 23, 2023 at 09:13:34PM CEST, xuejun.zhang@intel.com wrote: >> On 8/22/2023 8:34 AM, Jiri Pirko wrote: >>> Tue, Aug 22, 2023 at 05:12:55PM CEST,kuba@kernel.org wrote: >>>> On Tue, 22 Aug 2023 08:12:28 +0200 Jiri Pirko wrote: >>>>> NACK! Port function is there to configure the VF/SF from the eswitch >>>>> side. Yet you use it for the configureation of the actual VF, which is >>>>> clear misuse. Please don't >>>> Stating where they are supposed to configure the rate would be helpful. >>> TC? >> Our implementation is an extension to this commit 42c2eb6b1f43 ice: Implement >> devlink-rate API). >> >> We are setting the Tx max & share rates of individual queues in a VF using >> the devlink rate API. >> >> Here we are using DEVLINK_PORT_FLAVOUR_VIRTUAL as the attribute for the port >> to distinguish it from being eswitch. > I understand, that is a wrong object. So again, you should use > "function" subobject of devlink port to configure "the other side of the > wire", that means the function related to a eswitch port. Here, you are > doing it for the VF directly, which is wrong. If you need some rate > limiting to be configured on an actual VF, use what you use for any > other nic. Offload TC. Thanks for detailed explanation and suggestions. Sorry for late reply as it took a bit longer to understand options. As sysfs has similar rate configuration on per queue basis with tx_maxrate, is it a viable option for our use case (i.e allow user to configure tx rate for each allocated queue in a VF). Pls aslo see If adding tx_minrate to sysfs tx queue entry is feasible on the current framework.
Hi, please allow me to revive this old thread... On Thu, 2023-08-24 at 09:04 +0200, Jiri Pirko wrote: > > Wed, Aug 23, 2023 at 09:13:34PM CEST, xuejun.zhang@intel.com wrote: > > > > > > > > On 8/22/2023 8:34 AM, Jiri Pirko wrote: > > > > > > Tue, Aug 22, 2023 at 05:12:55PM CEST,kuba@kernel.org wrote: > > > > > > > > On Tue, 22 Aug 2023 08:12:28 +0200 Jiri Pirko wrote: > > > > > > > > > > NACK! Port function is there to configure the VF/SF from the eswitch > > > > > > > > > > side. Yet you use it for the configureation of the actual VF, which is > > > > > > > > > > clear misuse. Please don't > > > > > > > > Stating where they are supposed to configure the rate would be helpful. > > > > > > TC? > > > > > > > > Our implementation is an extension to this commit 42c2eb6b1f43 ice: Implement > > > > devlink-rate API). > > > > > > > > We are setting the Tx max & share rates of individual queues in a VF using > > > > the devlink rate API. > > > > > > > > Here we are using DEVLINK_PORT_FLAVOUR_VIRTUAL as the attribute for the port > > > > to distinguish it from being eswitch. > > > > I understand, that is a wrong object. So again, you should use > > "function" subobject of devlink port to configure "the other side of the > > wire", that means the function related to a eswitch port. Here, you are > > doing it for the VF directly, which is wrong. If you need some rate > > limiting to be configured on an actual VF, use what you use for any > > other nic. Offload TC. I have a doubt WRT the above. Don't we need something more/different here? I mean: a possible intent is limiting the amount of resources (BW in the VF -> esw direction) that the application owing the VF could use. If that is enforced via TC on the VF side (say, a different namespace or VM), the VF user could circumvent such limit - changing the tc configuration - either by mistake or malicious action. Looking at the thing from a different perspective, the TX B/W on the VF side is the RX B/W on the eswitch side, so the same effect could be obtained with a (new/different) API formally touching only eswitch side object. WDYT? Thanks, Paolo
Hello Jiri & Jakub, Thanks for looking into our last patch with devlink API. Really appreciate your candid review. Following your suggestion, we have looked into 3 tc offload options to support queue rate limiting #1 mq + matchall + police #2 mq + tbf #3 htb all 3 tc offload options require some level of tc extensions to support VF tx queue rate limiting (tx_maxrate & tx_minrate) htb offload requires minimal tc changes or no change with similar change done @ driver (we can share patch for review). After discussing with Maxim Mikityanskiy( https://lore.kernel.org/netdev/54a7dd27-a612-46f1-80dd-b43e28f8e4ce@intel.com/ ), looks like sysfs interface with tx_minrate extension could be the option we can take. Look forward your opinion & guidance. Thanks for your time! Regards, Jun On 8/28/2023 3:46 PM, Zhang, Xuejun wrote: > > On 8/24/2023 12:04 AM, Jiri Pirko wrote: >> Wed, Aug 23, 2023 at 09:13:34PM CEST, xuejun.zhang@intel.com wrote: >>> On 8/22/2023 8:34 AM, Jiri Pirko wrote: >>>> Tue, Aug 22, 2023 at 05:12:55PM CEST,kuba@kernel.org wrote: >>>>> On Tue, 22 Aug 2023 08:12:28 +0200 Jiri Pirko wrote: >>>>>> NACK! Port function is there to configure the VF/SF from the eswitch >>>>>> side. Yet you use it for the configureation of the actual VF, >>>>>> which is >>>>>> clear misuse. Please don't >>>>> Stating where they are supposed to configure the rate would be >>>>> helpful. >>>> TC? >>> Our implementation is an extension to this commit 42c2eb6b1f43 ice: >>> Implement >>> devlink-rate API). >>> >>> We are setting the Tx max & share rates of individual queues in a VF >>> using >>> the devlink rate API. >>> >>> Here we are using DEVLINK_PORT_FLAVOUR_VIRTUAL as the attribute for >>> the port >>> to distinguish it from being eswitch. >> I understand, that is a wrong object. So again, you should use >> "function" subobject of devlink port to configure "the other side of the >> wire", that means the function related to a eswitch port. Here, you are >> doing it for the VF directly, which is wrong. If you need some rate >> limiting to be configured on an actual VF, use what you use for any >> other nic. Offload TC. > Thanks for detailed explanation and suggestions. Sorry for late reply > as it took a bit longer to understand options. > > As sysfs has similar rate configuration on per queue basis with > tx_maxrate, is it a viable option for our use case (i.e allow user to > configure tx rate for each allocated queue in a VF). > > Pls aslo see If adding tx_minrate to sysfs tx queue entry is feasible > on the current framework. > _______________________________________________ > Intel-wired-lan mailing list > Intel-wired-lan@osuosl.org > https://lists.osuosl.org/mailman/listinfo/intel-wired-lan
Fri, Nov 17, 2023 at 06:52:49AM CET, xuejun.zhang@intel.com wrote: >Hello Jiri & Jakub, > >Thanks for looking into our last patch with devlink API. Really appreciate >your candid review. > >Following your suggestion, we have looked into 3 tc offload options to >support queue rate limiting > >#1 mq + matchall + police This looks most suitable. Why it would not work? > >#2 mq + tbf > >#3 htb > >all 3 tc offload options require some level of tc extensions to support VF tx >queue rate limiting (tx_maxrate & tx_minrate) > >htb offload requires minimal tc changes or no change with similar change done >@ driver (we can share patch for review). > >After discussing with Maxim Mikityanskiy( https://lore.kernel.org/netdev/54a7dd27-a612-46f1-80dd-b43e28f8e4ce@intel.com/ >), looks like sysfs interface with tx_minrate extension could be the option I don't undestand how any sysfs know is related to any of the tree tc solutions above. >we can take. > >Look forward your opinion & guidance. Thanks for your time! > >Regards, > >Jun > >On 8/28/2023 3:46 PM, Zhang, Xuejun wrote: >> >> On 8/24/2023 12:04 AM, Jiri Pirko wrote: >> > Wed, Aug 23, 2023 at 09:13:34PM CEST, xuejun.zhang@intel.com wrote: >> > > On 8/22/2023 8:34 AM, Jiri Pirko wrote: >> > > > Tue, Aug 22, 2023 at 05:12:55PM CEST,kuba@kernel.org wrote: >> > > > > On Tue, 22 Aug 2023 08:12:28 +0200 Jiri Pirko wrote: >> > > > > > NACK! Port function is there to configure the VF/SF from the eswitch >> > > > > > side. Yet you use it for the configureation of the >> > > > > > actual VF, which is >> > > > > > clear misuse. Please don't >> > > > > Stating where they are supposed to configure the rate >> > > > > would be helpful. >> > > > TC? >> > > Our implementation is an extension to this commit 42c2eb6b1f43 >> > > ice: Implement >> > > devlink-rate API). >> > > >> > > We are setting the Tx max & share rates of individual queues in a >> > > VF using >> > > the devlink rate API. >> > > >> > > Here we are using DEVLINK_PORT_FLAVOUR_VIRTUAL as the attribute >> > > for the port >> > > to distinguish it from being eswitch. >> > I understand, that is a wrong object. So again, you should use >> > "function" subobject of devlink port to configure "the other side of the >> > wire", that means the function related to a eswitch port. Here, you are >> > doing it for the VF directly, which is wrong. If you need some rate >> > limiting to be configured on an actual VF, use what you use for any >> > other nic. Offload TC. >> Thanks for detailed explanation and suggestions. Sorry for late reply as >> it took a bit longer to understand options. >> >> As sysfs has similar rate configuration on per queue basis with >> tx_maxrate, is it a viable option for our use case (i.e allow user to >> configure tx rate for each allocated queue in a VF). >> >> Pls aslo see If adding tx_minrate to sysfs tx queue entry is feasible on >> the current framework. >> _______________________________________________ >> Intel-wired-lan mailing list >> Intel-wired-lan@osuosl.org >> https://lists.osuosl.org/mailman/listinfo/intel-wired-lan
On Thu, 16 Nov 2023 21:52:49 -0800 Zhang, Xuejun wrote: > Thanks for looking into our last patch with devlink API. Really > appreciate your candid review. > > Following your suggestion, we have looked into 3 tc offload options to > support queue rate limiting > > #1 mq + matchall + police > > #2 mq + tbf You can extend mqprio, too, if you wanted. > #3 htb > > all 3 tc offload options require some level of tc extensions to support > VF tx queue rate limiting (tx_maxrate & tx_minrate) > > htb offload requires minimal tc changes or no change with similar change > done @ driver (we can share patch for review). > > After discussing with Maxim Mikityanskiy( > https://lore.kernel.org/netdev/54a7dd27-a612-46f1-80dd-b43e28f8e4ce@intel.com/ > ), looks like sysfs interface with tx_minrate extension could be the > option we can take. > > Look forward your opinion & guidance. Thanks for your time! My least favorite thing to do is to configure the same piece of silicon with 4 different SW interfaces. It's okay if we have 4 different uAPIs (user level APIs) but the driver should not be exposed to all these options. I'm saying 4 but really I can think of 6 ways of setting maxrate :( IMHO we need to be a bit more realistic about the notion of "offloading the SW thing" for qdiscs specifically. Normally we offload SW constructs to have a fallback and have a clear definition of functionality. I bet most data-centers will use BPF+FQ these days, so the "fallback" argument does not apply. And the "clear definition" when it comes to basic rate limiting is.. moot. Besides we already have mqprio, sysfs maxrate, sriov ndo, devlink rate, none of which have SW fallback. So since you asked for my opinion - my opinion is that step 1 is to create a common representation of what we already have and feed it to the drivers via a single interface. I could just be taking sysfs maxrate and feeding it to the driver via the devlink rate interface. If we have the right internals I give 0 cares about what uAPI you pick.
On Fri, 2023-11-17 at 12:21 +0100, Jiri Pirko wrote: > Fri, Nov 17, 2023 at 06:52:49AM CET, xuejun.zhang@intel.com wrote: > > Hello Jiri & Jakub, > > > > Thanks for looking into our last patch with devlink API. Really appreciate > > your candid review. > > > > Following your suggestion, we have looked into 3 tc offload options to > > support queue rate limiting > > > > #1 mq + matchall + police > > This looks most suitable. Why it would not work? AFAICS, it should work, but it does not look the most suitable to me: beyond splitting a "simple" task in separate entities, it poses a constraint on the classification performed on the egress device. Suppose the admin wants to limit the egress bandwidth on the given tx queue _and_ do some application specific packet classification and actions. That would not be possible right? Thanks! Paolo
On 11/18/2023 8:48 AM, Jakub Kicinski wrote: > On Thu, 16 Nov 2023 21:52:49 -0800 Zhang, Xuejun wrote: >> Thanks for looking into our last patch with devlink API. Really >> appreciate your candid review. >> >> Following your suggestion, we have looked into 3 tc offload options to >> support queue rate limiting >> >> #1 mq + matchall + police >> >> #2 mq + tbf > You can extend mqprio, too, if you wanted. > >> #3 htb >> >> all 3 tc offload options require some level of tc extensions to support >> VF tx queue rate limiting (tx_maxrate & tx_minrate) >> >> htb offload requires minimal tc changes or no change with similar change >> done @ driver (we can share patch for review). >> >> After discussing with Maxim Mikityanskiy( >> https://lore.kernel.org/netdev/54a7dd27-a612-46f1-80dd-b43e28f8e4ce@intel.com/ >> ), looks like sysfs interface with tx_minrate extension could be the >> option we can take. >> >> Look forward your opinion & guidance. Thanks for your time! > My least favorite thing to do is to configure the same piece of silicon > with 4 different SW interfaces. It's okay if we have 4 different uAPIs > (user level APIs) but the driver should not be exposed to all these > options. > > I'm saying 4 but really I can think of 6 ways of setting maxrate :( > > IMHO we need to be a bit more realistic about the notion of "offloading > the SW thing" for qdiscs specifically. Normally we offload SW constructs > to have a fallback and have a clear definition of functionality. > I bet most data-centers will use BPF+FQ these days, so the "fallback" > argument does not apply. And the "clear definition" when it comes to > basic rate limiting is.. moot. > > Besides we already have mqprio, sysfs maxrate, sriov ndo, devlink rate, > none of which have SW fallback. > > So since you asked for my opinion - my opinion is that step 1 is to > create a common representation of what we already have and feed it > to the drivers via a single interface. I could just be taking sysfs > maxrate and feeding it to the driver via the devlink rate interface. > If we have the right internals I give 0 cares about what uAPI you pick. There is an existing ndo api for setting tx_maxrate API int (*ndo_set_tx_maxrate)(struct net_device *dev, int queue_index, u32 maxrate); we could expand and modify the above API with tx_minrate and burst parameters as below int (*ndo_set_tx_rate)(struct net_device *dev, int queue_index, int maxrate , int minrate, int burst); queue_index: tx queue index of net device maxrate: tx maxrate in mbps minrate: tx min rate in mbps burst: data burst in bytes The proposed API would incur net/core and driver changes as follows a) existing drivers with ndo_set_tx_maxrate support upgraded to use new ndo_set_tx_rate b) net sysfs (replacing ndo_set_maxrate with ndo_set_tx_rate with minrate and burst set to 0, -1 means ignore) c) Keep the existing /sys/class/net/ethx/queues/tx_nn/tx_maxrate as it is currently d) Add sysfs entry as /sys/class/net/ethx/queues/tx_nn/tx_minrate & /sys/class/net/ethx/queues/tx_nn/burst Let us know your thoughts.
On Wed, 22 Nov 2023 14:19:14 -0800 Zhang, Xuejun wrote: > The proposed API would incur net/core and driver changes as follows > a) existing drivers with ndo_set_tx_maxrate support upgraded to use new > ndo_set_tx_rate > b) net sysfs (replacing ndo_set_maxrate with ndo_set_tx_rate with > minrate and burst set to 0, -1 means ignore) > c) Keep the existing /sys/class/net/ethx/queues/tx_nn/tx_maxrate as it > is currently > d) Add sysfs entry as /sys/class/net/ethx/queues/tx_nn/tx_minrate & > /sys/class/net/ethx/queues/tx_nn/burst You described extending the sysfs API (which the ndo you mention is for) and nothing about converging the other existing APIs.
On 11/22/2023 7:22 PM, Jakub Kicinski wrote: > On Wed, 22 Nov 2023 14:19:14 -0800 Zhang, Xuejun wrote: >> The proposed API would incur net/core and driver changes as follows >> a) existing drivers with ndo_set_tx_maxrate support upgraded to use new >> ndo_set_tx_rate >> b) net sysfs (replacing ndo_set_maxrate with ndo_set_tx_rate with >> minrate and burst set to 0, -1 means ignore) >> c) Keep the existing /sys/class/net/ethx/queues/tx_nn/tx_maxrate as it >> is currently >> d) Add sysfs entry as /sys/class/net/ethx/queues/tx_nn/tx_minrate & >> /sys/class/net/ethx/queues/tx_nn/burst > You described extending the sysfs API (which the ndo you mention > is for) and nothing about converging the other existing APIs. This is extension of ndo_set_tx_maxrate to include per queue parameters of tx_minrate and burst. devlink rate api includes tx_maxrate and tx_minrate, it is intended for port rate configurations. With regarding to tc mqprio, it is being used to configure queue group per tc. For sriov ndo ndo_set_vf_rate, that has been used for overall VF rate configuration, not for queue based rate configuration. It seems there are differences on intent of the aforementioned APIs. Our use case here is to allow user (i.e @ uAPI) to configure tx rates of max rate & min rate per VF queue.Hence we are inclined to ndo_set_tx_maxrate extension.
On Mon, 27 Nov 2023 16:15:47 -0800 Zhang, Xuejun wrote: > This is extension of ndo_set_tx_maxrate to include per queue parameters > of tx_minrate and burst. > > devlink rate api includes tx_maxrate and tx_minrate, it is intended for > port rate configurations. > > With regarding to tc mqprio, it is being used to configure queue group > per tc. > > For sriov ndo ndo_set_vf_rate, that has been used for overall VF rate > configuration, not for queue based rate configuration. > > It seems there are differences on intent of the aforementioned APIs. > > Our use case here is to allow user (i.e @ uAPI) to configure tx rates of > max rate & min rate per VF queue.Hence we are inclined to > ndo_set_tx_maxrate extension. I said: So since you asked for my opinion - my opinion is that step 1 is to create a common representation of what we already have and feed it to the drivers via a single interface. I could just be taking sysfs maxrate and feeding it to the driver via the devlink rate interface. If we have the right internals I give 0 cares about what uAPI you pick. https://lore.kernel.org/all/20231118084843.70c344d9@kernel.org/ Again, the first step is creating a common kernel <> driver interface which can be used to send to the driver the configuration from the existing 4 interfaces.
On Mon, 2023-11-27 at 17:43 -0800, Jakub Kicinski wrote: > On Mon, 27 Nov 2023 16:15:47 -0800 Zhang, Xuejun wrote: > > This is extension of ndo_set_tx_maxrate to include per queue parameters > > of tx_minrate and burst. > > > > devlink rate api includes tx_maxrate and tx_minrate, it is intended for > > port rate configurations. > > > > With regarding to tc mqprio, it is being used to configure queue group > > per tc. > > > > For sriov ndo ndo_set_vf_rate, that has been used for overall VF rate > > configuration, not for queue based rate configuration. > > > > It seems there are differences on intent of the aforementioned APIs. > > > > Our use case here is to allow user (i.e @ uAPI) to configure tx rates of > > max rate & min rate per VF queue.Hence we are inclined to > > ndo_set_tx_maxrate extension. > > I said: > > So since you asked for my opinion - my opinion is that step 1 is to > create a common representation of what we already have and feed it > to the drivers via a single interface. I could just be taking sysfs > maxrate and feeding it to the driver via the devlink rate interface. > If we have the right internals I give 0 cares about what uAPI you pick. > > https://lore.kernel.org/all/20231118084843.70c344d9@kernel.org/ > > Again, the first step is creating a common kernel <> driver interface > which can be used to send to the driver the configuration from the > existing 4 interfaces. Together with Simon, I spent some time on the above. We think the ndo_setup_tc(TC_SETUP_QDISC_TBF) hook could be used as common basis for this offloads, with some small extensions (adding a 'max_rate' param, too). The idea would be: - 'fixing' sch_btf so that the s/w path became a no-op when h/w offload is enabled - extend sch_btf to support max rate - do the relevant ice implementation - ndo_set_tx_maxrate could be replaced with the mentioned ndo call (the latter interface is a strict super-set of former) - ndo_set_vf_rate could also be replaced with the mentioned ndo call (with another small extension to the offload data) I think mqprio deserves it's own separate offload interface, as it covers multiple tasks other than shaping (grouping queues and mapping priority to classes) In the long run we could have a generic implementation of the ndo_setup_tc(TC_SETUP_QDISC_TBF) in term of devlink rate adding a generic way to fetch the devlink_port instance corresponding to the given netdev and mapping the TBF features to the devlink_rate API. Not starting this due to what Jiri mentioned [1]. WDYT? Thanks, Paolo and Simon [1] https://lore.kernel.org/netdev/ZORRzEBcUDEjMniz@nanopsycho/
On Thu, 14 Dec 2023 21:29:51 +0100 Paolo Abeni wrote: > Together with Simon, I spent some time on the above. We think the > ndo_setup_tc(TC_SETUP_QDISC_TBF) hook could be used as common basis for > this offloads, with some small extensions (adding a 'max_rate' param, > too). uAPI aside, why would we use ndo_setup_tc(TC_SETUP_QDISC_TBF) to implement common basis? Is it not cleaner to have a separate driver API, with its ops and capabilities? > The idea would be: > - 'fixing' sch_btf so that the s/w path became a no-op when h/w offload > is enabled > - extend sch_btf to support max rate > - do the relevant ice implementation > - ndo_set_tx_maxrate could be replaced with the mentioned ndo call (the > latter interface is a strict super-set of former) > - ndo_set_vf_rate could also be replaced with the mentioned ndo call > (with another small extension to the offload data) > > I think mqprio deserves it's own separate offload interface, as it > covers multiple tasks other than shaping (grouping queues and mapping > priority to classes) > > In the long run we could have a generic implementation of the > ndo_setup_tc(TC_SETUP_QDISC_TBF) in term of devlink rate adding a > generic way to fetch the devlink_port instance corresponding to the > given netdev and mapping the TBF features to the devlink_rate API. > > Not starting this due to what Jiri mentioned [1]. Jiri, AFAIU, is against using devlink rate *uAPI* to configure network rate limiting. That's separate from the internal representation.
On Thu, 2023-12-14 at 17:46 -0800, Jakub Kicinski wrote: > On Thu, 14 Dec 2023 21:29:51 +0100 Paolo Abeni wrote: > > Together with Simon, I spent some time on the above. We think the > > ndo_setup_tc(TC_SETUP_QDISC_TBF) hook could be used as common basis for > > this offloads, with some small extensions (adding a 'max_rate' param, > > too). > > uAPI aside, why would we use ndo_setup_tc(TC_SETUP_QDISC_TBF) > to implement common basis? > > Is it not cleaner to have a separate driver API, with its ops > and capabilities? We understand one of the end goal is consolidating the existing rate- related in kernel interfaces. Adding a new one does not feel a good starting to reach that goal, see [1] & [2] ;). ndo_setup_tc() feels like the natural choice for H/W offload and TBF is the existing interface IMHO nearest to the requirements here. The devlink rate API could be a possible alternative... > > The idea would be: > > - 'fixing' sch_btf so that the s/w path became a no-op when h/w offload > > is enabled > > - extend sch_btf to support max rate > > - do the relevant ice implementation > > - ndo_set_tx_maxrate could be replaced with the mentioned ndo call (the > > latter interface is a strict super-set of former) > > - ndo_set_vf_rate could also be replaced with the mentioned ndo call > > (with another small extension to the offload data) > > > > I think mqprio deserves it's own separate offload interface, as it > > covers multiple tasks other than shaping (grouping queues and mapping > > priority to classes) > > > > In the long run we could have a generic implementation of the > > ndo_setup_tc(TC_SETUP_QDISC_TBF) in term of devlink rate adding a > > generic way to fetch the devlink_port instance corresponding to the > > given netdev and mapping the TBF features to the devlink_rate API. > > > > Not starting this due to what Jiri mentioned [1]. > > Jiri, AFAIU, is against using devlink rate *uAPI* to configure network > rate limiting. That's separate from the internal representation. ... with a couples of caveats: 1) AFAICS devlink (and/or devlink_port) does not have fine grained, per queue representation and intel want to be able to configure shaping on per queue basis. I think/hope we don't want to bring the discussion to extending the devlink interface with queue support, I fear that will block us for a long time. Perhaps I’m missing or misunderstanding something here. Otherwise in retrospect this looks like a reasonable point to completely avoid devlink here. 2) My understanding of Jiri statement was more restrictive. @Jiri it would great if could share your genuine interpretation: are you ok with using the devlink_port rate API as a basis to replace ndo_set_tx_maxrate() (via dev->devlink_port->devlink->) and possibly ndo_set_vf_rate(). Note the given the previous point, this option would still feel problematic. Cheers, Paolo [1] https://xkcd.com/927/ [2] https://www.youtube.com/watch?v=f8kO_L-pDwo
On Fri, 2023-12-15 at 12:06 +0100, Paolo Abeni wrote: > 1) AFAICS devlink (and/or devlink_port) does not have fine grained, per > queue representation and intel want to be able to configure shaping on > per queue basis. I think/hope we don't want to bring the discussion to > extending the devlink interface with queue support, I fear that will > block us for a long time. Perhaps I’m missing or misunderstanding > something here. Otherwise in retrospect this looks like a reasonable > point to completely avoid devlink here. Note to self: never send a message to the ML before my 3rd morning coffee. This thread started with Intel trying to using devlink rate for their use-case, apparently slamming my doubt above. My understanding is that in the patches the queue devlink <> queue relationship was kept inside the driver and not exposed to the devlink level. If we want to use the devlink rate api to replace e.g. ndo_set_tx_maxrate, we would need a devlink queue(id) or the like, hence this point. Cheer, Paolo
Fri, Dec 15, 2023 at 02:46:04AM CET, kuba@kernel.org wrote: >On Thu, 14 Dec 2023 21:29:51 +0100 Paolo Abeni wrote: >> Together with Simon, I spent some time on the above. We think the >> ndo_setup_tc(TC_SETUP_QDISC_TBF) hook could be used as common basis for >> this offloads, with some small extensions (adding a 'max_rate' param, >> too). > >uAPI aside, why would we use ndo_setup_tc(TC_SETUP_QDISC_TBF) >to implement common basis? > >Is it not cleaner to have a separate driver API, with its ops >and capabilities? > >> The idea would be: >> - 'fixing' sch_btf so that the s/w path became a no-op when h/w offload >> is enabled >> - extend sch_btf to support max rate >> - do the relevant ice implementation >> - ndo_set_tx_maxrate could be replaced with the mentioned ndo call (the >> latter interface is a strict super-set of former) >> - ndo_set_vf_rate could also be replaced with the mentioned ndo call >> (with another small extension to the offload data) >> >> I think mqprio deserves it's own separate offload interface, as it >> covers multiple tasks other than shaping (grouping queues and mapping >> priority to classes) >> >> In the long run we could have a generic implementation of the >> ndo_setup_tc(TC_SETUP_QDISC_TBF) in term of devlink rate adding a >> generic way to fetch the devlink_port instance corresponding to the >> given netdev and mapping the TBF features to the devlink_rate API. >> >> Not starting this due to what Jiri mentioned [1]. > >Jiri, AFAIU, is against using devlink rate *uAPI* to configure network >rate limiting. That's separate from the internal representation. Devlink rate was introduced for configuring port functions that are connected to eswitch port. I don't see any reason to extend it for configuration of netdev on the host. We have netdev instance and other means to do it.
Fri, Dec 15, 2023 at 12:06:52PM CET, pabeni@redhat.com wrote: >On Thu, 2023-12-14 at 17:46 -0800, Jakub Kicinski wrote: >> On Thu, 14 Dec 2023 21:29:51 +0100 Paolo Abeni wrote: >> > Together with Simon, I spent some time on the above. We think the >> > ndo_setup_tc(TC_SETUP_QDISC_TBF) hook could be used as common basis for >> > this offloads, with some small extensions (adding a 'max_rate' param, >> > too). >> >> uAPI aside, why would we use ndo_setup_tc(TC_SETUP_QDISC_TBF) >> to implement common basis? >> >> Is it not cleaner to have a separate driver API, with its ops >> and capabilities? > >We understand one of the end goal is consolidating the existing rate- >related in kernel interfaces. Adding a new one does not feel a good >starting to reach that goal, see [1] & [2] ;). ndo_setup_tc() feels >like the natural choice for H/W offload and TBF is the existing >interface IMHO nearest to the requirements here. > >The devlink rate API could be a possible alternative... Again, devlink rate was introduced for the rate configuration of the entity that is not present (by netdev for example) on a host. If we have netdev, let's use it. > >> > The idea would be: >> > - 'fixing' sch_btf so that the s/w path became a no-op when h/w offload >> > is enabled >> > - extend sch_btf to support max rate >> > - do the relevant ice implementation >> > - ndo_set_tx_maxrate could be replaced with the mentioned ndo call (the >> > latter interface is a strict super-set of former) >> > - ndo_set_vf_rate could also be replaced with the mentioned ndo call >> > (with another small extension to the offload data) >> > >> > I think mqprio deserves it's own separate offload interface, as it >> > covers multiple tasks other than shaping (grouping queues and mapping >> > priority to classes) >> > >> > In the long run we could have a generic implementation of the >> > ndo_setup_tc(TC_SETUP_QDISC_TBF) in term of devlink rate adding a >> > generic way to fetch the devlink_port instance corresponding to the >> > given netdev and mapping the TBF features to the devlink_rate API. >> > >> > Not starting this due to what Jiri mentioned [1]. >> >> Jiri, AFAIU, is against using devlink rate *uAPI* to configure network >> rate limiting. That's separate from the internal representation. > >... with a couples of caveats: > >1) AFAICS devlink (and/or devlink_port) does not have fine grained, per >queue representation and intel want to be able to configure shaping on >per queue basis. I think/hope we don't want to bring the discussion to >extending the devlink interface with queue support, I fear that will >block us for a long time. Perhaps I’m missing or misunderstanding >something here. Otherwise in retrospect this looks like a reasonable >point to completely avoid devlink here. > >2) My understanding of Jiri statement was more restrictive. @Jiri it >would great if could share your genuine interpretation: are you ok with >using the devlink_port rate API as a basis to replace >ndo_set_tx_maxrate() (via dev->devlink_port->devlink->) and possibly Does not make any sense to me. >ndo_set_vf_rate(). Note the given the previous point, this option would ndo_set_vf_rate() (and the rest of ndo_[gs]et_vf_*() ndo) is the legacy way. Devlink rate replaced that when switchdev eswich mode is configured by: $ sudo devlink dev eswitch set pci/0000:08:00.1 mode switchdev In drivers, ndo_set_vf_rate() and devlink rate are implemented in the same way. See mlx5 for example: mlx5_esw_qos_set_vport_rate() mlx5_esw_devlink_rate_leaf_tx_share_set() >still feel problematic. > >Cheers, > >Paolo > >[1] https://xkcd.com/927/ >[2] https://www.youtube.com/watch?v=f8kO_L-pDwo >
On Fri, 15 Dec 2023 12:06:52 +0100 Paolo Abeni wrote: > > uAPI aside, why would we use ndo_setup_tc(TC_SETUP_QDISC_TBF) > > to implement common basis? > > > > Is it not cleaner to have a separate driver API, with its ops > > and capabilities? > > We understand one of the end goal is consolidating the existing rate- > related in kernel interfaces. Adding a new one does not feel a good > starting to reach that goal, see [1] & [2] ;) ndo_setup_tc(TC_SETUP_QDISC_TBF) is a new API, too, very much so. These attempts to build on top of existing interfaces with small tweaks is leading us to a fragmented and incompatible driver landscape. I explained before (perhaps on the netdev call) - Qdiscs have two different offload models. "local" and "switchdev", here we want "local" AFAIU and TBF only has "switchdev" offload (take a look at the enqueue method and which drivers support it today). "We'll extend TBF" is very much adding a new API. You'll have to add "local offload" support in TBF and no NIC driver today supports it. I'm not saying TBF is bad, but I disagree that it's any different than a new NDO for all practical purposes. > ndo_setup_tc() feels like the natural choice for H/W offload and TBF > is the existing interface IMHO nearest to the requirements here. I question whether something as basic as scheduling and ACLs should follow the "offload SW constructs" mantra. You are exposed to more diverse users so please don't hesitate to disagree, but AFAICT the transparent offload (user installs SW constructs and if offload is available - offload, otherwise use SW is good enough) has not played out like we have hoped. Let's figure out what is the abstract model of scheduling / shaping within a NIC that we want to target. And then come up with a way of representing it in SW. Not which uAPI we can shoehorn into the use case.
On Fri, 2023-12-15 at 14:41 -0800, Jakub Kicinski wrote: > I explained before (perhaps on the netdev call) - Qdiscs have two > different offload models. "local" and "switchdev", here we want "local" > AFAIU and TBF only has "switchdev" offload (take a look at the enqueue > method and which drivers support it today). I must admit the above is not yet clear to me. I initially thought you meant that "local" offloads properly reconfigure the S/W datapath so that locally generated traffic would go through the expected processing (e.g. shaping) just once, while with "switchdev" offload locally generated traffic will see shaping done both by the S/W and the H/W[1]. Reading the above I now think you mean that local offloads has only effect for locally generated traffic but not on traffic forwarded via eswitch, and vice versa[2]. The drivers I looked at did not show any clue (to me). FTR, I think that [1] is a bug worth fixing and [2] is evil ;) Could you please clarify which is the difference exactly between them? > "We'll extend TBF" is very much adding a new API. You'll have to add > "local offload" support in TBF and no NIC driver today supports it. > I'm not saying TBF is bad, but I disagree that it's any different > than a new NDO for all practical purposes. > > > ndo_setup_tc() feels like the natural choice for H/W offload and TBF > > is the existing interface IMHO nearest to the requirements here. > > I question whether something as basic as scheduling and ACLs should > follow the "offload SW constructs" mantra. You are exposed to more > diverse users so please don't hesitate to disagree, but AFAICT > the transparent offload (user installs SW constructs and if offload > is available - offload, otherwise use SW is good enough) has not > played out like we have hoped. > > Let's figure out what is the abstract model of scheduling / shaping > within a NIC that we want to target. And then come up with a way of > representing it in SW. Not which uAPI we can shoehorn into the use > case. I thought the model was quite well defined since the initial submission from Intel, and is quite simple: expose TX shaping on per tx queue basis, with min rate, max rate (in bps) and burst (in bytes). I think that making it more complex (e.g. with nesting, pkt overhead, etc) we will still not cover every possible use case and will add considerable complexity. > Cheers, Paolo
On Mon, 18 Dec 2023 21:12:35 +0100 Paolo Abeni wrote: > On Fri, 2023-12-15 at 14:41 -0800, Jakub Kicinski wrote: > > I explained before (perhaps on the netdev call) - Qdiscs have two > > different offload models. "local" and "switchdev", here we want "local" > > AFAIU and TBF only has "switchdev" offload (take a look at the enqueue > > method and which drivers support it today). > > I must admit the above is not yet clear to me. > > I initially thought you meant that "local" offloads properly > reconfigure the S/W datapath so that locally generated traffic would go > through the expected processing (e.g. shaping) just once, while with > "switchdev" offload locally generated traffic will see shaping done > both by the S/W and the H/W[1]. > > Reading the above I now think you mean that local offloads has only > effect for locally generated traffic but not on traffic forwarded via > eswitch, and vice versa[2]. > > The drivers I looked at did not show any clue (to me). > > FTR, I think that [1] is a bug worth fixing and [2] is evil ;) > > Could you please clarify which is the difference exactly between them? The practical difference which you can see in the code is that "locally offloaded" qdiscs will act like a FIFO in the SW path (at least to some extent). While "switchdev" offload qdiscs act exactly the same regardless of the offload. Neither is wrong, they are offloading different things. Qdisc offload on a representor (switchdev) offloads from the switch perspective, i.e. "ingress to host". Only fallback goes thru SW path, and should be negligible. "Local" offload can be implemented as admission control (and is sometimes work conserving), it's on the "real" interface, it's egress, and doesn't take part in forwarding. > > I question whether something as basic as scheduling and ACLs should > > follow the "offload SW constructs" mantra. You are exposed to more > > diverse users so please don't hesitate to disagree, but AFAICT > > the transparent offload (user installs SW constructs and if offload > > is available - offload, otherwise use SW is good enough) has not > > played out like we have hoped. > > > > Let's figure out what is the abstract model of scheduling / shaping > > within a NIC that we want to target. And then come up with a way of > > representing it in SW. Not which uAPI we can shoehorn into the use > > case. > > I thought the model was quite well defined since the initial submission > from Intel, and is quite simple: expose TX shaping on per tx queue > basis, with min rate, max rate (in bps) and burst (in bytes). For some definition of a model, I guess. Given the confusion about switchdev vs local (ingress vs egress) - I can't agree that the model is well defined :( What I mean is - given piece of functionality like "Tx queue shaping" you can come up with a reasonable uAPI that you can hijack and it makes sense to you. But someone else (switchdev ingress) can chose the same API to implement a different offload. Not to mention that yet another person will chose a different API to implement the same things as you :( Off the top of my head we have at least: - Tx DMA admission control / scheduling (which Tx DMA queue will NIC pull from) - Rx DMA scheduling (which Rx queue will NIC push to) - buffer/queue configuration (how to deal with buildup of packets in NIC SRAM, usually mostly for ingress) - NIC buffer configuration (how the SRAM is allocated to queues) - policers in the NIC forwarding logic Let's extend this list so that it covers all reasonable NIC designs, and them work on mapping how each of them is configured? > I think that making it more complex (e.g. with nesting, pkt overhead, > etc) we will still not cover every possible use case and will add > considerable complexity.